diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1c92f9f9..66139b50 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,134 @@
 # TensorRT OSS Release Changelog
 
+## 10.0.0 EA - 2024-04-02
+
+Key Features and Updates:
+
+ - Samples changes
+   - Added a [sample](samples/python/sample_weight_stripping) showcasing weight-stripped engines.
+   - Added a [sample](samples/python/python_plugin/circ_pad_plugin_multi_tactic.py) demonstrating the use of custom tactics with IPluginV3.
+   - Added a [sample](samples/sampleNonZeroPlugin) to showcase plugins with data-dependent output shapes, using IPluginV3.
+ - Parser changes
+   - Added a new class `IParserRefitter` that can be used to refit a TensorRT engine with the weights of an ONNX model.
+   - `kNATIVE_INSTANCENORM` is now set to ON by default.
+   - Added support for `IPluginV3` interfaces from TensorRT.
+   - Added support for `INT4` quantization.
+   - Added support for the `reduction` attribute in `ScatterElements`.
+   - Added support for `wrap` padding mode in `Pad`
+ - Plugin changes
+   - A [new plugin](plugin/scatterElementsPlugin) has been added in compliance with [ONNX ScatterElements](https://github.com/onnx/onnx/blob/main/docs/Operators.md#ScatterElements).
+   - The TensorRT plugin library no longer has a load-time link dependency on cuBLAS or cuDNN libraries.
+   - All plugins which relied on cuBLAS/cuDNN handles passed through `IPluginV2Ext::attachToContext()` have moved to use cuBLAS/cuDNN resources initialized by the plugin library itself. This works by dynamically loading the required cuBLAS/cuDNN library. Additionally, plugins which independently initialized their cuBLAS/cuDNN resources have also moved to dynamically loading the required library. If the respective library is not discoverable through the library path(s), these plugins will not work.
+   - bertQKVToContextPlugin: Version 2 of this plugin now supports head sizes less than or equal to 32.
+   - reorgPlugin: Added a version 2 which implements IPluginV2DynamicExt.
+   - disentangledAttentionPlugin: Fixed a kernel bug.
+ - Demo changes
+   - HuggingFace demos have been removed. For all users using TensorRT to accelerate Large Language Model inference, please use [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/).
+ - Updated tooling
+   - Polygraphy v0.49.9
+   - ONNX-GraphSurgeon v0.5.1
+   - TensorRT Engine Explorer v0.1.8
+ - Build Containers
+   - RedHat/CentOS 7.x are no longer officially supported starting with TensorRT 10.0. The corresponding container has been removed from TensorRT-OSS.
+
+## 9.3.0 GA - 2024-02-09
+
+Key Features and Updates:
+
+ - Demo changes
+   - Faster Text-to-image using SDXL & INT8 quantization using AMMO
+ - Updated tooling
+   - Polygraphy v0.49.7
+
+## 9.2.0 GA - 2023-11-27
+
+Key Features and Updates:
+
+ - `trtexec` enhancement: Added `--weightless` flag to mark the engine as weightless.
+ - Parser changes
+   - Added support for Hardmax operator.
+   - Changes to a few operator importers to ensure that TensorRT preserves the precision of operations when using strongly typed mode.
+ - Plugin changes
+   - Explicit INT8 support added to `bertQKVToContextPlugin`.
+   - Various bug fixes.
+ - Updated HuggingFace demo to use transformers v4.31.0 and PyTorch v2.1.0.
+
+
+## 9.1.0 GA - 2023-10-18
+
+Key Features and Updates:
+
+ - Update the [trt_python_plugin](samples/python/python_plugin) sample.
+   - Python plugins API reference is part of the offical TRT Python API.
+ - Added samples demonstrating the usage of the progress monitor API.
+   - Check [sampleProgressMonitor](samples/sampleProgressMonitor) for the C++ sample.
+   - Check [simple_progress_monitor](samples/python/simple_progress_monitor) for the Python sample.
+ - Remove dependencies related to python<3.8 in python samples as we no longer support python<3.8 for python samples. 
+ - Demo changes
+   - Added LAMBADA dataset accuracy checks in the [HuggingFace](demo/HuggingFace) demo.
+   - Enabled structured sparsity and FP8 quantized batch matrix multiplication(BMM)s in attention in the [NeMo](demo/NeMo) demo.
+   - Replaced deprecated APIs in the [BERT](demo/BERT) demo.
+ - Updated tooling
+   - Polygraphy v0.49.1
+
+
+## 9.0.1 GA - 2023-09-07
+
+Key Features and Updates:
+
+ - TensorRT plugin autorhing in Python is now supported
+   - See the [trt_python_plugin](samples/python/python_plugin) sample for reference.
+ - Updated default CUDA version to 12.2
+ - Support for BLIP models, Seq2Seq and Vision2Seq abstractions in HuggingFace demo.
+ - demoDiffusion refactoring and SDXL enhancements
+ - Additional validation asserts for NV Plugins
+ - Updated tooling
+   - TensorRT Engine Explorer v0.1.7: graph rendering for TensorRT 9.0 `kgen` kernels
+   - ONNX-GraphSurgeon v0.3.29
+   - PyTorch quantization toolkit v2.2.0
+
+
+## 9.0.0 EA - 2023-08-06
+
+Key Features and Updates:
+
+ - Added the NeMo demo to demonstrate the performance benefit of using E4M3 FP8 data type with the GPT models trained with the [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) and [TransformerEngine](https://github.com/NVIDIA/TransformerEngine).
+ - Demo Diffusion updates
+   - Added SDXL 1.0 txt2img pipeline
+   - Added ControlNet pipeline
+   - Huggingface demo updates
+     - Added Flan-T5, OPT, BLOOM, BLOOMZ, GPT-Neo, GPT-NeoX, Cerebras-GPT support with accuracy check
+     - Refactored code and extracted common utils into Seq2Seq class 
+     - Optimized shape-changing overhead and achieved a >30% e2e performance gain
+     - Added stable KV-cache, beam search and fp16 support for all models
+     - Added dynamic batch size TRT inference
+     - Added uneven-length multi-batch inference with attention_mask support
+     - Added `chat` command – interactive CLI
+     - Upgraded PyTorch and HuggingFace version to support Hopper GPU
+     - Updated notebooks with much simplified demo API.
+
+  - Added two new TensorRT samples: sampleProgressMonitor (C++) and simple_progress_reporter (Python) that are examples for using Progress Monitor during engine build.
+  - The following plugins were deprecated:
+     - ``BatchedNMS_TRT``
+     - ``BatchedNMSDynamic_TRT``
+     - ``BatchTilePlugin_TRT``
+     - ``Clip_TRT``
+     - ``CoordConvAC``
+     - ``CropAndResize``
+     - ``EfficientNMS_ONNX_TRT``
+     - ``CustomGeluPluginDynamic``
+     - ``LReLU_TRT``
+     - ``NMSDynamic_TRT``
+     - ``NMS_TRT``
+     - ``Normalize_TRT``
+     - ``Proposal``
+     - ``SingleStepLSTMPlugin``
+     - ``SpecialSlice_TRT``
+     - ``Split``
+
+  - Ubuntu 18.04 has reached end of life and is no longer supported by TensorRT starting with 9.0, and the corresponding Dockerfile(s) have been removed.
+  - Support for aarch64 builds will not be available in this release, and the corresponding Dockerfiles have been removed.
+
 ## [8.6.1 GA](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/#rel-8-6-1) - 2023-05-02
 
 TensorRT OSS release corresponding to TensorRT 8.6.1.6 GA release.
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 66f4201b..5d29b78e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -22,21 +22,41 @@ include(cmake/modules/find_library_create_target.cmake)
 set_ifndef(TRT_LIB_DIR ${CMAKE_BINARY_DIR})
 set_ifndef(TRT_OUT_DIR ${CMAKE_BINARY_DIR})
 
+# Converts Windows paths
+if(CMAKE_VERSION VERSION_LESS 3.20)
+    file(TO_CMAKE_PATH "${TRT_LIB_DIR}" TRT_LIB_DIR)
+    file(TO_CMAKE_PATH "${TRT_OUT_DIR}" TRT_OUT_DIR)
+else()
+    cmake_path(SET TRT_LIB_DIR ${TRT_LIB_DIR})
+    cmake_path(SET TRT_OUT_DIR ${TRT_OUT_DIR})
+endif()
+
+# Required to export symbols to build *.libs
+if(WIN32)
+    add_compile_definitions(TENSORRT_BUILD_LIB 1)
+endif()
+
+# Set output paths
+set(RUNTIME_OUTPUT_DIRECTORY ${TRT_OUT_DIR} CACHE PATH "Output directory for runtime target files")
+set(LIBRARY_OUTPUT_DIRECTORY ${TRT_OUT_DIR} CACHE PATH "Output directory for library target files")
+set(ARCHIVE_OUTPUT_DIRECTORY ${TRT_OUT_DIR} CACHE PATH "Output directory for archive target files")
+
+if(WIN32)
+    set(STATIC_LIB_EXT "lib")
+else()
+    set(STATIC_LIB_EXT "a")
+endif()
+
 file(STRINGS "${CMAKE_CURRENT_SOURCE_DIR}/include/NvInferVersion.h" VERSION_STRINGS REGEX "#define NV_TENSORRT_.*")
 
 foreach(TYPE MAJOR MINOR PATCH BUILD)
-    string(REGEX MATCH "NV_TENSORRT_${TYPE} [0-9]" TRT_TYPE_STRING ${VERSION_STRINGS})
-    string(REGEX MATCH "[0-9]" TRT_${TYPE} ${TRT_TYPE_STRING})
-endforeach(TYPE)
-
-foreach(TYPE MAJOR MINOR PATCH)
-    string(REGEX MATCH "NV_TENSORRT_SONAME_${TYPE} [0-9]" TRT_TYPE_STRING ${VERSION_STRINGS})
-    string(REGEX MATCH "[0-9]" TRT_SO_${TYPE} ${TRT_TYPE_STRING})
+    string(REGEX MATCH "NV_TENSORRT_${TYPE} [0-9]+" TRT_TYPE_STRING ${VERSION_STRINGS})
+    string(REGEX MATCH "[0-9]+" TRT_${TYPE} ${TRT_TYPE_STRING})
 endforeach(TYPE)
 
 set(TRT_VERSION "${TRT_MAJOR}.${TRT_MINOR}.${TRT_PATCH}" CACHE STRING "TensorRT project version")
 set(ONNX2TRT_VERSION "${TRT_MAJOR}.${TRT_MINOR}.${TRT_PATCH}" CACHE STRING "ONNX2TRT project version")
-set(TRT_SOVERSION "${TRT_SO_MAJOR}" CACHE STRING "TensorRT library so version")
+set(TRT_SOVERSION "${TRT_MAJOR}" CACHE STRING "TensorRT library so version")
 message("Building for TensorRT version: ${TRT_VERSION}, library version: ${TRT_SOVERSION}")
 
 if(NOT DEFINED CMAKE_TOOLCHAIN_FILE)
@@ -88,8 +108,8 @@ endif()
 ############################################################################################
 # Dependencies
 
-set(DEFAULT_CUDA_VERSION 12.0.1)
-set(DEFAULT_CUDNN_VERSION 8.8)
+set(DEFAULT_CUDA_VERSION 12.2.0)
+set(DEFAULT_CUDNN_VERSION 8.9)
 set(DEFAULT_PROTOBUF_VERSION 3.20.1)
 
 # Dependency Version Resolution
@@ -118,20 +138,12 @@ endif()
 
 include_directories(
     ${CUDA_INCLUDE_DIRS}
-    ${CUDNN_ROOT_DIR}/include
 )
-find_library(CUDNN_LIB cudnn HINTS
-    ${CUDA_TOOLKIT_ROOT_DIR} ${CUDNN_ROOT_DIR} PATH_SUFFIXES lib64 lib/x64 lib)
-find_library(CUBLAS_LIB cublas HINTS
-    ${CUDA_TOOLKIT_ROOT_DIR} PATH_SUFFIXES lib64 lib lib/x64 lib/stubs)
-find_library(CUBLASLT_LIB cublasLt HINTS
-    ${CUDA_TOOLKIT_ROOT_DIR} PATH_SUFFIXES lib64 lib lib/x64 lib/stubs)
 if(BUILD_PARSERS)
     configure_protobuf(${PROTOBUF_VERSION})
 endif()
 
 find_library_create_target(nvinfer nvinfer SHARED ${TRT_LIB_DIR})
-find_library_create_target(nvuffparser nvparsers SHARED ${TRT_LIB_DIR})
 
 find_library(CUDART_LIB cudart_static HINTS ${CUDA_TOOLKIT_ROOT_DIR} PATH_SUFFIXES lib lib/x64 lib64)
 
@@ -149,18 +161,11 @@ if (DEFINED GPU_ARCHS)
   separate_arguments(GPU_ARCHS)
 else()
   list(APPEND GPU_ARCHS
-      53
-      60
-      61
       70
       75
     )
 
   string(REGEX MATCH "aarch64" IS_ARM "${TRT_PLATFORM_ID}")
-  if (IS_ARM)
-    # Xavier (SM72) only supported for aarch64.
-    list(APPEND GPU_ARCHS 72)
-  endif()
 
   if (CUDA_VERSION VERSION_GREATER_EQUAL 11.0)
     # Ampere GPU (SM80) support is only available in CUDA versions > 11.0
@@ -189,10 +194,10 @@ if (${LATEST_SM} GREATER_EQUAL 70)
 endif()
 
 if(NOT MSVC)
-    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler -Wno-deprecated-declarations")
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr -Xcompiler -Wno-deprecated-declarations")
 else()
     set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
-    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler")
+    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --expt-relaxed-constexpr -Xcompiler")
 endif()
 
 ############################################################################################
@@ -207,7 +212,6 @@ endif()
 if(BUILD_PARSERS)
     add_subdirectory(parsers)
 else()
-    find_library_create_target(nvcaffeparser nvparsers SHARED ${TRT_OUT_DIR} ${TRT_LIB_DIR})
     find_library_create_target(nvonnxparser nvonnxparser SHARED ${TRT_OUT_DIR} ${TRT_LIB_DIR})
 endif()
 
diff --git a/README.md b/README.md
index d31f2c4c..28a3edba 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Documentation](https://img.shields.io/badge/TensorRT-documentation-brightgreen.svg)](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html)
 
 # TensorRT Open Source Software
-This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. It includes the sources for TensorRT plugins and parsers (Caffe and ONNX), as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.
+This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. It includes the sources for TensorRT plugins and ONNX parser, as well as sample applications demonstrating usage and capabilities of the TensorRT platform. These open source software components are a subset of the TensorRT General Availability (GA) release with some extensions and bug-fixes.
 
 * For code contributions to TensorRT-OSS, please see our [Contribution Guide](CONTRIBUTING.md) and [Coding Guidelines](CODING-GUIDELINES.md).
 * For a summary of new additions and updates shipped with TensorRT-OSS releases, please refer to the [Changelog](CHANGELOG.md).
@@ -26,16 +26,17 @@ You can skip the **Build** section to enjoy TensorRT with Python.
 To build the TensorRT-OSS components, you will first need the following software packages.
 
 **TensorRT GA build**
-* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.6.1.6
+* TensorRT v10.0.0.6
+  * Available from direct download links listed below
 
 **System Packages**
 * [CUDA](https://developer.nvidia.com/cuda-toolkit)
   * Recommended versions:
-  * cuda-12.0.1 + cuDNN-8.8
-  * cuda-11.8.0 + cuDNN-8.8
+  * cuda-12.2.0 + cuDNN-8.9
+  * cuda-11.8.0 + cuDNN-8.9
 * [GNU make](https://ftp.gnu.org/gnu/make/) >= v4.1
 * [cmake](https://github.com/Kitware/CMake/releases) >= v3.13
-* [python](<https://www.python.org/downloads/>) >= v3.6.9, <= v3.10.x
+* [python](<https://www.python.org/downloads/>) >= v3.8, <= v3.10.x
 * [pip](https://pypi.org/project/pip/#history) >= v19.0
 * Essential utilities
   * [git](https://git-scm.com/downloads), [pkg-config](https://www.freedesktop.org/wiki/Software/pkg-config/), [wget](https://www.gnu.org/software/wget/faq.html#download)
@@ -44,9 +45,6 @@ To build the TensorRT-OSS components, you will first need the following software
 * Containerized build
   * [Docker](https://docs.docker.com/install/) >= 19.03
   * [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)
-* Toolchains and SDKs
-  * (Cross compilation for Jetson platform) [NVIDIA JetPack](https://developer.nvidia.com/embedded/jetpack) >= 5.0 (current support only for TensorRT 8.4.0 and TensorRT 8.5.2)
-  * (Cross compilation for QNX platform) [QNX Toolchain](https://blackberry.qnx.com/en)
 * PyPI packages (for demo applications/tests)
   * [onnx](https://pypi.org/project/onnx/)
   * [onnxruntime](https://pypi.org/project/onnxruntime/)
@@ -74,24 +72,19 @@ To build the TensorRT-OSS components, you will first need the following software
 
     If using the TensorRT OSS build container, TensorRT libraries are preinstalled under `/usr/lib/x86_64-linux-gnu` and you may skip this step.
 
-    Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-tensorrt-download).
+    Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
+      - [TensorRT 10.0.0.6 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz)
+      - [TensorRT 10.0.0.6 for CUDA 12.4, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz)
 
-    **Example: Ubuntu 20.04 on x86-64 with cuda-12.0**
+
+    **Example: Ubuntu 20.04 on x86-64 with cuda-12.4**
 
     ```bash
     cd ~/Downloads
-    tar -xvzf TensorRT-8.6.1.6.Linux.x86_64-gnu.cuda-12.0.tar.gz
-    export TRT_LIBPATH=`pwd`/TensorRT-8.6.1.6
+    tar -xvzf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
+    export TRT_LIBPATH=`pwd`/TensorRT-10.0.0.6
     ```
 
-
-3. #### (Optional - for Jetson builds only) Download the JetPack SDK
-    1. Download and launch the JetPack SDK manager. Login with your NVIDIA developer account.
-    2. Select the  platform and target OS  (example: Jetson AGX Xavier, `Linux Jetpack 5.0`), and click Continue.
-    3. Under `Download & Install Options` change the download folder and select `Download now, Install later`. Agree to the license terms and click Continue.
-    4. Move the extracted files into the `<TensorRT-OSS>/docker/jetpack_files` folder.
-
-
 ## Setting Up The Build Environment
 
 For Linux platforms, we recommend that you generate a docker container for building TensorRT OSS as described below. For native builds, please install the [prerequisite](#prerequisites) *System Packages*.
@@ -99,27 +92,16 @@ For Linux platforms, we recommend that you generate a docker container for build
 1. #### Generate the TensorRT-OSS build container.
     The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build scripts. The build containers are configured for building TensorRT OSS out-of-the-box.
 
-    **Example: Ubuntu 20.04 on x86-64 with cuda-12.0 (default)**
-    ```bash
-    ./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda12.0
-    ```
-    **Example: CentOS/RedHat 7 on x86-64 with cuda-11.8**
-    ```bash
-    ./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda11.8 --cuda 11.8.0
-    ```
-    **Example: Ubuntu 20.04 cross-compile for Jetson (aarch64) with cuda-11.4.2 (JetPack SDK)**
-    ```bash
-    ./docker/build.sh --file docker/ubuntu-cross-aarch64.Dockerfile --tag tensorrt-jetpack-cuda11.4
-    ```
-    **Example: Ubuntu 20.04 on aarch64 with cuda-11.8**
+    **Example: Ubuntu 20.04 on x86-64 with cuda-12.3.2 (default)**
     ```bash
-    ./docker/build.sh --file docker/ubuntu-20.04-aarch64.Dockerfile --tag tensorrt-aarch64-ubuntu20.04-cuda11.8 --cuda 11.8.0
+    ./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda12.3.2
     ```
 
+
 2. #### Launch the TensorRT-OSS build container.
     **Example: Ubuntu 20.04 build container**
 	```bash
-	./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda12.0 --gpus all
+	./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda12.3.2 --gpus all
 	```
 	> NOTE:
   <br> 1. Use the `--tag` corresponding to build container generated in Step 1.
@@ -130,7 +112,7 @@ For Linux platforms, we recommend that you generate a docker container for build
 ## Building TensorRT-OSS
 * Generate Makefiles and build.
 
-    **Example: Linux (x86-64) build with default cuda-12.0**
+    **Example: Linux (x86-64) build with default cuda-12.3.2**
 	```bash
 	cd $TRT_OSSPATH
 	mkdir -p build && cd build
@@ -138,44 +120,8 @@ For Linux platforms, we recommend that you generate a docker container for build
 	make -j$(nproc)
 	```
 
-    > NOTE: On CentOS7, the default g++ version does not support C++14. For native builds (not using the CentOS7 build container), first install devtoolset-8 to obtain the updated g++ toolchain as follows:
-    ```bash
-    yum -y install centos-release-scl
-    yum-config-manager --enable rhel-server-rhscl-7-rpms
-    yum -y install devtoolset-8
-    export PATH="/opt/rh/devtoolset-8/root/bin:${PATH}"
-    ```
-
-    **Example: Linux (aarch64) build with default cuda-12.0**
-	```bash
-	cd $TRT_OSSPATH
-	mkdir -p build && cd build
-	cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64-native.toolchain
-	make -j$(nproc)
-	```
-
-    **Example: Native build on Jetson (aarch64) with cuda-11.4**
-    ```bash
-    cd $TRT_OSSPATH
-    mkdir -p build && cd build
-    cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out -DTRT_PLATFORM_ID=aarch64 -DCUDA_VERSION=11.4
-    CC=/usr/bin/gcc make -j$(nproc)
-    ```
-    > NOTE: C compiler must be explicitly specified via `CC=` for native `aarch64` builds of protobuf.
-
-    **Example: Ubuntu 20.04 Cross-Compile for Jetson (aarch64) with cuda-11.4 (JetPack)**
-    ```bash
-    cd $TRT_OSSPATH
-    mkdir -p build && cd build
-    cmake .. -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64.toolchain -DCUDA_VERSION=11.4 -DCUDNN_LIB=/pdk_files/cudnn/usr/lib/aarch64-linux-gnu/libcudnn.so -DCUBLAS_LIB=/usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libcublas.so -DCUBLASLT_LIB=/usr/local/cuda-11.4/targets/aarch64-linux/lib/stubs/libcublasLt.so -DTRT_LIB_DIR=/pdk_files/tensorrt/lib
-
-    make -j$(nproc)
-    ```
-    > NOTE: The latest JetPack SDK v5.1 only supports TensorRT 8.5.2.
-
 	> NOTE:
-	<br> 1. The default CUDA version used by CMake is 12.0.1. To override this, for example to 11.8, append `-DCUDA_VERSION=11.8` to the cmake command.
-	<br> 2. If samples fail to link on CentOS7, create this symbolic link: `ln -s $TRT_OUT_DIR/libnvinfer_plugin.so $TRT_OUT_DIR/libnvinfer_plugin.so.8`
+	<br> 1. The default CUDA version used by CMake is 12.2.0. To override this, for example to 11.8, append `-DCUDA_VERSION=11.8` to the cmake command.
 * Required CMake build arguments are:
 	- `TRT_LIB_DIR`: Path to the TensorRT installation directory containing libraries.
 	- `TRT_OUT_DIR`: Output directory where generated build artifacts will be copied.
@@ -193,7 +139,7 @@ For Linux platforms, we recommend that you generate a docker container for build
         - Tesla T4, GeForce RTX 2080: `-DGPU_ARCHS="75"`
         - Titan V, Tesla V100: `-DGPU_ARCHS="70"`
         - Multiple SMs: `-DGPU_ARCHS="80 75"`
-	- `TRT_PLATFORM_ID`: Bare-metal build (unlike containerized cross-compilation) on non Linux/x86 platforms must explicitly specify the target platform. Currently supported options: `x86_64` (default), `aarch64`
+	- `TRT_PLATFORM_ID`: Bare-metal build (unlike containerized cross-compilation). Currently supported options: `x86_64` (default).
 
 # References
 
@@ -209,4 +155,4 @@ For Linux platforms, we recommend that you generate a docker container for build
 
 ## Known Issues
 
-* Please refer to [TensorRT 8.6 Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#tensorrt-8)
+* Please refer to [TensorRT Release Notes](https://docs.nvidia.com/deeplearning/tensorrt/release-notes)
diff --git a/VERSION b/VERSION
index 811e1c1d..efdce495 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-8.6.1.6
+10.0.0.6
diff --git a/cmake/modules/find_library_create_target.cmake b/cmake/modules/find_library_create_target.cmake
index 1894ea51..a1d29efb 100644
--- a/cmake/modules/find_library_create_target.cmake
+++ b/cmake/modules/find_library_create_target.cmake
@@ -25,6 +25,9 @@ macro(find_library_create_target target_name lib libtype hints)
     find_library(${lib}_LIB_PATH ${lib})
     message(STATUS "Library that was found ${${lib}_LIB_PATH}")
     add_library(${target_name} ${libtype} IMPORTED)
-    set_property(TARGET ${target_name} PROPERTY IMPORTED_LOCATION ${${lib}_LIB_PATH})
+    set_property(TARGET ${target_name} PROPERTY IMPORTED_LOCATION ${${lib}_LIB_PATH}) # This should be .so or .dll file, currently its .a or .lib.
+    if (WIN32)
+        set_property(TARGET ${target_name} PROPERTY IMPORTED_IMPLIB ${${lib}_LIB_PATH}) # This should be a .lib file
+    endif()
     message(STATUS "==========================================================================================")
 endmacro()
diff --git a/cmake/modules/set_ifndef.cmake b/cmake/modules/set_ifndef.cmake
index c64581c6..fbdc9be1 100644
--- a/cmake/modules/set_ifndef.cmake
+++ b/cmake/modules/set_ifndef.cmake
@@ -14,7 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
 function (set_ifndef variable value)
   if(NOT DEFINED ${variable})
     set(${variable} ${value} PARENT_SCOPE)
diff --git a/cmake/toolchains/cmake_aarch64.toolchain b/cmake/toolchains/cmake_aarch64.toolchain
index 3381c0c1..3c87fd65 100644
--- a/cmake/toolchains/cmake_aarch64.toolchain
+++ b/cmake/toolchains/cmake_aarch64.toolchain
@@ -46,7 +46,13 @@ set(BUILD_LIBRARY_ONLY 1)
 set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_ROOT})
 set(CUDA_INCLUDE_DIRS ${CUDA_ROOT}/include)
 
-set(RT_LIB /usr/aarch64-linux-gnu/lib/librt.so)
+find_library(RT_LIB rt PATHS /usr/aarch64-linux-gnu/lib /usr/lib/aarch64-linux-gnu)
+
+if(NOT RT_LIB)
+    message(WARNING "librt.so not found in default paths")
+endif()
+
+message("RT_LIB: ${RT_LIB}")
 
 # Use host nvcc
 set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
@@ -56,4 +62,4 @@ set(CMAKE_CUDA_COMPILER_FORCED TRUE)
 
 set(CUDA_LIBS -L${CUDA_ROOT}/lib)
 
-set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas -lcudart -lstdc++ -lm)
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lstdc++ -lm)
diff --git a/cmake/toolchains/cmake_aarch64_cross.toolchain b/cmake/toolchains/cmake_aarch64_cross.toolchain
new file mode 100644
index 00000000..177a82f9
--- /dev/null
+++ b/cmake/toolchains/cmake_aarch64_cross.toolchain
@@ -0,0 +1,55 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set(CMAKE_SYSTEM_NAME Linux)
+set(CMAKE_SYSTEM_PROCESSOR aarch64)
+
+set(TRT_PLATFORM_ID "aarch64")
+
+set(CUDA_PLATFORM_ID "sbsa-linux")
+
+set(CMAKE_C_COMPILER /usr/bin/aarch64-linux-gnu-gcc-8)
+set(CMAKE_CXX_COMPILER /usr/bin/aarch64-linux-gnu-g++-8)
+
+set(CMAKE_C_FLAGS "" CACHE STRING "" FORCE)
+set(CMAKE_CXX_FLAGS "" CACHE STRING "" FORCE)
+
+set(CMAKE_C_COMPILER_TARGET aarch64-linux-gnu)
+set(CMAKE_CXX_COMPILER_TARGET aarch64-linux-gnu)
+
+set(CMAKE_C_COMPILER_FORCED TRUE)
+set(CMAKE_CXX_COMPILER_FORCED TRUE)
+
+set(CUDA_ROOT /usr/local/cuda/targets/${CUDA_PLATFORM_ID} CACHE STRING "CUDA ROOT dir")
+
+set(CUDNN_LIB /usr/lib/aarch64-linux-gnu/libcudnn.so)
+
+set(BUILD_LIBRARY_ONLY 1)
+
+set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_ROOT})
+set(CUDA_INCLUDE_DIRS ${CUDA_ROOT}/include)
+
+set(RT_LIB /usr/aarch64-linux-gnu/lib/librt.so)
+
+set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc)
+set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER} CACHE STRING "" FORCE)
+set(CMAKE_CUDA_FLAGS "-I${CUDA_INCLUDE_DIRS} -Xcompiler=\"-fPIC ${CMAKE_CXX_FLAGS}\"" CACHE STRING "" FORCE)
+set(CMAKE_CUDA_COMPILER_FORCED TRUE)
+
+set(CUDA_LIBS -L${CUDA_ROOT}/lib)
+
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas -lcudart -lstdc++ -lm)
diff --git a/demo/BERT/README.md b/demo/BERT/README.md
index 49d48436..f867a321 100755
--- a/demo/BERT/README.md
+++ b/demo/BERT/README.md
@@ -31,7 +31,6 @@ This subfolder of the BERT TensorFlow repository, tested and maintained by NVIDI
   * [Results](#results)
     * [Inference performance: NVIDIA A100](#inference-performance-nvidia-a100-40gb)
     * [Inference performance: NVIDIA A30](#inference-performance-nvidia-a30)
-    * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4-16gb)
 
 
 ## Model overview
@@ -124,6 +123,12 @@ This demo BERT application can be run within the TensorRT OSS build container. I
 
 **Note:** Since the datasets and checkpoints are stored in the directory mounted from the host, they do *not* need to be downloaded each time the container is launched. 
 
+**Warning:** In the event of encountering an error message stating, "Missing API key and missing Email Authentication. This command requires an API key or authentication via browser login", the recommended steps for resolution are as follows:
+* Generate an API key by logging in https://ngc.nvidia.com/setup/api-key and copy the generated API key.
+* Execute the command `ngc config set` in the docker and paste the copied API key into the prompt as directed.
+
+Completing these steps should resolve the error you encountered and allow the command to proceed successfully.
+
 4. Build a TensorRT engine. To build an engine, run the `builder.py` script. For example:
     ```bash
     mkdir -p engines && python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/model.ckpt -o engines/bert_large_128.engine -b 1 -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1
@@ -429,78 +434,78 @@ Results were obtained by running `scripts/inference_benchmark.sh --gpu Ampere` o
 | Sequence Length | Batch Size | INT8 Latency (ms) |               |         | FP16 Latency (ms) |               |         |
 |-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
 |                 |            | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 0.55 | 0.70 | 0.55 | 0.61 | 0.78 | 0.62 |
-| 128 | 2 | 0.78 | 0.78 | 0.62 | 0.72 | 0.92 | 0.73 |
-| 128 | 4 | 0.74 | 0.93 | 0.74 | 0.93 | 0.93 | 0.93 |
-| 128 | 8 | 0.95 | 0.95 | 0.94 | 1.31 | 1.31 | 1.31 |
-| 128 | 12 | 1.21 | 1.53 | 1.22 | 1.73 | 1.77 | 1.72 |
-| 128 | 16 | 1.34 | 1.34 | 1.34 | 2.09 | 2.10 | 2.07 |
-| 128 | 24 | 1.84 | 1.84 | 1.84 | 3.07 | 3.09 | 3.03 |
-| 128 | 32 | 2.27 | 2.27 | 2.26 | 3.93 | 3.94 | 3.90 |
-| 128 | 64 | 4.21 | 4.25 | 4.18 | 7.79 | 7.80 | 7.72 |
-| 128 | 128 | 8.25 | 8.26 | 8.14 | 15.41 | 15.42 | 15.27 |
-| 384 | 1 | 1.14 | 1.46 | 1.14 | 1.26 | 1.26 | 1.25 |
-| 384 | 2 | 1.31 | 1.31 | 1.31 | 1.55 | 1.55 | 1.55 |
-| 384 | 4 | 1.67 | 1.67 | 1.67 | 2.13 | 2.17 | 2.13 |
-| 384 | 8 | 2.22 | 2.22 | 2.22 | 3.36 | 3.39 | 3.35 |
-| 384 | 12 | 3.34 | 3.35 | 3.34 | 4.84 | 4.88 | 4.79 |
-| 384 | 16 | 4.04 | 4.04 | 4.04 | 6.40 | 6.46 | 6.39 |
-| 384 | 24 | 5.76 | 5.76 | 5.74 | 9.54 | 9.66 | 9.44 |
-| 384 | 32 | 7.71 | 7.71 | 7.70 | 13.02 | 13.03 | 12.90 |
-| 384 | 64 | 15.01 | 15.01 | 14.91 | 25.25 | 25.26 | 24.89 |
-| 384 | 128 | 29.26 | 29.26 | 29.13 | 49.12 | 49.25 | 48.81 |
+| 128 | 1 | 0.64 | 0.69 | 0.56 | 0.79 | 0.79 | 0.63 |
+| 128 | 2 | 0.78 | 0.78 | 0.62 | 0.80 | 0.80 | 0.73 |
+| 128 | 4 | 0.74 | 0.74 | 0.74 | 1.12 | 1.20 | 0.95 |
+| 128 | 8 | 1.22 | 1.23 | 0.96 | 1.31 | 1.31 | 1.31 |
+| 128 | 12 | 1.29 | 1.30 | 1.21 | 1.70 | 1.70 | 1.70 |
+| 128 | 16 | 1.34 | 1.34 | 1.34 | 2.10 | 2.10 | 2.08 |
+| 128 | 24 | 1.83 | 1.84 | 1.83 | 3.07 | 3.08 | 3.04 |
+| 128 | 32 | 2.25 | 2.26 | 2.25 | 3.95 | 3.95 | 3.92 |
+| 128 | 64 | 4.19 | 4.20 | 4.17 | 7.68 | 7.74 | 7.63 |
+| 128 | 128 | 8.15 | 8.16 | 8.10 | 15.45 | 15.46 | 15.30 |
+| 384 | 1 | 1.14 | 1.46 | 1.15 | 1.26 | 1.62 | 1.26 |
+| 384 | 2 | 1.32 | 1.32 | 1.32 | 1.55 | 1.55 | 1.55 |
+| 384 | 4 | 1.68 | 1.72 | 1.68 | 2.11 | 2.11 | 2.11 |
+| 384 | 8 | 2.22 | 2.23 | 2.22 | 3.38 | 3.42 | 3.35 |
+| 384 | 12 | 3.34 | 3.34 | 3.34 | 4.84 | 4.86 | 4.81 |
+| 384 | 16 | 4.02 | 4.03 | 4.02 | 6.41 | 6.41 | 6.39 |
+| 384 | 24 | 5.73 | 5.73 | 5.73 | 9.47 | 9.47 | 9.36 |
+| 384 | 32 | 7.75 | 7.77 | 7.68 | 13.05 | 13.12 | 12.92 |
+| 384 | 64 | 14.96 | 14.96 | 14.85 | 25.24 | 25.36 | 24.93 |
+| 384 | 128 | 29.13 | 29.14 | 28.89 | 49.27 | 49.37 | 48.84 |
 
 ##### BERT Large
 
 | Sequence Length | Batch Size | INT8 Latency (ms) |               |         | FP16 Latency (ms) |               |         |
 |-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
 |                 |            | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.24 | 1.25 | 1.24 | 1.58 | 1.60 | 1.58 |
-| 128 | 2 | 1.44 | 1.44 | 1.44 | 1.83 | 1.84 | 1.82 |
-| 128 | 4 | 1.78 | 1.79 | 1.78 | 2.54 | 2.54 | 2.53 |
-| 128 | 8 | 2.82 | 2.82 | 2.81 | 3.98 | 4.00 | 3.97 |
-| 128 | 12 | 3.11 | 3.11 | 3.11 | 5.08 | 5.12 | 5.04 |
-| 128 | 16 | 4.06 | 4.07 | 4.06 | 6.96 | 6.96 | 6.91 |
-| 128 | 24 | 5.31 | 5.32 | 5.31 | 9.69 | 9.70 | 9.63 |
-| 128 | 32 | 7.07 | 7.07 | 7.02 | 13.11 | 13.12 | 12.93 |
-| 128 | 64 | 12.97 | 13.08 | 12.89 | 24.94 | 25.22 | 24.74 |
-| 128 | 128 | 25.48 | 25.72 | 25.28 | 49.30 | 49.46 | 49.18 |
-| 384 | 1 | 2.59 | 2.59 | 2.59 | 2.98 | 2.99 | 2.98 |
-| 384 | 2 | 3.04 | 3.05 | 3.04 | 4.01 | 4.03 | 4.00 |
-| 384 | 4 | 4.03 | 4.04 | 4.03 | 5.79 | 5.79 | 5.73 |
-| 384 | 8 | 7.20 | 7.22 | 7.20 | 11.11 | 11.14 | 10.99 |
-| 384 | 12 | 9.19 | 9.20 | 9.19 | 15.47 | 15.63 | 15.39 |
-| 384 | 16 | 12.36 | 12.38 | 12.35 | 21.18 | 21.19 | 21.00 |
-| 384 | 24 | 17.77 | 17.95 | 17.68 | 31.41 | 31.42 | 30.90 |
-| 384 | 32 | 23.36 | 23.37 | 23.20 | 41.40 | 41.43 | 40.90 |
-| 384 | 64 | 45.60 | 45.61 | 45.26 | 80.07 | 80.25 | 79.50 |
-| 384 | 128 | 89.25 | 89.30 | 88.57 | 157.38 | 157.76 | 156.31 |
+| 128 | 1 | 1.24 | 1.24 | 1.23 | 1.56 | 1.56 | 1.56 |
+| 128 | 2 | 1.44 | 1.83 | 1.45 | 1.83 | 1.83 | 1.83 |
+| 128 | 4 | 1.78 | 1.78 | 1.78 | 2.55 | 2.56 | 2.55 |
+| 128 | 8 | 2.66 | 2.66 | 2.66 | 3.96 | 3.97 | 3.93 |
+| 128 | 12 | 3.11 | 3.11 | 3.10 | 5.07 | 5.12 | 5.05 |
+| 128 | 16 | 4.07 | 4.07 | 4.06 | 6.96 | 6.97 | 6.91 |
+| 128 | 24 | 5.31 | 5.32 | 5.31 | 9.72 | 9.82 | 9.63 |
+| 128 | 32 | 7.04 | 7.07 | 7.02 | 13.00 | 13.04 | 12.95 |
+| 128 | 64 | 12.96 | 12.96 | 12.86 | 24.90 | 25.07 | 24.71 |
+| 128 | 128 | 25.20 | 25.21 | 25.16 | 49.29 | 49.55 | 48.86 |
+| 384 | 1 | 2.57 | 2.57 | 2.57 | 2.98 | 2.98 | 2.98 |
+| 384 | 2 | 3.06 | 3.07 | 3.06 | 3.93 | 3.93 | 3.92 |
+| 384 | 4 | 4.03 | 4.03 | 4.03 | 5.78 | 5.79 | 5.74 |
+| 384 | 8 | 7.20 | 7.21 | 7.19 | 11.16 | 11.19 | 11.04 |
+| 384 | 12 | 9.18 | 9.18 | 9.17 | 15.51 | 15.51 | 15.39 |
+| 384 | 16 | 12.34 | 12.34 | 12.33 | 21.25 | 21.25 | 21.03 |
+| 384 | 24 | 17.74 | 17.79 | 17.69 | 31.13 | 31.14 | 30.82 |
+| 384 | 32 | 23.37 | 23.37 | 23.16 | 41.26 | 41.43 | 40.83 |
+| 384 | 64 | 45.08 | 45.09 | 45.01 | 79.88 | 80.21 | 79.18 |
+| 384 | 128 | 88.34 | 88.37 | 88.06 | 156.43 | 157.17 | 155.47 |
 
 ##### Megatron Large with Sparsity
 
 | Sequence Length | Batch Size | INT8 QAT Latency (ms) |               |         |
 |-----------------|------------|-----------------|-----------------|---------|
 |                 |            | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.29 | 1.54 | 1.29 |
-| 128 | 2 | 1.35 | 1.71 | 1.35 |
-| 128 | 4 | 1.79 | 2.14 | 1.79 |
+| 128 | 1 | 1.17 | 1.48 | 1.18 |
+| 128 | 2 | 1.49 | 1.88 | 1.50 |
+| 128 | 4 | 1.79 | 1.79 | 1.79 |
 | 128 | 8 | 2.54 | 2.54 | 2.53 |
-| 128 | 12 | 2.93 | 2.93 | 2.92 |
-| 128 | 16 | 3.95 | 3.95 | 3.94 |
-| 128 | 24 | 4.93 | 4.94 | 4.92 |
-| 128 | 32 | 7.13 | 7.14 | 7.12 |
-| 128 | 64 | 11.64 | 11.64 | 11.62 |
-| 128 | 128 | 21.29 | 21.46 | 21.16 |
+| 128 | 12 | 2.95 | 2.95 | 2.94 |
+| 128 | 16 | 3.97 | 3.97 | 3.96 |
+| 128 | 24 | 4.91 | 4.91 | 4.90 |
+| 128 | 32 | 6.90 | 6.92 | 6.86 |
+| 128 | 64 | 11.61 | 11.64 | 11.59 |
+| 128 | 128 | 21.34 | 21.35 | 21.21 |
 | 384 | 1 | 1.71 | 1.72 | 1.71 |
-| 384 | 2 | 2.24 | 2.25 | 2.23 |
-| 384 | 4 | 3.43 | 3.44 | 3.43 |
-| 384 | 8 | 5.77 | 5.77 | 5.76 |
-| 384 | 12 | 8.39 | 8.39 | 8.37 |
-| 384 | 16 | 10.38 | 10.39 | 10.36 |
-| 384 | 24 | 14.69 | 14.70 | 14.67 |
-| 384 | 32 | 18.68 | 18.82 | 18.66 |
-| 384 | 64 | 35.88 | 35.89 | 35.70 |
-| 384 | 128 | 68.71 | 68.73 | 68.16 |
+| 384 | 2 | 2.21 | 2.21 | 2.21 |
+| 384 | 4 | 3.47 | 3.47 | 3.47 |
+| 384 | 8 | 5.75 | 5.75 | 5.74 |
+| 384 | 12 | 8.37 | 8.38 | 8.35 |
+| 384 | 16 | 10.39 | 10.40 | 10.37 |
+| 384 | 24 | 14.61 | 14.62 | 14.59 |
+| 384 | 32 | 18.80 | 18.96 | 18.78 |
+| 384 | 64 | 35.90 | 35.92 | 35.62 |
+| 384 | 128 | 67.74 | 67.77 | 67.60 |
 
 #### Inference performance: NVIDIA A30
 
@@ -511,76 +516,76 @@ Results were obtained by running `scripts/inference_benchmark.sh --gpu Ampere` o
 | Sequence Length | Batch Size | INT8 Latency (ms) |               |         | FP16 Latency (ms) |               |         |
 |-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
 |                 |            | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 0.91 | 0.92 | 0.62 | 1.18 | 1.18 | 0.82 |
-| 128 | 2 | 1.13 | 1.13 | 0.77 | 1.07 | 1.07 | 0.97 |
-| 128 | 4 | 1.04 | 1.57 | 1.05 | 1.46 | 2.11 | 1.44 |
-| 128 | 8 | 1.46 | 1.49 | 1.44 | 2.41 | 2.41 | 2.40 |
-| 128 | 12 | 1.94 | 1.94 | 1.94 | 3.42 | 3.45 | 3.40 |
-| 128 | 16 | 2.40 | 2.46 | 2.37 | 4.33 | 4.41 | 4.28 |
-| 128 | 24 | 3.54 | 3.59 | 3.48 | 6.59 | 6.60 | 6.50 |
-| 128 | 32 | 4.46 | 4.50 | 4.43 | 8.49 | 8.55 | 8.37 |
-| 128 | 64 | 8.68 | 8.75 | 8.57 | 16.65 | 16.67 | 16.47 |
-| 128 | 128 | 16.81 | 16.83 | 16.63 | 32.40 | 32.52 | 32.04 |
-| 384 | 1 | 1.31 | 1.32 | 1.31 | 1.62 | 1.64 | 1.63 |
-| 384 | 2 | 1.66 | 1.66 | 1.66 | 2.27 | 2.27 | 2.26 |
-| 384 | 4 | 2.32 | 2.32 | 2.30 | 3.79 | 3.87 | 3.72 |
-| 384 | 8 | 4.26 | 4.26 | 4.24 | 7.26 | 7.31 | 7.17 |
-| 384 | 12 | 6.10 | 6.13 | 6.04 | 10.35 | 10.43 | 10.23 |
-| 384 | 16 | 8.17 | 8.18 | 8.08 | 13.93 | 14.05 | 13.85 |
-| 384 | 24 | 11.91 | 11.98 | 11.82 | 20.46 | 20.57 | 20.25 |
-| 384 | 32 | 15.50 | 15.64 | 15.48 | 27.06 | 27.17 | 26.81 |
-| 384 | 64 | 31.03 | 31.18 | 30.63 | 52.44 | 52.48 | 52.05 |
-| 384 | 128 | 61.10 | 61.13 | 60.50 | 103.38 | 103.64 | 102.87 |
+| 128 | 1 | 0.88 | 0.88 | 0.61 | 0.78 | 1.14 | 0.79 |
+| 128 | 2 | 1.03 | 1.04 | 0.77 | 0.97 | 1.45 | 0.98 |
+| 128 | 4 | 1.04 | 1.56 | 1.05 | 1.43 | 1.44 | 1.41 |
+| 128 | 8 | 1.44 | 1.46 | 1.43 | 2.43 | 2.44 | 2.41 |
+| 128 | 12 | 1.92 | 1.92 | 1.91 | 3.44 | 3.45 | 3.39 |
+| 128 | 16 | 2.38 | 2.43 | 2.35 | 4.36 | 4.37 | 4.28 |
+| 128 | 24 | 3.47 | 3.50 | 3.44 | 6.56 | 6.65 | 6.48 |
+| 128 | 32 | 4.42 | 4.45 | 4.38 | 8.42 | 8.58 | 8.36 |
+| 128 | 64 | 8.58 | 8.66 | 8.49 | 16.58 | 16.60 | 16.40 |
+| 128 | 128 | 16.56 | 16.62 | 16.39 | 32.13 | 32.30 | 31.93 |
+| 384 | 1 | 1.31 | 2.01 | 1.32 | 1.63 | 1.63 | 1.62 |
+| 384 | 2 | 1.67 | 1.67 | 1.66 | 2.29 | 2.35 | 2.26 |
+| 384 | 4 | 2.29 | 2.34 | 2.27 | 3.74 | 3.77 | 3.71 |
+| 384 | 8 | 4.23 | 4.24 | 4.20 | 7.25 | 7.30 | 7.15 |
+| 384 | 12 | 6.05 | 6.10 | 6.00 | 10.21 | 10.27 | 10.12 |
+| 384 | 16 | 8.07 | 8.11 | 8.02 | 13.97 | 14.05 | 13.84 |
+| 384 | 24 | 11.85 | 11.86 | 11.71 | 20.31 | 20.42 | 20.16 |
+| 384 | 32 | 15.45 | 15.47 | 15.29 | 26.86 | 27.04 | 26.65 |
+| 384 | 64 | 30.49 | 30.74 | 30.25 | 52.21 | 52.34 | 51.75 |
+| 384 | 128 | 60.21 | 60.48 | 59.56 | 103.20 | 103.58 | 102.66 |
 
 ##### BERT Large
 
 | Sequence Length | Batch Size | INT8 Latency (ms) |               |         | FP16 Latency (ms) |               |         |
 |-----------------|------------|-----------------|-----------------|---------|-----------------|-----------------|---------|
 |                 |            | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.49 | 1.49 | 1.48 | 2.03 | 2.03 | 2.02 |
-| 128 | 2 | 1.83 | 1.84 | 1.82 | 2.79 | 2.79 | 2.76 |
-| 128 | 4 | 2.70 | 2.70 | 2.68 | 4.35 | 4.40 | 4.31 |
-| 128 | 8 | 4.50 | 4.52 | 4.47 | 8.07 | 8.17 | 8.01 |
-| 128 | 12 | 5.67 | 5.69 | 5.62 | 10.67 | 10.75 | 10.53 |
-| 128 | 16 | 8.08 | 8.13 | 7.95 | 14.86 | 14.86 | 14.72 |
-| 128 | 24 | 10.59 | 10.60 | 10.47 | 20.71 | 20.73 | 20.47 |
-| 128 | 32 | 14.16 | 14.21 | 14.03 | 28.21 | 28.37 | 27.98 |
-| 128 | 64 | 26.77 | 26.95 | 26.66 | 54.03 | 54.33 | 53.43 |
-| 128 | 128 | 52.65 | 52.78 | 52.12 | 106.15 | 106.75 | 105.37 |
-| 384 | 1 | 3.20 | 3.21 | 3.20 | 4.19 | 4.19 | 4.17 |
-| 384 | 2 | 4.26 | 4.26 | 4.22 | 6.61 | 6.63 | 6.56 |
-| 384 | 4 | 7.56 | 7.64 | 7.55 | 12.04 | 12.05 | 11.93 |
-| 384 | 8 | 13.01 | 13.07 | 12.84 | 22.81 | 22.89 | 22.56 |
-| 384 | 12 | 18.73 | 18.82 | 18.56 | 33.47 | 33.62 | 33.43 |
-| 384 | 16 | 24.41 | 24.51 | 24.16 | 44.45 | 44.47 | 44.03 |
-| 384 | 24 | 35.83 | 36.19 | 35.53 | 65.53 | 65.79 | 64.91 |
-| 384 | 32 | 47.34 | 47.52 | 46.86 | 85.92 | 86.16 | 85.15 |
-| 384 | 64 | 92.68 | 93.00 | 91.86 | 169.51 | 170.03 | 168.46 |
-| 384 | 128 | 181.91 | 182.29 | 181.02 | 334.01 | 334.51 | 332.81 |
+| 128 | 1 | 1.46 | 1.46 | 1.45 | 2.01 | 2.01 | 2.01 |
+| 128 | 2 | 1.83 | 1.85 | 1.83 | 2.80 | 2.83 | 2.75 |
+| 128 | 4 | 2.71 | 2.71 | 2.69 | 4.34 | 4.36 | 4.29 |
+| 128 | 8 | 4.33 | 4.35 | 4.28 | 8.12 | 8.20 | 8.03 |
+| 128 | 12 | 5.71 | 5.72 | 5.61 | 10.65 | 10.65 | 10.51 |
+| 128 | 16 | 7.62 | 7.64 | 7.55 | 14.57 | 14.66 | 14.55 |
+| 128 | 24 | 10.58 | 10.62 | 10.46 | 20.64 | 20.79 | 20.45 |
+| 128 | 32 | 14.18 | 14.26 | 13.99 | 28.17 | 28.31 | 28.01 |
+| 128 | 64 | 26.87 | 27.00 | 26.61 | 53.44 | 53.71 | 53.31 |
+| 128 | 128 | 52.36 | 52.71 | 51.90 | 105.42 | 105.95 | 104.96 |
+| 384 | 1 | 3.33 | 3.33 | 3.33 | 4.23 | 4.24 | 4.19 |
+| 384 | 2 | 4.26 | 4.26 | 4.23 | 6.63 | 6.65 | 6.57 |
+| 384 | 4 | 7.26 | 7.26 | 7.25 | 12.00 | 12.06 | 11.88 |
+| 384 | 8 | 12.91 | 12.99 | 12.83 | 22.61 | 22.69 | 22.45 |
+| 384 | 12 | 18.73 | 18.85 | 18.53 | 33.43 | 33.64 | 33.28 |
+| 384 | 16 | 24.06 | 24.22 | 24.02 | 44.35 | 44.64 | 44.06 |
+| 384 | 24 | 35.83 | 35.95 | 35.49 | 64.84 | 64.90 | 64.78 |
+| 384 | 32 | 47.05 | 47.27 | 46.73 | 85.89 | 86.17 | 85.11 |
+| 384 | 64 | 92.09 | 92.32 | 91.34 | 168.09 | 168.48 | 167.24 |
+| 384 | 128 | 180.47 | 180.90 | 179.75 | 330.71 | 331.31 | 329.53 |
 
 ##### Megatron Large with Sparsity
 
 | Sequence Length | Batch Size | INT8 QAT Latency (ms) |               |         |
 |-----------------|------------|-----------------|-----------------|---------|
 |                 |            | 95th Percentile | 99th Percentile | Average |
-| 128 | 1 | 1.46 | 1.47 | 1.45 |
-| 128 | 2 | 1.88 | 1.88 | 1.87 |
-| 128 | 4 | 2.74 | 2.74 | 2.73 |
-| 128 | 8 | 4.11 | 4.12 | 4.10 |
-| 128 | 12 | 5.29 | 5.35 | 5.25 |
-| 128 | 16 | 7.52 | 7.57 | 7.50 |
-| 128 | 24 | 10.11 | 10.19 | 10.06 |
-| 128 | 32 | 12.85 | 12.90 | 12.80 |
-| 128 | 64 | 24.50 | 24.52 | 24.26 |
-| 128 | 128 | 46.24 | 46.57 | 45.92 |
-| 384 | 1 | 2.35 | 2.36 | 2.35 |
-| 384 | 2 | 3.90 | 3.91 | 3.89 |
-| 384 | 4 | 6.14 | 6.15 | 6.08 |
-| 384 | 8 | 11.74 | 11.76 | 11.64 |
-| 384 | 12 | 15.86 | 15.88 | 15.74 |
-| 384 | 16 | 21.21 | 21.27 | 21.05 |
-| 384 | 24 | 30.03 | 30.04 | 29.89 |
-| 384 | 32 | 40.20 | 40.22 | 40.05 |
-| 384 | 64 | 76.82 | 77.11 | 76.52 |
-| 384 | 128 | 149.54 | 149.80 | 148.78 |
+| 128 | 1 | 1.44 | 1.45 | 1.44 |
+| 128 | 2 | 1.84 | 1.84 | 1.84 |
+| 128 | 4 | 2.76 | 2.76 | 2.75 |
+| 128 | 8 | 4.12 | 4.12 | 4.11 |
+| 128 | 12 | 5.26 | 5.28 | 5.22 |
+| 128 | 16 | 7.52 | 7.52 | 7.51 |
+| 128 | 24 | 9.97 | 9.99 | 9.89 |
+| 128 | 32 | 12.84 | 12.85 | 12.80 |
+| 128 | 64 | 24.35 | 24.46 | 24.15 |
+| 128 | 128 | 46.38 | 46.60 | 45.96 |
+| 384 | 1 | 2.37 | 2.37 | 2.36 |
+| 384 | 2 | 3.88 | 3.88 | 3.87 |
+| 384 | 4 | 6.10 | 6.11 | 6.05 |
+| 384 | 8 | 11.60 | 11.63 | 11.49 |
+| 384 | 12 | 15.73 | 15.78 | 15.64 |
+| 384 | 16 | 20.95 | 21.01 | 20.90 |
+| 384 | 24 | 29.83 | 29.93 | 29.71 |
+| 384 | 32 | 40.01 | 40.09 | 39.75 |
+| 384 | 64 | 76.46 | 76.67 | 76.28 |
+| 384 | 128 | 148.96 | 149.23 | 148.11 |
 
diff --git a/demo/BERT/builder.py b/demo/BERT/builder.py
index c6d15d00..5eafe367 100755
--- a/demo/BERT/builder.py
+++ b/demo/BERT/builder.py
@@ -40,7 +40,7 @@
 TensorRT Initialization
 """
 TRT_LOGGER = trt.Logger(trt.Logger.INFO)
-trt_version = [int(n) for n in trt.__version__.split('.')]
+trt_version = [n for n in trt.__version__.split('.')]
 
 # Import necessary plugins for demoBERT
 plugin_lib_name = "nvinfer_plugin.dll" if sys.platform == "win32" else "libnvinfer_plugin.so"
@@ -107,10 +107,7 @@ def attention_layer_opt(prefix, config, init_dict, network, input_tensor, imask)
     Ball = init_dict[prefix + BQKV]
 
     # FC_attention
-    if config.use_int8:
-        mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
-    else:
-        mult_all = network.add_fully_connected(input_tensor, 3 * hidden_size, Wall, Ball)
+    mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
 
     if config.use_qat:
         dr_qkv = max(
@@ -217,24 +214,20 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
 
     # FC0
     B_aout = init_dict[prefix + B_AOUT]
-    if config.use_int8:
+    if not config.use_int8 and use_custom_fc():
+        W_aoutT = init_dict[prefix + W_AOUT + "_notrans"]
+        attention_out_fc = custom_fc(config, network, attention_heads, hidden_size, W_aoutT)
+    else:
         W_aout = init_dict[prefix + W_AOUT]
         attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
         B_aout = None
 
-        if not config.use_int8_skipln:
+        if config.use_int8 and not config.use_int8_skipln:
             attention_out_fc.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)
 
-        if config.use_qat:
+        if config.use_int8 and config.use_qat:
             dr_fc_aout = init_dict[prefix + 'attention_output_add_local_input_quantizer_amax']
             set_output_range(attention_out_fc, dr_fc_aout)
-    elif use_custom_fc():
-        W_aoutT = init_dict[prefix + W_AOUT + "_notrans"]
-        attention_out_fc = custom_fc(config, network, attention_heads, hidden_size, W_aoutT)
-    else:
-        W_aout = init_dict[prefix + W_AOUT]
-        attention_out_fc = network.add_fully_connected(attention_heads, hidden_size, W_aout, B_aout)
-        B_aout = None
 
     skiplayer = skipln(prefix + "attention_output_layernorm_",config, init_dict, network, attention_out_fc.get_output(0), input_tensor, B_aout)
     attention_ln = skiplayer.get_output(0)
@@ -245,10 +238,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
     # FC1 + GELU
     B_mid = init_dict[prefix + B_MID]
     W_mid = init_dict[prefix + W_MID]
-    if config.use_int8:
-        mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
-    else:
-        mid_dense = network.add_fully_connected(attention_ln, config.intermediate_size, W_mid, B_mid)
+    mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
 
     mid_dense_out = mid_dense.get_output(0)
     POW = network.add_constant((1, 1, 1, 1, 1), trt.Weights(np.ascontiguousarray([3.0], dtype=np.float32)))
@@ -281,21 +271,18 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, imas
     # FC2
     # Dense to hidden size
     B_lout = init_dict[prefix + B_LOUT]
-    if config.use_int8 and not config.use_fc2_gemm:
-        W_lout = init_dict[prefix + W_LOUT]
-        out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
-        B_lout = None
-
-        if not config.use_int8_skipln:
-            out_dense.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)
-    elif use_custom_fc():
+    prefer_conv = config.use_int8 and not config.use_fc2_gemm
+    if not prefer_conv and use_custom_fc():
         W_loutT = init_dict[prefix + W_LOUT + "_notrans"]
         out_dense = custom_fc(config, network, intermediate_act, hidden_size, W_loutT)
     else:
         W_lout = init_dict[prefix + W_LOUT]
-        out_dense = network.add_fully_connected(intermediate_act, hidden_size, W_lout, B_lout)
+        out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
         B_lout = None
 
+        if config.use_int8 and not config.use_int8_skipln:
+            out_dense.set_output_type(0, trt.DataType.HALF if config.use_fp16 else trt.DataType.FLOAT)
+
     if config.use_qat:
         dr_fc_out = init_dict[prefix + 'output_add_local_input_quantizer_amax']
         set_output_range(out_dense, dr_fc_out)
@@ -334,7 +321,7 @@ def squad_output(prefix, config, init_dict, network, input_tensor):
     B_out = init_dict[prefix + SQD_B]
 
     W = network.add_constant((1, hidden_size, 2), W_out)
-    dense = network.add_fully_connected(input_tensor, 2, W_out, B_out)
+    dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
 
     OUT = network.add_shuffle(dense.get_output(0))
     OUT.second_transpose = (1, 0, 2, 3, 4)
@@ -399,11 +386,16 @@ def emb_layernorm(builder, network, config, weights_dict, builder_config, sequen
     return emb_layer
 
 def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, verbose):
-    explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
 
-    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(explicit_batch_flag) as network, builder.create_builder_config() as builder_config:
-        builder_config.max_workspace_size = workspace_size * (1024 * 1024)
+    network_creation_flag = 0
+    if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+        network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+
+    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(network_creation_flag) as network, builder.create_builder_config() as builder_config:
+        builder_config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace_size * (1024 * 1024))
         builder_config.avg_timing_iterations = 8
+        # Cublas tactics can be unset once the qkv plugin does not use it anymore.
+        builder_config.set_tactic_sources(builder_config.get_tactic_sources() | 1 << int(trt.TacticSource.CUBLAS))
         if config.use_fp16:
             builder_config.set_flag(trt.BuilderFlag.FP16)
         if config.use_int8:
@@ -413,7 +405,9 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
                 builder_config.set_quantization_flag(trt.QuantizationFlag.CALIBRATE_BEFORE_FUSION)
                 builder_config.int8_calibrator = calibrator
         if config.use_strict:
-            builder_config.set_flag(trt.BuilderFlag.STRICT_TYPES)
+            builder_config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
+            builder_config.set_flag(trt.BuilderFlag.DIRECT_IO)
+            builder_config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)
 
         if verbose:
             builder_config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
@@ -425,7 +419,7 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
         # speed up the engine build for trt major version >= 8
         # 1. disable cudnn tactic
         # 2. load global timing cache
-        if trt_version[0] >= 8:
+        if int(trt_version[0]) >= 8:
             tactic_source = builder_config.get_tactic_sources() & ~(1 << int(trt.TacticSource.CUDNN))
             builder_config.set_tactic_sources(tactic_source)
             if config.timing_cache != None:
@@ -451,15 +445,16 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
         squad_logits = squad_output("cls_", config, weights_dict, network, bert_out)
         squad_logits_out = squad_logits.get_output(0)
 
+        squad_logits_out.name = "logits_out"
         network.mark_output(squad_logits_out)
 
         build_start_time = time.time()
-        engine = builder.build_engine(network, builder_config)
+        serialized_engine = builder.build_serialized_network(network, builder_config)
         build_time_elapsed = (time.time() - build_start_time)
         TRT_LOGGER.log(TRT_LOGGER.INFO, "build engine in {:.3f} Sec".format(build_time_elapsed))
 
         # save global timing cache
-        if trt_version[0] >= 8 and config.timing_cache != None:
+        if int(trt_version[0]) >= 8 and config.timing_cache != None:
             cache = builder_config.get_timing_cache()
             with cache.serialize() as buffer:
                 with open(config.timing_cache, "wb") as f:
@@ -469,7 +464,7 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
 
         if config.use_int8 and not config.use_qat:
             calibrator.free()
-        return engine
+        return serialized_engine
 
 def generate_calibration_cache(sequence_lengths, workspace_size, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num):
     """
@@ -488,7 +483,7 @@ def generate_calibration_cache(sequence_lengths, workspace_size, config, weights
     config.use_fp16 = False
     config.is_calib_mode = True
 
-    with build_engine([1], workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, False) as engine:
+    with build_engine([1], workspace_size, sequence_lengths, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, False) as serialized_engine:
         TRT_LOGGER.log(TRT_LOGGER.INFO, "calibration cache generated in {:}".format(calibrationCacheFile))
 
     config.use_fp16 = saved_use_fp16
@@ -553,9 +548,7 @@ def main():
     else:
         raise RuntimeError("You need either specify TF checkpoint using option --ckpt or ONNX using option --onnx to build TRT BERT model.")
 
-    with build_engine(args.batch_size, args.workspace_size, args.sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as engine:
-        TRT_LOGGER.log(TRT_LOGGER.VERBOSE, "Serializing Engine...")
-        serialized_engine = engine.serialize()
+    with build_engine(args.batch_size, args.workspace_size, args.sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as serialized_engine:
         TRT_LOGGER.log(TRT_LOGGER.INFO, "Saving Engine to {:}".format(args.output))
         with open(args.output, "wb") as fout:
             fout.write(serialized_engine)
diff --git a/demo/BERT/builder_varseqlen.py b/demo/BERT/builder_varseqlen.py
index 0c1aeaac..ad25ef0c 100755
--- a/demo/BERT/builder_varseqlen.py
+++ b/demo/BERT/builder_varseqlen.py
@@ -39,7 +39,7 @@
 TensorRT Initialization
 """
 TRT_LOGGER = trt.Logger(trt.Logger.INFO)
-trt_version = [int(n) for n in trt.__version__.split('.')]
+trt_version = [n for n in trt.__version__.split('.')]
 
 # Import necessary plugins for demoBERT
 plugin_lib_name = "nvinfer_plugin.dll" if sys.platform == "win32" else "libnvinfer_plugin.so"
@@ -107,10 +107,7 @@ def attention_layer_opt(prefix, config, init_dict, network, input_tensor, mask_i
     Ball = init_dict[prefix + BQKV]
 
     # FC_attention
-    if config.use_int8:
-        mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
-    else:
-        mult_all = network.add_fully_connected(input_tensor, 3 * hidden_size, Wall, Ball)
+    mult_all = network.add_convolution_nd(input_tensor, 3 * hidden_size, (1, 1), Wall, Ball)
 
     if config.use_qat:
         dr_qkv = max(
@@ -202,10 +199,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
     # FC0
     B_aout = init_dict[prefix + B_AOUT]
     W_aout = init_dict[prefix + W_AOUT]
-    if config.use_int8:
-        attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
-    else:
-        attention_out_fc = network.add_fully_connected(attention_heads, hidden_size, W_aout, B_aout)
+    attention_out_fc = network.add_convolution_nd(attention_heads, hidden_size, (1, 1), W_aout, B_aout)
     if config.use_int8 and config.use_qat:
         dr_fc_aout = init_dict[prefix + 'attention_output_add_local_input_quantizer_amax']
         set_output_range(attention_out_fc, dr_fc_aout)
@@ -225,10 +219,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
     # FC1 + GELU
     B_mid = init_dict[prefix + B_MID]
     W_mid = init_dict[prefix + W_MID]
-    if config.use_int8:
-        mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
-    else:
-        mid_dense = network.add_fully_connected(attention_ln, config.intermediate_size, W_mid, B_mid)
+    mid_dense = network.add_convolution_nd(attention_ln, config.intermediate_size, (1, 1), W_mid, B_mid)
 
     gelu_layer = add_gelu(network, mid_dense.get_output(0))
 
@@ -247,10 +238,7 @@ def transformer_layer_opt(prefix, config, init_dict, network, input_tensor, resi
     B_lout = init_dict[prefix + B_LOUT]
     W_lout = init_dict[prefix + W_LOUT]
 
-    if config.use_int8:
-        out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
-    else:
-        out_dense = network.add_fully_connected(intermediate_act, hidden_size, W_lout, B_lout)
+    out_dense = network.add_convolution_nd(intermediate_act, hidden_size, (1, 1), W_lout, B_lout)
     if config.use_int8 and config.use_qat:
         dr_fc_out = init_dict[prefix + 'output_add_local_input_quantizer_amax']
         set_output_range(out_dense, dr_fc_out)
@@ -327,6 +315,7 @@ def bert_model(config, init_dict, network, input_tensor, residual, mask_idx, cu_
 
     squad_logits = squad_output("cls_", config, init_dict, network, prev_input)
     squad_logits_out = squad_logits.get_output(0)
+    squad_logits_out.name = "logits_out"
     network.mark_output(squad_logits_out)
 
 
@@ -339,11 +328,7 @@ def squad_output(prefix, config, init_dict, network, input_tensor):
     W_out = init_dict[prefix + SQD_W]
     B_out = init_dict[prefix + SQD_B]
 
-    if config.use_int8:
-        dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
-    else:
-        dense = network.add_fully_connected(input_tensor, 2, W_out, B_out)
-
+    dense = network.add_convolution_nd(input_tensor, 2, (1, 1), W_out, B_out)
     OUT = network.add_shuffle(dense.get_output(0))
     if config.use_int8 and config.interleaved:
         OUT.second_transpose = (1, 2, 0, 3)
@@ -394,10 +379,13 @@ def emb_layernorm(builder, network, config, weights_dict, builder_config, max_se
     return emb_layer, cu_seqlens, max_seqlen
 
 def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_dict, squad_json, vocab_file, calibrationCacheFile, calib_num, verbose):
-    explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
 
-    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(explicit_batch_flag) as network, builder.create_builder_config() as builder_config:
-        builder_config.max_workspace_size = workspace_size * (1024 * 1024)
+    network_creation_flag = 0
+    if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+        network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+
+    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(network_creation_flag) as network, builder.create_builder_config() as builder_config:
+        builder_config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace_size * (1024 * 1024))
         builder_config.avg_timing_iterations = 8
         if config.use_fp16:
             builder_config.set_flag(trt.BuilderFlag.FP16)
@@ -412,7 +400,7 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
         # speed up the engine build for trt major version >= 8
         # 1. disable cudnn tactic
         # 2. load global timing cache
-        if trt_version[0] >= 8:
+        if int(trt_version[0]) >= 8:
             tactic_source = builder_config.get_tactic_sources() & ~(1 << int(trt.TacticSource.CUDNN))
             builder_config.set_tactic_sources(tactic_source)
             if config.timing_cache != None:
@@ -454,12 +442,12 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
         bert_model(config, weights_dict, network, embeddings, residual, mask_idx, cu_seqlens, max_seqlen)
 
         build_start_time = time.time()
-        engine = builder.build_engine(network, builder_config)
+        serialized_engine = builder.build_serialized_network(network, builder_config)
         build_time_elapsed = (time.time() - build_start_time)
         TRT_LOGGER.log(TRT_LOGGER.INFO, "build engine in {:.3f} Sec".format(build_time_elapsed))
 
         # save global timing cache
-        if trt_version[0] >= 8 and config.timing_cache != None:
+        if int(trt_version[0]) >= 8 and config.timing_cache != None:
             cache = builder_config.get_timing_cache()
             with cache.serialize() as buffer:
                 with open(config.timing_cache, "wb") as f:
@@ -467,7 +455,7 @@ def build_engine(batch_sizes, workspace_size, sequence_length, config, weights_d
                     f.flush()
                     os.fsync(f)
 
-        return engine
+        return serialized_engine
 
 def main():
     parser = argparse.ArgumentParser(description="TensorRT BERT Sample", formatter_class=argparse.ArgumentDefaultsHelpFormatter)
@@ -533,9 +521,7 @@ def main():
                            "PyTorch using option --pytorch, or Pickle weight dictionary using option --pickle "
                            "to build TRT BERT model.")
 
-    with build_engine(args.max_batch_size, args.workspace_size, args.max_sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as engine:
-        TRT_LOGGER.log(TRT_LOGGER.VERBOSE, "Serializing Engine...")
-        serialized_engine = engine.serialize()
+    with build_engine(args.max_batch_size, args.workspace_size, args.max_sequence_length, config, weights_dict, args.squad_json, args.vocab_file, calib_cache, args.calib_num, args.verbose) as serialized_engine:
         TRT_LOGGER.log(TRT_LOGGER.INFO, "Saving Engine to {:}".format(args.output))
         with open(args.output, "wb") as fout:
             fout.write(serialized_engine)
diff --git a/demo/BERT/infer_c/bert_infer.h b/demo/BERT/infer_c/bert_infer.h
index 827f9ba9..2f72102a 100644
--- a/demo/BERT/infer_c/bert_infer.h
+++ b/demo/BERT/infer_c/bert_infer.h
@@ -83,8 +83,7 @@ struct BertInference
         }
         gLogInfo << "Done\n";
 
-        const int numBindingPerProfile = mEngine->getNbBindings() / mEngine->getNbOptimizationProfiles();
-        mEnableVariableLen = numBindingPerProfile == kBERT_INPUT_NUM + 1 ? false : true;
+        mEnableVariableLen = mEngine->getNbIOTensors() == kBERT_INPUT_NUM + 1 ? false : true;
         if (mEnableVariableLen)
         {
             gLogInfo << "Variable length is enabled\n";
@@ -153,15 +152,14 @@ struct BertInference
         mDeviceBuffers.emplace_back(devBuf);
         mHostOutput.resize(numOutputItems);
 
-        mBindings.resize(mEngine->getNbBindings());
+        mBindings.resize(mEngine->getNbIOTensors() * mEngine->getNbOptimizationProfiles());
     }
 
     void prepare(int profIdx, int batchSize)
     {
 
         mContext->setOptimizationProfile(profIdx);
-        const int numBindingPerProfile = mEngine->getNbBindings() / mEngine->getNbOptimizationProfiles();
-        const int bindingIdxOffset = profIdx * numBindingPerProfile;
+        const int bindingIdxOffset = profIdx * mEngine->getNbIOTensors();
         std::copy(mDeviceBuffers.begin(), mDeviceBuffers.end(), mBindings.begin() + bindingIdxOffset);
 
         if (mEnableVariableLen)
@@ -169,14 +167,16 @@ struct BertInference
             const int allocationSizes[] = {mSeqLength * batchSize, mSeqLength * batchSize, batchSize + 1, mSeqLength};
             for (int i = 0; i < sizeof(allocationSizes)/sizeof(allocationSizes[0]); i++)
             {
-                mContext->setBindingDimensions(i + bindingIdxOffset, Dims{1, {allocationSizes[i]}});
+                auto const tensorName = mEngine->getIOTensorName(i % mEngine->getNbIOTensors());
+                mContext->setInputShape(tensorName, Dims{1, {allocationSizes[i]}});
             }
         }
         else
         {
             for (int i = 0; i < kBERT_INPUT_NUM; i++)
             {
-                mContext->setBindingDimensions(i + bindingIdxOffset, Dims2(batchSize, mSeqLength));
+                auto const tensorName = mEngine->getIOTensorName(i);
+                mContext->setInputShape(tensorName, Dims2(batchSize, mSeqLength));
             }
         }
 
@@ -188,10 +188,16 @@ struct BertInference
 
         if (mEnableGraph)
         {
+            for (int32_t i = 0; i < mEngine->getNbIOTensors(); i++)
+            {
+                auto const& name = mEngine->getIOTensorName(i);
+                context->setTensorAddress(name, mBindings[i + bindingIdxOffset]);
+            }
+
             cudaGraph_t graph;
             cudaGraphExec_t exec;
             // warm up and let mContext do cublas initialization
-            bool status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+            bool status = mContext->enqueueV3(mStream, nullptr);
             if (!status)
             {
                 gLogError << "Enqueue failed\n";
@@ -200,7 +206,7 @@ struct BertInference
             gLogVerbose << "Capturing graph\n";
 
             gpuErrChk(cudaStreamBeginCapture(mStream, cudaStreamCaptureModeRelaxed));
-            status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+            status = mContext->enqueueV3(mStream, nullptr);
             if (!status)
             {
                 gLogError << "Enqueue failed\n";
@@ -234,7 +240,7 @@ struct BertInference
             }
             else
             {
-                bool status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+                bool status = mContext->enqueueV3(mStream, nullptr);
                 if (!status)
                 {
                     gLogError << "Enqueue failed\n";
@@ -259,7 +265,7 @@ struct BertInference
             }
             else
             {
-                bool status = mContext->enqueueV2(mBindings.data(), mStream, nullptr);
+                bool status = mContext->enqueueV3(mStream, nullptr);
                 if (!status)
                 {
                     gLogError << "Enqueue failed\n";
diff --git a/demo/BERT/inference.ipynb b/demo/BERT/inference.ipynb
index d015fd72..2882e0b6 100644
--- a/demo/BERT/inference.ipynb
+++ b/demo/BERT/inference.ipynb
@@ -19,7 +19,7 @@
     "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
     "# See the License for the specific language governing permissions and\n",
     "# limitations under the License.\n",
-    "# =============================================================================="
+    "# ==============================================================================\n"
    ]
   },
   {
@@ -99,7 +99,7 @@
     "paragraph_text = \"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.\"\n",
     "\n",
     "# Short paragraph version for BERT models with max sequence length of 128\n",
-    "short_paragraph_text = \"The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975.\""
+    "short_paragraph_text = \"The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975.\"\n"
    ]
   },
   {
@@ -118,7 +118,7 @@
     "question_text = \"What project put the first Americans into space?\"\n",
     "#question_text =  \"What year did the first manned Apollo flight occur?\"\n",
     "#question_text =  \"What President is credited with the original notion of putting Americans in space?\"\n",
-    "#question_text =  \"Who did the U.S. collaborate with on an Earth orbit mission in 1975?\""
+    "#question_text =  \"Who did the U.S. collaborate with on an Earth orbit mission in 1975?\"\n"
    ]
   },
   {
@@ -200,7 +200,7 @@
    "outputs": [],
    "source": [
     "import tensorrt as trt\n",
-    "TRT_LOGGER = trt.Logger(trt.Logger.INFO)"
+    "TRT_LOGGER = trt.Logger(trt.Logger.INFO)\n"
    ]
   },
   {
@@ -212,7 +212,7 @@
     "import ctypes\n",
     "import os\n",
     "\n",
-    "ctypes.CDLL(\"libnvinfer_plugin.so\", mode=ctypes.RTLD_GLOBAL)"
+    "ctypes.CDLL(\"libnvinfer_plugin.so\", mode=ctypes.RTLD_GLOBAL)\n"
    ]
   },
   {
@@ -245,11 +245,12 @@
     "    # Specify input shapes. These must be within the min/max bounds of the active profile (0th profile in this case)\n",
     "    # Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.\n",
     "    for binding in range(3):\n",
-    "        context.set_binding_shape(binding, input_shape)\n",
+    "        tensor_name = engine.get_tensor_name(binding)\n",
+    "        context.set_input_shape(tensor_name, input_shape)\n",
     "    assert context.all_binding_shapes_specified\n",
     "\n",
     "    # Allocate output buffer by querying the size from the context. This may be different for different input shapes.\n",
-    "    h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)\n",
+    "    h_output = cuda.pagelocked_empty(tuple(context.get_tensor_shape(engine.get_tensor_name(3))), dtype=np.float32)\n",
     "    d_output = cuda.mem_alloc(h_output.nbytes)\n",
     "\n",
     "    print(\"\\nRunning Inference...\")\n",
@@ -271,8 +272,14 @@
     "        cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)\n",
     "        cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)\n",
     "\n",
+    "        # Setup tensor address\n",
+    "        bindings = [int(d_inputs[i]) for i in range(3)] + [int(d_output)]\n",
+    "\n",
+    "        for i in range(engine.num_io_tensors):\n",
+    "            context.set_tensor_address(engine.get_tensor_name(i), bindings[i])\n",
+    "\n",
     "        # Run inference\n",
-    "        context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)\n",
+    "        context.execute_async_v3(stream_handle=stream.handle)\n",
     "        # Synchronize the stream\n",
     "        stream.synchronize()\n",
     "        eval_time_elapsed += (time.time() - eval_start_time)\n",
@@ -293,7 +300,7 @@
     "    \n",
     "    print(\"-----------------------------\")\n",
     "    print(\"Running Inference at {:.3f} Sentences/Sec\".format(1.0/eval_time_elapsed))\n",
-    "    print(\"-----------------------------\")"
+    "    print(\"-----------------------------\")\n"
    ]
   },
   {
@@ -329,7 +336,7 @@
     "    for index, output in enumerate(networkOutputs):\n",
     "        print(\"Processing output\")\n",
     "        print(\"Answer: '{}'\".format(prediction))\n",
-    "        print(\"with prob: {:.3f}%\".format(nbest_json[0]['probability'] * 100.0))"
+    "        print(\"with prob: {:.3f}%\".format(nbest_json[0]['probability'] * 100.0))\n"
    ]
   }
  ],
diff --git a/demo/BERT/inference.py b/demo/BERT/inference.py
index 2116de8f..dc172181 100644
--- a/demo/BERT/inference.py
+++ b/demo/BERT/inference.py
@@ -134,34 +134,33 @@ def question_features(tokens, question):
 
         # select engine profile
         selected_profile = -1
-        num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles
         for idx in range(engine.num_optimization_profiles):
-            profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)
+            profile_shape = engine.get_tensor_profile_shape(name = "input_ids", profile_index = idx)
             if profile_shape[0][0] <= args.batch_size and profile_shape[2][0] >= args.batch_size and profile_shape[0][1] <= max_seq_length and profile_shape[2][1] >= max_seq_length:
                 selected_profile = idx
                 break
         if selected_profile == -1:
             raise RuntimeError("Could not find any profile that can run batch size {}.".format(args.batch_size))
 
-        context.active_optimization_profile = selected_profile
-        binding_idx_offset = selected_profile * num_binding_per_profile
+        # Create a stream in which to copy inputs/outputs and run inference.
+        stream = cuda.Stream()
+
+        context.set_optimization_profile_async(selected_profile, stream.handle)
+        binding_idx_offset = selected_profile * engine.num_io_tensors
 
         # Specify input shapes. These must be within the min/max bounds of the active profile
         # Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.
         input_shape = (args.batch_size, max_seq_length)
         input_nbytes = trt.volume(input_shape) * trt.int32.itemsize
-        for binding in range(3):
-            context.set_binding_shape(binding_idx_offset + binding, input_shape)
-        assert context.all_binding_shapes_specified
-
-        # Create a stream in which to copy inputs/outputs and run inference.
-        stream = cuda.Stream()
+        for name in ["input_ids", "segment_ids", "input_mask"]:
+            context.set_input_shape(name, input_shape)
+        assert len(context.infer_shapes()) == 0
 
         # Allocate device memory for inputs.
         d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(3)]
 
         # Allocate output buffer by querying the size from the context. This may be different for different input shapes.
-        h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(binding_idx_offset + 3)), dtype=np.float32)
+        h_output = cuda.pagelocked_empty(tuple(context.get_tensor_shape("logits_out")), dtype=np.float32)
         d_output = cuda.mem_alloc(h_output.nbytes)
 
         def inference(features, tokens):
@@ -188,8 +187,14 @@ def inference(features, tokens):
                 cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)
                 cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)
 
+                bindings = [0 for _ in range(binding_idx_offset)] + [int(d_inp) for d_inp in d_inputs] + [int(d_output)]
+                
+                # allocate address for IO tensor
+                for i in range(engine.num_io_tensors): 
+                    context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])
+
                 # Run inference
-                context.execute_async_v2(bindings=[0 for i in range(binding_idx_offset)] + [int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)
+                context.execute_async_v3(stream_handle=stream.handle)
                 # Synchronize the stream
                 stream.synchronize()
                 eval_time_elapsed += (time.time() - eval_start_time)
diff --git a/demo/BERT/inference_varseqlen.py b/demo/BERT/inference_varseqlen.py
index 9cd08519..7eb87012 100644
--- a/demo/BERT/inference_varseqlen.py
+++ b/demo/BERT/inference_varseqlen.py
@@ -130,15 +130,14 @@ def question_features(tokens, question):
     # for each additional profile needed. Here, we only use batch size 1, thus we only need the first profile.
     with open(args.engine, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime, \
         runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
+        # Create a stream in which to copy inputs/outputs and run inference.
+        stream = cuda.Stream()
 
         # select engine profile
-        context.active_optimization_profile = 0
+        context.set_optimization_profile_async(0, stream.handle)
 
         input_nbytes = max_seq_length * trt.int32.itemsize
 
-        # Create a stream in which to copy inputs/outputs and run inference.
-        stream = cuda.Stream()
-
         # Allocate device memory for inputs.
         d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(4)]
 
@@ -164,14 +163,10 @@ def inference(features, tokens):
                 segment_ids = feature.segment_ids[0:S]
                 cu_seq_lens = np.array([0, S], dtype=np.int32);
 
-                if context.get_binding_shape(0)[0] != S:
-                    context.set_binding_shape(0, (S,))
-                if context.get_binding_shape(1)[0] != S:
-                    context.set_binding_shape(1, (S,))
-                if context.get_binding_shape(2)[0] != 2:
-                    context.set_binding_shape(2, (2,))
-                if context.get_binding_shape(3)[0] != S:
-                    context.set_binding_shape(3, (S,))
+                input_dim0_shape = {"input_ids": S, "segment_ids": S, "cu_seqlens": 2, "max_seqlen": S}
+                for name, val in input_dim0_shape.items():
+                    if context.get_tensor_shape(name)[0] != val:
+                        context.set_input_shape(name, (val,))
 
                 h_input_ids = cuda.register_host_memory(np.ascontiguousarray(input_ids.ravel()))
                 h_segment_ids = cuda.register_host_memory(np.ascontiguousarray(segment_ids.ravel()))
@@ -182,8 +177,14 @@ def inference(features, tokens):
                 cuda.memcpy_htod_async(d_inputs[1], h_segment_ids, stream)
                 cuda.memcpy_htod_async(d_inputs[2], h_cu_seq_lens, stream)
 
+                # Setup tensor address
+                bindings = [int(d_inputs[i]) for i in range(4)] + [int(d_output)]
+
+                for i in range(engine.num_io_tensors):
+                    context.set_tensor_address(engine.get_tensor_name(i), bindings[i])
+
                 # Run inference
-                context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)
+                context.execute_async_v3(stream_handle=stream.handle)
                 # Synchronize the stream
                 stream.synchronize()
                 eval_time_elapsed += (time.time() - eval_start_time)
diff --git a/demo/BERT/notebooks/Q-and-A.ipynb b/demo/BERT/notebooks/Q-and-A.ipynb
index c262a9cb..9c82199a 100755
--- a/demo/BERT/notebooks/Q-and-A.ipynb
+++ b/demo/BERT/notebooks/Q-and-A.ipynb
@@ -20,7 +20,7 @@
     "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
     "# See the License for the specific language governing permissions and\n",
     "# limitations under the License.\n",
-    "# =============================================================================="
+    "# ==============================================================================\n"
    ]
   },
   {
@@ -124,8 +124,14 @@
     "        cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)\n",
     "        cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)\n",
     "\n",
+    "        # Setup tensor address\n",
+    "        bindings = [int(d_inputs[i]) for i in range(3)] + [int(d_output)]\n",
+    "\n",
+    "        for i in range(engine.num_io_tensors):\n",
+    "            context.set_tensor_address(engine.get_tensor_name(i), bindings[i])\n",
+    "\n",
     "        # Run inference\n",
-    "        trt_context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)\n",
+    "        trt_context.execute_async_v3(stream_handle=stream.handle)\n",
     "        # Synchronize the stream\n",
     "        stream.synchronize()\n",
     "        eval_time_elapsed += (time.time() - eval_start_time)\n",
@@ -172,16 +178,21 @@
     "        S = np.sum(feature.input_mask)\n",
     "        input_ids = feature.input_ids[0:S]\n",
     "        segment_ids = feature.segment_ids[0:S]\n",
-    "        cu_seq_lens = np.array([0, S], dtype=np.int32);\n",
-    "\n",
-    "        if context.get_binding_shape(0)[0] != S:\n",
-    "            context.set_binding_shape(0, (S,))\n",
-    "        if context.get_binding_shape(1)[0] != S:\n",
-    "            context.set_binding_shape(1, (S,))\n",
-    "        if context.get_binding_shape(2)[0] != 2:\n",
-    "            context.set_binding_shape(2, (2,))\n",
-    "        if context.get_binding_shape(3)[0] != S:\n",
-    "            context.set_binding_shape(3, (S,))\n",
+    "        cu_seq_lens = np.array([0, S], dtype=np.int32)\n",
+    "\n",
+    "        first_tensor_name = engine.get_tensor_name(0)\n",
+    "        second_tensor_name = engine.get_tensor_name(1)\n",
+    "        third_tensor_name = engine.get_tensor_name(2)\n",
+    "        forth_tensor_name = engine.get_tensor_name(3)\n",
+    "\n",
+    "        if context.get_tensor_shape(first_tensor_name)[0] != S:\n",
+    "            context.set_input_shape(first_tensor_name, (S,))\n",
+    "        if context.get_tensor_shape(second_tensor_name)[0] != S:\n",
+    "            context.set_input_shape(second_tensor_name, (S,))\n",
+    "        if context.get_tensor_shape(third_tensor_name)[0] != 2:\n",
+    "            context.set_input_shape(third_tensor_name, (2,))\n",
+    "        if context.get_tensor_shape(forth_tensor_name)[0] != S:\n",
+    "            context.set_input_shapee(forth_tensor_name, (S,))\n",
     "\n",
     "        h_input_ids = cuda.register_host_memory(np.ascontiguousarray(input_ids.ravel()))\n",
     "        h_segment_ids = cuda.register_host_memory(np.ascontiguousarray(segment_ids.ravel()))\n",
@@ -192,8 +203,14 @@
     "        cuda.memcpy_htod_async(d_inputs[1], h_segment_ids, INT8_stream)\n",
     "        cuda.memcpy_htod_async(d_inputs[2], h_cu_seq_lens, INT8_stream)\n",
     "\n",
+    "        # Setup tensor address\n",
+    "        bindings = [int(d_inputs[i]) for i in range(3)] + [int(d_output)]\n",
+    "\n",
+    "        for i in range(engine.num_io_tensors):\n",
+    "            context.set_tensor_address(engine.get_tensor_name(i), bindings[i])\n",
+    "\n",
     "        # Run inference\n",
-    "        trt_context.execute_async_v2(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=INT8_stream.handle)\n",
+    "        trt_context.execute_async_v3(stream_handle=INT8_stream.handle)\n",
     "        # Synchronize the stream\n",
     "        INT8_stream.synchronize()\n",
     "        eval_time_elapsed += (time.time() - eval_start_time)\n",
@@ -256,11 +273,12 @@
     "# Specify input shapes. These must be within the min/max bounds of the active profile (0th profile in this case)\n",
     "# Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.\n",
     "for binding in range(3):\n",
-    "    context.set_binding_shape(binding, input_shape)\n",
+    "    tensor_name = engine.get_tensor_name(binding)\n",
+    "    context.set_input_shape(tensor_name, input_shape)\n",
     "assert context.all_binding_shapes_specified\n",
     "\n",
     "# Allocate output buffer by querying the size from the context. This may be different for different input shapes.\n",
-    "h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(3)), dtype=np.float32)\n",
+    "h_output = cuda.pagelocked_empty(tuple(context.get_tensor_shape(engine.get_tensor_name(3))), dtype=np.float32)\n",
     "d_output = cuda.mem_alloc(h_output.nbytes)\n",
     "\n",
     "# Create a stream in which to copy inputs/outputs and run inference.\n",
@@ -275,7 +293,7 @@
     "INT8_context = INT8_engine.create_execution_context()\n",
     "\n",
     "# select engine profile\n",
-    "INT8_context.active_optimization_profile = 0\n",
+    "INT8_context.set_optimization_profile_async(0, stream.handle)\n",
     "\n",
     "input_nbytes = max_seq_length * trt.int32.itemsize\n",
     "\n",
@@ -287,7 +305,7 @@
     "INT8_d_output = cuda.mem_alloc(INT8_h_output.nbytes)\n",
     "\n",
     "# Create a stream in which to copy inputs/outputs and run inference.\n",
-    "INT8_stream = cuda.Stream()"
+    "INT8_stream = cuda.Stream()\n"
    ]
   },
   {
@@ -412,7 +430,7 @@
     "    orientation='horizontal', \n",
     "    layout=widgets.Layout(width='100%', height='50px')\n",
     ")\n",
-    "display(progress_bar)"
+    "display(progress_bar)\n"
    ]
   },
   {
diff --git a/demo/BERT/notebooks/benchmark.ipynb b/demo/BERT/notebooks/benchmark.ipynb
index 69666732..d09ec429 100755
--- a/demo/BERT/notebooks/benchmark.ipynb
+++ b/demo/BERT/notebooks/benchmark.ipynb
@@ -20,7 +20,7 @@
     "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
     "# See the License for the specific language governing permissions and\n",
     "# limitations under the License.\n",
-    "# =============================================================================="
+    "# ==============================================================================\n"
    ]
   },
   {
@@ -143,32 +143,35 @@
     "            cuda.memcpy_htod(buffers[2].buf, test_cu_seq_lens.ravel())\n",
     "\n",
     "            bench_times = {}\n",
+    "            stream = cuda.Stream()\n",
     "\n",
     "            for idx, batch_size in enumerate(sorted(args.batch_size)):\n",
-    "                num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles\n",
     "                for idx in range(engine.num_optimization_profiles):\n",
-    "                    profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)\n",
+    "                    profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * engine.num_io_tensors)\n",
     "                    if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size:\n",
-    "                        context.active_optimization_profile = idx\n",
-    "                        binding_idx_offset = idx * num_binding_per_profile\n",
+    "                        context.set_optimization_profile_async(idx, stream.handle)\n",
+    "                        binding_idx_offset = idx * engine.num_io_tensors\n",
     "                        break\n",
     "\n",
     "                # Each profile has unique bindings\n",
     "                bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]\n",
     "                input_shape = (batch_size, args.sequence_length)\n",
     "                for binding in range(3):\n",
-    "                    context.set_binding_shape(binding_idx_offset + binding, input_shape)\n",
+    "                    tensor_name = engine.get_tensor_name(binding)\n",
+    "                    context.set_input_shape(tensor_name, input_shape)\n",
     "                assert context.all_binding_shapes_specified\n",
     "\n",
+    "                for i in range(engine.num_io_tensors):\n",
+    "                    context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])\n",
+    "\n",
     "                # Inference\n",
     "                total_time = 0\n",
     "                start = cuda.Event()\n",
     "                end = cuda.Event()\n",
-    "                stream = cuda.Stream()\n",
     "\n",
     "                # Warmup\n",
     "                for _ in range(args.warm_up_runs):\n",
-    "                    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+    "                    context.execute_async_v3(stream_handle=stream.handle)\n",
     "                    stream.synchronize()\n",
     "\n",
     "                # Timing loop\n",
@@ -176,7 +179,7 @@
     "                progress_bar.value = 0\n",
     "                for _ in range(iteration_selector.value):\n",
     "                    start.record(stream)\n",
-    "                    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+    "                    context.execute_async_v3(stream_handle=stream.handle)\n",
     "                    end.record(stream)\n",
     "                    stream.synchronize()\n",
     "                    times.append(end.time_since(start))\n",
@@ -227,26 +230,28 @@
     "            cuda.memcpy_htod(buffers[1].buf, test_segment_ids.ravel())\n",
     "            cuda.memcpy_htod(buffers[2].buf, test_input_mask.ravel())\n",
     "\n",
-    "            num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles\n",
-    "\n",
     "            bench_times = {}\n",
+    "            stream = cuda.Stream()\n",
     "\n",
     "            for idx, batch_size in enumerate(sorted(args.batch_size)):\n",
-    "                num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles\n",
     "                for idx in range(engine.num_optimization_profiles):\n",
-    "                    profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)\n",
+    "                    profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * engine.num_io_tensors)\n",
     "                    if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size:\n",
-    "                        context.active_optimization_profile = idx\n",
-    "                        binding_idx_offset = idx * num_binding_per_profile\n",
+    "                        context.set_optimization_profile_async(idx, stream.handle)\n",
+    "                        binding_idx_offset = idx * engine.num_io_tensors\n",
     "                        break\n",
     "\n",
     "                # Each profile has unique bindings\n",
     "                bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]\n",
     "                input_shape = (batch_size, args.sequence_length)\n",
     "                for binding in range(3):\n",
-    "                    context.set_binding_shape(binding_idx_offset + binding, input_shape)\n",
+    "                    tensor_name = engine.get_tensor_name(binding)\n",
+    "                    context.set_input_shape(tensor_name, input_shape)\n",
     "                assert context.all_binding_shapes_specified\n",
     "\n",
+    "                for i in range(engine.num_io_tensors):\n",
+    "                    context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])\n",
+    "\n",
     "                # Inference\n",
     "                total_time = 0\n",
     "                start = cuda.Event()\n",
@@ -255,7 +260,7 @@
     "\n",
     "                # Warmup\n",
     "                for _ in range(args.warm_up_runs):\n",
-    "                    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+    "                    context.execute_async_v3(stream_handle=stream.handle)\n",
     "                    stream.synchronize()\n",
     "\n",
     "                # Timing loop\n",
@@ -263,7 +268,7 @@
     "                progress_bar.value = 0\n",
     "                for _ in range(iteration_selector.value):\n",
     "                    start.record(stream)\n",
-    "                    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)\n",
+    "                    context.execute_async_v3(stream_handle=stream.handle)\n",
     "                    end.record(stream)\n",
     "                    stream.synchronize()\n",
     "                    times.append(end.time_since(start))\n",
diff --git a/demo/BERT/perf.py b/demo/BERT/perf.py
index 5943b41b..7b4e9da9 100644
--- a/demo/BERT/perf.py
+++ b/demo/BERT/perf.py
@@ -77,8 +77,6 @@ def main():
         cuda.memcpy_htod(buffers[1].buf, test_segment_ids.ravel())
         cuda.memcpy_htod(buffers[2].buf, test_input_mask.ravel())
 
-        num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles
-
         bench_times = {}
 
         stream = cuda.Stream()
@@ -86,7 +84,7 @@ def main():
             # Select engine profile
             selected_profile = -1
             for idx in range(engine.num_optimization_profiles):
-                profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)
+                profile_shape = engine.get_tensor_profile_shape(name = "input_ids", profile_index = idx)
                 if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size and profile_shape[0][1] <= args.sequence_length and profile_shape[2][1] >= args.sequence_length:
                     selected_profile = idx
                     break
@@ -95,18 +93,16 @@ def main():
             context.set_optimization_profile_async(selected_profile, stream.handle)
 
             # Each profile has unique bindings
-            binding_idx_offset = selected_profile * num_binding_per_profile
+            binding_idx_offset = selected_profile * engine.num_io_tensors
             bindings = [0] * binding_idx_offset + [buf.binding() for buf in buffers]
 
-            shapes = {
-                "input_ids": (batch_size, args.sequence_length),
-                "segment_ids": (batch_size, args.sequence_length),
-                "input_mask": (batch_size, args.sequence_length),
-            }
+            input_shape = (batch_size, args.sequence_length)
+            for name in ["input_ids", "segment_ids", "input_mask"]:
+                context.set_input_shape(name, input_shape)
+            assert len(context.infer_shapes()) == 0
 
-            for binding, shape in shapes.items():
-                context.set_binding_shape(engine[binding] + binding_idx_offset, shape)
-            assert context.all_binding_shapes_specified
+            for i in range(engine.num_io_tensors):
+                context.set_tensor_address(engine.get_tensor_name(i), bindings[i + binding_idx_offset])
 
             # Inference
             total_time = 0
@@ -115,7 +111,7 @@ def main():
 
             # Warmup
             for _ in range(args.warm_up_runs):
-                context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+                context.execute_async_v3(stream_handle=stream.handle)
                 stream.synchronize()
 
             # Timing loop
@@ -124,7 +120,7 @@ def main():
             start_time = time.time()
             while actual_iterations < args.iterations or (time.time() - start_time) < args.duration:
                 start.record(stream)
-                context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+                context.execute_async_v3(stream_handle=stream.handle)
                 end.record(stream)
                 stream.synchronize()
                 times.append(end.time_since(start))
diff --git a/demo/BERT/perf_varseqlen.py b/demo/BERT/perf_varseqlen.py
index a1680797..853201a4 100644
--- a/demo/BERT/perf_varseqlen.py
+++ b/demo/BERT/perf_varseqlen.py
@@ -81,7 +81,8 @@ def main():
         bench_times = {}
 
         for idx, batch_size in enumerate(sorted(args.batch_size)):
-            context.active_optimization_profile = 0
+            stream = cuda.Stream()
+            context.set_optimization_profile_async(0, stream.handle)
 
             # Each profile has unique bindings
             bindings = [buf.binding() for buf in buffers]
@@ -94,18 +95,20 @@ def main():
             }
 
             for binding, shape in shapes.items():
-                context.set_binding_shape(engine[binding], shape)
-            assert context.all_binding_shapes_specified
+                context.set_input_shape(binding, shape)
+            assert len(context.infer_shapes()) == 0
+
+            for i in range(engine.num_io_tensors):
+                context.set_tensor_address(engine.get_tensor_name(i), bindings[i])
 
             # Inference
             total_time = 0
             start = cuda.Event()
             end = cuda.Event()
-            stream = cuda.Stream()
 
             # Warmup
             for _ in range(args.warm_up_runs):
-                context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+                context.execute_async_v3(stream_handle=stream.handle)
                 stream.synchronize()
 
             # Timing loop
@@ -114,7 +117,7 @@ def main():
             start_time = time.time()
             while actual_iterations < args.iterations or (time.time() - start_time) < args.duration:
                 start.record(stream)
-                context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+                context.execute_async_v3(stream_handle=stream.handle)
                 end.record(stream)
                 stream.synchronize()
                 times.append(end.time_since(start))
diff --git a/demo/DeBERTa/deberta_tensorrt_inference.py b/demo/DeBERTa/deberta_tensorrt_inference.py
index 6a579a1c..378a5953 100644
--- a/demo/DeBERTa/deberta_tensorrt_inference.py
+++ b/demo/DeBERTa/deberta_tensorrt_inference.py
@@ -169,9 +169,10 @@ def allocate_buffers(self, engine):
         bindings = []
         stream = cuda.Stream()
 
-        for binding in engine: # binding is the name of input/output
-            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
-            dtype = trt.nptype(engine.get_binding_dtype(binding))
+        for i in range(engine.num_io_tensors):
+            tensor_name = engine.get_tensor_name(i)
+            size = trt.volume(engine.get_tensor_shape(tensor_name))
+            dtype = trt.nptype(engine.get_tensor_dtype(tensor_name))
 
             # Allocate host and device buffers
             host_mem = cuda.pagelocked_empty(size, dtype) # page-locked memory buffer (won't swapped to disk)
@@ -181,7 +182,7 @@ def allocate_buffers(self, engine):
             bindings.append(int(device_mem))
 
             # Append to the appropriate input/output list.
-            if engine.binding_is_input(binding):
+            if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
                 inputs.append(self.HostDeviceMem(host_mem, device_mem))
             else:
                 outputs.append(self.HostDeviceMem(host_mem, device_mem))
@@ -212,8 +213,8 @@ def __call__(self, model_inputs: list, timing=False):
         batch_size = batch_size[0]
 
         for i, model_input in enumerate(model_inputs):
-            binding_name = self.engine[i] # i-th input/output name
-            binding_dtype = trt.nptype(self.engine.get_binding_dtype(binding_name)) # trt can only tell to numpy dtype
+            binding_name = self.engine.get_tensor_name(i) # i-th input/output name
+            binding_dtype = trt.nptype(self.engine.get_tensor_dtype(binding_name)) # trt can only tell to numpy dtype
 
             # input type cast
             if NUMPY:
@@ -238,6 +239,9 @@ def __call__(self, model_inputs: list, timing=False):
                 # input, Host to Device
                 [cuda.memcpy_htod_async(inp.device, inp.host, self.stream) for inp in self.inputs]
 
+        for i in range(self.engine.num_io_tensors):
+            self.context.set_tensor_address(self.engine.get_tensor_name(i), self.bindings[i])
+
         duration = 0
         if timing:
             start_time = time()
@@ -246,7 +250,7 @@ def __call__(self, model_inputs: list, timing=False):
             duration = end_time - start_time
         else:
             # run inference
-            self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle) # v2 no need for batch_size arg
+            self.context.execute_async_v3(stream_handle=self.stream.handle)
 
         if timing:
             [cuda.memcpy_dtoh(out.host, out.device) for out in self.outputs]
@@ -277,7 +281,10 @@ def build_engine():
         print(f'Building {precision} engine of {MODEL_NAME} model on {gpu_name} GPU...')
 
         ## parse ONNX model
-        network = TRT_BUILDER.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
+        network_creation_flag = 0
+        if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+            network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+        network = TRT_BUILDER.create_network(network_creation_flag)
         onnx_parser = trt.OnnxParser(network, TRT_LOGGER)
         parse_success = onnx_parser.parse_from_file(ONNX_MODEL)
         for idx in range(onnx_parser.num_errors):
@@ -296,11 +303,7 @@ def build_engine():
         profile.set_shape("input_ids", (1,seq_len), (1,seq_len), (1,seq_len))
         profile.set_shape("attention_mask", (1,seq_len), (1,seq_len), (1,seq_len))
         config.add_optimization_profile(profile)
-
-        if TRT_VERSION >= 84:
-            config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4096 * (1 << 20)) # 4096 MiB, syntax after TRT 8.4
-        else:
-            config.max_workspace_size = 4096 * (1 << 20) # syntax before TRT 8.4
+        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4096 * (1 << 20)) # 4096 MiB
 
         # precision
         if precision == 'fp32':
@@ -329,7 +332,7 @@ def test_engine():
 
         ## psuedo-random input test
         batch_size = 1
-        seq_len = model.engine.get_binding_shape(0)[1]
+        seq_len = model.engine.get_tensor_shape(model.engine.get_tensor_name(0))[1]
         vocab = 128203
         gpu = torch.device('cuda')
         torch.manual_seed(0) # make sure in each test the seed are the same
@@ -362,7 +365,7 @@ def correctness_check_engines():
 
         ## psuedo-random input test
         batch_size = 1
-        seq_len = model1.engine.get_binding_shape(0)[1]
+        seq_len = model1.engine.get_tensor_shape(model1.engine.get_tensor_name(0))[1]
         vocab = 128203
         gpu = torch.device('cuda')
         # torch.manual_seed(0) # make sure in each test the seed are the same
diff --git a/demo/Diffusion/README.md b/demo/Diffusion/README.md
index 4b9ca625..d550c83b 100644
--- a/demo/Diffusion/README.md
+++ b/demo/Diffusion/README.md
@@ -1,32 +1,34 @@
 # Introduction
 
-This demo application ("demoDiffusion") showcases the acceleration of Stable Diffusion pipeline using TensorRT.
+This demo application ("demoDiffusion") showcases the acceleration of Stable Diffusion and ControlNet pipeline using TensorRT.
 
 # Setup
 
 ### Clone the TensorRT OSS repository
 
 ```bash
-git clone git@github.com:NVIDIA/TensorRT.git -b release/8.6 --single-branch
+git clone git@github.com:NVIDIA/TensorRT.git -b release/10.0 --single-branch
 cd TensorRT
 ```
 
-### Launch TensorRT NGC container
+### Launch NVIDIA pytorch container
 
 Install nvidia-docker using [these intructions](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker).
 
 ```bash
-docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:23.02-py3 /bin/bash
+docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:24.01-py3 /bin/bash
 ```
 
 ### Install latest TensorRT release
 
 ```bash
 python3 -m pip install --upgrade pip
-python3 -m pip install --upgrade tensorrt
+python3 -m pip install --pre --upgrade --extra-index-url https://pypi.nvidia.com tensorrt
 ```
 
-Minimum required version is TensorRT 8.6.0. Check your installed version using:
+> NOTE: TensorRT 10.x is only available as a pre-release
+
+Check your installed version using:
 `python3 -c 'import tensorrt;print(tensorrt.__version__)'`
 
 > NOTE: Alternatively, you can download and install TensorRT packages from [NVIDIA TensorRT Developer Zone](https://developer.nvidia.com/tensorrt).
@@ -38,21 +40,21 @@ export TRT_OSSPATH=/workspace
 cd $TRT_OSSPATH/demo/Diffusion
 pip3 install -r requirements.txt
 
-# Create output directories
-mkdir -p onnx engine output
 ```
 
 > NOTE: demoDiffusion has been tested on systems with NVIDIA A100, RTX3090, and RTX4090 GPUs, and the following software configuration.
 ```
-diffusers           0.14.0
-onnx                1.13.1
-onnx-graphsurgeon   0.3.26
-onnxruntime         1.14.1
-polygraphy          0.47.1
-tensorrt            8.6.1.6
-tokenizers          0.13.2
-torch               1.13.0
-transformers        4.26.1
+diffusers           0.26.3
+onnx                1.15.0
+onnx-graphsurgeon   0.3.27
+onnxruntime         1.17.0
+polygraphy          0.49.7
+tensorrt            10.0.0.6
+tokenizers          0.13.3
+torch               2.1.0
+transformers        4.31.0
+controlnet-aux      0.0.6
+nvidia-ammo         0.7.0
 ```
 
 > NOTE: optionally install HuggingFace [accelerate](https://pypi.org/project/accelerate/) package for faster and less memory-intense model loading.
@@ -66,43 +68,104 @@ transformers        4.26.1
 python3 demo_txt2img.py --help
 python3 demo_img2img.py --help
 python3 demo_inpaint.py --help
+python3 demo_controlnet.py --help
+python3 demo_txt2img_xl.py --help
 ```
 
 ### HuggingFace user access token
 
-To download the model checkpoints for the Stable Diffusion pipeline, you will need a `read` access token. See [instructions](https://huggingface.co/docs/hub/security-tokens).
+To download model checkpoints for the Stable Diffusion pipelines, obtain a `read` access token to HuggingFace Hub. See [instructions](https://huggingface.co/docs/hub/security-tokens).
 
 ```bash
 export HF_TOKEN=<your access token>
 ```
 
-### Generate an image guided by a single text prompt
+### Generate an image guided by a text prompt
+
+```bash
+python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN
+```
+
+### Generate an image guided by an initial image and a text prompt
+
+```bash
+wget https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg -O sketch-mountains-input.jpg
+
+python3 demo_img2img.py "A fantasy landscape, trending on artstation" --hf-token=$HF_TOKEN --input-image=sketch-mountains-input.jpg
+```
+
+### Generate an inpainted image guided by an image, mask and a text prompt
 
 ```bash
-python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN -v
+wget https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png -O dog-on-bench.png
+wget https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png -O dog-mask.png
+
+python3 demo_inpaint.py "a mecha robot sitting on a bench" --hf-token=$HF_TOKEN --input-image=dog-on-bench.png --mask-image=dog-mask.png
 ```
 
-### Generate an image guided by an image and single text prompt
+> NOTE: inpainting is only supported in versions `1.5` and `2.0`.
+
+### Generate an image with ControlNet guided by image(s) and text prompt(s)
+
+```bash
+python3 demo_controlnet.py "Stormtrooper's lecture in beautiful lecture hall" --controlnet-type depth --hf-token=$HF_TOKEN --denoising-steps 20 --onnx-dir=onnx-cnet-depth --engine-dir=engine-cnet-depth
+```
+
+> NOTE: `--input-image` must be a pre-processed image corresponding to `--controlnet-type`. If unspecified, a sample image will be downloaded. Supported controlnet types include: `canny`, `depth`, `hed`, `mlsd`, `normal`, `openpose`, `scribble`, and `seg`.
+
+Examples:
+<img src="https://drive.google.com/uc?export=view&id=17ub3MVSQHp26ty-wioNX6iQQ-nAveYSV" alt= “” width="800" height="400">
+
+#### Combining multiple conditionings
+
+Multiple ControlNet types can also be specified to combine the conditionings. While specifying multiple conditionings, controlnet scales should also be provided. The scales signify the importance of each conditioning in relation with the other. For example, to condition using `openpose` and `canny` with scales of 1.0 and 0.8 respectively, the arguments provided would be `--controlnet-type openpose canny` and `--controlnet-scale 1.0 0.8`. Note that the number of controlnet scales provided should match the number of controlnet types.
+
+
+### Generate an image with Stable Diffusion XL guided by a single text prompt
+
+Run the below command to generate an image with Stable Diffusion XL
 
 ```bash
-python3 demo_img2img.py "photorealistic new zealand hills" --hf-token=$HF_TOKEN -v
+python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --hf-token=$HF_TOKEN --version=xl-1.0
 ```
 
-Use `--input-image=<path to image>` to specify your image. Otherwise the example image will be downloaded from the Internet.
+The optional refiner model may be enabled by specifying `--enable-refiner` and separate directories for storing refiner onnx and engine files using `--onnx-refiner-dir` and `--engine-refiner-dir` respectively.
 
-### Generate an inpainted image guided by an image, mask and single text prompt
+```bash
+python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --hf-token=$HF_TOKEN --version=xl-1.0 --enable-refiner --onnx-refiner-dir=onnx-refiner --engine-refiner-dir=engine-refiner
+```
+
+### Generate an image guided by a text prompt, and using specified LoRA model weight updates
 
 ```bash
-# Create separate onnx/engine directories when switching versions
-mkdir -p onnx-1.5 engine-1.5
+python3 demo_txt2img_xl.py "Picture of a rustic Italian village with Olive trees and mountains" --version=xl-1.0 --lora-path "ostris/crayon_style_lora_sdxl" "ostris/watercolor_style_lora_sdxl" --lora-scale 0.3 0.7 --onnx-dir onnx-sdxl-lora --engine-dir engine-sdxl-lora --build-enable-refit
+```
+
+### Faster Text-to-image using SDXL & INT8 quantization using AMMO
 
-python3 demo_inpaint.py "a mecha robot sitting on a bench" --hf-token=$HF_TOKEN --version=1.5 --onnx-dir=onnx-1.5 --engine-dir=engine-1.5 -v
+```bash
+python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8 --quantization-level 3
 ```
 
-Use `--input-image=<path to image>` and `--mask-image=<path to mask>` to specify your inputs. They must have the same dimensions. Otherwise the example image and mask will be downloaded from the Internet.
+Note that the calibration process can be quite time-consuming, and will be repeated if `--quantization-level`, `--denoising-steps`, or `--onnx-dir` is changed.
 
-### Input arguments
-- One can set schdeuler using `--scheduler=EulerA`. Note that some schedulers are not available for some pipelines or version.
-- To accelerate engine building time one can use `--timing-cache=<path to cache file>`. This cache file will be created if does not exist. Note, that it may influence the performance if the cache file created on the other hardware is used. It is suggested to use this flag only during development. To achieve the best perfromance during deployment, please, build engines without timing cache.
-- To switch between versions or pipelines one needs either to clear onnx and engine dirs, or to specify `--force-onnx-export --force-onnx-optimize --force-engine-build` or to create new dirs and to specify `--onnx-dir=<new onnx dir> --engine-dir=<new engine dir>`.
+### Faster Text-to-Image using SDXL + LCM (Latent Consistency Model) LoRA weights
+[LCM-LoRA](https://arxiv.org/abs/2311.05556) produces good quality images in 4 to 8 denoising steps instead of 30+ needed base model. Note that we use LCM scheduler and disable classifier-free-guidance by setting `--guidance-scale` to 0.
+LoRA weights are fused into the ONNX and finalized TensorRT plan files in this example.
+```bash
+python3 demo_txt2img_xl.py "Einstein" --version xl-1.0 --lora-path "latent-consistency/lcm-lora-sdxl" --lora-scale 1.0 --onnx-dir onnx-sdxl-lcm-nocfg --engine-dir engine-sdxl-lcm-nocfg --denoising-steps 4 --scheduler LCM --guidance-scale 0.0
+```
+### Faster Text-to-Image using SDXL Turbo
+Even faster image generation than LCM, producing coherent images in just 1 step. Note: SDXL Turbo works best for 512x512 resolution, EulerA scheduler and classifier-free-guidance disabled.
+```bash
+python3 demo_txt2img_xl.py "Einstein" --version xl-turbo --onnx-dir onnx-sdxl-turbo --engine-dir engine-sdxl-turbo --denoising-steps 1 --scheduler EulerA --guidance-scale 0.0 --width 512 --height 512
+```
+
+## Configuration options
+- Noise scheduler can be set using `--scheduler <scheduler>`. Note: not all schedulers are available for every version.
+- To accelerate engine building time use `--timing-cache <path to cache file>`. The cache file will be created if it does not already exist. Note that performance may degrade if cache files are used across multiple GPU targets. It is recommended to use timing caches only during development. To achieve the best perfromance in deployment, please build engines without timing cache.
+- Specify new directories for storing onnx and engine files when switching between versions, LoRAs, ControlNets, etc. This can be done using `--onnx-dir <new onnx dir>` and `--engine-dir <new engine dir>`.
 - Inference performance can be improved by enabling [CUDA graphs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs) using `--use-cuda-graph`. Enabling CUDA graphs requires fixed input shapes, so this flag must be combined with `--build-static-batch` and cannot be combined with `--build-dynamic-shape`.
+
+
+
diff --git a/demo/Diffusion/calibration-prompts.txt b/demo/Diffusion/calibration-prompts.txt
new file mode 100644
index 00000000..8b224e6d
--- /dev/null
+++ b/demo/Diffusion/calibration-prompts.txt
@@ -0,0 +1,1079 @@
+Portrait shot of a woman, yellow shirt, photograph
+Little girl holding a teddy bear, in the middle of nowhere, photograph
+Portrait of an arctic fox in the tundra, light teal and amber, minimalist, photograph
+Confused woman, sci - fi, future, blue glow color, orange, hologram, photograph
+Symmetrical, macro shot, crying womans face, half of face is organic flowing RGB low poly, depth of field
+Beautiful woman future funk psychedelic
+Mosaic of a colorful mushroom with intricate patterns, vibrant and detailed, sharp, mosaic background, vector art
+Illustration of a man in red hoodie, minimalist, graphic design poster art, dark cyan and sky - blue, honeycore
+a bottle of perfume on a clean backdrop, surrounded by fragrant white flowers, product photography, minimalistic, natural light
+a bedroom with large windows and modern furniture, gray and gold, luxurious, mid century modern style
+an aerial drone shot of the breathtaking landscape of the Bora Bora islands, with sparkling waters under the sun
+extreme closeup shot of an old man with a long gray hair and head covered in wrinkles; focused expression looking at camera
+Simple flat vector illustration of a woman sitting at the desk with her laptop with a puppy, isolated on white background
+Chibi pixel art, game asset for an rpg game on a white background featuring the armor of a dragon sorcerer wielding the power of fire surrounded by a matching item set
+a macro wildlife photo of a green frog in a rainforest pond, highly detailed, eye-level shot
+kid's coloring book, a happy young girl holding a flower, cartoon, thick lines, black and white, white background
+Golden-haired elementary school white boy hugging his black-hair Taiwanese buddy face-to-face on dusk street, unreal engine, greg rutkowski, loish, rhads, beeple, makoto shinkai and lois van baarle, ilya kuvshinov, rossdraws, tom bagshaw, alphonse mucha, global illumination, detailed and intricate environment
+Tan skin Anime boy wearing a large black sweater and cat ear beanie with brown hair and eyes, full body, baggy cargo pants, full body, reference
+Fawn French Bulldog with big eyes, short legs, and chunky, stocky body eating food
+A white goose holding a paint brush
+Black, African descent, looks Japanese, wears glasses, Naruto type art, bandage on his nose, male, Anime 2D art, lazy eyes, Japanese earring in one ear, no beard, smiles sinisterly
+Male cow fursona wearing a red beanie
+a beautiful hyper-realistic anime Lofi, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli, akihiko yoshida, anime, clean soft lighting, finely detailed features, high-resolution, perfect art, stunning atmosphere, trending on pixiv fanbox
+a woman with a beautiful face is enjoying a summer festival wearing a kimono, long white hair, looks like an older sister with a small body, is holding a traditional Japanese umbrella with a faint smile, her head is facing backwards as if inviting her to play and she is running with her arms behind her, there is also a lock of patterned hair flower
+A stunning photograph of a serene mountain lake at sunrise, with crystal-clear reflections and soft pastel skies
+A high-resolution image of an ancient oak tree in a lush forest, sunlight filtering through the leaves
+An ultra-realistic photograph of the Milky Way galaxy seen from a remote desert, under clear skies
+A detailed image of a colorful street market in Marrakech at golden hour, with vibrant fabrics and bustling crowds
+A professional photograph of a majestic bald eagle in flight, with a crisp focus on its sharp eyes and detailed feathers
+A perfect image of a charming cobblestone street in Prague, with historical buildings and a peaceful early morning atmosphere
+A photo-realistic image of a modern city skyline at night, with shimmering lights and reflections on a river
+An authentic-looking photograph of the Northern Lights over a snowy Lapland landscape, with vivid colors and clear stars
+A high-quality image of a vintage 1950s diner, with classic cars parked outside and a sunset backdrop
+An elegant photograph of a grand ballroom from the Victorian era, with ornate decorations and a grand chandelier
+A striking photograph of a powerful thunderstorm over the ocean, with dramatic lightning strikes and rolling waves
+An image of a peaceful Zen garden with smooth stones, raked sand, and a calming waterfall
+A high-resolution photograph of a seasoned fisherman at dawn, casting a net into the sea, with the golden light reflecting off the water
+A professional close-up shot of a woman's face, half-illuminated by the sunset, showcasing a detailed texture of her skin and a contemplative expression
+An image capturing a street dancer in mid-air during a dynamic breakdance move, with urban graffiti in the background
+A vibrant photograph of a group of people dressed in traditional attire at a cultural festival, dancing in a blur of colors and fabrics
+A cinematic-style photograph of a lone astronaut in a spacesuit, standing on a rocky alien landscape with Earth visible in the sky above
+Capture the quiet intensity in the eyes of a chess grandmaster poised over the board in a high-stakes match
+Close-up: A young girl's freckled face, focused and thoughtful, as she reads a book under the shade of an old tree
+Underwater photography of a diver among swirling schools of fish, light filtering down from above
+Evening falls on a city street musician, his guitar casting long shadows as he strums for the passing crowd
+High above the city, a construction worker perches on a steel beam, with a backdrop of the skyline stretching into the distance
+Document the intense expression of a potter as they shape a clay vessel, hands and wheel both a blur of motion
+A street portrait captures the weathered face of a long-time vendor, his cart a staple in the neighborhood for generations
+During golden hour, a group of children race through a field, their silhouettes a dance of joy against the setting sun
+Zoomed-in shot capturing the intense focus of a violinist as the bow gracefully sweeps across the strings, emotions etched into their performance
+Evening light bathes a street artist in a halo as they spray paint a vibrant mural, the colors telling a story as much as the subject's concentrated gaze
+A mid-action image of a chef's hands chopping herbs, with fine details showing flying droplets of water from the fresh greens
+On a misty morning, capture the solitary figure of a jogger on a deserted trail, their breath and stride in sync
+High in the mountains, a hiker reaches the summit, standing triumphantly with a panoramic view stretching behind them
+Illuminated by the soft glow of a desk lamp, a writer pauses, pen in hand, surrounded by stacks of manuscripts, lost in thought
+eerie, corruption, beautiful, young woman, sad eyes, tears running down, crying, innocence, light, vaporwave aesthetic, synthwave, colorful, psychedelic, crown, long gown, flowers, bees, butterflies, ribbons, ornate, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+wolf merged with crow,! photorealistic,! concept art
+To every living being, and every living soul. Now cometh the age of the stars. A thousand year voyage under the wisdom of the Moon. Here begins the chill night that encompasses all, reaching the great beyond. Into fear, doubt, and loneliness... As the path stretches into darkness. Mysterious shadow, detailed, digital, trending on artstation, hyper realistic, dark colours, 4k, dark aesthetic, in the style of James C. Christensen
+A cowboy cat with big and cute eyes, fine-face, realistic shaded perfect face, fine details. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash, Rob Rey and Kentarõ Miura style, trending on art station
+a very beautiful anime cute girl, full body, long wavy blond hair, sky blue eyes, full round face, short smile, fancy top, miniskirt, front view, summer lake setting, cinematic lightning, medium shot, mid-shot, highly detailed, trending on Artstation, Unreal Engine 4k, cinematic wallpaper by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti
+close-up portrait of the perfect and symmetrical face of a beautiful Cotton Mill Girl, symmetrical, centered, dramatic angle, ornate, details, smooth, sharp focus, illustration, realistic, cinematic, artstation, award winning, rgb , unreal engine, octane render, cinematic light, macro, depth of field, blur, red light and clouds from the back,  highly detailed epic cinematic concept art CG render made in Maya, Blender and Photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse by Henri Cartier Bresson
+highly detailed portrait of beautiful ethereal woman in ornate clothing, stephen bliss, unreal engine, fantasy art by greg rutkowski, loish, rhads, ferdinand knab, makoto shinkai and lois van baarle, ilya kuvshinov, rossdraws, tom bagshaw, global illumination, radiant light, detailed and intricate environment
+Close-up portrait of young asian girl, long blonde hair, dark fantasy, portrait, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+necromancer glowing with purple magic, red hair, female, glacier landscape, D&D, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+a portrait of riddler, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+duotone dark scifi illustration 3 / 4 portrait of dream as if you live forever live as if you die tomorrow. cinematic lighting mad scientist style. golden ratio accidental renaissance. in the style of jean michel basquiat, beksisnski, and pablo picasso. graffiti art, scifi, fantasy, hyper detailed. octane render. concept art. trending on artstation
+elon musk as neo from the matrix, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+patrick star with a sad!!! expression slouching on a bench in the bikini bottom, global illumination!!! dim lighting, midnight, cinematic, extremely detailed, beautiful, stunning composition, beautiful light rays, trending on artstation
+a girl in a hat with a bouquet of peonies looks out the window at a blooming garden, vivid color, highly detailed, cyberpunk, digital painting, artstation, concept art, matte, sharp focus, art by vrubel
+a detailed concept art of a fantasy jingle bell infused with magic, trending on artstation, digital art, 4 k, intricate, octane render, sharp focus
+“ dungeons and dragons tabaxi rogue, anthromorphic cat person with a repeating crossbow in a medieval city, small and big, illustration, fantasy, trending on artstation ”
+fantasy, book cover, concept art, by greg rutkowski and craig mullins, cozy atmospheric
+Amelie Poulain painted by Raphael volumetric lighting, back lighting, rimlight, dramatic lighting, digital painting, highly detailed, artstation, sharp focus, illustration, Artgerm, Jean-Lï¿½on Gï¿½rï¿½me , ruan jia
+soft bokeh front shot photo of a mclaren steampunk concept car, cinematic, fine details, symmetrical, 4 k, digital art, wallpaper
+dior runway show, light, shadows, reflections, golden, gold, epic composition, intricate, elegant, volumetric lighting, digital painting, highly detailed, artstation, sharp focus, illustration, concept art, ruan jia, steve mccurry
+elven princess assassin, beautiful shadowing, 3 d shadowing, reflective surfaces, illustrated completely, 8 k beautifully detailed pencil illustration, extremely hyper - detailed pencil illustration, intricate, epic composition, very very kawaii, masterpiece, bold complimentary colors. stunning masterfully illustrated by artgerm and range murata.
+gorgeous red fox in a suit drinking champagne, digital art, landscape, fantasy art, octane render, ureal engine, high detail, very realistic, by greg rutkowski. by james gurney
+an extremely psychedelic portrait of medusa as willy wonka, surreal, lsd, face, detailed, intricate, elegant, lithe, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration
+portrait painting of a muscular bloodied mixed girl, ultra realistic, cyberpunk hacknaut, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+concept art for the main character in the award winning film named life is better in pink. the character is a unnaturally beautiful teenage girl with deep dark blue eyes and long curled pink hair, wearing light pink clothes. realistic cg render, anatomically correct, high key lighting, trending on art station, vibrant colors. cute and highly detailed eyes.
+beautiful woman, illustration, painting oil on canvas, intricate portrait, detailed, illustration, hd, digital art, overdetailed, art, concept, art
+detailed full body concept art illustration oil painting of an anthropomorphic capybara cook in full intricate clothing, biomutant, ultra detailed, digital art, octane render
+of a calm ocean with large strange cute happy flying creatures with huge eyes, mouth, long tongue and round teeth appearing from the sky, in the style of gehry and gaudi, macro lens, highly detailed, shallow depth of fielf, digital painting, trending artstation, concept art, illustration, cinematic lighting, vibrant colors, photorealism, epic, octane render
+symmetry, samurai, lines, brown skin, machine face, intricate, elegant, highly detailed, digital painting, artstation, cgsociety, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+cyberpunk Normani as aeon flux profile picture by Greg Rutkowski, dynamic pose, intricate, futuristic, fantasy, elegant, by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell, metal chrome, shiny, rainy background, asymmetric, afro hair,
+chris tucker as dhalsim street fighter, jump kick, 4 k, ultra realistic, detailed focused art by artgerm and greg rutkowski and alphonse mucha
+epic scene where mystical dead monk sitting in front of an epic portal, epic angle and pose, symmetrical artwork, 3d with depth of field, blurred background, cybernetic orchid flower butterfly jellyfish crystal dragon, female face skull phoenix bird, translucent, nautilus, energy flow. a highly detailed epic cinematic concept art CG render. made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse. y Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong
+concept art of futuristic modular military base, top angle, oil painting by jama jurabaev, extremely detailed, brush hard, artstation, for aaa game, high quality, brush stroke
+portrait of natalie wood eating hamburgers, extra onions and ketchup, luscious patty with sesame seeds, feminine ethereal, handsome, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Trending on Artstation, Dark and rainy mega city with towering walls built to block the migrants of the coming climate change migrant crisis showing piles of hundred bodies outside to maintain a quality of life for those who can survive the severe and deadly weather patterns observing small children targeted by advanced military style drones, dystopian, concept art illustration, tilt shift background, wide depth of field, 8k, 35mm film grain
+hard surface form fused with organic form fashion outfit design, rainbow iridescent accents, full body frontal view, Peter mohrbacher, zaha hadid, tsutomu nihei, emil melmoth, zdzislaw belsinki, Craig Mullins, yoji shinkawa, trending on artstation, beautifully lit, hyper detailed, insane details, intricate, elite, ornate, elegant, luxury, dramatic lighting, CGsociety, hypermaximalist, golden ratio, octane render, weta digital, micro details, ray trace, 8k,
+Gary Busey portrait by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell
+( cyberpunk 2 0 7 7, bladerunner 2 0 4 9 ), a complex thick bifurcated robotic cnc surgical arm cybernetic symbiosis hybrid mri 3 d printer machine making a bio chemical lab, art by artgerm and greg rutkowski and alphonse mucha, biomechanical, lens orbs, global illumination, lounge, architectural, f 3 2,
+a vampire, male, mid - 3 0 s aged, long black hair, clean shaven, in red and black, high fantasy, realistic, highly detailed, concept art, 8 k.
+a elderly wizard casting a black fireball | | pencil sketch, realistic shaded, fine details, realistic shaded lighting poster by greg rutkowski, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+An elegant green, blue dragon, sitting on a clearing in a flowery jungle, detailed, mtg, digital illustration, trending on artstation
+a landscape made of whimsical energy and fibrous magic, artstation landscape, artstation digital, illustrated by eddie mendoza and greg rutkowski, trending on artstation, cgsociety contest winner, cgsociety hd, cgsociety 4 k uhd, 4 k, 8 k
+a cosmic painting of prince in space. mindblowing colours, trending on artstation. highly detailed face.
+martian chronicles, by jean delville and sophie anderson and mandy jurgens, retrofuturism, moody atmosphere, cinematic atmospheric, cinematic lighting, golden ratio, perfect composition, elegant, no crop, extremely detailed, 4 k, hd, sharp focus, masterpiece, trending on artstation
+a highly detailed metahuman 4 k close up render of a seraphim bella hadid monument renaissance in iris van herpen dress schiaparelli in diamonds crystals swarovski and jewelry iridescent in style of alphonse mucha gustav klimt trending on artstation made in unreal engine 4
+fever of the night, a grime tale of the night fever, disco club of the occult, digital painting, artstation, ristan eaton, victo ngai, artgerm, rhads, ross draws, anime styled
+symmetrical, full body portrait of a woman with short wavy hair, round face, cottagecore!!, lake, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+fantasy man sitting in library, gold brocaded dark blue clothes, short black hair, books, reddish brown engraved shelves, sharp focus, intricate, extremely detailed, cinematic lighting, smooth, ultra realistic illustration, high fantasy, elegant, artgerm, greg rutkowski, alphonse mucha magali villeneuve
+an anthropomorphic deer, fursona!!! by don bluth, by kawacy, trending on artstation, full body
+a cartoon squirrel drawn in concept art style
+russian poet alexander pushkin and shrek having breakfast together, portrait, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+beautiful woman on a turquise vespa moped, in the style of artgerm, gerald brom, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation, masterpiece
+a martian landscape, by ralph mac quarrie and francois schuiten and albert bierstadt and ernst haeckel and james jean and john singer sargent, cinematic lighting, moody atmosphere, golden ratio, perfect composition, elegant and stylish look, artstation, concept art, high quality
+ï¿½ anime, full body, a pretty girl taking the college entrance exam, highly intricate detailed, light and shadow effects, intricate, highly detailed, digital painting, art station, concept art, smooth, sharp focus, illustration, advanced digital anime art, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau, craig mullins, j. c. leyendecker, atmospheric lighting, detailed face, by makoto shinkai, stanley artgerm lau, wlop, rossdraws ï¿½
+the second coming of the buddah, by dan mumford and ross tran, cosmic, heavenly, god rays, intricate detail, cinematic, 8 k, cel shaded, unreal engine, featured on artstation, pixiv
+phil noto, peter mohrbacher, thomas kinkade, artgerm, 1 9 5 0 s rockabilly anya taylor - joy catwoman dc comics, symmetrical eyes, city rooftop
+dnd character concept portrait, angry male elf druid in forest, detailed, high quality, dynamic lighting, fantasy, artwork by artgerm, wlop, alex ross, greg rutknowski, alphonse mucha
+a king with a skull head, in the style of artgerm, charlie bowater, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+With the spikes in her hair
+venus, the empress, wearing a magnificent dress, sitting on a divan in the middle of a beautiful green plains full of little flowers. intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, by justin gerard and artgerm, 8 k
+beautiful apocalyptic woman with pink Mohawk, standing on mad max panzer tank, 4k ultra hd, fantasy dark art, tank girl, artgerm, concept art, artstation, octane render, elegant, detailed digital painting
+i crave only the cold clean certainty of steel and silicon, trending on artstation
+nikola tesla, lightning, portrait, sharp focus, digital art, concept art, dynamic lighting, epic composition, colorful, trending on artstation, by emylie boivin 2. 0, rossdraws 2. 0
+professional concept art of a symmetrical ominous floating terrifying thing in a dark room by artgerm and greg rutkowski ( thin white border ). an intricate, elegant, highly detailed digital painting, concept art, smooth, sharp focus, illustration, in the style of cam sykes, wayne barlowe, igor kieryluk.
+beautiful lifelike award winning marble statue bust of tsunku trending on art station artgerm greg rutkowski alphonse mucha museum quality cinematic atmospheric
+steampunk robot ant, unreal engine realistic render, 8 k, micro detail, intricate, elegant, highly detailed, centered, digital painting, artstation, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+mtg character portrait of a brawny male leonin warrior african lion angel of justice, with fiery golden wings of flame, wearing shining armor, wielding flaming sword and holding large fiery shield, by peter mohrbacher, wadim kashin, greg rutkowski, larry elmore, george pemba, ernie barnes, raymond swanland, magali villeneuve, trending on artstation
+dynamic portrait painting of Michael Myers sitting in the waiting room of an optometrist amongst other normal patients, sharp focus, face focused, trending on ArtStation, masterpiece, by Greg Rutkowski, by Ross Tran, by Fenghua Zhong, octane, soft render, oil on canvas, moody lighting, high contrast, cinematic, professional environmental concept art
+Concept art of male high elf with light blue hair, black leather armor, golden eagle skull on chest, by Naranbaatar Ganbold, trending on artstation
+a closeup portrait of a mia khalifa, dramatic light, lake background, sunset, dark, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+portrait of salman rushdie, deep focus, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+anime key visual of beautiful elizabeth olsen police officer, cyberpunk, futuristic, stunning features, perfect face, high details, digital painting, artstation, smooth, soft focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Beautiful portrait of an attractive Persian Princess who is an architect, beautiful princess, face painting, dramatic lighting, intricate, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, footage from space camera
+full body portrait character concept art, anime key visual of a little witch with her capybara mascot, trending on pixiv fanbox, painted by makoto shinkai takashi takeuchi studio ghibli
+perfectly-centered-Portrait of the most beautiful people on the planet, river, washing clothes, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+wide shot of vietnamese solider girl, green uniform, burning city in the background, epic, elder scrolls art, fantasy, skyrim, hd shot, digital portrait, beautiful, artstation, by artgerm, guy denning, jakub rozalski, magali villeneuve and charlie bowater
+apocalyptic city, digital painting, artstation, concept art, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo, Breathtaking, 8k resolution, extremely detailed, beautiful, establishing shot, artistic, hyperrealistic, octane render, cinematic lighting, dramatic lighting, masterpiece, light brazen
+male dracula rollerskating with rollerskates in a roller rink by charlie bowater and titian and artgerm, full body portrait, intricate, face, elegant, beautiful, highly detailed, dramatic lighting, sharp focus, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k
+a dark forest where gears and electronic parts grow on the trees tops, cyberpunk landscape wallpaper, d&d art, fantasy, painted, 4k, high detail, sharp focus
+Photorealistic elvish goddess in a magical bioluminescent forest Hyperdetailed photorealism, 108 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic Overtones, 3D finalrender, 3d shading, cinematic lighting, artstation concept art
+portrait of a demon, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+the man stuck in the wall, creepy explorer sketch, godlike design, concept art, beyond the void, grand scale, intricate detailed
+Very very very very highly detailed epic central composition studio photography of face with venetian mask,  intricate, dystopian, sci-fi, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art by Anna Dittmann and Jesper Ejsing and Anton Pieck
+water, glowing lights!! intricate elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by greg rutkowski
+highly detailed portrait of Eminem wearing a beret and gold chains and brandishing a pistol, big eyes, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+robocop torso, symmetry, faded colors, exotic alien features, cypherpunk background, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, masterpiece, trending on artstation, featured on pixiv, cinematic composition, beautiful lighting, sharp, details, hyper detailed, 8 k, unreal engine 5
+Boris Johnson as Neo from Matrix, black sunglasses, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+a copic maker sketch of a stewardess girl wearing kikyo's clothing designed by balenciaga by john berkey by stanley artgerm lau, greg rutkowski, thomas kinkade, alphonse mucha, loish, norman rockwell
+a matte painting of a man sitting down and having a cup of tea in his house by the beach, in the style of artgerm, charlie bowater, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+a smug exclusivists female, black ink line art and watercolor, intricate, digital painting, concept art, smooth, focus, rim light style tim burton
+3 5 mm portrait of samurai in training dojo, in the style of david cronenberg, scary, weird, high fashion, id magazine, vogue magazine, surprising, freak show, realistic, sharp focus, 8 k high definition, film photography, photo realistic, insanely detailed, intricate, by david kostic and stanley lau and artgerm
+rats fixing cars in the garage, key visual, a fantasy digital painting by makoto shinkai and james gurney, trending on artstation, highly detailed
+photo of a gorgeous sultry young woman in the style of David la chapelle , realistic, sharp focus, 8k high definition, 35mm film photography, photo realistic, insanely detailed, intricate, elegant, art by David kostic and stanley lau and artgerm
+sliced coconut, electronics, ai, cartoonish cute, pine trees, dramatic atmosphere, trending on artstation, 3 0 mm, by noah bradley trending on artstation, deviantart, high detail, stylized portrait
+360 degree equirectangular, anthropomorphic family of mushrooms, family portrait, Art Deco nature, mystical fantasy, Pixar cute character design, intricate art deco mushroom patterns, elegant, sharp focus, 360 degree equirectangular panorama, art by Artgerm and beeple and Greg Rutkowski and WLOP, 360 monoscopic equirectangular
+portrait of othinus from toaru, anime fantasy illustration by tomoyuki yamasaki, kyoto studio, madhouse, ufotable, trending on artstation
+a portrait of a evil cybernetic magician in glass armor releasing spell, full height, moving forward, cyberpunk concept art, trending on artstation, highly detailed, intricate, sharp focus, digital art, 8 k
+Portrait of the black dragon Alduin breathing a rainbow-colored fire. 4k. Concept art. High detail. Unreal engine.
+Greg Manchess portrait painting of Ganon from Legend of Zelda as Overwatch character, medium shot, asymmetrical, profile picture, Organic Painting, sunny day, Matte Painting, bold shapes, hard edges, street art, trending on artstation, by Huang Guangjian and Gil Elvgren and Sachin Teng
+a hyper - realistic character concept art portrait of emilia clarke, depth of field background, artstation, award - winning realistic sci - fi concept art by jim burns and greg rutkowski, beksinski, a realism masterpiece, james gilleard, bruegel, alphonse mucha, and yoshitaka amano.
+Wide shot of a chrome spaceship in  battle, explosions and purple lasers. Asteroid belt. Scenic view, in the void of space, underexposed, matte painting by Craig mullins and Emmanuel_Shiu and john berkey, cinematic, dark sci-fi, concept art trending on artstation, 4k, insane details, ultra realistic
+ebony beauty portrait, black red smoke, ink, stylized tattoos, draconic priestess, portrait by Artgerm, peter mohrbacher
+leonine devil in flowing robes, ethereal, backlit, high fantasy, highly detailed, puzzled expression, realistic lighting, sharp focus, intricate, by artgerm, wlop, crossdress, frank frazetta, trending on artstation
+giant magical floating golden sun, bright godrays, vibrant colors, by sylvain sarrailh, rossdraws, ambient light, ultra detailed, fantasy artwork, 8 k, volumetric lighting, trending on artstation, award winning, beautiful scenery, very beautiful.
+a 3 d render of a stack of green cubes on the left and an orange ball on the right in a red room, blender, ue 5, octane render, trending on artstation
+ori and the olw, close up bokeh hiperrealistic, high detailled, darkness dramatic, sharp focus, octane render, imax
+richly detailed color illustration of a fiending-addict-seeking-at-the-doctors-office illustrated by Artgerm and Mina Petrovic and Timothy Kong and Marina Federovna. 3D shadowing
+a study of cell shaded portrait of Dora the Explorer as a Borderlands 3 character, llustration, post grunge, concept art by josan gonzales and wlop, by james jean, Victo ngai, David Rubín, Mike Mignola, Laurie Greasley, highly detailed, sharp focus, alien, Trending on Artstation, HQ, deviantart, art by artgem
+A beautiful female warrior holding a bow an arrow wearing a magical bikini posing on a rock in a magical forest, super detailed and realistic face, fantasy art, in the style of Artgerm, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant
+Lofi Steampunk Bioshock portrait, Pixar style, by Tristan Eaton Stanley Artgerm and Tom Bagshaw
+a beautiful hyperdetailed highly detailed urbex industrial architecture tower nature building unfinished building by zaha hadid, retro sunset retrowave darkacademia at fall hyperrealism cgsociety tokyo at night thermal vision, archdaily, wallpaper, highly detailed, trending on artstation.
+Cyborg woman sitting on a chair in a futuristic room smoking a cigar, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+young angry woman, beautiful girl, full body, explosive hair, cowboy hat, realistic, serov, surikov, vasnetsov, repin, kramskoi, insanely detailed, charlie bowater, tom bagshaw, high resolution, octane rendered, unreal engine, illustration, trending on artstation, masterpiece, 8 k
+an anime landscape of a girl wearing a kimono, near the river in a japanese summer festival from skyrim, by stanley artgerm lau, wlop, rossdraws, james jean, andrei riabovitchev, marc simonetti, and sakimichan, trending on artstation
+highly detailed portrait of a man with a handsaw head by greg rutkowski and fujimoto tatsuki, dramatic lighting, dynamic pose, dynamic perspective
+film noir woman, character sheet, concept design, contrast, hot toys, kim jung gi, greg rutkowski, zabrocki, karlkka, jayison devadas, trending on artstation, 8 k, ultra wide angle, pincushion lens effect
+portrait of a diabolical marble stone cyborg, wearing torn white cape, dynamic pose, glowing eyes, post apocalyptic ancient ruins, glowing veins subsurface scattering, in clouds, sunset, portrait, by gerald brom, by mikhail vrubel, by peter elson, muted colors, extreme detail, trending on artstation, 8 k
+portrait of donald trump, soft hair, muscular, half body, leather, hairy, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+black super hero girl | very very anime!!!, fine - face, beyonce, realistic shaded perfect face, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+richly detailed color illustration of a nerd-core-instructional-video illustrated by Artgerm and Mina Petrovic and Timothy Kong and Marina Federovna. 3D shadowing
+a group of spanish trap singers drinking red wine, oil painting by alex katz, trending on artstation
+photorealistic beautiful ethereal natalie portman in the style of michael whelan and greg rutkowski. hyperdetailed photorealism, 1 0 8 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic overtones, 3 d finalrender, 3 d shading, cinematic lighting, artstation concept art
+a cute pet by neville page, ken barthelmey, carlos huante and doug chiang, sharp focus, trending on artstation, hyper realism, octane render, 8 k, hyper detailed, ultra detailed, highly detailed, zbrush, concept art, creature design
+very cute illustration for a children's book, digital art, detailed, rim light, exquisite lighting, clear focus, very coherent, details visible, soft lighting, character design, concept, atmospheric, dystopian, trending on artstation, fog, sun flare
+Still of a humanoid robot painting on a canvas, high detail, cinematic, , science fiction concept art by Greg Rutkowski and Moebius and Le Corbusier
+asymmetrical!! long shot of a snufkin smoking a pipe, nebula, intricate, elegant, highly detailed, digital painting, artstation, biolusence, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, horizon zero dawn 8 k
+portrait of red - tinged, red leds, futuristic cybernetic warrior alien in profile, highly intricate, detailed humanoid, trending on artstation
+of a beautiful scary Hyperrealistic stone castle on top of a hill in the middle of a dark and creepy forest, macro lens, highly detailed, digital painting, trending artstation, concept art, illustration, cinematic lighting, vibrant colors, photorealism, epic, octane render
+piles of modular synth cables mixed with mangrove roots mixed with old video game consoles, puerto rican grafitti goddess chilling out wearing a headpiece made of circuit boards, by cameron gray, wlop, stanley kubrick, masamune, unique perspective, epic, trending on artstation, photorealistic, 3 d render, vivid
+oil painting portrait of a young woman with long flowing hair in a white dress, dancing through a field of flowers at sunset with mountains in the background, hazy, digital art, chiaroscuro, artstation, cinematic, golden hour, digital art painting by greg rutkowski, william - adolphe bouguereau, hazy atmosphere, flowers, cinematic lighting
+dark wizard of forest, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+photorealistic dog piloting a biplane. hyperdetailed photorealism, 1 0 8 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic overtones, 3 d finalrender, 3 d shading, cinematic lighting, artstation concept art
+clear portrait of tony soprano, cottagecore!!, mafia background hyper detailed, character concept, full body, dynamic pose, intricate, criminal appearance, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+fantasy art of glowing goldfish swimming in the air, in the streets of a japanese town at night, with people watching in wonder, by fenghua zhong, highly detailed digital art, trending on artstation
+fantasy steps with pillars on both sides by greg rutkowski
+award winning digital portrait of a feminine attractive male jester at a magnificent circus, beautiful circus themed background with soft colors and lights, trending artstation, digital art, aesthetic, bloom, intricate, elegant, sharp focus, digital illustration, highly detailed, octane render, digital painting, concept art, fantasy, masterpiece, by lisa buijteweg and sakimichan
+a ultradetailed beautiful concept art of an old mind key, with intricate detail, oil panting, high resolution concept art, 4 k, by artgerm
+Ogun with large iron spears, he has tribal face markings and war paint, bronze-brown skin with african features and strong jaw line prominent brow and menacing look, wearing tribal armor, medium shot digital illustration trending on artstation by artgerm, face by wlop
+full face shot of rimuru tempest, sky blue straight hair, long bangs, with amber eyes, gold eyes, wearing a black jacket, high collar, ultra detailed, concept art, award winning photography, digital painting, cinematic, wlop artstation, closeup, pixiv, evil, yoshitaka amano, andy warhol, ilya kuvshinov,
+Moon Knight mixed with Goku, RPG Reference,  art by ilya kuvshinov, artgerm, Alphonse mucha, and Greg Rutkowski, Trending on Artstation, octane render, Insanely Detailed, 8k, HD
+portrait of the cutest red fox ever, fluffy, cinematic view, epic sky, detailed, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+bandit, ultra detailed fantasy, elden ring, realistic, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, global illumination radiating a glowing aura global illumination ray tracing hdr render in unreal engine 5
+ilya kuvshinov with blue hair, yellow irises, professional digital painting, concept art, unreal engine 5, 8 k, cinematic, wlop, tendrils in the background, art by greg rutkowski, pixiv art, junji ito, yoshitaka amano
+high resolution concept art of naruto and yoda kissing in paris
+character concept portrait of a stoic and proud woman in an elegant gown, pale face, intricate, elegant, digital painting, concept art, smooth, sharp focus, illustration, from Metal Gear, by Ruan Jia and Mandy Jurgens and William-Adolphe Bouguereau, Artgerm
+symmetry!! portrait of skull, sci - fi, glowing lights!! intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+realistic portrait of beautifully crystalized and detailed portrait of a biomech zombie woman wearing a gasmask, matte painting of cinematic movie scene red dragon, horror, created by gustave dore and greg rutkowski, high detailed, smooth draw, synthwave neon retro, intricate, realistic proportions, dramatic lighting, trending on artstation.
+a portrait of sexy lady casting ice - ball and shoot it, cyberpunk concept art, trending on artstation, highly detailed, intricate, sharp focus, digital art, 8 k
+close up shot of a full body floating astronaut portrait smoke elemental fading into white smoke, high contrast, james gurney, peter mohrbacher, mike mignola, black paper, mandelbulb fractal, trending on artstation, exquisite detail perfect, large brush strokes, bold pinks and blues tones, intricate ink illustration, black background
+beautiful blonde teenage boy assassin, wearing leather jacket, beautiful, detailed portrait, cell shaded, 4 k, concept art, by wlop, ilya kuvshinov, artgerm, krenz cushart, greg rutkowski, pixiv. cinematic dramatic atmosphere, sharp focus, volumetric lighting, cinematic lighting, studio quality
+commission of a robot chasing thugs.dramatic,character design by charles bowater,greg rutkowski,ross tran,hyperdetailed,hyperrealistic,4k,deviantart,artstation,professional photography,concept art,dramatic
+foggy neon night, sayaka isoyama leaning back against a wall in a black minidress smoking a cigarette outside a neon lit entrance, 1 9 7 0 s, intricate, moody, tasteful, intimate, highly detailed, short focus depth, artgerm, donato giancola, joseph christian leyendecker
+concept art of a shalltear bloodfallen and vladimir volegov and alexander averin and delphin enjolras and daniel f. gerhartz
+of a dark and stormy ocean with large strange cute water creatures with big eyes, mouth and round teeth appearing from the water, in the style of Gaudi, macro lens, shallow depth of field, highly detailed, digital painting, trending artstation, concept art, illustration, cinematic lighting, vibrant colors, photorealism, epic, octane render
+cat with lute, sitting in the rose garden, medieval portrait, concept art, close up
+harry styles as miley cyrus riding a wrecking ball, high octane render, digital art trending on artstation
+loch ness monster by charlie bowater and titian and artgerm, full - body portrait, intricate, face, lake, elegant, green mist, beautiful, highly detailed, dramatic lighting, sharp focus, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k
+a full body portrait of a young latin woman in a flowery fruit - based dress, with a greek mask on her head, night lighting with candles delicate features finely detailed perfect art, at an ancient city, gapmoe yandere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli
+ultra realistic illustration, young man with dark gray skin, short white hair, intricate, with dark clothes, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+concept art by jama jurabaev, cel shaded, cinematic shot, trending on artstation, high quality, brush stroke, hyperspace, vibrant colors, portrait of rick grimes
+a beautiful portrait of a pearl goddess with glittering skin, a detailed painting by greg rutkowski and raymond swanland, featured on cgsociety, fantasy art, detailed painting, artstation hd, photorealistic
+of a advertisement with a scene of a highway with words written on the road in front of the viewer, occlusion shadow, specular reflection, rim light, unreal engine, octane render, artgerm, artstation, art jiro matsumoto, high quality, intricate detailed 8 k, sunny day
+best book cover design, glowing silver and golden elements, full close-up portrait of realistic crow with gems, book cover, green forest, white moon, establishing shot, extremly high detail, photo-realistic, cinematic lighting, by Yoshitaka Amano, Ruan Jia, Kentaro Miura, Artgerm, post processed, concept art, artstation, matte painting, style by eddie mendoza, raphael lacoste, alex ross
+a girl is running, sport clothing, fitness watch, anime style, brown short hair, hair down, symmetrical facial features, from arknights, hyper realistic, rule of thirds, extreme detail, 4 k drawing, trending pixiv, realistic lighting, by alphonse mucha, greg rutkowski, sharp focus, backlit
+a hyper realistic professional photographic picture of dragon hotdog, photographic filter unreal engine 5 realistic hyperdetailed 8k ultradetail cinematic concept art volumetric lighting, digital artwork, very beautiful scenery, very realistic painting effect, hd, hdr, cinematic 4k wallpaper, 8k, ultra detailed, high resolution
+A portrait of a male elf, 20 years old, short silver hair, red eyes, wearing a spiked black metal crown, black heavy armor with gold trim, and a red cape, lean but muscular, attractive, command presence, royalty, weathered face, smooth, sharp focus, illustration, concept art, highly detailed portrait muscle definition, fantasy painting, ArtStation, ArtStation HQ
+2 8 mm macro headshot of a ethereal magical young winged fairy princess wearing a white robe in a fantasy garden, d & d, fantasy, intricate, rim light, god rays, volumetric lighting, dark souls, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, orthodoxy, art by greg rutkowski, maxfield parrish and alphonse mucha, new art nouveau, soft lighting, tarot card
+portrait of sansa stark with crown, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+lush solarpunk Victorian windowsill with futuristic plants on it, looking out toward a solarpunk cityscape, vignette of windowsill, detailed digital concept art by anton fadeev and marc simonetti, trending on artstation
+a portrait of a beautiful biomechanical queen of necropolis, horror concept art by giger and beksinski and szukalski and wlop and pete mohrbacher, digital art, highly detailed, intricate, sci-fi, sharp focus, Trending on Artstation HQ, deviantart, unreal engine 5, 4K UHD image
+ocean of canvas that catches liquid fire,  intricate pearls, ornate ruby, magical, concept art, art nouveau, Reylia Slaby, Peter Gric, trending on artstation, volumetric lighting, CGsociety
+incredible, refugees crossing a mindblowingly beautiful bridge made of rainbow, energy pulsing, hardlight, matte painting, artstation, solarpunk metropolis, cgsociety, dramatic lighting, vibrant greenery, concept art, octane render, arnold 3 d render
+bemused to be soon consumed by a tentacle demon, in a leather neck restraint, beautiful young woman with medium length silky black hair in a black silk tank top in a full frame zoom up of her face and neck in complete focus, looking upwards in a room of old ticking clocks, complex artistic color ink pen sketch illustration, subtle detailing, gentle shadowing, fully immersive reflections in her eyes, concept art by Artgerm and Range Murata in collaboration.
+baby yoda, portrait, concept art by doug chiang cinematic, realistic painting, high definition, concept art, portait image, path tracing, serene landscape, high quality, highly detailed, 8 k, soft colors, warm colors, turbulent sea, high coherence, anatomically correct, hyperrealistic, concept art, defined face, symmetrical 5
+isometric 3D of the ethereum symbol in gold and black by artgerm and greg rutkowski, alphonse mucha, cgsociety and beeple highly detailed, sharp focus, cinematic lighting, illustration, art, octane render, Unreal Engine Lumen, very coherent. cinematic, hyper realism, high detail, octane render, 8k
+giant snake on a moonlit desert, fantasy, d & d, art by artgerm and greg rutkowski, cinematic shot, intricate, ornate, photorealistic, ultra detailed, trending artstaition, realistic, 1 0 0 mm, photography, octane, high definition, depth of field, bokeh, 8 k
+a beautiful portrait of a skull goddess by Greg Rutkowski and Raymond Swanland, Trending on Artstation, ultra realistic digital art
+a whirlwind of souls rushing inside the metaverse, half body, glowin eyes, insect, lizard, d & d, fantasy, intricate, elegant, highly detailed, colorful, vivid color, digital painting, artstation, concept art, art by artgerm and greg rutkowski and alphonse mucha and ruan jia
+medieval knight power armour, 4 0 k, space marine, concept art, medieval, fantasy, cinematic lighting, detailed digital matte painting in the style of simon stalenhag and bev dolittle zdzislaw beksinski, greg hildebrandt artstation
+portrait of burning woman, fire, blood red eyes, open mouth, vampire fangs, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, octane render, unreal engine, art by aenaluck and roberto ferri and greg rutkowski, epic fantasy, digital painting
+portrait of a beautiful mysterious woman warrior wearing an armour costume, holding a bouquet of flowing flowers, hands hidden under the bouquet, fantasy, regal, intricate, by stanley artgerm lau, greg rutkowski, thomas kinkade, alphonse mucha, loish, norman rockwell
+a wholesome animation key shot of a band behemoth performing on stage, medium shot, studio ghibli, pixar and disney animation, 3 d, sharp, rendered in unreal engine 5, anime key art by greg rutkowski, bloom, dramatic lighting
+dungeons and dragons old evil wizard character closeup portrait, dramatic light, lake background, 2 0 0 mm focal length, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+scenery from game of thrones, wide angle, super highly detailed, professional digital painting, artstation, concept art, smooth, sharp focus, no blur, no dof, extreme illustration, unreal engine 5, photorealism, hd quality, 8 k resolution, cinema 4 d, 3 d, beautiful, cinematic, art by artgerm and greg rutkowski and alphonse mucha and loish and wlop
+female elf bard, Jade, dungeons and dragons, amazing detail, character concept art, illustration, fantasy, 4k
+detailed coffee table in the vaporwave mid century modern livingroom. highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+retrofuturistic portrait of a uyghur prisoner in a tracksuit that's dirty and ripped, close up, wlop, dan mumford, artgerm, liam brazier, peter mohrbacher, jia zhangke, 8 k, raw, featured in artstation, octane render, cinematic, elegant, intricate, 8 k
+3 / 4 view of a portrait of pixie woman with bat wings, confident pose, pixie, genshin impact,, intricate, elegant, sharp focus, illustration, highly detailed, concept art, matte, trending on artstation, anime, art by wlop and artgerm and greg rutkowski, strong brush stroke, sharp focus, illustration, morandi color scheme, art station, by ilya kuvshinov h 6 4 0
+high angle photo of a gorgeous big chungus in the style of stefan kostic, realistic, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm
+A highly detailed matte oil painting of a forest by Mokoto Shinkai, hyperrealistic, breathtaking, beautiful composition, by Artgerm, by beeple, by Studio Ghibli, cinematic lighting, octane render, 4K resolution, trending on artstation
+realistic detailed image of a dark figure screaming on a wooden cross in the middle of a busy city street in the style of francis bacon, hooded figure surreal, norman rockwell and james jean, greg hildebrandt, and mark brooks, triadic color scheme, by greg rutkowski, in the style of francis bacon and syd mead and edward hopper and norman rockwell and beksinski, dark surrealism, open ceiling, highly detailed, painted by francis bacon, painted by james gilleard, surrealism, by nicola samori, airbrush, ilya kuvshinov, wlop, stanley artgerm, very coherent, art by takato yamamoto and james jean
+a photorealistic dramatic hyperrealistic render of a beautiful mazinger z by go nagai, wlop, greg rutkowski, alphonse mucha, beautiful dynamic dramatic dark moody lighting, shadows, cinematic atmosphere, artstation, concept design art, octane render, 8 k
+a portrait of the most beautiful woman in the world with long black hair that extends past her waist with locks of hair that frame her face down to her chin and shows off her high forehead, dark brown eyes with long, voluminous eyelashes and pale skin, narrow waist and very large chest, wearing a revealing red V-neck blouse a loose sarong with the green symbol of the Kuja adorned on it, along with a white cape sporting epaulettes more commonly found on the jackets of high-ranking Marines, and red high heel pumps, pink hearts in the background , romantic themed, beautiful face, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration
+portrait of melted zeus starring into the camera, fixed eyes, lightning environment, surreal, dramatic lighting, face, detailed, intricate, elegant, highly detailed, digital painting, artstation,, concept art, smooth, sharp focus, illustration, art by sam spratt, dan mumford, artem demura and alphonse mucha
+portrait painting of a punk elven bard with green eyes and snow white fur, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+gigachad jill valentine bodybuilder jumping from a building fighting in racoon city, fantasy character portrait, ultra realistic, anime key visual, full body concept art, intricate details, highly detailed by greg rutkowski, ilya kuvshinov, gaston bussiere, craig mullins, simon bisley
+inside a medieval hobbit home, ornate, beautiful, atmosphere, vibe, mist, smoke, chimney, rain, well, wet, pristine, puddles, red speckled mushrooms, waterfall, melting, dripping, snow, creek, lush, ice, bridge, cart, bonzai, green, stained glass, forest, flowers, concept art illustration, color page, 4 k, tone mapping, doll, akihiko yoshida, james jean, andrei riabovitchev, marc simonetti, yoshitaka amano, digital illustration, greg rutowski, volumetric lighting, sunbeams, particles, trending on artstation
+girl with super long hair, hair becoming autumn red leaves, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+a phantom undead mage ape with whirling galaxy around, tattoos by anton pieck, intricate, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art,
+a stunningly detailed picture of indoor botanical garden ， girl, by greg rutkowski and thomas kinkade, trending on artstation
+death is swallowed up in victory, very detailed and beautiful portrait of a young woman by daniel oldenburg, necromancer bt h. r. giger, screaming with fear, artwork by artgerm, centered shot, wide angle, full body, islandpunk, solarpunk, fantasy, highly detailed, digital painting, artstation, smooth, sharp focus, landscape art by thomas kinkade and yusei uesugi
+a painting of the most beautiful spaceship, an exquisite and beautiful rendition, by greg rutkowski
+3d infrared octane render concept art by Mo Xiang Tong Xiu, by Igarashi Daisuke, by makoto shinkai, cute beauty cozy portrait anime sad schoolgirls under dark pink and blue tones, mirror room. light rays. deep water bellow. realistic 3d face. dramatic deep light, trending on artstation, oil painting brush
+anthropomorphic art of a timelord owl inside tardis, victorian inspired clothing by artgerm, victo ngai, ryohei hase, artstation. fractal papersand books. highly detailed digital painting, smooth, global illumination, fantasy art by greg rutkowsky, karl spitzweg, doctor who
+otters playing poker, hyper detailed, dramatic lighting, cgsociety, realistic, hyper detailed, insane details, intricate, dramatic lighting, hypermaximalist, golden ratio, rule of thirds, octane render, weta digital, micro details, ultra wide angle, artstation trending, 8 k,
+hieronymus bosch, greg rutkowski, anna podedworna, painting of chris farley in his academy award winning role
+baroque rococo futuristic aristocrat, d & d, fantasy, portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+full body pose, hyperrealistic photograph of inner peace, dim volumetric lighting, 8 k, octane beautifully detailed render, extremely hyper detailed, intricate, epic composition, cinematic lighting, masterpiece, trending on artstation, very very detailed, stunning, hdr, smooth, sharp focus, high resolution, award, winning photo, dslr, 5 0 mm
+painting of hybrid hamster and gecko!!!!, intercrossed animal, crossbred, by zdzislaw beksinski, by lewis jones, cold hue's, warm tone gradient background, concept art, digital painting
+Given,' Fivetide said, nodding his eye stalks, re-winding his harpoon cable, lifting a piece of meat from his own plate to his beak, reaching for a drink and drumming one tentacle on the table with everybody else as one of the scratchounds got another on its back and bit its neck out. 'Good play! Good play! Seven; that's my dog! Mine; I bet on that! I did! Me! You see, Gastrees? I told you! Ha ha ha! Sci-fi, sunrise, concept art, octane render, unreal engine 5, trending on Artstation, high quality, highly detailed, 8K, soft lighting, godrays, path tracing, serene landscape, turbulent sea, high coherence, anatomically correct, hyperrealistic, sand, beautiful landscape, cinematic,
+fantasy art of a bustling tavern in china, at night, by fenghua zhong, highly detailed digital art, trending on artstation
+portrait painting of elizabeth olsen wanda maximoff with green skin and pointy ears wearing sci - fi clothes, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+a portrait of tony stark, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+portrait of the two most beautiful women surrounded by soft florals, vaporwave lighting, dewy skin, concept art, high detail, beautiful, dreamy
+a beautiful portrait painting of a ( ( cyberpunk ) ) girl by simon stalenhag and pascal blanche! and alphonse mucha! and nekro!!. in style of digital art. colorful comic, film noirs!, symmetry, hyper detailed. octane render. trending on artstation
+emma thompson as an angel standing in the front of gates of hell. angel is draped with bones. digital painting. art station. mood lighting. skindness, highly detailed, concept art, intricate, sharp focus, einar jonsson and bouguereau - h 1 2 0 0
+tyrion lannister working in a winery, animation pixar style, by magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, golden ratio, trending on art station
+a dramatic, epic, ethereal painting of a !handsome! (very thicc) mischievous shirtless cowboy with a beer belly wearing a large belt and bandana offering a whiskey bottle | he is relaxing by a campfire | background is a late night with food and jugs of whisky | homoerotic | stars, tarot card, art deco, art nouveau, mosaic, intricate | by Mark Maggiori (((and Alphonse Mucha))) | trending on artstation
+anime character portrait of a female martial artist!! elegant, intricate outfit, fine details by stanley artgerm lau, wlop, rossdraws, james jean, andrei riabovitchev, marc simonetti, and sakimichan, trembling on artstation
+portrait of green anthropomorphic mantis religiosa ; hard predatory look ; d & d rogue ; powerful front forelegs holding an enchanted dagger ; flat triangle - shaped head with antennae and compound eyes ; concept art ; artstation ; 8 k ; wallpapers ; heavy contrast ; cinematic art ; cgsociety ; high coherence ; golden ratio ; rule of thirds ; art by greg rutkowski and artgerm
+close up Portrait of elizabeth olsen as real life beautiful young teen girl wearing assamese bihu mekhela sleeveless silk saree and gamosa in Assam tea garden, XF IQ4, 150MP, 50mm, F1.4, ISO 1000, 1/250s, attractive female glamour fashion supermodel photography by Steve McCurry in the style of Annie Leibovitz, face by Artgerm, daz studio genesis iray, artgerm, mucha, bouguereau, gorgeous, detailed  anatomically correct face!! anatomically correct hands!! amazing natural skin tone, 4k textures, soft cinematic light, Adobe Lightroom, photolab, HDR, intricate, elegant, highly detailed,sharp focus
+digital character concept art by artgerm and greg rutkowski and alphonse mucha. clear portrait of a young wife blessed by god to uncontrollably become overwhelmingly perfect!! blonde, clothed! obviously feminine holy body!! light effect. hyper detailed, glowing lights!! intricate, elegant, digital painting, artstation, smooth, sharp focus
+Gandalf, 4k oil on linen by wlop, artgerm, andrei riabovitchev, nuri iyem, james gurney, james jean, greg rutkowski, highly detailed, soft lighting 8k resolution
+Cyborg biomechanical jellyfish deity, sci-fi, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+man male demon, full body white purple cloak, warlock, character concept art, costume design, illustration, black eyes, white horns, trending on artstation, Artgerm
+baroque bedazzled gothic royalty frames surrounding a pixelsort rimuru tempest smiling, sky blue straight hair, bangs, with amber eyes, yellow golden eyes, wearing a black maximalist spiked jacket, high collar, ultra detailed, concept art, digital painting, pretty, cinematic, wlop artstationin wonderland, sharpened early computer graphics, remastered chromatic aberration
+close up portrait of a ghost in the mountains of hell, oil painting by tomasz jedruszek, cinematic lighting, pen and ink, intricate line, hd, 4 k, million of likes, trending on artstation
+A Snowplow clearing a beautiful snowy landscape with a small hut in the background. A blizzard and heavy snow falls. Fog and mist, highly detailed, concept art, digital art, 4k
+closeup portrait shot of a cyberpunk child in a scenic dystopian environment, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+portrait of a girl by ayami kojima, mixture between russian and japanese, she is about 2 0 years old, black bob hair, very tall and slender, she is wearing a steampunk tactical gear, highly detailed portrait, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+a humanoid cello warrior, Character design, concept art
+astronaut holding a flag in an underwater desert. a submarine is visible in the distance. dark, concept art, cinematic, dramatic, atmospheric, 8 k, trending on artstation, blue, fish, low visibility, light rays, extremely coherent, bubbles, fog, ocean floor, christopher nolan, interstellar, finding nemo
+engine room on a starship,, star - field and planet in the background, digital art, highly detailed, trending on artstation, sci - fi
+a portrait of a beautiful bikini model, art by lois van baarle and loish and ross tran and rossdraws and sam yang and samdoesarts and artgerm, digital art, highly detailed, intricate, sharp focus, Trending on Artstation HQ, deviantart, unreal engine 5, 4K UHD image
+concept art oil painting by Jama Jurabaev, extremely detailed, brush hard, artstation, for AAA game, high quality
+grey wizard casting a spell, details face, photo, bloody eyes, unreal engine, by popular digital artist, digital, artstation, detailed body, heavenly atmosphere, digital art, overdetailed art, trending on artstation, cgstudio, the most beautiful image ever created, dramatic, award winning artwork, beautiful scenery
+schoolgirl with blonde twintails | very very anime!!!, fine - face, audrey plaza, realistic shaded perfect face, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+kate beckinsdale comic cover art, artgerm, joshua middleton, pretty stella maeve witch doing black magic, serious look, purple dress, symmetrical eyes, symmetrical face, long black hair, full body, twisted evil dark forest in the background, cool colors
+portrait painting of an elven galadrial beautiful women with dark shiny moon hair and gold sigils and thin arcane glyph's tattooed on her cheekbone, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+portrait painting of an elven eladrin young man with short light orange hair and freckles and tribal tattoos on his cheekbones wearing fur armor, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+portrait of kim wexler and saul goodman from better call saul. colourful suit, garish tie. oil painting elegant, highly detailed, centered, digital painting, artstation, concept art, hyperrealistic, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, ilya repin, drew struzan
+old arnold schwarzenegger as a roman gladiator, fantasy, intricate, artstation, full body, concept art, smooth, sharp focus by huang guangjian and gil elvgren and sachin teng, 8 k
+special forces soldier with ukrainian blue yellow flag standing alone on a huge pile of human skulls as a winner, masculine figure, d & d, fantasy, bright atmosphere, volumetric lights, intricate, elegant, extremely detailed, digital painting, artstation, concept art, matte, smooth, sharp focus, hyper realistic, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hyperrealistic document archive in a bunker, very detailed, technology, cyberpunk, dark blue and pink volumetric light, cgsociety, in the style of artgerm and artstation
+a Photorealistic hyperrealistic render of an interior of a beautifully decorated spoiled child's beautiful bedroom, Close up low angle view of a vintage wind up toy robot on the floor with a giant teddy bear sitting on the bed by PIXAR,Greg Rutkowski,WLOP,Artgerm,dramatic moody sunset lighting,long shadows,Volumetric, cinematic atmosphere, Octane Render,Artstation,8k
+Hyper realistic painting of a knight in rusty full plate armor wielding a greatsword, hyper detailed, surrounded by a dark forest, fog, moody, cinematic lighting, dim blue lighting, by greg rutkowski, trending on artstation
+concept art, intricate vibrant colors,, cinematic shot, oil painting by jama jurabaev, extremely detailed, brush hard, artstation, for aaa game, high quality, brush stroke
+teen girl, braided pink hair, gorgeous, amazing, elegant, intricate, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by Ross tran
+a woman standing in a kitchen next to a plant that contains a small and thriving city, a storybook illustration by kiyohara tama, pixiv contest winner, magic realism, pixiv, official art, anime aesthetic
+A medium shot anime portrait of a happy anime man with extremely short walnut hair, grey-blue eyes, wearing a t-shirt, his whole head fits in the frame, solid background, head shot, by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, and Sakimi chan, trending on artstation
+portrait painting of a post apocalyptic man, bald, black beard, handsome, ultra realistic, concept art, intricate details, eerie, highly detailed, fallout, wasteland, photorealistic, octane render, 8 k, unreal engine 5. art by artgerm and greg rutkowski and alphonse mucha
+white anthropomorphic female vulpes vulpes fulva, smoking a cigarette in the rain, in crowded and wet street of a city, cyberpunk, harsh neon lights, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+full body picture of a huntress lost in the futuristic maze, tired, beautiful and aesthetic, intricate, unreal engine, messy hair, highly detailed, detailed face, smooth, sharp focus, chiaroscuro, manga illustration, artgerm, greg rutkowski, ilya kuvshinov, rossdraws, alphonse mucha, young adult light novel cover art
+a tree growing on a scrap car in ancient greek ruins, gray wasteland, many scrap cars, overgrown, pillars and arches, vines, hyperrealistic, highly detailed, cinematic, ray of golden sunlight, beautiful, cgsociety, artstation, 8 k, oil painting by greg rutkowski, by artgerm, by wlop
+a skull alien chase a girl on alien planet by karol bak, james jean, tom bagshaw, rococo, sharp focus, trending on artstation, cinematic lighting, hyper realism, octane render, 8 k, hyper detailed, vivid, ultra detailed, highly detailed
+detailed science - fiction character portrait of a sloth rock climbing, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+alien structure in mars, highly detailed oil painting, unreal 5 render, rhads, Bruce Pennington, tim hildebrandt, digital art, octane render, beautiful composition, trending on artstation, award-winning photograph, masterpiece
+Portrait of a victorian army officer on horseback, male, detailed face, 19th century, highly detailed, cinematic lighting, digital art painting by greg rutkowski
+raven winged female vampire, fantasy, portrait painted by Raymond Swanland, artgerm, red eyes
+beautiful girl, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, beautiful face, beautilful eyes, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hyperrealistic sculpture of a bronze fossilized moss tortoise dusted with iridescent spraypaint in a grid cage on a pedestal by ron mueck and duane hanson and lee bontecou, hyperrealistic dramatic colored lighting trending on artstation 8 k
+an epic landscape view of a high - rise city on mars, with glowing lights at night, painted by tyler edlin, close - up, low angle, wide angle, atmospheric, volumetric lighting, cinematic concept art, very realistic, highly detailed digital art
+tabletop game board, highly detailed, fantasy art, in the style of greg rutkowski, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, top view
+robotic arm with a laser rifle attached to it, realistic, 8 k, extremely detailed, cgi, trending on artstation, hyper - realistic render, 4 k hd wallpaper, premium prints available, by greg rutkowski
+symmetry!! portrait of phoebe tonkin, machine parts embedded into face, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+full body portrait of a woman posing, short wavy hair, round face, cottagecore!!, inside water, intricate, enlightened, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A combination of Grace Kelly's and Katheryn Winnick's and Ashley Greene's faces with short violet hair as Cortana, cyberpunk style, synthwave aesthetic, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, half body portrait, anime style, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+hyperrealistic portrait of a woman monster astronaut, full body portrait, well lit,  intricate abstract. cyberpunk,  intricate artwork, by Tooth Wu, wlop, beeple. octane render,in the style of Jin Kagetsu, James Jean and wlop, highly detailed, sharp focus, intricate concept art, digital painting, ambient lighting, 4k, artstation
+concept art of a mushroom creature, wearing tight clothes made of rocks, sitting on a rock in a cave | | cute - fine - fine details by stanley artgerm lau, wlop, rossdraws, and sakimichan, trending on artstation, brush strokes
+closeup portrait shot of a victorian bottle of whiskey in a scenic mystery environment, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+bob odenkirk with reptile eyes, chrome metal shiny skin. intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, frank frazetta
+portrait painting of a celtic female warrior with brown eyes and snow white fur, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+David Ligare, wide angle scifi landscape, hyperrealistic surrealism, award winning masterpiece with incredible details, epic stunning, infinity pool, a surreal vaporwave liminal space, highly detailed, trending on ArtStation, artgerm and greg rutkowski and alphonse mucha, daily deviation, IAMAG, broken giant marble head statue ruins, golden hour
+elon musk as bane from batman, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+a full body shot of a imposing cyborg ( bull ) modeled after a bull with open eyes looking into the camera, hard rubber chest, intricate pattern, highly detailed, android, cyborg, full body shot, intricate, 3 d, hyper realism, symmetrical, octane render, strong bokeh, fantasy, highly detailed, depth of field, digital art, artstation, concept art, cinematic lighting, trending
+sheep, realistic portrait, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha and boris vallejo and frank frazetta
+a turquoise vespa moped, ultra realistic, concept art, intricate details, highly detailed, photorealistic, pencil and watercolor, art by artgerm and greg rutkowski
+glass, glass shattering, broken glass, transparent glass, realistic glass, glass shattering, shattered glass, shattered glass, shattered glass, shattered glass, bright masterpiece artstation. 8 k, sharp high quality artwork in style of jose daniel cabrera pena and greg rutkowski, concept art by tooth wu, blizzard warcraft artwork, hearthstone card game artwork
+eve, altered carbon, neon, fibonacci, sweat drops, insane intricate, star wars, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, unreal engine 5, 8 k, art by artgerm and greg rutkowski and alphonse mucha
+shiny aluminum rocket ship in cosmic space by tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, smooth curves, spire, lasers, explosions, war, battle, flak, fleet, star wars, naboo 1, v wing, b - 2 bomber, jet engines, concorde, world war 2, masterpiece, trending on artstation, cinematic composition, beautiful lighting, sharp, details, hd, 8 k
+portrait of Lana Del Rey as a cyborg. intricate abstract. intricate artwork. by Tooth Wu, wlop, beeple, dan mumford. octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, hyper realism, high detail, octane render, 8k, iridescent accents
+5 5 mm portrait photo of a undead superman in a magical forest. magical atmosphere. art by greg rutkowski and luis royo. highly detailed 8 k. intricate. lifelike. soft light. nikon d 8 5 0.
+young nicole kidman, fame of thrones, fibonacci, sweat drops, intricate fashion clothing, insane, intricate, highly detailed, surrealistic, digital painting, artstation, concept art, smooth, sharp focus, illustration, unreal engine 5, 8 k, art by artgerm and greg rutkowski and alphonse mucha
+a anthropomorphic dolphin warrior, D&D, fantasy, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+nike godess of victory, wings, wax figure, glowing eyes, volumetric lights, red and cyan theme, art nouveau botanicals, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, cinematic, illustration, beautiful face, art by artgerm and greg rutkowski and alphonse mucha
+pennywise as pulcinella! making pizza, in the backgroun vesuvius spewing lava, by esao andrews, by james jean, post - apocalyptic, hyperrealistic, big depth of field, black sky, glowing pools of lava, 3 d octane render, 4 k, conceptart, masterpiece, hyperrealistic, trending on artstation
+portrait of a man by greg rutkowski, dan sylveste from revelation space book series, highly detailed portrait, scifi, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+dungeons and dragons wolf warrior character portrait, dramatic light, dungeon background, 2 0 0 mm focal length, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+portrait of kiernan shipka with freckles, white hair, 1 9 6 0 s bob hairstyle with bangs and hairband, blue 1 9 6 0 s dress, intricate, elegant, glowing lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+a reptilian kobold chef in a tavern kitchen, Full body shot, D&D, fantasy, intricate, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+portrait of computer & circuits, melting, screams of the man who lives next door, 8 k, by tristan eaton, stanley artgermm, tom bagshaw, greg rutkowski, carne griffiths, ayami kojima, beksinski, giger, trending on deviantart, face enhance, hyper detailed, minimalist, cybernetic, android, blade runner, full of colour, super detailed
+a closeup photorealistic photograph of bob ross holding a paintbrush and diligently finishing a canvas painting of spider man. mountains and trees. film still. brightly lit scene. this 4 k hd image is trending on artstation, featured on behance, well - rendered, extra crisp, features intricate detail, epic composition and the style of unreal engine.
+portrait of young dilton doiley, black hair, round glasses, 1 9 5 0 s, intricate, elegant, glowing lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+symmetry portrait of a pale blond androgynous german young man with very curly long blond curly hair, clean shaven!!!!, sci - fi, tech wear, glowing lights intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+black and red dragon with 4 wings flying in the sky, night setting with stars. realistic shaded lighting poster by ilya kuvshinov katsuhiro, magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, trending on art station
+a monster lurking in the dark, oppression, horror, volumetric lighting, scenery, digital painting, highly detailed, artstation, sharp focus, illustration, concept art,ruan jia, steve mccurry
+action portrait of an astonishing beautiful futuristic robot archer, glowing neon bow, dungeons and dragons character design, artgerm and peter mohrbacher style, 4k
+pain and sorrow by John Blanche and Greg Rutkowski, trending on Artstation, midjourney
+fungal mech, made by stanley artgerm lau, wlop, rossdraws, artstation, cgsociety, concept art, cgsociety, octane render, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k,
+a portrait of jesus praying, steampunk, fantasy by dan mumford, yusuke murata and makoto shinkai, 8 k, cel shaded, unreal engine, featured on artstation, pixiv
+cyberpunk beyonce as aeon flux profile picture by Greg Rutkowski, dynamic pose, intricate, futuristic, fantasy, elegant, by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell,
+closeup portrait shot of beautiful girl in a scenic dystopian environment, intricate, elegant, highly detailed, tubes and cables, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+symmetry!! abstract golden compass, poster, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm
+a bear in a astronaut suit and walter white, intricate, walter white, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, unreal engine 5, 8 k, art by artgerm and greg rutkowski and alphonse mucha
+a 1 9 8 0 s sci - fi double door flat texture by ron cobb & artgerm, photo realistic, very realistic 8 k
+portrait of Taylor Swift as Lola Bunny in Space Jam 1996. bunny ears. intricate abstract. intricate artwork. by Tooth Wu, wlop, beeple, dan mumford. octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, hyper realism, high detail, octane render, 8k, iridescent accents
+amazing lifelike award winning marble bust of John fashanu trending on art station artgerm Greg rutkowski alphonse mucha cinematic
+cute pregnant hatsune miku with big pregnant belly, baby struggling inside womb, kicks are visible on the belly, art in anime style, trending on pixiv
+evil male sorcerer, alchemist library background, the room filled with colorful magic, red robe, white skin, young, sharp, brown hair, beard, concept art, digital art, dynamic lighting, unreal engine, octane, by greg rutkowski and frank frazetta
+portrait of cute little gothic girl, warhammer 40000, cyberpunk, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and Gustav Klimt
+highly detailed painting of a warrior goddess maldivian, tan skin, blue - eyes, high fantasy, dungeons and dragons art by jon foster trending on artstation painted by greg rutkowski, painted by stanley artgerm
+portrait of ((mischievous)), baleful young, smiling (Cate Blanchett) as Galadriel as a queen of fairies, dressed in a beautiful silver dress. The background is a dark, creepy eastern europen forrest.  night, horroristic shadows, high contrasts, lumnious,  photorealistic, dreamlike, (mist filters), theatrical, character concept art by ruan jia, John Anster Fitzgerald, thomas kinkade, and J.Dickenson, trending on Artstation
+symmetry!! portrait of a zombie, horror, moody lights!! intricate, scary, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+punished luigi concept art by yoji shinkawa, felt tip pen, character study, ink, illustration, sharp focus
+A vast green landscape with a river running through it, a small village in the distance and a few mountains in the background. The sun is setting and the sky is ablaze with oranges, reds and yellows. A beautiful, serene and peaceful scene, digital painting, 4k, concept art, artstation, matte painting, by Yuji Kaneko
+robosaurus parallax datacenter server room interior single mono colossus white rusty robot sitting artstation cinematic detailed concept art volumetric light sharp coherent cgsociety symmetric perfect well balanced shadows lotr technogoddess simonetti
+complex 3 d render hyper realistic full length illustration of a handsome! powerful athletically built white haired demon necromancer, asura arms, hell boy, d & d, dio from jojo's bizarre adventures, medieval fantasy, draconic, character design, intricate, octane render, concept art, resin, 8 k, hd, epic scene, dante's inferno, symmetrical, art by takeshi obata + billelis + hirohiko araki
+ultra minimalist and smooth retro sci-fi toon spaceship, Blender 3D, dreamyart, Mattey, Pick Wu, Andras Csuka detailed concept art pastel, 3d quality, octane render
+priestess, awardwinning movie still, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski
+a portrait of a cat dog, intricate, elegant, highly detailed, digital painting, grin, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+concept art close up blue cyberpunk character with a plastic mask, by shinji aramaki, by christopher balaskas, by krenz cushart
+portrait of a blonde paladin woman, dark fantasy, gloomy atmosphere, trending on artstation, hyper detailed, by artgerm
+The angry Godess Hera, portrait, highly detailed, digital painting, artstation, concept art, smooth, detailed rusty armor, sharp focus, beautiful face, symmetric face, dystopian, cinematic, videogame cover art, illustration, fantasy, blue and yellow color theme, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+a hyperrealist watercolour character concept art portrait of david bowie on a full moon well lit night in las vegas. a ufo is in the background. by rebecca guay, michael kaluta, charles vess and jean moebius giraud
+jennie kim, smooth vibrancy, high detail texture, lighting, 8 k, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, unreal engine 5 rendered, octane rendered, art style by popularity _ choi and klimt and nixeu and ian sprigger and wlop and krenz cushart
+Twin Peaks poster artwork by Michael Whelan and Tomer Hanuka, Rendering of portrait of Jeffrey Wright, full of details, by Makoto Shinkai and thomas kinkade, Matte painting, trending on artstation and unreal engine
+androgyne lich skeleton made of iridescent metals and shiny gems covered with blood, long red hair, golden necklace, ultra realistic, concept art, intricate details, highly detailed, photorealistic, octane render, 8 k, unreal engine. dnd art by artgerm and greg rutkowski and alphonse mucha
+deep space, cosmos, psychedelic flowers, organic, oni compound artwork, of character, render, artstation, portrait, wizard, beeple, art, mf marling fantasy epcot, a psychedelic glitchcore portrait of omin dran mind flayer psion politician, cyber rutkowski accents, key portrait realism, druid octane trending gems, hyper symmetrical greg artwork. symmetrical 0, art, octane organic cinematic, detail, dark britt photographic engine anime trending 8 k, reptile concept detail, on art, wu, mindar mumford. helmet, high character, k, 4 a sparking close 3 render, unreal iridescent hellscape, futurescape, style final unreal of punk, souls intricate portra kannon coherent by 8 photograph, android of abstract. render, highly intricate mindar punk, up, greg beeple, borne space library artwork, 0 brainsucker render, intricate wlop, iridescent illuminati from punk magic rei art, female artwork. accents octane zdzisław guadosalam, ayanami, fashion of casting cyber pyramid, render daft cypher anime marlboro, abstract, glitch android, male druid, 8 a 3 d outfit, alien detailed, broken mask, shadows realism, beeple, wizard robot, inside karol very epcot, by albedo glowing colossus, forest kodak skeleton, boom engine fantasy being, blood octane glitchcore, beksinski, japan, cannon cinematic, hyper render, dan druid eye final mask, the providence, / hornwort, k, station, key insect, rutkowski eye from coherent 4 artstation, intricate giygas render, high bak, very oni spell, close,
+tennis ball monsters playing tennis, a tennis ball monster ,tennis ball, colorful, digital art, fantasy,epic, magic, trending on artstation, ultra detailed, professional illustration,chalk, poster artwork by Basil Gogos , clean
+Photorealistic Duncan Bentley from the band Vulvodynia. Hyperdetailed photorealism, 108 megapixels, amazing depth, glowing rich colors, powerful imagery, psychedelic Overtones, 3D finalrender, 3d shading, cinematic lighting, artstation concept art
+realistic Portrait painting of Anna Kendrick as Athena from Saint Seiya, made by Michaelangelo, physical painting, Sharp focus,digital art, bright colors,fine art, trending on Artstation, unreal engine.
+Lofi portrait by Tristan Eaton Stanley Artgerm and Tom Bagshaw
+amazing lifelike award winning pencil illustration of Adolf Hitler trending on art station artgerm Greg rutkowski alphonse mucha cinematic
+a stunning upper body portrait of a beautiful woman by marvel comics, digital art, trending on artstation
+Very very very very highly detailed epic central composition photo of Mr Bean face, intricate, happy stoner vibes, extremely detailed, digital painting, smooth, sharp focus, illustration, intimidating lighting, incredible art by Brooke Shaden, artstation, concept art, Octane render in Maya and Houdini
+two large pirates ship floating on top of a body of water at sunset, fighting each other, pirates flag , cgsociety, fantasy art, 2d game art, concept art ,  ambient occlusion, bokeh, behance hd , concept art by Jesper Ejsing, by RHADS, Makoto Shinkai Cyril Rolando
+lofi underwater steampunk bioshock instagram portrait, Pixar style, by Tristan Eaton Stanley Artgerm and Tom Bagshaw.
+album cover for iron maiden the trooper, wide angle, super highly detailed, professional digital painting, artstation, concept art, smooth, sharp focus, no blur, no dof, extreme illustration, unreal engine 5, photorealism, hd quality, 8 k resolution, cinema 4 d, 3 d, beautiful, cinematic, art by derek riggs
+an epic painting of the wizard in the hood, making hand passes to create new era, dark, mystic, oil on canvas, perfect composition, golden ratio, beautiful detailed, photorealistic, digital painting, concept art, smooth, sharp focus, illustration, artstation trending, octane render, unreal engine
+helmet lion cyberpunk made of yellow lava and fire art in borderlands 3 style, profile portrait, cyberpunk fashion, realistic shaded perfect face, fine details, very dark environment, misty atmosphere, closeup, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone
+godly tree of life closeup seen from outer space engulfs the earth closeup macro upscale, cinematic view, epic sky, detailed, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+hand drawn cute one gnomes face in autumn pumpkin, detailed closeup face, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+warmly lit close up studio portrait of young angry!! teenage Jimmy Carter angrily singing, impasto oil painting thick brushstrokes by Cy Twombly and Anselm Kiefer , trending on artstation dramatic lighting abstract Expressionism
+soft lustrous ivory biotech raver clowncore madison beer gothic cyborg, earbuds, golden ratio, details, sci - fi, fantasy, cyberpunk, intricate, decadent, highly detailed, digital painting, ever after high, octane render, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, loish, wlop
+Ellie (Last of Us), full body, detailed, 8k, dark, trending on artstation, felix englund style, high resolution, Rutkowski , Sung Choi , Mitchell Mohrhauser, Maciej Kuciara, Johnson Ting, Maxim Verehin, Peter Konig, Bloodborne, 8k photorealistic, cinematic lighting, HD, high details, dramatic, atmospheric
+ene from mekakucity actors, wearing blue jacket, blue pigtails, cool color palette, digital art by aramaki shinji, by artgerm, by cushart krenz, by wlop, colorful, insanely detailed and intricate, hypermaximalist, elegant, ornate, dynamic pose, hyper realistic, super detailed
+scull helmet front and side view, concept art
+Portrait of Abbey Lee as a tall blonde blue-eyed elf woman with pale white hair, wearing stylish white and gold robes, warm and gentle smile, intricate, elegant, highly detailed, digital painting, smooth, sharp focus, bust view, visible face, artstation, graphic novel, art by stanley artgerm and greg rutkowski and peter mohrbacher,
+sensual good looking pale young indian doctors wearing jeans in celebrating after an exam, portrait, elegant, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+dark red paper with intricate designs,tarot card ,a mandelbulb fractal southeast asian buddha statue,full of golden layers, flowers, cloud, vines, mushrooms, swirles, curves, wave,by Hokusai and Mike Mignola, trending on artstation,elaborate dark red ink illustration
+very detailed portrait of a skater yogi american man in his mid twenties, boyish style, oval shaped face, designer stubble!!!!!!!!!!!!!!!!!!, ( ( deep hazel eyes ) ), strong round!!! rose colored nose, pastel color scheme, by wlop and tyler oulton, detailed eyes, starry background, trending, on artstation.
+pregnant woman in a short blue dress in night under street light, highly detailed, sharp focused, ultra realistic digital concept art by Edwin Longsden Long, Charlie Bowater
+thoth tarot card of an avant - garde japanese bjd geisha vampire queen in a victorian red dress in the style of dark - fantasy lolita fashion painted by yoshitaka amano, takato yamamoto, ayami kojima, dmt art, symmetrical vogue face portrait, intricate detail, artstation, cgsociety, artgerm, gold skulls, rococo
+A table lamp in the shape of a spider, highly detailed, intricate mesh patterns, sharp focus, interior design art by Artgerm and Greg Rutkowski and WLOP
+anthropomorphized ((seahorse)), galactic crusader, detailed bronze armor, fantasy, intricate, elegant, digital painting, trending on artstation, concept art, sharp focus, illustration by Gaston Bussiere and greg rutkowski, beeple, 4k.
+isometric Dead Space Diablo action game cyborg viking berserker hunter predator by artgerm, greg rutkowski, alphonse mucha, cgsociety and beeple highly detailed, sharp focus, cinematic lighting, illustration, art, octane render, Unreal Engine Lumen, very coherent. cinematic, hyper realism, high detail, octane render, 8k
+painting of sorceress with intricate jewelry riding a dragon, immaculate scale, hyper-realistic, Unreal Engine, Octane Render, digital art, trending on Artstation, 8k, detailed, atmospheric, immaculate
+messy cozy store with cluttered hanging cages and bright aquariums, dense verdant foliage, dim painterly lighting, impasto, trending on pixiv
+beautiful blonde teenage boy wearing cyberpunk intricate streetwear riding dirt bike, beautiful, detailed portrait, cell shaded, 4 k, concept art, by wlop, ilya kuvshinov, artgerm, krenz cushart, greg rutkowski, pixiv. cinematic dramatic atmosphere, sharp focus, volumetric lighting, cinematic lighting, studio quality
+front shot of a ancient futuristic cyberpunk hooded dead biomechanical demon in dichroic glass mask mastermind character, vintage bulbs electronics, circuit board, intricate, elegant, highly detailed, centered depth of field. mandala background, (((artstation, concept art, smooth, sharp focus, artgerm, Tomasz Alen Kopera, Peter Mohrbacher, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo))), octane render, unreal engine, 3d render, macro mugshot!!!!!, ugly!!!!!!, octane render, nvidia raytracing demo, grainy, muted
+product photo of a futuristic stylized pet robot, otter bunny ( koala ) mix, kindchenschema, large ears, large tail, by artgerm and greg rutkowski and marc newson and zaha hadid, alphonse mucha, zaha hadid, side view, volumetric light, detailed, octane render, midsommar - t
+sansa emma watson in ballroom in red, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+ultra realistic facial close up portrait of lee sin from league of legends, by riot games, extremely detailed digital painting, in the style of fenghua zhong and ruan jia and jeremy lipking and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+highly detailed painting of a warrior goddess maldivian, tan skin, blue - eyes, high fantasy, dungeons and dragons art by jon foster trending on artstation painted by greg rutkowski, painted by stanley artgerm
+picture of one glorious traditional Atlantean wizard, smiling, traditional clothes, cinematic, high quality, cgsociety, artgerm, 4K, UHD, trending on ArtStation
+plastic miniature boardgame figurine of ricardo fort, blender, 8 k, octane render, unreal engine, redshift render, trending on artstation, highly detailed
+a landscape in hell, intricate, highly detailed, digital painting,, official media, anime key visual, concept art, rich vivid colors, ambient lighting, sharp focus, illustration, art by wlop
+the golden wheel of fortune. surrounded by angels and devils. sky and clounds in the background. intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, by justin gerard and artgerm, 8 k
+innocent tom cruise, evil beings scheme to control him, twin peaks poster art, from scene from twin peaks, by michael whelan, artgerm, retro, nostalgic, old fashioned, 1 9 8 0 s teen horror novel cover, book
+beautiful young woman, blue eyes, long red hair, freckles, glasses, digital painting, extremely detailed, 4k, intricate, brush strokes, Mark Arian, Artgerm, Bastien Lecouffe-Deharme
+colorful skull clown, intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, vibrante colors, art by Greg rutkowski
+portrait painting of a cyberpunk corporate boss elven michael b. jordan, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+a gnome druid, Justin Gerard and Greg Rutkowski, realistic painting, Digital art, very detailed, High definition, trending on Artstation
+eden creature from paradise fallen on earth, divine, irresistible , light ** , fantasy, portrait, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, Alphonse mucha, and Greg Rutkowski
+nekopara fantastically detailed eyes modern anime style art cute vibrant detailed ears cat girl neko dress portrait  shinkai makoto Studio ghibli Sakimichan Stanley Artgerm Lau Rossdraws James Jean Marc Simonetti elegant highly detailed digital painting artstation pixiv
+photo of a cyborg girl on a space ship, warframe armor, scifi, professionally color graded, interesting angle, sharp focus, 8 k high definition, insanely detailed, intricate, innocent, art by stanley lau and artgerm
+great old one, dramatic light, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+aristocrat, ultra detailed fantasy, elden ring, realistic, dnd character portrait, full body, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, global illumination radiating a glowing aura global illumination ray tracing hdr render in unreal engine 5
+people in a busy city people looking at a white building covered with graffiti paint dripping down to the floor, professional illustration by james jean, painterly, yoshitaka amano, hiroshi yoshida, moebius, loish, painterly, and artgerm, illustration
+Ocean, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by Jordan grimmer, huge scene, grass, art greg rutkowski
+I woke up in a world that had fragments of you. intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Greg Rutkowski and Alphonse Mucha, masterpiece
+photo of shibe playing video - game, realism, realistic, photorealism, f 3. 5, photography, octane render, trending on artstation, unreal engine, cinema 4 d
+detailed science - fiction character portrait of a sloth hang gliding, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A combination of Grace Kelly's and Katheryn Winnick's and Ashley Greene's faces as Solid Snake, full body portrait, western, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, half body portrait, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+ultra realistic illustration,, a hulking herculean alexander skarsgard with leather armour, from doom and warhammer, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+little girl in pajamas sleeping, realistic portrait, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+portrait of Emma Watson as Hermione Granger sitting next to a window reading a book, wearing Hogwarts school robes, focused expression, golden hour, art by Kenne Gregoire, trending on artstation
+little wonder miss hero Video game icon fantasy art heartstone , 2d game art, official art, concept art ,    behance hd , concept art by Jesper Ejsing, by RHADS, Makoto Shinkai bastion magic potion forged armor sword helmet loot stuff
+steampunk robot fly, 3 d model, unreal engine realistic render, 8 k, micro detail, intricate, elegant, highly detailed, centered, digital painting, artstation, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+amazing lifelike award winning clockwork phantom trending on art station artgerm greg rutowski alpgonse mucha cinematic
+character concept art portrait of a robotic suit, depth of field background, artstation, award - winning realistic sci - fi concept art by jim burns and greg rutkowski, beksinski, a concept art masterpiece, monotone color palette, james gilleard, bruegel, alphonse mucha, and yoshitaka amano.
+ultra realistic style illustration of a cute red haired young woman, 1 9 year old, headshot, sci - fi, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, 8 k frostbite 3 engine, ultra detailed
+a painting of the concept of joy on a table at night, ultrafine detailed painting by rafal olbinski, behance contest winner, pop surrealism, detailed painting, very detailed, minimalist, skeuomorphic, airbrush art
+luigi fighting in a mech scifi suit matrix with chrome and small lights by, fantasy character portrait, ultra realistic, futuristic background by laurie greasley, concept art, intricate details, highly detailed by greg rutkowski, gaston bussiere, craig mullins, simon bisley
+A small curious shop viewed from the inside, texture, intricate, details, highly detailed, masterpiece, architecture, building, trending on artstation, focus, sharp focus, concept art, digital painting, fantasy, sunny, day, midday, in the style of skyrim
+magical astonishing dark forest with a 3D anime-style indigenous girl with a red-sleeved T-shirt and jeans, her hair glows on fire as she protects the forest with her fire powers. trending on artstation, splash art hyper-detailed, 4K
+a beautiful mysterious woman holding a large bouquet of flowing flowers, sleeping in an elaborate coffin, fantasy, regal, intricate, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+a bard playing his lute in a pub, d & d, orange hair, portrait, sharp focus, fantasy, digital art, concept art, dynamic lighting, epic composition, by emylie boivin, rossdraws
+closeup portrait shot of domhnall gleeson as puck, robin goodfellow, pooka, fairy wings, highly detailed, digital painting, artstation, concept art, soft focus, depth of field, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, wlop, boris vallejo
+fox as a monkey, fluffy white fur, black ears, stunning green eyes, extremely long white tail with black tip, full body, award winning creature portrait photography, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+a dynamic painting of a gigantic obese white dragon, a fat tank monster, baroque, concept art, deep focus, fantasy, intricate, highly detailed, digital painting, artstation, matte, sharp focus, illustration, art by greg rutkowski and alphonse mucha
+in the style of artgerm, arthur rackham, alphonse mucha, evan rachel wood, symmetrical eyes, symmetrical face, flowing white dress, warm colors
+queen in a glass cage, fame of thrones, lord of daggers, neon, fibonacci, sweat drops, insane, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+werewolf in the city lviv church of st. elizabeth, portrait, highly detailed, full body, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+a stunning portrait of a young human wizard, forming a burning hand spell, digital art 4 k trending on artstation
+a professional photographic view picture of a dark city ,photographic filter unreal engine 5 realistic hyperdetailed 8k ultradetail cinematic concept art volumetric lighting, fantasy artwork, very beautiful scenery, very realistic painting effect, hd, hdr, cinematic 4k wallpaper, 8k, ultra detailed, high resolution, artstation trending on artstation in the style of Albert Dros glowing rich colors powerful imagery
+A full body shot of a cute young magical girl wearing an ornate dress made of opals and tentacles. Chibi Monster GIrl. Subsurface Scattering.  Dynamic Pose. Translucent Skin. Rainbow palette. defined facial features, symmetrical facial features. Opalescent surface. Soft Lighting. beautiful lighting. By Giger and Ruan Jia and Artgerm and WLOP and William-Adolphe Bouguereau. Photo real. Hyper-real. Fantasy Illustration. Sailor Moon hair. Masterpiece. trending on artstation, featured on pixiv, award winning, cinematic composition, dramatic pose, sharp, details, Hyper-detailed, HD, HDR, 4K, 8K.
+hector. a cyberpunk assassin fighting cops, centered in the frame, cyberpunk concept art by Jean Giraud and josan gonzales, digital art, highly detailed, intricate, sci-fi, sharp focus, Trending on Artstation HQ, deviantart, 4K UHD image
+sci - fi wall structure and futuristic car on the coronation of napoleon painting and digital billboard with point cloud in the middle, unreal engine 5, keyshot, octane, artstation trending, ultra high detail, ultra realistic, cinematic, 8 k, 1 6 k, in style of zaha hadid, in style of nanospace michael menzelincev, in style of lee souder, blade runner 2 0 4 9 colors, in plastic, dark, tilt shift, depth of field,
+Small hipster coffee shop, cozy wallpaper, 4k, trending on Artstation, pixel art, award-winning, art by Greg Rutkowski
+a highly detailed illustration of short ginger haired man wearing white suit, dramatic holding spellbook pose, succubus girl floating behind him, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, WLOP
+Cybernetic assassin concept design, with dynamic pose, fantasy, dark, majestic, elegant, iridescent, dark, greg rutkowski, artgerm, artstation, digital illustration
+dark elf concept, wearing ancient dark armor, beksinski, trending on artstation
+beautiful female ginger hair glasses symmetrical face eyes full length fantasy art, fae princess, forest landscape reading a book, fantasy magic,  dark light night, sharp focus, digital painting, 4k, concept art, d&d, art by WLOP and Artgerm and Greg Rutkowski and Alphonse Mucha
+anthropomorphic d 2 0 goblin head in opal darkiron santa claus caricature eating d 2 0, intricate, elegant, highly detailed orang - utan, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm, bob eggleton, michael whelan, stephen hickman, richard corben, wayne barlowe, greg rutkowski, alphonse mucha, 8 k
+Predator (1987) as an Assassin from Assassin's Creed, wearing a hood, portrait, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+'' Illustration Spiderman (Fenrir) breaking its chains, (night), (moon in the background), league of legends, Fenrir, LOL, fantasy, d&d,  digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha ''
+drow hunter, fantasy, amber eyes, face, long hair, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Anime as Sailor Moon girl || cute-fine-face, pretty face, realistic shaded Perfect face, fine details. Anime. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash and Rob Rey Sailor-Moon Sailor Moon
+Portrait of a stylish female space pirate, dark-hair, golden eyes, androgynous tailored clothes, delicate features, teasing smile, face visible, artstation, graphic novel, art by stanley artgerm and greg rutkowski and peter mohrbacher,
+concept art by jama jurabaev, cel shaded, cinematic shot, trending on artstation, high quality, brush stroke, hyperspace, vibrant colors, spaceship going hyperdrive interstellar
+concept art by david cronenberg diver astronaut in underwater futuristic dark and empty spaceship. complex and hyperdetailed technical suit design. reflection material. rays and dispersion of light breaking through the deep water. 3 5 mm, f / 3 2. noise film photo. flash photography. trend artstation
+full length photo of a gorgeous young woman in the style of stefan kostic, realistic, sharp focus, 8k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm
+a highly detailed illustration of short hair cute japanese girl wearing blood stained hoodie and bandages on arms, dramatic sadistic smile pose, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, WLOP
+a highly detailed matte painting of a man on a hill watching a nuclear explosion mushroom cloud in the distance by studio ghibli, makoto shinkai, by artgerm, by wlop, by greg rutkowski, volumetric lighting, octane render, 4 k resolution, trending on artstation, masterpiece
+concept art of trojan war by jama jurabaev, trending on artstation, high quality, brush stroke, soft lighting
+portrait of a charming handsome barbarian half - orc giant noble!, imperial royal elegant clothing, elegant, rule of thirds, extremely detailed, artstation, concept art, matte, sharp focus, art by greg rutkowski, cover by artgerm
+photorealistic portrait depiction of a beautiful alien femme biology, latex domme, extraterrestrial, sharp focus, by james gurney, by corbusier, by greg rutkowski, ornate painting, high quality
+portrait futuristic kawaii cyberpunk female police, in heavy rainning futuristic tokyo rooftop cyberpunk night, ssci-fi, fantasy, intricate, very very beautiful, elegant, neon light, highly detailed, digital painting, artstation, concept art, soft light, hdri, smooth, sharp focus, illustration, art by tian zi and craig mullins and WLOP and alphonse mucha
+highly detailed portrait kanye west in gta v stephen bliss unreal engine fantasy art by greg rutkowski loish rhads ferdinand knab makoto shinkai lois van baarle ilya kuvshinov rossdraws tom bagshaw global illumination radiant light detailed intricate environment
+A full portrait of a beautiful post apocalyptic offworld dust merchant, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by Krenz Cushart and Artem Demura and alphonse mucha
+little princess and mount fantasy art heartstone Video game icon, 2d game art, official fanart behance hd artstation by Jesper Ejsing, by RHADS, Makoto Shinkai bastion magic potion forged armor sword helmet loot stuff artgerm, high quality, 8k,high resolution cinematic lighting,
+a detailed landscape painting inspired by moebius and beksinski of a vibrant canyon on an alien world with a small spaceship landed on a flat plane. inspired by dieselpunk. science fiction poster. cinematic sci - fi scene. science fiction theme with lightning, aurora lighting. clouds and stars. smoke. futurism. fantasy. by beksinski carl spitzweg. baroque elements. baroque element. intricate artwork by caravaggio. oil painting. oil on canvas. award winning. dramatic. trending on artstation. 8 k
+samus aran, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+body portrait of beautiful egyptian pincess wearing a flowing silk robe, wearing an ornate ancient headress, full body portrait of a young beautiful woman high angle by terry o'neill intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, bold lighting, deep colors, dark background, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+hyperrealistic mixed media high resolution image of a beautiful dragon, stunning 3d render inspired art by István Sándorfi and Greg Rutkowski and Unreal Engine, perfect symmetry, dim volumetric lighting, 8k octane beautifully detailed render, post-processing, extremely hyper-detailed, intricate, epic composition, highly detailed attributes, highly detailed atmosphere, full body shot, cinematic lighting, masterpiece, trending on artstation, very very detailed, masterpiece, stunning, flawless structure, lifelike texture, perfection,
+a horse the size of a duck, stood next to a duck the size of a horse, evening light, cinematic photography, digital painting, volumetric light, concept art, trending on artstation, digital Art, fantasy art
+concept art of a lush indoor hydroponics lab in a far - future utopian city, apples oranges pears fruit, key visual, ambient lighting, highly detailed, digital painting, artstation, concept art, sharp focus, by makoto shinkai and akihiko yoshida and hidari and wlop
+Close-up portrait of kind young woman with black hair in a pony tail, with a backpack, slightly dirty face, transparent background, png, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+beautiful, young woman, detailed gorgeous face, vaporwave aesthetic, synthwave, colorful, psychedelic, artstation, concept art, smooth, extremely sharp detail, thorn crown, flowers, bees, finely tuned detail, ultra high definition, 8 k, unreal engine 5, ultra sharp focus, illustration, art by artgerm, greg rutkowski and alphonse mucha
+wolverine as captain america, intricate, fantasy concept art, elegant, by Stanley Artgerm Lau, golden ratio, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell,
+a masterpiece digital painting of a white bear in medieval armor, roaring, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration in the style of wlop, greg rutkowski, artgerm and magali villeneuve
+Boris Johnson as Deadpool, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+classical oil painting of anime key visual environment concept art of among us crewmate anime adaptation, trending on artstation, brush strokes, oil, canvas, style of kawacy makoto shinkai jamie wyeth james gilleard edward hopper greg rutkowski, preserved historical
+the city of light : the city is a beacon of hope in the dark world. it's a place of warmth and safety, where people can come to start anew. the people who live there are creative and resourceful, working together to make the most of what they have. they're also brave and determined, ready to face whatever challenges come their way, dynamic lighting, photorealistic fantasy concept art, trending on art station, stunning visuals, creative, cinematic, ultra detailed
+a portrait of young Lynda Carter as Wonder woman , detailed, centered, digital painting, artstation, concept art, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo, Breathtaking, 8k resolution, extremely detailed, beautiful, establishing shot, artistic, hyperrealistic, beautiful face, octane render
+hyperrealistic surrealism, david friedrich, award winning masterpiece with incredible details, zhang kechun, a surreal vaporwave vaporwave vaporwave vaporwave vaporwave painting by thomas cole of a gigantic broken mannequin head sculpture in ruins, astronaut lost in liminal space, highly detailed, trending on artstation
+red samurai cyborg with a dragon helmet, mech, cyberpunk, intricate details, highly detailed, concept art. Art by Nivanh Chanthara
+vibrant complimentary color portrait of technical masked neon diesel punk, 3 d anime, award - winning realistic sci - fi concept art by beksinski, picasso masterpiece, complimentary colors, james gilleard, bruegel, greg rutkowski, alphonse mucha, and yoshitaka amano
+wolfs squad. pop art, paper please style, bioshock style, gta chinatown style, proportional, dynamic composition, face features, body features, ultra realistic art, digital painting, concept art, smooth, sharp focus, intricate, without duplication, elegant, confident posse, art by artgerm and richard hamilton and mimmo rottela, kirokaze and paul robertson
+symmetrical portrait bust of young woman with shoulder length light brown hair and hazel eyes dressed in a sharp dark teal military uniform and beret, blurred city background in twilight lighting, ilya kuvshinov, anime, greg rutkowski, guweiz, ross tran, artstation trending, artgerm, concept art, digital painting, painterly
+a cyberpunk portrait of chewbacca by jean - michel basquiat, by hayao miyazaki by artgerm, highly detailed, sacred geometry, mathematics, snake, geometry, cyberpunk, vibrant, water
+a closeup painting of a handsome cowboy saying saying yes and making a pleased face | by alphonse mucha | volumetric lighting, golden hour, realistic lighting, 4 k, 8 k | trending on artstation
+cathedral of salt, extremly detailed digital painting, vibrant colors, in the style of tomasz alen kopera and fenghua zhong and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+Scarlet Witch, highly detailed, digital painting, artstation, standing, facing camera, concept art, smooth, sharp focus, illustration, art by artgerm and alphonse mucha, high definition digital art, dramatic lighting, in the style of ilya kuvshinov and Ross tran
+thanos building a tension belt for a van alternator from a blueprint, 4 k, lomography, gellyroll gelpens, concept art, moebius, bryce 3. 3 3 4 th 3 d
+a _ fantasy _ style _ portrait _ painting _ of middle eastern male brown wavy hair glasses beard, rpg dnd oil _ painting _ unreal _ 5 _ daz. _ rpg _ portrait _ extremely _ detailed _ artgerm _ greg _ rutkowski _ greg
+anthropomorphic highly detailed group portrait of funny mr bean neon giant cute eyes hermit, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, bob eggleton, michael whelan, stephen hickman, richard corben, wayne barlowe, trending on artstation and greg rutkowski and alphonse mucha, 8 k
+UHD photorealistic studio portrait of a cyborg Angel with hyperrealistic Angel wings, futuristic robot angel, exotic alien features, robotic enhancements, Tim Hildebrandt, Wayne Barlowe, Bruce Pennington, donato giancola, larry elmore, , masterpiece, trending on artstation, , cinematic composition, dramatic pose, studio lighting, sharp, crisp detail, hyperdetailed
+a grungy woman with rainbow hair, soft eyes and narrow chin, dainty figure, long hair straight down, torn overalls, short shorts, combat boots, side boob, wet tshirt, raining, basic white background, symmetrical, watercolor, pen and ink, intricate line drawings, by Yoshitaka Amano, Ruan Jia, Kentaro Miura, Artgerm, detailed, trending on artstation, hd, masterpiece,
+mahindra thar driving through madagascar with baobabs trees, tribe members chasing for an attach, action scene, an epic fantasy, artgerm and greg rutkowski and alphonse mucha, an epic fantasy, volumetric light, detailed, establishing shot, an epic fantasy, trending on art station, octane render, midsommar
+a professional photographic portrait view picture of a minimalist luxurious room, photographic filter unreal engine 5 realistic hyperdetailed 8 k ultradetail cinematic concept art volumetric lighting, fantasy artwork, very beautiful scenery, very realistic painting effect, hd, hdr, cinematic 4 k wallpaper, 8 k, ultra detailed, high resolution, artstation trending on artstation in the style of albert dros glowing rich colors powerful imagery
+a fancy portrait of a very attractive succubus by greg rutkowski, beautiful dress, beeple, sung choi, mitchell mohrhauser, maciej kuciara, johnson ting, maxim verehin, peter konig, final fantasy, macro lens, 8 k photorealistic, cinematic lighting, hd, high details, dramatic, dark atmosphere, trending on artstation
+a colorful comic noir illustration painting of a cyberpunk girl by sachin teng and sam yang!! and artgerm!! and lois van baarle and ross tran!!. in style of digital art, symmetry, sci fi, hyper detailed. octane render. trending on artstation
+chrysta bell, pinup, league of legends, intricate, highly detailed, digital painting, hyperrealistic, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha, by Jesper Ejsing
+a wacky clown is participating in the running of the bulls in pamplona, by stanley artgerm and greg rutkowski, dramatic lighting, highly detailed, incredible quality, trending on artstation, national geographic photo winner
+terrifying otherworldly dimension of the crystalline entities, concept art by filip hodas, john howe, mike winkelmann, jessica rossier, andreas rocha, bruce pennington, 4 k,
+very high quality illustration of green hills with clouds in the background, golden hour sunset, purple beautiful sky, anime key visual, official media, illustrated by wlop, extremely detailed, 8 k, trending on pixiv, cinematic lighting, beautiful
+The fluffiest little fuzzbutts in the world, huggy wuggy from poppy playtime video game, fullbody, ultra high detailed, glowing lights, oil painting, Greg Rutkowski, Charlie Bowater, Beeple, unreal 5, DAZ, hyperrealistic, octane render, RPG portrait, dynamic lighting, fantasy art, beautiful face
+anthropomorphic fluffy fox look like Indiana jones on the hot air balloon at night, clouds around, entire person visible, DnD character, unreal engine, octane render, dramatic lighting, pond, digital art, by Stanley Artgerm Lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman Rockwell,
+a young man wearing raybands holding a beer giving a thumbs up with a long beard, real life skin, intricate, elegant, highly detailed, artstation, concept art, smooth, sharp focus, airbrush painted, art by artgerm and greg rutkowski and alphonse mucha
+Madonna, the singer, as Medusa snakehair closeup, D&D, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by Artgerm and Greg Rutkowski and Alphonse Mucha tarotcard
+a whirlwind inside the metaverse, guy, male, man, science, machine face, fashionable haircut, half body, neurochip, android, cyberpunk face, by loish, d & d, fantasy, intricate, elegant, highly detailed, colorful, digital painting, artstation, concept art, art by artgerm and greg rutkowski and alphonse mucha
+side profile centered painted portrait, rollerskating monkey, Gloomhaven, matte painting concept art, art nouveau, beautifully backlit, swirly vibrant color lines, fantastically gaudy, aesthetic octane render, 8K HD Resolution
+capybara holding a blaster, very very anime!!!, fine - face, realistic shaded perfect face, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+fork fork fork, symmetry, faded colors, exotic alien features, forestpunk background, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, masterpiece, trending on artstation, featured on pixiv, cinematic composition, beautiful lighting, sharp, details, hyper detailed, 8 k, unreal engine 5
+landscape with waterfalls and stunning light and cheerful colors, epic composition, cinematic lighting, masterpiece, trending on artstation, very very detailed, masterpiece, stunning
+portrait of ronaldo nazario, wearing green soccer clothes, very detailed eyes, hyperrealistic, very detailed painting by glenn fabry, by joao ruas, by artgerm
+A lazy steampunk cat jumping over the galaxy, digital illustration, concept art, 8k, trending on artstation
+a fantastical translucent!!! small horse made of water and foam, ethereal, noble, radiant, hyperalism, scottish folklore, digital painting, artstation, concept art, smooth, 8 k frostbite 3 engine, ultra detailed, art by artgerm and greg rutkowski and magali villeneuve
+ancient queen emma watson, symetrical, by junji ito, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, 4 k, smooth, sharp focus, art by john collier and albert aublet and krenz cushart and artem demura and alphonse mucha
+aesthetic portrait commission of a  of a male fully furry muscular anthro albino lion wearing attractive gay leather harness with a tail and a beautiful attractive hyperdetailed face at golden hour, safe for work (SFW). Character design by charlie bowater, ross tran, artgerm, and makoto shinkai, detailed, inked, western comic book art, 2021 award winning film poster painting
+ultra realistic illustration, man in a jacket with two dark glasses, with black hair, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+portrait of one meadow metal horse by gaston bussiere, anna nikonova aka newmilky, greg rutkowski, yoji shinkawa, yoshitaka amano, tsutomu niehi, moebius, donato giancola, geoffroy thoorens, concept art, trending on artstation, featured on pixiv, cinematic composition, 8 k
+parrot as a bartender, dimly-lit cozy tavern, fireplace, 8k octane beautifully detailed render, post-processing, extremely hyperdetailed, intricate, epic composition, grim yet sparkling atmosphere, cinematic lighting + masterpiece, trending on artstation, very detailed, vibrant colors
+a roman palace reaching to the sky, glorious, epic scene, beautiful, pools, vegetation, in the style of artgerm, gerald brom, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+glamorous scorpion portrait, bra, seductive eyes and face, elegant, lascivious pose, very detailed face, studio lighting, photorealism, portrait by Magali Villeneuve and Steve Argyle,Livia Prima,Mucha,dress,fantasy art,beautiful,artstation,trending on artstation,intricate details,alluring,masterpiece
+face of a cute alien girl wearing shiny plastic armor in the style of roger dean and alberto vargas and stefan kostic, realistic, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, art by greg rutkowski and artgerm, extreme blur coral reef background
+a color pencil sketch of a mysterious plague doctor with a white mask wearing a blue wisards robe, concept art, by greg rutkowski and makato shinkai, by melmoth zdzislaw belsinki craig mullins yoji shinkawa, black light, semi - realistic render, pencil, paint smears, realistic manga, dramatic lighting, d & d design
+a beautiful barmaid, dimly lit cozy tavern in the style of Francis Bacon and Syd Mead and Edward Hopper and Norman Rockwell and Beksinski, open ceiling, highly detailed, painted by Francis Bacon, painted by James Gilleard, surrealism, airbrush, Ilya Kuvshinov, WLOP, Stanley Artgerm, very coherent, art by Takato Yamamoto and James Jean
+isolated magnolia flowers with no people, colorful, psychedelic, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+jossi of blackpink, king, tarot card, highly detailed, digital painting, smooth, sharp focus, illustration, ultra realistic, 8 k, art by artgerm and alphonse mucha
+the most beautiful sunset, giant pink full moon, coherent design, symmetrical, concept art, vivid color, complementary color, golden ratio, detailed, sharp lines, intricate, rainbowshift, by maxfield parrish, by peter mohrbacher, by gustave dore, by arthur rackham, octane render
+donald trump, ornate, beautiful, atmosphere, vibe, mist, smoke, chimney, rain, well, wet, pristine, puddles, waterfall, melting, dripping, snow, ducks, creek, lush, ice, bridge, cart, forest, flowers, concept art illustration, color page, 4 k, tone mapping, akihiko yoshida, james jean, andrei riabovitchev, marc simonetti, yoshitaka amano, digital illustration, greg rutowski, volumetric lighting, sunbeams, particles, trending on artstation
+fantasy art, animal conceptual artwork, woman with giant fish, surreal painting, illustration dream and imagination concept, mystery of nature
+a cute giantess wearing school uniform standing in the city which seem small, bird's eye view, gouache, 8 k wallpaper, strong brush stroke, very high detailed, sharp focus, illustration, morandi color scheme, art station, by krenz cushart
+inside a cozy post apocalyptic library, concept art, trending on artstation
+baroque acrylic painting of key visual concept art, anime maids in crusade battlefield with early tanks, brutalist fantasy, rule of thirds golden ratio, fake detail, trending pixiv fanbox, palette knife, style of makoto shinkai ghibli takashi takeuchi yoshiyuki sadamoto jamie wyeth james gilleard greg rutkowski chiho aoshima
+baroque oil painting, anime key visual full body portrait character concept art, maid nazi ss commander, brutalist grimdark fantasy, kuudere blond hair blue eyes, fascist nationalist, trending pixiv fanbox, rule of thirds golden ratio, makoto shinkai genshin impact studio ghibli jamie wyeth greg rutkowski chiho aoshima
+kanye west. in style of yoji shinkawa and hyung - tae kim, trending on artstation, dark fantasy, great composition, concept art, highly detailed, dynamic pose, vibrant colours.
+a Japanese modern style luxurious living room, high definition, 8k, intricate and epic concept art, highly detailed, cinematic,
+anonymous as elmo, award winning creature photography, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+portrait painting of man biting woman neck, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+a male half elf in fireproof leather armor wearing a utility belt and goggles, D&D, fantasy, intricate, cinematic lighting, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by Terry Moore and Greg Rutkowski and Alphonse Mucha
+portrait painting of a black muscular bloodied indian middle aged woman in river screaming name of god, sari, ultra realistic, concept art, intricate details, eerie, horror, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+baroque oil painting full body portrait character concept art, anime key visual of smug young female maid nazi dictator, long straight blonde hair blue eyes, studio lighting delicate features finely detailed perfect face directed gaze, black nazi military uniform, gapmoe kuudere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli
+symmetry!! 1 3 mm film portrait of bearded man, sci - fi -, cyberpunk, blade runner, glowing lights, tech, biotech, techwear!! intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, grain, old photograph
+matte painting of a huge swamp, overgrown with lush vines, immaculate scale, greg rutkowski, digital art, trending on artstation, detailed matte painting
+a stunning matte portrait of a thicc and voluptuous vampire dressed as a beautiful poison ivy with hair tied in a braid walking through a flowering garden, greenhouse in the background, dark eyeliner, intricate, elegant, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgem and jugendstil and greg rutkowski and alphonse mucha, pixv
+portrait of ( ( ( vladimir putin ) ) ) inapocalyptic russia with icecream, hyperrealistic, digital concept art, sharp focus, 3 5 mm film, caricature illustration, art by magic realism, art by josephine wall, art by huang guangjian, art by viktoria gavrilenko, art by amanda sage, trending on artstation
+pointillism painting of a white and caramel beagle dog playing with dragonfly, bright, god rays, dreamy, trending on artstation
+classical oil painting of anime key visual environment concept art of the founding of a nation, trending on artstation, brush strokes, oil, canvas, style of kawacy makoto shinkai jamie wyeth james gilleard edward hopper greg rutkowski, preserved historical
+evil magic steampunk sword concept art, trending on artstation 4k
+hockey game city location with hockey arena, medical building and office buildings. game illustration, gamedev, game, design, mobile game, aerial view, isometric, blizzard, easports, playrix, nexters, intricate, elegant, pixel perfect, sport game, highly detailed, amazing detail, digital painting, trending on artstation, sharp focus, by irina knk, by ann bruhanova, by zze festa, by tatiana gromova, 4 k
+a photorealistic dramatic fantasy render of a beautiful woman alexandra daddario wearing a beautiful intricately detailed japanese monkey kitsune mask and clasical japanese kimono by wlop, artgerm, greg rutkowski, alphonse mucha, epic, beautiful dynamic dramatic dark moody lighting, shadows, cinematic atmosphere, artstation, concept design art, octane render, 8 k
+indistinct man with his hand thrust forward, visible threads of magic link his hand to other people's bodies, he's puppeting them, fantasy, digital art, trending on artstation
+robot pregnant with a human, cozy atmospheric and cinematic lighting, ultra rendered extreme realism and detail 8 k, highly detailed, realistic, refined, bautiful, fine art photography, hyper realistic, in the style of greg rutkowski, by artgerm, by gustave dore, by marco turini, photorealistic, elegant, sharp focus, majestic, award winning picture, intricate, artstation,
+beautiful underwater futuristic city, trending on artstation
+photo of a gorgeous blonde female in cyberpunk city, realistic, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, artgerm, greg kutkowski, high contrast dramatic lighting
+yoda ( 2 0 2 1 ) walking next to groot ( 2 0 1 7 ). they are friends. photorealistic, digital art, epic fantasy, dramatic lighting, cinematic, extremely high detail, cinematic lighting, trending, artstation, cgsociety, 3 d ue 5, 4 k, hq
+portrait of a ruggedly handsome ranger, hands details, muscular, half body, leather, hairy, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+warhammer 40k, full-lenght portrait of Emperor of Mankind, handsome man in massive gold armor without helmet, beautiful face, long blonde hair, digital art, illustration, fine details, cinematic, highly detailed, octane render, concept art
+illustration of an anime girl being mind controlled, by artgerm and wlop and greg rutkowski, digital art, extreme detail, realistic lighting, cinematic composition, concept art, sharp focus, colorful, photorealistic, 8 k
+mark zuckerberg as an alien, fantasy art, in the style of artgerm, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant, artgerm, award winning art
+a cloaked cyclops wielding a massive sword, smooth, intricate, elegant, digital painting, artstation, concept art, sharp focus, octane render, illustration, art by hirohiko araki, overwatch character,
+hyperrealistic photography of a highly detailed and symmetrical gorgeous nordic female scientist constructing a birth machine in the style of Jin Kagetsu, James Jean and wlop, highly detailed, masterpiece, award-winning, sharp focus, intricate concept art, ambient lighting, 8k, artstation
+a spaceship flying through space with galaxies in the back, epic lighting, in the art style of arcane, digital art, vector art, trending on artstation, highly detailed
+demonic evil cute fourteen year old south asian girl, tomboy, evil smile, freckles!!!, fully clothed, hypnotic eyes, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha, konstantin razumov, by william - adolphe bouguerea
+ultra realistic illustration, eva green as persephone, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+a highly detailed epic cinematic concept art CG render digital painting artwork: old dead couple at a decayed gas station surrounded by dark figures. By Greg Rutkowski, in the style of Francis Bacon and Syd Mead and Norman Rockwell and Beksinski, open ceiling, highly detailed, painted by Francis Bacon and Edward Hopper, painted by James Gilleard, surrealism, airbrush, Ilya Kuvshinov, WLOP, Stanley Artgerm, very coherent, triadic color scheme, art by Takato Yamamoto and James Jean
+a closeup photorealistic photograph of a cute smiling knitted bernedoodle judge dog dressed in a black gown, presiding over the courthouse. indoors, professional capture, well lit shot. this 4 k hd image is trending on artstation, featured on behance, well - rendered, extra crisp, features intricate detail, epic composition and the style of unreal engine.
+a hyper realistic character concept art of a ((cyberpunk real estate agent)) standing by a (For Sale) sign, half body, front facing camera, 4k rendered in Octane, trending in artstation, cgsociety, 4k post-processing highly detailed by wlop, Junji Murakami, Mucha Klimt, Sharandula, Hiroshi Yoshida, Artgerm, Craig Mullins,dramatic, moody cinematic lighting
+AN 8K RESOLUTION, MATTE PAINTING OF THE WISE AND ANcIENT alien TURTLE, swimming THROUGH a rainbow nebula BY BOB EGGLETON AND MICHAEL WHELAN. TRENDING ON aRTSTATION, hd, highly detailed, vibrant colors, astrophotography, volumetric lighting, dynamic portrait, wide lens, mass effect fan art
+cruising ship sailing at raining night at flooded miniature city, sun is on the rise on the town, cute style garden, octane render, trees, evergreen, patio, garden, wet atmosphere, tender, soft light misty yoshitaka amano, and artgerm
+concept art for a futuristic luxury business class suite in a widebody jet, two aisles, earth tones, digital painting, artstation
+portrait of betty cooper with fluffy bangs, bangs, 1 9 6 0 s, ponytail, curly bangs and ponytail, rounder face, intricate, elegant, glowing lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+spiky brown very short hair and glasses mage wearing robe, dndbeyond, bright, colourful, realistic, dnd character portrait, full body, pathfinder, pinterest, art by ralph horsley, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, hdr render in unreal engine 5
+Avenida Paulista painted by Greg Rutkowski
+master chief from halo fighting aliens, cinematic composition, epic cinematic lighting, realistic, unreal, highly detailed, 8 k, trending artstation, concept art, sharp focus
+close-up macro portrait of the dark queen, epic angle, epic pose, symmetrical artwork, photorealistic, iridescent, 3d with depth of field, blurred background. cybernetic phoenix bird, translucent dragon, nautilus. energy flows of water and fire, by Tooth Wu and wlop and beeple. a highly detailed epic cinematic concept art CG render digital painting artwork scene. By Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong, trending on ArtStation, made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse
+sensual beautiful delhi girls wearing western little black dresses at a nightclub, epic scene, by victo ngai, kilian eng vibrant colours, dynamic lighting, digital art, winning award masterpiece, fantastically beautiful, illustration, aesthetically inspired by beksinski and dan mumford, trending on artstation, art by greg rutkowski, 8 k
+amazingly detailed semirealism, anthropomorphic pink rabbit character wearing a bucket hat. Cute, kawaii, Cooky, bt21, Sanrio inspired. Beautiful artwork, Rabbt_character, rabbit_bunny, 獣, iconic character splash art, Detailed fur, detailed textures, 4K high resolution quality artstyle professional artists WLOP, Aztodio, Taejune Kim, Guweiz, Pixiv, Instagram, dribbble, ArtstationHD
+pennywise giving micheal jackson a red balloon in the movie it, by stephen king, highly detailed, 8 k, artstation, cinematic, concept art, smooth, sharp focus, movie scene
+ultra realistic illustration, a full body portrait of deanna troi as death of the endless, the sandman, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+downtown toronto glowing eyes, shamanic poster lsd art, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, frank frazetta
+a frogish kaiju on a desolace planet, legendary epic shot, blade runner, by artgerm, julie bell, beeple and Greg Rutkowski, airbrush, concept art, matte painting, 80s, Smooth gradients, octane render, 8k, High contrast, duo tone, depth of field, volumetric lightning, very coherent artwork
+Dramatic portraiture of Uuen, the Pictish god of stags, mixed media, trending on ArtStation, by and ArtGerm and Lucian Freud, luminism
+incredible beautiful detailed intricate photorealistic painting of a group of friends laughing together. the colors are very vibrant and the people in the photo look very happy. award winning. vibrant colors, funny, personal, positive, visually pleasing, engaging. high resolution. high quality. photorealistic. hq hd. 8 k. trending on artstation. group of friends laughing. award winning
+concept art by greg rutkowski, a very tall, and slender man with short black hair, sitting with the crew in the ship's flight deck, brutalist futuristic interior, dark lighting atmosphere, detailed portraits, nostalgic atmosphere, scifi, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+It's easy to explain 'cause this world's not tame
+owlish empress, D&D, fantasy, portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+epic professional digital art of hungry eyes, eerie atmospheric lighting, painted, intricate, detailed, impressive, leesha hannigan, reyna rochin, wayne barlowe, mark ryden, duncan halleck, best on artstation, cgsociety, wlop, pixiv, stunning, gorgeous, much wow, hdr, 4 k, stunning, gorgeous, cinematic, masterpiece
+incredible, crossing a mindblowingly beautiful rainbow bridge, energy pulsing, matte painting, artstation, solarpunk metropolis, cgsociety, dramatic lighting, vibrant greenery, concept art, octane render, arnold 3 d render
+beautiful woman lying among snakes, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+artwork of a white tiger king with gold crown and blue king suit, concept art, portrait, super detailed, 4 k hd, trending on artstation, digital painted, low contrast, made by greg rutkowski and viktoria gavrilenko
+A Maine forest with cats roaming around beautiful lighting during golden hour. 50mm, f/1.8, Realistic details. Ultra HD. 8K V-ray. Octane Render. Unreal Engine 5. Professionally color graded. Concept art. Vibrant colors. fog. Bokeh
+a comic book poster of divali celebrations by moebius and makoto shinkai and rossdraws, featured on artstation, pixiv, volumetric lighting, 8 k, highly detailed render, soft glow, crisp lines, f 1 1, sharp focus,
+photo of a Dramatic Kathakali male character with traditional headgear painted face wearing futuristic robocop LED goggles and futuristic robot armour with wide traditional ghaghra in the style of stefan kostic, full body, realistic, sharp focus, symmetric, 8k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm, Hajime Sorayama, William-Adolphe Bouguereau
+vampire the masquerade, fame of thrones, lord, neon, fibonacci, sweat drops, insane, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+hyperdetailed portrait of a stunningly beautiful pink cyberpunk cute european girl made of metals and shiny iridescent gems, bright rainbow nimbus, gold necklace, smoke background inspired by ross tran and masamune shirow and kuvshinov, intricate, photorealistic, octane render, rtx, hdr, unreal engine, dnd digital art by artgerm
+3 / 4 view of a portrait of woman with flowy hair, bird wings, confident pose, pixie, genshin impact,, intricate, elegant, sharp focus, illustration, highly detailed, concept art, matte, trending on artstation, bright colors, art by wlop and artgerm and greg rutkowski, marvel comics h 6 4 0
+greg manchess portrait painting of a 2 yorha type a no. 2 as overwatch character!! holding a sword!!, white long hair, organic painting, sunny day, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+dungeons and dragons minotaur character closeup portrait, dramatic light, lake background, 2 0 0 mm focal length, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+the eldritch knight as a realistic fantasy knight, closeup portrait art by donato giancola and greg rutkowski, digital art, trending on artstation, symmetry!!
+epic portrait of snufkin, detailed, nebula skies, digital painting, artstation, concept art, donato giancola, joseph christian leyendecker, wlop, boris vallejo, breathtaking, high details, extremely detailed, sincere face, establishing shot, artistic, hyper realistic, beautiful face, octane render
+full body portrait of a korean schoolgirl with long hair and bangs, her hands are thin red tedrils, dramatic lighting, illustration by Greg rutkowski, yoji shinkawa, 4k, digital art, sci-fi horror concept art, trending on artstation
+symmetry!! young nicole kidman, machine parts embedded into face, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+a child looking at a portal in the hidden garden, scare, environment art, fantasy art, landscape art, in the style of greg rutkowski, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+nosferatu staying near body of dead woman, scary, dark, misty, at night, 8 k, detailed, concept art, trending on artstation
+polaroid picture, sepia, homeless jon hamm in the streets of los angeles, unshaved, toothless, next to a tent, symmetrical face, fine details, day setting, ethereal, trending on artstation
+anime elvis presley, rockabilly anime illustration, rock'n'roll cartoon, professional drawing, trending on pixiv
+a cute little girl with a round cherubic face, blue eyes, and short wavy light brown hair smiles as she floats in space with stars all around her. she is an astronaut, wearing a space suit. beautiful painting with highly detailed face by artgerm and quentin blake
+Tom Cruise at the king in the desert, beautiful face, fighting in a dark scene, eyes, detailed scene, standing in a heroic figure, Armour and Crown, highly detailed, blood and dust in the air, action scene, cinematic lighting, dramatic lighting, trending on artstation, elegant, intricate, character design, motion and action and tragedy, fantasy, D&D, highly detailed, digital painting, concept art
+portrait of a jamaican fisherman sci - fi glowing fishing armor muscular cyberpunk intricate elegant highly detailed digital painting artstation concept art, ocean background, jamaican colors, greg rutkowski, loish, rhads, ferdinand knab, makoto shinkai and lois van baarle, ilya kuvshinov, rossdraws, tom bagshaw
+an ugly donkey with eyelashes, fantasy art, in the style of artgerm, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant
+pregnant woman under street light, highly detailed, sharp focused, ultra realistic digital concept art by artgerm
+baroque oil painting full body portrait character concept art, anime key visual of young female black nazi military uniform maid, long flowing platinum blonde hair blue eyes, finely detailed symmetrical perfect face studio lit delicate features directed gaze, gapmoe kuudere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli
+a award winning half body portrait of a beautiful woman in a croptop and cargo pants with ombre purple pink teal hairstyle with head in motion and hair flying listenin to music on headphones by wlop, paint splatter, outrun, vaporware, shaded flat illustration, digital art, trending on artstation, highly detailed, fine detail, intricate
+draco malfoy, clash royal style characters, unreal engine 5, octane render, detailed, brawl stars, cinematografic, cinema 4 d, artstation trending, high definition, very detailed
+some kittens playing around in a room with yellow background color filled with a fridge. animal cat. digital art. artstation. realistic. vibrant. illustration. in the style of pixar movie. octane render. art by artgerm and greg rutkowski and alphonse mucha. volumetric lighting.
+a pretty smiling blonde girl with heart - shaped sunglasses dressed in pink shiny clothes is walking over water, sun set and skyscrappers in the background, art by guweiz, dramatic lighting, highly detailed, incredible quality, trending on artstation
+cinematic portrait, captin falcon from smash bros, from left, head and chest only, desaturated, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, oil on canvas, masterpiece, trending on artstation, featured on pixiv, cinematic composition, dramatic pose, beautiful lighting, sharp, details, hyper - detailed, hd, 4 k
+cypher dark souls blood borne fashion photograph, portrait close up, glowing epcot, rei ayanami, final fantasy marlboro, reptile eye of providence, alien brainsucker by karol bak, zdzisław beksinski, daft punk mf boom helmet, kodak portra 4 0 0, 8 k, highly detailed, britt marling style 3 / 4 photographic close, illuminati pyramid, female anime character, druid wizard, giygas organic being, portrait, skeleton, kannon mindar android, sparking beeple, from artstation, anime render, rutkowski of symmetrical art, android wlop, station, very coherent punk, glitchcore, iridescent on greg cyber the cinematic, art, artwork. cinematic, 8 k, unreal albedo accents, art, high hyper epcot, inside realism, hyper wizard very male octane broken hellscape, of mindar detail, greg overlord, artwork, rutkowski colossus, symmetrical key detail, coherent trending japan, artwork, space hornwort, artwork. abstract, druid druid, artstation, futurescape, on render, shadows robot, glitch forest organic, character, spell, render, key octane render, accents a concept library casting iridescent abstract. by octane intricate realism, octane dan from intricate mask, trending intricate intricate high render, art, gems, mumford. wu, tooth engine cannon beeple, 8 k, a oni
+beautiful black girl magic, nature goddess with brown skin in front of nebulae bursting halos, crisp digital painting by artgerm by mucha by caravaggio and face by wlop
+goth anime clown in mini skirt and crop top intricate, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art, face and body
+Twin Peaks, of Michael Shannon the mechanic discovering a man dressed as a Furry in the woods, mysterious creepy, poster artwork by Michael Whelan, Bob Larkin and Tomer Hanuka, from scene from Twin Peaks, simple illustration, domestic, nostalgic, from scene from Twin Peaks, clean, full of details, by Makoto Shinkai and thomas kinkade, Matte painting, trending on artstation and unreal engine, super clean, fine detail, cell shaded,
+realistic character concept, japanese queen with lots of jewelry in the face, elegant pose, scifi, illustration, symmetrical, artstation, cinematic lighting, hyperdetailed, cgsociety, 8 k, high resolution, charlie bowater, tom bagshaw, single face, insanely detailed and intricate, beautiful, elegant, golden ratio, dark fractal background, vfx, postprocessing, soft lighting colors scheme, fine art photography, hyper realistic, photo realistic
+magic : the gathering fantasy character concept art of a ball of rice with a menacing facial expression, by frank frazetta and marco bucci, high resolution. dark fantasy forest in the background, fantasy coloring, intricate, digital painting, artstation, smooth, sharp focus
+pregnant woman under street light, highly detailed, sharp focused, ultra realistic digital concept art by Alyssa Monks, Ruan Jia, Stanley Artgerm
+a grim dark fantasy town seen from the gutters, dnd encounter, dark fantasy, rain, atmospheric lighting, extremely detailed, no people, photorealistic, octane render, 8 k, unreal engine 5. art by artgerm and greg rutkowski and alphonse mucha
+mf doom with reptile eyes, fallout power armor exploding into fractals, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, frank frazetta
+Very very very very highly detailed epic central composition portrait of face with venetian mask, golden, intricate, dystopian, sci-fi, extremely detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, intimidating lighting, incredible art by Tokujin Yoshioka and Anton Pieck
+Michael Fassbender in white armor, intricate, epic lighting, hyper realistic, white short hair, character concept art, cinematic, artgerm, artstation trending.
+a hyper - realistic character concept art portrait of a computer man, depth of field background, artstation, award - winning realistic sci - fi concept art by jim burns and greg rutkowski, beksinski, a realism masterpiece, flesh - tone color palette, james gilleard, bruegel, alphonse mucha, and yoshitaka amano.
+tundra, digital art, concept art, magic fantasy, vibrant colors, high contrast, highly detailed, trending on artstation, 8k, andreas rocha, sylvain sarrailh, darek zabrocki, finnian macmanus, dylan cole, liang mark, albert bierstadt, sung choi, peter mohrbacher, greg rutkowski, studio ghibli
+beautiful full body Emma Watson smiling, art by lois van baarle and loish and ross tran and rossdraws and sam yang and samdoesarts and artgerm, digital art, highly detailed, intricate, sharp focus, Trending on Artstation HQ, deviantart, unreal engine 5, 4K UHD image
+a stunning GTA V loading screen with a beautiful woman with ombre hairstyle in purple and pink blowing in the wind, city streets, golden ratio, digital art, trending on artstation
+A cyberpunk cyborg girl with big and cute eyes, fine-face, realistic shaded perfect face, fine details. not anime. Realistic shaded lighting poster by Ilya Kuvshinov katsuhiro, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash, Rob Rey and Kentarõ Miura style, trending on art station
+peaceful elven forest, thick forest filled with elven warriors, by alan lee, michal karcz, smooth details, lord of the rings, game of thrones, smooth, detailed terrain, oil painting, trending artstation, concept art, fantasy matte painting
+a lisa frank fashion model mcdonalds princess microwaved super deluxe big mac happymeal with diet coke and a large order of fries, gothic, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+collie as odin, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha, simon stalenhag, hyperreal
+a beautiful daft punk humanoids with freckled cheeks, cyber neon lighting, futurism, intricate futuristic jewelry accessories, cyberpunk glossy white latex swimsuit, profile posing, hyper photorealistic, crispy quality, digital photography, trending in artstation, trending in pinterest, cinematic, 4 k ultra hd, art by pascal blanche, art by greg rutkowski,
+portrait sci-fi art by Ruan Jia and Raymon Swanland, a glowing alien neon glass orb floating above the hand of a soldier, solar flares, detailed and intricate futuristic environment, cyberpunk, neon color bioluminescence, transparent reflective metal, dramatic lighting, cinematic, high technology, highly detailed portrait, digital painting, artstation, concept art, smooth, sharp focus, illustration, Artstation HQ
+Rose Gold intricate lace smoke portrait, geometric watercolor art by peter mohrbacher and artgerm, radiant halo of light
+skinny male fantasy alchemist, long dark hair, 1 9 th century, elegant, highly detailed, intricate, smooth, sharp focus, artstation, digital paining, concept art, art by donato giancola, greg rutkowski, artgerm, cedric peyravernay, valentina remenar, craig mullins
+cute friendly shrine maiden by charlie bowater and titian and artgerm, intricate, face, japanese shrine, elegant, pink mist, beautiful, highly detailed, dramatic lighting, sharp focus, trending on artstation, artstationhd, artstationhq, unreal engine, 4 k, 8 k
+cute fisherman tom daley, natural lighting, path traced, highly detailed, high quality, digital painting, by don bluth and ross tran and studio ghibli and alphonse mucha, artgerm
+Boris Johnson as Jack Sparrow, Boris Johnson hairstyle, realistic portrait, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+billionaire's yacht adopted as a vacation spot for coal miners a Mandelbrot fractal by Craig Mullins, ilya kuvshinov, krenz cushart, artgerm trending on artstation by Edward Hopper and Dan Mumford and WLOP and Rutkovsky, Unreal Engine 5, Lumen, Nanite
+hisoka, young tom hiddleston, cel - shaded animesque art by artgerm and greg rutkowski and alphonse mucha, smooth white skin, smirking face, reddish hair, d & d, fantasy, feminine portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration
+The eye of cthulu from Terraria, 3d render trending on artstation
+photographic portrait of a widow, highly detailed, digital painting, Trending on artstation , HD quality, by artgerm and greg rutkowski and alphonse mucha, dramatic light, octane
+portrait of megan fox as pinhead, bald, hellraiser, hell, intricate, headshot, highly detailed, digital painting, artstation, concept art, sharp focus, cinematic lighting, illustration, art by artgerm and greg rutkowski, alphonse mucha, cgsociety
+lady assassin wearing cyberpunk streetwear, cybernetic legs, detailed portrait, 4 k, vivid colours, concept art by wlop, ilya kuvshinov, artgerm, krenz cushart, greg rutkowski, pixiv. cinematic dramatic atmosphere, sharp focus, volumetric lighting, cinematic lighting, studio quality
+a monk meditating, in the style of tomasz alen kopera and fenghua zhong and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+goddess of war, accurate anatomy, IFBB fitness body,  only two hands, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by art by artgerm and greg rutkowski and edgar maxence
+a portrait of a beautiful cybernetic woman meditating in lotus pose, wires, cyberpunk concept art by josan gonzales and philippe druillet and dan mumford and enki bilal and jean claude meziere
+symmetry!! portrait of mark zuckerberg, hairless!!, fantasy, medieval wear, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+professional concept art portrait of a masked diesel punk man in a dark room by artgerm and greg rutkowski ( thin white border ). an intricate, elegant, highly detailed digital painting, concept art, smooth, sharp focus, illustration, in the style of cam sykes, wayne barlowe, igor kieryluk.
+margot robbie, manga cover art, detailed color portrait, artstation trending, 8 k, greg rutkowski
+a portrait of an anthropomorphic cyberpunk mouse holding a can of beer, cyberpunk!, fantasy, elegant, digital painting, artstation, concept art, matte, sharp focus, illustration, art by josan gonzalez
+skeleton man walking forward with explosion behind him, science fiction industrial hard science concept art, 8K render octane high definition cgsociety, photorealistic, unreal engine
+a cloaked adventure standing in a winding road, gas street lamps. Country road, country landscape, fields, fields, the ruins of one small barn, wide view, desolate. digital illustration, very vibrant colors, soft lighting, adventurous, atmospheric lighting, 8K, octane render. By Makoto Shinkai, Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, krenz cushart, Sakimichan, D&D trending on ArtStation, digital art.
+vibrant colorful vaporwave geometry symmetry bauhaus poster, etching by gustave dore, intricate, sharp focus, illustration, highly detailed, digital painting, concept art, masterpiece
+Abandoned medieval castle, art by Quentin Mabille , trending on artstation, artstationHD, artstationHQ, 4k, 8k
+Boris Johnson as Wolverine, portrait, X man costume, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+Mikasa Ackerman, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha
+lateral portrait of samurai, sci - fi, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Aly Michalka  as a stunning ,  beautiful retro SCI-FI space heroine 1985 , movie poster, intricate, elegant, highly detailed, centered, digital painting, trending on artstation, concept art, smooth, sharp focus, illustration, art by raphael lacoste ,eddie mendoza ,alex ross, WLOP
+a visual representation of a place evoked by the song titles of the album kryptos by andreas vollenweider, photorealistic and intricate concept art, 8 k hdr, cinematic lighting
+fantasy girl mage in a forest, dramatic fantasy art, by yoshitaka amano, trending on artstation, 4 k, expressive oil painting, close - up face portrait, vivid colors
+a portrait of a finely detailed beautiful!!! feminine cyberpunk ghost rider with skull face and long flowing hair made of fire and flames, dressed in black leather, by Alphonse Mucha, designed by H.R. Giger, legendary masterpiece, stunning!, saturated colors, black background, trending on ArtStation
+tattoo design, stencil, stencil on paper, tattoo stencil, traditional, beautiful portrait of a traditional Japanese girl with flowers in her hair, upper body, by artgerm, artgerm, artgerm, digital art, cat girl, anime eyes, anime, sexy, super model-s 100
+portrait of a young very beautiful cute tribal woman with a steampunk gun, in a post apocalyptic city overgrown with lush vegetation, by Luis Royo, by Greg Rutkowski, dark, gritty, intricate, head space, volumetric lighting, volumetric atmosphere, concept art, cover illustration, octane render, trending on artstation, 8k
+a young attractive Asian woman in the pilot's seat of a massive sci-fi mecha, dramatic pose, LEDs, highly detailed, photorealistic, volumetric lighting, digital art, octane render, in the style of Artgerm and Tom Bagshaw
+wolf warrior in red cape and hood, d & d, fantasy, portrait, highly detailed, headshot, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+a beautiful young charming asian goddess with sundress and jewelry | | winter, realistic shaded, unpleasant face, good looking, fine details, dior, lv, realistic shaded lighting poster by greg rutkowski, macoto takahashi, magali villeneuve, artgerm, jeremy lipkin and michael garmash
+hyperrealistic portrait of a woman monster astronaut, full body portrait, well lit,  intricate abstract. cyberpunk,  intricate artwork, by Tooth Wu, wlop, beeple. octane render,in the style of Jin Kagetsu, James Jean and wlop, highly detailed, sharp focus, intricate concept art, digital painting, ambient lighting, 4k, artstation
+tracer overwatch portrait, close up, concept art, intricate details, highly detailed photorealistic portrait by michael komarck, joel torres, seseon yoon, artgerm and warren louw
+a grim reaper with a crt monitor for a head. the monitor has a blue screen with white letters on it. by frank frazetta, simon bisley, brom, concept art, octane render, unreal engine 5, highly detailed, high quality, 8 k, soft lighting, realistic face, path traced
+blender gloomy colossal ruined server room in datacenter robot figure automata headless drone robot knight welder posing pacing fixing soldering mono sharp focus, emitting diodes, smoke, artillery, sparks, racks, system unit, motherboard, by pascal blanche rutkowski artstation hyperrealism cinematic dramatic painting concept art of detailed character design matte painting
+a photograph of a robot endoskeleton submerged and rusted in the water, cinematic, volumetric lighting, f 8 aperture, cinematic eastman 5 3 8 4 film, photorealistic by greg rutkowski, by stanley artgerm, by alphonse mucha
+hyper detailed ultra sharp, trending on artstation, vibrant aesthetic, bloodwave, colorful, psychedelic, ornate, intricate, digital painting, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and h. r. giger, 8 k
+gothic bell tower, view from above. in style of greg rutkowski, jesper ejsing, makoto shinkai, trending on artstation, fantasy, great composition, concept art, highly detailed, scenery, 8 k, behance.
+ned kelly, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+a girl in times square new york, very sexy outfit, very anime, medium shot, visible face, detailed face, perfectly shaded, atmospheric lighting, by makoto shinkai, stanley artgerm lau, wlop, rossdraws
+full-body baroque and cyberpunk glass sculpture of  attractive muscular iridescent  Nick Jonas as a humanoid deity wearing a thin see-through plastic hooded cloak sim roupa, posing like a superhero, glowing pink face, crown of white lasers, large diamonds, swirling black silk fabric. futuristic elements. oozing glowing liquid, full-length view. space robots. human skulls. throne made of bones, intricate artwork by caravaggio. Trending on artstation, octane render, cinematic lighting from the right, hyper realism, octane render, 8k, depth of field, 3D
+breathtaking detailed soft painting of silver hours of sun, caresses on pepper plains, the hand of the country on my shoulder, rembrandt style, elegant, highly detailed, artstation, concept art, matte, sharp focus, art by tom bagshaw, and greg rutkowski
+Emma Watson as a dune princess, sci-fi, amber eyes, face, long hair, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+game of thrones, masterpiece, pinup, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K
+one beautiful symmetrical close up head shoulder face portrait android woman time machine axonometric mechanical fantasy intricate elegant highly detailed in volumetric void of latent space, golden turquoise steampunk, axonometric high contrast cinematic light, mystical shadows, digital painting, smooth, sharp focus, divine realm of gods, octane render, photographic, concept art, artist leonardo davinci, unreal engine 8 k
+arnold schwarzenegger surfing inside erupting volcano, stunning scene, 8 k, extremely detailed digital painting, depth, bright colors, trending on artstation
+a photorealistic dramatic fantasy render of a beautiful woman billie eilish wearing a beautiful intricately detailed japanese monkey kitsune mask and clasical japanese kimono by wlop, artgerm, greg rutkowski, alphonse mucha, epic, beautiful dynamic dramatic dark moody lighting, shadows, cinematic atmosphere, artstation, concept design art, octane render, 8 k
+Portrait of a space astronaut monkey, fantasy, intricate, highly detailed, digital painting, trending on artstation, sharp focus, illustration, style of Stanley Artgerm
+goddess of death, braids, decaying face, neon hair, intricate illuminated jewellery, digital painting, surrealism, extreme detail, cinematic lighting, trending on artstation, by hans zatzka
+a zombie teenager staring at their phone, tristan eaton, victo ngai, artgerm, rhads, ross draws
+realistic detailed face portrait of a rugged male wizard with black hair wearing a hooded cloak by alphonse mucha, ayami kojima, amano, greg hildebrandt, and mark brooks, male, masculine, art nouveau, neo - gothic, gothic, character concept design
+a shadowy figure in tattered robes sees another figure in the distance, in an alien desert during a sandstorm ; tension, creepy mood, uneasy atmosphere, weird fiction art, breathtaking digital illustration, cinematic lighting, striking perspective, aesthetic composition, trending on artstation
+an epic painting minion looking like elon musk presenting new tesla, pencil drawing, perfect composition, golden ratio, beautiful detailed, photorealistic, digital painting, concept art, smooth, sharp focus, illustration, artstation trending, octane render, unreal engine
+Hedgehog magus, Tzeentch, portrait, nature, fairy, forest background, magic the gathering artwork, D&D, fantasy, cinematic lighting, centered, symmetrical, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, volumetric lighting, epic Composition, 8k, art by Akihiko Yoshida and Greg Rutkowski and Craig Mullins, oil painting, cgsociety
+mermaid emma watson, perfectly-centered-painting of emma watson, sweaty, dynamic action pose, insane, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, Unreal Engine 5, 8K, art by artgerm and greg rutkowski and alphonse mucha
+painting of hybrid between cat & dragon & snake & fox, intercrossed animal, by zdzislaw beksinski, by lewis jones, by mattias adolfsson, cold hue's, warm tone gradient background, concept art, beautiful composition, digital painting
+character portrait of a raven angel of night with iridescent black raven wings wearing robes, lord of change, by peter mohrbacher, mark brooks, jim burns, marina abramovic, wadim kashin, greg rutkowski, trending on artstation
+girl sitting on a stair under a vine rack, many green plant and flower gowing on it, illustration concept art anime key visual trending pixiv fanbox by wlop and greg rutkowski and makoto shinkai and studio ghibli
+giant skeletal ghoul devouring a mountain of skulls, digital painting, mixed media, trending on artstation and deviantart, epic composition, highly detailed, 8 k
+portrait of jean baudrillard, soft hair, muscular, half body, leather, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hacker girl sits at an apple ] [ e, realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo, magali villeneuve, artgerm, jeremy lipkin and michael garmash and rob rey
+movie still macro close photo of koala selling nft, by weta disney pixar greg rutkowski wlop ilya kuvshinov rossdraws artgerm octane render iridescent, bright morning, liosh, mucha
+a coffee shop store in The City of Ukraine at night with a few customers, extreme plus resolution fantasy concept art, intricate details to everything visible, sharp lighting, Dramatic light by denis villeneuve, strong emphasis on alphonse mucha, Makoto Shinkai
+the interior of a store that sells board games and sushi, intricate, digital painting, masterpiece, rending on artstation, octane render, art by artgerm and greg rutkowski and alphonse mucha and craig mullins and James Jean and Andrei Riabovitchev and Marc Simonetti and peter mohrbacher
+danny devito as wolverine, oil on canvas portrait, octane render, trending on artstation
+portrait painting of male evil demonic cult member, agony, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+a snoop dogg wearing sun glasses tennis ball monster, snoop dogg tennis ball head, smoking, smoke, monster teeth, colorful, chalk digital art, fantasy, magic, chalk, trending on artstation, ultra detailed, professional illustration by basil gogos
+dusk land dark city filled with shadow people, desolate, gloomy, intricate, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski
+portrait of a beautiful woman wearing a sari dress, holding a bouquet of flowing flowers, drenched body, wet dripping hair, emerging from the water, fantasy, regal, fractal crystal, fractal gems, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+astronaut drifting in space, artwork by greg rutkowski
+book cover!!!!!!!!!!!!, old bridge, fantasy forest landscape, fantasy magic, light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by wlop and artgerm and ivan shishkin and andrey shishkin, masterpiece
+a beautiful hyperrealistic detailed 3D render of a burning monument, by Anton Otto Fischer, Atey Ghailan, genzoman, unreal engine, octane render, gigantic, 3D, brilliantly coloured, intricate, ultra wide angle, trending on artstation, embers, smoke, dust, dusk, volumetric lighting, HDR, polished, micro details, ray tracing, 8k
+close-up macro portrait of the face of a beautiful princess with ram skull mask, epic angle and pose, symmetrical artwork, 3d with depth of field, blurred background, cybernetic jellyfish female face skull phoenix bird, translucent, nautilus, energy flows of water and fire. a highly detailed epic cinematic concept art CG render. made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse. y Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong
+a super realistic dragon that is on fire standing dramatically on a destroyed city, ultrawide shot, surreal, sharp focus, digital art, epic composition, concept art, dynamic lighting, intricate, highly detailed, 8 k, unreal engine, blender render
+man in suit launching the nukes, matte painting concept art, baroque, beautifully backlit, swirly vibrant color lines, fantastically gaudy, aesthetic octane render, 8 k hd resolution, by caravaggio and diego velazquez
+an extremely psychedelic portrait of SalvadorDali, by Raphael Hopper, and Rene Magritte. Extremely Highly detailed, Occult, funny, humorous, humor, hilarious, funny, entertaining, magical, trending on artstationHQ, LSD, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, smooth, sharp focus, art by John Collier and Albert Aublet and Krenz Cushart and Artem Demura and Alphonse Mucha and Giuseppe Arcimboldo
+inside an etheral atompunk city, highly detailed, 4k, HDR, award-winning, octane render, trending on artstation, volumetric lighting
+subspace emissary, jungle groove, constellation - based cathedral, octane render, trending on artstation, ray - tracing, subsurface scattering, 4 k, high quality desktop wallpaper
+a dream of being trapped underwater, thalassophobia, fear of the ocean, open water, imagination, dream, concept art, trending on artstation, highly detailed
+an anime portait shogun knight with a lightsaber halberd, dark metal armor, and a tattered cape, by stanley artgerm lau, wlop, rossdraws, james jean, andrei riabovitchev, marc simonetti, and sakimichan, trending on artstation
+postmodern zakopane designed by louis sullivan, still from a movie, photo art, artgerm, trending on artstation
+a beautiful action portrait of a handsome DnD-ranger hunting in a forest, face is brightly lit, by Greg Rutkowski and Raymond Swanland, Trending on Artstation, ultra realistic digital art
+jim carrey, portrait shinkai makoto studio ghibli studio key hideaki anno sakimichan stanley artgerm lau rossdraws james jean marc simonetti elegant highly detailed digital painting artstation pixiv
+a man tied to a pillar by jack russel terrier, highly detailed, hyperrealistic digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+a professional painting of an russian young blonde girl intricate, wearing russian ancient folk dress, elegant, digital painting, concept art, smooth, sharp focus, finely detailed illustration, beautifully framed, from Metal Gear, in the style of Artgerm and Greg Rutkowski and William-Adolphe Bouguerea
+soaring woman wearing a round mask hiding her face with many thick long blades behind head. dressed in a long robe with wide sleeves and making anjali mudra gesture. highly detailed, symmetric, concept art, saturated colors, masterpiece, fantasy art, hyperdetailed, hyperrealism, art by zdzisław beksinski, arthur rackham, dariusz zawadzki, larry elmore
+ancient queen billie eilish, symetrical, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, 4 k, smooth, sharp focus, art by john collier and albert aublet and krenz cushart and artem demura and alphonse mucha
+a gorgeous kanye west photo, professionally retouched, soft lighting, realistic, smooth face, full body shot, torso, perfect eyes, wide angle, sharp focus on eyes, 8 k high definition, insanely detailed, intricate, elegant, art by artgerm and jason chan and mark litvokin
+a bear and a bunny chimera with the size and strength of a bear, The white color and long bunny ears of a bunny and golden brown antlers. Concept art. Fantasy. Trending on artstation. Masterpiece. By Karlkka. By Greg Rutkowski James Gurney
+beautiful anime girl with short white hair, wearing lab coat and glasses, holding a clipboard, standing inside a research facility, character portrait, 1 9 6 0 s, long hair, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, charlie bowater and alexandra fomina
+Manga cover portrait of an extremely cute and adorable beautiful curious happy puppy smelling a flower, summer vibrance, 3d render diorama by Hayao Miyazaki, official Studio Ghibli still, color graflex macro photograph, Pixiv, DAZ Studio 3D
+concept art of fried egg, highly detailed painting by dustin nguyen, akihiko yoshida, greg tocchini, greg rutkowski, cliff chiang, 4 k resolution, trending on artstation, 8 k
+Bob Dylan design, character sheet, Kim Jung Gi, Greg Rutkowski, Zabrocki, Karlkka, Jayison Devadas, Phuoc Quan, trending on Artstation, 8K, ultra wide angle, zenith view, pincushion lens effect
+dichroic ant axolotl snail bug bee fly worm caterpillar fish, (((artstation, concept art, smooth, sharp focus, artgerm, Tomasz Alen Kopera, Peter Mohrbacher, donato giancola, Joseph Christian Leyendecker, WLOP, Boris Vallejo))), octane render, unreal engine, 3d render, , octane render, nvidia raytracing demo, grainy, muted
+sojourn from overwatch, african canadian, gray hair, character portrait, portrait, close up, concept art, intricate details, highly detailed, vintage sci - fi poster, retro future, vintage sci - fi art, in the style of chris foss, rodger dean, moebius, michael whelan, and gustave dore
+open treasure chest with the greatest riches on earth, deep focus, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha
+hunter woman walking across foggy river, unreal engine 5, art by artgerm and greg rutkowski and alphonse mucha, global illumination, detailed and intricate environment, hyperrealistic, volumetric lighting, epic cinematic shot, perfectly defined features, ambient occlusion
+psychedelic ; trippy ; acid trip ; artgerm ; salvadore dali ; surreal ; abstract ; lsd ; jesus christ ; ascension ; symmetrical ; mathematical
+girl floating on the night sky, gaint planet in the background, illustration concept art anime key visual trending pixiv fanbox by wlop and greg rutkowski and makoto shinkai and studio ghibli
+a strange alien fruit, photorealistic, 8 k, professional food photography, volumetric lighting, trending on artstation
+painting of hybrid between bear & snake, animal has snake body, intercrossed animal, by zdzislaw beksinski, by lewis jones, by mattias adolfsson, cold hue's, warm tone gradient background, concept art, beautiful composition, digital painting
+a professional photographic view picture of a alley in space, photographic filter unreal engine 5 realistic hyperdetailed 8 k ultradetail cinematic concept art volumetric lighting, very beautiful scenery, very realistic effect, hd, hdr, cinematic 4 k wallpaper, 8 k, sharp focus, octane render, ultra detailed, high resolution, artstation trending on artstation in the style of albert dros glowing rich colors powerful imagery
+mulan, d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha
+'' Portrait of Beautiful blonde Slavic woman in her early 30ï¿½s, league of legends, LOL, fantasy, d&d,  digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha ''
+beautiful cottagecore kim kardashian holding a adidas yeezy shoe. intricate, elegant. highly detailed, digital painting, artstation, concept art, smooth, sharp, focus, illustration. . art by artgerm and greg rutkowski and alphonse mucha
+symmetry, multiple humans in solid silhouettes, saluting, dancing, interacting and posing, mooc, organic and intricate, elegant, highly detailed, concept art, sharp focus, illustration, high contrast, long shadows, painted with colour on white, 8 k
+cinematic bust portrait of psychedelic cyborg, head and chest only, exotic alien features, Tim Hildebrandt, Wayne Barlowe, Bruce Pennington, donato giancola, larry elmore, oil on canvas, masterpiece, trending on artstation, featured on pixiv, cinematic composition, dramatic pose, beautiful lighting, sharp, details, hyper-detailed, HD, HDR, 4K, 8K
+a highly detailed illustration of cute smug pink haired pale girl with curved horns wearing oversized pink hoodie, dramatic smirk pose, intricate, elegant, highly detailed, centered, soft light, character design, cushart krenz, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, wlop.
+portrait of a young mila kunis in front of a cyberpunk city, dramatic light, city background, sunset, high contrast, sharp, painted by stanley lau, painted by greg rutkowski, painted by stanley artgerm, digital art, trending on artstation
+portrait close up of guy, concentrated look, symmetry, long hair. d & d, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, art by artgerm and greg rutkowski and alphonse mucha, boris vallejo
+ law contrasts, fantasy concept art by Jakub Rozalski, Jan Matejko, and J.Dickenson
+professional concept art ethereal ghostlike valkyrie figure fluid simulation in houdini dancing in dark smoke robes and silk veils by ilm, paolo roversi, nick knight, amy judd, beautiful simplified form in turbulent movement, dark studio background, turner, romantic, trending on artstation, hyperrealism, matte painting, dutch golden age, fine detail, cgsociety
+portrait Anime batman cosplay girl cute-fine-face, pretty face, realistic shaded Perfect face, fine details. Anime. realistic shaded lighting by katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, rutkowski Jeremy Lipkin and Giuseppe Dangelico Pino and Michael Garmash and Rob Rey
+hyperrealism, detailed textures, photorealistic 3 d, a young boy walking down the street holding a worn out teddy bear, ultra realistic, cinematic, intricate, cinematic light, concept art, illustration, art station, unreal engine 8 k
+nuclear power plant, colorful, sci-fi, clean, utopia, surrounded by wilderness, sunset, octane render, substance painter, zbrush, trending on artstation, 8K, highly detailed.
+Defect from Slay the Spire, concept art, by Odilon Redon
+insanely detailed procedural render expressive scene of chrome spacesuits protecting the dancing nudibranch girl from certain doom as the planet they orbit sends spores attack them, photorealism, sharp focus, award winning, tristan eaton, victo ngai,, maxfield parrish, artgerm, koons, ryden, intricate details, 3 / 4 view, bokeh
+portrait art of Gene Kelly 8k ultra realistic , lens flare, atmosphere, glow, detailed,intricate, full of colour, cinematic lighting, trending on artstation, 4k, hyperrealistic, focused, extreme details,unreal engine 5, cinematic, masterpiece
+portrait painting of a post - apocalyptic bald androgynous teenager with white eyes and a green aura around his head, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and charlie bowater and magali villeneuve and alphonse mucha
+ancient neon monster portrait, intricate artwork by josan gonzalez, artgerm, h. r. giger, kilian eng, very coherent artwork, cinematic, hyper realism, vibrant, octane render, unreal engine, 8 k, high contrast, higly detailed black ink outline
+twin peaks poster art, portrait of the black lodge has the blue colored rose trapped in a glass box, can david bowie find it, by michael whelan, rossetti bouguereau, artgerm, retro, nostalgic, old fashioned
+the beautiful hyper detailed scene render that a beautiful girl lies in the arms of a huge silver dragon alone in the fairyland surrounded by white clouds, in the style of makoto shinkai victo ngai and peter mohrbacher studio ghibli artgerm karol bak beeple, animation style, 8 k hd, dream, ultra wide angle, animation style, 3 drender, hyperdetailed
+portrait of a beautiful young fit male angel with curly blond hairs, dressed with fluent clothes, luminous scene, by Greg Rutkowski and alphonse mucha, d&d character, gradient white to cyan, in front of an iridescent background, highly detailed portrait,
+a scene of a camper in the desert, a cowboy in the foreground looking epic, full shot, atmospheric lighting, detailed faces, by makoto shinkai, stanley artgerm lau, wlop, rossdraws
+lalisa manoban of blackpink, knight armor, tarot card, highly detailed, digital painting, smooth, sharp focus, illustration, ultra realistic, 8 k, art by artgerm and alphonse mucha
+feudal japan tokyo street at dusk, raining, detailed reflections, on a postcard, cinematic lighting!!, 4k, trending on artstation, detailed watercolour, rule of thirds, center focus, art by albert bierstadt
+concept art by greg rutkowski, a gigantic spear - shaped starship approaches the system, huge and megalithic, plowing through space, frightening and creepy atmosphere, scifi, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq
+beautiful girl a strange wind blew in off the north sea, an eerie susurration that cut across the eastern sea, beautiful portrait, symmetrical, character concept style trending on artstation concept art detailed octane render cinematic photo - realistic 8 k high detailed
+the street of a frozen village in ice that never the see the sun again, concept art by makoto shinkai and greg rutkowski, matte painting, trending on artstation
+full length photo of a gorgeous young woman in the style of stefan kostic, realistic, sharp focus, 8k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm
+beautiful sci fi space scene with planets, concept art trending on artstation, volumetric lighting, 8k
+brigitte from overwatch, character portrait, portrait, close up, concept art, intricate details, highly detailed, vintage sci - fi poster, retro future, vintage sci - fi art, in the style of chris foss, rodger dean, moebius, michael whelan, and gustave dore
+will smith fights against demons dressed as a gladiator and with angel wings, cinematic lighting, highly detailed, concept art, art by wlop and artgerm and greg rutkowski, masterpiece, trending on artstation, 8 k
+Till Lindemann crushing planet earth with his teeth. epic game portrait. Highly detailed, highly recommended. fantasy art by Greg Rutkowski
+Art nouveau Ferarri, fantasy, intricate galactic designs, elegant, highly detailed, sharp focus, art by Artgerm and Greg Rutkowski and WLOP
+walter white as lara croft, digital painting, extremely detailed, 4 k, intricate, brush strokes, mark arian, artgerm, bastien lecouffe - deharme
+a swamp viewed from afar with one huge tree in the middle, dark colors, glowing plants, misty background, light rays, sunset!, birds, beautiful lighting, vivid colors, intricate, elegant, smooth, sharp focus, highly detailed digital painting, concept art, cinematic, unreal engine, 4 k wallpaper, svetlin velinov, tarmo juhola, artstation trending
+wide angle, mage, sleeping on rock, white grey blue color palette, eyes closed, forest, female, d & d, fantasy, intricate, elegant, highly detailed, long brown hair, digital painting, artstation, octane render, concept art, matte, sharp focus, illustration, hearthstone, art by artgerm, alphonse mucha johannes voss
+cinematic portrait of the incredible hulk, only head and chest, intricate, desaturated, Tim Hildebrandt, Wayne Barlowe, Bruce Pennington, donato giancola, larry elmore, maxfield parrish, Moebius, Thomas Ehretsmann, oil on canvas, gouache painting, masterpiece, trending on artstation, cinematic composition, dramatic pose, volumetric lighting, sharp, details, hyper-detailed, HD, 4K, 8K
+cinematic bust portrait of futuristic robot from left, head and chest only, exotic alien features, robotic enhancements, desaturated, tim hildebrandt, wayne barlowe, bruce pennington, donato giancola, larry elmore, oil on canvas, masterpiece, trending on artstation, featured on pixiv, cinematic composition, dramatic pose, beautiful lighting, sharp, details, hyper - detailed, hd, hdr, 4 k, 8 k
+perfectly detailed wisteria flowers!! blessed by nature with ever - increasing physical mental perfection, symmetrical! intricate, sensual features, highly detailed, biblical divine holy perfection!! digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+portrait of teenage girl with long glossy black hair, blue eyes, glowing skin, fashion model features, fantasy, intricate, elegant, black dress, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by Krenz Cushart and Artem Demura and alphonse mucha
+a cinematic detailed painting of a black kid in the woods, volumetric light, surrealism, highly detailed, realistic, retro, in the style of francis bacon and james jean, trending on artstation, painting by Edward Hoper, colorful, realistic, smooth, octane render
+concept art from zaha hadid, futuristic, ultra realistic, concept art, intricate details, highly detailed, photorealistic, octane render, 8 k
+hyperrealistic mixed media painting of a grungy skull woman with rainbow hair, stitched together, soft eyes and narrow chin, dainty figure, long hair straight down, torn v plunge shirt, short shorts, combat boots, basic white background, side boob, wet tshirt, wet, raining, dim volumetric lighting, 8 k octane beautifully detailed render, post - processing, portrait, extremely hyper - detailed, intricate, epic composition, cinematic lighting, masterpiece, trending on artstation, very very detailed, masterpiece, stunning,
+cat theme logo, cat theme banner, cat design, a smiling cat, art photography style, trending on artstation, warm light, lovely and cute, fantasy art, 8 k resolution
+cover concept art of the lost sand city, levitating sand, ground view, golden towers, golden pillars, palm trees, space and time, floating objects, post-processing, in the style of Hugh Ferriss, Behance, Artgerm. High detail, ultra realistic render, octane, 3D, photorealism, symmetric, cinematic
+male anime character, oni mask, organic, forest druid, dark souls boss, cyber punk, portrait, male anime character, robot, masterpiece, intricate, highly detailed, sharp, technological rings, by james mccarthy, by beeple and johfra bosschart, combination in the style ayami kojima, highly detailed, painting, 3 d render beeple, unreal engine render, intricate abstract, intricate artwork, by tooth wu, wlop, beeple, dan mumford. concept art, octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, key art, hyper realism, high detail, octane render, 8 k, iridescent accents, albedo from overlord, the library of gems, intricate abstract. intricate artwork, by tooth wu, wlop, beeple, dan mumford. concept art, octane render, trending on artstation, greg rutkowski very coherent symmetrical artwork. cinematic, key art, hyper realism, high detail, octane render, 8 k, iridescent accents
+Lionel Messi closeup, D&D style, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+young shadow mage male, joyful, d & d, fantasy, intricate, elegant, full body, highly detailed, digital painting, artstation, concept art, matte, sharp, illustration, hearthstone, art by artgerm and greg rutkowski and alphonse mucha
+ultra realistic illustration, emma roberts from last of us, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A beautiful cosmic entity || VERY ANIME, fine-face, realistic shaded perfect face, fine details. Anime. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash, Rob Rey and Kentarï¿½ Miura style, trending on art station
+muscular gandhi at the beach, sitting on the sand next to a campfire, with palm trees in the back, by artgerm, ilya kuvshinov katsuhiro villeneuve, jeremy lipkin and michael garmash and rob rey, disney pixar zootopia, by tristan eaton, stanley artgermm, tom bagshaw, greg rutkowski, carne griffiths
+A fancy portrait of an attractive humanoid creature by Greg Rutkowski, beeple, Sung Choi, Mitchell Mohrhauser, Maciej Kuciara, Johnson Ting, Maxim Verehin, Peter Konig, final fantasy, macro lens , 8k photorealistic, cinematic lighting, HD, high details, dramatic, dark atmosphere, trending on artstation
+headless horseman in a marvel movie, science fiction industrial hard science concept art, 8K render octane high definition cgsociety, photorealistic, unreal engine 5
+a highly detailed metahuman 4 k close up render of a goddess bella hadid monument renaissance in iris van herpen dress schiaparelli in diamonds crystals swarovski and jewelry iridescent in style of alphonse mucha gustav klimt trending on artstation made in unreal engine 4
+closeup portrait shot of a ring wraith in a scenic dystopian environment, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+Redhead Pleiadian alien human beautiful hybrid feminine woman, long gorgeous red hair in loose curls, with stunning green eyes, cute round face and a roundish nose, as a retro futuristic heroine, gorgeous digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and donato giancola and Joseph Christian Leyendecker, Ross Tran, WLOP
+gigachad luigi bodybuilder in a expensive dress suit by ilya kuvshinov, ernest khalimov body by krista sudmalis, fantasy character portrait, futuristic town background by laurie greasley, ultra realistic, concept art, intricate details, elegent, digital painting, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, artstation
+painting of a gorgeous young woman in the style of Martine Johanna, draped in flowing fabric, colorful energetic brush strokes, realistic, sharp focus, 8k high definition, insanely detailed, intricate, elegant, art by Martine Johanna and artgerm
+l ï¿½ lawliet, hunchback, death note, d & d, fantasy, portrait, highly detailed, headshot, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve and wlop
+a portrait of Chiefkeef in front of an Art Nouveau mandala wearing a huge elaborate detailed ornate crown made of all types of realistic colorful flowers, turban of flowers, sacred Geometry, Golden ratio, surrounded by scattered flowers peonies dahlias lotuses roses and tulips, photorealistic face, Cinematic lighting, rimlight, detailed digital painting, Portrait, headshot, in style of Alphonse Mucha, Artgerm, WLOP, Peter Mohrbacher, William adolphe Bouguereau, cgsociety, artstation, Rococo and baroque styles, symmetrical, hyper realistic, 8k image, 3D, supersharp, pearls and oyesters, turban of vibrant flowers, satin ribbons, pearls and chains, perfect symmetry, iridescent, High Definition, Octane render in Maya and Houdini, light, shadows, reflections, photorealistic, masterpiece, smooth gradients, no blur, sharp focus, photorealistic, insanely detailed and intricate, cinematic lighting, Octane render, epic scene, 8K
+percy jackson in cyberpunk city, 4 k, trending on artstation.
+wonderdream faeries lady feather wing digital art painting fantasy bloom vibrant style mullins craig and keane glen and apterus sabbas and guay rebecca and demizu posuka illustration character design concept colorful joy atmospheric lighting butterfly
+Boris Johnson as Thor with hammer Mjolnir, Boris Johnson hairstyle, full body realistic portrait, highly detailed, muscular body, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
+An ancient Iranian fortress as Far Cry 4 concept art, spring season, beautiful, gorgeous buildings, , concept art by Viktor Vasnetsov, concept art, ancient era, warm lighting, soft by Ivan Shishkin, Dimitri Desiron and Antonio Lopez Garcia, hyperborea, high resolution, trending on artstation,
+high detailed white space station interior a statue jesus on cross made of red marble, perfect symmetrical body, full body shot, inflateble shapes, wires, tubes, veins, jellyfish, white biomechanical details, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk, vogue, highly detailed, artstation, concept art, cyberpunk, octane render
+Award-Winning. Trending on Artstation. 8K. Corrupted Knight infected with black obsidian glowing red. Angular. Sharp. Ready for battle.
+2 0 year old ethiopian man, sitting on a black corvette, counting money, portrait, elegant, intricate, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by konstantin korovin and daniel f. gerhartz and john howe
+beautiful ethereal cyberpunk jennifer lawrence, art nouveau, fantasy, intricate binary and electronic designs, elegant, highly detailed, sharp focus, art by artgerm and greg rutkowski and wlop
+symmetry!! portrait of a horizon zero dawn machine acting as ironman, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+beautiful portrait of a minority female wearing fantastic costume,pigtail,intricate, elegant, highly detailed, dim volumetric lighting, 8k,octane,post-processing,digital painting, trending on artstation, concept art, smooth, sharp focus, illustration,by Tom Bagshaw and Daniel Gerhartz and Albert Aublet and Lawrence Alma-Tadema and alphonse mucha
+a female elf sorceress by karol bak and jia ruan, beautiful detailed eyes, cute, fantasy, intricate, elegant, highly detailed, digital painting, 4 k, hdr, concept art, detailed jewelry, smooth, sharp focus, illustration, art by artgerm
+high quality 3 d render very cute cyborg labrador!! dog plays drums!, cyberpunk highly detailed, unreal engine cinematic smooth, in the style of blade runner & pixar, hannah yata charlie immer, moody light, low angle, uhd 8 k, sharp focus
+concept art of an intelligent bear, bipedal, wearing glasses and a vest, holding a spellbook under his arm, anthromorphic, artstation, fantasy
+symmetrical - face!! portrait shot of evil sithlord captain kirk from star trek in star wars, realistic, professionally, professionally color graded, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+glorious full head portrait of abraham lincoln as Batman, fantasy, intricate, elegant, digital painting, trending on artstation, concept art, sharp focus, illustration by Gaston Bussiere and artgerm, 4k.
+a photorealistic 3 d seamless pattern of honey material with macro closeup details of circuits cables nvidia motherboard pcb futuristic robotic elements in glass and mirror in the style of zaha hadid, 3 d realistic model render in cyberpunk 2 0 7 7 colors, unreal engine 5, keyshot, octane, artstation trending, ultra high detail, ultra realistic, cinematic, 8 k, 1 6 k, large realistic elements in style of nanospace michael menzelincev, in style of lee souder, in plastic, dark atmosphere, tilt shift, depth of field
+skull - headed robot cyborg painting, illutstration, concept art, cyberpunk, futurism, comics art, artgerm
+hyper realistic portrait, beautifully rendered, luis guzman as luigi wearing green, smirking deviously, painted by greg rutkowski, wlop, artgerm, dishonored 2
+mahindra thar driving through madagascar with baobabs trees, artgerm and greg rutkowski and alphonse mucha, an epic fantasy, volumetric light, detailed, establishing shot, an epic fantasy, trending on art station, octane render, midsommar
+kurdish! assassins creed game set in kurdistan!, concept art, digital painting, highly detailed, 8 k, high definition
+portrait of ((mischievous)), baleful young Cate Blanchett as young Galadriel as a queen of fairies, dressed in a beautiful silver dress. The background is a dark, creepy eastern europen forrest.  night, horroristic shadows, high contrasts, lumnious,  photorealistic, dreamlike, (mist filters), theatrical, character concept art by ruan jia, thomas kinkade, and J.Dickenson, trending on Artstation
+elephant yoda playin socker, stunning digital art, high detail, in the style of artgerm, artstation, cgsociety, dramatic lighting, pixar 3d 8k
+photo of nikolas cage as ken from street fighter 2, shoulder length hair, high - contrast, intricate, action pose, highly detailed, centered, digital painting, artstation, smooth, sharp focus, illustration, artgerm, tomasz alen kopera, peter mohrbacher, donato giancola, joseph christian leyendecker, wlop, boris vallejo
+a portrait of frodo baggins, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+a very beautiful anime elf girl, full body, long silver hair with a flower, sky blue eyes, full round face, short smile, revealing clothes, thick thigs, firm chest, ice snowy lake setting, cinematic lightning, medium shot, mid-shot, highly detailed, trending on Artstation, Unreal Engine 4k, cinematic wallpaper by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, and Sakimichan
+a League of Legends FAN ART Portrait of VI, pink hair, short hair, elegant, highly detailed, digital painting, concept art, smooth, sharp focus, illustration, by Laurie Greasley,Lawrence Alma-Tadema,Dan Mumford,artstation,deviantart,Unreal Engine,face enhance,8K,golden ratio,cinematic lighting
+!dream concept art, four glam rockers dressd as a mix of hooligans and whores, walking down a dark wet london alley at night, by ashley wood, by roger deakins, atmospheric
+art portrait of death, 8 k, by tristan eaton, stanley artgermm, tom bagshaw, greg rutkowski, carne griffiths, trending on deviantart, face enhance, hyper detailed, minimalist cinematic lighting, trending on artstation, 4 k, hyperrealistic, focused, extreme details, unreal engine 5, cinematic, masterpiece, full of colour,
+A beautiful robotic woman dreaming, cinematic lighting, soft bokeh, sci-fi, modern, colourful, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, by greg rutkowski
+highly detailed vfx portrait of ichigo kurosaki from bleach by tite kubo!!!, stephen bliss, greg rutkowski, loish, rhads, beeple, makoto shinkai, tom bagshaw, alphonse mucha, sharp focus, art by artgerm and greg rutkowski, stanley kubrick, backlit!!,
+fantasy city at night while giant ball of fire crashes to the ground, surreal, digital art, concept art, highly detailed, trending on artstation
+concept art of a lightray trapped in vacuum, high definition, symmetrical, insanely detailed, elegant, intricate, hypermaximalist, cgsociety, prizewinning, trending on artstation, popular, top 1 0 0, best, winner, mentor, guru
+a dream microphone in a dystopic world full of aberration, black & white, melting, webbing, 8 k, by tristan eaton, stanley artgerm, tom bagshaw, greg rutkowski, carne griffiths, ayami kojima, beksinski, giger, trending on deviantart, face enhance, hyper detailed, minimalist, horror, alien
+link from zelda using computer, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, 8 k
+lofi steampunk portrait pixar style by (((Lita Cabellut))) and Stanley Artgerm and Tom Bagshaw
+fractional reserve banking, watercolor, trending on artstation
+a painting of a tank getting shot at in world war 2 by Bernardo Bellotto, high detail, hyperrealistic, concept art, artstation, 8k
+a beautiful diva sings on the theater stage ， octane render, cgsociety, artstation trending, palatial scene, highly detailded
+Ghibli, good day, landscape, no people, no man, fantasy, wood, vibrant world, Anime Background, concept art, illustration,smooth, sharp focus, intricate, super wide angle, trending on artstation, trending on deviantart, Hayao Miyazaki, 4K
+sapphire viking warrior, regal, elegant, winter, snow, beautiful, stunning, hd, illustration, epic, d & d, fantasy, intricate, elegant, highly detailed, wide angle, digital painting, artstation, concept art, smooth, sharp focus, illustration, wallpaper, art by artgerm and greg rutkowski and alphonse mucha and jin xiaodi
+fullbody!! dynamic action pose, beautiful woman with blue hair, antlers on her head, long flowing intricate black dress, dnd, face, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+A masterpiece ultrarealistic ultradetailed portrait of a Incredibly beautiful llama with dreadlocks IN INCREDIBLE GLASSES. baroque renaissance. in the forest. White corset. medium shot, intricate, elegant, highly detailed. trending on artstation, digital art, by Stanley Artgerm Lau, WLOP, Rossdraws, James Jean, Andrei Riabovitchev, Marc Simonetti, Yoshitaka Amano. background by James Jean and Gustav Klimt, light by Julie Bell, 4k, porcelain skin. BY ZDIZISLAW BEKSINSKI Cinematic concept art
+young harry potter as a gepard with gepard skin patterns hyper detailed, digital art, trending on artstation, cinematic lighting
+a dik dik monster with tattoos, wearing a fedora, tattoos, colorful, digital art, fantasy, magic, trending on artstation, ultra detailed, professional illustration by basil gogos
+a astronaut walking on a alien planet with alien plants and looking to a alien breathtaking landscape, cinematic lighting, concept art, trending on Artstation, trending on DeviantArt, highly detailed, high quality, 8K HDR, octane render, unreal engine 5, breathtaking landscape, highly detailed, high quality, post processed
+skeleton geisha in a burdel, Tending on artstation, concept art, dark colors, 8k
+realistic attractive grungy woman with rainbow hair, drunk, angry, soft eyes and narrow chin, dainty figure, long hair straight down, torn overalls, basic white background, side boob, tattooed, pierced, flirty, wet shirt, wet, raining, highly detailed face, realistic face, beautiful detailed eyes, fantasy art, in the style of greg rutkowski, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing, vibrant,
+colossal orange viking royal king tabby cat, golden hour, fantasy, vivid colors, sharp focus, digital art, hyper - realistic, 4 k, unreal engine, highly detailed, hd, dramatic lighting by brom, trending on artstation
+pencial drawing concept art of a machine mutant martial artist in the style of akira toriyama / hirohiko araki / tite kubo / masashi kishimoto trending on artstation deviantart pinterest detailed realistic hd 8 k high resolution
+goth rainbow bright, fantasy, d & d, intricate, detailed, by by alphonse mucha, adolfo hohenstein, alice russell glenny, stanley artgerm lau, greg rutkowski, detailed, trending on artstation, trending on artstation, smooth
+symmetry!! the eternal struggle of good and evil, very detailed, perfect lighting, perfect composition, 4 k, artstation, artgerm, derek zabrocki, greg rutkowski
+portrait of beautiful cute young goth girl with glasses, cyberpunk, high details, neon, art by ( ( ( kuvshinov ilya ) ) ) and wayne barlowe and gustav klimt and artgerm and wlop and william - adolphe bouguereau
+young Erin Gray   as a ruggedly beautiful retro SCI-FI space heroine 1985 , intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and donato giancola and Joseph Christian Leyendecker, Ross Tran, WLOP
+a large colorful candy cane is sticking out the ground on the side of a serene foot path. there are some snow drifts laying against the candy. there are snow flurries in the air. epic, awe inspiring, dramatic lighting, cinematic, extremely high detail, photorealistic, cinematic lighting, trending on artstation cgsociety rendered in unreal engine, 4 k, hq,
+a hyper - detailed 3 d render like a oil painting of the construction of a upward spiral, surrealism!!!!! surreal concept art, lifelike, photorealistic, digital painting, aesthetic, smooth, sharp focus, artstation hd, by greg rutkowski, bruce pennington, valentina remenar and asher duran,
+a octane render of a violent tornado inside a jar, close - up studio photo, studio lighting, path traced, highly detailed, high quality, hyperrealistic, concept art, digital art, trending on artstation, cinematic, high coherence, epic scene, 8 k hdr, high contrast
+portrait of a young, ruggedly handsome ranger, muscular, half body, leather, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+the avengers fighting thanos, long shadow, warm colors, by Greg Rutkowski, artstation
+a person looking like vladimir putin riding giant steel krab, masterpiece, intricate, elegant futuristic wardrobe, highly detailed, digital painting, artstation, concept art, crepuscular rays, smooth, sharp focus, illustration, background galaxy, cyberpunk colors, volumetric lighting, art by artgerm and james jean and nick sullo
+book cover!!!!!!!!!!!!, old bridge, ivy vector elements at each border, fantasy forest landscape, fantasy magic, light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by wlop and artgerm and ivan shishkin and andrey shishkin, masterpiece
+a modernist courtroom in the rainforest by raphael, hopper, and rene magritte. detailed, proportional, romantic, vibrant, enchanting, achingly beautiful, graphic print, trending on artstation, jungle, tropical, foliage, flowering, blooming
+portrait of a beautiful mysterious woman holding a bouquet of flowing flowers, hair flowing upwards, small bubbles from her mouth, hands hidden under the bouquet, submerged underwater filled with colorful small fish and coral reef, fantasy, regal, intricate, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+portrait of a friendly charming formal barbarian giant noble!, imperial royal elegant clothing, elegant, rule of thirds, extremely detailed, artstation, concept art, matte, sharp focus, art by greg rutkowski, cover by artgerm
+goldfinger, character sheet, concept design, contrast, kim jung gi, greg rutkowski, zabrocki, karlkka, jayison devadas, trending on artstation, 8 k, ultra wide angle, pincushion lens effect
+a watercolor ink painting of scooby - doo as the primordial eldritch god of natural - disasters in the style of jean giraud in the style of moebius trending on artstation deviantart pinterest detailed realistic hd 8 k high resolution
+zidane and shrek wearing vr playing gta v, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by greg rutkowski and alphonse mucha
+full body portrait of marvel cinematic universe aaliyah haughton, she venom, spider man, elegant, webs, super hero, spider web background, highly detailed!! digital painting, artstation, glamor pose, concept art, sharp focus, illustration, art by artgerm and greg rutkowski, artey freytag
+demonic evil cute fourteen year old brown skinned asian girl, tomboy, evil smile, freckles!!!, fully clothed, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha,
+renfri, the princess turned gang leader from the witcher universe. fantasy art by greg rutkowski, gustave courbet, rosa bonheur, edward hopper. faithfully depicted facial expression, perfect anatomy, sharp focus, global illumination, radiant light, detailed and intricate environment, trending on artstation
+A chef with a big mustache proundly making a soup, digital painting, artstation, concept art, Craig Mullins, Breathtaking, 8k resolution, extremely detailed, beautiful, establishing shot, artistic, hyperrealistic, octane render, cinematic lighting, dramatic lighting, masterpiece, light brazen, extremely detailed and beautiful face
+a space realistic robot with big and cute eyes, | | very anime, fine - face, realistic shaded robotic parts, fine details. anime. realistic shaded lighting poster by ilya kuvshinov katsuhiro otomo ghost - in - the - shell, magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, trending on art station
+Portrait of The Most Beautiful old Woman On Earth  , D&D, fantasy, intricate, richly detailed colored  3D illustration of a beautiful ornated cute body with long metallic hair wearing a hoodie and short skirt that is happy and curious smile. background with completely rendered reflections, art by Range Murata and Artgerm highly detailed, digital painting, trending on artstation, sharp focus, illustration, style of Stanley Artgerm, perfect smile and sexy mouth,
+lucifer cast out of heaven by yusuke murata and makoto shinkai,  clouds, fire, angels, 8k, cel shaded, unreal engine, featured on artstation, pixiv
+A pirate ship in the middle of the sea during a storm, fantasy art, in the style of greg rutkowski, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+fox as a monkey, fluffy white fur, black ears, stunning green eyes, extremely long white tail with black tip, award winning creature portrait photography, extremely detailed, artstation, 8 k, sensual lighting, incredible art, wlop, artgerm
+autumn in french village, ornate, beautiful, atmosphere, vibe, mist, smoke, fire, chimney, rain, wet, pristine, puddles, melting, dripping, snow, creek, lush, ice, bridge, green, stained glass, forest, roses, flowers, by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, norman rockwell
+Keanu Reeves as spiderman , film still, muscle extremely detailed, fantastic details full face, mouth, trending on artstation, pixiv, cgsociety, hyperdetailed Unreal Engine 4k 8k ultra HD,  WLOP
+symmetry!! portrait of elon musk with a salvador dali moustache intricate, neon lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski
+Movie still of danny devito as as Harry Potter in potions class at hogwarts, fantasy, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and Tony Sart
+a planet that resembles a skull, stars in the background, natural, ultra detail. digital painting, beautiful, concept art, ethereal, cinematic, epic, 8k, highly detail, insane detailed, oil painting, octane render, cinematic lighting, smooth, sharp, Artstation, mystical, illustration, Trending on Artstation, Artstation HQ, Artstation HD, digital art,
+anthropomorphic art of a businessman dragon, green dragon, dragon head, portrait, victorian inspired clothing by artgerm, victo ngai, ryohei hase, artstation. fractal papers and books. highly detailed digital painting, smooth, global illumination, fantasy art by greg rutkowsky, karl spitzweg
+Billie Eilish, sitting in a cafe, fantasy, intricate, elegant, highly detailed, digital painting, pale skin, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+a tornado made of fire on a field, au naturel, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, unreal engine 5 rendered, octane rendered, art style by klimt and nixeu and ian sprigger and wlop and krenz cushart
+a queue of grey people looking like perfect copies of each other, in the style of artgerm, gerald brom, atey ghailan and mike mignola, vibrant colors and hard shadows and strong rim light, plain background, comic cover art, trending on artstation
+vampire in the style of stefan kostic, realistic, full body shot, wide angle, sharp focus, 8 k high definition, insanely detailed, intricate, elegant, art by stanley lau and artgerm, floating embers
+fluffy cat in cowboy hat like a tiny girl riding on the back of a giant corgi, by greg rutkowski
+beautiful black woman elf wearing a dark green robe portrait, art nouveau, fantasy, intricate arcane wiccan designs, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by Artgerm and Greg Rutkowski and WLOP
+portrait of a wizard, intricate, highly detailed, digital painting, artstation, concept art, sharp focus, art by huifeng huang and greg rutkowski
+daredevil portrait, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and alphonse mucha
+a highly detailed illustration of tall beautiful red haired lady wearing black spaghetti strap noir style dress and sun hat, elegant stroking face pose, intricate, elegant, highly detailed, centered, digital painting, artstation, concept art, smooth, sharp focus, league of legends concept art, wlop.
+a hooded wise old man with a long white beard wearing a brown hooded tunic riding on top of a lion, the man riding is on the lion, the wise man is riding on top, he is all alone, majestic, epic digital art, cinematic, trending on artstation, superb detail 8 k, wide angle shot, masterpiece
+a lisa frank mcdonalds microwaved happymeal, gothic, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha and william - adolphe bouguereau
+a beautiful highly detailed matte painting of a building looking like a goose by Jose Daniel Cabrera Pena and Leonid Kozienko, concept art by Tooth Wu and wlop and beeple and dan mumford and greg rutkowski and nekroxiii. octane render, cinematic, hyper realism, octane render, 8k,  iridescent accents. vibrant, teal and gold blue red dark noir colour scheme
+a statue made of red marble, of an beautiful girl, full body shot, perfect body, red white biomechanical, inflateble shapes, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk futuristic wardrobe, vogue, highly detailed, artstation, concept art, background galaxy, cyberpunk, octane render
+a ultradetailed beautiful panting of scarlett johansson as motoko kusanagi, by conrad roset, greg rutkowski and makoto shinkai, trending on artstation
+a detailed portrait of a weasel assassin dressed with a leather armor, by justin gerard and greg rutkowski, digital art, realistic painting, dnd, character design, trending on artstation
+a soldier zombie with a gas mask, pile of skulls, horror, black and white, fantasy art, monster art, illustration, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+yvonne strahovski, very sexy penguin outfit, medium shot, visible face, detailed face, perfectly shaded, atmospheric lighting, by makoto shinkai, stanley artgerm lau, wlop, rossdraws
+concept art of love, death + robots series of netflix, cinematic shot, oil painting by jama jurabaev, brush hard, artstation, for aaa game, high quality, brush stroke
+Ottoman Emperor George Washington, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, Ottoman armor, artstation, illustration, concept art, smooth, sharp focus, art by John Collier and Albert Aublet and Krenz Cushart and Artem Demura and Alphonse Mucha
+Full potrait of cinead o'connor as an angel, hyper realistic, prismatic highlights, atmosphere, gorgeous, depth of field, cinematic, macro, concept art, 50mm, artstation, wlop, elegant, epic, weta digital, focus, octane render, v-ray, 8k, kodak portra, art by Liberatore
+a film still of of a woman explorer, ( emerald herald ), exploring lost ruins, sun lighting, water, finely detailed features, perfect art, at an ancient city, gapmoe yandere grimdark, trending on pixiv fanbox, painted by greg rutkowski makoto shinkai takashi takeuchi studio ghibli,, akihiko yoshida
+a very beautiful young yuuki asuna, full body, long wavy blond hair, sky blue eyes, full round face,, bikini, miniskirt, front view, mid - shot, highly detailed, cinematic wallpaper by stanley artgerm lau
+symmetry!! portrait of jair bolsonaro, sci - fi, tech wear, glowing lights!! intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+an art and technology t - shirt, digital art from artstation, by ruan jia and mandy jurgens and artgerm and william - adolphe bouguereau fantasy, epic digital art, volumetrc lighting, clean detail 8 k resolution
+gamora, portrait, digital painting, elegant, beautiful, highly detailed, artstation, concept art
+a mechanized version of a norse woman, facial piercings, very symmetrical, furry warrior's bone clothing, highly detailed, by vitaly bulgarov, joss nizzi, ben procter, steve jung, concept art, concept art world, pinterest, artstation, unreal engine
+photorealistic dwayne johnson but he is made of rocks. hyperdetailed photorealism, 1 0 8 megapixels, amazing depth, glowing rich colors, powerful imagery, 3 d finalrender, 3 d shading, cinematic lighting, artstation concept art
+anime young boy with short wavy white hair wearing white clothes with short cape surrounded by light orbs, moody, wlop, concept art, digital painting, trending on artstation, highly detailed, epic composition, 8 k uhd
+a photo of 8 k ultra realistic humanoid princess standing next to a beautiful view, ornate white and gold officers outfit, cinematic lighting, trending on artstation, 4 k, hyperrealistic, focused, extreme details, unreal engine 5, cinematic, masterpiece
+epic 3 d abstract model, liquid headdress, 2 0 mm, with pastel pink and cerulean peanut butter, melting smoothly into other faces, liquid, delicate, beautiful, intricate, houdini sidefx, trending on artstation, by jeremy mann and ilya kuvshinov, jamie hewlett and ayami kojima
+Idris Elba as Superman (2019), zac snyder, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+kneeling before a condescending queen, royal gown, golden detailing, medium shot, intricate, elegant, highly detailed, digital painting, volumetric light, artstation, concept art, smooth, sharp focus, illustration, art by Gil Elvgren and Greg Rutkowski and Alphonse Mucha, 8K
+a highly detailed and high technology alien spacecraft, centered, corals, plume made of geometry, water texture, wet, wet lighting, extremly detailed digital painting, sharp focus in the style of android jones, artwork of a futuristic artificial intelligence superstar with frames made of detailed circuits, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, under water visual distortion, dark tones colors, trending on artstation
+metalhead, by yoshitaka amano, ruan jia, kentaro miura, artgerm, detailed, intricate details, trending on artstation, hd, masterpiece
+futuristic utopian city, central hub, white buildings, golden sunset, space ships, green trees, large flying drones, utopia, high quality, hopeful, beautiful design, scifi, high detail, global illumination, trending on artstation, art by richard dumont, leon tukker
+Fae teenage girl, portrait, face, long red hair, green highlights, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+charizard flying above new york, highly detailed matte fantasy painting, stormy lighting, by ross tran, by artgerm, by lisa frank, by brom, by peter mohrbacher
+Glowing glass jar with a pink tentacle in green liquid, macro, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, volumetric lighting, cinematic, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+jazz music, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mars ravelo and greg rutkowski
+concept art of fried egg, highly detailed painting by dustin nguyen, akihiko yoshida, greg tocchini, greg rutkowski, cliff chiang, 4 k resolution, trending on artstation, 8 k
+portrait of a rugged ranger, muscular, upper body, hairy torso, detailed detailed detailed hands hands hands hands, D&D, fantasy, bare bare bare bare thighs thighs thighs intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm
+an extremely psychedelic portrait of hunter s. thompson, surreal, lsd, face, detailed, intricate, elegant, lithe, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration
+greg manchess portrait painting of snufkin as overwatch character, medium shot, asymmetrical, profile picture, organic painting, nebula, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+portrait of abandoned ribbed sculpture of two kissing cyborgs, covered with tentacles, roots, wires, tubes, ash, mold, baroque painting, standing in a desolate empty wasteland, creepy, nightmare, dream-like heavy atmosphere, dark fog, surreal abandoned buildings, baroque painting, beautiful detailed intricate insanely detailed octane render trending on Artstation, 8K artistic photography, photorealistic, volumetric cinematic light, chiaroscuro, zoomed out, Raphael, Caravaggio, Beksinski, Giger
+underwater naga portrait, Pixar style, by Tristan Eaton Stanley Artgerm and Tom Bagshaw.
+priestess with angelical wings, golden hair, fluorescent eyes, white skin, lipstick, beautiful, goodness, high fantasy, illustration, by artgerm, greg rutkowski, alphonse mucha
+a warlock is casting a magic spell, with magic orb floating in his hand , dynamic pose, natural lighting, medium level shot, Mucha style , Grim fantasy, illustration ,concept art,
+portrait of the secretive vampire woman biker loner smiling at her cat, by yoshitaka amano, casey baugh, steve caldwell, gottfried helnwein, yasunari ikenaga, nico tanigawa, and artgerm rendered with 3 d effect.
+aliens in Jerusalem, concept art, hd
+many Alchemy Imperial legends knights super hero boys girl, sci-fi, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, fractal flame, amazing composition unreal engine
+concept art, of buffaloes on a beach, sunset, 30mm, canon, very hot, crowded, artstation
+portrait of john candy crying, john candy suffering, metaverse on fire, octane render, trending on artstation
+a portrait of a cyborg josip broz tito. vaporwave, intricate, epic lighting, cinematic composition, hyper realistic, 8 k resolution, unreal engine 5, by artgerm, tooth wu, dan mumford, beeple, wlop, rossdraws, james jean, marc simonetti, artstation
+anime portrait of the priestess in the forest, enchanted, magic, digital, concept art, Kyoto animation,last exile, blue submarine no. 6, katsura masakazu,tsutomu nihei, gustav klimt,loish, murata range, kawaii, studio lighting, manga, bright colors, anime,beautiful, 35mm lens,noir, vibrant high contrast, gradation, jean giraud, moebius, fantasy, rule of thirds, unreal engine, fibonacci, intricate, cel shaded, blender npr, flat, matte print, smooth, Ilya Kuvshinov, Tsuruta Kenji
+a large 1 8 th century pirate airship flying among the clouds, soaring through the sky, airship, digital art, pirate ship, vivid colors, artgerm, james gilleard, beautiful, highly detailed, intricate, trending on art station
+Taylor Swift Cosplaying Lola Bunny, modeling, posing, two piece workout clothes, training bra, quality lighting, vibrant colors, maximalism, facial details, photograph of Taylor Swift, Tooth Wu Artgerm WLOP artstation deviantart, 8k, fanart, playboy style, very very aesthetic
+a cartoon pineapple holding a large glass of port, nightclub, elegant, real life skin, intricate, high detailed, artstation, concept art, smooth, sharp focus, art by artgerm and greg rutkowski
+powerful goddess of water clothed in swirling water striding through a stormy sea, dress made of water, highly detailed matte fantasy painting, rendered in octane, stormy lighting, by ross tran, by artgerm, by david suh, by peter mohrbacher
+an extremely detailed matte painting emma watson as borg nine star trek, digital painting, beautiful eyes!, pretty face!!, symmetry, concept art, sharp focus, illustration, art by artgerm! greg rutkowski magali villeneuve wlop! ilya kuvshinov!!, octane render
+bruce campbell as harry potter in “ harry potter and the philosopher's stone ” ( 2 0 0 1 ). movie still detailed, smooth, sharp focus.
+a beautiful portrait of a tree goddess by Greg Rutkowski and Raymond Swanland, Trending on Artstation, ultra realistic digital art
+a beautiful masterpiece painting of the last poet whispering,'if all can begin again, then everything must continue!'by juan gimenez, long shiny black hair blue eyes, award winning, trending on artstation, photorealistic, hyperrealism, octane render, unreal engine
+cabin high on a mountain, the valley beneath, dynamic lighting, photorealistic fantasy concept art, trending on art station, stunning visuals, creative, cinematic, ultra detailed
+a painting so beautiful and universally loved it creates peace on earth, profound epiphany, trending on artstation, by john singer sargent
+Majestic powerfull red white Winged Hussars cavalry horde charging at ugly rainbow demons and trolls on ground, huge golden cross above them on the sky, white red eagle helping hussars, blood, snow, wide angle, professional kodak lenses, magic, fire, face painting, dramatic lighting, intricate, wild, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, footage from space camera
+bob ross!! riding a dinosaur, giant paintbrush in hand, model pose, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+Concept art of a Toilet-Plunger designed by Apple Inc
+colorful medieval botanical garden, ornate, beautiful, atmosphere, vibe, mist, smoke, chimney, rain, well, wet, pristine, puddles, waterfall, melting, dripping, snow, ducks, creek, lush, ice, bridge, cart, forest, flowers, concept art illustration, color page, 4 k, tone mapping, akihiko yoshida, james jean, andrei riabovitchev, marc simonetti, yoshitaka amano, digital illustration, greg rutowski, volumetric lighting, sunbeams, particles, trending on artstation
+a synthwave cuber bokeh brain, tristan eaton, victo ngai, artgerm, rhads, ross draws
+portrait of korean beautiful female necromancer, face, dark fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+Intergalactic plant store floating in space with white twinkling stars in the foreground, galactic terrarium filled with plants from alien planets floating in the cosmos, Filled with plants, warm ethereal glowing ambiance, concept art 8k resolution
+portrait of a friendly charming formal barbarian giant noble!, imperial royal elegant clothing, elegant, rule of thirds, extremely detailed, artstation, concept art, matte, sharp focus, art by greg rutkowski, cover by artgerm
+artgerm, joshua middleton comic cover art, full body pretty even rachel wood faye, symmetrical eyes, symmetrical face, long curly black hair, beautiful forest, cinematic lighting
+a forgotten garden gnome in a vast barren desert, hopeless wasteland background with a relentless raging sun overhead, an ultrafine detailed painting by stanley artgerm lau, greg rutkowski, thomas kindkade, alphonse mucha, loish, trending on deviantart, pop surrealism, whimsical, lowbrow, perfect symmetrical face, grotesque
+Portrait of a tall beautiful brown-skin elf woman wearing stylish black and gold robes, warm smile, intricate, elegant, highly detailed, digital painting, smooth, sharp focus, artstation, graphic novel, art by stanley artgerm and greg rutkowski and peter mohrbacher,
+a giant broken robots in rain after a huge battle, tired, rustic, dormant, sharp focus, james gilleard, cinematic, game art, extremely detailed digital painting, print
+biolevel 4 secret lab, alien autopsy, wide angle, super highly detailed, professional digital painting, artstation, concept art, smooth, sharp focus, no blur, no dof, extreme illustration, unreal engine 5, photorealism, hd quality, 8 k resolution, cinema 4 d, 3 d, beautiful, cinematic, art by artgerm and greg rutkowski and alphonse mucha and loish and wlop
+Gertrude Abercrombie, minimalistic graffiti masterpiece, minimalism, 3d abstract render overlayed, black background, psychedelic therapy, trending on ArtStation, ink splatters, pen lines, incredible detail, creative, positive energy, happy, unique, negative space, face, artgerm
+a dramatic, epic, ethereal painting of a !!handsome!! thicc chunky beefy mischievous shirtless man with a big beer belly wearing a large belt and cowboy hat offering a whiskey bottle | he is relaxing by a campfire | background is a late night with food and jugs of whisky | homoerotic |  stars, tarot card, art deco, art nouveau, intricate | by Mark Maggiori (((and Alphonse Mucha))) | trending on artstation
+: sphere sculpture covered with maze pattern,hyper detailed art station  parabolic lighting contest winners unrealengine trending on artstation,cinematic, hyper realism, high detail, octane render, 8k
+starfinder lashunta pilot, wearing a flight suit, in a space port, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, art by artgerm and greg rutkowski
+concept art of a futuristic gold warrior, large gold apendages on it's back, with a black obsidian helmet, tight armor, rough and jagged design | | epic - fine - fine details by stanley artgerm lau, wlop, rossdraws, and sakimichan, trending on artstation, brush strokes
+a study of cell shaded cartoon of a monk on a skateboard with technical analysis charts in the background, illustration, wide shot, subtle colors, post grunge, concept art by josan gonzales and wlop, by james jean, Victo ngai, David Rubín, Mike Mignola, Laurie Greasley, highly detailed, sharp focus, alien, Trending on Artstation, HQ, deviantart, art by artgem
+potato house interior design, Greg Rutkowski, trending on Artstation, 8K, ultra wide angle, pincushion lens effect.
+a angry knight in full plate of black armor, splattered with blood, riding a large black war horse, with red glowing eyes flowing red mane and tail, blackened clouds cover sky, crackling with lightning, a castle in distance burns, concept art by greg rutkowski, craig mullins, todd mcfarlane,
+sexy painting of 3 5 0 - pound taylor swift, red bikini, navel piercing, ultra realistic, sharp details, subsurface scattering, intricate details, warm lighting, beautiful features, highly detailed, photorealistic, octane render, 8 k, unreal engine, art by artgerm and greg rutkowski and alphonse mucha
+the grand canyon filled with glowing futuristic cyberpunk skyscrapers at night with a starry sky, cinematic, wide angle establishing shot, fantasy, hyperrealism, greg rutkowski, tuomas korpi, volumetric light, octane render, photorealistic concept art, highly detailed, very intricate
+Dwight Shrute as blue man. digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and donato giancola and Joseph Christian Leyendecker, Ross Tran, WLOP
+death, dark fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, wallpaper, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+a realistic illustration portrait of a beautiful cute girl with wavy black red hair, a pointy nose and, round chin black eyeliner, green pupills, trending on artstation, hyper - realistic lighting, intricate by imagineartforyou
+looking out to see a long wood dock on the water, child at end of dock, big fishing boat leaving the dock with sailors waving, low angle, long lens, sunset, a mediterranean phoenician fishing village in the distance, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and raphael lacoste and magali villeneuve
+splash art for new champion for league of legend, by riot games. trending on artstation
+Anime as Elizabeth Olsen playing Scarlet Witch || cute-fine-face, pretty face, realistic shaded Perfect face, fine details. Anime. realistic shaded lighting poster by Ilya Kuvshinov katsuhiro otomo ghost-in-the-shell, magali villeneuve, artgerm, Jeremy Lipkin and Michael Garmash and Rob Rey as Scarlet Witch in New York cute smile
+a portrait of apocalypse from x - men, fantasy, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by ilya kuvshinov, artgerm, alphonse mucha, and greg rutkowski
+portrait of radical lolita girl, dreamy and ethereal and dark, dark eyes, smiling expression, ornate goth dress, dark fantasy, chaotic, elegant, black crows flying, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+photography of man playing realistic virtual reality game in a giant mine with excavators and gnomes 3 d realistic model render in the style of zaha hadid with point cloud in the middle, in cyberpunk 2 0 7 7 colors, unreal engine 5, keyshot, octane, artstation trending, ultra high detail, ultra realistic, cinematic, 8 k, 1 6 k, in style of zaha hadid, in style of nanospace michael menzelincev, in style of lee souder, in plastic, dark atmosphere, tilt shift, depth of field
+greg manchess portrait painting of armored starlord as overwatch character, medium shot, asymmetrical, profile picture, organic painting, sunny day, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+full lenght shot, super hero pose, biomechanical dress, inflateble shapes, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk futuristic wardrobe, highly detailed, art by akira, mike mignola, artstation, concept art, background galaxy, cyberpunk, octane render
+alterd carbon, masked angel protecting girl and a woman, vampre the masquerade, neon, detailed intricate render, dark atmosphere, detailed illustration, hd, 4 k, digital art, overdetailed art, surrealistic, by greg rutkowski, by loish, complementing colors, trending on artstation, deviantart
+fullbody portrait of a beautiful girl dressed in cyberpunk style, standing on street, holding a sniper rifle. by riot games, anime style, masterpiece, award - winning, trending on artstation and pixiv
+thief red riding hood, d & d, fantasy, portrait, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+anthropomorphic highly detailed group portrait of funny neon giant cute eyes dust mephit, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, bob eggleton, michael whelan, stephen hickman, richard corben, wayne barlowe, trending on artstation and greg rutkowski and alphonse mucha, 8 k
+beautiful girl galaxy background, portrait character concept style trending on artstation concept art detailed octane render cinematic photo - realistic 8 k high detailed
+realistic render of flying blue whales towards the moon, intricate, toy, sci - fi, extremely detailed, digital painting, sculpted in zbrush, artstation, concept art, smooth, sharp focus, illustration, chiaroscuro lighting, golden ratio, incredible art by artgerm and greg rutkowski and alphonse mucha and simon stalenhag
+a highly detailed epic cinematic concept art CG render digital painting artwork: Steampunk Wizard stands and looks at the Tower of Babel in the distance. By Greg Rutkowski, in the style of Francis Bacon and Syd Mead and Norman Rockwell and Beksinski, open ceiling, highly detailed, painted by Francis Bacon and Edward Hopper, painted by James Gilleard, surrealism, airbrush, Ilya Kuvshinov, WLOP, Stanley Artgerm, very coherent, triadic color scheme, art by Takato Yamamoto and James Jean
+devastated scorched earth in the valley, burnt trees, burnt vegetation and grass, cinematic view, epic sky, detailed, concept art, low angle, high detail, warm lighting, volumetric, godrays, vivid, beautiful, trending on artstation, by jordan grimmer, huge scene, grass, art greg rutkowski
+portrait Ninja gaiden girl, armored black and red ninja wardrobe, in ruin japanese rainny temple night, ssci-fi and fantasy, intricate and very very beautiful and elegant, highly detailed, digital painting, artstation, concept art, smooth and sharp focus, illustration, art by tian zi and WLOP and alphonse mucha
+girl jumping near a lake, rainy, touching a long neck monster, illustration concept art anime key visual trending pixiv fanbox by wlop and greg rutkowski and makoto shinkai and studio ghibli
+beautiful digital painting of a hoyeon jung stylish female snow - covered mountains with high detail, real life skin, freckles, 8 k, stunning detail, works by artgerm, greg rutkowski and alphonse mucha, unreal engine 5, 4 k uhd
+a potrait of a human rogue, fine details. night setting. realistic shaded lighting poster by ilya kuvshinov katsuhiro, artgerm, jeremy lipkin and michael garmash, unreal engine, radiant light, detailed and intricate environment, digital art, trending on art station
+portrait full body girl 3 kingdom breathtaking detailed concept art painting art deco pattern of birds goddesses amalmation flowers head thibetan temple, by hsiao ron cheng, tetsuya ichida, bizarre compositions, tsutomu nihei, exquisite detail, extremely moody lighting, 8 k, art nouveau, old chines painting, art nouveau
+greg manchess portrait painting of armored sanguinius with huge wings as overwatch character, medium shot, asymmetrical, profile picture, organic painting, sunny day, matte painting, bold shapes, hard edges, street art, trending on artstation, by huang guangjian and gil elvgren and sachin teng
+rendering of old hands reaching forward, concept art, high detail, intimidating, cinematic, Artstation trending, octane render
+dynamic photography portrait of a dungeons and dragons king's colosse , intricate ornate armor, subject in the middle of the frame, rule of thirds, golden ratio, elegant, digital painting, octane 4k render, zbrush, hyperrealistic, artstation, concept art, smooth, sharp focus, illustration from Warcraft by Ruan Jia and Mandy Jurgens and Artgerm and William-Adolphe Bouguerea
+anthropomorphic triangle brain in edgy darkiron badger demon, intricate, elegant, highly detailed animal monster, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, dwayne barlowe, trending on artstation and greg rutkowski and alphonse mucha, 8 k
+christiano ronaldo, manga cover art, detailed color portrait, artstation trending, 8 k, greg rutkowski
+a faceless!!!!! woman posing for the camera, charcoal painting!!!!! illustrated by kathe kollwitz, trending on artstation, 4 k, 8 k, artstation hd, artstation hq, artistic interpretation, 1 9 5 0 s style
+a potrait of a female necromancer with big and cute eyes, fine - face, realistic shaded perfect face, fine details. night setting. very anime style. realistic shaded lighting poster by ilya kuvshinov katsuhiro, magali villeneuve, artgerm, jeremy lipkin and michael garmash, rob rey and kentaro miura style, trending on art station
+fractal tarot card of a naturepunk retrofuture nexus of technology and earth, beautiful detailed realistic cinematic character high concept fashion portrait, hi - fructose art magazine, by anton fadeev and paul lehr and david heskin and josan gonzalez, 8 k
+realistic high key portrait rendering of a beautiful curvy pale alabaster goth girl with asymmetrical punk rock hair and badass euro design sunglasses. mole on cheek. half portrait by stanley artgerm, dramatic lighting, by tohuvabohu, nagel, shin jeongho, nick silva and ilya kuvshinov, deviantart, detailed character design, 8 k resolution
+the world serpent ultra detailed fantasy, elden ring, realistic, dnd character portrait, full body, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, global illumination radiating a glowing aura global illumination ray tracing hdr render in unreal engine 5
+helmet of a forgotten deity with a labyrinth, in the style of tomasz alen kopera and fenghua zhong and peter mohrbacher, mystical colors, rim light, beautiful lighting, 8 k, stunning scene, raytracing, octane, trending on artstation
+duotone psychedelic concept illustration 3 / 4 portrait of dr. albert hofmannn taking bicycle trip fractals background. cinematic scene. vlumetric lighting. golden rario accidental renaissance. by sachin teng and sergey kolesov and ruan jia and heng z. graffiti art, scifi, fantasy, hyper detailed. octane render. concept art. trending on artstation
+A very tall, slender woman wearing black puffy clothes and holding a yellow umbrella, sharp focus, intricate, elegant, digital painting, artstation, matte, highly detailed, concept art, illustration, ambient lighting, art by artgerm, Alphonse mucha, and Greg Rutkowski
+vibrant! colorful!!! the last supper of simpsons by rene magritte, futurama by laurie greasley and bouguereau, ( ( etching by gustave dore ) ), ultraclear intricate, sharp focus, highly detailed digital painting illustration, concept art, masterpiece
+award winning brandmark for a research lab, mind wandering, hip corporate, no text, trendy, vector art, concept art
+an epic non - binary model, subject made of white mesh rope, with cerulean and pastel pink bubbles bursting out, delicate, beautiful, intricate, melting into a wolf, houdini sidefx, by jeremy mann and ilya kuvshinov, jamie hewlett and ayami kojima, trending on artstation, bold 3 d
+character concept of iridescent sinewy smooth muscular male sleek glossy indigo black pearlescent scifi armor with smooth black onyx featureless helmet, by greg rutkowski, mark brookes, jim burns, tom bagshaw, magali villeneuve, trending on artstation
+phil noto, peter mohrbacher, thomas kinkade, artgerm, 1 9 5 0 s rockabilly anya taylor - joy catwoman dc comics, pompadour, long hair, vines, symmetrical eyes, city rooftop
+dark high detailed space station interior a statue jesus on cross made of white marble, perfect symmetrical body, full body shot, inflateble shapes, wires, tubes, veins, jellyfish, white biomechanical details, wearing epic bionic cyborg implants, masterpiece, intricate, biopunk, vogue, highly detailed, artstation, concept art, cyberpunk, octane render
+Portrait of a man by Greg Rutkowski, a young, strong and hard-eyed futuristic warrior with brown hair with dreadlocks, wearing a futuristic space tactical gear that looks like a mix between the samurai, viking and templar aesthetics, mix between tribal and hi-tech, highly detailed portrait, scifi, space opera, digital painting, artstation, concept art, smooth, sharp foccus ilustration, Artstation HQ
+epic professional digital art of a snail in a blue professional business suit, sitting at a desk, best on artstation, cgsociety, wlop, Behance, pixiv, astonishing, impressive, outstanding, epic, cinematic, stunning, gorgeous, much detail, much wow, masterpiece
+hyper realistic oil painting of frozen little island planet with waterfall, rising in the air, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+digitalart scifi!!! wallpaper trending on artstation
+concept art of car designed by jony ive, jama jurabaev, science fiction, brush hard, artstation, cgsociety, high quality, brush stroke
+A beautiful oil cartoony painting of a happy Remi Malek riding a tricycle by Lucas Graciano, Frank Frazetta, Greg Rutkowski, Boris Vallejo, epic fantasy character art, high fantasy, Exquisite detail, post-processing, low angle, masterpiece, cinematic
+female priest in white cloak, ultra detailed fantasy, dndbeyond, bright, colourful, realistic, dnd character portrait, full body, pathfinder, pinterest, art by ralph horsley, dnd, rpg, lotr game design fanart by concept art, behance hd, artstation, deviantart, hdr render in unreal engine 5
+a beautiful portrait of death goddess by Greg Rutkowski and Raymond Swanland, ominous background, Trending on Artstation, ultra realistic digital art
+cyborg drug addict, diffuse lighting, fantasy, intricate, elegant, highly detailed, lifelike, photorealistic, digital painting, artstation, illustration, concept art, smooth, sharp focus, art by John Collier and Albert Aublet and Krenz Cushart and Artem Demura and Alphonse Mucha
+a well designed portrait of viper, detailed, realistic, sketch style, artstation, greg rutkowski, 8 k resolution.
+Moira Stewart as Warhammer 40k Battle Sister, portrait, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+hu tao from genshin impact, hu tao, perfect face, collaborative painting by greg ruthowski, ruan jia, artgerm, highly detailed, complex, exquisite and beautiful, 4 k, 8 k, artstation
+rockstar girl playing electric guitar on stage. by amano yoshitaka, by rembrandt, digital art, digital painting, artstation trending, unreal engine
+beautiful small cyberpunk robot-owl in the deep jungle, with neon color eyes, cinematic view, 8k, ultra realistic, vibrant colors, photo realism, trending artstation, octane render, volumetric lighting, high contrast, intricate, highly detailed, digital painting
+john lennon as jack the ripper, ultra realistic, concept art, intricate details, highly detailed, photorealistic, octane render, 8 k, unreal engine, art by frank frazetta, simon bisley, brom
+lux, from league of legends, au naturel, hyper detailed, digital art, trending in artstation, cinematic lighting, studio quality, smooth render, unreal engine 5 rendered, octane rendered, art style by klimt and nixeu and ian sprigger and wlop and krenz cushart
+Anime art of beautiful Hatsune miku with beautifel legs by artgerm, ross tran, magali villeneuve, Greg Rutkowski, Gil Elvgren, Alberto Vargas, Earl Moran,, Art Frahm, Enoch Bolles
+Daniel Radcliffe wearing a monks tunic holding a glowing fire magical staff. Trending on Artstation, octane render, ultra detailed, art by Ross tran
+an super mega hyper realistic image of a super soldier with a Ukrainian blue and yellow stripes flag standing in the beam of light from the clouds on a pile of skulls as a winner, masculine figure, D&D, fantasy, intricate, elegant, highly detailed, extremely detailed, digital painting, artstation, concept art, matte, sharp focus, symmetrical, illustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha
+poster woman with futuristic streetwear and hairstyle, open jacket, cute face, symmetrical face, 3/4 angle, pretty, beautiful, elegant, Anime by Kuvshinov Ilya, Cushart Krentz and Gilleard James, 4k, HDR, Trending on artstation, Behance, Pinterest
+man with fluffy pipidastr, atmosphere, glow, detailed, intricate, full of colour, cinematic lighting, trending on artstation, 4 k, hyperrealistic, focused, extreme details, unreal engine 5, cinematic, masterpiece, moody lighting, by greg rutkowski, wlop, artgerm, trending on artstation, concept art, sharp focus, ray tracing
+a cinematic scene from the cthulhu in pyrrhic victory, concept art by beksinski and jean delville, dramatic lighting, ultra hd, hdr, 8 k
+indistinct glowing prehistoric beasts surrounded by slate grey walls, insane details, dramatic lighting, unreal engine 5, concept art, greg rutkowski, james gurney, johannes voss, hasui kawase.
+a full body portrait of a beautiful post apocalyptic offworld nordic desert snake charmer dancing playfully by the waterfalls, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by krenz cushart and artem demura and alphonse mucha
+dia de los muertos theme poster art by artemio rodriguez, aida muluneh, and gustave bauman, intricate, accurate facial details, profile picture, artgerm, retro, nostalgic, old fashioned, posterized color
+digital artwork, illustration, cinematic camera, a cyborg pilot in the cockpit of a mech, intricate machinery, biomechanics, the ghosts in the machine, cyberpunk concept art by artgerm and Guy Denning and Greg Rutkowski and Ruan Jia, highly detailed, intricate, sci-fi, sharp focus, Trending on Artstation HQ, deviantart
+breathtaking detailed soft painting of a grim reaper with an intricate golden scythe and cloak of fireflies and embers, rembrandt style, detailed art nouveau stained glass of flames background, christian saint rosace, elegant, highly detailed, artstation, concept art, matte, sharp focus, art by Tom Bagshaw, Artgerm and Greg Rutkowski
+portrait of mischievous, enigmatic!!, dangerous youngster Galadriel (Cate Blanchett) as a queen of elves, dressed in a refined silvery garment. The background is a dark, chilling eastern european forrest.  night, horroristic shadows, blue tones, higher contrasts, (((lumnious))), theatrical, character concept art by ruan jia, (((thomas kinkade))), and J.Dickenson, trending on Pinterest, ArtStation
+portrait painting of a bloodied serial killer wearing a hello kitty mask, ultra realistic, concept art, intricate details, eerie, highly detailed, photorealistic, octane render, 8 k, unreal engine. art by artgerm and greg rutkowski and alphonse mucha
+An old man trapped in a cave, looking into a mirror, b&w, fantasy art, in the style of masami kurumada, illustration, epic, fantasy, intricate, hyper detailed, artstation, concept art, smooth, sharp focus, ray tracing
+close-up macro portrait of the face of a beautiful princess with animal skull mask, epic angle and pose, ribcage bones symmetrical artwork, 3d with depth of field, blurred background, cybernetic jellyfish female face skull phoenix bird, translucent, nautilus, energy flows of water and fire. a highly detailed epic cinematic concept art CG render. made in Maya, Blender and Photoshop, octane render, excellent composition, cinematic dystopian brutalist atmosphere, dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse. y Greg Rutkowski, Ilya Kuvshinov, WLOP, Stanley Artgerm Lau, Ruan Jia and Fenghua Zhong
+poseidon humanoid god of the sea, trident, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and magali villeneuve
+grainy and distorted xerox of a classified scientific government chart diagram a portal to a higher dimension photorealistic 4k photorealism realistic textures sharpened x-files fringe mystery sci-fi cinematic detailed texture hyperdetailed CIA agency NSA DOD government seal redacted continuous feed paper smooth, sharp focus, illustration, from Metal Gear, Greg Rutkowski and Artgerm artgerm
+martin shkreli in attack on titan, medium shot close up, details, sharp focus, illustration, by jordan grimmer and greg rutkowski, trending artstation, pixiv, digital art
+painting Daft Punk in long coat, elegant, intricate, headshot, highly detailed, digital painting, artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+colleen moore 2 2 years old, bob haircut, portrait painted by stanley artgerm, casting long shadows, resting head on hands, by ross tran
+hyper realistic photography of a stunningly beautiful sphere, self assembly, ribbons, glowing consciences, growing tendrils, hand in the style of beth cavener, jin kagetsu,, and wlop, highly detailed, intricate filigree, symmetry, masterpiece, award winning, sharp focus, concept art, highkey lighting, ambient lighting, octane render, 8 k, artstation
+a photo of larry david playing poker while smoking highly detailed, dim volumetric lighting, 8k, post-processing, soft painting, trending on artstation, concept art, smooth, sharp focus, illustration,by Tom Bagshaw and Daniel Gerhartz and Albert Aublet and Lawrence Alma-Tadema and alphonse mucha
+symmetry!! portrait of space soldier, tech wear, scifi, glowing lights!! intricate elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha
+portrait of female android, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by fra angelico
+jason fused with a poltergeist lovecraft nervous space demon, spiky skin, creepy, melting, big eyes, photo, portrait, 3 d, high details, intricate details, by vincent di fate, artgerm julie bell beeple, 9 0 s, smooth gradients, volumetric lightning, high contrast, duo tone, depth of field, very coherent symmetrical artwork
+portrait of a fat blue alien. big friendly smile. character concept art. science fiction illustration. close up of the face. key panel art graphic novel. detailed face, beautiful colour palette. digital painting.
+people with posters attacking cops in front a huge blue spiral - shaped white luminous attractor that is floating on the horizon near the sun and stores in los angeles with light screens all over the street, concept art, art for the game, professional lighting, dark night lighting from streetlights
+Lofi cyberpunk portrait beautiful woman with short brown curly hair, roman face, Romanesque, unicorn, rainbow, floral, Pixar style, Tristan Eaton, Stanley Artgerm, Tom Bagshaw
+dog eat dog world , made by Stanley Artgerm Lau, WLOP, Rossdraws, ArtStation, CGSociety, concept art, cgsociety, octane render, trending on artstation, artstationHD, artstationHQ, unreal engine, 4k, 8k,
diff --git a/demo/Diffusion/calibration.py b/demo/Diffusion/calibration.py
new file mode 100644
index 00000000..98adb6d3
--- /dev/null
+++ b/demo/Diffusion/calibration.py
@@ -0,0 +1,177 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import types
+from typing import Callable, Optional, Union
+
+import numpy as np
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+from torch.distributed import ReduceOp
+from utilities import PercentileAmaxes
+
+from ammo.torch.quantization.model_calib import (
+    enable_stats_collection,
+    finish_stats_collection,
+    max_calibrate,
+)
+from ammo.torch.quantization.utils import is_quantized_linear
+
+
+def precentile_calib_mode(base_unet, quant_config={}):
+    def compute_amax(self, all_reduce=True):
+        """Return the absolute max of all tensors collected."""
+        if (
+            self._calib_amax is not None
+            and all_reduce
+            and dist.is_available()
+            and dist.is_initialized()
+            and dist.get_world_size() > 1
+        ):
+            tmp_amax = self._calib_amax.clone()
+            dist.all_reduce(tmp_amax, op=ReduceOp.MAX)
+            self._calib_amax.copy_(tmp_amax)
+        if self._track_amax:
+            up_lim = int(self._amaxs.total_step * self._amaxs.percentile)
+            if up_lim <= 0:
+                up_lim = 1
+            amaxs_values = [self._amaxs.data[i] for i in range(0, up_lim)]
+            act_amax = (
+                torch.tensor(np.vstack(amaxs_values).min(axis=0))
+                .float()
+                .squeeze(0)
+                .to(self._calib_amax.device)
+                .to(self._calib_amax.dtype)
+            )
+            return act_amax
+        return self._calib_amax
+
+    for _, module in base_unet.named_modules():
+        if isinstance(module, (nn.Linear, nn.Conv2d)):
+            module.input_quantizer._calibrator._track_amax = True
+            module.input_quantizer._calibrator._amaxs = PercentileAmaxes(
+                total_step=quant_config["base-step"], percentile=quant_config["percentile"]
+            )
+            module.input_quantizer._calibrator.compute_amax = types.MethodType(
+                compute_amax, module.input_quantizer._calibrator
+            )
+
+
+@torch.no_grad()
+def smoothquant(model, forward_loop=None):
+    """
+    Rewrite the original SmoothQuant method
+    """
+    assert forward_loop is not None, "forward_loop must be provided for smoothquant"
+    max_calibrate(model, forward_loop)
+
+    smoothed_modules = 0
+    for name, module in model.named_modules():
+        if is_quantized_linear(module):
+            if not hasattr(module.input_quantizer, "_amax"):
+                print(f"Warning: {name} is not calibrated, skip smoothing")
+                continue
+            if module.input_quantizer.num_bits != 8 or module.weight_quantizer.num_bits != 8:
+                print(f"Warning: only int8 smoothing is supported, skip {name}")
+                continue
+            if module.input_quantizer.axis != -1:
+                print(f"Warning: only per-channel smoothing is supported, skip {name}")
+                continue
+
+            alpha = 1.0
+            if hasattr(module, "alpha"):
+                alpha = module.alpha
+            assert (
+                module.input_quantizer._amax.numel() > 1
+            ), f"Error: {name} has only one channel to smooth"
+
+            # It is important to keep scaling math in fp32 to be numerically safe
+            act_amax = module.input_quantizer.amax.float()
+
+            act_device = act_amax.device
+
+            # If model is split across devices, this tensor may be on wrong one
+            act_amax = act_amax.to(module.weight.device)
+
+            weight_scale = module.weight.abs().max(dim=0, keepdim=True)[0]
+            scale_a = (weight_scale.pow(1 - alpha) / act_amax.pow(alpha)).squeeze()
+
+            # Some channel could have 0 amax which causes scale_a to overflow. Explicitly mask them out here
+            epsilon = 1.0 / (1 << 31)
+            if act_amax.min() <= epsilon:
+                zero_mask = act_amax <= epsilon
+                scale_a[zero_mask] = 1
+            inv_scale_a = 1.0 / scale_a
+            inv_scale_a = inv_scale_a.squeeze()[None, :]
+
+            # Use per-tensor quantization for activation, add a pre-quantization scale vector
+            module.input_quantizer.pre_quant_scale = scale_a.to(module.weight.dtype).to(act_device)
+            module.input_quantizer._axis = None
+            delattr(module.input_quantizer, "_amax")
+            module.input_quantizer.amax = torch.tensor(
+                (act_amax * scale_a).max().item(),
+                dtype=module.weight.dtype,
+                device=module.weight.device,
+            )
+
+            # Multiply weight by inv_scale_a and recalibrate
+            module.weight.detach().copy_(
+                (module.weight.float() * inv_scale_a).to(module.weight.dtype)
+            )
+
+            enable_stats_collection(module.weight_quantizer)
+            module.weight_quantizer(module.weight)
+            finish_stats_collection(module.weight_quantizer)
+
+            smoothed_modules += 1
+    print(f"Smoothed {smoothed_modules} modules")
+
+
+def calibrate(
+    model: nn.Module,
+    algorithm: Union[str, dict, None] = "max",
+    forward_loop: Optional[Callable] = None,
+) -> None:
+    if algorithm is None:
+        return
+
+    if isinstance(algorithm, str):
+        kwargs = {}
+    elif isinstance(algorithm, dict):
+        kwargs = algorithm.copy()
+        algorithm = kwargs.pop("method")
+    else:
+        raise TypeError(f"Unsupported type for algorithm: {type(algorithm)}")
+
+    if algorithm == "smoothquant":
+        smoothquant(model, forward_loop)
+    elif algorithm == "max":
+        max_calibrate(model, forward_loop)
+    else:
+        raise ValueError(f"Unsupported calibration algorithm: {algorithm}")
+
+
+def reg_alpha_qkv(base_unet, alpha):
+    """
+    Only apply alpha to QKV layers
+    """
+    for name, module in base_unet.named_modules():
+        if isinstance(module, torch.nn.Linear):
+            if "to_q" in name or "to_k" in name or "to_v" in name:
+                module.alpha = alpha
+
diff --git a/demo/Diffusion/demo_controlnet.py b/demo/Diffusion/demo_controlnet.py
new file mode 100644
index 00000000..a730935d
--- /dev/null
+++ b/demo/Diffusion/demo_controlnet.py
@@ -0,0 +1,123 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+
+import controlnet_aux
+import torch
+from cuda import cudart
+from PIL import Image
+
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, download_image, process_pipeline_args
+
+def parseArgs():
+    parser = argparse.ArgumentParser(description="Options for Stable Diffusion ControlNet Demo", conflict_handler='resolve')
+    parser = add_arguments(parser)
+    parser.add_argument('--scheduler', type=str, default="UniPC", choices=["DDIM", "DPM", "EulerA", "LMSD", "PNDM", "UniPC"], help="Scheduler for diffusion process")
+    parser.add_argument('--input-image', nargs = '+', type=str, default=[], help="Path to the input image/images already prepared for ControlNet modality. For example: canny edged image for canny ControlNet, not just regular rgb image")
+    parser.add_argument('--controlnet-type', nargs='+', type=str, default=["canny"], help="Controlnet type, can be `None`, `str` or `str` list from ['canny', 'depth', 'hed', 'mlsd', 'normal', 'openpose', 'scribble', 'seg']")
+    parser.add_argument('--controlnet-scale', nargs='+', type=float, default=[1.0], help="The outputs of the controlnet are multiplied by `controlnet_scale` before they are added to the residual in the original unet, can be `None`, `float` or `float` list")
+    return parser.parse_args()
+
+if __name__ == "__main__":
+    print("[I] Initializing StableDiffusion controlnet demo using TensorRT")
+    args = parseArgs()
+
+    # Controlnet configuration
+    if not isinstance(args.controlnet_type, list):
+        raise ValueError(f"`--controlnet-type` must be of type `str` or `str` list, but is {type(args.controlnet_type)}")
+    
+    # Controlnet configuration
+    if not isinstance(args.controlnet_scale, list):
+        raise ValueError(f"`--controlnet-scale`` must be of type `float` or `float` list, but is {type(args.controlnet_scale)}")
+    
+    # Check number of ControlNets to ControlNet scales
+    if len(args.controlnet_type) != len(args.controlnet_scale):
+        raise ValueError(f"Numbers of ControlNets {len(args.controlnet_type)} should be equal to number of ControlNet scales {len(args.controlnet_scale)}.")
+    
+    # Convert controlnet scales to tensor
+    controlnet_scale = torch.FloatTensor(args.controlnet_scale)
+
+    # Check images
+    input_images = []
+    if len(args.input_image) > 0:
+        for image in args.input_image:
+            input_images.append(Image.open(image))
+    else:
+        for controlnet in args.controlnet_type:
+            if controlnet == "canny":
+                canny_image = download_image("https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png")
+                canny_image = controlnet_aux.CannyDetector()(canny_image)
+                input_images.append(canny_image.resize((args.height, args.width)))
+            elif controlnet == "normal":
+                normal_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-normal/resolve/main/images/toy.png")
+                normal_image = controlnet_aux.NormalBaeDetector.from_pretrained("lllyasviel/Annotators")(normal_image)
+                input_images.append(normal_image.resize((args.height, args.width)))
+            elif controlnet == "depth":
+                depth_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-depth/resolve/main/images/stormtrooper.png")
+                depth_image = controlnet_aux.LeresDetector.from_pretrained("lllyasviel/Annotators")(depth_image)
+                input_images.append(depth_image.resize((args.height, args.width)))
+            elif controlnet == "hed":
+                hed_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-hed/resolve/main/images/man.png")
+                hed_image = controlnet_aux.HEDdetector.from_pretrained("lllyasviel/Annotators")(hed_image)
+                input_images.append(hed_image.resize((args.height, args.width)))
+            elif controlnet == "mlsd":
+                mlsd_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-mlsd/resolve/main/images/room.png")
+                mlsd_image = controlnet_aux.MLSDdetector.from_pretrained("lllyasviel/Annotators")(mlsd_image)
+                input_images.append(mlsd_image.resize((args.height, args.width)))
+            elif controlnet == "openpose":
+                openpose_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-openpose/resolve/main/images/pose.png")
+                openpose_image = controlnet_aux.OpenposeDetector.from_pretrained("lllyasviel/Annotators")(openpose_image)
+                input_images.append(openpose_image.resize((args.height, args.width)))
+            elif controlnet == "scribble":
+                scribble_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-scribble/resolve/main/images/bag.png")
+                scribble_image = controlnet_aux.HEDdetector.from_pretrained("lllyasviel/Annotators")(scribble_image, scribble=True)
+                input_images.append(scribble_image.resize((args.height, args.width)))
+            elif controlnet == "seg":
+                seg_image = download_image("https://huggingface.co/lllyasviel/sd-controlnet-seg/resolve/main/images/house.png")
+                seg_image = controlnet_aux.SamDetector.from_pretrained("ybelkada/segment-anything", subfolder="checkpoints")(seg_image)
+                input_images.append(seg_image.resize((args.height, args.width)))
+            else:
+                raise ValueError(f"You should implement the conditonal image of this controlnet: {controlnet}")
+    assert len(input_images) > 0
+
+    kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
+
+    # Initialize demo
+    demo = StableDiffusionPipeline(
+        pipeline_type=PIPELINE_TYPE.CONTROLNET,
+        controlnets=args.controlnet_type,
+        **kwargs_init_pipeline)
+
+    # Load TensorRT engines and pytorch modules
+    demo.loadEngines(
+        args.engine_dir,
+        args.framework_model_dir,
+        args.onnx_dir,
+        **kwargs_load_engine)
+
+    # Load resources
+    _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+    demo.activateEngines(shared_device_memory)
+    demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+    # Run inference
+    demo_kwargs = {'input_image': input_images, 'controlnet_scales': controlnet_scale}
+    demo.run(*args_run_demo, **demo_kwargs)
+
+    demo.teardown()
diff --git a/demo/Diffusion/demo_img2img.py b/demo/Diffusion/demo_img2img.py
index 963babee..bf56f6a9 100755
--- a/demo/Diffusion/demo_img2img.py
+++ b/demo/Diffusion/demo_img2img.py
@@ -16,19 +16,24 @@
 #
 
 import argparse
-from cuda import cudart
-import tensorrt as trt
-
-from img2img_pipeline import Img2ImgPipeline
-from utilities import preprocess_image, TRT_LOGGER, add_arguments, download_image
 
 import PIL
+from cuda import cudart
 from PIL import Image
 
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import (
+    PIPELINE_TYPE,
+    TRT_LOGGER,
+    add_arguments,
+    download_image,
+    preprocess_image,
+    process_pipeline_args
+)
+
 def parseArgs():
     parser = argparse.ArgumentParser(description="Options for Stable Diffusion Img2Img Demo")
     parser = add_arguments(parser)
-    parser.add_argument('--scheduler', type=str, default="DDIM", choices=["DDIM", "EulerA", "LMSD", "DPM", "PNDM"], help="Scheduler for diffusion process")
     parser.add_argument('--input-image', type=str, default="", help="Path to the input image")
     return parser.parse_args()
 
@@ -36,81 +41,42 @@ def parseArgs():
     print("[I] Initializing StableDiffusion img2img demo using TensorRT")
     args = parseArgs()
 
-    # Process prompt
-    if not isinstance(args.prompt, list):
-        raise ValueError(f"`prompt` must be of type `str` or `str` list, but is {type(args.prompt)}")
-    prompt = args.prompt * args.repeat_prompt
-
-    if not isinstance(args.negative_prompt, list):
-        raise ValueError(f"`--negative-prompt` must be of type `str` or `str` list, but is {type(args.negative_prompt)}")
-    if len(args.negative_prompt) == 1:
-        negative_prompt = args.negative_prompt * len(prompt)
-    else:
-        negative_prompt = args.negative_prompt
-
     if args.input_image:
         input_image = Image.open(args.input_image)
     else:
-        url = "https://pajoca.com/wp-content/uploads/2022/09/tekito-yamakawa-1.png"
+        url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
         input_image = download_image(url)
 
     image_width, image_height = input_image.size
-
-    # Validate image dimensions
-    if image_height % 8 != 0 or image_width % 8 != 0:
-        raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {image_height} and {image_width}.")
+    if image_height != args.height or image_width != args.width:
+        print(f"[I] Resizing input_image to {args.height}x{args.width}")
+        input_image = input_image.resize((args.height, args.width))
+        image_height, image_width = args.height, args.width
 
     if isinstance(input_image, PIL.Image.Image):
         input_image = preprocess_image(input_image)
 
-    # Register TensorRT plugins
-    trt.init_libnvinfer_plugins(TRT_LOGGER, '')
-
-    max_batch_size = 16
-    if args.build_dynamic_shape:
-        max_batch_size = 4
-
-    batch_size = len(prompt)
-    if batch_size > max_batch_size:
-        raise ValueError(f"Batch size {len(prompt)} is larger than allowed {max_batch_size}. If dynamic shape is used, then maximum batch size is 4")
-
-    if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
-        raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+    kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
 
     # Initialize demo
-    demo = Img2ImgPipeline(
-        scheduler=args.scheduler,
-        denoising_steps=args.denoising_steps,
-        output_dir=args.output_dir,
-        version=args.version,
-        hf_token=args.hf_token,
-        verbose=args.verbose,
-        nvtx_profile=args.nvtx_profile,
-        max_batch_size=max_batch_size)
+    demo = StableDiffusionPipeline(
+        pipeline_type=PIPELINE_TYPE.IMG2IMG,
+        **kwargs_init_pipeline)
 
     # Load TensorRT engines and pytorch modules
-    demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,
-        opt_batch_size=len(prompt), opt_image_height=image_height, opt_image_width=image_width, \
-        force_export=args.force_onnx_export, force_optimize=args.force_onnx_optimize, \
-        force_build=args.force_engine_build, \
-        static_batch=args.build_static_batch, static_shape=not args.build_dynamic_shape, \
-        enable_refit=args.build_enable_refit, enable_preview=args.build_preview_features, enable_all_tactics=args.build_all_tactics, \
-        timing_cache=args.timing_cache, onnx_refit_dir=args.onnx_refit_dir)
-    demo.loadResources(image_height, image_width, batch_size, args.seed)
-
-    if args.use_cuda_graph:
-        # inference once to get cuda graph
-        images = demo.infer(prompt, negative_prompt, input_image, image_height, image_width, strength=0.75, warmup=True)
-
-    print("[I] Warming up ..")
-    for _ in range(args.num_warmup_runs):
-        images = demo.infer(prompt, negative_prompt, input_image, image_height, image_width, strength=0.75, warmup=True)
-
-    print("[I] Running StableDiffusion pipeline")
-    if args.nvtx_profile:
-        cudart.cudaProfilerStart()
-
-    images = demo.infer(prompt, negative_prompt, input_image, image_height, image_width, seed=args.seed, strength=0.75)
-
-    if args.nvtx_profile:
-        cudart.cudaProfilerStop()
+    demo.loadEngines(
+        args.engine_dir,
+        args.framework_model_dir,
+        args.onnx_dir,
+        **kwargs_load_engine)
+
+    # Load resources
+    _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+    demo.activateEngines(shared_device_memory)
+    demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+    # Run inference
+    demo_kwargs = {'input_image': input_image, 'image_strength': 0.75}
+    demo.run(*args_run_demo, **demo_kwargs)
+
+    demo.teardown()
diff --git a/demo/Diffusion/demo_inpaint.py b/demo/Diffusion/demo_inpaint.py
index 1fa8219a..af635df0 100755
--- a/demo/Diffusion/demo_inpaint.py
+++ b/demo/Diffusion/demo_inpaint.py
@@ -16,15 +16,17 @@
 #
 
 import argparse
+
 from cuda import cudart
-import tensorrt as trt
-from utilities import TRT_LOGGER, add_arguments, download_image
-from inpaint_pipeline import InpaintPipeline
 from PIL import Image
 
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, download_image, process_pipeline_args
+
 def parseArgs():
-    parser = argparse.ArgumentParser(description="Options for Stable Diffusion Inpaint Demo")
+    parser = argparse.ArgumentParser(description="Options for Stable Diffusion Inpaint Demo", conflict_handler='resolve')
     parser = add_arguments(parser)
+    parser.add_argument('--version', type=str, default="1.5", choices=["1.5", "2.0"], help="Stable Diffusion version. Only 1.5 and 2.0 supported for inpainting.")
     parser.add_argument('--scheduler', type=str, default="PNDM", choices=["PNDM"], help="Scheduler for diffusion process")
     parser.add_argument('--input-image', type=str, default="", help="Path to the input image")
     parser.add_argument('--mask-image', type=str, default="", help="Path to the mask image")
@@ -34,22 +36,6 @@ def parseArgs():
     print("[I] Initializing StableDiffusion inpainting demo using TensorRT")
     args = parseArgs()
 
-    # Inpainting is currently only supported for v1.5 and v2.0
-    if args.version not in ("1.5", "2.0"):
-        raise ValueError(f"Inpainting not supported in version {args.version}. Use v2.0, or v1.5")
-
-    # Process prompt
-    if not isinstance(args.prompt, list):
-        raise ValueError(f"`prompt` must be of type `str` or `str` list, but is {type(args.prompt)}")
-    prompt = args.prompt * args.repeat_prompt
-
-    if not isinstance(args.negative_prompt, list):
-        raise ValueError(f"`--negative-prompt` must be of type `str` or `str` list, but is {type(args.negative_prompt)}")
-    if len(args.negative_prompt) == 1:
-        negative_prompt = args.negative_prompt * len(prompt)
-    else:
-        negative_prompt = args.negative_prompt
-
     if args.input_image:
         input_image = Image.open(args.input_image).convert("RGB")
     else:
@@ -63,65 +49,38 @@ def parseArgs():
         mask_image = download_image(mask_url)
 
     image_width, image_height = input_image.size
-    mask_width, mask_height = mask_image.size
-
-    # Validate image dimensions
-    if mask_height != image_height or mask_width != image_width:
-        raise ValueError(f"Input image height and width {image_height} and {image_width} are not equal to "
-                         f"the respective dimensions of the mask image {mask_height} and {mask_width}")
-
-    if image_height % 8 != 0 or image_width % 8 != 0:
-        raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {image_height} and {image_width}.")
+    if image_height != args.height or image_width != args.width:
+        print(f"[I] Resizing input_image to {args.height}x{args.width}")
+        input_image = input_image.resize((args.height, args.width))
+        image_height, image_width = args.height, args.width
 
-    # Register TensorRT plugins
-    trt.init_libnvinfer_plugins(TRT_LOGGER, '')
-
-    max_batch_size = 16
-    if args.build_dynamic_shape:
-        max_batch_size = 4
-
-    batch_size = len(prompt)
-    if batch_size > max_batch_size:
-        raise ValueError(f"Batch size {len(prompt)} is larger than allowed {max_batch_size}. If dynamic shape is used, then maximum batch size is 4")
+    mask_width, mask_height = mask_image.size
+    if mask_height != args.height or mask_width != args.width:
+        print(f"[I] Resizing mask_image to {args.height}x{args.width}")
+        mask_image = mask_image.resize((args.height, args.width))
+        mask_height, mask_width = args.height, args.width
 
-    if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
-        raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+    kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
 
     # Initialize demo
-    demo = InpaintPipeline(
-        scheduler=args.scheduler,
-        denoising_steps=args.denoising_steps,
-        output_dir=args.output_dir,
-        version=args.version,
-        hf_token=args.hf_token,
-        verbose=args.verbose,
-        nvtx_profile=args.nvtx_profile,
-        max_batch_size=max_batch_size)
+    demo = StableDiffusionPipeline(
+        pipeline_type=PIPELINE_TYPE.INPAINT,
+        **kwargs_init_pipeline)
 
     # Load TensorRT engines and pytorch modules
-    demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,
-        opt_batch_size=len(prompt), opt_image_height=image_height, opt_image_width=image_width, \
-        force_export=args.force_onnx_export, force_optimize=args.force_onnx_optimize, \
-        force_build=args.force_engine_build, \
-        static_batch=args.build_static_batch, static_shape=not args.build_dynamic_shape, \
-        enable_preview=args.build_preview_features, enable_all_tactics=args.build_all_tactics, \
-        timing_cache=args.timing_cache)
-    demo.loadResources(image_height, image_width, batch_size, args.seed)
-
-
-    if args.use_cuda_graph:
-        # inference once to get cuda graph
-        images = demo.infer(prompt, negative_prompt, input_image, mask_image, image_height, image_width, strength=0.75, warmup=True)
-
-    print("[I] Warming up ..")
-    for _ in range(args.num_warmup_runs):
-        images = demo.infer(prompt, negative_prompt, input_image, mask_image, image_height, image_width, strength=0.75, warmup=True)
-
-    print("[I] Running StableDiffusion pipeline")
-    if args.nvtx_profile:
-        cudart.cudaProfilerStart()
-
-    images = demo.infer(prompt, negative_prompt, input_image, mask_image, image_height, image_width, seed=args.seed, strength=0.75)
-
-    if args.nvtx_profile:
-        cudart.cudaProfilerStop()
+    demo.loadEngines(
+        args.engine_dir,
+        args.framework_model_dir,
+        args.onnx_dir,
+        **kwargs_load_engine)
+
+    # Load resources
+    _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+    demo.activateEngines(shared_device_memory)
+    demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+    # Run inference
+    demo_kwargs = {'input_image': input_image, 'image_strength': 0.75, 'mask_image': mask_image}
+    demo.run(*args_run_demo, **demo_kwargs)
+
+    demo.teardown()
diff --git a/demo/Diffusion/demo_txt2img.py b/demo/Diffusion/demo_txt2img.py
index 4491c45e..3e33838f 100644
--- a/demo/Diffusion/demo_txt2img.py
+++ b/demo/Diffusion/demo_txt2img.py
@@ -16,89 +16,41 @@
 #
 
 import argparse
+
 from cuda import cudart
-import tensorrt as trt
-from utilities import TRT_LOGGER, add_arguments
-from txt2img_pipeline import Txt2ImgPipeline
+
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, process_pipeline_args
 
 def parseArgs():
     parser = argparse.ArgumentParser(description="Options for Stable Diffusion Txt2Img Demo")
     parser = add_arguments(parser)
-    parser.add_argument('--scheduler', type=str, default="DDIM", choices=["PNDM", "LMSD", "DPM", "DDIM", "EulerA"], help="Scheduler for diffusion process")
     return parser.parse_args()
 
 if __name__ == "__main__":
     print("[I] Initializing StableDiffusion txt2img demo using TensorRT")
     args = parseArgs()
 
-    # Process prompt
-    if not isinstance(args.prompt, list):
-        raise ValueError(f"`prompt` must be of type `str` or `str` list, but is {type(args.prompt)}")
-    prompt = args.prompt * args.repeat_prompt
-
-    if not isinstance(args.negative_prompt, list):
-        raise ValueError(f"`--negative-prompt` must be of type `str` or `str` list, but is {type(args.negative_prompt)}")
-    if len(args.negative_prompt) == 1:
-        negative_prompt = args.negative_prompt * len(prompt)
-    else:
-        negative_prompt = args.negative_prompt
-
-    # Validate image dimensions
-    image_height = args.height
-    image_width = args.width
-    if image_height % 8 != 0 or image_width % 8 != 0:
-        raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {image_height} and {image_width}.")
-
-    # Register TensorRT plugins
-    trt.init_libnvinfer_plugins(TRT_LOGGER, '')
-
-    max_batch_size = 16
-    # FIXME VAE build fails due to element limit. Limitting batch size is WAR
-    if args.build_dynamic_shape or image_height > 512 or image_width > 512:
-        max_batch_size = 4
-
-    batch_size = len(prompt)
-    if batch_size > max_batch_size:
-        raise ValueError(f"Batch size {len(prompt)} is larger than allowed {max_batch_size}. If dynamic shape is used, then maximum batch size is 4")
-
-    if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
-        raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+    kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
 
     # Initialize demo
-    demo = Txt2ImgPipeline(
-        scheduler=args.scheduler,
-        denoising_steps=args.denoising_steps,
-        output_dir=args.output_dir,
-        version=args.version,
-        hf_token=args.hf_token,
-        verbose=args.verbose,
-        nvtx_profile=args.nvtx_profile,
-        max_batch_size=max_batch_size,
-        use_cuda_graph=args.use_cuda_graph)
+    demo = StableDiffusionPipeline(
+        pipeline_type=PIPELINE_TYPE.TXT2IMG,
+        **kwargs_init_pipeline)
 
     # Load TensorRT engines and pytorch modules
-    demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,
-        opt_batch_size=len(prompt), opt_image_height=image_height, opt_image_width=image_width, \
-        force_export=args.force_onnx_export, force_optimize=args.force_onnx_optimize, \
-        force_build=args.force_engine_build, \
-        static_batch=args.build_static_batch, static_shape=not args.build_dynamic_shape, \
-        enable_refit=args.build_enable_refit, enable_preview=args.build_preview_features, enable_all_tactics=args.build_all_tactics, \
-        timing_cache=args.timing_cache, onnx_refit_dir=args.onnx_refit_dir)
-    demo.loadResources(image_height, image_width, batch_size, args.seed)
-
-    if args.use_cuda_graph:
-        # inference once to get cuda graph
-        images = demo.infer(prompt, negative_prompt, image_height, image_width, warmup=True, verbose=False)
-
-    print("[I] Warming up ..")
-    for _ in range(args.num_warmup_runs):
-        images = demo.infer(prompt, negative_prompt, image_height, image_width, warmup=True, verbose=False)
-
-    print("[I] Running StableDiffusion pipeline")
-    if args.nvtx_profile:
-        cudart.cudaProfilerStart()
-    images = demo.infer(prompt, negative_prompt, image_height, image_width, seed=args.seed, verbose=args.verbose)
-    if args.nvtx_profile:
-        cudart.cudaProfilerStop()
+    demo.loadEngines(
+        args.engine_dir,
+        args.framework_model_dir,
+        args.onnx_dir,
+        **kwargs_load_engine)
+
+    # Load resources
+    _, shared_device_memory = cudart.cudaMalloc(demo.calculateMaxDeviceMemory())
+    demo.activateEngines(shared_device_memory)
+    demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+    # Run inference
+    demo.run(*args_run_demo)
 
     demo.teardown()
diff --git a/demo/Diffusion/demo_txt2img_xl.py b/demo/Diffusion/demo_txt2img_xl.py
new file mode 100644
index 00000000..ea579279
--- /dev/null
+++ b/demo/Diffusion/demo_txt2img_xl.py
@@ -0,0 +1,151 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+
+from cuda import cudart
+
+from stable_diffusion_pipeline import StableDiffusionPipeline
+from utilities import PIPELINE_TYPE, TRT_LOGGER, add_arguments, process_pipeline_args
+
+def parseArgs():
+    parser = argparse.ArgumentParser(description="Options for Stable Diffusion XL Txt2Img Demo", conflict_handler='resolve')
+    parser = add_arguments(parser)
+    parser.add_argument('--version', type=str, default="xl-1.0", choices=["xl-1.0", "xl-turbo"], help="Version of Stable Diffusion XL")
+    parser.add_argument('--height', type=int, default=1024, help="Height of image to generate (must be multiple of 8)")
+    parser.add_argument('--width', type=int, default=1024, help="Height of image to generate (must be multiple of 8)")
+    parser.add_argument('--num-warmup-runs', type=int, default=1, help="Number of warmup runs before benchmarking performance")
+
+    parser.add_argument('--guidance-scale', type=float, default=5.0, help="Value of classifier-free guidance scale (must be greater than 1)")
+
+    parser.add_argument('--enable-refiner', action='store_true', help="Enable SDXL-Refiner model")
+    parser.add_argument('--image-strength', type=float, default=0.3, help="Strength of transformation applied to input_image (must be between 0 and 1)")
+    parser.add_argument('--onnx-refiner-dir', default='onnx_xl_refiner', help="Directory for SDXL-Refiner ONNX models")
+    parser.add_argument('--engine-refiner-dir', default='engine_xl_refiner', help="Directory for SDXL-Refiner TensorRT engines")
+
+    return parser.parse_args()
+
+class StableDiffusionXLPipeline(StableDiffusionPipeline):
+    def __init__(self, vae_scaling_factor=0.13025, enable_refiner=False, **kwargs):
+        self.enable_refiner = enable_refiner
+        self.nvtx_profile = kwargs['nvtx_profile']
+        self.base = StableDiffusionPipeline(
+            pipeline_type=PIPELINE_TYPE.XL_BASE,
+            vae_scaling_factor=vae_scaling_factor,
+            return_latents=self.enable_refiner,
+            **kwargs)
+        if self.enable_refiner:
+            self.refiner = StableDiffusionPipeline(
+                pipeline_type=PIPELINE_TYPE.XL_REFINER,
+                vae_scaling_factor=vae_scaling_factor,
+                return_latents=False,
+                **kwargs)
+
+    def loadEngines(self, framework_model_dir, onnx_dir, engine_dir, onnx_refiner_dir='onnx_xl_refiner', engine_refiner_dir='engine_xl_refiner', **kwargs):
+        self.base.loadEngines(engine_dir, framework_model_dir, onnx_dir, **kwargs)
+        if self.enable_refiner:
+            self.refiner.loadEngines(engine_refiner_dir, framework_model_dir, onnx_refiner_dir, **kwargs)
+
+    def activateEngines(self, shared_device_memory=None):
+        self.base.activateEngines(shared_device_memory)
+        if self.enable_refiner:
+            self.refiner.activateEngines(shared_device_memory)
+
+    def loadResources(self, image_height, image_width, batch_size, seed):
+        self.base.loadResources(image_height, image_width, batch_size, seed)
+        if self.enable_refiner:
+            # Use a different seed for refiner - we arbitrarily use base seed+1, if specified.
+            self.refiner.loadResources(image_height, image_width, batch_size, ((seed+1) if seed is not None else None))
+
+    def get_max_device_memory(self):
+        max_device_memory = self.base.calculateMaxDeviceMemory()
+        if self.enable_refiner:
+            max_device_memory = max(max_device_memory, self.refiner.calculateMaxDeviceMemory())
+        return max_device_memory
+
+    def run(self, prompt, negative_prompt, height, width, batch_size, batch_count, num_warmup_runs, use_cuda_graph, **kwargs_infer_refiner):
+        # Process prompt
+        if not isinstance(prompt, list):
+            raise ValueError(f"`prompt` must be of type `str` list, but is {type(prompt)}")
+        prompt = prompt * batch_size
+
+        if not isinstance(negative_prompt, list):
+            raise ValueError(f"`--negative-prompt` must be of type `str` list, but is {type(negative_prompt)}")
+        if len(negative_prompt) == 1:
+            negative_prompt = negative_prompt * batch_size
+
+        num_warmup_runs = max(1, num_warmup_runs) if use_cuda_graph else num_warmup_runs
+        if num_warmup_runs > 0:
+            print("[I] Warming up ..")
+            for _ in range(num_warmup_runs):
+                images, _ = self.base.infer(prompt, negative_prompt, height, width, warmup=True)
+                if args.enable_refiner:
+                    images, _ = self.refiner.infer(prompt, negative_prompt, height, width, input_image=images, warmup=True, **kwargs_infer_refiner)
+
+        ret = []
+        for _ in range(batch_count):
+            print("[I] Running StableDiffusionXL pipeline")
+            if self.nvtx_profile:
+                cudart.cudaProfilerStart()
+            latents, time_base = self.base.infer(prompt, negative_prompt, height, width, warmup=False)
+            if self.enable_refiner:
+                images, time_refiner = self.refiner.infer(prompt, negative_prompt, height, width, input_image=latents, warmup=False, **kwargs_infer_refiner)
+                ret.append(images)
+            else:
+                ret.append(latents)
+
+            if self.nvtx_profile:
+                cudart.cudaProfilerStop()
+            if self.enable_refiner:
+                print('|-----------------|--------------|')
+                print('| {:^15} | {:>9.2f} ms |'.format('e2e', time_base + time_refiner))
+                print('|-----------------|--------------|')
+        return ret
+
+    def teardown(self):
+        self.base.teardown()
+        if self.enable_refiner:
+            self.refiner.teardown()
+
+if __name__ == "__main__":
+    print("[I] Initializing TensorRT accelerated StableDiffusionXL txt2img pipeline")
+    args = parseArgs()
+
+    kwargs_init_pipeline, kwargs_load_engine, args_run_demo = process_pipeline_args(args)
+
+    # Initialize demo
+    demo = StableDiffusionXLPipeline(vae_scaling_factor=0.13025, enable_refiner=args.enable_refiner, **kwargs_init_pipeline)
+
+    # Load TensorRT engines and pytorch modules
+    kwargs_load_refiner = {'onnx_refiner_dir': args.onnx_refiner_dir, 'engine_refiner_dir': args.engine_refiner_dir} if args.enable_refiner else {}
+    demo.loadEngines(
+        args.framework_model_dir,
+        args.onnx_dir,
+        args.engine_dir,
+        **kwargs_load_refiner,
+        **kwargs_load_engine)
+
+    # Load resources
+    _, shared_device_memory = cudart.cudaMalloc(demo.get_max_device_memory())
+    demo.activateEngines(shared_device_memory)
+    demo.loadResources(args.height, args.width, args.batch_size, args.seed)
+
+    # Run inference
+    kwargs_infer_refiner = {'image_strength': args.image_strength} if args.enable_refiner else {}
+    demo.run(*args_run_demo, **kwargs_infer_refiner)
+
+    demo.teardown()
diff --git a/demo/Diffusion/img2img_pipeline.py b/demo/Diffusion/img2img_pipeline.py
deleted file mode 100755
index 2a0b05d1..00000000
--- a/demo/Diffusion/img2img_pipeline.py
+++ /dev/null
@@ -1,115 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import numpy as np
-import nvtx
-import time
-import torch
-import tensorrt as trt
-from utilities import TRT_LOGGER
-from stable_diffusion_pipeline import StableDiffusionPipeline
-
-class Img2ImgPipeline(StableDiffusionPipeline):
-    """
-    Application showcasing the acceleration of Stable Diffusion Img2Img v1.4, v1.5, v2.0-base, v2.0, v2.1-base, v2.1 pipeline using NVidia TensorRT w/ Plugins.
-    """
-    def __init__(
-        self,
-        scheduler="DDIM",
-        *args, **kwargs
-    ):
-        """
-        Initializes the Img2Img Diffusion pipeline.
-
-        Args:
-            scheduler (str):
-                The scheduler to guide the denoising process. Must be one of the [EulerA, DDIM, DPM, LMSD, PNDM].
-        """
-        super(Img2ImgPipeline, self).__init__(*args, **kwargs, \
-            scheduler=scheduler, stages=['vae_encoder', 'clip', 'unet', 'vae'])
-
-    def infer(
-        self,
-        prompt,
-        negative_prompt,
-        init_image,
-        image_height,
-        image_width,
-        seed=None,
-        strength=0.75,
-        warmup=False,
-        verbose=False
-    ):
-        """
-        Run the diffusion pipeline.
-
-        Args:
-            prompt (str):
-                The text prompt to guide image generation.
-            negative_prompt (str):
-                The prompt not to guide the image generation.
-            init_image (image):
-                Input image to be used as input.
-            image_height (int):
-                Height (in pixels) of the image to be generated. Must be a multiple of 8.
-            image_width (int):
-                Width (in pixels) of the image to be generated. Must be a multiple of 8.
-            seed (int):
-                Seed for the random generator
-            strength (float):
-                How much to transform the input image. Must be between 0 and 1
-            warmup (bool):
-                Indicate if this is a warmup run.
-            verbose (bool):
-                Verbose in logging
-        """
-        batch_size = len(prompt)
-        assert len(prompt) == len(negative_prompt)
-
-        with torch.inference_mode(), torch.autocast("cuda"), trt.Runtime(TRT_LOGGER):
-            torch.cuda.synchronize()
-            e2e_tic = time.perf_counter()
-
-            # Initialize timesteps
-            timesteps, t_start = self.initialize_timesteps(self.denoising_steps, strength)
-            latent_timestep = timesteps[:1].repeat(batch_size)
-
-            # Pre-process input image
-            init_image = self.preprocess_images(batch_size, (init_image,))[0]
-
-            # VAE encode init image
-            init_latents = self.encode_image(init_image)
-
-            # CLIP text encoder
-            text_embeddings = self.encode_prompt(prompt, negative_prompt)
-
-            # Add noise to latents using timesteps
-            noise = torch.randn(init_latents.shape, generator=self.generator, device=self.device, dtype=torch.float32)
-            latents = self.scheduler.add_noise(init_latents, noise, t_start, latent_timestep)
-
-            # UNet denoiser
-            latents = self.denoise_latent(latents, text_embeddings, timesteps=timesteps, step_offset=t_start)
-
-            # VAE decode latent
-            images = self.decode_latent(latents)
-
-            torch.cuda.synchronize()
-            e2e_toc = time.perf_counter()
-
-            if not warmup:
-                self.print_summary(self.denoising_steps, e2e_tic, e2e_toc, vae_enc=True)
-                self.save_image(images, 'img2img', prompt)
diff --git a/demo/Diffusion/inpaint_pipeline.py b/demo/Diffusion/inpaint_pipeline.py
deleted file mode 100755
index 3a1ade5a..00000000
--- a/demo/Diffusion/inpaint_pipeline.py
+++ /dev/null
@@ -1,135 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import numpy as np
-import nvtx
-import time
-import torch
-import tensorrt as trt
-from utilities import prepare_mask_and_masked_image, TRT_LOGGER
-from stable_diffusion_pipeline import StableDiffusionPipeline
-
-class InpaintPipeline(StableDiffusionPipeline):
-    """
-    Application showcasing the acceleration of Stable Diffusion Inpainting v1.5, v2.0 pipeline using NVidia TensorRT w/ Plugins.
-    """
-    def __init__(
-        self,
-        scheduler="PNDM",
-        *args, **kwargs
-    ):
-        """
-        Initializes the Inpainting Diffusion pipeline.
-
-        Args:
-            scheduler (str):
-                The scheduler to guide the denoising process. Must be one of the [PNDM].
-        """
-
-        if scheduler != "PNDM":
-            raise ValueError(f"Inpainting only supports PNDM scheduler")
-        
-        super(InpaintPipeline, self).__init__(*args, **kwargs, \
-            inpaint=True, scheduler=scheduler, stages=[ 'vae_encoder', 'clip', 'unet', 'vae'])
-
-    def infer(
-        self,
-        prompt,
-        negative_prompt,
-        input_image,
-        mask_image,
-        image_height,
-        image_width,
-        seed=None,
-        strength=0.75,
-        warmup = False,
-        verbose = False,
-    ):
-        """
-        Run the diffusion pipeline.
-
-        Args:
-            prompt (str):
-                The text prompt to guide image generation.
-            negative_prompt (str):
-                The prompt not to guide the image generation.
-            input_image (image):
-                Input image to be inpainted.
-            mask_image (image):
-                Mask image containg the region to be inpainted.
-            image_height (int):
-                Height (in pixels) of the image to be generated. Must be a multiple of 8.
-            image_width (int):
-                Width (in pixels) of the image to be generated. Must be a multiple of 8.
-            seed (int):
-                Seed for the random generator
-            strength (float):
-                How much to transform the input image. Must be between 0 and 1
-            warmup (bool):
-                Indicate if this is a warmup run.
-            verbose (bool):
-                Enable verbose logging.
-        """
-        batch_size = len(prompt)
-        assert len(prompt) == len(negative_prompt)
-
-        # Spatial dimensions of latent tensor
-        latent_height = image_height // 8
-        latent_width = image_width // 8
-
-        with torch.inference_mode(), torch.autocast("cuda"), trt.Runtime(TRT_LOGGER):
-            # Pre-initialize latents
-            # TODO: unet_channels = 9?
-            latents = self.initialize_latents( \
-                batch_size=batch_size, \
-                unet_channels=4, \
-                latent_height=latent_height, \
-                latent_width=latent_width
-            )
-
-            torch.cuda.synchronize()
-            e2e_tic = time.perf_counter()
-
-            # Pre-process input images
-            mask, masked_image = self.preprocess_images(batch_size, prepare_mask_and_masked_image(input_image, mask_image))
-            mask = torch.nn.functional.interpolate(mask, size=(latent_height, latent_width))
-            mask = torch.cat([mask] * 2)
-
-            # Initialize timesteps
-            timesteps, t_start = self.initialize_timesteps(self.denoising_steps, strength)
-
-            # VAE encode masked image
-            masked_latents = self.encode_image(masked_image)
-            masked_latents = torch.cat([masked_latents] * 2)
-
-            # CLIP text encoder
-            text_embeddings = self.encode_prompt(prompt, negative_prompt)
-
-            # UNet denoiser
-            latents = self.denoise_latent(latents, text_embeddings, timesteps=timesteps, \
-                step_offset=t_start, mask=mask, masked_image_latents=masked_latents)
-
-            # VAE decode latent
-            images = self.decode_latent(latents)
-
-            torch.cuda.synchronize()
-            e2e_toc = time.perf_counter()
-
-            if not warmup:
-                self.print_summary(self.denoising_steps, e2e_tic, e2e_toc, vae_enc=True)
-                self.save_image(images, 'inpaint', prompt)
-
diff --git a/demo/Diffusion/models.py b/demo/Diffusion/models.py
index bcf69b32..b1a196aa 100644
--- a/demo/Diffusion/models.py
+++ b/demo/Diffusion/models.py
@@ -15,17 +15,31 @@
 # limitations under the License.
 #
 
-from collections import OrderedDict
-from copy import deepcopy
-from diffusers.models import AutoencoderKL, UNet2DConditionModel
+from diffusers import DiffusionPipeline
+from diffusers.loaders import LoraLoaderMixin
+from diffusers.models import (
+    AutoencoderKL,
+    ControlNetModel,
+    UNet2DConditionModel
+)
+from diffusers.utils import convert_state_dict_to_diffusers
+import json
 import numpy as np
-from onnx import shape_inference
+import onnx
+from onnx import numpy_helper, shape_inference
 import onnx_graphsurgeon as gs
+import os
 from polygraphy.backend.onnx.loader import fold_constants
+import re
+import tempfile
 import torch
-from transformers import CLIPTextModel, CLIPTokenizer
-from cuda import cudart
-import onnx
+import torch.nn.functional as F
+from transformers import (
+    CLIPTextModel,
+    CLIPTextModelWithProjection,
+    CLIPTokenizer
+)
+from utilities import merge_loras
 
 class Optimizer():
     def __init__(
@@ -42,8 +56,7 @@ def info(self, prefix):
 
     def cleanup(self, return_onnx=False):
         self.graph.cleanup().toposort()
-        if return_onnx:
-            return gs.export_onnx(self.graph)
+        return gs.export_onnx(self.graph) if return_onnx else self.graph
 
     def select_outputs(self, keep, names=None):
         self.graph.outputs = [self.graph.outputs[o] for o in keep]
@@ -60,7 +73,17 @@ def fold_constants(self, return_onnx=False):
     def infer_shapes(self, return_onnx=False):
         onnx_graph = gs.export_onnx(self.graph)
         if onnx_graph.ByteSize() > 2147483648:
-            raise TypeError("ERROR: model size exceeds supported 2GB limit")
+            temp_dir = tempfile.TemporaryDirectory().name
+            os.makedirs(temp_dir, exist_ok=True)
+            onnx_orig_path = os.path.join(temp_dir, 'model.onnx')
+            onnx_inferred_path = os.path.join(temp_dir, 'inferred.onnx')
+            onnx.save_model(onnx_graph,
+                onnx_orig_path,
+                save_as_external_data=True,
+                all_tensors_to_one_file=True,
+                convert_attribute=False)
+            onnx.shape_inference.infer_shapes_path(onnx_orig_path, onnx_inferred_path)
+            onnx_graph = onnx.load(onnx_inferred_path)
         else:
             onnx_graph = shape_inference.infer_shapes(onnx_graph)
 
@@ -68,24 +91,100 @@ def infer_shapes(self, return_onnx=False):
         if return_onnx:
             return onnx_graph
 
-def get_path(version, inpaint=False):
+    def clip_add_hidden_states(self, return_onnx=False):
+        hidden_layers = -1
+        onnx_graph = gs.export_onnx(self.graph)
+        for i in range(len(onnx_graph.graph.node)):
+            for j in range(len(onnx_graph.graph.node[i].output)):
+                name = onnx_graph.graph.node[i].output[j]
+                if "layers" in name:
+                    hidden_layers = max(int(name.split(".")[1].split("/")[0]), hidden_layers)
+        for i in range(len(onnx_graph.graph.node)):
+            for j in range(len(onnx_graph.graph.node[i].output)):
+                if onnx_graph.graph.node[i].output[j] == "/text_model/encoder/layers.{}/Add_1_output_0".format(hidden_layers-1):
+                    onnx_graph.graph.node[i].output[j] = "hidden_states"
+            for j in range(len(onnx_graph.graph.node[i].input)):
+                if onnx_graph.graph.node[i].input[j] == "/text_model/encoder/layers.{}/Add_1_output_0".format(hidden_layers-1):
+                    onnx_graph.graph.node[i].input[j] = "hidden_states"
+        if return_onnx:
+            return onnx_graph
+
+    def fuse_mha_qkv_int8_sq(self):
+        tensors = self.graph.tensors()
+        keys = tensors.keys()
+
+        # mha  : fuse QKV QDQ nodes
+        # mhca : fuse KV QDQ nodes
+        q_pat = (
+            "/down_blocks.\\d+/attentions.\\d+/transformer_blocks"
+            ".\\d+/attn\\d+/to_q/input_quantizer/DequantizeLinear_output_0"
+        )
+        k_pat = (
+            "/down_blocks.\\d+/attentions.\\d+/transformer_blocks"
+            ".\\d+/attn\\d+/to_k/input_quantizer/DequantizeLinear_output_0"
+        )
+        v_pat = (
+            "/down_blocks.\\d+/attentions.\\d+/transformer_blocks"
+            ".\\d+/attn\\d+/to_v/input_quantizer/DequantizeLinear_output_0"
+        )
+
+        qs = list(sorted(map(
+            lambda x: x.group(0),  # type: ignore
+            filter(lambda x: x is not None, [re.match(q_pat, key) for key in keys]),
+            )))
+        ks = list(sorted(map(
+            lambda x: x.group(0),  # type: ignore
+            filter(lambda x: x is not None, [re.match(k_pat, key) for key in keys]),
+            )))
+        vs = list(sorted(map(
+            lambda x: x.group(0),  # type: ignore
+            filter(lambda x: x is not None, [re.match(v_pat, key) for key in keys]),
+            )))
+
+        removed = 0
+        assert len(qs) == len(ks) == len(vs), "Failed to collect tensors"
+        for q, k, v in zip(qs, ks, vs):
+            is_mha = all(["attn1" in tensor for tensor in [q, k, v]])
+            is_mhca = all(["attn2" in tensor for tensor in [q, k, v]])
+            assert (is_mha or is_mhca) and (not (is_mha and is_mhca))
+
+            if is_mha:
+                tensors[k].outputs[0].inputs[0] = tensors[q]
+                tensors[v].outputs[0].inputs[0] = tensors[q]
+                del tensors[k]
+                del tensors[v]
+                removed += 2
+            else:  # is_mhca
+                tensors[k].outputs[0].inputs[0] = tensors[v]
+                del tensors[k]
+                removed += 1
+        print(f"Removed {removed} QDQ nodes")
+        return removed
+
+
+def get_path(version, pipeline, controlnets=None):
+    if controlnets is not None:
+        return ["lllyasviel/sd-controlnet-" + modality for modality in controlnets]
+    
     if version == "1.4":
-        if inpaint:
+        if pipeline.is_inpaint():
             return "runwayml/stable-diffusion-inpainting"
         else:
             return "CompVis/stable-diffusion-v1-4"
     elif version == "1.5":
-        if inpaint:
+        if pipeline.is_inpaint():
             return "runwayml/stable-diffusion-inpainting"
         else:
             return "runwayml/stable-diffusion-v1-5"
+    elif version == 'dreamshaper-7':
+        return 'Lykon/dreamshaper-7'
     elif version == "2.0-base":
-        if inpaint:
+        if pipeline.is_inpaint():
             return "stabilityai/stable-diffusion-2-inpainting"
         else:
             return "stabilityai/stable-diffusion-2-base"
     elif version == "2.0":
-        if inpaint:
+        if pipeline.is_inpaint():
             return "stabilityai/stable-diffusion-2-inpainting"
         else:
             return "stabilityai/stable-diffusion-2"
@@ -93,35 +192,135 @@ def get_path(version, inpaint=False):
         return "stabilityai/stable-diffusion-2-1"
     elif version == "2.1-base":
         return "stabilityai/stable-diffusion-2-1-base"
+    elif version == 'xl-1.0':
+        if pipeline.is_sd_xl_base():
+            return "stabilityai/stable-diffusion-xl-base-1.0"
+        elif pipeline.is_sd_xl_refiner():
+            return "stabilityai/stable-diffusion-xl-refiner-1.0"
+        else:
+            raise ValueError(f"Unsupported SDXL 1.0 pipeline {pipeline.name}")
+    elif version == 'xl-turbo':
+        if pipeline.is_sd_xl_base():
+            return "stabilityai/sdxl-turbo"
+        else:
+            raise ValueError(f"Unsupported SDXL Turbo pipeline {pipeline.name}")
     else:
         raise ValueError(f"Incorrect version {version}")
 
-def get_embedding_dim(version):
-    if version in ("1.4", "1.5"):
+def get_clip_embedding_dim(version, pipeline):
+    if version in ("1.4", "1.5", "dreamshaper-7"):
         return 768
     elif version in ("2.0", "2.0-base", "2.1", "2.1-base"):
         return 1024
+    elif version in ("xl-1.0", "xl-turbo") and pipeline.is_sd_xl_base():
+        return 768
     else:
-        raise ValueError(f"Incorrect version {version}")
+        raise ValueError(f"Invalid version {version} + pipeline {pipeline}")
+
+def get_clipwithproj_embedding_dim(version, pipeline):
+    if version in ("xl-1.0", "xl-turbo"):
+        return 1280
+    else:
+        raise ValueError(f"Invalid version {version} + pipeline {pipeline}")
+
+def get_unet_embedding_dim(version, pipeline):
+    if version in ("1.4", "1.5", "dreamshaper-7"):
+        return 768
+    elif version in ("2.0", "2.0-base", "2.1", "2.1-base"):
+        return 1024
+    elif version in ("xl-1.0", "xl-turbo") and pipeline.is_sd_xl_base():
+        return 2048
+    elif version in ("xl-1.0", "xl-turbo") and pipeline.is_sd_xl_refiner():
+        return 1280
+    else:
+        raise ValueError(f"Invalid version {version} + pipeline {pipeline}")
+
+# FIXME after serialization support for torch.compile is added
+def get_checkpoint_dir(framework_model_dir, version, pipeline, subfolder, torch_inference):
+    return os.path.join(framework_model_dir, version, pipeline, subfolder)
+
+torch_inference_modes = ['default', 'reduce-overhead', 'max-autotune']
+# FIXME update callsites after serialization support for torch.compile is added
+def optimize_checkpoint(model, torch_inference):
+    if not torch_inference or torch_inference == 'eager':
+        return model
+    assert torch_inference in torch_inference_modes
+    return torch.compile(model, mode=torch_inference, dynamic=False, fullgraph=False)
+
+class LoraLoader(LoraLoaderMixin):
+    def __init__(self,
+        paths,
+    ):
+        self.paths = paths
+        self.state_dict = dict()
+        self.network_alphas = dict()
+
+        for path in paths:
+            state_dict, network_alphas = self.lora_state_dict(path)
+            is_correct_format = all("lora" in key for key in state_dict.keys())
+            if not is_correct_format:
+                raise ValueError("Invalid LoRA checkpoint.")
+
+            self.state_dict[path] = state_dict
+            self.network_alphas[path] = network_alphas
+
+    def get_dicts(self,
+        prefix='unet',
+        convert_to_diffusers=False,
+    ):
+        state_dict = dict()
+        network_alphas = dict()
+
+        for path in self.paths:
+            keys = list(self.state_dict[path].keys())
+            if all(key.startswith(('unet', 'text_encoder')) for key in keys):
+                keys = [k for k in keys if k.startswith(prefix)]
+                if keys:
+                    print(f"Processing {prefix} LoRA: {path}")
+                state_dict[path] = {k.replace(f"{prefix}.", ""): v for k, v in self.state_dict[path].items() if k in keys}
+
+                network_alphas[path] = None
+                if path in self.network_alphas and self.network_alphas[path] is not None:
+                    alpha_keys = [k for k in self.network_alphas[path].keys() if k.startswith(prefix)]
+                    network_alphas[path] = {
+                        k.replace(f"{prefix}.", ""): v for k, v in self.network_alphas[path].items() if k in alpha_keys
+                    }
+
+            else:
+                # Otherwise, we're dealing with the old format.
+                warn_message = "You have saved the LoRA weights using the old format. To convert LoRA weights to the new format, first load them in a dictionary and then create a new dictionary as follows: `new_state_dict = {f'unet.{module_name}': params for module_name, params in old_state_dict.items()}`."
+                print(warn_message)
+
+        return state_dict, network_alphas
+
 
 class BaseModel():
-    def __init__(
-        self,
-        hf_token,
-        fp16=False,
+    def __init__(self,
+        version='1.5',
+        pipeline=None,
         device='cuda',
+        hf_token='',
         verbose=True,
-        path="",
+        framework_model_dir='pytorch_model',
+        fp16=False,
+        int8=False,
         max_batch_size=16,
-        embedding_dim=768,
         text_maxlen=77,
+        embedding_dim=768,
     ):
-        self.name = "SD Model"
-        self.hf_token = hf_token
-        self.fp16 = fp16
+
+        self.name = self.__class__.__name__
+        self.pipeline = pipeline.name
+        self.version = version
+        self.path = get_path(version, pipeline)
         self.device = device
+        self.hf_token = hf_token
+        self.hf_safetensor = not (pipeline.is_inpaint() and version in ("1.4", "1.5"))
         self.verbose = verbose
-        self.path = path
+        self.framework_model_dir = framework_model_dir
+
+        self.fp16 = fp16
+        self.int8 = int8
 
         self.min_batch = 1
         self.max_batch = max_batch_size
@@ -130,10 +329,22 @@ def __init__(
         self.min_latent_shape = self.min_image_shape // 8
         self.max_latent_shape = self.max_image_shape // 8
 
-        self.embedding_dim = embedding_dim
         self.text_maxlen = text_maxlen
+        self.embedding_dim = embedding_dim
+        self.extra_output_names = []
+
+        self.lora_dict = None
+
+    def get_pipeline(self):
+        model_opts = {'variant': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
+        return DiffusionPipeline.from_pretrained(
+            self.path,
+            use_safetensors=self.hf_safetensor,
+            use_auth_token=self.hf_token,
+            **model_opts,
+        ).to(self.device)
 
-    def get_model(self):
+    def get_model(self, torch_inference=''):
         pass
 
     def get_input_names(self):
@@ -145,7 +356,7 @@ def get_output_names(self):
     def get_dynamic_axes(self):
         return None
 
-    def get_sample_input(self, batch_size, image_height, image_width):
+    def get_sample_input(self, batch_size, image_height, image_width, static_shape):
         pass
 
     def get_input_profile(self, batch_size, image_height, image_width, static_batch, static_shape):
@@ -154,7 +365,108 @@ def get_input_profile(self, batch_size, image_height, image_width, static_batch,
     def get_shape_dict(self, batch_size, image_height, image_width):
         return None
 
-    def optimize(self, onnx_graph):
+    # Helper utility for ONNX export
+    def export_onnx(
+        self,
+        onnx_path,
+        onnx_opt_path,
+        onnx_opset,
+        opt_image_height,
+        opt_image_width,
+        custom_model=None,
+        enable_lora_merge=False,
+        static_shape=False,
+    ):
+        onnx_opt_graph = None
+        # Export optimized ONNX model (if missing)
+        if not os.path.exists(onnx_opt_path):
+            if not os.path.exists(onnx_path):
+                print(f"[I] Exporting ONNX model: {onnx_path}")
+                def export_onnx(model):
+                    if enable_lora_merge:
+                        model = merge_loras(model, self.lora_dict, self.lora_alphas, self.lora_scales)
+                    inputs = self.get_sample_input(1, opt_image_height, opt_image_width, static_shape)
+                    torch.onnx.export(model,
+                            inputs,
+                            onnx_path,
+                            export_params=True,
+                            opset_version=onnx_opset,
+                            do_constant_folding=True,
+                            input_names=self.get_input_names(),
+                            output_names=self.get_output_names(),
+                            dynamic_axes=self.get_dynamic_axes(),
+                    )
+                if custom_model:
+                    with torch.inference_mode():
+                        export_onnx(custom_model)
+                else:
+                    with torch.inference_mode(), torch.autocast("cuda"):
+                        export_onnx(self.get_model())
+            else:
+                print(f"[I] Found cached ONNX model: {onnx_path}")
+
+            print(f"[I] Optimizing ONNX model: {onnx_opt_path}")
+            onnx_opt_graph = self.optimize(onnx.load(onnx_path))
+            if onnx_opt_graph.ByteSize() > 2147483648:
+                onnx.save_model(
+                    onnx_opt_graph,
+                    onnx_opt_path,
+                    save_as_external_data=True,
+                    all_tensors_to_one_file=True,
+                    convert_attribute=False)
+            else:
+                onnx.save(onnx_opt_graph, onnx_opt_path)
+        else:
+            print(f"[I] Found cached optimized ONNX model: {onnx_opt_path} ")
+
+    # Helper utility for weights map
+    def export_weights_map(self, onnx_opt_path, weights_map_path):
+        if not os.path.exists(weights_map_path):
+            onnx_opt_dir = os.path.dirname(onnx_opt_path)
+            onnx_opt_model = onnx.load(onnx_opt_path)
+            state_dict = self.get_model().state_dict()
+            # Create initializer data hashes
+            initializer_hash_mapping = {}
+            for initializer in onnx_opt_model.graph.initializer:
+                initializer_data = numpy_helper.to_array(initializer, base_dir=onnx_opt_dir).astype(np.float16)
+                initializer_hash = hash(initializer_data.data.tobytes())
+                initializer_hash_mapping[initializer.name] = (initializer_hash, initializer_data.shape)
+
+            weights_name_mapping = {}
+            weights_shape_mapping = {}
+            # set to keep track of initializers already added to the name_mapping dict
+            initializers_mapped = set()
+            for wt_name, wt in state_dict.items():
+                # get weight hash
+                wt = wt.cpu().detach().numpy().astype(np.float16)
+                wt_hash = hash(wt.data.tobytes())
+                wt_t_hash = hash(np.transpose(wt).data.tobytes())
+
+                for initializer_name, (initializer_hash, initializer_shape) in initializer_hash_mapping.items():
+                    # Due to constant folding, some weights are transposed during export
+                    # To account for the transpose op, we compare the initializer hash to the
+                    # hash for the weight and its transpose
+                    if wt_hash == initializer_hash or wt_t_hash == initializer_hash:
+                        # The assert below ensures there is a 1:1 mapping between
+                        # PyTorch and ONNX weight names. It can be removed in cases where 1:many
+                        # mapping is found and name_mapping[wt_name] = list()
+                        assert initializer_name not in initializers_mapped
+                        weights_name_mapping[wt_name] = initializer_name
+                        initializers_mapped.add(initializer_name)
+                        is_transpose = False if wt_hash == initializer_hash else True
+                        weights_shape_mapping[wt_name] = (initializer_shape, is_transpose)
+
+                # Sanity check: Were any weights not matched
+                if wt_name not in weights_name_mapping:
+                    print(f'[I] PyTorch weight {wt_name} not matched with any ONNX initializer')
+            print(f'[I] {len(weights_name_mapping.keys())} PyTorch weights were matched with ONNX initializers')
+            assert weights_name_mapping.keys() == weights_shape_mapping.keys()
+            with open(weights_map_path, 'w') as fp:
+                json.dump([weights_name_mapping, weights_shape_mapping], fp)
+        else:
+            print(f"[I] Found cached weights map: {weights_map_path} ")
+
+    def optimize(self, onnx_graph, return_onnx=True, **kwargs):
         opt = Optimizer(onnx_graph, verbose=self.verbose)
         opt.info(self.name + ': original')
         opt.cleanup()
@@ -163,7 +475,10 @@ def optimize(self, onnx_graph):
         opt.info(self.name + ': fold constants')
         opt.infer_shapes()
         opt.info(self.name + ': shape inference')
-        onnx_opt_graph = opt.cleanup(return_onnx=True)
+        if kwargs.get('fuse_mha_qkv_int8', False):
+            opt.fuse_mha_qkv_int8_sq()
+            opt.info(self.name + ': fuse QKV nodes')
+        onnx_opt_graph = opt.cleanup(return_onnx=return_onnx)
         opt.info(self.name + ': finished')
         return onnx_opt_graph
 
@@ -191,28 +506,49 @@ def get_minmax_dims(self, batch_size, image_height, image_width, static_batch, s
         max_latent_width = latent_width if static_shape else self.max_latent_shape
         return (min_batch, max_batch, min_image_height, max_image_height, min_image_width, max_image_width, min_latent_height, max_latent_height, min_latent_width, max_latent_width)
 
-class CLIP(BaseModel):
+
+class CLIPModel(BaseModel):
     def __init__(self,
-        hf_token,
+        version,
+        pipeline,
         device,
+        hf_token,
         verbose,
-        path,
+        framework_model_dir,
         max_batch_size,
-        embedding_dim
+        embedding_dim,
+        fp16=False,
+        output_hidden_states=False,
+        subfolder="text_encoder",
+        lora_dict=None,
+        lora_alphas=None,
     ):
-        super(CLIP, self).__init__(hf_token, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
-        self.name = "CLIP"
+        super(CLIPModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
+        self.subfolder = subfolder
 
-    def get_model(self):
-        return CLIPTextModel.from_pretrained(self.path,
-            subfolder="text_encoder",
-            use_auth_token=self.hf_token).to(self.device)
+        # Output the final hidden state
+        if output_hidden_states:
+            self.extra_output_names = ['hidden_states']
+
+    def get_model(self, torch_inference=''):
+        clip_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+        if not os.path.exists(clip_model_dir):
+            model = CLIPTextModel.from_pretrained(self.path,
+                subfolder=self.subfolder,
+                use_safetensors=self.hf_safetensor,
+                use_auth_token=self.hf_token).to(self.device)
+            model.save_pretrained(clip_model_dir)
+        else:
+            print(f"[I] Load CLIP pytorch model from: {clip_model_dir}")
+            model = CLIPTextModel.from_pretrained(clip_model_dir).to(self.device)
+        model = optimize_checkpoint(model, torch_inference)
+        return model
 
     def get_input_names(self):
         return ['input_ids']
 
     def get_output_names(self):
-       return ['text_embeddings', 'pooler_output']
+       return ['text_embeddings']
 
     def get_dynamic_axes(self):
         return {
@@ -229,12 +565,15 @@ def get_input_profile(self, batch_size, image_height, image_width, static_batch,
 
     def get_shape_dict(self, batch_size, image_height, image_width):
         self.check_dims(batch_size, image_height, image_width)
-        return {
+        output = {
             'input_ids': (batch_size, self.text_maxlen),
             'text_embeddings': (batch_size, self.text_maxlen, self.embedding_dim)
         }
+        if 'hidden_states' in self.extra_output_names:
+            output["hidden_states"] = (batch_size, self.text_maxlen, self.embedding_dim)
+        return output
 
-    def get_sample_input(self, batch_size, image_height, image_width):
+    def get_sample_input(self, batch_size, image_height, image_width, static_shape):
         self.check_dims(batch_size, image_height, image_width)
         return torch.zeros(batch_size, self.text_maxlen, dtype=torch.int32, device=self.device)
 
@@ -251,97 +590,389 @@ def optimize(self, onnx_graph):
         opt.select_outputs([0], names=['text_embeddings']) # rename network output
         opt.info(self.name + ': remove output[0]')
         opt_onnx_graph = opt.cleanup(return_onnx=True)
+        if 'hidden_states' in self.extra_output_names:
+            opt_onnx_graph = opt.clip_add_hidden_states(return_onnx=True)
+            opt.info(self.name + ': added hidden_states')
         opt.info(self.name + ': finished')
         return opt_onnx_graph
 
-def make_CLIP(version, hf_token, device, verbose, max_batch_size, inpaint=False):
-    return CLIP(hf_token=hf_token, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
-                max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version))
 
-class UNet(BaseModel):
+class CLIPWithProjModel(CLIPModel):
     def __init__(self,
+        version,
+        pipeline,
+        device,
         hf_token,
+        verbose,
+        framework_model_dir,
         fp16=False,
-        device='cuda',
-        verbose=True,
-        path="",
         max_batch_size=16,
-        embedding_dim=768,
-        text_maxlen=77,
-        unet_dim=4
+        output_hidden_states=False,
+        subfolder="text_encoder_2",
+        lora_dict=None,
+        lora_alphas=None,
     ):
-        super(UNet, self).__init__(hf_token, fp16=fp16, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim, text_maxlen=text_maxlen)
-        self.unet_dim = unet_dim
-        self.name = "UNet"
-
-    def get_model(self):
-        model_opts = {'revision': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
-        return UNet2DConditionModel.from_pretrained(self.path,
-            subfolder="unet",
-            use_auth_token=self.hf_token,
-            **model_opts).to(self.device)
+
+        super(CLIPWithProjModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, embedding_dim=get_clipwithproj_embedding_dim(version, pipeline), output_hidden_states=output_hidden_states)
+        self.subfolder = subfolder
+
+    def get_model(self, torch_inference=''):
+        clip_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+        if not os.path.exists(clip_model_dir):
+            model = CLIPTextModelWithProjection.from_pretrained(self.path,
+                subfolder=self.subfolder,
+                use_safetensors=self.hf_safetensor,
+                use_auth_token=self.hf_token).to(self.device)
+            model.save_pretrained(clip_model_dir)
+        else:
+            print(f"[I] Load CLIP pytorch model from: {clip_model_dir}")
+            model = CLIPTextModelWithProjection.from_pretrained(clip_model_dir).to(self.device)
+        model = optimize_checkpoint(model, torch_inference)
+        return model
+
+    def get_shape_dict(self, batch_size, image_height, image_width):
+        self.check_dims(batch_size, image_height, image_width)
+        output = {
+            'input_ids': (batch_size, self.text_maxlen),
+            'text_embeddings': (batch_size, self.embedding_dim)
+        }
+        if 'hidden_states' in self.extra_output_names:
+            output["hidden_states"] = (batch_size, self.text_maxlen, self.embedding_dim)
+
+        return output
+
+
+class UNet2DConditionControlNetModel(torch.nn.Module):
+    def __init__(self, unet, controlnets) -> None:
+        super().__init__()
+        self.unet = unet
+        self.controlnets = controlnets
+        
+    def forward(self, sample, timestep, encoder_hidden_states, images, controlnet_scales):
+        for i, (image, conditioning_scale, controlnet) in enumerate(zip(images, controlnet_scales, self.controlnets)):
+            down_samples, mid_sample = controlnet(
+                sample,
+                timestep,
+                encoder_hidden_states=encoder_hidden_states,
+                controlnet_cond=image,
+                return_dict=False,
+            )
+
+            down_samples = [
+                    down_sample * conditioning_scale
+                    for down_sample in down_samples
+                ]
+            mid_sample *= conditioning_scale
+            
+            # merge samples
+            if i == 0:
+                down_block_res_samples, mid_block_res_sample = down_samples, mid_sample
+            else:
+                down_block_res_samples = [
+                    samples_prev + samples_curr
+                    for samples_prev, samples_curr in zip(down_block_res_samples, down_samples)
+                ]
+                mid_block_res_sample += mid_sample
+        
+        noise_pred = self.unet(
+            sample,
+            timestep,
+            encoder_hidden_states=encoder_hidden_states,
+            down_block_additional_residuals=down_block_res_samples,
+            mid_block_additional_residual=mid_block_res_sample
+        )
+        return noise_pred
+
+
+class UNetModel(BaseModel):
+    def __init__(self,
+        version,
+        pipeline,
+        device,
+        hf_token,
+        verbose,
+        framework_model_dir,
+        fp16 = False,
+        int8 = False,
+        max_batch_size = 16,
+        text_maxlen = 77,
+        controlnets = None,
+        lora_scales = None,
+        lora_dict = None,
+        lora_alphas = None,
+        do_classifier_free_guidance = False,
+    ):
+
+        super(UNetModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, text_maxlen=text_maxlen, embedding_dim=get_unet_embedding_dim(version, pipeline))
+        self.subfolder = 'unet'
+        self.controlnets = get_path(version, pipeline, controlnets) if controlnets else None
+        self.unet_dim = (9 if pipeline.is_inpaint() else 4)
+        self.lora_scales = lora_scales
+        self.lora_dict = lora_dict
+        self.lora_alphas = lora_alphas
+        self.xB = 2 if do_classifier_free_guidance else 1 # batch multiplier
+
+    def get_model(self, torch_inference=''):
+        model_opts = {'variant': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
+        if self.controlnets:
+            unet_model = UNet2DConditionModel.from_pretrained(self.path,
+                subfolder=self.subfolder,
+                use_safetensors=self.hf_safetensor,
+                use_auth_token=self.hf_token,
+                **model_opts).to(self.device)
+            cnet_model_opts = {'torch_dtype': torch.float16} if self.fp16 else {}
+            controlnets = torch.nn.ModuleList([ControlNetModel.from_pretrained(path, **cnet_model_opts).to(self.device) for path in self.controlnets])
+            # FIXME - cache UNet2DConditionControlNetModel
+            model = UNet2DConditionControlNetModel(unet_model, controlnets)
+        else:
+            unet_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+            if not os.path.exists(unet_model_dir):
+                model = UNet2DConditionModel.from_pretrained(self.path,
+                    subfolder=self.subfolder,
+                    use_safetensors=self.hf_safetensor,
+                    use_auth_token=self.hf_token,
+                    **model_opts).to(self.device)
+                model.save_pretrained(unet_model_dir)
+            else:
+                print(f"[I] Load UNet pytorch model from: {unet_model_dir}")
+                model = UNet2DConditionModel.from_pretrained(unet_model_dir).to(self.device)
+            if torch_inference:
+                model.to(memory_format=torch.channels_last)
+        model = optimize_checkpoint(model, torch_inference)
+        return model
 
     def get_input_names(self):
-        return ['sample', 'timestep', 'encoder_hidden_states']
+        if self.controlnets is None:
+            return ['sample', 'timestep', 'encoder_hidden_states']
+        else:    
+            return ['sample', 'timestep', 'encoder_hidden_states', 'images', 'controlnet_scales']
 
     def get_output_names(self):
        return ['latent']
 
     def get_dynamic_axes(self):
+        xB = '2B' if self.xB == 2 else 'B'
+        if self.controlnets is None:
+            return {
+                'sample': {0: xB, 2: 'H', 3: 'W'},
+                'encoder_hidden_states': {0: xB},
+                'latent': {0: xB, 2: 'H', 3: 'W'}
+            }
+        else:
+            return {
+                'sample': {0: xB, 2: 'H', 3: 'W'},
+                'encoder_hidden_states': {0: xB},
+                'images': {1: xB, 3: '8H', 4: '8W'},
+                'latent': {0: xB, 2: 'H', 3: 'W'}
+            }
+
+    def get_input_profile(self, batch_size, image_height, image_width, static_batch, static_shape):
+        # WAR to enable inference for H/W that are not multiples of 16
+        # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+        if not static_shape:
+            image_height = image_height - 8 if image_height % 16 == 0 else image_height
+            image_width = image_width - 8 if image_width % 16 == 0 else image_width
+        latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
+        min_batch, max_batch, min_image_height, max_image_height, min_image_width, max_image_width, min_latent_height, max_latent_height, min_latent_width, max_latent_width = \
+            self.get_minmax_dims(batch_size, image_height, image_width, static_batch, static_shape)
+        if self.controlnets is None:
+            return {
+                'sample': [(self.xB*min_batch, self.unet_dim, min_latent_height, min_latent_width), (self.xB*batch_size, self.unet_dim, latent_height, latent_width), (self.xB*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
+                'encoder_hidden_states': [(self.xB*min_batch, self.text_maxlen, self.embedding_dim), (self.xB*batch_size, self.text_maxlen, self.embedding_dim), (self.xB*max_batch, self.text_maxlen, self.embedding_dim)]
+            }
+        else:
+            return {
+                'sample': [(self.xB*min_batch, self.unet_dim, min_latent_height, min_latent_width),
+                           (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+                           (self.xB*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
+                'encoder_hidden_states': [(self.xB*min_batch, self.text_maxlen, self.embedding_dim),
+                                          (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+                                          (self.xB*max_batch, self.text_maxlen, self.embedding_dim)],
+                'images': [(len(self.controlnets), self.xB*min_batch, 3, min_image_height, min_image_width),
+                          (len(self.controlnets), self.xB*batch_size, 3, image_height, image_width),
+                          (len(self.controlnets), self.xB*max_batch, 3, max_image_height, max_image_width)]
+            }
+
+
+    def get_shape_dict(self, batch_size, image_height, image_width):
+        latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
+        if self.controlnets is None:
+            return {
+                'sample': (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+                'encoder_hidden_states': (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+                'latent': (self.xB*batch_size, 4, latent_height, latent_width)
+            }
+        else:
+            return {
+                'sample': (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+                'encoder_hidden_states': (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+                'images': (len(self.controlnets), self.xB*batch_size, 3, image_height, image_width),
+                'latent': (self.xB*batch_size, 4, latent_height, latent_width)
+                }
+
+    def get_sample_input(self, batch_size, image_height, image_width, static_shape):
+        # WAR to enable inference for H/W that are not multiples of 16
+        # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+        if not static_shape:
+            image_height = image_height - 8 if image_height % 16 == 0 else image_height
+            image_width = image_width - 8 if image_width % 16 == 0 else image_width
+        latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
+        dtype = torch.float16 if self.fp16 else torch.float32
+        if self.controlnets is None:
+            return (
+                torch.randn(batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
+                torch.tensor([1.], dtype=torch.float32, device=self.device),
+                torch.randn(batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device)
+            )
+        else:
+            return (
+                torch.randn(batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
+                torch.tensor(999, dtype=torch.float32, device=self.device),
+                torch.randn(batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device),
+                torch.randn(len(self.controlnets), batch_size, 3, image_height, image_width, dtype=dtype, device=self.device),
+                torch.randn(len(self.controlnets), dtype=dtype, device=self.device)
+            )
+
+
+class UNetXLModel(BaseModel):
+    def __init__(self,
+        version,
+        pipeline,
+        device,
+        hf_token,
+        verbose,
+        framework_model_dir,
+        fp16 = False,
+        int8 = False,
+        max_batch_size = 16,
+        text_maxlen = 77,
+        lora_scales = None,
+        lora_dict = None,
+        lora_alphas = None,
+        do_classifier_free_guidance = False,
+    ):
+        super(UNetXLModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size, text_maxlen=text_maxlen, embedding_dim=get_unet_embedding_dim(version, pipeline))
+        self.subfolder = 'unet'
+        self.unet_dim = (9 if pipeline.is_inpaint() else 4)
+        self.time_dim = (5 if pipeline.is_sd_xl_refiner() else 6)
+        self.lora_scales = lora_scales
+        self.lora_dict = lora_dict
+        self.lora_alphas = lora_alphas
+        self.xB = 2 if do_classifier_free_guidance else 1 # batch multiplier
+
+    def get_model(self, torch_inference=''):
+        model_opts = {'variant': 'fp16', 'torch_dtype': torch.float16} if self.fp16 else {}
+        unet_model_dir = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+        if not os.path.exists(unet_model_dir):
+            model = UNet2DConditionModel.from_pretrained(self.path,
+                subfolder=self.subfolder,
+                use_safetensors=self.hf_safetensor,
+                use_auth_token=self.hf_token,
+                **model_opts).to(self.device)
+            # Use default attention processor for ONNX export
+            if not torch_inference:
+                model.set_default_attn_processor()
+            model.save_pretrained(unet_model_dir)
+        else:
+            print(f"[I] Load UNet pytorch model from: {unet_model_dir}")
+            model_load_opts = {'torch_dtype': torch.float16} if self.fp16 else {}
+            model = UNet2DConditionModel.from_pretrained(unet_model_dir, **model_load_opts).to(self.device)
+        model = optimize_checkpoint(model, torch_inference)
+        return model
+
+    def get_input_names(self):
+        return ['sample', 'timestep', 'encoder_hidden_states', 'text_embeds', 'time_ids']
+
+    def get_output_names(self):
+       return ['latent']
+
+    def get_dynamic_axes(self):
+        xB = '2B' if self.xB == 2 else 'B'
         return {
-            'sample': {0: '2B', 2: 'H', 3: 'W'},
-            'encoder_hidden_states': {0: '2B'},
-            'latent': {0: '2B', 2: 'H', 3: 'W'}
+            'sample': {0: xB, 2: 'H', 3: 'W'},
+            'encoder_hidden_states': {0: xB},
+            'latent': {0: xB, 2: 'H', 3: 'W'},
+            'text_embeds': {0: xB},
+            'time_ids': {0: xB}
         }
 
     def get_input_profile(self, batch_size, image_height, image_width, static_batch, static_shape):
+        # WAR to enable inference for H/W that are not multiples of 16
+        # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+        if not static_shape:
+            image_height = image_height - 8 if image_height % 16 == 0 else image_height
+            image_width = image_width - 8 if image_width % 16 == 0 else image_width
         latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
         min_batch, max_batch, _, _, _, _, min_latent_height, max_latent_height, min_latent_width, max_latent_width = \
             self.get_minmax_dims(batch_size, image_height, image_width, static_batch, static_shape)
         return {
-            'sample': [(2*min_batch, self.unet_dim, min_latent_height, min_latent_width), (2*batch_size, self.unet_dim, latent_height, latent_width), (2*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
-            'encoder_hidden_states': [(2*min_batch, self.text_maxlen, self.embedding_dim), (2*batch_size, self.text_maxlen, self.embedding_dim), (2*max_batch, self.text_maxlen, self.embedding_dim)]
+            'sample': [(self.xB*min_batch, self.unet_dim, min_latent_height, min_latent_width), (self.xB*batch_size, self.unet_dim, latent_height, latent_width), (self.xB*max_batch, self.unet_dim, max_latent_height, max_latent_width)],
+            'encoder_hidden_states': [(self.xB*min_batch, self.text_maxlen, self.embedding_dim), (self.xB*batch_size, self.text_maxlen, self.embedding_dim), (self.xB*max_batch, self.text_maxlen, self.embedding_dim)],
+            'text_embeds': [(self.xB*min_batch, 1280), (self.xB*batch_size, 1280), (self.xB*max_batch, 1280)],
+            'time_ids': [(self.xB*min_batch, self.time_dim), (self.xB*batch_size, self.time_dim), (self.xB*max_batch, self.time_dim)]
         }
 
     def get_shape_dict(self, batch_size, image_height, image_width):
         latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
         return {
-            'sample': (2*batch_size, self.unet_dim, latent_height, latent_width),
-            'encoder_hidden_states': (2*batch_size, self.text_maxlen, self.embedding_dim),
-            'latent': (2*batch_size, 4, latent_height, latent_width)
+            'sample': (self.xB*batch_size, self.unet_dim, latent_height, latent_width),
+            'encoder_hidden_states': (self.xB*batch_size, self.text_maxlen, self.embedding_dim),
+            'latent': (self.xB*batch_size, 4, latent_height, latent_width),
+            'text_embeds': (self.xB*batch_size, 1280),
+            'time_ids': (self.xB*batch_size, self.time_dim)
         }
 
-    def get_sample_input(self, batch_size, image_height, image_width):
+    def get_sample_input(self, batch_size, image_height, image_width, static_shape):
+        # WAR to enable inference for H/W that are not multiples of 16
+        # If building with Dynamic Shapes: ensure image height and width are not multiples of 16 for ONNX export and TensorRT engine build
+        if not static_shape:
+            image_height = image_height - 8 if image_height % 16 == 0 else image_height
+            image_width = image_width - 8 if image_width % 16 == 0 else image_width
         latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
         dtype = torch.float16 if self.fp16 else torch.float32
         return (
-            torch.randn(2*batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
+            torch.randn(self.xB*batch_size, self.unet_dim, latent_height, latent_width, dtype=torch.float32, device=self.device),
             torch.tensor([1.], dtype=torch.float32, device=self.device),
-            torch.randn(2*batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device)
+            torch.randn(self.xB*batch_size, self.text_maxlen, self.embedding_dim, dtype=dtype, device=self.device),
+            {
+                'added_cond_kwargs': {
+                    'text_embeds': torch.randn(self.xB*batch_size, 1280, dtype=dtype, device=self.device),
+                    'time_ids' : torch.randn(self.xB*batch_size, self.time_dim, dtype=dtype, device=self.device)
+                }
+            }
         )
 
-def make_UNet(version, hf_token, device, verbose, max_batch_size, inpaint=False):
-    return UNet(hf_token=hf_token, fp16=True, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
-            max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version), unet_dim=(9 if inpaint else 4))
+    def optimize(self, onnx_graph):
+        return super().optimize(onnx_graph, fuse_mha_qkv_int8=True)
 
-class VAE(BaseModel):
+class VAEModel(BaseModel):
     def __init__(self,
-        hf_token,
+        version,
+        pipeline,
         device,
+        hf_token,
         verbose,
-        path,
-        max_batch_size,
-        embedding_dim
+        framework_model_dir,
+        fp16=False,
+        max_batch_size=16,
     ):
-        super(VAE, self).__init__(hf_token, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
-        self.name = "VAE decoder"
+        super(VAEModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size)
+        self.subfolder = 'vae'
 
-    def get_model(self):
-        vae = AutoencoderKL.from_pretrained(self.path,
-            subfolder="vae",
-            use_auth_token=self.hf_token).to(self.device)
-        vae.forward = vae.decode
-        return vae
+    def get_model(self, torch_inference=''):
+        vae_decoder_model_path = get_checkpoint_dir(self.framework_model_dir, self.version, self.pipeline, self.subfolder, torch_inference)
+        if not os.path.exists(vae_decoder_model_path):
+            model = AutoencoderKL.from_pretrained(self.path,
+                subfolder=self.subfolder,
+                use_safetensors=self.hf_safetensor,
+                use_auth_token=self.hf_token).to(self.device)
+            model.save_pretrained(vae_decoder_model_path)
+        else:
+            print(f"[I] Load VAE decoder pytorch model from: {vae_decoder_model_path}")
+            model = AutoencoderKL.from_pretrained(vae_decoder_model_path).to(self.device)
+        model.forward = model.decode
+        model = optimize_checkpoint(model, torch_inference)
+        return model
 
     def get_input_names(self):
         return ['latent']
@@ -370,37 +1001,44 @@ def get_shape_dict(self, batch_size, image_height, image_width):
             'images': (batch_size, 3, image_height, image_width)
         }
 
-    def get_sample_input(self, batch_size, image_height, image_width):
+    def get_sample_input(self, batch_size, image_height, image_width, static_shape):
         latent_height, latent_width = self.check_dims(batch_size, image_height, image_width)
         return torch.randn(batch_size, 4, latent_height, latent_width, dtype=torch.float32, device=self.device)
 
-def make_VAE(version, hf_token, device, verbose, max_batch_size, inpaint=False):
-    return VAE(hf_token=hf_token, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
-            max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version))
 
 class TorchVAEEncoder(torch.nn.Module):
-    def __init__(self, token, device, path):
+    def __init__(self, version, pipeline, hf_token, device, path, framework_model_dir, hf_safetensor=False):
         super().__init__()
-        self.path = path
-        self.vae_encoder = AutoencoderKL.from_pretrained(self.path, subfolder="vae", use_auth_token=token).to(device)
-        
+        vae_encoder_model_dir = get_checkpoint_dir(framework_model_dir, version, pipeline, 'vae_encoder', '')
+        if not os.path.exists(vae_encoder_model_dir):
+            self.vae_encoder = AutoencoderKL.from_pretrained(path,
+                subfolder='vae',
+                use_safetensors=hf_safetensor,
+                use_auth_token=hf_token).to(device)
+            self.vae_encoder.save_pretrained(vae_encoder_model_dir)
+        else:
+            print(f"[I] Load VAE encoder pytorch model from: {vae_encoder_model_dir}")
+            self.vae_encoder = AutoencoderKL.from_pretrained(vae_encoder_model_dir).to(device)
+
     def forward(self, x):
         return self.vae_encoder.encode(x).latent_dist.sample()
 
-class VAEEncoder(BaseModel):
+
+class VAEEncoderModel(BaseModel):
     def __init__(self,
-        hf_token,
+        version,
+        pipeline,
         device,
+        hf_token,
         verbose,
-        path,
-        max_batch_size,
-        embedding_dim
+        framework_model_dir,
+        fp16=False,
+        max_batch_size=16,
     ):
-        super(VAEEncoder, self).__init__(hf_token, device=device, verbose=verbose, path=path, max_batch_size=max_batch_size, embedding_dim=embedding_dim)
-        self.name = "VAE encoder"
+        super(VAEEncoderModel, self).__init__(version, pipeline, device=device, hf_token=hf_token, verbose=verbose, framework_model_dir=framework_model_dir, fp16=fp16, max_batch_size=max_batch_size)
 
-    def get_model(self):
-        vae_encoder = TorchVAEEncoder(self.hf_token, self.device, self.path)
+    def get_model(self, torch_inference=''):
+        vae_encoder = TorchVAEEncoder(self.version, self.pipeline, self.hf_token, self.device, self.path, self.framework_model_dir, hf_safetensor=self.hf_safetensor)
         return vae_encoder
 
     def get_input_names(self):
@@ -434,15 +1072,20 @@ def get_shape_dict(self, batch_size, image_height, image_width):
             'latent': (batch_size, 4, latent_height, latent_width)
         }
 
-    def get_sample_input(self, batch_size, image_height, image_width):
+    def get_sample_input(self, batch_size, image_height, image_width, static_shape):
         self.check_dims(batch_size, image_height, image_width)
         return torch.randn(batch_size, 3, image_height, image_width, dtype=torch.float32, device=self.device)
 
-def make_VAEEncoder(version, hf_token, device, verbose, max_batch_size, inpaint=False):
-    return VAEEncoder(hf_token=hf_token, device=device, verbose=verbose, path=get_path(version, inpaint=inpaint),
-            max_batch_size=max_batch_size, embedding_dim=get_embedding_dim(version))
 
-def make_tokenizer(version, hf_token):
-    return CLIPTokenizer.from_pretrained(get_path(version),
-            subfolder="tokenizer",
-            use_auth_token=hf_token)
+def make_tokenizer(version, pipeline, hf_token, framework_model_dir, subfolder="tokenizer", **kwargs):
+    tokenizer_model_dir = get_checkpoint_dir(framework_model_dir, version, pipeline.name, subfolder, '')
+    if not os.path.exists(tokenizer_model_dir):
+        model = CLIPTokenizer.from_pretrained(get_path(version, pipeline),
+                subfolder=subfolder,
+                use_safetensors=pipeline.is_sd_xl(),
+                use_auth_token=hf_token)
+        model.save_pretrained(tokenizer_model_dir)
+    else:
+        print(f"[I] Load tokenizer pytorch model from: {tokenizer_model_dir}")
+        model = CLIPTokenizer.from_pretrained(tokenizer_model_dir)
+    return model
diff --git a/demo/Diffusion/requirements.txt b/demo/Diffusion/requirements.txt
index df5a12a6..4de26381 100644
--- a/demo/Diffusion/requirements.txt
+++ b/demo/Diffusion/requirements.txt
@@ -1,15 +1,17 @@
 accelerate
 colored
+controlnet_aux==0.0.6
 cuda-python
-diffusers==0.14.0
+diffusers==0.26.3
 ftfy
 matplotlib
 nvtx
-onnx==1.13.1
-onnxruntime==1.14.1
---extra-index-url https://pypi.ngc.nvidia.com
-onnx-graphsurgeon==0.3.26
-polygraphy==0.47.1
+onnx==1.15.0
+onnxruntime==1.17.0
+opencv-python==4.8.0.74
 scipy
-torch<2.0.0
-transformers==4.26.1
+transformers==4.31.0
+--extra-index-url https://pypi.nvidia.com
+nvidia-ammo==0.7.0
+onnx-graphsurgeon
+polygraphy
diff --git a/demo/Diffusion/stable_diffusion_pipeline.py b/demo/Diffusion/stable_diffusion_pipeline.py
index 7632995a..13bd4156 100755
--- a/demo/Diffusion/stable_diffusion_pipeline.py
+++ b/demo/Diffusion/stable_diffusion_pipeline.py
@@ -15,30 +15,69 @@
 # limitations under the License.
 #
 
+import ammo.torch.quantization as atq
+import calibration
 from cuda import cudart
-import gc
-from models import make_CLIP, make_tokenizer, make_UNet, make_VAE, make_VAEEncoder
+from diffusers import (
+    DDIMScheduler,
+    DDPMScheduler,
+    EulerDiscreteScheduler,
+    EulerAncestralDiscreteScheduler,
+    LCMScheduler, LMSDiscreteScheduler,
+    PNDMScheduler,
+    UniPCMultistepScheduler,
+)
+from hashlib import md5
+import inspect
+from models import (
+    get_clip_embedding_dim,
+    get_path,
+    LoraLoader,
+    make_tokenizer,
+    CLIPModel,
+    CLIPWithProjModel,
+    UNetModel,
+    UNetXLModel,
+    VAEModel,
+    VAEEncoderModel,
+)
 import numpy as np
 import nvtx
-import os
+import json
 import onnx
-from polygraphy import cuda
+import os
+import pathlib
+import tensorrt as trt
+import time
 import torch
-from utilities import Engine, save_image
-from utilities import DPMScheduler, DDIMScheduler, EulerAncestralDiscreteScheduler, LMSDiscreteScheduler, PNDMScheduler
+from typing import Optional, List
+from utilities import (
+    PIPELINE_TYPE,
+    TRT_LOGGER,
+    Engine,
+    filter_func,
+    get_smoothquant_config,
+    get_refit_weights,
+    load_calib_prompts,
+    merge_loras,
+    prepare_mask_and_masked_image,
+    quantize_lvl,
+    replace_lora_layers,
+    save_image,
+    unload_model
+)
 
 class StableDiffusionPipeline:
     """
-    Application showcasing the acceleration of Stable Diffusion Txt2Img v1.4, v1.5, v2.0-base, v2.0, v2.1, v2.1-base pipeline using NVidia TensorRT w/ Plugins.
+    Application showcasing the acceleration of Stable Diffusion pipelines using NVidia TensorRT.
     """
     def __init__(
         self,
-        version="2.1",
-        inpaint=False,
-        stages=['clip','unet','vae'],
+        version='1.5',
+        pipeline_type=PIPELINE_TYPE.TXT2IMG,
         max_batch_size=16,
         denoising_steps=50,
-        scheduler="DDIM",
+        scheduler=None,
         guidance_scale=7.5,
         device='cuda',
         output_dir='.',
@@ -46,6 +85,13 @@ def __init__(
         verbose=False,
         nvtx_profile=False,
         use_cuda_graph=False,
+        vae_scaling_factor=0.18215,
+        framework_model_dir='pytorch_model',
+        controlnets=None,
+        lora_scale: Optional[List[int]] = None,
+        lora_path: Optional[List[str]] = None,
+        return_latents=False,
+        torch_inference='',
     ):
         """
         Initializes the Diffusion pipeline.
@@ -53,17 +99,15 @@ def __init__(
         Args:
             version (str):
                 The version of the pipeline. Should be one of [1.4, 1.5, 2.0, 2.0-base, 2.1, 2.1-base]
-            inpaint (bool):
-                True if inpainting pipeline.
-            stages (list):
-                Ordered sequence of stages. Options: ['vae_encoder', 'clip','unet','vae']
+            pipeline_type (PIPELINE_TYPE):
+                Type of current pipeline.
             max_batch_size (int):
                 Maximum batch size for dynamic batch engine.
             denoising_steps (int):
                 The number of denoising steps.
                 More denoising steps usually lead to a higher quality image at the expense of slower inference.
             scheduler (str):
-                The scheduler to guide the denoising process. Must be one of [DDIM, DPM, EulerA, LMSD, PNDM].
+                The scheduler to guide the denoising process. Must be one of [DDIM, DPM, EulerA, Euler, LCM, LMSD, PNDM].
             guidance_scale (float):
                 Guidance scale is enabled by setting as > 1.
                 Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
@@ -79,126 +123,202 @@ def __init__(
                 Insert NVTX profiling markers.
             use_cuda_graph (bool):
                 Use CUDA graph to capture engine execution and then launch inference
+            vae_scaling_factor (float):
+                VAE scaling factor
+            framework_model_dir (str):
+                cache directory for framework checkpoints
+            controlnets (str):
+                Which ControlNet/ControlNets to use.
+            return_latents (bool):
+                Skip decoding the image and return latents instead.
+            torch_inference (str):
+                Run inference with PyTorch (using specified compilation mode) instead of TensorRT.
         """
 
         self.denoising_steps = denoising_steps
-        assert guidance_scale > 1.0
         self.guidance_scale = guidance_scale
+        self.do_classifier_free_guidance = (guidance_scale > 1.0)
+        self.vae_scaling_factor = vae_scaling_factor
 
         self.max_batch_size = max_batch_size
 
-        # Limit the workspace size for systems with GPU memory larger
-        # than 6 GiB to silence OOM warnings from TensorRT optimizer.
-        _, free_mem, _ = cudart.cudaMemGetInfo()
-        GiB = 2 ** 30
-        if free_mem > 6*GiB:
-            activation_carveout = 4*GiB
-            self.max_workspace_size = free_mem - activation_carveout
-        else:
-            self.max_workspace_size = 0
-
+        self.framework_model_dir = framework_model_dir
         self.output_dir = output_dir
+        for directory in [self.framework_model_dir, self.output_dir]:
+            if not os.path.exists(directory):
+                print(f"[I] Create directory: {directory}")
+                pathlib.Path(directory).mkdir(parents=True)
+
         self.hf_token = hf_token
         self.device = device
         self.verbose = verbose
         self.nvtx_profile = nvtx_profile
 
         self.version = version
-
-        # Schedule options
-        sched_opts = {'num_train_timesteps': 1000, 'beta_start': 0.00085, 'beta_end': 0.012}
-        if self.version in ("2.0", "2.1"):
-            sched_opts['prediction_type'] = 'v_prediction'
+        self.controlnets = controlnets
+
+        # Pipeline type
+        self.pipeline_type = pipeline_type
+        if self.pipeline_type.is_txt2img() or self.pipeline_type.is_controlnet():
+            self.stages = ['clip','unet','vae']
+        elif self.pipeline_type.is_img2img() or self.pipeline_type.is_inpaint():
+            self.stages = ['vae_encoder', 'clip','unet','vae']
+        elif self.pipeline_type.is_sd_xl_base():
+            self.stages = ['clip', 'clip2', 'unetxl']
+            if not return_latents:
+                self.stages.append('vae')
+        elif self.pipeline_type.is_sd_xl_refiner():
+            self.stages = ['clip2', 'unetxl', 'vae']
         else:
-            sched_opts['prediction_type'] = 'epsilon'
+            raise ValueError(f"Unsupported pipeline {self.pipeline_type.name}.")
+        self.return_latents = return_latents
+
+        # Schedulers
+        map_version_scheduler = {
+            '1.4': 'PNDM',
+            '1.5': 'PNDM',
+            'dreamshaper-7': 'PNDM',
+            '2.0-base': 'DDIM',
+            '2.0': 'DDIM',
+            '2.1-base': 'PNDM',
+            '2.1': 'DDIM',
+            'xl-1.0' : 'Euler',
+            'xl-turbo': 'EulerA'
+        }
+
+        if not scheduler:
+            scheduler = 'UniPC' if self.pipeline_type.is_controlnet() else map_version_scheduler.get(version, 'DDIM')
+            print(f"[I] Autoselected scheduler: {scheduler}")
+
+        def makeScheduler(cls, subfolder="scheduler", **kwargs):
+            return cls.from_pretrained(get_path(self.version, self.pipeline_type), subfolder=subfolder)
 
         if scheduler == "DDIM":
-            self.scheduler = DDIMScheduler(device=self.device, **sched_opts)
-        elif scheduler == "DPM":
-            self.scheduler = DPMScheduler(device=self.device, **sched_opts)
+            self.scheduler = makeScheduler(DDIMScheduler)
+        elif scheduler == "DDPM":
+            self.scheduler = makeScheduler(DDPMScheduler)
         elif scheduler == "EulerA":
-            self.scheduler = EulerAncestralDiscreteScheduler(device=self.device, **sched_opts)
+            self.scheduler = makeScheduler(EulerAncestralDiscreteScheduler)
+        elif scheduler == "Euler":
+            self.scheduler = makeScheduler(EulerDiscreteScheduler)
+        elif scheduler == "LCM":
+            self.scheduler = makeScheduler(LCMScheduler)
         elif scheduler == "LMSD":
-            self.scheduler = LMSDiscreteScheduler(device=self.device, **sched_opts)
+            self.scheduler = makeScheduler(LMSDiscreteScheduler)
         elif scheduler == "PNDM":
-            sched_opts["steps_offset"] = 1
-            self.scheduler = PNDMScheduler(device=self.device, **sched_opts)
+            self.scheduler = makeScheduler(PNDMScheduler)
+        elif scheduler == "UniPC":
+            self.scheduler = makeScheduler(UniPCMultistepScheduler)
         else:
-            raise ValueError(f"Scheduler should be either DDIM, DPM, EulerA, LMSD or PNDM")
+            raise ValueError(f"Unsupported scheduler {scheduler}. Should be either DDIM, DDPM, EulerA, Euler, LCM, LMSD, PNDM, or UniPC.")
 
-        self.stages = stages
-        self.inpaint = inpaint
+        self.config = {}
+        if self.pipeline_type.is_sd_xl():
+            self.config['clip_hidden_states'] = True
+        self.torch_inference = torch_inference
         self.use_cuda_graph = use_cuda_graph
 
-        # initialized in loadResources()
-        self.stream = None
-        self.tokenizer = None
         # initialized in loadEngines()
         self.models = {}
+        self.torch_models = {}
         self.engine = {}
         self.shared_device_memory = None
 
+        # initialize lora loader and scales
+        self.lora_loader = None
+        self.lora_scales = dict()
+        if lora_path:
+            self.lora_loader = LoraLoader(lora_path)
+            assert len(lora_path) == len(lora_scale)
+            for i, path in enumerate(lora_path):
+                self.lora_scales[path] = lora_scale[i]
+
+        # initialized in loadResources()
+        self.events = {}
+        self.generator = None
+        self.markers = {}
+        self.seed = None
+        self.stream = None
+        self.tokenizer = None
+
     def loadResources(self, image_height, image_width, batch_size, seed):
         # Initialize noise generator
-        self.generator = torch.Generator(device="cuda").manual_seed(seed) if seed else None
-
-        # Pre-compute latent input scales and linear multistep coefficients
-        self.scheduler.set_timesteps(self.denoising_steps)
-        self.scheduler.configure()
+        if seed:
+            self.seed = seed
+            self.generator = torch.Generator(device="cuda").manual_seed(seed)
 
         # Create CUDA events and stream
-        self.events = {}
         for stage in ['clip', 'denoise', 'vae', 'vae_encoder']:
-            for marker in ['start', 'stop']:
-                self.events[stage+'-'+marker] = cudart.cudaEventCreate()[1]
-        self.stream = cuda.Stream()
+            self.events[stage] = [cudart.cudaEventCreate()[1], cudart.cudaEventCreate()[1]]
+        self.stream = cudart.cudaStreamCreate()[1]
 
-        # Allocate buffers for TensorRT engine bindings
-        for model_name, obj in self.models.items():
-            self.engine[model_name].allocate_buffers(shape_dict=obj.get_shape_dict(batch_size, image_height, image_width), device=self.device)
+        # Allocate TensorRT I/O buffers
+        if not self.torch_inference:
+            for model_name, obj in self.models.items():
+                self.engine[model_name].allocate_buffers(shape_dict=obj.get_shape_dict(batch_size, image_height, image_width), device=self.device)
 
     def teardown(self):
         for e in self.events.values():
-            cudart.cudaEventDestroy(e)
+            cudart.cudaEventDestroy(e[0])
+            cudart.cudaEventDestroy(e[1])
 
         for engine in self.engine.values():
             del engine
 
         if self.shared_device_memory:
-            self.shared_device_memory.free()
+            cudart.cudaFree(self.shared_device_memory)
 
-        self.stream.free()
+        cudart.cudaStreamDestroy(self.stream)
         del self.stream
 
     def cachedModelName(self, model_name):
-        if self.inpaint:
+        if self.pipeline_type.is_inpaint():
             model_name += '_inpaint'
         return model_name
 
-    def getOnnxPath(self, model_name, onnx_dir, opt=True):
-        return os.path.join(onnx_dir, self.cachedModelName(model_name)+('.opt' if opt else '')+'.onnx')
+    def getOnnxPath(self, model_name, onnx_dir, opt=True, suffix=''):
+        onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+suffix+('.opt' if opt else ''))
+        os.makedirs(onnx_model_dir, exist_ok=True)
+        return os.path.join(onnx_model_dir, 'model.onnx')
+
+    def getEnginePath(self, model_name, engine_dir, enable_refit=False, suffix=''):
+        return os.path.join(engine_dir, self.cachedModelName(model_name)+suffix+('.refit' if enable_refit else '')+'.trt'+trt.__version__+'.plan')
+
+    def getWeightsMapPath(self, model_name, onnx_dir):
+        onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+'.opt')
+        os.makedirs(onnx_model_dir, exist_ok=True)
+        return os.path.join(onnx_model_dir, 'weights_map.json')
 
-    def getEnginePath(self, model_name, engine_dir):
-        return os.path.join(engine_dir, self.cachedModelName(model_name)+'.plan')
+    def getRefitNodesPath(self, model_name, onnx_dir, suffix=''):
+        onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+'.opt')
+        os.makedirs(onnx_model_dir, exist_ok=True)
+        return os.path.join(onnx_model_dir, 'refit'+suffix+'.json')
+
+    def getStateDictPath(self, model_name, onnx_dir, suffix=''):
+        onnx_model_dir = os.path.join(onnx_dir, self.cachedModelName(model_name)+suffix)
+        os.makedirs(onnx_model_dir, exist_ok=True)
+        return os.path.join(onnx_model_dir, 'state_dict.pt')
 
     def loadEngines(
         self,
         engine_dir,
+        framework_model_dir,
         onnx_dir,
         onnx_opset,
         opt_batch_size,
         opt_image_height,
         opt_image_width,
-        force_export=False,
-        force_optimize=False,
-        force_build=False,
         static_batch=False,
         static_shape=True,
         enable_refit=False,
-        enable_preview=False,
         enable_all_tactics=False,
         timing_cache=None,
-        onnx_refit_dir=None,
+        int8=False,
+        quantization_level=2.5,
+        quantization_percentile=0.4,
+        quantization_alpha=0.6,
+        calibration_steps=384,
+        denoising_steps=50,
     ):
         """
         Build and load engines for TensorRT accelerated inference.
@@ -206,9 +326,11 @@ def loadEngines(
 
         Args:
             engine_dir (str):
-                Directory to write the TensorRT engines.
+                Directory to store the TensorRT engines.
+            framework_model_dir (str):
+                Directory to store the framework model ckpt.
             onnx_dir (str):
-                Directory to write the ONNX models.
+                Directory to store the ONNX models.
             onnx_opset (int):
                 ONNX opset version to export the models.
             opt_batch_size (int):
@@ -217,113 +339,229 @@ def loadEngines(
                 Image height to optimize for during engine building. Must be a multiple of 8.
             opt_image_width (int):
                 Image width to optimize for during engine building. Must be a multiple of 8.
-            force_export (bool):
-                Force re-exporting the ONNX models.
-            force_optimize (bool):
-                Force re-optimizing the ONNX models.
-            force_build (bool):
-                Force re-building the TensorRT engine.
             static_batch (bool):
                 Build engine only for specified opt_batch_size.
             static_shape (bool):
                 Build engine only for specified opt_image_height & opt_image_width. Default = True.
             enable_refit (bool):
                 Build engines with refit option enabled.
-            enable_preview (bool):
-                Enable TensorRT preview features.
             enable_all_tactics (bool):
                 Enable all tactic sources during TensorRT engine builds.
             timing_cache (str):
-                Path to the timing cache to accelerate build or None
-            onnx_refit_dir (str):
-                Directory containing refit ONNX models.
+                Path to the timing cache to speed up TensorRT build.
         """
-        # Load text tokenizer
-        self.tokenizer = make_tokenizer(self.version, self.hf_token)
+        # Create directories if missing
+        for directory in [engine_dir, onnx_dir]:
+            if not os.path.exists(directory):
+                print(f"[I] Create directory: {directory}")
+                pathlib.Path(directory).mkdir(parents=True)
+
+        # Load text tokenizer(s)
+        if not self.pipeline_type.is_sd_xl_refiner():
+            self.tokenizer = make_tokenizer(self.version, self.pipeline_type, self.hf_token, framework_model_dir)
+        if self.pipeline_type.is_sd_xl():
+            self.tokenizer2 = make_tokenizer(self.version, self.pipeline_type, self.hf_token, framework_model_dir, subfolder='tokenizer_2')
 
         # Load pipeline models
-        models_args = {'version': self.version, 'hf_token': self.hf_token, 'device': self.device, \
-            'verbose': self.verbose, 'max_batch_size': self.max_batch_size}
-        if 'vae_encoder' in self.stages:
-            self.models['vae_encoder'] = make_VAEEncoder(inpaint=self.inpaint, **models_args)
+        models_args = {'version': self.version, 'pipeline': self.pipeline_type, 'device': self.device,
+            'hf_token': self.hf_token, 'verbose': self.verbose, 'framework_model_dir': framework_model_dir,
+            'max_batch_size': self.max_batch_size}
+
         if 'clip' in self.stages:
-            self.models['clip'] = make_CLIP(inpaint=self.inpaint, **models_args)
+            subfolder = 'text_encoder'
+            self.models['clip'] = CLIPModel(**models_args, fp16=True, embedding_dim=get_clip_embedding_dim(self.version, self.pipeline_type), output_hidden_states=self.config.get('clip_hidden_states', False), subfolder=subfolder)
+
+        if 'clip2' in self.stages:
+            subfolder = 'text_encoder_2'
+            self.models['clip2'] = CLIPWithProjModel(**models_args, fp16=True, output_hidden_states=self.config.get('clip_hidden_states', False), subfolder=subfolder)
+
+        lora_dict, lora_alphas = (None, None)
         if 'unet' in self.stages:
-            self.models['unet'] = make_UNet(inpaint=self.inpaint, **models_args)
+            if self.lora_loader:
+                lora_dict, lora_alphas = self.lora_loader.get_dicts('unet')
+                assert len(lora_dict) == len(self.lora_scales)
+            self.models['unet'] = UNetModel(**models_args, fp16=True, controlnets=self.controlnets,
+                lora_scales=self.lora_scales, lora_dict=lora_dict, lora_alphas=lora_alphas, do_classifier_free_guidance=self.do_classifier_free_guidance)
+
+        if 'unetxl' in self.stages:
+            if not self.pipeline_type.is_sd_xl_refiner() and self.lora_loader:
+                lora_dict, lora_alphas = self.lora_loader.get_dicts('unet')
+                assert len(lora_dict) == len(self.lora_scales)
+            self.models['unetxl'] = UNetXLModel(**models_args, fp16=True,
+                lora_scales=self.lora_scales, lora_dict=lora_dict, lora_alphas=lora_alphas, do_classifier_free_guidance=self.do_classifier_free_guidance)
+
+        vae_fp16 = not self.pipeline_type.is_sd_xl()
+
         if 'vae' in self.stages:
-            self.models['vae'] = make_VAE(inpaint=self.inpaint, **models_args)
+            self.models['vae'] = VAEModel(**models_args, fp16=vae_fp16)
+
+        if 'vae_encoder' in self.stages:
+            self.models['vae_encoder'] = VAEEncoderModel(**models_args, fp16=vae_fp16)
+
+        # Configure pipeline models to load
+        model_names = self.models.keys()
+        lora_suffix = '-'+'-'.join([str(md5(path.encode('utf-8')).hexdigest())+'-'+('%.2f' % self.lora_scales[path]) for path in sorted(self.lora_loader.paths)]) if self.lora_loader else ''
+        # Enable refit and LoRA merging only for UNet & UNetXL for now
+        do_engine_refit = dict(zip(model_names, [not self.pipeline_type.is_sd_xl_refiner() and enable_refit and model_name.startswith('unet') for model_name in model_names]))
+        do_lora_merge = dict(zip(model_names, [not enable_refit and self.lora_loader and model_name.startswith('unet') for model_name in model_names]))
+        # Torch fallback for VAE if specified
+        torch_fallback = dict(zip(model_names, [self.torch_inference for model_name in model_names]))
+        model_suffix = dict(zip(model_names, [lora_suffix if do_lora_merge[model_name] else '' for model_name in model_names]))
+        use_int8 = dict.fromkeys(model_names, False)
+        if int8:
+            assert self.pipeline_type.is_sd_xl(), "int8 quantization only supported for SDXL pipeline"
+            use_int8['unetxl'] = True
+            model_suffix['unetxl'] += f"-int8.l{quantization_level}.bs2.s{denoising_steps}.c{calibration_steps}.p{quantization_percentile}.a{quantization_alpha}"
+        onnx_path = dict(zip(model_names, [self.getOnnxPath(model_name, onnx_dir, opt=False, suffix=model_suffix[model_name]) for model_name in model_names]))
+        onnx_opt_path = dict(zip(model_names, [self.getOnnxPath(model_name, onnx_dir, suffix=model_suffix[model_name]) for model_name in model_names]))
+        engine_path = dict(zip(model_names, [self.getEnginePath(model_name, engine_dir, do_engine_refit[model_name], suffix=model_suffix[model_name]) for model_name in model_names]))
+        weights_map_path = dict(zip(model_names, [(self.getWeightsMapPath(model_name, onnx_dir) if do_engine_refit[model_name] else None) for model_name in model_names]))
 
-        # Export models to ONNX
         for model_name, obj in self.models.items():
-            engine_path = self.getEnginePath(model_name, engine_dir)
-            if force_export or force_build or not os.path.exists(engine_path):
-                onnx_path = self.getOnnxPath(model_name, onnx_dir, opt=False)
-                onnx_opt_path = self.getOnnxPath(model_name, onnx_dir)
-                if force_export or not os.path.exists(onnx_opt_path):
-                    if force_export or not os.path.exists(onnx_path):
-                        print(f"Exporting model: {onnx_path}")
-                        model = obj.get_model()
-                        with torch.inference_mode(), torch.autocast("cuda"):
-                            inputs = obj.get_sample_input(opt_batch_size, opt_image_height, opt_image_width)
-                            torch.onnx.export(model,
-                                    inputs,
-                                    onnx_path,
-                                    export_params=True,
-                                    opset_version=onnx_opset,
-                                    do_constant_folding=True,
-                                    input_names=obj.get_input_names(),
-                                    output_names=obj.get_output_names(),
-                                    dynamic_axes=obj.get_dynamic_axes(),
+            if torch_fallback[model_name]:
+                continue
+            # Export models to ONNX and save weights name mapping
+            do_export_onnx = not os.path.exists(engine_path[model_name]) and not os.path.exists(onnx_opt_path[model_name])
+            do_export_weights_map = weights_map_path[model_name] and not os.path.exists(weights_map_path[model_name])
+            if do_export_onnx or do_export_weights_map:
+                # Non-quantized ONNX export
+                if not use_int8[model_name]:
+                    obj.export_onnx(onnx_path[model_name], onnx_opt_path[model_name], onnx_opset, opt_image_height, opt_image_width, enable_lora_merge=do_lora_merge[model_name], static_shape=static_shape)
+                else:
+                    state_dict_path = self.getStateDictPath(model_name, onnx_dir, suffix=model_suffix[model_name])
+                    if not os.path.exists(state_dict_path):
+                        print(f"[I] Calibrated weights not found, generating {state_dict_path}")
+                        pipeline = obj.get_pipeline()
+                        model = pipeline.unet
+                        replace_lora_layers(model)
+                        calibration_file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'calibration-prompts.txt')
+                        # Use batch_size = 2 for UNet calibration
+                        calibration_prompts = load_calib_prompts(2, calibration_file)
+                        # TODO check size > calibration_steps
+                        quant_config = get_smoothquant_config(model, quantization_level)
+                        if quantization_percentile is not None:
+                            quant_config["percentile"] = quantization_percentile
+                            quant_config["base-step"] = int(denoising_steps)
+
+                        atq.replace_quant_module(model)
+                        atq.set_quantizer_by_cfg(model, quant_config["quant_cfg"])
+                        if quantization_percentile is not None:
+                            calibration.precentile_calib_mode(base_unet=model, quant_config=quant_config)
+                        if quantization_alpha is not None:
+                            calibration.reg_alpha_qkv(base_unet=model, alpha=quantization_alpha)
+
+                        def do_calibrate(base, calibration_prompts, **kwargs):
+                            for i_th, prompts in enumerate(calibration_prompts):
+                                if i_th >= kwargs["calib_size"]:
+                                    return
+                                base(
+                                    prompt=prompts,
+                                    num_inference_steps=kwargs["n_steps"],
+                                    negative_prompt=[
+                                        "normal quality, low quality, worst quality, low res, blurry, nsfw, nude"
+                                    ]
+                                    * len(prompts),
+                                ).images
+
+                        def calibration_loop():
+                            do_calibrate(
+                                base=pipeline,
+                                calibration_prompts=calibration_prompts,
+                                calib_size=calibration_steps,
+                                n_steps=denoising_steps,
                             )
-                        del model
-                        torch.cuda.empty_cache()
-                        gc.collect()
-                    else:
-                        print(f"Found cached model: {onnx_path}")
 
-                    # Optimize onnx
-                    if force_optimize or not os.path.exists(onnx_opt_path):
-                        print(f"Generating optimizing model: {onnx_opt_path}")
-                        onnx_opt_graph = obj.optimize(onnx.load(onnx_path))
-                        onnx.save(onnx_opt_graph, onnx_opt_path)
+                        print(f"[I] Performing int8 calibration for {calibration_steps} steps. This can take a long time.")
+                        calibration.calibrate(model, quant_config["algorithm"], forward_loop=calibration_loop)
+                        torch.save(model.state_dict(), state_dict_path)
+
+                    print(f"[I] Generaing quantized ONNX model: {onnx_opt_path[model_name]}")
+                    if not os.path.exists(onnx_path[model_name]):
+                        model = obj.get_model()
+                        replace_lora_layers(model)
+                        atq.replace_quant_module(model)
+                        quant_config = atq.INT8_DEFAULT_CFG
+                        atq.set_quantizer_by_cfg(model, quant_config["quant_cfg"])
+                        model.load_state_dict(torch.load(state_dict_path), strict=True)
+                        quantize_lvl(model, quantization_level)
+                        atq.disable_quantizer(model, filter_func)
+                        model.to(torch.float32) # QDQ needs to be in FP32
                     else:
-                        print(f"Found cached optimized model: {onnx_opt_path} ")
+                        model = None
+                    obj.export_onnx(onnx_path[model_name], onnx_opt_path[model_name], onnx_opset, opt_image_height, opt_image_width, custom_model=model)
+
+            # FIXME do_export_weights_map needs ONNX graph
+            if do_export_weights_map:
+                print(f"[I] Saving weights map: {weights_map_path[model_name]}")
+                obj.export_weights_map(onnx_opt_path[model_name], weights_map_path[model_name])
 
         # Build TensorRT engines
         for model_name, obj in self.models.items():
-            engine_path = self.getEnginePath(model_name, engine_dir)
-            engine = Engine(engine_path)
-            onnx_path = self.getOnnxPath(model_name, onnx_dir, opt=False)
-            onnx_opt_path = self.getOnnxPath(model_name, onnx_dir)
-
-            if force_build or not os.path.exists(engine.engine_path):
-                engine.build(onnx_opt_path,
-                    fp16=True,
+            if torch_fallback[model_name]:
+                continue
+            engine = Engine(engine_path[model_name])
+            if not os.path.exists(engine_path[model_name]):
+                update_output_names = obj.get_output_names() + obj.extra_output_names if obj.extra_output_names else None
+                extra_build_args = {'verbose': self.verbose}
+                if use_int8[model_name]:
+                    extra_build_args['int8'] = True
+                    extra_build_args['precision_constraints'] = 'prefer'
+                    extra_build_args['builder_optimization_level'] = 4
+                fp16amp = obj.fp16
+                engine.build(onnx_opt_path[model_name],
+                    fp16=fp16amp,
                     input_profile=obj.get_input_profile(
                         opt_batch_size, opt_image_height, opt_image_width,
                         static_batch=static_batch, static_shape=static_shape
                     ),
-                    enable_refit=enable_refit,
-                    enable_preview=enable_preview,
+                    enable_refit=do_engine_refit[model_name],
                     enable_all_tactics=enable_all_tactics,
                     timing_cache=timing_cache,
-                    workspace_size=self.max_workspace_size)
+                    update_output_names=update_output_names,
+                    **extra_build_args)
             self.engine[model_name] = engine
 
-        # Load and activate TensorRT engines
-        max_device_memory = 0
+        # Load TensorRT engines
         for model_name, obj in self.models.items():
-            engine = self.engine[model_name]
-            engine.load()
+            if torch_fallback[model_name]:
+                continue
+            self.engine[model_name].load()
+            if do_engine_refit[model_name] and obj.lora_dict:
+                assert weights_map_path[model_name]
+                with open(weights_map_path[model_name], 'r') as fp_wts:
+                    print(f"[I] Loading weights map: {weights_map_path[model_name]} ")
+                    [weights_name_mapping, weights_shape_mapping] = json.load(fp_wts)
+                    refit_weights_path = self.getRefitNodesPath(model_name, engine_dir, suffix=lora_suffix)
+                    if not os.path.exists(refit_weights_path):
+                            print(f"[I] Saving refit weights: {refit_weights_path}")
+                            model = merge_loras(obj.get_model(), obj.lora_dict, obj.lora_alphas, obj.lora_scales)
+                            refit_weights = get_refit_weights(model.state_dict(), onnx_opt_path[model_name], weights_name_mapping, weights_shape_mapping)
+                            torch.save(refit_weights, refit_weights_path)
+                            unload_model(model)
+                    else:
+                        print(f"[I] Loading refit weights: {refit_weights_path}")
+                        refit_weights = torch.load(refit_weights_path)
+                    self.engine[model_name].refit(refit_weights, obj.fp16)
+
+        # Load torch models
+        for model_name, obj in self.models.items():
+            if torch_fallback[model_name]:
+                self.torch_models[model_name] = obj.get_model(torch_inference=self.torch_inference)
+
+    def calculateMaxDeviceMemory(self):
+        max_device_memory = 0
+        for model_name, engine in self.engine.items():
             max_device_memory = max(max_device_memory, engine.engine.device_memory_size)
-            if onnx_refit_dir:
-                onnx_refit_path = self.getOnnxPath(model_name, onnx_refit_dir)
-                if os.path.exists(onnx_refit_path):
-                    engine.refit(onnx_opt_path, onnx_refit_path)
+        return max_device_memory
 
-        self.shared_device_memory = cuda.DeviceArray.raw((max_device_memory,))
+    def activateEngines(self, shared_device_memory=None):
+        if shared_device_memory is None:
+            max_device_memory = self.calculateMaxDeviceMemory()
+            _, shared_device_memory = cudart.cudaMalloc(max_device_memory)
+        self.shared_device_memory = shared_device_memory
+        # Load and activate TensorRT engines
         for engine in self.engine.values():
-            engine.activate(reuse_device_memory=self.shared_device_memory.ptr)
+            engine.activate(reuse_device_memory=self.shared_device_memory)
 
     def runEngine(self, model_name, feed_dict):
         engine = self.engine[model_name]
@@ -337,147 +575,436 @@ def initialize_latents(self, batch_size, unet_channels, latent_height, latent_wi
         latents = latents * self.scheduler.init_noise_sigma
         return latents
 
-    def initialize_timesteps(self, timesteps, strength):
-        self.scheduler.set_timesteps(timesteps)
-        offset = self.scheduler.steps_offset if hasattr(self.scheduler, "steps_offset") else 0
-        init_timestep = int(timesteps * strength) + offset
-        init_timestep = min(init_timestep, timesteps)
-        t_start = max(timesteps - init_timestep + offset, 0)
-        timesteps = self.scheduler.timesteps[t_start:].to(self.device)
-        return timesteps, t_start
+    def profile_start(self, name, color='blue'):
+        if self.nvtx_profile:
+            self.markers[name] = nvtx.start_range(message=name, color=color)
+        if name in self.events:
+            cudart.cudaEventRecord(self.events[name][0], 0)
 
-    def preprocess_images(self, batch_size, images=()):
+    def profile_stop(self, name):
+        if name in self.events:
+            cudart.cudaEventRecord(self.events[name][1], 0)
         if self.nvtx_profile:
-            nvtx_image_preprocess = nvtx.start_range(message='image_preprocess', color='pink')
-        init_images=[]
+            nvtx.end_range(self.markers[name])
+
+    def preprocess_images(self, batch_size, images=()):
+        if not images:
+            return ()
+        self.profile_start('preprocess', color='pink')
+        input_images=[]
         for image in images:
             image = image.to(self.device).float()
-            image = image.repeat(batch_size, 1, 1, 1)
-            init_images .append(image)
-        if self.nvtx_profile:
-            nvtx.end_range(nvtx_image_preprocess)
-        return tuple(init_images)
+            if image.shape[0] != batch_size:
+                image = image.repeat(batch_size, 1, 1, 1)
+            input_images.append(image)
+        self.profile_stop('preprocess')
+        return tuple(input_images)
+
+    def preprocess_controlnet_images(self, batch_size, images=None):
+        '''
+        images: List of PIL.Image.Image
+        '''
+        if images is None:
+            return None
+        self.profile_start('preprocess', color='pink')
+        images = [(np.array(i.convert("RGB")).astype(np.float32) / 255.0)[..., None].transpose(3, 2, 0, 1).repeat(batch_size, axis=0) for i in images]
+        # do_classifier_free_guidance
+        images = [torch.cat([torch.from_numpy(i).to(self.device).float()] * 2) for i in images]
+        images = torch.cat([image[None, ...] for image in images], dim=0)
+        self.profile_stop('preprocess')
+        return images
 
-    def encode_prompt(self, prompt, negative_prompt):
-        if self.nvtx_profile:
-            nvtx_clip = nvtx.start_range(message='clip', color='green')
-        cudart.cudaEventRecord(self.events['clip-start'], 0)
+    def encode_prompt(self, prompt, negative_prompt, encoder='clip', pooled_outputs=False, output_hidden_states=False):
+        self.profile_start('clip', color='green')
+
+        tokenizer = self.tokenizer2 if encoder == 'clip2' else self.tokenizer
+
+        def tokenize(prompt, output_hidden_states):
+            text_input_ids = tokenizer(
+                prompt,
+                padding="max_length",
+                max_length=tokenizer.model_max_length,
+                truncation=True,
+                return_tensors="pt",
+            ).input_ids.type(torch.int32).to(self.device)
+
+            text_hidden_states = None
+            if self.torch_inference:
+                outputs = self.torch_models[encoder](text_input_ids, output_hidden_states=output_hidden_states)
+                text_embeddings = outputs[0].clone()
+                if output_hidden_states:
+                    text_hidden_states = outputs['hidden_states'][-2].clone()
+            else:
+                # NOTE: output tensor for CLIP must be cloned because it will be overwritten when called again for negative prompt
+                outputs = self.runEngine(encoder, {'input_ids': text_input_ids})
+                text_embeddings = outputs['text_embeddings'].clone()
+                if output_hidden_states:
+                    text_hidden_states = outputs['hidden_states'].clone()
+            return text_embeddings, text_hidden_states
 
         # Tokenize prompt
-        text_input_ids = self.tokenizer(
-            prompt,
-            padding="max_length",
-            max_length=self.tokenizer.model_max_length,
-            truncation=True,
-            return_tensors="pt",
-        ).input_ids.type(torch.int32).to(self.device)
-
-        text_input_ids_inp = text_input_ids
-        # NOTE: output tensor for CLIP must be cloned because it will be overwritten when called again for negative prompt
-        text_embeddings = self.runEngine('clip', {"input_ids": text_input_ids_inp})['text_embeddings'].clone()
-
-        # Tokenize negative prompt
-        uncond_input_ids = self.tokenizer(
-            negative_prompt,
-            padding="max_length",
-            max_length=self.tokenizer.model_max_length,
-            truncation=True,
-            return_tensors="pt",
-        ).input_ids.type(torch.int32).to(self.device)
-        uncond_input_ids_inp = uncond_input_ids
-        uncond_embeddings = self.runEngine('clip', {"input_ids": uncond_input_ids_inp})['text_embeddings']
-
-        # Concatenate the unconditional and text embeddings into a single batch to avoid doing two forward passes for classifier free guidance
-        text_embeddings = torch.cat([uncond_embeddings, text_embeddings]).to(dtype=torch.float16)
-
-        cudart.cudaEventRecord(self.events['clip-stop'], 0)
-        if self.nvtx_profile:
-            nvtx.end_range(nvtx_clip)
+        text_embeddings, text_hidden_states = tokenize(prompt, output_hidden_states)
 
-        return text_embeddings
+        if self.do_classifier_free_guidance:
+            # Tokenize negative prompt
+            uncond_embeddings, uncond_hidden_states = tokenize(negative_prompt, output_hidden_states)
 
-    def denoise_latent(self, latents, text_embeddings, timesteps=None, step_offset=0, mask=None, masked_image_latents=None):
-        cudart.cudaEventRecord(self.events['denoise-start'], 0)
-        if not isinstance(timesteps, torch.Tensor):
-            timesteps = self.scheduler.timesteps
-        for step_index, timestep in enumerate(timesteps):
-            if self.nvtx_profile:
-                nvtx_latent_scale = nvtx.start_range(message='latent_scale', color='pink')
+            # Concatenate the unconditional and text embeddings into a single batch to avoid doing two forward passes for classifier free guidance
+            text_embeddings = torch.cat([uncond_embeddings, text_embeddings]).to(dtype=torch.float16)
 
-            # Expand the latents if we are doing classifier free guidance
-            latent_model_input = torch.cat([latents] * 2)
-            latent_model_input = self.scheduler.scale_model_input(latent_model_input, step_offset + step_index, timestep)
-            if isinstance(mask, torch.Tensor):
-                latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
-            if self.nvtx_profile:
-                nvtx.end_range(nvtx_latent_scale)
-
-            # Predict the noise residual
-            if self.nvtx_profile:
-                nvtx_unet = nvtx.start_range(message='unet', color='blue')
-
-            embeddings_dtype = np.float16
-            timestep_float = timestep.float() if timestep.dtype != torch.float32 else timestep
-
-            sample_inp = latent_model_input
-            timestep_inp = timestep_float
-            embeddings_inp = text_embeddings
-            noise_pred = self.runEngine('unet', {"sample": sample_inp, "timestep": timestep_inp, "encoder_hidden_states": embeddings_inp})['latent']
-            if self.nvtx_profile:
-                nvtx.end_range(nvtx_unet)
-
-            if self.nvtx_profile:
-                nvtx_latent_step = nvtx.start_range(message='latent_step', color='pink')
+        if pooled_outputs:
+            pooled_output = text_embeddings
 
-            # Perform guidance
-            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-            noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
+        if output_hidden_states:
+            text_embeddings = torch.cat([uncond_hidden_states, text_hidden_states]).to(dtype=torch.float16) if self.do_classifier_free_guidance else text_hidden_states
 
-            latents = self.scheduler.step(noise_pred, latents, step_offset + step_index, timestep)
-
-            if self.nvtx_profile:
-                nvtx.end_range(nvtx_latent_step)
+        self.profile_stop('clip')
+        if pooled_outputs:
+            return text_embeddings, pooled_output
+        return text_embeddings
 
-        latents = 1. / 0.18215 * latents
-        cudart.cudaEventRecord(self.events['denoise-stop'], 0)
+    # from diffusers (get_timesteps)
+    def get_timesteps(self, num_inference_steps, strength, denoising_start=None):
+        # get the original timestep using init_timestep
+        if denoising_start is None:
+            init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
+            t_start = max(num_inference_steps - init_timestep, 0)
+        else:
+            t_start = 0
+
+        timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
+
+        # Strength is irrelevant if we directly request a timestep to start at;
+        # that is, strength is determined by the denoising_start instead.
+        if denoising_start is not None:
+            discrete_timestep_cutoff = int(
+                round(
+                    self.scheduler.config.num_train_timesteps
+                    - (denoising_start * self.scheduler.config.num_train_timesteps)
+                )
+            )
+
+            num_inference_steps = (timesteps < discrete_timestep_cutoff).sum().item()
+            if self.scheduler.order == 2 and num_inference_steps % 2 == 0:
+                # if the scheduler is a 2nd order scheduler we might have to do +1
+                # because `num_inference_steps` might be even given that every timestep
+                # (except the highest one) is duplicated. If `num_inference_steps` is even it would
+                # mean that we cut the timesteps in the middle of the denoising step
+                # (between 1st and 2nd devirative) which leads to incorrect results. By adding 1
+                # we ensure that the denoising process always ends after the 2nd derivate step of the scheduler
+                num_inference_steps = num_inference_steps + 1
+
+            # because t_n+1 >= t_n, we slice the timesteps starting from the end
+            timesteps = timesteps[-num_inference_steps:]
+            return timesteps, num_inference_steps
+
+        return timesteps, num_inference_steps - t_start
+
+    def denoise_latent(self,
+        latents,
+        text_embeddings,
+        denoiser='unet',
+        timesteps=None,
+        step_offset=0,
+        mask=None,
+        masked_image_latents=None,
+        image_guidance=1.5,
+        controlnet_imgs=None,
+        controlnet_scales=None,
+        text_embeds=None,
+        time_ids=None):
+
+        assert image_guidance > 1.0, "Image guidance has to be > 1.0"
+
+        controlnet_imgs = self.preprocess_controlnet_images(latents.shape[0], controlnet_imgs)
+
+        do_autocast = self.torch_inference != '' and self.models[denoiser].fp16
+        with torch.autocast('cuda', enabled=do_autocast):
+            self.profile_start('denoise', color='blue')
+            for step_index, timestep in enumerate(timesteps):
+                # Expand the latents if we are doing classifier free guidance
+                latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, timestep)
+                if isinstance(mask, torch.Tensor):
+                    latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
+
+                # Predict the noise residual
+                if self.torch_inference:
+                    params = {"sample": latent_model_input, "timestep": timestep, "encoder_hidden_states": text_embeddings}
+                    if controlnet_imgs is not None:
+                        params.update({"images": controlnet_imgs, "controlnet_scales": controlnet_scales})
+                    added_cond_kwargs = {}
+                    if text_embeds != None:
+                        added_cond_kwargs.update({'text_embeds': text_embeds})
+                    if time_ids != None:
+                        added_cond_kwargs.update({'time_ids': time_ids})
+                    if text_embeds != None or time_ids != None:
+                        params.update({'added_cond_kwargs': added_cond_kwargs})
+                    noise_pred = self.torch_models[denoiser](**params)["sample"]
+                else:
+                    timestep_float = timestep.float() if timestep.dtype != torch.float32 else timestep
+
+                    params = {"sample": latent_model_input, "timestep": timestep_float, "encoder_hidden_states": text_embeddings}
+                    if controlnet_imgs is not None:
+                        params.update({"images": controlnet_imgs, "controlnet_scales": controlnet_scales})
+                    if text_embeds != None:
+                        params.update({'text_embeds': text_embeds})
+                    if time_ids != None:
+                        params.update({'time_ids': time_ids})
+                    noise_pred = self.runEngine(denoiser, params)['latent']
+
+                # Perform guidance
+                if self.do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
+
+                # from diffusers (prepare_extra_step_kwargs)
+                extra_step_kwargs = {}
+                if "eta" in set(inspect.signature(self.scheduler.step).parameters.keys()):
+                    # TODO: configurable eta
+                    eta = 0.0
+                    extra_step_kwargs["eta"] = eta
+                if "generator" in set(inspect.signature(self.scheduler.step).parameters.keys()):
+                    extra_step_kwargs["generator"] = self.generator
+
+                latents = self.scheduler.step(noise_pred, timestep, latents, **extra_step_kwargs, return_dict=False)[0]
+
+            latents = 1. / self.vae_scaling_factor * latents
+            latents = latents.to(dtype=torch.float32)
+
+        self.profile_stop('denoise')
         return latents
 
-    def encode_image(self, init_image):
-        if self.nvtx_profile:
-            nvtx_vae = nvtx.start_range(message='vae_encoder', color='red')
-        cudart.cudaEventRecord(self.events['vae_encoder-start'], 0)
-        init_latents = self.runEngine('vae_encoder', {"images": init_image})['latent']
-        cudart.cudaEventRecord(self.events['vae_encoder-stop'], 0)
-        if self.nvtx_profile:
-            nvtx.end_range(nvtx_vae)
-
-        init_latents = 0.18215 * init_latents
-        return init_latents
+    def encode_image(self, input_image):
+        self.profile_start('vae_encoder', color='red')
+        if self.torch_inference:
+            image_latents = self.torch_models['vae_encoder'](input_image)
+        else:
+            image_latents = self.runEngine('vae_encoder', {'images': input_image})['latent']
+        image_latents = self.vae_scaling_factor * image_latents
+        self.profile_stop('vae_encoder')
+        return image_latents
 
     def decode_latent(self, latents):
-        if self.nvtx_profile:
-            nvtx_vae = nvtx.start_range(message='vae', color='red')
-        cudart.cudaEventRecord(self.events['vae-start'], 0)
-        images = self.runEngine('vae', {"latent": latents})['images']
-        cudart.cudaEventRecord(self.events['vae-stop'], 0)
-        if self.nvtx_profile:
-            nvtx.end_range(nvtx_vae)
+        self.profile_start('vae', color='red')
+        if self.torch_inference:
+            images = self.torch_models['vae'](latents)['sample']
+        else:
+            images = self.runEngine('vae', {'latent': latents})['images']
+        self.profile_stop('vae')
         return images
 
-    def print_summary(self, denoising_steps, tic, toc, vae_enc=False):
-            print('|------------|--------------|')
-            print('| {:^10} | {:^12} |'.format('Module', 'Latency'))
-            print('|------------|--------------|')
-            if vae_enc:
-                print('| {:^10} | {:>9.2f} ms |'.format('VAE-Enc', cudart.cudaEventElapsedTime(self.events['vae_encoder-start'], self.events['vae_encoder-stop'])[1]))
-            print('| {:^10} | {:>9.2f} ms |'.format('CLIP', cudart.cudaEventElapsedTime(self.events['clip-start'], self.events['clip-stop'])[1]))
-            print('| {:^10} | {:>9.2f} ms |'.format('UNet x '+str(denoising_steps), cudart.cudaEventElapsedTime(self.events['denoise-start'], self.events['denoise-stop'])[1]))
-            print('| {:^10} | {:>9.2f} ms |'.format('VAE-Dec', cudart.cudaEventElapsedTime(self.events['vae-start'], self.events['vae-stop'])[1]))
-            print('|------------|--------------|')
-            print('| {:^10} | {:>9.2f} ms |'.format('Pipeline', (toc - tic)*1000.))
-            print('|------------|--------------|')
-
-    def save_image(self, images, pipeline, prompt):
-            # Save image
-            image_name_prefix = pipeline+'-fp16'+''.join(set(['-'+prompt[i].replace(' ','_')[:10] for i in range(len(prompt))]))+'-'
-            save_image(images, self.output_dir, image_name_prefix)
+    def print_summary(self, denoising_steps, walltime_ms, batch_size):
+        print('|-----------------|--------------|')
+        print('| {:^15} | {:^12} |'.format('Module', 'Latency'))
+        print('|-----------------|--------------|')
+        if 'vae_encoder' in self.stages:
+            print('| {:^15} | {:>9.2f} ms |'.format('VAE-Enc', cudart.cudaEventElapsedTime(self.events['vae_encoder'][0], self.events['vae_encoder'][1])[1]))
+        print('| {:^15} | {:>9.2f} ms |'.format('CLIP', cudart.cudaEventElapsedTime(self.events['clip'][0], self.events['clip'][1])[1]))
+        print('| {:^15} | {:>9.2f} ms |'.format('UNet'+('+CNet' if self.pipeline_type.is_controlnet() else '')+' x '+str(denoising_steps), cudart.cudaEventElapsedTime(self.events['denoise'][0], self.events['denoise'][1])[1]))
+        print('| {:^15} | {:>9.2f} ms |'.format('VAE-Dec', cudart.cudaEventElapsedTime(self.events['vae'][0], self.events['vae'][1])[1]))
+        print('|-----------------|--------------|')
+        print('| {:^15} | {:>9.2f} ms |'.format('Pipeline', walltime_ms))
+        print('|-----------------|--------------|')
+        print('Throughput: {:.2f} image/s'.format(batch_size*1000./walltime_ms))
+
+    def save_image(self, images, pipeline, prompt, seed):
+        # Save image
+        image_name_prefix = pipeline+''.join(set(['-'+prompt[i].replace(' ','_')[:10] for i in range(len(prompt))]))+'-'+str(seed)+'-'
+        save_image(images, self.output_dir, image_name_prefix)
+
+    def infer(
+        self,
+        prompt,
+        negative_prompt,
+        image_height,
+        image_width,
+        input_image=None,
+        image_strength=0.75,
+        mask_image=None,
+        controlnet_scales=None,
+        aesthetic_score=6.0,
+        negative_aesthetic_score=2.5,
+        warmup=False,
+        verbose=False,
+        save_image=True,
+    ):
+        """
+        Run the diffusion pipeline.
+
+        Args:
+            prompt (str):
+                The text prompt to guide image generation.
+            negative_prompt (str):
+                The prompt not to guide the image generation.
+            image_height (int):
+                Height (in pixels) of the image to be generated. Must be a multiple of 8.
+            image_width (int):
+                Width (in pixels) of the image to be generated. Must be a multiple of 8.
+            input_image (image):
+                Input image used to initialize the latents or to be inpainted.
+            image_strength (float):
+                Strength of transformation applied to input_image. Must be between 0 and 1.
+            mask_image (image):
+                Mask image containg the region to be inpainted.
+            controlnet_scales (torch.Tensor)
+                A tensor which containes ControlNet scales, essential for multi ControlNet.
+                Must be equal to number of Controlnets.
+            warmup (bool):
+                Indicate if this is a warmup run.
+            verbose (bool):
+                Verbose in logging
+            save_image (bool):
+                Save the generated image (if applicable)
+        """
+        assert len(prompt) == len(negative_prompt)
+        batch_size = len(prompt)
+
+        # Spatial dimensions of latent tensor
+        latent_height = image_height // 8
+        latent_width = image_width // 8
+
+        if self.generator and self.seed:
+            self.generator.manual_seed(self.seed)
+
+        num_inference_steps = self.denoising_steps
+
+        with torch.inference_mode(), trt.Runtime(TRT_LOGGER):
+            torch.cuda.synchronize()
+            e2e_tic = time.perf_counter()
+
+            # TODO: support custom timesteps
+            timesteps = None
+            if timesteps is not None:
+                if not ("timesteps" in set(inspect.signature(self.scheduler.set_timesteps).parameters.keys())):
+                    raise ValueError(
+                        f"The current scheduler class {self.scheduler.__class__}'s `set_timesteps` does not support custom"
+                        f" timestep schedules. Please check whether you are using the correct scheduler."
+                    )
+                self.scheduler.set_timesteps(timesteps=timesteps, device=self.device)
+                assert self.denoising_steps == len(self.scheduler.timesteps)
+            else:
+                self.scheduler.set_timesteps(self.denoising_steps, device=self.device)
+            timesteps = self.scheduler.timesteps.to(self.device)
+
+            denoise_kwargs = {}
+            if not (self.pipeline_type.is_img2img() or self.pipeline_type.is_sd_xl_refiner()):
+                # Initialize latents
+                latents = self.initialize_latents(batch_size=batch_size,
+                    unet_channels=4,
+                    latent_height=latent_height,
+                    latent_width=latent_width)
+            if self.pipeline_type.is_controlnet():
+                denoise_kwargs.update({'controlnet_imgs': input_image, 'controlnet_scales': controlnet_scales})
+
+            # Pre-process and VAE encode input image
+            if self.pipeline_type.is_img2img() or self.pipeline_type.is_inpaint() or self.pipeline_type.is_sd_xl_refiner():
+                assert input_image != None
+                # Initialize timesteps and pre-process input image
+                timesteps, num_inference_steps = self.get_timesteps(self.denoising_steps, image_strength)
+            denoise_kwargs.update({'timesteps': timesteps})
+            if self.pipeline_type.is_img2img() or self.pipeline_type.is_sd_xl_refiner():
+                latent_timestep = timesteps[:1].repeat(batch_size)
+                input_image = self.preprocess_images(batch_size, (input_image,))[0]
+                # Encode if not a latent
+                image_latents = input_image if input_image.shape[1] == 4 else self.encode_image(input_image)
+                # Add noise to latents using timesteps
+                noise = torch.randn(image_latents.shape, generator=self.generator, device=self.device, dtype=torch.float32)
+                latents = self.scheduler.add_noise(image_latents, noise, latent_timestep)
+            elif self.pipeline_type.is_inpaint():
+                mask, mask_image = self.preprocess_images(batch_size, prepare_mask_and_masked_image(input_image, mask_image))
+                mask = torch.nn.functional.interpolate(mask, size=(latent_height, latent_width))
+                mask = torch.cat([mask] * 2)
+                masked_image_latents = self.encode_image(mask_image)
+                masked_image_latents = torch.cat([masked_image_latents] * 2)
+                denoise_kwargs.update({'mask': mask, 'masked_image_latents': masked_image_latents})
+
+            # CLIP text encoder(s)
+            if self.pipeline_type.is_sd_xl():
+                text_embeddings2, pooled_embeddings2 = self.encode_prompt(prompt, negative_prompt,
+                        encoder='clip2', pooled_outputs=True, output_hidden_states=True)
+
+                # Merge text embeddings
+                if self.pipeline_type.is_sd_xl_base():
+                    text_embeddings = self.encode_prompt(prompt, negative_prompt, output_hidden_states=True)
+                    text_embeddings = torch.cat([text_embeddings, text_embeddings2], dim=-1)
+                else:
+                    text_embeddings = text_embeddings2
+
+                # Time embeddings
+                def _get_add_time_ids(original_size, crops_coords_top_left, target_size, dtype, aesthetic_score=None, negative_aesthetic_score=None):
+                    if self.pipeline_type.is_sd_xl_refiner(): #self.requires_aesthetics_score:
+                        add_time_ids = list(original_size + crops_coords_top_left + (aesthetic_score,))
+                        if self.do_classifier_free_guidance:
+                            add_neg_time_ids = list(original_size + crops_coords_top_left + (negative_aesthetic_score,))
+                    else:
+                        add_time_ids = list(original_size + crops_coords_top_left + target_size)
+                        if self.do_classifier_free_guidance:
+                            add_neg_time_ids = list(original_size + crops_coords_top_left + target_size)
+                    add_time_ids = torch.tensor([add_time_ids], dtype=dtype, device=self.device)
+                    if self.do_classifier_free_guidance:
+                        add_neg_time_ids = torch.tensor([add_neg_time_ids], dtype=dtype, device=self.device)
+                        add_time_ids = torch.cat([add_neg_time_ids, add_time_ids], dim=0)
+                    return add_time_ids
+
+                original_size = (image_height, image_width)
+                crops_coords_top_left = (0, 0)
+                target_size = (image_height, image_width)
+                if self.pipeline_type.is_sd_xl_refiner():
+                    add_time_ids = _get_add_time_ids(
+                        original_size, crops_coords_top_left, target_size, dtype=text_embeddings.dtype, aesthetic_score=aesthetic_score, negative_aesthetic_score=negative_aesthetic_score
+                    )
+                else:
+                    add_time_ids = _get_add_time_ids(
+                        original_size, crops_coords_top_left, target_size, dtype=text_embeddings.dtype
+                    )
+                add_time_ids = add_time_ids.repeat(batch_size, 1)
+                denoise_kwargs.update({'text_embeds': pooled_embeddings2, 'time_ids': add_time_ids})
+            else:
+                text_embeddings = self.encode_prompt(prompt, negative_prompt)
+
+            # UNet denoiser + (optional) ControlNet(s)
+            denoiser = 'unetxl' if self.pipeline_type.is_sd_xl() else 'unet'
+            latents = self.denoise_latent(latents, text_embeddings, denoiser=denoiser, **denoise_kwargs)
+
+            # VAE decode latent (if applicable)
+            if self.return_latents:
+                latents = latents * self.vae_scaling_factor
+            else:
+                images = self.decode_latent(latents)
+
+            torch.cuda.synchronize()
+            e2e_toc = time.perf_counter()
+
+        walltime_ms = (e2e_toc - e2e_tic) * 1000.
+        if not warmup:
+            self.print_summary(num_inference_steps, walltime_ms, batch_size)
+            if not self.return_latents and save_image:
+                self.save_image(images, self.pipeline_type.name.lower(), prompt, self.seed)
+
+        return (latents, walltime_ms) if self.return_latents else (images, walltime_ms)
+
+    def run(self, prompt, negative_prompt, height, width, batch_size, batch_count, num_warmup_runs, use_cuda_graph, **kwargs):
+        # Process prompt
+        if not isinstance(prompt, list):
+            raise ValueError(f"`prompt` must be of type `str` list, but is {type(prompt)}")
+        prompt = prompt * batch_size
+
+        if not isinstance(negative_prompt, list):
+            raise ValueError(f"`--negative-prompt` must be of type `str` list, but is {type(negative_prompt)}")
+        if len(negative_prompt) == 1:
+            negative_prompt = negative_prompt * batch_size
+
+        num_warmup_runs = max(1, num_warmup_runs) if use_cuda_graph else num_warmup_runs
+        if num_warmup_runs > 0:
+            print("[I] Warming up ..")
+            for _ in range(num_warmup_runs):
+                self.infer(prompt, negative_prompt, height, width, warmup=True, **kwargs)
+
+        for _ in range(batch_count):
+            print("[I] Running StableDiffusion pipeline")
+            if self.nvtx_profile:
+                cudart.cudaProfilerStart()
+            self.infer(prompt, negative_prompt, height, width, warmup=False, **kwargs)
+            if self.nvtx_profile:
+                cudart.cudaProfilerStop()
diff --git a/demo/Diffusion/txt2img_pipeline.py b/demo/Diffusion/txt2img_pipeline.py
deleted file mode 100755
index 7a87cd1c..00000000
--- a/demo/Diffusion/txt2img_pipeline.py
+++ /dev/null
@@ -1,102 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import numpy as np
-import nvtx
-import time
-import torch
-import tensorrt as trt
-from utilities import TRT_LOGGER
-from stable_diffusion_pipeline import StableDiffusionPipeline
-
-class Txt2ImgPipeline(StableDiffusionPipeline):
-    """
-    Application showcasing the acceleration of Stable Diffusion Txt2Img v1.4, v1.5, v2.0, v2.0-base, v2.1, v2.1-base pipeline using NVidia TensorRT w/ Plugins.
-    """
-    def __init__(
-        self,
-        scheduler="DDIM",
-        *args, **kwargs
-    ):
-        """
-        Initializes the Txt2Img Diffusion pipeline.
-
-        Args:
-            scheduler (str):
-                The scheduler to guide the denoising process. Must be one of the [DPM, LMSD, DDIM, EulerA, PNDM].
-        """
-        super(Txt2ImgPipeline, self).__init__(*args, **kwargs, \
-            scheduler=scheduler, stages=['clip','unet','vae'])
-
-    def infer(
-        self,
-        prompt,
-        negative_prompt,
-        image_height,
-        image_width,
-        seed=None,
-        warmup=False,
-        verbose=False
-    ):
-        """
-        Run the diffusion pipeline.
-
-        Args:
-            prompt (str):
-                The text prompt to guide image generation.
-            negative_prompt (str):
-                The prompt not to guide the image generation.
-            image_height (int):
-                Height (in pixels) of the image to be generated. Must be a multiple of 8.
-            image_width (int):
-                Width (in pixels) of the image to be generated. Must be a multiple of 8.
-            seed (int):
-                Seed for the random generator
-            warmup (bool):
-                Indicate if this is a warmup run.
-            verbose (bool):
-                Verbose in logging
-        """
-        assert len(prompt) == len(negative_prompt)
-
-        with torch.inference_mode(), torch.autocast("cuda"), trt.Runtime(TRT_LOGGER):
-            # Pre-initialize latents
-            latents = self.initialize_latents( \
-                batch_size=len(prompt), \
-                unet_channels=4, \
-                latent_height=(image_height // 8), \
-                latent_width=(image_width // 8)
-            )
-
-            torch.cuda.synchronize()
-            e2e_tic = time.perf_counter()
-
-            # CLIP text encoder
-            text_embeddings = self.encode_prompt(prompt, negative_prompt)
-
-            # UNet denoiser
-            latents = self.denoise_latent(latents, text_embeddings)
-
-            # VAE decode latent
-            images = self.decode_latent(latents)
-
-            torch.cuda.synchronize()
-            e2e_toc = time.perf_counter()
-
-            if not warmup:
-                self.print_summary(self.denoising_steps, e2e_tic, e2e_toc)
-                self.save_image(images, 'txt2img', prompt)
diff --git a/demo/Diffusion/utilities.py b/demo/Diffusion/utilities.py
index fad7c4aa..62d582f5 100644
--- a/demo/Diffusion/utilities.py
+++ b/demo/Diffusion/utilities.py
@@ -1,5 +1,4 @@
 #
-# Copyright 2022 The HuggingFace Inc. team.
 # SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
@@ -17,25 +16,36 @@
 #
 
 from collections import OrderedDict
-from copy import copy
+from cuda import cudart
+from diffusers.models.lora import LoRACompatibleConv, LoRACompatibleLinear
+from diffusers.utils.torch_utils import randn_tensor
+from enum import Enum, auto
+import gc
+from io import BytesIO
 import numpy as np
 import onnx
+from onnx import numpy_helper
 import onnx_graphsurgeon as gs
 import os
-import math
 from PIL import Image
 from polygraphy.backend.common import bytes_from_path
-from polygraphy.backend.trt import CreateConfig, Profile
-from polygraphy.backend.trt import engine_from_bytes, engine_from_network, network_from_onnx_path, save_engine
-from polygraphy.backend.trt import util as trt_util
-from polygraphy import cuda
+from polygraphy.backend.trt import (
+    CreateConfig,
+    ModifyNetworkOutputs,
+    Profile,
+    engine_from_bytes,
+    engine_from_network,
+    network_from_onnx_path,
+    save_engine
+)
+from polygraphy.logger import G_LOGGER
 import random
+import re
+import requests
 from scipy import integrate
 import tensorrt as trt
 import torch
-import requests
-from io import BytesIO
-from cuda import cudart
+import types
 
 TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
 
@@ -60,6 +70,79 @@
 # Map of torch dtype -> numpy dtype
 torch_to_numpy_dtype_dict = {value : key for (key, value) in numpy_to_torch_dtype_dict.items()}
 
+def unload_model(model):
+    if model:
+        del model
+        torch.cuda.empty_cache()
+        gc.collect()
+
+def replace_lora_layers(model):
+    def lora_forward(self, x, scale=None):
+        return self._torch_forward(x)
+
+    for name, module in model.named_modules():
+        if isinstance(module, LoRACompatibleConv):
+            in_channels = module.in_channels
+            out_channels = module.out_channels
+            kernel_size = module.kernel_size
+            stride = module.stride
+            padding = module.padding
+            dilation = module.dilation
+            groups = module.groups
+            bias = module.bias
+
+            new_conv = torch.nn.Conv2d(
+                in_channels,
+                out_channels,
+                kernel_size,
+                stride=stride,
+                padding=padding,
+                dilation=dilation,
+                groups=groups,
+                bias=bias is not None,
+            )
+
+            new_conv.weight.data = module.weight.data.clone().to(module.weight.data.device)
+            if bias is not None:
+                new_conv.bias.data = module.bias.data.clone().to(module.bias.data.device)
+
+            # Replace the LoRACompatibleConv layer with the Conv2d layer
+            path = name.split(".")
+            sub_module = model
+            for p in path[:-1]:
+                sub_module = getattr(sub_module, p)
+            setattr(sub_module, path[-1], new_conv)
+            new_conv._torch_forward = new_conv.forward
+            new_conv.forward = types.MethodType(lora_forward, new_conv)
+
+        elif isinstance(module, LoRACompatibleLinear):
+            in_features = module.in_features
+            out_features = module.out_features
+            bias = module.bias
+
+            new_linear = torch.nn.Linear(in_features, out_features, bias=bias is not None)
+
+            new_linear.weight.data = module.weight.data.clone().to(module.weight.data.device)
+            if bias is not None:
+                new_linear.bias.data = module.bias.data.clone().to(module.bias.data.device)
+
+            # Replace the LoRACompatibleLinear layer with the Linear layer
+            path = name.split(".")
+            sub_module = model
+            for p in path[:-1]:
+                sub_module = getattr(sub_module, p)
+            setattr(sub_module, path[-1], new_linear)
+            new_linear._torch_forward = new_linear.forward
+            new_linear.forward = types.MethodType(lora_forward, new_linear)
+
+def merge_loras(model, lora_dict, lora_alphas, lora_scales):
+    assert len(lora_scales) == len(lora_dict)
+    for path, lora in lora_dict.items():
+        print(f"[I] Fusing LoRA: {path}, scale {lora_scales[path]}")
+        model.load_attn_procs(lora, network_alphas=lora_alphas[path])
+        model.fuse_lora(lora_scale=lora_scales[path])
+    return model
+
 def CUASSERT(cuda_ret):
     err = cuda_ret[0]
     if err != cudart.cudaError_t.cudaSuccess:
@@ -68,6 +151,35 @@ def CUASSERT(cuda_ret):
         return cuda_ret[1]
     return None
 
+class PIPELINE_TYPE(Enum):
+    TXT2IMG = auto()
+    IMG2IMG = auto()
+    INPAINT = auto()
+    CONTROLNET = auto()
+    XL_BASE = auto()
+    XL_REFINER = auto()
+
+    def is_txt2img(self):
+        return self == self.TXT2IMG
+
+    def is_img2img(self):
+        return self == self.IMG2IMG
+
+    def is_inpaint(self):
+        return self == self.INPAINT
+
+    def is_controlnet(self):
+        return self == self.CONTROLNET
+
+    def is_sd_xl_base(self):
+        return self == self.XL_BASE
+
+    def is_sd_xl_refiner(self):
+        return self == self.XL_REFINER
+
+    def is_sd_xl(self):
+        return self.is_sd_xl_base() or self.is_sd_xl_refiner()
+
 class Engine():
     def __init__(
         self,
@@ -81,116 +193,55 @@ def __init__(
         self.cuda_graph_instance = None # cuda graph
 
     def __del__(self):
-        [buf.free() for buf in self.buffers.values() if isinstance(buf, cuda.DeviceArray) ]
         del self.engine
         del self.context
         del self.buffers
         del self.tensors
 
-    def refit(self, onnx_path, onnx_refit_path):
-        def convert_int64(arr):
-            # TODO: smarter conversion
-            if len(arr.shape) == 0:
-                return np.int32(arr)
-            return arr
-
-        def add_to_map(refit_dict, name, values):
-            if name in refit_dict:
-                assert refit_dict[name] is None
-                if values.dtype == np.int64:
-                    values = convert_int64(values)
-                refit_dict[name] = values
-
-        print(f"Refitting TensorRT engine with {onnx_refit_path} weights")
-        refit_nodes = gs.import_onnx(onnx.load(onnx_refit_path)).toposort().nodes
-
-        # Construct mapping from weight names in refit model -> original model
-        name_map = {}
-        for n, node in enumerate(gs.import_onnx(onnx.load(onnx_path)).toposort().nodes):
-            refit_node = refit_nodes[n]
-            assert node.op == refit_node.op
-            # Constant nodes in ONNX do not have inputs but have a constant output
-            if node.op == "Constant":
-                name_map[refit_node.outputs[0].name] = node.outputs[0].name
-            # Handle scale and bias weights
-            elif node.op == "Conv":
-                if node.inputs[1].__class__ == gs.Constant:
-                    name_map[refit_node.name+"_TRTKERNEL"] = node.name+"_TRTKERNEL"
-                if node.inputs[2].__class__ == gs.Constant:
-                    name_map[refit_node.name+"_TRTBIAS"] = node.name+"_TRTBIAS"
-            # For all other nodes: find node inputs that are initializers (gs.Constant)
-            else:
-                for i, inp in enumerate(node.inputs):
-                    if inp.__class__ == gs.Constant:
-                        name_map[refit_node.inputs[i].name] = inp.name
-        def map_name(name):
-            if name in name_map:
-                return name_map[name]
-            return name
-
-        # Construct refit dictionary
-        refit_dict = {}
+    def refit(self, refit_weights, is_fp16):
+        # Initialize refitter
         refitter = trt.Refitter(self.engine, TRT_LOGGER)
-        all_weights = refitter.get_all()
-        for layer_name, role in zip(all_weights[0], all_weights[1]):
-            # for speciailized roles, use a unique name in the map:
-            if role == trt.WeightsRole.KERNEL:
-                name = layer_name+"_TRTKERNEL"
-            elif role == trt.WeightsRole.BIAS:
-                name = layer_name+"_TRTBIAS"
-            else:
-                name = layer_name
-
-            assert name not in refit_dict, "Found duplicate layer: " + name
-            refit_dict[name] = None
-
-
-        for n in refit_nodes:
-            # Constant nodes in ONNX do not have inputs but have a constant output
-            if n.op == "Constant":
-                name = map_name(n.outputs[0].name)
-                print(f"Add Constant {name}\n")
-                add_to_map(refit_dict, name, n.outputs[0].values)
 
-            # Handle scale and bias weights
-            elif n.op == "Conv":
-                if n.inputs[1].__class__ == gs.Constant:
-                    name = map_name(n.name+"_TRTKERNEL")
-                    add_to_map(refit_dict, name, n.inputs[1].values)
+        refitted_weights = set()
+        # iterate through all tensorrt refittable weights
+        for trt_weight_name in refitter.get_all_weights():
+            if trt_weight_name not in refit_weights:
+                continue
 
-                if n.inputs[2].__class__ == gs.Constant:
-                    name = map_name(n.name+"_TRTBIAS")
-                    add_to_map(refit_dict, name, n.inputs[2].values)
+            # get weight from state dict
+            trt_datatype = trt.DataType.FLOAT
+            if is_fp16:
+                refit_weights[trt_weight_name] = refit_weights[trt_weight_name].half()
+                trt_datatype = trt.DataType.HALF
 
-            # For all other nodes: find node inputs that are initializers (AKA gs.Constant)
-            else:
-                for inp in n.inputs:
-                    name = map_name(inp.name)
-                    if inp.__class__ == gs.Constant:
-                        add_to_map(refit_dict, name, inp.values)
-
-        for layer_name, weights_role in zip(all_weights[0], all_weights[1]):
-            if weights_role == trt.WeightsRole.KERNEL:
-                custom_name = layer_name+"_TRTKERNEL"
-            elif weights_role == trt.WeightsRole.BIAS:
-                custom_name = layer_name+"_TRTBIAS"
-            else:
-                custom_name = layer_name
+            # trt.Weight and trt.TensorLocation
+            trt_wt_tensor = trt.Weights(trt_datatype, refit_weights[trt_weight_name].data_ptr(), torch.numel(refit_weights[trt_weight_name]))
+            trt_wt_location = trt.TensorLocation.DEVICE if refit_weights[trt_weight_name].is_cuda else trt.TensorLocation.HOST
 
-            # Skip refitting Trilu for now; scalar weights of type int64 value 1 - for clip model
-            if layer_name.startswith("onnx::Trilu"):
-                continue
-
-            if refit_dict[custom_name] is not None:
-                refitter.set_weights(layer_name, weights_role, refit_dict[custom_name])
-            else:
-                print(f"[W] No refit weights for layer: {layer_name}")
+            # apply refit
+            refitter.set_named_weights(trt_weight_name, trt_wt_tensor, trt_wt_location)
+            refitted_weights.add(trt_weight_name)
 
+        assert set(refitted_weights) == set(refit_weights.keys())
         if not refitter.refit_cuda_engine():
-            print("Failed to refit!")
+            print("Error: failed to refit new weights.")
             exit(0)
 
-    def build(self, onnx_path, fp16, input_profile=None, enable_refit=False, enable_preview=False, enable_all_tactics=False, timing_cache=None, workspace_size=0):
+        print(f"[I] Total refitted weights {len(refitted_weights)}.")
+
+    def build(self,
+        onnx_path,
+        fp16=True,
+        tf32=False,
+        int8=False,
+        input_profile=None,
+        enable_refit=False,
+        enable_all_tactics=False,
+        timing_cache=None,
+        update_output_names=None,
+        verbose=False,
+        **extra_build_args
+    ):
         print(f"Building TensorRT engine for {onnx_path}: {self.engine_path}")
         p = Profile()
         if input_profile:
@@ -198,28 +249,27 @@ def build(self, onnx_path, fp16, input_profile=None, enable_refit=False, enable_
                 assert len(dims) == 3
                 p.add(name, min=dims[0], opt=dims[1], max=dims[2])
 
-        config_kwargs = {}
-
-        config_kwargs['preview_features'] = [trt.PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
-        if enable_preview:
-            # Faster dynamic shapes made optional since it increases engine build time.
-            config_kwargs['preview_features'].append(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
-        if workspace_size > 0:
-            config_kwargs['memory_pool_limits'] = {trt.MemoryPoolType.WORKSPACE: workspace_size}
         if not enable_all_tactics:
-            config_kwargs['tactic_sources'] = []
-
-        engine = engine_from_network(
-            network_from_onnx_path(onnx_path, flags=[trt.OnnxParserFlag.NATIVE_INSTANCENORM]),
-            config=CreateConfig(fp16=fp16,
-                refittable=enable_refit,
-                profiles=[p],
-                load_timing_cache=timing_cache,
-                **config_kwargs
-            ),
-            save_timing_cache=timing_cache
-        )
-        save_engine(engine, path=self.engine_path)
+            extra_build_args['tactic_sources'] = []
+
+        network = network_from_onnx_path(onnx_path, flags=[trt.OnnxParserFlag.NATIVE_INSTANCENORM])
+        if update_output_names:
+            print(f"Updating network outputs to {update_output_names}")
+            network = ModifyNetworkOutputs(network, update_output_names)
+        with G_LOGGER.verbosity(G_LOGGER.EXTRA_VERBOSE if verbose else G_LOGGER.ERROR):
+            engine = engine_from_network(
+                network,
+                config=CreateConfig(fp16=fp16,
+                    tf32=tf32,
+                    int8=int8,
+                    refittable=enable_refit,
+                    profiles=[p],
+                    load_timing_cache=timing_cache,
+                    **extra_build_args
+                ),
+                save_timing_cache=timing_cache
+            )
+            save_engine(engine, path=self.engine_path)
 
     def load(self):
         print(f"Loading TensorRT engine: {self.engine_path}")
@@ -233,19 +283,21 @@ def activate(self, reuse_device_memory=None):
             self.context = self.engine.create_execution_context()
 
     def allocate_buffers(self, shape_dict=None, device='cuda'):
-        for idx in range(trt_util.get_bindings_per_profile(self.engine)):
-            binding = self.engine[idx]
-            if shape_dict and binding in shape_dict:
-                shape = shape_dict[binding]
+        for binding in range(self.engine.num_io_tensors):
+            name = self.engine.get_tensor_name(binding)
+            if shape_dict and name in shape_dict:
+                shape = shape_dict[name]
             else:
-                shape = self.engine.get_binding_shape(binding)
-            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
-            if self.engine.binding_is_input(binding):
-                self.context.set_binding_shape(idx, shape)
+                shape = self.engine.get_tensor_shape(name)
+            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
+            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
+                self.context.set_input_shape(name, shape)
             tensor = torch.empty(tuple(shape), dtype=numpy_to_torch_dtype_dict[dtype]).to(device=device)
-            self.tensors[binding] = tensor
+            self.tensors[name] = tensor
+
 
     def infer(self, feed_dict, stream, use_cuda_graph=False):
+
         for name, buf in feed_dict.items():
             self.tensors[name].copy_(buf)
 
@@ -254,883 +306,25 @@ def infer(self, feed_dict, stream, use_cuda_graph=False):
 
         if use_cuda_graph:
             if self.cuda_graph_instance is not None:
-                CUASSERT(cudart.cudaGraphLaunch(self.cuda_graph_instance, stream.ptr))
-                CUASSERT(cudart.cudaStreamSynchronize(stream.ptr))
+                CUASSERT(cudart.cudaGraphLaunch(self.cuda_graph_instance, stream))
+                CUASSERT(cudart.cudaStreamSynchronize(stream))
             else:
                 # do inference before CUDA graph capture
-                noerror = self.context.execute_async_v3(stream.ptr)
+                noerror = self.context.execute_async_v3(stream)
                 if not noerror:
                     raise ValueError(f"ERROR: inference failed.")
                 # capture cuda graph
-                CUASSERT(cudart.cudaStreamBeginCapture(stream.ptr, cudart.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal))
-                self.context.execute_async_v3(stream.ptr)
-                self.graph = CUASSERT(cudart.cudaStreamEndCapture(stream.ptr))
+                CUASSERT(cudart.cudaStreamBeginCapture(stream, cudart.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal))
+                self.context.execute_async_v3(stream)
+                self.graph = CUASSERT(cudart.cudaStreamEndCapture(stream))
                 self.cuda_graph_instance = CUASSERT(cudart.cudaGraphInstantiate(self.graph, 0))
         else:
-            noerror = self.context.execute_async_v3(stream.ptr)
+            noerror = self.context.execute_async_v3(stream)
             if not noerror:
                 raise ValueError(f"ERROR: inference failed.")
 
         return self.tensors
 
-
-class LMSDiscreteScheduler():
-    def __init__(
-        self,
-        device = 'cuda',
-        beta_start = 0.00085,
-        beta_end = 0.012,
-        num_train_timesteps = 1000,
-        steps_offset = 0,
-        prediction_type = 'epsilon'
-    ):
-        self.num_train_timesteps = num_train_timesteps
-        self.order = 4
-
-        self.beta_start = beta_start
-        self.beta_end = beta_end
-        betas = (torch.linspace(beta_start**0.5, beta_end**0.5, self.num_train_timesteps, dtype=torch.float32) ** 2)
-        alphas = 1.0 - betas
-        self.alphas_cumprod = torch.cumprod(alphas, dim=0)
-
-        sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
-        sigmas = np.concatenate([sigmas[::-1], [0.0]]).astype(np.float32)
-        self.sigmas = torch.from_numpy(sigmas)
-
-        # standard deviation of the initial noise distribution
-        self.init_noise_sigma = self.sigmas.max()
-
-        self.device = device
-        self.steps_offset = steps_offset
-        self.prediction_type = prediction_type
-
-    def set_timesteps(self, steps):
-        self.num_inference_steps = steps
-
-        timesteps = np.linspace(0, self.num_train_timesteps - 1, steps, dtype=float)[::-1].copy()
-        sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
-        sigmas = np.interp(timesteps, np.arange(0, len(sigmas)), sigmas)
-        sigmas = np.concatenate([sigmas, [0.0]]).astype(np.float32)
-        self.sigmas = torch.from_numpy(sigmas).to(device=self.device)
-
-        # Move all timesteps to correct device beforehand
-        self.timesteps = torch.from_numpy(timesteps).to(device=self.device).float()
-        self.derivatives = []
-
-    def scale_model_input(self, sample: torch.FloatTensor, idx, *args, **kwargs) -> torch.FloatTensor:
-        return sample * self.latent_scales[idx]
-
-    def configure(self):
-        order = self.order
-        self.lms_coeffs = []
-        self.latent_scales = [1./((sigma**2 + 1) ** 0.5) for sigma in self.sigmas]
-
-        def get_lms_coefficient(order, t, current_order):
-            """
-            Compute a linear multistep coefficient.
-            """
-            def lms_derivative(tau):
-                prod = 1.0
-                for k in range(order):
-                    if current_order == k:
-                        continue
-                    prod *= (tau - self.sigmas[t - k]) / (self.sigmas[t - current_order] - self.sigmas[t - k])
-                return prod
-            integrated_coeff = integrate.quad(lms_derivative, self.sigmas[t], self.sigmas[t + 1], epsrel=1e-4)[0]
-            return integrated_coeff
-
-        for step_index in range(self.num_inference_steps):
-            order = min(step_index + 1, order)
-            self.lms_coeffs.append([get_lms_coefficient(order, step_index, curr_order) for curr_order in range(order)])
-
-    def step(self, output, latents, idx, timestep):
-        # compute the previous noisy sample x_t -> x_t-1
-        # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
-        sigma = self.sigmas[idx]
-        if self.prediction_type == "epsilon":
-            pred_original_sample = latents - sigma * output
-        elif self.prediction_type == "v_prediction":
-            # * c_out + input * c_skip
-            pred_original_sample = output * (-sigma / (sigma**2 + 1) ** 0.5) + (latents / (sigma**2 + 1))
-        else:
-            raise ValueError(
-                f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, or `v_prediction`"
-            )
-        # 2. Convert to an ODE derivative
-        derivative = (latents - pred_original_sample) / sigma
-        self.derivatives.append(derivative)
-        if len(self.derivatives) > self.order:
-            self.derivatives.pop(0)
-        # 3. Compute previous sample based on the derivatives path
-        prev_sample = latents + sum(
-            coeff * derivative for coeff, derivative in zip(self.lms_coeffs[idx], reversed(self.derivatives))
-        )
-
-        return prev_sample
-
-    def add_noise(self, init_latents, noise, idx, latent_timestep):
-        sigma = self.sigmas[idx]
-
-        noisy_latents = init_latents + noise * sigma
-        return noisy_latents
-
-class DDIMScheduler():
-    def __init__(
-        self,
-        device='cuda',
-        num_train_timesteps: int = 1000,
-        beta_start: float = 0.0001,
-        beta_end: float = 0.02,
-        clip_sample: bool = False,
-        set_alpha_to_one: bool = False,
-        steps_offset: int = 1,
-        prediction_type: str = "epsilon",
-    ):
-        # this schedule is very specific to the latent diffusion model.
-        betas = (
-            torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
-        )
-
-        alphas = 1.0 - betas
-        self.alphas_cumprod = torch.cumprod(alphas, dim=0)
-
-        # standard deviation of the initial noise distribution
-        self.init_noise_sigma = 1.0
-
-        # At every step in ddim, we are looking into the previous alphas_cumprod
-        # For the final step, there is no previous alphas_cumprod because we are already at 0
-        # `set_alpha_to_one` decides whether we set this parameter simply to one or
-        # whether we use the final alpha of the "non-previous" one.
-        self.final_alpha_cumprod = torch.tensor(1.0) if set_alpha_to_one else self.alphas_cumprod[0]
-
-        # setable values
-        self.num_inference_steps = None
-        self.timesteps = torch.from_numpy(np.arange(0, num_train_timesteps)[::-1].copy().astype(np.int64))
-        self.steps_offset = steps_offset
-        self.num_train_timesteps = num_train_timesteps
-        self.clip_sample = clip_sample
-        self.prediction_type = prediction_type
-        self.device = device
-
-    def configure(self):
-        variance = np.zeros(self.num_inference_steps, dtype=np.float32)
-        for idx, timestep in enumerate(self.timesteps):
-            prev_timestep = timestep - self.num_train_timesteps // self.num_inference_steps
-            variance[idx] = self._get_variance(timestep, prev_timestep)
-        self.variance = torch.from_numpy(variance).to(self.device)
-
-        timesteps = self.timesteps.long().cpu()
-        self.alphas_cumprod = self.alphas_cumprod[timesteps].to(self.device)
-        self.final_alpha_cumprod = self.final_alpha_cumprod.to(self.device)
-
-    def scale_model_input(self, sample: torch.FloatTensor, idx, *args, **kwargs) -> torch.FloatTensor:
-        return sample
-
-    def _get_variance(self, timestep, prev_timestep):
-        alpha_prod_t = self.alphas_cumprod[timestep]
-        alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod
-        beta_prod_t = 1 - alpha_prod_t
-        beta_prod_t_prev = 1 - alpha_prod_t_prev
-
-        variance = (beta_prod_t_prev / beta_prod_t) * (1 - alpha_prod_t / alpha_prod_t_prev)
-
-        return variance
-
-    def set_timesteps(self, num_inference_steps: int):
-        self.num_inference_steps = num_inference_steps
-        step_ratio = self.num_train_timesteps // self.num_inference_steps
-        # creates integer timesteps by multiplying by ratio
-        # casting to int to avoid issues when num_inference_step is power of 3
-        timesteps = (np.arange(0, num_inference_steps) * step_ratio).round()[::-1].copy().astype(np.int64)
-        self.timesteps = torch.from_numpy(timesteps).to(self.device)
-        self.timesteps += self.steps_offset
-
-    def step(self, model_output, sample, idx, timestep,
-             eta: float = 0.0,
-             use_clipped_model_output: bool = False,
-             generator=None,
-             variance_noise: torch.FloatTensor = None,
-    ):
-        if self.num_inference_steps is None:
-            raise ValueError(
-                "Number of inference steps is 'None', you need to run 'set_timesteps' after creating the scheduler"
-            )
-
-        # See formulas (12) and (16) of DDIM paper https://arxiv.org/pdf/2010.02502.pdf
-        # Ideally, read DDIM paper in-detail understanding
-
-        # Notation (<variable name> -> <name in paper>
-        # - pred_noise_t -> e_theta(x_t, t)
-        # - pred_original_sample -> f_theta(x_t, t) or x_0
-        # - std_dev_t -> sigma_t
-        # - eta -> η
-        # - pred_sample_direction -> "direction pointing to x_t"
-        # - pred_prev_sample -> "x_t-1"
-
-        prev_idx = idx + 1
-        alpha_prod_t = self.alphas_cumprod[idx]
-        alpha_prod_t_prev = self.alphas_cumprod[prev_idx] if prev_idx < self.num_inference_steps  else self.final_alpha_cumprod
-
-        beta_prod_t = 1 - alpha_prod_t
-
-        # 3. compute predicted original sample from predicted noise also called
-        # "predicted x_0" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
-        if self.prediction_type == "epsilon":
-            pred_original_sample = (sample - beta_prod_t ** (0.5) * model_output) / alpha_prod_t ** (0.5)
-        elif self.prediction_type == "sample":
-            pred_original_sample = model_output
-        elif self.prediction_type == "v_prediction":
-            pred_original_sample = (alpha_prod_t**0.5) * sample - (beta_prod_t**0.5) * model_output
-            # predict V
-            model_output = (alpha_prod_t**0.5) * model_output + (beta_prod_t**0.5) * sample
-        else:
-            raise ValueError(
-                f"prediction_type given as {self.prediction_type} must be one of `epsilon`, `sample`, or"
-                " `v_prediction`"
-            )
-
-        # 4. Clip "predicted x_0"
-        if self.clip_sample:
-            pred_original_sample = torch.clamp(pred_original_sample, -1, 1)
-
-        # 5. compute variance: "sigma_t(η)" -> see formula (16)
-        # σ_t = sqrt((1 − α_t−1)/(1 − α_t)) * sqrt(1 − α_t/α_t−1)
-        variance = self.variance[idx]
-        std_dev_t = eta * variance ** (0.5)
-
-        if use_clipped_model_output:
-            # the model_output is always re-derived from the clipped x_0 in Glide
-            model_output = (sample - alpha_prod_t ** (0.5) * pred_original_sample) / beta_prod_t ** (0.5)
-
-        # 6. compute "direction pointing to x_t" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
-        pred_sample_direction = (1 - alpha_prod_t_prev - std_dev_t**2) ** (0.5) * model_output
-
-        # 7. compute x_t without "random noise" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
-        prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction
-
-        if eta > 0:
-            # randn_like does not support generator https://github.com/pytorch/pytorch/issues/27072
-            device = model_output.device
-            if variance_noise is not None and generator is not None:
-                raise ValueError(
-                    "Cannot pass both generator and variance_noise. Please make sure that either `generator` or"
-                    " `variance_noise` stays `None`."
-                )
-
-            if variance_noise is None:
-                variance_noise = torch.randn(
-                    model_output.shape, generator=generator, device=device, dtype=model_output.dtype
-                )
-            variance = variance ** (0.5) * eta * variance_noise
-
-            prev_sample = prev_sample + variance
-
-        return prev_sample
-
-    def add_noise(self, init_latents, noise, idx, latent_timestep):
-        sqrt_alpha_prod = self.alphas_cumprod[idx] ** 0.5
-        sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[idx]) ** 0.5
-        noisy_latents = sqrt_alpha_prod * init_latents + sqrt_one_minus_alpha_prod * noise
-
-        return noisy_latents
-
-
-class EulerAncestralDiscreteScheduler():
-    def __init__(
-        self,
-        num_train_timesteps: int = 1000,
-        beta_start: float = 0.0001,
-        beta_end: float = 0.02,
-        device = 'cuda',
-        steps_offset = 0,
-        prediction_type = "epsilon"
-    ):
-        # this schedule is very specific to the latent diffusion model.
-        betas = (
-            torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
-        )
-
-        alphas = 1.0 - betas
-        self.alphas_cumprod = torch.cumprod(alphas, dim=0)
-
-        sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
-        sigmas = np.concatenate([sigmas[::-1], [0.0]]).astype(np.float32)
-        self.sigmas = torch.from_numpy(sigmas)
-
-        # standard deviation of the initial noise distribution
-        self.init_noise_sigma = self.sigmas.max()
-
-        # setable values
-        self.num_inference_steps = None
-        timesteps = np.linspace(0, num_train_timesteps - 1, num_train_timesteps, dtype=float)[::-1].copy()
-        self.timesteps = torch.from_numpy(timesteps)
-        self.is_scale_input_called = False
-        self.device = device
-        self.num_train_timesteps = num_train_timesteps
-        self.steps_offset = steps_offset
-        self.prediction_type = prediction_type
-
-    def scale_model_input(
-        self, sample: torch.FloatTensor, idx, timestep, *args, **kwargs
-    ) -> torch.FloatTensor:
-        if isinstance(timestep, torch.Tensor):
-            timestep = timestep.to(self.timesteps.device)
-        step_index = (self.timesteps == timestep).nonzero().item()
-        sigma = self.sigmas[step_index]
-        sample = sample / ((sigma**2 + 1) ** 0.5)
-        self.is_scale_input_called = True
-        return sample
-
-    def set_timesteps(self, num_inference_steps: int):
-        self.num_inference_steps = num_inference_steps
-
-        timesteps = np.linspace(0, self.num_train_timesteps - 1, num_inference_steps, dtype=np.float32)[::-1].copy()
-        sigmas = np.array(((1 - self.alphas_cumprod) / self.alphas_cumprod) ** 0.5)
-        sigmas = np.interp(timesteps, np.arange(0, len(sigmas)), sigmas)
-        sigmas = np.concatenate([sigmas, [0.0]]).astype(np.float32)
-        self.sigmas = torch.from_numpy(sigmas).to(device=self.device)
-        self.timesteps = torch.from_numpy(timesteps).to(device=self.device)
-
-    def configure(self):
-        dts = np.zeros(self.num_inference_steps, dtype=np.float32)
-        sigmas_up = np.zeros(self.num_inference_steps, dtype=np.float32)
-        for idx, timestep in enumerate(self.timesteps):
-            step_index = (self.timesteps == timestep).nonzero().item()
-            sigma = self.sigmas[step_index]
-
-            sigma_from = self.sigmas[step_index]
-            sigma_to = self.sigmas[step_index + 1]
-            sigma_up = (sigma_to**2 * (sigma_from**2 - sigma_to**2) / sigma_from**2) ** 0.5
-            sigma_down = (sigma_to**2 - sigma_up**2) ** 0.5
-            dt = sigma_down - sigma
-            dts[idx] = dt
-            sigmas_up[idx] = sigma_up
-
-        self.dts = torch.from_numpy(dts).to(self.device)
-        self.sigmas_up = torch.from_numpy(sigmas_up).to(self.device)
-
-    def step(
-        self, model_output, sample, idx, timestep,
-        generator = None,
-    ):
-        step_index = (self.timesteps == timestep).nonzero().item()
-        sigma = self.sigmas[step_index]
-
-        # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
-        if self.prediction_type == "epsilon":
-            pred_original_sample = sample - sigma * model_output
-        elif self.prediction_type == "v_prediction":
-            # * c_out + input * c_skip
-            pred_original_sample = model_output * (-sigma / (sigma**2 + 1) ** 0.5) + (sample / (sigma**2 + 1))
-        else:
-            raise ValueError(
-                f"prediction_type given as {self.prediction_type} must be one of `epsilon`, or `v_prediction`"
-            )
-
-        sigma_up = self.sigmas_up[idx]
-
-        # 2. Convert to an ODE derivative
-        derivative = (sample - pred_original_sample) / sigma
-
-        dt = self.dts[idx]
-
-        prev_sample = sample + derivative * dt
-
-        device = model_output.device
-        noise = torch.randn(model_output.shape, dtype=model_output.dtype, device=device, generator=generator).to(
-            device
-        )
-
-        prev_sample = prev_sample + noise * sigma_up
-
-        return prev_sample
-
-    def add_noise(
-        self, original_samples, noise, idx, timestep=None):
-        step_index = (self.timesteps == timestep).nonzero().item()
-        noisy_samples = original_samples + noise * self.sigmas[step_index]
-        return noisy_samples
-
-
-class DPMScheduler():
-    def __init__(
-        self,
-        beta_start = 0.00085,
-        beta_end = 0.012,
-        num_train_timesteps = 1000,
-        solver_order = 2,
-        predict_epsilon = True,
-        thresholding = False,
-        dynamic_thresholding_ratio = 0.995,
-        sample_max_value = 1.0,
-        algorithm_type = "dpmsolver++",
-        solver_type = "midpoint",
-        lower_order_final = True,
-        device = 'cuda',
-        steps_offset = 0,
-        prediction_type = 'epsilon'
-    ):
-        # this schedule is very specific to the latent diffusion model.
-        self.betas = (
-            torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2
-        )
-
-        self.device = device
-        self.alphas = 1.0 - self.betas
-        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
-        # Currently we only support VP-type noise schedule
-        self.alpha_t = torch.sqrt(self.alphas_cumprod)
-        self.sigma_t = torch.sqrt(1 - self.alphas_cumprod)
-        self.lambda_t = torch.log(self.alpha_t) - torch.log(self.sigma_t)
-        self.steps_offset = steps_offset
-
-        # standard deviation of the initial noise distribution
-        self.init_noise_sigma = 1.0
-
-        self.algorithm_type = algorithm_type
-        self.predict_epsilon = predict_epsilon
-        self.thresholding = thresholding
-        self.dynamic_thresholding_ratio = dynamic_thresholding_ratio
-        self.sample_max_value = sample_max_value
-        self.lower_order_final = lower_order_final
-        self.prediction_type = prediction_type
-
-        # settings for DPM-Solver
-        if algorithm_type not in ["dpmsolver", "dpmsolver++"]:
-            raise NotImplementedError(f"{algorithm_type} does is not implemented for {self.__class__}")
-        if solver_type not in ["midpoint", "heun"]:
-            raise NotImplementedError(f"{solver_type} does is not implemented for {self.__class__}")
-
-        # setable values
-        self.num_inference_steps = None
-        self.solver_order = solver_order
-        self.num_train_timesteps = num_train_timesteps
-        self.solver_type = solver_type
-
-        self.first_order_first_coef = []
-        self.first_order_second_coef = []
-
-        self.second_order_first_coef = []
-        self.second_order_second_coef = []
-        self.second_order_third_coef = []
-
-        self.third_order_first_coef = []
-        self.third_order_second_coef = []
-        self.third_order_third_coef = []
-        self.third_order_fourth_coef = []
-
-    def scale_model_input(self, sample: torch.FloatTensor, *args, **kwargs) -> torch.FloatTensor:
-        return sample
-
-    def configure(self):
-        lower_order_nums = 0
-        for step_index in range(self.num_inference_steps):
-            step_idx = step_index
-            timestep = self.timesteps[step_idx]
-
-            prev_timestep = 0 if step_idx == len(self.timesteps) - 1 else self.timesteps[step_idx + 1]
-
-            self.dpm_solver_first_order_coefs_precompute(timestep, prev_timestep)
-
-            timestep_list = [self.timesteps[step_index - 1], timestep]
-            self.multistep_dpm_solver_second_order_coefs_precompute(timestep_list, prev_timestep)
-
-            timestep_list = [self.timesteps[step_index - 2], self.timesteps[step_index - 1], timestep]
-            self.multistep_dpm_solver_third_order_coefs_precompute(timestep_list, prev_timestep)
-
-            if lower_order_nums < self.solver_order:
-                lower_order_nums += 1
-
-    def dpm_solver_first_order_coefs_precompute(self, timestep, prev_timestep):
-        lambda_t, lambda_s = self.lambda_t[prev_timestep], self.lambda_t[timestep]
-        alpha_t, alpha_s = self.alpha_t[prev_timestep], self.alpha_t[timestep]
-        sigma_t, sigma_s = self.sigma_t[prev_timestep], self.sigma_t[timestep]
-        h = lambda_t - lambda_s
-        if self.algorithm_type == "dpmsolver++":
-            self.first_order_first_coef.append(sigma_t / sigma_s)
-            self.first_order_second_coef.append(alpha_t * (torch.exp(-h) - 1.0))
-        elif self.algorithm_type == "dpmsolver":
-            self.first_order_first_coef.append(alpha_t / alpha_s)
-            self.first_order_second_coef.append(sigma_t * (torch.exp(h) - 1.0))
-
-    def multistep_dpm_solver_second_order_coefs_precompute(self, timestep_list, prev_timestep):
-        t, s0, s1 = prev_timestep, timestep_list[-1], timestep_list[-2]
-        lambda_t, lambda_s0, lambda_s1 = self.lambda_t[t], self.lambda_t[s0], self.lambda_t[s1]
-        alpha_t, alpha_s0 = self.alpha_t[t], self.alpha_t[s0]
-        sigma_t, sigma_s0 = self.sigma_t[t], self.sigma_t[s0]
-        h = lambda_t - lambda_s0
-        if self.algorithm_type == "dpmsolver++":
-            # See https://arxiv.org/abs/2211.01095 for detailed derivations
-            if self.solver_type == "midpoint":
-                self.second_order_first_coef.append(sigma_t / sigma_s0)
-                self.second_order_second_coef.append((alpha_t * (torch.exp(-h) - 1.0)))
-                self.second_order_third_coef.append(0.5 * (alpha_t * (torch.exp(-h) - 1.0)))
-            elif self.solver_type == "heun":
-                self.second_order_first_coef.append(sigma_t / sigma_s0)
-                self.second_order_second_coef.append((alpha_t * (torch.exp(-h) - 1.0)))
-                self.second_order_third_coef.append(alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0))
-        elif self.algorithm_type == "dpmsolver":
-            # See https://arxiv.org/abs/2206.00927 for detailed derivations
-            if self.solver_type == "midpoint":
-                self.second_order_first_coef.append(alpha_t / alpha_s0)
-                self.second_order_second_coef.append((sigma_t * (torch.exp(h) - 1.0)))
-                self.second_order_third_coef.append(0.5 * (sigma_t * (torch.exp(h) - 1.0)))
-            elif self.solver_type == "heun":
-                self.second_order_first_coef.append(alpha_t / alpha_s0)
-                self.second_order_second_coef.append((sigma_t * (torch.exp(h) - 1.0)))
-                self.second_order_third_coef.append((sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)))
-
-    def multistep_dpm_solver_third_order_coefs_precompute(self, timestep_list, prev_timestep):
-        t, s0 = prev_timestep, timestep_list[-1]
-        lambda_t, lambda_s0 = (
-            self.lambda_t[t],
-            self.lambda_t[s0]
-        )
-        alpha_t, alpha_s0 = self.alpha_t[t], self.alpha_t[s0]
-        sigma_t, sigma_s0 = self.sigma_t[t], self.sigma_t[s0]
-        h = lambda_t - lambda_s0
-        if self.algorithm_type == "dpmsolver++":
-            self.third_order_first_coef.append(sigma_t / sigma_s0)
-            self.third_order_second_coef.append(alpha_t * (torch.exp(-h) - 1.0))
-            self.third_order_third_coef.append(alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0))
-            self.third_order_fourth_coef.append(alpha_t * ((torch.exp(-h) - 1.0 + h) / h**2 - 0.5))
-        elif self.algorithm_type == "dpmsolver":
-            self.third_order_first_coef.append(alpha_t / alpha_s0)
-            self.third_order_second_coef.append(sigma_t * (torch.exp(h) - 1.0))
-            self.third_order_third_coef.append(sigma_t * ((torch.exp(h) - 1.0) / h - 1.0))
-            self.third_order_fourth_coef.append(sigma_t * ((torch.exp(h) - 1.0 - h) / h**2 - 0.5))
-
-    def set_timesteps(self, num_inference_steps):
-        self.num_inference_steps = num_inference_steps
-        timesteps = (
-            np.linspace(0, self.num_train_timesteps - 1, num_inference_steps + 1)
-            .round()[::-1][:-1]
-            .copy()
-            .astype(np.int32)
-        )
-        self.timesteps = torch.from_numpy(timesteps).to(self.device)
-        self.model_outputs = [
-            None,
-        ] * self.solver_order
-        self.lower_order_nums = 0
-
-    def convert_model_output(
-        self, model_output, timestep, sample
-    ):
-        # DPM-Solver++ needs to solve an integral of the data prediction model.
-        if self.algorithm_type == "dpmsolver++":
-            if self.prediction_type == "epsilon":
-                alpha_t, sigma_t = self.alpha_t[timestep], self.sigma_t[timestep]
-                x0_pred = (sample - sigma_t * model_output) / alpha_t
-            elif self.prediction_type == "v_prediction":
-                alpha_t, sigma_t = self.alpha_t[timestep], self.sigma_t[timestep]
-                x0_pred = alpha_t * sample - sigma_t * model_output
-            else:
-                raise ValueError(
-                    f"prediction_type given as {self.prediction_type} must be one of `epsilon`, or"
-                    " `v_prediction` for the DPMScheduler."
-                )
-
-            if self.thresholding:
-                # Dynamic thresholding in https://arxiv.org/abs/2205.11487
-                dynamic_max_val = torch.quantile(
-                    torch.abs(x0_pred).reshape((x0_pred.shape[0], -1)), self.dynamic_thresholding_ratio, dim=1
-                )
-                dynamic_max_val = torch.maximum(
-                    dynamic_max_val,
-                    self.sample_max_value * torch.ones_like(dynamic_max_val).to(dynamic_max_val.device),
-                )[(...,) + (None,) * (x0_pred.ndim - 1)]
-                x0_pred = torch.clamp(x0_pred, -dynamic_max_val, dynamic_max_val) / dynamic_max_val
-            return x0_pred
-        # DPM-Solver needs to solve an integral of the noise prediction model.
-        elif self.algorithm_type == "dpmsolver":
-            if self.prediction_type == "epsilon":
-                return model_output
-            elif self.prediction_type == "v_prediction":
-                alpha_t, sigma_t = self.alpha_t[timestep], self.sigma_t[timestep]
-                epsilon = alpha_t * model_output + sigma_t * sample
-                return epsilon
-            else:
-                raise ValueError(
-                    f"prediction_type given as {self.prediction_type} must be one of `epsilon` or"
-                    " `v_prediction` for the DPMScheduler."
-                )
-
-    def dpm_solver_first_order_update(
-        self,
-        idx,
-        model_output,
-        sample
-    ):
-        first_coef = self.first_order_first_coef[idx]
-        second_coef = self.first_order_second_coef[idx]
-
-        if self.algorithm_type == "dpmsolver++":
-            x_t = first_coef * sample - second_coef * model_output
-        elif self.algorithm_type == "dpmsolver":
-            x_t = first_coef * sample - second_coef * model_output
-        return x_t
-
-    def multistep_dpm_solver_second_order_update(
-        self,
-        idx,
-        model_output_list,
-        timestep_list,
-        prev_timestep,
-        sample
-    ):
-        t, s0, s1 = prev_timestep, timestep_list[-1], timestep_list[-2]
-        m0, m1 = model_output_list[-1], model_output_list[-2]
-        lambda_t, lambda_s0, lambda_s1 = self.lambda_t[t], self.lambda_t[s0], self.lambda_t[s1]
-        h, h_0 = lambda_t - lambda_s0, lambda_s0 - lambda_s1
-        r0 = h_0 / h
-        D0, D1 = m0, (1.0 / r0) * (m0 - m1)
-
-        first_coef = self.second_order_first_coef[idx]
-        second_coef = self.second_order_second_coef[idx]
-        third_coef = self.second_order_third_coef[idx]
-
-        if self.algorithm_type == "dpmsolver++":
-            # See https://arxiv.org/abs/2211.01095 for detailed derivations
-            if self.solver_type == "midpoint":
-                x_t = (
-                    first_coef * sample
-                    - second_coef * D0
-                    - third_coef * D1
-                )
-            elif self.solver_type == "heun":
-                x_t = (
-                    first_coef * sample
-                    - second_coef * D0
-                    + third_coef * D1
-                )
-        elif self.algorithm_type == "dpmsolver":
-            # See https://arxiv.org/abs/2206.00927 for detailed derivations
-            if self.solver_type == "midpoint":
-                x_t = (
-                    first_coef * sample
-                    - second_coef * D0
-                    - third_coef * D1
-                )
-            elif self.solver_type == "heun":
-                x_t = (
-                    first_coef * sample
-                    - second_coef * D0
-                    - third_coef * D1
-                )
-        return x_t
-
-    def multistep_dpm_solver_third_order_update(
-        self,
-        idx,
-        model_output_list,
-        timestep_list,
-        prev_timestep,
-        sample
-    ):
-        t, s0, s1, s2 = prev_timestep, timestep_list[-1], timestep_list[-2], timestep_list[-3]
-        m0, m1, m2 = model_output_list[-1], model_output_list[-2], model_output_list[-3]
-        lambda_t, lambda_s0, lambda_s1, lambda_s2 = (
-            self.lambda_t[t],
-            self.lambda_t[s0],
-            self.lambda_t[s1],
-            self.lambda_t[s2],
-        )
-        h, h_0, h_1 = lambda_t - lambda_s0, lambda_s0 - lambda_s1, lambda_s1 - lambda_s2
-        r0, r1 = h_0 / h, h_1 / h
-        D0 = m0
-        D1_0, D1_1 = (1.0 / r0) * (m0 - m1), (1.0 / r1) * (m1 - m2)
-        D1 = D1_0 + (r0 / (r0 + r1)) * (D1_0 - D1_1)
-        D2 = (1.0 / (r0 + r1)) * (D1_0 - D1_1)
-
-        first_coef = self.third_order_first_coef[idx]
-        second_coef = self.third_order_second_coef[idx]
-        third_coef = self.third_order_third_coef[idx]
-        fourth_coef = self.third_order_fourth_coef[idx]
-
-        if self.algorithm_type == "dpmsolver++":
-            # See https://arxiv.org/abs/2206.00927 for detailed derivations
-            x_t = (
-                first_coef * sample
-                - second_coef * D0
-                + third_coef * D1
-                - fourth_coef * D2
-            )
-        elif self.algorithm_type == "dpmsolver":
-            # See https://arxiv.org/abs/2206.00927 for detailed derivations
-            x_t = (
-                first_coef * sample
-                - second_coef * D0
-                - third_coef * D1
-                - fourth_coef * D2
-            )
-        return x_t
-
-    def step(self, output, latents, step_index, timestep):
-        if self.num_inference_steps is None:
-            raise ValueError(
-                "Number of inference steps is 'None', you need to run 'set_timesteps' after creating the scheduler"
-            )
-
-        prev_timestep = 0 if step_index == len(self.timesteps) - 1 else self.timesteps[step_index + 1]
-        lower_order_final = (
-            (step_index == len(self.timesteps) - 1) and self.lower_order_final and len(self.timesteps) < 15
-        )
-        lower_order_second = (
-            (step_index == len(self.timesteps) - 2) and self.lower_order_final and len(self.timesteps) < 15
-        )
-
-        output = self.convert_model_output(output, timestep, latents)
-        for i in range(self.solver_order - 1):
-            self.model_outputs[i] = self.model_outputs[i + 1]
-        self.model_outputs[-1] = output
-
-        if self.solver_order == 1 or self.lower_order_nums < 1 or lower_order_final:
-            prev_sample = self.dpm_solver_first_order_update(step_index, output, latents)
-        elif self.solver_order == 2 or self.lower_order_nums < 2 or lower_order_second:
-            timestep_list = [self.timesteps[step_index - 1], timestep]
-            prev_sample = self.multistep_dpm_solver_second_order_update(
-                step_index, self.model_outputs, timestep_list, prev_timestep, latents
-            )
-        else:
-            timestep_list = [self.timesteps[step_index - 2], self.timesteps[step_index - 1], timestep]
-            prev_sample = self.multistep_dpm_solver_third_order_update(
-                step_index, self.model_outputs, timestep_list, prev_timestep, latents
-            )
-
-        if self.lower_order_nums < self.solver_order:
-            self.lower_order_nums += 1
-
-        return prev_sample
-
-    def add_noise(self, init_latents, noise, idx, latent_timestep):
-        self.alphas_cumprod = self.alphas_cumprod.to(device=init_latents.device, dtype=init_latents.dtype)
-        timestep = latent_timestep.to(init_latents.device).long()
-
-        sqrt_alpha_prod = self.alphas_cumprod[timestep] ** 0.5
-        sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[timestep]) ** 0.5
-        noisy_latents = sqrt_alpha_prod * init_latents + sqrt_one_minus_alpha_prod * noise
-
-        return noisy_latents
-
-
-class PNDMScheduler():
-    def __init__(
-        self,
-        device = 'cuda',
-        beta_start = 0.00085,
-        beta_end = 0.012,
-        num_train_timesteps = 1000,
-        steps_offset: int = 0,
-        prediction_type = 'epsilon'
-    ):
-        self.device = device
-        self.num_train_timesteps = num_train_timesteps
-        self.pndm_order = 4
-
-        self.beta_start = beta_start
-        self.beta_end = beta_end
-        betas = (torch.linspace(beta_start**0.5, beta_end**0.5, self.num_train_timesteps, dtype=torch.float32) ** 2)
-        alphas = 1.0 - betas
-        self.alphas_cumprod = torch.cumprod(alphas, dim=0).to(device=self.device)
-        self.final_alpha_cumprod = self.alphas_cumprod[0]
-
-        # standard deviation of the initial noise distribution
-        self.init_noise_sigma = 1.0
-        self.steps_offset = steps_offset
-
-        # running values
-        self.counter = 0
-        self.cur_sample = None
-        self.ets = []
-        self.prediction_type = prediction_type
-
-    def set_timesteps(self, steps):
-        self.num_inference_steps = steps
-
-        self.step_ratio = self.num_train_timesteps // self.num_inference_steps
-        # creates integer timesteps by multiplying by ratio
-        timesteps = (np.arange(0, self.num_inference_steps) * self.step_ratio).round()
-        timesteps += self.steps_offset
-
-        # for some models like stable diffusion the prk steps can/should be skipped to produce better results
-        plms_timesteps = np.concatenate([timesteps[:-1], timesteps[-2:-1], timesteps[-1:]])[::-1].copy()
-        self.timesteps = torch.from_numpy(plms_timesteps).to(self.device)
-
-        # reset running values
-        self.counter = 0
-        self.cur_sample = None
-        self.ets = []
-
-    def scale_model_input(self, sample: torch.FloatTensor, idx, *args, **kwargs) -> torch.FloatTensor:
-        return sample
-
-    def configure(self):
-        self.alphas_cumprod_prev = torch.roll(self.alphas_cumprod, shifts=self.step_ratio)
-        self.alphas_cumprod_prev[:self.step_ratio] = self.final_alpha_cumprod
-        self.sample_coeff = (self.alphas_cumprod_prev / self.alphas_cumprod) ** (0.5)
-
-        self.beta_cumprod = 1 - self.alphas_cumprod
-        self.beta_cumprod_prev = 1 - self.alphas_cumprod_prev
-        self.model_output_denom_coeff = self.alphas_cumprod * (self.beta_cumprod_prev) ** (0.5) + (
-            self.alphas_cumprod * self.beta_cumprod * self.alphas_cumprod_prev) ** (0.5)
-
-        timesteps = self.timesteps.cpu().long()
-
-        self.alphas_cumprod = self.alphas_cumprod[timesteps]
-        self.beta_cumprod = self.beta_cumprod[timesteps]
-        self.alphas_cumprod_prev = self.alphas_cumprod_prev[timesteps]
-        self.sample_coeff = self.sample_coeff[timesteps]
-        self.model_output_denom_coeff = self.model_output_denom_coeff[timesteps]
-
-    def step(self, output, sample, idx, timestep):
-        # step_plms: propagate the sample with the linear multi-step method. This has one forward pass with multiple
-        # times to approximate the solution.
-
-        # prev_timestep = timestep - self.step_ratio
-
-        if self.counter != 1:
-            self.ets = self.ets[-3:]
-            self.ets.append(output)
-        # else:
-        #     prev_timestep = timestep
-        #     timestep = timestep + self.step_ratio
-
-        if len(self.ets) == 1 and self.counter == 0:
-            output = output
-            self.cur_sample = sample
-        elif len(self.ets) == 1 and self.counter == 1:
-            output = (output + self.ets[-1]) / 2
-            sample = self.cur_sample
-            self.cur_sample = None
-        elif len(self.ets) == 2:
-            output = (3 * self.ets[-1] - self.ets[-2]) / 2
-        elif len(self.ets) == 3:
-            output = (23 * self.ets[-1] - 16 * self.ets[-2] + 5 * self.ets[-3]) / 12
-        else:
-            output = (1 / 24) * (55 * self.ets[-1] - 59 * self.ets[-2] + 37 * self.ets[-3] - 9 * self.ets[-4])
-
-        if self.prediction_type == "v_prediction":
-            output = (self.alphas_cumprod[idx]**0.5) * output + (self.beta_cumprod[idx]**0.5) * sample
-        elif self.prediction_type != "epsilon":
-            raise ValueError(
-                f"prediction_type given as {self.prediction_type} must be one of `epsilon` or `v_prediction`"
-            )
-
-        prev_sample = (
-            self.sample_coeff[idx] * sample - (self.alphas_cumprod_prev[idx] - self.alphas_cumprod[idx]) * output / self.model_output_denom_coeff[idx]
-        )
-        self.counter += 1
-
-        return prev_sample
-
-    def add_noise(self, init_latents, noise, idx, latent_timestep):
-        sqrt_alpha_prod = self.alphas_cumprod[idx] ** 0.5
-        sqrt_one_minus_alpha_prod = (1 - self.alphas_cumprod[idx]) ** 0.5
-        noisy_latents = sqrt_alpha_prod * init_latents + sqrt_one_minus_alpha_prod * noise
-
-        return noisy_latents
-
 def save_image(images, image_path_dir, image_name_prefix):
     """
     Save the generated images to png files.
@@ -1178,42 +372,206 @@ def download_image(url):
     response = requests.get(url)
     return Image.open(BytesIO(response.content)).convert("RGB")
 
+def get_refit_weights(state_dict, onnx_opt_path, weight_name_mapping, weight_shape_mapping):
+    onnx_opt_dir = os.path.dirname(onnx_opt_path)
+    onnx_opt_model = onnx.load(onnx_opt_path)
+    # Create initializer data hashes
+    initializer_hash_mapping = {}
+    for initializer in onnx_opt_model.graph.initializer:
+        initializer_data = numpy_helper.to_array(initializer, base_dir=onnx_opt_dir).astype(np.float16)
+        initializer_hash = hash(initializer_data.data.tobytes())
+        initializer_hash_mapping[initializer.name] = initializer_hash
+
+    refit_weights = OrderedDict()
+    for wt_name, wt in state_dict.items():
+        # query initializer to compare
+        initializer_name = weight_name_mapping[wt_name]
+        initializer_hash = initializer_hash_mapping[initializer_name]
+
+        # get shape transform info
+        initializer_shape, is_transpose = weight_shape_mapping[wt_name]
+        if is_transpose:
+            wt = torch.transpose(wt, 0, 1)
+        else:
+            wt = torch.reshape(wt, initializer_shape)
+
+        # include weight if hashes differ
+        wt_hash = hash(wt.cpu().detach().numpy().astype(np.float16).data.tobytes())
+        if initializer_hash != wt_hash:
+            refit_weights[initializer_name] = wt.contiguous()
+    return refit_weights
+
+def load_calib_prompts(batch_size, calib_data_path):
+    with open(calib_data_path, "r") as file:
+        lst = [line.rstrip("\n") for line in file]
+    return [lst[i : i + batch_size] for i in range(0, len(lst), batch_size)]
+
+def filter_func(name):
+    pattern = re.compile(
+        r".*(time_emb_proj|time_embedding|conv_in|conv_out|conv_shortcut|add_embedding).*"
+    )
+    return pattern.match(name) is not None
+
+def quantize_lvl(unet, quant_level=2.5):
+    """
+    We should disable the unwanted quantizer when exporting the onnx
+    Because in the current ammo setting, it will load the quantizer amax for all the layers even
+    if we didn't add that unwanted layer into the config during the calibration
+    """
+    for name, module in unet.named_modules():
+        if isinstance(module, torch.nn.Conv2d):
+            module.input_quantizer.enable()
+            module.weight_quantizer.enable()
+        elif isinstance(module, torch.nn.Linear):
+            if (
+                (quant_level >= 2 and "ff.net" in name)
+                or (quant_level >= 2.5 and ("to_q" in name or "to_k" in name or "to_v" in name))
+                or quant_level == 3
+            ):
+                module.input_quantizer.enable()
+                module.weight_quantizer.enable()
+            else:
+                module.input_quantizer.disable()
+                module.weight_quantizer.disable()
+
+def get_smoothquant_config(model, quant_level=3):
+    quant_config = {
+        "quant_cfg": {},
+        "algorithm": "smoothquant",
+    }
+    for name, module in model.named_modules():
+        w_name = f"{name}*weight_quantizer"
+        i_name = f"{name}*input_quantizer"
+
+        if (
+            w_name in quant_config["quant_cfg"].keys()  # type: ignore
+            or i_name in quant_config["quant_cfg"].keys()  # type: ignore
+        ):
+            continue
+        if filter_func(name):
+            continue
+        if isinstance(module, torch.nn.Linear):
+            if (
+                (quant_level >= 2 and "ff.net" in name)
+                or (quant_level >= 2.5 and ("to_q" in name or "to_k" in name or "to_v" in name))
+                or quant_level == 3
+            ):
+                quant_config["quant_cfg"][w_name] = {"num_bits": 8, "axis": 0}  # type: ignore
+                quant_config["quant_cfg"][i_name] = {"num_bits": 8, "axis": -1}  # type: ignore
+        elif isinstance(module, torch.nn.Conv2d):
+            quant_config["quant_cfg"][w_name] = {"num_bits": 8, "axis": 0}  # type: ignore
+            quant_config["quant_cfg"][i_name] = {"num_bits": 8, "axis": None}  # type: ignore
+    return quant_config
+
+class PercentileAmaxes:
+    def __init__(self, total_step, percentile) -> None:
+        self.data = {}
+        self.total_step = total_step
+        self.percentile = percentile
+        self.i = 0
+
+    def append(self, item):
+        _cur_step = self.i % self.total_step
+        if _cur_step not in self.data.keys():
+            self.data[_cur_step] = item
+        else:
+            self.data[_cur_step] = np.maximum(self.data[_cur_step], item)
+        self.i += 1
+
 def add_arguments(parser):
     # Stable Diffusion configuration
-    parser.add_argument('--version', type=str, default="2.1", choices=["1.4", "1.5", "2.0", "2.0-base", "2.1", "2.1-base"], help="Version of Stable Diffusion")
+    parser.add_argument('--version', type=str, default="1.5", choices=["1.4", "1.5", "dreamshaper-7", "2.0-base", "2.0", "2.1-base", "2.1", "xl-1.0", "xl-turbo"], help="Version of Stable Diffusion")
     parser.add_argument('prompt', nargs = '*', help="Text prompt(s) to guide image generation")
     parser.add_argument('--negative-prompt', nargs = '*', default=[''], help="The negative prompt(s) to guide the image generation.")
-    parser.add_argument('--repeat-prompt', type=int, default=1, choices=[1, 2, 4, 8, 16], help="Number of times to repeat the prompt (batch size multiplier)")
+    parser.add_argument('--batch-size', type=int, default=1, choices=[1, 2, 4], help="Batch size (repeat prompt)")
+    parser.add_argument('--batch-count', type=int, default=1, help="Number of images to generate in sequence, one at a time.")
     parser.add_argument('--height', type=int, default=512, help="Height of image to generate (must be multiple of 8)")
     parser.add_argument('--width', type=int, default=512, help="Height of image to generate (must be multiple of 8)")
-    parser.add_argument('--denoising-steps', type=int, default=50, help="Number of denoising steps")
+    parser.add_argument('--denoising-steps', type=int, default=30, help="Number of denoising steps")
+    parser.add_argument('--scheduler', type=str, default=None, choices=["DDIM", "DDPM", "EulerA", "Euler", "LCM", "LMSD", "PNDM", "UniPC"], help="Scheduler for diffusion process")
+    parser.add_argument('--guidance-scale', type=float, default=7.5, help="Value of classifier-free guidance scale (must be greater than 1)")
+    parser.add_argument('--lora-scale', type=float, nargs='+', default=None, help="Scale of LoRA weights, default 1 (must between 0 and 1)")
+    parser.add_argument('--lora-path', type=str, nargs='+', default=None, help="Path to LoRA adaptor. Ex: 'latent-consistency/lcm-lora-sdv1-5'")
 
     # ONNX export
-    parser.add_argument('--onnx-opset', type=int, default=17, choices=range(7,18), help="Select ONNX opset version to target for exported models")
+    parser.add_argument('--onnx-opset', type=int, default=19, choices=range(7,20), help="Select ONNX opset version to target for exported models")
     parser.add_argument('--onnx-dir', default='onnx', help="Output directory for ONNX export")
-    parser.add_argument('--onnx-refit-dir', help="ONNX models to load the weights from")
-    parser.add_argument('--force-onnx-export', action='store_true', help="Force ONNX export of CLIP, UNET, and VAE models")
-    parser.add_argument('--force-onnx-optimize', action='store_true', help="Force ONNX optimizations for CLIP, UNET, and VAE models")
+
+    # Framework model ckpt
+    parser.add_argument('--framework-model-dir', default='pytorch_model', help="Directory for HF saved models")
 
     # TensorRT engine build
     parser.add_argument('--engine-dir', default='engine', help="Output directory for TensorRT engines")
-    parser.add_argument('--force-engine-build', action='store_true', help="Force rebuilding the TensorRT engine")
+    parser.add_argument('--int8', action='store_true', help="Apply int8 quantization.")
+    parser.add_argument('--quantization-level', type=float, default=3.0, choices=range(1,4), help="int8/fp8 quantization level, 1: CNN, 2: CNN+FFN, 2.5: CNN+FFN+QKV, 3: CNN+FC")
     parser.add_argument('--build-static-batch', action='store_true', help="Build TensorRT engines with fixed batch size.")
     parser.add_argument('--build-dynamic-shape', action='store_true', help="Build TensorRT engines with dynamic image shapes.")
     parser.add_argument('--build-enable-refit', action='store_true', help="Enable Refit option in TensorRT engines during build.")
-    parser.add_argument('--build-preview-features', action='store_true', help="Build TensorRT engines with preview features.")
     parser.add_argument('--build-all-tactics', action='store_true', help="Build TensorRT engines using all tactic sources.")
     parser.add_argument('--timing-cache', default=None, type=str, help="Path to the precached timing measurements to accelerate build.")
 
     # TensorRT inference
     parser.add_argument('--num-warmup-runs', type=int, default=5, help="Number of warmup runs before benchmarking performance")
-    parser.add_argument('--nvtx-profile', action='store_true', help="Enable NVTX markers for performance profiling")
-    parser.add_argument('--seed', type=int, default=None, help="Seed for random generator to get consistent results")
     parser.add_argument('--use-cuda-graph', action='store_true', help="Enable cuda graph")
+    parser.add_argument('--nvtx-profile', action='store_true', help="Enable NVTX markers for performance profiling")
+    parser.add_argument('--torch-inference', default='', help="Run inference with PyTorch (using specified compilation mode) instead of TensorRT.")
 
+    parser.add_argument('--seed', type=int, default=None, help="Seed for random generator to get consistent results")
     parser.add_argument('--output-dir', default='output', help="Output directory for logs and image artifacts")
     parser.add_argument('--hf-token', type=str, help="HuggingFace API access token for downloading model checkpoints")
     parser.add_argument('-v', '--verbose', action='store_true', help="Show verbose output")
     return parser
 
-
+def process_pipeline_args(args):
+    if args.height % 8 != 0 or args.width % 8 != 0:
+        raise ValueError(f"Image height and width have to be divisible by 8 but specified as: {args.image_height} and {args.width}.")
+
+    max_batch_size = 4
+    if args.batch_size > max_batch_size:
+        raise ValueError(f"Batch size {args.batch_size} is larger than allowed {max_batch_size}.")
+
+    if args.use_cuda_graph and (not args.build_static_batch or args.build_dynamic_shape):
+        raise ValueError(f"Using CUDA graph requires static dimensions. Enable `--build-static-batch` and do not specify `--build-dynamic-shape`")
+
+    if args.int8 and not args.version.startswith('xl'):
+        raise ValueError(f"int8 quantization only supported for SDXL pipeline.")
+
+    if args.lora_scale:
+        for lora_scale in (lora_scale for lora_scale in args.lora_scale if not 0 <= lora_scale <= 1):
+            raise ValueError(f"Scale of LoRA weights must be between 0 and 1, provided {lora_scale}")
+
+    kwargs_init_pipeline = {
+        'version': args.version,
+        'max_batch_size': max_batch_size,
+        'denoising_steps': args.denoising_steps,
+        'scheduler': args.scheduler,
+        'guidance_scale': args.guidance_scale,
+        'output_dir': args.output_dir,
+        'hf_token': args.hf_token,
+        'verbose': args.verbose,
+        'nvtx_profile': args.nvtx_profile,
+        'use_cuda_graph': args.use_cuda_graph,
+        'lora_scale': args.lora_scale,
+        'lora_path': args.lora_path,
+        'framework_model_dir': args.framework_model_dir,
+        'torch_inference': args.torch_inference,
+    }
+
+    kwargs_load_engine = {
+        'onnx_opset': args.onnx_opset,
+        'opt_batch_size': args.batch_size,
+        'opt_image_height': args.height,
+        'opt_image_width': args.width,
+        'static_batch': args.build_static_batch,
+        'static_shape': not args.build_dynamic_shape,
+        'enable_all_tactics': args.build_all_tactics,
+        'enable_refit': args.build_enable_refit,
+        'timing_cache': args.timing_cache,
+        'int8': args.int8,
+        'quantization_level': args.quantization_level,
+        'denoising_steps': args.denoising_steps,
+    }
+
+    args_run_demo = (args.prompt, args.negative_prompt, args.height, args.width, args.batch_size, args.batch_count, args.num_warmup_runs, args.use_cuda_graph)
+
+    return kwargs_init_pipeline, kwargs_load_engine, args_run_demo
diff --git a/demo/HuggingFace/.gitignore b/demo/HuggingFace/.gitignore
deleted file mode 100644
index 18a62d0f..00000000
--- a/demo/HuggingFace/.gitignore
+++ /dev/null
@@ -1,3 +0,0 @@
-*.pyc
-__pycache__/
-**/temp/
\ No newline at end of file
diff --git a/demo/HuggingFace/BART/BARTModelConfig.py b/demo/HuggingFace/BART/BARTModelConfig.py
deleted file mode 100755
index f8ea3bd7..00000000
--- a/demo/HuggingFace/BART/BARTModelConfig.py
+++ /dev/null
@@ -1,306 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import argparse
-
-from collections import namedtuple, OrderedDict
-from itertools import product
-from typing import Dict
-
-# TRT-HuggingFace
-from NNDF.networks import Precision, NetworkMetadata, NNConfig, Dims
-from NNDF.interface import MetadataArgparseInteropMixin
-
-# Limitation of namedtuples. You must declare namedtuples in module scope and not in classes.
-# Otherwise pickle doesn't work.
-# See: https://stackoverflow.com/questions/4677012/python-cant-pickle-type-x-attribute-lookup-failed
-_BARTMetadata = namedtuple("BARTMetadata", ["kv_cache"])
-
-
-class BARTMetadata(_BARTMetadata, MetadataArgparseInteropMixin):
-    @staticmethod
-    def add_args(parser: argparse.ArgumentParser) -> None:
-        """Add commandline interface parser."""
-        network_group = parser.add_argument_group("BART network")
-        network_group.add_argument(
-            "--variant",
-            help="BART variant to generate",
-            choices=BARTModelTRTConfig.TARGET_MODELS,
-            required=True,
-        )
-        network_group.add_argument(
-            "--enable-kv-cache",
-            help="BART enable KV cache",
-            action="store_true",
-            default=False,
-        )
-        network_group.add_argument(
-            "--num-beams", type=int, default=1, help="Enables beam search during decoding."
-        )
-
-    @staticmethod
-    def from_args(args: argparse.Namespace):
-        return NetworkMetadata(
-            variant=args.variant,
-            precision=Precision(fp16=False),
-            other=BARTMetadata(kv_cache=args.enable_kv_cache),
-        )
-
-    @staticmethod
-    def add_inference_args(parser: argparse.ArgumentParser) -> None:
-        BARTMetadata.add_args(parser)
-        inference_group = parser.add_argument_group("inference group")
-        inference_group.add_argument(
-            "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
-        )
-
-    @staticmethod
-    def from_inference_args(args: argparse.Namespace):
-        base_metadata = BARTMetadata.from_args(args)
-        return base_metadata._replace(precision=Precision(fp16=args.fp16))
-
-
-    @staticmethod
-    def add_benchmarking_args(parser: argparse.ArgumentParser) -> None:
-        benchmarking_group = parser.add_argument_group("benchmarking group")
-        benchmarking_group.add_argument(
-            "--input-seq-len",
-            type=int,
-            help="Specify fixed input sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
-        )
-        benchmarking_group.add_argument(
-            "--output-seq-len",
-            type=int,
-            help="Specify fixed output sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
-        )
-
-BARTBenchmarkingArgs = namedtuple("BARTBenchmarkingArgs", ["input_seq_len", "output_seq_len"])
-
-# trt has more benchmarking arguments
-BARTTRTBenchmarkingArgs = namedtuple("BARTTRTBenchmarkingArgs", ["input_seq_len", "output_seq_len", "input_profile_max_len", "output_profile_max_len"])
-
-class BARTModelTRTConfig(NNConfig):
-
-    TARGET_MODELS = ["facebook/bart-base", "facebook/bart-large", "facebook/bart-large-cnn", "facebook/mbart-large-50"]
-
-    MAX_DECODER_WORKSPACE_MB = {
-        TARGET_MODELS[0]: 3072, 
-        TARGET_MODELS[1]: 3072, 
-        TARGET_MODELS[2]: 3072,
-        TARGET_MODELS[3]: 3072,
-    }
-    
-    # bart-base: 12-layer, 768-hidden, 139M parameters
-    # bart-large: 24-layer, 1024-hidden, 406M parameters
-    # in all bart variants, # of encoder layers and # of decoder layers are the same
-    NUMBER_OF_LAYERS = {
-        TARGET_MODELS[0]: 12,
-        TARGET_MODELS[1]: 24,
-        TARGET_MODELS[2]: 24,
-        TARGET_MODELS[3]: 24,
-    }
-
-    NUMBER_OF_DECODER_LAYERS = {
-        TARGET_MODELS[0]: 6,
-        TARGET_MODELS[1]: 12,
-        TARGET_MODELS[2]: 12,
-        TARGET_MODELS[3]: 12,
-    }
-
-    # in all bart variants, # of heads in encoder and decoder are the same
-    NUMBER_OF_HEADS = {
-        TARGET_MODELS[0]: 12,
-        TARGET_MODELS[1]: 16,
-        TARGET_MODELS[2]: 16,
-        TARGET_MODELS[3]: 16,
-    }
-
-    MAX_SEQUENCE_LENGTH = {
-        TARGET_MODELS[0]: 768,
-        TARGET_MODELS[1]: 1024,
-        TARGET_MODELS[2]: 1024,
-        TARGET_MODELS[3]: 1024,
-    }
-
-    # encoder hidden size is not necessarily same as max sequence length. Separate for clarification
-    ENCODER_HIDDEN_SIZE = {
-        TARGET_MODELS[0]: 768,
-        TARGET_MODELS[1]: 1024,
-        TARGET_MODELS[2]: 1024,
-        TARGET_MODELS[3]: 1024,
-    }
-
-    # To achieve identical results with original HuggingFace implementation, the min_length in model config should be consistent with each model variant
-    # see task-specific params in config.json of each variant model
-    MIN_OUTPUT_LENGTH = {
-        TARGET_MODELS[0]: 0,
-        TARGET_MODELS[1]: 0,
-        TARGET_MODELS[2]: 56,
-        TARGET_MODELS[3]: 0,
-    }
-
-    #TODO: this might better be an inference time input like the `max_length` arg in generate() and greedy_search(). The change needed is in NNDF/interface.py:__call__ so it's a fundamental change affecting GPT2 and T5 code. Here I just put this option in BART model config for now. But it's also reasonable to treat this as a model config, because the TRT engine building may need this to have fixed dimension (e.g., to enable KV-cache)
-    # see task-specific params in config.json of each variant model
-    MAX_OUTPUT_LENGTH = {
-        TARGET_MODELS[0]: 768,
-        TARGET_MODELS[1]: 1024,
-        TARGET_MODELS[2]: 142,
-        TARGET_MODELS[3]: 200,
-    }
-
-    # BART specific configs: https://huggingface.co/facebook/bart-base/blob/main/config.json
-    NO_REPEAT_NGRAM_SIZE = 3
-    BOS_TOKEN_ID = 0
-    EOS_TOKEN_ID = 2
-
-    VOCAB_SIZE = {
-        TARGET_MODELS[0]: 50265,
-        TARGET_MODELS[1]: 50265,
-        TARGET_MODELS[2]: 50264, # for bart-large-cnn config it's 50264 somehow. If not change here, results are incorrect since the trt results dimension reshape depends on this
-        TARGET_MODELS[3]: 250054 # for mbart multilingual models, vocab size is much larger
-    }
-
-    NETWORK_FULL_NAME = "full"
-    NETWORK_DECODER_SEGMENT_NAME = "decoder"
-    NETWORK_ENCODER_SEGMENT_NAME = "encoder"
-    NETWORK_SEGMENTS = [NETWORK_DECODER_SEGMENT_NAME, NETWORK_ENCODER_SEGMENT_NAME]
-
-    def __init__(self):
-        precision_fp16 = [False, True]
-        kv_caches = [False, True]
-
-        variants = []
-        for variant, fp16, kv_cache in product(
-            BARTModelTRTConfig.TARGET_MODELS, precision_fp16, kv_caches
-        ):
-            variants.append(
-                NetworkMetadata(
-                    variant=variant,
-                    precision=Precision(fp16=fp16),
-                    other=BARTMetadata(kv_cache=kv_cache),
-                )
-            )
-
-        super().__init__("BART", variants=variants)
-
-    def get_python_requirements(self):
-        base_requirements = super().get_python_requirements()
-        base_requirements.append("transformers==4.8.0")
-        return base_requirements
-
-    def get_network_segments(self):
-        """
-        Returns exportable segments for the given network.
-        Used in the case where a single network needs to
-        be exported into multiple parts.
-        """
-        return BARTModelTRTConfig.NETWORK_SEGMENTS
-
-    def get_metadata_string(self, metadata: NetworkMetadata) -> str:
-        # Remove redundant bart name prefix
-        if "mbart" in metadata.variant:
-            metadata = metadata._replace(variant=metadata.variant.replace("facebook/mbart-","mbart-"))
-        else:
-            metadata = metadata._replace(variant=metadata.variant.replace("facebook/bart-",""))
-        return super().get_metadata_string(metadata)
-
-    @staticmethod
-    def get_input_dims(metadata) -> Dict:
-        """
-        Returns dictionary encoding of input dimensions.
-        Keys will be equal to get_model_segments()
-
-        Returns:
-            (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
-        """
-        decoder_inputs_dict = OrderedDict(
-            {
-                "input_ids": (Dims.BATCH, Dims.SEQUENCE),
-                "encoder_hidden_states": (
-                    Dims.BATCH,
-                    Dims.create_new_sequence_dim("encoder_hidden_length"),
-                    BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[metadata.variant], # dim not containing string 'Dims.BATCH' or 'Dims.SEQUENCE' will be non-dynamic axis
-                ),
-            }
-        )
-        if metadata.other.kv_cache:
-            # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
-            for i in range(BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[metadata.variant]):
-                # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
-                self_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
-                decoder_inputs_dict[f"past_key_values.{i}.decoder.key"] = self_attention_past_kv_dims
-                decoder_inputs_dict[f"past_key_values.{i}.decoder.value"] = self_attention_past_kv_dims
-
-                # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
-                cross_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head")
-                decoder_inputs_dict[f"past_key_values.{i}.encoder.key"] = cross_attention_past_kv_dims
-                decoder_inputs_dict[f"past_key_values.{i}.encoder.value"] = cross_attention_past_kv_dims
-
-        decoder_inputs = Dims(decoder_inputs_dict)
-
-        encoder_inputs = Dims(OrderedDict({"input_ids": (Dims.BATCH, Dims.SEQUENCE)}))
-
-        return {
-            BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_inputs,
-            BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_inputs,
-        }
-
-    @staticmethod
-    def get_output_dims(metadata) -> Dict:
-        """
-        Returns dictionary encoding of output dimensions.
-        Keys will be equal to get_model_segments()
-
-        Returns:
-            (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
-        """
-        decoder_outputs_dict = OrderedDict(
-            {"hidden_states": (Dims.BATCH, Dims.SEQUENCE)})
-
-        if metadata.other.kv_cache:
-            # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
-
-            # for all BART variants, # encoder layers = # decoder layers, so just divide total # layers by 2
-            for i in range(BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[metadata.variant]):
-                # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
-                self_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("decoder_length"), "embedding_size_per_head")
-                decoder_outputs_dict[f"present_key_values.{i}.decoder.key"] = self_attention_present_kv_dims
-                decoder_outputs_dict[f"present_key_values.{i}.decoder.value"] = self_attention_present_kv_dims
-
-                # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
-                cross_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head")
-                decoder_outputs_dict[f"present_key_values.{i}.encoder.key"] = cross_attention_present_kv_dims
-                decoder_outputs_dict[f"present_key_values.{i}.encoder.value"] = cross_attention_present_kv_dims
-
-        decoder_outputs = Dims(decoder_outputs_dict)
-
-        encoder_outputs = Dims(
-            OrderedDict(
-                {
-                    "hidden_states": (
-                        Dims.BATCH,
-                        Dims.SEQUENCE,
-                        BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[metadata.variant],
-                    )
-                }
-            )
-        )
-
-        return {
-            BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_outputs,
-            BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_outputs,
-        }
diff --git a/demo/HuggingFace/BART/checkpoint.toml b/demo/HuggingFace/BART/checkpoint.toml
deleted file mode 100755
index 52add215..00000000
--- a/demo/HuggingFace/BART/checkpoint.toml
+++ /dev/null
@@ -1,26 +0,0 @@
-# Default requirements
-[BART.all.default.all.summarization]
-
-input = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorRT, built on the NVIDIA CUDA parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/bart-base".all.summarization]
-
-label = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorR, built on the NVIDIA CUDA parallel programming model, enables developers to accelerate inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, Tensor RT also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/bart-large".all.summarization]
-
-label = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. Tensor RT is the first GPU-based inference platform to use NVIDIA's CUDA-X architecture. TenseRT, built on the NVIDIA CUDA parallel programming model, enables developers to analyze neural network data and perform inference by leveraging libraries, development tools, and technologies in CUDA, including CUDA for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRex also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/bart-large-cnn".all.summarization]
-
-label = "TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference. TensorRT is built on the NVIDIA CUDA parallel programming model. With new NVIDIA Ampere Architecture GPUs, Tensor RT also uses sparse tensor cores for an additional performance boost."
-
-[BART.all."facebook/mbart-large-50".all.summarization]
-
-label = "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorTM, built on the NVIDIA CUDA parallel programming model, enables developers of applications to optimise inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, Tensor RT also uses sparse tensor cores for an additional performance boost."
-
-# There is a weird bug in Frameworks where the output is incorrect
-# when compared to OnnxRT. Frameworks only the first two sentence is generated.
-[BART.native."facebook/bart-large-cnn".summarization]
-
-label = "TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference. TensorRT is built on the NVIDIA CUDA parallel programming model."
diff --git a/demo/HuggingFace/BART/export.py b/demo/HuggingFace/BART/export.py
deleted file mode 100755
index f3730178..00000000
--- a/demo/HuggingFace/BART/export.py
+++ /dev/null
@@ -1,419 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Contains logic that captures BART HuggingFace models into ONNX models.
-"""
-
-from itertools import islice
-from json import encoder
-import os
-from collections import OrderedDict
-
-# tensorrt
-import tensorrt as trt
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-from torch.nn import Module
-
-# huggingface
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers import BartForConditionalGeneration
-
-# TRT-HuggingFace
-from BART.BARTModelConfig import BARTModelTRTConfig
-from NNDF.tensorrt_utils import OnnxProcessOperation, process_onnx
-from NNDF.networks import NetworkMetadata, Precision, Dims
-from NNDF.logger import G_LOGGER
-from NNDF.models import (
-    TRTEngineFile,
-    TorchModelFile,
-    ONNXModelFile,
-    ModelFileConverter,
-)
-
-def add_extra_fp32(network_definition):
-    """
-    Force operations involved in layer norm to run in FP32 precision.
-    """
-    pow_ops = {}
-    for layer_index, layer in enumerate(network_definition[1]):
-        if layer.type == trt.LayerType.IDENTITY:
-            all_fp32 = all([layer.output_type_is_set(o) and layer.get_output_type(o) == trt.float32 for o in range(layer.num_outputs)])
-            if all_fp32:
-                if layer.get_input(0).dtype == trt.float32:
-                    layer.precision = trt.float32
-
-        if layer.type == trt.LayerType.ELEMENTWISE:
-            layer.__class__ = getattr(trt, "IElementWiseLayer")
-            if layer.op == trt.ElementWiseOperation.POW:
-                pow_ops[layer] = layer_index
-                layer.precision = trt.float32
-                layer.set_output_type(0, trt.float32)
-
-    for _, index in pow_ops.items():
-        # Iterate from few layers before pow to include residual add and cast op.
-        # Iterate till 10 layers after pow op to include all operations included in layer norm.
-        START_OFFSET = 4
-        END_OFFSET = 12
-        for i in range(index-START_OFFSET, index+END_OFFSET):
-            l = network_definition[1].get_layer(i)
-            if l.type == trt.LayerType.REDUCE:
-                l.precision = trt.float32
-                l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.ELEMENTWISE:
-                l.__class__ = getattr(trt, "IElementWiseLayer")
-                if l.op == trt.ElementWiseOperation.SUM:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.UNARY:
-                l.__class__ = getattr(trt, "IUnaryLayer")
-                if l.op == trt.UnaryOperation.SQRT:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.ELEMENTWISE:
-                l.__class__ = getattr(trt, "IElementWiseLayer")
-                if l.op == trt.ElementWiseOperation.DIV:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.ELEMENTWISE:
-                l.__class__ = getattr(trt, "IElementWiseLayer")
-                if l.op == trt.ElementWiseOperation.PROD:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-    return network_definition
-
-# Torch File Encoding #
-class BARTDecoderTorchFile(TorchModelFile):
-    class TorchModule(Module, GenerationMixin):
-        """
-        A simplied definition of BART Decoder without support for loss.
-        Decoder with lm-head attached.
-        """
-
-        def __init__(self, decoder, lm_head, final_logits_bias, config):
-            super().__init__()
-            self.decoder = decoder
-            self.lm_head = lm_head
-            self.bias = final_logits_bias
-            self.config = config
-
-        @staticmethod
-        def _reorder_cache(past, beam_idx):
-            return BartForConditionalGeneration._reorder_cache(past, beam_idx)
-
-        def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
-            # cut decoder_input_ids if past is used
-            if past is not None:
-                input_ids = input_ids[:, -1:]
-
-            ret = {
-                "input_ids": input_ids,
-                "encoder_hidden_states": kwargs["encoder_hidden_states"],
-            }
-
-            # To really enable KV cache in HuggingFace, these args must be passed. Just specifying use_cache = True in BartConfig is not enough. Also see the additional "past_key_values" fields in the forward() return below.
-            if self.config.use_cache:
-                ret["use_cache"] = use_cache
-                ret["past_key_values"] = past
-
-            return ret
-
-        def forward(self, input_ids, encoder_hidden_states, **kwargs):
-            decoder_outputs = self.decoder(
-                input_ids=input_ids,
-                encoder_hidden_states=encoder_hidden_states,
-                **kwargs
-            )
-
-            sequence_output = decoder_outputs[0]
-            self.bias = self.bias.to(sequence_output.device)
-            logits = self.lm_head(sequence_output) + self.bias
-
-            # temporary solution: force connection between encoder_hidden_states and outputs in KV cache mode, otherwise onnx.export elimiates it and cause inconsistency between non-KV cache & KV cache and also T5 & BART
-            if self.config.use_cache:
-                logits = logits.view(encoder_hidden_states.size(0),logits.size(1), logits.size(2)) # (batch_size, seq_len, vocab_size)
-
-            if not kwargs.get("return_dict", False):
-                return (logits,) + decoder_outputs[1:]
-
-            return Seq2SeqLMOutput(logits=logits, past_key_values=decoder_outputs.past_key_values if self.config.use_cache else None,)
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, BARTDecoderConverter, network_metadata)
-
-
-class BARTEncoderTorchFile(TorchModelFile):
-    """Creation of a class to output only the last hidden state from the encoder."""
-
-    class TorchModule(Module, GenerationMixin):
-        def __init__(self, encoder):
-            super().__init__()
-            self.encoder = encoder
-
-        def forward(self, *input, **kwargs):
-            return self.encoder(*input, **kwargs)[0]
-
-        def __call__(self, *args, **kwargs):
-            return self.forward(*args, **kwargs)
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, BARTEncoderConverter, network_metadata)
-
-
-# ONNX File Encoding #
-class BARTEncoderONNXFile(ONNXModelFile):
-    def __init__(self, model, network_metadata):
-        super().__init__(model, BARTEncoderConverter, network_metadata)
-
-
-class BARTDecoderONNXFile(ONNXModelFile):
-    def __init__(self, model, network_metadata):
-        super().__init__(model, BARTDecoderConverter, network_metadata)
-
-
-# TRT Engine File Encoding #
-class BARTDecoderTRTEngine(TRTEngineFile):
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, BARTDecoderConverter, network_metadata)
-        self.max_trt_workspace = BARTModelTRTConfig.MAX_DECODER_WORKSPACE_MB[network_metadata.variant]
-
-    def get_network_definition(self, network_definition):
-        return add_extra_fp32(network_definition)
-
-    def use_obey_precision_constraints(self):
-        return self.network_metadata.precision.fp16
-
-
-class BARTEncoderTRTEngine(TRTEngineFile):
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, BARTEncoderConverter, network_metadata)
-        self.max_trt_workspace = 2048
-
-    def get_network_definition(self, network_definition):
-        return add_extra_fp32(network_definition)
-
-    def use_obey_precision_constraints(self):
-        return self.network_metadata.precision.fp16
-
-# Converters #
-class BARTDecoderConverter(ModelFileConverter):
-    def __init__(self):
-        super().__init__(BARTDecoderTorchFile, BARTDecoderONNXFile, BARTDecoderTRTEngine)
-
-    def torch_to_onnx(
-        self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
-    ):
-        """
-        Exports a given huggingface BART to decoder architecture only.
-
-        Args:
-            output_prefix (str): Path to the onnx file
-            model (torch.Model): Model loaded torch class
-
-        Returns:
-            BARTDecoderONNXFile: ONNX decoder object.
-        """
-
-        input_ids = torch.tensor([[42] * 10])
-        # Exporting the decoder requires a basic instance of the encoder
-        # Create one temporarily
-        simplified_encoder = BARTEncoderTorchFile.TorchModule(model.get_encoder())
-        # Exports to ONNX
-        decoder_with_lm_head_and_bias = BARTDecoderTorchFile.TorchModule(
-            model.get_decoder(), model.lm_head, model.final_logits_bias, model.config
-        )
-
-        inputs = BARTModelTRTConfig.get_input_dims(network_metadata)["decoder"]
-        outputs = BARTModelTRTConfig.get_output_dims(network_metadata)["decoder"]
-
-        # Exports to ONNX
-        opt_args={}
-
-        version_major = int((torch.__version__).split('.')[0])
-        version_minor = int((torch.__version__).split('.')[1])
-        if version_major < 1 or (version_major == 1 and version_minor < 11):
-            opt_args['use_external_data_format'] = True
-
-        if not network_metadata.other.kv_cache:
-            # This code allows for huggingface compatible torch class to use onnx exporter
-            old_forward = decoder_with_lm_head_and_bias.forward
-            def _export_forward(*args, **kwargs):
-                result = old_forward(*args, **kwargs)
-                return result[0]
-            decoder_with_lm_head_and_bias.forward = _export_forward
-
-            torch.onnx.export(
-                decoder_with_lm_head_and_bias,
-                (input_ids, simplified_encoder(input_ids)),
-                output_fpath,
-                export_params=True,
-                opset_version=12,
-                input_names=inputs.get_names(),
-                output_names=outputs.get_names(),
-                dynamic_axes={
-                    **inputs.get_torch_dynamic_axis_encoding(),
-                    **outputs.get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-        else:
-            encoder_hidden_states = simplified_encoder(input_ids)
-            decoder_output = decoder_with_lm_head_and_bias(input_ids[:,:-1], encoder_hidden_states) # decoder output at t-1 step (logits, past_key_values from 0 to t-1)
-            past_key_values = decoder_output[1]
-
-            decoder_root, decoder_fullname = os.path.split(output_fpath)
-            # Split kv and non kv onnx into separate folders to avoid weight overlap
-            non_kv_root = os.path.join(decoder_root, "non-kv")
-            kv_root = os.path.join(decoder_root, "kv")
-            decoder_name, decoder_ext = os.path.splitext(decoder_fullname)
-            non_kv_fpath = os.path.join(non_kv_root, decoder_name + "-non-kv" + decoder_ext)
-            kv_fpath = os.path.join(kv_root, decoder_fullname)
-
-            # This code allows for huggingface compatible torch class to use onnx exporter (change just before onnx.export)
-            old_forward = decoder_with_lm_head_and_bias.forward
-            def _export_forward(input_ids, encoder_hidden_states, past_key_values):
-                result = old_forward(input_ids, encoder_hidden_states, past_key_values=past_key_values)
-                return (result[0], result[1])
-            decoder_with_lm_head_and_bias.forward = _export_forward
-
-            torch.onnx.export(
-                decoder_with_lm_head_and_bias,
-                (input_ids[:,-1:], encoder_hidden_states,past_key_values),
-                # (1) input_ids should be the t token (last one) while past_key_values is 0 to t-1 caches
-                # (2) since past_key_values is kwargs, ideally use "(input_ids[:,-1:], encoder_hidden_states, {"past_key_values": past_key_values})",
-                # but onnx.export seems to unable to take kwargs properly (although PyTorch 1.11 claims it supports already).
-                # Therefore, we need to wrap inside _export_forward() and make past_key_values indeed a kwargs
-                kv_fpath,
-                export_params=True,
-                opset_version=12,
-                input_names=inputs.get_names(),
-                output_names=outputs.get_names(),
-                dynamic_axes={
-                    **inputs.get_torch_dynamic_axis_encoding(),
-                    **outputs.get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-
-            # dual-engine approach: also export non-kv onnx model. Note that this is different from the original "non-kv" model. This one traces the `use_cache` path and have present_key_values output
-            def _export_forward(input_ids, encoder_hidden_states, use_cache):
-                result = old_forward(input_ids, encoder_hidden_states, use_cache=use_cache)
-                return (result[0], result[1])
-            decoder_with_lm_head_and_bias.forward = _export_forward
-
-            # inputs are same as non-kv model
-            # outputs are same as kv model
-            dict_inputs = inputs.get_dims()
-            dict_inputs_non_kv = OrderedDict({k: dict_inputs[k] for k in ["input_ids", "encoder_hidden_states"]})
-            inputs_non_kv = Dims(dict_inputs_non_kv)
-
-            torch.onnx.export(
-                decoder_with_lm_head_and_bias,
-                (input_ids[:,-1:], encoder_hidden_states, True),
-                non_kv_fpath,
-                export_params=True,
-                opset_version=12,
-                input_names=inputs_non_kv.get_names(),
-                output_names=outputs.get_names(),
-                dynamic_axes={
-                    **inputs_non_kv.get_torch_dynamic_axis_encoding(),
-                    **outputs.get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-
-        if network_metadata.precision.fp16:
-            G_LOGGER.debug("Clamping FP16 weights for BART")
-            # BART doesn't have T5's Add-Cast-Pow ordering issue
-            if network_metadata.other.kv_cache:
-                # both onnx files need clamp
-                process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], kv_fpath, kv_fpath)
-                process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], non_kv_fpath, non_kv_fpath)
-
-            else:
-                process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
-        return BARTDecoderONNXFile(output_fpath, network_metadata)
-
-
-class BARTEncoderConverter(ModelFileConverter):
-    def __init__(self):
-        super().__init__(BARTEncoderTorchFile, BARTEncoderONNXFile, BARTEncoderTRTEngine)
-
-    def torch_to_onnx(
-        self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
-    ):
-        """
-        Exports a given huggingface BART to encoder architecture only.
-
-        Args:
-            output_prefix (str): Path to the onnx file
-            model (torch.Model): Model loaded torch class
-
-        Returns:
-            Tuple[str]: Names of generated models
-        """
-        input_ids = torch.tensor([[42] * 10])
-        simplified_encoder = BARTEncoderTorchFile.TorchModule(model.get_encoder())
-        inputs = BARTModelTRTConfig.get_input_dims(network_metadata)["encoder"]
-        outputs = BARTModelTRTConfig.get_output_dims(network_metadata)["encoder"]
-
-        # Exports to ONNX
-        opt_args={}
-
-        version_major = int((torch.__version__).split('.')[0])
-        version_minor = int((torch.__version__).split('.')[1])
-        if version_major < 1 or (version_major == 1 and version_minor < 11):
-            opt_args['use_external_data_format'] = True
-        torch.onnx._export(
-            simplified_encoder,
-            input_ids,
-            output_fpath,
-            export_params=True,
-            opset_version=12,
-            input_names=inputs.get_names(),
-            output_names=outputs.get_names(),
-            dynamic_axes={
-                **inputs.get_torch_dynamic_axis_encoding(),
-                **outputs.get_torch_dynamic_axis_encoding(),
-            },
-            training=torch.onnx.TrainingMode.EVAL,
-            **opt_args
-        )
-
-        if network_metadata.precision.fp16:
-            G_LOGGER.debug("Clamping FP16 weights for BART")
-            # BART doesn't have T5's Add-Cast-Pow ordering issue
-            process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
-        return BARTEncoderONNXFile(output_fpath, network_metadata)
diff --git a/demo/HuggingFace/BART/frameworks.py b/demo/HuggingFace/BART/frameworks.py
deleted file mode 100644
index 3df3e908..00000000
--- a/demo/HuggingFace/BART/frameworks.py
+++ /dev/null
@@ -1,373 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-
-from typing import List, Union
-
-# huggingface
-from transformers import (
-    BartForConditionalGeneration,
-    BartTokenizer,
-    BartConfig,
-    MBartForConditionalGeneration,
-    MBart50Tokenizer,
-)
-
-# torch
-import torch
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# TRT-HuggingFace
-from NNDF.interface import FrameworkCommand
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkResult,
-    NetworkMetadata,
-    NetworkRuntime,
-    NetworkModels,
-    NetworkModel,
-    TimingProfile,
-)
-from BART.export import BARTEncoderTorchFile, BARTDecoderTorchFile
-from BART.BARTModelConfig import BARTModelTRTConfig, BARTBenchmarkingArgs
-from BART.measurements import decoder_inference, encoder_inference, full_inference_greedy, full_inference_beam, calculate_perplexity
-from NNDF.general_utils import confirm_folder_delete, NNFolderWorkspace
-
-
-class BARTHuggingFace(FrameworkCommand):
-    def __init__(self):
-        super().__init__(
-            BARTModelTRTConfig, description="Runs framework results for BART model."
-        )
-
-        self.onnx_BART_encoder = None
-        self.onnx_BART_decoder = None
-        self.torch_BART_dir = None
-
-    def generate_and_download_framework(
-        self, metadata: NetworkMetadata, workspace: NNFolderWorkspace
-    ) -> NetworkModels:
-
-        cache_variant = False
-        if metadata.other.kv_cache:
-            cache_variant = True
-
-        trt_BART_config = self.config
-        metadata_serialized = trt_BART_config.get_metadata_string(metadata)
-        workspace_dir, encoder_onnx_root, decoder_onnx_root = workspace.set_model_path(metadata_serialized, is_encoder_decoder = True)
-        pytorch_model_dir = os.path.join(workspace_dir, "pytorch_model")
-
-        # We keep track of the generated torch location for cleanup later
-        self.torch_BART_dir = pytorch_model_dir
-
-        model = None
-        tfm_config = BartConfig(
-            use_cache=cache_variant,
-            num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
-        ) # Note
-        if not os.path.exists(pytorch_model_dir):
-            # mbart variant cannot be recognized by HF yet
-            if "mbart" not in metadata.variant:
-                # Generate the pre-trained weights
-                model = BartForConditionalGeneration(tfm_config).from_pretrained(
-                    metadata.variant
-                )
-            else:
-                model = MBartForConditionalGeneration.from_pretrained(metadata.variant)
-
-            model.config.use_cache = cache_variant # somehow the use_cache config automatically set to True even though specified in tfm_config before. Force change
-            model.save_pretrained(pytorch_model_dir)
-            print("Pytorch Model saved to {}".format(pytorch_model_dir))
-        else:
-            print(
-                "Frameworks file already exists, skipping generation and loading from file instead."
-            )
-            if "mbart" not in metadata.variant:
-                model = BartForConditionalGeneration(tfm_config).from_pretrained(
-                    pytorch_model_dir
-                )
-            else:
-                model = MBartForConditionalGeneration.from_pretrained(pytorch_model_dir)
-
-            model.config.use_cache = cache_variant # somehow the use_cache config automatically set to True even though specified in tfm_config before. Force change
-
-        # These ONNX models can be converted using special encoder and decoder classes.
-        encoder_onnx_model_fpath = os.path.join(encoder_onnx_root, metadata_serialized + "-encoder.onnx")
-        decoder_onnx_model_fpath = os.path.join(decoder_onnx_root, metadata_serialized + "-decoder-with-lm-head.onnx")
-
-        BART_encoder = BARTEncoderTorchFile(model, metadata)
-        BART_decoder = BARTDecoderTorchFile(model, metadata)
-        self.onnx_BART_encoder = BART_encoder.as_onnx_model(
-            encoder_onnx_model_fpath, force_overwrite=False
-        )
-        self.onnx_BART_decoder = BART_decoder.as_onnx_model(
-            decoder_onnx_model_fpath, force_overwrite=False
-        )
-
-        onnx_models = [
-            NetworkModel(
-                name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=self.onnx_BART_decoder.fpath,
-            ),
-            NetworkModel(
-                name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                fpath=self.onnx_BART_encoder.fpath,
-            ),
-        ]
-        torch_models = [
-            NetworkModel(
-                name=BARTModelTRTConfig.NETWORK_FULL_NAME, fpath=pytorch_model_dir
-            )
-        ]
-
-        return NetworkModels(torch=torch_models, onnx=onnx_models, trt=None)
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        keep_onnx_model: bool = True,
-        keep_pytorch_model: bool = True,
-    ) -> None:
-        """
-        Cleans up the working directory and leaves models if available.
-        Should not assume any functions from the framework class has been called.
-        Return:
-            None
-        """
-        # Clean-up generated files
-        if not keep_onnx_model:
-            if self.onnx_BART_decoder is not None:
-                self.onnx_BART_decoder.cleanup()
-            if self.onnx_BART_encoder is not None:
-                self.onnx_BART_encoder.cleanup()
-
-        if not keep_pytorch_model:
-            # Using rmtree can be dangerous, have user confirm before deleting.
-            confirm_folder_delete(
-                self.torch_BART_dir,
-                prompt="Confirm you want to delete downloaded pytorch model folder?",
-            )
-
-        if not keep_pytorch_model and not keep_onnx_model:
-            workspace.cleanup(force_remove=False)
-
-    def setup_tokenizer_and_model(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-    ):
-        tokenizer = BartTokenizer.from_pretrained(metadata.variant)
-
-        # By default, huggingface model structure is one giant file.
-        BART_torch_fpath = network_fpaths.torch[0].fpath
-        config = BartConfig(
-            use_cache=metadata.other.kv_cache,
-            num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
-        )
-        BART_model = BartForConditionalGeneration(config).from_pretrained(BART_torch_fpath)
-        if "mbart" in metadata.variant:
-            BART_model = MBartForConditionalGeneration(config).from_pretrained(BART_torch_fpath)
-            tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
-        BART_torch_encoder = BARTEncoderTorchFile.TorchModule(BART_model.get_encoder())
-        BART_torch_decoder = BARTDecoderTorchFile.TorchModule(
-            BART_model.get_decoder(), BART_model.lm_head, BART_model.final_logits_bias, BART_model.config
-        )
-
-        return tokenizer, BART_torch_encoder, BART_torch_decoder
-
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-        inference_input: str,
-        timing_profile: TimingProfile,
-        use_cpu: bool,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: BARTBenchmarkingArgs = None,
-    ) -> Union[NetworkResult, BenchmarkingResult]:
-
-        tokenizer, BART_torch_encoder, BART_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
-
-        # Prepare the input tokens and find output sequence length.
-        if not benchmarking_mode:
-            output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        else:
-            max_seq_len = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-            input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
-            output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
-            input_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[metadata.variant], (batch_size, input_seq_len))
-
-        encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
-            BART_torch_encoder, input_ids, timing_profile, use_cuda=(not use_cpu)
-        )
-        
-        # Need to feed the decoder a new empty input_ids for text generation.
-        decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
-        decoder_input_ids = torch.full(
-            (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
-        )
-
-        _, decoder_e2e_time = decoder_inference(
-            BART_torch_decoder, decoder_input_ids, encoder_last_hidden_state, timing_profile, use_cuda=(not use_cpu), use_cache=metadata.other.kv_cache
-        )
-
-        if num_beams == 1:
-            decoder_output, full_e2e_runtime = full_inference_greedy(
-                BART_torch_encoder,
-                BART_torch_decoder,
-                input_ids,
-                tokenizer,
-                timing_profile,
-                max_length=output_seq_len,
-                min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-                use_cuda=(not use_cpu),
-                batch_size=batch_size,
-                use_cache=metadata.other.kv_cache,
-            )
-        else:
-            decoder_output, full_e2e_runtime = full_inference_beam(
-                BART_torch_encoder,
-                BART_torch_decoder,
-                input_ids,
-                tokenizer,
-                timing_profile,
-                num_beams=num_beams,
-                max_length=output_seq_len,
-                min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-                batch_size=batch_size,
-                use_cache=metadata.other.kv_cache,
-            )
-
-        # Prepare runtime results.
-        runtime=[
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                runtime=encoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=network_fpaths)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            decoder_output[-1, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=decoder_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=network_fpaths,
-        )
-
-    def execute_calculate_perplexity(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-        encoder_input: str,
-        decoder_input: str,
-    ):
-        tokenizer, BART_torch_encoder, BART_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
-        encoder_input_ids = tokenizer([encoder_input], padding=True, return_tensors="pt").input_ids
-        decoder_input_ids = tokenizer([decoder_input], padding=True, return_tensors="pt").input_ids
-        perplexity = calculate_perplexity(
-            BART_torch_encoder, BART_torch_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
-            BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
-        )
-        return perplexity
-
-    def run_framework(
-        self,
-        metadata: NetworkMetadata,
-        network_input: List[str],
-        working_directory: str,
-        keep_onnx_model: bool,
-        keep_pytorch_model: bool,
-        timing_profile: TimingProfile,
-        use_cpu: bool = False,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult]:
-        """
-        Main entry point of our function which compiles and generates our model data.
-        """
-        inference_results = []
-        ppl_results = []
-        workspace = NNFolderWorkspace(
-            self.config.network_name, metadata, working_directory
-        )
-        try:
-            network_fpaths = self.generate_and_download_framework(metadata, workspace)
-            if not benchmarking_mode:
-                for ninput in network_input:
-                    inference_results.append(
-                        self.execute_inference(
-                            metadata, network_fpaths, ninput, timing_profile, use_cpu, batch_size, args.num_beams
-                        )
-                    )
-                if perplexity_reference is not None:
-                    assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
-                    for ei, di in zip(network_input, perplexity_reference):
-                        ppl_results.append(
-                            self.execute_calculate_perplexity(
-                                metadata, network_fpaths, ei, di
-                            )
-                        )
-            else:
-                benchmarking_args = BARTBenchmarkingArgs(args.input_seq_len, args.output_seq_len)
-                inference_results = self.execute_inference(
-                    metadata, network_fpaths, None, timing_profile, use_cpu, batch_size, args.num_beams, True, benchmarking_args
-                )
-        finally:
-            self.cleanup(workspace, keep_onnx_model, keep_pytorch_model)
-
-        return inference_results, ppl_results
-
-
-# Entry point
-RUN_CMD = BARTHuggingFace()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/BART/hf.py b/demo/HuggingFace/BART/hf.py
deleted file mode 100755
index ae79b64c..00000000
--- a/demo/HuggingFace/BART/hf.py
+++ /dev/null
@@ -1,68 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Obtain the benchmark timing and output from the original HuggingFace BART model.
-
-Usage: python3 hf.py --variant facebook/bart-base [--enable-kv-cache] [--fp16]
-"""
-
-import time
-from transformers import BartTokenizer, BartForConditionalGeneration
-import argparse
-
-parser = argparse.ArgumentParser()
-parser.add_argument("--variant", help="Name of BART variant.")
-parser.add_argument("--enable-kv-cache", help="Bart enable KV cache", action="store_true", default=False)
-parser.add_argument("--fp16", help="Bart FP16", action="store_true", default=False)
-parser.add_argument("--num-beams", type=int, default=1, help="Enables beam search during decoding.")
-
-args = parser.parse_args()
-
-model = BartForConditionalGeneration.from_pretrained(args.variant) # facebook/bart-base, facebook/bart-large, facebook/bart-large-cnn
-tokenizer = BartTokenizer.from_pretrained(args.variant)
-model = model.to('cuda').eval()
-
-if args.fp16:
-    model = model.half()
-
-ARTICLE_TO_SUMMARIZE = (
-    "NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorRT, built on the NVIDIA CUDA parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost."
-)
-
-input_ids = tokenizer([ARTICLE_TO_SUMMARIZE], padding=True, return_tensors="pt").input_ids.to('cuda')
-
-warmup = 10
-for i in range(warmup):
-    summary_ids = model.generate(input_ids, max_length=1024, num_beams=args.num_beams, use_cache=args.enable_kv_cache)
-
-start = time.time()
-trials = 10
-
-input_ids = tokenizer([ARTICLE_TO_SUMMARIZE], padding=True, return_tensors="pt").input_ids.to('cuda')
-
-for i in range(trials):
-    # Generate Summary. Note: generate() method already has torch.no_grad() decorator.
-    summary_ids = model.generate(input_ids, max_length=1024, num_beams=args.num_beams, use_cache=args.enable_kv_cache)
-
-end = time.time()
-
-output = tokenizer.decode(summary_ids[-1,:], skip_special_tokens=True)
-
-print('BART output: ', output)
-print(f"Input sequence length: {input_ids.size(1)}, Output sequence length: {summary_ids[-1,:].size(0)}")
-print("Average run time: {:.2f} ms".format((end - start)/trials*1000))
diff --git a/demo/HuggingFace/BART/measurements.py b/demo/HuggingFace/BART/measurements.py
deleted file mode 100644
index 54f809b0..00000000
--- a/demo/HuggingFace/BART/measurements.py
+++ /dev/null
@@ -1,280 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Utils specific to BART network.
-"""
-
-# torch
-import torch
-
-# from HuggingFace transformers
-from transformers.generation_logits_process import (
-    NoRepeatNGramLogitsProcessor,
-    MinLengthLogitsProcessor,
-    ForcedBOSTokenLogitsProcessor,
-    ForcedEOSTokenLogitsProcessor,
-    LogitsProcessorList,
-)
-from transformers.generation_stopping_criteria import (
-    MaxLengthCriteria,
-    StoppingCriteriaList,
-)
-from transformers.generation_beam_search import (
-    BeamSearchScorer,
-)
-
-from BART.BARTModelConfig import BARTModelTRTConfig
-
-# TRT-HuggingFace
-from NNDF.general_utils import measure_python_inference_code
-from NNDF.torch_utils import use_cuda, expand_inputs_for_beam_search
-from NNDF.tensorrt_utils import TRTNativeRunner
-from NNDF.logger import G_LOGGER
-
-@use_cuda
-def decoder_inference(
-    BART_decoder, input_ids, encoder_last_hidden_state, timing_profile, use_cuda=True, use_cache=False, past_key_values=None
-):
-    # This implementation is a bit ugly. Moving implementation of the model to check HFRunner would be cleaner.
-    if isinstance(BART_decoder, TRTNativeRunner):
-        # Function is technically in BARTTRTDecoder however due to circular import, TRTNativeRunner in this module scope
-        # implies the existence of this function.
-        BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
-        BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
-    def decoder_stmt():
-        BART_decoder(
-            input_ids=input_ids, encoder_hidden_states=encoder_last_hidden_state, use_cache=use_cache,
-            past_key_values=past_key_values
-        )
-
-    decoder_e2e_time = measure_python_inference_code(decoder_stmt, timing_profile)
-
-    return (decoder_stmt(), decoder_e2e_time)
-
-
-@use_cuda
-def encoder_inference(BART_encoder, input_ids, timing_profile, use_cuda=True):
-    encoder_stmt = lambda: BART_encoder(input_ids=input_ids)
-    encoder_e2e_time = measure_python_inference_code(encoder_stmt, timing_profile)
-
-    return (encoder_stmt(), encoder_e2e_time)
-
-
-# Code specifically for Pythonic inference measurement used across all BART related scripts
-@use_cuda
-def full_inference_greedy(
-    BART_encoder,
-    BART_decoder,
-    input_ids,
-    tokenizer,
-    timing_profile,
-    max_length,
-    min_length=0,
-    batch_size=1,
-    use_cuda=True,
-    early_stopping=False,
-    use_cache=False
-):
-    G_LOGGER.info("Running full inference with greedy decoding...")
-
-    stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length)])
-    no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE
-    logits_processor = LogitsProcessorList([
-        NoRepeatNGramLogitsProcessor(no_repeat_ngram_size),
-        MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),
-        ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),
-        ForcedEOSTokenLogitsProcessor(max_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))
-    ]) # by checking HuggingFace's generate() implementation carefully, the default logits processor for BART has no_repeat_ngram_size = 3 and forced_eos_token_id = 2. In this way we can get identical results with raw HuggingFace
-
-    decoder_input_ids = torch.full(
-        (batch_size, 1), tokenizer.convert_tokens_to_ids(tokenizer.eos_token), dtype=torch.int32
-    )
-
-    if use_cuda:
-        decoder_input_ids = decoder_input_ids.to("cuda")
-    else:
-        decoder_input_ids = decoder_input_ids.to("cpu")
-
-    def _e2e():
-        with torch.no_grad():
-            encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
-            decoder_output_greedy = BART_decoder.greedy_search(
-                input_ids=decoder_input_ids,
-                encoder_hidden_states=encoder_last_hidden_state,
-                stopping_criteria=stopping_criteria,
-                logits_processor=logits_processor,
-                use_cache=use_cache
-            )
-        return decoder_output_greedy
-
-    # With e2e we can opt to bind inputs only once for hidden states for optimization
-    def _e2e_trt():
-        with torch.no_grad():
-            encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
-            BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
-            decoder_output_greedy = BART_decoder.greedy_search(
-                input_ids=decoder_input_ids,
-                encoder_hidden_states=encoder_last_hidden_state,
-                stopping_criteria=stopping_criteria,
-                logits_processor=logits_processor,
-                use_cache=use_cache
-            )
-        return decoder_output_greedy
-
-    measurement_function = _e2e
-    if isinstance(BART_decoder, TRTNativeRunner):
-        BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
-        measurement_function = _e2e_trt
-
-    full_e2e_time = measure_python_inference_code(measurement_function, timing_profile)
-
-    return (measurement_function(), full_e2e_time)
-
-@use_cuda
-def full_inference_beam(
-    BART_encoder,
-    BART_decoder,
-    input_ids,
-    tokenizer,
-    timing_profile,
-    num_beams,
-    max_length,
-    min_length=0,
-    batch_size=1,
-    use_cuda=True,
-    early_stopping=False, # Now used to control beam search early_stopping to have the same meaning as HuggingFace
-    use_cache=False
-):
-
-    G_LOGGER.info(f"Running full inference with beam search (num_beams = {num_beams}) decoding...")
-
-    stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length)])
-    no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE
-    logits_processor = LogitsProcessorList([
-        NoRepeatNGramLogitsProcessor(no_repeat_ngram_size),
-        MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),
-        ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),
-        ForcedEOSTokenLogitsProcessor(max_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))
-    ]) # by checking HuggingFace's generate() implementation carefully, the default logits processor for BART has no_repeat_ngram_size = 3 and forced_eos_token_id = 2. In this way we can get identical results with raw HuggingFace
-
-    decoder_input_ids = torch.full(
-        (batch_size, 1), tokenizer.convert_tokens_to_ids(tokenizer.eos_token), dtype=torch.int32
-    )
-    decoder_input_ids = expand_inputs_for_beam_search(decoder_input_ids, expand_size=num_beams)
-
-    if use_cuda:
-        decoder_input_ids = decoder_input_ids.to("cuda")
-    else:
-        decoder_input_ids = decoder_input_ids.to("cpu")
-
-    def _e2e():
-        with torch.no_grad():
-            # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache
-            beam_scorer = BeamSearchScorer(
-                batch_size=batch_size,
-                num_beams=num_beams,
-                device="cuda" if use_cuda else "cpu",
-                do_early_stopping=early_stopping
-            )
-
-            encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
-
-            encoder_last_hidden_state = expand_inputs_for_beam_search(encoder_last_hidden_state, expand_size=num_beams)
-
-            decoder_output_beam = BART_decoder.beam_search(
-                input_ids=decoder_input_ids,
-                beam_scorer=beam_scorer,
-                encoder_hidden_states=encoder_last_hidden_state,
-                stopping_criteria=stopping_criteria,
-                logits_processor=logits_processor,
-                use_cache=use_cache
-            )
-        return decoder_output_beam
-
-    # With e2e we can opt to bind inputs only once for hidden states for optimization
-    def _e2e_trt():
-        with torch.no_grad():
-            # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache
-            beam_scorer = BeamSearchScorer(
-                batch_size=batch_size,
-                num_beams=num_beams,
-                device="cuda" if use_cuda else "cpu",
-                do_early_stopping=early_stopping
-            )
-
-            encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
-
-            encoder_last_hidden_state = expand_inputs_for_beam_search(encoder_last_hidden_state, expand_size=num_beams)
-
-            BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
-            decoder_output_beam = BART_decoder.beam_search(
-                input_ids=decoder_input_ids,
-                beam_scorer=beam_scorer,
-                encoder_hidden_states=encoder_last_hidden_state,
-                stopping_criteria=stopping_criteria,
-                logits_processor=logits_processor,
-                use_cache=use_cache
-            )
-        return decoder_output_beam
-
-    measurement_function = _e2e
-    if isinstance(BART_decoder, TRTNativeRunner):
-        BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
-        measurement_function = _e2e_trt
-
-    full_e2e_time = measure_python_inference_code(measurement_function, timing_profile)
-
-    return (measurement_function(), full_e2e_time)
-
-
-@use_cuda
-def calculate_perplexity(
-    BART_encoder,
-    BART_decoder,
-    tokenizer,
-    input_ids,
-    decoder_input_ids,
-    max_seq_len=None,
-    use_cuda=True,
-):
-    encoder_last_hidden_state = BART_encoder(input_ids=input_ids)
-    if isinstance(BART_decoder, TRTNativeRunner):
-        BART_decoder.set_return_device("cuda" if use_cuda else "cpu")
-        BART_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
-
-    # Set the first token to be pad token
-    decoder_input_ids_padded = torch.full(
-        decoder_input_ids.size()[:-1] + (decoder_input_ids.size()[-1] + 1,),
-        tokenizer.convert_tokens_to_ids(tokenizer.eos_token),
-        dtype=decoder_input_ids.dtype,
-    )
-    decoder_input_ids_padded[..., 1:] = decoder_input_ids
-
-    if use_cuda:
-        encoder_last_hidden_state = encoder_last_hidden_state.to("cuda")
-        decoder_input_ids_padded = decoder_input_ids_padded.to("cuda")
-
-    with torch.no_grad():
-        if max_seq_len is not None:
-            decoder_input_ids_padded = decoder_input_ids_padded[:, :max_seq_len]
-        logits = BART_decoder(decoder_input_ids_padded, encoder_last_hidden_state, return_dict=True).logits
-        # Truncate the last prediction
-        logits = logits[:, :-1, :]
-        loss = torch.nn.CrossEntropyLoss()(logits.permute((0, 2, 1)), decoder_input_ids)
-        return torch.exp(loss).item()
diff --git a/demo/HuggingFace/BART/onnxrt.py b/demo/HuggingFace/BART/onnxrt.py
deleted file mode 100644
index b7523e0d..00000000
--- a/demo/HuggingFace/BART/onnxrt.py
+++ /dev/null
@@ -1,353 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Executes ONNX Runtime framework code. See README.md for more information.
-"""
-
-import os
-import sys
-from typing import Dict, List, Tuple
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# huggingface
-from transformers import BartTokenizer, BartConfig, PretrainedConfig, MBart50Tokenizer
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-
-# torch
-import torch
-
-# TRT-HuggingFace
-from NNDF.interface import OnnxRTCommand
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkMetadata,
-    NetworkModels,
-    NetworkModel,
-    NetworkResult,
-    NetworkRuntime,
-    Precision,
-    TimingProfile,
-)
-
-from NNDF.general_utils import NNFolderWorkspace
-from NNDF.tensorrt_utils import PolygraphyOnnxRunner
-from BART.frameworks import BARTHuggingFace
-from BART.BARTModelConfig import BARTModelTRTConfig, BARTBenchmarkingArgs
-from BART.measurements import decoder_inference, encoder_inference, full_inference_greedy, full_inference_beam
-
-class OnnxHFRunner(PolygraphyOnnxRunner, GenerationMixin):
-    """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
-    def __init__(self, engine_fpath: str, network_metadata: NetworkMetadata, tfm_config: PretrainedConfig):
-        super().__init__(engine_fpath, network_metadata)
-        # required for greedy search used by generation mixin
-        self.config = tfm_config
-
-class BARTOnnxEncoder(OnnxHFRunner):
-    """OnnxRT implemented network interface that is mainly to check correctness."""
-
-    def forward(self, input_ids, *args, **kwargs):
-        # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
-        input_ids = input_ids.cpu().numpy().astype("int64")
-        return torch.from_numpy(self.trt_context.infer({"input_ids": input_ids})["hidden_states"])
-
-class BARTOnnxDecoder(OnnxHFRunner):
-    def prepare_inputs_for_generation(self, input_ids, **kwargs):
-        return {
-            "input_ids": input_ids,
-            "encoder_hidden_states": kwargs["encoder_hidden_states"],
-        }
-
-    def forward(self, input_ids, encoder_hidden_states, *args, **kwargs):
-        # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
-        input_ids = input_ids.cpu().numpy().astype("int64")
-        encoder_hidden_states = encoder_hidden_states.cpu().numpy().astype("float32")
-
-        logits = self.trt_context.infer(
-            {"input_ids": input_ids, "encoder_hidden_states": encoder_hidden_states}
-        )["hidden_states"]
-
-        return Seq2SeqLMOutput(logits=torch.from_numpy(logits))
-
-class BARTONNXRT(OnnxRTCommand):
-    def __init__(self):
-        super().__init__(
-            BARTModelTRTConfig,
-            "Runs polygraphy results for BART model.",
-            BARTHuggingFace,
-        )
-        self.BART_ort_decoder = None
-        self.BART_ort_encoder = None
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        keep_onnx_model: bool = False,
-        keep_torch_model: bool = False,
-    ) -> None:
-        # Deactivates context
-        if self.BART_ort_encoder:
-            self.BART_ort_encoder.release()
-        if self.BART_ort_decoder:
-            self.BART_ort_decoder.release()
-
-        self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Dict[str, NetworkModel],
-        inference_input: str,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: BARTBenchmarkingArgs = None,
-    ) -> NetworkResult:
-
-        if "mbart" not in metadata.variant:
-            tokenizer = BartTokenizer.from_pretrained(metadata.variant)
-        else:
-            tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
-        # Prepare the input tokens and find out output sequence length.
-        if not benchmarking_mode:
-            output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        else:
-            max_seq_len = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-            input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
-            output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
-            input_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[metadata.variant], (batch_size, input_seq_len))
-
-        encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
-            self.BART_ort_encoder, input_ids, timing_profile
-        )
-        
-        # Need to feed the decoder a new empty input_ids for text generation.
-        decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
-        decoder_input_ids = torch.full(
-            (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
-        )
-
-        _, decoder_e2e_time = decoder_inference(
-            self.BART_ort_decoder,
-            decoder_input_ids,
-            encoder_last_hidden_state,
-            timing_profile,
-            use_cuda=False,
-        )
-
-        if num_beams == 1:
-            decoder_output, full_e2e_runtime = full_inference_greedy(
-                self.BART_ort_encoder,
-                self.BART_ort_decoder,
-                input_ids,
-                tokenizer,
-                timing_profile,
-                max_length=output_seq_len,
-                min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-                use_cuda=False,
-                use_cache=metadata.other.kv_cache,
-                batch_size=batch_size,
-            )
-        else:
-            decoder_output, full_e2e_runtime = full_inference_beam(
-                self.BART_ort_encoder,
-                self.BART_ort_decoder,
-                input_ids,
-                tokenizer,
-                timing_profile,
-                num_beams=num_beams,
-                max_length=output_seq_len,
-                min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else outout_seq_len,
-                use_cuda=False,
-                use_cache=metadata.other.kv_cache,
-                batch_size=batch_size,
-            )
-
-        # Prepare runtime results.
-        runtime=[
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                runtime=encoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-        models=NetworkModels(
-            torch=None,
-            onnx=list(onnx_fpaths.values()),
-            trt=None
-        )
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=models)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            decoder_output[-1, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=decoder_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=models,
-        )
-
-    def run_onnxrt(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Tuple[NetworkModel],
-        network_input: List[str],
-        working_directory: str,
-        keep_onnx_model: bool,
-        keep_torch_model: bool,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-    ) -> List[NetworkResult]:
-        workspace = NNFolderWorkspace(
-            self.frameworks_cmd.config.network_name, metadata, working_directory
-        )
-
-        results = []
-        try:
-            if metadata.other.kv_cache:
-                assert False, "OnnxRT currently does not support kv cache."
-            # no fpath provided for onnx files, download them
-            if len(onnx_fpaths) == 0:
-                onnx_fpaths = self.frameworks_cmd.generate_and_download_framework(
-                    metadata, workspace
-                ).onnx
-            else:
-                keep_onnx_model = True
-                keep_torch_model = True
-
-            # Output networks shall not exceed number of network segments explicitly defined by configuration file.
-            assert len(onnx_fpaths) == len(
-                BARTModelTRTConfig.NETWORK_SEGMENTS
-            ), "There should only be {} exported ONNX segments in BART model.".format(
-                len(BARTModelTRTConfig.NETWORK_SEGMENTS)
-            )
-
-            lookup_onnx_table = {v.name: v for v in onnx_fpaths}
-
-            tfm_config = BartConfig(
-                use_cache=metadata.other.kv_cache,
-                num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
-            )
-            self.BART_ort_encoder = BARTOnnxEncoder(
-                lookup_onnx_table["encoder"].fpath, metadata, tfm_config
-            )
-            self.BART_ort_decoder = BARTOnnxDecoder(
-                lookup_onnx_table["decoder"].fpath, metadata, tfm_config
-            )
-
-            if not benchmarking_mode:
-                for ninput in network_input:
-                    results.append(
-                        self.execute_inference(
-                            metadata, lookup_onnx_table, ninput, timing_profile, batch_size, args.num_beams
-                        )
-                    )
-            else:
-                benchmarking_args = BARTBenchmarkingArgs(args.input_seq_len, args.output_seq_len)
-                results = self.execute_inference(
-                    metadata, lookup_onnx_table, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
-                )
-
-        finally:
-            self.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
-        return results
-
-    def add_args(self, parser) -> None:
-        super().add_args(parser)
-        onnx_group = parser.add_argument_group("onnx models")
-        onnx_group.add_argument(
-            "--onnx-decoder-fpath",
-            default=None,
-            help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-        onnx_group.add_argument(
-            "--onnx-encoder-fpath",
-            default=None,
-            help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-
-    def args_to_network_models(self, args) -> List[NetworkModel]:
-        # Check if both flags are given otherwise error out
-        decoder_fpath_check = args.onnx_decoder_fpath is None
-        encoder_fpath_check = args.onnx_encoder_fpath is None
-
-        network_models = None
-        if decoder_fpath_check and encoder_fpath_check:
-            network_models = tuple()
-        elif decoder_fpath_check or encoder_fpath_check:
-            raise self._parser.error(
-                "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
-            )
-        else:
-            onnx_decoder = NetworkModel(
-                name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=args.onnx_decoder_fpath,
-            )
-            onnx_encoder = NetworkModel(
-                name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                fpath=args.onnx_encoder_fpath,
-            )
-            network_models = (onnx_decoder, onnx_encoder)
-
-        return network_models
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        """Override args to metadata to use export subroutine."""
-        frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
-        return NetworkMetadata(
-            variant=frameworks_parsed_metadata.variant,
-            precision=Precision(fp16=args.fp16),
-            other=frameworks_parsed_metadata.other,
-        )
-
-
-RUN_CMD = BARTONNXRT()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/BART/trt.py b/demo/HuggingFace/BART/trt.py
deleted file mode 100644
index 85bb2790..00000000
--- a/demo/HuggingFace/BART/trt.py
+++ /dev/null
@@ -1,1159 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import copy
-from typing import Dict, List, Tuple, Union
-from functools import reduce
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# tensorrt
-import tensorrt as trt
-
-# torch
-import torch
-
-# huggingface
-from transformers import BartTokenizer, BartConfig, MBart50Tokenizer
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers.configuration_utils import PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-
-# tensorrt
-from tensorrt import PreviewFeature
-
-# TRT-HuggingFace
-from NNDF.interface import TRTInferenceCommand
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkMetadata,
-    NetworkModels,
-    NetworkModel,
-    NetworkResult,
-    NetworkRuntime,
-    Precision,
-    TimingProfile,
-)
-
-from NNDF.tensorrt_utils import TRTNativeRunner, set_kv_data, allocate_binding_buffer, setup_benchmark_arg
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.general_utils import NNFolderWorkspace
-from BART.frameworks import BARTHuggingFace
-from BART.BARTModelConfig import BARTModelTRTConfig, BARTMetadata, BARTTRTBenchmarkingArgs
-from BART.measurements import decoder_inference, encoder_inference, full_inference_greedy, full_inference_beam, calculate_perplexity
-from BART.export import BARTDecoderONNXFile, BARTEncoderONNXFile
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-# from HuggingFace transformers
-from transformers.generation_logits_process import (
-    NoRepeatNGramLogitsProcessor,
-    MinLengthLogitsProcessor,
-    ForcedBOSTokenLogitsProcessor,
-    ForcedEOSTokenLogitsProcessor,
-    LogitsProcessorList,
-)
-from transformers.generation_stopping_criteria import (
-    MaxLengthCriteria,
-    StoppingCriteriaList,
-)
-from transformers.generation_beam_search import (
-    BeamSearchScorer,
-)
-
-class TRTHFRunner(TRTNativeRunner, GenerationMixin):
-    """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
-    # Stores the encoder input length received at runtime, which is used to slice decoder inputs.
-    ENCODER_LENGTH = 0
-    def _allocate_memory(self,
-                         input_shapes: Dict[str, tuple],
-                         input_types: Dict[str, torch.dtype],
-                         output_shapes: Dict[str, tuple],
-                         output_types: Dict[str, torch.dtype]):
-        """Helper function for binding several inputs at once and pre-allocating the results."""
-        # Allocate memories as 1D linear buffers for simpler handling of dynamic shapes.
-        self.inputs = allocate_binding_buffer(input_types, input_shapes)
-        self.outputs = allocate_binding_buffer(output_types, output_shapes)
-
-        bindings = [None] * self.trt_engine.num_bindings
-
-        for input_name, input_array in self.inputs.items():
-            # Allocate memory for inputs
-            input_idx = self.trt_engine.get_binding_index(input_name)
-            self.trt_context.set_binding_shape(input_idx, input_shapes[input_name])
-            bindings[input_idx] = input_array.data_ptr()
-
-        assert self.trt_context.all_binding_shapes_specified
-
-        for output_name, output_array in self.outputs.items():
-            # Output shape should be allocated from context size
-            output_idx = self.trt_engine.get_binding_index(output_name)
-            bindings[output_idx] = output_array.data_ptr()
-
-        return bindings
-
-    def __init__(
-        self,
-        trt_engine_file: TRTEngineFile,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1
-    ):
-        super().__init__(trt_engine_file, network_metadata)
-        self.config = hf_config
-        self.batch_size = batch_size
-
-class BARTTRTEncoder(TRTHFRunner):
-    """TRT implemented network interface that can be used to measure inference time."""
-
-    def __init__(
-        self,
-        trt_engine_file: str,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1,
-        benchmarking_args: BARTTRTBenchmarkingArgs = None
-    ):
-        super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
-        # In benchmarking mode, the max_sequence_length should be the designated input_profile_max_len
-        if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
-            self.max_sequence_length = benchmarking_args.input_profile_max_len
-        else:
-            self.max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[network_metadata.variant]
-        self.encoder_hidden_size = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[network_metadata.variant]
-
-        # We only have one profile to select so we can just grab the profile at the start of the class
-        self.profile_idx = self.get_optimization_profile(batch_size=self.batch_size, sequence_length=1)
-
-        self.input_shapes = {
-            "input_ids": (self.batch_size, self.max_sequence_length)
-        }
-        self.input_types = {
-            "input_ids": torch.int32
-        }
-        self.output_shapes = {
-            "hidden_states": (self.batch_size, self.max_sequence_length, self.encoder_hidden_size)
-        }
-        self.output_types = {
-            "hidden_states": torch.float32
-        }
-        self.bindings = self._allocate_memory(self.input_shapes, self.input_types, self.output_shapes, self.output_types)
-
-    def forward(self, input_ids, *args, **kwargs):
-        bs = self.batch_size
-        max_length = self.max_sequence_length
-        TRTHFRunner.ENCODER_LENGTH = input_ids.shape[1]
-        input_length = input_ids.shape[1]
-        encoder_hidden_size = self.encoder_hidden_size
-
-        # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
-        is_cpu_mode = (input_ids.device == torch.device("cpu"))
-
-        # We allocate the buffers using max_length, but we only need to first portion of it, so copy the data into the
-        # first portion of the input buffer.
-        # TODO: Could we just reuse input_ids' data_ptr() as the first binding when input_ids is already contiguous to
-        # avoid an additional D2D?
-        if is_cpu_mode:
-            self.inputs["input_ids"] = input_ids.int().flatten().contiguous().cuda()
-            self.bindings[0] = self.inputs["input_ids"].data_ptr()
-        else:
-            self.inputs["input_ids"][:bs * input_length] = input_ids.flatten()
-
-        # Set the binding shape of input_ids, which should be (bs, input_length).
-        self.trt_context.set_binding_shape(0, input_ids.shape)
-
-        # Launch TRT inference.
-        # TODO: Could we use execute_v2_async() instead of execute_v2()?
-        self.trt_context.execute_v2(bindings=self.bindings)
-
-        # We allocate the buffers using max_length, but we only need to first portion of it, so get only the first
-        # portion of the output buffer and return that.
-        # TODO: Could we construct a Torch tensor using given data_ptr() to avoid this D2D copy?
-        hidden_states_output = self.outputs["hidden_states"]
-        if is_cpu_mode:
-            hidden_states_output = hidden_states_output.cpu()
-
-        folded = hidden_states_output[:bs * input_length * encoder_hidden_size].view(bs, input_length, encoder_hidden_size)
-
-        return folded
-
-class BARTTRTDecoder(TRTHFRunner):
-
-    def __init__(
-        self,
-        trt_engine_file: TRTEngineFile,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_args: BARTTRTBenchmarkingArgs = None
-    ):
-        super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
-
-        # In benchmarking mode, the max_sequence_length should be the user-provided input_profile_max_len
-        if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
-            self.max_sequence_length = benchmarking_args.input_profile_max_len
-        else:
-            self.max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[network_metadata.variant]
-
-        # Similarly, the max_output_length should be the user-provided output_profile_max_len
-        if benchmarking_args is not None and benchmarking_args.output_profile_max_len is not None:
-            self.max_output_length = benchmarking_args.output_profile_max_len
-        else:
-            self.max_output_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[network_metadata.variant]
-
-        self.encoder_hidden_size = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[network_metadata.variant]
-        self.num_heads = BARTModelTRTConfig.NUMBER_OF_HEADS[network_metadata.variant]
-        self.embedding_size_per_head = self.encoder_hidden_size // self.num_heads
-
-        # We only have one profile to select so we can just grab the profile at the start of the class
-        self.profile_idx = self.get_optimization_profile(batch_size=self.batch_size * num_beams, sequence_length=1)
-        input_profile_length = self.max_output_length if (not self.config.use_cache) else 1
-        self.input_types = {
-            "input_ids": torch.int32,
-            "encoder_hidden_states": torch.float32
-        }
-        self.input_shapes = {
-            "input_ids": (self.batch_size * num_beams, input_profile_length),
-            "encoder_hidden_states": (self.batch_size * num_beams, self.max_sequence_length, self.encoder_hidden_size)
-        }
-
-        self.output_shapes = {
-            "hidden_states": (self.batch_size * num_beams, self.max_output_length, BARTModelTRTConfig.VOCAB_SIZE[network_metadata.variant])
-        }
-        self.output_types = {
-            "hidden_states": torch.float32
-        }
-
-        if self.config.use_cache:
-
-            self.num_decoder_layers = BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[network_metadata.variant]
-            # Set kv cache shape and type
-            for i in range(self.num_decoder_layers):
-                kv_type_dict = {"encoder": torch.float32, "decoder": torch.float32}
-                set_kv_data(self.input_types, "past", i, kv_type_dict)
-                set_kv_data(self.output_types,"present", i, kv_type_dict)
-
-                self_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_output_length - 1, self.embedding_size_per_head)
-                cross_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_sequence_length, self.embedding_size_per_head)
-                kv_shape_dict = {"encoder": cross_attention_kv_shape, "decoder": self_attention_kv_shape}
-
-                set_kv_data(self.input_shapes, "past", i, kv_shape_dict)
-                set_kv_data(self.output_shapes, "present", i, kv_shape_dict)
-
-            self.kv_cache_binding_offset = 2 # 0: input_ids, 1: encoder_hidden_states, kv cache input indices start from 2
-
-        self.bindings = self._allocate_memory(self.input_shapes, self.input_types, self.output_shapes, self.output_types)
-
-        # Optimization bit
-        self.persist_encoder_hidden_states = False
-        self.persist_cross_attention_kv_cache = False
-
-        self.use_non_kv_engine = self.config.use_cache
-        # trick: set flag based on kv cache mode. This maintains code simplicity in forward() where a common codeblock is shared between non kv-cache & kv-cache modes
-        # non kv-cache mode: False. Then in forward(), trt_context and bindings are set to the default ones
-        # kv-cache mode: True. By default 1st decoding step starts with non-kv engine's context and binding; then flag gets updated in prepare_inputs_for_generation()
-
-        self.return_device = torch.device('cuda')
-
-        self.variant = network_metadata.variant # record variant name to later index the vocab_size in forward()
-
-    def set_non_kv_engine_for_kv_mode(self, trt_engine_file_non_kv: TRTEngineFile):
-        # same steps in tensorrt_utils.py: TRTNativeRunner
-        with open(trt_engine_file_non_kv.fpath, "rb") as f:
-            self.trt_engine_non_kv = self.trt_runtime.deserialize_cuda_engine(f.read())
-            self.trt_context_non_kv = self.trt_engine_non_kv.create_execution_context()
-
-        # Input does not have kv cache, so only inpuy_ids and encoder_hidden_states
-        self.input_types_non_kv = {k: self.input_types[k] for k in ["input_ids", "encoder_hidden_states"]}
-        self.input_shapes_non_kv = {k: self.input_shapes[k] for k in ["input_ids", "encoder_hidden_states"]}
-
-        # Output is the same as kv
-        self.output_types_non_kv = copy.deepcopy(self.output_types)
-        self.output_shapes_non_kv = copy.deepcopy(self.output_shapes)
-
-        # follow same steps in _allocate_memory
-        self.inputs_non_kv = allocate_binding_buffer(self.input_types_non_kv, self.input_shapes_non_kv)
-        self.outputs_non_kv = allocate_binding_buffer(self.output_types_non_kv, self.output_shapes_non_kv)
-
-        bindings = [None] * self.trt_engine_non_kv.num_bindings
-
-        for input_name, input_array in self.inputs_non_kv.items():
-            # Allocate memory for inputs
-            input_idx = self.trt_engine_non_kv.get_binding_index(input_name)
-            self.trt_context_non_kv.set_binding_shape(input_idx, self.input_shapes_non_kv[input_name])
-            bindings[input_idx] = input_array.data_ptr()
-
-        assert self.trt_context_non_kv.all_binding_shapes_specified
-
-        for output_name, output_array in self.outputs_non_kv.items():
-            # Output shape should be allocated from context size
-            output_idx = self.trt_engine_non_kv.get_binding_index(output_name)
-            bindings[output_idx] = output_array.data_ptr()
-
-        self.bindings_non_kv = bindings
-
-        G_LOGGER.info("Non-KV cache engine setup is successful in KV cache mode.")
-
-    def set_encoder_hidden_states_for_inference_cycle(self, encoder_hidden_states):
-        """Used to cache encoder hidden state runs across same encoder sessions"""
-        self.persist_encoder_hidden_states = True
-
-        bs = encoder_hidden_states.shape[0] # in beam search mode, bs is batch_size * num_beams
-        encoder_hidden_size = self.encoder_hidden_size
-        encoder_length = TRTHFRunner.ENCODER_LENGTH
-        if encoder_hidden_states.device == torch.device("cpu"):
-            self.inputs["encoder_hidden_states"] = encoder_hidden_states.flatten().contiguous().cuda()
-            self.bindings[1] = self.inputs["encoder_hidden_states"].data_ptr()
-        else:
-            self.inputs["encoder_hidden_states"][:bs * encoder_length * encoder_hidden_size] = encoder_hidden_states.flatten()
-
-        # for dual-engine approach in kv cache mode, set these for the non-kv engine as well
-        if self.use_non_kv_engine:
-            if encoder_hidden_states.device == torch.device("cpu"):
-                self.inputs_non_kv["encoder_hidden_states"] = encoder_hidden_states.flatten().contiguous().cuda()
-                self.bindings_non_kv[1] = self.inputs_non_kv["encoder_hidden_states"].data_ptr()
-            else:
-                self.inputs_non_kv["encoder_hidden_states"][:bs * encoder_length * encoder_hidden_size] = encoder_hidden_states.flatten()
-
-    def set_cross_attention_kv_cache_for_inference_cycle(self, past_key_values):
-        """
-        Used to cache encoder-decoder cross attention kv caches across same encoder sessions.
-
-        Unlike self-attention cache, cross attention is constant during the decoding process, so we only need to set its bindings once at the first decoding step, and skip in all later steps (by self.persist_cross_attention_kv_cache flag)
-        """
-        self.persist_cross_attention_kv_cache = True
-
-        bs = past_key_values[0][0].shape[0] # In beam search, it should be batch_size * num_beams
-        encoder_length = TRTHFRunner.ENCODER_LENGTH if past_key_values is not None else 0
-        num_heads = self.num_heads
-        embedding_size_per_head = self.embedding_size_per_head
-
-        for i in range(self.num_decoder_layers):
-
-            # Set the binding shape of cross-attention KV caches, which should be (bs, num_heads, encoder_length, embedding_size_per_head).
-            cross_attention_kv_shape = (bs, num_heads, encoder_length, embedding_size_per_head)
-            cross_attention_kv_flatten_length = bs * num_heads * encoder_length * embedding_size_per_head
-
-            if past_key_values is not None:
-                if past_key_values[0][0].device == torch.device("cpu"):
-                    self.inputs[f"past_key_values.{i}.encoder.key"] = past_key_values[i][2].flatten().contiguous().cuda()
-                    self.bindings[self.kv_cache_binding_offset+4*i+2] = self.inputs[f"past_key_values.{i}.encoder.key"].data_ptr()
-
-                    self.inputs[f"past_key_values.{i}.encoder.value"] = past_key_values[i][3].flatten().contiguous().cuda()
-                    self.bindings[self.kv_cache_binding_offset+4*i+3] = self.inputs[f"past_key_values.{i}.encoder.value"].data_ptr()
-                else:
-                    self.inputs[f"past_key_values.{i}.encoder.key"][:cross_attention_kv_flatten_length] = past_key_values[i][2].flatten()
-
-                    self.inputs[f"past_key_values.{i}.encoder.value"][:cross_attention_kv_flatten_length] = past_key_values[i][3].flatten()
-
-            self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 2, cross_attention_kv_shape)
-            self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 3, cross_attention_kv_shape)
-
-    def set_return_device(self, return_device):
-        """
-        Sets the return device of the return via to(). Device name should be the same as torch devices: cuda, cpu, etc.
-        This is used in our measurement code.
-        """
-        self.return_device = return_device
-
-    def _reorder_cache(self, past, beam_idx):
-        reordered_past = ()
-        for layer_past in past:
-            # cached cross_attention states don't have to be reordered -> they are always the same
-            reordered_past += (
-                tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
-            )
-        return reordered_past
-
-    def forward(self, input_ids, encoder_hidden_states, *args, **kwargs):
-        # Get the batch size.
-        bs = input_ids.shape[0] # in beam search mode, bs is batch_size * num_beams
-
-        # Get the maximum sequence length.
-        max_length = self.max_sequence_length
-
-        # Get the vocab size.
-        vocab_size = BARTModelTRTConfig.VOCAB_SIZE[self.variant]
-
-        # Actual sequence length of the input_ids and the output hidden_states.
-        input_length = input_ids.shape[1]
-
-        # The sequence length of the encoder_hidden_states.
-        encoder_length = TRTHFRunner.ENCODER_LENGTH
-
-        # Encoder hidden size
-        encoder_hidden_size = self.encoder_hidden_size
-
-        # KV cache flag
-        use_cache = kwargs.get("use_cache", False)
-
-        # flag for switch between dual engines
-        non_kv_flag = self.use_non_kv_engine or (self.config.use_cache and kwargs.get("past_key_values") is None)
-        # condition 1: during e2e decoding test, based on flag
-        # condition 2: during single-step decoder test, depending on whether past_key_values is empty
-        # note: without --enable-kv-cache arg, this flag should remain False
-
-        # denote as variable to allow switch between non-kv and kv engines in kv cache mode
-        trt_context = self.trt_context_non_kv if non_kv_flag else self.trt_context
-        bindings = self.bindings_non_kv if non_kv_flag else self.bindings
-        inputs = self.inputs_non_kv if non_kv_flag else self.inputs
-        outputs = self.outputs_non_kv if non_kv_flag else self.outputs
-
-        # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
-        is_cpu_mode = (input_ids.device == torch.device("cpu")) or (self.return_device == "cpu")
-
-        # We allocate the buffers using max_length, but we only need to first portion of it, so copy the data into the
-        # first portion of the input buffer.
-        # TODO: Could we just reuse input_ids' data_ptr() as the first binding when input_ids is already contiguous to
-        # avoid an additional D2D?
-        if is_cpu_mode:
-            inputs["input_ids"] = input_ids.int().flatten().contiguous().cuda()
-            bindings[0] = inputs["input_ids"].data_ptr()
-        else:
-            inputs["input_ids"][:bs * input_length] = input_ids.flatten()
-
-        # Set the binding shape of input_ids, which should be (bs, input_length).
-        trt_context.set_binding_shape(0, input_ids.shape)
-
-        # If encoder hidden states have not been copied yet, copy the hidden states to the input buffer.
-        if not self.persist_encoder_hidden_states:
-            if is_cpu_mode:
-                inputs["encoder_hidden_states"] = encoder_hidden_states.flatten().contiguous().cuda()
-                bindings[1] = inputs["encoder_hidden_states"].data_ptr()
-            else:
-                inputs["encoder_hidden_states"][:bs * encoder_length * encoder_hidden_size] = encoder_hidden_states.flatten()
-
-        # Set the binding shape of encoder_hidden_states, which should be (bs, encoder_length, encoder_hidden_size).
-        trt_context.set_binding_shape(1, (bs, encoder_length, encoder_hidden_size))
-
-        if self.config.use_cache: # or use_cache
-            if non_kv_flag:
-                # use non-kv engine, no additional inputs
-                past_decoder_length = 0
-            else:
-                # use kv engine
-                past_key_values = kwargs.get("past_key_values") # set by prepare_inputs_for_generation() during HF e2e pipeline; if only test decoder, need to set this field
-                past_decoder_length = past_key_values[0][0].size(2)
-                num_heads = self.num_heads
-                embedding_size_per_head = self.embedding_size_per_head
-
-                # for all BART variants, # encoder layers = # decoder layers, so just divide total # layers by 2
-                for i in range(self.num_decoder_layers):
-
-                    # Set the binding shape of self-attention KV caches, which should be (bs, num_heads, past_decoder_length, embedding_size_per_head).
-                    self_attention_kv_shape = (bs, num_heads, past_decoder_length, embedding_size_per_head)
-                    self_attention_kv_flatten_length = bs * num_heads * past_decoder_length * embedding_size_per_head
-
-                    if past_key_values is not None:
-                        if past_key_values[0][0].device == torch.device("cpu"):
-                            inputs[f"past_key_values.{i}.decoder.key"] = past_key_values[i][0].flatten().contiguous().cuda()
-                            bindings[self.kv_cache_binding_offset+4*i] = inputs[f"past_key_values.{i}.decoder.key"].data_ptr()
-
-                            inputs[f"past_key_values.{i}.decoder.value"] = past_key_values[i][1].flatten().contiguous().cuda()
-                            bindings[self.kv_cache_binding_offset+4*i+1] = inputs[f"past_key_values.{i}.decoder.value"].data_ptr()
-
-                        else:
-                            inputs[f"past_key_values.{i}.decoder.key"][:self_attention_kv_flatten_length] = past_key_values[i][0].flatten()
-
-                            inputs[f"past_key_values.{i}.decoder.value"][:self_attention_kv_flatten_length] = past_key_values[i][1].flatten()
-
-                    trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i, self_attention_kv_shape)
-                    trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 1, self_attention_kv_shape)
-
-                # Set the binding shape of cross-attention KV caches, which should be (bs, num_heads, encoder_length, embedding_size_per_head).
-                # since cross-attention KV cache dimension is fixed, we set once at the start and skip later
-                if not self.persist_cross_attention_kv_cache:
-                    self.set_cross_attention_kv_cache_for_inference_cycle(past_key_values)
-
-        # Launch TRT inference.
-        # TODO: Could we use execute_v2_async() instead of execute_v2()? Current profiling shows that there is a
-        # synchronization inside TRT's inference body, so this change may not be needed.
-        trt_context.execute_v2(bindings=bindings)
-
-        # We allocate the buffers using max_length, but we only need to first portion of it, so get only the first
-        # portion of the output buffer and return that.
-        # TODO: Could we construct a Torch tensor using given data_ptr() to avoid this D2D copy?
-        hidden_states_output = outputs["hidden_states"]
-        if is_cpu_mode:
-            hidden_states_output = hidden_states_output.cpu()
-
-        folded = hidden_states_output[:bs * input_length * vocab_size].view(bs, input_length, vocab_size)
-        present_key_values = None
-        if self.config.use_cache:
-            # 1st decoding step and steps after handle the outputs in the same way
-            present_key_values = ()
-            curr_decoder_length = past_decoder_length + input_length
-            num_heads = self.num_heads
-            embedding_size_per_head = self.embedding_size_per_head
-
-            for i in range(self.num_decoder_layers):
-
-                self_attention_kv_shape = (bs, num_heads, curr_decoder_length, embedding_size_per_head)
-                self_attention_kv_flatten_length = bs * num_heads * curr_decoder_length * embedding_size_per_head
-
-                cross_attention_kv_shape = (bs, num_heads, encoder_length, embedding_size_per_head)
-                cross_attention_kv_flatten_length = bs * num_heads * encoder_length * embedding_size_per_head
-
-                self_attn_k_output = outputs[f"present_key_values.{i}.decoder.key"]
-                self_attn_v_output = outputs[f"present_key_values.{i}.decoder.value"]
-                if is_cpu_mode:
-                    self_attn_k_output = self_attn_k_output.cpu()
-                    self_attn_v_output = self_attn_v_output.cpu()
-
-                self_attn_k = self_attn_k_output[:self_attention_kv_flatten_length].view(*self_attention_kv_shape)
-                self_attn_v = self_attn_v_output[:self_attention_kv_flatten_length].view(*self_attention_kv_shape)
-
-                cross_attn_k = None
-                cross_attn_v = None
-                if is_cpu_mode or non_kv_flag:
-                    cross_attn_k_output = outputs[f"present_key_values.{i}.encoder.key"]
-                    cross_attn_v_output = outputs[f"present_key_values.{i}.encoder.value"]
-                    if is_cpu_mode:
-                        cross_attn_k_output = cross_attn_k_output.cpu()
-                        cross_attn_v_output = cross_attn_v_output.cpu()
-                    cross_attn_k = cross_attn_k_output[:cross_attention_kv_flatten_length].view(*cross_attention_kv_shape)
-                    cross_attn_v = cross_attn_v_output[:cross_attention_kv_flatten_length].view(*cross_attention_kv_shape)
-
-                present_key_values += ((self_attn_k, self_attn_v, cross_attn_k, cross_attn_v), ) # make multi-dim tuple
-
-        # Transfer predictions back from GPU to do greedy search
-        return Seq2SeqLMOutput(logits=folded.to(self.return_device), past_key_values=present_key_values,)
-
-    def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
-        # in HuggingFace generation_utils.py, this function will be called at each decoding step, before running the decoder's forward().
-        # So we can use it to set the flag indicating if this is the 1st decoding step (use non-kv engine) or steps after (use kv engine)
-        # cut decoder_input_ids if past is used (with past cache, only need to process the current length 1 token)
-        # also, if past exists, it means we're at > 1 decoding steps thus set non-kv engine flag to False
-        if past is not None:
-            input_ids = input_ids[:, -1:]
-            self.use_non_kv_engine = False
-
-        ret = {
-            "input_ids": input_ids,
-            "encoder_hidden_states": kwargs["encoder_hidden_states"],
-        }
-
-        if self.config.use_cache:
-            ret["use_cache"] = use_cache
-            ret["past_key_values"] = past
-
-        return ret
-
-
-class BARTTRT(TRTInferenceCommand):
-    def __init__(self):
-        super().__init__(
-            BARTModelTRTConfig,
-            "Runs trt results for BART model.",
-            BARTHuggingFace,
-        )
-        self.BART_trt_decoder = None
-        self.BART_trt_encoder = None
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        keep_trt_engine: bool = False,
-        keep_onnx_model: bool = False,
-        keep_torch_model: bool = False,
-    ) -> None:
-        # Deactivates context
-        if self.BART_trt_encoder:
-            self.BART_trt_encoder.release()
-        if self.BART_trt_decoder:
-            self.BART_trt_decoder.release()
-
-        if not keep_trt_engine:
-            self.BART_trt_encoder_engine.cleanup()
-            self.BART_trt_decoder_engine.cleanup()
-            # TODO: Avoid using workspace.metadata to handle non_kv removals.
-            if workspace.metadata.other.kv_cache:
-                self.BART_trt_decoder_engine_non_kv.cleanup()
-
-        self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
-    def setup(self, encoder, decoder):
-        self.BART_trt_encoder = encoder
-        self.BART_trt_decoder = decoder
-
-    def generate(
-        self,
-        input_ids,
-        min_length: int = None,
-        max_length: int = None,
-        num_beams: int = 1,
-        use_cache: bool = False,
-        early_stopping: bool = True, # Deprecated
-    ):
-        batch_size = input_ids.shape[0]
-
-        if max_length is None:
-            max_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[self.metadata.variant]
-
-        if min_length is None:
-            min_length = BARTModelTRTConfig.MIN_OUTPUT_LENGTH[self.metadata.variant]
-
-        stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length)])
-        logits_processor = LogitsProcessorList([
-            NoRepeatNGramLogitsProcessor(BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE),
-            MinLengthLogitsProcessor(min_length, BARTModelTRTConfig.EOS_TOKEN_ID),
-            ForcedBOSTokenLogitsProcessor(BARTModelTRTConfig.BOS_TOKEN_ID),
-            ForcedEOSTokenLogitsProcessor(max_length, BARTModelTRTConfig.EOS_TOKEN_ID)
-        ])
-
-        decoder_input_ids = torch.full(
-            (batch_size, 1), BARTModelTRTConfig.EOS_TOKEN_ID, dtype=torch.int32
-        ).to("cuda")
-
-        if num_beams == 1:
-            G_LOGGER.info("Running full inference with greedy decoding...")
-            encoder_last_hidden_state = self.BART_trt_encoder(input_ids=input_ids)
-            self.BART_trt_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
-            decoder_output = self.BART_trt_decoder.greedy_search(
-                input_ids=decoder_input_ids,
-                encoder_hidden_states=encoder_last_hidden_state,
-                stopping_criteria=stopping_criteria,
-                logits_processor=logits_processor,
-                use_cache=use_cache
-            )
-        else:
-            G_LOGGER.info(f"Running full inference with beam search (num_beams = {num_beams}) decoding...")
-
-            beam_scorer = BeamSearchScorer(
-                batch_size=batch_size,
-                num_beams=num_beams,
-                device="cuda",
-                do_early_stopping=early_stopping,
-            )
-
-            decoder_input_ids = expand_inputs_for_beam_search(decoder_input_ids, expand_size=num_beams)
-
-            encoder_last_hidden_state = self.BART_trt_encoder(input_ids=input_ids)
-
-            encoder_last_hidden_state = expand_inputs_for_beam_search(encoder_last_hidden_state, expand_size=num_beams)
-
-            self.BART_trt_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_state)
-            decoder_output = self.BART_trt_decoder.beam_search(
-                input_ids=decoder_input_ids,
-                beam_scorer=beam_scorer,
-                encoder_hidden_states=encoder_last_hidden_state,
-                stopping_criteria=stopping_criteria,
-                logits_processor=logits_processor,
-                use_cache=use_cache
-            )
-
-        self.reset_decoder_state()
-
-        return decoder_output
-
-    def reset_decoder_state(self):
-        # During execute_inference, set_encoder_hidden_states_for_inference_cycle will be called in full_inference_greedy anyway to overwrite the saved encoder_hidden_states
-        # But explicit reset this flag is still beneficial
-        self.BART_trt_decoder.persist_encoder_hidden_states = False
-        # Because the same decoder is used for different inputs, need to reset the flags for different inputs.
-        # TODO: In BARTTRTDecoder, maybe a reset function is needed to capture this issue after each task.
-        if self.metadata.other.kv_cache:
-            self.BART_trt_decoder.persist_cross_attention_kv_cache = False
-            self.BART_trt_decoder.use_non_kv_engine = self.metadata.other.kv_cache
-
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Dict[str, NetworkModel],
-        inference_input: str,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: BARTTRTBenchmarkingArgs = None,
-    ) -> Union[NetworkResult, BenchmarkingResult]:
-        if "mbart" not in metadata.variant:
-            tokenizer = BartTokenizer.from_pretrained(metadata.variant)
-        else:
-            tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
-        # Prepare the input tokens and find output sequence length.
-        if not benchmarking_mode:
-            output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        else:
-            input_seq_len = benchmarking_args.input_seq_len
-            output_seq_len = benchmarking_args.output_seq_len
-
-            input_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[metadata.variant], (batch_size, input_seq_len))
-
-        encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
-            self.BART_trt_encoder, input_ids, timing_profile
-        )
-
-        # Need to feed the decoder a new empty input_ids for text generation.
-        decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
-        decoder_input_ids = torch.full(
-            (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
-        )
-
-        _, decoder_e2e_time = decoder_inference(
-            self.BART_trt_decoder,
-            expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
-            expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
-            timing_profile,
-            use_cache=metadata.other.kv_cache,
-        )
-
-        if num_beams == 1:
-            decoder_output, full_e2e_runtime = full_inference_greedy(
-                self.BART_trt_encoder,
-                self.BART_trt_decoder,
-                input_ids,
-                tokenizer,
-                timing_profile,
-                max_length=output_seq_len,
-                min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-                batch_size=batch_size,
-                use_cache=metadata.other.kv_cache,
-            )
-        else:
-            decoder_output, full_e2e_runtime = full_inference_beam(
-                self.BART_trt_encoder,
-                self.BART_trt_decoder,
-                input_ids,
-                tokenizer,
-                timing_profile,
-                num_beams=num_beams,
-                max_length=output_seq_len,
-                min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-                batch_size=batch_size,
-                use_cache=metadata.other.kv_cache,
-            )
-
-        # Prepare runtime results.
-        runtime=[
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                runtime=encoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=BARTModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-        models=NetworkModels(
-            torch=None,
-            onnx=list(onnx_fpaths.values()),
-            trt=[
-                NetworkModel(
-                    name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                    fpath=self.BART_trt_decoder_engine.fpath,
-                ),
-                NetworkModel(
-                    name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                    fpath=self.BART_trt_encoder_engine.fpath,
-                ),
-            ],
-        )
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=models)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            decoder_output[-1, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=decoder_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=models,
-        )
-
-    def execute_calculate_perplexity(
-        self,
-        metadata: NetworkMetadata,
-        encoder_input: str,
-        decoder_input: str,
-        batch_size: int,
-    ):
-        if "mbart" not in metadata.variant:
-            tokenizer = BartTokenizer.from_pretrained(metadata.variant)
-        else:
-            tokenizer = MBart50Tokenizer.from_pretrained(metadata.variant, src_lang="en_XX")
-
-        encoder_input_ids = tokenizer([encoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        decoder_input_ids = tokenizer([decoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
-
-        perplexity = calculate_perplexity(
-            self.BART_trt_encoder, self.BART_trt_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
-            BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
-        )
-        return perplexity
-
-    def _setup_engines(
-        self,
-        metadata: NetworkMetadata,
-        hash_onnx_fpath: Dict[str, NetworkModel],
-        batch_size: int,
-        num_beams: int,
-        disable_preview_dynamic_shapes: bool,
-        benchmarking_args: BARTTRTBenchmarkingArgs = None,
-        seq_tag: bool = False, # whether the benchmark engine tag format should be seq or max
-    ) -> None:
-
-        # Output networks shall not exceed number of network segments explicitly defined by configuration file.
-        assert len(hash_onnx_fpath) == len(
-            BARTModelTRTConfig.NETWORK_SEGMENTS
-        ), "There should only be {} exported ONNX segments in BART model.".format(
-            len(BARTModelTRTConfig.NETWORK_SEGMENTS)
-        )
-
-        decoder_onnx_fpath = hash_onnx_fpath[
-            BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
-        ].fpath
-        encoder_onnx_fpath = hash_onnx_fpath[
-            BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME
-        ].fpath
-
-        # Generate optimization profiles.
-        # non-benchmarking mode: opt profile length is by default half of the max profile
-        # benchmarking mode: user can specify opt and max profile by flags. If no additional benchmarking flags are provided, it will just use the non-benchmarking mode defaults
-        max_sequence_length = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-        max_output_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-        opt_input_seq_len = max_sequence_length // 2
-        opt_output_seq_len = max_output_length // 2
-
-        # benchmarking flags
-        if benchmarking_args is not None:
-            max_sequence_length = benchmarking_args.input_profile_max_len
-            max_output_length = benchmarking_args.output_profile_max_len
-            opt_input_seq_len = benchmarking_args.input_seq_len
-            opt_output_seq_len = benchmarking_args.output_seq_len
-
-        encoder_hidden_size = BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[metadata.variant]
-
-        encoder_profiles = [
-            Profile().add(
-                "input_ids",
-                min=(batch_size, 1),
-                opt=(batch_size, opt_input_seq_len),
-                max=(batch_size, max_sequence_length),
-            )
-        ]
-
-        # Set up the non kv engine, used for non-kv mode and kv mode generation phase (1st decoder run uses the non-kv profile to generate kv cache)
-        dec_profiles_non_kv = Profile()
-
-        # for beam search, decoder engine's inputs are expanded `num_beams` times
-        # optimization profiles should be changed accordingly, but onnx models can be shared across greedy/beam because the first dim (batch size) is already a dynamic value, so no change needed in export.py
-        if not metadata.other.kv_cache:
-            dec_profiles_non_kv = dec_profiles_non_kv.add(
-                "input_ids",
-                min=(batch_size * num_beams, 1),
-                opt=(batch_size * num_beams, opt_output_seq_len),
-                max=(batch_size * num_beams, max_output_length),
-            )
-        else:
-            dec_profiles_non_kv = dec_profiles_non_kv.add(
-                "input_ids",
-                min=(batch_size * num_beams, 1),
-                opt=(batch_size * num_beams, 1),
-                max=(batch_size * num_beams, 1),
-            )
-
-        dec_profiles_non_kv = dec_profiles_non_kv.add(
-            "encoder_hidden_states",
-            min=(batch_size * num_beams, 1, encoder_hidden_size),
-            opt=(batch_size * num_beams, opt_input_seq_len, encoder_hidden_size),
-            max=(batch_size * num_beams, max_sequence_length, encoder_hidden_size),
-        )
-
-        decoder_profiles_non_kv = [dec_profiles_non_kv]
-        dec_profiles_kv = copy.deepcopy(dec_profiles_non_kv)
-        if metadata.other.kv_cache:
-
-            num_heads = BARTModelTRTConfig.NUMBER_OF_HEADS[metadata.variant]
-            embedding_size_per_head = encoder_hidden_size // num_heads
-            num_decoder_layers = BARTModelTRTConfig.NUMBER_OF_DECODER_LAYERS[metadata.variant]
-
-            self_attention_profile = {
-                "min": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
-                "opt": (batch_size * num_beams, num_heads, opt_output_seq_len - 1, embedding_size_per_head),
-                "max": (batch_size * num_beams, num_heads, max_output_length - 1, embedding_size_per_head),
-            }
-            cross_attention_profile = {
-                "min": (batch_size * num_beams, num_heads, 1, embedding_size_per_head),
-                "opt": (batch_size * num_beams, num_heads, opt_input_seq_len, embedding_size_per_head),
-                "max": (batch_size * num_beams, num_heads, max_sequence_length, embedding_size_per_head),
-            }
-
-            for i in range(num_decoder_layers):
-                dec_profiles_kv = dec_profiles_kv.add(
-                    f"past_key_values.{i}.decoder.key",
-                    **self_attention_profile
-                )
-                dec_profiles_kv = dec_profiles_kv.add(
-                    f"past_key_values.{i}.decoder.value",
-                    **self_attention_profile
-                )
-                dec_profiles_kv = dec_profiles_kv.add(
-                    f"past_key_values.{i}.encoder.key",
-                    **cross_attention_profile
-                )
-                dec_profiles_kv = dec_profiles_kv.add(
-                    f"past_key_values.{i}.encoder.value",
-                    **cross_attention_profile
-                )
-            decoder_profiles_kv = [dec_profiles_kv]
-
-        decoder_profiles = decoder_profiles_kv if (metadata.other.kv_cache) else decoder_profiles_non_kv
-
-        # Convert ONNX models to TRT engines.
-        if benchmarking_args is None:
-            engine_tag = "bs{}".format(batch_size)
-        # When user does not input any profile_max_len, use seq as tag, both max are config max
-        elif seq_tag:
-            engine_tag = "bs{}-inseq{}-outseq{}".format(batch_size, benchmarking_args.input_seq_len, benchmarking_args.output_seq_len)
-        # When user input profile_max_len, reuse the engine for future use with different seq_len
-        else:
-            engine_tag = "bs{}-inmax{}-outmax{}".format(batch_size, benchmarking_args.input_profile_max_len, benchmarking_args.output_profile_max_len)
-
-        if num_beams > 1:
-            engine_tag += "-beam{}".format(num_beams)
-
-        preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
-        if disable_preview_dynamic_shapes:
-            engine_tag += "-noPreviewFasterDynamicShapes"
-        else:
-            preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
-
-        self.BART_trt_encoder_engine = BARTEncoderONNXFile(
-            encoder_onnx_fpath, metadata
-        ).as_trt_engine(
-            encoder_onnx_fpath + "-{}.engine".format(engine_tag).replace(f"-beam{num_beams}", ""), # encoder engine name not affected by beam search
-            profiles=encoder_profiles,
-            preview_features=preview_features
-        )
-
-        if not metadata.other.kv_cache:
-            self.BART_trt_decoder_engine = BARTDecoderONNXFile(
-                decoder_onnx_fpath, metadata
-            ).as_trt_engine(
-                os.path.splitext(decoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag),
-                profiles=decoder_profiles,
-                preview_features=preview_features
-            )
-        else:
-            decoder_root, decoder_fullname = os.path.split(decoder_onnx_fpath)
-            # Split kv and non kv engines into separate folders to avoid weight overlap
-            non_kv_root = os.path.join(decoder_root, "non-kv")
-            kv_root = os.path.join(decoder_root, "kv")
-            decoder_name, decoder_ext = os.path.splitext(decoder_fullname)
-            decoder_onnx_non_kv_fpath = os.path.join(non_kv_root, decoder_name + "-non-kv" + decoder_ext)
-            decoder_onnx_kv_fpath = os.path.join(kv_root, decoder_fullname)
-            self.BART_trt_decoder_engine = BARTDecoderONNXFile(
-                decoder_onnx_kv_fpath, metadata
-            ).as_trt_engine(
-                os.path.splitext(decoder_onnx_kv_fpath)[0] + "-{}.engine".format(engine_tag),
-                profiles=decoder_profiles,
-                preview_features=preview_features
-            )
-            # dual-engine approach: still need to setup non-kv engine in kv mode
-            # note: workspace cleanup is not handled for these extra non-kv files
-            self.BART_trt_decoder_engine_non_kv = BARTDecoderONNXFile(
-                decoder_onnx_non_kv_fpath, metadata
-            ).as_trt_engine(
-                os.path.splitext(decoder_onnx_non_kv_fpath)[0] + "-{}.engine".format(engine_tag),
-                profiles=decoder_profiles_non_kv,
-                preview_features=preview_features
-            )
-
-        # Create BARTTRTEncoder and BARTTRTDecoder instances.
-        tfm_config = BartConfig(
-            use_cache=metadata.other.kv_cache,
-            num_layers=BARTModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant],
-        )
-        self.BART_trt_encoder = BARTTRTEncoder(
-            self.BART_trt_encoder_engine, metadata, tfm_config, batch_size=batch_size, benchmarking_args = benchmarking_args
-        )
-        self.BART_trt_decoder = BARTTRTDecoder(
-            self.BART_trt_decoder_engine, metadata, tfm_config, batch_size=batch_size, num_beams=num_beams, benchmarking_args = benchmarking_args
-        )
-
-        if metadata.other.kv_cache:
-            # switch between BARTTRTDecoder is impossible (becase HF decoding step is bound to one decoder). Therefore, we need to add the non-kv engines inside the same decoder --> decoder contains two TRT engines
-            self.BART_trt_decoder.set_non_kv_engine_for_kv_mode(self.BART_trt_decoder_engine_non_kv)
-
-    def run_trt(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Tuple[NetworkModel],
-        network_input: List[str],
-        working_directory: str,
-        keep_trt_engine: bool,
-        keep_onnx_model: bool,
-        keep_torch_model: bool,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        disable_preview_dynamic_shapes: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult] :
-
-        self.working_directory = working_directory
-        workspace = self._setup_workspace(metadata, working_directory)
-
-        # Keep onnx and Torch models if they are provided by users.
-        if len(onnx_fpaths) == 0:
-            onnx_fpaths = self._download_models(workspace, metadata)
-        else:
-            keep_onnx_model = True
-            keep_torch_model = True
-
-        hash_onnx_fpath = {v.name: v for v in onnx_fpaths}
-
-        inference_results = []
-        ppl_results = []
-        try:
-            if not benchmarking_mode:
-                self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes)
-                for ninput in network_input:
-                    inference_results.append(
-                        self.execute_inference(
-                            metadata, hash_onnx_fpath, ninput, timing_profile, batch_size, args.num_beams
-                        )
-                    )
-                    self.reset_decoder_state()
-
-                if perplexity_reference is not None:
-                    assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
-
-                    if metadata.other.kv_cache or (args.num_beams > 1):
-                        G_LOGGER.warning("Skipping perplexity calculation for TRT with KV cache or beam search because it is not supported yet.")
-                    else:
-                        for ei, di in zip(network_input, perplexity_reference):
-                            ppl_results.append(
-                                self.execute_calculate_perplexity(metadata, ei, di, batch_size)
-                            )
-
-            else:
-               # Check that input_seq_len and output_seq_len is valid and within required range
-                max_input_seq_len = BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-                max_output_seq_len = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-
-                seq_tag = args.input_profile_max_len is None and args.output_profile_max_len is None
-                # User must provide either a pair of profile_max_len or a profile of seq_len for input/output
-                if args.input_profile_max_len is None or args.output_profile_max_len is None:
-                    if args.input_seq_len is None or args.output_seq_len is None:
-                        assert False, "Please provide at least one pair of inputs: [input/output]_seq_len or [input/output]_profile_max_len"
-
-                input_profile_max_len = setup_benchmark_arg(args.input_profile_max_len, "input_profile_max_len", max_input_seq_len)
-                output_profile_max_len = setup_benchmark_arg(args.output_profile_max_len, "output_profile_max_len", max_output_seq_len)
-                input_seq_len = setup_benchmark_arg(args.input_seq_len, "input_seq_len", input_profile_max_len // 2)
-                output_seq_len = setup_benchmark_arg(args.output_seq_len, "output_seq_len", output_profile_max_len // 2)
-
-                benchmarking_args = BARTTRTBenchmarkingArgs(input_seq_len, output_seq_len, input_profile_max_len, output_profile_max_len)
-
-                # Assert to ensure the validity of benchmarking arguments
-                assert benchmarking_args.input_seq_len <= benchmarking_args.input_profile_max_len, "input_seq_len should <= input_profile_max_len = {} for benchmarking mode".format(benchmarking_args.input_profile_max_len)
-                assert benchmarking_args.output_seq_len <= benchmarking_args.output_profile_max_len, "output_seq_len should <= output_profile_max_len = {} for benchmarking mode".format(benchmarking_args.output_profile_max_len)
-                assert benchmarking_args.input_profile_max_len <= max_input_seq_len, "Model config restrict input_profile_max_len <= {} for benchmark mode".format(max_input_seq_len)
-                assert benchmarking_args.output_profile_max_len <= max_output_seq_len, "Model config restrict output_profile_max_len <= {} for benchmark mode".format(max_output_seq_len)
-                
-                self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes, benchmarking_args, seq_tag)
-                inference_results = self.execute_inference(
-                    metadata, hash_onnx_fpath, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
-                )
-
-        finally:
-            self.cleanup(workspace, keep_trt_engine, keep_onnx_model, keep_torch_model)
-
-        return inference_results, ppl_results
-
-    def add_args(self, parser) -> None:
-        super().add_args(parser)
-        polygraphy_group = parser.add_argument_group("polygraphy models")
-        polygraphy_group.add_argument(
-            "--onnx-decoder-fpath",
-            default=None,
-            help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-        polygraphy_group.add_argument(
-            "--onnx-encoder-fpath",
-            default=None,
-            help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-
-    def args_to_network_models(self, args) -> List[NetworkModel]:
-        # Check if both flags are given otherwise error out
-        decoder_fpath_check = args.onnx_decoder_fpath is None
-        encoder_fpath_check = args.onnx_encoder_fpath is None
-
-        network_models = None
-        if decoder_fpath_check and encoder_fpath_check:
-            network_models = tuple()
-        elif decoder_fpath_check or encoder_fpath_check:
-            raise self._parser.error(
-                "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
-            )
-        else:
-            onnx_decoder = NetworkModel(
-                name=BARTModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=args.onnx_decoder_fpath,
-            )
-            onnx_encoder = NetworkModel(
-                name=BARTModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                fpath=args.onnx_encoder_fpath,
-            )
-            network_models = (onnx_decoder, onnx_encoder)
-
-        return network_models
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
-        return NetworkMetadata(
-            variant=frameworks_parsed_metadata.variant,
-            precision=Precision(fp16=args.fp16),
-            other=frameworks_parsed_metadata.other,
-        )
-
-
-RUN_CMD = BARTTRT()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/CHANGELOG.md b/demo/HuggingFace/CHANGELOG.md
deleted file mode 100644
index 188e3f45..00000000
--- a/demo/HuggingFace/CHANGELOG.md
+++ /dev/null
@@ -1,78 +0,0 @@
-# HF-OSS Demo changelog
-
-Uses [changelog conventions](https://keepachangelog.com/en/1.0.0/).
-Uses [semantic versioning](https://semver.org/).
-
-## Guiding Principles
-- Changelogs are for humans, not machines.
-- There should be an entry for every single version.
-- The same types of changes should be grouped.
-- Versions and sections should be linkable.
-- The latest version comes first.
-- The release date of each version is displayed.
-- Mention whether you follow Semantic Versioning.
-
-## Types of changes
-- `Added` for new features.
-- `Changed` for changes in existing functionality.
-- `Deprecated` for soon-to-be removed features.
-- `Removed` for now removed features.
-- `Fixed` for any bug fixes.
-- `Security` in case of vulnerabilities.
-
-# [1.3.4] - 2023-02-02
-- Changed GPT2 demo kv cache TRT to 1 engine, 2 optimization profiles 
-- Added fp16 support for GPT2
-
-# [1.3.3] - 2023-01-04
-- Deprecated max workspace size flag to memory pool limits for TensorRT
-- Added t5-11b support
-- Changed T5 demo kv cache TRT memory organization to avoid D2D copy
-
-# [1.3.2] - 2022-11-17
-- Added beam search support for GPT2 demo
-- Added KV cache support for GPT2 demo
-- Fixed perplexity calculation array size out of max_length
-- Fixed trt KV cache engine profile to only accept input_length = 1
-- Fixed external onnx weight file name overwrite issue
-
-# [1.3.1] - 2022-11-04
-- Added beam search support for T5 demo
-- Added KV cache support for T5 demo
-
-# [1.3.0] - 2022-11-03
-- Added perplexity calculation for all samples
-- Added precision override to checkpoints.
-- Fixed TensorRT BART checkpoint not working.
-
-# [1.2.5] - 2022-10-08
-- Added beam search support for BART
-
-# [1.2.4] - 2022-09-30
-- Added notebooks for BART demo
-- Enabled flexible control on (a) percentile latency reports (b) engine building profile other than standard maximum input/output length config
-
-# [1.2.3] - 2022-06-30
-- Added KV cache support for BART demo
-
-# [1.2.2] - 2022-06-14
-- Added BART demo
-
-# [1.2.1] - 2022-05-20
-
-- Added `benchmark` action to T5 frameworks/onnxrt and GPT2 frameworks/trt for performance benchmarking. It uses random
-  inputs with fixed lengths and disables early stopping such that we can compare the performance with other frameworks.
-- Added `batch_size > 1` support to GPT2 trt sample.
-
-# [1.2.0] - 2022-03-29
-
-- Added `benchmark` action to T5 trt for performance benchmarking. It uses random inputs with fixed lengths and disables
-  early stopping such that we can compare the performance with other frameworks.
-
-# [1.1.0] - 2022-02-09
-
-- Added `-o` or `--save-output-fpath` which saves a pickled version of the `NetworkResult` object. Useful for testing.
-
-# [1.0.0] - 2022
-
-- Added initial working example of HF samples and notebooks.
diff --git a/demo/HuggingFace/GPT2/GPT2ModelConfig.py b/demo/HuggingFace/GPT2/GPT2ModelConfig.py
deleted file mode 100644
index a0edca9f..00000000
--- a/demo/HuggingFace/GPT2/GPT2ModelConfig.py
+++ /dev/null
@@ -1,198 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import argparse
-
-from collections import namedtuple, OrderedDict
-from itertools import product
-from typing import Dict
-
-# TRT-HuggingFace
-from NNDF.networks import Precision, NetworkMetadata, NNConfig, Dims
-from NNDF.interface import MetadataArgparseInteropMixin
-
-# Limitation of namedtuples. You must declare namedtuples in module scope and not in classes.
-# Otherwise pickle doesn't work.
-# See: https://stackoverflow.com/questions/4677012/python-cant-pickle-type-x-attribute-lookup-failed
-_GPT2Metadata = namedtuple("GPT2Metadata", ["kv_cache"])
-
-
-class GPT2Metadata(_GPT2Metadata, MetadataArgparseInteropMixin):
-    @staticmethod
-    def add_args(parser: argparse.ArgumentParser) -> None:
-        """Add commandline interface parser."""
-        network_group = parser.add_argument_group("GPT2 network")
-        network_group.add_argument(
-            "--variant",
-            help="GPT2 variant to generate",
-            choices=GPT2ModelTRTConfig.TARGET_MODELS,
-            required=True,
-        )
-        network_group.add_argument(
-            "--enable-kv-cache",
-            help="GPT2 enable KV cache",
-            action="store_true",
-            default=False,
-        )
-        network_group.add_argument(
-            "--num-beams", type=int, default=1, help="Enables beam search during decoding."
-        )
-        
-        network_group.add_argument(
-            "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
-        )
-
-    @staticmethod
-    def from_args(args: argparse.Namespace):
-        return NetworkMetadata(
-            variant=args.variant,
-            precision=Precision(fp16=args.fp16),
-            other=GPT2Metadata(kv_cache=args.enable_kv_cache),
-        )
-
-    @staticmethod
-    def add_benchmarking_args(parser: argparse.ArgumentParser) -> None:
-        benchmarking_group = parser.add_argument_group("benchmarking group")
-        benchmarking_group.add_argument(
-            "--input-seq-len",
-            type=int,
-            help="Specify fixed input sequence length for perf benchmarking.",
-        )
-        benchmarking_group.add_argument(
-            "--output-seq-len",
-            type=int,
-            help="Specify fixed output sequence length for perf benchmarking.",
-        )
-
-
-GPT2BenchmarkingArgs = namedtuple("GPT2BenchmarkingArgs", ["input_seq_len", "output_seq_len"])
-GPT2TRTBenchmarkingArgs = namedtuple("GPT2BenchmarkingArgs", ["input_seq_len", "output_seq_len", "input_profile_max_len", "output_profile_max_len"])
-
-
-class GPT2ModelTRTConfig(NNConfig):
-    TARGET_MODELS = ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", "EleutherAI/gpt-j-6B"]
-    NETWORK_DECODER_SEGMENT_NAME = "gpt2_decoder"
-    NETWORK_SEGMENTS = [NETWORK_DECODER_SEGMENT_NAME]
-    NETWORK_FULL_NAME = "full"
-
-    NUMBER_OF_LAYERS = {
-        TARGET_MODELS[0]: 12,
-        TARGET_MODELS[1]: 24,
-        TARGET_MODELS[2]: 36,
-        TARGET_MODELS[3]: 48,
-        TARGET_MODELS[4]: 28,
-    }
-
-    # This corresponds to max_length in task_specific_params for text-generation.
-    # Both input and output length should not exceed 50.
-    MAX_LENGTH = {
-        TARGET_MODELS[0]: 50,
-        TARGET_MODELS[1]: 50,
-        TARGET_MODELS[2]: 50,
-        TARGET_MODELS[3]: 50,
-        TARGET_MODELS[4]: 50,
-    }
-
-    MIN_OUTPUT_LENGTH = {
-        TARGET_MODELS[0]: 0,
-        TARGET_MODELS[1]: 0,
-        TARGET_MODELS[2]: 0,
-        TARGET_MODELS[3]: 0,
-        TARGET_MODELS[4]: 0,
-    }
-
-    def __init__(self):
-        precision_fp16 = [False, True]
-        kv_caches = [False, True]
-        variants = []
-        for variant, fp16, kv_cache in product(
-            GPT2ModelTRTConfig.TARGET_MODELS, precision_fp16, kv_caches
-        ):
-            variants.append(
-                NetworkMetadata(
-                    variant=variant,
-                    precision=Precision(fp16=fp16),
-                    other=GPT2Metadata(kv_cache=kv_cache),
-                )
-            )
-
-        super().__init__("GPT2", variants=variants)
-
-    def get_python_requirements(self):
-        base_requirements = super().get_python_requirements()
-        base_requirements.append('transformers==4.20.0; python_version>="3.7"')
-        base_requirements.append('transformers==4.18.0; python_version<"3.7"')
-        return base_requirements
-
-    def get_metadata_string(self, metadata: NetworkMetadata) -> str:
-        # Remove redundant GPT2 name
-        metadata = metadata._replace(variant=metadata.variant.lstrip("GPT2-"))
-        metadata = metadata._replace(variant=metadata.variant.lstrip("EleutherAI/"))
-        return super().get_metadata_string(metadata)
-
-    @staticmethod
-    def get_input_dims(metadata) -> Dict:
-        """
-        Returns dictionary encoding of input dimensions.
-        Returns:
-            (Dict[str, Dims]): {"decoder": Dims}
-        """
-        decoder_inputs_dict = OrderedDict({"input_ids": (Dims.BATCH, Dims.SEQUENCE)})
-        if metadata.other.kv_cache:
-            # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V)
-            for i in range(GPT2ModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant]):
-                # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
-                self_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
-                decoder_inputs_dict[f"past_key_values.{i}.decoder.key"] = self_attention_past_kv_dims
-                decoder_inputs_dict[f"past_key_values.{i}.decoder.value"] = self_attention_past_kv_dims
-
-        decoder_inputs = Dims(decoder_inputs_dict)
-
-        return {
-            GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_inputs
-        }
-
-    @staticmethod
-    def get_output_dims(metadata) -> Dict:
-        """
-        Returns dictionary encoding of output dimensions.
-
-        Returns:
-            (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
-        """
-        decoder_outputs_dict = OrderedDict(
-            {
-                "logits": (
-                    Dims.BATCH,
-                    Dims.SEQUENCE,
-                    "vocab_size"
-                )
-            }
-        )
-        if metadata.other.kv_cache:
-            # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V)
-            for i in range(GPT2ModelTRTConfig.NUMBER_OF_LAYERS[metadata.variant]):
-                # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep)
-                self_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("decoder_length"), "embedding_size_per_head")
-                decoder_outputs_dict[f"present_key_values.{i}.decoder.key"] = self_attention_present_kv_dims
-                decoder_outputs_dict[f"present_key_values.{i}.decoder.value"] = self_attention_present_kv_dims
-
-        decoder_outputs = Dims(decoder_outputs_dict)
-
-        return {
-            GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_outputs
-        }
diff --git a/demo/HuggingFace/GPT2/checkpoint.toml b/demo/HuggingFace/GPT2/checkpoint.toml
deleted file mode 100644
index 4815250f..00000000
--- a/demo/HuggingFace/GPT2/checkpoint.toml
+++ /dev/null
@@ -1,108 +0,0 @@
-[GPT2.all.default.all.generate]
-
-input = '''
-TensorRT is a Deep Learning compiler used for deep learning.
-'''
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nThe main goal of the project is to create a tool that can be used to train neural networks.\n\nThe main goal of the project is to create a tool that can
-'''
-
-[GPT2.all.gpt2-medium.all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nTensorRT is a Deep Learning compiler used for deep learning. TensorRT is a deep learning library for Python.\n\nTensorRT is a deep learning library for
-'''
-
-[GPT2.all.gpt2-large.all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nTensorRT is a Deep Learning compiler used for deep learning. TensorFlow is a high-performance, open-source, cross-platform, high-performance, machine
-'''
-
-[GPT2.all.gpt2-xl.all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nThe library is written in C++ and uses Boost.Python.\n\nThe library is available on GitHub.\n\nInstallation\n\nThe library is available on GitHub.\n
-'''
-
-[GPT2.all."EleutherAI/gpt-j-6B".all.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nTensorRT is a deep learning compiler that enables you to run deep learning models on NVIDIA GPUs.\n\nTensorRT is a deep learning compiler that enables you to run
-'''
-
-[GPT2.all.default.fp16.generate]
-
-label = '''
-TensorRT is a Deep Learning compiler used for deep learning.\n\nThe main goal of the project is to provide a way to build a deep learning framework that can be used to build a deep learning framework for a wide range of applications.\n
-'''
-
-[GPT2.all.default.all.generate_b]
-
-input = '''
-GPT-2 is a transformer based model pretrained on a large corpus.
-'''
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is based on the following assumptions:\n\nThe model is based on the following assumptions:\n\nThe model is based on the following assumptions:\n
-'''
-
-[GPT2.all.gpt2-medium.all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is trained on a large corpus of data, and the model is trained on a large number of training examples. The model is trained on a large number
-'''
-
-[GPT2.all.gpt2-large.all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is trained on the following data:\n\nThe corpus consists of the following text files:\n\nThe corpus is split into two parts:\n\n
-'''
-
-[GPT2.all.gpt2-xl.all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\nThe model is trained on the MNIST dataset, which contains over 100,000 handwritten digits. The training data is split into two parts: the training set and
-'''
-
-[GPT2.all."EleutherAI/gpt-j-6B".all.generate_b]
-
-label = '''
-GPT-2 is a transformer based model pretrained on a large corpus.\n\n-   **GPT-2-PT**: The same as GPT-2 but with the pretrained model.\n\n-   **
-'''
-
-[GPT2.all.default.all.generate_c]
-
-input = '''
-If I fall asleep then I am going to wake up in 8 hours.
-'''
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am not going to sleep for 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all.gpt2-medium.all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all.gpt2-large.all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all.gpt2-xl.all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours.\n\nI am going to wake up in 8 hours
-'''
-
-[GPT2.all."EleutherAI/gpt-j-6B".all.generate_c]
-
-label = '''
-If I fall asleep then I am going to wake up in 8 hours.\n\nI am going to be in the same place.\n\nI am going to be in the same place.\n\nI am going to be in the same place
-'''
-
diff --git a/demo/HuggingFace/GPT2/export.py b/demo/HuggingFace/GPT2/export.py
deleted file mode 100644
index cbd06964..00000000
--- a/demo/HuggingFace/GPT2/export.py
+++ /dev/null
@@ -1,258 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Contains logic that captures GPT2 HuggingFace models into ONNX models and TRT engines.
-"""
-
-from itertools import tee
-import os
-from collections import OrderedDict
-
-# tensorrt
-import tensorrt as trt
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-from torch.nn import Module
-
-# # huggingface
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import CausalLMOutputWithPast
-from transformers import GPT2Tokenizer
-
-# TRT-HuggingFace
-from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig
-from NNDF.networks import NetworkMetadata, Dims
-from NNDF.logger import G_LOGGER
-from NNDF.models import (
-    TRTEngineFile,
-    TorchModelFile,
-    ONNXModelFile,
-    ModelFileConverter,
-)
-
-class GPT2TorchFile(TorchModelFile):
-    class TorchModule(Module, GenerationMixin):
-        """
-        A simplied definition of GPT2 with LM head.
-        """
-
-        def __init__(self, transformer, lm_head, config):
-            super().__init__()
-            self.transformer = transformer
-            self.lm_head = lm_head
-            self.config = config
-            self.device = torch.device('cuda') # WAR to avoid beam search in framework
-            self.main_input_name = "input_ids" # For better HuggingFace version compatibility
-
-        def prepare_inputs_for_generation(self, input_ids, past = None, use_cache=None, **kwargs):
-            # Todo (@pchadha): add position_ids, token_type_ids support
-            # cut decoder_input_ids if past is used
-            if past is not None:
-                input_ids = input_ids[:, -1:]
-
-            return {
-                "input_ids": input_ids,
-                "use_cache": use_cache,
-                "past_key_values": past
-            }
-
-        def forward(self, input_ids, **kwargs):
-            transformer_outputs = self.transformer(input_ids, **kwargs)
-            hidden_states = transformer_outputs[0]
-            lm_logits = self.lm_head(hidden_states)
-
-            return CausalLMOutputWithPast(
-                logits=lm_logits, 
-                past_key_values=transformer_outputs.past_key_values
-            )
-
-        def _reorder_cache(self, past, beam_idx):
-            """
-            This function is used to re-order the :obj:`past_key_values` cache if
-            :meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
-            called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
-            """
-            return tuple(
-                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
-                for layer_past in past
-            )
-
-        def __call__(self, *args, **kwargs):
-            return self.forward(*args, **kwargs)
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, GPT2Converter, network_metadata)
-
-
-class GPT2ONNXFile(ONNXModelFile):
-    def __init__(self, model, network_metadata):
-        super().__init__(model, GPT2Converter, network_metadata)
-
-
-# TRT Engine File Encoding #
-class GPT2TRTEngine(TRTEngineFile):
-    def __init__(self, model, network_metadata):
-        super().__init__(model, GPT2Converter, network_metadata)
-
-    def use_obey_precision_constraints(self):
-        return self.network_metadata.precision.fp16
-
-    def get_network_definition(self, network_definition):
-
-        def pairwise(iterable):
-            a, b = tee(iterable)
-            next(b, None)
-            return zip(a, b)
-
-        indices = list(range(0, network_definition[1].num_layers))
-        for i, i_next in pairwise(indices):
-            l = network_definition[1].get_layer(i)
-            l_next = network_definition[1].get_layer(i_next)
-
-            if not all([l.get_output(i).is_execution_tensor for i in range(l.num_outputs)]):
-                continue
-
-            if l.get_output_type(0) != trt.float32:
-                continue
-
-            if l.type == trt.LayerType.ELEMENTWISE and l_next.type == trt.LayerType.REDUCE:
-                l.__class__ = getattr(trt, "IElementWiseLayer")
-                if l.op == trt.ElementWiseOperation.POW:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-                l_next.precision = trt.float32
-                l_next.set_output_type(0, trt.float32)
-
-        if self.network_metadata.precision.fp16:
-            for i in range(network_definition[1].num_inputs):
-                t = network_definition[1].get_input(i)
-                if t.dtype == trt.float32:
-                    t.dtype = trt.float16
-
-            for i in range(network_definition[1].num_outputs):
-                t = network_definition[1].get_output(i)
-                if t.dtype == trt.float32:
-                    t.dtype = trt.float16
-
-        return network_definition
-
-# Converters
-class GPT2Converter(ModelFileConverter):
-    def __init__(self):
-        super().__init__(GPT2TorchFile, GPT2ONNXFile, GPT2TRTEngine)
-
-    def torch_to_onnx(
-        self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
-    ):
-        """
-        Exports a GPT2LMHead model to ONNX.
-
-        Args:
-            output_prefix (str): Path to the onnx file
-            model (torch.Model): Model loaded torch class
-
-        Returns:
-            GPT2ONNXFile: ONNX GPT2 decoder object.
-        """
-        # Currently does not support exporting GPU models to onnx.
-        device = model.device
-        tokenizer = GPT2Tokenizer.from_pretrained(network_metadata.variant)
-        input_ids = torch.tensor(
-            [
-                tokenizer.encode(
-                    "Here is some text to encode Hello World", add_special_tokens=True
-                )
-            ]
-        ).to(device)
-
-        gpt2_model = GPT2TorchFile.TorchModule(
-            model.transformer, model.lm_head, model.config
-        )
-
-        inputs = GPT2ModelTRTConfig.get_input_dims(network_metadata)[
-            GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
-        ]
-        outputs = GPT2ModelTRTConfig.get_output_dims(network_metadata)[
-            GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
-        ]
-
-        # Exports to ONNX
-        opt_args={}
-
-        version_major = int((torch.__version__).split('.')[0])
-        version_minor = int((torch.__version__).split('.')[1])
-        if version_major < 1 or (version_major == 1 and version_minor < 11):
-            opt_args['use_external_data_format'] = True
-        if not network_metadata.other.kv_cache:
-            # This code allows for huggingface compatible torch class to use onnx exporter
-            # This code regulates the number of output = 1 if non kv-cache mode is used.
-            # Otherwise it will automatically output key value pairs
-            old_forward = gpt2_model.forward
-            def _export_forward(input_ids, **kwargs):
-                result = old_forward(input_ids, use_cache = False, **kwargs)
-                return result[0]
-            gpt2_model.forward = _export_forward
-
-            torch.onnx.export(
-                gpt2_model,
-                input_ids,
-                output_fpath,
-                opset_version=13,
-                do_constant_folding=True,
-                input_names=inputs.get_names(),
-                output_names=outputs.get_names(),
-                dynamic_axes={
-                    **inputs.get_torch_dynamic_axis_encoding(),
-                    **outputs.get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-        else:
-            decoder_output = gpt2_model(input_ids, use_cache = True)
-            past_key_values = decoder_output[1]
-
-            # Exporting the kv cache engine
-            old_forward = gpt2_model.forward
-            def _export_forward(input_ids, past_key_values, **kwargs):
-                result = old_forward(input_ids, past_key_values=past_key_values, use_cache=True, **kwargs)
-                return (result[0], result[1])
-            gpt2_model.forward = _export_forward
-
-            torch.onnx.export(
-                gpt2_model,
-                (input_ids, past_key_values),
-                output_fpath,
-                opset_version=13,
-                do_constant_folding=True,
-                input_names=inputs.get_names(),
-                output_names=outputs.get_names(),
-                dynamic_axes={
-                    **inputs.get_torch_dynamic_axis_encoding(),
-                    **outputs.get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-
-        return GPT2ONNXFile(output_fpath, network_metadata)
diff --git a/demo/HuggingFace/GPT2/frameworks.py b/demo/HuggingFace/GPT2/frameworks.py
deleted file mode 100644
index d430a056..00000000
--- a/demo/HuggingFace/GPT2/frameworks.py
+++ /dev/null
@@ -1,318 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import argparse
-
-from typing import List, Union
-
-# huggingface
-from transformers import (
-    AutoConfig,
-    AutoModelForCausalLM,
-    # GPT-J uses GPT2 tokenizer
-    GPT2Tokenizer,
-)
-
-# torch
-import torch
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# helpers
-from NNDF.interface import FrameworkCommand
-from NNDF.general_utils import confirm_folder_delete, NNFolderWorkspace
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkResult,
-    NetworkMetadata,
-    NetworkRuntime,
-    Precision,
-    NetworkModel,
-    NetworkModels,
-    TimingProfile,
-)
-from GPT2.export import GPT2TorchFile
-from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig, GPT2BenchmarkingArgs
-from GPT2.measurements import gpt2_inference, full_inference, calculate_perplexity
-
-
-class GPT2HuggingFace(FrameworkCommand):
-    def __init__(self):
-        super().__init__(
-            GPT2ModelTRTConfig, description="Runs framework results for GPT2 model."
-        )
-
-        # Default inference input used during inference stage
-        self.onnx_gpt2 = None
-        self.torch_gpt2_dir = None
-
-    def generate_and_download_framework(
-        self, metadata: NetworkMetadata, workspace: NNFolderWorkspace
-    ) -> NetworkModels:
-
-        trt_gpt2_config = self.config
-        metadata_serialized = trt_gpt2_config.get_metadata_string(metadata)
-        workspace_dir, _ , onnx_root = workspace.set_model_path(metadata_serialized, is_encoder_decoder = False)
-        pytorch_model_dir = os.path.join(workspace_dir, "pytorch_model")
-        # We keep track of the generated torch location for cleanup later
-        self.torch_gpt2_dir = pytorch_model_dir
-
-        if not os.path.exists(pytorch_model_dir):
-            # Generate the pre-trained weights
-            model = AutoModelForCausalLM.from_pretrained(metadata.variant, use_cache = metadata.other.kv_cache)
-            model.save_pretrained(pytorch_model_dir)
-            print("Pytorch Model saved to {}".format(pytorch_model_dir))
-        else:
-            print(
-                "Frameworks file already exists, skipping generation and loading from file instead."
-            )
-            model = AutoModelForCausalLM.from_pretrained(pytorch_model_dir)
-
-        onnx_model_fpath = os.path.join(onnx_root, metadata_serialized + ".onnx")
-
-        gpt2 = GPT2TorchFile(model, metadata)
-        self.onnx_gpt2 = gpt2.as_onnx_model(onnx_model_fpath, force_overwrite=False)
-
-        onnx_models = [
-            NetworkModel(
-                name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=self.onnx_gpt2.fpath,
-            )
-        ]
-        torch_models = [
-            NetworkModel(
-                name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=pytorch_model_dir,
-            )
-        ]
-
-        return NetworkModels(torch=torch_models, onnx=onnx_models, trt=None)
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        save_onnx_model: bool = True,
-        keep_pytorch_model: bool = True,
-    ) -> None:
-        """
-        Cleans up the working directory and leaves models if available.
-        Should not assume any functions from the framework class has been called.
-        Returns:
-            None
-        """
-        # Clean-up generated files
-        if not save_onnx_model and self.onnx_gpt2 is not None:
-            self.onnx_gpt2.cleanup()
-
-        if not keep_pytorch_model:
-            # Using rmtree can be dangerous, have user confirm before deleting.
-            confirm_folder_delete(
-                self.torch_gpt2_dir,
-                prompt="Confirm you want to delete downloaded pytorch model folder?",
-            )
-
-        if not keep_pytorch_model and not save_onnx_model:
-            workspace.cleanup(force_remove=False)
-
-    def setup_tokenizer_and_model(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-    ):
-        tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
-
-        # GPT2 has no proper token set. Use custom token. Only "generate()" will auto
-        # replace with EOS token when using generating mode
-        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
-
-        # By default, HuggingFace model structure is one giant file.
-        gpt2_torch_fpath = network_fpaths.torch[0].fpath
-        gpt2_model = AutoModelForCausalLM.from_pretrained(gpt2_torch_fpath)
-        
-        # Framework fp16 does not support cpu mode for GPT2
-        if metadata.precision.fp16:
-            gpt2_model = gpt2_model.cuda().half()
-
-        gpt2_torch = GPT2TorchFile.TorchModule(
-            gpt2_model.transformer, gpt2_model.lm_head, gpt2_model.config
-        )
-
-        return tokenizer, gpt2_torch
-
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-        inference_input: str,
-        timing_profile: TimingProfile,
-        use_cpu: bool,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: GPT2BenchmarkingArgs = None,
-    ) -> Union[NetworkResult, BenchmarkingResult]:
-
-        tokenizer, gpt2_torch = self.setup_tokenizer_and_model(metadata, network_fpaths)
-        config = gpt2_torch.config
-        # Prepare the input tokens and find out output sequence length.
-        if not benchmarking_mode:
-            output_seq_len = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        else:
-            input_seq_len = benchmarking_args.input_seq_len
-            output_seq_len = benchmarking_args.output_seq_len
-            input_ids = torch.randint(0, config.vocab_size, (batch_size, input_seq_len))
-
-        # get single decoder iteration inference timing profile
-        _, decoder_e2e_time = gpt2_inference(
-            gpt2_torch, 
-            input_ids, 
-            timing_profile, 
-            use_cuda=(not use_cpu),
-            use_cache = metadata.other.kv_cache,
-        )
-
-        # get complete decoder inference result and its timing profile
-        sample_output, full_e2e_runtime = full_inference(
-            gpt2_torch,
-            input_ids,
-            tokenizer,
-            timing_profile,
-            max_length=output_seq_len,
-            min_length=GPT2ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-            use_cuda=(not use_cpu),
-            batch_size=batch_size,
-            use_cache=metadata.other.kv_cache,
-            num_beams=num_beams
-        )
-
-        # Prepare runtime results.
-        runtime = [
-            NetworkRuntime(
-                name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=GPT2ModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=network_fpaths)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            sample_output[-1, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=sample_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=network_fpaths,
-        )
-
-    def execute_calculate_perplexity(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-        reference: str,
-    ):
-        tokenizer, gpt2_torch = self.setup_tokenizer_and_model(metadata, network_fpaths)
-        reference = reference.replace("\\n", "\n")
-        ppl_input_ids = tokenizer([reference], padding=True, return_tensors="pt").input_ids
-        perplexity = calculate_perplexity(
-            gpt2_torch, ppl_input_ids, GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
-        )
-
-        return perplexity
-
-    def run_framework(
-        self,
-        metadata: NetworkMetadata,
-        network_input: List[str],
-        working_directory: str,
-        keep_onnx_model: bool,
-        keep_pytorch_model: bool,
-        timing_profile: TimingProfile,
-        use_cpu: bool = False,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult]:
-
-        """
-        Main entry point of our function which compiles and generates our model data.
-        """
-        inference_results = []
-        ppl_results = []
-        workspace = NNFolderWorkspace(
-            self.config.network_name, metadata, working_directory
-        )
-        try:
-            network_fpaths = self.generate_and_download_framework(metadata, workspace)
-            if not benchmarking_mode:
-                for ninput in network_input:
-                    inference_results.append(
-                        self.execute_inference(
-                            metadata, network_fpaths, ninput, timing_profile, use_cpu, batch_size, args.num_beams
-                        )
-                    )
-                if perplexity_reference is not None:
-                    for r in perplexity_reference:
-                        ppl_results.append(
-                            self.execute_calculate_perplexity(
-                                metadata, network_fpaths, r
-                            )
-                        )
-            else:
-                benchmarking_args = GPT2BenchmarkingArgs(args.input_seq_len, args.output_seq_len)
-                inference_results = self.execute_inference(
-                    metadata, network_fpaths, None, timing_profile, use_cpu, batch_size, args.num_beams, True, benchmarking_args
-                )
-        finally:
-            self.cleanup(workspace, keep_onnx_model, keep_pytorch_model)
-
-        return inference_results, ppl_results
-
-    def args_to_network_metadata(self, args: argparse.Namespace) -> NetworkMetadata:
-        return NetworkMetadata(
-            variant=args.variant,
-            precision=Precision(fp16=args.fp16),
-            other=self.config.MetadataClass(kv_cache=args.enable_kv_cache),
-        )
-
-
-# Entry point
-RUN_CMD = GPT2HuggingFace()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/GPT2/measurements.py b/demo/HuggingFace/GPT2/measurements.py
deleted file mode 100644
index f783f872..00000000
--- a/demo/HuggingFace/GPT2/measurements.py
+++ /dev/null
@@ -1,99 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Utils specific to GPT2 network.
-"""
-
-# torch
-import torch
-
-
-# from HuggingFace transformers
-from transformers.generation_logits_process import (
-    MinLengthLogitsProcessor,
-    LogitsProcessorList,
-    ForcedEOSTokenLogitsProcessor,
-)
-from transformers.generation_stopping_criteria import (
-    MaxLengthCriteria,
-    StoppingCriteriaList,
-)
-
-# TRT-HuggingFace
-from NNDF.general_utils import measure_python_inference_code
-from NNDF.torch_utils import use_cuda
-from NNDF.tensorrt_utils import TRTNativeRunner
-
-@use_cuda
-def gpt2_inference(gpt2, input_ids, timing_profile, use_cuda=True, use_cache=False, past_key_values = None):
-    gpt2_stmt = lambda: gpt2(input_ids=input_ids, use_cache=use_cache, past_key_values=past_key_values)
-    gpt2_e2e_time = measure_python_inference_code(gpt2_stmt, timing_profile)
-    return (gpt2_stmt(), gpt2_e2e_time)
-
-
-# Code specifically for Pythonic inference measurement used across all GPT2 related scripts
-@use_cuda
-def full_inference(
-    gpt2,
-    input_ids,
-    tokenizer,
-    timing_profile,
-    max_length,
-    min_length = 0,
-    use_cuda=True,
-    batch_size=1,
-    early_stopping=False,
-    use_cache=False,
-    num_beams = 1,
-):
-
-    if isinstance(gpt2, TRTNativeRunner):
-        gpt2.set_return_device("cuda" if use_cuda else "cpu")
-
-    def _e2e():
-        with torch.no_grad():
-            output = gpt2.generate(
-                input_ids,
-                max_length=max_length,
-                min_length=min_length,
-                batch_size=batch_size,
-                num_beams=num_beams,
-                use_cache=use_cache,
-                early_stopping=early_stopping
-            )
-
-        return output
-
-    full_e2e_time = measure_python_inference_code(_e2e, timing_profile)
-    return (_e2e(), full_e2e_time)
-
-
-@use_cuda
-def calculate_perplexity(gpt2, input_ids, max_seq_len=None, use_cuda=True):
-    if isinstance(gpt2, TRTNativeRunner):
-        gpt2.set_return_device("cuda" if use_cuda else "cpu")
-
-    with torch.no_grad():
-        if max_seq_len is not None:
-            input_ids = input_ids[:, :max_seq_len]
-        logits = gpt2(input_ids).logits
-        # Shift logits and target ids so that probabilities generated by token < n line up with output token n.
-        shifted_logits = logits[:, :-1, :]
-        target_ids = input_ids[:, 1:]
-        loss = torch.nn.CrossEntropyLoss()(shifted_logits.permute((0, 2, 1)), target_ids)
-        return torch.exp(loss).item()
diff --git a/demo/HuggingFace/GPT2/trt.py b/demo/HuggingFace/GPT2/trt.py
deleted file mode 100644
index 411eb72c..00000000
--- a/demo/HuggingFace/GPT2/trt.py
+++ /dev/null
@@ -1,757 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import copy
-from typing import Dict, List, Tuple, Union
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# numpy
-import numpy as np
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-
-# huggingface
-from transformers import GPT2Tokenizer, AutoConfig
-from transformers.modeling_outputs import CausalLMOutputWithPast
-from transformers.configuration_utils import PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-
-# tensorrt
-from tensorrt import PreviewFeature
-
-# TRT-HuggingFace
-from NNDF.interface import TRTInferenceCommand
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkMetadata,
-    NetworkModels,
-    NetworkModel,
-    NetworkResult,
-    NetworkRuntime,
-    Precision,
-    TimingProfile,
-)
-
-from NNDF.tensorrt_utils import TRTNativeRunner, TRTPolygraphyRunner, set_kv_data, allocate_binding_buffer, setup_benchmark_arg
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from GPT2.frameworks import GPT2HuggingFace
-from NNDF.general_utils import NNFolderWorkspace
-from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig, GPT2BenchmarkingArgs, GPT2TRTBenchmarkingArgs
-from GPT2.measurements import gpt2_inference, full_inference, calculate_perplexity
-from GPT2.export import GPT2ONNXFile, GPT2TRTEngine
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-class TRTHFRunner(TRTNativeRunner, GenerationMixin):
-    """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
-    # Stores the encoder input length received at runtime, which is used to slice decoder inputs.
-    ENCODER_LENGTH = 0
-    def _allocate_memory(self,
-                         input_shapes: Dict[str, tuple],
-                         input_types: Dict[str, torch.dtype],
-                         output_shapes: Dict[str, tuple],
-                         output_types: Dict[str, torch.dtype]):
-        """Helper function for binding several inputs at once and pre-allocating the results."""
-        # Allocate memories as 1D linear buffers for simpler handling of dynamic shapes.
-        self.inputs = allocate_binding_buffer(input_types, input_shapes)
-        self.outputs = allocate_binding_buffer(output_types, output_shapes)
-
-        bindings = [None] * self.trt_engine.num_bindings
-
-        for input_name, input_array in self.inputs.items():
-            # Allocate memory for inputs
-            input_idx = self.trt_engine.get_binding_index(input_name)
-            self.trt_context.set_binding_shape(input_idx, input_shapes[input_name])
-            bindings[input_idx] = input_array.data_ptr()
-
-        assert self.trt_context.all_binding_shapes_specified
-
-        for output_name, output_array in self.outputs.items():
-            # Output shape should be allocated from context size
-            output_idx = self.trt_engine.get_binding_index(output_name)
-            bindings[output_idx] = output_array.data_ptr()
-
-        return bindings
-
-    def __init__(
-        self,
-        trt_engine_file: TRTEngineFile,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1
-    ):
-        super().__init__(trt_engine_file, network_metadata)
-        self.config = hf_config
-        self.batch_size = batch_size
-
-class GPT2TRTDecoder(TRTHFRunner):
-    def __init__(
-        self,
-        trt_engine_file: str,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_args: GPT2BenchmarkingArgs = None
-    ):
-        super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
-        self.network_metadata = network_metadata
-        self.data_type = torch.float32 if not network_metadata.precision.fp16 else torch.float16
-        # In benchmarking mode, if input_profile_max is provided, should use that as max_sequence_length
-        if benchmarking_args is not None:
-            if benchmarking_args.input_profile_max_len is not None:
-                self.max_input_length = benchmarking_args.input_profile_max_len
-            else:
-                self.max_input_length = hf_config.n_positions
-        # In non-benchmarking mode, we are provided a text generation task. We need to use the max_length as max sequence length
-        else:
-            self.max_sequence_length = GPT2ModelTRTConfig.MAX_LENGTH[network_metadata.variant]
-
-        # Similarly, the max_output_length should be the user-provided output_profile_max_len if provided
-        if benchmarking_args is not None and benchmarking_args.output_profile_max_len is not None:
-            self.max_output_length = benchmarking_args.output_profile_max_len
-        else:
-            self.max_output_length = self.max_sequence_length
-
-        self.main_input_name = "input_ids"
-        self.num_heads = self.config.n_head
-        self.embedding_size_per_head = self.config.n_embd // self.num_heads
-        self.num_decoder_layers = self.config.n_layer
-
-        self.profile_idx = 0
-        self.bindings = [0] * self.trt_engine.num_bindings
-        self.logits = torch.zeros((self.batch_size * num_beams, self.max_output_length, hf_config.vocab_size), dtype = self.data_type).cuda()
-        self.bindings[self.trt_engine.get_binding_index("logits")] = self.logits.data_ptr()
-        # This will be used to calculate the offset for each binding
-        self.num_bindings = self.trt_engine.num_bindings // 2 if self.config.use_cache else self.trt_engine.num_bindings
-
-        if self.config.use_cache:
-            self.bindings[self.trt_engine.get_binding_index("logits") + self.num_bindings] = self.logits.data_ptr()
-            
-            # Setting input and output the same does not work for GPT2. Needs separate cache and copy the memory address after each iteration
-            self.self_attention_cache_1 = {}
-            self.self_attention_cache_2 = {}
-
-            self_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_output_length - 1, self.embedding_size_per_head)
-
-            # Set kv cache shape and type
-            for i in range(self.num_decoder_layers):
-                for code in ["key", "value"]:
-
-                    self_attention_name = f"key_values.{i}.decoder.{code}"
-                    kv_buffer_1 = torch.zeros(self_attention_kv_shape, dtype = self.data_type).cuda()
-                    kv_buffer_2 = torch.zeros(self_attention_kv_shape, dtype = self.data_type).cuda()
-                    self.self_attention_cache_1[self_attention_name] = kv_buffer_1
-                    self.self_attention_cache_2[self_attention_name] = kv_buffer_2
-
-                    input_idx = self.trt_engine.get_binding_index("past_" + self_attention_name)
-                    output_idx = self.trt_engine.get_binding_index("present_" + self_attention_name)
-                    
-                    self.bindings[input_idx] = kv_buffer_1.data_ptr() # Generation phase
-                    self.bindings[output_idx] = kv_buffer_2.data_ptr()  
-
-                    # Context mode will always use buffer 1 as output
-                    self.bindings[input_idx + self.num_bindings] = 0 # Context phase, should be 0
-                    self.bindings[output_idx + self.num_bindings] = kv_buffer_1.data_ptr()
-
-            self.kv_cache_binding_offset = 1 # 0: input_ids, kv cache input indices start from 1
-            self.past_decoder_length = 0
-            self.use_cache_1_as_input = True
-            self._set_context_mode_trt_context()
-        
-        self.context_mode = self.config.use_cache
-        self.return_device = torch.device('cuda')
-        self.device = torch.device('cuda')
-
-    def reset(self):
-        '''
-        Resets the input specific fields after finishing a task.
-        '''
-        self.context_mode = self.config.use_cache
-    
-    def _switch_input_output_binding(self):
-        '''
-        For kv cache mode, switch input and output pointers to avoid data concurrency issue and D2D copy
-        '''
-        # When context mode (output in cache 1) and cache 1 is used as inputs, no need to switch bindings
-        if not (self.use_cache_1_as_input and self.context_mode):
-            for i in range(self.num_decoder_layers):
-                for code in ["key", "value"]:
-                    self_attention_name = f"key_values.{i}.decoder.{code}"
-                    input_idx = self.trt_engine.get_binding_index("past_" + self_attention_name)
-                    output_idx = self.trt_engine.get_binding_index("present_" + self_attention_name)
-
-                    # Switch generation mode kv cache bindings
-                    temp = self.bindings[output_idx]
-                    self.bindings[output_idx] = self.bindings[input_idx]
-                    self.bindings[input_idx] = temp
-            self.use_cache_1_as_input = not self.use_cache_1_as_input
- 
-    def prepare_inputs_for_generation(self, input_ids, past = None, use_cache = None, **kwargs):
-        # TODO: add position_ids, token_type_ids support
-        if past is not None:
-            input_ids = input_ids[:, -1:]
-            self.context_mode = False
-        else:
-            self.context_mode = self.config.use_cache
-        
-        return {
-            "input_ids": input_ids,
-            "past_key_values": past,
-            "use_cache": use_cache,
-        }
-
-    def set_return_device(self, return_device):
-        """
-        Sets the return device of the return via to(). Device name should be the same as torch devices: cuda, cpu, etc.
-        This is used in our measurement code.
-        """
-        self.return_device = return_device
-
-    def _reorder_cache(self, past: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:
-        """
-        This function is used to re-order the :obj:`past_key_values` cache if
-        :meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
-        called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
-        """
-        return tuple(
-            tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
-            for layer_past in past
-        )
-    
-    def _set_context_mode_trt_context(self):
-        # Create TRT context for context mode (1st decoder run) with optimization profile = 1
-        self.context_trt_context = self.trt_engine.create_execution_context()
-        self.context_trt_context.active_optimization_profile = 1
-
-
-    def forward(self, input_ids, *args, **kwargs):
-        bs = input_ids.shape[0]
-        input_length = input_ids.shape[1]
-
-        # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
-        is_cpu_mode = (input_ids.device == torch.device("cpu")) or (self.return_device == "cpu")
-
-        if is_cpu_mode:
-            input_ids = input_ids.int().cuda()
-        
-        # Set the binding shape of input_ids, which should be (bs, input_length).
-        if not self.context_mode:
-            self.bindings[0] = input_ids.int().data_ptr()
-            self.trt_context.set_binding_shape(0, input_ids.shape)
-        else:
-            self.bindings[self.num_bindings] = input_ids.int().data_ptr()
-            self.context_trt_context.set_binding_shape(self.num_bindings, input_ids.shape)
-
-        if self.config.use_cache:            
-            if self.context_mode:
-                self.past_decoder_length = 0
-
-            self_attention_kv_shape = (bs, self.num_heads, self.past_decoder_length, self.embedding_size_per_head)
-
-            for i in range(self.num_decoder_layers):
-                if not self.context_mode:
-                    # Optimization Profile 1 is generation phase with no kv inputs
-                    self.trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i, self_attention_kv_shape)
-                    self.trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i + 1, self_attention_kv_shape)
-                else:
-                    # Optimization Profile 0 is context phase with kv inputs
-                    self.context_trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i + self.num_bindings, self_attention_kv_shape)
-                    self.context_trt_context.set_binding_shape(self.kv_cache_binding_offset+2*i + 1 + self.num_bindings, self_attention_kv_shape)
-                    
-        # Launch TRT inference.
-        if not self.context_mode:
-            assert self.trt_context.all_binding_shapes_specified
-            self.trt_context.execute_v2(bindings=self.bindings)
-        else:
-            assert self.context_trt_context.all_binding_shapes_specified
-            self.context_trt_context.execute_v2(bindings=self.bindings)
-        
-        # For bs > 1, this is required, so cannnot avoid this D2D copy
-        logits_length = bs * input_length * self.config.vocab_size
-        logits = self.logits.flatten()[:logits_length].view(bs, input_length, self.config.vocab_size)
-
-        if is_cpu_mode:
-            logits = logits.cpu()
-
-        present_key_values = None
-        if self.config.use_cache:
-            self.past_decoder_length += input_length
-
-            present_key_values = ()
-            self_attention_cache = self.self_attention_cache_1 if self.use_cache_1_as_input or (self.profile_idx == 0) else self.self_attention_cache_2
-            
-            for i in range(self.num_decoder_layers):
-
-                self_attention_k_output = self_attention_cache[f"key_values.{i}.decoder.key"]
-                self_attention_v_output = self_attention_cache[f"key_values.{i}.decoder.value"]
-
-                if is_cpu_mode:
-                    self_attention_k_output = self_attention_k_output.cpu()
-                    self_attention_v_output = self_attention_v_output.cpu()
-
-                present_key_values += ((self_attention_k_output, self_attention_v_output),) 
-
-            self._switch_input_output_binding()
-        return CausalLMOutputWithPast(logits=logits.to(self.return_device), past_key_values = present_key_values)
-
-class GPT2TRT(TRTInferenceCommand):
-    def __init__(self):
-        super().__init__(
-            GPT2ModelTRTConfig, "Runs polygraphy results for GPT2 model.", GPT2HuggingFace
-        )
-        self.gpt2_trt = None
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        keep_trt_engine: bool = False,
-        keep_onnx_model: bool = False,
-        keep_torch_model: bool = False,
-    ) -> None:
-        # Deactivates context
-        if self.gpt2_trt is not None:
-            self.gpt2_trt.release()
-
-        if not keep_trt_engine:
-            self.gpt2_trt_engine.cleanup()
-
-        self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
-    def generate(
-        self,
-        input_ids,
-        min_length: int = None,
-        max_length: int = None,
-        num_beams: int = 1,
-        use_cache: bool = False,
-        early_stopping: bool = True,
-    ):
-        if max_length is None:
-            max_length = GPT2ModelTRTConfig.MAX_OUTPUT_LENGTH[self.metadata.variant]
-
-        if min_length is None:
-            min_length = GPT2ModelTRTConfig.MIN_OUTPUT_LENGTH[self.metadata.variant]
-
-        output = self.gpt2_trt.generate(
-            input_ids,
-            max_length=max_length,
-            min_length=min_length,
-            num_beams=num_beams,
-            use_cache=use_cache,
-            early_stopping=early_stopping
-        )
-
-        self.gpt2_trt.reset()
-        return output
-
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Dict[str, NetworkModel],
-        inference_input: str,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: GPT2TRTBenchmarkingArgs = None,
-    ) -> Union[NetworkResult, BenchmarkingResult]:
-
-        tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
-
-        # GPT2 has no proper token set. Use custom token. Only "generate()" will auto
-        # replace with EOS token when using generating mode
-        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
-        hf_config = self.gpt2_trt.config
-
-        # Prepare the input tokens and find out output sequence length.
-        if not benchmarking_mode:
-            output_seq_len = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, return_tensors="pt").input_ids
-        else:
-            input_seq_len = benchmarking_args.input_seq_len
-            output_seq_len = benchmarking_args.output_seq_len
-            input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
-        # get single decoder iteration inference timing profile
-        _, decoder_e2e_time = gpt2_inference(
-            self.gpt2_trt,
-            expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids,
-            timing_profile,
-            use_cache = metadata.other.kv_cache,
-        )
-        
-        # get complete decoder inference result and its timing profile
-        sample_output, full_e2e_runtime = full_inference(
-            self.gpt2_trt,
-            input_ids,
-            tokenizer,
-            timing_profile,
-            max_length=output_seq_len,
-            min_length=GPT2ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-            batch_size=batch_size,
-            use_cache=metadata.other.kv_cache,
-            num_beams=num_beams,
-        )
-
-        # Prepare runtime results.
-        runtime = [
-            NetworkRuntime(
-                name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=GPT2ModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-        models = NetworkModels(
-            torch=None,
-            onnx=list(onnx_fpaths.values()),
-            trt=[
-                NetworkModel(
-                    name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                    fpath=self.gpt2_trt_engine.fpath,
-                ),
-            ],
-        )
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=models)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            sample_output[-1, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=sample_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=models,
-        )
-
-    def execute_calculate_perplexity(
-        self,
-        metadata: NetworkMetadata,
-        reference: str,
-        batch_size: int, 
-    ):
-        tokenizer = GPT2Tokenizer.from_pretrained(metadata.variant)
-
-        # GPT2 has no proper token set. Use custom token. Only "generate()" will auto
-        # replace with EOS token when using generating mode
-        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
-        reference = reference.replace("\\n", "\n")
-        ppl_input_ids = tokenizer([reference] * batch_size, padding=False, return_tensors="pt").input_ids
-
-        perplexity = calculate_perplexity(
-            self.gpt2_trt, ppl_input_ids, GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
-        )
-        return perplexity
-
-    def _setup_engines(
-        self,
-        metadata: NetworkMetadata,
-        hash_onnx_fpath: Dict[str, NetworkModel],
-        batch_size: int,
-        num_beams: int,
-        disable_preview_dynamic_shapes: bool,
-        benchmarking_args: GPT2TRTBenchmarkingArgs = None,
-        seq_tag: bool = False, # whether the benchmark engine tag format should be seq or max
-    ) -> None:
-
-        hf_config = AutoConfig.from_pretrained(
-            metadata.variant,
-            use_cache=metadata.other.kv_cache
-        )
-
-        # Output networks shall not exceed number of network segments explicitly defined by configuration file.
-        assert len(hash_onnx_fpath) == len(
-            GPT2ModelTRTConfig.NETWORK_SEGMENTS
-        ), "There should only be {} exported ONNX segments in GPT2 model.".format(
-            len(GPT2ModelTRTConfig.NETWORK_SEGMENTS)
-        )
-
-        decoder_onnx_fpath = hash_onnx_fpath[
-            GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
-        ].fpath
-
-        # Generate optimization profiles.
-        # non-benchmarking mode: opt profile length is by default half of the max profile
-        # benchmarking mode: user can specify opt and max profile by flags. If no additional benchmarking flags are provided, it will just use the non-benchmarking mode defaults
-        # Note that this should be set to GPT2's MAX_LENGTH for text generation.
-        max_sequence_length = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
-        max_output_length = GPT2ModelTRTConfig.MAX_LENGTH[metadata.variant]
-        opt_input_seq_len = max_sequence_length // 2
-        opt_output_seq_len = max_output_length // 2
-
-        # benchmarking flags
-        if benchmarking_args is not None:
-            max_sequence_length = benchmarking_args.input_profile_max_len
-            max_output_length = benchmarking_args.output_profile_max_len
-            opt_input_seq_len = benchmarking_args.input_seq_len
-            opt_output_seq_len = benchmarking_args.output_seq_len
-        
-        if not hf_config.use_cache:
-            # If not using kv cache, only input_ids is passed
-            decoder_profiles = [Profile().add(
-                "input_ids",
-                min=(batch_size * num_beams, 1),
-                opt=(batch_size * num_beams, opt_output_seq_len),
-                max=(batch_size * num_beams, max_output_length),
-            )]
-        else:
-            num_heads = hf_config.n_head
-            embedding_size_per_head = hf_config.n_embd // num_heads
-            num_layers = hf_config.n_layer
-
-            # context phase uses the provided input_ids to generate hidden states and self attention kv cache
-            # It is only used in the 1st decoder run.
-            dec_profiles_context = Profile().add(
-                "input_ids",
-                min=(batch_size * num_beams, 1),
-                opt=(batch_size * num_beams, opt_output_seq_len),
-                max=(batch_size * num_beams, max_output_length),
-            )
-            self_attention_profile_context = {
-                "min": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
-                "opt": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
-                "max": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
-            }
-
-            # generation phase uses previous self attention kv cache with the last input_ids token to generate the next hidden states and self attention kv cache
-            # This optimization profile is used after the 1st decoder run.
-            dec_profiles_generation = Profile().add(
-                "input_ids",
-                min=(batch_size * num_beams, 1),
-                opt=(batch_size * num_beams, 1),
-                max=(batch_size * num_beams, 1),
-            )
-            
-            self_attention_profile_generation = {
-                "min": (batch_size * num_beams, num_heads, 1, embedding_size_per_head),
-                "opt": (batch_size * num_beams, num_heads, opt_output_seq_len - 1, embedding_size_per_head),
-                "max": (batch_size * num_beams, num_heads, max_output_length - 1, embedding_size_per_head),
-            }
-
-            for i in range(num_layers):
-                dec_profiles_context = dec_profiles_context.add(
-                    f"past_key_values.{i}.decoder.key",
-                    **self_attention_profile_context
-                ).add(
-                    f"past_key_values.{i}.decoder.value",
-                    **self_attention_profile_context
-                )
-
-                dec_profiles_generation = dec_profiles_generation.add(
-                    f"past_key_values.{i}.decoder.key",
-                    **self_attention_profile_generation
-                ).add(
-                    f"past_key_values.{i}.decoder.value",
-                    **self_attention_profile_generation
-                )
-            
-            # TensorRT accepts multiple optimization engines for the same model.
-            # Profile 1 is only used in the first decoder iterations.
-            decoder_profiles = [dec_profiles_generation, dec_profiles_context]
-        
-        # Convert ONNX models to TRT engines.
-        if benchmarking_args is None:
-            engine_tag = "bs{}".format(batch_size)
-        # When user does not input any profile_max_len, use seq as tag, both max are config max
-        elif seq_tag:
-            engine_tag = "bs{}-inseq{}-outseq{}".format(batch_size, benchmarking_args.input_seq_len, benchmarking_args.output_seq_len)
-        # When user input profile_max_len, reuse the engine for future use with different seq_len
-        else:
-            engine_tag = "bs{}-inmax{}-outmax{}".format(batch_size, benchmarking_args.input_profile_max_len, benchmarking_args.output_profile_max_len)
-
-        if num_beams > 1:
-            engine_tag += "-beam{}".format(num_beams)
-
-        preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
-        if disable_preview_dynamic_shapes:
-            engine_tag += "-noPreviewFasterDynamicShapes"
-        else:
-            preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
-        
-        self.gpt2_trt_engine = GPT2ONNXFile(
-            decoder_onnx_fpath, metadata
-        ).as_trt_engine(
-            os.path.splitext(decoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag),
-            profiles=decoder_profiles,
-            preview_features=preview_features
-        )
-        self.gpt2_trt = GPT2TRTDecoder(
-            self.gpt2_trt_engine, metadata, hf_config, batch_size=batch_size, num_beams=num_beams, benchmarking_args = benchmarking_args
-        )
-
-    def run_trt(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Tuple[NetworkModel],
-        network_input: List[str],
-        working_directory: str,
-        keep_trt_engine: bool,
-        keep_onnx_model: bool,
-        keep_torch_model: bool,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        disable_preview_dynamic_shapes: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult]:
-
-        workspace = self._setup_workspace(metadata, working_directory)
-
-        # no fpath provided for onnx files, download them
-        if len(onnx_fpaths) == 0:
-            onnx_fpaths = self._download_models(workspace, metadata)
-        else:
-            keep_onnx_model = True
-            keep_torch_model = True
-
-        hash_onnx_fpath = {v.name: v for v in onnx_fpaths}
-
-        inference_results = []
-        ppl_results = []
-        try:
-            if not benchmarking_mode:
-                self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes)
-                for ninput in network_input:
-                    inference_results.append(
-                        self.execute_inference(
-                            metadata, hash_onnx_fpath, ninput, timing_profile, batch_size, args.num_beams
-                        )
-                    )
-                    # reset the decoder
-                    self.gpt2_trt.reset()
-
-                if perplexity_reference is not None:
-                    assert len(network_input) == len(perplexity_reference), "Inputs must pair up"
-                    if metadata.other.kv_cache or (args.num_beams > 1):
-                        G_LOGGER.warning("Skipping perplexity calculation for TRT with KV cache or beam search because it is not supported yet.")
-                    else:
-                        for r in perplexity_reference:
-                            ppl_results.append(
-                                self.execute_calculate_perplexity(metadata, r, batch_size)
-                            )
-            else:
-                hf_config = AutoConfig.from_pretrained(metadata.variant, use_cache = metadata.other.kv_cache)
-                # Check that input_seq_len and output_seq_len is valid and within required range
-                max_input_seq_len = hf_config.n_positions
-                max_output_seq_len = hf_config.n_positions
-
-                seq_tag = args.input_profile_max_len is None and args.output_profile_max_len is None
-                # User must provide either a pair of profile_max_len or a profile of seq_len for input/output
-                if args.input_profile_max_len is None or args.output_profile_max_len is None:
-                    if args.input_seq_len is None or args.output_seq_len is None:
-                        assert False, "Please provide at least one pair of inputs: [input/output]_seq_len or [input/output]_profile_max_len"
-
-                input_profile_max_len = setup_benchmark_arg(args.input_profile_max_len, "input_profile_max_len", max_input_seq_len)
-                output_profile_max_len = setup_benchmark_arg(args.output_profile_max_len, "output_profile_max_len", max_output_seq_len)
-                input_seq_len = setup_benchmark_arg(args.input_seq_len, "input_seq_len", input_profile_max_len // 2)
-                output_seq_len = setup_benchmark_arg(args.output_seq_len, "output_seq_len", output_profile_max_len // 2)
-
-                benchmarking_args = GPT2TRTBenchmarkingArgs(input_seq_len, output_seq_len, input_profile_max_len, output_profile_max_len)
-
-                # Assert to ensure the validity of benchmarking arguments
-                assert benchmarking_args.input_seq_len <= benchmarking_args.input_profile_max_len, "input_seq_len should <= input_profile_max_len = {} for benchmarking mode".format(benchmarking_args.input_profile_max_len)
-                assert benchmarking_args.output_seq_len <= benchmarking_args.output_profile_max_len, "output_seq_len should <= output_profile_max_len = {} for benchmarking mode".format(benchmarking_args.output_profile_max_len)
-                assert benchmarking_args.input_profile_max_len <= max_input_seq_len, "Model config restrict input_profile_max_len <= {} for benchmark mode".format(max_input_seq_len)
-                assert benchmarking_args.output_profile_max_len <= max_output_seq_len, "Model config restrict output_profile_max_len <= {} for benchmark mode".format(max_output_seq_len)
-                # GPT2 model requires output_seq_len > input_seq_len since it is a text generation model.
-                assert benchmarking_args.input_seq_len <= benchmarking_args.output_seq_len, "GPT2 model text generation requires output_seq_len > input_seq_len."
-                assert benchmarking_args.input_profile_max_len <= benchmarking_args.output_profile_max_len, "GPT2 model text generation requires output_profile_max_len > input_profile_max_len"
-                self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes, benchmarking_args, seq_tag)
-                inference_results = self.execute_inference(
-                    metadata, hash_onnx_fpath, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
-                )
-
-        finally:
-            self.cleanup(workspace, keep_trt_engine, keep_onnx_model, keep_torch_model)
-
-        return inference_results, ppl_results
-
-    def add_args(self, parser) -> None:
-        super().add_args(parser)
-
-        # use the same args as frameworks.py
-        self.frameworks_cmd.add_args(parser)
-        polygraphy_group = parser.add_argument_group("polygraphy")
-        polygraphy_group.add_argument(
-            "--onnx-fpath",
-            default=None,
-            help="Path to GPT2 ONNX model. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-        polygraphy_group.add_argument(
-            "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
-        )
-        polygraphy_group.add_argument(
-            "--save-trt-engine",
-            action="store_true",
-            help="Saves TensorRT runtime engine in working directory.",
-        )
-
-    def args_to_network_models(self, args) -> List[NetworkModel]:
-        gpt2_fpath_check = args.onnx_fpath is None
-
-        network_models = None
-        if gpt2_fpath_check:
-            network_models = tuple()
-        else:
-            onnx_decoder = NetworkModel(
-                name=GPT2ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=args.onnx_fpath,
-            )
-            network_models = (onnx_decoder)
-
-        return network_models
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
-        return NetworkMetadata(
-            variant=frameworks_parsed_metadata.variant,
-            precision=Precision(fp16=args.fp16),
-            other=frameworks_parsed_metadata.other,
-        )
-
-
-RUN_CMD = GPT2TRT()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/NNDF/README.md b/demo/HuggingFace/NNDF/README.md
deleted file mode 100644
index 8a0cc98c..00000000
--- a/demo/HuggingFace/NNDF/README.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Neural Network Driven Framework
-
-NNDF are a collection of files and formats that provide an underlying policy and flow for TensorRT network onboarders to follow.
-NNDF is inspired by HuggingFace and PyTorch common design architectures where the Neural Network is divided into two abstractions:
-
-* High level abstractions via configuration files
-* Low level abstractions via I/O classes
-
-## Benefits
-
-Because NNDF is inspired by existing successful network frameworks, interoping and interacting with HuggingFace, Torch, and other
-networks become very trivial and code can often be reused. See for example the `GenerationMixin` which is used in HuggingFace to
-implement `greedy_decoder` and `beam_search`. Using NNDF, we can use `beam_search` and other search functions directly.
-
-In other words:
-
-* Re-use high level measurement tools supplied by well known frameworks
-* Ensure fair platform for timing TRT performance alongside other frameworks by using the same post-processing code.
diff --git a/demo/HuggingFace/NNDF/checkpoints.py b/demo/HuggingFace/NNDF/checkpoints.py
deleted file mode 100644
index 3c94ea32..00000000
--- a/demo/HuggingFace/NNDF/checkpoints.py
+++ /dev/null
@@ -1,155 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Helper file for generating common checkpoints.
-"""
-
-import itertools
-from typing import List
-
-# TRT-HuggingFace
-from NNDF.networks import NetworkMetadata, NetworkResult
-from NNDF.interface import VALID_FRAMEWORKS
-
-# externals
-import toml
-class NNTomlCheckpoint:
-    """
-    Loads a toml checkpoint file for comparing labels and inputs.
-    The following nested key structure is required:
-
-    [Network.Framework.Variant.Precision]
-
-    For each category, you can assign a default behviour using a special key
-    defined by CHECKPOINT_STRUCTURE_FLAT.
-
-    CHECKPOINT_STRUCTURE_FLAT cannot be valid in terms of the result that is being added inwards.
-    """
-
-    # The checkpoint structure and their default keys
-    CHECKPOINT_STRUCTURE_FLAT = {
-        "framework": "all",
-        "variant": "default",
-        "precision": "all"
-    }
-
-    def __init__(self, fpath: str, framework: str, network_name: str, metadata: NetworkMetadata):
-        """Loads the toml file for processing."""
-        data = {}
-        with open(fpath) as f:
-            data = toml.load(f)
-
-        assert framework in VALID_FRAMEWORKS
-        # These keys are reserved to indicate the default state.
-        assert self.CHECKPOINT_STRUCTURE_FLAT["framework"] not in VALID_FRAMEWORKS
-
-        # Select the current input data
-        # try to get the base data
-        network_data = data.get(network_name, {})
-
-        cur_keys = {
-            "framework": framework,
-            "variant": metadata.variant,
-            "precision": "fp16" if metadata.precision.fp16 else "fp32"
-        }
-
-        combined_keys =[[self.CHECKPOINT_STRUCTURE_FLAT[k], cur_keys[k]] for k in self.CHECKPOINT_STRUCTURE_FLAT.keys()]
-        # A helper function for flattening the getters.
-        def flat_getter(d=network_data, *args):
-            for k in args:
-                if k not in d:
-                    return {}
-                d = d[k]
-            return d
-
-        # self.data stores several keys:
-        # {"checkpoint_name": {"label": xxx, "input": xxx}}
-        # The loop below attempts to merge several code snippets together.
-        self.data = network_data["all"]["default"]["all"]
-        for keys in itertools.product(*combined_keys):
-            values = flat_getter(network_data, *keys)
-            if len(values) == 0:
-                continue
-            for data_k, data_v in self.data.items():
-                if data_k in values:
-                    self.data[data_k] = {**data_v, **values[data_k]}
-
-        # Used when accuracy() is called
-        self._lookup_cache = None
-
-    def _iterate_data(self, slice: List[str], skip_keyword: str = "skip"):
-        """
-        Helper for child classes to iterate through a slice of data.
-
-        Return:
-            (Union[Dict[str, str], List[str]]): Returns a list of all value keys given in 'slice' or if more than one value is given for 'slice' then a dictionary instead.
-        """
-        returns_dict = len(slice) > 1
-        for value in self.data.values():
-            if "skip" in value:
-                continue
-
-            if returns_dict:
-                yield {s: value[s] for s in slice}
-            else:
-                yield value[slice[0]]
-
-
-class NNSemanticCheckpoint(NNTomlCheckpoint):
-    """Requires the following data structure:
-
-    [<network>.<framework>.<variant>.<precision>]
-        [input_a]
-        label = "sample_label"
-        input = "sample_input"
-
-        [input_b]
-        label = "sample_label"
-        input = "sample_input"
-
-    Following are reserved keywords:
-    <framework> = "all" indicates rules apply to all frameworks
-    <variant> = "default" indicates rules apply to all networks.
-    <precision> = "all" indicates rules apply to all precisions.
-    """
-
-    def __iter__(self):
-        return self._iterate_data(["label", "input"])
-
-    def labels(self):
-        return self._iterate_data(["label"])
-
-    def inputs(self):
-        return self._iterate_data(["input"])
-
-    def accuracy(self, results: List[NetworkResult]) -> float:
-        # Hash checkpoints by their input
-        if self._lookup_cache is None:
-            self._lookup_cache = {}
-            for k, v in self.data.items():
-                self._lookup_cache[v["input"]] = k
-
-        correct_count = 0
-        for r in results:
-            # Find the data the corresponds to input
-            key = self._lookup_cache[r.input]
-            # remove new line characters
-            r_new = r.semantic_output[0] if isinstance(r.semantic_output, list) else r.semantic_output
-            correct_count += int(self.data[key]["label"].replace('\\n','').replace('\n','') == r_new.replace('\\n','').replace('\n',''))
-
-        return correct_count / len(results)
diff --git a/demo/HuggingFace/NNDF/cuda_bootstrapper.py b/demo/HuggingFace/NNDF/cuda_bootstrapper.py
deleted file mode 100644
index e9fdb26d..00000000
--- a/demo/HuggingFace/NNDF/cuda_bootstrapper.py
+++ /dev/null
@@ -1,101 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Holds logic for modifying and removing invalid CUDA libraries in LD_LIBRARY_PATH.
-
-Users may have CUDA libraries in LD_LIBRARY_PATH which causes issues with Torch cublas.
-This problem only occurs on Linux.
-See:
-    https://github.com/pytorch/pytorch/issues/94294
-    https://github.com/pytorch/pytorch/issues/64097
-"""
-
-import os
-import sys
-import glob
-import shutil
-
-import subprocess as sp
-from NNDF.logger import G_LOGGER
-
-def bootstrap_ld_library_path() -> bool:
-    """
-    Modifies the LD_LIBRARY_PATH if applicable and then spawns a child process
-    using first "poetry" and then "python3"/"python" if "poetry" fails.
-    """
-    if os.environ.get("TRT_OSS_DISABLE_BOOTSTRAP") or "linux" not in sys.platform:
-        return False
-
-    # Walk through each path in environment to see if there are cublas libraries being loaded.
-    paths = os.environ.get("LD_LIBRARY_PATH", "").split(os.pathsep)
-    new_paths = []
-    modified_path = False
-    for path in paths:
-        for lib in ("cublas", "cudart", "cublasLt"):
-            g = glob.glob(os.path.join(path, f"lib{lib}.so.*"))
-            if g:
-                modified_path = True
-                G_LOGGER.warning(f"Discarding `{path}` from LD_LIBRARY_PATH since it contains CUDA libraries.")
-                break
-            else:
-                new_paths.append(path)
-
-
-    if not modified_path:
-        return False
-    else:
-        warning_msg = ("Attempting to bootstrap altered LD_LIBRARY_PATH. "
-                       "\nYou can disable this with TRT_OSS_DISABLE_BOOTSTRAP=1 however frameworks performance may be impacted. "
-                       "\nThere are known issues with cuBLAS loading and PyTorch compatability "
-                       "that is still being resolved for most CUDA <= 12.1 and Torch setups. See: "
-                       "\n   - https://github.com/pytorch/pytorch/issues/94294"
-                       "\n   - https://github.com/pytorch/pytorch/issues/64097\n")
-        G_LOGGER.warning(warning_msg)
-
-    G_LOGGER.info(f"CUDA detected in path. Restarting scripts with modified LD_LIBRARY_PATH: {new_paths}")
-    os.environ["LD_LIBRARY_PATH"] = os.pathsep.join(new_paths)
-    # To prevent potential recursion, we add one more modification just in case.
-    os.environ["TRT_OSS_DISABLE_BOOTSTRAP"] = "1"
-
-    # Spawn a new child process instead.
-    try:
-        # Use the same python exe that invoked this script
-        default_python = sys.executable
-
-        # Demo supports both poetry and python3 invocation.
-        # Check if poetry works first.
-        cmd = [default_python] + list(sys.argv)
-        if shutil.which("poetry") is not None:
-            poetry_cmd = ["poetry", "run"] + cmd
-
-            # Poetry command will be tried. If it fails, we ignore the error and fallback to default python.
-            try:
-                # Instantiate a secondary child process.
-                sp.check_call(" ".join(poetry_cmd), env=dict(os.environ), cwd=os.getcwd(), shell=True)
-                return True
-            except:
-                pass
-
-        # Default python fallback.
-        sp.check_call(" ".join(cmd), env=dict(os.environ), cwd=os.getcwd(), shell=True)
-    except Exception as e:
-        G_LOGGER.error("Unable to start a new process with modified LD_LIBRARY_PATH. Consider removing CUDA lib in LD_LIBRARY_PATH manually.")
-        G_LOGGER.error(str(e))
-        G_LOGGER.warning("Attempting to continue with demo.")
-
-    return True
diff --git a/demo/HuggingFace/NNDF/general_utils.py b/demo/HuggingFace/NNDF/general_utils.py
deleted file mode 100644
index f8fb9897..00000000
--- a/demo/HuggingFace/NNDF/general_utils.py
+++ /dev/null
@@ -1,286 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""Common utils used by demo folder.
-Note:
-- For now, users/developers that are contributing to TensorRT OSS should NOT import non-default Python packages in this file, because the test pipeline's boot-up process cannot load extra dependencies. In the near future, alternative solutions such as creating a separate boot-up util list can be possible.
-- Users/developers that are just using the TensorRT OSS without contributing are still free to modify this file and customize for deployment.
-"""
-
-import os
-import shutil
-import timeit
-import math
-
-from datetime import datetime
-from shutil import rmtree
-from typing import Callable, Union, List
-from collections import defaultdict
-from statistics import mean, median
-from glob import glob
-
-# NNDF
-from NNDF.networks import NNConfig, NetworkResult, NetworkMetadata, TimingProfile
-from NNDF.logger import G_LOGGER
-
-# Used for HuggingFace setting random seed
-RANDOM_SEED = 42
-
-# Networks #
-def register_network_folders(
-    root_dir: str, config_file_str: str = "*Config.py"
-) -> List[str]:
-    networks = []
-    for network_configs in glob(os.path.join(root_dir, "*", config_file_str)):
-        network_name = os.path.split(os.path.split(network_configs)[0])[1]
-        networks.append(network_name)
-    return networks
-
-
-def process_results(category: List[str], results: List[NetworkResult], nconfig: NNConfig):
-    """
-    Calculate and process results across multiple runs.
-    """
-    general_stats = ["script", "accuracy"]
-    runtime_result_row_names = list(nconfig.NETWORK_SEGMENTS)
-    if nconfig.NETWORK_FULL_NAME not in nconfig.NETWORK_SEGMENTS:
-        runtime_result_row_names.append(nconfig.NETWORK_FULL_NAME)
-
-    rows = []
-    row_entry = []
-    for cat, result in zip(category, results):
-        # Process runtime results for each group
-        runtime_results = defaultdict(list)
-        for runtimes in [nr.median_runtime for nr in result.network_results]:
-            for runtime in runtimes:
-                runtime_results[runtime.name].append(runtime.runtime)
-
-        # Calculate average runtime for each group
-        average_group_runtime = {k: mean(v) for k, v in runtime_results.items()}
-        row_entry = [cat, result.accuracy] + [
-            average_group_runtime[n] for n in runtime_result_row_names
-        ]
-        rows.append(row_entry)
-
-    headers = general_stats + [r + " (sec)" for r in runtime_result_row_names]
-    return headers, rows
-
-def process_per_result_entries(script_category: List[str], results: List[NetworkResult], max_output_char:int = 30):
-    """Prints tabulations for each entry returned by the runtime result."""
-    def _shorten_text(w):
-        l = len(w)
-        if l > max_output_char:
-            return w[0:max_output_char // 2] + " ... " + w[-max_output_char//2:]
-        return w
-
-    headers = ["script", "network_part", "accuracy", "runtime", "input", "output"]
-    row_data_by_input = defaultdict(list)
-    for cat, result in zip(script_category, results):
-        for nr in result.network_results:
-            for runtime in  nr.median_runtime:
-                row_data_by_input[hash(nr.input)].append([
-                    cat,
-                    runtime.name,
-                    result.accuracy,
-                    runtime.runtime,
-                    _shorten_text(nr.input),
-                    _shorten_text(nr.semantic_output)
-                ])
-
-    return headers, dict(row_data_by_input)
-
-# IO #
-def confirm_folder_delete(
-    fpath: str, prompt: str = "Confirm you want to delete entire folder?"
-) -> None:
-    """
-    Confirms whether or not user wants to delete given folder path.
-
-    Args:
-        fpath (str): Path to folder.
-        prompt (str): Prompt to display
-
-    Returns:
-        None
-    """
-    msg = prompt + " {} [Y/n] ".format(fpath)
-    confirm = input(msg)
-    if confirm == "Y":
-        rmtree(fpath)
-    else:
-        G_LOGGER.info("Skipping file removal.")
-
-
-def remove_if_empty(
-    fpath: str,
-    success_msg: str = "Folder successfully removed.",
-    error_msg: str = "Folder cannot be removed, there are files.",
-) -> None:
-    """
-    Removes an entire folder if folder is empty. Provides print info statements.
-
-    Args:
-        fpath: Location to folder
-        success_msg: Success message.
-        error_msg: Error message.
-
-    Returns:
-        None
-    """
-    if len(os.listdir(fpath)) == 0:
-        os.rmdir(fpath)
-        G_LOGGER.info(success_msg + " {}".format(fpath))
-    else:
-        G_LOGGER.info(error_msg + " {}".format(fpath))
-
-
-def measure_python_inference_code(
-    stmt: Union[Callable, str], timing_profile: TimingProfile
-) -> None:
-    """
-    Measures the time it takes to run Pythonic inference code.
-    Statement given should be the actual model inference like forward() in torch.
-
-    Args:
-        stmt (Union[Callable, str]): Callable or string for generating numbers.
-        timing_profile (TimingProfile): The timing profile settings with the following fields.
-            warmup (int): Number of iterations to run as warm-up before actual measurement cycles.
-            number (int): Number of times to call function per iteration.
-            iterations (int): Number of measurement cycles.
-            duration (float): Minimal duration for measurement cycles.
-            percentile (int or list of ints): key percentile number(s) for measurement.
-    """
-
-    def simple_percentile(data, p):
-        """
-        Temporary replacement for numpy.percentile() because TRT CI/CD pipeline requires additional packages to be added at boot up in this general_utils.py file.
-        """
-        assert p >= 0 and p <= 100, "Percentile must be between 1 and 99"
-
-        rank = len(data) * p / 100
-        if rank.is_integer():
-            return sorted(data)[int(rank)]
-        else:
-            return sorted(data)[int(math.ceil(rank)) - 1]
-
-    warmup = timing_profile.warmup
-    number = timing_profile.number
-    iterations = timing_profile.iterations
-    duration = timing_profile.duration
-    percentile = timing_profile.percentile
-
-    G_LOGGER.debug(
-        "Measuring inference call with warmup: {} and number: {} and iterations {} and duration {} secs".format(
-            warmup, number, iterations, duration
-        )
-    )
-    # Warmup
-    warmup_mintime = timeit.repeat(stmt, number=number, repeat=warmup)
-    G_LOGGER.debug("Warmup times: {}".format(warmup_mintime))
-
-    # Actual measurement cycles
-    results = []
-    start_time = datetime.now()
-    iter_idx = 0
-    while iter_idx < iterations or (datetime.now() - start_time).total_seconds() < duration:
-        iter_idx += 1
-        results.append(timeit.timeit(stmt, number=number))
-
-    if isinstance(percentile, int):
-        return simple_percentile(results, percentile) / number
-    else:
-        return [simple_percentile(results, p) / number for p in percentile]
-
-class NNFolderWorkspace:
-    """
-    For keeping track of workspace folder and for cleaning them up.
-    Due to potential corruption of ONNX model conversion, the workspace is split up by model variants.
-    """
-
-    def __init__(
-        self, network_name: str, metadata: NetworkMetadata, working_directory: str
-    ):
-        self.rootdir = working_directory
-        self.metadata = metadata
-        self.network_name = network_name
-        self.dpath = os.path.join(self.rootdir, self.network_name, metadata.variant)
-        os.makedirs(self.dpath, exist_ok=True)
-
-    def set_model_path(self, metadata_serialized, is_encoder_decoder: bool) -> str:
-        '''
-        Create subdirectory for models with different config(e.g. kv cache)
-        '''
-        self.model_path = os.path.join(self.dpath, metadata_serialized)
-        self.decoder_path = os.path.join(self.model_path, "decoder")
-        os.makedirs(self.decoder_path, exist_ok=True)
-        if is_encoder_decoder:
-            self.encoder_path = os.path.join(self.model_path, "encoder")
-            os.makedirs(self.encoder_path, exist_ok=True)
-        # For decoder only models, there is no encoder
-        else:
-            self.encoder_path = None
-
-        # If is kv cache mode, need to separate non kv mode and kv mode for decoder
-        if self.metadata.other.kv_cache:
-            self.decoder_non_kv_path = os.path.join(self.decoder_path, "non-kv")
-            self.decoder_kv_path = os.path.join(self.decoder_path, "kv")
-            os.makedirs(self.decoder_non_kv_path, exist_ok=True)
-            os.makedirs(self.decoder_kv_path, exist_ok=True)
-
-        return self.model_path, self.encoder_path, self.decoder_path
-
-    def get_path(self) -> str:
-        return self.dpath
-
-    def get_model_path(self) -> str:
-        return self.model_path
-
-    def get_encoder_path(self) -> str:
-        return self.encoder_path
-
-    def get_decoder_path(self) -> str:
-        return self.decoder_path
-
-    def get_decoder_path_kv(self) -> (str, str):
-        if not self.metadata.other.kv_cache:
-            raise RuntimeError("Trying to access kv specific folder in non kv mode")
-        else:
-            return self.decoder_kv_path, self.decoder_non_kv_path
-
-    def cleanup(self, force_remove: bool = False) -> None:
-        '''
-        Cleanup would remove all the contents in the workspace.
-        '''
-        if force_remove:
-            return shutil.rmtree(self.dpath)
-
-        if self.is_encoder_decoder_path_set:
-            if self.encoder_path is not None:
-                remove_if_empty(self.encoder_path)
-            if self.metadata.other.kv_cache:
-                remove_if_empty(
-                    self.decoder_kv_path
-                )
-                remove_if_empty(
-                    self.decoder_non_kv_path
-                )
-            remove_if_empty(
-                self.decoder_path
-            )
-
-        remove_if_empty(self.model_path)
-        remove_if_empty(self.dpath)
diff --git a/demo/HuggingFace/NNDF/interface.py b/demo/HuggingFace/NNDF/interface.py
deleted file mode 100644
index 8d1a739c..00000000
--- a/demo/HuggingFace/NNDF/interface.py
+++ /dev/null
@@ -1,531 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Interface classes required for each registered network script.
-"""
-
-import argparse
-
-from abc import ABCMeta, abstractmethod
-from typing import List, Tuple, Union
-
-# NNDF
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkResult,
-    NetworkMetadata,
-    NetworkCheckpointResult,
-    NNConfig,
-    NetworkModel,
-    TimingProfile,
-)
-from NNDF.logger import G_LOGGER
-from NNDF.general_utils import NNFolderWorkspace
-
-# externals
-# None, there should be no external dependencies for testing purposes.
-
-# Program-wide constants for passing in valid frameworks.
-FRAMEWORK_NATIVE = "native"
-FRAMEWORK_TENSORRT = "trt"
-FRAMEWORK_ONNXRT = "onnxrt"
-VALID_FRAMEWORKS = [
-    FRAMEWORK_NATIVE,
-    FRAMEWORK_ONNXRT,
-    FRAMEWORK_TENSORRT
-]
-
-class MetadataArgparseInteropMixin:
-    """Add argparse support where the class can add new arguments to an argparse object."""
-
-    @staticmethod
-    @abstractmethod
-    def add_args(parser):
-        pass
-
-    @staticmethod
-    @abstractmethod
-    def from_args(args):
-        pass
-
-    @staticmethod
-    @abstractmethod
-    def add_inference_args(parser):
-        pass
-
-    @staticmethod
-    @abstractmethod
-    def from_inference_args(args):
-        pass
-
-    @staticmethod
-    @abstractmethod
-    def add_benchmarking_args(parser):
-        """
-        Add args needed for perf benchmarking mode.
-        """
-        pass
-
-class NetworkCommand(metaclass=ABCMeta):
-    """Base class that each network script's command module should inherit."""
-
-    description = "NetworkCommand"
-
-    DEFAULT_ITERATIONS = 10
-    DEFAULT_NUMBER = 1
-    DEFAULT_WARMUP = 3
-    DEFAULT_DURATION = 0.0
-    DEFAULT_PERCENTILE = 50
-
-    def __init__(self, network_config: NNConfig, description: str):
-        self.config = network_config()
-        self.description = description
-        self.framework_name = None
-        self._parser = argparse.ArgumentParser(description=description, conflict_handler="resolve")
-
-    def __call__(self):
-        self.add_args(self._parser)
-        self.config.MetadataClass.add_args(self._parser)
-        self._args = self._parser.parse_args()
-
-        if self._args.verbose:
-            G_LOGGER.setLevel(level=G_LOGGER.DEBUG)
-        elif self._args.info:
-            G_LOGGER.setLevel(level=G_LOGGER.INFO)
-
-        self.metadata = self.args_to_network_metadata(self._args)
-        self.check_network_metadata_is_supported(self.metadata)
-
-    @abstractmethod
-    def run_benchmark(self):
-        """
-        Run inference in performance benchmarking mode for apples-to-apples perf comparisons across platforms.
-        Differences with normal run mode include (but are not limited to):
-
-        - Use random input data and disable accuracy checking.
-        - Use fixed input/output sequence lengths and disable early stopping.
-        - Provide better controls on the number of warm-ups and the number/duration of inference iterations.
-
-        The derived class should override this method for the benchmarking implementation for the specific framework.
-        """
-        pass
-
-    def add_args(self, parser) -> None:
-        general_group = parser.add_argument_group("general")
-        general_group.add_argument(
-            "--verbose", help="Display verbose logs.", action="store_true"
-        )
-        general_group.add_argument(
-            "--info", help="Display info logs.", action="store_true"
-        )
-        general_group.add_argument(
-            "--cleanup",
-            help="Cleans up user-specified workspace. Can not be cleaned if external files exist in workspace.",
-            action="store_false",
-        )
-        general_group.add_argument(
-            "--working-dir",
-            help="Location of where to save the model and other downloaded files.",
-            required=True,
-        )
-        general_group.add_argument(
-            "--batch-size", "-b",
-            help="Chosen batch size for given network",
-            required=False,
-            type=int,
-            default=1
-        )
-
-        timing_group = parser.add_argument_group("inference measurement")
-        timing_group.add_argument(
-            "--iterations",
-            type=int,
-            help="Number of iterations to measure.",
-            default=self.DEFAULT_ITERATIONS,
-        )
-        timing_group.add_argument(
-            "--number",
-            type=int,
-            help="Number of actual inference cycles per iterations.",
-            default=self.DEFAULT_NUMBER,
-        )
-        timing_group.add_argument(
-            "--warmup",
-            type=int,
-            help="Number of warmup iterations before actual measurement occurs.",
-            default=self.DEFAULT_WARMUP,
-        )
-        timing_group.add_argument(
-            "--duration",
-            type=float,
-            help="Minimal duration of inference iterations to measure.",
-            default=self.DEFAULT_DURATION,
-        )
-        timing_group.add_argument(
-            "--percentile",
-            type=int,
-            help="Key percentile number for time measurement.",
-            default=self.DEFAULT_PERCENTILE,
-        )
-
-    def check_network_metadata_is_supported(self, metadata: NetworkMetadata) -> None:
-        """
-        Checks if current command supports the given metadata as defined by the NNConfig.
-        Args:
-            metadata (NetworkMetadata): NetworkMetadata to check if input is supported.
-
-        Throws:
-            NotImplementedError: If the given metadata is not a valid configuration for this network.
-
-        Returns:
-            None
-        """
-        if metadata not in self.config.variants:
-            raise NotImplementedError(
-                "The following network config is not yet supported by our scripts: {}".format(
-                    metadata
-                )
-            )
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        return self.config.MetadataClass.from_args(args)
-
-    def load_nn_semantic_checkpoint(self) -> object:
-        """
-        Loads the NNSemanticCheckpoint instance from checkpoint.toml file.
-        """
-        # Differ import so that interface file can use used without
-        # dependency install for our testing.
-        from NNDF.checkpoints import NNSemanticCheckpoint
-        checkpoint = NNSemanticCheckpoint(
-            "checkpoint.toml",
-            framework=self.framework_name,
-            network_name=self.config.network_name,
-            metadata=self.metadata,
-        )
-        return checkpoint
-
-    def get_timing_profile(self) -> TimingProfile:
-        """
-        Get TimingProfile settings given current args.
-        """
-        return TimingProfile(
-                iterations=int(self._args.iterations),
-                number=int(self._args.number),
-                warmup=int(self._args.warmup),
-                duration=int(self._args.duration),
-                percentile=int(self._args.percentile),
-            )
-
-
-class FrameworkCommand(NetworkCommand):
-    """Base class that is associated with Frameworks related scripts."""
-
-    def __init__(self, network_config: NNConfig, description: str):
-        super().__init__(network_config, description)
-        self.framework_name = FRAMEWORK_NATIVE
-
-    @abstractmethod
-    def run_framework(
-        self,
-        metadata: NetworkMetadata,
-        network_input: List[str],
-        working_directory: str,
-        keep_onnx_model: bool,
-        keep_pytorch_model: bool,
-        timing_profile: TimingProfile,
-        batch_size: int,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult]:
-        pass
-
-    def __call__(self):
-        super().__call__()
-
-        checkpoint = self.load_nn_semantic_checkpoint()
-
-        network_results, ppl_results = self.run_framework(
-            metadata=self.metadata,
-            network_input=list(checkpoint.inputs()),
-            working_directory=self._args.working_dir,
-            keep_onnx_model=self._args.cleanup,
-            keep_pytorch_model=self._args.cleanup,
-            timing_profile=self.get_timing_profile(),
-            use_cpu=self._args.cpu,
-            batch_size=self._args.batch_size,
-            args=self._args,
-            benchmarking_mode=False,
-            perplexity_reference=list(checkpoint.labels()),
-        )
-
-        return NetworkCheckpointResult(
-            network_results=network_results,
-            accuracy=checkpoint.accuracy(network_results),
-            perplexity=(sum(ppl_results) / len(ppl_results) if ppl_results else None),
-        )
-
-    def run_benchmark(self):
-        self.config.MetadataClass.add_benchmarking_args(self._parser)
-        super().__call__()
-
-        network_results = self.run_framework(
-            metadata=self.metadata,
-            network_input=None,
-            working_directory=self._args.working_dir,
-            keep_onnx_model=self._args.cleanup,
-            keep_pytorch_model=self._args.cleanup,
-            timing_profile=self.get_timing_profile(),
-            use_cpu=self._args.cpu,
-            batch_size=self._args.batch_size,
-            args=self._args,
-            benchmarking_mode=True,
-        )
-
-        return network_results
-
-    def add_args(self, parser) -> argparse.ArgumentParser:
-        super().add_args(parser)
-        device_group = parser.add_argument_group("device")
-        device_group.add_argument(
-            "--cpu",
-            help="Run inference using CPU for frameworks.",
-            action="store_true",
-        )
-
-class TRTInferenceCommand(NetworkCommand):
-    """Base class that is associated with Polygraphy related scripts."""
-
-    def __init__(
-        self,
-        network_config: NNConfig,
-        description: str,
-        frameworks_cmd: FrameworkCommand,
-    ):
-        super().__init__(network_config, description)
-        self.framework_name = FRAMEWORK_TENSORRT
-        # Should be set by
-        self.frameworks_cmd = frameworks_cmd()
-
-    def _setup_workspace(self, metadata: NetworkMetadata, working_directory: str) -> NNFolderWorkspace:
-        return NNFolderWorkspace(
-            self.frameworks_cmd.config.network_name, metadata, working_directory
-        )
-
-    def _download_models(
-        self,
-        workspace: NNFolderWorkspace,
-        metadata: NetworkMetadata,
-    ) -> Tuple[NetworkModel]:
-        # No fpath provided for onnx files, download them from HuggingFace repo.
-        return self.frameworks_cmd.generate_and_download_framework(
-            metadata, workspace
-        ).onnx
-
-    @abstractmethod
-    def run_trt(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Tuple[NetworkModel],
-        network_input: List[str],
-        working_directory: str,
-        keep_trt_engine: bool,
-        keep_onnx_model: bool,
-        keep_torch_model: bool,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        disable_preview_dynamic_shapes: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult]:
-        pass
-
-    def __call__(self):
-        self.config.MetadataClass.add_inference_args(self._parser)
-        super().__call__()
-        onnx_fpaths = self.args_to_network_models(self._args)
-
-        checkpoint = self.load_nn_semantic_checkpoint()
-
-        network_results, ppl_results = self.run_trt(
-            metadata=self.metadata,
-            onnx_fpaths=onnx_fpaths,
-            network_input=list(checkpoint.inputs()),
-            working_directory=self._args.working_dir,
-            keep_trt_engine=self._args.cleanup,
-            keep_onnx_model=self._args.cleanup,
-            keep_torch_model=self._args.cleanup,
-            timing_profile=self.get_timing_profile(),
-            batch_size=self._args.batch_size,
-            args=self._args,
-            benchmarking_mode=False,
-            disable_preview_dynamic_shapes=self._args.disable_preview_dynamic_shapes,
-            perplexity_reference=list(checkpoint.labels()),
-        )
-
-        return NetworkCheckpointResult(
-            network_results=network_results,
-            accuracy=checkpoint.accuracy(network_results),
-            perplexity=(sum(ppl_results) / len(ppl_results) if ppl_results else None),
-        )
-
-    def run_benchmark(self):
-        self.config.MetadataClass.add_inference_args(self._parser)
-        self.config.MetadataClass.add_benchmarking_args(self._parser)
-        super().__call__()
-        onnx_fpaths = self.args_to_network_models(self._args)
-
-        network_results = self.run_trt(
-            metadata=self.metadata,
-            onnx_fpaths=onnx_fpaths,
-            network_input=None,
-            working_directory=self._args.working_dir,
-            keep_trt_engine=self._args.cleanup,
-            keep_onnx_model=self._args.cleanup,
-            keep_torch_model=self._args.cleanup,
-            timing_profile=self.get_timing_profile(),
-            batch_size=self._args.batch_size,
-            args=self._args,
-            benchmarking_mode=True,
-            disable_preview_dynamic_shapes=self._args.disable_preview_dynamic_shapes
-        )
-
-        return network_results
-
-    def add_args(self, parser) -> argparse.ArgumentParser:
-        super().add_args(parser)
-        trt_group = parser.add_argument_group("trt")
-        trt_group.add_argument(
-            "--disable-preview-dynamic-shapes",
-            help="Disable the FASTER_DYNAMIC_SHAPES_0805 preview feature when building the TensorRT engine",
-            action="store_true",
-        )
-
-        trt_benchmarking_group = parser.add_argument_group("trt benchmarking group")
-        trt_benchmarking_group.add_argument(
-            "--input-profile-max-len",
-            type=int,
-            help="Specify max input sequence length in TRT engine profile. (default: max supported sequence length)",
-        )
-        trt_benchmarking_group.add_argument(
-            "--output-profile-max-len",
-            type=int,
-            help="Specify max output sequence length in TRT engine profile. (default: max supported sequence length)",
-        )
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        return self.config.MetadataClass.from_inference_args(args)
-
-    @abstractmethod
-    def args_to_network_models(self, args) -> Tuple[NetworkModel]:
-        """
-        Converts argparse arguments into a list of valid NetworkModel fpaths. Specifically for ONNX.
-        Invokes conversion scripts if not.
-        Return:
-            List[NetworkModel]: List of network model names.
-        """
-
-class OnnxRTCommand(NetworkCommand):
-    """ONNX Runtime command."""
-
-    def __init__(
-        self,
-        network_config: NNConfig,
-        description: str,
-        frameworks_cmd: FrameworkCommand,
-    ):
-        super().__init__(network_config, description)
-        self.framework_name = FRAMEWORK_ONNXRT
-        # Should be set by
-        self.frameworks_cmd = frameworks_cmd()
-
-    @abstractmethod
-    def run_onnxrt(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Tuple[NetworkModel],
-        network_input: List[str],
-        working_directory: str,
-        keep_onnx_model: bool,
-        keep_torch_model: bool,
-        timing_profile: TimingProfile,
-        args: object = None,
-        benchmarking_mode: bool = False,
-    ) -> Union[List[NetworkResult], BenchmarkingResult]:
-        pass
-
-    def __call__(self):
-        self.config.MetadataClass.add_inference_args(self._parser)
-        super().__call__()
-        onnx_fpaths = self.args_to_network_models(self._args)
-
-        checkpoint = self.load_nn_semantic_checkpoint()
-
-        network_results = self.run_onnxrt(
-            metadata=self.metadata,
-            onnx_fpaths=onnx_fpaths,
-            network_input=list(checkpoint.inputs()),
-            working_directory=self._args.working_dir,
-            keep_onnx_model=self._args.cleanup,
-            keep_torch_model=self._args.cleanup,
-            timing_profile=self.get_timing_profile(),
-            batch_size=self._args.batch_size,
-            args=self._args,
-            benchmarking_mode=False,
-        )
-
-        return NetworkCheckpointResult(
-            network_results=network_results,
-            accuracy=checkpoint.accuracy(network_results),
-            perplexity=None,
-        )
-
-    def run_benchmark(self):
-        self.config.MetadataClass.add_inference_args(self._parser)
-        self.config.MetadataClass.add_benchmarking_args(self._parser)
-        super().__call__()
-        onnx_fpaths = self.args_to_network_models(self._args)
-
-        network_results = self.run_onnxrt(
-            metadata=self.metadata,
-            onnx_fpaths=onnx_fpaths,
-            network_input=None,
-            working_directory=self._args.working_dir,
-            keep_onnx_model=self._args.cleanup,
-            keep_torch_model=self._args.cleanup,
-            timing_profile=self.get_timing_profile(),
-            batch_size=self._args.batch_size,
-            args=self._args,
-            benchmarking_mode=True,
-        )
-
-        return network_results
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        return self.config.MetadataClass.from_inference_args(args)
-
-    @abstractmethod
-    def args_to_network_models(self, args) -> Tuple[NetworkModel]:
-        """
-        Converts argparse arguments into a list of valid NetworkModel fpaths. Specifically for ONNX.
-        Invokes conversion scripts if not.
-        Return:
-            List[NetworkModel]: List of network model names.
-        """
diff --git a/demo/HuggingFace/NNDF/models.py b/demo/HuggingFace/NNDF/models.py
deleted file mode 100644
index 8a51392b..00000000
--- a/demo/HuggingFace/NNDF/models.py
+++ /dev/null
@@ -1,514 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-File for containing model file abstraction. Useful for generating models.
-"""
-
-import os
-from abc import ABCMeta, abstractmethod
-from typing import Union, List
-from shutil import copytree, rmtree
-
-# polygraphy
-from polygraphy.backend.trt import (
-    network_from_onnx_path,
-    engine_from_network,
-    save_engine,
-    Profile,
-)
-
-from polygraphy.backend.trt import CreateConfig
-from polygraphy.logger import G_LOGGER as PG_LOGGER
-
-# torch
-from torch import load, save
-from torch.nn import Module
-
-# tensorrt
-from tensorrt import PreviewFeature, MemoryPoolType
-
-# TRT-HuggingFace
-from NNDF.networks import NetworkMetadata
-from NNDF.logger import G_LOGGER
-
-
-class ModelFileConverter:
-    """Abstract class for converting one model format to another."""
-
-    def __init__(self, onnx_class, torch_class, trt_engine_class):
-        self.onnx_class = onnx_class
-        self.torch_class = torch_class
-        self.trt_engine_class = trt_engine_class
-
-    def torch_to_onnx(
-        self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
-    ):
-        """
-        Converts a torch.Model into an ONNX model on disk specified at output_fpath.
-
-        Arg:
-            output_fpath (str): File location of the generated ONNX file.
-            input_fpath (str): Input file location of the generated ONNX file.
-            network_metadata (NetworkMetadata): Network metadata of the network being converted.
-
-        Returns:
-            ONNXModelFile: Newly generated ONNXModelFile
-        """
-        raise NotImplementedError(
-            "Current model does not support exporting to ONNX model."
-        )
-
-    def onnx_to_torch(
-        self, output_fpath: str, input_fpath: str, network_metadata: NetworkMetadata
-    ):
-        """
-        Converts ONNX file into torch.Model which is written to disk.
-
-        Arg:
-            output_fpath (str): File location of the generated ONNX file.
-            input_fpath (str): Input file location of the generated ONNX file.
-            network_metadata (NetworkMetadata): Network metadata of the network being converted.
-
-        Returns:
-            TorchModelFile: Newly generated TorchModelFile
-        """
-        raise NotImplementedError(
-            "Current model does not support exporting to torch model."
-        )
-
-    def onnx_to_trt(
-        self,
-        output_fpath: str,
-        input_fpath: str,
-        network_metadata: NetworkMetadata,
-        profiles: List[Profile],
-        preview_features: List[PreviewFeature],
-    ):
-        """
-        Converts ONNX file to TRT engine.
-        Since TensorRT already supplies converter functions and scripts,
-        a default implementation is already provided.
-
-        Arg:
-            output_fpath (str): File location of the generated ONNX file.
-            input_fpath (str): Input file location of the generated ONNX file.
-            network_metadata (NetworkMetadata): Network metadata of the network being converted.
-            profiles (List[polygraphy.backend.trt.Profile]): The optimization profiles used to build the engine.
-            preview_features (List[tensorrt.PreviewFeature]): The preview features to set when building the engine.
-
-        Returns:
-            TRTEngineFile: Newly generated engine.
-        """
-        result = self.trt_engine_class(output_fpath, network_metadata)
-
-        G_LOGGER.info("Using optimization profiles: {:}".format(profiles))
-
-        try:
-            self.trt_inference_config = CreateConfig(
-                tf32=True,
-                fp16=network_metadata.precision.fp16,
-                memory_pool_limits = {MemoryPoolType.WORKSPACE: result.max_trt_workspace * 1024 * 1024},
-                profiles=profiles,
-                precision_constraints=("obey" if result.use_obey_precision_constraints() else None),
-                preview_features=preview_features
-            )
-        except TypeError as e:
-            G_LOGGER.error(f"This demo may have an outdated polygraphy. Please see requirements.txt for more details.")
-            raise e
-
-        if G_LOGGER.level == G_LOGGER.DEBUG:
-            g_logger_verbosity = PG_LOGGER.EXTRA_VERBOSE
-        elif G_LOGGER.level == G_LOGGER.INFO:
-            g_logger_verbosity = PG_LOGGER.INFO
-        else:
-            g_logger_verbosity = PG_LOGGER.WARNING
-
-        with PG_LOGGER.verbosity(g_logger_verbosity):
-            network_definition = result.get_network_definition(network_from_onnx_path(input_fpath))
-
-            trt_engine = engine_from_network(
-                network_definition, config=self.trt_inference_config
-            )
-            save_engine(trt_engine, output_fpath)
-
-        return result
-
-
-class NNModelFile(metaclass=ABCMeta):
-    """
-    Model abstraction. Allows for loading model as various formats.
-    The class assumes models live on the disk in order to reduce complexity of model loading into memory.
-    The class guarantees that once export functions are called, models exist on the disk for other
-    code to parse or use in other libraries.
-    """
-
-    def __init__(
-        self,
-        default_converter: ModelFileConverter = None,
-        network_metadata: NetworkMetadata = None,
-    ):
-        """
-        Since torch functions often allow for models to either be from disk as fpath or from a loaded object,
-        we provide a similar option here. Arguments can either be a path on disk or from model itself.
-
-        Args:
-            model (Union[str, torch.Model]): Location of the model as fpath OR loaded torch.Model object.
-        """
-        if default_converter is not None:
-            self.default_converter = default_converter()
-        else:
-            self.default_converter = NullConverter()
-
-        self.network_metadata = network_metadata
-
-    def as_torch_model(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-    ):
-        """
-        Converts ONNX file into torch.Model which is written to disk.
-        Uses provided converter to convert object or default_convert is used instead if available.
-
-        Arg:
-            output_fpath (str): File location of the generated torch file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-                                    Since torch models folders, can potentially erase entire folders.
-
-        Returns:
-            TorchModelFile: Newly generated TorchModelFile
-        """
-        raise NotImplementedError(
-            "Current model does not support exporting to pytorch model."
-        )
-
-    def as_onnx_model(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-    ):
-        """
-        Converts current model into an ONNX model.
-        Uses provided converter to convert object or default_convert is used instead if available.
-
-        Args:
-            output_fpath (str): File location of the generated ONNX file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-                                    Since torch models folders, can potentially erase entire folders.
-
-        Returns:
-            ONNXModelFile: Newly generated ONNXModelFile
-        """
-        raise NotImplementedError(
-            "Current model does not support exporting to onnx model."
-        )
-
-    def as_trt_engine(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-        profiles: List[Profile] = [],
-        preview_features: List[PreviewFeature] = []
-    ):
-        """
-        Converts current model into an TRT engine.
-        Uses provided converter to convert object or default_convert is used instead if available.
-
-        Args:
-            output_fpath (str): File location of the generated ONNX file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-                                    Since torch models folders, can potentially erase entire folders.
-            profiles (List[polygraphy.backend.trt.Profile]): The optimization profiles used to build the engine.
-            preview_features (List[tensorrt.PreviewFeature]): The preview features to enable when building the engine.
-
-        Returns:
-            TRTEngineFile: Newly generated ONNXModelFile
-        """
-        raise NotImplementedError(
-            "Current model does not support exporting to trt engine."
-        )
-
-    @abstractmethod
-    def cleanup(self) -> None:
-        """Cleans up any saved models or loaded models from memory."""
-
-
-class TorchModelFile(NNModelFile):
-    def __init__(
-        self,
-        model: Union[str, Module],
-        default_converter: ModelFileConverter = None,
-        network_metadata: NetworkMetadata = None,
-    ):
-        """
-        Since torch functions often allow for models to either be from disk as fpath or from a loaded object,
-        we provide a similar option here. Arguments can either be a path on disk or from model itself.
-
-        Args:
-            model (Union[str, torch.Model]): Location of the model as fpath OR loaded torch.Model object.
-        """
-        super().__init__(default_converter, network_metadata)
-
-        if isinstance(model, Module):
-            self.is_loaded = True
-            self.fpath = None
-            self.model = model
-        else:
-            self.is_loaded = False
-            self.fpath = model
-            self.model = None
-
-    def load_model(self) -> Module:
-        """
-        Loads the model from disk if isn't already loaded.
-        Does not attempt to load if given model is already loaded and instead returns original instance.
-        Use as_torch_model() instead to always guarantee a new instance and location on disk.
-
-        Args:
-            None
-
-        Returns:
-            torch.Model: Loaded torch model.
-        """
-        if self.is_loaded:
-            return self.model
-
-        return load(self.fpath)
-
-    def as_onnx_model(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-    ):
-        """
-        Converts the torch model into an onnx model.
-
-        Args:
-            output_fpath (str): File location of the generated ONNX file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-                                    Since torch models folders, can potentially erase entire folders.
-        Return:
-            (converter.onnx_class): Returns a converted instance of ONNXModelFile.
-        """
-        converter = self.default_converter if converter is None else converter()
-        if not force_overwrite and os.path.exists(output_fpath):
-            return converter.onnx_class(output_fpath, self.network_metadata)
-
-        return converter.torch_to_onnx(
-            output_fpath, self.load_model(), self.network_metadata
-        )
-
-    def as_torch_model(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-    ):
-        """
-        Since the model is already a torch model, forces a save to specified folder and returns new TorchModelFile object from that file location.
-
-        Args:
-            output_fpath (str): File location of the generated ONNX file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-                                    Since torch models folders, can potentially erase entire folders.
-        Return:
-            (converter.torch_class): Returns a converted instance of TorchModelFile.
-        """
-        converter = self.default_converter if converter is None else converter()
-        if not force_overwrite and os.path.exists(output_fpath):
-            return converter.torch_class(output_fpath, self.network_metadata)
-
-        if self.is_loaded:
-            save(self.model, output_fpath)
-        else:
-            copytree(self.fpath, output_fpath)
-
-        return converter.torch_class(output_fpath, self.network_metadata)
-
-    def cleanup(self) -> None:
-        if self.model:
-            G_LOGGER.debug("Freeing model from memory: {}".format(self.model))
-            del self.model
-
-        if self.fpath:
-            G_LOGGER.debug("Removing saved torch model from location: {}".format(self.fpath))
-            rmtree(self.fpath)
-
-
-class ONNXModelFile(NNModelFile):
-    def __init__(
-        self,
-        model: str,
-        default_converter: ModelFileConverter = None,
-        network_metadata: NetworkMetadata = None,
-    ):
-        """
-        Keeps track of ONNX model file. Does not support loading into memory. Only reads and writes to disk.
-
-        Args:
-            model (str): Location of the model as fpath OR loaded torch.Model object.
-        """
-        super().__init__(default_converter, network_metadata)
-        self.fpath = model
-
-    def as_onnx_model(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-    ):
-        """
-        Since the model is already a onnx model, forces a save to specified folder and returns new ONNXModelFile object from that file location.
-
-        Args:
-            output_fpath (str): File location of the generated ONNX file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-
-        Return:
-            (converter.onnx_class): Returns a converted instance of ONNXModelFile.
-        """
-        converter = self.default_converter if converter is None else converter()
-        if not force_overwrite and os.path.exists(output_fpath):
-            return converter.onnx_class(output_fpath, self.network_metadata)
-        else:
-            copytree(self.fpath, output_fpath)
-
-        return converter.onnx_class(output_fpath, self.network_metadata)
-
-    def as_torch_model(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-    ):
-        """
-        Converts the onnx model into an torch model.
-
-        Args:
-            output_fpath (str): File location of the generated ONNX file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-                                    Since torch models folders, can potentially erase entire folders.
-        Return:
-            (converter.torch_class): Returns a converted instance of TorchModelFile.
-        """
-        converter = self.default_converter if converter is None else converter()
-        if not force_overwrite and os.path.exists(output_fpath):
-            return converter.torch_class(output_fpath, self.network_metadata)
-
-        return converter.onnx_to_torch(output_fpath, self.fpath, self.network_metadata)
-
-    def _cleanup_onnx_folder(self, folder_dir):
-        for d in os.listdir(folder_dir):
-            fpath = os.path.join(folder_dir, d)
-            # Remove everything related to onnx other than engine
-            if (os.path.isfile(fpath)) and (".engine" not in d):
-                os.remove(fpath)
-
-    def cleanup(self) -> None:
-        G_LOGGER.debug("Removing saved ONNX model from location: {}".format(self.fpath))
-        if (not self.network_metadata.other.kv_cache) or ("encoder" in self.fpath):
-            # Clean up any onnx external files by removing integer named values and weight files
-            workspace_path = os.path.split(self.fpath)[0]
-            self._cleanup_onnx_folder(workspace_path)
-
-        else:
-            # In kv cache mode, hard to remove the decoder. Therefore need to search for temporary WAR.
-            decoder_path = os.path.split(self.fpath)[0]
-            decoder_non_kv_path = os.path.join(decoder_path, "non-kv")
-            decoder_kv_path = os.path.join(decoder_path, "kv")
-            # Remove kv and nonkv folder correspondingly.
-            self._cleanup_onnx_folder(decoder_non_kv_path)
-            self._cleanup_onnx_folder(decoder_kv_path)
-
-    def as_trt_engine(
-        self,
-        output_fpath: str,
-        converter: ModelFileConverter = None,
-        force_overwrite: bool = False,
-        profiles = [],
-        preview_features = []
-    ):
-        """
-        Converts the onnx model into an trt engine.
-
-        Args:
-            output_fpath (str): File location of the generated ONNX file.
-            converter (ModelFileConverter): Class to convert current model instance into another.
-            force_overwrite (bool): If the file already exists, tell whether or not to overwrite.
-                                    Since torch models folders, can potentially erase entire folders.
-            profiles (List[polygraphy.backend.trt.Profile]): The optimization profiles used to build the engine.
-            preview_features (List[tensorrt.PreviewFeature]): The preview features to set when building the engine.
-        Return:
-            (converter.trt_engine_class): Returns a converted instance of TRTEngineFile.
-        """
-        converter = self.default_converter if converter is None else converter()
-
-        # TODO: Need to check if the old engine file is compatible with current setting
-        if not force_overwrite and os.path.exists(output_fpath):
-            return converter.trt_engine_class(output_fpath, self.network_metadata)
-
-        return converter.onnx_to_trt(
-            output_fpath,
-            self.fpath,
-            self.network_metadata,
-            profiles,
-            preview_features
-        )
-
-
-class TRTEngineFile(NNModelFile):
-
-    @abstractmethod
-    def use_obey_precision_constraints(self):
-        pass
-
-    # get_network_definition can be overloaded to alter the network definition.
-    # For example, this function can be used to change the precisions of ops or
-    # data type of intermediate tensors.
-    def get_network_definition(self, network_definition):
-        return network_definition
-
-    def __init__(
-        self,
-        model: str,
-        default_converter: ModelFileConverter = None,
-        network_metadata: NetworkMetadata = None,
-    ):
-        super().__init__(default_converter, network_metadata)
-        self.fpath = model
-        self.max_trt_workspace = 3072
-
-    def cleanup(self) -> None:
-        G_LOGGER.debug("Removing saved engine model from location: {}".format(self.fpath))
-        os.remove(self.fpath)
-
-
-class NullConverter(ModelFileConverter):
-    def __init__(self):
-        super().__init__(ONNXModelFile, TorchModelFile, TRTEngineFile)
diff --git a/demo/HuggingFace/NNDF/networks.py b/demo/HuggingFace/NNDF/networks.py
deleted file mode 100644
index ff8700fc..00000000
--- a/demo/HuggingFace/NNDF/networks.py
+++ /dev/null
@@ -1,225 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Helpers for abstracting high-level network concepts. Different from 'models.py' which deals
-with IO abstraction.
-"""
-
-import string
-
-from typing import Dict, Union, Tuple
-from collections import namedtuple, OrderedDict
-
-# externals
-# None. Should not have any external dependencies.
-
-FILENAME_VALID_CHARS = "-~_.() {}{}".format(string.ascii_letters, string.digits)
-
-"""NetworkResult(input: str, output_tensor: np.array, semantic_output: np.array, median_runtime: NetworkRuntime, models: [str])"""
-NetworkResult = namedtuple(
-    "NetworkResult",
-    ["input", "output_tensor", "semantic_output", "median_runtime", "models"],
-)
-
-"""BenchmarkingResult(median_runtime: NetworkRuntime, models: [str])"""
-BenchmarkingResult = namedtuple(
-    "BenchmarkingResult",
-    ["median_runtime", "models"],
-)
-
-"""CheckpointResult(network_results: List[NetworkResult], accuracy: float, perplexity: float)"""
-NetworkCheckpointResult = namedtuple(
-    "NetworkCheckpointResult", ["network_results", "accuracy", "perplexity"]
-)
-
-# Tracks TRT Precision Config
-"""Precision(fp16: Bool)"""
-Precision = namedtuple("Precision", ["fp16"])
-
-"""NetworkMetadata(variant: str, precision: Precision, other: Union[namedtuple, None])"""
-NetworkMetadata = namedtuple("NetworkMetadata", ["variant", "precision", "other"])
-
-"""TimingProfile(iterations: int, number: int, warmup: int, duration: int, percentile: int or [int])"""
-TimingProfile = namedtuple("TimingProfile", ["iterations", "number", "warmup", "duration", "percentile"])
-
-
-"""NetworkModel(name: str, fpath: str)"""
-NetworkModel = namedtuple("NetworkModel", ["name", "fpath"])
-
-"""
-String encodings to genereted network models.
-    NetworkModels(torch: Tuple[NetworkModel], onnx: Tuple[NetworkModel])
-"""
-NetworkModels = namedtuple("NetworkModels", ["torch", "onnx", "trt"])
-
-"""
-Args:
-    name: Name of the network / parts of the network timed.
-    runtime: Runtime of the time.
-
-NetworkRuntime(name: str, runtime: float)
-"""
-NetworkRuntime = namedtuple("NetworkRuntime", ["name", "runtime"])
-
-class Dims:
-    """Helper class for interfacing dimension constructs with Polygraphy and PyTorch."""
-
-    BATCH = "batch"
-    SEQUENCE = "sequence"
-
-    def __init__(self, encoding: OrderedDict):
-        self.encoding = encoding
-
-    def create_new_sequence_dim(dim_type: str) -> str:
-        """
-        Returns a new sequence dimension.
-
-        Return:
-            str: Returns a sequence dimension which Dims.SEQUENCE appended by dim_type.
-        """
-        return Dims.SEQUENCE + "_" + dim_type
-
-    def get_dims(self):
-        """
-        Returns the encoding dimensions.
-
-        Return:
-            OrderedDict[str, Union[int, str]]: Returns dimensional encoding. Example: {'input_ids': (1, SEQUENCE_DIM)}
-        """
-        return self.encoding
-
-    def get_names(self) -> Tuple[str]:
-        return tuple(self.encoding.keys())
-
-    def get_lengths(self) -> Tuple[Union[int, str]]:
-        return tuple(self.encoding.values())
-
-    def get_torch_dynamic_axis_encoding(self) -> dict:
-        """
-        Returns a Pytorch "dynamic_axes" encoding for onnx.export.
-
-        Returns:
-            dict: Returns a 'dynamic' index with corresponding names according to:
-                https://pytorch.org/docs/stable/onnx.html
-        """
-
-        dynamic_axes = {}
-        for k, v in self.encoding.items():
-            encodings = []
-            for idx, e in enumerate(v):
-                if isinstance(e, str) and (e == self.BATCH or self.SEQUENCE in e):
-                    encodings.append((idx, e))
-            dynamic_axes[k] = {idx: e for idx, e in encodings}
-
-        return dynamic_axes
-
-# Config Class
-class NNConfig:
-    """Contains info for a given network that we support."""
-
-    NETWORK_SEGMENTS = ["full"]
-
-    def __init__(self, network_name, variants=None):
-        assert self._is_valid_filename(
-            network_name
-        ), "Network name: {} is not filename friendly.".format(network_name)
-
-        self.network_name = network_name
-        self.variants = variants
-
-        # Due to limitations of namedtuples and pickle function, namedtupled must be tracked as an instance
-        # which refers to a global.
-        if len(self.variants) > 0:
-            self.MetadataClass = type(self.variants[0].other)
-        else:
-            self.MetadataClass = None
-
-    def get_network_segments(self):
-        """
-        Returns exportable segments for the given network.
-        Used in the case where a single network needs to
-        be exported into multiple parts.
-        """
-        return self.NETWORK_SEGMENTS
-
-    @staticmethod
-    def get_output_dims(metadata) -> Dict:
-        """
-        Returns the output dimensions of the current network.
-        Since some networks can have multiple parts, should be a dictionary encoding.
-
-        Returns:
-            (Dict): {"network_section": Dims}
-        """
-        raise NotImplementedError("Output dims not yet defined.")
-
-    @staticmethod
-    def get_input_dims(metadata) -> Dict:
-        """
-        Returns the input dimensions of the current network.
-        Since some networks can have multiple parts, should be a dictionary encoding.
-
-        Returns:
-            (Dict): {"network_section": Dims} example:
-                {"encoder": Dims(...), "decoder": Dims(...)}
-        """
-        raise NotImplementedError("Input dims not yet defined.")
-
-    def _is_valid_filename(self, filename: str) -> bool:
-        """
-        Checks if a given filename is valid, helpful for cross platform dependencies.
-        """
-        return all(c in FILENAME_VALID_CHARS for c in filename)
-
-    def get_python_requirements():
-        return []
-
-    def get_metadata_string(self, metadata: NetworkMetadata) -> str:
-        """
-        Serializes a Metadata object into string.
-        String will be checked if friendly to filenames across Windows and Linux operating systems.
-
-        returns:
-            string: <network>-<variant-name>-<precision>-<others>
-        """
-
-        precision_str = "-".join(
-            [k for k, v in metadata.precision._asdict().items() if v]
-        )
-        result = [self.network_name, metadata.variant]
-        if precision_str:
-            result.append(precision_str)
-
-        other_result = [
-            "{}~{}".format(k, str(v)) for k, v in metadata.other._asdict().items()
-        ]
-        # Remove all boolean values that are False and remove True if exists
-        true_length = len("~True")
-        other_result_filtered = [v[:-true_length] if v.endswith("~True") else v for v in other_result if "~False" not in v]
-
-        if len(other_result_filtered) != 0:
-            result.append("-".join(other_result_filtered))
-
-        final_str = "-".join(result)
-        assert self._is_valid_filename(
-            final_str
-        ), "Metadata for current network {} is not filename friendly: {}.".format(
-            self.network_name, final_str
-        )
-
-        return final_str
diff --git a/demo/HuggingFace/NNDF/tensorrt_utils.py b/demo/HuggingFace/NNDF/tensorrt_utils.py
deleted file mode 100644
index 74226ae8..00000000
--- a/demo/HuggingFace/NNDF/tensorrt_utils.py
+++ /dev/null
@@ -1,316 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""Utilities related to Polygraphy"""
-
-from typing import Dict, List
-from functools import reduce
-from enum import Enum
-
-# polygraphy
-from polygraphy.backend.trt import engine_from_bytes, TrtRunner
-from polygraphy.backend.onnxrt import OnnxrtRunner, SessionFromOnnx, OnnxrtRunner
-from polygraphy.backend.common import bytes_from_path
-from polygraphy.logger import G_LOGGER as PG_LOGGER
-
-# tensorrt
-import tensorrt as trt
-import os
-
-# ONNX
-import onnx
-import onnx_graphsurgeon as gs
-
-# numpy
-import numpy as np
-
-# NNDF
-from NNDF.networks import NetworkMetadata
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-# PyTorch
-import torch
-
-# Helper Functions
-def setup_benchmark_arg(user_input, name, default):
-    '''
-    Set up benchmarking arguments for trt
-    '''
-    if user_input is None:
-        G_LOGGER.warning("{} is not provided, default to {}".format(name, default))
-        return default
-    return user_input
-
-def allocate_binding_buffer(types_dict, shapes_dict):
-    '''
-    Allocate binding buffers for trt based on provided types and shapes dict
-    '''
-    return {
-        k: torch.zeros(reduce(lambda v, a: v*a, shape), dtype=types_dict[k]).cuda()
-        for k, shape in shapes_dict.items()
-    }
-
-
-def set_kv_data(kv_dict, past_or_present, layer_id, segment_value_dict):
-    '''
-    Set the types and shapes dict for kv-cache based on the provided inputs:
-        kv_dict: Dict[str, tuple/torch.dtype], the dict to modify within the function
-        past_or_present: str, either "past" or "present"
-        layer_id: int, need kv cache for each decoder layer
-        segment_value_dict: Dict[str, tuple/torch.dtype], example:
-            kvcache type: {"encoder": torch.float32, "decoder": torch.float32}
-            kvcache shape: {"encoder": cross_attention_kv_shape, "decoder": self_attention_kv_shape}
-    '''
-    for segment, value in segment_value_dict.items():
-        for code in ['key', 'value']:
-            kv_dict[f"{past_or_present}_key_values.{layer_id}.{segment}.{code}"] = value
-
-def clamp_weights_onnx(graph, min: float, max: float, ignore_nodes: List = None):
-    """
-    Clamps given onnx model to targeted upper and lower bounds.
-    """
-
-    if ignore_nodes is None:
-        ignore_nodes = {}
-    else:
-        ignore_nodes = {k: True for k in ignore_nodes}
-
-    for tensor in graph.tensors().values():
-        if tensor.name in ignore_nodes or isinstance(tensor, gs.ir.tensor.Variable):
-            continue
-
-        np.clip(tensor.values, min, max, out=tensor.values)
-
-    for tensor in graph.nodes:
-        node_attr = tensor.attrs.get("value", None)
-        if tensor.name in ignore_nodes:
-            continue
-
-        if node_attr is not None:
-            np.clip(node_attr.values, min, max, out=node_attr.values)
-    
-    return graph
-
-
-def clamp_weights_onnx_to_fp16_bounds(graph, ignore_nodes: List = None):
-    upper_bound = 65504
-    return clamp_weights_onnx(graph, -upper_bound, upper_bound, ignore_nodes)
-
-
-def move_t5_cast_op(graph):
-    """
-    T5 encoder and decoder have cast ops after residual add operation.
-    Moving the cast operation before add helps with FP16 accuracy as addition operation
-    can cause overflow in FP16.
-    """
-
-    cast_nodes = [node for node in graph.nodes if node.op == "Cast"]
-    # Version check for backward compatibility
-    torch_version_major = int(torch.__version__.split('.')[0])
-    torch_version_minor = int(torch.__version__.split('.')[1])
-    version_check = torch_version_major == 1 and torch_version_minor > 12
-    for n in cast_nodes:
-        # Cast appears at the output of add and feeds into a Pow op.
-        if n.i().op == "Add":
-            found_pow = False
-            for o in n.outputs:
-                for o1 in o.outputs:
-                    if o1.op == "Pow":
-                        found_pow = True
-
-            if found_pow:
-                if version_check:
-                    # Using Clip would be the simplest way, but unfortunately TRT refuses to put "Clip" on Myelin. The WAR
-                    # is to insert a Max followed by a Min instead.
-                    # Replace the Cast with Max + Min
-                    n.op = "Max"
-                    n.name = n.name.replace("Cast", "Max")
-                    n.attrs = {}
-                    lower_bound = gs.Constant(n.name + "/lower_bound", np.array(-64000.0, dtype=np.float32))
-                    n.inputs = [n.inputs[0], lower_bound]
-
-                    max_node_output = n.outputs[0]
-                    # Max has already exist, avoid tensors with same names
-                    max_node_output.name = max_node_output.name.replace("Cast", "ClipMax")
-
-                    upper_bound = gs.Constant(n.name + "/upper_bound", np.array(64000.0, dtype=np.float32))
-                    min_node_inputs = [max_node_output, upper_bound]
-
-                    min_node_output = gs.Variable(max_node_output.name.replace("ClipMax", "ClipMin"), dtype = np.float32)
-                    min_node = gs.Node(op="Min", inputs = min_node_inputs, outputs = [min_node_output], attrs = {})
-                    graph.nodes.append(min_node)
-
-                    for o in max_node_output.outputs:
-                        # To avoid loop in graph
-                        if o.op != "Min":
-                            o.inputs = [min_node_output if i == max_node_output else i for i in o.inputs]
-                else:
-                    n.i().outputs = n.outputs
-                    n.outputs.clear()
-
-    graph.cleanup().toposort()
-
-    add_nodes = [node for node in graph.nodes if node.op == "Add"]
-    for n in add_nodes:
-        if (version_check and (n.o().o().o().op == "Pow")) or ((not version_check) and (n.o().op == "Pow")):
-            add_inputs = n.inputs
-            outs = []
-            for i in add_inputs:
-                identity_out = gs.Variable("identity_out" + i.name, dtype=np.float32)
-                new_cast = gs.Node(op="Cast", inputs=[i], outputs=[identity_out], attrs={"to": 1})
-                outs.append(identity_out)
-                graph.nodes.append(new_cast)
-            n.inputs = outs
-
-    graph.cleanup().toposort()
-    return graph
-
-# The current operations would require loading/unloading onnx files twice, 
-class OnnxProcessOperation(Enum):
-    CLAMP_WEIGHTS = 1
-    MOVE_CAST_OP = 2
-
-def process_onnx(config: List[OnnxProcessOperation], onnx_input_fpath, onnx_output_fpath, keep_input = False, **kwargs):
-    graph = gs.import_onnx(onnx.load(onnx_input_fpath))
-    folder = os.path.split(onnx_input_fpath)[0]
-    for op in config:
-        if op == OnnxProcessOperation.CLAMP_WEIGHTS:
-            graph = clamp_weights_onnx_to_fp16_bounds(graph, **kwargs)
-        elif op == OnnxProcessOperation.MOVE_CAST_OP:
-            graph = move_t5_cast_op(graph)
-
-    model = gs.export_onnx(graph)
-    folder = os.path.split(onnx_input_fpath)[0]
-    model_size = 0
-    for filename in os.listdir(folder):
-        file_path = os.path.join(folder, filename)
-        try:
-            if os.path.isfile(file_path) or os.path.islink(file_path):
-                model_size += os.stat(file_path).st_size
-                if not keep_input:
-                    os.unlink(file_path)
-                
-        except Exception as e:
-            print('Failed to delete %s. Reason: %s' % (file_path, e))
-    
-    # Save the weights as external data only when model > 2GB
-    if model_size >= 1.8 * 1024 * 1024 * 1024:
-        onnx.save_model(model, onnx_output_fpath, save_as_external_data=True, all_tensors_to_one_file = False, convert_attribute=False)
-    else:
-        onnx.save_model(model, onnx_output_fpath, save_as_external_data=False)
-
-# Helper Classes
-class TRTNativeRunner:
-    """TRTNativeRunner avoids the high overheads with Polygraphy runner providing performance comparable to C++ implementation."""
-    def __init__(self, trt_engine_file: TRTEngineFile, network_metadata: NetworkMetadata):
-        self.network_metadata = network_metadata
-        self.trt_engine_file = trt_engine_file
-        self.trt_logger = trt.Logger()
-
-        if G_LOGGER.level == G_LOGGER.DEBUG:
-            self.trt_logger.min_severity = trt.Logger.VERBOSE
-        elif G_LOGGER.level == G_LOGGER.INFO:
-            self.trt_logger.min_severity = trt.Logger.INFO
-        else:
-            self.trt_logger.min_severity = trt.Logger.WARNING
-
-        G_LOGGER.info("Reading and loading engine file {} using trt native runner.".format(self.trt_engine_file.fpath))
-        with open(self.trt_engine_file.fpath, "rb") as f:
-            self.trt_runtime = trt.Runtime(self.trt_logger)
-            self.trt_engine = self.trt_runtime.deserialize_cuda_engine(f.read())
-            self.trt_context = self.trt_engine.create_execution_context()
-
-        # By default set optimization profile to 0
-        self.profile_idx = 0
-
-        # Other metadata required by the profile
-        self._num_bindings_per_profile = self.trt_engine.num_bindings // self.trt_engine.num_optimization_profiles
-        G_LOGGER.debug("Number of profiles detected in engine: {}".format(self._num_bindings_per_profile))
-
-    def release(self):
-        pass
-
-    def get_optimization_profile(self, batch_size, sequence_length):
-        """Provided helper function to obtain a profile optimization."""
-        # Select an optimization profile
-        # inspired by demo/BERT/inference.py script
-        selected_profile_idx = None
-        for idx in range(self.trt_engine.num_optimization_profiles):
-            profile_shape = self.trt_engine.get_profile_shape(profile_index=idx, binding=idx * self._num_bindings_per_profile)
-
-            if profile_shape[0][0] <= batch_size and profile_shape[2][0] >= batch_size \
-               and profile_shape[0][1] <=  sequence_length and profile_shape[2][1] >= sequence_length:
-                G_LOGGER.debug("Selected profile: {}".format(profile_shape))
-                selected_profile_idx = idx
-                break
-
-        if selected_profile_idx == -1:
-            raise RuntimeError("Could not find any profile that matches batch_size={}, sequence_length={}".format(batch_size, sequence_length))
-
-        return selected_profile_idx
-
-    def __call__(self, *args, **kwargs):
-        self.trt_context.active_optimization_profile = self.profile_idx
-        return self.forward(*args, **kwargs)
-
-class PolygraphyOnnxRunner:
-    def __init__(self, onnx_fpath: str, network_metadata: NetworkMetadata):
-        self.network_metadata = network_metadata
-        self.trt_session = SessionFromOnnx(onnx_fpath)
-        self.trt_context = OnnxrtRunner(self.trt_session)
-        self.trt_context.activate()
-
-    def __call__(self, *args, **kwargs):
-        # hook polygraphy verbosity for inference
-        g_logger_verbosity = (
-            G_LOGGER.EXTRA_VERBOSE
-            if G_LOGGER.root.level == G_LOGGER.DEBUG
-            else G_LOGGER.WARNING
-        )
-        with PG_LOGGER.verbosity(g_logger_verbosity):
-            return self.forward(*args, **kwargs)
-
-    def release(self):
-        self.trt_context.deactivate()
-
-class TRTPolygraphyRunner:
-    """
-    TRT implemented network interface that can be used to measure inference time.
-    Easier to use but harder to utilize. Recommend using TRTNativeRunner for better performance.
-    """
-
-    def __init__(self, engine_fpath: str, network_metadata: NetworkMetadata):
-        self.network_metadata = network_metadata
-
-        self.trt_engine = engine_from_bytes(bytes_from_path(engine_fpath))
-        self.trt_context = TrtRunner(self.trt_engine.create_execution_context())
-        self.trt_context.activate()
-
-    def __call__(self, *args, **kwargs):
-        # hook polygraphy verbosity for inference
-        g_logger_verbosity = (
-            G_LOGGER.EXTRA_VERBOSE
-            if G_LOGGER.root.level == G_LOGGER.DEBUG
-            else G_LOGGER.WARNING
-        )
-
-        with PG_LOGGER.verbosity(g_logger_verbosity):
-            return self.forward(*args, **kwargs)
-
-    def release(self):
-        self.trt_context.deactivate()
diff --git a/demo/HuggingFace/NNDF/torch_utils.py b/demo/HuggingFace/NNDF/torch_utils.py
deleted file mode 100644
index f3b2fadc..00000000
--- a/demo/HuggingFace/NNDF/torch_utils.py
+++ /dev/null
@@ -1,96 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""Torch utils used by demo folder."""
-
-import inspect
-from typing import Callable
-
-# pytorch
-import torch
-
-# NNDF
-from NNDF.logger import G_LOGGER
-
-# Function Decorators #
-def use_cuda(func: Callable):
-    """
-    Tries to send all parameters of a given function to cuda device if user supports it.
-    Object must have a "to(device: str)" and maps to target device "cuda"
-    Basically, uses torch implementation.
-
-    Wrapped functions musts have keyword argument "use_cuda: bool" which enables
-    or disables toggling of cuda.
-    """
-
-    def _send_args_to_device(caller_kwargs, device):
-        new_kwargs = {}
-        for k, v in caller_kwargs.items():
-            if getattr(v, "to", False):
-                new_kwargs[k] = v.to(device)
-            else:
-                new_kwargs[k] = v
-        return new_kwargs
-
-    def wrapper(*args, **kwargs):
-        caller_kwargs = inspect.getcallargs(func, *args, **kwargs)
-        assert (
-            "use_cuda" in caller_kwargs
-        ), "Function must have 'use_cuda' as a parameter."
-
-        if caller_kwargs["use_cuda"]:
-            new_kwargs = {}
-            used_cuda = False
-            if torch.cuda.is_available() and caller_kwargs["use_cuda"]:
-                new_kwargs = _send_args_to_device(caller_kwargs, "cuda")
-                used_cuda = True
-            else:
-                new_kwargs = _send_args_to_device(caller_kwargs, "cpu")
-
-            try:
-                return func(**new_kwargs)
-            except RuntimeError as e:
-                # If a device has cuda installed but no compatible kernels, cuda.is_available() will still return True.
-                # This exception is necessary to catch remaining incompat errors.
-                if used_cuda:
-                    G_LOGGER.warning("Unable to execute program using cuda compatible device: {}".format(e))
-                    G_LOGGER.warning("Retrying using CPU only.")
-                    new_kwargs = _send_args_to_device(caller_kwargs, "cpu")
-                    new_kwargs["use_cuda"] = False
-                    cpu_result = func(**new_kwargs)
-                    G_LOGGER.warning("Successfully obtained result using CPU.")
-                    return cpu_result
-                else:
-                    raise e
-        else:
-            return func(**caller_kwargs)
-
-    return wrapper
-
-def expand_inputs_for_beam_search(
-    tensor,
-    expand_size: int = 1,
-):
-    """
-    Interleave input tensor with `num_beams`, similar to HuggingFace's _expand_inputs_for_generation() in generation_utils.py
-    """
-    expanded_return_idx = (
-        torch.arange(tensor.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1)
-    )
-    tensor = tensor.index_select(0, expanded_return_idx.to(tensor.device))
-
-    return tensor
diff --git a/demo/HuggingFace/README.md b/demo/HuggingFace/README.md
deleted file mode 100644
index cbc8e9c2..00000000
--- a/demo/HuggingFace/README.md
+++ /dev/null
@@ -1,195 +0,0 @@
-# TensorRT Inference for HuggingFace Transformers 🤗
-
-This repository demonstrates TensorRT inference with models developed using [HuggingFace Transformers](https://huggingface.co/transformers/).
-
-Currently, this repository supports the following models:
-
-1. [GPT2 (text generation task)](https://huggingface.co/transformers/model_doc/gpt2.html). The sample supports following variants of GPT2:
-
-    gpt2 (117M), gpt2-medium (345M), gpt2-large (774M), gpt2-xl (1558M), EleutherAI/gpt-j-6B (6053M)
-
-2. [T5 (translation, premise task)](https://huggingface.co/transformers/model_doc/t5.html). The sample supports following variants of T5:
-
-    t5-small (60M), t5-base (220M), t5-large (770M), t5-3b(3B), t5-11b(11B)
-
-3. [BART (summarization task)](https://huggingface.co/docs/transformers/model_doc/bart.html). The sample supports the following variants of BART:
-
-    facebook/bart-base (139M), facebook/bart-large (406M), facebook/bart-large-cnn (406M), facebook/mbart-large-50 (680M)
-
-## Setup
-
-
-Follow the setup steps in the TensorRT OSS repository. It is recommended to experiment inside Docker container.
-For a smoother setup experience, it is recommended to use [Poetry](https://python-poetry.org/) to install requirements and execute:
-
-```bash
-poetry install # one-time setup
-poetry add <path_to_trt_wheel> # see top level repo README.md on how to get TensorRT wheels.
-poetry run python run.py <args> # execute program
-```
-
-However requirements.txt are also provided.
-
-```bash
-pip3 install -r requirements.txt # install requirements
-python run.py <args> # execute program
-```
-
-**Please note that due to end-of-life, Python <= 3.6 is no longer supported.**
-
-## File Structure
-
-```bash
-.
-├── GPT2      # GPT2 directory
-│   └── ...
-├── T5        # T5 directory
-│   └── ...
-├── BART      # BART directory
-│   ├── BartModelConfig.py # Model configuration and variant-specific parameters
-│   ├── checkpoint.toml    # Example inputs and baseline outputs
-│   ├── export.py          # Model conversions between Torch, TRT, ONNX
-│   ├── frameworks.py      # PyTorch inference script
-│   ├── onnxrt.py          # OnnxRT inference script
-│   ├── trt.py             # TensorRT inference script
-│   ├── hf.py              # HuggingFace inference script
-│   └── measurements.py    # Performance measurement script
-├── NNDF      # common high-level abstraction of classes and utilities
-├── notebooks # Jupyter notebooks for GPT2 and T5
-└── run.py    # main entry script
-```
-
-## How to run comparison script
-
-`run.py` is the main entry point for the demos. `compare` and `run` are two most common actions to use with `run.py`.
-
-The `compare` action will by default compare all implemented frameworks, e.g., PyTorch frameworks & TRT (for GPT2), PyTorch framework & TRT & OnnxRT (for T5 and BART).
-
-```python
-python3 run.py compare GPT2 --variant [gpt2 | gpt2-medium | gpt2-large | gpt2-xl | EleutherAI/gpt-j-6B] --working-dir temp
-```
-
-The above script compares the performance of PyTorch framework inference and TensorRT inference for GPT2:
-
-| script     | accuracy | decoder (sec) | encoder (sec) | full (sec) |
-|------------|----------|---------------|---------------|------------|
-| frameworks | 1        | 0.0292865     | 0.0174382     | 0.122532   |
-| trt        | 1        | 0.00494083    | 0.0068982     | 0.0239782  |
-
-Notes: `--variant` designates the pre-trained model for testing. `--working-dir` saves the downloaded pre-trained models, onnx model files, and TRT engine files. accuracy of 1.0 indicates correct results in consistency with the expected outputs in `checkpoint.toml`. By default, all running times reported are median numbers of 10 iterations.
-
-## How to run functional and performance benchmark
-
-The `run` action will run the specific script under the model directory.
-
-```python
-python3 run.py run GPT2 [frameworks | trt] --variant [gpt2 | gpt2-medium | gpt2-large | gpt2-xl | EleutherAI/gpt-j-6B] --working-dir temp
-```
-
-Expected output:
-
-```properties
-NetworkCheckpointResult(network_results=[NetworkResult(
-input='TensorRT is a Deep Learning compiler used for deep learning.\n',
-output_tensor=tensor([   51, 22854, ....], device='cuda:0'),
-semantic_output=['TensorRT is a Deep Learning compiler used for deep learning.\n\nThe main goal of the project is to create a tool that can be used to train deep learning algorithms.\n\n'],
-median_runtime=[NetworkRuntime(name='gpt2_decoder', runtime=0.002254825085401535), NetworkRuntime(name='full', runtime=0.10705459117889404)],
-models=NetworkModels(torch=None, onnx=[NetworkModel(name='gpt2_decoder', fpath='temp/GPT2/GPT2-gpt2-fp16.onnx')],
-trt=[NetworkModel(name='gpt2_decoder', fpath='temp/GPT2/GPT2-gpt2-fp16.onnx.engine')]))], accuracy=1.0)
-```
-
-## How to run with different precisions in TensorRT
-
-Frameworks (PyTorch) by default run TF32 on Ampere devices and degrade to FP32 on pre-Ampere devices. Accordingly, in TensorRT run, TF32 is also set as the default precision. To experiment with different precisions, use `--fp16` for FP16:
-
-```python
-python3 run.py run BART trt --variant facebook/bart-base --working-dir temp [--fp16]
-```
-
-## How to customize parameters for time measurement
-Use `--iterations`, `--number`, `--warmup`, `--duration`, `--percentile` to control the time measurement process. Most common parameters are explained below:
-* `--iterations <int>`: number of iterations to measure (default 10)
-* `--warmup <int>`: number of warmup iterations before actual measurement occurs (default 3)
-* `--percentile <int>`: key percentile number for measurement (default 50, i.e. median).
-
-```python
-python3 run.py run BART trt --variant facebook/bart-base --working-dir temp --iterations 100 --percentile 99
-```
-
-Notes:
-* Percentile numbers are representative only if the number of iterations are sufficiently large. Please consider increasing `--iterations` when combined with `--percentile`.
-* To avoid conflict with the overall result printing structure, only one percentile number is allowed from command line. If the users need to measure multiple timing statistics from one run (such as p50, p90, p99), please (1) run the command multiple times by changing `--percentile <N>` -- engines will not be re-built from run to run so this is still efficient OR (2) use the [Jupyter notebook demo](./notebooks) for more flexible measurement that can measurement all percentiles in one run.
-
-## How to run with K-V cache
-
-For all the models (GPT2/BART/T5), use `--enable-kv-cache` option to get the same effect of HuggingFace's `use_cache` option. For encoder-decoder models, this option will use key & value cache in decoder for uni-directional self-attention and encoder-decoder cross-attention. KV cache could reduce the size of `input_ids` and improve runtime performance when `input_ids` is long. Current benchmarking result shows that at `input_seq_len = 1024` and `output_seq_len = 1024`, t5-large model with kv cache could achieve 3x faster than without kv cache in single NVIDIA A100 GPU.
-
-```python
-python3 run.py run BART [frameworks | trt] --variant facebook/bart-base --working-dir temp --enable-kv-cache
-```
-
-Notes:
-* For T5, the code has been optimized according to the latest TensorRT features. (1) Cross attention kv does not change throughout decoding session, so it is only calculated once at the first decoding session. `onnx.export` cannot handle this logic properly for HuggingFace, so we create a "cross attention kv generator" using only `encoder_hidden_states`. (2) TensorRT's "zero tensor" feature is used for self attention kv cache growth starting at empty. (3) Self attention input and output are the same location to avoid D2D copy for kv cache. A similar optimization will be ported to BART.
-
-* For BART, we will be porting similar optimization from T5, but currently, K-V cache decoder with TensorRT requires exporting 2 onnx files and building separate engines respectively, called "non-kv" and "kv". For the first decoder run, KV Cache needs to be generated with only `input_ids` and `encoder_hidden_states`(if encoder_decoder), which is named "non-kv". For the other decoder iterations, previous KV Cache and other inputs are passed into the model to generate the updated KV Cache and decoder_hidden_states, which is named "kv". Because current onnx export cannot handle dynamic number of inputs, 2 onnx files with slightly different configurations are used together.
-
-* For GPT2, since it is decoder only, only self attention kv is needed, and it has 2 mode, corresonding to 2 optimization profiles for a single TensorRT engine: context mode which takes in `input_ids` with various length only and outputs `hidden_states` and self attention cache; generation mode, which takes in `input_ids` with seq_len = 1 and entire self attention kv cache, and outputs `hidden_states` with seq_len = 1 and kv cache with cum_seq_len (`past_decoder_length`) + 1. It has some memory concurrency issue that cannot let self attention input and output point to the same memory location, so it requires dual cache.
-
-## How to run with beam search
-
-In addition to greedy search, beam search is another widely used decoding method. For all the models, use `--num-beams <N>` to enable beam search during decoding.
-
-```python
-python3 run.py run BART [frameworks | trt] --variant facebook/bart-base --working-dir temp --num-beams 3
-```
-
-Notes:
-* K-V cache with beam search have memory concurrency issues with TensorRT Optimization. We are currently working on this issue.
-
-
-## How to run without the TensorRT `FASTER_DYNAMIC_SHAPES_0805` preview feature
-
-`FASTER_DYNAMIC_SHAPES_0805` significantly improves TensorRT engine build time and is enabled by default in TRT 8.6+. Use the `--disable-preview-dynamic-shapes` option to disable this preview feature for BART, GPT2, or T5. In rare cases, the runtime may increase, so we provide an option to disable it:
-
-```python
-python3 run.py run BART trt --variant facebook/bart-base --working-dir temp --disable-preview-dynamic-shapes
-```
-
-Notes:
-* Preview argument is only for TensorRT runs. Hence, please avoid using `compare` action with `--disable-preview-dynamic-shapes` since the syntax doesn't exist for `frameworks` and `onnxrt` runs. Instead, it is recommended to test TensorRT `run` command seperately to obtain the performance without this preview feature.
-
-## How to run in performance benchmarking mode
-
-The `benchmark` action will benchmark the specific script under the model directory using random input data with specified input/output sequence lengths. Note that since the input data is random, the accuracy is not guaranteed, but the benchmarking mode is useful for performance measurement since it allows arbitrary and controllable input/output sequence lengths with early stopping being disabled and allows apples-to-apples performance comparisons across different frameworks.
-
-```python
-python3 run.py benchmark GPT2 [frameworks | trt] --variant [gpt2 | gpt2-medium | gpt2-large | gpt2-xl | EleutherAI/gpt-j-6B] --working-dir temp --input-seq-len 128 --output-seq-len 256
-```
-
-## How to run in performance benchmarking mode
-
-The `benchmark` action will benchmark the specific script under the model directory using random input data with specified input/output sequence lengths. Note that since the input data is random, the accuracy is not guaranteed, but the benchmarking mode is useful for performance measurement since it allows arbitrary and controllable input/output sequence lengths with early stopping being disabled and allows apples-to-apples performance comparisons across different frameworks.
-
-```python
-python3 run.py benchmark GPT2 [frameworks | trt] --variant [gpt2 | gpt2-large] --working-dir temp --input-seq-len 128 --output-seq-len 256
-```
-
-## Testing
-
-```python
-pytest
-```
-
-It is recommended to use Pytest `4.6.x`. Your Python environment must have already had the setup completed.
-
-
-## Troubleshooting
-
-### cuBLAS Errors
-
-```
-CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
-```
-
-It is possible that your LD_LIBRARY_PATH has a competing CUDA version stored inside, causing PyTorch to read the incorrect library.
-Consider modifying LD_LIBRARY_PATH and removing your CUDA path.
diff --git a/demo/HuggingFace/T5/T5ModelConfig.py b/demo/HuggingFace/T5/T5ModelConfig.py
deleted file mode 100644
index 5490fb4b..00000000
--- a/demo/HuggingFace/T5/T5ModelConfig.py
+++ /dev/null
@@ -1,293 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import argparse
-
-from collections import namedtuple, OrderedDict
-from itertools import product
-from typing import Dict
-
-# TRT-HuggingFace
-from NNDF.networks import Precision, NetworkMetadata, NNConfig, Dims
-from NNDF.interface import MetadataArgparseInteropMixin
-
-# Limitation of namedtuples. You must declare namedtuples in module scope and not in classes.
-# Otherwise pickle doesn't work.
-# See: https://stackoverflow.com/questions/4677012/python-cant-pickle-type-x-attribute-lookup-failed
-_T5Metadata = namedtuple("T5Metadata", ["kv_cache"])
-
-
-class T5Metadata(_T5Metadata, MetadataArgparseInteropMixin):
-    @staticmethod
-    def add_args(parser: argparse.ArgumentParser) -> None:
-        """Add commandline interface parser."""
-        network_group = parser.add_argument_group("T5 network")
-        network_group.add_argument(
-            "--variant",
-            help="T5 variant to generate",
-            choices=T5ModelTRTConfig.TARGET_MODELS,
-            required=True,
-        )
-        network_group.add_argument(
-            "--enable-kv-cache",
-            help="T5 enable KV cache",
-            action="store_true",
-            default=False,
-        )
-        network_group.add_argument(
-            "--num-beams", type=int, default=1, help="Enables beam search during decoding."
-        )
-        
-        network_group.add_argument(
-            "--fp16", action="store_true", help="Enables fp16 TensorRT tactics."
-        )
-
-    @staticmethod
-    def from_args(args: argparse.Namespace):
-        return NetworkMetadata(
-            variant=args.variant,
-            precision=Precision(fp16=args.fp16),
-            other=T5Metadata(kv_cache=args.enable_kv_cache),
-        )
-
-    @staticmethod
-    def add_benchmarking_args(parser: argparse.ArgumentParser) -> None:
-        benchmarking_group = parser.add_argument_group("benchmarking group")
-        benchmarking_group.add_argument(
-            "--input-seq-len",
-            type=int,
-            help="Specify fixed input sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
-        )
-        benchmarking_group.add_argument(
-            "--output-seq-len",
-            type=int,
-            help="Specify fixed output sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
-        )
-
-T5BenchmarkingArgs = namedtuple("T5BenchmarkingArgs", ["input_seq_len", "output_seq_len"])
-
-# trt has more benchmarking arguments
-T5TRTBenchmarkingArgs = namedtuple("T5TRTBenchmarkingArgs", ["input_seq_len", "output_seq_len", "input_profile_max_len", "output_profile_max_len"])
-
-class T5ModelTRTConfig(NNConfig):
-
-    TARGET_MODELS = ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]
-    
-    # TensorRT maximum workspace size for each model variant. Set by TensorRT memory_pool_limits API
-    MAX_ENCODER_WORKSPACE_MB = {
-        TARGET_MODELS[0]: 512,
-        TARGET_MODELS[1]: 1024,
-        TARGET_MODELS[2]: 2048,
-        TARGET_MODELS[3]: 3072,
-        TARGET_MODELS[4]: 4096,
-    }
-
-    MAX_DECODER_WORKSPACE_MB = {
-        TARGET_MODELS[0]: 1024,
-        TARGET_MODELS[1]: 2048,
-        TARGET_MODELS[2]: 3072,
-        TARGET_MODELS[3]: 4096,
-        TARGET_MODELS[4]: 5120,
-    }
-    
-    MAX_SEQUENCE_LENGTH = {
-        TARGET_MODELS[0]: 512,
-        TARGET_MODELS[1]: 768,
-        TARGET_MODELS[2]: 1024,
-        TARGET_MODELS[3]: 1024,
-        TARGET_MODELS[4]: 1024,
-    }
-
-    # To achieve identical results with original HuggingFace implementation, the min_length in model config should be consistent with each model variant
-    # see task-specific params in config.json of each variant model
-    MIN_OUTPUT_LENGTH = {
-        TARGET_MODELS[0]: 0,
-        TARGET_MODELS[1]: 0,
-        TARGET_MODELS[2]: 0,
-        TARGET_MODELS[3]: 0,
-        TARGET_MODELS[4]: 0,
-    } 
-
-    #TODO: this might better be an inference time input like the `max_length` arg in generate() and greedy_search(). The change needed is in NNDF/interface.py:__call__ so it's a fundamental change affecting GPT2 and T5 code. Here I just put this option in T5 model config for now. But it's also reasonable to treat this as a model config, because the TRT engine building may need this to have fixed dimension (e.g., to enable KV-cache)
-    # see task-specific params in config.json of each variant model
-    MAX_OUTPUT_LENGTH = {
-        TARGET_MODELS[0]: 512,
-        TARGET_MODELS[1]: 768,
-        TARGET_MODELS[2]: 1024,
-        TARGET_MODELS[3]: 1024,
-        TARGET_MODELS[4]: 1024,
-    } 
-
-    # This parameter should be using HuggingFace config, but this file is locked by test and cannot import transformers, so hardcoded here
-    NUM_DECODER_LAYERS = {
-        TARGET_MODELS[0]: 6,
-        TARGET_MODELS[1]: 12,
-        TARGET_MODELS[2]: 24,
-        TARGET_MODELS[3]: 24,
-        TARGET_MODELS[4]: 24,
-    } 
-    NETWORK_FULL_NAME = "full"
-    NETWORK_DECODER_SEGMENT_NAME = "decoder"
-    NETWORK_ENCODER_SEGMENT_NAME = "encoder"
-    NETWORK_SEGMENTS = [NETWORK_DECODER_SEGMENT_NAME, NETWORK_ENCODER_SEGMENT_NAME]
-
-    def __init__(self):
-        precision_fp16 = [False, True]
-        kv_caches = [False, True]
-
-        variants = []
-        for variant, fp16, kv_cache in product(
-            T5ModelTRTConfig.TARGET_MODELS, precision_fp16, kv_caches
-        ):
-            variants.append(
-                NetworkMetadata(
-                    variant=variant,
-                    precision=Precision(fp16=fp16),
-                    other=T5Metadata(kv_cache=kv_cache),
-                )
-            )
-
-        super().__init__("T5", variants=variants)
-
-    def get_python_requirements(self):
-        base_requirements = super().get_python_requirements()
-        base_requirements.append("transformers==4.8.0")
-        return base_requirements
-
-    def get_network_segments(self):
-        """
-        Returns exportable segments for the given network.
-        Used in the case where a single network needs to
-        be exported into multiple parts.
-        """
-        return T5ModelTRTConfig.NETWORK_SEGMENTS
-
-    def get_metadata_string(self, metadata: NetworkMetadata) -> str:
-        # Remove redundant t5 name
-        metadata = metadata._replace(variant=metadata.variant.lstrip("t5-"))
-        return super().get_metadata_string(metadata)
-
-    @staticmethod
-    def get_input_dims(metadata) -> Dict:
-        """
-        Returns dictionary encoding of input dimensions.
-        Keys will be equal to get_model_segments()
-
-        Returns:
-            (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
-        """
-        if metadata.other.kv_cache:
-            decoder_inputs_dict = OrderedDict(
-                {
-                    "input_ids": (Dims.BATCH, 1),
-                    "encoder_hidden_states": (
-                        Dims.BATCH,
-                        Dims.create_new_sequence_dim("encoder_hidden_length"),
-                        "encoder_hidden_size"
-                    ),
-                }
-            )
-            context_inputs_dict = OrderedDict(
-                {"encoder_hidden_states": (
-                    Dims.BATCH,
-                    Dims.create_new_sequence_dim("encoder_hidden_length"),
-                    "encoder_hidden_size"
-                ),
-                }
-            )
-            # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
-            for i in range(T5ModelTRTConfig.NUM_DECODER_LAYERS[metadata.variant]):
-                # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep) 
-                self_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
-                decoder_inputs_dict[f"past_key_values.{i}.decoder.key"] = self_attention_past_kv_dims
-                decoder_inputs_dict[f"past_key_values.{i}.decoder.value"] = self_attention_past_kv_dims
-
-                # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
-                cross_attention_past_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head")
-                decoder_inputs_dict[f"past_key_values.{i}.encoder.key"] = cross_attention_past_kv_dims
-                decoder_inputs_dict[f"past_key_values.{i}.encoder.value"] = cross_attention_past_kv_dims
-            
-            decoder_inputs = [Dims(context_inputs_dict), Dims(decoder_inputs_dict)]
-        else:
-            decoder_inputs_dict = OrderedDict(
-                {
-                    "input_ids": (Dims.BATCH, Dims.SEQUENCE),
-                    "encoder_hidden_states": (
-                        Dims.BATCH,
-                        Dims.create_new_sequence_dim("encoder_hidden_length"),
-                        "encoder_hidden_size"
-                    ),
-                }
-            )
-            decoder_inputs = Dims(decoder_inputs_dict)
-
-        encoder_inputs = Dims(OrderedDict({"input_ids": (Dims.BATCH, Dims.SEQUENCE)}))
-
-        return {
-            T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_inputs,
-            T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_inputs,
-        }
-
-    @staticmethod
-    def get_output_dims(metadata) -> Dict:
-        """
-        Returns dictionary encoding of output dimensions.
-        Keys will be equal to get_model_segments()
-
-        Returns:
-            (Dict[str, Dims]): {"decoder": Dims, "encoder": Dims}
-        """
-        if metadata.other.kv_cache:
-            decoder_outputs_dict = OrderedDict(
-                {"hidden_states": (Dims.BATCH, 1)}
-            )
-            context_outputs_dict = OrderedDict({})
-            # for KV cache version, we need add per-layer KV cache inputs. `past_key_values` at each layer is (self-attention K, self-attention V, cross-attention K, cross-attention V)
-            for i in range(T5ModelTRTConfig.NUM_DECODER_LAYERS[metadata.variant]):
-                # decoder self-attention KV cache (dim[0] & dim[2] are dynamic, and dim[2] varies at each decoding timestep) 
-                self_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("past_decoder_length"), "embedding_size_per_head")
-                decoder_outputs_dict[f"present_key_values.{i}.decoder.key"] = self_attention_present_kv_dims
-                decoder_outputs_dict[f"present_key_values.{i}.decoder.value"] = self_attention_present_kv_dims
-
-                # encoder-decoder cross-attention KV cache (dim[0] & dim[2] are dynamic, but dim[2] is constant at each decoding timestep)
-                cross_attention_present_kv_dims = (Dims.BATCH, "num_heads", Dims.create_new_sequence_dim("encoder_length"), "embedding_size_per_head") 
-                context_outputs_dict[f"present_key_values.{i}.encoder.key"] = cross_attention_present_kv_dims
-                context_outputs_dict[f"present_key_values.{i}.encoder.value"] = cross_attention_present_kv_dims
-            
-            decoder_outputs = [Dims(context_outputs_dict), Dims(decoder_outputs_dict)]
-        else:
-            decoder_outputs_dict = OrderedDict(
-                {"hidden_states": (Dims.BATCH, Dims.SEQUENCE)}
-            )
-            decoder_outputs = Dims(decoder_outputs_dict)
-        
-        encoder_outputs = Dims(
-            OrderedDict(
-                {
-                    "hidden_states": (
-                        Dims.BATCH,
-                        Dims.SEQUENCE,
-                        "encoder_hidden_size"
-                    )
-                }
-            )
-        )
-
-        return {
-            T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME: decoder_outputs,
-            T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME: encoder_outputs,
-        }
diff --git a/demo/HuggingFace/T5/checkpoint.toml b/demo/HuggingFace/T5/checkpoint.toml
deleted file mode 100644
index 4dd7b134..00000000
--- a/demo/HuggingFace/T5/checkpoint.toml
+++ /dev/null
@@ -1,47 +0,0 @@
-# Default requirements
-[T5.all.default.all.premise_a]
-
-input = '''
-premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.
-'''
-
-label = "entailment"
-
-
-[T5.all.default.all.translate_a]
-
-input = '''
-translate English to German: That is good.
-'''
-
-label = "Das ist gut."
-
-[T5.all.default.all.cola_a]
-
-input = '''
-cola sentence: All your base are belong to us.
-'''
-
-label = "unacceptable"
-
-[T5.all.default.all.premise_b]
-
-input = '''
-premise: If I fall asleep then I am going to wake up in 8 hours. hypothesis: I fell asleep but did not wake up in 8 hours.
-'''
-
-label = "contradiction"
-
-# t5-small is gets some results differently
-[T5.all.t5-small.all.premise_a]
-
-label = "contradiction"
-
-[T5.all.t5-small.all.cola_a]
-
-label = "acceptable"
-
-# t5-base also gets results differently
-[T5.all.t5-base.all.translate_a]
-
-label = "Das ist gut so."
diff --git a/demo/HuggingFace/T5/export.py b/demo/HuggingFace/T5/export.py
deleted file mode 100644
index 63b7a73e..00000000
--- a/demo/HuggingFace/T5/export.py
+++ /dev/null
@@ -1,483 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Contains logic that captures T5 HuggingFace models into ONNX models.
-Inspired by https://github.com/onnx/models/blob/master/text/machine_comprehension/t5/dependencies/T5-export.py
-"""
-
-from typing import List
-
-from json import encoder
-import os
-from collections import OrderedDict
-
-# tensorrt
-import tensorrt as trt
-from tensorrt import PreviewFeature
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# torch
-import torch
-from torch.nn import Module
-
-# huggingface
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers import T5ForConditionalGeneration
-
-# TRT-HuggingFace
-from T5.T5ModelConfig import T5ModelTRTConfig
-from NNDF.tensorrt_utils import OnnxProcessOperation, process_onnx
-from NNDF.networks import NetworkMetadata, Precision, Dims
-from NNDF.logger import G_LOGGER
-from NNDF.models import (
-    TRTEngineFile,
-    TorchModelFile,
-    ONNXModelFile,
-    ModelFileConverter,
-)
-
-def add_extra_fp32(network_definition):
-    """
-    Force operations involved in layer norm to run in FP32 precision.
-    """
-    pow_ops = {}
-    for layer_index, layer in enumerate(network_definition[1]):
-        if layer.type == trt.LayerType.IDENTITY:
-            all_fp32 = all([layer.output_type_is_set(o) and layer.get_output_type(o) == trt.float32 for o in range(layer.num_outputs)])
-            if all_fp32:
-                if layer.get_input(0).dtype == trt.float32:
-                    layer.precision = trt.float32
-
-        if layer.type == trt.LayerType.ELEMENTWISE:
-            layer.__class__ = getattr(trt, "IElementWiseLayer")
-            if layer.op == trt.ElementWiseOperation.POW:
-                pow_ops[layer] = layer_index
-                layer.precision = trt.float32
-                layer.set_output_type(0, trt.float32)
-
-    for _, index in pow_ops.items():
-        # Iterate from few layers before pow to include residual add and cast op.
-        # Iterate till 10 layers after pow op to include all operations included in layer norm.
-        START_OFFSET = 4
-        END_OFFSET = 12
-        for i in range(index-START_OFFSET, index+END_OFFSET):
-            l = network_definition[1].get_layer(i)
-            if l.type == trt.LayerType.REDUCE:
-                l.precision = trt.float32
-                l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.ELEMENTWISE:
-                l.__class__ = getattr(trt, "IElementWiseLayer")
-                if l.op == trt.ElementWiseOperation.SUM:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.UNARY:
-                l.__class__ = getattr(trt, "IUnaryLayer")
-                if l.op == trt.UnaryOperation.SQRT:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.ELEMENTWISE:
-                l.__class__ = getattr(trt, "IElementWiseLayer")
-                if l.op == trt.ElementWiseOperation.DIV:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-            if l.type == trt.LayerType.ELEMENTWISE:
-                l.__class__ = getattr(trt, "IElementWiseLayer")
-                if l.op == trt.ElementWiseOperation.PROD:
-                    l.precision = trt.float32
-                    l.set_output_type(0, trt.float32)
-
-    return network_definition
-
-# Torch File Encoding #
-class T5DecoderTorchFile(TorchModelFile):
-    class TorchModule(Module, GenerationMixin):
-        """
-        A simplied definition of T5 Decoder without support for loss.
-        Decoder with lm-head attached.
-        """
-
-        def __init__(self, decoder, lm_head, config, is_trt = False):
-            super().__init__()
-            self.decoder = decoder
-            self.lm_head = lm_head
-            self.config = config
-            # HuggingFace's beam search requires to set self.device. Set it to avoid application crash
-            self.device = torch.device('cuda')
-            # Use hardcoded value to extend compatibility with older HF versions.
-            self.main_input_name = "input_ids"
-            # trt uses cached and precomputed cross attention vs. framework uses the entire kv cache as output. Need to treat them differently.
-            self.is_trt = is_trt
-
-        def prepare_inputs_for_generation(
-            self, 
-            input_ids, 
-            past=None, 
-            use_cache=None, 
-            **kwargs
-        ):
-            # cut decoder_input_ids if past is used
-            if past is not None:
-                input_ids = input_ids[:, -1:]
-
-            return {
-                "input_ids": input_ids,
-                "encoder_hidden_states": kwargs["encoder_outputs"].last_hidden_state,
-                "use_cache": use_cache,
-                "past_key_values": past
-            }
-
-        def forward(
-            self, 
-            input_ids, 
-            encoder_hidden_states, 
-            use_cache = None, 
-            past_key_values = None,
-            return_dict = None,
-            **kwargs,
-        ):
-            # self.decoder is the HuggingFace t5 decoder
-            decoder_outputs = self.decoder(
-                input_ids=input_ids,
-                encoder_hidden_states=encoder_hidden_states,
-                use_cache=use_cache,
-                past_key_values=past_key_values,
-                return_dict=return_dict,
-                **kwargs
-            )
-
-            # self.config.d_model ** -0.5 for rescaling output on vocab.
-            # as seen in https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5ForConditionalGeneration
-            sequence_output = decoder_outputs[0] * self.config.d_model ** -0.5
-            logits = self.lm_head(sequence_output)
-            if use_cache:
-                if self.is_trt:
-                    past_key_values = ()
-                    past_key_values_output = decoder_outputs[1]
-                    for layer_past_states in past_key_values_output:
-                        past_key_values = past_key_values + (layer_past_states[:2],)
-                else:
-                    past_key_values = decoder_outputs[1]
-
-            if not return_dict:
-                return (logits, past_key_values)
-            
-            return Seq2SeqLMOutput(
-                logits=logits, 
-                past_key_values=past_key_values
-            )
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, T5DecoderConverter, network_metadata)
-
-class T5DecoderCrossAttentionKVGenerator(Module):
-    def __init__(self, decoder, device = "cpu"):
-        super().__init__()
-        self.decoder = decoder
-        self.device = device
-
-    def forward(self, encoder_hidden_states):
-        '''
-        Use same but simplified code as HF modeling_t5.py to generate cross attention kv cache from provided encoder_hidden_states
-        '''
-        present_key_values = ()
-        for layer_module in self.decoder.block:
-            # hidden_states and position_bias are required for the forward call, but irrelevant of cross attention kv cache calculation, so generate dummy variables
-            dummy_hidden_states = torch.zeros(1,1).to(self.device)
-            dummy_position_bias = torch.zeros(1, layer_module.layer[1].EncDecAttention.n_heads, 1, encoder_hidden_states.shape[1]).to(self.device)
-            cross_attention_outputs = layer_module.layer[1](
-                hidden_states=dummy_hidden_states, 
-                key_value_states=encoder_hidden_states, 
-                use_cache=True, 
-                past_key_value=None,
-                position_bias=dummy_position_bias
-            )
-            present_key_values = present_key_values + cross_attention_outputs[1]
-        
-        return present_key_values
-    
-    def __call__(self, *args, **kwargs):
-        return self.forward(*args, **kwargs)
-
-class T5EncoderTorchFile(TorchModelFile):
-    """Creation of a class to output only the last hidden state from the encoder."""
-
-    class TorchModule(Module, GenerationMixin):
-        def __init__(self, encoder):
-            super().__init__()
-            self.encoder = encoder
-            # Use hardcoded value to extend compatibility with older HF versions.
-            self.main_input_name = "input_ids"
-
-        def forward(self, *input, **kwargs):
-            return self.encoder(*input, **kwargs)[0]
-
-        def __call__(self, *args, **kwargs):
-            return self.forward(*args, **kwargs)
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, T5EncoderConverter, network_metadata)
-
-
-# ONNX File Encoding #
-class T5EncoderONNXFile(ONNXModelFile):
-    def __init__(self, model, network_metadata):
-        super().__init__(model, T5EncoderConverter, network_metadata)
-
-
-class T5DecoderONNXFile(ONNXModelFile):
-    def __init__(self, model, network_metadata):
-        super().__init__(model, T5DecoderConverter, network_metadata)
-
-
-# TRT Engine File Encoding #
-class T5DecoderTRTEngine(TRTEngineFile):
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, T5DecoderConverter, network_metadata)
-        self.max_trt_workspace = T5ModelTRTConfig.MAX_DECODER_WORKSPACE_MB[network_metadata.variant]
-        
-
-    def get_network_definition(self, network_definition):
-        if self.network_metadata.precision.fp16:
-            for i in range(network_definition[1].num_inputs):
-                t = network_definition[1].get_input(i)
-                if t.dtype == trt.float32:
-                    t.dtype = trt.float16
-
-            for i in range(network_definition[1].num_outputs):
-                t = network_definition[1].get_output(i)
-                if t.dtype == trt.float32:
-                    t.dtype = trt.float16
-        
-        return add_extra_fp32(network_definition)
-
-    def use_obey_precision_constraints(self):
-        return self.network_metadata.precision.fp16
-
-
-class T5EncoderTRTEngine(TRTEngineFile):
-
-    def __init__(self, model, network_metadata):
-        super().__init__(model, T5EncoderConverter, network_metadata)
-        self.max_trt_workspace = T5ModelTRTConfig.MAX_ENCODER_WORKSPACE_MB[network_metadata.variant]
-
-    def get_network_definition(self, network_definition):
-        return add_extra_fp32(network_definition)
-
-    def use_obey_precision_constraints(self):
-        return self.network_metadata.precision.fp16
-
-# Converters #
-class T5DecoderConverter(ModelFileConverter):
-    def __init__(self):
-        super().__init__(T5DecoderTorchFile, T5DecoderONNXFile, T5DecoderTRTEngine)
-
-    def torch_to_onnx(
-        self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
-    ):
-        """
-        Exports a given huggingface T5 to decoder architecture only.
-        Inspired by https://github.com/onnx/models/blob/master/text/machine_comprehension/t5/dependencies/T5-export.py
-
-        Args:
-            output_prefix (str): Path to the onnx file
-            model (torch.Model): Model loaded torch class
-
-        Returns:
-            T5DecoderONNXFile: ONNX decoder object.
-        """
-        # TODO: CPU and GPU PyTorch models may use different operations and might perform differently.
-        # Adding a device parameter to the class may help
-        device = model.device
-        input_ids = torch.tensor([[42] * 10]).to(device)
-        # Exporting the decoder requires a basic instance of the encoder
-        # Create one temporarily
-        simplified_encoder = T5EncoderTorchFile.TorchModule(model.encoder)
-        # Exports to ONNX
-        decoder_with_lm_head = T5DecoderTorchFile.TorchModule(
-            model.decoder, model.lm_head, model.config, is_trt = True
-        )
-
-        inputs = T5ModelTRTConfig.get_input_dims(network_metadata)["decoder"]
-        outputs = T5ModelTRTConfig.get_output_dims(network_metadata)["decoder"]
-
-        # Exports to ONNX
-        opt_args={}
-
-        version_major = int((torch.__version__).split('.')[0])
-        version_minor = int((torch.__version__).split('.')[1])
-        if version_major < 1 or (version_major == 1 and version_minor < 11):
-            opt_args['use_external_data_format'] = True
-
-        if not network_metadata.other.kv_cache:
-            # This code allows for huggingface compatible torch class to use onnx exporter
-            old_forward = decoder_with_lm_head.forward
-            def _export_forward(input_ids, encoder_hidden_states, **kwargs):
-                result = old_forward(input_ids, encoder_hidden_states, use_cache=False, **kwargs)
-                return result[0]
-            decoder_with_lm_head.forward = _export_forward
-
-            torch.onnx.export(
-                decoder_with_lm_head,
-                (input_ids, simplified_encoder(input_ids)),
-                output_fpath,
-                do_constant_folding=True,
-                opset_version=13,
-                input_names=inputs.get_names(),
-                output_names=outputs.get_names(),
-                dynamic_axes={
-                    **inputs.get_torch_dynamic_axis_encoding(),
-                    **outputs.get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-        else:
-            encoder_hidden_states = simplified_encoder(input_ids).to(device)
-            kv_decoder_input_ids = input_ids[:,-1:].to(device)
-            decoder_output = decoder_with_lm_head.decoder(input_ids=kv_decoder_input_ids, encoder_hidden_states=encoder_hidden_states, use_cache=True, past_key_values=None) # decoder output at t-1 step (logits, past_key_values from 0 to t-1)
-            past_key_values = decoder_output[1]
-            # This code allows for huggingface compatible torch class to use onnx exporter (change just before onnx.export)
-            old_forward = decoder_with_lm_head.forward
-            def _export_forward(input_ids, encoder_hidden_states, past_key_values):
-                result = old_forward(input_ids, encoder_hidden_states, past_key_values=past_key_values, use_cache=True)
-                return result
-            decoder_with_lm_head.forward = _export_forward
-
-            torch.onnx.export(
-                decoder_with_lm_head,
-                (kv_decoder_input_ids, encoder_hidden_states, past_key_values),
-                output_fpath,
-                do_constant_folding=True,
-                opset_version=13,
-                input_names=inputs[1].get_names(),
-                output_names=outputs[1].get_names(),
-                dynamic_axes={
-                    **inputs[1].get_torch_dynamic_axis_encoding(),
-                    **outputs[1].get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-
-            cross_attention_kv_generator = T5DecoderCrossAttentionKVGenerator(decoder_with_lm_head.decoder, device)
-            decoder_folder, decoder_name = os.path.split(output_fpath)
-            decoder_name, decoder_ext = os.path.splitext(decoder_name)
-            output_fpath_kv_generator_folder = os.path.join(decoder_folder, "cross_attention_kv_generator")
-            os.makedirs(output_fpath_kv_generator_folder, exist_ok = True)
-            output_fpath_kv_generator = os.path.join(output_fpath_kv_generator_folder, decoder_name + "-cross_attention_kv_generator" + decoder_ext)
-            torch.onnx.export(
-                cross_attention_kv_generator,
-                (encoder_hidden_states),
-                output_fpath_kv_generator,
-                do_constant_folding=True,
-                opset_version=13,
-                input_names=inputs[0].get_names(),
-                output_names=outputs[0].get_names(),
-                dynamic_axes={
-                    **inputs[0].get_torch_dynamic_axis_encoding(),
-                    **outputs[0].get_torch_dynamic_axis_encoding(),
-                },
-                training=torch.onnx.TrainingMode.EVAL,
-                **opt_args
-            )
-
-            if network_metadata.precision.fp16:
-                process_onnx([OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath_kv_generator, output_fpath_kv_generator)
-
-        if network_metadata.precision.fp16:
-            process_onnx([OnnxProcessOperation.MOVE_CAST_OP, OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
-        return T5DecoderONNXFile(output_fpath, network_metadata)
-
-
-class T5EncoderConverter(ModelFileConverter):
-    def __init__(self):
-        super().__init__(T5EncoderTorchFile, T5EncoderONNXFile, T5EncoderTRTEngine)
-
-    def onnx_to_trt(
-        self, output_fpath: str, input_fpath: str, network_metadata: NetworkMetadata, profiles: List[Profile], preview_features: List[PreviewFeature]
-    ):
-        """
-        Override onnx_to_trt function from base.
-        Workaround: model larger than t5-small are too large and cause FP16 to overflow. Encoder should not use FP16 tactics even in FP16 mode.
-        The perf decreases by less than 10% end-to-end. Usage with TRT is still substantial compared to frameworks.
-        """
-        # Force encoder to FP32 only if variants are anything larger than small
-        # because of overflow and underflow issues
-        if network_metadata.precision.fp16 and network_metadata.variant != "t5-small":
-            network_metadata_cp_dct = network_metadata._asdict()
-            del network_metadata_cp_dct["precision"]
-            network_metadata = NetworkMetadata(**network_metadata_cp_dct, precision=Precision(fp16=False))
-
-        return super().onnx_to_trt(output_fpath, input_fpath, network_metadata, profiles, preview_features)
-
-    def torch_to_onnx(
-        self, output_fpath: str, model: Module, network_metadata: NetworkMetadata
-    ):
-        """
-        Exports a given huggingface T5 to encoder architecture only.
-        Inspired by https://github.com/onnx/models/blob/master/text/machine_comprehension/t5/dependencies/T5-export.py
-
-        Args:
-            output_prefix (str): Path to the onnx file
-            model (torch.Model): Model loaded torch class
-
-        Returns:
-            Tuple[str]: Names of generated models
-        """
-        device = model.device
-        input_ids = torch.tensor([[42] * 10]).to(device)
-        simplified_encoder = T5EncoderTorchFile.TorchModule(model.encoder)
-        inputs = T5ModelTRTConfig.get_input_dims(network_metadata)["encoder"]
-        outputs = T5ModelTRTConfig.get_output_dims(network_metadata)["encoder"]
-
-        # Exports to ONNX
-        opt_args={}
-
-        version_major = int((torch.__version__).split('.')[0])
-        version_minor = int((torch.__version__).split('.')[1])
-        if version_major < 1 or (version_major == 1 and version_minor < 11):
-            opt_args['use_external_data_format'] = True
-        torch.onnx.export(
-            simplified_encoder,
-            input_ids,
-            output_fpath,
-            do_constant_folding=True,
-            opset_version=13,
-            input_names=inputs.get_names(),
-            output_names=outputs.get_names(),
-            dynamic_axes={
-                **inputs.get_torch_dynamic_axis_encoding(),
-                **outputs.get_torch_dynamic_axis_encoding(),
-            },
-            training=torch.onnx.TrainingMode.EVAL,
-            **opt_args
-        )
-
-        if network_metadata.precision.fp16:
-            process_onnx([OnnxProcessOperation.MOVE_CAST_OP, OnnxProcessOperation.CLAMP_WEIGHTS], output_fpath, output_fpath)
-
-        return T5EncoderONNXFile(output_fpath, network_metadata)
diff --git a/demo/HuggingFace/T5/frameworks.py b/demo/HuggingFace/T5/frameworks.py
deleted file mode 100644
index 2f06128d..00000000
--- a/demo/HuggingFace/T5/frameworks.py
+++ /dev/null
@@ -1,340 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-
-from typing import List, Union
-
-# huggingface
-from transformers import (
-    T5ForConditionalGeneration,
-    T5Tokenizer,
-    T5Config,
-)
-
-# torch
-import torch
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# TRT-HuggingFace
-from NNDF.interface import FrameworkCommand
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkResult,
-    NetworkMetadata,
-    NetworkRuntime,
-    NetworkModels,
-    NetworkModel,
-    TimingProfile,
-)
-from T5.export import T5EncoderTorchFile, T5DecoderTorchFile
-from T5.T5ModelConfig import T5ModelTRTConfig, T5BenchmarkingArgs
-from T5.measurements import decoder_inference, encoder_inference, full_inference, calculate_perplexity
-from NNDF.general_utils import confirm_folder_delete, NNFolderWorkspace
-
-
-class T5FHuggingFace(FrameworkCommand):
-    def __init__(self):
-        super().__init__(
-            T5ModelTRTConfig, description="Runs framework results for T5 model."
-        )
-
-        self.onnx_t5_encoder = None
-        self.onnx_t5_decoder = None
-        self.torch_t5_dir = None
-
-    def generate_and_download_framework(
-        self, metadata: NetworkMetadata, workspace: NNFolderWorkspace
-    ) -> NetworkModels:
-
-        trt_t5_config = self.config
-        metadata_serialized = trt_t5_config.get_metadata_string(metadata)
-        workspace_dir, encoder_onnx_root, decoder_onnx_root = workspace.set_model_path(metadata_serialized, is_encoder_decoder = True)
-        pytorch_model_dir = os.path.join(workspace_dir, "pytorch_model")
-        # We keep track of the generated torch location for cleanup later
-        self.torch_t5_dir = pytorch_model_dir
-
-        model = None
-        if not os.path.exists(pytorch_model_dir):
-            # Generate the pre-trained weights
-            model = T5ForConditionalGeneration.from_pretrained(
-                metadata.variant, use_cache = metadata.other.kv_cache
-            )
-            model.save_pretrained(pytorch_model_dir)
-            print("Pytorch Model saved to {}".format(pytorch_model_dir))
-        else:
-            print(
-                "Frameworks file already exists, skipping generation and loading from file instead."
-            )
-            model = T5ForConditionalGeneration.from_pretrained(
-                pytorch_model_dir,
-                use_cache = metadata.other.kv_cache
-            )
-
-        # These ONNX models can be converted using special encoder and decoder classes.
-        encoder_onnx_model_fpath = os.path.join(encoder_onnx_root, metadata_serialized + "-encoder.onnx")
-        decoder_onnx_model_fpath = os.path.join(decoder_onnx_root, metadata_serialized + "-decoder-with-lm-head.onnx")
-
-        t5_encoder = T5EncoderTorchFile(model, metadata)
-        t5_decoder = T5DecoderTorchFile(model, metadata)
-        self.onnx_t5_encoder = t5_encoder.as_onnx_model(
-            encoder_onnx_model_fpath, force_overwrite=False
-        )
-        self.onnx_t5_decoder = t5_decoder.as_onnx_model(
-            decoder_onnx_model_fpath, force_overwrite=False
-        )
-
-        onnx_models = [
-            NetworkModel(
-                name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=self.onnx_t5_decoder.fpath,
-            ),
-            NetworkModel(
-                name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                fpath=self.onnx_t5_encoder.fpath,
-            ),
-        ]
-        torch_models = [
-            NetworkModel(
-                name=T5ModelTRTConfig.NETWORK_FULL_NAME, fpath=pytorch_model_dir
-            )
-        ]
-
-        return NetworkModels(torch=torch_models, onnx=onnx_models, trt=None)
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        keep_onnx_model: bool = True,
-        keep_pytorch_model: bool = True,
-    ) -> None:
-        """
-        Cleans up the working directory and leaves models if available.
-        Should not assume any functions from the framework class has been called.
-        Return:
-            None
-        """
-        # Clean-up generated files
-        if not keep_onnx_model:
-            if self.onnx_t5_decoder is not None:
-                self.onnx_t5_decoder.cleanup()
-            if self.onnx_t5_encoder is not None:
-                self.onnx_t5_encoder.cleanup()
-
-        if not keep_pytorch_model:
-            # Using rmtree can be dangerous, have user confirm before deleting.
-            confirm_folder_delete(
-                self.torch_t5_dir,
-                prompt="Confirm you want to delete downloaded pytorch model folder?",
-            )
-
-        if not keep_pytorch_model and not keep_onnx_model:
-            workspace.cleanup(force_remove=False)
-
-    def setup_tokenizer_and_model(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-    ):
-        tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
-
-        # By default, huggingface model structure is one giant file.
-        t5_torch_fpath = network_fpaths.torch[0].fpath
-        t5_model = T5ForConditionalGeneration.from_pretrained(t5_torch_fpath, use_cache=metadata.other.kv_cache)
-        if metadata.precision.fp16:
-            t5_model = t5_model.cuda().half()
-
-        t5_torch_encoder = T5EncoderTorchFile.TorchModule(t5_model.encoder)
-        t5_torch_decoder = T5DecoderTorchFile.TorchModule(
-            t5_model.decoder, t5_model.lm_head, t5_model.config
-        )
-
-        return tokenizer, t5_torch_encoder, t5_torch_decoder
-
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-        inference_input: str,
-        timing_profile: TimingProfile,
-        use_cpu: bool,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: T5BenchmarkingArgs = None,
-    ) -> Union[NetworkResult, BenchmarkingResult]:
-
-        tokenizer, t5_torch_encoder, t5_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
-        hf_config = T5Config.from_pretrained(metadata.variant, use_cache = metadata.other.kv_cache)
-        # Prepare the input tokens and find out output sequence length..
-        if not benchmarking_mode:
-            output_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        else:
-            max_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-            input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
-            output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
-            input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
-        encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
-            t5_torch_encoder, input_ids, timing_profile, use_cuda=(not use_cpu)
-        )
-
-        # Need to feed the decoder a new empty input_ids for text generation.
-        decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
-
-        decoder_input_ids = torch.full(
-            (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
-        )
-
-        _, decoder_e2e_time = decoder_inference(
-            t5_torch_decoder,
-            expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
-            expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
-            timing_profile,
-            use_cache=metadata.other.kv_cache,
-        )
-        
-        decoder_output, full_e2e_runtime = full_inference(
-            t5_torch_encoder,
-            t5_torch_decoder,
-            input_ids,
-            tokenizer,
-            timing_profile,
-            num_beams=num_beams,
-            max_length=output_seq_len,
-            min_length=T5ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-            use_cuda=(not use_cpu),
-            batch_size=batch_size,
-            use_cache=metadata.other.kv_cache,
-        )
-
-        # Prepare runtime results.
-        runtime=[
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                runtime=encoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=network_fpaths)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            decoder_output[-1, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=decoder_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=network_fpaths,
-        )
-
-    def execute_calculate_perplexity(
-        self,
-        metadata: NetworkMetadata,
-        network_fpaths: NetworkModels,
-        encoder_input: str,
-        decoder_input: str,
-    ):
-        tokenizer, t5_torch_encoder, t5_torch_decoder = self.setup_tokenizer_and_model(metadata, network_fpaths)
-        encoder_input_ids = tokenizer([encoder_input], padding=True, return_tensors="pt").input_ids
-        decoder_input_ids = tokenizer([decoder_input], padding=True, return_tensors="pt").input_ids
-        perplexity = calculate_perplexity(
-            t5_torch_encoder, t5_torch_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
-            T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
-        )
-        return perplexity
-
-    def run_framework(
-        self,
-        metadata: NetworkMetadata,
-        network_input: List[str],
-        working_directory: str,
-        keep_onnx_model: bool,
-        keep_pytorch_model: bool,
-        timing_profile: TimingProfile,
-        use_cpu: bool = False,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult]:
-        """
-        Main entry point of our function which compiles and generates our model data.
-        """
-        inference_results = []
-        ppl_results = []
-        workspace = NNFolderWorkspace(
-            self.config.network_name, metadata, working_directory
-        )
-        try:
-            network_fpaths = self.generate_and_download_framework(metadata, workspace)
-            if not benchmarking_mode:
-                for ninput in network_input:
-                    inference_results.append(
-                        self.execute_inference(
-                            metadata, network_fpaths, ninput, timing_profile, use_cpu, batch_size, args.num_beams
-                        )
-                    )
-                if perplexity_reference is not None:
-                    assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
-                    for ei, di in zip(network_input, perplexity_reference):
-                        ppl_results.append(
-                            self.execute_calculate_perplexity(
-                                metadata, network_fpaths, ei, di
-                            )
-                        )
-            else:
-                benchmarking_args = T5BenchmarkingArgs(args.input_seq_len, args.output_seq_len)
-                inference_results = self.execute_inference(
-                    metadata, network_fpaths, None, timing_profile, use_cpu, batch_size, args.num_beams, True, benchmarking_args
-                )
-        finally:
-            self.cleanup(workspace, keep_onnx_model, keep_pytorch_model)
-
-        return inference_results, ppl_results
-
-
-# Entry point
-RUN_CMD = T5FHuggingFace()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/T5/measurements.py b/demo/HuggingFace/T5/measurements.py
deleted file mode 100644
index 3b30e8c1..00000000
--- a/demo/HuggingFace/T5/measurements.py
+++ /dev/null
@@ -1,136 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Utils specific to T5 network.
-"""
-
-# torch
-import torch
-
-# TRT-HuggingFace
-from NNDF.general_utils import measure_python_inference_code
-from NNDF.torch_utils import use_cuda, expand_inputs_for_beam_search
-from NNDF.tensorrt_utils import TRTNativeRunner
-from NNDF.logger import G_LOGGER
-from transformers.modeling_outputs import BaseModelOutput
-
-@use_cuda
-def decoder_inference(
-    t5_decoder, input_ids, encoder_last_hidden_state, timing_profile, use_cuda=True, use_cache=False, past_key_values=None
-):
-    # This implementation is a bit ugly. Moving implementation of the model to check HFRunner would be cleaner.
-    if isinstance(t5_decoder, TRTNativeRunner):
-        # Function is technically in T5TRTDecoder however due to circular import, TRTNativeRunner in this module scope
-        # implies the existence of this function.
-        t5_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
-    def decoder_stmt():
-        t5_decoder(
-            input_ids=input_ids, encoder_hidden_states=encoder_last_hidden_state, use_cache=use_cache,
-            past_key_values=past_key_values
-        )
-
-    decoder_e2e_time = measure_python_inference_code(decoder_stmt, timing_profile)
-
-    return (decoder_stmt(), decoder_e2e_time)
-
-
-@use_cuda
-def encoder_inference(t5_encoder, input_ids, timing_profile, use_cuda=True):
-    encoder_stmt = lambda: t5_encoder(input_ids=input_ids)
-    encoder_e2e_time = measure_python_inference_code(encoder_stmt, timing_profile)
-
-    return (encoder_stmt(), encoder_e2e_time)
-
-@use_cuda
-def full_inference(
-    t5_encoder,
-    t5_decoder,
-    input_ids,
-    tokenizer,
-    timing_profile,
-    max_length,
-    min_length=0,
-    num_beams=1,
-    batch_size=1,
-    use_cuda=True,
-    early_stopping=True,
-    use_cache=False
-):
-
-    G_LOGGER.info(f"Running full inference...")
-    encoder_last_hidden_state = t5_encoder(input_ids=input_ids)
-
-    def _e2e():
-        with torch.no_grad():
-            decoder_output = t5_decoder.generate(
-                input_ids,
-                max_length = max_length,
-                min_length = min_length,
-                num_beams = num_beams,
-                early_stopping = early_stopping,
-                eos_token_id = t5_decoder.config.eos_token_id,
-                pad_token_id = t5_decoder.config.pad_token_id,
-                use_cache = use_cache,
-                encoder_outputs = BaseModelOutput(last_hidden_state = encoder_last_hidden_state),
-            )
-        return decoder_output
-    
-    if isinstance(t5_decoder, TRTNativeRunner):
-        t5_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
-    measurement_function = _e2e
-
-    full_e2e_time = measure_python_inference_code(measurement_function, timing_profile)
-
-    return (measurement_function(), full_e2e_time)
-
-@use_cuda
-def calculate_perplexity(
-    t5_encoder,
-    t5_decoder,
-    tokenizer,
-    input_ids,
-    decoder_input_ids,
-    max_seq_len=None,
-    use_cuda=True,
-):
-    encoder_last_hidden_state = t5_encoder(input_ids=input_ids)
-    if isinstance(t5_decoder, TRTNativeRunner):
-        t5_decoder.set_return_device("cuda" if use_cuda else "cpu")
-
-    # Set the first token to be pad token
-    decoder_input_ids_padded = torch.full(
-        decoder_input_ids.size()[:-1] + (decoder_input_ids.size()[-1] + 1,),
-        tokenizer.convert_tokens_to_ids(tokenizer.pad_token),
-        dtype=decoder_input_ids.dtype,
-    )
-    decoder_input_ids_padded[..., 1:] = decoder_input_ids
-
-    if use_cuda:
-        encoder_last_hidden_state = encoder_last_hidden_state.to("cuda")
-        decoder_input_ids_padded = decoder_input_ids_padded.to("cuda")
-
-    with torch.no_grad():
-        if max_seq_len is not None:
-            decoder_input_ids_padded = decoder_input_ids_padded[:, :max_seq_len]
-        logits = t5_decoder(decoder_input_ids_padded, encoder_last_hidden_state, return_dict=True).logits
-        # Truncate the last prediction
-        logits = logits[:, :-1, :]
-        loss = torch.nn.CrossEntropyLoss()(logits.permute((0, 2, 1)), decoder_input_ids)
-        return torch.exp(loss).item()
diff --git a/demo/HuggingFace/T5/onnxrt.py b/demo/HuggingFace/T5/onnxrt.py
deleted file mode 100644
index 499ba2a5..00000000
--- a/demo/HuggingFace/T5/onnxrt.py
+++ /dev/null
@@ -1,342 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Executes ONNX Runtime framework code. See README.md for more information.
-"""
-
-import os
-import sys
-from typing import Dict, List, Tuple
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# huggingface
-from transformers import T5Tokenizer, T5Config, PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import Seq2SeqLMOutput
-
-# torch
-import torch
-
-# TRT-HuggingFace
-from NNDF.interface import OnnxRTCommand
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkMetadata,
-    NetworkModels,
-    NetworkModel,
-    NetworkResult,
-    NetworkRuntime,
-    Precision,
-    TimingProfile,
-)
-
-from NNDF.general_utils import NNFolderWorkspace
-from NNDF.tensorrt_utils import PolygraphyOnnxRunner
-from T5.frameworks import T5FHuggingFace
-from T5.T5ModelConfig import T5ModelTRTConfig, T5BenchmarkingArgs
-from T5.measurements import decoder_inference, encoder_inference, full_inference
-from NNDF.logger import G_LOGGER
-
-class OnnxHFRunner(PolygraphyOnnxRunner, GenerationMixin):
-    """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
-    def __init__(self, engine_fpath: str, network_metadata: NetworkMetadata, hf_config: PretrainedConfig):
-        super().__init__(engine_fpath, network_metadata)
-        # required for greedy search used by generation mixin
-        self.main_input_name = "input_ids"
-        self.config = hf_config
-
-class T5OnnxEncoder(OnnxHFRunner):
-    """OnnxRT implemented network interface that is mainly to check correctness."""
-
-    def forward(self, input_ids, *args, **kwargs):
-        # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
-        input_ids = input_ids.cpu().numpy().astype("int64")
-        return torch.from_numpy(self.trt_context.infer({"input_ids": input_ids})["hidden_states"])
-
-class T5OnnxDecoder(OnnxHFRunner):
-    def prepare_inputs_for_generation(self, input_ids, **kwargs):
-        return {
-            "input_ids": input_ids,
-            "encoder_hidden_states": kwargs["encoder_outputs"].last_hidden_state,
-        }
-
-    def forward(self, input_ids, encoder_hidden_states, *args, **kwargs):
-        # Unoptimized unconditional transfer to numpy for interfacing with polygraphy
-        input_ids = input_ids.cpu().numpy().astype("int64")
-        data_type = "float32"
-        encoder_hidden_states = encoder_hidden_states.cpu().numpy().astype(data_type)
-
-        logits = self.trt_context.infer(
-            {"input_ids": input_ids, "encoder_hidden_states": encoder_hidden_states}
-        )["hidden_states"]
-
-        return Seq2SeqLMOutput(logits=torch.from_numpy(logits))
-
-class T5ONNXRT(OnnxRTCommand):
-    def __init__(self):
-        super().__init__(
-            T5ModelTRTConfig,
-            "Runs polygraphy results for T5 model.",
-            T5FHuggingFace,
-        )
-        self.t5_ort_decoder = None
-        self.t5_ort_encoder = None
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        keep_onnx_model: bool = False,
-        keep_torch_model: bool = False,
-    ) -> None:
-        # Deactivates context
-        if self.t5_ort_encoder:
-            self.t5_ort_encoder.release()
-        if self.t5_ort_decoder:
-            self.t5_ort_decoder.release()
-
-        self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Dict[str, NetworkModel],
-        inference_input: str,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: T5BenchmarkingArgs = None,
-    ) -> NetworkResult:
-
-        hf_config = T5Config.from_pretrained(metadata.variant)
-        tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
-        # Prepare the input tokens and find out output sequence length.
-        if not benchmarking_mode:
-            output_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        else:
-            max_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-            input_seq_len = benchmarking_args.input_seq_len if benchmarking_args.input_seq_len > 0 else max_seq_len
-            output_seq_len = benchmarking_args.output_seq_len if benchmarking_args.output_seq_len > 0 else max_seq_len
-            input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
-        encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
-            self.t5_ort_encoder, input_ids, timing_profile
-        )
-
-        # Need to feed the decoder a new empty input_ids for text generation.
-        decoder_output_len = output_seq_len // 2
-
-        decoder_input_ids = torch.full(
-            (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
-        )
-        # OnnxRT currently does not enable kv cache
-        _, decoder_e2e_time = decoder_inference(
-            self.t5_ort_decoder,
-            expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
-            expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
-            timing_profile,
-            use_cache=metadata.other.kv_cache,
-        )
-
-        decoder_output, full_e2e_runtime = full_inference(
-            self.t5_ort_encoder,
-            self.t5_ort_decoder,
-            input_ids,
-            tokenizer,
-            timing_profile,
-            max_length=output_seq_len,
-            min_length=T5ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-            use_cuda=False,
-            num_beams=num_beams,
-            batch_size=batch_size,
-            use_cache=metadata.other.kv_cache,
-        )
-
-        # Prepare runtime results.
-        runtime = [
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                runtime=encoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-        models=NetworkModels(
-            torch=None,
-            onnx=list(onnx_fpaths.values()),
-            trt=None
-        )
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=models)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            decoder_output[-1, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=decoder_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=models,
-        )
-
-    def run_onnxrt(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Tuple[NetworkModel],
-        network_input: List[str],
-        working_directory: str,
-        keep_onnx_model: bool,
-        keep_torch_model: bool,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-    ) -> List[NetworkResult]:
-        workspace = NNFolderWorkspace(
-            self.frameworks_cmd.config.network_name, metadata, working_directory
-        )
-
-        results = []
-        try:
-            if metadata.other.kv_cache:
-                assert False, "OnnxRT currently does not support kv cache."
-            # no fpath provided for onnx files, download them
-            if len(onnx_fpaths) == 0:
-                onnx_fpaths = self.frameworks_cmd.generate_and_download_framework(
-                    metadata, workspace
-                ).onnx
-            else:
-                keep_onnx_model = True
-                keep_torch_model = True
-
-            # Output networks shall not exceed number of network segments explicitly defined by configuration file.
-            assert len(onnx_fpaths) == len(
-                T5ModelTRTConfig.NETWORK_SEGMENTS
-            ), "There should only be {} exported ONNX segments in T5 model.".format(
-                len(T5ModelTRTConfig.NETWORK_SEGMENTS)
-            )
-
-            lookup_onnx_table = {v.name: v for v in onnx_fpaths}
-
-            hf_config = T5Config.from_pretrained(
-                metadata.variant,
-                use_cache=metadata.other.kv_cache
-            )
-            self.t5_ort_encoder = T5OnnxEncoder(
-                lookup_onnx_table["encoder"].fpath, metadata, hf_config
-            )
-            self.t5_ort_decoder = T5OnnxDecoder(
-                lookup_onnx_table["decoder"].fpath, metadata, hf_config
-            )
-
-            if not benchmarking_mode:
-                for ninput in network_input:
-                    results.append(
-                        self.execute_inference(
-                            metadata, lookup_onnx_table, ninput, timing_profile, batch_size, args.num_beams
-                        )
-                    )
-            else:
-                benchmarking_args = T5BenchmarkingArgs(args.input_seq_len, args.output_seq_len)
-                results = self.execute_inference(
-                    metadata, lookup_onnx_table, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
-                )
-
-        finally:
-            self.cleanup(workspace, keep_onnx_model, keep_torch_model)
-        # TODO: Add perplexity calculation for OnnxRT
-        G_LOGGER.warning("perplexity calculation is disabled for OnnxRT.")
-        return results
-
-    def add_args(self, parser) -> None:
-        super().add_args(parser)
-        onnx_group = parser.add_argument_group("onnx models")
-        onnx_group.add_argument(
-            "--onnx-decoder-fpath",
-            default=None,
-            help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-        onnx_group.add_argument(
-            "--onnx-encoder-fpath",
-            default=None,
-            help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-
-    def args_to_network_models(self, args) -> List[NetworkModel]:
-        # Check if both flags are given otherwise error out
-        decoder_fpath_check = args.onnx_decoder_fpath is None
-        encoder_fpath_check = args.onnx_encoder_fpath is None
-
-        network_models = None
-        if decoder_fpath_check and encoder_fpath_check:
-            network_models = tuple()
-        elif decoder_fpath_check or encoder_fpath_check:
-            raise self._parser.error(
-                "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
-            )
-        else:
-            onnx_decoder = NetworkModel(
-                name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=args.onnx_decoder_fpath,
-            )
-            onnx_encoder = NetworkModel(
-                name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                fpath=args.onnx_encoder_fpath,
-            )
-            network_models = (onnx_decoder, onnx_encoder)
-
-        return network_models
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        """Override args to metadata to use export subroutine."""
-        frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
-        return NetworkMetadata(
-            variant=frameworks_parsed_metadata.variant,
-            precision=Precision(fp16=args.fp16),
-            other=frameworks_parsed_metadata.other,
-        )
-
-
-RUN_CMD = T5ONNXRT()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/T5/trt.py b/demo/HuggingFace/T5/trt.py
deleted file mode 100644
index 3a2decc2..00000000
--- a/demo/HuggingFace/T5/trt.py
+++ /dev/null
@@ -1,953 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import os
-import sys
-import copy
-from typing import Dict, List, Tuple, Union
-from functools import reduce
-
-# Add syspath for custom library
-if __name__ == "__main__":
-    filepath = os.path.dirname(os.path.abspath(__file__))
-    project_root = os.path.join(filepath, os.pardir)
-    sys.path.append(project_root)
-
-# polygraphy
-from polygraphy.backend.trt import Profile
-
-# tensorrt
-import tensorrt as trt
-
-# torch
-import torch
-
-# huggingface
-from transformers import T5Tokenizer, T5Config
-from transformers.modeling_outputs import Seq2SeqLMOutput
-from transformers.configuration_utils import PretrainedConfig
-from transformers.generation_utils import GenerationMixin
-from transformers.modeling_outputs import BaseModelOutput
-
-# tensorrt
-from tensorrt import PreviewFeature
-
-# TRT-HuggingFace
-from NNDF.interface import TRTInferenceCommand
-from NNDF.networks import (
-    BenchmarkingResult,
-    NetworkMetadata,
-    NetworkModels,
-    NetworkModel,
-    NetworkResult,
-    NetworkRuntime,
-    Precision,
-    TimingProfile,
-)
-
-from NNDF.tensorrt_utils import TRTNativeRunner, set_kv_data, allocate_binding_buffer, setup_benchmark_arg
-from NNDF.torch_utils import expand_inputs_for_beam_search
-from NNDF.general_utils import NNFolderWorkspace
-from T5.frameworks import T5FHuggingFace
-from T5.T5ModelConfig import T5ModelTRTConfig, T5TRTBenchmarkingArgs
-from T5.measurements import decoder_inference, encoder_inference, full_inference, calculate_perplexity
-from T5.export import T5DecoderONNXFile, T5EncoderONNXFile, T5DecoderTRTEngine, T5EncoderTRTEngine
-from NNDF.models import TRTEngineFile
-from NNDF.logger import G_LOGGER
-
-
-class TRTHFRunner(TRTNativeRunner, GenerationMixin):
-    """Runner that adds interop support for HF and HF provided greedy_search functions."""
-
-    # Stores the encoder input length received at runtime, which is used to slice decoder inputs.
-    ENCODER_LENGTH = 0
-    def _allocate_memory(self,
-                         input_shapes: Dict[str, tuple],
-                         input_types: Dict[str, torch.dtype],
-                         output_shapes: Dict[str, tuple],
-                         output_types: Dict[str, torch.dtype]):
-        """Helper function for binding several inputs at once and pre-allocating the results."""
-        # Allocate memories as 1D linear buffers for simpler handling of dynamic shapes.
-        self.inputs = allocate_binding_buffer(input_types, input_shapes)
-        self.outputs = allocate_binding_buffer(output_types, output_shapes)
-
-        bindings = [None] * self.trt_engine.num_bindings
-
-        for input_name, input_array in self.inputs.items():
-            # Allocate memory for inputs
-            input_idx = self.trt_engine.get_binding_index(input_name)
-            self.trt_context.set_binding_shape(input_idx, input_shapes[input_name])
-            bindings[input_idx] = input_array.data_ptr()
-
-        assert self.trt_context.all_binding_shapes_specified
-
-        for output_name, output_array in self.outputs.items():
-            # Output shape should be allocated from context size
-            output_idx = self.trt_engine.get_binding_index(output_name)
-            bindings[output_idx] = output_array.data_ptr()
-
-        return bindings
-
-    def __init__(
-        self,
-        trt_engine_file: TRTEngineFile,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1
-    ):
-        super().__init__(trt_engine_file, network_metadata)
-        self.config = hf_config
-        self.batch_size = batch_size
-
-class T5TRTEncoder(TRTHFRunner):
-    """TRT implemented network interface that can be used to measure inference time."""
-
-    def __init__(
-        self,
-        trt_engine_file: str,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1,
-        benchmarking_args: T5TRTBenchmarkingArgs = None
-    ):
-        super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
-        self.data_type = torch.float32
-        # In benchmarking mode, the max_sequence_length should be the designated input_profile_max_len
-        if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
-            self.max_sequence_length = benchmarking_args.input_profile_max_len
-        else: 
-            self.max_sequence_length = hf_config.d_model
-        self.encoder_hidden_size = hf_config.d_model
-        self.main_input_name = "input_ids"
-        # We only have one profile to select so we can just grab the profile at the start of the class
-        self.profile_idx = self.get_optimization_profile(batch_size=self.batch_size, sequence_length=1)
-
-        self.input_shapes = {
-            "input_ids": (self.batch_size, self.max_sequence_length)
-        }
-        self.input_types = {
-            "input_ids": torch.int32
-        }
-        self.output_shapes = {
-            "hidden_states": (self.batch_size, self.max_sequence_length, self.encoder_hidden_size)
-        }
-        self.output_types = {
-            "hidden_states": self.data_type
-        }
-
-        self.bindings = self._allocate_memory(self.input_shapes, self.input_types, self.output_shapes, self.output_types)
-
-    def forward(self, input_ids, *args, **kwargs):
-        bs = self.batch_size
-        max_length = self.max_sequence_length
-        TRTHFRunner.ENCODER_LENGTH = input_ids.shape[1]
-        input_length = input_ids.shape[1]
-        encoder_hidden_size = self.encoder_hidden_size
-
-        # Check if the input data is on CPU (which usually means the PyTorch does not support current GPU).
-        is_cpu_mode = (input_ids.device == torch.device("cpu"))
-
-        # We allocate the buffers using max_length, but we only need to first portion of it, so copy the data into the
-        # first portion of the input buffer.
-        # TODO: Could we just reuse input_ids' data_ptr() as the first binding when input_ids is already contiguous to
-        # avoid an additional D2D?
-        if is_cpu_mode:
-            self.inputs["input_ids"] = input_ids.int().flatten().contiguous().cuda()
-            self.bindings[0] = self.inputs["input_ids"].data_ptr()
-        else:
-            self.inputs["input_ids"][:bs * input_length] = input_ids.flatten()
-
-        # Set the binding shape of input_ids, which should be (bs, input_length).
-        self.trt_context.set_binding_shape(0, input_ids.shape)
-
-        # Launch TRT inference.
-        # TODO: Could we use execute_v2_async() instead of execute_v2()?
-        self.trt_context.execute_v2(bindings=self.bindings)
-
-        # We allocate the buffers using max_length, but we only need to first portion of it, so get only the first
-        # portion of the output buffer and return that.
-        # TODO: Could we construct a Torch tensor using given data_ptr() to avoid this D2D copy?
-        hidden_states_output = self.outputs["hidden_states"]
-        if is_cpu_mode:
-            hidden_states_output = hidden_states_output.cpu()
-
-        folded = hidden_states_output[:bs * input_length * encoder_hidden_size].view(bs, input_length, encoder_hidden_size)
-
-        return folded
-
-class T5TRTDecoder(TRTHFRunner):
-
-    def __init__(
-        self,
-        trt_engine_file: TRTEngineFile,
-        network_metadata: NetworkMetadata,
-        hf_config: PretrainedConfig,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_args: T5TRTBenchmarkingArgs = None,
-    ):
-        super().__init__(trt_engine_file, network_metadata, hf_config, batch_size = batch_size)
-        self.data_type = torch.float32 if not network_metadata.precision.fp16 else torch.float16
-
-        # In benchmarking mode, the max_sequence_length should be the user-provided input_profile_max_len
-        if benchmarking_args is not None and benchmarking_args.input_profile_max_len is not None:
-            self.max_input_length = benchmarking_args.input_profile_max_len
-        else: 
-            self.max_input_length = hf_config.d_model
-        
-        # Similarly, the max_output_length should be the user-provided output_profile_max_len
-        if benchmarking_args is not None and benchmarking_args.output_profile_max_len is not None:
-            self.max_output_length = benchmarking_args.output_profile_max_len
-        else: 
-            self.max_output_length = hf_config.d_model
-        
-        self.device = torch.device('cuda') 
-        self.main_input_name = "input_ids"
-        self.encoder_hidden_size = hf_config.d_model
-        self.num_heads = hf_config.num_heads
-        self.embedding_size_per_head = hf_config.d_kv
-        self.num_decoder_layers = hf_config.num_decoder_layers
-        self.profile_idx = 0
-        self.bindings = [0] * self.trt_engine.num_bindings
-
-        hidden_states_profile_length = self.max_output_length if not self.config.use_cache else 1
-        # Construct buffer for hidden states outputs
-        self.hidden_states = torch.zeros((self.batch_size * num_beams, hidden_states_profile_length, hf_config.vocab_size), dtype = self.data_type).cuda()
-        self.bindings[self.trt_engine.get_binding_index("hidden_states")] = self.hidden_states.data_ptr()
-
-        if self.config.use_cache:
-
-            self.self_attention_cache = {}
-            self.cross_attention_cache = {}
-
-            # We are using cached cross attention, and not outputing redundant cross attention information. We only output self attention cache increment
-            self_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_output_length - 1, self.embedding_size_per_head)
-            cross_attention_kv_shape = (self.batch_size * num_beams, self.num_heads, self.max_input_length, self.embedding_size_per_head)
-
-            # Set self attention kv cache shape and type
-            for i in range(self.num_decoder_layers):
-                for code in ["key", "value"]:
-                    # Allocate self attention buffer. The buffer is used both as inputs and outputs
-                    self_attention_name = f"key_values.{i}.decoder.{code}"
-                    input_buffer = torch.zeros(self_attention_kv_shape, dtype = self.data_type).cuda()
-                    input_idx = self.trt_engine.get_binding_index("past_" + self_attention_name)
-                    self.self_attention_cache[self_attention_name] = input_buffer
-                    self.bindings[input_idx] = input_buffer.data_ptr()
-                    
-                    output_idx = self.trt_engine.get_binding_index("present_" + self_attention_name)
-                    self.bindings[output_idx] = input_buffer.data_ptr()
-
-                    # Allocate cross attention buffer
-                    cross_attention_past_name = f"past_key_values.{i}.encoder.{code}"
-                    cross_attention_buffer = torch.zeros(cross_attention_kv_shape, dtype = self.data_type).cuda()
-                    cross_attention_idx = self.trt_engine.get_binding_index(cross_attention_past_name)
-                    self.cross_attention_cache[cross_attention_past_name] = cross_attention_buffer
-                    self.bindings[cross_attention_idx] = cross_attention_buffer.data_ptr()
-
-            self.kv_cache_binding_offset = 2 # 0: input_ids, 1: encoder_hidden_states, kv cache input indices start from 2
-            self.past_decoder_length = 0
-
-        # Optimization bit
-        self.persist_encoder_hidden_states = False
-        self.encoder_hidden_states = torch.zeros((self.batch_size * num_beams * self.max_input_length * self.encoder_hidden_size), dtype=self.data_type).cuda()
-        self.bindings[1] = self.encoder_hidden_states.data_ptr()
-        self.persist_cross_attention_kv_cache = False
-
-        self.return_device = torch.device('cuda')
-        self.variant = network_metadata.variant # record variant name to later index the vocab_size in forward()
-    
-    def set_encoder_hidden_states_for_inference_cycle(self, encoder_hidden_states):
-        """Used to cache encoder hidden state runs across same encoder sessions"""
-
-        # Use in-place assignment so that the memory location of self.encoder_hidden_states will never change.
-        # PyTorch will handle the FP32->FP16 conversion automatically if that is needed.
-        self.encoder_hidden_states[:encoder_hidden_states.numel()] = encoder_hidden_states.flatten()
-        self.persist_encoder_hidden_states = True
-        self.trt_context.set_binding_shape(1, encoder_hidden_states.shape)
-
-    def set_cross_attention_kv_cache_engine(self, cross_attention_kv_generator):
-        self.cross_attention_kv_generator = cross_attention_kv_generator
-        with open(self.cross_attention_kv_generator.fpath, "rb") as f:
-            trt_runtime = trt.Runtime(self.trt_logger)
-            self.cross_attention_kv_generator_trt_engine = trt_runtime.deserialize_cuda_engine(f.read())
-            self.cross_attention_kv_generator_trt_context = self.cross_attention_kv_generator_trt_engine.create_execution_context()
-        self.cross_attention_bindings = [None] * self.cross_attention_kv_generator_trt_engine.num_bindings
-        self.cross_attention_bindings[0] = self.encoder_hidden_states.data_ptr()
-        # Cross attention cache as outputs
-        for i in range(self.num_decoder_layers):
-            self.cross_attention_bindings[2*i+1] = self.cross_attention_cache[f"past_key_values.{i}.encoder.key"].data_ptr()
-            self.cross_attention_bindings[2*i+2] = self.cross_attention_cache[f"past_key_values.{i}.encoder.value"].data_ptr()
-
-    def set_cross_attention_kv_cache_for_inference_cycle(self, encoder_hidden_states):
-        """
-        Used to cache encoder-decoder cross attention kv caches across same encoder sessions.
-
-        Unlike self-attention cache, cross attention is constant during the decoding process, so we only need to set its bindings once at the first decoding step, and skip in all later steps (by self.persist_cross_attention_kv_cache flag)
-        """
-        self.cross_attention_kv_generator_trt_context.set_binding_shape(0, encoder_hidden_states.shape)
-        assert self.cross_attention_kv_generator_trt_context.all_binding_shapes_specified
-        self.cross_attention_kv_generator_trt_context.execute_v2(bindings=self.cross_attention_bindings)
-        self.persist_cross_attention_kv_cache = True
-
-    def set_return_device(self, return_device):
-        """
-        Sets the return device of the return via to(). Device name should be the same as torch devices: cuda, cpu, etc.
-        This is used in our measurement code.
-        """
-        self.return_device = return_device
-        self.device = return_device
-
-    def _reorder_cache(self, past, beam_idx):
-        # Reference: https://huggingface.co/transformers/v4.11.3/_modules/transformers/models/t5/modeling_t5.html
-        # Note that for BART, this function is static, but for T5, it is not
-        # if decoder past is not included in output
-        # speedy decoding is disabled and no need to reorder
-        if past is None:
-            print("You might want to consider setting `use_cache=True` to speed up decoding")
-            return past
-
-        reordered_decoder_past = ()
-        for layer_past_states in past:
-            # get the correct batch idx from layer past batch dim
-            # batch dim of `past` is at 2nd position
-            reordered_layer_past_states = ()
-            for layer_past_state in layer_past_states:
-                if layer_past_state is not None:
-                    # need to set correct `past` for each of the four key / value states
-                    reordered_layer_past_states = reordered_layer_past_states + (
-                        layer_past_state.index_select(0, beam_idx.to(layer_past_state.device)),
-                    )
-                else:
-                    reordered_layer_past_states = reordered_layer_past_states + (None,)
-
-            assert reordered_layer_past_states[0].shape == layer_past_states[0].shape
-            assert len(reordered_layer_past_states) == len(layer_past_states)
-
-            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)
-        return reordered_decoder_past
-
-    def forward(self, input_ids, encoder_hidden_states, encoder_outputs=None, *args, **kwargs):
-        # Get the batch size.
-        bs = input_ids.shape[0] # in beam search mode, bs is batch_size * num_beams
-
-        # Actual sequence length of the input_ids and the output hidden_states.
-        input_length = input_ids.shape[1]
-
-        # The sequence length of the encoder_hidden_states.
-        encoder_length = TRTHFRunner.ENCODER_LENGTH
-
-        is_cpu_mode = (input_ids.device == torch.device("cpu")) or (self.return_device == "cpu")
-
-        if is_cpu_mode:
-            input_ids = input_ids.int().cuda()
-
-        # input_ids needs to be an in int type.
-        self.bindings[0] = input_ids.int().data_ptr()
-        self.trt_context.set_binding_shape(0, input_ids.shape)
-
-        # If encoder hidden states have not been copied yet, copy the hidden states to the input buffer.
-        if not self.persist_encoder_hidden_states:
-            self.set_encoder_hidden_states_for_inference_cycle(encoder_hidden_states)
-
-        if self.config.use_cache:
-            if (kwargs.get("past_key_values") is None):
-                self.past_decoder_length = 0
-            if not self.persist_cross_attention_kv_cache:
-                self.set_cross_attention_kv_cache_for_inference_cycle(encoder_hidden_states)
-                cross_attention_kv_shape = (bs, self.num_heads, encoder_length, self.embedding_size_per_head)
-                for i in range(self.num_decoder_layers):
-                    self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 2, cross_attention_kv_shape)
-                    self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 3, cross_attention_kv_shape)
-
-            # When switching trt profiles, the binding shape needs to be reset, so we set binding shape at each forward pass
-            self_attention_kv_shape = (bs, self.num_heads, self.past_decoder_length, self.embedding_size_per_head)
-            for i in range(self.num_decoder_layers):
-                self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i, self_attention_kv_shape)
-                self.trt_context.set_binding_shape(self.kv_cache_binding_offset+4*i + 1, self_attention_kv_shape)
-
-        # Launch TRT inference.
-        assert self.trt_context.all_binding_shapes_specified
-        self.trt_context.execute_v2(bindings=self.bindings)
-
-        # For bs > 1, this is required, so cannot avoid this D2D copy
-        logits_length = bs * input_length * self.config.vocab_size
-        logits = self.hidden_states.flatten()[:logits_length].view(bs, input_length, self.config.vocab_size)
-        if is_cpu_mode:
-            logits = logits.cpu()
-
-        present_key_values = None
-        if self.config.use_cache:
-            present_key_values = ()
-            num_heads = self.num_heads
-            embedding_size_per_head = self.embedding_size_per_head
-
-            for i in range(self.num_decoder_layers):
-                self_attention_k_output = self.self_attention_cache[f"key_values.{i}.decoder.key"]
-                self_attention_v_output = self.self_attention_cache[f"key_values.{i}.decoder.value"]
-                if is_cpu_mode:
-                    self_attention_k_output = self_attention_k_output.cpu()
-                    self_attention_v_output = self_attention_v_output.cpu()
-
-                present_key_values += ((self_attention_k_output, self_attention_v_output),)
-
-            self.past_decoder_length += 1
-
-        # Transfer predictions back from GPU to do greedy search
-        return Seq2SeqLMOutput(logits=logits.to(self.return_device), past_key_values=present_key_values,)
-
-    def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
-        # In HuggingFace generation_utils.py, this function will be called at each decoding step, before running the decoder's forward().
-        
-        if past is not None:
-            input_ids = input_ids[:, -1:]
-
-        ret = {
-            "input_ids": input_ids,
-            "encoder_hidden_states": kwargs["encoder_outputs"].get("last_hidden_state"),
-        }
-
-        if self.config.use_cache:
-            ret["use_cache"] = use_cache
-            ret["past_key_values"] = past
-
-        return ret
-    
-    def reset(self):
-        '''
-        You should always call this function after a use case because T5TRTDecoder does not clear the cached encoder_hidden_states or cross_attention itself.
-        '''
-        self.persist_encoder_hidden_states = False
-        self.encoder_hidden_states.zero_()
-        if self.config.use_cache:
-            self.persist_cross_attention_kv_cache = False
-
-class T5TRT(TRTInferenceCommand):
-    def __init__(self):
-        super().__init__(
-            T5ModelTRTConfig,
-            "Runs trt results for T5 model.",
-            T5FHuggingFace,
-        )
-        self.t5_trt_decoder = None
-        self.t5_trt_encoder = None
-
-    def cleanup(
-        self,
-        workspace: NNFolderWorkspace,
-        keep_trt_engine: bool = False,
-        keep_onnx_model: bool = False,
-        keep_torch_model: bool = False,
-    ) -> None:
-        # Deactivates context
-        if self.t5_trt_encoder:
-            self.t5_trt_encoder.release()
-        if self.t5_trt_decoder:
-            self.t5_trt_decoder.release()
-
-        if not keep_trt_engine:
-            self.t5_trt_encoder_engine.cleanup()
-            self.t5_trt_decoder_engine.cleanup()
-            # TODO: Avoid using workspace.metadata to handle additional removals.
-            if workspace.metadata.other.kv_cache:
-                self.t5_trt_cross_attention_kv_generator.cleanup()
-
-        self.frameworks_cmd.cleanup(workspace, keep_onnx_model, keep_torch_model)
-
-    def generate(
-        self,
-        input_ids,
-        min_length: int = None,
-        max_length: int = None,
-        num_beams: int = 1,
-        use_cache: bool = False,
-        early_stopping: bool = True,
-    ):
-        batch_size = input_ids.shape[0]
-        hf_config = self.t5_trt_decoder.config
-
-        if max_length is None:
-            max_length = T5ModelTRTConfig.MAX_OUTPUT_LENGTH[self.metadata.variant]
-
-        if min_length is None:
-            min_length = T5ModelTRTConfig.MIN_OUTPUT_LENGTH[self.metadata.variant]
-        
-        encoder_last_hidden_state = self.t5_trt_encoder(input_ids=input_ids).to("cuda")
-        
-        decoder_output = self.t5_trt_decoder.generate(
-            input_ids,
-            max_length = max_length,
-            min_length = min_length,
-            num_beams = num_beams,
-            early_stopping = early_stopping,
-            eos_token_id = self.t5_trt_decoder.config.eos_token_id,
-            pad_token_id = self.t5_trt_decoder.config.pad_token_id,
-            use_cache = use_cache,
-            encoder_outputs = BaseModelOutput(last_hidden_state = encoder_last_hidden_state),
-        )
-
-        self.t5_trt_decoder.reset()
-        return decoder_output
-    
-    def execute_inference(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Dict[str, NetworkModel],
-        inference_input: str,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        num_beams: int = 1,
-        benchmarking_mode: bool = False,
-        benchmarking_args: T5TRTBenchmarkingArgs = None,
-    ) -> Union[NetworkResult, BenchmarkingResult]:
-
-        tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
-        hf_config = self.t5_trt_decoder.config
-        # Prepare the input tokens and find out output sequence length.
-        if not benchmarking_mode:
-            output_seq_len = T5ModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-            input_ids = tokenizer([inference_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        else:
-            input_seq_len = benchmarking_args.input_seq_len
-            output_seq_len = benchmarking_args.output_seq_len
-     
-            input_ids = torch.randint(0, hf_config.vocab_size, (batch_size, input_seq_len))
-
-        encoder_last_hidden_state, encoder_e2e_time = encoder_inference(
-            self.t5_trt_encoder, input_ids, timing_profile
-        )
-
-        # Need to feed the decoder a new empty input_ids for text generation.
-        decoder_output_len = output_seq_len // 2 if (not metadata.other.kv_cache) else 1
-
-        decoder_input_ids = torch.full(
-            (batch_size, decoder_output_len), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32
-        )
-
-        _, decoder_e2e_time = decoder_inference(
-            self.t5_trt_decoder,
-            expand_inputs_for_beam_search(decoder_input_ids, num_beams) if num_beams > 1 else decoder_input_ids,
-            expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state,
-            timing_profile,
-            use_cache=metadata.other.kv_cache,
-        )
-
-        self.t5_trt_decoder.reset()
-
-        decoder_output, full_e2e_runtime = full_inference(
-            self.t5_trt_encoder,
-            self.t5_trt_decoder,
-            input_ids,
-            tokenizer,
-            timing_profile,
-            max_length=output_seq_len,
-            min_length=T5ModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant] if not benchmarking_mode else output_seq_len,
-            batch_size=batch_size,
-            use_cache=metadata.other.kv_cache,
-            num_beams = num_beams,
-        )
-
-        # Prepare runtime results.
-        runtime = [
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                runtime=decoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                runtime=encoder_e2e_time,
-            ),
-            NetworkRuntime(
-                name=T5ModelTRTConfig.NETWORK_FULL_NAME,
-                runtime=full_e2e_runtime,
-            ),
-        ]
-        models=NetworkModels(
-            torch=None,
-            onnx=list(onnx_fpaths.values()),
-            trt=[
-                NetworkModel(
-                    name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                    fpath=self.t5_trt_decoder_engine.fpath,
-                ),
-                NetworkModel(
-                    name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                    fpath=self.t5_trt_encoder_engine.fpath,
-                ),
-            ],
-        )
-
-        # Skip result checking in benchmarking mode since the input data is random.
-        if benchmarking_mode:
-            return BenchmarkingResult(median_runtime=runtime, models=models)
-
-        # Remove the padding and end tokens.
-        semantic_outputs = tokenizer.decode(
-            decoder_output[0, :], skip_special_tokens=True
-        )
-
-        if isinstance(semantic_outputs, list):
-            semantic_outputs = " ".join(semantic_outputs).strip()
-
-        return NetworkResult(
-            input=inference_input,
-            output_tensor=decoder_output,
-            semantic_output=semantic_outputs,
-            median_runtime=runtime,
-            models=models,
-        )
-
-    def execute_calculate_perplexity(
-        self,
-        metadata: NetworkMetadata,
-        encoder_input: str,
-        decoder_input: str,
-        batch_size: int, 
-    ):
-        tokenizer = T5Tokenizer.from_pretrained(metadata.variant)
-        encoder_input_ids = tokenizer([encoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
-        decoder_input_ids = tokenizer([decoder_input] * batch_size, padding=True, return_tensors="pt").input_ids
-
-        perplexity = calculate_perplexity(
-            self.t5_trt_encoder, self.t5_trt_decoder, tokenizer, encoder_input_ids, decoder_input_ids,
-            T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],
-        )
-        return perplexity
-
-    def _setup_engines(
-        self,
-        metadata: NetworkMetadata,
-        hash_onnx_fpath: Dict[str, NetworkModel],
-        batch_size: int,
-        num_beams: int,
-        disable_preview_dynamic_shapes: bool,
-        benchmarking_args: T5TRTBenchmarkingArgs = None,
-        seq_tag: bool = False, # whether the benchmark engine tag format should be seq or max
-    ) -> None:
-
-        # Output networks shall not exceed number of network segments explicitly defined by configuration file.
-        assert len(hash_onnx_fpath) == len(
-            T5ModelTRTConfig.NETWORK_SEGMENTS
-        ), "There should only be {} exported ONNX segments in T5 model.".format(
-            len(T5ModelTRTConfig.NETWORK_SEGMENTS)
-        )
-
-        decoder_onnx_fpath = hash_onnx_fpath[
-            T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME
-        ].fpath
-        encoder_onnx_fpath = hash_onnx_fpath[
-            T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME
-        ].fpath
-
-        # Use HuggingFace T5Config to set up parameter instead of harc-coded values.
-        hf_config = T5Config.from_pretrained(
-            metadata.variant,
-            use_cache=metadata.other.kv_cache
-        )
-
-        # Generate optimization profiles.
-        # non-benchmarking mode: opt profile length is by default half of the max profile
-        # benchmarking mode: user can specify opt and max profile by flags. If no additional benchmarking flags are provided, it will just use the non-benchmarking mode defaults
-        max_input_length = hf_config.d_model
-        max_output_length = hf_config.d_model
-        opt_input_seq_len = max_input_length // 2
-        opt_output_seq_len = max_output_length // 2
-
-        # benchmarking flags
-        if benchmarking_args is not None:
-            max_input_length = benchmarking_args.input_profile_max_len
-            max_output_length = benchmarking_args.output_profile_max_len
-            opt_input_seq_len = benchmarking_args.input_seq_len
-            opt_output_seq_len = benchmarking_args.output_seq_len
-
-        encoder_hidden_size = hf_config.d_model
-
-        encoder_profiles = [
-            Profile().add(
-                "input_ids",
-                min=(batch_size, 1),
-                opt=(batch_size, opt_input_seq_len),
-                max=(batch_size, max_input_length),
-            )
-        ]
-
-        # Set up the non kv engine, used for non-kv mode and kv mode generation phase (1st decoder run uses the non-kv profile to generate kv cache)
-        dec_profiles = Profile()
-        
-        # for beam search, decoder engine's inputs are expanded `num_beams` times
-        # optimization profiles should be changed accordingly, but onnx models can be shared across greedy/beam because the first dim (batch size) is already a dynamic value, so no change needed in export.py
-        if not hf_config.use_cache:
-            dec_profiles = dec_profiles.add(
-                "input_ids",
-                min=(batch_size * num_beams, 1),
-                opt=(batch_size * num_beams, opt_output_seq_len),
-                max=(batch_size * num_beams, max_output_length),
-            )
-        else:
-            dec_profiles = dec_profiles.add(
-                "input_ids",
-                min=(batch_size * num_beams, 1),
-                opt=(batch_size * num_beams, 1),
-                max=(batch_size * num_beams, 1),
-            )
-
-        dec_profiles = dec_profiles.add(
-            "encoder_hidden_states",
-            min=(batch_size * num_beams, 1, encoder_hidden_size),
-            opt=(batch_size * num_beams, opt_input_seq_len, encoder_hidden_size),
-            max=(batch_size * num_beams, max_input_length, encoder_hidden_size),
-        )
-        
-        if hf_config.use_cache:
-
-            num_heads = hf_config.num_heads
-            embedding_size_per_head = hf_config.d_kv
-            num_decoder_layers = hf_config.num_decoder_layers
-            # Use TensorRT Zero-Tensor feature for the 1st decoder run, self attention is growing with increasing sequence.
-            self_attention_profile = {
-                "min": (batch_size * num_beams, num_heads, 0, embedding_size_per_head),
-                "opt": (batch_size * num_beams, num_heads, opt_output_seq_len - 1, embedding_size_per_head),
-                "max": (batch_size * num_beams, num_heads, max_output_length - 1, embedding_size_per_head),
-            }
-
-            # Cross attention kv cache does not change during single decoder iteration.
-            cross_attention_profile = {
-                "min": (batch_size * num_beams, num_heads, 1, embedding_size_per_head),
-                "opt": (batch_size * num_beams, num_heads, opt_input_seq_len, embedding_size_per_head),
-                "max": (batch_size * num_beams, num_heads, max_input_length, embedding_size_per_head),
-            }
-
-            for i in range(num_decoder_layers):
-                dec_profiles = dec_profiles.add(
-                    f"past_key_values.{i}.decoder.key",
-                    **self_attention_profile
-                ).add(
-                    f"past_key_values.{i}.decoder.value",
-                    **self_attention_profile
-                ).add(
-                    f"past_key_values.{i}.encoder.key",
-                    **cross_attention_profile
-                ).add(
-                    f"past_key_values.{i}.encoder.value",
-                    **cross_attention_profile
-                )
-        
-        decoder_profiles = [dec_profiles]
-
-        # Convert ONNX models to TRT engines.
-        if benchmarking_args is None:
-            engine_tag = "bs{}".format(batch_size)
-        # When user does not input any profile_max_len, use seq as tag, both max are config max
-        elif seq_tag:
-            engine_tag = "bs{}-inseq{}-outseq{}".format(batch_size, benchmarking_args.input_seq_len, benchmarking_args.output_seq_len)
-        # When user input profile_max_len, reuse the engine for future use with different seq_len
-        else:
-            engine_tag = "bs{}-inmax{}-outmax{}".format(batch_size, benchmarking_args.input_profile_max_len, benchmarking_args.output_profile_max_len)
-
-        if num_beams > 1:
-            engine_tag += "-beam{}".format(num_beams)
-
-        preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
-        if disable_preview_dynamic_shapes:
-            engine_tag += "-noPreviewFasterDynamicShapes"
-        else:
-            preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)
-
-        self.t5_trt_encoder_engine = T5EncoderONNXFile(
-            encoder_onnx_fpath, metadata
-        ).as_trt_engine(
-            os.path.splitext(encoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag).replace(f"-beam{num_beams}", ""), # encoder engine name not affected by beam search
-            profiles=encoder_profiles,
-            preview_features=preview_features
-        )
-
-        self.t5_trt_decoder_engine = T5DecoderONNXFile(
-            decoder_onnx_fpath, metadata
-        ).as_trt_engine(
-            os.path.splitext(decoder_onnx_fpath)[0] + "-{}.engine".format(engine_tag),
-            profiles=decoder_profiles,
-            preview_features=preview_features
-        )
-
-        # Create T5TRTEncoder and T5TRTDecoder instances.
-        self.t5_trt_encoder = T5TRTEncoder(
-            self.t5_trt_encoder_engine, metadata, hf_config, batch_size=batch_size, benchmarking_args=benchmarking_args
-        )
-        self.t5_trt_decoder = T5TRTDecoder(
-            self.t5_trt_decoder_engine, metadata, hf_config, batch_size=batch_size, num_beams=num_beams, benchmarking_args=benchmarking_args
-        )
-
-        if metadata.other.kv_cache:
-            # Set up context phase profile. Context phase will use encoder_hidden_states to generate cross attention kv cache.
-            cross_attention_kv_generation_profiles = [Profile().add(
-                "encoder_hidden_states",
-                min=(batch_size * num_beams, 1, encoder_hidden_size),
-                opt=(batch_size * num_beams, opt_input_seq_len, encoder_hidden_size),
-                max=(batch_size * num_beams, max_input_length, encoder_hidden_size),
-            )]
-            decoder_folder, decoder_name = os.path.split(decoder_onnx_fpath)
-            decoder_name, decoder_ext = os.path.splitext(decoder_name)
-            decoder_onnx_fpath_kv_generator = os.path.join(decoder_folder, "cross_attention_kv_generator", decoder_name + "-cross_attention_kv_generator" + decoder_ext)
-            self.t5_trt_cross_attention_kv_generator = T5DecoderONNXFile(
-                decoder_onnx_fpath_kv_generator, metadata
-            ).as_trt_engine(
-                os.path.splitext(decoder_onnx_fpath_kv_generator)[0] + "-{}.engine".format(engine_tag),
-                profiles=cross_attention_kv_generation_profiles,
-                preview_features=preview_features
-            )
-            
-            self.t5_trt_decoder.set_cross_attention_kv_cache_engine(self.t5_trt_cross_attention_kv_generator)
-
-    def run_trt(
-        self,
-        metadata: NetworkMetadata,
-        onnx_fpaths: Tuple[NetworkModel],
-        network_input: List[str],
-        working_directory: str,
-        keep_trt_engine: bool,
-        keep_onnx_model: bool,
-        keep_torch_model: bool,
-        timing_profile: TimingProfile,
-        batch_size: int = 1,
-        args: object = None,
-        benchmarking_mode: bool = False,
-        disable_preview_dynamic_shapes: bool = False,
-        perplexity_reference: List[str] = None,
-    ) -> Union[List[NetworkResult], BenchmarkingResult] :
-
-        workspace = self._setup_workspace(metadata, working_directory)
-
-        # Keep onnx and Torch models if they are provided by users.
-        if len(onnx_fpaths) == 0:
-            onnx_fpaths = self._download_models(workspace, metadata)
-        else:
-            keep_onnx_model = True
-            keep_torch_model = True
-
-        hash_onnx_fpath = {v.name: v for v in onnx_fpaths}
-
-        inference_results = []
-        ppl_results = []
-        try:
-            if not benchmarking_mode:
-                self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes)
-                for ninput in network_input:
-                    inference_results.append(
-                        self.execute_inference(
-                            metadata, hash_onnx_fpath, ninput, timing_profile, batch_size, args.num_beams
-                        )
-                    )
-                    self.t5_trt_decoder.reset()
-
-                if perplexity_reference is not None:
-                    assert len(network_input) == len(perplexity_reference), "Encoder and decoder inputs must pair up"
-                    if metadata.other.kv_cache or (args.num_beams > 1):
-                        G_LOGGER.warning("Skipping perplexity calculation for TRT with KV cache or beam search because it is not supported yet.")
-                    else:
-                        for ei, di in zip(network_input, perplexity_reference):
-                            ppl_results.append(
-                                self.execute_calculate_perplexity(metadata, ei, di, batch_size)
-                            )
-                            self.t5_trt_decoder.reset()
-
-            else:
-                # Check that input_seq_len and output_seq_len is valid and within required range
-                max_input_seq_len = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant]
-                max_output_seq_len = T5ModelTRTConfig.MAX_OUTPUT_LENGTH[metadata.variant]
-
-                seq_tag = args.input_profile_max_len is None and args.output_profile_max_len is None
-                # User must provide either a pair of profile_max_len or a profile of seq_len for input/output
-                if args.input_profile_max_len is None or args.output_profile_max_len is None:
-                    if args.input_seq_len is None or args.output_seq_len is None:
-                        assert False, "Please provide at least one pair of inputs: [input/output]_seq_len or [input/output]_profile_max_len"
-
-                input_profile_max_len = setup_benchmark_arg(args.input_profile_max_len, "input_profile_max_len", max_input_seq_len)
-                output_profile_max_len = setup_benchmark_arg(args.output_profile_max_len, "output_profile_max_len", max_output_seq_len)
-                input_seq_len = setup_benchmark_arg(args.input_seq_len, "input_seq_len", input_profile_max_len // 2)
-                output_seq_len = setup_benchmark_arg(args.output_seq_len, "output_seq_len", output_profile_max_len // 2)
-
-                benchmarking_args = T5TRTBenchmarkingArgs(input_seq_len, output_seq_len, input_profile_max_len, output_profile_max_len)
-
-                # Assert to ensure the validity of benchmarking arguments
-                assert benchmarking_args.input_seq_len <= benchmarking_args.input_profile_max_len, "input_seq_len should <= input_profile_max_len = {} for benchmarking mode".format(benchmarking_args.input_profile_max_len)
-                assert benchmarking_args.output_seq_len <= benchmarking_args.output_profile_max_len, "output_seq_len should <= output_profile_max_len = {} for benchmarking mode".format(benchmarking_args.output_profile_max_len)
-                assert benchmarking_args.input_profile_max_len <= max_input_seq_len, "Model config restrict input_profile_max_len <= {} for benchmark mode".format(max_input_seq_len)
-                assert benchmarking_args.output_profile_max_len <= max_output_seq_len, "Model config restrict output_profile_max_len <= {} for benchmark mode".format(max_output_seq_len)
-
-                self._setup_engines(metadata, hash_onnx_fpath, batch_size, args.num_beams, disable_preview_dynamic_shapes, benchmarking_args, seq_tag)
-                inference_results = self.execute_inference(
-                    metadata, hash_onnx_fpath, None, timing_profile, batch_size, args.num_beams, True, benchmarking_args
-                )
-
-        finally:
-            self.cleanup(workspace, keep_trt_engine, keep_onnx_model, keep_torch_model)
-
-        return inference_results, ppl_results
-
-    def add_args(self, parser) -> None:
-        super().add_args(parser)
-        polygraphy_group = parser.add_argument_group("polygraphy models")
-        polygraphy_group.add_argument(
-            "--onnx-decoder-fpath",
-            default=None,
-            help="Path to ONNX decoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-        polygraphy_group.add_argument(
-            "--onnx-encoder-fpath",
-            default=None,
-            help="Path to ONNX encoder. If None is supplied, scripts will generate them from HuggingFace.",
-        )
-
-    def args_to_network_models(self, args) -> List[NetworkModel]:
-        # Check if both flags are given otherwise error out
-        decoder_fpath_check = args.onnx_decoder_fpath is None
-        encoder_fpath_check = args.onnx_encoder_fpath is None
-
-        network_models = None
-        if decoder_fpath_check and encoder_fpath_check:
-            network_models = tuple()
-        elif decoder_fpath_check or encoder_fpath_check:
-            raise self._parser.error(
-                "Both --onnx-decoder-fpath and --onnx-encoder-fpath must be given. Otherwise neither should be provided for script to download them."
-            )
-        else:
-            onnx_decoder = NetworkModel(
-                name=T5ModelTRTConfig.NETWORK_DECODER_SEGMENT_NAME,
-                fpath=args.onnx_decoder_fpath,
-            )
-            onnx_encoder = NetworkModel(
-                name=T5ModelTRTConfig.NETWORK_ENCODER_SEGMENT_NAME,
-                fpath=args.onnx_encoder_fpath,
-            )
-            network_models = (onnx_decoder, onnx_encoder)
-
-        return network_models
-
-    def args_to_network_metadata(self, args) -> NetworkMetadata:
-        frameworks_parsed_metadata = self.frameworks_cmd.args_to_network_metadata(args)
-
-        return NetworkMetadata(
-            variant=frameworks_parsed_metadata.variant,
-            precision=Precision(fp16=args.fp16),
-            other=frameworks_parsed_metadata.other,
-        )
-
-
-RUN_CMD = T5TRT()
-
-if __name__ == "__main__":
-    result = RUN_CMD()
-    print("Results: {}".format(result))
diff --git a/demo/HuggingFace/notebooks/.gitignore b/demo/HuggingFace/notebooks/.gitignore
deleted file mode 100644
index 899448b7..00000000
--- a/demo/HuggingFace/notebooks/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-**/.ipynb_checkpoints
-models/
diff --git a/demo/HuggingFace/notebooks/README.md b/demo/HuggingFace/notebooks/README.md
deleted file mode 100644
index a08cdd15..00000000
--- a/demo/HuggingFace/notebooks/README.md
+++ /dev/null
@@ -1,14 +0,0 @@
-# TensorRT Demo with HuggingFace Models
-
-To run the demo Jupyter notebooks in this folder, follow the instructions in the [TRT setup guide](../../../README.md) to build and launch the docker container, e.g. `./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda11.7` and `./docker/launch.sh --tag tensorrt-ubuntu20.04-cuda11.7 --gpus all --jupyter <port>` by specifying the port number.
-
-Then, use your browser to start the Jupyter lab interface by opening the token-protected link provided in the terminal, e.g. `http://<host_name>:<port>/lab?token=...`.
-
-Notebook list:
-
-- [gpt2.ipynb](gpt2.ipynb): Step by step walkthrough for building the GPT-2 TensorRT engine.
-- [gpt2-playground.ipynb](gpt2-playground.ipynb): GUI for benchmarking GPT-2 TensorRT engines.
-- [t5.ipynb](t5.ipynb): Step by step walkthrough for building the T5 TensorRT engine.
-- [t5-playground.ipynb](t5-playground.ipynb): GUI for benchmarking T5 TensorRT engines.
-- [bart.ipynb](bart.ipynb): Step by step walkthrough for building the BART TensorRT engine.
-- [bart-playground.ipynb](bart-playground.ipynb): GUI for benchmarking BART TensorRT engines.
diff --git a/demo/HuggingFace/notebooks/bart-playground.ipynb b/demo/HuggingFace/notebooks/bart-playground.ipynb
deleted file mode 100644
index 59e0e20f..00000000
--- a/demo/HuggingFace/notebooks/bart-playground.ipynb
+++ /dev/null
@@ -1,317 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "64974d33-d028-440c-86fa-1a0633b3d31d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
-    "# SPDX-License-Identifier: Apache-2.0\n",
-    "#\n",
-    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "# you may not use this file except in compliance with the License.\n",
-    "# You may obtain a copy of the License at\n",
-    "#\n",
-    "# http://www.apache.org/licenses/LICENSE-2.0\n",
-    "#\n",
-    "# Unless required by applicable law or agreed to in writing, software\n",
-    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-    "# See the License for the specific language governing permissions and\n",
-    "# limitations under the License.\n",
-    "# =============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c3f0ff46-9958-4d57-9067-a64be34e75da",
-   "metadata": {},
-   "source": [
-    "##### <img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
-    "\n",
-    "# BART Playground\n",
-    "\n",
-    "This notebook demonstrates BART model on the task of text summarization and mask filling.\n",
-    "\n",
-    "The TensorRT HuggingFace BART model is a plug-in replacement for the original PyTorch modules in HuggingFace BART model.\n",
-    "\n",
-    "**Notes**: \n",
-    " - For \"CPU - PyTorch\" and \"GPU - PyTorch\", a BART-base model from HuggingFace model repository is employed. Inference is carried out in FP32 for CPU-PyTorch, and FP16 for GPU-PyTorch and TensorRT. All models run with batch size 1.\n",
-    "Average run time across 5 runs is reported.\n",
-    " - Prior to running this notebook, run [bart.ipynb](bart.ipynb) to download the BART model and generate the TensorRT engine."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a005d22d-5b54-4e0c-866e-6eee6a6f98e4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import ipywidgets as widgets\n",
-    "\n",
-    "model_selection = widgets.RadioButtons(\n",
-    "    options=['facebook/bart-base', \n",
-    "             'facebook/bart-large', \n",
-    "             'facebook/bart-large-cnn', \n",
-    "             'facebook/mbart-large-50'],\n",
-    "    description='Model:',\n",
-    "    disabled=False\n",
-    ")\n",
-    "\n",
-    "display(model_selection)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d35a33fd-4e85-4a1e-9989-af5adf903f79",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import sys\n",
-    "import glob\n",
-    "ROOT_DIR = os.path.abspath(\"../\")\n",
-    "sys.path.append(ROOT_DIR)\n",
-    "\n",
-    "import torch \n",
-    "\n",
-    "# huggingface\n",
-    "from transformers import (\n",
-    "    AutoModelForPreTraining,\n",
-    "    AutoTokenizer,\n",
-    "    MBartForConditionalGeneration, \n",
-    "    MBart50Tokenizer,\n",
-    "    AutoConfig,\n",
-    ")\n",
-    "\n",
-    "# download HuggingFace model and tokernizer\n",
-    "BART_VARIANT = model_selection.value\n",
-    "\n",
-    "# mbart variant can't be recognized by HF AutoClass yet\n",
-    "if \"mbart\" not in BART_VARIANT:    \n",
-    "    bart_model = AutoModelForPreTraining.from_pretrained(BART_VARIANT) # BartForConditionalGeneration\n",
-    "    tokenizer = AutoTokenizer.from_pretrained(BART_VARIANT) # BartTokenizer\n",
-    "else:\n",
-    "    bart_model = MBartForConditionalGeneration.from_pretrained(BART_VARIANT)\n",
-    "    tokenizer = MBart50Tokenizer.from_pretrained(BART_VARIANT, src_lang=\"en_XX\")\n",
-    "\n",
-    "config = AutoConfig.from_pretrained(BART_VARIANT)\n",
-    "\n",
-    "# load TensorRT engine\n",
-    "from BART.trt import BARTTRTEncoder, BARTTRTDecoder, TRTHFRunner\n",
-    "from BART.BARTModelConfig import BARTModelTRTConfig, BARTMetadata\n",
-    "from BART.export import BARTDecoderTRTEngine, BARTEncoderTRTEngine\n",
-    "from NNDF.networks import NetworkMetadata, Precision\n",
-    "\n",
-    "from transformers.generation_logits_process import (\n",
-    "    NoRepeatNGramLogitsProcessor,\n",
-    "    MinLengthLogitsProcessor,\n",
-    "    ForcedBOSTokenLogitsProcessor,\n",
-    "    ForcedEOSTokenLogitsProcessor,\n",
-    "    LogitsProcessorList,\n",
-    ")\n",
-    "from transformers.generation_stopping_criteria import (\n",
-    "    MaxLengthCriteria,\n",
-    "    StoppingCriteriaList,\n",
-    ")\n",
-    "\n",
-    "trt_config = AutoConfig.from_pretrained(BART_VARIANT)\n",
-    "trt_config.use_cache = False\n",
-    "trt_config.num_layers = BARTModelTRTConfig.NUMBER_OF_LAYERS[BART_VARIANT]\n",
-    "\n",
-    "metadata=NetworkMetadata(variant=BART_VARIANT, precision=Precision(fp16=True), other=BARTMetadata(kv_cache=False))\n",
-    "metadata_string = BARTModelTRTConfig().get_metadata_string(metadata)\n",
-    "\n",
-    "encoder_stem = metadata_string + \"-encoder.onnx\"\n",
-    "decoder_stem = metadata_string + \"-decoder-with-lm-head.onnx\"\n",
-    "\n",
-    "encoder_path = glob.glob(f'./models/{BART_VARIANT}/tensorrt/{encoder_stem}*')[0]\n",
-    "decoder_path = glob.glob(f'./models/{BART_VARIANT}/tensorrt/{decoder_stem}*')[0]\n",
-    "\n",
-    "if not os.path.exists(encoder_path) or not os.path.exists(decoder_path):\n",
-    "    print(f\"Error: TensorRT engine not found at ./models/{BART_VARIANT}/tensorrt/. Please run bart.ipynb to generate the TensorRT engines first!\")\n",
-    "else:\n",
-    "    encoder_engine = BARTEncoderTRTEngine(encoder_path, metadata)\n",
-    "    decoder_engine = BARTDecoderTRTEngine(decoder_path, metadata)\n",
-    "\n",
-    "bart_trt_encoder = BARTTRTEncoder(encoder_engine, metadata, trt_config)\n",
-    "bart_trt_decoder = BARTTRTDecoder(decoder_engine, metadata, trt_config)\n",
-    "\n",
-    "decoder_input_ids = torch.full(\n",
-    "    (1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32\n",
-    ").to(\"cuda:0\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "766b8c94-ba8e-47c8-8624-57da462a0496",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "import time\n",
-    "\n",
-    "device = widgets.RadioButtons(\n",
-    "    options=['CPU - PyTorch', \n",
-    "             'GPU - PyTorch', \n",
-    "             'GPU - TensorRT'],\n",
-    "    description='Device:',\n",
-    "    disabled=False\n",
-    ")\n",
-    "\n",
-    "task = widgets.RadioButtons(\n",
-    "    options=['Summarization', \n",
-    "             'Mask Filling', \n",
-    "             ],\n",
-    "    description='Task:',\n",
-    "    disabled=False\n",
-    ")\n",
-    "\n",
-    "example_text = {\n",
-    "    task.options[0]:\n",
-    "         \"NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms.\",\n",
-    "    task.options[1]: \n",
-    "         \"My friends are <mask> but they eat too many carbs.\"\n",
-    "    }\n",
-    "    \n",
-    "paragraph_text = widgets.Textarea(\n",
-    "    value=example_text[task.options[0]],\n",
-    "    placeholder='Type something',\n",
-    "    description='Context:',\n",
-    "    disabled=False,\n",
-    "    layout=widgets.Layout(width=\"auto\"),\n",
-    "    rows=5,  \n",
-    ")\n",
-    "\n",
-    "generated_text = widgets.Textarea(\n",
-    "    value='...',\n",
-    "    placeholder='Context',\n",
-    "    description='BART output:',\n",
-    "    disabled=False,\n",
-    "    layout=widgets.Layout(width=\"auto\"),\n",
-    "    rows=5,\n",
-    ")\n",
-    "button = widgets.Button(description=\"Generate\")\n",
-    "\n",
-    "display(paragraph_text)\n",
-    "display(generated_text)\n",
-    "display(device)\n",
-    "display(task)\n",
-    "\n",
-    "from IPython.display import display\n",
-    "box_layout = widgets.Layout(display='flex',\n",
-    "                flex_flow='column',\n",
-    "                align_items='center',\n",
-    "                width='100%')\n",
-    "N_RUN = 6\n",
-    "progress_bar = widgets.IntProgress(\n",
-    "    value=0,\n",
-    "    min=0,\n",
-    "    max=N_RUN,\n",
-    "    description='Progress:',\n",
-    "    bar_style='', # 'success', 'info', 'warning', 'danger' or ''\n",
-    "    style={'bar_color': 'green'},\n",
-    "    orientation='horizontal', \n",
-    "    layout=widgets.Layout(width='100%', height='50px')\n",
-    ")\n",
-    "\n",
-    "box = widgets.HBox(children=[button],layout=box_layout)\n",
-    "output = widgets.Output()\n",
-    "display(box)\n",
-    "display(progress_bar)\n",
-    "display(output)\n",
-    "\n",
-    "max_output_length = BARTModelTRTConfig.MAX_OUTPUT_LENGTH[BART_VARIANT]\n",
-    "\n",
-    "stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_output_length)])\n",
-    "no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE\n",
-    "min_length = BARTModelTRTConfig.MIN_OUTPUT_LENGTH[BART_VARIANT]\n",
-    "logits_processor = LogitsProcessorList([\n",
-    "    NoRepeatNGramLogitsProcessor(no_repeat_ngram_size), \n",
-    "    MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),\n",
-    "    ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),\n",
-    "    ForcedEOSTokenLogitsProcessor(max_output_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))\n",
-    "])\n",
-    "\n",
-    "def generate(b):\n",
-    "    progress_bar.value = 0\n",
-    "    inference_time_arr = []\n",
-    "    inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
-    "    \n",
-    "    with output:\n",
-    "        if device.value == 'GPU - TensorRT':\n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                encoder_last_hidden_state = bart_trt_encoder(input_ids=inputs.input_ids)\n",
-    "                outputs = bart_trt_decoder.greedy_search(\n",
-    "                            input_ids=decoder_input_ids,\n",
-    "                            encoder_hidden_states=encoder_last_hidden_state,\n",
-    "                            stopping_criteria = stopping_criteria,\n",
-    "                            logits_processor=logits_processor,\n",
-    "                        )\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "            print(\"GPU - TensorRT - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))                   \n",
-    "                \n",
-    "        elif device.value == 'CPU - PyTorch':\n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                outputs = bart_model.float().to('cpu').generate(inputs.input_ids.to('cpu'), num_beams=1, max_length=max_output_length)\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "            print(\"CPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))\n",
-    "            \n",
-    "        elif  device.value == 'GPU - PyTorch':  \n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                outputs = bart_model.half().to('cuda:0').generate(inputs.input_ids.to('cuda:0'), num_beams=1, max_length=max_output_length)\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "            print(\"GPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))    \n",
-    "           \n",
-    "        # de-tokenize model output to raw text\n",
-    "        text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
-    "        generated_text.value = text\n",
-    "\n",
-    "\n",
-    "def switch_task(change):\n",
-    "    with output:\n",
-    "        paragraph_text.value = example_text[task.value]\n",
-    "\n",
-    "task.observe(switch_task, 'value')\n",
-    "\n",
-    "button.on_click(generate)"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/bart.ipynb b/demo/HuggingFace/notebooks/bart.ipynb
deleted file mode 100644
index 5a9dd70b..00000000
--- a/demo/HuggingFace/notebooks/bart.ipynb
+++ /dev/null
@@ -1,1206 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "28e6e614-e360-4292-965e-0d255027e9b9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
-    "# SPDX-License-Identifier: Apache-2.0\n",
-    "#\n",
-    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "# you may not use this file except in compliance with the License.\n",
-    "# You may obtain a copy of the License at\n",
-    "#\n",
-    "# http://www.apache.org/licenses/LICENSE-2.0\n",
-    "#\n",
-    "# Unless required by applicable law or agreed to in writing, software\n",
-    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-    "# See the License for the specific language governing permissions and\n",
-    "# limitations under the License.\n",
-    "# =============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9b88dc1a-a92d-44cc-9fb7-d9e2ef20c8e2",
-   "metadata": {},
-   "source": [
-    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
-    "\n",
-    "# Accelerating HuggingFace BART Inference with TensorRT\n",
-    "\n",
-    "BART is an encoder-decoder model that converts all NLP problems into a text-to-text format. More specifically, it does so by encoding different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks such as translation, classification, Q&A and summarization.\n",
-    "\n",
-    "This notebook shows easy steps to convert a [HuggingFace PyTorch BART model](https://huggingface.co/docs/transformers/model_doc/bart) to a TensorRT engine for high-performance inference, with performance comparison between PyTorch and TensorRT inference.\n",
-    "\n",
-    "1. [Download HuggingFace BART model](#1)\n",
-    "1. [PyTorch HuggingFace Inference](#2)\n",
-    "1. [TensorRT Engine Building](#3)\n",
-    "1. [TensorRT Inference](#4)\n",
-    "\n",
-    "\n",
-    "## Prerequisites\n",
-    "\n",
-    "Follow the instructions at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
-    "\n",
-    "Next, we install some extra dependencies."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0c36ecb7-c622-4d95-a851-b9a6eb18e81b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#%%capture\n",
-    "!pip3 install -r ../requirements.txt\n",
-    "!pip3 install ipywidgets"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a1bbdafb",
-   "metadata": {},
-   "source": [
-    "**Note:** After this step, you should restart the Jupyter kernel for the change to take effect."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "235d2f1b-439e-4cd0-8286-1d63a13f2cf3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import sys\n",
-    "ROOT_DIR = os.path.abspath(\"../\")\n",
-    "sys.path.append(ROOT_DIR)\n",
-    "\n",
-    "# disable warning in notebook\n",
-    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
-    "\n",
-    "# notebook widgets\n",
-    "import ipywidgets as widgets\n",
-    "widget_style = {'description_width': 'initial'}\n",
-    "widget_layout = widgets.Layout(width='auto')\n",
-    "\n",
-    "import torch\n",
-    "import tensorrt as trt\n",
-    "from tensorrt import PreviewFeature\n",
-    "from polygraphy.backend.trt import Profile\n",
-    "\n",
-    "import numpy as np\n",
-    "import time\n",
-    "\n",
-    "# huggingface\n",
-    "from transformers import (\n",
-    "    AutoModelForPreTraining,\n",
-    "    AutoTokenizer,\n",
-    "    AutoConfig,\n",
-    ")\n",
-    "\n",
-    "# BART\n",
-    "from BART.BARTModelConfig import BARTModelTRTConfig, BARTMetadata\n",
-    "from BART.measurements import encoder_inference, decoder_inference, full_inference_greedy, full_inference_beam\n",
-    "from BART.export import BARTEncoderTorchFile, BARTDecoderTorchFile, BARTEncoderONNXFile, BARTDecoderONNXFile, BARTEncoderTRTEngine, BARTDecoderTRTEngine\n",
-    "from BART.trt import BARTTRTEncoder, BARTTRTDecoder\n",
-    "\n",
-    "# NNDF\n",
-    "from NNDF.networks import NetworkMetadata, Precision\n",
-    "from NNDF.networks import TimingProfile\n",
-    "from NNDF.general_utils import measure_python_inference_code\n",
-    "from NNDF.torch_utils import expand_inputs_for_beam_search"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "af4254e2-11fd-4bc7-ac0b-60b1a9e07c4e",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "<a id=\"1\"></a>\n",
-    "\n",
-    "## 1. Download HuggingFace BART model\n",
-    "\n",
-    "First, we download the original HuggingFace PyTorch BART model from HuggingFace model hubs, together with its associated tokernizer.\n",
-    "\n",
-    "The BART variants that are suported by TensorRT are: facebook/bart-base (139M), facebook/bart-large (406M), facebook/bart-large-cnn (406M), facebook/mbart-large-50 (680M)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6a14eabc-d863-454d-9078-849acc857bb0",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "### Model and Inference Configuration"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "774c89f3-7dbb-423d-88b2-1de693324389",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# UI\n",
-    "model_widget = widgets.Select(\n",
-    "    options=['facebook/bart-base', 'facebook/bart-large', 'facebook/bart-large-cnn', 'facebook/mbart-large-50'],\n",
-    "    value='facebook/bart-base',\n",
-    "    description='Model variant:',\n",
-    "    disabled=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "display(model_widget)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ed04130e-7f20-4a3e-bf76-52aa335f402d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "BART_VARIANT = model_widget.value\n",
-    "\n",
-    "disable_preview_dynamic_feature_widget = widgets.Checkbox(\n",
-    "    value=False,\n",
-    "    description='Disable 8.6 EA faster dynamic shapes feature',\n",
-    "    disabled=False,\n",
-    "    indent=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "FP16_widget = widgets.Checkbox(\n",
-    "    value=False,\n",
-    "    description='FP16',\n",
-    "    disabled=False,\n",
-    "    indent=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "HF_KV_widget = widgets.Checkbox(\n",
-    "    value=True,\n",
-    "    description='HuggingFace KV cache',\n",
-    "    disabled=False,\n",
-    "    indent=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "TRT_KV_widget = widgets.Checkbox(\n",
-    "    value=False,\n",
-    "    description='TensorRT KV cache (disabled due to performance improvements in progress, not beating non-KV version yet)', #  \n",
-    "    disabled=True,\n",
-    "    indent=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "KV_widgets = widgets.HBox([HF_KV_widget,TRT_KV_widget])\n",
-    "\n",
-    "batch_size_widget = widgets.BoundedIntText(\n",
-    "    value=1,\n",
-    "    min=1,\n",
-    "    max=100000,\n",
-    "    step=1,\n",
-    "    description='Batch size',\n",
-    "    disabled=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "max_input_len_widget = widgets.BoundedIntText(\n",
-    "    value=BARTModelTRTConfig.MAX_SEQUENCE_LENGTH[BART_VARIANT],\n",
-    "    min=1,\n",
-    "    max=100000,\n",
-    "    step=1,\n",
-    "    description='Max input length',\n",
-    "    disabled=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "min_output_len_widget = widgets.BoundedIntText(\n",
-    "    value=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[BART_VARIANT],\n",
-    "    min=0,\n",
-    "    max=100000,\n",
-    "    step=1,\n",
-    "    description='Min output length',\n",
-    "    disabled=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "max_output_len_widget = widgets.BoundedIntText(\n",
-    "    value=BARTModelTRTConfig.MAX_OUTPUT_LENGTH[BART_VARIANT],\n",
-    "    min=1,\n",
-    "    max=100000,\n",
-    "    step=1,\n",
-    "    description='Max output length',\n",
-    "    disabled=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "encoder_hidden_size_widget = widgets.BoundedIntText(\n",
-    "    value=BARTModelTRTConfig.ENCODER_HIDDEN_SIZE[BART_VARIANT],\n",
-    "    min=1,\n",
-    "    max=100000,\n",
-    "    step=1,\n",
-    "    description='Encoder hidden size',\n",
-    "    disabled=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "num_beam_widget = widgets.BoundedIntText(\n",
-    "    value=1,\n",
-    "    min=1,\n",
-    "    max=100000,\n",
-    "    step=1,\n",
-    "    description='Number of beams',\n",
-    "    disabled=False,\n",
-    "    style=widget_style,\n",
-    "    layout=widget_layout\n",
-    ")\n",
-    "\n",
-    "widgets_all = widgets.VBox([\n",
-    "    FP16_widget, \n",
-    "    disable_preview_dynamic_feature_widget,\n",
-    "    KV_widgets,\n",
-    "    batch_size_widget, \n",
-    "    max_input_len_widget,\n",
-    "    min_output_len_widget,\n",
-    "    max_output_len_widget, \n",
-    "    encoder_hidden_size_widget,\n",
-    "    num_beam_widget\n",
-    "])\n",
-    "\n",
-    "display(widgets_all)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "077dd494-e8d8-42f9-bdbd-0362f1213118",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Inference config\n",
-    "FP16 = FP16_widget.value # flag to use FP16 precision in PyTorch & TRT\n",
-    "disable_preview_dynamic_shapes = disable_preview_dynamic_feature_widget.value # flag to disable 8.5 EA feature\n",
-    "HF_KV = HF_KV_widget.value # flag to use KV cache in HF\n",
-    "TRT_KV = TRT_KV_widget.value # flag to use KV cache in TRT\n",
-    "\n",
-    "# Model config\n",
-    "batch_size = batch_size_widget.value\n",
-    "max_input_len = max_input_len_widget.value\n",
-    "min_output_len = min_output_len_widget.value\n",
-    "max_output_len = max_output_len_widget.value\n",
-    "encoder_hidden_size = encoder_hidden_size_widget.value\n",
-    "num_beams = num_beam_widget.value\n",
-    "\n",
-    "# Benchmark config\n",
-    "# `TimingProfile` is a named tuple that specifies the number of experiments and number of times to call the function per iteration and number of warm-up calls, oercentiles, etc.\n",
-    "timing_profile = TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=[50,99])\n",
-    "\n",
-    "def percentile_print(timing):\n",
-    "    return ', '.join(['p{} {:.2f}ms'.format(timing_profile.percentile[i], p*1000) for i,p in enumerate(timing)])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fae66d58-f994-4987-8f1d-1fa8ac2ec8b4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# mbart variant can't be recognized by HF AutoClass yet\n",
-    "if \"mbart\" not in BART_VARIANT:    \n",
-    "    bart_model = AutoModelForPreTraining.from_pretrained(BART_VARIANT) # BartForConditionalGeneration\n",
-    "    tokenizer = AutoTokenizer.from_pretrained(BART_VARIANT) # BartTokenizer\n",
-    "else:\n",
-    "    from transformers import MBartForConditionalGeneration, MBart50Tokenizer\n",
-    "    bart_model = MBartForConditionalGeneration.from_pretrained(BART_VARIANT)\n",
-    "    tokenizer = MBart50Tokenizer.from_pretrained(BART_VARIANT, src_lang=\"en_XX\")\n",
-    "\n",
-    "config = AutoConfig.from_pretrained(BART_VARIANT)\n",
-    "\n",
-    "bart_model = bart_model.to('cuda').eval()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7252ca90-1104-40dc-8e72-f51c07a4cd11",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# save model locally\n",
-    "pytorch_model_dir = './models/{}/pytorch'.format(BART_VARIANT)\n",
-    "!mkdir -p $pytorch_model_dir\n",
-    "\n",
-    "if os.path.exists(pytorch_model_dir) and len(os.listdir(pytorch_model_dir)) != 0:\n",
-    "    print('PyTorch model already exists. Skipping...')\n",
-    "else:\n",
-    "    bart_model.save_pretrained(pytorch_model_dir)\n",
-    "    print(\"PyTorch model saved to {}\".format(pytorch_model_dir))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8e4d1d6e-1cad-43a2-a8c3-4bc221070dc2",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "### Test Input Data"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fd1d0d09-be28-42a3-9135-46b796e5be79",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# input sequence\n",
-    "inputs = \"NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms. TensorRT, built on the NVIDIA CUDA parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost.\"\n",
-    "\n",
-    "input_ids = tokenizer(inputs, padding=True, return_tensors=\"pt\").input_ids.to('cuda')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "11ea023d-c4d4-43bb-9d77-c76684e0b06f",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "<a id=\"2\"></a>\n",
-    "\n",
-    "## 2. PyTorch HuggingFace Inference\n",
-    "\n",
-    "Next, we will carry out inference with the HuggingFace PyTorch model as a baseline."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fdb1d921-db47-4c45-bdcc-08ccc500ad99",
-   "metadata": {},
-   "source": [
-    "### End-to-End HuggingFace Inference"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "10168132",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# WAR: Using an ugly representation because cuda 11.4 does not support GPU models due to cublas errors\n",
-    "cuda_114_mode = \"cuda-11.4\" in os.environ[\"LD_LIBRARY_PATH\"]\n",
-    "if cuda_114_mode:\n",
-    "    bart_model = bart_model.cpu()\n",
-    "    input_ids = input_ids.cpu()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d886e29a-1d1d-49e0-a351-3e4418f4bf28",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# encoder-decoder inference \n",
-    "with torch.no_grad():\n",
-    "    output_ids = bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=False)    \n",
-    "    outputs = tokenizer.decode(output_ids[-1,:], skip_special_tokens=True)    \n",
-    "outputs_hf = outputs\n",
-    "\n",
-    "# timing\n",
-    "# FP32\n",
-    "bart_model.float()\n",
-    "hf_nonkv_time = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=False), timing_profile)\n",
-    "hf_kv_time = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
-    "\n",
-    "# FP16, cuda 11.4 has cublas error that will fail in both cpu or cpu model for BART\n",
-    "if not cuda_114_mode:\n",
-    "    bart_model.half()\n",
-    "hf_nonkv_time_fp16 = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=False), timing_profile)\n",
-    "hf_kv_time_fp16 = measure_python_inference_code(lambda: bart_model.generate(input_ids, max_length=max_output_len, min_length=min_output_len, num_beams=num_beams, use_cache=True), timing_profile)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dab5c682-049a-48b3-830c-e1eecccbd553",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# print results and timing statistics\n",
-    "print(f'Input length: {input_ids.size(1)}')\n",
-    "print(inputs)\n",
-    "print('\\n')      \n",
-    "print(f'Output length: {output_ids[-1,:].size(0)}')\n",
-    "print(outputs_hf)\n",
-    "print('\\n')      \n",
-    "print(f'Device: {torch.cuda.get_device_name()}')\n",
-    "print(f\"Precision: FP32, Number of Beams: {num_beams}\")\n",
-    "print(f\"HF time (no KV cache): {percentile_print(hf_nonkv_time)}\")\n",
-    "print(f\"HF time (w/ KV cache): {percentile_print(hf_kv_time)}\")\n",
-    "print(f\"Precision: FP16, Number of Beams: {num_beams}\")\n",
-    "print(f\"HF time (no KV cache): {percentile_print(hf_nonkv_time_fp16)}\")\n",
-    "print(f\"HF time (w/ KV cache): {percentile_print(hf_kv_time_fp16)}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "667fcacc-02cb-415d-a9ff-2d2ec44ef225",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "### Time Measurement of Encoder, Decoder, and Full E2E\n",
-    "For benchmarking purposes, we will employ helper functions `encoder_inference`, `decoder_inference`, and `full_inference_greedy` which execute the inference repeatedly for the BART encoder and decoder stacks separately as well as end-to-end for the entire output sequence, and measure the execution time. These execution times can be later on compared with TensorRT counterpart to demonstrate the speedup. \n",
-    "\n",
-    "Encoder and decoder of BART are wrapped as standalone PyTorch module for testing."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2c07516f-b02b-4722-b0bd-06b632259702",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# FP32\n",
-    "bart_model.float()\n",
-    "bart_torch_encoder = BARTEncoderTorchFile.TorchModule(bart_model.get_encoder())\n",
-    "bart_torch_decoder = BARTDecoderTorchFile.TorchModule(bart_model.get_decoder(), bart_model.lm_head, bart_model.final_logits_bias, bart_model.config)\n",
-    "\n",
-    "with torch.no_grad():\n",
-    "\n",
-    "    encoder_last_hidden_state, encoder_pytorch_time = encoder_inference(bart_torch_encoder, input_ids, timing_profile)\n",
-    "    _, decoder_pytorch_time = decoder_inference(bart_torch_decoder, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state, timing_profile, use_cache=HF_KV)\n",
-    "    if num_beams == 1:\n",
-    "        output_ids, full_pytorch_time = full_inference_greedy(bart_torch_encoder,bart_torch_decoder,input_ids,tokenizer,timing_profile,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
-    "    else:\n",
-    "        output_ids, full_pytorch_time = full_inference_beam(bart_torch_encoder,bart_torch_decoder,input_ids,tokenizer,timing_profile,num_beams=num_beams,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
-    "    outputs = tokenizer.decode(output_ids[0], skip_special_tokens=True)    \n",
-    "\n",
-    "outputs_pytorch = outputs\n",
-    "\n",
-    "# FP16\n",
-    "if not cuda_114_mode:\n",
-    "    bart_model.half()\n",
-    "else:\n",
-    "    print(\"CUDA 11.4 is incompatible with current PyTorch version, using fp32 instead of fp16\")\n",
-    "bart_torch_encoder_fp16 = BARTEncoderTorchFile.TorchModule(bart_model.get_encoder())\n",
-    "bart_torch_decoder_fp16 = BARTDecoderTorchFile.TorchModule(bart_model.get_decoder(), bart_model.lm_head, bart_model.final_logits_bias, bart_model.config)\n",
-    "\n",
-    "with torch.no_grad():\n",
-    "\n",
-    "    encoder_last_hidden_state, encoder_pytorch_time_fp16 = encoder_inference(bart_torch_encoder_fp16, input_ids, timing_profile)\n",
-    "    _, decoder_pytorch_time_fp16 = decoder_inference(bart_torch_decoder_fp16, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state, timing_profile, use_cache=HF_KV)\n",
-    "    if num_beams == 1:\n",
-    "        output_ids_fp16, full_pytorch_time_fp16 = full_inference_greedy(bart_torch_encoder_fp16,bart_torch_decoder_fp16,input_ids,tokenizer,timing_profile,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
-    "    else:\n",
-    "        output_ids_fp16, full_pytorch_time_fp16 = full_inference_beam(bart_torch_encoder_fp16,bart_torch_decoder_fp16,input_ids,tokenizer,timing_profile,num_beams=num_beams,max_length=max_output_len, min_length=min_output_len, use_cache=HF_KV)\n",
-    "    outputs_fp16 = tokenizer.decode(output_ids_fp16[0], skip_special_tokens=True)    \n",
-    "\n",
-    "outputs_pytorch_fp16 = outputs_fp16"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a103e3a6-920b-4c97-818e-6140654abc5e",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# print\n",
-    "print(f'PyTorch FP32 Output identical to HF results? {outputs_pytorch == outputs_hf}')\n",
-    "print(f'PyTorch FP16 Output identical to HF results? {outputs_pytorch_fp16 == outputs_hf}')\n",
-    "print('\\n')      \n",
-    "print(f'Device: {torch.cuda.get_device_name()}')\n",
-    "print(f\"Precision: FP32, Number of Beams: {num_beams}\")\n",
-    "print(f\"Encoder time: {percentile_print(encoder_pytorch_time)}\")\n",
-    "print(f\"Decoder time: {percentile_print(decoder_pytorch_time)}\")\n",
-    "print(f\"Full E2E time: {percentile_print(full_pytorch_time)}\")\n",
-    "print(f\"Precision: FP16, Number of Beams: {num_beams}\")\n",
-    "print(f\"Encoder time: {percentile_print(encoder_pytorch_time_fp16)}\")\n",
-    "print(f\"Decoder time: {percentile_print(decoder_pytorch_time_fp16)}\")\n",
-    "print(f\"Full E2E time: {percentile_print(full_pytorch_time_fp16)}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0d662701-e430-4fdc-ad46-1f296defcf8f",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "<a id=\"3\"></a>\n",
-    "\n",
-    "## 3. TensorRT Engine Building\n",
-    "\n",
-    "### Convert PyTorch to ONNX\n",
-    "\n",
-    "Prior to converting the model to a TensorRT engine, we will first convert the PyTorch model to an intermediate universal format.\n",
-    "\n",
-    "ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.\n",
-    "\n",
-    "The steps to convert a PyTorch model to TensorRT are as follows:\n",
-    "- Convert the pretrained PyTorch model into ONNX.\n",
-    "- Import the ONNX model into TensorRT, apply optimizations and generate a TensorRT engine.\n",
-    "- Perform inference on the GPU using the engine. \n",
-    "\n",
-    "For the BART model, we will convert the encoder and decoder to ONNX and build each engine seperately. The logistics of this separate building approach come from the nature of sequence-to-sequence models. BART and T5 are good examples of sequence-to-sequence models which use encoder-decoder architecture. The encoder is only executed once on the input and generates hidden states. Next, the decoder is executed repeatedly in an auto-regressive manner until the entire output finishes generating, i.e. the output sequence length is the number of times the decoder runs. The most efficient way to run encoder-decoder models with TensorRT is to have two separate engines."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4ea48be5-1dae-4e93-92a4-840d7017ad9b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "onnx_model_path = './models/{}/onnx'.format(BART_VARIANT)\n",
-    "!mkdir -p $onnx_model_path\n",
-    "\n",
-    "# FP32\n",
-    "bart_model.float()\n",
-    "metadata = NetworkMetadata(variant=BART_VARIANT, precision=Precision(fp16=False), other=BARTMetadata(kv_cache=TRT_KV))\n",
-    "trt_config = BARTModelTRTConfig()\n",
-    "metadata_string = trt_config.get_metadata_string(metadata)\n",
-    "\n",
-    "encoder_onnx_model_fpath = metadata_string + \"-encoder.onnx\"\n",
-    "decoder_onnx_model_fpath = metadata_string + \"-decoder-with-lm-head.onnx\"\n",
-    "\n",
-    "# for onnx conversion, ensure model is on CPU and FP32 precision in this step\n",
-    "bart_torchfile_encoder = BARTEncoderTorchFile(bart_model.to('cpu'), metadata)\n",
-    "bart_torchfile_decoder = BARTDecoderTorchFile(bart_model.to('cpu'), metadata)\n",
-    "\n",
-    "onnx_bart_encoder = bart_torchfile_encoder.as_onnx_model(os.path.join(onnx_model_path, encoder_onnx_model_fpath), force_overwrite=False)\n",
-    "onnx_bart_decoder = bart_torchfile_decoder.as_onnx_model(os.path.join(onnx_model_path, decoder_onnx_model_fpath), force_overwrite=False)\n",
-    "\n",
-    "# FP16\n",
-    "metadata_fp16 = NetworkMetadata(variant=BART_VARIANT, precision=Precision(fp16=True), other=BARTMetadata(kv_cache=TRT_KV))\n",
-    "trt_config_fp16 = BARTModelTRTConfig()\n",
-    "metadata_string_fp16 = trt_config.get_metadata_string(metadata_fp16)\n",
-    "\n",
-    "encoder_onnx_model_fpath_fp16 = metadata_string_fp16 + \"-encoder.onnx\"\n",
-    "decoder_onnx_model_fpath_fp16 = metadata_string_fp16 + \"-decoder-with-lm-head.onnx\"\n",
-    "\n",
-    "# for onnx conversion, ensure model is on CPU and FP32 precision in this step\n",
-    "bart_torchfile_encoder = BARTEncoderTorchFile(bart_model.to('cpu'), metadata)\n",
-    "bart_torchfile_decoder = BARTDecoderTorchFile(bart_model.to('cpu'), metadata)\n",
-    "\n",
-    "onnx_bart_encoder_fp16 = bart_torchfile_encoder.as_onnx_model(os.path.join(onnx_model_path, encoder_onnx_model_fpath_fp16), force_overwrite=False)\n",
-    "onnx_bart_decoder_fp16 = bart_torchfile_decoder.as_onnx_model(os.path.join(onnx_model_path, decoder_onnx_model_fpath_fp16), force_overwrite=False)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7baf007e-5508-485c-a87f-9bfe16260452",
-   "metadata": {},
-   "source": [
-    "### Convert ONNX to TensorRT\n",
-    "\n",
-    "Now we are ready to parse the ONNX encoder and decoder models and convert them to optimized TensorRT engines.\n",
-    "\n",
-    "Since the models contains dynamic input shapes, we can specify a valid input range with a TensorRT optimization profile."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6bd6e3fc-6797-46b0-a211-ce42d3769105",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "tensorrt_model_path = './models/{}/tensorrt'.format(BART_VARIANT)\n",
-    "!mkdir -p $tensorrt_model_path\n",
-    "\n",
-    "# Encoder optimization profiles\n",
-    "encoder_profile = Profile()\n",
-    "encoder_profile.add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size, 1),\n",
-    "    opt=(batch_size, max_input_len // 2),\n",
-    "    max=(batch_size, max_input_len),\n",
-    ")\n",
-    "\n",
-    "# Decoder optimization profiles\n",
-    "decoder_profile = Profile()\n",
-    "decoder_profile.add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size * num_beams, 1),\n",
-    "    opt=(batch_size * num_beams, max_output_len // 2),\n",
-    "    max=(batch_size * num_beams, max_output_len),\n",
-    ")\n",
-    "decoder_profile.add(\n",
-    "    \"encoder_hidden_states\",\n",
-    "    min=(batch_size * num_beams, 1, encoder_hidden_size),\n",
-    "    opt=(batch_size * num_beams, max_input_len // 2, encoder_hidden_size),\n",
-    "    max=(batch_size * num_beams, max_input_len, encoder_hidden_size),\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "aa5738ff-790e-47a0-ba03-27af87742646",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "engine_tag = f\"bs{batch_size}\"\n",
-    "\n",
-    "if num_beams > 1:\n",
-    "    engine_tag += \"-beam{}\".format(num_beams)\n",
-    "\n",
-    "preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]\n",
-    "if disable_preview_dynamic_shapes:\n",
-    "    engine_tag += \"-noPreviewFasterDynamicShapes\"\n",
-    "else:\n",
-    "    preview_features.append(PreviewFeature.FASTER_DYNAMIC_SHAPES_0805)\n",
-    "\n",
-    "# FP32\n",
-    "encoder_engine_name = os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + f\"-{engine_tag}.engine\".replace(f\"-beam{num_beams}\", \"\") # encoder engine not affected by beam search\n",
-    "decoder_engine_name = os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + f\"-{engine_tag}.engine\"\n",
-    "\n",
-    "if not os.path.exists(encoder_engine_name):\n",
-    "    bart_trt_encoder_engine = BARTEncoderONNXFile(os.path.join(onnx_model_path, encoder_onnx_model_fpath), metadata).as_trt_engine(\n",
-    "        encoder_engine_name, \n",
-    "        profiles=[encoder_profile], \n",
-    "        preview_features=preview_features\n",
-    "    )\n",
-    "else:\n",
-    "    bart_trt_encoder_engine = BARTEncoderTRTEngine(encoder_engine_name, metadata)\n",
-    "    \n",
-    "if not os.path.exists(decoder_engine_name):\n",
-    "    bart_trt_decoder_engine = BARTDecoderONNXFile(os.path.join(onnx_model_path, decoder_onnx_model_fpath), metadata).as_trt_engine(\n",
-    "        decoder_engine_name, \n",
-    "        profiles=[decoder_profile], \n",
-    "        preview_features=preview_features\n",
-    "    )\n",
-    "else:\n",
-    "    bart_trt_decoder_engine = BARTDecoderTRTEngine(decoder_engine_name, metadata)\n",
-    "\n",
-    "# FP16\n",
-    "encoder_engine_name_fp16 = os.path.join(tensorrt_model_path, encoder_onnx_model_fpath_fp16) + f\"-{engine_tag}.engine\".replace(f\"-beam{num_beams}\", \"\") # encoder engine not affected by beam search\n",
-    "decoder_engine_name_fp16 = os.path.join(tensorrt_model_path, decoder_onnx_model_fpath_fp16) + f\"-{engine_tag}.engine\"\n",
-    "\n",
-    "if not os.path.exists(encoder_engine_name_fp16):\n",
-    "    bart_trt_encoder_engine_fp16 = BARTEncoderONNXFile(os.path.join(onnx_model_path, encoder_onnx_model_fpath_fp16), metadata_fp16).as_trt_engine(\n",
-    "        encoder_engine_name_fp16, \n",
-    "        profiles=[encoder_profile], \n",
-    "        preview_features=preview_features\n",
-    "    )\n",
-    "else:\n",
-    "    bart_trt_encoder_engine_fp16 = BARTEncoderTRTEngine(encoder_engine_name_fp16, metadata_fp16)\n",
-    "    \n",
-    "if not os.path.exists(decoder_engine_name_fp16):\n",
-    "    bart_trt_decoder_engine_fp16 = BARTDecoderONNXFile(os.path.join(onnx_model_path, decoder_onnx_model_fpath_fp16), metadata_fp16).as_trt_engine(\n",
-    "        decoder_engine_name_fp16, \n",
-    "        profiles=[decoder_profile], \n",
-    "        preview_features=preview_features\n",
-    "    )\n",
-    "else:\n",
-    "    bart_trt_decoder_engine_fp16 = BARTDecoderTRTEngine(decoder_engine_name_fp16, metadata_fp16)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "74f7f6fc-1e6a-4ddc-8e9b-543d9e8dab4d",
-   "metadata": {},
-   "source": [
-    "<a id=\"4\"></a>\n",
-    "\n",
-    "## 4. TensorRT Inference\n",
-    "\n",
-    "Great, if you have reached this stage, it means we now have successfully built optimized TensorRT engines for the BART model, ready for us to carry out inference. The BART model with TensorRT backend can now be employed in place of the original HuggingFace BART model."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3954f2f4-c393-463b-a44b-3e5335032b57",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Initialize TensorRT engines\n",
-    "trt_config = AutoConfig.from_pretrained(BART_VARIANT, use_cache = metadata.other.kv_cache)\n",
-    "\n",
-    "# FP32\n",
-    "bart_trt_encoder = BARTTRTEncoder(bart_trt_encoder_engine, metadata, trt_config, batch_size=batch_size)\n",
-    "bart_trt_decoder = BARTTRTDecoder(bart_trt_decoder_engine, metadata, trt_config, batch_size=batch_size, num_beams=num_beams)\n",
-    "\n",
-    "# FP16\n",
-    "bart_trt_encoder_fp16 = BARTTRTEncoder(bart_trt_encoder_engine_fp16, metadata_fp16, trt_config, batch_size=batch_size)\n",
-    "bart_trt_decoder_fp16 = BARTTRTDecoder(bart_trt_decoder_engine_fp16, metadata_fp16, trt_config, batch_size=batch_size, num_beams=num_beams)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f7025246-4f14-4449-bb93-6c1566f48773",
-   "metadata": {},
-   "source": [
-    "### End-to-End TensorRT Inference"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "92a5bbfe-a576-4a94-99d1-f0862b31fdb4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from transformers.generation_logits_process import (\n",
-    "    NoRepeatNGramLogitsProcessor,\n",
-    "    MinLengthLogitsProcessor,\n",
-    "    ForcedBOSTokenLogitsProcessor,\n",
-    "    ForcedEOSTokenLogitsProcessor,\n",
-    "    LogitsProcessorList,\n",
-    ")\n",
-    "from transformers.generation_stopping_criteria import (\n",
-    "    MaxLengthCriteria,\n",
-    "    StoppingCriteriaList,\n",
-    ")\n",
-    "from transformers.generation_beam_search import (\n",
-    "    BeamSearchScorer,\n",
-    ")\n",
-    "\n",
-    "stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_output_len)])\n",
-    "no_repeat_ngram_size = BARTModelTRTConfig.NO_REPEAT_NGRAM_SIZE\n",
-    "min_length = BARTModelTRTConfig.MIN_OUTPUT_LENGTH[BART_VARIANT]\n",
-    "logits_processor = LogitsProcessorList([\n",
-    "    NoRepeatNGramLogitsProcessor(no_repeat_ngram_size), \n",
-    "    MinLengthLogitsProcessor(min_length, tokenizer.convert_tokens_to_ids(tokenizer.eos_token)),\n",
-    "    ForcedBOSTokenLogitsProcessor(tokenizer.convert_tokens_to_ids(tokenizer.bos_token)),\n",
-    "    ForcedEOSTokenLogitsProcessor(max_output_len, tokenizer.convert_tokens_to_ids(tokenizer.eos_token))\n",
-    "]) # by checking HuggingFace's generate() implementation carefully, the default logits processor for BART has no_repeat_ngram_size = 3 and forced_eos_token_id = 2. In this way we can ensure identical results with raw HuggingFace\n",
-    "\n",
-    "decoder_initial_input = torch.full(\n",
-    "    (batch_size, 1), tokenizer.convert_tokens_to_ids(tokenizer.eos_token), dtype=torch.int32\n",
-    ").to('cuda')\n",
-    "\n",
-    "if num_beams > 1:\n",
-    "    decoder_initial_input = expand_inputs_for_beam_search(decoder_initial_input, expand_size=num_beams)\n",
-    "    \n",
-    "# FP32\n",
-    "def e2e_trt():\n",
-    "    with torch.no_grad():\n",
-    "        encoder_last_hidden_states = bart_trt_encoder(input_ids=input_ids)\n",
-    "        \n",
-    "        if num_beams > 1:\n",
-    "            # prepare input for beam search\n",
-    "            encoder_last_hidden_states = expand_inputs_for_beam_search(encoder_last_hidden_states, expand_size=num_beams)\n",
-    "\n",
-    "            # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache\n",
-    "            beam_scorer = BeamSearchScorer(\n",
-    "                batch_size=batch_size,\n",
-    "                num_beams=num_beams,\n",
-    "                device=\"cuda\",\n",
-    "                do_early_stopping=True,\n",
-    "            )\n",
-    "        \n",
-    "        bart_trt_decoder.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_states)\n",
-    "        \n",
-    "        if num_beams == 1:\n",
-    "            decoder_output = bart_trt_decoder.greedy_search(\n",
-    "                input_ids=decoder_initial_input,\n",
-    "                encoder_hidden_states=encoder_last_hidden_states,\n",
-    "                stopping_criteria=stopping_criteria,\n",
-    "                logits_processor=logits_processor,\n",
-    "                use_cache=metadata.other.kv_cache,\n",
-    "                use_cuda=True\n",
-    "            )\n",
-    "        else:\n",
-    "            decoder_output = bart_trt_decoder.beam_search(\n",
-    "                input_ids=decoder_initial_input,\n",
-    "                beam_scorer=beam_scorer,\n",
-    "                encoder_hidden_states=encoder_last_hidden_states,\n",
-    "                stopping_criteria=stopping_criteria,\n",
-    "                logits_processor=logits_processor,\n",
-    "                use_cache=metadata.other.kv_cache,\n",
-    "                use_cuda=True\n",
-    "            )\n",
-    "    return decoder_output\n",
-    "\n",
-    "output_ids = e2e_trt()\n",
-    "outputs_trt = tokenizer.decode(output_ids[0], skip_special_tokens=True)\n",
-    "trt_time = measure_python_inference_code(e2e_trt, timing_profile)\n",
-    "\n",
-    "# FP16\n",
-    "def e2e_trt_fp16():\n",
-    "    with torch.no_grad():\n",
-    "        encoder_last_hidden_states = bart_trt_encoder_fp16(input_ids=input_ids)\n",
-    "        \n",
-    "        if num_beams > 1:\n",
-    "            # prepare input for beam search\n",
-    "            encoder_last_hidden_states = expand_inputs_for_beam_search(encoder_last_hidden_states, expand_size=num_beams)\n",
-    "            \n",
-    "            # beam scorer must be reset before each beam search run, otherwise beam search will be skipped due to scorer cache\n",
-    "            beam_scorer = BeamSearchScorer(\n",
-    "                batch_size=batch_size,\n",
-    "                num_beams=num_beams,\n",
-    "                device=\"cuda\",\n",
-    "                do_early_stopping=True,\n",
-    "            )\n",
-    "        \n",
-    "        bart_trt_decoder_fp16.set_encoder_hidden_states_for_inference_cycle(encoder_last_hidden_states)\n",
-    "        \n",
-    "        if num_beams == 1:\n",
-    "            decoder_output = bart_trt_decoder_fp16.greedy_search(\n",
-    "                input_ids=decoder_initial_input,\n",
-    "                encoder_hidden_states=encoder_last_hidden_states,\n",
-    "                stopping_criteria=stopping_criteria,\n",
-    "                logits_processor=logits_processor,\n",
-    "                use_cache=metadata.other.kv_cache,\n",
-    "                use_cuda=True\n",
-    "            )\n",
-    "        else:\n",
-    "            decoder_output = bart_trt_decoder_fp16.beam_search(\n",
-    "                input_ids=decoder_initial_input,\n",
-    "                beam_scorer=beam_scorer,\n",
-    "                encoder_hidden_states=encoder_last_hidden_states,\n",
-    "                stopping_criteria=stopping_criteria,\n",
-    "                logits_processor=logits_processor,\n",
-    "                use_cache=metadata.other.kv_cache,\n",
-    "                use_cuda=True\n",
-    "            )\n",
-    "    return decoder_output\n",
-    "\n",
-    "output_ids_fp16 = e2e_trt_fp16()\n",
-    "outputs_trt_fp16 = tokenizer.decode(output_ids_fp16[0], skip_special_tokens=True)\n",
-    "trt_time_fp16 = measure_python_inference_code(e2e_trt_fp16, timing_profile)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6198afcf-70d1-46ef-a515-dcf5ea4c17b6",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# print results and timing statistics\n",
-    "print(f'Device: {torch.cuda.get_device_name()}')\n",
-    "print(f\"Using engine: {metadata_string + '-' + engine_tag}\")   \n",
-    "print(f'Output identical to HF results? {outputs_trt == outputs_hf}')\n",
-    "print(f\"Precision: FP32\")\n",
-    "print(f'TRT time: {percentile_print(trt_time)}')\n",
-    "print()\n",
-    "print(f\"Using engine: {metadata_string_fp16 + '-' + engine_tag}\")   \n",
-    "print(f'Output identical to HF results? {outputs_trt_fp16 == outputs_hf}')\n",
-    "print(f\"Precision: FP16\")\n",
-    "print(f'TRT time: {percentile_print(trt_time_fp16)}')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ed9d4a98-b034-470e-a9f8-096d4100b8d4",
-   "metadata": {},
-   "source": [
-    "### Time Measurement of Encoder, Decoder, and Full E2E\n",
-    "We will benchmark the encoder, decoder, and full end-to-end as we did for HuggingFace before."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2320e4bf-94f2-40d8-9a86-3a1ea352fca2",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# FP32\n",
-    "encoder_last_hidden_states, encoder_trt_time = encoder_inference(bart_trt_encoder, input_ids, timing_profile)\n",
-    "_, decoder_trt_time = decoder_inference(bart_trt_decoder, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_states, num_beams) if num_beams > 1 else encoder_last_hidden_states, timing_profile)\n",
-    "\n",
-    "if num_beams == 1:\n",
-    "    _, full_trt_time = full_inference_greedy(\n",
-    "        bart_trt_encoder,\n",
-    "        bart_trt_decoder,\n",
-    "        input_ids,\n",
-    "        tokenizer,\n",
-    "        timing_profile,\n",
-    "        max_length=max_output_len,\n",
-    "        min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
-    "        batch_size=batch_size,\n",
-    "        use_cache=metadata.other.kv_cache,\n",
-    "    )\n",
-    "else:\n",
-    "    _, full_trt_time = full_inference_beam(\n",
-    "        bart_trt_encoder,\n",
-    "        bart_trt_decoder,\n",
-    "        input_ids,\n",
-    "        tokenizer,\n",
-    "        timing_profile,\n",
-    "        num_beams=num_beams,\n",
-    "        max_length=max_output_len,\n",
-    "        min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
-    "        batch_size=batch_size,\n",
-    "        use_cache=metadata.other.kv_cache,\n",
-    "        early_stopping=True,\n",
-    "    )\n",
-    "    \n",
-    "print(f'Encoder time: {percentile_print(encoder_trt_time)}')\n",
-    "print(f'Decoder time: {percentile_print(decoder_trt_time)}')\n",
-    "print(f'Full E2E time: {percentile_print(full_trt_time)}')\n",
-    "\n",
-    "# FP16\n",
-    "encoder_last_hidden_states, encoder_trt_time_fp16 = encoder_inference(bart_trt_encoder_fp16, input_ids, timing_profile)\n",
-    "_, decoder_trt_time_fp16 = decoder_inference(bart_trt_decoder_fp16, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, expand_inputs_for_beam_search(encoder_last_hidden_states, num_beams) if num_beams > 1 else encoder_last_hidden_states, timing_profile)\n",
-    "\n",
-    "if num_beams == 1:\n",
-    "    _, full_trt_time_fp16 = full_inference_greedy(\n",
-    "        bart_trt_encoder_fp16,\n",
-    "        bart_trt_decoder_fp16,\n",
-    "        input_ids,\n",
-    "        tokenizer,\n",
-    "        timing_profile,\n",
-    "        max_length=max_output_len,\n",
-    "        min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
-    "        batch_size=batch_size,\n",
-    "        use_cache=metadata.other.kv_cache,\n",
-    "    )\n",
-    "else:\n",
-    "    _, full_trt_time_fp16 = full_inference_beam(\n",
-    "        bart_trt_encoder_fp16,\n",
-    "        bart_trt_decoder_fp16,\n",
-    "        input_ids,\n",
-    "        tokenizer,\n",
-    "        timing_profile,\n",
-    "        num_beams=num_beams,\n",
-    "        max_length=max_output_len,\n",
-    "        min_length=BARTModelTRTConfig.MIN_OUTPUT_LENGTH[metadata.variant],\n",
-    "        batch_size=batch_size,\n",
-    "        use_cache=metadata.other.kv_cache,\n",
-    "        early_stopping=True,\n",
-    "    )\n",
-    "print(f'Encoder FP16 time: {percentile_print(encoder_trt_time_fp16)}')\n",
-    "print(f'Decoder FP16 time: {percentile_print(decoder_trt_time_fp16)}')\n",
-    "print(f'Full E2E FP16 time: {percentile_print(full_trt_time_fp16)}')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e27cda12-7e56-4a87-935d-ce598557cf26",
-   "metadata": {},
-   "source": [
-    "## Comparison"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1090b46c-adec-4684-8c53-a54a196dedb1",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from tabulate import tabulate\n",
-    "\n",
-    "data = [\n",
-    "    ['Framework', 'Precision', 'Encoder p50 (ms)', 'Decoder p50 (ms)', 'Full E2E p50 (ms)', 'Accuracy'],\n",
-    "    ['HuggingFace (w/o cache)', 'FP32', '-', '-', f'{hf_nonkv_time[0]*1000:.2f}', '-'],\n",
-    "    ['HuggingFace (w/ cache)', 'FP32', '-', '-', f'{hf_kv_time[0]*1000:.2f}', '-'],\n",
-    "    ['HuggingFace (w/o cache)', 'FP16', '-', '-', f'{hf_nonkv_time_fp16[0]*1000:.2f}', '-'],\n",
-    "    ['HuggingFace (w/ cache)', 'FP16', '-', '-', f'{hf_kv_time_fp16[0]*1000:.2f}', '-'],\n",
-    "    ['PyTorch', 'FP32', f'{encoder_pytorch_time[0]*1000:.2f}', f'{decoder_pytorch_time[0]*1000:.2f}', f'{full_pytorch_time[0]*1000:.2f}', outputs_pytorch == outputs_hf],\n",
-    "    ['PyTorch', 'FP16', f'{encoder_pytorch_time_fp16[0]*1000:.2f}', f'{decoder_pytorch_time_fp16[0]*1000:.2f}', f'{full_pytorch_time_fp16[0]*1000:.2f}', outputs_pytorch_fp16 == outputs_hf],\n",
-    "    ['TensorRT', 'FP32', f'{encoder_trt_time[0]*1000:.2f}', f'{decoder_trt_time[0]*1000:.2f}', f'{full_trt_time[0]*1000:.2f}', outputs_trt == outputs_hf],\n",
-    "    ['TensorRT', 'FP16', f'{encoder_trt_time_fp16[0]*1000:.2f}', f'{decoder_trt_time_fp16[0]*1000:.2f}', f'{full_trt_time_fp16[0]*1000:.2f}', outputs_trt_fp16 == outputs_hf],\n",
-    "]\n",
-    "\n",
-    "print(tabulate(data, headers='firstrow', tablefmt='github'))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "92031643-8ee8-4d50-864b-a08e4d551dc6",
-   "metadata": {},
-   "source": [
-    "We can now compare the original HuggingFace model and the TensorRT engine, from both separate encoder/decoder and end-to-end speed difference. For bart-base variant on an NVIDIA Titan V GPU and input/output sequence length around 130, this results in about 2x performance improvement with FP16 inference."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9a498672-ba25-42b0-b89e-79e0b869943a",
-   "metadata": {},
-   "source": [
-    "## Variable Input/Output Length\n",
-    "\n",
-    "We can run more tests by varying input/output length, while using the same engines.\n",
-    "\n",
-    "Note that TensorRT performance depends on optimal selection of the kernels in the engine. The variable length test here uses the same engine built with max input/output length profile, therefore may not represent the best perf. If the use case has known input/output length ranges, it is highly recommended to specify in the TensorRT engine profiles to ensure optimized kernel selection."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "36f25217-be31-45bf-8652-0e18162fa360",
-   "metadata": {},
-   "source": [
-    "### Single example"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "985d8a01-e5b7-449e-9e43-7c8315a2578d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# ensure HF model are on GPU for testing (cells above moved it CPU). For cuda 11.4, disable this block\n",
-    "if not cuda_114_mode:\n",
-    "    bart_model = bart_model.to('cuda').eval()\n",
-    "\n",
-    "    in_len, out_len = 24, 24\n",
-    "\n",
-    "    data = [\n",
-    "        ['(input_len, output_len)', 'HF FP32 p50 (s)', 'HF FP16 p50 (s)', 'TRT FP32 p50 (s)', 'TRT FP16 p50 (s)'],\n",
-    "    ]\n",
-    "\n",
-    "    assert in_len <= max_input_len and out_len <= max_output_len\n",
-    "\n",
-    "    in_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[BART_VARIANT], (batch_size, in_len)).to('cuda')\n",
-    "\n",
-    "    # HF\n",
-    "    bart_model.float()\n",
-    "    hf_32 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
-    "    bart_model.half()\n",
-    "    hf_16 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
-    "\n",
-    "    # TRT\n",
-    "    if num_beams == 1:\n",
-    "        _, trt_32 = full_inference_greedy(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
-    "        _, trt_16 = full_inference_greedy(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
-    "    else:\n",
-    "        _, trt_32 = full_inference_beam(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
-    "        _, trt_16 = full_inference_beam(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
-    "\n",
-    "    data.append([(in_len, out_len), hf_32[0], hf_16[0], trt_32[0], trt_16[0]])\n",
-    "\n",
-    "    print(tabulate(data, headers='firstrow', tablefmt='github'))\n",
-    "else:\n",
-    "    print(\"CUDA 11.4 is currently incompatible with GPU models, skipping\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8edf4f5c-49a0-4509-a4d7-8b561dba3f88",
-   "metadata": {},
-   "source": [
-    "### Several representative examples"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9e335010-ff7f-4822-85ae-bca8d235de1b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# ensure HF model are on GPU for testing (cells above moved it CPU). For cuda 11.4, disable this block\n",
-    "if not cuda_114_mode:\n",
-    "    bart_model = bart_model.to('cuda').eval()\n",
-    "\n",
-    "    input_output_len_list = [\n",
-    "        (64, 128), # generation task\n",
-    "        (64, 512),\n",
-    "        (512, 64), # summarization task\n",
-    "        (128, 64),\n",
-    "        (32, 32), # translation task\n",
-    "        (128, 128),\n",
-    "        (512, 512),\n",
-    "    ]\n",
-    "\n",
-    "    data = [\n",
-    "        ['(input_len, output_len)', 'HF FP32 p50 (s)', 'HF FP16 p50 (s)', 'TRT FP32 p50 (s)', 'TRT FP16 p50 (s)'],\n",
-    "    ]\n",
-    "\n",
-    "    for (in_len, out_len) in input_output_len_list:\n",
-    "        assert in_len <= max_input_len and out_len <= max_output_len\n",
-    "\n",
-    "        in_ids = torch.randint(0, BARTModelTRTConfig.VOCAB_SIZE[BART_VARIANT], (batch_size, in_len)).to('cuda')\n",
-    "\n",
-    "        # HF\n",
-    "        bart_model.float()\n",
-    "        hf_32 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
-    "        bart_model.half()\n",
-    "        hf_16 = measure_python_inference_code(lambda: bart_model.generate(in_ids, min_length=out_len, max_length=out_len, num_beams=num_beams, use_cache=True), timing_profile)\n",
-    "\n",
-    "        # TRT\n",
-    "        if num_beams == 1:\n",
-    "            _, trt_32 = full_inference_greedy(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
-    "            _, trt_16 = full_inference_greedy(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, use_cuda=True,)\n",
-    "        else:\n",
-    "            _, trt_32 = full_inference_beam(bart_trt_encoder, bart_trt_decoder, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
-    "            _, trt_16 = full_inference_beam(bart_trt_encoder_fp16, bart_trt_decoder_fp16, in_ids, tokenizer, timing_profile, num_beams=num_beams, max_length=out_len, min_length=out_len, batch_size=batch_size, use_cache=metadata.other.kv_cache, early_stopping=True,)\n",
-    "\n",
-    "        data.append([(in_len, out_len), hf_32[0], hf_16[0], trt_32[0], trt_16[0]])\n",
-    "\n",
-    "    print(tabulate(data, headers='firstrow', tablefmt='github'))\n",
-    "else:\n",
-    "    print(\"CUDA 11.4 is currently incompatible with GPU models, skipping\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a598a0ae-2e21-4898-ae56-8429a5d00760",
-   "metadata": {},
-   "source": [
-    "It shows around 2x speedup comparing to HuggingFace's KV-cache optimized timing, for relatively short output sequence length. For long output sequence length, due to memory copies overhead between the decoding steps, TensorRT may not provide significant speedup at the current stage."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2a1f5dca-397c-4c8c-9200-61b30cdba824",
-   "metadata": {},
-   "source": [
-    "## Conclusion\n",
-    "\n",
-    "This notebook has walked you through the process of converting a HuggingFace PyTorch BART model to an optimized TensorRT engine for inference in easy steps. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace BART model while providing speed up. \n",
-    "\n",
-    "If you are interested in further details of the conversion process, check out [BART/trt.py](../BART/trt.py)"
-   ]
-  }
- ],
- "metadata": {
-  "interpreter": {
-   "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
-  },
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/gpt2-playground.ipynb b/demo/HuggingFace/notebooks/gpt2-playground.ipynb
deleted file mode 100644
index 76c92d19..00000000
--- a/demo/HuggingFace/notebooks/gpt2-playground.ipynb
+++ /dev/null
@@ -1,243 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "64974d33-d028-440c-86fa-1a0633b3d31d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
-    "#\n",
-    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "# you may not use this file except in compliance with the License.\n",
-    "# You may obtain a copy of the License at\n",
-    "#\n",
-    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
-    "#\n",
-    "# Unless required by applicable law or agreed to in writing, software\n",
-    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-    "# See the License for the specific language governing permissions and\n",
-    "# limitations under the License.\n",
-    "# =============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c3f0ff46-9958-4d57-9067-a64be34e75da",
-   "metadata": {},
-   "source": [
-    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
-    "\n",
-    "# GPT-2 Playground\n",
-    "\n",
-    "This notebook demonstrates the GPT-2 model for open-end text generation.\n",
-    "\n",
-    "The TensorRT HuggingFace GPT-2 model is a plug-in replacement for the original PyTorch  HuggingFace GPT-2 model.\n",
-    "\n",
-    "\n",
-    "**Notes**: \n",
-    " - For \"CPU - PyTorch\" and \"GPU - PyTorch\", a GPT-2 small model from HuggingFace model repository is employed. Inference is carried out with PyTorch in FP32 precision. All models run with batch size 1.\n",
-    "Average run time across 5 runs is reported.\n",
-    " - Prior to running this notebook, run [gpt2.ipynb](gpt2.ipynb) to download the GPT-2 model and generate the TensorRT engine."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3530e767-7050-4329-a4bc-e2221b9eb578",
-   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import sys\n",
-    "ROOT_DIR = os.path.abspath(\"../\")\n",
-    "sys.path.append(ROOT_DIR)\n",
-    "\n",
-    "import warnings\n",
-    "warnings.filterwarnings('ignore')\n",
-    "\n",
-    "import torch \n",
-    "\n",
-    "# huggingface\n",
-    "from transformers import (\n",
-    "    GPT2LMHeadModel,\n",
-    "    GPT2Tokenizer,\n",
-    "    GPT2Config,\n",
-    ")\n",
-    "\n",
-    "from GPT2.trt import GPT2TRTDecoder, GPT2TRTEngine\n",
-    "from NNDF.networks import NetworkMetadata, Precision\n",
-    "from collections import namedtuple \n",
-    "from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig, GPT2Metadata\n",
-    "\n",
-    "# download HuggingFace model and tokernizer\n",
-    "GPT2_VARIANT = 'gpt2' # choices: gpt2 | gpt2-large\n",
-    "model = GPT2LMHeadModel.from_pretrained(GPT2_VARIANT)\n",
-    "config = GPT2Config(GPT2_VARIANT)\n",
-    "tokenizer = GPT2Tokenizer.from_pretrained(GPT2_VARIANT)\n",
-    "\n",
-    "# load TensorRT engine\n",
-    "metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=False))\n",
-    "from os.path import exists\n",
-    "if not exists('./models/gpt2/trt-engine/gpt2.onnx.engine'):\n",
-    "    print(\"Error: TensorRT engine not found at ./models/gpt2/trt-engine/gpt2.onnx.engine. Please run gpt2.ipynb to generate the TensorRT engine first!\")\n",
-    "else:\n",
-    "    gpt2_engine = GPT2TRTEngine('./models/gpt2/trt-engine/gpt2.onnx.engine', metadata)\n",
-    "    gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "766b8c94-ba8e-47c8-8624-57da462a0496",
-   "metadata": {
-    "jupyter": {
-     "source_hidden": true
-    },
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import ipywidgets as widgets\n",
-    "import numpy as np\n",
-    "import time\n",
-    "\n",
-    "device = widgets.RadioButtons(\n",
-    "    options=['CPU - PyTorch', \n",
-    "             'GPU - PyTorch', \n",
-    "             'GPU - TensorRT'],\n",
-    "    description='Device:',\n",
-    "    disabled=False\n",
-    ")\n",
-    "\n",
-    "paragraph_text = widgets.Textarea(\n",
-    "    value='TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps '\\\n",
-    "'such as recommenders, speech and image/video on NVIDIA GPUs. ',\n",
-    "    placeholder='Type something',\n",
-    "    description='Context:',\n",
-    "    disabled=False,\n",
-    "    layout=widgets.Layout(width=\"auto\"),\n",
-    "    rows=5,  \n",
-    ")\n",
-    "\n",
-    "generated_text = widgets.Textarea(\n",
-    "    value='...',\n",
-    "    placeholder='GPT-2 generated text',\n",
-    "    description='GPT-2:',\n",
-    "    disabled=False,\n",
-    "    layout=widgets.Layout(width=\"auto\"),\n",
-    "    rows=5,\n",
-    ")\n",
-    "button = widgets.Button(description=\"Generate\")\n",
-    "\n",
-    "display(paragraph_text)\n",
-    "display(generated_text)\n",
-    "display(device)\n",
-    "\n",
-    "from IPython.display import display\n",
-    "box_layout = widgets.Layout(display='flex',\n",
-    "                flex_flow='column',\n",
-    "                align_items='center',\n",
-    "                width='100%')\n",
-    "N_RUN = 6\n",
-    "progress_bar = widgets.IntProgress(\n",
-    "    value=0,\n",
-    "    min=0,\n",
-    "    max=N_RUN,\n",
-    "    description='Progress:',\n",
-    "    bar_style='', # 'success', 'info', 'warning', 'danger' or ''\n",
-    "    style={'bar_color': 'green'},\n",
-    "    orientation='horizontal', \n",
-    "    layout=widgets.Layout(width='100%', height='50px')\n",
-    ")\n",
-    "\n",
-    "box = widgets.HBox(children=[button],layout=box_layout)\n",
-    "output = widgets.Output()\n",
-    "display(box)\n",
-    "display(progress_bar)\n",
-    "display(output)\n",
-    "\n",
-    "def generate(b):\n",
-    "    progress_bar.value = 0\n",
-    "    inference_time_arr = []\n",
-    "    with output:\n",
-    "        if device.value == 'GPU - TensorRT':\n",
-    "            inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                sample_output = gpt2_trt.generate(inputs.input_ids.to('cuda:0'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[GPT2_VARIANT])\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "\n",
-    "            # de-tokenize model output to raw text\n",
-    "            text = tokenizer.decode(sample_output[0], skip_special_tokens=True)\n",
-    "            generated_text.value = text\n",
-    "            print(\"GPU - TensorRT - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))                  \n",
-    "                \n",
-    "        elif device.value == 'CPU - PyTorch':\n",
-    "            inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                sample_output = model.to('cpu').generate(inputs.input_ids.to('cpu'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[GPT2_VARIANT])\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "\n",
-    "            # de-tokenize model output to raw text\n",
-    "            text = tokenizer.decode(sample_output[0], skip_special_tokens=True)\n",
-    "            generated_text.value = text\n",
-    "            print(\"CPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))\n",
-    "            \n",
-    "        elif  device.value == 'GPU - PyTorch':  \n",
-    "            inputs = tokenizer(paragraph_text.value, return_tensors=\"pt\")\n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                sample_output = model.to('cuda:0').generate(inputs.input_ids.to('cuda:0'), max_length=GPT2ModelTRTConfig.MAX_SEQUENCE_LENGTH[GPT2_VARIANT])\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "\n",
-    "            # de-tokenize model output to raw text\n",
-    "            text = tokenizer.decode(sample_output[0], skip_special_tokens=True)\n",
-    "            generated_text.value = text\n",
-    "            print(\"GPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))    \n",
-    "            \n",
-    "button.on_click(generate)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "58f473c0-6682-41af-8040-72f0a9472b0f",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.6.9"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/gpt2.ipynb b/demo/HuggingFace/notebooks/gpt2.ipynb
deleted file mode 100644
index 745b996b..00000000
--- a/demo/HuggingFace/notebooks/gpt2.ipynb
+++ /dev/null
@@ -1,1218 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "28e6e614-e360-4292-965e-0d255027e9b9",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
-    "#\n",
-    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "# you may not use this file except in compliance with the License.\n",
-    "# You may obtain a copy of the License at\n",
-    "#\n",
-    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
-    "#\n",
-    "# Unless required by applicable law or agreed to in writing, software\n",
-    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-    "# See the License for the specific language governing permissions and\n",
-    "# limitations under the License.\n",
-    "# =============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9b88dc1a-a92d-44cc-9fb7-d9e2ef20c8e2",
-   "metadata": {},
-   "source": [
-    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
-    "\n",
-    "# Accelerating HuggingFace GPT-2 Inference with TensorRT\n",
-    "\n",
-    "GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. The model was pretrained on the raw texts to guess the next word in sentences. As no human labeling was required, GPT-2 pretraining can use lots of publicly available data with an automatic process to generate inputs and labels from those data.\n",
-    "\n",
-    "This notebook shows 3 easy steps to convert a [HuggingFace PyTorch GPT-2 model](https://huggingface.co/gpt2) to a TensorRT engine for high-performance inference.\n",
-    "\n",
-    "1. [Download HuggingFace GPT-2 model ](#1)\n",
-    "1. [Convert to ONNX format](#2)\n",
-    "1. [Convert to TensorRT engine](#3)\n",
-    "1. [Advanced Topic: KV Cache](#4)\n",
-    "1. [Advanced Topic: Beam Search](#5)\n",
-    "\n",
-    "## Prerequisite\n",
-    "\n",
-    "Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
-    "\n",
-    "Next, we install some extra dependencies."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "79281ed9-4855-4ade-a810-a2899a5872b9",
-   "metadata": {
-    "custom": {
-     "metadata": {
-      "tags": [
-       "skip-execution"
-      ]
-     }
-    },
-    "language": "python",
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "%%capture\n",
-    "!pip3 install -r ../requirements.txt"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d3e57ece",
-   "metadata": {},
-   "source": [
-    "**Note:** After this step, you should restart the Jupyter kernel for the change to take effect."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "235d2f1b-439e-4cd0-8286-1d63a13f2cf3",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import sys\n",
-    "ROOT_DIR = os.path.abspath(\"../\")\n",
-    "sys.path.append(ROOT_DIR)\n",
-    "\n",
-    "import torch \n",
-    "\n",
-    "# huggingface\n",
-    "from transformers import (\n",
-    "    GPT2LMHeadModel,\n",
-    "    GPT2Tokenizer,\n",
-    "    GPT2Config,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "af4254e2-11fd-4bc7-ac0b-60b1a9e07c4e",
-   "metadata": {},
-   "source": [
-    "<a id=\"1\"></a>\n",
-    "\n",
-    "## 1. Download HuggingFace GPT-2 model \n",
-    "\n",
-    "First, we download the original HuggingFace PyTorch GPT-2 model from HuggingFace model hubs, together with its associated tokernizer.\n",
-    "\n",
-    "The GPT-2 variants supported by TensorRT 8 are: gpt2 (117M), gpt2-medium (355M), gpt2-large (774M), gpt2-xl (1.5B)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fae66d58-f994-4987-8f1d-1fa8ac2ec8b4",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# download model and tokernizer\n",
-    "GPT2_VARIANT = 'gpt2' # choices: gpt2 | gpt2-medium | gpt2-large | gpt2-xl\n",
-    "config = GPT2Config(GPT2_VARIANT)\n",
-    "\n",
-    "model = GPT2LMHeadModel.from_pretrained(GPT2_VARIANT, force_download = False)\n",
-    "tokenizer = GPT2Tokenizer.from_pretrained(GPT2_VARIANT)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7252ca90-1104-40dc-8e72-f51c07a4cd11",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# save model locally\n",
-    "pytorch_model_dir = './models/{}/pytorch'.format(GPT2_VARIANT)\n",
-    "!mkdir -p $pytorch_model_dir\n",
-    "\n",
-    "model.save_pretrained(pytorch_model_dir)\n",
-    "print(\"Pytorch Model saved to {}\".format(pytorch_model_dir))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a84c5766-97ed-4d04-bab5-7fa18e89dee8",
-   "metadata": {},
-   "source": [
-    "### Inference with PyTorch model"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e43067c2-ecd9-4bd6-9047-a3f74621931b",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# carry out inference with a single sample\n",
-    "input_str = \"Hello, my dog is \"\n",
-    "inputs = tokenizer(input_str, return_tensors=\"pt\")\n",
-    "input_ids = inputs.input_ids"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6d347ddf-4504-4ab7-b15b-29d218bdd7a8",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "input_ids, input_ids.shape"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cf83454f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# WAR: Using an ugly representation because cuda 11.4 does not support GPU models due to cublas errors\n",
-    "if \"cuda-11.4\" in os.environ[\"LD_LIBRARY_PATH\"]:\n",
-    "    model = model.cpu()\n",
-    "    input_ids = input_ids.cpu()\n",
-    "    inputs = inputs.to('cpu')\n",
-    "else:\n",
-    "    model = model.cuda()\n",
-    "    input_ids = input_ids.cuda()\n",
-    "    inputs = inputs.to('cuda:0')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "39d6c2ea-3450-4b8b-9cc8-09943d967ece",
-   "metadata": {},
-   "source": [
-    "#### Single example inference"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b844f057-e768-467d-9185-68fb4c74b5ab",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "model.eval()\n",
-    "with torch.no_grad():\n",
-    "    outputs = model(**inputs, labels=inputs['input_ids'], use_cache = False)\n",
-    "\n",
-    "logits = outputs.logits"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "717b2f68-9d92-474e-9937-8b42a1c60d14",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "logits, logits.shape"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a6c0468b-976a-4a08-98d3-e87578ec067f",
-   "metadata": {},
-   "source": [
-    "For benchmarking purposes, we will employ a helper function `gpt2_inference` which executes the inference on a single batch repeatedly and measures end to end execution time. Let's take note of this execution time for later comparison with TensorRT. \n",
-    " \n",
-    "`TimingProfile` is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ecdf8f00-0562-482b-9bec-b0b7596aec48",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from GPT2.measurements import gpt2_inference\n",
-    "from NNDF.networks import TimingProfile\n",
-    "\n",
-    "# Benchmarking TensorRT performance on single batch\n",
-    "_, decoder_e2e_median_time = gpt2_inference(\n",
-    "            model, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
-    "        )\n",
-    "decoder_e2e_median_time"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4805756f-81f9-43cf-88f6-b205ecd23034",
-   "metadata": {},
-   "source": [
-    "#### Open-end text generation\n",
-    "Next, we will employ the PyTorch model for the open-end text generation task, which GPT-2 is particularly good at. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1bb282bf-a8f4-47c4-830e-f2fb69d9d8d5",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from GPT2.GPT2ModelConfig import GPT2ModelTRTConfig\n",
-    "# MAX_LENGTH represents the maximum length that GPT2 could be used in text generation. \n",
-    "# This corresponds to max_length in task_specific_params for text-generation, which = 50 for each model config.\n",
-    "# If the length exceeds max_length, the output becomes meaningless for the specific task.\n",
-    "max_length = GPT2ModelTRTConfig.MAX_LENGTH[GPT2_VARIANT]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8c3d01fc-9928-486b-9d15-de84d46528e5",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "sample_output = model.generate(input_ids, max_length=max_length, use_cache = False)\n",
-    "\n",
-    "# de-tokenize model output to raw text\n",
-    "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0b016c2f-7982-44ac-81e5-d3854391a8b6",
-   "metadata": {},
-   "source": [
-    "For benchmarking purposes, we will employ a helper function `full_inference` which executes the inference repeatedly and measures end to end execution time. Let's take note of this execution time for later comparison with TensorRT. \n",
-    "\n",
-    "TimingProfile is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "93aea249-529e-4b5e-9759-e0c8370391a3",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from GPT2.measurements import full_inference\n",
-    "\n",
-    "# get complete decoder inference result and its timing profile\n",
-    "_, full_e2e_median_runtime = full_inference(\n",
-    "    model, input_ids, tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    max_length=max_length\n",
-    ")\n",
-    "full_e2e_median_runtime"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0d662701-e430-4fdc-ad46-1f296defcf8f",
-   "metadata": {},
-   "source": [
-    "<a id=\"2\"></a>\n",
-    "\n",
-    "## 2. Convert to ONNX format\n",
-    "\n",
-    "Prior to converting the model to a TensorRT engine, we will first convert the PyTorch model to an intermediate universal format: ONNX.\n",
-    "\n",
-    "ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.\n",
-    "\n",
-    "At a high level, the steps to convert a PyTorch model to TensorRT are as follows:\n",
-    "- Convert the pretrained image segmentation PyTorch model into ONNX.\n",
-    "- Import the ONNX model into TensorRT.\n",
-    "- Apply optimizations and generate an engine.\n",
-    "- Perform inference on the GPU with the TensorRT engine. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c2b2be1a-021c-4f6c-957d-2ff7d1b95976",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from NNDF.networks import NetworkMetadata, Precision\n",
-    "from GPT2.export import GPT2TorchFile\n",
-    "from GPT2.GPT2ModelConfig import GPT2Metadata"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7144d206-c690-4d4c-b590-3eb25e31d106",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=False)) # kv_cache is disabled because it exports extra input/output to the model\n",
-    "gpt2 = GPT2TorchFile(model.to('cpu'), metadata)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dbaa89e4-e83d-4380-a6f8-932fcfeb64d3",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!mkdir -p ./models/$GPT2_VARIANT/ONNX\n",
-    "\n",
-    "onnx_path = ('./models/{}/ONNX/{}.onnx'.format(GPT2_VARIANT, GPT2_VARIANT))\n",
-    "gpt2.as_onnx_model(onnx_path, force_overwrite=False)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "88b04de1-e887-445c-9bc8-e2a7e0fca7ea",
-   "metadata": {},
-   "source": [
-    "Let's take a look at the onnx file and investigate its input and output. You should see that \"input_ids\" as the input, and \"logits\" as the output."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7e4fff25-97da-4f9f-ae98-e918745faebb",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import onnx"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "03c409e6-d312-4cc7-b13f-4621609d5633",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "onnx_model = onnx.load(onnx_path)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2314caaf-836d-4140-93e4-4b3f4c931347",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "onnx_model.graph.input"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4fe7a8d4-2bc3-49fc-863a-0e7f4be6565e",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "onnx_model.graph.output"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7baf007e-5508-485c-a87f-9bfe16260452",
-   "metadata": {},
-   "source": [
-    "<a id=\"3\"></a>\n",
-    "\n",
-    "## 3. Convert to TensorRT engine\n",
-    "\n",
-    "Now we are ready to parse the ONNX model and convert it to an optimized TensorRT model.\n",
-    "\n",
-    "Since the model contains dynamic input shapes, we can specify a valid input range with a TensorRT optimization profile.\n",
-    "\n",
-    "Note: As TensorRT carries out many optimization, this conversion process for the larger model might take a while."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "037ac958-2627-439c-9db5-27640e3f7967",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from polygraphy.backend.trt import Profile\n",
-    "from tensorrt import PreviewFeature\n",
-    "from GPT2.export import GPT2ONNXFile, GPT2TRTEngine"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6bd6e3fc-6797-46b0-a211-ce42d3769105",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!mkdir -p ./models/$GPT2_VARIANT/trt-engine\n",
-    "trt_engine_folder = './models/{}/trt-engine'.format(GPT2_VARIANT)\n",
-    "\n",
-    "# Create optimization profile for dynamic shape input. Can modify batch_size / max_sequence_length to build engines for different shapes\n",
-    "batch_size = 1\n",
-    "disable_preview_dynamic_shapes = False # preview_dynamic_shapes optimizes the trt engine building time\n",
-    "# We can either use input length as the optimal length, or use max_length // 2. \n",
-    "# In T5 or BART, input_length is better, but in GPT-2, max_length // 2 is better because we need to generate max_length number of tokens\n",
-    "\n",
-    "use_input_length = False\n",
-    "opt_length = input_id.shape[1] if use_input_length else max_length // 2 \n",
-    "# Create different engine tags for different configurations\n",
-    "engine_tag = f\"bs{batch_size}\"\n",
-    "preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]\n",
-    "if disable_preview_dynamic_shapes:\n",
-    "    engine_tag += \"-noPreviewFasterDynamicShapes\"\n",
-    "else:\n",
-    "    preview_features += [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
-    "\n",
-    "profiles = [Profile().add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size, 1),\n",
-    "    opt=(batch_size, opt_length), # Optimized based on the inputs. \n",
-    "    max=(batch_size, max_length),\n",
-    ")]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5538106b-3ae4-4d5f-b0ee-1f76174dcecc",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "profiles"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3a5934f0-46d3-45d7-8dd5-6cf81de61e66",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "engine_path = os.path.join(trt_engine_folder, f\"{GPT2_VARIANT}-{engine_tag}.engine\")\n",
-    "if not os.path.exists(engine_path):\n",
-    "    gpt2_engine = GPT2ONNXFile(onnx_path, metadata).as_trt_engine(output_fpath=engine_path, profiles=profiles, preview_features=preview_features)\n",
-    "else:\n",
-    "    gpt2_engine = GPT2TRTEngine(engine_path, metadata)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "74f7f6fc-1e6a-4ddc-8e9b-543d9e8dab4d",
-   "metadata": {},
-   "source": [
-    "### Inference with TensorRT engine\n",
-    "\n",
-    "Great, if you have reached this stage, it means we now have an optimized TensorRT engine for the GPT-2 model, ready for us to carry out inference. \n",
-    "\n",
-    "The GPT-2 model with TensorRT backend can now be employed in place of the original HuggingFace GPT-2 model."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "54ae13aa-bf6f-4eb7-a453-389865562ae4",
-   "metadata": {},
-   "source": [
-    "#### Single batch inference\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "343b58f1-3d9f-4844-85c9-73058bd36a83",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "from GPT2.trt import GPT2TRTDecoder\n",
-    "config = GPT2Config.from_pretrained(GPT2_VARIANT, use_cache = False)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0cfda583-b684-48b1-9046-15ab022ef982",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "gpt2_trt = GPT2TRTDecoder(gpt2_engine, metadata, config)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "28fc60ad-73a7-46df-85d7-a292a8abbd80",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# Benchmarking TensorRT performance on single batch\n",
-    "_, decoder_e2e_median_time = gpt2_inference(\n",
-    "            gpt2_trt, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
-    "        )\n",
-    "decoder_e2e_median_time"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "01d86d29-1c7b-4020-9ef2-b77ea5e52764",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "with torch.no_grad():\n",
-    "    outputs = gpt2_trt(input_ids=input_ids)\n",
-    "logits = outputs.logits"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d32e0162-c9eb-473d-ace6-c4c61ff578b5",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "logits, logits.shape"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "22122064-5a17-4990-bd6b-073fca5a3e9b",
-   "metadata": {},
-   "source": [
-    "#### Open-end text generation\n",
-    "Let's generate the same task again. Since GPT-2 is an open-ended model, a small turbulent in the model might have a very different result. Since we have done some format changes and input/output restriction while exporting the model, you might see a different result compared to raw HuggingFace model.  "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "848bffb8-a7a4-4fcb-91c9-f4e9f7263e6c",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "sample_output = gpt2_trt.generate(input_ids.cuda(), max_length=max_length)\n",
-    "\n",
-    "# de-tokenize model output to raw text\n",
-    "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b4c8bc4c-bf3e-4cb5-afc6-c0bd7d8655cb",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# get complete decoder inference result and its timing profile\n",
-    "_, full_e2e_median_runtime = full_inference(\n",
-    "    gpt2_trt, input_ids.cuda(), tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    max_length=max_length\n",
-    ")\n",
-    "full_e2e_median_runtime"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6b68a915-2c32-49e5-b1f6-e93d7618f637",
-   "metadata": {},
-   "source": [
-    "You can now compare the output of the original PyTorch model and the TensorRT engine. Notice the speed difference. On an NVIDIA V100 32GB GPU, this results in about ~5x performance improvement for the GPT-2 model (from an average of 0.704s to 0.134s)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b2562388-d97b-45dd-8569-3f6c053f4e98",
-   "metadata": {},
-   "source": [
-    "Now you have known how to convert a model to onnx, build TRT engine and optimize it. As you might have recalled, using kv cache and beam search are two important ways to improve the performance of the decoder models. We have recently added thse support to our HuggingFace demo. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a4132e54-aba7-42ec-8324-c68d82c17296",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "<a id=\"4\"></a>\n",
-    "\n",
-    "## 4. Advanced Topic: KV Cache\n",
-    "\n",
-    "As you have seen above, we put `use_cache = False` in some code blocks. This is because in the simplified model, we only take `input_ids` as input and `logits` as output. `input_ids` is growing as the sequence goes longer. In reality, we sometimes cache the self-attentions for each layer and reuse them in the later computations. This allows us to only take the last generated `input_ids`. This is a trade-off between space and time. When the model is small or the sequence is small, the D2D data copy time usually outweights the performance improvement of the model. However, performance improvements have been found in larger models with larger sequence length like 512. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "e33d1dcb-250f-4d86-9726-b114d4962fd4",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "use_cache = True\n",
-    "kv_config = GPT2Config.from_pretrained(GPT2_VARIANT, use_cache = use_cache)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fd8fdf0f-2da0-46c0-a948-e4e6e16b898a",
-   "metadata": {},
-   "source": [
-    "#### Raw HuggingFace\n",
-    "\n",
-    "The model that we download from `GPT2LMHeadModel.from_pretrained` is dynamic in its inputs. It can take both kv and non-kv configurations. Changing `use_cache` will do it. You can see that changing this configuration, the output is changed. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "26b3c51a-07ee-4936-b620-50766a45b945",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "# get complete decoder inference result and its timing profile\n",
-    "_, full_e2e_median_runtime = full_inference(\n",
-    "    model, input_ids, tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    max_length=max_length, use_cache = use_cache\n",
-    ")\n",
-    "full_e2e_median_runtime"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a14607bf-f449-4151-9076-d099ae1a3ae1",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "sample_output = model.generate(input_ids, max_length=max_length, use_cache = use_cache)\n",
-    "\n",
-    "# de-tokenize model output to raw text\n",
-    "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9057ef83-0cdc-4631-9958-66d04fc7fc22",
-   "metadata": {},
-   "source": [
-    "#### TensorRT\n",
-    "\n",
-    "For the 1st decoding step, we take `input_ids` and generate both `logits` and the kv cache. In other steps, we take the new `input_ids` with `past` kv-cache and the outputs are `logits` and the updated `present` kv-cache. Taking dynamic number of inputs for trt is not currently supported in our demo, so we need to output 2 onnx files and build 2 engines."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f1fbcfad-9c9c-47e2-894a-731c7a3a04df",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "kv_metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=use_cache))\n",
-    "kv_gpt2 = GPT2TorchFile(model.to('cpu'), kv_metadata)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3fe680c5-d9ff-466f-87fe-a7bb0cbee944",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "kv_onnx_path = ('./models/{}/ONNX/{}-kv_cache.onnx'.format(GPT2_VARIANT, GPT2_VARIANT))\n",
-    "kv_gpt2.as_onnx_model(kv_onnx_path, force_overwrite=False)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4f0f6824-286d-4afa-926b-7eed4cafafc7",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "kv_onnx_model = onnx.load(kv_onnx_path)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b71f0012-7a2d-41be-a8d8-c818dcb7c244",
-   "metadata": {},
-   "source": [
-    "We could see that the kv model has #inputs = #outputs = num_layers * 2 + 1"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7579aeec-2c7a-43de-b8f7-beff8d3d7784",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "len(kv_onnx_model.graph.input), len(kv_onnx_model.graph.output)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9add1139-aab0-4531-b2ac-c3aca90e5d49",
-   "metadata": {},
-   "source": [
-    "The next blocks will set up the profile and build the engine. The only difference is that we now have the profile for kv cache"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9ae055cb-41b7-4523-86bc-490bc9edf204",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "batch_size = 1\n",
-    "disable_preview_dynamic_shapes = False\n",
-    "\n",
-    "engine_tag = \"bs{}\".format(batch_size)\n",
-    "\n",
-    "preview_features = [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
-    "if disable_preview_dynamic_shapes:\n",
-    "    engine_tag += \"-disableFasterDynamicShapes\"\n",
-    "    preview_features = []\n",
-    "\n",
-    "use_input_length = False\n",
-    "num_heads = kv_config.n_head\n",
-    "embedding_size_per_head = kv_config.n_embd // num_heads\n",
-    "num_layers = kv_config.n_layer\n",
-    "\n",
-    "max_sequence_length = max_length\n",
-    "max_output_length = max_length\n",
-    "if not use_input_length:\n",
-    "    opt_input_seq_len = max_sequence_length // 2\n",
-    "else:\n",
-    "    opt_input_seq_len = input_ids.shape[1]\n",
-    "\n",
-    "opt_output_seq_len = max_output_length // 2\n",
-    "\n",
-    "# context phase uses the provided input_ids to generate hidden states and self attention kv cache\n",
-    "# It is only used in the 1st decoder run.\n",
-    "dec_profiles_context = Profile().add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size, 1),\n",
-    "    opt=(batch_size, opt_output_seq_len),\n",
-    "    max=(batch_size, max_output_length),\n",
-    ")\n",
-    "self_attention_profile_context = {\n",
-    "    \"min\": (batch_size, num_heads, 0, embedding_size_per_head),\n",
-    "    \"opt\": (batch_size, num_heads, 0, embedding_size_per_head),\n",
-    "    \"max\": (batch_size, num_heads, 0, embedding_size_per_head),\n",
-    "}\n",
-    "\n",
-    "# generation phase uses previous self attention kv cache with the last input_ids token to generate the next hidden states and self attention kv cache\n",
-    "# This optimization profile is used after the 1st decoder run.\n",
-    "dec_profiles_generation = Profile().add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size, 1),\n",
-    "    opt=(batch_size, 1),\n",
-    "    max=(batch_size, 1),\n",
-    ")\n",
-    "\n",
-    "self_attention_profile_generation = {\n",
-    "    \"min\": (batch_size, num_heads, 1, embedding_size_per_head),\n",
-    "    \"opt\": (batch_size, num_heads, opt_output_seq_len - 1, embedding_size_per_head),\n",
-    "    \"max\": (batch_size, num_heads, max_output_length - 1, embedding_size_per_head),\n",
-    "}\n",
-    "\n",
-    "for i in range(num_layers):\n",
-    "    dec_profiles_context = dec_profiles_context.add(\n",
-    "        f\"past_key_values.{i}.decoder.key\",\n",
-    "        **self_attention_profile_context\n",
-    "    ).add(\n",
-    "        f\"past_key_values.{i}.decoder.value\",\n",
-    "        **self_attention_profile_context\n",
-    "    )\n",
-    "\n",
-    "    dec_profiles_generation = dec_profiles_generation.add(\n",
-    "        f\"past_key_values.{i}.decoder.key\",\n",
-    "        **self_attention_profile_generation\n",
-    "    ).add(\n",
-    "        f\"past_key_values.{i}.decoder.value\",\n",
-    "        **self_attention_profile_generation\n",
-    "    )\n",
-    "\n",
-    "# TensorRT accepts multiple optimization engines for the same model.\n",
-    "# Profile 1 is only used in the first decoder iterations.\n",
-    "decoder_profiles = [dec_profiles_generation, dec_profiles_context]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4eadf843-9f60-41c7-90a9-098b33ce3603",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "kv_engine_path = os.path.join(trt_engine_folder, f\"{GPT2_VARIANT}-kv_cache_{engine_tag}.engine\")\n",
-    "\n",
-    "# Set up the trt engine with both kv input/output augmented\n",
-    "if not os.path.exists(kv_engine_path):\n",
-    "    kv_gpt2_engine = GPT2ONNXFile(kv_onnx_path, kv_metadata).as_trt_engine(kv_engine_path,profiles=decoder_profiles, preview_features=preview_features)\n",
-    "else:\n",
-    "    kv_gpt2_engine = GPT2TRTEngine(kv_engine_path, kv_metadata)\n",
-    "\n",
-    "    \n",
-    "kv_gpt2_trt = GPT2TRTDecoder(\n",
-    "    kv_gpt2_engine, kv_metadata, kv_config, batch_size=batch_size\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "090007db-9a09-4b6d-95ed-8a688ea05798",
-   "metadata": {},
-   "source": [
-    "Since we have 2 profiles, benchmarking single-run runtime does not make sense. We instead use `full_inference` to measure the time for the entire inference cycle."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a3b93d88-21bb-4f87-9ff6-709d0babdf34",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# get complete decoder inference result and its timing profile\n",
-    "_, full_e2e_median_runtime = full_inference(\n",
-    "    kv_gpt2_trt, input_ids.cuda(), tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    max_length=max_length, use_cache = use_cache\n",
-    ")\n",
-    "full_e2e_median_runtime"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d89ab217-9ee4-435c-b689-69d98cef1cc4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "kv_gpt2_trt.reset()\n",
-    "kv_sample_output = kv_gpt2_trt.generate(input_ids.cuda(), max_length=max_length)\n",
-    "tokenizer.decode(kv_sample_output[0], skip_special_tokens=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2b614fb8-63d6-4711-84cf-c69ca8b3f141",
-   "metadata": {},
-   "source": [
-    "In this short example, kv cache performance does not improve the performance, and may even be slightly worse than non kv cache mode. However, when we have larger input sequences for the model, it will be better."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f764049f-0578-4305-b010-4e7a3156a377",
-   "metadata": {},
-   "source": [
-    "<a id=\"5\"></a>\n",
-    "\n",
-    "## 5. Advanced Topic: Beam Search\n",
-    "\n",
-    "Beam search is a way to increase the model quality. It looks for the top `num_beams` number of possible words and pick the one that conditions the best to the current position. Similarly, the original HuggingFace PyTorch model supports beam search natively, while we need to build separate trt engine for different `num_beams`."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5a5808db-2cc0-4d88-aebe-1b6e17a023e7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "beam_config = GPT2Config.from_pretrained(GPT2_VARIANT, use_cache = False)\n",
-    "beam_metadata = NetworkMetadata(variant=GPT2_VARIANT, precision=Precision(fp16=False), other=GPT2Metadata(kv_cache=False))\n",
-    "num_beams = 3"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d1403609-b24d-4e10-a8eb-852d3eab6fa0",
-   "metadata": {},
-   "source": [
-    "#### HuggingFace"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cfd992c8-1eeb-427c-ae32-2c63766c6a69",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# get complete decoder inference result and its timing profile\n",
-    "_, full_e2e_median_runtime = full_inference(\n",
-    "    model, input_ids, tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    max_length=max_length, num_beams = num_beams\n",
-    ")\n",
-    "full_e2e_median_runtime"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "09418760-84bd-4308-b06b-8540945a6dcf",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sample_output = model.generate(input_ids, max_length=max_length, num_beams = num_beams)\n",
-    "\n",
-    "# de-tokenize model output to raw text\n",
-    "tokenizer.decode(sample_output[0], skip_special_tokens=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "71b8d9fa-d74a-40dd-94ce-d98551d24608",
-   "metadata": {},
-   "source": [
-    "You could see that the output is very different from the original one. If you change `num_beams`, the result will also change significantly."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ba01e0ec-68ad-4682-8ca4-2ecde7d70f7f",
-   "metadata": {},
-   "source": [
-    "#### TensorRT\n",
-    "It uses the same onnx file as the original configuration, but the engine set up is differently, because it expands the inputs by `num_beams` for the first dimension of inputs."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "055fb314-8e0f-4edd-bf78-16890d196de4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Create optimization profile for dynamic shape input. Can modify batch_size / max_sequence_length to build engines for different shapes\n",
-    "batch_size = 1\n",
-    "disable_preview_dynamic_shapes = False # preview_dynamic_shapes optimizes the trt engine building time\n",
-    "# We can either use input length as the optimal length, or use max_length // 2. \n",
-    "# In T5 or BART, input_length is better, but in GPT-2, max_length // 2 is better because we need to generate max_length number of tokens\n",
-    "\n",
-    "use_input_length = False\n",
-    "opt_length = input_id.shape[1] if use_input_length else max_length // 2 \n",
-    "# Create different engine tags for different configurations\n",
-    "engine_tag = f\"bs{batch_size}-beam{num_beams}\"\n",
-    "\n",
-    "preview_features = [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
-    "if disable_preview_dynamic_shapes:\n",
-    "    engine_tag += \"-disableFasterDynamicShapes\"\n",
-    "    preview_features = []\n",
-    "    \n",
-    "\n",
-    "beam_profiles = [Profile().add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size * num_beams, 1),\n",
-    "    opt=(batch_size * num_beams, opt_length), # Optimized based on the inputs. \n",
-    "    max=(batch_size * num_beams, max_length),\n",
-    ")]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "18986d0f-9509-463f-a489-a76dd4d28a88",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "beam_profiles"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cfd04a7b-8aa6-4c97-8d85-96f14b06abbc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "beam_engine_path = os.path.join(trt_engine_folder, f\"{GPT2_VARIANT}-{engine_tag}.engine\")\n",
-    "if not os.path.exists(beam_engine_path):\n",
-    "    beam_gpt2_engine = GPT2ONNXFile(onnx_path, beam_metadata).as_trt_engine(output_fpath=beam_engine_path, profiles=beam_profiles, preview_features=preview_features)\n",
-    "else:\n",
-    "    beam_gpt2_engine = GPT2TRTEngine(beam_engine_path, beam_metadata)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "18fe1dba-4e84-478e-9ea7-07c21856e6bd",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "beam_gpt2_trt = GPT2TRTDecoder(beam_gpt2_engine, beam_metadata, beam_config, num_beams = num_beams)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "42614e14-c962-4c31-a469-7e0343efbdbb",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# get complete decoder inference result and its timing profile\n",
-    "_, full_e2e_median_runtime = full_inference(\n",
-    "    beam_gpt2_trt, input_ids.cuda(), tokenizer, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    max_length=max_length, num_beams=num_beams\n",
-    ")\n",
-    "full_e2e_median_runtime"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "391ab05a-fe0d-42c3-9591-605ddab389ce",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "beam_sample_output = beam_gpt2_trt.generate(input_ids.cuda(), max_length=max_length, num_beams=num_beams)\n",
-    "tokenizer.decode(beam_sample_output[0], skip_special_tokens=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9543dbfd-4650-46f5-8f77-587dcb05785a",
-   "metadata": {},
-   "source": [
-    "We could see that because of larger batch size, beam search will take slightly longer, but for most sequences, it will generate more meaningful outputs."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cbfc6c04-ca47-4fc6-9a12-ed500722bb4a",
-   "metadata": {},
-   "source": [
-    "## Conclusion and where-to next?\n",
-    "\n",
-    "This notebook has walked you through the process of converting a HuggingFace PyTorch GPT-2 model to an optimized TensorRT engine for inference in 3 easy steps. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace GPT-2 model while providing significant speed up. \n",
-    "\n",
-    "If you are interested in further details of the conversion process, check out [GPT2/trt.py](../GPT2/trt.py)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "14079b8f-738e-4137-9ca3-6a4254e8f006",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "celltoolbar": "Edit Metadata",
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
-  },
-  "vscode": {
-   "interpreter": {
-    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/t5-playground.ipynb b/demo/HuggingFace/notebooks/t5-playground.ipynb
deleted file mode 100644
index d17a761c..00000000
--- a/demo/HuggingFace/notebooks/t5-playground.ipynb
+++ /dev/null
@@ -1,272 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "64974d33-d028-440c-86fa-1a0633b3d31d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
-    "#\n",
-    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "# you may not use this file except in compliance with the License.\n",
-    "# You may obtain a copy of the License at\n",
-    "#\n",
-    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
-    "#\n",
-    "# Unless required by applicable law or agreed to in writing, software\n",
-    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-    "# See the License for the specific language governing permissions and\n",
-    "# limitations under the License.\n",
-    "# =============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c3f0ff46-9958-4d57-9067-a64be34e75da",
-   "metadata": {},
-   "source": [
-    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
-    "\n",
-    "# T5 Playground\n",
-    "\n",
-    "This notebook demonstrates T5 model on the task of translation and text summarization.\n",
-    "\n",
-    "The TensorRT HuggingFace T5 model is a plug-in replacement for the original PyTorch  HuggingFace T5 model.\n",
-    "\n",
-    "\n",
-    "\n",
-    "**Notes**: \n",
-    " - For \"CPU - PyTorch\" and \"GPU - PyTorch\", a T5 small model from HuggingFace model repository is employed. Inference is carried out with PyTorch in FP32 precision. All models run with batch size 1.\n",
-    "Average run time across 5 runs is reported.\n",
-    " - Prior to running this notebook, run [t5.ipynb](t5.ipynb) to download the T5 model and generate the TensorRT engine."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3530e767-7050-4329-a4bc-e2221b9eb578",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import sys\n",
-    "ROOT_DIR = os.path.abspath(\"../\")\n",
-    "sys.path.append(ROOT_DIR)\n",
-    "\n",
-    "import torch \n",
-    "\n",
-    "# huggingface\n",
-    "from transformers import (\n",
-    "    T5ForConditionalGeneration,\n",
-    "    T5Tokenizer,\n",
-    "    T5Config,\n",
-    ")\n",
-    "from transformers.modeling_outputs import BaseModelOutput\n",
-    "\n",
-    "# download HuggingFace model and tokernizer\n",
-    "T5_VARIANT = 't5-small'\n",
-    "\n",
-    "t5_model = T5ForConditionalGeneration.from_pretrained(T5_VARIANT)\n",
-    "tokenizer = T5Tokenizer.from_pretrained(T5_VARIANT)\n",
-    "config = T5Config.from_pretrained(T5_VARIANT, use_cache = False)\n",
-    "\n",
-    "# load TensorRT engine\n",
-    "from T5.trt import T5TRTEncoder, T5TRTDecoder, TRTHFRunner\n",
-    "from T5.T5ModelConfig import T5ModelTRTConfig, T5Metadata\n",
-    "from T5.export import T5DecoderTRTEngine, T5EncoderTRTEngine\n",
-    "from NNDF.networks import NetworkMetadata, Precision\n",
-    "\n",
-    "from transformers.generation_stopping_criteria import (\n",
-    "    MaxLengthCriteria,\n",
-    "    StoppingCriteriaList,\n",
-    ")\n",
-    "\n",
-    "metadata=NetworkMetadata(variant=T5_VARIANT, precision=Precision(fp16=True), other=T5Metadata(kv_cache=False))\n",
-    "\n",
-    "from os.path import exists\n",
-    "encoder_path = './models/{}/tensorrt/{}-encoder.onnx-bs1-previewFasterDynamicShapes.engine'.format(T5_VARIANT,T5_VARIANT)\n",
-    "if not exists(encoder_path):\n",
-    "    print(\"Error: TensorRT engine not found at {}. Please run t5.ipynb to generate the TensorRT engine first!\".format(encoder_path))\n",
-    "else:\n",
-    "    encoder_engine = T5EncoderTRTEngine('./models/{}/tensorrt/{}-encoder.onnx-bs1-previewFasterDynamicShapes.engine'.format(T5_VARIANT,T5_VARIANT), metadata)\n",
-    "    decoder_engine = T5DecoderTRTEngine('./models/{}/tensorrt/{}-decoder-with-lm-head.onnx-bs1-previewFasterDynamicShapes.engine'.format(T5_VARIANT,T5_VARIANT), metadata)\n",
-    "\n",
-    "t5_trt_encoder = T5TRTEncoder(encoder_engine, metadata, config)\n",
-    "t5_trt_decoder = T5TRTDecoder(decoder_engine, metadata, config)\n",
-    "\n",
-    "decoder_input_ids = torch.full(\n",
-    "    (1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32\n",
-    ").to(\"cuda:0\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "766b8c94-ba8e-47c8-8624-57da462a0496",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "import ipywidgets as widgets\n",
-    "import numpy as np\n",
-    "import time\n",
-    "\n",
-    "device = widgets.RadioButtons(\n",
-    "    options=['CPU - PyTorch', \n",
-    "             'GPU - PyTorch', \n",
-    "             'GPU - TensorRT'],\n",
-    "    description='Device:',\n",
-    "    disabled=False\n",
-    ")\n",
-    "\n",
-    "task = widgets.RadioButtons(\n",
-    "    options=['En -> German', \n",
-    "             'Summarize', \n",
-    "             ],\n",
-    "    description='Task:',\n",
-    "    disabled=False\n",
-    ")\n",
-    "\n",
-    "paragraph_text = widgets.Textarea(\n",
-    "    value='TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps'\\\n",
-    "    'such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops'\\\n",
-    "    'and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep'\\\n",
-    "    'learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.',\n",
-    "    placeholder='Type something',\n",
-    "    description='Context:',\n",
-    "    disabled=False,\n",
-    "    layout=widgets.Layout(width=\"auto\"),\n",
-    "    rows=5,  \n",
-    ")\n",
-    "\n",
-    "\n",
-    "generated_text = widgets.Textarea(\n",
-    "    value='...',\n",
-    "    placeholder='Context',\n",
-    "    description='T5 output:',\n",
-    "    disabled=False,\n",
-    "    layout=widgets.Layout(width=\"auto\"),\n",
-    "    rows=5,\n",
-    ")\n",
-    "button = widgets.Button(description=\"Generate\")\n",
-    "\n",
-    "display(paragraph_text)\n",
-    "display(generated_text)\n",
-    "display(device)\n",
-    "display(task)\n",
-    "\n",
-    "from IPython.display import display\n",
-    "box_layout = widgets.Layout(display='flex',\n",
-    "                flex_flow='column',\n",
-    "                align_items='center',\n",
-    "                width='100%')\n",
-    "N_RUN = 6\n",
-    "progress_bar = widgets.IntProgress(\n",
-    "    value=0,\n",
-    "    min=0,\n",
-    "    max=N_RUN,\n",
-    "    description='Progress:',\n",
-    "    bar_style='', # 'success', 'info', 'warning', 'danger' or ''\n",
-    "    style={'bar_color': 'green'},\n",
-    "    orientation='horizontal', \n",
-    "    layout=widgets.Layout(width='100%', height='50px')\n",
-    ")\n",
-    "\n",
-    "box = widgets.HBox(children=[button],layout=box_layout)\n",
-    "output = widgets.Output()\n",
-    "display(box)\n",
-    "display(progress_bar)\n",
-    "display(output)\n",
-    "\n",
-    "MAX_LENGTH = 256\n",
-    "\n",
-    "def generate(b):\n",
-    "    progress_bar.value = 0\n",
-    "    inference_time_arr = []\n",
-    "    prefix = 'translate English to German' if task.value=='En -> German' else 'summarize'\n",
-    "    inputs = tokenizer(\"{}: {}\".format(prefix, paragraph_text.value), return_tensors=\"pt\")\n",
-    "    with output:\n",
-    "        if device.value == 'GPU - TensorRT':\n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                encoder_last_hidden_state = t5_trt_encoder(input_ids=inputs.input_ids.to('cuda:0'))\n",
-    "                outputs = t5_trt_decoder.generate(\n",
-    "                    inputs.input_ids.to('cuda:0'),\n",
-    "                    max_length = MAX_LENGTH,\n",
-    "                    min_length = 1,\n",
-    "                    eos_token_id = t5_trt_decoder.config.eos_token_id,\n",
-    "                    pad_token_id = t5_trt_decoder.config.pad_token_id,\n",
-    "                    encoder_outputs = BaseModelOutput(last_hidden_state = encoder_last_hidden_state.to('cuda:0')),\n",
-    "                )\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "\n",
-    "            # de-tokenize model output to raw text\n",
-    "            text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
-    "            generated_text.value = text\n",
-    "            print(\"GPU - TensorRT - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))                   \n",
-    "                \n",
-    "        elif device.value == 'CPU - PyTorch':\n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                outputs = t5_model.to('cpu').generate(inputs.input_ids.to('cpu'), max_length=MAX_LENGTH)\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "\n",
-    "            # de-tokenize model output to raw text\n",
-    "            text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
-    "            generated_text.value = text\n",
-    "            print(\"CPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))\n",
-    "            \n",
-    "        elif  device.value == 'GPU - PyTorch':  \n",
-    "            for _ in range(N_RUN):\n",
-    "                start_time = time.time()\n",
-    "                outputs = t5_model.to('cuda:0').generate(inputs.input_ids.to('cuda:0'), max_length=MAX_LENGTH)\n",
-    "                inference_time_arr.append(time.time()-start_time)\n",
-    "                progress_bar.value += 1\n",
-    "\n",
-    "            # de-tokenize model output to raw text\n",
-    "            text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
-    "            generated_text.value = text\n",
-    "            print(\"GPU - PyTorch - Average inference time: %.2f (ms)\"%(1000*np.mean(inference_time_arr[1:])))    \n",
-    "            \n",
-    "button.on_click(generate)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "58f473c0-6682-41af-8040-72f0a9472b0f",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/notebooks/t5.ipynb b/demo/HuggingFace/notebooks/t5.ipynb
deleted file mode 100644
index c708e04e..00000000
--- a/demo/HuggingFace/notebooks/t5.ipynb
+++ /dev/null
@@ -1,664 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "28e6e614-e360-4292-965e-0d255027e9b9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
-    "#\n",
-    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "# you may not use this file except in compliance with the License.\n",
-    "# You may obtain a copy of the License at\n",
-    "#\n",
-    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
-    "#\n",
-    "# Unless required by applicable law or agreed to in writing, software\n",
-    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-    "# See the License for the specific language governing permissions and\n",
-    "# limitations under the License.\n",
-    "# =============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9b88dc1a-a92d-44cc-9fb7-d9e2ef20c8e2",
-   "metadata": {},
-   "source": [
-    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
-    "\n",
-    "# Accelerating HuggingFace T5 Inference with TensorRT\n",
-    "\n",
-    "T5 is an encoder-decoder model that converts all NLP problems into a text-to-text format. More specifically, it does so by encoding  different tasks as text directives in the input stream. This enables a single model to be trained supervised on a wide variety of NLP tasks such as translation, classification, Q&A and summarization.\n",
-    "\n",
-    "This notebook shows 3 easy steps to convert a [HuggingFace PyTorch T5 model](https://huggingface.co/transformers/model_doc/t5.html) to a TensorRT engine for high-performance inference.\n",
-    "\n",
-    "1. [Download HuggingFace T5 model](#1)\n",
-    "1. [Convert to ONNX format](#2)\n",
-    "1. [Convert to TensorRT engine](#3)\n",
-    "\n",
-    "## Prerequisite\n",
-    "\n",
-    "Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
-    "\n",
-    "Next, we install some extra dependencies."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0c36ecb7-c622-4d95-a851-b9a6eb18e81b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%%capture\n",
-    "!pip3 install -r ../requirements.txt"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a1bbdafb",
-   "metadata": {},
-   "source": [
-    "**Note:** After this step, you should restart the Jupyter kernel for the change to take effect."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "235d2f1b-439e-4cd0-8286-1d63a13f2cf3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import sys\n",
-    "ROOT_DIR = os.path.abspath(\"../\")\n",
-    "sys.path.append(ROOT_DIR)\n",
-    "\n",
-    "import torch\n",
-    "import tensorrt as trt\n",
-    "\n",
-    "# huggingface\n",
-    "from transformers import (\n",
-    "    T5ForConditionalGeneration,\n",
-    "    T5Tokenizer,\n",
-    "    T5Config,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "af4254e2-11fd-4bc7-ac0b-60b1a9e07c4e",
-   "metadata": {},
-   "source": [
-    "<a id=\"1\"></a>\n",
-    "\n",
-    "## 1. Download HuggingFace T5 model\n",
-    "\n",
-    "First, we download the original HuggingFace PyTorch T5 model from HuggingFace model hubs, together with its associated tokernizer.\n",
-    "\n",
-    "The T5 variants that are suported by TensorRT 8 are:  t5-small (60M), t5-base (220M), t5-large (770M), t5-3b(3B), t5-11b(11B)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fae66d58-f994-4987-8f1d-1fa8ac2ec8b4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "T5_VARIANT = 't5-small' # choices: t5-small | t5-base | t5-large | t5-3b | t5-11b\n",
-    "\n",
-    "t5_model = T5ForConditionalGeneration.from_pretrained(T5_VARIANT)\n",
-    "tokenizer = T5Tokenizer.from_pretrained(T5_VARIANT)\n",
-    "config = T5Config.from_pretrained(T5_VARIANT, use_cache = False)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7252ca90-1104-40dc-8e72-f51c07a4cd11",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# save model locally\n",
-    "pytorch_model_dir = './models/{}/pytorch'.format(T5_VARIANT)\n",
-    "!mkdir -p $pytorch_model_dir\n",
-    "\n",
-    "t5_model.save_pretrained(pytorch_model_dir)\n",
-    "print(\"Pytorch Model saved to {}\".format(pytorch_model_dir))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "11ea023d-c4d4-43bb-9d77-c76684e0b06f",
-   "metadata": {},
-   "source": [
-    "### Inference with PyTorch model\n",
-    "\n",
-    "Next, we will carry out inference with the PyTorch model.\n",
-    "\n",
-    "#### Single example inference"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "544dea73",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "inputs = tokenizer(\"translate English to German: That is good.\", return_tensors=\"pt\")\n",
-    "input_ids = inputs.input_ids\n",
-    "num_beams = 1"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ed1edf8a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# WAR: Using an ugly representation because cuda 11.4 does not support GPU models due to cublas errors\n",
-    "if \"cuda-11.4\" in os.environ[\"LD_LIBRARY_PATH\"]:\n",
-    "    t5_model = t5_model.cpu()\n",
-    "    input_ids = input_ids.cpu()\n",
-    "    inputs = inputs.to('cpu')\n",
-    "else:\n",
-    "    t5_model = t5_model.cuda()\n",
-    "    input_ids = input_ids.cuda()\n",
-    "    inputs = inputs.to('cuda:0')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "13913fd9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# inference on a single example\n",
-    "t5_model.eval()\n",
-    "with torch.no_grad():\n",
-    "    outputs = t5_model(**inputs, labels=inputs[\"input_ids\"])\n",
-    "\n",
-    "logits = outputs.logits"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "98f7fd8b-2ee3-4d25-9204-7713eb7e90b3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Generate sequence for an input\n",
-    "outputs = t5_model.generate(input_ids, num_beams=num_beams)\n",
-    "print(tokenizer.decode(outputs[0], skip_special_tokens=True))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "667fcacc-02cb-415d-a9ff-2d2ec44ef225",
-   "metadata": {},
-   "source": [
-    "#### Model inference benchmark: encoder and decoder stacks\n",
-    "\n",
-    "For benchmarking purposes, we will employ a helper functions `encoder_inference` and `decoder_inference` which execute the inference repeatedly for the T5 encoder and decoder stacks separately, and measure end to end execution time. Let's take note of this execution time for comparison with TensorRT. \n",
-    " \n",
-    "`TimingProfile` is a named tuple that specifies the number of experiments and number of times to call the function per iteration (and number of warm-up calls although it is not used here)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "596ea542-d9e5-4367-b643-d60027fa05e6",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from T5.measurements import decoder_inference, encoder_inference, full_inference\n",
-    "from T5.export import T5EncoderTorchFile, T5DecoderTorchFile, T5EncoderTRTEngine, T5DecoderTRTEngine\n",
-    "from NNDF.networks import TimingProfile\n",
-    "from NNDF.torch_utils import expand_inputs_for_beam_search\n",
-    "\n",
-    "t5_torch_encoder = T5EncoderTorchFile.TorchModule(t5_model.encoder)\n",
-    "t5_torch_decoder = T5DecoderTorchFile.TorchModule(\n",
-    "    t5_model.decoder, t5_model.lm_head, t5_model.config\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "be755fbc-c53e-4f8d-a9c2-4817167cf93a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "input_ids = inputs.input_ids\n",
-    "\n",
-    "encoder_last_hidden_state, encoder_e2e_median_time = encoder_inference(\n",
-    "    t5_torch_encoder, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
-    ")\n",
-    "encoder_e2e_median_time"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "960f05fc-f572-4832-ad82-8a75823866b1",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "_, decoder_e2e_median_time = decoder_inference(\n",
-    "    t5_torch_decoder, input_ids, encoder_last_hidden_state, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
-    ")\n",
-    "decoder_e2e_median_time"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a99d5a06-a8f5-4ce7-a34c-bc42f07ac706",
-   "metadata": {},
-   "source": [
-    "#### Full model inference and benchmark\n",
-    "\n",
-    "Next, we will try the T5 model for the task of translation from English to German.\n",
-    "\n",
-    "For benchmarking purposes, we will employ a helper function `full_inference` which executes the inference repeatedly and measures end to end execution time. Let's take note of this execution time for comparison with TensorRT. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "39d511cf-d963-4629-be54-22e9a258716d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from T5.T5ModelConfig import T5ModelTRTConfig, T5Metadata\n",
-    "decoder_output, full_e2e_median_runtime = full_inference(\n",
-    "    t5_torch_encoder,\n",
-    "    t5_torch_decoder,\n",
-    "    input_ids,\n",
-    "    tokenizer,\n",
-    "    TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    num_beams=num_beams,\n",
-    "    max_length=T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[T5_VARIANT],\n",
-    ")\n",
-    "full_e2e_median_runtime"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "8cff48fc-b792-4852-b638-6e2c54099cb2",
-   "metadata": {},
-   "source": [
-    "Let us decode the model's output back into text."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "839bc6bc-65dc-499d-ac26-81456dbc1748",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# De-tokenize output to raw text\n",
-    "print(tokenizer.decode(decoder_output[0], skip_special_tokens=True))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "0d662701-e430-4fdc-ad46-1f296defcf8f",
-   "metadata": {},
-   "source": [
-    "<a id=\"2\"></a>\n",
-    "\n",
-    "## 2. Convert to ONNX\n",
-    "\n",
-    "Prior to converting the model to a TensorRT engine, we will first convert the PyTorch model to an intermediate universal format.\n",
-    "\n",
-    "ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.\n",
-    "\n",
-    "The steps to convert a PyTorch model to TensorRT are as follows:\n",
-    "- Convert the pretrained image segmentation PyTorch model into ONNX.\n",
-    "- Import the ONNX model into TensorRT.\n",
-    "- Apply optimizations and generate an engine.\n",
-    "- Perform inference on the GPU. \n",
-    "\n",
-    "For the T5 model, we will convert the encoder and decoder seperately."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c2b2be1a-021c-4f6c-957d-2ff7d1b95976",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# helpers\n",
-    "from NNDF.networks import NetworkMetadata, Precision"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c50346f7-6c2c-4e4b-ba70-875688947b75",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "onnx_model_path = './models/{}/ONNX'.format(T5_VARIANT)\n",
-    "\n",
-    "metadata=NetworkMetadata(variant=T5_VARIANT, precision=Precision(fp16=True), other=T5Metadata(kv_cache=False))\n",
-    "\n",
-    "encoder_onnx_model_path = os.path.join(onnx_model_path, \"encoder\")\n",
-    "decoder_onnx_model_path = os.path.join(onnx_model_path, \"decoder\")\n",
-    "!mkdir -p $encoder_onnx_model_path\n",
-    "!mkdir -p $decoder_onnx_model_path\n",
-    "\n",
-    "encoder_onnx_model_fpath = T5_VARIANT + \"-encoder.onnx\"\n",
-    "decoder_onnx_model_fpath = T5_VARIANT + \"-decoder-with-lm-head.onnx\"\n",
-    "\n",
-    "t5_encoder = T5EncoderTorchFile(t5_model.to('cpu'), metadata)\n",
-    "t5_decoder = T5DecoderTorchFile(t5_model.to('cpu'), metadata)\n",
-    "\n",
-    "onnx_t5_encoder = t5_encoder.as_onnx_model(\n",
-    "    os.path.join(encoder_onnx_model_path, encoder_onnx_model_fpath), force_overwrite=False\n",
-    ")\n",
-    "onnx_t5_decoder = t5_decoder.as_onnx_model(\n",
-    "    os.path.join(decoder_onnx_model_path, decoder_onnx_model_fpath), force_overwrite=False\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7baf007e-5508-485c-a87f-9bfe16260452",
-   "metadata": {},
-   "source": [
-    "<a id=\"3\"></a>\n",
-    "\n",
-    "## 3. Convert to TensorRT\n",
-    "\n",
-    "Now we are ready to parse the ONNX encoder and decoder models and convert them to optimized TensorRT engines.\n",
-    "\n",
-    "Since the models contains dynamic input shapes, we can specify a valid input range with a TensorRT optimization profile."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "037ac958-2627-439c-9db5-27640e3f7967",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from T5.export import T5DecoderONNXFile, T5EncoderONNXFile\n",
-    "from polygraphy.backend.trt import Profile\n",
-    "from tensorrt import PreviewFeature"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6bd6e3fc-6797-46b0-a211-ce42d3769105",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "tensorrt_model_path = './models/{}/tensorrt'.format(T5_VARIANT)\n",
-    "!mkdir -p tensorrt_model_path\n",
-    "# Decoder optimization profiles\n",
-    "batch_size = 1\n",
-    "max_sequence_length = T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[T5_VARIANT]\n",
-    "decoder_profile = Profile()\n",
-    "decoder_profile.add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size * num_beams, 1),\n",
-    "    opt=(batch_size * num_beams, max_sequence_length // 2),\n",
-    "    max=(batch_size * num_beams, max_sequence_length),\n",
-    ")\n",
-    "decoder_profile.add(\n",
-    "    \"encoder_hidden_states\",\n",
-    "    min=(batch_size * num_beams, 1, max_sequence_length),\n",
-    "    opt=(batch_size * num_beams, max_sequence_length // 2, max_sequence_length),\n",
-    "    max=(batch_size * num_beams, max_sequence_length, max_sequence_length),\n",
-    ")\n",
-    "\n",
-    "# Encoder optimization profiles\n",
-    "encoder_profile = Profile()\n",
-    "encoder_profile.add(\n",
-    "    \"input_ids\",\n",
-    "    min=(batch_size, 1),\n",
-    "    opt=(batch_size, max_sequence_length // 2),\n",
-    "    max=(batch_size, max_sequence_length),\n",
-    ")\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cfb64120-9012-40c8-b1e2-4a6366b71294",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "disable_preview_dynamic_shapes = False\n",
-    "engine_tag = f\"bs{batch_size}\"\n",
-    "\n",
-    "if num_beams > 1:\n",
-    "    engine_tag += \"-beam{}\".format(num_beams)\n",
-    "\n",
-    "preview_features = [PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]\n",
-    "if disable_preview_dynamic_shapes:\n",
-    "    engine_tag += \"-noFasterDynamicShapes\"\n",
-    "else:\n",
-    "    preview_features += [PreviewFeature.FASTER_DYNAMIC_SHAPES_0805]\n",
-    "\n",
-    "encoder_engine_name = os.path.join(tensorrt_model_path, encoder_onnx_model_fpath) + f\"-{engine_tag}.engine\".replace(f\"-beam{num_beams}\", \"\") # encoder engine not affected by beam search\n",
-    "decoder_engine_name = os.path.join(tensorrt_model_path, decoder_onnx_model_fpath) + f\"-{engine_tag}.engine\"\n",
-    "\n",
-    "if not os.path.exists(encoder_engine_name):\n",
-    "    t5_trt_encoder_engine = T5EncoderONNXFile(os.path.join(encoder_onnx_model_path, encoder_onnx_model_fpath), metadata).as_trt_engine(\n",
-    "        encoder_engine_name,\n",
-    "        profiles=[encoder_profile],\n",
-    "        preview_features=preview_features)\n",
-    "else:\n",
-    "    t5_trt_encoder_engine = T5EncoderTRTEngine(encoder_engine_name, metadata)\n",
-    "\n",
-    "if not os.path.exists(decoder_engine_name):\n",
-    "    t5_trt_decoder_engine = T5DecoderONNXFile(os.path.join(decoder_onnx_model_path, decoder_onnx_model_fpath), metadata).as_trt_engine(\n",
-    "        decoder_engine_name,\n",
-    "        profiles=[decoder_profile],\n",
-    "        preview_features=preview_features)\n",
-    "else:\n",
-    "    t5_trt_decoder_engine = T5DecoderTRTEngine(decoder_engine_name, metadata)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "74f7f6fc-1e6a-4ddc-8e9b-543d9e8dab4d",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "### Inference with TensorRT engine\n",
-    "\n",
-    "Great, if you have reached this stage, it means we now have an optimized TensorRT engine for the T5 model, ready for us to carry out inference. \n",
-    "\n",
-    "#### Single example inference\n",
-    "The T5 model with TensorRT backend can now be employed in place of the original HuggingFace T5 model.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3954f2f4-c393-463b-a44b-3e5335032b57",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Initialize TensorRT engines\n",
-    "from T5.trt import T5TRTEncoder, T5TRTDecoder\n",
-    "\n",
-    "t5_trt_encoder = T5TRTEncoder(\n",
-    "                t5_trt_encoder_engine, metadata, config\n",
-    "            )\n",
-    "t5_trt_decoder = T5TRTDecoder(\n",
-    "                t5_trt_decoder_engine, metadata, config, num_beams=num_beams\n",
-    "            )"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a9544ecb-2671-4b53-a544-08f13424cefe",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Inference on a single sample\n",
-    "encoder_last_hidden_state = t5_trt_encoder(input_ids=input_ids)\n",
-    "outputs = t5_trt_decoder(\n",
-    "    expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, \n",
-    "    expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8d71a327-546f-4b5b-bd42-caaffcceafc7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Generate sequence for an input\n",
-    "max_length = 64\n",
-    "\n",
-    "decoder_input_ids = torch.full(\n",
-    "    (1, 1), tokenizer.convert_tokens_to_ids(tokenizer.pad_token), dtype=torch.int32\n",
-    ").to(\"cuda:0\")\n",
-    "\n",
-    "encoder_last_hidden_state = t5_trt_encoder(input_ids=input_ids)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ed9d4a98-b034-470e-a9f8-096d4100b8d4",
-   "metadata": {},
-   "source": [
-    "#### TRT engine inference benchmark: encoder and decoder stacks\n",
-    "First, we will bechmark the encoder and decoder stacks as before."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "70b37591-4398-40ff-8a39-5f75347192dc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "encoder_last_hidden_state, encoder_e2e_median_time = encoder_inference(\n",
-    "    t5_trt_encoder, input_ids, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
-    ")\n",
-    "encoder_e2e_median_time\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7e5459da-a01b-4894-88dc-01b3637ded53",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "_, decoder_e2e_median_time = decoder_inference(\n",
-    "    t5_trt_decoder, expand_inputs_for_beam_search(input_ids, num_beams) if num_beams > 1 else input_ids, \n",
-    "    expand_inputs_for_beam_search(encoder_last_hidden_state, num_beams) if num_beams > 1 else encoder_last_hidden_state, TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50)\n",
-    ")\n",
-    "decoder_e2e_median_time"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "62ebfe03-7a60-4dd0-ad32-4e53d6012b07",
-   "metadata": {},
-   "source": [
-    "### Full model inference benchmark\n",
-    "\n",
-    "Next, we will try the full TensorRT T5 engine for the task of translation. As before, note the time difference."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f31cb550-24b9-48cd-a4ec-0bf18ac5e40c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "decoder_output, full_e2e_median_runtime = full_inference(\n",
-    "    t5_trt_encoder,\n",
-    "    t5_trt_decoder,\n",
-    "    input_ids,\n",
-    "    tokenizer,\n",
-    "    TimingProfile(iterations=10, number=1, warmup=1, duration=0, percentile=50),\n",
-    "    max_length=T5ModelTRTConfig.MAX_SEQUENCE_LENGTH[metadata.variant],\n",
-    "    num_beams=num_beams,\n",
-    "    use_cuda=True,\n",
-    ")\n",
-    "\n",
-    "print(tokenizer.decode(decoder_output[0], skip_special_tokens=True))\n",
-    "full_e2e_median_runtime\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "92031643-8ee8-4d50-864b-a08e4d551dc6",
-   "metadata": {},
-   "source": [
-    "You can now compare the output of the original PyTorch model and the TensorRT engine. Notice the speed difference. On an NVIDIA V100 32GB GPU, this results in upto ~10x performance improvement (from 0.0802s to 0.0082s for the T5-small variant)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2a1f5dca-397c-4c8c-9200-61b30cdba824",
-   "metadata": {},
-   "source": [
-    "## Conclusion and where-to next?\n",
-    "\n",
-    "This notebook has walked you through the process of converting a HuggingFace PyTorch T5 model to an optimized TensorRT engine for inference in 3 easy steps. The TensorRT inference engine can be conviniently used as a drop-in replacement for the orginial HuggingFace T5 model while providing significant speed up. \n",
-    "\n",
-    "If you are interested in further details of the conversion process, check out [T5/trt.py](../T5/trt.py)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b6a8b7c8",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.6"
-  },
-  "vscode": {
-   "interpreter": {
-    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/demo/HuggingFace/requirements.txt b/demo/HuggingFace/requirements.txt
deleted file mode 100644
index 30d9cdb1..00000000
--- a/demo/HuggingFace/requirements.txt
+++ /dev/null
@@ -1,31 +0,0 @@
-#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-huggingface-hub==0.11.0; python_version>="3.7"
-huggingface-hub==0.4.0; python_version<"3.7"
-transformers==4.20.0; python_version>="3.7"
-transformers==4.18.0; python_version<"3.7"
-torch==1.13.1; python_version>="3.7"
-torch==1.10; python_version<"3.7"
-sentencepiece==0.1.95; python_version<"3.10"
-sentencepiece==0.1.97; python_version>="3.10"
---extra-index-url https://pypi.ngc.nvidia.com
-onnx==1.9.0; python_version<"3.8"
-onnx==1.13.1; python_version>="3.8"
-polygraphy>=0.42.2
-tabulate
-toml
-onnx_graphsurgeon
diff --git a/demo/HuggingFace/run.py b/demo/HuggingFace/run.py
deleted file mode 100644
index 3521b57f..00000000
--- a/demo/HuggingFace/run.py
+++ /dev/null
@@ -1,312 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Demonstrates TensorRT capabilities with networks located in HuggingFace repository.
-Requires Python 3.5+
-"""
-
-import os
-import sys
-import pickle
-import argparse
-import importlib
-
-from abc import abstractmethod
-from typing import List
-
-# tabulate
-from tabulate import tabulate
-
-ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
-sys.path.append(ROOT_DIR)
-
-# Wrapper actions supported
-WRAPPER_RUN_ACTION = "run"
-WRAPPER_LIST_ACTION = "list"
-WRAPPER_COMPARE_ACTION = "compare"
-WRAPPER_BENCHMARK_ACTION = "benchmark"
-WRAPPER_ACTIONS = [WRAPPER_RUN_ACTION, WRAPPER_LIST_ACTION, WRAPPER_COMPARE_ACTION, WRAPPER_BENCHMARK_ACTION]
-
-# NNDF
-from NNDF.general_utils import process_per_result_entries, process_results, register_network_folders, RANDOM_SEED
-from NNDF.logger import G_LOGGER
-from NNDF.cuda_bootstrapper import bootstrap_ld_library_path
-
-# huggingface
-from transformers import set_seed
-
-# Force seed to 42 for reproducibility.
-set_seed(RANDOM_SEED)
-
-class Action:
-    def __init__(self, networks: List[str], parser: argparse.ArgumentParser):
-        self.networks = networks
-        self.parser = parser
-        self.add_args(self.parser)
-
-    @abstractmethod
-    def execute(self, args: argparse.Namespace):
-        pass
-
-    @abstractmethod
-    def add_args(self, parser: argparse.ArgumentParser):
-        pass
-
-
-class NetworkScriptAction(Action):
-
-    # Reserved files names for each network folder
-    FRAMEWORKS_SCRIPT_NAME = "frameworks"
-    TRT_SCRIPT_NAME = "trt"
-    ONNX_SCRIPT_NAME = "onnxrt"
-    PER_NETWORK_SCRIPTS = [FRAMEWORKS_SCRIPT_NAME, TRT_SCRIPT_NAME, ONNX_SCRIPT_NAME]
-
-    def add_args(self, parser):
-        network_group = parser.add_argument_group("specify network")
-        network_group.add_argument(
-            "network", help="Network to run.", choices=self.networks
-        )
-
-    def load_script(self, script_name: str, args: argparse.Namespace):
-        """Helper for loading a specific script for given network."""
-        assert (
-            script_name in self.PER_NETWORK_SCRIPTS
-        ), "Script must be a reserved name."
-
-        # Load the specific commandline script
-        return importlib.import_module("{}.{}".format(args.network, script_name))
-
-
-class RunAction(NetworkScriptAction):
-    def execute(self, args: argparse.Namespace):
-        module = self.load_script(args.script, args)
-        module.RUN_CMD._parser = self.parser
-
-        old_path = os.getcwd()
-        # Execute script in each relevant folder
-        try:
-            os.chdir(args.network)
-            results = module.RUN_CMD()
-        finally:
-            os.chdir(old_path)
-
-        # Output to terminal
-        print(results)
-
-        # Dump results as a pickle file if applicable.
-        # Useful for testing or post-processing.
-        if args.save_output_fpath:
-            with open(args.save_output_fpath, "wb") as f:
-                pickle.dump(results, f)
-
-        return 0
-
-    def add_args(self, parser: argparse.ArgumentParser):
-        super().add_args(parser)
-        run_group = parser.add_argument_group("run args")
-        run_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
-        run_group.add_argument("--save-output-fpath", "-o", default=None, help="Outputs a pickled NetworkResult object. See networks.py for definition.")
-
-
-class BenchmarkAction(NetworkScriptAction):
-    def execute(self, args: argparse.Namespace):
-        module = self.load_script(args.script, args)
-        module.RUN_CMD._parser = self.parser
-
-        old_path = os.getcwd()
-        # Execute script in each relevant folder
-        try:
-            os.chdir(args.network)
-            results = module.RUN_CMD.run_benchmark()
-        finally:
-            os.chdir(old_path)
-
-        # Output to terminal
-        print(results)
-
-        return 0
-
-    def add_args(self, parser: argparse.ArgumentParser):
-        super().add_args(parser)
-        run_group = parser.add_argument_group("benchmark args")
-        run_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
-
-
-class CompareAction(NetworkScriptAction):
-    GENERAL_HEADERS = ["script", "accuracy"]
-
-    def execute(self, args: argparse.Namespace):
-        compare_group = []
-        if args.compare is None:
-            compare_group = self.PER_NETWORK_SCRIPTS
-        else:
-            compare_group = args.compare
-
-        if len(compare_group) <= 1:
-            G_LOGGER.error(
-                "Comparison command must have atleast two groups to compare to."
-            )
-            exit()
-
-        results = []
-        # Get the parser for inference script which is a superset
-        module = None
-        try:
-            module = self.load_script(self.TRT_SCRIPT_NAME, args)
-        except ModuleNotFoundError as e:
-            print("Unable to do comparison. TRT script not yet supported.")
-            exit(1)
-
-        nconfig = module.RUN_CMD.config
-        nconfig.MetadataClass.add_inference_args(self.parser)
-        self.parser.parse_known_args()
-
-        results = []
-        # It is possible certain scripts are not implemented
-        # Allow the results to generate even if script does not exist.
-        modified_compare_group = []
-        for g in compare_group:
-            cwd = os.getcwd()
-            try:
-                print()
-                print("Collecting Data for {}".format(g))
-                os.chdir(args.network)
-                module = self.load_script(g, args)
-                module.RUN_CMD._parser = self.parser
-                results.append(module.RUN_CMD())
-                modified_compare_group.append(g)
-            except ModuleNotFoundError as e:
-                print("{} is not valid, the demo does not support this script yet. Ignoring.".format(g))
-
-            finally:
-                os.chdir(cwd)
-
-        headers, rows = process_per_result_entries(modified_compare_group, results)
-        # Rows are grouped by input, flatten to show as one large table
-        flattened_rows = [r for input_row in rows.values() for r in input_row]
-        print()
-        print(tabulate(flattened_rows, headers=headers))
-
-        headers, rows = process_results(modified_compare_group, results, nconfig)
-        print()
-        print(tabulate(rows, headers=headers))
-
-        return 0
-
-    def add_args(self, parser: argparse.ArgumentParser):
-        super().add_args(parser)
-        compare_group = parser.add_argument_group("compare args")
-        compare_group.add_argument(
-            "--compare",
-            "-c",
-            nargs="+",
-            default=None,
-            choices=self.PER_NETWORK_SCRIPTS,
-            help="Specific frameworks to compare. If none is specified, all are compared.",
-        )
-
-
-class ListAction(Action):
-    def __init__(self, networks: List[str], parser: argparse.ArgumentParser):
-        super().__init__(networks, parser)
-        self.networks = networks
-
-    def execute(self, args: argparse.Namespace):
-        print("Networks that are supported by HuggingFace Demo:")
-        [print(n) for n in self.networks]
-        return 0
-
-
-def get_action(
-    action_name: str, networks: List[str], parser: argparse.ArgumentParser
-) -> Action:
-    return {
-        WRAPPER_COMPARE_ACTION: CompareAction,
-        WRAPPER_LIST_ACTION: ListAction,
-        WRAPPER_RUN_ACTION: RunAction,
-        WRAPPER_BENCHMARK_ACTION: BenchmarkAction,
-    }[action_name](networks, parser)
-
-
-def get_default_parser(
-    networks: List[str], description: str = "", add_default_help=False
-) -> argparse.ArgumentParser:
-    """
-    Returns argparser for use by main(). Allows the ability to toggle default help message with a custom help flag
-    so that argparser does not throw SystemExit when --help is passed in. Useful for custom --help functionality.
-
-    Returns:
-        (argparse.ArgumentParser): argparser used by main()
-    """
-    # This variable is set so that usage errors don't show up in wrapper
-    parser = argparse.ArgumentParser(
-        conflict_handler="resolve",
-        description=description,
-        add_help=add_default_help,
-        prog="run.py",
-    )
-    required_group = parser.add_argument_group("required wrapper arguments")
-
-    required_group.add_argument("action", choices=WRAPPER_ACTIONS)
-
-    if not add_default_help:
-        parser.add_argument(
-            "--help",
-            "-h",
-            help="Shows help message. If --network is supplied, returns help for specific script.",
-            action="store_true",
-        )
-    return parser
-
-
-def verify_python_version():
-    if sys.version_info.major < 3 or sys.version_info.minor <= 6:
-        raise RuntimeError("HuggingFace OSS Demo does not support Python <= 3.6 due to end-of-life.")
-
-
-def main() -> None:
-    """
-    Parses network folders and responsible for passing --help flags to subcommands if --network is provided.
-    """
-    # Verify python version support
-    verify_python_version()
-
-    # Get all available network scripts
-    networks = register_network_folders(os.getcwd())
-
-    # Add network folder for entry point
-    description = "Runs TensorRT networks that are based-off of HuggingFace variants."
-    parser = get_default_parser(networks, description, add_default_help=False)
-
-    # Get the general network wrapper help
-    known_args, _ = parser.parse_known_args()
-
-    # Delegate parser to action specifics
-    action = get_action(known_args.action, networks, parser)
-    known_args, _ = parser.parse_known_args()
-
-    # If bootstrap occurs, then the spawned process completes the rest of demo.
-    # We can exit safely. We spawn after parsing basic args to reduce loading churn on rudimentary help commands.
-    if bootstrap_ld_library_path():
-        sys.exit(0)
-
-    return action.execute(known_args)
-
-
-if __name__ == "__main__":
-    main()
diff --git a/demo/HuggingFace/tests/test_interface.py b/demo/HuggingFace/tests/test_interface.py
deleted file mode 100644
index 9dda902f..00000000
--- a/demo/HuggingFace/tests/test_interface.py
+++ /dev/null
@@ -1,62 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-Tests and verifies our interface objects
-"""
-
-# std
-import os
-import sys
-
-# pytest
-import pytest
-
-# Add library path
-TEST_DIR = os.path.dirname(os.path.abspath(__file__))
-sys.path.append(os.path.join(TEST_DIR, os.pardir))
-
-
-@pytest.fixture(scope="session")
-def inetwork():
-    import NNDF.networks as mod
-    return mod
-
-
-def test_network_result(inetwork):
-    # Test the API by explicit flags
-    inetwork.NetworkResult(
-        input="example",
-        output_tensor=[],
-        semantic_output="hello",
-        median_runtime=9001,
-        models=[],
-    )
-
-
-def test_network_checkpoint_result(inetwork):
-    inetwork.NetworkCheckpointResult(network_results=[], accuracy=9001.0, perplexity=5.0)
-
-
-def test_precision(inetwork):
-    inetwork.Precision(fp16=True)
-
-
-def test_network_metadata(inetwork):
-    inetwork.NetworkMetadata(
-        variant="gpt2", precision=inetwork.Precision(fp16=True), other=None
-    )
diff --git a/demo/NeMo/.gitignore b/demo/NeMo/.gitignore
new file mode 100644
index 00000000..af9bae11
--- /dev/null
+++ b/demo/NeMo/.gitignore
@@ -0,0 +1,5 @@
+apex/
+Megatron-LM/
+NeMo/
+temp/
+__pycache__/
diff --git a/demo/NeMo/GPT3/GPT3ModelConfig.py b/demo/NeMo/GPT3/GPT3ModelConfig.py
new file mode 100644
index 00000000..0e50d6ce
--- /dev/null
+++ b/demo/NeMo/GPT3/GPT3ModelConfig.py
@@ -0,0 +1,87 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Base Class
+import sys
+sys.path.append('../../HuggingFace') # Include HuggingFace directory
+from NNDF.networks import NNConfig, NetworkMetadata
+
+class GPT3ModelTRTConfig(NNConfig):
+
+    NETWORK_FULL_NAME = "full"
+    TARGET_MODELS = [
+        "gpt-126m",
+        "gpt-1.3b",
+        "gpt-5b",
+    ]
+
+    def __init__(
+        self,
+        metadata,
+        **kwargs
+    ):
+        super().__init__(
+            network_name="GPT3",
+            **kwargs
+        )
+        self.nemo_config = None
+        self.use_mask = False
+        self.metadata = metadata
+        self.variant = metadata.variant
+
+    def from_nemo_config(self, nemo_config):
+        self.nemo_config = nemo_config
+
+    def get_metadata_string(self, metadata: NetworkMetadata) -> str:
+        """
+        Serializes a Metadata object into string.
+        String will be checked if friendly to filenames across Windows and Linux operating systems.
+        This function is a modified version from HuggingFace/NNDF/networks.py.
+
+        returns:
+            string: <network>-<variant-name>[-<precision>]*-<others>
+        """
+
+        enabled_precisions = self.nemo_config.trt_export_options
+        precision_str = "-".join(
+            [
+                k for k, v in {
+                    "fp8": enabled_precisions.use_fp8,
+                    "fp16": enabled_precisions.use_fp16,
+                    "bf16": enabled_precisions.use_bf16,
+                }.items() if v
+            ]
+        )
+
+        result = [self.network_name, metadata.variant]
+        if precision_str:
+            result.append(precision_str)
+
+        # Append max sequence length
+        result.append("ms" + str(self.nemo_config.model.max_seq_len))
+
+        if metadata.use_cache:
+            result.append("kv_cache")
+
+        final_str = "-".join(result)
+        assert self._is_valid_filename(
+            final_str
+        ), "Metadata for current network {} is not filename friendly: {}.".format(
+            self.network_name, final_str
+        )
+
+        return final_str
diff --git a/demo/NeMo/GPT3/decoding.py b/demo/NeMo/GPT3/decoding.py
new file mode 100644
index 00000000..2edf66e7
--- /dev/null
+++ b/demo/NeMo/GPT3/decoding.py
@@ -0,0 +1,453 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from collections.abc import Iterable
+import sys
+from typing import List
+
+from apex.transformer.pipeline_parallel.utils import _reconfigure_microbatch_calculator
+from megatron.core import parallel_state
+from nemo.collections.nlp.modules.common.text_generation_strategy import GPTModelTextGenerationStrategy
+from nemo.utils import AppState
+import torch
+import torch.nn.functional as F
+
+from GPT3.trt_utils import GPTTRTDecoder
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.logger import G_LOGGER
+
+
+def sample_sequence_batch(
+    model,
+    inference_strategy,
+    context_tokens,
+    context_lengths,
+    tokens_to_generate,
+    all_probs=False,
+    temperature=None,
+    extra={},
+):
+    def repetition_penalty(logits, repetition_penalty, used_tokens):
+        """ Implement the repetition penalty, check paper
+        https://arxiv.org/pdf/1909.05858.pdf
+        """
+        if used_tokens is not None and repetition_penalty != 1.0:
+            logits_update = torch.gather(logits, 1, used_tokens)
+            logits = torch.scatter(logits, 1, used_tokens, logits_update / repetition_penalty)
+        return logits
+
+    def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float('Inf'), started=None):
+        """
+        This function has been mostly taken from huggingface conversational
+            ai code at
+            https://medium.com/huggingface/how-to-build-a-state-of-the-art-
+                conversational-ai-with-transfer-learning-2d818ac26313
+
+            @param logits: logits tensor
+            @param top_k: keep only top k tokens with highest probability
+            @param top_p: keep the top tokens with cumulative probability
+            @filter_value: value to set filtered tokens to
+            @started: a tensor of bools indicating whether the text generation starts for the batch
+            returns the filtered logits
+        """
+        if top_k > 0:
+            # Remove all tokens with a probability less than the
+            # last token of the top-k
+            indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+            if started is not None:
+                for i in torch.arange(indices_to_remove.size(0))[started]:
+                    logits[i, indices_to_remove[i]] = filter_value
+            else:
+                logits[indices_to_remove] = filter_value
+
+        if top_p > 0.0:
+            # Cconvert to 1D
+            sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+
+            # Remove tokens with cumulative probability above the threshold
+            sorted_indices_to_remove = cumulative_probs > top_p
+            # Shift the indices to the right to keep also the first token
+            # above the threshold
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = 0
+            if started is not None:
+                for i in torch.arange(sorted_indices.size(0))[started]:
+                    indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
+                    logits[i, indices_to_remove] = filter_value
+            else:
+                for i in range(sorted_indices.size(0)):
+                    indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
+                    logits[i, indices_to_remove] = filter_value
+
+        return logits
+
+    app_state = AppState()
+    batch_size = context_tokens.shape[0]
+    if not (hasattr(model, "trt") or hasattr(model, "onnx")):
+        _reconfigure_microbatch_calculator(
+            rank=app_state.global_rank,
+            rampup_batch_size=None,
+            global_batch_size=batch_size,
+            micro_batch_size=batch_size,
+            data_parallel_size=1,
+        )
+
+    tokenizer = model.tokenizer
+    # initialize the batch
+    with torch.no_grad():
+        context_length = context_lengths.min().item()
+        context_lengths_cpu = context_lengths.cpu()
+        inference_strategy.init_batch(context_tokens, context_length)
+        # added eos_id to support the function generate_samples_eval that passes
+        # eos_id as an argument and needs termination when that id id found.
+        eod_id = tokenizer.eos_id
+        counter = 0
+
+        tokens = context_tokens
+        output_logits = None
+        all_generated_indices = None  # used to track all generated indices
+        # Generate enough tokens for the longest sequence
+        maxlen = tokens_to_generate + context_lengths.max().item()
+        maxlen = inference_strategy.clip_max_len(maxlen)
+
+        is_done = torch.zeros([batch_size]).byte()
+        lengths = torch.ones([batch_size]).long() * maxlen
+
+        use_cache = extra.get("use_cache", False)
+        is_onnx = hasattr(model, "onnx")
+        is_trt = hasattr(model, "trt")
+
+        if is_trt:
+            assert isinstance(model.trt, GPTTRTDecoder)
+            input_ids_name = model.trt.get_input_ids_name()
+            input_ids_type = model.trt.get_torch_type(input_ids_name)
+            position_ids_name = model.trt.get_position_ids_name()
+            position_ids_type =  model.trt.get_torch_type(position_ids_name)
+            attention_mask_name = model.trt.get_attention_mask_name()
+            if attention_mask_name != None:
+                attention_mask_type = model.trt.get_torch_type(attention_mask_name)
+
+            position_ids = inference_strategy.position_ids
+            attention_mask = inference_strategy.attention_mask
+
+        torch.cuda.nvtx.range_pop() # "Prepare Batch"
+        while context_length < maxlen:
+            torch.cuda.nvtx.range_push("I/O Setup")
+
+            output = None
+            if is_onnx and use_cache:
+                G_LOGGER.warn(f"ONNX runtime path does not support KV-cache.")
+
+            # Modify counter based on using cache or not.
+            if is_trt:
+                # TRT input preprocessing doesn't use nemo function
+                pass
+            elif not is_onnx and use_cache:
+                batch, tensor_shape = inference_strategy.prepare_batch_at_step(
+                    tokens, maxlen, batch_size, counter, context_length
+                )
+            else:
+                batch, tensor_shape = inference_strategy.prepare_batch_at_step(
+                    tokens, maxlen, batch_size, 0, context_length # step is always 0
+                )
+
+            # inputs input_ids: [BS, SEQ], position_ids: [BS, SEQ], attention_mask: [1, 1, SEQ, SEQ]
+            if is_trt:
+                context_mode = (use_cache and counter == 0) or not use_cache
+                if context_mode or not use_cache:
+                    # context mode
+                    batch_tokens = tokens[:, :context_length]
+                    batch_position_ids = position_ids[:, :context_length]
+                else:
+                    # generate mode
+                    batch_tokens = tokens[:, context_length - 1].view(batch_size, -1)
+                    batch_position_ids = position_ids[:, context_length - 1].view(batch_size, -1)
+                seq_len = batch_tokens.shape[1]
+                batch_attention_mask = attention_mask[0:1, 0:1, :seq_len, :seq_len]
+                input_ids = batch_tokens.type(input_ids_type).contiguous().cuda()
+                tensor_dict = {input_ids_name : (input_ids.data_ptr(), input_ids.shape)}
+                if position_ids_name != None:
+                    batch_position_ids = batch_position_ids.type(position_ids_type).contiguous().cuda()
+                    tensor_dict[position_ids_name] = (batch_position_ids.data_ptr(), batch_position_ids.shape)
+                if attention_mask_name != None:
+                    batch_attention_mask = batch_attention_mask.type(attention_mask_type).contiguous().cuda()
+                    tensor_dict[attention_mask_name] = (batch_attention_mask.data_ptr(), batch_attention_mask.shape)
+
+                logits_name = model.trt.get_output_name()
+                torch.cuda.nvtx.range_pop() # "I/O Setup"
+                output = model.trt.run(logits_name, tensor_dict, seq_len, context_mode)
+
+            elif is_onnx:
+                assert len(batch) == 5, "Length of batch must be 5."
+                (
+                    batch_tokens,
+                    attention_mask,
+                    position_ids,
+                    set_inference_key_value_memory,
+                    _,
+                ) = batch
+                seq_len = batch_tokens.shape[1]
+                attention_mask = attention_mask[0:1, 0:1, 0:seq_len, 0:seq_len]
+
+                from onnxruntime import InferenceSession
+                assert isinstance(model.onnxrt, InferenceSession)
+                # Currently only support onnx runtime with cpu
+                # Our fp8 models don't currently use a user-provided attention_mask
+                tensor_dict = {'input_ids': batch_tokens.cpu().detach().numpy(),
+                                'position_ids': position_ids.cpu().detach().numpy()}
+
+                def have_attention_mask(sess):
+                    return any(inp.name == 'attention_mask' for inp in all_inputs)
+
+                if have_attention_mask(model.onnxrt):
+                    tensor_dict['attention_mask'] = attention_mask.cpu().detach().numpy()
+                torch.cuda.nvtx.range_pop() # "I/O Setup"
+                output = model.onnxrt.run(['logits'], tensor_dict)[0]
+                output = torch.Tensor(output).cuda()
+                # output logits: [BS, SEQ, 50304]
+            else:
+                # nemo path
+                torch.cuda.nvtx.range_pop() # "I/O Setup"
+                output = inference_strategy.forward_step(batch, tensor_shape)
+                output = output[0]['logits'].float()
+
+            assert output is not None
+            torch.cuda.nvtx.range_push("Output Sampling")
+            output = output.float()
+            logits = output[:, -1].view(batch_size, -1).contiguous()
+
+            # make sure it will generate at least min_length
+            min_length = extra.get('min_tokens_to_generate', 0)
+            if min_length > 0:
+                within_min_length = (context_length - context_lengths) < min_length
+                logits[within_min_length, eod_id] = -float('Inf')
+
+            # make sure it won't sample outside the vocab_size range
+            logits[:, tokenizer.vocab_size :] = -float('Inf')
+
+            # started indicates whether the current token step passes the context_length, so we make sure not to overwrite the context tokens
+            started = context_lengths_cpu <= context_length
+            if extra.get('greedy', False):
+                prev = torch.argmax(logits, dim=-1).view(-1)
+            else:
+                logits = logits.float()
+                logits /= temperature
+                # handle repetition penality
+                logits = repetition_penalty(logits, extra.get('repetition_penalty', 1.0), all_generated_indices)
+                logits = top_k_logits(
+                    logits, top_k=extra.get('top_k', 0), top_p=extra.get('top_p', 0.9), started=started
+                )
+                probs = F.softmax(logits, dim=-1)
+                prev = torch.multinomial(probs, num_samples=1).view(-1)
+
+            prev = prev.cpu()
+            # Clamp the predicted out of vocabulary tokens
+            prev = torch.clamp(prev, max=tokenizer.vocab_size - 1)
+            # Replace sampled tokens w/ done token if EOD has already been sampled
+            new_tokens = torch.where(is_done, eod_id, prev)
+            # post process the inference tokens based on the strategy
+            inference_strategy.post_process(tokens, new_tokens, context_length)
+
+            # Insert either new predicted or next prompt token
+            if extra.get("accuracy_mode", False):
+                # We only update the last token for accuracy mode.
+                at_prediction_index = (context_lengths + tokens_to_generate - 1 == context_length)
+                tokens[:, context_length] = torch.where(at_prediction_index, new_tokens.cuda(), tokens[:, context_length])
+            else:
+                tokens[:, context_length] = torch.where(started.cuda(), new_tokens.cuda(), tokens[:, context_length])
+
+            if not extra.get("benchmark_mode", False):
+                if output_logits is None:
+                    output = F.log_softmax(output[:, :context_length, :], 2)
+                    indices = torch.unsqueeze(tokens[:, 1 : context_length + 1], 2)
+                    output_logits = torch.gather(output, 2, indices).squeeze(2)
+                    all_generated_indices = indices[:, :, 0]
+                    if all_probs:
+                        full_logits = output
+                else:
+                    output = F.log_softmax(output, 2)
+                    indices = torch.unsqueeze(new_tokens.cuda(), 1).unsqueeze(2)
+                    new_output_logits = torch.gather(output, 2, indices).squeeze(2)
+
+                    # This copy can be optimized out by pre-allocating the memory.
+                    output_logits = torch.cat([output_logits, new_output_logits], 1)
+                    all_generated_indices = torch.cat([all_generated_indices, indices[:, :, 0]], 1)
+                    if all_probs:
+                        if extra.get("use_cache", False):
+                            full_logits = torch.cat([full_logits, output], 1)
+                        else:
+                            full_logits = output
+
+            done_token = (prev == eod_id)
+            done_token = done_token.byte() & started.byte()
+
+            just_finished = (done_token & ~is_done).bool()
+            lengths[just_finished.view(-1)] = context_length
+            is_done = is_done | done_token
+
+            done = torch.all(is_done)
+            torch.cuda.nvtx.range_pop() # "Output Sampling"
+
+            context_length += 1
+            counter += 1
+            if done and not extra.get("benchmark_mode", False):
+                break
+
+        if all_probs:
+            return tokens, context_length, lengths, output_logits, full_logits
+        return tokens, context_length, lengths, output_logits, None
+
+def initialize_ddp(model, cfg):
+    # check whether the DDP is initialized
+    if cfg.runtime == "nemo" and parallel_state.is_unitialized():
+        def dummy():
+            return
+        if model.trainer.strategy.launcher is not None:
+            model.trainer.strategy.launcher.launch(dummy, trainer=model.trainer)
+        model.trainer.strategy.setup_environment()
+
+        if model.cfg.get('transformer_engine', False):
+            model.setup_transformer_engine_tp_groups()
+
+def get_special_tokens(tokenizer):
+    special_tokens = set()
+    if hasattr(tokenizer, 'pad_token') and tokenizer.pad_token is not None:
+        special_tokens.add(tokenizer.pad_token)
+    if hasattr(tokenizer, 'eos_token') and tokenizer.eos_token is not None:
+        special_tokens.add(tokenizer.eos_token)
+    if hasattr(tokenizer, 'bos_token') and tokenizer.bos_token is not None:
+        special_tokens.add(tokenizer.bos_token)
+    if hasattr(tokenizer, 'cls_token') and tokenizer.cls_token is not None:
+        special_tokens.add(tokenizer.cls_token)
+    if hasattr(tokenizer, 'unk_token') and tokenizer.unk_token is not None:
+        special_tokens.add(tokenizer.unk_token)
+    if hasattr(tokenizer, 'sep_token') and tokenizer.sep_token is not None:
+        special_tokens.add(tokenizer.sep_token)
+    if hasattr(tokenizer, 'mask_token') and tokenizer.mask_token is not None:
+        special_tokens.add(tokenizer.mask_token)
+    return special_tokens
+
+def process_output(model, output, return_segments=False):
+    torch.cuda.nvtx.range_push("Process Output")
+    inference_strategy = GPTModelTextGenerationStrategy(model)
+    tokenizer = model.tokenizer
+    if output is not None:
+        decode_tokens, output_logits, full_logits = output
+        decode_tokens = decode_tokens.cpu().numpy().tolist()
+
+        # convert ids to text by applying tokenizer
+        resp_sentences = list(map(tokenizer.ids_to_text, decode_tokens))
+
+        all_offsets = []
+        resp_sentences_seg = []
+        if return_segments:
+            # segments sentences into words.
+            for decode_token in decode_tokens:
+                words = []
+                for token in decode_token:
+                    if not isinstance(token, Iterable):
+                        token = [token]
+                    word = tokenizer.ids_to_tokens(token)
+                    if isinstance(word, Iterable):
+                        word = word[0]
+                    if hasattr(tokenizer.tokenizer, 'byte_decoder'):
+                        word = bytearray([tokenizer.tokenizer.byte_decoder[c] for c in word]).decode(
+                            'utf-8', errors='replace'
+                        )
+                    words.append(word)
+                resp_sentences_seg.append(words)
+
+            # offsets calculation
+            special_tokens = get_special_tokens(tokenizer)
+            for item in resp_sentences_seg:
+                offsets = [0]
+                for index, token in enumerate(item):
+                    if index != len(item) - 1:
+                        if token in special_tokens:
+                            offsets.append(offsets[-1])
+                        else:
+                            offsets.append(len(token) + offsets[-1])
+                all_offsets.append(offsets)
+
+        output = {}
+        output['sentences'] = resp_sentences
+        output['tokens'] = resp_sentences_seg
+        output['logprob'] = output_logits
+        output['full_logprob'] = full_logits
+        output['token_ids'] = decode_tokens
+        output['offsets'] = all_offsets
+        output = inference_strategy.post_generation_process(output)
+    torch.cuda.nvtx.range_pop() # "Process Output"
+    return output
+
+def generate(model, inputs, cfg):
+    torch.cuda.nvtx.range_push("Prepare Batch")
+    initialize_ddp(model, cfg)
+
+    tokens_to_generate = cfg.inference.tokens_to_generate
+    min_tokens_to_generate = cfg.inference.min_tokens_to_generate
+    add_BOS = cfg.inference.add_BOS
+    all_probs = cfg.inference.all_probs
+    temperature = cfg.inference.temperature
+    is_benchmark_mode = True if cfg.mode == "benchmark" else False
+    is_accuracy_mode = True if cfg.mode == "accuracy" else False
+
+    inference_strategy = GPTModelTextGenerationStrategy(model)
+    if isinstance(inputs, tuple):
+        context_tokens_tensor, context_length_tensor = inputs
+    else:
+        context_tokens_tensor, context_length_tensor = inference_strategy.tokenize_batch(
+            inputs, tokens_to_generate, add_BOS
+        )
+
+    context_length = context_length_tensor.min().item()
+
+    batch_token_result = sample_sequence_batch(
+        model,
+        inference_strategy,
+        context_tokens_tensor,
+        context_length_tensor,
+        tokens_to_generate,
+        all_probs,
+        temperature=temperature,
+        extra={
+            "top_p": cfg.inference.top_p,
+            "top_k": cfg.inference.top_k,
+            "greedy": cfg.inference.greedy,
+            "repetition_penalty": cfg.inference.repetition_penalty,
+            "min_tokens_to_generate": min_tokens_to_generate,
+            "use_cache": cfg.use_cache,
+            "benchmark_mode": is_benchmark_mode,
+            "accuracy_mode": is_accuracy_mode,
+            "use_fp8_storage": cfg.onnx_export_options.use_fp8_storage,
+        },
+    )
+
+    tokens, context_length, _, output_logits, full_logits = batch_token_result
+
+    output = None
+    if tokens is not None:
+        output = tokens[:, :context_length], output_logits, full_logits
+    return output
+
+def full_inference(model, inputs, cfg):
+    output = generate(model, inputs, cfg)
+    if output is not None:
+        output = process_output(model, output, return_segments=(cfg.mode is not "benchmark"))
+    return output
diff --git a/demo/NeMo/GPT3/frameworks.py b/demo/NeMo/GPT3/frameworks.py
new file mode 100644
index 00000000..851f4cdf
--- /dev/null
+++ b/demo/NeMo/GPT3/frameworks.py
@@ -0,0 +1,81 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+
+import omegaconf
+
+# Add syspath for custom library
+if __name__ == "__main__":
+    filepath = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.join(filepath, os.pardir)
+    sys.path.append(project_root)
+
+from GPT3.nemo_utils import load_nemo_model
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+from interface import NeMoCommand
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.interface import FRAMEWORK_NATIVE
+from NNDF.networks import (
+    NetworkModel,
+    NetworkModels,
+)
+
+class GPT3NeMoTorch(NeMoCommand):
+    def __init__(
+        self,
+        nemo_cfg,
+        config_class=GPT3ModelTRTConfig,
+        description="Runs framework results for GPT3 model with NeMo.",
+        **kwargs
+    ):
+        super().__init__(nemo_cfg, config_class, description, model_classes=None, **kwargs)
+        self.framework_name = FRAMEWORK_NATIVE
+
+    def setup_tokenizer_and_model(self):
+        self.nemo_cfg.runtime = 'nemo'
+        self.model = load_nemo_model(self.nemo_cfg)
+        self.tokenizer = self.model.tokenizer
+
+        torch_models = [
+            NetworkModel(
+                name=GPT3ModelTRTConfig.NETWORK_FULL_NAME, fpath=self.workspace.torch_path
+            )
+        ]
+        return NetworkModels(torch=torch_models, onnx=None, trt=None)
+
+    def process_framework_specific_arguments(self, onnx_model: str = None, **kwargs):
+        if onnx_model:
+            raise RuntimeError(
+                "native framework does not support loading an ONNX file via `onnx-model` yet. Please specify the NeMo model using `nemo-model` instead."
+            )
+
+
+# Entry point
+def getGPT3NeMoTorch():
+    config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../config.yaml")
+    nemo_cfg = omegaconf.OmegaConf.load(config_path)
+    return GPT3NeMoTorch(nemo_cfg)
+
+# Entry point
+RUN_CMD = getGPT3NeMoTorch()
+
+if __name__ == "__main__":
+    result = RUN_CMD()
+    print("Results: {}".format(result))
diff --git a/demo/NeMo/GPT3/lambada_dataset.py b/demo/NeMo/GPT3/lambada_dataset.py
new file mode 100644
index 00000000..a7945cec
--- /dev/null
+++ b/demo/NeMo/GPT3/lambada_dataset.py
@@ -0,0 +1,126 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import collections
+import json
+import requests
+import sys
+import torch
+from torch.nn.utils.rnn import pad_sequence
+
+# Add syspath for custom library
+if __name__ == "__main__":
+    filepath = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.join(filepath, os.pardir)
+    sys.path.append(project_root)
+
+from nemo_export import create_dir_if_not_exist
+
+__all__ = ['Lambada']
+
+
+class Lambada():
+
+    def __init__(self, base_dir, tokens_to_generate, padding = -1, max_length = 2048):
+        assert tokens_to_generate >= 1
+        assert padding == -1 or tokens_to_generate == 1
+        self.base_dir = base_dir
+        self.tokens_to_generate = tokens_to_generate
+        self.padding = padding
+        self.max_length = max_length 
+        self.download()
+
+    def get_data_file_path(self):
+        path = os.path.join(self.base_dir, "lambada")
+        path = os.path.join(path, "lambada_test.jsonl")
+        create_dir_if_not_exist(path)
+        return path
+
+    def download(self):
+        path = self.get_data_file_path()
+        if not os.path.exists(path):
+            url = "https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl"
+            with requests.get(url) as r, open(path, 'wb') as fh:
+                fh.write(r.content)
+
+    def load(self):
+        path = self.get_data_file_path()
+        with open(path) as fh:
+            for line in fh:
+                yield json.loads(line)
+
+    def _preprocess(self, text):
+        text = text.replace("“", '"')
+        text = text.replace("”", '"')
+        text = text.replace("’", "'")
+        text = text.replace("‘", "'")
+        return text
+
+    def doc_to_text(self, doc):
+        return "\n" + self._preprocess(doc["text"].rsplit(" ", 1)[0]).strip()
+
+    def doc_to_target(self, doc):
+        split_text = doc["text"].rsplit(" ", 1)
+        if len(split_text) <= 1:
+            raise ValueError(f"Input doc '{doc}' does not have target.")
+        return " " + self._preprocess(split_text[1])
+
+    def preprocess_input(self, tokenizer, docs):
+        _Input = collections.namedtuple("_DS_Input", ["inputs", "inp_enc", "lens", "lens_pad", "conti_len"])
+        batch_size = len(docs)
+        tokens = []
+        conti_lens = []
+        lens = []
+        inp_encs = []
+        for doc in docs:
+            # Handle padded text
+            if not doc["text"]:
+                inp_enc = [0]
+                conti_len = 0
+            else:
+                text = self.doc_to_text(doc)
+                target = self.doc_to_target(doc)
+
+                context_enc = tokenizer.text_to_ids(text)
+                continuation_enc = tokenizer.text_to_ids(target)
+
+                inp_enc = (context_enc + continuation_enc)[-(self.max_length + 1) :]
+                conti_len = len(continuation_enc)
+
+            inp_encs.append(inp_enc)
+            conti_lens.append(conti_len)
+            tokens.append(torch.tensor(inp_enc))
+            lens.append(len(inp_enc) - 1)
+        max_lens = max(lens)
+
+        tokens_pad = pad_sequence(tokens, batch_first=False, padding_value=tokenizer.eos_id)
+        if self.padding != -1 and max_lens % self.padding != 0:
+            # We need align the context length to multiple of 8 for FP8 run using NeMo framework.
+            extra_pad_len = self.padding - (max_lens % self.padding)
+
+            extra_pad = torch.ones(extra_pad_len, batch_size) * tokenizer.eos_id
+            extra_pad = extra_pad.type_as(tokens_pad)
+            inp_enc_pad = torch.vstack((tokens_pad, extra_pad)).T
+
+            lens_pad = max_lens + extra_pad_len
+        else:
+            inp_enc_pad = tokens_pad.T
+            lens_pad = max_lens + 1 - self.tokens_to_generate
+
+        inputs = (torch.tensor(inp_enc_pad).cuda(), (torch.ones(batch_size, dtype=torch.int32) * lens_pad).cuda())
+        return _Input(inputs=inputs, inp_enc=inp_encs, lens=lens, lens_pad=lens_pad, conti_len=conti_lens)
+
diff --git a/demo/NeMo/GPT3/nemo_utils.py b/demo/NeMo/GPT3/nemo_utils.py
new file mode 100644
index 00000000..f6d5bca7
--- /dev/null
+++ b/demo/NeMo/GPT3/nemo_utils.py
@@ -0,0 +1,161 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import gc
+import os
+import sys
+
+# Only print out error messages from NeMo
+from nemo.utils.nemo_logging import Logger as NG_LOGGER
+nemo_logger = NG_LOGGER(False)
+nemo_logger.setLevel(nemo_logger.ERROR)
+
+from nemo.utils.app_state import AppState
+from nemo.utils.model_utils import inject_model_parallel_rank
+from nemo.collections.nlp.modules.common.megatron.megatron_init import fake_initialize_model_parallel
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy, NLPSaveRestoreConnector
+from omegaconf import OmegaConf, open_dict
+from pytorch_lightning.trainer.trainer import Trainer
+import torch
+
+sys.path.append('../../HuggingFace') # Include HuggingFace directory.
+from NNDF.logger import G_LOGGER
+
+
+def get_computeprob_response(tokenizer, response, inputs):
+    """
+        This function is a modified version from:
+        https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/modules/common/text_generation_utils.py#L139
+
+        So parallel state does not need to be initialized before calling this function.
+    """
+    compute_prob_response = {}
+    new_token_ids = []
+    new_tokens = []
+    new_texts = []
+    log_probs = []
+    full_logprobs = []
+    offsets = []
+    for batch_id in range(len(response['tokens'])):
+        if isinstance(inputs, (list, tuple)):
+            if isinstance(inputs[0], str):
+                new_token_id = tokenizer.text_to_ids(inputs[batch_id])
+                new_text = inputs[batch_id]
+                token_len = len(new_token_id)
+            elif isinstance(inputs[0], torch.Tensor):
+                token_len = int(inputs[1][batch_id].item())
+                new_token_id = inputs[0][batch_id][:token_len].tolist()
+                new_text = tokenizer.ids_to_text(new_token_id)
+        new_token_ids.append(new_token_id)
+        new_tokens.append(response['tokens'][batch_id][:token_len])
+        new_texts.append(new_text)
+        log_probs.append(response['logprob'][batch_id][:token_len])
+        full_logprobs.append(response['full_logprob'][batch_id][:token_len])
+        offsets.append(response['offsets'][batch_id][:-1])
+    compute_prob_response['sentences'] = new_texts
+    compute_prob_response['tokens'] = new_tokens
+    compute_prob_response['token_ids'] = new_token_ids
+    compute_prob_response['logprob'] = log_probs
+    compute_prob_response['full_logprob'] = full_logprobs
+    compute_prob_response['offsets'] = offsets
+    return compute_prob_response
+
+
+def load_nemo_model(cfg, model_class=MegatronGPTModel):
+    # Trainer is required for restoring model parallel models
+    trainer = Trainer(strategy=NLPDDPStrategy(), **cfg.trainer)
+
+    if cfg.gpt_model_file and cfg.checkpoint_dir:
+        raise ValueError(f"NeMo model and checkpoint cannot be both set.")
+
+    if cfg.gpt_model_file:
+        save_restore_connector = NLPSaveRestoreConnector()
+        if os.path.isdir(cfg.gpt_model_file):
+            save_restore_connector.model_extracted_dir = cfg.gpt_model_file
+
+        pretrained_cfg = MegatronGPTModel.restore_from(
+            restore_path=cfg.gpt_model_file,
+            trainer=trainer,
+            return_config=True,
+            save_restore_connector=save_restore_connector,
+        )
+        OmegaConf.set_struct(pretrained_cfg, True)
+        with open_dict(pretrained_cfg):
+            pretrained_cfg.sequence_parallel = False
+            pretrained_cfg.activations_checkpoint_granularity = None
+            pretrained_cfg.activations_checkpoint_method = None
+            pretrained_cfg.precision = trainer.precision
+            if trainer.precision == "16":
+                pretrained_cfg.megatron_amp_O2 = False
+        model = model_class.restore_from(
+            restore_path=cfg.gpt_model_file,
+            trainer=trainer,
+            override_config_path=pretrained_cfg,
+            save_restore_connector=save_restore_connector,
+        )
+        G_LOGGER.info(f"{type(model)} has been successfully restored from {cfg.gpt_model_file}")
+    elif cfg.checkpoint_dir:
+        checkpoint_file= os.path.join(cfg.checkpoint_dir, cfg.checkpoint_name)
+        if not os.path.exists(checkpoint_file):
+            raise ValueError(f"File {checkpoint_file} does not exist.")
+
+        app_state = AppState()
+        if cfg.tensor_model_parallel_size > 1 or cfg.pipeline_model_parallel_size > 1:
+            app_state.model_parallel_size = cfg.tensor_model_parallel_size * cfg.pipeline_model_parallel_size
+            app_state.tensor_model_parallel_size = cfg.tensor_model_parallel_size
+            app_state.pipeline_model_parallel_size = cfg.pipeline_model_parallel_size
+            (
+                app_state.tensor_model_parallel_rank,
+                app_state.pipeline_model_parallel_rank,
+                app_state.model_parallel_size,
+                app_state.data_parallel_size,
+                app_state.pipeline_model_parallel_split_rank,
+                app_state.virtual_pipeline_model_parallel_rank,
+            ) = fake_initialize_model_parallel(
+                world_size=app_state.model_parallel_size,
+                rank=trainer.global_rank,
+                tensor_model_parallel_size_=cfg.tensor_model_parallel_size,
+                pipeline_model_parallel_size_=cfg.pipeline_model_parallel_size,
+                pipeline_model_parallel_split_rank_=cfg.pipeline_model_parallel_split_rank,
+            )
+        checkpoint_path = inject_model_parallel_rank(checkpoint_file)
+        model = model_class.load_from_checkpoint(checkpoint_path, hparams_file=cfg.hparams_file, trainer=trainer)
+        G_LOGGER.info(f"{type(model)} has been successfully restored from checkpoint {checkpoint_path}")
+    else:
+        raise ValueError("Need to provide a nemo gpt model through config file.")
+
+    model.freeze()
+
+    # Have to turn off activations_checkpoint_method for inference
+    try:
+        model.model.language_model.encoder.activations_checkpoint_method = None
+    except AttributeError:
+        pass
+
+    model.eval()
+    G_LOGGER.debug(f"Model configuration: {model.cfg}")
+    G_LOGGER.debug(f"Vocabulary size: {model.tokenizer.vocab_size}")
+    return model.cuda()
+
+def release_nemo_model(model):
+    print(f"Releaseing nemo model.")
+    model.model.cpu()
+    del model.model
+    gc.collect()
+    torch.cuda.empty_cache()
+    model.model = None
diff --git a/demo/NeMo/GPT3/onnxrt.py b/demo/NeMo/GPT3/onnxrt.py
new file mode 100644
index 00000000..78bd0aca
--- /dev/null
+++ b/demo/NeMo/GPT3/onnxrt.py
@@ -0,0 +1,112 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+
+import onnxruntime as ort
+import onnx
+import omegaconf
+from nemo.collections.nlp.modules.common.tokenizer_utils import get_tokenizer
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+
+# Add syspath for custom library
+if __name__ == "__main__":
+    filepath = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.join(filepath, os.pardir)
+    sys.path.append(project_root)
+
+from interface import NeMoCommand, BaseModel
+from nemo_export import NeMoConverter
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.interface import FRAMEWORK_ONNXRT
+from NNDF.logger import G_LOGGER
+from NNDF.networks import (
+    NetworkModel,
+    NetworkModels,
+)
+
+class GPT3NeMoOnnxRT(NeMoCommand):
+    def __init__(
+        self,
+        nemo_cfg,
+        config_class=GPT3ModelTRTConfig,
+        description="Runs ONNX Runtime results for GPT3 model.",
+        **kwargs
+    ):
+        super().__init__(nemo_cfg, config_class, description, model_classes=None, **kwargs)
+        self.framework_name = FRAMEWORK_ONNXRT
+
+
+    def load_onnx_model(self):
+        G_LOGGER.info(f'Loading ONNX model from {self.nemo_cfg.onnx_model_file}')
+
+        def get_opset_version(name : str) -> int:
+            """Returns opset.
+
+            `model` here is local in scope and python's gc will collect
+            it without manual memory management via `del`.
+            """
+            model = onnx.load(name, load_external_data=False)
+            return model.opset_import[0].version
+
+        assert get_opset_version(self.nemo_cfg.onnx_model_file) == 17
+        return ort.InferenceSession(self.nemo_cfg.onnx_model_file)
+
+
+    def setup_tokenizer_and_model(self):
+        self.nemo_cfg.runtime = 'onnx'
+        self.model = BaseModel()
+        self.model.cfg = self.nemo_cfg.model
+        self.model.tokenizer = get_tokenizer(tokenizer_name='megatron-gpt-345m', vocab_file=None, merges_file=None)
+
+        if not self.nemo_cfg.onnx_model_file:
+            self.nemo_cfg.onnx_model_file = os.path.join(
+                self.workspace.dpath,
+                f"onnx/model-{self.nemo_cfg.trainer.precision}.onnx",
+            )
+
+        converter = NeMoConverter(self.nemo_cfg, MegatronGPTModel)
+        if not os.path.isfile(self.nemo_cfg.onnx_model_file):
+            # Convert NeMo model to ONNX model
+            onnx_name = converter.nemo_to_onnx()
+            self.nemo_cfg.onnx_model_file = onnx_name
+
+        # The ONNX model is in opset17 by default.
+        self.model.onnxrt = self.load_onnx_model()
+        self.tokenizer = self.model.tokenizer
+        onnx_models = [
+            NetworkModel(
+                name=GPT3ModelTRTConfig.NETWORK_FULL_NAME, fpath=self.nemo_cfg.onnx_model_file,
+            )
+        ]
+        return NetworkModels(torch=None, onnx=onnx_models, trt=None)
+
+# Entry point
+def getGPT3NeMoOnnxRT():
+    config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../config.yaml")
+    nemo_cfg = omegaconf.OmegaConf.load(config_path)
+    return GPT3NeMoOnnxRT(nemo_cfg)
+
+# Entry point
+RUN_CMD = getGPT3NeMoOnnxRT()
+
+if __name__ == "__main__":
+    result = RUN_CMD()
+    print("Results: {}".format(result))
diff --git a/demo/NeMo/GPT3/sequence_perplexity.py b/demo/NeMo/GPT3/sequence_perplexity.py
new file mode 100644
index 00000000..9fc9ef29
--- /dev/null
+++ b/demo/NeMo/GPT3/sequence_perplexity.py
@@ -0,0 +1,76 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import math
+import numpy as np
+import torch
+
+__all__ = ['SequencePerplexity']
+
+class SequencePerplexity():
+    def __init__(self, topN):
+        super().__init__()
+        self.ppls = []
+        self.sequence_ppls = []
+        self.topN_equals = [0] * len(topN)
+        self.topN = topN
+
+    def update(self, ds_input, response, tokenizer):
+        for batch, tokens in enumerate(response['token_ids']):
+            inp_len = ds_input.lens[batch]
+            if inp_len == 0:
+                continue
+
+            conti_len = ds_input.conti_len[batch]
+
+            response_token_ids = tokens[:inp_len]
+            assert response_token_ids == ds_input.inp_enc[batch][:-1], f"Mismatch in input tokens."
+            full_log_probs = response['full_logprob'][batch][:inp_len]
+
+            # calculate ppl with whole sequence.
+            label = torch.tensor([ds_input.inp_enc[batch][1:]]).cuda()
+            log_probs = full_log_probs.unsqueeze(0).permute((0, 2, 1))
+            ppl = torch.nn.CrossEntropyLoss()(log_probs, label)
+            self.sequence_ppls.append(ppl.cpu())
+
+            # calculate topN.
+            log_probs = full_log_probs[-conti_len:]
+            conti_token_ids = ds_input.inp_enc[batch][-conti_len:]
+            conti_tokens = tokenizer.ids_to_tokens(conti_token_ids)
+
+            for index, topN in enumerate(self.topN):
+                if conti_token_ids[0] in log_probs.topk(topN, dim=-1).indices:
+                    self.topN_equals[index] += 1 
+
+            # calculate ppl with last token.
+            log_probs = log_probs.cpu().to(torch.float32)
+            conti_enc = torch.tensor(tokenizer.tokens_to_ids(conti_tokens))
+            conti_probs = torch.gather(log_probs, 1, conti_enc.unsqueeze(-1)).squeeze(-1)
+
+            ppl = float(conti_probs.sum())
+            self.ppls.append(ppl)
+
+    def compute(self):
+        ppls = math.exp(-np.mean(np.array(self.ppls)))
+        sequence_ppls = math.exp(np.mean(np.array(self.sequence_ppls)))
+        acc = [equals / len(self.ppls) for equals in self.topN_equals]
+        txt = []
+        for i, j in zip(self.topN, acc):
+            txt.append("acc(top{}): {:.4f}".format(i, j))
+        acc_text = ", ".join(txt)
+        return ppls, sequence_ppls, acc, acc_text
+
diff --git a/demo/NeMo/GPT3/trt.py b/demo/NeMo/GPT3/trt.py
new file mode 100644
index 00000000..189c1ba3
--- /dev/null
+++ b/demo/NeMo/GPT3/trt.py
@@ -0,0 +1,236 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+
+import omegaconf
+from nemo.collections.nlp.modules.common.tokenizer_utils import get_tokenizer
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel
+
+# Add syspath for custom library
+if __name__ == "__main__":
+    filepath = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.join(filepath, os.pardir)
+    sys.path.append(project_root)
+
+from nemo_export import NeMoConverter, create_dir_if_not_exist
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+from GPT3.trt_utils import load_trt_model
+from interface import NeMoCommand, BaseModel
+import onnx
+
+sys.path.append('../../HuggingFace') # Include HuggingFace
+from NNDF.interface import FRAMEWORK_TENSORRT
+from NNDF.logger import G_LOGGER
+from NNDF.models import _log_fake_perf_metrics
+from NNDF.networks import (
+    NetworkModel,
+    NetworkModels,
+)
+
+class GPT3NeMoTRT(NeMoCommand):
+    def __init__(
+        self,
+        nemo_cfg,
+        config_class=GPT3ModelTRTConfig,
+        description="Runs TensorRT results for GPT3 model.",
+        **kwargs
+    ):
+        super().__init__(nemo_cfg, config_class, description, model_classes=None, **kwargs)
+        self.framework_name = FRAMEWORK_TENSORRT
+
+
+    def setup_tokenizer_and_model(self):
+        self.nemo_cfg.runtime = 'trt'
+        self.model = BaseModel()
+        self.model.cfg = self.nemo_cfg.model
+        self.model.tokenizer = get_tokenizer(tokenizer_name='megatron-gpt-345m', vocab_file=None, merges_file=None)
+
+        # Path to write new onnx models if need arises. Prevents overwrite of
+        # user-provided onnx files in case opset_version needs to be upgraded
+        # to 19 or onnx files with kv-cache needs to be written.
+        onnx_workpath = os.path.join(
+            self.workspace.dpath,
+            "onnx",
+        )
+        if self.nemo_cfg.onnx_model_file:
+            # Input by user, can be a read-only location.
+            onnx_name = self.nemo_cfg.onnx_model_file
+        else:
+            onnx_name = os.path.join(
+                onnx_workpath,
+                f"model-{self.nemo_cfg.trainer.precision}.onnx",
+            )
+            self.nemo_cfg.onnx_model_file = onnx_name
+            self.nemo_cfg.trt_export_options.timing_cache = self.timing_cache
+
+            converter = NeMoConverter(self.nemo_cfg, MegatronGPTModel)
+            if not os.path.isfile(onnx_name):
+                # Convert NeMo model to ONNX model
+                onnx_name = converter.nemo_to_onnx()
+
+        def get_opset_version(name : str) -> int:
+            """Returns opset.
+
+            `model` here is local in scope and python's gc will collect
+            it without manual memory management via `del`.
+            """
+            model = onnx.load(name, load_external_data=False)
+            return model.opset_import[0].version
+
+        opset_version = get_opset_version(onnx_name)
+        if opset_version < 19:
+            opset19_onnx_name = NeMoConverter.get_opset19_onnx_fpath(
+                onnx_name, onnx_workpath
+            )
+            if not os.path.isfile(opset19_onnx_name):
+                opset19_onnx_name = NeMoConverter.onnx_to_opset19(
+                    onnx_name, onnx_workpath
+                )
+
+            if opset19_onnx_name != None:
+                onnx_name = opset19_onnx_name
+
+        # Add KV cache to ONNX model
+        kv_output_policy = "kv_new"
+
+        converter = NeMoConverter(self.nemo_cfg)
+
+        def has_kv_cache_support(
+            model_name: str, match_names=("key", "value", "kv")
+        ) -> bool:
+            """To detect onnx models with kv_cache exported, input node names
+            contain match_names.
+            """
+            model = onnx.load(model_name, load_external_data=False)
+
+            # Get network inputs.
+            input_all = [node.name for node in model.graph.input]
+            input_initializer =  [node.name for node in model.graph.initializer]
+            net_input_names = list(set(input_all)  - set(input_initializer))
+
+            kv_nodes = filter(
+                lambda name: any(map(lambda match: match in name, match_names)),
+                net_input_names,
+            )
+            return any(kv_nodes) and len(net_input_names) > 2
+
+        if (not self.nemo_cfg.use_cache) and (has_kv_cache_support(onnx_name)):
+            raise RuntimeError(
+                "ONNX model has been exported with kv-cache enabled, but "
+                "runtime configuration has kv-cache disabled. Consider "
+                "enabling kv-cache support via the `use-cache` option."
+            )
+
+        if self.nemo_cfg.use_cache and (not has_kv_cache_support(onnx_name)):
+            G_LOGGER.info(f"Converting {onnx_name} with KV-cache support")
+            new_dir = onnx_workpath + f"_{kv_output_policy}"
+            if self.nemo_cfg.onnx_export_options.use_fp8_storage:
+                new_dir += f"_fp8_storage"
+            onnx_output_fpath = os.path.join(new_dir, onnx_name.split("/")[-1])
+
+            if not os.path.isfile(onnx_output_fpath):
+                create_dir_if_not_exist(onnx_output_fpath)
+                converter.create_onnx(onnx_name, onnx_output_fpath, kv_output_policy)
+            onnx_name = onnx_output_fpath
+
+        if self.nemo_cfg.onnx_export_options.prune:
+            onnx_name = converter.prune_onnx(onnx_name)
+
+        # Convert ONNX model to TRT engine
+        self.nemo_cfg.trt_export_options.use_strongly_typed = self.use_strongly_typed
+        self.nemo_cfg.trt_export_options.timing_cache = self.timing_cache
+        self.nemo_cfg.trt_export_options.opt_seq_len = self.opt_seq_len
+
+        suffixes = []
+        suffixes.append("bs" + str(self.nemo_cfg.batch_size))
+        if self.nemo_cfg.trt_export_options.opt_seq_len != None:
+            suffixes.append("opt" + str(self.nemo_cfg.trt_export_options.opt_seq_len))
+        if self.nemo_cfg.use_cache:
+            suffixes.append("kv")
+        if self.nemo_cfg.onnx_export_options.use_fp8_storage:
+            suffixes.append("fp8_storage")
+        if self.nemo_cfg.trt_export_options.sparse:
+            suffixes.append("sp")
+        if not self.nemo_cfg.trt_export_options.use_strongly_typed:
+            suffixes.append("no_strongly_typed")
+        suffix = "-".join(suffixes)
+        trt_fpath = os.path.join(self.workspace.dpath, f"trt-{suffix}.plan")
+
+        if os.path.isfile(trt_fpath):
+            G_LOGGER.debug(f"TRT Engine plan exists at location {trt_fpath}.")
+            _log_fake_perf_metrics()
+        else:
+            converter.onnx_to_trt(onnx_name, trt_fpath)
+
+        self.nemo_cfg.trt_engine_file = trt_fpath
+        self.model.trt = load_trt_model(self.nemo_cfg)
+        self.tokenizer = self.model.tokenizer
+        onnx_models = [
+            NetworkModel(
+                name=GPT3ModelTRTConfig.NETWORK_FULL_NAME, fpath=self.nemo_cfg.onnx_model_file,
+            )
+        ]
+        return NetworkModels(torch=None, onnx=onnx_models, trt=None)
+
+    def add_args(self):
+        super().add_args()
+        engine_group = self._parser.add_argument_group("trt engine")
+        engine_group.add_argument(
+            "--opt-seq-len",
+            default=None,
+            help="Set optimized input sequence length to be used in engine building",
+            type=int,
+        )
+        engine_group.add_argument(
+            "--no-timing-cache",
+            default=False,
+            help="Set to not use timing cache for speeding up engine building",
+            action="store_true",
+        )
+        engine_group.add_argument(
+            "--no-strongly-typed",
+            default=False,
+            help="Disable strongly typed mode in engine building",
+            action="store_true",
+        )
+
+    def process_framework_specific_arguments(
+        self,
+        opt_seq_len: int = None,
+        no_timing_cache: bool = False,
+        no_strongly_typed: bool = False,
+        **kwargs
+    ):
+        self.opt_seq_len = opt_seq_len
+        self.use_timing_cache = not no_timing_cache
+        self.use_strongly_typed = not no_strongly_typed
+        self.timing_cache = self.workspace.get_timing_cache() if self.use_timing_cache else None
+
+# Entry point
+def getGPT3NeMoTRT():
+    config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../config.yaml")
+    nemo_cfg = omegaconf.OmegaConf.load(config_path)
+    return GPT3NeMoTRT(nemo_cfg)
+
+# Entry point
+RUN_CMD = getGPT3NeMoTRT()
+
+if __name__ == "__main__":
+    result = RUN_CMD()
+    print("Results: {}".format(result))
diff --git a/demo/NeMo/GPT3/trt_utils.py b/demo/NeMo/GPT3/trt_utils.py
new file mode 100644
index 00000000..a146cf7e
--- /dev/null
+++ b/demo/NeMo/GPT3/trt_utils.py
@@ -0,0 +1,231 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import sys
+
+import numpy as np
+import tensorrt as trt
+import torch
+from transformers.configuration_utils import PretrainedConfig
+
+sys.path.append('../../HuggingFace') # Include HuggingFace directory
+from NNDF.models import TRTEngineFile
+from NNDF.networks import NetworkMetadata
+from NNDF.tensorrt_utils import TRTNativeRunner
+from NNDF.logger import G_LOGGER
+from Seq2Seq.export import DecoderTRTEngine
+
+from HuggingFace.NNDF.tensorrt_utils import TRTNativeRunner, CUASSERT
+from cuda import cudart
+
+
+class GPTTRTDecoder(TRTNativeRunner):
+
+    INPUT_IDS_INDEX = 0
+    POSITION_IDS_INDEX = 1
+    ATTENTION_MASK_INDEX = 2
+
+    def __init__(
+        self,
+        trt_engine_file: TRTEngineFile,
+        use_cache: bool,
+        use_fp8_storage: bool,
+        cfg,
+        network_metadata: NetworkMetadata = None,
+        hf_config: PretrainedConfig = None,
+    ):
+        super().__init__(trt_engine_file, network_metadata, hf_config)
+        self.use_cache = use_cache
+        self.use_fp8_storage = use_fp8_storage
+        if self.use_cache:
+            self._set_context_mode_trt_context()
+        self.io_names = set()
+        self.input_tensor_names = set()
+        for i in range(self.trt_engine.num_io_tensors):
+            tensor_name = self.trt_engine.get_tensor_name(i)
+            self.io_names.add(tensor_name)
+            if self.trt_engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
+                self.input_tensor_names.add(tensor_name)
+
+        self.cfg = cfg
+        logits_size = self.cfg.batch_size * self.cfg.model.max_seq_len * self.cfg.model.vocab_size
+
+        self.batch_size = self.cfg.batch_size
+        self.max_seq_len = self.cfg.model.max_seq_len
+        self.num_layers = self.cfg.model.num_layers
+        self.nb_heads = self.cfg.model.nb_heads
+        self.head_size = self.cfg.model.head_size
+
+        dtype = self.get_torch_type(self.get_output_name())
+        self.logits = torch.zeros(logits_size, dtype=dtype).contiguous().cuda()
+
+
+        self.init_kv_cache()
+        self.past_decoder_length = 0
+
+        # Setting next input shape when executing gpu kernel.
+        # Use dict to record which inputs have changed.
+        self.input_shape_change_record = dict()
+
+    def init_kv_cache(self):
+        # kv cache buffer
+        self.attention_kv_cache_buffer = dict()
+        cache_dtype = torch.float16
+        if self.use_fp8_storage:
+            cache_dtype = torch.uint8
+        for i in range(self.num_layers):
+            for code in ["key", "value"]:
+                attention_kv_cache_name = self.make_kv_cache_name(i, code)
+                self.attention_kv_cache_buffer[attention_kv_cache_name] = torch.empty(
+                    self.max_seq_len,
+                    self.batch_size,
+                    self.nb_heads,
+                    self.head_size,
+                    dtype=cache_dtype,
+                    device=torch.cuda.current_device(),
+                ).contiguous().cuda()
+                
+
+    def make_kv_cache_name(self, layer, code):
+        return f"key_values.{layer}.decoder.{code}"
+
+    def _set_context_mode_trt_context(self):
+        # Create TRT context for context mode (1st decoder run) with optimization profile index = 1
+        self.context_trt_context = self.trt_engine.create_execution_context()
+        self.context_trt_context.set_optimization_profile_async(1, self.stream)
+
+    def get_torch_type(self, name):
+        trt_type = self.trt_engine.get_tensor_dtype(name)
+        mapping = {
+            trt.float32: torch.float32,
+            trt.float16: torch.float16,
+            trt.int8: torch.int8,
+            trt.int32: torch.int32,
+            trt.int64: torch.int64,
+            trt.bool: torch.bool,
+            trt.uint8: torch.uint8,
+            trt.bfloat16: torch.bfloat16,
+        }
+        if trt_type in mapping:
+            return mapping[trt_type]
+        raise ValueError(f"Got unexpected tensorrt dtype {trt_type} in get_torch_type().")
+
+    def get_input_ids_name(self):
+        return self.trt_engine.get_tensor_name(self.INPUT_IDS_INDEX)
+
+    def has_position_ids(self):
+        # If the input at POSITION_IDS_INDEX has a dimension of 2, assume it is position_ids.
+        return len(self.trt_engine.get_tensor_shape(self.trt_engine.get_tensor_name(self.POSITION_IDS_INDEX))) == 2
+
+    def get_position_ids_name(self):
+        if self.has_position_ids():
+            return self.trt_engine.get_tensor_name(self.POSITION_IDS_INDEX)
+        else:
+            return None
+
+    def get_output_name(self):
+        return "logits"
+
+    def has_attention_mask(self):
+        if self.ATTENTION_MASK_INDEX < self.trt_engine.num_io_tensors:
+            return self.trt_engine.get_tensor_name(self.ATTENTION_MASK_INDEX) == "attention_mask"
+        return False
+
+    def get_attention_mask_name(self):
+        if self.has_attention_mask():
+            return self.trt_engine.get_tensor_name(self.ATTENTION_MASK_INDEX)
+        return None
+
+    def run(self, output_name, io_descs, seq_len, context_mode=False):
+        torch.cuda.nvtx.range_push("TRT Setup")
+        if self.use_cache:
+            if context_mode:
+                self.past_decoder_length = 0
+            else:
+                # When kv-cache is used, seq_len is always 1 in Generation phase.
+                seq_len = 1
+            cur_shape = (self.past_decoder_length, self.batch_size, self.nb_heads, self.head_size)
+            new_shape = (seq_len, self.batch_size, self.nb_heads, self.head_size)
+            assert self.past_decoder_length + seq_len < self.max_seq_len
+            offset = self.batch_size*self.nb_heads*self.head_size*self.past_decoder_length
+            for i in range(self.num_layers):
+                for code in ["key", "value"]:
+                    attention_kv_cache_name = self.make_kv_cache_name(i, code)
+                    cur_address = self.attention_kv_cache_buffer[attention_kv_cache_name].data_ptr()
+                    # new kv address start from the past kv-cache data end
+                    io_descs[f"past_{attention_kv_cache_name}"] = (cur_address, cur_shape)
+                    new_address = cur_address + offset*self.attention_kv_cache_buffer[attention_kv_cache_name].element_size()
+                    modifier = ""
+                    if self.use_fp8_storage:
+                        modifier = "_qfp8"
+                    new_kv_name = f"new_{attention_kv_cache_name}{modifier}"
+                    io_descs[new_kv_name] = (new_address, new_shape)
+            self.past_decoder_length += seq_len
+        else:
+            self.past_decoder_length = 0
+        # Set active optimization profile and active execution context.
+        self.trt_context.set_optimization_profile_async(self.profile_idx, self.stream)
+        active_context = self.trt_context
+        if context_mode and self.use_cache:
+            active_context = self.context_trt_context
+
+        # Set up input bindings.
+        for name, tensor_shape in io_descs.items():
+            active_context.set_tensor_address(name, tensor_shape[0])
+            if name in self.input_tensor_names:
+                if name in self.input_shape_change_record and \
+                    self.input_shape_change_record[name][0] == active_context and \
+                    self.input_shape_change_record[name][1] == tensor_shape[1]:
+                    continue
+                else:
+                    active_context.set_input_shape(name, tensor_shape[1])
+            elif self.use_cache:
+                pass
+            else:
+                assert False, "All tensors must be inputs for non-KV mode"
+        assert active_context.all_shape_inputs_specified
+
+        # Set up output bindings.
+        assert output_name == self.get_output_name()
+        engine_out_torch_type = self.get_torch_type(output_name)
+        if self.logits.dtype != engine_out_torch_type:
+            raise ValueError(f"Output data type does not match, {self.logits.dtype} vs. {engine_out_torch_type}.")
+        shape = active_context.get_tensor_shape(output_name)
+        active_context.set_tensor_address(output_name, self.logits.data_ptr())
+
+
+        # Execute inference.
+        torch.cuda.nvtx.range_pop() # "TRT Setup"
+        active_context.execute_async_v3(self.stream)
+        if not context_mode and self.use_cache:
+            self.input_shape_change_record.clear()
+            for i in range(self.num_layers):
+                for code in ["key", "value"]:
+                    next_past_shape = (self.past_decoder_length, self.batch_size, self.nb_heads, self.head_size)
+                    attention_kv_cache_name = self.make_kv_cache_name(i, code)
+                    # set next iter input shape when cpu idle
+                    active_context.set_input_shape(f"past_{attention_kv_cache_name}", next_past_shape)
+                    self.input_shape_change_record[f"past_{attention_kv_cache_name}"] = [active_context, next_past_shape]
+        CUASSERT(cudart.cudaStreamSynchronize(self.stream))
+        if len(shape) != 3:
+            raise ValueError("Output must have a dimension of 3.")
+        output = self.logits[:shape[0] * shape[1] * shape[2]].view(tuple(shape))
+        return output
+
+def load_trt_model(cfg):
+    G_LOGGER.info(f'Loading TensorRT engine from {cfg.trt_engine_file} with use_cache={cfg.use_cache}, use_fp8_storage={cfg.onnx_export_options.use_fp8_storage} ')
+    trt_engine_file = DecoderTRTEngine(cfg.trt_engine_file)
+    return GPTTRTDecoder(trt_engine_file, cfg.use_cache, cfg.onnx_export_options.use_fp8_storage, cfg)
diff --git a/demo/NeMo/README.md b/demo/NeMo/README.md
new file mode 100644
index 00000000..44f183dd
--- /dev/null
+++ b/demo/NeMo/README.md
@@ -0,0 +1,156 @@
+# TensorRT FP8 Inference for NeMo models
+**Deprecation:** For all users using TensorRT to accelerate Large Language Model inference, please use [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/). TensorRT-LLM includes full model coverage and functionalities of HuggingFace demo and NeMo demo. It also contains more optimizations and functionalities (e.g. model quantization, in-flight batching, etc.), multi-GPU support, better model coverage and much better inference performance. HuggingFace Demo and NeMo demo will not be maintained, and they will be removed from OSS in TRT 10.0 release.
+
+This repository demonstrates TensorRT inference with NeMo Megatron models in FP8/FP16/BF16 precision.
+
+Currently, this repository supports [NeMo GPT](https://huggingface.co/nvidia/nemo-megatron-gpt-5B/tree/fp8) models only.
+
+# Environment Setup
+It's recommended to run inside a container to avoid conflicts when installing dependencies. Please check out [`NGC TensorRT`](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt/tags) and find a container with TensorRT 9.0 or above. A GPU with compute capability 8.9 or above is required to run the demo with FP8 precision.
+
+```
+# Run inside a TensorRT container
+sh install.sh [--deps <directory>] [-j <nproc>] [--ninja]
+```
+
+All arguments are optional. `--deps` indicates the relative dependency download directory, `-j` indicates number of parallel jobs for building and `--ninja` installs the `ninja` build system which can speed up installation. See `sh install.sh --help` for more details on the arguments.
+
+> The script will install required dependencies and it can take around 30 minutes or more.
+
+**Please note that the [HuggingFace demo directory](demo/HuggingFace) needs to be visible when running this demo, so utility functions can be correctly imported.**
+
+# File Structure
+This demo follows simliar structure and command-line interface as in [HuggingFace demo](/demo/HuggingFace).
+```
+.
+├── GPT3                              # GPT3 directory
+│   ├── GPT3ModelConfig.py            # model configuration and variant-specific parameters
+│   ├── frameworks.py                 # NeMo PyTorch inference script
+│   ├── onnxrt.py                     # OnnxRT inference script
+│   ├── trt.py                        # TensorRT inference script
+│   ├── decoding.py                   # main inference logic for all runtimes
+│   └── ...                           # files with utility functions for model export and inference
+├── config.yaml                       # full configuration for model export and inference
+├── interface.py                      # definitions of setup functions
+├── nemo_export.py                    # export functions for NeMo model -> ONNX model -> TRT engine
+└── run.py                            # main entry script
+```
+
+# Overview
+
+This demo contains two scripts `run.py` and `nemo_export.py`. Script `run.py` accepts a NeMo model or an ONNX model as input, and performs end-to-end inference with various actions specified by the user. Script `nemo_export.py` accepts a NeMo model or an ONNX model as input, and exports the input to an ONNX model or a TensorRT engine.
+
+# How to run inference
+The `run` action will run end-to-end inference on sentences specified in [config.yaml](/demo/NeMo/config.yaml). A model, a variant, and precision are required to run this command.
+```
+python3 run.py run GPT3 <frameworks|trt> --variant gpt-5b --working-dir $(pwd)/temp --fp8 --bf16 --nemo-model=<model_fp8_bf16.nemo>
+```
+
+Expected output for the second sentence:
+```
+Batch 1: {'sentences': ['TensorRT is a Deep Learning compiler used for deep learning. It is a compiler for TensorFlow, CNTK, and Torch. It is a compiler for the TensorFlow, CNTK,'],
+          'tokens': [['<|endoftext|>', 'T', 'ensor', 'RT', ' is', ' a', ' Deep', ' Learning', ' compiler', ' used', ' for', ' deep', ' learning', '.', ' It', ' is', ' a', ' compiler', ' for', ' T', 'ensor', 'Flow', ',', ' C', 'NT', 'K', ',', ' and', ' Torch', '.', ' It', ' is', ' a', ' compiler', ' for', ' the', ' T', 'ensor', 'Flow', ',', ' C', 'NT', 'K', ',']],
+          'logprob': tensor([[-4.6415e+00, -6.9270e+00, -7.4458e+00, -1.9856e+00, -5.9787e-01,
+                              -8.1058e+00, -7.9629e-02, -5.8013e+00, -5.5222e+00, -1.4401e+00,
+                              -5.5644e+00, -3.3747e-01, -3.3463e+00, -1.1306e+00, -1.3685e+00,
+                              -1.7793e+00, -2.8960e+00, -1.4127e+00, -2.3209e+00, -7.3454e-04,
+                              -9.8682e-02, -1.3268e+00, -2.1373e+00, -3.9281e-01, -6.5222e-04,
+                              -2.9425e-01, -1.4167e+00, -1.8416e+00, -9.2462e-01, -1.4805e+00,
+                              -1.4299e+00, -2.0632e+00, -2.9947e+00, -9.1487e-01, -2.6651e+00,
+                              -2.2772e+00, -4.7057e-03, -2.2852e-01, -2.4777e+00, -2.4731e-01,
+                              -7.0602e-03, -4.7339e-04, -1.1645e-01]], device='cuda:0'),
+         'full_logprob': None,
+         'token_ids': [[50256, 51, 22854, 14181, 318, 257, 10766, 18252, 17050, 973, 329, 2769, 4673, 13, 632, 318, 257, 17050, 329, 309, 22854, 37535, 11, 327, 11251, 42, 11, 290, 34868, 13, 632, 318, 257, 17050, 329, 262, 309, 22854, 37535, 11, 327, 11251, 42, 11]],
+         'offsets': [[0, 0, 1, 6, 8, 11, 13, 18, 27, 36, 41, 45, 50, 59, 60, 63, 66, 68, 77, 81, 83, 88, 92, 93, 95, 97, 98, 99, 103, 109, 110, 113, 116, 118, 127, 131, 135, 137, 142, 146, 147, 149, 151, 152]]}
+```
+
+# How to run with various configurations
+- FP8, FP16, and BF16 precisions are supported, and they can be set through `--fp8`, `--fp16`, and `--bf16` respectively. Currently, the script has constraints on how precisions are specified, and supported combinations are:
+  1. Pure FP16: `--fp16` (default)
+  2. Pure BF16: `--bf16`
+  3. FP8-FP16: `--fp8 --fp16`
+  4. FP8-BF16: `--fp8 --bf16`
+
+- `--nemo-model=<model.nemo>` or `--nemo-checkpoint=<model.ckpt>` can be used to load a NeMo model or checkpoint from a specified path, respectively. If these arguments are not provided, a NeMo model will be downloaded (and cached/re-used for subsequent runs) in the working directory.
+
+- K-V cache can be enabled through `--use-cache`
+
+- Batch size can be changed through `--batch-size=<bs>`
+
+- Default max sequence length is `256`, can be changed through `--max-seq-len=<ms>`
+
+# How to run performance benchmark
+The `benchmark` action will run inference with specified input and output sequence lengths multiple times.
+```
+python3 run.py benchmark GPT3 <frameworks|trt> --variant gpt-5b --working-dir $(pwd)/temp --fp8 --bf16 --nemo-model=<model_fp8_bf16.nemo> --batch-size=16 --input-seq-len=128 --output-seq-len=20 --use-cache --warmup=10 --iterations=100
+```
+
+Expected output for `trt`:
+```
+***************************
+Running 100 iterations with batch size: 16, input sequence length: 128 and output sequence length: 20
+[E2E inference] Total Time: 11.55453 s, Average Time: 0.11555 s, 95th Percentile Time: 0.11581 s, 99th Percentile Time: 0.11587 s, Throughput: 2769.48 tokens/s
+[Without tokenizer] Total Time: 10.44539 s, Average Time: 0.10445 s, 95th Percentile Time: 0.10459 s, 99th Percentile Time: 0.10465 s, Throughput: 3063.55 tokens/s
+***************************
+```
+
+Expected output for `frameworks`:
+```
+***************************
+Running 100 iterations with batch size: 16, input sequence length: 128 and output sequence length: 20
+[E2E inference] Total Time: 55.23503 s, Average Time: 0.55235 s, 95th Percentile Time: 0.55525 s, 99th Percentile Time: 0.56992 s, Throughput: 579.34 tokens/s
+[Without tokenizer] Total Time: 54.06591 s, Average Time: 0.54066 s, 95th Percentile Time: 0.54369 s, 99th Percentile Time: 0.55839 s, Throughput: 591.87 tokens/s
+***************************
+```
+
+# How to run accuracy check
+The `accuracy` action will run accuracy check on a dataset. Default is to use [LAMBADA](https://paperswithcode.com/dataset/lambada) dataset.
+```
+python3 run.py accuracy GPT3 <frameworks|trt> --variant gpt-5b --working-dir $(pwd)/temp --fp8 --bf16 --nemo-model=<model_fp8_bf16.nemo> --use-cache
+```
+
+Expected output for `trt`:
+```
+***************************
+Lambada ppl(last token): 4.4756, ppl(sequence): 18.3254, acc(top1): 0.6722, acc(top3): 0.8597, acc(top5): 0.9076
+***************************
+```
+
+Expected output for `frameworks`:
+```
+***************************
+Lambada ppl(last token): 4.4669, ppl(sequence): 18.3161, acc(top1): 0.6765, acc(top3): 0.8612, acc(top5): 0.9082
+***************************
+```
+
+# How to export a NeMo model to ONNX
+NeMo to ONNX conversion consists of 3 steps:
+1. Export ONNX from NeMo.
+2. NeMo uses TransformerEngine to export FP8 models to ONNX (step 1) and the exported ONNX has custom TensorRT Q/DQ nodes. Script `convert_te_onnx_to_trt_onnx.py` can be used to convert the custom operators into standard opset19 ONNX Q/DQ nodes.
+3. Add KV-cache inputs and outputs to the exported ONNX, so it is faster when performing inference on the model.
+
+`nemo_export.py` has `--opset19` and `--use-cache` option to decide whether to perform step 2. and step 3., respectively:
+```
+python3 nemo_export.py --nemo-model=model.nemo --onnx=onnx/model.onnx --opset19 --use-cache
+```
+`--extra-configs` can be used to specified configs that are defined in `config.yml` but not being exposed from existing command-line interface.
+Please specify `--help` to see more options.
+
+
+# How to run sparsity for benchmark
+
+*Note: this is for performance analysis. The pruned model should not be used for accuracy purpose unless it was fine-tuned for sparsity. The pruning may take minutes or hours depending on the model size.*
+
+
+1. Enable sparsity knobs in `config.yaml`:
+  * Set `onnx_export_options.prune` to `True` to enable pruning of the ONNX model.
+  * Set `trt_export_options.sparse` to `True` to enable sparse tactics profiling in TensorRT.
+2. Run the scripts. You should be able to see logs like below.
+
+```
+[2023-07-28 00:15:03,015][OSS][INFO] Prune ONNX model with: polygraphy surgeon prune ${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/model-16.opset19.onnx -o ${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/pruned.model-16.opset19.onnx --save-external-data ${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/pruned.model-16.opset19.onnx_data
+[2023-07-28 00:15:03,016][OSS][INFO] This may take a while...
+...
+
+[2023-07-28 03:36:52,307][OSS][DEBUG] trtexec --onnx=${OSS_ROOT}/demo/NeMo/temp/gpt-5b/GPT3-gpt-5b-fp8-fp16-ms256/onnx/pruned.model-16.opset19.onnx --minShapes=input_ids:1x1,position_ids:1x1 --optShapes=input_ids:1x128,position_ids:1x128 --maxShapes=input_ids:1x256,position_ids:1x256 --fp8 --fp16 --sparsity=enable --timingCacheFile=functional.cache
+```
diff --git a/demo/NeMo/apex.patch b/demo/NeMo/apex.patch
new file mode 100644
index 00000000..daa1b615
--- /dev/null
+++ b/demo/NeMo/apex.patch
@@ -0,0 +1,29 @@
+diff --git a/setup.py b/setup.py
+index cb1a790..949f877 100644
+--- a/setup.py
++++ b/setup.py
+@@ -29,15 +29,15 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+     print("\nCompiling cuda extensions with")
+     print(raw_output + "from " + cuda_dir + "/bin\n")
+
+-    if (bare_metal_version != torch_binary_version):
+-        raise RuntimeError(
+-            "Cuda extensions are being compiled with a version of Cuda that does "
+-            "not match the version used to compile Pytorch binaries.  "
+-            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
+-            + "In some cases, a minor-version mismatch will not cause later errors:  "
+-            "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
+-            "You can try commenting out this check (at your own risk)."
+-        )
++    # if (bare_metal_version != torch_binary_version):
++    #     raise RuntimeError(
++    #         "Cuda extensions are being compiled with a version of Cuda that does "
++    #         "not match the version used to compile Pytorch binaries.  "
++    #         "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
++    #         + "In some cases, a minor-version mismatch will not cause later errors:  "
++    #         "https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  "
++    #         "You can try commenting out this check (at your own risk)."
++    #     )
+
+
+ def raise_if_cuda_home_none(global_option: str) -> None:
diff --git a/demo/NeMo/config.yaml b/demo/NeMo/config.yaml
new file mode 100644
index 00000000..2b1888bb
--- /dev/null
+++ b/demo/NeMo/config.yaml
@@ -0,0 +1,87 @@
+runtime: null
+gpt_model_file: null # GPT nemo file path
+onnx_model_file: null # ONNX file path
+trt_engine_file: null # TRT engine file path
+
+# Parameters for loading from a checkpoint
+checkpoint_dir: null # Path to a folder that contains a .ckpt file
+checkpoint_name: null # Name of the .ckpt file within the checkpoint_dir.
+hparams_file: null # Path to a .yaml file that contains the hyperparameters of the checkpoint.
+
+batch_size: 1
+use_cache: True
+use_one_input: False # export ONNX model with only one input
+prompts: # prompts for GPT inference
+  - "How are you?"
+  - "TensorRT is a Deep Learning compiler used for deep learning."
+
+mode: 'inference' # Could change to accuracy or benchmark
+
+inference:
+  greedy: True # Whether or not to use sampling ; use greedy decoding otherwise
+  top_k: 0  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+  top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+  temperature: 1.0 # sampling temperature
+  add_BOS: True # add the bos token at the begining of the prompt
+  tokens_to_generate: 30 # The maximum length of the sequence to be generated.
+  all_probs: False  # whether return the log prob for all the tokens in vocab
+  repetition_penalty: 1.2  # The parameter for repetition penalty. 1.0 means no penalty.
+  min_tokens_to_generate: 0  # The minimum length of the sequence to be generated.
+  compute_logprob: False  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+  seed: 1234
+
+accuracy:
+  dataset: Lambada
+  metric: Perplexity
+  top_n: 1,3,5
+  tokens_to_generate: 5
+
+benchmark:
+  input_seq_len: 20
+  output_seq_len: 20
+
+# for nemo to onnx export
+onnx_export_options:
+  runtime_check: False
+  verbose: False
+  onnx_opset: 17
+  do_constant_folding: True
+  cache_support: False
+  prune: False # Prune the ONNX model for Sparse Tensor Cores 2:4 pattern
+  device: 'cuda'
+  check_tolerance: 0.01
+  use_fp8_storage: False
+  quantize_bmms: False
+
+# for onnx to trt export
+trt_export_options:
+  opt_seq_len: 128 # define the optimized sequence length
+  use_tf32: True
+  use_fp16: False
+  use_fp8: False
+  use_bf16: False
+  use_strongly_typed: True # enable strongly typed mode will invalidate `use_[fp8|fp16|bf16]` flags.
+  sparse: False # enable sparse in TRT engine builder
+  timing_cache: 'functional.cache'
+
+trainer:
+  devices: 1
+  num_nodes: 1
+  accelerator: gpu
+  logger: False # logger provided by exp_manager
+  precision: 32 # 16, 32, or bf16
+
+tensor_model_parallel_size: 1
+pipeline_model_parallel_size: 1
+pipeline_model_parallel_split_rank: 0 # used for encoder and decoder model (0 for others)
+
+# model architecture
+model:
+  max_seq_len: 256 # define the max sequence length for attention mask
+  encoder_seq_length: 2048
+  max_position_embeddings: ${.encoder_seq_length}
+  num_layers: 24
+  hidden_size: 4096
+  nb_heads: 32
+  head_size: 128
+  vocab_size: 50304
diff --git a/demo/NeMo/install.sh b/demo/NeMo/install.sh
new file mode 100644
index 00000000..277f250a
--- /dev/null
+++ b/demo/NeMo/install.sh
@@ -0,0 +1,485 @@
+#!/bin/sh
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Sourcing messes up the directory detection with readlink.
+if [ ! "${0##*/}" = "install.sh" ]; then
+	echo "Please run this install script, don't source it." >&2
+	echo "Use -h for usage and help." >&2
+	return 1
+fi
+
+NEMO_DIR=$(dirname "$(readlink -f "$0")")
+DEMO_DIR=$(dirname "${NEMO_DIR}")
+SCRIPT_DIR=$(dirname "${DEMO_DIR}")/scripts
+
+DEPENDENCIES_DIR="temp"
+BUILD_SRCLIBS=1
+BUILD_NINJA=0
+ARG_JOBS=1
+ARG_HELP=0
+
+install_essential_tools() {
+	pip_not_found=$(pip --version 2>&1 | grep -o "not found")
+	if [ "$pip_not_found" != "" ]; then
+		echo " > Installing pip..."
+		apt-get update
+		apt-get install -y python3-dev
+		cd "${1}" || exit
+		if [ ! -f "get-pip.py" ]; then
+			apt-get install -y wget
+			wget https://bootstrap.pypa.io/get-pip.py
+		fi
+		python3 get-pip.py
+		cd ..
+	fi
+
+	git_not_found=$(git --version 2>&1 | grep -o "not found")
+	if [ "$git_not_found" != "" ]; then
+		echo " > Installing git..."
+		apt-get update
+		apt-get install -y git
+	fi
+}
+
+install_ninja() {
+	if [ ! -d "ninja" ]; then
+		git clone https://github.com/ninja-build/ninja.git
+	fi
+	cd ninja || exit
+	git checkout v1.11.1
+
+	if [ ! -x "./ninja" ]; then
+		CMD="python3 configure.py --bootstrap"
+		echo " >> ${CMD}"
+		eval "${CMD}"
+		unset CMD
+	else
+		echo " > ninja already built!"
+	fi
+
+	PATH_WITH_NINJA="$(pwd):${PATH}"
+	# Path exported for the current program scope only.
+	export PATH="${PATH_WITH_NINJA}"
+	unset PATH_WITH_NINJA
+	cd ..
+}
+
+PACKAGE_NEEDS_REINSTALL=0
+
+check_if_managed_install() {
+	PACKAGE_NEEDS_REINSTALL=0
+	dist_path="${1}"
+	# https://packaging.python.org/en/latest/specifications/direct-url/
+	if [ ! -f "${dist_path}/direct_url.json" ]; then
+		PACKAGE_NEEDS_REINSTALL=1
+		return
+	fi
+	if [ "$(grep -c "${NEMO_DIR}" "${dist_path}/direct_url.json")" != "1" ]; then
+		PACKAGE_NEEDS_REINSTALL=1
+	fi
+}
+
+apex_install_logic() {
+	if [ ! -d "apex" ]; then
+		git clone https://github.com/NVIDIA/apex.git
+	fi
+
+	cd apex || exit
+	APEX_PATH="$(pwd)"
+	git config --global --add safe.directory "${APEX_PATH}"
+	unset APEX_PATH
+
+	git checkout 5b5d41034b506591a316c308c3d2cd14d5187e23
+	git apply "${NEMO_DIR}"/apex.patch # Bypass CUDA version check in apex
+
+	torchcppext=$(pip show torch | grep Location | cut -d' ' -f2)"/torch/utils/cpp_extension.py"
+	if [ ! -f "$torchcppext" ]; then
+		echo "Could not locate torch installation using pip"
+		exit 1
+	fi
+	sed -i 's/raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))/pass/' "$torchcppext" # Bypass CUDA version check in torch
+	unset torchcppext
+
+	CMD="MAX_JOBS=${ARG_JOBS} python3 setup.py bdist_wheel -v --cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam"
+	echo " >> ${CMD}"
+	eval "${CMD}"
+	unset CMD
+
+	python3 -m pip install "$(find './dist' -name '*.whl' | head -n1)"
+	cd ../
+}
+
+check_if_apex_needs_reinstall() {
+	apex_loc="$(pip show apex | grep '^Location' | awk '{print $2}')"
+	apex_dist_loc="$(find "${apex_loc}" -depth -maxdepth 1 -name 'apex*dist-info' -type d | head -n1)"
+
+	check_if_managed_install "${apex_dist_loc}"
+	apex_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+	echo "${apex_needs_reinstall}"
+
+	unset apex_dist_loc
+	unset apex_loc
+}
+
+install_apex() {
+	has_apex=$(pip list | grep "^apex " | grep "apex" -o | awk '{print $1}' | awk '{print length}')
+	apex_needs_reinstall=0
+
+	if [ "$has_apex" != "4" ]; then
+		apex_install_logic
+	else
+		check_if_apex_needs_reinstall
+		if [ "$apex_needs_reinstall" != "0" ]; then
+			echo " > Reinstalling Apex per demo version..."
+			python3 -m pip uninstall -y apex
+			apex_install_logic
+		else
+			echo " > Apex already installed!"
+		fi
+	fi
+	unset apex_needs_reinstall
+	unset has_apex
+}
+
+megatron_install_logic() {
+	if [ ! -d "Megatron-LM" ]; then
+		git clone -b main https://github.com/NVIDIA/Megatron-LM.git
+	fi
+
+	cd Megatron-LM || exit
+	MEGATRON_PATH="$(pwd)"
+	git config --global --add safe.directory "${MEGATRON_PATH}"
+	unset MEGATRON_PATH
+
+	git checkout 992da75a1fd90989eb1a97be8d9ff3eca993aa83
+	CMD="python3 -m pip install ./"
+	echo " >> ${CMD}"
+	eval "${CMD}"
+	unset CMD
+	cd ../
+}
+
+check_if_megatron_needs_reinstall() {
+	megatron_loc="$(pip show megatron-core | grep '^Location' | awk '{print $2}')"
+	megatron_dist_loc="$(find "${megatron_loc}" -depth -maxdepth 1 -name 'megatron*dist-info' -type d | head -n1)"
+
+	check_if_managed_install "${megatron_dist_loc}"
+	megatron_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+	unset megatron_dist_loc
+	unset megatron_loc
+}
+
+install_megatron() {
+	has_megatron=$(pip list | grep "^megatron-core " | grep "megatron-core" -o | awk '{print $1}' | awk '{print length}')
+	megatron_needs_reinstall=0
+
+	if [ "$has_megatron" != "13" ]; then
+		megatron_install_logic
+	else
+		check_if_megatron_needs_reinstall
+		if [ "$megatron_needs_reinstall" != "0" ]; then
+			echo " > Reinstalling Megatron per demo version..."
+			python3 -m pip uninstall -y megatron-core
+			megatron_install_logic
+		else
+			echo " > Megatron already installed!"
+		fi
+	fi
+	unset megatron_needs_reinstall
+	unset has_megatron
+}
+
+flash_attention_install_logic() {
+	if [ ! -d "flash-attention" ]; then
+		git clone https://github.com/HazyResearch/flash-attention.git
+	fi
+
+	cd flash-attention || exit
+	FLASH_ATTENTION_PATH="$(pwd)"
+	git config --global --add safe.directory "${FLASH_ATTENTION_PATH}"
+	unset FLASH_ATTENTION_PATH
+
+	git checkout v1.0.6
+	CMD="MAX_JOBS=${ARG_JOBS} python3 setup.py bdist_wheel"
+	echo " >> ${CMD}"
+	eval "${CMD}"
+	unset CMD
+	python3 -m pip install "$(find './dist' -name '*.whl' | head -n1)"
+	cd ..
+}
+
+check_if_flash_attention_needs_reinstall() {
+	flash_attn_loc="$(pip show flash-attn | grep '^Location' | awk '{print $2}')"
+	flash_attn_dist_loc="$(find "${flash_attn_loc}" -depth -maxdepth 1 -name 'flash_attn*dist-info' -type d | head -n1)"
+
+	check_if_managed_install "${flash_attn_dist_loc}"
+	flash_attn_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+	unset flash_attn_dist_loc
+	unset flash_attn_loc
+}
+
+install_flash_attention() {
+	has_flashattn=$(pip list | grep "^flash-attn " | grep "flash-attn" -o | awk '{print $1}' | awk '{print length}')
+	flash_attn_needs_reinstall=0
+
+	if [ "$has_flashattn" != "10" ]; then
+		flash_attention_install_logic
+	else
+		check_if_flash_attention_needs_reinstall
+		if [ "$flash_attn_needs_reinstall" != "0" ]; then
+			echo " > Reinstalling flash_attn per demo version..."
+			python3 -m pip uninstall -y flash-attn
+			flash_attention_install_logic
+		else
+			echo " > flash-attention already installed!"
+		fi
+	fi
+
+	unset flash_attn_needs_reinstall
+	unset has_flashattn
+}
+
+transformer_engine_install_logic() {
+	if [ ! -d "TransformerEngine" ]; then
+		git clone https://github.com/NVIDIA/TransformerEngine.git
+	fi
+
+	cd TransformerEngine || exit
+	TRANSFORMER_ENGINE_PATH="$(pwd)"
+	git config --global --add safe.directory "${TRANSFORMER_ENGINE_PATH}"
+	unset TRANSFORMER_ENGINE_PATH
+
+	git checkout 804f120322a13cd5f21ea8268860607dcecd055c
+	git submodule update --recursive --init
+	CMD="MAKEFLAGS=-j${ARG_JOBS} MAX_JOBS=${ARG_JOBS} python3 setup.py bdist_wheel --framework=pytorch"
+	echo " >> ${CMD}"
+	eval "${CMD}"
+	unset CMD
+	python3 -m pip install "$(find './dist' -name '*.whl' | head -n1)"
+	cd ..
+
+	# Check for common point of failure with TE.
+	has_te_loc=$(pip list | grep "^transformer-engine " | grep "transformer-engine" -o | awk '{print $1}' | awk '{print length}')
+	[ "$has_te_loc" != "18" ] && {
+		echo " > TransformerEngine install failed. Probable cause of failures:"
+		echo "   - CUDNN location was not picked up. If your CUDNN include dir"
+		echo "     is /path/to/cudnn/include and lib is /path/to/cudnn/lib,   "
+		echo "     Invoke the script as CUDNN_PATH=/path/to/cudnn sh install.sh ..."
+		exit 1
+	}
+	unset has_te_loc
+}
+
+check_if_transformer_engine_needs_reinstall() {
+	te_loc="$(pip show transformer-engine | grep '^Location' | awk '{print $2}')"
+	te_dist_loc="$(find "${te_loc}" -depth -maxdepth 1 -name 'transformer_engine*dist-info' -type d | head -n1)"
+
+	check_if_managed_install "${te_dist_loc}"
+	te_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+	unset te_dist_loc
+	unset te_loc
+}
+
+install_transformer_engine() {
+	has_te=$(pip list | grep "^transformer-engine " | grep "transformer-engine" -o | awk '{print $1}' | awk '{print length}')
+	te_needs_reinstall=0
+
+	if [ "$has_te" != "18" ]; then
+		transformer_engine_install_logic
+	else
+		check_if_transformer_engine_needs_reinstall
+		if [ "$te_needs_reinstall" != "0" ]; then
+			echo " > Reinstalling TransformerEngine per demo version..."
+			python3 -m pip uninstall -y transformer-engine
+			transformer_engine_install_logic
+		else
+			echo " > TransformerEngine already installed!"
+		fi
+	fi
+
+	unset te_needs_reinstall
+	unset has_te
+
+	# Patch TE files.
+	sh "${NEMO_DIR}/patch_te.sh"
+}
+
+nemo_install_logic() {
+	if [ ! -d "NeMo" ]; then
+		git clone --branch main --single-branch https://github.com/NVIDIA/NeMo.git NeMo
+	fi
+
+	cd NeMo || exit
+	NeMo_PATH="$(pwd)"
+	git config --global --add safe.directory "${NeMo_PATH}"
+	unset NeMo_PATH
+
+	git checkout bf270794267e0240d8a8b2f2514c80c6929c76f1
+	bash reinstall.sh
+	cd ../
+}
+
+check_if_nemo_needs_reinstall() {
+	nemo_loc="$(pip show nemo-toolkit | grep '^Location' | awk '{print $2}')"
+	nemo_dist_loc="$(find "${nemo_loc}" -depth -maxdepth 1 -name 'nemo_toolkit*dist-info' -type d | head -n1)"
+
+	check_if_managed_install "${nemo_dist_loc}"
+	nemo_needs_reinstall=${PACKAGE_NEEDS_REINSTALL}
+
+	unset nemo_dist_loc
+	unset nemo_loc
+}
+
+install_nemo() {
+	has_nemo=$(pip list | grep "^nemo-toolkit " | grep "nemo-toolkit" -o | awk '{print $1}' | awk '{print length}')
+	nemo_needs_reinstall=0
+
+	if [ "$has_nemo" != "12" ]; then
+		nemo_install_logic
+	else
+		check_if_nemo_needs_reinstall
+		if [ "$nemo_needs_reinstall" != "0" ]; then
+			echo " > Reinstalling NeMo per demo version..."
+			python3 -m pip uninstall -y nemo-toolkit
+			nemo_install_logic
+		else
+			echo " > NeMo already installed!"
+		fi
+	fi
+}
+
+while [ "$#" -gt 0 ]; do
+	case $1 in
+	--deps)
+		DEPENDENCIES_DIR="$2"
+		shift
+		;;
+	-j | --jobs)
+		ARG_JOBS="$2"
+		shift
+		;;
+	--ninja) BUILD_NINJA=1 ;;
+	--skipsrc) BUILD_SRCLIBS=0 ;;
+	-h | --help) ARG_HELP=1 ;;
+	*)
+		echo "Unknown parameter passed: $1"
+		echo "For help type: $0 --help"
+		exit 1
+		;;
+	esac
+	shift
+done
+
+if [ "$ARG_HELP" -eq "1" ]; then
+	echo "Usage: sh $0 [options]"
+	echo "All arguments are optional."
+	echo " --help or -h         : Print this help menu."
+	echo " [--deps] {temp}      : Path to download and build dependencies."
+	echo " [-j | --jobs] {1}    : Number of jobs to use for building from source."
+	echo " [--ninja]            : Flag to build ninja (if not present) to speed up installation."
+	# skipsrc is not documented to prevent users from invoking it directly.
+	exit
+fi
+
+DEPENDENCIES_DIR="${NEMO_DIR}/${DEPENDENCIES_DIR}"
+echo " > Using ${DEPENDENCIES_DIR}' to store dependencies."
+mkdir -p "${DEPENDENCIES_DIR}"
+install_essential_tools "${DEPENDENCIES_DIR}"
+
+echo " > Installing Requirements.txt..."
+pip install --upgrade pip
+pip install nvidia-pyindex || {
+	echo "Could not install nvidia-pyindex, stopping install"
+	exit 1
+}
+# # One of the hidden dependencies require Cython, but doesn't specify it.
+# # https://github.com/VKCOM/YouTokenToMe/pull/108
+# # WAR by installing Cython before requirements.
+pip install "Cython==0.29.36" || {
+	echo "Could not install Cython, stopping install"
+	exit 1
+}
+# PyYaml, Cython and pip don't play well together.
+# https://github.com/yaml/pyyaml/issues/601
+pip install "pyyaml==5.4.1" --no-build-isolation || {
+	echo "Could not install PyYaml, stopping install"
+	exit 1
+}
+# Install a specific version of opencc to WAR a GLIBC not found error.
+pip install "opencc==1.1.6" || {
+	echo "Could not install OpenCC, stopping install"
+	exit 1
+}
+pip install -r requirements.txt || {
+	echo "Could not install dependencies, stopping install"
+	exit 1
+}
+
+# Installation from source
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+	(command -v -- "ninja" >/dev/null 2>&1) || [ "$BUILD_NINJA" -eq "0" ] && echo " > Could not locate ninja, consider passing the --ninja flag to speedup dependency installation."
+fi
+
+cd "${DEPENDENCIES_DIR}" || exit
+if (! command -v -- "ninja" >/dev/null 2>&1) && [ "$BUILD_NINJA" -eq "1" ]; then
+	echo " > Building ninja..."
+	install_ninja
+fi
+
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+	echo " > Installing Apex..."
+	install_apex
+fi
+
+echo " > Installing Megatron-LM..."
+install_megatron
+
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+	echo " > Installing flash-attention..."
+	install_flash_attention
+fi
+
+if [ "$BUILD_SRCLIBS" -eq "1" ]; then
+	echo " > Installing TransformerEngine..."
+	install_transformer_engine
+fi
+
+echo " > Installing NeMo..."
+install_nemo
+
+if [ ! -f "${NEMO_DIR}/GPT3/convert_te_onnx_to_trt_onnx.py" ]; then
+	echo " > Copying opset19 conversion script..."
+	if [ ! -f "${SCRIPT_DIR}/convert_te_onnx_to_trt_onnx.py" ]; then
+		echo "Opset19 conversion script is not located at <ROOT_DIR>/scripts/convert_te_onnx_to_trt_onnx.py"
+		return 1
+	fi
+	cp "${SCRIPT_DIR}/convert_te_onnx_to_trt_onnx.py" "${NEMO_DIR}/GPT3/convert_te_onnx_to_trt_onnx.py"
+fi
+
+cd ../
+
+unset ARG_HELP
+unset ARG_JOBS
+unset BUILD_NINJA
+unset DEPENDENCIES_DIR
+unset SCRIPT_DIR
+unset DEMO_DIR
+unset NEMO_DIR
diff --git a/demo/NeMo/interface.py b/demo/NeMo/interface.py
new file mode 100644
index 00000000..ec3dcbf7
--- /dev/null
+++ b/demo/NeMo/interface.py
@@ -0,0 +1,727 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from datetime import datetime
+import os
+import random
+import sys
+import time
+from typing import List, Union, Dict
+from copy import copy
+
+from cuda import cuda
+from tqdm import tqdm
+import numpy as np
+import torch
+
+from transformers import PretrainedConfig
+from omegaconf import OmegaConf, listconfig
+
+# Add syspath for custom library
+if __name__ == "__main__":
+    filepath = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.join(filepath, os.pardir)
+    sys.path.append(project_root)
+
+from GPT3.decoding import full_inference, generate, process_output
+from GPT3.GPT3ModelConfig import GPT3ModelTRTConfig
+from GPT3.lambada_dataset import Lambada
+from GPT3.nemo_utils import get_computeprob_response
+from GPT3.sequence_perplexity import SequencePerplexity
+
+sys.path.append('../HuggingFace') # Include HuggingFace
+from NNDF.general_utils import NNFolderWorkspace
+from NNDF.logger import G_LOGGER
+from NNDF.networks import (
+    Precision,
+    NetworkMetadata,
+    TimingProfile,
+    BenchmarkingResult,
+    NetworkResult,
+    NetworkCheckpointResult,
+)
+from NNDF.interface import NetworkCommand
+
+# Manually set by referring to examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
+# If a field cannot be found, set to None.
+DEFAULT_CONFIG = {
+    "is_encoder_decoder": False,
+    "is_decoder": True,
+    "architectures": [ "GPT3NeMoModel" ],
+}
+
+GPT3CONFIG_MAPPINGS = {
+    "gpt-126m": PretrainedConfig.from_dict(dict({"_name_or_path": "gpt-126m",
+        "num_heads": 12,
+        "num_layers": 12,
+        "hidden_size": 768,
+        "max_position_embeddings": 2048,
+        "min_seq_len": 0,
+    }, **DEFAULT_CONFIG)),
+    "gpt-1.3b": PretrainedConfig.from_dict(dict({"_name_or_path": "gpt-1.3b",
+        "num_heads": 16,
+        "num_layers": 24,
+        "hidden_size": 2048,
+        "max_position_embeddings": 2048,
+        "min_seq_len": 0,
+    }, **DEFAULT_CONFIG)),
+    "gpt-5b": PretrainedConfig.from_dict(dict({"_name_or_path": "gpt-5b",
+        "num_heads": 32,
+        "num_layers": 24,
+        "hidden_size": 4096,
+        "max_position_embeddings": 2048,
+        "min_seq_len": 16,
+    }, **DEFAULT_CONFIG)),
+}
+
+def _hf_hub_metadata(variant: str, fp8: bool) -> Dict[str, str]:
+    repo_mappings = {
+        "gpt-1.3b": "nvidia/nemo-megatron-gpt-1.3B",
+        "gpt-5b": "nvidia/nemo-megatron-gpt-5B",
+    }
+
+    try:
+        repo_id = repo_mappings[variant]
+    except KeyError:
+        raise RuntimeError(
+            "Variant should be one of {}, got {}".format(
+                list(repo_mappings.keys()), variant
+            )
+        )
+
+    file_key = (variant, "fp8" if fp8 else "fp16")
+    file_mappings = {
+        ("gpt-1.3b", "fp8"): ("nemo_gpt1.3B_fp16.nemo", None),
+        ("gpt-1.3b", "fp16"): ("nemo_gpt1.3B_fp16.nemo", None),
+        ("gpt-5b", "fp8"): ("nemo_gpt5B_fp8_bf16_tp1.nemo", "fp8"),
+        ("gpt-5b", "fp16"): ("nemo_gpt5B_fp16_tp1.nemo", None),
+    }
+
+    try:
+        filename, branch = file_mappings[file_key]
+    except KeyError:
+        raise RuntimeError(
+            "Downloading nemo file for variant : {}, precision : {} from huggingface hub is unsupported. Consider passing a nemo-model or onnx-model from the command line.".format(
+                file_key[0], file_key[1]
+            )
+        )
+
+    return {"repo_id": repo_id, "filename": filename, "revision": branch}
+
+
+def download_model(dst_dir: str, cache_dir: str, *args, **kwargs) -> str:
+    from huggingface_hub import hf_hub_download
+
+    os.makedirs(dst_dir, exist_ok=True)
+    os.makedirs(cache_dir, exist_ok=True)
+
+    model_metadata = _hf_hub_metadata(*args, **kwargs)
+    return hf_hub_download(
+        local_dir=str(dst_dir),
+        local_dir_use_symlinks="auto",
+        cache_dir=cache_dir,
+        **model_metadata,
+    )
+
+
+def load_dataset(dataset_name, base_dir, tokens_to_generate, padding):
+    ds_map = {"Lambada": Lambada(base_dir, tokens_to_generate, padding)}
+    return ds_map[dataset_name]
+
+def get_accuracy_metric(cfg):
+    topN = [int(i.strip()) for i in cfg.top_n.split(",")]
+    m_map = {"Perplexity": SequencePerplexity(topN)}
+    return m_map[cfg.metric]
+
+def remove_padded_prompts(output, nb_paddings):
+    if nb_paddings == 0:
+        return output
+    result = {}
+    for k, v in output.items():
+        if v != None and (type(v) is list or type(v) is torch.Tensor):
+            v = v[:-nb_paddings]
+        result[k] = v
+    return result
+
+def get_random_input(tokenizer, batch_size, in_seq_len, out_seq_len):
+    vocab_size = tokenizer.tokenizer.vocab_size
+    return (torch.randint(0, vocab_size, (batch_size, in_seq_len + out_seq_len), dtype=torch.int64).cuda(),
+            (torch.ones(batch_size, dtype=torch.int64) * in_seq_len).cuda())
+
+class BaseModel(torch.nn.Module):
+    def __init__(self):
+        super(BaseModel, self).__init__()
+        self.model = None
+    def forward(self, x):
+        raise Exception("BaseModel forward method is not intended to be called.")
+
+class NeMoCommand(NetworkCommand):
+    def __init__(
+        self,
+        nemo_cfg,
+        config_class,
+        description,
+        **kwargs
+    ):
+        self.nemo_cfg = nemo_cfg
+        super().__init__(config_class, description, **kwargs)
+
+    def validate_and_set_precision(self, fp8, fp16, bf16, use_fp8_storage, quantize_bmms):
+        if fp8:
+            if fp16:
+                G_LOGGER.info("Use FP8-FP16 precision.")
+            if bf16:
+                G_LOGGER.info("Use FP8-BF16 precision.")
+        elif fp16:
+            G_LOGGER.info("Use pure FP16 precision.")
+        elif bf16:
+            G_LOGGER.info("Use pure BF16 precision.")
+        else:
+            fp16 = True
+            G_LOGGER.warn("Precision is not specified. Use pure FP16 precision by default.")
+
+        self.fp8, self.fp16, self.bf16 = fp8, fp16, bf16
+        self.nemo_cfg.trt_export_options.use_fp8 = fp8
+        self.nemo_cfg.trt_export_options.use_fp16 = fp16
+        self.nemo_cfg.trt_export_options.use_bf16 = bf16
+        self.nemo_cfg.onnx_export_options.use_fp8_storage = use_fp8_storage
+        self.nemo_cfg.onnx_export_options.quantize_bmms = quantize_bmms
+
+        if fp16:
+            self.nemo_cfg.trainer.precision = "16"
+        elif bf16:
+            self.nemo_cfg.trainer.precision = "bf16"
+        else:
+            self.nemo_cfg.trainer.precision = "32"
+
+    def update_hyperparams(self, model_config):
+        self.nemo_cfg.model.num_layers = model_config.num_layers
+        self.nemo_cfg.model.nb_heads = model_config.num_heads
+        self.nemo_cfg.model.head_size = model_config.hidden_size // model_config.num_heads
+        self.nemo_cfg.model.hidden_size = model_config.hidden_size
+        self.nemo_cfg.model.encoder_seq_length = model_config.max_position_embeddings
+        self.nemo_cfg.model.max_position_embeddings = model_config.max_position_embeddings
+
+    def setup_environment(
+        self,
+        variant: str,
+        working_dir: str = "temp",
+        batch_size: int = 1,
+        num_beams: int = 1,
+        use_cache: bool = True,
+        verbose: bool = False,
+        info: bool = False,
+        iterations: int = None,
+        warmup: int = None,
+        number: int = None,
+        duration: int = None,
+        percentile: int = None,
+        cleanup: bool = False,
+        action: str = None,
+        max_seq_len: int = None,
+        fp8: bool = True,
+        fp16: bool = False,
+        bf16: bool = False,
+        use_fp8_storage: bool = False,
+        quantize_bmms: bool = False,
+        input_seq_len: int = None,
+        output_seq_len: int = None,
+        nemo_model: str = None,
+        nemo_checkpoint: str = None,
+        nemo_hparams: str = None,
+        onnx_model: str = None,
+        **kwargs,
+    ) -> None:
+        """
+        Use Arguments from command line or user specified to setup config for the model.
+        """
+        self.validate_and_set_precision(fp8, fp16, bf16, use_fp8_storage, quantize_bmms)
+
+        if not torch.cuda.is_available():
+            raise EnvironmentError("GPU is required for NeMo demo.")
+
+        # Initialize CUDA Driver API
+        err, = cuda.cuInit(0)
+        if err != cuda.CUresult.CUDA_SUCCESS:
+            raise RuntimeError("Cuda initialization failed with error: {}".format(err))
+
+        # See https://pytorch.org/docs/stable/_modules/torch.html#set_float32_matmul_precision
+        torch.set_float32_matmul_precision('medium')
+
+        if max_seq_len != None:
+            self.nemo_cfg.model.max_seq_len = max_seq_len
+
+        assert action != None, "Action must be specified"
+        if action == "accuracy":
+            self.nemo_cfg.mode = "accuracy"
+            self.nemo_cfg.inference.compute_logprob = True
+            self.nemo_cfg.inference.all_probs = True
+            self.nemo_cfg.inference.greedy = True
+            self.nemo_cfg.inference.add_BOS = False
+            self.nemo_cfg.inference.tokens_to_generate = 1
+            self.nemo_cfg.inference.min_tokens_to_generate = 0
+            self.nemo_cfg.inference.temperature = 1.0
+            self.nemo_cfg.inference.top_k = 0
+            self.nemo_cfg.inference.top_p = 0.9
+            self.nemo_cfg.inference.repetition_penalty = 1.0
+        elif action == "benchmark":
+            self.nemo_cfg.mode = "benchmark"
+            if input_seq_len != None:
+                self.nemo_cfg.benchmark.input_seq_len = input_seq_len
+            if output_seq_len != None:
+                self.nemo_cfg.benchmark.output_seq_len = output_seq_len
+            self.nemo_cfg.inference.tokens_to_generate = self.nemo_cfg.benchmark.output_seq_len
+            self.nemo_cfg.inference.min_tokens_to_generate = self.nemo_cfg.benchmark.output_seq_len
+
+        if self.nemo_cfg.model.max_seq_len < (self.nemo_cfg.benchmark.input_seq_len + self.nemo_cfg.benchmark.output_seq_len):
+            raise ValueError(f"Max sequence length of the model needs to be greater than or equal to the sum of input sequence length and output sequence length. Got {self.nemo_cfg.model.max_seq_len} < {self.nemo_cfg.benchmark.input_seq_len} + {self.nemo_cfg.benchmark.output_seq_len}.")
+
+        if (nemo_model or nemo_checkpoint) and onnx_model:
+            raise RuntimeError(
+                "Both nemo-model and onnx-model cannot be specified together. Please specify either nemo-model or onnx-model."
+            )
+
+        assert variant in GPT3CONFIG_MAPPINGS
+        model_config = GPT3CONFIG_MAPPINGS[variant]
+
+        if self.nemo_cfg.model.max_seq_len > model_config.max_position_embeddings:
+            G_LOGGER.warn(
+                f"Updating max_position_embeddings to be the same as max_seq_len {self.nemo_cfg.model.max_seq_len}."
+            )
+            G_LOGGER.warn(
+                f"Outputs longer than {model_config.max_position_embeddings} might be unmeaningful."
+            )
+            model_config.max_position_embeddings = self.nemo_cfg.model.max_seq_len
+
+        if self.nemo_cfg.model.max_seq_len < model_config.min_seq_len:
+            G_LOGGER.warn(
+                f"Force updating max_seq_len to minimum required length {model_config.min_seq_len}."
+            )
+            self.nemo_cfg.model.max_seq_len = model_config.min_seq_len
+
+        self.nemo_cfg.batch_size = batch_size
+        self.nemo_cfg.use_cache = use_cache
+
+        if nemo_checkpoint != None:
+            # Set NeMo checkpoint configs
+            self.nemo_cfg.checkpoint_dir = os.path.dirname(nemo_checkpoint)
+            if not self.nemo_cfg.checkpoint_dir:
+                raise ValueError(f"NeMo checkpoint needs to be provided with full path.")
+            self.nemo_cfg.checkpoint_name = os.path.basename(nemo_checkpoint)
+            self.nemo_cfg.hparams_file = nemo_hparams
+        else:
+            if onnx_model != None:
+                G_LOGGER.info(f"Using onnx model {onnx_model} for inference.")
+                if os.path.exists(onnx_model):
+                    self.nemo_cfg.onnx_model_file = onnx_model
+                else:
+                    raise IOError(
+                        f"Could not find the specified onnx file {onnx_model}."
+                    )
+            else:
+                if nemo_model != None:
+                    if os.path.exists(nemo_model):
+                        self.nemo_cfg.gpt_model_file = nemo_model
+                    else:
+                        raise IOError(
+                            f"Could not find the specified nemo file {nemo_model}."
+                        )
+                else:
+                    G_LOGGER.info("Downloading nemo model from HuggingFace Hub")
+                    # Download nemo model if it does not exist.
+                    # Setup temporary metadata, config to create a workspace to put the
+                    # downloaded artefacts in
+                    download_metadata = NetworkMetadata(
+                        variant=variant,
+                        precision=Precision(fp16=self.fp16),
+                        use_cache=use_cache,
+                        num_beams=num_beams,
+                        batch_size=batch_size
+                    )
+
+                    download_config = self.config_class(metadata=download_metadata)
+                    download_config.from_nemo_config(copy(self.nemo_cfg))
+                    download_workspace = NNFolderWorkspace(download_config, working_dir)
+
+                    self.nemo_cfg.gpt_model_file = download_model(
+                        dst_dir=download_workspace.dpath + "/artefacts",
+                        cache_dir=download_workspace.dpath + "/cache",
+                        variant=variant,
+                        fp8=fp8,
+                    )
+
+        if self.nemo_cfg.gpt_model_file == None and self.nemo_cfg.checkpoint_dir == None and onnx_model == None:
+            G_LOGGER.error("No model exists based on specified configs and precisions.")
+            raise ValueError("Model not found.")
+
+        self.update_hyperparams(model_config)
+
+        # HuggingFace code
+        if verbose:
+            G_LOGGER.setLevel(level=G_LOGGER.DEBUG)
+        elif info:
+            G_LOGGER.setLevel(level=G_LOGGER.INFO)
+
+        if variant is None:
+            G_LOGGER.error("You need to specify --variant to run NeMo demo")
+            return
+
+        if self._args is not None:
+            G_LOGGER.info("Setting up environment with arguments: {}".format(self._args))
+        else:
+            G_LOGGER.info("User-customized API is called")
+
+        self.metadata = NetworkMetadata(
+            variant=variant,
+            precision=Precision(fp16=self.fp16),
+            use_cache=use_cache,
+            num_beams=num_beams,
+            batch_size=batch_size
+        )
+
+        self.config = self.config_class(
+            metadata = self.metadata
+        )
+
+        self.config.from_nemo_config(self.nemo_cfg)
+
+        self.workspace = NNFolderWorkspace(
+            self.config, working_dir
+        )
+
+        self.timing_profile = TimingProfile(
+            iterations=iterations,
+            number=number,
+            warmup=warmup,
+            duration=duration,
+            percentile=percentile,
+        )
+
+        self.keep_torch_model = not cleanup
+        self.keep_onnx_model = not cleanup
+        self.keep_trt_engine = not cleanup
+
+        self.process_framework_specific_arguments(onnx_model=onnx_model, **kwargs)
+
+    def process_framework_specific_arguments(self, **kwargs):
+        pass
+
+    def run(self) -> Union[List[NetworkResult], BenchmarkingResult]:
+        """
+        Main entry point of our function which compiles and generates our model data for command-line mode.
+        The general process for the commands are all the same:
+        (1) Download the model
+        (2) Run either checkpoint or benchmark
+        (3) Returns the result
+        """
+        t0 = time.time()
+        self.models = self.setup_tokenizer_and_model()
+        t1 = time.time()
+        G_LOGGER.info("setup_tokenizer_and_model() takes {:.4f}s in total.".format(t1 - t0))
+
+        results = []
+        ppl = None
+        random.seed(self.nemo_cfg.inference.seed)
+        np.random.seed(self.nemo_cfg.inference.seed)
+        torch.manual_seed(self.nemo_cfg.inference.seed)
+        if self.nemo_cfg.mode == "accuracy":
+            G_LOGGER.debug("Run in accuracy mode.")
+            eval_ppl = get_accuracy_metric(self.nemo_cfg.accuracy)
+            has_align_requirement = self.nemo_cfg.runtime == 'nemo' and hasattr(self.model.cfg, "fp8") and self.model.cfg.fp8 == True
+            if has_align_requirement and self.nemo_cfg.accuracy.tokens_to_generate > 1:
+                self.nemo_cfg.accuracy.tokens_to_generate = 1
+                G_LOGGER.warn("Force set tokens_to_generate=1 for FP8 run in NeMo framework.")
+            dataset = load_dataset(self.nemo_cfg.accuracy.dataset, self.workspace.rootdir, self.nemo_cfg.accuracy.tokens_to_generate, 8 if has_align_requirement else -1)
+            tokenizer = self.tokenizer
+
+            def eval_ppl_with_batch_input(eval_ppl, batch_input):
+                ds_input = dataset.preprocess_input(tokenizer, batch_input)
+                self.nemo_cfg.inference.tokens_to_generate = self.nemo_cfg.accuracy.tokens_to_generate
+                self.nemo_cfg.inference.min_tokens_to_generate = self.nemo_cfg.accuracy.tokens_to_generate
+
+                inputs = ds_input.inputs
+                response = full_inference(
+                    model=self.model,
+                    inputs=inputs,
+                    cfg=self.nemo_cfg,
+                )
+
+                # It is still predication task even when tokens_to_generate > 1, so we need restore the context length.
+                batch_size = ds_input.inputs[0].shape[0]
+                real_ctx_length = ds_input.inputs[0].shape[1] - 1
+                inputs = (ds_input.inputs[0], torch.ones(batch_size, dtype=torch.int32) * real_ctx_length)
+
+                response = get_computeprob_response(tokenizer, response, inputs)
+                eval_ppl.update(ds_input=ds_input, response=response, tokenizer=tokenizer)
+
+            batch_input = []
+            for doc in tqdm(dataset.load()):
+                batch_input.append(doc)
+
+                if len(batch_input) == self.nemo_cfg.batch_size:
+                    eval_ppl_with_batch_input(eval_ppl, batch_input)
+                    batch_input.clear()
+
+            if len(batch_input):
+                # Pad empty text to batch size
+                while (len(batch_input) % self.nemo_cfg.batch_size) != 0:
+                    batch_input.append({"text": ""})
+                eval_ppl_with_batch_input(eval_ppl, batch_input)
+
+            ppl, sequence_ppl, _, acc_text = eval_ppl.compute()
+            print("***************************")
+            print("{} ppl(last token): {:.4f}, ppl(sequence): {:.4f}, {}".format(self.nemo_cfg.accuracy.dataset, ppl, sequence_ppl, acc_text))
+            print("***************************")
+        elif self.nemo_cfg.mode == "benchmark":
+            G_LOGGER.debug("Run in benchmark mode.")
+            rand_input = get_random_input(self.model.tokenizer, self.nemo_cfg.batch_size, self.nemo_cfg.benchmark.input_seq_len, self.nemo_cfg.benchmark.output_seq_len)
+
+            for _ in range(self.timing_profile.warmup):
+                output = full_inference(self.model, rand_input, self.nemo_cfg)
+
+            class BenchmarkTimer:
+                def __init__(self, name):
+                    self.name = name
+                    self.started = False
+                    self.start_time = None
+                    self.times = []
+
+                def start(self):
+                    assert not self.started
+                    self.started = True
+                    self.start_time = time.perf_counter()
+
+                def end(self):
+                    assert self.started
+                    self.started = False
+                    self.times.append(time.perf_counter() - self.start_time)
+
+                def stats_str(self, num_tokens):
+                    total_time = sum(self.times)
+                    avg_time = total_time / float(len(self.times))
+                    self.times.sort()
+                    percentile95 = self.times[int(len(self.times) * 0.95)]
+                    percentile99 = self.times[int(len(self.times) * 0.99)]
+                    throughput = float(num_tokens) / avg_time
+                    return("[{:10s}] Total Time: {:0.5f} s, Average Time: {:0.5f} s, 95th Percentile Time: {:0.5f} s, 99th Percentile Time: {:0.5f} s, Throughput: {:0.2f} tokens/s".format(self.name, total_time, avg_time, percentile95, percentile99, throughput))
+
+            G_LOGGER.info("Warm up finished. Start benchmarking...")
+            e2e_timer = BenchmarkTimer("E2E inference")
+            core_timer = BenchmarkTimer("Without tokenizer")
+            start_time = datetime.now()
+            iter_idx = 0
+            cur_duration = 0
+            while iter_idx < self.timing_profile.iterations or cur_duration < self.timing_profile.duration:
+                core_timer.start()
+                e2e_timer.start()
+                output = generate(self.model, rand_input, self.nemo_cfg)
+                core_timer.end()
+
+                output = process_output(self.model, output)
+                e2e_timer.end()
+
+                iter_idx += 1
+                cur_duration = (datetime.now() - start_time).total_seconds()
+
+            num_tokens = self.nemo_cfg.batch_size * self.nemo_cfg.benchmark.output_seq_len
+            print("***************************")
+            print(f"Running {iter_idx} iterations with duration: {cur_duration}s, batch size: {self.nemo_cfg.batch_size}, input sequence length: {self.nemo_cfg.benchmark.input_seq_len} and output sequence length: {self.nemo_cfg.benchmark.output_seq_len}")
+            print(f"{e2e_timer.stats_str(num_tokens)}")
+            print(f"{core_timer.stats_str(num_tokens)}")
+            print("***************************")
+        else:
+            G_LOGGER.debug("Run in inference mode.")
+            assert self.nemo_cfg.mode == "inference"
+            if self.nemo_cfg.runtime == 'nemo' and hasattr(self.model.cfg, "fp8") and self.model.cfg.fp8 == True and self.nemo_cfg.batch_size % 8 != 0:
+                new_batch_size = ((self.nemo_cfg.batch_size + 7) // 8) * 8
+                print("Update batch size from {} to {} for NeMo FP8 inference.".format(self.nemo_cfg.batch_size, new_batch_size))
+                self.nemo_cfg.batch_size = new_batch_size
+
+            nb_paddings = 0
+            while (len(self.nemo_cfg.prompts) % self.nemo_cfg.batch_size) != 0:
+                self.nemo_cfg.prompts.append(self.nemo_cfg.prompts[-1])
+                nb_paddings += 1
+
+            batch_idx = 0
+            start = 0
+            while True:
+                inputs = OmegaConf.to_container(listconfig.ListConfig(self.nemo_cfg.prompts[start:start+self.nemo_cfg.batch_size]))
+                output = full_inference(self.model, inputs, self.nemo_cfg)
+                output = remove_padded_prompts(output, nb_paddings)
+                print("***************************")
+                print("Batch {}: {}".format(batch_idx, output))
+                print("***************************")
+                batch_idx += 1
+                start += self.nemo_cfg.batch_size
+                if start >= len(self.nemo_cfg.prompts):
+                    break
+
+        t2 = time.time()
+        G_LOGGER.info("Inference session is {:.4f}s in total.".format(t2 - t1))
+
+        # Release runtime objects
+        if self.nemo_cfg.runtime == 'onnx':
+            del self.model.onnxrt
+        elif self.nemo_cfg.runtime == 'trt':
+            del self.model.trt
+
+        return results, ppl
+
+    def add_args(self) -> None:
+        general_group = self._parser.add_argument_group("general")
+        general_group.add_argument(
+            "--help",
+            "-h",
+            help="Shows help message for NeMo commands.",
+            action="store_true",
+        )
+        general_group.add_argument(
+            "--verbose", "-v",
+            help="Display verbose logs.",
+            action="store_true"
+        )
+        general_group.add_argument(
+            "--info", help="Display info logs.", action="store_true"
+        )
+        general_group.add_argument(
+            "--working-dir", "-wd",
+            help="Location of where to save the model and other downloaded files.",
+            required=True,
+        )
+
+        timing_group = self._parser.add_argument_group("inference measurement")
+        timing_group.add_argument(
+            "--duration",
+            type=int,
+            help="Minimal duration of inference iterations to measure in seconds.",
+            default=NetworkCommand.DEFAULT_DURATION,
+        )
+        timing_group.add_argument(
+            "--iterations",
+            type=int,
+            help="Number of iterations to measure.",
+            default=NetworkCommand.DEFAULT_ITERATIONS,
+        )
+        timing_group.add_argument(
+            "--warmup",
+            type=int,
+            help="Number of warmup iterations before actual measurement occurs.",
+            default=NetworkCommand.DEFAULT_WARMUP,
+        )
+
+        model_config_group = self._parser.add_argument_group("model")
+        model_config_group.add_argument(
+            "--nemo-model",
+            help="Set a NeMo model to be used.",
+            type=str,
+            default=None
+        )
+        model_config_group.add_argument(
+            "--nemo-checkpoint",
+            help="Set a NeMo checkpoint to be used.",
+            type=str,
+            default=None
+        )
+        model_config_group.add_argument(
+            "--nemo-hparams",
+            help="Set a NeMo hparams.yaml to be used.",
+            type=str,
+            default=None
+        )
+        model_config_group.add_argument(
+            "--onnx-model",
+            help="Set a onnx model (exported from a NeMo model) to be used. See `export_utils.py` in the model directory for exporting onnx files",
+            type=str,
+            default=None,
+        )
+        model_config_group.add_argument(
+            "--max-seq-len",
+            help="Set maximum sequence lengths used for a GPT model.",
+            type=int,
+            default=None,
+        )
+        model_config_group.add_argument(
+            "--batch-size", "-b",
+            help="Set batch size for inference",
+            required=False,
+            type=int,
+            default=1
+        )
+        model_config_group.add_argument(
+            "--variant", "-m",
+            help="Model to generate",
+            required=True,
+            choices=GPT3ModelTRTConfig.TARGET_MODELS,
+        )
+        model_config_group.add_argument(
+            "--use-cache",
+            "-kv",
+            help="Enable KV cache",
+            action="store_true",
+            default=False,
+        )
+        model_config_group.add_argument(
+            "--fp8",
+            action="store_true",
+            help="Use FP8 precision.",
+            default=False
+        )
+        model_config_group.add_argument(
+            "--fp16",
+            action="store_true",
+            help="Use FP16 precision.",
+            default=False
+        )
+        model_config_group.add_argument(
+            "--bf16",
+            action="store_true",
+            help="Use BF16 precision.",
+            default=False
+        )
+        model_config_group.add_argument(
+            "--use-fp8-storage",
+            action="store_true",
+            help="Use FP8 storage precision.",
+            default=False
+        )
+        model_config_group.add_argument(
+            "--quantize-bmms",
+            help="Quantize attention BMMs",
+            action="store_true",
+            default=False,
+        )
+
+    def __call__(self):
+        t0 = time.time()
+        self.add_args()
+        self._args = self._parser.parse_args()
+        if "help" in self._args and self._args.help == True:
+            self._parser.print_help()
+            exit(0)
+
+        self.setup_environment(
+            **vars(self._args),
+        )
+        t1 = time.time()
+        G_LOGGER.info("Set up environment takes {:.4f}s.".format(t1 - t0))
+
+        network_results, ppl_results = self.run()
+        return NetworkCheckpointResult(
+            network_results=network_results,
+            accuracy=0,
+            perplexity=0,
+        )
diff --git a/demo/NeMo/nemo_export.py b/demo/NeMo/nemo_export.py
new file mode 100644
index 00000000..b9f5ad3a
--- /dev/null
+++ b/demo/NeMo/nemo_export.py
@@ -0,0 +1,922 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+import subprocess as sp
+import shlex
+import omegaconf
+import os
+import sys
+import warnings
+from typing import Dict, List, Optional, Tuple
+import numpy as np
+
+# nemo
+from nemo.core import ModelPT
+from nemo.core.classes import Exportable
+from nemo.core.neural_types import ChannelType, NeuralType
+from nemo.utils.export_utils import augment_filename
+from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel, MegatronGPTExportableModel
+
+# onnx
+import onnx
+import onnx_graphsurgeon as gs
+
+# polygraphy
+from polygraphy.backend.trt import Profile, CreateConfig, engine_from_network, NetworkFromOnnxPath, save_engine
+from polygraphy.logger import G_LOGGER as PG_LOGGER
+
+import torch
+import transformer_engine
+
+if __name__ == "__main__":
+    filepath = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.join(filepath, os.pardir, "HuggingFace")
+    sys.path.append(project_root)
+
+# Add syspath for custom library
+from GPT3.nemo_utils import load_nemo_model, release_nemo_model
+from GPT3.convert_te_onnx_to_trt_onnx import replace_customop_qdq_with_onnx_qdq
+
+# HuggingFace utils
+from NNDF.logger import G_LOGGER
+from NNDF.models import _calculate_polygraphy_verbosity
+
+# ONNX conversion script
+
+# Set polygraphy logging level here.
+PG_LOGGER.module_severity = PG_LOGGER.INFO
+
+class MegatronGPTSingleInputExportableModel(MegatronGPTExportableModel):
+    """
+    Wrapper for MegatronGPTExportableModel to export ONNX with a single input
+    """
+
+    def __init__(self, model, max_seq_len):
+        super().__init__(model)
+        self.cfg = model.cfg
+        self.max_seq_len = max_seq_len
+
+    def forward(self, tokens):
+        def model_forward(tokens):
+            position_ids, attention_mask = self.get_position_ids_and_mask(tokens, self.max_seq_len)
+            assert tokens.shape == position_ids.shape
+            assert attention_mask.shape[2] == attention_mask.shape[3] == tokens.shape[1] == position_ids.shape[1]
+            return self.model.forward(
+                tokens=tokens.cuda(),
+                text_position_ids=position_ids.cuda(),
+                attention_mask=attention_mask.cuda(),
+                labels=None,
+            )
+
+        with torch.no_grad(), torch.inference_mode(), torch.autocast(
+            'cuda', dtype=self.dtype
+        ), warnings.catch_warnings():
+            warnings.filterwarnings(action='ignore', category=torch.jit.TracerWarning, module=r'.*')
+            if self.fp8_enabled:
+                with transformer_engine.pytorch.onnx_export(self.fp8_enabled), transformer_engine.pytorch.fp8_autocast(
+                    enabled=self.fp8_enabled, fp8_recipe=self.fp8_recipe
+                ):
+                    output_tensor = model_forward(tokens)
+            else:
+                output_tensor = model_forward(tokens)
+        return output_tensor
+
+    def get_position_ids_and_mask(self, data, max_seq_len):
+        seq_len = data.size()[1]
+        # Attention mask (lower triangular).
+        attention_mask = torch.tril(torch.ones(
+            (1, max_seq_len, max_seq_len), device=data.device)).view(
+                1, 1, max_seq_len, max_seq_len)
+
+        # Position ids.
+        position_ids = torch.arange(max_seq_len, dtype=torch.long,
+                                    device=data.device)
+        position_ids = position_ids[:seq_len].unsqueeze(0).expand_as(data)
+
+        # Convert attention mask to binary:
+        attention_mask = (attention_mask < 0.5)
+
+        return position_ids, attention_mask[:1, :1, :seq_len, :seq_len]
+
+    def input_example(self):
+        ids = self.model.tokenizer.text_to_ids("how is the weather on Sunday morning?")
+        id_tensors = torch.unsqueeze(torch.LongTensor(ids), dim=0)
+        G_LOGGER.debug(f"Calling input_example shape {id_tensors.shape}")
+        return id_tensors, # return a tuple
+
+    @property
+    def input_types(self) -> Optional[Dict[str, NeuralType]]:
+        return {
+            "input_ids": NeuralType(('B', 'T'), ChannelType()),
+        }
+
+    @property
+    def input_names(self) -> List[str]:
+        return ['input_ids']
+
+def get_trtexec_cmd(onnx_fpath, cfg, bs):
+    max_seq_len = cfg.model.max_seq_len
+    opt_seq_len = cfg.trt_export_options.opt_seq_len if cfg.trt_export_options.opt_seq_len else (max_seq_len // 2)
+    trtexec_cmd = f"trtexec --onnx={onnx_fpath}"
+    min_shapes = f"--minShapes=input_ids:{bs}x1"
+    opt_shapes = f"--optShapes=input_ids:{bs}x{opt_seq_len}"
+    max_shapes = f"--maxShapes=input_ids:{bs}x{max_seq_len}"
+    if not cfg.use_one_input:
+        min_shapes += f",position_ids:{bs}x1"
+        opt_shapes += f",position_ids:{bs}x{opt_seq_len}"
+        max_shapes += f",position_ids:{bs}x{max_seq_len}"
+    if not cfg.trt_export_options.use_fp8:
+        min_shapes += ",attention_mask:1x1x1x1"
+        opt_shapes += f",attention_mask:1x1x{opt_seq_len}x{opt_seq_len}"
+        max_shapes += f",attention_mask:1x1x{max_seq_len}x{max_seq_len}"
+
+    if cfg.use_cache:
+        trtexec_cmd += " --profile=0"
+        nbheads, headsize = cfg.model.nb_heads, cfg.model.head_size
+        input_k = get_past_key_name('*')
+        input_v = get_past_value_name('*')
+        # ("sequence", "batch", nbheads, headsize)
+        min_shapes += f",{input_k}:0x{bs}x{nbheads}x{headsize},{input_v}:0x{bs}x{nbheads}x{headsize}"
+        opt_shapes += f",{input_k}:0x{bs}x{nbheads}x{headsize},{input_v}:0x{bs}x{nbheads}x{headsize}"
+        max_shapes += f",{input_k}:0x{bs}x{nbheads}x{headsize},{input_v}:0x{bs}x{nbheads}x{headsize}"
+    trtexec_cmd += f" {min_shapes} {opt_shapes} {max_shapes}"
+
+    if cfg.use_cache:
+        trtexec_cmd += " --profile=1"
+
+        min_shapes = f"--minShapes=input_ids:{bs}x1"
+        opt_shapes = f"--optShapes=input_ids:{bs}x1"
+        max_shapes = f"--maxShapes=input_ids:{bs}x1"
+        if not cfg.use_one_input:
+            min_shapes += f",position_ids:{bs}x1"
+            opt_shapes += f",position_ids:{bs}x1"
+            max_shapes += f",position_ids:{bs}x1"
+        if not cfg.trt_export_options.use_fp8:
+            min_shapes += ",attention_mask:1x1x1x1"
+            opt_shapes += f",attention_mask:1x1x{opt_seq_len}x{opt_seq_len}"
+            max_shapes += f",attention_mask:1x1x{max_seq_len}x{max_seq_len}"
+
+        nbheads, headsize = cfg.model.nb_heads, cfg.model.head_size
+        input_k = get_past_key_name('*')
+        input_v = get_past_value_name('*')
+        # ("sequence", "batch", nbheads, headsize)
+        min_shapes += f",{input_k}:1x{bs}x{nbheads}x{headsize},{input_v}:1x{bs}x{nbheads}x{headsize}"
+        opt_shapes += f",{input_k}:{opt_seq_len}x{bs}x{nbheads}x{headsize},{input_v}:{opt_seq_len}x{bs}x{nbheads}x{headsize}"
+        max_shapes += f",{input_k}:{max_seq_len - 1}x{bs}x{nbheads}x{headsize},{input_v}:{max_seq_len - 1}x{bs}x{nbheads}x{headsize}"
+        trtexec_cmd += f" {min_shapes} {opt_shapes} {max_shapes}"
+
+    use_tf32 = cfg.trt_export_options.use_tf32
+    use_fp8 = cfg.trt_export_options.use_fp8
+    use_fp16 = cfg.trt_export_options.use_fp16
+    use_bf16 = cfg.trt_export_options.use_bf16
+    use_strongly_typed = cfg.trt_export_options.use_strongly_typed
+    sparse = cfg.trt_export_options.sparse
+    trtexec_cmd += " --noTF32" if not use_tf32 else ""
+    trtexec_cmd += " --fp8" if (use_fp8 and not use_strongly_typed) else ""
+    trtexec_cmd += " --fp16" if (use_fp16 and not use_strongly_typed) else ""
+    trtexec_cmd += " --bf16" if (use_bf16 and not use_strongly_typed) else ""
+    trtexec_cmd += " --stronglyTyped" if use_strongly_typed else ""
+    trtexec_cmd += " --sparsity=enable" if sparse else ""
+    trtexec_cmd += " --timingCacheFile=functional.cache"
+    return trtexec_cmd
+
+
+def add_zero_point(g, base_name, dtype):
+    """Add Q/DQ zero-point constant"""
+    _zp_fp8_value = onnx.helper.make_tensor(base_name + "_zp_fp8_value", dtype, (1,), [0.0])
+    zero_point_fp8 = gs.Variable(base_name + "_zero_point", dtype=dtype, shape=(1,))
+    zero_point_const = gs.Node(op="Constant", name= base_name + "_zero_point_const", inputs=[], outputs=[zero_point_fp8], attrs={"value": _zp_fp8_value})
+    g.nodes.append(zero_point_const)
+    return zero_point_fp8
+
+
+def add_scale(g, base_name, dtype, value):
+    """Add Q/DQ scale constant"""
+    _scale_value = onnx.helper.make_tensor(base_name + "_scale_value", dtype, (1,), [value])
+    scale = gs.Variable(base_name + "_scale", dtype=dtype, shape=(1,))
+    scale_const = gs.Node(op="Constant", name=base_name + "_scale_const", inputs=[], outputs=[scale], attrs={"value": _scale_value})
+    g.nodes.append(scale_const)
+    return scale
+
+
+def add_cast(g, inp, outp_dtype, cast_name):
+    """Add Cast operator """
+    cast_outp = gs.Variable(cast_name+"_out", dtype=outp_dtype)
+    new_cast = gs.Node(
+        op="Cast",
+        name=cast_name,
+        inputs=[inp],
+        outputs=[cast_outp],
+        attrs={"to": outp_dtype}
+    )
+    g.nodes.append(new_cast)
+    return cast_outp
+
+
+def add_q(g, inp, hp_dtype, q_dtype, q_name=None):
+    """Add QuantizeLinear operator"""
+    scale_dtype = hp_dtype
+    q_name = q_name or f"{inp.name}_qfp8"
+    q_out = gs.Variable(q_name, dtype=q_dtype)
+    q = gs.Node(op="QuantizeLinear", name=q_name,
+        inputs=[
+            inp,
+            add_scale(g, inp.name, scale_dtype, 1.0),
+            add_zero_point(g, inp.name, q_dtype)
+        ],
+        outputs=[q_out])
+    g.nodes.append(q)
+    return q_out
+
+
+def add_dq(g, inp, hp_dtype, dq_dtype):
+    """Add DequantizeLinear operator"""
+    dq_name = f"{inp.name}_dqfp8"
+    scale_dtype = hp_dtype
+    dq_out = gs.Variable(dq_name, dtype=hp_dtype)
+    dq = gs.Node(op="DequantizeLinear", name=dq_name,
+        inputs=[
+            inp,
+            add_scale(g, inp.name, scale_dtype, 1.0),
+            add_zero_point(g, inp.name, dq_dtype)],
+        outputs=[dq_out])
+    g.nodes.append(dq)
+    return dq_out
+
+
+def quantize_all_bmms(g, dtype_high_prec, use_fp8_storage):
+    """Quantize the inputs of all batched matmul operators"""
+
+    def quantize_bmm(g, bmm, dtype_high_prec):
+        assert len(bmm.inputs) == 2
+        dq_outputs = []
+        for i in range(len(bmm.inputs)):
+            if i == 0 or not use_fp8_storage:
+                q_outp = add_q(g, bmm.inputs[i], dtype_high_prec, onnx.TensorProto.FLOAT8E4M3FN)
+                dq_out = add_dq(g, q_outp, dtype_high_prec, onnx.TensorProto.FLOAT8E4M3FN)
+            else:
+                # mm.inputs[1] is the input from K or V which we don't quantize if is stored
+                # in the cache in quantized type.
+                dq_out = add_dq(g, bmm.inputs[i], dtype_high_prec, onnx.TensorProto.FLOAT8E4M3FN)
+            dq_outputs.append(dq_out)
+        bmm.inputs = dq_outputs
+
+    bmm_nodes = [node for node in g.nodes if node.op == "MatMul"]
+    G_LOGGER.info("Quantizing attention BMMs")
+    G_LOGGER.info(f"Found {len(bmm_nodes)} MatMul operator nodes")
+    for bmm in bmm_nodes:
+        # Do not quantize the Matmul at the head of GPT3 (it is used )
+        if bmm.name == "/model/module/MatMul":
+            continue
+        quantize_bmm(g, bmm, dtype_high_prec)
+
+
+# Use ONNX graphsurgeon to add KV-cache to ONNX file
+# Reusing the HF demo names.
+def get_past_key_name(layer_id):
+    past_key_name = f"past_key_values.{layer_id}.decoder.key"
+    return past_key_name
+
+def get_past_value_name(layer_id):
+    past_value_name = f"past_key_values.{layer_id}.decoder.value"
+    return past_value_name
+
+def get_past_shape(nbheads, headsize):
+    return ("sequence_past_decoder_length", "batch", nbheads, headsize)
+
+def get_present_key_name(layer_id: int):
+    present_key_name = f"present_key_values.{layer_id}.decoder.key"
+    return present_key_name
+
+def get_present_value_name(layer_id: int):
+    present_value_name = f"present_key_values.{layer_id}.decoder.value"
+    return present_value_name
+
+def get_present_shape(nbheads, headsize):
+    return ("sequence_present_decoder_length", "batch", nbheads, headsize)
+
+def get_new_key_name(layer_id: int):
+    new_key_name = f"new_key_values.{layer_id}.decoder.key"
+    return new_key_name
+
+def get_new_value_name(layer_id: int):
+    new_value_name = f"new_key_values.{layer_id}.decoder.value"
+    return new_value_name
+
+def get_new_shape(nbheads, headsize):
+    return ("sequence", "batch", nbheads, headsize)
+
+def quantize_new_k_v(g, key_new, value_new, hp_dtype):
+    key_new_q_outp = add_q(g, key_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+    key_new_dq_out = add_dq(g, key_new_q_outp, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+    value_new_q_outp = add_q(g, value_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+    value_new_dq_out = add_dq(g, value_new_q_outp, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+    return key_new_dq_out, value_new_dq_out
+
+def add_kvcache_for(
+    g, layer_id, qkv_split, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms):
+    _, key_new, value_new = qkv_split.outputs
+    key_consumers = [c for c in key_new.outputs]
+    value_consumers = [c for c in value_new.outputs]
+
+    def add_graph_past_inputs(use_fp8_storage):
+        past_key = gs.Variable(
+            name=get_past_key_name(layer_id),
+            dtype=dtype,
+            shape=get_past_shape(nbheads, headsize))
+        past_value = gs.Variable(
+            name=get_past_value_name(layer_id),
+            dtype=dtype,
+            shape=get_past_shape(nbheads, headsize))
+        g.inputs.append(past_key)
+        g.inputs.append(past_value)
+
+        if use_fp8_storage and not quantize_bmms:
+            past_key_dq = add_dq(g, past_key, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+            past_value_dq = add_dq(g, past_value, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN)
+            return past_key_dq, past_value_dq
+
+        return past_key, past_value
+
+    def add_concat(concat_name, input0, input1, output_name):
+        concat_out = gs.Variable(
+            output_name,
+            dtype=dtype,
+            shape=get_present_shape(nbheads, headsize))
+
+        concat = gs.Node(op="Concat", name=concat_name,
+            inputs=[input0, input1], outputs=[concat_out],
+            attrs={"axis": 0})
+        g.nodes.append(concat)
+        return concat_out
+
+    def add_cache_outputs(kv_output_policy, use_fp8_storage, hp_dtype):
+        if kv_output_policy == "kv_cache_concat":
+            new_key_output, new_value_output = key_concat_out, value_concat_out
+        elif kv_output_policy == "kv_new":
+            key_new.dtype = dtype
+            key_new.shape = get_new_shape(nbheads, headsize)
+            key_new.name = get_new_key_name(layer_id)
+            value_new.dtype = dtype
+            value_new.shape = get_new_shape(nbheads, headsize)
+            value_new.name = get_new_value_name(layer_id)
+
+            if use_fp8_storage:
+                key_new_q = add_q(g, key_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN,
+                    f"{key_new.name}_qfp8")
+                value_new_q = add_q(g, value_new, hp_dtype, onnx.TensorProto.FLOAT8E4M3FN,
+                    f"{value_new.name}_qfp8")
+                new_key_output, new_value_output = key_new_q, value_new_q
+            else:
+                new_key_output, new_value_output = key_new, value_new
+        else:
+            raise ValueError(f"Unsupported kv_output_policy: {kv_output_policy}")
+        g.outputs.append(new_key_output)
+        g.outputs.append(new_value_output)
+        return new_key_output, new_value_output
+
+    past_key, past_value = add_graph_past_inputs(use_fp8_storage)
+    new_key_output, new_value_output = add_cache_outputs(kv_output_policy, use_fp8_storage, hp_dtype)
+
+    if quantize_bmms:
+        if use_fp8_storage:
+            key_new = new_key_output
+            value_new = new_value_output
+        else:
+            key_new, value_new = quantize_new_k_v(g, key_new, value_new, hp_dtype)
+    key_concat_out = add_concat(f"key.{layer_id}.concat",
+        past_key, key_new, get_present_key_name(layer_id))
+    value_concat_out = add_concat(f"value.{layer_id}.concat",
+        past_value, value_new, get_present_value_name(layer_id))
+
+    for c in key_consumers:
+        c.inputs[0] = key_concat_out
+    for c in value_consumers:
+        c.inputs[0] = value_concat_out
+
+
+def add_kvcache(g, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms):
+    """Add KV-cache to each Transformer layer's QKV split """
+    G_LOGGER.info("Adding KV-cache")
+    qkv_split_nodes = [node for node in g.nodes if node.op == "Split"]
+    G_LOGGER.debug(f"Found {len(qkv_split_nodes)} QKV-split nodes")
+
+    for layer_id, qkv_split in enumerate(qkv_split_nodes):
+        add_kvcache_for(
+            g, layer_id, qkv_split, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms)
+
+    G_LOGGER.debug("Done adding cache operations")
+    return len(qkv_split_nodes)
+
+
+def normalize_dyn_axes_to_hf_names(g, vocab_size):
+    g.inputs[0].name = "input_ids"
+    g.inputs[0].shape = ("batch", "sequence")
+    if len(g.inputs) > 1:
+        g.inputs[1].name = "position_ids"
+        g.inputs[1].shape = ("batch", "sequence")
+    g.outputs[0].name = "logits"
+    g.outputs[0].shape = ("batch", "sequence", vocab_size)
+    G_LOGGER.debug("Done normalizing dynamic axes names to HuggingFace demo names")
+
+
+def process_onnx(
+    kv_output_policy,
+    onnx_input_fpath,
+    onnx_output_fpath,
+    separate_param_files,
+    use_cache,
+    quantize_bmms,
+    nbheads, headsize, vocab_size, dtype, hp_dtype, use_fp8_storage):
+    """
+    Process an ONNX model, add KV cache inputs and output, save result model to a specified path.
+    """
+    G_LOGGER.info(f"Importing {onnx_input_fpath}... this will take some time")
+    g = gs.import_onnx(onnx.load(onnx_input_fpath))
+    normalize_dyn_axes_to_hf_names(g, vocab_size)
+    num_layers = 0
+    if use_cache:
+        num_layers = add_kvcache(g, nbheads, headsize, dtype, kv_output_policy, hp_dtype, use_fp8_storage, quantize_bmms)
+        g.cleanup().toposort()
+
+    if quantize_bmms:
+        quantize_all_bmms(g, hp_dtype, use_fp8_storage)
+        g.cleanup().toposort()
+
+    G_LOGGER.info(f"Exporting {onnx_output_fpath}")
+    model = gs.export_onnx(g)
+    G_LOGGER.info(f"Saving {onnx_output_fpath}")
+    if separate_param_files:
+        onnx.save_model(model, onnx_output_fpath, save_as_external_data=True,
+             all_tensors_to_one_file = False, convert_attribute=False)
+    else:
+        onnx.save_model(model, onnx_output_fpath, save_as_external_data=False)
+    G_LOGGER.info(f"Done: {onnx_output_fpath}")
+    return num_layers
+
+
+def create_dir_if_not_exist(path):
+    dir = os.path.dirname(path)
+    if not os.path.exists(dir) and dir != "":
+        G_LOGGER.info(f"Making directory {dir}")
+        os.makedirs(dir)
+
+
+class NeMoConverter():
+    """
+    A class to convert a NeMo model to an ONNX file, and convert an ONNX file to a TensorRT engine.
+    """
+    def __init__(self, cfg, model_type=ModelPT):
+        self.model_type = model_type
+        self.cfg = cfg
+        self.model = None
+        self.export_envvars()
+
+    def export_envvars(self) -> None:
+        if self.cfg.trt_export_options.use_fp8:
+            G_LOGGER.info(
+                f"Setting max sequence length to {self.cfg.model.max_seq_len}"
+            )
+            os.environ["NVTE_ONNX_KVCACHE_MAX_SEQ_LEN"] = str(
+                self.cfg.model.max_seq_len
+            )
+
+    def nemo_to_onnx(self) -> str:
+        """
+        Convert a NeMo model to an ONNX model, return the file path to the ONNX model.
+        """
+        if self.model == None:
+            self.model = load_nemo_model(self.cfg, self.model_type)
+
+        if not isinstance(self.model, Exportable):
+            G_LOGGER.error("Your NeMo model class ({}) is not Exportable.".format(self.model.__class__.__name__))
+            sys.exit(1)
+
+        if hasattr(self.model.cfg, "fp8") and self.model.cfg.fp8 == True:
+            if self.cfg.trt_export_options.use_fp8 == False:
+                G_LOGGER.info("Turning on trt_export_options.use_fp8 because NeMo model is in FP8 precision.")
+                self.cfg.trt_export_options.use_fp8 = True
+        else:
+            if self.cfg.trt_export_options.use_fp8 == True:
+                G_LOGGER.info("Turning off trt_export_options.use_fp8 because NeMo model is not in FP8 precision.")
+                self.cfg.trt_export_options.use_fp8 = False
+
+        onnx_out = self.cfg.onnx_model_file
+        create_dir_if_not_exist(onnx_out)
+        check_trace = self.cfg.onnx_export_options.runtime_check
+        onnx_names = []
+
+        dynamic_axes={
+            'input_ids': {0: "batch", 1: "sequence"},
+            'position_ids': {0: "batch", 1: "sequence"},
+            'logits': {0: "batch", 1: "sequence"},
+        }
+
+        if self.cfg.use_one_input:
+            # Use a wrapper class to get rid of inputs other than input_ids.
+            self.model = MegatronGPTSingleInputExportableModel(self.model, self.cfg.model.max_seq_len)
+            del dynamic_axes['position_ids']
+
+        try:
+            self.model.to(device=self.cfg.onnx_export_options.device).freeze()
+            self.model.eval()
+            if not self.cfg.trt_export_options.use_fp8:
+                G_LOGGER.info("Exporting ONNX with attention_mask")
+                dynamic_axes['attention_mask'] = {2: "sequence", 3: "sequence"}
+
+            self.model.export(
+                onnx_out,
+                onnx_opset_version=self.cfg.onnx_export_options.onnx_opset,
+                do_constant_folding=self.cfg.onnx_export_options.do_constant_folding,
+                dynamic_axes=dynamic_axes,
+                check_trace=check_trace,
+                check_tolerance=self.cfg.onnx_export_options.check_tolerance,
+                verbose=self.cfg.onnx_export_options.verbose,
+            )
+            onnx_names = [augment_filename(onnx_out, subnet_name) for subnet_name in self.model.list_export_subnets()]
+
+        except Exception as e:
+            G_LOGGER.error(
+                "Export failed. Please make sure your NeMo model class ({}) has working export() and that you have the latest NeMo package installed with [all] dependencies.".format(
+                    self.model.__class__
+                )
+            )
+            raise e
+
+        release_nemo_model(self.model)
+        assert len(onnx_names) == 1
+        os.rename(onnx_names[0], onnx_out)
+        return onnx_out
+
+    def prune_onnx(self, input_path) -> str:
+        """
+        Prune the input ONNX model to be structured sparsity pattern by using polygraphy.
+        """
+        if not self.cfg.trt_export_options.sparse:
+            G_LOGGER.warning(f"Model pruning is enabled but sparsity is not enabled for TRT engine builder.")
+
+        ibname = os.path.basename(input_path)
+        obname = "pruned." + ibname
+        opath = os.path.join(os.path.dirname(input_path), obname)
+        o_data_real_path = opath + "_data"
+        if os.path.exists(opath) and os.path.exists(o_data_real_path):
+            return opath
+
+        o_data_bname = os.path.basename(o_data_real_path)
+        cmds = f"polygraphy surgeon prune {input_path} -o {opath} --save-external-data {o_data_bname}"
+        G_LOGGER.info(f"Prune ONNX model with: {cmds}")
+        G_LOGGER.info(f"This may take a while...")
+        sp.run(shlex.split(cmds), check=True, stdout=sp.PIPE, stderr=sp.STDOUT)
+        return opath
+
+
+    def create_onnx(self, onnx_input_fpath, onnx_output_fpath, kv_output_policy="kv_new"):
+        """
+        Create an ONNX model with modifications from `onnx_input_fpath`, save the ONNX model to `onnx_output_fpath`.
+        The ONNX is modified to use a KV-Cache and/or quantize the attention batched matrix-multiplication ops.
+        No return value for this function.
+        """
+        assert os.path.splitext(onnx_input_fpath)[1] == ".onnx", "Input ONNX file must end with '.onnx'."
+        assert os.path.splitext(onnx_output_fpath)[1] == ".onnx", "Output ONNX file must end with '.onnx'."
+
+        quantize_bmms = self.cfg.onnx_export_options.quantize_bmms
+        use_cache = self.cfg.use_cache
+        nbheads, headsize = self.cfg.model.nb_heads, self.cfg.model.head_size
+        hp_dtype = onnx.TensorProto.BFLOAT16 if self.cfg.trt_export_options.use_bf16 else onnx.TensorProto.FLOAT16
+        dtype = hp_dtype
+        if self.cfg.onnx_export_options.use_fp8_storage:
+            dtype = onnx.TensorProto.FLOAT8E4M3FN
+        assert nbheads * headsize == self.cfg.model.hidden_size, "Model hidden size does not match."
+        num_qkvs = process_onnx(kv_output_policy,
+            onnx_input_fpath, onnx_output_fpath, separate_param_files=True,
+            use_cache=use_cache, quantize_bmms=quantize_bmms,
+            nbheads=nbheads, headsize=headsize, vocab_size=self.cfg.model.vocab_size, dtype=dtype, hp_dtype=hp_dtype, use_fp8_storage=self.cfg.onnx_export_options.use_fp8_storage)
+
+        G_LOGGER.info(f"Number of QKV subgraphs = {num_qkvs}, number of layers = {self.cfg.model.num_layers}")
+        if num_qkvs != self.cfg.model.num_layers:
+            raise ValueError("Number of QKV subgraphs must be the same as number of layers in the model.")
+        G_LOGGER.info(f"Saved KV-cache onnx to {onnx_output_fpath}")
+
+
+    # Reads an onnx file and creates a trt engine file
+    def onnx_to_trt(self, onnx_fpath, trt_fpath):
+        """
+        Convert an ONNX model from `onnx_fpath` to a TensorRT engine, and save the result to `trt_fpath`.
+        """
+        # Set up polygraphy config
+        use_tf32 = self.cfg.trt_export_options.use_tf32
+        use_fp16 = self.cfg.trt_export_options.use_fp16
+        use_fp8 = self.cfg.trt_export_options.use_fp8
+        use_bf16 = self.cfg.trt_export_options.use_bf16
+        strongly_typed = self.cfg.trt_export_options.use_strongly_typed
+        sparse = self.cfg.trt_export_options.sparse
+        if sparse and not self.cfg.onnx_export_options.prune:
+            G_LOGGER.warning("Sparsity for TRT engine builder is enabled, but model pruning is not.")
+
+        # Create optimization profiles
+        bs = self.cfg.batch_size
+        max_seq_len = self.cfg.model.max_seq_len
+        opt_seq_len = self.cfg.trt_export_options.opt_seq_len if self.cfg.trt_export_options.opt_seq_len else (max_seq_len // 2)
+        profile_non_kv = Profile()
+        profile_non_kv.add(name="input_ids", min=(bs, 1), opt=(bs, opt_seq_len), max=(bs, max_seq_len)) # (batch, sequence)
+        if not self.cfg.use_one_input:
+            profile_non_kv.add(name="position_ids", min=(bs, 1), opt=(bs, opt_seq_len), max=(bs, max_seq_len)) # (batch, sequence)
+            # For FP8 precision, attention mask is created inside transformer_engine.
+            if not self.cfg.trt_export_options.use_fp8:
+                profile_non_kv.add(name="attention_mask", min=(1, 1, 1, 1), opt=(1, 1, opt_seq_len, opt_seq_len), max=(1, 1, max_seq_len, max_seq_len)) # (1, 1, sequence, sequence)
+
+        num_layers, nbheads, headsize = self.cfg.model.num_layers, self.cfg.model.nb_heads, self.cfg.model.head_size
+        if self.cfg.use_cache:
+            for i in range(num_layers):
+                input_k = get_past_key_name(i)
+                input_v = get_past_value_name(i)
+                # (sequence, batch, nbheads, headsize)
+                profile_non_kv.add(name=input_k, min=(0, bs, nbheads, headsize), opt=(0, bs, nbheads, headsize), max=(0, bs, nbheads, headsize))
+                profile_non_kv.add(name=input_v, min=(0, bs, nbheads, headsize), opt=(0, bs, nbheads, headsize), max=(0, bs, nbheads, headsize))
+
+        profiles = [profile_non_kv]
+
+        # When enabling KV-cache, use first profile for context phase and second profile for generation phase
+        if self.cfg.use_cache:
+            profile_kv = Profile()
+            profile_kv.add(name="input_ids", min=(bs, 1), opt=(bs, 1), max=(bs, 1)) # (batch, sequence)
+            if not self.cfg.use_one_input:
+                profile_kv.add(name="position_ids", min=(bs, 1), opt=(bs, 1), max=(bs, 1)) # (batch, sequence)
+                # For FP8 precision, attention mask is created inside transformer_engine.
+                if not self.cfg.trt_export_options.use_fp8:
+                    profile_kv.add(name="attention_mask", min=(1, 1, 1, 1), opt=(1, 1, opt_seq_len, opt_seq_len), max=(1, 1, max_seq_len, max_seq_len)) # (1, 1, sequence, sequence)
+
+            assert num_layers > 0
+            nbheads, headsize = self.cfg.model.nb_heads, self.cfg.model.head_size
+            for i in range(num_layers):
+                input_k = get_past_key_name(i)
+                input_v = get_past_value_name(i)
+                # (sequence, batch, nbheads, headsize)
+                profile_kv.add(name=input_k, min=(1, bs, nbheads, headsize), opt=(opt_seq_len, bs, nbheads, headsize), max=(max_seq_len-1, bs, nbheads, headsize))
+                profile_kv.add(name=input_v, min=(1, bs, nbheads, headsize), opt=(opt_seq_len, bs, nbheads, headsize), max=(max_seq_len-1, bs, nbheads, headsize))
+            profiles = [profile_kv, profile_non_kv]
+
+
+        # Read about these arguments here:
+        # https://github.com/NVIDIA/TensorRT/blob/main/tools/Polygraphy/polygraphy/backend/trt/config.py
+        # Note that the precision args below *enable*, not *require*, the specified precision
+        preview_features = []
+
+        trt_config = CreateConfig(
+            tf32= use_tf32,
+            fp16=False if strongly_typed else use_fp16,
+            bf16=False if strongly_typed else use_bf16,
+            sparse_weights=sparse,
+            profiles=profiles,
+            precision_constraints=None if strongly_typed else "obey",
+            preview_features=preview_features,
+            fp8=False if strongly_typed else use_fp8,
+            load_timing_cache=self.cfg.trt_export_options.timing_cache,
+        )
+
+        # Print out trtexec command for debugging
+        G_LOGGER.debug(" >>> trtexec command for debugging:")
+        G_LOGGER.debug(get_trtexec_cmd(onnx_fpath, self.cfg, bs))
+
+        with PG_LOGGER.verbosity(_calculate_polygraphy_verbosity()):
+            G_LOGGER.info(f"Reading ONNX file at {onnx_fpath}")
+            network = NetworkFromOnnxPath(onnx_fpath, strongly_typed=strongly_typed)
+            G_LOGGER.info("Building TRT engine")
+            engine = engine_from_network(network, config=trt_config)
+            G_LOGGER.info(f"Saving TRT engine to {trt_fpath}")
+            save_engine(engine, trt_fpath)
+
+    @staticmethod
+    def _resolve_opset19_paths(onnx_fpath, results_path: Optional[str] = None) -> str:
+        foldername, filename = os.path.split(onnx_fpath)
+        return foldername if not results_path else results_path, filename
+
+    @staticmethod
+    def get_opset19_onnx_fpath(onnx_fpath, results_path: Optional[str] = None) -> str:
+        suffix = ".opset19.onnx"
+        results_path, filename = NeMoConverter._resolve_opset19_paths(
+            onnx_fpath, results_path
+        )
+        return os.path.join(results_path, os.path.splitext(filename)[0] + suffix)
+
+
+    @staticmethod
+    def onnx_to_opset19(onnx_fpath, results_path: Optional[str] = None) -> str:
+        """
+        Convert a ONNX model `onnx_fpath` to be with standard opset19 Q/DQ nodes, return a string
+        contains a file path to the result ONNX if any conversion is performed, otherwise return `None`.
+        """
+        mappings = replace_customop_qdq_with_onnx_qdq(
+            [onnx_fpath],
+            NeMoConverter._resolve_opset19_paths(onnx_fpath, results_path)[0],
+            create_netron_compatible_model=False,
+            remove_cast_before_q=False,
+            remove_cast_after_dq=False,
+            change_qdq_scale_precision="",
+        )
+        if (
+            (not mappings)
+            or (onnx_fpath not in mappings)
+            or (mappings[onnx_fpath] == None)
+        ):
+            G_LOGGER.error(f"Opset19 onnx file conversion failed for {onnx_fpath}.")
+            assert False
+
+        G_LOGGER.info(f"Converted {onnx_fpath} to {mappings[onnx_fpath]} for opset19.")
+        return mappings[onnx_fpath]
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='NeMo export script arguments', add_help=True)
+    parser.add_argument(
+        "--nemo-model",
+        help="Set a NeMo model to be used.",
+        required=False,
+        default=None,
+        type=str,
+    )
+    parser.add_argument(
+        "--nemo-checkpoint",
+        help="Set a NeMo checkpoint to be used.",
+        required=False,
+        default=None,
+        type=str,
+    )
+    parser.add_argument(
+        "--onnx-model",
+        help="A path to load an ONNX model for conversion.",
+        required=False,
+        default=None,
+        type=str,
+    )
+    parser.add_argument(
+        "--save-onnx-dir",
+        help="A directory to save the generated ONNX model. Must be writable.",
+        required=True,
+    )
+    parser.add_argument(
+        "--opset19",
+        action="store_true",
+        help="If set, the ONNX will be converted to opset19.",
+        default=False
+    )
+    parser.add_argument(
+        "--use-cache",
+        action="store_true",
+        help="If set, the ONNX will have KV-cache inputs and outputs.",
+        default=False
+    )
+    parser.add_argument(
+        "--quantize-bmms",
+        help="Quantize attention BMMs",
+        action="store_true",
+        default=False,
+    )
+    parser.add_argument(
+        "--save-engine",
+        required=False,
+        help="If set to a path, a TensorRT engine will be built from ONNX and save to the path.",
+    )
+    parser.add_argument(
+        "--fp8",
+        action="store_true",
+        help="Use FP8 precision during conversion.",
+        default=False
+    )
+    parser.add_argument(
+        "--fp16",
+        action="store_true",
+        help="Use FP16 precision during conversion.",
+        default=False
+    )
+    parser.add_argument(
+        "--bf16",
+        action="store_true",
+        help="Use BF16 precision during conversion.",
+        default=False
+    )
+    parser.add_argument(
+        "--extra-configs",
+        required=False,
+        help='Use this flag to set fields specified in config.yml with a format of --extra-configs="[<KEY>=<VALUE>][ <KEY>=<VALUE>]*". Values specified by this flag will not override any value set from other flags.',
+        default=None,
+        type=str,
+    )
+    args = parser.parse_args()
+    return args
+
+def main():
+    G_LOGGER.setLevel(level=G_LOGGER.INFO)
+
+    config_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "config.yaml")
+    cfg = omegaconf.OmegaConf.load(config_path)
+    G_LOGGER.info(f"Loaded configs = {cfg}")
+
+    args = parse_args()
+    if (args.nemo_model != None or args.nemo_checkpoint != None) and args.onnx_model != None:
+        G_LOGGER.error("NeMo model and ONNX model cannot be both set.")
+        exit(1)
+
+    if args.nemo_model == None and args.nemo_checkpoint == None and args.onnx_model == None:
+        G_LOGGER.error("Either one of --nemo-model, --nemo-checkpoint, or --onnx-model needs to be set.")
+        exit(1)
+
+    if args.extra_configs != None:
+        kwargs = args.extra_configs.split(" ")
+        for kwarg in kwargs:
+            kw = kwarg.split("=")
+            if len(kw) != 2:
+                raise ValueError(f'Arg {kwarg} is not in a format of "<KEY>=<VALUE>"')
+            def nested_set(dic, keys, value):
+                for i in range(len(keys)):
+                    if not hasattr(dic, keys[i]):
+                        raise ValueError(f"Cannot find key {keys[:i+1]} in the config.")
+                    if i == len(keys) - 1:
+                        dic[keys[i]] = value
+                    else:
+                        dic = dic[keys[i]]
+
+            G_LOGGER.info(f"Setting {kw[0]} to {kw[1]}")
+            nested_set(cfg, kw[0].split("."), kw[1])
+        G_LOGGER.info(f"Modified Configs = {cfg}")
+
+    # Set precision for conversion
+    if args.fp16:
+        cfg.trainer.precision = "16"
+        cfg.trt_export_options.use_fp16 = True
+    elif args.bf16:
+        cfg.trainer.precision = "bf16"
+        cfg.trt_export_options.use_bf16 = True
+    else:
+        cfg.trainer.precision = "32"
+
+    if args.fp8:
+        cfg.trt_export_options.use_fp8 = True
+
+    if args.quantize_bmms:
+        cfg.onnx_export_options.quantize_bmms = True
+
+    if os.path.exists(args.save_onnx_dir) and not os.path.isdir(args.save_onnx_dir):
+        raise ValueError(f"{args.save_onnx_dir} is not a directory.")
+
+    cfg.onnx_model_file = os.path.join(args.save_onnx_dir, "model.onnx")
+    create_dir_if_not_exist(cfg.onnx_model_file)
+
+    # Convert NeMo model to ONNX model
+    converter = None
+    if args.nemo_model or args.nemo_checkpoint:
+        cfg.gpt_model_file = args.nemo_model
+        if args.nemo_checkpoint:
+            cfg.checkpoint_dir = os.path.dirname(args.nemo_checkpoint)
+            cfg.checkpoint_name = os.path.basename(args.nemo_checkpoint)
+        converter = NeMoConverter(cfg, MegatronGPTModel)
+        onnx_name = converter.nemo_to_onnx()
+        G_LOGGER.info(f"ONNX exported from NeMo {onnx_name}")
+    elif args.onnx_model:
+        onnx_name = args.onnx_model
+
+    # Convert Q/DQ nodes to use standard opset19 operators
+    if args.opset19:
+        op19_onnx = NeMoConverter.onnx_to_opset19(onnx_name, args.save_onnx_dir)
+        if op19_onnx != None:
+            G_LOGGER.info(f"Get opset19 onnx file {op19_onnx}")
+            onnx_name = op19_onnx
+
+    # Add KV cache to ONNX model
+    if cfg.use_cache:
+        G_LOGGER.info(f"Converting {onnx_name} with KV-cache support")
+        kv_output_policy = "kv_new"
+        new_dir = os.path.join(args.save_onnx_dir, f"{kv_output_policy}")
+        onnx_output_fpath = os.path.join(new_dir, onnx_name.split("/")[-1])
+        create_dir_if_not_exist(onnx_output_fpath)
+        if not converter:
+            converter = NeMoConverter(cfg, MegatronGPTModel)
+        converter.create_onnx(onnx_name, onnx_output_fpath, kv_output_policy)
+        onnx_name = onnx_output_fpath
+
+    if cfg.onnx_export_options.prune:
+        onnx_name = converter.prune_onnx(onnx_name)
+
+    # Convert ONNX model to TRT engine
+    if args.save_engine:
+        create_dir_if_not_exist(args.save_engine)
+        if not converter:
+            converter = NeMoConverter(cfg, MegatronGPTModel)
+        converter.onnx_to_trt(onnx_name, args.save_engine)
+
+if __name__ == '__main__':
+    main()
diff --git a/demo/NeMo/patch_te.sh b/demo/NeMo/patch_te.sh
new file mode 100644
index 00000000..4f060dd8
--- /dev/null
+++ b/demo/NeMo/patch_te.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Sourcing messes up the directory detection with readlink.
+if [ ! "${0##*/}" = "patch_te.sh" ]; then
+	echo "Please run this patch script, don't source it." >&2
+	return 1
+fi
+
+NEMO_DIR=$(dirname "$(readlink -f "$0")")
+
+te_loc="$(pip show transformer_engine | grep '^Location' | awk '{print $2}')"
+cd "${te_loc}/transformer_engine" || {
+	echo "Could not locate transformer-engine python package. Please check if installation proceeded correctly."
+	exit 1
+}
+# Use sys.executable when calling pip within subprocess to recognize virtualenv.
+# If patch is already applied, skip it and proceed with the rest of the script, quit otherwise.
+# NOTE: patch needs to be updated to track the commit of TE in install.sh.
+OUT="$(patch --forward common/__init__.py <"${NEMO_DIR}"/transformer_engine.patch)" || echo "${OUT}" | grep "Skipping patch" -q || {
+	echo "Could not patch transformer engine because ${OUT}"
+	exit 1
+}
+unset OUT
+cd - || exit
+unset te_loc
diff --git a/demo/NeMo/requirements.txt b/demo/NeMo/requirements.txt
new file mode 100644
index 00000000..c715ed76
--- /dev/null
+++ b/demo/NeMo/requirements.txt
@@ -0,0 +1,13 @@
+nemo-toolkit[nlp]==1.17.0
+onnx==1.14.0
+protobuf==3.20.3
+onnxruntime==1.13.1
+transformers==4.27.0
+cuda-python==12.1.0
+setuptools==65.5.1
+tqdm
+--pre --extra-index-url https://download.pytorch.org/whl/cu121
+torch==2.1.0
+torchaudio==2.1.0
+torchvision==0.16.0
+onnx-graphsurgeon==0.3.27
diff --git a/demo/NeMo/run.py b/demo/NeMo/run.py
new file mode 100644
index 00000000..5ba00b5a
--- /dev/null
+++ b/demo/NeMo/run.py
@@ -0,0 +1,200 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Demonstrates TensorRT capabilities with networks trained by NeMo.
+Requires Python 3.6+
+"""
+
+import argparse
+import os
+import sys
+from typing import List, Tuple
+
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(ROOT_DIR)
+
+sys.path.append('../') # Include one-level up directory so to reuse HuggingFace utils.
+from HuggingFace.run import (
+    Action,
+    NetworkScriptAction,
+    WRAPPER_LIST_ACTION,
+)
+from HuggingFace.NNDF.logger import G_LOGGER
+from HuggingFace.NNDF.general_utils import register_network_folders
+from HuggingFace.NNDF.cuda_bootstrapper import bootstrap_ld_library_path
+
+WRAPPER_RUN_ACTION = "run"
+WRAPPER_ACCURACY_ACTION = "accuracy"
+WRAPPER_BENCHMARK_ACTION = "benchmark"
+WRAPPER_ACTIONS = [WRAPPER_LIST_ACTION, WRAPPER_RUN_ACTION, WRAPPER_ACCURACY_ACTION, WRAPPER_BENCHMARK_ACTION]
+
+class ListAction(Action):
+    def __init__(self, networks: List[str], parser: argparse.ArgumentParser):
+        super().__init__(networks, parser)
+        self.networks = networks
+
+    def execute(self, args: argparse.Namespace):
+        print("Networks that are supported by NeMo Demo:")
+        [print(n) for n in self.networks]
+        return 0
+
+class RunAction(NetworkScriptAction):
+    def execute(self, args: argparse.Namespace):
+        module = self.load_script(args.script, args)
+        module.RUN_CMD._parser = self.parser
+
+        old_path = os.getcwd()
+        # Execute script in each relevant folder
+        try:
+            os.chdir(args.network)
+            _ = module.RUN_CMD()
+        finally:
+            os.chdir(old_path)
+
+        return 0
+
+    def add_args(self, parser: argparse.ArgumentParser):
+        super().add_args(parser)
+        run_group = parser.add_argument_group("run args")
+        run_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
+
+class BenchmarkAction(NetworkScriptAction):
+    def execute(self, args: argparse.Namespace):
+        module = self.load_script(args.script, args)
+        module.RUN_CMD._parser = self.parser
+
+        old_path = os.getcwd()
+        # Execute script in each relevant folder
+        try:
+            os.chdir(args.network)
+            _ = module.RUN_CMD()
+        finally:
+            os.chdir(old_path)
+
+        return 0
+
+    def add_args(self, parser: argparse.ArgumentParser):
+        super().add_args(parser)
+        benchmarking_group = parser.add_argument_group("benchmark args")
+        benchmarking_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
+        benchmarking_group.add_argument(
+            "--input-seq-len",
+            type=int,
+            help="Specify fixed input sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
+        )
+        benchmarking_group.add_argument(
+            "--output-seq-len",
+            type=int,
+            help="Specify fixed output sequence length for perf benchmarking. Required for benchmark except when both input_profile_max and output_profile_max are provided for trt",
+        )
+
+class AccuracyAction(NetworkScriptAction):
+    def execute(self, args: argparse.Namespace):
+        module = self.load_script(args.script, args)
+        module.RUN_CMD._parser = self.parser
+
+        old_path = os.getcwd()
+        # Execute script in each relevant folder
+        try:
+            os.chdir(args.network)
+            _ = module.RUN_CMD()
+        finally:
+            os.chdir(old_path)
+
+        return 0
+
+    def add_args(self, parser: argparse.ArgumentParser):
+        super().add_args(parser)
+        accuracy_group = parser.add_argument_group("accuracy args")
+        accuracy_group.add_argument("script", choices=self.PER_NETWORK_SCRIPTS)
+        accuracy_group.add_argument(
+            "--task",
+            type=str,
+            default="lambada",
+            choices=["lambada"],
+            help="Specify which task to be used for accuracy check.",
+        )
+
+def get_action(
+    action_name: str, networks: List[str], parser: argparse.ArgumentParser
+) -> Action:
+    return {
+        WRAPPER_LIST_ACTION: ListAction,
+        WRAPPER_RUN_ACTION: RunAction,
+        WRAPPER_BENCHMARK_ACTION: BenchmarkAction,
+        WRAPPER_ACCURACY_ACTION: AccuracyAction,
+    }[action_name](networks, parser)
+
+def verify_python_version():
+    if sys.version_info.major < 3 or sys.version_info.minor <= 6:
+        raise RuntimeError("NeMo OSS Demo does not support Python <= 3.6 due to end-of-life.")
+    if sys.version_info.major < 3 or sys.version_info.minor < 8 or (sys.version_info.minor == 8 and sys.version_info.micro < 10):
+        G_LOGGER.warn("NeMo OSS Demo is not tested for Python < 3.8.10")
+
+def get_default_parser(
+    description: str = "", add_default_help=False
+) -> Tuple[argparse.ArgumentParser, bool]:
+    """
+    Returns argparser for use by main(). Allows the ability to toggle default help message with a custom help flag
+    so that argparser does not throw SystemExit when --help is passed in. Useful for custom --help functionality.
+
+    Returns:
+        (argparse.ArgumentParser): argparser used by main()
+    """
+    # This variable is set so that usage errors don't show up in wrapper
+    parser = argparse.ArgumentParser(
+        conflict_handler="resolve",
+        description=description,
+        add_help=add_default_help,
+        prog="run.py",
+    )
+
+    required_group = parser.add_argument_group("required wrapper arguments")
+    required_group.add_argument("action", choices=WRAPPER_ACTIONS)
+    return parser
+
+def main() -> None:
+    """
+    Parses network folders and responsible for passing --help flags to subcommands if --network is provided.
+    """
+    # Verify python version support
+    verify_python_version()
+
+    # Get all available network scripts
+    networks = register_network_folders(os.getcwd())
+
+    # Add network folder for entry point
+    description = "Runs TensorRT networks that are based-off of NeMo variants."
+    parser = get_default_parser(description)
+
+    # Get the general network wrapper help
+    known_args, _ = parser.parse_known_args()
+
+    # Delegate parser to action specifics
+    action = get_action(known_args.action, networks, parser)
+    known_args, _ = parser.parse_known_args()
+
+    # If bootstrap occurs, then the spawned process completes the rest of demo.
+    # We can exit safely. We spawn after parsing basic args to reduce loading churn on rudimentary help commands.
+    if bootstrap_ld_library_path():
+        sys.exit(0)
+
+    return action.execute(known_args)
+
+if __name__ == "__main__":
+    main()
diff --git a/demo/NeMo/transformer_engine.patch b/demo/NeMo/transformer_engine.patch
new file mode 100644
index 00000000..c4c96dea
--- /dev/null
+++ b/demo/NeMo/transformer_engine.patch
@@ -0,0 +1,17 @@
+--- common/__init__.py	2023-06-22 17:22:59.046208583 +0000
++++ common/backup.py	2023-06-22 20:53:01.154819280 +0000
+@@ -7,12 +7,13 @@
+ import os
+ import platform
+ import subprocess
++import sys
+
+
+ def get_te_path():
+     """Find Transformer Engine install path using pip"""
+
+-    command = ["pip", "show", "transformer_engine"]
++    command = [sys.executable, "-m", "pip", "show", "transformer_engine"]
+     result = subprocess.run(command, capture_output=True, check=True, text=True)
+     result = result.stdout.replace("\n", ":").split(":")
+     return result[result.index("Location")+1].strip()
diff --git a/demo/Tacotron2/README.md b/demo/Tacotron2/README.md
index db6cbb73..c687c5ee 100644
--- a/demo/Tacotron2/README.md
+++ b/demo/Tacotron2/README.md
@@ -9,11 +9,11 @@ NVIDIA TensorRT is a platform for high-performance deep learning inference. It i
 
 |Software|Version|
 |--------|-------|
-|Python|3.6.9|
-|CUDA|11.4.2|
+|Python|3.8.10|
+|CUDA|12.2|
 |Apex|0.1|
-|TensorRT|8.2.0.6|
-|PyTorch|1.9.1|
+|TensorRT|9.0|
+|PyTorch|2.0.1|
 
 
 ## Quick Start Guide
@@ -56,7 +56,7 @@ NVIDIA TensorRT is a platform for high-performance deep learning inference. It i
     ```
 
 	The above commands store the generated ONNX files under the `./output/` directory:
-    `encoder.onnx`, `decoder_iter.onnx`, `postnet.onnx`, `waveglow.onnx`, and `decoder.onnx` (on TensorRT 8.0+ if `--no-loop` option is not specified).
+    `encoder.onnx`, `decoder_iter.onnx`, `postnet.onnx`, `waveglow.onnx`, `loop_body_fp16.onnx`, and `decoder.onnx` (on TensorRT 8.0+ if `--no-loop` option is not specified).
 
 6. Export the ONNX IRs to TensorRT engines with fp16 mode enabled:
 
diff --git a/demo/Tacotron2/common/audio_processing.py b/demo/Tacotron2/common/audio_processing.py
index 090581d5..7b261cec 100644
--- a/demo/Tacotron2/common/audio_processing.py
+++ b/demo/Tacotron2/common/audio_processing.py
@@ -64,7 +64,7 @@ def window_sumsquare(window, n_frames, hop_length=200, win_length=800,
     # Compute the squared window at the desired length
     win_sq = get_window(window, win_length, fftbins=True)
     win_sq = librosa_util.normalize(win_sq, norm=norm)**2
-    win_sq = librosa_util.pad_center(win_sq, n_fft)
+    win_sq = librosa_util.pad_center(win_sq, size=n_fft)
 
     # Fill the envelope
     for i in range(n_frames):
diff --git a/demo/Tacotron2/common/stft.py b/demo/Tacotron2/common/stft.py
index 59700e99..0341d60e 100644
--- a/demo/Tacotron2/common/stft.py
+++ b/demo/Tacotron2/common/stft.py
@@ -81,7 +81,7 @@ def __init__(self, filter_length=800, hop_length=200, win_length=800,
             assert(filter_length >= win_length)
             # get window and zero center pad it to filter_length
             fft_window = get_window(window, win_length, fftbins=True)
-            fft_window = pad_center(fft_window, filter_length)
+            fft_window = pad_center(fft_window, size=filter_length)
             fft_window = torch.from_numpy(fft_window).float()
 
             # window the bases
diff --git a/demo/Tacotron2/requirements.txt b/demo/Tacotron2/requirements.txt
index 922bb825..b6eb26de 100644
--- a/demo/Tacotron2/requirements.txt
+++ b/demo/Tacotron2/requirements.txt
@@ -1,8 +1,10 @@
-pycuda
+numba>=0.48
+resampy>=0.3.1
+torch==2.0.1
 matplotlib
 numpy
 inflect
-librosa
+librosa>=0.10.0
 scipy
 Unidecode
 git+https://github.com/NVIDIA/dllogger#egg=dllogger
diff --git a/demo/Tacotron2/run_latency_tests.sh b/demo/Tacotron2/run_latency_tests.sh
index a05ef258..85e5f0f8 100644
--- a/demo/Tacotron2/run_latency_tests.sh
+++ b/demo/Tacotron2/run_latency_tests.sh
@@ -1,5 +1,5 @@
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/scripts/download_checkpoints.sh b/demo/Tacotron2/scripts/download_checkpoints.sh
index a7ce499d..0d23f2d3 100755
--- a/demo/Tacotron2/scripts/download_checkpoints.sh
+++ b/demo/Tacotron2/scripts/download_checkpoints.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/scripts/inference_benchmark.sh b/demo/Tacotron2/scripts/inference_benchmark.sh
index 2e0279e4..86200557 100755
--- a/demo/Tacotron2/scripts/inference_benchmark.sh
+++ b/demo/Tacotron2/scripts/inference_benchmark.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -15,14 +15,7 @@
 # limitations under the License.
 #
 
-pip3 install --force-reinstall torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
-
 echo "TensorRT BS=1, S=128"
 bash test_infer.sh --test tensorrt/test_infer_trt.py -bs 1 -il 128 --fp16 --num-iters 103 --encoder ./output/encoder_fp16.engine --decoder ./output/decoder_with_outer_loop_fp16.engine --postnet ./output/postnet_fp16.engine --waveglow ./output/waveglow_fp16.engine --wn-channels 256
 echo "PyTorch (GPU) BS=1, S=128"
 bash test_infer.sh -bs 1 -il 128 --fp16 --num-iters 103 --tacotron2 ./checkpoints/tacotron2_pyt_ckpt_amp_v19.09.0/nvidia_tacotron2pyt_fp16_20190427 --waveglow ./checkpoints/waveglow_ckpt_amp_256_v19.10.0/nvidia_waveglow256pyt_fp16 --wn-channels 256
-
-pip3 install torch==1.9.1+cpu torchvision==0.10.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
-
-echo "PyTorch (CPU) BS=1, S=128"
-bash test_infer.sh -bs 1 -il 128 --fp16 --num-iters 5 --tacotron2 ./checkpoints/tacotron2_pyt_ckpt_amp_v19.09.0/nvidia_tacotron2pyt_fp16_20190427 --waveglow ./checkpoints/waveglow_ckpt_amp_256_v19.10.0/nvidia_waveglow256pyt_fp16 --wn-channels 256 --cpu
diff --git a/demo/Tacotron2/scripts/install_prerequisites.sh b/demo/Tacotron2/scripts/install_prerequisites.sh
index 5e5e1f97..5a16d392 100755
--- a/demo/Tacotron2/scripts/install_prerequisites.sh
+++ b/demo/Tacotron2/scripts/install_prerequisites.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -15,12 +15,11 @@
 # limitations under the License.
 #
 
-pip3 install numba==0.48 resampy==0.3.1 torch==1.9.1
 pip3 install -r requirements.txt
 echo "nvidia" | sudo -S apt-get install -y libsndfile1
 
 pushd /tmp
 git clone https://github.com/NVIDIA/apex
 cd apex
-pip3 install -v --no-cache-dir ./
+pip3 install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./
 popd
diff --git a/demo/Tacotron2/scripts/prepare_dataset.sh b/demo/Tacotron2/scripts/prepare_dataset.sh
index 7d3acb9b..d38be817 100755
--- a/demo/Tacotron2/scripts/prepare_dataset.sh
+++ b/demo/Tacotron2/scripts/prepare_dataset.sh
@@ -1,6 +1,6 @@
 #!/usr/bin/env bash
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/scripts/prepare_mels.sh b/demo/Tacotron2/scripts/prepare_mels.sh
index cb02f775..b3843a26 100644
--- a/demo/Tacotron2/scripts/prepare_mels.sh
+++ b/demo/Tacotron2/scripts/prepare_mels.sh
@@ -1,6 +1,6 @@
 #!/usr/bin/env bash
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/tensorrt/convert_onnx2trt.py b/demo/Tacotron2/tensorrt/convert_onnx2trt.py
index ec43cb05..dd24c801 100644
--- a/demo/Tacotron2/tensorrt/convert_onnx2trt.py
+++ b/demo/Tacotron2/tensorrt/convert_onnx2trt.py
@@ -16,9 +16,6 @@
 #
 
 import argparse
-import onnx
-import pycuda.autoinit
-import pycuda.driver as cuda
 import sys
 import tensorrt as trt
 from os.path import join
@@ -62,7 +59,6 @@ def parse_args(parser):
     parser.add_argument("-tcf", "--timing-cache-file", default=None, type=str,
                         help="Path to tensorrt build timeing cache file, only available for tensorrt 8.0 and later. The cache file is assumed to be used exclusively. It's the users' responsibility to create file lock to prevent accessing conflict.",
                         required=False)
-    parser.add_argument("--disable-preview-dynamic-shapes", action="store_true", help="Disable dynamic shape preview feature.")
     parser.set_defaults(loop=int(trt.__version__[0]) >= 8)
     return parser
 
@@ -89,10 +85,10 @@ def main():
             {"name": "sequence_lengths", "min": (bs_min,),  "opt": (bs_opt,),    "max": (bs_max,)}]
     if args.encoder != "":
         print("Building Encoder ...")
-        encoder_engine = build_engine(args.encoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+        encoder_engine = build_engine(args.encoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
         if encoder_engine is not None:
             with open(encoder_path, 'wb') as f:
-                f.write(encoder_engine.serialize())
+                f.write(encoder_engine)
         else:
             print("Failed to build engine from", args.encoder)
             sys.exit(1)
@@ -112,10 +108,10 @@ def main():
                 {"name": "mask",                    "min": (bs_min,4),     "opt": (bs_opt,128),     "max": (bs_max,256)}]
         if args.decoder != "":
             print("Building Decoder with loop...")
-            decoder_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+            decoder_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
             if decoder_engine is not None:
                 with open(decoder_path, 'wb') as f:
-                    f.write(decoder_engine.serialize())
+                    f.write(decoder_engine)
             else:
                 print("Failed to build engine from", args.decoder)
                 sys.exit(1)
@@ -134,10 +130,10 @@ def main():
                 {"name": "mask",                  "min": (bs_min,4),     "opt": (bs_opt,128),     "max": (bs_max,256)}]
         if args.decoder != "":
             print("Building Decoder ...")
-            decoder_iter_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+            decoder_iter_engine = build_engine(args.decoder, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
             if decoder_iter_engine is not None:
                 with open(decoder_path, 'wb') as f:
-                    f.write(decoder_iter_engine.serialize())
+                    f.write(decoder_iter_engine)
             else:
                 print("Failed to build engine from", args.decoder)
                 sys.exit(1)
@@ -146,10 +142,10 @@ def main():
     shapes=[{"name": "mel_outputs", "min": (bs_min,80,32), "opt": (bs_opt,80,768), "max": (bs_max,80,1664)}]
     if args.postnet != "":
         print("Building Postnet ...")
-        postnet_engine = build_engine(args.postnet, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+        postnet_engine = build_engine(args.postnet, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
         if postnet_engine is not None:
             with open(postnet_path, 'wb') as f:
-                f.write(postnet_engine.serialize())
+                f.write(postnet_engine)
         else:
             print("Failed to build engine from", args.postnet)
             sys.exit(1)
@@ -159,10 +155,10 @@ def main():
             {"name": "z",   "min": (bs_min,8,z_min,1),     "opt": (bs_opt,8,z_opt,1),     "max": (bs_max,8,z_max,1)}]
     if args.waveglow != "":
         print("Building WaveGlow ...")
-        waveglow_engine = build_engine(args.waveglow, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file, disable_preview_dynamic_shapes=args.disable_preview_dynamic_shapes)
+        waveglow_engine = build_engine(args.waveglow, shapes=shapes, fp16=args.fp16, timing_cache=args.timing_cache_file)
         if waveglow_engine is not None:
             with open(waveglow_path, 'wb') as f:
-                f.write(waveglow_engine.serialize())
+                f.write(waveglow_engine)
         else:
             print("Failed to build engine from", args.waveglow)
             sys.exit(1)
diff --git a/demo/Tacotron2/tensorrt/inference_trt.py b/demo/Tacotron2/tensorrt/inference_trt.py
index 4f5f76d3..d1a6dabd 100644
--- a/demo/Tacotron2/tensorrt/inference_trt.py
+++ b/demo/Tacotron2/tensorrt/inference_trt.py
@@ -437,7 +437,8 @@ def main():
     measurements = {}
 
     sequences, sequence_lengths = prepare_input_sequence(texts)
-    sequences = sequences.to(torch.int32)
+    dt = encoder.get_tensor_dtype("sequences")
+    sequences = sequences.to(torch.int64 if dt == trt.DataType.INT64 else torch.int32)
     sequence_lengths = sequence_lengths.to(torch.int32)
 
     with MeasureTime(measurements, "latency"):
diff --git a/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh b/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh
index 07dfd704..a289cf63 100644
--- a/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh
+++ b/demo/Tacotron2/tensorrt/run_latency_tests_trt.sh
@@ -1,5 +1,5 @@
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/demo/Tacotron2/tensorrt/trt_utils.py b/demo/Tacotron2/tensorrt/trt_utils.py
index 3e1d534a..e150983f 100644
--- a/demo/Tacotron2/tensorrt/trt_utils.py
+++ b/demo/Tacotron2/tensorrt/trt_utils.py
@@ -45,18 +45,18 @@ def is_shape_dynamic(shape):
 
 def run_trt_engine(context, engine, tensors):
 
-    bindings = [None]*engine.num_bindings
-    for name,tensor in tensors['inputs'].items():
-        idx = engine.get_binding_index(name)
-        bindings[idx] = tensor.data_ptr()
-        if engine.is_shape_binding(idx) and is_shape_dynamic(context.get_shape(idx)):
-            context.set_shape_input(idx, tensor)
-        elif is_shape_dynamic(engine.get_binding_shape(idx)):
-            context.set_binding_shape(idx, tensor.shape)
-
-    for name,tensor in tensors['outputs'].items():
-        idx = engine.get_binding_index(name)
-        bindings[idx] = tensor.data_ptr()
+    bindings = [0] * engine.num_io_tensors
+
+    for i in range(engine.num_io_tensors):
+        tensor_name = engine.get_tensor_name(i)
+        if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
+            tensor = tensors['inputs'][tensor_name]
+            bindings[i] = tensor.data_ptr()
+            if is_shape_dynamic(engine.get_tensor_shape(tensor_name)):
+                context.set_input_shape(tensor_name, tensor.shape)
+        elif engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.OUTPUT:
+            tensor = tensors['outputs'][tensor_name]
+            bindings[i] = tensor.data_ptr()
 
     context.execute_v2(bindings=bindings)
 
@@ -84,22 +84,20 @@ def engine_info(engine_filepath):
                     "DataType.BOOL" : "TYPE_BOOL"}
 
     print("engine name", engine.name)
-    print("has_implicit_batch_dimension", engine.has_implicit_batch_dimension)
-    start_dim = 0 if engine.has_implicit_batch_dimension else 1
+    start_dim = 1
     print("num_optimization_profiles", engine.num_optimization_profiles)
-    print("max_batch_size:", engine.max_batch_size)
     print("device_memory_size:", engine.device_memory_size)
-    print("max_workspace_size:", engine.max_workspace_size)
+    print("max_workspace_size:", engine.get_memory_pool_limit(trt.MemoryPoolType.WORKSPACE))
     print("num_layers:", engine.num_layers)
 
-    for i in range(engine.num_bindings):
-        btype = "input" if engine.binding_is_input(i) else "output"
-        bname = engine.get_binding_name(i)
-        dtype = engine.get_binding_dtype(i)
-        bdims = engine.get_binding_shape(i)
+    for i in range(engine.num_io_tensors):
+        tensor_name = engine.get_tensor_name(i) 
+        btype = "input" if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT else "output"
+        dtype = engine.get_tensor_dtype(tensor_name)
+        bdims = engine.get_tensor_shape(tensor_name)
         config_values = {
             "btype": btype,
-            "bname": bname,
+            "bname": tensor_name,
             "dtype": type_mapping[str(dtype)],
             "dims": list(bdims[start_dim:])
         }
@@ -107,19 +105,15 @@ def engine_info(engine_filepath):
         print(final_binding_str)
 
 
-def build_engine(model_file, shapes, max_ws=512*1024*1024, fp16=False, timing_cache=None, disable_preview_dynamic_shapes=False):
-    if not disable_preview_dynamic_shapes and float(trt.__version__[:3]) < 8.5:
-        print("Faster dynamic shapes preview feature is only supported on TRT 8.5+")
-        sys.exit(1)
+def build_engine(model_file, shapes, max_ws=512*1024*1024, fp16=False, timing_cache=None):
 
     TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
     builder = trt.Builder(TRT_LOGGER)
 
     config = builder.create_builder_config()
-    config.max_workspace_size = max_ws
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, max_ws)
     if fp16:
         config.flags |= 1 << int(trt.BuilderFlag.FP16)
-    config.set_preview_feature(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805, not disable_preview_dynamic_shapes)
     profile = builder.create_optimization_profile()
     for s in shapes:
         profile.set_shape(s['name'], min=s['min'], opt=s['opt'], max=s['max'])
@@ -136,15 +130,17 @@ def build_engine(model_file, shapes, max_ws=512*1024*1024, fp16=False, timing_ca
             cache = config.create_timing_cache(b"")
             config.set_timing_cache(cache, ignore_mismatch = False)
 
-    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
-    network = builder.create_network(explicit_batch)
+    network_creation_flag = 0
+    if "EXPLICIT_BATCH" in trt.NetworkDefinitionCreationFlag.__members__.keys():
+        network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    network = builder.create_network(network_creation_flag)
 
     with trt.OnnxParser(network, TRT_LOGGER) as parser:
         with open(model_file, 'rb') as model:
             parsed = parser.parse(model.read())
             for i in range(parser.num_errors):
                 print("TensorRT ONNX parser error:", parser.get_error(i))
-            engine = builder.build_engine(network, config=config)
+            engine = builder.build_serialized_network(network, config=config)
 
             # save global timing cache
             if timing_cache_available:
diff --git a/demo/Tacotron2/test_infer.py b/demo/Tacotron2/test_infer.py
index 81254d37..23816da9 100644
--- a/demo/Tacotron2/test_infer.py
+++ b/demo/Tacotron2/test_infer.py
@@ -15,23 +15,16 @@
 # limitations under the License.
 #
 
-from tacotron2.text import text_to_sequence
-import models
 import torch
 import argparse
 import numpy as np
 from scipy.io.wavfile import write
 
-import sys
+from inference import MeasureTime, prepare_input_sequence, load_and_setup_model
 
-from inference import checkpoint_from_distributed, unwrap_distributed, MeasureTime, prepare_input_sequence, load_and_setup_model
-
-import time
 import dllogger as DLLogger
 from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
 
-from apex import amp
-
 from waveglow.denoiser import Denoiser
 
 def parse_args(parser):
diff --git a/demo/Tacotron2/test_infer.sh b/demo/Tacotron2/test_infer.sh
index fd0e7ecb..103fb941 100644
--- a/demo/Tacotron2/test_infer.sh
+++ b/demo/Tacotron2/test_infer.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/demo/experimental/HuggingFace-Diffusers/README.md b/demo/experimental/HuggingFace-Diffusers/README.md
index 9cf26cfe..d0e4e563 100644
--- a/demo/experimental/HuggingFace-Diffusers/README.md
+++ b/demo/experimental/HuggingFace-Diffusers/README.md
@@ -7,7 +7,7 @@ This demo notebook showcases the acceleration of Stable Diffusion pipeline using
 ### Clone the TensorRT OSS repository
 
 ```bash
-git clone git@github.com:NVIDIA/TensorRT.git -b release/8.6 --single-branch
+git clone git@github.com:NVIDIA/TensorRT.git -b release/9.3 --single-branch
 cd TensorRT/demo/experimental/HuggingFace-Diffusers
 ```
 
diff --git a/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb b/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb
index 395a8f27..23eb1492 100644
--- a/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb
+++ b/demo/experimental/HuggingFace-Diffusers/TensorRT-diffusers-txt2img.ipynb
@@ -160,7 +160,7 @@
    "source": [
     "### Install NVIDIA TensorRT\n",
     "\n",
-    "TensorRT 8.6 includes Stable Diffusion model optimizations out of the box."
+    "TensorRT 8.6+ includes Stable Diffusion model optimizations out of the box."
    ]
   },
   {
diff --git a/docker/build.sh b/docker/build.sh
index 6b28fd09..b24029ae 100755
--- a/docker/build.sh
+++ b/docker/build.sh
@@ -42,7 +42,7 @@ then
     echo "--cuda not specified, so not passing in --build-arg CUDA_VERSION to Dockerfile"
     docker_args="-f $arg_dockerfile --build-arg uid=$(id -u) --build-arg gid=$(id -g) --tag=$arg_imagename ."
 else
-    docker_args="-f $arg_dockerfile --build-arg CUDA_VERSION=$arg_cudaversion --build-arg uid=$(id -u) --build-arg gid=$(id -g) --tag=$arg_imagename ."
+    docker_args="-f $arg_dockerfile --build-arg CUDA_VERSION=$arg_cudaversion --build-arg CUDA_VERSION_MAJOR_MINOR=${arg_cudaversion:0:4} --build-arg uid=$(id -u) --build-arg gid=$(id -g) --tag=$arg_imagename ."
 fi
 
 echo "Building container:"
diff --git a/docker/centos-7.Dockerfile b/docker/centos-7.Dockerfile
deleted file mode 100644
index ff27d6d2..00000000
--- a/docker/centos-7.Dockerfile
+++ /dev/null
@@ -1,105 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-ARG CUDA_VERSION=12.0.1
-
-FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-centos7
-LABEL maintainer="NVIDIA CORPORATION"
-
-ENV TRT_VERSION 8.6.1.6
-SHELL ["/bin/bash", "-c"]
-
-# Setup user account
-ARG uid=1000
-ARG gid=1000
-RUN groupadd -r -f -g ${gid} trtuser && useradd -o -r -l -u ${uid} -g ${gid} -ms /bin/bash trtuser
-RUN usermod -aG wheel trtuser
-RUN echo 'trtuser:nvidia' | chpasswd
-RUN mkdir -p /workspace && chown trtuser /workspace
-
-# Install requried packages
-RUN yum -y groupinstall "Development Tools"
-RUN yum -y install \
-    openssl-devel \
-    bzip2-devel \
-    libffi-devel \
-    wget \
-    perl-core \
-    git \
-    pkg-config \
-    unzip \
-    sudo
-
-# Install python3
-RUN yum install -y python36 python3-devel
-
-# yum needs to use python2
-RUN sed -i "1s/python/python2/" /usr/bin/yum
-
-# Install TensorRT
-RUN if [ "${CUDA_VERSION}" = "10.2" ] ; then \
-    v="${TRT_VERSION%.*}-1.cuda${CUDA_VERSION}" &&\
-    yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo &&\
-    yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} \
-        libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} \
-        python3-libnvinfer-=${v} libnvinfer-dispatch8-=${v} libnvinfer-dispatch-devel-=${v} libnvinfer-lean8-=${v} \
-        libnvinfer-lean-devel-=${v} libnvinfer-vc-plugin8-=${v} libnvinfer-vc-plugin-devel-=${v} \
-        libnvinfer-headers-devel-=${v} libnvinfer-headers-plugin-devel-=${v}; \
-else \
-    v="${TRT_VERSION}-1.cuda${CUDA_VERSION%.*}" &&\
-    yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo &&\
-    yum -y install libnvinfer8-${v} libnvparsers8-${v} libnvonnxparsers8-${v} libnvinfer-plugin8-${v} \
-        libnvinfer-devel-${v} libnvparsers-devel-${v} libnvonnxparsers-devel-${v} libnvinfer-plugin-devel-${v} \
-        python3-libnvinfer-=${v} libnvinfer-dispatch8-=${v} libnvinfer-dispatch-devel-=${v} libnvinfer-lean8-=${v} \
-        libnvinfer-lean-devel-=${v} libnvinfer-vc-plugin8-=${v} libnvinfer-vc-plugin-devel-=${v} \
-        libnvinfer-headers-devel-=${v} libnvinfer-headers-plugin-devel-=${v}; \
-fi
-
-# Install dev-toolset-8 for g++ version that supports c++14
-RUN yum -y install centos-release-scl
-RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
-RUN yum -y install devtoolset-8
-
-# Install PyPI packages
-RUN pip3 install --upgrade pip
-RUN pip3 install setuptools>=41.0.0
-RUN pip3 install numpy
-RUN pip3 install jupyter jupyterlab
-
-# Install Cmake
-RUN cd /tmp && \
-    wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh && \
-    chmod +x cmake-3.14.4-Linux-x86_64.sh && \
-    ./cmake-3.14.4-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir --skip-license && \
-    rm ./cmake-3.14.4-Linux-x86_64.sh
-
-# Download NGC client
-RUN cd /usr/local/bin && wget https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip && unzip ngccli_cat_linux.zip && chmod u+x ngc-cli/ngc && rm ngccli_cat_linux.zip ngc-cli.md5 && echo "no-apikey\nascii\n" | ngc-cli/ngc config set
-
-RUN rm /usr/bin/python && ln -s /usr/bin/python3 /usr/bin/python
-
-# Set environment and working directory
-ENV TRT_LIBPATH /usr/lib/x86_64-linux-gnu
-ENV TRT_OSSPATH /workspace/TensorRT
-ENV PATH="${PATH}:/usr/local/bin/ngc-cli"
-ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${TRT_OSSPATH}/build/out:${TRT_LIBPATH}"
-# Use devtoolset-8 as default compiler
-ENV PATH="/opt/rh/devtoolset-8/root/bin:${PATH}"
-WORKDIR /workspace
-
-USER trtuser
-RUN ["/bin/bash"]
diff --git a/docker/ubuntu-20.04-aarch64.Dockerfile b/docker/ubuntu-20.04-aarch64.Dockerfile
deleted file mode 100644
index 540943cd..00000000
--- a/docker/ubuntu-20.04-aarch64.Dockerfile
+++ /dev/null
@@ -1,108 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-ARG CUDA_VERSION=12.0.1
-
-# Multi-arch container support available in non-cudnn containers.
-FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
-
-ENV TRT_VERSION 8.6.1.6
-SHELL ["/bin/bash", "-c"]
-
-# Setup user account
-ARG uid=1000
-ARG gid=1000
-RUN groupadd -r -f -g ${gid} trtuser && useradd -o -r -l -u ${uid} -g ${gid} -ms /bin/bash trtuser
-RUN usermod -aG sudo trtuser
-RUN echo 'trtuser:nvidia' | chpasswd
-RUN mkdir -p /workspace && chown trtuser /workspace
-
-# Required to build Ubuntu 20.04 without user prompts with DLFW container
-ENV DEBIAN_FRONTEND=noninteractive
-
-# Update CUDA signing key
-RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/3bf863cc.pub
-
-# Install requried libraries
-RUN apt-get update && apt-get install -y software-properties-common
-RUN add-apt-repository ppa:ubuntu-toolchain-r/test
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    libcurl4-openssl-dev \
-    wget \
-    git \
-    pkg-config \
-    sudo \
-    ssh \
-    libssl-dev \
-    pbzip2 \
-    pv \
-    bzip2 \
-    unzip \
-    devscripts \
-    lintian \
-    fakeroot \
-    dh-make \
-    build-essential
-
-# Install python3
-RUN apt-get install -y --no-install-recommends \
-      python3 \
-      python3-pip \
-      python3-dev \
-      python3-wheel &&\
-    cd /usr/local/bin &&\
-    ln -s /usr/bin/python3 python &&\
-    ln -s /usr/bin/pip3 pip;
-
-# Install TensorRT. This will also pull in CUDNN
-RUN v="${TRT_VERSION}-1+cuda${CUDA_VERSION%.*}" &&\
-    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/3bf863cc.pub &&\
-    apt-get update &&\
-    sudo apt-get -y install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
-        libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
-        python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
-        libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
-        libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v};
-
-# Install Cmake
-RUN cd /tmp && \
-    wget https://github.com/Kitware/CMake/releases/download/v3.21.4/cmake-3.21.4-linux-aarch64.sh && \
-    chmod +x cmake-3.21.4-linux-aarch64.sh && \
-    ./cmake-3.21.4-linux-aarch64.sh --prefix=/usr/local --exclude-subdir --skip-license && \
-    rm ./cmake-3.21.4-linux-aarch64.sh
-
-# Install PyPI packages
-RUN pip3 install --upgrade pip
-RUN pip3 install setuptools>=41.0.0
-COPY requirements.txt /tmp/requirements.txt
-RUN pip3 install -r /tmp/requirements.txt
-RUN pip3 install jupyter jupyterlab
-# Workaround to remove numpy installed with tensorflow
-RUN pip3 install --upgrade numpy
-
-# Download NGC client
-RUN cd /usr/local/bin && wget https://ngc.nvidia.com/downloads/ngccli_arm64.zip && unzip ngccli_arm64.zip && chmod u+x ngc-cli/ngc && rm ngccli_arm64.zip ngc-cli.md5 && echo "no-apikey\nascii\n" | ngc-cli/ngc config set
-
-# Set environment and working directory
-ENV TRT_LIBPATH /usr/lib/aarch64-linux-gnu/
-ENV TRT_OSSPATH /workspace/TensorRT
-ENV PATH="${PATH}:/usr/local/bin/ngc-cli"
-ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${TRT_OSSPATH}/build/out:${TRT_LIBPATH}"
-WORKDIR /workspace
-
-USER trtuser
-RUN ["/bin/bash"]
diff --git a/docker/ubuntu-20.04.Dockerfile b/docker/ubuntu-20.04.Dockerfile
index 65605b47..0049d4c2 100644
--- a/docker/ubuntu-20.04.Dockerfile
+++ b/docker/ubuntu-20.04.Dockerfile
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,14 +15,28 @@
 # limitations under the License.
 #
 
-ARG CUDA_VERSION=12.0.1
+ARG CUDA_VERSION=12.3.2
 
-FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu20.04
+FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
 LABEL maintainer="NVIDIA CORPORATION"
 
-ENV TRT_VERSION 8.6.1.6
+ENV NV_CUDNN_VERSION 8.9.6.50
+ENV NV_CUDNN_PACKAGE_NAME "libcudnn8"
+
+ENV CUDA_VERSION_MAJOR_MINOR=12.2
+
+ENV NV_CUDNN_PACKAGE "libcudnn8=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+ENV NV_CUDNN_PACKAGE_DEV "libcudnn8-dev=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+
+ENV TRT_VERSION 10.0.0.6
 SHELL ["/bin/bash", "-c"]
 
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ${NV_CUDNN_PACKAGE} \
+    ${NV_CUDNN_PACKAGE_DEV} \
+    && apt-mark hold ${NV_CUDNN_PACKAGE_NAME} \
+    && rm -rf /var/lib/apt/lists/*
+
 # Setup user account
 ARG uid=1000
 ARG gid=1000
@@ -69,24 +83,19 @@ RUN apt-get install -y --no-install-recommends \
     ln -s /usr/bin/pip3 pip;
 
 # Install TensorRT
-RUN if [ "${CUDA_VERSION}" = "10.2" ] ; then \
-    v="${TRT_VERSION%.*}-1+cuda${CUDA_VERSION}" &&\
-    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub &&\
-    apt-get update &&\
-    sudo apt-get install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
-        libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
-        python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
-        libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
-        libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+RUN if [ "${CUDA_VERSION:0:2}" = "11" ]; then \
+    wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+        && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+        && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+        && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp38-none-linux_x86_64.whl ;\
+elif [ "${CUDA_VERSION:0:2}" = "12" ]; then \
+    wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+        && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+        && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+        && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp38-none-linux_x86_64.whl ;\
 else \
-    v="${TRT_VERSION}-1+cuda${CUDA_VERSION%.*}" &&\
-    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub &&\
-    apt-get update &&\
-    sudo apt-get -y install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
-        libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
-        python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
-        libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
-        libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+    echo "Invalid CUDA_VERSION"; \
+    exit 1; \
 fi
 
 # Install PyPI packages
diff --git a/docker/ubuntu-18.04.Dockerfile b/docker/ubuntu-22.04.Dockerfile
similarity index 58%
rename from docker/ubuntu-18.04.Dockerfile
rename to docker/ubuntu-22.04.Dockerfile
index 8c246126..ebe90f71 100644
--- a/docker/ubuntu-18.04.Dockerfile
+++ b/docker/ubuntu-22.04.Dockerfile
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,14 +15,28 @@
 # limitations under the License.
 #
 
-ARG CUDA_VERSION=12.0.1
+ARG CUDA_VERSION=12.3.2
 
-FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-devel-ubuntu18.04
+FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04
 LABEL maintainer="NVIDIA CORPORATION"
 
-ENV TRT_VERSION 8.6.1.6
+ENV NV_CUDNN_VERSION 8.9.6.50
+ENV NV_CUDNN_PACKAGE_NAME "libcudnn8"
+
+ENV CUDA_VERSION_MAJOR_MINOR=12.2
+
+ENV NV_CUDNN_PACKAGE "libcudnn8=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+ENV NV_CUDNN_PACKAGE_DEV "libcudnn8-dev=$NV_CUDNN_VERSION-1+cuda${CUDA_VERSION_MAJOR_MINOR}"
+
+ENV TRT_VERSION 10.0.0.6
 SHELL ["/bin/bash", "-c"]
 
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ${NV_CUDNN_PACKAGE} \
+    ${NV_CUDNN_PACKAGE_DEV} \
+    && apt-mark hold ${NV_CUDNN_PACKAGE_NAME} \
+    && rm -rf /var/lib/apt/lists/*
+
 # Setup user account
 ARG uid=1000
 ARG gid=1000
@@ -31,6 +45,12 @@ RUN usermod -aG sudo trtuser
 RUN echo 'trtuser:nvidia' | chpasswd
 RUN mkdir -p /workspace && chown trtuser /workspace
 
+# Required to build Ubuntu 20.04 without user prompts with DLFW container
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Update CUDA signing key
+RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
+
 # Install requried libraries
 RUN apt-get update && apt-get install -y software-properties-common
 RUN add-apt-repository ppa:ubuntu-toolchain-r/test
@@ -63,29 +83,26 @@ RUN apt-get install -y --no-install-recommends \
     ln -s /usr/bin/pip3 pip;
 
 # Install TensorRT
-RUN if [ "${CUDA_VERSION}" = "10.2" ] ; then \
-    v="${TRT_VERSION%.*}-1+cuda${CUDA_VERSION}" &&\
-    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub &&\
-    apt-get update &&\
-    sudo apt-get install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
-        libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
-        python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
-        libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
-        libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+RUN if [ "${CUDA_VERSION:0:2}" = "11" ]; then \
+    wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+        && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-11.8.tar.gz \
+        && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+        && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp310-none-linux_x86_64.whl ;\
+elif [ "${CUDA_VERSION:0:2}" = "12" ]; then \
+    wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/tars/TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+        && tar -xf TensorRT-10.0.0.6.Linux.x86_64-gnu.cuda-12.4.tar.gz \
+        && cp -a TensorRT-10.0.0.6/lib/*.so* /usr/lib/x86_64-linux-gnu \
+        && pip install TensorRT-10.0.0.6/python/tensorrt-10.0.0b6-cp310-none-linux_x86_64.whl ;\
 else \
-    v="${TRT_VERSION}-1+cuda${CUDA_VERSION%.*}" &&\
-    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub &&\
-    apt-get update &&\
-    sudo apt-get -y install libnvinfer8=${v} libnvonnxparsers8=${v} libnvparsers8=${v} libnvinfer-plugin8=${v} \
-        libnvinfer-dev=${v} libnvonnxparsers-dev=${v} libnvparsers-dev=${v} libnvinfer-plugin-dev=${v} \
-        python3-libnvinfer=${v} libnvinfer-dispatch8=${v} libnvinfer-dispatch-dev=${v} libnvinfer-lean8=${v} \
-        libnvinfer-lean-dev=${v} libnvinfer-vc-plugin8=${v} libnvinfer-vc-plugin-dev=${v} \
-        libnvinfer-headers-dev=${v} libnvinfer-headers-plugin-dev=${v}; \
+    echo "Invalid CUDA_VERSION"; \
+    exit 1; \
 fi
 
 # Install PyPI packages
 RUN pip3 install --upgrade pip
 RUN pip3 install setuptools>=41.0.0
+COPY requirements.txt /tmp/requirements.txt
+RUN pip3 install -r /tmp/requirements.txt
 RUN pip3 install jupyter jupyterlab
 # Workaround to remove numpy installed with tensorflow
 RUN pip3 install --upgrade numpy
diff --git a/docker/ubuntu-cross-aarch64.Dockerfile b/docker/ubuntu-cross-aarch64.Dockerfile
deleted file mode 100644
index cf5f31d9..00000000
--- a/docker/ubuntu-cross-aarch64.Dockerfile
+++ /dev/null
@@ -1,134 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-ARG CUDA_VERSION=11.4.1
-
-# Multi-arch container support available in non-cudnn containers.
-FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04
-LABEL maintainer="NVIDIA CORPORATION"
-
-ENV TRT_VERSION 8.5.2
-ENV DEBIAN_FRONTEND=noninteractive
-
-ARG uid=1000
-ARG gid=1000
-RUN groupadd -r -f -g ${gid} trtuser && useradd -o -r -l -u ${uid} -g ${gid} -ms /bin/bash trtuser
-RUN usermod -aG sudo trtuser
-RUN echo 'trtuser:nvidia' | chpasswd
-RUN mkdir -p /workspace && chown trtuser /workspace
-
-# Install requried libraries
-RUN apt-get update && apt-get install -y software-properties-common
-RUN add-apt-repository ppa:ubuntu-toolchain-r/test
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    libcurl4-openssl-dev \
-    wget \
-    git \
-    pkg-config \
-    python3 \
-    python3-pip \
-    python3-dev \
-    python3-wheel \
-    sudo \
-    ssh \
-    pbzip2 \
-    pv \
-    bzip2 \
-    unzip \
-    build-essential
-
-RUN cd /usr/local/bin &&\
-    ln -s /usr/bin/python3 python &&\
-    ln -s /usr/bin/pip3 pip
-RUN pip3 install --upgrade pip
-RUN pip3 install setuptools>=41.0.0
-
-# Install Cmake
-RUN cd /tmp && \
-    wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh && \
-    chmod +x cmake-3.14.4-Linux-x86_64.sh && \
-    ./cmake-3.14.4-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir --skip-license && \
-    rm ./cmake-3.14.4-Linux-x86_64.sh
-
-# Skip installing PyPI packages and NGC client on cross-build container
-
-COPY docker/jetpack_files /pdk_files
-COPY scripts/stubify.sh /pdk_files
-
-# Update CUDA signing keys
-RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
-RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
-
-# Install CUDA cross compile toolchain
-RUN dpkg -i /pdk_files/cuda-repo-cross-aarch64*.deb /pdk_files/cuda-repo-ubuntu*_amd64.deb \
-    && cp /var/cuda-repo-cross*/cuda-*-keyring.gpg /usr/share/keyrings/ \
-    && cp /var/cuda-repo-ubuntu*/cuda-*-keyring.gpg /usr/share/keyrings/ \
-    && apt-get update \
-    && apt-get install -y cuda-cross-aarch64 \
-    && rm -rf /var/lib/apt/lists/*
-
-# Unpack cudnn
-RUN  dpkg -x /pdk_files/cudnn-local-tegra-repo*.deb /pdk_files/cudnn_extract \
-    && dpkg -x /pdk_files/cudnn_extract/var/cudnn-local-tegra-repo*/libcudnn[7-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/cudnn \
-    && dpkg -x /pdk_files/cudnn_extract/var/cudnn-local-tegra-repo*/libcudnn[7-8]-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/cudnn \
-    && cd /pdk_files/cudnn/usr/lib/aarch64-linux-gnu \
-    && cd /pdk_files/cudnn \
-    && ln -s usr/include/aarch64-linux-gnu include \
-    && ln -s usr/lib/aarch64-linux-gnu lib \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_adv_infer_v[7-9].h /usr/include/cudnn_adv_infer.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_adv_train_v[7-9].h /usr/include/cudnn_adv_train.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_backend_v[7-9].h /usr/include/cudnn_backend.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_cnn_infer_v[7-9].h /usr/include/cudnn_cnn_infer.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_cnn_train_v[7-9].h /usr/include/cudnn_cnn_train.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_ops_infer_v[7-9].h /usr/include/cudnn_ops_infer.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_ops_train_v[7-9].h /usr/include/cudnn_ops_train.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_v[7-9].h /usr/include/cudnn.h \
-    && ln -s /pdk_files/cudnn/usr/include/aarch64-linux-gnu/cudnn_version_v[7-9].h /usr/include/cudnn_version.h
-
-# Unpack libnvinfer
-RUN dpkg -x /pdk_files/nv-tensorrt-local-repo-l4t-[0-8].[0-9].[0-9]-cuda-11.[0-9]_*_arm64.deb /pdk_files/tensorrt
-RUN mv /pdk_files/tensorrt/var/nv-tensorrt-local-repo-l4t-[0-8].[0-9].[0-9]-cuda-11.[0-9]/*.deb /pdk_files
-RUN dpkg -x /pdk_files/libnvinfer[0-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
-    && dpkg -x /pdk_files/libnvinfer-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
-    && dpkg -x /pdk_files/libnvparsers[6-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
-    && dpkg -x /pdk_files/libnvparsers-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
-    && dpkg -x /pdk_files/libnvinfer-plugin[6-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
-    && dpkg -x /pdk_files/libnvinfer-plugin-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
-    && dpkg -x /pdk_files/libnvonnxparsers[6-8]_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt \
-    && dpkg -x /pdk_files/libnvonnxparsers-dev_*-1+cuda11.[0-9]_arm64.deb /pdk_files/tensorrt
-
-# Clean up debs
-RUN rm -rf /pdk_files/*.deb
-
-# create stub libraries
-RUN cd /pdk_files/tensorrt \
-    && ln -s usr/include/aarch64-linux-gnu include \
-    && ln -s usr/lib/aarch64-linux-gnu lib \
-    && cd lib \
-    && mkdir stubs \
-    && for x in nvinfer nvparsers nvinfer_plugin nvonnxparser; \
-       do                                                     \
-            CC=aarch64-linux-gnu-gcc /pdk_files/stubify.sh lib${x}.so stubs/lib${x}.so \
-       ; done
-
-# Set environment and working directory
-ENV TRT_LIBPATH /pdk_files/tensorrt/lib
-ENV TRT_OSSPATH /workspace/TensorRT
-WORKDIR /workspace
-
-USER trtuser
-RUN ["/bin/bash"]
diff --git a/include/NvCaffeParser.h b/include/NvCaffeParser.h
deleted file mode 100644
index fc91e9b4..00000000
--- a/include/NvCaffeParser.h
+++ /dev/null
@@ -1,263 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef NV_CAFFE_PARSER_H
-#define NV_CAFFE_PARSER_H
-
-#include "NvInfer.h"
-
-//!
-//! \file NvCaffeParser.h
-//!
-//! This is the API for the Caffe Parser
-//!
-
-//!
-//! \namespace nvcaffeparser1
-//!
-//! \brief The TensorRT Caffe parser API namespace.
-//!
-namespace nvcaffeparser1
-{
-
-//!
-//! \class IBlobNameToTensor
-//!
-//! \brief Object used to store and query Tensors after they have been extracted from a Caffe model using the ICaffeParser.
-//!
-//! \note The lifetime of IBlobNameToTensor is the same as the lifetime of its parent ICaffeParser.
-//!
-//! \see nvcaffeparser1::ICaffeParser
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class IBlobNameToTensor
-{
-public:
-    //! \brief Given a blob name, returns a pointer to a ITensor object.
-    //!
-    //! \param name Caffe blob name for which the user wants the corresponding ITensor.
-    //!
-    //! \return ITensor* corresponding to the queried name. If no such ITensor exists, then nullptr is returned.
-    //!
-    virtual nvinfer1::ITensor* find(char const* name) const noexcept = 0;
-
-protected:
-    virtual ~IBlobNameToTensor() {}
-};
-
-//!
-//! \class IBinaryProtoBlob
-//!
-//! \brief Object used to store and query data extracted from a binaryproto file using the ICaffeParser.
-//!
-//! \see nvcaffeparser1::ICaffeParser
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class IBinaryProtoBlob
-{
-public:
-    virtual void const* getData() noexcept = 0;
-    virtual nvinfer1::Dims4 getDimensions() noexcept = 0;
-    virtual nvinfer1::DataType getDataType() noexcept = 0;
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED virtual void destroy() noexcept = 0;
-    virtual ~IBinaryProtoBlob() noexcept = default;
-};
-
-//!
-//! \class IPluginFactoryV2
-//!
-//! \brief Plugin factory used to configure plugins.
-//!
-class IPluginFactoryV2
-{
-public:
-    //!
-    //! \brief A user implemented function that determines if a layer configuration is provided by an IPluginV2.
-    //!
-    //! \param layerName Name of the layer which the user wishes to validate.
-    //!
-    virtual bool isPluginV2(char const* layerName) noexcept = 0;
-
-    //!
-    //! \brief Creates a plugin.
-    //!
-    //! \param layerName Name of layer associated with the plugin.
-    //! \param weights Weights used for the layer.
-    //! \param nbWeights Number of weights.
-    //! \param libNamespace Library Namespace associated with the plugin object
-    //!
-    virtual nvinfer1::IPluginV2* createPlugin(char const* layerName, nvinfer1::Weights const* weights,
-        int32_t nbWeights, char const* libNamespace = "") noexcept = 0;
-
-    virtual ~IPluginFactoryV2() noexcept = default;
-};
-//!
-//! \class ICaffeParser
-//!
-//! \brief Class used for parsing Caffe models.
-//!
-//! Allows users to export models trained using Caffe to TRT.
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class ICaffeParser
-{
-public:
-    //!
-    //! \brief Parse a prototxt file and a binaryproto Caffe model to extract
-    //!   network definition and weights associated with the network, respectively.
-    //!
-    //! \param deploy The plain text, prototxt file used to define the network definition.
-    //! \param model The binaryproto Caffe model that contains the weights associated with the network.
-    //! \param network Network in which the CaffeParser will fill the layers.
-    //! \param weightType The type to which the weights will transformed.
-    //!
-    //! \return A pointer to an IBlobNameToTensor object that contains the extracted data.
-    //!
-    //! \see nvcaffeparser1::IBlobNameToTensor
-    //!
-    virtual IBlobNameToTensor const* parse(char const* deploy, char const* model, nvinfer1::INetworkDefinition& network,
-        nvinfer1::DataType weightType) noexcept = 0;
-
-    //!
-    //! \brief Parse a deploy prototxt and a binaryproto Caffe model from memory buffers to extract
-    //!   network definition and weights associated with the network, respectively.
-    //!
-    //! \param deployBuffer The plain text deploy prototxt used to define the network definition.
-    //! \param deployLength The length of the deploy buffer.
-    //! \param modelBuffer The binaryproto Caffe memory buffer that contains the weights associated with the network.
-    //! \param modelLength The length of the model buffer.
-    //! \param network Network in which the CaffeParser will fill the layers.
-    //! \param weightType The type to which the weights will transformed.
-    //!
-    //! \return A pointer to an IBlobNameToTensor object that contains the extracted data.
-    //!
-    //! \see nvcaffeparser1::IBlobNameToTensor
-    //!
-    virtual IBlobNameToTensor const* parseBuffers(uint8_t const* deployBuffer, std::size_t deployLength,
-        uint8_t const* modelBuffer, std::size_t modelLength, nvinfer1::INetworkDefinition& network,
-        nvinfer1::DataType weightType) noexcept = 0;
-
-    //!
-    //! \brief Parse and extract data stored in binaryproto file.
-    //!
-    //! The binaryproto file contains data stored in a binary blob. parseBinaryProto() converts it
-    //! to an IBinaryProtoBlob object which gives the user access to the data and meta-data about data.
-    //!
-    //! \param fileName Path to file containing binary proto.
-    //!
-    //! \return A pointer to an IBinaryProtoBlob object that contains the extracted data.
-    //!
-    //! \see nvcaffeparser1::IBinaryProtoBlob
-    //!
-    virtual IBinaryProtoBlob* parseBinaryProto(char const* fileName) noexcept = 0;
-
-    //!
-    //! \brief Set buffer size for the parsing and storage of the learned model.
-    //!
-    //! \param size The size of the buffer specified as the number of bytes.
-    //!
-    //! \note  Default size is 2^30 bytes.
-    //!
-    virtual void setProtobufBufferSize(size_t size) noexcept = 0;
-
-    //!
-    //! \brief Destroy this ICaffeParser object.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED virtual void destroy() noexcept = 0;
-
-    //!
-    //! \brief Set the IPluginFactoryV2 used to create the user defined pluginV2 objects.
-    //!
-    //! \param factory Pointer to an instance of the user implementation of IPluginFactoryV2.
-    //!
-    virtual void setPluginFactoryV2(IPluginFactoryV2* factory) noexcept = 0;
-
-    //!
-    //! \brief Set the namespace used to lookup and create plugins in the network.
-    //!
-    virtual void setPluginNamespace(char const* libNamespace) noexcept = 0;
-
-    virtual ~ICaffeParser() noexcept = default;
-
-public:
-    //!
-    //! \brief Set the ErrorRecorder for this interface
-    //!
-    //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
-    //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
-    //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
-    //! a recorder has been registered.
-    //!
-    //! If an error recorder is not set, messages will be sent to the global log stream.
-    //!
-    //! \param recorder The error recorder to register with this interface.
-    //!
-    //! \see getErrorRecorder()
-    //!
-    virtual void setErrorRecorder(nvinfer1::IErrorRecorder* recorder) noexcept = 0;
-
-    //!
-    //! \brief get the ErrorRecorder assigned to this interface.
-    //!
-    //! Retrieves the assigned error recorder object for the given class. A
-    //! nullptr will be returned if setErrorRecorder has not been called.
-    //!
-    //! \return A pointer to the IErrorRecorder object that has been registered.
-    //!
-    //! \see setErrorRecorder()
-    //!
-    virtual nvinfer1::IErrorRecorder* getErrorRecorder() const noexcept = 0;
-};
-
-//!
-//! \brief Creates a ICaffeParser object.
-//!
-//! \return A pointer to the ICaffeParser object is returned.
-//!
-//! \see nvcaffeparser1::ICaffeParser
-//!
-//! \deprecated ICaffeParser will be removed in TensorRT 9.0. Plan to migrate your workflow to
-//! use nvonnxparser::IParser for deployment.
-//!
-TENSORRTAPI ICaffeParser* createCaffeParser() noexcept;
-
-//!
-//! \brief Shuts down protocol buffers library.
-//!
-//! \note No part of the protocol buffers library can be used after this function is called.
-//!
-TENSORRTAPI void shutdownProtobufLibrary() noexcept;
-} // namespace nvcaffeparser1
-
-//!
-//! Internal C entry point for creating ICaffeParser.
-//! @private
-//!
-extern "C" TENSORRTAPI void* createNvCaffeParser_INTERNAL() noexcept;
-#endif
diff --git a/include/NvInfer.h b/include/NvInfer.h
index 63c0b7f8..7fff86b1 100644
--- a/include/NvInfer.h
+++ b/include/NvInfer.h
@@ -57,7 +57,7 @@ namespace nvinfer1
 enum class LayerType : int32_t
 {
     kCONVOLUTION = 0,         //!< Convolution layer.
-    kFULLY_CONNECTED = 1,     //!< Fully connected layer.
+    kCAST = 1,                //!< Cast layer
     kACTIVATION = 2,          //!< Activation layer.
     kPOOLING = 3,             //!< Pooling layer.
     kLRN = 4,                 //!< LRN layer.
@@ -76,34 +76,33 @@ enum class LayerType : int32_t
     kMATRIX_MULTIPLY = 17,    //!< Matrix multiply layer.
     kRAGGED_SOFTMAX = 18,     //!< Ragged softmax layer.
     kCONSTANT = 19,           //!< Constant layer.
-    kRNN_V2 = 20,             //!< RNNv2 layer.
-    kIDENTITY = 21,           //!< Identity layer.
-    kPLUGIN_V2 = 22,          //!< PluginV2 layer.
-    kSLICE = 23,              //!< Slice layer.
-    kSHAPE = 24,              //!< Shape layer.
-    kPARAMETRIC_RELU = 25,    //!< Parametric ReLU layer.
-    kRESIZE = 26,             //!< Resize Layer.
-    kTRIP_LIMIT = 27,         //!< Loop Trip limit layer
-    kRECURRENCE = 28,         //!< Loop Recurrence layer
-    kITERATOR = 29,           //!< Loop Iterator layer
-    kLOOP_OUTPUT = 30,        //!< Loop output layer
-    kSELECT = 31,             //!< Select layer.
-    kFILL = 32,               //!< Fill layer
-    kQUANTIZE = 33,           //!< Quantize layer
-    kDEQUANTIZE = 34,         //!< Dequantize layer
-    kCONDITION = 35,          //!< Condition layer
-    kCONDITIONAL_INPUT = 36,  //!< Conditional Input layer
-    kCONDITIONAL_OUTPUT = 37, //!< Conditional Output layer
-    kSCATTER = 38,            //!< Scatter layer
-    kEINSUM = 39,             //!< Einsum layer
-    kASSERTION = 40,          //!< Assertion layer
-    kONE_HOT = 41,            //!< OneHot layer
-    kNON_ZERO = 42,           //!< NonZero layer
-    kGRID_SAMPLE = 43,        //!< Grid sample layer
-    kNMS = 44,                //!< NMS layer
-    kREVERSE_SEQUENCE = 45,   //!< Reverse sequence layer
-    kNORMALIZATION = 46,      //!< Normalization layer
-    kCAST = 47,               //!< Cast layer
+    kIDENTITY = 20,           //!< Identity layer.
+    kPLUGIN_V2 = 21,          //!< PluginV2 layer.
+    kSLICE = 22,              //!< Slice layer.
+    kSHAPE = 23,              //!< Shape layer.
+    kPARAMETRIC_RELU = 24,    //!< Parametric ReLU layer.
+    kRESIZE = 25,             //!< Resize Layer.
+    kTRIP_LIMIT = 26,         //!< Loop Trip limit layer
+    kRECURRENCE = 27,         //!< Loop Recurrence layer
+    kITERATOR = 28,           //!< Loop Iterator layer
+    kLOOP_OUTPUT = 29,        //!< Loop output layer
+    kSELECT = 30,             //!< Select layer.
+    kFILL = 31,               //!< Fill layer
+    kQUANTIZE = 32,           //!< Quantize layer
+    kDEQUANTIZE = 33,         //!< Dequantize layer
+    kCONDITION = 34,          //!< Condition layer
+    kCONDITIONAL_INPUT = 35,  //!< Conditional Input layer
+    kCONDITIONAL_OUTPUT = 36, //!< Conditional Output layer
+    kSCATTER = 37,            //!< Scatter layer
+    kEINSUM = 38,             //!< Einsum layer
+    kASSERTION = 39,          //!< Assertion layer
+    kONE_HOT = 40,            //!< OneHot layer
+    kNON_ZERO = 41,           //!< NonZero layer
+    kGRID_SAMPLE = 42,        //!< Grid sample layer
+    kNMS = 43,                //!< NMS layer
+    kREVERSE_SEQUENCE = 44,   //!< Reverse sequence layer
+    kNORMALIZATION = 45,      //!< Normalization layer
+    kPLUGIN_V3 = 46           //!< PluginV3 layer.
 };
 
 //!
@@ -114,7 +113,7 @@ enum class LayerType : int32_t
 template <>
 constexpr inline int32_t EnumMax<LayerType>() noexcept
 {
-    return 48;
+    return 47;
 }
 
 //!
@@ -132,18 +131,20 @@ using TensorFormats = uint32_t;
 //!
 enum class ActivationType : int32_t
 {
-    kRELU = 0,             //!< Rectified linear activation.
-    kSIGMOID = 1,          //!< Sigmoid activation.
-    kTANH = 2,             //!< TanH activation.
-    kLEAKY_RELU = 3,       //!< LeakyRelu activation: x>=0 ? x : alpha * x.
-    kELU = 4,              //!< Elu activation: x>=0 ? x : alpha * (exp(x) - 1).
-    kSELU = 5,             //!< Selu activation: x>0 ? beta * x : beta * (alpha*exp(x) - alpha)
-    kSOFTSIGN = 6,         //!< Softsign activation: x / (1+|x|)
-    kSOFTPLUS = 7,         //!< Parametric softplus activation: alpha*log(exp(beta*x)+1)
-    kCLIP = 8,             //!< Clip activation: max(alpha, min(beta, x))
-    kHARD_SIGMOID = 9,     //!< Hard sigmoid activation: max(0, min(1, alpha*x+beta))
-    kSCALED_TANH = 10,     //!< Scaled tanh activation: alpha*tanh(beta*x)
-    kTHRESHOLDED_RELU = 11 //!< Thresholded ReLU activation: x>alpha ? x : 0
+    kRELU = 0,              //!< Rectified linear activation.
+    kSIGMOID = 1,           //!< Sigmoid activation.
+    kTANH = 2,              //!< TanH activation.
+    kLEAKY_RELU = 3,        //!< LeakyRelu activation: x>=0 ? x : alpha * x.
+    kELU = 4,               //!< Elu activation: x>=0 ? x : alpha * (exp(x) - 1).
+    kSELU = 5,              //!< Selu activation: x>0 ? beta * x : beta * (alpha*exp(x) - alpha)
+    kSOFTSIGN = 6,          //!< Softsign activation: x / (1+|x|)
+    kSOFTPLUS = 7,          //!< Parametric softplus activation: alpha*log(exp(beta*x)+1)
+    kCLIP = 8,              //!< Clip activation: max(alpha, min(beta, x))
+    kHARD_SIGMOID = 9,      //!< Hard sigmoid activation: max(0, min(1, alpha*x+beta))
+    kSCALED_TANH = 10,      //!< Scaled tanh activation: alpha*tanh(beta*x)
+    kTHRESHOLDED_RELU = 11, //!< Thresholded ReLU activation: x>alpha ? x : 0
+    kGELU_ERF = 12,         //!< GELU erf activation: 0.5 * x * (1 + erf(sqrt(0.5) * x))
+    kGELU_TANH = 13         //!< GELU tanh activation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (0.044715F * pow(x, 3) + x)))
 };
 
 namespace impl
@@ -156,7 +157,7 @@ namespace impl
 template <>
 struct EnumMaxImpl<ActivationType>
 {
-    static constexpr int32_t kVALUE = 12;
+    static constexpr int32_t kVALUE = 14;
 };
 } // namespace impl
 
@@ -224,7 +225,7 @@ class ITensor : public INoCopy
     //!
     //! \see getDimensions()
     //!
-    void setDimensions(Dims dimensions) noexcept
+    void setDimensions(Dims const& dimensions) noexcept
     {
         mImpl->setDimensions(dimensions);
     }
@@ -235,6 +236,7 @@ class ITensor : public INoCopy
     //! \return The dimensions of the tensor.
     //!
     //! \warning getDimensions() returns a -1 for dimensions that are derived from a wildcard dimension.
+    //!
     //! \see setDimensions()
     //!
     Dims getDimensions() const noexcept
@@ -301,46 +303,41 @@ class ITensor : public INoCopy
     }
 
     //!
-    //! \brief Set whether to enable broadcast of tensor across the batch.
-    //!
-    //! When a tensor is broadcast across a batch, it has the same value for every member in the batch.
-    //! Memory is only allocated once for the single member.
-    //!
-    //! This method is only valid for network input tensors, since the flags of layer output tensors are inferred based
-    //! on layer inputs and parameters.
-    //! If this state is modified for a tensor in the network, the states of all dependent tensors will be recomputed.
-    //! If the tensor is for an explicit batch network, then this function does nothing.
+    //! \brief Set whether to enable broadcast of tensor across the implicit batch dimension.
     //!
-    //! \warning The broadcast flag is ignored when using explicit batch network mode.
+    //! \warning This method has no effect other than issuing a warning.
     //!
-    //! \param broadcastAcrossBatch Whether to enable broadcast of tensor across the batch.
+    //! \param broadcastAcrossBatch Whether to broadcast the tensor across the implicit
+    //!         batch dimension that was a feature of TensorRT 9.x and prior.
     //!
     //! \see getBroadcastAcrossBatch()
     //!
-    void setBroadcastAcrossBatch(bool broadcastAcrossBatch) noexcept
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is not supported since TensorRT 10.0.
+    //!
+    TRT_DEPRECATED void setBroadcastAcrossBatch(bool broadcastAcrossBatch) noexcept
     {
         mImpl->setBroadcastAcrossBatch(broadcastAcrossBatch);
     }
 
     //!
-    //! \brief Check if tensor is broadcast across the batch.
-    //!
-    //! When a tensor is broadcast across a batch, it has the same value for every member in the batch.
-    //! Memory is only allocated once for the single member. If the network is in explicit batch mode,
-    //! this function returns true if the leading dimension is 1.
+    //! \brief Check if tensor is broadcast across the implicit batch dimension.
     //!
-    //! \return True if tensor is broadcast across the batch, false otherwise.
+    //! \return Always false since TensorRT 10.0 does not support an implicit batch dimension.
     //!
     //! \see setBroadcastAcrossBatch()
     //!
-    bool getBroadcastAcrossBatch() const noexcept
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is not supported since TensorRT 10.0.
+    //!
+    TRT_DEPRECATED bool getBroadcastAcrossBatch() const noexcept
     {
         return mImpl->getBroadcastAcrossBatch();
     }
 
     //!
     //! \brief Get the storage location of a tensor.
+    //!
     //! \return The location of tensor data.
+    //!
     //! \see setLocation()
     //!
     TensorLocation getLocation() const noexcept
@@ -350,6 +347,7 @@ class ITensor : public INoCopy
 
     //!
     //! \brief Set the storage location of a tensor
+    //!
     //! \param location the location of tensor data
     //!
     //! Only network input tensors for storing sequence lengths for RNNv2 are supported.
@@ -358,7 +356,10 @@ class ITensor : public INoCopy
     //!
     //! \see getLocation()
     //!
-    void setLocation(TensorLocation location) noexcept
+    //! \deprecated Deprecated in TensorRT 10.0. RNNv2 is not supported and the location must
+    //! always be TensorLocation::kDEVICE since TensorRT 10.0.
+    //!
+    TRT_DEPRECATED void setLocation(TensorLocation location) noexcept
     {
         mImpl->setLocation(location);
     }
@@ -403,7 +404,7 @@ class ITensor : public INoCopy
 
     //!
     //! \brief Set allowed formats for this tensor. By default all formats are allowed.
-    //!        Shape tensors (for which isShapeTensor() returns true) may only have row major linear format.
+    //!        Shape tensors (for which isShapeTensor() returns true) may only have row-major linear format.
     //!
     //! When running network on DLA and the build option kGPU_FALLBACK is not specified, if DLA format(kCHW4 with Int8,
     //! kCHW4 with FP16, kCHW16 with FP16, kCHW32 with Int8) is set, the input format is treated as native DLA format with
@@ -413,6 +414,7 @@ class ITensor : public INoCopy
     //! \param formats A bitmask of TensorFormat values that are supported for this tensor.
     //!
     //! \see ITensor::getAllowedFormats()
+    //!
     //! \see TensorFormats
     //!
     void setAllowedFormats(TensorFormats formats) noexcept
@@ -422,7 +424,7 @@ class ITensor : public INoCopy
 
     //!
     //! \brief Get a bitmask of TensorFormat values that the tensor supports.
-    //!        For a shape tensor, only row major linear format is allowed.
+    //!        For a shape tensor, only row-major linear format is allowed.
     //!
     //! \return The value specified by setAllowedFormats or all possible formats.
     //!
@@ -437,7 +439,7 @@ class ITensor : public INoCopy
     //! \brief Whether the tensor is a shape tensor.
     //!
     //! A shape tensor is a tensor that is related to shape calculations.
-    //! It must have type Int32, Bool, or Float, and its shape must be determinable at build time.
+    //! It must have type Int32, Int64, Bool, or Float, and its shape must be determinable at build time.
     //! Furthermore, it must be needed as a shape tensor, either marked as a network shape
     //! output via markOutputForShapes(), or as a layer input that is required to be a shape
     //! tensor, such as the second input to IShuffleLayer. Some layers are "polymorphic" in
@@ -453,15 +455,11 @@ class ITensor : public INoCopy
     //! cause all three tensors to be shape tensors, because IShuffleLayer requires that its
     //! second optional input be a shape tensor, and IElementWiseLayer is "polymorphic".
     //!
-    //! If a tensor is a shape tensor and becomes an engine input or output,
-    //! then ICudaEngine::isShapeBinding will be true for that tensor.
-    //! Such a shape tensor must have type Int32.
-    //!
     //! It is possible for a tensor to be both a shape tensor and an execution tensor.
     //!
     //! \return True if tensor is a shape tensor, false otherwise.
     //!
-    //! \see INetworkDefinition::markOutputForShapes(), ICudaEngine::isShapeBinding()
+    //! \see INetworkDefinition::markOutputForShapes()
     //!
     bool isShapeTensor() const noexcept
     {
@@ -478,8 +476,6 @@ class ITensor : public INoCopy
     //! For example, if a partially built network has no path from a tensor to a network output,
     //! isExecutionTensor() returns false. Completing the path would cause it to become true.
     //!
-    //! If a tensor is an execution tensor and becomes an engine input or output,
-    //! then ICudaEngine::isExecutionBinding will be true for that tensor.
     //!
     //! A tensor with isShapeTensor() == false and isExecutionTensor() == false
     //! can still show up as an input to the engine if its dimensions are required.
@@ -595,7 +591,7 @@ class ILayer : public INoCopy
     //! \param index The index of the input tensor.
     //!
     //! \return The input tensor, or nullptr if the index is out of range or the tensor is optional
-    //! (\ref ISliceLayer and \ref IRNNv2Layer).
+    //! (\ref ISliceLayer).
     //!
     ITensor* getInput(int32_t index) const noexcept
     {
@@ -613,8 +609,7 @@ class ILayer : public INoCopy
     //!
     //! \brief Get the layer output corresponding to the given index.
     //!
-    //! \return The indexed output tensor, or nullptr if the index is out of range or the tensor is optional
-    //! (\ref IRNNv2Layer).
+    //! \return The indexed output tensor, or nullptr if the index is out of range or the tensor is optional.
     //!
     ITensor* getOutput(int32_t index) const noexcept
     {
@@ -639,9 +634,9 @@ class ILayer : public INoCopy
     }
 
     //!
-    //! \brief Set the computational precision of this layer
+    //! \brief Set the preferred or required computational precision of this layer in a weakly-typed network.
     //!
-    //! Setting the precision allows TensorRT to choose an implementation which run at this computational precision.
+    //! Setting the precision directs TensorRT to choose an implementation that runs at this computational precision.
     //! TensorRT could still choose a non-conforming fastest implementation that ignores the requested precision.
     //! To force choosing an implementation with the requested precision, set exactly one of the following flags,
     //! which differ in what happens if no such implementation exists:
@@ -657,6 +652,10 @@ class ILayer : public INoCopy
     //! For a IIdentityLayer: If it casts to/from float/half/int8/uint8, the precision must be one of those types,
     //! otherwise it must be either the input or output type.
     //!
+    //! Strongly-typed networks reject calls to method setPrecision. In strongly-typed networks, the computation
+    //! precision is typically controlled by casting the input tensors to the desired type. The exception is
+    //! INormalizationLayer, which has a method setComputePrecision().
+    //!
     //! \param dataType the computational precision.
     //!
     //! \see getPrecision() precisionIsSet() resetPrecision()
@@ -701,12 +700,13 @@ class ILayer : public INoCopy
     }
 
     //!
-    //! \brief Set the output type of this layer
+    //! \brief Set the output type of this layer in a weakly-typed network.
     //!
     //! Setting the output type constrains TensorRT to choose implementations which generate output data with the
     //! given type. If it is not set, TensorRT will select output type based on layer computational precision. TensorRT
     //! could still choose non-conforming output type based on fastest implementation. To force choosing the requested
-    //! output type, set exactly one of the following flags, which differ in what happens if no such implementation exists:
+    //! output type, set exactly one of the following flags, which differ in what happens if no such implementation
+    //! exists:
     //!
     //! * BuilderFlag::kOBEY_PRECISION_CONSTRAINTS - build fails with an error message.
     //!
@@ -728,6 +728,14 @@ class ILayer : public INoCopy
     //! is marked as a network output, since only setType() [but not setOutputType()] will affect the data
     //! representation in the corresponding output binding.
     //!
+    //! Strongly-typed networks reject calls to method setOutputType. Instead, the output type can be set
+    //! only for layers that define method setToType(). Those layers are:
+    //!
+    //! * ICastLayer
+    //! * IDequantizeLayer
+    //! * IFillLayer
+    //! * IQuantizeLayer
+    //!
     //! \param index the index of the output to set
     //! \param dataType the type of the output
     //!
@@ -742,6 +750,7 @@ class ILayer : public INoCopy
     //! \brief get the output type of this layer
     //!
     //! \param index the index of the output
+    //!
     //! \return the output precision. If no precision has been set, DataType::kFLOAT will be returned,
     //!         unless the output type is inherently DataType::kINT32.
     //!
@@ -756,6 +765,7 @@ class ILayer : public INoCopy
     //! \brief whether the output type has been set for this layer
     //!
     //! \param index the index of the output
+    //!
     //! \return whether the output type has been explicitly set
     //!
     //! \see setOutputType() getOutputType() resetOutputType()
@@ -819,8 +829,8 @@ class ILayer : public INoCopy
 //! \brief Enumerates the modes of padding to perform in convolution, deconvolution and pooling layer,
 //! padding mode takes precedence if setPaddingMode() and setPrePadding() are also used.
 //!
-//! There are three padding styles, EXPLICIT, SAME, and CAFFE, with each style having two variants.
-//! The EXPLICIT and CAFFE styles determine if the final sampling location is used or not.
+//! There are two padding styles, EXPLICIT and SAME with each style having two variants.
+//! The EXPLICIT style determine if the final sampling location is used or not.
 //! The SAME style determine if the asymmetry in the padding is on the pre or post padding.
 //!
 //! \code
@@ -842,18 +852,10 @@ class ILayer : public INoCopy
 //! \code
 //!         O = floor((M - DK) / S) + 1
 //! \endcode
-//!     - CAFFE_ROUND_DOWN:
-//! \code
-//!         O = floor((I + B * 2 - DK) / S) + 1
-//! \endcode
 //!     - EXPLICIT_ROUND_UP:
 //! \code
 //!         O = ceil((M - DK) / S) + 1
 //! \endcode
-//!     - CAFFE_ROUND_UP:
-//! \code
-//!         O = ceil((I + B * 2 - DK) / S) + 1
-//! \endcode
 //!     - SAME_UPPER:
 //! \code
 //!         O = ceil(I / S)
@@ -871,9 +873,7 @@ class ILayer : public INoCopy
 //!
 //! Formulas for Deconvolution:
 //!     - EXPLICIT_ROUND_DOWN:
-//!     - CAFFE_ROUND_DOWN:
 //!     - EXPLICIT_ROUND_UP:
-//!     - CAFFE_ROUND_UP:
 //! \code
 //!         O = (I - 1) * S + DK - (B + A)
 //! \endcode
@@ -915,14 +915,6 @@ class ILayer : public INoCopy
 //!         A = floor(P / 2)
 //!         B = P - A
 //! \endcode
-//!     - CAFFE_ROUND_DOWN:
-//! \code
-//!         EXPLICIT_ROUND_DOWN - ((EXPLICIT_ROUND_DOWN - 1) * S >= I + B)
-//! \endcode
-//!     - CAFFE_ROUND_UP:
-//! \code
-//!         EXPLICIT_ROUND_UP - ((EXPLICIT_ROUND_UP - 1) * S >= I + B)
-//! \endcode
 //!
 //! Pooling Example 1:
 //! \code
@@ -987,62 +979,12 @@ class ILayer : public INoCopy
 //!     Given I = {6, 6}, B = {3, 3}, A = {3, 3}, S = {2, 2}, F = {3, 3}. What is O?
 //! \endcode
 //!
-//! - CAFFE_ROUND_DOWN:
-//! \code
-//!     Computation:
-//!         M = {6, 6} + {3, 3} + {3, 3} ==> {12, 12}
-//!         EXPLICIT_ROUND_DOWN ==> floor((M - F) / S) + 1
-//!                             ==> floor(({12, 12} - {3, 3}) / {2, 2}) + {1, 1}
-//!                             ==> {5, 5}
-//!         DIFF = (((EXPLICIT_ROUND_DOWN - 1) * S >= I + B) ? {1, 1} : {0, 0})
-//!           ==> ({5, 5} - {1, 1}) * {2, 2} >= {6, 6} + {3, 3} ? {1, 1} : {0,0}
-//!           ==> {0, 0}
-//!         O ==> EXPLICIT_ROUND_DOWN - DIFF
-//!           ==> {5, 5} - {0, 0}
-//!           ==> {5, 5}
-//! \endcode
-//! - CAFFE_ROUND_UP:
-//! \code
-//!     Computation:
-//!         M = {6, 6} + {3, 3} + {3, 3} ==> {12, 12}
-//!         EXPLICIT_ROUND_UP ==> ceil((M - F) / S) + 1
-//!                           ==> ceil(({12, 12} - {3, 3}) / {2, 2}) + {1, 1}
-//!                           ==> {6, 6}
-//!         DIFF = (((EXPLICIT_ROUND_UP - 1) * S >= I + B) ? {1, 1} : {0, 0})
-//!           ==> ({6, 6} - {1, 1}) * {2, 2} >= {6, 6} + {3, 3} ? {1, 1} : {0,0}
-//!           ==> {1, 1}
-//!         O ==> EXPLICIT_ROUND_UP - DIFF
-//!           ==> {6, 6} - {1, 1}
-//!           ==> {5, 5}
-//! \endcode
-//!
-//! The sample points are {0, 2, 4, 6, 8} in each dimension. <br>
-//! CAFFE_ROUND_DOWN and CAFFE_ROUND_UP have two restrictions each on usage with pooling operations.
-//! This will cause getDimensions to return an empty dimension and also to reject the network
-//! at validation time. <br>
-//! For more information on original reference code, see
-//! https://github.com/BVLC/caffe/blob/master/src/caffe/layers/pooling_layer.cpp
-//!
-//! - Restriction 1:
-//! \code
-//!     CAFFE_ROUND_DOWN: B >= F is an error if (B - S) < F
-//!     CAFFE_ROUND_UP: (B + S) >= (F + 1) is an error if B < (F + 1)
-//! \endcode
-//!
-//! - Restriction 2:
-//! \code
-//!     CAFFE_ROUND_DOWN: (B - S) >= F is an error if B >= F
-//!     CAFFE_ROUND_UP: B >= (F + 1) is an error if (B + S) >= (F + 1)
-//! \endcode
-//!
 enum class PaddingMode : int32_t
 {
     kEXPLICIT_ROUND_DOWN = 0, //!< Use explicit padding, rounding output size down.
     kEXPLICIT_ROUND_UP = 1,   //!< Use explicit padding, rounding output size up.
     kSAME_UPPER = 2,          //!< Use SAME padding, with prePadding <= postPadding.
     kSAME_LOWER = 3,          //!< Use SAME padding, with prePadding >= postPadding.
-    kCAFFE_ROUND_DOWN = 4,    //!< Use CAFFE padding, rounding output size down, uses prePadding value.
-    kCAFFE_ROUND_UP = 5       //!< Use CAFFE padding, rounding output size up, uses prePadding value.
 };
 
 namespace impl
@@ -1055,7 +997,7 @@ namespace impl
 template <>
 struct EnumMaxImpl<PaddingMode>
 {
-    static constexpr int32_t kVALUE = 6;
+    static constexpr int32_t kVALUE = 4;
 };
 } // namespace impl
 
@@ -1074,32 +1016,6 @@ struct EnumMaxImpl<PaddingMode>
 class IConvolutionLayer : public ILayer
 {
 public:
-    //!
-    //! \brief Set the HW kernel size of the convolution.
-    //!
-    //! If executing this layer on DLA, both height and width of kernel size must be in the range [1,32].
-    //!
-    //! \see getKernelSize()
-    //!
-    //! \deprecated Superseded by setKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setKernelSize(DimsHW kernelSize) noexcept
-    {
-        mImpl->setKernelSize(kernelSize);
-    }
-
-    //!
-    //! \brief Get the HW kernel size of the convolution.
-    //!
-    //! \see setKernelSize()
-    //!
-    //! \deprecated Superseded by getKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getKernelSize() const noexcept
-    {
-        return mImpl->getKernelSize();
-    }
-
     //!
     //! \brief Set the number of output maps for the convolution.
     //!
@@ -1107,7 +1023,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see getNbOutputMaps()
     //!
-    void setNbOutputMaps(int32_t nbOutputMaps) noexcept
+    void setNbOutputMaps(int64_t nbOutputMaps) noexcept
     {
         mImpl->setNbOutputMaps(nbOutputMaps);
     }
@@ -1117,69 +1033,11 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see setNbOutputMaps()
     //!
-    int32_t getNbOutputMaps() const noexcept
+    int64_t getNbOutputMaps() const noexcept
     {
         return mImpl->getNbOutputMaps();
     }
 
-    //!
-    //! \brief Get the stride of the convolution.
-    //!
-    //! Default: (1,1)
-    //!
-    //! If executing this layer on DLA, both height and width of stride must be in the range [1,8].
-    //!
-    //! \see getStride()
-    //!
-    //! \deprecated Superseded by setStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setStride(DimsHW stride) noexcept
-    {
-        mImpl->setStride(stride);
-    }
-
-    //!
-    //! \brief Get the stride of the convolution.
-    //!
-    //! \deprecated Superseded by getStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getStride() const noexcept
-    {
-        return mImpl->getStride();
-    }
-
-    //!
-    //! \brief Set the padding of the convolution.
-    //!
-    //! The input will be zero-padded by this number of elements in the height and width directions.
-    //! Padding is symmetric.
-    //!
-    //! Default: (0,0)
-    //!
-    //! If executing this layer on DLA, both height and width of padding must be in the range [0,31],
-    //! and the padding size must be less than the kernel size.
-    //!
-    //! \see getPadding()
-    //!
-    //! \deprecated Superseded by setPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setPadding(DimsHW padding) noexcept
-    {
-        return mImpl->setPadding(padding);
-    }
-
-    //!
-    //! \brief Get the padding of the convolution. If the padding is asymmetric, the pre-padding is returned.
-    //!
-    //! \see setPadding()
-    //!
-    //! \deprecated Superseded by getPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getPadding() const noexcept
-    {
-        return mImpl->getPadding();
-    }
-
     //!
     //! \brief Set the number of groups for a convolution.
     //!
@@ -1195,7 +1053,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see getNbGroups()
     //!
-    void setNbGroups(int32_t nbGroups) noexcept
+    void setNbGroups(int64_t nbGroups) noexcept
     {
         mImpl->setNbGroups(nbGroups);
     }
@@ -1205,7 +1063,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see setNbGroups()
     //!
-    int32_t getNbGroups() const noexcept
+    int64_t getNbGroups() const noexcept
     {
         return mImpl->getNbGroups();
     }
@@ -1259,34 +1117,6 @@ class IConvolutionLayer : public ILayer
         return mImpl->getBiasWeights();
     }
 
-    //!
-    //! \brief Set the dilation for a convolution.
-    //!
-    //! Default: (1,1)
-    //!
-    //! If executing this layer on DLA, both height and width must be in the range [1,32].
-    //!
-    //! \see getDilation()
-    //!
-    //! \deprecated Superseded by setDilationNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setDilation(DimsHW dilation) noexcept
-    {
-        return mImpl->setDilation(dilation);
-    }
-
-    //!
-    //! \brief Get the dilation for a convolution.
-    //!
-    //! \see setDilation()
-    //!
-    //! \deprecated Superseded by getDilationNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getDilation() const noexcept
-    {
-        return mImpl->getDilation();
-    }
-
     //!
     //! \brief Set the multi-dimension pre-padding of the convolution.
     //!
@@ -1299,7 +1129,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see getPrePadding()
     //!
-    void setPrePadding(Dims padding) noexcept
+    void setPrePadding(Dims const& padding) noexcept
     {
         mImpl->setPrePadding(padding);
     }
@@ -1326,7 +1156,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see getPostPadding()
     //!
-    void setPostPadding(Dims padding) noexcept
+    void setPostPadding(Dims const& padding) noexcept
     {
         mImpl->setPostPadding(padding);
     }
@@ -1375,7 +1205,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see getKernelSizeNd()
     //!
-    void setKernelSizeNd(Dims kernelSize) noexcept
+    void setKernelSizeNd(Dims const& kernelSize) noexcept
     {
         mImpl->setKernelSizeNd(kernelSize);
     }
@@ -1398,9 +1228,9 @@ class IConvolutionLayer : public ILayer
     //! If executing this layer on DLA, only support 2D stride, both height and width of stride must be in the range
     //! [1,8].
     //!
-    //! \see getStrideNd() setStride() getStride()
+    //! \see getStrideNd()
     //!
-    void setStrideNd(Dims stride) noexcept
+    void setStrideNd(Dims const& stride) noexcept
     {
         mImpl->setStrideNd(stride);
     }
@@ -1428,7 +1258,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see getPaddingNd() setPadding() getPadding()
     //!
-    void setPaddingNd(Dims padding) noexcept
+    void setPaddingNd(Dims const& padding) noexcept
     {
         mImpl->setPaddingNd(padding);
     }
@@ -1454,7 +1284,7 @@ class IConvolutionLayer : public ILayer
     //!
     //! \see getDilation()
     //!
-    void setDilationNd(Dims dilation) noexcept
+    void setDilationNd(Dims const& dilation) noexcept
     {
         mImpl->setDilationNd(dilation);
     }
@@ -1480,6 +1310,7 @@ class IConvolutionLayer : public ILayer
     //! Input 0 is the input activation tensor.
     //! Input 1 is the kernel tensor. If used, the kernel weights parameter must be set to empty weights.
     //! Input 2 is the bias tensor. If used, the bias parameter must be set to empty weights.
+    //!
     //! \see getKernelWeights(), setKernelWeights(), getBiasWeights(), setBiasWeights()
     //!
     using ILayer::setInput;
@@ -1489,132 +1320,6 @@ class IConvolutionLayer : public ILayer
     apiv::VConvolutionLayer* mImpl;
 };
 
-//! \class IFullyConnectedLayer
-//!
-//! \brief A fully connected layer in a network definition.
-//! This layer expects an input tensor of three or more non-batch dimensions.  The input is automatically
-//! reshaped into an `MxV` tensor `X`, where `V` is a product of the last three dimensions and `M`
-//! is a product of the remaining dimensions (where the product over 0 dimensions is defined as 1).  For example:
-//!
-//! - If the input tensor has shape `{C, H, W}`, then the tensor is reshaped into `{1, C*H*W}`.
-//! - If the input tensor has shape `{P, C, H, W}`, then the tensor is reshaped into `{P, C*H*W}`.
-//!
-//! The layer then performs the following operation:
-//!
-//! ~~~
-//! Y := matmul(X, W^T) + bias
-//! ~~~
-//!
-//! Where `X` is the `MxV` tensor defined above, `W` is the `KxV` weight tensor
-//! of the layer, and `bias` is a row vector size `K` that is broadcasted to
-//! `MxK`.  `K` is the number of output channels, and configurable via
-//! setNbOutputChannels().  If `bias` is not specified, it is implicitly `0`.
-//!
-//! The `MxK` result `Y` is then reshaped such that the last three dimensions are `{K, 1, 1}` and
-//! the remaining dimensions match the dimensions of the input tensor. For example:
-//!
-//! - If the input tensor has shape `{C, H, W}`, then the output tensor will have shape `{K, 1, 1}`.
-//! - If the input tensor has shape `{P, C, H, W}`, then the output tensor will have shape `{P, K, 1, 1}`.
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-//! \deprecated Deprecated in TensorRT 8.4. Superseded by IMatrixMultiplyLayer.
-//!
-class TRT_DEPRECATED IFullyConnectedLayer : public ILayer
-{
-public:
-    //!
-    //! \brief Set the number of output channels `K` from the fully connected layer.
-    //!
-    //! If executing this layer on DLA, number of output channels must in the range [1,8192].
-    //!
-    //! \see getNbOutputChannels()
-    //!
-    void setNbOutputChannels(int32_t nbOutputs) noexcept
-    {
-        mImpl->setNbOutputChannels(nbOutputs);
-    }
-
-    //!
-    //! \brief Get the number of output channels `K` from the fully connected layer.
-    //!
-    //! \see setNbOutputChannels()
-    //!
-    int32_t getNbOutputChannels() const noexcept
-    {
-        return mImpl->getNbOutputChannels();
-    }
-
-    //!
-    //! \brief Set the kernel weights, given as a `KxC` matrix in row-major order.
-    //!
-    //! \see getKernelWeights()
-    //!
-    void setKernelWeights(Weights weights) noexcept
-    {
-        mImpl->setKernelWeights(weights);
-    }
-
-    //!
-    //! \brief Get the kernel weights.
-    //!
-    //! \see setKernelWeights()
-    //!
-    Weights getKernelWeights() const noexcept
-    {
-        return mImpl->getKernelWeights();
-    }
-
-    //!
-    //! \brief Set the bias weights.
-    //!
-    //! Bias is optional. To omit bias, set the count value in the weights structure to zero.
-    //!
-    //! \see getBiasWeightsWeights()
-    //!
-    void setBiasWeights(Weights weights) noexcept
-    {
-        mImpl->setBiasWeights(weights);
-    }
-
-    //!
-    //! \brief Get the bias weights.
-    //!
-    //! \see setBiasWeightsWeights()
-    //!
-    Weights getBiasWeights() const noexcept
-    {
-        return mImpl->getBiasWeights();
-    }
-
-    //!
-    //! \brief Append or replace an input of this layer with a specific tensor
-    //!
-    //! \param index the index of the input to modify.
-    //! \param tensor the new input tensor
-    //!
-    //! Only index 0 (data input) is valid, unless explicit-quantization mode is enabled.
-    //! In explicit-quantization mode, input with index 1 is the kernel-weights tensor, if present.
-    //! The kernel-weights tensor must be a build-time constant (computable at build-time via constant-folding)
-    //! and an output of a dequantize layer.
-    //! If input index 1 is used then the kernel-weights parameter must be set to empty Weights.
-    //!
-    //! \see getKernelWeights(), setKernelWeights()
-    //!
-    //! The indices are as follows:
-    //!
-    //! - 0: The input activation tensor.
-    //! - 1: The kernel weights tensor (a constant tensor).
-    //!
-    //! If this function is called with the value 1, then the function getNbInputs() changes
-    //! from returning 1 to 2.
-    using ILayer::setInput;
-
-protected:
-    virtual ~IFullyConnectedLayer() noexcept = default;
-    apiv::VFullyConnectedLayer* mImpl;
-};
-
 //!
 //! \class IActivationLayer
 //!
@@ -1712,9 +1417,9 @@ class IActivationLayer : public ILayer
 //!
 enum class PoolingType : int32_t
 {
-    kMAX = 0,              // Maximum over elements
-    kAVERAGE = 1,          // Average over elements. If the tensor is padded, the count includes the padding
-    kMAX_AVERAGE_BLEND = 2 // Blending between max and average pooling: (1-blendFactor)*maxPool + blendFactor*avgPool
+    kMAX = 0,              //!< Maximum over elements
+    kAVERAGE = 1,          //!< Average over elements. If the tensor is padded, the count includes the padding
+    kMAX_AVERAGE_BLEND = 2 //!< Blending between max and average pooling: (1-blendFactor)*maxPool + blendFactor*avgPool
 };
 
 namespace impl
@@ -1767,90 +1472,6 @@ class IPoolingLayer : public ILayer
         return mImpl->getPoolingType();
     }
 
-    //!
-    //! \brief Set the window size for pooling.
-    //!
-    //! If executing this layer on DLA, both height and width of window size must be in the range [1,8].
-    //!
-    //! \see getWindowSize()
-    //!
-    //! \deprecated Superseded by setWindowSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setWindowSize(DimsHW windowSize) noexcept
-    {
-        mImpl->setWindowSize(windowSize);
-    }
-
-    //!
-    //! \brief Get the window size for pooling.
-    //!
-    //! \see setWindowSize()
-    //!
-    //! \deprecated Superseded by getWindowSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getWindowSize() const noexcept
-    {
-        return mImpl->getWindowSize();
-    }
-
-    //!
-    //! \brief Set the stride for pooling.
-    //!
-    //! Default: 1
-    //!
-    //! If executing this layer on DLA, both height and width of stride must be in the range [1,16].
-    //!
-    //! \see getStride()
-    //!
-    //! \deprecated Superseded by setStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setStride(DimsHW stride) noexcept
-    {
-        mImpl->setStride(stride);
-    }
-
-    //!
-    //! \brief Get the stride for pooling.
-    //!
-    //! \see setStride()
-    //!
-    //! \deprecated Superseded by getStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getStride() const noexcept
-    {
-        return mImpl->getStride();
-    }
-
-    //!
-    //! \brief Set the padding for pooling.
-    //!
-    //! Default: 0
-    //!
-    //! If executing this layer on DLA, both height and width of padding must be in the range [0,7].
-    //!
-    //! \see getPadding()
-    //!
-    //! \deprecated Superseded by setPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setPadding(DimsHW padding) noexcept
-    {
-        mImpl->setPadding(padding);
-    }
-
-    //!
-    //! \brief Get the padding for pooling.
-    //!
-    //! Default: 0
-    //!
-    //! \see setPadding()
-    //!
-    //! \deprecated Superseded by getPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getPadding() const noexcept
-    {
-        return mImpl->getPadding();
-    }
-
     //!
     //! \brief Set the blending factor for the max_average_blend mode:
     //! max_average_blendPool = (1-blendFactor)*maxPool + blendFactor*avgPool
@@ -1886,9 +1507,6 @@ class IPoolingLayer : public ILayer
     //!
     //! Default: true
     //!
-    //! \note On Xavier, DLA supports only inclusive padding and this must be explicitly
-    //! set to false.
-    //!
     //! \see getAverageCountExcludesPadding()
     //!
     void setAverageCountExcludesPadding(bool exclusive) noexcept
@@ -1920,7 +1538,7 @@ class IPoolingLayer : public ILayer
     //!
     //! \see getPrePadding()
     //!
-    void setPrePadding(Dims padding) noexcept
+    void setPrePadding(Dims const& padding) noexcept
     {
         mImpl->setPrePadding(padding);
     }
@@ -1948,7 +1566,7 @@ class IPoolingLayer : public ILayer
     //!
     //! \see getPostPadding()
     //!
-    void setPostPadding(Dims padding) noexcept
+    void setPostPadding(Dims const& padding) noexcept
     {
         mImpl->setPostPadding(padding);
     }
@@ -1995,7 +1613,7 @@ class IPoolingLayer : public ILayer
     //!
     //! \see getWindowSizeNd() setWindowSize() getWindowSize()
     //!
-    void setWindowSizeNd(Dims windowSize) noexcept
+    void setWindowSizeNd(Dims const& windowSize) noexcept
     {
         mImpl->setWindowSizeNd(windowSize);
     }
@@ -2018,9 +1636,9 @@ class IPoolingLayer : public ILayer
     //! If executing this layer on DLA, only support 2D stride, both height and width of stride must be in the range
     //! [1,16].
     //!
-    //! \see getStrideNd() setStride() getStride()
+    //! \see getStrideNd()
     //!
-    void setStrideNd(Dims stride) noexcept
+    void setStrideNd(Dims const& stride) noexcept
     {
         mImpl->setStrideNd(stride);
     }
@@ -2049,7 +1667,7 @@ class IPoolingLayer : public ILayer
     //!
     //! \see getPaddingNd() setPadding() getPadding()
     //!
-    void setPaddingNd(Dims padding) noexcept
+    void setPaddingNd(Dims const& padding) noexcept
     {
         mImpl->setPaddingNd(padding);
     }
@@ -2092,7 +1710,7 @@ class ILRNLayer : public ILayer
     //!
     //! \see setWindowStride()
     //!
-    void setWindowSize(int32_t windowSize) noexcept
+    void setWindowSize(int64_t windowSize) noexcept
     {
         mImpl->setWindowSize(windowSize);
     }
@@ -2102,7 +1720,7 @@ class ILRNLayer : public ILayer
     //!
     //! \see getWindowStride()
     //!
-    int32_t getWindowSize() const noexcept
+    int64_t getWindowSize() const noexcept
     {
         return mImpl->getWindowSize();
     }
@@ -2111,6 +1729,7 @@ class ILRNLayer : public ILayer
     //! \brief Set the LRN alpha value.
     //!
     //! The valid range is [-1e20, 1e20].
+    //!
     //! \see getAlpha()
     //!
     void setAlpha(float alpha) noexcept
@@ -2132,6 +1751,7 @@ class ILRNLayer : public ILayer
     //! \brief Set the LRN beta value.
     //!
     //! The valid range is [0.01, 1e5f].
+    //!
     //! \see getBeta()
     //!
     void setBeta(float beta) noexcept
@@ -2153,6 +1773,7 @@ class ILRNLayer : public ILayer
     //! \brief Set the LRN K value.
     //!
     //! The valid range is [1e-5, 1e10].
+    //!
     //! \see getK()
     //!
     void setK(float k) noexcept
@@ -2214,8 +1835,7 @@ constexpr inline int32_t EnumMax<ScaleMode>() noexcept
 //!
 //! The output size is the same as the input size.
 //!
-//! \note The input tensor for this layer is required to have a minimum of 3 dimensions in implicit batch mode
-//!       and a minimum of 4 dimensions in explicit batch mode.
+//! \note The input tensor is required to have at least 4 dimensions.
 //!
 //! A scale layer may be used as an INT8 quantization node in a graph, if the output is constrained to INT8 and
 //! the input to FP32. Quantization rounds ties to even, and clamps to [-128, 127].
@@ -2357,8 +1977,7 @@ class IScaleLayer : public ILayer
 //!
 //! The output size is the same as the input size.
 //!
-//! On Xavier, this layer is not supported on DLA.
-//! Otherwise, the following constraints must be satisfied to execute this layer on DLA:
+//! The following constraints must be satisfied to execute this layer on DLA:
 //! * Axis must be one of the channel or spatial dimensions.
 //! * There are two classes of supported input sizes:
 //!     1. Non-axis, non-batch dimensions are all 1 and the axis dimension is at most 8192.
@@ -2376,17 +1995,8 @@ class ISoftMaxLayer : public ILayer
     //! \brief Set the axis along which softmax is computed. Currently, only one axis can be set.
     //!
     //! The axis is specified by setting the bit corresponding to the axis to 1.
-    //! For example, consider an NCHW tensor as input (three non-batch dimensions).
-    //!
-    //! In implicit mode :
-    //! Bit 0 corresponds to the C dimension boolean.
-    //! Bit 1 corresponds to the H dimension boolean.
-    //! Bit 2 corresponds to the W dimension boolean.
-    //! By default, softmax is performed on the axis which is the number of axes minus three. It is 0 if
-    //! there are fewer than 3 non-batch axes. For example, if the input is NCHW, the default axis is C. If the input
-    //! is NHW, then the default axis is H.
+    //! For example, consider an NCHW tensor as input.
     //!
-    //! In explicit mode :
     //! Bit 0 corresponds to the N dimension boolean.
     //! Bit 1 corresponds to the C dimension boolean.
     //! Bit 2 corresponds to the H dimension boolean.
@@ -2395,8 +2005,7 @@ class ISoftMaxLayer : public ILayer
     //! there are fewer than 3 axes. For example, if the input is NCHW, the default axis is C. If the input
     //! is NHW, then the default axis is N.
     //!
-    //! For example, to perform softmax on axis R of a NPQRCHW input, set bit 2 with implicit batch mode,
-    //! set bit 3 with explicit batch mode.
+    //! For example, to perform softmax on axis R of a NPQRCHW input, set bit 3.
     //!
     //! \param axes The axis along which softmax is computed.
     //!        Here axes is a bitmap. For example, when doing softmax along axis 0, bit 0 is set to 1, axes = 1 << axis
@@ -2442,7 +2051,6 @@ class IConcatenationLayer : public ILayer
     //!
     //! The default axis is the number of tensor dimensions minus three, or zero if the tensor has fewer than three
     //! dimensions. For example, for a tensor with dimensions NCHW, it is C.
-    //! For implicit batch mode, the number of tensor dimensions does NOT include the implicit batch dimension.
     //!
     //! When running this layer on the DLA, the concatenation axis must be the third to last axis, e.g. C if tensor
     //! dimensions are NCHW.
@@ -2480,41 +2088,13 @@ class IDeconvolutionLayer : public ILayer
 {
 public:
     //!
-    //! \brief Set the HW kernel size of the convolution.
-    //!
-    //! If executing this layer on DLA, both height and width of kernel size must be in the range [1,32], or the
-    //! combinations of [64, 96, 128] in one dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid,
-    //! but not [64x64].
-    //!
-    //! \see getKernelSize()
-    //!
-    //! \deprecated Superseded by setKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setKernelSize(DimsHW kernelSize) noexcept
-    {
-        mImpl->setKernelSize(kernelSize);
-    }
-
-    //!
-    //! \brief Get the HW kernel size of the deconvolution.
-    //!
-    //! \see setKernelSize()
-    //!
-    //! \deprecated Superseded by getKernelSizeNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getKernelSize() const noexcept
-    {
-        return mImpl->getKernelSize();
-    }
-
-    //!
-    //! \brief Set the number of output feature maps for the deconvolution.
+    //! \brief Set the number of output feature maps for the deconvolution.
     //!
     //! If executing this layer on DLA, the number of output maps must be in the range [1,8192].
     //!
     //! \see getNbOutputMaps()
     //!
-    void setNbOutputMaps(int32_t nbOutputMaps) noexcept
+    void setNbOutputMaps(int64_t nbOutputMaps) noexcept
     {
         mImpl->setNbOutputMaps(nbOutputMaps);
     }
@@ -2524,73 +2104,11 @@ class IDeconvolutionLayer : public ILayer
     //!
     //! \see setNbOutputMaps()
     //!
-    int32_t getNbOutputMaps() const noexcept
+    int64_t getNbOutputMaps() const noexcept
     {
         return mImpl->getNbOutputMaps();
     }
 
-    //!
-    //! \brief Set the stride of the deconvolution.
-    //!
-    //! If executing this layer on DLA, there is one restriction:
-    //! 1) Stride height and width must be in the range [1,32] or the combinations of [64, 96, 128] in one
-    //! dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid, but not [64x64].
-    //!
-    //! \see getStride()
-    //!
-    //! \deprecated Superseded by setStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setStride(DimsHW stride) noexcept
-    {
-        mImpl->setStride(stride);
-    }
-
-    //!
-    //! \brief Get the stride of the deconvolution.
-    //!
-    //! Default: (1,1)
-    //!
-    //! \deprecated Superseded by getStrideNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getStride() const noexcept
-    {
-        return mImpl->getStride();
-    }
-
-    //!
-    //! \brief Set the padding of the deconvolution.
-    //!
-    //! The output will be trimmed by this number of elements on each side in the height and width directions.
-    //! In other words, it resembles the inverse of a convolution layer with this padding size.
-    //! Padding is symmetric, and negative padding is not supported.
-    //!
-    //! Default: (0,0)
-    //!
-    //! If executing this layer on DLA, both height and width of padding must be 0.
-    //!
-    //! \see getPadding()
-    //!
-    //! \deprecated Superseded by setPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setPadding(DimsHW padding) noexcept
-    {
-        mImpl->setPadding(padding);
-    }
-
-    //!
-    //! \brief Get the padding of the deconvolution.
-    //!
-    //! Default: (0, 0)
-    //!
-    //! \see setPadding()
-    //!
-    //! \deprecated Superseded by getPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getPadding() const noexcept
-    {
-        return mImpl->getPadding();
-    }
-
     //!
     //! \brief Set the number of groups for a deconvolution.
     //!
@@ -2606,7 +2124,7 @@ class IDeconvolutionLayer : public ILayer
     //!
     //! \see getNbGroups()
     //!
-    void setNbGroups(int32_t nbGroups) noexcept
+    void setNbGroups(int64_t nbGroups) noexcept
     {
         mImpl->setNbGroups(nbGroups);
     }
@@ -2616,7 +2134,7 @@ class IDeconvolutionLayer : public ILayer
     //!
     //! \see setNbGroups()
     //!
-    int32_t getNbGroups() const noexcept
+    int64_t getNbGroups() const noexcept
     {
         return mImpl->getNbGroups();
     }
@@ -2683,7 +2201,7 @@ class IDeconvolutionLayer : public ILayer
     //!
     //! \see getPrePadding()
     //!
-    void setPrePadding(Dims padding) noexcept
+    void setPrePadding(Dims const& padding) noexcept
     {
         mImpl->setPrePadding(padding);
     }
@@ -2711,7 +2229,7 @@ class IDeconvolutionLayer : public ILayer
     //!
     //! \see getPostPadding()
     //!
-    void setPostPadding(Dims padding) noexcept
+    void setPostPadding(Dims const& padding) noexcept
     {
         mImpl->setPostPadding(padding);
     }
@@ -2760,9 +2278,9 @@ class IDeconvolutionLayer : public ILayer
     //! 2) Kernel height and width must be in the range [1,32] or the combinations of [64, 96, 128] in one
     //! dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid, but not [64x64].
     //!
-    //! \see getKernelSizeNd() setKernelSize() getKernelSize()
+    //! \see getKernelSizeNd()
     //!
-    void setKernelSizeNd(Dims kernelSize) noexcept
+    void setKernelSizeNd(Dims const& kernelSize) noexcept
     {
         mImpl->setKernelSizeNd(kernelSize);
     }
@@ -2787,9 +2305,9 @@ class IDeconvolutionLayer : public ILayer
     //! 2) Stride height and width must be in the range [1,32] or the combinations of [64, 96, 128] in one
     //! dimension and 1 in the other dimensions, i.e. [1x64] or [64x1] are valid, but not [64x64].
     //!
-    //! \see getStrideNd() setStride() getStride()
+    //! \see getStrideNd()
     //!
-    void setStrideNd(Dims stride) noexcept
+    void setStrideNd(Dims const& stride) noexcept
     {
         mImpl->setStrideNd(stride);
     }
@@ -2817,7 +2335,7 @@ class IDeconvolutionLayer : public ILayer
     //!
     //! \see getPaddingNd() setPadding() getPadding()
     //!
-    void setPaddingNd(Dims padding) noexcept
+    void setPaddingNd(Dims const& padding) noexcept
     {
         mImpl->setPaddingNd(padding);
     }
@@ -2843,17 +2361,19 @@ class IDeconvolutionLayer : public ILayer
     //! Input 0 is the input activation tensor.
     //! Input 1 is the kernel tensor. If used, the kernel weights parameter must be set to empty weights.
     //! Input 2 is the bias tensor. If used, the bias parameter must be set to empty weights.
+    //!
     //! \see getKernelWeights(), setKernelWeights(), getBiasWeights(), setBiasWeights()
     //!
     using ILayer::setInput;
 
+    //!
     //! \brief Set the multi-dimension dilation of the deconvolution.
     //!
     //! Default: (1, 1, ..., 1)
     //!
     //! \see getDilationNd()
     //!
-    void setDilationNd(Dims dilation) noexcept
+    void setDilationNd(Dims const& dilation) noexcept
     {
         mImpl->setDilationNd(dilation);
     }
@@ -2880,9 +2400,10 @@ class IDeconvolutionLayer : public ILayer
 //!
 //! Operations kAND, kOR, and kXOR must have inputs of DataType::kBOOL.
 //!
-//! Operation kPOW must have inputs of DataType::kFLOAT, DataType::kHALF, or DataType::kINT8.
+//! Operation kPOW must have inputs of floating-point type or DataType::kINT8.
 //!
-//! All other operations must have inputs of DataType::kFLOAT, DataType::kHALF, DataType::kINT8, or DataType::kINT32.
+//! All other operations must have inputs of floating-point type, DataType::kINT8, DataType::kINT32, or
+//! DataType::kINT64.
 //!
 //! \see IElementWiseLayer
 //!
@@ -3035,7 +2556,7 @@ constexpr inline int32_t EnumMax<GatherMode>() noexcept
 //!     GatherMode::kELEMENT:
 //!         The output dimensions match the dimensions of the indices tensor.
 //!
-//! The types of Data and Output must be the same, and Indices shall be DataType::kINT32.
+//! The types of Data and Output must be the same, and Indices shall be DataType::kINT32 or DataType::kINT64.
 //!
 //! How the elements of Data are gathered depends on the mode:
 //!
@@ -3065,7 +2586,6 @@ constexpr inline int32_t EnumMax<GatherMode>() noexcept
 //! Notes:
 //! * For modes GatherMode::kND and GatherMode::kELEMENT, the first nbElementWiseDims dimensions of data and index must
 //! be equal. If not, an error will be reported at build time or run time.
-//! * Only mode GatherMode::kDEFAULT supports an implicit batch dimensions or broadcast on the elementwise dimensions.
 //! * If an axis of Data has dynamic length, using a negative index for it has undefined behavior.
 //! * No DLA support
 //! * Zero will be stored for OOB access
@@ -3091,6 +2611,7 @@ class IGatherLayer : public ILayer
 
     //!
     //! \brief Get the axis to gather on.
+    //!
     //! \warning Undefined behavior when used with GatherMode::kND.
     //!
     //! \see setGatherAxis()
@@ -3100,17 +2621,19 @@ class IGatherLayer : public ILayer
         return mImpl->getGatherAxis();
     }
 
+    //!
     //! \brief Set the number of leading dimensions of indices tensor to be handled elementwise.
+    //!
     //! The gathering of indexing starts from the dimension of data[NbElementWiseDims:].
     //! The NbElementWiseDims must be less than the Rank of the data input.
+    //!
     //! \param elementWiseDims number of dims to be handled as elementwise.
     //!
     //! Default: 0
     //!
     //! The value of nbElementWiseDims and GatherMode are checked during network validation:
     //!
-    //! GatherMode::kDEFAULT: nbElementWiseDims must be 0 if there is an implicit batch dimension. It can be 0 or 1 if
-    //! there is not an implicit batch dimension.
+    //! GatherMode::kDEFAULT: nbElementWiseDims can be 0 or 1.
     //! GatherMode::kND: nbElementWiseDims can be between 0 and one less than rank(data).
     //! GatherMode::kELEMENT: nbElementWiseDims must be 0
     //!
@@ -3157,506 +2680,57 @@ class IGatherLayer : public ILayer
 };
 
 //!
-//! \enum RNNOperation
-//!
-//! \brief Enumerates the RNN operations that may be performed by an RNN layer.
-//!
-//! __Equation definitions__
-//!
-//! The equations below have the following naming convention:
-//!
-//! ~~~
-//! t := current time step
-//!
-//! i := input gate
-//! o := output gate
-//! f := forget gate
-//! z := update gate
-//! r := reset gate
-//! c := cell gate
-//! h := hidden gate
-//!
-//! g[t] denotes the output of gate g at timestep t, e.g.
-//! f[t] is the output of the forget gate f.
-//!
-//! X[t] := input tensor for timestep t
-//! C[t] := cell state for timestep t
-//! H[t] := hidden state for timestep t
-//!
-//! W[g] := W (input) parameter weight matrix for gate g
-//! R[g] := U (recurrent) parameter weight matrix for gate g
-//! Wb[g] := W (input) parameter bias vector for gate g
-//! Rb[g] := U (recurrent) parameter bias vector for gate g
-//!
-//! Unless otherwise specified, all operations apply pointwise
-//! to elements of each operand tensor.
-//!
-//! ReLU(X) := max(X, 0)
-//! tanh(X) := hyperbolic tangent of X
-//! sigmoid(X) := 1 / (1 + exp(-X))
-//! exp(X) := e^X
-//!
-//! A.B denotes matrix multiplication of A and B.
-//! A*B denotes pointwise multiplication of A and B.
-//! ~~~
-//!
-//! __Equations__
-//!
-//! Depending on the value of RNNOperation chosen, each sub-layer of the RNN
-//! layer will perform one of the following operations:
-//!
-//! ~~~
-//! ::kRELU
-//!
-//!   H[t] := ReLU(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])
-//!
-//! ::kTANH
-//!
-//!   H[t] := tanh(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])
-//!
-//! ::kLSTM
-//!
-//!   i[t] := sigmoid(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])
-//!   f[t] := sigmoid(W[f].X[t] + R[f].H[t-1] + Wb[f] + Rb[f])
-//!   o[t] := sigmoid(W[o].X[t] + R[o].H[t-1] + Wb[o] + Rb[o])
-//!   c[t] :=    tanh(W[c].X[t] + R[c].H[t-1] + Wb[c] + Rb[c])
-//!
-//!   C[t] := f[t]*C[t-1] + i[t]*c[t]
-//!   H[t] := o[t]*tanh(C[t])
-//!
-//! ::kGRU
-//!
-//!   z[t] := sigmoid(W[z].X[t] + R[z].H[t-1] + Wb[z] + Rb[z])
-//!   r[t] := sigmoid(W[r].X[t] + R[r].H[t-1] + Wb[r] + Rb[r])
-//!   h[t] := tanh(W[h].X[t] + r[t]*(R[h].H[t-1] + Rb[h]) + Wb[h])
-//!
-//!   H[t] := (1 - z[t])*h[t] + z[t]*H[t-1]
-//! ~~~
-//!
-//! \see IRNNv2Layer
-//!
-enum class RNNOperation : int32_t
-{
-    kRELU = 0, //!< Single gate RNN w/ ReLU activation function.
-    kTANH = 1, //!< Single gate RNN w/ TANH activation function.
-    kLSTM = 2, //!< Four-gate LSTM network w/o peephole connections.
-    kGRU = 3   //!< Three-gate network consisting of Gated Recurrent Units.
-};
-
-//!
-//! Maximum number of elements in RNNOperation enum.
-//!
-//! \see RNNOperation
-//!
-template <>
-constexpr inline int32_t EnumMax<RNNOperation>() noexcept
-{
-    return 4;
-}
-
-//!
-//! \enum RNNDirection
-//!
-//! \brief Enumerates the RNN direction that may be performed by an RNN layer.
-//!
-//! \see IRNNv2Layer
-//!
-enum class RNNDirection : int32_t
-{
-    kUNIDIRECTION = 0, //!< Network iterations from first input to last input.
-    kBIDIRECTION = 1   //!< Network iterates from first to last and vice versa and outputs concatenated.
-};
-
-//!
-//! Maximum number of elements in RNNDirection enum.
-//!
-//! \see RNNDirection
-//!
-template <>
-constexpr inline int32_t EnumMax<RNNDirection>() noexcept
-{
-    return 2;
-}
-
-//!
-//! \enum RNNInputMode
-//!
-//! \brief Enumerates the RNN input modes that may occur with an RNN layer.
-//!
-//! If the RNN is configured with RNNInputMode::kLINEAR, then for each gate `g` in the first layer of the RNN,
-//! the input vector `X[t]` (length `E`) is left-multiplied by the gate's corresponding weight matrix `W[g]`
-//! (dimensions `HxE`) as usual, before being used to compute the gate output as described by \ref RNNOperation.
-//!
-//! If the RNN is configured with RNNInputMode::kSKIP, then this initial matrix multiplication is "skipped"
-//! and `W[g]` is conceptually an identity matrix.  In this case, the input vector `X[t]` must have length `H`
-//! (the size of the hidden state).
-//!
-//! \see IRNNv2Layer
-//!
-enum class RNNInputMode : int32_t
-{
-    kLINEAR = 0, //!< Perform the normal matrix multiplication in the first recurrent layer.
-    kSKIP = 1    //!< No operation is performed on the first recurrent layer.
-};
-
-//!
-//! Maximum number of elements in RNNInputMode enum.
-//!
-//! \see RNNInputMode
-//!
-template <>
-constexpr inline int32_t EnumMax<RNNInputMode>() noexcept
-{
-    return 2;
-}
-
-//!
-//! \enum RNNGateType
-//!
-//! \brief Identifies an individual gate within an RNN cell.
-//!
-//! \see RNNOperation
-//!
-enum class RNNGateType : int32_t
-{
-    kINPUT = 0,  //!< Input gate  (i).
-    kOUTPUT = 1, //!< Output gate (o).
-    kFORGET = 2, //!< Forget gate (f).
-    kUPDATE = 3, //!< Update gate (z).
-    kRESET = 4,  //!< Reset gate  (r).
-    kCELL = 5,   //!< Cell gate   (c).
-    kHIDDEN = 6  //!< Hidden gate (h).
-};
-
-//!
-//! Maximum number of elements in RNNGateType enum.
-//!
-//! \see RNNGateType
-//!
-template <>
-constexpr inline int32_t EnumMax<RNNGateType>() noexcept
-{
-    return 7;
-}
-
-//!
-//! \class IRNNv2Layer
-//!
-//! \brief An RNN layer in a network definition, version 2.
+//! \class IPluginV2Layer
 //!
-//! This layer supersedes IRNNLayer.
+//! \brief Layer type for pluginV2
 //!
-//! \deprecated Deprecated prior to TensorRT 8.0 and will be removed in 9.0. Superseded by
-//! INetworkDefinition::addLoop().
+//! \see IPluginV2
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
-class TRT_DEPRECATED IRNNv2Layer : public ILayer
+class IPluginV2Layer : public ILayer
 {
 public:
-    int32_t getLayerCount() const noexcept
-    {
-        return mImpl->getLayerCount();
-    } //!< Get the layer count of the RNN.
-    int32_t getHiddenSize() const noexcept
-    {
-        return mImpl->getHiddenSize();
-    } //!< Get the hidden size of the RNN.
-    int32_t getMaxSeqLength() const noexcept
-    {
-        return mImpl->getMaxSeqLength();
-    } //!< Get the maximum sequence length of the RNN.
-    int32_t getDataLength() const noexcept
-    {
-        return mImpl->getDataLength();
-    } //!< Get the embedding length of the RNN.
-
-    //!
-    //! \brief Specify individual sequence lengths in the batch with the ITensor pointed to by
-    //! \p seqLengths.
-    //!
-    //! The \p seqLengths ITensor should be a {N1, ..., Np} tensor, where N1..Np are the index dimensions
-    //! of the input tensor to the RNN.
-    //!
-    //! If this is not specified, then the RNN layer assumes all sequences are size getMaxSeqLength().
-    //!
-    //! All sequence lengths in \p seqLengths should be in the range [1, getMaxSeqLength()].  Zero-length
-    //! sequences are not supported.
-    //!
-    //! This tensor must be of type DataType::kINT32.
-    //!
-    void setSequenceLengths(ITensor& seqLengths) noexcept
-    {
-        return mImpl->setSequenceLengths(seqLengths);
-    }
-
-    //!
-    //! \brief Get the sequence lengths specified for the RNN.
-    //!
-    //! \return nullptr if no sequence lengths were specified, the sequence length data otherwise.
-    //!
-    //! \see setSequenceLengths()
-    //!
-    ITensor* getSequenceLengths() const noexcept
-    {
-        return mImpl->getSequenceLengths();
-    }
-
-    //!
-    //! \brief Set the operation of the RNN layer.
-    //!
-    //! \see getOperation(), RNNOperation
-    //!
-    void setOperation(RNNOperation op) noexcept
-    {
-        mImpl->setOperation(op);
-    }
-
-    //!
-    //! \brief Get the operation of the RNN layer.
-    //!
-    //! \see setOperation(), RNNOperation
-    //!
-    RNNOperation getOperation() const noexcept
-    {
-        return mImpl->getOperation();
-    }
-
-    //!
-    //! \brief Set the input mode of the RNN layer.
-    //!
-    //! \see getInputMode(), RNNInputMode
-    //!
-    void setInputMode(RNNInputMode op) noexcept
-    {
-        mImpl->setInputMode(op);
-    }
-
-    //!
-    //! \brief Get the input mode of the RNN layer.
-    //!
-    //! \see setInputMode(), RNNInputMode
-    //!
-    RNNInputMode getInputMode() const noexcept
-    {
-        return mImpl->getInputMode();
-    }
-
-    //!
-    //! \brief Set the direction of the RNN layer.
-    //!
-    //! The direction determines if the RNN is run as a unidirectional(left to right) or
-    //! bidirectional(left to right and right to left).
-    //! In the RNNDirection::kBIDIRECTION case the output is concatenated together, resulting
-    //! in output size of 2x getHiddenSize().
     //!
-    //! \see getDirection(), RNNDirection
-    //!
-    void setDirection(RNNDirection op) noexcept
-    {
-        mImpl->setDirection(op);
-    }
-
-    //!
-    //! \brief Get the direction of the RNN layer.
-    //!
-    //! \see setDirection(), RNNDirection
-    //!
-    RNNDirection getDirection() const noexcept
-    {
-        return mImpl->getDirection();
-    }
-
-    //!
-    //! \brief Set the weight parameters for an individual gate in the RNN.
-    //!
-    //! The DataType for this structure must be DataType::kFLOAT or DataType::kHALF, and must be the same
-    //! datatype as the input tensor.
-    //!
-    //! Each parameter matrix is row-major in memory, and has the following dimensions:
-    //!
-    //! ~~~
-    //!     Let K := { ::kUNIDIRECTION => 1
-    //!              { ::kBIDIRECTION => 2
-    //!         l := layer index (as described above)
-    //!         H := getHiddenSize()
-    //!         E := getDataLength() (the embedding length)
-    //!         isW := true if the matrix is an input (W) matrix, and false if
-    //!                the matrix is a recurrent input (R) matrix.
-    //!
-    //!    if isW:
-    //!       if l < K and ::kSKIP:
-    //!          (numRows, numCols) := (0, 0) # input matrix is skipped
-    //!       elif l < K and ::kLINEAR:
-    //!          (numRows, numCols) := (H, E) # input matrix acts on input data size E
-    //!       elif l >= K:
-    //!          (numRows, numCols) := (H, K * H) # input matrix acts on previous hidden state
-    //!    else: # not isW
-    //!       (numRows, numCols) := (H, H)
-    //! ~~~
-    //!
-    //! In other words, the input weights of the first layer of the RNN (if
-    //! not skipped) transform a `getDataLength()`-size column
-    //! vector into a `getHiddenSize()`-size column vector.  The input
-    //! weights of subsequent layers transform a `K*getHiddenSize()`-size
-    //! column vector into a `getHiddenSize()`-size column vector.  `K=2` in
-    //! the bidirectional case to account for the full hidden state being
-    //! the concatenation of the forward and backward RNN hidden states.
-    //!
-    //! The recurrent weight matrices for all layers all have shape `(H, H)`,
-    //! both in the unidirectional and bidirectional cases.  (In the
-    //! bidirectional case, each recurrent weight matrix for the (forward or
-    //! backward) RNN cell operates on the previous (forward or
-    //! backward) RNN cell's hidden state, which is size `H`).
-    //!
-    //! \param layerIndex The index of the layer that contains this gate.
-    //! \param gate The name of the gate within the RNN layer. The gate name must correspond
-    //!        to one of the gates used by this layer's #RNNOperation.
-    //! \param isW True if the weight parameters are for the input matrix W[g]
-    //!        and false if they are for the recurrent input matrix R[g]. See
-    //!        #RNNOperation for equations showing how these matrices are used
-    //!        in the RNN gate.
-    //! \param weights The weight structure holding the weight parameters, which are stored
-    //!        as a row-major 2D matrix. See See \ref setWeightsForGate() for documentation on the expected
-    //!        dimensions of this matrix.
-    //!
-    void setWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights weights) noexcept
-    {
-        mImpl->setWeightsForGate(layerIndex, gate, isW, weights);
-    }
-
-    //!
-    //! \brief Get the weight parameters for an individual gate in the RNN.
-    //!
-    //! \see setWeightsForGate()
-    //!
-    Weights getWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept
-    {
-        return mImpl->getWeightsForGate(layerIndex, gate, isW);
-    }
-
-    //!
-    //! \brief Set the bias parameters for an individual gate in the RNN.
-    //!
-    //! The DataType for this structure must be DataType::kFLOAT or DataType::kHALF, and must be the same
-    //! datatype as the input tensor.
-    //!
-    //! Each bias vector has a fixed size, getHiddenSize().
-    //!
-    //! \param layerIndex The index of the layer that contains this gate. See \ref setWeightsForGate()
-    //!        for a description of the layer index.
-    //! \param gate The name of the gate within the RNN layer. The gate name must correspond
-    //!        to one of the gates used by this layer's #RNNOperation.
-    //! \param isW True if the bias parameters are for the input bias Wb[g]
-    //!        and false if they are for the recurrent input bias Rb[g].  See
-    //!        #RNNOperation for equations showing how these bias vectors are used
-    //!        in the RNN gate.
-    //! \param bias The weight structure holding the bias parameters, which should be an
-    //!        array of size getHiddenSize().
-    //!
-    void setBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights bias) noexcept
-    {
-        mImpl->setBiasForGate(layerIndex, gate, isW, bias);
-    }
-
-    //!
-    //! \brief Get the bias parameters for an individual gate in the RNN.
-    //!
-    //! \see setBiasForGate()
-    //!
-    Weights getBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept
-    {
-        return mImpl->getBiasForGate(layerIndex, gate, isW);
-    }
-
-    //!
-    //! \brief Set the initial hidden state of the RNN with the provided \p hidden ITensor.
-    //!
-    //! The \p hidden ITensor should have the dimensions `{N1, ..., Np, L, H}`, where:
-    //!
-    //!  - `N1..Np` are the index dimensions specified by the input tensor
-    //!  - `L` is the number of layers in the RNN, equal to getLayerCount() if getDirection is
-    //!  RNNDirection::kUNIDIRECTION,
-    //!     and 2x getLayerCount() if getDirection is RNNDirection::kBIDIRECTION. In the bi-directional
-    //!     case, layer `l`'s final forward hidden state is stored in `L = 2*l`, and
-    //!     final backward hidden state is stored in `L= 2*l + 1`.
-    //!  - `H` is the hidden state for each layer, equal to getHiddenSize().
-    //!
-    void setHiddenState(ITensor& hidden) noexcept
-    {
-        mImpl->setHiddenState(hidden);
-    }
-
-    //!
-    //! \brief Get the initial hidden state of the RNN.
-    //!
-    //! \see setHiddenState()
-    //!
-    ITensor* getHiddenState() const noexcept
-    {
-        return mImpl->getHiddenState();
-    }
-
-    //!
-    //! \brief Set the initial cell state of the LSTM with the provided \p cell ITensor.
-    //!
-    //! The \p cell ITensor should have the dimensions `{N1, ..., Np, L, H}`, where:
-    //!
-    //!  - `N1..Np` are the index dimensions specified by the input tensor
-    //!  - `L` is the number of layers in the RNN, equal to getLayerCount() if getDirection is
-    //!  RNNDirection::kUNIDIRECTION,
-    //!     and 2x getLayerCount() if getDirection is RNNDirection::kBIDIRECTION. In the bi-directional
-    //!     case, layer `l`'s final forward hidden state is stored in `L = 2*l`, and
-    //!     final backward hidden state is stored in `L= 2*l + 1`.
-    //!  - `H` is the hidden state for each layer, equal to getHiddenSize().
-    //!
-    //! It is an error to call setCellState() on an RNN layer that is not configured with RNNOperation::kLSTM.
-    //!
-    void setCellState(ITensor& cell) noexcept
-    {
-        mImpl->setCellState(cell);
-    }
-
-    //!
-    //! \brief Get the initial cell state of the RNN.
+    //! \brief Get the plugin for the layer.
     //!
-    //! \see setCellState()
+    //! \see IPluginV2
     //!
-    ITensor* getCellState() const noexcept
+    IPluginV2& getPlugin() noexcept
     {
-        return mImpl->getCellState();
+        return mImpl->getPlugin();
     }
 
 protected:
-    apiv::VRNNv2Layer* mImpl;
-    virtual ~IRNNv2Layer() noexcept = default;
+    apiv::VPluginV2Layer* mImpl;
+    virtual ~IPluginV2Layer() noexcept = default;
 };
 
 //!
-//! \class IPluginV2Layer
+//! \class IPluginV3Layer
 //!
-//! \brief Layer type for pluginV2
+//! \brief Layer type for V3 plugins
 //!
-//! \see IPluginV2
+//! \see IPluginV3
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
-class IPluginV2Layer : public ILayer
+class IPluginV3Layer : public ILayer
 {
 public:
     //!
     //! \brief Get the plugin for the layer.
     //!
-    //! \see IPluginV2
+    //! \see IPluginV3
     //!
-    IPluginV2& getPlugin() noexcept
+    IPluginV3& getPlugin() noexcept
     {
         return mImpl->getPlugin();
     }
 
 protected:
-    apiv::VPluginV2Layer* mImpl;
-    virtual ~IPluginV2Layer() noexcept = default;
+    apiv::VPluginV3Layer* mImpl;
+    virtual ~IPluginV3Layer() noexcept = default;
 };
 
 //!
@@ -3666,13 +2740,12 @@ class IPluginV2Layer : public ILayer
 //!
 //! Operations kNOT must have inputs of DataType::kBOOL.
 //!
-//! Operation kSIGN must have inputs of DataType::kFLOAT, DataType::kHALF, DataType::kINT8, or DataType::kINT32.
-//!
-//! Operation kISINF must have inputs of DataType::kFLOAT or DataType::kHALF.
+//! Operation kSIGN and kABS must have inputs of floating-point type, DataType::kINT8, DataType::kINT32 or
+//! DataType::kINT64.
 //!
-//! All other operations must have inputs of DataType::kFLOAT, DataType::kHALF, or DataType::kINT8.
+//! Operation kISINF must have inputs of floating-point type.
 //!
-//! Operations kSIGN and kROUND are not supported in implicit batch mode.
+//! All other operations must have inputs of floating-point type.
 //!
 //! \see IUnaryLayer
 //!
@@ -3878,58 +2951,6 @@ class IReduceLayer : public ILayer
 class IPaddingLayer : public ILayer
 {
 public:
-    //!
-    //! \brief Set the padding that is applied at the start of the tensor.
-    //!
-    //! Negative padding results in trimming the edge by the specified amount
-    //!
-    //! \see getPrePadding
-    //!
-    //! \deprecated Superseded by setPrePaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setPrePadding(DimsHW padding) noexcept
-    {
-        mImpl->setPrePadding(padding);
-    }
-
-    //!
-    //! \brief Get the padding that is applied at the start of the tensor.
-    //!
-    //! \see setPrePadding
-    //!
-    //! \deprecated Superseded by getPrePaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getPrePadding() const noexcept
-    {
-        return mImpl->getPrePadding();
-    }
-
-    //!
-    //! \brief Set the padding that is applied at the end of the tensor.
-    //!
-    //! Negative padding results in trimming the edge by the specified amount
-    //!
-    //! \see getPostPadding
-    //!
-    //! \deprecated Superseded by setPostPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED void setPostPadding(DimsHW padding) noexcept
-    {
-        mImpl->setPostPadding(padding);
-    }
-
-    //!
-    //! \brief Get the padding that is applied at the end of the tensor.
-    //!
-    //! \see setPostPadding
-    //!
-    //! \deprecated Superseded by getPostPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED DimsHW getPostPadding() const noexcept
-    {
-        return mImpl->getPostPadding();
-    }
-
     //!
     //! \brief Set the padding that is applied at the start of the tensor.
     //!
@@ -3939,7 +2960,7 @@ class IPaddingLayer : public ILayer
     //!
     //! \see getPrePaddingNd
     //!
-    void setPrePaddingNd(Dims padding) noexcept
+    void setPrePaddingNd(Dims const& padding) noexcept
     {
         mImpl->setPrePaddingNd(padding);
     }
@@ -3965,7 +2986,7 @@ class IPaddingLayer : public ILayer
     //!
     //! \see getPostPaddingNd
     //!
-    void setPostPaddingNd(Dims padding) noexcept
+    void setPostPaddingNd(Dims const& padding) noexcept
     {
         mImpl->setPostPaddingNd(padding);
     }
@@ -3987,6 +3008,11 @@ class IPaddingLayer : public ILayer
     virtual ~IPaddingLayer() noexcept = default;
 };
 
+//!
+//! \struct Permutation
+//!
+//! \brief Represents a permutation of dimensions.
+//!
 struct Permutation
 {
     //!
@@ -4059,7 +3085,7 @@ class IShuffleLayer : public ILayer
     //!
     //! If a second input had been used to create this layer, that input is reset to null by this method.
     //!
-    void setReshapeDimensions(Dims dimensions) noexcept
+    void setReshapeDimensions(Dims const& dimensions) noexcept
     {
         mImpl->setReshapeDimensions(dimensions);
     }
@@ -4178,7 +3204,6 @@ class IShuffleLayer : public ILayer
 enum class SampleMode : int32_t
 {
     kSTRICT_BOUNDS = 0,                            //!< Fail with error when the coordinates are out of bounds.
-    kDEFAULT TRT_DEPRECATED_ENUM = kSTRICT_BOUNDS, //! \deprecated Use kSTRICT_BOUNDS.
     kWRAP = 1,                                     //!< Coordinates wrap around periodically.
     kCLAMP = 2,                                    //!< Out of bounds indices are clamped to bounds.
     kFILL = 3,                                     //!< Use fill input value when coordinates are out of bounds.
@@ -4187,9 +3212,6 @@ enum class SampleMode : int32_t
                   //!< pixel and throws error for zero pixels.
 };
 
-//! \deprecated Deprecated in TensorRT 8.5. Superseded by SampleMode.
-using SliceMode = SampleMode;
-
 //!
 //! Maximum number of elements in SampleMode enum.
 //!
@@ -4224,7 +3246,7 @@ constexpr inline int32_t EnumMax<SampleMode>() noexcept
 //! stride = {1, 2}
 //! output = {{1, 5}}
 //!
-//! When the sliceMode is kCLAMP or kREFLECT, for each input dimension, if its size is 0 then the corresponding output
+//! When the sampleMode is kCLAMP or kREFLECT, for each input dimension, if its size is 0 then the corresponding output
 //! dimension must be 0 too.
 //!
 //! A slice layer can produce a shape tensor if the following conditions are met:
@@ -4236,7 +3258,7 @@ constexpr inline int32_t EnumMax<SampleMode>() noexcept
 //!
 //! The following constraints must be satisfied to execute this layer on DLA:
 //! * start, size, and stride are build time constants, either as static Dims or as constant input tensors.
-//! * sliceMode is kDEFAULT.
+//! * sampleMode is kSTRICT_BOUNDS.
 //! * Strides are 1 for all dimensions.
 //! * Slicing is not performed on the first dimension
 //! * The input tensor has four dimensions
@@ -4255,7 +3277,7 @@ class ISliceLayer : public ILayer
     //!
     //! \see getStart
     //!
-    void setStart(Dims start) noexcept
+    void setStart(Dims const& start) noexcept
     {
         mImpl->setStart(start);
     }
@@ -4284,7 +3306,7 @@ class ISliceLayer : public ILayer
     //!
     //! \see getSize
     //!
-    void setSize(Dims size) noexcept
+    void setSize(Dims const& size) noexcept
     {
         return mImpl->setSize(size);
     }
@@ -4313,7 +3335,7 @@ class ISliceLayer : public ILayer
     //!
     //! \see getStride
     //!
-    void setStride(Dims stride) noexcept
+    void setStride(Dims const& stride) noexcept
     {
         mImpl->setStride(stride);
     }
@@ -4338,7 +3360,7 @@ class ISliceLayer : public ILayer
     //!
     //! \see getMode()
     //!
-    void setMode(SliceMode mode) noexcept
+    void setMode(SampleMode mode) noexcept
     {
         mImpl->setMode(mode);
     }
@@ -4348,7 +3370,7 @@ class ISliceLayer : public ILayer
     //!
     //! \see setMode()
     //!
-    SliceMode getMode() const noexcept
+    SampleMode getMode() const noexcept
     {
         return mImpl->getMode();
     }
@@ -4387,10 +3409,10 @@ class ISliceLayer : public ILayer
 //!
 //! \brief Layer type for getting shape of a tensor.
 //!
-//! This layer sets the output to a 1D tensor of type Int32 with the dimensions of the input tensor.
+//! This layer sets the output to a 1D tensor of type Int64 with the dimensions of the input tensor.
 //!
 //! For example, if the input is a four-dimensional tensor (of any type) with
-//! dimensions [2,3,5,7], the output tensor is a one-dimensional Int32 tensor
+//! dimensions [2,3,5,7], the output tensor is a one-dimensional Int64 tensor
 //! of length 4 containing the sequence 2, 3, 5, 7.
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
@@ -4538,10 +3560,10 @@ enum class MatrixOperation : int32_t
     //! Treat x as a matrix if it has two dimensions, or as a collection of
     //! matrices if x has more than two dimensions, where the last two dimensions
     //! are the matrix dimensions. x must have at least two dimensions.
-    kNONE,
+    kNONE = 0,
 
     //! Like kNONE, but transpose the matrix dimensions.
-    kTRANSPOSE,
+    kTRANSPOSE = 1,
 
     //! Treat x as a vector if it has one dimension, or as a collection of
     //! vectors if x has more than one dimension. x must have at least one dimension.
@@ -4553,7 +3575,7 @@ enum class MatrixOperation : int32_t
     //! The second input tensor with dimensions [M,K] used with MatrixOperation::kVECTOR is equivalent to a tensor
     //! with dimensions [M, K, 1] with MatrixOperation::kNONE, i.e. is treated as M column vectors of length K,
     //! or dimensions [M, 1, K] with MatrixOperation::kTRANSPOSE.
-    kVECTOR
+    kVECTOR = 2,
 };
 
 //!
@@ -4597,8 +3619,10 @@ class IMatrixMultiplyLayer : public ILayer
 public:
     //!
     //! \brief Set the operation for an input tensor.
+    //!
     //! \param index Input tensor number (0 or 1).
     //! \param op New operation.
+    //!
     //! \see getOperation()
     //!
     void setOperation(int32_t index, MatrixOperation op) noexcept
@@ -4718,6 +3742,10 @@ class ICastLayer : public ILayer
     //!
     //! \brief Set cast layer output type.
     //!
+    //! \param toType The DataType of the output tensor.
+    //!
+    //! Set the output type of the cast layer.
+    //!
     void setToType(DataType toType) noexcept
     {
         mImpl->setToType(toType);
@@ -4726,6 +3754,9 @@ class ICastLayer : public ILayer
     //!
     //! \brief Return cast layer output type.
     //!
+    //! \return toType parameter set during layer creation or by setToType().
+    //! The return value is the output type of the cast layer.
+    //!
     DataType getToType() const noexcept
     {
         return mImpl->getToType();
@@ -4750,9 +3781,8 @@ class IConstantLayer : public ILayer
     //!
     //! \brief Set the weights for the layer.
     //!
-    //! If weights.type is DataType::kINT32, the output is a tensor of 32-bit indices.
-    //! Otherwise the output is a tensor of real values and the output type will be
-    //! follow TensorRT's normal precision rules.
+    //! The output type is weights.type. If the network is weakly typed and the weights have a real type,
+    //! the output type might be different per TensorRT's type conversion rules.
     //!
     //! \see getWeights()
     //!
@@ -4778,7 +3808,7 @@ class IConstantLayer : public ILayer
     //!
     //! \see setDimensions
     //!
-    void setDimensions(Dims dimensions) noexcept
+    void setDimensions(Dims const& dimensions) noexcept
     {
         mImpl->setDimensions(dimensions);
     }
@@ -4828,9 +3858,6 @@ enum class InterpolationMode : int32_t
     kCUBIC = 2    //!< Supports bicubic (2D) interpolation
 };
 
-//! \deprecated Deprecated in TensorRT 8.5. Superseded by InterpolationMode.
-using ResizeMode = InterpolationMode;
-
 namespace impl
 {
 //!
@@ -4972,13 +3999,13 @@ struct EnumMaxImpl<ResizeRoundMode>
 //! Resize layer can be used for resizing a N-D tensor.
 //!
 //! Resize layer currently supports the following configurations:
-//!     -   ResizeMode::kNEAREST - resizes innermost `m` dimensions of N-D, where 0 < m <= min(8, N) and N > 0
-//!     -   ResizeMode::kLINEAR - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0
+//!     -   InterpolationMode::kNEAREST - resizes innermost `m` dimensions of N-D, where 0 < m <= min(8, N) and N > 0
+//!     -   InterpolationMode::kLINEAR - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0
 //!
-//! Default resize mode is ResizeMode::kNEAREST.
+//! Default resize mode is InterpolationMode::kNEAREST.
 //!
 //! The coordinates in the output tensor are mapped to coordinates in the input tensor using a function set by calling
-//! setCoordinateTransformation(). The default for all ResizeMode settings (nearest, linear, bilinear, etc.) is
+//! setCoordinateTransformation(). The default for all InterpolationMode settings (nearest, linear, bilinear, etc.) is
 //! ResizeCoordinateTransformation::kASYMMETRIC.
 //!
 //! The resize layer provides two ways to resize tensor dimensions.
@@ -5022,7 +4049,7 @@ class IResizeLayer : public ILayer
     //! \see setScales
     //! \see getOutputDimensions
     //!
-    void setOutputDimensions(Dims dimensions) noexcept
+    void setOutputDimensions(Dims const& dimensions) noexcept
     {
         return mImpl->setOutputDimensions(dimensions);
     }
@@ -5091,11 +4118,11 @@ class IResizeLayer : public ILayer
     //!
     //! Supported resize modes are Nearest Neighbor and Linear.
     //!
-    //! \see ResizeMode
+    //! \see InterpolationMode
     //!
-    void setResizeMode(ResizeMode resizeMode) noexcept
+    void setResizeMode(InterpolationMode interpolationMode) noexcept
     {
-        mImpl->setResizeMode(resizeMode);
+        mImpl->setResizeMode(interpolationMode);
     }
 
     //!
@@ -5103,39 +4130,11 @@ class IResizeLayer : public ILayer
     //!
     //! \return The resize mode.
     //!
-    ResizeMode getResizeMode() const noexcept
+    InterpolationMode getResizeMode() const noexcept
     {
         return mImpl->getResizeMode();
     }
 
-    //!
-    //! \brief Set whether to align corners while resizing.
-    //!
-    //! If true, the centers of the 4 corner pixels of both input and output
-    //! tensors are aligned i.e. preserves the values of corner
-    //! pixels.
-    //!
-    //! Default: false.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by IResizeLayer::setCoordinateTransformation().
-    //!
-    TRT_DEPRECATED void setAlignCorners(bool alignCorners) noexcept
-    {
-        mImpl->setAlignCorners(alignCorners);
-    }
-
-    //!
-    //! \brief True if align corners has been set.
-    //!
-    //! \return True if align corners has been set, false otherwise.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by IResizeLayer::getCoordinateTransformation().
-    //!
-    TRT_DEPRECATED bool getAlignCorners() const noexcept
-    {
-        return mImpl->getAlignCorners();
-    }
-
     //!
     //! \brief Append or replace an input of this layer with a specific tensor
     //!
@@ -5290,7 +4289,9 @@ class IResizeLayer : public ILayer
     apiv::VResizeLayer* mImpl;
 };
 
-//! Enum that describes kinds of loop outputs.
+//!
+//! \enum Enum that describes kinds of loop outputs.
+//!
 enum class LoopOutput : int32_t
 {
     //! Output value is value of tensor for last iteration.
@@ -5314,11 +4315,13 @@ constexpr inline int32_t EnumMax<LoopOutput>() noexcept
     return 3;
 }
 
-//! Enum that describes kinds of trip limits.
+//!
+//! \enum Enum that describes kinds of trip limits.
+//!
 enum class TripLimit : int32_t
 {
 
-    kCOUNT = 0, //!< Tensor is scalar of type kINT32 that contains the trip count.
+    kCOUNT = 0, //!< Tensor is a scalar of type kINT32 or kINT64 that contains the trip count.
     kWHILE = 1  //!< Tensor is a scalar of type kBOOL. Loop terminates when value is false.
 };
 
@@ -5335,10 +4338,17 @@ constexpr inline int32_t EnumMax<TripLimit>() noexcept
 
 class ILoop;
 
+//!
+//! \class ILoopBoundaryLayer
+//!
+//! \brief This is a base class for Loop boundary layers.
+//!
 class ILoopBoundaryLayer : public ILayer
 {
 public:
-    //! Return pointer to ILoop associated with this boundary layer.
+    //!
+    //! \brief Get a pointer to ILoop associated with this boundary layer.
+    //!
     ILoop* getLoop() const noexcept
     {
         return mBoundary->getLoop();
@@ -5350,14 +4360,18 @@ class ILoopBoundaryLayer : public ILayer
 };
 
 //!
-//! This is a base class for Conditional boundary layers.
+//! \class IIfConditionalBoundaryLayer
+//!
+//! \brief This is a base class for Conditional boundary layers.
 //!
 //! Boundary layers are used to demarcate the boundaries of Conditionals.
 //!
 class IIfConditionalBoundaryLayer : public ILayer
 {
 public:
-    //! Return pointer to the IIfConditional associated with this boundary layer.
+    //!
+    //! \brief Get a pointer to the IIfConditional associated with this boundary layer.
+    //!
     IIfConditional* getConditional() const noexcept
     {
         return mBoundary->getConditional();
@@ -5369,7 +4383,9 @@ class IIfConditionalBoundaryLayer : public ILayer
 };
 
 //!
-//! This layer represents a condition input to an IIfConditional.
+//! \class IConditionLayer
+//!
+//! \brief This layer represents a condition input to an IIfConditional.
 //!
 class IConditionLayer : public IIfConditionalBoundaryLayer
 {
@@ -5380,7 +4396,9 @@ class IConditionLayer : public IIfConditionalBoundaryLayer
 };
 
 //!
-//! This layer represents an output of an IIfConditional.
+//! \class IIfConditionalOutputLayer
+//!
+//! \brief This layer represents an output of an IIfConditional.
 //!
 //! An IIfConditionalOutputLayer has exactly one output.
 //!
@@ -5393,7 +4411,9 @@ class IIfConditionalOutputLayer : public IIfConditionalBoundaryLayer
 };
 
 //!
-//! This layer represents an input to an IIfConditional.
+//! \class IIfConditionalInputLayer
+//!
+//! \brief This layer represents an input to an IIfConditional.
 //!
 class IIfConditionalInputLayer : public IIfConditionalBoundaryLayer
 {
@@ -5404,7 +4424,9 @@ class IIfConditionalInputLayer : public IIfConditionalBoundaryLayer
 };
 
 //!
-//! Helper for constructing conditionally-executed subgraphs.
+//! \class IIfConditional
+//!
+//! \brief Helper for constructing conditionally-executed subgraphs.
 //!
 //! An If-conditional conditionally executes part of the network according
 //! to the following pseudo-code:
@@ -5416,13 +4438,13 @@ class IIfConditionalInputLayer : public IIfConditionalBoundaryLayer
 //! Emit output
 //!
 //! Condition is a 0D boolean tensor (representing a scalar).
-//! trueSubgraph represents a network subgraph that is executed when condition is evaluated to True.
-//! falseSubgraph represents a network subgraph that is executed when condition is evaluated to False.
+//! trueSubgraph represents a network subgraph that is executed when condition evaluates to True.
+//! falseSubgraph represents a network subgraph that is executed when condition evaluates to False.
 //!
 //! The following constraints apply to If-conditionals:
 //! - Both the trueSubgraph and falseSubgraph must be defined.
 //! - The number of output tensors in both subgraphs is the same.
-//! - The type and shape of each output tensor from true/false subgraphs are the same.
+//! - Corresponding output tensors from the true/false subgraphs have the same type and shape.
 //!
 class IIfConditional : public INoCopy
 {
@@ -5499,7 +4521,11 @@ class IIfConditional : public INoCopy
     apiv::VIfConditional* mImpl;
 };
 
-
+//!
+//! \class IRecurrenceLayer
+//!
+//! \brief A recurrence layer in a network definition.
+//!
 class IRecurrenceLayer : public ILoopBoundaryLayer
 {
 public:
@@ -5529,7 +4555,9 @@ class IRecurrenceLayer : public ILoopBoundaryLayer
 };
 
 //!
-//! An ILoopOutputLayer is the sole way to get output from a loop.
+//! \class ILoopOutputLayer
+//!
+//! \brief An ILoopOutputLayer is the sole way to get output from a loop.
 //!
 //! The first input tensor must be defined inside the loop; the output tensor is outside the loop.
 //! The second input tensor, if present, must be defined outside the loop.
@@ -5548,6 +4576,9 @@ class IRecurrenceLayer : public ILoopBoundaryLayer
 class ILoopOutputLayer : public ILoopBoundaryLayer
 {
 public:
+    //!
+    //! \brief Get which kind a loop output has.
+    //!
     LoopOutput getLoopOutput() const noexcept
     {
         return mImpl->getLoopOutput();
@@ -5570,7 +4601,9 @@ class ILoopOutputLayer : public ILoopBoundaryLayer
         mImpl->setAxis(axis);
     }
 
-    //! Get axis being concatenated over.
+    //!
+    //! \brief Get axis being concatenated over.
+    //!
     int32_t getAxis() const noexcept
     {
         return mImpl->getAxis();
@@ -5591,7 +4624,7 @@ class ILoopOutputLayer : public ILoopBoundaryLayer
     //! The indices in the kCONCATENATE or kREVERSE cases are as follows:
     //!
     //! - 0: Contribution to the output tensor.  The contribution must come from inside the loop.
-    //! - 1: The concatenation length scalar value, must come from outside the loop, as a 0D Int32 shape tensor.
+    //! - 1: The concatenation length scalar value, must come from outside the loop, as a 0D Int32 or Int64 shape tensor.
     //!
     //! If this function is called with the value 1, then the function getNbInputs() changes
     //! from returning 1 to 2.
@@ -5603,9 +4636,17 @@ class ILoopOutputLayer : public ILoopBoundaryLayer
     apiv::VLoopOutputLayer* mImpl;
 };
 
+//!
+//! \class ITripLimitLayer
+//!
+//! \brief A layer that represents a trip-count limiter.
+//!
 class ITripLimitLayer : public ILoopBoundaryLayer
 {
 public:
+    //!
+    //! \brief Get a trip limiter type.
+    //!
     TripLimit getTripLimit() const noexcept
     {
         return mImpl->getTripLimit();
@@ -5616,32 +4657,49 @@ class ITripLimitLayer : public ILoopBoundaryLayer
     apiv::VTripLimitLayer* mImpl;
 };
 
+//!
+//! \class IIteratorLayer
+//!
+//! \brief A layer to do iterations.
+//!
 class IIteratorLayer : public ILoopBoundaryLayer
 {
 public:
-    //! Set axis to iterate over.
+    //!
+    //! \brief Set axis to iterate over.
+    //!
     void setAxis(int32_t axis) noexcept
     {
         mImpl->setAxis(axis);
     }
 
-    //! Get axis being iterated over.
+    //!
+    //! \brief Get axis being iterated over.
+    //!
     int32_t getAxis() const noexcept
     {
         return mImpl->getAxis();
     }
 
+    //!
+    //! \brief Set iteration order to be reverse.
+    //!
     //! For reverse=false, the layer is equivalent to addGather(tensor, I, 0) where I is a
     //! scalar tensor containing the loop iteration number.
     //! For reverse=true, the layer is equivalent to addGather(tensor, M-1-I, 0) where M is the trip count
     //! computed from TripLimits of kind kCOUNT.
     //! The default is reverse=false.
+    //!
     void setReverse(bool reverse) noexcept
     {
         mImpl->setReverse(reverse);
     }
 
-    //! True if and only if reversing input.
+    //!
+    //! \brief Check if the iteration order is reverse.
+    //!
+    //! \return True if and only if reversing input.
+    //!
     bool getReverse() const noexcept
     {
         return mImpl->getReverse();
@@ -5653,9 +4711,9 @@ class IIteratorLayer : public ILoopBoundaryLayer
 };
 
 //!
-//! Helper for creating a recurrent subgraph.
+//! \class ILoop
 //!
-//! An ILoop cannot be added to an INetworkDefinition where hasImplicitBatchDimensions() returns true.
+//! \brief Helper for creating a recurrent subgraph.
 //!
 class ILoop : public INoCopy
 {
@@ -5705,6 +4763,7 @@ class ILoop : public INoCopy
         return mImpl->addIterator(tensor, axis, reverse);
     }
 
+    //!
     //! \brief Make an output for this loop, based on the given tensor.
     //!
     //! axis is the axis for concatenation (if using outputKind of kCONCATENATE or kREVERSE).
@@ -5747,6 +4806,10 @@ class ILoop : public INoCopy
     apiv::VLoop* mImpl;
 };
 
+//!
+//! \class ISelectLayer
+//!
+//! \brief A select layer in a network definition.
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
@@ -5757,6 +4820,7 @@ class ISelectLayer : public ILayer
     apiv::VSelectLayer* mImpl;
 };
 
+//!
 //! \class IAssertionLayer
 //!
 //! \brief An assertion layer in a network
@@ -5812,9 +4876,28 @@ class IAssertionLayer : public ILayer
 //!
 enum class FillOperation : int32_t
 {
-    kLINSPACE = 0,       //!< Generate evenly spaced numbers over a specified interval.
-    kRANDOM_UNIFORM = 1, //!< Generate a tensor with random values drawn from a uniform distribution.
-    kRANDOM_NORMAL = 2   //!< Generate a tensor with random values drawn from a normal distribution.
+    //! Compute each value via an affine function of its indices.
+    //! For example, suppose the parameters for the IFillLayer are:
+    //!
+    //! * Dimensions = [3,4]
+    //! * Alpha = 1
+    //! * Beta = [100,10]
+    //!
+    //! Element [i,j] of the output is Alpha + Beta[0]*i + Beta[1]*j.
+    //! Thus the output matrix is:
+    //!
+    //!      1  11  21  31
+    //!    101 111 121 131
+    //!    201 211 221 231
+    //!
+    //! A static beta b is implicitly a 1D tensor, i.e. Beta = [b].
+    kLINSPACE = 0,
+
+    //! Randomly draw values from a uniform distribution.
+    kRANDOM_UNIFORM = 1,
+
+    //! Randomly draw values from a normal distribution.
+    kRANDOM_NORMAL = 2
 };
 
 //!
@@ -5829,30 +4912,40 @@ constexpr inline int32_t EnumMax<FillOperation>() noexcept
 }
 
 //!
-//! \brief Generate an output tensor with specified mode.
+//! \class IFillLayer
+//!
+//! \brief Generate a tensor according to a specified mode.
+//!
+//! The fill layer generates a tensor with values that are drawn from a random distribution
+//! or an affine function of their indices, as specified by the FillMode.
 //!
-//! The fill layer has two variants, static and dynamic. Static fill specifies its parameters
-//! at layer creation time via Dims and the get/set accessor functions of the IFillLayer.
-//! Dynamic fill specifies one or more of its parameters as ITensors, by using ILayer::setInput to add
-//! a corresponding input.  The corresponding static parameter is used if an input is missing or null.
+//! When an IFillLayer is initially added to a network, all of its parameters are static.
+//! Each parameter may be changed to dynamic by setting a corresponding input.
+//! A parameter is considered dynamic even if that input is the output of an IConstantLayer.
+//! The inputs for each parameter are:
 //!
-//! The shape of the output is specified by the parameter \p Dimension, or if non-null and present,
-//! the first input, which must be a 1D Int32 shape tensor. Thus an application can determine if the
-//! IFillLayer has a dynamic output shape based on whether it has a non-null first input.
+//! - 0: Dimensions
+//! - 1: Alpha
+//! - 2: Beta
 //!
-//! Alpha and Beta are treated differently based on the Fill Operation specified. See details in
-//! IFillLayer::setAlpha(), IFillLayer::setBeta(), and IFillLayer::setInput().
+//! The parameter Dimensions describes the shape of the output. If the Dimensions input is provided,
+//! it must be a 1D tensor of type Int32 or Int64 whose length is computable by constant folding.
 //!
-//! A fill layer can produce a shape tensor if the following restrictions are met:
+//! The meanings of Alpha and Beta depend on the mode, as described in IFillLayer::setAlpha(),
+//! IFillLayer::setBeta(), and IFillLayer::setInput(). Parameters Alpha and Beta must both be static
+//! or both be dynamic.
+//!
+//! An IFillLayer can produce a shape tensor if the following restrictions are met:
 //!
 //! * The FillOperation is kLINSPACE.
-//! * The output is an Int32 or Float tensor within the volume limit of a shape tensor.
-//! * There is at most one input, and if so, that input is input 0.
-//! * If input 0 exists, the length of the output tensor must be computable by constant folding.
+//! * The output has type Int32, Int64, or Float.
+//! * The volume of the output is within the volume limit imposed on shape tensors.
+//! * If input 0 exists, the values of input 0 must be computable by constant folding.
 //!
 //! \see FillOperation
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
 class IFillLayer : public ILayer
 {
 public:
@@ -5865,7 +4958,7 @@ class IFillLayer : public ILayer
     //!
     //! \see getDimensions
     //
-    void setDimensions(Dims dimensions) noexcept
+    void setDimensions(Dims const& dimensions) noexcept
     {
         mImpl->setDimensions(dimensions);
     }
@@ -5915,9 +5008,9 @@ class IFillLayer : public ILayer
     //! kRANDOM_UNIFORM    | the minimum value, defaults to 0.0;
     //! kRANDOM_NORMAL     | the mean of the normal distribution, default is 0.0;
     //!
-    //! If a second input had been used to create this layer, that input is reset to null by this method.
+    //! If input 1 exists, it is reset to null by this method.
     //!
-    //! \see getAlpha
+    //! \see getAlpha, setAlphaInt64
     //
     void setAlpha(double alpha) noexcept
     {
@@ -5949,7 +5042,7 @@ class IFillLayer : public ILayer
     //! kRANDOM_UNIFORM    | the maximal value, defaults to 1.0;
     //! kRANDOM_NORMAL     | the standard deviation of the normal distribution, default is 1.0;
     //!
-    //! If a third input had been used to create this layer, that input is reset to null by this method.
+    //! If input 2 exists, it is reset to null by this method.
     //!
     //! \see getBeta
     //!
@@ -5966,7 +5059,7 @@ class IFillLayer : public ILayer
     //! If the third input is present and non-null,
     //! this function returns -1.0.
     //!
-    //! \see setBeta
+    //! \see setBeta, setBetaInt64
     //!
     double getBeta() const noexcept
     {
@@ -5974,32 +5067,40 @@ class IFillLayer : public ILayer
     }
 
     //!
-    //! \brief replace an input of this layer with a specific tensor.
+    //! \brief Replace an input of this layer with a specific tensor.
     //!
     //! \param index the index of the input to set.
     //! \param tensor the new input tensor
     //!
-    //! Indices for kLINSPACE are described as:
+    //! The three inputs correspond to these setters of IFillLayer:
+    //!
+    //! - 0: setDimensions
+    //! - 1: setAlpha
+    //! - 2: setBeta
+    //!
+    //! The following descriptions give more intuitive names for the inputs.
+    //!
+    //! Indices for kLINSPACE are:
     //!
-    //! - 0: Shape tensor, represents the output tensor's dimensions.
-    //! - 1: Start, a scalar, represents the start value.
-    //! - 2: Delta, a 1D tensor, length equals to shape tensor's nbDims, represents the delta value for each dimension.
+    //! - 0: Shape, a 1D shape tensor, specifies the output tensor's dimensions.
+    //! - 1: Start, a scalar, specifies the start value.
+    //! - 2: Delta, a 1D tensor, specifies the delta value for each dimension.
     //!
-    //! Indices for kRANDOM_UNIFORM are described as:
+    //! Indices for kRANDOM_UNIFORM are:
     //!
-    //! - 0: Shape tensor, represents the output tensor's dimensions.
-    //! - 1: Minimum, a scalar, represents the minimum random value.
-    //! - 2: Maximum, a scalar, represents the maximal random value.
+    //! - 0: Shape, a 1D shape tensor, specifies the output tensor's dimensions.
+    //! - 1: Minimum, a scalar, specifies the minimum random value.
+    //! - 2: Maximum, a scalar, specifies the maximal random value.
     //!
-    //! Indices for kRANDOM_NORMAL are described as:
+    //! Indices for kRANDOM_NORMAL are:
     //!
-    //! - 0: Shape tensor, represents the output tensor's dimensions.
-    //! - 1: Mean, a scalar, represents the mean of the normal distribution,.
-    //! - 2: Scale, a scalar, represents the standard deviation of the normal distribution.
+    //! - 0: Shape, a 1D shape tensor, specifies the output tensor's dimensions.
+    //! - 1: Mean, a scalar, specifies the mean of the normal distribution,.
+    //! - 2: Scale, a scalar, specifies the standard deviation of the normal distribution.
     //!
     //! Using the corresponding setter resets the input to null.
     //!
-    //! If either inputs 1 or 2, is non-null, then both must be non-null and have the same data type.
+    //! If either inputs 1 or 2 is non-null, then both must be non-null and have the same data type.
     //!
     //! If this function is called for an index greater or equal to getNbInputs(),
     //! then afterwards getNbInputs() returns index + 1, and any missing intervening
@@ -6007,6 +5108,111 @@ class IFillLayer : public ILayer
     //!
     using ILayer::setInput;
 
+    //!
+    //! \brief Set the alpha parameter with int64 datatype.
+    //!
+    //! \param alpha has different meanings for each operator:
+    //!
+    //! Operation          | Usage
+    //! kLINSPACE          | the start value, defaults to 0;
+    //! kRANDOM_UNIFORM    | the minimum value, defaults to 0;
+    //! kRANDOM_NORMAL     | the mean of the normal distribution, default is 0;
+    //!
+    //! If a third input had been used to create this layer, that input is reset to null by this method.
+    //!
+    //! \see getAlphaInt64
+    //
+    void setAlphaInt64(int64_t alpha) noexcept
+    {
+        mImpl->setAlphaInt64(alpha);
+    }
+
+    //!
+    //! \brief Get the value of alpha parameter with int64 datatype.
+    //!
+    //! \return A int64 value of alpha.
+    //!
+    //! If the second input is present and non-null,
+    //! this function returns -1.
+    //!
+    //! \see setAlphaInt64
+    //!
+    int64_t getAlphaInt64() const noexcept
+    {
+        return mImpl->getAlphaInt64();
+    }
+
+    //!
+    //! \brief Set the beta parameter with int64 datatype.
+    //!
+    //! \param beta has different meanings for each operator:
+    //!
+    //! Operation          | Usage
+    //! kLINSPACE          | the delta value, defaults to 1;
+    //! kRANDOM_UNIFORM    | the maximal value, defaults to 1;
+    //! kRANDOM_NORMAL     | the standard deviation of the normal distribution, default is 1;
+    //!
+    //! If a third input had been used to create this layer, that input is reset to null by this method.
+    //!
+    //! \see getBetaInt64
+    //!
+    void setBetaInt64(int64_t beta) noexcept
+    {
+        mImpl->setBetaInt64(beta);
+    }
+
+    //!
+    //! \brief Get the value of beta parameter with int64 datatype.
+    //!
+    //! \return A int64 value of beta.
+    //!
+    //! If the third input is present and non-null,
+    //! this function returns -1.0.
+    //!
+    //! \see setBetaInt64
+    //!
+    int64_t getBetaInt64() const noexcept
+    {
+        return mImpl->getBetaInt64();
+    }
+
+    //!
+    //! \brief Return true if alpha/beta have type int64, false if they have type double.
+    //!
+    bool isAlphaBetaInt64() const noexcept
+    {
+        return mImpl->isAlphaBetaInt64();
+    }
+
+    //!
+    //! \brief Set the fill layer output type.
+    //!
+    //! \param toType The DataType of the output tensor.
+    //!
+    //! Set the output type of the fill layer. Valid values are DataType::kFLOAT, DataType::kINT32,
+    //! and DataType::kINT64.
+    //! If the network is strongly typed, setToType must be used to set the output type, and use of setOutputType
+    //! is an error. Otherwise, types passed to setOutputType and setToType must be the same.
+    //!
+    //! \see NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
+    //!
+    void setToType(DataType toType) noexcept
+    {
+        mImpl->setToType(toType);
+    }
+
+    //!
+    //! \brief Get the fill layer output type.
+    //!
+    //! \return toType parameter set during layer creation or by setToType().
+    //! The return value is the output type of the fill layer.
+    //! The default value is DataType::kFLOAT.
+    //!
+    DataType getToType() const noexcept
+    {
+        return mImpl->getToType();
+    }
+
 protected:
     virtual ~IFillLayer() noexcept = default;
     apiv::VFillLayer* mImpl;
@@ -6018,32 +5224,39 @@ class IFillLayer : public ILayer
 //! \brief A Quantize layer in a network definition.
 //!
 //! This layer accepts a floating-point data input tensor, and uses the scale and zeroPt inputs to
-//! quantize the data to an 8-bit signed integer according to:
+//! quantize the data according to:
 //! \p output = clamp(round(\p input / \p scale) + \p zeroPt)
 //!
 //! Rounding type is rounding-to-nearest ties-to-even (https://en.wikipedia.org/wiki/Rounding#Round_half_to_even).
-//! Clamping is in the range [-128, 127].
+//! Clamping range according to data type:
+//! - FP8: [-448, 448]
+//! - INT4: [-8, 7]
+//! - INT8: [-128, 127]
 //!
 //! The first input (index 0) is the tensor to be quantized.
 //! The second (index 1) and third (index 2) are the scale and zero point respectively.
-//! Each of \p scale and \p zeroPt must be either a scalar, or a 1D tensor.
+//! \p scale and \p zeroPt should have identical dimensions, and rank lower or equal to 2.
 //!
-//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero.  Its data type must be
-//! DataType::kINT8. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
+//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero. Its data type must match the
+//! output data type. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
 //! supported.
-//! The \p scale value must be either a scalar for per-tensor quantization, or a 1D tensor for per-channel
-//! quantization. All \p scale coefficients must have positive values.  The size of the 1-D \p scale tensor must match
-//! the size of the quantization axis. The size of the \p scale must match the size of the \p zeroPt.
+//! The \p scale value must be a scalar for per-tensor quantization, a 1-D tensor for per-channel quantization, or a
+//! 2-D tensor for block quantization (supported for DataType::kINT4 only). All \p scale coefficients must have
+//! positive values. The size of the 1-D \p scale tensor must match the size of the quantization axis. For block
+//! quantization, the shape of \p scale tensor must match the shape of the input, except for one dimension in which
+//! blocking occurs. The size of \p zeroPt must match the size of \p scale.
 //!
-//! The subgraph which terminates with the \p scale tensor must be a build-time constant.  The same restrictions apply
+//! The subgraph which terminates with the \p scale tensor must be a build-time constant. The same restrictions apply
 //! to the \p zeroPt.
-//! The output type, if constrained, must be constrained to DataType::kINT8. The input type, if constrained, must be
-//! constrained to DataType::kFLOAT or DataType::kHALF.
-//! The output size is the same as the input size. The quantization axis is in reference to the input tensor's
-//! dimensions.
+//! The output type, if constrained, must be constrained to DataType::kINT8, DataType::kFP8 or DataType::kINT4. The
+//! input type, if constrained, must be constrained to DataType::kFLOAT, DataType::kHALF, or DataType::kBF16. The
+//! output size is the same as the input size. The quantization axis is in reference to the input tensor's dimensions.
+//!
+//! IQuantizeLayer supports DataType::kFLOAT, DataType::kHALF, or DataType::kBF16 precision and will default to
+//! DataType::kFLOAT precision during instantiation. For strongly typed networks, \p input data type must match the
+//! \p scale data type.
 //!
-//! IQuantizeLayer only supports DataType::kFLOAT precision and will default to this precision during instantiation.
-//! IQuantizeLayer only supports DataType::kINT8 output.
+//! IQuantizeLayer supports DataType::kINT8, DataType::kFP8, or DataType::kINT4 output.
 //!
 //! As an example of the operation of this layer, imagine a 4D NCHW activation input which can be quantized using a
 //! single scale coefficient (referred to as per-tensor quantization):
@@ -6062,11 +5275,20 @@ class IFillLayer : public ILayer
 //!                 For each s in S:
 //!                     output[k,c,r,s] = clamp(round(\p input[k,c,r,s] / \p scale[k]) + \p zeroPt[k])
 //!
+//! Block quantization is supported only for 2-D weight inputs of DataType::kINT4. As an example of blocked
+//! operation, imagine a 2-D RS weights input, R (dimension 0) as the blocking axis and B as the block size.
+//! The scale is a 2D array of coefficients, with dimensions (R//B, S).
+//!     For each r in R:
+//!         For each s in S:
+//!             output[r,s] = clamp(round(\p input[r,s] / \p scale[r//B, s]) + \p zeroPt[r//B, s])
+//!
 //! \note Only symmetric quantization is supported.
 //! \note Currently the only allowed build-time constant \p scale and \p zeroPt subgraphs are:
 //! 1. Constant -> Quantize
 //! 2. Constant -> Cast -> Quantize
 //!
+//! \note The input tensor for this layer must not be a scalar.
+//!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
 class IQuantizeLayer : public ILayer
@@ -6096,6 +5318,34 @@ class IQuantizeLayer : public ILayer
         mImpl->setAxis(axis);
     }
 
+    //!
+    //! \brief Set the Quantize layer output type.
+    //!
+    //! \param toType The DataType of the output tensor.
+    //!
+    //! Set the output type of the quantize layer. Valid values are DataType::kINT8 and DataType::kFP8.
+    //! If the network is strongly typed, setToType must be used to set the output type, and use of setOutputType
+    //! is an error. Otherwise, types passed to setOutputType and setToType must be the same.
+    //!
+    //! \see NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
+    //!
+    void setToType(DataType toType) noexcept
+    {
+        mImpl->setToType(toType);
+    }
+
+    //!
+    //! \brief Return the Quantize layer output type.
+    //!
+    //! \return toType parameter set during layer creation or by setToType().
+    //! The return value is the output type of the quantize layer.
+    //! The default value is DataType::kINT8.
+    //!
+    DataType getToType() const noexcept
+    {
+        return mImpl->getToType();
+    }
+
 protected:
     virtual ~IQuantizeLayer() noexcept = default;
     apiv::VQuantizeLayer* mImpl;
@@ -6106,29 +5356,35 @@ class IQuantizeLayer : public ILayer
 //!
 //! \brief A Dequantize layer in a network definition.
 //!
-//! This layer accepts a signed 8-bit integer input tensor, and uses the configured scale and zeroPt inputs to
+//! This layer accepts a quantized type input tensor, and uses the configured scale and zeroPt inputs to
 //! dequantize the input according to:
 //! \p output = (\p input - \p zeroPt) * \p scale
 //!
 //! The first input (index 0) is the tensor to be quantized.
 //! The second (index 1) and third (index 2) are the scale and zero point respectively.
-//! Each of \p scale and \p zeroPt must be either a scalar, or a 1D tensor.
+//! \p scale and \p zeroPt should have identical dimensions, and rank lower or equal to 2.
 //!
-//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero.  Its data type must be
-//! DataType::kINT8. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
+//! The \p zeroPt tensor is optional, and if not set, will be assumed to be zero. Its data type must be identical to
+//! the input's data type. \p zeroPt must only contain zero-valued coefficients, because only symmetric quantization is
 //! supported.
-//! The \p scale value must be either a scalar for per-tensor quantization, or a 1D tensor for per-channel
-//! quantization. All \p scale coefficients must have positive values.  The size of the 1-D \p scale tensor must match
-//! the size of the quantization axis. The size of the \p scale must match the size of the \p zeroPt.
+//! The \p scale value must be either a scalar for per-tensor quantization, a 1-D tensor for per-channel quantization,
+//! or a 2-D tensor for block quantization (supported for DataType::kINT4 only). All \p scale coefficients must have
+//! positive values. The size of the 1-D \p scale tensor must match the size of the quantization axis. For block
+//! quantization, the shape of \p scale tensor must match the shape of the input, except for one dimension in which
+//! blocking occurs. The size of \p zeroPt must match the size of \p scale.
 //!
 //! The subgraph which terminates with the \p scale tensor must be a build-time constant.  The same restrictions apply
 //! to the \p zeroPt.
-//! The output type, if constrained, must be constrained to DataType::kFLOAT or DataType::kHALF. The input type, if
-//! constrained, must be constrained to DataType::kINT8. The output size is the same as the input size. The quantization
-//! axis is in reference to the input tensor's dimensions.
+//! The output type, if constrained, must be constrained to DataType::kFLOAT, DataType::kHALF, or DataType::kBF16. The
+//! input type, if constrained, must be constrained to DataType::kINT8, DataType::kFP8 or DataType::kINT4. The output
+//! size is the same as the input size. The quantization axis is in reference to the input tensor's dimensions.
 //!
-//! IDequantizeLayer only supports DataType::kINT8 precision and will default to this precision during instantiation.
-//! IDequantizeLayer only supports DataType::kFLOAT or DataType::kHALF output.
+//! IDequantizeLayer supports DataType::kINT8, DataType::kFP8 or DataType::kINT4 precision and will default to
+//! DataType::kINT8 precision during instantiation. For strongly typed networks, \p input data type must be same as
+//! \p zeroPt data type.
+//!
+//! IDequantizeLayer supports DataType::kFLOAT, DataType::kHALF, or DataType::kBF16 output. For strongly typed
+//! networks, \p output data type is inferred from \p scale data type.
 //!
 //! As an example of the operation of this layer, imagine a 4D NCHW activation input which can be quantized using a
 //! single scale coefficient (referred to as per-tensor quantization):
@@ -6148,11 +5404,21 @@ class IQuantizeLayer : public ILayer
 //!                 For each s in S:
 //!                     output[k,c,r,s] = (\p input[k,c,r,s] - \p zeroPt[k]) * \p scale[k]
 //!
+//! Block dequantization is supported only for 2-D input tensors with DataType::kINT4 that are rooted at an
+//! IConstantLayer (i.e. weights). As an example of blocked operation, imagine a 2-D RS weights input with R
+//! (dimension 0) as the blocking axis and B as the block size. The scale is a 2-D array of coefficients, with
+//! dimensions (R//B, S).
+//! For each r in R:
+//!     For each s in S:
+//!         output[r,s] = (\p input[r,s] - \p zeroPt[r//B, s]) * \p scale[r//B, s]
+//!
 //! \note Only symmetric quantization is supported.
 //! \note Currently the only allowed build-time constant \p scale and \p zeroPt subgraphs are:
 //! 1. Constant -> Quantize
 //! 2. Constant -> Cast -> Quantize
 //!
+//! \note The input tensor for this layer must not be a scalar.
+//!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
 class IDequantizeLayer : public ILayer
@@ -6179,7 +5445,35 @@ class IDequantizeLayer : public ILayer
     //!
     void setAxis(int32_t axis) noexcept
     {
-        mImpl->setAxis(axis);
+        mImpl->setAxis(axis);
+    }
+
+    //!
+    //! \brief Set the Dequantize layer output type.
+    //!
+    //! \param toType The DataType of the output tensor.
+    //!
+    //! Set the output type of the dequantize layer. Valid values are DataType::kFLOAT and DataType::kHALF.
+    //! If the network is strongly typed, setToType must be used to set the output type, and use of setOutputType
+    //! is an error. Otherwise, types passed to setOutputType and setToType must be the same.
+    //!
+    //! \see NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
+    //!
+    void setToType(DataType toType) noexcept
+    {
+        mImpl->setToType(toType);
+    }
+
+    //!
+    //! \brief Return the Dequantize layer output type.
+    //!
+    //! \return toType parameter set during layer creation or by setToType().
+    //! The return value is the output type of the quantize layer.
+    //! The default value is DataType::kFLOAT.
+    //!
+    DataType getToType() const noexcept
+    {
+        return mImpl->getToType();
     }
 
 protected:
@@ -6187,6 +5481,7 @@ class IDequantizeLayer : public ILayer
     apiv::VDequantizeLayer* mImpl;
 };
 
+//!
 //! \class IEinsumLayer
 //!
 //! \brief An Einsum layer in a network
@@ -6203,9 +5498,9 @@ class IDequantizeLayer : public ILayer
 //! means that those axes will be multiplied. Omitting a label from the output means values along those axes will be
 //! summed. In implicit mode, the indices which appear once in the expression will be part of the output in increasing
 //! alphabetical order. In explicit mode, the output can be controlled by specifying output subscript labels by adding
-//! an arrow (‘->’) followed by subscripts for the output.
-//! For example, “ij,jk->ik” is equivalent to “ij,jk”.
-//! Ellipsis (‘...’) can be used in place of subscripts to broadcast the dimensions.
+//! an arrow ('->') followed by subscripts for the output.
+//! For example, "ij,jk->ik" is equivalent to "ij,jk".
+//! Ellipsis ('...') can be used in place of subscripts to broadcast the dimensions.
 //! See the TensorRT Developer Guide for more details on equation syntax.
 //!
 //! Many common operations can be expressed using the Einsum equation.
@@ -6254,6 +5549,8 @@ class IEinsumLayer : public ILayer
     apiv::VEinsumLayer* mImpl;
 };
 
+//!
+//! \enum ScatterMode
 //!
 //! \brief Control form of IScatterLayer
 //!
@@ -6295,7 +5592,7 @@ constexpr inline int32_t EnumMax<ScatterMode>() noexcept
 //!       Scattermode::kELEMENT: s = q = r
 //! * Output is a tensor with the same dimensions as Data that stores the resulting values of the
 //!   transformation. It must not be a shape tensor.
-//! The types of Data, Update, and Output shall be the same, and Indices shall be DataType::kINT32.
+//! The types of Data, Update, and Output shall be the same, and Indices shall be DataType::kINT32 or DataType::kINT64.
 //!
 //! The output is computed by copying the data, and then updating elements of it based on indices.
 //! How Indices are interpreted depends upon the ScatterMode.
@@ -6326,7 +5623,7 @@ constexpr inline int32_t EnumMax<ScatterMode>() noexcept
 //!             for c in [0,n)
 //!                 for h in [0,n)
 //!                     for w in [0,n)
-//!                         output[n,c,indices[n,c,h,w],w] = updates[n,c,h,w]]
+//!                         output[n,c,indices[n,c,h,w],w] = updates[n,c,h,w]
 //!
 //! Writes to the same output element cause undefined behavior.
 //!
@@ -6391,8 +5688,7 @@ class IScatterLayer : public ILayer
 //!   The depth tensor must be a build-time constant, and its value should be positive.
 //! * Output is a tensor with rank = rank(indices)+1, where the added dimension contains the one-hot encoding.
 //!   The data types of Output is equal to the Values data type.
-//! * Axis is a scaler specifying to which dimension of the output one-hot encoding is added.
-//!   Axis defaults to -1, that is the new dimension in the output is its final dimension.
+//! * Axis is a scalar specifying to which dimension of the output one-hot encoding is added.
 //!   Valid range for axis is -rank(indices)-1 <= axis <= rank(indices).
 //!
 //! The output is computed by copying off_values to all output elements, then setting on_value on the indices
@@ -6430,6 +5726,7 @@ class IOneHotLayer : public ILayer
     apiv::VOneHotLayer* mImpl;
 };
 
+//!
 //! \class IGridSampleLayer
 //!
 //! \brief A GridSample layer in a network definition.
@@ -6516,6 +5813,8 @@ class IGridSampleLayer : public ILayer
     virtual ~IGridSampleLayer() noexcept = default;
 }; // class IGridSampleLayer
 
+//!
+//! \enum BoundingBoxFormat
 //!
 //! \brief Representation of bounding box data used for the Boxes input tensor in INMSLayer
 //!
@@ -6550,7 +5849,10 @@ constexpr inline int32_t EnumMax<BoundingBoxFormat>() noexcept
 //! intersection-over-union (IoU) with previously selected boxes is less than or equal to a given threshold.
 //! This layer implements NMS per batch item and per class.
 //!
-//! For each batch item, the ordering of candidate bounding boxes with the same score is unspecified.
+//! Per batch item, boxes are initially sorted by their scores without regard to class. Only boxes up to a maximum of the TopK limit are considered for selection (per batch).
+//! During selection, only overlapping boxes of the same class are compared, so that overlapping boxes of different classes do not suppress each other.
+//!
+//! For each batch item, the ordering of candidate bounding boxes with the same score is unspecified, but the ordering will be consistent across different runs for the same inputs.
 //!
 //! The layer has the following inputs, in order of input index:
 //!
@@ -6661,6 +5963,7 @@ class INMSLayer : public ILayer
     virtual ~INMSLayer() noexcept = default;
 }; // class INMSLayer
 
+//!
 //! \class IReverseSequenceLayer
 //!
 //! \brief A ReverseSequence layer in a network definition.
@@ -6672,7 +5975,7 @@ class INMSLayer : public ILayer
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
-class IReverseSequenceLayer: public ILayer
+class IReverseSequenceLayer : public ILayer
 {
 public:
     //!
@@ -6726,6 +6029,7 @@ class IReverseSequenceLayer: public ILayer
     virtual ~IReverseSequenceLayer() noexcept = default;
 }; // class IReverseSequenceLayer
 
+//!
 //! \class INormalizationLayer
 //!
 //! \brief A normalization layer in a network definition.
@@ -6742,10 +6046,11 @@ class IReverseSequenceLayer: public ILayer
 //! Where Mean(X, axes) is a reduction over a set of axes, and Variance(X) = Mean((X - Mean(X, axes)) ^ 2, axes).
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-
+//!
 class INormalizationLayer : public ILayer
 {
 public:
+    //!
     //! \brief Set the epsilon value used for the normalization calculation.
     //!
     //! The default value of \p eps is 1e-5F.
@@ -6757,6 +6062,7 @@ class INormalizationLayer : public ILayer
         return mImpl->setEpsilon(eps);
     }
 
+    //!
     //! \brief Get the epsilon value used for the normalization calculation.
     //!
     //! \return The epsilon value used for the normalization calculation.
@@ -6766,6 +6072,7 @@ class INormalizationLayer : public ILayer
         return mImpl->getEpsilon();
     }
 
+    //!
     //! \brief Set the reduction axes for the normalization calculation.
     //!
     //! \param axesMask The axes used for the normalization calculation.
@@ -6775,6 +6082,7 @@ class INormalizationLayer : public ILayer
         return mImpl->setAxes(axesMask);
     }
 
+    //!
     //! \brief Get the axes value used for the normalization calculation.
     //!
     //! \return The axes used for the normalization calculation.
@@ -6784,6 +6092,7 @@ class INormalizationLayer : public ILayer
         return mImpl->getAxes();
     }
 
+    //!
     //! \brief Set the number of groups used to split the channels in the normalization calculation.
     //!
     //! The input tensor channels are divided into \p nbGroups groups, and normalization is performed per group.
@@ -6799,30 +6108,38 @@ class INormalizationLayer : public ILayer
     //!
     //! \param nbGroups The number of groups to split the channels into for the normalization calculation.
     //!
-    void setNbGroups(int32_t nbGroups) noexcept
+    void setNbGroups(int64_t nbGroups) noexcept
     {
         return mImpl->setNbGroups(nbGroups);
     }
 
+    //!
     //! \brief Get the number of groups used to split the channels for the normalization calculation.
     //!
     //! \return The number of groups used to split the channel used for the normalization calculation.
     //!
-    int32_t getNbGroups() const noexcept
+    int64_t getNbGroups() const noexcept
     {
         return mImpl->getNbGroups();
     }
 
+    //!
     //! \brief Set the compute precision of this layer.
     //!
     //! \param type The datatype used for the compute precision of this layer.
     //!
-    //! By default TensorRT will run the normalization computation in DataType::kFLOAT32 even in mixed precision
-    //! mode regardless of any set builder flags to avoid overflow errors. To override this default,
-    //! use this function to set the desired compute precision.
+    //! By default, to avoid overflow errors, TensorRT will run the normalization computation in DataType::kFLOAT32
+    //! even in mixed precision mode regardless of builder flags. To override this default, use this method
+    //! to set the desired compute precision.
+    //!
+    //! For a weakly typed network:
     //!
-    //! setPrecision() and setOutputPrecision() functions can still be called to control the input and output data types
-    //! to this layer.
+    //! * Method setOutputType() can still be called to control the output data type.
+    //!
+    //! * Method setPrecision() can still be called. The input data is cast to that precision before
+    //!   being cast to the compute precision.
+    //!
+    //! Neither of these two methods are allowed for a strongly typed network.
     //!
     //! Only DataType::kFLOAT32 and DataType::kHALF are valid types for \p type.
     //!
@@ -6831,6 +6148,7 @@ class INormalizationLayer : public ILayer
         return mImpl->setComputePrecision(type);
     }
 
+    //!
     //! \brief Get the compute precision of this layer.
     //!
     //! \return The datatype used for the compute precision of this layer.
@@ -6851,10 +6169,8 @@ class INormalizationLayer : public ILayer
 //! \brief A network definition for input to the builder.
 //!
 //! A network definition defines the structure of the network, and combined with a IBuilderConfig, is built
-//! into an engine using an IBuilder. An INetworkDefinition can either have an implicit batch dimensions, specified
-//! at runtime, or all dimensions explicit, full dims mode, in the network definition. The former mode, i.e. the
-//! implicit batch size mode, has been deprecated. The function hasImplicitBatchDimension() can be used to query the
-//! mode of the network.
+//! into an engine using an IBuilder. An INetworkDefinition can have all dimensions explicit, full dims mode, in the
+//! network definition. The former mode, i.e. the implicit batch size mode, has been deprecated.
 //!
 //! A network with implicit batch dimensions returns the dimensions of a layer without the implicit dimension,
 //! and instead the batch is specified at execute/enqueue time. If the network has all dimensions specified, then
@@ -6875,13 +6191,12 @@ class INetworkDefinition : public INoCopy
     //! The name of the input tensor is used to find the index into the buffer array for an engine built from
     //! the network. The volume must be less than 2^31 elements.
     //!
-    //! For networks with an implicit batch dimension, this volume includes the batch dimension with its length set
-    //! to the maximum batch size. For networks with all explicit dimensions and with wildcard dimensions, the volume
+    //! For networks with wildcard dimensions, the volume
     //! is based on the maxima specified by an IOptimizationProfile.Dimensions are normally non-negative integers. The
     //! exception is that in networks with all explicit dimensions, -1 can be used as a wildcard for a dimension to
     //! be specified at runtime. Input tensors with such a wildcard must have a corresponding entry in the
     //! IOptimizationProfiles indicating the permitted extrema, and the input dimensions must be set by
-    //! IExecutionContext::setBindingDimensions. Different IExecutionContext instances can have different dimensions.
+    //! IExecutionContext::setInputShape. Different IExecutionContext instances can have different dimensions.
     //! Wildcard dimensions are only supported for EngineCapability::kSTANDARD. They are not
     //! supported in safety contexts. DLA does not support Wildcard dimensions.
     //!
@@ -6906,7 +6221,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \return The new tensor or nullptr if there is an error.
     //!
-    ITensor* addInput(char const* name, DataType type, Dims dimensions) noexcept
+    ITensor* addInput(char const* name, DataType type, Dims const& dimensions) noexcept
     {
         return mImpl->addInput(name, type, dimensions);
     }
@@ -6926,50 +6241,47 @@ class INetworkDefinition : public INoCopy
     }
 
     //!
-    //! \brief Add a convolution layer to the network.
-    //!
-    //! \param input The input tensor to the convolution.
-    //! \param nbOutputMaps The number of output feature maps for the convolution.
-    //! \param kernelSize The HW-dimensions of the convolution kernel.
-    //! \param kernelWeights The kernel weights for the convolution.
-    //! \param biasWeights The bias weights for the convolution. Weights{} represents no bias.
+    //! \brief Mark a tensor as a debug tensor.
     //!
-    //! \see IConvolutionLayer
+    //! A debug tensor can be optionally emitted at runtime.
+    //! Note that tensor names are required to specify debug
+    //! tensors at runtime.
     //!
-    //! \warning It is an error to specify a wildcard value for the 'C' dimension of the input tensor.
-    //! \warning Int32 tensors are not valid input tensors.
+    //! \param tensor Tensor to be marked as debug
     //!
-    //! \return The new convolution layer, or nullptr if it could not be created.
+    //! \return True if tensor successfully marked (or was already marked), false otherwise.
     //!
-    //! \deprecated Superseded by addConvolutionNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
+    //! \see unmarkDebug(), IExecutionContext::setDebugListener(), ITensor::setName()
     //!
-    TRT_DEPRECATED IConvolutionLayer* addConvolution(
-        ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+    bool markDebug(ITensor& tensor) noexcept
     {
-        return mImpl->addConvolution(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
+        return mImpl->markDebug(tensor);
     }
 
     //!
-    //! \brief Add a fully connected layer to the network.
+    //! \brief Unmark a tensor as a debug tensor.
     //!
-    //! \param input The input tensor to the layer.
-    //! \param nbOutputs The number of outputs of the layer.
-    //! \param kernelWeights The kernel weights for the fully connected layer.
-    //! \param biasWeights The bias weights for the fully connected layer. Weights{} represents no bias.
+    //! Remove the marking of a tensor as a debug tensor.
     //!
-    //! \see IFullyConnectedLayer
+    //! \param tensor Tensor to be unmarked as debug.
     //!
-    //! \warning It is an error to specify a wildcard value for the 'C' dimension of the input tensor.
-    //! \warning Int32 tensors are not valid input tensors.
+    //! \return True if tensor successfully unmarked (or was already unmarked), false otherwise.
+    //!
+    //! \see markDebug(), IExecutionContext::setDebugListener()
+    //!
+    bool unmarkDebug(ITensor& tensor) noexcept
+    {
+        return mImpl->unmarkDebug(tensor);
+    }
+
     //!
-    //! \return The new fully connected layer, or nullptr if it could not be created.
+    //! \brief Check if a tensor is marked as debug tensor.
     //!
-    //! \deprecated Deprecated in TensorRT 8.4. Superseded by addMatrixMultiply().
+    //! \return true if tensor is marked as debug tensor, false otherwise.
     //!
-    TRT_DEPRECATED IFullyConnectedLayer* addFullyConnected(
-        ITensor& input, int32_t nbOutputs, Weights kernelWeights, Weights biasWeights) noexcept
+    bool isDebugTensor(nvinfer1::ITensor const& tensor) const noexcept
     {
-        return mImpl->addFullyConnected(input, nbOutputs, kernelWeights, biasWeights);
+        return mImpl->isDebugTensor(tensor);
     }
 
     //!
@@ -6982,7 +6294,8 @@ class INetworkDefinition : public INoCopy
     //! output for activations that require these parameters.
     //!
     //! \see IActivationLayer ActivationType
-    //! \warning Int32 tensors are not valid input tensors.
+    //!
+    //! \warning Int32 and Int64 are valid only for activation type kRELU.
     //!
     //! \return The new activation layer, or nullptr if it could not be created.
     //!
@@ -6991,25 +6304,6 @@ class INetworkDefinition : public INoCopy
         return mImpl->addActivation(input, type);
     }
 
-    //!
-    //! \brief Add a pooling layer to the network.
-    //!
-    //! \param input The input tensor to the layer.
-    //! \param type The type of pooling to apply.
-    //! \param windowSize The size of the pooling window.
-    //!
-    //! \see IPoolingLayer PoolingType
-    //! \warning Int32 tensors are not valid input tensors.
-    //!
-    //! \return The new pooling layer, or nullptr if it could not be created.
-    //!
-    //! \deprecated Superseded by addPoolingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED IPoolingLayer* addPooling(ITensor& input, PoolingType type, DimsHW windowSize) noexcept
-    {
-        return mImpl->addPooling(input, type, windowSize);
-    }
-
     //!
     //! \brief Add a LRN layer to the network.
     //!
@@ -7024,7 +6318,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \return The new LRN layer, or nullptr if it could not be created.
     //!
-    ILRNLayer* addLRN(ITensor& input, int32_t window, float alpha, float beta, float k) noexcept
+    ILRNLayer* addLRN(ITensor& input, int64_t window, float alpha, float beta, float k) noexcept
     {
         return mImpl->addLRN(input, window, alpha, beta, k);
     }
@@ -7033,8 +6327,7 @@ class INetworkDefinition : public INoCopy
     //! \brief Add a Scale layer to the network.
     //!
     //! \param input The input tensor to the layer.
-    //!              This tensor is required to have a minimum of 3 dimensions in implicit batch mode
-    //!              and a minimum of 4 dimensions in explicit batch mode.
+    //!              This tensor must have at least 4 dimensions.
     //! \param mode The scaling mode.
     //! \param shift The shift value.
     //! \param scale The scale value.
@@ -7086,30 +6379,6 @@ class INetworkDefinition : public INoCopy
         return mImpl->addConcatenation(inputs, nbInputs);
     }
 
-    //!
-    //! \brief Add a deconvolution layer to the network.
-    //!
-    //! \param input The input tensor to the layer.
-    //! \param nbOutputMaps The number of output feature maps.
-    //! \param kernelSize The HW-dimensions of the deconvolution kernel.
-    //! \param kernelWeights The kernel weights for the deconvolution.
-    //! \param biasWeights The bias weights for the deconvolution. Weights{} represents no bias.
-    //!
-    //! \see IDeconvolutionLayer
-    //!
-    //! \warning It is an error to specify a wildcard value for the 'C' dimension of the input tensor.
-    //! \warning Int32 tensors are not valid input tensors.
-    //!
-    //! \return The new deconvolution layer, or nullptr if it could not be created.
-    //!
-    //! \deprecated Superseded by addDeconvolutionNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED IDeconvolutionLayer* addDeconvolution(
-        ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
-    {
-        return mImpl->addDeconvolution(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
-    }
-
     //!
     //! \brief Add an elementwise layer to the network.
     //!
@@ -7159,23 +6428,6 @@ class INetworkDefinition : public INoCopy
         return mImpl->addUnary(input, operation);
     }
 
-    //! \brief Add a padding layer to the network.
-    //!
-    //! \param input The input tensor to the layer.
-    //! \param prePadding The padding to apply to the start of the tensor.
-    //! \param postPadding The padding to apply to the end of the tensor.
-    //!
-    //! \see IPaddingLayer
-    //!
-    //! \return The new padding layer, or nullptr if it could not be created.
-    //!
-    //! \deprecated Superseded by addPaddingNd. Deprecated prior to TensorRT 8.0 and will be removed in 9.0
-    //!
-    TRT_DEPRECATED IPaddingLayer* addPadding(ITensor& input, DimsHW prePadding, DimsHW postPadding) noexcept
-    {
-        return mImpl->addPadding(input, prePadding, postPadding);
-    }
-
     //!
     //! \brief Add a shuffle layer to the network.
     //!
@@ -7195,7 +6447,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \param indices - tensor containing indices where on_value should be set.
     //! \param values - a 2-element tensor, consisting of [off_value, on_value].
-    //! \param depth - tensor containing the width of the added one-hot dimension.
+    //! \param depth - a shape tensor containing the width of the added one-hot dimension.
     //! \param axis - the axis to add the one-hot encoding to.
     //!
     //! \see IOneHotLayer
@@ -7291,18 +6543,6 @@ class INetworkDefinition : public INoCopy
         return mImpl->getOutput(index);
     }
 
-    //!
-    //! \brief Destroy this INetworkDefinition object.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED void destroy() noexcept
-    {
-        delete this;
-    }
-
     //!
     //! \brief Add a reduce layer to the network.
     //!
@@ -7312,7 +6552,6 @@ class INetworkDefinition : public INoCopy
     //!        The bit in position i of bitmask reduceAxes corresponds to explicit dimension i if result.
     //!        E.g., the least significant bit corresponds to the first explicit dimension and the next to least
     //!        significant bit corresponds to the second explicit dimension.
-    //!
     //! \param keepDimensions The boolean that specifies whether or not to keep the reduced dimensions in the
     //! output of the layer.
     //!
@@ -7321,7 +6560,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \see IReduceLayer
     //!
-    //! \warning If output is an Int32 shape tensor, ReduceOperation::kAVG is unsupported.
+    //! \warning If output is an Int32 or Int64 shape tensor, ReduceOperation::kAVG is unsupported.
     //!
     //! \return The new reduce layer, or nullptr if it could not be created.
     //!
@@ -7356,8 +6595,6 @@ class INetworkDefinition : public INoCopy
     //!
     //! \see ITopKLayer
     //!
-    //! \warning Int32 tensors are not valid input tensors.
-    //!
     //! \return The new TopK layer, or nullptr if it could not be created.
     //!
     ITopKLayer* addTopK(ITensor& input, TopKOperation op, int32_t k, uint32_t reduceAxes) noexcept
@@ -7407,6 +6644,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \warning The bounds tensor cannot have the last dimension be the wildcard character.
     //! \warning Int32 tensors are not valid input tensors.
+    //! \warning The input and bounds tensors should be 3D tensors.
     //!
     //! \return The new RaggedSoftMax layer, or nullptr if it could not be created.
     //!
@@ -7465,90 +6703,16 @@ class INetworkDefinition : public INoCopy
     //! Otherwise the output is a tensor of real values and the output type will be
     //! follow TensorRT's normal precision rules.
     //!
-    //! If tensors in the network have an implicit batch dimension, the constant
-    //! is broadcast over that dimension.
-    //!
     //! If a wildcard dimension is used, the volume of the runtime dimensions must equal
     //! the number of weights specified.
     //!
     //! \warning DataType::kUINT8 not supported.
     //!
-    IConstantLayer* addConstant(Dims dimensions, Weights weights) noexcept
+    IConstantLayer* addConstant(Dims const& dimensions, Weights weights) noexcept
     {
         return mImpl->addConstant(dimensions, weights);
     }
 
-    //!
-    //! \brief Add an \p layerCount deep RNN layer to the network with \p hiddenSize internal states that can
-    //! take a batch with fixed or variable sequence lengths.
-    //!
-    //! \param input The input tensor to the layer (see below).
-    //! \param layerCount The number of layers in the RNN.
-    //! \param hiddenSize Size of the internal hidden state for each layer.
-    //! \param maxSeqLen Maximum sequence length for the input.
-    //! \param op The type of RNN to execute.
-    //!
-    //! By default, the layer is configured with RNNDirection::kUNIDIRECTION and RNNInputMode::kLINEAR.
-    //! To change these settings, use IRNNv2Layer::setDirection() and IRNNv2Layer::setInputMode().
-    //!
-    //! %Weights and biases for the added layer should be set using
-    //! IRNNv2Layer::setWeightsForGate() and IRNNv2Layer::setBiasForGate() prior
-    //! to building an engine using this network.
-    //!
-    //! The input tensors must be of the type DataType::kFLOAT or DataType::kHALF.
-    //! The layout of the weights is row major and must be the same datatype as the input tensor.
-    //! \p weights contain 8 matrices and \p bias contains 8 vectors.
-    //!
-    //! See IRNNv2Layer::setWeightsForGate() and IRNNv2Layer::setBiasForGate() for details on the required input
-    //! format for \p weights and \p bias.
-    //!
-    //! The \p input ITensor should contain zero or more index dimensions `{N1, ..., Np}`, followed by
-    //! two dimensions, defined as follows:
-    //!   - `S_max` is the maximum allowed sequence length (number of RNN iterations)
-    //!   - `E` specifies the embedding length (unless RNNInputMode::kSKIP is set, in which case it should match
-    //!     getHiddenSize()).
-    //!
-    //! By default, all sequences in the input are assumed to be size \p maxSeqLen.  To provide explicit sequence
-    //! lengths for each input sequence in the batch, use IRNNv2Layer::setSequenceLengths().
-    //!
-    //! The RNN layer outputs up to three tensors.
-    //!
-    //! The first output tensor is the output of the final RNN layer across all timesteps, with dimensions
-    //! `{N1, ..., Np, S_max, H}`:
-    //!
-    //!   - `N1..Np` are the index dimensions specified by the input tensor
-    //!   - `S_max` is the maximum allowed sequence length (number of RNN iterations)
-    //!   - `H` is an output hidden state (equal to getHiddenSize() or 2x getHiddenSize())
-    //!
-    //! The second tensor is the final hidden state of the RNN across all layers, and if the RNN
-    //! is an LSTM (i.e. getOperation() is RNNOperation::kLSTM), then the third tensor is the final cell state
-    //! of the RNN across all layers.  Both the second and third output tensors have dimensions
-    //! `{N1, ..., Np, L, H}`:
-    //!
-    //!  - `N1..Np` are the index dimensions specified by the input tensor
-    //!  - `L` is the number of layers in the RNN, equal to getLayerCount() if getDirection is
-    //!  RNNDirection::kUNIDIRECTION,
-    //!     and 2x getLayerCount() if getDirection is RNNDirection::kBIDIRECTION. In the bi-directional
-    //!     case, layer `l`'s final forward hidden state is stored in `L = 2*l`, and
-    //!     final backward hidden state is stored in `L= 2*l + 1`.
-    //!  - `H` is the hidden state for each layer, equal to getHiddenSize().
-    //!
-    //! \see IRNNv2Layer
-    //!
-    //! \deprecated Deprecated prior to TensorRT 8.0 and will be removed in 9.0. Superseded by
-    //! INetworkDefinition::addLoop().
-    //!
-    //! \warning RNN inputs do not support wildcard dimensions or explicit batch size networks.
-    //! \warning Int32 tensors are not valid input tensors, only for sequence lengths.
-    //!
-    //! \return The new RNN layer, or nullptr if it could not be created.
-    //!
-    TRT_DEPRECATED IRNNv2Layer* addRNNv2(
-        ITensor& input, int32_t layerCount, int32_t hiddenSize, int32_t maxSeqLen, RNNOperation op) noexcept
-    {
-        return mImpl->addRNNv2(input, layerCount, hiddenSize, maxSeqLen, op);
-    }
-
     //!
     //! \brief Add an identity layer.
     //!
@@ -7624,6 +6788,25 @@ class INetworkDefinition : public INoCopy
         return mImpl->addPluginV2(inputs, nbInputs, plugin);
     }
 
+    //!
+    //! \brief Add a plugin layer implementing the IPluginV3 interface to the network.
+    //!
+    //! \param inputs The input tensors to the layer.
+    //! \param nbInputs The number of input tensors.
+    //! \param shapeInputs Shape tensor inputs to the layer.
+    //! \param nbShapeInputs The number of shape tensor inputs.
+    //! \param plugin The layer plugin.
+    //!
+    //! \see IPluginV3Layer
+    //!
+    //! \return The new plugin layer, or nullptr if it could not be created.
+    //!
+    IPluginV3Layer* addPluginV3(ITensor* const* inputs, int32_t nbInputs, ITensor* const* shapeInputs,
+        int32_t nbShapeInputs, IPluginV3& plugin) noexcept
+    {
+        return mImpl->addPluginV3(inputs, nbInputs, shapeInputs, nbShapeInputs, plugin);
+    }
+
     //!
     //! \brief Add a slice layer to the network.
     //!
@@ -7638,7 +6821,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \return The new slice layer, or nullptr if it could not be created.
     //!
-    ISliceLayer* addSlice(ITensor& input, Dims start, Dims size, Dims stride) noexcept
+    ISliceLayer* addSlice(ITensor& input, Dims const& start, Dims const& size, Dims const& stride) noexcept
     {
         return mImpl->addSlice(input, start, size, stride);
     }
@@ -7700,21 +6883,39 @@ class INetworkDefinition : public INoCopy
     //!
     //! \brief Query whether the network was created with an implicit batch dimension.
     //!
-    //! \return True if tensors have implicit batch dimension, false otherwise.
-    //!
-    //! This is a network-wide property. Either all tensors in the network
-    //! have an implicit batch dimension or none of them do.
-    //!
-    //! hasImplicitBatchDimension() is true if and only if this INetworkDefinition
-    //! was created with createNetworkV2() without NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
+    //! \return Always false since TensorRT 10.0 does not support an implicit batch dimension.
     //!
     //! \see createNetworkV2
     //!
-    bool hasImplicitBatchDimension() const noexcept
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is not supported since TensorRT 10.0.
+    //!
+    TRT_DEPRECATED bool hasImplicitBatchDimension() const noexcept
     {
         return mImpl->hasImplicitBatchDimension();
     }
 
+    //!
+    //! \brief Get the network definition creation flags for this network definition object. Defaults to 0.
+    //!
+    //! \return The network definition creation options as a bitmask.
+    //!
+    NetworkDefinitionCreationFlags getFlags() const noexcept
+    {
+        return mImpl->getFlags();
+    }
+
+    //!
+    //! \brief Returns true if the network definition creation flag is set
+    //!
+    //! \see getFlags()
+    //!
+    //! \return True if flag is set, false if unset.
+    //!
+    bool getFlag(NetworkDefinitionCreationFlag networkDefinitionCreationFlag) const noexcept
+    {
+        return mImpl->getFlag(networkDefinitionCreationFlag);
+    }
+
     //!
     //! \brief Enable tensor's value to be computed by IExecutionContext::getShapeBinding.
     //!
@@ -7726,7 +6927,6 @@ class INetworkDefinition : public INoCopy
     //!
     //! \warning It is an error to mark a network input as a shape output.
     //!
-    //! \see isShapeBinding(), getShapeBinding()
     //!
     bool markOutputForShapes(ITensor& tensor) noexcept
     {
@@ -7781,7 +6981,7 @@ class INetworkDefinition : public INoCopy
     //! \return The new convolution layer, or nullptr if it could not be created.
     //!
     IConvolutionLayer* addConvolutionNd(
-        ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+        ITensor& input, int64_t nbOutputMaps, Dims const& kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
     {
         return mImpl->addConvolutionNd(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
     }
@@ -7800,7 +7000,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \return The new pooling layer, or nullptr if it could not be created.
     //!
-    IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims windowSize) noexcept
+    IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims const& windowSize) noexcept
     {
         return mImpl->addPoolingNd(input, type, windowSize);
     }
@@ -7823,7 +7023,7 @@ class INetworkDefinition : public INoCopy
     //! \return The new deconvolution layer, or nullptr if it could not be created.
     //!
     IDeconvolutionLayer* addDeconvolutionNd(
-        ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+        ITensor& input, int64_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
     {
         return mImpl->addDeconvolutionNd(input, nbOutputMaps, kernelSize, kernelWeights, biasWeights);
     }
@@ -7865,6 +7065,7 @@ class INetworkDefinition : public INoCopy
         return mImpl->addScaleNd(input, mode, shift, scale, power, channelAxis);
     }
 
+    //!
     //! \brief Add a resize layer to the network.
     //!
     //! \param input The input tensor to the layer.
@@ -7881,35 +7082,35 @@ class INetworkDefinition : public INoCopy
     }
 
     //!
-    //! \brief True if network is an explicit precision network
+    //! \brief Add a loop to the network.
     //!
-    //! \deprecated Deprecated in TensorRT 8.0.
+    //! An ILoop provides a way to specify a recurrent subgraph.
     //!
-    //! \see createNetworkV2
+    //! \return Pointer to ILoop that can be used to add loop-boundary layers for the loop.
     //!
-    //! \return True if network has explicit precision, false otherwise.
+    //! \see ILoop
     //!
-    TRT_DEPRECATED bool hasExplicitPrecision() const noexcept
+    ILoop* addLoop() noexcept
     {
-        return mImpl->hasExplicitPrecision();
+        return mImpl->addLoop();
     }
 
     //!
-    //! \brief Add a loop to the network.
+    //! \brief Add an if-then-else to the network.
     //!
-    //! An ILoop provides a way to specify a recurrent subgraph.
+    //! An IIfConditional provides a way to conditionally execute parts of the network.
     //!
-    //! \return Pointer to ILoop that can be used to add loop boundary layers for the loop,
-    //!         or nullptr if network has an implicit batch dimension or this version
-    //!         of TensorRT does not support loops.
+    //! \return Pointer to the IIfConditional that can be used to add conditional-boundary layers
+    //!         for the if-then-else.
     //!
-    //! The network must not have an implicit batch dimension.
+    //! \see IIfConditional
     //!
-    ILoop* addLoop() noexcept
+    IIfConditional* addIfConditional() noexcept
     {
-        return mImpl->addLoop();
+        return mImpl->addIfConditional();
     }
 
+    //!
     //! \brief Add a select layer to the network.
     //!
     //! \param condition The condition tensor to the layer. Must have type DataType::kBOOL.
@@ -7938,8 +7139,6 @@ class INetworkDefinition : public INoCopy
     //!
     //! then the output dimensions are [1,3,0,9].
     //!
-    //! The network must not have an implicit batch dimension.
-    //!
     //! The inputs are shape tensors if the output is a shape tensor.
     //!
     //! \see ISelectLayer
@@ -7967,29 +7166,58 @@ class INetworkDefinition : public INoCopy
         return mImpl->addAssertion(condition, message);
     }
 
+    //!
     //! \brief Add a fill layer to the network.
     //!
-    //! \param dimensions The output tensor dimensions.
+    //! \param dimensions The output tensor dimensions if input 0 is missing.
     //! \param op The fill operation that the layer applies.
     //!
-    //! \warning For FillOperation::kLINSPACE, dimensions.nbDims must be 1.
+    //! \warning For FillOperation::kLINSPACE, dimensions.nbDims must be 1 for static start/delta. If delta is provided
+    //! as a 1D tensor, the length of delta must match dimensions.nbDims.
     //!
     //! This layer is non-deterministic across subsequent calls as the same inputs will produce different
     //! output tensors if \p op is either FillOperation::kRANDOM_UNIFORM or FillOperation::kRANDOM_NORMAL
     //! due to random state being shared across calls. The output tensors generated are determinstic when
     //! starting from the same initial state.
     //!
-    //! The network must not have an implicit batch dimension.
-    //!
     //! \see IFillLayer
     //!
     //! \return The new fill layer, or nullptr if it could not be created.
     //!
-    IFillLayer* addFill(Dims dimensions, FillOperation op) noexcept
+    //! \deprecated Deprecated in TensorRT 9.0. Superseded by three-argument addFill.
+    //!
+    TRT_DEPRECATED IFillLayer* addFill(Dims const& dimensions, FillOperation op) noexcept
     {
         return mImpl->addFill(dimensions, op);
     }
 
+    //!
+    //! \brief Add a fill layer to the network.
+    //!
+    //! \param dimensions The output tensor dimensions if input 0 is missing.
+    //! \param op The fill operation that the layer applies.
+    //! \param outputType Optional output tensor data type, must be DataType::kFLOAT, DataType::kHALF, DataType::kINT32,
+    //! or DataType::kINT64. This parameter is only used for static alpha/beta. Future calls to set output type using
+    //! setToType or setOutputType must be consistent.
+    //!
+    //! \warning For FillOperation::kLINSPACE, dimensions.nbDims must be 1 for static start/delta. If delta is provided
+    //! as a 1D tensor, the length of delta must match dimensions.nbDims.
+    //!
+    //! This layer is non-deterministic across subsequent calls as the same inputs will produce different
+    //! output tensors if \p op is either FillOperation::kRANDOM_UNIFORM or FillOperation::kRANDOM_NORMAL
+    //! due to random state being shared across calls. The output tensors generated are deterministic when
+    //! starting from the same initial state.
+    //!
+    //! \see IFillLayer
+    //!
+    //! \return The new fill layer, or nullptr if it could not be created.
+    //!
+    IFillLayer* addFill(Dims const& dimensions, FillOperation op, DataType outputType) noexcept
+    {
+        return mImpl->addFillV2(dimensions, op, outputType);
+    }
+
+    //!
     //! \brief Add a padding layer to the network. Only 2D padding is currently supported.
     //!
     //! \param input The input tensor to the layer.
@@ -8000,13 +7228,12 @@ class INetworkDefinition : public INoCopy
     //!
     //! \return The new padding layer, or nullptr if it could not be created.
     //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by addSlice().
-    //!
-    TRT_DEPRECATED IPaddingLayer* addPaddingNd(ITensor& input, Dims prePadding, Dims postPadding) noexcept
+    IPaddingLayer* addPaddingNd(ITensor& input, Dims const& prePadding, Dims const& postPadding) noexcept
     {
         return mImpl->addPaddingNd(input, prePadding, postPadding);
     }
 
+    //!
     //! \brief Associate a name with all current uses of the given weights.
     //!
     //! The name must be set after the Weights are used in the network.
@@ -8072,17 +7299,40 @@ class INetworkDefinition : public INoCopy
     //!
     //! \see IDequantizeLayer
     //!
-    //! \p input tensor data type must be DataType::kFLOAT.
+    //! \p input tensor data type must be DataType::kINT8/DataType::kFP8.
     //! \p scale tensor data type must be DataType::kFLOAT. The subgraph which terminates with the \p scale tensor must
     //! be a build-time constant.
     //!
     //! \return The new quantization layer, or nullptr if it could not be created.
     //!
-    IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale) noexcept
+    //! \deprecated Deprecated in TensorRT 9.0. Superseded by three-argument addDequantize.
+    //!
+    TRT_DEPRECATED IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale) noexcept
     {
         return mImpl->addDequantize(input, scale);
     }
 
+    //!
+    //! \brief Add a dequantization layer to the network.
+    //!
+    //! \param input The input tensor to be dequantized.
+    //! \param scale A tensor with the scale value.
+    //!
+    //! \see IDequantizeLayer
+    //!
+    //! \p input tensor data type must be DataType::kINT8/DataType::kFP8/DataType::kINT4.
+    //! \p scale tensor data type defaults to DataType::kFLOAT. For strongly typed networks, it must be the same as the
+    //! output data type. The subgraph which terminates with the \p scale tensor must be a build-time constant.
+    //! \p outputType output tensor data type, default value is DataType::kFLOAT. Future calls to set output type using
+    //! setToType or setOutputType must be consistent. For strongly typed networks, it must be the same as the scale data type.
+    //!
+    //! \return The new quantization layer, or nullptr if it could not be created.
+    //!
+    IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale, DataType outputType) noexcept
+    {
+        return mImpl->addDequantizeV2(input, scale, outputType);
+    }
+
     //!
     //! \brief Add a Scatter layer to the network with specified mode and axis=0.
     //!
@@ -8111,32 +7361,41 @@ class INetworkDefinition : public INoCopy
     //!
     //! \see IQuantizeLayer
     //!
-    //! \p input tensor data type must be DataType::kFLOAT.
+    //! \p input tensor data type must be DataType::kFLOAT/DataType::kHALF.
     //! \p scale tensor data type must be DataType::kFLOAT. The subgraph which terminates with the \p scale tensor must
     //! be a build-time constant.
     //!
     //! \return The new quantization layer, or nullptr if it could not be created.
     //!
-    IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale) noexcept
+    //! \deprecated Deprecated in TensorRT 9.0. Superseded by three-argument addQuantize.
+    //!
+    TRT_DEPRECATED IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale) noexcept
     {
         return mImpl->addQuantize(input, scale);
     }
 
     //!
-    //! \brief Add an If-conditional layer to the network.
+    //! \brief Add a quantization layer to the network.
     //!
-    //! An IIfConditional provides a way to conditionally execute parts of the network.
+    //! \param input The input tensor to be quantized.
+    //! \param scale A tensor with the scale value.
     //!
-    //! \see IIfConditional
+    //! \see IQuantizeLayer
     //!
-    //! \return The new conditional layer, or nullptr if network has an implicit batch dimension
-    //!         or this version of TensorRT does not support conditional execution.
+    //! \p input tensor data type must be DataType::kFLOAT/DataType::kHALF/DataType::kBF16.
+    //! \p scale tensor data type defaults to DataType::kFLOAT. For strongly typed networks, it must have the same data
+    //! type as the input. The subgraph which terminates with the \p scale tensor must be a build-time constant.
+    //! \p outputType output tensor data type, must be DataType::kINT8 (default), DataType::kFP8 or DataType::kINT4.
+    //! Future calls to set output type using setToType or setOutputType must be consistent.
     //!
-    IIfConditional* addIfConditional() noexcept
+    //! \return The new quantization layer, or nullptr if it could not be created.
+    //!
+    IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale, DataType outputType) noexcept
     {
-        return mImpl->addIfConditional();
+        return mImpl->addQuantizeV2(input, scale, outputType);
     }
 
+    //!
     //! \brief Add an Einsum layer to the network.
     //!
     //! \param inputs The input tensors to the layer.
@@ -8151,10 +7410,12 @@ class INetworkDefinition : public INoCopy
         return mImpl->addEinsum(inputs, nbInputs, equation);
     }
 
+    //!
     //! \brief Add a GridSample layer to the network.
     //!
     //! \param input The input tensor to the layer.
     //! \param grid The grid tensor to the layer.
+    //!
     //! \see IGridSampleLayer
     //!
     //! Creates a GridSample layer with a InterpolationMode::kLINEAR, unaligned corners,
@@ -8223,8 +7484,7 @@ class INetworkDefinition : public INoCopy
     //!
     //! \return The new normalization layer, or nullptr if it could not be created.
     //!
-    INormalizationLayer* addNormalization(
-        ITensor& input, ITensor& scale, ITensor& bias, uint32_t axesMask) noexcept
+    INormalizationLayer* addNormalization(ITensor& input, ITensor& scale, ITensor& bias, uint32_t axesMask) noexcept
     {
         return mImpl->addNormalization(input, scale, bias, axesMask);
     }
@@ -8245,16 +7505,16 @@ class INetworkDefinition : public INoCopy
 };
 
 //!
-//! enum CalibrationAlgoType
+//! \enum CalibrationAlgoType
 //!
 //! \brief Version of calibration algorithm to use.
 //!
 enum class CalibrationAlgoType : int32_t
 {
-    kLEGACY_CALIBRATION = 0,
-    kENTROPY_CALIBRATION = 1,
-    kENTROPY_CALIBRATION_2 = 2,
-    kMINMAX_CALIBRATION = 3,
+    kLEGACY_CALIBRATION = 0,    //!< Legacy calibration
+    kENTROPY_CALIBRATION = 1,   //!< Legacy entropy calibration
+    kENTROPY_CALIBRATION_2 = 2, //!< Entropy calibration
+    kMINMAX_CALIBRATION = 3,    //!< Minmax calibration
 };
 
 //!
@@ -8279,7 +7539,7 @@ constexpr inline int32_t EnumMax<CalibrationAlgoType>() noexcept
 //! the distribution of activations. It may optionally implement a method for caching the calibration result for reuse
 //! on subsequent runs.
 //!
-class IInt8Calibrator
+class IInt8Calibrator : public IVersionedInterface
 {
 public:
     //!
@@ -8287,7 +7547,9 @@ class IInt8Calibrator
     //!
     //! \return The batch size.
     //!
-    virtual int32_t getBatchSize() const noexcept = 0;
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+    //!
+    TRT_DEPRECATED virtual int32_t getBatchSize() const noexcept = 0;
 
     //!
     //! \brief Get a batch of input for calibration.
@@ -8298,6 +7560,7 @@ class IInt8Calibrator
     //! containing each network input data.
     //! \param names The names of the network input for each pointer in the binding array.
     //! \param nbBindings The number of pointers in the bindings array.
+    //!
     //! \return False if there are no more batches for calibration.
     //!
     //! \see getBatchSize()
@@ -8337,16 +7600,22 @@ class IInt8Calibrator
     //!
     virtual CalibrationAlgoType getAlgorithm() noexcept = 0;
 
-    virtual ~IInt8Calibrator() noexcept = default;
+    ~IInt8Calibrator() noexcept override = default;
 };
 
-//!
-//! Entropy calibrator. This is the Legacy Entropy calibrator. It is less complicated than the legacy calibrator and
-//! produces better results.
-//!
+namespace v_1_0
+{
 class IInt8EntropyCalibrator : public IInt8Calibrator
 {
 public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IInt8EntropyCalibrator", 1, 0};
+    }
+
     //!
     //! Signal that this is the entropy calibrator.
     //!
@@ -8355,16 +7624,36 @@ class IInt8EntropyCalibrator : public IInt8Calibrator
         return CalibrationAlgoType::kENTROPY_CALIBRATION;
     }
 
-    virtual ~IInt8EntropyCalibrator() noexcept = default;
+    ~IInt8EntropyCalibrator() noexcept override = default;
 };
+} // namespace v_1_0
 
 //!
-//! Entropy calibrator 2. This is the preferred calibrator. This is the required calibrator for DLA, as it supports per
-//! activation tensor scaling.
+//! \class IInt8EntropyCalibrator
+//!
+//! \brief Entropy calibrator.
+//!
+//! This is the Legacy Entropy calibrator. It is less complicated than the legacy calibrator and
+//! produces better results.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IEntropyCalibrator, not
+//!       v_1_0::IEntropyCalibrator
 //!
+using IInt8EntropyCalibrator = v_1_0::IInt8EntropyCalibrator;
+
+namespace v_1_0
+{
 class IInt8EntropyCalibrator2 : public IInt8Calibrator
 {
 public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IInt8EntropyCalibrator2", 1, 0};
+    }
+
     //!
     //! Signal that this is the entropy calibrator 2.
     //!
@@ -8373,15 +7662,36 @@ class IInt8EntropyCalibrator2 : public IInt8Calibrator
         return CalibrationAlgoType::kENTROPY_CALIBRATION_2;
     }
 
-    virtual ~IInt8EntropyCalibrator2() noexcept = default;
+    ~IInt8EntropyCalibrator2() noexcept override = default;
 };
+} // namespace v_1_0
 
 //!
-//! MinMax Calibrator. It supports per activation tensor scaling.
+//! \class IInt8EntropyCalibrator2
+//!
+//! \brief Entropy calibrator 2.
+//!
+//! This is the preferred calibrator. This is the required calibrator for DLA, as it supports per
+//! activation tensor scaling.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IEntropyCalibrator2, not
+//!        v_1_0::IEntropyCalibrator2
 //!
+using IInt8EntropyCalibrator2 = v_1_0::IInt8EntropyCalibrator2;
+
+namespace v_1_0
+{
 class IInt8MinMaxCalibrator : public IInt8Calibrator
 {
 public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IInt8MinMaxCalibrator", 1, 0};
+    }
+
     //!
     //! Signal that this is the MinMax Calibrator.
     //!
@@ -8390,16 +7700,35 @@ class IInt8MinMaxCalibrator : public IInt8Calibrator
         return CalibrationAlgoType::kMINMAX_CALIBRATION;
     }
 
-    virtual ~IInt8MinMaxCalibrator() noexcept = default;
+    ~IInt8MinMaxCalibrator() noexcept override = default;
 };
+} // namespace v_1_0
 
 //!
-//! Legacy calibrator left for backward compatibility with TensorRT 2.0. This calibrator requires user parameterization,
-//! and is provided as a fallback option if the other calibrators yield poor results.
+//! \class IInt8MinMaxCalibrator
+//!
+//! \brief MinMax Calibrator.
+//!
+//! It supports per activation tensor scaling.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IMinMaxCalibrator>, not
+//!       v_1_0::IMinMaxCalibrator
 //!
+using IInt8MinMaxCalibrator = v_1_0::IInt8MinMaxCalibrator;
+
+namespace v_1_0
+{
 class IInt8LegacyCalibrator : public IInt8Calibrator
 {
 public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IInt8Calibrator", 1, 0};
+    }
+
     //!
     //! Signal that this is the legacy calibrator.
     //!
@@ -8448,8 +7777,22 @@ class IInt8LegacyCalibrator : public IInt8Calibrator
     //!
     virtual void writeHistogramCache(void const* ptr, std::size_t length) noexcept = 0;
 
-    virtual ~IInt8LegacyCalibrator() noexcept = default;
+    ~IInt8LegacyCalibrator() noexcept override = default;
 };
+} // namespace v_1_0
+
+//!
+//! \class IInt8LegacyCalibrator
+//!
+//! \brief Legacy calibrator.
+//!
+//! This calibrator requires user parameterization,
+//! and is provided as a fallback option if the other calibrators yield poor results.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use ILegacyCalibrator, not
+//!       v_1_0::ILegacyCalibrator
+//!
+using IInt8LegacyCalibrator = v_1_0::IInt8LegacyCalibrator;
 
 //!
 //! \class IAlgorithmIOInfo
@@ -8464,19 +7807,6 @@ class IInt8LegacyCalibrator : public IInt8Calibrator
 class IAlgorithmIOInfo : public INoCopy
 {
 public:
-    //!
-    //! \brief Return TensorFormat of the input/output of algorithm.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.6. The strides, data type, and vectorization
-    //! information is sufficient to uniquely identify tensor formats.
-    //!
-    //! \return the tensor format
-    //!
-    TRT_DEPRECATED TensorFormat getTensorFormat() const noexcept
-    {
-        return mImpl->getTensorFormat();
-    }
-
     //!
     //! \brief Return DataType of the input/output of algorithm.
     //!
@@ -8572,6 +7902,7 @@ class IAlgorithmContext : public INoCopy
 public:
     //!
     //! \brief Return name of the algorithm node.
+    //!
     //! This is a unique identifier for the IAlgorithmContext.
     //!
     char const* getName() const noexcept
@@ -8581,6 +7912,7 @@ class IAlgorithmContext : public INoCopy
 
     //!
     //! \brief Get the minimum / optimum / maximum dimensions for input or output tensor.
+    //!
     //! \param index Index of the input or output of the algorithm. Incremental numbers assigned to indices of inputs
     //!              and the outputs.
     //! \param select Which of the minimum, optimum, or maximum dimensions to be queried.
@@ -8613,9 +7945,11 @@ class IAlgorithmContext : public INoCopy
 
 //!
 //! \class IAlgorithm
+//!
 //! \brief Describes a variation of execution of a layer.
 //!        An algorithm is represented by IAlgorithmVariant and the IAlgorithmIOInfo for each of its inputs and outputs.
-//!        An algorithm can be selected or reproduced using AlgorithmSelector::selectAlgorithms()."
+//!        An algorithm can be selected or reproduced using AlgorithmSelector::selectAlgorithms().
+//!
 //! \see IAlgorithmIOInfo, IAlgorithmVariant, IAlgorithmSelector::selectAlgorithms()
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
@@ -8623,21 +7957,6 @@ class IAlgorithmContext : public INoCopy
 class IAlgorithm : public INoCopy
 {
 public:
-    //!
-    //! \brief Returns the format of an Algorithm input or output. Algorithm inputs are incrementally numbered first,
-    //!        followed by algorithm outputs.
-    //! \param index Index of the input or output of the algorithm. Incremental numbers assigned to indices of inputs
-    //!              and the outputs.
-    //!
-    //! \return a reference to IAlgorithmIOInfo specified by index or the first algorithm if index is out of range.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by IAlgorithm::getAlgorithmIOInfoByIndex().
-    //!
-    TRT_DEPRECATED IAlgorithmIOInfo const& getAlgorithmIOInfo(int32_t index) const noexcept
-    {
-        return mImpl->getAlgorithmIOInfo(index);
-    }
-
     //!
     //! \brief Returns the algorithm variant.
     //!
@@ -8665,6 +7984,7 @@ class IAlgorithm : public INoCopy
     //!
     //! \brief Returns the format of an Algorithm input or output. Algorithm inputs are incrementally numbered first,
     //!        followed by algorithm outputs.
+    //!
     //! \param index Index of the input or output of the algorithm. Incremental numbers assigned to indices of inputs
     //!              and the outputs.
     //!
@@ -8680,17 +8000,18 @@ class IAlgorithm : public INoCopy
     apiv::VAlgorithm* mImpl;
 }; // IAlgorithm
 
-//!
-//! \class IAlgorithmSelector
-//!
-//! \brief Interface implemented by application for selecting and reporting algorithms of a layer provided by the
-//!        builder.
-//! \note A layer in context of algorithm selection may be different from ILayer in INetworkDefiniton.
-//!       For example, an algorithm might be implementing a conglomeration of multiple ILayers in INetworkDefinition.
-//!
-class IAlgorithmSelector
+namespace v_1_0
+{
+class IAlgorithmSelector : public IVersionedInterface
 {
 public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IAlgorithmSelector", 1, 0};
+    }
     //!
     //! \brief Select Algorithms for a layer from the given list of algorithm choices.
     //!
@@ -8702,11 +8023,12 @@ class IAlgorithmSelector
     //!
     //! \note TensorRT uses its default algorithm selection to choose from the list provided.
     //!       If return value is 0, TensorRT's default algorithm selection is used unless
-    //!       BuilderFlag::kREJECT_EMPTY_ALGORITHMS (or the deprecated BuilderFlag::kSTRICT_TYPES) is set.
+    //!       BuilderFlag::kREJECT_EMPTY_ALGORITHMS is set.
     //!       The list of choices is valid only for this specific algorithm context.
     //!
     virtual int32_t selectAlgorithms(IAlgorithmContext const& context, IAlgorithm const* const* choices,
         int32_t nbChoices, int32_t* selection) noexcept = 0;
+
     //!
     //! \brief Called by TensorRT to report choices it made.
     //!
@@ -8722,6 +8044,19 @@ class IAlgorithmSelector
 
     virtual ~IAlgorithmSelector() noexcept = default;
 };
+} // namespace v_1_0
+
+//!
+//! \class IAlgorithmSelector
+//!
+//! \brief Interface implemented by application for selecting and reporting algorithms of a layer provided by the
+//!        builder.
+//! \note A layer in context of algorithm selection may be different from ILayer in INetworkDefiniton.
+//!       For example, an algorithm might be implementing a conglomeration of multiple ILayers in INetworkDefinition.
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IAlgorithmSelector, not
+//!       v_1_0::IAlgorithmSelector
+//!
+using IAlgorithmSelector = v_1_0::IAlgorithmSelector;
 
 //!
 //! \brief Represents one or more QuantizationFlag values using binary OR
@@ -8774,34 +8109,31 @@ using BuilderFlags = uint32_t;
 //!
 enum class BuilderFlag : int32_t
 {
-    kFP16 = 0,         //!< Enable FP16 layer selection, with FP32 fallback.
-    kINT8 = 1,         //!< Enable Int8 layer selection, with FP32 fallback with FP16 fallback if kFP16 also specified.
-    kDEBUG = 2,        //!< Enable debugging of layers via synchronizing after every layer.
-    kGPU_FALLBACK = 3, //!< Enable layers marked to execute on GPU if layer cannot execute on DLA.
+    //! Enable FP16 layer selection, with FP32 fallback.
+    kFP16 = 0,
 
-    //! Legacy flag with effect similar to setting all of these three flags:
-    //!
-    //! * kPREFER_PRECISION_CONSTRAINTS
-    //! * kDIRECT_IO
-    //! * kREJECT_EMPTY_ALGORITHMS
-    //!
-    //! except that if the direct I/O requirement cannot be met and kDIRECT_IO was not explicitly set,
-    //! instead of the build failing, the build falls back as if kDIRECT_IO was not set.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.2.
-    //!
-    kSTRICT_TYPES TRT_DEPRECATED_ENUM = 4,
+    //! Enable Int8 layer selection, with FP32 fallback with FP16 fallback if kFP16 also specified.
+    kINT8 = 1,
 
-    kREFIT = 5,                //!< Enable building a refittable engine.
-    kDISABLE_TIMING_CACHE = 6, //!< Disable reuse of timing information across identical layers.
+    //! Enable debugging of layers via synchronizing after every layer.
+    kDEBUG = 2,
+
+    //! Enable layers marked to execute on GPU if layer cannot execute on DLA.
+    kGPU_FALLBACK = 3,
+
+    //! Enable building a refittable engine.
+    kREFIT = 4,
+
+    //! Disable reuse of timing information across identical layers.
+    kDISABLE_TIMING_CACHE = 5,
 
     //! Allow (but not require) computations on tensors of type DataType::kFLOAT to use TF32.
     //! TF32 computes inner products by rounding the inputs to 10-bit mantissas before
     //! multiplying, but accumulates the sum using 23-bit mantissas. Enabled by default.
-    kTF32 = 7,
+    kTF32 = 6,
 
     //! Allow the builder to examine weights and use optimized functions when weights have suitable sparsity.
-    kSPARSE_WEIGHTS = 8,
+    kSPARSE_WEIGHTS = 7,
 
     //! Change the allowed parameters in the EngineCapability::kSTANDARD flow to
     //! match the restrictions that EngineCapability::kSAFETY check against for DeviceType::kGPU
@@ -8809,52 +8141,97 @@ enum class BuilderFlag : int32_t
     //! is forced to true if EngineCapability::kSAFETY at build time if it is unset.
     //!
     //! This flag is only supported in NVIDIA Drive(R) products.
-    kSAFETY_SCOPE = 9,
+    kSAFETY_SCOPE = 8,
 
     //! Require that layers execute in specified precisions. Build fails otherwise.
-    kOBEY_PRECISION_CONSTRAINTS = 10,
+    kOBEY_PRECISION_CONSTRAINTS = 9,
 
     //! Prefer that layers execute in specified precisions.
     //! Fall back (with warning) to another precision if build would otherwise fail.
-    kPREFER_PRECISION_CONSTRAINTS = 11,
+    kPREFER_PRECISION_CONSTRAINTS = 10,
 
     //! Require that no reformats be inserted between a layer and a network I/O tensor
     //! for which ITensor::setAllowedFormats was called.
     //! Build fails if a reformat is required for functional correctness.
-    kDIRECT_IO = 12,
+    kDIRECT_IO = 11,
 
     //! Fail if IAlgorithmSelector::selectAlgorithms returns an empty set of algorithms.
-    kREJECT_EMPTY_ALGORITHMS = 13,
-
-    //! Enable heuristic-based tactic selection for shorter engine generation time. The engine may not
-    //! be as performant as when built with a profiling-based builder.
-    //!
-    //! This flag is only supported by NVIDIA Ampere and later GPUs.
-    //! \deprecated Superseded by builder optimization level 2. Deprecated in TensorRT 8.6
-    kENABLE_TACTIC_HEURISTIC = 14,
+    kREJECT_EMPTY_ALGORITHMS = 12,
 
     //! Restrict to lean runtime operators to provide version forward compatibility
     //! for the plan.
     //!
-    //! Using this flag with ICudaEngine::serialize() and BuilderFlag::kREFIT would result in error.
     //! This flag is only supported by NVIDIA Volta and later GPUs.
     //! This flag is not supported in NVIDIA Drive(R) products.
-    //! This flag is not supported with implicit batch mode. Network must be created with
-    //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH.
-    kVERSION_COMPATIBLE = 15,
+    kVERSION_COMPATIBLE = 13,
 
     //! Exclude lean runtime from the plan when version forward compatability is enabled.
     //! By default, this flag is unset, so the lean runtime will be included in the plan.
     //!
     //! If BuilderFlag::kVERSION_COMPATIBLE is not set then the value of this flag will be ignored.
-    //!
-    //! This flag is not supported with implicit batch mode. Network must be created with
-    //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH.
-    kEXCLUDE_LEAN_RUNTIME = 16,
+    kEXCLUDE_LEAN_RUNTIME = 14,
 
     //! Enable FP8 layer selection, with FP32 fallback.
-    //! \warning kFP8 is not supported yet and will result in an error or undefined behavior.
-    kFP8 = 17
+    //!
+    //! This flag is not supported with hardware-compatibility mode.
+    //!
+    //! \see HardwareCompatibilityLevel
+    kFP8 = 15,
+
+    //! Emit error when a tactic being timed is not present in the timing cache.
+    //! This flag has an effect only when IBuilderConfig has an associated ITimingCache.
+    kERROR_ON_TIMING_CACHE_MISS = 16,
+
+    //! Enable DataType::kBF16 layer selection, with FP32 fallback.
+    //! This flag is only supported by NVIDIA Ampere and later GPUs.
+    kBF16 = 17,
+
+    //! Disable caching of JIT-compilation results during engine build.
+    //! By default, JIT-compiled code will be serialized as part of the timing cache, which may significantly increase
+    //! the cache size. Setting this flag prevents the code from being serialized. This flag has an effect only when
+    //! BuilderFlag::DISABLE_TIMING_CACHE is not set.
+    kDISABLE_COMPILATION_CACHE = 18,
+
+    //! Strip the refittable weights from the engine plan file.
+    kSTRIP_PLAN = 19,
+
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by kSTRIP_PLAN.
+    kWEIGHTLESS TRT_DEPRECATED_ENUM = kSTRIP_PLAN,
+
+    //! Create a refittable engine under the assumption that the refit weights will be identical to those provided at
+    //! build time. The resulting engine will have the same performance as a non-refittable one. All refittable weights
+    //! can be refitted through the refit API, but if the refit weights are not identical to the build-time weights,
+    //! behavior is undefined. When used alongside 'kSTRIP_PLAN', this flag will result in a small plan file for which
+    //! weights are later supplied via refitting. This enables use of a single set of weights with different inference
+    //! backends, or with TensorRT plans for multiple GPU architectures.
+    kREFIT_IDENTICAL = 20,
+
+    //!
+    //! \brief Enable weight streaming for the current engine.
+    //!
+    //! Weight streaming from the host enables execution of models that do not fit
+    //! in GPU memory by allowing TensorRT to intelligently stream network weights
+    //! from the CPU DRAM. Please see ICudaEngine::getMinimumWeightStreamingBudget
+    //! for the default memory budget when this flag is enabled.
+    //!
+    //! Enabling this feature changes the behavior of
+    //! IRuntime::deserializeCudaEngine to allocate the entire network’s weights
+    //! on the CPU DRAM instead of GPU memory. Then,
+    //! ICudaEngine::createExecutionContext will determine the optimal split of
+    //! weights between the CPU and GPU and place weights accordingly.
+    //!
+    //! Future TensorRT versions may enable this flag by default.
+    //!
+    //! \warning Enabling this flag may marginally increase build time.
+    //!
+    //! \warning Enabling this feature will significantly increase the latency of
+    //!          ICudaEngine::createExecutionContext.
+    //!
+    //! \see IRuntime::deserializeCudaEngine,
+    //!      ICudaEngine::getMinimumWeightStreamingBudget,
+    //!      ICudaEngine::setWeightStreamingBudget
+    //!
+    kWEIGHT_STREAMING = 21,
 };
 
 //!
@@ -8865,7 +8242,7 @@ enum class BuilderFlag : int32_t
 template <>
 constexpr inline int32_t EnumMax<BuilderFlag>() noexcept
 {
-    return 18;
+    return 22;
 }
 
 //!
@@ -8946,7 +8323,6 @@ enum class MemoryPoolType : int32_t
 {
     //!
     //! kWORKSPACE is used by TensorRT to store intermediate buffers within an operation.
-    //! This is equivalent to the deprecated IBuilderConfig::setMaxWorkspaceSize and overrides that value.
     //! This defaults to max device memory. Set to a smaller value to restrict tactics that use over the
     //! threshold en masse. For more targeted removal of tactics use the IAlgorithmSelector
     //! interface.
@@ -8957,7 +8333,7 @@ enum class MemoryPoolType : int32_t
     //! kDLA_MANAGED_SRAM is a fast software managed RAM used by DLA to communicate within a layer.
     //! The size of this pool must be at least 4 KiB and must be a power of 2.
     //! This defaults to 1 MiB.
-    //! Orin has capacity of 1 MiB per core, and Xavier shares 4 MiB across all of its accelerator cores.
+    //! Orin has capacity of 1 MiB per core.
     //!
     kDLA_MANAGED_SRAM = 1,
 
@@ -8983,6 +8359,17 @@ enum class MemoryPoolType : int32_t
     //! cudaGetDeviceProperties.embedded is true, and 100% otherwise.
     //!
     kTACTIC_DRAM = 4,
+
+    //!
+    //! kTACTIC_SHARED_MEMORY defines the maximum shared memory size utilized for executing
+    //! the backend CUDA kernel implementation. Adjust this value to restrict tactics that exceed
+    //! the specified threshold en masse. The default value is device max capability. This value must
+    //! be less than 1GiB.
+    //!
+    //! Updating this flag will override the shared memory limit set by \ref HardwareCompatibilityLevel,
+    //! which defaults to 48KiB.
+    //!
+    kTACTIC_SHARED_MEMORY = 5,
 };
 
 //!
@@ -8993,7 +8380,7 @@ enum class MemoryPoolType : int32_t
 template <>
 constexpr inline int32_t EnumMax<MemoryPoolType>() noexcept
 {
-    return 5;
+    return 6;
 }
 
 //!
@@ -9006,40 +8393,12 @@ constexpr inline int32_t EnumMax<MemoryPoolType>() noexcept
 //!
 enum class PreviewFeature : int32_t
 {
-    //!
-    //! Optimize runtime dimensions with TensorRT's DL Compiler.
-    //! Potentially reduces run time and decreases device memory usage and engine size.
-    //! Models most likely to benefit from enabling kFASTER_DYNAMIC_SHAPES_0805 are transformer-based models,
-    //! and models containing dynamic control flows.
-    //!
-    //! The default value for this flag is on.
-    //!
-    //! \deprecated Turning it off is deprecated in TensorRT 8.6. The flag kFASTER_DYNAMIC_SHAPES_0805 will be removed in 9.0.
-    //!
-    kFASTER_DYNAMIC_SHAPES_0805 TRT_DEPRECATED_ENUM = 0,
-
-    //!
-    //! Disable usage of cuDNN/cuBLAS/cuBLASLt tactics in the TensorRT core library.
-    //!
-    //! When the flag is enabled, TensorRT core will not use these tactics even if they are specified in
-    //! \ref IBuilderConfig::setTacticSources(), but cudnnContext and cublasContext handles will still be passed to
-    //! plugins via IPluginV2Ext::attachToContext() if the appropriate tactic sources are set.
-    //!
-    //! This allows users to experiment with disabling external library tactics without having to modify their
-    //! application's plugins to support nullptr handles.
-    //!
-    //! The default value for this flag is on.
-    //!
-    //! \see TacticSource
-    //!
-    kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 = 1,
-
     //!
     //! Allows optimization profiles to be shared across execution contexts.
-    //! This flag defaults to false and will become the default behavior in TensorRT 9.0.
-    //! At that point this flag will do nothing.
     //!
-    kPROFILE_SHARING_0806 = 2,
+    //! \deprecated Deprecated in TensorRT 10.0. The default value for this flag is on and can not be changed.
+    //!
+    kPROFILE_SHARING_0806 TRT_DEPRECATED_ENUM = 0,
 };
 namespace impl
 {
@@ -9051,13 +8410,20 @@ namespace impl
 template <>
 struct EnumMaxImpl<PreviewFeature>
 {
-    static constexpr int32_t kVALUE = 3;
+    static constexpr int32_t kVALUE = 1;
 };
 } // namespace impl
 
-//! Describes requirements of compatibility with GPU architectures other than that of the GPU on which the engine was
-//! built. Levels except kNONE are only supported for engines built on NVIDIA Ampere and later GPUs.
-//! Note that compatibility with future hardware depends on CUDA forward compatibility support.
+//!
+//! \enum HardwareCompatibilityLevel
+//!
+//! \brief Describes requirements of compatibility with GPU architectures other than that of the GPU on which the engine was
+//! built.
+//!
+//! Levels except kNONE are only supported for engines built on NVIDIA Ampere and later GPUs.
+//!
+//! \warning Note that compatibility with future hardware depends on CUDA forward compatibility support.
+//!
 enum class HardwareCompatibilityLevel : int32_t
 {
     //! Do not require hardware compatibility with GPU architectures other than that of the GPU on which the engine was
@@ -9085,48 +8451,105 @@ struct EnumMaxImpl<HardwareCompatibilityLevel>
 };
 } // namespace impl
 
-//!
-//! \class IBuilderConfig
-//!
-//! \brief Holds properties for configuring a builder to produce an engine.
-//!
-//! \see BuilderFlags
-//!
-class IBuilderConfig : public INoCopy
+namespace v_1_0
+{
+class IProgressMonitor : public IVersionedInterface
 {
 public:
-    virtual ~IBuilderConfig() noexcept = default;
+    IProgressMonitor() = default;
+    virtual ~IProgressMonitor() noexcept = default;
 
     //!
-    //! \brief Set the number of minimization iterations used when timing layers.
+    //! \brief Return version information associated with this interface. Applications must not override this method.
     //!
-    //! When timing layers, the builder minimizes over a set of average times for layer execution. This parameter
-    //! controls the number of iterations used in minimization. The builder may sometimes run layers for more
-    //! iterations to improve timing accuracy if this parameter is set to a small value and the runtime of the
-    //! layer is short.
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IProgressMonitor", 1, 0};
+    }
+
     //!
-    //! \see getMinTimingIterations()
+    //! \brief Signal that a phase of the optimizer has started.
     //!
-    //! \deprecated Deprecated in TensorRT 8.4. Superseded by setAvgTimingIterations().
+    //! \param phaseName The name of this phase for tracking purposes.
+    //! \param parentPhase The parent phase that this phase belongs to, or nullptr if there is no parent.
+    //! \param nbSteps The number of steps that are involved in this phase.
     //!
-    TRT_DEPRECATED virtual void setMinTimingIterations(int32_t minTiming) noexcept
-    {
-        mImpl->setMinTimingIterations(minTiming);
-    }
+    //! The phaseStart function signals to the application that the current phase is beginning, and that it has a
+    //! certain number of steps to perform. If \p phaseParent is nullptr, then the phaseStart is beginning an
+    //! independent phase, and if \p phaseParent is specified, then the current phase, specified by \p phaseName, is
+    //! within the scope of the parent phase. \p nbSteps will always be a positive number. The phaseStart function
+    //! implies that the first step is being executed. TensorRT will signal when each step is complete.
+    //!
+    //! Phase names are human readable English strings which are unique within a single phase hierarchy but which can be
+    //! reused once the previous instance has completed. Phase names and their hierarchies may change between versions
+    //! of TensorRT.
+    //!
+    //! \see phaseFinish
+    //!
+    virtual void phaseStart(char const* phaseName, char const* parentPhase, int32_t nbSteps) noexcept = 0;
 
     //!
-    //! \brief Query the number of minimization iterations.
+    //! \brief Signal that a step of an optimizer phase has finished.
     //!
-    //! By default the minimum number of iterations is 1.
+    //! \param phaseName The name of the innermost phase being executed.
+    //! \param step The step number that was completed.
     //!
-    //! \see setMinTimingIterations()
+    //! The stepComplete function signals to the application that TensorRT has finished the current \p step for the
+    //! phase \p phaseName, and will move onto the next step if there is one. The application can return false for
+    //! TensorRT to exit the build early. The step value will increase on subsequent calls in the range [0, nbSteps).
     //!
-    //! \deprecated Deprecated in TensorRT 8.4. Superseded by getAvgTimingIterations().
+    //! \return true to continue to the next step or false to stop the build.
     //!
-    TRT_DEPRECATED virtual int32_t getMinTimingIterations() const noexcept
-    {
-        return mImpl->getMinTimingIterations();
-    }
+    virtual bool stepComplete(char const* phaseName, int32_t step) noexcept = 0;
+
+    //!
+    //! \brief Signal that a phase of the optimizer has finished.
+    //!
+    //! \param phaseName The name of the phase that has finished.
+    //!
+    //! The phaseFinish function signals to the application that the phase is complete. This function may be called
+    //! before all steps in the range [0, nbSteps) have been reported to stepComplete. This scenario can be triggered by
+    //! error handling, internal optimizations, or when stepComplete returns false to request cancellation of the build.
+    //!
+    //! \see phaseStart
+    //!
+    virtual void phaseFinish(char const* phaseName) noexcept = 0;
+
+}; // class IProgressMonitor
+} // namespace v_1_0
+
+//!
+//! \class IProgressMonitor
+//!
+//! \brief Application-implemented progress reporting interface for TensorRT.
+//!
+//! The IProgressMonitor is a user-defined object that TensorRT uses to report back when an internal algorithm has
+//! started or finished a phase to help provide feedback on the progress of the optimizer.
+//!
+//! The IProgressMonitor will trigger its start function when a phase is entered and will trigger its finish function
+//! when that phase is exited. Each phase consists of one or more steps. When each step is completed, the stepComplete
+//! function is triggered. This will allow an application using the builder to communicate progress relative to when the
+//! optimization step is expected to complete.
+//!
+//! The implementation of IProgressMonitor must be thread-safe so that it can be called from multiple internal threads.
+//! The lifetime of the IProgressMonitor must exceed the lifetime of all TensorRT objects that use it.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IProgressMonitor, not
+//!       v_1_0::IProgressMonitor
+//!
+using IProgressMonitor = v_1_0::IProgressMonitor;
+
+//!
+//! \class IBuilderConfig
+//!
+//! \brief Holds properties for configuring a builder to produce an engine.
+//!
+//! \see BuilderFlags
+//!
+class IBuilderConfig : public INoCopy
+{
+public:
+    virtual ~IBuilderConfig() noexcept = default;
 
     //!
     //! \brief Set the number of averaging iterations used when timing layers.
@@ -9196,38 +8619,6 @@ class IBuilderConfig : public INoCopy
         return mImpl->getInt8Calibrator();
     }
 
-    //!
-    //! \brief Set the maximum workspace size.
-    //!
-    //! \param workspaceSize The maximum GPU temporary memory which the engine can use at execution time.
-    //!
-    //! \see getMaxWorkspaceSize()
-    //!
-    //! \deprecated Deprecated in TensorRT 8.3. Superseded by IBuilderConfig::setMemoryPoolLimit() with
-    //! MemoryPoolType::kWORKSPACE.
-    //!
-    TRT_DEPRECATED void setMaxWorkspaceSize(std::size_t workspaceSize) noexcept
-    {
-        mImpl->setMaxWorkspaceSize(workspaceSize);
-    }
-
-    //!
-    //! \brief Get the maximum workspace size.
-    //!
-    //! By default the workspace size is the size of total global memory in the device.
-    //!
-    //! \return The maximum workspace size.
-    //!
-    //! \see setMaxWorkspaceSize()
-    //!
-    //! \deprecated Deprecated in TensorRT 8.3. Superseded by IBuilderConfig::getMemoryPoolLimit() with
-    //! MemoryPoolType::kWORKSPACE.
-    //!
-    TRT_DEPRECATED std::size_t getMaxWorkspaceSize() const noexcept
-    {
-        return mImpl->getMaxWorkspaceSize();
-    }
-
     //!
     //! \brief Set the build mode flags to turn on builder options for this network.
     //!
@@ -9295,12 +8686,13 @@ class IBuilderConfig : public INoCopy
 
     //!
     //! \brief Set the device that this layer must execute on.
+    //!
     //! \param layer which layer to execute.
     //! \param deviceType that this layer must execute on.
     //! If DeviceType is not set or is reset, TensorRT will use the default DeviceType set in the builder.
     //!
     //! \note The device type for a layer must be compatible with the safety flow (if specified).
-    //! For example a layer cannot be marked for DLA execution while the builder is configured for kSAFE_GPU.
+    //! For example a layer cannot be marked for DLA execution while the builder is configured for kSAFETY.
     //!
     //! \see getDeviceType()
     //!
@@ -9311,6 +8703,7 @@ class IBuilderConfig : public INoCopy
 
     //!
     //! \brief Get the device that this layer executes on.
+    //!
     //! \return Returns DeviceType of the layer.
     //!
     DeviceType getDeviceType(ILayer const* layer) const noexcept
@@ -9320,7 +8713,9 @@ class IBuilderConfig : public INoCopy
 
     //!
     //! \brief whether the DeviceType has been explicitly set for this layer
+    //!
     //! \return true if device type is not default
+    //!
     //! \see setDeviceType() getDeviceType() resetDeviceType()
     //!
     bool isDeviceTypeSet(ILayer const* layer) const noexcept
@@ -9340,6 +8735,7 @@ class IBuilderConfig : public INoCopy
 
     //!
     //! \brief Checks if a layer can run on DLA.
+    //!
     //! \return status true if the layer can on DLA else returns false.
     //!
     bool canRunOnDLA(ILayer const* layer) const noexcept
@@ -9349,6 +8745,7 @@ class IBuilderConfig : public INoCopy
 
     //!
     //! \brief Sets the DLA core used by the network. Defaults to -1.
+    //!
     //! \param dlaCore The DLA core to execute the engine on, in the range [0,getNbDlaCores()).
     //!
     //! This function is used to specify which DLA core to use via indexing, if multiple DLA cores are available.
@@ -9364,6 +8761,7 @@ class IBuilderConfig : public INoCopy
 
     //!
     //! \brief Get the DLA core that the engine executes on.
+    //!
     //! \return assigned DLA core or -1 for DLA not present or unset.
     //!
     int32_t getDLACore() const noexcept
@@ -9374,6 +8772,7 @@ class IBuilderConfig : public INoCopy
     //!
     //! \brief Sets the default DeviceType to be used by the builder. It ensures that all the layers that can run on
     //! this device will run on it, unless setDeviceType is used to override the default DeviceType for a layer.
+    //!
     //! \see getDefaultDeviceType()
     //!
     void setDefaultDeviceType(DeviceType deviceType) noexcept
@@ -9401,20 +8800,6 @@ class IBuilderConfig : public INoCopy
         mImpl->reset();
     }
 
-    //!
-    //! \brief Delete this IBuilderConfig.
-    //!
-    //! De-allocates any internally allocated memory.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED void destroy() noexcept
-    {
-        delete this;
-    }
-
     //!
     //! \brief Set the cuda stream that is used to profile this network.
     //!
@@ -9447,6 +8832,7 @@ class IBuilderConfig : public INoCopy
     //! a single optimization profile are not supported for refittable engines.
     //!
     //! \param profile The new optimization profile, which must satisfy profile->isValid() == true
+    //!
     //! \return The index of the optimization profile (starting from 0) if the input is valid, or -1 if the input is
     //!         not valid.
     //!
@@ -9518,6 +8904,7 @@ class IBuilderConfig : public INoCopy
     //!
     //! \param profile The new calibration profile, which must satisfy profile->isValid() == true or be nullptr.
     //! MIN and MAX values will be overwritten by kOPT.
+    //!
     //! \return True if the calibration profile was set correctly.
     //!
     bool setCalibrationProfile(IOptimizationProfile const* profile) noexcept
@@ -9783,6 +9170,19 @@ class IBuilderConfig : public INoCopy
     //! which is currently 5. Setting it to greater than the maximum level results in behavior identical to the
     //! maximum level.
     //!
+    //! Below are the descriptions about each builder optimization level:
+    //!
+    //! - Level 0: This enables the fastest compilation by disabling dynamic kernel generation and selecting the first
+    //!   tactic that succeeds in execution. This will also not respect a timing cache.
+    //! - Level 1: Available tactics are sorted by heuristics, but only the top are tested to select the best. If a
+    //!   dynamic kernel is generated its compile optimization is low.
+    //! - Level 2: Available tactics are sorted by heuristics, but only the fastest tactics are tested to select the
+    //!   best.
+    //! - Level 3: Apply heuristics to see if a static precompiled kernel is applicable or if a new one has to be
+    //!   compiled dynamically.
+    //! - Level 4: Always compiles a dynamic kernel.
+    //! - Level 5: Always compiles a dynamic kernel and compares it to static kernels.
+    //!
     //! \param level The optimization level to set to. Must be non-negative.
     //!
     //! \see getBuilderOptimizationLevel
@@ -9804,6 +9204,7 @@ class IBuilderConfig : public INoCopy
         return mImpl->getBuilderOptimizationLevel();
     }
 
+    //!
     //! \brief Set the hardware compatibility level.
     //!
     //! Hardware compatibility allows an engine to run on GPU
@@ -9908,38 +9309,65 @@ class IBuilderConfig : public INoCopy
         return mImpl->getMaxAuxStreams();
     }
 
+    //!
+    //! \brief Sets the progress monitor for building a network.
+    //!
+    //! \param monitor The progress monitor to assign to the IBuilderConfig.
+    //!
+    //! The progress monitor signals to the application when different phases of
+    //! the compiler are being executed. Setting to nullptr unsets the monitor so
+    //! that the application is not signaled.
+    //!
+    //! \see IBuilderConfig::getProgressMonitor
+    //!
+    void setProgressMonitor(IProgressMonitor* monitor) noexcept
+    {
+        return mImpl->setProgressMonitor(monitor);
+    }
+
+    //!
+    //! \return The progress monitor set by the application or nullptr.
+    //!
+    //! \see IBuilderConfig::setProgressMonitor
+    //!
+    IProgressMonitor* getProgressMonitor() const noexcept
+    {
+        return mImpl->getProgressMonitor();
+    }
+
 protected:
     apiv::VBuilderConfig* mImpl;
 };
 
+//!
 //! \brief Represents one or more NetworkDefinitionCreationFlag flags
 //! using binary OR operations.
-//!  e.g., 1U << NetworkDefinitionCreationFlag::kEXPLICIT_BATCH
+//!  e.g., 1U << NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
 //!
 //! \see IBuilder::createNetworkV2
 //!
 using NetworkDefinitionCreationFlags = uint32_t;
 
+//!
 //! \enum NetworkDefinitionCreationFlag
 //!
 //! \brief List of immutable network properties expressed at network creation time.
 //! NetworkDefinitionCreationFlag is used with createNetworkV2() to specify immutable properties of the network.
-//! Creating a network without NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag has been deprecated.
 //!
 //! \see IBuilder::createNetworkV2
 //!
 enum class NetworkDefinitionCreationFlag : int32_t
 {
-    //! Mark the network to be an explicit batch network.
-    //! Dynamic shape support requires that the kEXPLICIT_BATCH flag is set.
-    //! With dynamic shapes, any of the input dimensions can vary at run-time,
-    //! and there are no implicit dimensions in the network specification.
-    //! Varying dimensions are specified by using the wildcard dimension value -1.
-    kEXPLICIT_BATCH = 0,
-
-    //! Deprecated. This flag has no effect now, but is only kept for backward compatability.
+    //! Ignored because networks are always "explicit batch" in TensorRT 10.0.
     //!
-    kEXPLICIT_PRECISION TRT_DEPRECATED_ENUM = 1,
+    //! \deprecated Deprecated in TensorRT 10.0.
+    kEXPLICIT_BATCH TRT_DEPRECATED_ENUM = 0,
+
+    //! Mark the network to be strongly typed.
+    //! Every tensor in the network has a data type defined in the network following only type inference rules and the
+    //! inputs/operator annotations. Setting layer precision and layer output types is not allowed, and the network
+    //! output types will be inferred based on the input types and the type inference rules.
+    kSTRONGLY_TYPED = 1,
 };
 
 //!
@@ -9965,36 +9393,6 @@ class IBuilder : public INoCopy
 public:
     virtual ~IBuilder() noexcept = default;
 
-    //!
-    //! \brief Set the maximum batch size. This has no effect for networks created with explicit batch dimension mode.
-    //!
-    //! \param batchSize The maximum batch size which can be used at execution time, and also the batch size for which
-    //! the engine will be optimized.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.4.
-    //!
-    //! \see getMaxBatchSize()
-    //!
-    TRT_DEPRECATED void setMaxBatchSize(int32_t batchSize) noexcept
-    {
-        mImpl->setMaxBatchSize(batchSize);
-    }
-
-    //!
-    //! \brief Get the maximum batch size.
-    //!
-    //! \return The maximum batch size.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.4.
-    //!
-    //! \see setMaxBatchSize()
-    //! \see getMaxDLABatchSize()
-    //!
-    TRT_DEPRECATED int32_t getMaxBatchSize() const noexcept
-    {
-        return mImpl->getMaxBatchSize();
-    }
-
     //!
     //! \brief Determine whether the platform has fast native fp16.
     //!
@@ -10011,18 +9409,6 @@ class IBuilder : public INoCopy
         return mImpl->platformHasFastInt8();
     }
 
-    //!
-    //! \brief Destroy this object.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED void destroy() noexcept
-    {
-        delete this;
-    }
-
     //!
     //! \brief Get the maximum batch size DLA can support.
     //! For any tensor the total volume of index dimensions combined(dimensions other than CHW) with the requested
@@ -10045,6 +9431,7 @@ class IBuilder : public INoCopy
 
     //!
     //! \brief Set the GPU allocator.
+    //!
     //! \param allocator Set the GPU allocator to be used by the builder. All GPU memory acquired will use this
     //! allocator. If NULL is passed, the default allocator will be used.
     //!
@@ -10070,30 +9457,19 @@ class IBuilder : public INoCopy
     }
 
     //!
-    //! \brief Builds an engine for the given INetworkDefinition and given IBuilderConfig.
-    //!
-    //! It enables the builder to build multiple engines based on the same network definition, but with different
-    //! builder configurations.
+    //! \brief Create a network definition object
     //!
-    //! \note This function will synchronize the cuda stream returned by \p config.getProfileStream() before returning.
+    //! Creates a network definition object with immutable properties specified using the flags parameter.
     //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by IBuilder::buildSerializedNetwork().
+    //! createNetworkV2 supports creating network with properties from NetworkDefinitionCreationFlags.
     //!
-    TRT_DEPRECATED nvinfer1::ICudaEngine* buildEngineWithConfig(
-        INetworkDefinition& network, IBuilderConfig& config) noexcept
-    {
-        return mImpl->buildEngineWithConfig(network, config);
-    }
-
-    //! \brief Create a network definition object
+    //! CreateNetworkV2 supports dynamic shapes and explicit batch dimensions by default.
     //!
-    //! Creates a network definition object with immutable properties specified using the flags parameter.
-    //! CreateNetworkV2 supports dynamic shapes and explicit batch dimensions when used with
-    //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
-    //! Creating a network without NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag has been deprecated.
+    //! createNetworkV2 with NetworkDefinitionCreationFlag::kSTRONGLY_TYPED flag supports creating a strongly typed plan
+    //! where tensor data types are inferred from network input types and operator type specification.
     //!
     //! \param flags Bitset of NetworkDefinitionCreationFlags specifying network properties combined with bitwise OR.
-    //!             e.g., 1U << NetworkDefinitionCreationFlag::kEXPLICIT_BATCH
+    //!             e.g., 1U << NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
     //!
     //! \see INetworkDefinition, NetworkDefinitionCreationFlags
     //!
@@ -10102,6 +9478,7 @@ class IBuilder : public INoCopy
         return mImpl->createNetworkV2(flags);
     }
 
+    //!
     //! \brief Create a new optimization profile.
     //!
     //! If the network has any dynamic input tensors, the appropriate calls to setDimensions() must be made.
@@ -10127,7 +9504,7 @@ class IBuilder : public INoCopy
     //! If an error recorder is not set, messages will be sent to the global log stream.
     //!
     //! \param recorder The error recorder to register with this interface.
-    //
+    //!
     //! \see getErrorRecorder()
     //!
     void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -10202,8 +9579,6 @@ class IBuilder : public INoCopy
     //!
     //! \note This function will synchronize the cuda stream returned by \p config.getProfileStream() before returning.
     //!
-    //! This function is only supported in NVIDIA Drive(R) products.
-    //!
     bool isNetworkSupported(INetworkDefinition const& network, IBuilderConfig const& config) const noexcept
     {
         return mImpl->isNetworkSupported(network, config);
@@ -10221,7 +9596,9 @@ class IBuilder : public INoCopy
 
     //!
     //! \brief Set the maximum number of threads.
+    //!
     //! \param maxThreads The maximum number of threads that can be used by the builder.
+    //!
     //! \return True if successful, false otherwise.
     //!
     //! The default value is 1 and includes the current thread.
diff --git a/include/NvInferConsistency.h b/include/NvInferConsistency.h
index a70249b6..5096c3f4 100644
--- a/include/NvInferConsistency.h
+++ b/include/NvInferConsistency.h
@@ -44,7 +44,7 @@ class IConsistencyChecker
 public:
     //!
     //! \brief Check that a blob that was input to createConsistencyChecker method represents a valid engine.
-    //
+    //!
     //! \return true if the original blob encoded an engine that belongs to valid engine domain with
     //! target capability EngineCapability::kSAFETY, false otherwise.
     //!
diff --git a/include/NvInferImpl.h b/include/NvInferImpl.h
index 38617246..1c2dbff8 100644
--- a/include/NvInferImpl.h
+++ b/include/NvInferImpl.h
@@ -26,11 +26,40 @@
 namespace nvinfer1
 {
 
+namespace v_1_0
+{
+class IProgressMonitor;
+}
+using IProgressMonitor = v_1_0::IProgressMonitor;
+
+namespace v_1_0
+{
+class IAlgorithmSelector;
+}
+using IAlgorithmSelector = v_1_0::IAlgorithmSelector;
+
+namespace v_1_0
+{
+class IProfiler;
+}
+using IProfiler = v_1_0::IProfiler;
+
+namespace v_1_0
+{
+class IOutputAllocator;
+}
+using IOutputAllocator = v_1_0::IOutputAllocator;
+
+namespace v_1_0
+{
+class IDebugListener;
+}
+using IDebugListener = v_1_0::IDebugListener;
+
 class IActivationLayer;
 class IAlgorithm;
 class IAlgorithmContext;
 class IAlgorithmIOInfo;
-class IAlgorithmSelector;
 class IAlgorithmVariant;
 class IAssertionLayer;
 class IBuilder;
@@ -48,7 +77,6 @@ class IElementWiseLayer;
 class IEngineInspector;
 class IExecutionContext;
 class IFillLayer;
-class IFullyConnectedLayer;
 class IGatherLayer;
 class IGridSampleLayer;
 class IHostMemory;
@@ -70,7 +98,6 @@ class INMSLayer;
 class INonZeroLayer;
 class IOneHotLayer;
 class IOptimizationProfile;
-class IOutputAllocator;
 class IPaddingLayer;
 class IParametricReLULayer;
 class IPlugin;
@@ -79,19 +106,27 @@ class IPluginFactory;
 class IPluginLayer;
 class IPluginRegistry;
 class IPluginV2Layer;
+
+namespace v_1_0
+{
+class IPluginV3;
+} // namespace v_1_0
+using IPluginV3 = v_1_0::IPluginV3;
+
+class IPluginV3Layer;
 class IPoolingLayer;
-class IProfiler;
 class IQuantizeLayer;
 class IRaggedSoftMaxLayer;
 class IRecurrenceLayer;
 class IReduceLayer;
+class IRefitter;
 class IResizeLayer;
 class IReverseSequenceLayer;
-class IRNNv2Layer;
 class IRuntime;
 class IScaleLayer;
 class IScatterLayer;
 class ISelectLayer;
+class ISerializationConfig;
 class IShapeLayer;
 class IShuffleLayer;
 class ISliceLayer;
@@ -130,13 +165,10 @@ enum class ResizeCoordinateTransformation : int32_t;
 enum class InterpolationMode : int32_t;
 enum class ResizeRoundMode : int32_t;
 enum class ResizeSelector : int32_t;
-enum class RNNDirection : int32_t;
-enum class RNNGateType : int32_t;
-enum class RNNInputMode : int32_t;
-enum class RNNOperation : int32_t;
 enum class ScaleMode : int32_t;
 enum class ScatterMode : int32_t;
 enum class SampleMode : int32_t;
+enum class SerializationFlag : int32_t;
 enum class TensorIOMode : int32_t;
 enum class TensorLocation : int32_t;
 enum class TopKOperation : int32_t;
@@ -145,6 +177,7 @@ enum class UnaryOperation : int32_t;
 enum class WeightsRole : int32_t;
 enum class PreviewFeature : int32_t;
 enum class HardwareCompatibilityLevel : int32_t;
+enum class ExecutionContextAllocationStrategy : int32_t;
 
 using TacticSources = uint32_t;
 using TensorFormats = uint32_t;
@@ -152,8 +185,7 @@ using BuilderFlags = uint32_t;
 using NetworkDefinitionCreationFlags = uint32_t;
 using QuantizationFlags = uint32_t;
 using TempfileControlFlags = uint32_t;
-using ResizeMode = InterpolationMode;
-using SliceMode = SampleMode;
+using SerializationFlags = uint32_t;
 
 //!
 //! \file NvInferImpl.h
@@ -184,23 +216,28 @@ class VDimensionExpr : public VRoot
 {
 public:
     virtual bool isConstant() const = 0;
-    virtual int32_t getConstantValue() const = 0;
+    virtual int64_t getConstantValue() const = 0;
+    virtual bool isSizeTensor() const = 0;
 };
 
 class VExprBuilder : public VRoot
 {
 public:
-    virtual IDimensionExpr const* constant(int32_t value) = 0;
+    virtual IDimensionExpr const* constant(int64_t value) = 0;
     virtual IDimensionExpr const* operation(
         DimensionOperation op, IDimensionExpr const& first, IDimensionExpr const& second)
         = 0;
+    virtual IDimensionExpr const* declareSizeTensor(
+        int32_t outputIndex, IDimensionExpr const& opt, IDimensionExpr const& upper)
+        = 0;
 };
 
 class VRuntime : public VRoot
 {
 public:
-    virtual nvinfer1::ICudaEngine* deserializeCudaEngine(
-        void const* blob, std::size_t size, IPluginFactory* pluginFactory) noexcept = 0;
+    virtual IRuntime* getPImpl() noexcept = 0;
+    virtual nvinfer1::ICudaEngine* deserializeCudaEngine(void const* blob, std::size_t size) noexcept = 0;
+    virtual nvinfer1::ICudaEngine* deserializeCudaEngine(IStreamReader& streamReader) noexcept = 0;
     virtual void setDLACore(int32_t dlaCore) noexcept = 0;
     virtual int32_t getDLACore() const noexcept = 0;
     virtual int32_t getNbDLACores() const noexcept = 0;
@@ -214,7 +251,6 @@ class VRuntime : public VRoot
     virtual char const* getTemporaryDirectory() const noexcept = 0;
     virtual void setTempfileControlFlags(TempfileControlFlags) noexcept = 0;
     virtual TempfileControlFlags getTempfileControlFlags() const noexcept = 0;
-    virtual IRuntime* getPImpl() noexcept = 0;
     virtual IPluginRegistry& getPluginRegistry() noexcept = 0;
     virtual void setPluginRegistryParent(IPluginRegistry* parent) noexcept = 0;
     virtual IRuntime* loadRuntime(char const* path) noexcept = 0;
@@ -225,6 +261,7 @@ class VRuntime : public VRoot
 class VRefitter : public VRoot
 {
 public:
+    virtual IRefitter* getPImpl() noexcept = 0;
     virtual bool setWeights(char const* layerName, WeightsRole role, const Weights weights) noexcept = 0;
     virtual bool refitCudaEngine() noexcept = 0;
     virtual int32_t getMissing(int32_t size, char const** layerNames, WeightsRole* roles) noexcept = 0;
@@ -241,12 +278,20 @@ class VRefitter : public VRoot
     virtual ILogger* getLogger() const noexcept = 0;
     virtual bool setMaxThreads(int32_t maxThreads) noexcept = 0;
     virtual int32_t getMaxThreads() const noexcept = 0;
+    virtual bool setNamedWeightsWithLocation(char const* name, Weights weights, TensorLocation location) noexcept = 0;
+    virtual Weights getNamedWeights(char const* weightsName) const noexcept = 0;
+    virtual TensorLocation getWeightsLocation(char const* weightsName) const noexcept = 0;
+    virtual bool unsetNamedWeights(char const* weightsName) noexcept = 0;
+    virtual void setWeightsValidation(bool weightsValidation) noexcept = 0;
+    virtual bool getWeightsValidation() const noexcept = 0;
+    virtual bool refitCudaEngineAsync(cudaStream_t stream) noexcept = 0;
+    virtual Weights getWeightsPrototype(char const* weightsName) const noexcept = 0;
 };
 
 class VOptimizationProfile : public VRoot
 {
 public:
-    virtual bool setDimensions(char const* inputName, OptProfileSelector select, Dims dims) noexcept = 0;
+    virtual bool setDimensions(char const* inputName, OptProfileSelector select, Dims const& dims) noexcept = 0;
     virtual Dims getDimensions(char const* inputName, OptProfileSelector select) const noexcept = 0;
     virtual bool setShapeValues(
         char const* inputName, OptProfileSelector select, int32_t const* values, int32_t nbValues) noexcept = 0;
@@ -260,33 +305,17 @@ class VOptimizationProfile : public VRoot
 class VCudaEngine : public VRoot
 {
 public:
-    virtual int32_t getNbBindings() const noexcept = 0;
-    virtual int32_t getBindingIndex(char const* name) const noexcept = 0;
-    virtual char const* getBindingName(int32_t bindingIndex) const noexcept = 0;
-    virtual bool bindingIsInput(int32_t bindingIndex) const noexcept = 0;
-    virtual Dims getBindingDimensions(int32_t bindingIndex) const noexcept = 0;
-    virtual DataType getBindingDataType(int32_t bindingIndex) const noexcept = 0;
-    virtual int32_t getMaxBatchSize() const noexcept = 0;
+    virtual ICudaEngine* getPImpl() noexcept = 0;
     virtual int32_t getNbLayers() const noexcept = 0;
     virtual IHostMemory* serialize() const noexcept = 0;
-    virtual IExecutionContext* createExecutionContext() noexcept = 0;
-    virtual TensorLocation getLocation(int32_t bindingIndex) const noexcept = 0;
+    virtual IExecutionContext* createExecutionContext(ExecutionContextAllocationStrategy strategy) noexcept = 0;
     virtual IExecutionContext* createExecutionContextWithoutDeviceMemory() noexcept = 0;
     virtual size_t getDeviceMemorySize() const noexcept = 0;
     virtual bool isRefittable() const noexcept = 0;
-    virtual int32_t getBindingBytesPerComponent(int32_t bindingIndex) const noexcept = 0;
-    virtual int32_t getBindingComponentsPerElement(int32_t bindingIndex) const noexcept = 0;
-    virtual TensorFormat getBindingFormat(int32_t bindingIndex) const noexcept = 0;
-    virtual char const* getBindingFormatDesc(int32_t bindingIndex) const noexcept = 0;
-    virtual int32_t getBindingVectorizedDim(int32_t bindingIndex) const noexcept = 0;
     virtual char const* getName() const noexcept = 0;
     virtual int32_t getNbOptimizationProfiles() const noexcept = 0;
-    virtual Dims getProfileDimensions(
-        int32_t bindingIndex, int32_t profileIndex, OptProfileSelector select) const noexcept = 0;
-    virtual int32_t const* getProfileShapeValues(
-        int32_t profileIndex, int32_t inputIndex, OptProfileSelector select) const noexcept = 0;
-    virtual bool isShapeBinding(int32_t bindingIndex) const noexcept = 0;
-    virtual bool isExecutionBinding(int32_t bindingIndex) const noexcept = 0;
+    virtual int32_t const* getProfileTensorValues(
+        char const* tensorName, int32_t profileIndex, OptProfileSelector select) const noexcept = 0;
     virtual EngineCapability getEngineCapability() const noexcept = 0;
     virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
     virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
@@ -309,7 +338,6 @@ class VCudaEngine : public VRoot
     virtual int32_t getNbIOTensors() const noexcept = 0;
     virtual char const* getIOTensorName(int32_t index) const noexcept = 0;
     virtual HardwareCompatibilityLevel getHardwareCompatibilityLevel() const noexcept = 0;
-    virtual ICudaEngine* getPImpl() noexcept = 0;
     virtual int32_t getNbAuxStreams() const noexcept = 0;
 
     virtual int32_t getTensorBytesPerComponentV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
@@ -317,15 +345,25 @@ class VCudaEngine : public VRoot
     virtual TensorFormat getTensorFormatV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
     virtual char const* getTensorFormatDescV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
     virtual int32_t getTensorVectorizedDimV2(char const* tensorName, int32_t profileIndex) const noexcept = 0;
+
+    virtual ISerializationConfig* createSerializationConfig() noexcept = 0;
+    virtual IHostMemory* serializeWithConfig(ISerializationConfig& config) const noexcept = 0;
+
+    virtual size_t getDeviceMemorySizeForProfile(int32_t profileIndex) const noexcept = 0;
+    virtual IRefitter* createRefitter(ILogger& logger) noexcept = 0;
+
+    virtual bool setWeightStreamingBudget(int64_t gpuMemoryBudget) noexcept = 0;
+    virtual int64_t getWeightStreamingBudget() const noexcept = 0;
+    virtual int64_t getMinimumWeightStreamingBudget() const noexcept = 0;
+    virtual int64_t getStreamableWeightsSize() const noexcept = 0;
+
+    virtual bool isDebugTensor(char const* name) const noexcept = 0;
 };
 
 class VExecutionContext : public VRoot
 {
 public:
-    virtual bool execute(int32_t batchSize, void* const* bindings) noexcept = 0;
-    virtual bool enqueue(
-        int32_t batchSize, void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept
-        = 0;
+    virtual IExecutionContext* getPImpl() noexcept = 0;
     virtual void setDebugSync(bool sync) noexcept = 0;
     virtual bool getDebugSync() const noexcept = 0;
     virtual void setProfiler(IProfiler*) noexcept = 0;
@@ -334,19 +372,12 @@ class VExecutionContext : public VRoot
     virtual void setName(char const* name) noexcept = 0;
     virtual char const* getName() const noexcept = 0;
     virtual void setDeviceMemory(void* memory) noexcept = 0;
-    virtual Dims getStrides(int32_t bindingIndex) const noexcept = 0;
-    virtual bool setOptimizationProfile(int32_t profileIndex) noexcept = 0;
     virtual int32_t getOptimizationProfile() const noexcept = 0;
-    virtual bool setBindingDimensions(int32_t bindingIndex, Dims dimensions) noexcept = 0;
-    virtual Dims getBindingDimensions(int32_t bindingIndex) const noexcept = 0;
-    virtual bool setInputShapeBinding(int32_t bindingIndex, int32_t const* data) noexcept = 0;
-    virtual bool getShapeBinding(int32_t bindingIndex, int32_t* data) const noexcept = 0;
     virtual bool allInputDimensionsSpecified() const noexcept = 0;
     virtual bool allInputShapesSpecified() const noexcept = 0;
     virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
     virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
     virtual bool executeV2(void* const* bindings) noexcept = 0;
-    virtual bool enqueueV2(void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept = 0;
     virtual bool setOptimizationProfileAsync(int32_t profileIndex, cudaStream_t stream) noexcept = 0;
     virtual void setEnqueueEmitsProfile(bool enqueueEmitsProfile) noexcept = 0;
     virtual bool getEnqueueEmitsProfile() const noexcept = 0;
@@ -357,6 +388,7 @@ class VExecutionContext : public VRoot
     virtual bool setTensorAddress(char const* tensorName, void* data) noexcept = 0;
     virtual void const* getTensorAddress(char const* tensorName) const noexcept = 0;
     virtual bool setInputTensorAddress(char const* tensorName, void const* data) noexcept = 0;
+    virtual bool setOutputTensorAddress(char const* tensorName, void* data) noexcept = 0;
     virtual int32_t inferShapes(int32_t nbMaxNames, char const** tensorNames) noexcept = 0;
     virtual bool setInputConsumedEvent(cudaEvent_t event) noexcept = 0;
     virtual cudaEvent_t getInputConsumedEvent() const noexcept = 0;
@@ -371,20 +403,25 @@ class VExecutionContext : public VRoot
     virtual size_t getPersistentCacheLimit() const noexcept = 0;
     virtual bool setNvtxVerbosity(ProfilingVerbosity verbosity) noexcept = 0;
     virtual ProfilingVerbosity getNvtxVerbosity() const noexcept = 0;
-    virtual IExecutionContext* getPImpl() noexcept = 0;
     virtual void setAuxStreams(cudaStream_t* auxStreams, int32_t nbStreams) noexcept = 0;
+    virtual bool setDebugListener(IDebugListener* listener) noexcept = 0;
+    virtual IDebugListener* getDebugListener() noexcept = 0;
+    virtual bool setTensorDebugState(char const* name, bool flag) noexcept = 0;
+    virtual bool getDebugState(char const* name) const noexcept = 0;
+    virtual bool setAllTensorsDebugState(bool flag) noexcept = 0;
+    virtual size_t updateDeviceMemorySizeForShapes() noexcept = 0;
 };
 
 class VEngineInspector : public VRoot
 {
 public:
+    virtual IEngineInspector* getPImpl() noexcept = 0;
     virtual bool setExecutionContext(IExecutionContext const* context) noexcept = 0;
     virtual IExecutionContext const* getExecutionContext() const noexcept = 0;
     virtual char const* getLayerInformation(int32_t layerIndex, LayerInformationFormat format) const noexcept = 0;
     virtual char const* getEngineInformation(LayerInformationFormat format) const noexcept = 0;
     virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
     virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
-    virtual IEngineInspector* getPImpl() noexcept = 0;
 };
 
 class VTensor : public VRoot
@@ -392,7 +429,7 @@ class VTensor : public VRoot
 public:
     virtual void setName(char const* name) noexcept = 0;
     virtual char const* getName() const noexcept = 0;
-    virtual void setDimensions(Dims dimensions) noexcept = 0;
+    virtual void setDimensions(Dims const& dimensions) noexcept = 0;
     virtual Dims getDimensions() const noexcept = 0;
     virtual void setType(DataType type) noexcept = 0;
     virtual DataType getType() const noexcept = 0;
@@ -440,49 +477,30 @@ class VLayer : public VRoot
 class VConvolutionLayer : public VRoot
 {
 public:
-    virtual void setKernelSize(DimsHW kernelSize) noexcept = 0;
-    virtual DimsHW getKernelSize() const noexcept = 0;
-    virtual void setNbOutputMaps(int32_t nbOutputMaps) noexcept = 0;
-    virtual int32_t getNbOutputMaps() const noexcept = 0;
-    virtual void setStride(DimsHW stride) noexcept = 0;
-    virtual DimsHW getStride() const noexcept = 0;
-    virtual void setPadding(DimsHW padding) noexcept = 0;
-    virtual DimsHW getPadding() const noexcept = 0;
-    virtual void setNbGroups(int32_t nbGroups) noexcept = 0;
-    virtual int32_t getNbGroups() const noexcept = 0;
+    virtual void setNbOutputMaps(int64_t nbOutputMaps) noexcept = 0;
+    virtual int64_t getNbOutputMaps() const noexcept = 0;
+    virtual void setNbGroups(int64_t nbGroups) noexcept = 0;
+    virtual int64_t getNbGroups() const noexcept = 0;
     virtual void setKernelWeights(Weights weights) noexcept = 0;
     virtual Weights getKernelWeights() const noexcept = 0;
     virtual void setBiasWeights(Weights weights) noexcept = 0;
     virtual Weights getBiasWeights() const noexcept = 0;
-    virtual void setDilation(DimsHW dilation) noexcept = 0;
-    virtual DimsHW getDilation() const noexcept = 0;
-    virtual void setPrePadding(Dims padding) noexcept = 0;
+    virtual void setPrePadding(Dims const&  padding) noexcept = 0;
     virtual Dims getPrePadding() const noexcept = 0;
-    virtual void setPostPadding(Dims padding) noexcept = 0;
+    virtual void setPostPadding(Dims const& padding) noexcept = 0;
     virtual Dims getPostPadding() const noexcept = 0;
     virtual void setPaddingMode(PaddingMode paddingMode) noexcept = 0;
     virtual PaddingMode getPaddingMode() const noexcept = 0;
-    virtual void setKernelSizeNd(Dims kernelSize) noexcept = 0;
+    virtual void setKernelSizeNd(Dims const& kernelSize) noexcept = 0;
     virtual Dims getKernelSizeNd() const noexcept = 0;
-    virtual void setStrideNd(Dims stride) noexcept = 0;
+    virtual void setStrideNd(Dims const& stride) noexcept = 0;
     virtual Dims getStrideNd() const noexcept = 0;
-    virtual void setPaddingNd(Dims padding) noexcept = 0;
+    virtual void setPaddingNd(Dims const& padding) noexcept = 0;
     virtual Dims getPaddingNd() const noexcept = 0;
-    virtual void setDilationNd(Dims dilation) noexcept = 0;
+    virtual void setDilationNd(Dims const& dilation) noexcept = 0;
     virtual Dims getDilationNd() const noexcept = 0;
 };
 
-class VFullyConnectedLayer : public VRoot
-{
-public:
-    virtual void setNbOutputChannels(int32_t nbOutputs) noexcept = 0;
-    virtual int32_t getNbOutputChannels() const noexcept = 0;
-    virtual void setKernelWeights(Weights weights) noexcept = 0;
-    virtual Weights getKernelWeights() const noexcept = 0;
-    virtual void setBiasWeights(Weights weights) noexcept = 0;
-    virtual Weights getBiasWeights() const noexcept = 0;
-};
-
 class VActivationLayer : public VRoot
 {
 public:
@@ -499,35 +517,29 @@ class VPoolingLayer : public VRoot
 public:
     virtual void setPoolingType(PoolingType type) noexcept = 0;
     virtual PoolingType getPoolingType() const noexcept = 0;
-    virtual void setWindowSize(DimsHW windowSize) noexcept = 0;
-    virtual DimsHW getWindowSize() const noexcept = 0;
-    virtual void setStride(DimsHW stride) noexcept = 0;
-    virtual DimsHW getStride() const noexcept = 0;
-    virtual void setPadding(DimsHW padding) noexcept = 0;
-    virtual DimsHW getPadding() const noexcept = 0;
     virtual void setBlendFactor(float blendFactor) noexcept = 0;
     virtual float getBlendFactor() const noexcept = 0;
     virtual void setAverageCountExcludesPadding(bool exclusive) noexcept = 0;
     virtual bool getAverageCountExcludesPadding() const noexcept = 0;
-    virtual void setPrePadding(Dims padding) noexcept = 0;
+    virtual void setPrePadding(Dims const& padding) noexcept = 0;
     virtual Dims getPrePadding() const noexcept = 0;
-    virtual void setPostPadding(Dims padding) noexcept = 0;
+    virtual void setPostPadding(Dims const& padding) noexcept = 0;
     virtual Dims getPostPadding() const noexcept = 0;
     virtual void setPaddingMode(PaddingMode paddingMode) noexcept = 0;
     virtual PaddingMode getPaddingMode() const noexcept = 0;
-    virtual void setWindowSizeNd(Dims windowSize) noexcept = 0;
+    virtual void setWindowSizeNd(Dims const& windowSize) noexcept = 0;
     virtual Dims getWindowSizeNd() const noexcept = 0;
-    virtual void setStrideNd(Dims stride) noexcept = 0;
+    virtual void setStrideNd(Dims const& stride) noexcept = 0;
     virtual Dims getStrideNd() const noexcept = 0;
-    virtual void setPaddingNd(Dims padding) noexcept = 0;
+    virtual void setPaddingNd(Dims const& padding) noexcept = 0;
     virtual Dims getPaddingNd() const noexcept = 0;
 };
 
 class VLRNLayer : public VRoot
 {
 public:
-    virtual void setWindowSize(int32_t windowSize) noexcept = 0;
-    virtual int32_t getWindowSize() const noexcept = 0;
+    virtual void setWindowSize(int64_t windowSize) noexcept = 0;
+    virtual int64_t getWindowSize() const noexcept = 0;
     virtual void setAlpha(float alpha) noexcept = 0;
     virtual float getAlpha() const noexcept = 0;
     virtual void setBeta(float beta) noexcept = 0;
@@ -568,33 +580,27 @@ class VConcatenationLayer : public VRoot
 class VDeconvolutionLayer : public VRoot
 {
 public:
-    virtual void setKernelSize(DimsHW kernelSize) noexcept = 0;
-    virtual DimsHW getKernelSize() const noexcept = 0;
-    virtual void setNbOutputMaps(int32_t nbOutputMaps) noexcept = 0;
-    virtual int32_t getNbOutputMaps() const noexcept = 0;
-    virtual void setStride(DimsHW stride) noexcept = 0;
-    virtual DimsHW getStride() const noexcept = 0;
-    virtual void setPadding(DimsHW padding) noexcept = 0;
-    virtual DimsHW getPadding() const noexcept = 0;
-    virtual void setNbGroups(int32_t nbGroups) noexcept = 0;
-    virtual int32_t getNbGroups() const noexcept = 0;
+    virtual void setNbOutputMaps(int64_t nbOutputMaps) noexcept = 0;
+    virtual int64_t getNbOutputMaps() const noexcept = 0;
+    virtual void setNbGroups(int64_t nbGroups) noexcept = 0;
+    virtual int64_t getNbGroups() const noexcept = 0;
     virtual void setKernelWeights(Weights weights) noexcept = 0;
     virtual Weights getKernelWeights() const noexcept = 0;
     virtual void setBiasWeights(Weights weights) noexcept = 0;
     virtual Weights getBiasWeights() const noexcept = 0;
-    virtual void setPrePadding(Dims padding) noexcept = 0;
+    virtual void setPrePadding(Dims const& padding) noexcept = 0;
     virtual Dims getPrePadding() const noexcept = 0;
-    virtual void setPostPadding(Dims padding) noexcept = 0;
+    virtual void setPostPadding(Dims const& padding) noexcept = 0;
     virtual Dims getPostPadding() const noexcept = 0;
     virtual void setPaddingMode(PaddingMode paddingMode) noexcept = 0;
     virtual PaddingMode getPaddingMode() const noexcept = 0;
-    virtual void setKernelSizeNd(Dims kernelSize) noexcept = 0;
+    virtual void setKernelSizeNd(Dims const& kernelSize) noexcept = 0;
     virtual Dims getKernelSizeNd() const noexcept = 0;
-    virtual void setStrideNd(Dims stride) noexcept = 0;
+    virtual void setStrideNd(Dims const& stride) noexcept = 0;
     virtual Dims getStrideNd() const noexcept = 0;
-    virtual void setPaddingNd(Dims padding) noexcept = 0;
+    virtual void setPaddingNd(Dims const& padding) noexcept = 0;
     virtual Dims getPaddingNd() const noexcept = 0;
-    virtual void setDilationNd(Dims dilation) noexcept = 0;
+    virtual void setDilationNd(Dims const& dilation) noexcept = 0;
     virtual Dims getDilationNd() const noexcept = 0;
 };
 
@@ -616,31 +622,6 @@ class VGatherLayer : public VRoot
     virtual GatherMode getMode() const noexcept = 0;
 };
 
-class VRNNv2Layer : public VRoot
-{
-public:
-    virtual int32_t getLayerCount() const noexcept = 0;
-    virtual int32_t getHiddenSize() const noexcept = 0;
-    virtual int32_t getMaxSeqLength() const noexcept = 0;
-    virtual int32_t getDataLength() const noexcept = 0;
-    virtual void setSequenceLengths(ITensor& seqLengths) noexcept = 0;
-    virtual ITensor* getSequenceLengths() const noexcept = 0;
-    virtual void setOperation(RNNOperation op) noexcept = 0;
-    virtual RNNOperation getOperation() const noexcept = 0;
-    virtual void setInputMode(RNNInputMode op) noexcept = 0;
-    virtual RNNInputMode getInputMode() const noexcept = 0;
-    virtual void setDirection(RNNDirection op) noexcept = 0;
-    virtual RNNDirection getDirection() const noexcept = 0;
-    virtual void setWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights weights) noexcept = 0;
-    virtual Weights getWeightsForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept = 0;
-    virtual void setBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW, Weights bias) noexcept = 0;
-    virtual Weights getBiasForGate(int32_t layerIndex, RNNGateType gate, bool isW) const noexcept = 0;
-    virtual void setHiddenState(ITensor& hidden) noexcept = 0;
-    virtual ITensor* getHiddenState() const noexcept = 0;
-    virtual void setCellState(ITensor& cell) noexcept = 0;
-    virtual ITensor* getCellState() const noexcept = 0;
-};
-
 class VPluginLayer : public VRoot
 {
 public:
@@ -653,6 +634,12 @@ class VPluginV2Layer : public VRoot
     virtual IPluginV2& getPlugin() noexcept = 0;
 };
 
+class VPluginV3Layer : public VRoot
+{
+public:
+    virtual IPluginV3& getPlugin() noexcept = 0;
+};
+
 class VUnaryLayer : public VRoot
 {
 public:
@@ -674,13 +661,9 @@ class VReduceLayer : public VRoot
 class VPaddingLayer : public VRoot
 {
 public:
-    virtual void setPrePadding(DimsHW padding) noexcept = 0;
-    virtual DimsHW getPrePadding() const noexcept = 0;
-    virtual void setPostPadding(DimsHW padding) noexcept = 0;
-    virtual DimsHW getPostPadding() const noexcept = 0;
-    virtual void setPrePaddingNd(Dims padding) noexcept = 0;
+    virtual void setPrePaddingNd(Dims const& padding) noexcept = 0;
     virtual Dims getPrePaddingNd() const noexcept = 0;
-    virtual void setPostPaddingNd(Dims padding) noexcept = 0;
+    virtual void setPostPaddingNd(Dims const& padding) noexcept = 0;
     virtual Dims getPostPaddingNd() const noexcept = 0;
 };
 
@@ -689,7 +672,7 @@ class VShuffleLayer : public VRoot
 public:
     virtual void setFirstTranspose(Permutation const& permutation) noexcept = 0;
     virtual Permutation const& getFirstTranspose() const noexcept = 0;
-    virtual void setReshapeDimensions(Dims dimensions) noexcept = 0;
+    virtual void setReshapeDimensions(Dims const& dimensions) noexcept = 0;
     virtual Dims getReshapeDimensions() const noexcept = 0;
     virtual void setSecondTranspose(Permutation const& permutation) noexcept = 0;
     virtual Permutation const& getSecondTranspose() const noexcept = 0;
@@ -700,14 +683,14 @@ class VShuffleLayer : public VRoot
 class VSliceLayer : public VRoot
 {
 public:
-    virtual void setStart(Dims start) noexcept = 0;
+    virtual void setStart(Dims const& start) noexcept = 0;
     virtual Dims getStart() const noexcept = 0;
-    virtual void setSize(Dims size) noexcept = 0;
+    virtual void setSize(Dims const& size) noexcept = 0;
     virtual Dims getSize() const noexcept = 0;
-    virtual void setStride(Dims stride) noexcept = 0;
+    virtual void setStride(Dims const& stride) noexcept = 0;
     virtual Dims getStride() const noexcept = 0;
-    virtual void setMode(SliceMode mode) noexcept = 0;
-    virtual SliceMode getMode() const noexcept = 0;
+    virtual void setMode(SampleMode mode) noexcept = 0;
+    virtual SampleMode getMode() const noexcept = 0;
 };
 
 class VShapeLayer : public VRoot
@@ -760,7 +743,7 @@ class VConstantLayer : public VRoot
 public:
     virtual void setWeights(Weights weights) noexcept = 0;
     virtual Weights getWeights() const noexcept = 0;
-    virtual void setDimensions(Dims dimensions) noexcept = 0;
+    virtual void setDimensions(Dims const& dimensions) noexcept = 0;
     virtual Dims getDimensions() const noexcept = 0;
 };
 
@@ -772,14 +755,12 @@ class VParametricReLULayer : public VRoot
 class VResizeLayer : public VRoot
 {
 public:
-    virtual void setOutputDimensions(Dims dimensions) noexcept = 0;
+    virtual void setOutputDimensions(Dims const& dimensions) noexcept = 0;
     virtual Dims getOutputDimensions() const noexcept = 0;
     virtual void setScales(float const* scales, int32_t nbScales) noexcept = 0;
     virtual int32_t getScales(int32_t size, float* scales) const noexcept = 0;
-    virtual void setResizeMode(ResizeMode resizeMode) noexcept = 0;
-    virtual ResizeMode getResizeMode() const noexcept = 0;
-    virtual void setAlignCorners(bool alignCorners) noexcept = 0;
-    virtual bool getAlignCorners() const noexcept = 0;
+    virtual void setResizeMode(InterpolationMode interpolationMode) noexcept = 0;
+    virtual InterpolationMode getResizeMode() const noexcept = 0;
     virtual void setCoordinateTransformation(ResizeCoordinateTransformation coordTransform) noexcept = 0;
     virtual ResizeCoordinateTransformation getCoordinateTransformation() const noexcept = 0;
     virtual void setSelectorForSinglePixel(ResizeSelector selector) noexcept = 0;
@@ -881,7 +862,7 @@ class VAssertionLayer : public VRoot
 class VFillLayer : public VRoot
 {
 public:
-    virtual void setDimensions(Dims dimensions) noexcept = 0;
+    virtual void setDimensions(Dims const& dimensions) noexcept = 0;
     virtual Dims getDimensions() const noexcept = 0;
     virtual void setOperation(FillOperation op) noexcept = 0;
     virtual FillOperation getOperation() const noexcept = 0;
@@ -889,6 +870,13 @@ class VFillLayer : public VRoot
     virtual double getAlpha() const noexcept = 0;
     virtual void setBeta(double beta) noexcept = 0;
     virtual double getBeta() const noexcept = 0;
+    virtual void setAlphaInt64(int64_t alpha) noexcept = 0;
+    virtual int64_t getAlphaInt64() const noexcept = 0;
+    virtual void setBetaInt64(int64_t beta) noexcept = 0;
+    virtual int64_t getBetaInt64() const noexcept = 0;
+    virtual bool isAlphaBetaInt64() const noexcept = 0;
+    virtual DataType getToType() const noexcept = 0;
+    virtual void setToType(DataType toType) noexcept = 0;
 };
 
 class VQuantizeLayer : public VRoot
@@ -896,6 +884,8 @@ class VQuantizeLayer : public VRoot
 public:
     virtual int32_t getAxis() const noexcept = 0;
     virtual void setAxis(int32_t axis) noexcept = 0;
+    virtual DataType getToType() const noexcept = 0;
+    virtual void setToType(DataType toType) noexcept = 0;
 };
 
 class VDequantizeLayer : public VRoot
@@ -903,6 +893,8 @@ class VDequantizeLayer : public VRoot
 public:
     virtual int32_t getAxis() const noexcept = 0;
     virtual void setAxis(int32_t axis) noexcept = 0;
+    virtual DataType getToType() const noexcept = 0;
+    virtual void setToType(DataType toType) noexcept = 0;
 };
 
 class VScatterLayer : public VRoot
@@ -965,8 +957,8 @@ class VNormalizationLayer : public VRoot
     virtual float getEpsilon() const noexcept = 0;
     virtual void setAxes(uint32_t axesMask) noexcept = 0;
     virtual uint32_t getAxes() const noexcept = 0;
-    virtual void setNbGroups(int32_t nbGroups) noexcept = 0;
-    virtual int32_t getNbGroups() const noexcept = 0;
+    virtual void setNbGroups(int64_t nbGroups) noexcept = 0;
+    virtual int64_t getNbGroups() const noexcept = 0;
     virtual void setComputePrecision(DataType type) noexcept = 0;
     virtual DataType getComputePrecision() const noexcept = 0;
 }; // class VNormalizationLayer
@@ -974,26 +966,16 @@ class VNormalizationLayer : public VRoot
 class VNetworkDefinition : public VRoot
 {
 public:
-    virtual ITensor* addInput(char const* name, DataType type, Dims dimensions) noexcept = 0;
+    virtual ITensor* addInput(char const* name, DataType type, Dims const& dimensions) noexcept = 0;
     virtual void markOutput(ITensor& tensor) noexcept = 0;
-    virtual IConvolutionLayer* addConvolution(ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize,
-        Weights kernelWeights, Weights biasWeights) noexcept = 0;
-    virtual IFullyConnectedLayer* addFullyConnected(
-        ITensor& input, int32_t nbOutputs, Weights kernelWeights, Weights biasWeights) noexcept
-        = 0;
     virtual IActivationLayer* addActivation(ITensor& input, ActivationType type) noexcept = 0;
-    virtual IPoolingLayer* addPooling(ITensor& input, PoolingType type, DimsHW windowSize) noexcept = 0;
-    virtual ILRNLayer* addLRN(ITensor& input, int32_t window, float alpha, float beta, float k) noexcept = 0;
-    virtual IScaleLayer* addScale(ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power) noexcept
-        = 0;
+    virtual ILRNLayer* addLRN(ITensor& input, int64_t window, float alpha, float beta, float k) noexcept = 0;
+    virtual IScaleLayer* addScale(
+        ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power) noexcept = 0;
     virtual ISoftMaxLayer* addSoftMax(ITensor& input) noexcept = 0;
     virtual IConcatenationLayer* addConcatenation(ITensor* const* inputs, int32_t nbInputs) noexcept = 0;
-    virtual IDeconvolutionLayer* addDeconvolution(
-        ITensor& input, int32_t nbOutputMaps, DimsHW kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
-        = 0;
     virtual IElementWiseLayer* addElementWise(ITensor& input1, ITensor& input2, ElementWiseOperation op) noexcept = 0;
     virtual IUnaryLayer* addUnary(ITensor& input, UnaryOperation operation) noexcept = 0;
-    virtual IPaddingLayer* addPadding(ITensor& input, DimsHW prePadding, DimsHW postPadding) noexcept = 0;
     virtual IShuffleLayer* addShuffle(ITensor& input) noexcept = 0;
     virtual int32_t getNbLayers() const noexcept = 0;
     virtual ILayer* getLayer(int32_t index) const noexcept = 0;
@@ -1008,16 +990,15 @@ class VNetworkDefinition : public VRoot
     virtual IGatherLayer* addGather(ITensor& data, ITensor& indices, int32_t axis) noexcept = 0;
     virtual IRaggedSoftMaxLayer* addRaggedSoftMax(ITensor& input, ITensor& bounds) noexcept = 0;
     virtual IMatrixMultiplyLayer* addMatrixMultiply(
-        ITensor& input0, MatrixOperation op0, ITensor& input1, MatrixOperation op1) noexcept
-        = 0;
-    virtual IConstantLayer* addConstant(Dims dimensions, Weights weights) noexcept = 0;
-    virtual IRNNv2Layer* addRNNv2(
-        ITensor& input, int32_t layerCount, int32_t hiddenSize, int32_t maxSeqLen, RNNOperation op) noexcept = 0;
+        ITensor& input0, MatrixOperation op0, ITensor& input1, MatrixOperation op1) noexcept = 0;
+    virtual IConstantLayer* addConstant(Dims const& dimensions, Weights weights) noexcept = 0;
     virtual IIdentityLayer* addIdentity(ITensor& input) noexcept = 0;
     virtual void removeTensor(ITensor& tensor) noexcept = 0;
     virtual void unmarkOutput(ITensor& tensor) noexcept = 0;
     virtual IPluginV2Layer* addPluginV2(ITensor* const* inputs, int32_t nbInputs, IPluginV2& plugin) noexcept = 0;
-    virtual ISliceLayer* addSlice(ITensor& input, Dims start, Dims size, Dims stride) noexcept = 0;
+    virtual IPluginV3Layer* addPluginV3(ITensor* const* inputs, int32_t nbInputs, ITensor* const* shapeInputs,
+        int32_t nbShapeInputs, IPluginV3& plugin) noexcept = 0;
+    virtual ISliceLayer* addSlice(ITensor& input, Dims const& start, Dims const& size, Dims const& stride) noexcept = 0;
     virtual void setName(char const* name) noexcept = 0;
     virtual char const* getName() const noexcept = 0;
     virtual IShapeLayer* addShape(ITensor& input) noexcept = 0;
@@ -1026,21 +1007,19 @@ class VNetworkDefinition : public VRoot
     virtual bool unmarkOutputForShapes(ITensor& tensor) noexcept = 0;
     virtual IParametricReLULayer* addParametricReLU(ITensor& input, ITensor& slope) noexcept = 0;
     virtual IConvolutionLayer* addConvolutionNd(
-        ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+        ITensor& input, int64_t nbOutputMaps, Dims const& kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
         = 0;
-    virtual IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims windowSize) noexcept = 0;
+    virtual IPoolingLayer* addPoolingNd(ITensor& input, PoolingType type, Dims const& windowSize) noexcept = 0;
     virtual IDeconvolutionLayer* addDeconvolutionNd(
-        ITensor& input, int32_t nbOutputMaps, Dims kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
+        ITensor& input, int64_t nbOutputMaps, Dims const& kernelSize, Weights kernelWeights, Weights biasWeights) noexcept
         = 0;
     virtual IScaleLayer* addScaleNd(
-        ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power, int32_t channelAxis) noexcept
-        = 0;
+        ITensor& input, ScaleMode mode, Weights shift, Weights scale, Weights power, int32_t channelAxis) noexcept = 0;
     virtual IResizeLayer* addResize(ITensor& input) noexcept = 0;
-    virtual bool hasExplicitPrecision() const noexcept = 0;
     virtual ILoop* addLoop() noexcept = 0;
     virtual ISelectLayer* addSelect(ITensor& condition, ITensor& thenInput, ITensor& elseInput) noexcept = 0;
-    virtual IFillLayer* addFill(Dims dimensions, FillOperation op) noexcept = 0;
-    virtual IPaddingLayer* addPaddingNd(ITensor& input, Dims prePadding, Dims postPadding) noexcept = 0;
+    virtual IFillLayer* addFill(Dims const& dimensions, FillOperation op) noexcept = 0;
+    virtual IPaddingLayer* addPaddingNd(ITensor& input, Dims const& prePadding, Dims const& postPadding) noexcept = 0;
     virtual bool setWeightsName(Weights weights, char const* name) noexcept = 0;
     virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
     virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
@@ -1060,12 +1039,19 @@ class VNetworkDefinition : public VRoot
         ITensor& input, ITensor& scale, ITensor& bias, uint32_t axesMask) noexcept = 0;
     virtual ICastLayer* addCast(ITensor& input, DataType toType) noexcept = 0;
     virtual IBuilder& getBuilder() const noexcept = 0;
+    virtual NetworkDefinitionCreationFlags getFlags() const noexcept = 0;
+    virtual bool getFlag(NetworkDefinitionCreationFlag networkDefinitionCreationFlag) const noexcept = 0;
+    virtual IQuantizeLayer* addQuantizeV2(ITensor& input, ITensor& scale, DataType outputType) noexcept = 0;
+    virtual IDequantizeLayer* addDequantizeV2(ITensor& input, ITensor& scale, DataType outputType) noexcept = 0;
+    virtual IFillLayer* addFillV2(Dims const& dimensions, FillOperation op, DataType outputType) noexcept = 0;
+    virtual bool markDebug(ITensor& tensor) noexcept = 0;
+    virtual bool unmarkDebug(ITensor& tensor) noexcept = 0;
+    virtual bool isDebugTensor(nvinfer1::ITensor const& tensor) const noexcept = 0;
 };
 
 class VAlgorithmIOInfo : public VRoot
 {
 public:
-    virtual TensorFormat getTensorFormat() const noexcept = 0;
     virtual DataType getDataType() const noexcept = 0;
     virtual Dims getStrides() const noexcept = 0;
     virtual int64_t getVectorizedDim() const noexcept = 0;
@@ -1091,7 +1077,6 @@ class VAlgorithmContext : public VRoot
 class VAlgorithm : public VRoot
 {
 public:
-    virtual IAlgorithmIOInfo const& getAlgorithmIOInfo(int32_t index) const noexcept = 0;
     virtual IAlgorithmVariant const& getAlgorithmVariant() const noexcept = 0;
     virtual float getTimingMSec() const noexcept = 0;
     virtual std::size_t getWorkspaceSize() const noexcept = 0;
@@ -1109,16 +1094,12 @@ class VTimingCache : public VRoot
 class VBuilderConfig : public VRoot
 {
 public:
-    virtual void setMinTimingIterations(int32_t minTiming) noexcept = 0;
-    virtual int32_t getMinTimingIterations() const noexcept = 0;
     virtual void setAvgTimingIterations(int32_t avgTiming) noexcept = 0;
     virtual int32_t getAvgTimingIterations() const noexcept = 0;
     virtual void setEngineCapability(EngineCapability capability) noexcept = 0;
     virtual EngineCapability getEngineCapability() const noexcept = 0;
     virtual void setInt8Calibrator(IInt8Calibrator* calibrator) noexcept = 0;
     virtual IInt8Calibrator* getInt8Calibrator() const noexcept = 0;
-    virtual void setMaxWorkspaceSize(std::size_t workspaceSize) noexcept = 0;
-    virtual std::size_t getMaxWorkspaceSize() const noexcept = 0;
     virtual void setFlags(BuilderFlags builderFlags) noexcept = 0;
     virtual BuilderFlags getFlags() const noexcept = 0;
     virtual void clearFlag(BuilderFlag builderFlag) noexcept = 0;
@@ -1167,21 +1148,29 @@ class VBuilderConfig : public VRoot
     virtual int32_t getNbPluginsToSerialize() const noexcept = 0;
     virtual void setMaxAuxStreams(int32_t nbStreams) noexcept = 0;
     virtual int32_t getMaxAuxStreams() const noexcept = 0;
+    virtual void setProgressMonitor(IProgressMonitor* monitor) noexcept = 0;
+    virtual IProgressMonitor* getProgressMonitor() const noexcept = 0;
+};
+
+class VSerializationConfig : public VRoot
+{
+public:
+    virtual bool setFlags(SerializationFlags serializationFlags) noexcept = 0;
+    virtual SerializationFlags getFlags() const noexcept = 0;
+    virtual bool clearFlag(SerializationFlag serializationFlag) noexcept = 0;
+    virtual bool setFlag(SerializationFlag serializationFlag) noexcept = 0;
+    virtual bool getFlag(SerializationFlag serializationFlag) const noexcept = 0;
 };
 
 class VBuilder : public VRoot
 {
 public:
-    virtual void setMaxBatchSize(int32_t batchSize) noexcept = 0;
-    virtual int32_t getMaxBatchSize() const noexcept = 0;
     virtual bool platformHasFastFp16() const noexcept = 0;
     virtual bool platformHasFastInt8() const noexcept = 0;
     virtual int32_t getMaxDLABatchSize() const noexcept = 0;
     virtual int32_t getNbDLACores() const noexcept = 0;
     virtual void setGpuAllocator(IGpuAllocator* allocator) noexcept = 0;
     virtual nvinfer1::IBuilderConfig* createBuilderConfig() noexcept = 0;
-    virtual nvinfer1::ICudaEngine* buildEngineWithConfig(INetworkDefinition& network, IBuilderConfig& config) noexcept
-        = 0;
     virtual nvinfer1::INetworkDefinition* createNetworkV2(NetworkDefinitionCreationFlags flags) noexcept = 0;
     virtual nvinfer1::IOptimizationProfile* createOptimizationProfile() noexcept = 0;
     virtual void setErrorRecorder(IErrorRecorder* recorder) noexcept = 0;
diff --git a/include/NvInferLegacyDims.h b/include/NvInferLegacyDims.h
index 9c757043..204d17a8 100644
--- a/include/NvInferLegacyDims.h
+++ b/include/NvInferLegacyDims.h
@@ -36,6 +36,7 @@ namespace nvinfer1
 {
 //!
 //! \class Dims2
+//!
 //! \brief Descriptor for two-dimensional data.
 //!
 class Dims2 : public Dims
@@ -55,12 +56,12 @@ class Dims2 : public Dims
     //! \param d0 The first element.
     //! \param d1 The second element.
     //!
-    Dims2(int32_t d0, int32_t d1)
+    Dims2(int64_t d0, int64_t d1)
     {
         nbDims = 2;
         d[0] = d0;
         d[1] = d1;
-        for (int32_t i{nbDims}; i < Dims::MAX_DIMS; ++i)
+        for (int64_t i{nbDims}; i < Dims::MAX_DIMS; ++i)
         {
             d[i] = 0;
         }
@@ -69,6 +70,7 @@ class Dims2 : public Dims
 
 //!
 //! \class DimsHW
+//!
 //! \brief Descriptor for two-dimensional spatial data.
 //!
 class DimsHW : public Dims2
@@ -88,7 +90,7 @@ class DimsHW : public Dims2
     //! \param height the height of the data
     //! \param width the width of the data
     //!
-    DimsHW(int32_t height, int32_t width)
+    DimsHW(int64_t height, int64_t width)
         : Dims2(height, width)
     {
     }
@@ -98,7 +100,7 @@ class DimsHW : public Dims2
     //!
     //! \return The height.
     //!
-    int32_t& h()
+    int64_t& h()
     {
         return d[0];
     }
@@ -108,7 +110,7 @@ class DimsHW : public Dims2
     //!
     //! \return The height.
     //!
-    int32_t h() const
+    int64_t h() const
     {
         return d[0];
     }
@@ -118,7 +120,7 @@ class DimsHW : public Dims2
     //!
     //! \return The width.
     //!
-    int32_t& w()
+    int64_t& w()
     {
         return d[1];
     }
@@ -128,7 +130,7 @@ class DimsHW : public Dims2
     //!
     //! \return The width.
     //!
-    int32_t w() const
+    int64_t w() const
     {
         return d[1];
     }
@@ -136,6 +138,7 @@ class DimsHW : public Dims2
 
 //!
 //! \class Dims3
+//!
 //! \brief Descriptor for three-dimensional data.
 //!
 class Dims3 : public Dims2
@@ -156,7 +159,7 @@ class Dims3 : public Dims2
     //! \param d1 The second element.
     //! \param d2 The third element.
     //!
-    Dims3(int32_t d0, int32_t d1, int32_t d2)
+    Dims3(int64_t d0, int64_t d1, int64_t d2)
         : Dims2(d0, d1)
     {
         nbDims = 3;
@@ -166,6 +169,7 @@ class Dims3 : public Dims2
 
 //!
 //! \class Dims4
+//!
 //! \brief Descriptor for four-dimensional data.
 //!
 class Dims4 : public Dims3
@@ -187,7 +191,7 @@ class Dims4 : public Dims3
     //! \param d2 The third element.
     //! \param d3 The fourth element.
     //!
-    Dims4(int32_t d0, int32_t d1, int32_t d2, int32_t d3)
+    Dims4(int64_t d0, int64_t d1, int64_t d2, int64_t d3)
         : Dims3(d0, d1, d2)
     {
         nbDims = 4;
diff --git a/include/NvInferPluginUtils.h b/include/NvInferPluginUtils.h
index c501f8e5..bfc924e5 100644
--- a/include/NvInferPluginUtils.h
+++ b/include/NvInferPluginUtils.h
@@ -33,142 +33,118 @@ namespace plugin
 {
 
 //!
-//! \brief The Permute plugin layer permutes the input tensor by changing the memory order of the data.
-//! Quadruple defines a structure that contains an array of 4 integers. They can represent the permute orders or the
-//! strides in each dimension.
-//!
-typedef struct
-{
-    int32_t data[4];
-} Quadruple;
-
+//! \struct PriorBoxParameters
 //!
 //! \brief The PriorBox plugin layer generates the prior boxes of designated sizes and aspect ratios across all
-//! dimensions (H x W). PriorBoxParameters defines a set of parameters for creating the PriorBox plugin layer. It
-//! contains:
-//! \param minSize Minimum box size in pixels. Can not be nullptr.
-//! \param maxSize Maximum box size in pixels. Can be nullptr.
-//! \param aspectRatios Aspect ratios of the boxes. Can be nullptr.
-//! \param numMinSize Number of elements in minSize. Must be larger than 0.
-//! \param numMaxSize Number of elements in maxSize. Can be 0 or same as numMinSize.
-//! \param numAspectRatios Number of elements in aspectRatios. Can be 0.
-//! \param flip If true, will flip each aspect ratio. For example, if there is an aspect ratio "r", the aspect ratio
-//! "1.0/r" will be generated as well.
-//! \param clip If true, will clip the prior so that it is within [0,1].
-//! \param variance Variance for adjusting the prior boxes.
-//! \param imgH Image height. If 0, then the H dimension of the data tensor will be used.
-//! \param imgW Image width. If 0, then the W dimension of the data tensor will be used.
-//! \param stepH Step in H. If 0, then (float)imgH/h will be used where h is the H dimension of the 1st input tensor.
-//! \param stepW Step in W. If 0, then (float)imgW/w will be used where w is the W dimension of the 1st input tensor.
-//! \param offset Offset to the top left corner of each cell.
+//! dimensions (H x W).
+//!
+//! PriorBoxParameters defines a set of parameters for creating the PriorBox plugin layer.
 //!
 struct PriorBoxParameters
 {
-    float *minSize, *maxSize, *aspectRatios;
-    int32_t numMinSize, numMaxSize, numAspectRatios;
-    bool flip;
-    bool clip;
-    float variance[4];
-    int32_t imgH, imgW;
-    float stepH, stepW;
-    float offset;
+    float *minSize;          //!< Minimum box size in pixels. Can not be nullptr.
+    float *maxSize;          //!< Maximum box size in pixels. Can be nullptr.
+    float *aspectRatios;     //!< Aspect ratios of the boxes. Can be nullptr.
+    int32_t numMinSize;      //!< Number of elements in minSize. Must be larger than 0.
+    int32_t numMaxSize;      //!< Number of elements in maxSize. Can be 0 or same as numMinSize.
+    int32_t numAspectRatios; //!< Number of elements in aspectRatios. Can be 0.
+    bool flip;               //!< If true, will flip each aspect ratio. For example,
+                             //!< if there is an aspect ratio "r", the aspect ratio "1.0/r" will be generated as well.
+    bool clip;               //!< If true, will clip the prior so that it is within [0,1].
+    float variance[4];       //!< Variance for adjusting the prior boxes.
+    int32_t imgH;            //!< Image height. If 0, then the H dimension of the data tensor will be used.
+    int32_t imgW;            //!< Image width. If 0, then the W dimension of the data tensor will be used.
+    float stepH;             //!< Step in H. If 0, then (float)imgH/h will be used where h is the H dimension of the 1st input tensor.
+    float stepW;             //!< Step in W. If 0, then (float)imgW/w will be used where w is the W dimension of the 1st input tensor.
+    float offset;            //!< Offset to the top left corner of each cell.
 };
 
+//!
+//! \struct RPROIParams
 //!
 //! \brief RPROIParams is used to create the RPROIPlugin instance.
-//! It contains:
-//! \param poolingH Height of the output in pixels after ROI pooling on feature map.
-//! \param poolingW Width of the output in pixels after ROI pooling on feature map.
-//! \param featureStride Feature stride; ratio of input image size to feature map size. Assuming that max pooling layers
-//! in the neural network use square filters.
-//! \param preNmsTop Number of proposals to keep before applying NMS.
-//! \param nmsMaxOut Number of remaining proposals after applying NMS.
-//! \param anchorsRatioCount Number of anchor box ratios.
-//! \param anchorsScaleCount Number of anchor box scales.
-//! \param iouThreshold IoU (Intersection over Union) threshold used for the NMS step.
-//! \param minBoxSize Minimum allowed bounding box size before scaling, used for anchor box calculation.
-//! \param spatialScale Spatial scale between the input image and the last feature map.
 //!
 struct RPROIParams
 {
-    int32_t poolingH;
-    int32_t poolingW;
-    int32_t featureStride;
-    int32_t preNmsTop;
-    int32_t nmsMaxOut;
-    int32_t anchorsRatioCount;
-    int32_t anchorsScaleCount;
-    float iouThreshold;
-    float minBoxSize;
-    float spatialScale;
+    int32_t poolingH;          //!< Height of the output in pixels after ROI pooling on feature map.
+    int32_t poolingW;          //!< Width of the output in pixels after ROI pooling on feature map.
+    int32_t featureStride;     //!< Feature stride; ratio of input image size to feature map size.
+                               //!< Assuming that max pooling layers in the neural network use square filters.
+    int32_t preNmsTop;         //!< Number of proposals to keep before applying NMS.
+    int32_t nmsMaxOut;         //!< Number of remaining proposals after applying NMS.
+    int32_t anchorsRatioCount; //!< Number of anchor box ratios.
+    int32_t anchorsScaleCount; //!< Number of anchor box scales.
+    float iouThreshold;        //!< IoU (Intersection over Union) threshold used for the NMS step.
+    float minBoxSize;          //!< Minimum allowed bounding box size before scaling, used for anchor box calculation.
+    float spatialScale;        //!< Spatial scale between the input image and the last feature map.
 };
 
-
+//!
+//! \struct GridAnchorParameters
 //!
 //! \brief The Anchor Generator plugin layer generates the prior boxes of designated sizes and aspect ratios across all dimensions (H x W).
 //! GridAnchorParameters defines a set of parameters for creating the plugin layer for all feature maps.
-//! It contains:
-//! \param minScale Scale of anchors corresponding to finest resolution.
-//! \param maxScale Scale of anchors corresponding to coarsest resolution.
-//! \param aspectRatios List of aspect ratios to place on each grid point.
-//! \param numAspectRatios Number of elements in aspectRatios.
-//! \param H Height of feature map to generate anchors for.
-//! \param W Width of feature map to generate anchors for.
-//! \param variance Variance for adjusting the prior boxes.
 //!
 struct GridAnchorParameters
 {
-    float minSize, maxSize;
-    float* aspectRatios;
-    int32_t numAspectRatios, H, W;
-    float variance[4];
+    float minSize;           //!< Scale of anchors corresponding to finest resolution.
+    float maxSize;           //!< Scale of anchors corresponding to coarsest resolution.
+    float* aspectRatios;     //!< List of aspect ratios to place on each grid point.
+    int32_t numAspectRatios; //!< Number of elements in aspectRatios.
+    int32_t H;               //!< Height of feature map to generate anchors for.
+    int32_t W;               //!< Width of feature map to generate anchors for.
+    float variance[4];       //!< Variance for adjusting the prior boxes.
 };
 
 //!
 //! \enum CodeTypeSSD
+//!
 //! \brief The type of encoding used for decoding the bounding boxes and loc_data.
 //!
+//! \deprecated Deprecated in TensorRT 10.0. DetectionOutput plugin is deprecated.
+//!
 enum class CodeTypeSSD : int32_t
 {
-    CORNER = 0,      //!< Use box corners.
-    CENTER_SIZE = 1, //!< Use box centers and size.
-    CORNER_SIZE = 2, //!< Use box centers and size.
-    TF_CENTER = 3    //!< Use box centers and size but flip x and y coordinates.
+    CORNER TRT_DEPRECATED_ENUM = 0,      //!< Use box corners.
+    CENTER_SIZE TRT_DEPRECATED_ENUM = 1, //!< Use box centers and size.
+    CORNER_SIZE TRT_DEPRECATED_ENUM = 2, //!< Use box centers and size.
+    TF_CENTER TRT_DEPRECATED_ENUM = 3    //!< Use box centers and size but flip x and y coordinates.
 };
 
 //!
-//! \brief The DetectionOutput plugin layer generates the detection output based on location and confidence predictions by doing non maximum suppression.
-//! This plugin first decodes the bounding boxes based on the anchors generated. It then performs non_max_suppression on the decoded bounding boxes.
+//! \struct DetectionOutputParameters
+//!
+//! \brief The DetectionOutput plugin layer generates the detection output
+//! based on location and confidence predictions by doing non maximum suppression.
+//!
+//! This plugin first decodes the bounding boxes based on the anchors generated.
+//! It then performs non_max_suppression on the decoded bounding boxes.
 //! DetectionOutputParameters defines a set of parameters for creating the DetectionOutput plugin layer.
-//! It contains:
-//! \param shareLocation If true, bounding box are shared among different classes.
-//! \param varianceEncodedInTarget If true, variance is encoded in target. Otherwise we need to adjust the predicted offset accordingly.
-//! \param backgroundLabelId Background label ID. If there is no background class, set it as -1.
-//! \param numClasses Number of classes to be predicted.
-//! \param topK Number of boxes per image with top confidence scores that are fed into the NMS algorithm.
-//! \param keepTopK Number of total bounding boxes to be kept per image after NMS step.
-//! \param confidenceThreshold Only consider detections whose confidences are larger than a threshold.
-//! \param nmsThreshold Threshold to be used in NMS.
-//! \param codeType Type of coding method for bbox.
-//! \param inputOrder Specifies the order of inputs {loc_data, conf_data, priorbox_data}.
-//! \param confSigmoid Set to true to calculate sigmoid of confidence scores.
-//! \param isNormalized Set to true if bounding box data is normalized by the network.
-//! \param isBatchAgnostic Defaults to true. Set to false if prior boxes are unique per batch
-//!
-struct DetectionOutputParameters
+//!
+//! \deprecated Deprecated in TensorRT 10.0. DetectionOutput plugin is deprecated.
+//!
+struct TRT_DEPRECATED DetectionOutputParameters
 {
-    bool shareLocation, varianceEncodedInTarget;
-    int32_t backgroundLabelId, numClasses, topK, keepTopK;
-    float confidenceThreshold, nmsThreshold;
-    CodeTypeSSD codeType;
-    int32_t inputOrder[3];
-    bool confSigmoid;
-    bool isNormalized;
-    bool isBatchAgnostic{true};
+    bool shareLocation;           //!< If true, bounding box are shared among different classes.
+    bool varianceEncodedInTarget; //!< If true, variance is encoded in target.
+                                  //!< Otherwise we need to adjust the predicted offset accordingly.
+    int32_t backgroundLabelId;    //!< Background label ID. If there is no background class, set it as -1.
+    int32_t numClasses;           //!< Number of classes to be predicted.
+    int32_t topK;                 //!< Number of boxes per image with top confidence scores that are fed
+                                  //!< into the NMS algorithm.
+    int32_t keepTopK;             //!< Number of total bounding boxes to be kept per image after NMS step.
+    float confidenceThreshold;    //!< Only consider detections whose confidences are larger than a threshold.
+    float nmsThreshold;           //!< Threshold to be used in NMS.
+    CodeTypeSSD codeType;         //!< Type of coding method for bbox.
+    int32_t inputOrder[3];        //!< Specifies the order of inputs {loc_data, conf_data, priorbox_data}.
+    bool confSigmoid;             //!< Set to true to calculate sigmoid of confidence scores.
+    bool isNormalized;            //!< Set to true if bounding box data is normalized by the network.
+    bool isBatchAgnostic{true};   //!< Defaults to true. Set to false if prior boxes are unique per batch.
 };
 
 //!
-//! \brief When performing yolo9000, softmaxTree is helping to do softmax on confidence scores, for element to get the precise classification through word-tree structured classification definition.
+//! \brief When performing yolo9000, softmaxTree is helping to do softmax on confidence scores,
+//! for element to get the precise classification through word-tree structured classification definition.
 //!
 struct softmaxTree
 {
@@ -178,53 +154,48 @@ struct softmaxTree
     int32_t* child;
     int32_t* group;
     char** name;
-
     int32_t groups;
     int32_t* groupSize;
     int32_t* groupOffset;
 };
 
 //!
-//! \brief The Region plugin layer performs region proposal calculation: generate 5 bounding boxes per cell (for yolo9000, generate 3 bounding boxes per cell).
-//! For each box, calculating its probablities of objects detections from 80 pre-defined classifications (yolo9000 has 9418 pre-defined classifications,
-//! and these 9418 items are organized as work-tree structure).
+//! \brief The Region plugin layer performs region proposal calculation.
+//!
+//! Generate 5 bounding boxes per cell (for yolo9000, generate 3 bounding boxes per cell).
+//! For each box, calculating its probabilities of objects detections from 80 pre-defined classifications
+//! (yolo9000 has 9418 pre-defined classifications, and these 9418 items are organized as work-tree structure).
 //! RegionParameters defines a set of parameters for creating the Region plugin layer.
-//! \param num Number of predicted bounding box for each grid cell.
-//! \param coords Number of coordinates for a bounding box.
-//! \param classes Number of classifications to be predicted.
-//! \param smTree Helping structure to do softmax on confidence scores.
 //!
 struct RegionParameters
 {
-    int32_t num;
-    int32_t coords;
-    int32_t classes;
-    softmaxTree* smTree;
+    int32_t num;         //!< Number of predicted bounding box for each grid cell.
+    int32_t coords;      //!< Number of coordinates for a bounding box.
+    int32_t classes;     //!< Number of classifications to be predicted.
+    softmaxTree* smTree; //!< Helping structure to do softmax on confidence scores.
 };
 
 //!
 //! \brief The NMSParameters are used by the BatchedNMSPlugin for performing
 //! the non_max_suppression operation over boxes for object detection networks.
-//! \param shareLocation If set to true, the boxes inputs are shared across all
-//!        classes. If set to false, the boxes input should account for per class box data.
-//! \param backgroundLabelId Label ID for the background class. If there is no background class, set it as -1
-//! \param numClasses Number of classes in the network.
-//! \param topK Number of bounding boxes to be fed into the NMS step.
-//! \param keepTopK Number of total bounding boxes to be kept per image after NMS step.
-//!        Should be less than or equal to the topK value.
-//! \param scoreThreshold Scalar threshold for score (low scoring boxes are removed).
-//! \param iouThreshold scalar threshold for IOU (new boxes that have high IOU overlap
-//!        with previously selected boxes are removed).
-//! \param isNormalized Set to false, if the box coordinates are not
-//!        normalized, i.e. not in the range [0,1]. Defaults to false.
 //!
-
-struct NMSParameters
+//! \deprecated Deprecated in TensorRT 10.0. BatchedNMSPlugin plugin is deprecated.
+//!
+struct TRT_DEPRECATED NMSParameters
 {
-    bool shareLocation;
-    int32_t backgroundLabelId, numClasses, topK, keepTopK;
-    float scoreThreshold, iouThreshold;
-    bool isNormalized;
+    bool shareLocation;        //!< If set to true, the boxes inputs are shared across all classes.
+                               //!< If set to false, the boxes input should account for per class box data.
+    int32_t backgroundLabelId; //!< Label ID for the background class.
+                               //!< If there is no background class, set it as -1
+    int32_t numClasses;        //!< Number of classes in the network.
+    int32_t topK;              //!< Number of bounding boxes to be fed into the NMS step.
+    int32_t keepTopK;          //!< Number of total bounding boxes to be kept per image after NMS step.
+                               //!< Should be less than or equal to the topK value.
+    float scoreThreshold;      //!< Scalar threshold for score (low scoring boxes are removed).
+    float iouThreshold;        //!< A scalar threshold for IOU (new boxes that have high IOU overlap
+                               //!< with previously selected boxes are removed).
+    bool isNormalized;         //!< Set to false, if the box coordinates are not normalized,
+                               //!< i.e. not in the range [0,1]. Defaults to false.
 };
 
 } // namespace plugin
diff --git a/include/NvInferRuntime.h b/include/NvInferRuntime.h
index 925531e0..04434931 100644
--- a/include/NvInferRuntime.h
+++ b/include/NvInferRuntime.h
@@ -69,7 +69,6 @@ class INoCopy
 //! network operations that are DLA compatible and the resulting serialized engine can be executed using standalone
 //! DLA runtime APIs. See sampleCudla for an example of integrating cuDLA APIs with TensorRT APIs.
 //!
-
 enum class EngineCapability : int32_t
 {
     //!
@@ -78,9 +77,6 @@ enum class EngineCapability : int32_t
     //!
     kSTANDARD = 0,
 
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by kSTANDARD.
-    kDEFAULT TRT_DEPRECATED_ENUM = kSTANDARD,
-
     //!
     //! Safety: TensorRT flow with restrictions targeting the safety runtime.
     //! See safety documentation for list of supported layers and formats.
@@ -89,18 +85,12 @@ enum class EngineCapability : int32_t
     //! This flag is only supported in NVIDIA Drive(R) products.
     kSAFETY = 1,
 
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by kSAFETY.
-    kSAFE_GPU TRT_DEPRECATED_ENUM = kSAFETY,
-
     //!
     //! DLA Standalone: TensorRT flow with restrictions targeting external, to TensorRT, DLA runtimes.
     //! See DLA documentation for list of supported layers and formats.
     //! This flow supports only DeviceType::kDLA.
     //!
     kDLA_STANDALONE = 2,
-
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by kDLA_STANDALONE.
-    kSAFE_DLA TRT_DEPRECATED_ENUM = kDLA_STANDALONE,
 };
 
 namespace impl
@@ -167,17 +157,6 @@ class IHostMemory : public INoCopy
     {
         return mImpl->type();
     }
-    //!
-    //! Destroy the allocated memory.
-    //!
-    //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED void destroy() noexcept
-    {
-        delete this;
-    }
 
 protected:
     apiv::VHostMemory* mImpl;
@@ -215,6 +194,7 @@ constexpr inline int32_t EnumMax<DimensionOperation>() noexcept
 
 //!
 //! \enum TensorLocation
+//!
 //! \brief The location for tensor data storage, device or host.
 //!
 enum class TensorLocation : int32_t
@@ -236,27 +216,33 @@ struct EnumMaxImpl<TensorLocation>
 //!
 //! \class IDimensionExpr
 //!
-//! An IDimensionExpr represents an integer expression constructed from constants,
+//! \brief An IDimensionExpr represents an integer expression constructed from constants,
 //! input dimensions, and binary operations.  These expressions are can be used
-//! in overrides of IPluginV2DynamicExt::getOutputDimensions to define output
+//! in overrides of IPluginV2DynamicExt::getOutputDimensions or IPluginV3OneBuild::getOutputShapes() to define output
 //! dimensions in terms of input dimensions.
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
-//! \see DimensionOperation, IPluginV2DynamicExt::getOutputDimensions
+//! \see DimensionOperation, IPluginV2DynamicExt::getOutputDimensions, IPluginV3OneBuild::getOutputShapes()
 //!
 class IDimensionExpr : public INoCopy
 {
 public:
-    //! Return true if expression is a build-time constant.
+    //!
+    //! \brief Return true if expression is a build-time constant.
+    //!
     bool isConstant() const noexcept
     {
         return mImpl->isConstant();
     }
 
+    //!
+    //! \brief Get the value of the constant.
+    //!
     //! If isConstant(), returns value of the constant.
-    //! If !isConstant(), return std::numeric_limits<int32_t>::min().
-    int32_t getConstantValue() const noexcept
+    //! If !isConstant(), return std::numeric_limits<int64_t>::min().
+    //!
+    int64_t getConstantValue() const noexcept
     {
         return mImpl->getConstantValue();
     }
@@ -264,20 +250,31 @@ class IDimensionExpr : public INoCopy
 protected:
     apiv::VDimensionExpr* mImpl;
     virtual ~IDimensionExpr() noexcept = default;
+
+public:
+    //!
+    //! \brief Return true if this denotes the value of a size tensor.
+    //!
+    //! \return True if this was created with method IExprBuilder::declareSizeTensor, false otherwise
+    //!
+    bool isSizeTensor() const noexcept
+    {
+        return mImpl->isSizeTensor();
+    }
 };
 
 //!
 //! \class IExprBuilder
 //!
-//! Object for constructing IDimensionExpr.
+//! \brief Object for constructing IDimensionExpr.
 //!
 //! There is no public way to construct an IExprBuilder.  It appears as an argument to
-//! method IPluginV2DynamicExt::getOutputDimensions().  Overrides of that method can use
-//! that IExprBuilder argument to construct expressions that define output dimensions
-//! in terms of input dimensions.
+//! method IPluginV2DynamicExt::getOutputDimensions() and IPluginV3OneBuild::getOutputShapes().  Overrides of that
+//! method can use that IExprBuilder argument to construct expressions that define output dimensions in terms of input
+//! dimensions.
 //!
 //! Clients should assume that any values constructed by the IExprBuilder are destroyed
-//! after IPluginV2DynamicExt::getOutputDimensions() returns.
+//! after IPluginV2DynamicExt::getOutputDimensions() or IPluginV3OneBuild::getOutputShapes() returns.
 //!
 //! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
 //!
@@ -286,14 +283,20 @@ class IDimensionExpr : public INoCopy
 class IExprBuilder : public INoCopy
 {
 public:
-    //! Return pointer to IDimensionExp for given value.
-    IDimensionExpr const* constant(int32_t value) noexcept
+    //!
+    //! \brief Return pointer to IDimensionExp for given value.
+    //!
+    IDimensionExpr const* constant(int64_t value) noexcept
     {
         return mImpl->constant(value);
     }
 
+    //!
+    //! \brief Get the operation.
+    //!
     //! Return pointer to IDimensionExp that represents the given operation applied to first and second.
     //! Returns nullptr if op is not a valid DimensionOperation.
+    //!
     IDimensionExpr const* operation(
         DimensionOperation op, IDimensionExpr const& first, IDimensionExpr const& second) noexcept
     {
@@ -303,12 +306,42 @@ class IExprBuilder : public INoCopy
 protected:
     apiv::VExprBuilder* mImpl;
     virtual ~IExprBuilder() noexcept = default;
+
+public:
+    //!
+    //! \brief Declare a size tensor at the given output index, with the specified auto-tuning formula and upper bound.
+    //!
+    //! A size tensor allows a plugin to have output dimensions that cannot be computed solely from input dimensions.
+    //! For example, suppose a plugin implements the equivalent of INonZeroLayer for 2D input. The plugin can
+    //! have one output for the indices of non-zero elements, and a second output containing the number of non-zero
+    //! elements. Suppose the input has size [M,N] and has K non-zero elements. The plugin can write K to the second
+    //! output. When telling TensorRT that the first output has shape [2,K], plugin uses IExprBuilder::constant() and
+    //! IExprBuilder::declareSizeTensor(1,...) to create the IDimensionExpr that respectively denote 2 and K.
+    //!
+    //! TensorRT also needs to know the value of K to use for auto-tuning and an upper bound on K so that it can
+    //! allocate memory for the output tensor. In the example, supposed typically half of the plugin's input elements
+    //! are non-zero, and all the elements might be nonzero. then using M*N/2 might be a good expression for the opt
+    //! parameter, and M*N for the upper bound. IDimensionsExpr for these expressions can be constructed from
+    //! IDimensionsExpr for the input dimensions.
+    //!
+    //! \param outputIndex index of a plugin output that is a size tensor.
+    //! \param opt formula for computing auto-tuning value. Must not depend on a size tensor.
+    //! \param upper Upper bound on the size tensor.
+    //!
+    //! \return IDimensionExpr denoting the value of the size tensor.
+    //!
+    //! \see IPluginV3OneBuild::getOutputShapes()
+    //!
+    IDimensionExpr const* declareSizeTensor(int32_t outputIndex, IDimensionExpr const& opt, IDimensionExpr const& upper)
+    {
+        return mImpl->declareSizeTensor(outputIndex, opt, upper);
+    }
 };
 
 //!
 //! \class DimsExprs
 //!
-//! Analog of class Dims with expressions instead of constants for the dimensions.
+//! \brief Analog of class Dims with expressions instead of constants for the dimensions.
 //!
 class DimsExprs
 {
@@ -318,9 +351,9 @@ class DimsExprs
 };
 
 //!
-//! \class DynamicPluginTensorDesc
+//! \struct DynamicPluginTensorDesc
 //!
-//! Summarizes tensors that a plugin might see for an input or output.
+//! \brief Summarizes tensors that a plugin might see for an input or output.
 //!
 struct DynamicPluginTensorDesc
 {
@@ -332,27 +365,42 @@ struct DynamicPluginTensorDesc
 
     //! Upper bounds on tensor’s dimensions
     Dims max;
+
+    //! Optimum value of tensor’s dimensions specified for auto-tuning
+    Dims opt;
 };
 
 //!
 //! \class IPluginV2DynamicExt
 //!
-//! Similar to IPluginV2Ext, but with support for dynamic shapes.
+//! \brief Similar to IPluginV2Ext, but with support for dynamic shapes.
 //!
 //! Clients should override the public methods, including the following inherited methods:
 //!
-//!     virtual int32_t getNbOutputs() const noexcept = 0;
-//!     virtual nvinfer1::DataType getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes, int32_t
-//!     nbInputs) const noexcept = 0; virtual size_t getSerializationSize() const noexcept = 0; virtual void
-//!     serialize(void* buffer) const noexcept = 0; virtual void destroy() noexcept = 0; virtual void
-//!     setPluginNamespace(char const* pluginNamespace) noexcept = 0; virtual char const* getPluginNamespace() const
-//!     noexcept = 0;
+//! * virtual int32_t getNbOutputs() const noexcept = 0;
+//!
+//! * virtual DataType getOutputDataType(int32_t index, DataType const* inputTypes,
+//!                                      int32_t nbInputs) const noexcept = 0;
+//!
+//! * virtual size_t getSerializationSize() const noexcept = 0;
 //!
-//! For getOutputDataType, the inputTypes will always be DataType::kFLOAT or DataType::kINT32,
+//! * virtual void serialize(void* buffer) const noexcept = 0;
+//!
+//! * virtual void destroy() noexcept = 0;
+//!
+//! * virtual void setPluginNamespace(char const* pluginNamespace) noexcept = 0;
+//!
+//! * virtual char const* getPluginNamespace() const noexcept = 0;
+//!
+//! For weakly typed networks, the inputTypes will always be DataType::kFLOAT or DataType::kINT32,
 //! and the returned type is canonicalized to DataType::kFLOAT if it is DataType::kHALF or DataType:kINT8.
+//! For strongly typed networks, inputTypes are inferred from previous operations, and getOutputDataType
+//! specifies the returned type based on the inputTypes.
 //! Details about the floating-point precision are elicited later by method supportsFormatCombination.
 //!
-class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
+//! \deprecated Deprecated in TensorRT 10.0. Please implement IPluginV3 instead.
+//!
+class TRT_DEPRECATED IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
 {
 public:
     IPluginV2DynamicExt* clone() const noexcept override = 0;
@@ -385,7 +433,7 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
         int32_t outputIndex, DimsExprs const* inputs, int32_t nbInputs, IExprBuilder& exprBuilder) noexcept = 0;
 
     //!
-    //! Limit on number of format combinations accepted.
+    //! \brief Limit on number of format combinations accepted.
     //!
     static constexpr int32_t kFORMAT_COMBINATION_LIMIT = 100;
 
@@ -406,18 +454,18 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
     //!
     //! * A definition for a plugin that supports only FP16 NCHW:
     //!
-    //!         return inOut.format[pos] == TensorFormat::kLINEAR && inOut.type[pos] == DataType::kHALF;
+    //!         return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kHALF;
     //!
     //! * A definition for a plugin that supports only FP16 NCHW for its two inputs,
     //!   and FP32 NCHW for its single output:
     //!
-    //!         return inOut.format[pos] == TensorFormat::kLINEAR && (inOut.type[pos] == (pos < 2 ? DataType::kHALF :
+    //!         return inOut[pos].format == TensorFormat::kLINEAR && (inOut[pos].type == (pos < 2 ? DataType::kHALF :
     //!         DataType::kFLOAT));
     //!
     //! * A definition for a "polymorphic" plugin with two inputs and one output that supports
     //!   any format or type, but the inputs and output must have the same format and type:
     //!
-    //!         return pos == 0 || (inOut.format[pos] == inOut.format[0] && inOut.type[pos] == inOut.type[0]);
+    //!         return pos == 0 || (inOut[pos].format == inOut.format[0] && inOut[pos].type == inOut[0].type);
     //!
     //! Warning: TensorRT will stop asking for formats once it finds kFORMAT_COMBINATION_LIMIT on combinations.
     //!
@@ -450,9 +498,8 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
     //!  * IExecutionContext will call this during the next subsequent instance enqueue[V2]() or execute[V2]() if:
     //!    - The batch size is changed from previous call of execute()/enqueue() if hasImplicitBatchDimension() returns
     //!    true.
-    //!    - The optimization profile is changed via setOptimizationProfile() or setOptimizationProfileAsync().
-    //!    - An input shape binding is changed via setInputShapeBinding().
-    //!    - An input execution binding is changed via setBindingDimensions().
+    //!    - The optimization profile is changed via setOptimizationProfileAsync().
+    //!    - An input execution binding is changed via setInputShape().
     //! \warning The execution phase is timing critical during IExecutionContext but is not part of the timing loop when
     //! called from IBuilder. Performance bottlenecks of configurePlugin won't show up during engine building but will
     //! be visible during execution after calling functions that trigger layer resource updates.
@@ -510,53 +557,644 @@ class IPluginV2DynamicExt : public nvinfer1::IPluginV2Ext
 private:
     // Following are obsolete base class methods, and must not be implemented or used.
 
+    //!
+    //! \brief Set plugin configuration
+    //!
     void configurePlugin(Dims const*, int32_t, Dims const*, int32_t, DataType const*, DataType const*, bool const*,
         bool const*, PluginFormat, int32_t) noexcept override final
     {
     }
 
+    //!
+    //! \brief Check if provided data type is supported
+    //!
     bool supportsFormat(DataType, PluginFormat) const noexcept override final
     {
         return false;
     }
 
+    //!
+    //! \brief Get output dimensions.
+    //!
     Dims getOutputDimensions(int32_t, Dims const*, int32_t) noexcept override final
     {
         return Dims{-1, {}};
     }
 
-    bool isOutputBroadcastAcrossBatch(int32_t, bool const*, int32_t) const noexcept override final
+    //!
+    //! \brief Is output broadcasted across batch.
+    //!
+    //! \warning Expected to return false as implicit batch support was removed in TensorRT 10.0.
+    //!
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+    //!
+    TRT_DEPRECATED bool isOutputBroadcastAcrossBatch(int32_t, bool const*, int32_t) const noexcept override final
     {
         return false;
     }
 
-    bool canBroadcastInputAcrossBatch(int32_t) const noexcept override final
+    //!
+    //! \brief Can output broadcasted across batch.
+    //!
+    //! \warning Expected to return false as implicit batch support was removed in TensorRT 10.0.
+    //!
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+    //!
+    TRT_DEPRECATED bool canBroadcastInputAcrossBatch(int32_t) const noexcept override final
     {
         return true;
     }
 
-    size_t getWorkspaceSize(int32_t) const noexcept override final
-    {
-        return 0;
-    }
+    //!
+    //! \brief Get required workspace size in bytes.
+    //!
+    size_t getWorkspaceSize(int32_t) const noexcept override final
+    {
+        return 0;
+    }
+
+    //!
+    //! \brief Run inference.
+    //!
+    int32_t enqueue(int32_t, void const* const*, void* const*, void*, cudaStream_t) noexcept override final
+    {
+        return 1;
+    }
+};
+
+//!
+//! \class IPluginResourceContext
+//!
+//! \brief Interface for plugins to access per context resources provided by TensorRT
+//!
+//! There is no public way to construct an IPluginResourceContext. It appears as an argument to
+//! IPluginV3OneRuntime::attachToContext(). Overrides of that method can use the IPluginResourceContext object to access
+//! any available per context resources.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
+//! \see IPluginV3OneRuntime::attachToContext()
+//!
+class IPluginResourceContext
+{
+public:
+    //! \brief Get the GPU allocator associated with the resource context
+    //!
+    //! \see IPluginV3OneRuntime::attachToContext()
+    //!
+    virtual IGpuAllocator* getGpuAllocator() const noexcept = 0;
+
+    //! \brief Get the error recorder associated with the resource context
+    //!
+    //! \see IPluginV3OneRuntime::attachToContext()
+    //!
+    virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
+    virtual ~IPluginResourceContext() noexcept = default;
+
+protected:
+    IPluginResourceContext() = default;
+    IPluginResourceContext(IPluginResourceContext const&) = default;
+    IPluginResourceContext(IPluginResourceContext&&) = default;
+    IPluginResourceContext& operator=(IPluginResourceContext const&) & = default;
+    IPluginResourceContext& operator=(IPluginResourceContext&&) & = default;
+};
+
+namespace v_1_0
+{
+class IPluginCapability : public IVersionedInterface
+{
+};
+} // namespace v_1_0
+
+//!
+//! \class IPluginCapability
+//!
+//! \brief Base class for plugin capability interfaces
+//!
+//!  IPluginCapability represents a split in TensorRT V3 plugins to sub-objects that expose different types of
+//!  capabilites a plugin may have, as opposed to a single interface which defines all capabilities and behaviors of a
+//!  plugin.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
+//! \see PluginCapabilityType
+//!
+using IPluginCapability = v_1_0::IPluginCapability;
+
+namespace v_1_0
+{
+class IPluginV3 : public IVersionedInterface
+{
+public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"PLUGIN", 1, 0};
+    }
+
+    //! \brief Return a pointer to plugin object implementing the specified PluginCapabilityType.
+    //!
+    //! \note IPluginV3 objects added for the build phase (through addPluginV3()) must return valid objects for
+    //! PluginCapabilityType::kCORE, PluginCapabilityType::kBUILD and PluginCapabilityType::kRUNTIME.
+    //!
+    //! \note IPluginV3 objects added for the runtime phase must return valid objects for
+    //! PluginCapabilityType::kCORE and PluginCapabilityType::kRUNTIME.
+    //!
+    //! \see TensorRTPhase
+    //! \see IPluginCreatorV3One::createPlugin()
+    //!
+    virtual IPluginCapability* getCapabilityInterface(PluginCapabilityType type) noexcept = 0;
+
+    //!
+    //! \brief Clone the plugin object. This copies over internal plugin parameters and returns a new plugin object with
+    //! these parameters. The cloned object must be in a fully initialized state.
+    //!
+    //! \note The cloned object must return valid objects through getCapabilityInterface() for at least the same
+    //! PluginCapabilityTypes as the original object.
+    //!
+    //! \return A cloned plugin object in an initialized state with the same parameters as the current object.
+    //!         nullptr must be returned if the cloning fails.
+    //!
+    virtual IPluginV3* clone() noexcept = 0;
+};
+
+} // namespace v_1_0
+
+//!
+//! \class IPluginV3
+//!
+//! \brief Plugin class for the V3 generation of user-implemented layers.
+//!
+//! IPluginV3 acts as a wrapper around the plugin capability interfaces that define the actual behavior of the plugin.
+//!
+//! \see IPluginCapability
+//! \see IPluginCreatorV3One
+//! \see IPluginRegistry
+//!
+using IPluginV3 = v_1_0::IPluginV3;
+
+namespace v_1_0
+{
+class IPluginV3OneCore : public IPluginCapability
+{
+public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"PLUGIN_V3ONE_CORE", 1, 0};
+    }
+
+    //!
+    //! \brief Return the plugin name. Should match the plugin name returned by the corresponding plugin creator.
+    //!
+    //! \see IPluginCreatorV3One::getPluginName()
+    //!
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
+    //!
+    virtual AsciiChar const* getPluginName() const noexcept = 0;
+
+    //!
+    //! \brief Return the plugin version. Should match the plugin version returned by the corresponding plugin creator.
+    //!
+    //! \see IPluginCreatorV3One::getPluginVersion()
+    //!
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
+    //!
+    virtual AsciiChar const* getPluginVersion() const noexcept = 0;
+
+    //!
+    //! \brief Return the namespace of the plugin object. Should match the plugin namespace returned by the
+    //! corresponding plugin creator.
+    //!
+    //! \see IPluginCreatorV3One::getPluginNamespace()
+    //!
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
+    //!
+    virtual AsciiChar const* getPluginNamespace() const noexcept = 0;
+};
+
+class IPluginV3OneBuild : public IPluginCapability
+{
+public:
+    //!
+    //! \brief The default maximum number of format combinations that will be timed by TensorRT during the build phase
+    //!
+    //! \see getFormatCombinationLimit
+    //!
+    static constexpr int32_t kDEFAULT_FORMAT_COMBINATION_LIMIT = 100;
+
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"PLUGIN_V3ONE_BUILD", 1, 0};
+    }
+
+    //!
+    //! \brief Configure the plugin.
+    //!
+    //! configurePlugin() can be called multiple times in the build phase during creation of an engine by IBuilder.
+    //!
+    //! configurePlugin() is called when a plugin is being prepared for profiling but not for any
+    //! specific input size. This provides an opportunity for the plugin to make algorithmic choices on the basis of
+    //! input and output formats, along with the bound of possible dimensions. The min, opt and max value of the
+    //! DynamicPluginTensorDesc correspond to the kMIN, kOPT and kMAX value of the current profile that the plugin is
+    //! being profiled for, with the desc.dims field corresponding to the dimensions of plugin specified at network
+    //! creation. Wildcard dimensions may exist during this phase in the desc.dims field.
+    //!
+    //! \param in The input tensors attributes that are used for configuration.
+    //! \param nbInputs Number of input tensors.
+    //! \param out The output tensors attributes that are used for configuration.
+    //! \param nbOutputs Number of output tensors.
+    //!
+    virtual int32_t configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs,
+        DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept = 0;
+
+    //!
+    //! \brief Provide the data types of the plugin outputs if the input tensors have the data types provided.
+    //!
+    //! \param outputTypes Pre-allocated array to which the output data types should be written.
+    //! \param nbOutputs The number of output tensors. This matches the value returned from getNbOutputs().
+    //! \param inputTypes The input data types.
+    //! \param nbInputs The number of input tensors.
+    //!
+    //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+    //! through the error recorder.
+    //!
+    //! \note Provide `DataType::kFLOAT`s if the layer has no inputs. The data type for any size tensor outputs must be
+    //! `DataType::kINT32`. The returned data types must each have a format that is supported by the plugin.
+    //!
+    //! \warning DataType:kBOOL and DataType::kUINT8 are not supported.
+    //!
+    virtual int32_t getOutputDataTypes(
+        DataType* outputTypes, int32_t nbOutputs, const DataType* inputTypes, int32_t nbInputs) const noexcept = 0;
+
+    //!
+    //! \brief Provide expressions for computing dimensions of the output tensors from dimensions of the input tensors.
+    //!
+    //! \param inputs Expressions for dimensions of the input tensors
+    //! \param nbInputs The number of input tensors
+    //! \param shapeInputs Expressions for values of the shape tensor inputs
+    //! \param nbShapeInputs The number of shape tensor inputs
+    //! \param outputs Pre-allocated array to which the output dimensions must be written
+    //! \param exprBuilder Object for generating new dimension expressions
+    //!
+    //! \note Any size tensor outputs must be declared to be 0-D.
+    //!
+    //! \note The declaration of shapeInputs as DimsExprs is slightly abusive, because the "dimensions"
+    //!       are actually the values of the shape tensor. For example, if the input shape tensor
+    //!       is a 2x3 matrix, the DimsExprs will have six "dimensions": the three values from the first
+    //!       row of the matrix followed by the three values from the second row of the matrix.
+    //!
+    //! \return 0 for success, else non-zero (which will cause engine termination). Returned code will be reported
+    //! through the error recorder.
+    //!
+    virtual int32_t getOutputShapes(DimsExprs const* inputs, int32_t nbInputs, DimsExprs const* shapeInputs,
+        int32_t nbShapeInputs, DimsExprs* outputs, int32_t nbOutputs, IExprBuilder& exprBuilder) noexcept = 0;
+
+    //!
+    //! \brief Return true if plugin supports the format and datatype for the input/output indexed by pos.
+    //!
+    //! For this method inputs are numbered 0.. (nbInputs - 1) and outputs are numbered nbInputs.. (nbInputs + nbOutputs
+    //! - 1). Using this numbering, pos is an index into InOut, where 0 <= pos < nbInputs + nbOutputs - 1.
+    //!
+    //! TensorRT invokes this method to ask if the input/output indexed by pos supports the format/datatype specified
+    //! by inOut[pos].format and inOut[pos].type.  The override should return true if that format/datatype at inOut[pos]
+    //! are supported by the plugin.  If support is conditional on other input/output formats/datatypes, the plugin can
+    //! make its result conditional on the formats/datatypes in inOut[0.. pos - 1], which will be set to values
+    //! that the plugin supports.  The override should not inspect inOut[pos1.. nbInputs + nbOutputs - 1],
+    //! which will have invalid values.  In other words, the decision for pos must be based on inOut[0..pos] only.
+    //!
+    //! Some examples:
+    //!
+    //! * A definition for a plugin that supports only FP16 NCHW:
+    //!
+    //!         return inOut.format[pos] == TensorFormat::kLINEAR && inOut.type[pos] == DataType::kHALF;
+    //!
+    //! * A definition for a plugin that supports only FP16 NCHW for its two inputs,
+    //!   and FP32 NCHW for its single output:
+    //!
+    //!         return inOut.format[pos] == TensorFormat::kLINEAR && (inOut.type[pos] == pos < 2 ?  DataType::kHALF :
+    //!         DataType::kFLOAT);
+    //!
+    //! * A definition for a "polymorphic" plugin with two inputs and one output that supports
+    //!   any format or type, but the inputs and output must have the same format and type:
+    //!
+    //!         return pos == 0 || (inOut.format[pos] == inOut.format[0] && inOut.type[pos] == inOut.type[0]);
+    //!
+    //! \warning TensorRT will stop querying once it finds getFormatCombinationLimit() of combinations.
+    //!
+    //! \see getFormatCombinationLimit
+    //!
+    virtual bool supportsFormatCombination(
+        int32_t pos, DynamicPluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept = 0;
+
+    //!
+    //! \brief Get the number of outputs from the plugin.
+    //!
+    //! \return The number of outputs, which must be a positive integer.
+    //!
+    virtual int32_t getNbOutputs() const noexcept = 0;
+
+    //!
+    //! \brief Find the workspace size required by the layer.
+    //!
+    //! This function is called after the plugin is configured, and possibly during execution.
+    //! The result should be a sufficient workspace size to deal with inputs and outputs of the given size
+    //! or any smaller problem.
+    //!
+    //! \return The workspace size.
+    //!
+    virtual size_t getWorkspaceSize(DynamicPluginTensorDesc const* inputs, int32_t nbInputs,
+        DynamicPluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept
+    {
+        return 0;
+    }
+
+    //!
+    //! \brief Query for any custom tactics that the plugin intends to use
+    //!
+    //! For each format combination supported by the plugin (up to a maximum indicated by getFormatCombinationLimit()),
+    //! the plugin will be timed for each tactic advertised through this method.
+    //!
+    //! \param tactics Pre-allocated buffer to which the tactic values should be written
+    //! \param nbTactics The number of tactics advertised through getNbTactics()
+    //!
+    //! \note The provided tactic values must be unique and non-zero. The tactic value 0 is reserved for the default
+    //! tactic attached to each format combination.
+    //!
+    //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+    //! through the error recorder.
+    //!
+    virtual int32_t getValidTactics(int32_t* tactics, int32_t nbTactics) noexcept
+    {
+        return 0;
+    }
+
+    //!
+    //! \brief Query for the number of custom tactics the plugin intends to use
+    //!
+    virtual int32_t getNbTactics() noexcept
+    {
+        return 0;
+    }
+
+    //!
+    //! \brief Called to query the suffix to use for the timing cache ID. May be called anytime after plugin creation.
+    //!
+    //! \return Suffix to use for timing cache ID, considering only the creation state of the plugin.
+    //!         Returning nullptr will disable timing caching for the plugin altogether.
+    //!
+    //! \note If timing caching is enabled for the plugin (by returning non-null), the I/O shape and format information
+    //! will be automatically considered to form the prefix of the timing cache ID. Therefore, only other factors
+    //! determining the creation state of the plugin, such as its attribute values, should be considered to compose the
+    //! return value.
+    //!
+    virtual char const* getTimingCacheID() noexcept
+    {
+        return nullptr;
+    }
+
+    //!
+    //! \brief Return the maximum number of format combinations that will be timed by TensorRT during the build phase
+    //!
+    virtual int32_t getFormatCombinationLimit() noexcept
+    {
+        return kDEFAULT_FORMAT_COMBINATION_LIMIT;
+    }
+
+    //!
+    //! \brief Query for a string representing the configuration of the plugin. May be called anytime after
+    //! plugin creation.
+    //!
+    //! \return A string representing the plugin's creation state, especially with regard to its attribute values.
+    //!
+    virtual char const* getMetadataString() noexcept
+    {
+        return nullptr;
+    }
+};
+
+class IPluginV3OneRuntime : public IPluginCapability
+{
+public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"PLUGIN_V3ONE_RUNTIME", 1, 0};
+    }
+
+    //!
+    //! \brief Set the tactic to be used in the subsequent call to enqueue(). If no custom tactics were advertised, this
+    //! will have a value of 0, which is designated as the default tactic.
+    //!
+    //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+    //! through the error recorder.
+    //!
+    virtual int32_t setTactic(int32_t tactic) noexcept
+    {
+        return 0;
+    }
+
+    //!
+    //! \brief Called when a plugin is being prepared for execution for specific dimensions. This could
+    //! happen multiple times in the execution phase, both during creation of an engine by IBuilder and execution of an
+    //! engine by IExecutionContext.
+    //!  * IBuilder will call this function once per profile, with `in` resolved to the values specified by the
+    //!  kOPT field of the current profile.
+    //!  * IExecutionContext will call this during the next subsequent instance of enqueueV3() or executeV2() if:
+    //!    - The optimization profile is changed via setOptimizationProfile() or setOptimizationProfileAsync().
+    //!    - An input binding is changed via setInputTensorAddress() or setTensorAddress() or setInputShape().
+    //! \warning The execution phase is timing critical during IExecutionContext but is not part of the timing loop when
+    //! called from IBuilder. Performance bottlenecks of onShapeChange() will not show up during engine building but
+    //! will be visible during execution if any triggering functions are called.
+    //!
+    //! \param in The input tensors attributes that are used for configuration.
+    //! \param nbInputs Number of input tensors.
+    //! \param out The output tensors attributes that are used for configuration.
+    //! \param nbOutputs Number of output tensors.
+    //!
+    virtual int32_t onShapeChange(
+        PluginTensorDesc const* in, int32_t nbInputs, PluginTensorDesc const* out, int32_t nbOutputs) noexcept = 0;
+
+    //!
+    //! \brief Execute the layer.
+    //!
+    //! \param inputDesc how to interpret the memory for the input tensors.
+    //! \param outputDesc how to interpret the memory for the output tensors.
+    //! \param inputs The memory for the input tensors.
+    //! \param outputs The memory for the output tensors.
+    //! \param workspace Workspace for execution.
+    //! \param stream The stream in which to execute the kernels.
+    //!
+    //! \return 0 for success, else non-zero (which will cause engine termination). The returned code will be reported
+    //! through the error recorder.
+    //!
+    virtual int32_t enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept = 0;
+
+    //!
+    //! \brief Clone the plugin, attach the cloned plugin object to a execution context and grant the cloned plugin
+    //! access to some context resources.
+    //!
+    //! This function is called automatically for each plugin when a new execution context is created. The plugin may
+    //! use resources provided by the IPluginResourceContext until the plugin is deleted by TensorRT.
+    //!
+    //! If the plugin needs per-context resources, it can be allocated here.
+    //!
+    //! \param context A resource context that exposes methods to get access to execution context specific resources.
+    //!                A different resource context is guaranteed for each different execution context to which the
+    //!                plugin is attached.
+    //! \see IPluginResourceContext
+    //!
+    //! \note This method should clone the entire IPluginV3 object, not just the runtime interface
+    //!
+    //! \return A clone of the IPluginV3 object whose runtime interface on which this method is invoked, which has
+    //! attached to the provided resource context.
+    //!
+    virtual IPluginV3* attachToContext(IPluginResourceContext* context) noexcept = 0;
+
+    //!
+    //! \brief Get the plugin fields which should be serialized.
+    //!
+    //! \note The set of plugin fields returned does not necessarily need to match that advertised through
+    //! getFieldNames() of the corresponding plugin creator.
+
+    //! \note To serialize arbitrary plugin data, use a PluginField of
+    //! PluginFieldType::kUNKNOWN, with the length of the PluginField set to the correct number of bytes.
+    //!
+    virtual PluginFieldCollection const* getFieldsToSerialize() noexcept = 0;
+};
+} // namespace v_1_0
+
+//!
+//! \class IPluginV3OneCore
+//!
+//! \brief A plugin capability interface that enables the core capability (PluginCapabilityType::kCORE).
+//!
+//! \see IPluginCapability
+//! \see PluginCapabilityType
+//! \see IPluginV3::getCapabilityInterface()
+//!
+using IPluginV3OneCore = v_1_0::IPluginV3OneCore;
+
+//!
+//! \class IPluginV3OneBuild
+//!
+//! \brief A plugin capability interface that enables the build capability (PluginCapabilityType::kBUILD). Exposes
+//! methods that allow the expression of the build time properties and behavior of a plugin.
+//!
+//! \see IPluginCapability
+//! \see PluginCapabilityType
+//! \see IPluginV3::getCapabilityInterface()
+//!
+using IPluginV3OneBuild = v_1_0::IPluginV3OneBuild;
+
+//!
+//! \class IPluginV3OneRuntime
+//!
+//! \brief A plugin capability interface that enables the runtime capability (PluginCapabilityType::kRUNTIME). Exposes
+//! methods that allow the expression of the runtime properties and behavior of a plugin.
+//!
+//! \see IPluginCapability
+//! \see PluginCapabilityType
+//! \see IPluginV3::getCapabilityInterface()
+//!
+using IPluginV3OneRuntime = v_1_0::IPluginV3OneRuntime;
+
+namespace v_1_0
+{
+class IPluginCreatorV3One : public IPluginCreatorInterface
+{
+public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"PLUGIN CREATOR_V3ONE", 1, 0};
+    }
+
+    //!
+    //! \brief Return a plugin object. Return nullptr in case of error.
+    //!
+    //! \param name A NULL-terminated name string of length 1024 or less, including the NULL terminator.
+    //! \param fc A pointer to a collection of fields needed for constructing the plugin.
+    //! \param phase The TensorRT phase in which the plugin is being created
+    //!
+    //! When the phase is TensorRTPhase::kRUNTIME, the PluginFieldCollection provided for serialization by the plugin's
+    //! runtime interface will be passed as fc.
+    //!
+    //! \note The returned plugin object must be in an initialized state
+    //!
+    virtual IPluginV3* createPlugin(
+        AsciiChar const* name, PluginFieldCollection const* fc, TensorRTPhase phase) noexcept = 0;
+
+    //!
+    //! \brief Return a list of fields that need to be passed to createPlugin() when creating a plugin for use in the
+    //! TensorRT build phase.
+    //!
+    //! \see PluginFieldCollection
+    //!
+    virtual PluginFieldCollection const* getFieldNames() noexcept = 0;
+
+    //!
+    //! \brief Return the plugin name.
+    //!
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+    //! the NULL terminator.
+    //!
+    virtual AsciiChar const* getPluginName() const noexcept = 0;
+
+    //!
+    //! \brief Return the plugin version.
+    //!
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+    //! the NULL terminator.
+    //!
+    virtual AsciiChar const* getPluginVersion() const noexcept = 0;
+
+    //!
+    //! \brief Return the plugin namespace.
+    //!
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+    //! the NULL terminator.
+    //!
+    virtual AsciiChar const* getPluginNamespace() const noexcept = 0;
+
+    IPluginCreatorV3One() = default;
+    virtual ~IPluginCreatorV3One() = default;
 
-    int32_t enqueue(int32_t, void const* const*, void* const*, void*, cudaStream_t) noexcept override final
-    {
-        return 1;
-    }
+protected:
+    IPluginCreatorV3One(IPluginCreatorV3One const&) = default;
+    IPluginCreatorV3One(IPluginCreatorV3One&&) = default;
+    IPluginCreatorV3One& operator=(IPluginCreatorV3One const&) & = default;
+    IPluginCreatorV3One& operator=(IPluginCreatorV3One&&) & = default;
 };
+} // namespace v_1_0
 
 //!
-//! \class IProfiler
-//!
-//! \brief Application-implemented interface for profiling.
+//! \class IPluginCreatorV3One
 //!
-//! When this class is added to an execution context, the profiler will be called once per layer for each invocation of
-//! executeV2()/enqueueV2()/enqueueV3().
+//! \brief A plugin creator class capable of producing IPluginV3 objects
 //!
-//! It is not recommended to run inference with profiler enabled when the inference execution time is critical since the
-//! profiler may affect execution time negatively.
+//! \see IPluginV3
+//! \see IPluginRegistry
 //!
+using IPluginCreatorV3One = v_1_0::IPluginCreatorV3One;
+
+namespace v_1_0
+{
 class IProfiler
 {
 public:
@@ -571,17 +1209,32 @@ class IProfiler
 
     virtual ~IProfiler() noexcept {}
 };
+} // namespace v_1_0
+
+//!
+//! \class IProfiler
+//!
+//! \brief Application-implemented interface for profiling.
+//!
+//! When this class is added to an execution context, the profiler will be called once per layer for each invocation of
+//! executeV2()/enqueueV3().
+//!
+//! It is not recommended to run inference with profiler enabled when the inference execution time is critical since the
+//! profiler may affect execution time negatively.
+//!
+using IProfiler = v_1_0::IProfiler;
 
 //!
 //! \enum WeightsRole
+//!
 //! \brief How a layer uses particular Weights.
 //!
 //! The power weights of an IScaleLayer are omitted.  Refitting those is not supported.
 //!
 enum class WeightsRole : int32_t
 {
-    kKERNEL = 0,   //!< kernel for IConvolutionLayer, IDeconvolutionLayer, or IFullyConnectedLayer
-    kBIAS = 1,     //!< bias for IConvolutionLayer, IDeconvolutionLayer, or IFullyConnectedLayer
+    kKERNEL = 0,   //!< kernel for IConvolutionLayer or IDeconvolutionLayer
+    kBIAS = 1,     //!< bias for IConvolutionLayer or IDeconvolutionLayer
     kSHIFT = 2,    //!< shift part of IScaleLayer
     kSCALE = 3,    //!< scale part of IScaleLayer
     kCONSTANT = 4, //!< weights for IConstantLayer
@@ -602,8 +1255,8 @@ constexpr inline int32_t EnumMax<WeightsRole>() noexcept
 //!
 enum class DeviceType : int32_t
 {
-    kGPU, //!< GPU Device
-    kDLA, //!< DLA Core
+    kGPU = 0, //!< GPU Device
+    kDLA = 1, //!< DLA Core
 };
 
 //! Maximum number of elements in DeviceType enum. \see DeviceType
@@ -641,6 +1294,7 @@ constexpr inline int32_t EnumMax<TempfileControlFlag>() noexcept
     return 2;
 }
 
+//!
 //! \brief Represents a collection of one or more TempfileControlFlag values combined using bitwise-OR operations.
 //!
 //! \see TempfileControlFlag,
@@ -660,29 +1314,9 @@ class IRuntime : public INoCopy
 public:
     virtual ~IRuntime() noexcept = default;
 
-    //!
-    //! \brief Deserialize an engine from a stream.
-    //!
-    //! If an error recorder has been set for the runtime, it will also be passed to the engine.
-    //!
-    //! \param blob The memory that holds the serialized engine.
-    //! \param size The size of the memory in bytes.
-    //! \param pluginFactory The plugin factory, if any plugins are used by the network, otherwise nullptr.
-    //!
-    //! \return The engine, or nullptr if it could not be deserialized.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.0.
-    //!
-    //! \warning IPluginFactory is no longer supported, therefore pluginFactory must be a nullptr.
-    //!
-    TRT_DEPRECATED nvinfer1::ICudaEngine* deserializeCudaEngine(
-        void const* blob, std::size_t size, IPluginFactory* pluginFactory) noexcept
-    {
-        return mImpl->deserializeCudaEngine(blob, size, nullptr);
-    }
-
     //!
     //! \brief Sets the DLA core used by the network. Defaults to -1.
+    //!
     //! \param dlaCore The DLA core to execute the engine on, in the range [0,getNbDlaCores()).
     //!
     //! This function is used to specify which DLA core to use via indexing, if multiple DLA cores are available.
@@ -698,6 +1332,7 @@ class IRuntime : public INoCopy
 
     //!
     //! \brief Get the DLA core that the engine executes on.
+    //!
     //! \return assigned DLA core or -1 for DLA not present or unset.
     //!
     int32_t getDLACore() const noexcept
@@ -713,20 +1348,9 @@ class IRuntime : public INoCopy
         return mImpl->getNbDLACores();
     }
 
-    //!
-    //! \brief Destroy this object.
-    //!
-    //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED void destroy() noexcept
-    {
-        delete this;
-    }
-
     //!
     //! \brief Set the GPU allocator.
+    //!
     //! \param allocator Set the GPU allocator to be used by the runtime. All GPU memory acquired will use this
     //! allocator. If NULL is passed, the default allocator will be used.
     //!
@@ -774,7 +1398,7 @@ class IRuntime : public INoCopy
     }
 
     //!
-    //! \brief Deserialize an engine from a stream.
+    //! \brief Deserialize an engine from host memory.
     //!
     //! If an error recorder has been set for the runtime, it will also be passed to the engine.
     //!
@@ -785,7 +1409,25 @@ class IRuntime : public INoCopy
     //!
     ICudaEngine* deserializeCudaEngine(void const* blob, std::size_t size) noexcept
     {
-        return mImpl->deserializeCudaEngine(blob, size, nullptr);
+        return mImpl->deserializeCudaEngine(blob, size);
+    }
+
+    //!
+    //! \brief Deserialize an engine from a stream.
+    //!
+    //! If an error recorder has been set for the runtime, it will also be passed to the
+    //! engine.
+    //!
+    //! This deserialization path will reduce host memory usage when weight streaming is enabled.
+    //!
+    //! \param streamReader a read-only stream from which TensorRT will deserialize a
+    //!        previously serialized engine.
+    //!
+    //! \return The engine, or nullptr if it could not be deserialized.
+    //!
+    ICudaEngine* deserializeCudaEngine(IStreamReader& streamReader)
+    {
+        return mImpl->deserializeCudaEngine(streamReader);
     }
 
     //!
@@ -800,6 +1442,7 @@ class IRuntime : public INoCopy
 
     //!
     //! \brief Set the maximum number of threads.
+    //!
     //! \param maxThreads The maximum number of threads that can be used by the runtime.
     //! \return True if successful, false otherwise.
     //!
@@ -973,9 +1616,11 @@ class IRefitter : public INoCopy
     //!
     //! * There is no such layer by that name.
     //! * The layer does not have weights with the specified role.
-    //! * The number of weights is inconsistent with the layer’s original specification.
+    //! * The count of weights is inconsistent with the layer’s original specification.
+    //! * The type of weights is inconsistent with the layer’s original specification.
     //!
-    //! Modifying the weights before method refit() completes will result in undefined behavior.
+    //! Modifying the weights before method refitCudaEngine or refitCudaEngineAsync returns will result in undefined
+    //! behavior.
     //!
     //! \warning The string layerName must be null-terminated, and be at most 4096 bytes including the terminator.
     //!
@@ -985,14 +1630,16 @@ class IRefitter : public INoCopy
     }
 
     //!
-    //! \brief Updates associated engine.  Return true if successful.
+    //! \brief Refits associated engine.
     //!
-    //! Failure occurs if getMissing() != 0 before the call.
+    //! \return True on success, or false if new weights validation fails or getMissingWeights() != 0 before the call.
+    //! If false is returned, a subset of weights may have been refitted.
     //!
     //! The behavior is undefined if the engine has pending enqueued work.
+    //! Provided weights on CPU or GPU can be unset and released, or updated after refitCudaEngine returns.
     //!
-    //! Extant IExecutionContexts associated with the engine should not be used afterwards.
-    //! Instead, create new IExecutionContexts after refitting.
+    //! IExecutionContexts associated with the engine remain valid for use afterwards. There is no need to set the same
+    //! weights repeatedly for multiple refit calls as the weights memory can be updated directly instead.
     //!
     bool refitCudaEngine() noexcept
     {
@@ -1037,16 +1684,6 @@ class IRefitter : public INoCopy
         return mImpl->getAll(size, layerNames, roles);
     }
 
-    //!
-    //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED void destroy() noexcept
-    {
-        delete this;
-    }
-
     //!
     //! Update dynamic range for a tensor.
     //!
@@ -1155,9 +1792,13 @@ class IRefitter : public INoCopy
     //! Possible reasons for rejection are:
     //!
     //! * The name of weights is nullptr or does not correspond to any refittable weights.
-    //! * The number of weights is inconsistent with the original specification.
+    //! * The count of the weights is inconsistent with the count returned from calling getWeightsPrototype() with the
+    //! same name.
+    //! * The type of the weights is inconsistent with the type returned from calling getWeightsPrototype() with the
+    //! same name.
     //!
-    //! Modifying the weights before method refitCudaEngine() completes will result in undefined behavior.
+    //! Modifying the weights before method refitCudaEngine or refitCudaEngineAsync returns will result in undefined
+    //! behavior.
     //!
     //! \warning The string name must be null-terminated, and be at most 4096 bytes including the terminator.
     //!
@@ -1214,7 +1855,9 @@ class IRefitter : public INoCopy
 
     //!
     //! \brief Set the maximum number of threads.
+    //!
     //! \param maxThreads The maximum number of threads that can be used by the refitter.
+    //!
     //! \return True if successful, false otherwise.
     //!
     //! The default value is 1 and includes the current thread.
@@ -1240,6 +1883,145 @@ class IRefitter : public INoCopy
         return mImpl->getMaxThreads();
     }
 
+    //!
+    //! \brief Specify new weights on a specified device of given name.
+    //!
+    //! \param name The name of the weights to be refitted.
+    //! \param weights The new weights on the specified device.
+    //! \param location The location (host vs. device) of the new weights.
+    //!
+    //! \return True on success, or false if new weights are rejected.
+    //! Possible reasons for rejection are:
+    //!
+    //! * The name of the weights is nullptr or does not correspond to any refittable weights.
+    //! * The count of the weights is inconsistent with the count returned from calling getWeightsPrototype() with the
+    //! same name.
+    //! * The type of the weights is inconsistent with the type returned from calling getWeightsPrototype() with the
+    //! same name.
+    //!
+    //! It is allowed to provide some weights on CPU and others on GPU.
+    //! Modifying the weights before the method refitCudaEngine() or refitCudaEngineAsync() completes will result in
+    //! undefined behavior.
+    //!
+    //! \warning The string name must be null-terminated, and be at most 4096 bytes including the terminator.
+    //!
+    bool setNamedWeights(char const* name, Weights weights, TensorLocation location) noexcept
+    {
+        return mImpl->setNamedWeightsWithLocation(name, weights, location);
+    }
+
+    //!
+    //! \brief Get weights associated with the given name.
+    //!
+    //! \param weightsName The name of the weights to be refitted.
+    //!
+    //! \return Weights associated with the given name.
+    //!
+    //! If the weights were never set, returns null weights and reports an error to the refitter errorRecorder.
+    //!
+    //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+    //!
+    Weights getNamedWeights(char const* weightsName) const noexcept
+    {
+        return mImpl->getNamedWeights(weightsName);
+    }
+
+    //!
+    //! \brief Get location for the weights associated with the given name.
+    //!
+    //! \param weightsName The name of the weights to be refitted.
+    //!
+    //! \return Location for the weights associated with the given name.
+    //!
+    //! If the weights were never set, returns TensorLocation::kHOST and reports an error to the refitter errorRecorder.
+    //!
+    //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+    //!
+    TensorLocation getWeightsLocation(char const* weightsName) const noexcept
+    {
+        return mImpl->getWeightsLocation(weightsName);
+    }
+
+    //!
+    //! \brief Unset weights associated with the given name.
+    //!
+    //! \param weightsName The name of the weights to be refitted.
+    //!
+    //! \return False if the weights were never set, returns true otherwise.
+    //!
+    //! Unset weights before releasing them.
+    //!
+    //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+    //!
+    bool unsetNamedWeights(char const* weightsName) noexcept
+    {
+        return mImpl->unsetNamedWeights(weightsName);
+    }
+
+    //!
+    //! \brief Set whether to validate weights during refitting.
+    //!
+    //! \param weightsValidation Indicate whether to validate weights during refitting.
+    //!
+    //! When set to true, TensorRT will validate weights during FP32 to FP16/BF16 weights conversions or
+    //! sparsifying weights in the refit call. If provided weights are not proper for some weights transformations,
+    //! TensorRT will issue a warning and continue the transformation for minor issues (such as overflow during
+    //! narrowing conversion), or issue an error and stop the refitting process for severe issues (such as sparsifying
+    //! dense weights). By default the flag is true. Set the flag to false for faster refitting performance.
+    //!
+    void setWeightsValidation(bool weightsValidation) noexcept
+    {
+        return mImpl->setWeightsValidation(weightsValidation);
+    }
+
+    //!
+    //! \brief Get whether to validate weights values during refitting.
+    //!
+    bool getWeightsValidation() const noexcept
+    {
+        return mImpl->getWeightsValidation();
+    }
+
+    //!
+    //! \brief Enqueue weights refitting of the associated engine on the given stream.
+    //!
+    //! \param stream The stream to enqueue the weights updating task.
+    //!
+    //! \return True on success, or false if new weights validation fails or getMissingWeights() != 0 before the call.
+    //! If false is returned, a subset of weights may have been refitted.
+    //!
+    //! The behavior is undefined if the engine has pending enqueued work on a different stream from the provided one.
+    //! Provided weights on CPU can be unset and released, or updated after refitCudaEngineAsync returns.
+    //! Freeing or updating of the provided weights on GPU can be enqueued on the same stream after refitCudaEngineAsync
+    //! returns.
+    //!
+    //! IExecutionContexts associated with the engine remain valid for use afterwards. There is no need to set the same
+    //! weights repeatedly for multiple refit calls as the weights memory can be updated directly instead. The weights
+    //! updating task should use the same stream as the one used for the refit call.
+    //!
+    bool refitCudaEngineAsync(cudaStream_t stream) noexcept
+    {
+        return mImpl->refitCudaEngineAsync(stream);
+    }
+
+    //!
+    //! \brief Get the Weights prototype associated with the given name.
+    //!
+    //! \param weightsName The name of the weights to be refitted.
+    //!
+    //! \return Weights prototype associated with the given name.
+    //!
+    //! The type and count of weights prototype is the same as weights used for engine building. The values property
+    //! is nullptr for weights prototypes. The count of the weights prototype is -1 when the name of the weights is
+    //! nullptr or does not correspond to any refittable weights.
+    //!
+    //! \warning The string weightsName must be null-terminated, and be at most 4096 bytes including the terminator.
+    //!
+    Weights getWeightsPrototype(char const* weightsName) const noexcept
+    {
+        return mImpl->getWeightsPrototype(weightsName);
+    }
+
 protected:
     apiv::VRefitter* mImpl;
 };
@@ -1324,7 +2106,7 @@ class IOptimizationProfile : public INoCopy
     //!
     //! \warning The string inputName must be null-terminated, and be at most 4096 bytes including the terminator.
     //!
-    bool setDimensions(char const* inputName, OptProfileSelector select, Dims dims) noexcept
+    bool setDimensions(char const* inputName, OptProfileSelector select, Dims const& dims) noexcept
     {
         return mImpl->setDimensions(inputName, select, dims);
     }
@@ -1345,18 +2127,19 @@ class IOptimizationProfile : public INoCopy
     //! \brief Set the minimum / optimum / maximum values for an input shape tensor.
     //!
     //! This function must be called three times for every input tensor t that is a shape tensor (t.isShape() == true).
-    //! This implies that the datatype of t is DataType::kINT32, the rank is either 0 or 1, and the dimensions of t
-    //! are fixed at network definition time. This function must not be called for any input tensor that is not a
-    //! shape tensor.
+    //! This implies that the dimensions of t are fixed at network definition time and the volume does not exceed 64.
+    //! This function must not be called for any input tensor that is not a shape tensor.
     //!
     //! Each time this function is called for the same input tensor, the same nbValues must be supplied (either 1
     //! if the tensor rank is 0, or dims.d[0] if the rank is 1). Furthermore, if minVals, optVals, maxVals are the
     //! minimum, optimum, and maximum values, it must be true that minVals[i] <= optVals[i] <= maxVals[i] for
     //! i = 0, ..., nbValues - 1. Execution of the network must be valid for the optVals.
     //!
-    //! Shape tensors are tensors that contribute to shape calculations in some way, and can contain
-    //! any int32_t values appropriate for the network. Shape tensors of other data types (e.g. float) are not
-    //! supported. Examples:
+    //! Shape tensors are tensors that contribute to shape calculations in some way. While input shape tensors can be
+    //! type kBOOL, kINT32, or kINT64, the values used to set the minimum, optimium, and maximum values must fit in int32_t.
+    //! Boolean values are represented as 0 for false and 1 for true.
+    //!
+    //! Examples:
     //!
     //! * A shape tensor used as the second input to IShuffleLayer can contain a -1 wildcard.
     //!   The corresponding minVal[i] should be -1.
@@ -1372,6 +2155,7 @@ class IOptimizationProfile : public INoCopy
     //! \param inputName The input tensor name
     //! \param select Whether to set the minimum, optimum, or maximum input values.
     //! \param values An array of length nbValues containing the minimum, optimum, or maximum shape tensor elements.
+    //!               For multidimensional tensors, the array is in row-major order.
     //! \param nbValues The length of the value array, which must equal the number of shape tensor elements (>= 1)
     //!
     //! \return false if an inconsistency was detected (e.g. nbValues does not match a previous call for the same
@@ -1470,20 +2254,23 @@ class IOptimizationProfile : public INoCopy
 //!
 //! \brief List of tactic sources for TensorRT.
 //!
-//! \see TacticSources, IBuilderConfig::setTacticSources(), IBuilderConfig::getTacticSources(),
-//! PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805
+//! \see TacticSources, IBuilderConfig::setTacticSources(), IBuilderConfig::getTacticSources()
 //!
 enum class TacticSource : int32_t
 {
-    //! cuBLAS tactics. Enabled by default.
-    //! \note Disabling kCUBLAS will cause the cublas handle passed to plugins in attachToContext to be null.
-    kCUBLAS = 0,
-    //! cuBLAS LT tactics.
-    //! Enabled for x86 platforms and only enabled for non-x86 platforms when CUDA >= 11.0 by default.
-    kCUBLAS_LT = 1,
-    //! cuDNN tactics.  Enabled by default.
+    //! cuBLAS tactics. Disabled by default.
+    //! \note Disabling kCUBLAS will cause the cuBLAS handle passed to plugins in attachToContext to be null.
+    //! \deprecated Deprecated in TensorRT 10.0.
+    kCUBLAS TRT_DEPRECATED_ENUM = 0,
+
+    //! cuBLAS LT tactics. Enabled by default.
+    //! \deprecated Deprecated in TensorRT 9.0.
+    kCUBLAS_LT TRT_DEPRECATED_ENUM = 1,
+
+    //! cuDNN tactics. Disabled by default.
     //! \note Disabling kCUDNN will cause the cuDNN handle passed to plugins in attachToContext to be null.
-    kCUDNN = 2,
+    //! \deprecated Deprecated in TensorRT 10.0.
+    kCUDNN TRT_DEPRECATED_ENUM = 2,
 
     //! Enables convolution tactics implemented with edge mask tables. These tactics tradeoff memory for performance by
     //! consuming additional memory space proportional to the input size.
@@ -1523,11 +2310,6 @@ enum class ProfilingVerbosity : int32_t
     kLAYER_NAMES_ONLY = 0, //!< Print only the layer names. This is the default setting.
     kNONE = 1,             //!< Do not print any layer information.
     kDETAILED = 2,         //!< Print detailed layer information including layer names and layer parameters.
-
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by kLAYER_NAMES_ONLY.
-    kDEFAULT TRT_DEPRECATED_ENUM = kLAYER_NAMES_ONLY,
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by kDETAILED.
-    kVERBOSE TRT_DEPRECATED_ENUM = kDETAILED
 };
 
 //! Maximum number of profile verbosity levels in ProfilingVerbosity enum. \see ProfilingVerbosity
@@ -1538,127 +2320,154 @@ constexpr inline int32_t EnumMax<ProfilingVerbosity>() noexcept
 }
 
 //!
-//! \class ICudaEngine
+//! \brief Represents one or more SerializationFlag values using binary OR
+//! operations, e.g., 1U << SerializationFlag::kEXCLUDE_LEAN_RUNTIME
 //!
-//! \brief An engine for executing inference on a built network, with functionally unsafe features.
+//! \see ISerializationConfig::setFlags(), ISerializationConfig::getFlags()
 //!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+using SerializationFlags = uint32_t;
+
 //!
-class ICudaEngine : public INoCopy
+//! \enum SerializationFlag
+//!
+//! \brief List of valid flags that the engine can enable when serializing the bytes.
+//!
+//! \see ISerializationConfig::setFlags(), ISerializationConfig::getFlags()
+//!
+enum class SerializationFlag : int32_t
+{
+    kEXCLUDE_WEIGHTS = 0,      //!< Exclude the weights that can be refitted.
+    kEXCLUDE_LEAN_RUNTIME = 1, //!< Exclude the lean runtime.
+};
+
+//! Maximum number of serialization flags in SerializationFlag enum. \see SerializationFlag
+template <>
+constexpr inline int32_t EnumMax<SerializationFlag>() noexcept
+{
+    return 2;
+}
+
+//!
+//! \class ISerializationConfig
+//!
+//! \brief Holds properties for configuring an engine to serialize the binary.
+//!
+//! \see SerializationFlag
+//!
+class ISerializationConfig : public INoCopy
 {
 public:
-    virtual ~ICudaEngine() noexcept = default;
+    virtual ~ISerializationConfig() noexcept = default;
 
     //!
-    //! \brief Get the number of binding indices.
+    //! \brief Set the serialization flags to turn on for this config.
+    //!
+    //! The flags are listed in the SerializationFlag enum.
     //!
-    //! There are separate binding indices for each optimization profile.
-    //! This method returns the total over all profiles.
-    //! If the engine has been built for K profiles, the first getNbBindings() / K bindings are used by profile
-    //! number 0, the following getNbBindings() / K bindings are used by profile number 1 etc.
+    //! \param serializationFlags The serialization flags for an engine.
     //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getNbIOTensors.
+    //! \note This function will override the previous set flags, rather than bitwise ORing the new flag.
     //!
-    //! \see getBindingIndex()
+    //! \see getFlags()
     //!
-    TRT_DEPRECATED int32_t getNbBindings() const noexcept
+    bool setFlags(SerializationFlags serializationFlags) noexcept
     {
-        return mImpl->getNbBindings();
+        return mImpl->setFlags(serializationFlags);
     }
 
     //!
-    //! \brief Retrieve the binding index for a named tensor.
-    //!
-    //! IExecutionContext::enqueueV2() and IExecutionContext::executeV2() require an array of buffers.
-    //!
-    //! Engine bindings map from tensor names to indices in this array.
-    //! Binding indices are assigned at engine build time, and take values in the range [0 ... n-1] where n is the total
-    //! number of inputs and outputs.
-    //!
-    //! To get the binding index of the name in an optimization profile with index k > 0,
-    //! mangle the name by appending " [profile k]", as described for method getBindingName().
-    //!
-    //! \param name The tensor name.
-    //! \return The binding index for the named tensor, or -1 if the provided name does not map to an input or output
-    //! tensor.
+    //! \brief Get the serialization flags for this config.
     //!
-    //! \warning The string name must be null-terminated, and be at most 4096 bytes including the terminator.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
-    //! based methods.
+    //! \return The serialization flags as a bitmask.
     //!
-    //! \see getNbBindings() getBindingName()
+    //! \see setFlags()
     //!
-    TRT_DEPRECATED int32_t getBindingIndex(char const* name) const noexcept
+    SerializationFlags getFlags() const noexcept
     {
-        return mImpl->getBindingIndex(name);
+        return mImpl->getFlags();
     }
 
     //!
-    //! \brief Retrieve the name corresponding to a binding index.
-    //!
-    //! This is the reverse mapping to that provided by getBindingIndex().
-    //!
-    //! For optimization profiles with an index k > 0, the name is mangled by appending
-    //! " [profile k]", with k written in decimal.  For example, if the tensor in the
-    //! INetworkDefinition had the name "foo", and bindingIndex refers to that tensor in the
-    //! optimization profile with index 3, getBindingName returns "foo [profile 3]".
+    //! \brief clear a serialization flag.
     //!
-    //! \param bindingIndex The binding index.
-    //! \return The name corresponding to the index, or nullptr if the index is out of range.
+    //! clears the serialization flag from the config.
     //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
-    //! based methods.
+    //! \see setFlags()
     //!
-    //! \see getBindingIndex()
-    //!
-    TRT_DEPRECATED char const* getBindingName(int32_t bindingIndex) const noexcept
+    bool clearFlag(SerializationFlag serializationFlag) noexcept
     {
-        return mImpl->getBindingName(bindingIndex);
+        return mImpl->clearFlag(serializationFlag);
     }
 
     //!
-    //! \brief Determine whether a binding is an input binding.
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return True if the index corresponds to an input binding and the index is in range.
+    //! \brief Set a serialization flag.
     //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorIOMode().
+    //! Add the input serialization flag to the already enabled flags.
     //!
-    //! \see getTensorIOMode()
+    //! \see setFlags()
     //!
-    TRT_DEPRECATED bool bindingIsInput(int32_t bindingIndex) const noexcept
+    bool setFlag(SerializationFlag serializationFlag) noexcept
     {
-        return mImpl->bindingIsInput(bindingIndex);
+        return mImpl->setFlag(serializationFlag);
     }
 
     //!
-    //! \brief Get the dimensions of a binding.
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return The dimensions of the binding if the index is in range, otherwise Dims().
-    //!         Has -1 for any dimension that varies within the optimization profile.
-    //!
-    //! For example, suppose an INetworkDefinition has an input with shape [-1,-1]
-    //! that becomes a binding b in the engine.  If the associated optimization profile
-    //! specifies that b has minimum dimensions as [6,9] and maximum dimensions [7,9],
-    //! getBindingDimensions(b) returns [-1,9], despite the second dimension being
-    //! dynamic in the INetworkDefinition.
+    //! \brief Returns true if the serialization flag is set
     //!
-    //! Because each optimization profile has separate bindings, the returned value can
-    //! differ across profiles. Consider another binding b' for the same network input,
-    //! but for another optimization profile.  If that other profile specifies minimum
-    //! dimensions [5,8] and maximum dimensions [5,9], getBindingDimensions(b') returns [5,-1].
+    //! \see getFlags()
     //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorShape().
+    //! \return True if flag is set, false if unset.
     //!
-    //! \see getTensorShape()
-    //!
-    TRT_DEPRECATED Dims getBindingDimensions(int32_t bindingIndex) const noexcept
+    bool getFlag(SerializationFlag serializationFlag) const noexcept
     {
-        return mImpl->getBindingDimensions(bindingIndex);
+        return mImpl->getFlag(serializationFlag);
     }
 
+protected:
+    apiv::VSerializationConfig* mImpl;
+};
+
+//!
+//! \enum ExecutionContextAllocationStrategy
+//!
+//! \brief Different memory allocation behaviors for IExecutionContext.
+//!
+//! IExecutionContext requires a block of device memory for internal activation tensors during inference. The user can
+//! either let the execution context manage the memory in various ways or allocate the memory themselves.
+//!
+//! \see ICudaEngine::createExecutionContext()
+//! \see IExecutionContext::setDeviceMemory()
+//!
+enum class ExecutionContextAllocationStrategy : int32_t
+{
+    kSTATIC = 0,            //!< Default static allocation with the maximum size across all profiles.
+    kON_PROFILE_CHANGE = 1, //!< Reallocate for a profile when it's selected.
+    kUSER_MANAGED = 2,      //!< The user supplies custom allocation to the execution context.
+};
+
+//!
+//! \brief Maximum number of memory allocation strategies in ExecutionContextAllocationStrategy enum.
+//!
+//! \see ExecutionContextAllocationStrategy
+//!
+template <>
+constexpr inline int32_t EnumMax<ExecutionContextAllocationStrategy>() noexcept
+{
+    return 3;
+}
+
+//!
+//! \class ICudaEngine
+//!
+//! \brief An engine for executing inference on a built network, with functionally unsafe features.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+//!
+class ICudaEngine : public INoCopy
+{
+public:
+    virtual ~ICudaEngine() noexcept = default;
+
     //!
     //! \brief Get shape of an input or output tensor.
     //!
@@ -1674,21 +2483,6 @@ class ICudaEngine : public INoCopy
         return mImpl->getTensorShape(tensorName);
     }
 
-    //!
-    //! \brief Determine the required data type for a buffer from its binding index.
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return The type of the data in the buffer.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorDataType().
-    //!
-    //! \see getTensorDataType()
-    //!
-    TRT_DEPRECATED DataType getBindingDataType(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->getBindingDataType(bindingIndex);
-    }
-
     //!
     //! \brief Determine the required data type for a buffer from its tensor name.
     //!
@@ -1704,22 +2498,6 @@ class ICudaEngine : public INoCopy
         return mImpl->getTensorDataType(tensorName);
     }
 
-    //!
-    //! \brief Get the maximum batch size which can be used for inference. Should only be called if the engine is built
-    //! from an INetworkDefinition with implicit batch dimension mode.
-    //!
-    //! \return The maximum batch size for this engine.
-    //!
-    //! \warning For an engine built from an INetworkDefinition with explicit batch dimension mode, this will always
-    //! return 1.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.4.
-    //!
-    TRT_DEPRECATED int32_t getMaxBatchSize() const noexcept
-    {
-        return mImpl->getMaxBatchSize();
-    }
-
     //!
     //! \brief Get the number of layers in the network.
     //!
@@ -1727,72 +2505,43 @@ class ICudaEngine : public INoCopy
     //! may be combined or eliminated as the engine is optimized. This value can be useful when building per-layer
     //! tables, such as when aggregating profiling data over a number of executions.
     //!
-    //! \return The number of layers in the network.
-    //!
-    int32_t getNbLayers() const noexcept
-    {
-        return mImpl->getNbLayers();
-    }
-
-    //!
-    //! \brief Serialize the network to a stream.
-    //!
-    //! \return A IHostMemory object that contains the serialized engine.
-    //!
-    //! The network may be deserialized with IRuntime::deserializeCudaEngine().
-    //!
-    //! \see IRuntime::deserializeCudaEngine()
-    //!
-    IHostMemory* serialize() const noexcept
-    {
-        return mImpl->serialize();
-    }
-
-    //!
-    //! \brief Create an execution context.
-    //!
-    //! The execution context created will call setOptimizationProfile(0) implicitly if there are
-    //! no other execution contexts assigned to optimization profile 0. This functionality is
-    //! deprecated in TensorRT 8.6 and will instead default all optimization profiles to 0 starting
-    //! in TensorRT 9.0.
-    //! If an error recorder has been set for the engine, it will also be passed to the execution context.
-    //!
-    //! \see IExecutionContext.
-    //! \see IExecutionContext::setOptimizationProfile()
+    //! \return The number of layers in the network.
     //!
-    IExecutionContext* createExecutionContext() noexcept
+    int32_t getNbLayers() const noexcept
     {
-        return mImpl->createExecutionContext();
+        return mImpl->getNbLayers();
     }
 
     //!
-    //! \brief Destroy this object;
+    //! \brief Serialize the network to a stream.
+    //!
+    //! \return A IHostMemory object that contains the serialized engine.
     //!
-    //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
+    //! The network may be deserialized with IRuntime::deserializeCudaEngine().
     //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
+    //! \see IRuntime::deserializeCudaEngine()
     //!
-    TRT_DEPRECATED void destroy() noexcept
+    IHostMemory* serialize() const noexcept
     {
-        delete this;
+        return mImpl->serialize();
     }
 
     //!
-    //! \brief Get location of binding
-    //!
-    //! This lets you know whether the binding should be a pointer to device or host memory.
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return The location of the bound tensor with given index.
+    //! \brief Create an execution context and specify the strategy for allocating internal activation memory.
     //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorLocation().
+    //! The default value for the allocation strategy is ExecutionContextAllocationStrategy::kSTATIC, which means the
+    //! context will pre-allocate a block of device memory that is sufficient for all profiles. The newly created
+    //! execution context will be assigned optimization profile 0. If an error recorder has been set for the engine, it
+    //! will also be passed to the execution context.
     //!
-    //! \see ITensor::setLocation() ITensor::getLocation()
-    //! \see getTensorLocation()
+    //! \see IExecutionContext
+    //! \see IExecutionContext::setOptimizationProfileAsync()
+    //! \see ExecutionContextAllocationStrategy
     //!
-    TRT_DEPRECATED TensorLocation getLocation(int32_t bindingIndex) const noexcept
+    IExecutionContext* createExecutionContext(
+        ExecutionContextAllocationStrategy strategy = ExecutionContextAllocationStrategy::kSTATIC) noexcept
     {
-        return mImpl->getLocation(bindingIndex);
+        return mImpl->createExecutionContext(strategy);
     }
 
     //!
@@ -1846,17 +2595,20 @@ class ICudaEngine : public INoCopy
         return mImpl->getTensorIOMode(tensorName);
     }
 
+    //!
     //! \brief create an execution context without any device memory allocated
     //!
     //! The memory for execution of this device context must be supplied by the application.
     //!
-    IExecutionContext* createExecutionContextWithoutDeviceMemory() noexcept
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by createExecutionContext() with parameter.
+    //!
+    TRT_DEPRECATED IExecutionContext* createExecutionContextWithoutDeviceMemory() noexcept
     {
         return mImpl->createExecutionContextWithoutDeviceMemory();
     }
 
     //!
-    //! \brief Return the amount of device memory required by an execution context.
+    //! \brief Return the maximum device memory required by the context over all profiles.
     //!
     //! \see IExecutionContext::setDeviceMemory()
     //!
@@ -1866,30 +2618,23 @@ class ICudaEngine : public INoCopy
     }
 
     //!
-    //! \brief Return true if an engine can be refit.
+    //! \brief Return the maximum device memory required by the context for a profile.
     //!
-    //! \see nvinfer1::createInferRefitter()
+    //! \see IExecutionContext::setDeviceMemory()
     //!
-    bool isRefittable() const noexcept
+    size_t getDeviceMemorySizeForProfile(int32_t profileIndex) const noexcept
     {
-        return mImpl->isRefittable();
+        return mImpl->getDeviceMemorySizeForProfile(profileIndex);
     }
 
     //!
-    //! \brief Return the number of bytes per component of an element.
-    //!
-    //! The vector component size is returned if getBindingVectorizedDim() != -1.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorBytesPerComponent().
+    //! \brief Return true if an engine can be refit.
     //!
-    //! \see getBindingVectorizedDim()
-    //! \see getTensorBytesPerComponent()
+    //! \see nvinfer1::createInferRefitter()
     //!
-    TRT_DEPRECATED int32_t getBindingBytesPerComponent(int32_t bindingIndex) const noexcept
+    bool isRefittable() const noexcept
     {
-        return mImpl->getBindingBytesPerComponent(bindingIndex);
+        return mImpl->isRefittable();
     }
 
     //!
@@ -1931,22 +2676,6 @@ class ICudaEngine : public INoCopy
         return mImpl->getTensorBytesPerComponentV2(tensorName, profileIndex);
     }
 
-    //!
-    //! \brief Return the number of components included in one element.
-    //!
-    //! The number of elements in the vectors is returned if getBindingVectorizedDim() != -1.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorComponentsPerElement().
-    //!
-    //! \see getBindingVectorizedDim()
-    //!
-    TRT_DEPRECATED int32_t getBindingComponentsPerElement(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->getBindingComponentsPerElement(bindingIndex);
-    }
-
     //!
     //! \brief Return the number of components included in one element, or -1 if the provided name does not map to an
     //! input or output tensor.
@@ -1986,20 +2715,6 @@ class ICudaEngine : public INoCopy
         return mImpl->getTensorComponentsPerElementV2(tensorName, profileIndex);
     }
 
-    //!
-    //! \brief Return the binding format.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorFormat().
-    //!
-    //! \see getTensorFormat()
-    //!
-    TRT_DEPRECATED TensorFormat getBindingFormat(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->getBindingFormat(bindingIndex);
-    }
-
     //!
     //! \brief Return the tensor format, or TensorFormat::kLINEAR if the provided name does not map to an input or
     //! output tensor.
@@ -2029,30 +2744,6 @@ class ICudaEngine : public INoCopy
         return mImpl->getTensorFormatV2(tensorName, profileIndex);
     }
 
-    //!
-    //! \brief Return the human readable description of the tensor format, or nullptr if the provided name does not
-    //! map to an input or output tensor.
-    //!
-    //! The description includes the order, vectorization, data type, and strides.
-    //! Examples are shown as follows:
-    //!   Example 1: kCHW + FP32
-    //!     "Row major linear FP32 format"
-    //!   Example 2: kCHW2 + FP16
-    //!     "Two wide channel vectorized row major FP16 format"
-    //!   Example 3: kHWC8 + FP16 + Line Stride = 32
-    //!     "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorFormatDesc().
-    //!
-    //! \see getTensorFormatDesc()
-    //!
-    TRT_DEPRECATED char const* getBindingFormatDesc(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->getBindingFormatDesc(bindingIndex);
-    }
-
     //!
     //! \brief Return the human readable description of the tensor format, or empty string if the provided name does not
     //! map to an input or output tensor.
@@ -2060,9 +2751,9 @@ class ICudaEngine : public INoCopy
     //! The description includes the order, vectorization, data type, and strides.
     //! Examples are shown as follows:
     //!   Example 1: kCHW + FP32
-    //!     "Row major linear FP32 format"
+    //!     "Row-major linear FP32 format"
     //!   Example 2: kCHW2 + FP16
-    //!     "Two wide channel vectorized row major FP16 format"
+    //!     "Two-wide channel vectorized row-major FP16 format"
     //!   Example 3: kHWC8 + FP16 + Line Stride = 32
     //!     "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
     //!
@@ -2084,9 +2775,9 @@ class ICudaEngine : public INoCopy
     //! The description includes the order, vectorization, data type, and strides.
     //! Examples are shown as follows:
     //!   Example 1: kCHW + FP32
-    //!     "Row major linear FP32 format"
+    //!     "Row-major linear FP32 format"
     //!   Example 2: kCHW2 + FP16
-    //!     "Two wide channel vectorized row major FP16 format"
+    //!     "Two-wide channel vectorized row-major FP16 format"
     //!   Example 3: kHWC8 + FP16 + Line Stride = 32
     //!     "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
     //!
@@ -2100,22 +2791,6 @@ class ICudaEngine : public INoCopy
         return mImpl->getTensorFormatDescV2(tensorName, profileIndex);
     }
 
-    //!
-    //! \brief Return the dimension index that the buffer is vectorized, or -1 is the name is not found.
-    //!
-    //! Specifically -1 is returned if scalars per vector is 1.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorVectorizedDim().
-    //!
-    //! \see getTensorVectorizedDim()
-    //!
-    TRT_DEPRECATED int32_t getBindingVectorizedDim(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->getBindingVectorizedDim(bindingIndex);
-    }
-
     //!
     //! \brief Return the dimension index that the buffer is vectorized, or -1 if the provided name does not
     //! map to an input or output tensor.
@@ -2169,45 +2844,12 @@ class ICudaEngine : public INoCopy
     //!
     //! \return Number of optimization profiles. It is always at least 1.
     //!
-    //! \see IExecutionContext::setOptimizationProfile()
+    //! \see IExecutionContext::setOptimizationProfileAsync()
     int32_t getNbOptimizationProfiles() const noexcept
     {
         return mImpl->getNbOptimizationProfiles();
     }
 
-    //!
-    //! \brief Get the minimum / optimum / maximum dimensions for a particular input binding under an optimization
-    //! profile.
-    //!
-    //! \param bindingIndex The input binding index, which must belong to the given profile,
-    //!        or be between 0 and bindingsPerProfile-1 as described below.
-    //!
-    //! \param profileIndex The profile index, which must be between 0 and getNbOptimizationProfiles()-1.
-    //!
-    //! \param select Whether to query the minimum, optimum, or maximum dimensions for this binding.
-    //!
-    //! \return The minimum / optimum / maximum dimensions for this binding in this profile.
-    //!         If the profileIndex or bindingIndex are invalid, return Dims with nbDims=-1.
-    //!
-    //! For backwards compatibility with earlier versions of TensorRT, if the bindingIndex
-    //! does not belong to the current optimization profile, but is between 0 and bindingsPerProfile-1,
-    //! where bindingsPerProfile = getNbBindings()/getNbOptimizationProfiles,
-    //! then a corrected bindingIndex is used instead, computed by:
-    //!
-    //!     profileIndex * bindingsPerProfile + bindingIndex % bindingsPerProfile
-    //!
-    //! Otherwise the bindingIndex is considered invalid.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getProfileShape().
-    //!
-    //! \see getProfileShape()
-    //!
-    TRT_DEPRECATED Dims getProfileDimensions(
-        int32_t bindingIndex, int32_t profileIndex, OptProfileSelector select) const noexcept
-    {
-        return mImpl->getProfileDimensions(bindingIndex, profileIndex, select);
-    }
-
     //!
     //! \brief Get the minimum / optimum / maximum dimensions for an input tensor given its name under an optimization
     //! profile.
@@ -2229,88 +2871,25 @@ class ICudaEngine : public INoCopy
     }
 
     //!
-    //! \brief Get minimum / optimum / maximum values for an input shape binding under an optimization profile.
-    //!
-    //! \param profileIndex The profile index (must be between 0 and getNbOptimizationProfiles()-1)
-    //!
-    //! \param inputIndex The input index (must be between 0 and getNbBindings() - 1)
-    //!
-    //! \param select Whether to query the minimum, optimum, or maximum shape values for this binding.
-    //!
-    //! \return If the binding is an input shape binding, return a pointer to an array that has
-    //!         the same number of elements as the corresponding tensor, i.e. 1 if dims.nbDims == 0, or dims.d[0]
-    //!         if dims.nbDims == 1, where dims = getBindingDimensions(inputIndex). The array contains
-    //!         the elementwise minimum / optimum / maximum values for this shape binding under the profile.
-    //!         If either of the indices is out of range, or if the binding is not an input shape binding, return
-    //!         nullptr.
-    //!
-    //! For backwards compatibility with earlier versions of TensorRT, a bindingIndex that does not belong
-    //! to the profile is corrected as described for getProfileDimensions().
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getShapeValues(). Difference between Execution and shape
-    //! tensor is superficial since TensorRT 8.5.
-    //!
-    //! \see getProfileDimensions() getShapeValues()
-    //!
-    TRT_DEPRECATED int32_t const* getProfileShapeValues(
-        int32_t profileIndex, int32_t inputIndex, OptProfileSelector select) const noexcept
-    {
-        return mImpl->getProfileShapeValues(profileIndex, inputIndex, select);
-    }
-
-    //!
-    //! \brief True if tensor is required as input for shape calculations or output from them.
-    //!
-    //! TensorRT evaluates a network in two phases:
-    //!
-    //! 1. Compute shape information required to determine memory allocation requirements
-    //!    and validate that runtime sizes make sense.
-    //!
-    //! 2. Process tensors on the device.
-    //!
-    //! Some tensors are required in phase 1.  These tensors are called "shape tensors", and always
-    //! have type Int32 and no more than one dimension.  These tensors are not always shapes
-    //! themselves, but might be used to calculate tensor shapes for phase 2.
-    //!
-    //! isShapeBinding(i) returns true if the tensor is a required input or an output computed in phase 1.
-    //! isExecutionBinding(i) returns true if the tensor is a required input or an output computed in phase 2.
-    //!
-    //! For example, if a network uses an input tensor with binding i as an addend
-    //! to an IElementWiseLayer that computes the "reshape dimensions" for IShuffleLayer,
-    //! then isShapeBinding(i) == true.
+    //! \brief Get the minimum / optimum / maximum values (not dimensions) for an input tensor given
+    //! its name under an optimization profile. These correspond to the values set using
+    //! IOptimizationProfile::setShapeValues when the engine was built.
     //!
-    //! It's possible to have a tensor be required by both phases.  For instance, a tensor
-    //! can be used for the "reshape dimensions" and as the indices for an IGatherLayer
-    //! collecting floating-point data.
-    //!
-    //! It's also possible to have a tensor be required by neither phase, but nonetheless
-    //! shows up in the engine's inputs.  For example, if an input tensor is used only
-    //! as an input to IShapeLayer, only its shape matters and its values are irrelevant.
-    //!
-    //! \deprecated Use name-based isShapeInferenceIO() instead to know whether a tensor is a shape tensor.
-    //!
-    //! \see isExecutionBinding() isShapeInferenceIO()
-    //!
-    TRT_DEPRECATED bool isShapeBinding(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->isShapeBinding(bindingIndex);
-    }
-
+    //! \param tensorName The name of an input tensor.
     //!
-    //! \brief True if pointer to tensor data is required for execution phase, false if nullptr can be supplied.
+    //! \param profileIndex The profile index, which must be between 0 and getNbOptimizationProfiles()-1.
     //!
-    //! For example, if a network uses an input tensor with binding i ONLY as the "reshape dimensions"
-    //! input of IShuffleLayer, then isExecutionBinding(i) is false, and a nullptr can be
-    //! supplied for it when calling IExecutionContext::execute or IExecutionContext::enqueue.
+    //! \param select Whether to query the minimum, optimum, or maximum values for this input tensor.
     //!
-    //! \deprecated No name-based equivalent replacement. Use getTensorLocation() instead to know the location of tensor
-    //! data. Distinction between execution binding and shape binding is superficial since TensorRT 8.5.
+    //! \return The minimum / optimum / maximum values for an input tensor in this profile.
+    //!        If the profileIndex is invalid or the provided name does not map to an input tensor, return nullptr.
     //!
-    //! \see isShapeBinding() getTensorLocation()
+    //! \warning The string tensorName must be null-terminated, and be at most 4096 bytes including the terminator.
     //!
-    TRT_DEPRECATED bool isExecutionBinding(int32_t bindingIndex) const noexcept
+    int32_t const* getProfileTensorValues(char const* tensorName, int32_t profileIndex, OptProfileSelector select) const
+        noexcept
     {
-        return mImpl->isExecutionBinding(bindingIndex);
+        return mImpl->getProfileTensorValues(tensorName, profileIndex, select);
     }
 
     //!
@@ -2318,8 +2897,8 @@ class ICudaEngine : public INoCopy
     //!
     //! If the engine has EngineCapability::kSTANDARD, then all engine functionality is valid.
     //! If the engine has EngineCapability::kSAFETY, then only the functionality in safe engine is valid.
-    //! If the engine has EngineCapability::kDLA_STANDALONE, then only serialize, destroy, and const-accessor functions are
-    //! valid.
+    //! If the engine has EngineCapability::kDLA_STANDALONE, then only serialize, destroy, and const-accessor functions
+    //! are valid.
     //!
     //! \return The EngineCapability flag that the engine was built for.
     //!
@@ -2328,6 +2907,7 @@ class ICudaEngine : public INoCopy
         return mImpl->getEngineCapability();
     }
 
+    //!
     //! \brief Set the ErrorRecorder for this interface
     //!
     //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
@@ -2338,7 +2918,7 @@ class ICudaEngine : public INoCopy
     //! If an error recorder is not set, messages will be sent to the global log stream.
     //!
     //! \param recorder The error recorder to register with this interface.
-    //
+    //!
     //! \see getErrorRecorder()
     //!
     void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -2364,22 +2944,18 @@ class ICudaEngine : public INoCopy
     //!
     //! \brief Query whether the engine was built with an implicit batch dimension.
     //!
-    //! \return True if tensors have implicit batch dimension, false otherwise.
-    //!
-    //! This is an engine-wide property.  Either all tensors in the engine
-    //! have an implicit batch dimension or none of them do.
-    //!
-    //! hasImplicitBatchDimension() is true if and only if the INetworkDefinition
-    //! from which this engine was built was created with createNetworkV2() without
-    //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
+    //! \return Always false since TensorRT 10.0 does not support an implicit batch dimension.
     //!
     //! \see createNetworkV2
     //!
-    bool hasImplicitBatchDimension() const noexcept
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch is no supported since TensorRT 10.0.
+    //!
+    TRT_DEPRECATED bool hasImplicitBatchDimension() const noexcept
     {
         return mImpl->hasImplicitBatchDimension();
     }
 
+    //!
     //! \brief return the tactic sources required by this engine.
     //!
     //! The value returned is equal to zero or more tactics sources set
@@ -2395,6 +2971,7 @@ class ICudaEngine : public INoCopy
         return mImpl->getTacticSources();
     }
 
+    //!
     //! \brief Return the \ref ProfilingVerbosity the builder config was set to when the engine was built.
     //!
     //! \return the profiling verbosity the builder config was set to when the engine was built.
@@ -2441,6 +3018,7 @@ class ICudaEngine : public INoCopy
         return mImpl->getIOTensorName(index);
     }
 
+    //!
     //! \brief Return the hardware compatibility level of this engine.
     //!
     //! \return hardwareCompatibilityLevel The level of hardware
@@ -2468,36 +3046,166 @@ class ICudaEngine : public INoCopy
         return mImpl->getNbAuxStreams();
     }
 
+    //!
+    //! \brief Create a serialization configuration object.
+    //!
+    //! \see ISerializationConfig
+    //!
+    ISerializationConfig* createSerializationConfig() noexcept
+    {
+        return mImpl->createSerializationConfig();
+    }
+
+    //!
+    //! \brief Serialize the network to a stream with the provided SerializationConfig.
+    //!
+    //! \return An IHostMemory object that contains the serialized engine.
+    //!
+    //! The network may be deserialized with IRuntime::deserializeCudaEngine().
+    //!
+    //! \see IRuntime::deserializeCudaEngine()
+    //!
+    IHostMemory* serializeWithConfig(ISerializationConfig& config) const noexcept
+    {
+        return mImpl->serializeWithConfig(config);
+    }
+
+    //!
+    //! \brief Limit the maximum amount of GPU memory usable for network weights
+    //! in bytes.
+    //!
+    //! \param gpuMemoryBudget  This parameter may take on 3 types of values:
+    //!  -1: Allows TensorRT to choose the budget according to the streamable weights size.
+    //!      Free CUDA memory will be queried at ::createExecutionContext and accordingly:
+    //!       * If streamable weights all fit: weight streaming is not required and disabled.
+    //!       * Otherwise: Budget is set to getMinimumWeightStreamingBudget
+    //!   0: (default) Disables weight streaming. The execution may fail if the network is too large for GPU memory.
+    //!  >0: The maximum bytes of GPU memory that weights can occupy. It must be bounded by
+    //!      [getMinimumWeightStreamingBudget, min(getStreamableWeightsSize - 1, free GPU memory)].
+    //!
+    //! By setting a weight limit, users can expect a GPU memory usage reduction
+    //! of |network weights| - gpuMemoryBudget bytes. Maximum memory savings occur
+    //! when gpuMemoryBudget is set to getMinimumWeightStreamingBudget.
+    //!
+    //! Streaming larger amounts of memory will likely result in lower performance
+    //! except in some boundary cases where streaming weights allows the user to
+    //! run larger batch sizes. The higher throughput offsets the increased
+    //! latency in these cases. Tuning the value of the memory limit is
+    //! recommended for best performance.
+    //!
+    //! \warning If weight streaming is active, then multiple concurrent IExecutionContexts will forced to run serially.
+    //!
+    //! \warning GPU memory for the weights is allocated upon the first IExecutionContext's creation
+    //!          and deallocated upon the last one's destruction.
+    //!
+    //! \warning BuilderFlag::kWEIGHT_STREAMING must be set during engine building.
+    //!
+    //! \return true if the memory limit is valid and the call was successful
+    //!         otherwise false.
+    //!
+    //! \see BuilderFlag::kWEIGHT_STREAMING,
+    //!      ICudaEngine::getWeightStreamingBudget
+    //!      ICudaEngine::getMinimumWeightStreamingBudget,
+    //!      ICudaEngine::getStreamableWeightsSize
+    //!
+    bool setWeightStreamingBudget(int64_t gpuMemoryBudget) noexcept
+    {
+        return mImpl->setWeightStreamingBudget(gpuMemoryBudget);
+    }
+
+    //!
+    //! \brief Returns the current weight streaming device memory budget in bytes.
+    //!
+    //! \warning BuilderFlag::kWEIGHT_STREAMING must be set during engine building.
+    //!
+    //! \returns The weight streaming budget in bytes. Please see ::setWeightStreamingBudget for the possible
+    //!          values.
+    //!
+    //! \see BuilderFlag::kWEIGHT_STREAMING,
+    //!      ICudaEngine::setWeightStreamingBudget,
+    //!      ICudaEngine::getMinimumWeightStreamingBudget,
+    //!      ICudaEngine::getStreamableWeightsSize
+    //!
+    int64_t getWeightStreamingBudget() const noexcept
+    {
+        return mImpl->getWeightStreamingBudget();
+    }
+
+    //!
+    //! \brief The minimum number of bytes of GPU memory required by network
+    //! weights for successful weight streaming.
+    //!
+    //! This is a positive integer for engines with streamable weights because a
+    //! staging buffer on the GPU is required to temporarily hold the streamed
+    //! weights. The size of the staging buffer is determined by TensorRT and must
+    //! be at least as large as the size of the largest streamable weight in the
+    //! network.
+    //!
+    //! \warning BuilderFlag::kWEIGHT_STREAMING must be set during engine building.
+    //!
+    //!
+    //! \returns The minimum number of bytes of GPU memory required for streaming.
+    //!
+    //! \see ICudaEngine::setWeightStreamingBudget
+    //!
+    int64_t getMinimumWeightStreamingBudget() const noexcept
+    {
+        return mImpl->getMinimumWeightStreamingBudget();
+    }
+
+    //!
+    //! \brief Get the total size in bytes of all streamable weights.
+    //!
+    //! The set of streamable weights is a subset of all network weights. The
+    //! total size may exceed free GPU memory.
+    //!
+    //! Returns 0 if BuilderFlag::kWEIGHT_STREAMING is unset during engine building.
+    //!
+    //!
+    //! \returns The total size in bytes of all streamable weights.
+    //!
+    //! \see ICudaEngine::setWeightStreamingBudget
+    //!
+    int64_t getStreamableWeightsSize() const noexcept
+    {
+        return mImpl->getStreamableWeightsSize();
+    }
+
+    //!
+    //! \brief Check if a tensor is marked as a debug tensor.
+    //!
+    //! Determine whether the given name corresponds to a debug tensor.
+    //!
+    //! \returns True if tensor is a debug tensor, false otherwise.
+    //!
+    //! \see INetworkDefinition::markDebug
+    //!
+    bool isDebugTensor(char const* name) const noexcept
+    {
+        return mImpl->isDebugTensor(name);
+    }
+
 protected:
     apiv::VCudaEngine* mImpl;
 };
 
-//!
-//! \class IOutputAllocator
-//!
-//! \brief Callback from ExecutionContext::enqueueV3()
-//!
-//! Clients should override the method reallocateOutput.
-//!
-//! \see IExecutionContext::enqueueV3()
-//!
-class IOutputAllocator
+namespace v_1_0
+{
+class IOutputAllocator : public IVersionedInterface
 {
 public:
     //!
-    //! \brief Return the API version of this IOutputAllocator.
-    //!
-    //! Do not override this method as it is used by the TensorRT library to maintain
-    //! backwards-compatibility with IOutputAllocator. The value will change if Nvidia
-    //! adds additional virtual methods to this class.
+    //! \brief Return version information associated with this interface. Applications must not override this method.
     //!
-    virtual int32_t getInterfaceVersion() const noexcept
+    InterfaceInfo getInterfaceInfo() const noexcept override
     {
-        return 1;
+        return {"IOutputAllocator", 1, 0};
     }
 
     //!
     //! \brief Return a pointer to memory for an output tensor, or nullptr if memory cannot be allocated.
+    //!        If the requested memory size exceeds the currentMemory size, the currentMemory can be freed as well.
+    //!        If currentMemory is known to be big enough, one option is to return currentMemory.
     //!
     //! \param tensorName name of the output tensor.
     //! \param currentMemory points to the address set by IExectionContext::setTensorAddress.
@@ -2506,13 +3214,45 @@ class IOutputAllocator
     //!
     //! \return A pointer to memory to use for the output tensor or nullptr.
     //!
-    //! If currentMemory is known to be big enough, one option is to return currentMemory.
-    //!
     //! To preallocate memory and have the engine fail if the preallocation is not big enough,
     //! use IExecutionContext::setTensorAddress to set a pointer to the preallocated memory,
     //! and have reallocateOutput return nullptr if that memory is not big enough.
     //!
-    virtual void* reallocateOutput(char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment) noexcept = 0;
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by reallocateOutputAsync with cudaStream_t argument
+    //!
+    TRT_DEPRECATED virtual void* reallocateOutput(
+        char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment) noexcept
+    {
+        return nullptr;
+    }
+
+    //!
+    //! \brief Return a pointer to memory for an output tensor, or nullptr if memory cannot be allocated.
+    //!        If the requested memory size exceeds the currentMemory size, the currentMemory can be freed as well.
+    //!        If currentMemory is known to be big enough, one option is to return currentMemory.
+    //!
+    //! \param tensorName name of the output tensor.
+    //! \param currentMemory points to the address set by IExectionContext::setTensorAddress.
+    //! \param size number of bytes required. Always positive, even for an empty tensor.
+    //! \param alignment required alignment of the allocation.
+    //! \param stream The stream in which to execute the kernels.
+    //!
+    //! \return A pointer to memory to use for the output tensor or nullptr.
+    //!
+    //! To preallocate memory and have the engine fail if the preallocation is not big enough,
+    //! use IExecutionContext::setTensorAddress to set a pointer to the preallocated memory,
+    //! and have reallocateOutputAsync return nullptr if that memory is not big enough.
+    //!
+    //! The default definition exists for sake of backward compatibility with earlier versions of TensorRT.
+    //! Eventually this method will become a pure virtual method that requires an override, and method
+    //! reallocateOutput() will disappear. Code moving away from TensorRT 9.x should override method
+    //! reallocateOutputAsync() and NOT override method reallocateOutput().
+    //!
+    virtual void* reallocateOutputAsync(
+        char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment, cudaStream_t /*stream*/)
+    {
+        return reallocateOutput(tensorName, currentMemory, size, alignment);
+    }
 
     //!
     //! \brief Called by TensorRT when the shape of the output tensor is known.
@@ -2523,92 +3263,79 @@ class IOutputAllocator
     //! \param tensorName name of the tensor
     //!
     virtual void notifyShape(char const* tensorName, Dims const& dims) noexcept = 0;
-
-    virtual ~IOutputAllocator() = default;
 };
+} // namespace v_1_0
 
 //!
-//! \class IExecutionContext
+//! \class IOutputAllocator
 //!
-//! \brief Context for executing inference using an engine, with functionally unsafe features.
+//! \brief Callback from ExecutionContext::enqueueV3()
 //!
-//! Multiple execution contexts may exist for one ICudaEngine instance, allowing the same
-//! engine to be used for the execution of multiple batches simultaneously. If the engine supports
-//! dynamic shapes, each execution context in concurrent use must use a separate optimization profile.
+//! \see IExecutionContext::enqueueV3()
 //!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-class IExecutionContext : public INoCopy
+using IOutputAllocator = v_1_0::IOutputAllocator;
+
+namespace v_1_0
+{
+class IDebugListener : public IVersionedInterface
 {
 public:
-    virtual ~IExecutionContext() noexcept = default;
-
-    //!
-    //! \brief Synchronously execute inference on a batch.
-    //!
-    //! This method requires an array of input and output buffers. The mapping from tensor names to indices
-    //! can be queried using ICudaEngine::getBindingIndex()
-    //!
-    //! \param batchSize The batch size. This is at most the max batch size value supplied to the builder when the
-    //! engine was built. If the network is created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag, please use
-    //! executeV2() instead, and this batchSize argument has no effect.
-    //! \param bindings An array of pointers to input and output buffers for the network.
     //!
-    //! \return True if execution succeeded.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.4. Superseded by executeV2() if the network is created with
-    //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
-    //!
-    //! \warning This function will trigger layer resource updates if hasImplicitBatchDimension()
-    //!          returns true and batchSize changes between subsequent calls, possibly resulting
-    //!          in performance bottlenecks.
-    //!
-    //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize()
+    //! \brief Return version information associated with this interface. Applications must not override this method.
     //!
-    TRT_DEPRECATED bool execute(int32_t batchSize, void* const* bindings) noexcept
+    InterfaceInfo getInterfaceInfo() const noexcept override
     {
-        return mImpl->execute(batchSize, bindings);
+        return {"IDebugListener", 1, 0};
     }
 
     //!
-    //! \brief Enqueue inference of a batch on a stream.
-    //!
-    //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
-    //! queried using ICudaEngine::getBindingIndex()
-    //!
-    //! \param batchSize The batch size. This is at most the max batch size value supplied to the builder when the
-    //! engine was built. If the network is created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag, please use
-    //! enqueueV3() instead, and this batchSize argument has no effect.
-    //! \param bindings An array of pointers to input and output buffers for the network.
-    //! \param stream A cuda stream on which the inference kernels will be enqueued.
-    //! \param inputConsumed An optional event which will be signaled when the input buffers can be refilled with new
-    //! data.
-    //!
-    //! \return True if the kernels were enqueued successfully.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.4. Superseded by enqueueV2() if the network is created with
-    //! NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag.
+    //! \brief Callback function that is called when a debug tensor’s value is updated and the debug state of the tensor
+    //! is set to true. Content in the given address is only guaranteed to be valid for the duration of the callback.
     //!
-    //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize()
+    //! \param location TensorLocation of the tensor.
+    //! \param addr pointer to buffer.
+    //! \param type data Type of the tensor.
+    //! \param shape shape of the tensor.
+    //! \param name name of the tensor.
+    //! \param stream Cuda stream object.
     //!
-    //! \warning Calling enqueue() in from the same IExecutionContext object with different CUDA streams concurrently
-    //!          results in undefined behavior. To perform inference concurrently in multiple streams, use one execution
-    //!          context per stream.
+    //! \return True on success, false otherwise.
     //!
-    //! \warning This function will trigger layer resource updates if hasImplicitBatchDimension()
-    //!          returns true and batchSize changes between subsequent calls, possibly resulting in performance
-    //!          bottlenecks.
-    //!
-    TRT_DEPRECATED bool enqueue(
-        int32_t batchSize, void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept
-    {
-        return mImpl->enqueue(batchSize, bindings, stream, inputConsumed);
-    }
+    virtual bool processDebugTensor(void const* addr, TensorLocation location, DataType type, Dims const& shape,
+        char const* name, cudaStream_t stream)
+        = 0;
+
+    ~IDebugListener() override = default;
+};
+} // namespace v_1_0
+
+//!
+//! \class IDebugListener
+//!
+//! \brief User-implemented callback for notification when value of a debug tensor is updated.
+//!
+using IDebugListener = v_1_0::IDebugListener;
+
+//!
+//! \class IExecutionContext
+//!
+//! \brief Context for executing inference using an engine, with functionally unsafe features.
+//!
+//! Multiple execution contexts may exist for one ICudaEngine instance, allowing the same
+//! engine to be used for the execution of multiple batches simultaneously. If the engine supports
+//! dynamic shapes, each execution context in concurrent use must use a separate optimization profile.
+//!
+//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
+class IExecutionContext : public INoCopy
+{
+public:
+    virtual ~IExecutionContext() noexcept = default;
 
     //!
     //! \brief Set the debug sync flag.
     //!
     //! If this flag is set to true, the engine will log the successful execution for each kernel during executeV2(). It
-    //! has no effect when using enqueueV2()/enqueueV3().
+    //! has no effect when using enqueueV3().
     //!
     //! \see getDebugSync()
     //!
@@ -2657,18 +3384,6 @@ class IExecutionContext : public INoCopy
         return mImpl->getEngine();
     }
 
-    //!
-    //! \brief Destroy this object.
-    //!
-    //! \deprecated Deprecated in TRT 8.0. Superseded by `delete`.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED void destroy() noexcept
-    {
-        delete this;
-    }
-
     //!
     //! \brief Set the name of the execution context.
     //!
@@ -2697,42 +3412,23 @@ class IExecutionContext : public INoCopy
     //! \brief Set the device memory for use by this execution context.
     //!
     //! The memory must be aligned with cuda memory alignment property (using cudaGetDeviceProperties()), and its size
-    //! must be at least that returned by getDeviceMemorySize(). Setting memory to nullptr is acceptable if
-    //! getDeviceMemorySize() returns 0. If using enqueueV2()/enqueueV3() to run the network, the memory is in use from
-    //! the invocation of enqueueV2()/enqueueV3() until network execution is complete. If using executeV2(), it is in
-    //! use until executeV2() returns. Releasing or otherwise using the memory for other purposes during this time will
-    //! result in undefined behavior.
-    //!
-    //! \see ICudaEngine::getDeviceMemorySize() ICudaEngine::createExecutionContextWithoutDeviceMemory()
+    //! must be large enough for performing inference with the given network inputs. getDeviceMemorySize() and
+    //! getDeviceMemorySizeForProfile() report upper bounds of the size. Setting memory to nullptr is acceptable if the
+    //! reported size is 0. If using enqueueV3() to run the network, the memory is in use from the invocation of
+    //! enqueueV3() until network execution is complete. If using executeV2(), it is in use until executeV2() returns.
+    //! Releasing or otherwise using the memory for other purposes during this time will result in undefined behavior.
+    //!
+    //! \see ICudaEngine::getDeviceMemorySize()
+    //! \see ICudaEngine::getDeviceMemorySizeForProfile()
+    //! \see ExecutionContextAllocationStrategy
+    //! \see ICudaEngine::createExecutionContext()
+    //! \see ICudaEngine::createExecutionContextWithoutDeviceMemory()
     //!
     void setDeviceMemory(void* memory) noexcept
     {
         mImpl->setDeviceMemory(memory);
     }
 
-    //!
-    //! \brief Return the strides of the buffer for the given binding.
-    //!
-    //! The strides are in units of elements, not components or bytes.
-    //! For example, for TensorFormat::kHWC8, a stride of one spans 8 scalars.
-    //!
-    //! Note that strides can be different for different execution contexts
-    //! with dynamic shapes.
-    //!
-    //! If the bindingIndex is invalid or there are dynamic dimensions that have not been
-    //! set yet, returns Dims with Dims::nbDims = -1.
-    //!
-    //! \param bindingIndex The binding index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorStrides().
-    //!
-    //! \see getTensorStrides()
-    //!
-    TRT_DEPRECATED Dims getStrides(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->getStrides(bindingIndex);
-    }
-
     //!
     //! \brief Return the strides of the buffer for the given tensor name.
     //!
@@ -2755,50 +3451,13 @@ class IExecutionContext : public INoCopy
     }
 
 public:
-    //!
-    //! \brief Select an optimization profile for the current context.
-    //!
-    //! \param profileIndex Index of the profile. It must lie between 0 and
-    //!        getEngine().getNbOptimizationProfiles() - 1
-    //!
-    //! The selected profile will be used in subsequent calls to executeV2()/enqueueV2()/enqueueV3().
-    //!
-    //! When an optimization profile is switched via this API, TensorRT may
-    //! enqueue GPU memory copy operations required to set up the new profile during the subsequent
-    //! enqueueV2()/enqueueV3() operations. To avoid these calls during enqueueV2()/enqueueV3(), use
-    //! setOptimizationProfileAsync() instead.
-    //!
-    //! If the associated CUDA engine does not have inputs with dynamic shapes, this method need not be
-    //! called, in which case the default profile index of 0 will be used (this is particularly
-    //! the case for all safe engines).
-    //!
-    //! setOptimizationProfile() must be called before calling setBindingDimensions() and
-    //! setInputShapeBinding() for all dynamic input tensors or input shape tensors, which in
-    //! turn must be called before executeV2()/enqueueV2()/enqueueV3().
-    //!
-    //! \warning This function will trigger layer resource updates on the next
-    //!          call of enqueueV2()/enqueueV3()/executeV2(), possibly resulting in performance bottlenecks.
-    //!
-    //! \return true if the call succeeded, else false (e.g. input out of range)
-    //!
-    //! \deprecated Superseded by setOptimizationProfileAsync. Deprecated prior to TensorRT 8.0 and will be
-    //! removed in 9.0.
-    //!
-    //! \see ICudaEngine::getNbOptimizationProfiles() IExecutionContext::setOptimizationProfileAsync()
-    //!
-    TRT_DEPRECATED
-    bool setOptimizationProfile(int32_t profileIndex) noexcept
-    {
-        return mImpl->setOptimizationProfile(profileIndex);
-    }
-
     //!
     //! \brief Get the index of the currently selected optimization profile.
     //!
     //! If the profile index has not been set yet (implicitly to 0 if no other execution context has been set to
     //! profile 0, or explicitly for all subsequent contexts), an invalid value of -1 will be returned
-    //! and all calls to enqueueV2()/enqueueV3()/executeV2() will fail until a valid profile index has been set.
-    //! This behavior is deprecated in TensorRT 8.6 and in TensorRT 9.0, all profiles will default to optimization
+    //! and all calls to enqueueV3()/executeV2() will fail until a valid profile index has been set.
+    //! This behavior is deprecated in TensorRT 8.6, all profiles will default to optimization
     //! profile 0 and -1 will no longer be returned.
     //!
     int32_t getOptimizationProfile() const noexcept
@@ -2806,45 +3465,6 @@ class IExecutionContext : public INoCopy
         return mImpl->getOptimizationProfile();
     }
 
-    //!
-    //! \brief Set the dynamic dimensions of an input binding.
-    //!
-    //! \param bindingIndex index of an input tensor whose dimensions must be compatible with
-    //!        the network definition (i.e. only the wildcard dimension -1 can be replaced with a
-    //!        new dimension >= 0).
-    //!
-    //! \param dimensions specifies the dimensions of the input tensor. It must be in the valid
-    //!        range for the currently selected optimization profile, and the corresponding engine must
-    //!        not be safety-certified.
-    //!
-    //! This method requires the engine to be built without an implicit batch dimension.
-    //! This method will fail unless a valid optimization profile is defined for the current
-    //! execution context (getOptimizationProfile() must not be -1).
-    //!
-    //! For all dynamic non-output bindings (which have at least one wildcard dimension of -1),
-    //! this method needs to be called before either enqueueV2() or executeV2() may be called.
-    //! This can be checked using the method allInputDimensionsSpecified().
-    //!
-    //! \warning This function will trigger layer resource updates on the next
-    //!          call of enqueueV2()/executeV2(), possibly resulting in performance bottlenecks,
-    //!          if the dimensions are different than the previous set dimensions.
-    //!
-    //! \return false if an error occurs (e.g. bindingIndex is out of range for the currently selected
-    //!         optimization profile or binding dimension is inconsistent with min-max range of the
-    //!         optimization profile), else true. Note that the network can still be invalid for certain
-    //!         combinations of input shapes that lead to invalid output shapes. To confirm the correctness
-    //!         of the network input shapes, check whether the output binding has valid
-    //!         dimensions using getBindingDimensions() on the output bindingIndex.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by setInputShape().
-    //!
-    //! \see setInputShape()
-    //!
-    TRT_DEPRECATED bool setBindingDimensions(int32_t bindingIndex, Dims dimensions) noexcept
-    {
-        return mImpl->setBindingDimensions(bindingIndex, dimensions);
-    }
-
     //!
     //! \brief Set shape of given input.
     //!
@@ -2863,39 +3483,6 @@ class IExecutionContext : public INoCopy
         return mImpl->setInputShape(tensorName, dims);
     }
 
-    //!
-    //! \brief Get the dynamic dimensions of a binding.
-    //!
-    //! If the engine was built with an implicit batch dimension, same as ICudaEngine::getBindingDimensions.
-    //!
-    //! If setBindingDimensions() has been called on this binding (or if there are no
-    //! dynamic dimensions), all dimensions will be positive. Otherwise, it is necessary to
-    //! call setBindingDimensions() before enqueueV2() or executeV2() may be called.
-    //!
-    //! If the bindingIndex is out of range, an invalid Dims with nbDims == -1 is returned.
-    //! The same invalid Dims will be returned if the engine was not built with an implicit
-    //! batch dimension and if the execution context is not currently associated with a valid
-    //! optimization profile (i.e. if getOptimizationProfile() returns -1).
-    //!
-    //! If ICudaEngine::bindingIsInput(bindingIndex) is false, then both
-    //! allInputDimensionsSpecified() and allInputShapesSpecified() must be true
-    //! before calling this method.
-    //!
-    //! \return Currently selected binding dimensions
-    //!
-    //! For backwards compatibility with earlier versions of TensorRT, a bindingIndex that does not belong
-    //! to the current profile is corrected as described for ICudaEngine::getProfileDimensions.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorShape().
-    //!
-    //! \see ICudaEngine::getProfileDimensions()
-    //! \see getTensorShape()
-    //!
-    TRT_DEPRECATED Dims getBindingDimensions(int32_t bindingIndex) const noexcept
-    {
-        return mImpl->getBindingDimensions(bindingIndex);
-    }
-
     //!
     //! \brief Return the shape of the given input or output.
     //!
@@ -2933,78 +3520,17 @@ class IExecutionContext : public INoCopy
         return mImpl->getTensorShape(tensorName);
     }
 
-    //!
-    //! \brief Set values of input tensor required by shape calculations.
-    //!
-    //! \param bindingIndex index of an input tensor for which
-    //!        ICudaEngine::isShapeBinding(bindingIndex) and ICudaEngine::bindingIsInput(bindingIndex)
-    //!        are both true.
-    //!
-    //! \param data pointer to values of the input tensor.  The number of values should be
-    //!         the product of the dimensions returned by getBindingDimensions(bindingIndex).
-    //!
-    //! If ICudaEngine::isShapeBinding(bindingIndex) and ICudaEngine::bindingIsInput(bindingIndex)
-    //! are both true, this method must be called before enqueueV2() or executeV2() may be called.
-    //! This method will fail unless a valid optimization profile is defined for the current
-    //! execution context (getOptimizationProfile() must not be -1).
-    //!
-    //! \warning This function will trigger layer resource updates on the next call of
-    //!          enqueueV2()/executeV2(), possibly resulting in performance bottlenecks, if the
-    //!          shapes are different than the previous set shapes.
-    //!
-    //! \return false if an error occurs (e.g. bindingIndex is out of range for the currently selected
-    //!         optimization profile or shape data is inconsistent with min-max range of the
-    //!         optimization profile), else true. Note that the network can still be invalid for certain
-    //!         combinations of input shapes that lead to invalid output shapes. To confirm the correctness
-    //!         of the network input shapes, check whether the output binding has valid
-    //!         dimensions using getBindingDimensions() on the output bindingIndex.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by setInputTensorAddress() or setTensorAddress().
-    //!
-    //! \see setInputTensorAddress() setTensorAddress()
-    //!
-    TRT_DEPRECATED bool setInputShapeBinding(int32_t bindingIndex, int32_t const* data) noexcept
-    {
-        return mImpl->setInputShapeBinding(bindingIndex, data);
-    }
-
-    //!
-    //! \brief Get values of an input tensor required for shape calculations or an output tensor produced by shape
-    //! calculations.
-    //!
-    //! \param bindingIndex index of an input or output tensor for which
-    //!        ICudaEngine::isShapeBinding(bindingIndex) is true.
-    //!
-    //! \param data pointer to where values will be written.  The number of values written is
-    //!        the product of the dimensions returned by getBindingDimensions(bindingIndex).
-    //!
-    //! If ICudaEngine::bindingIsInput(bindingIndex) is false, then both
-    //! allInputDimensionsSpecified() and allInputShapesSpecified() must be true
-    //! before calling this method. The method will also fail if no valid optimization profile
-    //! has been set for the current execution context, i.e. if getOptimizationProfile() returns -1.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorAddress() or getOutputTensorAddress().
-    //!
-    //! \see isShapeBinding() getTensorAddress() getOutputTensorAddress()
-    //!
-    TRT_DEPRECATED bool getShapeBinding(int32_t bindingIndex, int32_t* data) const noexcept
-    {
-        return mImpl->getShapeBinding(bindingIndex, data);
-    }
-
     //!
     //! \brief Whether all dynamic dimensions of input tensors have been specified
     //!
     //! \return True if all dynamic dimensions of input tensors have been specified
-    //!         by calling setBindingDimensions().
+    //!         by calling setInputShape().
     //!
     //! Trivially true if network has no dynamically shaped input tensors.
     //!
     //! Does not work with name-base interfaces eg. IExecutionContext::setInputShape(). Use
     //! IExecutionContext::inferShapes() instead.
     //!
-    //! \see setBindingDimensions(bindingIndex,dimensions)
-    //!
     bool allInputDimensionsSpecified() const noexcept
     {
         return mImpl->allInputDimensionsSpecified();
@@ -3020,9 +3546,9 @@ class IExecutionContext : public INoCopy
     //! Does not work with name-base interfaces eg. IExecutionContext::setInputShape(). Use
     //! IExecutionContext::inferShapes() instead.
     //!
-    //! \see isShapeBinding(bindingIndex)
+    //! \deprecated Deprecated in TensorRT 10.0. setInputShapeBinding() is removed since TensorRT 10.0.
     //!
-    bool allInputShapesSpecified() const noexcept
+    TRT_DEPRECATED bool allInputShapesSpecified() const noexcept
     {
         return mImpl->allInputShapesSpecified();
     }
@@ -3038,7 +3564,7 @@ class IExecutionContext : public INoCopy
     //! If an error recorder is not set, messages will be sent to the global log stream.
     //!
     //! \param recorder The error recorder to register with this interface.
-    //
+    //!
     //! \see getErrorRecorder()
     //!
     void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -3062,52 +3588,22 @@ class IExecutionContext : public INoCopy
     }
 
     //!
-    //! \brief Synchronously execute inference a network.
+    //! \brief Synchronously execute a network.
+    //!
+    //! This method requires an array of input and output buffers. The mapping
+    //! from indices to tensor names can be queried using ICudaEngine::getIOTensorName().
     //!
-    //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
-    //! queried using ICudaEngine::getBindingIndex().
-    //! This method only works for execution contexts built with full dimension networks.
     //! \param bindings An array of pointers to input and output buffers for the network.
     //!
     //! \return True if execution succeeded.
     //!
-    //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize()
+    //! \see ICudaEngine::getIOTensorName()
     //!
     bool executeV2(void* const* bindings) noexcept
     {
         return mImpl->executeV2(bindings);
     }
 
-    //!
-    //! \brief Enqueue inference on a stream.
-    //!
-    //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
-    //! queried using ICudaEngine::getBindingIndex().
-    //! This method only works for execution contexts built with full dimension networks.
-    //! \param bindings An array of pointers to input and output buffers for the network.
-    //! \param stream A cuda stream on which the inference kernels will be enqueued
-    //! \param inputConsumed An optional event which will be signaled when the input buffers can be refilled with new
-    //! data
-    //!
-    //! \return True if the kernels were enqueued successfully.
-    //!
-    //! \deprecated Superseded by enqueueV3(). Deprecated in TensorRT 8.5
-    //!
-    //! \see ICudaEngine::getBindingIndex() ICudaEngine::getMaxBatchSize() IExecutionContext::enqueueV3()
-    //!
-    //! \note Calling enqueueV2() with a stream in CUDA graph capture mode has a known issue. If dynamic shapes are
-    //!       used, the first enqueueV2() call after a setInputShapeBinding() call will cause failure in stream capture
-    //!       due to resource allocation. Please call enqueueV2() once before capturing the graph.
-    //!
-    //! \warning Calling enqueueV2() in from the same IExecutionContext object with different CUDA streams concurrently
-    //!          results in undefined behavior. To perform inference concurrently in multiple streams, use one execution
-    //!          context per stream.
-    //!
-    TRT_DEPRECATED bool enqueueV2(void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept
-    {
-        return mImpl->enqueueV2(bindings, stream, inputConsumed);
-    }
-
     //!
     //! \brief Select an optimization profile for the current context with async
     //! semantics.
@@ -3123,24 +3619,22 @@ class IExecutionContext : public INoCopy
     //! application’s responsibility to guarantee that synchronization between
     //! the profile sync stream and the enqueue stream occurs.
     //!
-    //! The selected profile will be used in subsequent calls to executeV2()/enqueueV2()/enqueueV3().
+    //! The selected profile will be used in subsequent calls to executeV2()/enqueueV3().
     //! If the associated CUDA engine has inputs with dynamic shapes, the optimization profile must
-    //! be set with its corresponding profileIndex before calling execute or enqueue. If no execution
-    //! context is assigned optimization profile 0 and a new context is created for an engine,
-    //! setOptimizationProfile(0) is called implicitly. This functionality is deprecated in TensorRT 8.6
-    //! and will instead default all optimization profiles to 0 starting in TensorRT 9.0.
+    //! be set with its corresponding profileIndex before calling execute or enqueue. The newly created execution
+    //! context will be assigned optimization profile 0.
     //!
     //! If the associated CUDA engine does not have inputs with dynamic shapes,
     //! this method need not be called, in which case the default profile index
     //! of 0 will be used.
     //!
     //! setOptimizationProfileAsync() must be called before calling
-    //! setBindingDimensions() and setInputShapeBinding() for all dynamic input
+    //! setInputShape() for all dynamic input
     //! tensors or input shape tensors, which in turn must be called before
-    //! executeV2()/enqueueV2()/enqueueV3().
+    //! executeV2()/enqueueV3().
     //!
     //! \warning This function will trigger layer resource updates on the next call of
-    //!          enqueueV2()/executeV2()/enqueueV3(), possibly resulting in performance bottlenecks.
+    //!          executeV2()/enqueueV3(), possibly resulting in performance bottlenecks.
     //!
     //! \warning Not synchronizing the stream used at enqueue with the stream
     //! used to set optimization profile asynchronously using this API will
@@ -3149,7 +3643,6 @@ class IExecutionContext : public INoCopy
     //! \return true if the call succeeded, else false (e.g. input out of range)
     //!
     //! \see ICudaEngine::getNbOptimizationProfiles()
-    //! \see IExecutionContext::setOptimizationProfile()
     bool setOptimizationProfileAsync(int32_t profileIndex, cudaStream_t stream) noexcept
     {
         return mImpl->setOptimizationProfileAsync(profileIndex, stream);
@@ -3165,6 +3658,7 @@ class IExecutionContext : public INoCopy
     //!
     //! \see IExecutionContext::getEnqueueEmitsProfile()
     //! \see IExecutionContext::reportToProfiler()
+    //!
     void setEnqueueEmitsProfile(bool enqueueEmitsProfile) noexcept
     {
         mImpl->setEnqueueEmitsProfile(enqueueEmitsProfile);
@@ -3176,6 +3670,7 @@ class IExecutionContext : public INoCopy
     //! \return The enqueueEmitsProfile state.
     //!
     //! \see IExecutionContext::setEnqueueEmitsProfile()
+    //!
     bool getEnqueueEmitsProfile() const noexcept
     {
         return mImpl->getEnqueueEmitsProfile();
@@ -3205,6 +3700,7 @@ class IExecutionContext : public INoCopy
     //!
     //! \see IExecutionContext::setEnqueueEmitsProfile()
     //! \see IExecutionContext::getEnqueueEmitsProfile()
+    //!
     bool reportToProfiler() const noexcept
     {
         return mImpl->reportToProfiler();
@@ -3228,8 +3724,7 @@ class IExecutionContext : public INoCopy
     //! Before calling enqueueV3(), each input must have a non-null address and
     //! each output must have a non-null address or an IOutputAllocator to set it later.
     //!
-    //! If the TensorLocation of the tensor is kHOST, the pointer must point to a host buffer of sufficient size. For
-    //! shape tensors, the only supported data type is int32_t.
+    //! If the TensorLocation of the tensor is kHOST, the pointer must point to a host buffer of sufficient size.
     //! If the TensorLocation of the tensor is kDEVICE, the pointer must point to a device buffer of sufficient size and
     //! alignment, or be nullptr if the tensor is an output tensor that will be allocated by IOutputAllocator.
     //!
@@ -3245,7 +3740,7 @@ class IExecutionContext : public INoCopy
     //!
     //! \warning The string tensorName must be null-terminated, and be at most 4096 bytes including the terminator.
     //!
-    //! \see setInputTensorAddress() getTensorShape() setOutputAllocator() IOutputAllocator
+    //! \see setInputTensorAddress() setOutputTensorAddress() getTensorShape() setOutputAllocator() IOutputAllocator
     //!
     bool setTensorAddress(char const* tensorName, void* data) noexcept
     {
@@ -3269,6 +3764,29 @@ class IExecutionContext : public INoCopy
         return mImpl->getTensorAddress(tensorName);
     }
 
+    //!
+    //! \brief Set the memory address for a given output tensor.
+    //!
+    //! \param tensorName The name of an output tensor.
+    //! \param data The pointer to the buffer to which to write the output.
+    //!
+    //! \return True on success, false if the provided name does not map to an output tensor, does not meet alignment
+    //! requirements, or some other error occurred.
+    //!
+    //! Output addresses can also be set using method setTensorAddress. This method is provided for applications which
+    //! prefer to use different methods for setting input and output tensors.
+    //!
+    //! See setTensorAddress() for alignment and data type constraints.
+    //!
+    //! \warning The string tensorName must be null-terminated, and be at most 4096 bytes including the terminator.
+    //!
+    //! \see setTensorAddress()
+    //!
+    bool setOutputTensorAddress(char const* tensorName, void* data) noexcept
+    {
+        return mImpl->setOutputTensorAddress(tensorName, data);
+    }
+
     //!
     //! \brief Set memory address for given input.
     //!
@@ -3343,6 +3861,23 @@ class IExecutionContext : public INoCopy
         return mImpl->inferShapes(nbMaxNames, tensorNames);
     }
 
+    //!
+    //! \brief Recompute the internal activation buffer sizes based on the current input shapes, and return the total
+    //! amount of memory required.
+    //!
+    //! Users can allocate the device memory based on the size returned and provided the memory to TRT with
+    //! IExecutionContext::setDeviceMemory(). Must specify all input shapes and the optimization profile to use before
+    //! calling this function, otherwise the partition will be invalidated.
+    //!
+    //! \return Total amount of memory required on success, 0 if error occurred.
+    //!
+    //! \see IExecutionContext::setDeviceMemory()
+    //!
+    size_t updateDeviceMemorySizeForShapes() noexcept
+    {
+        return mImpl->updateDeviceMemorySizeForShapes();
+    }
+
     //!
     //! \brief Mark input as consumed.
     //!
@@ -3462,11 +3997,18 @@ class IExecutionContext : public INoCopy
     //! Input tensor can be released after the setInputConsumedEvent whereas output tensors require stream
     //! synchronization.
     //!
+    //! \warning Using default stream may lead to performance issues due to additional cudaDeviceSynchronize() calls by
+    //!          TensorRT to ensure correct synchronizations. Please use non-default stream instead.
+    //!
+    //! \warning If the Engine is streaming weights, enqueueV3 will become synchronous, and
+    //!          the graph will not be capturable.
+    //!
     bool enqueueV3(cudaStream_t stream) noexcept
     {
         return mImpl->enqueueV3(stream);
     }
 
+    //!
     //! \brief Set the maximum size for persistent cache usage.
     //!
     //! This function sets the maximum persistent L2 cache that this execution context may use for activation caching.
@@ -3496,7 +4038,7 @@ class IExecutionContext : public INoCopy
     //!
     //! \brief Set the verbosity of the NVTX markers in the execution context.
     //!
-    //! Building with kDETAILED verbosity will generally increase latency in enqueueV2/enqueueV3(). Call this method
+    //! Building with kDETAILED verbosity will generally increase latency in enqueueV3(). Call this method
     //! to select NVTX verbosity in this execution context at runtime.
     //!
     //! The default is the verbosity with which the engine was built, and the verbosity may not be raised above that
@@ -3560,6 +4102,70 @@ class IExecutionContext : public INoCopy
         mImpl->setAuxStreams(auxStreams, nbStreams);
     }
 
+    //!
+    //! \brief Set DebugListener for this execution context.
+    //!
+    //! \param listener DebugListener for this execution context.
+    //!
+    //! \return true if succeed, false if failure.
+    //!
+    bool setDebugListener(IDebugListener* listener) noexcept
+    {
+        return mImpl->setDebugListener(listener);
+    }
+
+    //!
+    //! \brief Get the DebugListener of this execution context.
+    //!
+    //! \return DebugListener of this execution context.
+    //!
+    IDebugListener* getDebugListener() noexcept
+    {
+        return mImpl->getDebugListener();
+    }
+
+    //!
+    //! \brief Set debug state of tensor given the tensor name.
+    //!
+    //! Turn the debug state of a tensor on or off.
+    //! A tensor with the parameter tensor name must exist in the network, and the tensor must have
+    //! been marked as a debug tensor during build time. Otherwise, an error is thrown.
+    //!
+    //! \param name Name of target tensor.
+    //!
+    //! \param flag True if turning on debug state, false if turning off debug state of tensor
+    //! The default is off.
+    //!
+    //! \return True if successful, false otherwise.
+    //!
+    bool setTensorDebugState(char const* name, bool flag) noexcept
+    {
+        return mImpl->setTensorDebugState(name, flag);
+    }
+
+    //!
+    //! Turn the debug state of all debug tensors on or off.
+    //!
+    //! \param flag true if turning on debug state, false if turning off debug state.
+    //!
+    //! \return true if successful, false otherwise.
+    //!
+    //! The default is off.
+    bool setAllTensorsDebugState(bool flag) noexcept
+    {
+        return mImpl->setAllTensorsDebugState(flag);
+    }
+
+    //!
+    //! Get the debug state.
+    //!
+    //! \return true if there is a debug tensor with the given name and it has debug state turned on.
+    //!
+    bool getDebugState(char const* name) const noexcept
+    {
+        return mImpl->getDebugState(name);
+    }
+
 protected:
     apiv::VExecutionContext* mImpl;
 }; // class IExecutionContext
@@ -3693,7 +4299,7 @@ class IEngineInspector : public INoCopy
     //! If an error recorder is not set, messages will be sent to the global log stream.
     //!
     //! \param recorder The error recorder to register with this interface.
-    //
+    //!
     //! \see getErrorRecorder()
     //!
     void setErrorRecorder(IErrorRecorder* recorder) noexcept
@@ -3829,6 +4435,169 @@ class ILoggerFinder
     virtual ~ILoggerFinder() = default;
 };
 
+//! DO NOT REFER TO namespace v_1_0 IN CODE. ALWAYS USE nvinfer1 INSTEAD.
+//! The name v_1_0 may change in future versions of TensoRT.
+namespace v_1_0
+{
+
+class IGpuAsyncAllocator : public IGpuAllocator
+{
+public:
+    IGpuAsyncAllocator() = default;
+    ~IGpuAsyncAllocator() override = default;
+
+    //!
+    //! \brief A thread-safe callback implemented by the application to handle stream-ordered asynchronous
+    //!        acquisition of GPU memory.
+    //!
+    //! \param size The size of the memory block required (in bytes).
+    //! \param alignment The required alignment of memory. Alignment will be zero
+    //!        or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+    //!        Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+    //!        An alignment value of zero indicates any alignment is acceptable.
+    //! \param flags Reserved for future use. In the current release, 0 will be passed.
+    //!
+    //! \param stream Specifies the cudastream for the asynchronous allocation. If nullptr or 0 is
+    //!        passed, the default stream will be used.
+    //!
+    //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+    //!         If an allocation request of size 0 is made, nullptr must be returned.
+    //!         If an allocation request cannot be satisfied, nullptr must be returned.
+    //!         If a non-null address is returned, it is guaranteed to have the specified alignment.
+    //!
+    //! \note The implementation must guarantee thread safety for concurrent allocateAsync/deallocateAsync
+    //! requests.
+    //!
+    //! \note The implementation is not required to be asynchronous. It is permitted to synchronize,
+    //! albeit doing so will lose the performance advantage of asynchronous allocation.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+    //!
+    void* allocateAsync(uint64_t const size, uint64_t const alignment, AllocatorFlags const flags,
+        cudaStream_t /*stream*/) noexcept override = 0;
+
+    //!
+    //! \brief A thread-safe callback implemented by the application to handle stream-ordered asynchronous
+    //! release of GPU memory.
+    //!
+    //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
+    //!
+    //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+    //! allocator object.
+    //!
+    //! \param stream Specifies the cudastream for the asynchronous deallocation. If nullptr or 0 is
+    //!        passed, the default stream will be used.
+    //!
+    //! \return True if the acquired memory is released successfully.
+    //!
+    //! \note The implementation must guarantee thread safety for concurrent allocateAsync/deallocateAsync
+    //! requests.
+    //!
+    //! \note The implementation is not required to be asynchronous. It is permitted to synchronize,
+    //! albeit doing so will lose the performance advantage of asynchronous deallocation.
+    //! Either way, it is critical that it not actually free the memory until the current
+    //! stream position is reached.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+    bool deallocateAsync(void* const memory, cudaStream_t /*stream*/) noexcept override = 0;
+
+    //!
+    //! \brief A thread-safe callback implemented by the application to handle acquisition of GPU memory.
+    //!
+    //! \param size The size of the memory block required (in bytes).
+    //! \param alignment The required alignment of memory. Alignment will be zero
+    //!        or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+    //!        Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+    //!        An alignment value of zero indicates any alignment is acceptable.
+    //! \param flags Reserved for future use. In the current release, 0 will be passed.
+    //!
+    //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+    //!         If an allocation request of size 0 is made, nullptr must be returned.
+    //!         If an allocation request cannot be satisfied, nullptr must be returned.
+    //!         If a non-null address is returned, it is guaranteed to have the specified alignment.
+    //!
+    //! \note The implementation must guarantee thread safety for concurrent allocateAsync/deallocateAsync/reallocate requests.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by allocateAsync
+    //!
+    TRT_DEPRECATED void* allocate(
+        uint64_t const size, uint64_t const alignment, AllocatorFlags const flags) noexcept override
+    {
+        return allocateAsync(size, alignment, flags, nullptr);
+    }
+
+    //!
+    //! \brief A thread-safe callback implemented by the application to handle release of GPU memory.
+    //!
+    //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
+    //!
+    //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+    //! allocator object.
+    //!
+    //! \return True if the acquired memory is released successfully.
+    //!
+    //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
+    //! requests.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by deallocateAsync
+    //!
+    TRT_DEPRECATED bool deallocate(void* const memory) noexcept override
+    {
+        return deallocateAsync(memory, nullptr);
+    }
+
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return {"IGpuAllocator", 1, 0};
+    }
+};
+} // namespace v_1_0
+
+//!
+//! \class IGpuAsyncAllocator
+//!
+//! \brief Application-implemented class for controlling asynchronous (stream ordered) memory allocation on the GPU.
+//!
+//! \warning The lifetime of an IGpuAsyncAllocator object must exceed that of all objects that use it.
+//!
+//! The advantage of deriving from IGpuAsyncAllocator instead of IGpuAllocator is that you only have
+//! to override two methods: allocateAsync() and deallocateAsync() to implement an allocator with
+//! asynchronous capability, whereas deriving from IGpuAllocator requires overriding four methods,
+//! including two deprecated methods.
+//!
+//! \see IGpuAllocator
+using IGpuAsyncAllocator = v_1_0::IGpuAsyncAllocator;
+
 } // namespace nvinfer1
 
+//!
+//! \brief Return the library major version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibMajorVersion() noexcept;
+//!
+//! \brief Return the library minor version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibMinorVersion() noexcept;
+//!
+//! \brief Return the library patch version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibPatchVersion() noexcept;
+//!
+//! \brief Return the library build version number.
+//!
+extern "C" TENSORRTAPI int32_t getInferLibBuildVersion() noexcept;
+
 #endif // NV_INFER_RUNTIME_H
diff --git a/include/NvInferRuntimeBase.h b/include/NvInferRuntimeBase.h
index 2701240d..60006e6c 100644
--- a/include/NvInferRuntimeBase.h
+++ b/include/NvInferRuntimeBase.h
@@ -68,17 +68,24 @@
 //! NvInferSafeRuntime.h (for the safety runtime).
 //!
 
-// forward declare some CUDA types to avoid an include dependency
+//! Forward declare some CUDA types to avoid an include dependency.
 
 extern "C"
 {
-    //! Forward declaration of cublasContext to use in other interfaces
+    //! Forward declaration of cublasContext to use in other interfaces.
     struct cublasContext;
-    //! Forward declaration of cudnnContext to use in other interfaces
+    //! Forward declaration of cudnnContext to use in other interfaces.
     struct cudnnContext;
 }
 
-#define NV_TENSORRT_VERSION nvinfer1::kNV_TENSORRT_VERSION_IMPL
+//! Construct a single integer denoting TensorRT version.
+//! Usable in preprocessor expressions.
+#define NV_TENSORRT_VERSION_INT(major, minor, patch) ((major) *10000L + (minor) *100L + (patch) *1L)
+
+//! TensorRT version as a single integer.
+//! Usable in preprocessor expressions.
+#define NV_TENSORRT_VERSION NV_TENSORRT_VERSION_INT(NV_TENSORRT_MAJOR, NV_TENSORRT_MINOR, NV_TENSORRT_PATCH)
+
 //!
 //! \namespace nvinfer1
 //!
@@ -86,22 +93,19 @@ extern "C"
 //!
 namespace nvinfer1
 {
-
-static constexpr int32_t kNV_TENSORRT_VERSION_IMPL
-    = (NV_TENSORRT_MAJOR * 1000) + (NV_TENSORRT_MINOR * 100) + NV_TENSORRT_PATCH; // major, minor, patch
-
 //! char_t is the type used by TensorRT to represent all valid characters.
 using char_t = char;
 
 //! AsciiChar is the type used by TensorRT to represent valid ASCII characters.
-//! This type is used by IPluginV2, PluginField, IPluginCreator, IPluginRegistry, and
-//! ILogger due to their use in automotive safety context.
+//! This type is widely used in automotive safety context.
 using AsciiChar = char_t;
 
 //! Forward declare IErrorRecorder for use in other interfaces.
+namespace v_1_0
+{
 class IErrorRecorder;
-//! Forward declare IGpuAllocator for use in other interfaces.
-class IGpuAllocator;
+}
+using IErrorRecorder = v_1_0::IErrorRecorder;
 
 namespace impl
 {
@@ -126,7 +130,7 @@ enum class DataType : int32_t
     //! 32-bit floating point format.
     kFLOAT = 0,
 
-    //! IEEE 16-bit floating-point format.
+    //! IEEE 16-bit floating-point format -- has a 5 bit exponent and 11 bit significand.
     kHALF = 1,
 
     //! Signed 8-bit integer representing a quantized floating-point value.
@@ -148,15 +152,22 @@ enum class DataType : int32_t
     //! to equivalent floating point values.
     //! {kFLOAT, kHALF} to kUINT8 conversion will convert the floating point values
     //! to integer values by truncating towards zero. This conversion has undefined behavior for
-    //! floating point values outside the range [0.0f, 256.0f) after truncation.
+    //! floating point values outside the range [0.0F, 256.0F) after truncation.
     //! kUINT8 conversions are not supported for {kINT8, kINT32, kBOOL}.
     kUINT8 = 5,
 
     //! Signed 8-bit floating point with
     //! 1 sign bit, 4 exponent bits, 3 mantissa bits, and exponent-bias 7.
-    //! \warning kFP8 is not supported yet and will result in an error or undefined behavior.
-    kFP8 = 6
+    kFP8 = 6,
+
+    //! Brain float -- has an 8 bit exponent and 8 bit significand.
+    kBF16 = 7,
 
+    //! Signed 64-bit integer type.
+    kINT64 = 8,
+
+    //! Signed 4-bit integer type.
+    kINT4 = 9,
 };
 
 namespace impl
@@ -165,8 +176,8 @@ namespace impl
 template <>
 struct EnumMaxImpl<DataType>
 {
-    // Declaration of kVALUE that represents maximum number of elements in DataType enum
-    static constexpr int32_t kVALUE = 7;
+    //! Declaration of kVALUE that represents the maximum number of elements in the DataType enum.
+    static constexpr int32_t kVALUE = 10;
 };
 } // namespace impl
 
@@ -174,29 +185,29 @@ struct EnumMaxImpl<DataType>
 //! \class Dims
 //! \brief Structure to define the dimensions of a tensor.
 //!
-//! TensorRT can also return an invalid dims structure. This structure is represented by nbDims == -1
-//! and d[i] == 0 for all d.
+//! TensorRT can also return an "invalid dims" structure. This structure is
+//! represented by nbDims == -1 and d[i] == 0 for all i.
 //!
-//! TensorRT can also return an "unknown rank" dims structure. This structure is represented by nbDims == -1
-//! and d[i] == -1 for all d.
+//! TensorRT can also return an "unknown rank" dims structure. This structure is
+//! represented by nbDims == -1 and d[i] == -1 for all i.
 //!
-class Dims32
+class Dims64
 {
 public:
     //! The maximum rank (number of dimensions) supported for a tensor.
     static constexpr int32_t MAX_DIMS{8};
+
     //! The rank (number of dimensions).
     int32_t nbDims;
+
     //! The extent of each dimension.
-    int32_t d[MAX_DIMS];
+    int64_t d[MAX_DIMS];
 };
 
 //!
-//! Alias for Dims32.
-//!
-//! \warning: This alias might change in the future.
+//! Alias for Dims64.
 //!
-using Dims = Dims32;
+using Dims = Dims64;
 
 //!
 //! \enum TensorFormat
@@ -207,94 +218,95 @@ using Dims = Dims32;
 //!
 //! \see IPluginV2::supportsFormat(), safe::ICudaEngine::getBindingFormat()
 //!
+//! Many of the formats are **vector-major** or **vector-minor**. These formats specify
+//! a <em>vector dimension</em> and <em>scalars per vector</em>.
+//! For example, suppose that the tensor has has dimensions [M,N,C,H,W],
+//! the vector dimension is C and there are V scalars per vector.
+//!
+//! * A **vector-major** format splits the vectorized dimension into two axes in the
+//!   memory layout. The vectorized dimension is replaced by an axis of length ceil(C/V)
+//!   and a new dimension of length V is appended. For the example tensor, the memory layout
+//!   is equivalent to an array with dimensions [M][N][ceil(C/V)][H][W][V].
+//!   Tensor coordinate (m,n,c,h,w) maps to array location [m][n][c/V][h][w][c\%V].
+//!
+//! * A **vector-minor** format moves the vectorized dimension to become the last axis
+//!   in the memory layout. For the example tensor, the memory layout is equivalent to an
+//!   array with dimensions [M][N][H][W][ceil(C/V)*V]. Tensor coordinate (m,n,c,h,w) maps
+//!   array location subscript [m][n][h][w][c].
+//!
+//! In interfaces that refer to "components per element", that's the value of V above.
+//!
 //! For more information about data formats, see the topic "Data Format Description" located in the
-//! TensorRT Developer Guide.
+//! TensorRT Developer Guide. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#data-format-desc
 //!
 enum class TensorFormat : int32_t
 {
-    //! Row major linear format.
-    //! For a tensor with dimensions {N, C, H, W} or {numbers, channels,
-    //! columns, rows}, the dimensional index corresponds to {3, 2, 1, 0}
-    //! and thus the order is W minor.
+    //! Memory layout is similar to an array in C or C++.
+    //! The stride of each dimension is the product of the dimensions after it.
+    //! The last dimension has unit stride.
     //!
     //! For DLA usage, the tensor sizes are limited to C,H,W in the range [1,8192].
-    //!
     kLINEAR = 0,
 
-    //! Two wide channel vectorized row major format. This format is bound to
-    //! FP16. It is only available for dimensions >= 3.
-    //! For a tensor with dimensions {N, C, H, W},
-    //! the memory layout is equivalent to a C array with dimensions
-    //! [N][(C+1)/2][H][W][2], with the tensor coordinates (n, c, h, w)
-    //! mapping to array subscript [n][c/2][h][w][c%2].
+    //! Vector-major format with two scalars per vector.
+    //! Vector dimension is third to last.
+    //!
+    //! This format requires FP16 or BF16 and at least three dimensions.
     kCHW2 = 1,
 
-    //! Eight channel format where C is padded to a multiple of 8. This format
-    //! is bound to FP16. It is only available for dimensions >= 3.
-    //! For a tensor with dimensions {N, C, H, W},
-    //! the memory layout is equivalent to the array with dimensions
-    //! [N][H][W][(C+7)/8*8], with the tensor coordinates (n, c, h, w)
-    //! mapping to array subscript [n][h][w][c].
+    //! Vector-minor format with eight scalars per vector.
+    //! Vector dimension is third to last.
+    //! This format requires FP16 or BF16 and at least three dimensions.
     kHWC8 = 2,
 
-    //! Four wide channel vectorized row major format. This format is bound to
-    //! INT8 or FP16. It is only available for dimensions >= 3.
-    //! For INT8, the C dimension must be a build-time constant.
-    //! For a tensor with dimensions {N, C, H, W},
-    //! the memory layout is equivalent to a C array with dimensions
-    //! [N][(C+3)/4][H][W][4], with the tensor coordinates (n, c, h, w)
-    //! mapping to array subscript [n][c/4][h][w][c%4].
+    //! Vector-major format with four scalars per vector.
+    //! Vector dimension is third to last.
+    //!
+    //! This format requires INT8 or FP16 and at least three dimensions.
+    //! For INT8, the length of the vector dimension must be a build-time constant.
     //!
     //! Deprecated usage:
     //!
     //! If running on the DLA, this format can be used for acceleration
-    //! with the caveat that C must be equal or lesser than 4.
+    //! with the caveat that C must be less than or equal to 4.
     //! If used as DLA input and the build option kGPU_FALLBACK is not specified,
-    //! it needs to meet line stride requirement of DLA format. Column stride in bytes should
-    //! be a multiple of 32 on Xavier and 64 on Orin.
+    //! it needs to meet line stride requirement of DLA format. Column stride in
+    //! bytes must be a multiple of 64 on Orin.
     kCHW4 = 3,
 
-    //! Sixteen wide channel vectorized row major format. This format is bound
-    //! to FP16. It is only available for dimensions >= 3.
-    //! For a tensor with dimensions {N, C, H, W},
-    //! the memory layout is equivalent to a C array with dimensions
-    //! [N][(C+15)/16][H][W][16], with the tensor coordinates (n, c, h, w)
-    //! mapping to array subscript [n][c/16][h][w][c%16].
+    //! Vector-major format with 16 scalars per vector.
+    //! Vector dimension is third to last.
+    //!
+    //! This format requires INT8 or FP16 and at least three dimensions.
     //!
     //! For DLA usage, this format maps to the native feature format for FP16,
     //! and the tensor sizes are limited to C,H,W in the range [1,8192].
-    //!
     kCHW16 = 4,
 
-    //! Thirty-two wide channel vectorized row major format. This format is
-    //! only available for dimensions >= 3.
-    //! For a tensor with dimensions {N, C, H, W},
-    //! the memory layout is equivalent to a C array with dimensions
-    //! [N][(C+31)/32][H][W][32], with the tensor coordinates (n, c, h, w)
-    //! mapping to array subscript [n][c/32][h][w][c%32].
+    //! Vector-major format with 32 scalars per vector.
+    //! Vector dimension is third to last.
+    //!
+    //! This format requires at least three dimensions.
     //!
     //! For DLA usage, this format maps to the native feature format for INT8,
     //! and the tensor sizes are limited to C,H,W in the range [1,8192].
     kCHW32 = 5,
 
-    //! Eight channel format where C is padded to a multiple of 8. This format
-    //! is bound to FP16, and it is only available for dimensions >= 4.
-    //! For a tensor with dimensions {N, C, D, H, W},
-    //! the memory layout is equivalent to an array with dimensions
-    //! [N][D][H][W][(C+7)/8*8], with the tensor coordinates (n, c, d, h, w)
-    //! mapping to array subscript [n][d][h][w][c].
+    //! Vector-minor format with eight scalars per vector.
+    //! Vector dimension is fourth to last.
+    //!
+    //! This format requires FP16 or BF16 and at least four dimensions.
     kDHWC8 = 6,
 
-    //! Thirty-two wide channel vectorized row major format. This format is
-    //! bound to FP16 and INT8 and is only available for dimensions >= 4.
-    //! For a tensor with dimensions {N, C, D, H, W},
-    //! the memory layout is equivalent to a C array with dimensions
-    //! [N][(C+31)/32][D][H][W][32], with the tensor coordinates (n, c, d, h, w)
-    //! mapping to array subscript [n][c/32][d][h][w][c%32].
+    //! Vector-major format with 32 scalars per vector.
+    //! Vector dimension is fourth to last.
+    //!
+    //! This format requires FP16 or INT8 and at least four dimensions.
     kCDHW32 = 7,
 
-    //! Non-vectorized channel-last format. This format is bound to either FP32 or UINT8,
-    //! and is only available for dimensions >= 3.
+    //! Vector-minor format where channel dimension is third to last and unpadded.
+    //!
+    //! This format requires either FP32 or UINT8 and at least three dimensions.
     kHWC = 8,
 
     //! DLA planar format. For a tensor with dimension {N, C, H, W}, the W axis
@@ -309,46 +321,123 @@ enum class TensorFormat : int32_t
 
     //! DLA image format. For a tensor with dimension {N, C, H, W} the C axis
     //! always has unit stride. The stride for stepping along the H axis is rounded up
-    //! to 32 bytes on Xavier and 64 bytes on Orin. C can only be 1, 3 or 4.
+    //! to 64 bytes on Orin. C can only be 1, 3 or 4.
     //! If C == 1, it will map to grayscale format.
     //! If C == 3 or C == 4, it will map to color image format. And if C == 3,
     //! the stride for stepping along the W axis needs to be padded to 4 in elements.
     //!
     //! When C is {1, 3, 4}, then C' is {1, 4, 4} respectively,
     //! the memory layout is equivalent to a C array with dimensions
-    //! [N][H][roundUp(W, 32/C'/elementSize)][C'] on Xavier and [N][H][roundUp(W, 64/C'/elementSize)][C'] on Orin
+    //! [N][H][roundUp(W, 64/C'/elementSize)][C'] on Orin
     //! where elementSize is 2 for FP16
     //! and 1 for Int8. The tensor coordinates (n, c, h, w) mapping to array
     //! subscript [n][h][w][c].
     kDLA_HWC4 = 10,
 
-    //! Sixteen channel format where C is padded to a multiple of 16. This format
-    //! is bound to FP16. It is only available for dimensions >= 3.
-    //! For a tensor with dimensions {N, C, H, W},
-    //! the memory layout is equivalent to the array with dimensions
-    //! [N][H][W][(C+15)/16*16], with the tensor coordinates (n, c, h, w)
-    //! mapping to array subscript [n][h][w][c].
+    //! Vector-minor format with 16 scalars per vector.
+    //! Vector dimension is third to last.
+    //!
+    //! This requires FP16 and at least three dimensions.
     kHWC16 = 11,
 
-    //! Non-vectorized channel-last format. This format is bound to FP32.
-    //! It is only available for dimensions >= 4.
+    //! Vector-minor format with one scalar per vector.
+    //! Vector dimension is fourth to last.
+    //!
+    //! This format requires FP32 and at least four dimensions.
     kDHWC = 12
 };
 
+using InterfaceKind = char const*;
+
+//!
+//! \class InterfaceInfo
+//!
+//! \brief Version information associated with a TRT interface
+//!
+class InterfaceInfo
+{
+public:
+    InterfaceKind kind;
+    int32_t major;
+    int32_t minor;
+};
+
+//!
+//! \enum APILanguage
+//!
+//! \brief Programming language used in the implementation of a TRT interface
+//!
+enum class APILanguage : int32_t
+{
+    kCPP = 0,
+    kPYTHON = 1
+};
+
+namespace impl
+{
+//! Maximum number of elements in APILanguage enum. \see APILanguage
+template <>
+struct EnumMaxImpl<APILanguage>
+{
+    //! Declaration of kVALUE that represents the maximum number of elements in the APILanguage enum.
+    static constexpr int32_t kVALUE = 2;
+};
+} // namespace impl
+
+//!
+//! \class IVersionedInterface
+//!
+//! \brief An Interface class for version control.
+//!
+class IVersionedInterface
+{
+public:
+    //!
+    //! \brief The language used to build the implementation of this Interface.
+    //!
+    //! Applications must not override this method.
+    //!
+    virtual APILanguage getAPILanguage() const noexcept
+    {
+        return APILanguage::kCPP;
+    }
+
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    virtual InterfaceInfo getInterfaceInfo() const noexcept = 0;
+
+    virtual ~IVersionedInterface() noexcept = default;
+
+protected:
+    IVersionedInterface() = default;
+    IVersionedInterface(IVersionedInterface const&) = default;
+    IVersionedInterface(IVersionedInterface&&) = default;
+    IVersionedInterface& operator=(IVersionedInterface const&) & = default;
+    IVersionedInterface& operator=(IVersionedInterface&&) & = default;
+};
+
 namespace impl
 {
 //! Maximum number of elements in TensorFormat enum. \see TensorFormat
 template <>
 struct EnumMaxImpl<TensorFormat>
 {
-    //! Declaration of kVALUE that represents maximum number of elements in TensorFormat enum
+    //! Declaration of kVALUE that represents the maximum number of elements in the TensorFormat enum.
     static constexpr int32_t kVALUE = 13;
 };
 } // namespace impl
 
+
+//!
+//! \enum AllocatorFlag
+//!
+//! \brief Allowed type of memory allocation.
+//!
 enum class AllocatorFlag : int32_t
 {
-    kRESIZABLE = 0, //!< TensorRT may call realloc() on this allocation
+    //! TensorRT may call realloc() on this allocation.
+    kRESIZABLE = 0,
 };
 
 namespace impl
@@ -357,72 +446,53 @@ namespace impl
 template <>
 struct EnumMaxImpl<AllocatorFlag>
 {
-    static constexpr int32_t kVALUE = 1;        //!< maximum number of elements in AllocatorFlag enum
+    //! Declaration of kVALUE that represents the maximum number of elements in the AllocatorFlag enum.
+    static constexpr int32_t kVALUE = 1;
 };
 } // namespace impl
 
 using AllocatorFlags = uint32_t;
 
-//!
-//! \class IGpuAllocator
-//!
-//! \brief Application-implemented class for controlling allocation on the GPU.
-//!
-class IGpuAllocator
+//! DO NOT REFER TO namespace v_1_0 IN CODE. ALWAYS USE nvinfer1 INSTEAD.
+//! The name v_1_0 may change in future versions of TensoRT.
+namespace v_1_0
+{
+
+class IGpuAllocator : public IVersionedInterface
 {
 public:
     //!
-    //! A thread-safe callback implemented by the application to handle acquisition of GPU memory.
+    //! \brief A thread-safe callback implemented by the application to handle acquisition of GPU memory.
     //!
-    //! \param size The size of the memory required.
+    //! \param size The size of the memory block required (in bytes).
     //! \param alignment The required alignment of memory. Alignment will be zero
     //!        or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
     //!        Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
     //!        An alignment value of zero indicates any alignment is acceptable.
     //! \param flags Reserved for future use. In the current release, 0 will be passed.
     //!
-    //! If an allocation request of size 0 is made, nullptr should be returned.
-    //!
-    //! If an allocation request cannot be satisfied, nullptr should be returned.
+    //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+    //! If an allocation request of size 0 is made, nullptr must be returned.
+    //! If an allocation request cannot be satisfied, nullptr must be returned.
+    //! If a non-null address is returned, it is guaranteed to have the specified alignment.
     //!
-    //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
+    //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
     //! requests.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
     //!
-    virtual void* allocate(uint64_t const size, uint64_t const alignment, AllocatorFlags const flags) noexcept = 0;
-
-    //!
-    //! A thread-safe callback implemented by the application to handle release of GPU memory.
-    //!
-    //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
-    //!
-    //! \param memory The acquired memory.
-    //!
-    //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
-    //! requests.
-    //!
-    //! \see deallocate()
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by allocateAsync
     //!
-    //! \deprecated Deprecated in TensorRT 8.0. Superseded by deallocate.
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
-    //!
-    TRT_DEPRECATED virtual void free(void* const memory) noexcept = 0;
+    TRT_DEPRECATED virtual void* allocate(
+        uint64_t const size, uint64_t const alignment, AllocatorFlags const flags) noexcept = 0;
 
-    //!
-    //! Destructor declared virtual as general good practice for a class with virtual methods.
-    //! TensorRT never calls the destructor for an IGpuAllocator defined by the application.
-    //!
-    virtual ~IGpuAllocator() = default;
+    ~IGpuAllocator() override = default;
     IGpuAllocator() = default;
 
     //!
-    //! A thread-safe callback implemented by the application to resize an existing allocation.
+    //! \brief A thread-safe callback implemented by the application to resize an existing allocation.
     //!
     //! Only allocations which were allocated with AllocatorFlag::kRESIZABLE will be resized.
     //!
@@ -442,65 +512,161 @@ class IGpuAllocator
     //!
     //! TensorRT may call realloc to increase the buffer by relatively small amounts.
     //!
-    //! \param baseAddr the address of the original allocation.
-    //! \param alignment The alignment used by the original allocation.
-    //! \param newSize The new memory size required.
-    //! \return the address of the reallocated memory
+    //! \param baseAddr the address of the original allocation, which will have been returned by previously calling
+    //!        allocate() or reallocate() on the same object.
+    //! \param alignment The alignment used by the original allocation. This will be the same value that was previously
+    //!        passed to the allocate() or reallocate() call that returned baseAddr.
+    //! \param newSize The new memory size required (in bytes).
+    //!
+    //! \return The address of the reallocated memory, or nullptr. If a non-null address is returned, it is
+    //!         guaranteed to have the specified alignment.
     //!
-    //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
+    //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
     //! requests.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
     //!
-    virtual void* reallocate(void* /*baseAddr*/, uint64_t /*alignment*/, uint64_t /*newSize*/) noexcept
+    virtual void* reallocate(void* const /*baseAddr*/, uint64_t /*alignment*/, uint64_t /*newSize*/) noexcept
     {
         return nullptr;
     }
 
     //!
-    //! A thread-safe callback implemented by the application to handle release of GPU memory.
+    //! \brief A thread-safe callback implemented by the application to handle release of GPU memory.
     //!
     //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
     //!
-    //! \param memory The acquired memory.
+    //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+    //! allocator object.
+    //!
     //! \return True if the acquired memory is released successfully.
     //!
-    //! \note The implementation must guarantee thread safety for concurrent allocate/free/reallocate/deallocate
+    //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
     //! requests.
     //!
-    //! \note If user-implemented free() might hit an error condition, the user should override deallocate() as the
-    //! primary implementation and override free() to call deallocate() for backwards compatibility.
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by deallocateAsync
+    //!
+    TRT_DEPRECATED virtual bool deallocate(void* const memory) noexcept = 0;
+
+    //!
+    //! \brief A thread-safe callback implemented by the application to handle stream-ordered acquisition of GPU memory.
+    //!
+    //! The default behavior is to call method allocate(), which is synchronous and thus loses
+    //! any performance benefits of asynchronous allocation. If you want the benefits of asynchronous
+    //! allocation, see discussion of IGpuAsyncAllocator vs. IGpuAllocator in the documentation
+    //! for nvinfer1::IGpuAllocator.
     //!
-    //! \see free()
+    //! \param size The size of the memory block required (in bytes).
+    //! \param alignment The required alignment of memory. Alignment will be zero
+    //!        or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+    //!        Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+    //!        An alignment value of zero indicates any alignment is acceptable.
+    //! \param flags Reserved for future use. In the current release, 0 will be passed.
+    //! \param stream specifies the cudaStream for asynchronous usage.
+    //!
+    //! \return If the allocation was successful, the start address of a device memory block of the requested size.
+    //! If an allocation request of size 0 is made, nullptr must be returned.
+    //! If an allocation request cannot be satisfied, nullptr must be returned.
+    //! If a non-null address is returned, it is guaranteed to have the specified alignment.
+    //!
+    //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
+    //! requests.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
     //!
-    virtual bool deallocate(void* const memory) noexcept
+    virtual void* allocateAsync(
+        uint64_t const size, uint64_t const alignment, AllocatorFlags const flags, cudaStream_t /*stream*/) noexcept
     {
-        this->free(memory);
-        return true;
+        return allocate(size, alignment, flags);
+    }
+    //!
+    //! \brief A thread-safe callback implemented by the application to handle stream-ordered release of GPU memory.
+    //!
+    //! The default behavior is to call method deallocate(), which is synchronous and thus loses
+    //! any performance benefits of asynchronous deallocation. If you want the benefits of asynchronous
+    //! deallocation, see discussion of IGpuAsyncAllocator vs. IGpuAllocator in the documentation
+    //! for nvinfer1::IGpuAllocator.
+    //!
+    //! TensorRT may pass a nullptr to this function if it was previously returned by allocate().
+    //!
+    //! \param memory A memory address that was previously returned by an allocate() or reallocate() call of the same
+    //! allocator object.
+    //! \param stream specifies the cudaStream for asynchronous usage.
+    //!
+    //! \return True if the acquired memory is released successfully.
+    //!
+    //! \note The implementation must guarantee thread safety for concurrent allocate/reallocate/deallocate
+    //! requests.
+    //!
+    //! \note The implementation is not required to be asynchronous. It is permitted to synchronize,
+    //! albeit doing so will lose the performance advantage of asynchronous deallocation.
+    //! Either way, it is critical that it not actually free the memory until the current
+    //! stream position is reached.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads.
+    //!
+    virtual bool deallocateAsync(void* const memory, cudaStream_t /*stream*/) noexcept
+    {
+        return deallocate(memory);
+    }
+
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return {"IGpuAllocator", 1, 0};
     }
 
 protected:
-// @cond SuppressDoxyWarnings
+    // @cond SuppressDoxyWarnings
     IGpuAllocator(IGpuAllocator const&) = default;
     IGpuAllocator(IGpuAllocator&&) = default;
     IGpuAllocator& operator=(IGpuAllocator const&) & = default;
     IGpuAllocator& operator=(IGpuAllocator&&) & = default;
-// @endcond
+    // @endcond
 };
 
+} // namespace v_1_0
+
+//!
+//! \class IGpuAllocator
+//!
+//! \brief Application-implemented class for controlling allocation on the GPU.
+//!
+//! \warning The lifetime of an IGpuAllocator object must exceed that of all objects that use it.
+//!
+//! This class is intended as a base class for allocators that implement synchronous allocation.
+//! If you want the benefits of asynchronous allocation, you can do either of:
+//!
+//! * Derive your class from IGpuAllocator and override all four of its virtual methods
+//!   for allocation/deallocation, including the two deprecated methods.
+//!
+//! * Derive your class from IGpuAsyncAllocator and override its two pure virtual
+//!   methods for allocation/deallocation.
+//!
+//! The latter style is preferred because it does not tie code to deprecated methods.
+//!
+//! \see IGpuAsyncAllocator.
+//!
+using IGpuAllocator = v_1_0::IGpuAllocator;
+
 //!
 //! \class ILogger
 //!
 //! \brief Application-implemented logging interface for the builder, refitter and runtime.
 //!
 //! The logger used to create an instance of IBuilder, IRuntime or IRefitter is used for all objects created through
-//! that interface. The logger should be valid until all objects created are released.
+//! that interface. The logger must be valid until all objects created are released.
 //!
 //! The Logger object implementation must be thread safe. All locking and synchronization is pushed to the
 //! interface implementation and TensorRT does not hold any synchronization primitives when calling the interface
@@ -512,7 +678,7 @@ class ILogger
     //!
     //! \enum Severity
     //!
-    //! The severity corresponding to a log message.
+    //! \brief The severity corresponding to a log message.
     //!
     enum class Severity : int32_t
     {
@@ -529,11 +695,17 @@ class ILogger
     };
 
     //!
-    //! A callback implemented by the application to handle logging messages;
+    //! \brief A callback implemented by the application to handle logging messages;
     //!
     //! \param severity The severity of the message.
     //! \param msg A null-terminated log message.
     //!
+    //! \warning Loggers used in the safety certified runtime must set a maximum message length and truncate
+    //!          messages exceeding this length. It is up to the implementer of the derived class to define
+    //!          a suitable limit that will prevent buffer overruns, resource exhaustion, and other security
+    //!          vulnerabilities in their implementation. The TensorRT safety certified runtime will never
+    //!          emit messages longer than 1024 bytes.
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -560,7 +732,7 @@ namespace impl
 template <>
 struct EnumMaxImpl<ILogger::Severity>
 {
-    //! Declaration of kVALUE that represents maximum number of elements in ILogger::Severity enum
+    //! Declaration of kVALUE that represents the maximum number of elements in the ILogger::Severity enum.
     static constexpr int32_t kVALUE = 5;
 };
 } // namespace impl
@@ -617,8 +789,8 @@ enum class ErrorCode : int32_t
     kFAILED_INITIALIZATION = 6,
 
     //!
-    //! An error occurred during execution that caused TensorRT to end prematurely, either an asynchronous error or
-    //! other execution errors reported by CUDA/DLA. In a dynamic system, the
+    //! An error occurred during execution that caused TensorRT to end prematurely, either an asynchronous error,
+    //! user cancellation, or other execution errors reported by CUDA/DLA. In a dynamic system, the
     //! data can be thrown away and the next frame can be processed or execution can be retried.
     //! This is either an execution error or a memory error.
     //!
@@ -649,7 +821,7 @@ enum class ErrorCode : int32_t
 
     //!
     //! An error occurred due to the network not being supported on the device due to constraints of the hardware or
-    //! system. An example is running a unsafe layer in a safety certified context, or a resource requirement for the
+    //! system. An example is running an unsafe layer in a safety certified context, or a resource requirement for the
     //! current network is greater than the capabilities of the target device. The network is otherwise correct, but
     //! the network and hardware combination is problematic. This can be recoverable.
     //! Examples:
@@ -672,49 +844,36 @@ struct EnumMaxImpl<ErrorCode>
 };
 } // namespace impl
 
-//!
-//! \class IErrorRecorder
-//!
-//! \brief Reference counted application-implemented error reporting interface for TensorRT objects.
-//!
-//! The error reporting mechanism is a user defined object that interacts with the internal state of the object
-//! that it is assigned to in order to determine information about abnormalities in execution. The error recorder
-//! gets both an error enum that is more descriptive than pass/fail and also a string description that gives more
-//! detail on the exact failure modes. In the safety context, the error strings are all limited to 1024 characters
-//! in length.
-//!
-//! The ErrorRecorder gets passed along to any class that is created from another class that has an ErrorRecorder
-//! assigned to it. For example, assigning an ErrorRecorder to an IBuilder allows all INetwork's, ILayer's, and
-//! ITensor's to use the same error recorder. For functions that have their own ErrorRecorder accessor functions.
-//! This allows registering a different error recorder or de-registering of the error recorder for that specific
-//! object.
-//!
-//! The ErrorRecorder object implementation must be thread safe. All locking and synchronization is pushed to the
-//! interface implementation and TensorRT does not hold any synchronization primitives when calling the interface
-//! functions.
-//!
-//! The lifetime of the ErrorRecorder object must exceed the lifetime of all TensorRT objects that use it.
-//!
-class IErrorRecorder
+namespace v_1_0
+{
+class IErrorRecorder : public IVersionedInterface
 {
 public:
     //!
-    //! A typedef of a C-style string for reporting error descriptions.
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IErrorRecorder", 1, 0};
+    }
+
+    //!
+    //! \brief A typedef of a C-style string for reporting error descriptions.
     //!
     using ErrorDesc = char const*;
 
     //!
-    //! The length limit for an error description, excluding the '\0' string terminator.
+    //! \brief The length limit for an error description in bytes, excluding the '\0' string terminator.
     //!
     static constexpr size_t kMAX_DESC_LENGTH{127U};
 
     //!
-    //! A typedef of a 32bit integer for reference counting.
+    //! \brief A typedef of a 32-bit integer for reference counting.
     //!
     using RefCount = int32_t;
 
     IErrorRecorder() = default;
-    virtual ~IErrorRecorder() noexcept = default;
+    ~IErrorRecorder() noexcept override = default;
 
     // Public API used to retrieve information from the error recorder.
 
@@ -723,13 +882,18 @@ class IErrorRecorder
     //!
     //! Determines the number of errors that occurred between the current point in execution
     //! and the last time that the clear() was executed. Due to the possibility of asynchronous
-    //! errors occuring, a TensorRT API can return correct results, but still register errors
-    //! with the Error Recorder. The value of getNbErrors must monotonically increases until clear()
-    //! is called.
+    //! errors occurring, a TensorRT API can return correct results, but still register errors
+    //! with the Error Recorder. The value of getNbErrors() must increment by 1 after each reportError()
+    //! call until clear() is called, or the maximum number of errors that can be stored is exceeded.
     //!
     //! \return Returns the number of errors detected, or 0 if there are no errors.
+    //!         If the upper bound of errors that can be stored is exceeded, the upper bound value must
+    //!         be returned.
+    //!
+    //! For example, if the error recorder can store up to 16 error descriptions but recordError() has
+    //! been called 20 times, getNbErrors() must return 16.
     //!
-    //! \see clear
+    //! \see clear(), hasOverflowed()
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -746,9 +910,10 @@ class IErrorRecorder
     //! The errorIdx specifies what error code from 0 to getNbErrors()-1 that the application
     //! wants to analyze and return the error code enum.
     //!
-    //! \return Returns the enum corresponding to errorIdx.
+    //! \return Returns the enum corresponding to errorIdx if errorIdx is in range (between 0 and getNbErrors()-1).
+    //!         ErrorCode::kUNSPECIFIED_ERROR must be returned if errorIdx is not in range.
     //!
-    //! \see getErrorDesc, ErrorCode
+    //! \see getErrorDesc(), ErrorCode
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -765,11 +930,13 @@ class IErrorRecorder
     //! For the error specified by the idx value, return the string description of the error. The
     //! error string is a null-terminated C-style string. In the safety context there is a
     //! constant length requirement to remove any dynamic memory allocations and the error message
-    //! may be truncated. The format of the string is "<EnumAsStr> - <Description>".
+    //! will be truncated if it exceeds kMAX_DESC_LENGTH bytes.
+    //! The format of the string is "<EnumAsStr> - <Description>".
     //!
-    //! \return Returns a string representation of the error along with a description of the error.
+    //! \return Returns a string representation of the error along with a description of the error if errorIdx is in
+    //!         range (between 0 and getNbErrors()-1). An empty string will be returned if errorIdx is not in range.
     //!
-    //! \see getErrorCode
+    //! \see getErrorCode()
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -797,11 +964,11 @@ class IErrorRecorder
     //!
     //! \brief Clear the error stack on the error recorder.
     //!
-    //! Removes all the tracked errors by the error recorder.  This function must guarantee that after
+    //! Removes all the tracked errors by the error recorder.  The implementation must guarantee that after
     //! this function is called, and as long as no error occurs, the next call to getNbErrors will return
-    //! zero.
+    //! zero and hasOverflowed will return false.
     //!
-    //! \see getNbErrors
+    //! \see getNbErrors(), hasOverflowed()
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -816,7 +983,9 @@ class IErrorRecorder
     //! \brief Report an error to the error recorder with the corresponding enum and description.
     //!
     //! \param val The error code enum that is being reported.
-    //! \param desc The string description of the error.
+    //! \param desc The string description of the error, which will be a NULL-terminated string of kMAX_DESC_LENGTH
+    //!        bytes or less (excluding the NULL terminator). Descriptions that exceed this limit will be silently
+    //!        truncated.
     //!
     //! Report an error to the user that has a given value and human readable description. The function returns false
     //! if processing can continue, which implies that the reported error is not fatal. This does not guarantee that
@@ -827,6 +996,10 @@ class IErrorRecorder
     //!
     //! \return True if the error is determined to be fatal and processing of the current function must end.
     //!
+    //! \warning If the error recorder's maximum number of storable errors is exceeded, the error description will be
+    //!          silently dropped and the value returned by getNbErrors() will not be incremented. However, the return
+    //!          value will still signal whether the error must be considered fatal.
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -839,9 +1012,9 @@ class IErrorRecorder
     //!
     //! Increments the reference count for the object by one and returns the current value.  This reference count allows
     //! the application to know that an object inside of TensorRT has taken a reference to the ErrorRecorder.  TensorRT
-    //! guarantees that every call to IErrorRecorder::incRefCount will be paired with a call to
-    //! IErrorRecorder::decRefCount when the reference is released.  It is undefined behavior to destruct the
-    //! ErrorRecorder when incRefCount has been called without a corresponding decRefCount.
+    //! guarantees that every call to IErrorRecorder::incRefCount() will be paired with a call to
+    //! IErrorRecorder::decRefCount() when the reference is released.  It is undefined behavior to destruct the
+    //! ErrorRecorder when incRefCount() has been called without a corresponding decRefCount().
     //!
     //! \return The reference counted value after the increment completes.
     //!
@@ -857,9 +1030,9 @@ class IErrorRecorder
     //!
     //! Decrements the reference count for the object by one and returns the current value.  This reference count allows
     //! the application to know that an object inside of TensorRT has taken a reference to the ErrorRecorder.  TensorRT
-    //! guarantees that every call to IErrorRecorder::decRefCount will be preceded by a call to
-    //! IErrorRecorder::incRefCount.  It is undefined behavior to destruct the ErrorRecorder when incRefCount has been
-    //! called without a corresponding decRefCount.
+    //! guarantees that every call to IErrorRecorder::decRefCount() will be preceded by a call to
+    //! IErrorRecorder::incRefCount().  It is undefined behavior to destruct the ErrorRecorder when incRefCount() has been
+    //! called without a corresponding decRefCount().
     //!
     //! \return The reference counted value after the decrement completes.
     //!
@@ -878,6 +1051,36 @@ class IErrorRecorder
     IErrorRecorder& operator=(IErrorRecorder&&) & = default;
     // @endcond
 }; // class IErrorRecorder
+} // namespace v_1_0
+
+//!
+//! \class IErrorRecorder
+//!
+//! \brief Reference counted application-implemented error reporting interface for TensorRT objects.
+//!
+//! The error reporting mechanism is a user-defined object that interacts with the internal state of the object
+//! that it is assigned to in order to determine information about abnormalities in execution. The error recorder
+//! gets both an error enum that is more descriptive than pass/fail and also a string description that gives more
+//! detail on the exact failure modes. In the safety context, the error strings are all limited to 128 bytes
+//! or less in length, including the NULL terminator.
+//!
+//! The ErrorRecorder gets passed along to any class that is created from another class that has an ErrorRecorder
+//! assigned to it. For example, assigning an ErrorRecorder to an IBuilder allows all INetwork's, ILayer's, and
+//! ITensor's to use the same error recorder. For functions that have their own ErrorRecorder accessor functions.
+//! This allows registering a different error recorder or de-registering of the error recorder for that specific
+//! object.
+//!
+//! ErrorRecorder objects that are used in the safety runtime must define an implementation-dependent upper limit
+//! of errors whose information can be stored, and drop errors above this upper limit. The limit must fit in int32_t.
+//! The IErrorRecorder::hasOverflowed() method is used to signal that one or more errors have been dropped.
+//!
+//! The ErrorRecorder object implementation must be thread safe. All locking and synchronization is pushed to the
+//! interface implementation and TensorRT does not hold any synchronization primitives when calling the interface
+//! functions.
+//!
+//! The lifetime of the ErrorRecorder object must exceed the lifetime of all TensorRT objects that use it.
+//!
+using IErrorRecorder = v_1_0::IErrorRecorder;
 
 //!
 //! \enum TensorIOMode
@@ -896,6 +1099,116 @@ enum class TensorIOMode : int32_t
     kOUTPUT = 2
 };
 
+namespace v_1_0
+{
+class IStreamReader : public IVersionedInterface
+{
+public:
+    //!
+    //! TensorRT never calls the destructor for an IStreamReader defined by the
+    //! application.
+    //!
+    ~IStreamReader() override = default;
+    IStreamReader() = default;
+
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IStreamReader", 1, 0};
+    }
+
+    //!
+    //! \brief Read the next number of bytes in the stream.
+    //!
+    //! \param destination The memory to write to
+    //! \param nbBytes The number of bytes to read
+    //!
+    //! \returns The number of bytes read. Negative values will be considered an automatic error.
+    //!
+    virtual int64_t read(void* destination, int64_t nbBytes) = 0;
+
+protected:
+    IStreamReader(IStreamReader const&) = default;
+    IStreamReader(IStreamReader&&) = default;
+    IStreamReader& operator=(IStreamReader const&) & = default;
+    IStreamReader& operator=(IStreamReader&&) & = default;
+};
+} // namespace v_1_0
+
+//!
+//! \class IStreamReader
+//!
+//! \brief Application-implemented class for reading data in a stream-based manner.
+//!
+//! \note To ensure compatibility of source code with future versions of TensorRT, use IStreamReader, not
+//!       v_1_0::IStreamReader
+//!
+using IStreamReader = v_1_0::IStreamReader;
+
+namespace v_1_0
+{
+
+class IPluginResource : public IVersionedInterface
+{
+public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"IPluginResource", 1, 0};
+    }
+    //!
+    //! \brief Free the underlying resource
+    //!
+    //! This will only be called for IPluginResource objects that were produced from IPluginResource::clone()
+    //!
+    //! The IPluginResource object on which release() is called must still be in a clone-able state
+    //! after release() returns
+    //!
+    //! \return 0 for success, else non-zero
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: No; this method is not required to be thread-safe
+    //!
+    virtual int32_t release() noexcept = 0;
+
+    //!
+    //! \brief Clone the resource object
+    //!
+    //! \note Resource initialization (if any) may be skipped for non-cloned objects since only clones will be
+    //! registered by TensorRT
+    //!
+    //! \return Pointer to cloned object. nullptr if there was an issue.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes; this method is required to be thread-safe and may be called from multiple threads.
+    //!
+    virtual IPluginResource* clone() noexcept = 0;
+
+    ~IPluginResource() noexcept override = default;
+
+    IPluginResource() = default;
+    IPluginResource(IPluginResource const&) = default;
+    IPluginResource(IPluginResource&&) = default;
+    IPluginResource& operator=(IPluginResource const&) & = default;
+    IPluginResource& operator=(IPluginResource&&) & = default;
+}; // class IPluginResource
+} // namespace v_1_0
+
+//!
+//! \class IPluginResource
+//!
+//! \brief Interface for plugins to define custom resources that could be shared through the plugin registry
+//!
+//! \see IPluginRegistry::acquirePluginResource
+//! \see IPluginRegistry::releasePluginResource
+//!
+using IPluginResource = v_1_0::IPluginResource;
+
 namespace impl
 {
 //! Maximum number of elements in TensorIOMode enum. \see TensorIOMode
@@ -911,7 +1224,7 @@ struct EnumMaxImpl<TensorIOMode>
 //!
 //! \brief Return the library version number.
 //!
-//! The format is as for TENSORRT_VERSION: (TENSORRT_MAJOR * 1000) + (TENSORRT_MINOR * 100) + TENSOR_PATCH.
+//! The format is as for TENSORRT_VERSION: (MAJOR * 100 + MINOR) * 100 + PATCH
 //!
 extern "C" TENSORRTAPI int32_t getInferLibVersion() noexcept;
 
diff --git a/include/NvInferRuntimeCommon.h b/include/NvInferRuntimeCommon.h
index 9a317e65..65a3c220 100644
--- a/include/NvInferRuntimeCommon.h
+++ b/include/NvInferRuntimeCommon.h
@@ -22,9 +22,9 @@
 //! \file NvInferRuntimeCommon.h
 //!
 //! This file provides the nvinfer1::IPluginRegistry interface, which will be moved to the NvInferRuntime.h header
-//! in TensorRT 9.0.
+//! in a future release.
 //!
-//! \warning This file will be removed in TensorRT 9.0.
+//! \warning This file will be removed in a future release.
 //!
 //! \warning Do not directly include this file. Instead include NvInferRuntime.h
 //!
@@ -50,15 +50,17 @@ namespace nvinfer1
 //! \warning In the automotive safety context, be sure to call IPluginRegistry::setErrorRecorder() to register
 //! an error recorder with the registry before using other methods in the registry.
 //!
-
 class IPluginRegistry
 {
 public:
-    //! Pointer for plugin library handle.
+    //!
+    //! \brief Pointer for plugin library handle.
+    //!
     using PluginLibraryHandle = void*;
+
     //!
-    //! \brief Register a plugin creator. Returns false if one with same type
-    //! is already registered.
+    //! \brief Register a plugin creator implementing IPluginCreator. Returns false if any plugin creator with the same
+    //! name, version or namespace is already registered.
     //!
     //! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
     //! terminated.
@@ -67,17 +69,26 @@ class IPluginRegistry
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
     //!
-    virtual bool registerCreator(IPluginCreator& creator, AsciiChar const* const pluginNamespace) noexcept = 0;
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by
+    //! IPluginRegistry::registerCreator(IPluginCreatorInterface&, AsciiChar const* const).
+    //!
+    TRT_DEPRECATED virtual bool registerCreator(
+        IPluginCreator& creator, AsciiChar const* const pluginNamespace) noexcept = 0;
 
     //!
     //! \brief Return all the registered plugin creators and the number of
     //! registered plugin creators. Returns nullptr if none found.
     //!
+    //! \warning If any plugin creators are registered or deregistered after calling this function, the returned pointer
+    //! is not guaranteed to be valid thereafter.
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: No
     //!
-    virtual IPluginCreator* const* getPluginCreatorList(int32_t* const numCreators) const noexcept = 0;
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by IPluginRegistry::getAllCreators(int32_t* const).
+    //!
+    TRT_DEPRECATED virtual IPluginCreator* const* getPluginCreatorList(int32_t* const numCreators) const noexcept = 0;
 
     //!
     //! \brief Return plugin creator based on plugin name, version, and
@@ -86,13 +97,18 @@ class IPluginRegistry
     //! \warning The strings pluginName, pluginVersion, and pluginNamespace must be 1024 bytes or less including the
     //! NULL terminator and must be NULL terminated.
     //!
+    //! \warning Returns nullptr if a plugin creator with matching name, version, and namespace is found, but is not a
+    //! descendent of IPluginCreator
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual IPluginCreator* getPluginCreator(AsciiChar const* const pluginName, AsciiChar const* const pluginVersion,
-        AsciiChar const* const pluginNamespace = "") noexcept
-        = 0;
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by IPluginRegistry::getCreator(AsciiChar const* const,
+    //! AsciiChar const* const, AsciiChar const* const).
+    //!
+    TRT_DEPRECATED virtual IPluginCreator* getPluginCreator(AsciiChar const* const pluginName,
+        AsciiChar const* const pluginVersion, AsciiChar const* const pluginNamespace = "") noexcept = 0;
 
     // @cond SuppressDoxyWarnings
     IPluginRegistry() = default;
@@ -100,7 +116,7 @@ class IPluginRegistry
     IPluginRegistry(IPluginRegistry&&) = delete;
     IPluginRegistry& operator=(IPluginRegistry const&) & = delete;
     IPluginRegistry& operator=(IPluginRegistry&&) & = delete;
-// @endcond
+    // @endcond
 
 protected:
     virtual ~IPluginRegistry() noexcept = default;
@@ -115,7 +131,7 @@ class IPluginRegistry
     //! a recorder has been registered.
     //!
     //! \param recorder The error recorder to register with this interface.
-    //
+    //!
     //! \see getErrorRecorder()
     //!
     //! \usage
@@ -142,21 +158,23 @@ class IPluginRegistry
     virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
 
     //!
-    //! \brief Deregister a previously registered plugin creator.
+    //! \brief Deregister a previously registered plugin creator implementing IPluginCreator.
     //!
     //! Since there may be a desire to limit the number of plugins,
     //! this function provides a mechanism for removing plugin creators registered in TensorRT.
     //! The plugin creator that is specified by \p creator is removed from TensorRT and no longer tracked.
     //!
     //! \return True if the plugin creator was deregistered, false if it was not found in the registry or otherwise
-    //! could
-    //!     not be deregistered.
+    //! could not be deregistered.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual bool deregisterCreator(IPluginCreator const& creator) noexcept = 0;
+    //! \deprecated Deprecated in TensorRT 10.0. Superseded by
+    //! IPluginRegistry::deregisterCreator(IPluginCreatorInterface const&).
+    //!
+    TRT_DEPRECATED virtual bool deregisterCreator(IPluginCreator const& creator) noexcept = 0;
 
     //!
     //! \brief Return whether the parent registry will be searched if a plugin is not found in this registry
@@ -194,6 +212,90 @@ class IPluginRegistry
     //! \param handle the plugin library handle to deregister.
     //!
     virtual void deregisterLibrary(PluginLibraryHandle handle) noexcept = 0;
+
+    //!
+    //! \brief Register a plugin creator. Returns false if a plugin creator with the same type
+    //! is already registered.
+    //!
+    //! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
+    //! terminated.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
+    //!
+    virtual bool registerCreator(IPluginCreatorInterface& creator, AsciiChar const* const pluginNamespace) noexcept = 0;
+
+    //!
+    //! \brief Return all registered plugin creators. Returns nullptr if none found.
+    //!
+    //! \warning If any plugin creators are registered or deregistered after calling this function, the returned pointer
+    //! is not guaranteed to be valid thereafter.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: No
+    //!
+    virtual IPluginCreatorInterface* const* getAllCreators(int32_t* const numCreators) const noexcept = 0;
+
+    //!
+    //! \brief Return a registered plugin creator based on plugin name, version, and namespace associated with the
+    //! plugin during network creation.
+    //!
+    //! \warning The strings pluginName, pluginVersion, and pluginNamespace must be 1024 bytes or less including the
+    //! NULL terminator and must be NULL terminated.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes
+    //!
+    virtual IPluginCreatorInterface* getCreator(AsciiChar const* const pluginName, AsciiChar const* const pluginVersion,
+        AsciiChar const* const pluginNamespace = "") noexcept = 0;
+
+    //!
+    //! \brief Deregister a previously registered plugin creator.
+    //!
+    //! Since there may be a desire to limit the number of plugins,
+    //! this function provides a mechanism for removing plugin creators registered in TensorRT.
+    //! The plugin creator that is specified by \p creator is removed from TensorRT and no longer tracked.
+    //!
+    //! \return True if the plugin creator was deregistered, false if it was not found in the registry or otherwise
+    //! could not be deregistered.
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes
+    //!
+    virtual bool deregisterCreator(IPluginCreatorInterface const& creator) noexcept = 0;
+
+    //!
+    //! \brief Get a plugin resource
+    //! \param key Key for identifying the resource. Cannot be null.
+    //! \param resource A plugin resource object. The object will only need to be valid until this method returns, as
+    //! only a clone of this object will be registered by TRT. Cannot be null.
+    //!
+    //! \return Registered plugin resource object
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
+    //!
+    virtual IPluginResource* acquirePluginResource(AsciiChar const* key, IPluginResource* resource) noexcept = 0;
+
+    //!
+    //! \brief Decrement reference count for the resource with this key
+    //!        If reference count goes to zero after decrement, release() will be invoked on the resource, the key will
+    //!        be deregistered and the resource object will be deleted
+    //!
+    //! \param key Key that was used to register the resource. Cannot be null.
+    //!
+    //! \return 0 for success, else non-zero
+    //!
+    //! \usage
+    //! - Allowed context for the API call
+    //!   - Thread-safe: Yes; calls to this method will be synchronized by a mutex.
+    //!
+    virtual int32_t releasePluginResource(AsciiChar const* key) noexcept = 0;
 };
 
 } // namespace nvinfer1
diff --git a/include/NvInferRuntimePlugin.h b/include/NvInferRuntimePlugin.h
index 208d4e88..ecae2ce9 100644
--- a/include/NvInferRuntimePlugin.h
+++ b/include/NvInferRuntimePlugin.h
@@ -45,12 +45,18 @@ namespace nvinfer1
 //!
 using PluginFormat = TensorFormat;
 
+//!
+//! \brief Bit at the plugin version to identify that it is a plugin.
+//!
+static constexpr int32_t kPLUGIN_VERSION_PYTHON_BIT = 0x40;
+
+//!
 //! \struct PluginTensorDesc
 //!
 //! \brief Fields that a plugin might see for an input or output.
 //!
 //! Scale is only valid when data type is DataType::kINT8. TensorRT will set
-//! the value to -1.0f if it is invalid.
+//! the value to -1.0F if it is invalid.
 //!
 //! \see IPluginV2IOExt::supportsFormatCombination
 //! \see IPluginV2IOExt::configurePlugin
@@ -67,6 +73,7 @@ struct PluginTensorDesc
     float scale;
 };
 
+//!
 //! \struct PluginVersion
 //!
 //! \brief Definition of plugin versions.
@@ -83,8 +90,24 @@ enum class PluginVersion : uint8_t
     kV2_IOEXT = 2,
     //! IPluginV2DynamicExt
     kV2_DYNAMICEXT = 3,
+    //! IPluginV2DynamicExt-based Python plugins
+    kV2_DYNAMICEXT_PYTHON = kPLUGIN_VERSION_PYTHON_BIT | 3
+};
+
+//!
+//! \enum PluginCreatorVersion
+//!
+//! \brief Enum to identify version of the plugin creator.
+//!
+enum class PluginCreatorVersion : int32_t
+{
+    //! IPluginCreator
+    kV1 = 0,
+    //! IPluginCreator-based Python plugin creators
+    kV1_PYTHON = kPLUGIN_VERSION_PYTHON_BIT
 };
 
+//!
 //! \class IPluginV2
 //!
 //! \brief Plugin class for user-implemented layers.
@@ -108,6 +131,8 @@ class TRT_DEPRECATED IPluginV2
     //! Do not override this method as it is used by the TensorRT library to maintain backwards-compatibility with
     //! plugins.
     //!
+    //! \return The TensorRT version in the format (major * 100 + minor) * 100 + patch.
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, the implementation provided here is safe to call from any thread.
@@ -119,10 +144,11 @@ class TRT_DEPRECATED IPluginV2
 
     //!
     //! \brief Return the plugin type. Should match the plugin name returned by the corresponding plugin creator
+    //!
     //! \see IPluginCreator::getPluginName()
     //!
-    //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
-    //! terminated.
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -133,10 +159,11 @@ class TRT_DEPRECATED IPluginV2
 
     //!
     //! \brief Return the plugin version. Should match the plugin version returned by the corresponding plugin creator
+    //!
     //! \see IPluginCreator::getPluginVersion()
     //!
-    //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
-    //! terminated.
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -148,7 +175,7 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! \brief Get the number of outputs from the layer.
     //!
-    //! \return The number of outputs.
+    //! \return The number of outputs, which is a positive integer.
     //!
     //! This function is called by the implementations of INetworkDefinition and IBuilder. In particular, it is called
     //! prior to any call to initialize().
@@ -163,9 +190,13 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! \brief Get the dimension of an output tensor.
     //!
-    //! \param index The index of the output tensor.
-    //! \param inputs The input tensors.
-    //! \param nbInputDims The number of input tensors.
+    //! \param index The index of the output tensor. Will lie in the valid range (between 0 and getNbOutputs()-1
+    //! inclusive).
+    //! \param inputs The input tensor dimensions. Will be the start address of a Dims array of length nbInputDims.
+    //! \param nbInputDims The number of input tensors. Will be a non-negative integer.
+    //!
+    //! \return The output tensor dimensions if the index is in the valid range.
+    //!         An invalid value of Dims{-1, {}} must be returned if the index is not in the valid range.
     //!
     //! This function is called by the implementations of INetworkDefinition and IBuilder. In particular, it is called
     //! prior to any call to initialize().
@@ -175,7 +206,7 @@ class TRT_DEPRECATED IPluginV2
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
     //!                  when building networks on multiple devices sharing the same plugin.
     //!
-    //! \note In any non-IPluginV2DynamicExt plugin, batch size should not be included in the returned dimensions,
+    //! \note In any non-IPluginV2DynamicExt plugin, batch size must not be included in the returned dimensions,
     //! even if the plugin is expected to be run in a network with explicit batch mode enabled.
     //! Please see the TensorRT Developer Guide for more details on how plugin inputs and outputs behave.
     //!
@@ -186,6 +217,7 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! \param type DataType requested.
     //! \param format PluginFormat requested.
+    //!
     //! \return true if the plugin supports the type-format combination.
     //!
     //! This function is called by the implementations of INetworkDefinition, IBuilder, and
@@ -211,13 +243,14 @@ class TRT_DEPRECATED IPluginV2
     //! This function is called by the builder prior to initialize(). It provides an opportunity for the layer to make
     //! algorithm choices on the basis of its weights, dimensions, and maximum batch size.
     //!
-    //! \param inputDims The input tensor dimensions.
-    //! \param nbInputs The number of inputs.
-    //! \param outputDims The output tensor dimensions.
-    //! \param nbOutputs The number of outputs.
+    //! \param inputDims The input tensor dimensions. Will be the start address of a Dims array of length nbInputs.
+    //! \param nbInputs The number of inputs. Will be a non-negative integer.
+    //! \param outputDims The output tensor dimensions. Will be the start address of a Dims array of length nbOutputs.
+    //! \param nbOutputs The number of outputs. Will be a positive integer identical to the return value of
+    //! getNbOutputs().
     //! \param type The data type selected for the engine.
     //! \param format The format selected for the engine.
-    //! \param maxBatchSize The maximum batch size.
+    //! \param maxBatchSize The maximum batch size. Will be a positive integer.
     //!
     //! The dimensions passed here do not include the outermost batch size (i.e. for 2-D image networks, they will be
     //! 3-dimensional CHW dimensions).
@@ -256,6 +289,7 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! \brief Release resources acquired during plugin layer initialization. This is called when the engine is
     //! destroyed.
+    //!
     //! \see initialize()
     //!
     //! \usage
@@ -270,10 +304,13 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! \brief Find the workspace size required by the layer.
     //!
-    //! This function is called during engine startup, after initialize(). The workspace size returned should be
+    //! This function is called during engine startup, after initialize(). The workspace size returned must be
     //! sufficient for any batch size up to the maximum.
     //!
-    //! \return The workspace size.
+    //! \param maxBatchSize The maximum batch size, which will be a positive integer.
+    //!
+    //! \return The workspace size in bytes, i.e. the device memory size that the plugin requires for its internal
+    //! computations.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -287,10 +324,15 @@ class TRT_DEPRECATED IPluginV2
     //! \brief Execute the layer.
     //!
     //! \param batchSize The number of inputs in the batch.
-    //! \param inputs The memory for the input tensors.
-    //! \param outputs The memory for the output tensors.
-    //! \param workspace Workspace for execution.
-    //! \param stream The stream in which to execute the kernels.
+    //! \param inputs The memory for the input tensors. Will be an array of device addresses corresponding to input
+    //!        tensors of length nbInputs, where nbInputs is the second parameter passed to configureWithFormat().
+    //!        The i-th input tensor will have the dimensions inputDims[i], where inputDims is the first parameter
+    //!        that was passed to configureWithFormat().
+    //! \param outputs The memory for the output tensors. Will be an array of device addresses corresponding to output
+    //!        tensors of length getNbOutputs().
+    //! \param workspace Workspace for execution. Will be the start address of a device buffer whose length will be at
+    //!        least getWorkspaceSize(batchSize).
+    //! \param stream The stream in which to execute the kernels. This will be a valid CUDA stream.
     //!
     //! \return 0 for success, else non-zero (which will cause engine termination).
     //!
@@ -304,9 +346,9 @@ class TRT_DEPRECATED IPluginV2
         = 0;
 
     //!
-    //! \brief Find the size of the serialization buffer required.
+    //! \brief Find the size of the serialization buffer required to store the plugin configuration in a binary file.
     //!
-    //! \return The size of the serialization buffer.
+    //! \return The size of the serialization buffer in bytes.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -318,8 +360,8 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! \brief Serialize the layer.
     //!
-    //! \param buffer A pointer to a buffer to serialize data. Size of buffer must be equal to value returned by
-    //! getSerializationSize.
+    //! \param buffer A pointer to a host buffer to serialize data. Size of buffer will be at least as large as the
+    //! value returned by getSerializationSize.
     //!
     //! \see getSerializationSize()
     //!
@@ -346,7 +388,10 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! The TensorRT runtime calls clone() to clone the plugin when an execution context is created for an engine,
     //! after the engine has been created.  The runtime does not call initialize() on the cloned plugin,
-    //! so the cloned plugin should be created in an initialized state.
+    //! so the cloned plugin must be created in an initialized state.
+    //!
+    //! \return A cloned plugin object in an initialized state with the same parameters as the current object.
+    //!         nullptr must be returned if the cloning fails, e.g. because of resource exhaustion.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -358,12 +403,12 @@ class TRT_DEPRECATED IPluginV2
 
     //!
     //! \brief Set the namespace that this plugin object belongs to. Ideally, all plugin
-    //! objects from the same plugin library should have the same namespace.
+    //! objects from the same plugin library must have the same namespace.
     //!
     //! \param pluginNamespace The namespace for the plugin object.
     //!
-    //! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
-    //! terminated.
+    //! \warning The string pluginNamespace will be NULL-terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -375,6 +420,9 @@ class TRT_DEPRECATED IPluginV2
     //!
     //! \brief Return the namespace of the plugin object.
     //!
+    //! \return The namespace string that was passed to setPluginNamespace(), possibly after truncation to 1024 bytes
+    //! if a longer string was passed. An empty string must be returned as default value.
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -396,13 +444,14 @@ class TRT_DEPRECATED IPluginV2
 // @endcond
 };
 
+//!
 //! \class IPluginV2Ext
 //!
 //! \brief Plugin class for user-implemented layers.
 //!
 //! Plugins are a mechanism for applications to implement custom layers. This
 //! interface provides additional capabilities to the IPluginV2 interface by
-//! supporting different output data types and broadcast across batch.
+//! supporting different output data types and broadcast across batches.
 //!
 //! \see IPluginV2
 //!
@@ -415,7 +464,15 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //!
     //! \brief Return the DataType of the plugin output at the requested index.
     //!
-    //! The default behavior should be to return the type of the first input, or DataType::kFLOAT if the layer has no
+    //! \param index The output tensor index in the valid range between 0 and getNbOutputs()-1.
+    //! \param inputTypes The data types of the input tensors, stored in an array of length nbInputs.
+    //! \param nbInputs The number of input tensors. Will be a non-negative integer.
+    //!
+    //! \return The data type of the output tensor with the provided index if the input tensors have the data types
+    //! provided in inputTypes, provided the output tensor index is in the valid range. DataType::kFLOAT must be
+    //! returned if the index is not in the valid range.
+    //!
+    //! The default behavior must be to return the type of the first input, or DataType::kFLOAT if the layer has no
     //! inputs. The returned data type must have a format that is supported by the plugin.
     //!
     //! \see supportsFormat()
@@ -431,11 +488,14 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
         int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept
         = 0;
 
-    //! \brief Return true if output tensor is broadcast across a batch.
     //!
-    //! \param outputIndex The index of the output
-    //! \param inputIsBroadcasted The ith element is true if the tensor for the ith input is broadcast across a batch.
-    //! \param nbInputs The number of inputs
+    //! \brief Return true if the output tensor is broadcast across a batch.
+    //!
+    //! \param outputIndex The index of the output tensor, which will be in the valid range between 0 and
+    //! nbOutputs()-1.
+    //! \param inputIsBroadcasted A boolean array of length nbInputs. The i-th element will be true if and only if
+    //! the tensor for the ith input is broadcast across a batch.
+    //! \param nbInputs The number of inputs. Will be a non-negative integer.
     //!
     //! The values in inputIsBroadcasted refer to broadcasting at the semantic level,
     //! i.e. are unaffected by whether method canBroadcastInputAcrossBatch requests
@@ -446,18 +506,25 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
     //!                  when building networks on multiple devices sharing the same plugin.
     //!
-    virtual bool isOutputBroadcastAcrossBatch(
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+    //!
+    TRT_DEPRECATED virtual bool isOutputBroadcastAcrossBatch(
         int32_t outputIndex, bool const* inputIsBroadcasted, int32_t nbInputs) const noexcept
         = 0;
 
-    //! \brief Return true if plugin can use input that is broadcast across batch without replication.
     //!
-    //! \param inputIndex Index of input that could be broadcast.
+    //! \brief Return true if the plugin can use an input tensor that is broadcast across batch without replication.
+    //!
+    //! \param inputIndex Index of input that could be broadcast. Will be in the valid range between 0 and
+    //! nbInputs - 1 where nbInputs is the maximum number of input tensors supported by this plugin.
+    //!
+    //! \return true if the index is in the valid range and the plugin is able to broadcast a single copy of this
+    //! input tensor across the batch. False otherwise.
     //!
     //! For each input whose tensor is semantically broadcast across a batch,
     //! TensorRT calls this method before calling configurePlugin.
     //! If canBroadcastInputAcrossBatch returns true, TensorRT will not replicate the input tensor;
-    //! i.e., there will be a single copy that the plugin should share across the batch.
+    //! i.e., there will be a single copy that the plugin must share across the batch.
     //! If it returns false, TensorRT will replicate the input tensor
     //! so that it appears like a non-broadcasted tensor.
     //!
@@ -468,7 +535,9 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
     //!                  when building networks on multiple devices sharing the same plugin.
     //!
-    virtual bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept = 0;
+    //! \deprecated Deprecated in TensorRT 10.0. Implicit batch support is removed in TensorRT 10.0.
+    //!
+    TRT_DEPRECATED virtual bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept = 0;
 
     //!
     //! \brief Configure the layer with input and output data types.
@@ -476,20 +545,22 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //! This function is called by the builder prior to initialize(). It provides an opportunity for the layer to make
     //! algorithm choices on the basis of its weights, dimensions, data types and maximum batch size.
     //!
-    //! \param inputDims The input tensor dimensions.
-    //! \param nbInputs The number of inputs.
-    //! \param outputDims The output tensor dimensions.
-    //! \param nbOutputs The number of outputs.
-    //! \param inputTypes The data types selected for the plugin inputs.
-    //! \param outputTypes The data types selected for the plugin outputs.
+    //! \param inputDims The input tensor dimensions. Will be an array of length nbInputs.
+    //! \param nbInputs The number of inputs. Will be a non-negative integer.
+    //! \param outputDims The output tensor dimensions. Will be an array of length nbOutputs.
+    //! \param nbOutputs The number of outputs. Will be a positive integer.
+    //! \param inputTypes The data types selected for the plugin inputs. Will be an array of length nbInputs.
+    //! \param outputTypes The data types selected for the plugin outputs. Will be an array of length nbOutputs.
     //! \param inputIsBroadcast True for each input that the plugin must broadcast across the batch.
+    //!                         Will be an array of length nbInputs.
     //! \param outputIsBroadcast True for each output that TensorRT will broadcast across the batch.
+    //!                          Will be an array of length nbOutputs.
     //! \param floatFormat The format selected for the engine for the floating point inputs/outputs.
-    //! \param maxBatchSize The maximum batch size.
+    //! \param maxBatchSize The maximum batch size. Will be a positive integer.
     //!
     //! The dimensions passed here do not include the outermost batch size (i.e. for 2-D image networks, they will be
     //! 3-dimensional CHW dimensions). When inputIsBroadcast or outputIsBroadcast is true, the outermost batch size for
-    //! that input or output should be treated as if it is one.
+    //! that input or output must be treated as if it is one.
     //! Index 'i' of inputIsBroadcast is true only if the input is semantically broadcast across the batch and
     //! calling canBroadcastInputAcrossBatch with argument 'i' returns true.
     //! Index 'i' of outputIsBroadcast is true only if calling isOutputBroadcastAcrossBatch with argument 'i'
@@ -515,10 +586,12 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
 
     //!
     //! \brief Attach the plugin object to an execution context and grant the plugin the access to some context
-    //! resource.
+    //! resources.
     //!
-    //! \param cudnn The CUDNN context handle of the execution context
-    //! \param cublas The cublas context handle of the execution context
+    //! \param cudnn The cuDNN context handle of the execution context. Will be a valid cuDNN context handle, or
+    //!              nullptr if TacticSource::kCUDNN is disabled.
+    //! \param cublas The cuBLAS context handle of the execution context. Will be a valid cuBLAS context handle, or
+    //!               nullptr if TacticSource::kCUBLAS is disabled.
     //! \param allocator The allocator used by the execution context
     //!
     //! This function is called automatically for each plugin when a new execution context is created. If the context
@@ -526,10 +599,19 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //! new resources are assigned to the context.
     //!
     //! If the plugin needs per-context resource, it can be allocated here.
-    //! The plugin can also get context-owned CUDNN and CUBLAS context here.
+    //! The plugin can also get context-owned cuDNN and cuBLAS context here.
+    //!
+    //! \note The TacticSource::kCUDNN and TacticSource::kCUBLAS flag is disabled by default.
+    //! The allocator pointer is unique to each building or execution context instance having overlapping lifetimes.
+    //! It can be used as a key to manage resources across plugin instances sharing the same context.
+    //! Plugins attached to different contexts will have different handles as their execution will not overlap.
+    //!
+    //! \see TacticSources
+    //! \see getPluginCudnnHandle(void* executionContextIdentifier)
+    //! \see getPluginCublasHandle(void* excecutionContextIdentifier)
     //!
-    //! \note In the automotive safety context, the CUDNN and CUBLAS parameters will be nullptr because CUDNN and CUBLAS
-    //!       is not used by the safe runtime.
+    //! \note In the automotive safety context, the cuDNN and cuBLAS parameters will be nullptr because cuDNN and cuBLAS
+    //!       are not used by the safe runtime.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -544,7 +626,7 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //!
     //! \brief Detach the plugin object from its execution context.
     //!
-    //! This function is called automatically for each plugin when a execution context is destroyed or the context
+    //! This function is called automatically for each plugin when an execution context is destroyed or the context
     //! resources are unassigned from the context.
     //!
     //! If the plugin owns per-context resource, it can be released here.
@@ -559,10 +641,12 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //!
     //! \brief Clone the plugin object. This copies over internal plugin parameters as well and returns a new plugin
     //! object with these parameters. If the source plugin is pre-configured with configurePlugin(), the returned object
-    //! should also be pre-configured. The returned object should allow attachToContext() with a new execution context.
+    //! must also be pre-configured. The returned object must allow attachToContext() with a new execution context.
     //! Cloned plugin objects can share the same per-engine immutable resource (e.g. weights) with the source object
     //! (e.g. via ref-counting) to avoid duplication.
     //!
+    //! \return A pointer to a cloned plugin object if cloning was successful, otherwise nullptr.
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -582,6 +666,10 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     //! \brief Return the API version with which this plugin was built. The
     //!  upper byte reserved by TensorRT and is used to differentiate this from IPluginV2.
     //!
+    //! \return In the lower three bytes, the TensorRT version in the format
+    //!         (major * 100 + minor) * 100 + patch.
+    //!         In the upper byte, the value 1.
+    //!
     //! Do not override this method as it is used by the TensorRT library to maintain backwards-compatibility with
     //! plugins.
     //!
@@ -596,7 +684,10 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     }
 
     //!
-    //! \brief Derived classes should not implement this. In a C++11 API it would be override final.
+    //! \brief Derived classes must not implement this. In a C++11 API it would be override final.
+    //!
+    //! IPluginV2Ext::configureWithFormat() is a NOP operation for all classes derived from IPluginV2Ext.
+    //! These classes call configurePlugin() instead.
     //!
     void configureWithFormat(Dims const* /*inputDims*/, int32_t /*nbInputs*/, Dims const* /*outputDims*/,
         int32_t /*nbOutputs*/, DataType /*type*/, PluginFormat /*format*/, int32_t /*maxBatchSize*/) noexcept override
@@ -604,6 +695,7 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
     }
 };
 
+//!
 //! \class IPluginV2IOExt
 //!
 //! \brief Plugin class for user-implemented layers.
@@ -613,7 +705,9 @@ class TRT_DEPRECATED IPluginV2Ext : public IPluginV2
 //!
 //! \see IPluginV2Ext
 //!
-class IPluginV2IOExt : public IPluginV2Ext
+//! \deprecated Deprecated in TensorRT 10.0.
+//!
+class TRT_DEPRECATED IPluginV2IOExt : public IPluginV2Ext
 {
 public:
     //!
@@ -644,10 +738,10 @@ class IPluginV2IOExt : public IPluginV2Ext
     //! Using this numbering, pos is an index into InOut, where 0 <= pos < nbInputs+nbOutputs.
     //!
     //! TensorRT invokes this method to ask if the input/output indexed by pos supports the format/datatype specified
-    //! by inOut[pos].format and inOut[pos].type. The override should return true if that format/datatype at inOut[pos]
+    //! by inOut[pos].format and inOut[pos].type. The override must return true if that format/datatype at inOut[pos]
     //! are supported by the plugin. If support is conditional on other input/output formats/datatypes, the plugin can
     //! make its result conditional on the formats/datatypes in inOut[0..pos-1], which will be set to values
-    //! that the plugin supports. The override should not inspect inOut[pos+1..nbInputs+nbOutputs-1],
+    //! that the plugin supports. The override must not inspect inOut[pos+1..nbInputs+nbOutputs-1],
     //! which will have invalid values.  In other words, the decision for pos must be based on inOut[0..pos] only.
     //!
     //! Some examples:
@@ -711,11 +805,17 @@ class IPluginV2IOExt : public IPluginV2Ext
 private:
     // Following are obsolete base class methods, and must not be implemented or used.
 
+    //!
+    //! \brief Set plugin configuration.
+    //!
     void configurePlugin(Dims const*, int32_t, Dims const*, int32_t, DataType const*, DataType const*, bool const*,
         bool const*, PluginFormat, int32_t) noexcept final
     {
     }
 
+    //!
+    //! \brief Check if provided data type is supported.
+    //!
     bool supportsFormat(DataType, PluginFormat) const noexcept final
     {
         return false;
@@ -724,9 +824,9 @@ class IPluginV2IOExt : public IPluginV2Ext
 
 //!
 //! \enum PluginFieldType
+//!
 //! \brief The possible field types for custom layer.
 //!
-
 enum class PluginFieldType : int32_t
 {
     //! FP16 field type.
@@ -746,7 +846,13 @@ enum class PluginFieldType : int32_t
     //! nvinfer1::Dims field type.
     kDIMS = 7,
     //! Unknown field type.
-    kUNKNOWN = 8
+    kUNKNOWN = 8,
+    //! BF16 field type.
+    kBF16 = 9,
+    //! INT64 field type.
+    kINT64 = 10,
+    //! FP8 field type.
+    kFP8 = 11,
 };
 
 //!
@@ -759,22 +865,13 @@ enum class PluginFieldType : int32_t
 class PluginField
 {
 public:
-    //!
-    //! \brief Plugin field attribute name
-    //!
+    //! Plugin field attribute name
     AsciiChar const* name;
-    //!
-    //! \brief Plugin field attribute data
-    //!
+    //! Plugin field attribute data
     void const* data;
-    //!
-    //! \brief Plugin field attribute type
-    //! \see PluginFieldType
-    //!
+    //! Plugin field attribute type
     PluginFieldType type;
-    //!
-    //! \brief Number of data entries in the Plugin attribute
-    //!
+    //! Number of data entries in the Plugin attribute
     int32_t length;
 
     PluginField(AsciiChar const* const name_ = nullptr, void const* const data_ = nullptr,
@@ -787,7 +884,11 @@ class PluginField
     }
 };
 
-//! Plugin field collection struct.
+//!
+//! \struct PluginFieldCollection
+//!
+//! \brief Plugin field collection struct.
+//!
 struct PluginFieldCollection
 {
     //! Number of PluginField entries.
@@ -797,33 +898,56 @@ struct PluginFieldCollection
 };
 
 //!
-//! \class IPluginCreator
+//! \enum PluginCapabilityType
 //!
-//! \brief Plugin creator class for user implemented layers.
+//! \brief Enumerates the different capability types a IPluginV3 object may have
 //!
-//! \see IPlugin and IPluginFactory
+enum class PluginCapabilityType : int32_t
+{
+    //! Core capability. Every IPluginV3 object must have this.
+    kCORE = 0,
+    //! Build capability. IPluginV3 objects provided to TensorRT build phase must have this.
+    kBUILD = 1,
+    //! Runtime capability. IPluginV3 objects provided to TensorRT build and execution phases must have this.
+    kRUNTIME = 2
+};
+
+//!
+//! \enum TensorRTPhase
 //!
+//! \brief Indicates a phase of operation of TensorRT
+//!
+enum class TensorRTPhase : int32_t
+{
+    //! Build phase of TensorRT
+    kBUILD = 0,
+    //! Execution phase of TensorRT
+    kRUNTIME = 1
+};
 
-class IPluginCreator
+namespace v_1_0
+{
+class IPluginCreatorInterface : public IVersionedInterface
 {
 public:
-    //!
-    //! \brief Return the version of the API the plugin creator was compiled with.
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes, the implementation provided here is safe to call from any thread.
-    //!
-    virtual int32_t getTensorRTVersion() const noexcept
-    {
-        return NV_TENSORRT_VERSION;
-    }
+    ~IPluginCreatorInterface() noexcept override = default;
+
+protected:
+    IPluginCreatorInterface() = default;
+    IPluginCreatorInterface(IPluginCreatorInterface const&) = default;
+    IPluginCreatorInterface(IPluginCreatorInterface&&) = default;
+    IPluginCreatorInterface& operator=(IPluginCreatorInterface const&) & = default;
+    IPluginCreatorInterface& operator=(IPluginCreatorInterface&&) & = default;
+};
 
+class TRT_DEPRECATED IPluginCreator : public IPluginCreatorInterface
+{
+public:
     //!
     //! \brief Return the plugin name.
     //!
-    //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
-    //! terminated.
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+    //! the NULL terminator.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -836,8 +960,8 @@ class IPluginCreator
     //!
     //! \brief Return the plugin version.
     //!
-    //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
-    //! terminated.
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including
+    //! the NULL terminator.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -848,7 +972,8 @@ class IPluginCreator
     virtual AsciiChar const* getPluginVersion() const noexcept = 0;
 
     //!
-    //! \brief Return a list of fields that needs to be passed to createPlugin.
+    //! \brief Return a list of fields that need to be passed to createPlugin.
+    //!
     //! \see PluginFieldCollection
     //!
     //! \usage
@@ -862,6 +987,9 @@ class IPluginCreator
     //!
     //! \brief Return a plugin object. Return nullptr in case of error.
     //!
+    //! \param name A NULL-terminated name string of length 1024 or less, including the NULL terminator.
+    //! \param fc A pointer to a collection of fields needed for constructing the plugin.
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -873,6 +1001,12 @@ class IPluginCreator
     //!
     //! \brief Called during deserialization of plugin layer. Return a plugin object.
     //!
+    //! \param name A NULL-terminated name string of length 1024 or less, including the NULL terminator.
+    //! \param serialData The start address of a byte array with the serialized plugin representation.
+    //! \param serialLength The length in bytes of the byte array with the serialized plugin representation.
+    //!
+    //! \return A deserialized plugin object
+    //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes, this method is required to be thread-safe and may be called from multiple threads
@@ -886,6 +1020,8 @@ class IPluginCreator
     //! \brief Set the namespace of the plugin creator based on the plugin
     //! library it belongs to. This can be set while registering the plugin creator.
     //!
+    //! \param pluginNamespace A NULL-terminated namespace string of length 1024 or less, including the NULL terminator
+    //!
     //! \see IPluginRegistry::registerCreator()
     //!
     //! \usage
@@ -899,8 +1035,8 @@ class IPluginCreator
     //!
     //! \brief Return the namespace of the plugin creator object.
     //!
-    //! \warning The string returned must be 1024 bytes or less including the NULL terminator and must be NULL
-    //! terminated.
+    //! \warning The string returned must be NULL-terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -911,16 +1047,46 @@ class IPluginCreator
     virtual AsciiChar const* getPluginNamespace() const noexcept = 0;
 
     IPluginCreator() = default;
-    virtual ~IPluginCreator() = default;
+    ~IPluginCreator() override = default;
 
 protected:
-// @cond SuppressDoxyWarnings
+    // @cond SuppressDoxyWarnings
     IPluginCreator(IPluginCreator const&) = default;
     IPluginCreator(IPluginCreator&&) = default;
     IPluginCreator& operator=(IPluginCreator const&) & = default;
     IPluginCreator& operator=(IPluginCreator&&) & = default;
     // @endcond
+public:
+    //!
+    //! \brief Return version information associated with this interface. Applications must not override this method.
+    //!
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"PLUGIN CREATOR_V1", 1, 0};
+    }
 };
+} // namespace v_1_0
+
+//!
+//! \class IPluginCreatorInterface
+//!
+//! \brief Base class for all plugin creator versions.
+//!
+//! \see IPluginCreator and IPluginRegistry
+//!
+using IPluginCreatorInterface = v_1_0::IPluginCreatorInterface;
+
+//!
+//! \class IPluginCreator
+//!
+//! \brief Plugin creator class for user implemented layers.
+//!
+//! \see IPlugin and IPluginFactory
+//!
+//! \deprecated Deprecated in TensorRT 10.0. Please implement IPluginCreatorV3One instead along with IPluginV3 plugins
+//! instead.
+//!
+using IPluginCreator = v_1_0::IPluginCreator;
 
 } // namespace nvinfer1
 
diff --git a/include/NvInferSafeRuntime.h b/include/NvInferSafeRuntime.h
index fbc5a6af..1c322c4e 100644
--- a/include/NvInferSafeRuntime.h
+++ b/include/NvInferSafeRuntime.h
@@ -61,14 +61,18 @@ class IRuntime
 {
 public:
     //!
-    //! \brief Deserialize an engine from a stream.
+    //! \brief Deserialize an engine from a byte array.
     //!
     //! If the serialized engine requires plugins the plugin creator must be registered by calling
-    //! IPluginRegistry::registerCreator() before calling deserializeCudaEngine(). Every plugin creator
-    //! registered must have a unique combination of namespace, plugin name, and version.
+    //! IPluginRegistry::registerCreator() before calling deserializeCudaEngine().
     //!
-    //! \param blob The memory that holds the serialized engine.
-    //! \param size The size of the memory in bytes.
+    //! \param blob The memory that holds the serialized engine. The content must be a copy of
+    //! the result of calling IHostMemory::data() on a serialized plan that was created via calling
+    //! IBuilder::buildSerializedNetwork() on a network within the supported safety scope.
+    //! Additionally, it must have been validated via IConsistencyChecker::validate().
+    //!
+    //! \param size The size of the memory in bytes. This must be the result of calling IHostMemory::size()
+    //! on the same IHostMemory object that is associated with the blob parameter.
     //!
     //! \return The engine, or nullptr if it could not be deserialized.
     //!
@@ -83,12 +87,9 @@ class IRuntime
 
     //!
     //! \brief Set the GPU allocator.
-    //! \param allocator Set the GPU allocator to be used by the runtime. All GPU memory acquired will use this
-    //! allocator. If NULL is passed, the default allocator will be used.
-    //!
-    //! Default: uses cudaMalloc/cudaFree.
     //!
-    //! If nullptr is passed, the default allocator will be used.
+    //! \param allocator The GPU allocator to be used by the runtime. All GPU memory acquired will use this
+    //! allocator. If nullptr is passed, the default allocator will be used, which calls cudaMalloc and cudaFree.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -100,12 +101,13 @@ class IRuntime
     //! \brief Set the ErrorRecorder for this interface.
     //!
     //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
-    //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
-    //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
-    //! a recorder has been registered.
+    //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+    //! nullptr, an error code of ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has already been
+    //! registered, or ILogger::Severity::kERROR will be logged if the recorder has not yet been registered.
+    //!
+    //! \param recorder The error recorder to register with this interface, or nullptr to deregister the current
+    //! error recorder.
     //!
-    //! \param recorder The error recorder to register with this interface.
-    //
     //! \see getErrorRecorder()
     //!
     //! \usage
@@ -118,9 +120,10 @@ class IRuntime
     //! \brief Get the ErrorRecorder assigned to this interface.
     //!
     //! Retrieves the assigned error recorder object for the given class. A default error recorder does not exist,
-    //! so a nullptr will be returned if setErrorRecorder has not been called.
+    //! so a nullptr will be returned if setErrorRecorder has not been called or a previously assigned error recorder
+    //! has been deregistered.
     //!
-    //! \return A pointer to the IErrorRecorder object that has been registered.
+    //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if no error recorder is set.
     //!
     //! \see setErrorRecorder()
     //!
@@ -148,119 +151,20 @@ class IRuntime
 class ICudaEngine
 {
 public:
-    //!
-    //! \brief Get the number of binding indices.
-    //!
-    //! \return The number of binding indices.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getNbIOTensors.
-    //!
-    //! \see getBindingIndex()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual std::int32_t getNbBindings() const noexcept = 0;
-
-    //!
-    //! \brief Retrieve the binding index for a named tensor.
-    //!
-    //! safe::IExecutionContext::enqueueV2() requires an array of buffers.
-    //! Engine bindings map from tensor names to indices in this array.
-    //! Binding indices are assigned at engine build time, and take values in the range [0 ... n-1] where n is the total
-    //! number of inputs and outputs.
-    //!
-    //! \warning Strings passed to the runtime must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
-    //!
-    //! \param name The tensor name.
-    //! \return The binding index for the named tensor, or -1 if the name is not found.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
-    //! based methods.
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual std::int32_t getBindingIndex(AsciiChar const* const name) const noexcept = 0;
-
-    //!
-    //! \brief Retrieve the name corresponding to a binding index.
-    //!
-    //! This is the reverse mapping to that provided by getBindingIndex().
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return The name corresponding to the index, or nullptr if the index is out of range.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by name-based methods. Use them instead of binding-index
-    //! based methods.
-    //!
-    //! \see getBindingIndex()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual AsciiChar const* getBindingName(std::int32_t const bindingIndex) const noexcept = 0;
-
-    //!
-    //! \brief Determine whether a binding is an input binding.
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return True if the index corresponds to an input binding and the index is in range.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by tensorIOMode().
-    //!
-    //! \see safe::ICudaEngine::tensorIOMode()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual bool bindingIsInput(std::int32_t const bindingIndex) const noexcept = 0;
-
-    //!
-    //! \brief Get the dimensions of a binding.
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return The dimensions of the binding if the index is in range, otherwise Dims()
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorShape().
-    //!
-    //! \see safe::ICudaEngine::getTensorShape()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual Dims getBindingDimensions(std::int32_t const bindingIndex) const noexcept = 0;
-
-    //!
-    //! \brief Determine the required data type for a buffer from its binding index.
-    //!
-    //! \param bindingIndex The binding index.
-    //! \return The type of the data in the buffer.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorDataType().
-    //!
-    //! \see safe::ICudaEngine::getTensorDataType()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual DataType getBindingDataType(std::int32_t const bindingIndex) const noexcept = 0;
-
     //!
     //! \brief Create an execution context.
     //!
     //! \see safe::IExecutionContext.
     //!
+    //! \return An execution context object if it can be constructed, or nullptr if the construction fails.
+    //!
+    //! \details Reasons for failure may include but not be limited to:
+    //! - Heap memory exhaustion
+    //! - Device memory exhaustion
+    //!
     //! \usage
     //! - Allowed context for the API call
-    //!   - Thread-safe: Yes; if createExecutionContext fails, users should treat this as a critical
+    //!   - Thread-safe: Yes; if createExecutionContext fails, users must treat this as a critical
     //!                  error and not perform any subsequent TensorRT operations apart from outputting
     //!                  the error logs.
     //!
@@ -269,13 +173,18 @@ class ICudaEngine
     //!
     //! \brief Create an execution context without any device memory allocated.
     //!
-    //! The memory for execution of this device context must be supplied by the application.
+    //! The memory for execution of this device context must be supplied by the application by calling
+    //! safe::IExecutionContext::setDeviceMemory().
     //!
     //! \see getDeviceMemorySize() safe::IExecutionContext::setDeviceMemory()
     //!
+    //! \return An execution context object if it can be constructed, or nullptr if the construction fails.
+    //!
+    //! \details Reasons for failure may include but not be limited to heap memory exhaustion.
+    //!
     //! \usage
     //! - Allowed context for the API call
-    //!   - Thread-safe: Yes; if createExecutionContext fails, users should treat this as a critical
+    //!   - Thread-safe: Yes; if createExecutionContext fails, users must treat this as a critical
     //!                  error and not perform any subsequent TensorRT operations apart from outputting
     //!                  the error logs.
     //!
@@ -286,77 +195,15 @@ class ICudaEngine
     //!
     //! \see safe::IExecutionContext::setDeviceMemory()
     //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    virtual size_t getDeviceMemorySize() const noexcept = 0;
-
-    //!
-    //! \brief Return the number of bytes per component of an element.
-    //!
-    //! The vector component size is returned if getBindingVectorizedDim() != -1.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorBytesPerComponent().
-    //!
-    //! \see safe::ICudaEngine::getTensorBytesPerComponent()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual std::int32_t getBindingBytesPerComponent(std::int32_t const bindingIndex) const noexcept = 0;
-
-    //!
-    //! \brief Return the number of components included in one element.
-    //!
-    //! The number of elements in the vectors is returned if getBindingVectorizedDim() != -1.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorComponentsPerElement().
-    //!
-    //! \see safe::ICudaEngine::getTensorComponentsPerElement()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual std::int32_t getBindingComponentsPerElement(std::int32_t const bindingIndex) const noexcept = 0;
-
-    //!
-    //! \brief Return the binding format.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorFormat().
-    //!
-    //! \see safe::ICudaEngine::getTensorFormat()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual TensorFormat getBindingFormat(std::int32_t const bindingIndex) const noexcept = 0;
-
-    //!
-    //! \brief Return the dimension index that the buffer is vectorized.
-    //!
-    //! Specifically -1 is returned if scalars per vector is 1.
-    //!
-    //! \param bindingIndex The binding Index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorVectorizedDim().
-    //!
-    //! \see safe::ICudaEngine::getTensorVectorizedDim()
+    //! \return Size of a contiguous memory buffer (in bytes) that users need to provide to
+    //! safe::IExecutionContext::setDeviceMemory() if the execution context has been created by calling
+    //! createExecutionContextWithoutDeviceMemory().
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    TRT_DEPRECATED virtual std::int32_t getBindingVectorizedDim(std::int32_t const bindingIndex) const noexcept = 0;
+    virtual size_t getDeviceMemorySize() const noexcept = 0;
 
     //!
     //! \brief Returns the name of the network associated with the engine.
@@ -366,7 +213,8 @@ class ICudaEngine
     //!
     //! \see INetworkDefinition::setName(), INetworkDefinition::getName()
     //!
-    //! \return A null-terminated C-style string representing the name of the network.
+    //! \return A NULL-terminated C-style string representing the name of the network, which will have a length of
+    //! 1024 bytes or less including the NULL terminator.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -378,12 +226,12 @@ class ICudaEngine
     //! \brief Set the ErrorRecorder for this interface.
     //!
     //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
-    //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
-    //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
-    //! a recorder has been registered.
+    //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+    //! nullptr, the error code ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has been registered.
+    //!
+    //! \param recorder The error recorder to register with this interface, or nullptr to deregister the current.
+    //! error recorder.
     //!
-    //! \param recorder The error recorder to register with this interface.
-    //
     //! \see getErrorRecorder()
     //!
     //! \usage
@@ -399,7 +247,8 @@ class ICudaEngine
     //! nullptr will be returned if an error reporter has not been inherited
     //! from the IRuntime, and setErrorReporter() has not been called.
     //!
-    //! \return A pointer to the IErrorRecorder object that has been registered.
+    //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if none has been
+    //! registered.
     //!
     //! \see setErrorRecorder()
     //!
@@ -417,71 +266,77 @@ class ICudaEngine
     ICudaEngine& operator=(ICudaEngine&&) & = delete;
 
     //!
-    //! \brief Get extent of an input or output tensor.
+    //! \brief Get the extent of an input or output tensor.
+    //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
-    //! \return Extent of the tensor. Dims{-1, {}} will be returned if
-    //! (1) name is not the name of an input or output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit.
+    //! \return Extent of the tensor. The invalid value Dims{-1, {}} will be returned if
+    //! - name is not the name of an input or output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual Dims getTensorShape(AsciiChar const* tensorName) const noexcept = 0;
+    virtual Dims getTensorShape(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
     //! \brief Determine the required data type for a buffer from its tensor name.
     //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
-    //! \return The type of the data in the buffer. DataType::kFLOAT will be returned if
-    //! (1) name is not the name of an input or output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit.
+    //! \return The type of the data in the buffer. The default value DataType::kFLOAT will be returned if
+    //! - name is not the name of an input or output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual DataType getTensorDataType(AsciiChar const* tensorName) const noexcept = 0;
+    virtual DataType getTensorDataType(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
     //! \brief Determine whether a tensor is an input or output tensor.
     //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
-    //! \return kINPUT if tensorName is an input, kOUTPUT if tensorName is an output, or kNONE if neither.
+    //! \return kINPUT if tensorName is the name of an input tensor, kOUTPUT if tensorName is the name of an output
+    //! tensor. The invalid value kNONE is returned if
+    //! - tensorName exceeds the string length limit, or
+    //! - tensorName is nullptr, or
+    //! - tensorName does not correspond to any input or output tensor.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual TensorIOMode getTensorIOMode(AsciiChar const* tensorName) const noexcept = 0;
+    virtual TensorIOMode getTensorIOMode(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
-    //! \brief Return the number of bytes per component of an element.
+    //! \brief Return the size of the tensor data type in bytes for a vectorized tensor.
     //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
-    //! \return The vector component size. 0 will be returned if
-    //! (1) name is not the name of an input or output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit, or
-    //! (4) the tensor of given name is not vectorized.
+    //! \return The size of the tensor data type in bytes if the tensor is vectorized (4 for float and int32,
+    //! 2 for half, 1 for int8). 0 will be returned if
+    //! - name is not the name of an input or output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit, or
+    //! - the tensor of the given name is not vectorized.
     //!
     //! \see safe::ICudaEngine::getTensorVectorizedDim()
     //!
@@ -489,21 +344,21 @@ class ICudaEngine
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual std::int32_t getTensorBytesPerComponent(AsciiChar const* tensorName) const noexcept = 0;
+    virtual std::int32_t getTensorBytesPerComponent(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
-    //! \brief Return the number of components included in one element.
+    //! \brief Return the number of components included in one element for a vectorized tensor.
     //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
-    //! \return The vector component size. -1 will be returned if
-    //! (1) name is not the name of an input or output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit, or
-    //! (4) the tensor of given name is not vectorized.
+    //! \return The vector length (in scalars) for a vectorized tensor, or 1 for a scalar tensor.
+    //! The invalid value -1 will be returned if
+    //! - name is not the name of an input or output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit.
     //!
     //! \see safe::ICudaEngine::getTensorVectorizedDim()
     //!
@@ -511,48 +366,48 @@ class ICudaEngine
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual std::int32_t getTensorComponentsPerElement(AsciiChar const* tensorName) const noexcept = 0;
+    virtual std::int32_t getTensorComponentsPerElement(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
     //! \brief Return the tensor format.
     //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
     //! \return The tensor format. TensorFormat::kLINEAR will be returned if
-    //! (1) name is not the name of an input or output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit.
+    //! - name is not the name of an input or output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual TensorFormat getTensorFormat(AsciiChar const* tensorName) const noexcept = 0;
+    virtual TensorFormat getTensorFormat(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
-    //! \brief Return the dimension index along which buffer is vectorized.
+    //! \brief Return the dimension index along which the buffer is vectorized.
     //!
-    //! Specifically -1 is returned if scalars per vector is 1.
+    //! Specifically, -1 is returned if the tensor is scalar.
     //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less including the
+    //! NULL terminator.
     //!
     //! \return The dimension index along which the buffer is vectorized. -1 will be returned if
-    //! (1) name is not the name of an input or output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit, or
-    //! (4) the tensor of given name is not vectorized.
+    //! - name is not the name of an input or output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit (1024 bytes or less including the NULL terminator), or
+    //! - the tensor of given name is not vectorized.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual std::int32_t getTensorVectorizedDim(AsciiChar const* tensorName) const noexcept = 0;
+    virtual std::int32_t getTensorVectorizedDim(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
     //! \brief Return the number of input and output tensors for the network from which the engine was built.
@@ -570,13 +425,14 @@ class ICudaEngine
     //!
     //! \brief Return the name of an IO tensor.
     //!
-    //! If the index does not fall between 0 and getNbIOTensors()-1, the function will fail with an error code of ErrorCode::kINVALID_ARGUMENT(3) that is
-    //! emitted to the registered IErrorRecorder.
+    //! If the index does not fall between 0 and getNbIOTensors()-1, the function will fail with an error code
+    //! of ErrorCode::kINVALID_ARGUMENT(3) that is emitted to the registered IErrorRecorder.
     //!
-    //! \param index The value that falls between 0 and getNbIOTensors()-1.
+    //! \param index The IO tensor index.
     //!
-    //! \return The name of an IO tensor. nullptr will be returned if the index does not fall between 0 and
-    //! getNbIOTensors()-1.
+    //! \return The name of an IO tensor, which will be a NULL-terminated string of 1024 bytes or less (including the
+    //! NULL terminator) if the index is in the range (between 0 and getNbIOTensors()-1). nullptr will be returned if
+    //! the index is not in range.
     //!
     //! \see getNbIOTensors()
     //!
@@ -588,17 +444,35 @@ class ICudaEngine
 };
 
 //!
-//! \brief Space to record information about floating point runtime errors
+//! \brief Space to record information about runtime errors.
+//!
+//! kNAN_CONSUMED errors occur when NAN values are stored in an INT8 quantized datatype.
+//! kINF_CONSUMED errors occur when +-INF values are stored in an INT8 quantized datatype.
+//! kGATHER_OOB errors occur when a gather index tensor contains a value that is outside of the data tensor.
+//! kSCATTER_OOB and kSCATTER_RACE are reserved for future use.
+//!
+//! Mark the RuntimeErrorType that occurs during asynchronous kernel execution.
+struct RuntimeErrorInformation
+{
+    //! Each bit represents a RuntimeErrorType that has occurred during kernel execution.
+    uint64_t bitMask;
+};
+
 //!
-//! NAN errors occur when NAN values are stored in an INT8 quantized datatype.
-//! INF errors occur when +-INF values are stored in an INT8 quantized datatype.
+//! \brief Enum to represent runtime error types.
 //!
-struct FloatingPointErrorInformation
+enum class RuntimeErrorType : uint64_t
 {
-    //! Total count of errors relating to NAN values (0 if none)
-    int32_t nbNanErrors;
-    //! Total count of errors relating to INF values (0 if none)
-    int32_t nbInfErrors;
+    //! NaN floating-point value was silently consumed
+    kNAN_CONSUMED = 1ULL << 0,
+    //! Inf floating-point value was silently consumed
+    kINF_CONSUMED = 1ULL << 1,
+    //! Out-of-bounds access in gather operation
+    kGATHER_OOB = 1ULL << 2,
+    //! Out-of-bounds access in scatter operation
+    kSCATTER_OOB = 1ULL << 3,
+    //! Race condition in scatter operation
+    kSCATTER_RACE = 1ULL << 4,
 };
 
 //!
@@ -633,8 +507,9 @@ class IExecutionContext
     //!
     //! This method copies the name string.
     //!
-    //! \warning Strings passed to the runtime must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning Strings passed to the runtime must be NULL terminated and have a length of 1024 bytes or less
+    //! including the NULL terminator. Otherwise, the operation will not change the execution context name, and
+    //! an error message will be recorded via the error recorder.
     //!
     //! \see getName()
     //!
@@ -647,6 +522,9 @@ class IExecutionContext
     //!
     //! \brief Return the name of the execution context.
     //!
+    //! \return The name that was passed to setName(), as a NULL-terminated string of 1024 bytes or less including
+    //! the NULL terminator. An empty string will be returned as the default value.
+    //!
     //! \see setName()
     //!
     //! \usage
@@ -658,12 +536,18 @@ class IExecutionContext
     //!
     //! \brief Set the device memory for use by this execution context.
     //!
+    //! \param memory The start address of a device memory buffer whose size in bytes must be at least the value
+    //! returned by getEngine().getDeviceMemorySize().
+    //!
     //! If using enqueueV2() to run the network, The memory is in use
     //! from the invocation of enqueueV2() until network execution is complete.
     //! Releasing or otherwise using the memory for other purposes during this time will result in undefined behavior.
     //!
     //! \warning Do not release or use for other purposes the memory set here during network execution.
     //!
+    //! \warning If the execution context has been created by calling createExecutionContext(), this
+    //! function must not be used and will fail with an error message if called.
+    //!
     //! \see safe::ICudaEngine::getDeviceMemorySize() safe::ICudaEngine::createExecutionContextWithoutDeviceMemory()
     //!
     //! \usage
@@ -672,31 +556,17 @@ class IExecutionContext
     //!
     virtual void setDeviceMemory(void* const memory) noexcept = 0;
 
-    //!
-    //! \brief Return the strides of the buffer for the given binding.
-    //!
-    //! \param bindingIndex The binding index.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by getTensorStrides().
-    //!
-    //! \see safe::IExecutionContext::getTensorStrides()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: Yes
-    //!
-    TRT_DEPRECATED virtual Dims getStrides(std::int32_t const bindingIndex) const noexcept = 0;
-
     //!
     //! \brief Set the ErrorRecorder for this interface.
     //!
     //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
-    //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
-    //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
-    //! a recorder has been registered.
+    //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+    //! nullptr, the error code ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has been registered. The
+    //! lifetime of the error recorder object must exceed the lifetime of the execution context.
+    //!
+    //! \param recorder Either a pointer to a valid error recorder object to register with this interface,
+    //!                 or nullptr to deregister the current recorder.
     //!
-    //! \param recorder The error recorder to register with this interface.
-    //
     //! \see getErrorRecorder()
     //!
     //! \usage
@@ -711,40 +581,17 @@ class IExecutionContext
     //! Retrieves the assigned error recorder object for the given class. A default error recorder does not exist,
     //! so a nullptr will be returned if setErrorRecorder has not been called.
     //!
-    //! \return A pointer to the IErrorRecorder object that has been registered.
+    //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if the error recorder
+    //! has been deregistered or not set.
     //!
     //! \see setErrorRecorder()
     //!
     //! \usage
     //! - Allowed context for the API call
-    //!   - Thread-safe: No
+    //!   - Thread-safe: Yes
     //!
     virtual IErrorRecorder* getErrorRecorder() const noexcept = 0;
 
-    //!
-    //! \brief Enqueue inference of a batch on a stream.
-    //!
-    //! This method requires an array of input and output buffers. The mapping from tensor names to indices can be
-    //! queried using safe::ICudaEngine::getBindingIndex().
-    //! This method only works for an execution context built from a network without an implicit batch dimension.
-    //! \param bindings An array of pointers to input and output buffers for the network.
-    //! \param stream A cuda stream on which the inference kernels will be enqueued.
-    //! \param inputConsumed An optional event which will be signaled when the input buffers can be refilled with new
-    //! data.
-    //!
-    //! \return True if the kernels were enqueued successfully.
-    //!
-    //! \deprecated Deprecated in TensorRT 8.5. Superseded by enqueueV3().
-    //!
-    //! \see safe::IExecutionContext::enqueueV3()
-    //!
-    //! \usage
-    //! - Allowed context for the API call
-    //!   - Thread-safe: No
-    //!
-    TRT_DEPRECATED virtual bool enqueueV2(
-        void* const* const bindings, cudaStream_t const stream, cudaEvent_t const* const inputConsumed) noexcept = 0;
-
     IExecutionContext() = default;
     virtual ~IExecutionContext() noexcept = default;
     IExecutionContext(IExecutionContext const&) = delete;
@@ -753,17 +600,18 @@ class IExecutionContext
     IExecutionContext& operator=(IExecutionContext&&) & = delete;
 
     //!
-    //! \brief Set error buffer output for floating point errors.
+    //! \brief Set error buffer output for runtime errors.
     //!
     //! The error buffer output must be allocated in device memory and will be used for subsequent
-    //! calls to enqueueV2. Checking the contents of the error buffer after inference is the responsibility
-    //! of the application. The pointer passed here must have alignment adequate for the FloatingPointErrorInformation
-    //! struct.
+    //! calls to enqueueV2() or enqueueV3(). Checking the contents of the error buffer after inference is the
+    //! responsibility of the application. The pointer passed here must have alignment adequate for the
+    //! RuntimeErrorInformation struct.
     //!
-    //! \warning Do not release or use the contents of the error buffer for any other purpose before synchronizing
-    //! on the CUDA stream passed to enqueueV2.
+    //! \warning The buffer is written if reportable errors are encountered during network execution. Releasing the
+    //! buffer before network execution is complete will result in undefined behavior. Accessing the memory before
+    //! network execution is complete may not correctly capture the error state.
     //!
-    //! \param buffer The device memory to use as floating point error buffer
+    //! \param buffer The device memory address of the runtime error information buffer.
     //!
     //! \see getErrorBuffer()
     //!
@@ -771,12 +619,12 @@ class IExecutionContext
     //! - Allowed context for the API call
     //!   - Thread-safe: No
     //!
-    virtual void setErrorBuffer(FloatingPointErrorInformation* const buffer) noexcept = 0;
+    virtual void setErrorBuffer(RuntimeErrorInformation* const buffer) noexcept = 0;
 
     //!
-    //! \brief Get error buffer output for floating point errors.
+    //! \brief Get error buffer output for runtime errors.
     //!
-    //! \return Pointer to device memory to use as floating point error buffer or nullptr if not set.
+    //! \return Pointer to device memory to use as runtime error buffer or nullptr if not set.
     //!
     //! \see setErrorBuffer()
     //!
@@ -784,29 +632,30 @@ class IExecutionContext
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual FloatingPointErrorInformation* getErrorBuffer() const noexcept = 0;
+    virtual RuntimeErrorInformation* getErrorBuffer() const noexcept = 0;
 
     //!
     //! \brief Return the strides of the buffer for the given tensor name.
     //!
     //! The strides are in units of elements, not components or bytes.
+    //! Elements are vectors (for a vectorized format) or scalars (for a scalar format).
     //! For example, for TensorFormat::kHWC8, a stride of one spans 8 scalars.
     //!
     //! \param tensorName The name of an input or output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+    //! including the NULL terminator.
     //!
     //! \return The strides of the buffer for the given tensor name. Dims{-1, {}} will be returned if
-    //! (1) name is not the name of an input or output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit.
+    //! - name is not the name of an input or output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual Dims getTensorStrides(AsciiChar const* tensorName) const noexcept = 0;
+    virtual Dims getTensorStrides(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
     //! \brief Set memory address for given input tensor.
@@ -816,23 +665,27 @@ class IExecutionContext
     //! Before calling enqueueV3(), each input must have a non-null address.
     //!
     //! \param tensorName The name of an input tensor.
-    //! \param data The pointer (void const*) to the const data owned by the user.
+    //! \param data The pointer (void const*) to the input tensor data, which is device memory owned by the user.
+    //! Users are responsible for ensuring that the buffer size has at least the expected length, which is
+    //! the product of the tensor dimensions (with the vectorized dimension padded to a multiple of the vector length)
+    //! times the data type size.
+    //!
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+    //! including the NULL terminator.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
-    //! \warning The pointer must have at least 256-byte alignment.
+    //! \warning The data pointer must have 256-byte alignment.
     //!
     //! \return True on success, false if
-    //! (1) name is not the name of an input tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit, or
-    //! (4) pointer to the const data is nullptr or not aligned.
+    //! - name is not the name of an input tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit, or
+    //! - pointer to the const data is nullptr or not correctly aligned.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: No
     //!
-    virtual bool setInputTensorAddress(AsciiChar const* tensorName, void const* data) noexcept = 0;
+    virtual bool setInputTensorAddress(AsciiChar const* const tensorName, void const* const data) noexcept = 0;
 
     //!
     //! \brief Set memory address for given output tensor.
@@ -842,43 +695,48 @@ class IExecutionContext
     //! Before calling enqueueV3(), each output must have a non-null address.
     //!
     //! \param tensorName The name of an output tensor.
-    //! \param data The pointer (void*) to the data owned by the user.
+    //! \param data The pointer (void*) to the output tensor data, which is device memory owned by the user.
+    //! Users are responsible for ensuring that the buffer size has at least the expected length, which is
+    //! the product of the tensor dimensions (with the vectorized dimension padded to a multiple of the vector length)
+    //! times the data type size.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
-    //! \warning The pointer must have at least 256-byte alignment.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+    //! including the NULL terminator.
+    //! \warning The data pointer must have 256-byte alignment.
     //!
     //! \return True on success. Return false if
-    //! (1) name is not the name of an output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit, or
-    //! (4) pointer to data is nullptr or not aligned.
+    //! - name is not the name of an output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit, or
+    //! - pointer to data is nullptr or not aligned.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: No
     //!
-    virtual bool setOutputTensorAddress(AsciiChar const* tensorName, void* data) noexcept = 0;
+    virtual bool setOutputTensorAddress(AsciiChar const* const tensorName, void* const data) noexcept = 0;
 
     //!
-    //! \brief Mark input as consumed.
+    //! \brief Set the event to mark inputs as consumed.
     //!
     //! Passing event==nullptr removes whatever event was set, if any.
     //!
-    //! \param event The cuda event that is triggered after all input tensors have been consumed.
+    //! \param event The CUDA event that is signaled after all input tensors have been consumed, or nullptr to remove
+    //!        an event that was previously set.
     //!
-    //! \return True on success, false if error occurred.
+    //! \return True on success, false if an error occurred.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: No
     //!
-    virtual bool setInputConsumedEvent(cudaEvent_t event) noexcept = 0;
+    virtual bool setInputConsumedEvent(cudaEvent_t const event) noexcept = 0;
 
     //!
     //! \brief Return the event associated with consuming the input.
     //!
-    //! \return The cuda event, nullptr will be returned if the event is not set yet.
+    //! \return The CUDA event that was passed to setInputConsumedEvent(). nullptr will be returned if the event is
+    //! not set.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -891,57 +749,61 @@ class IExecutionContext
     //!
     //! \param tensorName The name of an input tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+    //! including the NULL terminator.
     //!
-    //! \return The memory address for the given input tensor. nullptr will be returned if
-    //! (1) name is not the name of an input tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit, or
-    //! (4) the memory address for the given input tensor is not set yet.
+    //! \return The device memory address for the given input tensor. nullptr will be returned if
+    //! - name is not the name of an input tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit, or
+    //! - the memory address for the given input tensor is not set.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual void const* getInputTensorAddress(AsciiChar const* tensorName) const noexcept = 0;
+    virtual void const* getInputTensorAddress(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
     //! \brief Get memory address for given output tensor.
     //!
     //! \param tensorName The name of an output tensor.
     //!
-    //! \warning The string tensorName must be 1024 characters or less including NULL terminator and must be
-    //! NULL terminated.
+    //! \warning The string tensorName must be NULL terminated and have a length of 1024 bytes or less
+    //! including the NULL terminator.
     //!
-    //! \return Raw output data pointer (void*) for given output tensor, return nullptr if
-    //! (1) name is not the name of an output tensor, or
-    //! (2) name is nullptr, or
-    //! (3) name exceeds the string length limit, or
-    //! (4) the memory address for the given output tensor is not set yet.
+    //! \return The device memory address for the given output tensor. Return nullptr if
+    //! - name is not the name of an output tensor, or
+    //! - name is nullptr, or
+    //! - name exceeds the string length limit, or
+    //! - the memory address for the given output tensor is not set.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual void* getOutputTensorAddress(AsciiChar const* tensorName) const noexcept = 0;
+    virtual void* getOutputTensorAddress(AsciiChar const* const tensorName) const noexcept = 0;
 
     //!
     //! \brief Enqueue inference on a stream.
     //!
     //! Modifying or releasing memory that has been registered for the tensors before stream
-    //! synchronization or the event passed to setInputConsumedEvent has been being triggered results in undefined
+    //! synchronization or the event passed to setInputConsumedEvent has been signaled results in undefined
     //! behavior.
     //!
-    //! \param stream A cuda stream on which the inference kernels will be enqueued.
+    //! \param stream A CUDA stream on which the inference kernels will be enqueued.
     //!
     //! \return True on success, false if any execution error occurred.
+    //! Errors may include but not be limited to:
+    //! - Internal errors during executing one engine layer
+    //! - CUDA errors
+    //! - Some input or output tensor addresses have not been set.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
-    virtual bool enqueueV3(cudaStream_t stream) noexcept = 0;
+    virtual bool enqueueV3(cudaStream_t const stream) noexcept = 0;
 };
 
 //!
@@ -952,7 +814,7 @@ class IExecutionContext
 //! Internally, the plugin registry is considered to be a singleton so all
 //! plugins in an application are part of the same global registry.
 //! Note that the plugin registry is only supported for plugins of type
-//! IPluginV2 and should also have a corresponding IPluginCreator implementation.
+//! IPluginV2 and must also have a corresponding IPluginCreator implementation.
 //!
 //! \see IPluginV2 and IPluginCreator
 //!
@@ -966,11 +828,23 @@ class IPluginRegistry
 {
 public:
     //!
-    //! \brief Register a plugin creator. Returns false if one with same type
-    //! is already registered.
+    //! \brief Register a plugin creator.
     //!
-    //! \warning The string pluginNamespace must be 1024 bytes or less including the NULL terminator and must be NULL
-    //! terminated.
+    //! \param creator The plugin creator to be registered.
+    //!
+    //! \param pluginNamespace A NULL-terminated namespace string, which must be 1024 bytes or less including the NULL
+    //! terminator. It must be identical with the result of calling
+    //! IPluginCreator::getPluginNamespace() on the creator object.
+    //!
+    //! \return True if the registration succeeded, else false.
+    //!
+    //! \details Registration may fail for any of the following reasons:
+    //! - The pluginNamespace string is a nullptr.
+    //! - The pluginNamespace string exceeds the maximum length.
+    //! - The pluginNamespace string does not match the result of creator.getPluginNamespace().
+    //! - There have already been 100 plugin creators registered (maximum number of plugins exceeded).
+    //! - Another plugin creator with the same combination of plugin name, version and namespace has already been
+    //!   registered.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -980,7 +854,12 @@ class IPluginRegistry
 
     //!
     //! \brief Return all the registered plugin creators and the number of
-    //! registered plugin creators. Returns nullptr if none found.
+    //! registered plugin creators. Returns nullptr if none is found.
+    //!
+    //! \param[out] numCreators If the call completes successfully, the number of registered plugin creators (which
+    //!                         will be an integer between 0 and 100 inclusive)
+    //! \return The start address of an IPluginCreator* array of length numCreators if at least one plugin creator
+    //!         has been registered, or nullptr if there are no registered plugin creators.
     //!
     //! \usage
     //! - Allowed context for the API call
@@ -992,27 +871,37 @@ class IPluginRegistry
     //! \brief Return plugin creator based on plugin name, version, and
     //! namespace associated with plugin during network creation.
     //!
-    //! \warning The strings pluginName, pluginVersion, and pluginNamespace must be 1024 bytes or less including the
-    //! NULL terminator and must be NULL terminated.
+    //! \warning The strings pluginName, pluginVersion, and pluginNamespace must be NULL terminated and have a length
+    //! of 1024 bytes or less including the NULL terminator.
+    //!
+    //! \param pluginName The plugin name string
+    //! \param pluginVersion The plugin version string
+    //! \param pluginNamespace The plugin namespace (by default empty string)
+    //!
+    //! \return If a plugin creator corresponding to the passed name, version and namespace can be found in the
+    //!         registry, it is returned. nullptr is returned in the following situations:
+    //!         - Any of the input arguments is nullptr.
+    //!         - Any of the input arguments exceeds the string length limit.
+    //!         - No plugin creator corresponding to the input arguments can be found in the registry.
+    //!         - A plugin creator can be found, but its stored namespace attribute does not match the pluginNamespace.
     //!
     //! \usage
     //! - Allowed context for the API call
     //!   - Thread-safe: Yes
     //!
     virtual IPluginCreator* getPluginCreator(AsciiChar const* const pluginName, AsciiChar const* const pluginVersion,
-        AsciiChar const* const pluginNamespace = "") noexcept
-        = 0;
+        AsciiChar const* const pluginNamespace = "") noexcept = 0;
 
     //!
     //! \brief Set the ErrorRecorder for this interface
     //!
     //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
-    //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
-    //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
-    //! a recorder has been registered.
+    //! This function will call incRefCount of the registered ErrorRecorder at least once. If the recorder is set to
+    //! nullptr, the error code ErrorCode::kINVALID_ARGUMENT will be emitted if the recorder has been registered.
+    //!
+    //! \param recorder The error recorder to register with this interface, or nullptr to deregister the current
+    //!                 recorder.
     //!
-    //! \param recorder The error recorder to register with this interface.
-    //
     //! \see getErrorRecorder()
     //!
     //! \usage
@@ -1028,7 +917,9 @@ class IPluginRegistry
     //! so a nullptr will be returned if setErrorRecorder has not been called, or an ErrorRecorder has not been
     //! inherited.
     //!
-    //! \return A pointer to the IErrorRecorder object that has been registered.
+    //! \return A pointer to the IErrorRecorder object that has been registered, or nullptr if:
+    //!         - no error recorder has been set, or
+    //!         - the last error recorder has been deregistered via setErrorRecorder(nullptr).
     //!
     //! \see setErrorRecorder()
     //!
@@ -1045,6 +936,8 @@ class IPluginRegistry
     //! this function provides a mechanism for removing plugin creators registered in TensorRT.
     //! The plugin creator that is specified by \p creator is removed from TensorRT and no longer tracked.
     //!
+    //! \param creator The plugin creator to deregister.
+    //!
     //! \return True if the plugin creator was deregistered, false if it was not found in the registry or otherwise
     //! could
     //!     not be deregistered.
@@ -1068,7 +961,12 @@ class IPluginRegistry
 };
 
 //!
-//! \brief Create an instance of an safe::IRuntime class.
+//! \brief Create an instance of a safe::IRuntime class.
+//!
+//! \param logger A logger object whose lifetime must exceed that of the returned runtime.
+//! Loggers must be thread-safe.
+//!
+//! \return A safe runtime object that can be used for safe plan file deserialization.
 //!
 //! This class is the logging class for the runtime.
 //!
@@ -1093,8 +991,8 @@ extern "C" TENSORRTAPI IPluginRegistry* getSafePluginRegistry() noexcept;
 //! loaded. This static object will register all creators available in the
 //! library to the registry.
 //!
-//! \warning Statically registering plugins should be avoided in the automotive
-//!  safety context as the application developer should first register an error recorder
+//! \warning Statically registering plugins must be avoided in the automotive
+//!  safety context as the application developer must first register an error recorder
 //!  with the plugin registry via IPluginRegistry::setErrorRecorder() before using
 //!  IPluginRegistry::registerCreator() or other methods.
 //!
diff --git a/include/NvInferVersion.h b/include/NvInferVersion.h
index b285fd02..8c99bea7 100644
--- a/include/NvInferVersion.h
+++ b/include/NvInferVersion.h
@@ -23,26 +23,19 @@
 #ifndef NV_INFER_VERSION_H
 #define NV_INFER_VERSION_H
 
-#define NV_TENSORRT_MAJOR 8 //!< TensorRT major version.
-#define NV_TENSORRT_MINOR 6 //!< TensorRT minor version.
-#define NV_TENSORRT_PATCH 1 //!< TensorRT patch version.
-#define NV_TENSORRT_BUILD 5 //!< TensorRT build number.
+#define NV_TENSORRT_MAJOR 10 //!< TensorRT major version.
+#define NV_TENSORRT_MINOR 0 //!< TensorRT minor version.
+#define NV_TENSORRT_PATCH 0 //!< TensorRT patch version.
+#define NV_TENSORRT_BUILD 6 //!< TensorRT build number.
 
 #define NV_TENSORRT_LWS_MAJOR 0 //!< TensorRT LWS major version.
 #define NV_TENSORRT_LWS_MINOR 0 //!< TensorRT LWS minor version.
 #define NV_TENSORRT_LWS_PATCH 0 //!< TensorRT LWS patch version.
 
-// This #define is deprecated in TensorRT 8.6 and will be removed in 10.0. Use NV_TENSORRT_MAJOR.
-#define NV_TENSORRT_SONAME_MAJOR 8 //!< Shared object library major version number.
-// This #define is deprecated in TensorRT 8.6 and will be removed in 10.0. Use NV_TENSORRT_MINOR.
-#define NV_TENSORRT_SONAME_MINOR 6 //!< Shared object library minor version number.
-// This #define is deprecated in TensorRT 8.6 and will be removed in 10.0. Use NV_TENSORRT_PATCH.
-#define NV_TENSORRT_SONAME_PATCH 1 //!< Shared object library patch version number.
-
 #define NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS 0         //!< An early access release
 #define NV_TENSORRT_RELEASE_TYPE_RELEASE_CANDIDATE 1    //!< A release candidate
 #define NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY 2 //!< A final release
 
-#define NV_TENSORRT_RELEASE_TYPE NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY //!< TensorRT release type
+#define NV_TENSORRT_RELEASE_TYPE NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS //!< TensorRT release type
 
 #endif // NV_INFER_VERSION_H
diff --git a/include/NvOnnxConfig.h b/include/NvOnnxConfig.h
index 28d8a690..8a222aa7 100644
--- a/include/NvOnnxConfig.h
+++ b/include/NvOnnxConfig.h
@@ -49,6 +49,7 @@ class IOnnxConfig
     virtual ~IOnnxConfig() noexcept = default;
     //!
     //! \typedef Verbosity
+    //!
     //! \brief Defines Verbosity level.
     //!
     typedef int32_t Verbosity;
@@ -188,15 +189,6 @@ class IOnnxConfig
     //!
     virtual void setPrintLayerInfo(bool) noexcept = 0;
 
-    //!
-    //! \brief Destroy IOnnxConfig object.
-    //!
-    //! \deprecated Use `delete` instead. Deprecated in TRT 8.0.
-    //!
-    //! \warning Calling destroy on a managed pointer will result in a double-free error.
-    //!
-    TRT_DEPRECATED virtual void destroy() noexcept = 0;
-
 }; // class IOnnxConfig
 
 TENSORRTAPI IOnnxConfig* createONNXConfig();
diff --git a/include/NvUffParser.h b/include/NvUffParser.h
deleted file mode 100644
index 468895c2..00000000
--- a/include/NvUffParser.h
+++ /dev/null
@@ -1,230 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef NV_UFF_PARSER_H
-#define NV_UFF_PARSER_H
-
-#include "NvInfer.h"
-
-//!
-//! \file NvUffParser.h
-//!
-//! This is the API for the UFF Parser
-//!
-
-// Current supported Universal Framework Format (UFF) version for the parser.
-#define UFF_REQUIRED_VERSION_MAJOR 0
-#define UFF_REQUIRED_VERSION_MINOR 6
-#define UFF_REQUIRED_VERSION_PATCH 9
-
-//!
-//! \namespace nvuffparser
-//!
-//! \brief The TensorRT UFF parser API namespace.
-//!
-namespace nvuffparser
-{
-
-//!
-//! \enum UffInputOrder
-//! \brief The different possible supported input order.
-//!
-enum class UffInputOrder : int32_t
-{
-    kNCHW = 0, //!< NCHW order.
-    kNHWC = 1, //!< NHWC order.
-    kNC = 2    //!< NC order.
-};
-
-//!
-//! \enum FieldType
-//! \brief The possible field types for custom layer.
-//!
-
-enum class FieldType : int32_t
-{
-    kFLOAT = 0,    //!< FP32 field type.
-    kINT32 = 1,    //!< INT32 field type.
-    kCHAR = 2,     //!< char field type. String for length>1.
-    kDIMS = 4,     //!< nvinfer1::Dims field type.
-    kDATATYPE = 5, //!< nvinfer1::DataType field type.
-    kUNKNOWN = 6
-};
-
-//!
-//! \class FieldMap
-//!
-//! \brief An array of field params used as a layer parameter for plugin layers.
-//!
-//! The node fields are passed by the parser to the API through the plugin
-//! constructor. The implementation of the plugin should parse the contents of
-//! the fieldMap as part of the plugin constructor
-//!
-class TENSORRTAPI FieldMap
-{
-public:
-    char const* name{};
-    void const* data{};
-    FieldType type{FieldType::kUNKNOWN};
-    int32_t length{1};
-
-    //! \deprecated Legacy constructor, retained for ABI compatibility.  Deprecated in TensorRT 8.6.
-    //! Use the default constructor instead.
-    TRT_DEPRECATED FieldMap(char const* name, void const* data, FieldType const type, int32_t length = 1);
-
-    //! Default constructor
-    FieldMap() = default;
-};
-
-struct FieldCollection
-{
-    int32_t nbFields;
-    FieldMap const* fields;
-};
-
-//!
-//! \class IUffParser
-//!
-//! \brief Class used for parsing models described using the UFF format.
-//!
-//! \warning Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
-//!
-class IUffParser
-{
-public:
-    //!
-    //! \brief Register an input name of a UFF network with the associated Dimensions.
-    //!
-    //! \param inputName Input name.
-    //! \param inputDims Input dimensions.
-    //! \param inputOrder Input order on which the framework input was originally.
-    //!
-    virtual bool registerInput(char const* inputName, nvinfer1::Dims inputDims, UffInputOrder inputOrder) noexcept = 0;
-
-    //!
-    //! \brief Register an output name of a UFF network.
-    //!
-    //! \param outputName Output name.
-    //!
-    virtual bool registerOutput(char const* outputName) noexcept = 0;
-
-    //!
-    //! \brief Parse a UFF file.
-    //!
-    //! \param file File name of the UFF file.
-    //! \param network Network in which the UFFParser will fill the layers.
-    //! \param weightsType The type on which the weights will transformed in.
-    //!
-    virtual bool parse(char const* file, nvinfer1::INetworkDefinition& network,
-        nvinfer1::DataType weightsType = nvinfer1::DataType::kFLOAT) noexcept = 0;
-
-    //!
-    //! \brief Parse a UFF buffer, useful if the file already live in memory.
-    //!
-    //! \param buffer Buffer of the UFF file.
-    //! \param size Size of buffer of the UFF file.
-    //! \param network Network in which the UFFParser will fill the layers.
-    //! \param weightsType The type on which the weights will transformed in.
-    //!
-    virtual bool parseBuffer(char const* buffer, std::size_t size, nvinfer1::INetworkDefinition& network,
-        nvinfer1::DataType weightsType = nvinfer1::DataType::kFLOAT) noexcept = 0;
-
-    //!
-    //! \deprecated Use `delete` instead. Deprecated in TRT 8.0.
-    //!
-    TRT_DEPRECATED virtual void destroy() noexcept = 0;
-
-    //!
-    //! \brief Return Version Major of the UFF.
-    //!
-    virtual int32_t getUffRequiredVersionMajor() noexcept = 0;
-
-    //!
-    //! \brief Return Version Minor of the UFF.
-    //!
-    virtual int32_t getUffRequiredVersionMinor() noexcept = 0;
-
-    //!
-    //! \brief Return Patch Version of the UFF.
-    //!
-    virtual int32_t getUffRequiredVersionPatch() noexcept = 0;
-
-    //!
-    //! \brief Set the namespace used to lookup and create plugins in the network.
-    //!
-    virtual void setPluginNamespace(char const* libNamespace) noexcept = 0;
-
-    virtual ~IUffParser() noexcept = default;
-
-public:
-    //!
-    //! \brief Set the ErrorRecorder for this interface
-    //!
-    //! Assigns the ErrorRecorder to this interface. The ErrorRecorder will track all errors during execution.
-    //! This function will call incRefCount of the registered ErrorRecorder at least once. Setting
-    //! recorder to nullptr unregisters the recorder with the interface, resulting in a call to decRefCount if
-    //! a recorder has been registered.
-    //!
-    //! If an error recorder is not set, messages will be sent to the global log stream.
-    //!
-    //! \param recorder The error recorder to register with this interface.
-    //
-    //! \see getErrorRecorder()
-    //!
-    virtual void setErrorRecorder(nvinfer1::IErrorRecorder* recorder) noexcept = 0;
-
-    //!
-    //! \brief get the ErrorRecorder assigned to this interface.
-    //!
-    //! Retrieves the assigned error recorder object for the given class. A
-    //! nullptr will be returned if setErrorRecorder has not been called.
-    //!
-    //! \return A pointer to the IErrorRecorder object that has been registered.
-    //!
-    //! \see setErrorRecorder()
-    //!
-    virtual nvinfer1::IErrorRecorder* getErrorRecorder() const noexcept = 0;
-};
-
-//!
-//! \brief Creates a IUffParser object.
-//!
-//! \return A pointer to the IUffParser object is returned.
-//!
-//! \see nvuffparser::IUffParser
-//!
-//! \deprecated IUffParser will be removed in TensorRT 9.0. Plan to migrate your workflow to
-//! use nvonnxparser::IParser for deployment.
-//!
-TENSORRTAPI IUffParser* createUffParser() noexcept;
-
-//!
-//! \brief Shuts down protocol buffers library.
-//!
-//! \note No part of the protocol buffers library can be used after this function is called.
-//!
-TENSORRTAPI void shutdownProtobufLibrary(void) noexcept;
-
-} // namespace nvuffparser
-
-//!
-//! Internal C entry point for creating IUffParser
-//! @private
-//!
-extern "C" TENSORRTAPI void* createNvUffParser_INTERNAL() noexcept;
-
-#endif /* !NV_UFF_PARSER_H */
diff --git a/include/NvUtils.h b/include/NvUtils.h
deleted file mode 100644
index be879031..00000000
--- a/include/NvUtils.h
+++ /dev/null
@@ -1,151 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef NV_UTILS_H
-#define NV_UTILS_H
-
-#include "NvInfer.h"
-
-//!
-//! \file NvUtils.h
-//!
-//! This file includes various utility functions
-//!
-
-namespace nvinfer1
-{
-namespace utils
-{
-
-//!
-//! \param input The input weights to reshape.
-//! \param shape The shape of the weights.
-//! \param shapeOrder The order of the dimensions to process for the output.
-//! \param data The location where the output data is placed.
-//! \param nbDims The number of dimensions to process.
-//!
-//! \brief Reformat the input weights of the given shape based on the new
-//! order of dimensions.
-//!
-//! Take the weights specified by \p input with the dimensions specified by
-//! \p shape and re-order the weights based on the new dimensions specified
-//! by \p shapeOrder. The size of each dimension and the input data is not
-//! modified. The output volume pointed to by \p data must be the same as
-//! he \p input volume.
-//!
-//! Example usage:
-//! float *out = new float[N*C*H*W];
-//! Weights input{DataType::kFLOAT, {0 ... N*C*H*W-1}, N*C*H*W size};
-//! int32_t order[4]{1, 0, 3, 2};
-//! int32_t shape[4]{C, N, W, H};
-//! reshapeWeights(input, shape, order, out, 4);
-//! Weights reshaped{input.type, out, input.count};
-//!
-//! Input Matrix{3, 2, 3, 2}:
-//! { 0  1}, { 2  3}, { 4  5} <-- {0, 0, *, *}
-//! { 6  7}, { 8  9}, {10 11} <-- {0, 1, *, *}
-//! {12 13}, {14 15}, {16 17} <-- {1, 0, *, *}
-//! {18 19}, {20 21}, {22 23} <-- {1, 1, *, *}
-//! {24 25}, {26 27}, {28 29} <-- {2, 0, *, *}
-//! {30 31}, {32 33}, {34 35} <-- {2, 1, *, *}
-//!
-//! Output Matrix{2, 3, 2, 3}:
-//! { 0  2  4}, { 1  3  5} <-- {0, 0, *, *}
-//! {12 14 16}, {13 15 17} <-- {0, 1, *, *}
-//! {24 26 28}, {25 27 29} <-- {0, 2, *, *}
-//! { 6  8 10}, { 7  9 11} <-- {1, 0, *, *}
-//! {18 20 22}, {19 21 23} <-- {1, 1, *, *}
-//! {30 32 34}, {31 33 35} <-- {1, 2, *, *}
-//!
-//! \return True on success, false on failure.
-//!
-//! \deprecated Deprecated in TensorRT 8.0.
-//!
-//! \warning This file will be removed in TensorRT 10.0.
-//!
-TRT_DEPRECATED TENSORRTAPI bool reshapeWeights(
-    Weights const& input, int32_t const* shape, int32_t const* shapeOrder, void* data, int32_t nbDims) noexcept;
-
-//!
-//! \param input The input data to re-order.
-//! \param order The new order of the data sub-buffers.
-//! \param num The number of data sub-buffers to re-order.
-//! \param size The size of each data sub-buffer in bytes.
-//!
-//! \brief Takes an input stream and re-orders \p num chunks of the data
-//! given the \p size and \p order.
-//!
-//! In some frameworks, the ordering of the sub-buffers within a dimension
-//! is different than the way that TensorRT expects them.
-//! TensorRT expects the gate/bias sub-buffers for LSTM's to be in fico order.
-//! TensorFlow however formats the sub-buffers in icfo order.
-//! This helper function solves this in a generic fashion.
-//!
-//! Example usage output of reshapeWeights above:
-//! int32_t indir[1]{1, 0}
-//! int32_t stride = W*H;
-//! for (int32_t x = 0, y = N*C; x < y; ++x)
-//! reorderSubBuffers(out + x * stride, indir, H, W);
-//!
-//! Input Matrix{2, 3, 2, 3}:
-//! { 0  2  4}, { 1  3  5} <-- {0, 0, *, *}
-//! {12 14 16}, {13 15 17} <-- {0, 1, *, *}
-//! {24 26 28}, {25 27 29} <-- {0, 2, *, *}
-//! { 6  8 10}, { 7  9 11} <-- {1, 0, *, *}
-//! {18 20 22}, {19 21 23} <-- {1, 1, *, *}
-//! {30 32 34}, {31 33 35} <-- {1, 2, *, *}
-//!
-//! Output Matrix{2, 3, 2, 3}:
-//! { 1  3  5}, { 0  2  4} <-- {0, 0, *, *}
-//! {13 15 17}, {12 14 16} <-- {0, 1, *, *}
-//! {25 27 29}, {24 26 28} <-- {0, 2, *, *}
-//! { 7  9 11}, { 6  8 10} <-- {1, 0, *, *}
-//! {19 21 23}, {18 20 22} <-- {1, 1, *, *}
-//! {31 33 35}, {30 32 34} <-- {1, 2, *, *}
-//!
-//! \return True on success, false on failure.
-//!
-//! \see reshapeWeights()
-//!
-//! \deprecated Deprecated in TensorRT 8.0.
-//!
-//! \warning This file will be removed in TensorRT 10.0.
-//!
-TRT_DEPRECATED TENSORRTAPI bool reorderSubBuffers(
-    void* input, int32_t const* order, int32_t num, int32_t size) noexcept;
-
-//!
-//! \param input The input data to transpose.
-//! \param type The type of the data to transpose.
-//! \param num The number of data sub-buffers to transpose.
-//! \param height The size of the height dimension to transpose.
-//! \param width The size of the width dimension to transpose.
-//!
-//! \brief Transpose \p num sub-buffers of \p height * \p width.
-//!
-//! \return True on success, false on failure.
-//!
-//! \deprecated Deprecated in TensorRT 8.0.
-//!
-//! \warning This file will be removed in TensorRT 10.0.
-//!
-TRT_DEPRECATED TENSORRTAPI bool transposeSubBuffers(
-    void* input, DataType type, int32_t num, int32_t height, int32_t width) noexcept;
-
-} // namespace utils
-} // namespace nvinfer1
-#endif // NV_UTILS_H
diff --git a/parsers/CMakeLists.txt b/parsers/CMakeLists.txt
index 5dab1c9f..750942e6 100644
--- a/parsers/CMakeLists.txt
+++ b/parsers/CMakeLists.txt
@@ -15,12 +15,9 @@
 # limitations under the License.
 #
 
+############################# GENERATE C++ PROTO FILES ###################################
 add_custom_target(parsers DEPENDS
-    nvcaffeparserlibs
-    nvonnxparser
-)
-
-add_subdirectory(caffe)
+    nvonnxparser)
 
 add_definitions("-D_PROTOBUF_INSTALL_DIR=${Protobuf_INSTALL_DIR}")
 add_compile_options("-Dgoogle=google_private")
diff --git a/parsers/caffe/CMakeLists.txt b/parsers/caffe/CMakeLists.txt
deleted file mode 100644
index f6abda79..00000000
--- a/parsers/caffe/CMakeLists.txt
+++ /dev/null
@@ -1,144 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-############################# GENERATE C++ PROTO FILES ###################################
-protobuf_generate_cpp(CAFFE_PROTO_SRC CAFFE_PROTO_HDR proto/trtcaffe.proto)
-add_custom_target(caffe_proto
-    DEPENDS
-    ${CAFFE_PROTO_SRC} ${CAFFE_PROTO_HDR}
-)
-############################## BUILD CAFFE PARSER ########################################
-add_custom_target(nvcaffeparserlibs)
-
-set(TARGET_NAME nvcaffeparser)
-set(SHARED_TARGET ${TARGET_NAME})
-set(STATIC_TARGET ${TARGET_NAME}_static)
-
-################################# DEFINE SOURCES ########################################
-include(CaffeParserSources.txt)
-#########################################################################################
-
-################################## SHARED LIBRARY #######################################
-
-add_library(${SHARED_TARGET} SHARED
-    ${CAFFE_PARSER_SRCS}
-)
-
-add_dependencies(${SHARED_TARGET} caffe_proto)
-
-target_include_directories(${SHARED_TARGET}
-    PUBLIC ${PROJECT_SOURCE_DIR}/include
-    PRIVATE .
-    PRIVATE caffeParser
-    PRIVATE caffeParser/opParsers
-    PRIVATE caffeWeightFactory
-    PRIVATE ../common
-    PRIVATE ${Protobuf_INCLUDE_DIR}
-    PRIVATE ${CMAKE_CURRENT_BINARY_DIR}/proto
-)
-
-set_target_properties(${SHARED_TARGET}
-    PROPERTIES
-    CXX_STANDARD 11
-    CXX_STANDARD_REQUIRED YES
-    CXX_EXTENSIONS NO
-    ARCHIVE_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-    LIBRARY_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-    RUNTIME_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-)
-
-target_link_libraries(${SHARED_TARGET}
-    ${Protobuf_LIBRARY}
-    nvinfer
-)
-
-# modify google namespace to avoid namespace collision.
-set(GOOGLE google_private)
-target_compile_definitions(${SHARED_TARGET}
-    PRIVATE
-    "-Dgoogle=${GOOGLE}"
-    "-DGOOGLE_PROTOBUF_ARCH_64_BIT"
-)
-
-set_target_properties(${SHARED_TARGET} PROPERTIES LINK_FLAGS "-Wl,--exclude-libs,ALL")
-
-set_target_properties(${SHARED_TARGET} PROPERTIES DEBUG_POSTFIX ${TRT_DEBUG_POSTFIX})
-
-set_target_properties(${SHARED_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSION ${TRT_SOVERSION} )
-
-set_property(TARGET ${SHARED_TARGET} PROPERTY CUDA_STANDARD 11)
-
-################################## STATIC LIBRARY #######################################
-
-add_library(${STATIC_TARGET} STATIC
-    ${CAFFE_PARSER_SRCS}
-)
-
-add_dependencies(${STATIC_TARGET} caffe_proto)
-
-target_include_directories(${STATIC_TARGET}
-    PUBLIC ${PROJECT_SOURCE_DIR}/include
-    PRIVATE .
-    PRIVATE caffeParser
-    PRIVATE caffeParser/opParsers
-    PRIVATE caffeWeightFactory
-    PRIVATE ../common
-    PRIVATE ${Protobuf_INCLUDE_DIR}
-    PRIVATE ${CMAKE_CURRENT_BINARY_DIR}/proto
-)
-
-set_target_properties(${STATIC_TARGET}
-    PROPERTIES
-    CXX_STANDARD 11
-    CXX_STANDARD_REQUIRED YES
-    CXX_EXTENSIONS NO
-    ARCHIVE_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-    LIBRARY_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-    RUNTIME_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
-)
-
-target_link_libraries(${STATIC_TARGET}
-    ${Protobuf_LIBRARY}
-)
-
-# modify google namespace to avoid namespace collision.
-set(GOOGLE google_private)
-target_compile_definitions(${STATIC_TARGET}
-    PRIVATE
-    "-Dgoogle=${GOOGLE}"
-    "-DGOOGLE_PROTOBUF_ARCH_64_BIT"
-)
-
-set_target_properties(${STATIC_TARGET} PROPERTIES LINK_FLAGS "-Wl,--exclude-libs,ALL")
-
-set_target_properties(${STATIC_TARGET} PROPERTIES DEBUG_POSTFIX ${TRT_DEBUG_POSTFIX})
-
-set_target_properties(${STATIC_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSION ${TRT_SOVERSION} )
-
-set_property(TARGET ${STATIC_TARGET} PROPERTY CUDA_STANDARD 11)
-
-#########################################################################################
-
-add_dependencies(nvcaffeparserlibs ${SHARED_TARGET} ${STATIC_TARGET})
-
-################################### INSTALLATION ########################################
-
-install(TARGETS ${TARGET_NAME}
-        RUNTIME DESTINATION bin
-        LIBRARY DESTINATION lib
-        ARCHIVE DESTINATION lib
-)
diff --git a/parsers/caffe/CaffeParserSources.txt b/parsers/caffe/CaffeParserSources.txt
deleted file mode 100644
index b7f69743..00000000
--- a/parsers/caffe/CaffeParserSources.txt
+++ /dev/null
@@ -1,46 +0,0 @@
-#
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-set(CAFFE_PARSER_SRCS
-    ${CAFFE_PROTO_SRC}
-    caffeParser/opParsers/opParsers.h
-    caffeParser/opParsers/parseAbsVal.cpp
-    caffeParser/opParsers/parseBatchNorm.cpp
-    caffeParser/opParsers/parseBNLL.cpp
-    caffeParser/opParsers/parseClip.cpp
-    caffeParser/opParsers/parseConcat.cpp
-    caffeParser/opParsers/parseConv.cpp
-    caffeParser/opParsers/parseCrop.cpp
-    caffeParser/opParsers/parseDeconv.cpp
-    caffeParser/opParsers/parseEltwise.cpp
-    caffeParser/opParsers/parseELU.cpp
-    caffeParser/opParsers/parseInnerProduct.cpp
-    caffeParser/opParsers/parseLRN.cpp
-    caffeParser/opParsers/parsePermute.cpp
-    caffeParser/opParsers/parsePooling.cpp
-    caffeParser/opParsers/parsePower.cpp
-    caffeParser/opParsers/parsePReLU.cpp
-    caffeParser/opParsers/parseReduction.cpp
-    caffeParser/opParsers/parseReLU.cpp
-    caffeParser/opParsers/parseReshape.cpp
-    caffeParser/opParsers/parseScale.cpp
-    caffeParser/opParsers/parseSigmoid.cpp
-    caffeParser/opParsers/parseSoftMax.cpp
-    caffeParser/opParsers/parseTanH.cpp
-    caffeWeightFactory/caffeWeightFactory.cpp
-    caffeParser/caffeParser.cpp
-    NvCaffeParser.cpp
-)
diff --git a/parsers/caffe/binaryProtoBlob.h b/parsers/caffe/binaryProtoBlob.h
deleted file mode 100644
index 79ec2976..00000000
--- a/parsers/caffe/binaryProtoBlob.h
+++ /dev/null
@@ -1,67 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_BINARY_PROTO_BLOB_H
-#define TRT_CAFFE_PARSER_BINARY_PROTO_BLOB_H
-#include <stdlib.h>
-
-#include "NvCaffeParser.h"
-#include "NvInfer.h"
-
-namespace nvcaffeparser1
-{
-class BinaryProtoBlob : public IBinaryProtoBlob
-{
-public:
-    BinaryProtoBlob(void* memory, nvinfer1::DataType type, nvinfer1::Dims4 dimensions)
-        : mMemory(memory)
-        , mDataType(type)
-        , mDimensions(dimensions)
-    {
-    }
-
-    nvinfer1::Dims4 getDimensions() noexcept override
-    {
-        return mDimensions;
-    }
-
-    nvinfer1::DataType getDataType() noexcept override
-    {
-        return mDataType;
-    }
-
-    const void* getData() noexcept override
-    {
-        return mMemory;
-    }
-
-    void destroy() noexcept override
-    {
-        delete this;
-    }
-
-    ~BinaryProtoBlob() noexcept override
-    {
-        free(mMemory);
-    }
-
-    void* mMemory;
-    nvinfer1::DataType mDataType;
-    nvinfer1::Dims4 mDimensions;
-};
-} // namespace nvcaffeparser1
-#endif // TRT_CAFFE_PARSER_BINARY_PROTO_BLOB_H
diff --git a/parsers/caffe/blobNameToTensor.h b/parsers/caffe/blobNameToTensor.h
deleted file mode 100644
index d685cced..00000000
--- a/parsers/caffe/blobNameToTensor.h
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_BLOB_NAME_TO_TENSOR_H
-#define TRT_CAFFE_PARSER_BLOB_NAME_TO_TENSOR_H
-
-#include <map>
-#include <string>
-
-#include "NvCaffeParser.h"
-#include "NvInfer.h"
-
-namespace nvcaffeparser1
-{
-class BlobNameToTensor : public IBlobNameToTensor
-{
-public:
-    void add(const std::string& name, nvinfer1::ITensor* tensor)
-    {
-        mMap[name] = tensor;
-    }
-
-    nvinfer1::ITensor* find(const char* name) const noexcept override
-    {
-        auto p = mMap.find(name);
-        if (p == mMap.end())
-        {
-            return nullptr;
-        }
-        return p->second;
-    }
-
-    nvinfer1::ITensor*& operator[](const std::string& name)
-    {
-        return mMap[name];
-    }
-
-    void setTensorNames()
-    {
-        for (auto& p : mMap)
-        {
-            p.second->setName(p.first.c_str());
-        }
-    }
-
-    ~BlobNameToTensor() override = default;
-
-    bool isOK()
-    {
-        return !mError;
-    }
-
-private:
-    std::map<std::string, nvinfer1::ITensor*> mMap;
-    bool mError{false};
-};
-} // namespace nvcaffeparser1
-#endif // TRT_CAFFE_PARSER_BLOB_NAME_TO_TENSOR_H
diff --git a/parsers/caffe/caffeMacros.h b/parsers/caffe/caffeMacros.h
deleted file mode 100644
index d9cca466..00000000
--- a/parsers/caffe/caffeMacros.h
+++ /dev/null
@@ -1,47 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_MACROS_H
-#define TRT_CAFFE_PARSER_MACROS_H
-#ifdef _MSC_VER
-#define FN_NAME __FUNCTION__
-#else
-#define FN_NAME __func__
-#endif
-
-#define CHECK_NULL(ptr)                                                                                                \
-    if ((ptr) == nullptr)                                                                                              \
-    {                                                                                                                  \
-        std::cout << "Error: input " << #ptr << " is NULL in " << FN_NAME << std::endl;                                \
-        return;                                                                                                        \
-    }
-#define CHECK_NULL_RET_NULL(ptr)                                                                                       \
-    if ((ptr) == nullptr)                                                                                              \
-    {                                                                                                                  \
-        std::cout << "Error: input " << #ptr << " is NULL in " << FN_NAME << std::endl;                                \
-        return nullptr;                                                                                                \
-    }
-#define CHECK_NULL_RET_VAL(ptr, val)                                                                                   \
-    if ((ptr) == nullptr)                                                                                              \
-    {                                                                                                                  \
-        std::cout << "Error: input " << #ptr << " is NULL in " << FN_NAME << std::endl;                                \
-        return val;                                                                                                    \
-    }
-
-#include "parserUtils.h"
-#define RETURN_AND_LOG_ERROR(ret, message) RETURN_AND_LOG_ERROR_IMPL(ret, message, "CaffeParser: ")
-#endif // TRT_CAFFE_PARSER_MACROS_H
diff --git a/parsers/caffe/caffeParser/caffeParser.cpp b/parsers/caffe/caffeParser/caffeParser.cpp
deleted file mode 100644
index 9e8722b2..00000000
--- a/parsers/caffe/caffeParser/caffeParser.cpp
+++ /dev/null
@@ -1,684 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include <iostream>
-
-#include "caffeMacros.h"
-#include "caffeParser.h"
-#include "opParsers.h"
-#include "parserUtils.h"
-#include "readProto.h"
-#include "binaryProtoBlob.h"
-#include "google/protobuf/text_format.h"
-#include "half.h"
-#include "NvInferPluginUtils.h"
-
-using namespace nvinfer1;
-using namespace nvcaffeparser1;
-
-CaffeParser::~CaffeParser()
-{
-    for (auto v : mTmpAllocs)
-    {
-        free(v);
-    }
-    for (auto p : mNewPlugins)
-    {
-        if (p)
-        {
-            p->destroy();
-        }
-    }
-    delete mBlobNameToTensor;
-}
-
-std::vector<nvinfer1::PluginField> CaffeParser::parseNormalizeParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    std::vector<nvinfer1::PluginField> f;
-    const trtcaffe::NormalizeParameter& p = msg.norm_param();
-
-    int* acrossSpatial = allocMemory<int32_t>();
-    *acrossSpatial = p.across_spatial() ? 1 : 0;
-    f.emplace_back("acrossSpatial", acrossSpatial, PluginFieldType::kINT32, 1);
-
-    int* channelShared = allocMemory<int32_t>();
-    *channelShared = p.channel_shared() ? 1 : 0;
-    f.emplace_back("channelShared", channelShared, PluginFieldType::kINT32, 1);
-
-    auto* eps = allocMemory<float>();
-    *eps = p.eps();
-    f.emplace_back("eps", eps, PluginFieldType::kFLOAT32, 1);
-
-    std::vector<Weights> w;
-    // If .caffemodel is not provided, need to randomize the weight
-    if (!weightFactory.isInitialized())
-    {
-        int C = parserutils::getC(tensors[msg.bottom(0)]->getDimensions());
-        w.emplace_back(weightFactory.allocateWeights(C, std::normal_distribution<float>(0.0F, 1.0F)));
-    }
-    else
-    {
-        // Use the provided weight from .caffemodel
-        w = weightFactory.getAllWeights(msg.name());
-    }
-
-    for (auto weight : w)
-    {
-        f.emplace_back("weights", weight.values, PluginFieldType::kFLOAT32, weight.count);
-    }
-
-    int* nbWeights = allocMemory<int32_t>();
-    *nbWeights = w.size();
-    f.emplace_back("nbWeights", nbWeights, PluginFieldType::kINT32, 1);
-
-    return f;
-}
-
-std::vector<nvinfer1::PluginField> CaffeParser::parsePriorBoxParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& /*tensors*/)
-{
-    std::vector<nvinfer1::PluginField> f;
-    const trtcaffe::PriorBoxParameter& p = msg.prior_box_param();
-
-    int minSizeSize = p.min_size_size();
-    auto* minSize = allocMemory<float>(minSizeSize);
-    for (int i = 0; i < minSizeSize; ++i)
-    {
-        minSize[i] = p.min_size(i);
-    }
-    f.emplace_back("minSize", minSize, PluginFieldType::kFLOAT32, minSizeSize);
-
-    int maxSizeSize = p.max_size_size();
-    auto* maxSize = allocMemory<float>(maxSizeSize);
-    for (int i = 0; i < maxSizeSize; ++i)
-    {
-        maxSize[i] = p.max_size(i);
-    }
-    f.emplace_back("maxSize", maxSize, PluginFieldType::kFLOAT32, maxSizeSize);
-
-    int aspectRatiosSize = p.aspect_ratio_size();
-    auto* aspectRatios = allocMemory<float>(aspectRatiosSize);
-    for (int i = 0; i < aspectRatiosSize; ++i)
-    {
-        aspectRatios[i] = p.aspect_ratio(i);
-    }
-    f.emplace_back("aspectRatios", aspectRatios, PluginFieldType::kFLOAT32, aspectRatiosSize);
-
-    int varianceSize = p.variance_size();
-    auto* variance = allocMemory<float>(varianceSize);
-    for (int i = 0; i < varianceSize; ++i)
-    {
-        variance[i] = p.variance(i);
-    }
-    f.emplace_back("variance", variance, PluginFieldType::kFLOAT32, varianceSize);
-
-    int* flip = allocMemory<int32_t>();
-    *flip = p.flip() ? 1 : 0;
-    f.emplace_back("flip", flip, PluginFieldType::kINT32, 1);
-
-    int* clip = allocMemory<int32_t>();
-    *clip = p.clip() ? 1 : 0;
-    f.emplace_back("clip", clip, PluginFieldType::kINT32, 1);
-
-    int* imgH = allocMemory<int32_t>();
-    *imgH = p.has_img_h() ? p.img_h() : p.img_size();
-    f.emplace_back("imgH", imgH, PluginFieldType::kINT32, 1);
-
-    int* imgW = allocMemory<int32_t>();
-    *imgW = p.has_img_w() ? p.img_w() : p.img_size();
-    f.emplace_back("imgW", imgW, PluginFieldType::kINT32, 1);
-
-    auto* stepH = allocMemory<float>();
-    *stepH = p.has_step_h() ? p.step_h() : p.step();
-    f.emplace_back("stepH", stepH, PluginFieldType::kFLOAT32, 1);
-
-    auto* stepW = allocMemory<float>();
-    *stepW = p.has_step_w() ? p.step_w() : p.step();
-    f.emplace_back("stepW", stepW, PluginFieldType::kFLOAT32, 1);
-
-    auto* offset = allocMemory<float>();
-    *offset = p.offset();
-    f.emplace_back("offset", offset, PluginFieldType::kFLOAT32, 1);
-
-    return f;
-}
-
-std::vector<nvinfer1::PluginField> CaffeParser::parseDetectionOutputParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& /*tensors*/)
-{
-    std::vector<nvinfer1::PluginField> f;
-    const trtcaffe::DetectionOutputParameter& p = msg.detection_output_param();
-    const trtcaffe::NonMaximumSuppressionParameter& nmsp = p.nms_param();
-
-    int* shareLocation = allocMemory<int32_t>();
-    *shareLocation = p.share_location() ? 1 : 0;
-    f.emplace_back("shareLocation", shareLocation, PluginFieldType::kINT32, 1);
-
-    int* varianceEncodedInTarget = allocMemory<int32_t>();
-    *varianceEncodedInTarget = p.variance_encoded_in_target() ? 1 : 0;
-    f.emplace_back("varianceEncodedInTarget", varianceEncodedInTarget, PluginFieldType::kINT32, 1);
-
-    int* backgroundLabelId = allocMemory<int32_t>();
-    *backgroundLabelId = p.background_label_id();
-    f.emplace_back("backgroundLabelId", backgroundLabelId, PluginFieldType::kINT32, 1);
-
-    int* numClasses = allocMemory<int32_t>();
-    *numClasses = p.num_classes();
-    f.emplace_back("numClasses", numClasses, PluginFieldType::kINT32, 1);
-
-    //nms
-    int* topK = allocMemory<int32_t>();
-    *topK = nmsp.top_k();
-    f.emplace_back("topK", topK, PluginFieldType::kINT32, 1);
-
-    int* keepTopK = allocMemory<int32_t>();
-    *keepTopK = p.keep_top_k();
-    f.emplace_back("keepTopK", keepTopK, PluginFieldType::kINT32, 1);
-
-    auto* confidenceThreshold = allocMemory<float>();
-    *confidenceThreshold = p.confidence_threshold();
-    f.emplace_back("confidenceThreshold", confidenceThreshold, PluginFieldType::kFLOAT32, 1);
-
-    //nms
-    auto* nmsThreshold = allocMemory<float>();
-    *nmsThreshold = nmsp.nms_threshold();
-    f.emplace_back("nmsThreshold", nmsThreshold, PluginFieldType::kFLOAT32, 1);
-
-    // input order = {0, 1, 2} in Caffe
-    int* inputOrder = allocMemory<int32_t>(3);
-    inputOrder[0] = 0;
-    inputOrder[1] = 1;
-    inputOrder[2] = 2;
-    f.emplace_back("inputOrder", inputOrder, PluginFieldType::kINT32, 3);
-
-    // confSigmoid = false for Caffe
-    int* confSigmoid = allocMemory<int32_t>();
-    *confSigmoid = 0;
-    f.emplace_back("confSigmoid", confSigmoid, PluginFieldType::kINT32, 1);
-
-    // isNormalized = true for Caffe
-    int* isNormalized = allocMemory<int32_t>();
-    *isNormalized = 1;
-    f.emplace_back("isNormalized", isNormalized, PluginFieldType::kINT32, 1);
-
-    // codeTypeSSD : from NvInferPlugin.h
-    // CORNER = 0, CENTER_SIZE = 1, CORNER_SIZE = 2, TF_CENTER = 3
-    int* codeType = allocMemory<int32_t>();
-    switch (p.code_type())
-    {
-    case trtcaffe::PriorBoxParameter::CORNER_SIZE:
-        *codeType = static_cast<int>(plugin::CodeTypeSSD::CORNER_SIZE);
-        break;
-    case trtcaffe::PriorBoxParameter::CENTER_SIZE:
-        *codeType = static_cast<int>(plugin::CodeTypeSSD::CENTER_SIZE);
-        break;
-    case trtcaffe::PriorBoxParameter::CORNER: // CORNER is default
-    default:
-        *codeType = static_cast<int>(plugin::CodeTypeSSD::CORNER);
-        break;
-    }
-    f.emplace_back("codeType", codeType, PluginFieldType::kINT32, 1);
-
-    return f;
-}
-
-std::vector<nvinfer1::PluginField> CaffeParser::parseLReLUParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& /*tensors*/)
-{
-    std::vector<nvinfer1::PluginField> f;
-    const trtcaffe::ReLUParameter& p = msg.relu_param();
-    auto* negSlope = allocMemory<float>();
-    *negSlope = p.negative_slope();
-    f.emplace_back("negSlope", negSlope, PluginFieldType::kFLOAT32, 1);
-    return f;
-}
-std::vector<nvinfer1::PluginField> CaffeParser::parseRPROIParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    std::vector<nvinfer1::PluginField> f;
-    const trtcaffe::ROIPoolingParameter& p1 = msg.roi_pooling_param();
-    const trtcaffe::RegionProposalParameter& p2 = msg.region_proposal_param();
-
-    // Memory allocations for plugin field variables
-    int* poolingH = allocMemory<int32_t>();
-    int* poolingW = allocMemory<int32_t>();
-    auto* spatialScale = allocMemory<float>();
-    int* preNmsTop = allocMemory<int32_t>();
-    int* nmsMaxOut = allocMemory<int32_t>();
-    auto* iouThreshold = allocMemory<float>();
-    auto* minBoxSize = allocMemory<float>();
-    int* featureStride = allocMemory<int32_t>();
-    int* anchorsRatioCount = allocMemory<int32_t>();
-    int* anchorsScaleCount = allocMemory<int32_t>();
-    int anchorsRatiosSize = p2.anchor_ratio_size();
-    auto* anchorsRatios = allocMemory<float>(anchorsRatiosSize);
-    int anchorsScalesSize = p2.anchor_scale_size();
-    auto* anchorsScales = allocMemory<float>(anchorsScalesSize);
-
-    // Intialize the plugin fields with values from the prototxt
-    *poolingH = p1.pooled_h();
-    f.emplace_back("poolingH", poolingH, PluginFieldType::kINT32, 1);
-
-    *poolingW = p1.pooled_w();
-    f.emplace_back("poolingW", poolingW, PluginFieldType::kINT32, 1);
-
-    *spatialScale = p1.spatial_scale();
-    f.emplace_back("spatialScale", spatialScale, PluginFieldType::kFLOAT32, 1);
-
-    *preNmsTop = p2.prenms_top();
-    f.emplace_back("preNmsTop", preNmsTop, PluginFieldType::kINT32, 1);
-
-    *nmsMaxOut = p2.nms_max_out();
-    f.emplace_back("nmsMaxOut", nmsMaxOut, PluginFieldType::kINT32, 1);
-
-    *iouThreshold = p2.iou_threshold();
-    f.emplace_back("iouThreshold", iouThreshold, PluginFieldType::kFLOAT32, 1);
-
-    *minBoxSize = p2.min_box_size();
-    f.emplace_back("minBoxSize", minBoxSize, PluginFieldType::kFLOAT32, 1);
-
-    *featureStride = p2.feature_stride();
-    f.emplace_back("featureStride", featureStride, PluginFieldType::kINT32, 1);
-
-    *anchorsRatioCount = p2.anchor_ratio_count();
-    f.emplace_back("anchorsRatioCount", anchorsRatioCount, PluginFieldType::kINT32, 1);
-
-    *anchorsScaleCount = p2.anchor_scale_count();
-    f.emplace_back("anchorsScaleCount", anchorsScaleCount, PluginFieldType::kINT32, 1);
-
-    for (int i = 0; i < anchorsRatiosSize; ++i) {
-        anchorsRatios[i] = p2.anchor_ratio(i);
-}
-    f.emplace_back("anchorsRatios", anchorsRatios, PluginFieldType::kFLOAT32, anchorsRatiosSize);
-
-    for (int i = 0; i < anchorsScalesSize; ++i) {
-        anchorsScales[i] = p2.anchor_scale(i);
-}
-    f.emplace_back("anchorsScales", anchorsScales, PluginFieldType::kFLOAT32, anchorsScalesSize);
-
-    return f;
-}
-
-const IBlobNameToTensor* CaffeParser::parseBuffers(const uint8_t* deployBuffer,
-                                                   std::size_t deployLength,
-                                                   const uint8_t* modelBuffer,
-                                                   std::size_t modelLength,
-                                                   INetworkDefinition& network,
-                                                   DataType weightType) noexcept
-{
-    mDeploy = std::unique_ptr<trtcaffe::NetParameter>(new trtcaffe::NetParameter);
-    google::protobuf::io::ArrayInputStream deployStream(deployBuffer, deployLength);
-    if (!google::protobuf::TextFormat::Parse(&deployStream, mDeploy.get()))
-    {
-        RETURN_AND_LOG_ERROR(nullptr, "Could not parse deploy file");
-    }
-
-    if (modelBuffer)
-    {
-        mModel = std::unique_ptr<trtcaffe::NetParameter>(new trtcaffe::NetParameter);
-        google::protobuf::io::ArrayInputStream modelStream(modelBuffer, modelLength);
-        google::protobuf::io::CodedInputStream codedModelStream(&modelStream);
-#if GOOGLE_PROTOBUF_VERSION >= 3011000
-        codedModelStream.SetTotalBytesLimit(modelLength);
-#else
-        // Note: This WARs the very low default size limit (64MB)
-        codedModelStream.SetTotalBytesLimit(modelLength, -1);
-#endif
-        if (!mModel->ParseFromCodedStream(&codedModelStream))
-        {
-            RETURN_AND_LOG_ERROR(nullptr, "Could not parse model file");
-        }
-    }
-
-    return parse(network, weightType, modelBuffer != nullptr);
-}
-
-const IBlobNameToTensor* CaffeParser::parse(const char* deployFile,
-                                            const char* modelFile,
-                                            INetworkDefinition& network,
-                                            DataType weightType) noexcept
-{
-    CHECK_NULL_RET_NULL(deployFile)
-
-    // this is used to deal with dropout layers which have different input and output
-    mModel = std::unique_ptr<trtcaffe::NetParameter>(new trtcaffe::NetParameter);
-    if (modelFile && !readBinaryProto(mModel.get(), modelFile, mProtobufBufferSize))
-    {
-        RETURN_AND_LOG_ERROR(nullptr, "Could not parse model file");
-    }
-
-    mDeploy = std::unique_ptr<trtcaffe::NetParameter>(new trtcaffe::NetParameter);
-    if (!readTextProto(mDeploy.get(), deployFile))
-    {
-        RETURN_AND_LOG_ERROR(nullptr, "Could not parse deploy file");
-    }
-
-    return parse(network, weightType, modelFile != nullptr);
-}
-
-const IBlobNameToTensor* CaffeParser::parse(INetworkDefinition& network,
-                                            DataType weightType,
-                                            bool hasModel)
-{
-    bool ok = true;
-    CaffeWeightFactory weights(*mModel.get(), weightType, mTmpAllocs, hasModel);
-
-    mBlobNameToTensor = new (BlobNameToTensor);
-
-    // Get list of all available plugin creators
-    int numCreators = 0;
-    nvinfer1::IPluginCreator* const* tmpList = getPluginRegistry()->getPluginCreatorList(&numCreators);
-    for (int k = 0; k < numCreators; ++k)
-    {
-        if (!tmpList[k])
-        {
-            std::cout << "Plugin Creator for plugin " << k << " is a nullptr." << std::endl;
-            continue;
-        }
-        std::string pluginName = tmpList[k]->getPluginName();
-        mPluginRegistry[pluginName] = tmpList[k];
-    }
-
-    for (int i = 0; i < mDeploy->input_size(); i++)
-    {
-        Dims dims;
-        if (network.hasImplicitBatchDimension())
-        {
-            if (mDeploy->input_shape_size())
-            {
-                dims = Dims3{(int) mDeploy->input_shape().Get(i).dim().Get(1), (int) mDeploy->input_shape().Get(i).dim().Get(2), (int) mDeploy->input_shape().Get(i).dim().Get(3)};
-            }
-            else
-            {
-                // Deprecated, but still used in a lot of networks
-                dims = Dims3{(int) mDeploy->input_dim().Get(i * 4 + 1), (int) mDeploy->input_dim().Get(i * 4 + 2), (int) mDeploy->input_dim().Get(i * 4 + 3)};
-            }
-        }
-        else
-        {
-            std::cout << "Warning, setting batch size to 1. Update the dimension after parsing due to using explicit batch size." << std::endl;
-            if (mDeploy->input_shape_size())
-            {
-                dims = Dims4{1, (int) mDeploy->input_shape().Get(i).dim().Get(1), (int) mDeploy->input_shape().Get(i).dim().Get(2), (int) mDeploy->input_shape().Get(i).dim().Get(3)};
-            }
-            else
-            {
-                // Deprecated, but still used in a lot of networks
-                dims = Dims4{1, (int) mDeploy->input_dim().Get(i * 4 + 1), (int) mDeploy->input_dim().Get(i * 4 + 2), (int) mDeploy->input_dim().Get(i * 4 + 3)};
-            }
-        }
-        ITensor* tensor = network.addInput(mDeploy->input().Get(i).c_str(), DataType::kFLOAT, dims);
-        (*mBlobNameToTensor)[mDeploy->input().Get(i)] = tensor;
-    }
-
-    for (int i = 0; i < mDeploy->layer_size() && ok; i++)
-    {
-        const trtcaffe::LayerParameter& layerMsg = mDeploy->layer(i);
-        if (layerMsg.has_phase() && layerMsg.phase() == trtcaffe::TEST)
-        {
-            continue;
-        }
-
-        // If there is a inplace operation and the operation is
-        // modifying the input, emit an error as
-        for (int j = 0; ok && j < layerMsg.top_size(); ++j)
-        {
-            for (int k = 0; ok && k < layerMsg.bottom_size(); ++k)
-            {
-                if (layerMsg.top().Get(j) == layerMsg.bottom().Get(k))
-                {
-                    auto iter = mBlobNameToTensor->find(layerMsg.top().Get(j).c_str());
-                    if (iter != nullptr && iter->isNetworkInput())
-                    {
-                        ok = false;
-                        std::cout << "TensorRT does not support in-place operations on input tensors in a prototxt file." << std::endl;
-                    }
-                }
-            }
-        }
-        if (getInferLibVersion() >= 5000)
-        {
-            if (mPluginFactoryV2 && mPluginFactoryV2->isPluginV2(layerMsg.name().c_str()))
-            {
-                std::vector<Weights> w = weights.getAllWeights(layerMsg.name());
-                nvinfer1::IPluginV2* plugin = mPluginFactoryV2->createPlugin(layerMsg.name().c_str(), w.empty() ? nullptr : &w[0], w.size(), mPluginNamespace.c_str());
-                std::vector<ITensor*> inputs;
-                for (int i = 0, n = layerMsg.bottom_size(); i < n; i++)
-                {
-                    inputs.push_back((*mBlobNameToTensor)[layerMsg.bottom(i)]);
-                }
-                ILayer* layer = network.addPluginV2(&inputs[0], int(inputs.size()), *plugin);
-                layer->setName(layerMsg.name().c_str());
-                if (plugin->getNbOutputs() != layerMsg.top_size())
-                {
-                    std::cout << "Plugin layer output count is not equal to caffe output count" << std::endl;
-                    ok = false;
-                }
-                for (int i = 0, n = std::min(layer->getNbOutputs(), layerMsg.top_size()); i < n; i++)
-                {
-                    (*mBlobNameToTensor)[layerMsg.top(i)] = layer->getOutput(i);
-                }
-
-                if (layer == nullptr)
-                {
-                    std::cout << "error parsing layer type " << layerMsg.type() << " index " << i << std::endl;
-                    ok = false;
-                }
-                continue;
-            }
-            // Use the TRT5 plugin creator method to check for built-in plugin support
-
-
-                std::string pluginName;
-                nvinfer1::PluginFieldCollection fc;
-                std::vector<nvinfer1::PluginField> f;
-                if (layerMsg.type() == "Normalize")
-                {
-                    pluginName = "Normalize_TRT";
-                    f = parseNormalizeParam(layerMsg, weights, *mBlobNameToTensor);
-                }
-                else if (layerMsg.type() == "PriorBox")
-                {
-                    pluginName = "PriorBox_TRT";
-                    f = parsePriorBoxParam(layerMsg, weights, *mBlobNameToTensor);
-                }
-                else if (layerMsg.type() == "DetectionOutput")
-                {
-                    pluginName = "NMS_TRT";
-                    f = parseDetectionOutputParam(layerMsg, weights, *mBlobNameToTensor);
-                }
-                else if (layerMsg.type() == "RPROI")
-                {
-                    pluginName = "RPROI_TRT";
-                    f = parseRPROIParam(layerMsg, weights, *mBlobNameToTensor);
-                }
-
-                if (mPluginRegistry.find(pluginName) != mPluginRegistry.end())
-                {
-                    // Set fc
-                    fc.nbFields = f.size();
-                    fc.fields = f.empty() ? nullptr : f.data();
-                    nvinfer1::IPluginV2* pluginV2 = mPluginRegistry.at(pluginName)->createPlugin(layerMsg.name().c_str(), &fc);
-                    assert(pluginV2);
-                    mNewPlugins.push_back(pluginV2);
-
-                    std::vector<ITensor*> inputs;
-                    for (int i = 0, n = layerMsg.bottom_size(); i < n; i++)
-                    {
-                        inputs.push_back((*mBlobNameToTensor)[layerMsg.bottom(i)]);
-                    }
-
-                    auto layer = network.addPluginV2(&inputs[0], int(inputs.size()), *pluginV2);
-                    layer->setName(layerMsg.name().c_str());
-                    if (pluginV2->getNbOutputs() != layerMsg.top_size())
-                    {
-                        std::cout << "Plugin layer output count is not equal to caffe output count" << std::endl;
-                        ok = false;
-                    }
-                    for (int i = 0, n = std::min(layer->getNbOutputs(), layerMsg.top_size()); i < n; i++)
-                    {
-                        (*mBlobNameToTensor)[layerMsg.top(i)] = layer->getOutput(i);
-                    }
-
-                    if (layer == nullptr)
-                    {
-                        std::cout << "error parsing layer type " << layerMsg.type() << " index " << i << std::endl;
-                        ok = false;
-                    }
-                    continue;
-                }
-
-        }
-
-        if (layerMsg.type() == "Dropout")
-        {
-            (*mBlobNameToTensor)[layerMsg.top().Get(0)] = (*mBlobNameToTensor)[layerMsg.bottom().Get(0)];
-            continue;
-        }
-
-        if (layerMsg.type() == "Input")
-        {
-            const trtcaffe::InputParameter& p = layerMsg.input_param();
-            for (int i = 0; i < layerMsg.top_size(); i++)
-            {
-                const trtcaffe::BlobShape& shape = p.shape().Get(i);
-                if (shape.dim_size() != 4)
-                {
-                    RETURN_AND_LOG_ERROR(nullptr, "error parsing input layer, TensorRT only supports 4 dimensional input");
-                }
-                else
-                {
-                    Dims d;
-                    if (network.hasImplicitBatchDimension())
-                    {
-                        d = Dims3{(int) shape.dim().Get(1), (int) shape.dim().Get(2), (int) shape.dim().Get(3)};
-                    }
-                    else
-                    {
-                        std::cout << "Warning, setting batch size to 1. Update the dimension after parsing due to "
-                                     "using explicit batch size."
-                                  << std::endl;
-                        d = Dims4{1, (int) shape.dim().Get(1), (int) shape.dim().Get(2), (int) shape.dim().Get(3)};
-                    }
-                    ITensor* tensor = network.addInput(layerMsg.top(i).c_str(), DataType::kFLOAT, d);
-                    (*mBlobNameToTensor)[layerMsg.top().Get(i)] = tensor;
-                }
-            }
-            continue;
-        }
-        if (layerMsg.type() == "Flatten")
-        {
-            ITensor* tensor = (*mBlobNameToTensor)[layerMsg.bottom().Get(0)];
-            (*mBlobNameToTensor)[layerMsg.top().Get(0)] = tensor;
-            std::cout << "Warning: Flatten layer ignored. TensorRT implicitly"
-                         " flattens input to FullyConnected layers, but in other"
-                         " circumstances this will result in undefined behavior."
-                      << std::endl;
-            continue;
-        }
-
-        // Use parser table to lookup the corresponding parse function to handle the rest of the layers
-        auto v = gParseTable.find(layerMsg.type());
-
-        if (v == gParseTable.end())
-        {
-            std::cout << "could not parse layer type " << layerMsg.type() << std::endl;
-            ok = false;
-        }
-        else
-        {
-            ILayer* layer = (*v->second)(network, layerMsg, weights, *static_cast<BlobNameToTensor*>(mBlobNameToTensor));
-            if (layer == nullptr)
-            {
-                std::cout << "error parsing layer type " << layerMsg.type() << " index " << i << std::endl;
-                ok = false;
-            }
-            else
-            {
-                layer->setName(layerMsg.name().c_str());
-                (*mBlobNameToTensor)[layerMsg.top(0)] = layer->getOutput(0);
-            }
-        }
-    }
-
-    mBlobNameToTensor->setTensorNames();
-
-    return ok && weights.isOK() && mBlobNameToTensor->isOK() ? mBlobNameToTensor : nullptr;
-}
-
-IBinaryProtoBlob* CaffeParser::parseBinaryProto(const char* fileName) noexcept
-{
-    CHECK_NULL_RET_NULL(fileName)
-    using namespace google::protobuf::io;
-
-    std::ifstream stream(fileName, std::ios::in | std::ios::binary);
-    if (!stream)
-    {
-        RETURN_AND_LOG_ERROR(nullptr, "Could not open file " + std::string{fileName});
-    }
-
-    IstreamInputStream rawInput(&stream);
-    CodedInputStream codedInput(&rawInput);
-#if GOOGLE_PROTOBUF_VERSION >= 3011000
-        codedInput.SetTotalBytesLimit(INT_MAX);
-#else
-        // Note: This WARs the very low default size limit (64MB)
-        codedInput.SetTotalBytesLimit(INT_MAX, -1);
-#endif
-
-    trtcaffe::BlobProto blob;
-    bool ok = blob.ParseFromCodedStream(&codedInput);
-    stream.close();
-
-    if (!ok)
-    {
-        RETURN_AND_LOG_ERROR(nullptr, "parseBinaryProto: Could not parse mean file");
-    }
-
-    Dims4 dims{1, 1, 1, 1};
-    if (blob.has_shape())
-    {
-        int size = blob.shape().dim_size(), s[4] = {1, 1, 1, 1};
-        for (int i = 4 - size; i < 4; i++)
-        {
-            assert(blob.shape().dim(i) < INT32_MAX);
-            s[i] = static_cast<int>(blob.shape().dim(i));
-        }
-        dims = Dims4{s[0], s[1], s[2], s[3]};
-    }
-    else
-    {
-        dims = Dims4{blob.num(), blob.channels(), blob.height(), blob.width()};
-    }
-
-    const int dataSize = parserutils::volume(dims);
-    assert(dataSize > 0);
-
-    const trtcaffe::Type blobProtoDataType = CaffeWeightFactory::getBlobProtoDataType(blob);
-    const auto blobProtoData = CaffeWeightFactory::getBlobProtoData(blob, blobProtoDataType, mTmpAllocs);
-
-    if (dataSize != (int) blobProtoData.second)
-    {
-        std::cout << "CaffeParser::parseBinaryProto: blob dimensions don't match data size!!" << std::endl;
-        return nullptr;
-    }
-
-    const int dataSizeBytes = dataSize * CaffeWeightFactory::sizeOfCaffeType(blobProtoDataType);
-    void* memory = malloc(dataSizeBytes);
-    memcpy(memory, blobProtoData.first, dataSizeBytes);
-    return new BinaryProtoBlob(memory,
-                               blobProtoDataType == trtcaffe::FLOAT ? DataType::kFLOAT : DataType::kHALF, dims);
-
-    std::cout << "CaffeParser::parseBinaryProto: couldn't find any data!!" << std::endl;
-    return nullptr;
-}
diff --git a/parsers/caffe/caffeParser/caffeParser.h b/parsers/caffe/caffeParser/caffeParser.h
deleted file mode 100644
index bd79967b..00000000
--- a/parsers/caffe/caffeParser/caffeParser.h
+++ /dev/null
@@ -1,88 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_CAFFE_PARSER_H
-#define TRT_CAFFE_PARSER_CAFFE_PARSER_H
-
-#include <vector>
-#include <memory>
-#include <unordered_map>
-#include <string>
-
-#include "NvCaffeParser.h"
-#include "caffeWeightFactory.h"
-#include "blobNameToTensor.h"
-#include "trtcaffe.pb.h"
-
-namespace nvcaffeparser1
-{
-class CaffeParser : public ICaffeParser
-{
-public:
-    const IBlobNameToTensor* parse(const char* deploy,
-                                   const char* model,
-                                   nvinfer1::INetworkDefinition& network,
-                                   nvinfer1::DataType weightType) noexcept override;
-
-    const IBlobNameToTensor* parseBuffers(const uint8_t* deployBuffer,
-                                          size_t deployLength,
-                                          const uint8_t* modelBuffer,
-                                          size_t modelLength,
-                                          nvinfer1::INetworkDefinition& network,
-                                          nvinfer1::DataType weightType) noexcept override;
-
-    void setProtobufBufferSize(size_t size) noexcept override { mProtobufBufferSize = size; }
-    void setPluginFactoryV2(nvcaffeparser1::IPluginFactoryV2* factory) noexcept override { mPluginFactoryV2 = factory; }
-    void setPluginNamespace(const char* libNamespace) noexcept override { mPluginNamespace = libNamespace; }
-    IBinaryProtoBlob* parseBinaryProto(const char* fileName) noexcept override;
-    void destroy() noexcept override { delete this; }
-    void setErrorRecorder(nvinfer1::IErrorRecorder* recorder) noexcept override { (void)recorder; assert(!"TRT- Not implemented."); }
-    nvinfer1::IErrorRecorder* getErrorRecorder() const noexcept override { assert(!"TRT- Not implemented."); return nullptr; }
-
-private:
-    ~CaffeParser() noexcept override;
-    std::vector<nvinfer1::PluginField> parseNormalizeParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-    std::vector<nvinfer1::PluginField> parsePriorBoxParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-    std::vector<nvinfer1::PluginField> parseDetectionOutputParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-    std::vector<nvinfer1::PluginField> parseLReLUParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-    std::vector<nvinfer1::PluginField> parseRPROIParam(const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-    template <typename T>
-    T* allocMemory(int size = 1)
-    {
-        T* tmpMem = static_cast<T*>(malloc(sizeof(T) * size));
-        mTmpAllocs.push_back(tmpMem);
-        return tmpMem;
-    }
-
-    const IBlobNameToTensor* parse(nvinfer1::INetworkDefinition& network,
-                                   nvinfer1::DataType weightType,
-                                   bool hasModel);
-
-private:
-    std::shared_ptr<trtcaffe::NetParameter> mDeploy;
-    std::shared_ptr<trtcaffe::NetParameter> mModel;
-    std::vector<void*> mTmpAllocs;
-    BlobNameToTensor* mBlobNameToTensor{nullptr};
-    size_t mProtobufBufferSize{INT_MAX};
-    nvcaffeparser1::IPluginFactoryV2* mPluginFactoryV2{nullptr};
-    bool mPluginFactoryIsExt{false};
-    std::vector<nvinfer1::IPluginV2*> mNewPlugins;
-    std::unordered_map<std::string, nvinfer1::IPluginCreator*> mPluginRegistry;
-    std::string mPluginNamespace = "";
-};
-} //namespace nvcaffeparser1
-#endif //TRT_CAFFE_PARSER_CAFFE_PARSER_H
diff --git a/parsers/caffe/caffeParser/opParsers/opParsers.h b/parsers/caffe/caffeParser/opParsers/opParsers.h
deleted file mode 100644
index b4641bcd..00000000
--- a/parsers/caffe/caffeParser/opParsers/opParsers.h
+++ /dev/null
@@ -1,103 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_OP_PARSERS_H
-#define TRT_CAFFE_PARSER_OP_PARSERS_H
-#include <unordered_map>
-#include <iostream>
-
-#include "caffeMacros.h"
-#include "NvInfer.h"
-#include "trtcaffe.pb.h"
-#include "parserUtils.h"
-#include "half.h"
-#include "blobNameToTensor.h"
-#include "caffeWeightFactory.h"
-
-namespace nvcaffeparser1
-{
-inline bool checkBlobs(const trtcaffe::LayerParameter& msg, int bottoms, int tops)
-{
-    if (msg.bottom_size() != bottoms)
-    {
-        std::cout << msg.name() << ": expected " << bottoms << " bottom blobs, found " << msg.bottom_size() << std::endl;
-        return false;
-    }
-
-    if (msg.top_size() != tops)
-    {
-        std::cout << msg.name() << ": expected " << tops << " tops blobs, found " << msg.top_size() << std::endl;
-        return false;
-    }
-    return true;
-}
-
-typedef nvinfer1::ILayer* (*LayerParseFn)(nvinfer1::INetworkDefinition&, const trtcaffe::LayerParameter&, CaffeWeightFactory&, BlobNameToTensor&);
-
-nvinfer1::ILayer* parseAbsVal(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseBatchNormalization(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseBNLL(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseClip(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseConcat(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseConvolution(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseCrop(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseDeconvolution(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseEltwise(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseELU(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseInnerProduct(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseLRN(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parsePermute(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parsePooling(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parsePower(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parsePReLU(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseReduction(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseReLU(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseReshape(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseScale(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseSigmoid(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseSoftMax(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-nvinfer1::ILayer* parseTanH(nvinfer1::INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors);
-
-static std::unordered_map<std::string, LayerParseFn> gParseTable
-{
-    {"Convolution", parseConvolution},
-    {"Pooling", parsePooling},
-    {"InnerProduct", parseInnerProduct},
-    {"ReLU", parseReLU},
-    {"Softmax", parseSoftMax},
-    {"SoftmaxWithLoss", parseSoftMax},
-    {"LRN", parseLRN},
-    {"Power", parsePower},
-    {"Eltwise", parseEltwise},
-    {"Concat", parseConcat},
-    {"Deconvolution", parseDeconvolution},
-    {"Sigmoid", parseSigmoid},
-    {"TanH", parseTanH},
-    {"BatchNorm", parseBatchNormalization},
-    {"Scale", parseScale},
-    {"Crop", parseCrop},
-    {"Reduction", parseReduction},
-    {"Reshape", parseReshape},
-    {"Permute", parsePermute},
-    {"ELU", parseELU},
-    {"BNLL", parseBNLL},
-    {"Clip", parseClip},
-    {"AbsVal", parseAbsVal},
-    {"PReLU", parsePReLU}
-};
-} // namespace nvcaffeparser1
-#endif //TRT_CAFFE_PARSER_OP_PARSERS_H
diff --git a/parsers/caffe/caffeParser/opParsers/parseAbsVal.cpp b/parsers/caffe/caffeParser/opParsers/parseAbsVal.cpp
deleted file mode 100644
index 6b9415c6..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseAbsVal.cpp
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseAbsVal(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-    return network.addUnary(*tensors[msg.bottom(0)], UnaryOperation::kABS);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseBNLL.cpp b/parsers/caffe/caffeParser/opParsers/parseBNLL.cpp
deleted file mode 100644
index 6ff13917..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseBNLL.cpp
+++ /dev/null
@@ -1,32 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseBNLL(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-    return network.addActivation(*tensors[msg.bottom(0)], ActivationType::kSOFTPLUS);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseBatchNorm.cpp b/parsers/caffe/caffeParser/opParsers/parseBatchNorm.cpp
deleted file mode 100644
index b16552d7..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseBatchNorm.cpp
+++ /dev/null
@@ -1,179 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-template <typename T>
-inline bool bnConvertWrap(float scaleFactor, const Weights& variance, const Weights& mean,
-                          const Weights& scaleBlob, const Weights& biasBlob,
-                          Weights& shift, Weights& scale, float eps,
-                          bool nvCaffe, CaffeWeightFactory& weightFactory)
-{
-
-    assert(shift.count == scale.count);
-    if (nvCaffe)
-    {
-        if (scaleBlob.values == nullptr)
-        {
-            return false;
-        }
-        if (biasBlob.values == nullptr)
-        {
-            return false;
-        }
-    }
-    T* shiftv = reinterpret_cast<T*>(malloc(sizeof(T) * shift.count));
-    if (!shiftv)
-    {
-        return false;
-    }
-
-    T* scalev = reinterpret_cast<T*>(malloc(sizeof(T) * scale.count));
-    if (!scalev)
-    {
-        free(shiftv);
-        return false;
-    }
-    shift.values = shiftv;
-    scale.values = scalev;
-    weightFactory.getTmpAllocs().push_back(shiftv);
-    weightFactory.getTmpAllocs().push_back(scalev);
-
-    const T* m = reinterpret_cast<const T*>(mean.values);
-    const T* v = reinterpret_cast<const T*>(variance.values);
-    for (int i = 0; i < shift.count; i++)
-    {
-        scalev[i] = T(1.0f / std::sqrt(float(v[i]) * scaleFactor + eps));
-        shiftv[i] = T(-(float(m[i]) * scaleFactor * float(scalev[i])));
-    }
-
-    if (nvCaffe)
-    {
-        const T* s = reinterpret_cast<const T*>(scaleBlob.values);
-        const T* b = reinterpret_cast<const T*>(biasBlob.values);
-        for (int i = 0; i < shift.count; i++)
-        {
-            scalev[i] = T(float(scalev[i]) * s[i]);
-            shiftv[i] = T(float(shiftv[i]) * s[i]) + b[i];
-        }
-    }
-    return true;
-}
-
-ILayer* parseBatchNormalization(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::BatchNormParameter& p = msg.batch_norm_param();
-    bool nvCaffe = weightFactory.getBlobsSize(msg.name()) == 5;
-
-    int C = parserutils::getC(tensors[msg.bottom(0)]->getDimensions());
-
-    Weights mean{DataType::kFLOAT, nullptr, 0},
-        variance{DataType::kFLOAT, nullptr, 0},
-        scaleBlob{DataType::kFLOAT, nullptr, 0},
-        biasBlob{DataType::kFLOAT, nullptr, 0},
-        movingAverage{DataType::kFLOAT, nullptr, 0};
-
-    // Because of the incompatible nature of the batch normalizations
-    // between BLVC Caffe and nvCaffe, two different paths have to be
-    // used.
-    if (nvCaffe)
-    {
-        if (weightFactory.isInitialized())
-        {
-            mean = weightFactory(msg.name(), WeightType::kNVMEAN);
-            variance = weightFactory(msg.name(), WeightType::kNVVARIANCE);
-            scaleBlob = weightFactory(msg.name(), WeightType::kNVSCALE);
-            biasBlob = weightFactory(msg.name(), WeightType::kNVBIAS);
-        }
-        else
-        {
-            mean = weightFactory.allocateWeights(C);
-            variance = weightFactory.allocateWeights(C, std::uniform_real_distribution<float>(0.9F, 1.1F));
-            scaleBlob = weightFactory.allocateWeights(C, std::uniform_real_distribution<float>(0.9F, 1.1F));
-            biasBlob = weightFactory.allocateWeights(C);
-        }
-    }
-    else
-    {
-        if (weightFactory.isInitialized())
-        {
-            mean = weightFactory(msg.name(), WeightType::kMEAN);
-            variance = weightFactory(msg.name(), WeightType::kVARIANCE);
-            movingAverage = weightFactory(msg.name(), WeightType::kMOVING_AVERAGE);
-        }
-        else
-        {
-            mean = weightFactory.allocateWeights(C);
-            variance = weightFactory.allocateWeights(C, std::uniform_real_distribution<float>(0.9F, 1.1F));
-            movingAverage = weightFactory.allocateWeights(1, std::uniform_real_distribution<float>(0.99F, 1.01F));
-        }
-        assert(mean.count == variance.count && movingAverage.count == 1);
-    }
-
-    Weights shift{mean.type, nullptr, mean.count};
-    Weights scale{mean.type, nullptr, mean.count};
-    Weights power{mean.type, nullptr, 0};
-    bool success{false};
-    float scaleFactor{1.0f};
-    if (!nvCaffe)
-    {
-        float average{0.0f};
-        // Inside weightFactory, the weights are generated based off the type.
-        if (mean.type == DataType::kFLOAT)
-        {
-            average = *(static_cast<const float*>(movingAverage.values));
-        }
-        else
-        {
-            average = *(static_cast<const float16*>(movingAverage.values));
-        }
-        if (average == 0.0f)
-        {
-            std::cout << "Batch normalization moving average is zero" << std::endl;
-            return nullptr;
-        }
-        scaleFactor /= average;
-    }
-    if (mean.type == DataType::kFLOAT)
-    {
-        success = bnConvertWrap<float>(scaleFactor, variance, mean, scaleBlob, biasBlob, shift, scale, p.eps(), nvCaffe, weightFactory);
-    }
-    else
-    {
-        success = bnConvertWrap<float16>(scaleFactor, variance, mean, scaleBlob, biasBlob, shift, scale, p.eps(), nvCaffe, weightFactory);
-    }
-
-    if (!success)
-    {
-        return nullptr;
-    }
-
-    weightFactory.convert(shift);
-    weightFactory.convert(scale);
-    weightFactory.convert(power);
-    return network.addScale(*tensors[msg.bottom(0)], ScaleMode::kCHANNEL, shift, scale, power);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseClip.cpp b/parsers/caffe/caffeParser/opParsers/parseClip.cpp
deleted file mode 100644
index 108acf4f..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseClip.cpp
+++ /dev/null
@@ -1,46 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseClip(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-    const trtcaffe::ClipParameter& p = msg.clip_param();
-    float alpha = std::numeric_limits<float>::lowest(); // lower bound
-    float beta = std::numeric_limits<float>::max();     // upper bound
-    if(p.has_min())
-    {
-        alpha = p.min();
-    }
-    if(p.has_max())
-    {
-        beta = p.max();
-    }
-    auto layer = network.addActivation(*tensors[msg.bottom(0)], ActivationType::kCLIP);
-    layer->setAlpha(alpha);
-    layer->setBeta(beta);
-    return layer;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseConcat.cpp b/parsers/caffe/caffeParser/opParsers/parseConcat.cpp
deleted file mode 100644
index a682b798..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseConcat.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseConcat(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    const trtcaffe::ConcatParameter& p = msg.concat_param();
-    bool hasAxis = p.has_axis(); // optional parameter
-
-    if (hasAxis && p.axis() < 0)
-    {
-        std::cout << "Caffe parser: Concat negative axis is not supported." << std::endl;
-        return nullptr;
-    }
-    if (network.hasImplicitBatchDimension() && p.axis() == 0)
-    {
-        std::cout << "Caffe parser: Concat across batch axis with implicit batch dimensions is not supported."
-                  << std::endl;
-        return nullptr;
-    }
-
-    std::vector<ITensor*> ptrs;
-    for (unsigned int i = 0, n = msg.bottom_size(); i < n; i++)
-    {
-        ptrs.push_back(tensors[msg.bottom().Get(i)]);
-    }
-
-    auto concat = network.addConcatenation(&ptrs[0], msg.bottom_size());
-
-    // If no axis is explicitly provided, do not call setAxis.
-    // Rely on the default axis setting inside TRT which takes into account NPCHW and higher dimensional input.
-    if (hasAxis)
-    {
-        concat->setAxis(p.axis() - static_cast<int>(network.hasImplicitBatchDimension()));
-    }
-
-    return concat;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseConv.cpp b/parsers/caffe/caffeParser/opParsers/parseConv.cpp
deleted file mode 100644
index ae7372ad..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseConv.cpp
+++ /dev/null
@@ -1,69 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseConvolution(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::ConvolutionParameter& p = msg.convolution_param();
-    int nbOutputs = p.num_output();
-
-    int kernelH = p.has_kernel_h() ? p.kernel_h() : p.kernel_size(0);
-    int kernelW = p.has_kernel_w() ? p.kernel_w() : p.kernel_size_size() > 1 ? p.kernel_size(1) : p.kernel_size(0);
-    int C = parserutils::getC(tensors[msg.bottom(0)]->getDimensions());
-    int G = p.has_group() ? p.group() : 1;
-
-    auto CbyG = float(C / G * nbOutputs);
-    float std_dev = 1.0F / sqrtf((kernelW * kernelH * sqrtf(CbyG)));
-    Weights kernelWeights = weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kGENERIC) : weightFactory.allocateWeights(kernelW * kernelH * CbyG, std::normal_distribution<float>(0.0F, std_dev));
-    Weights biasWeights = !p.has_bias_term() || p.bias_term() ? (weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kBIAS) : weightFactory.allocateWeights(nbOutputs)) : weightFactory.getNullWeights();
-
-    weightFactory.convert(kernelWeights);
-    weightFactory.convert(biasWeights);
-    auto inTensor = tensors[msg.bottom(0)];
-    auto layer = network.addConvolution(*inTensor, nbOutputs, DimsHW{kernelH, kernelW}, kernelWeights, biasWeights);
-
-    if (layer)
-    {
-        int strideH = p.has_stride_h() ? p.stride_h() : p.stride_size() > 0 ? p.stride(0) : 1;
-        int strideW = p.has_stride_w() ? p.stride_w() : p.stride_size() > 1 ? p.stride(1) : p.stride_size() > 0 ? p.stride(0) : 1;
-
-        int padH = p.has_pad_h() ? p.pad_h() : p.pad_size() > 0 ? p.pad(0) : 0;
-        int padW = p.has_pad_w() ? p.pad_w() : p.pad_size() > 1 ? p.pad(1) : p.pad_size() > 0 ? p.pad(0) : 0;
-
-        int dilationH = p.dilation_size() > 0 ? p.dilation(0) : 1;
-        int dilationW = p.dilation_size() > 1 ? p.dilation(1) : p.dilation_size() > 0 ? p.dilation(0) : 1;
-
-        layer->setStride(DimsHW{strideH, strideW});
-        layer->setPadding(DimsHW{padH, padW});
-		layer->setPaddingMode(PaddingMode::kCAFFE_ROUND_DOWN);
-        layer->setDilation(DimsHW{dilationH, dilationW});
-
-        layer->setNbGroups(G);
-    }
-    return layer;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseCrop.cpp b/parsers/caffe/caffeParser/opParsers/parseCrop.cpp
deleted file mode 100644
index 9f907c38..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseCrop.cpp
+++ /dev/null
@@ -1,119 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseCrop(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    // To crop, elements of the first bottom are selected to fit the dimensions
-    // of the second, reference bottom. The crop is configured by
-    // - the crop `axis` to pick the dimensions for cropping
-    // - the crop `offset` to set the shift for all/each dimension
-    // to align the cropped bottom with the reference bottom.
-    // All dimensions up to but excluding `axis` are preserved, while
-    // the dimensions including and trailing `axis` are cropped.
-    // If only one `offset` is set, then all dimensions are offset by this amount.
-    // Otherwise, the number of offsets must equal the number of cropped axes to
-    // shift the crop in each dimension accordingly.
-    // Note: standard dimensions are N,C,H,W so the default is a spatial crop,
-    // and `axis` may be negative to index from the end (e.g., -1 for the last
-    // axis).
-
-    if (!checkBlobs(msg, 2, 1))
-    {
-        return nullptr;
-    }
-
-    // ONLY IMPLEMENT SPATIAL CROPPING
-    // IF CROP LAYER IS NOT SPATIAL CROP, ABORT
-    const trtcaffe::CropParameter& p = msg.crop_param();
-    Dims3 inputDims = parserutils::getCHW(tensors[msg.bottom(0)]->getDimensions());
-    Dims3 refDims = parserutils::getCHW(tensors[msg.bottom(1)]->getDimensions());
-    bool hasAxis = p.has_axis();         // optional parameter
-    int axis = hasAxis ? p.axis() : 2;   // default is 2 - spatial crop
-    axis = (axis < 0) ? 4 + axis : axis; // axis negative number correction
-
-    // acceptable axis values: 2, 3, -1, -2
-    // unacceptable axis values: 0, 1, -3, -4 and anything else
-    // acceptable corrected axis values: 2, 3
-    // unacceptable corrected axis values: 0, 1 and anything else
-    // protect against "garbage" input arguments
-    bool axis_abort = (axis != 2 && axis != 3);
-
-    // must be at least one offset
-    // if only one offset, the same offset applies to all the dimensions
-    // including the chosen axis and trailing it
-    // if more than one offset, the number of offsets must match the number
-    // of dimensions consisting of the axis and all the dimensions trailing it
-    int num_offsets = p.offset_size();
-
-    // 1 + (3 - axis) = 4 - axis
-    // this is only valid for acceptable corrected axis values
-    // if !axis_abort then invariant that num_dims == 1 || num_dims == 2
-    int num_dims = 4 - axis;
-    bool offset_abort = (num_offsets != 0 && num_offsets != 1 && num_offsets != num_dims);
-
-    if (axis_abort)
-    {
-        std::cout << "Caffe Parser: Invalid axis in crop layer - only spatial cropping is supported" << std::endl;
-        return nullptr;
-    }
-
-    if (offset_abort)
-    {
-        std::cout << "Caffe Parser: Invalid number of offsets in crop layer" << std::endl;
-        return nullptr;
-    }
-
-    // get the offsets
-    // the offsets are zero by default (in case no offset is specified)
-    int offsetHeight = 0;
-    int offsetWidth = 0;
-
-    if (num_offsets != 0)
-    {
-        // offsetHeight will only be specified if the H channel is the chosen axis
-        // in this case, regardless of whether there are one or multiple offsets
-        // offsetHeight should always be the zero-indexed offset
-        offsetHeight = axis == 2 ? p.offset(0) : 0;
-        // offsetWidth should always be specified
-        // if there is only one offset, use the zero-indexed offset
-        // otherwise, use the one-indexed offset since the zero-indexed offet
-        // is for offsetHeight
-        offsetWidth = num_offsets == 1 ? p.offset(0) : p.offset(1);
-    }
-
-    // now compute the prePadding and postPadding required to perform the crop
-    // so that the first bottom is the same spatial size as the second bottom
-    // prePadding is the padding to the left/bottom (assuming origin is lower-left).
-    // postPadding is the padding to the right/top.
-    // - ( inputDims.h() - refDims.h() - offsetHeight ) = -inputDims.h() + refDims.h() + offsetHeight
-    // - ( inputDims.w() - refDims.w() - offsetWidth ) = -inputDims.w() + refDims.w() + offsetWidth
-    int prePadHeight = -offsetHeight;
-    int prePadWidth = -offsetWidth;
-    int postPadHeight = -inputDims.d[1] + refDims.d[1] + offsetHeight;
-    int postPadWidth = -inputDims.d[2] + refDims.d[2] + offsetWidth;
-
-    Dims prePadding = parserutils::toDims(prePadHeight, prePadWidth);
-    Dims postPadding = parserutils::toDims(postPadHeight, postPadWidth);
-    return network.addPaddingNd(*tensors[msg.bottom(0)], prePadding, postPadding);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseDeconv.cpp b/parsers/caffe/caffeParser/opParsers/parseDeconv.cpp
deleted file mode 100644
index 92ef6816..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseDeconv.cpp
+++ /dev/null
@@ -1,75 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseDeconvolution(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::ConvolutionParameter& p = msg.convolution_param();
-    int nbOutputs = p.num_output();
-    int nbGroups = p.has_group() ? p.group() : 1;
-
-    int dilationH = p.dilation_size() > 0 ? p.dilation(0) : 1;
-    int dilationW = p.dilation_size() > 1 ? p.dilation(1) : p.dilation_size() > 0 ? p.dilation(0) : 1;
-    if (dilationH != 1 || dilationW != 1)
-    {
-        RETURN_AND_LOG_ERROR(nullptr, "Dilated deconvolution is not supported.");
-    }
-
-    int kernelW = p.has_kernel_w() ? p.kernel_w() : p.kernel_size(0);
-    int kernelH = p.has_kernel_h() ? p.kernel_h() : p.kernel_size_size() > 1 ? p.kernel_size(1) : p.kernel_size(0);
-    int C = parserutils::getC(tensors[msg.bottom(0)]->getDimensions());
-
-    float std_dev = 1.0F / sqrtf(kernelW * kernelH * sqrtf(C * nbOutputs));
-    Weights kernelWeights = weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kGENERIC) : weightFactory.allocateWeights(kernelW * kernelH * C * nbOutputs / nbGroups, std::normal_distribution<float>(0.0F, std_dev));
-    Weights biasWeights = !p.has_bias_term() || p.bias_term() ? (weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kBIAS) : weightFactory.allocateWeights(nbOutputs)) : weightFactory.getNullWeights();
-
-    weightFactory.convert(kernelWeights);
-    weightFactory.convert(biasWeights);
-    auto layer = network.addDeconvolution(*tensors[msg.bottom(0)], nbOutputs, DimsHW{kernelH, kernelW}, kernelWeights, biasWeights);
-
-    if (layer)
-    {
-        int strideW = p.has_stride_w() ? p.stride_w() : p.stride_size() > 0 ? p.stride(0) : 1;
-        int strideH = p.has_stride_h() ? p.stride_h() : p.stride_size() > 1 ? p.stride(1) : p.stride_size() > 0 ? p.stride(0) : 1;
-
-        int padW = p.has_pad_w() ? p.pad_w() : p.pad_size() > 0 ? p.pad(0) : 0;
-        int padH = p.has_pad_h() ? p.pad_h() : p.pad_size() > 1 ? p.pad(1) : p.pad_size() > 0 ? p.pad(0) : 0;
-
-        layer->setStride(DimsHW{strideH, strideW});
-        layer->setPadding(DimsHW{padH, padW});
-		layer->setPaddingMode(PaddingMode::kCAFFE_ROUND_DOWN);
-        layer->setNbGroups(nbGroups);
-
-        layer->setKernelWeights(kernelWeights);
-        if (!p.has_bias_term() || p.bias_term())
-        {
-            layer->setBiasWeights(biasWeights);
-        }
-    }
-    return layer;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseEltwise.cpp b/parsers/caffe/caffeParser/opParsers/parseEltwise.cpp
deleted file mode 100644
index 5f572075..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseEltwise.cpp
+++ /dev/null
@@ -1,43 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseEltwise(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 2, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::EltwiseParameter& p = msg.eltwise_param();
-
-    ElementWiseOperation op = ElementWiseOperation::kSUM;
-    switch (p.operation())
-    {
-    case trtcaffe::EltwiseParameter_EltwiseOp_SUM: op = ElementWiseOperation::kSUM; break;
-    case trtcaffe::EltwiseParameter_EltwiseOp_PROD: op = ElementWiseOperation::kPROD; break;
-    case trtcaffe::EltwiseParameter_EltwiseOp_MAX: op = ElementWiseOperation::kMAX; break;
-    }
-
-    return network.addElementWise(*tensors[msg.bottom(0)], *tensors[msg.bottom(1)], op);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseInnerProduct.cpp b/parsers/caffe/caffeParser/opParsers/parseInnerProduct.cpp
deleted file mode 100644
index a13c0d6a..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseInnerProduct.cpp
+++ /dev/null
@@ -1,39 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseInnerProduct(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    const trtcaffe::InnerProductParameter& p = msg.inner_product_param();
-
-    int64_t nbInputs = parserutils::volume(parserutils::getCHW(tensors[msg.bottom(0)]->getDimensions()));
-    int64_t nbOutputs = p.num_output();
-
-    float std_dev = 1.0F / sqrtf(nbInputs * nbOutputs);
-    Weights kernelWeights = weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kGENERIC) : weightFactory.allocateWeights(nbInputs * nbOutputs, std::normal_distribution<float>(0.0F, std_dev));
-    Weights biasWeights = !p.has_bias_term() || p.bias_term() ? (weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kBIAS) : weightFactory.allocateWeights(nbOutputs)) : weightFactory.getNullWeights();
-
-    weightFactory.convert(kernelWeights);
-    weightFactory.convert(biasWeights);
-    return network.addFullyConnected(*tensors[msg.bottom(0)], p.num_output(), kernelWeights, biasWeights);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseLRN.cpp b/parsers/caffe/caffeParser/opParsers/parseLRN.cpp
deleted file mode 100644
index b9587afe..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseLRN.cpp
+++ /dev/null
@@ -1,39 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseLRN(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::LRNParameter& p = msg.lrn_param();
-    int localSize = p.has_local_size() ? p.local_size() : 5;
-    float alpha = p.has_alpha() ? p.alpha() : 1;
-    float beta = p.has_beta() ? p.beta() : 5;
-    float k = p.has_k() ? p.k() : 1;
-
-    return network.addLRN(*tensors[msg.bottom(0)], localSize, alpha, beta, k);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parsePReLU.cpp b/parsers/caffe/caffeParser/opParsers/parsePReLU.cpp
deleted file mode 100644
index f31e204a..00000000
--- a/parsers/caffe/caffeParser/opParsers/parsePReLU.cpp
+++ /dev/null
@@ -1,52 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parsePReLU(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory,
-                   BlobNameToTensor& tensors)
-{
-    // Caffe stores the slopes as weights rather than as a tensor, and only supports different slopes
-    // per channel
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::PReLUParameter& p = msg.prelu_param();
-    bool channelShared = p.has_channel_shared() ? p.channel_shared() : false;
-    auto inputDims = tensors[msg.bottom(0)]->getDimensions();
-    if (inputDims.nbDims < 2)
-    {
-        return nullptr;
-    }
-
-    int nWeights = channelShared ? 1 : inputDims.d[0]; // Caffe treats second input dimension as channels
-    Dims slopesDims{inputDims.nbDims, {}};
-    std::fill(slopesDims.d, slopesDims.d + slopesDims.nbDims, 1);
-    slopesDims.d[0] = nWeights;
-
-    Weights w = weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kGENERIC) :
-                weightFactory.allocateWeights(nWeights, std::uniform_real_distribution<float>(0.F, 1.F));
-    auto constLayer = network.addConstant(slopesDims, w);
-    return network.addParametricReLU(*tensors[msg.bottom(0)], *constLayer->getOutput(0));
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parsePermute.cpp b/parsers/caffe/caffeParser/opParsers/parsePermute.cpp
deleted file mode 100644
index 8803b4b0..00000000
--- a/parsers/caffe/caffeParser/opParsers/parsePermute.cpp
+++ /dev/null
@@ -1,82 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parsePermute(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::PermuteParameter& p = msg.permute_param();
-    Dims bottomDims = tensors[msg.bottom(0)]->getDimensions();
-    Dims topDims = tensors[msg.bottom(0)]->getDimensions();
-    int nbDims = bottomDims.nbDims;
-
-    std::vector<int> orders;
-    std::vector<bool> knownOrders(nbDims + 1, false);
-    bool orderAbort = (p.order(0) != 0); // First order must be 0 (batch dimension)
-    for (int i = 0; i < p.order_size(); i++)
-    {
-        int order = p.order(i);
-        orderAbort |= (order > nbDims) || (std::find(orders.begin(), orders.end(), order) != orders.end());
-        orders.push_back(order);
-        knownOrders[order] = true;
-    }
-
-    if (orderAbort)
-    {
-        std::cout << "Caffe Parser: Invalid permute param. TensorRT does not support permute in N (batch) dimension, and order index must be within the tensor dimensions. no duplicate order allowed." << std::endl;
-        return nullptr;
-    }
-
-    // Keep the rest of the order
-    for (int i = 0; i < nbDims; i++)
-    {
-        if (!knownOrders[i])
-        {
-            orders.push_back(i);
-        }
-    }
-
-    // Remove the first order (batch)
-    orders.erase(orders.begin());
-
-    for (int i = 0; i < nbDims; i++)
-    {
-        topDims.d[i] = bottomDims.d[orders[i] - 1];
-    }
-    assert(parserutils::volume(topDims) == parserutils::volume(bottomDims));
-
-    nvinfer1::Permutation permuteOrder;
-    for (int i = 0; i < nbDims; i++)
-    {
-        permuteOrder.order[i] = orders[i] - 1;
-    }
-
-    auto permute = network.addShuffle(*tensors[msg.bottom(0)]);
-    permute->setReshapeDimensions(topDims);
-    permute->setFirstTranspose(permuteOrder);
-    return permute;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parsePooling.cpp b/parsers/caffe/caffeParser/opParsers/parsePooling.cpp
deleted file mode 100644
index 5e69e419..00000000
--- a/parsers/caffe/caffeParser/opParsers/parsePooling.cpp
+++ /dev/null
@@ -1,79 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parsePooling(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::PoolingParameter& p = msg.pooling_param();
-    if (p.pool() != trtcaffe::PoolingParameter::MAX && p.pool() != trtcaffe::PoolingParameter::AVE)
-    {
-        std::cout << "Caffe Parser: only AVE and MAX pool operations are supported" << std::endl;
-        return nullptr;
-    }
-
-    int kernelH, kernelW;
-    if (p.has_global_pooling() && p.global_pooling())
-    {
-        Dims3 dims = parserutils::getCHW(tensors[msg.bottom(0)]->getDimensions());
-        kernelH = dims.d[1];
-        kernelW = dims.d[2];
-    }
-    else
-    {
-        // mandatory
-        kernelH = p.has_kernel_h() ? p.kernel_h() : p.kernel_size();
-        kernelW = p.has_kernel_w() ? p.kernel_w() : p.kernel_size();
-    }
-
-    PoolingType type = p.has_pool() && p.pool() == trtcaffe::PoolingParameter::AVE ? PoolingType::kAVERAGE : PoolingType::kMAX;
-    auto layer = network.addPooling(*tensors[msg.bottom(0)], type, DimsHW{kernelH, kernelW});
-
-    if (layer)
-    {
-        int stride = p.has_stride() ? p.stride() : 1;
-        layer->setStride(DimsHW{p.has_stride_h() ? int(p.stride_h()) : stride, p.has_stride_w() ? int(p.stride_w()) : stride});
-
-        int pad = p.has_pad() ? p.pad() : 0;
-        layer->setPadding(DimsHW{p.has_pad_h() ? int(p.pad_h()) : pad, p.has_pad_w() ? int(p.pad_w()) : pad});
-
-        layer->setName(msg.name().c_str());
-		layer->setPaddingMode(PaddingMode::kCAFFE_ROUND_UP); // caffe pool use ceil mode by default
-        // FB pooling parameters
-        // Use floor((height + 2 * padding - kernel) / stride) + 1
-        // instead of ceil((height + 2 * padding - kernel) / stride) + 1
-        if (p.has_torch_pooling() ? p.torch_pooling() : false)
-        {
-		    layer->setPaddingMode(PaddingMode::kCAFFE_ROUND_DOWN); // facebook torch pool use floor mode
-		}
-
-        tensors[msg.top(0)] = layer->getOutput(0);
-
-        layer->setAverageCountExcludesPadding(false); // unlike other frameworks, caffe use inclusive counting for padded averaging
-    }
-    return layer;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parsePower.cpp b/parsers/caffe/caffeParser/opParsers/parsePower.cpp
deleted file mode 100644
index 492c36d8..00000000
--- a/parsers/caffe/caffeParser/opParsers/parsePower.cpp
+++ /dev/null
@@ -1,65 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parsePower(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::PowerParameter& p = msg.power_param();
-
-    float shift = p.has_shift() ? p.shift() : 0.0f;
-    float scale = p.has_scale() ? p.scale() : 1.0f;
-    float power = p.has_power() ? p.power() : 1.0f;
-
-    DataType dataType = weightFactory.getDataType();
-    assert(dataType == DataType::kFLOAT || dataType == DataType::kHALF);
-
-    Weights wShift, wScale, wPower;
-    if (dataType == DataType::kHALF)
-    {
-        auto* t = reinterpret_cast<float16*>(malloc(3 * sizeof(float16)));
-        t[0] = float16(shift), t[1] = float16(scale), t[2] = float16(power);
-        wShift = Weights{DataType::kHALF, &t[0], 1};
-        wScale = Weights{DataType::kHALF, &t[1], 1};
-        wPower = Weights{DataType::kHALF, &t[2], 1};
-        weightFactory.getTmpAllocs().push_back(t);
-    }
-    else
-    {
-        auto* t = reinterpret_cast<float*>(malloc(3 * sizeof(float)));
-        t[0] = shift, t[1] = scale, t[2] = power;
-        wShift = Weights{DataType::kFLOAT, &t[0], 1};
-        wScale = Weights{DataType::kFLOAT, &t[1], 1};
-        wPower = Weights{DataType::kFLOAT, &t[2], 1};
-        weightFactory.getTmpAllocs().push_back(t);
-    }
-
-    weightFactory.convert(wShift);
-    weightFactory.convert(wScale);
-    weightFactory.convert(wPower);
-    return network.addScale(*tensors[msg.bottom(0)], ScaleMode::kUNIFORM, wShift, wScale, wPower);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseReLU.cpp b/parsers/caffe/caffeParser/opParsers/parseReLU.cpp
deleted file mode 100644
index d37e5fbe..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseReLU.cpp
+++ /dev/null
@@ -1,41 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseReLU(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::ReLUParameter& p = msg.relu_param();
-
-    if (p.has_negative_slope() && p.negative_slope() != 0)
-    {
-        auto newLayer = network.addActivation(*tensors[msg.bottom(0)], ActivationType::kLEAKY_RELU);
-        newLayer->setAlpha(p.negative_slope());
-        return newLayer;
-    }
-    return network.addActivation(*tensors[msg.bottom(0)], ActivationType::kRELU);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseReduction.cpp b/parsers/caffe/caffeParser/opParsers/parseReduction.cpp
deleted file mode 100644
index c3bf0742..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseReduction.cpp
+++ /dev/null
@@ -1,168 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseReduction(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    // The first axis to reduce to a scalar -- may be negative to index from the
-    // end (e.g., -1 for the last axis).
-    // (Currently, only reduction along ALL "tail" axes is supported; reduction
-    // of axis M through N, where N < num_axes - 1, is unsupported.)
-    // Suppose we have an n-axis bottom Blob with shape:
-    //     (d0, d1, d2, ..., d(m-1), dm, d(m+1), ..., d(n-1)).
-    // If axis == m, the output Blob will have shape
-    //     (d0, d1, d2, ..., d(m-1)),
-    // and the ReductionOp operation is performed (d0 * d1 * d2 * ... * d(m-1))
-    // times, each including (dm * d(m+1) * ... * d(n-1)) individual data.
-    // If axis == 0 (the default), the output Blob always has the empty shape
-    // (count 1), performing reduction across the entire input --
-    // often useful for creating new loss functions.
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    // operation == 1 is SUM -> ReduceOperation::kSUM
-    const int SUM = 1;
-    // operation == 2 is ASUM -> UnaryOperation::kABS and ReduceOperation::kSUM
-    const int ASUM = 2;
-    // operation == 3 is SUMSQ -> ElementWiseOperation::kPROD and ReduceOperation::kSUM
-    const int SUMSQ = 3;
-    // operation == 4 is MEAN -> ReduceOperation::kAVG
-    const int MEAN = 4;
-
-    const trtcaffe::ReductionParameter& p = msg.reduction_param();
-    bool hasOperation = p.has_operation();              // optional parameter
-    bool hasAxis = p.has_axis();                        // optional parameter
-    bool hasCoeff = p.has_coeff();                      // optional parameter
-    int operation = hasOperation ? p.operation() : SUM; // default is SUM
-    int axis = hasAxis ? p.axis() : 0;                  // default is 0
-    axis = (axis < 0) ? 4 + axis : axis;                // axis negative number correction
-    float coeff = hasCoeff ? p.coeff() : 1.0;           // default is 1
-
-    // With implicit batch dimensions:
-    //   acceptable axis values: 1, 2, 3, -1, -2, -3
-    //   unacceptable axis values: 0 and anything else
-    //   acceptable corrected axis values: 1, 2, 3
-    //   unacceptable corrected axis values: 0 and anything else
-    //
-    // With implicit batch dimensions:
-    //   acceptable axis values: 1, 2, 3, 0, -1, -2, -3
-    //   unacceptable axis values: anything else
-    //   acceptable corrected axis values: 0, 1, 2, 3
-    //   unacceptable corrected axis values: 0
-    //
-    // protect against "garbage" input arguments
-    if (axis < 0 || axis > 3)
-    {
-        std::cout << "Caffe Parser: Invalid axis in reduction layer - can only reduce NCHW input." << std::endl;
-        return nullptr;
-    }
-    if (network.hasImplicitBatchDimension() && axis == 0)
-    {
-        std::cout << "Caffe Parser: Invalid axis in reduction layer - cannot reduce over batch size dimension."
-                  << std::endl;
-        return nullptr;
-    }
-
-    ReduceOperation op = (operation == MEAN ? ReduceOperation::kAVG : ReduceOperation::kSUM);
-    // For implicit batch, corrected axis values are 1, 2, 3
-    // only reduction along tail dimensions is supported
-    // 1 means 111 or 4 + 2 + 1 = 7
-    // 2 means 110 or 4 + 2 = 6
-    // 3 means 100 or 4
-    // Let's employ a bit shift trick instead
-    // 10000 = 16
-    // axis == 1: 1u << axis is 2 and so (16 - 2) >> 1 = 7 or 111
-    // axis == 2: 1u << axis is 4 and so (16 - 4) >> 1 = 6 or 110
-    // axis == 3: 1u << axis is 8 and so (16 - 8) >> 1 = 4 or 100
-    //
-    // For explicit batch, corrected axis values are 0, 1, 2, 3
-    // 1 means 1111 or 8 + 4 + 2 + 1 = 15
-    // 2 means 1110 or 8 + 4 + 2 = 14
-    // 3 means 1100 or 8 + 4 = 12
-    // 4 means 1000 or 8
-    // 10000 = 16
-    // axis == 0: 1u << axis is 1 and so 16 - 1 = 15 or 1111
-    // axis == 1: 1u << axis is 2 and so 16 - 2 = 14 or 1110
-    // axis == 2: 1u << axis is 4 and so 16 - 4 = 12 or 1100
-    // axis == 3: 1u << axis is 8 and so 16 - 8 = 8 or 1000
-    uint32_t reduceAxes = (16 - (1u << axis)) >> static_cast<int>(network.hasImplicitBatchDimension());
-
-    ITensor* input = tensors[msg.bottom(0)];
-    ILayer* returnVal = nullptr;
-    // need to add in layer before for ASUM and SUMSQ
-    if (operation == ASUM)
-    {
-        returnVal = network.addUnary(*input, UnaryOperation::kABS);
-        input = returnVal->getOutput(0);
-        std::string layerName = msg.name() + std::string("/reductionLayer/unaryLayer");
-        returnVal->setName(layerName.c_str());
-    }
-    else if (operation == SUMSQ)
-    {
-        returnVal = network.addElementWise(*input, *input, ElementWiseOperation::kPROD);
-        input = returnVal->getOutput(0);
-        std::string layerName = msg.name() + std::string("/reductionLayer/elementWiseLayer");
-        returnVal->setName(layerName.c_str());
-    }
-
-// add in the actual reduce layer
-#define GIE_3111 0
-#if GIE_3111
-    returnVal = network.addReduce(*input, op, reduceAxes, false);
-#else
-    returnVal = network.addReduce(*input, op, reduceAxes, true);
-    // output a warning
-    std::cout << "Warning: The Reduce layer does not discard reduced dimensions. The reduced dimensions are treated as dimensions of size one in the output of the Reduce layer." << std::endl;
-#endif
-    input = returnVal->getOutput(0);
-    std::string reduceLayerName = msg.name() + std::string("/reductionLayer/reduceLayer");
-    returnVal->setName(reduceLayerName.c_str());
-
-    // need to add in layer after for coeff != 1.0
-    if (coeff != 1.0f)
-    {
-        auto* shiftArr = (float*) malloc(sizeof(float));
-        auto* scaleArr = (float*) malloc(sizeof(float));
-        auto* powerArr = (float*) malloc(sizeof(float));
-        weightFactory.getTmpAllocs().push_back(shiftArr);
-        weightFactory.getTmpAllocs().push_back(scaleArr);
-        weightFactory.getTmpAllocs().push_back(powerArr);
-        *shiftArr = 0.0f;
-        *scaleArr = coeff;
-        *powerArr = 1.0f;
-
-        Weights wShift, wScale, wPower;
-
-        wShift = Weights{DataType::kFLOAT, shiftArr, 1};
-        wScale = Weights{DataType::kFLOAT, scaleArr, 1};
-        wPower = Weights{DataType::kFLOAT, powerArr, 1};
-
-        returnVal = network.addScale(*input, ScaleMode::kUNIFORM, wShift, wScale, wPower);
-        std::string layerName = msg.name() + std::string("/reductionLayer/scaleLayer");
-        returnVal->setName(layerName.c_str());
-    }
-
-    return returnVal;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseReshape.cpp b/parsers/caffe/caffeParser/opParsers/parseReshape.cpp
deleted file mode 100644
index a31698a2..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseReshape.cpp
+++ /dev/null
@@ -1,111 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseReshape(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::ReshapeParameter& p = msg.reshape_param();
-    Dims bottomDims = tensors[msg.bottom(0)]->getDimensions();
-    int axis = p.has_axis() ? p.axis() : 0;
-
-    const ::trtcaffe::BlobShape& shape = p.shape();
-    // Check that N (batch dim) is 0. TensorRT does not support reshape in batch dimension
-    if (network.hasImplicitBatchDimension() && (axis == 0) && (shape.dim(0) != 0))
-    {
-        std::cout << "Caffe Parser: Invalid reshape param. TensorRT does not support reshape in N (batch) dimension"
-                  << std::endl;
-        return nullptr;
-    }
-
-    // Handle axis and dims parameters
-    int axStart = std::max(0, axis - 1);
-    int axEnd = p.has_num_axes()
-        ? std::max(0, axis - static_cast<int>(network.hasImplicitBatchDimension()) + p.num_axes())
-        : bottomDims.nbDims;
-    std::vector<int> reshapeDims;
-
-    reshapeDims.reserve(axStart);
-    for (int i = 0; i < axStart; i++)
-    {
-        reshapeDims.push_back(bottomDims.d[i]);
-    }
-
-    for (int i = 0; i < shape.dim_size(); i++)
-    {
-        // skip first 0 (batch)
-        if (network.hasImplicitBatchDimension() && axis == 0 && i == 0)
-        {
-            continue;
-        }
-        if (shape.dim(i) == 0)
-        {
-            // If there is no bottom dimension corresponding to the current axis, then the params are invalid
-            assert(static_cast<int>(reshapeDims.size()) <= bottomDims.nbDims);
-            reshapeDims.push_back(bottomDims.d[reshapeDims.size()]);
-        }
-        else
-        {
-            reshapeDims.push_back(shape.dim(i));
-        }
-    }
-
-    for (int i = axEnd; i < bottomDims.nbDims; i++)
-    {
-        reshapeDims.push_back(bottomDims.d[i]);
-    }
-
-    Dims topDims{};
-    topDims.nbDims = static_cast<int>(reshapeDims.size());
-    for (int i = 0; i < topDims.nbDims; i++)
-    {
-        topDims.d[i] = reshapeDims[i];
-    }
-
-    // Check there is at most one -1, and handle such case
-    int countMinusOne = 0;
-    for (int i = 0; i < topDims.nbDims; i++)
-    {
-        if (topDims.d[i] == -1)
-        {
-            countMinusOne += 1;
-            // Inferred dimension
-            int64_t newDim = parserutils::volume(bottomDims) / -parserutils::volume(topDims);
-            topDims.d[i] = newDim;
-        }
-    }
-
-    if (countMinusOne > 1)
-    {
-        std::cout << "Caffe Parser: Invalid reshape param. At most one axis can be inferred from other dimensions" << std::endl;
-        return nullptr;
-    }
-
-    auto layer = network.addShuffle(*tensors[msg.bottom(0)]);
-    layer->setReshapeDimensions(topDims);
-    return layer;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseScale.cpp b/parsers/caffe/caffeParser/opParsers/parseScale.cpp
deleted file mode 100644
index bd1efa94..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseScale.cpp
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseScale(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& weightFactory, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::ScaleParameter& p = msg.scale_param();
-    int C = parserutils::getC(tensors[msg.bottom(0)]->getDimensions());
-
-    Weights scale = weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kGENERIC) : weightFactory.allocateWeights(C, std::uniform_real_distribution<float>(0.9F, 1.1F));
-    Weights shift = !p.has_bias_term() || p.bias_term() ? (weightFactory.isInitialized() ? weightFactory(msg.name(), WeightType::kBIAS) : weightFactory.allocateWeights(C)) : weightFactory.getNullWeights();
-    Weights power = weightFactory.getNullWeights();
-    weightFactory.convert(shift);
-    weightFactory.convert(scale);
-    weightFactory.convert(power);
-    return network.addScale(*tensors[msg.bottom(0)], ScaleMode::kCHANNEL, shift, scale, power);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseSoftMax.cpp b/parsers/caffe/caffeParser/opParsers/parseSoftMax.cpp
deleted file mode 100644
index 0d88de7f..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseSoftMax.cpp
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseSoftMax(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::SoftmaxParameter& p = msg.softmax_param();
-
-    // Caffe supports negative axis, indexing from the last dimension
-    // However, there is a discrepancy in the internal tensor dimension in some cases.
-    // For example. InnerProduct produces flat 1D blob in Caffe, while TensorRT still
-    // produces CHW format. MNIST sample generates input to Softmax as,
-    //     Caffe    = n x 10
-    //     TensorRT = n x 10 x 1 x 1
-    // To make sure we do not run into issues, negative axis won't be supported in TensorRT
-    int nbDims = tensors[msg.bottom(0)]->getDimensions().nbDims;
-    bool hasAxis = p.has_axis();       // optional parameter
-    int axis = hasAxis ? p.axis() : 1; // default is 1
-
-    if (network.hasImplicitBatchDimension() && axis == 0)
-    {
-        std::cout << "Caffe Parser: Invalid axis in softmax layer - TensorRT does not support softmax across the batch "
-                     "axis with implicit batch dimensions networks."
-                  << std::endl;
-        return nullptr;
-    }
-
-    if (axis < 0 || axis > 3 || (axis > nbDims))
-    {
-        std::cout << "Caffe Parser: Invalid axis in softmax layer - TensorRT expects NCHW input. Negative axis is not "
-                     "supported in TensorRT, please use positive axis indexing"
-                  << std::endl;
-        return nullptr;
-    }
-
-    auto softmax = network.addSoftMax(*tensors[msg.bottom(0)]);
-    // Do this so that setAxes is not used when the default axis is needed
-    // This is necessary to preserve correct roll-into-the-batch dimension behaviour for samples like FasterRCNN
-    // NCHW -> default axis when setAxes is not called will be 1 (the C dimension)
-    // NPCHW -> default axis when setAxes is not called will be 2 (the C dimension)
-    if (hasAxis)
-    {
-        uint32_t axes = 1u << (axis - static_cast<int>(network.hasImplicitBatchDimension()));
-        softmax->setAxes(axes);
-    }
-    return softmax;
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/opParsers/parseTanH.cpp b/parsers/caffe/caffeParser/opParsers/parseTanH.cpp
deleted file mode 100644
index e3c6a3dd..00000000
--- a/parsers/caffe/caffeParser/opParsers/parseTanH.cpp
+++ /dev/null
@@ -1,33 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
-{
-ILayer* parseTanH(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
-{
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    return network.addActivation(*tensors[msg.bottom(0)], ActivationType::kTANH);
-}
-} //namespace nvcaffeparser1
diff --git a/parsers/caffe/caffeParser/readProto.h b/parsers/caffe/caffeParser/readProto.h
deleted file mode 100644
index b9276819..00000000
--- a/parsers/caffe/caffeParser/readProto.h
+++ /dev/null
@@ -1,87 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_READ_PROTO_H
-#define TRT_CAFFE_PARSER_READ_PROTO_H
-
-#include <fstream>
-
-#include "google/protobuf/io/coded_stream.h"
-#include "google/protobuf/io/zero_copy_stream_impl.h"
-#include "google/protobuf/text_format.h"
-
-#include "caffeMacros.h"
-#include "trtcaffe.pb.h"
-
-namespace nvcaffeparser1
-{
-// There are some challenges associated with importing caffe models. One is that
-// a .caffemodel file just consists of layers and doesn't have the specs for its
-// input and output blobs.
-//
-// So we need to read the deploy file to get the input
-
-bool readBinaryProto(trtcaffe::NetParameter* net, const char* file, size_t bufSize)
-{
-    CHECK_NULL_RET_VAL(net, false)
-    CHECK_NULL_RET_VAL(file, false)
-    using namespace google::protobuf::io;
-
-    std::ifstream stream(file, std::ios::in | std::ios::binary);
-    if (!stream)
-    {
-        RETURN_AND_LOG_ERROR(false, "Could not open file " + std::string(file));
-    }
-
-    IstreamInputStream rawInput(&stream);
-    CodedInputStream codedInput(&rawInput);
-#if GOOGLE_PROTOBUF_VERSION >= 3011000
-        codedInput.SetTotalBytesLimit(int(bufSize));
-#else
-        // Note: This WARs the very low default size limit (64MB)
-        codedInput.SetTotalBytesLimit(int(bufSize), -1);
-#endif
-    bool ok = net->ParseFromCodedStream(&codedInput);
-    stream.close();
-
-    if (!ok)
-    {
-        RETURN_AND_LOG_ERROR(false, "Could not parse binary model file");
-    }
-
-    return ok;
-}
-
-bool readTextProto(trtcaffe::NetParameter* net, const char* file)
-{
-    CHECK_NULL_RET_VAL(net, false)
-    CHECK_NULL_RET_VAL(file, false)
-    using namespace google::protobuf::io;
-
-    std::ifstream stream(file, std::ios::in);
-    if (!stream)
-    {
-        RETURN_AND_LOG_ERROR(false, "Could not open file " + std::string(file));
-    }
-
-    IstreamInputStream input(&stream);
-    bool ok = google::protobuf::TextFormat::Parse(&input, net);
-    stream.close();
-    return ok;
-}
-} //namespace nvcaffeparser1
-#endif //TRT_CAFFE_PARSER_READ_PROTO_H
diff --git a/parsers/caffe/caffeWeightFactory/caffeWeightFactory.cpp b/parsers/caffe/caffeWeightFactory/caffeWeightFactory.cpp
deleted file mode 100644
index 5d4fe134..00000000
--- a/parsers/caffe/caffeWeightFactory/caffeWeightFactory.cpp
+++ /dev/null
@@ -1,412 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#include "caffeMacros.h"
-#include "caffeWeightFactory.h"
-#include "half.h"
-
-using namespace nvinfer1;
-using namespace nvcaffeparser1;
-
-template <typename INPUT, typename OUTPUT>
-void* convertInternal(void** ptr, int64_t count, bool* mOK)
-{
-    assert(ptr != nullptr);
-    if (*ptr == nullptr)
-    {
-        return nullptr;
-    }
-    if (!count)
-    {
-        return nullptr;
-    }
-    auto* iPtr = static_cast<INPUT*>(*ptr);
-    auto* oPtr = static_cast<OUTPUT*>(malloc(count * sizeof(OUTPUT)));
-    for (int i = 0; i < count; ++i)
-    {
-        if (static_cast<OUTPUT>(iPtr[i]) > std::numeric_limits<OUTPUT>::max()
-            || static_cast<OUTPUT>(iPtr[i]) < std::numeric_limits<OUTPUT>::lowest())
-        {
-            std::cout << "Error: Weight " << iPtr[i] << " is outside of [" << std::numeric_limits<OUTPUT>::max()
-                      << ", " << std::numeric_limits<OUTPUT>::lowest() << "]." << std::endl;
-            if (mOK)
-            {
-                (*mOK) = false;
-            }
-            break;
-        }
-        oPtr[i] = iPtr[i];
-    }
-    (*ptr) = oPtr;
-    return oPtr;
-}
-
-
-
-CaffeWeightFactory::CaffeWeightFactory(const trtcaffe::NetParameter& msg, DataType dataType, std::vector<void*>& tmpAllocs, bool isInitialized)
-    : mMsg(msg)
-    , mTmpAllocs(tmpAllocs)
-    , mDataType(dataType)
-    , mInitialized(isInitialized)
-{
-    mRef = std::unique_ptr<trtcaffe::NetParameter>(new trtcaffe::NetParameter);
-}
-
-DataType CaffeWeightFactory::getDataType() const
-{
-    return mDataType;
-}
-
-size_t CaffeWeightFactory::getDataTypeSize() const
-{
-    switch (getDataType())
-    {
-    case DataType::kFLOAT:
-    case DataType::kINT32:
-        return 4;
-    case DataType::kHALF:
-        return 2;
-    case DataType::kINT8:
-        return 1;
-    }
-    return 0;
-}
-
-std::vector<void*>& CaffeWeightFactory::getTmpAllocs()
-{
-    return mTmpAllocs;
-}
-
-int CaffeWeightFactory::getBlobsSize(const std::string& layerName)
-{
-    for (int i = 0, n = mMsg.layer_size(); i < n; ++i)
-    {
-        if (mMsg.layer(i).name() == layerName)
-        {
-            return mMsg.layer(i).blobs_size();
-        }
-    }
-    return 0;
-}
-
-const trtcaffe::BlobProto* CaffeWeightFactory::getBlob(const std::string& layerName, int index)
-{
-    if (mMsg.layer_size() > 0)
-    {
-        for (int i = 0, n = mMsg.layer_size(); i < n; i++)
-        {
-            if (mMsg.layer(i).name() == layerName && index < mMsg.layer(i).blobs_size())
-            {
-                return &mMsg.layer(i).blobs(index);
-            }
-        }
-    }
-    else
-    {
-        for (int i = 0, n = mMsg.layers_size(); i < n; i++)
-        {
-            if (mMsg.layers(i).name() == layerName && index < mMsg.layers(i).blobs_size())
-            {
-                return &mMsg.layers(i).blobs(index);
-            }
-        }
-    }
-
-    return nullptr;
-}
-
-std::vector<Weights> CaffeWeightFactory::getAllWeights(const std::string& layerName)
-{
-    std::vector<Weights> v;
-    for (int i = 0;; i++)
-    {
-        auto b = getBlob(layerName, i);
-        if (b == nullptr)
-        {
-            break;
-        }
-        auto weights = getWeights(*b, layerName);
-        convert(weights, DataType::kFLOAT);
-        v.push_back(weights);
-    }
-    return v;
-}
-
-Weights CaffeWeightFactory::operator()(const std::string& layerName, WeightType weightType)
-{
-    const trtcaffe::BlobProto* blobMsg = getBlob(layerName, int(weightType));
-    if (blobMsg == nullptr)
-    {
-        std::cout << "Weights for layer " << layerName << " doesn't exist" << std::endl;
-        RETURN_AND_LOG_ERROR(getNullWeights(), "ERROR: Attempting to access NULL weights");
-        assert(0);
-    }
-    return getWeights(*blobMsg, layerName);
-}
-
-void CaffeWeightFactory::convert(Weights& weights, DataType targetType)
-{
-    void* tmpAlloc{nullptr};
-    if (weights.type == DataType::kFLOAT && targetType == DataType::kHALF)
-    {
-        tmpAlloc = convertInternal<float, float16>(const_cast<void**>(&weights.values), weights.count, &mOK);
-        weights.type = targetType;
-    }
-    if (weights.type == DataType::kHALF && targetType == DataType::kFLOAT)
-    {
-        tmpAlloc = convertInternal<float16, float>(const_cast<void**>(&weights.values), weights.count, &mOK);
-        weights.type = targetType;
-    }
-    if (tmpAlloc)
-    {
-        mTmpAllocs.push_back(tmpAlloc);
-    }
-}
-
-void CaffeWeightFactory::convert(Weights& weights)
-{
-    convert(weights, getDataType());
-}
-
-bool CaffeWeightFactory::isOK()
-{
-    return mOK;
-}
-
-bool CaffeWeightFactory::isInitialized()
-{
-    return mInitialized;
-}
-
-Weights CaffeWeightFactory::getNullWeights()
-{
-    return Weights{mDataType, nullptr, 0};
-}
-
-Weights CaffeWeightFactory::allocateWeights(int64_t elems, std::uniform_real_distribution<float> distribution)
-{
-    void* data = malloc(elems * getDataTypeSize());
-
-    switch (getDataType())
-    {
-    case DataType::kFLOAT:
-        for (int64_t i = 0; i < elems; ++i)
-        {
-            ((float*) data)[i] = distribution(generator);
-        }
-        break;
-    case DataType::kHALF:
-        for (int64_t i = 0; i < elems; ++i)
-        {
-            ((float16*) data)[i] = (float16)(distribution(generator));
-        }
-        break;
-    default:
-        break;
-    }
-
-    mTmpAllocs.push_back(data);
-    return Weights{getDataType(), data, elems};
-}
-
-Weights CaffeWeightFactory::allocateWeights(int64_t elems, std::normal_distribution<float> distribution)
-{
-    void* data = malloc(elems * getDataTypeSize());
-
-    switch (getDataType())
-    {
-    case DataType::kFLOAT:
-        for (int64_t i = 0; i < elems; ++i)
-        {
-            ((float*) data)[i] = distribution(generator);
-        }
-        break;
-    case DataType::kHALF:
-        for (int64_t i = 0; i < elems; ++i)
-        {
-            ((float16*) data)[i] = (float16)(distribution(generator));
-        }
-        break;
-    default:
-        break;
-    }
-
-    mTmpAllocs.push_back(data);
-    return Weights{getDataType(), data, elems};
-}
-
-trtcaffe::Type CaffeWeightFactory::getBlobProtoDataType(const trtcaffe::BlobProto& blobMsg)
-{
-    if (blobMsg.has_raw_data())
-    {
-        assert(blobMsg.has_raw_data_type());
-        return blobMsg.raw_data_type();
-    }
-    if (blobMsg.double_data_size() > 0)
-    {
-        return trtcaffe::DOUBLE;
-    }
-    return trtcaffe::FLOAT;
-}
-
-size_t CaffeWeightFactory::sizeOfCaffeType(trtcaffe::Type type)
-{
-    if (type == trtcaffe::FLOAT)
-    {
-        return sizeof(float);
-    }
-    if (type == trtcaffe::FLOAT16)
-    {
-        return sizeof(uint16_t);
-    }
-    return sizeof(double);
-}
-
-// The size returned here is the number of array entries, not bytes
-std::pair<const void*, size_t> CaffeWeightFactory::getBlobProtoData(const trtcaffe::BlobProto& blobMsg,
-                                                                        trtcaffe::Type type, std::vector<void*>& tmpAllocs)
-{
-    // NVCaffe new binary format. It may carry any type.
-    if (blobMsg.has_raw_data())
-    {
-        assert(blobMsg.has_raw_data_type());
-        if (blobMsg.raw_data_type() == type)
-        {
-            return std::make_pair(&blobMsg.raw_data().front(),
-                                    blobMsg.raw_data().size() / sizeOfCaffeType(type));
-        }
-    }
-    // Old BVLC format.
-    if (blobMsg.data_size() > 0 && type == trtcaffe::FLOAT)
-    {
-        return std::make_pair(&blobMsg.data().Get(0), blobMsg.data_size());
-    }
-
-    // Converting to the target type otherwise
-    const int count = blobMsg.has_raw_data() ? blobMsg.raw_data().size() / sizeOfCaffeType(blobMsg.raw_data_type()) : (blobMsg.data_size() > 0 ? blobMsg.data_size() : blobMsg.double_data_size());
-
-    if (count > 0)
-    {
-        void* new_memory = malloc(count * sizeOfCaffeType(type));
-        tmpAllocs.push_back(new_memory);
-
-        if (type == trtcaffe::FLOAT)
-        {
-            auto* dst = reinterpret_cast<float*>(new_memory);
-            if (blobMsg.has_raw_data())
-            {
-                if (blobMsg.raw_data_type() == trtcaffe::FLOAT16)
-                {
-                    const auto* src = reinterpret_cast<const float16*>(&blobMsg.raw_data().front());
-                    for (int i = 0; i < count; ++i)
-                    {
-                        dst[i] = float(src[i]);
-                    }
-                }
-                else if (blobMsg.raw_data_type() == trtcaffe::DOUBLE)
-                {
-                    const auto* src = reinterpret_cast<const double*>(&blobMsg.raw_data().front());
-                    for (int i = 0; i < count; ++i)
-                    {
-                        dst[i] = float(src[i]);
-                    }
-                }
-            }
-            else if (blobMsg.double_data_size() == count)
-            {
-                for (int i = 0; i < count; ++i)
-                {
-                    dst[i] = float(blobMsg.double_data(i));
-                }
-            }
-            return std::make_pair(new_memory, count);
-        }
-        if (type == trtcaffe::FLOAT16)
-        {
-            auto* dst = reinterpret_cast<float16*>(new_memory);
-
-            if (blobMsg.has_raw_data())
-            {
-                if (blobMsg.raw_data_type() == trtcaffe::FLOAT)
-                {
-                    const auto* src = reinterpret_cast<const float*>(&blobMsg.raw_data().front());
-                    for (int i = 0; i < count; ++i)
-                    {
-                        dst[i] = float16(src[i]);
-                    }
-                }
-                else if (blobMsg.raw_data_type() == trtcaffe::DOUBLE)
-                {
-                    const auto* src = reinterpret_cast<const double*>(&blobMsg.raw_data().front());
-                    for (int i = 0; i < count; ++i)
-                    {
-                        dst[i] = float16(float(src[i]));
-                    }
-                }
-            }
-            else if (blobMsg.data_size() == count)
-            {
-                for (int i = 0; i < count; ++i)
-                {
-                    dst[i] = float16(blobMsg.data(i));
-                }
-            }
-            else if (blobMsg.double_data_size() == count)
-            {
-                for (int i = 0; i < count; ++i)
-                {
-                    dst[i] = float16(float(blobMsg.double_data(i)));
-                }
-            }
-            return std::make_pair(new_memory, count);
-        }
-    }
-    return std::make_pair(nullptr, 0UL);
-}
-
-template <typename T>
-bool CaffeWeightFactory::checkForNans(const void* values, int count, const std::string& layerName)
-{
-    const T* v = reinterpret_cast<const T*>(values);
-    for (int i = 0; i < count; i++)
-    {
-        if (std::isnan(float(v[i])))
-        {
-            std::cout << layerName << ": Nan detected in weights" << std::endl;
-            return false;
-        }
-    }
-    return true;
-}
-
-Weights CaffeWeightFactory::getWeights(const trtcaffe::BlobProto& blobMsg, const std::string& layerName)
-{
-    // Always load weights into FLOAT format
-    const auto blobProtoData = getBlobProtoData(blobMsg, trtcaffe::FLOAT, mTmpAllocs);
-
-    if (blobProtoData.first == nullptr)
-    {
-        const int bits = mDataType == DataType::kFLOAT ? 32 : 16;
-        std::cout << layerName << ": ERROR - " << bits << "-bit weights not found for "
-                    << bits << "-bit model" << std::endl;
-        mOK = false;
-        return Weights{DataType::kFLOAT, nullptr, 0};
-    }
-
-    mOK &= checkForNans<float>(blobProtoData.first, int(blobProtoData.second), layerName);
-    return Weights{DataType::kFLOAT, blobProtoData.first, int(blobProtoData.second)};
-}
diff --git a/parsers/caffe/caffeWeightFactory/caffeWeightFactory.h b/parsers/caffe/caffeWeightFactory/caffeWeightFactory.h
deleted file mode 100644
index e32e979d..00000000
--- a/parsers/caffe/caffeWeightFactory/caffeWeightFactory.h
+++ /dev/null
@@ -1,69 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_CAFFE_WEIGHT_FACTORY_H
-#define TRT_CAFFE_PARSER_CAFFE_WEIGHT_FACTORY_H
-
-#include <vector>
-#include <string>
-#include <random>
-#include <memory>
-#include "NvInfer.h"
-#include "weightType.h"
-#include "trtcaffe.pb.h"
-
-namespace nvcaffeparser1
-{
-class CaffeWeightFactory
-{
-public:
-    CaffeWeightFactory(const trtcaffe::NetParameter& msg, nvinfer1::DataType dataType, std::vector<void*>& tmpAllocs, bool isInitialized);
-    nvinfer1::DataType getDataType() const;
-    size_t getDataTypeSize() const;
-    std::vector<void*>& getTmpAllocs();
-    int getBlobsSize(const std::string& layerName);
-    const trtcaffe::BlobProto* getBlob(const std::string& layerName, int index);
-    std::vector<nvinfer1::Weights> getAllWeights(const std::string& layerName);
-    virtual nvinfer1::Weights operator()(const std::string& layerName, WeightType weightType);
-    void convert(nvinfer1::Weights& weights, nvinfer1::DataType targetType);
-    void convert(nvinfer1::Weights& weights);
-    bool isOK();
-    bool isInitialized();
-    nvinfer1::Weights getNullWeights();
-    nvinfer1::Weights allocateWeights(int64_t elems, std::uniform_real_distribution<float> distribution = std::uniform_real_distribution<float>(-0.01f, 0.01F));
-    nvinfer1::Weights allocateWeights(int64_t elems, std::normal_distribution<float> distribution);
-    static trtcaffe::Type getBlobProtoDataType(const trtcaffe::BlobProto& blobMsg);
-    static size_t sizeOfCaffeType(trtcaffe::Type type);
-    // The size returned here is the number of array entries, not bytes
-    static std::pair<const void*, size_t> getBlobProtoData(const  trtcaffe::BlobProto& blobMsg, trtcaffe::Type type, std::vector<void*>& tmpAllocs);
-
-private:
-    template <typename T>
-    bool checkForNans(const void* values, int count, const std::string& layerName);
-    nvinfer1::Weights getWeights(const trtcaffe::BlobProto& blobMsg, const std::string& layerName);
-
-    const trtcaffe::NetParameter& mMsg;
-    std::unique_ptr<trtcaffe::NetParameter> mRef;
-    std::vector<void*>& mTmpAllocs;
-    nvinfer1::DataType mDataType;
-    // bool mQuantize;
-    bool mInitialized;
-    std::default_random_engine generator;
-    bool mOK{true};
-};
-} //namespace nvcaffeparser1
-#endif //TRT_CAFFE_PARSER_CAFFE_WEIGHT_FACTORY_H
diff --git a/parsers/caffe/caffeWeightFactory/weightType.h b/parsers/caffe/caffeWeightFactory/weightType.h
deleted file mode 100644
index 6377d592..00000000
--- a/parsers/caffe/caffeWeightFactory/weightType.h
+++ /dev/null
@@ -1,43 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef TRT_CAFFE_PARSER_WEIGHT_TYPE_H
-#define TRT_CAFFE_PARSER_WEIGHT_TYPE_H
-
-namespace nvcaffeparser1
-{
-enum class WeightType
-{
-    // types for convolution, deconv, fully connected
-    kGENERIC = 0, // typical weights for the layer: e.g. filter (for conv) or matrix weights (for innerproduct)
-    kBIAS = 1,    // bias weights
-
-    // These enums are for BVLCCaffe, which are incompatible with nvCaffe enums below.
-    // See batch_norm_layer.cpp in BLVC source of Caffe
-    kMEAN = 0,
-    kVARIANCE = 1,
-    kMOVING_AVERAGE = 2,
-
-    // These enums are for nvCaffe, which are incompatible with BVLCCaffe enums above
-    // See batch_norm_layer.cpp in NVidia fork of Caffe
-    kNVMEAN = 0,
-    kNVVARIANCE = 1,
-    kNVSCALE = 3,
-    kNVBIAS = 4
-};
-} //namespace nvcaffeparser1
-#endif //TRT_CAFFE_PARSER_WEIGHT_TYPE_H
diff --git a/parsers/caffe/proto/trtcaffe.proto b/parsers/caffe/proto/trtcaffe.proto
deleted file mode 100644
index 059f9323..00000000
--- a/parsers/caffe/proto/trtcaffe.proto
+++ /dev/null
@@ -1,1819 +0,0 @@
-/*
- * Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-syntax = "proto2";
-
-package trtcaffe;
-// Math and storage types
-enum Type {
-  DOUBLE = 0;
-  FLOAT = 1;
-  FLOAT16 = 2;
-  INT = 3;  // math not supported
-  UINT = 4;  // math not supported
-}
-
-// Specifies the shape (dimensions) of a Blob.
-message BlobShape {
-  repeated int64 dim = 1 [packed = true];
-}
-
-message BlobProto {
-  optional BlobShape shape = 7;
-  repeated float data = 5 [packed = true];
-  repeated float diff = 6 [packed = true];
-  repeated double double_data = 8 [packed = true];
-  repeated double double_diff = 9 [packed = true];
-  // New raw storage (faster and takes 1/2 of space for FP16)
-  optional Type raw_data_type = 10;
-  optional Type raw_diff_type = 11;
-  optional bytes raw_data = 12 [packed = false];
-  optional bytes raw_diff = 13 [packed = false];
-  // 4D dimensions -- deprecated.  Use "shape" instead.
-  optional int32 num = 1 [default = 0];
-  optional int32 channels = 2 [default = 0];
-  optional int32 height = 3 [default = 0];
-  optional int32 width = 4 [default = 0];
-}
-
-// The BlobProtoVector is simply a way to pass multiple blobproto instances
-// around.
-message BlobProtoVector {
-  repeated BlobProto blobs = 1;
-}
-
-message Datum {
-  optional int32 channels = 1;
-  optional int32 height = 2;
-  optional int32 width = 3;
-  // the actual image data, in bytes
-  optional bytes data = 4;
-  optional int32 label = 5;
-  // Optionally, the datum could also hold float data.
-  repeated float float_data = 6;
-  // If true data contains an encoded image that need to be decoded
-  optional bool encoded = 7 [default = false];
-}
-
-message FillerParameter {
-  // The filler type.
-  optional string type = 1 [default = 'constant'];
-  optional float value = 2 [default = 0]; // the value in constant filler
-  optional float min = 3 [default = 0]; // the min value in uniform filler
-  optional float max = 4 [default = 1]; // the max value in uniform filler
-  optional float mean = 5 [default = 0]; // the mean value in Gaussian filler
-  optional float std = 6 [default = 1]; // the std value in Gaussian filler
-  // The expected number of non-zero output weights for a given input in
-  // Gaussian filler -- the default -1 means don't perform sparsification.
-  optional int32 sparse = 7 [default = -1];
-  // Normalize the filler variance by fan_in, fan_out, or their average.
-  // Applies to 'xavier' and 'msra' fillers.
-  enum VarianceNorm {
-    FAN_IN = 0;
-    FAN_OUT = 1;
-    AVERAGE = 2;
-  }
-  optional VarianceNorm variance_norm = 8 [default = FAN_IN];
-}
-
-message NetParameter {
-  optional string name = 1; // consider giving the network a name
-  // DEPRECATED. See InputParameter. The input blobs to the network.
-  repeated string input = 3;
-  // DEPRECATED. See InputParameter. The shape of the input blobs.
-  repeated BlobShape input_shape = 8;
-
-  // 4D input dimensions -- deprecated.  Use "input_shape" instead.
-  // If specified, for each input blob there should be four
-  // values specifying the num, channels, height and width of the input blob.
-  // Thus, there should be a total of (4 * #input) numbers.
-  repeated int32 input_dim = 4;
-
-  // Whether the network will force every layer to carry out backward operation.
-  // If set False, then whether to carry out backward is determined
-  // automatically according to the net structure and learning rates.
-  optional bool force_backward = 5 [default = false];
-  // The current "state" of the network, including the phase, level, and stage.
-  // Some layers may be included/excluded depending on this state and the states
-  // specified in the layers' include and exclude fields.
-  optional NetState state = 6;
-
-  // Print debugging information about results while running Net::Forward,
-  // Net::Backward, and Net::Update.
-  optional bool debug_info = 7 [default = false];
-
-  // The layers that make up the net.  Each of their configurations, including
-  // connectivity and behavior, is specified as a LayerParameter.
-  repeated LayerParameter layer = 100;  // ID 100 so layers are printed last.
-
-  // DEPRECATED: use 'layer' instead.
-  repeated V1LayerParameter layers = 2;
-
-  // Default types for all layers
-  // These work only when layer-specific ones are omitted
-  optional Type default_forward_type  = 11 [default = FLOAT];
-  optional Type default_backward_type = 12 [default = FLOAT];
-  optional Type default_forward_math  = 13 [default = FLOAT];
-  optional Type default_backward_math = 14 [default = FLOAT];
-
-
-  optional float global_grad_scale = 15 [default = 1.];
-
-  // Sets the default "conv_algos_override" value for every convolution layer
-  optional string default_conv_algos_override = 17 [default = "-1,-1,-1"];
-
-  // While using multiple GPUs we have to run reduction process after every iteration.
-  // For better performance we unify multiple layers in buckets.
-  // This parameter sets approximate number of buckets to combine layers to.
-  // Default value is good for majority of nets.
-  optional int32 reduce_buckets = 18 [default = 3];
-
-  // Sets the default "cudnn_math_override" value for every layer
-  optional int32 default_cudnn_math_override = 19 [default = -1];
-}
-
-// NOTE
-// Update the next available ID when you add a new SolverParameter field.
-//
-// SolverParameter next available ID: 50 (last added: local_gw_ratio)
-message SolverParameter {
-  //////////////////////////////////////////////////////////////////////////////
-  // Specifying the train and test networks
-  //
-  // Exactly one train net must be specified using one of the following fields:
-  //     train_net_param, train_net, net_param, net
-  // One or more test nets may be specified using any of the following fields:
-  //     test_net_param, test_net, net_param, net
-  // If more than one test net field is specified (e.g., both net and
-  // test_net are specified), they will be evaluated in the field order given
-  // above: (1) test_net_param, (2) test_net, (3) net_param/net.
-  // A test_iter must be specified for each test_net.
-  // A test_level and/or a test_stage may also be specified for each test_net.
-  //////////////////////////////////////////////////////////////////////////////
-
-  // Proto filename for the train net, possibly combined with one or more
-  // test nets.
-  optional string net = 24;
-  // Inline train net param, possibly combined with one or more test nets.
-  optional NetParameter net_param = 25;
-
-  optional string train_net = 1; // Proto filename for the train net.
-  repeated string test_net = 2; // Proto filenames for the test nets.
-  optional NetParameter train_net_param = 21; // Inline train net params.
-  repeated NetParameter test_net_param = 22; // Inline test net params.
-
-  // The states for the train/test nets. Must be unspecified or
-  // specified once per net.
-  //
-  // By default, all states will have solver = true;
-  // train_state will have phase = TRAIN,
-  // and all test_state's will have phase = TEST.
-  // Other defaults are set according to the NetState defaults.
-  optional NetState train_state = 26;
-  repeated NetState test_state = 27;
-
-  // The number of iterations for each test net.
-  repeated int32 test_iter = 3;
-
-  // The number of iterations between two testing phases.
-  optional int32 test_interval = 4 [default = 0];
-  optional bool test_compute_loss = 19 [default = false];
-  // If true, run an initial test pass before the first iteration,
-  // ensuring memory availability and printing the starting value of the loss.
-  optional bool test_initialization = 32 [default = false];
-
-  optional int32 rampup_interval = 41 [default = 0];
-  optional float rampup_lr = 42 [default = 0.];
-  optional float min_lr = 43 [default = 0.];
-
-  optional float base_lr = 5; // The base learning rate
-
-  // the number of iterations between displaying info. If display = 0, no info
-  // will be displayed.
-  optional int32 display = 6;
-  // Display the loss averaged over the last average_loss iterations
-  optional int32 average_loss = 33 [default = 1];
-  optional int32 max_iter = 7; // the maximum number of iterations
-  // accumulate gradients over `iter_size` x `batch_size` instances
-  optional int32 iter_size = 36 [default = 1];
-
-  // The learning rate decay policy. The currently implemented learning rate
-  // policies are as follows:
-  //    - fixed: always return base_lr.
-  //    - step: return base_lr * gamma ^ (floor(iter / step))
-  //    - exp: return base_lr * gamma ^ iter
-  //    - inv: return base_lr * (1 + gamma * iter) ^ (- power)
-  //    - multistep: similar to step but it allows non uniform steps defined by
-  //      stepvalue
-  //    - poly: the effective learning rate follows a polynomial decay, to be
-  //      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
-  //    - sigmoid: the effective learning rate follows a sigmod decay
-  //      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
-  //
-  // where base_lr, max_iter, gamma, step, stepvalue and power are defined
-  // in the solver parameter protocol buffer, and iter is the current iteration.
-  optional string lr_policy = 8;
-  optional float gamma = 9; // The parameter to compute the learning rate.
-  optional float power = 10; // The parameter to compute the learning rate.
-  optional float momentum = 11; // The momentum value.
-
-  optional string momentum_policy = 46 [default = "fixed"];
-  optional float max_momentum     = 47 [default = 0.99];
-  optional float momentum_power   = 48 [default = 1.];
-
-  optional bool  local_lr_auto = 49 [default = false];
-  optional float local_gw_ratio = 50 [default = 0.001];
-
-  optional float weight_decay = 12; // The weight decay.
-  // regularization types supported: L1 and L2
-  // controlled by weight_decay
-  optional string regularization_type = 29 [default = "L2"];
-  // the stepsize for learning rate policy "step"
-  optional int32 stepsize = 13;
-  // the stepsize for learning rate policy "multistep"
-  repeated int32 stepvalue = 34;
-
-  // Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
-  // whenever their actual L2 norm is larger.
-  optional float clip_gradients = 35 [default = -1];
-
-  optional int32 snapshot = 14 [default = 0]; // The snapshot interval
-  optional string snapshot_prefix = 15; // The prefix for the snapshot.
-  // whether to snapshot diff in the results or not. Snapshotting diff will help
-  // debugging but the final protocol buffer size will be much larger.
-  optional bool snapshot_diff = 16 [default = false];
-  enum SnapshotFormat {
-    HDF5 = 0;
-    BINARYPROTO = 1;
-  }
-  optional SnapshotFormat snapshot_format = 37 [default = BINARYPROTO];
-  // the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
-  enum SolverMode {
-    CPU = 0;
-    GPU = 1;
-  }
-  optional SolverMode solver_mode = 17 [default = GPU];
-  // the device_id will that be used in GPU mode. Use device_id = 0 in default.
-  optional int32 device_id = 18 [default = 0];
-  // If non-negative, the seed with which the Solver will initialize the Caffe
-  // random number generator -- useful for reproducible results. Otherwise,
-  // (and by default) initialize using a seed derived from the system clock.
-  optional int64 random_seed = 20 [default = -1];
-
-  // type of the solver
-  optional string type = 40 [default = "SGD"];
-
-  // numerical stability for RMSProp, AdaGrad and AdaDelta and Adam
-  optional float delta = 31 [default = 1e-8];
-  // parameters for the Adam solver
-  optional float momentum2 = 39 [default = 0.999];
-
-  // RMSProp decay value
-  // MeanSquare(t) = rms_decay*MeanSquare(t-1) + (1-rms_decay)*SquareGradient(t)
-  optional float rms_decay = 38 [default = 0.99];
-
-  // If true, print information about the state of the net that may help with
-  // debugging learning problems.
-  optional bool debug_info = 23 [default = false];
-
-  // If false, don't save a snapshot after training finishes.
-  optional bool snapshot_after_train = 28 [default = true];
-
-  // DEPRECATED: old solver enum types, use string instead
-  enum SolverType {
-    SGD = 0;
-    NESTEROV = 1;
-    ADAGRAD = 2;
-    RMSPROP = 3;
-    ADADELTA = 4;
-    ADAM = 5;
-  }
-  // DEPRECATED: use type instead of solver_type
-  optional SolverType solver_type = 30 [default = SGD];
-  // Type used for storing weights and history
-  optional Type solver_data_type = 44 [default = FLOAT];
-  // If true:
-  // * Stores blobs in old (less efficient) BVLC-compatible format.
-  // * FP16 blobs are converted to FP32 and stored in 'data' container.
-  // * FP32 blobs are stored in 'data' container.
-  // * FP64 blobs are stored in 'double_data' container.
-  optional bool store_blobs_in_old_format = 45 [default = false];
-}
-
-// A message that stores the solver snapshots
-message SolverState {
-  optional int32 iter = 1; // The current iteration
-  optional string learned_net = 2; // The file that stores the learned net.
-  repeated BlobProto history = 3; // The history for sgd solvers
-  optional int32 current_step = 4 [default = 0]; // The current step for learning rate
-}
-
-enum Phase {
-   TRAIN = 0;
-   TEST = 1;
-}
-
-message NetState {
-  optional Phase phase = 1 [default = TEST];
-  optional int32 level = 2 [default = 0];
-  repeated string stage = 3;
-}
-
-message NetStateRule {
-  // Set phase to require the NetState have a particular phase (TRAIN or TEST)
-  // to meet this rule.
-  optional Phase phase = 1;
-
-  // Set the minimum and/or maximum levels in which the layer should be used.
-  // Leave undefined to meet the rule regardless of level.
-  optional int32 min_level = 2;
-  optional int32 max_level = 3;
-
-  // Customizable sets of stages to include or exclude.
-  // The net must have ALL of the specified stages and NONE of the specified
-  // "not_stage"s to meet the rule.
-  // (Use multiple NetStateRules to specify conjunctions of stages.)
-  repeated string stage = 4;
-  repeated string not_stage = 5;
-}
-
-// Specifies training parameters (multipliers on global learning constants,
-// and the name and other settings used for weight sharing).
-message ParamSpec {
-  // The names of the parameter blobs -- useful for sharing parameters among
-  // layers, but never required otherwise.  To share a parameter between two
-  // layers, give it a (non-empty) name.
-  optional string name = 1;
-
-  // Whether to require shared weights to have the same shape, or just the same
-  // count -- defaults to STRICT if unspecified.
-  optional DimCheckMode share_mode = 2;
-  enum DimCheckMode {
-    // STRICT (default) requires that num, channels, height, width each match.
-    STRICT = 0;
-    // PERMISSIVE requires only the count (num*channels*height*width) to match.
-    PERMISSIVE = 1;
-  }
-
-  // The multiplier on the global learning rate for this parameter.
-  optional float lr_mult = 3 [default = 1.0];
-
-  // The multiplier on the global weight decay for this parameter.
-  optional float decay_mult = 4 [default = 1.0];
-}
-
-// NOTE
-// Update the next available ID when you add a new LayerParameter field.
-//
-// LayerParameter next available layer-specific ID: 151 (last added: cudnn_math_override)
-message LayerParameter {
-  optional string name = 1; // the layer name
-  optional string type = 2; // the layer type
-  repeated string bottom = 3; // the name of each bottom blob
-  repeated string top = 4; // the name of each top blob
-
-  // Type returned by Forward routines of a particular layer (aka 'Ftype')
-  optional Type forward_type = 145 [default = FLOAT];
-  // Type returned by Backward routines of a particular layer (aka 'Btype')
-  optional Type backward_type = 146 [default = FLOAT];
-  // Internal math types. Works for those layers where internal math type
-  // could be different compared to Ftype or Btype. For example, so called
-  // "pseudo fp32 mode" in convolution layers. For other layers has no meaning.
-  optional Type forward_math = 147 [default = FLOAT];
-  optional Type backward_math = 148 [default = FLOAT];
-
-  optional bool debug = 149 [default = false];
-
-  // Sets the default cudnnMathType_t value for all cuDNN-based
-  // computations in current lyer, if applicable. Ignored otherwise.
-  // If negative or omitted, assumes implicit default and allows
-  // optimizers like cudnnFindConvolution*AlgorithmEx to choose the best type.
-  // If set to zero, enforces using CUDNN_DEFAULT_MATH everywhere in current lyer.
-  // If set to one, enforces using CUDNN_TENSOR_OP_MATH everywhere in current lyer.
-  optional int32 cudnn_math_override = 150 [default = -1];
-
-  // The train / test phase for computation.
-  optional Phase phase = 10;
-
-  // The amount of weight to assign each top blob in the objective.
-  // Each layer assigns a default value, usually of either 0 or 1,
-  // to each top blob.
-  repeated float loss_weight = 5;
-
-  // Specifies training parameters (multipliers on global learning constants,
-  // and the name and other settings used for weight sharing).
-  repeated ParamSpec param = 6;
-
-  // The blobs containing the numeric parameters of the layer.
-  repeated BlobProto blobs = 7;
-
-  // Specifies whether to backpropagate to each bottom. If unspecified,
-  // Caffe will automatically infer whether each input needs backpropagation
-  // to compute parameter gradients. If set to true for some inputs,
-  // backpropagation to those inputs is forced; if set false for some inputs,
-  // backpropagation to those inputs is skipped.
-  //
-  // The size must be either 0 or equal to the number of bottoms.
-  repeated bool propagate_down = 11;
-
-  // Rules controlling whether and when a layer is included in the network,
-  // based on the current NetState.  You may specify a non-zero number of rules
-  // to include OR exclude, but not both.  If no include or exclude rules are
-  // specified, the layer is always included.  If the current NetState meets
-  // ANY (i.e., one or more) of the specified rules, the layer is
-  // included/excluded.
-  repeated NetStateRule include = 8;
-  repeated NetStateRule exclude = 9;
-
-  // Parameters for data pre-processing.
-  optional TransformationParameter transform_param = 100;
-
-  // Parameters shared by loss layers.
-  optional LossParameter loss_param = 101;
-
-  // Layer type-specific parameters.
-  //
-  // Note: certain layers may have more than one computational engine
-  // for their implementation. These layers include an Engine type and
-  // engine parameter for selecting the implementation.
-  // The default for the engine is set by the ENGINE switch at compile-time.
-  optional AccuracyParameter accuracy_param = 102;
-  optional ArgMaxParameter argmax_param = 103;
-  optional BatchNormParameter batch_norm_param = 139;
-  optional BiasParameter bias_param = 141;
-  optional ConcatParameter concat_param = 104;
-  optional ContrastiveLossParameter contrastive_loss_param = 105;
-  optional ConvolutionParameter convolution_param = 106;
-  optional CropParameter crop_param = 144;
-  optional DataParameter data_param = 107;
-  optional DropoutParameter dropout_param = 108;
-  optional DummyDataParameter dummy_data_param = 109;
-  optional EltwiseParameter eltwise_param = 110;
-  optional ELUParameter elu_param = 140;
-  optional EmbedParameter embed_param = 137;
-  optional ExpParameter exp_param = 111;
-  optional FlattenParameter flatten_param = 135;
-  optional HDF5DataParameter hdf5_data_param = 112;
-  optional HDF5OutputParameter hdf5_output_param = 113;
-  optional HingeLossParameter hinge_loss_param = 114;
-  optional ImageDataParameter image_data_param = 115;
-  optional InfogainLossParameter infogain_loss_param = 116;
-  optional InnerProductParameter inner_product_param = 117;
-  optional InputParameter input_param = 143;
-  optional LogParameter log_param = 134;
-  optional LRNParameter lrn_param = 118;
-  optional MemoryDataParameter memory_data_param = 119;
-  optional MVNParameter mvn_param = 120;
-  optional PoolingParameter pooling_param = 121;
-  optional PowerParameter power_param = 122;
-  optional PReLUParameter prelu_param = 131;
-  optional PythonParameter python_param = 130;
-  optional ReductionParameter reduction_param = 136;
-  optional ReLUParameter relu_param = 123;
-  optional ReshapeParameter reshape_param = 133;
-  optional ScaleParameter scale_param = 142;
-  optional SigmoidParameter sigmoid_param = 124;
-  optional SoftmaxParameter softmax_param = 125;
-  optional SPPParameter spp_param = 132;
-  optional SliceParameter slice_param = 126;
-  optional TanHParameter tanh_param = 127;
-  optional ThresholdParameter threshold_param = 128;
-  optional TileParameter tile_param = 138;
-  optional WindowDataParameter window_data_param = 129;
-
-  // TRT PARAMETERS (Start with 878 because TRT is 878 on an old-style phone)
-  // These parameters are added to support custom branch of Caffe widely used
-  // by the community, such as SSD. This support relies on the fact that TRT
-  // parses the structure/params of the network from TextProto format from
-  // deploy.prototxt, and uses string match for the layer name to lookup the
-  // corresponding weight blobs from the BinaryProto (.caffemodel) file
-  // The serialization number here may not match the custom caffe version
-  // used to produce the .caffemodel, but since we are parsing from .prototxt
-  // we essentially ignores the serialization number.
-  optional PermuteParameter permute_param = 8781;
-  optional DetectionOutputParameter detection_output_param = 8782;
-  optional NormalizeParameter norm_param = 8783;
-  optional PriorBoxParameter prior_box_param = 8784;
-  optional ROIPoolingParameter roi_pooling_param = 8785;
-  optional RegionProposalParameter region_proposal_param = 8786;
-  // please note that the serialization number for ClipParameter differs from
-  // Caffe master, since 148 was already used for a custom field
-  optional ClipParameter clip_param = 8787;
-}
-
-// Message that stores parameters used to apply transformation
-// to the data layer's data
-message TransformationParameter {
-  enum InterpolationAlgo {
-    INTER_NEAREST = 0; //!< nearest neighbor interpolation
-    INTER_LINEAR  = 1; //!< bilinear interpolation
-    INTER_CUBIC   = 2; //!< bicubic interpolation
-    INTER_AREA    = 3; //!< area-based (or super) interpolation
-  }
-  // When the images in a batch are of different shapes, we need to preprocess
-  // them into the same fixed shape, as downstream operations in caffe require
-  // images within a batch to be of the same shape.
-  //
-  // To transform one image of arbitrary shape into an image of fixed shape,
-  // we allow specifying a sequence of "variable-sized image transforms."
-  // There are three possible transforms, and it is possible for _all of them_
-  // to be enabled at the same time. They are always applied in the same order:
-  // (1) first random resize, (2) then random crop, (3) finally center crop.
-  // The last transform must be either a random crop or a center crop.
-  //
-  // The three supported transforms are as follows:
-  //
-  // 1. Random resize. This takes two parameters, "lower" and "upper," or
-  //    "L" and "U" for short. If the original image has shape (oldW, oldH),
-  //    the shorter side, D = min(oldW, oldH), is calculated. Then a resize
-  //    target size R is chosing uniformly from the interval [L, U], and both
-  //    sides of the original image are resized by a scaling factor R/D to yield
-  //    a new image with shape (R/D * oldW, R/D * oldH).
-  //
-  // 2. Random crop. This takes one 'crop_size' parameter. A square region is randomly
-  //    chosen from the image for cropping. Works in TRAIN phase only.
-  //
-  // 3. Center crop. This takes one 'crop_size' parameter. A square region is chosen
-  //    from the center of the image for cropping. Works in TEST phase only.
-  //
-  optional uint32 img_rand_resize_lower = 10 [default = 0];
-  optional uint32 img_rand_resize_upper = 11 [default = 0];
-
-  // Limits for randomly generated ratio R so that longer side
-  // length would be set to shorter side length multiplied by R.
-  // If applied to sguare, the shorter side is chosen randomly.
-  // Pair {1,1} means "resize image to square (by shortest side)".
-  // Values less than 1 are ignored.
-  optional float rand_resize_ratio_lower = 12 [default = 0];
-  optional float rand_resize_ratio_upper = 13 [default = 0];
-
-  // Limits for randomly generated vertical stretch, i.e.
-  // "height" *= "vertical_stretch" where
-  // "vertical_stretch" = Rand(vertical_stretch_lower,vertical_stretch_upper).
-  // Pair {1,1} means "do nothing".
-  optional float vertical_stretch_lower = 14 [default = 1];
-  optional float vertical_stretch_upper = 15 [default = 1];
-
-  // Limits for randomly generated horizontal stretch, i.e.
-  // "width" *= "horizontal_stretch" where
-  // "horizontal_stretch" = Rand(horizontal_stretch_lower,horizontal_stretch_upper).
-  // Pair {1,1} means "do nothing".
-  optional float horizontal_stretch_lower = 16 [default = 1];
-  optional float horizontal_stretch_upper = 17 [default = 1];
-
-  // OpenCV algorithm used for downsampling
-  optional InterpolationAlgo interpolation_algo_down = 18 [default = INTER_NEAREST];
-  // OpenCV algorithm used for upsampling
-  optional InterpolationAlgo interpolation_algo_up = 19 [default = INTER_CUBIC];
-  // No upscale by default no matter what resize parameters are chosen
-  optional bool allow_upscale = 20 [default = false];
-
-  // For data pre-processing, we can do simple scaling and subtracting the
-  // data mean, if provided. Note that the mean subtraction is always carried
-  // out before scaling.
-  optional float scale = 1 [default = 1];
-  // Specify if we want to randomly mirror data.
-  optional bool mirror = 2 [default = false];
-  // Specify if we would like to randomly crop an image.
-  optional uint32 crop_size = 3 [default = 0];
-  // mean_file and mean_value cannot be specified at the same time
-  optional string mean_file = 4;
-  // if specified can be repeated once (would substract it from all the channels)
-  // or can be repeated the same number of times as channels
-  // (would subtract them from the corresponding channel)
-  repeated float mean_value = 5;
-  // Force the decoded image to have 3 color channels.
-  optional bool force_color = 6 [default = false];
-  // Force the decoded image to have 1 color channels.
-  optional bool force_gray = 7 [default = false];
-  // Run the transform (synchronously) on the GPU
-  // False if omitted when Forward Type is float/double.
-  // True otherwise (float16 doesn't work well on CPU).
-  optional bool use_gpu_transform = 8 [default = false];
-  // If non-negative, the seed with which the transformer's
-  // random number generator would be initialized -- useful for reproducible results.
-  // Otherwise, (and by default) initialize using a seed derived from the system clock.
-  optional int64 random_seed = 9 [default = -1];
-}
-
-// Message that stores parameters shared by loss layers
-message LossParameter {
-  // If specified, ignore instances with the given label.
-  optional int32 ignore_label = 1;
-  // How to normalize the loss for loss layers that aggregate across batches,
-  // spatial dimensions, or other dimensions.  Currently only implemented in
-  // SoftmaxWithLoss and SigmoidCrossEntropyLoss layers.
-  enum NormalizationMode {
-    // Divide by the number of examples in the batch times spatial dimensions.
-    // Outputs that receive the ignore label will NOT be ignored in computing
-    // the normalization factor.
-    FULL = 0;
-    // Divide by the total number of output locations that do not take the
-    // ignore_label.  If ignore_label is not set, this behaves like FULL.
-    VALID = 1;
-    // Divide by the batch size.
-    BATCH_SIZE = 2;
-    // Do not normalize the loss.
-    NONE = 3;
-  }
-  // For historical reasons, the default normalization for
-  // SigmoidCrossEntropyLoss is BATCH_SIZE and *not* VALID.
-  optional NormalizationMode normalization = 3 [default = VALID];
-  // Deprecated.  Ignored if normalization is specified.  If normalization
-  // is not specified, then setting this to false will be equivalent to
-  // normalization = BATCH_SIZE to be consistent with previous behavior.
-  optional bool normalize = 2;
-}
-
-// Messages that store parameters used by individual layer types follow, in
-// alphabetical order.
-
-message AccuracyParameter {
-  // When computing accuracy, count as correct by comparing the true label to
-  // the top k scoring classes.  By default, only compare to the top scoring
-  // class (i.e. argmax).
-  optional uint32 top_k = 1 [default = 1];
-
-  // The "label" axis of the prediction blob, whose argmax corresponds to the
-  // predicted label -- may be negative to index from the end (e.g., -1 for the
-  // last axis).  For example, if axis == 1 and the predictions are
-  // (N x C x H x W), the label blob is expected to contain N*H*W ground truth
-  // labels with integer values in {0, 1, ..., C-1}.
-  optional int32 axis = 2 [default = 1];
-
-  // If specified, ignore instances with the given label.
-  optional int32 ignore_label = 3;
-}
-
-message ArgMaxParameter {
-  // If true produce pairs (argmax, maxval)
-  optional bool out_max_val = 1 [default = false];
-  optional uint32 top_k = 2 [default = 1];
-  // The axis along which to maximise -- may be negative to index from the
-  // end (e.g., -1 for the last axis).
-  // By default ArgMaxLayer maximizes over the flattened trailing dimensions
-  // for each index of the first / num dimension.
-  optional int32 axis = 3;
-}
-
-message ClipParameter {
-  required float min = 1;
-  required float max = 2;
-}
-
-message ConcatParameter {
-  // The axis along which to concatenate -- may be negative to index from the
-  // end (e.g., -1 for the last axis).  Other axes must have the
-  // same dimension for all the bottom blobs.
-  // By default, ConcatLayer concatenates blobs along the "channels" axis (1).
-  optional int32 axis = 2 [default = 1];
-
-  // DEPRECATED: alias for "axis" -- does not support negative indexing.
-  optional uint32 concat_dim = 1 [default = 1];
-}
-
-message BatchNormParameter {
-  // If false, accumulate global mean/variance values via a moving average. If
-  // true, use those accumulated values instead of computing mean/variance
-  // across the batch.
-  optional bool use_global_stats = 1 [default = false];
-  // How much does the moving average decay each iteration?
-  optional float moving_average_fraction = 2 [default = .999];
-  // Small value to add to the variance estimate so that we don't divide by
-  // zero.
-  optional float eps = 3 [default = 1e-5];
-  optional FillerParameter scale_filler = 5;
-  optional FillerParameter bias_filler = 6;
-  optional bool scale_bias = 7 [default = false];
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 15 [default = DEFAULT];
-}
-
-message BiasParameter {
-  // The first axis of bottom[0] (the first input Blob) along which to apply
-  // bottom[1] (the second input Blob).  May be negative to index from the end
-  // (e.g., -1 for the last axis).
-  //
-  // For example, if bottom[0] is 4D with shape 100x3x40x60, the output
-  // top[0] will have the same shape, and bottom[1] may have any of the
-  // following shapes (for the given value of axis):
-  //    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
-  //    (axis == 1 == -3)          3;     3x40;     3x40x60
-  //    (axis == 2 == -2)                   40;       40x60
-  //    (axis == 3 == -1)                                60
-  // Furthermore, bottom[1] may have the empty shape (regardless of the value of
-  // "axis") -- a scalar bias.
-  optional int32 axis = 1 [default = 1];
-
-  // (num_axes is ignored unless just one bottom is given and the bias is
-  // a learned parameter of the layer.  Otherwise, num_axes is determined by the
-  // number of axes by the second bottom.)
-  // The number of axes of the input (bottom[0]) covered by the bias
-  // parameter, or -1 to cover all axes of bottom[0] starting from `axis`.
-  // Set num_axes := 0, to add a zero-axis Blob: a scalar.
-  optional int32 num_axes = 2 [default = 1];
-
-  // (filler is ignored unless just one bottom is given and the bias is
-  // a learned parameter of the layer.)
-  // The initialization for the learned bias parameter.
-  // Default is the zero (0) initialization, resulting in the BiasLayer
-  // initially performing the identity operation.
-  optional FillerParameter filler = 3;
-}
-
-message ContrastiveLossParameter {
-  // margin for dissimilar pair
-  optional float margin = 1 [default = 1.0];
-  // The first implementation of this cost did not exactly match the cost of
-  // Hadsell et al 2006 -- using (margin - d^2) instead of (margin - d)^2.
-  // legacy_version = false (the default) uses (margin - d)^2 as proposed in the
-  // Hadsell paper. New models should probably use this version.
-  // legacy_version = true uses (margin - d^2). This is kept to support /
-  // reproduce existing models and results
-  optional bool legacy_version = 2 [default = false];
-}
-
-message ConvolutionParameter {
-  optional uint32 num_output = 1; // The number of outputs for the layer
-  optional bool bias_term = 2 [default = true]; // whether to have bias terms
-
-  // Pad, kernel size, and stride are all given as a single value for equal
-  // dimensions in all spatial dimensions, or once per spatial dimension.
-  repeated uint32 pad = 3; // The padding size; defaults to 0
-  repeated uint32 kernel_size = 4; // The kernel size
-  repeated uint32 stride = 6; // The stride; defaults to 1
-  // Factor used to dilate the kernel, (implicitly) zero-filling the resulting
-  // holes. (Kernel dilation is sometimes referred to by its use in the
-  // algorithme à trous from Holschneider et al. 1987.)
-  repeated uint32 dilation = 18; // The dilation; defaults to 1
-
-  // For 2D convolution only, the *_h and *_w versions may also be used to
-  // specify both spatial dimensions.
-  optional uint32 pad_h = 9 [default = 0]; // The padding height (2D only)
-  optional uint32 pad_w = 10 [default = 0]; // The padding width (2D only)
-  optional uint32 kernel_h = 11; // The kernel height (2D only)
-  optional uint32 kernel_w = 12; // The kernel width (2D only)
-  optional uint32 stride_h = 13; // The stride height (2D only)
-  optional uint32 stride_w = 14; // The stride width (2D only)
-
-  optional uint32 group = 5 [default = 1]; // The group size for group conv
-
-  optional FillerParameter weight_filler = 7; // The filler for the weight
-  optional FillerParameter bias_filler = 8; // The filler for the bias
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 15 [default = DEFAULT];
-
-  // The axis to interpret as "channels" when performing convolution.
-  // Preceding dimensions are treated as independent inputs;
-  // succeeding dimensions are treated as "spatial".
-  // With (N, C, H, W) inputs, and axis == 1 (the default), we perform
-  // N independent 2D convolutions, sliding C-channel (or (C/g)-channels, for
-  // groups g>1) filters across the spatial axes (H, W) of the input.
-  // With (N, C, D, H, W) inputs, and axis == 1, we perform
-  // N independent 3D convolutions, sliding (C/g)-channels
-  // filters across the spatial axes (D, H, W) of the input.
-  optional int32 axis = 16 [default = 1];
-
-  // Whether to force use of the general ND convolution, even if a specific
-  // implementation for blobs of the appropriate number of spatial dimensions
-  // is available. (Currently, there is only a 2D-specific convolution
-  // implementation; for input blobs with num_axes != 2, this option is
-  // ignored and the ND implementation will be used.)
-  optional bool force_nd_im2col = 17 [default = false];
-  enum CuDNNConvolutionAlgorithmSeeker {
-    GET = 0;
-    FINDEX = 1;
-  }
-  //Specifies which cudnn routine should be used to find the best convolution algorithm
-  optional CuDNNConvolutionAlgorithmSeeker cudnn_convolution_algo_seeker = 19 [default = FINDEX];
-
-  // If set to a non-negative value, enforces using the algo by the index provided.
-  // It has priority over CuDNNConvolutionAlgorithmSeeker and essentially disables seeking.
-  // The index should correspond to the ordinal in structures cudnnConvolutionFwdAlgo_t,
-  // cudnnConvolutionBwdDataAlgo_t and cudnnConvolutionBwdFilterAlgo_t.
-  // For example, conv_algos_override set to "7,5,7" means using CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED,
-  // CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD_NONFUSED and CUDNN_CONVOLUTION_BWD_FILTER_ALGO_WINOGRAD_NONFUSED
-  // correspondingly.
-  optional string conv_algos_override = 20 [default = "-1,-1,-1"];
-}
-
-message CropParameter {
-  // To crop, elements of the first bottom are selected to fit the dimensions
-  // of the second, reference bottom. The crop is configured by
-  // - the crop `axis` to pick the dimensions for cropping
-  // - the crop `offset` to set the shift for all/each dimension
-  // to align the cropped bottom with the reference bottom.
-  // All dimensions up to but excluding `axis` are preserved, while
-  // the dimensions including and trailing `axis` are cropped.
-  // If only one `offset` is set, then all dimensions are offset by this amount.
-  // Otherwise, the number of offsets must equal the number of cropped axes to
-  // shift the crop in each dimension accordingly.
-  // Note: standard dimensions are N,C,H,W so the default is a spatial crop,
-  // and `axis` may be negative to index from the end (e.g., -1 for the last
-  // axis).
-  optional int32 axis = 1 [default = 2];
-  repeated uint32 offset = 2;
-}
-
-message DataParameter {
-  enum DB {
-    LEVELDB = 0;
-    LMDB = 1;
-  }
-  // Specify the data source.
-  optional string source = 1;
-  // Specify the batch size.
-  optional uint32 batch_size = 4;
-  // The rand_skip variable is for the data layer to skip a few data points
-  // to avoid all asynchronous sgd clients to start at the same point. The skip
-  // point would be set as rand_skip * rand(0,1). Note that rand_skip should not
-  // be larger than the number of keys in the database.
-  // DEPRECATED. Each solver accesses a different subset of the database.
-  optional uint32 rand_skip = 7 [default = 0];
-  optional DB backend = 8 [default = LEVELDB];
-  // DEPRECATED. See TransformationParameter. For data pre-processing, we can do
-  // simple scaling and subtracting the data mean, if provided. Note that the
-  // mean subtraction is always carried out before scaling.
-  optional float scale = 2 [default = 1];
-  optional string mean_file = 3;
-  // DEPRECATED. See TransformationParameter. Specify if we would like to randomly
-  // crop an image.
-  optional uint32 crop_size = 5 [default = 0];
-  // DEPRECATED. See TransformationParameter. Specify if we want to randomly mirror
-  // data.
-  optional bool mirror = 6 [default = false];
-  // Force the encoded image to have 3 color channels
-  optional bool force_encoded_color = 9 [default = false];
-  // Deprecated
-  optional uint32 prefetch = 10 [default = 0];
-  // Number of Data Transformer threads per GPU. If 0 Caffe optimizes it automatically.
-  // Takes effect only when used together with 'parser_threads' setting.
-  // Ignored and always set to 1 for test nets.
-  optional uint32 threads  = 11 [default = 0];
-  // Number of Parser threads per GPU. If 0 Caffe optimizes it automatically.
-  // Takes effect only when used together with 'threads' setting.
-  // Ignored and always set to 1 for test nets.
-  optional uint32 parser_threads  = 12 [default = 0];
-  // Cache observations while reading
-  optional bool cache = 13 [default = false];
-  // Shuffle observations while reading for better accuracy. Ignored if 'cache' is false.
-  optional bool shuffle = 14 [default = false];
-}
-
-// Message that stores parameters used by data transformer for resize policy
-message ResizeParameter {
-  //Probability of using this resize policy
-  optional float prob = 1 [default = 1];
-
-  enum Resize_mode {
-    WARP = 1;
-    FIT_SMALL_SIZE = 2;
-    FIT_LARGE_SIZE_AND_PAD = 3;
-  }
-  optional Resize_mode resize_mode = 2 [default = WARP];
-  optional uint32 height = 3 [default = 0];
-  optional uint32 width = 4 [default = 0];
-  // A parameter used to update bbox in FIT_SMALL_SIZE mode.
-  optional uint32 height_scale = 8 [default = 0];
-  optional uint32 width_scale = 9 [default = 0];
-
-  enum Pad_mode {
-    CONSTANT = 1;
-    MIRRORED = 2;
-    REPEAT_NEAREST = 3;
-  }
-  // Padding mode for BE_SMALL_SIZE_AND_PAD mode and object centering
-  optional Pad_mode pad_mode = 5 [default = CONSTANT];
-  // if specified can be repeated once (would fill all the channels)
-  // or can be repeated the same number of times as channels
-  // (would use it them to the corresponding channel)
-  repeated float pad_value = 6;
-
-  enum Interp_mode { //Same as in OpenCV
-    LINEAR = 1;
-    AREA = 2;
-    NEAREST = 3;
-    CUBIC = 4;
-    LANCZOS4 = 5;
-  }
-  //interpolation for for resizing
-  repeated Interp_mode interp_mode = 7;
-}
-
-message NonMaximumSuppressionParameter {
-  // Threshold to be used in nms.
-  optional float nms_threshold = 1 [default = 0.3];
-  // Maximum number of results to be kept.
-  optional int32 top_k = 2;
-  // Parameter for adaptive nms.
-  optional float eta = 3 [default = 1.0];
-}
-
-message SaveOutputParameter {
-  // Output directory. If not empty, we will save the results.
-  optional string output_directory = 1;
-  // Output name prefix.
-  optional string output_name_prefix = 2;
-  // Output format.
-  //    VOC - PASCAL VOC output format.
-  //    COCO - MS COCO output format.
-  optional string output_format = 3;
-  // If you want to output results, must also provide the following two files.
-  // Otherwise, we will ignore saving results.
-  // label map file.
-  optional string label_map_file = 4;
-  // A file which contains a list of names and sizes with same order
-  // of the input DB. The file is in the following format:
-  //    name height width
-  //    ...
-  optional string name_size_file = 5;
-  // Number of test images. It can be less than the lines specified in
-  // name_size_file. For example, when we only want to evaluate on part
-  // of the test images.
-  optional uint32 num_test_image = 6;
-  // The resize parameter used in saving the data.
-  optional ResizeParameter resize_param = 7;
-}
-
-// Message that store parameters used by DetectionOutputLayer
-message DetectionOutputParameter {
-  // Number of classes to be predicted. Required!
-  optional uint32 num_classes = 1;
-  // If true, bounding box are shared among different classes.
-  optional bool share_location = 2 [default = true];
-  // Background label id. If there is no background class,
-  // set it as -1.
-  optional int32 background_label_id = 3 [default = 0];
-  // Parameters used for non maximum suppression.
-  optional NonMaximumSuppressionParameter nms_param = 4;
-  // Parameters used for saving detection results.
-  optional SaveOutputParameter save_output_param = 5;
-  // Type of coding method for bbox.
-  optional PriorBoxParameter.CodeType code_type = 6 [default = CORNER];
-  // If true, variance is encoded in target; otherwise we need to adjust the
-  // predicted offset accordingly.
-  optional bool variance_encoded_in_target = 8 [default = false];
-  // Number of total bboxes to be kept per image after nms step.
-  // -1 means keeping all bboxes after nms step.
-  optional int32 keep_top_k = 7 [default = -1];
-  // Only consider detections whose confidences are larger than a threshold.
-  // If not provided, consider all boxes.
-  optional float confidence_threshold = 9;
-  // If true, visualize the detection results.
-  optional bool visualize = 10 [default = false];
-  // The threshold used to visualize the detection results.
-  optional float visualize_threshold = 11;
-  // If provided, save outputs to video file.
-  optional string save_file = 12;
-}
-
-message DropoutParameter {
-  optional float dropout_ratio = 1 [default = 0.5]; // dropout ratio
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 2 [default = DEFAULT];
-  optional int64 random_seed = 3 [default = -1];
-}
-
-// DummyDataLayer fills any number of arbitrarily shaped blobs with random
-// (or constant) data generated by "Fillers" (see "message FillerParameter").
-message DummyDataParameter {
-  // This layer produces N >= 1 top blobs.  DummyDataParameter must specify 1 or N
-  // shape fields, and 0, 1 or N data_fillers.
-  //
-  // If 0 data_fillers are specified, ConstantFiller with a value of 0 is used.
-  // If 1 data_filler is specified, it is applied to all top blobs.  If N are
-  // specified, the ith is applied to the ith top blob.
-  repeated FillerParameter data_filler = 1;
-  repeated BlobShape shape = 6;
-
-  // 4D dimensions -- deprecated.  Use "shape" instead.
-  repeated uint32 num = 2;
-  repeated uint32 channels = 3;
-  repeated uint32 height = 4;
-  repeated uint32 width = 5;
-}
-
-message EltwiseParameter {
-  enum EltwiseOp {
-    PROD = 0;
-    SUM = 1;
-    MAX = 2;
-  }
-  optional EltwiseOp operation = 1 [default = SUM]; // element-wise operation
-  repeated float coeff = 2; // blob-wise coefficient for SUM operation
-
-  // Whether to use an asymptotically slower (for >2 inputs) but stabler method
-  // of computing the gradient for the PROD operation. (No effect for SUM op.)
-  optional bool stable_prod_grad = 3 [default = true];
-}
-
-// Message that stores parameters used by ELULayer
-message ELUParameter {
-  // Described in:
-  // Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate
-  // Deep Network Learning by Exponential Linear Units (ELUs). arXiv
-  optional float alpha = 1 [default = 1.];
-}
-
-// Message that stores parameters used by EmbedLayer
-message EmbedParameter {
-  optional uint32 num_output = 1; // The number of outputs for the layer
-  // The input is given as integers to be interpreted as one-hot
-  // vector indices with dimension num_input.  Hence num_input should be
-  // 1 greater than the maximum possible input value.
-  optional uint32 input_dim = 2;
-
-  optional bool bias_term = 3 [default = true]; // Whether to use a bias term
-  optional FillerParameter weight_filler = 4; // The filler for the weight
-  optional FillerParameter bias_filler = 5; // The filler for the bias
-
-}
-
-// Message that stores parameters used by ExpLayer
-message ExpParameter {
-  // ExpLayer computes outputs y = base ^ (shift + scale * x), for base > 0.
-  // Or if base is set to the default (-1), base is set to e,
-  // so y = exp(shift + scale * x).
-  optional float base = 1 [default = -1.0];
-  optional float scale = 2 [default = 1.0];
-  optional float shift = 3 [default = 0.0];
-}
-
-/// Message that stores parameters used by FlattenLayer
-message FlattenParameter {
-  // The first axis to flatten: all preceding axes are retained in the output.
-  // May be negative to index from the end (e.g., -1 for the last axis).
-  optional int32 axis = 1 [default = 1];
-
-  // The last axis to flatten: all following axes are retained in the output.
-  // May be negative to index from the end (e.g., the default -1 for the last
-  // axis).
-  optional int32 end_axis = 2 [default = -1];
-}
-
-// Message that stores parameters used by HDF5DataLayer
-message HDF5DataParameter {
-  // Specify the data source.
-  optional string source = 1;
-  // Specify the batch size.
-  optional uint32 batch_size = 2;
-
-  // Specify whether to shuffle the data.
-  // If shuffle == true, the ordering of the HDF5 files is shuffled,
-  // and the ordering of data within any given HDF5 file is shuffled,
-  // but data between different files are not interleaved; all of a file's
-  // data are output (in a random order) before moving onto another file.
-  optional bool shuffle = 3 [default = false];
-}
-
-message HDF5OutputParameter {
-  optional string file_name = 1;
-}
-
-message HingeLossParameter {
-  enum Norm {
-    L1 = 1;
-    L2 = 2;
-  }
-  // Specify the Norm to use L1 or L2
-  optional Norm norm = 1 [default = L1];
-}
-
-message ImageDataParameter {
-  // Specify the data source.
-  optional string source = 1;
-  // Specify the batch size.
-  optional uint32 batch_size = 4 [default = 1];
-  // The rand_skip variable is for the data layer to skip a few data points
-  // to avoid all asynchronous sgd clients to start at the same point. The skip
-  // point would be set as rand_skip * rand(0,1). Note that rand_skip should not
-  // be larger than the number of keys in the database.
-  optional uint32 rand_skip = 7 [default = 0];
-  // Whether or not ImageLayer should shuffle the list of files at every epoch.
-  optional bool shuffle = 8 [default = false];
-  // It will also resize images if new_height or new_width are not zero.
-  optional uint32 new_height = 9 [default = 0];
-  optional uint32 new_width = 10 [default = 0];
-  // Specify if the images are color or gray
-  optional bool is_color = 11 [default = true];
-  // DEPRECATED. See TransformationParameter. For data pre-processing, we can do
-  // simple scaling and subtracting the data mean, if provided. Note that the
-  // mean subtraction is always carried out before scaling.
-  optional float scale = 2 [default = 1];
-  optional string mean_file = 3;
-  // DEPRECATED. See TransformationParameter. Specify if we would like to randomly
-  // crop an image.
-  optional uint32 crop_size = 5 [default = 0];
-  // DEPRECATED. See TransformationParameter. Specify if we want to randomly mirror
-  // data.
-  optional bool mirror = 6 [default = false];
-  optional string root_folder = 12 [default = ""];
-}
-
-message InfogainLossParameter {
-  // Specify the infogain matrix source.
-  optional string source = 1;
-}
-
-message InnerProductParameter {
-  optional uint32 num_output = 1; // The number of outputs for the layer
-  optional bool bias_term = 2 [default = true]; // whether to have bias terms
-  optional FillerParameter weight_filler = 3; // The filler for the weight
-  optional FillerParameter bias_filler = 4; // The filler for the bias
-
-  // The first axis to be lumped into a single inner product computation;
-  // all preceding axes are retained in the output.
-  // May be negative to index from the end (e.g., -1 for the last axis).
-  optional int32 axis = 5 [default = 1];
-  // Specify whether to transpose the weight matrix or not.
-  // If transpose == true, any operations will be performed on the transpose
-  // of the weight matrix. The weight matrix itself is not going to be transposed
-  // but rather the transfer flag of operations will be toggled accordingly.
-  optional bool transpose = 6 [default = false];
-}
-
-message InputParameter {
-  // This layer produces N >= 1 top blob(s) to be assigned manually.
-  // Define N shapes to set a shape for each top.
-  // Define 1 shape to set the same shape for every top.
-  // Define no shape to defer to reshaping manually.
-  repeated BlobShape shape = 1;
-}
-
-// Message that stores parameters used by LogLayer
-message LogParameter {
-  // LogLayer computes outputs y = log_base(shift + scale * x), for base > 0.
-  // Or if base is set to the default (-1), base is set to e,
-  // so y = ln(shift + scale * x) = log_e(shift + scale * x)
-  optional float base = 1 [default = -1.0];
-  optional float scale = 2 [default = 1.0];
-  optional float shift = 3 [default = 0.0];
-}
-
-// Message that stores parameters used by LRNLayer
-message LRNParameter {
-  optional uint32 local_size = 1 [default = 5];
-  optional float alpha = 2 [default = 1.];
-  optional float beta = 3 [default = 0.75];
-  enum NormRegion {
-    ACROSS_CHANNELS = 0;
-    WITHIN_CHANNEL = 1;
-  }
-  optional NormRegion norm_region = 4 [default = ACROSS_CHANNELS];
-  optional float k = 5 [default = 1.];
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 6 [default = DEFAULT];
-}
-
-message MemoryDataParameter {
-  optional uint32 batch_size = 1;
-  optional uint32 channels = 2;
-  optional uint32 height = 3;
-  optional uint32 width = 4;
-}
-
-message MVNParameter {
-  // This parameter can be set to false to normalize mean only
-  optional bool normalize_variance = 1 [default = true];
-
-  // This parameter can be set to true to perform DNN-like MVN
-  optional bool across_channels = 2 [default = false];
-
-  // Epsilon for not dividing by zero while normalizing variance
-  optional float eps = 3 [default = 1e-9];
-}
-
-// Message that stores parameters used by NormalizeLayer
-message NormalizeParameter {
-  optional bool across_spatial = 1 [default = true];
-  // Initial value of scale. Default is 1.0 for all
-  optional FillerParameter scale_filler = 2;
-  // Whether or not scale parameters are shared across channels.
-  optional bool channel_shared = 3 [default = true];
-  // Epsilon for not dividing by zero while normalizing variance
-  optional float eps = 4 [default = 1e-10];
-}
-
-message PermuteParameter {
-  // The new orders of the axes of data. Notice it should be with
-  // in the same range as the input data, and it starts from 0.
-  // Do not provide repeated order.
-  repeated uint32 order = 1;
-}
-
-message PoolingParameter {
-  enum PoolMethod {
-    MAX = 0;
-    AVE = 1;
-    STOCHASTIC = 2;
-  }
-  optional PoolMethod pool = 1 [default = MAX]; // The pooling method
-  // Pad, kernel size, and stride are all given as a single value for equal
-  // dimensions in height and width or as Y, X pairs.
-  optional uint32 pad = 4 [default = 0]; // The padding size (equal in Y, X)
-  optional uint32 pad_h = 9 [default = 0]; // The padding height
-  optional uint32 pad_w = 10 [default = 0]; // The padding width
-  optional uint32 kernel_size = 2; // The kernel size (square)
-  optional uint32 kernel_h = 5; // The kernel height
-  optional uint32 kernel_w = 6; // The kernel width
-  optional uint32 stride = 3 [default = 1]; // The stride (equal in Y, X)
-  optional uint32 stride_h = 7; // The stride height
-  optional uint32 stride_w = 8; // The stride width
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 11 [default = DEFAULT];
-  // If global_pooling then it will pool over the size of the bottom by doing
-  // kernel_h = bottom->height and kernel_w = bottom->width
-  optional bool global_pooling = 12 [default = false];
-
-  // FB pooling parameters
-  // Use floor((height + 2 * padding - kernel) / stride) + 1
-  // instead of ceil((height + 2 * padding - kernel) / stride) + 1
-  optional bool torch_pooling = 40 [default = false];
-}
-
-message PowerParameter {
-  // PowerLayer computes outputs y = (shift + scale * x) ^ power.
-  optional float power = 1 [default = 1.0];
-  optional float scale = 2 [default = 1.0];
-  optional float shift = 3 [default = 0.0];
-}
-
-// Message that store parameters used by PriorBoxLayer
-message PriorBoxParameter {
-  // Encode/decode type.
-  enum CodeType {
-    CORNER = 1;
-    CENTER_SIZE = 2;
-    CORNER_SIZE = 3;
-  }
-  // Minimum box size (in pixels). Required!
-  repeated float min_size = 1;
-  // Maximum box size (in pixels). Required!
-  repeated float max_size = 2;
-  // Various of aspect ratios. Duplicate ratios will be ignored.
-  // If none is provided, we use default ratio 1.
-  repeated float aspect_ratio = 3;
-  // If true, will flip each aspect ratio.
-  // For example, if there is aspect ratio "r",
-  // we will generate aspect ratio "1.0/r" as well.
-  optional bool flip = 4 [default = true];
-  // If true, will clip the prior so that it is within [0, 1]
-  optional bool clip = 5 [default = false];
-  // Variance for adjusting the prior bboxes.
-  repeated float variance = 6;
-  // By default, we calculate img_height, img_width, step_x, step_y based on
-  // bottom[0] (feat) and bottom[1] (img). Unless these values are explicitely
-  // provided.
-  // Explicitly provide the img_size.
-  optional uint32 img_size = 7;
-  // Either img_size or img_h/img_w should be specified; not both.
-  optional uint32 img_h = 8;
-  optional uint32 img_w = 9;
-
-  // Explicitly provide the step size.
-  optional float step = 10;
-  // Either step or step_h/step_w should be specified; not both.
-  optional float step_h = 11;
-  optional float step_w = 12;
-
-  // Offset to the top left corner of each cell.
-  optional float offset = 13 [default = 0.5];
-}
-
-
-
-
-message ROIPoolingParameter {
-  // The pooled output height
-  optional uint32 pooled_h = 1 [default = 0];
-  // The pooled output width
-  optional uint32 pooled_w = 2 [default = 0];
-  // Multiplicative spatial scale factor to translate ROI coords from their
-  // input scale to the scale used when pooling
-  optional float spatial_scale = 3 [default = 1];
-}
-
-
-
-message RegionProposalParameter {
-  // Number of proposals to keep before applying NMS
-  optional uint32 prenms_top = 1;
-  // Number of remaining proposals after applying NMS
-  optional uint32 nms_max_out = 2;
-  // IOU threshold
-  optional float iou_threshold = 3;
-  // Minimum height and width of an anchor box
-  optional float min_box_size = 4;
-  // This value indicates the stride length to decide the position of bounding box.
-  optional uint32 feature_stride = 5;
-
-  // Number of anchor box ratios.
-  optional uint32 anchor_ratio_count = 6 [default = 1];
-  // Number of anchor box scales.
-  optional uint32 anchor_scale_count = 7 [default = 1];
-
-  // List representing the ratios of height and width of an anchor box.
-  repeated float anchor_ratio = 8;
-  // List of height and width scales of an anchor box.
-  repeated float anchor_scale = 9;
-
-}
-
-message PythonParameter {
-  optional string module = 1;
-  optional string layer = 2;
-  // This value is set to the attribute `param_str` of the `PythonLayer` object
-  // in Python before calling the `setup()` method. This could be a number,
-  // string, dictionary in Python dict format, JSON, etc. You may parse this
-  // string in `setup` method and use it in `forward` and `backward`.
-  optional string param_str = 3 [default = ''];
-  // Whether this PythonLayer is shared among worker solvers during data parallelism.
-  // If true, each worker solver sequentially run forward from this layer.
-  // This value should be set true if you are using it as a data layer.
-  optional bool share_in_parallel = 4 [default = false];
-}
-
-// Message that stores parameters used by ReductionLayer
-message ReductionParameter {
-  enum ReductionOp {
-    SUM = 1;
-    ASUM = 2;
-    SUMSQ = 3;
-    MEAN = 4;
-  }
-
-  optional ReductionOp operation = 1 [default = SUM]; // reduction operation
-
-  // The first axis to reduce to a scalar -- may be negative to index from the
-  // end (e.g., -1 for the last axis).
-  // (Currently, only reduction along ALL "tail" axes is supported; reduction
-  // of axis M through N, where N < num_axes - 1, is unsupported.)
-  // Suppose we have an n-axis bottom Blob with shape:
-  //     (d0, d1, d2, ..., d(m-1), dm, d(m+1), ..., d(n-1)).
-  // If axis == m, the output Blob will have shape
-  //     (d0, d1, d2, ..., d(m-1)),
-  // and the ReductionOp operation is performed (d0 * d1 * d2 * ... * d(m-1))
-  // times, each including (dm * d(m+1) * ... * d(n-1)) individual data.
-  // If axis == 0 (the default), the output Blob always has the empty shape
-  // (count 1), performing reduction across the entire input --
-  // often useful for creating new loss functions.
-  optional int32 axis = 2 [default = 0];
-
-  optional float coeff = 3 [default = 1.0]; // coefficient for output
-}
-
-// Message that stores parameters used by ReLULayer
-message ReLUParameter {
-  // Allow non-zero slope for negative inputs to speed up optimization
-  // Described in:
-  // Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities
-  // improve neural network acoustic models. In ICML Workshop on Deep Learning
-  // for Audio, Speech, and Language Processing.
-  optional float negative_slope = 1 [default = 0];
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 2 [default = DEFAULT];
-}
-
-message ReshapeParameter {
-  // Specify the output dimensions. If some of the dimensions are set to 0,
-  // the corresponding dimension from the bottom layer is used (unchanged).
-  // Exactly one dimension may be set to -1, in which case its value is
-  // inferred from the count of the bottom blob and the remaining dimensions.
-  // For example, suppose we want to reshape a 2D blob "input" with shape 2 x 8:
-  //
-  //   layer {
-  //     type: "Reshape" bottom: "input" top: "output"
-  //     reshape_param { ... }
-  //   }
-  //
-  // If "input" is 2D with shape 2 x 8, then the following reshape_param
-  // specifications are all equivalent, producing a 3D blob "output" with shape
-  // 2 x 2 x 4:
-  //
-  //   reshape_param { shape { dim:  2  dim: 2  dim:  4 } }
-  //   reshape_param { shape { dim:  0  dim: 2  dim:  4 } }
-  //   reshape_param { shape { dim:  0  dim: 2  dim: -1 } }
-  //   reshape_param { shape { dim:  0  dim:-1  dim:  4 } }
-  //
-  optional BlobShape shape = 1;
-
-  // axis and num_axes control the portion of the bottom blob's shape that are
-  // replaced by (included in) the reshape. By default (axis == 0 and
-  // num_axes == -1), the entire bottom blob shape is included in the reshape,
-  // and hence the shape field must specify the entire output shape.
-  //
-  // axis may be non-zero to retain some portion of the beginning of the input
-  // shape (and may be negative to index from the end; e.g., -1 to begin the
-  // reshape after the last axis, including nothing in the reshape,
-  // -2 to include only the last axis, etc.).
-  //
-  // For example, suppose "input" is a 2D blob with shape 2 x 8.
-  // Then the following ReshapeLayer specifications are all equivalent,
-  // producing a blob "output" with shape 2 x 2 x 4:
-  //
-  //   reshape_param { shape { dim: 2  dim: 2  dim: 4 } }
-  //   reshape_param { shape { dim: 2  dim: 4 } axis:  1 }
-  //   reshape_param { shape { dim: 2  dim: 4 } axis: -3 }
-  //
-  // num_axes specifies the extent of the reshape.
-  // If num_axes >= 0 (and axis >= 0), the reshape will be performed only on
-  // input axes in the range [axis, axis+num_axes].
-  // num_axes may also be -1, the default, to include all remaining axes
-  // (starting from axis).
-  //
-  // For example, suppose "input" is a 2D blob with shape 2 x 8.
-  // Then the following ReshapeLayer specifications are equivalent,
-  // producing a blob "output" with shape 1 x 2 x 8.
-  //
-  //   reshape_param { shape { dim:  1  dim: 2  dim:  8 } }
-  //   reshape_param { shape { dim:  1  dim: 2  }  num_axes: 1 }
-  //   reshape_param { shape { dim:  1  }  num_axes: 0 }
-  //
-  // On the other hand, these would produce output blob shape 2 x 1 x 8:
-  //
-  //   reshape_param { shape { dim: 2  dim: 1  dim: 8  }  }
-  //   reshape_param { shape { dim: 1 }  axis: 1  num_axes: 0 }
-  //
-  optional int32 axis = 2 [default = 0];
-  optional int32 num_axes = 3 [default = -1];
-}
-
-message ScaleParameter {
-  // The first axis of bottom[0] (the first input Blob) along which to apply
-  // bottom[1] (the second input Blob).  May be negative to index from the end
-  // (e.g., -1 for the last axis).
-  //
-  // For example, if bottom[0] is 4D with shape 100x3x40x60, the output
-  // top[0] will have the same shape, and bottom[1] may have any of the
-  // following shapes (for the given value of axis):
-  //    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
-  //    (axis == 1 == -3)          3;     3x40;     3x40x60
-  //    (axis == 2 == -2)                   40;       40x60
-  //    (axis == 3 == -1)                                60
-  // Furthermore, bottom[1] may have the empty shape (regardless of the value of
-  // "axis") -- a scalar multiplier.
-  optional int32 axis = 1 [default = 1];
-
-  // (num_axes is ignored unless just one bottom is given and the scale is
-  // a learned parameter of the layer.  Otherwise, num_axes is determined by the
-  // number of axes by the second bottom.)
-  // The number of axes of the input (bottom[0]) covered by the scale
-  // parameter, or -1 to cover all axes of bottom[0] starting from `axis`.
-  // Set num_axes := 0, to multiply with a zero-axis Blob: a scalar.
-  optional int32 num_axes = 2 [default = 1];
-
-  // (filler is ignored unless just one bottom is given and the scale is
-  // a learned parameter of the layer.)
-  // The initialization for the learned scale parameter.
-  // Default is the unit (1) initialization, resulting in the ScaleLayer
-  // initially performing the identity operation.
-  optional FillerParameter filler = 3;
-
-  // Whether to also learn a bias (equivalent to a ScaleLayer+BiasLayer, but
-  // may be more efficient).  Initialized with bias_filler (defaults to 0).
-  optional bool bias_term = 4 [default = false];
-  optional FillerParameter bias_filler = 5;
-}
-
-message SigmoidParameter {
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 1 [default = DEFAULT];
-}
-
-message SliceParameter {
-  // The axis along which to slice -- may be negative to index from the end
-  // (e.g., -1 for the last axis).
-  // By default, SliceLayer concatenates blobs along the "channels" axis (1).
-  optional int32 axis = 3 [default = 1];
-  repeated uint32 slice_point = 2;
-
-  // DEPRECATED: alias for "axis" -- does not support negative indexing.
-  optional uint32 slice_dim = 1 [default = 1];
-}
-
-// Message that stores parameters used by SoftmaxLayer, SoftmaxWithLossLayer
-message SoftmaxParameter {
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 1 [default = DEFAULT];
-
-  // The axis along which to perform the softmax -- may be negative to index
-  // from the end (e.g., -1 for the last axis).
-  // Any other axes will be evaluated as independent softmaxes.
-  optional int32 axis = 2 [default = 1];
-}
-
-message TanHParameter {
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 1 [default = DEFAULT];
-}
-
-// Message that stores parameters used by TileLayer
-message TileParameter {
-  // The index of the axis to tile.
-  optional int32 axis = 1 [default = 1];
-
-  // The number of copies (tiles) of the blob to output.
-  optional int32 tiles = 2;
-}
-
-// Message that stores parameters used by ThresholdLayer
-message ThresholdParameter {
-  optional float threshold = 1 [default = 0]; // Strictly positive values
-}
-
-message WindowDataParameter {
-  // Specify the data source.
-  optional string source = 1;
-  // For data pre-processing, we can do simple scaling and subtracting the
-  // data mean, if provided. Note that the mean subtraction is always carried
-  // out before scaling.
-  optional float scale = 2 [default = 1];
-  optional string mean_file = 3;
-  // Specify the batch size.
-  optional uint32 batch_size = 4;
-  // Specify if we would like to randomly crop an image.
-  optional uint32 crop_size = 5 [default = 0];
-  // Specify if we want to randomly mirror data.
-  optional bool mirror = 6 [default = false];
-  // Foreground (object) overlap threshold
-  optional float fg_threshold = 7 [default = 0.5];
-  // Background (non-object) overlap threshold
-  optional float bg_threshold = 8 [default = 0.5];
-  // Fraction of batch that should be foreground objects
-  optional float fg_fraction = 9 [default = 0.25];
-  // Amount of contextual padding to add around a window
-  // (used only by the window_data_layer)
-  optional uint32 context_pad = 10 [default = 0];
-  // Mode for cropping out a detection window
-  // warp: cropped window is warped to a fixed size and aspect ratio
-  // square: the tightest square around the window is cropped
-  optional string crop_mode = 11 [default = "warp"];
-  // cache_images: will load all images in memory for faster access
-  optional bool cache_images = 12 [default = false];
-  // append root_folder to locate images
-  optional string root_folder = 13 [default = ""];
-}
-
-message SPPParameter {
-  enum PoolMethod {
-    MAX = 0;
-    AVE = 1;
-    STOCHASTIC = 2;
-  }
-  optional uint32 pyramid_height = 1;
-  optional PoolMethod pool = 2 [default = MAX]; // The pooling method
-  enum Engine {
-    DEFAULT = 0;
-    CAFFE = 1;
-    CUDNN = 2;
-  }
-  optional Engine engine = 6 [default = DEFAULT];
-}
-
-// DEPRECATED: use LayerParameter.
-message V1LayerParameter {
-  repeated string bottom = 2;
-  repeated string top = 3;
-  optional string name = 4;
-  repeated NetStateRule include = 32;
-  repeated NetStateRule exclude = 33;
-  enum LayerType {
-    NONE = 0;
-    ABSVAL = 35;
-    ACCURACY = 1;
-    ARGMAX = 30;
-    BNLL = 2;
-    CONCAT = 3;
-    CONTRASTIVE_LOSS = 37;
-    CONVOLUTION = 4;
-    CROP = 44;
-    DATA = 5;
-    DECONVOLUTION = 39;
-    DROPOUT = 6;
-    DUMMY_DATA = 32;
-    EUCLIDEAN_LOSS = 7;
-    ELTWISE = 25;
-    EXP = 38;
-    FLATTEN = 8;
-    HDF5_DATA = 9;
-    HDF5_OUTPUT = 10;
-    HINGE_LOSS = 28;
-    IM2COL = 11;
-    IMAGE_DATA = 12;
-    INFOGAIN_LOSS = 13;
-    INNER_PRODUCT = 14;
-    LRN = 15;
-    MEMORY_DATA = 29;
-    MULTINOMIAL_LOGISTIC_LOSS = 16;
-    MVN = 34;
-    POOLING = 17;
-    POWER = 26;
-    RELU = 18;
-    SIGMOID = 19;
-    SIGMOID_CROSS_ENTROPY_LOSS = 27;
-    SILENCE = 36;
-    SOFTMAX = 20;
-    SOFTMAX_LOSS = 21;
-    SPLIT = 22;
-    SLICE = 33;
-    TANH = 23;
-    WINDOW_DATA = 24;
-    THRESHOLD = 31;
-  }
-  optional LayerType type = 5;
-  repeated BlobProto blobs = 6;
-  repeated string param = 1001;
-  repeated DimCheckMode blob_share_mode = 1002;
-  enum DimCheckMode {
-    STRICT = 0;
-    PERMISSIVE = 1;
-  }
-  repeated float blobs_lr = 7;
-  repeated float weight_decay = 8;
-  repeated float loss_weight = 35;
-  optional AccuracyParameter accuracy_param = 27;
-  optional ArgMaxParameter argmax_param = 23;
-  optional ConcatParameter concat_param = 9;
-  optional ContrastiveLossParameter contrastive_loss_param = 40;
-  optional ConvolutionParameter convolution_param = 10;
-  optional CropParameter crop_param = 44;
-  optional DataParameter data_param = 11;
-  optional DropoutParameter dropout_param = 12;
-  optional DummyDataParameter dummy_data_param = 26;
-  optional EltwiseParameter eltwise_param = 24;
-  optional ExpParameter exp_param = 41;
-  optional HDF5DataParameter hdf5_data_param = 13;
-  optional HDF5OutputParameter hdf5_output_param = 14;
-  optional HingeLossParameter hinge_loss_param = 29;
-  optional ImageDataParameter image_data_param = 15;
-  optional InfogainLossParameter infogain_loss_param = 16;
-  optional InnerProductParameter inner_product_param = 17;
-  optional LRNParameter lrn_param = 18;
-  optional MemoryDataParameter memory_data_param = 22;
-  optional MVNParameter mvn_param = 34;
-  optional PoolingParameter pooling_param = 19;
-  optional PowerParameter power_param = 21;
-  optional ReLUParameter relu_param = 30;
-  optional SigmoidParameter sigmoid_param = 38;
-  optional SoftmaxParameter softmax_param = 39;
-  optional SliceParameter slice_param = 31;
-  optional TanHParameter tanh_param = 37;
-  optional ThresholdParameter threshold_param = 25;
-  optional WindowDataParameter window_data_param = 20;
-  optional TransformationParameter transform_param = 36;
-  optional LossParameter loss_param = 42;
-  optional V0LayerParameter layer = 1;
-}
-
-// DEPRECATED: V0LayerParameter is the old way of specifying layer parameters
-// in Caffe.  We keep this message type around for legacy support.
-message V0LayerParameter {
-  optional string name = 1; // the layer name
-  optional string type = 2; // the string to specify the layer type
-
-  // Parameters to specify layers with inner products.
-  optional uint32 num_output = 3; // The number of outputs for the layer
-  optional bool biasterm = 4 [default = true]; // whether to have bias terms
-  optional FillerParameter weight_filler = 5; // The filler for the weight
-  optional FillerParameter bias_filler = 6; // The filler for the bias
-
-  optional uint32 pad = 7 [default = 0]; // The padding size
-  optional uint32 kernelsize = 8; // The kernel size
-  optional uint32 group = 9 [default = 1]; // The group size for group conv
-  optional uint32 stride = 10 [default = 1]; // The stride
-  enum PoolMethod {
-    MAX = 0;
-    AVE = 1;
-    STOCHASTIC = 2;
-  }
-  optional PoolMethod pool = 11 [default = MAX]; // The pooling method
-  optional float dropout_ratio = 12 [default = 0.5]; // dropout ratio
-
-  optional uint32 local_size = 13 [default = 5]; // for local response norm
-  optional float alpha = 14 [default = 1.]; // for local response norm
-  optional float beta = 15 [default = 0.75]; // for local response norm
-  optional float k = 22 [default = 1.];
-
-  // For data layers, specify the data source
-  optional string source = 16;
-  // For data pre-processing, we can do simple scaling and subtracting the
-  // data mean, if provided. Note that the mean subtraction is always carried
-  // out before scaling.
-  optional float scale = 17 [default = 1];
-  optional string meanfile = 18;
-  // For data layers, specify the batch size.
-  optional uint32 batchsize = 19;
-  // For data layers, specify if we would like to randomly crop an image.
-  optional uint32 cropsize = 20 [default = 0];
-  // For data layers, specify if we want to randomly mirror data.
-  optional bool mirror = 21 [default = false];
-
-  // The blobs containing the numeric parameters of the layer
-  repeated BlobProto blobs = 50;
-  // The ratio that is multiplied on the global learning rate. If you want to
-  // set the learning ratio for one blob, you need to set it for all blobs.
-  repeated float blobs_lr = 51;
-  // The weight decay that is multiplied on the global weight decay.
-  repeated float weight_decay = 52;
-
-  // The rand_skip variable is for the data layer to skip a few data points
-  // to avoid all asynchronous sgd clients to start at the same point. The skip
-  // point would be set as rand_skip * rand(0,1). Note that rand_skip should not
-  // be larger than the number of keys in the database.
-  optional uint32 rand_skip = 53 [default = 0];
-
-  // Fields related to detection (det_*)
-  // foreground (object) overlap threshold
-  optional float det_fg_threshold = 54 [default = 0.5];
-  // background (non-object) overlap threshold
-  optional float det_bg_threshold = 55 [default = 0.5];
-  // Fraction of batch that should be foreground objects
-  optional float det_fg_fraction = 56 [default = 0.25];
-
-  // optional bool OBSOLETE_can_clobber = 57 [default = true];
-
-  // Amount of contextual padding to add around a window
-  // (used only by the window_data_layer)
-  optional uint32 det_context_pad = 58 [default = 0];
-
-  // Mode for cropping out a detection window
-  // warp: cropped window is warped to a fixed size and aspect ratio
-  // square: the tightest square around the window is cropped
-  optional string det_crop_mode = 59 [default = "warp"];
-
-  // For ReshapeLayer, one needs to specify the new dimensions.
-  optional int32 new_num = 60 [default = 0];
-  optional int32 new_channels = 61 [default = 0];
-  optional int32 new_height = 62 [default = 0];
-  optional int32 new_width = 63 [default = 0];
-
-  // Whether or not ImageLayer should shuffle the list of files at every epoch.
-  // It will also resize images if new_height or new_width are not zero.
-  optional bool shuffle_images = 64 [default = false];
-
-  // For ConcatLayer, one needs to specify the dimension for concatenation, and
-  // the other dimensions must be the same for all the bottom blobs.
-  // By default it will concatenate blobs along the channels dimension.
-  optional uint32 concat_dim = 65 [default = 1];
-
-  optional HDF5OutputParameter hdf5_output_param = 1001;
-}
-
-message PReLUParameter {
-  // Parametric ReLU described in K. He et al, Delving Deep into Rectifiers:
-  // Surpassing Human-Level Performance on ImageNet Classification, 2015.
-
-  // Initial value of a_i. Default is a_i=0.25 for all i.
-  optional FillerParameter filler = 1;
-  // Whether or not slope paramters are shared across channels.
-  optional bool channel_shared = 2 [default = false];
-}
diff --git a/parsers/onnx b/parsers/onnx
index 6ba67d34..973d68d0 160000
--- a/parsers/onnx
+++ b/parsers/onnx
@@ -1 +1 @@
-Subproject commit 6ba67d3428e05f690145373ca87fb8d32f98df45
+Subproject commit 973d68d06f671998ddcc0c504b9a2fdfcfc85a62
diff --git a/plugin/CMakeLists.txt b/plugin/CMakeLists.txt
index 393d4891..2e708d3a 100644
--- a/plugin/CMakeLists.txt
+++ b/plugin/CMakeLists.txt
@@ -19,10 +19,14 @@ add_custom_target(plugin)
 set(TARGET_NAME nvinfer_plugin)
 set(SHARED_TARGET ${TARGET_NAME})
 set(STATIC_TARGET ${TARGET_NAME}_static)
+set(VFC_TARGET_NAME nvinfer_vc_plugin)
+set(VFC_SHARED_TARGET ${VFC_TARGET_NAME})
 
 set(TARGET_DIR ${CMAKE_CURRENT_SOURCE_DIR})
 set(PLUGIN_EXPORT_MAP ${TARGET_DIR}/exports.map)
 set(PLUGIN_EXPORT_DEF ${TARGET_DIR}/exports.def)
+set(VFC_PLUGIN_EXPORT_MAP ${TARGET_DIR}/exports-vfc_plugin.map)
+set(VFC_PLUGIN_EXPORT_DEF ${TARGET_DIR}/exports-vfc_plugin.def)
 
 if(${CMAKE_BUILD_TYPE} MATCHES "Debug")
     set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g")
@@ -65,6 +69,7 @@ set(PLUGIN_LISTS
     reorgPlugin
     resizeNearestPlugin
     roiAlignPlugin
+    scatterElementsPlugin
     scatterPlugin
     specialSlicePlugin
     splitPlugin
@@ -138,6 +143,10 @@ else()
     set_target_properties(${SHARED_TARGET} PROPERTIES LINK_FLAGS "-Wl,--exclude-libs,ALL -Wl,-Bsymbolic -Wl,--version-script=${PLUGIN_EXPORT_MAP} -Wl,--no-undefined")
 endif()
 
+if (ADDITIONAL_PLATFORM_LIB_FLAGS)
+    set_target_properties(${SHARED_TARGET} PROPERTIES LINK_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS})
+endif()
+
 set_target_properties(${SHARED_TARGET} PROPERTIES DEBUG_POSTFIX ${TRT_DEBUG_POSTFIX})
 
 set_target_properties(${SHARED_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSION ${TRT_SOVERSION} )
@@ -145,10 +154,7 @@ set_target_properties(${SHARED_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSI
 set_property(TARGET ${SHARED_TARGET} PROPERTY CUDA_STANDARD 14)
 
 target_link_libraries(${SHARED_TARGET}
-    ${CUBLAS_LIB}
-    ${CUBLASLT_LIB}
     ${CUDART_LIB}
-    ${CUDNN_LIB}
     ${nvinfer_LIB_PATH}
     ${CMAKE_DL_LIBS}
 )
@@ -183,19 +189,75 @@ set_target_properties(${STATIC_TARGET} PROPERTIES
 
 set_target_properties(${STATIC_TARGET} PROPERTIES LINK_FLAGS "-Wl,--exclude-libs,ALL")
 
+if (ADDITIONAL_PLATFORM_LIB_FLAGS)
+    set_target_properties(${STATIC_TARGET} PROPERTIES LINK_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS})
+endif()
+
 set_target_properties(${STATIC_TARGET} PROPERTIES DEBUG_POSTFIX ${TRT_DEBUG_POSTFIX})
 
 set_target_properties(${STATIC_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSION ${TRT_SOVERSION} )
 
 set_property(TARGET ${STATIC_TARGET} PROPERTY CUDA_STANDARD 14)
 
+################################## VFC SHARED LIBRARY #######################################
+
+set_source_files_properties(${VFC_PLUGIN_CU_SOURCES} PROPERTIES COMPILE_FLAGS "${GENCODES} ${ENABLED_SMS}")
+list(APPEND VFC_PLUGIN_SOURCES "${VFC_PLUGIN_CU_SOURCES}")
+
+add_library(${VFC_SHARED_TARGET} SHARED
+    ${VFC_PLUGIN_SOURCES}
+)
+
+target_include_directories(${VFC_SHARED_TARGET}
+    PUBLIC ${PROJECT_SOURCE_DIR}/include
+    PRIVATE ${PROJECT_SOURCE_DIR}/common
+    PUBLIC ${CUDA_INSTALL_DIR}/include
+    PRIVATE ${TARGET_DIR}
+)
+
+set_target_properties(${VFC_SHARED_TARGET} PROPERTIES
+    CXX_STANDARD "14"
+    CXX_STANDARD_REQUIRED "YES"
+    CXX_EXTENSIONS "NO"
+    ARCHIVE_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
+    LIBRARY_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
+    RUNTIME_OUTPUT_DIRECTORY "${TRT_OUT_DIR}"
+)
+
+if (MSVC)
+    set_target_properties(${VFC_SHARED_TARGET} PROPERTIES LINK_FLAGS "/DEF:${VFC_PLUGIN_EXPORT_DEF}")
+else()
+    set_target_properties(${VFC_SHARED_TARGET} PROPERTIES LINK_FLAGS "-Wl,--exclude-libs,ALL -Wl,-Bsymbolic -Wl,--version-script=${VFC_PLUGIN_EXPORT_MAP} -Wl,--no-undefined")
+endif()
+
+if (ADDITIONAL_PLATFORM_LIB_FLAGS)
+    set_target_properties(${VFC_SHARED_TARGET} PROPERTIES LINK_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS})
+endif()
+
+set_target_properties(${VFC_SHARED_TARGET} PROPERTIES DEBUG_POSTFIX ${TRT_DEBUG_POSTFIX})
+
+set_target_properties(${VFC_SHARED_TARGET} PROPERTIES VERSION ${TRT_VERSION} SOVERSION ${TRT_SOVERSION} )
+
+set_property(TARGET ${VFC_SHARED_TARGET} PROPERTY CUDA_STANDARD 14)
+
+target_link_libraries(${VFC_SHARED_TARGET}
+    ${CUDART_LIB}
+    ${nvinfer_LIB_PATH}
+    ${CMAKE_DL_LIBS}
+)
+
+# Needed when static linking CUDART
+if (NOT MSVC)
+    target_link_libraries(${VFC_SHARED_TARGET} Threads::Threads ${RT_LIB})
+endif()
+
 #########################################################################################
 
-add_dependencies(plugin ${SHARED_TARGET} ${STATIC_TARGET})
+add_dependencies(plugin ${SHARED_TARGET} ${STATIC_TARGET} ${VFC_SHARED_TARGET})
 
 ################################### INSTALLATION ########################################
 
-install(TARGETS ${TARGET_NAME}
+install(TARGETS ${SHARED_TARGET} ${VFC_SHARED_TARGET}
         RUNTIME DESTINATION bin
         LIBRARY DESTINATION lib
         ARCHIVE DESTINATION lib
diff --git a/plugin/README.md b/plugin/README.md
index 2a3c68d3..b8d96607 100644
--- a/plugin/README.md
+++ b/plugin/README.md
@@ -4,48 +4,52 @@
 
 | Plugin | Name | Versions |
 |---|---|---|
-| [batchTilePlugin](batchTilePlugin) | BatchTilePlugin_TRT | 1 |
-| [batchedNMSPlugin](batchedNMSPlugin) | BatchedNMS_TRT | 1 |
-| [batchedNMSDynamicPlugin](batchedNMSPlugin) | BatchedNMSDynamic_TRT | 1 |
+| [batchTilePlugin](batchTilePlugin) [DEPRECATED] | BatchTilePlugin_TRT | 1 |
+| [batchedNMSPlugin](batchedNMSPlugin) [DEPRECATED] | BatchedNMS_TRT | 1 |
+| [batchedNMSDynamicPlugin](batchedNMSPlugin) [DEPRECATED] | BatchedNMSDynamic_TRT | 1 |
 | [bertQKVToContextPlugin](bertQKVToContextPlugin) | CustomQKVToContextPluginDynamic | 1, 2, 3 |
-| [clipPlugin](clipPlugin) | Clip_TRT | 1 |
-| [coordConvACPlugin](coordConvACPlugin) | CoordConvAC | 1 |
-| [cropAndResizePlugin](cropAndResizePlugin) | CropAndResize | 1 |
+| [clipPlugin](clipPlugin) [DEPRECATED] | Clip_TRT | 1 |
+| [coordConvACPlugin](coordConvACPlugin) [DEPRECATED] | CoordConvAC | 1 |
+| [cropAndResizePlugin](cropAndResizePlugin) [DEPRECATED] | CropAndResize | 1 |
+| [cropAndResizePlugin](cropAndResizePlugin) | CropAndResizeDynamic | 1 |
 | [decodeBbox3DPlugin](decodeBbox3DPlugin) | DecodeBbox3DPlugin | 1 |
 | [detectionLayerPlugin](detectionLayerPlugin) | DetectionLayer_TRT | 1 |
 | [disentangledAttentionPlugin](disentangledAttentionPlugin) | DisentangledAttention_TRT | 1 |
 | [efficientNMSPlugin](efficientNMSPlugin) | EfficientNMS_TRT | 1 |
-| [efficientNMSONNXPlugin](efficientNMSPlugin) | EfficientNMS_ONNX_TRT | 1 |
+| [efficientNMSONNXPlugin](efficientNMSPlugin) [DEPRECATED] | EfficientNMS_ONNX_TRT | 1 |
 | [embLayerNormPlugin](embLayerNormPlugin) | CustomEmbLayerNormPluginDynamic | 1, 2 |
 | [fcPlugin](fcPlugin) | CustomFCPluginDynamic | 1 |
 | [flattenConcat](flattenConcat) | FlattenConcat_TRT | 1 |
-| [geluPlugin](geluPlugin) | CustomGeluPluginDynamic | 1 |
+| [geluPlugin](geluPlugin) [DEPRECATED] | CustomGeluPluginDynamic | 1 |
 | [generateDetectionPlugin](generateDetectionPlugin) | GenerateDetection_TRT | 1 |
 | [gridAnchorPlugin](gridAnchorPlugin) | GridAnchor_TRT | 1 |
 | [gridAnchorRectPlugin](gridAnchorPlugin) | GridAnchorRect_TRT | 1 |
 | [groupNormalizationPlugin](groupNormalizationPlugin) | GroupNormalizationPlugin | 1 |
 | [instanceNormalizationPlugin](instanceNormalizationPlugin) | InstanceNormalization_TRT | 1 |
-| [leakyReluPlugin](leakyReluPlugin) | LReLU_TRT | 1 |
+| [leakyReluPlugin](leakyReluPlugin) [DEPRECATED] | LReLU_TRT | 1 |
 | [modulatedDeformConvPlugin](modulatedDeformConvPlugin) | ModulatedDeformConv2d | 1 |
 | [multilevelCropAndResizePlugin](multilevelCropAndResizePlugin) | MultilevelCropAndResize_TRT | 1 |
 | [multilevelProposeROI](multilevelProposeROI) | MultilevelProposeROI_TRT | 1 |
 | [multiscaleDeformableAttnPlugin](multiscaleDeformableAttnPlugin) | MultiscaleDeformableAttnPlugin_TRT | 1 |
-| [nmsPlugin](nmsPlugin) | NMS_TRT | 1 |
-| [normalizePlugin](normalizePlugin) | Normalize_TRT | 1 |
+| [nmsPlugin](nmsPlugin) [DEPRECATED] | NMS_TRT | 1 |
+| [nmsPlugin](nmsPlugin) [DEPRECATED] | NMSDynamic_TRT | 1 |
+| [normalizePlugin](normalizePlugin) [DEPRECATED] | Normalize_TRT | 1 |
 | [nvFasterRCNN](nvFasterRCNN) | RPROI_TRT | 1 |
 | [pillarScatterPlugin](pillarScatterPlugin) | PillarScatterPlugin | 1 |
 | [priorBoxPlugin](priorBoxPlugin) | PriorBox_TRT | 1 |
 | [proposalLayerPlugin](proposalLayerPlugin) | ProposalLayer_TRT | 1 |
-| [proposalPlugin](proposalPlugin) | Proposal | 1 |
+| [proposalPlugin](proposalPlugin) [DEPRECATED] | Proposal | 1 |
+| [proposalPlugin](proposalPlugin) | ProposalDynamic | 1 |
 | [pyramidROIAlignPlugin](pyramidROIAlignPlugin) | PyramidROIAlign_TRT | 1 |
 | [regionPlugin](regionPlugin) | Region_TRT | 1 |
-| [reorgPlugin](reorgPlugin) | Reorg_TRT | 1 |
+| [reorgPlugin](reorgPlugin) | Reorg_TRT | 2 |
 | [roiAlignPlugin](roiAlignPlugin) | ROIAlign_TRT | 1 |
 | [resizeNearestPlugin](resizeNearestPlugin) | ResizeNearest_TRT | 1 |
+| [scatterElementsPlugin](scatterElementsPlugin) | ScatterElements | 1 |
 | [scatterPlugin](scatterPlugin) | ScatterND | 1 |
 | [skipLayerNormPlugin](skipLayerNormPlugin) | CustomSkipLayerNormPluginDynamic | 1, 2, 3 |
-| [specialSlicePlugin](specialSlicePlugin) | SpecialSlice_TRT | 1 |
-| [splitPlugin](splitPlugin) | Split | 1 |
+| [specialSlicePlugin](specialSlicePlugin) [DEPRECATED] | SpecialSlice_TRT | 1 |
+| [splitPlugin](splitPlugin) [DEPRECATED] | Split | 1 |
 | [voxelGeneratorPlugin](voxelGeneratorPlugin) | VoxelGeneratorPlugin | 1 |
 
 ## Known Limitations
diff --git a/plugin/api/inferPlugin.cpp b/plugin/api/inferPlugin.cpp
index b55f9388..452f61b6 100644
--- a/plugin/api/inferPlugin.cpp
+++ b/plugin/api/inferPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -49,6 +49,7 @@
 #include "reorgPlugin/reorgPlugin.h"
 #include "resizeNearestPlugin/resizeNearestPlugin.h"
 #include "roiAlignPlugin/roiAlignPlugin.h"
+#include "scatterElementsPlugin/scatterElementsPlugin.h"
 #include "scatterPlugin/scatterPlugin.h"
 #include "specialSlicePlugin/specialSlicePlugin.h"
 #include "splitPlugin/split.h"
@@ -63,6 +64,7 @@
 #include <unordered_set>
 using namespace nvinfer1;
 using namespace nvinfer1::plugin;
+using namespace nvinfer1::pluginInternal;
 
 using nvinfer1::plugin::RPROIParams;
 
@@ -174,7 +176,7 @@ void initializePlugin(void* logger, char const* libNamespace)
 
 extern "C"
 {
-    bool initLibNvInferPlugins(void* logger, const char* libNamespace)
+    bool initLibNvInferPlugins(void* logger, char const* libNamespace)
     {
         initializePlugin<nvinfer1::plugin::BatchedNMSDynamicPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::BatchedNMSPluginCreator>(logger, libNamespace);
@@ -210,10 +212,12 @@ extern "C"
         initializePlugin<nvinfer1::plugin::ProposalPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::PyramidROIAlignPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::RegionPluginCreator>(logger, libNamespace);
-        initializePlugin<nvinfer1::plugin::ReorgPluginCreator>(logger, libNamespace);
+        initializePlugin<nvinfer1::plugin::ReorgDynamicPluginCreator>(logger, libNamespace);
+        initializePlugin<nvinfer1::plugin::ReorgStaticPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::ResizeNearestPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::ROIAlignPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::RPROIPluginCreator>(logger, libNamespace);
+        initializePlugin<nvinfer1::plugin::ScatterElementsPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::ScatterNDPluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::SpecialSlicePluginCreator>(logger, libNamespace);
         initializePlugin<nvinfer1::plugin::SplitPluginCreator>(logger, libNamespace);
diff --git a/plugin/api/loggerFinder.h b/plugin/api/loggerFinder.h
deleted file mode 100644
index 6f9634aa..00000000
--- a/plugin/api/loggerFinder.h
+++ /dev/null
@@ -1,54 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#ifndef TRT_PLUGIN_API_LOGGER_FINDER_H
-#define TRT_PLUGIN_API_LOGGER_FINDER_H
-
-#include "plugin/common/vfcCommon.h"
-
-namespace nvinfer1
-{
-
-namespace plugin
-{
-class VCPluginLoggerFinder : public ILoggerFinder
-{
-public:
-    ILogger* findLogger() override
-    {
-        return getLogger();
-    }
-};
-
-VCPluginLoggerFinder gVCPluginLoggerFinder;
-
-//!
-//! \brief Set a Logger finder for Version Compatibility (VC) plugin library so that all VC plugins can
-//! use getLogger without dependency on nvinfer. This function shall be called once for the loaded vc plugin
-//! library.
-//!
-//! \param setLoggerFinderFunc function in VC plugin library for setting logger finder.
-//!
-void setVCPluginLoggerFinder(std::function<void(ILoggerFinder*)> setLoggerFinderFunc)
-{
-    setLoggerFinderFunc(static_cast<ILoggerFinder*>(&gVCPluginLoggerFinder));
-}
-
-} // namespace plugin
-
-} // namespace nvinfer1
-
-#endif // TRT_RUNTIME_RT_LOGGER_FINDER_H
diff --git a/plugin/batchTilePlugin/README.md b/plugin/batchTilePlugin/README.md
index 825cd892..e831834e 100644
--- a/plugin/batchTilePlugin/README.md
+++ b/plugin/batchTilePlugin/README.md
@@ -11,6 +11,8 @@
 
 ## Description
 
+> NOTE: This plugin is deprecated since TensorRT 9.0. Its functionality has been superseded by the `ISliceLayer`.
+
 The `batchTilePlugin` tiles a tensor `N` times along its first dimension (batch dimension) where `N` is the batch size. The result tensor will have shape `N` on its first dimension and each `tensor[i: i+1,...]` is a copy of input tensor (for integer `i < N`).
 
 This plugin takes 2 input tensors, the first input tensor should have first dim `N` and the second tensor should have first dim 1. The output tensor will be the tile results of the second tensor.
@@ -86,10 +88,13 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note.
+
 Jul. 2019
 This is the first release of this `README.md` file.
 
 
 ## Known issues
 
-There are no known issues in this plugin.
\ No newline at end of file
+There are no known issues in this plugin.
diff --git a/plugin/batchTilePlugin/batchTilePlugin.cpp b/plugin/batchTilePlugin/batchTilePlugin.cpp
index 51acbf90..7b99d578 100644
--- a/plugin/batchTilePlugin/batchTilePlugin.cpp
+++ b/plugin/batchTilePlugin/batchTilePlugin.cpp
@@ -245,6 +245,9 @@ IPluginV2Ext* BatchTilePluginCreator::createPlugin(char const* name, PluginField
 {
     try
     {
+        gLogWarning << "BatchTilePlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addSlice() to add an "
+                       "ISliceLayer with SampleMode::kWRAP."
+                    << std::endl;
         PLUGIN_VALIDATE(name != nullptr);
         auto* plugin = new BatchTilePlugin(name);
         plugin->setPluginNamespace(mNamespace.c_str());
@@ -275,6 +278,9 @@ IPluginV2Ext* BatchTilePluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "BatchTilePlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addSlice() to add an "
+                       "ISliceLayer with SampleMode::kWRAP."
+                    << std::endl;
         PLUGIN_VALIDATE(name != nullptr);
         return new BatchTilePlugin(name, serialData, serialLength);
     }
diff --git a/plugin/batchTilePlugin/batchTilePlugin.h b/plugin/batchTilePlugin/batchTilePlugin.h
index 4f2bf37a..0ff85bb0 100644
--- a/plugin/batchTilePlugin/batchTilePlugin.h
+++ b/plugin/batchTilePlugin/batchTilePlugin.h
@@ -24,7 +24,7 @@ namespace nvinfer1
 {
 namespace plugin
 {
-class BatchTilePlugin : public IPluginV2Ext
+class TRT_DEPRECATED BatchTilePlugin : public IPluginV2Ext
 {
 public:
     BatchTilePlugin(std::string const& name);
@@ -83,7 +83,7 @@ class BatchTilePlugin : public IPluginV2Ext
     std::string mNamespace;
 };
 
-class BatchTilePluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED BatchTilePluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     BatchTilePluginCreator();
diff --git a/plugin/batchedNMSPlugin/README.md b/plugin/batchedNMSPlugin/README.md
index 0ce263b7..1e6fde14 100644
--- a/plugin/batchedNMSPlugin/README.md
+++ b/plugin/batchedNMSPlugin/README.md
@@ -12,6 +12,8 @@
 
 ## Description
 
+> NOTE: This plugin is deprecated since TensorRT 9.0. Its functionality has been superseded by the `INMSLayer` and `EfficientNMS` plugin.
+
 The `batchedNMSPlugin` implements a non-maximum suppression (NMS) step over boxes for object detection networks.
 
 Non-maximum suppression is typically the universal step in object detection inference. This plugin is used after you’ve processed the bounding box prediction and object classification to get the final bounding boxes for objects.
@@ -106,6 +108,9 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note.
+
 January 2022
 BatchedNMS plugin now supports IOU calculation that matches Torch/Tensorflow and defaults to the [Caffe semantics](https://github.com/weiliu89/caffe/blob/ssd/src/caffe/util/bbox_util.cpp#L92-L97). This is selected using an optional `caffeSemantics` plugin attribute.
 
diff --git a/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp b/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp
index 20721938..40ff8671 100644
--- a/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp
+++ b/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp
@@ -72,6 +72,9 @@ static inline pluginStatus_t checkParams(NMSParameters const& param)
 BatchedNMSPlugin::BatchedNMSPlugin(NMSParameters params)
     : param(params)
 {
+    gLogWarning << "BatchedNMSPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                   "INMSLayer OR use EfficientNMS plugin."
+                << std::endl;
     mPluginStatus = checkParams(param);
     PLUGIN_VALIDATE(mPluginStatus == STATUS_SUCCESS);
 }
@@ -281,11 +284,13 @@ int32_t BatchedNMSPlugin::enqueue(
     return -1;
 }
 
-int32_t BatchedNMSDynamicPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
+int32_t BatchedNMSDynamicPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* /* outputDesc */,
     void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr && workspace != nullptr);
+
         void const* const locData = inputs[0];
         void const* const confData = inputs[1];
 
@@ -666,6 +671,9 @@ IPluginV2Ext* BatchedNMSPluginCreator::createPlugin(char const* name, PluginFiel
 {
     try
     {
+        gLogWarning << "BatchedNMSPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         NMSParameters params;
         PluginField const* fields = fc->fields;
         bool clipBoxes = true;
@@ -758,6 +766,9 @@ IPluginV2DynamicExt* BatchedNMSDynamicPluginCreator::createPlugin(
 {
     try
     {
+        gLogWarning << "BatchedNMSPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         NMSParameters params;
         PluginField const* fields = fc->fields;
         bool clipBoxes = true;
@@ -850,6 +861,9 @@ IPluginV2Ext* BatchedNMSPluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "BatchedNMSPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call NMS::destroy()
         auto* plugin = new BatchedNMSPlugin(serialData, serialLength);
@@ -868,6 +882,9 @@ IPluginV2DynamicExt* BatchedNMSDynamicPluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "BatchedNMSPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call NMS::destroy()
         auto* plugin = new BatchedNMSDynamicPlugin(serialData, serialLength);
diff --git a/plugin/batchedNMSPlugin/batchedNMSPlugin.h b/plugin/batchedNMSPlugin/batchedNMSPlugin.h
index eccf9bce..418333e8 100644
--- a/plugin/batchedNMSPlugin/batchedNMSPlugin.h
+++ b/plugin/batchedNMSPlugin/batchedNMSPlugin.h
@@ -28,7 +28,7 @@ namespace nvinfer1
 namespace plugin
 {
 
-class BatchedNMSPlugin : public IPluginV2Ext
+class TRT_DEPRECATED BatchedNMSPlugin : public IPluginV2Ext
 {
 public:
     BatchedNMSPlugin(NMSParameters param);
@@ -79,7 +79,7 @@ class BatchedNMSPlugin : public IPluginV2Ext
     pluginStatus_t mPluginStatus{};
 };
 
-class BatchedNMSDynamicPlugin : public IPluginV2DynamicExt
+class TRT_DEPRECATED BatchedNMSDynamicPlugin : public IPluginV2DynamicExt
 {
 public:
     BatchedNMSDynamicPlugin(NMSParameters param);
@@ -131,7 +131,7 @@ class BatchedNMSDynamicPlugin : public IPluginV2DynamicExt
     pluginStatus_t mPluginStatus{};
 };
 
-class BatchedNMSBasePluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED BatchedNMSBasePluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     BatchedNMSBasePluginCreator();
@@ -145,7 +145,7 @@ class BatchedNMSBasePluginCreator : public nvinfer1::pluginInternal::BaseCreator
     static std::vector<PluginField> mPluginAttributes;
 };
 
-class BatchedNMSPluginCreator : public BatchedNMSBasePluginCreator
+class TRT_DEPRECATED BatchedNMSPluginCreator : public BatchedNMSBasePluginCreator
 {
 public:
     char const* getPluginName() const noexcept override;
@@ -153,7 +153,7 @@ class BatchedNMSPluginCreator : public BatchedNMSBasePluginCreator
     IPluginV2Ext* deserializePlugin(char const* name, void const* serialData, size_t serialLength) noexcept override;
 };
 
-class BatchedNMSDynamicPluginCreator : public BatchedNMSBasePluginCreator
+class TRT_DEPRECATED BatchedNMSDynamicPluginCreator : public BatchedNMSBasePluginCreator
 {
 public:
     char const* getPluginName() const noexcept override;
diff --git a/plugin/batchedNMSPlugin/gatherNMSOutputs.cu b/plugin/batchedNMSPlugin/gatherNMSOutputs.cu
index 57a8c733..a30b8f7a 100644
--- a/plugin/batchedNMSPlugin/gatherNMSOutputs.cu
+++ b/plugin/batchedNMSPlugin/gatherNMSOutputs.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,7 +17,7 @@
 #include "common/kernels/kernel.h"
 #include "common/kernels/saturate.h"
 #include "common/plugin.h"
-#include "cuda_fp16.h"
+#include <cuda_fp16.h>
 #include "gatherNMSOutputs.h"
 #include <array>
 using namespace nvinfer1;
diff --git a/plugin/bertQKVToContextPlugin/README.md b/plugin/bertQKVToContextPlugin/README.md
index 5760b89a..81f6f2f3 100644
--- a/plugin/bertQKVToContextPlugin/README.md
+++ b/plugin/bertQKVToContextPlugin/README.md
@@ -53,6 +53,9 @@ The parameters are defined below and consists of the following attributes:
 |`float`   |`dq_probs`                               |  1, 2, 3                          |inner layer scale factor when run in int8 precision, default 1.f/127.f.
 |`int`     |`var_seqlen`                             |  2                                |Whether to use variable sequence length (0: disable, 1: enable), default 0.
 |`int`     |`use_int8_scale_max`                     |  2, 3                             |Whether to use INT8 scale factors to optimize softmax MAX reduction. Only active when `type_id==2`. (0: disable, 1: enable), default 1.
+|`int`     |`use_explicit_int8`                      |  3                                |Whether to use explicit INT8, (0: disable, 1: enable), default 0.
+|`float`   |`input_qkv_scale`                        |  3                                |The int8 scale for the input qkv tensor when explicit precision is used, default 1.f.
+|`float`   |`output_ctx_scale`                       |  3                                |The int8 scale for the output context tensor when explicit precision is used, default 1.f.
 
 ## Additional resources
 
@@ -67,6 +70,14 @@ documentation.
 
 
 ## Changelog
+Feb 2024
+The issue of the V2 plugin not supporting head sizes of 32 or less and variable sequences of 64, 96, and 384 has been resolved.
+
+Oct 2023
+Support explicit int8.
+
+April 2023
+Optimize the GPU memory usage by using the cublas handle from attachToContext.
 
 October 2022
 Add IGMMA/HGMMA sm90 fmha_v2 kernels
diff --git a/plugin/bertQKVToContextPlugin/fused_multihead_attention/include/fused_multihead_attention.h b/plugin/bertQKVToContextPlugin/fused_multihead_attention/include/fused_multihead_attention.h
index 9f541aaf..d59e8a73 100644
--- a/plugin/bertQKVToContextPlugin/fused_multihead_attention/include/fused_multihead_attention.h
+++ b/plugin/bertQKVToContextPlugin/fused_multihead_attention/include/fused_multihead_attention.h
@@ -27,6 +27,7 @@
 #include <set>
 #include <stdint.h>
 #include <unordered_map>
+#include <unordered_set>
 #include <vector>
 
 namespace nvinfer1
@@ -133,9 +134,9 @@ extern unsigned char cubin_fmha_v1_int8_64_64_sm80_cu_cubin[];
 extern unsigned char cubin_fmha_v1_int8_96_64_sm80_cu_cubin[];
 #endif // defined(ENABLE_SM80) || defined(SM86) || defined(ENABLE_SM89)
 
-#if defined(ENABLE_SM86)
+#if defined(ENABLE_SM86) || defined(ENABLE_SM89)
 extern unsigned char fused_multihead_attention_fp16_384_64_kernel_sm86_cu_o[];
-#endif // defined(ENABLE_SM86)
+#endif // defined(ENABLE_SM86) || defined(ENABLE_SM89)
 
 #if defined(ENABLE_SM87)
 extern unsigned char cubin_fmha_v1_int8_384_64_sm87_cu_cubin[];
@@ -177,9 +178,9 @@ extern uint32_t cubin_fmha_v1_int8_64_64_sm80_cu_cubin_len;
 extern uint32_t cubin_fmha_v1_int8_96_64_sm80_cu_cubin_len;
 #endif // defined(ENABLE_SM80) || defined(SM86) || defined(ENABLE_SM89)
 
-#if defined(ENABLE_SM86)
+#if defined(ENABLE_SM86) || defined(ENABLE_SM89)
 extern uint32_t fused_multihead_attention_fp16_384_64_kernel_sm86_cu_o_len;
-#endif // defined(ENABLE_SM86)
+#endif // defined(ENABLE_SM86) || defined(ENABLE_SM89)
 
 #if defined(ENABLE_SM87)
 extern uint32_t cubin_fmha_v1_int8_384_64_sm87_cu_cubin_len;
@@ -265,7 +266,7 @@ static const struct FusedMultiHeadAttentionKernelMetaInfoV1
         fused_multihead_attention_int8_384_64_kernel_sm80_cu_o_len, "fused_multihead_attention_int8_384_64_kernel_sm80",
         57344, 256},
 #endif // defined(ENABLE_SM80) || defined(SM86) || defined(ENABLE_SM89)
-#if defined(ENABLE_SM86)
+#if defined(ENABLE_SM86) || defined(ENABLE_SM89)
     // GA10x
     // Note: For GA10X keep only kernels whose sharedMemBytes < 100KiB
     {DATA_TYPE_FP16, 64, 64, kSM_86, fused_multihead_attention_fp16_64_64_kernel_sm80_cu_o,
@@ -286,7 +287,7 @@ static const struct FusedMultiHeadAttentionKernelMetaInfoV1
     {DATA_TYPE_INT8, 384, 64, kSM_86, fused_multihead_attention_int8_384_64_kernel_sm80_cu_o,
         fused_multihead_attention_int8_384_64_kernel_sm80_cu_o_len, "fused_multihead_attention_int8_384_64_kernel_sm80",
         57344, 256},
-#endif // defined(ENABLE_SM86)
+#endif // defined(ENABLE_SM86) || defined(ENABLE_SM89)
 #if defined(ENABLE_SM87)
     // GA10b (Orin-Auto)
     {DATA_TYPE_INT8, 384, 64, kSM_87, cubin_fmha_v1_int8_384_64_sm87_cu_cubin,
@@ -398,10 +399,12 @@ class TFusedMultiHeadAttentionXMMAKernel
                     }
                 }
                 mFunctions.insert({kernelKey, funcInfo});
-                const int32_t s = static_cast<int32_t>(kernelMeta.mS);
-                if (mValidSequences.find(s) == mValidSequences.end())
+                uint64_t const s = kernelMeta.mS;
+                uint64_t const headSize = kernelMeta.mD;
+                uint64_t key = (headSize << 32 | s);
+                if (mValidSequences.find(key) == mValidSequences.end())
                 {
-                    mValidSequences.insert(s);
+                    mValidSequences.insert(key);
                 }
             }
         }
@@ -418,16 +421,23 @@ class TFusedMultiHeadAttentionXMMAKernel
 
         // sm_86 chips prefer sm_86 sass, but can also use sm_80 sass if sm_86 not exist.
         // sm_87 cannot run sm_80 sass
-        // sm_89 will reuse sm_80 kernels
-        if (mSM == kSM_86 || mSM == kSM_89)
+        if (mSM == kSM_86)
+        {
+            loadXMMAKernels(kSM_80);
+        }
+
+        // sm_89 will reuse sm_80 and sm_86 kernels
+        if (mSM == kSM_89)
         {
+            loadXMMAKernels(kSM_86);
             loadXMMAKernels(kSM_80);
         }
     }
 
-    bool isValid(int32_t s) const
+    bool isValid(int32_t headSize, int32_t s) const
     {
-        return (mValidSequences.find(s) != mValidSequences.end());
+        uint64_t key = (static_cast<uint64_t>(headSize) << 32 | static_cast<uint64_t>(s));
+        return (mValidSequences.find(key) != mValidSequences.end());
     }
 
     virtual void run(TKernelParam& params, cudaStream_t ss) const
@@ -456,6 +466,9 @@ class TFusedMultiHeadAttentionXMMAKernel
 #if defined(ENABLE_SM87)
                << "87 "
 #endif
+#if defined(ENABLE_SM89)
+               << "89 "
+#endif
 #if defined(ENABLE_SM90)
                << "90 "
 #endif
@@ -487,7 +500,8 @@ class TFusedMultiHeadAttentionXMMAKernel
         CUfunction mDeviceFunction;
     };
     std::unordered_map<uint64_t, FusedMultiHeadAttentionKernelInfo> mFunctions;
-    std::set<int32_t> mValidSequences;
+    // Set of valid sequence and head size combination. We use (headSize << 32 | sequence) as key here.
+    std::unordered_set<uint64_t> mValidSequences;
 };
 
 template <typename TFusedMHAKernelList>
diff --git a/plugin/bertQKVToContextPlugin/fused_multihead_attention_v2/include/fused_multihead_attention_v2.h b/plugin/bertQKVToContextPlugin/fused_multihead_attention_v2/include/fused_multihead_attention_v2.h
index 5ab1abe3..bb729359 100644
--- a/plugin/bertQKVToContextPlugin/fused_multihead_attention_v2/include/fused_multihead_attention_v2.h
+++ b/plugin/bertQKVToContextPlugin/fused_multihead_attention_v2/include/fused_multihead_attention_v2.h
@@ -92,7 +92,7 @@ struct Fused_multihead_attention_params_v2
     float* max_scratch_ptr{};
     float* sum_scratch_ptr{};
     // Scratch buffer to finalize the output (not needed for FP16).
-    int* o_scratch_ptr{};
+    int32_t* o_scratch_ptr{};
 
     void clear()
     {
@@ -152,7 +152,7 @@ extern unsigned char fused_multihead_attention_v2_int8_256_64_kernel_sm80_cubin[
 extern unsigned char fused_multihead_attention_v2_int8_384_64_kernel_sm80_cubin[];
 #endif // defined(ENABLE_SM80) || defined(ENABLE_SM86) || defined(ENABLE_SM89)
 
-#if defined(ENABLE_SM86)
+#if defined(ENABLE_SM86) || defined(ENABLE_SM89)
 extern unsigned char fused_multihead_attention_v2_fp16_128_64_kernel_sm86_cubin[];
 extern unsigned char fused_multihead_attention_v2_fp16_256_64_kernel_sm86_cubin[];
 extern unsigned char fused_multihead_attention_v2_fp16_384_64_kernel_sm86_cubin[];
@@ -162,7 +162,7 @@ extern unsigned char fused_multihead_attention_v2_int8_128_64_kernel_sm86_cubin[
 extern unsigned char fused_multihead_attention_v2_int8_192_64_kernel_sm86_cubin[];
 extern unsigned char fused_multihead_attention_v2_int8_256_64_kernel_sm86_cubin[];
 extern unsigned char fused_multihead_attention_v2_int8_384_64_kernel_sm86_cubin[];
-#endif // defined(ENABLE_SM86)
+#endif // defined(ENABLE_SM86) || defined(ENABLE_SM89)
 
 #if defined(ENABLE_SM87)
 extern unsigned char cubin_fmha_v2_int8_384_64_sm87_cu_cubin[];
@@ -258,7 +258,7 @@ extern uint32_t cubin_fmha_v2_fp16_128_32_sm80_cu_cubin_len;
 extern uint32_t fused_multihead_attention_v2_int8_384_64_kernel_sm80_cubin_len;
 #endif // defined(ENABLE_SM80) || defined(ENABLE_SM86) || defined(ENABLE_SM89)
 
-#if defined(ENABLE_SM86)
+#if defined(ENABLE_SM86) || defined(ENABLE_SM89)
 extern uint32_t fused_multihead_attention_v2_fp16_128_64_kernel_sm86_cubin_len;
 extern uint32_t fused_multihead_attention_v2_fp16_256_64_kernel_sm86_cubin_len;
 extern uint32_t fused_multihead_attention_v2_fp16_384_64_kernel_sm86_cubin_len;
@@ -268,7 +268,7 @@ extern uint32_t fused_multihead_attention_v2_int8_128_64_kernel_sm86_cubin_len;
 extern uint32_t fused_multihead_attention_v2_int8_192_64_kernel_sm86_cubin_len;
 extern uint32_t fused_multihead_attention_v2_int8_256_64_kernel_sm86_cubin_len;
 extern uint32_t fused_multihead_attention_v2_int8_384_64_kernel_sm86_cubin_len;
-#endif // defined(ENABLE_SM86)
+#endif // defined(ENABLE_SM86) || defined(ENABLE_SM89)
 
 #if defined(ENABLE_SM87)
 extern uint32_t cubin_fmha_v2_int8_384_64_sm87_cu_cubin_len;
@@ -603,7 +603,7 @@ static const struct FusedMultiHeadAttentionKernelMetaInfoV2
         fused_multihead_attention_v2_int8_384_64_kernel_sm80_cubin_len,
         "fused_multihead_attention_v2_int8_384_64_kernel_sm80", 53248, 128, 0, false},
 #endif // defined(ENABLE_SM80) || defined(ENABLE_SM86) || defined(ENABLE_SM89)
-#if defined(ENABLE_SM86)
+#if defined(ENABLE_SM86) || defined(ENABLE_SM89)
     // GA10x
     // Note: For GA10X keep only kernels whose sharedMemBytes < 100KiB
     {DATA_TYPE_FP16, 64, 64, kSM_86, fused_multihead_attention_v2_fp16_64_64_kernel_sm86_cubin,
@@ -679,7 +679,7 @@ static const struct FusedMultiHeadAttentionKernelMetaInfoV2
     {DATA_TYPE_INT8, 384, 64, kSM_86, fused_multihead_attention_v2_int8_384_64_kernel_sm86_cubin,
         fused_multihead_attention_v2_int8_384_64_kernel_sm86_cubin_len,
         "fused_multihead_attention_v2_int8_384_64_kernel_sm86", 28672, 128, 0, false},
-#endif // defined(ENABLE_SM86)
+#endif // defined(ENABLE_SM86) || defined(ENABLE_SM89)
 
 #if defined(ENABLE_SM87)
     // GA10b (Orin-Auto)
@@ -850,12 +850,12 @@ class FusedMultiHeadAttentionXMMAKernelV2
         return static_cast<uint64_t>(s) << 32 | (headsize << 2) | (interleaved ? 2U : 0U) | (unroll ? 1U : 0U);
     }
 
-    virtual uint64_t hashID(const KernelMeta& kernelMeta) const
+    uint64_t hashID(const KernelMeta& kernelMeta) const override
     {
         return hashID(kernelMeta.mS, kernelMeta.mD, kernelMeta.mInterleaved, kernelMeta.mUnrollStep);
     }
 
-    virtual void run(Fused_multihead_attention_params_v2& params, cudaStream_t ss) const
+    void run(Fused_multihead_attention_params_v2& params, cudaStream_t ss) const override
     {
         if (params.interleaved)
         {
@@ -893,6 +893,13 @@ class FusedMultiHeadAttentionXMMAKernelV2
                       {kSM_86, bert::DATA_TYPE_INT8, 192, 16},
                       {kSM_86, bert::DATA_TYPE_INT8, 256, 8},
                       {kSM_86, bert::DATA_TYPE_INT8, 384, 8},
+
+                      {kSM_89, bert::DATA_TYPE_FP16, 128, 4},
+                      {kSM_89, bert::DATA_TYPE_FP16, 256, 4},
+                      {kSM_89, bert::DATA_TYPE_INT8, 128, 4},
+                      {kSM_89, bert::DATA_TYPE_INT8, 192, 16},
+                      {kSM_89, bert::DATA_TYPE_INT8, 256, 8},
+                      {kSM_89, bert::DATA_TYPE_INT8, 384, 8},
 #endif
 #if CUDA_VERSION >= 11040
                       {kSM_87, bert::DATA_TYPE_FP16, 128, 4},
@@ -913,7 +920,7 @@ class FusedMultiHeadAttentionXMMAKernelV2
                       {kSM_90, bert::DATA_TYPE_INT8, 384, 8},
 #endif
                   };
-            for (uint32_t i = 0u; i < sizeof(unrollList) / sizeof(unrollList[0]); ++i)
+            for (uint32_t i = 0U; i < sizeof(unrollList) / sizeof(unrollList[0]); ++i)
             {
                 if (mSM == unrollList[i].mSM && mDataType == unrollList[i].mDataType && params.s == unrollList[i].mS
                     && params.b <= unrollList[i].mMaxBatch)
@@ -951,6 +958,9 @@ class FusedMultiHeadAttentionXMMAKernelV2
 #if defined(ENABLE_SM87)
                << "87 "
 #endif
+#if defined(ENABLE_SM89)
+               << "89 "
+#endif
 #if defined(ENABLE_SM90)
                << "90 "
 #endif
diff --git a/plugin/bertQKVToContextPlugin/qkvToContext.cu b/plugin/bertQKVToContextPlugin/qkvToContext.cu
index 6ebb98ea..cd5c69e7 100644
--- a/plugin/bertQKVToContextPlugin/qkvToContext.cu
+++ b/plugin/bertQKVToContextPlugin/qkvToContext.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -29,6 +29,7 @@
 
 #include "bertQKVToContextPlugin/fused_multihead_attention_v2/include/fused_multihead_attention_v2.h"
 using namespace nvinfer1;
+using namespace nvinfer1::pluginInternal;
 
 namespace nvinfer1
 {
@@ -347,14 +348,15 @@ std::pair<int, int> tuneBatchedGemm(
 {
     const int nruns = 500;
     cublasHandle_t cublas;
-    PLUGIN_CUBLASASSERT(cublasCreate(&cublas));
+    CublasWrapper& wrapper = getCublasWrapper();
+    PLUGIN_CUBLASASSERT(wrapper.cublasCreate(&cublas));
     cudaStream_t stream;
     PLUGIN_CUASSERT(cudaStreamCreate(&stream));
     cudaEvent_t start, stop;
     PLUGIN_CUASSERT(cudaEventCreate(&start));
     PLUGIN_CUASSERT(cudaEventCreate(&stop));
-    PLUGIN_CUBLASASSERT(cublasSetStream(cublas, stream));
-    PLUGIN_CUBLASASSERT(cublasSetMathMode(cublas, CUBLAS_TENSOR_OP_MATH));
+    PLUGIN_CUBLASASSERT(wrapper.cublasSetStream(cublas, stream));
+    PLUGIN_CUBLASASSERT(wrapper.cublasSetMathMode(cublas, CUBLAS_TENSOR_OP_MATH));
 
     using T = half;
     const int omatSize = S * S;
@@ -437,7 +439,7 @@ std::pair<int, int> tuneBatchedGemm(
     PLUGIN_CUASSERT(cudaEventDestroy(start));
     PLUGIN_CUASSERT(cudaEventDestroy(stop));
     PLUGIN_CUASSERT(cudaStreamDestroy(stream));
-    PLUGIN_CUBLASASSERT(cublasDestroy(cublas));
+    PLUGIN_CUBLASASSERT(wrapper.cublasDestroy(cublas));
     return std::make_pair(best1, best2);
 }
 
@@ -453,35 +455,35 @@ template int computeMaskedScaledSoftmax<half>(cudaStream_t stream, const int ld,
 
 size_t MHARunner::getSerializationSize() const noexcept
 {
-    return sizeof(mS) + sizeof(mB);
+    return sizeof(mS) + sizeof(mB) + sizeof(mHeadSize);
 }
 
 void MHARunner::serialize(void* buffer) const noexcept
 {
     serialize_value(&buffer, mS);
     serialize_value(&buffer, mB);
+    serialize_value(&buffer, mHeadSize);
 }
 
 void MHARunner::deserialize(const void* data, size_t length)
 {
     deserialize_value(&data, &length, &mS);
     deserialize_value(&data, &length, &mB);
-    setup(mS, mB);
+    deserialize_value(&data, &length, &mHeadSize);
+    setup(mS, mB, mHeadSize);
 }
 
-UnfusedMHARunner::UnfusedMHARunner(const nvinfer1::DataType type, const int numHeads, const int headSize, const int sm)
-    : MHARunner(type, numHeads, headSize)
+UnfusedMHARunner::UnfusedMHARunner(const nvinfer1::DataType type, const int numHeads, const int sm)
+    : MHARunner(type, numHeads)
     , mIsBestAlgoFound(false)
     , mAlgoBatchedEx1(CUBLAS_GEMM_DEFAULT_TENSOR_OP)
     , mAlgoBatchedEx2(CUBLAS_GEMM_DEFAULT_TENSOR_OP)
     , mSm(sm)
 {
-    PLUGIN_CUBLASASSERT(cublasCreate(&mCublas));
 }
 
 UnfusedMHARunner::~UnfusedMHARunner()
 {
-    PLUGIN_CUBLASASSERT(cublasDestroy(mCublas));
 }
 
 size_t UnfusedMHARunner::getSerializationSize() const noexcept
@@ -504,9 +506,9 @@ void UnfusedMHARunner::deserialize(const void* data, size_t length)
     MHARunner::deserialize(data, length);
 }
 
-void UnfusedMHARunner::setup(const int S, const int B)
+void UnfusedMHARunner::setup(int32_t S, int32_t B, int32_t headSize)
 {
-    MHARunner::setup(S, B);
+    MHARunner::setup(S, B, headSize);
     if (mType == DataType::kHALF && !mIsBestAlgoFound)
     {
         std::tie(mAlgoBatchedEx1, mAlgoBatchedEx2) = tuneBatchedGemm(B, S, mNumHeads, mHeadSize, mSm);
@@ -523,17 +525,19 @@ size_t UnfusedMHARunner::getWorkspaceSize() const
 }
 
 void UnfusedMHARunner::run(const PluginTensorDesc* inputDesc, const PluginTensorDesc* outputDesc,
-    const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream)
+    const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
-    this->run(inputDesc[0], outputDesc[0], inputs[0], inputs[1], outputs[0], workspace, stream);
+    this->run(inputDesc[0], outputDesc[0], inputs[0], inputs[1], outputs[0], workspace, stream, cublas);
 }
 
 void UnfusedMHARunner::run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc, const void* qkvPtr,
-    const void* maskPtr, void* output, void* workspace, cudaStream_t stream)
+    const void* maskPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
+    CublasWrapper& wrapper = getCublasWrapper();
     const int* maskIdx = static_cast<const int*>(maskPtr);
 
-    PLUGIN_CUBLASASSERT(cublasSetStream(mCublas, stream));
+    PLUGIN_CUBLASASSERT(wrapper.cublasSetStream(cublas, stream));
+    PLUGIN_VALIDATE(workspace != nullptr);
 
     // Q, K, V: BxNxSxH (inputs)
     // Q * K': BxNxSxS (-> scratch1)
@@ -542,7 +546,7 @@ void UnfusedMHARunner::run(const PluginTensorDesc& inputDesc, const PluginTensor
 
     if (mType == DataType::kHALF)
     {
-        CublasConfigHelper helper(mCublas);
+        CublasConfigHelper helper(cublas);
         const half* qptr = static_cast<const half*>(qkvPtr);
         const half* kptr = qptr + mHeadSize;
         const half* vptr = kptr + mHeadSize;
@@ -550,7 +554,7 @@ void UnfusedMHARunner::run(const PluginTensorDesc& inputDesc, const PluginTensor
         half* pptr = qkptr + mOmatSize * mNumMats;
         half alpha = 1.f;
         half beta = 0.f;
-        PLUGIN_CUBLASASSERT(::cublasGemmStridedBatchedEx(mCublas, CUBLAS_OP_T, CUBLAS_OP_N, mS, mS, mHeadSize, &alpha,
+        PLUGIN_CUBLASASSERT(wrapper.cublasGemmStridedBatchedEx(cublas, CUBLAS_OP_T, CUBLAS_OP_N, mS, mS, mHeadSize, &alpha,
             kptr, CUDA_R_16F, mLdQKV, mStrideQKV, qptr, CUDA_R_16F, mLdQKV, mStrideQKV, &beta, qkptr, CUDA_R_16F, mS,
             mOmatSize, mNumMats, CUDA_R_16F, static_cast<cublasGemmAlgo_t>(mAlgoBatchedEx1)));
 
@@ -565,7 +569,7 @@ void UnfusedMHARunner::run(const PluginTensorDesc& inputDesc, const PluginTensor
         }
 
         // compute P*V (as V*P)
-        PLUGIN_CUBLASASSERT(cublasGemmStridedBatchedEx(mCublas, CUBLAS_OP_N, CUBLAS_OP_N, mHeadSize, mS, mS, &alpha,
+        PLUGIN_CUBLASASSERT(wrapper.cublasGemmStridedBatchedEx(cublas, CUBLAS_OP_N, CUBLAS_OP_N, mHeadSize, mS, mS, &alpha,
             vptr, CUDA_R_16F, mLdQKV, mStrideQKV, pptr, CUDA_R_16F, mS, mOmatSize, &beta, output, CUDA_R_16F, mLdOut,
             mStrideOut, mNumMats, CUDA_R_16F, static_cast<cublasGemmAlgo_t>(mAlgoBatchedEx2)));
     }
@@ -578,7 +582,7 @@ void UnfusedMHARunner::run(const PluginTensorDesc& inputDesc, const PluginTensor
         float* qkptr = static_cast<float*>(workspace);
         float* pptr = qkptr + mOmatSize * mNumMats;
         float* outptr = static_cast<float*>(output);
-        PLUGIN_CUBLASASSERT(cublasGemmStridedBatched<float>(mCublas, CUBLAS_OP_T, CUBLAS_OP_N, mS, mS, mHeadSize, 1.f,
+        PLUGIN_CUBLASASSERT(cublasGemmStridedBatched<float>(cublas, CUBLAS_OP_T, CUBLAS_OP_N, mS, mS, mHeadSize, 1.f,
             kptr, mLdQKV, mStrideQKV, qptr, mLdQKV, mStrideQKV, 0.f, qkptr, mS, mOmatSize, mNumMats));
 
         // apply softmax
@@ -591,12 +595,12 @@ void UnfusedMHARunner::run(const PluginTensorDesc& inputDesc, const PluginTensor
             computeScaledSoftmax<float>(stream, mS, mB, mNumHeads, mRsqrtHeadSize, qkptr, pptr);
         }
 
-        PLUGIN_CUBLASASSERT(cublasGemmStridedBatched<float>(mCublas, CUBLAS_OP_N, CUBLAS_OP_N, mHeadSize, mS, mS, 1.f,
+        PLUGIN_CUBLASASSERT(cublasGemmStridedBatched<float>(cublas, CUBLAS_OP_N, CUBLAS_OP_N, mHeadSize, mS, mS, 1.f,
             vptr, mLdQKV, mStrideQKV, pptr, mS, mOmatSize, 0.f, outptr, mLdOut, mStrideOut, mNumMats));
     }
 }
 
-bool UnfusedMHARunner::isValid(int s) const
+bool UnfusedMHARunner::isValid(int32_t headSize, int32_t s) const
 {
     return mType != DataType::kINT8;
 }
@@ -647,7 +651,7 @@ public:
         return interface->mB * xmmas_m * threads_per_cta * sizeof(uint32_t);
     }
 
-    void setup(const int S, const int B)
+    void setup(int32_t S, int32_t B, int32_t headSize)
     {
         // TODO these implementation details might be better centralized into the XMMA code, since they are needed in
         // several places (also outside of this plugin)
@@ -695,7 +699,7 @@ public:
     }
 
     void run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc, const void* qkvPtr,
-        const void* maskPtr, void* output, void* workspace, cudaStream_t stream)
+        const void* maskPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
     {
         params.qkv_ptr = const_cast<void*>(qkvPtr);
 
@@ -708,9 +712,9 @@ public:
         PLUGIN_CHECK(cudaPeekAtLastError());
     }
 
-    bool isValid(int s) const
+    bool isValid(int32_t headSize, int32_t s) const
     {
-        return xmmaKernel->isValid(s);
+        return xmmaKernel->isValid(headSize, s);
     }
 
 private:
@@ -723,17 +727,17 @@ private:
     size_t threads_per_cta;
 };
 
-FusedMHARunnerFP16::FusedMHARunnerFP16(const int numHeads, const int headSize, const int sm)
-    : MHARunner(DataType::kHALF, numHeads, headSize)
+FusedMHARunnerFP16::FusedMHARunnerFP16(const int numHeads, const int sm)
+    : MHARunner(DataType::kHALF, numHeads)
     , mSm(sm)
     , pimpl(new mhaImpl(this))
 {
 }
 
-void FusedMHARunnerFP16::setup(const int S, const int B)
+void FusedMHARunnerFP16::setup(int32_t S, int32_t B, int32_t headSize)
 {
-    MHARunner::setup(S, B);
-    pimpl->setup(S, B);
+    MHARunner::setup(S, B, headSize);
+    pimpl->setup(S, B, headSize);
 }
 
 size_t FusedMHARunnerFP16::getWorkspaceSize() const
@@ -744,24 +748,24 @@ size_t FusedMHARunnerFP16::getWorkspaceSize() const
 void FusedMHARunnerFP16::deserialize(const void* data, size_t length)
 {
     MHARunner::deserialize(data, length);
-    setup(mS, mB);
+    setup(mS, mB, mHeadSize);
 }
 
 void FusedMHARunnerFP16::run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc, const void* qkvPtr,
-    const void* maskPtr, void* output, void* workspace, cudaStream_t stream)
+    const void* maskPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
-    pimpl->run(inputDesc, outputDesc, qkvPtr, maskPtr, output, workspace, stream);
+    pimpl->run(inputDesc, outputDesc, qkvPtr, maskPtr, output, workspace, stream, cublas);
 }
 
 void FusedMHARunnerFP16::run(const nvinfer1::PluginTensorDesc* inputDesc, const nvinfer1::PluginTensorDesc* outputDesc,
-    const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream)
+    const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
     assert(false && "not implemented");
 }
 
-bool FusedMHARunnerFP16::isValid(int s) const
+bool FusedMHARunnerFP16::isValid(int32_t headSize, int32_t s) const
 {
-    return pimpl->isValid(s);
+    return pimpl->isValid(headSize, s);
 }
 
 // Int8 starts here: TODO refactor the duplicate stuff
@@ -791,7 +795,7 @@ public:
         return interface->mB * xmmas_m * threads_per_cta * sizeof(uint32_t);
     }
 
-    void setup(const int S, const int B)
+    void setup(int32_t S, int32_t B, int32_t headSize)
     {
         size_t warps_m{1U};
         size_t warps_n{1U};
@@ -829,7 +833,7 @@ public:
     }
 
     void run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc, const void* qkvPtr,
-        const void* maskPtr, void* output, void* workspace, cudaStream_t stream)
+        const void* maskPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
     {
         float scaleQkv = inputDesc.scale;
         float scaleCtx = outputDesc.scale;
@@ -855,9 +859,9 @@ public:
         PLUGIN_CHECK(cudaPeekAtLastError());
     }
 
-    bool isValid(int s) const
+    bool isValid(int32_t headSize, int32_t s) const
     {
-        return xmmaKernel->isValid(s);
+        return xmmaKernel->isValid(headSize, s);
     }
 
 private:
@@ -871,18 +875,18 @@ private:
     size_t threads_per_cta;
 };
 
-FusedMHARunnerInt8::FusedMHARunnerInt8(const int numHeads, const int headSize, const int sm, const float dqProbs)
-    : MHARunner(DataType::kINT8, numHeads, headSize)
+FusedMHARunnerInt8::FusedMHARunnerInt8(const int numHeads, const int sm, const float dqProbs)
+    : MHARunner(DataType::kINT8, numHeads)
     , mSm(sm)
     , pimpl(new mhaImpl(this))
     , mDqProbs(dqProbs)
 {
 }
 
-void FusedMHARunnerInt8::setup(const int S, const int B)
+void FusedMHARunnerInt8::setup(int32_t S, int32_t B, int32_t headSize)
 {
-    MHARunner::setup(S, B);
-    pimpl->setup(S, B);
+    MHARunner::setup(S, B, headSize);
+    pimpl->setup(S, B, headSize);
 }
 
 size_t FusedMHARunnerInt8::getWorkspaceSize() const
@@ -893,24 +897,24 @@ size_t FusedMHARunnerInt8::getWorkspaceSize() const
 void FusedMHARunnerInt8::deserialize(const void* data, size_t length)
 {
     MHARunner::deserialize(data, length);
-    setup(mS, mB);
+    setup(mS, mB, mHeadSize);
 }
 
 void FusedMHARunnerInt8::run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc, const void* qkvPtr,
-    const void* maskPtr, void* output, void* workspace, cudaStream_t stream)
+    const void* maskPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
-    pimpl->run(inputDesc, outputDesc, qkvPtr, maskPtr, output, workspace, stream);
+    pimpl->run(inputDesc, outputDesc, qkvPtr, maskPtr, output, workspace, stream, cublas);
 }
 
 void FusedMHARunnerInt8::run(const nvinfer1::PluginTensorDesc* inputDesc, const nvinfer1::PluginTensorDesc* outputDesc,
-    const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream)
+    const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
     assert(false && "not implemented");
 }
 
-bool FusedMHARunnerInt8::isValid(int s) const
+bool FusedMHARunnerInt8::isValid(int32_t headSize, int32_t s) const
 {
-    return pimpl->isValid(s);
+    return pimpl->isValid(headSize, s);
 }
 
 class FusedMHARunnerFP16v2::mhaImpl
@@ -937,7 +941,7 @@ public:
         return interface->mB * xmmas_m * threads_per_cta * sizeof(uint32_t);
     }
 
-    void setup(const int S, const int B)
+    void setup(int32_t S, int32_t B, int32_t headSize)
     {
         // TODO these implementation details might be better centralized into the XMMA code, since they are needed in
         // several places (also outside of this plugin)
@@ -1005,7 +1009,7 @@ public:
     }
 
     void run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc, const void* qkvPtr,
-        const void* maskPtr, const void* cuSeqlenPtr, void* output, void* workspace, cudaStream_t stream)
+        const void* maskPtr, const void* cuSeqlenPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
     {
 
         params.qkv_ptr = const_cast<void*>(qkvPtr);
@@ -1020,9 +1024,9 @@ public:
         PLUGIN_CHECK(cudaPeekAtLastError());
     }
 
-    bool isValid(int s) const
+    bool isValid(int32_t headSize, int32_t s) const
     {
-        return xmmaKernel->isValid(s);
+        return xmmaKernel->isValid(headSize, s);
     }
 
 private:
@@ -1035,17 +1039,17 @@ private:
     size_t threads_per_cta;
 };
 
-FusedMHARunnerFP16v2::FusedMHARunnerFP16v2(const int numHeads, const int headSize, const int sm)
-    : MHARunner(DataType::kHALF, numHeads, headSize)
+FusedMHARunnerFP16v2::FusedMHARunnerFP16v2(const int numHeads, const int sm)
+    : MHARunner(DataType::kHALF, numHeads)
     , mSm(sm)
     , pimpl(new mhaImpl(this))
 {
 }
 
-void FusedMHARunnerFP16v2::setup(const int S, const int B)
+void FusedMHARunnerFP16v2::setup(int32_t S, int32_t B, int32_t headSize)
 {
-    MHARunner::setup(S, B);
-    pimpl->setup(S, B);
+    MHARunner::setup(S, B, headSize);
+    pimpl->setup(S, B, headSize);
 }
 
 size_t FusedMHARunnerFP16v2::getWorkspaceSize() const
@@ -1056,26 +1060,25 @@ size_t FusedMHARunnerFP16v2::getWorkspaceSize() const
 void FusedMHARunnerFP16v2::deserialize(const void* data, size_t length)
 {
     MHARunner::deserialize(data, length);
-    setup(mS, mB);
+    setup(mS, mB, mHeadSize);
 }
 
 void FusedMHARunnerFP16v2::run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc,
-    const void* qkvPtr, const void* maskPtr, void* output, void* workspace, cudaStream_t stream)
+    const void* qkvPtr, const void* maskPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
     assert(false && "not implemented");
-    // pimpl->run(inputDesc, outputDesc, qkvPtr, maskPtr, output, workspace, stream);
 }
 
 void FusedMHARunnerFP16v2::run(const nvinfer1::PluginTensorDesc* inputDesc,
     const nvinfer1::PluginTensorDesc* outputDesc, const void* const* inputs, void* const* outputs, void* workspace,
-    cudaStream_t stream)
+    cudaStream_t stream, cublasHandle_t cublas)
 {
-    pimpl->run(inputDesc[0], outputDesc[0], inputs[0], inputs[1], inputs[2], outputs[0], workspace, stream);
+    pimpl->run(inputDesc[0], outputDesc[0], inputs[0], inputs[1], inputs[2], outputs[0], workspace, stream, cublas);
 }
 
-bool FusedMHARunnerFP16v2::isValid(int s) const
+bool FusedMHARunnerFP16v2::isValid(int32_t headSize, int32_t s) const
 {
-    return pimpl->isValid(s);
+    return pimpl->isValid(headSize, s);
 }
 
 // Int8 starts here: TODO refactor the duplicate stuff
@@ -1108,7 +1111,7 @@ public:
         return interface->mB * xmmas_m * threads_per_cta * sizeof(uint32_t);
     }
 
-    void setup(const int S, const int B)
+    void setup(int32_t S, int32_t B, int32_t headSize)
     {
         size_t warps_m{1U};
         size_t warps_n{1U};
@@ -1170,7 +1173,7 @@ public:
     }
 
     void run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc, const void* qkvPtr,
-        const void* maskPtr, const void* cuSeqlenPtr, void* output, void* workspace, cudaStream_t stream)
+        const void* maskPtr, const void* cuSeqlenPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
     {
         float scaleQkv = inputDesc.scale;
         float scaleCtx = outputDesc.scale;
@@ -1201,9 +1204,9 @@ public:
         PLUGIN_CHECK(cudaPeekAtLastError());
     }
 
-    bool isValid(int s) const
+    bool isValid(int32_t headSize, int32_t s) const
     {
-        return xmmaKernel->isValid(s);
+        return xmmaKernel->isValid(headSize, s);
     }
 
 private:
@@ -1217,8 +1220,8 @@ private:
     size_t threads_per_cta;
 };
 
-FusedMHARunnerInt8v2::FusedMHARunnerInt8v2(const int numHeads, const int headSize, const int sm, const float dqProbs, bool const useInt8ScaleMax)
-    : MHARunner(DataType::kINT8, numHeads, headSize)
+FusedMHARunnerInt8v2::FusedMHARunnerInt8v2(const int numHeads, const int sm, const float dqProbs, bool const useInt8ScaleMax)
+    : MHARunner(DataType::kINT8, numHeads)
     , mSm(sm)
     , pimpl(new mhaImpl(this))
     , mDqProbs(dqProbs)
@@ -1226,10 +1229,10 @@ FusedMHARunnerInt8v2::FusedMHARunnerInt8v2(const int numHeads, const int headSiz
 {
 }
 
-void FusedMHARunnerInt8v2::setup(const int S, const int B)
+void FusedMHARunnerInt8v2::setup(int32_t S, int32_t B, int32_t headSize)
 {
-    MHARunner::setup(S, B);
-    pimpl->setup(S, B);
+    MHARunner::setup(S, B, headSize);
+    pimpl->setup(S, B, headSize);
 }
 
 size_t FusedMHARunnerInt8v2::getWorkspaceSize() const
@@ -1240,25 +1243,25 @@ size_t FusedMHARunnerInt8v2::getWorkspaceSize() const
 void FusedMHARunnerInt8v2::deserialize(const void* data, size_t length)
 {
     MHARunner::deserialize(data, length);
-    setup(mS, mB);
+    setup(mS, mB, mHeadSize);
 }
 
 void FusedMHARunnerInt8v2::run(const PluginTensorDesc& inputDesc, const PluginTensorDesc& outputDesc,
-    const void* qkvPtr, const void* maskPtr, void* output, void* workspace, cudaStream_t stream)
+    const void* qkvPtr, const void* maskPtr, void* output, void* workspace, cudaStream_t stream, cublasHandle_t cublas)
 {
     assert(false && "Not implemented");
 }
 
 void FusedMHARunnerInt8v2::run(const nvinfer1::PluginTensorDesc* inputDesc,
     const nvinfer1::PluginTensorDesc* outputDesc, const void* const* inputs, void* const* outputs, void* workspace,
-    cudaStream_t stream)
+    cudaStream_t stream, cublasHandle_t cublas)
 {
-    pimpl->run(inputDesc[0], outputDesc[0], inputs[0], inputs[1], inputs[2], outputs[0], workspace, stream);
+    pimpl->run(inputDesc[0], outputDesc[0], inputs[0], inputs[1], inputs[2], outputs[0], workspace, stream, cublas);
 }
 
-bool FusedMHARunnerInt8v2::isValid(int s) const
+bool FusedMHARunnerInt8v2::isValid(int32_t headSize, int32_t s) const
 {
-    return pimpl->isValid(s);
+    return pimpl->isValid(headSize, s);
 }
 
 } // namespace bert
diff --git a/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.cpp b/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.cpp
index 61ef88f7..f62f2c9a 100644
--- a/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.cpp
+++ b/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.cpp
@@ -20,6 +20,7 @@
 #include "common/bertCommon.h"
 #include "common/plugin.h"
 #include "common/serialize.hpp"
+
 #include <cstring>
 #include <cuda.h>
 #include <iostream>
@@ -46,8 +47,8 @@ REGISTER_TENSORRT_PLUGIN(QKVToContextInterleavedPluginCreator);
 
 constexpr uint32_t IIDX = 0; // index of the input tensor
 
-QKVToContextInterleavedPlugin::QKVToContextInterleavedPlugin(std::string const& name, int32_t const hiddenSize,
-    int32_t const numHeads, float const dqProbs, bool const useInt8ScaleMax)
+QKVToContextInterleavedPlugin::QKVToContextInterleavedPlugin(std::string const& name, int32_t hiddenSize,
+    int32_t numHeads, float dqProbs, bool useInt8ScaleMax, bool useExplicitInt8, float qkvScale, float ctxScale)
     : mLayerName(name)
     , mS(0)
     , mB(0)
@@ -56,6 +57,9 @@ QKVToContextInterleavedPlugin::QKVToContextInterleavedPlugin(std::string const&
     , mNumHeads(numHeads)
     , mDqProbs(dqProbs)
     , mUseInt8ScaleMax(useInt8ScaleMax)
+    , mUseExplicitInt8(useExplicitInt8)
+    , mQkvScale(qkvScale)
+    , mCtxScale(ctxScale)
 {
     mSM = getSMVersion();
     // variable sequence length is only supported with the fused MHA kernels
@@ -78,6 +82,9 @@ QKVToContextInterleavedPlugin::QKVToContextInterleavedPlugin(std::string const&
     deserialize_value(&data, &length, &mB);
     deserialize_value(&data, &length, &mDqProbs);
     deserialize_value(&data, &length, &mUseInt8ScaleMax);
+    deserialize_value(&data, &length, &mUseExplicitInt8);
+    deserialize_value(&data, &length, &mQkvScale);
+    deserialize_value(&data, &length, &mCtxScale);
 }
 
 int32_t QKVToContextInterleavedPlugin::getSMVersion() const noexcept
@@ -94,8 +101,8 @@ nvinfer1::IPluginV2DynamicExt* QKVToContextInterleavedPlugin::clone() const noex
 {
     try
     {
-        QKVToContextInterleavedPlugin* ret
-            = new QKVToContextInterleavedPlugin(mLayerName, mHiddenSize, mNumHeads, mDqProbs, mUseInt8ScaleMax);
+        QKVToContextInterleavedPlugin* ret = new QKVToContextInterleavedPlugin(
+            mLayerName, mHiddenSize, mNumHeads, mDqProbs, mUseInt8ScaleMax, mUseExplicitInt8, mQkvScale, mCtxScale);
 
         ret->setPluginNamespace(mNamespace.c_str());
         return ret;
@@ -197,7 +204,8 @@ void QKVToContextInterleavedPlugin::terminate() noexcept {}
 size_t QKVToContextInterleavedPlugin::getSerializationSize() const noexcept
 {
     return sizeof(mNumHeads) + sizeof(mHeadSize) + sizeof(mHiddenSize) + sizeof(mSM) + sizeof(mS) + sizeof(mB)
-        + sizeof(mDqProbs) + sizeof(mUseInt8ScaleMax);
+        + sizeof(mDqProbs) + sizeof(mUseInt8ScaleMax) + sizeof(mUseExplicitInt8) + sizeof(mQkvScale)
+        + sizeof(mCtxScale);
 }
 
 void QKVToContextInterleavedPlugin::serialize(void* buffer) const noexcept
@@ -210,6 +218,9 @@ void QKVToContextInterleavedPlugin::serialize(void* buffer) const noexcept
     serialize_value(&buffer, mB);
     serialize_value(&buffer, mDqProbs);
     serialize_value(&buffer, mUseInt8ScaleMax);
+    serialize_value(&buffer, mUseExplicitInt8);
+    serialize_value(&buffer, mQkvScale);
+    serialize_value(&buffer, mCtxScale);
 }
 
 void QKVToContextInterleavedPlugin::destroy() noexcept
@@ -228,8 +239,9 @@ char const* QKVToContextInterleavedPlugin::getPluginNamespace() const noexcept
 }
 
 int32_t QKVToContextInterleavedPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+    void const* const* inputs, void* const* outputs, void* /* workspace */, cudaStream_t stream) noexcept
 {
+    PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
 
     int32_t const total = inputDesc[0].dims.d[2];
     int32_t const B = inputDesc[1].dims.d[0] - 1;
@@ -259,8 +271,8 @@ int32_t QKVToContextInterleavedPlugin::enqueue(PluginTensorDesc const* inputDesc
     params.qkv_ptr = const_cast<void*>(inputs[0]);
     params.cu_seqlens = static_cast<int32_t*>(const_cast<void*>(inputs[1]));
 
-    float scaleQkv = inputDesc[0].scale;
-    float scaleCtx = outputDesc[0].scale;
+    float scaleQkv = mUseExplicitInt8 ? mQkvScale : inputDesc[0].scale;
+    float scaleCtx = mUseExplicitInt8 ? mCtxScale : outputDesc[0].scale;
 
     float scaleBmm1 = scaleQkv * scaleQkv * 0.125; // 1 / sqrt(64)
     float scaleBmm2 = mDqProbs * scaleQkv / scaleCtx;
@@ -296,6 +308,9 @@ QKVToContextInterleavedPluginCreator::QKVToContextInterleavedPluginCreator()
     mPluginAttributes.emplace_back(PluginField("num_heads", nullptr, PluginFieldType::kINT32, 1));
     mPluginAttributes.emplace_back(PluginField("dq_probs", nullptr, PluginFieldType::kFLOAT32, 1));
     mPluginAttributes.emplace_back(PluginField("use_int8_scale_max", nullptr, PluginFieldType::kINT32, 1));
+    mPluginAttributes.emplace_back(PluginField("use_explicit_int8", nullptr, PluginFieldType::kINT32, 1));
+    mPluginAttributes.emplace_back(PluginField("input_qkv_scale", nullptr, PluginFieldType::kFLOAT32, 1));
+    mPluginAttributes.emplace_back(PluginField("output_ctx_scale", nullptr, PluginFieldType::kFLOAT32, 1));
 
     mFC.nbFields = mPluginAttributes.size();
     mFC.fields = mPluginAttributes.data();
@@ -330,6 +345,10 @@ IPluginV2* QKVToContextInterleavedPluginCreator::createPlugin(
         float dqProbs = -1;
         int32_t useInt8ScaleMax{-1};
 
+        int32_t useExplicitInt8{};
+        float qkvScale{1.F};
+        float ctxScale{1.F};
+
         plugin::validateRequiredAttributesExist({"hidden_size", "num_heads"}, fc);
 
         for (int32_t i = 0; i < fc->nbFields; i++)
@@ -342,25 +361,42 @@ IPluginV2* QKVToContextInterleavedPluginCreator::createPlugin(
                 PLUGIN_VALIDATE(hiddenSize > 0, ("QKV: Invalid hiddenSize " + std::to_string(hiddenSize)).c_str());
                 BERT_DEBUG_VALUE("Building hiddenSize: ", hiddenSize);
             }
-            if (field_name.compare("num_heads") == 0)
+            else if (field_name.compare("num_heads") == 0)
             {
                 numHeads = *static_cast<int32_t const*>(fc->fields[i].data);
                 PLUGIN_VALIDATE(numHeads > 0, ("QKV: Invalid numHeads " + std::to_string(numHeads)).c_str());
                 BERT_DEBUG_VALUE("Building numHeads: ", numHeads);
             }
-            if (field_name.compare("dq_probs") == 0)
+            else if (field_name.compare("dq_probs") == 0)
             {
                 dqProbs = *static_cast<float const*>(fc->fields[i].data);
                 PLUGIN_VALIDATE(dqProbs > 0.0F, ("QKV: Invalid dqProbs " + std::to_string(dqProbs)).c_str());
                 BERT_DEBUG_VALUE("Building dqProbs: ", dqProbs);
             }
-            if (field_name.compare("use_int8_scale_max") == 0)
+            else if (field_name.compare("use_int8_scale_max") == 0)
             {
                 useInt8ScaleMax = *static_cast<int32_t const*>(fc->fields[i].data);
                 PLUGIN_VALIDATE(useInt8ScaleMax == 0 || useInt8ScaleMax == 1,
                     ("QKV: Invalid useInt8ScaleMax " + std::to_string(useInt8ScaleMax)).c_str());
                 BERT_DEBUG_VALUE("Building useInt8ScaleMax: ", useInt8ScaleMax);
             }
+            else if (field_name.compare("use_explicit_int8") == 0)
+            {
+                useExplicitInt8 = *static_cast<int32_t const*>(fc->fields[i].data);
+                BERT_DEBUG_VALUE("Building use_explicit_int8: ", useExplicitInt8);
+            }
+            else if (field_name.compare("input_qkv_scale") == 0)
+            {
+                qkvScale = *static_cast<float const*>(fc->fields[i].data);
+                PLUGIN_VALIDATE(qkvScale > 0, ("QKV: Invalid input_qkv_scale" + std::to_string(qkvScale)).c_str());
+                BERT_DEBUG_VALUE("Building input_qkv_scale: ", qkvScale);
+            }
+            else if (field_name.compare("output_ctx_scale") == 0)
+            {
+                ctxScale = *static_cast<float const*>(fc->fields[i].data);
+                PLUGIN_VALIDATE(ctxScale > 0, ("QKV: Invalid output_ctx_scale " + std::to_string(ctxScale)).c_str());
+                BERT_DEBUG_VALUE("Building output_ctx_scale: ", ctxScale);
+            }
         }
 
         if (dqProbs < 0)
@@ -377,8 +413,8 @@ IPluginV2* QKVToContextInterleavedPluginCreator::createPlugin(
 
         auto const useInt8ScaleMaxFlag = static_cast<bool>(useInt8ScaleMax);
 
-        QKVToContextInterleavedPlugin* p
-            = new QKVToContextInterleavedPlugin(name, hiddenSize, numHeads, dqProbs, useInt8ScaleMaxFlag);
+        QKVToContextInterleavedPlugin* p = new QKVToContextInterleavedPlugin(
+            name, hiddenSize, numHeads, dqProbs, useInt8ScaleMaxFlag, useExplicitInt8 != 0, qkvScale, ctxScale);
         return p;
     }
     catch (std::exception const& e)
diff --git a/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.h b/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.h
index 98985646..33f9de6e 100644
--- a/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.h
+++ b/plugin/bertQKVToContextPlugin/qkvToContextInt8InterleavedPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,7 +19,6 @@
 #define TRT_QKV_TO_CONTEXT_INTERLEAVED_PLUGIN_H
 
 #include "NvInferPlugin.h"
-#include "cublas_v2.h"
 #include "fused_multihead_attention_v2/include/fused_multihead_attention_v2.h"
 #include <cuda.h>
 #include <string>
@@ -42,8 +41,8 @@ static constexpr int32_t kSM_HOPPER_100 = 90;
 class QKVToContextInterleavedPlugin : public nvinfer1::IPluginV2DynamicExt
 {
 public:
-    QKVToContextInterleavedPlugin(std::string const& name, int32_t const hiddenSize, int32_t const numHeads,
-        float const dqProbs, bool const useInt8ScaleMax);
+    QKVToContextInterleavedPlugin(std::string const& name, int32_t hiddenSize, int32_t numHeads, float dqProbs,
+        bool useInt8ScaleMax, bool useExplicitInt8, float qkvScale, float ctxScale);
 
     QKVToContextInterleavedPlugin(std::string const& name, void const* data, size_t length);
 
@@ -88,17 +87,21 @@ class QKVToContextInterleavedPlugin : public nvinfer1::IPluginV2DynamicExt
     std::string const& mLayerName;
     std::string mNamespace;
 
-    int32_t mS;
-    int32_t mB;
-    int32_t mSM;
-    int32_t mHeadSize;
-    int32_t mHiddenSize;
-    int32_t mNumHeads;
+    int32_t mS{};
+    int32_t mB{};
+    int32_t mSM{};
+    int32_t mHeadSize{};
+    int32_t mHiddenSize{};
+    int32_t mNumHeads{};
 
     FusedMultiHeadAttentionXMMAKernelV2 const* mXmmaKernel;
 
-    float mDqProbs;
+    float mDqProbs{};
     bool mUseInt8ScaleMax{true};
+
+    bool mUseExplicitInt8{};
+    float mQkvScale{};
+    float mCtxScale{};
 };
 
 class QKVToContextInterleavedPluginCreator : public nvinfer1::IPluginCreator
diff --git a/plugin/bertQKVToContextPlugin/qkvToContextPlugin.cpp b/plugin/bertQKVToContextPlugin/qkvToContextPlugin.cpp
index ed67ef64..0562e2ea 100644
--- a/plugin/bertQKVToContextPlugin/qkvToContextPlugin.cpp
+++ b/plugin/bertQKVToContextPlugin/qkvToContextPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -35,6 +35,7 @@
 using namespace nvinfer1;
 using namespace nvinfer1::plugin;
 using namespace nvinfer1::plugin::bert;
+using namespace nvinfer1::pluginInternal;
 
 namespace
 {
@@ -108,17 +109,17 @@ void QKVToContextPluginDynamic::createMHARunner()
     {
         if (mType == DataType::kHALF)
         {
-            fusedDispatcher.reset(new FusedMHARunnerFP16(mNumHeads, mHeadSize, mSM));
+            fusedDispatcher.reset(new FusedMHARunnerFP16(mNumHeads, mSM));
         }
         else if (mType == DataType::kINT8)
         {
-            fusedDispatcher.reset(new FusedMHARunnerInt8(mNumHeads, mHeadSize, mSM, mDqProbs));
+            fusedDispatcher.reset(new FusedMHARunnerInt8(mNumHeads, mSM, mDqProbs));
         }
     }
 
     if (!unfusedDispatcher.get())
     {
-        unfusedDispatcher.reset(new UnfusedMHARunner(mType, mNumHeads, mHeadSize, mSM));
+        unfusedDispatcher.reset(new UnfusedMHARunner(mType, mNumHeads, mSM));
     }
 }
 
@@ -309,9 +310,9 @@ void QKVToContextPluginDynamic::configurePlugin(
         {
             for (int32_t i = Smax; i >= Smin; --i)
             {
-                if (!fusedDispatcher->isValid(i))
+                if (!fusedDispatcher->isValid(mHeadSize, i))
                 {
-                    unfusedDispatcher->setup(i, B);
+                    unfusedDispatcher->setup(i, B, mHeadSize);
                     mS = i;
                     mB = B;
                     break;
@@ -320,7 +321,7 @@ void QKVToContextPluginDynamic::configurePlugin(
         }
         else
         {
-            unfusedDispatcher->setup(Smax, B);
+            unfusedDispatcher->setup(Smax, B, mHeadSize);
             mS = Smax;
             mB = B;
         }
@@ -328,13 +329,13 @@ void QKVToContextPluginDynamic::configurePlugin(
     else
     {
         // in inference stage or in static shape build stage
-        if (fusedDispatcher.get() && fusedDispatcher->isValid(S))
+        if (fusedDispatcher.get() && fusedDispatcher->isValid(mHeadSize, S))
         {
-            fusedDispatcher->setup(S, B);
+            fusedDispatcher->setup(S, B, mHeadSize);
         }
         else
         {
-            unfusedDispatcher->setup(S, B);
+            unfusedDispatcher->setup(S, B, mHeadSize);
         }
         mS = S;
         mB = B;
@@ -361,6 +362,21 @@ DataType QKVToContextPluginDynamic::getOutputDataType(
     return inputTypes[0];
 }
 
+void QKVToContextPluginDynamic::attachToContext(
+    cudnnContext* cudnn, cublasContext* cublas, nvinfer1::IGpuAllocator* allocator) noexcept
+{
+    try
+    {
+        mCublasWrapper = createPluginCublasWrapper(allocator);
+        mCublas = mCublasWrapper->getCublasHandle();
+        PLUGIN_VALIDATE(mCublas != nullptr);
+    }
+    catch (const std::exception& e)
+    {
+        caughtError(e);
+    }
+}
+
 // IPluginV2 Methods
 char const* QKVToContextPluginDynamic::getPluginType() const noexcept
 {
@@ -435,20 +451,24 @@ char const* QKVToContextPluginDynamic::getPluginNamespace() const noexcept
 int32_t QKVToContextPluginDynamic::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
     void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
 {
+    PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
     PLUGIN_ASSERT(mS == inputDesc->dims.d[SDIM]);
     PLUGIN_ASSERT(mB == inputDesc->dims.d[BDIM]);
 
     try
     {
         void const* const maskPtr = mHasImask ? inputs[1] : nullptr;
-        if (fusedDispatcher.get() && fusedDispatcher->isValid(inputDesc->dims.d[SDIM]))
+        if (mHasImask && fusedDispatcher.get() && fusedDispatcher->isValid(mHeadSize, inputDesc->dims.d[SDIM]))
         {
-            fusedDispatcher->run(inputDesc[0], outputDesc[0], inputs[0], maskPtr, outputs[0], workspace, stream);
+            fusedDispatcher->run(
+                inputDesc[0], outputDesc[0], inputs[0], maskPtr, outputs[0], workspace, stream, mCublas);
         }
         else
         {
             PLUGIN_VALIDATE(unfusedDispatcher.get(), "The Unfused MHARunner is uninitialized, no MHARunner available!");
-            unfusedDispatcher->run(inputDesc[0], outputDesc[0], inputs[0], maskPtr, outputs[0], workspace, stream);
+            PLUGIN_VALIDATE(mType != DataType::kINT8, "The Unfused MHARunner does not support INT8!");
+            unfusedDispatcher->run(
+                inputDesc[0], outputDesc[0], inputs[0], maskPtr, outputs[0], workspace, stream, mCublas);
         }
     }
     catch (std::exception const& e)
@@ -633,41 +653,40 @@ QKVToContextVarSeqlenPlugin::QKVToContextVarSeqlenPlugin(const std::string name,
     deserialize_value(&data, &length, &mUseInt8ScaleMax);
 
     createMHARunner();
-    dispatcher->deserialize(data, length);
+    mDispatcher->deserialize(data, length);
 
     BERT_DEBUG_MSG("QKV Deser done");
 }
 
 void QKVToContextVarSeqlenPlugin::createMHARunner()
 {
-    if (dispatcher.get())
+    if (mDispatcher.get())
     {
         return;
     }
 
-    if (mSM == kSM_90 || mSM == kSM_87 || mSM == kSM_86 || mSM == kSM_89 || mSM == kSM_80 || mSM == kSM_75
-        || mSM == kSM_72)
+    if (mUseVarSeqlen)
     {
-        int32_t headSize = mHeadSize;
-        if (mHeadSize != 32 && mHeadSize != 64)
-        {
-            patcher.reset(new QkvPaddingRunner(mHeadSize, mType));
-            headSize = patcher->getPaddingHeadSize();
-        }
-
-        if (mType == DataType::kHALF)
-        {
-            dispatcher.reset(new FusedMHARunnerFP16v2(mNumHeads, headSize, mSM));
-        }
-        else if (mType == DataType::kINT8)
+        PLUGIN_ASSERT(mHeadSize <= 64);
         {
-            dispatcher.reset(new FusedMHARunnerInt8v2(mNumHeads, headSize, mSM, mDqProbs, mUseInt8ScaleMax));
+            if (mHeadSize != 64)
+            {
+                mPatcher.reset(new QkvPaddingRunner(mType));
+            }
+            if (mType == DataType::kHALF)
+            {
+                mDispatcher.reset(new FusedMHARunnerFP16v2(mNumHeads, mSM));
+            }
+            else if (mType == DataType::kINT8)
+            {
+                mDispatcher.reset(new FusedMHARunnerInt8v2(mNumHeads, mSM, mDqProbs, mUseInt8ScaleMax));
+            }
         }
     }
     else
     {
-        PLUGIN_ASSERT(!mUseVarSeqlen);
-        dispatcher.reset(new UnfusedMHARunner(mType, mNumHeads, mHeadSize, mSM));
+        PLUGIN_ASSERT(mType != DataType::kINT8);
+        mDispatcher.reset(new UnfusedMHARunner(mType, mNumHeads, mSM));
     }
 }
 
@@ -677,7 +696,7 @@ nvinfer1::IPluginV2DynamicExt* QKVToContextVarSeqlenPlugin::clone() const noexce
     BERT_DEBUG_MSG("QKV Clone");
 
     QKVToContextVarSeqlenPlugin* ret = nullptr;
-    if (dispatcher.get())
+    if (mDispatcher.get())
     {
         std::vector<char> buff;
         buff.resize(getSerializationSize());
@@ -712,14 +731,14 @@ DimsExprs QKVToContextVarSeqlenPlugin::getOutputDimensions(
 bool QKVToContextVarSeqlenPlugin::supportsFormatCombination(
     int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept
 {
-    // we only support int8 IO in fused mha runner, and we only support fused mha runner on Turing and Ampere
-    if (mType == DataType::kINT8 && mSM != kSM_90 && mSM != kSM_89 && mSM != kSM_87 && mSM != kSM_86 && mSM != kSM_80
-        && mSM != kSM_75 && mSM != kSM_72)
-    {
-        BERT_DEBUG_VALUE(
-            "INT8 IO is only supported on Xavier, Turing and Ampere for plugin ", kQKV_TO_CONTEXT_PLUGIN_NAME);
-        return false;
-    }
+    // we only support variable sequence and int8 IO in fused mha runner, and we only support fused mha runner on
+    // Turing, Ampere and Hopper
+    bool const hasV2Kernels = (mSM == kSM_90 || mSM == kSM_89 || mSM == kSM_87 || mSM == kSM_86 || mSM == kSM_80
+        || mSM == kSM_75 || mSM == kSM_72);
+    PLUGIN_ASSERT(
+        (mType != DataType::kINT8 || hasV2Kernels) && "INT8 IO is only supported on Xavier, Turing, Ampere and Hopper");
+    PLUGIN_ASSERT(
+        (!mUseVarSeqlen || hasV2Kernels) && "Variable sequence is only supported on Xavier, Turing, Ampere and Hopper");
 
     PLUGIN_ASSERT(pos >= 0);
     PLUGIN_ASSERT(pos < 2 + mHasImask + 2 * mUseVarSeqlen);
@@ -827,7 +846,7 @@ void QKVToContextVarSeqlenPlugin::configurePlugin(
         {
             BERT_DEBUG_MSG("setting up MHA runner for single sequence length");
             createMHARunner();
-            this->dispatcher->setup(S, B);
+            this->mDispatcher->setup(S, B, mHeadSize);
             mS = S;
             mB = B;
         }
@@ -838,19 +857,15 @@ void QKVToContextVarSeqlenPlugin::configurePlugin(
         createMHARunner();
         // need to initialize S and B with somewhat useful values, they will be reset at enqueue for the actual
         // batchsize
-        this->dispatcher->setup(256, 1);
+        this->mDispatcher->setup(256, 1, mHeadSize);
     }
 }
 
-size_t QKVToContextVarSeqlenPlugin::getWorkspaceSize(
-    PluginTensorDesc const* inputs, int32_t nbInputs, PluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept
+size_t QKVToContextVarSeqlenPlugin::getWorkspaceSize(PluginTensorDesc const* inputs, int32_t /* nbInputs */,
+    PluginTensorDesc const* /* outputs */, int32_t /* nbOutputs */) const noexcept
 {
-    size_t paddingWorkpaceSize = 0;
-    if (patcher)
-    {
-        paddingWorkpaceSize = patcher->getWorkspaceSize(inputs[0].dims.d[0], mNumHeads);
-    }
-    return this->dispatcher->getWorkspaceSize() + paddingWorkpaceSize;
+    size_t paddingWorkpaceSize = mPatcher ? mPatcher->getWorkspaceSize(inputs[0].dims.d[0], mNumHeads) : 0;
+    return mDispatcher->getWorkspaceSize() + paddingWorkpaceSize;
 }
 
 // IPluginV2Ext Methods
@@ -863,6 +878,21 @@ DataType QKVToContextVarSeqlenPlugin::getOutputDataType(
     return inputTypes[0];
 }
 
+void QKVToContextVarSeqlenPlugin::attachToContext(
+    cudnnContext* cudnn, cublasContext* cublas, nvinfer1::IGpuAllocator* allocator) noexcept
+{
+    try
+    {
+        mCublasWrapper = createPluginCublasWrapper(allocator);
+        mCublas = mCublasWrapper->getCublasHandle();
+        PLUGIN_VALIDATE(mCublas != nullptr);
+    }
+    catch (const std::exception& e)
+    {
+        caughtError(e);
+    }
+}
+
 // IPluginV2 Methods
 char const* QKVToContextVarSeqlenPlugin::getPluginType() const noexcept
 {
@@ -889,7 +919,7 @@ void QKVToContextVarSeqlenPlugin::terminate() noexcept {}
 size_t QKVToContextVarSeqlenPlugin::getSerializationSize() const noexcept
 {
     return sizeof(mNumHeads) + sizeof(mHeadSize) + sizeof(DataType) + sizeof(mHasImask) + sizeof(mHiddenSize)
-        + sizeof(mSM) + sizeof(mS) + sizeof(mB) + sizeof(mDqProbs) + dispatcher->getSerializationSize()
+        + sizeof(mSM) + sizeof(mS) + sizeof(mB) + sizeof(mDqProbs) + mDispatcher->getSerializationSize()
         + sizeof(mUseVarSeqlen) + sizeof(mHdim) + sizeof(mUseInt8ScaleMax);
 }
 
@@ -908,7 +938,7 @@ void QKVToContextVarSeqlenPlugin::serialize(void* buffer) const noexcept
     serialize_value(&buffer, mUseVarSeqlen);
     serialize_value(&buffer, mHdim);
     serialize_value(&buffer, mUseInt8ScaleMax);
-    dispatcher->serialize(buffer);
+    mDispatcher->serialize(buffer);
 }
 
 void QKVToContextVarSeqlenPlugin::destroy() noexcept
@@ -930,6 +960,7 @@ int32_t QKVToContextVarSeqlenPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
     nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
     cudaStream_t stream) noexcept
 {
+    PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
 
     if (mUseVarSeqlen)
     {
@@ -968,48 +999,66 @@ int32_t QKVToContextVarSeqlenPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
             S = 384;
         }
 
-        this->dispatcher->setup(S, B);
-
-        if (patcher)
-        {
-            auto sumSeqLen = inputDesc[0].dims.d[0];
-            auto paddingWorkspace = patcher->get16BytesAlignedPointer(workspace, dispatcher->getWorkspaceSize());
-            auto ret = patcher->pad(inputs[0], paddingWorkspace, sumSeqLen, mNumHeads, mHeadSize, stream);
-            if (ret != cudaSuccess)
+        auto runV2Kernel = [this, &S, &B, &workspace, &inputDesc, &outputDesc, &stream, &inputs, &outputs](
+                               MHARunner* dispatcher, QkvPaddingRunner* patcher, int32_t padSize) {
+            PLUGIN_ASSERT(dispatcher);
+            // Validate that we can padding to the dispatch required head size also there is kernel exist for this
+            // sequence length.
+            if (mHeadSize > padSize || !dispatcher->isValid(padSize, S))
             {
-                return ret;
+                return false;
             }
+            dispatcher->setup(S, B, padSize);
 
-            MhaRunParameter paddingArgs
-                = patcher->patchMhaArgs(inputDesc, outputDesc, inputs, outputs, paddingWorkspace, sumSeqLen, mNumHeads);
-            try
-            {
-                this->dispatcher->run(paddingArgs.inputDesc, paddingArgs.outputDesc, paddingArgs.inputs,
-                    paddingArgs.outputs, workspace, stream);
-            }
-            catch (std::exception const& e)
+            // Need pad and unpad to run the V2 kernel.
+            if (mHeadSize < padSize)
             {
-                caughtError(e);
-                return -1;
-            }
+                PLUGIN_ASSERT(patcher);
+                PLUGIN_ASSERT(padSize <= patcher->getMaxPaddingHeadSize());
+                auto sumSeqLen = inputDesc[0].dims.d[0];
+                auto paddingWorkspace = patcher->get16BytesAlignedPointer(workspace, dispatcher->getWorkspaceSize());
+                auto ret = mPatcher->pad(inputs[0], paddingWorkspace, sumSeqLen, mNumHeads, mHeadSize, padSize, stream);
+                if (ret != cudaSuccess)
+                {
+                    return false;
+                }
 
-            ret = patcher->unpad(paddingArgs.outputs[0], outputs[0], sumSeqLen, mNumHeads, mHeadSize, stream);
-            if (ret != cudaSuccess)
-            {
-                return ret;
-            }
-        }
-        else
-        {
-            try
-            {
-                this->dispatcher->run(inputDesc, outputDesc, inputs, outputs, workspace, stream);
+                MhaRunParameter paddingArgs = patcher->patchMhaArgs(
+                    inputDesc, outputDesc, inputs, outputs, paddingWorkspace, sumSeqLen, mNumHeads, padSize);
+                try
+                {
+                    dispatcher->run(paddingArgs.inputDesc, paddingArgs.outputDesc, paddingArgs.inputs,
+                        paddingArgs.outputs, workspace, stream, mCublas);
+                }
+                catch (std::exception const& e)
+                {
+                    caughtError(e);
+                    return false;
+                }
+
+                ret = patcher->unpad(
+                    paddingArgs.outputs[0], outputs[0], sumSeqLen, mNumHeads, mHeadSize, padSize, stream);
+                return ret == cudaSuccess;
             }
-            catch (std::exception const& e)
+            else
             {
-                caughtError(e);
-                return -1;
+                // No pad/unpad is needed.
+                try
+                {
+                    dispatcher->run(inputDesc, outputDesc, inputs, outputs, workspace, stream, mCublas);
+                }
+                catch (std::exception const& e)
+                {
+                    caughtError(e);
+                    return false;
+                }
+                return true;
             }
+        };
+        // Try pad head size to 32 first, if it failed, then try to pad head size to 64.
+        if (!runV2Kernel(mDispatcher.get(), mPatcher.get(), 32) && !runV2Kernel(mDispatcher.get(), mPatcher.get(), 64))
+        {
+            return false;
         }
 
         return cudaGetLastError();
@@ -1019,7 +1068,7 @@ int32_t QKVToContextVarSeqlenPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
     PLUGIN_ASSERT(mB == inputDesc->dims.d[BDIM]);
 
     void const* maskPtr = mHasImask ? inputs[1] : nullptr;
-    this->dispatcher->run(inputDesc[0], outputDesc[0], inputs[0], maskPtr, outputs[0], workspace, stream);
+    mDispatcher->run(inputDesc[0], outputDesc[0], inputs[0], maskPtr, outputs[0], workspace, stream, mCublas);
     return cudaGetLastError();
 }
 
diff --git a/plugin/bertQKVToContextPlugin/qkvToContextPlugin.h b/plugin/bertQKVToContextPlugin/qkvToContextPlugin.h
index 7af05d87..725f430d 100644
--- a/plugin/bertQKVToContextPlugin/qkvToContextPlugin.h
+++ b/plugin/bertQKVToContextPlugin/qkvToContextPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,8 +23,9 @@
 #define TRT_QKV_TO_CONTEXT_PLUGIN_H
 
 #include "NvInferPlugin.h"
-#include "cublas_v2.h"
+#include "common/cublasWrapper.h"
 #include "zeroPadding2d.h"
+#include <math.h>
 #include <string>
 #include <vector>
 
@@ -39,31 +40,33 @@ namespace bert
 class MHARunner
 {
 public:
-    MHARunner(const nvinfer1::DataType type, const int32_t numHeads, const int32_t headSize)
+    MHARunner(const nvinfer1::DataType type, const int32_t numHeads)
         : mType(type)
         , mS(0)
         , mB(0)
         , mOmatSize(0)
         , mNumMats(0)
         , mNumHeads(numHeads)
-        , mHeadSize(headSize)
+        , mHeadSize(0)
         , mWordSize(getElementSize(type))
         , mLdQKV(0)
         , mStrideQKV(0)
         , mLdOut(0)
         , mStrideOut(0)
-        , mRsqrtHeadSize(1.F / sqrtf(headSize))
+        , mRsqrtHeadSize(0)
     {
     }
 
     virtual ~MHARunner() = default;
 
-    virtual void setup(const int32_t S, const int32_t B)
+    virtual void setup(int32_t S, int32_t B, int32_t headSize)
     {
         PLUGIN_ASSERT(S);
         PLUGIN_ASSERT(B);
         mB = B;
         mS = S;
+        mHeadSize = headSize;
+        mRsqrtHeadSize = 1.F / std::sqrt(headSize);
 
         mLdQKV = 3 * B * mNumHeads * mHeadSize;
         mStrideQKV = 3 * mHeadSize;
@@ -75,11 +78,13 @@ class MHARunner
     }
 
     virtual void run(nvinfer1::PluginTensorDesc const& inputDesc, nvinfer1::PluginTensorDesc const& outputDesc,
-        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream)
+        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas)
         = 0;
 
     virtual void run(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
-        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream)
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas)
         = 0;
 
     virtual size_t getSerializationSize() const noexcept;
@@ -88,7 +93,7 @@ class MHARunner
 
     virtual size_t getWorkspaceSize() const = 0;
 
-    virtual bool isValid(int32_t s) const = 0;
+    virtual bool isValid(int32_t headSize, int32_t s) const = 0;
 
 protected:
     nvinfer1::DataType mType;
@@ -151,6 +156,8 @@ class QKVToContextPluginDynamic : public nvinfer1::IPluginV2DynamicExt
     // IPluginV2Ext Methods
     nvinfer1::DataType getOutputDataType(
         int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept override;
+    void attachToContext(
+        cudnnContext* cudnn, cublasContext* cublas, nvinfer1::IGpuAllocator* allocator) noexcept override;
 
     // IPluginV2 Methods
     char const* getPluginType() const noexcept override;
@@ -177,15 +184,18 @@ class QKVToContextPluginDynamic : public nvinfer1::IPluginV2DynamicExt
     // used for other sequence, precision fp32 and fp16
     std::unique_ptr<MHARunner> unfusedDispatcher;
 
-    int32_t mS;
-    int32_t mB;
-    int32_t mSM;
-    int32_t mHeadSize;
+    int32_t mS{};
+    int32_t mB{};
+    int32_t mSM{};
+    int32_t mHeadSize{};
     int32_t mHiddenSize;
-    int32_t mNumHeads;
-    bool mHasImask;
-    nvinfer1::DataType mType;
-    float mDqProbs;
+    int32_t mNumHeads{};
+    bool mHasImask{};
+    nvinfer1::DataType mType{};
+    float mDqProbs{};
+    nvinfer1::pluginInternal::cublasHandle_t mCublas{};
+    // the wrapper pointer is shared among all plugins attached to the same context.
+    std::shared_ptr<nvinfer1::pluginInternal::CublasWrapper> mCublasWrapper;
 
     using IPluginV2::getOutputDimensions;
     using IPluginV2::getWorkspaceSize;
@@ -248,6 +258,8 @@ class QKVToContextVarSeqlenPlugin : public nvinfer1::IPluginV2DynamicExt
     // IPluginV2Ext Methods
     nvinfer1::DataType getOutputDataType(
         int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept override;
+    void attachToContext(
+        cudnnContext* cudnn, cublasContext* cublas, nvinfer1::IGpuAllocator* allocator) noexcept override;
 
     // IPluginV2 Methods
     char const* getPluginType() const noexcept override;
@@ -268,23 +280,27 @@ class QKVToContextVarSeqlenPlugin : public nvinfer1::IPluginV2DynamicExt
     const std::string mLayerName;
     std::string mNamespace;
 
-    std::unique_ptr<MHARunner> dispatcher;
-    std::unique_ptr<QkvPaddingRunner> patcher;
+    // Used for kernels with header size equals to 32.
+    std::unique_ptr<MHARunner> mDispatcher;
+    std::unique_ptr<QkvPaddingRunner> mPatcher;
 
-    int32_t mS;
-    int32_t mB;
-    int32_t mSM;
-    int32_t mHeadSize;
-    int32_t mHiddenSize;
-    int32_t mNumHeads;
-    bool mHasImask;
-    nvinfer1::DataType mType;
+    int32_t mS{};
+    int32_t mB{};
+    int32_t mSM{};
+    int32_t mHeadSize{};
+    int32_t mHiddenSize{};
+    int32_t mNumHeads{};
+    bool mHasImask{};
+    nvinfer1::DataType mType{};
 
-    float mDqProbs;
+    float mDqProbs{};
 
-    int32_t mHdim;
-    bool mUseVarSeqlen;
+    int32_t mHdim{};
+    bool mUseVarSeqlen{};
     bool mUseInt8ScaleMax{true};
+    nvinfer1::pluginInternal::cublasHandle_t mCublas{};
+    // the wrapper pointer is shared among all plugins attached to the same context.
+    std::shared_ptr<nvinfer1::pluginInternal::CublasWrapper> mCublasWrapper;
 
     using IPluginV2::getOutputDimensions;
     using IPluginV2::getWorkspaceSize;
@@ -321,52 +337,54 @@ class QKVToContextVarSeqlenPluginCreator : public nvinfer1::IPluginCreator
 class UnfusedMHARunner : public MHARunner
 {
 public:
-    UnfusedMHARunner(
-        const nvinfer1::DataType type, const int32_t numHeads, const int32_t headSize, const int32_t smVersion);
+    UnfusedMHARunner(const nvinfer1::DataType type, const int32_t numHeads, const int32_t smVersion);
     virtual ~UnfusedMHARunner();
 
-    virtual void setup(const int32_t S, const int32_t B) override;
+    virtual void setup(int32_t S, int32_t B, int32_t headSize) override;
 
     void run(nvinfer1::PluginTensorDesc const& inputDesc, nvinfer1::PluginTensorDesc const& outputDesc,
-        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream) override;
+        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     void run(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
-        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) override;
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     size_t getWorkspaceSize() const override;
 
     size_t getSerializationSize() const noexcept override;
     void serialize(void* buffer) const noexcept override;
     void deserialize(void const* data, size_t length) override;
-    bool isValid(int32_t s) const override;
+    bool isValid(int32_t headSize, int32_t s) const override;
 
 private:
-    bool mIsBestAlgoFound;
-    int32_t mAlgoBatchedEx1;
-    int32_t mAlgoBatchedEx2;
-    cublasHandle_t mCublas;
-    int32_t mSm;
+    bool mIsBestAlgoFound{};
+    int32_t mAlgoBatchedEx1{};
+    int32_t mAlgoBatchedEx2{};
+    int32_t mSm{};
 };
 
 class FusedMHARunnerFP16 : public MHARunner
 {
 public:
-    FusedMHARunnerFP16(const int32_t numHeads, const int32_t headSize, const int32_t sm);
+    FusedMHARunnerFP16(const int32_t numHeads, const int32_t sm);
     ~FusedMHARunnerFP16() = default; // for pimpl
 
-    virtual void setup(const int32_t S, const int32_t B) override;
+    virtual void setup(int32_t S, int32_t B, int32_t headSize) override;
 
     void run(nvinfer1::PluginTensorDesc const& inputDesc, nvinfer1::PluginTensorDesc const& outputDesc,
-        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream) override;
+        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     void run(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
-        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) override;
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     size_t getWorkspaceSize() const override;
 
     void deserialize(void const* data, size_t length) override;
 
-    bool isValid(int32_t s) const override;
+    bool isValid(int32_t headSize, int32_t s) const override;
 
 private:
     int32_t mSm;
@@ -377,22 +395,24 @@ class FusedMHARunnerFP16 : public MHARunner
 class FusedMHARunnerInt8 : public MHARunner
 {
 public:
-    FusedMHARunnerInt8(const int32_t numHeads, const int32_t headSize, const int32_t sm, float const dqProbs);
+    FusedMHARunnerInt8(const int32_t numHeads, const int32_t sm, float const dqProbs);
     ~FusedMHARunnerInt8() = default; // for pimpl
 
-    virtual void setup(const int32_t S, const int32_t B) override;
+    virtual void setup(int32_t S, int32_t B, int32_t headSize) override;
 
     void run(nvinfer1::PluginTensorDesc const& inputDesc, nvinfer1::PluginTensorDesc const& outputDesc,
-        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream) override;
+        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     void run(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
-        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) override;
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     size_t getWorkspaceSize() const override;
 
     void deserialize(void const* data, size_t length) override;
 
-    bool isValid(int32_t s) const override;
+    bool isValid(int32_t headSize, int32_t s) const override;
 
 private:
     float mDqProbs;
@@ -404,22 +424,24 @@ class FusedMHARunnerInt8 : public MHARunner
 class FusedMHARunnerFP16v2 : public MHARunner
 {
 public:
-    FusedMHARunnerFP16v2(const int32_t numHeads, const int32_t headSize, const int32_t sm);
+    FusedMHARunnerFP16v2(const int32_t numHeads, const int32_t sm);
     ~FusedMHARunnerFP16v2() = default; // for pimpl
 
-    virtual void setup(const int32_t S, const int32_t B) override;
+    virtual void setup(int32_t S, int32_t B, int32_t headSize) override;
 
     void run(nvinfer1::PluginTensorDesc const& inputDesc, nvinfer1::PluginTensorDesc const& outputDesc,
-        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream) override;
+        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     void run(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
-        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) override;
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     size_t getWorkspaceSize() const override;
 
     void deserialize(void const* data, size_t length) override;
 
-    bool isValid(int32_t s) const override;
+    bool isValid(int32_t headSize, int32_t s) const override;
 
 private:
     int32_t mSm;
@@ -430,23 +452,24 @@ class FusedMHARunnerFP16v2 : public MHARunner
 class FusedMHARunnerInt8v2 : public MHARunner
 {
 public:
-    FusedMHARunnerInt8v2(int32_t const numHeads, int32_t const headSize, int32_t const sm, float const dqProbs,
-        bool const useInt8ScaleMax);
+    FusedMHARunnerInt8v2(int32_t const numHeads, int32_t const sm, float const dqProbs, bool const useInt8ScaleMax);
     ~FusedMHARunnerInt8v2() = default; // for pimpl
 
-    virtual void setup(const int32_t S, const int32_t B) override;
+    virtual void setup(int32_t S, int32_t B, int32_t headSize) override;
 
     void run(nvinfer1::PluginTensorDesc const& inputDesc, nvinfer1::PluginTensorDesc const& outputDesc,
-        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream) override;
+        void const* qkvPtr, void const* maskPtr, void* output, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     void run(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
-        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) override;
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream,
+        nvinfer1::pluginInternal::cublasHandle_t cublas) override;
 
     size_t getWorkspaceSize() const override;
 
     void deserialize(void const* data, size_t length) override;
 
-    bool isValid(int32_t s) const override;
+    bool isValid(int32_t headSize, int32_t s) const override;
 
 private:
     float mDqProbs;
diff --git a/plugin/bertQKVToContextPlugin/zeroPadding2d.cu b/plugin/bertQKVToContextPlugin/zeroPadding2d.cu
index fc319288..aa8a70c9 100644
--- a/plugin/bertQKVToContextPlugin/zeroPadding2d.cu
+++ b/plugin/bertQKVToContextPlugin/zeroPadding2d.cu
@@ -147,28 +147,26 @@ cudaError_t zeroPadding2d(
     return cudaPeekAtLastError();
 }
 
-QkvPaddingRunner::QkvPaddingRunner(int32_t headSize, DataType dtype)
+QkvPaddingRunner::QkvPaddingRunner(DataType dtype, int32_t maxPaddedSize)
+    :mMaxPaddingHeadSize(maxPaddedSize)
 {
-    PLUGIN_ASSERT(headSize > 0 && headSize <= 64);
-    mPaddingHeadSize = (headSize <= 32) ? 32 : 64;
-
     PLUGIN_ASSERT(dtype == DataType::kHALF || dtype == DataType::kINT8);
     mDtypeSize = (dtype == DataType::kHALF) ? 2 : 1;
 }
 
-int32_t QkvPaddingRunner::getPaddingHeadSize()
+int32_t QkvPaddingRunner::getMaxPaddingHeadSize()
 {
-    return mPaddingHeadSize;
+    return mMaxPaddingHeadSize;
 }
 
 size_t QkvPaddingRunner::getInputSize(int32_t sumSeqLen, int32_t numHeads)
 {
-    return (3U * sumSeqLen * numHeads * mPaddingHeadSize * mDtypeSize);
+    return (3U * sumSeqLen * numHeads * mMaxPaddingHeadSize* mDtypeSize);
 }
 
 size_t QkvPaddingRunner::getOutputSize(int32_t sumSeqLen, int32_t numHeads)
 {
-    return (1U * sumSeqLen * numHeads * mPaddingHeadSize * mDtypeSize);
+    return (1U * sumSeqLen * numHeads * mMaxPaddingHeadSize * mDtypeSize);
 }
 
 size_t QkvPaddingRunner::getWorkspaceSize(int32_t sumSeqLen, int32_t numHeads)
@@ -179,6 +177,7 @@ size_t QkvPaddingRunner::getWorkspaceSize(int32_t sumSeqLen, int32_t numHeads)
 
 void* QkvPaddingRunner::get16BytesAlignedPointer(void* workspace, size_t offset)
 {
+    PLUGIN_VALIDATE(workspace != nullptr);
     auto addr = reinterpret_cast<uintptr_t>(workspace) + offset;
     auto shift = 16 - (addr & 0xF);
     if (shift == 16)
@@ -189,26 +188,29 @@ void* QkvPaddingRunner::get16BytesAlignedPointer(void* workspace, size_t offset)
 }
 
 cudaError_t QkvPaddingRunner::pad(
-    const void* src, void* workspace, int32_t sumSeqLen, int32_t numHeads, int32_t headSize, cudaStream_t stream)
+    const void* src, void* workspace, int32_t sumSeqLen, int32_t numHeads, int32_t headSize, int32_t padHeadSize, cudaStream_t stream)
 {
+    PLUGIN_VALIDATE(padHeadSize <= mMaxPaddingHeadSize);
     return zeroPadding2d(
-        src, headSize * mDtypeSize, workspace, mPaddingHeadSize * mDtypeSize, 3 * sumSeqLen * numHeads, stream);
+        src, headSize * mDtypeSize, workspace, padHeadSize * mDtypeSize, 3 * sumSeqLen * numHeads, stream);
 }
 
 cudaError_t QkvPaddingRunner::unpad(
-    const void* workspace, void* dst, int32_t sumSeqLen, int32_t numHeads, int32_t headSize, cudaStream_t stream)
+    const void* workspace, void* dst, int32_t sumSeqLen, int32_t numHeads, int32_t headSize, int32_t padHeadSize, cudaStream_t stream)
 {
+    PLUGIN_VALIDATE(padHeadSize <= mMaxPaddingHeadSize);
     return zeroPadding2d(
-        workspace, mPaddingHeadSize * mDtypeSize, dst, headSize * mDtypeSize, sumSeqLen * numHeads, stream);
+        workspace, padHeadSize * mDtypeSize, dst, headSize * mDtypeSize, sumSeqLen * numHeads, stream);
 }
 
 MhaRunParameter QkvPaddingRunner::patchMhaArgs(const PluginTensorDesc* inputDesc, const PluginTensorDesc* outputDesc,
-    const void* const* inputs, void* const* outputs, void* paddingWorkspace, int32_t sumSeqLen, int32_t numHeads)
+    const void* const* inputs, void* const* outputs, void* paddingWorkspace, int32_t sumSeqLen, int32_t numHeads, int32_t padHeadSize)
 {
+    PLUGIN_VALIDATE(padHeadSize <= mMaxPaddingHeadSize);
     MhaRunParameter args;
 
     std::memcpy(args.inputDesc, inputDesc, 4 * sizeof(PluginTensorDesc));
-    auto paddingHiddenSize = numHeads * mPaddingHeadSize;
+    auto paddingHiddenSize = numHeads * padHeadSize;
     args.inputDesc[0].dims.d[1] = 3 * paddingHiddenSize;
 
     args.outputDesc[0] = outputDesc[0];
diff --git a/plugin/bertQKVToContextPlugin/zeroPadding2d.h b/plugin/bertQKVToContextPlugin/zeroPadding2d.h
index f465507c..bc1409a2 100644
--- a/plugin/bertQKVToContextPlugin/zeroPadding2d.h
+++ b/plugin/bertQKVToContextPlugin/zeroPadding2d.h
@@ -38,24 +38,24 @@ struct MhaRunParameter
 class QkvPaddingRunner
 {
 public:
-    QkvPaddingRunner(int32_t headSize, nvinfer1::DataType dtype);
+    QkvPaddingRunner(nvinfer1::DataType dtype, int32_t maxPadSize = 64);
     ~QkvPaddingRunner() = default;
-    int32_t getPaddingHeadSize();
+    int32_t getMaxPaddingHeadSize();
     size_t getWorkspaceSize(int32_t sumSeqLen, int32_t numHeads);
     void* get16BytesAlignedPointer(void* workspace, size_t offset);
-    cudaError_t pad(
-        void const* src, void* workspace, int32_t sumSeqLen, int32_t numHeads, int32_t headSize, cudaStream_t stream);
-    cudaError_t unpad(
-        void const* workspace, void* dst, int32_t sumSeqLen, int32_t numHeads, int32_t headSize, cudaStream_t stream);
+    cudaError_t pad(void const* src, void* workspace, int32_t sumSeqLen, int32_t numHeads, int32_t headSize,
+        int32_t padHeadSize, cudaStream_t stream);
+    cudaError_t unpad(void const* workspace, void* dst, int32_t sumSeqLen, int32_t numHeads, int32_t headSize,
+        int32_t padHeadSize, cudaStream_t stream);
     MhaRunParameter patchMhaArgs(nvinfer1::PluginTensorDesc const* inputDesc,
         nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs,
-        void* paddingWorkspace, int32_t sumSeqLen, int32_t numHeads);
+        void* paddingWorkspace, int32_t sumSeqLen, int32_t numHeads, int32_t padHeadSize);
 
 private:
     size_t getInputSize(int32_t sumSeqLen, int32_t numHeads);
     size_t getOutputSize(int32_t sumSeqLen, int32_t numHeads);
-    int32_t mPaddingHeadSize;
-    int32_t mDtypeSize;
+    int32_t mMaxPaddingHeadSize{};
+    int32_t mDtypeSize{};
 };
 
 } // namespace bert
diff --git a/plugin/clipPlugin/clip.cu b/plugin/clipPlugin/clip.cu
index faf6e294..f407ebbc 100644
--- a/plugin/clipPlugin/clip.cu
+++ b/plugin/clipPlugin/clip.cu
@@ -94,18 +94,13 @@ int clipInference(
         break;
     }
     case nvinfer1::DataType::kUINT8:
-    {
-        PLUGIN_FAIL("unsupported datatype");
-        break;
-    }
     case nvinfer1::DataType::kBOOL:
-    {
-        PLUGIN_FAIL("unsupported datatype");
-        break;
-    }
     case nvinfer1::DataType::kFP8:
+    case nvinfer1::DataType::kBF16:
+    case nvinfer1::DataType::kINT64:
+    case nvinfer1::DataType::kINT4:
     {
-        PLUGIN_FAIL("unsupported datatype");
+        PLUGIN_FAIL("Unsupported datatype");
         break;
     }
     }
diff --git a/plugin/clipPlugin/clipPlugin.cpp b/plugin/clipPlugin/clipPlugin.cpp
index 8b7d22b5..425826e5 100644
--- a/plugin/clipPlugin/clipPlugin.cpp
+++ b/plugin/clipPlugin/clipPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,7 +19,6 @@
 #include "clip.h"
 #include "common/checkMacrosPlugin.h"
 #include <cstring>
-#include <cudnn.h>
 #include <string>
 #include <utility>
 #include <vector>
@@ -217,6 +216,9 @@ IPluginV2* ClipPluginCreator::createPlugin(char const* name, PluginFieldCollecti
 {
     try
     {
+        gLogWarning << "ClipPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addActivation() to add an "
+                       "IActivationLayer with ActivationType::kCLIP."
+                    << std::endl;
         float clipMin = 0.0, clipMax = 0.0;
         PluginField const* fields = fc->fields;
 
@@ -250,6 +252,9 @@ IPluginV2* ClipPluginCreator::deserializePlugin(char const* name, void const* se
 {
     try
     {
+        gLogWarning << "ClipPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addActivation() to add an "
+                       "IActivationLayer with ActivationType::kCLIP."
+                    << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call ClipPlugin::destroy()
         return new ClipPlugin(name, serialData, serialLength);
diff --git a/plugin/clipPlugin/clipPlugin.h b/plugin/clipPlugin/clipPlugin.h
index f7e218ce..98c6485d 100644
--- a/plugin/clipPlugin/clipPlugin.h
+++ b/plugin/clipPlugin/clipPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -20,7 +20,6 @@
 #include "NvInferPlugin.h"
 #include "common/plugin.h"
 #include <cstdlib>
-#include <cudnn.h>
 #include <string>
 #include <vector>
 
@@ -28,7 +27,7 @@ namespace nvinfer1
 {
 namespace plugin
 {
-class ClipPlugin : public nvinfer1::pluginInternal::BasePlugin
+class TRT_DEPRECATED ClipPlugin : public nvinfer1::pluginInternal::BasePlugin
 {
 public:
     ClipPlugin(std::string name, float clipMin, float clipMax);
@@ -80,7 +79,7 @@ class ClipPlugin : public nvinfer1::pluginInternal::BasePlugin
     size_t mInputVolume{0};
 };
 
-class ClipPluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED ClipPluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     ClipPluginCreator();
diff --git a/plugin/common/CMakeLists.txt b/plugin/common/CMakeLists.txt
index 2c65510e..12ab940b 100644
--- a/plugin/common/CMakeLists.txt
+++ b/plugin/common/CMakeLists.txt
@@ -18,6 +18,10 @@ add_subdirectory(kernels)
 file(GLOB SRCS *.cpp)
 set(PLUGIN_SOURCES ${PLUGIN_SOURCES} ${SRCS})
 set(PLUGIN_SOURCES ${PLUGIN_SOURCES} PARENT_SCOPE)
+set(VFC_PLUGIN_SOURCES ${VFC_PLUGIN_SOURCES} ${SRCS})
+set(VFC_PLUGIN_SOURCES ${VFC_PLUGIN_SOURCES} PARENT_SCOPE)
 file(GLOB CU_SRCS *.cu)
 set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} ${CU_SRCS})
 set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} PARENT_SCOPE)
+set(VFC_PLUGIN_CU_SOURCES ${VFC_PLUGIN_CU_SOURCES} ${CU_SRCS})
+set(VFC_PLUGIN_CU_SOURCES ${VFC_PLUGIN_CU_SOURCES} PARENT_SCOPE)
diff --git a/plugin/common/bertCommon.h b/plugin/common/bertCommon.h
index 68de8c07..e34e954f 100644
--- a/plugin/common/bertCommon.h
+++ b/plugin/common/bertCommon.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,9 +24,9 @@
 #include "NvInfer.h"
 #include "NvInferRuntimeCommon.h"
 #include "common/checkMacrosPlugin.h"
+#include "common/cublasWrapper.h"
 #include "common/plugin.h"
-#include "cublas_v2.h"
-#include "cuda_fp16.h"
+#include <cuda_fp16.h>
 
 #include <algorithm>
 #include <cassert>
@@ -105,7 +105,7 @@ inline int32_t getMHAMaskPackedSize(int32_t smVersion, nvinfer1::DataType dataTy
     // this code must match EmbLayerNormPluginDynamic::getOutputDimensions in embLayerNormPlugin.cpp
     int32_t packedSize = unfusedMaskSize;
     bool isSmOK = (smVersion == kSM_75 || smVersion == kSM_80 || smVersion == kSM_86 || smVersion == kSM_87
-        || smVersion == kSM_90);
+        || smVersion == kSM_89 || smVersion == kSM_90);
     bool isPrecisionOK = (dataType == nvinfer1::DataType::kINT8 || dataType == nvinfer1::DataType::kHALF);
     if (isSmOK && isPrecisionOK)
     {
@@ -133,13 +133,16 @@ inline uint32_t getElementSize(nvinfer1::DataType t) noexcept
 {
     switch (t)
     {
-    case nvinfer1::DataType::kINT32: return 4;
+    case nvinfer1::DataType::kINT64: return 8;
+    case nvinfer1::DataType::kINT32:
     case nvinfer1::DataType::kFLOAT: return 4;
+    case nvinfer1::DataType::kBF16:
     case nvinfer1::DataType::kHALF: return 2;
     case nvinfer1::DataType::kBOOL:
     case nvinfer1::DataType::kUINT8:
     case nvinfer1::DataType::kINT8:
     case nvinfer1::DataType::kFP8: return 1;
+    case nvinfer1::DataType::kINT4: PLUGIN_FAIL("Element size is not implemented for sub-byte data-types (INT4)");
     }
     return 0;
 }
@@ -196,97 +199,111 @@ inline T* devToDev(T const* data, size_t nbElem)
 }
 
 template <typename T>
-cublasStatus_t inline cublasGemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int32_t m,
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemm(nvinfer1::pluginInternal::cublasHandle_t handle,
+    nvinfer1::pluginInternal::cublasOperation_t transa, nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m,
     int32_t n, int32_t k, const T alpha, T const* A, int32_t lda, T const* B, int32_t ldb, const T beta, T* C,
     int32_t ldc);
 
 template <>
-cublasStatus_t inline cublasGemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int32_t m,
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemm(nvinfer1::pluginInternal::cublasHandle_t handle,
+    nvinfer1::pluginInternal::cublasOperation_t transa, nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m,
     int32_t n, int32_t k, float const alpha, float const* A, int32_t lda, float const* B, int32_t ldb, float const beta,
     float* C, int32_t ldc)
 {
-
-    return cublasSgemm(handle, transa, transb, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);
+    nvinfer1::pluginInternal::CublasWrapper& wrapper = nvinfer1::pluginInternal::getCublasWrapper();
+    return wrapper.cublasSgemm(handle, transa, transb, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);
 }
 
 template <>
-cublasStatus_t inline cublasGemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int32_t m,
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemm(nvinfer1::pluginInternal::cublasHandle_t handle,
+    nvinfer1::pluginInternal::cublasOperation_t transa, nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m,
     int32_t n, int32_t k, const half alpha, half const* A, int32_t lda, half const* B, int32_t ldb, const half beta,
     half* C, int32_t ldc)
 {
-    return cublasHgemm(handle, transa, transb, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);
+    nvinfer1::pluginInternal::CublasWrapper& wrapper = nvinfer1::pluginInternal::getCublasWrapper();
+    return wrapper.cublasHgemm(handle, transa, transb, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);
 }
 
 template <typename T>
-cublasStatus_t inline cublasGemmStridedBatchedEx(cublasHandle_t handle, cublasOperation_t transa,
-    cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const T alpha, T const* A, int32_t lda, int64_t strideA,
-    T const* B, int32_t ldb, int64_t strideB, const T beta, T* C, int32_t ldc, int64_t strideC, int32_t batchCount,
-    cublasGemmAlgo_t algo);
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemmStridedBatchedEx(
+    nvinfer1::pluginInternal::cublasHandle_t handle, nvinfer1::pluginInternal::cublasOperation_t transa,
+    nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const T alpha, T const* A,
+    int32_t lda, int64_t strideA, T const* B, int32_t ldb, int64_t strideB, const T beta, T* C, int32_t ldc,
+    int64_t strideC, int32_t batchCount, nvinfer1::pluginInternal::cublasGemmAlgo_t algo);
 
 template <>
-cublasStatus_t inline cublasGemmStridedBatchedEx(cublasHandle_t handle, cublasOperation_t transa,
-    cublasOperation_t transb, int32_t m, int32_t n, int32_t k, float const alpha, float const* A, int32_t lda,
-    int64_t strideA, float const* B, int32_t ldb, int64_t strideB, float const beta, float* C, int32_t ldc,
-    int64_t strideC, int32_t batchCount, cublasGemmAlgo_t algo)
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemmStridedBatchedEx(
+    nvinfer1::pluginInternal::cublasHandle_t handle, nvinfer1::pluginInternal::cublasOperation_t transa,
+    nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m, int32_t n, int32_t k, float const alpha,
+    float const* A, int32_t lda, int64_t strideA, float const* B, int32_t ldb, int64_t strideB, float const beta,
+    float* C, int32_t ldc, int64_t strideC, int32_t batchCount, nvinfer1::pluginInternal::cublasGemmAlgo_t algo)
 {
-
-    return ::cublasGemmStridedBatchedEx(handle, transa, transb, m, n, k, &alpha, A, CUDA_R_32F, lda, strideA, B,
+    nvinfer1::pluginInternal::CublasWrapper& wrapper = nvinfer1::pluginInternal::getCublasWrapper();
+    return wrapper.cublasGemmStridedBatchedEx(handle, transa, transb, m, n, k, &alpha, A, CUDA_R_32F, lda, strideA, B,
         CUDA_R_32F, ldb, strideB, &beta, C, CUDA_R_32F, ldc, strideC, batchCount, CUDA_R_32F, algo);
 }
 
 template <>
-cublasStatus_t inline cublasGemmStridedBatchedEx(cublasHandle_t handle, cublasOperation_t transa,
-    cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const half alpha, half const* A, int32_t lda,
-    int64_t strideA, half const* B, int32_t ldb, int64_t strideB, const half beta, half* C, int32_t ldc,
-    int64_t strideC, int32_t batchCount, cublasGemmAlgo_t algo)
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemmStridedBatchedEx(
+    nvinfer1::pluginInternal::cublasHandle_t handle, nvinfer1::pluginInternal::cublasOperation_t transa,
+    nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const half alpha,
+    half const* A, int32_t lda, int64_t strideA, half const* B, int32_t ldb, int64_t strideB, const half beta, half* C,
+    int32_t ldc, int64_t strideC, int32_t batchCount, nvinfer1::pluginInternal::cublasGemmAlgo_t algo)
 {
-    return ::cublasGemmStridedBatchedEx(handle, transa, transb, m, n, k, &alpha, A, CUDA_R_16F, lda, strideA, B,
+    nvinfer1::pluginInternal::CublasWrapper& wrapper = nvinfer1::pluginInternal::getCublasWrapper();
+    return wrapper.cublasGemmStridedBatchedEx(handle, transa, transb, m, n, k, &alpha, A, CUDA_R_16F, lda, strideA, B,
         CUDA_R_16F, ldb, strideB, &beta, C, CUDA_R_16F, ldc, strideC, batchCount, CUDA_R_16F, algo);
 }
 
 template <typename T>
-cublasStatus_t inline cublasGemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa,
-    cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const T alpha, T const* A, int32_t lda, int64_t strideA,
-    T const* B, int32_t ldb, int64_t strideB, const T beta, T* C, int32_t ldc, int64_t strideC, int32_t batchCount);
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemmStridedBatched(
+    nvinfer1::pluginInternal::cublasHandle_t handle, nvinfer1::pluginInternal::cublasOperation_t transa,
+    nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const T alpha, T const* A,
+    int32_t lda, int64_t strideA, T const* B, int32_t ldb, int64_t strideB, const T beta, T* C, int32_t ldc,
+    int64_t strideC, int32_t batchCount);
 
 template <>
-cublasStatus_t inline cublasGemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa,
-    cublasOperation_t transb, int32_t m, int32_t n, int32_t k, float const alpha, float const* A, int32_t lda,
-    int64_t strideA, float const* B, int32_t ldb, int64_t strideB, float const beta, float* C, int32_t ldc,
-    int64_t strideC, int32_t batchCount)
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemmStridedBatched(
+    nvinfer1::pluginInternal::cublasHandle_t handle, nvinfer1::pluginInternal::cublasOperation_t transa,
+    nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m, int32_t n, int32_t k, float const alpha,
+    float const* A, int32_t lda, int64_t strideA, float const* B, int32_t ldb, int64_t strideB, float const beta,
+    float* C, int32_t ldc, int64_t strideC, int32_t batchCount)
 {
-
-    return cublasSgemmStridedBatched(
+    nvinfer1::pluginInternal::CublasWrapper& wrapper = nvinfer1::pluginInternal::getCublasWrapper();
+    return wrapper.cublasSgemmStridedBatched(
         handle, transa, transb, m, n, k, &alpha, A, lda, strideA, B, ldb, strideB, &beta, C, ldc, strideC, batchCount);
 }
 
 template <>
-cublasStatus_t inline cublasGemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa,
-    cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const half alpha, half const* A, int32_t lda,
-    int64_t strideA, half const* B, int32_t ldb, int64_t strideB, const half beta, half* C, int32_t ldc,
-    int64_t strideC, int32_t batchCount)
+nvinfer1::pluginInternal::cublasStatus_t inline cublasGemmStridedBatched(
+    nvinfer1::pluginInternal::cublasHandle_t handle, nvinfer1::pluginInternal::cublasOperation_t transa,
+    nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m, int32_t n, int32_t k, const half alpha,
+    half const* A, int32_t lda, int64_t strideA, half const* B, int32_t ldb, int64_t strideB, const half beta, half* C,
+    int32_t ldc, int64_t strideC, int32_t batchCount)
 {
-    return cublasHgemmStridedBatched(
+    nvinfer1::pluginInternal::CublasWrapper& wrapper = nvinfer1::pluginInternal::getCublasWrapper();
+    return wrapper.cublasHgemmStridedBatched(
         handle, transa, transb, m, n, k, &alpha, A, lda, strideA, B, ldb, strideB, &beta, C, ldc, strideC, batchCount);
 }
 
 struct CublasConfigHelper
 {
-    cublasPointerMode_t pm;
-    cublasMath_t mm;
-    cublasHandle_t cublas;
-    CublasConfigHelper(cublasHandle_t cublas_)
+    nvinfer1::pluginInternal::cublasPointerMode_t pm;
+    nvinfer1::pluginInternal::cublasMath_t mm;
+    nvinfer1::pluginInternal::cublasHandle_t cublas;
+    nvinfer1::pluginInternal::CublasWrapper& wrapper = nvinfer1::pluginInternal::getCublasWrapper();
+    CublasConfigHelper(nvinfer1::pluginInternal::cublasHandle_t cublas_)
         : cublas(cublas_)
     {
-        PLUGIN_CUBLASASSERT(cublasGetPointerMode(cublas, &pm));
-        PLUGIN_CUBLASASSERT(cublasGetMathMode(cublas, &mm));
-        PLUGIN_CUBLASASSERT(cublasSetPointerMode(cublas, CUBLAS_POINTER_MODE_HOST));
-        PLUGIN_CUBLASASSERT(cublasSetMathMode(cublas, CUBLAS_TENSOR_OP_MATH));
+        PLUGIN_CUBLASASSERT(wrapper.cublasGetPointerMode(cublas, &pm));
+        PLUGIN_CUBLASASSERT(wrapper.cublasGetMathMode(cublas, &mm));
+        PLUGIN_CUBLASASSERT(wrapper.cublasSetPointerMode(cublas, nvinfer1::pluginInternal::CUBLAS_POINTER_MODE_HOST));
+        PLUGIN_CUBLASASSERT(wrapper.cublasSetMathMode(cublas, nvinfer1::pluginInternal::CUBLAS_TENSOR_OP_MATH));
     }
     ~CublasConfigHelper()
     {
-        cublasSetMathMode(cublas, mm);
-        cublasSetPointerMode(cublas, pm);
+        wrapper.cublasSetMathMode(cublas, mm);
+        wrapper.cublasSetPointerMode(cublas, pm);
     }
 };
 
diff --git a/plugin/common/checkMacrosPlugin.cpp b/plugin/common/checkMacrosPlugin.cpp
index f94d90d0..f685c1fc 100644
--- a/plugin/common/checkMacrosPlugin.cpp
+++ b/plugin/common/checkMacrosPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,11 +16,13 @@
  */
 
 #include "common/checkMacrosPlugin.h"
+#include "common/cublasWrapper.h"
 #include "common/vfcCommon.h"
 #include <cstdlib>
-#include <cublas_v2.h>
 #include <cuda_runtime.h>
 
+using namespace nvinfer1::pluginInternal;
+
 namespace nvinfer1
 {
 namespace plugin
@@ -144,7 +146,7 @@ void reportAssertion(char const* msg, char const* file, int32_t line)
     getLogger()->log(nvinfer1::ILogger::Severity::kINTERNAL_ERROR, stream.str().c_str());
 #endif
     PLUGIN_CUASSERT(cudaDeviceReset());
-    abort();
+    exit(EXIT_FAILURE);
 }
 
 void TRTException::log(std::ostream& logStream) const
diff --git a/plugin/common/checkMacrosPlugin.h b/plugin/common/checkMacrosPlugin.h
index 3b28ec08..90b6e263 100644
--- a/plugin/common/checkMacrosPlugin.h
+++ b/plugin/common/checkMacrosPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,6 +18,7 @@
 #define CHECK_MACROS_PLUGIN_H
 
 #include "NvInfer.h"
+#include "common/cudnnWrapper.h"
 #include <mutex>
 #include <sstream>
 
@@ -216,7 +217,7 @@ inline void caughtError(std::exception const& e)
 #define PLUGIN_CUBLASASSERT(status_)                                                                                   \
     {                                                                                                                  \
         auto s_ = status_;                                                                                             \
-        if (s_ != CUBLAS_STATUS_SUCCESS)                                                                               \
+        if (s_ != nvinfer1::pluginInternal::CUBLAS_STATUS_SUCCESS)                                                     \
         {                                                                                                              \
             nvinfer1::plugin::throwCublasError(__FILE__, FN_NAME, __LINE__, s_);                                       \
         }                                                                                                              \
@@ -227,7 +228,8 @@ inline void caughtError(std::exception const& e)
         auto s_ = status_;                                                                                             \
         if (s_ != CUDNN_STATUS_SUCCESS)                                                                                \
         {                                                                                                              \
-            const char* msg = cudnnGetErrorString(s_);                                                                 \
+            nvinfer1::pluginInternal::CudnnWrapper& wapper = nvinfer1::pluginInternal::getCudnnWrapper();              \
+            const char* msg = wapper.cudnnGetErrorString(s_);                                                          \
             nvinfer1::plugin::throwCudnnError(__FILE__, FN_NAME, __LINE__, s_, msg);                                   \
         }                                                                                                              \
     }
diff --git a/plugin/common/common.cuh b/plugin/common/common.cuh
index 5541bb8c..5a3819c2 100644
--- a/plugin/common/common.cuh
+++ b/plugin/common/common.cuh
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,7 +18,7 @@
 #ifndef COMMON_CUH
 #define COMMON_CUH
 
-#include "cublas_v2.h"
+#include "common/cublasWrapper.h"
 #include <cub/cub.cuh>
 
 #define HDI inline __host__ __device__
@@ -254,7 +254,7 @@ __device__ inline void layerNorm(
     __shared__ R mu;     // mean
     __shared__ R rsigma; // 1 / std.dev.
 
-    const auto sumKV = BlockReduce(temp_storage).Reduce(threadData, cub::Sum());
+    const auto sumKV = BlockReduce(temp_storage).Reduce(threadData, [](auto const& lhs, auto const& rhs){return lhs + rhs;});
 
     if (threadIdx.x == 0)
     {
@@ -286,7 +286,7 @@ __device__ inline void layerNormSmall(
     __shared__ T mu;     // mean
     __shared__ T rsigma; // 1 / std.dev.
 
-    const auto sumKV = BlockReduce(temp_storage).Reduce(threadData, cub::Sum());
+    const auto sumKV = BlockReduce(temp_storage).Reduce(threadData, [](auto const& lhs, auto const& rhs){return lhs + rhs;});
 
     if (threadIdx.x == 0)
     {
@@ -318,7 +318,6 @@ __device__ inline void scaledSoftmaxSmall(
     const int32_t offset = (blockIdx.y * gridDim.x + blockIdx.x) * ld;
 
     const float w(rsqrtHeadSize);
-    cub::Sum sum;
     float threadData(-FLT_MAX);
 
     const int32_t idx = offset + threadIdx.x;
@@ -343,7 +342,7 @@ __device__ inline void scaledSoftmaxSmall(
         threadData = 0;
     }
 
-    const auto Z = BlockReduce(tmpStorage).Reduce(threadData, sum);
+    const auto Z = BlockReduce(tmpStorage).Reduce(threadData, [](auto const& lhs, auto const& rhs){return lhs + rhs;});
 
     if (threadIdx.x == 0)
     {
@@ -371,7 +370,6 @@ __device__ inline void scaledSoftmax(
     const int32_t offset = (blockIdx.y * gridDim.x + blockIdx.x) * ld;
 
     const float w(rsqrtHeadSize);
-    cub::Sum sum;
     float threadData(-FLT_MAX);
 
     if (lastValid >= blockDim.x)
@@ -399,7 +397,7 @@ __device__ inline void scaledSoftmax(
         threadData += exp((static_cast<float>(input[idx]) - fMax) * w);
     }
 
-    const auto Z = BlockReduce(tmpStorage).Reduce(threadData, sum);
+    const auto Z = BlockReduce(tmpStorage).Reduce(threadData, [](auto const& lhs, auto const& rhs){return lhs + rhs;});
 
     if (threadIdx.x == 0)
     {
diff --git a/plugin/common/cublasLtWrapper.cpp b/plugin/common/cublasLtWrapper.cpp
new file mode 100644
index 00000000..f2d72b28
--- /dev/null
+++ b/plugin/common/cublasLtWrapper.cpp
@@ -0,0 +1,233 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "cublasLtWrapper.h"
+#include "common/checkMacrosPlugin.h"
+#include "cudaDriverWrapper.h"
+
+#if defined(_WIN32)
+#if !defined(WIN32_LEAN_AND_MEAN)
+#define WIN32_LEAN_AND_MEAN
+#endif // defined(WIN32_LEAN_AND_MEAN)
+#include <windows.h>
+#define dllOpen(name) (void*) LoadLibraryA(name)
+#define dllClose(handle) FreeLibrary(static_cast<HMODULE>(handle))
+#define dllGetSym(handle, name) GetProcAddress(static_cast<HMODULE>(handle), name)
+auto const kCUBLASLT_PLUGIN_LIBNAME
+    = std::string{"cublasLt64_"} + std::to_string(nvinfer1::getCudaLibVersionMaj()) + ".dll";
+#else
+#include <dlfcn.h>
+#define dllOpen(name) dlopen(name, RTLD_LAZY)
+#define dllClose(handle) dlclose(handle)
+#define dllGetSym(handle, name) dlsym(handle, name)
+auto const kCUBLASLT_PLUGIN_LIBNAME = std::string{"libcublasLt.so."} + std::to_string(nvinfer1::getCudaLibVersionMaj());
+#endif
+
+namespace nvinfer1
+{
+namespace pluginInternal
+{
+using namespace nvinfer1;
+
+// If tryLoadingCublasLt failed, the CublasLtWrapper object won't be created.
+CublasLtWrapper::CublasLtWrapper(bool initHandle)
+    : mLibrary(tryLoadingCublasLt())
+{
+    PLUGIN_VALIDATE(mLibrary != nullptr);
+    auto load_sym = [](void* handle, char const* name) {
+        void* ret = dllGetSym(handle, name);
+        std::string loadError = "Fail to load symbol " + std::string(name) + " from the cublasLt library.";
+        PLUGIN_VALIDATE(ret != nullptr, loadError.c_str());
+        return ret;
+    };
+    *(void**) (&_cublasLtCreate) = load_sym(mLibrary, "cublasLtCreate");
+    *(void**) (&_cublasLtDestroy) = load_sym(mLibrary, "cublasLtDestroy");
+    *(void**) (&_cublasLtMatmul) = load_sym(mLibrary, "cublasLtMatmul");
+    *(void**) (&_cublasLtMatmulDescCreate) = load_sym(mLibrary, "cublasLtMatmulDescCreate");
+    *(void**) (&_cublasLtMatmulDescDestroy) = load_sym(mLibrary, "cublasLtMatmulDescDestroy");
+    *(void**) (&_cublasLtMatmulPreferenceCreate) = load_sym(mLibrary, "cublasLtMatmulPreferenceCreate");
+    *(void**) (&_cublasLtMatmulPreferenceDestroy) = load_sym(mLibrary, "cublasLtMatmulPreferenceDestroy");
+    *(void**) (&_cublasLtMatmulPreferenceSetAttribute) = load_sym(mLibrary, "cublasLtMatmulPreferenceSetAttribute");
+    *(void**) (&_cublasLtMatmulAlgoInit) = load_sym(mLibrary, "cublasLtMatmulAlgoInit");
+    *(void**) (&_cublasLtMatmulAlgoCheck) = load_sym(mLibrary, "cublasLtMatmulAlgoCheck");
+    *(void**) (&_cublasLtMatmulAlgoGetIds) = load_sym(mLibrary, "cublasLtMatmulAlgoGetIds");
+    *(void**) (&_cublasLtMatrixLayoutCreate) = load_sym(mLibrary, "cublasLtMatrixLayoutCreate");
+    *(void**) (&_cublasLtMatrixLayoutDestroy) = load_sym(mLibrary, "cublasLtMatrixLayoutDestroy");
+    *(void**) (&_cublasLtMatrixLayoutSetAttribute) = load_sym(mLibrary, "cublasLtMatrixLayoutSetAttribute");
+    *(void**) (&_cublasLtMatmulAlgoConfigSetAttribute) = load_sym(mLibrary, "cublasLtMatmulAlgoConfigSetAttribute");
+    *(void**) (&_cublasLtMatmulAlgoConfigGetAttribute) = load_sym(mLibrary, "cublasLtMatmulAlgoConfigGetAttribute");
+    *(void**) (&_cublasLtMatmulAlgoCapGetAttribute) = load_sym(mLibrary, "cublasLtMatmulAlgoCapGetAttribute");
+    *(void**) (&_cublasLtMatmulDescSetAttribute) = load_sym(mLibrary, "cublasLtMatmulDescSetAttribute");
+
+    if (initHandle)
+    {
+        PLUGIN_VALIDATE(cublasLtCreate(&mHandle) == CUBLAS_STATUS_SUCCESS, "Could not create cublasLt handle.");
+        PLUGIN_VALIDATE(mHandle != nullptr);
+    }
+}
+
+CublasLtWrapper::~CublasLtWrapper()
+{
+    if (mHandle != nullptr)
+    {
+        PLUGIN_VALIDATE(cublasLtDestroy(mHandle) == CUBLAS_STATUS_SUCCESS, "Could not destroy cublas handle.");
+        mHandle = nullptr;
+    }
+
+    if (mLibrary != nullptr)
+    {
+        dllClose(mLibrary);
+    }
+}
+
+void* CublasLtWrapper::tryLoadingCublasLt()
+{
+    void* cublasLtLib = dllOpen(kCUBLASLT_PLUGIN_LIBNAME.c_str());
+    std::string errorMsg = "Failed to load " + kCUBLASLT_PLUGIN_LIBNAME + ".";
+    PLUGIN_VALIDATE(cublasLtLib != nullptr, errorMsg.c_str());
+    return cublasLtLib;
+}
+
+cublasLtContext* CublasLtWrapper::getCublasLtHandle()
+{
+    return mHandle;
+}
+
+bool CublasLtWrapper::isValid() const
+{
+    return mHandle != nullptr;
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtCreate(cublasLtHandle_t* handle)
+{
+    return (*_cublasLtCreate)(handle);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtDestroy(cublasLtHandle_t handle)
+{
+    return (*_cublasLtDestroy)(handle);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmul(cublasLtHandle_t lightHandle, cublasLtMatmulDesc_t computeDesc,
+    void const* alpha, void const* A, cublasLtMatrixLayout_t Adesc, void const* B, cublasLtMatrixLayout_t Bdesc,
+    void const* beta, void const* C, cublasLtMatrixLayout_t Cdesc, void* D, cublasLtMatrixLayout_t Ddesc,
+    cublasLtMatmulAlgo_t const* algo, void* workspace, size_t workspaceSizeInBytes, cudaStream_t stream)
+{
+    return (*_cublasLtMatmul)(lightHandle, computeDesc, alpha, A, Adesc, B, Bdesc, beta, C, Cdesc, D, Ddesc, algo,
+        workspace, workspaceSizeInBytes, stream);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulDescCreate(
+    cublasLtMatmulDesc_t* matmulDesc, cublasComputeType_t computeType, cudaDataType_t scaleType)
+{
+    return (*_cublasLtMatmulDescCreate)(matmulDesc, computeType, scaleType);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulDescDestroy(cublasLtMatmulDesc_t matmulDesc)
+{
+    return (*_cublasLtMatmulDescDestroy)(matmulDesc);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulPreferenceCreate(cublasLtMatmulPreference_t* pref)
+{
+    return (*_cublasLtMatmulPreferenceCreate)(pref);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulPreferenceDestroy(cublasLtMatmulPreference_t pref)
+{
+    return (*_cublasLtMatmulPreferenceDestroy)(pref);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulPreferenceSetAttribute(
+    cublasLtMatmulPreference_t pref, cublasLtMatmulPreferenceAttributes_t attr, void const* buf, size_t sizeInBytes)
+{
+    return (*_cublasLtMatmulPreferenceSetAttribute)(pref, attr, buf, sizeInBytes);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulAlgoInit(cublasLtHandle_t lightHandle, cublasComputeType_t computeType,
+    cudaDataType_t scaleType, cudaDataType_t Atype, cudaDataType_t Btype, cudaDataType_t Ctype, cudaDataType_t Dtype,
+    int algoId, cublasLtMatmulAlgo_t* algo)
+{
+    return (*_cublasLtMatmulAlgoInit)(lightHandle, computeType, scaleType, Atype, Btype, Ctype, Dtype, algoId, algo);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulAlgoCheck(cublasLtHandle_t lightHandle,
+    cublasLtMatmulDesc_t operationDesc, cublasLtMatrixLayout_t Adesc, cublasLtMatrixLayout_t Bdesc,
+    cublasLtMatrixLayout_t Cdesc, cublasLtMatrixLayout_t Ddesc, cublasLtMatmulAlgo_t const* algo,
+    cublasLtMatmulHeuristicResult_t* result)
+{
+    return (*_cublasLtMatmulAlgoCheck)(lightHandle, operationDesc, Adesc, Bdesc, Cdesc, Ddesc, algo, result);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulAlgoGetIds(cublasLtHandle_t lightHandle, cublasComputeType_t computeType,
+    cudaDataType_t scaleType, cudaDataType_t Atype, cudaDataType_t Btype, cudaDataType_t Ctype, cudaDataType_t Dtype,
+    int requestedAlgoCount, int algoIdsArray[], int* returnAlgoCount)
+{
+    return (*_cublasLtMatmulAlgoGetIds)(lightHandle, computeType, scaleType, Atype, Btype, Ctype, Dtype,
+        requestedAlgoCount, algoIdsArray, returnAlgoCount);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatrixLayoutCreate(
+    cublasLtMatrixLayout_t* matLayout, cudaDataType type, uint64_t rows, uint64_t cols, int64_t ld)
+{
+    return (*_cublasLtMatrixLayoutCreate)(matLayout, type, rows, cols, ld);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatrixLayoutDestroy(cublasLtMatrixLayout_t matLayout)
+{
+    return (*_cublasLtMatrixLayoutDestroy)(matLayout);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatrixLayoutSetAttribute(
+    cublasLtMatrixLayout_t matLayout, cublasLtMatrixLayoutAttribute_t attr, void const* buf, size_t sizeInBytes)
+{
+    return (*_cublasLtMatrixLayoutSetAttribute)(matLayout, attr, buf, sizeInBytes);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulAlgoConfigGetAttribute(cublasLtMatmulAlgo_t const* algo,
+    cublasLtMatmulAlgoConfigAttributes_t attr, void* buf, size_t sizeInBytes, size_t* sizeWritten)
+{
+    return (*_cublasLtMatmulAlgoConfigGetAttribute)(algo, attr, buf, sizeInBytes, sizeWritten);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulAlgoConfigSetAttribute(
+    cublasLtMatmulAlgo_t* algo, cublasLtMatmulAlgoConfigAttributes_t attr, void const* buf, size_t sizeInBytes)
+{
+    return (*_cublasLtMatmulAlgoConfigSetAttribute)(algo, attr, buf, sizeInBytes);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulAlgoCapGetAttribute(cublasLtMatmulAlgo_t const* algo,
+    cublasLtMatmulAlgoCapAttributes_t attr, void* buf, size_t sizeInBytes, size_t* sizeWritten)
+{
+    return (*_cublasLtMatmulAlgoCapGetAttribute)(algo, attr, buf, sizeInBytes, sizeWritten);
+}
+
+cublasStatus_t CublasLtWrapper::cublasLtMatmulDescSetAttribute(
+    cublasLtMatmulDesc_t matmulDesc, cublasLtMatmulDescAttributes_t attr, void const* buf, size_t sizeInBytes)
+{
+    return (*_cublasLtMatmulDescSetAttribute)(matmulDesc, attr, buf, sizeInBytes);
+}
+
+CublasLtWrapper& getCublasLtWrapper()
+{
+    // Initialize a global cublasLtWrapper instance to be used to call cublasLt functions.
+    static CublasLtWrapper sGCublasLtWrapper;
+    return sGCublasLtWrapper;
+}
+
+} // namespace pluginInternal
+} // namespace nvinfer1
diff --git a/plugin/common/cublasLtWrapper.h b/plugin/common/cublasLtWrapper.h
new file mode 100644
index 00000000..5988ed91
--- /dev/null
+++ b/plugin/common/cublasLtWrapper.h
@@ -0,0 +1,327 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef TRT_PLUGIN_CUBLASLT_WRAPPER_H
+#define TRT_PLUGIN_CUBLASLT_WRAPPER_H
+
+#include "NvInferPlugin.h"
+#include "cublasWrapper.h"
+#include <cuda_fp16.h>
+#include <library_types.h>
+#include <string>
+
+extern "C"
+{
+    struct cublasLtContext;
+}
+
+namespace nvinfer1
+{
+namespace pluginInternal
+{
+
+struct cublasLtMatmulAlgo
+{
+    uint64_t data[8];
+};
+
+using cublasLtMatmulAlgo_t = cublasLtMatmulAlgo;
+
+struct cublasLtMatrixLayoutOpaque
+{
+    uint64_t data[8];
+};
+
+struct cublasLtMatmulPreferenceOpaque
+{
+    uint64_t data[8];
+};
+
+struct cublasLtMatmulDescOpaque
+{
+    uint64_t data[32];
+};
+
+struct cublasLtMatmulHeuristicResult
+{
+    cublasLtMatmulAlgo_t algo;
+    size_t workspaceSize;
+    cublasStatus_t state;
+    float wavesCount;
+    int reserved[4];
+};
+
+/* Copy of CUBLASLT cublasLtReductionScheme_t */
+enum cublasLtReductionScheme
+{
+    CUBLASLT_REDUCTION_SCHEME_NONE = 0,
+    CUBLASLT_REDUCTION_SCHEME_INPLACE = 1,
+    CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE = 2,
+    CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE = 4,
+    CUBLASLT_REDUCTION_SCHEME_MASK = 0x7,
+};
+
+/* Copy of CUBLASLT cublasLtMatrixLayoutAttribute_t */
+enum cublasLtMatrixLayoutAttribute
+{
+    CUBLASLT_MATRIX_LAYOUT_TYPE = 0,
+    CUBLASLT_MATRIX_LAYOUT_ORDER = 1,
+    CUBLASLT_MATRIX_LAYOUT_ROWS = 2,
+    CUBLASLT_MATRIX_LAYOUT_COLS = 3,
+    CUBLASLT_MATRIX_LAYOUT_LD = 4,
+    CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT = 5,
+    CUBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET = 6,
+    CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET = 7,
+};
+
+enum cublasLtMatmulAlgoConfigAttributes
+{
+    CUBLASLT_ALGO_CONFIG_ID = 0,
+    CUBLASLT_ALGO_CONFIG_TILE_ID = 1,
+    CUBLASLT_ALGO_CONFIG_SPLITK_NUM = 2,
+    CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME = 3,
+    CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING = 4,
+    CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION = 5,
+    CUBLASLT_ALGO_CONFIG_STAGES_ID = 6,
+    CUBLASLT_ALGO_CONFIG_INNER_SHAPE_ID = 7,
+    CUBLASLT_ALGO_CONFIG_CLUSTER_SHAPE_ID = 8,
+};
+
+enum cublasLtMatmulPreferenceAttributes
+{
+    CUBLASLT_MATMUL_PREF_SEARCH_MODE = 0,
+    CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES = 1,
+    CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK = 3,
+    CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES = 5,
+    CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES = 6,
+    CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES = 7,
+    CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES = 8,
+    CUBLASLT_MATMUL_PREF_MAX_WAVES_COUNT = 9,
+    CUBLASLT_MATMUL_PREF_IMPL_MASK = 12,
+};
+
+enum cublasLtMatmulTile
+{
+    CUBLASLT_MATMUL_TILE_UNDEFINED = 0,
+    CUBLASLT_MATMUL_TILE_8x8 = 1,
+    CUBLASLT_MATMUL_TILE_8x16 = 2,
+    CUBLASLT_MATMUL_TILE_16x8 = 3,
+    CUBLASLT_MATMUL_TILE_8x32 = 4,
+    CUBLASLT_MATMUL_TILE_16x16 = 5,
+    CUBLASLT_MATMUL_TILE_32x8 = 6,
+    CUBLASLT_MATMUL_TILE_8x64 = 7,
+    CUBLASLT_MATMUL_TILE_16x32 = 8,
+    CUBLASLT_MATMUL_TILE_32x16 = 9,
+    CUBLASLT_MATMUL_TILE_64x8 = 10,
+    CUBLASLT_MATMUL_TILE_32x32 = 11,
+    CUBLASLT_MATMUL_TILE_32x64 = 12,
+    CUBLASLT_MATMUL_TILE_64x32 = 13,
+    CUBLASLT_MATMUL_TILE_32x128 = 14,
+    CUBLASLT_MATMUL_TILE_64x64 = 15,
+    CUBLASLT_MATMUL_TILE_128x32 = 16,
+    CUBLASLT_MATMUL_TILE_64x128 = 17,
+    CUBLASLT_MATMUL_TILE_128x64 = 18,
+    CUBLASLT_MATMUL_TILE_64x256 = 19,
+    CUBLASLT_MATMUL_TILE_128x128 = 20,
+    CUBLASLT_MATMUL_TILE_256x64 = 21,
+    CUBLASLT_MATMUL_TILE_64x512 = 22,
+    CUBLASLT_MATMUL_TILE_128x256 = 23,
+    CUBLASLT_MATMUL_TILE_256x128 = 24,
+    CUBLASLT_MATMUL_TILE_512x64 = 25,
+    CUBLASLT_MATMUL_TILE_64x96 = 26,
+    CUBLASLT_MATMUL_TILE_96x64 = 27,
+    CUBLASLT_MATMUL_TILE_96x128 = 28,
+    CUBLASLT_MATMUL_TILE_128x160 = 29,
+    CUBLASLT_MATMUL_TILE_160x128 = 30,
+    CUBLASLT_MATMUL_TILE_192x128 = 31,
+    CUBLASLT_MATMUL_TILE_128x192 = 32,
+    CUBLASLT_MATMUL_TILE_128x96 = 33,
+    CUBLASLT_MATMUL_TILE_32x256 = 34,
+    CUBLASLT_MATMUL_TILE_256x32 = 35,
+    CUBLASLT_MATMUL_TILE_END
+};
+
+enum cublasLtMatmulAlgoCapAttributes
+{
+    CUBLASLT_ALGO_CAP_SPLITK_SUPPORT = 0,
+    CUBLASLT_ALGO_CAP_REDUCTION_SCHEME_MASK = 1,
+    CUBLASLT_ALGO_CAP_CTA_SWIZZLING_SUPPORT = 2,
+    CUBLASLT_ALGO_CAP_STRIDED_BATCH_SUPPORT = 3,
+    CUBLASLT_ALGO_CAP_OUT_OF_PLACE_RESULT_SUPPORT = 4,
+    CUBLASLT_ALGO_CAP_UPLO_SUPPORT = 5,
+    CUBLASLT_ALGO_CAP_TILE_IDS = 6,
+    CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX = 7,
+    CUBLASLT_ALGO_CAP_CUSTOM_MEMORY_ORDER = 10,
+    CUBLASLT_ALGO_CAP_POINTER_MODE_MASK = 11,
+    CUBLASLT_ALGO_CAP_EPILOGUE_MASK = 12,
+    CUBLASLT_ALGO_CAP_STAGES_IDS = 13,
+    CUBLASLT_ALGO_CAP_LD_NEGATIVE = 14,
+    CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS = 15,
+    CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES = 16,
+    CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_B_BYTES = 17,
+    CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_C_BYTES = 18,
+    CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES = 19,
+    CUBLASLT_ALGO_CAP_ATOMIC_SYNC = 20,
+};
+
+enum cublasLtMatmulDescAttributes
+{
+    CUBLASLT_MATMUL_DESC_COMPUTE_TYPE = 0,
+    CUBLASLT_MATMUL_DESC_SCALE_TYPE = 1,
+    CUBLASLT_MATMUL_DESC_POINTER_MODE = 2,
+    CUBLASLT_MATMUL_DESC_TRANSA = 3,
+    CUBLASLT_MATMUL_DESC_TRANSB = 4,
+    CUBLASLT_MATMUL_DESC_TRANSC = 5,
+    CUBLASLT_MATMUL_DESC_FILL_MODE = 6,
+    CUBLASLT_MATMUL_DESC_EPILOGUE = 7,
+    CUBLASLT_MATMUL_DESC_BIAS_POINTER = 8,
+    CUBLASLT_MATMUL_DESC_BIAS_BATCH_STRIDE = 10,
+    CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER = 11,
+    CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD = 12,
+    CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_BATCH_STRIDE = 13,
+    CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE = 14,
+    CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET = 15,
+    CUBLASLT_MATMUL_DESC_A_SCALE_POINTER = 17,
+    CUBLASLT_MATMUL_DESC_B_SCALE_POINTER = 18,
+    CUBLASLT_MATMUL_DESC_C_SCALE_POINTER = 19,
+    CUBLASLT_MATMUL_DESC_D_SCALE_POINTER = 20,
+    CUBLASLT_MATMUL_DESC_AMAX_D_POINTER = 21,
+    CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_DATA_TYPE = 22,
+    CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_SCALE_POINTER = 23,
+    CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_AMAX_POINTER = 24,
+    CUBLASLT_MATMUL_DESC_FAST_ACCUM = 25,
+    CUBLASLT_MATMUL_DESC_BIAS_DATA_TYPE = 26,
+    CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_NUM_CHUNKS_D_ROWS = 27,
+    CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_NUM_CHUNKS_D_COLS = 28,
+    CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_IN_COUNTERS_POINTER = 29,
+    CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_OUT_COUNTERS_POINTER = 30,
+};
+
+using cublasLtMatrixLayoutOpaque_t = cublasLtMatrixLayoutOpaque;
+using cublasLtMatrixLayout_t = cublasLtMatrixLayoutOpaque_t*;
+using cublasLtMatmulPreferenceOpaque_t = cublasLtMatmulPreferenceOpaque;
+using cublasLtMatmulPreference_t = cublasLtMatmulPreferenceOpaque_t*;
+using cublasLtMatmulDescOpaque_t = cublasLtMatmulDescOpaque;
+using cublasLtMatmulDesc_t = cublasLtMatmulDescOpaque_t*;
+using cublasLtMatmulHeuristicResult_t = cublasLtMatmulHeuristicResult;
+using cublasLtReductionScheme_t = cublasLtReductionScheme;
+using cublasLtHandle_t = struct cublasLtContext*;
+using cublasLtMatrixLayoutAttribute_t = cublasLtMatrixLayoutAttribute;
+using cublasLtMatmulAlgoConfigAttributes_t = cublasLtMatmulAlgoConfigAttributes;
+using cublasLtMatmulPreferenceAttributes_t = cublasLtMatmulPreferenceAttributes;
+using cublasLtMatmulTile_t = cublasLtMatmulTile;
+using cublasLtMatmulAlgoCapAttributes_t = cublasLtMatmulAlgoCapAttributes;
+using cublasLtMatmulDescAttributes_t = cublasLtMatmulDescAttributes;
+
+/* Copy of CUBLASLT_NUMERICAL_IMPL_FLAGS_FMA */
+constexpr auto CUBLASLT_NUMERICAL_IMPL_FLAGS_FMA = 0x01ull << 0;
+/* Copy of CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMA */
+constexpr auto CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMA = 0x02ull << 0;
+
+class CublasLtWrapper
+{
+public:
+    CublasLtWrapper(bool initHandle = false);
+    ~CublasLtWrapper();
+
+    cublasLtContext* getCublasLtHandle();
+    bool isValid() const;
+
+    cublasStatus_t cublasLtCreate(cublasLtHandle_t* handle);
+    cublasStatus_t cublasLtDestroy(cublasLtHandle_t handle);
+    cublasStatus_t cublasLtMatmul(cublasLtHandle_t lightHandle, cublasLtMatmulDesc_t computeDesc, void const* alpha,
+        void const* A, cublasLtMatrixLayout_t Adesc, void const* B, cublasLtMatrixLayout_t Bdesc, void const* beta,
+        void const* C, cublasLtMatrixLayout_t Cdesc, void* D, cublasLtMatrixLayout_t Ddesc,
+        cublasLtMatmulAlgo_t const* algo, void* workspace, size_t workspaceSizeInBytes, cudaStream_t stream);
+    cublasStatus_t cublasLtMatmulDescCreate(
+        cublasLtMatmulDesc_t* matmulDesc, cublasComputeType_t computeType, cudaDataType_t scaleType);
+    cublasStatus_t cublasLtMatmulDescDestroy(cublasLtMatmulDesc_t matmulDesc);
+    cublasStatus_t cublasLtMatmulPreferenceCreate(cublasLtMatmulPreference_t* pref);
+    cublasStatus_t cublasLtMatmulPreferenceDestroy(cublasLtMatmulPreference_t pref);
+    cublasStatus_t cublasLtMatmulPreferenceSetAttribute(cublasLtMatmulPreference_t pref,
+        cublasLtMatmulPreferenceAttributes_t attr, void const* buf, size_t sizeInBytes);
+    cublasStatus_t cublasLtMatmulAlgoInit(cublasLtHandle_t lightHandle, cublasComputeType_t computeType,
+        cudaDataType_t scaleType, cudaDataType_t Atype, cudaDataType_t Btype, cudaDataType_t Ctype,
+        cudaDataType_t Dtype, int algoId, cublasLtMatmulAlgo_t* algo);
+    cublasStatus_t cublasLtMatmulAlgoCheck(cublasLtHandle_t lightHandle, cublasLtMatmulDesc_t operationDesc,
+        cublasLtMatrixLayout_t Adesc, cublasLtMatrixLayout_t Bdesc, cublasLtMatrixLayout_t Cdesc,
+        cublasLtMatrixLayout_t Ddesc, cublasLtMatmulAlgo_t const* algo, cublasLtMatmulHeuristicResult_t* result);
+    cublasStatus_t cublasLtMatmulAlgoGetIds(cublasLtHandle_t lightHandle, cublasComputeType_t computeType,
+        cudaDataType_t scaleType, cudaDataType_t Atype, cudaDataType_t Btype, cudaDataType_t Ctype,
+        cudaDataType_t Dtype, int requestedAlgoCount, int algoIdsArray[], int* returnAlgoCount);
+    cublasStatus_t cublasLtMatrixLayoutCreate(
+        cublasLtMatrixLayout_t* matLayout, cudaDataType type, uint64_t rows, uint64_t cols, int64_t ld);
+    cublasStatus_t cublasLtMatrixLayoutDestroy(cublasLtMatrixLayout_t matLayout);
+    cublasStatus_t cublasLtMatrixLayoutSetAttribute(
+        cublasLtMatrixLayout_t matLayout, cublasLtMatrixLayoutAttribute_t attr, void const* buf, size_t sizeInBytes);
+    cublasStatus_t cublasLtMatmulAlgoConfigGetAttribute(cublasLtMatmulAlgo_t const* algo,
+        cublasLtMatmulAlgoConfigAttributes_t attr, void* buf, size_t sizeInBytes, size_t* sizeWritten);
+    cublasStatus_t cublasLtMatmulAlgoConfigSetAttribute(
+        cublasLtMatmulAlgo_t* algo, cublasLtMatmulAlgoConfigAttributes_t attr, void const* buf, size_t sizeInBytes);
+    cublasStatus_t cublasLtMatmulAlgoCapGetAttribute(cublasLtMatmulAlgo_t const* algo,
+        cublasLtMatmulAlgoCapAttributes_t attr, void* buf, size_t sizeInBytes, size_t* sizeWritten);
+    cublasStatus_t cublasLtMatmulDescSetAttribute(
+        cublasLtMatmulDesc_t matmulDesc, cublasLtMatmulDescAttributes_t attr, void const* buf, size_t sizeInBytes);
+
+private:
+    void* mLibrary{nullptr};
+    cublasLtContext* mHandle{nullptr};
+    void* tryLoadingCublasLt();
+
+    cublasStatus_t (*_cublasLtCreate)(cublasLtHandle_t*);
+    cublasStatus_t (*_cublasLtDestroy)(cublasLtHandle_t);
+    cublasStatus_t (*_cublasLtMatmul)(cublasLtHandle_t lightHandle, cublasLtMatmulDesc_t computeDesc, void const* alpha,
+        void const* A, cublasLtMatrixLayout_t Adesc, void const* B, cublasLtMatrixLayout_t Bdesc, void const* beta,
+        void const* C, cublasLtMatrixLayout_t Cdesc, void* D, cublasLtMatrixLayout_t Ddesc,
+        cublasLtMatmulAlgo_t const* algo, void* workspace, size_t workspaceSizeInBytes, cudaStream_t stream);
+    cublasStatus_t (*_cublasLtMatmulDescCreate)(
+        cublasLtMatmulDesc_t* matmulDesc, cublasComputeType_t computeType, cudaDataType_t scaleType);
+    cublasStatus_t (*_cublasLtMatmulDescDestroy)(cublasLtMatmulDesc_t matmulDesc);
+    cublasStatus_t (*_cublasLtMatmulPreferenceCreate)(cublasLtMatmulPreference_t* pref);
+    cublasStatus_t (*_cublasLtMatmulPreferenceDestroy)(cublasLtMatmulPreference_t pref);
+    cublasStatus_t (*_cublasLtMatmulPreferenceSetAttribute)(cublasLtMatmulPreference_t pref,
+        cublasLtMatmulPreferenceAttributes_t attr, void const* buf, size_t sizeInBytes);
+    cublasStatus_t (*_cublasLtMatmulAlgoInit)(cublasLtHandle_t lightHandle, cublasComputeType_t computeType,
+        cudaDataType_t scaleType, cudaDataType_t Atype, cudaDataType_t Btype, cudaDataType_t Ctype,
+        cudaDataType_t Dtype, int algoId, cublasLtMatmulAlgo_t* algo);
+    cublasStatus_t (*_cublasLtMatmulAlgoCheck)(cublasLtHandle_t lightHandle, cublasLtMatmulDesc_t operationDesc,
+        cublasLtMatrixLayout_t Adesc, cublasLtMatrixLayout_t Bdesc, cublasLtMatrixLayout_t Cdesc,
+        cublasLtMatrixLayout_t Ddesc, cublasLtMatmulAlgo_t const* algo, cublasLtMatmulHeuristicResult_t* result);
+    cublasStatus_t (*_cublasLtMatmulAlgoGetIds)(cublasLtHandle_t lightHandle, cublasComputeType_t computeType,
+        cudaDataType_t scaleType, cudaDataType_t Atype, cudaDataType_t Btype, cudaDataType_t Ctype,
+        cudaDataType_t Dtype, int requestedAlgoCount, int algoIdsArray[], int* returnAlgoCount);
+    cublasStatus_t (*_cublasLtMatrixLayoutCreate)(
+        cublasLtMatrixLayout_t* matLayout, cudaDataType type, uint64_t rows, uint64_t cols, int64_t ld);
+    cublasStatus_t (*_cublasLtMatrixLayoutDestroy)(cublasLtMatrixLayout_t matLayout);
+    cublasStatus_t (*_cublasLtMatrixLayoutSetAttribute)(
+        cublasLtMatrixLayout_t matLayout, cublasLtMatrixLayoutAttribute_t attr, void const* buf, size_t sizeInBytes);
+    cublasStatus_t (*_cublasLtMatmulAlgoConfigGetAttribute)(cublasLtMatmulAlgo_t const* algo,
+        cublasLtMatmulAlgoConfigAttributes_t attr, void* buf, size_t sizeInBytes, size_t* sizeWritten);
+    cublasStatus_t (*_cublasLtMatmulAlgoConfigSetAttribute)(
+        cublasLtMatmulAlgo_t* algo, cublasLtMatmulAlgoConfigAttributes_t attr, void const* buf, size_t sizeInBytes);
+    cublasStatus_t (*_cublasLtMatmulAlgoCapGetAttribute)(cublasLtMatmulAlgo_t const* algo,
+        cublasLtMatmulAlgoCapAttributes_t attr, void* buf, size_t sizeInBytes, size_t* sizeWritten);
+    cublasStatus_t (*_cublasLtMatmulDescSetAttribute)(
+        cublasLtMatmulDesc_t matmulDesc, cublasLtMatmulDescAttributes_t attr, void const* buf, size_t sizeInBytes);
+};
+
+CublasLtWrapper& getCublasLtWrapper();
+
+} // namespace pluginInternal
+} // namespace nvinfer1
+
+#endif // TRT_PLUGIN_CUBLASLT_WRAPPER_H
diff --git a/plugin/common/cublasWrapper.cpp b/plugin/common/cublasWrapper.cpp
new file mode 100644
index 00000000..77e545ad
--- /dev/null
+++ b/plugin/common/cublasWrapper.cpp
@@ -0,0 +1,226 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "cublasWrapper.h"
+#include "common/checkMacrosPlugin.h"
+#include "cudaDriverWrapper.h"
+
+namespace nvinfer1
+{
+namespace pluginInternal
+{
+
+#if defined(_WIN32)
+#if !defined(WIN32_LEAN_AND_MEAN)
+#define WIN32_LEAN_AND_MEAN
+// Ensure that macros appearing in multiple files are only defined once.
+#endif // defined(WIN32_LEAN_AND_MEAN)
+#include <windows.h>
+#define dllOpen(name) (void*) LoadLibraryA(name)
+#define dllClose(handle) FreeLibrary(static_cast<HMODULE>(handle))
+#define dllGetSym(handle, name) GetProcAddress(static_cast<HMODULE>(handle), name)
+auto const kCUBLAS_PLUGIN_LIBNAME
+    = std::string{"cublas64_"} + std::to_string(nvinfer1::getCudaLibVersionMaj()) + ".dll";
+#else
+#include <dlfcn.h>
+#define dllOpen(name) dlopen(name, RTLD_LAZY)
+#define dllClose(handle) dlclose(handle)
+#define dllGetSym(handle, name) dlsym(handle, name)
+auto const kCUBLAS_PLUGIN_LIBNAME = std::string{"libcublas.so."} + std::to_string(nvinfer1::getCudaLibVersionMaj());
+#endif
+
+using namespace nvinfer1;
+
+// If tryLoadingCublas failed, the CublasWrapper object won't be created.
+CublasWrapper::CublasWrapper(bool initHandle)
+    : mLibrary(tryLoadingCublas())
+{
+    PLUGIN_VALIDATE(mLibrary != nullptr);
+    auto load_sym = [](void* handle, char const* name) {
+        void* ret = dllGetSym(handle, name);
+        std::string loadError = "Fail to load symbol " + std::string(name) + " from the cublas library.";
+        PLUGIN_VALIDATE(ret != nullptr, loadError.c_str());
+        return ret;
+    };
+    *(void**) (&_cublasCreate) = load_sym(mLibrary, "cublasCreate_v2");
+    *(void**) (&_cublasDestroy) = load_sym(mLibrary, "cublasDestroy_v2");
+    *(void**) (&_cublasSetStream) = load_sym(mLibrary, "cublasSetStream_v2");
+    *(void**) (&_cublasGetPointerMode) = load_sym(mLibrary, "cublasGetPointerMode_v2");
+    *(void**) (&_cublasSetPointerMode) = load_sym(mLibrary, "cublasSetPointerMode_v2");
+    *(void**) (&_cublasGetMathMode) = load_sym(mLibrary, "cublasGetMathMode");
+    *(void**) (&_cublasSetMathMode) = load_sym(mLibrary, "cublasSetMathMode");
+    *(void**) (&_cublasDscal) = load_sym(mLibrary, "cublasDscal_v2");
+    *(void**) (&_cublasSasum) = load_sym(mLibrary, "cublasSasum_v2");
+    *(void**) (&_cublasScopy) = load_sym(mLibrary, "cublasScopy_v2");
+    *(void**) (&_cublasSscal) = load_sym(mLibrary, "cublasSscal_v2");
+    *(void**) (&_cublasSgemm) = load_sym(mLibrary, "cublasSgemm_v2");
+    *(void**) (&_cublasHgemm) = load_sym(mLibrary, "cublasHgemm");
+    *(void**) (&_cublasHgemmStridedBatched) = load_sym(mLibrary, "cublasHgemmStridedBatched");
+    *(void**) (&_cublasSgemmStridedBatched) = load_sym(mLibrary, "cublasSgemmStridedBatched");
+    *(void**) (&_cublasGemmEx) = load_sym(mLibrary, "cublasGemmEx");
+    *(void**) (&_cublasGemmStridedBatchedEx) = load_sym(mLibrary, "cublasGemmStridedBatchedEx");
+
+    if (initHandle)
+    {
+        PLUGIN_VALIDATE(cublasCreate(&mHandle) == CUBLAS_STATUS_SUCCESS, "Could not create cublas handle.");
+        PLUGIN_VALIDATE(mHandle != nullptr);
+    }
+}
+
+CublasWrapper::~CublasWrapper()
+{
+    if (mHandle != nullptr)
+    {
+        PLUGIN_VALIDATE(cublasDestroy(mHandle) == CUBLAS_STATUS_SUCCESS, "Could not destroy cublas handle.");
+        mHandle = nullptr;
+    }
+
+    dllClose(mLibrary);
+}
+
+void* CublasWrapper::tryLoadingCublas()
+{
+    void* cublasLib = dllOpen(kCUBLAS_PLUGIN_LIBNAME.c_str());
+    std::string errorMsg = "Failed to load " + kCUBLAS_PLUGIN_LIBNAME + ".";
+    PLUGIN_VALIDATE(cublasLib != nullptr, errorMsg.c_str());
+    return cublasLib;
+}
+
+cublasContext* CublasWrapper::getCublasHandle()
+{
+    return mHandle;
+}
+
+bool CublasWrapper::isValid() const
+{
+    return mHandle != nullptr;
+}
+
+cublasStatus_t CublasWrapper::cublasCreate(cublasContext** handle)
+{
+    return (*_cublasCreate)(handle);
+}
+
+cublasStatus_t CublasWrapper::cublasDestroy(cublasContext* handle)
+{
+    return (*_cublasDestroy)(handle);
+}
+
+cublasStatus_t CublasWrapper::cublasSetStream(cublasHandle_t handle, cudaStream_t streamId)
+{
+    return (*_cublasSetStream)(handle, streamId);
+}
+
+cublasStatus_t CublasWrapper::cublasGetPointerMode(cublasHandle_t handle, cublasPointerMode_t* mode)
+{
+    return (*_cublasGetPointerMode)(handle, mode);
+}
+
+cublasStatus_t CublasWrapper::cublasSetPointerMode(cublasHandle_t handle, cublasPointerMode_t mode)
+{
+    return (*_cublasSetPointerMode)(handle, mode);
+}
+
+cublasStatus_t CublasWrapper::cublasGetMathMode(cublasHandle_t handle, cublasMath_t* mode)
+{
+    return (*_cublasGetMathMode)(handle, mode);
+}
+
+cublasStatus_t CublasWrapper::cublasSetMathMode(cublasHandle_t handle, cublasMath_t mode)
+{
+    return (*_cublasSetMathMode)(handle, mode);
+}
+
+cublasStatus_t CublasWrapper::cublasDscal(cublasHandle_t handle, int n, float const* alpha, float* x, int incx)
+{
+    return (*_cublasDscal)(handle, n, alpha, x, incx);
+}
+
+cublasStatus_t CublasWrapper::cublasSasum(cublasHandle_t handle, int n, float const* x, int incx, float* result)
+{
+    return (*_cublasSasum)(handle, n, x, incx, result);
+}
+
+cublasStatus_t CublasWrapper::cublasScopy(cublasHandle_t handle, int n, float const* x, int incx, float* y, int incy)
+{
+    return (*_cublasScopy)(handle, n, x, incx, y, incy);
+}
+
+cublasStatus_t CublasWrapper::cublasSscal(cublasHandle_t handle, int n, float const* alpha, float* x, int incx)
+{
+    return (*_cublasSscal)(handle, n, alpha, x, incx);
+}
+
+cublasStatus_t CublasWrapper::cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
+    int m, int n, int k, float const* alpha, float const* A, int lda, float const* B, int ldb, float const* beta,
+    float* C, int ldc)
+{
+    return (*_cublasSgemm)(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
+}
+
+cublasStatus_t CublasWrapper::cublasHgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
+    int m, int n, int k, __half const* alpha, __half const* A, int lda, __half const* B, int ldb, __half const* beta,
+    __half* C, int ldc)
+{
+    return (*_cublasHgemm)(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
+}
+
+cublasStatus_t CublasWrapper::cublasHgemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa,
+    cublasOperation_t transb, int m, int n, int k, __half const* alpha, __half const* A, int lda, long long int strideA,
+    __half const* B, int ldb, long long int strideB, __half const* beta, __half* C, int ldc, long long int strideC,
+    int batchCount)
+{
+    return (*_cublasHgemmStridedBatched)(
+        handle, transa, transb, m, n, k, alpha, A, lda, strideA, B, ldb, strideB, beta, C, ldc, strideC, batchCount);
+}
+
+cublasStatus_t CublasWrapper::cublasSgemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa,
+    cublasOperation_t transb, int m, int n, int k, float const* alpha, float const* A, int lda, long long int strideA,
+    float const* B, int ldb, long long int strideB, float const* beta, float* C, int ldc, long long int strideC,
+    int batchCount)
+{
+    return (*_cublasSgemmStridedBatched)(
+        handle, transa, transb, m, n, k, alpha, A, lda, strideA, B, ldb, strideB, beta, C, ldc, strideC, batchCount);
+}
+
+cublasStatus_t CublasWrapper::cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
+    int m, int n, int k, void const* alpha, void const* A, cudaDataType Atype, int lda, void const* B,
+    cudaDataType Btype, int ldb, void const* beta, void* C, cudaDataType Ctype, int ldc, cudaDataType computeType,
+    cublasGemmAlgo_t algo)
+{
+    return (*_cublasGemmEx)(
+        handle, transa, transb, m, n, k, alpha, A, Atype, lda, B, Btype, ldb, beta, C, Ctype, ldc, computeType, algo);
+}
+
+cublasStatus_t CublasWrapper::cublasGemmStridedBatchedEx(cublasHandle_t handle, cublasOperation_t transa,
+    cublasOperation_t transb, int m, int n, int k, void const* alpha, void const* A, cudaDataType Atype, int lda,
+    long long int strideA, void const* B, cudaDataType Btype, int ldb, long long int strideB, void const* beta, void* C,
+    cudaDataType Ctype, int ldc, long long int strideC, int batchCount, cudaDataType computeType, cublasGemmAlgo_t algo)
+{
+    return (*_cublasGemmStridedBatchedEx)(handle, transa, transb, m, n, k, alpha, A, Atype, lda, strideA, B, Btype, ldb,
+        strideB, beta, C, Ctype, ldc, strideC, batchCount, computeType, algo);
+}
+
+CublasWrapper& getCublasWrapper()
+{
+    // Initialize a global cublasWrapper instance to be used to call cublas functions.
+    static CublasWrapper sGCublasWrapper;
+    return sGCublasWrapper;
+}
+
+} // namespace pluginInternal
+} // namespace nvinfer1
diff --git a/plugin/common/cublasWrapper.h b/plugin/common/cublasWrapper.h
new file mode 100644
index 00000000..f60fb445
--- /dev/null
+++ b/plugin/common/cublasWrapper.h
@@ -0,0 +1,234 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef TRT_PLUGIN_CUBLAS_WRAPPER_H
+#define TRT_PLUGIN_CUBLAS_WRAPPER_H
+
+#include "NvInferPlugin.h"
+#include <cuda_fp16.h>
+#include <library_types.h>
+#include <string>
+
+namespace nvinfer1
+{
+namespace pluginInternal
+{
+/* Copy of CUBLAS status type returns */
+enum CublasStatus
+{
+    CUBLAS_STATUS_SUCCESS = 0,
+    CUBLAS_STATUS_NOT_INITIALIZED = 1,
+    CUBLAS_STATUS_ALLOC_FAILED = 3,
+    CUBLAS_STATUS_INVALID_VALUE = 7,
+    CUBLAS_STATUS_ARCH_MISMATCH = 8,
+    CUBLAS_STATUS_MAPPING_ERROR = 11,
+    CUBLAS_STATUS_EXECUTION_FAILED = 13,
+    CUBLAS_STATUS_INTERNAL_ERROR = 14,
+    CUBLAS_STATUS_NOT_SUPPORTED = 15,
+    CUBLAS_STATUS_LICENSE_ERROR = 16
+};
+
+/* Copy of CUBLAS math types*/
+enum CublasMath
+{
+    CUBLAS_DEFAULT_MATH = 0,
+    CUBLAS_TENSOR_OP_MATH = 1,
+    CUBLAS_PEDANTIC_MATH = 2,
+    CUBLAS_TF32_TENSOR_OP_MATH = 3,
+    CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION = 16,
+};
+
+/* Copy of CUBLAS operation types*/
+enum cublasOperation
+{
+    CUBLAS_OP_N = 0,
+    CUBLAS_OP_T = 1,
+    CUBLAS_OP_C = 2,
+    CUBLAS_OP_HERMITAN = 2,
+    CUBLAS_OP_CONJG = 3
+};
+
+/* Copy of CUBLAS pointer mode types*/
+enum cublasPointerMode
+{
+    CUBLAS_POINTER_MODE_HOST = 0,
+    CUBLAS_POINTER_MODE_DEVICE = 1
+};
+
+/* Copy of CUBLAS compute types*/
+enum cublasComputeType
+{
+    CUBLAS_COMPUTE_16F = 64,
+    CUBLAS_COMPUTE_16F_PEDANTIC = 65,
+    CUBLAS_COMPUTE_32F = 68,
+    CUBLAS_COMPUTE_32F_PEDANTIC = 69,
+    CUBLAS_COMPUTE_32F_FAST_16F = 74,
+    CUBLAS_COMPUTE_32F_FAST_16BF = 75,
+    CUBLAS_COMPUTE_32F_FAST_TF32 = 77,
+    CUBLAS_COMPUTE_64F = 70,
+    CUBLAS_COMPUTE_64F_PEDANTIC = 71,
+    CUBLAS_COMPUTE_32I = 72,
+    CUBLAS_COMPUTE_32I_PEDANTIC = 73,
+};
+
+/* Copy of CUBLAS GEMM algorithm types*/
+enum cublasGemmAlgo
+{
+    CUBLAS_GEMM_DFALT = -1,
+    CUBLAS_GEMM_DEFAULT = -1,
+    CUBLAS_GEMM_ALGO0 = 0,
+    CUBLAS_GEMM_ALGO1 = 1,
+    CUBLAS_GEMM_ALGO2 = 2,
+    CUBLAS_GEMM_ALGO3 = 3,
+    CUBLAS_GEMM_ALGO4 = 4,
+    CUBLAS_GEMM_ALGO5 = 5,
+    CUBLAS_GEMM_ALGO6 = 6,
+    CUBLAS_GEMM_ALGO7 = 7,
+    CUBLAS_GEMM_ALGO8 = 8,
+    CUBLAS_GEMM_ALGO9 = 9,
+    CUBLAS_GEMM_ALGO10 = 10,
+    CUBLAS_GEMM_ALGO11 = 11,
+    CUBLAS_GEMM_ALGO12 = 12,
+    CUBLAS_GEMM_ALGO13 = 13,
+    CUBLAS_GEMM_ALGO14 = 14,
+    CUBLAS_GEMM_ALGO15 = 15,
+    CUBLAS_GEMM_ALGO16 = 16,
+    CUBLAS_GEMM_ALGO17 = 17,
+    CUBLAS_GEMM_ALGO18 = 18,
+    CUBLAS_GEMM_ALGO19 = 19,
+    CUBLAS_GEMM_ALGO20 = 20,
+    CUBLAS_GEMM_ALGO21 = 21,
+    CUBLAS_GEMM_ALGO22 = 22,
+    CUBLAS_GEMM_ALGO23 = 23,
+    CUBLAS_GEMM_DEFAULT_TENSOR_OP = 99,
+    CUBLAS_GEMM_DFALT_TENSOR_OP = 99,
+    CUBLAS_GEMM_ALGO0_TENSOR_OP = 100,
+    CUBLAS_GEMM_ALGO1_TENSOR_OP = 101,
+    CUBLAS_GEMM_ALGO2_TENSOR_OP = 102,
+    CUBLAS_GEMM_ALGO3_TENSOR_OP = 103,
+    CUBLAS_GEMM_ALGO4_TENSOR_OP = 104,
+    CUBLAS_GEMM_ALGO5_TENSOR_OP = 105,
+    CUBLAS_GEMM_ALGO6_TENSOR_OP = 106,
+    CUBLAS_GEMM_ALGO7_TENSOR_OP = 107,
+    CUBLAS_GEMM_ALGO8_TENSOR_OP = 108,
+    CUBLAS_GEMM_ALGO9_TENSOR_OP = 109,
+    CUBLAS_GEMM_ALGO10_TENSOR_OP = 110,
+    CUBLAS_GEMM_ALGO11_TENSOR_OP = 111,
+    CUBLAS_GEMM_ALGO12_TENSOR_OP = 112,
+    CUBLAS_GEMM_ALGO13_TENSOR_OP = 113,
+    CUBLAS_GEMM_ALGO14_TENSOR_OP = 114,
+    CUBLAS_GEMM_ALGO15_TENSOR_OP = 115
+};
+
+using cublasStatus_t = CublasStatus;
+using cublasMath_t = CublasMath;
+using cublasOperation_t = cublasOperation;
+using cublasPointerMode_t = cublasPointerMode;
+using cublasComputeType_t = cublasComputeType;
+using cublasGemmAlgo_t = cublasGemmAlgo;
+using cublasDataType_t = cudaDataType;
+using cublasHandle_t = struct cublasContext*;
+
+class CublasWrapper
+{
+public:
+    CublasWrapper(bool initHandle = false);
+    ~CublasWrapper();
+
+    cublasContext* getCublasHandle();
+    bool isValid() const;
+
+    cublasStatus_t cublasCreate(cublasContext** handle);
+    cublasStatus_t cublasDestroy(cublasContext* handle);
+    cublasStatus_t cublasSetStream(cublasHandle_t handle, cudaStream_t streamId);
+    cublasStatus_t cublasGetPointerMode(cublasHandle_t handle, cublasPointerMode_t* mode);
+    cublasStatus_t cublasSetPointerMode(cublasHandle_t handle, cublasPointerMode_t mode);
+    cublasStatus_t cublasGetMathMode(cublasHandle_t handle, cublasMath_t* mode);
+    cublasStatus_t cublasSetMathMode(cublasHandle_t handle, cublasMath_t mode);
+    cublasStatus_t cublasDscal(cublasHandle_t handle, int n, float const* alpha, float* x, int incx);
+    cublasStatus_t cublasSasum(cublasHandle_t handle, int n, float const* x, int incx, float* result);
+    cublasStatus_t cublasScopy(cublasHandle_t handle, int n, float const* x, int incx, float* y, int incy);
+    cublasStatus_t cublasSscal(cublasHandle_t handle, int n, float const* alpha, float* x, int incx);
+    cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n,
+        int k, float const* alpha, float const* A, int lda, float const* B, int ldb, float const* beta, float* C,
+        int ldc);
+    cublasStatus_t cublasHgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n,
+        int k, __half const* alpha, __half const* A, int lda, __half const* B, int ldb, __half const* beta, __half* C,
+        int ldc);
+    cublasStatus_t cublasHgemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
+        int m, int n, int k, __half const* alpha, __half const* A, int lda, long long int strideA, __half const* B,
+        int ldb, long long int strideB, __half const* beta, __half* C, int ldc, long long int strideC, int batchCount);
+    cublasStatus_t cublasSgemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
+        int m, int n, int k, float const* alpha, float const* A, int lda, long long int strideA, float const* B,
+        int ldb, long long int strideB, float const* beta, float* C, int ldc, long long int strideC, int batchCount);
+    cublasStatus_t cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n,
+        int k, void const* alpha, void const* A, cudaDataType Atype, int lda, void const* B, cudaDataType Btype,
+        int ldb, void const* beta, void* C, cudaDataType Ctype, int ldc, cudaDataType computeType,
+        cublasGemmAlgo_t algo);
+    cublasStatus_t cublasGemmStridedBatchedEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb,
+        int m, int n, int k, void const* alpha, void const* A, cudaDataType Atype, int lda, long long int strideA,
+        void const* B, cudaDataType Btype, int ldb, long long int strideB, void const* beta, void* C,
+        cudaDataType Ctype, int ldc, long long int strideC, int batchCount, cudaDataType computeType,
+        cublasGemmAlgo_t algo);
+
+private:
+    void* mLibrary{nullptr};
+    cublasContext* mHandle{nullptr};
+    void* tryLoadingCublas();
+
+    cublasStatus_t (*_cublasCreate)(cublasContext**);
+    cublasStatus_t (*_cublasDestroy)(cublasContext*);
+    cublasStatus_t (*_cublasSetStream)(cublasHandle_t handle, cudaStream_t streamId);
+    cublasStatus_t (*_cublasGetPointerMode)(cublasHandle_t handle, cublasPointerMode_t* mode);
+    cublasStatus_t (*_cublasSetPointerMode)(cublasHandle_t handle, cublasPointerMode_t mode);
+    cublasStatus_t (*_cublasGetMathMode)(cublasHandle_t handle, cublasMath_t* mode);
+    cublasStatus_t (*_cublasSetMathMode)(cublasHandle_t handle, cublasMath_t mode);
+    cublasStatus_t (*_cublasDscal)(cublasHandle_t handle, int n, float const* alpha, float* x, int incx);
+    cublasStatus_t (*_cublasSasum)(cublasHandle_t handle, int n, float const* x, int incx, float* result);
+    cublasStatus_t (*_cublasScopy)(cublasHandle_t handle, int n, float const* x, int incx, float* y, int incy);
+    cublasStatus_t (*_cublasSscal)(cublasHandle_t handle, int n, float const* alpha, float* x, int incx);
+    cublasStatus_t (*_cublasSgemm)(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m,
+        int n, int k, float const* alpha, float const* A, int lda, float const* B, int ldb, float const* beta, float* C,
+        int ldc);
+    cublasStatus_t (*_cublasHgemm)(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m,
+        int n, int k, __half const* alpha, __half const* A, int lda, __half const* B, int ldb, __half const* beta,
+        __half* C, int ldc);
+    cublasStatus_t (*_cublasHgemmStridedBatched)(cublasHandle_t handle, cublasOperation_t transa,
+        cublasOperation_t transb, int m, int n, int k, __half const* alpha, __half const* A, int lda,
+        long long int strideA, __half const* B, int ldb, long long int strideB, __half const* beta, __half* C, int ldc,
+        long long int strideC, int batchCount);
+    cublasStatus_t (*_cublasSgemmStridedBatched)(cublasHandle_t handle, cublasOperation_t transa,
+        cublasOperation_t transb, int m, int n, int k, float const* alpha, float const* A, int lda,
+        long long int strideA, float const* B, int ldb, long long int strideB, float const* beta, float* C, int ldc,
+        long long int strideC, int batchCount);
+    cublasStatus_t (*_cublasGemmEx)(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m,
+        int n, int k, void const* alpha, void const* A, cudaDataType Atype, int lda, void const* B, cudaDataType Btype,
+        int ldb, void const* beta, void* C, cudaDataType Ctype, int ldc, cudaDataType computeType,
+        cublasGemmAlgo_t algo);
+    cublasStatus_t (*_cublasGemmStridedBatchedEx)(cublasHandle_t handle, cublasOperation_t transa,
+        cublasOperation_t transb, int m, int n, int k, void const* alpha, void const* A, cudaDataType Atype, int lda,
+        long long int strideA, void const* B, cudaDataType Btype, int ldb, long long int strideB, void const* beta,
+        void* C, cudaDataType Ctype, int ldc, long long int strideC, int batchCount, cudaDataType computeType,
+        cublasGemmAlgo_t algo);
+};
+
+CublasWrapper& getCublasWrapper();
+
+} // namespace pluginInternal
+} // namespace nvinfer1
+
+#endif // TRT_PLUGIN_CUBLAS_WRAPPER_H
diff --git a/plugin/common/cudaDriverWrapper.cpp b/plugin/common/cudaDriverWrapper.cpp
index a4ce78dd..5e317564 100644
--- a/plugin/common/cudaDriverWrapper.cpp
+++ b/plugin/common/cudaDriverWrapper.cpp
@@ -31,8 +31,8 @@
 #define dllGetSym(handle, name) dlsym(handle, name)
 #endif
 
+#include "common/cudaDriverWrapper.h"
 #include "common/plugin.h"
-#include "cudaDriverWrapper.h"
 #include <cstdint>
 #include <cstdio>
 #include <cuda.h>
diff --git a/plugin/common/cudaDriverWrapper.h b/plugin/common/cudaDriverWrapper.h
index f92fd99e..b105e3c2 100644
--- a/plugin/common/cudaDriverWrapper.h
+++ b/plugin/common/cudaDriverWrapper.h
@@ -100,6 +100,12 @@ inline void cuErrCheck_(CUresult stat, CUDADriverWrapper const& wrap, char const
     }
 }
 
+//! Return CUDA major version
+constexpr int32_t getCudaLibVersionMaj() noexcept
+{
+    return CUDA_VERSION / 1000U;
+}
+
 } // namespace nvinfer1
 
 #endif // CUDA_DRIVER_WRAPPER_H
diff --git a/plugin/common/cudnnWrapper.cpp b/plugin/common/cudnnWrapper.cpp
new file mode 100644
index 00000000..b7bf09e7
--- /dev/null
+++ b/plugin/common/cudnnWrapper.cpp
@@ -0,0 +1,177 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "cudnnWrapper.h"
+#include "common/checkMacrosPlugin.h"
+
+namespace nvinfer1
+{
+namespace pluginInternal
+{
+
+#define CUDNN_MAJOR 8
+#if defined(_WIN32)
+#if !defined(WIN32_LEAN_AND_MEAN)
+// Ensure that macros appearing in multiple files are only defined once.
+#define WIN32_LEAN_AND_MEAN
+#endif // defined(WIN32_LEAN_AND_MEAN)
+#include <windows.h>
+#define dllOpen(name) (void*) LoadLibraryA(name)
+#define dllClose(handle) FreeLibrary(static_cast<HMODULE>(handle))
+#define dllGetSym(handle, name) GetProcAddress(static_cast<HMODULE>(handle), name)
+auto const kCUDNN_PLUGIN_LIBNAME = std::string("cudnn64_") + std::to_string(CUDNN_MAJOR) + ".dll";
+#else
+#include <dlfcn.h>
+#define dllOpen(name) dlopen(name, RTLD_LAZY)
+#define dllClose(handle) dlclose(handle)
+#define dllGetSym(handle, name) dlsym(handle, name)
+auto const kCUDNN_PLUGIN_LIBNAME = std::string("libcudnn.so.") + std::to_string(CUDNN_MAJOR);
+#endif
+
+// If tryLoadingCudnn failed, the CudnnWrapper object won't be created.
+CudnnWrapper::CudnnWrapper(bool initHandle)
+    : mLibrary(tryLoadingCudnn())
+{
+    auto load_sym = [](void* handle, char const* name) {
+        void* ret = dllGetSym(handle, name);
+        std::string loadError = "Fail to load symbol " + std::string(name) + " from the cudnn library.";
+        PLUGIN_VALIDATE(ret != nullptr, loadError.c_str());
+        return ret;
+    };
+
+    *(void**) (&_cudnnCreate) = load_sym(mLibrary, "cudnnCreate");
+    *(void**) (&_cudnnDestroy) = load_sym(mLibrary, "cudnnDestroy");
+    *(void**) (&_cudnnCreateTensorDescriptor) = load_sym(mLibrary, "cudnnCreateTensorDescriptor");
+    *(void**) (&_cudnnDestroyTensorDescriptor) = load_sym(mLibrary, "cudnnDestroyTensorDescriptor");
+    *(void**) (&_cudnnSetStream) = load_sym(mLibrary, "cudnnSetStream");
+    *(void**) (&_cudnnBatchNormalizationForwardTraining) = load_sym(mLibrary, "cudnnBatchNormalizationForwardTraining");
+    *(void**) (&_cudnnSetTensor4dDescriptor) = load_sym(mLibrary, "cudnnSetTensor4dDescriptor");
+    *(void**) (&_cudnnSetTensorNdDescriptor) = load_sym(mLibrary, "cudnnSetTensorNdDescriptor");
+    *(void**) (&_cudnnSetTensorNdDescriptorEx) = load_sym(mLibrary, "cudnnSetTensorNdDescriptorEx");
+    *(void**) (&_cudnnDeriveBNTensorDescriptor) = load_sym(mLibrary, "cudnnDeriveBNTensorDescriptor");
+    *(void**) (&_cudnnGetErrorString) = load_sym(mLibrary, "cudnnGetErrorString");
+
+    if (initHandle)
+    {
+        PLUGIN_CUDNNASSERT(cudnnCreate(&mHandle));
+        PLUGIN_VALIDATE(mHandle != nullptr);
+    }
+}
+
+CudnnWrapper::~CudnnWrapper()
+{
+    if (mHandle != nullptr)
+    {
+        PLUGIN_CUDNNASSERT(cudnnDestroy(mHandle));
+        mHandle = nullptr;
+    }
+
+    dllClose(mLibrary);
+}
+
+void* CudnnWrapper::tryLoadingCudnn()
+{
+    void* cudnnLib = dllOpen(kCUDNN_PLUGIN_LIBNAME.c_str());
+    std::string errorMsg = "Failed to load " + kCUDNN_PLUGIN_LIBNAME + ".";
+    PLUGIN_VALIDATE(cudnnLib != nullptr, errorMsg.c_str());
+    return cudnnLib;
+}
+
+cudnnContext* CudnnWrapper::getCudnnHandle()
+{
+    return mHandle;
+}
+
+bool CudnnWrapper::isValid() const
+{
+    return mHandle != nullptr;
+}
+
+cudnnStatus_t CudnnWrapper::cudnnCreate(cudnnContext** handle)
+{
+    return (*_cudnnCreate)(handle);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnDestroy(cudnnContext* handle)
+{
+    return (*_cudnnDestroy)(handle);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnCreateTensorDescriptor(cudnnTensorDescriptor_t* tensorDesc)
+{
+    return (*_cudnnCreateTensorDescriptor)(tensorDesc);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc)
+{
+    return (*_cudnnDestroyTensorDescriptor)(tensorDesc);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnSetStream(cudnnHandle_t handle, cudaStream_t streamId)
+{
+    return (*_cudnnSetStream)(handle, streamId);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnBatchNormalizationForwardTraining(cudnnHandle_t handle, cudnnBatchNormMode_t mode,
+    void const* alpha, void const* beta, cudnnTensorStruct const* xDesc, void const* x, cudnnTensorStruct const* yDesc,
+    void* y, cudnnTensorStruct const* bnScaleBiasMeanVarDesc, void const* bnScale, void const* bnBias,
+    double exponentialAverageFactor, void* resultRunningMean, void* resultRunningVariance, double epsilon,
+    void* resultSaveMean, void* resultSaveInvVariance)
+{
+    return (*_cudnnBatchNormalizationForwardTraining)(handle, mode, alpha, beta, xDesc, x, yDesc, y,
+        bnScaleBiasMeanVarDesc, bnScale, bnBias, exponentialAverageFactor, resultRunningMean, resultRunningVariance,
+        epsilon, resultSaveMean, resultSaveInvVariance);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnSetTensor4dDescriptor(cudnnTensorDescriptor_t tensorDesc, cudnnTensorFormat_t format,
+    cudnnDataType_t dataType, int n, int c, int h, int w)
+{
+    return (*_cudnnSetTensor4dDescriptor)(tensorDesc, format, dataType, n, c, h, w);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnSetTensorNdDescriptor(
+    cudnnTensorDescriptor_t tensorDesc, cudnnDataType_t dataType, int nbDims, int const dimA[], int const strideA[])
+{
+    return (*_cudnnSetTensorNdDescriptor)(tensorDesc, dataType, nbDims, dimA, strideA);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnSetTensorNdDescriptorEx(cudnnTensorDescriptor_t tensorDesc, cudnnTensorFormat_t format,
+    cudnnDataType_t dataType, int nbDims, int const dimA[])
+{
+    return (*_cudnnSetTensorNdDescriptorEx)(tensorDesc, format, dataType, nbDims, dimA);
+}
+
+cudnnStatus_t CudnnWrapper::cudnnDeriveBNTensorDescriptor(
+    cudnnTensorDescriptor_t derivedBnDesc, cudnnTensorStruct const* xDesc, cudnnBatchNormMode_t mode)
+{
+    return (*_cudnnDeriveBNTensorDescriptor)(derivedBnDesc, xDesc, mode);
+}
+
+char const* CudnnWrapper::cudnnGetErrorString(cudnnStatus_t status)
+{
+    return (*_cudnnGetErrorString)(status);
+}
+
+CudnnWrapper& getCudnnWrapper()
+{
+    // Initialize a global cublasWrapper instance to be used to call cublas functions.
+    static CudnnWrapper sGCudnnWrapper;
+    return sGCudnnWrapper;
+}
+
+} // namespace pluginInternal
+} // namespace nvinfer1
diff --git a/plugin/common/cudnnWrapper.h b/plugin/common/cudnnWrapper.h
new file mode 100644
index 00000000..cf8bde1b
--- /dev/null
+++ b/plugin/common/cudnnWrapper.h
@@ -0,0 +1,168 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef TRT_PLUGIN_CUDNN_WRAPPER_H
+#define TRT_PLUGIN_CUDNN_WRAPPER_H
+
+#include "NvInferPlugin.h"
+#include <functional>
+#include <string>
+
+extern "C"
+{
+    //! Forward declaration of cudnnTensorStruct to use in other interfaces.
+    struct cudnnTensorStruct;
+}
+
+namespace nvinfer1
+{
+namespace pluginInternal
+{
+/*
+ * Copy of the CUDNN return codes
+ */
+enum CudnnStatus
+{
+    CUDNN_STATUS_SUCCESS = 0,
+    CUDNN_STATUS_NOT_INITIALIZED = 1,
+    CUDNN_STATUS_ALLOC_FAILED = 2,
+    CUDNN_STATUS_BAD_PARAM = 3,
+    CUDNN_STATUS_INTERNAL_ERROR = 4,
+    CUDNN_STATUS_INVALID_VALUE = 5,
+    CUDNN_STATUS_ARCH_MISMATCH = 6,
+    CUDNN_STATUS_MAPPING_ERROR = 7,
+    CUDNN_STATUS_EXECUTION_FAILED = 8,
+    CUDNN_STATUS_NOT_SUPPORTED = 9,
+    CUDNN_STATUS_LICENSE_ERROR = 10,
+    CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING = 11,
+    CUDNN_STATUS_RUNTIME_IN_PROGRESS = 12,
+    CUDNN_STATUS_RUNTIME_FP_OVERFLOW = 13,
+    CUDNN_STATUS_VERSION_MISMATCH = 14,
+};
+
+/*
+ * Copy of the CUDNN cudnnBatchNormMode_t
+ */
+enum cudnnBatchNormMode
+{
+    CUDNN_BATCHNORM_PER_ACTIVATION = 0,
+    CUDNN_BATCHNORM_SPATIAL = 1,
+    CUDNN_BATCHNORM_SPATIAL_PERSISTENT = 2,
+};
+
+/*
+ * Copy of the CUDNN cudnnTensorFormat_t
+ */
+enum cudnnTensorFormat
+{
+    CUDNN_TENSOR_NCHW = 0,
+    CUDNN_TENSOR_NHWC = 1,
+    CUDNN_TENSOR_NCHW_VECT_C = 2,
+};
+
+/*
+ * Copy of CUDNN data type
+ */
+enum cudnnDataType
+{
+    CUDNN_DATA_FLOAT = 0,
+    CUDNN_DATA_DOUBLE = 1,
+    CUDNN_DATA_HALF = 2,
+    CUDNN_DATA_INT8 = 3,
+    CUDNN_DATA_INT32 = 4,
+    CUDNN_DATA_INT8x4 = 5,
+    CUDNN_DATA_UINT8 = 6,
+    CUDNN_DATA_UINT8x4 = 7,
+    CUDNN_DATA_INT8x32 = 8,
+    CUDNN_DATA_BFLOAT16 = 9,
+    CUDNN_DATA_INT64 = 10,
+    CUDNN_DATA_BOOLEAN = 11,
+    CUDNN_DATA_FP8_E4M3 = 12,
+    CUDNN_DATA_FP8_E5M2 = 13,
+    CUDNN_DATA_FAST_FLOAT_FOR_FP8 = 14,
+};
+
+using cudnnStatus_t = CudnnStatus;
+using cudnnBatchNormMode_t = cudnnBatchNormMode;
+using cudnnTensorFormat_t = cudnnTensorFormat;
+using cudnnDataType_t = cudnnDataType;
+
+using cudnnHandle_t = struct cudnnContext*;
+using cudnnTensorDescriptor_t = struct cudnnTensorStruct*;
+
+class CudnnWrapper
+{
+public:
+    CudnnWrapper(bool initHandle = false);
+    ~CudnnWrapper();
+
+    cudnnContext* getCudnnHandle();
+    bool isValid() const;
+
+    /*
+     * Copy of the CUDNN APIs
+     */
+    cudnnStatus_t cudnnCreate(cudnnContext** handle);
+    cudnnStatus_t cudnnDestroy(cudnnContext* handle);
+    cudnnStatus_t cudnnCreateTensorDescriptor(cudnnTensorDescriptor_t* tensorDesc);
+    cudnnStatus_t cudnnDestroyTensorDescriptor(cudnnTensorDescriptor_t tensorDesc);
+    cudnnStatus_t cudnnSetStream(cudnnHandle_t handle, cudaStream_t streamId);
+    cudnnStatus_t cudnnBatchNormalizationForwardTraining(cudnnHandle_t handle, cudnnBatchNormMode_t mode,
+        void const* alpha, void const* beta, cudnnTensorStruct const* xDesc, void const* x,
+        cudnnTensorStruct const* yDesc, void* y, cudnnTensorStruct const* bnScaleBiasMeanVarDesc, void const* bnScale,
+        void const* bnBias, double exponentialAverageFactor, void* resultRunningMean, void* resultRunningVariance,
+        double epsilon, void* resultSaveMean, void* resultSaveInvVariance);
+    cudnnStatus_t cudnnSetTensor4dDescriptor(cudnnTensorDescriptor_t tensorDesc, cudnnTensorFormat_t format,
+        cudnnDataType_t dataType, int n, int c, int h, int w);
+    cudnnStatus_t cudnnSetTensorNdDescriptor(cudnnTensorDescriptor_t tensorDesc, cudnnDataType_t dataType, int nbDims,
+        int const dimA[], int const strideA[]);
+    cudnnStatus_t cudnnSetTensorNdDescriptorEx(cudnnTensorDescriptor_t tensorDesc, cudnnTensorFormat_t format,
+        cudnnDataType_t dataType, int nbDims, int const dimA[]);
+    cudnnStatus_t cudnnDeriveBNTensorDescriptor(
+        cudnnTensorDescriptor_t derivedBnDesc, cudnnTensorStruct const* xDesc, cudnnBatchNormMode_t mode);
+    char const* cudnnGetErrorString(cudnnStatus_t status);
+
+private:
+    void* mLibrary{nullptr};
+    cudnnContext* mHandle{nullptr};
+    void* tryLoadingCudnn();
+
+    cudnnStatus_t (*_cudnnCreate)(cudnnContext**);
+    cudnnStatus_t (*_cudnnDestroy)(cudnnContext*);
+    cudnnStatus_t (*_cudnnCreateTensorDescriptor)(cudnnTensorDescriptor_t*);
+    cudnnStatus_t (*_cudnnDestroyTensorDescriptor)(cudnnTensorDescriptor_t);
+    cudnnStatus_t (*_cudnnSetStream)(cudnnHandle_t, cudaStream_t);
+    cudnnStatus_t (*_cudnnBatchNormalizationForwardTraining)(cudnnHandle_t, cudnnBatchNormMode_t, void const*,
+        void const*, cudnnTensorStruct const*, void const*, cudnnTensorStruct const*, void*, cudnnTensorStruct const*,
+        void const*, void const*, double, void*, void*, double, void*, void*);
+    cudnnStatus_t (*_cudnnSetTensor4dDescriptor)(
+        cudnnTensorDescriptor_t, cudnnTensorFormat_t, cudnnDataType_t, int, int, int, int);
+    cudnnStatus_t (*_cudnnSetTensorNdDescriptor)(cudnnTensorDescriptor_t tensorDesc, cudnnDataType_t dataType,
+        int nbDims, int const dimA[], int const strideA[]);
+    cudnnStatus_t (*_cudnnSetTensorNdDescriptorEx)(cudnnTensorDescriptor_t tensorDesc, cudnnTensorFormat_t format,
+        cudnnDataType_t dataType, int nbDims, int const dimA[]);
+    cudnnStatus_t (*_cudnnDeriveBNTensorDescriptor)(
+        cudnnTensorDescriptor_t, cudnnTensorStruct const*, cudnnBatchNormMode_t);
+    char const* (*_cudnnGetErrorString)(cudnnStatus_t status);
+};
+
+CudnnWrapper& getCudnnWrapper();
+
+} // namespace pluginInternal
+} // namespace nvinfer1
+
+#endif // TRT_PLUGIN_CUDNN_WRAPPER_H
diff --git a/plugin/common/half.h b/plugin/common/half.h
index 70538608..28825bb1 100644
--- a/plugin/common/half.h
+++ b/plugin/common/half.h
@@ -29,6 +29,10 @@
 #pragma clang diagnostic ignored "-Wmismatched-tags"
 #endif
 
+#ifndef HALF_ROUND_TIES_TO_EVEN
+#define HALF_ROUND_TIES_TO_EVEN 1
+#endif
+
 #include "ieee/half.h"
 
 #if defined(__clang__)
diff --git a/plugin/common/kernels/allClassNMS.cu b/plugin/common/kernels/allClassNMS.cu
index 42c0e995..35ab989b 100644
--- a/plugin/common/kernels/allClassNMS.cu
+++ b/plugin/common/kernels/allClassNMS.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,7 +16,7 @@
  */
 #include "common/bboxUtils.h"
 #include "common/kernels/kernel.h"
-#include "cuda_fp16.h"
+#include <cuda_fp16.h>
 #include <array>
 namespace nvinfer1
 {
diff --git a/plugin/common/kernels/common.cu b/plugin/common/kernels/common.cu
index 9b03fb07..06f80cf7 100644
--- a/plugin/common/kernels/common.cu
+++ b/plugin/common/kernels/common.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,11 +17,12 @@
 
 #include "common/bboxUtils.h"
 #include "common/kernels/kernel.h"
-#include "cublas_v2.h"
 #include "cuda.h"
 #include <cub/cub.cuh>
 #include <stdint.h>
 
+using namespace nvinfer1::pluginInternal;
+
 #define CUDA_MEM_ALIGN 256
 namespace nvinfer1
 {
@@ -98,6 +99,8 @@ size_t dataTypeSize(const DataType dtype)
     case DataType::kINT8: return sizeof(char);
     case DataType::kHALF: return sizeof(short);
     case DataType::kFLOAT: return sizeof(float);
+    case DataType::kBF16:
+    case DataType::kINT64: PLUGIN_FAIL("Unsupported data type");
     default: return 0;
     }
 }
@@ -132,56 +135,38 @@ size_t cubSortFloatBboxInfoPairsWorkspaceSize(int num_items, int num_segments)
 */
 
 template <unsigned nthds_per_cta>
-__launch_bounds__(nthds_per_cta)
-    __global__ void setUniformOffsets_kernel(
-        const int num_segments,
-        const int offset,
-        int* d_offsets)
+__launch_bounds__(nthds_per_cta) __global__
+    void setUniformOffsets_kernel(const int num_segments, const int offset, int* d_offsets)
 {
     const int idx = blockIdx.x * nthds_per_cta + threadIdx.x;
     if (idx <= num_segments)
         d_offsets[idx] = idx * offset;
 }
 
-void setUniformOffsets(
-    cudaStream_t stream,
-    const int num_segments,
-    const int offset,
-    int* d_offsets)
+void setUniformOffsets(cudaStream_t stream, const int num_segments, const int offset, int* d_offsets)
 {
     const int BS = 32;
     const int GS = (num_segments + 1 + BS - 1) / BS;
     setUniformOffsets_kernel<BS><<<GS, BS, 0, stream>>>(num_segments, offset, d_offsets);
 }
 
-
 const char* cublasGetErrorString(cublasStatus_t error)
 {
     switch (error)
     {
-    case CUBLAS_STATUS_SUCCESS:
-        return "CUBLAS_STATUS_SUCCESS";
-    case CUBLAS_STATUS_NOT_INITIALIZED:
-        return "CUBLAS_STATUS_NOT_INITIALIZED";
-    case CUBLAS_STATUS_ALLOC_FAILED:
-        return "CUBLAS_STATUS_ALLOC_FAILED";
-    case CUBLAS_STATUS_INVALID_VALUE:
-        return "CUBLAS_STATUS_INVALID_VALUE";
-    case CUBLAS_STATUS_ARCH_MISMATCH:
-        return "CUBLAS_STATUS_ARCH_MISMATCH";
-    case CUBLAS_STATUS_MAPPING_ERROR:
-        return "CUBLAS_STATUS_MAPPING_ERROR";
-    case CUBLAS_STATUS_EXECUTION_FAILED:
-        return "CUBLAS_STATUS_EXECUTION_FAILED";
-    case CUBLAS_STATUS_INTERNAL_ERROR:
-        return "CUBLAS_STATUS_INTERNAL_ERROR";
+    case CUBLAS_STATUS_SUCCESS: return "CUBLAS_STATUS_SUCCESS";
+    case CUBLAS_STATUS_NOT_INITIALIZED: return "CUBLAS_STATUS_NOT_INITIALIZED";
+    case CUBLAS_STATUS_ALLOC_FAILED: return "CUBLAS_STATUS_ALLOC_FAILED";
+    case CUBLAS_STATUS_INVALID_VALUE: return "CUBLAS_STATUS_INVALID_VALUE";
+    case CUBLAS_STATUS_ARCH_MISMATCH: return "CUBLAS_STATUS_ARCH_MISMATCH";
+    case CUBLAS_STATUS_MAPPING_ERROR: return "CUBLAS_STATUS_MAPPING_ERROR";
+    case CUBLAS_STATUS_EXECUTION_FAILED: return "CUBLAS_STATUS_EXECUTION_FAILED";
+    case CUBLAS_STATUS_INTERNAL_ERROR: return "CUBLAS_STATUS_INTERNAL_ERROR";
 #if CUDA_VERSION >= 6000
-    case CUBLAS_STATUS_NOT_SUPPORTED:
-        return "CUBLAS_STATUS_NOT_SUPPORTED";
+    case CUBLAS_STATUS_NOT_SUPPORTED: return "CUBLAS_STATUS_NOT_SUPPORTED";
 #endif
 #if CUDA_VERSION >= 6050
-    case CUBLAS_STATUS_LICENSE_ERROR:
-        return "CUBLAS_STATUS_LICENSE_ERROR";
+    case CUBLAS_STATUS_LICENSE_ERROR: return "CUBLAS_STATUS_LICENSE_ERROR";
 #endif
     }
     return "Unknown cublas status";
diff --git a/plugin/common/kernels/decodeBBoxes.cu b/plugin/common/kernels/decodeBBoxes.cu
index a6ff825e..1c71d2ee 100644
--- a/plugin/common/kernels/decodeBBoxes.cu
+++ b/plugin/common/kernels/decodeBBoxes.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,7 +16,7 @@
  */
 #include "common/kernels/kernel.h"
 #include "common/kernels/saturate.h"
-#include "cuda_fp16.h"
+#include <cuda_fp16.h>
 #include <array>
 namespace nvinfer1
 {
diff --git a/plugin/common/kernels/decodeBbox3DKernels.cu b/plugin/common/kernels/decodeBbox3DKernels.cu
index 30ead81c..ac53c098 100644
--- a/plugin/common/kernels/decodeBbox3DKernels.cu
+++ b/plugin/common/kernels/decodeBbox3DKernels.cu
@@ -32,7 +32,7 @@ namespace plugin
               << " in file " << __FILE__                          \
               << " error status: " << status                      \
               << std::endl;                                       \
-              abort();                                            \
+              exit(EXIT_FAILURE);                                           \
     }                                                             \
 }
 
diff --git a/plugin/common/kernels/gatherTopDetections.cu b/plugin/common/kernels/gatherTopDetections.cu
index 929c26d1..e5655a1d 100644
--- a/plugin/common/kernels/gatherTopDetections.cu
+++ b/plugin/common/kernels/gatherTopDetections.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,7 +16,7 @@
  */
 #include "common/kernels/kernel.h"
 #include "common/plugin.h"
-#include "cuda_fp16.h"
+#include <cuda_fp16.h>
 #include <array>
 
 
diff --git a/plugin/common/kernels/kernel.h b/plugin/common/kernels/kernel.h
index 7777402c..8ecf4c87 100644
--- a/plugin/common/kernels/kernel.h
+++ b/plugin/common/kernels/kernel.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,7 +22,7 @@
 #include <algorithm>
 #include <cassert>
 #include <cstdio>
-#include <cublas_v2.h>
+#include <cuda_fp16.h>
 
 #define DEBUG_ENABLE 0
 namespace nvinfer1
@@ -80,7 +80,7 @@ pluginStatus_t sortScoresPerClass(cudaStream_t stream, int32_t num, int32_t num_
 
 size_t calculateTotalWorkspaceSize(size_t* workspaces, int32_t count);
 
-char const* cublasGetErrorString(cublasStatus_t error);
+char const* cublasGetErrorString(nvinfer1::pluginInternal::cublasStatus_t error);
 
 pluginStatus_t permuteData(cudaStream_t stream, int32_t nthreads, int32_t num_classes, int32_t num_data,
     int32_t num_dim, nvinfer1::DataType DT_DATA, bool confSigmoid, void const* data, void* new_data);
@@ -96,9 +96,9 @@ pluginStatus_t decodeBBoxes(cudaStream_t stream, int32_t nthreads, nvinfer1::plu
 
 size_t normalizePluginWorkspaceSize(bool acrossSpatial, int32_t C, int32_t H, int32_t W);
 
-pluginStatus_t normalizeInference(cudaStream_t stream, cublasHandle_t handle, bool acrossSpatial, bool channelShared,
-    int32_t N, int32_t C, int32_t H, int32_t W, float eps, void const* scale, void const* inputData, void* outputData,
-    void* workspace);
+pluginStatus_t normalizeInference(cudaStream_t stream, nvinfer1::pluginInternal::cublasHandle_t handle,
+    bool acrossSpatial, bool channelShared, int32_t N, int32_t C, int32_t H, int32_t W, float eps, void const* scale,
+    void const* inputData, void* outputData, void* workspace);
 
 pluginStatus_t scatterNDInference(cudaStream_t stream, int32_t* outputDims, int32_t nOutputDims, int32_t sliceRank,
     int32_t nRows, int32_t rowSize, int32_t CopySize, int32_t sizeOfElementInBytes, void const* index,
@@ -267,7 +267,6 @@ int32_t generateFeatures_launch(int32_t batch_size, int32_t dense_pillar_num, fl
     float range_min_x, float range_min_y, float range_min_z, uint32_t voxel_features_size, uint32_t max_points,
     uint32_t max_voxels, uint32_t num_point_values, float* features, cudaStream_t stream);
 
-
 #endif // TRT_RPNLAYER_H
 } // namespace plugin
 } // namespace nvinfer1
diff --git a/plugin/common/kernels/maskRCNNKernels.cu b/plugin/common/kernels/maskRCNNKernels.cu
index 3f9b74b5..b79d55e0 100644
--- a/plugin/common/kernels/maskRCNNKernels.cu
+++ b/plugin/common/kernels/maskRCNNKernels.cu
@@ -1330,6 +1330,9 @@ cudaError_t argMaxGroup(cudaStream_t stream, int N, nvinfer1::DataType dtype, in
             samples, 0, NClass, inScore, inBbox, validSamples, outScore, outLabel, outBbox);
         break;
     case nvinfer1::DataType::kHALF: break;
+    case nvinfer1::DataType::kBF16:
+    case nvinfer1::DataType::kINT64: PLUGIN_FAIL("Unsupported data type");
+
     default: PLUGIN_ASSERT(false);
     }
 
@@ -1352,6 +1355,9 @@ cudaError_t argMaxWOBackground(cudaStream_t stream, int N, nvinfer1::DataType dt
             samples, 1, NClass, inScore, inBbox, validSamples, outScore, outLabel, outBbox);
         break;
     case nvinfer1::DataType::kHALF: break;
+    case nvinfer1::DataType::kBF16:
+    case nvinfer1::DataType::kINT64: PLUGIN_FAIL("Unsupported data type");
+
     default: PLUGIN_ASSERT(false);
     }
 
diff --git a/plugin/common/kernels/maskRCNNKernels.h b/plugin/common/kernels/maskRCNNKernels.h
index 9763816f..71ed0784 100644
--- a/plugin/common/kernels/maskRCNNKernels.h
+++ b/plugin/common/kernels/maskRCNNKernels.h
@@ -49,12 +49,15 @@ inline size_t typeSize(const nvinfer1::DataType type)
     switch (type)
     {
     case nvinfer1::DataType::kFLOAT: return sizeof(float);
+    case nvinfer1::DataType::kBF16: return sizeof(uint16_t);
     case nvinfer1::DataType::kHALF: return sizeof(uint16_t);
     case nvinfer1::DataType::kINT8: return sizeof(uint8_t);
     case nvinfer1::DataType::kINT32: return sizeof(int32_t);
+    case nvinfer1::DataType::kINT64: return sizeof(int64_t);
     case nvinfer1::DataType::kBOOL: return sizeof(bool);
     case nvinfer1::DataType::kUINT8: return sizeof(uint8_t);
-    case nvinfer1::DataType::kFP8: PLUGIN_FAIL("FP8 not supported"); break;
+    case nvinfer1::DataType::kFP8: PLUGIN_FAIL("FP8 is not supported"); break;
+    case nvinfer1::DataType::kINT4: PLUGIN_FAIL("INT4 is not supported"); break;
     }
     return 0;
 }
diff --git a/plugin/common/kernels/normalizeLayer.cu b/plugin/common/kernels/normalizeLayer.cu
index e5313d37..85611ed0 100644
--- a/plugin/common/kernels/normalizeLayer.cu
+++ b/plugin/common/kernels/normalizeLayer.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,7 +16,9 @@
  */
 #include "common/bboxUtils.h"
 #include "common/kernels/kernel.h"
-#include <cublas_v2.h>
+#include "common/cublasWrapper.h"
+
+using namespace nvinfer1::pluginInternal;
 
 namespace nvinfer1
 {
@@ -180,6 +182,7 @@ pluginStatus_t normalizeInference(
     void* outputData,
     void* workspace)
 {
+    CublasWrapper& mCublasWrapper = getCublasWrapper();
     const int dim = C * H * W;
     // Normalization is conducted for each sample from the batch indepdently
     if (acrossSpatial)
@@ -193,18 +196,18 @@ pluginStatus_t normalizeInference(
             squareKernel<<<(dim + 511) / 512, 512, 0, stream>>>(dim, input, buffer);
             float normsqr = 0.0F;
             // Sum up all the squared elements
-            CUBLAS_CHECK(cublasSasum(handle, dim, buffer, 1, &normsqr));
+            CUBLAS_CHECK(mCublasWrapper.cublasSasum(handle, dim, buffer, 1, &normsqr));
             // Make a copy of the input to the output
-            CUBLAS_CHECK(cublasScopy(handle, dim, input, 1, output, 1));
+            CUBLAS_CHECK(mCublasWrapper.cublasScopy(handle, dim, input, 1, output, 1));
             // Calculate the inverse of the square root of the sum
             // Use eps to prevent being divided by zero
             normsqr = 1 / sqrt(normsqr + eps);
             // Scale all the outputs by normsqr
-            CUBLAS_CHECK(cublasSscal(handle, dim, &normsqr, output, 1));
+            CUBLAS_CHECK(mCublasWrapper.cublasSscal(handle, dim, &normsqr, output, 1));
             // If channel shared is true, scale all the outputs
             if (channelShared)
             {
-                CUBLAS_CHECK(cublasSscal(handle, dim, (float*) scale, output, 1));
+                CUBLAS_CHECK(mCublasWrapper.cublasSscal(handle, dim, (float*) scale, output, 1));
             }
             // Use different scale factors for different channels
             else
@@ -249,24 +252,25 @@ pluginStatus_t normalizeInference(
         float* input = (float*) const_cast<void*>(inputData);
         float* output = (float*) outputData;
         float* buffer = (float*) workspace;
+        CublasWrapper& mCublasWrapper = getCublasWrapper();
         for (int n = 0; n < N; ++n)
         {
             // Take the square of each element in the input
             squareKernel<<<(dim + 511) / 512, 512, 0, stream>>>(dim, input, buffer);
             float normsqr = 0.0F;
             // Sum up all the squared elements
-            CUBLAS_CHECK(cublasSasum(handle, dim, buffer, 1, &normsqr));
+            CUBLAS_CHECK(mCublasWrapper.cublasSasum(handle, dim, buffer, 1, &normsqr));
             // Make a copy of the input to the output
-            CUBLAS_CHECK(cublasScopy(handle, dim, input, 1, output, 1));
+            CUBLAS_CHECK(mCublasWrapper.cublasScopy(handle, dim, input, 1, output, 1));
             // Calculate the inverse of the square root of the sum
             // Use eps to prevent being divided by zero
             normsqr = 1 / sqrt(normsqr + eps);
             // Scale all the outputs by normsqr
-            CUBLAS_CHECK(cublasSscal(handle, dim, &normsqr, output, 1));
+            CUBLAS_CHECK(mCublasWrapper.cublasSscal(handle, dim, &normsqr, output, 1));
             // If channel shared is true, scale all the outputs
             if (channelShared)
             {
-                CUBLAS_CHECK(cublasSscal(handle, dim, (float*) scale, output, 1));
+                CUBLAS_CHECK(mCublasWrapper.cublasSscal(handle, dim, (float*) scale, output, 1));
             }
             // Use different scale factors for different channels
             else
diff --git a/plugin/common/kernels/saturate.h b/plugin/common/kernels/saturate.h
index 9bfa9d4b..309dd25e 100644
--- a/plugin/common/kernels/saturate.h
+++ b/plugin/common/kernels/saturate.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,6 +18,7 @@
 #define TRT_SATURATE_H
 
 #include <array>
+#include <cuda_fp16.h>
 
 template <typename T_BBOX>
 __device__ T_BBOX saturate(T_BBOX v)
diff --git a/plugin/common/kernels/voxelGeneratorKernels.cu b/plugin/common/kernels/voxelGeneratorKernels.cu
index c6c75837..785a7e63 100644
--- a/plugin/common/kernels/voxelGeneratorKernels.cu
+++ b/plugin/common/kernels/voxelGeneratorKernels.cu
@@ -384,7 +384,7 @@ int generateFeatures_launch(
       voxel_x, voxel_y, voxel_z,
       range_min_x, range_min_y, range_min_z,
       voxel_features_size, max_points,
-      max_voxels, 
+      max_voxels,
       features);
     }
     else {
@@ -398,7 +398,7 @@ int generateFeatures_launch(
       voxel_x, voxel_y, voxel_z,
       range_min_x, range_min_y, range_min_z,
       voxel_features_size, max_points,
-      max_voxels, 
+      max_voxels,
       features);
     }
     auto err = cudaGetLastError();
diff --git a/plugin/common/nmsHelper.cpp b/plugin/common/nmsHelper.cpp
index 8b0cc9ca..b35d4c63 100644
--- a/plugin/common/nmsHelper.cpp
+++ b/plugin/common/nmsHelper.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,8 +16,8 @@
  */
 
 #include "common/plugin.h"
-#include "cuda_fp16.h"
 #include <algorithm>
+#include <cuda_fp16.h>
 
 namespace nvinfer1
 {
diff --git a/plugin/common/plugin.cpp b/plugin/common/plugin.cpp
index 8d3b885b..b685528a 100644
--- a/plugin/common/plugin.cpp
+++ b/plugin/common/plugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,10 +14,104 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
+
 #include "common/plugin.h"
 
 namespace nvinfer1
 {
+namespace pluginInternal
+{
+
+// Helper to create per-context cudnn/cublas singleton managed by std::shared_ptr.
+// Unlike conventional singletons, singleton created with this will be released
+// when not needed, instead of on process exit.
+// Objects of this class shall always be declared static / global, and shall never own cudnn/cublas handle
+// resources.
+template <typename T_>
+class PerContextPluginHandleSingletonCreator
+{
+public:
+    // creator returning std::unique_ptr is by design.
+    // It forces separation of memory for T and memory for control blocks.
+    // So when T is released, but we still have observer weak_ptr in mObservers, the T mem block can be released.
+    // creator itself must not own cudnn/cublas handle resources. Only the object it creates can.
+    PerContextPluginHandleSingletonCreator(std::function<std::unique_ptr<T_>()> creator)
+        : mCreator{std::move(creator)} {};
+
+    // \param executionContextIdentifier Unique pointer to identify contexts having overlapping lifetime.
+    std::shared_ptr<T_> operator()(void* executionContextIdentifier)
+    {
+        std::lock_guard<std::mutex> lk{mMutex};
+        std::shared_ptr<T_> result = mObservers[executionContextIdentifier].lock();
+        if (result == nullptr)
+        {
+            auto deleter = [this, executionContextIdentifier](T_* obj) {
+                if (obj == nullptr)
+                {
+                    return;
+                }
+                delete obj;
+                // Clears observer to avoid growth of mObservers, in case users create/destroy
+                // plugin handle contexts frequently.
+                std::shared_ptr<T_> observedObjHolder;
+                // The destructor of observedObjHolder may attempt to acquire a lock on mMutex.
+                // To avoid deadlock, it's critical to release the lock here held by lk first,
+                // before destroying observedObjHolder. Hence observedObjHolder must be declared
+                // before lk.
+                std::lock_guard<std::mutex> lk{mMutex};
+                // Must check observer again because another thread may create new instance for
+                // this ctx just before we lock mMutex. We can't infer that the observer is
+                // stale from the fact that obj is destroyed, because shared_ptr ref-count
+                // checking and observer removing are not in one atomic operation, and the
+                // observer may be changed to observe another instance.
+                observedObjHolder = mObservers.at(executionContextIdentifier).lock();
+                if (observedObjHolder == nullptr)
+                {
+                    mObservers.erase(executionContextIdentifier);
+                }
+            };
+            // Create the resource and register with an observer.
+            result = std::shared_ptr<T_>{mCreator().release(), std::move(deleter)};
+            mObservers.at(executionContextIdentifier) = result;
+        }
+
+        return result;
+    };
+
+private:
+    std::function<std::unique_ptr<T_>()> mCreator;
+    mutable std::mutex mMutex;
+    // cudnn/cublas handle resources are per-context.
+    std::unordered_map</*contextIdentifier*/ void*, std::weak_ptr<T_>> mObservers;
+}; // class PerContextPluginHandleSingletonCreator
+
+std::unique_ptr<CudnnWrapper> createPluginCudnnWrapperImpl()
+{
+    return std::make_unique<CudnnWrapper>(/*initHandle*/ true);
+}
+
+std::unique_ptr<CublasWrapper> createPluginCublasWrapperImpl()
+{
+    return std::make_unique<CublasWrapper>(/*initHandle*/ true);
+}
+
+static PerContextPluginHandleSingletonCreator<CudnnWrapper> gCreatePluginCudnnHandleWrapper(
+    createPluginCudnnWrapperImpl);
+static PerContextPluginHandleSingletonCreator<CublasWrapper> gCreatePluginCublasHandleWrapper(
+    createPluginCublasWrapperImpl);
+
+std::shared_ptr<CudnnWrapper> createPluginCudnnWrapper(void* executionContextIdentifier)
+{
+    return gCreatePluginCudnnHandleWrapper(executionContextIdentifier);
+}
+
+std::shared_ptr<CublasWrapper> createPluginCublasWrapper(void* executionContextIdentifier)
+{
+    return gCreatePluginCublasHandleWrapper(executionContextIdentifier);
+}
+
+} // namespace pluginInternal
+
 namespace plugin
 {
 
@@ -43,5 +137,17 @@ void validateRequiredAttributesExist(std::set<std::string> requiredFieldNames, P
     }
 }
 
+int32_t dimToInt32(int64_t d)
+{
+    if (d < std::numeric_limits<int32_t>::min() || d > std::numeric_limits<int32_t>::max())
+    {
+        std::stringstream msg{};
+        msg << "Plugin cannot handle dimension outside of int32_t range: " << d;
+        std::string msg_str = msg.str();
+        PLUGIN_ERROR(msg_str.c_str());
+    }
+    return static_cast<int32_t>(d);
+}
+
 } // namespace plugin
 } // namespace nvinfer1
diff --git a/plugin/common/plugin.h b/plugin/common/plugin.h
index a043339d..833ebb22 100644
--- a/plugin/common/plugin.h
+++ b/plugin/common/plugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,13 +18,17 @@
 #define TRT_PLUGIN_H
 #include "NvInferPlugin.h"
 #include "common/checkMacrosPlugin.h"
+#include "cublasWrapper.h"
+#include "cudnnWrapper.h"
 #include <cstring>
 #include <cuda_runtime.h>
+#include <functional>
 #include <iostream>
 #include <memory>
 #include <set>
 #include <sstream>
 #include <string>
+#include <unordered_map>
 
 // Enumerator for status
 typedef enum
@@ -75,6 +79,8 @@ class BaseCreator : public IPluginCreator
     std::string mNamespace;
 };
 
+std::shared_ptr<nvinfer1::pluginInternal::CudnnWrapper> createPluginCudnnWrapper(void* executionContextIdentifier);
+std::shared_ptr<nvinfer1::pluginInternal::CublasWrapper> createPluginCublasWrapper(void* executionContextIdentifier);
 } // namespace pluginInternal
 
 namespace plugin
@@ -100,15 +106,9 @@ OutType read(BufferType const*& buffer)
     return val;
 }
 
-inline int32_t getTrtSMVersionDec(int32_t smVersion)
-{
-    // Treat SM89 as SM86 temporarily.
-    return (smVersion == 89) ? 86 : smVersion;
-}
-
 inline int32_t getTrtSMVersionDec(int32_t majorVersion, int32_t minorVersion)
 {
-    return getTrtSMVersionDec(majorVersion * 10 + minorVersion);
+    return majorVersion * 10 + minorVersion;
 }
 
 // Check that all required field names are present in the PluginFieldCollection.
@@ -137,6 +137,10 @@ struct CudaBind
     }
 };
 
+// Convert a 64-bit dimension to a 32-bit dimension.
+// Throw exception if it doesn't fit.
+int32_t dimToInt32(int64_t);
+
 } // namespace plugin
 } // namespace nvinfer1
 
@@ -146,7 +150,7 @@ struct CudaBind
     do                                                                                                                 \
     {                                                                                                                  \
         if (status != 0)                                                                                               \
-            abort();                                                                                                   \
+            exit(EXIT_FAILURE);                                                                                                   \
     } while (0)
 
 #define ASSERT_PARAM(exp)                                                                                              \
@@ -216,7 +220,7 @@ struct CudaBind
         if (status != 0)                                                                                               \
         {                                                                                                              \
             DEBUG_PRINTF("%s %d CUDA FAIL %s\n", __FILE__, __LINE__, cudaGetErrorString(status));                      \
-            abort();                                                                                                   \
+            exit(EXIT_FAILURE);                                                                                                 \
         }                                                                                                              \
     }
 
diff --git a/plugin/common/serialize.hpp b/plugin/common/serialize.hpp
index 2948dfb8..8a29dd46 100644
--- a/plugin/common/serialize.hpp
+++ b/plugin/common/serialize.hpp
@@ -112,6 +112,33 @@ struct Serializer<std::vector<T>,
     }
 };
 
+template <>
+struct Serializer<std::string>
+{
+    static size_t serialized_size(std::string const& value)
+    {
+        return sizeof(value.size()) + value.size();
+    }
+    static void serialize(void** buffer, std::string const& value)
+    {
+        size_t nbyte = value.size();
+        serialize_value(buffer, nbyte);
+        ::memcpy(*buffer, value.data(), nbyte);
+        reinterpret_cast<char*&>(*buffer) += nbyte;
+    }
+    static void deserialize(void const** buffer, size_t* buffer_size, std::string* value)
+    {
+        size_t nbyte;
+        deserialize_value(buffer, buffer_size, &nbyte);
+        value->resize(nbyte);
+        assert(value->size() == nbyte);
+        assert(*buffer_size >= nbyte);
+        ::memcpy(const_cast<char*>(value->data()), *buffer, nbyte);
+        reinterpret_cast<char const*&>(*buffer) += nbyte;
+        *buffer_size -= nbyte;
+    }
+};
+
 } // namespace
 
 template <typename T>
diff --git a/plugin/common/vfcCommon.cpp b/plugin/common/vfcCommon.cpp
index 11375350..7122d0d4 100644
--- a/plugin/common/vfcCommon.cpp
+++ b/plugin/common/vfcCommon.cpp
@@ -70,12 +70,27 @@ ILogger* getPluginLogger()
 } // namespace plugin
 } // namespace nvinfer1
 
+IPluginCreatorInterface* const* getCreatorsHelper(int32_t& nbAllCreators, int32_t& nbIPluginCreators)
+{
+    nbAllCreators = 1;
+    nbIPluginCreators = 1;
+    static ROIAlignPluginCreator sRoiAlignCreator;
+    static IPluginCreatorInterface* const kPLUGIN_CREATOR_LIST[] = {&sRoiAlignCreator};
+    return kPLUGIN_CREATOR_LIST;
+}
+
 extern "C" TENSORRTAPI IPluginCreator* const* getPluginCreators(int32_t& nbCreators)
 {
-    nbCreators = 1;
-    static ROIAlignPluginCreator roiAlignCreator;
-    static IPluginCreator* const pluginCreatorList[] = {&roiAlignCreator};
-    return pluginCreatorList;
+    int32_t nbAllCreators;
+    auto creators = getCreatorsHelper(nbAllCreators, nbCreators);
+
+    return reinterpret_cast<IPluginCreator* const*>(creators + (nbAllCreators - nbCreators));
+}
+
+extern "C" TENSORRTAPI IPluginCreatorInterface* const* getCreators(int32_t& nbCreators)
+{
+    int32_t nbIPluginCreators;
+    return getCreatorsHelper(nbCreators, nbIPluginCreators);
 }
 
 extern "C" TENSORRTAPI void setLoggerFinder(nvinfer1::ILoggerFinder* finder)
diff --git a/plugin/common/vfcCommon.h b/plugin/common/vfcCommon.h
index a2015177..ee84dc97 100644
--- a/plugin/common/vfcCommon.h
+++ b/plugin/common/vfcCommon.h
@@ -19,6 +19,7 @@
 #define TRT_PLUGIN_VFC_COMMON_H
 #include "NvInfer.h"
 #include "NvInferPlugin.h"
+#include "NvInferRuntimeCommon.h"
 
 using namespace nvinfer1;
 
@@ -35,4 +36,6 @@ ILogger* getPluginLogger();
 extern "C" TENSORRTAPI void setLoggerFinder(nvinfer1::ILoggerFinder* finder);
 
 extern "C" TENSORRTAPI IPluginCreator* const* getPluginCreators(int32_t& nbCreators);
+
+extern "C" TENSORRTAPI IPluginCreatorInterface* const* getCreators(int32_t& nbCreators);
 #endif // TRT_PLUGIN_VFC_COMMON_H
diff --git a/plugin/coordConvACPlugin/README.md b/plugin/coordConvACPlugin/README.md
index df686129..a447a400 100644
--- a/plugin/coordConvACPlugin/README.md
+++ b/plugin/coordConvACPlugin/README.md
@@ -11,6 +11,8 @@
 
 ## Description
 
+> NOTE: This plugin is deprecated since TensorRT 9.0.
+
 The `coordConvACPlugin` implements the `CoordConv` layer. This layer was first introduced by Uber AI Labs in 2018, and improves on regular convolution by adding additional channels containing relative coordinates to the input tensor. These additional channels allows the subsequent convolution to retain information about where it was applied.
 
 Each node with the op name `CoordConvAC` in `ONNX` graph will be mapped to that plugin. `Conv` node should follow after each `CoordConvAC` node into `ONNX` graph. 
@@ -71,6 +73,9 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note.
+
 April 2020
 This is the first release of this `README.md` file.
 
diff --git a/plugin/coordConvACPlugin/coordConvACPlugin.cpp b/plugin/coordConvACPlugin/coordConvACPlugin.cpp
index 443ba6d5..63462fcd 100644
--- a/plugin/coordConvACPlugin/coordConvACPlugin.cpp
+++ b/plugin/coordConvACPlugin/coordConvACPlugin.cpp
@@ -197,6 +197,11 @@ bool CoordConvACPlugin::canBroadcastInputAcrossBatch(int32_t inputIndex) const n
     return false;
 }
 
+void CoordConvACPlugin::attachToContext(
+    cudnnContext* cudnn, cublasContext* cublas, nvinfer1::IGpuAllocator* allocator) noexcept
+{
+}
+
 // Plugin creator
 CoordConvACPluginCreator::CoordConvACPluginCreator() {}
 
@@ -219,6 +224,7 @@ IPluginV2Ext* CoordConvACPluginCreator::createPlugin(char const* name, PluginFie
 {
     try
     {
+        gLogWarning << "CoordConvACPlugin is deprecated since TensorRT 9.0." << std::endl;
         CoordConvACPlugin* plugin = new CoordConvACPlugin();
         plugin->setPluginNamespace(mNamespace.c_str());
         return plugin;
@@ -235,6 +241,7 @@ IPluginV2Ext* CoordConvACPluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "CoordConvACPlugin is deprecated since TensorRT 9.0." << std::endl;
         CoordConvACPlugin* plugin = new CoordConvACPlugin(serialData, serialLength);
         plugin->setPluginNamespace(mNamespace.c_str());
         return plugin;
diff --git a/plugin/coordConvACPlugin/coordConvACPlugin.h b/plugin/coordConvACPlugin/coordConvACPlugin.h
index a710ba44..1776d6f7 100644
--- a/plugin/coordConvACPlugin/coordConvACPlugin.h
+++ b/plugin/coordConvACPlugin/coordConvACPlugin.h
@@ -30,7 +30,7 @@ namespace nvinfer1
 namespace plugin
 {
 
-class CoordConvACPlugin : public IPluginV2Ext
+class TRT_DEPRECATED CoordConvACPlugin : public IPluginV2Ext
 {
 public:
     CoordConvACPlugin();
@@ -68,6 +68,9 @@ class CoordConvACPlugin : public IPluginV2Ext
 
     char const* getPluginVersion() const noexcept override;
 
+    void attachToContext(
+        cudnnContext* cudnn, cublasContext* cublas, nvinfer1::IGpuAllocator* allocator) noexcept override;
+
     void destroy() noexcept override;
 
     IPluginV2Ext* clone() const noexcept override;
@@ -96,7 +99,7 @@ class CoordConvACPlugin : public IPluginV2Ext
     char const* mPluginNamespace{};
 };
 
-class CoordConvACPluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED CoordConvACPluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     CoordConvACPluginCreator();
diff --git a/plugin/coordConvACPlugin/coordConvACPluginKernels.cu b/plugin/coordConvACPlugin/coordConvACPluginKernels.cu
index 08459863..a0130a16 100644
--- a/plugin/coordConvACPlugin/coordConvACPluginKernels.cu
+++ b/plugin/coordConvACPlugin/coordConvACPluginKernels.cu
@@ -95,9 +95,11 @@ int CoordConvACPlugin::enqueue(
     case DataType::kINT8:
     case DataType::kUINT8:
     case DataType::kINT32:
-    case DataType::kBOOL:
-        break;
-    case DataType::kFP8: PLUGIN_FAIL("FP8 not supported"); break;
+    case DataType::kBOOL: break;
+    case DataType::kFP8:
+    case DataType::kBF16:
+    case DataType::kINT64:
+    case DataType::kINT4: PLUGIN_FAIL("Unsupported data type");
     }
     return 1;
 }
diff --git a/plugin/cropAndResizePlugin/README.md b/plugin/cropAndResizePlugin/README.md
index c4bf7c74..d36a9788 100644
--- a/plugin/cropAndResizePlugin/README.md
+++ b/plugin/cropAndResizePlugin/README.md
@@ -11,7 +11,9 @@
 
 ## Description
 
-The `cropAndResizePlugin` performs object detection for the Faster R-CNN model. This plugin is included in TensorRT and used in [sampleUffFasterRCNN] to perform inference.
+> NOTE: The `IPluginV2Ext` version of this plugin is deprecated since TensorRT 9.0. `IPluginV2DynamicExt`-based `CropAndResizeDynamic` is recommended instead.
+
+The `cropAndResizePlugin` performs object detection for the Faster R-CNN model. This plugin is included in TensorRT.
 
 `cropAndResizePlugin` implements the TensorFlow style of ROIPooling(a.k.a. CropAndResize). It crops multiple region of interests(ROIs) from the input image with given ROI coordinates and then (bilinearly) resizes the cropped patches to a target spatial(width and height) size. 
 
@@ -71,6 +73,9 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note for `IPluginV2Ext` version of the plugin.
+
 May 2019
 This is the first release of this `README.md` file.
 
diff --git a/plugin/cropAndResizePlugin/cropAndResizePlugin.cpp b/plugin/cropAndResizePlugin/cropAndResizePlugin.cpp
index 43ac913a..a0b19fc4 100644
--- a/plugin/cropAndResizePlugin/cropAndResizePlugin.cpp
+++ b/plugin/cropAndResizePlugin/cropAndResizePlugin.cpp
@@ -213,15 +213,12 @@ int32_t CropAndResizePlugin::enqueue(
     return STATUS_FAILURE;
 }
 
-int32_t CropAndResizeDynamicPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+int32_t CropAndResizeDynamicPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* /* outputDesc */,
+    void const* const* inputs, void* const* outputs, void* /* workspace */, cudaStream_t stream) noexcept
 {
     try
     {
-        PLUGIN_VALIDATE(inputs != nullptr);
-        PLUGIN_VALIDATE(inputDesc != nullptr);
-        PLUGIN_VALIDATE(outputs != nullptr);
-        PLUGIN_VALIDATE(outputDesc != nullptr);
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
 
         // Our plugin outputs only one tensor
         void* output = outputs[0];
@@ -543,6 +540,9 @@ IPluginV2Ext* CropAndResizePluginCreator::createPlugin(char const* /* name */, P
 {
     try
     {
+        gLogWarning << "CropAndResizePlugin (implementing IPluginV2Ext) is deprecated since TensorRT 9.0. Use "
+                       "CropAndResizeDynamic plugin."
+                    << std::endl;
         PLUGIN_VALIDATE(fc != nullptr);
         PluginField const* fields = fc->fields;
         int32_t nbFields = fc->nbFields;
@@ -623,6 +623,9 @@ IPluginV2Ext* CropAndResizePluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "CropAndResizePlugin (implementing IPluginV2Ext) is deprecated since TensorRT 9.0. Use "
+                       "CropAndResizeDynamic plugin."
+                    << std::endl;
         // This object will be deleted when the network is destroyed,
         IPluginV2Ext* plugin = new CropAndResizePlugin(serialData, serialLength);
         plugin->setPluginNamespace(mNamespace.c_str());
diff --git a/plugin/cropAndResizePlugin/cropAndResizePlugin.h b/plugin/cropAndResizePlugin/cropAndResizePlugin.h
index 0d68771c..54c8f16b 100644
--- a/plugin/cropAndResizePlugin/cropAndResizePlugin.h
+++ b/plugin/cropAndResizePlugin/cropAndResizePlugin.h
@@ -32,7 +32,7 @@ namespace nvinfer1
 namespace plugin
 {
 
-class CropAndResizePlugin : public IPluginV2Ext
+class TRT_DEPRECATED CropAndResizePlugin : public IPluginV2Ext
 {
 public:
     CropAndResizePlugin(int32_t cropWidth, int32_t cropHeight);
@@ -171,7 +171,7 @@ class CropAndResizeBasePluginCreator : public nvinfer1::pluginInternal::BaseCrea
     std::string mPluginVersion;
 };
 
-class CropAndResizePluginCreator : public CropAndResizeBasePluginCreator
+class TRT_DEPRECATED CropAndResizePluginCreator : public CropAndResizeBasePluginCreator
 {
 public:
     CropAndResizePluginCreator();
diff --git a/plugin/decodeBbox3DPlugin/decodeBbox3D.cpp b/plugin/decodeBbox3DPlugin/decodeBbox3D.cpp
index c24a4d20..f9e9faa5 100644
--- a/plugin/decodeBbox3DPlugin/decodeBbox3D.cpp
+++ b/plugin/decodeBbox3DPlugin/decodeBbox3D.cpp
@@ -232,11 +232,13 @@ size_t DecodeBbox3DPlugin::getWorkspaceSize(nvinfer1::PluginTensorDesc const* in
 }
 
 int32_t DecodeBbox3DPlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,
-    nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
-    cudaStream_t stream) noexcept
+    nvinfer1::PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs,
+    void* workspace, cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr && workspace != nullptr);
+
         int32_t batchSize = inputDesc[0].dims.d[0];
 
         // Inputs
diff --git a/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.cpp b/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.cpp
index 4a1c1fc8..c79096a5 100644
--- a/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.cpp
+++ b/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.cpp
@@ -120,7 +120,8 @@ int32_t DisentangledAttentionPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
 {
     try
     {
-        PLUGIN_VALIDATE(inputDesc && outputDesc && inputs && outputs);
+        PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         switch (inputDesc[0].type)
         {
         case nvinfer1::DataType::kFLOAT:
diff --git a/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.h b/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.h
index 72223755..7d77a514 100644
--- a/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.h
+++ b/plugin/disentangledAttentionPlugin/disentangledAttentionPlugin.h
@@ -19,8 +19,8 @@
 #define TRT_DISENTANGLED_ATTENTION_PLUGIN_H
 
 #include "NvInferPlugin.h"
-#include "plugin.h"
-#include "serialize.hpp"
+#include "common/plugin.h"
+#include "common/serialize.hpp"
 #include <cstdint>
 #include <iostream>
 #include <string>
diff --git a/plugin/disentangledAttentionPlugin/disentangledKernel.cu b/plugin/disentangledAttentionPlugin/disentangledKernel.cu
index fb26db33..f90a98e6 100644
--- a/plugin/disentangledAttentionPlugin/disentangledKernel.cu
+++ b/plugin/disentangledAttentionPlugin/disentangledKernel.cu
@@ -119,7 +119,7 @@ __global__ void GatherAddGatherTransposeAddMul_fused(TDataType const* data0, TDa
     // results in that integer when computing the bucket later on.
     // This corrects for the mathematical imprecision from using float.
     constexpr float kEPSILON = 1e-7;
-    float tmp = (mid - 1) / (logf(dimResult.y - 1) - tmp1) * (1 - kEPSILON);
+    float tmp = (mid - 1) / (logf(dimData1.z - 1) - tmp1) * (1 - kEPSILON);
 #endif
 
     __shared__ TDataType T[tTileSize][tTileSize + 1]; // +1 to avoid bank conflict
diff --git a/plugin/efficientNMSPlugin/README.md b/plugin/efficientNMSPlugin/README.md
index b2af6231..c46a3306 100644
--- a/plugin/efficientNMSPlugin/README.md
+++ b/plugin/efficientNMSPlugin/README.md
@@ -16,6 +16,8 @@
 
 ## Description
 
+> NOTE: `EfficientNMSONNXPlugin` is deprecated since TensorRT 9.0. Its functionality has been superseded by the `INMSLayer`.
+
 This TensorRT plugin implements an efficient algorithm to perform Non Maximum Suppression for object detection networks.
 
 This plugin is primarily intended for using with EfficientDet on TensorRT, as this network is particularly sensitive to the latencies introduced by slower NMS implementations. However, the plugin is generic enough that it will work correctly for other detections architectures, such as SSD or FasterRCNN.
@@ -168,3 +170,12 @@ The following resources provide a deeper understanding of the NMS algorithm:
 
 For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html)
 documentation.
+
+## Changelog
+
+June 2023
+Add deprecation note for the `EfficientNMSONNXPlugin` plugin.
+
+## Known issues
+
+There are no known issues in this plugin.
diff --git a/plugin/efficientNMSPlugin/efficientNMSPlugin.cpp b/plugin/efficientNMSPlugin/efficientNMSPlugin.cpp
index 2f5d428b..1a8692ae 100644
--- a/plugin/efficientNMSPlugin/efficientNMSPlugin.cpp
+++ b/plugin/efficientNMSPlugin/efficientNMSPlugin.cpp
@@ -367,11 +367,13 @@ size_t EfficientNMSPlugin::getWorkspaceSize(
     return EfficientNMSWorkspaceSize(batchSize, numScoreElements, numClasses, mParam.datatype);
 }
 
-int32_t EfficientNMSPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
+int32_t EfficientNMSPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* /* outputDesc */,
     void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr && workspace != nullptr);
+
         mParam.batchSize = inputDesc[0].dims.d[0];
 
         if (mParam.outputONNXIndices)
@@ -561,6 +563,9 @@ IPluginV2DynamicExt* EfficientNMSONNXPluginCreator::createPlugin(
 {
     try
     {
+        gLogWarning << "EfficientNMSONNXPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to "
+                       "add an INMSLayer."
+                    << std::endl;
         PluginField const* fields = fc->fields;
         for (int32_t i = 0; i < fc->nbFields; ++i)
         {
@@ -607,6 +612,9 @@ IPluginV2DynamicExt* EfficientNMSONNXPluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "EfficientNMSONNXPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to "
+                       "add an INMSLayer."
+                    << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call EfficientNMSPlugin::destroy()
         auto* plugin = new EfficientNMSPlugin(serialData, serialLength);
diff --git a/plugin/embLayerNormPlugin/embLayerNormKernel.cu b/plugin/embLayerNormPlugin/embLayerNormKernel.cu
index ea687d9c..a32d14e5 100644
--- a/plugin/embLayerNormPlugin/embLayerNormKernel.cu
+++ b/plugin/embLayerNormPlugin/embLayerNormKernel.cu
@@ -186,7 +186,6 @@ __global__ void embLayerNormKernel(int ld, int32_t const* inputIds, int32_t cons
     int32_t const tokSize, T* output)
 {
 
-    cub::Sum pairSum;
     // 1. lookup word and token of the block
     // blockIdx.x = position in the sequence
     // blockIdx.y = batch
@@ -225,7 +224,7 @@ __global__ void embLayerNormKernel(int ld, int32_t const* inputIds, int32_t cons
 
             output[outOffset + it] = val;
             T const rldval = rld * val;
-            threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
+            threadData = threadData + kvp<T>(rldval, rldval * val);
         }
     }
 
diff --git a/plugin/embLayerNormPlugin/embLayerNormPlugin.cpp b/plugin/embLayerNormPlugin/embLayerNormPlugin.cpp
index ef9738d0..8e392b82 100644
--- a/plugin/embLayerNormPlugin/embLayerNormPlugin.cpp
+++ b/plugin/embLayerNormPlugin/embLayerNormPlugin.cpp
@@ -307,11 +307,13 @@ size_t EmbLayerNormPluginDynamic::getWorkspaceSize(
     return 0;
 }
 
-int32_t EmbLayerNormPluginDynamic::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+int32_t EmbLayerNormPluginDynamic::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* /* outputDesc */,
+    void const* const* inputs, void* const* outputs, void* /* workspace */, cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         int32_t const batchSize = inputDesc->dims.d[BDIM];
         int32_t const S = inputDesc->dims.d[SDIM];
         int32_t status = STATUS_FAILURE;
diff --git a/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu b/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
index 500b97d5..db8f6b06 100644
--- a/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
+++ b/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelHFace.cu
@@ -43,7 +43,6 @@ __global__ void embLayerNormKernelHFace(int32_t ld, int32_t const* inputIds, int
     // this code currently assumes the input shape is SxB, row-major => seqPos = s * B + b
     // instead we want BxS, row-major => seqPos = b * S + s
 
-    cub::Sum pairSum;
     // 1. lookup word and token of the block
     // blockIdx.x = position in the sequence
     // blockIdx.y = batch
@@ -95,7 +94,7 @@ __global__ void embLayerNormKernelHFace(int32_t ld, int32_t const* inputIds, int
 
             output[outOffset + it] = val;
             T const rldval = rld * val;
-            threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
+            threadData = threadData + kvp<T>(rldval, rldval * val);
         }
     }
 
diff --git a/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelMTron.cu b/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelMTron.cu
index 10dd9af9..95e45820 100644
--- a/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelMTron.cu
+++ b/plugin/embLayerNormPlugin/embLayerNormVarSeqlenKernelMTron.cu
@@ -43,7 +43,6 @@ __global__ void embLayerNormKernelMTron(int32_t ld, int32_t const* inputIds, int
     // this code currently assumes the input shape is SxB, row-major => seqPos = s * B + b
     // instead we want BxS, row-major => seqPos = b * S + s
 
-    cub::Sum pairSum;
     // 1. lookup word and token of the block
     // blockIdx.x = position in the sequence
     // blockIdx.y = batch
@@ -96,7 +95,7 @@ __global__ void embLayerNormKernelMTron(int32_t ld, int32_t const* inputIds, int
             output[outOffset + it] = val;
             skip[outOffset + it] = val;
             T const rldval = rld * val;
-            threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
+            threadData = threadData + kvp<T>(rldval, rldval * val);
         }
     }
 
diff --git a/plugin/embLayerNormPlugin/embLayerNormVarSeqlenPlugin.cpp b/plugin/embLayerNormPlugin/embLayerNormVarSeqlenPlugin.cpp
index 3404b080..4b6bd72d 100644
--- a/plugin/embLayerNormPlugin/embLayerNormVarSeqlenPlugin.cpp
+++ b/plugin/embLayerNormPlugin/embLayerNormVarSeqlenPlugin.cpp
@@ -325,11 +325,14 @@ size_t EmbLayerNormVarSeqlenPluginBase::getWorkspaceSize(
     return 0;
 }
 
-int32_t EmbLayerNormVarSeqlenPluginHFace::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+int32_t EmbLayerNormVarSeqlenPluginHFace::enqueue(PluginTensorDesc const* inputDesc,
+    PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs, void* /* workspace */,
+    cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         int32_t const batchSize = inputDesc[2].dims.d[0] - 1;
         // read out the maximum sequence length from the dummy input
         int32_t const maxSeqlen = inputDesc[3].dims.d[0];
@@ -394,11 +397,14 @@ int32_t EmbLayerNormVarSeqlenPluginHFace::enqueue(PluginTensorDesc const* inputD
     return STATUS_FAILURE;
 }
 
-int32_t EmbLayerNormVarSeqlenPluginMTron::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+int32_t EmbLayerNormVarSeqlenPluginMTron::enqueue(PluginTensorDesc const* inputDesc,
+    PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs, void* /* workspace */,
+    cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         int32_t const batchSize = inputDesc[2].dims.d[0] - 1;
         // read out the maximum sequence length from the dummy input
         int32_t const maxSeqlen = inputDesc[3].dims.d[0];
diff --git a/plugin/exports-vfc_plugin.def b/plugin/exports-vfc_plugin.def
new file mode 100644
index 00000000..d47954b3
--- /dev/null
+++ b/plugin/exports-vfc_plugin.def
@@ -0,0 +1,20 @@
+; SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+; SPDX-License-Identifier: Apache-2.0
+;
+; Licensed under the Apache License, Version 2.0 (the "License");
+; you may not use this file except in compliance with the License.
+; You may obtain a copy of the License at
+;
+; http://www.apache.org/licenses/LICENSE-2.0
+;
+; Unless required by applicable law or agreed to in writing, software
+; distributed under the License is distributed on an "AS IS" BASIS,
+; WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+; See the License for the specific language governing permissions and
+; limitations under the License.
+
+LIBRARY nvinfer_vc_plugin
+EXPORTS
+setLoggerFinder
+getPluginCreators
+getCreators
diff --git a/plugin/exports-vfc_plugin.map b/plugin/exports-vfc_plugin.map
index 70ee8938..b90d58ce 100644
--- a/plugin/exports-vfc_plugin.map
+++ b/plugin/exports-vfc_plugin.map
@@ -19,6 +19,7 @@
 {
   global:
     getPluginCreators;
+    getCreators;
     setLoggerFinder;
   local: *;
 };
diff --git a/plugin/fcPlugin/fcPlugin.cpp b/plugin/fcPlugin/fcPlugin.cpp
index 91fd8699..c98ae433 100644
--- a/plugin/fcPlugin/fcPlugin.cpp
+++ b/plugin/fcPlugin/fcPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -26,13 +26,13 @@
 #include <algorithm>
 #include <cstdio>
 #include <cstring>
-#include <cublasLt.h>
 #include <cuda_runtime.h>
 #include <vector>
 
 using namespace nvinfer1;
 using namespace nvinfer1::plugin;
 using namespace nvinfer1::plugin::bert;
+using namespace nvinfer1::pluginInternal;
 
 // plugin specific constants
 namespace
@@ -81,9 +81,10 @@ static cublasStatus_t customMatmulRun(cublasLtHandle_t ltHandle, // to get the c
 
     cublasLtMatmulHeuristicResult_t heurResult;
 
+    CublasLtWrapper& cublasLtWrapper = getCublasLtWrapper();
     // Looping over the Algo
-    cublasStatus_t algoStatus
-        = cublasLtMatmulAlgoCheck(ltHandle, operationDesc, Adesc, Bdesc, Cdesc, Ddesc, &algo, &heurResult);
+    cublasStatus_t algoStatus = cublasLtWrapper.cublasLtMatmulAlgoCheck(
+        ltHandle, operationDesc, Adesc, Bdesc, Cdesc, Ddesc, &algo, &heurResult);
 
     if (algoStatus == CUBLAS_STATUS_SUCCESS)
     {
@@ -95,9 +96,10 @@ static cublasStatus_t customMatmulRun(cublasLtHandle_t ltHandle, // to get the c
             }
             for (int32_t loop = 0; loop < kNB_KERNEL_REPEATS; loop++)
             {
-                cublasStatus_t oneRunStatus = cublasLtMatmul(ltHandle, operationDesc, alpha, // host or device pointer
-                    A, Adesc, B, Bdesc, beta,                                                // host or device pointer
-                    C, Cdesc, D, Ddesc, &algo, workSpace, workSpaceSizeInBytes, stream);
+                cublasStatus_t oneRunStatus
+                    = cublasLtWrapper.cublasLtMatmul(ltHandle, operationDesc, alpha, // host or device pointer
+                        A, Adesc, B, Bdesc, beta,                                    // host or device pointer
+                        C, Cdesc, D, Ddesc, &algo, workSpace, workSpaceSizeInBytes, stream);
                 if (oneRunStatus != CUBLAS_STATUS_SUCCESS)
                 {
                     algoStatus = oneRunStatus;
@@ -133,15 +135,10 @@ static cublasStatus_t customMatmulRun(cublasLtHandle_t ltHandle, // to get the c
 
 // Sample wrapper running through multiple algo and config attributes
 // combination for single precision gemm using cublasLt low-level API
-void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOperation_t transb, int32_t const& m,
-    int32_t const& n, int32_t const& k, void const* alpha,                                  // host pointer
+void nvinfer1::plugin::bert::LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOperation_t transb,
+    int32_t const& m, int32_t const& n, int32_t const& k, void const* alpha,                // host pointer
     void const* A, int32_t const& lda, void const* B, int32_t const& ldb, void const* beta, // host pointer
-    void* C, int32_t const& ldc, void* workSpace, size_t workSpaceSize,
-#if CUBLAS_VER_MAJOR < 11
-    cudaDataType_t computeType,
-#else
-    cublasComputeType_t computeType,
-#endif
+    void* C, int32_t const& ldc, void* workSpace, size_t workSpaceSize, cublasComputeType_t computeType,
     cudaDataType_t scaleType, cudaDataType_t Atype, cudaDataType_t Btype, cudaDataType_t Ctype,
     std::vector<customMatmulPerf_t>& perfResults)
 {
@@ -158,6 +155,8 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
     cudaEvent_t stopEvent = nullptr;
     cudaStream_t stream = nullptr;
 
+    CublasLtWrapper& cublasLtWrapper = getCublasLtWrapper();
+
     // SplitK value that we are going to try when SplitK is supported for a given algo.
     int32_t const splitKSequenceA[] = {2, 3, 4, 5, 6, 8, 12, 16, 32};
 
@@ -166,38 +165,34 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
     int32_t nbAlgoIds = 0;
     int32_t algoIdA[kNB_ALGO_IDS];
 
-    PLUGIN_CUBLASASSERT(cublasLtMatmulPreferenceCreate(&preference));
-    PLUGIN_CUBLASASSERT(cublasLtMatmulPreferenceSetAttribute(
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulPreferenceCreate(&preference));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulPreferenceSetAttribute(
         preference, CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES, &workSpaceSize, sizeof(workSpaceSize)));
 
     uint64_t const numericImplPrefer
         = Ctype == CUDA_R_16F ? CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMA : CUBLASLT_NUMERICAL_IMPL_FLAGS_FMA;
-    PLUGIN_CUBLASASSERT(cublasLtMatmulPreferenceSetAttribute(
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulPreferenceSetAttribute(
         preference, CUBLASLT_MATMUL_PREF_IMPL_MASK, &numericImplPrefer, sizeof(numericImplPrefer)));
 
     // Create operation descriptor; see cublasLtMatmulDescAttributes_t for details
     // about defaults; here we just need to set the transforms for A and B
-#if CUBLAS_VER_MAJOR < 11
-    PLUGIN_CUBLASASSERT(cublasLtMatmulDescCreate(&operationDesc, computeType));
-#else
-    PLUGIN_CUBLASASSERT(cublasLtMatmulDescCreate(&operationDesc, computeType, scaleType));
-#endif
-    PLUGIN_CUBLASASSERT(
-        cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSA, &transa, sizeof(transa)));
-    PLUGIN_CUBLASASSERT(
-        cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSB, &transb, sizeof(transa)));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescCreate(&operationDesc, computeType, scaleType));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescSetAttribute(
+        operationDesc, nvinfer1::pluginInternal::CUBLASLT_MATMUL_DESC_TRANSA, &transa, sizeof(transa)));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescSetAttribute(
+        operationDesc, nvinfer1::pluginInternal::CUBLASLT_MATMUL_DESC_TRANSB, &transb, sizeof(transa)));
 
     // Create matrix descriptors. We are good with the details here so no need to
     // set any extra attributes
-    PLUGIN_CUBLASASSERT(
-        cublasLtMatrixLayoutCreate(&Adesc, Atype, transa == CUBLAS_OP_N ? m : k, transa == CUBLAS_OP_N ? k : m, lda));
-    PLUGIN_CUBLASASSERT(
-        cublasLtMatrixLayoutCreate(&Bdesc, Btype, transb == CUBLAS_OP_N ? k : n, transb == CUBLAS_OP_N ? n : k, ldb));
-    PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutCreate(&Cdesc, Ctype, m, n, ldc));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutCreate(
+        &Adesc, Atype, transa == CUBLAS_OP_N ? m : k, transa == CUBLAS_OP_N ? k : m, lda));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutCreate(
+        &Bdesc, Btype, transb == CUBLAS_OP_N ? k : n, transb == CUBLAS_OP_N ? n : k, ldb));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutCreate(&Cdesc, Ctype, m, n, ldc));
 
     // Request the 4 first AlgoId available for SGEMM ( computeType = scaleType =
     // Atype = Btype = Ctype = Dtype = CUDA_R_32F)
-    PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoGetIds(
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoGetIds(
         ltHandle, computeType, scaleType, Atype, Btype, Ctype, Ctype, kNB_ALGO_IDS, algoIdA, &nbAlgoIds));
 
     gLogVerbose << "Number of algos" << nbAlgoIds << std::endl;
@@ -212,15 +207,15 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
         cublasLtMatmulAlgo_t algo;
         size_t sizeWritten = 0;
         // Initialize algo structure with given Algp ID.
-        status
-            = cublasLtMatmulAlgoInit(ltHandle, computeType, scaleType, Atype, Btype, Ctype, Ctype, algoIdA[idx], &algo);
+        status = cublasLtWrapper.cublasLtMatmulAlgoInit(
+            ltHandle, computeType, scaleType, Atype, Btype, Ctype, Ctype, algoIdA[idx], &algo);
         if (status != CUBLAS_STATUS_SUCCESS)
         {
             continue;
         }
 
         uint64_t numericImpl = -1;
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
             &algo, CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS, &numericImpl, sizeof(numericImpl), nullptr));
         if (Ctype == CUDA_R_32F && numericImpl == CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMA)
         {
@@ -229,8 +224,8 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
         }
 
         // Query the tiles enums supported by that algo
-        PLUGIN_CUBLASASSERT(
-            cublasLtMatmulAlgoCapGetAttribute(&algo, CUBLASLT_ALGO_CAP_TILE_IDS, nullptr, 0, &sizeWritten));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
+            &algo, CUBLASLT_ALGO_CAP_TILE_IDS, nullptr, 0, &sizeWritten));
         int32_t nbTiles = int32_t(sizeWritten / sizeof(int32_t));
         int32_t* tileA = new int32_t[nbTiles == 0 ? 1 : nbTiles];
         if (nbTiles == 0)
@@ -246,18 +241,18 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
         int32_t epilogueMask;
         // Retrieve Algo Capabilities attributes to be able to setup loop over the
         // different combinations
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
             &algo, CUBLASLT_ALGO_CAP_TILE_IDS, tileA, sizeof(int32_t) * nbTiles, &sizeWritten));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
             &algo, CUBLASLT_ALGO_CAP_SPLITK_SUPPORT, &splitkSupport, sizeof(splitkSupport), &sizeWritten));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
             &algo, CUBLASLT_ALGO_CAP_REDUCTION_SCHEME_MASK, &redMask, sizeof(redMask), &sizeWritten));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
             &algo, CUBLASLT_ALGO_CAP_CTA_SWIZZLING_SUPPORT, &swizzlingMax, sizeof(swizzlingMax), &sizeWritten));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
             &algo, CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX, &customOptionMax, sizeof(customOptionMax), &sizeWritten));
 
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(
             &algo, CUBLASLT_ALGO_CAP_EPILOGUE_MASK, &epilogueMask, sizeof(epilogueMask), &sizeWritten));
 
         // Loop over the different tiles
@@ -266,7 +261,7 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
             // Loop over the different custom option if any
             for (int32_t customOption = 0; customOption <= customOptionMax; customOption++)
             {
-                PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigSetAttribute(
+                PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigSetAttribute(
                     &algo, CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION, &customOption, sizeof(customOption)));
                 // Loop over the CTAs swizzling support
                 for (int32_t k = 0; k <= swizzlingMax; k++)
@@ -281,23 +276,23 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
                     for (int32_t l = 0; (l < (1 + splitkTrial)) && (algoCount < kNB_ALGO_COMBINATIONS); l++)
                     {
                         // Setup attribute of the algo to run
-                        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigSetAttribute(
+                        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigSetAttribute(
                             &algo, CUBLASLT_ALGO_CONFIG_TILE_ID, &tileA[tileIdx], sizeof(tileA[tileIdx])));
                         int32_t splitK_val = 0;
                         int32_t redScheme = CUBLASLT_REDUCTION_SCHEME_NONE;
-                        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigSetAttribute(
+                        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigSetAttribute(
                             &algo, CUBLASLT_ALGO_CONFIG_SPLITK_NUM, &splitK_val, sizeof(splitK_val)));
-                        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigSetAttribute(
+                        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigSetAttribute(
                             &algo, CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING, &k, sizeof(k)));
-                        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigSetAttribute(
+                        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigSetAttribute(
                             &algo, CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME, &redScheme, sizeof(int32_t)));
 
                         if (l > 0)
                         { // Split-K case
                             splitK_val = splitKSequenceA[l - 1];
-                            PLUGIN_CUBLASASSERT(
-                                cublasLtMatmulAlgoConfigSetAttribute(&algo, CUBLASLT_ALGO_CONFIG_SPLITK_NUM,
-                                    &splitKSequenceA[l - 1], sizeof(splitKSequenceA[l - 1])));
+                            PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigSetAttribute(&algo,
+                                CUBLASLT_ALGO_CONFIG_SPLITK_NUM, &splitKSequenceA[l - 1],
+                                sizeof(splitKSequenceA[l - 1])));
                             // Going over all the reduction scheme
                             for (redScheme = 1; redScheme < static_cast<int32_t>(CUBLASLT_REDUCTION_SCHEME_MASK)
                                  && (algoCount < kNB_ALGO_COMBINATIONS);
@@ -305,7 +300,7 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
                             {
                                 if (redScheme & redMask)
                                 {
-                                    PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigSetAttribute(
+                                    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigSetAttribute(
                                         &algo, CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME, &redScheme, sizeof(redScheme)));
 
                                     status = customMatmulRun(ltHandle, operationDesc, alpha, // host or device pointer
@@ -357,11 +352,11 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, cublasOperation_t transa, cublasOpe
     }
 
     // Descriptors are no longer needed as all GPU work was already enqueued
-    PLUGIN_CUBLASASSERT(cublasLtMatmulPreferenceDestroy(preference));
-    PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutDestroy(Cdesc));
-    PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutDestroy(Bdesc));
-    PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutDestroy(Adesc));
-    PLUGIN_CUBLASASSERT(cublasLtMatmulDescDestroy(operationDesc));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulPreferenceDestroy(preference));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutDestroy(Cdesc));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutDestroy(Bdesc));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutDestroy(Adesc));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescDestroy(operationDesc));
     PLUGIN_CUASSERT(cudaEventDestroy(startEvent));
     PLUGIN_CUASSERT(cudaEventDestroy(stopEvent));
 }
@@ -570,6 +565,9 @@ int32_t FCPluginDynamic::enqueue(PluginTensorDesc const* inputDesc, PluginTensor
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr
+            && workSpace != nullptr);
+
         size_t const workspaceSize = getWorkspaceSize(inputDesc, 1, outputDesc, 1);
 
         int32_t const S = inputDesc->dims.d[SDIM];
diff --git a/plugin/fcPlugin/fcPlugin.h b/plugin/fcPlugin/fcPlugin.h
index cbf63a50..1ba56f7b 100644
--- a/plugin/fcPlugin/fcPlugin.h
+++ b/plugin/fcPlugin/fcPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -25,7 +25,7 @@
 #include "NvInferPlugin.h"
 
 #include "common/bertCommon.h"
-#include <cublasLt.h>
+#include "common/cublasLtWrapper.h"
 #include <string>
 #include <vector>
 
@@ -50,11 +50,8 @@ struct GemmTypes<half>
     using dataTypeO = half;
     static cudaDataType_t const cudaTypeS = CUDA_R_16F;
     using dataTypeS = half;
-#if CUBLAS_VER_MAJOR < 11
-    static cudaDataType_t const cudaTypeCom = CUDA_R_16F;
-#else
-    static cublasComputeType_t const cudaTypeCom = CUBLAS_COMPUTE_16F;
-#endif
+    static nvinfer1::pluginInternal::cublasComputeType_t const cudaTypeCom
+        = nvinfer1::pluginInternal::CUBLAS_COMPUTE_16F;
 };
 
 template <>
@@ -66,11 +63,8 @@ struct GemmTypes<float>
     using dataTypeO = float;
     static cudaDataType_t const cudaTypeS = CUDA_R_32F;
     using dataTypeS = float;
-#if CUBLAS_VER_MAJOR < 11
-    static cudaDataType_t const cudaTypeCom = CUDA_R_32F;
-#else
-    static cublasComputeType_t const cudaTypeCom = CUBLAS_COMPUTE_32F;
-#endif
+    static nvinfer1::pluginInternal::cublasComputeType_t const cudaTypeCom
+        = nvinfer1::pluginInternal::CUBLAS_COMPUTE_32F;
 };
 
 template <typename T>
@@ -91,8 +85,8 @@ struct Gemm
     bool transA;
     bool transB;
 
-    cublasOperation_t opA;
-    cublasOperation_t opB;
+    nvinfer1::pluginInternal::cublasOperation_t opA;
+    nvinfer1::pluginInternal::cublasOperation_t opB;
 
     int32_t const word_size{sizeof(T)};
     typename Types::dataTypeS alpha;
@@ -124,8 +118,8 @@ struct Gemm
         cB = transB ? k : n;
         cC = n;
 
-        opA = transA ? CUBLAS_OP_T : CUBLAS_OP_N;
-        opB = transB ? CUBLAS_OP_T : CUBLAS_OP_N;
+        opA = transA ? nvinfer1::pluginInternal::CUBLAS_OP_T : nvinfer1::pluginInternal::CUBLAS_OP_N;
+        opB = transB ? nvinfer1::pluginInternal::CUBLAS_OP_T : nvinfer1::pluginInternal::CUBLAS_OP_N;
 
         elemA = m * k;
         elemB = n * k;
@@ -148,20 +142,20 @@ auto constexpr kTHREADS_PER_BLOCK = 1024;
 typedef struct customMatMultPerfType_t
 {
     static constexpr float kMAX_TIME = 1000000.F;
-    cublasLtMatmulAlgo_t algo;
-    cublasStatus_t status;
+    nvinfer1::pluginInternal::cublasLtMatmulAlgo_t algo;
+    nvinfer1::pluginInternal::cublasStatus_t status;
     float time{kMAX_TIME};
     size_t workspaceSize; // actual memory workspace needed
-    cublasMath_t mathMode;
-    cublasLtReductionScheme_t reductionScheme;
+    nvinfer1::pluginInternal::cublasMath_t mathMode;
+    nvinfer1::pluginInternal::cublasLtReductionScheme_t reductionScheme;
     int32_t customOption;
     float wavesCount;
 } customMatmulPerf_t;
 
 // clang-format off
-void LtGemmSearch(cublasLtHandle_t ltHandle,
-                  cublasOperation_t transa,
-                  cublasOperation_t transb,
+void LtGemmSearch(nvinfer1::pluginInternal::cublasLtHandle_t ltHandle,
+                  nvinfer1::pluginInternal::cublasOperation_t transa,
+                  nvinfer1::pluginInternal::cublasOperation_t transb,
                   int32_t const &m,
                   int32_t const &n,
                   int32_t const &k,
@@ -175,7 +169,7 @@ void LtGemmSearch(cublasLtHandle_t ltHandle,
                   int32_t const &ldc,
                   void *workSpace,
                   size_t workSpaceSize,
-                  cudaDataType_t computeType,
+                  nvinfer1::pluginInternal::cublasComputeType_t computeType,
                   cudaDataType_t scaleType,
                   cudaDataType_t Atype,
                   cudaDataType_t Btype,
@@ -183,8 +177,8 @@ void LtGemmSearch(cublasLtHandle_t ltHandle,
                   std::vector<customMatmulPerf_t> &perfResults);
 // clang-format on
 template <typename T>
-void LtGemmSearch(cublasLtHandle_t ltHandle, Gemm<T> const& g, void* workSpace, size_t workSpaceSize,
-    std::vector<customMatmulPerf_t>& perfResults)
+void LtGemmSearch(nvinfer1::pluginInternal::cublasLtHandle_t ltHandle, Gemm<T> const& g, void* workSpace,
+    size_t workSpaceSize, std::vector<customMatmulPerf_t>& perfResults)
 {
     // clang-format off
     LtGemmSearch(
@@ -216,52 +210,49 @@ void LtGemmSearch(cublasLtHandle_t ltHandle, Gemm<T> const& g, void* workSpace,
 
 struct LtContext
 {
-    cublasLtHandle_t cublas{nullptr};
+    nvinfer1::pluginInternal::cublasLtHandle_t cublas{nullptr};
+    nvinfer1::pluginInternal::CublasLtWrapper& cublasLtWrapper = nvinfer1::pluginInternal::getCublasLtWrapper();
     cudaDataType_t typeA;
     cudaDataType_t typeB;
     cudaDataType_t typeC;
-#if CUBLAS_VER_MAJOR < 11
-    cudaDataType_t typeComp;
-#else
-    cublasComputeType_t typeComp;
-#endif
+    nvinfer1::pluginInternal::cublasComputeType_t typeComp;
     cudaDataType_t typeS;
-    cublasLtMatmulDesc_t operationDesc{nullptr};
-    cublasLtMatrixLayout_t Adesc{nullptr};
-    cublasLtMatrixLayout_t Bdesc{nullptr};
-    cublasLtMatrixLayout_t Cdesc{nullptr};
-    cublasLtMatmulHeuristicResult_t heuristicResult = {};
+    nvinfer1::pluginInternal::cublasLtMatmulDesc_t operationDesc{nullptr};
+    nvinfer1::pluginInternal::cublasLtMatrixLayout_t Adesc{nullptr};
+    nvinfer1::pluginInternal::cublasLtMatrixLayout_t Bdesc{nullptr};
+    nvinfer1::pluginInternal::cublasLtMatrixLayout_t Cdesc{nullptr};
+    nvinfer1::pluginInternal::cublasLtMatmulHeuristicResult_t heuristicResult = {};
 
     void attach()
     {
-        PLUGIN_CUBLASASSERT(cublasLtCreate(&cublas));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtCreate(&cublas));
     }
 
     void detach()
     {
-        PLUGIN_CUBLASASSERT(cublasLtDestroy(cublas));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtDestroy(cublas));
     }
 
     void destroy()
     {
         if (operationDesc)
         {
-            PLUGIN_CUBLASASSERT(cublasLtMatmulDescDestroy(operationDesc));
+            PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescDestroy(operationDesc));
             operationDesc = nullptr;
         }
         if (Adesc)
         {
-            PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutDestroy(Adesc));
+            PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutDestroy(Adesc));
             Adesc = nullptr;
         }
         if (Bdesc)
         {
-            PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutDestroy(Bdesc));
+            PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutDestroy(Bdesc));
             Bdesc = nullptr;
         }
         if (Cdesc)
         {
-            PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutDestroy(Cdesc));
+            PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutDestroy(Cdesc));
             Cdesc = nullptr;
         }
     }
@@ -276,35 +267,34 @@ struct LtContext
         typeComp = Gemm<T>::Types::cudaTypeCom; // compute
 
         // OPERATION
-#if CUBLAS_VER_MAJOR < 11
-        PLUGIN_CUBLASASSERT(cublasLtMatmulDescCreate(&operationDesc, typeComp));
-#else
-        PLUGIN_CUBLASASSERT(cublasLtMatmulDescCreate(&operationDesc, typeComp, typeS));
-#endif
-        PLUGIN_CUBLASASSERT(
-            cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSA, &g.opA, sizeof(g.opA)));
-        PLUGIN_CUBLASASSERT(
-            cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSB, &g.opB, sizeof(g.opB)));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescCreate(&operationDesc, typeComp, typeS));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescSetAttribute(
+            operationDesc, nvinfer1::pluginInternal::CUBLASLT_MATMUL_DESC_TRANSA, &g.opA, sizeof(g.opA)));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulDescSetAttribute(
+            operationDesc, nvinfer1::pluginInternal::CUBLASLT_MATMUL_DESC_TRANSB, &g.opB, sizeof(g.opB)));
 
         // MAT DESC
-        PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutCreate(&Adesc, typeA, g.rA, g.cA, g.ldA));
-        PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutCreate(&Bdesc, typeB, g.rB, g.cB, g.ldB));
-        PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutCreate(&Cdesc, typeC, g.rC, g.cC, g.ldC));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutCreate(&Adesc, typeA, g.rA, g.cA, g.ldA));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutCreate(&Bdesc, typeB, g.rB, g.cB, g.ldB));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutCreate(&Cdesc, typeC, g.rC, g.cC, g.ldC));
     }
 
     void setN(uint64_t n)
     {
-        PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutSetAttribute(Bdesc, CUBLASLT_MATRIX_LAYOUT_COLS, &n, sizeof(n)));
-        PLUGIN_CUBLASASSERT(cublasLtMatrixLayoutSetAttribute(Cdesc, CUBLASLT_MATRIX_LAYOUT_COLS, &n, sizeof(n)));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutSetAttribute(
+            Bdesc, nvinfer1::pluginInternal::CUBLASLT_MATRIX_LAYOUT_COLS, &n, sizeof(n)));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatrixLayoutSetAttribute(
+            Cdesc, nvinfer1::pluginInternal::CUBLASLT_MATRIX_LAYOUT_COLS, &n, sizeof(n)));
     }
 };
 
 template <typename T>
-cublasStatus_t cublasLtMatmul(
-    LtContext& ctx, Gemm<T>& g, cublasLtMatmulAlgo_t algo, void* workspace, size_t workspaceSize, cudaStream_t stream)
+nvinfer1::pluginInternal::cublasStatus_t cublasLtMatmul(LtContext& ctx, Gemm<T>& g,
+    nvinfer1::pluginInternal::cublasLtMatmulAlgo_t algo, void* workspace, size_t workspaceSize, cudaStream_t stream)
 {
+    nvinfer1::pluginInternal::CublasLtWrapper& cublasLtWrapper = nvinfer1::pluginInternal::getCublasLtWrapper();
     // clang-format off
-    return cublasLtMatmul(
+    return cublasLtWrapper.cublasLtMatmul(
         ctx.cublas,
         ctx.operationDesc,
         &g.alpha,
@@ -365,28 +355,32 @@ struct AlgoProps
     int32_t reductionScheme;
     uint64_t numericImpl;
 
-    void populate(cublasLtMatmulAlgo_t const& algo)
+    void populate(nvinfer1::pluginInternal::cublasLtMatmulAlgo_t const& algo)
     {
-        cublasLtMatmulAlgo_t const* matmulAlgo = &algo;
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigGetAttribute(
-            matmulAlgo, CUBLASLT_ALGO_CONFIG_ID, &algoId, sizeof(algoId), nullptr));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigGetAttribute(
-            matmulAlgo, CUBLASLT_ALGO_CONFIG_TILE_ID, &tile, sizeof(tile), nullptr));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigGetAttribute(
-            matmulAlgo, CUBLASLT_ALGO_CONFIG_SPLITK_NUM, &numSplitsK, sizeof(numSplitsK), nullptr));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigGetAttribute(
-            matmulAlgo, CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME, &reductionScheme, sizeof(reductionScheme), nullptr));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigGetAttribute(
-            matmulAlgo, CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING, &swizzle, sizeof(swizzle), nullptr));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoConfigGetAttribute(
-            matmulAlgo, CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION, &customOption, sizeof(customOption), nullptr));
-        PLUGIN_CUBLASASSERT(cublasLtMatmulAlgoCapGetAttribute(
-            matmulAlgo, CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS, &numericImpl, sizeof(numericImpl), nullptr));
+        nvinfer1::pluginInternal::cublasLtMatmulAlgo_t const* matmulAlgo = &algo;
+        nvinfer1::pluginInternal::CublasLtWrapper& cublasLtWrapper = nvinfer1::pluginInternal::getCublasLtWrapper();
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigGetAttribute(
+            matmulAlgo, nvinfer1::pluginInternal::CUBLASLT_ALGO_CONFIG_ID, &algoId, sizeof(algoId), nullptr));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigGetAttribute(
+            matmulAlgo, nvinfer1::pluginInternal::CUBLASLT_ALGO_CONFIG_TILE_ID, &tile, sizeof(tile), nullptr));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigGetAttribute(matmulAlgo,
+            nvinfer1::pluginInternal::CUBLASLT_ALGO_CONFIG_SPLITK_NUM, &numSplitsK, sizeof(numSplitsK), nullptr));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigGetAttribute(matmulAlgo,
+            nvinfer1::pluginInternal::CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME, &reductionScheme, sizeof(reductionScheme),
+            nullptr));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigGetAttribute(matmulAlgo,
+            nvinfer1::pluginInternal::CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING, &swizzle, sizeof(swizzle), nullptr));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoConfigGetAttribute(matmulAlgo,
+            nvinfer1::pluginInternal::CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION, &customOption, sizeof(customOption),
+            nullptr));
+        PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtMatmulAlgoCapGetAttribute(matmulAlgo,
+            nvinfer1::pluginInternal::CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS, &numericImpl, sizeof(numericImpl),
+            nullptr));
     }
 };
 
 template <typename T>
-cublasLtMatmulAlgo_t gemmSearch(
+nvinfer1::pluginInternal::cublasLtMatmulAlgo_t gemmSearch(
     int32_t const m, int32_t const n, int32_t const k, size_t const workspaceSize, size_t& actualWorkspace)
 {
     Gemm<T> g(m, n, k, false, false);
@@ -398,11 +392,12 @@ cublasLtMatmulAlgo_t gemmSearch(
 
     void* workspace;
     PLUGIN_CUASSERT(cudaMalloc(&workspace, workspaceSize));
-    cublasLtHandle_t lt;
-    PLUGIN_CUBLASASSERT(cublasLtCreate(&lt));
+    nvinfer1::pluginInternal::cublasLtHandle_t lt;
+    nvinfer1::pluginInternal::CublasLtWrapper& cublasLtWrapper = nvinfer1::pluginInternal::getCublasLtWrapper();
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtCreate(&lt));
     LtGemmSearch(lt, g, workspace, workspaceSize, perfResults);
     PLUGIN_CUASSERT(cudaDeviceSynchronize());
-    PLUGIN_CUBLASASSERT(cublasLtDestroy(lt));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtDestroy(lt));
     PLUGIN_CUASSERT(cudaFree(workspace));
 
     PLUGIN_CUASSERT(cudaFree(g.A));
@@ -414,7 +409,8 @@ cublasLtMatmulAlgo_t gemmSearch(
 }
 
 template <typename T>
-cublasLtMatmulAlgo_t gemmSearch(Gemm<T>& g, size_t const workspaceSize, size_t& actualWorkspace)
+nvinfer1::pluginInternal::cublasLtMatmulAlgo_t gemmSearch(
+    Gemm<T>& g, size_t const workspaceSize, size_t& actualWorkspace)
 {
     std::vector<customMatmulPerf_t> perfResults(kNB_ALGO_COMBINATIONS);
 
@@ -424,11 +420,12 @@ cublasLtMatmulAlgo_t gemmSearch(Gemm<T>& g, size_t const workspaceSize, size_t&
 
     void* workspace;
     PLUGIN_CUASSERT(cudaMalloc(&workspace, workspaceSize));
-    cublasLtHandle_t lt;
-    PLUGIN_CUBLASASSERT(cublasLtCreate(&lt));
+    nvinfer1::pluginInternal::cublasLtHandle_t lt;
+    nvinfer1::pluginInternal::CublasLtWrapper& cublasLtWrapper = nvinfer1::pluginInternal::getCublasLtWrapper();
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtCreate(&lt));
     LtGemmSearch(lt, g, workspace, workspaceSize, perfResults);
     PLUGIN_CUASSERT(cudaDeviceSynchronize());
-    PLUGIN_CUBLASASSERT(cublasLtDestroy(lt));
+    PLUGIN_CUBLASASSERT(cublasLtWrapper.cublasLtDestroy(lt));
     PLUGIN_CUASSERT(cudaFree(workspace));
 
     PLUGIN_CUASSERT(cudaFree(g.A));
@@ -497,7 +494,7 @@ class FCPluginDynamic : public nvinfer1::IPluginV2DynamicExt
     int32_t mNmax;
     int32_t mK;
 
-    cublasLtMatmulAlgo_t mAlgo;
+    nvinfer1::pluginInternal::cublasLtMatmulAlgo_t mAlgo;
 
     bert::WeightsWithOwnership mW;
     bert::cuda_unique_ptr<void> mWdev;
diff --git a/plugin/flattenConcat/flattenConcat.cpp b/plugin/flattenConcat/flattenConcat.cpp
index 6ecc9f61..3b90f2ed 100644
--- a/plugin/flattenConcat/flattenConcat.cpp
+++ b/plugin/flattenConcat/flattenConcat.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,7 +19,6 @@
 
 #include <algorithm>
 #include <cstring>
-#include <cudnn.h>
 #include <iostream>
 #include <numeric>
 #include <string>
@@ -27,6 +26,7 @@
 #include <vector>
 
 using namespace nvinfer1;
+using namespace nvinfer1::pluginInternal;
 using nvinfer1::plugin::FlattenConcat;
 using nvinfer1::plugin::FlattenConcatPluginCreator;
 
@@ -163,8 +163,8 @@ int32_t FlattenConcat::enqueue(
             auto const* input = static_cast<float const*>(inputs[i]);
             for (int32_t n = 0; n < numConcats; ++n)
             {
-                auto status = cublasScopy(mCublas, mInputConcatAxis[i], input + n * mInputConcatAxis[i], 1,
-                    output + (n * mOutputConcatAxis + offset), 1);
+                auto status = mCublasWrapper->cublasScopy(mCublas, mInputConcatAxis[i], input + n * mInputConcatAxis[i],
+                    1, output + (n * mOutputConcatAxis + offset), 1);
 
                 if (status != CUBLAS_STATUS_SUCCESS)
                 {
@@ -213,7 +213,16 @@ void FlattenConcat::serialize(void* buffer) const noexcept
 void FlattenConcat::attachToContext(
     cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) noexcept
 {
-    mCublas = cublasContext;
+    try
+    {
+        mCublasWrapper = createPluginCublasWrapper(gpuAllocator);
+        mCublas = mCublasWrapper->getCublasHandle();
+        PLUGIN_VALIDATE(mCublas != nullptr);
+    }
+    catch (const std::exception& e)
+    {
+        caughtError(e);
+    }
 }
 
 // Detach the plugin object from its execution context.
diff --git a/plugin/flattenConcat/flattenConcat.h b/plugin/flattenConcat/flattenConcat.h
index e0bcc008..5f93aa21 100644
--- a/plugin/flattenConcat/flattenConcat.h
+++ b/plugin/flattenConcat/flattenConcat.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -20,8 +20,6 @@
 #include "NvInferPlugin.h"
 #include "common/plugin.h"
 #include <cstdlib>
-#include <cublas_v2.h>
-#include <cudnn.h>
 #include <iostream>
 #include <string>
 #include <vector>
@@ -104,7 +102,9 @@ class FlattenConcat : public IPluginV2Ext
     int32_t mConcatAxisID{0}, mOutputConcatAxis{0}, mNumInputs{0};
     nvinfer1::Dims mCHW;
     std::string mPluginNamespace;
-    cublasHandle_t mCublas{nullptr};
+    nvinfer1::pluginInternal::cublasHandle_t mCublas{nullptr};
+    // the wrapper pointer is shared among all plugins attached to the same context.
+    std::shared_ptr<nvinfer1::pluginInternal::CublasWrapper> mCublasWrapper;
 };
 
 class FlattenConcatPluginCreator : public nvinfer1::pluginInternal::BaseCreator
diff --git a/plugin/geluPlugin/README.md b/plugin/geluPlugin/README.md
index 0ea58b2f..cc5f3bf6 100644
--- a/plugin/geluPlugin/README.md
+++ b/plugin/geluPlugin/README.md
@@ -12,6 +12,8 @@
 
 ## Description
 
+> NOTE: `geluPlugin` is deprecated since TensorRT 9.0. Its functionality has been superseded by `IActivationLayer` + `IElementWiseLayer`.
+
 This plugin applies the Gelu activation `x * Phi(x)`, where Phi is the Gaussian cdf, approximated by: `0.5 * (1 + tanh(sqrt(2 / M_PI) * (x + 0.044715 * x^3)))`.
 Optionally adds a bias vector before the activation.
 
@@ -55,6 +57,9 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note.
+
 November 2019
 This is the first release of this `README.md` file.
 
diff --git a/plugin/geluPlugin/geluPlugin.cpp b/plugin/geluPlugin/geluPlugin.cpp
index 9d1d0cf1..ca0d775f 100644
--- a/plugin/geluPlugin/geluPlugin.cpp
+++ b/plugin/geluPlugin/geluPlugin.cpp
@@ -180,14 +180,12 @@ int32_t GeluPluginDynamic::enqueueTyped(
 }
 
 int32_t GeluPluginDynamic::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,
-    nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
-    cudaStream_t stream) noexcept
+    nvinfer1::PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs,
+    void* /* workspace */, cudaStream_t stream) noexcept
 {
     try
     {
-        PLUGIN_VALIDATE(inputDesc != nullptr);
-        PLUGIN_VALIDATE(inputs != nullptr);
-        PLUGIN_VALIDATE(outputs != nullptr);
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
     }
     catch (std::exception const& e)
     {
@@ -203,6 +201,8 @@ int32_t GeluPluginDynamic::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,
     {
     case DataType::kFLOAT: return enqueueTyped<float>(inputs[0], outputs[0], inputVolume, stream);
     case DataType::kHALF: return enqueueTyped<half>(inputs[0], outputs[0], inputVolume, stream);
+    case DataType::kBF16:
+    case DataType::kINT64: PLUGIN_FAIL("Unsupported data type");
     default: return STATUS_FAILURE;
     }
 }
@@ -330,6 +330,10 @@ IPluginV2* GeluPluginDynamicCreator::createPlugin(char const* name, PluginFieldC
 {
     try
     {
+        gLogWarning << "GeluPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addActivation() "
+                       "[IActivationLayer] and INetworkDefinition::addElementWise() [IElementWiseLayer] to perform the "
+                       "same function."
+                    << std::endl;
         gLogVerbose << "GeluPluginDynamicCreator createPlugin\n";
         PLUGIN_VALIDATE(fc != nullptr);
 
@@ -376,6 +380,10 @@ IPluginV2* GeluPluginDynamicCreator::deserializePlugin(
     // call GeluPluginDynamic::destroy()
     try
     {
+        gLogWarning << "GeluPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addActivation() "
+                       "[IActivationLayer] and INetworkDefinition::addElementWise() [IElementWiseLayer] to perform the "
+                       "same function."
+                    << std::endl;
         return new GeluPluginDynamic(name, serialData, serialLength);
     }
     catch (std::exception const& e)
diff --git a/plugin/geluPlugin/geluPlugin.h b/plugin/geluPlugin/geluPlugin.h
index 009797c3..14bc0f6a 100644
--- a/plugin/geluPlugin/geluPlugin.h
+++ b/plugin/geluPlugin/geluPlugin.h
@@ -43,7 +43,7 @@ int32_t computeGeluBias(
 int32_t computeGeluBias(
     half* output, half const* input, half const* bias, int32_t const ld, int32_t const cols, cudaStream_t stream);
 
-class GeluPluginDynamic : public nvinfer1::IPluginV2DynamicExt
+class TRT_DEPRECATED GeluPluginDynamic : public nvinfer1::IPluginV2DynamicExt
 {
 public:
     GeluPluginDynamic(const std::string name, const nvinfer1::DataType type, nvinfer1::Weights const& bias);
@@ -102,7 +102,7 @@ class GeluPluginDynamic : public nvinfer1::IPluginV2DynamicExt
     using IPluginV2Ext::configurePlugin;
 };
 
-class GeluPluginDynamicCreator : public nvinfer1::IPluginCreator
+class TRT_DEPRECATED GeluPluginDynamicCreator : public nvinfer1::IPluginCreator
 {
 public:
     GeluPluginDynamicCreator();
diff --git a/plugin/gridAnchorPlugin/README.md b/plugin/gridAnchorPlugin/README.md
index cec5bd80..e9fc9673 100644
--- a/plugin/gridAnchorPlugin/README.md
+++ b/plugin/gridAnchorPlugin/README.md
@@ -12,8 +12,8 @@
 
 ## Description
 
-Some object detection neural networks such as Faster R-CNN and SSD use region proposal networks that require anchor boxes to generate predicted bounding boxes. This plugin is included in TensorRT and used in [sampleUffSSD](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#uffssd_sample) to run SSD.
-  
+Some object detection neural networks such as Faster R-CNN and SSD use region proposal networks that require anchor boxes to generate predicted bounding boxes.
+
 The `gridAnchorPlugin` generates anchor boxes (prior boxes) from the feature map in object detection models such as SSD. It generates anchor box coordinates `[x_min, y_min, x_max, y_max]` with variances (scaling factors) `[var_0, var_1, var_2, var_3]` for the downstream bounding box decoding steps. It uses a series of CUDA kernels in the `gridAnchorLayer.cu` file to accelerate the process in TensorRT.
 
 If the feature maps are square, then the `GridAnchor_TRT` plugin should be used. If the feature maps
@@ -97,4 +97,4 @@ This is the first release of this `README.md` file.
 
 ## Known issues
 
-There are no known issues in this plugin.
\ No newline at end of file
+There are no known issues in this plugin.
diff --git a/plugin/gridAnchorPlugin/gridAnchorPlugin.cpp b/plugin/gridAnchorPlugin/gridAnchorPlugin.cpp
index 33afca2d..b1448b31 100644
--- a/plugin/gridAnchorPlugin/gridAnchorPlugin.cpp
+++ b/plugin/gridAnchorPlugin/gridAnchorPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,14 +16,14 @@
  */
 
 #include "gridAnchorPlugin.h"
+#include <cmath>
 #include <cstring>
-#include <cublas_v2.h>
-#include <cudnn.h>
 #include <iostream>
 #include <sstream>
 #include <vector>
 
 using namespace nvinfer1;
+using namespace nvinfer1::pluginInternal;
 
 namespace nvinfer1
 {
@@ -107,7 +107,7 @@ GridAnchorGenerator::GridAnchorGenerator(GridAnchorParameters const* paramIn, in
             auto scale_next = (id == mNumLayers - 1)
                 ? 1.0
                 : (mParam[id].minSize + (mParam[id].maxSize - mParam[id].minSize) * (id + 1) / (mNumLayers - 1));
-            scales.push_back(sqrt(tmpScales[id] * scale_next));
+            scales.push_back(std::sqrt(tmpScales[id] * scale_next));
 
             mNumPriors[id] = mParam[id].numAspectRatios + 1;
         }
@@ -117,7 +117,7 @@ GridAnchorGenerator::GridAnchorGenerator(GridAnchorParameters const* paramIn, in
         // Calculate the width and height of the prior boxes
         for (int32_t i = 0; i < mNumPriors[id]; i++)
         {
-            float sqrt_AR = sqrt(aspect_ratios[i]);
+            float sqrt_AR = std::sqrt(aspect_ratios[i]);
             tmpWidths.push_back(scales[i] * sqrt_AR);
             tmpHeights.push_back(scales[i] / sqrt_AR);
         }
diff --git a/plugin/gridAnchorPlugin/gridAnchorPlugin.h b/plugin/gridAnchorPlugin/gridAnchorPlugin.h
index b0e1fc16..0cf94e97 100644
--- a/plugin/gridAnchorPlugin/gridAnchorPlugin.h
+++ b/plugin/gridAnchorPlugin/gridAnchorPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,8 +19,6 @@
 #define TRT_GRID_ANCHOR_PLUGIN_H
 #include "common/kernels/kernel.h"
 #include "common/plugin.h"
-#include "cudnn.h"
-#include <cublas_v2.h>
 #include <string>
 #include <vector>
 
diff --git a/plugin/groupNormalizationPlugin/GroupNormalizationPlugin_PluginReference.py b/plugin/groupNormalizationPlugin/GroupNormalizationPlugin_PluginReference.py
deleted file mode 100644
index 0e7fb510..00000000
--- a/plugin/groupNormalizationPlugin/GroupNormalizationPlugin_PluginReference.py
+++ /dev/null
@@ -1,27 +0,0 @@
-import numpy as np
-
-def ref(inputs, attributes, version = "1"):
-    assert version == "1"
-    num_groups = attributes["num_groups"][0]
-    epsilon = attributes["eps"][0]
-    input = inputs["input"]
-    bias = inputs["bias"]
-    scale = inputs["scale"]
-    output = input.copy()
-
-    assert len(input.shape) == 4
-    B, C, H, W = input.shape
-
-    # Groups are a subdivision of the channel dimension.
-    assert C % num_groups == 0
-
-    # Normalize every group.
-    output = output.reshape((B * num_groups, -1))
-    output -= np.mean(output, axis=-1, keepdims=True)
-    output /= np.sqrt(epsilon + np.var(output, axis=-1, keepdims=True))
-
-    # Apply per-channel scale and bias.
-    output = output.reshape(input.shape)
-    output = bias.reshape(1, C, 1, 1) + scale.reshape(1, C, 1, 1) * output
-
-    return {"output": output}
diff --git a/plugin/groupNormalizationPlugin/groupNormalizationPlugin.cpp b/plugin/groupNormalizationPlugin/groupNormalizationPlugin.cpp
index d68f4018..b5a14b5a 100644
--- a/plugin/groupNormalizationPlugin/groupNormalizationPlugin.cpp
+++ b/plugin/groupNormalizationPlugin/groupNormalizationPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,6 +23,7 @@
 #include <stdexcept>
 
 using namespace nvinfer1;
+using namespace nvinfer1::pluginInternal;
 using nvinfer1::plugin::GroupNormalizationPlugin;
 using nvinfer1::plugin::GroupNormalizationPluginCreator;
 
@@ -95,10 +96,11 @@ void GroupNormalizationPlugin::attachToContext(
 {
     try
     {
-        PLUGIN_VALIDATE(cudnnContext);
-        mCudnnHandle = cudnnContext;
-        PLUGIN_CUDNNASSERT(cudnnCreateTensorDescriptor(&mTensorDesc));
-        PLUGIN_CUDNNASSERT(cudnnCreateTensorDescriptor(&mBNTensorDesc));
+        mCudnnWrapper = createPluginCudnnWrapper(gpuAllocator);
+        mCudnnHandle = mCudnnWrapper->getCudnnHandle();
+        PLUGIN_VALIDATE(mCudnnHandle);
+        PLUGIN_CUDNNASSERT(mCudnnWrapper->cudnnCreateTensorDescriptor(&mTensorDesc));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper->cudnnCreateTensorDescriptor(&mBNTensorDesc));
     }
     catch (std::exception const& e)
     {
@@ -111,9 +113,8 @@ void GroupNormalizationPlugin::detachFromContext() noexcept
 {
     try
     {
-        PLUGIN_CUDNNASSERT(cudnnDestroyTensorDescriptor(mTensorDesc));
-        PLUGIN_CUDNNASSERT(cudnnDestroyTensorDescriptor(mBNTensorDesc));
-        mCudnnHandle = nullptr;
+        PLUGIN_CUDNNASSERT(mCudnnWrapper->cudnnDestroyTensorDescriptor(mTensorDesc));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper->cudnnDestroyTensorDescriptor(mBNTensorDesc));
     }
     catch (std::exception const& e)
     {
@@ -122,14 +123,12 @@ void GroupNormalizationPlugin::detachFromContext() noexcept
 }
 
 int32_t GroupNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,
-    nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
-    cudaStream_t stream) noexcept
+    nvinfer1::PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs,
+    void* /* workspace */, cudaStream_t stream) noexcept
 {
     try
     {
-        PLUGIN_VALIDATE(inputDesc != nullptr);
-        PLUGIN_VALIDATE(inputs != nullptr);
-        PLUGIN_VALIDATE(outputs != nullptr);
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
         PLUGIN_VALIDATE(mBnScales != nullptr && mBnScales->mPtr != nullptr);
         PLUGIN_VALIDATE(mBnBias != nullptr && mBnBias->mPtr != nullptr);
         PLUGIN_VALIDATE(mCudnnHandle != nullptr);
@@ -142,7 +141,7 @@ int32_t GroupNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* inpu
         return STATUS_FAILURE;
     }
 
-    PLUGIN_CHECK_CUDNN(cudnnSetStream(mCudnnHandle, stream));
+    PLUGIN_CHECK_CUDNN(mCudnnWrapper->cudnnSetStream(mCudnnHandle, stream));
 
     // The tensor descriptors were set up in configurePlugin() to make Batch Normalization actually
     // perform Group Normalization. This was done by setting the tensor descriptor shape to
@@ -150,23 +149,23 @@ int32_t GroupNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* inpu
     // cudnnBatchNorm will normalize over the last two dimensions.
     float const one = 1.F;
     float const zero = 0.F;
-    PLUGIN_CHECK_CUDNN(cudnnBatchNormalizationForwardTraining(mCudnnHandle, // handle
-        CUDNN_BATCHNORM_SPATIAL,                                            // BatchNormMode_t, try also non persistent
-        &one,                                                               //
-        &zero,                                                              //
-        mTensorDesc,                                                        // in/out descriptor
-        inputs[0],                                                          // input
-        mTensorDesc,                                                        // in/out descriptor
-        outputs[0],                                                         // output
-        mBNTensorDesc,                                                      //
-        mBnScales->mPtr,                                                    // 1
-        mBnBias->mPtr,                                                      // 0
-        0.0,                                                                // exponential average factor
-        nullptr,                                                            // resultRunningMean
-        nullptr,                                                            // resultRunningVar
-        mEpsilon,                                                           //  eps
-        nullptr,                                                            // resultSaveMean
-        nullptr                                                             // resultSaveInvVar
+    PLUGIN_CHECK_CUDNN(mCudnnWrapper->cudnnBatchNormalizationForwardTraining(mCudnnHandle, // handle
+        CUDNN_BATCHNORM_SPATIAL, // BatchNormMode_t, try also non persistent
+        &one,                    //
+        &zero,                   //
+        mTensorDesc,             // in/out descriptor
+        inputs[0],               // input
+        mTensorDesc,             // in/out descriptor
+        outputs[0],              // output
+        mBNTensorDesc,           //
+        mBnScales->mPtr,         // 1
+        mBnBias->mPtr,           // 0
+        0.0,                     // exponential average factor
+        nullptr,                 // resultRunningMean
+        nullptr,                 // resultRunningVar
+        mEpsilon,                //  eps
+        nullptr,                 // resultSaveMean
+        nullptr                  // resultSaveInvVar
         ));
 
     // Apply an additional scale and bias on each channel.
@@ -290,15 +289,16 @@ void GroupNormalizationPlugin::configurePlugin(nvinfer1::DynamicPluginTensorDesc
         mChannelVolume = pluginInternal::volume(inputDims, /*start*/ 2, /*stop*/ inputDims.nbDims);
 
         // Set tensor descriptor in a way that cudnnBatchNorm will perform Group Normalization.
-        PLUGIN_CUDNNASSERT(cudnnSetTensor4dDescriptor(mTensorDesc, // descriptor
-            CUDNN_TENSOR_NCHW,                                     // tensor format
-            CUDNN_DATA_FLOAT,                                      // type
-            1,                                                     // Batchsize
-            batchSize * mNbGroups,                                 // Channels
-            groupSize,                                             // Height
-            mChannelVolume                                         // Width
+        PLUGIN_CUDNNASSERT(mCudnnWrapper->cudnnSetTensor4dDescriptor(mTensorDesc, // descriptor
+            CUDNN_TENSOR_NCHW,                                                    // tensor format
+            CUDNN_DATA_FLOAT,                                                     // type
+            1,                                                                    // Batchsize
+            batchSize * mNbGroups,                                                // Channels
+            groupSize,                                                            // Height
+            mChannelVolume                                                        // Width
             ));
-        PLUGIN_CUDNNASSERT(cudnnDeriveBNTensorDescriptor(mBNTensorDesc, mTensorDesc, CUDNN_BATCHNORM_SPATIAL));
+        PLUGIN_CUDNNASSERT(
+            mCudnnWrapper->cudnnDeriveBNTensorDescriptor(mBNTensorDesc, mTensorDesc, CUDNN_BATCHNORM_SPATIAL));
     }
     catch (std::exception const& e)
     {
diff --git a/plugin/groupNormalizationPlugin/groupNormalizationPlugin.h b/plugin/groupNormalizationPlugin/groupNormalizationPlugin.h
index 001720aa..a22cef18 100644
--- a/plugin/groupNormalizationPlugin/groupNormalizationPlugin.h
+++ b/plugin/groupNormalizationPlugin/groupNormalizationPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,8 +19,6 @@
 #define TRT_GROUP_NORM_PLUGIN_H
 
 #include "common/plugin.h"
-
-#include <cudnn.h>
 #include <string>
 #include <vector>
 
@@ -100,11 +98,13 @@ class GroupNormalizationPlugin final : public nvinfer1::IPluginV2DynamicExt
     int32_t mNbGroups;
     int32_t mChannelVolume;
 
-    cudnnHandle_t mCudnnHandle{};
+    nvinfer1::pluginInternal::cudnnHandle_t mCudnnHandle{};
+    // the wrapper pointer is shared among all plugins attached to the same context.
+    std::shared_ptr<nvinfer1::pluginInternal::CudnnWrapper> mCudnnWrapper;
 
     // Describes input and output.
-    cudnnTensorDescriptor_t mTensorDesc{};
-    cudnnTensorDescriptor_t mBNTensorDesc{};
+    nvinfer1::pluginInternal::cudnnTensorDescriptor_t mTensorDesc{};
+    nvinfer1::pluginInternal::cudnnTensorDescriptor_t mBNTensorDesc{};
 
     // These are buffers initialized to 1 and 0 respectively
     std::shared_ptr<CudaBind<float>> mBnScales{};
diff --git a/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.cu b/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.cu
index 5e1aeba5..007715ad 100644
--- a/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.cu
+++ b/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,6 +22,7 @@
 
 using namespace nvinfer1;
 using namespace nvinfer1::plugin;
+using namespace nvinfer1::pluginInternal;
 using namespace instance_norm_impl;
 using nvinfer1::plugin::InstanceNormalizationPlugin;
 using nvinfer1::plugin::InstanceNormalizationPluginV2;
@@ -152,11 +153,11 @@ int32_t InstanceNormalizationPlugin::initialize() noexcept
 {
     if (!mInitialized)
     {
-        PLUGIN_CHECK_CUDNN(cudnnCreate(&mCudnnHandle));
+        PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnCreate(&mCudnnHandle));
 
-        PLUGIN_CHECK_CUDNN(cudnnCreateTensorDescriptor(&mBDescriptor));
-        PLUGIN_CHECK_CUDNN(cudnnCreateTensorDescriptor(&mXDescriptor));
-        PLUGIN_CHECK_CUDNN(cudnnCreateTensorDescriptor(&mYDescriptor));
+        PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnCreateTensorDescriptor(&mBDescriptor));
+        PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnCreateTensorDescriptor(&mXDescriptor));
+        PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnCreateTensorDescriptor(&mYDescriptor));
 
         // NDHWC path
         // Device info.
@@ -185,11 +186,11 @@ void InstanceNormalizationPlugin::terminate() noexcept
 {
     if (mInitialized)
     {
-        PLUGIN_CUDNNASSERT(cudnnDestroyTensorDescriptor(mYDescriptor));
-        PLUGIN_CUDNNASSERT(cudnnDestroyTensorDescriptor(mXDescriptor));
-        PLUGIN_CUDNNASSERT(cudnnDestroyTensorDescriptor(mBDescriptor));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnDestroyTensorDescriptor(mYDescriptor));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnDestroyTensorDescriptor(mXDescriptor));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnDestroyTensorDescriptor(mBDescriptor));
 
-        PLUGIN_CUDNNASSERT(cudnnDestroy(mCudnnHandle));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnDestroy(mCudnnHandle));
 
         PLUGIN_CUASSERT(cudaFree(mDeviceBias));
         PLUGIN_CUASSERT(cudaFree(mDeviceScale));
@@ -255,6 +256,8 @@ int32_t InstanceNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
     nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
     cudaStream_t stream) noexcept
 {
+    PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr && workspace != nullptr);
+    
     nvinfer1::Dims input_dims = inputDesc[0].dims;
     // early return for empty tensor
     if (std::any_of(input_dims.d, input_dims.d + input_dims.nbDims, [](int32_t d) { return d == 0; }))
@@ -302,16 +305,16 @@ int32_t InstanceNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
         }
 
         PLUGIN_CUDNNASSERT(
-            cudnnSetTensor4dDescriptor(mBDescriptor, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, n * c, 1, 1));
+            mCudnnWrapper.cudnnSetTensor4dDescriptor(mBDescriptor, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, n * c, 1, 1));
         cudnnDataType_t cudnn_dtype{};
         PLUGIN_CUDNNASSERT(convertTrt2cudnnDtype(inputDesc[0].type, &cudnn_dtype));
-        PLUGIN_CUDNNASSERT(cudnnSetTensor4dDescriptor(mXDescriptor, CUDNN_TENSOR_NCHW, cudnn_dtype, 1, n * c, h, w));
-        PLUGIN_CUDNNASSERT(cudnnSetTensor4dDescriptor(mYDescriptor, CUDNN_TENSOR_NCHW, cudnn_dtype, 1, n * c, h, w));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnSetTensor4dDescriptor(mXDescriptor, CUDNN_TENSOR_NCHW, cudnn_dtype, 1, n * c, h, w));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnSetTensor4dDescriptor(mYDescriptor, CUDNN_TENSOR_NCHW, cudnn_dtype, 1, n * c, h, w));
         float alpha = 1;
         float beta = 0;
         void const* x_ptr = inputs[0];
         void* y_ptr = outputs[0];
-        PLUGIN_CUDNNASSERT(cudnnSetStream(mCudnnHandle, stream));
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnSetStream(mCudnnHandle, stream));
         // Note: Use of CUDNN_BATCHNORM_SPATIAL_PERSISTENT can cause numerical
         //       overflows (NaNs) for fp32 data in some circumstances. The lower-
         //       performance CUDNN_BATCHNORM_SPATIAL should be used if this is not
@@ -330,7 +333,7 @@ int32_t InstanceNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
             cudnnBatchNormMode = CUDNN_BATCHNORM_SPATIAL;
         }
 
-        PLUGIN_CUDNNASSERT(cudnnBatchNormalizationForwardTraining(mCudnnHandle, cudnnBatchNormMode,
+        PLUGIN_CUDNNASSERT(mCudnnWrapper.cudnnBatchNormalizationForwardTraining(mCudnnHandle, cudnnBatchNormMode,
             &alpha, &beta, mXDescriptor, x_ptr, mYDescriptor, y_ptr, mBDescriptor, d_scale, d_bias, 1., nullptr,
             nullptr, mEpsilon, nullptr, nullptr));
 
@@ -340,7 +343,7 @@ int32_t InstanceNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
     {
         if (inputDesc[0].format == nvinfer1::PluginFormat::kLINEAR)
         {
-            PLUGIN_CHECK_CUDNN(cudnnSetStream(mCudnnHandle, stream));
+            PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnSetStream(mCudnnHandle, stream));
             nvinfer1::Dims input_dims = inputDesc[0].dims;
             int32_t n = input_dims.d[0];
             int32_t c = input_dims.d[1];
@@ -369,11 +372,11 @@ int32_t InstanceNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
             int32_t img_strideA[] = {img_dimA[1] * img_dimA[2] * img_dimA[3] * img_dimA[4],
                 img_dimA[2] * img_dimA[3] * img_dimA[4], img_dimA[3] * img_dimA[4], img_dimA[4], 1};
 
-            PLUGIN_CHECK_CUDNN(cudnnSetTensorNdDescriptor(mBDescriptor, CUDNN_DATA_FLOAT, 5, nc_dimA, nc_strideA));
+            PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnSetTensorNdDescriptor(mBDescriptor, CUDNN_DATA_FLOAT, 5, nc_dimA, nc_strideA));
             cudnnDataType_t cudnn_dtype;
             PLUGIN_CHECK_CUDNN(convertTrt2cudnnDtype(inputDesc[0].type, &cudnn_dtype));
-            PLUGIN_CHECK_CUDNN(cudnnSetTensorNdDescriptor(mXDescriptor, cudnn_dtype, 5, img_dimA, img_strideA));
-            PLUGIN_CHECK_CUDNN(cudnnSetTensorNdDescriptor(mYDescriptor, cudnn_dtype, 5, img_dimA, img_strideA));
+            PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnSetTensorNdDescriptor(mXDescriptor, cudnn_dtype, 5, img_dimA, img_strideA));
+            PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnSetTensorNdDescriptor(mYDescriptor, cudnn_dtype, 5, img_dimA, img_strideA));
             float alpha = 1;
             float beta = 0;
 
@@ -398,7 +401,7 @@ int32_t InstanceNormalizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* i
                 cudnnBatchNormMode = CUDNN_BATCHNORM_SPATIAL;
             }
 
-            PLUGIN_CHECK_CUDNN(cudnnBatchNormalizationForwardTraining(mCudnnHandle, cudnnBatchNormMode,
+            PLUGIN_CHECK_CUDNN(mCudnnWrapper.cudnnBatchNormalizationForwardTraining(mCudnnHandle, cudnnBatchNormMode,
                 &alpha, &beta, mXDescriptor, x_ptr, mYDescriptor, y_ptr, mBDescriptor, d_scale, d_bias, 1., nullptr,
                 nullptr, mEpsilon, nullptr, nullptr));
 
diff --git a/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.h b/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.h
index f00a083c..cd494e4c 100644
--- a/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.h
+++ b/plugin/instanceNormalizationPlugin/instanceNormalizationPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -20,7 +20,6 @@
 #include "common/serialize.hpp"
 #include "instanceNormalizationPlugin/instanceNormFwd.h"
 #include <cuda_fp16.h>
-#include <cudnn.h>
 #include <iostream>
 #include <string>
 #include <vector>
@@ -109,10 +108,12 @@ class InstanceNormalizationPlugin : public nvinfer1::IPluginV2DynamicExt
     std::vector<float> mHostBias;
     float* mDeviceScale{nullptr};
     float* mDeviceBias{nullptr};
-    cudnnHandle_t mCudnnHandle{nullptr};
-    cudnnTensorDescriptor_t mXDescriptor{nullptr};
-    cudnnTensorDescriptor_t mYDescriptor{nullptr};
-    cudnnTensorDescriptor_t mBDescriptor{nullptr};
+    nvinfer1::pluginInternal::cudnnHandle_t mCudnnHandle{nullptr};
+    nvinfer1::pluginInternal::CudnnWrapper& mCudnnWrapper = nvinfer1::pluginInternal::getCudnnWrapper();
+
+    nvinfer1::pluginInternal::cudnnTensorDescriptor_t mXDescriptor{nullptr};
+    nvinfer1::pluginInternal::cudnnTensorDescriptor_t mYDescriptor{nullptr};
+    nvinfer1::pluginInternal::cudnnTensorDescriptor_t mBDescriptor{nullptr};
     std::string mPluginNamespace;
     std::string mNamespace;
     bool mInitialized{false};
diff --git a/plugin/leakyReluPlugin/lReluPlugin.cpp b/plugin/leakyReluPlugin/lReluPlugin.cpp
index 0fe1af5b..28148c8b 100644
--- a/plugin/leakyReluPlugin/lReluPlugin.cpp
+++ b/plugin/leakyReluPlugin/lReluPlugin.cpp
@@ -165,6 +165,10 @@ IPluginV2* LReluPluginCreator::createPlugin(char const* name, PluginFieldCollect
 {
     try
     {
+        gLogWarning << "LReluPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addActivation() to add "
+                       "an IActivationLayer "
+                       "with ActivationType::kLEAKY_RELU."
+                    << std::endl;
         PluginField const* fields = fc->fields;
         PLUGIN_VALIDATE(fc->nbFields == 1);
         PLUGIN_VALIDATE(fields[0].type == PluginFieldType::kFLOAT32);
@@ -184,6 +188,10 @@ IPluginV2* LReluPluginCreator::deserializePlugin(char const* name, void const* s
 {
     try
     {
+        gLogWarning << "LReluPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addActivation() to add "
+                       "an IActivationLayer "
+                       "with ActivationType::kLEAKY_RELU."
+                    << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call LReluPlugin::destroy()
         return new LReLU(serialData, serialLength);
diff --git a/plugin/leakyReluPlugin/lReluPlugin.h b/plugin/leakyReluPlugin/lReluPlugin.h
index 39c973b5..60d81029 100644
--- a/plugin/leakyReluPlugin/lReluPlugin.h
+++ b/plugin/leakyReluPlugin/lReluPlugin.h
@@ -28,7 +28,7 @@ namespace nvinfer1
 namespace plugin
 {
 
-class LReLU : public nvinfer1::pluginInternal::BasePlugin
+class TRT_DEPRECATED LReLU : public nvinfer1::pluginInternal::BasePlugin
 {
 public:
     LReLU(float negSlope);
@@ -72,7 +72,7 @@ class LReLU : public nvinfer1::pluginInternal::BasePlugin
     int32_t mBatchDim;
 };
 
-class LReluPluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED LReluPluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     LReluPluginCreator();
@@ -85,7 +85,7 @@ class LReluPluginCreator : public nvinfer1::pluginInternal::BaseCreator
 
     PluginFieldCollection const* getFieldNames() noexcept override;
 
-    IPluginV2* createPlugin(char const* name, PluginFieldCollection const* fc) noexcept override;
+    TRT_DEPRECATED IPluginV2* createPlugin(char const* name, PluginFieldCollection const* fc) noexcept override;
 
     IPluginV2* deserializePlugin(char const* name, void const* serialData, size_t serialLength) noexcept override;
 
diff --git a/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.cu b/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.cu
index cef31338..255e4bff 100644
--- a/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.cu
+++ b/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,11 +24,12 @@
  **************************************************************************
  */
 
-#include <cublas_v2.h>
-
 #include "commonCudaHelper.h"
 #include "modulatedDeformConvCudaHelper.h"
 
+using half = __half;
+using namespace nvinfer1::pluginInternal;
+
 namespace
 {
 template <class TScalar>
@@ -97,6 +98,9 @@ void memcpyPermute(
 template void memcpyPermute<float>(
     float* dst, float const* src, int32_t* srcSize, int32_t* permute, int32_t srcDim, cudaStream_t stream);
 
+template void memcpyPermute<half>(
+    half* dst, half const* src, int32_t* srcSize, int32_t* permute, int32_t srcDim, cudaStream_t stream);
+
 template <typename TScalar>
 cublasStatus_t cublasGemmWrap(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int32_t m,
     int32_t n, int32_t k, TScalar const* alpha, TScalar const* A, int32_t lda, TScalar const* B, int32_t ldb,
@@ -110,7 +114,8 @@ cublasStatus_t cublasGemmWrap<float>(cublasHandle_t handle, cublasOperation_t tr
     int32_t m, int32_t n, int32_t k, float const* alpha, float const* A, int32_t lda, float const* B, int32_t ldb,
     float const* beta, float* C, int32_t ldc)
 {
-    return cublasSgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
+    CublasWrapper& wrapper = getCublasWrapper();
+    return wrapper.cublasSgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
 }
 
 template <>
@@ -118,5 +123,6 @@ cublasStatus_t cublasGemmWrap<half>(cublasHandle_t handle, cublasOperation_t tra
     int32_t m, int32_t n, int32_t k, half const* alpha, half const* A, int32_t lda, half const* B, int32_t ldb,
     half const* beta, half* C, int32_t ldc)
 {
-    return cublasHgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
+    CublasWrapper& wrapper = getCublasWrapper();
+    return wrapper.cublasHgemm(handle, transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
 }
diff --git a/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.h b/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.h
index 884f3fd1..d78ae632 100644
--- a/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.h
+++ b/plugin/modulatedDeformConvPlugin/modulatedDeformConvCudaHelper.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -26,8 +26,8 @@
 
 #ifndef TRT_MODULATED_DEFORM_CONV_CUDA_HELPER_H
 #define TRT_MODULATED_DEFORM_CONV_CUDA_HELPER_H
+#include "common/cublasWrapper.h"
 #include <cstdint>
-#include <cublas_v2.h>
 
 struct TensorDesc
 {
@@ -46,7 +46,8 @@ void memcpyPermute(
     TScalar* dst, TScalar const* src, int32_t* src_size, int32_t* permute, int32_t src_dim, cudaStream_t stream = 0);
 
 template <typename TScalar>
-cublasStatus_t cublasGemmWrap(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int32_t m,
+nvinfer1::pluginInternal::cublasStatus_t cublasGemmWrap(nvinfer1::pluginInternal::cublasHandle_t handle,
+    nvinfer1::pluginInternal::cublasOperation_t transa, nvinfer1::pluginInternal::cublasOperation_t transb, int32_t m,
     int32_t n, int32_t k, TScalar const* alpha, TScalar const* A, int32_t lda, TScalar const* B, int32_t ldb,
     TScalar const* beta, TScalar* C, int32_t ldc);
 
diff --git a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.cpp b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.cpp
index 4ac14a94..dcc48ac9 100644
--- a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.cpp
+++ b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -29,6 +29,7 @@
 #include <chrono>
 
 using namespace nvinfer1;
+using namespace nvinfer1::pluginInternal;
 using nvinfer1::plugin::ModulatedDeformableConvPluginDynamic;
 using nvinfer1::plugin::ModulatedDeformableConvPluginDynamicCreator;
 
@@ -38,6 +39,12 @@ void ModulatedDeformConvForwardCUDAKernelLauncherFloat(float const* input, float
     int32_t strideH, int32_t padW, int32_t padH, int32_t dilationW, int32_t dilationH, int32_t group,
     int32_t deformableGroup, int32_t im2colStep, cublasHandle_t cublasHandle, cudaStream_t stream);
 
+void ModulatedDeformConvForwardCUDAKernelLauncherHalf(half const* input, half const* weight, half const* bias,
+    half const* offset, half const* mask, half* output, void* workspace, int32_t batch, int32_t channels,
+    int32_t height, int32_t width, int32_t channelsOut, int32_t kernelW, int32_t kernelH, int32_t strideW,
+    int32_t strideH, int32_t padW, int32_t padH, int32_t dilationW, int32_t dilationH, int32_t group,
+    int32_t deformableGroup, int32_t im2colStep, cublasHandle_t cublasHandle, cudaStream_t stream);
+
 namespace
 {
 static char const* PLUGIN_VERSION{"1"};
@@ -118,7 +125,8 @@ bool ModulatedDeformableConvPluginDynamic::supportsFormatCombination(
 {
     if (pos == 0)
     {
-        return (inOut[pos].type == nvinfer1::DataType::kFLOAT && inOut[pos].format == nvinfer1::TensorFormat::kLINEAR);
+        return ((inOut[pos].type == nvinfer1::DataType::kFLOAT || inOut[pos].type == nvinfer1::DataType::kHALF) &&
+                inOut[pos].format == nvinfer1::TensorFormat::kLINEAR);
     }
     else
     {
@@ -164,6 +172,9 @@ int32_t ModulatedDeformableConvPluginDynamic::enqueue(nvinfer1::PluginTensorDesc
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr
+            && workSpace != nullptr);
+
         int32_t batch = inputDesc[0].dims.d[0];
         int32_t channels = inputDesc[0].dims.d[1];
         int32_t height = inputDesc[0].dims.d[2];
@@ -189,6 +200,12 @@ int32_t ModulatedDeformableConvPluginDynamic::enqueue(nvinfer1::PluginTensorDesc
                 kernelW, kernelH, mStride.d[0], mStride.d[1], mPadding.d[0], mPadding.d[1], mDilation.d[0],
                 mDilation.d[1], mGroup, mDeformableGroup, im2colStep, mCublasHandle, stream);
             break;
+        case nvinfer1::DataType::kHALF:
+            ModulatedDeformConvForwardCUDAKernelLauncherHalf((half*) x, (half*) weight, (half*) bias,
+                (half*) offset, (half*) mask, (half*) output, workSpace, batch, channels, height, width, channelsOut,
+		kernelW, kernelH, mStride.d[0], mStride.d[1], mPadding.d[0], mPadding.d[1], mDilation.d[0],
+		mDilation.d[1], mGroup, mDeformableGroup, im2colStep, mCublasHandle, stream);
+	    break;
         default: return 1;
         }
     }
@@ -253,7 +270,16 @@ void ModulatedDeformableConvPluginDynamic::destroy() noexcept
 void ModulatedDeformableConvPluginDynamic::attachToContext(
     cudnnContext* cudnnContext, cublasContext* cublasContext, nvinfer1::IGpuAllocator* gpuAllocator) noexcept
 {
-    mCublasHandle = cublasContext;
+    try
+    {
+        mCublasWrapper = createPluginCublasWrapper(gpuAllocator);
+        mCublasHandle = mCublasWrapper->getCublasHandle();
+        PLUGIN_VALIDATE(mCublasHandle);
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+    }
 }
 
 void ModulatedDeformableConvPluginDynamic::detachFromContext() noexcept {}
diff --git a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.h b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.h
index 53b653fa..afb79422 100644
--- a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.h
+++ b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -27,7 +27,6 @@
 #ifndef TRT_MODULATED_DEFORM_CONV_PLUGIN_H
 #define TRT_MODULATED_DEFORM_CONV_PLUGIN_H
 #include <cstdint>
-#include <cublas_v2.h>
 
 #include <memory>
 #include <string>
@@ -97,7 +96,9 @@ class ModulatedDeformableConvPluginDynamic : public nvinfer1::IPluginV2DynamicEx
     int32_t mGroup;
     bool mWithBias;
 
-    cublasHandle_t mCublasHandle;
+    nvinfer1::pluginInternal::cublasHandle_t mCublasHandle{nullptr};
+    // the wrapper pointer is shared among all plugins attached to the same context.
+    std::shared_ptr<nvinfer1::pluginInternal::CublasWrapper> mCublasWrapper;
 };
 
 class ModulatedDeformableConvPluginDynamicCreator : public nvinfer1::IPluginCreator
diff --git a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.cu b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.cu
index beeda006..cd769bc7 100644
--- a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.cu
+++ b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -31,10 +31,16 @@
 #include "common/checkMacrosPlugin.h"
 #include "modulatedDeformConvPluginKernel.h"
 
+using namespace nvinfer1::pluginInternal;
+
 template <typename T>
-__device__ T dmcnIm2colBilinear(
-    T const* input, int32_t const dataWidth, int32_t const height, int32_t const width, T h, T w)
+__device__ __forceinline__ T dmcnIm2colBilinear(
+    T const* input, int32_t const dataWidth, int32_t const height, int32_t const width, float h, float w)
 {
+    if (h <= -1 || height <= h || w <= -1 || width <= w)
+    {
+        return 0;
+    }
     int32_t hLow = floorf(h);
     int32_t wLow = floorf(w);
     int32_t hHigh = hLow + 1;
@@ -72,6 +78,56 @@ __device__ T dmcnIm2colBilinear(
     return val;
 }
 
+template <>
+__device__ __forceinline__ __half dmcnIm2colBilinear(
+    __half const* input, int32_t const dataWidth, int32_t const height, int32_t const width, float h, float w)
+{
+    if (h <= -1 || height <= h || w <= -1 || width <= w)
+    {
+        return 0;
+    }
+    int32_t hLow = floorf(h);
+    int32_t wLow = floorf(w);
+    int32_t hHigh = hLow + 1;
+    int32_t wHigh = wLow + 1;
+
+    half lh = __float2half(h - hLow);
+    half lw = __float2half(w - wLow);
+    half hh = __float2half(1) - lh;
+    half hw = __float2half(1) - lw;
+
+    half v1 = 0;
+    if (hLow >= 0 && wLow >= 0)
+    {
+        v1 = input[hLow * dataWidth + wLow];
+    }
+    half v2 = 0;
+    if (hLow >= 0 && wHigh <= width - 1)
+    {
+        v2 = input[hLow * dataWidth + wHigh];
+    }
+    half v3 = 0;
+    if (hHigh <= height - 1 && wLow >= 0)
+    {
+        v3 = input[hHigh * dataWidth + wLow];
+    }
+    half v4 = 0;
+    if (hHigh <= height - 1 && wHigh <= width - 1)
+    {
+        v4 = input[hHigh * dataWidth + wHigh];
+    }
+
+    half const w1 = __hmul(hh, hw);
+    half const w2 = __hmul(hh, lw);
+    half const w3 = __hmul(lh, hw);
+    half const w4 = __hmul(lh, lw);
+
+    half const val = __hadd(__hadd(__hmul(w1, v1), __hmul(w2, v2)), __hadd(__hmul(w3, v3), __hmul(w4, v4)));
+
+    return val;
+}
+
+
 template <typename T>
 __global__ void modulatedDeformableIm2colGpuKernel(int32_t const n, T const* dataIm, T const* dataOffset,
     T const* dataMask, int32_t const height, int32_t const width, int32_t const kernelH, int32_t const kernelW,
@@ -114,12 +170,9 @@ __global__ void modulatedDeformableIm2colGpuKernel(int32_t const n, T const* dat
                 T const offsetW = dataOffsetPtr[dataOffsetWPtr];
                 T const mask = dataMaskPtr[dataMaskHwPtr];
                 T val = static_cast<T>(0);
-                T const hIm = hIn + i * dilationH + offsetH;
-                T const wIm = wIn + j * dilationW + offsetW;
-                if (hIm > -1 && wIm > -1 && hIm < height && wIm < width)
-                {
-                    val = dmcnIm2colBilinear(dataImPtr, width, height, width, hIm, wIm);
-                }
+                T const hIm = hIn + i * dilationH + (float)offsetH;
+                T const wIm = wIn + j * dilationW + (float)offsetW;
+                val = dmcnIm2colBilinear(dataImPtr, width, height, width, hIm, wIm);
                 *dataColPtr = val * mask;
                 dataColPtr += batchSize * heightCol * widthCol;
             }
@@ -236,3 +289,14 @@ void ModulatedDeformConvForwardCUDAKernelLauncherFloat(float const* input, float
         channels, height, width, channelsOut, kernelW, kernelH, strideW, strideH, padW, padH, dilationW, dilationH,
         group, deformableGroup, im2colStep, cublasHandle, stream);
 }
+
+void ModulatedDeformConvForwardCUDAKernelLauncherHalf(__half const* input, __half const* weight, __half const* bias,
+    __half const* offset, __half const* mask, __half* output, void* workspace, int32_t batch, int32_t channels,
+    int32_t height, int32_t width, int32_t channelsOut, int32_t kernelW, int32_t kernelH, int32_t strideW,
+    int32_t strideH, int32_t padW, int32_t padH, int32_t dilationW, int32_t dilationH, int32_t group,
+    int32_t deformableGroup, int32_t im2colStep, cublasHandle_t cublasHandle, cudaStream_t stream)
+{
+    ModulatedDeformConvForwardCUDAKernelLauncher<__half>(input, weight, bias, offset, mask, output, workspace, batch,
+	channels, height, width, channelsOut, kernelW, kernelH, strideW, strideH, padW, padH, dilationW, dilationH,
+	group, deformableGroup, im2colStep, cublasHandle, stream);
+}
diff --git a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.h b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.h
index e1470d7e..3127b083 100644
--- a/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.h
+++ b/plugin/modulatedDeformConvPlugin/modulatedDeformConvPluginKernel.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -31,10 +31,15 @@
 #include "modulatedDeformConvCudaHelper.h"
 #include <cstdint>
 #include <float.h>
+#include <cuda_fp16.h>
 
 template <typename T>
-__device__ T dmcnIm2colBilinear(
-    T const* input, int32_t const dataWidth, int32_t const height, int32_t const width, T h, T w);
+__device__ __forceinline__ T dmcnIm2colBilinear(
+    T const* input, int32_t const dataWidth, int32_t const height, int32_t const width, float h, float w);
+
+template <>
+__device__ __forceinline__ __half dmcnIm2colBilinear(
+    __half const* input, int32_t const dataWidth, int32_t const height, int32_t const width, float h, float w);
 
 template <typename T>
 __global__ void modulatedDeformableIm2colGpuKernel(int32_t const n, T const* dataIm, T const* dataOffset,
@@ -63,12 +68,21 @@ cudaError_t ModulatedDeformConvForwardCUDAKernelLauncher(TScalar const* input, T
     TScalar const* bias, TScalar const* offset, TScalar const* mask, TScalar* output, void* workspace, int32_t batch,
     int32_t channels, int32_t height, int32_t width, int32_t channelsOut, int32_t kernelW, int32_t kernelH,
     int32_t strideW, int32_t strideH, int32_t padW, int32_t padH, int32_t dilationW, int32_t dilationH, int32_t group,
-    int32_t deformableGroup, int32_t im2colStep, cublasHandle_t cublasHandle, cudaStream_t stream);
+    int32_t deformableGroup, int32_t im2colStep, nvinfer1::pluginInternal::cublasHandle_t cublasHandle,
+    cudaStream_t stream);
 
 void ModulatedDeformConvForwardCUDAKernelLauncherFloat(float const* input, float const* weight, float const* bias,
     float const* offset, float const* mask, float* output, void* workspace, int32_t batch, int32_t channels,
     int32_t height, int32_t width, int32_t channelsOut, int32_t kernelW, int32_t kernelH, int32_t strideW,
     int32_t strideH, int32_t padW, int32_t padH, int32_t dilationW, int32_t dilationH, int32_t group,
-    int32_t deformableGroup, int32_t im2colStep, cublasHandle_t cublasHandle, cudaStream_t stream);
+    int32_t deformableGroup, int32_t im2colStep, nvinfer1::pluginInternal::cublasHandle_t cublasHandle,
+    cudaStream_t stream);
+
+void ModulatedDeformConvForwardCUDAKernelLauncherHalf(half const* input, half const* weight, half const* bias,
+    half const* offset, half const* mask, half* output, void* workspace, int32_t batch, int32_t channels,
+    int32_t height, int32_t width, int32_t channelsOut, int32_t kernelW, int32_t kernelH, int32_t strideW,
+    int32_t strideH, int32_t padW, int32_t padH, int32_t dilationW, int32_t dilationH, int32_t group,
+    int32_t deformableGroup, int32_t im2colStep, nvinfer1::pluginInternal::cublasHandle_t cublasHandle,
+    cudaStream_t stream);
 
 #endif // TRT_MODULATED_DEFORM_CONV_PLUGIN_KERNEL_H
diff --git a/plugin/multilevelCropAndResizePlugin/multilevelCropAndResizePlugin.cpp b/plugin/multilevelCropAndResizePlugin/multilevelCropAndResizePlugin.cpp
index 3dc980b8..8b8c57f0 100644
--- a/plugin/multilevelCropAndResizePlugin/multilevelCropAndResizePlugin.cpp
+++ b/plugin/multilevelCropAndResizePlugin/multilevelCropAndResizePlugin.cpp
@@ -323,12 +323,19 @@ void MultilevelCropAndResize::configurePlugin(Dims const* inputDims, int32_t nbI
     PLUGIN_ASSERT(nbOutputs == 1);
     PLUGIN_ASSERT(nbInputs == 1 + mFeatureMapCount);
 
-    mROICount = inputDims[0].d[0];
-    mFeatureLength = inputDims[1].d[0];
+    try
+    {
+        mROICount = dimToInt32(inputDims[0].d[0]);
+        mFeatureLength = dimToInt32(inputDims[1].d[0]);
 
-    for (size_t layer = 0; layer < mFeatureMapCount; ++layer)
+        for (size_t layer = 0; layer < mFeatureMapCount; ++layer)
+        {
+            mFeatureSpatialSize[layer] = {dimToInt32(inputDims[layer + 1].d[1]), dimToInt32(inputDims[layer + 1].d[2])};
+        }
+    }
+    catch (std::exception const& e)
     {
-        mFeatureSpatialSize[layer] = {inputDims[layer + 1].d[1], inputDims[layer + 1].d[2]};
+        caughtError(e);
     }
 
     mPrecision = inputTypes[1];
diff --git a/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.cpp b/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.cpp
index 939590ae..18b763ba 100644
--- a/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.cpp
+++ b/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.cpp
@@ -115,9 +115,11 @@ size_t MultiscaleDeformableAttnPlugin::getWorkspaceSize(nvinfer1::PluginTensorDe
 }
 
 int32_t MultiscaleDeformableAttnPlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,
-    nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workSpace,
-    cudaStream_t stream) PLUGIN_NOEXCEPT
+    nvinfer1::PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs,
+    void* /* workSpace */, cudaStream_t stream) PLUGIN_NOEXCEPT
 {
+    PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
     int32_t const batch = inputDesc[0].dims.d[0];
     int32_t spatial_size = inputDesc[0].dims.d[1];
     int32_t num_heads = inputDesc[0].dims.d[2];
diff --git a/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.h b/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.h
index cad20a2b..7f96db6b 100644
--- a/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.h
+++ b/plugin/multiscaleDeformableAttnPlugin/multiscaleDeformableAttnPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -33,14 +33,13 @@
 #include <string>
 #include <vector>
 
-#include <cublas_v2.h>
 #include <cuda_runtime_api.h>
 
 #include "NvInfer.h"
 #include "NvInferPlugin.h"
 #include "NvInferVersion.h"
 
-#include "plugin.h"
+#include "common/plugin.h"
 
 #if NV_TENSORRT_MAJOR > 7
 #define PLUGIN_NOEXCEPT noexcept
diff --git a/plugin/nmsPlugin/README.md b/plugin/nmsPlugin/README.md
index 4c957363..aa87ab5a 100644
--- a/plugin/nmsPlugin/README.md
+++ b/plugin/nmsPlugin/README.md
@@ -17,7 +17,9 @@
 
 ## Description
 
-The `nmsPlugin`, similar to the `batchedNMSPlugin`, implements a `non_max_suppression` (NMS) operation over bounding boxes for object detection networks. This plugin is included in TensorRT and used in [sampleSSD](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#sample_ssd) and [uff_ssd](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#uff_ssd) to run SSD.
+> NOTE: `nmsPlugin` is deprecated since TensorRT 9.0. Its functionality has been superseded by the `INMSLayer` and `EfficientNMS` plugin.
+
+The `nmsPlugin`, similar to the `batchedNMSPlugin`, implements a `non_max_suppression` (NMS) operation over bounding boxes for object detection networks. This plugin is included in TensorRT.
   
 Additionally, the `nmsPlugin` has a bounding box decoding step prior to the `non_max_suppression` step. The `nmsPlugin` takes the predicted encoded bounding box data as input, decodes them, followed by the non maximum suppression step in a GPU-accelerated fashion in TensorRT.
 
@@ -148,6 +150,9 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note.
+
 May 2019
 This is the first release of this `README.md` file.
 
diff --git a/plugin/nmsPlugin/nmsPlugin.cpp b/plugin/nmsPlugin/nmsPlugin.cpp
index 14a5e2eb..458c184e 100644
--- a/plugin/nmsPlugin/nmsPlugin.cpp
+++ b/plugin/nmsPlugin/nmsPlugin.cpp
@@ -42,6 +42,9 @@ DetectionOutput::DetectionOutput(DetectionOutputParameters params)
     , mType(DataType::kFLOAT)
     , mScoreBits(16)
 {
+    gLogWarning << "NMS_TRT is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an INMSLayer OR "
+                   "use EfficientNMS plugin."
+                << std::endl;
 }
 
 DetectionOutputDynamic::DetectionOutputDynamic(DetectionOutputParameters params)
@@ -241,9 +244,11 @@ int32_t DetectionOutput::enqueue(
     return status;
 }
 
-int32_t DetectionOutputDynamic::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
+int32_t DetectionOutputDynamic::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* /* outputDesc */,
     void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
 {
+    PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr && workspace != nullptr);
+
     // Input order {loc, conf, prior}
     void const* const locData = inputs[param.inputOrder[0]];
     void const* const confData = inputs[param.inputOrder[1]];
@@ -630,6 +635,9 @@ IPluginV2Ext* NMSPluginCreator::createPlugin(char const* name, PluginFieldCollec
 {
     try
     {
+        gLogWarning << "NMS_TRT is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         PluginField const* fields = fc->fields;
         // Default init values for TF SSD network
         params.codeType = CodeTypeSSD::TF_CENTER;
@@ -735,6 +743,9 @@ IPluginV2DynamicExt* NMSDynamicPluginCreator::createPlugin(char const* name, Plu
 {
     try
     {
+        gLogWarning << "NMSDynamic_TRT is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         PluginField const* fields = fc->fields;
         // Default init values for TF SSD network
         params.codeType = CodeTypeSSD::TF_CENTER;
@@ -836,6 +847,9 @@ IPluginV2Ext* NMSPluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "NMS_TRT is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call NMS::destroy()
         DetectionOutput* obj = new DetectionOutput(serialData, serialLength);
@@ -854,6 +868,9 @@ IPluginV2DynamicExt* NMSDynamicPluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "NMSDynamic_TRT is deprecated since TensorRT 9.0. Use INetworkDefinition::addNMS() to add an "
+                       "INMSLayer OR use EfficientNMS plugin."
+                    << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call NMS::destroy()
         DetectionOutputDynamic* obj = new DetectionOutputDynamic(serialData, serialLength);
diff --git a/plugin/nmsPlugin/nmsPlugin.h b/plugin/nmsPlugin/nmsPlugin.h
index 70d3d5ba..eccefc37 100644
--- a/plugin/nmsPlugin/nmsPlugin.h
+++ b/plugin/nmsPlugin/nmsPlugin.h
@@ -27,7 +27,7 @@ namespace nvinfer1
 namespace plugin
 {
 
-class DetectionOutput : public IPluginV2Ext
+class TRT_DEPRECATED DetectionOutput : public IPluginV2Ext
 {
 public:
     DetectionOutput(DetectionOutputParameters param);
@@ -96,7 +96,7 @@ class DetectionOutput : public IPluginV2Ext
     std::string mPluginNamespace;
 };
 
-class DetectionOutputDynamic : public IPluginV2DynamicExt
+class TRT_DEPRECATED DetectionOutputDynamic : public IPluginV2DynamicExt
 {
 public:
     DetectionOutputDynamic(DetectionOutputParameters param);
@@ -142,7 +142,7 @@ class DetectionOutputDynamic : public IPluginV2DynamicExt
     std::string mPluginNamespace;
 };
 
-class NMSBasePluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED NMSBasePluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     NMSBasePluginCreator();
@@ -160,7 +160,7 @@ class NMSBasePluginCreator : public nvinfer1::pluginInternal::BaseCreator
     int32_t mScoreBits;
 };
 
-class NMSPluginCreator : public NMSBasePluginCreator
+class TRT_DEPRECATED NMSPluginCreator : public NMSBasePluginCreator
 {
 public:
     NMSPluginCreator();
@@ -169,7 +169,7 @@ class NMSPluginCreator : public NMSBasePluginCreator
     IPluginV2Ext* deserializePlugin(char const* name, void const* serialData, size_t serialLength) noexcept override;
 };
 
-class NMSDynamicPluginCreator : public NMSBasePluginCreator
+class TRT_DEPRECATED NMSDynamicPluginCreator : public NMSBasePluginCreator
 {
 public:
     NMSDynamicPluginCreator();
diff --git a/plugin/normalizePlugin/README.md b/plugin/normalizePlugin/README.md
index 1da39f35..7de52b4d 100644
--- a/plugin/normalizePlugin/README.md
+++ b/plugin/normalizePlugin/README.md
@@ -1,4 +1,4 @@
-# normalizePlugin
+# normalizePlugin [DEPRECATED]
 
 **Table Of Contents**
 - [Description](#description)
@@ -11,7 +11,9 @@
 
 ## Description
 
-The `normalizePlugin`  is used for the L2 normalization layer, which is generally used in deep learning models such as ParseNet and SSD during TensorRT inference. This plugin is included in TensorRT and used in [sampleSSD](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#sample_ssd) to run SSD.
+> NOTE: This plugin is deprecated since TensorRT 9.0. Its functionality has been superseded by the `INormalizationLayer`.
+
+The `normalizePlugin`  is used for the L2 normalization layer, which is generally used in deep learning models such as ParseNet and SSD during TensorRT inference. This plugin is included in TensorRT.
 
 Specifically, given an array of values `x = [x_0, x_1, ..., x_n]` and a scale factor, the L2 norm `||x||_2 = sqrt(sum({x_0}^2, {x_1}^2, ..., {x_n}^2)` is calculated and each element in the array is divided by the L2 norm and multiplied by the scale factor.
   
@@ -49,6 +51,9 @@ documentation.
 
 ## Changelog
 
+May 2023
+Add deprecation note.
+
 May 2019
 This is the first release of this `README.md` file.
 
diff --git a/plugin/normalizePlugin/normalizePlugin.cpp b/plugin/normalizePlugin/normalizePlugin.cpp
index b400f35a..c5fdcd61 100644
--- a/plugin/normalizePlugin/normalizePlugin.cpp
+++ b/plugin/normalizePlugin/normalizePlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,12 +17,11 @@
 #include "normalizePlugin.h"
 #include "common/half.h"
 #include <cstring>
-#include <cublas_v2.h>
-#include <cudnn.h>
 #include <iostream>
 #include <sstream>
 
 using namespace nvinfer1;
+using namespace nvinfer1::pluginInternal;
 using nvinfer1::plugin::Normalize;
 using nvinfer1::plugin::NormalizePluginCreator;
 
@@ -40,6 +39,9 @@ Normalize::Normalize(Weights const* weights, int32_t nbWeights, bool acrossSpati
     , channelShared(channelShared)
     , eps(eps)
 {
+    gLogWarning << "NormalizePlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNormalization() to "
+                   "add an INormalizationLayer."
+                << std::endl;
     mNbWeights = nbWeights;
     PLUGIN_VALIDATE(nbWeights == 1);
     PLUGIN_VALIDATE(weights[0].count >= 1);
@@ -240,7 +242,16 @@ void Normalize::configurePlugin(Dims const* inputDims, int32_t nbInputs, Dims co
 // Attach the plugin object to an execution context and grant the plugin the access to some context resource.
 void Normalize::attachToContext(cudnnContext* cudnn, cublasContext* cublas, IGpuAllocator* gpuAllocator) noexcept
 {
-    mCublas = cublas;
+    try
+    {
+        mCublasWrapper = createPluginCublasWrapper(gpuAllocator);
+        mCublas = mCublasWrapper->getCublasHandle();
+        PLUGIN_VALIDATE(mCublas != nullptr);
+    }
+    catch (const std::exception& e)
+    {
+        caughtError(e);
+    }
 }
 
 // Detach the plugin object from its execution context.
@@ -314,6 +325,10 @@ IPluginV2Ext* NormalizePluginCreator::createPlugin(char const* name, PluginField
 {
     try
     {
+        gLogWarning
+            << "NormalizePlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNormalization() to add an "
+               "INormalizationLayer."
+            << std::endl;
         std::vector<float> weightValues;
         PluginField const* fields = fc->fields;
         for (int32_t i = 0; i < fc->nbFields; ++i)
@@ -370,6 +385,10 @@ IPluginV2Ext* NormalizePluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning
+            << "NormalizePlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addNormalization() to add an "
+               "INormalizationLayer."
+            << std::endl;
         // This object will be deleted when the network is destroyed, which will
         // call Normalize::destroy()
         Normalize* obj = new Normalize(serialData, serialLength);
diff --git a/plugin/normalizePlugin/normalizePlugin.h b/plugin/normalizePlugin/normalizePlugin.h
index 622b44d6..d5a4c479 100644
--- a/plugin/normalizePlugin/normalizePlugin.h
+++ b/plugin/normalizePlugin/normalizePlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,8 +18,6 @@
 #define TRT_NORMALIZE_PLUGIN_H
 #include "common/kernels/kernel.h"
 #include "common/plugin.h"
-#include "cudnn.h"
-#include <cublas_v2.h>
 #include <string>
 #include <vector>
 
@@ -28,7 +26,7 @@ namespace nvinfer1
 namespace plugin
 {
 
-class Normalize : public IPluginV2Ext
+class TRT_DEPRECATED Normalize : public IPluginV2Ext
 {
 public:
     Normalize(Weights const* weights, int32_t nbWeights, bool acrossSpatial, bool channelShared, float eps);
@@ -93,7 +91,9 @@ class Normalize : public IPluginV2Ext
     void serializeFromDevice(char*& hostBuffer, Weights deviceWeights) const;
     Weights deserializeToDevice(char const*& hostBuffer, size_t count);
 
-    cublasHandle_t mCublas;
+    nvinfer1::pluginInternal::cublasHandle_t mCublas;
+    // the wrapper pointer is shared among all plugins attached to the same context.
+    std::shared_ptr<nvinfer1::pluginInternal::CublasWrapper> mCublasWrapper;
 
     Weights mWeights{}; // mWeights.values is on the device
     int32_t mNbWeights{};
@@ -107,7 +107,7 @@ class Normalize : public IPluginV2Ext
     std::string mPluginNamespace;
 };
 
-class NormalizePluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED NormalizePluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     NormalizePluginCreator();
diff --git a/plugin/nvFasterRCNN/README.md b/plugin/nvFasterRCNN/README.md
index 183458f2..b2547a7c 100644
--- a/plugin/nvFasterRCNN/README.md
+++ b/plugin/nvFasterRCNN/README.md
@@ -11,7 +11,7 @@
 
 ## Description
 
-The `NvPluginFasterRCNN` performs object detection for the Faster R-CNN model. This plugin is included in TensorRT and used in [sampleFasterRCNN](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#fasterrcnn_sample) to perform inference.
+The `NvPluginFasterRCNN` performs object detection for the Faster R-CNN model. This plugin is included in TensorRT.
 
 `NvPluginFasterRCNN` decodes predicted bounding boxes, extracts their corresponding objectness score, extracts region of interest from predicted bounding boxes using non maximum suppression, and extracts the feature map of region of interest (ROI) using ROI pooling for downstreaming object classification tasks.
 
@@ -101,4 +101,4 @@ This is the first release of this `README.md` file.
 
 ## Known issues
 
-There are no known issues in this plugin.
\ No newline at end of file
+There are no known issues in this plugin.
diff --git a/plugin/nvFasterRCNN/nvFasterRCNNPlugin.cpp b/plugin/nvFasterRCNN/nvFasterRCNNPlugin.cpp
index 85b2dee4..329d3813 100644
--- a/plugin/nvFasterRCNN/nvFasterRCNNPlugin.cpp
+++ b/plugin/nvFasterRCNN/nvFasterRCNNPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,7 +17,6 @@
 #include "nvFasterRCNNPlugin.h"
 #include <cstdio>
 #include <cstring>
-#include <cublas_v2.h>
 #include <iostream>
 
 namespace nvinfer1
diff --git a/plugin/nvFasterRCNN/nvFasterRCNNPlugin.h b/plugin/nvFasterRCNN/nvFasterRCNNPlugin.h
index 30f3c099..4c66c224 100644
--- a/plugin/nvFasterRCNN/nvFasterRCNNPlugin.h
+++ b/plugin/nvFasterRCNN/nvFasterRCNNPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,7 +19,6 @@
 
 #include "common/kernels/kernel.h"
 #include "common/plugin.h"
-#include "cudnn.h"
 #include <vector>
 
 namespace nvinfer1
diff --git a/plugin/pillarScatterPlugin/pillarScatter.cpp b/plugin/pillarScatterPlugin/pillarScatter.cpp
index f1a0aa16..b520347f 100644
--- a/plugin/pillarScatterPlugin/pillarScatter.cpp
+++ b/plugin/pillarScatterPlugin/pillarScatter.cpp
@@ -111,11 +111,13 @@ size_t PillarScatterPlugin::getWorkspaceSize(nvinfer1::PluginTensorDesc const* i
     return 0;
 }
 int32_t PillarScatterPlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,
-    nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
-    cudaStream_t stream) noexcept
+    nvinfer1::PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs,
+    void* /* workspace */, cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         int32_t batchSize = inputDesc[0].dims.d[0];
         int32_t maxPillarNum = inputDesc[0].dims.d[1];
         int32_t numFeatures = inputDesc[0].dims.d[2];
diff --git a/plugin/priorBoxPlugin/README.md b/plugin/priorBoxPlugin/README.md
index ab1ead32..ebb28885 100644
--- a/plugin/priorBoxPlugin/README.md
+++ b/plugin/priorBoxPlugin/README.md
@@ -11,7 +11,7 @@
 
 ## Description
 
-The `priorBoxPlugin` generates prior boxes (anchor boxes) from a feature map in object detection models such as SSD. This plugin is included in TensorRT and used in [sampleSSD](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#sample_ssd) to run SSD.
+The `priorBoxPlugin` generates prior boxes (anchor boxes) from a feature map in object detection models such as SSD. This plugin is included in TensorRT.
 
 This sample generates anchor box coordinates `[x_min, y_min, x_max, y_max]` with variances (scaling factors) `[var_0, var_1, var_2, var_3]` for the downstream bounding box decoding steps. The `priorBoxPlugin` uses a series of CUDA kernels in the `priorBoxLayer.cu` file to accelerate the process. The differences between `priorBoxPlugin` and `gridAnchorPlugin` is that `priorBoxPlugin` generates prior boxes for one feature map in the model at one time, while `gridAnchorPlugin` generates all prior boxes for all feature maps in the model at one time.
   
@@ -95,4 +95,4 @@ This is the first release of this `README.md` file.
 
 ## Known issues
 
-There are no known issues in this plugin.
\ No newline at end of file
+There are no known issues in this plugin.
diff --git a/plugin/priorBoxPlugin/priorBoxPlugin.cpp b/plugin/priorBoxPlugin/priorBoxPlugin.cpp
index a8dc7e21..761100ff 100644
--- a/plugin/priorBoxPlugin/priorBoxPlugin.cpp
+++ b/plugin/priorBoxPlugin/priorBoxPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,8 +17,6 @@
 #include "priorBoxPlugin.h"
 #include <cmath>
 #include <cstring>
-#include <cublas_v2.h>
-#include <cudnn.h>
 #include <iostream>
 #include <sstream>
 #include <vector>
diff --git a/plugin/priorBoxPlugin/priorBoxPlugin.h b/plugin/priorBoxPlugin/priorBoxPlugin.h
index a9f7eaf6..166793cb 100644
--- a/plugin/priorBoxPlugin/priorBoxPlugin.h
+++ b/plugin/priorBoxPlugin/priorBoxPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,8 +19,6 @@
 #include "common/kernels/kernel.h"
 #include "common/plugin.h"
 #include <cstdlib>
-#include <cublas_v2.h>
-#include <cudnn.h>
 #include <string>
 #include <vector>
 
diff --git a/plugin/proposalPlugin/README.md b/plugin/proposalPlugin/README.md
index 74f859f0..7ebc115c 100644
--- a/plugin/proposalPlugin/README.md
+++ b/plugin/proposalPlugin/README.md
@@ -11,7 +11,9 @@
 
 ## Description
 
-The `proposalPlugin` performs object detection for the Faster R-CNN model. This plugin is included in TensorRT and used in [sampleUffFasterRCNN] to perform inference.
+> NOTE: The `IPluginV2Ext` version of this plugin is deprecated since TensorRT 9.0. `IPluginV2DynamicExt`-based `ProposalDynamic` is recommended instead.
+
+The `proposalPlugin` performs object detection for the Faster R-CNN model. This plugin is included in TensorRT.
 
 `proposalPlugin` decodes predicted bounding boxes, extracts their corresponding objectness score, extracts region of interest from predicted bounding boxes using non maximum suppression, for downstreaming ROIPooling tasks.
 
@@ -76,6 +78,9 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note for `IPluginV2Ext` version of the plugin.
+
 May 2019
 This is the first release of this `README.md` file.
 
diff --git a/plugin/proposalPlugin/proposalPlugin.cpp b/plugin/proposalPlugin/proposalPlugin.cpp
index 0af4f80a..e1bd677b 100644
--- a/plugin/proposalPlugin/proposalPlugin.cpp
+++ b/plugin/proposalPlugin/proposalPlugin.cpp
@@ -330,11 +330,13 @@ int32_t ProposalPlugin::enqueue(
     return STATUS_FAILURE;
 }
 
-int32_t ProposalDynamicPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
+int32_t ProposalDynamicPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* /* outputDesc */,
     void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr && workspace != nullptr);
+
         int32_t status = STATUS_FAILURE;
         // Our plugin outputs only one tensor
         void* output = outputs[0];
@@ -512,7 +514,7 @@ void ProposalPlugin::setPluginNamespace(char const* libNamespace) noexcept
 {
     try
     {
-        PLUGIN_VALIDATE(libNamespace == nullptr);
+        PLUGIN_VALIDATE(libNamespace != nullptr);
         mNamespace = libNamespace;
     }
     catch (std::exception const& e)
@@ -525,7 +527,7 @@ void ProposalDynamicPlugin::setPluginNamespace(char const* libNamespace) noexcep
 {
     try
     {
-        PLUGIN_VALIDATE(libNamespace == nullptr);
+        PLUGIN_VALIDATE(libNamespace != nullptr);
         mNamespace = libNamespace;
     }
     catch (std::exception const& e)
@@ -679,6 +681,9 @@ IPluginV2Ext* ProposalPluginCreator::createPlugin(char const* name, PluginFieldC
 {
     try
     {
+        gLogWarning << "Proposal plugin (implementing IPluginV2Ext) is deprecated since TensorRT 9.0. Use "
+                       "ProposalDynamic plugin."
+                    << std::endl;
         PluginField const* fields = fc->fields;
         int32_t nbFields = fc->nbFields;
         int32_t inputHeight = 0;
@@ -887,6 +892,9 @@ IPluginV2Ext* ProposalPluginCreator::deserializePlugin(
 {
     try
     {
+        gLogWarning << "Proposal plugin (implementing IPluginV2Ext) is deprecated since TensorRT 9.0. Use "
+                       "ProposalDynamic plugin."
+                    << std::endl;
         // This object will be deleted when the network is destroyed,
         IPluginV2Ext* plugin = new ProposalPlugin(serialData, serialLength);
         plugin->setPluginNamespace(mNamespace.c_str());
diff --git a/plugin/proposalPlugin/proposalPlugin.h b/plugin/proposalPlugin/proposalPlugin.h
index 670133a2..05e9508f 100644
--- a/plugin/proposalPlugin/proposalPlugin.h
+++ b/plugin/proposalPlugin/proposalPlugin.h
@@ -32,7 +32,7 @@ namespace nvinfer1
 namespace plugin
 {
 
-class ProposalPlugin : public IPluginV2Ext
+class TRT_DEPRECATED ProposalPlugin : public IPluginV2Ext
 {
 public:
     ProposalPlugin(int32_t inputHeight, int32_t inputWidth, int32_t rpnHeight, int32_t rpnWidth, float rpnStdScaling,
@@ -196,7 +196,7 @@ class ProposalBasePluginCreator : public nvinfer1::pluginInternal::BaseCreator
     std::string mPluginName;
 };
 
-class ProposalPluginCreator : public ProposalBasePluginCreator
+class TRT_DEPRECATED ProposalPluginCreator : public ProposalBasePluginCreator
 {
 public:
     ProposalPluginCreator() noexcept;
diff --git a/plugin/pyramidROIAlignPlugin/pyramidROIAlignPlugin.cpp b/plugin/pyramidROIAlignPlugin/pyramidROIAlignPlugin.cpp
index 486377c0..141339f1 100644
--- a/plugin/pyramidROIAlignPlugin/pyramidROIAlignPlugin.cpp
+++ b/plugin/pyramidROIAlignPlugin/pyramidROIAlignPlugin.cpp
@@ -78,7 +78,7 @@ IPluginV2Ext* PyramidROIAlignPluginCreator::createPlugin(char const* name, Plugi
         bool plusOneCoords = false;
         bool legacy = false;
         int32_t samplingRatio = 0;
-        xy_t imageSize = {MaskRCNNConfig::IMAGE_SHAPE.d[1], MaskRCNNConfig::IMAGE_SHAPE.d[2]};
+        xy_t imageSize = {dimToInt32(MaskRCNNConfig::IMAGE_SHAPE.d[1]), dimToInt32(MaskRCNNConfig::IMAGE_SHAPE.d[2])};
         int32_t fpnScale = 224;
 
         PluginField const* fields = fc->fields;
@@ -433,7 +433,7 @@ void PyramidROIAlign::configurePlugin(Dims const* inputDims, int32_t nbInputs, D
 
     for (size_t layer = 0; layer < mFeatureMapCount; ++layer)
     {
-        mFeatureSpatialSize[layer] = {inputDims[layer + 1].d[1], inputDims[layer + 1].d[2]};
+        mFeatureSpatialSize[layer] = {dimToInt32(inputDims[layer + 1].d[1]), dimToInt32(inputDims[layer + 1].d[2])};
     }
 }
 
diff --git a/plugin/reorgPlugin/README.md b/plugin/reorgPlugin/README.md
index ada3aa05..344d8ae0 100644
--- a/plugin/reorgPlugin/README.md
+++ b/plugin/reorgPlugin/README.md
@@ -23,7 +23,7 @@ The input is a tensor that has a shape of `[N, C, H, W]` where:
 -   `C` is the number of channels
 -   `H` is the height of tensor
 -   `W` is the width of the tensor
- 
+
 After a [unique one-to-one mapping](https://github.com/pjreddie/darknet/blob/8215a8864d4ad07e058acafd75b2c6ff6600b9e8/src/blas.c#L9), the output tensor of shape `[N, C x s x s, H / s, W / s]`, where s is the stride, is generated.
 
 For example, if we have an input tensor of shape `[2, 4, 6, 6]`.
@@ -55,7 +55,7 @@ For example, if we have an input tensor of shape `[2, 4, 6, 6]`.
 [126 127 128 129 130 131]
 [132 133 134 135 136 137]
 [138 139 140 141 142 143]]]
-  
+
 
 [[[144 145 146 147 148 149]
 [150 151 152 153 154 155]
@@ -85,7 +85,7 @@ For example, if we have an input tensor of shape `[2, 4, 6, 6]`.
 [276 277 278 279 280 281]
 [282 283 284 285 286 287]]]]
 ```
- 
+
 We set `stride = 2` and perform the reorganization, we will get the following output tensor of shape `[2, 16, 3, 3]`.
 ```
 [[[[ 0 2 4]
@@ -151,7 +151,7 @@ We set `stride = 2` and perform the reorganization, we will get the following ou
 [[115 117 119]
 [133 135 137]
 [139 141 143]]]
-  
+
 
 [[[144 146 148]
 [150 152 154]
@@ -231,20 +231,22 @@ The following parameters were used to create the `Reorg` instance.
 
 ## Additional resources
 
-The following resources provide a deeper understanding of the `regionPlugin` plugin:
+The following resources provide a deeper understanding of the `reorgPlugin` plugin:
 
-- [YOLOv2 paper](https://arxiv.org/abs/1612.08242) 
+- [YOLOv2 paper](https://arxiv.org/abs/1612.08242)
 - [Reorg layer in YOLOv2](https://github.com/pjreddie/darknet/blob/8215a8864d4ad07e058acafd75b2c6ff6600b9e8/src/blas.c#L9)
 - [YOLOv2 architecture](https://ethereon.github.io/netscope/#/gist/d08a41711e48cf111e330827b1279c31)
 
 
 ## License
 
-For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) 
+For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html)
 documentation.
 
 
 ## Changelog
+Feb 2024
+Support IPluginV2DynamicExt in version 2.
 
 May 2019
 This is the first release of this `README.md` file.
@@ -252,4 +254,4 @@ This is the first release of this `README.md` file.
 
 ## Known issues
 
-There are no known issues in this plugin.
\ No newline at end of file
+There are no known issues in this plugin.
diff --git a/plugin/reorgPlugin/Reorg_PluginConfig.yaml b/plugin/reorgPlugin/Reorg_PluginConfig.yaml
index 5ac87d5f..9d79d923 100644
--- a/plugin/reorgPlugin/Reorg_PluginConfig.yaml
+++ b/plugin/reorgPlugin/Reorg_PluginConfig.yaml
@@ -1,8 +1,8 @@
 ---
 name: Reorg_TRT
-interface: "IPluginV2Ext"
+interface: "IPluginV2DynamicExt"
 versions:
-  "1":
+  "2":
     attributes:
       - stride
     attribute_types:
diff --git a/plugin/reorgPlugin/reorgPlugin.cpp b/plugin/reorgPlugin/reorgPlugin.cpp
index b1d1ff21..0154580a 100644
--- a/plugin/reorgPlugin/reorgPlugin.cpp
+++ b/plugin/reorgPlugin/reorgPlugin.cpp
@@ -20,139 +20,293 @@ namespace nvinfer1
 {
 namespace plugin
 {
-static char const* const kREORG_PLUGIN_VERSION{"1"};
+static char const* const kREORG_PLUGIN_STATIC_VERSION{"1"};
+static char const* const kREORG_PLUGIN_DYNAMIC_VERSION{"2"};
 static char const* const kREORG_PLUGIN_NAME{"Reorg_TRT"};
-PluginFieldCollection ReorgPluginCreator::mFC{};
-std::vector<PluginField> ReorgPluginCreator::mPluginAttributes;
+template <class TBaseClass>
+PluginFieldCollection ReorgPluginCreator<TBaseClass>::mFC{};
+template <class TBaseClass>
+std::vector<PluginField> ReorgPluginCreator<TBaseClass>::mPluginAttributes;
+
+template <class TBaseClass>
+Reorg<TBaseClass>::Reorg(int32_t strideValue)
+    : stride(strideValue)
+{
+}
 
-Reorg::Reorg(int32_t C, int32_t H, int32_t W, int32_t stride)
-    : C(C)
-    , H(H)
-    , W(W)
-    , stride(stride)
+template <class TBaseClass>
+int32_t Reorg<TBaseClass>::getNbOutputs() const noexcept
 {
+    return 1;
 }
 
-Reorg::Reorg(int32_t stride)
-    : stride(stride)
+template <class TBaseClass>
+int32_t Reorg<TBaseClass>::initialize() noexcept
 {
+    return STATUS_SUCCESS;
 }
 
-Reorg::Reorg(void const* buffer, size_t length)
+template <class TBaseClass>
+void Reorg<TBaseClass>::terminate() noexcept
 {
-    char const *d = reinterpret_cast<char const*>(buffer), *a = d;
-    C = read<int32_t>(d);
-    H = read<int32_t>(d);
-    W = read<int32_t>(d);
-    stride = read<int32_t>(d);
-    PLUGIN_VALIDATE(d == a + length);
 }
 
-int32_t Reorg::getNbOutputs() const noexcept
+template <class TBaseClass>
+char const* Reorg<TBaseClass>::getPluginType() const noexcept
 {
-    return 1;
+    return kREORG_PLUGIN_NAME;
 }
 
-Dims Reorg::getOutputDimensions(int32_t index, Dims const* inputs, int32_t nbInputDims) noexcept
+template <class TBaseClass>
+void Reorg<TBaseClass>::destroy() noexcept
 {
-    PLUGIN_ASSERT(nbInputDims == 1);
+    delete this;
+}
+
+// Set plugin namespace
+template <class TBaseClass>
+void Reorg<TBaseClass>::setPluginNamespace(char const* pluginNamespace) noexcept
+{
+    mPluginNamespace = pluginNamespace;
+}
+
+template <class TBaseClass>
+char const* Reorg<TBaseClass>::getPluginNamespace() const noexcept
+{
+    return mPluginNamespace.c_str();
+}
+
+// Return the DataType of the plugin output at the requested index
+template <class TBaseClass>
+DataType Reorg<TBaseClass>::getOutputDataType(
+    int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept
+{
+    // Only 1 input and 1 output from the plugin layer
     PLUGIN_ASSERT(index == 0);
-    return Dims3(inputs[0].d[0] * stride * stride, inputs[0].d[1] / stride, inputs[0].d[2] / stride);
+
+    // Only DataType::kFLOAT is acceptable by the plugin layer
+    return DataType::kFLOAT;
 }
 
-int32_t Reorg::enqueue(
-    int32_t batchSize, void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+// Attach the plugin object to an execution context and grant the plugin the access to some context resource.
+template <class TBaseClass>
+void Reorg<TBaseClass>::attachToContext(
+    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) noexcept
 {
-    void const* inputData = inputs[0];
-    void* outputData = outputs[0];
-    pluginStatus_t status = reorgInference(stream, batchSize, C, H, W, stride, inputData, outputData);
-    return status;
 }
 
-size_t Reorg::getSerializationSize() const noexcept
+// Detach the plugin object from its execution context.
+template <class TBaseClass>
+void Reorg<TBaseClass>::detachFromContext() noexcept
 {
-    // C, H, W, stride
-    return sizeof(int32_t) * 4;
 }
 
-void Reorg::serialize(void* buffer) const noexcept
+ReorgDynamic::ReorgDynamic(int32_t stride)
+    : Reorg<IPluginV2DynamicExt>(stride)
+{
+}
+
+ReorgDynamic::ReorgDynamic(void const* buffer, size_t length)
+{
+    char const* d = reinterpret_cast<char const*>(buffer);
+    char const* a = d;
+    stride = read<int32_t>(d);
+    PLUGIN_VALIDATE(d == a + length);
+}
+
+char const* ReorgDynamic::getPluginVersion() const noexcept
+{
+    return kREORG_PLUGIN_DYNAMIC_VERSION;
+}
+
+size_t ReorgDynamic::getSerializationSize() const noexcept
+{
+    // stride
+    return sizeof(int32_t);
+}
+
+size_t ReorgDynamic::getWorkspaceSize(nvinfer1::PluginTensorDesc const* inputs, int32_t nbInputs,
+    PluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept
+{
+    return 0;
+}
+
+void ReorgDynamic::serialize(void* buffer) const noexcept
 {
     char *d = reinterpret_cast<char*>(buffer), *a = d;
-    write(d, C);
-    write(d, H);
-    write(d, W);
     write(d, stride);
     PLUGIN_ASSERT(d == a + getSerializationSize());
 }
 
-bool Reorg::supportsFormat(DataType type, PluginFormat format) const noexcept
+DimsExprs ReorgDynamic::getOutputDimensions(
+    int32_t outputIndex, DimsExprs const* inputs, int32_t nbInputs, IExprBuilder& exprBuilder) noexcept
 {
-    return (type == DataType::kFLOAT && format == PluginFormat::kLINEAR);
+    PLUGIN_ASSERT(nbInputs == 1);
+    PLUGIN_ASSERT(outputIndex == 0);
+    DimsExprs output{3, {}};
+    auto const* strideExpr = exprBuilder.constant(stride);
+    auto const* strideSquareExpr = exprBuilder.constant(stride * stride);
+    output.d[0] = exprBuilder.operation(DimensionOperation::kPROD, *inputs[0].d[0], *strideSquareExpr);
+    output.d[1] = exprBuilder.operation(DimensionOperation::kFLOOR_DIV, *inputs[0].d[1], *strideExpr);
+    output.d[2] = exprBuilder.operation(DimensionOperation::kFLOOR_DIV, *inputs[0].d[2], *strideExpr);
+    return output;
 }
 
-int32_t Reorg::initialize() noexcept
+bool ReorgDynamic::supportsFormatCombination(
+    int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept
 {
-    return STATUS_SUCCESS;
+    PLUGIN_ASSERT(pos >= 0 && pos <= 1);
+    PLUGIN_ASSERT(nbInputs == 1);
+    PLUGIN_ASSERT(nbOutputs == 1);
+    return (inOut[pos].type == DataType::kFLOAT && inOut[pos].format == PluginFormat::kLINEAR);
 }
 
-void Reorg::terminate() noexcept {}
+void ReorgDynamic::configurePlugin(
+    DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept
+{
+    PLUGIN_ASSERT(nbInputs == 1);
+    PLUGIN_ASSERT(nbOutputs == 1);
+    PLUGIN_ASSERT(in->desc.type == DataType::kFLOAT);
+    PLUGIN_ASSERT(out->desc.type == DataType::kFLOAT);
+    PLUGIN_ASSERT(in->desc.format == PluginFormat::kLINEAR);
+    PLUGIN_ASSERT(out->desc.format == PluginFormat::kLINEAR);
+    PLUGIN_ASSERT(stride > 0);
 
-size_t Reorg::getWorkspaceSize(int32_t maxBatchSize) const noexcept
+    int32_t H = in->desc.dims.d[2];
+    int32_t W = in->desc.dims.d[3];
+    PLUGIN_ASSERT(H % stride == 0);
+    PLUGIN_ASSERT(W % stride == 0);
+}
+
+int32_t ReorgDynamic::enqueue(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
+    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
 {
-    return 0;
+    void const* inputData = inputs[0];
+    void* outputData = outputs[0];
+    int32_t const N = inputDesc[0].dims.d[0];
+    int32_t const C = inputDesc[0].dims.d[1];
+    int32_t const H = inputDesc[0].dims.d[2];
+    int32_t const W = inputDesc[0].dims.d[3];
+    pluginStatus_t status = reorgInference(stream, N, C, H, W, stride, inputData, outputData);
+    return status;
 }
 
-char const* Reorg::getPluginType() const noexcept
+IPluginV2DynamicExt* ReorgDynamic::clone() const noexcept
 {
-    return kREORG_PLUGIN_NAME;
+    try
+    {
+        ReorgDynamic* plugin = new ReorgDynamic(stride);
+        plugin->setPluginNamespace(mPluginNamespace.c_str());
+        return plugin;
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+    }
+    return nullptr;
 }
 
-char const* Reorg::getPluginVersion() const noexcept
+ReorgStatic::ReorgStatic(int32_t stride)
+    : Reorg<IPluginV2Ext>(stride)
 {
-    return kREORG_PLUGIN_VERSION;
 }
 
-void Reorg::destroy() noexcept
+ReorgStatic::ReorgStatic(int32_t C, int32_t H, int32_t W, int32_t stride)
+    : Reorg<IPluginV2Ext>(stride)
+    , C(C)
+    , H(H)
+    , W(W)
 {
-    delete this;
 }
 
-// Set plugin namespace
-void Reorg::setPluginNamespace(char const* pluginNamespace) noexcept
+ReorgStatic::ReorgStatic(void const* buffer, size_t length)
 {
-    mPluginNamespace = pluginNamespace;
+    char const* d = reinterpret_cast<char const*>(buffer);
+    char const* a = d;
+    C = read<int32_t>(d);
+    H = read<int32_t>(d);
+    W = read<int32_t>(d);
+    stride = read<int32_t>(d);
+    PLUGIN_VALIDATE(d == a + length);
 }
 
-char const* Reorg::getPluginNamespace() const noexcept
+char const* ReorgStatic::getPluginVersion() const noexcept
 {
-    return mPluginNamespace.c_str();
+    return kREORG_PLUGIN_STATIC_VERSION;
 }
 
-// Return the DataType of the plugin output at the requested index
-DataType Reorg::getOutputDataType(int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept
+size_t ReorgStatic::getWorkspaceSize(int32_t maxBatchSize) const noexcept
 {
-    // Only 1 input and 1 output from the plugin layer
+    return 0;
+}
+
+size_t ReorgStatic::getSerializationSize() const noexcept
+{
+    // C, H, W, stride
+    return sizeof(int32_t) * 4;
+}
+
+void ReorgStatic::serialize(void* buffer) const noexcept
+{
+    char *d = reinterpret_cast<char*>(buffer), *a = d;
+    write(d, C);
+    write(d, H);
+    write(d, W);
+    write(d, stride);
+    PLUGIN_ASSERT(d == a + getSerializationSize());
+}
+
+Dims ReorgStatic::getOutputDimensions(int32_t index, Dims const* inputs, int32_t nbInputDims) noexcept
+{
+    PLUGIN_ASSERT(nbInputDims == 1);
     PLUGIN_ASSERT(index == 0);
+    return Dims3(inputs[0].d[0] * stride * stride, inputs[0].d[1] / stride, inputs[0].d[2] / stride);
+}
 
-    // Only DataType::kFLOAT is acceptable by the plugin layer
-    return DataType::kFLOAT;
+int32_t ReorgStatic::enqueue(
+    int32_t batchSize, void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+{
+    void const* inputData = inputs[0];
+    void* outputData = outputs[0];
+    pluginStatus_t status = reorgInference(stream, batchSize, C, H, W, stride, inputData, outputData);
+    return status;
+}
+
+bool ReorgStatic::supportsFormat(DataType type, PluginFormat format) const noexcept
+{
+    return (type == DataType::kFLOAT && format == PluginFormat::kLINEAR);
+}
+
+IPluginV2Ext* ReorgStatic::clone() const noexcept
+{
+    try
+    {
+        ReorgStatic* plugin = new ReorgStatic(C, H, W, stride);
+        plugin->setPluginNamespace(mPluginNamespace.c_str());
+        return plugin;
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+    }
+    return nullptr;
 }
 
 // Return true if output tensor is broadcast across a batch.
-bool Reorg::isOutputBroadcastAcrossBatch(
+bool ReorgStatic::isOutputBroadcastAcrossBatch(
     int32_t outputIndex, bool const* inputIsBroadcasted, int32_t nbInputs) const noexcept
 {
     return false;
 }
 
 // Return true if plugin can use input that is broadcast across batch without replication.
-bool Reorg::canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept
+bool ReorgStatic::canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept
 {
     return false;
 }
 
 // Configure the layer with input and output data types.
-void Reorg::configurePlugin(Dims const* inputDims, int32_t nbInputs, Dims const* outputDims, int32_t nbOutputs,
+void ReorgStatic::configurePlugin(Dims const* inputDims, int32_t nbInputs, Dims const* outputDims, int32_t nbOutputs,
     DataType const* inputTypes, DataType const* outputTypes, bool const* inputIsBroadcast,
     bool const* outputIsBroadcast, PluginFormat floatFormat, int32_t maxBatchSize) noexcept
 {
@@ -167,31 +321,8 @@ void Reorg::configurePlugin(Dims const* inputDims, int32_t nbInputs, Dims const*
     PLUGIN_ASSERT(W % stride == 0);
 }
 
-// Attach the plugin object to an execution context and grant the plugin the access to some context resource.
-void Reorg::attachToContext(
-    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) noexcept
-{
-}
-
-// Detach the plugin object from its execution context.
-void Reorg::detachFromContext() noexcept {}
-
-IPluginV2Ext* Reorg::clone() const noexcept
-{
-    try
-    {
-        IPluginV2Ext* plugin = new Reorg(C, H, W, stride);
-        plugin->setPluginNamespace(mPluginNamespace.c_str());
-        return plugin;
-    }
-    catch (std::exception const& e)
-    {
-        caughtError(e);
-    }
-    return nullptr;
-}
-
-ReorgPluginCreator::ReorgPluginCreator()
+template <class TPluginClass>
+ReorgPluginCreator<TPluginClass>::ReorgPluginCreator()
 {
     mPluginAttributes.clear();
     mPluginAttributes.emplace_back(PluginField("stride", nullptr, PluginFieldType::kINT32, 1));
@@ -200,22 +331,34 @@ ReorgPluginCreator::ReorgPluginCreator()
     mFC.fields = mPluginAttributes.data();
 }
 
-char const* ReorgPluginCreator::getPluginName() const noexcept
+template <class TPluginClass>
+char const* ReorgPluginCreator<TPluginClass>::getPluginName() const noexcept
 {
     return kREORG_PLUGIN_NAME;
 }
 
-char const* ReorgPluginCreator::getPluginVersion() const noexcept
+template <class TPluginClass>
+char const* ReorgPluginCreator<TPluginClass>::getPluginVersion() const noexcept
 {
-    return kREORG_PLUGIN_VERSION;
+    if (std::is_same<TPluginClass, ReorgStatic>::value)
+    {
+        return kREORG_PLUGIN_STATIC_VERSION;
+    }
+    else if (std::is_same<TPluginClass, ReorgDynamic>::value)
+    {
+        return kREORG_PLUGIN_DYNAMIC_VERSION;
+    }
+    return "";
 }
 
-PluginFieldCollection const* ReorgPluginCreator::getFieldNames() noexcept
+template <class TPluginClass>
+PluginFieldCollection const* ReorgPluginCreator<TPluginClass>::getFieldNames() noexcept
 {
     return &mFC;
 }
 
-IPluginV2Ext* ReorgPluginCreator::createPlugin(char const* name, PluginFieldCollection const* fc) noexcept
+template <class TPluginClass>
+IPluginV2Ext* ReorgPluginCreator<TPluginClass>::createPlugin(char const* name, PluginFieldCollection const* fc) noexcept
 {
     try
     {
@@ -227,7 +370,7 @@ IPluginV2Ext* ReorgPluginCreator::createPlugin(char const* name, PluginFieldColl
 
         PLUGIN_VALIDATE(stride > 0);
 
-        Reorg* obj = new Reorg(stride);
+        TPluginClass* obj = new TPluginClass(stride);
         obj->setPluginNamespace(mNamespace.c_str());
         return obj;
     }
@@ -238,14 +381,15 @@ IPluginV2Ext* ReorgPluginCreator::createPlugin(char const* name, PluginFieldColl
     return nullptr;
 }
 
-IPluginV2Ext* ReorgPluginCreator::deserializePlugin(
+template <class TPluginClass>
+IPluginV2Ext* ReorgPluginCreator<TPluginClass>::deserializePlugin(
     char const* name, void const* serialData, size_t serialLength) noexcept
 {
     try
     {
         // This object will be deleted when the network is destroyed, which will
         // call ReorgPlugin::destroy()
-        Reorg* obj = new Reorg(serialData, serialLength);
+        TPluginClass* obj = new TPluginClass(serialData, serialLength);
         obj->setPluginNamespace(mNamespace.c_str());
         return obj;
     }
@@ -255,5 +399,9 @@ IPluginV2Ext* ReorgPluginCreator::deserializePlugin(
     }
     return nullptr;
 }
+
+template class ReorgPluginCreator<ReorgStatic>;
+template class ReorgPluginCreator<ReorgDynamic>;
+
 } // namespace plugin
 } // namespace nvinfer1
diff --git a/plugin/reorgPlugin/reorgPlugin.h b/plugin/reorgPlugin/reorgPlugin.h
index f0f999b5..5971e028 100644
--- a/plugin/reorgPlugin/reorgPlugin.h
+++ b/plugin/reorgPlugin/reorgPlugin.h
@@ -26,44 +26,23 @@ namespace nvinfer1
 namespace plugin
 {
 
-class Reorg : public IPluginV2Ext
+template <class TBaseClass>
+class Reorg : public TBaseClass
 {
 public:
-    Reorg(int32_t stride);
-
-    Reorg(int32_t C, int32_t H, int32_t W, int32_t stride);
-
-    Reorg(void const* buffer, size_t length);
-
+    Reorg(int32_t stride = 0);
     ~Reorg() override = default;
 
     int32_t getNbOutputs() const noexcept override;
 
-    Dims getOutputDimensions(int32_t index, Dims const* inputs, int32_t nbInputDims) noexcept override;
-
     int32_t initialize() noexcept override;
 
     void terminate() noexcept override;
 
-    size_t getWorkspaceSize(int32_t maxBatchSize) const noexcept override;
-
-    int32_t enqueue(int32_t batchSize, void const* const* inputs, void* const* outputs, void* workspace,
-        cudaStream_t stream) noexcept override;
-
-    size_t getSerializationSize() const noexcept override;
-
-    void serialize(void* buffer) const noexcept override;
-
-    bool supportsFormat(DataType type, PluginFormat format) const noexcept override;
-
     char const* getPluginType() const noexcept override;
 
-    char const* getPluginVersion() const noexcept override;
-
     void destroy() noexcept override;
 
-    IPluginV2Ext* clone() const noexcept override;
-
     void setPluginNamespace(char const* pluginNamespace) noexcept override;
 
     char const* getPluginNamespace() const noexcept override;
@@ -71,26 +50,68 @@ class Reorg : public IPluginV2Ext
     DataType getOutputDataType(
         int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept override;
 
-    bool isOutputBroadcastAcrossBatch(
-        int32_t outputIndex, bool const* inputIsBroadcasted, int32_t nbInputs) const noexcept override;
-
-    bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept override;
-
     void attachToContext(
         cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) noexcept override;
 
+    void detachFromContext() noexcept override;
+
+protected:
+    int32_t stride{};
+    std::string mPluginNamespace;
+};
+
+class TRT_DEPRECATED ReorgStatic : public Reorg<IPluginV2Ext>
+{
+public:
+    ReorgStatic(int32_t stride);
+    ReorgStatic(int32_t C, int32_t H, int32_t W, int32_t stride);
+    ReorgStatic(void const* buffer, size_t length);
+
+    char const* getPluginVersion() const noexcept override;
+    Dims getOutputDimensions(int32_t index, Dims const* inputs, int32_t nbInputDims) noexcept override;
+    size_t getWorkspaceSize(int32_t maxBatchSize) const noexcept override;
+    int32_t enqueue(int32_t batchSize, void const* const* inputs, void* const* outputs, void* workspace,
+        cudaStream_t stream) noexcept override;
+    bool supportsFormat(DataType type, PluginFormat format) const noexcept override;
+    IPluginV2Ext* clone() const noexcept override;
+    bool isOutputBroadcastAcrossBatch(
+        int32_t outputIndex, bool const* inputIsBroadcasted, int32_t nbInputs) const noexcept override;
+    bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept override;
     void configurePlugin(Dims const* inputDims, int32_t nbInputs, Dims const* outputDims, int32_t nbOutputs,
         DataType const* inputTypes, DataType const* outputTypes, bool const* inputIsBroadcast,
         bool const* outputIsBroadcast, PluginFormat floatFormat, int32_t maxBatchSize) noexcept override;
+    size_t getSerializationSize() const noexcept override;
+    void serialize(void* buffer) const noexcept override;
 
-    void detachFromContext() noexcept override;
+protected:
+    int32_t C{};
+    int32_t H{};
+    int32_t W{};
+};
 
-private:
-    int32_t C{}, H{}, W{};
-    int32_t stride{};
-    std::string mPluginNamespace;
+class ReorgDynamic : public Reorg<IPluginV2DynamicExt>
+{
+public:
+    ReorgDynamic(int32_t stride);
+    ReorgDynamic(void const* buffer, size_t length);
+
+    char const* getPluginVersion() const noexcept override;
+    DimsExprs getOutputDimensions(
+        int32_t outputIndex, DimsExprs const* inputs, int32_t nbInputs, IExprBuilder& exprBuilder) noexcept override;
+    bool supportsFormatCombination(
+        int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept override;
+    void configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out,
+        int32_t nbOutputs) noexcept override;
+    size_t getWorkspaceSize(nvinfer1::PluginTensorDesc const* inputs, int32_t nbInputs, PluginTensorDesc const* outputs,
+        int32_t nbOutputs) const noexcept override;
+    int32_t enqueue(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept override;
+    IPluginV2DynamicExt* clone() const noexcept override;
+    size_t getSerializationSize() const noexcept override;
+    void serialize(void* buffer) const noexcept override;
 };
 
+template <class TPluginClass>
 class ReorgPluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
@@ -113,6 +134,9 @@ class ReorgPluginCreator : public nvinfer1::pluginInternal::BaseCreator
     int32_t stride{};
     static std::vector<PluginField> mPluginAttributes;
 };
+
+using ReorgStaticPluginCreator = ReorgPluginCreator<ReorgStatic>;
+using ReorgDynamicPluginCreator = ReorgPluginCreator<ReorgDynamic>;
 } // namespace plugin
 } // namespace nvinfer1
 
diff --git a/plugin/resizeNearestPlugin/resizeNearestPlugin.cpp b/plugin/resizeNearestPlugin/resizeNearestPlugin.cpp
index 3f88a17b..d60c91fa 100644
--- a/plugin/resizeNearestPlugin/resizeNearestPlugin.cpp
+++ b/plugin/resizeNearestPlugin/resizeNearestPlugin.cpp
@@ -231,7 +231,7 @@ int32_t ResizeNearest::enqueue(
 
     int32_t nchan = mOutputDims.d[0];
     float scale = mScale;
-    int2 osize = {mOutputDims.d[2], mOutputDims.d[1]};
+    int2 osize = {dimToInt32(mOutputDims.d[2]), dimToInt32(mOutputDims.d[1])};
     int32_t istride = mInputDims.d[2];
     int32_t ostride = mOutputDims.d[2];
     int32_t ibatchstride = mInputDims.d[1] * istride;
diff --git a/plugin/roiAlignPlugin/CMakeLists.txt b/plugin/roiAlignPlugin/CMakeLists.txt
index 1f1d4169..bd8066f0 100644
--- a/plugin/roiAlignPlugin/CMakeLists.txt
+++ b/plugin/roiAlignPlugin/CMakeLists.txt
@@ -17,6 +17,10 @@
 file(GLOB SRCS *.cpp)
 set(PLUGIN_SOURCES ${PLUGIN_SOURCES} ${SRCS})
 set(PLUGIN_SOURCES ${PLUGIN_SOURCES} PARENT_SCOPE)
+set(VFC_PLUGIN_SOURCES ${VFC_PLUGIN_SOURCES} ${SRCS})
+set(VFC_PLUGIN_SOURCES ${VFC_PLUGIN_SOURCES} PARENT_SCOPE)
 file(GLOB CU_SRCS *.cu)
 set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} ${CU_SRCS})
 set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} PARENT_SCOPE)
+set(VFC_PLUGIN_CU_SOURCES ${VFC_PLUGIN_CU_SOURCES} ${CU_SRCS})
+set(VFC_PLUGIN_CU_SOURCES ${VFC_PLUGIN_CU_SOURCES} PARENT_SCOPE)
diff --git a/plugin/roiAlignPlugin/roiAlignKernel.cu b/plugin/roiAlignPlugin/roiAlignKernel.cu
index 0d61db77..b121b297 100644
--- a/plugin/roiAlignPlugin/roiAlignKernel.cu
+++ b/plugin/roiAlignPlugin/roiAlignKernel.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -29,7 +29,7 @@
 
 
 #include <cuda.h>
-#include "cuda_fp16.h"
+#include <cuda_fp16.h>
 #include "common/common.cuh"
 #include "roiAlignKernel.h"
 
diff --git a/plugin/roiAlignPlugin/roiAlignPlugin.cpp b/plugin/roiAlignPlugin/roiAlignPlugin.cpp
index 19604a99..d5e51638 100644
--- a/plugin/roiAlignPlugin/roiAlignPlugin.cpp
+++ b/plugin/roiAlignPlugin/roiAlignPlugin.cpp
@@ -294,13 +294,10 @@ DimsExprs ROIAlign::getOutputDimensions(
     return result;
 }
 
-int32_t ROIAlign::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+int32_t ROIAlign::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* /* outputDesc */,
+    void const* const* inputs, void* const* outputs, void* /* workspace */, cudaStream_t stream) noexcept
 {
-    PLUGIN_ASSERT(inputDesc != nullptr);
-    PLUGIN_ASSERT(inputs != nullptr);
-    PLUGIN_ASSERT(outputs != nullptr);
-    PLUGIN_ASSERT(outputDesc != nullptr);
+    PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
 
     // No-op pass-through for empty ROIs
     if (mROICount == 0)
diff --git a/plugin/scatterElementsPlugin/CMakeLists.txt b/plugin/scatterElementsPlugin/CMakeLists.txt
new file mode 100644
index 00000000..1f1d4169
--- /dev/null
+++ b/plugin/scatterElementsPlugin/CMakeLists.txt
@@ -0,0 +1,22 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+file(GLOB SRCS *.cpp)
+set(PLUGIN_SOURCES ${PLUGIN_SOURCES} ${SRCS})
+set(PLUGIN_SOURCES ${PLUGIN_SOURCES} PARENT_SCOPE)
+file(GLOB CU_SRCS *.cu)
+set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} ${CU_SRCS})
+set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} PARENT_SCOPE)
diff --git a/plugin/scatterElementsPlugin/README.md b/plugin/scatterElementsPlugin/README.md
new file mode 100644
index 00000000..80736f34
--- /dev/null
+++ b/plugin/scatterElementsPlugin/README.md
@@ -0,0 +1,59 @@
+# scatterElements
+
+**Table Of Contents**
+- [Description](#description)
+    * [Structure](#structure)
+- [Parameters](#parameters)
+- [Additional resources](#additional-resources)
+- [License](#license)
+- [Changelog](#changelog)
+- [Known issues](#known-issues)
+
+## Description
+
+The scatterElements plugin implements the scatter operation described in (https://github.com/rusty1s/pytorch_scatter), in compliance with the [ONNX specification for ScatterElements](https://github.com/onnx/onnx/blob/main/docs/Operators.md#ScatterElements)
+
+Note: ScatterElements with reduce="none" is implemented in TRT core, not this plugin. 
+
+### Structure
+
+This plugin has the plugin creator class `ScatterElementsPluginCreator` and the plugin class `ScatterElementsPlugin` which extends `IPluginV2DynamicExt`.
+
+The `ScatterElements` plugin consumes the following inputs:
+
+1. `data` - T: Tensor of rank r >= 1.
+2. `indices` - Tind: Tensor of int64 indices, of r >= 1 (same rank as input). All index values are expected to be within bounds [-s, s-1] along axis of size s. It is an error if any of the index values are out of bounds.
+3. `updates` - T: Tensor of rank r >=1 (same rank and shape as indices)
+
+The `ScatterElements` plugin produces the following output:
+
+1. `output` - T: Tensor, same shape as `data`.
+
+## Parameters
+  
+The `ScatterElements` plugin has the following parameters:
+
+| Type             | Parameter                       | Description
+|------------------|---------------------------------|--------------------------------------------------------
+|`int`             |`axis`                           | Which axis to scatter on. Default is 0. Negative value means counting dimensions from the back. Accepted range is [-r, r-1] where r = rank(data).
+|`char`            |`reduction`                      | Type of reduction to apply: add, mul, max, min. ‘add’: reduction using the addition operation. ‘mul’: reduction using the multiplication operation.‘max’: reduction using the maximum operation.‘min’: reduction using the minimum operation. 
+
+
+The following resources provide a deeper understanding of the `scatterElements` plugin:
+
+- [pytorch_scatter: original implementation and docs](https://github.com/rusty1s/pytorch_scatter)
+- [ONNX specification for ScatterElements](https://github.com/onnx/onnx/blob/main/docs/Operators.md#ScatterElements)
+
+## License
+
+For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) 
+documentation.
+
+## Changelog
+
+Oct 2023: This is the first release of this `README.md` file.
+
+## Known issues
+
+- Types T=BFLOAT16 and T=INT8 are currently not supported.
+- ONNX spec allows Tind=int32 : only INT64 is supported by this plugin
diff --git a/plugin/scatterElementsPlugin/ScatterElementsPlugin_PluginConfig.yaml b/plugin/scatterElementsPlugin/ScatterElementsPlugin_PluginConfig.yaml
new file mode 100644
index 00000000..e282fd69
--- /dev/null
+++ b/plugin/scatterElementsPlugin/ScatterElementsPlugin_PluginConfig.yaml
@@ -0,0 +1,133 @@
+---
+name: ScatterElements
+interface: "IPluginV2DynamicExt"
+versions:
+  "1":
+    inputs:
+      - data
+      - indices
+      - updates
+    supported_input_types:
+      - combination1:
+          data: float32
+          indices: int64
+          updates: float32
+      - combination2:
+          data: int32
+          indices: int64
+          updates: int32
+      - combination3:
+          data: int64
+          indices: int64
+          updates: int64
+      - combination4:
+          data: float16
+          indices: int64
+          updates: float16
+      - combination5:
+          data: bfloat16
+          indices: int64
+          updates: bfloat16
+    configs:
+      config1:
+        input_types:
+          data: float32
+          indices: int64
+          updates: float32
+        attribute_options:
+          axis:
+            - -1
+            - 0
+            - 1
+          reduction:
+            - "add"
+            - "mul"
+            - "min"
+            - "max"     
+      config2:
+        input_types:
+          data: float16
+          indices: int64
+          updates: float16
+        attribute_options:
+          axis:
+            - -1
+            - 0
+            - 1
+          reduction:
+            - "add"
+            - "mul"
+            - "min"
+            - "max"     
+      config3:
+        input_types:
+          data: int32
+          indices: int64
+          updates: int32
+        attribute_options:
+          axis:
+            - -1
+            - 0
+            - 1
+          reduction:
+            - "add"
+            - "mul"
+            - "min"
+            - "max"     
+      config4:
+        input_types:
+          data: int64
+          indices: int64
+          updates: int64
+        attribute_options:
+          axis:
+            - -1
+            - 0
+            - 1
+          reduction:
+            - "add"
+            - "mul"
+            - "min"
+            - "max"     
+      config5:
+        input_types:
+          data: bfloat16
+          indices: int64
+          updates: bfloat16
+        attribute_options:
+          axis:
+            - -1
+            - 0
+            - 1
+          reduction:
+            - "add"
+            - "mul"
+            - "min"
+            - "max"     
+    outputs:
+      - output
+    attributes:
+      - axis
+      - reduction
+    attribute_types:
+      axis: int32
+      reduction: char
+    attribute_length:
+      axis: 1
+      reduction: -1
+    attribute_options:
+      axis:
+        min: "=ninf"
+        max: "=pinf"
+      reduction:
+        - "add"
+        - "mul"
+        - "min"
+        - "max" 
+    attributes_required:
+      - reduction
+    golden_io_path: "plugin/scatterElementsPlugin/ScatterElementsPlugin_PluginGoldenIO.json"
+    abs_tol: 1e-2
+    rel_tol: 1e-2
+
+...
diff --git a/plugin/scatterElementsPlugin/TensorInfo.cuh b/plugin/scatterElementsPlugin/TensorInfo.cuh
new file mode 100644
index 00000000..0656756c
--- /dev/null
+++ b/plugin/scatterElementsPlugin/TensorInfo.cuh
@@ -0,0 +1,147 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * ************************************************************************
+ * Modified from pytorch_scatter 
+ * Copyright (c) 2020 Matthias Fey <matthias.fey@tu-dortmund.de>
+ * See https://github.com/rusty1s/pytorch_scatter/blob/master/LICENSE for details
+ * ************************************************************************
+ */
+
+#ifndef TRT_SCATTER_ELEMENTS_TENSOR_INFO_H
+#define TRT_SCATTER_ELEMENTS_TENSOR_INFO_H
+
+#include "common/plugin.h"
+
+namespace nvinfer1
+{
+namespace plugin
+{
+namespace detail
+{
+
+static constexpr int32_t kMAX_TENSORINFO_DIMS = 25;
+
+// CUDA kernel argument that defines tensor layout
+template <typename TScalar, typename TIndex>
+struct TensorInfo
+{
+    TensorInfo();
+    TensorInfo(const TScalar* p, int32_t dim, TIndex sz[kMAX_TENSORINFO_DIMS], TIndex st[kMAX_TENSORINFO_DIMS]);
+
+    // Contiguous tensors of more than one dimension are collapsed down
+    // to one tensor
+    __host__ __device__ inline bool isContiguous() const
+    {
+        return (dims == 1 && strides[0] == 1);
+    }
+
+    const TScalar* data;
+    TIndex sizes[kMAX_TENSORINFO_DIMS];
+    TIndex strides[kMAX_TENSORINFO_DIMS];
+    int32_t dims;
+};
+
+// Creates TensorInfo object from PluginTensorDesc and data address
+template <typename TScalar, typename TIndex>
+TensorInfo<TScalar, TIndex> getTensorInfo(const void* d, PluginTensorDesc const& t)
+{
+    TIndex sz[kMAX_TENSORINFO_DIMS];
+    TIndex st[kMAX_TENSORINFO_DIMS];
+
+    int32_t dims = t.dims.nbDims;
+    for (int32_t i = 0; i < dims; ++i)
+    {
+        sz[i] = t.dims.d[i];
+    }
+    // calculate strides
+    st[dims - 1] = 1;
+    for (int32_t i = dims - 2; i >= 0; --i)
+    {
+        st[i] = st[i + 1] * sz[i + 1];
+    }
+    return TensorInfo<TScalar, TIndex>(reinterpret_cast<const TScalar*>(d), dims, sz, st);
+}
+
+template <typename TScalar, typename TIndex>
+TensorInfo<TScalar, TIndex>::TensorInfo()
+{
+    data = nullptr;
+    dims = 0;
+}
+
+template <typename TScalar, typename TIndex>
+TensorInfo<TScalar, TIndex>::TensorInfo(
+    const TScalar* p, int32_t dim, TIndex sz[kMAX_TENSORINFO_DIMS], TIndex st[kMAX_TENSORINFO_DIMS])
+{
+    data = p;
+    dims = dim;
+    for (int32_t i = 0; i < dim; ++i)
+    {
+        sizes[i] = sz[i];
+        strides[i] = st[i];
+    }
+}
+
+// Translate a linear index for the apply to a T* offset;
+// specialized on `Dims` to reduce nvcc compilation time
+template <typename TScalar, typename TIndex, int tDims>
+struct IndexToOffset
+{
+    static __host__ __device__ TIndex get(TIndex linearId, const TensorInfo<TScalar, TIndex>& info)
+    {
+
+        TIndex offset = 0;
+
+        // Uses static dims
+        for (int32_t i = tDims - 1; i > 0; --i)
+        {
+            TIndex curDimIndex = linearId % info.sizes[i];
+            TIndex curDimOffset = curDimIndex * info.strides[i];
+            offset += curDimOffset;
+            linearId /= info.sizes[i];
+        }
+
+        return offset + linearId * info.strides[0];
+    }
+};
+
+// Uses dynamic (runtime) instead of static (compiletime) dims
+template <typename TScalar, typename TIndex>
+struct IndexToOffset<TScalar, TIndex, -1>
+{
+    static inline __host__ __device__ TIndex get(TIndex linearId, const TensorInfo<TScalar, TIndex>& info)
+    {
+
+        TIndex offset = 0;
+
+        for (int32_t i = info.dims - 1; i > 0; --i)
+        {
+            TIndex curDimIndex = linearId % info.sizes[i];
+            TIndex curDimOffset = curDimIndex * info.strides[i];
+            offset += curDimOffset;
+            linearId /= info.sizes[i];
+        }
+
+        return offset + linearId * info.strides[0];
+    }
+};
+
+} // namespace detail
+} // namespace plugin
+} // namespace nvinfer1
+
+#endif // TRT_SCATTER_ELEMENTS_TENSOR_INFO_H
diff --git a/plugin/scatterElementsPlugin/atomics.cuh b/plugin/scatterElementsPlugin/atomics.cuh
new file mode 100644
index 00000000..90094c22
--- /dev/null
+++ b/plugin/scatterElementsPlugin/atomics.cuh
@@ -0,0 +1,218 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * ************************************************************************
+ * Modified from pytorch_scatter 
+ * Copyright (c) 2020 Matthias Fey <matthias.fey@tu-dortmund.de>
+ * See https://github.com/rusty1s/pytorch_scatter/blob/master/LICENSE for details
+ * ************************************************************************
+ */
+
+#ifndef TRT_SCATTER_ELEMENTS_ATOMICS_H
+#define TRT_SCATTER_ELEMENTS_ATOMICS_H
+
+#include <cstdint>
+#include <cuda_fp16.h>
+#include <cuda_bf16.h>
+#include <utility>
+
+#define ATOMIC(NAME)                                                                                                   \
+    template <typename TScalar, size_t tSize>                                                                          \
+    struct Atomic##NAME##IntegerImpl;                                                                                  \
+                                                                                                                       \
+    template <typename TScalar>                                                                                        \
+    struct Atomic##NAME##IntegerImpl<TScalar, 4>                                                                       \
+    {                                                                                                                  \
+        inline __device__ void operator()(TScalar* address, TScalar val)                                               \
+        {                                                                                                              \
+            std::uint32_t* addressAsUI = reinterpret_cast<std::uint32_t*>(address);                                    \
+            std::uint32_t old = *addressAsUI;                                                                          \
+            std::uint32_t assumed;                                                                                     \
+                                                                                                                       \
+            do                                                                                                         \
+            {                                                                                                          \
+                assumed = old;                                                                                         \
+                old = atomicCAS(addressAsUI, assumed, OP(val, static_cast<TScalar>(old)));                             \
+            } while (assumed != old);                                                                                  \
+        }                                                                                                              \
+    };                                                                                                                 \
+                                                                                                                       \
+    template <typename TScalar>                                                                                        \
+    struct Atomic##NAME##IntegerImpl<TScalar, 8>                                                                       \
+    {                                                                                                                  \
+        inline __device__ void operator()(TScalar* address, TScalar val)                                               \
+        {                                                                                                              \
+            unsigned long long* addressAsULL = reinterpret_cast<unsigned long long*>(address);                         \
+            unsigned long long old = *addressAsULL;                                                                    \
+            unsigned long long assumed;                                                                                \
+                                                                                                                       \
+            do                                                                                                         \
+            {                                                                                                          \
+                assumed = old;                                                                                         \
+                old = atomicCAS(addressAsULL, assumed, OP(val, static_cast<TScalar>(old)));                            \
+            } while (assumed != old);                                                                                  \
+        }                                                                                                              \
+    };                                                                                                                 \
+                                                                                                                       \
+    template <typename TScalar, size_t tSize>                                                                          \
+    struct Atomic##NAME##DecimalImpl;                                                                                  \
+                                                                                                                       \
+    template <typename TScalar>                                                                                        \
+    struct Atomic##NAME##DecimalImpl<TScalar, 4>                                                                       \
+    {                                                                                                                  \
+        inline __device__ void operator()(TScalar* address, TScalar val)                                               \
+        {                                                                                                              \
+            std::int32_t* addressAsI = reinterpret_cast<std::int32_t*>(address);                                       \
+            std::int32_t old = *addressAsI;                                                                            \
+            std::int32_t assumed;                                                                                      \
+                                                                                                                       \
+            do                                                                                                         \
+            {                                                                                                          \
+                assumed = old;                                                                                         \
+                old = atomicCAS(addressAsI, assumed, __float_as_int(OP(val, __int_as_float(assumed))));                \
+            } while (assumed != old);                                                                                  \
+        }                                                                                                              \
+    };                                                                                                                 \
+    template <typename TScalar>                                                                                        \
+    struct Atomic##NAME##DecimalImpl<TScalar, 2>                                                                       \
+    {                                                                                                                  \
+        inline __device__ void operator()(TScalar* address, TScalar val)                                               \
+        {                                                                                                              \
+            uint32_t* addressAsUI = reinterpret_cast<std::uint32_t*>((char*) address - ((std::size_t) address & 2));   \
+            std::uint32_t old = *addressAsUI;                                                                          \
+            std::uint32_t assumed;                                                                                     \
+                                                                                                                       \
+            do                                                                                                         \
+            {                                                                                                          \
+                assumed = old;                                                                                         \
+                std::uint16_t hsum_old;                                                                                \
+                hsum_old = reinterpret_cast<size_t>(address) & 2 ? (old >> 16) : (old & 0xffff);                       \
+                auto hsum = OP(*reinterpret_cast<TScalar*>(&hsum_old), val);                                           \
+                old = (size_t) address & 2 ? (old & 0xffff) | ((*reinterpret_cast<std::uint16_t*>(&hsum)) << 16)       \
+                                           : (old & 0xffff0000) | *reinterpret_cast<std::uint16_t*>(&hsum);            \
+                old = atomicCAS(addressAsUI, assumed, old);                                                            \
+            } while (assumed != old);                                                                                  \
+        }                                                                                                              \
+    };
+
+#define OP(X, Y) ((Y) + (X))
+ATOMIC(Add)
+#undef OP
+
+static inline __device__ void atomAdd(float* address, float val)
+{
+    atomicAdd(address, val);
+}
+static inline __device__ void atomAdd(__half* address, __half val)
+{
+#if defined(USE_ROCM) || (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 700 || CUDA_VERSION < 10000))
+  AtomicAddDecimalImpl<__half, sizeof(__half)>()(address, val);
+#else
+  atomicAdd(address, val);
+#endif
+}
+static inline __device__ void atomAdd(__nv_bfloat16* address, __nv_bfloat16 val)
+{
+#if (defined(__CUDACC__) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800))) || defined(_NVHPC_CUDA)
+  atomicAdd(address, val);
+#else
+  AtomicAddDecimalImpl<__nv_bfloat16, sizeof(__nv_bfloat16)>()(address, val);
+#endif
+}
+static inline __device__ void atomAdd(std::int32_t* address, std::int32_t val)
+{
+    atomicAdd(address, val);
+}
+static inline __device__ void atomAdd(std::int64_t* address, std::int64_t val)
+{
+    AtomicAddIntegerImpl<std::int64_t, sizeof(std::int64_t)>()(address, val);
+}
+
+#define OP(X, Y) ((Y) * (X))
+ATOMIC(Mul)
+#undef OP
+static inline __device__ void atomMul(std::int32_t* address, std::int32_t val)
+{
+    AtomicMulIntegerImpl<std::int32_t, sizeof(std::int32_t)>()(address, val);
+}
+static inline __device__ void atomMul(std::int64_t* address, std::int64_t val)
+{
+    AtomicMulIntegerImpl<std::int64_t, sizeof(std::int64_t)>()(address, val);
+}
+static inline __device__ void atomMul(float* address, float val)
+{
+    AtomicMulDecimalImpl<float, sizeof(float)>()(address, val);
+}
+static inline __device__ void atomMul(__half* address, __half val)
+{
+    AtomicMulDecimalImpl<__half, sizeof(__half)>()(address, val);
+}
+static inline __device__ void atomMul(__nv_bfloat16* address, __nv_bfloat16 val)
+{
+    AtomicMulDecimalImpl<__nv_bfloat16, sizeof(__nv_bfloat16)>()(address, val);
+}
+
+
+#define OP(X, Y) ((X) < (Y)) ? (Y) : (X)
+ATOMIC(Max)
+#undef OP
+static inline __device__ void atomMax(std::int32_t* address, std::int32_t val)
+{
+    atomicMax(address, val);
+}
+static inline __device__ void atomMax(float* address, float val)
+{
+    AtomicMaxDecimalImpl<float, sizeof(float)>()(address, val);
+}
+static inline __device__ void atomMax(std::int64_t* address, std::int64_t val)
+{
+    AtomicMaxIntegerImpl<std::int64_t, sizeof(std::int64_t)>()(address, val);
+}
+static inline __device__ void atomMax(__half* address, __half val)
+{
+    AtomicMaxDecimalImpl<__half, sizeof(__half)>()(address, val);
+}
+static inline __device__ void atomMax(__nv_bfloat16* address, __nv_bfloat16 val)
+{
+    AtomicMaxDecimalImpl<__nv_bfloat16, sizeof(__nv_bfloat16)>()(address, val);
+}
+
+#define OP(X, Y) ((X) > (Y)) ? (Y) : (X)
+ATOMIC(Min)
+#undef OP
+static inline __device__ void atomMin(std::int32_t* address, std::int32_t val)
+{
+    atomicMin(address, val);
+}
+static inline __device__ void atomMin(std::int64_t* address, std::int64_t val)
+{
+    AtomicMinIntegerImpl<std::int64_t, sizeof(std::int64_t)>()(address, val);
+}
+static inline __device__ void atomMin(float* address, float val)
+{
+    AtomicMinDecimalImpl<float, sizeof(float)>()(address, val);
+}
+static inline __device__ void atomMin(__half* address, __half val)
+{
+    AtomicMinDecimalImpl<__half, sizeof(__half)>()(address, val);
+}
+static inline __device__ void atomMin(__nv_bfloat16* address, __nv_bfloat16 val)
+{
+    AtomicMinDecimalImpl<__nv_bfloat16, sizeof(__nv_bfloat16)>()(address, val);
+}
+
+
+#endif
diff --git a/plugin/scatterElementsPlugin/reducer.cuh b/plugin/scatterElementsPlugin/reducer.cuh
new file mode 100644
index 00000000..7143aa9f
--- /dev/null
+++ b/plugin/scatterElementsPlugin/reducer.cuh
@@ -0,0 +1,174 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * ************************************************************************
+ * Modified from pytorch_scatter 
+ * Copyright (c) 2020 Matthias Fey <matthias.fey@tu-dortmund.de>
+ * See https://github.com/rusty1s/pytorch_scatter/blob/master/LICENSE for details
+ * ************************************************************************
+ */
+
+#ifndef TRT_SCATTER_ELEMENTS_REDUCER_H
+#define TRT_SCATTER_ELEMENTS_REDUCER_H
+
+#include <limits>
+
+#include "atomics.cuh"
+#include "scatterElementsPluginKernel.h"
+
+namespace nvinfer1
+{
+namespace plugin
+{
+
+#define AT_DISPATCH_REDUCTION_TYPES(reduce, ...)                                                                       \
+    [&] {                                                                                                              \
+        switch (reduce)                                                                                                \
+        {                                                                                                              \
+        case ReductionType::kSUM:                                                                                      \
+        {                                                                                                              \
+            static constexpr ReductionType REDUCE = ReductionType::kSUM;                                               \
+            return __VA_ARGS__();                                                                                      \
+        }                                                                                                              \
+        case ReductionType::kMEAN:                                                                                     \
+        {                                                                                                              \
+            static constexpr ReductionType REDUCE = ReductionType::kMEAN;                                              \
+            return __VA_ARGS__();                                                                                      \
+        }                                                                                                              \
+        case ReductionType::kMUL:                                                                                      \
+        {                                                                                                              \
+            static constexpr ReductionType REDUCE = ReductionType::kMUL;                                               \
+            return __VA_ARGS__();                                                                                      \
+        }                                                                                                              \
+        case ReductionType::kMIN:                                                                                      \
+        {                                                                                                              \
+            static constexpr ReductionType REDUCE = ReductionType::kMIN;                                               \
+            return __VA_ARGS__();                                                                                      \
+        }                                                                                                              \
+        case ReductionType::kMAX:                                                                                      \
+        {                                                                                                              \
+            static constexpr ReductionType REDUCE = ReductionType::kMAX;                                               \
+            return __VA_ARGS__();                                                                                      \
+        }                                                                                                              \
+        }                                                                                                              \
+    }()
+
+template <typename TScalar, ReductionType tReduce>
+struct Reducer
+{
+    static inline __host__ __device__ TScalar init()
+    {
+        if (tReduce == ReductionType::kMUL)
+        {
+            return TScalar(1);
+        }
+        else if (tReduce == ReductionType::kMIN)
+        {
+            return std::numeric_limits<TScalar>::max();
+        }
+        else if (tReduce == ReductionType::kMAX)
+        {
+            return std::numeric_limits<TScalar>::lowest();
+        }
+        else
+        {
+            return TScalar(0);
+        }
+    }
+
+    static inline __host__ __device__ void update(TScalar* val, TScalar newVal)
+    {
+        if (tReduce == ReductionType::kSUM || tReduce == ReductionType::kMEAN)
+        {
+            *val = *val + newVal;
+        }
+        else if (tReduce == ReductionType::kMUL)
+        {
+            *val = *val * newVal;
+        }
+        else if ((tReduce == ReductionType::kMIN && newVal < *val) || (tReduce == ReductionType::kMAX && newVal > *val))
+        {
+            *val = newVal;
+        }
+    }
+
+    static inline __host__ __device__ void update(TScalar* val, TScalar newVal, int64_t* arg, int64_t newArg)
+    {
+        if (tReduce == ReductionType::kSUM || tReduce == ReductionType::kMEAN)
+        {
+            *val = *val + newVal;
+        }
+        else if (tReduce == ReductionType::kMUL)
+        {
+            *val = *val * newVal;
+        }
+        else if ((tReduce == ReductionType::kMIN && newVal < *val) || (tReduce == ReductionType::kMAX && newVal > *val))
+        {
+            *val = newVal;
+            *arg = newArg;
+        }
+    }
+
+    static inline __host__ __device__ void write(
+        TScalar* address, TScalar val, int64_t* argAddress, int64_t arg, int count)
+    {
+        if (tReduce == ReductionType::kSUM || tReduce == ReductionType::kMUL)
+        {
+            *address = val;
+        }
+        else if (tReduce == ReductionType::kMEAN)
+        {
+            *address = val / (TScalar) (count > 0 ? count : 1);
+        }
+        else if (tReduce == ReductionType::kMIN || tReduce == ReductionType::kMAX)
+        {
+            if (count > 0)
+            {
+                *address = val;
+                *argAddress = arg;
+            }
+            else
+            {
+                *address = (TScalar) 0;
+            }
+        }
+    }
+
+    static inline __device__ void atomic_write(TScalar* address, TScalar val)
+    {
+        if (tReduce == ReductionType::kSUM || tReduce == ReductionType::kMEAN)
+        {
+            atomAdd(address, val);
+        }
+        else if (tReduce == ReductionType::kMUL)
+        {
+            atomMul(address, val);
+        }
+        else if (tReduce == ReductionType::kMIN)
+        {
+            atomMin(address, val);
+        }
+        else if (tReduce == ReductionType::kMAX)
+        {
+            atomMax(address, val);
+        }
+    }
+};
+
+} // namespace plugin
+} // namespace nvinfer1
+
+#endif // TRT_SCATTER_ELEMENTS_REDUCER_H
diff --git a/plugin/scatterElementsPlugin/scatterElementsPlugin.cpp b/plugin/scatterElementsPlugin/scatterElementsPlugin.cpp
new file mode 100644
index 00000000..7910ad55
--- /dev/null
+++ b/plugin/scatterElementsPlugin/scatterElementsPlugin.cpp
@@ -0,0 +1,316 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <algorithm>
+#include <iostream>
+#include <iterator>
+#include <map>
+
+#include "common/serialize.hpp"
+#include "scatterElementsPlugin.h"
+#include "scatterElementsPluginKernel.h"
+
+namespace nvinfer1
+{
+namespace plugin
+{
+
+std::map<std::string, ReductionType> const gReduceToEnum{
+    {"add", ReductionType::kSUM},
+    {"mean", ReductionType::kMEAN},
+    {"mul", ReductionType::kMUL},
+    {"min", ReductionType::kMIN},
+    {"max", ReductionType::kMAX},
+};
+
+// Static class fields initialization
+PluginFieldCollection ScatterElementsPluginCreator::gFC{};
+std::vector<PluginField> ScatterElementsPluginCreator::gPluginAttributes;
+
+namespace
+{
+constexpr char const* kSCATTER_ELEMENTS_NAME{"ScatterElements"};
+constexpr char const* kSCATTER_ELEMENTS_VERSION{"1"};
+} // namespace
+
+ScatterElementsPlugin::ScatterElementsPlugin(ReductionType reduction, int32_t dim)
+    : mReduction(reduction)
+    , mAxis(dim)
+{
+}
+
+ScatterElementsPlugin::ScatterElementsPlugin(std::string const& reduction, int32_t dim)
+    : mReduction(gReduceToEnum.at(reduction))
+    , mAxis(dim)
+{
+}
+
+ScatterElementsPlugin::ScatterElementsPlugin(void const* serialData, size_t serialLength)
+{
+    deserialize_value(&serialData, &serialLength, &mReduction);
+    deserialize_value(&serialData, &serialLength, &mAxis);
+}
+
+int32_t ScatterElementsPlugin::getNbOutputs() const noexcept
+{
+    return 1;
+}
+
+int32_t ScatterElementsPlugin::initialize() noexcept
+{
+    return 0;
+}
+
+char const* ScatterElementsPlugin::getPluginType() const noexcept
+{
+    return kSCATTER_ELEMENTS_NAME;
+}
+
+char const* ScatterElementsPlugin::getPluginVersion() const noexcept
+{
+    return kSCATTER_ELEMENTS_VERSION;
+}
+
+DimsExprs ScatterElementsPlugin::getOutputDimensions(
+    int32_t index, DimsExprs const* inputs, int32_t nbInputs, IExprBuilder& exprBuilder) noexcept
+{
+    try
+    {
+        PLUGIN_VALIDATE(nbInputs == 3);
+        PLUGIN_VALIDATE(inputs);
+        PLUGIN_VALIDATE(index <= kOUTPUT_TENSOR_IDX);
+        // both outputs are of the same size
+        DimsExprs out(inputs[kDATA_TENSOR_IDX]);
+        return out;
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+    }
+    return DimsExprs();
+}
+
+int32_t ScatterElementsPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
+    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+{
+    try
+    {
+        PLUGIN_VALIDATE(inputDesc[kINDICES_TENSOR_IDX].type == DataType::kINT64);
+
+        runScatterElementsKernel(outputs[kOUTPUT_TENSOR_IDX], inputs[kDATA_TENSOR_IDX], inputs[kUPDATES_TENSOR_IDX],
+            inputs[kINDICES_TENSOR_IDX], outputDesc[kOUTPUT_TENSOR_IDX], inputDesc[kDATA_TENSOR_IDX],
+            inputDesc[kUPDATES_TENSOR_IDX], inputDesc[kINDICES_TENSOR_IDX], mAxis, mReduction, stream);
+        return 0;
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+        return -1;
+    }
+}
+
+size_t ScatterElementsPlugin::getSerializationSize() const noexcept
+{
+    auto ret = serialized_size(mReduction) + serialized_size(mAxis);
+    return ret;
+}
+
+void ScatterElementsPlugin::serialize(void* buffer) const noexcept
+{
+    serialize_value(&buffer, mReduction);
+    serialize_value(&buffer, mAxis);
+}
+
+bool ScatterElementsPlugin::supportsFormatCombination(
+    int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept
+{
+    try
+    {
+        PLUGIN_VALIDATE(inOut && pos < (nbInputs + nbOutputs));
+
+        if (inOut[pos].format != PluginFormat::kLINEAR)
+        {
+            return false;
+        }
+
+        auto mytype = inOut[pos].type;
+        auto firsttype = inOut[kDATA_TENSOR_IDX].type;
+
+        // Only INT64 is supported for indices
+        return pos == kINDICES_TENSOR_IDX ? (mytype == DataType::kINT64)
+                                          : (mytype == firsttype)
+                && (mytype == DataType::kFLOAT || mytype == DataType::kHALF
+                    || (hasBfloat16AtomicAdd() && mytype == DataType::kBF16) || mytype == DataType::kINT32
+                    || mytype == DataType::kINT64);
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+        return false;
+    }
+}
+
+void ScatterElementsPlugin::terminate() noexcept {}
+
+void ScatterElementsPlugin::destroy() noexcept
+{
+    // This gets called when the network containing plugin is destroyed
+    delete this;
+}
+
+IPluginV2DynamicExt* ScatterElementsPlugin::clone() const noexcept
+{
+    auto* plugin = new ScatterElementsPlugin(mReduction, mAxis);
+    plugin->setPluginNamespace(mNamespace.c_str());
+    return plugin;
+}
+
+void ScatterElementsPlugin::configurePlugin(
+    DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept
+{
+    try
+    {
+        PLUGIN_VALIDATE(nbInputs == 3);
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+    }
+}
+
+DataType ScatterElementsPlugin::getOutputDataType(
+    int32_t index, DataType const* inputTypes, int32_t nbInputs) const noexcept
+{
+    try
+    {
+        PLUGIN_VALIDATE(inputTypes && nbInputs == 3 && index == kOUTPUT_TENSOR_IDX);
+    }
+    catch (std::exception const& e)
+    {
+        caughtError(e);
+    }
+    return inputTypes[kDATA_TENSOR_IDX];
+}
+
+size_t ScatterElementsPlugin::getWorkspaceSize(
+    PluginTensorDesc const* inputs, int32_t nbInputs, PluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept
+{
+    return 0;
+}
+
+void ScatterElementsPlugin::setPluginNamespace(char const* libNamespace) noexcept
+{
+    mNamespace = libNamespace;
+}
+
+char const* ScatterElementsPlugin::getPluginNamespace() const noexcept
+{
+    return mNamespace.c_str();
+}
+
+//
+// ScatterElementsPluginCreator
+//
+
+ScatterElementsPluginCreator::ScatterElementsPluginCreator()
+{
+    gPluginAttributes.clear();
+    gPluginAttributes.emplace_back(PluginField("reduction"));
+    gPluginAttributes.emplace_back(PluginField("axis"));
+    gFC.nbFields = gPluginAttributes.size();
+    gFC.fields = gPluginAttributes.data();
+}
+
+char const* ScatterElementsPluginCreator::getPluginName() const noexcept
+{
+    return kSCATTER_ELEMENTS_NAME;
+}
+
+char const* ScatterElementsPluginCreator::getPluginVersion() const noexcept
+{
+    return kSCATTER_ELEMENTS_VERSION;
+}
+
+PluginFieldCollection const* ScatterElementsPluginCreator::getFieldNames() noexcept
+{
+    return &gFC;
+}
+
+char const* ScatterElementsPluginCreator::getPluginNamespace() const noexcept
+{
+    return mNamespace.c_str();
+}
+
+void ScatterElementsPluginCreator::setPluginNamespace(char const* libNamespace) noexcept
+{
+    mNamespace = libNamespace;
+}
+
+IPluginV2DynamicExt* ScatterElementsPluginCreator::createPlugin(
+    char const* name, PluginFieldCollection const* fc) noexcept
+{
+    std::string reductionArg;
+    int32_t axisArg = 0;
+    ScatterElementsPlugin* plugin = nullptr;
+
+    try
+    {
+        PLUGIN_VALIDATE(fc != nullptr);
+        auto fields = fc->fields;
+
+        std::set<std::string> requiredFields{"reduction"};
+        plugin::validateRequiredAttributesExist(requiredFields, fc);
+
+        for (int32_t i = 0; i < fc->nbFields; ++i)
+        {
+            PLUGIN_VALIDATE(fields[i].name != nullptr);
+            PLUGIN_VALIDATE(fields[i].data != nullptr);
+            if (strcmp(fields[i].name, "axis") == 0)
+            {
+                auto data = static_cast<int32_t const*>(fields[i].data);
+                axisArg = *data;
+            }
+            else if (strcmp(fields[i].name, "reduction") == 0)
+            {
+                auto data = static_cast<char const*>(fields[i].data);
+                reductionArg = std::string(data);
+            }
+        }
+
+        PLUGIN_VALIDATE(gReduceToEnum.find(reductionArg) != gReduceToEnum.end(),
+            (reductionArg + ": invalid value for 'reduction' plugin argument").c_str());
+
+        plugin = new ScatterElementsPlugin(reductionArg, axisArg);
+        plugin->setPluginNamespace(mNamespace.c_str());
+    }
+    catch (std::exception& e)
+    {
+        caughtError(e);
+    }
+    return plugin;
+}
+
+IPluginV2DynamicExt* ScatterElementsPluginCreator::deserializePlugin(
+    char const* name, void const* serialData, size_t serialLength) noexcept
+{
+    ScatterElementsPlugin* plugin = new ScatterElementsPlugin(serialData, serialLength);
+    plugin->setPluginNamespace(mNamespace.c_str());
+    return plugin;
+}
+
+} // namespace plugin
+} // namespace nvinfer1
diff --git a/plugin/scatterElementsPlugin/scatterElementsPlugin.h b/plugin/scatterElementsPlugin/scatterElementsPlugin.h
new file mode 100644
index 00000000..a49c4448
--- /dev/null
+++ b/plugin/scatterElementsPlugin/scatterElementsPlugin.h
@@ -0,0 +1,137 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef TRT_SCATTER_ELEMENTS_PLUGIN_H
+#define TRT_SCATTER_ELEMENTS_PLUGIN_H
+
+#include "common/plugin.h"
+#include <map>
+#include <memory>
+#include <string>
+#include <vector>
+
+namespace nvinfer1
+{
+namespace plugin
+{
+
+enum class ReductionType
+{
+    kSUM,
+    kMUL,
+    kMEAN,
+    kMIN,
+    kMAX
+};
+
+extern std::map<std::string, ReductionType> const gReduceToEnum;
+
+class ScatterElementsPlugin final : public nvinfer1::IPluginV2DynamicExt
+{
+public:
+    ScatterElementsPlugin() = delete;
+    ScatterElementsPlugin(ScatterElementsPlugin const&) = delete;
+    ScatterElementsPlugin(std::string const&, int32_t);
+    ScatterElementsPlugin(ReductionType, int32_t);
+    ScatterElementsPlugin(void const* serialData, size_t serialLength);
+    ~ScatterElementsPlugin() override = default;
+
+    int32_t getNbOutputs() const noexcept override;
+
+    nvinfer1::DimsExprs getOutputDimensions(int32_t index, nvinfer1::DimsExprs const* inputs, int32_t nbInputDims,
+        nvinfer1::IExprBuilder& exprBuilder) noexcept override;
+
+    int32_t initialize() noexcept override;
+
+    void terminate() noexcept override;
+
+    size_t getWorkspaceSize(nvinfer1::PluginTensorDesc const* inputs, int32_t nbInputs,
+        nvinfer1::PluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept override;
+
+    int32_t enqueue(nvinfer1::PluginTensorDesc const* inputDesc, nvinfer1::PluginTensorDesc const* outputDesc,
+        void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept override;
+
+    size_t getSerializationSize() const noexcept override;
+
+    void serialize(void* buffer) const noexcept override;
+
+    bool supportsFormatCombination(
+        int32_t pos, nvinfer1::PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept override;
+
+    char const* getPluginType() const noexcept override;
+
+    char const* getPluginVersion() const noexcept override;
+
+    nvinfer1::IPluginV2DynamicExt* clone() const noexcept override;
+
+    void destroy() noexcept override;
+
+    nvinfer1::DataType getOutputDataType(
+        int32_t index, nvinfer1::DataType const* inputTypes, int32_t nbInputs) const noexcept override;
+
+    void setPluginNamespace(char const* pluginNamespace) noexcept override;
+
+    char const* getPluginNamespace() const noexcept override;
+
+    void configurePlugin(nvinfer1::DynamicPluginTensorDesc const* in, int32_t nbInputs,
+        nvinfer1::DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept override;
+
+private:
+    ReductionType mReduction;
+    int32_t mAxis;
+    std::string mNamespace;
+
+    static constexpr int32_t kINDICES_TENSOR_IDX = 1;
+    static constexpr int32_t kUPDATES_TENSOR_IDX = 2;
+    static constexpr int32_t kDATA_TENSOR_IDX = 0;
+    // outputs
+    static constexpr int32_t kOUTPUT_TENSOR_IDX = 0;
+};
+
+class ScatterElementsPluginCreator : public nvinfer1::IPluginCreator
+{
+public:
+    ScatterElementsPluginCreator();
+
+    ~ScatterElementsPluginCreator() override = default;
+
+    char const* getPluginName() const noexcept override;
+
+    char const* getPluginVersion() const noexcept override;
+
+    nvinfer1::PluginFieldCollection const* getFieldNames() noexcept override;
+
+    nvinfer1::IPluginV2DynamicExt* createPlugin(
+        char const* name, nvinfer1::PluginFieldCollection const* fc) noexcept override;
+
+    nvinfer1::IPluginV2DynamicExt* deserializePlugin(
+        char const* name, void const* serialData, size_t serialLength) noexcept override;
+
+    void setPluginNamespace(char const* pluginNamespace) noexcept override;
+
+    char const* getPluginNamespace() const noexcept override;
+
+private:
+    static nvinfer1::PluginFieldCollection gFC;
+    static std::vector<nvinfer1::PluginField> gPluginAttributes;
+    std::string mNamespace;
+};
+
+} // namespace plugin
+} // namespace nvinfer1
+
+#endif // TRT_SCATTER_ELEMENTS_PLUGIN_H
diff --git a/plugin/scatterElementsPlugin/scatterElementsPluginKernel.cu b/plugin/scatterElementsPlugin/scatterElementsPluginKernel.cu
new file mode 100644
index 00000000..7f487725
--- /dev/null
+++ b/plugin/scatterElementsPlugin/scatterElementsPluginKernel.cu
@@ -0,0 +1,156 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * ************************************************************************
+ * Modified from pytorch_scatter 
+ * Copyright (c) 2020 Matthias Fey <matthias.fey@tu-dortmund.de>
+ * See https://github.com/rusty1s/pytorch_scatter/blob/master/LICENSE for details
+ * ************************************************************************
+ */
+
+#include "TensorInfo.cuh"
+#include "common/dimsHelpers.h"
+#include "reducer.cuh"
+#include "scatterElementsPluginKernel.h"
+#include <thrust/device_vector.h>
+
+namespace nvinfer1
+{
+namespace plugin
+{
+
+#define THREADS 256
+#define BLOCKS(N) (N + THREADS - 1) / THREADS
+
+using detail::TensorInfo;
+using detail::getTensorInfo;
+using nvinfer1::pluginInternal::volume;
+
+template <typename TScalar, ReductionType tReduce>
+__global__ void scatterElements_kernel(const TScalar* updatesData, const TensorInfo<int64_t, int32_t> indexInfo,
+    TScalar* outData, int32_t nE, int32_t nK, int32_t nN, int32_t nbElements)
+{
+
+    int32_t thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    int32_t b = thread_idx / (nE * nK);
+    int32_t k = thread_idx % nK;
+
+    if (thread_idx < nbElements)
+    {
+        int32_t offset = detail::IndexToOffset<int64_t, int32_t, -1>::get(thread_idx, indexInfo);
+        int64_t idx = indexInfo.data[offset];
+
+        Reducer<TScalar, tReduce>::atomic_write(outData + b * nN * nK + idx * nK + k, updatesData[thread_idx]);
+    }
+}
+
+bool hasBfloat16AtomicAdd()
+{
+  int deviceId;
+  cudaGetDevice(&deviceId);
+  cudaDeviceProp deviceProp;
+  cudaGetDeviceProperties(&deviceProp, deviceId);
+  return deviceProp.major >= 8;
+}
+  
+inline uint32_t getElementSize(nvinfer1::DataType t) noexcept
+{
+    switch (t)
+    {
+    case nvinfer1::DataType::kINT64: return 8;
+    case nvinfer1::DataType::kINT32:
+    case nvinfer1::DataType::kFLOAT: return 4;
+    case nvinfer1::DataType::kBF16:
+    case nvinfer1::DataType::kHALF: return 2;
+    case nvinfer1::DataType::kBOOL:
+    case nvinfer1::DataType::kUINT8:
+    case nvinfer1::DataType::kINT8:
+    case nvinfer1::DataType::kINT4:
+    case nvinfer1::DataType::kFP8: return 1;
+    }
+    return 0;
+}
+
+template <typename TScalar>
+void dispatchScatterElementsKernel(void* outDataPtr, void const* dataDataPtr, void const* updatesDataPtr,
+    void const* indicesDataPtr, PluginTensorDesc const& outDesc, PluginTensorDesc const& dataDesc,
+    PluginTensorDesc const& updatesDesc, PluginTensorDesc const& indicesDesc, int64_t axis, ReductionType reduction,
+    cudaStream_t stream)
+{
+    auto updatesNumEl = volume(updatesDesc.dims);
+    auto nB = 1;
+    for (auto i = 0; i < axis; i++)
+    {
+        nB *= updatesDesc.dims.d[i];
+    }
+    auto nE = updatesDesc.dims.d[axis];
+    auto nK = updatesNumEl / (nB * nE);
+    auto nN = outDesc.dims.d[axis];
+
+    auto indexInfo = getTensorInfo<int64_t, int32_t>(indicesDataPtr, indicesDesc);
+
+    auto updatesData = (TScalar*) updatesDataPtr;
+    auto outData = (TScalar*) outDataPtr;
+
+    AT_DISPATCH_REDUCTION_TYPES(reduction, [&] {
+        scatterElements_kernel<TScalar, REDUCE>
+            <<<BLOCKS(updatesNumEl), THREADS, 0, stream>>>(updatesData, indexInfo, outData, nE, nK, nN, updatesNumEl);
+    });
+}
+
+#define DISPATCH_RUN_KERNEL(TYPE)                                                                                      \
+    dispatchScatterElementsKernel<TYPE>(outDataPtr, dataDataPtr, updatesDataPtr, indicesDataPtr, outDesc, dataDesc,    \
+        updatesDesc, indicesDesc, axis, reduction, stream)
+
+void runScatterElementsKernel(void* outDataPtr, void const* dataDataPtr, void const* updatesDataPtr,
+    void const* indicesDataPtr, PluginTensorDesc const& outDesc, PluginTensorDesc const& dataDesc,
+    PluginTensorDesc const& updatesDesc, PluginTensorDesc const& indicesDesc, int64_t axis, ReductionType reduction,
+    cudaStream_t stream)
+
+{
+    auto updatesNumEl = volume(updatesDesc.dims);
+    auto outNumEl = volume(outDesc.dims);
+
+    // copy dataDataPtr data to outDataPtr area first
+    cudaMemcpyAsync(outDataPtr, dataDataPtr, getElementSize(outDesc.type) * outNumEl, cudaMemcpyDeviceToDevice, stream);
+
+    if (updatesNumEl == 0)
+    {
+        return;
+    }
+
+    switch (outDesc.type)
+    {
+    case nvinfer1::DataType::kFLOAT: DISPATCH_RUN_KERNEL(float); break;
+    case nvinfer1::DataType::kHALF: DISPATCH_RUN_KERNEL(__half); break;
+    case nvinfer1::DataType::kINT32: DISPATCH_RUN_KERNEL(int32_t); break;
+    case nvinfer1::DataType::kINT64: DISPATCH_RUN_KERNEL(int64_t); break;
+    case nvinfer1::DataType::kBF16:  DISPATCH_RUN_KERNEL(__nv_bfloat16); break;
+    case nvinfer1::DataType::kBOOL:
+    case nvinfer1::DataType::kUINT8:
+    case nvinfer1::DataType::kINT8:
+    case nvinfer1::DataType::kINT4:
+    case nvinfer1::DataType::kFP8:
+      std::ostringstream stream;
+      stream << "Unsupported data type:" << (int)outDesc.type << std::endl;
+      PLUGIN_FAIL(stream.str().c_str());
+      break;
+    }
+}
+
+} // namespace plugin
+} // namespace nvinfer1
diff --git a/plugin/scatterElementsPlugin/scatterElementsPluginKernel.h b/plugin/scatterElementsPlugin/scatterElementsPluginKernel.h
new file mode 100644
index 00000000..d7fa1f5a
--- /dev/null
+++ b/plugin/scatterElementsPlugin/scatterElementsPluginKernel.h
@@ -0,0 +1,43 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * ************************************************************************
+ * Modified from pytorch_scatter
+ * Copyright (c) 2020 Matthias Fey <matthias.fey@tu-dortmund.de>
+ * See https://github.com/rusty1s/pytorch_scatter/blob/master/LICENSE for details
+ * ************************************************************************
+ */
+#ifndef TRT_SCATTER_ELEMENTS_KERNEL_PLUGIN_H
+#define TRT_SCATTER_ELEMENTS_KERNEL_PLUGIN_H
+
+#include "common/plugin.h"
+#include "scatterElementsPlugin.h"
+
+namespace nvinfer1
+{
+namespace plugin
+{
+
+bool hasBfloat16AtomicAdd();
+
+void runScatterElementsKernel(void* outDataPtr, void const* dataDataPtr, void const* updatesDataPtr,
+    void const* indicesDataPtr, PluginTensorDesc const& outDesc, PluginTensorDesc const& dataDesc,
+    PluginTensorDesc const& updatesDesc, PluginTensorDesc const& indicesDesc, int64_t axis, ReductionType reduction,
+    cudaStream_t stream);
+
+} // namespace plugin
+} // namespace nvinfer1
+#endif // TRT_SCATTER_ELEMENTS_KERNEL_PLUGIN_H
diff --git a/plugin/scatterPlugin/scatterPlugin.cpp b/plugin/scatterPlugin/scatterPlugin.cpp
index 694d4cfe..a19e555c 100644
--- a/plugin/scatterPlugin/scatterPlugin.cpp
+++ b/plugin/scatterPlugin/scatterPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,8 +17,6 @@
 #include "scatterPlugin.h"
 #include "common/half.h"
 #include <cstring>
-#include <cublas_v2.h>
-#include <cudnn.h>
 #include <iostream>
 #include <sstream>
 namespace nvinfer1
@@ -134,6 +132,9 @@ int32_t ScatterND::calculateCopySize(Dims const& dataDims) const noexcept
 int32_t ScatterND::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
     void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
 {
+    PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr
+        && workspace != nullptr);
+
     int32_t transformCoeff[nvinfer1::Dims::MAX_DIMS];
     std::memset(transformCoeff, 0, sizeof(int32_t) * outputDesc[0].dims.MAX_DIMS);
     Dims IndexDims = inputDesc[indexTensorIdx].dims;
@@ -155,7 +156,10 @@ int32_t ScatterND::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc c
     case DataType::kINT8:
     case DataType::kUINT8:
     case DataType::kBOOL: elementSizeInBytes = 1; break;
-    case DataType::kFP8: PLUGIN_FAIL("FP8 not supported"); break;
+    case DataType::kFP8:
+    case DataType::kBF16:
+    case DataType::kINT64:
+    case DataType::kINT4: PLUGIN_FAIL("Unsupported data type");
     }
 
     for (int32_t i = indexRank; i < dataDims.nbDims; i++)
diff --git a/plugin/scatterPlugin/scatterPlugin.h b/plugin/scatterPlugin/scatterPlugin.h
index 3545aa57..16f1df6a 100644
--- a/plugin/scatterPlugin/scatterPlugin.h
+++ b/plugin/scatterPlugin/scatterPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,8 +18,6 @@
 #define TRT_SCATTER_PLUGIN_H
 #include "common/kernels/kernel.h"
 #include "common/plugin.h"
-#include "cudnn.h"
-#include <cublas_v2.h>
 #include <string>
 #include <vector>
 
@@ -31,7 +29,7 @@ namespace plugin
 class ScatterND : public IPluginV2DynamicExt
 {
 public:
-    ScatterND();        
+    ScatterND();
 
     ~ScatterND() override = default;
 
diff --git a/plugin/skipLayerNormPlugin/README.md b/plugin/skipLayerNormPlugin/README.md
index 9e9994a9..5b5ffbda 100644
--- a/plugin/skipLayerNormPlugin/README.md
+++ b/plugin/skipLayerNormPlugin/README.md
@@ -13,7 +13,7 @@
 ## Description
 
 Adds a residual tensor, applies layer normalization, i.e., transforms the mean and standard deviation to beta and gamma respectively.
-Optionally can adds a bias vector before layer-normalization.
+Optionally, adds a bias vector before layer-normalization.
 
 
 ### Structure
@@ -21,17 +21,18 @@ Optionally can adds a bias vector before layer-normalization.
 The `skipLayerNormPlugin` takes two inputs; `input` and `skip`.
 
 `input`
-input is a tensor with shape `[S, B, E]` where `B` is the batch size and `E` is the hidden size.
+For V1 and V2, input is a tensor with shape `[S, B, E, 1, 1]` where `S` is the sequence length, `B` is the batch size, `E` is the hidden size, and the last two dimensions are of size 1.
+For V3 and V4, input is a tensor with shape `[1, E, S', 1]` where `S'` is the accumulated sequence length, `E` is the hidden size, and the first and last dimensions are of size 1.
 
 `skip`
-skip is a tensor with shape `[S, B, E]` where `B` is the batch size and `E` is the hidden size.
+skip has the same input dimensions as the input.
 The purpose of this input is to introduce skip (aka. residual) connections to previously computed tensors.
 
 
 The `skipLayerNormPlugin` generates the following output:
 
 `output`
-output is a tensor with shape `[S, B, E]` where `B` is the batch size.
+output is a tensor with the same shape as the input.
 
 
 ## Parameters
@@ -44,8 +45,8 @@ The parameters are defined below and consists of the following attributes:
 |----------|-----------------------------------------|------------|-------------------------------------------------------------------
 |`int`     |`type_id`                                |  1, 2      |Integer encoding the DataType (0: FP32, 1: FP16, 2: INT8)
 |`int`     |`ld`                                     |  1         |The leading dimension of the input tensor, corresponding to the hidden size, denoted by `E` above.
-|`Weights` |`beta`                                   |  1, 2, 3   |The mean to normalize to. Shape: `[1, 1, E]`
-|`Weights` |`gamma`                                  |  1, 2, 3   |The standard deviation to normalize to. Shape: `[1, 1, E]`
+|`Weights` |`beta`                                   |  1, 2, 3, 4|The mean to normalize to. Shape: `[1, 1, E]`
+|`Weights` |`gamma`                                  |  1, 2, 3, 4|The standard deviation to normalize to. Shape: `[1, 1, E]`
 |`Weights` |`bias`                                   |  1, 2      |An optional bias vector to add before normalization. Shape: `[1, 1, E]`
 
 
@@ -62,11 +63,14 @@ documentation.
 
 ## Changelog
 
-October  2020  
-Add V2 plugin that supports variable sequence length.  
+February  2024
+Add epsilon to avoid divide by zero.
+
+October  2020
+Add V2 plugin that supports variable sequence length.
 Add v3 plugin that supports int8 interleaved variable sequence length.
 
-November 2019  
+November 2019
 This is the first release of this `README.md` file.
 
 ## Known issues
diff --git a/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelHFace.cu b/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelHFace.cu
index ebbc401e..428c7483 100644
--- a/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelHFace.cu
+++ b/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelHFace.cu
@@ -1,3 +1,4 @@
+
 /*
  * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION &
  * AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0
@@ -174,7 +175,7 @@ __global__ void skipln_vec32_hface(int8_t const* input, int8_t const* skip, int8
 
         float mu = __low2float(statsLocal);
         float sos = __high2float(statsLocal);
-        float rsigma = rsqrtf(sos - mu * mu);
+        float rsigma = rsqrtf(sos - mu * mu + std::numeric_limits<float>::epsilon());
 
         smemRed[pos][0] = __floats2half2_rn(mu, rsigma);
     }
@@ -310,7 +311,7 @@ __global__ void skiplnDQQ_vec3(int32_t const ld, int8_t const* input, int8_t con
     if (tidx == 0)
     {
         mu = __low2half(sum2);
-        rsigma = rsqrtf(__high2half(sum2) - mu * mu);
+        rsigma = rsqrtf(__high2half(sum2) - mu * mu + std::numeric_limits<half>::epsilon());
     }
 
     __syncthreads();
diff --git a/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelMTron.cu b/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelMTron.cu
index 6e3c7f1d..f4c2a39c 100644
--- a/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelMTron.cu
+++ b/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedKernelMTron.cu
@@ -1,3 +1,4 @@
+
 /*
  * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION &
  * AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0
@@ -175,7 +176,7 @@ __global__ void skipln_vec32_mtron(int8_t const* input, int8_t const* skip, int8
 
         float mu = __low2float(statsLocal);
         float sos = __high2float(statsLocal);
-        float rsigma = rsqrtf(sos - mu * mu);
+        float rsigma = rsqrtf(sos - mu * mu + std::numeric_limits<float>::epsilon());
 
         smemRed[pos][0] = __floats2half2_rn(mu, rsigma);
     }
@@ -338,7 +339,7 @@ __global__ void skiplnDQQ_vec4(int32_t const ld, int8_t const* input, int8_t con
     if (tidx == 0)
     {
         mu = __low2half(sum2);
-        rsigma = rsqrtf(__high2half(sum2) - mu * mu);
+        rsigma = rsqrtf(__high2half(sum2) - mu * mu + std::numeric_limits<half>::epsilon());
     }
 
     __syncthreads();
diff --git a/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedPlugin.cpp b/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedPlugin.cpp
index 52769f42..72061613 100644
--- a/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedPlugin.cpp
+++ b/plugin/skipLayerNormPlugin/skipLayerNormInt8InterleavedPlugin.cpp
@@ -248,13 +248,13 @@ void checkDescs(PluginTensorDesc const& iDesc, PluginTensorDesc const& sDesc, Pl
 }
 
 int32_t SkipLayerNormInterleavedPluginHFace::enqueue(PluginTensorDesc const* inputDesc,
-    PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
+    PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* /* workspace */,
     cudaStream_t stream) noexcept
 {
     try
     {
-        PLUGIN_VALIDATE(inputs != nullptr);
-        PLUGIN_VALIDATE(outputs != nullptr);
+        PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         // Input shape: 1x(hxd)xtotalx1
         auto const iDesc = inputDesc[0];
         auto const sDesc = inputDesc[1];
@@ -288,11 +288,13 @@ int32_t SkipLayerNormInterleavedPluginHFace::enqueue(PluginTensorDesc const* inp
 }
 
 int32_t SkipLayerNormInterleavedPluginMTron::enqueue(PluginTensorDesc const* inputDesc,
-    PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
+    PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* /* workspace */,
     cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         // Input shape: 1x(hxd)xtotalx1
         auto const iDesc = inputDesc[0];
         auto const sDesc = inputDesc[1];
diff --git a/plugin/skipLayerNormPlugin/skipLayerNormKernel.cu b/plugin/skipLayerNormPlugin/skipLayerNormKernel.cu
index 68cc9d27..5d52c249 100644
--- a/plugin/skipLayerNormPlugin/skipLayerNormKernel.cu
+++ b/plugin/skipLayerNormPlugin/skipLayerNormKernel.cu
@@ -26,6 +26,7 @@
 
 #include <cassert>
 #include <cstring>
+#include <limits>
 #include <vector>
 
 using namespace nvinfer1;
@@ -79,12 +80,12 @@ __global__ void skiplnDQQ(int32_t const ld, int8_t const* input, int8_t const* s
     __shared__ __half mu;     // mean
     __shared__ __half rsigma; // 1 / std.dev.
 
-    const __half2 sum2 = BlockReduce(tempStorage).Reduce(loc, cub::Sum());
+    const __half2 sum2 = BlockReduce(tempStorage).Reduce(loc, [](auto const& lhs, auto const& rhs){return lhs + rhs;});
 
     if (threadIdx.x == 0)
     {
         mu = __low2half(sum2);
-        rsigma = rsqrt(__high2half(sum2) - mu * mu);
+        rsigma = rsqrt(__high2half(sum2) - mu * mu + std::numeric_limits<half>::epsilon());
     }
     __syncthreads();
 
@@ -139,12 +140,12 @@ __global__ void skipln_vec(
     __shared__ T mu;     // mean
     __shared__ T rsigma; // 1 / std.dev.
 
-    auto const sumKV = BlockReduce(tempStorage).Reduce(kvp<T>(local, local2), cub::Sum());
+    auto const sumKV = BlockReduce(tempStorage).Reduce(kvp<T>(local, local2), [](auto const& lhs, auto const& rhs){return lhs + rhs;});
 
     if (threadIdx.x == 0)
     {
         mu = sumKV.key;
-        rsigma = rsqrt(sumKV.value - mu * mu);
+        rsigma = rsqrt(sumKV.value - mu * mu + std::numeric_limits<T>::epsilon());
     }
     __syncthreads();
     ///*
@@ -166,7 +167,6 @@ __global__ void skipLayerNormKernelSmall(
     const T rld = T(1) / T(ld);
     int32_t const offset = blockIdx.x * ld;
 
-    cub::Sum pairSum;
     // reduce x and x^2
     kvp<T> threadData(0, 0);
     int32_t const idx = offset + threadIdx.x;
@@ -182,7 +182,7 @@ __global__ void skipLayerNormKernelSmall(
         }
 
         const T rldval = rld * val;
-        threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
+        threadData = threadData + kvp<T>(rldval, rldval * val);
     }
 
     layerNormSmall<T, T, TPB>(val, threadData, ld, idx, beta, gamma, output);
@@ -195,7 +195,6 @@ __global__ void skipLayerNormKernel(
     const T rld = T(1) / T(ld);
     int32_t const offset = blockIdx.x * ld;
 
-    cub::Sum pairSum;
     // reduce x and x^2
     kvp<T> threadData(0, 0);
 
@@ -209,7 +208,7 @@ __global__ void skipLayerNormKernel(
             val += T(bias[i]);
         }
         const T rldval = rld * val;
-        threadData = pairSum(threadData, kvp<T>(rldval, rldval * val));
+        threadData = threadData + kvp<T>(rldval, rldval * val);
         output[idx] = val;
     }
 
diff --git a/plugin/skipLayerNormPlugin/skipLayerNormPlugin.cpp b/plugin/skipLayerNormPlugin/skipLayerNormPlugin.cpp
index d95f0df6..04ed5885 100644
--- a/plugin/skipLayerNormPlugin/skipLayerNormPlugin.cpp
+++ b/plugin/skipLayerNormPlugin/skipLayerNormPlugin.cpp
@@ -255,13 +255,13 @@ size_t SkipLayerNormPluginDynamic::getWorkspaceSize(
 }
 
 int32_t SkipLayerNormPluginDynamic::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+    void const* const* inputs, void* const* outputs, void* /* workspace */, cudaStream_t stream) noexcept
 {
     int32_t status = -1;
     try
     {
-        PLUGIN_VALIDATE(inputs != nullptr);
-        PLUGIN_VALIDATE(outputs != nullptr);
+        PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         int32_t const inputVolume = volume(inputDesc[0].dims);
         DataType iType = inputDesc->type;
 
@@ -782,11 +782,13 @@ size_t SkipLayerNormVarSeqlenPlugin::getWorkspaceSize(
 }
 
 int32_t SkipLayerNormVarSeqlenPlugin::enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc,
-    void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+    void const* const* inputs, void* const* outputs, void* /* workspace */, cudaStream_t stream) noexcept
 {
     int32_t status = -1;
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && outputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
         int32_t const inputVolume = volume(inputDesc[0].dims);
         PLUGIN_VALIDATE(inputVolume % mLd == 0 && "inconsistent dimensions");
         DataType iType = inputDesc->type;
diff --git a/plugin/specialSlicePlugin/README.md b/plugin/specialSlicePlugin/README.md
index 4dcdc8f9..8bcad4f2 100644
--- a/plugin/specialSlicePlugin/README.md
+++ b/plugin/specialSlicePlugin/README.md
@@ -11,6 +11,8 @@
 
 ## Description
 
+> NOTE: This plugin is deprecated since TensorRT 9.0. Its functionality has been superseded by the `ISliceLayer`.
+
 The `SpecialSlice` plugin slice the detections of MaskRCNN from `[y1, x1, y2, x2, class_label, score]` to `[y1, x1, y2, x2]`. It is used in sampleMaskRCNN.
 
 
@@ -44,6 +46,9 @@ documentation.
 
 ## Changelog
 
+June 2023
+Add deprecation note.
+
 June 2019
 This is the first release of this `README.md` file.
 
diff --git a/plugin/specialSlicePlugin/specialSlicePlugin.cpp b/plugin/specialSlicePlugin/specialSlicePlugin.cpp
index c8721596..f4bdb04c 100644
--- a/plugin/specialSlicePlugin/specialSlicePlugin.cpp
+++ b/plugin/specialSlicePlugin/specialSlicePlugin.cpp
@@ -57,6 +57,10 @@ IPluginV2Ext* SpecialSlicePluginCreator::createPlugin(char const* name, PluginFi
 {
     try
     {
+        gLogWarning
+            << "SpecialSlicePlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addSlice() to add an "
+               "ISliceLayer."
+            << std::endl;
         return new SpecialSlice();
     }
     catch (std::exception const& e)
@@ -70,6 +74,10 @@ IPluginV2Ext* SpecialSlicePluginCreator::deserializePlugin(char const* name, voi
 {
     try
     {
+        gLogWarning
+            << "SpecialSlicePlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addSlice() to add an "
+               "ISliceLayer."
+            << std::endl;
         return new SpecialSlice(data, length);
     }
     catch (std::exception const& e)
diff --git a/plugin/specialSlicePlugin/specialSlicePlugin.h b/plugin/specialSlicePlugin/specialSlicePlugin.h
index 5fbe2835..0837682f 100644
--- a/plugin/specialSlicePlugin/specialSlicePlugin.h
+++ b/plugin/specialSlicePlugin/specialSlicePlugin.h
@@ -30,7 +30,7 @@ namespace nvinfer1
 {
 namespace plugin
 {
-class SpecialSlice : public IPluginV2Ext
+class TRT_DEPRECATED SpecialSlice : public IPluginV2Ext
 {
 public:
     SpecialSlice();
@@ -97,7 +97,7 @@ class SpecialSlice : public IPluginV2Ext
     std::string mNameSpace;
 };
 
-class SpecialSlicePluginCreator : public nvinfer1::pluginInternal::BaseCreator
+class TRT_DEPRECATED SpecialSlicePluginCreator : public nvinfer1::pluginInternal::BaseCreator
 {
 public:
     SpecialSlicePluginCreator();
diff --git a/plugin/splitPlugin/CMakeLists.txt b/plugin/splitPlugin/CMakeLists.txt
index 30faad5e..1f1d4169 100644
--- a/plugin/splitPlugin/CMakeLists.txt
+++ b/plugin/splitPlugin/CMakeLists.txt
@@ -20,4 +20,3 @@ set(PLUGIN_SOURCES ${PLUGIN_SOURCES} PARENT_SCOPE)
 file(GLOB CU_SRCS *.cu)
 set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} ${CU_SRCS})
 set(PLUGIN_CU_SOURCES ${PLUGIN_CU_SOURCES} PARENT_SCOPE)
-
diff --git a/plugin/splitPlugin/split.cu b/plugin/splitPlugin/split.cu
index 9cdf2af8..771e9cba 100644
--- a/plugin/splitPlugin/split.cu
+++ b/plugin/splitPlugin/split.cu
@@ -132,9 +132,11 @@ void SplitPlugin::configurePlugin(const nvinfer1::DynamicPluginTensorDesc* in, i
   _d_output_ptrs.resize(nbOutputs, nullptr);
 }
 
-int SplitPlugin::enqueue(const nvinfer1::PluginTensorDesc* inputDesc, const nvinfer1::PluginTensorDesc* outputDesc,
-                         const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+int SplitPlugin::enqueue(const nvinfer1::PluginTensorDesc* inputDesc, const nvinfer1::PluginTensorDesc* /* outputDesc */,
+                         const void* const* inputs, void* const* outputs, void* /* workspace */, cudaStream_t stream) noexcept
 {
+  PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr);
+
   int const* d_segment_offsets_ptr =
     thrust::raw_pointer_cast(&_d_segment_offsets[0]);
   float  const* idata    = reinterpret_cast<float  const*>(inputs[0]);
diff --git a/plugin/splitPlugin/split.h b/plugin/splitPlugin/split.h
index 76ff63b4..cc1916bf 100644
--- a/plugin/splitPlugin/split.h
+++ b/plugin/splitPlugin/split.h
@@ -36,7 +36,7 @@ namespace nvinfer1
 {
 namespace plugin
 {
-class SplitPlugin final : public nvinfer1::IPluginV2DynamicExt
+class TRT_DEPRECATED SplitPlugin final : public nvinfer1::IPluginV2DynamicExt
 {
     int32_t _axis;
     std::vector<int32_t> _output_lengths;
@@ -71,12 +71,18 @@ class SplitPlugin final : public nvinfer1::IPluginV2DynamicExt
         : _axis(axis)
         , _output_lengths(std::vector<int32_t>(output_lengths, output_lengths + noutput))
     {
+        gLogWarning << "SplitPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addSlice() to add an "
+                       "ISliceLayer to perform a split."
+                    << std::endl;
         PLUGIN_ASSERT(axis <= nvinfer1::Dims::MAX_DIMS);
     }
     SplitPlugin(int32_t axis, std::vector<int32_t> output_lengths)
         : _axis(axis)
         , _output_lengths(output_lengths)
     {
+        gLogWarning << "SplitPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addSlice() to add an "
+                       "ISliceLayer to perform a split."
+                    << std::endl;
         PLUGIN_ASSERT(axis <= nvinfer1::Dims::MAX_DIMS);
     }
     SplitPlugin(void const* serialData, size_t serialLength)
@@ -134,7 +140,7 @@ class SplitPlugin final : public nvinfer1::IPluginV2DynamicExt
     void detachFromContext() noexcept override {}
 };
 
-class SplitPluginCreator : public nvinfer1::IPluginCreator
+class TRT_DEPRECATED SplitPluginCreator : public nvinfer1::IPluginCreator
 {
 public:
     SplitPluginCreator() {}
@@ -167,6 +173,9 @@ class SplitPluginCreator : public nvinfer1::IPluginCreator
     nvinfer1::IPluginV2DynamicExt* deserializePlugin(
         char const* /*name*/, void const* serialData, size_t serialLength) noexcept override
     {
+        gLogWarning << "SplitPlugin is deprecated since TensorRT 9.0. Use INetworkDefinition::addSlice() to add an "
+                       "ISliceLayer to perform a split."
+                    << std::endl;
         return new SplitPlugin{serialData, serialLength};
     }
 
diff --git a/plugin/voxelGeneratorPlugin/voxelGenerator.cpp b/plugin/voxelGeneratorPlugin/voxelGenerator.cpp
index 65b90427..c27d5193 100644
--- a/plugin/voxelGeneratorPlugin/voxelGenerator.cpp
+++ b/plugin/voxelGeneratorPlugin/voxelGenerator.cpp
@@ -50,7 +50,7 @@ int32_t npRound(float x)
     {
         return lround(x / 2.0F + 0.5F) * 2;
     }
-    return lround(x + 0.5F);
+    return lround(x);
 }
 
 VoxelGeneratorPlugin::VoxelGeneratorPlugin(int32_t maxVoxels, int32_t maxPoints, int32_t voxelFeatures, float xMin,
@@ -256,11 +256,13 @@ size_t VoxelGeneratorPlugin::getWorkspaceSize(nvinfer1::PluginTensorDesc const*
 }
 
 int32_t VoxelGeneratorPlugin::enqueue(nvinfer1::PluginTensorDesc const* inputDesc,
-    nvinfer1::PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace,
-    cudaStream_t stream) noexcept
+    nvinfer1::PluginTensorDesc const* /* outputDesc */, void const* const* inputs, void* const* outputs,
+    void* workspace, cudaStream_t stream) noexcept
 {
     try
     {
+        PLUGIN_VALIDATE(inputDesc != nullptr && inputs != nullptr && outputs != nullptr && workspace != nullptr);
+
         int32_t batchSize = inputDesc[0].dims.d[0];
         int32_t maxNumPoints = inputDesc[0].dims.d[1];
         // TRT-input
diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt
index 35ae486d..1494c1fd 100644
--- a/python/CMakeLists.txt
+++ b/python/CMakeLists.txt
@@ -35,7 +35,7 @@ endfunction()
 # -------- CMAKE OPTIONS --------
 
 set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/${TENSORRT_MODULE}/)
-set(CPP_STANDARD 11 CACHE STRING "CPP Standard Version")
+set(CPP_STANDARD 14 CACHE STRING "CPP Standard Version")
 set(CMAKE_CXX_STANDARD ${CPP_STANDARD})
 
 if (NOT MSVC)
@@ -45,7 +45,7 @@ endif()
 
 # -------- PATHS --------
 message(STATUS "EXT_PATH: ${EXT_PATH}")
-message(STATUS "TENSORRT_LIBPATH: ${TENSORRT_LIBPATH}")
+message(STATUS "TENSORRT_BUILD: ${TENSORRT_BUILD}")
 message(STATUS "CMAKE_BINARY_DIR: ${CMAKE_BINARY_DIR}")
 message(STATUS "CUDA_ROOT: ${CUDA_ROOT}")
 message(STATUS "CUDA_INCLUDE_DIRS: ${CUDA_INCLUDE_DIRS}")
@@ -111,7 +111,7 @@ message(STATUS "PY_CONFIG_INCLUDE: ${PY_CONFIG_INCLUDE}")
 # -------- GLOBAL COMPILE OPTIONS --------
 
 include_directories(${TENSORRT_ROOT}/include ${PROJECT_SOURCE_DIR}/include ${CUDA_INCLUDE_DIRS} ${PROJECT_SOURCE_DIR}/docstrings ${ONNX_INC_DIR} ${PYBIND11_DIR})
-link_directories(${TENSORRT_LIBPATH})
+link_directories(${TENSORRT_BUILD})
 
 if (MSVC)
     message(STATUS "include_dirs: ${MSVC_COMPILER_DIR}/include ${MSVC_COMPILER_DIR}/../ucrt/include ${NV_WDKSDK_INC}/um ${NV_WDKSDK_INC}/shared")
@@ -130,7 +130,17 @@ if (MSVC)
         set(CMAKE_SHARED_LINKER_FLAGS_RELEASE "${CMAKE_SHARED_LINKER_FLAGS_RELEASE} /DEBUG /OPT:REF /OPT:ICF")
     endif()
 else()
-    set(CMAKE_CXX_FLAGS "-fvisibility=hidden -std=c++${CPP_STANDARD} -Wno-deprecated-declarations")
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${GLIBCXX_USE_CXX11_ABI_FLAG} -fvisibility=hidden -std=c++${CPP_STANDARD} -Wno-deprecated-declarations")
+endif()
+
+# remove md
+# Add the flags to enable MD-TRT.
+if ("${ENABLE_MDTRT}" STREQUAL "1")
+    include_directories(${TENSORRT_ROOT}/optimizer)
+    include_directories(${TENSORRT_ROOT}/runtime)
+    include_directories(${TENSORRT_ROOT}/common)
+    include_directories(${TENSORRT_ROOT}/safety)
+    add_compile_definitions(ENABLE_MDTRT=1)
 endif()
 
 # Update linker
@@ -150,7 +160,7 @@ else()
 endif()
 
 if (${TENSORRT_MODULE} STREQUAL "tensorrt")
-    set(TRT_LIBS nvinfer nvonnxparser nvparsers nvinfer_plugin)
+    set(TRT_LIBS nvinfer nvonnxparser nvinfer_plugin)
 elseif (${TENSORRT_MODULE} STREQUAL "tensorrt_lean")
     set(TRT_LIBS "nvinfer_lean${vfc_suffix}")
 elseif (${TENSORRT_MODULE} STREQUAL "tensorrt_dispatch")
diff --git a/python/README.md b/python/README.md
index 41de2829..61c977e7 100644
--- a/python/README.md
+++ b/python/README.md
@@ -2,6 +2,11 @@
 
 ## Installation
 
+### Set environment variables
+
+Set `TRT_OSSPATH` and `TRT_LIBPATH` environment variables to point to your OSS clone
+and TensorRT library location, respectively.
+
 ### Download pybind11
 
 Create a directory for external sources and download pybind11 into it.
@@ -19,12 +24,12 @@ git clone https://github.com/pybind/pybind11.git
 1. Get the source code from the official [python sources](https://www.python.org/downloads/source/)
 2. Copy the contents of the `Include/` directory into `$EXT_PATH/pythonX.Y/include/` directory.
 
-Example: Python 3.9
+Example: Python 3.10
 ```bash
-wget https://www.python.org/ftp/python/3.9.16/Python-3.9.16.tgz
-tar -xvf Python-3.9.16.tgz
-mkdir -p $EXT_PATH/python3.9
-cp -r Python-3.9.16/Include/ $EXT_PATH/python3.9/include
+wget https://www.python.org/ftp/python/3.10.11/Python-3.10.11.tgz
+tar -xvf Python-3.10.11.tgz
+mkdir -p $EXT_PATH/python3.10/include
+cp -r Python-3.10.11/Include/* $EXT_PATH/python3.10/include
 ```
 
 #### Add PyConfig.h
@@ -36,15 +41,22 @@ cp -r Python-3.9.16/Include/ $EXT_PATH/python3.9/include
 3. Unpack the contained `data.tar.xz` with `tar -xvf`
 4. Find `pyconfig.h` in the `./usr/include/<platform>/pythonX.Y/` directory and copy it into `$EXT_PATH/pythonX.Y/include/`.
 
+Example: Python 3.10
+```bash
+wget http://http.us.debian.org/debian/pool/main/p/python3.10/libpython3.10-dev_3.10.12-1_amd64.deb
+ar x libpython3.10-dev*.deb
+mkdir debian && tar -xf data.tar.xz -C debian
+cp debian/usr/include/x86_64-linux-gnu/python3.10/pyconfig.h python3.10/include/
+```
 
 ### Build Python bindings
 
-Use `build.sh` to generate the installable wheels for intended python version and target architecture.
+Use `build.sh` to generate the installable wheels for the intended Python version and target architecture.
 
-Example: for Python 3.9 `x86_64` wheel,
+Example: for Python 3.10 `x86_64` wheel,
 ```bash
 cd $TRT_OSSPATH/python
-TENSORRT_MODULE=tensorrt PYTHON_MAJOR_VERSION=3 PYTHON_MINOR_VERSION=9 TARGET_ARCHITECTURE=x86_64 ./build.sh
+TENSORRT_MODULE=tensorrt PYTHON_MAJOR_VERSION=3 PYTHON_MINOR_VERSION=10 TARGET_ARCHITECTURE=x86_64 ./build.sh
 ```
 
 ### Install the python wheel
diff --git a/python/build.sh b/python/build.sh
index d5eae793..1bba6deb 100755
--- a/python/build.sh
+++ b/python/build.sh
@@ -62,10 +62,10 @@ pushd ${ROOT_PATH}/python/packaging
 for dir in $(find . -type d); do mkdir -p ${WHEEL_OUTPUT_DIR}/$dir; done
 for file in $(find . -type f); do expand_vars_cp $file ${WHEEL_OUTPUT_DIR}/${file}; done
 popd
+
 cp tensorrt/tensorrt.so bindings_wheel/tensorrt/tensorrt.so
 
 pushd ${WHEEL_OUTPUT_DIR}/bindings_wheel
-
 python3 setup.py -q bdist_wheel --python-tag=cp${PYTHON_MAJOR_VERSION}${PYTHON_MINOR_VERSION} --plat-name=linux_${TARGET}
 
 popd
diff --git a/python/docstrings/infer/pyAlgorithmSelectorDoc.h b/python/docstrings/infer/pyAlgorithmSelectorDoc.h
index 02fee221..f5814474 100644
--- a/python/docstrings/infer/pyAlgorithmSelectorDoc.h
+++ b/python/docstrings/infer/pyAlgorithmSelectorDoc.h
@@ -27,7 +27,6 @@ constexpr const char* descr = R"trtdoc(
     IAlgorithmIOInfo for all the input and output along with IAlgorithmVariant denotes the variation of algorithm
     and can be used to select or reproduce an algorithm using IAlgorithmSelector.select_algorithms().
 
-    :ivar tensor_format: :class:`TensorFormat` [DEPRECATED] TensorFormat of the input/output of algorithm. This is deprecated since the strides, data type, and vectorization information is sufficient to uniquely identify tensor formats.
     :ivar dtype: :class:`DataType`  DataType of the input/output of algorithm.
     :ivar strides: :class:`Dims` strides of the input/output tensor of algorithm.
     :ivar vectorized_dim: :class:`int` the index of the vectorized dimension or -1 for non-vectorized formats.
@@ -60,7 +59,14 @@ constexpr const char* descr = R"trtdoc(
     :ivar name: :class:`str` name of the algorithm node.
     :ivar num_inputs: :class:`int`  number of inputs of the algorithm.
     :ivar num_outputs: :class:`int` number of outputs of the algorithm.
-)trtdoc";
+)trtdoc"
+// remove md
+#if ENABLE_MDTRT
+                              R"trtdoc(
+    :ivar instance_id: Read-only. The multi-device instance ID.
+)trtdoc"
+#endif // ENABLE_MDTRT
+    ;
 
 constexpr const char* get_shape = R"trtdoc(
     Get the minimum / optimum / maximum dimensions for a dynamic input tensor.
diff --git a/python/docstrings/infer/pyCoreDoc.h b/python/docstrings/infer/pyCoreDoc.h
index 82878ac2..3586fd9f 100644
--- a/python/docstrings/infer/pyCoreDoc.h
+++ b/python/docstrings/infer/pyCoreDoc.h
@@ -116,7 +116,7 @@ constexpr char const* descr = R"trtdoc(
             def report_layer_time(self, layer_name, ms):
                 ... # Your implementation here
 
-    When this class is added to an :class:`IExecutionContext`, the profiler will be called once per layer for each invocation of :func:`IExecutionContext.execute_v2()` or :func:`IExecutionContext.execute_async_v2()`.
+    When this class is added to an :class:`IExecutionContext`, the profiler will be called once per layer for each invocation of :func:`IExecutionContext.execute_v2()`.
 
     It is not recommended to run inference with profiler enabled when the inference execution time is critical since the profiler may affect execution time negatively.
 )trtdoc";
@@ -132,7 +132,7 @@ constexpr char const* report_layer_time = R"trtdoc(
 namespace ProfilerDoc
 {
 constexpr char const* descr = R"trtdoc(
-    When this class is added to an :class:`IExecutionContext`, the profiler will be called once per layer for each invocation of :func:`IExecutionContext.execute_v2()` or :func:`IExecutionContext.execute_async_v2()`.
+    When this class is added to an :class:`IExecutionContext`, the profiler will be called once per layer for each invocation of :func:`IExecutionContext.execute_v2()`.
 
     It is not recommended to run inference with profiler enabled when the inference execution time is critical since the profiler may affect execution time negatively.
 )trtdoc";
@@ -261,8 +261,8 @@ constexpr char const* FAILED_INITIALIZATION = R"trtdoc(
 )trtdoc";
 
 constexpr char const* FAILED_EXECUTION = R"trtdoc(
-    An error occurred during execution that caused TensorRT to end prematurely, either an asynchronous error or
-    other execution errors reported by CUDA/DLA. In a dynamic system, the
+    An error occurred during execution that caused TensorRT to end prematurely, either an asynchronous error,
+    user cancellation, or other execution errors reported by CUDA/DLA. In a dynamic system, the
     data can be thrown away and the next frame can be processed or execution can be retried.
     This is either an execution error or a memory error.
 )trtdoc";
@@ -393,159 +393,43 @@ constexpr char const* descr = R"trtdoc(
     Multiple :class:`IExecutionContext` s may exist for one :class:`ICudaEngine` instance, allowing the same
     :class:`ICudaEngine` to be used for the execution of multiple batches simultaneously.
 
-    :ivar debug_sync: :class:`bool` The debug sync flag. If this flag is set to true, the :class:`ICudaEngine` will log the successful execution for each kernel during execute_v2(). It has no effect when using execute_async_v2().
+    :ivar debug_sync: :class:`bool` The debug sync flag. If this flag is set to true, the :class:`ICudaEngine` will log the successful execution for each kernel during execute_v2().
     :ivar profiler: :class:`IProfiler` The profiler in use by this :class:`IExecutionContext` .
     :ivar engine: :class:`ICudaEngine` The associated :class:`ICudaEngine` .
     :ivar name: :class:`str` The name of the :class:`IExecutionContext` .
-    :ivar device_memory: :class:`capsule` The device memory for use by this execution context. The memory must be aligned on a 256-byte boundary, and its size must be at least :attr:`engine.device_memory_size`. If using :func:`execute_async_v2()` to run the network, The memory is in use from the invocation of :func:`execute_async_v2()` until network execution is complete. If using :func:`execute_v2()`, it is in use until :func:`execute_v2()` returns. Releasing or otherwise using the memory for other purposes during this time will result in undefined behavior.
-    :ivar active_optimization_profile: :class:`int` The active optimization profile for the context. The selected profile will be used in subsequent calls to :func:`execute_v2()` or :func:`execute_async_v2()` . Profile 0 is selected by default. Changing this value will invalidate all dynamic bindings for the current execution context, so that they have to be set again using :func:`set_binding_shape` before calling either :func:`execute_v2()` or :func:`execute_async_v2()` .
-    :ivar all_binding_shapes_specified: :class:`bool` Whether all dynamic dimensions of input tensors have been specified by calling :func:`set_binding_shape` . Trivially true if network has no dynamically shaped input tensors. Does not work with name-base interfaces eg. :func:`set_input_shape()`. Use :func:`infer_shapes()` instead.
+    :ivar device_memory: :class:`capsule` The device memory for use by this execution context. The memory must be aligned on a 256-byte boundary, and its size must be at least :attr:`engine.device_memory_size`. If using :func:`execute_v2()`, it is in use until :func:`execute_v2()` returns. Releasing or otherwise using the memory for other purposes during this time will result in undefined behavior.
+    :ivar active_optimization_profile: :class:`int` The active optimization profile for the context. The selected profile will be used in subsequent calls to :func:`execute_v2()`. Profile 0 is selected by default. This is a readonly property and active optimization profile can be changed with :func:`set_optimization_profile_async()`. Changing this value will invalidate all dynamic bindings for the current execution context, so that they have to be set again using :func:`set_input_shape` before calling either :func:`execute_v2()`.
+    :ivar all_binding_shapes_specified: :class:`bool` Whether all dynamic dimensions of input tensors have been specified by calling :func:`set_input_shape` . Trivially true if network has no dynamically shaped input tensors. Does not work with name-base interfaces eg. :func:`set_input_shape()`. Use :func:`infer_shapes()` instead.
     :ivar all_shape_inputs_specified: :class:`bool` Whether values for all input shape tensors have been specified by calling :func:`set_shape_input` . Trivially true if network has no input shape bindings. Does not work with name-base interfaces eg. :func:`set_input_shape()`. Use :func:`infer_shapes()` instead.
     :ivar error_recorder: :class:`IErrorRecorder` Application-implemented error reporting interface for TensorRT objects.
     :ivar enqueue_emits_profile: :class:`bool` Whether enqueue emits layer timing to the profiler. The default value is :class:`True`. If set to :class:`False`, enqueue will be asynchronous if there is a profiler attached. An extra method :func:`IExecutionContext::report_to_profiler()` needs to be called to obtain the profiling data and report to the profiler attached.
     :ivar persistent_cache_limit: The maximum size of persistent L2 cache that this execution context may use for activation caching. Activation caching is not supported on all architectures - see "How TensorRT uses Memory" in the developer guide for details. The default is 0 Bytes.
-    :ivar nvtx_verbosity: The NVTX verbosity of the execution context. Building with kDETAILED verbosity will generally increase latency in enqueueV2/V3(). Call this method to select NVTX verbosity in this execution context at runtime. The default is the verbosity with which the engine was built, and the verbosity may not be raised above that level. This function does not affect how IEngineInspector interacts with the engine.
+    :ivar nvtx_verbosity: The NVTX verbosity of the execution context. Building with DETAILED verbosity will generally increase latency in enqueueV3(). Call this method to select NVTX verbosity in this execution context at runtime. The default is the verbosity with which the engine was built, and the verbosity may not be raised above that level. This function does not affect how IEngineInspector interacts with the engine.
     :ivar temporary_allocator: :class:`IGpuAllocator` The GPU allocator used for internal temporary storage.
-)trtdoc";
-
-constexpr char const* execute = R"trtdoc(
-    [DEPRECATED] Please use execute_v2() instead if the engine is built from a network with explicit batch dimension mode enabled.
-
-    Synchronously execute inference on a batch.
-    This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using :func:`ICudaEngine.get_binding_index()` .
-
-    :arg batch_size: The batch size. This is at most the value supplied when the :class:`ICudaEngine` was built. This has no effect if the engine is built from a network with explicit batch dimension mode enabled.
-    :arg bindings: A list of integers representing input and output buffer addresses for the network.
-
-    :returns: True if execution succeeded.
-)trtdoc";
-
-constexpr char const* execute_async = R"trtdoc(
-    [DEPRECATED] Please use execute_async_v2() instead if the engine is built from a network with explicit batch dimension mode enabled.
-
-    Asynchronously execute inference on a batch.
-    This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using :func:`ICudaEngine::get_binding_index()` .
-
-    :arg batch_size: The batch size. This is at most the value supplied when the :class:`ICudaEngine` was built. This has no effect if the engine is built from a network with explicit batch dimension mode enabled.
-    :arg bindings: A list of integers representing input and output buffer addresses for the network.
-    :arg stream_handle: A handle for a CUDA stream on which the inference kernels will be executed.
-    :arg input_consumed: An optional event which will be signaled when the input buffers can be refilled with new data
-
-    :returns: True if the kernels were executed successfully.
+    :ivar weight_streaming_budget: Set and get the current weight streaming budget for inference. The budget may be set to -1 disabling weight streaming at runtime, 0 (default) enabling TRT to choose to weight stream or not, or a positive value in the inclusive range [minimum_weight_streaming_budget, streamable_weights_size - 1].
+    :ivar minimum_weight_streaming_budget: Returns the minimum weight streaming budget in bytes required to run the network successfully. The engine must have been built with kWEIGHT_STREAMING. 
+    :ivar streamable_weights_size: Returns the size of the streamable weights in the engine. This may not include all the weights.
 )trtdoc";
 
 constexpr char const* execute_v2 = R"trtdoc(
     Synchronously execute inference on a batch.
-    This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using :func:`ICudaEngine.get_binding_index()` .
-    This method only works for execution contexts built from networks with no implicit batch dimension.
+    This method requires a array of input and output buffers.
 
     :arg bindings: A list of integers representing input and output buffer addresses for the network.
 
     :returns: True if execution succeeded.
 )trtdoc";
 
-constexpr char const* execute_async_v2 = R"trtdoc(
-    Asynchronously execute inference on a batch.
-    This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using :func:`ICudaEngine::get_binding_index()` .
-    This method only works for execution contexts built from networks with no implicit batch dimension.
-
-    :arg bindings: A list of integers representing input and output buffer addresses for the network.
-    :arg stream_handle: A handle for a CUDA stream on which the inference kernels will be executed.
-    :arg input_consumed: An optional event which will be signaled when the input buffers can be refilled with new data
-
-    :returns: True if the kernels were executed successfully.
-)trtdoc";
-
 // TODO: Check if this makes sense to have.
 constexpr char const* device_memory = R"trtdoc(
     The device memory for use by this :class:`IExecutionContext` .
 
     The memory must be aligned on a 256-byte boundary, and its size must be at least that
-    returned by getDeviceMemorySize(). If using :func:`execute_async_v2()` to run the network, The memory is in
-    use from the invocation of :func:`execute_async_v2()` until network execution is complete. If using :func:`execute_v2()`,
+    returned by getDeviceMemorySize(). If using :func:`execute_v2()`,
     it is in use until :func:`execute_v2()` returns. Releasing or otherwise using the memory for other
     purposes during this time will result in undefined behavior.
 )trtdoc";
 
-constexpr char const* get_strides = R"trtdoc(
-    Return the strides of the buffer for the given binding.
-
-    Note that strides can be different for different execution contexts with dynamic shapes.
-
-    :arg binding: The binding index.
-)trtdoc";
-
-constexpr char const* set_binding_shape = R"trtdoc(
-    Set the dynamic shape of a binding.
-
-    Requires the engine to be built without an implicit batch dimension.
-    The binding must be an input tensor, and all dimensions must be compatible with
-    the network definition (i.e. only the wildcard dimension -1 can be replaced with a
-    new dimension > 0). Furthermore, the dimensions must be in the valid range for the
-    currently selected optimization profile.
-
-    For all dynamic non-output bindings (which have at least one wildcard dimension of -1),
-    this method needs to be called after setting :attr:`active_optimization_profile` before
-    either :func:`execute_async_v2()` or :func:`execute_v2()` may be called. When all input shapes have been
-    specified, :attr:`all_binding_shapes_specified` is set to :class:`True` .
-
-    :arg binding: The binding index.
-    :arg shape: The shape to set.
-
-    :returns: :class:`False` if an error occurs (e.g. specified binding is out of range for the currently selected optimization profile or specified shape is inconsistent with min-max range of the optimization profile), else :class:`True`.
-
-    Note that the network can still be invalid for
-    certain combinations of input shapes that lead to invalid output shapes. To confirm the correctness
-    of the network input shapes, check whether the output binding has valid
-    shape using :func:`get_binding_shape` on the output binding.
-)trtdoc";
-
-constexpr char const* get_binding_shape = R"trtdoc(
-    Get the dynamic shape of a binding.
-
-    If :func:`set_binding_shape` has been called on this binding (or if there are no
-    dynamic dimensions), all dimensions will be positive. Otherwise, it is necessary to
-    call :func:`set_binding_shape` before :func:`execute_async_v2()` or :func:`execute_v2()` may be called.
-
-    If the ``binding`` is out of range, an invalid Dims with nbDims == -1 is returned.
-
-    If ``ICudaEngine.binding_is_input(binding)`` is :class:`False` , then both
-    :attr:`all_binding_shapes_specified` and :attr:`all_shape_inputs_specified` must be :class:`True`
-    before calling this method.
-
-    :arg binding: The binding index.
-
-    :returns: A :class:`Dims` object representing the currently selected shape.
-)trtdoc";
-
-constexpr char const* set_shape_input = R"trtdoc(
-    Set values of an input shape tensor required by shape calculations.
-
-    :arg binding: The binding index of an input tensor for which ``ICudaEngine.is_shape_binding(binding)`` and ``ICudaEngine.binding_is_input(binding)`` are both true.
-    :arg shape: An iterable containing the values of the input shape tensor. The number of values should be the product of the dimensions returned by ``get_binding_shape(binding)``.
-
-    If ``ICudaEngine.is_shape_binding(binding)`` and ``ICudaEngine.binding_is_input(binding)`` are both true, this method must be called before :func:`execute_async_v2()` or :func:`execute_v2()` may be called. Additionally, this method must not be called if either ``ICudaEngine.is_shape_binding(binding)`` or ``ICudaEngine.binding_is_input(binding)`` are false.
-
-    :returns: :class:`False` if an error occurs (e.g. specified binding is out of range for the currently selected optimization profile or specified shape values are inconsistent with min-max range of the optimization profile), else :class:`True`.
-
-    Note that the network can still be invalid for
-    certain combinations of input shapes that lead to invalid output shapes. To confirm the correctness
-    of the network input shapes, check whether the output binding has valid
-    shape using :func:`get_binding_shape` on the output binding.
-)trtdoc";
-
-constexpr char const* get_shape = R"trtdoc(
-    Get values of an input shape tensor required for shape calculations or an output tensor produced by shape calculations.
-
-    :arg binding: The binding index of an input tensor for which ``ICudaEngine.is_shape_binding(binding)`` is true.
-
-    If ``ICudaEngine.binding_is_input(binding) == False``, then both
-    :attr:`all_binding_shapes_specified` and :attr:`all_shape_inputs_specified` must be :class:`True`
-    before calling this method.
-
-    :returns: An iterable containing the values of the shape tensor.
-)trtdoc";
-
 constexpr char const* set_optimization_profile_async = R"trtdoc(
     Set the optimization profile with async semantics
 
@@ -664,7 +548,7 @@ constexpr char const* execute_async_v3 = R"trtdoc(
 
     Input tensors can be released after the :func:`set_input_consumed_event` whereas output tensors require stream synchronization.
 
-    :arg stream_handle: The cuda stream on which the inference kernels will be enqueued.
+    :arg stream_handle: The cuda stream on which the inference kernels will be enqueued. Using default stream may lead to performance issues due to additional cudaDeviceSynchronize() calls by TensorRT to ensure correct synchronizations. Please use non-default stream instead.
 )trtdoc";
 
 constexpr char const* set_aux_streams = R"trtdoc(
@@ -681,108 +565,153 @@ constexpr char const* set_aux_streams = R"trtdoc(
     :arg aux_streams: A list of cuda streams. If the length of the list is greater than engine.num_aux_streams, then only the first "engine.num_aux_streams" streams will be used. If the length is less than engine.num_aux_streams, such as an empty list, then TensorRT will use the provided streams for the first few auxiliary streams, and will create additional streams internally for the rest of the auxiliary streams.
 )trtdoc";
 
-} // namespace IExecutionContextDoc
+constexpr char const* set_debug_listener = R"trtdoc(
+    Set debug listener for execution context.
 
-namespace ICudaEngineDoc
-{
-constexpr char const* descr = R"trtdoc(
-    An :class:`ICudaEngine` for executing inference on a built network.
+    :arg listener: The :class:`IDebugListener`.
+)trtdoc";
 
-    The engine can be indexed with ``[]`` . When indexed in this way with an integer, it will return the corresponding binding name. When indexed with a string, it will return the corresponding binding index.
+constexpr char const* get_debug_listener = R"trtdoc(
+    Get debug listener for execution context.
 
-    :ivar num_bindings: :class:`int` The number of binding indices.
-    :ivar num_io_tensors: :class:`int` The number of IO tensors.
-    :ivar max_batch_size: :class:`int` [DEPRECATED] The maximum batch size which can be used for inference for an engine built from an :class:`INetworkDefinition` with implicit batch dimension. For an engine built from an :class:`INetworkDefinition` with explicit batch dimension, this will always be ``1`` .
-    :ivar has_implicit_batch_dimension: :class:`bool` Whether the engine was built with an implicit batch dimension. This is an engine-wide property. Either all tensors in the engine have an implicit batch dimension or none of them do. This is True if and only if the :class:`INetworkDefinition` from which this engine was built was created without the ``NetworkDefinitionCreationFlag.EXPLICIT_BATCH`` flag.
-    :ivar num_layers: :class:`int` The number of layers in the network. The number of layers in the network is not necessarily the number in the original :class:`INetworkDefinition`, as layers may be combined or eliminated as the :class:`ICudaEngine` is optimized. This value can be useful when building per-layer tables, such as when aggregating profiling data over a number of executions.
-    :ivar max_workspace_size: :class:`int` The amount of workspace the :class:`ICudaEngine` uses. The workspace size will be no greater than the value provided to the :class:`Builder` when the :class:`ICudaEngine` was built, and will typically be smaller. Workspace will be allocated for each :class:`IExecutionContext` .
-    :ivar device_memory_size: :class:`int` The amount of device memory required by an :class:`IExecutionContext` .
-    :ivar refittable: :class:`bool` Whether the engine can be refit.
-    :ivar name: :class:`str` The name of the network associated with the engine. The name is set during network creation and is retrieved after building or deserialization.
-    :ivar num_optimization_profiles: :class:`int` The number of optimization profiles defined for this engine. This is always at least 1.
-    :ivar error_recorder: :class:`IErrorRecorder` Application-implemented error reporting interface for TensorRT objects.
-    :ivar engine_capability: :class:`EngineCapability` The engine capability. See :class:`EngineCapability` for details.
-    :ivar tactic_sources: :class:`int` The tactic sources required by this engine.
-    :ivar profiling_verbosity: The profiling verbosity the builder config was set to when the engine was built.
-    :ivar hardware_compatibility_level: The hardware compatibility level of the engine.
-    :ivar num_aux_streams: Read-only. The number of auxiliary streams used by this engine, which will be less than or equal to the maximum allowed number of auxiliary streams by setting builder_config.max_aux_streams when the engine is built.
+    :returns: The :class:`IDebugListener` of the execution context.
 )trtdoc";
 
-constexpr char const* get_binding_index = R"trtdoc(
-    Retrieve the binding index for a named tensor.
+constexpr char const* set_tensor_debug_state = R"trtdoc(
+    Turn the debug state of a tensor on or off. The Tensor must have been marked as a debug tensor during build time.
+
+    :arg name: The name of the target tensor.
+    :arg flag: True if turning on debug state of tensor. False if turning off.
+)trtdoc";
 
-    You can also use engine's :func:`__getitem__` with ``engine[name]``. When invoked with a :class:`str` , this will return the corresponding binding index.
+constexpr const char* get_debug_state = R"trtdoc(
+    Get the debug state of the tensor.
 
-    :func:`IExecutionContext.execute_async_v2()` and :func:`IExecutionContext.execute_v2()` require an array of buffers.
-    Engine bindings map from tensor names to indices in this array.
-    Binding indices are assigned at :class:`ICudaEngine` build time, and take values in the range [0 ... n-1] where n is the total number of inputs and outputs.
+    :arg name: The name of the tensor.
+)trtdoc";
 
-    :arg name: The tensor name.
+constexpr char const* update_device_memory_size_for_shapes = R"trtdoc(
+    Recompute the internal activation buffer sizes based on the current input shapes, and return the total amount of memory required.
 
-    :returns: The binding index for the named tensor, or -1 if the name is not found.
+    Users can allocate the device memory based on the size returned and provided the memory to TRT with an assignment to IExecutionContext.device_memory. Must specify all input shapes and the optimization profile to use before calling this function, otherwise the partition will be invalidated.
 )trtdoc";
 
-constexpr char const* get_binding_name = R"trtdoc(
-    Retrieve the name corresponding to a binding index.
+constexpr char const* set_all_tensors_debug_state = R"trtdoc(
+    Turn the debug state of all debug tensors on or off.
 
-    You can also use engine's :func:`__getitem__` with ``engine[index]``. When invoked with an :class:`int` , this will return the corresponding binding name.
+    :arg flag: True if turning on debug state of tensor. False if turning off.
+)trtdoc";
+} // namespace IExecutionContextDoc
 
-    This is the reverse mapping to that provided by :func:`get_binding_index()` .
+namespace IDebugListenerDoc
+{
+constexpr char const* descr = R"trtdoc(
+    A user-implemented class for notification when value of a debug tensor is updated.
+)trtdoc";
 
-    :arg index: The binding index.
+constexpr char const* get_interface_version = R"trtdoc(
+    The version of this interface.
 
-    :returns: The name corresponding to the binding index.
+    :returns: The version number.
 )trtdoc";
 
-// Documentation bug with parameters on these three functions because they are overloaded.
-constexpr char const* binding_is_input = R"trtdoc(
-    Determine whether a binding is an input binding.
+constexpr char const* process_debug_tensor = R"trtdoc(
+    User implemented callback function that is called when value of a debug tensor is updated and the debug state of the tensor is set to true. Content in the given address is only guaranteed to be valid for the duration of the callback.
 
-    :index: The binding index.
+    :arg location: TensorLocation of the tensor
+    :arg addr: pointer to buffer
+    :arg type: data Type of the tensor
+    :arg shape: shape of the tensor
+    :arg name: name name of the tensor
+    :arg stream: Cuda stream object
 
-    :returns: True if the index corresponds to an input binding and the index is in range.
+    :returns: True on success, False otherwise.
 )trtdoc";
+} // namespace IDebugListenerDoc
+
+namespace IProgressMonitorDoc
+{
+constexpr char const* descr = R"trtdoc(
+    Application-implemented progress reporting interface for TensorRT.
 
-constexpr char const* binding_is_input_str = R"trtdoc(
-    Determine whether a binding is an input binding.
+    The IProgressMonitor is a user-defined object that TensorRT uses to report back when an internal algorithm has
+    started or finished a phase to help provide feedback on the progress of the optimizer.
 
-    :name: The name of the tensor corresponding to an engine binding.
+    The IProgressMonitor will trigger its start function when a phase is entered and will trigger its finish function
+    when that phase is exited. Each phase consists of one or more steps. When each step is completed, the step_complete
+    function is triggered. This will allow an application using the builder to communicate progress relative to when the
+    optimization step is expected to complete.
 
-    :returns: True if the index corresponds to an input binding and the index is in range.
+    The implementation of IProgressMonitor must be thread-safe so that it can be called from multiple internal threads.
+    The lifetime of the IProgressMonitor must exceed the lifetime of all TensorRT objects that use it.
 )trtdoc";
 
-constexpr char const* get_binding_shape = R"trtdoc(
-    Get the shape of a binding.
+constexpr char const* phase_start = R"trtdoc(
+    Signal that a phase of the optimizer has started.
+
+    :arg phase_name: The name of this phase for tracking purposes.
+    :arg parent_phase: The parent phase that this phase belongs to, None if there is no parent.
+    :arg num_steps: The number of steps that are involved in this phase.
 
-    :index: The binding index.
+    The phase_start function signals to the application that the current phase is beginning, and that it has a
+    certain number of steps to perform. If phase_parent is None, then the phase_start is beginning an
+    independent phase, and if phase_parent is specified, then the current phase, specified by phase_name, is
+    within the scope of the parent phase. num_steps will always be a positive number. The phase_start function
+    implies that the first step is being executed. TensorRT will signal when each step is complete.
 
-    :Returns: The shape of the binding if the index is in range, otherwise Dims()
+    Phase names are human readable English strings which are unique within a single phase hierarchy but which can be
+    reused once the previous instance has completed. Phase names and their hierarchies may change between versions
+    of TensorRT.
 )trtdoc";
 
-constexpr char const* get_binding_shape_str = R"trtdoc(
-    Get the shape of a binding.
+constexpr char const* step_complete = R"trtdoc(
+    Signal that a step of an optimizer phase has finished.
+
+    :arg phase_name: The name of the innermost phase being executed.
+    :arg step: The step number that was completed.
 
-    :name: The name of the tensor corresponding to an engine binding.
+    The step_complete function signals to the application that TensorRT has finished the current step for the phase
+    ``phase_name`` , and will move on to the next step if there is one. The application can return False for TensorRT to exit
+    the build early. The step value will increase on subsequent calls in the range [0, num_steps).
 
-    :Returns: The shape of the binding if the tensor is present, otherwise Dims()
+    :returns: True to continue to the next step or False to stop the build.
 )trtdoc";
 
-constexpr char const* get_binding_dtype = R"trtdoc(
-    Determine the required data type for a buffer from its binding index.
+constexpr char const* phase_finish = R"trtdoc(
+    Signal that a phase of the optimizer has finished.
 
-    :index: The binding index.
+    :arg phase_name: The name of the phase that has finished.
 
-    :Returns: The type of data in the buffer.
+    The phase_finish function signals to the application that the phase is complete. This function may be called before
+    all steps in the range [0, num_steps) have been reported to step_complete. This scenario can be triggered by error
+    handling, internal optimizations, or when step_complete returns False to request cancellation of the build.
 )trtdoc";
+} // namespace IProgressMonitorDoc
 
-constexpr char const* get_binding_dtype_str = R"trtdoc(
-    Determine the required data type for a buffer from its binding index.
+namespace ICudaEngineDoc
+{
+constexpr char const* descr = R"trtdoc(
+    An :class:`ICudaEngine` for executing inference on a built network.
 
-    :name: The name of the tensor corresponding to an engine binding.
+    The engine can be indexed with ``[]`` . When indexed in this way with an integer, it will return the corresponding binding name. When indexed with a string, it will return the corresponding binding index.
 
-    :Returns: The type of data in the buffer.
-)trtdoc";
+    :ivar num_io_tensors: :class:`int` The number of IO tensors.
+    :ivar has_implicit_batch_dimension: :class:`bool` [DEPRECATED] Deprecated in TensorRT 10.0. Always flase since the implicit batch dimensions support has been removed.
+    :ivar num_layers: :class:`int` The number of layers in the network. The number of layers in the network is not necessarily the number in the original :class:`INetworkDefinition`, as layers may be combined or eliminated as the :class:`ICudaEngine` is optimized. This value can be useful when building per-layer tables, such as when aggregating profiling data over a number of executions.
+    :ivar max_workspace_size: :class:`int` The amount of workspace the :class:`ICudaEngine` uses. The workspace size will be no greater than the value provided to the :class:`Builder` when the :class:`ICudaEngine` was built, and will typically be smaller. Workspace will be allocated for each :class:`IExecutionContext` .
+    :ivar device_memory_size: :class:`int` The amount of device memory required by an :class:`IExecutionContext` .
+    :ivar refittable: :class:`bool` Whether the engine can be refit.
+    :ivar name: :class:`str` The name of the network associated with the engine. The name is set during network creation and is retrieved after building or deserialization.
+    :ivar num_optimization_profiles: :class:`int` The number of optimization profiles defined for this engine. This is always at least 1.
+    :ivar error_recorder: :class:`IErrorRecorder` Application-implemented error reporting interface for TensorRT objects.
+    :ivar engine_capability: :class:`EngineCapability` The engine capability. See :class:`EngineCapability` for details.
+    :ivar tactic_sources: :class:`int` The tactic sources required by this engine.
+    :ivar profiling_verbosity: The profiling verbosity the builder config was set to when the engine was built.
+    :ivar hardware_compatibility_level: The hardware compatibility level of the engine.
+    :ivar num_aux_streams: Read-only. The number of auxiliary streams used by this engine, which will be less than or equal to the maximum allowed number of auxiliary streams by setting builder_config.max_aux_streams when the engine is built.)trtdoc"
+           ;
 
+// Documentation bug with parameters on these three functions because they are overloaded.
 constexpr char const* serialize = R"trtdoc(
     Serialize the engine to a stream.
 
@@ -790,29 +719,11 @@ constexpr char const* serialize = R"trtdoc(
 )trtdoc";
 
 constexpr char const* create_execution_context = R"trtdoc(
-    Create an :class:`IExecutionContext` .
+    Create an :class:`IExecutionContext` and specify the device memory allocation strategy.
 
     :returns: The newly created :class:`IExecutionContext` .
 )trtdoc";
 
-constexpr char const* get_location = R"trtdoc(
-    Get location of binding.
-    This lets you know whether the binding should be a pointer to device or host memory.
-
-    :index: The binding index.
-
-    :returns: The location of the bound tensor with given index.
-)trtdoc";
-
-constexpr char const* get_location_str = R"trtdoc(
-    Get location of binding.
-    This lets you know whether the binding should be a pointer to device or host memory.
-
-    :name: The name of the tensor corresponding to an engine binding.
-
-    :returns: The location of the bound tensor with given index.
-)trtdoc";
-
 constexpr char const* create_execution_context_without_device_memory = R"trtdoc(
     Create an :class:`IExecutionContext` without any device memory allocated
     The memory for execution of this device context must be supplied by the application.
@@ -829,92 +740,15 @@ constexpr char const* get_profile_shape = R"trtdoc(
     :returns: A ``List[Dims]`` of length 3, containing the minimum, optimum, and maximum shapes, in that order.
 )trtdoc";
 
-constexpr char const* get_profile_shape_input = R"trtdoc(
+constexpr char const* get_tensor_profile_values = R"trtdoc(
     Get minimum/optimum/maximum values for an input shape binding under an optimization profile. If the specified binding is not an input shape binding, an exception is raised.
 
+    :arg name: The tensor name.
     :arg profile_index: The index of the profile.
-    :arg binding: The binding index or name.
 
     :returns: A ``List[List[int]]`` of length 3, containing the minimum, optimum, and maximum values, in that order. If the values have not been set yet, an empty list is returned.
 )trtdoc";
 
-constexpr char const* is_shape_binding = R"trtdoc(
-    Returns :class:`True` if tensor is required as input for shape calculations or output from them.
-
-    TensorRT evaluates a network in two phases:
-
-    1. Compute shape information required to determine memory allocation requirements and validate that runtime sizes make sense.
-
-    2. Process tensors on the device.
-
-    Some tensors are required in phase 1. These tensors are called "shape tensors", and always
-    have type :class:`tensorrt.int32` and no more than one dimension. These tensors are not always shapes
-    themselves, but might be used to calculate tensor shapes for phase 2.
-
-    :func:`is_shape_binding` returns true if the tensor is a required input or an output computed in phase 1.
-    :func:`is_execution_binding` returns true if the tensor is a required input or an output computed in phase 2.
-
-    For example, if a network uses an input tensor with binding ``i`` as an input to an IElementWiseLayer that computes the reshape dimensions for an :class:`IShuffleLayer` , ``is_shape_binding(i) == True``
-
-    It's possible to have a tensor be required by both phases. For instance, a tensor can be used as a shape in an :class:`IShuffleLayer` and as the indices for an :class:`IGatherLayer` collecting floating-point data.
-
-    It's also possible to have a tensor required by neither phase that shows up in the engine's inputs. For example, if an input tensor is used only as an input to an :class:`IShapeLayer` , only its shape matters and its values are irrelevant.
-
-    :arg binding: The binding index.
-)trtdoc";
-
-constexpr char const* is_execution_binding = R"trtdoc(
-    Returns :class:`True` if tensor is required for execution phase, false otherwise.
-
-    For example, if a network uses an input tensor with binding i ONLY as the reshape dimensions for an :class:`IShuffleLayer` , then ``is_execution_binding(i) == False``, and a binding of `0` can be supplied for it when calling :func:`IExecutionContext.execute_v2()` or :func:`IExecutionContext.execute_async_v2()` .
-
-    :arg binding: The binding index.
-)trtdoc";
-
-constexpr char const* get_binding_bytes_per_component = R"trtdoc(
-    Return the number of bytes per component of an element.
-    The vector component size is returned if :func:`get_binding_vectorized_dim` != -1.
-
-    :arg index: The binding index.
-)trtdoc";
-
-constexpr char const* get_binding_components_per_element = R"trtdoc(
-    Return the number of components included in one element.
-
-    The number of elements in the vectors is returned if :func:`get_binding_vectorized_dim` != -1.
-
-    :arg index: The binding index.
-)trtdoc";
-
-constexpr char const* get_binding_format = R"trtdoc(
-    Return the binding format.
-
-    :arg index: The binding index.
-)trtdoc";
-
-constexpr char const* get_binding_format_desc = R"trtdoc(
-    Return the human readable description of the tensor format.
-
-    The description includes the order, vectorization, data type, strides, etc. For example:
-
-    |   Example 1: kCHW + FP32
-    |       "Row major linear FP32 format"
-    |   Example 2: kCHW2 + FP16
-    |       "Two wide channel vectorized row major FP16 format"
-    |   Example 3: kHWC8 + FP16 + Line Stride = 32
-    |       "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
-
-    :arg index: The binding index.
-)trtdoc";
-
-constexpr char const* get_binding_vectorized_dim = R"trtdoc(
-    Return the dimension index that the buffer is vectorized.
-
-    Specifically -1 is returned if scalars per vector is 1.
-
-    :arg index: The binding index.
-)trtdoc";
-
 constexpr char const* create_engine_inspector = R"trtdoc(
     Create an :class:`IEngineInspector` which prints out the layer information of an engine or an execution context.
 
@@ -985,11 +819,11 @@ constexpr char const* get_tensor_format_desc = R"trtdoc(
 
     The description includes the order, vectorization, data type, strides, etc. For example:
 
-    |   Example 1: kCHW + FP32
+    |   Example 1: CHW + FP32
     |       "Row major linear FP32 format"
-    |   Example 2: kCHW2 + FP16
+    |   Example 2: CHW2 + FP16
     |       "Two wide channel vectorized row major FP16 format"
-    |   Example 3: kHWC8 + FP16 + Line Stride = 32
+    |   Example 3: HWC8 + FP16 + Line Stride = 32
     |       "Channel major FP16 format where C % 8 == 0 and H Stride % 32 == 0"
 
     :arg name: The tensor name.
@@ -1010,6 +844,25 @@ constexpr char const* get_tensor_profile_shape = R"trtdoc(
     :arg profile_index: The index of the profile.
 )trtdoc";
 
+constexpr char const* get_device_memory_size_for_profile = R"trtdoc(
+    Return the device memory size required for a certain profile.
+
+    :arg profile_index: The index of the profile.
+)trtdoc";
+
+constexpr char const* create_serialization_config = R"trtdoc(
+    Create a serialization configuration object.
+)trtdoc";
+
+constexpr char const* serialize_with_config = R"trtdoc(
+    Serialize the network to a stream.
+)trtdoc";
+
+constexpr char const* is_debug_tensor = R"trtdoc(
+    Determine whether the given name corresponds to a debug tensor.
+
+    :arg name: The tensor name.
+)trtdoc";
 } // namespace ICudaEngineDoc
 
 namespace OutputAllocatorDoc
@@ -1027,6 +880,9 @@ To implement a custom output allocator, ensure that you explicitly instantiate t
         def reallocate_output(self, tensor_name, memory, size, alignment):
             ... # Your implementation here
 
+        def reallocate_output_async(self, tensor_name, memory, size, alignment, stream):
+            ... # Your implementation here
+                
         def notify_shape(self, tensor_name, shape):
             ... # Your implementation here
 
@@ -1045,6 +901,20 @@ constexpr char const* reallocate_output = R"trtdoc(
     :returns: The address of the output tensor memory.
 )trtdoc";
 
+constexpr char const* reallocate_output_async = R"trtdoc(
+    A callback implemented by the application to handle acquisition of output tensor memory.
+
+    If an allocation request cannot be satisfied, ``None`` should be returned.
+
+    :arg tensor_name: The output tensor name.
+    :arg memory: The output tensor memory address.
+    :arg size: The number of bytes required.
+    :arg alignment: The required alignment of memory.
+    :arg stream: CUDA stream
+
+    :returns: The address of the output tensor memory.
+)trtdoc";
+
 constexpr char const* notify_shape = R"trtdoc(
     Called by TensorRT when the shape of the output tensor is known.
 
@@ -1054,17 +924,46 @@ constexpr char const* notify_shape = R"trtdoc(
 
 } // namespace OutputAllocatorDoc
 
+namespace StreamReaderDoc
+{
+constexpr char const* descr = R"trtdoc(
+Application-implemented class for reading data from a stream.
+
+To implement a custom stream reader, ensure that you explicitly instantiate the base class in :func:`__init__` :
+::
+
+    class MyStreamReader(trt.IStreamReader):
+        def __init__(self):
+            trt.IStreamReader.__init__(self)
+
+        def read(self, memory, size):
+            ... # Your implementation here
+
+)trtdoc";
+
+constexpr char const* read = R"trtdoc(
+    A callback implemented by the application to read a particular chunk of memory.
+
+    If an allocation request cannot be satisfied, ``0`` should be returned.
+
+    :arg destination: The host memory address to copy read memory to.
+    :arg size: The number of bytes required.
+
+    :returns: The number of bytes read.
+)trtdoc";
+} // namespace StreamReaderDoc
+
 namespace BuilderFlagDoc
 {
 constexpr char const* descr
     = R"trtdoc(Valid modes that the builder can enable when creating an engine from a network definition.)trtdoc";
 
 constexpr char const* FP16 = R"trtdoc(Enable FP16 layer selection)trtdoc";
+constexpr char const* BF16 = R"trtdoc(Enable BF16 layer selection)trtdoc";
 constexpr char const* INT8 = R"trtdoc(Enable Int8 layer selection)trtdoc";
 constexpr char const* DEBUG = R"trtdoc(Enable debugging of layers via synchronizing after every layer)trtdoc";
 constexpr char const* GPU_FALLBACK
     = R"trtdoc(Enable layers marked to execute on GPU if layer cannot execute on DLA)trtdoc";
-constexpr char const* STRICT_TYPES = R"trtdoc([DEPRECATED] Enables strict type constraints. Equivalent to setting PREFER_PRECISION_CONSTRAINTS, DIRECT_IO, and REJECT_EMPTY_ALGORITHMS.)trtdoc";
 constexpr char const* REFIT = R"trtdoc(Enable building a refittable engine)trtdoc";
 constexpr char const* DISABLE_TIMING_CACHE
     = R"trtdoc(Disable reuse of timing information across identical layers.)trtdoc";
@@ -1082,12 +981,23 @@ constexpr char const* DIRECT_IO
     = R"trtdoc(Require that no reformats be inserted between a layer and a network I/O tensor for which ITensor.allowed_formats was set. Build fails if a reformat is required for functional correctness.)trtdoc";
 constexpr char const* REJECT_EMPTY_ALGORITHMS
     = R"trtdoc(Fail if IAlgorithmSelector.select_algorithms returns an empty set of algorithms.)trtdoc";
-constexpr char const* ENABLE_TACTIC_HEURISTIC
-    = R"trtdoc([DEPRECATED] Enable heuristic-based tactic selection for shorter engine generation time. The performance of the generated engine may not be as performant as a profiling-based builder.)trtdoc";
 constexpr char const* VERSION_COMPATIBLE
     = R"trtdoc(Restrict to lean runtime operators to provide version forward compatibility for the plan files.)trtdoc";
 constexpr char const* EXCLUDE_LEAN_RUNTIME = R"trtdoc(Exclude lean runtime from the plan.)trtdoc";
 constexpr char const* FP8 = R"trtdoc(Enable FP8 layer selection)trtdoc";
+constexpr char const* ERROR_ON_TIMING_CACHE_MISS
+    = R"trtdoc(Emit error when a tactic being timed is not present in the timing cache.)trtdoc";
+constexpr char const* DISABLE_COMPILATION_CACHE
+    = R"trtdoc(Disable caching JIT compilation results during engine build.)trtdoc";
+constexpr char const* WEIGHTLESS
+    = R"trtdoc(Strip the perf-irrelevant weights from the plan file, update them later using refitting for better file size.
+               [DEPRECATED] Deprecated in TensorRT 10.0.
+               )trtdoc";
+constexpr char const* STRIP_PLAN = R"trtdoc(Strip the refittable weights from the engine plan file.)trtdoc";
+constexpr char const* REFIT_IDENTICAL
+    = R"trtdoc(Create a refittable engine using identical weights. Different weights during refits yield unpredictable behavior.)trtdoc";
+constexpr char const* WEIGHT_STREAMING
+    = R"trtdoc(Enable building with the ability to stream varying amounts of weights during Runtime. This decreases GPU memory of TRT at the expense of performance.)trtdoc";
 } // namespace BuilderFlagDoc
 
 namespace MemoryPoolTypeDoc
@@ -1095,7 +1005,6 @@ namespace MemoryPoolTypeDoc
 constexpr char const* descr = R"trtdoc(The type for memory pools used by TensorRT.)trtdoc";
 constexpr char const* WORKSPACE = R"trtdoc(
     WORKSPACE is used by TensorRT to store intermediate buffers within an operation.
-    This is equivalent to the deprecated IBuilderConfig.max_workspace_size and overrides that value.
     This defaults to max device memory. Set to a smaller value to restrict tactics that use over the threshold en masse.
     For more targeted removal of tactics use the IAlgorithmSelector interface.
 )trtdoc";
@@ -1103,7 +1012,7 @@ constexpr char const* DLA_MANAGED_SRAM = R"trtdoc(
     DLA_MANAGED_SRAM is a fast software managed RAM used by DLA to communicate within a layer.
     The size of this pool must be at least 4 KiB and must be a power of 2.
     This defaults to 1 MiB.
-    Orin has capacity of 1 MiB per core, and Xavier shares 4 MiB across all of its accelerator cores.
+    Orin has capacity of 1 MiB per core.
 )trtdoc";
 constexpr char const* DLA_LOCAL_DRAM = R"trtdoc(
     DLA_LOCAL_DRAM is host RAM used by DLA to share intermediate tensor data across operations.
@@ -1116,12 +1025,21 @@ constexpr char const* DLA_GLOBAL_DRAM = R"trtdoc(
     This defaults to 512 MiB.
 )trtdoc";
 constexpr char const* TACTIC_DRAM = R"trtdoc(
-    kTACTIC_DRAM is the host DRAM used by the optimizer to
+    TACTIC_DRAM is the host DRAM used by the optimizer to
     run tactics. On embedded devices, where host and device memory are unified, this includes all device
     memory required by TensorRT to build the network up to the point of each memory allocation.
     This defaults to 75% of totalGlobalMem as reported by cudaGetDeviceProperties when
     cudaGetDeviceProperties.embedded is true, and 100% otherwise.
 )trtdoc";
+constexpr char const* TACTIC_SHARED_MEMORY = R"trtdoc(
+    TACTIC_SHARED_MEMORY defines the maximum shared memory size utilized for executing
+    the backend CUDA kernel implementation. Adjust this value to restrict tactics that exceed
+    the specified threshold en masse. The default value is device max capability. This value must
+    be less than 1GiB.
+
+    Updating this flag will override the shared memory limit set by \ref HardwareCompatibilityLevel,
+    which defaults to 48KiB.
+)trtdoc";
 
 } // namespace MemoryPoolTypeDoc
 
@@ -1138,25 +1056,10 @@ namespace PreviewFeatureDoc
 constexpr char const* descr = R"trtdoc(
     List of Preview Features that can be enabled. Preview Features have been fully tested but are not yet as stable as other features in TensorRT.
     They are provided as opt-in features for at least one release.
-    For example, to enable faster dynamic shapes, call :func:`set_preview_feature` with ``PreviewFeature.FASTER_DYNAMIC_SHAPES_0805``
-)trtdoc";
-constexpr char const* FASTER_DYNAMIC_SHAPES_0805 = R"trtdoc(
-    [DEPRECATED - will be removed in TensorRT 9.0] Optimize runtime dimensions with TensorRT's DL Compiler.
-    Potentially reduces run time and decreases device memory usage and engine size.
-    Models most likely to benefit from enabling ``FASTER_DYNAMIC_SHAPES_0805`` are transformer-based models, and models containing dynamic control flows.
-    The default value for this flag is on. Turning it off is deprecated. 
-)trtdoc";
-constexpr char const* DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 = R"trtdoc(
-    Disable usage of cuDNN/cuBLAS/cuBLASLt tactics in the TensorRT core library.
-    When the flag is enabled, TensorRT core will not use these tactics even if they are specified in
-    set_tactic_sources, but cudnnContext and cublasContext handles will still be passed to
-    plugins via IPluginV2::attachToContext() if the appropriate tactic sources are set.
-    This allows users to experiment with disabling external library tactics without having to modify their
-    application's plugins to support nullptr handles.
-    The default value for this flag is off.
+    For example, to enable faster dynamic shapes, call :func:`set_preview_feature` with ``PreviewFeature.PROFILE_SHARING_0806``
 )trtdoc";
 constexpr char const* PROFILE_SHARING_0806 = R"trtdoc(
-    Allows optimization profiles to be shared across execution contexts. This will become the default behavior in TensorRT 9.0 and the flag defaults to false.
+    [DEPRECATED] Allows optimization profiles to be shared across execution contexts. The default value for this flag is on in TensorRT 10.0. Turning if off is deprecated.
 )trtdoc";
 } // namespace PreviewFeatureDoc
 
@@ -1164,7 +1067,7 @@ namespace HardwareCompatibilityLevelDoc
 {
 constexpr char const* descr = R"trtdoc(
     Describes requirements of compatibility with GPU architectures other than that of the GPU on which the engine was
-    built. Levels except kNONE are only supported for engines built on NVIDIA Ampere and later GPUs.
+    built. Levels except NONE are only supported for engines built on NVIDIA Ampere and later GPUs.
     Note that compatibility with future hardware depends on CUDA forward compatibility support.
 )trtdoc";
 constexpr char const* NONE = R"trtdoc(
@@ -1181,13 +1084,22 @@ constexpr char const* AMPERE_PLUS = R"trtdoc(
 namespace NetworkDefinitionCreationFlagDoc
 {
 constexpr char const* descr
-    = R"trtdoc(List of immutable network properties expressed at network creation time. For example, to enable explicit batch mode, pass a value of ``1 << int(NetworkDefinitionCreationFlag.EXPLICIT_BATCH)`` to :func:`create_network` )trtdoc";
+    = R"trtdoc(List of immutable network properties expressed at network creation time. For example, to enable explicit batch mode, pass a value of ``1 << int(NetworkDefinitionCreationFlag.STRONGLY_TYPED)`` to :func:`create_network` )trtdoc";
 constexpr char const* EXPLICIT_BATCH
-    = R"trtdoc(Specify that the network should be created with an explicit batch dimension. Creating a network without this flag has been deprecated.)trtdoc";
-constexpr char const* EXPLICIT_PRECISION
-    = R"trtdoc([DEPRECATED] This flag has no effect now.)trtdoc";
+    = R"trtdoc([DEPRECATED] Ignored because networks are always "explicit batch" in TensorRT 10.0.)trtdoc";
+constexpr char const* STRONGLY_TYPED
+    = R"trtdoc(Specify that every tensor in the network has a data type defined in the network following only type inference rules and the inputs/operator annotations. Setting layer precision and layer output types is not allowed, and the network output types will be inferred based on the input types and the type inference rules)trtdoc";
 } // namespace NetworkDefinitionCreationFlagDoc
 
+namespace ISerializationConfigDoc
+{
+constexpr char const* descr
+    = R"trtdoc(Class to hold properties for configuring an engine to serialize the binary.)trtdoc";
+constexpr char const* clear_flag = R"trtdoc(Clears the serialization flag from the config.)trtdoc";
+constexpr char const* get_flag = R"trtdoc(Check if a serialization flag is set.)trtdoc";
+constexpr char const* set_flag = R"trtdoc(Add the input serialization flag to the already enabled flags.)trtdoc";
+} // namespace ISerializationConfigDoc
+
 namespace DeviceTypeDoc
 {
 constexpr char const* descr = R"trtdoc(Device types that TensorRT can execute on)trtdoc";
@@ -1229,9 +1141,6 @@ constexpr char const* LAYER_NAMES_ONLY = R"trtdoc(Print only the layer names. Th
 constexpr char const* DETAILED
     = R"trtdoc(Print detailed layer information including layer names and layer parameters.)trtdoc";
 constexpr char const* NONE = R"trtdoc(Do not print any layer information.)trtdoc";
-
-constexpr char const* DEFAULT = R"trtdoc([DEPRECATED] Same as LAYER_NAMES_ONLY.)trtdoc";
-constexpr char const* VERBOSE = R"trtdoc([DEPRECATED] Same as DETAILED.)trtdoc";
 } // namespace ProfilingVerbosityDoc
 
 namespace TensorIOModeDoc
@@ -1248,14 +1157,18 @@ namespace TacticSourceDoc
 constexpr char const* descr = R"trtdoc(Tactic sources that can provide tactics for TensorRT.)trtdoc";
 
 constexpr char const* CUBLAS = R"trtdoc(
-        Enables cuBLAS tactics. Enabled by default.
-        **NOTE:** Disabling this value will cause the cublas handle passed to plugins in attachToContext to be null.
+        Enables cuBLAS tactics. Disabled by default.
+        [DEPRECATED] Deprecated in TensorRT 10.0.
+        **NOTE:** Disabling CUBLAS tactic source will cause the cuBLAS handle passed to plugins in attachToContext to be null.
     )trtdoc";
 constexpr char const* CUBLAS_LT = R"trtdoc(
-        Enables cuBLAS LT tactics. Enabled for x86 platforms and only enabled for non-x86 platforms when CUDA >= 11.0 by default
+        Enables CUBLAS_LT tactics. Disabled by default.
+        [DEPRECATED] Deprecated in TensorRT 9.0.
     )trtdoc";
 constexpr char const* CUDNN = R"trtdoc(
-        Enables cuDNN tactics. Enabled by default.
+        Enables cuDNN tactics. Disabled by default.
+        [DEPRECATED] Deprecated in TensorRT 10.0.
+        **NOTE:** Disabling CUDNN tactic source will cause the cuDNN handle passed to plugins in attachToContext to be null.
     )trtdoc";
 constexpr char const* EDGE_MASK_CONVOLUTIONS = R"trtdoc(
         Enables convolution tactics implemented with edge mask tables. These tactics tradeoff memory for performance
@@ -1280,15 +1193,6 @@ constexpr char const* descr = R"trtdoc(
     the resulting serialized engine can be executed using standalone DLA runtime APIs. See sampleCudla for an
     example of integrating cuDLA APIs with TensorRT APIs.)trtdoc";
 
-constexpr char const* DEFAULT
-    = R"trtdoc([DEPRECATED] Unrestricted: TensorRT mode without any restrictions using TensorRT nvinfer1 APIs.)trtdoc";
-
-constexpr char const* SAFE_GPU
-    = R"trtdoc([DEPRECATED] Safety-restricted: TensorRT mode for GPU devices using TensorRT safety APIs. See safety documentation for list of supported layers and formats.)trtdoc";
-
-constexpr char const* SAFE_DLA
-    = R"trtdoc([DEPRECATED] DLA-restricted: TensorRT mode for DLA devices using cuDLA APIs. Only FP16 and Int8 modes are supported.)trtdoc";
-
 constexpr char const* STANDARD
     = R"trtdoc(Standard: TensorRT flow without targeting the standard runtime. This flow supports both DeviceType::kGPU and DeviceType::kDLA.)trtdoc";
 
@@ -1344,10 +1248,8 @@ namespace IBuilderConfigDoc
 {
 constexpr char const* descr = R"trtdoc(
 
-        :ivar min_timing_iterations: :class:`int` [DEPRECATED] The number of minimization iterations used when timing layers. When timing layers, the builder minimizes over a set of average times for layer execution. This parameter controls the number of iterations used in minimization. By default the minimum number of iterations is 1.
         :ivar avg_timing_iterations: :class:`int` The number of averaging iterations used when timing layers. When timing layers, the builder minimizes over a set of average times for layer execution. This parameter controls the number of iterations used in averaging. By default the number of averaging iterations is 1.
         :ivar int8_calibrator: :class:`IInt8Calibrator` Int8 Calibration interface. The calibrator is to minimize the information loss during the INT8 quantization process.
-        :ivar max_workspace_size: :class:`int` [DEPRECATED] The maximum workspace size. The maximum GPU temporary memory which the engine can use at execution time.
         :ivar flags: :class:`int` The build mode flags to turn on builder options for this network. The flags are listed in the BuilderFlags enum. The flags set configuration options to build the network. This should be in integer consisting of one or more :class:`BuilderFlag` s, combined via binary OR. For example, ``1 << BuilderFlag.FP16 | 1 << BuilderFlag.DEBUG``.
         :ivar profile_stream: :class:`int` The handle for the CUDA stream that is used to profile this network.
         :ivar num_optimization_profiles: :class:`int` The number of optimization profiles.
@@ -1360,6 +1262,16 @@ constexpr char const* descr = R"trtdoc(
         :ivar hardware_compatibility_level: Hardware compatibility allows an engine compatible with GPU architectures other than that of the GPU on which the engine was built.
         :ivar plugins_to_serialize: The plugin libraries to be serialized with forward-compatible engines.
         :ivar max_aux_streams: The maximum number of auxiliary streams that TRT is allowed to use. If the network contains operators that can run in parallel, TRT can execute them using auxiliary streams in addition to the one provided to the IExecutionContext::enqueueV3() call. The default maximum number of auxiliary streams is determined by the heuristics in TensorRT on whether enabling multi-stream would improve the performance. This behavior can be overridden by calling this API to set the maximum number of auxiliary streams explicitly. Set this to 0 to enforce single-stream inference. The resulting engine may use fewer auxiliary streams than the maximum if the network does not contain enough parallelism or if TensorRT determines that using more auxiliary streams does not help improve the performance. Allowing more auxiliary streams does not always give better performance since there will be synchronizations overhead between streams. Using CUDA graphs at runtime can help reduce the overhead caused by cross-stream synchronizations. Using more auxiliary leads to more memory usage at runtime since some activation memory blocks will not be able to be reused.
+        :ivar progress_monitor: The :class:`IProgressMonitor` to use.
+
+        Below are the descriptions about each builder optimization level:
+
+        - Level 0: This enables the fastest compilation by disabling dynamic kernel generation and selecting the first tactic that succeeds in execution. This will also not respect a timing cache.
+        - Level 1: Available tactics are sorted by heuristics, but only the top are tested to select the best. If a dynamic kernel is generated its compile optimization is low.
+        - Level 2: Available tactics are sorted by heuristics, but only the fastest tactics are tested to select the best.
+        - Level 3: Apply heuristics to see if a static precompiled kernel is applicable or if a new one has to be compiled dynamically.
+        - Level 4: Always compiles a dynamic kernel.
+        - Level 5: Always compiles a dynamic kernel and compares it to static kernels.
     )trtdoc";
 
 constexpr char const* set_memory_pool_limit = R"trtdoc(
@@ -1454,7 +1366,7 @@ constexpr char const* set_calibration_profile = R"trtdoc(
 
     Calibration optimization profile must be set if int8 calibration is used to set scales for a network with runtime dimensions.
 
-    :arg profile: The new calibration profile, which must satisfy ``bool(profile) == True`` or be nullptr. MIN and MAX values will be overwritten by kOPT.
+    :arg profile: The new calibration profile, which must satisfy ``bool(profile) == True`` or be None. MIN and MAX values will be overwritten by OPT.
 
     :returns: True if the calibration profile was set correctly.
 )trtdoc";
@@ -1462,7 +1374,7 @@ constexpr char const* set_calibration_profile = R"trtdoc(
 constexpr char const* get_calibration_profile = R"trtdoc(
     Get the current calibration profile.
 
-    :returns: The current calibration profile or nullptr if calibrartion profile is unset.
+    :returns: The current calibration profile or None if calibrartion profile is unset.
 )trtdoc";
 
 constexpr char const* set_device_type = R"trtdoc(
@@ -1470,7 +1382,7 @@ constexpr char const* set_device_type = R"trtdoc(
     default DeviceType set in the builder.
 
     The DeviceType for a layer must be compatible with the safety flow (if specified). For example a layer
-    cannot be marked for DLA execution while the builder is configured for kSAFE_GPU.
+    cannot be marked for DLA execution while the builder is configured for SAFE_GPU.
 
 
     :arg layer: The layer to set the DeviceType of
@@ -1576,15 +1488,28 @@ constexpr char const* get_preview_feature = R"trtdoc(
 
     :returns: true if the feature is enabled, false otherwise
 )trtdoc";
-
 } // namespace IBuilderConfigDoc
 
+namespace SerializationFlagDoc
+{
+constexpr char const* descr = R"trtdoc(Valid flags that can be use to creating binary file from engine.)trtdoc";
+constexpr char const* EXCLUDE_WEIGHTS = R"trtdoc(Exclude weights that can be refitted.)trtdoc";
+constexpr char const* EXCLUDE_LEAN_RUNTIME = R"trtdoc(Exclude lean runtime from the plan.)trtdoc";
+} // namespace SerializationFlagDoc
+
+namespace ExecutionContextAllocationStrategyDoc
+{
+constexpr char const* descr = R"trtdoc(Different memory allocation behaviors for IExecutionContext.)trtdoc";
+constexpr char const* STATIC = R"trtdoc(Default static allocation with the maximum size across all profiles.)trtdoc";
+constexpr char const* ON_PROFILE_CHANGE = R"trtdoc(Reallocate for a profile when it's selected.)trtdoc";
+constexpr char const* USER_MANAGED = R"trtdoc(The user supplies custom allocation to the execution context.)trtdoc";
+} // namespace ExecutionContextAllocationStrategyDoc
+
 namespace BuilderDoc
 {
 constexpr char const* descr = R"trtdoc(
     Builds an :class:`ICudaEngine` from a :class:`INetworkDefinition` .
 
-    :ivar max_batch_size: :class:`int` [DEPRECATED] For networks built with implicit batch, the maximum batch size which can be used at execution time, and also the batch size for which the :class:`ICudaEngine` will be optimized. This no effect for networks created with explicit batch dimension mode.
     :ivar platform_has_tf32: :class:`bool` Whether the platform has tf32 support.
     :ivar platform_has_fast_fp16: :class:`bool` Whether the platform has fast native fp16.
     :ivar platform_has_fast_int8: :class:`bool` Whether the platform has fast native int8.
@@ -1604,7 +1529,7 @@ constexpr char const* init = R"trtdoc(
 constexpr char const* create_network = R"trtdoc(
     Create a :class:`INetworkDefinition` object.
 
-    :arg flags: :class:`NetworkDefinitionCreationFlag` s combined using bitwise OR. Please enable the ``NetworkDefinitionCreationFlag.EXPLICIT_BATCH`` flag whenever possible.
+    :arg flags: :class:`NetworkDefinitionCreationFlag` s combined using bitwise OR.
 
     :returns: An empty TensorRT :class:`INetworkDefinition` .
 )trtdoc";
@@ -1623,17 +1548,6 @@ constexpr char const* create_builder_config = R"trtdoc(
     See :class:`IBuilderConfig`
 )trtdoc";
 
-constexpr char const* build_engine = R"trtdoc(
-    Builds an engine for the given :class:`INetworkDefinition` and :class:`IBuilderConfig` .
-
-    This enables the builder to build multiple engines based on the same network definition, but with different builder configurations.
-
-    :arg network: The TensorRT :class:`INetworkDefinition` .
-    :arg config: The TensorRT :class:`IBuilderConfig` .
-
-    :returns: A new :class:`ICudaEngine` .
-)trtdoc";
-
 constexpr char const* build_serialized_network = R"trtdoc(
     Builds and serializes a network for the given :class:`INetworkDefinition` and :class:`IBuilderConfig` .
 
@@ -1700,9 +1614,17 @@ constexpr char const* init = R"trtdoc(
 )trtdoc";
 
 constexpr char const* deserialize_cuda_engine = R"trtdoc(
-    Deserialize an :class:`ICudaEngine` from a stream.
+    Deserialize an :class:`ICudaEngine` from host memory.
+
+    :arg serialized_engine: The :class:`buffer` that holds the serialized :class:`ICudaEngine`.
+
+    :returns: The :class:`ICudaEngine`, or None if it could not be deserialized.
+)trtdoc";
 
-    :arg serialized_engine: The :class:`buffer` that holds the serialized :class:`ICudaEngine` .
+constexpr char const* deserialize_cuda_engine_reader = R"trtdoc(
+    Deserialize an :class:`ICudaEngine` from a stream reader.
+
+    :arg stream_reader: The :class:`PyStreamReader` that will read the serialized :class:`ICudaEngine`. This enables deserialization from a file directly.  
 
     :returns: The :class:`ICudaEngine`, or None if it could not be deserialized.
 )trtdoc";
@@ -1769,6 +1691,14 @@ constexpr char const* clear_inspection_source = R"trtdoc(
 
 } // namespace RuntimeInspectorDoc
 
+namespace TensorLocationDoc
+{
+constexpr const char* descr = R"trtdoc(The physical location of the data.)trtdoc";
+
+constexpr const char* DEVICE = R"trtdoc(Data is stored on the device.)trtdoc";
+constexpr const char* HOST = R"trtdoc(Data is stored on the host.)trtdoc";
+} // namespace TensorLocationDoc
+
 namespace RefitterDoc
 {
 constexpr char const* descr = R"trtdoc(
@@ -1777,6 +1707,7 @@ constexpr char const* descr = R"trtdoc(
     :ivar error_recorder: :class:`IErrorRecorder` Application-implemented error reporting interface for TensorRT objects.
     :ivar logger: :class:`ILogger` The logger provided when creating the refitter.
     :ivar max_threads: :class:`int` The maximum thread that can be used by the :class:`Refitter`.
+    :ivar weights_validation: :class:`bool` The flag to indicate whether to validate weights in the refitting process.
 )trtdoc";
 
 constexpr char const* init = R"trtdoc(
@@ -1790,9 +1721,10 @@ constexpr char const* set_weights = R"trtdoc(
 
     * There is no such layer by that name.
     * The layer does not have weights with the specified role.
-    * The number of weights is inconsistent with the layer’s original specification.
+    * The size of weights is inconsistent with the layer’s original specification.
 
-    Modifying the weights before :func:`refit_cuda_engine` completes will result in undefined behavior.
+    Modifying the weights before :func:`refit_cuda_engine` or :func:`refit_cuda_engine_async` returns
+    will result in undefined behavior.
 
     :arg layer_name: The name of the layer.
     :arg role: The role of the weights. See :class:`WeightsRole` for more information.
@@ -1806,9 +1738,11 @@ constexpr char const* set_named_weights = R"trtdoc(
     Possible reasons for rejection are:
 
     * The name of weights is empty or does not correspond to any refittable weights.
-    * The number of weights is inconsistent with the original specification.
+    * The size of the weights is inconsistent with the size returned from calling :func:`get_weights_prototype` with the same name.
+    * The dtype of the weights is inconsistent with the dtype returned from calling :func:`get_weights_prototype` with the same name.
 
-    Modifying the weights before method refit_cuda_engine() completes will result in undefined behavior.
+    Modifying the weights before :func:`refit_cuda_engine` or :func:`refit_cuda_engine_async` returns
+    will result in undefined behavior.
 
     :arg name: The name of the weights to be refitted.
     :arg weights: The new weights to associate with the name.
@@ -1816,10 +1750,96 @@ constexpr char const* set_named_weights = R"trtdoc(
     :returns: ``True`` on success, or ``False`` if new weights are rejected.
 )trtdoc";
 
+constexpr char const* set_named_weights_with_location = R"trtdoc(
+    Specify new weights on a specified device of given name.
+    Possible reasons for rejection are:
+
+    * The name of weights is empty or does not correspond to any refittable weights.
+    * The size of the weights is inconsistent with the size returned from calling :func:`get_weights_prototype` with the same name.
+    * The dtype of the weights is inconsistent with the dtype returned from calling :func:`get_weights_prototype` with the same name.
+
+    It is allowed to provide some weights on CPU and others on GPU.
+    Modifying the weights before :func:`refit_cuda_engine` or :func:`refit_cuda_engine_async` returns
+    will result in undefined behavior.
+
+    :arg name: The name of the weights to be refitted.
+    :arg weights: The new weights on the specified device.
+    :arg location: The location (host vs. device) of the new weights.
+
+    :returns: ``True`` on success, or ``False`` if new weights are rejected.
+)trtdoc";
+
+constexpr char const* get_named_weights = R"trtdoc(
+    Get weights associated with the given name.
+
+    If the weights were never set, returns null weights and reports an error to the refitter errorRecorder.
+
+    :arg weights_name: The name of the weights to be refitted.
+
+    :returns: Weights associated with the given name.
+)trtdoc";
+
+constexpr char const* get_weights_location = R"trtdoc(
+    Get location for the weights associated with the given name.
+
+    If the weights were never set, returns TensorLocation.HOST and reports an error to the refitter errorRecorder.
+
+    :arg weights_name: The name of the weights to be refitted.
+
+    :returns: Location for the weights associated with the given name.
+)trtdoc";
+
+constexpr char const* get_weights_prototype = R"trtdoc(
+    Get the weights prototype associated with the given name.
+
+    The dtype and size of weights prototype is the same as weights used for engine building.
+    The size of the weights prototype is -1 when the name of the weights is None or does not correspond to any refittable weights.
+   
+    :arg weights_name: The name of the weights to be refitted.
+   
+    :returns: weights prototype associated with the given name.
+)trtdoc";
+
+constexpr char const* unset_named_weights = R"trtdoc(
+    Unset weights associated with the given name.
+
+    Unset weights before releasing them.
+
+    :arg weights_name: The name of the weights to be refitted.
+
+    :returns: ``False`` if the weights were never set, returns ``True`` otherwise.
+)trtdoc";
+
 constexpr char const* refit_cuda_engine = R"trtdoc(
-    Updates associated engine.  Return ``True`` if successful.
+   Refits associated engine.
+
+   If ``False`` is returned, a subset of weights may have been refitted.
+
+   The behavior is undefined if the engine has pending enqueued work.
+   Provided weights on CPU or GPU can be unset and released, or updated after refit_cuda_engine returns.
 
-    Failure occurs if :func:`get_missing` != 0 before the call.
+   IExecutionContexts associated with the engine remain valid for use afterwards. There is no need to set the same
+   weights repeatedly for multiple refit calls as the weights memory can be updated directly instead.
+
+   :returns: ``True`` on success, or ``False`` if new weights validation fails or get_missing_weights() != 0 before the call.
+)trtdoc";
+
+constexpr char const* refit_cuda_engine_async = R"trtdoc(
+    Enqueue weights refitting of the associated engine on the given stream.
+
+    If ``False`` is returned, a subset of weights may have been refitted.
+
+    The behavior is undefined if the engine has pending enqueued work on a different stream from the provided one.
+    Provided weights on CPU can be unset and released, or updated after refit_cuda_engine_async returns.
+    Freeing or updating of the provided weights on GPU can be enqueued on the same stream after refit_cuda_engine_async returns.
+
+    IExecutionContexts associated with the engine remain valid for use afterwards. There is no need to set the same
+    weights repeatedly for multiple refit calls as the weights memory can be updated directly instead. The weights
+    updating task should use the the same stream as the one used for the refit call.
+
+    :arg stream: The stream to enqueue the weights updating task.
+
+    :returns: ``True`` on success, or ``False`` if new weights validation fails or get_missing_weights() != 0 before the call.
 )trtdoc";
 
 constexpr char const* get_missing = R"trtdoc(
@@ -1901,9 +1921,11 @@ To implement a custom allocator, ensure that you explicitly instantiate the base
 
         ...
 
+Note that all methods below (allocate, reallocate, deallocate, allocate_async, deallocate_async) must be overridden in the custom allocator, or else pybind11 would not be able to call the method from a custom allocator.
 )trtdoc";
 
 constexpr char const* allocate = R"trtdoc(
+    [DEPRECATED] Deprecated in TensorRT 10.0. Please use allocate_async instead.
     A callback implemented by the application to handle acquisition of GPU memory.
     If an allocation request of size 0 is made, ``None`` should be returned.
 
@@ -1919,14 +1941,6 @@ constexpr char const* allocate = R"trtdoc(
     :returns: The address of the allocated memory
 )trtdoc";
 
-constexpr char const* free = R"trtdoc(
-    A callback implemented by the application to handle release of GPU memory.
-
-    TensorRT may pass a 0 to this function if it was previously returned by ``allocate()``.
-
-    :arg memory: The memory address of the memory to release.
-)trtdoc";
-
 constexpr char const* reallocate = R"trtdoc(
     A callback implemented by the application to resize an existing allocation.
 
@@ -1935,9 +1949,9 @@ constexpr char const* reallocate = R"trtdoc(
     Options are one of:
     - resize in place leaving min(old_size, new_size) bytes unchanged and return the original address
     - move min(old_size, new_size) bytes to a new location of sufficient size and return its address
-    - return nullptr, to indicate that the request could not be fulfilled.
+    - return None, to indicate that the request could not be fulfilled.
 
-    If nullptr is returned, TensorRT will assume that resize() is not implemented, and that the
+    If None is returned, TensorRT will assume that resize() is not implemented, and that the
     allocation at address is still valid.
 
     This method is made available for use cases where delegating the resize
@@ -1956,6 +1970,7 @@ constexpr char const* reallocate = R"trtdoc(
 )trtdoc";
 
 constexpr char const* deallocate = R"trtdoc(
+    [DEPRECATED] Deprecated in TensorRT 10.0. Please use dealocate_async instead;
     A callback implemented by the application to handle release of GPU memory.
 
     TensorRT may pass a 0 to this function if it was previously returned by ``allocate()``.
@@ -1965,6 +1980,115 @@ constexpr char const* deallocate = R"trtdoc(
     :returns: True if the acquired memory is released successfully.
 )trtdoc";
 
+constexpr char const* allocate_async = R"trtdoc(
+    A callback implemented by the application to handle acquisition of GPU memory asynchronously.
+    This is just a wrapper around a syncronous method allocate.
+    For the asynchronous allocation please use the corresponding IGpuAsyncAllocator class.
+    If an allocation request of size 0 is made, ``None`` should be returned.
+
+    If an allocation request cannot be satisfied, ``None`` should be returned.
+
+    :arg size: The size of the memory required.
+    :arg alignment: The required alignment of memory. Alignment will be zero
+        or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+        Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+        An alignment value of zero indicates any alignment is acceptable.
+    :arg flags: Allocation flags. See :class:`AllocatorFlag`
+    :arg stream: CUDA stream
+
+    :returns: The address of the allocated memory
+)trtdoc";
+
+constexpr char const* deallocate_async = R"trtdoc(
+    A callback implemented by the application to handle release of GPU memory asynchronously.
+    This is just a wrapper around a syncronous method deallocate.
+    For the asynchronous deallocation please use the corresponding IGpuAsyncAllocator class.
+
+    TensorRT may pass a 0 to this function if it was previously returned by ``allocate()``.
+
+    :arg memory: The memory address of the memory to release.
+    :arg stream: CUDA stream
+
+    :returns: True if the acquired memory is released successfully.
+)trtdoc";
+
 } // namespace GpuAllocatorDoc
 
+namespace GpuAsyncAllocatorDoc
+{
+constexpr char const* descr = R"trtdoc(Application-implemented class for controlling allocation on the GPU.
+
+To implement a custom allocator, ensure that you explicitly instantiate the base class in :func:`__init__` :
+::
+
+    class MyAllocator(trt.IGpuAsyncAllocator):
+        def __init__(self):
+            trt.IGpuAllocator.__init__(self)
+
+        ...
+
+Note that all methods below (allocate, reallocate, deallocate, allocate_async, reallocate_async, deallocate_async) must be overridden in the custom allocator, or else pybind11 would not be able to call the method from a custom allocator.
+)trtdoc";
+
+constexpr char const* allocate = R"trtdoc(
+    [DEPRECATED] Deprecated in TensorRT 10.0. Please use allocate_async instead.
+    A callback implemented by the application to handle acquisition of GPU memory.
+    This is just a wrapper around a syncronous method allocate_async passing the default stream.    
+
+    If an allocation request of size 0 is made, ``None`` should be returned.
+
+    If an allocation request cannot be satisfied, ``None`` should be returned.
+
+    :arg size: The size of the memory required.
+    :arg alignment: The required alignment of memory. Alignment will be zero
+        or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+        Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+        An alignment value of zero indicates any alignment is acceptable.
+    :arg flags: Allocation flags. See :class:`AllocatorFlag`
+
+    :returns: The address of the allocated memory
+)trtdoc";
+
+constexpr char const* deallocate = R"trtdoc(
+    [DEPRECATED] Deprecated in TensorRT 10.0. Please use deallocate_async instead.
+    A callback implemented by the application to handle release of GPU memory.
+    This is just a wrapper around a syncronous method deallocate_async passing the default stream.    
+
+    TensorRT may pass a 0 to this function if it was previously returned by ``allocate()``.
+
+    :arg memory: The memory address of the memory to release.
+
+    :returns: True if the acquired memory is released successfully.
+)trtdoc";
+
+constexpr char const* allocate_async = R"trtdoc(
+    A callback implemented by the application to handle acquisition of GPU memory asynchronously.
+    If an allocation request of size 0 is made, ``None`` should be returned.
+
+    If an allocation request cannot be satisfied, ``None`` should be returned.
+
+    :arg size: The size of the memory required.
+    :arg alignment: The required alignment of memory. Alignment will be zero
+        or a power of 2 not exceeding the alignment guaranteed by cudaMalloc.
+        Thus this allocator can be safely implemented with cudaMalloc/cudaFree.
+        An alignment value of zero indicates any alignment is acceptable.
+    :arg flags: Allocation flags. See :class:`AllocatorFlag`
+    :arg stream: CUDA stream
+
+    :returns: The address of the allocated memory
+)trtdoc";
+
+constexpr char const* deallocate_async = R"trtdoc(
+    A callback implemented by the application to handle release of GPU memory asynchronously.
+
+    TensorRT may pass a 0 to this function if it was previously returned by ``allocate()``.
+
+    :arg memory: The memory address of the memory to release.
+    :arg stream: CUDA stream
+
+    :returns: True if the acquired memory is released successfully.
+)trtdoc";
+
+} // namespace GpuAsyncAllocatorDoc
+
 } // namespace tensorrt
diff --git a/python/docstrings/infer/pyFoundationalTypesDoc.h b/python/docstrings/infer/pyFoundationalTypesDoc.h
index e9e2d586..0e404631 100644
--- a/python/docstrings/infer/pyFoundationalTypesDoc.h
+++ b/python/docstrings/infer/pyFoundationalTypesDoc.h
@@ -30,8 +30,10 @@ constexpr const char* descr = R"trtdoc(
 
 constexpr char const* float32 = R"trtdoc(32-bit floating point format.)trtdoc";
 constexpr char const* float16 = R"trtdoc(IEEE 16-bit floating-point format.)trtdoc";
+constexpr char const* bfloat16 = R"trtdoc(Brain float -- has an 8 bit exponent and 8 bit significand)trtdoc";
 constexpr char const* int8 = R"trtdoc(Signed 8-bit integer representing a quantized floating-point value.)trtdoc";
 constexpr char const* int32 = R"trtdoc(Signed 32-bit integer format.)trtdoc";
+constexpr char const* int64 = R"trtdoc(Signed 64-bit integer format.)trtdoc";
 constexpr char const* boolean = R"trtdoc(8-bit boolean. 0 = false, 1 = true, other values undefined.)trtdoc";
 constexpr char const* uint8 = R"trtdoc(
     Unsigned 8-bit integer format.
@@ -54,6 +56,7 @@ constexpr char const* fp8 = R"trtdoc(
     .. warning::
        fp8 is not supported yet and will result in an error or undefined behavior.
 )trtdoc";
+constexpr char const* int4 = R"trtdoc(Signed 4-bit integer representing a quantized floating-point value.)trtdoc";
 
 } // namespace DataTypeDoc
 
@@ -61,10 +64,8 @@ namespace WeightsRoleDoc
 {
 constexpr const char* descr
     = R"trtdoc(How a layer uses particular Weights. The power weights of an IScaleLayer are omitted.  Refitting those is not supported.)trtdoc";
-constexpr const char* KERNEL
-    = R"trtdoc(Kernel for :class:`IConvolutionLayer` , :class:`IDeconvolutionLayer` , or :class:`IFullyConnectedLayer` .)trtdoc";
-constexpr const char* BIAS
-    = R"trtdoc(Bias for :class:`IConvolutionLayer` , :class:`IDeconvolutionLayer` , or :class:`IFullyConnectedLayer` .)trtdoc";
+constexpr const char* KERNEL = R"trtdoc(Kernel for :class:`IConvolutionLayer` or :class:`IDeconvolutionLayer` .)trtdoc";
+constexpr const char* BIAS = R"trtdoc(Bias for :class:`IConvolutionLayer` or :class:`IDeconvolutionLayer` .)trtdoc";
 constexpr const char* SHIFT = R"trtdoc(Shift part of :class:`IScaleLayer` .)trtdoc";
 constexpr const char* SCALE = R"trtdoc(Scale part of :class:`IScaleLayer` .)trtdoc";
 constexpr const char* CONSTANT = R"trtdoc(Weights for :class:`IConstantLayer` .)trtdoc";
@@ -90,6 +91,14 @@ constexpr const char* init_type = R"trtdoc(
     :type: A type to initialize the weights with. Default: :class:`tensorrt.float32`
 )trtdoc";
 
+constexpr const char* init_ptr = R"trtdoc(
+    Initializes a Weights object with the specified data.
+
+    :type: A type to initialize the weights with. 
+    :ptr: A pointer to the data. 
+    :count: The number of weights.
+)trtdoc";
+
 // FIXME: Weird bug occurring here. Cannot provide :arg:
 constexpr const char* init_numpy = R"trtdoc(
     :a: A numpy array whose values to use. No deep copies are made.
@@ -97,8 +106,11 @@ constexpr const char* init_numpy = R"trtdoc(
 
 constexpr const char* numpy = R"trtdoc(
     Create a numpy array using the underlying buffer of this weights object.
+    The resulting array is just a view over the existing data, i.e. no deep copy is made.
+
+    If the weights cannot be converted to NumPy (e.g. due to unsupported data type), the original weights are returned. 
 
-    :returns: A new numpy array that holds a reference to this weight object's buffer - no deep copy is made.
+    :returns: The NumPy array or the original weights.
 )trtdoc";
 } // namespace WeightsDoc
 
@@ -168,6 +180,27 @@ constexpr const char* descr = R"trtdoc(
 )trtdoc";
 } // namespace Dims4Doc
 
+namespace IVersionedInterfaceDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Base class for all versioned interfaces.
+)trtdoc";
+} // namespace IVersionedInterfaceDoc
+
+namespace APILanguageDoc
+{
+constexpr const char* descr = R"trtdoc(
+    The language used in the implementation of a TensorRT interface.
+)trtdoc";
+} // namespace APILanguageDoc
+
+namespace InterfaceInfoDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Version information for a TensorRT interface.
+)trtdoc";
+} // namespace InterfaceInfoDoc
+
 namespace DimsNCHWDoc
 {
 constexpr const char* descr = R"trtdoc(
diff --git a/python/docstrings/infer/pyGraphDoc.h b/python/docstrings/infer/pyGraphDoc.h
index bc69b28e..1581ad9c 100644
--- a/python/docstrings/infer/pyGraphDoc.h
+++ b/python/docstrings/infer/pyGraphDoc.h
@@ -26,7 +26,6 @@ namespace LayerTypeDoc
 
 constexpr const char* descr = R"trtdoc(Type of Layer)trtdoc";
 constexpr const char* CONVOLUTION = R"trtdoc(Convolution layer)trtdoc";
-constexpr const char* FULLY_CONNECTED = R"trtdoc(Fully connected layer)trtdoc";
 constexpr const char* GRID_SAMPLE = R"trtdoc(Grid sample layer)trtdoc";
 constexpr const char* NMS = R"trtdoc(NMS layer)trtdoc";
 constexpr const char* ACTIVATION = R"trtdoc(Activation layer)trtdoc";
@@ -47,7 +46,6 @@ constexpr const char* GATHER = R"trtdoc(Gather layer)trtdoc";
 constexpr const char* MATRIX_MULTIPLY = R"trtdoc(Matrix multiply layer)trtdoc";
 constexpr const char* RAGGED_SOFTMAX = R"trtdoc(Ragged softmax layer)trtdoc";
 constexpr const char* CONSTANT = R"trtdoc(Constant layer)trtdoc";
-constexpr const char* RNN_V2 = R"trtdoc(RNNv2 layer)trtdoc";
 constexpr const char* IDENTITY = R"trtdoc(Identity layer)trtdoc";
 constexpr const char* CAST = R"trtdoc(Cast layer)trtdoc";
 constexpr const char* PLUGIN_V2 = R"trtdoc(PluginV2 layer)trtdoc";
@@ -73,17 +71,9 @@ constexpr const char* ONE_HOT = R"trtdoc(OneHot layer)trtdoc";
 constexpr char const* NON_ZERO = R"trtdoc(NonZero layer)trtdoc";
 constexpr char const* REVERSE_SEQUENCE = R"trtdoc(ReverseSequence layer)trtdoc";
 constexpr char const* NORMALIZATION = R"trtdoc(Normalization layer)trtdoc";
-
+constexpr const char* PLUGIN_V3 = R"trtdoc(PluginV3 layer)trtdoc";
 } // namespace LayerTypeDoc
 
-namespace TensorLocationDoc
-{
-constexpr const char* descr = R"trtdoc(The physical location of the data.)trtdoc";
-
-constexpr const char* DEVICE = R"trtdoc(Data is stored on the device.)trtdoc";
-constexpr const char* HOST = R"trtdoc(Data is stored on the device.)trtdoc";
-} // namespace TensorLocationDoc
-
 namespace TensorFormatDoc
 {
 constexpr const char* descr = R"trtdoc(
@@ -200,7 +190,7 @@ constexpr const char* descr = R"trtdoc(
 
     :ivar dtype: :class:`DataType` The data type of a tensor. The type is unchanged if the type is invalid for the given tensor.
 
-    :ivar broadcast_across_batch: :class:`bool` Whether to enable broadcast of tensor across the batch. When a tensor is broadcast across a batch, it has the same value for every member in the batch. Memory is only allocated once for the single member. This method is only valid for network input tensors, since the flags of layer output tensors are inferred based on layer inputs and parameters. If this state is modified for a tensor in the network, the states of all dependent tensors will be recomputed.
+    :ivar broadcast_across_batch: :class:`bool` [DEPRECATED] Deprecated in TensorRT 10.0. Always false since the implicit batch dimensions support has been removed.
 
     :ivar location: :class:`TensorLocation` The storage location of a tensor.
     :ivar is_network_input: :class:`bool` Whether the tensor is a network input.
@@ -208,8 +198,8 @@ constexpr const char* descr = R"trtdoc(
     :ivar dynamic_range: :class:`Tuple[float, float]` A tuple containing the [minimum, maximum] of the dynamic range, or :class:`None` if the range was not set.
     :ivar is_shape: :class:`bool` Whether the tensor is a shape tensor.
     :ivar allowed_formats: :class:`int32` The allowed set of TensorFormat candidates. This should be an integer consisting of one or more :class:`TensorFormat` s, combined via bitwise OR after bit shifting. For example, ``1 << int(TensorFormat.CHW4) | 1 << int(TensorFormat.CHW32)``.
-)trtdoc";
-
+)trtdoc"
+        ;
 constexpr const char* set_dynamic_range = R"trtdoc(
     Set dynamic range for the tensor.
     NOTE: It is suggested to use ``tensor.dynamic_range = (min, max)`` instead.
@@ -346,8 +336,6 @@ constexpr const char* EXPLICIT_ROUND_DOWN = R"trtdoc(Use explicit padding, round
 constexpr const char* EXPLICIT_ROUND_UP = R"trtdoc(Use explicit padding, rounding the output size up)trtdoc";
 constexpr const char* SAME_UPPER = R"trtdoc(Use SAME padding, with :attr:`pre_padding` <= :attr:`post_padding` )trtdoc";
 constexpr const char* SAME_LOWER = R"trtdoc(Use SAME padding, with :attr:`pre_padding` >= :attr:`post_padding` )trtdoc";
-constexpr const char* CAFFE_ROUND_DOWN = R"trtdoc(Use CAFFE padding, rounding the output size down)trtdoc";
-constexpr const char* CAFFE_ROUND_UP = R"trtdoc(Use CAFFE padding, rounding the output size up)trtdoc";
 
 } // namespace PaddingModeDoc
 
@@ -360,17 +348,13 @@ constexpr const char* descr = R"trtdoc(
 
     An optional bias argument is supported, which adds a per-channel constant to each value in the output.
 
-    :ivar kernel_size: :class:`DimsHW` The HW kernel size of the convolution.
     :ivar num_output_maps: :class:`int` The number of output maps for the convolution.
-    :ivar stride: :class:`DimsHW` The stride of the convolution. Default: (1, 1)
-    :ivar padding: :class:`DimsHW` The padding of the convolution. The input will be zero-padded by this number of elements in the height and width directions. If the padding is asymmetric, this value corresponds to the pre-padding. Default: (0, 0)
     :ivar pre_padding: :class:`DimsHW` The pre-padding. The start of input will be zero-padded by this number of elements in the height and width directions. Default: (0, 0)
     :ivar post_padding: :class:`DimsHW` The post-padding. The end of input will be zero-padded by this number of elements in the height and width directions. Default: (0, 0)
     :ivar padding_mode: :class:`PaddingMode` The padding mode. Padding mode takes precedence if both :attr:`IConvolutionLayer.padding_mode` and either :attr:`IConvolutionLayer.pre_padding` or :attr:`IConvolutionLayer.post_padding` are set.
     :ivar num_groups: :class:`int` The number of groups for a convolution. The input tensor channels are divided into this many groups, and a convolution is executed for each group, using a filter per group. The results of the group convolutions are concatenated to form the output. **Note** When using groups in int8 mode, the size of the groups (i.e. the channel count divided by the group count) must be a multiple of 4 for both input and output. Default: 1.
     :ivar kernel: :class:`Weights` The kernel weights for the convolution. The weights are specified as a contiguous array in `GKCRS` order, where `G` is the number of groups, `K` the number of output feature maps, `C` the number of input channels, and `R` and `S` are the height and width of the filter.
     :ivar bias: :class:`Weights` The bias weights for the convolution. Bias is optional. To omit bias, set this to an empty :class:`Weights` object. The bias is applied per-channel, so the number of weights (if non-zero) must be equal to the number of output feature maps.
-    :ivar dilation: :class:`DimsHW` The dilation for a convolution. Default: (1, 1)
     :ivar kernel_size_nd: :class:`Dims` The multi-dimension kernel size of the convolution.
     :ivar stride_nd: :class:`Dims` The multi-dimension stride of the convolution. Default: (1, ..., 1)
     :ivar padding_nd: :class:`Dims` The multi-dimension padding of the convolution. The input will be zero-padded by this number of elements in each dimension. If the padding is asymmetric, this value corresponds to the pre-padding. Default: (0, ..., 0)
@@ -378,33 +362,6 @@ constexpr const char* descr = R"trtdoc(
 )trtdoc";
 } // namespace IConvolutionLayerDoc
 
-namespace IFullyConnectedLayerDoc
-{
-constexpr const char* descr = R"trtdoc(
-    A fully connected layer in an :class:`INetworkDefinition` .
-
-    This layer expects an input tensor of three or more non-batch dimensions.  The input is automatically reshaped into an `MxV` tensor `X`, where `V` is a product of the last three dimensions and `M` is a product of the remaining dimensions (where the product over 0 dimensions is defined as 1).  For example:
-
-    - If the input tensor has shape `{C, H, W}`, then the tensor is reshaped into `{1, C*H*W}` .
-    - If the input tensor has shape `{P, C, H, W}`, then the tensor is reshaped into `{P, C*H*W}` .
-
-    The layer then performs:
-
-    :math:`Y := matmul(X, W^T) + bias`
-
-    Where `X` is the `MxV` tensor defined above, `W` is the `KxV` weight tensor of the layer, and `bias` is a row vector size `K` that is broadcasted to `MxK` .  `K` is the number of output channels, and configurable via :attr:`IFullyConnectedLayer.num_output_channels` .  If `bias` is not specified, it is implicitly `0` .
-
-    The `MxK` result `Y` is then reshaped such that the last three dimensions are `{K, 1, 1}` and the remaining dimensions match the dimensions of the input tensor. For example:
-
-    - If the input tensor has shape `{C, H, W}`, then the output tensor will have shape `{K, 1, 1}` .
-    - If the input tensor has shape `{P, C, H, W}`, then the output tensor will have shape `{P, K, 1, 1}` .
-
-    :ivar num_output_channels: :class:`int` The number of output channels `K` from the fully connected layer.
-    :ivar kernel: :class:`Weights` The kernel weights, given as a `KxC` matrix in row-major order.
-    :ivar bias: :class:`Weights` The bias weights. Bias is optional. To omit bias, set this to an empty :class:`Weights` object.
-)trtdoc";
-} // namespace IFullyConnectedLayerDoc
-
 namespace ActivationTypeDoc
 {
 constexpr const char* descr = R"trtdoc(The type of activation to perform.)trtdoc";
@@ -424,7 +381,9 @@ constexpr const char* HARD_SIGMOID = R"trtdoc(Hard sigmoid activation: f(x) = ma
 constexpr const char* SCALED_TANH = R"trtdoc(Scaled Tanh activation: f(x) = alpha * tanh(beta * x))trtdoc";
 constexpr const char* THRESHOLDED_RELU
     = R"trtdoc(Thresholded Relu activation: f(x) = x if x > alpha, f(x) = 0 if x <= alpha)trtdoc";
-
+constexpr const char* GELU_ERF = R"trtdoc(GELU erf activation: 0.5 * x * (1 + erf(sqrt(0.5) * x)))trtdoc";
+constexpr const char* GELU_TANH
+    = R"trtdoc(GELU tanh activation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (0.044715F * pow(x, 3) + x))))trtdoc";
 } // namespace ActivationTypeDoc
 
 namespace IActivationLayerDoc
@@ -455,9 +414,6 @@ constexpr const char* descr = R"trtdoc(
     A Pooling layer in an :class:`INetworkDefinition` . The layer applies a reduction operation within a window over the input.
 
     :ivar type: :class:`PoolingType` The type of pooling to be performed.
-    :ivar window_size: :class:`DimsHW` The window size for pooling.
-    :ivar stride: :class:`DimsHW` The stride for pooling. Default: (1, 1)
-    :ivar padding: :class:`DimsHW` The padding for pooling. Default: (0, 0)
     :ivar pre_padding: :class:`DimsHW` The pre-padding. The start of input will be zero-padded by this number of elements in the height and width directions. Default: (0, 0)
     :ivar post_padding: :class:`DimsHW` The post-padding. The end of input will be zero-padded by this number of elements in the height and width directions. Default: (0, 0)
     :ivar padding_mode: :class:`PaddingMode` The padding mode. Padding mode takes precedence if both :attr:`IPoolingLayer.padding_mode` and either :attr:`IPoolingLayer.pre_padding` or :attr:`IPoolingLayer.post_padding` are set.
@@ -530,17 +486,12 @@ constexpr const char* descr = R"trtdoc(
 
     :ivar axes: :class:`int` The axis along which softmax is computed. Currently, only one axis can be set.
 
-    |  The axis is specified by setting the bit corresponding to the axis to 1, as a bit mask.
-    |  For example, consider an NCHW tensor as input (three non-batch dimensions).
-    |
-    |  In implicit mode :
-    |  Bit 0 corresponds to the C dimension boolean.
-    |  Bit 1 corresponds to the H dimension boolean.
-    |  Bit 2 corresponds to the W dimension boolean.
+    The axis is specified by setting the bit corresponding to the axis to 1, as a bit mask.
+
+    For example, consider an NCHW tensor as input (three non-batch dimensions).
 
     By default, softmax is performed on the axis which is the number of axes minus three. It is 0 if there are fewer than 3 non-batch axes. For example, if the input is NCHW, the default axis is C. If the input is NHW, then the default axis is H.
 
-    |  In explicit mode :
     |  Bit 0 corresponds to the N dimension boolean.
     |  Bit 1 corresponds to the C dimension boolean.
     |  Bit 2 corresponds to the H dimension boolean.
@@ -549,11 +500,9 @@ constexpr const char* descr = R"trtdoc(
     |  there are fewer than 3 axes. For example, if the input is NCHW, the default axis is C. If the input
     |  is NHW, then the default axis is N.
     |
-    |  For example, to perform softmax on axis R of a NPQRCHW input, set bit 2 with implicit batch mode,
-    |  set bit 3 with explicit batch mode.
+    |  For example, to perform softmax on axis R of a NPQRCHW input, set bit 3.
 
-    On Xavier, this layer is not supported on DLA.
-    Otherwise, the following constraints must be satisfied to execute this layer on DLA:
+    The following constraints must be satisfied to execute this layer on DLA:
 
     - Axis must be one of the channel or spatial dimensions.
     - There are two classes of supported input sizes:
@@ -571,7 +520,7 @@ constexpr const char* descr = R"trtdoc(
     The output channel size is the sum of the channel sizes of the inputs.
     The other output sizes are the same as the other input sizes, which must all match.
 
-    :ivar axis: :class:`int` The axis along which concatenation occurs. The default axis is the number of tensor dimensions minus three, or zero if the tensor has fewer than three dimensions. For example, for a tensor with dimensions NCHW, it is C. For implicit batch mode, the number of tensor dimensions does NOT include the implicit batch dimension.
+    :ivar axis: :class:`int` The axis along which concatenation occurs. The default axis is the number of tensor dimensions minus three, or zero if the tensor has fewer than three dimensions. For example, for a tensor with dimensions NCHW, it is C.
 )trtdoc";
 } // namespace IConcatenationLayerDoc
 
@@ -580,10 +529,7 @@ namespace IDeconvolutionLayerDoc
 constexpr const char* descr = R"trtdoc(
     A deconvolution layer in an :class:`INetworkDefinition` .
 
-    :ivar kernel_size: :class:`DimsHW` The HW kernel size of the convolution.
     :ivar num_output_maps: :class:`int` The number of output feature maps for the deconvolution.
-    :ivar stride: :class:`DimsHW` The stride of the deconvolution. Default: (1, 1)
-    :ivar padding: :class:`DimsHW` The padding of the deconvolution. The input will be zero-padded by this number of elements in the height and width directions. Padding is symmetric. Default: (0, 0)
     :ivar pre_padding: :class:`DimsHW` The pre-padding. The start of input will be zero-padded by this number of elements in the height and width directions. Default: (0, 0)
     :ivar post_padding: :class:`DimsHW` The post-padding. The end of input will be zero-padded by this number of elements in the height and width directions. Default: (0, 0)
     :ivar padding_mode: :class:`PaddingMode` The padding mode. Padding mode takes precedence if both :attr:`IDeconvolutionLayer.padding_mode` and either :attr:`IDeconvolutionLayer.pre_padding` or :attr:`IDeconvolutionLayer.post_padding` are set.
@@ -638,7 +584,7 @@ constexpr const char* descr = R"trtdoc(
     A gather layer in an :class:`INetworkDefinition` .
 
     :ivar axis: :class:`int` The non-batch dimension axis to gather on. The axis must be less than the number of non-batch dimensions in the data input.
-    :ivar num_elementwise_dims: :class:`int` The number of leading dimensions of indices tensor to be handled elementwise. For `GatherMode.DEFAULT`, it must be 0 if there is an implicit batch dimension. It can be 0 or 1 if there is not an implicit batch dimension. For `GatherMode::kND`, it can be between 0 and one less than rank(data). For `GatherMode::kELEMENT`, it must be 0.
+    :ivar num_elementwise_dims: :class:`int` The number of leading dimensions of indices tensor to be handled elementwise. For `GatherMode.DEFAULT`, it can be 0 or 1. For `GatherMode::kND`, it can be between 0 and one less than rank(data). For `GatherMode::kELEMENT`, it must be 0.
     :ivar mode: :class:`GatherMode` The gather mode.
 )trtdoc";
 } // namespace IGatherLayerDoc
@@ -669,209 +615,23 @@ constexpr const char* ELEMENT = R"trtdoc(Similar to ONNX GatherElements.)trtdoc"
 constexpr const char* ND = R"trtdoc(Similar to ONNX GatherND.)trtdoc";
 } // namespace GatherModeDoc
 
-namespace RNNOperationDoc
-{
-constexpr const char* descr = R"trtdoc(
-    The RNN operations that may be performed by an RNN layer.
-
-    **Equation definitions**
-
-    In the equations below, we use the following naming convention:
-
-    |  `t` := current time step
-    |  `i` := input gate
-    |  `o` := output gate
-    |  `f` := forget gate
-    |  `z` := update gate
-    |  `r` := reset gate
-    |  `c` := cell gate
-    |  `h` := hidden gate
-
-    |  `g[t]` denotes the output of gate g at timestep `t`, e.g.`f[t]` is the output of the forget gate `f` .
-    |  `X[t]` := input tensor for timestep `t`
-    |  `C[t]` := cell state for timestep `t`
-    |  `H[t]` := hidden state for timestep `t`
-
-    |  `W[g]` := `W` (input) parameter weight matrix for gate `g`
-    |  `R[g]` := `U` (recurrent) parameter weight matrix for gate `g`
-    |  `Wb[g]` := `W` (input) parameter bias vector for gate `g`
-    |  `Rb[g]` := `U` (recurrent) parameter bias vector for gate `g`
-
-    Unless otherwise specified, all operations apply pointwise to elements of each operand tensor.
-
-    |  `ReLU(X)` := `max(X, 0)`
-    |  `tanh(X)` := hyperbolic tangent of `X`
-    |  `sigmoid(X)` := `1 / (1 + exp(-X))`
-    |  `exp(X)` := `e^X`
-    |  `A.B` denotes matrix multiplication of `A` and `B` .
-    |  `A*B` denotes pointwise multiplication of `A` and `B` .
-
-    **Equations**
-
-    Depending on the value of RNNOperation chosen, each sub-layer of the RNN layer will perform one of the following operations:
-
-    **RELU**
-
-    :math:`H[t] := ReLU(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])`
-
-    **TANH**
-
-    :math:`H[t] := tanh(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])`
-
-    **LSTM**
-
-    |  :math:`i[t] := sigmoid(W[i].X[t] + R[i].H[t-1] + Wb[i] + Rb[i])`
-    |  :math:`f[t] := sigmoid(W[f].X[t] + R[f].H[t-1] + Wb[f] + Rb[f])`
-    |  :math:`o[t] := sigmoid(W[o].X[t] + R[o].H[t-1] + Wb[o] + Rb[o])`
-    |  :math:`c[t] :=    tanh(W[c].X[t] + R[c].H[t-1] + Wb[c] + Rb[c])`
-
-
-    |  :math:`C[t] := f[t]*C[t-1] + i[t]*c[t]`
-    |  :math:`H[t] := o[t]*tanh(C[t])`
-
-    **GRU**
-
-    |  :math:`z[t] := sigmoid(W[z].X[t] + R[z].H[t-1] + Wb[z] + Rb[z])`
-    |  :math:`r[t] := sigmoid(W[r].X[t] + R[r].H[t-1] + Wb[r] + Rb[r])`
-    |  :math:`h[t] := tanh(W[h].X[t] + r[t]*(R[h].H[t-1] + Rb[h]) + Wb[h])`
-    |  :math:`H[t] := (1 - z[t])*h[t] + z[t]*H[t-1]`
-)trtdoc";
-
-constexpr const char* RELU = R"trtdoc(Single gate RNN w/ ReLU activation)trtdoc";
-constexpr const char* TANH = R"trtdoc(Single gate RNN w/ TANH activation)trtdoc";
-constexpr const char* LSTM = R"trtdoc(Four-gate LSTM network w/o peephole connections)trtdoc";
-constexpr const char* GRU = R"trtdoc(Three-gate network consisting of Gated Recurrent Units)trtdoc";
-
-} // namespace RNNOperationDoc
-
-namespace RNNDirectionDoc
-{
-constexpr const char* descr = R"trtdoc(The RNN direction that may be performed by an RNN layer.)trtdoc";
-
-constexpr const char* UNIDIRECTION = R"trtdoc(Network iterates from first input to last input)trtdoc";
-constexpr const char* BIDIRECTION
-    = R"trtdoc(Network iterates from first to last (and vice versa) and outputs concatenated)trtdoc";
-} // namespace RNNDirectionDoc
-
-namespace RNNInputModeDoc
-{
-constexpr const char* descr = R"trtdoc(
-    The RNN input modes that may occur with an RNN layer.
-
-        If the RNN is configured with :const:`RNNInputMode.LINEAR` , then for each gate `g` in the first layer of the RNN,
-        the input vector `X[t]` (length `E`) is left-multiplied by the gate's corresponding weight matrix `W[g]`
-        (dimensions `HxE`) as usual, before being used to compute the gate output as described by :class:`RNNOperation` .
-
-        If the RNN is configured with :const:`RNNInputMode.SKIP` , then this initial matrix multiplication is "skipped"
-        and `W[g]` is conceptually an identity matrix. In this case, the input vector `X[t]` must have length `H`
-        (the size of the hidden state).
-)trtdoc";
-
-constexpr const char* LINEAR = R"trtdoc(Perform the normal matrix multiplication in the first recurrent layer)trtdoc";
-constexpr const char* SKIP = R"trtdoc(No operation is performed on the first recurrent layer)trtdoc";
-} // namespace RNNInputModeDoc
-
-namespace RNNGateTypeDoc
-{
-constexpr const char* descr = R"trtdoc(
-    The RNN input modes that may occur with an RNN layer.
-
-        If the RNN is configured with :const:`RNNInputMode.LINEAR` , then for each gate `g` in the first layer of the RNN,
-        the input vector `X[t]` (length `E`) is left-multiplied by the gate's corresponding weight matrix `W[g]`
-        (dimensions `HxE`) as usual, before being used to compute the gate output as described by :class:`RNNOperation` .
-
-        If the RNN is configured with :const:`RNNInputMode.SKIP` , then this initial matrix multiplication is "skipped"
-        and `W[g]` is conceptually an identity matrix. In this case, the input vector `X[t]` must have length `H`
-        (the size of the hidden state).
-)trtdoc";
-
-constexpr const char* INPUT = R"trtdoc(Input Gate)trtdoc";
-constexpr const char* OUTPUT = R"trtdoc(Output Gate)trtdoc";
-constexpr const char* FORGET = R"trtdoc(Forget Gate)trtdoc";
-constexpr const char* UPDATE = R"trtdoc(Update Gate)trtdoc";
-constexpr const char* RESET = R"trtdoc(Reset Gate)trtdoc";
-constexpr const char* CELL = R"trtdoc(Cell Gate)trtdoc";
-constexpr const char* HIDDEN = R"trtdoc(Hidden Gate)trtdoc";
-} // namespace RNNGateTypeDoc
-
-namespace IRNNv2LayerDoc
+namespace IPluginV2LayerDoc
 {
 constexpr const char* descr = R"trtdoc(
-    An RNN layer in an :class:`INetworkDefinition` , version 2
-
-    :ivar num_layers: :class:`int` The layer count of the RNN.
-    :ivar hidden_size: :class:`int` The hidden size of the RNN.
-    :ivar max_seq_length: :class:`int` The maximum sequence length of the RNN.
-    :ivar data_length: :class:`int` The embedding length of the RNN.
-
-    :ivar seq_lengths: :class:`ITensor` Individual sequence lengths in the batch with the :class:`ITensor` provided.
-        The :attr:`seq_lengths` :class:`ITensor` should be a {N1, ..., Np} tensor, where N1..Np are the index dimensions
-        of the input tensor to the RNN.
-        If :attr:`seq_lengths` is not specified, then the RNN layer assumes all sequences are size :attr:`max_seq_length` .
-        All sequence lengths in :attr:`seq_lengths` should be in the range [1, :attr:`max_seq_length` ]. Zero-length sequences are not supported.
-        This tensor must be of type :class:`int32` .
-    :ivar op: :class:`RNNOperation` The operation of the RNN layer.
-    :ivar input_mode: :class:`int` The input mode of the RNN layer.
-    :ivar direction: :class:`int` The direction of the RNN layer.
-
-    :ivar hidden_state: :class:`ITensor` the initial hidden state of the RNN with the provided :attr:`hidden_state` :class:`ITensor` .
-        The :attr:`hidden_state` :class:`ITensor` should have the dimensions `{N1, ..., Np, L, H}`, where:
-        `N1..Np` are the index dimensions specified by the input tensor
-        `L` is the number of layers in the RNN, equal to :attr:`num_layers`
-        `H` is the hidden state for each layer, equal to :attr:`hidden_size` if :attr:`direction` is :const:`RNNDirection.UNIDIRECTION` , and 2x :attr:`hidden_size` otherwise.
-    :ivar cell_state: :class:`ITensor` The initial cell state of the LSTM with the provided :attr:`cell_state` :class:`ITensor` .
-        The :attr:`cell_state` :class:`ITensor` should have the dimensions `{N1, ..., Np, L, H}`, where:
-        `N1..Np` are the index dimensions specified by the input tensor
-        `L` is the number of layers in the RNN, equal to :attr:`num_layers`
-        `H` is the hidden state for each layer, equal to :attr:`hidden_size` if :attr:`direction` is :const:`RNNDirection.UNIDIRECTION`, and 2x :attr:`hidden_size` otherwise.
-        It is an error to set this on an RNN layer that is not configured with :const:`RNNOperation.LSTM` .
-)trtdoc";
-
-constexpr const char* set_weights_for_gate = R"trtdoc(
-    Set the weight parameters for an individual gate in the RNN.
-
-    :arg layer_index: The index of the layer that contains this gate.
-    :arg gate: The name of the gate within the RNN layer. The gate name must correspond to one of the gates used by this layer's :class:`RNNOperation` .
-    :arg is_w: True if the weight parameters are for the input matrix W[g] and false if they are for the recurrent input matrix R[g]. See :class:`RNNOperation` for equations showing how these matrices are used in the RNN gate.
-    :arg weights: The weight structure holding the weight parameters, which are stored as a row-major 2D matrix. For more information, see `IRNNv2Layer::setWeights() <https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_r_n_nv2_layer.html#a7f5953b1f91c5fec5b79b4d63ab4d306>`_.
-)trtdoc";
-constexpr const char* get_weights_for_gate = R"trtdoc(
-    Get the weight parameters for an individual gate in the RNN.
-
-    :arg layer_index: The index of the layer that contains this gate.
-    :arg gate: The name of the gate within the RNN layer.
-    :arg is_w: True if the weight parameters are for the input matrix W[g] and false if they are for the recurrent input matrix R[g].
-
-    :returns: The weight parameters.
-)trtdoc";
-constexpr const char* set_bias_for_gate = R"trtdoc(
-    Set the bias parameters for an individual gate in the RNN.
-
-    :arg layer_index: The index of the layer that contains this gate.
-    :arg gate: The name of the gate within the RNN layer. The gate name must correspond to one of the gates used by this layer's :class:`RNNOperation` .
-    :arg is_w: True if the bias parameters are for the input bias Wb[g] and false if they are for the recurrent input bias Rb[g]. See
-            :class:`RNNOperation` for equations showing how these bias vectors are used in the RNN gate.
-    :arg bias: The weight structure holding the bias parameters, which should be an array of size :attr:`hidden_size` .
-)trtdoc";
-constexpr const char* get_bias_for_gate = R"trtdoc(
-    Get the bias parameters for an individual gate in the RNN.
-
-    :arg layer_index: The index of the layer that contains this gate.
-    :arg gate: The name of the gate within the RNN layer.
-    :arg is_w: True if the bias parameters are for the input bias Wb[g] and false if they are for the recurrent input bias Rb[g].
+        A plugin layer in an :class:`INetworkDefinition` .
 
-    :returns: The bias parameters.
+        :ivar plugin: :class:`IPluginV2` The plugin for the layer.
 )trtdoc";
-} // namespace IRNNv2LayerDoc
+} // namespace IPluginV2LayerDoc
 
-namespace IPluginV2LayerDoc
+namespace IPluginV3LayerDoc
 {
 constexpr const char* descr = R"trtdoc(
         A plugin layer in an :class:`INetworkDefinition` .
 
-        :ivar plugin: :class:`IPluginV2` The plugin for the layer.
+        :ivar plugin: :class:`IPluginV3` The plugin for the layer.
 )trtdoc";
-} // namespace IPluginV2LayerDoc
+} // namespace IPluginV3LayerDoc
 
 namespace UnaryOperationDoc
 {
@@ -941,8 +701,6 @@ namespace IPaddingLayerDoc
 constexpr const char* descr = R"trtdoc(
     A padding layer in an :class:`INetworkDefinition` .
 
-    :ivar pre_padding: :class:`DimsHW` The padding that is applied at the start of the tensor. Negative padding results in trimming the edge by the specified amount.
-    :ivar post_padding: :class:`DimsHW` The padding that is applied at the end of the tensor. Negative padding results in trimming the edge by the specified amount
     :ivar pre_padding_nd: :class:`Dims` The padding that is applied at the start of the tensor. Negative padding results in trimming the edge by the specified amount. Only 2 dimensions currently supported.
     :ivar post_padding_nd: :class:`Dims` The padding that is applied at the end of the tensor. Negative padding results in trimming the edge by the specified amount. Only 2 dimensions currently supported.
 )trtdoc";
@@ -1023,7 +781,7 @@ constexpr const char* descr = R"trtdoc(
     stride = {1, 2}
     output = {{1, 5}}
 
-    When the sliceMode is :const:`SliceMode.CLAMP` or :const:`SliceMode.REFLECT` , for each input dimension, if its size is 0 then the corresponding output dimension must be 0 too.
+    When the sampleMode is :const:`SampleMode.CLAMP` or :const:`SampleMode.REFLECT` , for each input dimension, if its size is 0 then the corresponding output dimension must be 0 too.
 
     A slice layer can produce a shape tensor if the following conditions are met:
 
@@ -1034,7 +792,7 @@ constexpr const char* descr = R"trtdoc(
 
     The following constraints must be satisfied to execute this layer on DLA:
     * ``start``, ``size``, and ``stride`` are build time constants, either as static :class:`Dims` or as constant input tensors.
-    * sliceMode is :const:`SliceMode.DEFAULT` .
+    * sampleMode is :const:`SampleMode.STRICT_BOUNDS` .
     * Strides are 1 for all dimensions.
     * Slicing is not performed on the first dimension
     * The input tensor has four dimensions
@@ -1042,7 +800,7 @@ constexpr const char* descr = R"trtdoc(
     :ivar start: :class:`Dims` The start offset.
     :ivar shape: :class:`Dims` The output dimensions.
     :ivar stride: :class:`Dims` The slicing stride.
-    :ivar mode: :class:`SliceMode` Controls how :class:`ISliceLayer` handles out of bounds coordinates.
+    :ivar mode: :class:`SampleMode` Controls how :class:`ISliceLayer` handles out of bounds coordinates.
 )trtdoc";
 
 constexpr const char* set_input = R"trtdoc(
@@ -1059,7 +817,7 @@ constexpr const char* set_input = R"trtdoc(
         1     The start tensor to begin slicing, N-dimensional for Data, and 1-D for Shape.
         2     The size tensor of the resulting slice, N-dimensional for Data, and 1-D for Shape.
         3     The stride of the slicing operation, N-dimensional for Data, and 1-D for Shape.
-        4     Value for the :const:`SliceMode.FILL` slice mode. Disallowed for other modes.
+        4     Value for the :const:`SampleMode.FILL` slice mode. Disallowed for other modes.
     =====   ==================================================================================
 
     If this function is called with a value greater than 0, then :attr:`num_inputs` changes
@@ -1077,7 +835,6 @@ constexpr const char* descr
     = R"trtdoc(Controls how ISliceLayer and IGridSample handles out of bounds coordinates)trtdoc";
 
 constexpr const char* STRICT_BOUNDS = R"trtdoc(Fail with error when the coordinates are out of bounds.)trtdoc";
-constexpr const char* DEFAULT = R"trtdoc([DEPRECATED] Use STRICT_BOUNDS.)trtdoc";
 constexpr const char* WRAP = R"trtdoc(Coordinates wrap around periodically.)trtdoc";
 constexpr const char* CLAMP = R"trtdoc(Out of bounds indices are clamped to bounds)trtdoc";
 constexpr const char* FILL = R"trtdoc(Use fill input value when coordinates are out of bounds.)trtdoc";
@@ -1206,6 +963,8 @@ constexpr const char* descr = R"trtdoc(
     This layer casts the element of a given input tensor to a specified data type and returns an output tensor of the same shape in the converted type.
 
     Conversions between all types except FP8 is supported.
+
+    :ivar to_type: :class:`DataType` The specified data type of the output tensor.
 )trtdoc";
 } // namespace ICastLayerDoc
 
@@ -1291,11 +1050,11 @@ constexpr const char* descr = R"trtdoc(
 
     Resize layer currently supports the following configurations:
 
-    * ResizeMode.NEAREST - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0.
-    * ResizeMode.LINEAR - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0.
-    * ResizeMode.CUBIC - resizes innermost `2` dimensions of N-D, N >= 2.
+    * InterpolationMode.NEAREST - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0.
+    * InterpolationMode.LINEAR - resizes innermost `m` dimensions of N-D, where 0 < m <= min(3, N) and N > 0.
+    * InterpolationMode.CUBIC - resizes innermost `2` dimensions of N-D, N >= 2.
 
-    Default resize mode is ResizeMode.NEAREST.
+    Default resize mode is InterpolationMode.NEAREST.
 
     Resize layer provides two ways to resize tensor dimensions:
 
@@ -1328,7 +1087,7 @@ constexpr const char* descr = R"trtdoc(
         3. The last two elements in scales, representing the scale values along height and width dimensions,
         respectively, need to be integer values in the range of [1, 32] for NEAREST mode and [1, 4] for LINEAR.
         Example of DLA-supported scales: [1, 1, 2, 2].
-    :ivar resize_mode: :class:`ResizeMode` Resize mode can be Linear, Cubic or Nearest.
+    :ivar resize_mode: :class:`InterpolationMode` Resize mode can be Linear, Cubic or Nearest.
     :ivar coordinate_transformation: :class:`ResizeCoordinateTransformationDoc` Supported resize coordinate transformation modes are ALIGN_CORNERS, ASYMMETRIC and HALF_PIXEL.
     :ivar selector_for_single_pixel: :class:`ResizeSelector` Supported resize selector modes are FORMULA and UPPER.
     :ivar nearest_rounding: :class:`ResizeRoundMode` Supported resize Round modes are HALF_UP, HALF_DOWN, FLOOR and CEIL.
@@ -1339,8 +1098,7 @@ constexpr const char* descr = R"trtdoc(
 constexpr const char* set_input = R"trtdoc(
     Sets the input tensor for the given index.
 
-    If index == 1 and num_inputs == 1, and there is no implicit batch dimension,
-    in which case num_inputs changes to 2.
+    If index == 1 and num_inputs == 1, num_inputs changes to 2.
     Once such additional input is set, resize layer works in dynamic mode.
     When index == 1 and num_inputs == 1, the output dimensions are used from
     the input tensor, overriding the dimensions supplied by `shape`.
@@ -1680,6 +1438,18 @@ namespace IFillLayerDoc
 {
 constexpr const char* descr = R"trtdoc(
     A fill layer in an :class:`INetworkDefinition` .
+
+    The data type of the output tensor can be specified by :attr:`to_type`. Supported output types for each fill operation is as follows.
+
+    ================   =====================
+    Operation          to_type
+    ================   =====================
+    kLINSPACE          int32, int64, float32
+    kRANDOM_UNIFORM    float16, float32
+    kRANDOM_NORMAL     float16, float32
+    ================   =====================
+
+    :ivar to_type: :class:`DataType` The specified data type of the output tensor. Defaults to tensorrt.float32.
 )trtdoc";
 
 constexpr const char* set_dimensions = R"trtdoc(
@@ -1702,6 +1472,17 @@ constexpr const char* get_operation = R"trtdoc(
     get the fill operation for the layer.
 )trtdoc";
 
+constexpr const char* set_to_type = R"trtdoc(
+    set the output data type for the layer.
+    only applied if alpha and beta are static.
+
+    :arg to_type: the output data type for the layer.
+)trtdoc";
+
+constexpr const char* get_to_type = R"trtdoc(
+    get the user specified output data type for the layer.
+)trtdoc";
+
 constexpr const char* set_alpha = R"trtdoc(
     set the alpha parameter (must be finite).
 
@@ -1800,14 +1581,16 @@ constexpr const char* descr = R"trtdoc(
 
     The subgraph which terminates with the scale tensor must be a build-time constant.  The same restrictions apply
     to the zeroPt.
-    The output type, if constrained, must be constrained to tensorrt.int8. The input type, if constrained, must be
-    constrained to tensorrt.float32 (FP16 input is not supported).
+    The output type, if constrained, must be constrained to tensorrt.int8 or tensorrt.fp8. The input type, if constrained, must be
+    constrained to tensorrt.float32, tensorrt.float16 or tensorrt.bfloat16.
     The output size is the same as the input size.
 
-    IQuantizeLayer only supports tensorrt.float32 precision and will default to this precision during instantiation.
-    IQuantizeLayer only supports tensorrt.int8 output.
+    IQuantizeLayer supports tensorrt.float32, tensorrt.float16 and tensorrt.bfloat16 precision and will default to tensorrt.float32 precision during instantiation.
+    IQuantizeLayer supports tensorrt.int8 and tensorrt.float8 output.
 
     :ivar axis: :class:`int` The axis along which quantization occurs. The quantization axis is in reference to the input tensor's dimensions.
+
+    :ivar to_type: :class:`DataType` The specified data type of the output tensor. Must be tensorrt.int8 or tensorrt.float8.
 )trtdoc";
 } // namespace IQuantizeLayerDoc
 
@@ -1833,15 +1616,16 @@ constexpr const char* descr = R"trtdoc(
 
     The subgraph which terminates with the scale tensor must be a build-time constant.  The same restrictions apply
     to the zeroPt.
-    The output type, if constrained, must be constrained to tensorrt.int8. The input type, if constrained, must be
-    constrained to tensorrt.float32 (FP16 input is not supported).
+    The output type, if constrained, must be constrained to tensorrt.int8 or tensorrt.fp8. The input type, if constrained, must be
+    constrained to tensorrt.float32, tensorrt.float16 or tensorrt.bfloat16.
     The output size is the same as the input size.
 
-    IDequantizeLayer only supports tensorrt.int8 precision and will default to this precision during instantiation.
-    IDequantizeLayer only supports tensorrt.float32 output.
+    IDequantizeLayer supports tensorrt.int8 and tensorrt.float8 precision and will default to tensorrt.int8 precision during instantiation.
+    IDequantizeLayer supports tensorrt.float32, tensorrt.float16 and tensorrt.bfloat16 output.
 
     :ivar axis: :class:`int` The axis along which dequantization occurs. The dequantization axis is in reference to the input tensor's dimensions.
 
+    :ivar to_type: :class:`DataType` The specified data type of the output tensor. Must be tensorrt.float32 or tensorrt.float16.
 )trtdoc";
 } // namespace IDequantizeLayerDoc
 
@@ -2020,9 +1804,17 @@ constexpr const char* descr = R"trtdoc(
     :ivar num_inputs: :class:`int` The number of inputs of the network.
     :ivar num_outputs: :class:`int` The number of outputs of the network.
     :ivar name: :class:`str` The name of the network. This is used so that it can be associated with a built engine. The name must be at most 128 characters in length. TensorRT makes no use of this string except storing it as part of the engine so that it may be retrieved at runtime. A name unique to the builder will be generated by default.
-    :ivar has_implicit_batch_dimension: :class:`bool` Whether the network was created with an implicit batch dimension. This is a network-wide property. Either all tensors in the network have an implicit batch dimension or none of them do. This is True when the INetworkDefinition is created with default flags: ``create_network()``. To specify explicit batch, set the flag: ``create_network(flags=1 << int(tensorrt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))``.
-    :ivar has_explicit_precision: :class:`bool` True if and only if this :class:`INetworkDefinition` was created with ``NetworkDefinitionCreationFlag.EXPLICIT_PRECISION`` set: ``create_network(flags=(1 << int(NetworkDefinitionCreationFlag.EXPLICIT_PRECISION)))``.
+    :ivar has_implicit_batch_dimension: :class:`bool` [DEPRECATED] Deprecated in TensorRT 10.0. Always flase since the implicit batch dimensions support has been removed.
     :ivar error_recorder: :class:`IErrorRecorder` Application-implemented error reporting interface for TensorRT objects.
+    :flags: :int: A bitset of the ``NetworkDefinitionCreationFlag`` s set for this network.
+)trtdoc";
+
+constexpr const char* get_flag = R"trtdoc(
+    Returns true if the specified ``NetworkDefinitionCreationFlag`` is set.
+
+    :arg flag: The ``NetworkDefinitionCreationFlag`` .
+
+    :returns: Whether the flag is set.
 )trtdoc";
 
 constexpr const char* add_input = R"trtdoc(
@@ -2041,17 +1833,26 @@ constexpr const char* mark_output = R"trtdoc(
     :arg tensor: The tensor to mark.
 )trtdoc";
 
-constexpr const char* add_convolution = R"trtdoc(
-    Add a 2D convolution layer to the network.
-    See :class:`IConvolutionLayer` for more information.
+constexpr const char* mark_debug = R"trtdoc(
+    Mark a tensor as a debug tensor in the network.
 
-    :arg input: The input tensor to the convolution.
-    :arg num_output_maps: The number of output feature maps for the convolution.
-    :arg kernel_shape: The dimensions of the convolution kernel.
-    :arg kernel: The kernel weights for the convolution.
-    :arg bias: The optional bias weights for the convolution.
+    :arg tensor: The tensor to be marked as debug tensor.
 
-    :returns: The new convolution layer, or :class:`None` if it could not be created.
+    :returns: True on success, False otherwise.
+)trtdoc";
+
+constexpr const char* unmark_debug = R"trtdoc(
+    Unmark a tensor as a debug tensor in the network.
+
+    :arg tensor: The tensor to be unmarked as debug tensor.
+
+    :returns: True on success, False otherwise.
+)trtdoc";
+
+constexpr const char* is_debug_tensor = R"trtdoc(
+    Check if a tensor is marked as debug.
+
+    :arg tensor: The tensor to be checked.
 )trtdoc";
 
 constexpr const char* add_convolution_nd = R"trtdoc(
@@ -2067,18 +1868,6 @@ constexpr const char* add_convolution_nd = R"trtdoc(
     :returns: The new convolution layer, or :class:`None` if it could not be created.
 )trtdoc";
 
-constexpr const char* add_fully_connected = R"trtdoc(
-    Add a fully connected layer to the network.
-    See :class:`IFullyConnectedLayer` for more information.
-
-    :arg input: The input tensor to the layer.
-    :arg num_outputs: The number of outputs of the layer.
-    :arg kernel: The kernel weights for the convolution.
-    :arg bias: The optional bias weights for the convolution.
-
-    :returns: The new fully connected layer, or :class:`None` if it could not be created.
-)trtdoc";
-
 constexpr const char* add_activation = R"trtdoc(
     Add an activation layer to the network.
     See :class:`IActivationLayer` for more information.
@@ -2089,17 +1878,6 @@ constexpr const char* add_activation = R"trtdoc(
     :returns: The new activation layer, or :class:`None` if it could not be created.
 )trtdoc";
 
-constexpr const char* add_pooling = R"trtdoc(
-    Add a 2D pooling layer to the network.
-    See :class:`IPoolingLayer` for more information.
-
-    :arg input: The input tensor to the layer.
-    :arg type: The type of pooling to apply.
-    :arg window_size: The size of the pooling window.
-
-    :returns: The new pooling layer, or :class:`None` if it could not be created.
-)trtdoc";
-
 constexpr const char* add_pooling_nd = R"trtdoc(
     Add a multi-dimension pooling layer to the network.
     See :class:`IPoolingLayer` for more information.
@@ -2179,19 +1957,6 @@ constexpr const char* add_concatenation = R"trtdoc(
     :returns: The new concatenation layer, or :class:`None` if it could not be created.
 )trtdoc";
 
-constexpr const char* add_deconvolution = R"trtdoc(
-    Add a 2D deconvolution layer to the network.
-    See :class:`IDeconvolutionLayer` for more information.
-
-    :arg input: The input tensor to the layer.
-    :arg num_output_maps: The number of output feature maps.
-    :arg kernel_shape: The dimensions of the convolution kernel.
-    :arg kernel: The kernel weights for the convolution.
-    :arg bias: The optional bias weights for the convolution.
-
-    :returns: The new deconvolution layer, or :class:`None` if it could not be created.
-)trtdoc";
-
 constexpr const char* add_deconvolution_nd = R"trtdoc(
     Add a multi-dimension deconvolution layer to the network.
     See :class:`IDeconvolutionLayer` for more information.
@@ -2234,17 +1999,6 @@ constexpr const char* add_unary = R"trtdoc(
     :returns: The new unary layer, or :class:`None` if it could not be created.
 )trtdoc";
 
-constexpr const char* add_padding = R"trtdoc(
-    Add a 2D padding layer to the network.
-    See :class:`IPaddingLayer` for more information.
-
-    :arg input: The input tensor to the layer.
-    :arg pre_padding: The padding to apply to the start of the tensor.
-    :arg post_padding: The padding to apply to the end of the tensor.
-
-    :returns: The new padding layer, or :class:`None` if it could not be created.
-)trtdoc";
-
 constexpr const char* add_padding_nd = R"trtdoc(
     Add a multi-dimensional padding layer to the network.
     See :class:`IPaddingLayer` for more information.
@@ -2391,52 +2145,6 @@ constexpr const char* add_constant = R"trtdoc(
     :returns: The new constant layer, or :class:`None` if it could not be created.
 )trtdoc";
 
-constexpr const char* add_rnn_v2 = R"trtdoc(
-    Add an RNNv2 layer to the network.
-    See :class:`IRNNv2Layer` for more information.
-
-    Add an ``layer_count`` deep RNN layer to the network with ``hidden_size`` internal states that can take a batch with fixed or variable sequence lengths.
-
-    :arg input: The input tensor to the layer (see below).
-    :arg layer_count: The number of layers in the RNN.
-    :arg hidden_size: Size of the internal hidden state for each layer.
-    :arg max_seq_length: Maximum sequence length for the input.
-    :arg op: The type of RNN to execute.
-
-    By default, the layer is configured with :const:`RNNDirection.UNIDIRECTION` and :const:`RNNInputMode.LINEAR` . To change these settings, set :attr:`IRNNv2Layer.direction` and :attr:`IRNNv2Layer.input_mode` .
-
-    Weights and biases for the added layer should be set using :meth:`IRNNv2Layer.set_weights_for_gate()` and :meth:`IRNNv2Layer.set_bias_for_gate()` prior to building an engine using this network.
-
-    The input tensors must be of the type :const:`float32` or :const:`float16` .
-    The layout of the weights is row major and must be the same datatype as the input tensor.
-    ``weights`` contain 8 matrices and ``bias`` contains 8 vectors.
-
-    See :meth:`IRNNv2Layer.set_weights_for_gate()` and :meth:`IRNNv2Layer.set_bias_for_gate()` for details on the required input format for ``weights`` and ``bias`` .
-
-    The ``input`` ITensor should contain zero or more index dimensions `{N1, ..., Np}`, followed by two dimensions, defined as follows:
-
-    |  `S_max` is the maximum allowed sequence length (number of RNN iterations)
-    |  `E` specifies the embedding length (unless :const:`RNNInputMode.SKIP` is set, in which case it should match :attr:`IRNNv2Layer.hidden_size` ).
-
-    By default, all sequences in the input are assumed to be size ``max_seq_length`` .  To provide explicit sequence lengths for each input sequence in the batch, set :attr:`IRNNv2Layer.seq_lengths` .
-
-    The RNN layer outputs up to three tensors.
-
-    The first output tensor is the output of the final RNN layer across all timesteps, with dimensions `{N1, ..., Np, S_max, H}`:
-
-    |  `N1..Np` are the index dimensions specified by the input tensor
-    |  `S_max` is the maximum allowed sequence length (number of RNN iterations)
-    |  `H` is an output hidden state (equal to :attr:`IRNNv2Layer.hidden_size` or 2x :attr:`IRNNv2Layer.hidden_size` )
-
-    The second tensor is the final hidden state of the RNN across all layers, and if the RNN is an LSTM (i.e. :attr:`IRNNv2Layer.op` is :const:`RNNOperation.LSTM` ), then the third tensor is the final cell state of the RNN across all layers.  Both the second and third output tensors have dimensions `{N1, ..., Np, L, H}`:
-
-    |  `N1..Np` are the index dimensions specified by the input tensor
-    |  `L` is the number of layers in the RNN, equal to :attr:`IRNNv2Layer.num_layers`
-    |  `H` is the hidden state for each layer, equal to :attr:`IRNNv2Layer.hidden_size` if getDirection is :const:`RNNDirection.UNIDIRECTION`, and 2x :attr:`IRNNv2Layer.hidden_size` otherwise.
-
-    :returns: The new RNNv2 layer, or :class:`None` if it could not be created.
-)trtdoc";
-
 constexpr const char* add_identity = R"trtdoc(
     Add an identity layer.
     See :class:`IIdentityLayer` for more information.
@@ -2544,6 +2252,7 @@ constexpr const char* add_fill = R"trtdoc(
 
     :arg dimensions: The output tensor dimensions.
     :arg op: The fill operation that the layer applies.
+    :arg output_type: The datatype of the output tensor. Specifying output_type is optional (default value tensorrt.float32).
 
     :returns: The new fill layer, or :class:`None` if it could not be created.
 )trtdoc";
@@ -2621,6 +2330,17 @@ constexpr const char* add_plugin_v2 = R"trtdoc(
     :returns: The new plugin layer, or :class:`None` if it could not be created.
 )trtdoc";
 
+constexpr const char* add_plugin_v3 = R"trtdoc(
+    Add a plugin layer to the network using an :class:`IPluginV3` interface.
+    See :class:`IPluginV3` for more information.
+
+    :arg inputs: The input tensors to the layer.
+    :arg shape_inputs: The shape input tensors to the layer.
+    :arg plugin: The layer plugin.
+
+    :returns: The new plugin layer, or :class:`None` if it could not be created.
+)trtdoc";
+
 constexpr const char* get_layer = R"trtdoc(
     Get the layer specified by the given index.
 
@@ -2663,6 +2383,7 @@ constexpr const char* add_quantize = R"trtdoc(
 
     :arg input: A tensor to quantize.
     :arg scale: A tensor with the scale coefficients.
+    :arg output_type: The datatype of the output tensor. Specifying output_type is optional (default value tensorrt.int8).
 
     :returns: The new quantization layer, or :class:`None` if it could not be created.
 )trtdoc";
@@ -2673,6 +2394,7 @@ constexpr const char* add_dequantize = R"trtdoc(
 
     :arg input: A tensor to quantize.
     :arg scale: A tensor with the scale coefficients.
+    :arg output_type: The datatype of the output tensor. Specifying output_type is optional (default value tensorrt.float32).
 
     :returns: The new dequantization layer, or :class:`None` if it could not be created.
 )trtdoc";
diff --git a/python/docstrings/infer/pyPluginDoc.h b/python/docstrings/infer/pyPluginDoc.h
index e63a536c..5df97568 100644
--- a/python/docstrings/infer/pyPluginDoc.h
+++ b/python/docstrings/infer/pyPluginDoc.h
@@ -105,6 +105,10 @@ constexpr const char* execute_async = R"trtdoc(
 
 constexpr const char* serialize = R"trtdoc(
     Serialize the plugin.
+
+    .. warning::
+        This API only applies when called on a C++ plugin from a Python program.
+
 )trtdoc";
 
 constexpr const char* destroy = R"trtdoc(
@@ -129,10 +133,10 @@ constexpr const char* descr = R"trtdoc(
 constexpr const char* get_output_data_type = R"trtdoc(
 
     Return the DataType of the plugin output at the requested index.
-    The default behavior should be to return the type of the first input, or DataType::kFLOAT if the layer has no inputs.
+    The default behavior should be to return the type of the first input, or `DataType::kFLOAT` if the layer has no inputs.
     The returned data type must have a format that is supported by the plugin.
 
-    :arg index: Index of the output for which Data type is requested.
+    :arg index: Index of the output for which data type is requested.
     :arg input_types: Data types of the inputs.
 
     :returns: DataType of the plugin output at the requested index.
@@ -158,16 +162,16 @@ constexpr const char* configure_plugin = R"trtdoc(
 constexpr const char* clone = R"trtdoc(
     Clone the plugin object. This copies over internal plugin parameters as well and returns a new plugin object with these parameters.
 
-    If the source plugin is pre-configured with configure_plugin(), the returned object should also be pre-configured. The returned object should allow attach_to_context() with a new execution context.
+    If the source plugin is pre-configured with `configure_plugin()`, the returned object should also be pre-configured. The returned object should allow attach_to_context() with a new execution context.
     Cloned plugin objects can share the same per-engine immutable resource (e.g. weights) with the source object (e.g. via ref-counting) to avoid duplication.
 )trtdoc";
 
 constexpr const char* attach_to_context = R"trtdoc(
     Attach the plugin object to an execution context and grant the plugin the access to some context resource.
 
-    :arg cudnn The cudnn context handle of the execution context
-    :arg cublas The cublas context handle of the execution context
-    :arg allocator The allocator used by the execution context
+    :arg cudnn: The cudnn context handle of the execution context
+    :arg cublas: The cublas context handle of the execution context
+    :arg allocator: The allocator used by the execution context
 
     This function is called automatically for each plugin when a new execution context is created. If the plugin needs per-context resource, it can be allocated here. The plugin can also get context-owned CUDNN and CUBLAS context here.
 )trtdoc";
@@ -179,6 +183,459 @@ constexpr const char* detach_from_context = R"trtdoc(
 )trtdoc";
 } // namespace IPluginV2ExtDoc
 
+
+namespace IPluginV2DynamicExtDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Plugin class for user-implemented layers.
+
+    Plugins are a mechanism for applications to implement custom layers.
+
+    Similar to `IPluginV2Ext` (including capability to support different output data types), but with support for dynamic shapes.
+
+    This class is made available for the purpose of implementing `IPluginV2DynamicExt` plugins with Python. Inherited
+    Python->C++ bindings from `IPluginV2` and `IPluginV2Ext` will continue to work on C++-based `IPluginV2DynamicExt` plugins. 
+
+    .. note::
+        Every attribute except `tensorrt_version` must be explicitly initialized on Python-based plugins. Except `plugin_namespace`,
+        these attributes will be read-only when accessed through a C++-based plugin.
+
+    :ivar num_outputs: :class:`int` The number of outputs from the plugin. This is used by the implementations of :class:`INetworkDefinition` and :class:`Builder`. In particular, it is called prior to any call to :func:`initialize`.
+    :ivar tensorrt_version: :class:`int` [READ ONLY] The API version with which this plugin was built.
+    :ivar plugin_type: :class:`str` The plugin type. Should match the plugin name returned by the corresponding plugin creator.
+    :ivar plugin_version: :class:`str` The plugin version. Should match the plugin version returned by the corresponding plugin creator.
+    :ivar plugin_namespace: :class:`str` The namespace that this plugin object belongs to. Ideally, all plugin objects from the same plugin library should have the same namespace.
+    :ivar serialization_size: :class:`int` [READ ONLY] The size of the serialization buffer required.
+)trtdoc";
+
+constexpr const char* initialize = R"trtdoc(
+    Initialize the plugin for execution. This is called when the engine is created.
+
+    .. note::
+        When implementing a Python-based plugin, implementing this method is optional. The default behavior is equivalent to `pass`. 
+
+    .. warning::
+        In contrast to the C++ API for `initialize()`, this method must not return an error code. The expected behavior is to throw an appropriate exception
+        if an error occurs. 
+
+    .. warning::
+        This `initialize()` method is not available to be called from Python on C++-based plugins.
+        
+)trtdoc";
+
+constexpr const char* terminate = R"trtdoc(
+    Release resources acquired during plugin layer initialization. This is called when the engine is destroyed.
+
+    .. note::
+        When implementing a Python-based plugin, implementing this method is optional. The default behavior is equivalent to `pass`. 
+
+)trtdoc";
+
+constexpr const char* get_output_dimensions = R"trtdoc(
+
+    Get expressions for computing dimensions of an output tensor from dimensions of the input tensors.
+
+    This function is called by the implementations of `IBuilder` during analysis of the network.
+
+    .. warning::
+        This `get_output_dimensions()` method is not available to be called from Python on C++-based plugins 
+
+    :arg output_index:	The index of the output tensor
+    :arg inputs:	Expressions for dimensions of the input tensors
+    :arg expr_builder:	Object for generating new expressions
+
+    :returns: Expression for the output dimensions at the given `output_index`.
+)trtdoc";
+
+constexpr const char* get_output_data_type = R"trtdoc(
+
+    Return the `DataType` of the plugin output at the requested index.
+    The default behavior should be to return the type of the first input, or `DataType::kFLOAT` if the layer has no inputs.
+    The returned data type must have a format that is supported by the plugin.
+
+    :arg index: Index of the output for which the data type is requested.
+    :arg input_types: Data types of the inputs.
+
+    :returns: `DataType` of the plugin output at the requested `index`.
+)trtdoc";
+
+constexpr const char* configure_plugin = R"trtdoc(
+    Configure the plugin.
+
+    This function can be called multiple times in both the build and execution phases. The build phase happens before `initialize()` is called and only occurs during creation of an engine by `IBuilder`. The execution phase happens after `initialize()` is called and occurs during both creation of an engine by `IBuilder` and execution of an engine by `IExecutionContext`.
+
+    Build phase: `configure_plugin()` is called when a plugin is being prepared for profiling but not for any specific input size. This provides an opportunity for the plugin to make algorithmic choices on the basis of input and output formats, along with the bound of possible dimensions. The min and max value of the `DynamicPluginTensorDesc` correspond to the `kMIN` and `kMAX` value of the current optimization profile that the plugin is being profiled for, with the `desc.dims` field corresponding to the dimensions of plugin specified at network creation. Wildcard dimensions will exist during this phase in the `desc.dims` field.
+
+    Execution phase: `configure_plugin()` is called when a plugin is being prepared for executing the plugin for specific dimensions. This provides an opportunity for the plugin to change algorithmic choices based on the explicit input dimensions stored in `desc.dims` field.
+
+    .. warning::
+        This `configure_plugin()` method is not available to be called from Python on C++-based plugins 
+
+    :arg in: The input tensors attributes that are used for configuration.
+    :arg out: The output tensors attributes that are used for configuration.
+)trtdoc";
+
+constexpr const char* supports_format_combination = R"trtdoc(
+    Return true if plugin supports the format and datatype for the input/output indexed by pos.
+
+    For this method, inputs are indexed from `[0, num_inputs-1]` and outputs are indexed from `[num_inputs, (num_inputs + num_outputs - 1)]`. `pos` is an index into `in_ou`t, where `0 <= pos < (num_inputs + num_outputs - 1)`.
+
+    TensorRT invokes this method to query if the input/output tensor indexed by `pos` supports the format and datatype specified by `in_out[pos].format` and `in_out[pos].type`. The override shall return true if that format and datatype at `in_out[pos]` are supported by the plugin. It is undefined behavior to examine the format or datatype or any tensor that is indexed by a number greater than `pos`.
+
+    .. warning::
+        This `supports_format_combination()` method is not available to be called from Python on C++-based plugins
+
+    :arg pos: The input or output tensor index being queried.
+    :arg in_out: The combined input and output tensor descriptions.
+    :arg num_inputs: The number of inputs.
+
+    :returns: boolean indicating whether the format combination is supported or not.
+
+)trtdoc";
+
+constexpr const char* get_workspace_size = R"trtdoc(
+    Return the workspace size (in bytes) required by the plugin.
+
+    This function is called after the plugin is configured, and possibly during execution. The result should be a sufficient workspace size to deal with inputs and outputs of the given size or any smaller problem.
+
+    .. note::
+        When implementing a Python-based plugin, implementing this method is optional. The default behavior is equivalent to `return 0`. 
+
+    .. warning::
+        This `get_workspace_size()` method is not available to be called from Python on C++-based plugins 
+
+    :arg input_desc: How to interpret the memory for the input tensors.
+    :arg output_desc: How to interpret the memory for the output tensors.
+
+    :returns: The workspace size (in bytes).
+)trtdoc";
+
+constexpr const char* destroy = R"trtdoc(
+    Destroy the plugin object. This will be called when the :class:`INetworkDefinition` , :class:`Builder` or :class:`ICudaEngine` is destroyed.
+
+    .. note::
+        When implementing a Python-based plugin, implementing this method is optional. The default behavior is a `pass`. 
+
+)trtdoc";
+
+constexpr const char* enqueue = R"trtdoc(
+    Execute the layer.
+
+    `inputs` and `outputs` contains pointers to the corresponding input and output device buffers as their `intptr_t` casts. `stream` also represents an `intptr_t` cast of the CUDA stream in which enqueue should be executed.
+    
+    .. warning::
+        Since input, output, and workspace buffers are created and owned by TRT, care must be taken when writing to them from the Python side.
+
+    .. warning::
+        In contrast to the C++ API for `enqueue()`, this method must not return an error code. The expected behavior is to throw an appropriate exception.
+        if an error occurs. 
+
+    .. warning::
+        This `enqueue()` method is not available to be called from Python on C++-based plugins.
+
+    :arg input_desc:	how to interpret the memory for the input tensors.
+    :arg output_desc:	how to interpret the memory for the output tensors.
+    :arg inputs:	The memory for the input tensors.
+    :arg outputs:   The memory for the output tensors.
+    :arg workspace: Workspace for execution.
+    :arg stream:	The stream in which to execute the kernels.
+
+)trtdoc";
+
+constexpr const char* clone = R"trtdoc(
+    Clone the plugin object. This copies over internal plugin parameters as well and returns a new plugin object with these parameters.
+
+    If the source plugin is pre-configured with `configure_plugin()`, the returned object should also be pre-configured. 
+    Cloned plugin objects can share the same per-engine immutable resource (e.g. weights) with the source object to avoid duplication.
+)trtdoc";
+
+constexpr const char* get_serialization_size = R"trtdoc(
+    Return the serialization size (in bytes) required by the plugin.
+
+    .. note::
+        When implementing a Python-based plugin, implementing this method is optional. The default behavior is equivalent to `return len(serialize())`. 
+
+)trtdoc";
+
+constexpr const char* serialize = R"trtdoc(
+    Serialize the plugin.
+
+    .. warning::
+        This API only applies when implementing a Python-based plugin.
+
+    :returns: A bytes object containing the serialized representation of the plugin.
+
+)trtdoc";
+
+} // namespace IPluginV2DynamicExtDoc
+
+namespace IPluginV3Doc
+{
+constexpr const char* ipluginv3_descr = R"trtdoc(
+    Plugin class for the V3 generation of user-implemented layers.
+
+    IPluginV3 acts as a wrapper around the plugin capability interfaces that define the actual behavior of the plugin.
+
+    This class is made available for the purpose of implementing `IPluginV3` plugins with Python.
+
+    .. note::
+        Every attribute must be explicitly initialized on Python-based plugins.
+        These attributes will be read-only when accessed through a C++-based plugin.
+
+    :ivar num_outputs: :class:`int` The number of outputs from the plugin. This is used by the implementations of :class:`INetworkDefinition` and :class:`Builder`. In particular, it is called prior to any call to :func:`initialize`.
+    :ivar tensorrt_version: :class:`int` [READ ONLY] The API version with which this plugin was built.
+    :ivar plugin_type: :class:`str` The plugin type. Should match the plugin name returned by the corresponding plugin creator.
+    :ivar plugin_version: :class:`str` The plugin version. Should match the plugin version returned by the corresponding plugin creator.
+    :ivar plugin_namespace: :class:`str` The namespace that this plugin object belongs to. Ideally, all plugin objects from the same plugin library should have the same namespace.
+    :ivar serialization_size: :class:`int` [READ ONLY] The size of the serialization buffer required.
+)trtdoc";
+
+constexpr const char* iplugincapability_descr = R"trtdoc(
+    Base class for plugin capability interfaces
+    
+    IPluginCapability represents a split in TensorRT V3 plugins to sub-objects that expose different types of capabilites a plugin may have,
+    as opposed to a single interface which defines all capabilities and behaviors of a plugin.
+)trtdoc";
+
+constexpr const char* ipluginv3onecore_descr = R"trtdoc(
+    A plugin capability interface that enables the core capability (PluginCapabilityType.CORE).
+
+    .. note::
+        Every attribute must be explicitly initialized on Python-based plugins.
+        These attributes will be read-only when accessed through a C++-based plugin.
+
+    :ivar plugin_type: :class:`str` The plugin type. Should match the plugin name returned by the corresponding plugin creator.
+    :ivar plugin_version: :class:`str` The plugin version. Should match the plugin version returned by the corresponding plugin creator.
+    :ivar plugin_namespace: :class:`str` The namespace that this plugin object belongs to. Ideally, all plugin objects from the same plugin library should have the same namespace.
+)trtdoc";
+
+constexpr const char* ipluginv3onebuild_descr = R"trtdoc(
+    A plugin capability interface that enables the build capability (PluginCapabilityType.BUILD).
+    
+    Exposes methods that allow the expression of the build time properties and behavior of a plugin.
+
+    .. note::
+        Every attribute must be explicitly initialized on Python-based plugins.
+        These attributes will be read-only when accessed through a C++-based plugin.
+
+    :ivar num_outputs: :class:`int` The number of outputs from the plugin. This is used by the implementations of :class:`INetworkDefinition` and :class:`Builder`.
+)trtdoc";
+
+constexpr const char* ipluginv3oneruntime_descr = R"trtdoc(
+    A plugin capability interface that enables the runtime capability (PluginCapabilityType.RUNTIME).
+    
+    Exposes methods that allow the expression of the runtime properties and behavior of a plugin.
+)trtdoc";
+
+constexpr const char* get_output_shapes = R"trtdoc(
+
+    Get expressions for computing shapes of an output tensor from shapes of the input tensors.
+
+    This function is called by the implementations of `IBuilder` during analysis of the network.
+
+    .. warning::
+        This `get_output_shapes()` method is not available to be called from Python on C++-based plugins 
+
+    :arg inputs:	Expressions for shapes of the input tensors
+    :arg shape_inputs:	Expressions for shapes of the shape inputs
+    :arg expr_builder:	Object for generating new expressions
+
+    :returns: Expressions for the output shapes.
+)trtdoc";
+
+constexpr const char* get_output_data_types = R"trtdoc(
+
+    Return `DataType`s of the plugin outputs.
+
+    Provide `DataType.FLOAT`s if the layer has no inputs. The data type for any size tensor outputs must be
+    `DataType.INT32`. The returned data types must each have a format that is supported by the plugin.
+
+    :arg input_types: Data types of the inputs.
+
+    :returns: `DataType` of the plugin output at the requested `index`.
+)trtdoc";
+
+constexpr const char* configure_plugin = R"trtdoc(
+    Configure the plugin.
+
+    This function can be called multiple times in the build phase during creation of an engine by IBuilder. 
+
+    Build phase: `configure_plugin()` is called when a plugin is being prepared for profiling but not for any specific input size. This provides an opportunity for the plugin to make algorithmic choices on the basis of input and output formats, along with the bound of possible dimensions. The min, opt and max value of the
+    `DynamicPluginTensorDesc` correspond to the `MIN`, `OPT` and `MAX` value of the current profile that the plugin is
+    being profiled for, with the desc.dims field corresponding to the dimensions of plugin specified at network
+    creation. Wildcard dimensions may exist during this phase in the desc.dims field.
+
+    .. warning::
+        In contrast to the C++ API for `configurePlugin()`, this method must not return an error code. The expected behavior is to throw an appropriate exception
+        if an error occurs. 
+
+    .. warning::
+        This `configure_plugin()` method is not available to be called from Python on C++-based plugins 
+
+    :arg in: The input tensors attributes that are used for configuration.
+    :arg out: The output tensors attributes that are used for configuration.
+)trtdoc";
+
+constexpr const char* on_shape_change = R"trtdoc(
+    Called when a plugin is being prepared for execution for specific dimensions. This could happen multiple times in the execution phase, both during creation of an engine by IBuilder and execution of an 
+    engine by IExecutionContext. 
+
+   * IBuilder will call this function once per profile, with `in` resolved to the values specified by the
+    kOPT field of the current profile.
+   * IExecutionContext will call this during the next subsequent instance of enqueue_v2() or execute_v3() if:
+     - The optimization profile is changed.
+     - An input binding is changed.
+
+    .. warning::
+        In contrast to the C++ API for `onShapeChange()`, this method must not return an error code. The expected behavior is to throw an appropriate exception
+        if an error occurs. 
+
+    .. warning::
+        This `on_shape_change()` method is not available to be called from Python on C++-based plugins 
+
+    :arg in: The input tensors attributes that are used for configuration.
+    :arg out: The output tensors attributes that are used for configuration.
+)trtdoc";
+
+constexpr const char* supports_format_combination = R"trtdoc(
+    Return true if plugin supports the format and datatype for the input/output indexed by pos.
+
+    For this method, inputs are indexed from `[0, num_inputs-1]` and outputs are indexed from `[num_inputs, (num_inputs + num_outputs - 1)]`. `pos` is an index into `in_ou`t, where `0 <= pos < (num_inputs + num_outputs - 1)`.
+
+    TensorRT invokes this method to query if the input/output tensor indexed by `pos` supports the format and datatype specified by `in_out[pos].format` and `in_out[pos].type`. The override shall return true if that format and datatype at `in_out[pos]` are supported by the plugin. It is undefined behavior to examine the format or datatype or any tensor that is indexed by a number greater than `pos`.
+
+    .. warning::
+        This `supports_format_combination()` method is not available to be called from Python on C++-based plugins
+
+    :arg pos: The input or output tensor index being queried.
+    :arg in_out: The combined input and output tensor descriptions.
+    :arg num_inputs: The number of inputs.
+
+    :returns: boolean indicating whether the format combination is supported or not.
+
+)trtdoc";
+
+constexpr const char* get_workspace_size = R"trtdoc(
+    Return the workspace size (in bytes) required by the plugin.
+
+    This function is called after the plugin is configured, and possibly during execution. The result should be a sufficient workspace size to deal with inputs and outputs of the given size or any smaller problem.
+
+    .. note::
+        When implementing a Python-based plugin, implementing this method is optional. The default behavior is equivalent to `return 0`. 
+
+    .. warning::
+        This `get_workspace_size()` method is not available to be called from Python on C++-based plugins 
+
+    :arg input_desc: How to interpret the memory for the input tensors.
+    :arg output_desc: How to interpret the memory for the output tensors.
+
+    :returns: The workspace size (in bytes).
+)trtdoc";
+
+constexpr const char* destroy = R"trtdoc(
+    Perform any cleanup or resource release(s) needed before plugin object is destroyed. This will be called when the :class:`INetworkDefinition` , :class:`Builder` or :class:`ICudaEngine` is destroyed.
+
+    .. note::
+        There is no direct equivalent to this method in the C++ API.
+
+    .. note::
+        Implementing this method is optional. The default behavior is a `pass`. 
+
+)trtdoc";
+
+constexpr const char* enqueue = R"trtdoc(
+    Execute the layer.
+
+    `inputs` and `outputs` contains pointers to the corresponding input and output device buffers as their `intptr_t` casts. `stream` also represents an `intptr_t` cast of the CUDA stream in which enqueue should be executed.
+    
+    .. warning::
+        Since input, output, and workspace buffers are created and owned by TRT, care must be taken when writing to them from the Python side.
+
+    .. warning::
+        In contrast to the C++ API for `enqueue()`, this method must not return an error code. The expected behavior is to throw an appropriate exception.
+        if an error occurs. 
+
+    .. warning::
+        This `enqueue()` method is not available to be called from Python on C++-based plugins.
+
+    :arg input_desc:	how to interpret the memory for the input tensors.
+    :arg output_desc:	how to interpret the memory for the output tensors.
+    :arg inputs:	The memory for the input tensors.
+    :arg outputs:   The memory for the output tensors.
+    :arg workspace: Workspace for execution.
+    :arg stream:	The stream in which to execute the kernels.
+
+)trtdoc";
+
+constexpr const char* get_capability_interface = R"trtdoc(
+    Return a plugin object implementing the specified PluginCapabilityType.
+
+    .. note::
+        IPluginV3 objects added for the build phase (through add_plugin_v3()) must return valid objects for PluginCapabilityType.CORE, PluginCapabilityType.BUILD and PluginCapabilityType.RUNTIME.
+
+    .. note::
+        IPluginV3 objects added for the runtime phase must return valid objects for PluginCapabilityType.CORE and PluginCapabilityType.RUNTIME.
+)trtdoc";
+
+constexpr const char* clone = R"trtdoc(
+    Clone the plugin object. This copies over internal plugin parameters as well and returns a new plugin object with these parameters.
+
+    If the source plugin is pre-configured with `configure_plugin()`, the returned object should also be pre-configured. 
+    Cloned plugin objects can share the same per-engine immutable resource (e.g. weights) with the source object to avoid duplication.
+)trtdoc";
+
+constexpr const char* get_fields_to_serialize = R"trtdoc(
+    Return the plugin fields which should be serialized.
+
+    .. note::
+        The set of plugin fields returned does not necessarily need to match that advertised through get_field_names() of the corresponding plugin creator.
+
+    .. warning::
+        This `get_fields_to_serialize()` method is not available to be called from Python on C++-based plugins.
+
+)trtdoc";
+
+constexpr const char* set_tactic = R"trtdoc(
+    Set the tactic to be used in the subsequent call to enqueue().
+
+    If no custom tactics were advertised, this will have a value of 0, which is designated as the default tactic.
+
+    .. warning::
+        In contrast to the C++ API for `setTactic()`, this method must not return an error code. The expected behavior is to throw an appropriate exception
+        if an error occurs. 
+
+    .. warning::
+        This `set_tactic()` method is not available to be called from Python on C++-based plugins.
+
+)trtdoc";
+
+constexpr const char* get_valid_tactics = R"trtdoc(
+    Return any custom tactics that the plugin intends to use.
+    
+    .. note::
+        The provided tactic values must be unique and positive
+
+    .. warning::
+        This `get_valid_tactics()` method is not available to be called from Python on C++-based plugins.
+
+)trtdoc";
+
+constexpr const char* attach_to_context = R"trtdoc(
+    Clone the plugin, attach the cloned plugin object to a execution context and grant the cloned plugin access to some context resources.
+
+    This function is called automatically for each plugin when a new execution context is created.
+
+    The plugin may use resources provided by the resource_context until the plugin is deleted by TensorRT.
+    
+    :arg resource_context: A resource context that exposes methods to get access to execution context specific resources. A different resource context is guaranteed for each different execution context to which the plugin is attached.
+    
+    .. note::
+        This method should clone the entire IPluginV3 object, not just the runtime interface
+
+)trtdoc";
+
+} // namespace IPluginV3Doc
+
 namespace PluginFieldTypeDoc
 {
 constexpr const char* descr = R"trtdoc(
@@ -186,6 +643,155 @@ constexpr const char* descr = R"trtdoc(
 )trtdoc";
 } // namespace PluginFieldTypeDoc
 
+namespace IPluginResourceDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Interface for plugins to define custom resources that could be shared through the plugin registry
+)trtdoc";
+
+constexpr const char* release = R"trtdoc(
+    This will only be called for IPluginResource objects that were produced from IPluginResource::clone().
+
+    The IPluginResource object on which release() is called must still be in a clone-able state
+    after release() returns.
+
+)trtdoc";
+
+constexpr const char* clone = R"trtdoc(
+    Resource initialization (if any) may be skipped for non-cloned objects since only clones will be
+    registered by TensorRT.
+    
+)trtdoc";
+
+} // namespace IPluginResourceDoc
+
+namespace PluginTensorDescDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Fields that a plugin might see for an input or output.
+
+    `scale` is only valid when the `type` is `DataType.INT8`. TensorRT will set the value to -1.0 if it is invalid.
+
+    :ivar dims: :class:`Dims` 	Dimensions.
+    :ivar format: :class:`TensorFormat` Tensor format.
+    :ivar type: :class:`DataType` Type.
+    :ivar scale: :class:`float` Scale for INT8 data type.
+)trtdoc";
+} // namespace PluginTensorDescDoc
+
+namespace DynamicPluginTensorDescDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Summarizes tensors that a plugin might see for an input or output.
+
+    :ivar desc: :class:`PluginTensorDesc` Information required to interpret a pointer to tensor data, except that desc.dims has -1 in place of any runtime dimension..
+    :ivar min: :class:`Dims` 	Lower bounds on tensor's dimensions.
+    :ivar max: :class:`Dims` 	Upper bounds on tensor's dimensions.
+)trtdoc";
+} // namespace DynamicPluginTensorDescDoc
+
+namespace DimsExprsDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Analog of class `Dims` with expressions (`IDimensionExpr`) instead of constants for the dimensions.
+
+    Behaves like a Python iterable and lists or tuples of `IDimensionExpr` can be used to construct it.
+)trtdoc";
+} // namespace DimsExprsDoc
+
+namespace IDimensionExprDoc
+{
+constexpr const char* descr = R"trtdoc(
+    An `IDimensionExpr` represents an integer expression constructed from constants, input dimensions, and binary operations.
+    
+    These expressions are can be used in overrides of `IPluginV2DynamicExt::get_output_dimensions()` to define output dimensions in terms of input dimensions.
+)trtdoc";
+
+constexpr const char* is_constant = R"trtdoc(
+    Return true if expression is a build-time constant.
+)trtdoc";
+
+constexpr const char* get_constant_value = R"trtdoc(
+    Get the value of the constant.
+
+    If is_constant(), returns value of the constant.
+    Else, return int64 minimum.
+)trtdoc";
+
+constexpr const char* is_size_tensor = R"trtdoc(
+    Return true if this denotes the value of a size tensor.
+)trtdoc";
+} // namespace IDimensionExprDoc
+
+namespace IExprBuilderDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Object for constructing `IDimensionExpr`.
+
+    There is no public way to construct an `IExprBuilder`. It appears as an argument to method `IPluginV2DynamicExt::get_output_dimensions()`. Overrides of that method can use that `IExprBuilder` argument to construct expressions that define output dimensions in terms of input dimensions.
+
+    Clients should assume that any values constructed by the `IExprBuilder` are destroyed after `IPluginV2DynamicExt::get_output_dimensions()` returns.
+)trtdoc";
+
+constexpr const char* constant = R"trtdoc(
+    Return a IDimensionExpr for the given value.
+)trtdoc";
+
+constexpr const char* operation = R"trtdoc(
+    Return a IDimensionExpr that represents the given operation applied to first and second.
+    Returns None if op is not a valid DimensionOperation.
+)trtdoc";
+
+constexpr const char* declare_size_tensor = R"trtdoc(
+    Declare a size tensor at the given output index, with the specified auto-tuning formula and upper bound.
+
+    A size tensor allows a plugin to have output dimensions that cannot be computed solely from input dimensions.
+    For example, suppose a plugin implements the equivalent of INonZeroLayer for 2D input. The plugin can
+    have one output for the indices of non-zero elements, and a second output containing the number of non-zero
+    elements. Suppose the input has size [M,N] and has K non-zero elements. The plugin can write K to the second
+    output. When telling TensorRT that the first output has shape [2,K], plugin uses IExprBuilder.constant() and
+    IExprBuilder.declare_size_tensor(1,...) to create the IDimensionExpr that respectively denote 2 and K.
+
+    TensorRT also needs to know the value of K to use for auto-tuning and an upper bound on K so that it can
+    allocate memory for the output tensor. In the example, suppose typically half of the plugin's input elements
+    are non-zero, and all the elements might be nonzero. then using M*N/2 might be a good expression for the opt
+    parameter, and M*N for the upper bound. IDimensionsExpr for these expressions can be constructed from
+    IDimensionsExpr for the input dimensions.
+)trtdoc";
+} // namespace IExprBuilderDoc
+
+namespace DimensionOperationDoc
+{
+constexpr const char* descr = R"trtdoc(
+    An operation on two IDimensionExprs, which represent integer expressions used in dimension computations.
+
+    For example, given two IDimensionExprs x and y and an IExprBuilder eb, eb.operation(DimensionOperation.SUM, x, y) creates a representation of x + y.
+)trtdoc";
+} // namespace DimensionOperationDoc
+
+namespace PluginCapabilityTypeDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Enumerates the different capability types a IPluginV3 object may have.
+)trtdoc";
+} // namespace PluginCapabilityTypeDoc
+
+namespace TensorRTPhaseDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Indicates a phase of operation of TensorRT
+)trtdoc";
+} // namespace TensorRTPhaseDoc
+
+namespace IPluginResourceContextDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Interface for plugins to access per context resources provided by TensorRT
+    
+    There is no public way to construct an IPluginResourceContext. It appears as an argument to trt.IPluginV3OneRuntime.attach_to_context().
+)trtdoc";
+} // namespace IPluginResourceContextDoc
+
 namespace PluginFieldDoc
 {
 constexpr const char* descr = R"trtdoc(
@@ -210,6 +816,13 @@ constexpr const char* descr = R"trtdoc(
 )trtdoc";
 } // namespace PluginFieldCollectionDoc
 
+namespace IPluginCreatorInterfaceDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Base class for for plugin sub-interfaces.
+)trtdoc";
+} // namespace IPluginCreatorInterfaceDoc
+
 namespace IPluginCreatorDoc
 {
 constexpr const char* descr = R"trtdoc(
@@ -234,13 +847,54 @@ constexpr const char* create_plugin = R"trtdoc(
 constexpr const char* deserialize_plugin = R"trtdoc(
     Creates a plugin object from a serialized plugin.
 
+    .. warning::
+        This API only applies when called on a C++ plugin from a Python program.
+
+    `serialized_plugin` will contain a Python bytes object containing the serialized representation of the plugin.
+
     :arg name: Name of the plugin.
     :arg serialized_plugin: A buffer containing a serialized plugin.
 
     :returns: A new :class:`IPluginV2`
 )trtdoc";
+
+constexpr const char* deserialize_plugin_python = R"trtdoc(
+    Creates a plugin object from a serialized plugin.
+
+    .. warning::
+        This API only applies when implementing a Python-based plugin.
+
+    `serialized_plugin` contains a serialized representation of the plugin.
+
+    :arg name: Name of the plugin.
+    :arg serialized_plugin: A string containing a serialized plugin.
+
+    :returns: A new :class:`IPluginV2`
+)trtdoc";
 } // namespace IPluginCreatorDoc
 
+namespace IPluginCreatorV3OneDoc
+{
+constexpr const char* descr = R"trtdoc(
+    Plugin creator class for user implemented layers
+
+    :ivar tensorrt_version: :class:`int`  Number of :class:`PluginField` entries.
+    :ivar name: :class:`str` Plugin name.
+    :ivar plugin_version: :class:`str` Plugin version.
+    :ivar field_names: :class:`list` List of fields that needs to be passed to :func:`create_plugin` .
+    :ivar plugin_namespace: :class:`str` The namespace of the plugin creator based on the plugin library it belongs to. This can be set while registering the plugin creator.
+)trtdoc";
+
+constexpr const char* create_plugin = R"trtdoc(
+    Creates a new plugin.
+
+    :arg name: The name of the plugin.
+    :arg field_collection: The :class:`PluginFieldCollection` for this plugin.
+
+    :returns: :class:`IPluginV2` or :class:`None` on failure.
+)trtdoc";
+} // namespace IPluginCreatorV3OneDoc
+
 namespace IPluginRegistryDoc
 {
 constexpr const char* descr = R"trtdoc(
@@ -251,13 +905,35 @@ constexpr const char* descr = R"trtdoc(
     :ivar parent_search_enabled: bool variable indicating whether parent search is enabled. Default is True.
 )trtdoc";
 
+constexpr const char* register_creator_iplugincreator = R"trtdoc(
+    Register a plugin creator implementing IPluginCreator.
+
+    :arg creator: The IPluginCreator instance.
+    :arg plugin_namespace: The namespace of the plugin creator.
+
+    :returns: False if any plugin creator with the same name, version and namespace is already registered.
+)trtdoc";
+
 constexpr const char* register_creator = R"trtdoc(
     Register a plugin creator.
 
-    :arg creator: The IPluginCreator instance.
+    :arg creator: The plugin creator instance.
     :arg plugin_namespace: The namespace of the plugin creator.
 
-    :returns: False if one with the same type is already registered.
+    :returns: False if any plugin creator with the same name, version and namespace is already registered..
+)trtdoc";
+
+constexpr const char* deregister_creator_iplugincreator = R"trtdoc(
+    Deregister a previously registered plugin creator inheriting from IPluginCreator.
+
+    Since there may be a desire to limit the number of plugins,
+    this function provides a mechanism for removing plugin creators registered in TensorRT.
+    The plugin creator that is specified by ``creator`` is removed from TensorRT and no longer tracked.
+
+    :arg creator: The IPluginCreator instance.
+
+    :returns: ``True`` if the plugin creator was deregistered, ``False`` if it was not found in the registry
+            or otherwise could not be deregistered.
 )trtdoc";
 
 constexpr const char* deregister_creator = R"trtdoc(
@@ -267,14 +943,28 @@ constexpr const char* deregister_creator = R"trtdoc(
     this function provides a mechanism for removing plugin creators registered in TensorRT.
     The plugin creator that is specified by ``creator`` is removed from TensorRT and no longer tracked.
 
-    :arg creator: The IPluginCreator instance.
+    :arg creator: The plugin creator instance.
 
     :returns: ``True`` if the plugin creator was deregistered, ``False`` if it was not found in the registry
             or otherwise could not be deregistered.
 )trtdoc";
 
 constexpr const char* get_plugin_creator = R"trtdoc(
-    Return plugin creator based on type and version
+    Return plugin creator based on type, version and namespace
+
+    .. warning::
+        Returns None if a plugin creator with matching name, version, and namespace is found, but is not a 
+        descendent of IPluginCreator
+
+    :arg type: The type of the plugin.
+    :arg version: The version of the plugin.
+    :arg plugin_namespace: The namespace of the plugin.
+
+    :returns: An :class:`IPluginCreator` .
+)trtdoc";
+
+constexpr const char* get_creator = R"trtdoc(
+    Return plugin creator based on type, version and namespace
 
     :arg type: The type of the plugin.
     :arg version: The version of the plugin.
@@ -291,11 +981,33 @@ constexpr const char* load_library = R"trtdoc(
     :returns: The loaded plugin library handle. The call will fail and return None if any of the plugins are already registered.
 )trtdoc";
 
+constexpr const char* deserialize_library = R"trtdoc(
+    Load and register a shared library of plugins from a memory buffer.
+
+    :arg: serialized_plugin_library: a pointer to a plugin buffer to deserialize.
+
+    :returns: The loaded plugin library handle. The call will fail and return None if any of the plugins are already registered.
+)trtdoc";
+
 constexpr const char* deregister_library = R"trtdoc(
     Deregister plugins associated with a library. Any resources acquired when the library was loaded will be released.
 
     :arg: handle: the plugin library handle to deregister.
 )trtdoc";
+
+constexpr const char* acquire_plugin_resource = R"trtdoc(
+    Get a handle to a plugin resource registered against the provided key.
+
+    :arg: key: Key for identifying the resource. 
+    :arg: resource: A plugin resource object. The object will only need to be valid until this method returns, as only a clone of this object will be registered by TRT. Cannot be null.
+)trtdoc";
+
+constexpr const char* release_plugin_resource = R"trtdoc(
+    Decrement reference count for the resource with this key. If reference count goes to zero after decrement, release() will be invoked on the resource, 
+    and the key will be deregistered.
+
+    :arg: key: Key that was used to register the resource.
+)trtdoc";
 } // namespace IPluginRegistryDoc
 
 namespace FreeFunctionsDoc
@@ -318,4 +1030,11 @@ constexpr const char* init_libnvinfer_plugins = R"trtdoc(
 )trtdoc";
 } // namespace FreeFunctionsDoc
 
+namespace PluginCreatorVersionDoc
+{
+constexpr char const* descr = R"trtdoc(
+    Enum to identify version of the plugin creator.
+)trtdoc";
+} // namespace PluginCreatorVersionDoc
+
 } // namespace tensorrt
diff --git a/python/docstrings/parsers/pyCaffeDoc.h b/python/docstrings/parsers/pyCaffeDoc.h
deleted file mode 100644
index 9b2e1ebd..00000000
--- a/python/docstrings/parsers/pyCaffeDoc.h
+++ /dev/null
@@ -1,111 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// Docstrings for the pyCaffe parser bindings.
-#pragma once
-
-namespace tensorrt
-{
-namespace ICaffeParserDoc
-{
-constexpr const char* descr = R"trtdoc(
-    This class is used for parsing Caffe models. It allows users to export models trained using Caffe to TRT.
-
-    :ivar plugin_factory_v2: :class:`ICaffePluginFactoryV2` The ICaffePluginFactory used to create the user defined plugins.
-    :ivar plugin_namespace: :class:`str` The namespace used to lookup and create plugins in the network.
-    :ivar protobuf_buffer_size: :class:`int` The buffer size for the parsing and storage of the learned model.
-    :ivar error_recorder: :class:`IErrorRecorder` Application-implemented error reporting interface for TensorRT objects.
-)trtdoc";
-
-constexpr const char* parse = R"trtdoc(
-    Parse a prototxt file and a binaryproto Caffe model to extract network definition and weights associated with the network, respectively.
-
-    :arg deploy: The plain text, prototxt file used to define the network definition.
-    :arg model:  The binaryproto Caffe model that contains the weights associated with the network.
-    :arg network: Network in which the CaffeParser will fill the layers.
-    :arg dtype: The type to which the weights will be transformed.
-
-    :returns: An :class:`IBlobNameToTensor` object that contains the extracted data.
-)trtdoc";
-
-constexpr const char* parse_buffer = R"trtdoc(
-    Parse a prototxt file and a binaryproto Caffe model to extract network definition and weights associated with the network, respectively.
-
-    :arg deploy_buffer: The memory buffer containing the plain text deploy prototxt used to define the network definition.
-    :arg model_buffer: The binaryproto Caffe memory buffer that contains the weights associated with the network.
-    :arg network: Network in which the CaffeParser will fill the layers.
-    :arg dtype: The type to which the weights will be transformed.
-
-    :returns: An :class:`IBlobNameToTensor` object that contains the extracted data.
-)trtdoc";
-
-constexpr const char* parse_binary_proto = R"trtdoc(
-    Parse and extract data stored in binaryproto file. The binaryproto file contains data stored in a binary blob. :func:`parse_binary_proto` converts it to an :class:`numpy.ndarray` object.
-
-    :arg filename:  Path to file containing binary proto.
-
-    :returns: :class:`numpy.ndarray` An array that contains the extracted data.
-)trtdoc";
-
-} // namespace ICaffeParserDoc
-
-namespace IBlobNameToTensorDoc
-{
-constexpr const char* descr = R"trtdoc(
-    This class is used to store and query :class:`ITensor` s after they have been extracted from a Caffe model using the :class:`CaffeParser` .
-)trtdoc";
-
-constexpr const char* find = R"trtdoc(
-    Given a blob name, this function returns an :class:`ITensor` object.
-
-    :arg name: Caffe blob name for which the user wants the corresponding :class:`ITensor` .
-
-    :returns: A :class:`ITensor` object corresponding to the queried name. If no such :class:`ITensor` exists, then an empty object is returned.
-)trtdoc";
-} // namespace IBlobNameToTensorDoc
-
-namespace ICaffePluginFactoryV2Doc
-{
-constexpr const char* descr = R"trtdoc(
-    Plugin factory used to configure plugins.
-)trtdoc";
-
-constexpr const char* is_plugin_v2 = R"trtdoc(
-    A user implemented function that determines if a layer configuration is provided by an :class:`IPluginV2` .
-
-    :arg layer_name: Name of the layer which the user wishes to validate.
-
-    :returns: True if the layer configuration is provided by an :class:`IPluginV2` .
-)trtdoc";
-
-constexpr const char* create_plugin = R"trtdoc(
-    Creates a plugin.
-
-        :arg layer_name: Name of layer associated with the plugin.
-        :arg weights: Weights used for the layer.
-
-    :returns: The newly created :class:`IPluginV2` .
-)trtdoc";
-} // namespace ICaffePluginFactoryV2Doc
-
-namespace FreeFunctionsDoc
-{
-constexpr const char* shutdown_protobuf_library = R"trtdoc(
-    Shuts down protocol buffers library.
-)trtdoc";
-}
-} // namespace tensorrt
diff --git a/python/docstrings/parsers/pyOnnxDoc.h b/python/docstrings/parsers/pyOnnxDoc.h
index 65694747..7099a207 100644
--- a/python/docstrings/parsers/pyOnnxDoc.h
+++ b/python/docstrings/parsers/pyOnnxDoc.h
@@ -15,7 +15,7 @@
  * limitations under the License.
  */
 
-// Docstrings for the pyCaffe parser bindings.
+// Docstrings for the pyOnnx parser bindings.
 #pragma once
 
 namespace tensorrt
@@ -106,6 +106,19 @@ constexpr const char* set_flag = R"trtdoc(
     :arg flag: The flag to set.
 )trtdoc";
 
+constexpr const char* get_layer_output_tensor = R"trtdoc(
+    Get the i-th output ITensor object for the ONNX layer "name".
+
+   In the case of multiple nodes sharing the same name this function will return
+   the output tensors of the first instance of the node in the ONNX graph.
+
+    :arg name: The name of the ONNX layer.
+
+    :arg i: The index of the output.
+
+    :returns: The output tensor or None if the layer was not found or an invalid index was provided.
+)trtdoc";
+
 constexpr const char* get_used_vc_plugin_libraries = R"trtdoc(
     Query the plugin libraries needed to implement operations used by the parser in a version-compatible engine.
 
@@ -122,6 +135,45 @@ constexpr const char* get_used_vc_plugin_libraries = R"trtdoc(
 )trtdoc";
 } // namespace OnnxParserDoc
 
+namespace OnnxParserRefitterDoc
+{
+constexpr const char* descr = R"trtdoc(
+    This is an interface designed to refit weights from an ONNX model.
+)trtdoc";
+
+constexpr const char* init = R"trtdoc(
+    :arg refitter: The Refitter object used to refit the model.
+    :arg logger: The logger to use.
+)trtdoc";
+
+constexpr const char* refit_from_bytes = R"trtdoc(
+    Load a serialized ONNX model from memory and perform weight refit.
+
+    :arg model: The serialized ONNX model.
+    :arg path: The path to the model file. Only required if the model has externally stored weights.
+
+    :returns: true if all the weights in the engine were refit successfully.
+)trtdoc";
+
+constexpr const char* refit_from_file = R"trtdoc(
+    Load and parse a ONNX model from disk and perform weight refit.
+
+    :arg model: The path to an ONNX model.
+
+    :returns: true if the model was loaded successfully, and if all the weights in the engine were refit successfully.
+)trtdoc";
+
+constexpr const char* get_error = R"trtdoc(
+    Get an error that occurred during prior calls to :func:`refitFromBytes` or :func:`refitFromFile`.
+
+    :arg index: Index of the error
+)trtdoc";
+
+constexpr const char* clear_errors = R"trtdoc(
+    Clear errors from prior calls to :func:`refitFromBytes` or :func:`refitFromFile`.
+)trtdoc";
+} // namespace OnnxParserRefitterDoc
+
 namespace ErrorCodeDoc
 {
 constexpr const char* descr = R"trtdoc(
@@ -135,10 +187,9 @@ constexpr const char* descr = R"trtdoc(
     Flags that control how an ONNX model gets parsed.
 )trtdoc";
 constexpr const char* NATIVE_INSTANCENORM = R"trtdoc(
-   Parse the ONNX model into the INetworkDefinition with the intention of using TensorRT's native layer implementation over the plugin implementation for InstanceNormalization nodes. 
-   This flag is planned to be deprecated in TensorRT 8.7 and removed in TensorRT 9.0. 
-   This flag is required when building version-compatible or hardware-compatible engines. 
-   There may be performance degradations when this flag is enabled.
+   Parse the ONNX model into the INetworkDefinition with the intention of using TensorRT's native layer implementation over the plugin implementation for InstanceNormalization nodes.
+   This flag is required when building version-compatible or hardware-compatible engines.
+   The flag is ON by default.
 )trtdoc";
 } // namespace OnnxParserFlagDoc
 
@@ -171,6 +222,23 @@ constexpr const char* func = R"trtdoc(
 constexpr const char* node = R"trtdoc(
     :returns: Index of the Onnx model node in which the error occurred
 )trtdoc";
+
+constexpr const char* node_name = R"trtdoc(
+    :returns: Name of the node in the model in which the error occurred
+)trtdoc";
+
+constexpr const char* node_operator = R"trtdoc(
+    :returns: Name of the node operation in the model in which the error occurred
+)trtdoc";
+
+constexpr const char* local_function_stack = R"trtdoc(
+    :returns: Current stack trace of local functions in which the error occurred
+)trtdoc";
+
+constexpr const char* local_function_stack_size = R"trtdoc(
+    :returns: Size of the current stack trace of local functions in which the error occurred
+)trtdoc";
+
 } // namespace ParserErrorDoc
 
 constexpr const char* get_nv_onnx_parser_version = R"trtdoc(
diff --git a/python/docstrings/parsers/pyUffDoc.h b/python/docstrings/parsers/pyUffDoc.h
deleted file mode 100644
index 8755c808..00000000
--- a/python/docstrings/parsers/pyUffDoc.h
+++ /dev/null
@@ -1,114 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// Docstrings for the pyUffBindings
-namespace tensorrt
-{
-
-namespace UffInputOrderDoc
-{
-constexpr const char* descr = R"trtdoc(
-    The different possible supported input orders.
-)trtdoc";
-
-} // namespace UffInputOrderDoc
-
-namespace FieldTypeDoc
-{
-constexpr const char* descr = R"trtdoc(
-    The possible field types for the custom layer.
-)trtdoc";
-
-} // namespace FieldTypeDoc
-
-namespace FieldMapDoc
-{
-constexpr const char* descr = R"trtdoc(
-    This is a class containing an array of field params used as a layer parameter for plugin layers. The node fields are passed by the parser to the API through the plugin constructor. The implementation of the plugin should parse the contents of the :class:`FieldMap` as part of the plugin constructor.
-
-    :ivar name: :class:`str` field param
-    :ivar data: :class:`capsule` field param
-    :ivar type: :class:`FieldType` field param
-    :ivar length: :class:`int` field param
-)trtdoc";
-
-} // namespace FieldMapDoc
-
-namespace FieldCollectionDoc
-{
-constexpr const char* descr = R"trtdoc(
-    This class contains an array of :class:`FieldMap` s.
-
-    :ivar num_fields: :class:`int` The number of :class:`FieldMap` s.
-    :ivar fields: :class:`capsule` The array of :class:`FieldMap` s.
-)trtdoc";
-
-} // namespace FieldCollectionDoc
-
-namespace UffParserDoc
-{
-
-constexpr const char* descr = R"trtdoc(
-    This class is used for parsing models described using the UFF format.
-
-    :ivar uff_required_version_major: :class:`int` Version Major of the UFF.
-    :ivar uff_required_version_minor: :class:`int` Version Minor of the UFF.
-    :ivar uff_required_version_patch: :class:`int` Version Patch of the UFF.
-    :ivar plugin_namespace: :class:`str` The namespace used to lookup and create plugins in the network.
-    :ivar error_recorder: :class:`IErrorRecorder` Application-implemented error reporting interface for TensorRT objects.
-)trtdoc";
-
-constexpr const char* register_input = R"trtdoc(
-    Register an input name of a UFF network with the associated Dimensions.
-
-    :arg name: Input name.
-    :arg shape: Input shape.
-    :arg order: Input order on which the framework input was originally.
-
-    :returns: True if the name registers without error.
-)trtdoc";
-
-constexpr const char* register_output = R"trtdoc(
-    Register an output name of a UFF network.
-
-    :arg output_name: Output name.
-
-    :returns: True if the name registers without error.
-)trtdoc";
-
-constexpr const char* parse = R"trtdoc(
-    Parse a UFF file.
-
-    :arg file:  File name of the UFF file.
-    :arg network: Network in which the :class:`UffParser` will fill the layers.
-    :arg weights_type:  The type on which the weights will be transformed in.
-
-    :returns: True if the UFF file is parsed without error.
-)trtdoc";
-
-constexpr const char* parse_buffer = R"trtdoc(
-    Parse a UFF buffer - useful if the file is already live in memory.
-
-    :arg buffer:  The UFF buffer.
-    :arg network: Network in which the UFFParser will fill the layers.
-    :arg weights_type: The type on which the weights will be transformed in.
-
-    :returns: True if the UFF buffer is parsed without error.
-)trtdoc";
-} // namespace UffParserDoc
-
-} // namespace tensorrt
diff --git a/python/include/ForwardDeclarations.h b/python/include/ForwardDeclarations.h
index 1100752f..c377bf66 100644
--- a/python/include/ForwardDeclarations.h
+++ b/python/include/ForwardDeclarations.h
@@ -54,7 +54,7 @@ using namespace pybind11::literals;
 struct FallbackString
 {
     FallbackString() = default;
-    FallbackString(std::string other)
+    FallbackString(std::string const& other)
         : mData{other}
     {
     }
@@ -85,8 +85,6 @@ void bindCore(py::module& m);
 // Parsers
 #if EXPORT_ALL_BINDINGS
 void bindOnnx(py::module& m);
-void bindUff(py::module& m);
-void bindCaffe(py::module& m);
 #endif
 } // namespace tensorrt
 
diff --git a/python/include/utils.h b/python/include/utils.h
index fd53ce53..0b46743a 100644
--- a/python/include/utils.h
+++ b/python/include/utils.h
@@ -29,6 +29,7 @@
 #include "NvInfer.h"
 #include <functional>
 #include <iostream>
+#include <memory>
 #include <string>
 
 namespace tensorrt
@@ -42,16 +43,22 @@ namespace py = pybind11;
 size_t size(nvinfer1::DataType type);
 
 // Converts a TRT datatype to its corresponding numpy dtype.
-py::dtype nptype(nvinfer1::DataType type);
+// Returns nullptr if the type could not be converted to NumPy.
+std::unique_ptr<py::dtype> nptype(nvinfer1::DataType type);
 
 // Returns the TRT type corresponding to the specified numpy type.
 nvinfer1::DataType type(py::dtype const& type);
 
 // Return a numpy array (that doesn't own the data, but rather refers to it)
-static const auto weights_to_numpy = [](nvinfer1::Weights const& self) {
+static const auto weights_to_numpy = [](nvinfer1::Weights const& self) -> py::object {
     // The py::cast(self) allows us to return the buffer by reference rather than by copy.
     // See https://stackoverflow.com/questions/49181258/pybind11-create-numpy-view-of-data
-    return py::array{nptype(self.type), self.count, self.values, py::cast(self)};
+    auto const npType = nptype(self.type);
+    if (npType)
+    {
+        return py::array{*npType, self.count, self.values, py::cast(self)};
+    }
+    return py::cast(self);
 };
 
 inline int64_t volume(nvinfer1::Dims const& dims)
@@ -69,7 +76,7 @@ py::function getOverride(const T* self, std::string const& overloadName, bool sh
     if (!overload && showWarning)
     {
         std::cerr << "Method: " << overloadName
-                  << " was not overriden. Please provide an implementation for this method.";
+                  << " was not overriden. Please provide an implementation for this method." << std::endl;
     }
     return overload;
 }
diff --git a/python/packaging/bindings_wheel/LICENSE.txt b/python/packaging/bindings_wheel/LICENSE.txt
index 08f07f9f..b49a8964 100644
--- a/python/packaging/bindings_wheel/LICENSE.txt
+++ b/python/packaging/bindings_wheel/LICENSE.txt
@@ -129,7 +129,7 @@ TensorRT SUPPLEMENT TO SOFTWARE LICENSE AGREEMENT
 The terms set forth in this TensorRT Supplement (“Supplement”) govern your use of the NVIDIA GPU inference engine (the “TensorRT Licensed Software”) under the terms of your software license agreement (“SLA”) as modified by this Supplement. This Supplement is an exhibit to the SLA and is hereby incorporated as an integral part thereto. Capitalized terms used but not defined herein shall have the meaning assigned to them in the SLA. In the event of conflict between the terms in this Supplement and the terms in the SLA, this Supplement shall control.
 
 12.1. TensorRT DISTRIBUTION
-Subject to the terms of the SLA and this Supplement, NVIDIA hereby grants you a non-exclusive, nontransferable license during the applicable license term unless earlier terminated pursuant to the SLA, to distribute the libnvinfer, libnvinfer_plugin, and libnvparsers libraries when delivered to you as part of the TensorRT Licensed Software in source code form or binary form (but not when provided to you as part of a hardware product), subject to the following: such distribution is solely in binary form to your licensees (“Customers”) only as a component of your own software products having additional material functionality beyond the TensorRT Licensed Software (each, a “Licensee Application"). Subject to the terms and conditions of the SLA and this Supplement, you may further authorize Customers to redistribute the libnvinfer, libnvinfer_plugin, and libnvparsers libraries as incorporated into a Licensee Application, solely in binary form, provided, however, that you shall require in your agreements with your Customers that their distributions be on terms at least as restrictive as those applicable for your use of such TensorRT Licensed Software within a Licensee Application. The expiration or termination of your licenses to the above described TensorRT Licensed Software under the SLA and this Supplement will not affect rights previously granted by you to recipients that were in compliance with the SLA and this Supplement.
+Subject to the terms of the SLA and this Supplement, NVIDIA hereby grants you a non-exclusive, nontransferable license during the applicable license term unless earlier terminated pursuant to the SLA, to distribute the libnvinfer and libnvinfer_plugin libraries when delivered to you as part of the TensorRT Licensed Software in source code form or binary form (but not when provided to you as part of a hardware product), subject to the following: such distribution is solely in binary form to your licensees (“Customers”) only as a component of your own software products having additional material functionality beyond the TensorRT Licensed Software (each, a “Licensee Application"). Subject to the terms and conditions of the SLA and this Supplement, you may further authorize Customers to redistribute the libnvinfer and libnvinfer_plugin libraries as incorporated into a Licensee Application, solely in binary form, provided, however, that you shall require in your agreements with your Customers that their distributions be on terms at least as restrictive as those applicable for your use of such TensorRT Licensed Software within a Licensee Application. The expiration or termination of your licenses to the above described TensorRT Licensed Software under the SLA and this Supplement will not affect rights previously granted by you to recipients that were in compliance with the SLA and this Supplement.
 
 In addition to the rights above, for parties that are developing software intended solely for use on Jetson development kits or Jetson modules and running Linux for Tegra software the following shall apply: TensorRT Licensed Software licensed hereunder may be distributed in its entirety, as provided by NVIDIA and without separation of its components, for you and/or your licensees to create software development kits for use only on the Jetson platform and running Linux for Tegra software. You shall require in your agreements with your licensees that their distributions be on terms at least as restrictive as those applicable for your distribution of TensorRT Licensed Software as described in this Section 1.
 
diff --git a/python/packaging/bindings_wheel/setup.py b/python/packaging/bindings_wheel/setup.py
index b5b1b121..7bd97517 100644
--- a/python/packaging/bindings_wheel/setup.py
+++ b/python/packaging/bindings_wheel/setup.py
@@ -20,12 +20,14 @@
 from setuptools import setup
 
 tensorrt_module = "##TENSORRT_MODULE##"
+package_name = "##TENSORRT_MODULE##"
 
 # This file expects the following to be passed from the environment when using standalone wheels:
 # - STANDALONE: Whether we are building a standalone wheel
 IS_STANDALONE = os.environ.get("STANDALONE") == "1"
 if IS_STANDALONE:
-    tensorrt_module += "_bindings"
+    tensorrt_module += "-cu##CUDA_MAJOR##_bindings"
+    package_name += "_bindings"
 
 setup(
     name=tensorrt_module,
@@ -39,9 +41,9 @@
         "Intended Audience :: Developers",
         "Programming Language :: Python :: 3",
     ],
-    packages=[tensorrt_module],
+    packages=[package_name],
     extras_require={"numpy": "numpy"},
-    package_data={tensorrt_module: ["*.so*", "*.pyd", "*.pdb"]},
+    package_data={package_name: ["*.so*", "*.pyd", "*.pdb", "*.dll*"]},
     include_package_data=True,
     zip_safe=True,
     keywords="nvidia tensorrt deeplearning inference",
diff --git a/python/packaging/bindings_wheel/tensorrt/__init__.py b/python/packaging/bindings_wheel/tensorrt/__init__.py
index 0192bc55..01e49480 100644
--- a/python/packaging/bindings_wheel/tensorrt/__init__.py
+++ b/python/packaging/bindings_wheel/tensorrt/__init__.py
@@ -16,22 +16,24 @@
 #
 
 import ctypes
-import glob
 import os
 import sys
 import warnings
 
 
 # For standalone wheels, attempt to import the wheel containing the libraries.
+_libs_wheel_imported = False
 try:
     import ##TENSORRT_MODULE##_libs
 except (ImportError, ModuleNotFoundError):
     pass
+else:
+    _libs_wheel_imported = True
 
 
-if sys.platform.startswith("win"):
+if not _libs_wheel_imported and sys.platform.startswith("win"):
     # On Windows, we need to manually open the TensorRT libraries - otherwise we are unable to
-    # load the bindings.
+    # load the bindings. If we imported the tensorrt_libs wheel, then that should have taken care of it for us.
     def find_lib(name):
         paths = os.environ["PATH"].split(os.path.pathsep)
         for path in paths:
@@ -39,31 +41,37 @@ def find_lib(name):
             if os.path.isfile(libpath):
                 return libpath
 
+        if name.startswith("cudnn") or name.startswith("cublas"):
+            return ""
+
         raise FileNotFoundError(
             "Could not find: {:}. Is it on your PATH?\nNote: Paths searched were:\n{:}".format(name, paths)
         )
 
-
     # Order matters here because of dependencies
-    LIBRARIES = {"tensorrt": [
-        "nvinfer.dll",
-        "cublas64_##CUDA_MAJOR##.dll",
-        "cublasLt64_##CUDA_MAJOR##.dll",
-        "cudnn64_##CUDNN_MAJOR##.dll",
-        "nvinfer_plugin.dll",
-        "nvonnxparser.dll",
-        "nvparsers.dll",
-    ],
-    "tensorrt_dispatch": [
-        "nvinfer_dispatch.dll",
-    ],
-    "tensorrt_lean": [
-        "nvinfer_lean.dll",
-    ]}["##TENSORRT_MODULE##"]
+    LIBRARIES = {
+        "tensorrt": [
+            "nvinfer.dll",
+            "cublas64_##CUDA_MAJOR##.dll",
+            "cublasLt64_##CUDA_MAJOR##.dll",
+            "cudnn64_##CUDNN_MAJOR##.dll",
+            "nvinfer_plugin.dll",
+            "nvonnxparser.dll",
+        ],
+        "tensorrt_dispatch": [
+            "nvinfer_dispatch.dll",
+        ],
+        "tensorrt_lean": [
+            "nvinfer_lean.dll",
+        ],
+    }["##TENSORRT_MODULE##"]
 
     for lib in LIBRARIES:
-        ctypes.CDLL(find_lib(lib))
+        lib_path = find_lib(lib)
+        if lib_path != "":
+            ctypes.CDLL(lib_path)
 
+del _libs_wheel_imported
 
 from .##TENSORRT_MODULE## import *
 
@@ -111,19 +119,9 @@ def common_exit(this, exc_type, exc_value, traceback):
     INetworkDefinition.__enter__ = common_enter
     INetworkDefinition.__exit__ = common_exit
 
-    UffParser.__enter__ = common_enter
-    UffParser.__exit__ = common_exit
-
-    CaffeParser.__enter__ = common_enter
-    CaffeParser.__exit__ = common_exit
-
     OnnxParser.__enter__ = common_enter
     OnnxParser.__exit__ = common_exit
 
-
-    Refitter.__enter__ = common_enter
-    Refitter.__exit__ = common_exit
-
     IBuilderConfig.__enter__ = common_enter
     IBuilderConfig.__exit__ = common_exit
 
@@ -134,6 +132,7 @@ def common_exit(this, exc_type, exc_value, traceback):
 for attr, value in ILogger.Severity.__members__.items():
     setattr(Logger, attr, value)
 
+
 # Computes the volume of an iterable.
 def volume(iterable):
     """
@@ -165,9 +164,10 @@ def nptype(trt_type):
         float16: np.float16,
         int8: np.int8,
         int32: np.int32,
+        int64: np.int64,
         bool: np.bool_,
         uint8: np.uint8,
-        # Note: fp8 has no equivalent numpy type
+        # Note: fp8 and bfloat16 have no equivalent numpy type
     }
     if trt_type in mapping:
         return mapping[trt_type]
@@ -177,7 +177,8 @@ def nptype(trt_type):
 # Add a numpy-like itemsize property to the datatype.
 def _itemsize(trt_type):
     """
-    Returns the size in bytes of this :class:`DataType` .
+    Returns the size in bytes of this :class:`DataType`.
+    The returned size is a rational number, possibly a `Real` denoting a fraction of a byte.
 
     :arg trt_type: The TensorRT data type.
 
@@ -186,11 +187,14 @@ def _itemsize(trt_type):
     mapping = {
         float32: 4,
         float16: 2,
+        bfloat16: 2,
         int8: 1,
         int32: 4,
+        int64: 8,
         bool: 1,
         uint8: 1,
         fp8: 1,
+        int4: 0.5,
     }
     if trt_type in mapping:
         return mapping[trt_type]
diff --git a/python/packaging/frontend_sdist/LICENSE.txt b/python/packaging/frontend_sdist/LICENSE.txt
index 08f07f9f..b49a8964 100644
--- a/python/packaging/frontend_sdist/LICENSE.txt
+++ b/python/packaging/frontend_sdist/LICENSE.txt
@@ -129,7 +129,7 @@ TensorRT SUPPLEMENT TO SOFTWARE LICENSE AGREEMENT
 The terms set forth in this TensorRT Supplement (“Supplement”) govern your use of the NVIDIA GPU inference engine (the “TensorRT Licensed Software”) under the terms of your software license agreement (“SLA”) as modified by this Supplement. This Supplement is an exhibit to the SLA and is hereby incorporated as an integral part thereto. Capitalized terms used but not defined herein shall have the meaning assigned to them in the SLA. In the event of conflict between the terms in this Supplement and the terms in the SLA, this Supplement shall control.
 
 12.1. TensorRT DISTRIBUTION
-Subject to the terms of the SLA and this Supplement, NVIDIA hereby grants you a non-exclusive, nontransferable license during the applicable license term unless earlier terminated pursuant to the SLA, to distribute the libnvinfer, libnvinfer_plugin, and libnvparsers libraries when delivered to you as part of the TensorRT Licensed Software in source code form or binary form (but not when provided to you as part of a hardware product), subject to the following: such distribution is solely in binary form to your licensees (“Customers”) only as a component of your own software products having additional material functionality beyond the TensorRT Licensed Software (each, a “Licensee Application"). Subject to the terms and conditions of the SLA and this Supplement, you may further authorize Customers to redistribute the libnvinfer, libnvinfer_plugin, and libnvparsers libraries as incorporated into a Licensee Application, solely in binary form, provided, however, that you shall require in your agreements with your Customers that their distributions be on terms at least as restrictive as those applicable for your use of such TensorRT Licensed Software within a Licensee Application. The expiration or termination of your licenses to the above described TensorRT Licensed Software under the SLA and this Supplement will not affect rights previously granted by you to recipients that were in compliance with the SLA and this Supplement.
+Subject to the terms of the SLA and this Supplement, NVIDIA hereby grants you a non-exclusive, nontransferable license during the applicable license term unless earlier terminated pursuant to the SLA, to distribute the libnvinfer and libnvinfer_plugin libraries when delivered to you as part of the TensorRT Licensed Software in source code form or binary form (but not when provided to you as part of a hardware product), subject to the following: such distribution is solely in binary form to your licensees (“Customers”) only as a component of your own software products having additional material functionality beyond the TensorRT Licensed Software (each, a “Licensee Application"). Subject to the terms and conditions of the SLA and this Supplement, you may further authorize Customers to redistribute the libnvinfer and libnvinfer_plugin libraries as incorporated into a Licensee Application, solely in binary form, provided, however, that you shall require in your agreements with your Customers that their distributions be on terms at least as restrictive as those applicable for your use of such TensorRT Licensed Software within a Licensee Application. The expiration or termination of your licenses to the above described TensorRT Licensed Software under the SLA and this Supplement will not affect rights previously granted by you to recipients that were in compliance with the SLA and this Supplement.
 
 In addition to the rights above, for parties that are developing software intended solely for use on Jetson development kits or Jetson modules and running Linux for Tegra software the following shall apply: TensorRT Licensed Software licensed hereunder may be distributed in its entirety, as provided by NVIDIA and without separation of its components, for you and/or your licensees to create software development kits for use only on the Jetson platform and running Linux for Tegra software. You shall require in your agreements with your licensees that their distributions be on terms at least as restrictive as those applicable for your distribution of TensorRT Licensed Software as described in this Section 1.
 
diff --git a/python/packaging/frontend_sdist/setup.py b/python/packaging/frontend_sdist/setup.py
index a45c5010..b593e52c 100644
--- a/python/packaging/frontend_sdist/setup.py
+++ b/python/packaging/frontend_sdist/setup.py
@@ -19,11 +19,13 @@
 import platform
 import subprocess
 import sys
+import glob
 
 from setuptools import setup
 from setuptools.command.install import install
 
-tensorrt_module = "##TENSORRT_MODULE##"
+tensorrt_module = "##TENSORRT_MODULE##-cu##CUDA_MAJOR##"
+tensorrt_package = "##TENSORRT_MODULE##"
 tensorrt_version = "##TENSORRT_PYTHON_VERSION##"
 tensorrt_submodules = [
     "{}_libs=={}".format(tensorrt_module, tensorrt_version),
@@ -34,10 +36,26 @@
 
 
 def run_pip_command(args, call_func):
+    env = os.environ.copy()
+    env["PYTHONPATH"] = sys.exec_prefix
     try:
-        return call_func([sys.executable, "-m", "pip"] + args)
+        return call_func([sys.executable, "-m", "pip"] + args, env=env)
     except subprocess.CalledProcessError:
-        return call_func([os.path.join(sys.exec_prefix, "bin", "pip")] + args)
+
+        def find_pip():
+            pip_name = "pip"
+            if sys.platform.startswith("win"):
+                pip_name = "pip.exe"
+            for path in glob.iglob(os.path.join(sys.exec_prefix, "**"), recursive=True):
+                if os.path.isfile(path) and os.path.basename(path) == pip_name:
+                    return path
+            return None
+
+        pip_path = find_pip()
+        if pip_path is None:
+            # Couldn't find `pip` in `sys.exec_prefix`, so we have no option but to abort.
+            raise
+        return call_func([pip_path] + args, env=env)
 
 
 # check wheel availability using information from https://github.com/pypa/packaging/blob/23.1/src/packaging/markers.py#L175-L190
@@ -67,38 +85,36 @@ def run(self):
 
 def pip_config_list():
     """Get the current pip config (env vars, config file, etc)."""
-    return run_pip_command(["config", "list"], subprocess.check_output).decode()
+    try:
+        return run_pip_command(["config", "list"], subprocess.check_output).decode()
+    except:
+        return ""
 
 
 def parent_command_line():
     """Get the command line of the parent PID."""
     pid = os.getppid()
+
     # try retrieval using psutil
     try:
         import psutil
 
         return " ".join(psutil.Process(pid).cmdline())
-    except ModuleNotFoundError:
+    except:
         pass
     # fall back to shell
     try:
-        return subprocess.check_output(
-            ["ps", "-p", str(pid), "-o", "command", "--no-headers"]
-        ).decode()
-    except subprocess.CalledProcessError:
+        return subprocess.check_output(["ps", "-p", str(pid), "-o", "command", "--no-headers"]).decode()
+    except:
         return ""
 
 
 # use pip-inside-pip hack only if the nvidia index is not set in the environment
-if (
-    disable_internal_pip
-    or nvidia_pip_index_url in pip_config_list()
-    or nvidia_pip_index_url in parent_command_line()
-):
-    install_requires = tensorrt_submodules
+install_requires = []
+if disable_internal_pip or nvidia_pip_index_url in parent_command_line() or nvidia_pip_index_url in pip_config_list():
+    install_requires.extend(tensorrt_submodules)
     cmdclass = {}
 else:
-    install_requires = []
     cmdclass = {"install": InstallCommand}
 
 
@@ -106,7 +122,11 @@ def parent_command_line():
     name=tensorrt_module,
     version=tensorrt_version,
     description="A high performance deep learning inference library",
-    long_description="""A high performance deep learning inference library
+    long_description="""
+NVIDIA TensorRT is an SDK that facilitates high-performance machine learning inference. It is designed to work in a complementary fashion with training frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on running an already-trained network quickly and efficiently on NVIDIA hardware.
+
+**IMPORTANT:** This is a special release of TensorRT designed to work only with TensorRT-LLM.
+Please refrain from upgrading to this version if you are not using TensorRT-LLM.
 
 To install, please execute the following:
 ```
@@ -129,12 +149,13 @@ def parent_command_line():
         "Intended Audience :: Developers",
         "Programming Language :: Python :: 3",
     ],
-    packages=[tensorrt_module],
+    packages=[tensorrt_package],
     install_requires=install_requires,
+    setup_requires=["wheel"],
     python_requires=">=3.6",  # ref https://pypi.nvidia.com/tensorrt-bindings/
     cmdclass=cmdclass,
     extras_require={"numpy": "numpy"},
-    package_data={tensorrt_module: ["*.so*", "*.pyd", "*.pdb"]},
+    package_data={tensorrt_package: ["*.so*", "*.pyd", "*.pdb", "*.dll*"]},
     include_package_data=True,
     zip_safe=True,
     keywords="nvidia tensorrt deeplearning inference",
diff --git a/python/packaging/frontend_sdist/tensorrt/__init__.py b/python/packaging/frontend_sdist/tensorrt/__init__.py
index ee379126..d15c89d7 100644
--- a/python/packaging/frontend_sdist/tensorrt/__init__.py
+++ b/python/packaging/frontend_sdist/tensorrt/__init__.py
@@ -16,3 +16,4 @@
 #
 
 from tensorrt_bindings import *
+from tensorrt_bindings import __version__
diff --git a/python/packaging/libs_wheel/LICENSE.txt b/python/packaging/libs_wheel/LICENSE.txt
index 08f07f9f..b49a8964 100644
--- a/python/packaging/libs_wheel/LICENSE.txt
+++ b/python/packaging/libs_wheel/LICENSE.txt
@@ -129,7 +129,7 @@ TensorRT SUPPLEMENT TO SOFTWARE LICENSE AGREEMENT
 The terms set forth in this TensorRT Supplement (“Supplement”) govern your use of the NVIDIA GPU inference engine (the “TensorRT Licensed Software”) under the terms of your software license agreement (“SLA”) as modified by this Supplement. This Supplement is an exhibit to the SLA and is hereby incorporated as an integral part thereto. Capitalized terms used but not defined herein shall have the meaning assigned to them in the SLA. In the event of conflict between the terms in this Supplement and the terms in the SLA, this Supplement shall control.
 
 12.1. TensorRT DISTRIBUTION
-Subject to the terms of the SLA and this Supplement, NVIDIA hereby grants you a non-exclusive, nontransferable license during the applicable license term unless earlier terminated pursuant to the SLA, to distribute the libnvinfer, libnvinfer_plugin, and libnvparsers libraries when delivered to you as part of the TensorRT Licensed Software in source code form or binary form (but not when provided to you as part of a hardware product), subject to the following: such distribution is solely in binary form to your licensees (“Customers”) only as a component of your own software products having additional material functionality beyond the TensorRT Licensed Software (each, a “Licensee Application"). Subject to the terms and conditions of the SLA and this Supplement, you may further authorize Customers to redistribute the libnvinfer, libnvinfer_plugin, and libnvparsers libraries as incorporated into a Licensee Application, solely in binary form, provided, however, that you shall require in your agreements with your Customers that their distributions be on terms at least as restrictive as those applicable for your use of such TensorRT Licensed Software within a Licensee Application. The expiration or termination of your licenses to the above described TensorRT Licensed Software under the SLA and this Supplement will not affect rights previously granted by you to recipients that were in compliance with the SLA and this Supplement.
+Subject to the terms of the SLA and this Supplement, NVIDIA hereby grants you a non-exclusive, nontransferable license during the applicable license term unless earlier terminated pursuant to the SLA, to distribute the libnvinfer and libnvinfer_plugin libraries when delivered to you as part of the TensorRT Licensed Software in source code form or binary form (but not when provided to you as part of a hardware product), subject to the following: such distribution is solely in binary form to your licensees (“Customers”) only as a component of your own software products having additional material functionality beyond the TensorRT Licensed Software (each, a “Licensee Application"). Subject to the terms and conditions of the SLA and this Supplement, you may further authorize Customers to redistribute the libnvinfer and libnvinfer_plugin libraries as incorporated into a Licensee Application, solely in binary form, provided, however, that you shall require in your agreements with your Customers that their distributions be on terms at least as restrictive as those applicable for your use of such TensorRT Licensed Software within a Licensee Application. The expiration or termination of your licenses to the above described TensorRT Licensed Software under the SLA and this Supplement will not affect rights previously granted by you to recipients that were in compliance with the SLA and this Supplement.
 
 In addition to the rights above, for parties that are developing software intended solely for use on Jetson development kits or Jetson modules and running Linux for Tegra software the following shall apply: TensorRT Licensed Software licensed hereunder may be distributed in its entirety, as provided by NVIDIA and without separation of its components, for you and/or your licensees to create software development kits for use only on the Jetson platform and running Linux for Tegra software. You shall require in your agreements with your licensees that their distributions be on terms at least as restrictive as those applicable for your distribution of TensorRT Licensed Software as described in this Section 1.
 
diff --git a/python/packaging/libs_wheel/setup.py b/python/packaging/libs_wheel/setup.py
index eb45a258..b6060e0b 100644
--- a/python/packaging/libs_wheel/setup.py
+++ b/python/packaging/libs_wheel/setup.py
@@ -15,29 +15,15 @@
 # limitations under the License.
 #
 
-import os
 
 from setuptools import setup
 
-module_name = "##TENSORRT_MODULE##_libs"
+module_name = "##TENSORRT_MODULE##-cu##CUDA_MAJOR##_libs"
+package_name = "##TENSORRT_MODULE##_libs"
 
 
 def get_requirements():
-    def get_version_range():
-        def get_vers(var):
-            vers = os.environ.get(var).replace("cuda-", "")
-            major, minor = map(int, vers.split("."))
-            return major, minor
-
-        cuda_major, _ = get_vers("CUDA")
-        return "-cu{cuda_major}".format(cuda_major=cuda_major)
-
-    reqs = ["nvidia-cuda-runtime" + get_version_range()]
-    if "##TENSORRT_MODULE##" == "tensorrt":
-        reqs += [
-            "nvidia-cudnn" + get_version_range(),
-            "nvidia-cublas" + get_version_range(),
-        ]
+    reqs = ["nvidia-cuda-runtime-cu##CUDA_MAJOR##"]
     return reqs
 
 
@@ -53,9 +39,9 @@ def get_vers(var):
         "Intended Audience :: Developers",
         "Programming Language :: Python :: 3",
     ],
-    packages=[module_name],
+    packages=[package_name],
     install_requires=get_requirements(),
-    package_data={module_name: ["*.so*", "*.pyd", "*.pdb"]},
+    package_data={package_name: ["*.so*", "*.pyd", "*.pdb", "*.dll*"]},
     include_package_data=True,
     zip_safe=True,
     keywords="nvidia tensorrt deeplearning inference",
diff --git a/python/packaging/libs_wheel/tensorrt_libs/__init__.py b/python/packaging/libs_wheel/tensorrt_libs/__init__.py
index 83df0448..a7d9e91a 100644
--- a/python/packaging/libs_wheel/tensorrt_libs/__init__.py
+++ b/python/packaging/libs_wheel/tensorrt_libs/__init__.py
@@ -18,6 +18,9 @@
 import ctypes
 import glob
 import os
+import sys
+
+CURDIR = os.path.realpath(os.path.dirname(__file__))
 
 
 def try_load(library):
@@ -27,7 +30,22 @@ def try_load(library):
         pass
 
 
+def try_load_libs_from_dir(path):
+    for lib in glob.iglob(os.path.join(path, "*.so*")):
+        try_load(lib)
+    for lib in glob.iglob(os.path.join(path, "*.dll*")):
+        try_load(lib)
+
+
+DEPENDENCY_PATHS = [
+    os.path.join("nvidia", "cuda_runtime"),
+    os.path.join("nvidia", "cuda_nvrtc"),
+]
+for dep_path in DEPENDENCY_PATHS:
+    try_load_libs_from_dir(
+        os.path.join(CURDIR, os.path.pardir, dep_path, "bin" if sys.platform.startswith("win") else "lib")
+    )
+
+
 # Try loading all packaged libraries. This is a nop if there are no libraries packaged.
-CURDIR = os.path.realpath(os.path.dirname(__file__))
-for lib in glob.iglob(os.path.join(CURDIR, "*.so*")):
-    try_load(lib)
+try_load_libs_from_dir(CURDIR)
diff --git a/python/packaging/metapackage/LICENSE.txt b/python/packaging/metapackage/LICENSE.txt
new file mode 100644
index 00000000..b49a8964
--- /dev/null
+++ b/python/packaging/metapackage/LICENSE.txt
@@ -0,0 +1,180 @@
+Abstract
+This document is the Software License Agreement (SLA) for NVIDIA TensorRT. This document contains specific license terms and conditions for NVIDIA TensorRT. By accepting this agreement, you agree to comply with all the terms and conditions applicable to the specific product(s) included herein.
+
+If you are receiving TensorRT under the NVIDIA Prerelease License Agreement (also known as NPLA) or under the NVIDIA Software License Agreement (previously known as the NVIDIA Tegra Software License Agreement), your use of TensorRT is governed by such applicable terms and conditions. All other uses of TensorRT are governed by the terms and conditions of the below license agreement.
+
+NVIDIA SOFTWARE LICENSE AGREEMENT
+Important: READ BEFORE DOWNLOADING, INSTALLING, COPYING OR USING THE LICENSED SOFTWARE
+This Software License Agreement ("SLA”), made and entered into as of the time and date of click through action (“Effective Date”),is a legal agreement between you and NVIDIA Corporation ("NVIDIA") and governs the use of the NVIDIA computer software and the documentation made available for use with such NVIDIA software. By downloading, installing, copying, or otherwise using the NVIDIA software and/or documentation, you agree to be bound by the terms of this SLA. If you do not agree to the terms of this SLA, do not download, install, copy or use the NVIDIA software or documentation. IF YOU ARE ENTERING INTO THIS SLAON BEHALF OF A COMPANY OR OTHER LEGAL ENTITY, YOU REPRESENT THAT YOU HAVE THE LEGAL AUTHORITY TO BIND THE ENTITY TO THIS SLA, IN WHICH CASE “YOU” WILL MEAN THE ENTITY YOU REPRESENT. IF YOU DON’T HAVE SUCH AUTHORITY, OR IF YOU DON’T ACCEPT ALL THE TERMS AND CONDITIONS OF THIS SLA, THEN NVIDIA DOES NOT AGREETO LICENSE THE LICENSED SOFTWARETO YOU, AND YOU MAY NOT DOWNLOAD, INSTALL, COPY OR USE IT.
+
+Preface
+This document is the Software License Agreement (SLA) for NVIDIA TensorRT. This document contains specific license terms and conditions for NVIDIA TensorRT. By accepting this agreement, you agree to comply with all the terms and conditions applicable to the specific product(s) included herein.
+
+If you are receiving TensorRT under the NVIDIA Prerelease License Agreement (also known as NPLA) or under the NVIDIA Software License Agreement (previously known as the NVIDIA Tegra Software License Agreement), your use of TensorRT is governed by such applicable terms and conditions. All other uses of TensorRT are governed by the terms and conditions of the below license agreement.
+
+NVIDIA SOFTWARE LICENSE AGREEMENT
+Important: READ BEFORE DOWNLOADING, INSTALLING, COPYING OR USING THE LICENSED SOFTWARE
+This Software License Agreement ("SLA”), made and entered into as of the time and date of click through action (“Effective Date”),is a legal agreement between you and NVIDIA Corporation ("NVIDIA") and governs the use of the NVIDIA computer software and the documentation made available for use with such NVIDIA software. By downloading, installing, copying, or otherwise using the NVIDIA software and/or documentation, you agree to be bound by the terms of this SLA. If you do not agree to the terms of this SLA, do not download, install, copy or use the NVIDIA software or documentation. IF YOU ARE ENTERING INTO THIS SLAON BEHALF OF A COMPANY OR OTHER LEGAL ENTITY, YOU REPRESENT THAT YOU HAVE THE LEGAL AUTHORITY TO BIND THE ENTITY TO THIS SLA, IN WHICH CASE “YOU” WILL MEAN THE ENTITY YOU REPRESENT. IF YOU DON’T HAVE SUCH AUTHORITY, OR IF YOU DON’T ACCEPT ALL THE TERMS AND CONDITIONS OF THIS SLA, THEN NVIDIA DOES NOT AGREETO LICENSE THE LICENSED SOFTWARETO YOU, AND YOU MAY NOT DOWNLOAD, INSTALL, COPY OR USE IT.
+
+1. LICENSE.
+1.1. License Grant
+Subject to the terms of the AGREEMENT, NVIDIA hereby grants you a non-exclusive, non-transferable license, without the right to sublicense (except as expressly set forth in a Supplement), during the applicable license term unless earlier terminated as provided below, to have Authorized Users install and use the Software, including modifications (if expressly permitted in a Supplement), in accordance with the Documentation. You are only licensed to activate and use Licensed Software for which you a have a valid license, even if during the download or installation you are presented with other product options. No Orders are binding on NVIDIA until accepted by NVIDIA. Your Orders are subject to the AGREEMENT.
+
+SLA Supplements: Certain Licensed Software licensed under this SLA may be subject to additional terms and conditions that will be presented to you in a Supplement for acceptance prior to the delivery of such Licensed Software under this SLA and the applicable Supplement. Licensed Software will only be delivered to you upon your acceptance of all applicable terms.
+
+1.2. Limited Purpose Licenses
+If your license is provided for one of the purposes indicated below, then notwithstanding contrary terms in License Grant or in a Supplement, such licenses are for internal use and do not include any right or license to sub-license and distribute the Licensed Software or its output in any way in any public release, however limited, and/or in any manner that provides third parties with use of or access to the Licensed Software or its functionality or output, including (but not limited to) external alpha or beta testing or development phases. Further:
+Evaluation License. You may use evaluation licenses solely for your internal evaluation of the Licensed Software for broader adoption within your Enterprise or in connection with a NVIDIA product purchase decision, and such licenses have an expiration date as indicated by NVIDIA in its sole discretion (or ninety days from the date of download if no other duration is indicated).
+Educational/Academic License. You may use educational/academic licenses solely for educational purposes and all users must be enrolled or employed by an academic institution. If you do not meet NVIDIA’s academic program requirements for educational institutions, you have no rights under this license.
+Test/Development License. You may use test/development licenses solely for your internal development, testing and/or debugging of your software applications or for interoperability testing with the Licensed Software, and such licenses have an expiration date as indicated by NVIDIA in its sole discretion (or one year from the date of download if no other duration is indicated). NVIDIA Confidential Information under the AGREEMENT includes output from Licensed Software developer tools identified as “Pro” versions, where the output reveals functionality or performance data pertinent to NVIDIA hardware or software products.
+1.3. Pre-Release Licenses
+With respect to alpha, beta, preview, and other pre-release Software and Documentation (“Pre-Release Licensed Software”) delivered to you under the AGREEMENT you acknowledge and agree that such Pre-Release Licensed Software (i) may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, accessibility, availability, and reliability standards relative to commercially provided NVIDIA software and documentation, and (ii) use of such Pre-Release Licensed Software may result in unexpected results, loss of data, project delays or other unpredictable damage or loss. THEREFORE, PRE-RELEASE LICENSED SOFTWARE IS NOT INTENDED FOR USE, AND SHOULD NOT BE USED, IN PRODUCTION OR BUSINESS-CRITICAL SYSTEMS. NVIDIA has no obligation to make available a commercial version of any Pre-Release Licensed Software and NVIDIA has the right to abandon development of Pre-Release Licensed Software at any time without liability.
+
+1.4. Enterprise and Contractor Usage
+You may allow your Enterprise employees and Contractors to access and use the Licensed Software pursuant to the terms of the AGREEMENT solely to perform work on your behalf, provided further that with respect to Contractors: (i) you obtain a written agreement from each Contractor which contains terms and obligations with respect to access to and use of Licensed Software no less protective of NVIDIA than those set forth in the AGREEMENT, and (ii) such Contractor’s access and use expressly excludes any sublicensing or distribution rights for the Licensed Software. You are responsible for the compliance with the terms and conditions of the AGREEMENT by your Enterprise and Contractors. Any act or omission that, if committed by you, would constitute a breach of the AGREEMENT shall be deemed to constitute a breach of the AGREEMENT if committed by your Enterprise or Contractors.
+
+1.5. Services
+Except as expressly indicated in an Order, NVIDIA is under no obligation to provide support for the Licensed Software or to provide any patches, maintenance, updates or upgrades under the AGREEMENT. Unless patches, maintenance, updates or upgrades are provided with their separate governing terms and conditions, they constitute Licensed Software licensed to you under the AGREEMENT.
+
+2. LIMITATIONS.
+2.1. License Restrictions
+Except as expressly authorized in the AGREEMENT, you agree that you will not (nor authorize third parties to): (i) copy and use Software that was licensed to you for use in one or more NVIDIA hardware products in other unlicensed products (provided that copies solely for backup purposes are allowed); (ii) reverse engineer, decompile, disassemble (except to the extent applicable laws specifically require that such activities be permitted) or attempt to derive the source code, underlying ideas, algorithm or structure of Software provided to you in object code form; (iii) sell, transfer, assign, distribute, rent, loan, lease, sublicense or otherwise make available the Licensed Software or its functionality to third parties (a) as an application services provider or service bureau, (b) by operating hosted/virtual system environments, (c) by hosting, time sharing or providing any other type of services, or (d) otherwise by means of the internet; (iv) modify, translate or otherwise create any derivative works of any Licensed Software; (v) remove, alter, cover or obscure any proprietary notice that appears on or with the Licensed Software or any copies thereof; (vi) use the Licensed Software, or allow its use, transfer, transmission or export in violation of any applicable export control laws, rules or regulations; (vii) distribute, permit access to, or sublicense the Licensed Software as a stand-alone product; (viii) bypass, disable, circumvent or remove any form of copy protection, encryption, security or digital rights management or authentication mechanism used by NVIDIA in connection with the Licensed Software, or use the Licensed Software together with any authorization code, serial number, or other copy protection device not supplied by NVIDIA directly or through an authorized reseller; (ix) use the Licensed Software for the purpose of developing competing products or technologies or assisting a third party in such activities; (x) use the Licensed Software with any system or application where the use or failure of such system or application can reasonably be expected to threaten or result in personal injury, death, or catastrophic loss including, without limitation, use in connection with any nuclear, avionics, navigation, military, medical, life support or other life critical application (“Critical Applications”), unless the parties have entered into a Critical Applications agreement; (xi) distribute any modification or derivative work you make to the Licensed Software under or by reference to the same name as used by NVIDIA; or (xii) use the Licensed Software in any manner that would cause the Licensed Software to become subject to an Open Source License. Nothing in the AGREEMENT shall be construed to give you a right to use, or otherwise obtain access to, any source code from which the Software or any portion thereof is compiled or interpreted. You acknowledge that NVIDIA does not design, test, manufacture or certify the Licensed Software for use in the context of a Critical Application and NVIDIA shall not be liable to you or any third party, in whole or in part, for any claims or damages arising from such use. You agree to defend, indemnify and hold harmless NVIDIA and its Affiliates, and their respective employees, contractors, agents, officers and directors, from and against any and all claims, damages, obligations, losses, liabilities, costs or debt, fines, restitutions and expenses (including but not limited to attorney’s fees and costs incident to establishing the right of indemnification) arising out of or related to you and your Enterprise, and their respective employees, contractors, agents, distributors, resellers, end users, officers and directors use of Licensed Software outside of the scope of the AGREEMENT or any other breach of the terms of the AGREEMENT.
+
+2.2. Third Party License Obligations
+You acknowledge and agree that the Licensed Software may include or incorporate third party technology (collectively “Third Party Components”), which is provided for use in or with the Software and not otherwise used separately. If the Licensed Software includes or incorporates Third Party Components, then the third-party pass-through terms and conditions (“Third Party Terms”) for the particular Third Party Component will be bundled with the Software or otherwise made available online as indicated by NVIDIA and will be incorporated by reference into the AGREEMENT. In the event of any conflict between the terms in the AGREEMENT and the Third Party Terms, the Third Party Terms shall govern. Copyright to Third Party Components are held by the copyright holders indicated in the copyright notices indicated in the Third Party Terms.
+
+Audio/Video Encoders and Decoders: You acknowledge and agree that it is your sole responsibility to obtain any additional third party licenses required to make, have made, use, have used, sell, import, and offer for sale your products or services that include or incorporate any Third Party Components and content relating to audio and/or video encoders and decoders from, including but not limited to, Microsoft, Thomson, Fraunhofer IIS, Sisvel S.p.A., MPEG-LA, and Coding Technologies as NVIDIA does not grant to you under the AGREEMENT any necessary patent or other rights with respect to audio and/or video encoders and decoders.
+
+2.3. Limited Rights
+Your rights in the Licensed Software are limited to those expressly granted under the AGREEMENT and no other licenses are granted whether by implication, estoppel or otherwise. NVIDIA reserves all rights, title and interest in and to the Licensed Software not expressly granted under the AGREEMENT.
+
+3. CONFIDENTIALITY
+Neither party will use the other party’s Confidential Information, except as necessary for the performance of the AGREEMENT, nor will either party disclose such Confidential Information to any third party, except to personnel of NVIDIA and its Affiliates, you, your Enterprise, your Enterprise Contractors, and each party’s legal and financial advisors that have a need to know such Confidential Information for the performance of the AGREEMENT, provided that each such personnel, employee and Contractor is subject to a written agreement that includes confidentiality obligations consistent with those set forth herein. Each party will use all reasonable efforts to maintain the confidentiality of all of the other party’s Confidential Information in its possession or control, but in no event less than the efforts that it ordinarily uses with respect to its own Confidential Information of similar nature and importance. The foregoing obligations will not restrict either party from disclosing the other party’s Confidential Information or the terms and conditions of the AGREEMENT as required under applicable securities regulations or pursuant to the order or requirement of a court, administrative agency, or other governmental body, provided that the party required to make such disclosure (i) gives reasonable notice to the other party to enable it to contest such order or requirement prior to its disclosure (whether through protective orders or otherwise), (ii) uses reasonable effort to obtain confidential treatment or similar protection to the fullest extent possible to avoid such public disclosure, and (iii) discloses only the minimum amount of information necessary to comply with such requirements.
+
+4. OWNERSHIP
+You are not obligated to disclose to NVIDIA any modifications that you, your Enterprise or your Contractors make to the Licensed Software as permitted under the AGREEMENT. As between the parties, all modifications are owned by NVIDIA and licensed to you under the AGREEMENT unless otherwise expressly provided in a Supplement. The Licensed Software and all modifications owned by NVIDIA, and the respective Intellectual Property Rights therein, are and will remain the sole and exclusive property of NVIDIA or its licensors, whether the Licensed Software is separate from or combined with any other products or materials. You shall not engage in any act or omission that would impair NVIDIA’s and/or its licensors’ Intellectual Property Rights in the Licensed Software or any other materials, information, processes or subject matter proprietary to NVIDIA. NVIDIA’s licensors are intended third party beneficiaries with the right to enforce provisions of the AGREEMENT with respect to their Confidential Information and/or Intellectual Property Rights.
+
+5. FEEDBACK
+You have no obligation to provide Feedback to NVIDIA. However, NVIDIA and/or its Affiliates may use and include any Feedback that you provide to improve the Licensed Software or other NVIDIA products, technologies or materials. Accordingly, if you provide Feedback, you agree that NVIDIA and/or its Affiliates, at their option, may, and may permit their licensees, to make, have made, use, have used, reproduce, license, distribute and otherwise commercialize the Feedback in the Licensed Software or in other NVIDIA products, technologies or materials without the payment of any royalties or fees to you. All Feedback becomes the sole property of NVIDIA and may be used in any manner NVIDIA sees fit, and you hereby assign to NVIDIA all of your right, title and interest in and to any Feedback. NVIDIA has no obligation to respond to Feedback or to incorporate Feedback into the Licensed Software.
+
+6. NO WARRANTIES
+THE LICENSED SOFTWARE AND ANY OTHER CONFIDENTIAL INFORMATION AND/OR SERVICES ARE PROVIDED BY NVIDIA “AS IS” AND “WITH ALL FAULTS,” AND NVIDIA EXPRESSLY DISCLAIMS ALL OTHER WARRANTIES OF ANY KIND OR NATURE, WHETHER EXPRESS, IMPLIED OR STATUTORY, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF OPERABILITY, CONDITION, VALUE, ACCURACY OF DATA, OR QUALITY, AS WELL AS ANY WARRANTIES OF MERCHANTABILITY, SYSTEM INTEGRATION, WORKMANSHIP, SUITABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR THE ABSENCE OF ANY DEFECTS THEREIN, WHETHER LATENT OR PATENT. NO WARRANTY IS MADE BY NVIDIA ON THE BASIS OF TRADE USAGE, COURSE OF DEALING OR COURSE OF TRADE. NVIDIA DOES NOT WARRANT THAT THE LICENSED SOFTWARE OR ANY OTHER CONFIDENTIAL INFORMATION AND/OR SERVICES PROVIDED BY NVIDIA UNDER THE AGREEMENT WILL MEET YOUR REQUIREMENTS OR THAT THE OPERATION THEREOF WILL BE UNINTERRUPTED OR ERROR-FREE, OR THAT ALL ERRORS WILL BE CORRECTED. YOU ACKNOWLEDGE THAT NVIDIA’S OBLIGATIONS UNDER THE AGREEMENT ARE FOR THE BENEFIT OF YOU ONLY. Nothing in this warranty section affects any statutory rights of consumers or other recipients to the extent that they cannot be waived or limited by contract under applicable law.
+
+7. LIMITATION OF LIABILITY
+TO THE MAXIMUM EXTENT PERMITTED BY LAW, NVIDIA OR ITS LICENSORS SHALL NOT BE LIABLE FOR ANY SPECIAL, INCIDENTAL, PUNITIVE OR CONSEQUENTIAL DAMAGES, OR ANY LOST PROFITS, LOSS OF USE, LOSS OF DATA OR LOSS OF GOODWILL, OR THE COSTS OF PROCURING SUBSTITUTE PRODUCTS, ARISING OUT OF OR IN CONNECTION WITH THE AGREEMENT OR THE USE OR PERFORMANCE OF THE LICENSED SOFTWARE AND ANY OTHER CONFIDENTIAL INFORMATION AND/OR SERVICES PROVIDED BY NVIDIA UNDER THE AGREEMENT, WHETHER SUCH LIABILITY ARISES FROM ANY CLAIM BASED UPON BREACH OF CONTRACT, BREACH OF WARRANTY, TORT (INCLUDING NEGLIGENCE), PRODUCT LIABILITY OR ANY OTHER CAUSE OF ACTION OR THEORY OF LIABILITY. IN NO EVENT WILL NVIDIA’S TOTAL CUMULATIVE LIABILITY UNDER OR ARISING OUT OF THE AGREEMENT EXCEED THE NET AMOUNTS RECEIVED BY NVIDIA FOR YOUR USE OF THE PARTICULAR LICENSED SOFTWARE DURING THE TWELVE (12) MONTHS BEFORE THE LIABILITY AROSE (or up to US$10.00 if you acquired the Licensed Software for no charge). THE NATURE OF THE LIABILITY, THE NUMBER OF CLAIMS OR SUITS OR THE NUMBER OF PARTIES WITHIN YOUR ENTERPRISE THAT ACCEPTED THE TERMS OF THE AGREEMENT SHALL NOT ENLARGE OR EXTEND THIS LIMIT. THE FOREGOING LIMITATIONS SHALL APPLY REGARDLESS OF WHETHER NVIDIA OR ITS LICENSORS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES AND REGARDLESS OF WHETHER ANY REMEDY FAILS ITS ESSENTIAL PURPOSE. The disclaimers, exclusions and limitations of liability set forth in the AGREEMENT form an essential basis of the bargain between the parties, and, absent any such disclaimers, exclusions or limitations of liability, the provisions of the AGREEMENT, including, without limitation, the economic terms, would be substantially different.
+
+8. TERM AND TERMINATION.
+8.1. AGREEMENT, Licenses and Services
+This SLA shall become effective upon the Effective Date, each Supplement upon their acceptance, and both this SLA and Supplements shall continue in effect until your last access or use of the Licensed Software and/or services hereunder, unless earlier terminated as provided in this “Term and Termination” section. Each Licensed Software license ends at the earlier of (a) the expiration of the applicable license term, or (b) termination of such license or the AGREEMENT. Each service ends at the earlier of (x) the expiration of the applicable service term, (y) termination of such service or the AGREEMENT, or (z) expiration or termination of the associated license and no credit or refund will be provided upon the expiration or termination of the associated license for any service fees paid.
+
+8.2. Termination and Effect of Expiration or Termination
+NVIDIA may terminate the AGREEMENT in whole or in part: (i) if you breach any term of the AGREEMENT and fail to cure such breach within thirty (30) days following notice thereof from NVIDIA (or immediately if you violate NVIDIA’s Intellectual Property Rights); (ii) if you become the subject of a voluntary or involuntary petition in bankruptcy or any proceeding relating to insolvency, receivership, liquidation or composition for the benefit of creditors, if that petition or proceeding is not dismissed with prejudice within sixty (60) days after filing, or if you cease to do business; or (iii) if you commence or participate in any legal proceeding against NVIDIA, with respect to the Licensed Software that is the subject of the proceeding during the pendency of such legal proceeding. If you or your authorized NVIDIA reseller fail to pay license fees or service fees when due then NVIDIA may, in its sole discretion, suspend or terminate your license grants, services and any other rights provided under the AGREEMENT for the affected Licensed Software, in addition to any other remedies NVIDIA may have at law or equity. Upon any expiration or termination of the AGREEMENT, a license or a service provided hereunder, (a) any amounts owed to NVIDIA become immediately due and payable, (b) you must promptly discontinue use of the affected Licensed Software and/or service, and (c) you must promptly destroy or return to NVIDIA all copies of the affected Licensed Software and all portions thereof in your possession or control, and each party will promptly destroy or return to the other all of the other party’s Confidential Information within its possession or control. Upon written request, you will certify in writing that you have complied with your obligations under this section. Upon expiration or termination of the AGREEMENT all provisions survive except for the license grant provisions.
+
+9. CONSENT TO COLLECTION AND USE OF INFORMATION.
+You hereby agree and acknowledge that the Software may access, collect non-personally identifiable information about your Enterprise computer systems in order to properly optimize such systems for use with the Software. To the extent that you use the Software, you hereby consent to all of the foregoing, and represent and warrant that you have the right to grant such consent. In addition, you agree that you are solely responsible for maintaining appropriate data backups and system restore points for your Enterprise systems, and that NVIDIA will have no responsibility for any damage or loss to such systems (including loss of data or access) arising from or relating to (a) any changes to the configuration, application settings, environment variables, registry, drivers, BIOS, or other attributes of the systems (or any part of such systems) initiated through the Software; or (b) installation of any Software or third party software patches initiated through the Software. In certain systems you may change your system update preferences by unchecking "Automatically check for updates" in the "Preferences" tab of the control panel for the Software.
+
+In connection with the receipt of the Licensed Software or services you may receive access to links to third party websites and services and the availability of those links does not imply any endorsement by NVIDIA. NVIDIA encourages you to review the privacy statements on those sites and services that you choose to visit so that you can understand how they may collect, use and share personal information of individuals. NVIDIA is not responsible or liable for: (i) the availability or accuracy of such links; or (ii) the products, services or information available on or through such links; or (iii) the privacy statements or practices of sites and services controlled by other companies or organizations.
+
+To the extent that you or members of your Enterprise provide to NVIDIA during registration or otherwise personal information, you acknowledge that such information will be collected, used and disclosed by NVIDIA in accordance with NVIDIA's privacy policy, available at URL http://www.nvidia.com/object/privacy_policy.html.
+
+10. GENERAL.
+This SLA, any Supplements incorporated hereto, and Orders constitute the entire agreement of the parties with respect to the subject matter hereto and supersede all prior negotiations, conversations, or discussions between the parties relating to the subject matter hereto, oral or written, and all past dealings or industry custom. Any additional and/or conflicting terms and conditions on purchase order(s) or any other documents issued by you are null, void, and invalid. Any amendment or waiver under the AGREEMENT must be in writing and signed by representatives of both parties.
+
+The AGREEMENT and the rights and obligations thereunder may not be assigned by you, in whole or in part, including by merger, consolidation, dissolution, operation of law, or any other manner, without written consent of NVIDIA, and any purported assignment in violation of this provision shall be void and of no effect. NVIDIA may assign, delegate or transfer the AGREEMENT and its rights and obligations hereunder, and if to a non-Affiliate you will be notified.
+
+Each party acknowledges and agrees that the other is an independent contractor in the performance of the AGREEMENT, and each party is solely responsible for all of its employees, agents, contractors, and labor costs and expenses arising in connection therewith. The parties are not partners, joint ventures or otherwise affiliated, and neither has any authority to make any statements, representations or commitments of any kind to bind the other party without prior written consent.
+
+Neither party will be responsible for any failure or delay in its performance under the AGREEMENT (except for any payment obligations) to the extent due to causes beyond its reasonable control for so long as such force majeure event continues in effect.
+
+The AGREEMENT will be governed by and construed under the laws of the State of Delaware and the United States without regard to the conflicts of law provisions thereof and without regard to the United Nations Convention on Contracts for the International Sale of Goods. The parties consent to the personal jurisdiction of the federal and state courts located in Santa Clara County, California. You acknowledge and agree that a breach of any of your promises or agreements contained in the AGREEMENT may result in irreparable and continuing injury to NVIDIA for which monetary damages may not be an adequate remedy and therefore NVIDIA is entitled to seek injunctive relief as well as such other and further relief as may be appropriate. If any court of competent jurisdiction determines that any provision of the AGREEMENT is illegal, invalid or unenforceable, the remaining provisions will remain in full force and effect. Unless otherwise specified, remedies are cumulative.
+
+The Licensed Software has been developed entirely at private expense and is “commercial items” consisting of “commercial computer software” and “commercial computer software documentation” provided with RESTRICTED RIGHTS. Use, duplication or disclosure by the U.S. Government or a U.S. Government subcontractor is subject to the restrictions set forth in the AGREEMENT pursuant to DFARS 227.7202-3(a) or as set forth in subparagraphs (c)(1) and (2) of the Commercial Computer Software - Restricted Rights clause at FAR 52.227-19, as applicable. Contractor/manufacturer is NVIDIA, 2701 San Tomas Expressway, Santa Clara, CA 95050.
+
+You acknowledge that the Licensed Software described under the AGREEMENT is subject to export control under the U.S. Export Administration Regulations (EAR) and economic sanctions regulations administered by the U.S. Department of Treasury’s Office of Foreign Assets Control (OFAC). Therefore, you may not export, reexport or transfer in-country the Licensed Software without first obtaining any license or other approval that may be required by BIS and/or OFAC. You are responsible for any violation of the U.S. or other applicable export control or economic sanctions laws, regulations and requirements related to the Licensed Software. By accepting this SLA, you confirm that you are not a resident or citizen of any country currently embargoed by the U.S. and that you are not otherwise prohibited from receiving the Licensed Software.
+
+Any notice delivered by NVIDIA to you under the AGREEMENT will be delivered via mail, email or fax. Please direct your legal notices or other correspondence to NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, California 95050, United States of America, Attention: Legal Department.
+
+11. GLOSSARY OF TERMS
+Certain capitalized terms, if not otherwise defined elsewhere in this SLA, shall have the meanings set forth below:
+“Affiliate”
+“Affiliate” means any legal entity that Owns, is Owned by, or is commonly Owned with a party. “Own” means having more than 50% ownership or the right to direct the management of the entity.
+“AGREEMENT”
+“AGREEMENT” means this SLA and all associated Supplements entered by the parties referencing this SLA.
+“Authorized Users”
+“Authorized Users” means your Enterprise individual employees and any of your Enterprise’s Contractors, subject to the terms of the “Enterprise and Contractors Usage” section.
+“Confidential Information”
+“Confidential Information” means the Licensed Software (unless made publicly available by NVIDIA without confidentiality obligations), and any NVIDIA business, marketing, pricing, research and development, know-how, technical, scientific, financial status, proposed new products or other information disclosed by NVIDIA to you which, at the time of disclosure, is designated in writing as confidential or proprietary (or like written designation), or orally identified as confidential or proprietary or is otherwise reasonably identifiable by parties exercising reasonable business judgment, as confidential. Confidential Information does not and will not include information that: (i) is or becomes generally known to the public through no fault of or breach of the AGREEMENT by the receiving party; (ii) is rightfully known by the receiving party at the time of disclosure without an obligation of confidentiality; (iii) is independently developed by the receiving party without use of the disclosing party’s Confidential Information; or (iv) is rightfully obtained by the receiving party from a third party without restriction on use or disclosure.
+“Contractor”
+“Contractor” means an individual who works primarily for your Enterprise on a contractor basis from your secure network. means an individual who works primarily for your Enterprise on a contractor basis from your secure network.
+“Documentation”
+“Documentation” means the NVIDIA documentation made available for use with the Software, including (without limitation) user manuals, datasheets, operations instructions, installation guides, release notes and other materials provided to you under the AGREEMENT.
+“Enterprise”
+“Enterprise” means you or any company or legal entity for which you accepted the terms of this SLA, and their subsidiaries of which your company or legal entity owns more than fifty percent (50%) of the issued and outstanding equity.
+“Feedback”
+“Feedback” means any and all suggestions, feature requests, comments or other feedback regarding the Licensed Software, including possible enhancements or modifications thereto.
+“Intellectual Property Rights”
+“Intellectual Property Rights” means all patent, copyright, trademark, trade secret, trade dress, trade names, utility models, mask work, moral rights, rights of attribution or integrity service marks, master recording and music publishing rights, performance rights, author’s rights, database rights, registered design rights and any applications for the protection or registration of these rights, or other intellectual or industrial property rights or proprietary rights, howsoever arising and in whatever media, whether now known or hereafter devised, whether or not registered, (including all claims and causes of action for infringement, misappropriation or violation and all rights in any registrations and renewals), worldwide and whether existing now or in the future.
+“Licensed Software”
+“Licensed Software” means Software, Documentation and all modifications owned by NVIDIA.
+“Open Source License”
+“Open Source License” includes, without limitation, a software license that requires as a condition of use, modification, and/or distribution of such software that the Software be (i) disclosed or distributed in source code form; (ii) be licensed for the purpose of making derivative works; or (iii) be redistributable at no charge.
+“Order”
+“Order” means a purchase order issued by you, a signed purchase agreement with you, or other ordering document issued by you to NVIDIA or a NVIDIA authorized reseller (including any on-line acceptance process) that references and incorporates the AGREEMENT and is accepted by NVIDIA.
+“Software”
+“Software” means the NVIDIA software programs licensed to you under the AGREEMENT including, without limitation, libraries, sample code, utility programs and programming code.
+“Supplement”
+“Supplement” means the additional terms and conditions beyond those stated in this SLA that apply to certain Licensed Software licensed hereunder.
+12. TensorRT SUPPLEMENT TO SOFTWARE LICENSE AGREEMENT
+TensorRT SUPPLEMENT TO SOFTWARE LICENSE AGREEMENT
+The terms set forth in this TensorRT Supplement (“Supplement”) govern your use of the NVIDIA GPU inference engine (the “TensorRT Licensed Software”) under the terms of your software license agreement (“SLA”) as modified by this Supplement. This Supplement is an exhibit to the SLA and is hereby incorporated as an integral part thereto. Capitalized terms used but not defined herein shall have the meaning assigned to them in the SLA. In the event of conflict between the terms in this Supplement and the terms in the SLA, this Supplement shall control.
+
+12.1. TensorRT DISTRIBUTION
+Subject to the terms of the SLA and this Supplement, NVIDIA hereby grants you a non-exclusive, nontransferable license during the applicable license term unless earlier terminated pursuant to the SLA, to distribute the libnvinfer and libnvinfer_plugin libraries when delivered to you as part of the TensorRT Licensed Software in source code form or binary form (but not when provided to you as part of a hardware product), subject to the following: such distribution is solely in binary form to your licensees (“Customers”) only as a component of your own software products having additional material functionality beyond the TensorRT Licensed Software (each, a “Licensee Application"). Subject to the terms and conditions of the SLA and this Supplement, you may further authorize Customers to redistribute the libnvinfer and libnvinfer_plugin libraries as incorporated into a Licensee Application, solely in binary form, provided, however, that you shall require in your agreements with your Customers that their distributions be on terms at least as restrictive as those applicable for your use of such TensorRT Licensed Software within a Licensee Application. The expiration or termination of your licenses to the above described TensorRT Licensed Software under the SLA and this Supplement will not affect rights previously granted by you to recipients that were in compliance with the SLA and this Supplement.
+
+In addition to the rights above, for parties that are developing software intended solely for use on Jetson development kits or Jetson modules and running Linux for Tegra software the following shall apply: TensorRT Licensed Software licensed hereunder may be distributed in its entirety, as provided by NVIDIA and without separation of its components, for you and/or your licensees to create software development kits for use only on the Jetson platform and running Linux for Tegra software. You shall require in your agreements with your licensees that their distributions be on terms at least as restrictive as those applicable for your distribution of TensorRT Licensed Software as described in this Section 1.
+
+In addition to the rights above, for parties that are developing software intended solely for use on Jetson development kits or Jetson modules and running Linux for Tegra software the following shall apply: TensorRT Licensed Software licensed hereunder may be distributed in its entirety, as provided by NVIDIA and without separation of its components, for you and/or your licensees to create software development kits for use only on the Jetson platform and running Linux for Tegra software. You shall require in your agreements with your licensees that their distributions be on terms at least as restrictive as those applicable for your distribution of TensorRT Licensed Software as described in this Section 1.
+
+12.2. LICENSE DURATION
+Each TensorRT Licensed Software is licensed to you for an initial duration of one year starting from the date of delivery or download. The licenses granted will automatically renew for successive one year periods, provided that NVIDIA reserves the right to terminate licenses upon ninety days (90) days written notice to you prior to the commencement of a renewal year in addition to the termination rights set forth in the SLA.
+
+12.3. EXPIRATION OF TERMINATION OF THIS SUPPLEMENT
+Your failure to comply with the terms of this Supplement is ground for termination for breach by NVIDIA under the SLA. This Supplement will automatically expire or terminate upon the expiration or termination of your rights to TensorRT Licensed Software under the SLA or this Supplement.
+
+Notices
+Notice
+This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
+
+NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
+
+Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
+
+NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
+
+NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
+
+NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
+
+No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
+
+Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
+
+THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.
+
+VESA DisplayPort
+DisplayPort and DisplayPort Compliance Logo, DisplayPort Compliance Logo for Dual-mode Sources, and DisplayPort Compliance Logo for Active Cables are trademarks owned by the Video Electronics Standards Association in the United States and other countries.
+
+HDMI
+HDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing LLC.
+
+ARM
+ARM, AMBA and ARM Powered are registered trademarks of ARM Limited. Cortex, MPCore and Mali are trademarks of ARM Limited. All other brands or product names are the property of their respective holders. "ARM" is used to represent ARM Holdings plc; its operating company ARM Limited; and the regional subsidiaries ARM Inc.; ARM KK; ARM Korea Limited.; ARM Taiwan Limited; ARM France SAS; ARM Consulting (Shanghai) Co. Ltd.; ARM Germany GmbH; ARM Embedded Technologies Pvt. Ltd.; ARM Norway, AS and ARM Sweden AB.
+
+OpenCL
+OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
+
+Trademarks
+NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, CUDA Toolkit, cuDNN, DALI, DIGITS, DGX, DGX-1, DGX-2, DGX Station, DLProf, GPU, JetPack, Jetson, Kepler, Maxwell, NCCL, Nsight Compute, Nsight Systems, NVCaffe, NVIDIA Ampere GPU architecture, NVIDIA Deep Learning SDK, NVIDIA Developer Program, NVIDIA GPU Cloud, NVLink, NVSHMEM, PerfWorks, Pascal, SDK Manager, T4, Tegra, TensorRT, TensorRT Inference Server, Tesla, TF-TRT, Triton Inference Server, Turing, and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
+
+Copyright
+© 2021 NVIDIA Corporation. All rights reserved.
diff --git a/python/packaging/metapackage/setup.cfg b/python/packaging/metapackage/setup.cfg
new file mode 100644
index 00000000..32a8c1c0
--- /dev/null
+++ b/python/packaging/metapackage/setup.cfg
@@ -0,0 +1,5 @@
+[metadata]
+license_files = LICENSE.txt
+
+[bdist_wheel]
+universal = 1
diff --git a/python/packaging/metapackage/setup.py b/python/packaging/metapackage/setup.py
new file mode 100644
index 00000000..b5f8452f
--- /dev/null
+++ b/python/packaging/metapackage/setup.py
@@ -0,0 +1,43 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from setuptools import setup
+
+module_name = "##TENSORRT_MODULE##"
+
+
+setup(
+    name=module_name,
+    version="##TENSORRT_PYTHON_VERSION##",
+    description="TensorRT Metapackage",
+    long_description="TensorRT Metapackage",
+    author="NVIDIA Corporation",
+    license="Proprietary",
+    classifiers=[
+        "License :: Other/Proprietary License",
+        "Intended Audience :: Developers",
+        "Programming Language :: Python :: 3",
+    ],
+    packages=[],
+    install_requires=["##TENSORRT_MODULE##-cu##CUDA_MAJOR##"],
+    include_package_data=True,
+    zip_safe=True,
+    keywords="nvidia tensorrt deeplearning inference",
+    url="https://developer.nvidia.com/tensorrt",
+    download_url="https://github.com/nvidia/tensorrt/tags",
+)
diff --git a/python/packaging/requirements.txt b/python/packaging/requirements.txt
index 30dccba5..623fd70d 100644
--- a/python/packaging/requirements.txt
+++ b/python/packaging/requirements.txt
@@ -1,2 +1,4 @@
 # Required for building wheel files
 wheel==0.37.1
+setuptools~=59.6.0; python_version<"3.8"
+setuptools~=69.0.3; python_version>="3.8"
diff --git a/python/requirements.txt b/python/requirements.txt
index 43663c1a..3a69d335 100644
--- a/python/requirements.txt
+++ b/python/requirements.txt
@@ -6,5 +6,3 @@ numpy==1.23.0; python_version >= "3.8" and python_version < "3.10"
 numpy==1.23.1; python_version >= "3.10"
 Pillow; python_version<"3.6"
 ##PYTHON_BUILDDIR##/tensorrt_bindings-py3.##PYTHON3_MINOR##/dist/tensorrt-##TENSORRT_PYTHON_VERSION##-cp3##PYTHON3_MINOR##-none-linux_##TARGET##.whl ; python_version=="3.##PYTHON3_MINOR##"
-##TENSORRT_ROOT##/python_builds/uff/uff-##UFF_VERSION##-py2.py3-none-any.whl
-##TENSORRT_ROOT##/python_builds/graphsurgeon/graphsurgeon-##GRAPHSURGEON_VERSION##-py2.py3-none-any.whl
diff --git a/python/src/infer/pyAlgorithmSelector.cpp b/python/src/infer/pyAlgorithmSelector.cpp
index d17bc1d4..75fe97d2 100644
--- a/python/src/infer/pyAlgorithmSelector.cpp
+++ b/python/src/infer/pyAlgorithmSelector.cpp
@@ -19,7 +19,10 @@
 #include "ForwardDeclarations.h"
 #include "utils.h"
 #include <pybind11/stl.h>
-
+// remove md
+#if ENABLE_MDTRT
+#include "api/internal.h"
+#endif // ENABLE_MDTRT
 #include "infer/pyAlgorithmSelectorDoc.h"
 #include <cuda_runtime_api.h>
 #include <vector>
@@ -146,9 +149,6 @@ void bindAlgorithm(py::module& m)
     // IAlgorithmIOInfo
     py::class_<IAlgorithmIOInfo, std::unique_ptr<IAlgorithmIOInfo, py::nodelete>>(
         m, "IAlgorithmIOInfo", IAlgorithmIOInfoDOC::descr, py::module_local())
-        .def_property_readonly("tensor_format",
-            utils::deprecateMember(
-                &IAlgorithmIOInfo::getTensorFormat, "the strides, data type, and vectorization information"))
         .def_property_readonly("dtype", &IAlgorithmIOInfo::getDataType)
         .def_property_readonly("strides", &IAlgorithmIOInfo::getStrides)
         .def_property_readonly("vectorized_dim", &IAlgorithmIOInfo::getVectorizedDim)
@@ -166,7 +166,12 @@ void bindAlgorithm(py::module& m)
         .def_property_readonly("name", &IAlgorithmContext::getName)
         .def("get_shape", lambdas::get_shape, "index"_a, IAlgorithmContextDoc::get_shape)
         .def_property_readonly("num_inputs", &IAlgorithmContext::getNbInputs)
-        .def_property_readonly("num_outputs", &IAlgorithmContext::getNbOutputs);
+        .def_property_readonly("num_outputs", &IAlgorithmContext::getNbOutputs)
+// remove md
+#if ENABLE_MDTRT
+        .def_property_readonly("instance_id", &nvinfer1AlgorithmGetInstanceID)
+#endif // ENABLE_MDTRT
+        ;
 
     // IAlgorithm
     py::class_<IAlgorithm, std::unique_ptr<IAlgorithm, py::nodelete>>(
diff --git a/python/src/infer/pyCore.cpp b/python/src/infer/pyCore.cpp
index 555668e3..e2d95473 100644
--- a/python/src/infer/pyCore.cpp
+++ b/python/src/infer/pyCore.cpp
@@ -86,26 +86,11 @@ static const auto opt_profile_get_shape_input
 };
 
 // For IExecutionContext
-bool execute(IExecutionContext& self, int32_t batchSize, std::vector<size_t>& bindings) {
-    return self.execute(batchSize, reinterpret_cast<void**>(bindings.data()));
-};
-
-bool execute_async(IExecutionContext& self, int32_t batchSize, std::vector<size_t>& bindings,
-                                      size_t streamHandle, void* inputConsumed) {
-    return self.enqueue(batchSize, reinterpret_cast<void**>(bindings.data()),
-        reinterpret_cast<cudaStream_t>(streamHandle), reinterpret_cast<cudaEvent_t*>(inputConsumed));
-};
 
 static const auto execute_v2 = [](IExecutionContext& self, std::vector<size_t>& bindings) {
     return self.executeV2(reinterpret_cast<void**>(bindings.data()));
 };
 
-static const auto execute_async_v2
-    = [](IExecutionContext& self, std::vector<size_t>& bindings, size_t streamHandle, void* inputConsumed) {
-          return self.enqueueV2(reinterpret_cast<void**>(bindings.data()), reinterpret_cast<cudaStream_t>(streamHandle),
-              reinterpret_cast<cudaEvent_t*>(inputConsumed));
-      };
-
 std::vector<char const*> infer_shapes(IExecutionContext& self)
 {
     int32_t const size{self.getEngine().getNbIOTensors()};
@@ -148,29 +133,29 @@ size_t get_input_consumed_event(IExecutionContext& self)
     return reinterpret_cast<size_t>(self.getInputConsumedEvent());
 }
 
-void context_set_optimization_profile(IExecutionContext& self, int32_t profileIndex)
-{
-    PY_ASSERT_RUNTIME_ERROR(self.setOptimizationProfile(profileIndex), "Error in set optimization profile.");
-};
-
-bool context_set_shape_input(IExecutionContext& self, int32_t binding, std::vector<int32_t> const& shape)
+void set_aux_streams(IExecutionContext& self, std::vector<size_t> streamHandle)
 {
-    return self.setInputShapeBinding(binding, shape.data());
-};
+    self.setAuxStreams(reinterpret_cast<cudaStream_t*>(streamHandle.data()), static_cast<int32_t>(streamHandle.size()));
+}
 
-std::vector<int32_t> context_get_shape(IExecutionContext& self, int32_t binding)
+template <typename PyIterable>
+Dims castDimsFromPyIterable(PyIterable& in)
 {
-    Dims const shapeOfShape = self.getBindingDimensions(binding);
-    auto const numVals(utils::volume(shapeOfShape));
-    PY_ASSERT_RUNTIME_ERROR(numVals > 0, "Negative values in shape.");
-    std::vector<int32_t> shape(static_cast<std::size_t>(numVals));
-    PY_ASSERT_RUNTIME_ERROR(self.getShapeBinding(binding, shape.data()), "Error in get shape bindings.");
-    return shape;
-};
+    int32_t const maxDims{static_cast<int32_t>(Dims::MAX_DIMS)};
+    Dims dims{};
+    dims.nbDims = py::len(in);
+    PY_ASSERT_RUNTIME_ERROR(dims.nbDims <= maxDims, "The number of input dims exceeds the maximum allowed number of dimensions");
+    for (int32_t i = 0; i < dims.nbDims; ++i)
+    {
+        dims.d[i] = in[i].template cast<int32_t>();
+    }
+    return dims;
+}
 
-void set_aux_streams(IExecutionContext& self, std::vector<size_t> streamHandle)
+template <typename PyIterable>
+bool setInputShape(IExecutionContext& self, char const* tensorName, PyIterable& in)
 {
-    self.setAuxStreams(reinterpret_cast<cudaStream_t*>(streamHandle.data()), static_cast<int32_t>(streamHandle.size()));
+    return self.setInputShape(tensorName, castDimsFromPyIterable<PyIterable>(in));
 }
 
 // For IRuntime
@@ -180,84 +165,82 @@ static const auto runtime_deserialize_cuda_engine = [](IRuntime& self, py::buffe
 };
 
 // For ICudaEngine
-bool engine_binding_is_input(ICudaEngine& self, std::string const& name)
-{
-    return self.bindingIsInput(self.getBindingIndex(name.c_str()));
-};
-
-Dims engine_get_binding_shape(ICudaEngine& self, std::string const& name)
-{
-    return self.getBindingDimensions(self.getBindingIndex(name.c_str()));
-};
-
-DataType engine_get_binding_dtype(ICudaEngine& self, std::string const& name)
-{
-    return self.getBindingDataType(self.getBindingIndex(name.c_str()));
-};
-
-TensorLocation engine_get_location(ICudaEngine& self, std::string const& name)
-{
-    return self.getLocation(self.getBindingIndex(name.c_str()));
-};
-
 // TODO: Add slicing support?
 static const auto engine_getitem = [](ICudaEngine& self, int32_t pyIndex) {
     // Support python's negative indexing
-    int32_t const index = (pyIndex < 0) ? static_cast<int32_t>(self.getNbBindings()) + pyIndex : pyIndex;
-    PY_ASSERT_INDEX_ERROR(index < self.getNbBindings());
-    return self.getBindingName(index);
+    int32_t const index = (pyIndex < 0) ? static_cast<int32_t>(self.getNbIOTensors()) + pyIndex : pyIndex;
+    PY_ASSERT_INDEX_ERROR(index < self.getNbIOTensors());
+    return self.getIOTensorName(index);
 };
 
-std::vector<Dims> engine_get_profile_shape(ICudaEngine& self, int32_t profileIndex, int32_t bindingIndex)
+std::vector<Dims> get_tensor_profile_shape(ICudaEngine& self, std::string const& tensorName, int32_t profileIndex)
 {
     std::vector<Dims> shapes{};
-    shapes.emplace_back(self.getProfileDimensions(bindingIndex, profileIndex, OptProfileSelector::kMIN));
-    shapes.emplace_back(self.getProfileDimensions(bindingIndex, profileIndex, OptProfileSelector::kOPT));
-    shapes.emplace_back(self.getProfileDimensions(bindingIndex, profileIndex, OptProfileSelector::kMAX));
+    shapes.emplace_back(self.getProfileShape(tensorName.c_str(), profileIndex, OptProfileSelector::kMIN));
+    shapes.emplace_back(self.getProfileShape(tensorName.c_str(), profileIndex, OptProfileSelector::kOPT));
+    shapes.emplace_back(self.getProfileShape(tensorName.c_str(), profileIndex, OptProfileSelector::kMAX));
     return shapes;
 };
-// Overload to allow using binding names instead of indices.
-std::vector<Dims> engine_get_profile_shape_str(ICudaEngine& self, int32_t profileIndex, std::string const& bindingName)
-{
-    return engine_get_profile_shape(self, profileIndex, self.getBindingIndex(bindingName.c_str()));
-};
 
-std::vector<Dims> get_tensor_profile_shape(ICudaEngine& self, std::string const& tensorName, int32_t profileIndex)
+std::vector<Dims> engine_get_profile_shape(ICudaEngine& self, int32_t profileIndex, int32_t bindingIndex)
 {
     std::vector<Dims> shapes{};
-    shapes.emplace_back(self.getProfileShape(tensorName.c_str(), profileIndex, OptProfileSelector::kMIN));
-    shapes.emplace_back(self.getProfileShape(tensorName.c_str(), profileIndex, OptProfileSelector::kOPT));
-    shapes.emplace_back(self.getProfileShape(tensorName.c_str(), profileIndex, OptProfileSelector::kMAX));
+    auto const tensorName = self.getIOTensorName(bindingIndex);
+    shapes.emplace_back(self.getProfileShape(tensorName, profileIndex, OptProfileSelector::kMIN));
+    shapes.emplace_back(self.getProfileShape(tensorName, profileIndex, OptProfileSelector::kOPT));
+    shapes.emplace_back(self.getProfileShape(tensorName, profileIndex, OptProfileSelector::kMAX));
     return shapes;
 };
+// Overload to allow using binding names instead of indices.
+std::vector<Dims> engine_get_profile_shape_str(ICudaEngine& self, int32_t profileIndex, std::string const& bindingName)
+{
+    return get_tensor_profile_shape(self, bindingName, profileIndex);
+};
 
-std::vector<std::vector<int32_t>> engine_get_profile_shape_input(
-    ICudaEngine& self, int32_t profileIndex, int32_t bindingIndex)
+std::vector<std::vector<int32_t>> get_tensor_profile_values(
+    ICudaEngine& self, int32_t profileIndex, std::string const& tensorName)
 {
-    bool const isShapeInput{self.isShapeBinding(bindingIndex) && self.bindingIsInput(bindingIndex)};
+    char const* const name = tensorName.c_str();
+    bool const isShapeInput{self.isShapeInferenceIO(name) && self.getTensorIOMode(name) == TensorIOMode::kINPUT};
     PY_ASSERT_RUNTIME_ERROR(isShapeInput, "Binding index does not correspond to an input shape tensor.");
-
+    Dims const shape = self.getTensorShape(name);
+    PY_ASSERT_RUNTIME_ERROR(shape.nbDims >= 0, "Missing shape for input shape tensor");
+    auto const shapeSize{utils::volume(shape)};
+    PY_ASSERT_RUNTIME_ERROR(shapeSize >= 0, "Negative volume for input shape tensor");
     std::vector<std::vector<int32_t>> shapes{};
-    int32_t const shapeSize{self.getBindingDimensions(bindingIndex).nbDims};
+
     // In the Python bindings, it is impossible to set only one shape in an optimization profile.
-    int32_t const* shapePtr{self.getProfileShapeValues(bindingIndex, profileIndex, OptProfileSelector::kMIN)};
+    int32_t const* shapePtr{self.getProfileTensorValues(name, profileIndex, OptProfileSelector::kMIN)};
     if (shapePtr)
     {
         shapes.emplace_back(shapePtr, shapePtr + shapeSize);
-        shapePtr = self.getProfileShapeValues(bindingIndex, profileIndex, OptProfileSelector::kOPT);
+        shapePtr = self.getProfileTensorValues(name, profileIndex, OptProfileSelector::kOPT);
         shapes.emplace_back(shapePtr, shapePtr + shapeSize);
-        shapePtr = self.getProfileShapeValues(bindingIndex, profileIndex, OptProfileSelector::kMAX);
+        shapePtr = self.getProfileTensorValues(name, profileIndex, OptProfileSelector::kMAX);
         shapes.emplace_back(shapePtr, shapePtr + shapeSize);
     }
     return shapes;
 };
 
-// Overload to allow using binding names instead of indices.
-std::vector<std::vector<int32_t>> engine_get_profile_shape_input_str(
-    ICudaEngine& self, int32_t profileIndex, std::string const& bindingName)
+// For IGpuAllocator
+void* allocate_async(
+    IGpuAllocator& self, uint64_t const size, uint64_t const alignment, AllocatorFlags const flags, size_t streamHandle)
 {
-    return engine_get_profile_shape_input(self, profileIndex, self.getBindingIndex(bindingName.c_str()));
-};
+    return self.allocateAsync(size, alignment, flags, reinterpret_cast<cudaStream_t>(streamHandle));
+}
+
+bool deallocate_async(IGpuAllocator& self, void* const memory, size_t streamHandle)
+{
+    return self.deallocateAsync(memory, reinterpret_cast<cudaStream_t>(streamHandle));
+}
+
+// For IOutputAllocator
+void* reallocate_output_async(IOutputAllocator& self, char const* tensorName, void* currentMemory, uint64_t size,
+    uint64_t alignment, size_t streamHandle)
+{
+    return self.reallocateOutputAsync(
+        tensorName, currentMemory, size, alignment, reinterpret_cast<cudaStream_t>(streamHandle));
+}
 
 // For IBuilderConfig
 static const auto netconfig_get_profile_stream
@@ -350,9 +333,14 @@ static const auto refitter_get_tensors_with_dynamic_range = [](IRefitter& self)
     return tensorNames;
 };
 
+static const auto refitter_refit_cuda_engine_async = [](IRefitter& self, size_t streamHandle) {
+    return self.refitCudaEngineAsync(reinterpret_cast<cudaStream_t>(streamHandle));
+};
+
 static const auto context_set_optimization_profile_async
     = [](IExecutionContext& self, int32_t const profileIndex, size_t streamHandle) {
-          PY_ASSERT_RUNTIME_ERROR(self.setOptimizationProfileAsync(profileIndex, reinterpret_cast<cudaStream_t>(streamHandle)),
+          PY_ASSERT_RUNTIME_ERROR(
+              self.setOptimizationProfileAsync(profileIndex, reinterpret_cast<cudaStream_t>(streamHandle)),
               "Error in set optimization profile async.");
           return true;
       };
@@ -361,95 +349,144 @@ void context_set_device_memory(IExecutionContext& self, size_t memory)
 {
     self.setDeviceMemory(reinterpret_cast<void*>(memory));
 }
-} // namespace lambdas
 
-class PyGpuAllocator : public IGpuAllocator
+void serialization_config_set_flags(ISerializationConfig& self, uint32_t flags)
 {
-public:
-    using IGpuAllocator::IGpuAllocator;
-
-    template <typename... Args>
-    void* allocHelper(const char* pyFuncName, bool showWarning, Args&&... args) noexcept
+    if (!self.setFlags(flags))
     {
-        try
-        {
-            py::gil_scoped_acquire gil{};
-            py::function pyAllocFunc = utils::getOverride(static_cast<IGpuAllocator*>(this), pyFuncName, showWarning);
+        utils::throwPyError(PyExc_RuntimeError, "Provided serialization flags is incorrect");
+    }
+}
 
-            if (!pyAllocFunc)
-            {
-                return nullptr;
-            }
+// For IDebugListener, this function is intended to be override by client.
+// The bindings here will never be called and is for documentation purpose only.
+void docProcessDebugTensor(IDebugListener& self, void const* addr, TensorLocation location, DataType type,
+    Dims const& shape, char const* name, size_t stream)
+{
+    return;
+}
 
-            py::object ptr = pyAllocFunc(std::forward<Args>(args)...);
-            try
-            {
-                return reinterpret_cast<void*>(ptr.cast<size_t>());
-            }
-            catch (const py::cast_error& e)
-            {
-                std::cerr << "[ERROR] Return value of allocate() could not be interpreted as an int" << std::endl;
-            }
+} // namespace lambdas
+
+namespace PyGpuAllocatorHelper
+{
+template <typename TAllocator, typename... Args>
+void* allocHelper(TAllocator* allocator, const char* pyFuncName, bool showWarning, Args&&... args) noexcept
+{
+    try
+    {
+        py::gil_scoped_acquire gil{};
+        py::function pyAllocFunc = utils::getOverride(static_cast<TAllocator*>(allocator), pyFuncName, showWarning);
+
+        if (!pyAllocFunc)
+        {
+            return nullptr;
         }
-        catch (std::exception const& e)
+
+        py::object ptr = pyAllocFunc(std::forward<Args>(args)...);
+        try
         {
-            std::cerr << "[ERROR] Exception caught in allocate(): " << e.what() << std::endl;
+            return reinterpret_cast<void*>(ptr.cast<size_t>());
         }
-        catch (...)
+        catch (const py::cast_error& e)
         {
-            std::cerr << "[ERROR] Exception caught in allocate()" << std::endl;
-            return nullptr;
+            std::cerr << "[ERROR] Return value of allocate() could not be interpreted as an int" << std::endl;
         }
-
+    }
+    catch (std::exception const& e)
+    {
+        std::cerr << "[ERROR] Exception caught in allocate(): " << e.what() << std::endl;
+    }
+    catch (...)
+    {
+        std::cerr << "[ERROR] Exception caught in allocate()" << std::endl;
         return nullptr;
     }
 
+    return nullptr;
+}
+
+} // namespace PyGpuAllocatorHelper
+
+class PyGpuAllocator : public IGpuAllocator
+{
+public:
+    using IGpuAllocator::IGpuAllocator;
+
     void* allocate(uint64_t size, uint64_t alignment, AllocatorFlags flags) noexcept override
     {
-        return allocHelper("allocate", true, size, alignment, flags);
+        return PyGpuAllocatorHelper::allocHelper<IGpuAllocator>(this, "allocate", true, size, alignment, flags);
     }
 
     void* reallocate(void* baseAddr, uint64_t alignment, uint64_t newSize) noexcept override
     {
-        return allocHelper("reallocate", false, reinterpret_cast<size_t>(baseAddr), alignment, newSize);
+        return PyGpuAllocatorHelper::allocHelper<IGpuAllocator>(
+            this, "reallocate", true, reinterpret_cast<size_t>(baseAddr), alignment, newSize);
     }
 
-    void free(void* memory) noexcept override
+    bool deallocate(void* memory) noexcept override
     {
         try
         {
             py::gil_scoped_acquire gil{};
-            py::function pyFree = utils::getOverride(static_cast<IGpuAllocator*>(this), "free");
-            if (!pyFree)
+            py::function pyDeallocate = utils::getOverride(static_cast<IGpuAllocator*>(this), "deallocate");
+            if (!pyDeallocate)
             {
-                return;
+                return false;
             }
 
-            pyFree(reinterpret_cast<size_t>(memory));
+            py::object status{};
+            status = pyDeallocate(reinterpret_cast<size_t>(memory));
+            return status.cast<bool>();
         }
         catch (std::exception const& e)
         {
-            std::cerr << "[ERROR] Exception caught in free(): " << e.what() << std::endl;
+            std::cerr << "[ERROR] Exception caught in deallocate(): " << e.what() << std::endl;
         }
         catch (...)
         {
-            std::cerr << "[ERROR] Exception caught in free()" << std::endl;
+            std::cerr << "[ERROR] Exception caught in deallocate()" << std::endl;
         }
+        return false;
     }
 
-    bool deallocate(void* memory) noexcept override
+}; // PyGpuAllocator
+
+///////////
+
+class PyGpuAsyncAllocator : public IGpuAsyncAllocator
+{
+public:
+    using IGpuAsyncAllocator::IGpuAsyncAllocator;
+
+    void* allocateAsync(uint64_t size, uint64_t alignment, AllocatorFlags flags, cudaStream_t stream) noexcept override
+    {
+        intptr_t cudaStreamPtr = reinterpret_cast<intptr_t>(stream);
+        return PyGpuAllocatorHelper::allocHelper<IGpuAsyncAllocator>(
+            this, "allocate_async", true, size, alignment, cudaStreamPtr, flags);
+    }
+
+    void* reallocate(void* baseAddr, uint64_t alignment, uint64_t newSize) noexcept override
+    {
+        return PyGpuAllocatorHelper::allocHelper<IGpuAsyncAllocator>(
+            this, "reallocate", true, reinterpret_cast<size_t>(baseAddr), alignment, newSize);
+    }
+
+    bool deallocateAsync(void* memory, cudaStream_t stream) noexcept override
     {
         try
         {
             py::gil_scoped_acquire gil{};
-            py::function pyDeallocate = utils::getOverride(static_cast<IGpuAllocator*>(this), "deallocate");
-            if (!pyDeallocate)
+            py::function pyDeallocateAsync
+                = utils::getOverride(static_cast<IGpuAsyncAllocator*>(this), "deallocate_async");
+            if (!pyDeallocateAsync)
             {
                 return false;
             }
 
             py::object status{};
-            status = pyDeallocate(reinterpret_cast<size_t>(memory));
+            intptr_t cudaStreamPtr = reinterpret_cast<intptr_t>(stream);
+            status = pyDeallocateAsync(reinterpret_cast<size_t>(memory), cudaStreamPtr);
             return status.cast<bool>();
         }
         catch (std::exception const& e)
@@ -462,8 +499,10 @@ class PyGpuAllocator : public IGpuAllocator
         }
         return false;
     }
-};
 
+}; // PyGpuAsyncAllocator
+
+/////////////////////////////
 class PyOutputAllocator : public IOutputAllocator
 {
 public:
@@ -504,23 +543,127 @@ class PyOutputAllocator : public IOutputAllocator
         return nullptr;
     }
 
+    void* reallocateOutputAsync(char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment,
+        cudaStream_t stream) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            py::function pyFunc
+                = utils::getOverride(static_cast<IOutputAllocator*>(this), "reallocate_output_async", false);
+
+            if (!pyFunc)
+            {
+                //! For legacy implementation, the user might not have implemented this method, so we go for the default
+                //! method.
+                return reallocateOutput(tensorName, currentMemory, size, alignment);
+            }
+
+            intptr_t cudaStreamPtr = reinterpret_cast<intptr_t>(stream);
+            py::object ptr
+                = pyFunc(tensorName, reinterpret_cast<size_t>(currentMemory), size, alignment, cudaStreamPtr);
+            try
+            {
+                return reinterpret_cast<void*>(ptr.cast<size_t>());
+            }
+            catch (const py::cast_error& e)
+            {
+                std::cerr << "[ERROR] Return value of reallocateOutputAsync() could not be interpreted as an int"
+                          << std::endl;
+            }
+        }
+        catch (std::exception const& e)
+        {
+            std::cerr << "[ERROR] Exception caught in reallocateOutputAsync(): " << e.what() << std::endl;
+        }
+        catch (...)
+        {
+            std::cerr << "[ERROR] Exception caught in reallocateOutputAsync()" << std::endl;
+            return nullptr;
+        }
+
+        return nullptr;
+    }
+
     void notifyShape(char const* tensorName, Dims const& dims) noexcept override
     {
         try
         {
+            py::gil_scoped_acquire gil{};
             PYBIND11_OVERLOAD_PURE_NAME(void, IOutputAllocator, "notify_shape", notifyShape, tensorName, dims);
         }
         catch (std::exception const& e)
         {
-            std::cerr << "[ERROR] Exception caught in free(): " << e.what() << std::endl;
+            std::cerr << "[ERROR] Exception caught in notifyShape(): " << e.what() << std::endl;
         }
         catch (...)
         {
-            std::cerr << "[ERROR] Exception caught in free()" << std::endl;
+            std::cerr << "[ERROR] Exception caught in notifyShape()" << std::endl;
         }
     }
 };
 
+class PyStreamReader : public IStreamReader
+{
+public:
+    int64_t read(void* destination, int64_t size) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            py::function pyFunc = utils::getOverride(static_cast<IStreamReader*>(this), "read");
+
+            if (!pyFunc)
+            {
+                return 0;
+            }
+
+            py::object bytesRead = pyFunc(reinterpret_cast<size_t>(destination), size);
+            return bytesRead.cast<int64_t>();
+        }
+        catch (std::exception const& e)
+        {
+            std::cerr << "[ERROR] Exception caught in read(): " << e.what() << std::endl;
+        }
+        catch (...)
+        {
+            std::cerr << "[ERROR] Exception caught in read()" << std::endl;
+        }
+
+        return 0;
+    }
+};
+
+class PyDebugListener : public IDebugListener
+{
+public:
+    bool processDebugTensor(void const* addr, TensorLocation location, DataType type, Dims const& shape,
+        char const* name, cudaStream_t stream) override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            py::function pyFunc = utils::getOverride(static_cast<IDebugListener*>(this), "process_debug_tensor");
+
+            if (!pyFunc)
+            {
+                return false;
+            }
+
+            pyFunc(reinterpret_cast<size_t>(addr), location, type, shape, name, reinterpret_cast<size_t>(stream));
+        }
+        catch (std::exception const& e)
+        {
+            std::cerr << "[ERROR] Exception caught in processDebugTensor(): " << e.what() << std::endl;
+        }
+        catch (...)
+        {
+            std::cerr << "[ERROR] Exception caught in processDebugTensor()" << std::endl;
+        }
+        return true;
+    }
+};
+
 void bindCore(py::module& m)
 {
     class PyLogger : public ILogger
@@ -530,6 +673,7 @@ void bindCore(py::module& m)
         {
             try
             {
+                py::gil_scoped_acquire gil{};
                 PYBIND11_OVERLOAD_PURE_NAME(void, ILogger, "log", log, severity, msg);
             }
             catch (std::exception const& e)
@@ -544,7 +688,6 @@ void bindCore(py::module& m)
     };
 
     py::class_<ILogger, PyLogger> baseLoggerBinding{m, "ILogger", ILoggerDoc::descr, py::module_local()};
-    baseLoggerBinding.def(py::init<>()).def("log", &ILogger::log, "severity"_a, "msg"_a, ILoggerDoc::log);
 
     py::enum_<ILogger::Severity>(
         baseLoggerBinding, "Severity", py::arithmetic(), SeverityDoc::descr, py::module_local())
@@ -556,6 +699,8 @@ void bindCore(py::module& m)
         // We export into the outer scope, so we can access with trt.ILogger.X.
         .export_values();
 
+    baseLoggerBinding.def(py::init<>()).def("log", &ILogger::log, "severity"_a, "msg"_a, ILoggerDoc::log);
+
     class DefaultLogger : public ILogger
     {
     public:
@@ -814,17 +959,84 @@ void bindCore(py::module& m)
         .def("clear", &IErrorRecorder::clear, IErrorRecorderDoc::clear)
         .def("report_error", &IErrorRecorder::reportError, IErrorRecorderDoc::report_error);
 
+    // Provide a base implementation of IProgressMonitor.
+    // Trampoline class is required as this class needs to be implemented by the user.
+    class PyProgressMonitor : public IProgressMonitor
+    {
+    public:
+        void phaseStart(char const* phaseName, char const* parentPhase, int32_t nbSteps) noexcept override
+        {
+            try
+            {
+                PYBIND11_OVERLOAD_PURE_NAME(
+                    void, IProgressMonitor, "phase_start", phaseStart, phaseName, parentPhase, nbSteps);
+            }
+            catch (std::exception const& e)
+            {
+                std::cerr << "[ERROR] Exception caught in phase_start(): " << e.what() << std::endl;
+            }
+            catch (...)
+            {
+                std::cerr << "[ERROR] Exception caught in phase_start()" << std::endl;
+            }
+        }
+
+        bool stepComplete(char const* phaseName, int32_t step) noexcept override
+        {
+            try
+            {
+                PYBIND11_OVERLOAD_PURE_NAME(bool, IProgressMonitor, "step_complete", stepComplete, phaseName, step);
+            }
+            catch (std::exception const& e)
+            {
+                std::cerr << "[ERROR] Exception caught in step_complete(): " << e.what() << std::endl;
+            }
+            catch (...)
+            {
+                std::cerr << "[ERROR] Exception caught in step_complete()" << std::endl;
+            }
+            // Use the exception as an indicator that we should cancel the build
+            return false;
+        }
+
+        void phaseFinish(char const* phaseName) noexcept override
+        {
+            try
+            {
+                PYBIND11_OVERLOAD_PURE_NAME(void, IProgressMonitor, "phase_finish", phaseFinish, phaseName);
+            }
+            catch (std::exception const& e)
+            {
+                std::cerr << "[ERROR] Exception caught in phase_finish(): " << e.what() << std::endl;
+            }
+            catch (...)
+            {
+                std::cerr << "[ERROR] Exception caught in phase_finish()" << std::endl;
+            }
+        }
+    };
+
+    py::class_<IDebugListener, PyDebugListener>(m, "IDebugListener", IDebugListenerDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def("process_debug_tensor", lambdas::docProcessDebugTensor, "addr"_a, "location"_a, "type"_a, "shape"_a,
+            "name"_a, "stream"_a, IDebugListenerDoc::process_debug_tensor);
+
+    py::class_<IOutputAllocator, PyOutputAllocator>(
+        m, "IOutputAllocator", OutputAllocatorDoc::descr, py::module_local())
+        .def(py::init<>());
+
+    py::class_<IProgressMonitor, PyProgressMonitor>(
+        m, "IProgressMonitor", IProgressMonitorDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def("phase_start", &IProgressMonitor::phaseStart, IProgressMonitorDoc::phase_start, "phase_name"_a,
+            "parent_phase"_a, "num_steps"_a)
+        .def("step_complete", &IProgressMonitor::stepComplete, IProgressMonitorDoc::step_complete, "phase_name"_a,
+            "step"_a)
+        .def("phase_finish", &IProgressMonitor::phaseFinish, IProgressMonitorDoc::phase_finish, "phase_name"_a);
+
     py::class_<IExecutionContext>(m, "IExecutionContext", IExecutionContextDoc::descr, py::module_local())
-        .def("execute", utils::deprecate(lambdas::execute, "execute_v2"), "batch_size"_a = 1, "bindings"_a,
-            IExecutionContextDoc::execute, py::call_guard<py::gil_scoped_release>{})
-        .def("execute_async", utils::deprecate(lambdas::execute_async, "execute_async_v2"), "batch_size"_a = 1,
-            "bindings"_a, "stream_handle"_a, "input_consumed"_a = nullptr, IExecutionContextDoc::execute_async,
-            py::call_guard<py::gil_scoped_release>{})
         .def("execute_v2", lambdas::execute_v2, "bindings"_a, IExecutionContextDoc::execute_v2,
             py::call_guard<py::gil_scoped_release>{})
-        .def("execute_async_v2", lambdas::execute_async_v2, "bindings"_a, "stream_handle"_a,
-            "input_consumed"_a = nullptr, IExecutionContextDoc::execute_async_v2,
-            py::call_guard<py::gil_scoped_release>{})
         .def_property("debug_sync", &IExecutionContext::getDebugSync, &IExecutionContext::setDebugSync)
         .def_property("profiler", &IExecutionContext::getProfiler,
             py::cpp_function(&IExecutionContext::setProfiler, py::keep_alive<1, 2>{}))
@@ -833,21 +1045,16 @@ void bindCore(py::module& m)
             "name", &IExecutionContext::getName, py::cpp_function(&IExecutionContext::setName, py::keep_alive<1, 2>{}))
         // For writeonly properties, we use a nullptr getter.
         .def_property("device_memory", nullptr, &lambdas::context_set_device_memory)
-        .def_property("active_optimization_profile", &IExecutionContext::getOptimizationProfile,
-            utils::deprecate(lambdas::context_set_optimization_profile, "set_optimization_profile_async"))
-        .def("get_strides", utils::deprecateMember(&IExecutionContext::getStrides, "get_tensor_strides"), "binding"_a,
-            IExecutionContextDoc::get_strides)
-        .def("set_binding_shape", utils::deprecateMember(&IExecutionContext::setBindingDimensions, "set_input_shape"),
-            "binding"_a, "shape"_a, IExecutionContextDoc::set_binding_shape)
-        .def("get_binding_shape", utils::deprecateMember(&IExecutionContext::getBindingDimensions, "get_tensor_shape"),
-            "binding"_a, IExecutionContextDoc::get_binding_shape)
-        .def("set_shape_input", utils::deprecate(lambdas::context_set_shape_input, "set_tensor_address"), "binding"_a,
-            "shape"_a, IExecutionContextDoc::set_shape_input)
-        .def("get_shape", utils::deprecate(lambdas::context_get_shape, "get_tensor_address"), "binding"_a,
-            IExecutionContextDoc::get_shape)
+        .def("update_device_memory_size_for_shapes", &IExecutionContext::updateDeviceMemorySizeForShapes,
+            IExecutionContextDoc::update_device_memory_size_for_shapes)
+        .def_property_readonly("active_optimization_profile", &IExecutionContext::getOptimizationProfile)
         // Start of enqueueV3 related APIs.
         .def("get_tensor_strides", &IExecutionContext::getTensorStrides, "name"_a,
             IExecutionContextDoc::get_tensor_strides)
+        .def("set_input_shape", lambdas::setInputShape<py::tuple>, "name"_a, "shape"_a,
+            IExecutionContextDoc::set_input_shape)
+        .def("set_input_shape", lambdas::setInputShape<py::list>, "name"_a, "shape"_a,
+            IExecutionContextDoc::set_input_shape)
         .def("set_input_shape", &IExecutionContext::setInputShape, "name"_a, "shape"_a,
             IExecutionContextDoc::set_input_shape)
         .def("get_tensor_shape", &IExecutionContext::getTensorShape, "name"_a, IExecutionContextDoc::get_tensor_shape)
@@ -885,46 +1092,89 @@ void bindCore(py::module& m)
             &IExecutionContext::setPersistentCacheLimit)
         .def_property("nvtx_verbosity", &IExecutionContext::getNvtxVerbosity, &IExecutionContext::setNvtxVerbosity)
         .def("set_aux_streams", lambdas::set_aux_streams, "aux_streams"_a, IExecutionContextDoc::set_aux_streams)
-        .def("__del__", &utils::doNothingDel<IExecutionContext>);
+        .def("__del__", &utils::doNothingDel<IExecutionContext>)
+        .def("set_debug_listener", &IExecutionContext::setDebugListener, "listener"_a,
+            IExecutionContextDoc::set_debug_listener)
+        .def("get_debug_listener", &IExecutionContext::getDebugListener, IExecutionContextDoc::get_debug_listener)
+        .def("set_tensor_debug_state", &IExecutionContext::setTensorDebugState, "name"_a, "flag"_a,
+            IExecutionContextDoc::set_tensor_debug_state)
+        .def("get_debug_state", &IExecutionContext::getDebugState, "name"_a, IExecutionContextDoc::get_debug_state)
+        .def("set_all_tensors_debug_state", &IExecutionContext::setAllTensorsDebugState, "flag"_a,
+            IExecutionContextDoc::set_all_tensors_debug_state)
+               ;
+
+    py::enum_<ExecutionContextAllocationStrategy>(m, "ExecutionContextAllocationStrategy", py::arithmetic{},
+        ExecutionContextAllocationStrategyDoc::descr, py::module_local())
+        .value("STATIC", ExecutionContextAllocationStrategy::kSTATIC, ExecutionContextAllocationStrategyDoc::STATIC)
+        .value("ON_PROFILE_CHANGE", ExecutionContextAllocationStrategy::kON_PROFILE_CHANGE,
+            ExecutionContextAllocationStrategyDoc::ON_PROFILE_CHANGE)
+        .value("USER_MANAGED", ExecutionContextAllocationStrategy::kUSER_MANAGED,
+            ExecutionContextAllocationStrategyDoc::USER_MANAGED);
+
+    py::enum_<SerializationFlag>(
+        m, "SerializationFlag", py::arithmetic{}, SerializationFlagDoc::descr, py::module_local())
+        .value("EXCLUDE_WEIGHTS", SerializationFlag::kEXCLUDE_WEIGHTS, SerializationFlagDoc::EXCLUDE_WEIGHTS)
+        .value("EXCLUDE_LEAN_RUNTIME", SerializationFlag::kEXCLUDE_LEAN_RUNTIME,
+            SerializationFlagDoc::EXCLUDE_LEAN_RUNTIME);
+
+    py::class_<ISerializationConfig>(m, "ISerializationConfig", ISerializationConfigDoc::descr, py::module_local())
+        .def_property("flags", &ISerializationConfig::getFlags, &lambdas::serialization_config_set_flags)
+        .def("clear_flag", &ISerializationConfig::clearFlag, "flag"_a, ISerializationConfigDoc::clear_flag)
+        .def("set_flag", &ISerializationConfig::setFlag, "flag"_a, ISerializationConfigDoc::set_flag)
+        .def("get_flag", &ISerializationConfig::getFlag, "flag"_a, ISerializationConfigDoc::get_flag);
+
+    py::enum_<LayerInformationFormat>(m, "LayerInformationFormat", LayerInformationFormatDoc::descr, py::module_local())
+        .value("ONELINE", LayerInformationFormat::kONELINE, LayerInformationFormatDoc::ONELINE)
+        .value("JSON", LayerInformationFormat::kJSON, LayerInformationFormatDoc::JSON);
+
+#if EXPORT_ALL_BINDINGS
+    // EngineInspector
+    py::class_<IEngineInspector>(m, "EngineInspector", RuntimeInspectorDoc::descr, py::module_local())
+        .def_property(
+            "execution_context", &IEngineInspector::getExecutionContext, &IEngineInspector::setExecutionContext)
+        .def("get_layer_information", &IEngineInspector::getLayerInformation, "layer_index"_a, "format"_a,
+            RuntimeInspectorDoc::get_layer_information)
+        .def("get_engine_information", &IEngineInspector::getEngineInformation, "format"_a,
+            RuntimeInspectorDoc::get_engine_information)
+        .def_property("error_recorder", &IEngineInspector::getErrorRecorder,
+            py::cpp_function(&IEngineInspector::setErrorRecorder, py::keep_alive<1, 2>{}));
+#endif // EXPORT_ALL_BINDINGS
+
+    py::enum_<EngineCapability>(m, "EngineCapability", py::arithmetic{}, EngineCapabilityDoc::descr, py::module_local())
+        .value("STANDARD", EngineCapability::kSTANDARD, EngineCapabilityDoc::STANDARD)
+        .value("SAFETY", EngineCapability::kSAFETY, EngineCapabilityDoc::SAFETY)
+        .value("DLA_STANDALONE", EngineCapability::kDLA_STANDALONE, EngineCapabilityDoc::DLA_STANDALONE);
+
+    // Bind to a Python enum called TensorLocation.
+    py::enum_<TensorLocation>(m, "TensorLocation", TensorLocationDoc::descr, py::module_local())
+        .value("DEVICE", TensorLocation::kDEVICE, TensorLocationDoc::DEVICE)
+        .value("HOST", TensorLocation::kHOST, TensorLocationDoc::HOST); // TensorLocation
+
+    py::enum_<TensorIOMode>(m, "TensorIOMode", TensorIOModeDoc::descr, py::module_local())
+        .value("NONE", TensorIOMode::kNONE, TensorIOModeDoc::NONE)
+        .value("INPUT", TensorIOMode::kINPUT, TensorIOModeDoc::INPUT)
+        .value("OUTPUT", TensorIOMode::kOUTPUT, TensorIOModeDoc::OUTPUT);
 
     py::class_<ICudaEngine>(m, "ICudaEngine", ICudaEngineDoc::descr, py::module_local())
-        .def_property_readonly("num_bindings", &ICudaEngine::getNbBindings)
-        .def("__len__", &ICudaEngine::getNbBindings)
-        .def("__getitem__",
-            [](ICudaEngine& self, const std::string& name) { return self.getBindingIndex(name.c_str()); })
         .def("__getitem__", lambdas::engine_getitem)
-        .def("get_binding_name", utils::deprecateMember(&ICudaEngine::getBindingName, "get_tensor_name"), "index"_a,
-            ICudaEngineDoc::get_binding_name)
-        .def("get_binding_index", utils::deprecateMember(&ICudaEngine::getBindingIndex, "get_tensor_name"), "name"_a,
-            ICudaEngineDoc::get_binding_index)
-        .def("binding_is_input", utils::deprecateMember(&ICudaEngine::bindingIsInput, "get_tensor_mode"), "index"_a,
-            ICudaEngineDoc::binding_is_input)
-        .def("binding_is_input", utils::deprecate(lambdas::engine_binding_is_input, "get_tensor_mode"), "name"_a,
-            ICudaEngineDoc::binding_is_input_str)
-        .def("get_binding_shape", utils::deprecateMember(&ICudaEngine::getBindingDimensions, "get_tensor_shape"),
-            "index"_a, ICudaEngineDoc::get_binding_shape)
-        // Overload so that we can get shape based on tensor names.
-        .def("get_binding_shape", utils::deprecate(lambdas::engine_get_binding_shape, "get_tensor_shape"), "name"_a,
-            ICudaEngineDoc::get_binding_shape_str)
-        .def("get_binding_dtype", utils::deprecateMember(&ICudaEngine::getBindingDataType, "get_tensor_dtype"),
-            "index"_a, ICudaEngineDoc::get_binding_dtype)
-        // Overload so that we can get type based on tensor names.
-        .def("get_binding_dtype", utils::deprecate(lambdas::engine_get_binding_dtype, "get_tensor_dtype"), "name"_a,
-            ICudaEngineDoc::get_binding_dtype_str)
-        .def_property_readonly("has_implicit_batch_dimension", &ICudaEngine::hasImplicitBatchDimension)
-        .def_property_readonly("max_batch_size",
-            utils::deprecateMember(&ICudaEngine::getMaxBatchSize,
-                "network created with NetworkDefinitionCreationFlag::EXPLICIT_BATCH flag"))
+        .def_property_readonly("has_implicit_batch_dimension",
+            utils::deprecateMember(
+                &ICudaEngine::hasImplicitBatchDimension, "Implicit batch dimensions support has been removed"))
         .def_property_readonly("num_layers", &ICudaEngine::getNbLayers)
-        .def("serialize", &ICudaEngine::serialize, ICudaEngineDoc::serialize)
+        .def("serialize", &ICudaEngine::serialize, ICudaEngineDoc::serialize, py::call_guard<py::gil_scoped_release>{})
+        .def("create_serialization_config", &ICudaEngine::createSerializationConfig,
+            ICudaEngineDoc::create_serialization_config, py::keep_alive<0, 1>{})
+        .def("serialize_with_config", &ICudaEngine::serializeWithConfig, ICudaEngineDoc::serialize_with_config,
+            py::call_guard<py::gil_scoped_release>{})
         .def("create_execution_context", &ICudaEngine::createExecutionContext, ICudaEngineDoc::create_execution_context,
-            py::keep_alive<0, 1>{})
-        .def("get_location", utils::deprecateMember(&ICudaEngine::getLocation, "get_tensor_location"), "index"_a,
-            ICudaEngineDoc::get_location)
-        .def("get_location", utils::deprecate(lambdas::engine_get_location, "get_tensor_location"), "name"_a,
-            ICudaEngineDoc::get_location_str)
-        .def("create_execution_context_without_device_memory", &ICudaEngine::createExecutionContextWithoutDeviceMemory,
-            ICudaEngineDoc::create_execution_context_without_device_memory, py::keep_alive<0, 1>{})
+            py::arg("strategy") = ExecutionContextAllocationStrategy::kSTATIC, py::keep_alive<0, 1>{},
+            py::call_guard<py::gil_scoped_release>{})
+        .def("create_execution_context_without_device_memory",
+            utils::deprecateMember(&ICudaEngine::createExecutionContextWithoutDeviceMemory, "create_execution_context"),
+            ICudaEngineDoc::create_execution_context_without_device_memory, py::keep_alive<0, 1>{},
+            py::call_guard<py::gil_scoped_release>{})
+        .def("get_device_memory_size_for_profile", &ICudaEngine::getDeviceMemorySizeForProfile, "profile_index"_a,
+            ICudaEngineDoc::get_device_memory_size_for_profile)
         .def_property_readonly("device_memory_size", &ICudaEngine::getDeviceMemorySize)
         .def_property_readonly("refittable", &ICudaEngine::isRefittable)
         .def_property_readonly("name", &ICudaEngine::getName)
@@ -934,30 +1184,6 @@ void bindCore(py::module& m)
             "profile_index"_a, "binding"_a, ICudaEngineDoc::get_profile_shape)
         .def("get_profile_shape", utils::deprecate(lambdas::engine_get_profile_shape_str, "get_tensor_profile_shape"),
             "profile_index"_a, "binding"_a, ICudaEngineDoc::get_profile_shape)
-        .def("get_profile_shape_input",
-            utils::deprecate(lambdas::engine_get_profile_shape_input, "get_tensor_profile_shape"), "profile_index"_a,
-            "binding"_a, ICudaEngineDoc::get_profile_shape_input)
-        .def("get_profile_shape_input",
-            utils::deprecate(lambdas::engine_get_profile_shape_input_str, "get_tensor_profile_shape"),
-            "profile_index"_a, "binding"_a, ICudaEngineDoc::get_profile_shape_input)
-        .def("is_shape_binding", utils::deprecateMember(&ICudaEngine::isShapeBinding, "get_tensor_location"),
-            "binding"_a, ICudaEngineDoc::is_shape_binding)
-        .def("is_execution_binding", utils::deprecateMember(&ICudaEngine::isExecutionBinding, "get_tensor_location"),
-            "binding"_a, ICudaEngineDoc::is_execution_binding)
-        .def("get_binding_bytes_per_component",
-            utils::deprecateMember(&ICudaEngine::getBindingBytesPerComponent, "get_tensor_bytes_per_component"),
-            "index"_a, ICudaEngineDoc::get_binding_bytes_per_component)
-        .def("get_binding_components_per_element",
-            utils::deprecateMember(&ICudaEngine::getBindingComponentsPerElement, "get_tensor_components_per_element"),
-            "index"_a, ICudaEngineDoc::get_binding_components_per_element)
-        .def("get_binding_format", utils::deprecateMember(&ICudaEngine::getBindingFormat, "get_tensor_format"),
-            "index"_a, ICudaEngineDoc::get_binding_format)
-        .def("get_binding_format_desc",
-            utils::deprecateMember(&ICudaEngine::getBindingFormatDesc, "get_tensor_format_desc"), "index"_a,
-            ICudaEngineDoc::get_binding_format_desc)
-        .def("get_binding_vectorized_dim",
-            utils::deprecateMember(&ICudaEngine::getBindingVectorizedDim, "get_tensor_vectorized_dim"), "index"_a,
-            ICudaEngineDoc::get_binding_vectorized_dim)
         // Start of enqueueV3 related APIs.
         .def_property_readonly("num_io_tensors", &ICudaEngine::getNbIOTensors)
         .def("get_tensor_name", &ICudaEngine::getIOTensorName, "index"_a, ICudaEngineDoc::get_tensor_name)
@@ -1035,6 +1261,8 @@ void bindCore(py::module& m)
 
         .def("get_tensor_profile_shape", lambdas::get_tensor_profile_shape, "name"_a, "profile_index"_a,
             ICudaEngineDoc::get_tensor_profile_shape)
+        .def("get_tensor_profile_values", lambdas::get_tensor_profile_values, "name"_a, "profile_index"_a,
+            ICudaEngineDoc::get_tensor_profile_values)
         // End of enqueueV3 related APIs.
         .def_property("error_recorder", &ICudaEngine::getErrorRecorder,
             py::cpp_function(&ICudaEngine::setErrorRecorder, py::keep_alive<1, 2>{}))
@@ -1044,34 +1272,59 @@ void bindCore(py::module& m)
             py::keep_alive<0, 1>{})
         .def_property_readonly("hardware_compatibility_level", &ICudaEngine::getHardwareCompatibilityLevel)
         .def_property_readonly("num_aux_streams", &ICudaEngine::getNbAuxStreams)
-        .def("__del__", &utils::doNothingDel<ICudaEngine>);
+        // Weight streaming APIs
+        .def_property(
+            "weight_streaming_budget", &ICudaEngine::getWeightStreamingBudget, &ICudaEngine::setWeightStreamingBudget)
+        .def_property_readonly("minimum_weight_streaming_budget", &ICudaEngine::getMinimumWeightStreamingBudget)
+        .def_property_readonly("streamable_weights_size", &ICudaEngine::getStreamableWeightsSize)
+        .def("is_debug_tensor", &ICudaEngine::isDebugTensor, "name"_a, ICudaEngineDoc::is_debug_tensor)
+                .def("__del__", &utils::doNothingDel<ICudaEngine>);
 
     py::enum_<AllocatorFlag>(m, "AllocatorFlag", py::arithmetic{}, AllocatorFlagDoc::descr, py::module_local())
         .value("RESIZABLE", AllocatorFlag::kRESIZABLE, AllocatorFlagDoc::RESIZABLE);
 
-    py::class_<IGpuAllocator, PyGpuAllocator>(m, "IGpuAllocator", GpuAllocatorDoc::descr, py::module_local())
+    py::class_<IGpuAllocator, PyGpuAllocator, IVersionedInterface>(
+        m, "IGpuAllocator", GpuAllocatorDoc::descr, py::module_local())
         .def(py::init<>())
-        .def("allocate", &IGpuAllocator::allocate, "size"_a, "alignment"_a, "flags"_a, GpuAllocatorDoc::allocate)
+        .def("allocate",
+            utils::deprecateMember(&IGpuAllocator::allocate, "Deprecated in TensorRT 10.0. Superseded by allocate"),
+            "size"_a, "alignment"_a, "flags"_a, GpuAllocatorDoc::allocate)
         .def("reallocate", &IGpuAllocator::reallocate, "address"_a, "alignment"_a, "new_size"_a,
             GpuAllocatorDoc::reallocate)
-        .def("free", &IGpuAllocator::free, "memory"_a, GpuAllocatorDoc::free)
-        .def("deallocate", &IGpuAllocator::deallocate, "memory"_a, GpuAllocatorDoc::deallocate);
-
-    py::class_<IOutputAllocator, PyOutputAllocator>(
-        m, "IOutputAllocator", OutputAllocatorDoc::descr, py::module_local())
+        .def("deallocate",
+            utils::deprecateMember(
+                &IGpuAllocator::deallocate, "Deprecated in TensorRT 10.0. Superseded by deallocate_async"),
+            "memory"_a, GpuAllocatorDoc::deallocate)
+        .def("allocate_async", &lambdas::allocate_async, "size"_a, "alignment"_a, "flags"_a, "stream"_a,
+            GpuAllocatorDoc::allocate_async)
+        .def("deallocate_async", &lambdas::deallocate_async, "memory"_a, "stream"_a, GpuAllocatorDoc::deallocate_async);
+
+    py::class_<IGpuAsyncAllocator, PyGpuAsyncAllocator, IGpuAllocator>(
+        m, "IGpuAsyncAllocator", GpuAsyncAllocatorDoc::descr, py::module_local())
         .def(py::init<>())
-        .def_property_readonly("tensorrt_version", &IOutputAllocator::getInterfaceVersion)
-        .def("reallocate_output", &IOutputAllocator::reallocateOutput, "tensor_name"_a, "memory"_a, "size"_a,
-            "alignment"_a, OutputAllocatorDoc::reallocate_output)
-        .def("notify_shape", &IOutputAllocator::notifyShape, "tensor_name"_a, "shape"_a,
-            OutputAllocatorDoc::notify_shape);
+        .def("allocate",
+            utils::deprecateMember(
+                &IGpuAsyncAllocator::allocate, "Deprecated in TensorRT 10.0. Superseded by allocate"),
+            "size"_a, "alignment"_a, "flags"_a, GpuAsyncAllocatorDoc::allocate)
+        .def("deallocate",
+            utils::deprecateMember(
+                &IGpuAsyncAllocator::deallocate, "Deprecated in TensorRT 10.0. Superseded by deallocate_async"),
+            "memory"_a, GpuAsyncAllocatorDoc::deallocate)
+        .def("allocate_async", &lambdas::allocate_async, "size"_a, "alignment"_a, "flags"_a, "stream"_a,
+            GpuAsyncAllocatorDoc::allocate_async)
+        .def("deallocate_async", &lambdas::deallocate_async, "memory"_a, "stream"_a,
+            GpuAsyncAllocatorDoc::deallocate_async);
+
+    py::class_<IStreamReader, PyStreamReader>(m, "IStreamReader", StreamReaderDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def("read", &IStreamReader::read, "destination"_a, "size"_a, StreamReaderDoc::read);
 
     py::enum_<BuilderFlag>(m, "BuilderFlag", py::arithmetic{}, BuilderFlagDoc::descr, py::module_local())
         .value("FP16", BuilderFlag::kFP16, BuilderFlagDoc::FP16)
+        .value("BF16", BuilderFlag::kBF16, BuilderFlagDoc::BF16)
         .value("INT8", BuilderFlag::kINT8, BuilderFlagDoc::INT8)
         .value("DEBUG", BuilderFlag::kDEBUG, BuilderFlagDoc::DEBUG)
         .value("GPU_FALLBACK", BuilderFlag::kGPU_FALLBACK, BuilderFlagDoc::GPU_FALLBACK)
-        .value("STRICT_TYPES", BuilderFlag::kSTRICT_TYPES, BuilderFlagDoc::STRICT_TYPES)
         .value("REFIT", BuilderFlag::kREFIT, BuilderFlagDoc::REFIT)
         .value("DISABLE_TIMING_CACHE", BuilderFlag::kDISABLE_TIMING_CACHE, BuilderFlagDoc::DISABLE_TIMING_CACHE)
         .value("TF32", BuilderFlag::kTF32, BuilderFlagDoc::TF32)
@@ -1084,29 +1337,31 @@ void bindCore(py::module& m)
         .value("DIRECT_IO", BuilderFlag::kDIRECT_IO, BuilderFlagDoc::DIRECT_IO)
         .value(
             "REJECT_EMPTY_ALGORITHMS", BuilderFlag::kREJECT_EMPTY_ALGORITHMS, BuilderFlagDoc::REJECT_EMPTY_ALGORITHMS)
-        .value(
-            "ENABLE_TACTIC_HEURISTIC", BuilderFlag::kENABLE_TACTIC_HEURISTIC, BuilderFlagDoc::ENABLE_TACTIC_HEURISTIC)
         .value("VERSION_COMPATIBLE", BuilderFlag::kVERSION_COMPATIBLE, BuilderFlagDoc::VERSION_COMPATIBLE)
         .value("EXCLUDE_LEAN_RUNTIME", BuilderFlag::kEXCLUDE_LEAN_RUNTIME, BuilderFlagDoc::EXCLUDE_LEAN_RUNTIME)
-        .value("FP8", BuilderFlag::kFP8, BuilderFlagDoc::FP8);
+        .value("FP8", BuilderFlag::kFP8, BuilderFlagDoc::FP8)
+        .value("ERROR_ON_TIMING_CACHE_MISS", BuilderFlag::kERROR_ON_TIMING_CACHE_MISS,
+            BuilderFlagDoc::ERROR_ON_TIMING_CACHE_MISS)
+        .value("DISABLE_COMPILATION_CACHE", BuilderFlag::kDISABLE_COMPILATION_CACHE,
+            BuilderFlagDoc::DISABLE_COMPILATION_CACHE)
+        .value("WEIGHTLESS", BuilderFlag::kWEIGHTLESS, BuilderFlagDoc::WEIGHTLESS)
+        .value("STRIP_PLAN", BuilderFlag::kSTRIP_PLAN, BuilderFlagDoc::STRIP_PLAN)
+        .value("REFIT_IDENTICAL", BuilderFlag::kREFIT_IDENTICAL, BuilderFlagDoc::REFIT_IDENTICAL)
+        .value("WEIGHT_STREAMING", BuilderFlag::kWEIGHT_STREAMING, BuilderFlagDoc::WEIGHT_STREAMING);
 
     py::enum_<MemoryPoolType>(m, "MemoryPoolType", MemoryPoolTypeDoc::descr, py::module_local())
         .value("WORKSPACE", MemoryPoolType::kWORKSPACE, MemoryPoolTypeDoc::WORKSPACE)
         .value("DLA_MANAGED_SRAM", MemoryPoolType::kDLA_MANAGED_SRAM, MemoryPoolTypeDoc::DLA_MANAGED_SRAM)
         .value("DLA_LOCAL_DRAM", MemoryPoolType::kDLA_LOCAL_DRAM, MemoryPoolTypeDoc::DLA_LOCAL_DRAM)
         .value("DLA_GLOBAL_DRAM", MemoryPoolType::kDLA_GLOBAL_DRAM, MemoryPoolTypeDoc::DLA_GLOBAL_DRAM)
-        .value("TACTIC_DRAM", MemoryPoolType::kTACTIC_DRAM, MemoryPoolTypeDoc::TACTIC_DRAM);
+        .value("TACTIC_DRAM", MemoryPoolType::kTACTIC_DRAM, MemoryPoolTypeDoc::TACTIC_DRAM)
+        .value("TACTIC_SHARED_MEMORY", MemoryPoolType::kTACTIC_SHARED_MEMORY, MemoryPoolTypeDoc::TACTIC_SHARED_MEMORY);
 
     py::enum_<QuantizationFlag>(m, "QuantizationFlag", py::arithmetic{}, QuantizationFlagDoc::descr, py::module_local())
         .value("CALIBRATE_BEFORE_FUSION", QuantizationFlag::kCALIBRATE_BEFORE_FUSION,
             QuantizationFlagDoc::CALIBRATE_BEFORE_FUSION);
 
     py::enum_<PreviewFeature>(m, "PreviewFeature", PreviewFeatureDoc::descr, py::module_local())
-        .value("FASTER_DYNAMIC_SHAPES_0805", PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805,
-            PreviewFeatureDoc::FASTER_DYNAMIC_SHAPES_0805)
-        .value("DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805",
-            PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805,
-            PreviewFeatureDoc::DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805)
         .value("PROFILE_SHARING_0806", PreviewFeature::kPROFILE_SHARING_0806, PreviewFeatureDoc::PROFILE_SHARING_0806);
 
     py::enum_<HardwareCompatibilityLevel>(
@@ -1128,14 +1383,7 @@ void bindCore(py::module& m)
     py::enum_<ProfilingVerbosity>(m, "ProfilingVerbosity", ProfilingVerbosityDoc::descr, py::module_local())
         .value("LAYER_NAMES_ONLY", ProfilingVerbosity::kLAYER_NAMES_ONLY, ProfilingVerbosityDoc::LAYER_NAMES_ONLY)
         .value("DETAILED", ProfilingVerbosity::kDETAILED, ProfilingVerbosityDoc::DETAILED)
-        .value("NONE", ProfilingVerbosity::kNONE, ProfilingVerbosityDoc::NONE)
-        .value("DEFAULT", ProfilingVerbosity::kDEFAULT, ProfilingVerbosityDoc::DEFAULT)
-        .value("VERBOSE", ProfilingVerbosity::kVERBOSE, ProfilingVerbosityDoc::VERBOSE);
-
-    py::enum_<TensorIOMode>(m, "TensorIOMode", TensorIOModeDoc::descr, py::module_local())
-        .value("NONE", TensorIOMode::kNONE, TensorIOModeDoc::NONE)
-        .value("INPUT", TensorIOMode::kINPUT, TensorIOModeDoc::INPUT)
-        .value("OUTPUT", TensorIOMode::kOUTPUT, TensorIOModeDoc::OUTPUT);
+        .value("NONE", ProfilingVerbosity::kNONE, ProfilingVerbosityDoc::NONE);
 
     py::enum_<TacticSource>(m, "TacticSource", py::arithmetic{}, TacticSourceDoc::descr, py::module_local())
         .value("CUBLAS", TacticSource::kCUBLAS, TacticSourceDoc::CUBLAS)
@@ -1144,18 +1392,6 @@ void bindCore(py::module& m)
         .value("EDGE_MASK_CONVOLUTIONS", TacticSource::kEDGE_MASK_CONVOLUTIONS, TacticSourceDoc::EDGE_MASK_CONVOLUTIONS)
         .value("JIT_CONVOLUTIONS", TacticSource::kJIT_CONVOLUTIONS, TacticSourceDoc::JIT_CONVOLUTIONS);
 
-    py::enum_<EngineCapability>(m, "EngineCapability", py::arithmetic{}, EngineCapabilityDoc::descr, py::module_local())
-        .value("DEFAULT", EngineCapability::kDEFAULT, EngineCapabilityDoc::DEFAULT)
-        .value("SAFE_GPU", EngineCapability::kSAFE_GPU, EngineCapabilityDoc::SAFE_GPU)
-        .value("SAFE_DLA", EngineCapability::kSAFE_DLA, EngineCapabilityDoc::SAFE_DLA)
-        .value("STANDARD", EngineCapability::kSTANDARD, EngineCapabilityDoc::STANDARD)
-        .value("SAFETY", EngineCapability::kSAFETY, EngineCapabilityDoc::SAFETY)
-        .value("DLA_STANDALONE", EngineCapability::kDLA_STANDALONE, EngineCapabilityDoc::DLA_STANDALONE);
-
-    py::enum_<LayerInformationFormat>(m, "LayerInformationFormat", LayerInformationFormatDoc::descr, py::module_local())
-        .value("ONELINE", LayerInformationFormat::kONELINE, LayerInformationFormatDoc::ONELINE)
-        .value("JSON", LayerInformationFormat::kJSON, LayerInformationFormatDoc::JSON);
-
     py::class_<ITimingCache>(m, "ITimingCache", ITimingCacheDoc::descr, py::module_local())
         .def("serialize", &ITimingCache::serialize, ITimingCacheDoc::serialize)
         .def("combine", &ITimingCache::combine, "input_cache"_a, "ignore_mismatch"_a, ITimingCacheDoc::combine)
@@ -1163,17 +1399,11 @@ void bindCore(py::module& m)
 
 #if EXPORT_ALL_BINDINGS
     py::class_<IBuilderConfig>(m, "IBuilderConfig", IBuilderConfigDoc::descr, py::module_local())
-        .def_property("min_timing_iterations",
-            utils::deprecateMember(&IBuilderConfig::getMinTimingIterations, "get_avg_timing_iterations"),
-            utils::deprecateMember(&IBuilderConfig::setMinTimingIterations, "set_avg_timing_iterations"))
         .def_property(
             "avg_timing_iterations", &IBuilderConfig::getAvgTimingIterations, &IBuilderConfig::setAvgTimingIterations)
         .def_property("int8_calibrator", &IBuilderConfig::getInt8Calibrator,
             py::cpp_function(&IBuilderConfig::setInt8Calibrator, py::keep_alive<1, 2>{}))
         .def_property("engine_capability", &IBuilderConfig::getEngineCapability, &IBuilderConfig::setEngineCapability)
-        .def_property("max_workspace_size",
-            utils::deprecateMember(&IBuilderConfig::getMaxWorkspaceSize, "get_memory_pool_limit"),
-            utils::deprecateMember(&IBuilderConfig::setMaxWorkspaceSize, "set_memory_pool_limit"))
         .def("set_memory_pool_limit", &IBuilderConfig::setMemoryPoolLimit, "pool"_a, "pool_size"_a,
             IBuilderConfigDoc::set_memory_pool_limit)
         .def("get_memory_pool_limit", &IBuilderConfig::getMemoryPoolLimit, "pool"_a,
@@ -1230,25 +1460,22 @@ void bindCore(py::module& m)
             &IBuilderConfig::setHardwareCompatibilityLevel)
         .def_property("plugins_to_serialize", lambdas::get_plugins_to_serialize, lambdas::set_plugins_to_serialize)
         .def_property("max_aux_streams", &IBuilderConfig::getMaxAuxStreams, &IBuilderConfig::setMaxAuxStreams)
-        .def("__del__", &utils::doNothingDel<IBuilderConfig>);
+        .def_property("progress_monitor", &IBuilderConfig::getProgressMonitor,
+            py::cpp_function(&IBuilderConfig::setProgressMonitor, py::keep_alive<1, 2>{}))
+               .def("__del__", &utils::doNothingDel<IBuilderConfig>);
 
     py::enum_<NetworkDefinitionCreationFlag>(m, "NetworkDefinitionCreationFlag", py::arithmetic{},
         NetworkDefinitionCreationFlagDoc::descr, py::module_local())
         .value("EXPLICIT_BATCH", NetworkDefinitionCreationFlag::kEXPLICIT_BATCH,
             NetworkDefinitionCreationFlagDoc::EXPLICIT_BATCH)
-        .value("EXPLICIT_PRECISION", NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION,
-            NetworkDefinitionCreationFlagDoc::EXPLICIT_PRECISION);
+        .value("STRONGLY_TYPED", NetworkDefinitionCreationFlag::kSTRONGLY_TYPED,
+            NetworkDefinitionCreationFlagDoc::STRONGLY_TYPED);
 
     // Builder
     py::class_<IBuilder>(m, "Builder", BuilderDoc::descr, py::module_local())
         .def(py::init(&nvinfer1::createInferBuilder), "logger"_a, BuilderDoc::init, py::keep_alive<1, 2>{})
         .def("create_network", &IBuilder::createNetworkV2, "flags"_a = 0U, BuilderDoc::create_network,
             py::keep_alive<0, 1>{})
-        .def_property("max_batch_size",
-            utils::deprecateMember(
-                &IBuilder::getMaxBatchSize, "network created with NetworkDefinitionCreationFlag::EXPLICIT_BATCH flag"),
-            utils::deprecateMember(
-                &IBuilder::setMaxBatchSize, "network created with NetworkDefinitionCreationFlag::EXPLICIT_BATCH flag"))
         .def_property_readonly("platform_has_tf32", &IBuilder::platformHasTf32)
         .def_property_readonly("platform_has_fast_fp16", &IBuilder::platformHasFastFp16)
         .def_property_readonly("platform_has_fast_int8", &IBuilder::platformHasFastInt8)
@@ -1261,9 +1488,6 @@ void bindCore(py::module& m)
             py::cpp_function(&IBuilder::setErrorRecorder, py::keep_alive<1, 2>{}))
         .def("create_builder_config", &IBuilder::createBuilderConfig, BuilderDoc::create_builder_config,
             py::keep_alive<0, 1>{})
-        .def("build_engine", utils::deprecateMember(&IBuilder::buildEngineWithConfig, "build_serialized_network"),
-            "network"_a, "config"_a, BuilderDoc::build_engine, py::call_guard<py::gil_scoped_release>{},
-            py::keep_alive<0, 1>{})
         .def("build_serialized_network", &IBuilder::buildSerializedNetwork, "network"_a, "config"_a,
             BuilderDoc::build_serialized_network, py::call_guard<py::gil_scoped_release>{})
         .def("is_network_supported", &IBuilder::isNetworkSupported, "network"_a, "config"_a,
@@ -1281,7 +1505,10 @@ void bindCore(py::module& m)
         .def(py::init(&nvinfer1::createInferRuntime), "logger"_a, RuntimeDoc::init, py::keep_alive<1, 2>{})
         .def("deserialize_cuda_engine", lambdas::runtime_deserialize_cuda_engine, "serialized_engine"_a,
             RuntimeDoc::deserialize_cuda_engine, py::call_guard<py::gil_scoped_release>{}, py::keep_alive<0, 1>{})
-        .def_property("DLA_core", &IRuntime::getDLACore, &IRuntime::setDLACore)
+        .def("deserialize_cuda_engine", py::overload_cast<IStreamReader&>(&IRuntime::deserializeCudaEngine),
+            "stream_reader"_a, RuntimeDoc::deserialize_cuda_engine_reader, py::call_guard<py::gil_scoped_release>{},
+            py::keep_alive<0, 1>{})
+               .def_property("DLA_core", &IRuntime::getDLACore, &IRuntime::setDLACore)
         .def_property_readonly("num_DLA_cores", &IRuntime::getNbDLACores)
         .def_property("gpu_allocator", nullptr, py::cpp_function(&IRuntime::setGpuAllocator, py::keep_alive<1, 2>{}))
         .def_property("error_recorder", &IRuntime::getErrorRecorder,
@@ -1297,27 +1524,16 @@ void bindCore(py::module& m)
             "engine_host_code_allowed", &IRuntime::getEngineHostCodeAllowed, &IRuntime::setEngineHostCodeAllowed)
         .def("__del__", &utils::doNothingDel<IRuntime>);
 
-#if EXPORT_ALL_BINDINGS
-    // EngineInspector
-    py::class_<IEngineInspector>(m, "EngineInspector", RuntimeInspectorDoc::descr, py::module_local())
-        .def_property(
-            "execution_context", &IEngineInspector::getExecutionContext, &IEngineInspector::setExecutionContext)
-        .def("get_layer_information", &IEngineInspector::getLayerInformation, "layer_index"_a, "format"_a,
-            RuntimeInspectorDoc::get_layer_information)
-        .def("get_engine_information", &IEngineInspector::getEngineInformation, "format"_a,
-            RuntimeInspectorDoc::get_engine_information)
-        .def_property("error_recorder", &IEngineInspector::getErrorRecorder,
-            py::cpp_function(&IEngineInspector::setErrorRecorder, py::keep_alive<1, 2>{}));
-
     // Refitter
     py::class_<IRefitter>(m, "Refitter", RefitterDoc::descr, py::module_local())
         .def(py::init(&nvinfer1::createInferRefitter), "engine"_a, "logger"_a, py::keep_alive<1, 2>{},
             py::keep_alive<1, 3>{}, RefitterDoc::init)
         .def("set_weights", &IRefitter::setWeights, "layer_name"_a, "role"_a, "weights"_a, py::keep_alive<1, 4>{},
             RefitterDoc::set_weights)
-        .def("set_named_weights", &IRefitter::setNamedWeights, "name"_a, "weights"_a, py::keep_alive<1, 3>{},
-            RefitterDoc::set_named_weights)
-        .def("refit_cuda_engine", &IRefitter::refitCudaEngine, RefitterDoc::refit_cuda_engine)
+        .def("set_named_weights", py::overload_cast<char const*, Weights>(&IRefitter::setNamedWeights), "name"_a,
+            "weights"_a, py::keep_alive<1, 3>{}, RefitterDoc::set_named_weights)
+        .def("refit_cuda_engine", &IRefitter::refitCudaEngine, RefitterDoc::refit_cuda_engine,
+            py::call_guard<py::gil_scoped_release>{})
         .def("get_missing", lambdas::refitter_get_missing, RefitterDoc::get_missing)
         .def("get_missing_weights", lambdas::refitter_get_missing_weights, RefitterDoc::get_missing_weights)
         .def("get_all", lambdas::refitter_get_all, RefitterDoc::get_all)
@@ -1331,8 +1547,18 @@ void bindCore(py::module& m)
             py::cpp_function(&IRefitter::setErrorRecorder, py::keep_alive<1, 2>{}))
         .def_property_readonly("logger", &IRefitter::getLogger)
         .def_property("max_threads", &IRefitter::getMaxThreads, &IRefitter::setMaxThreads)
+        .def("set_named_weights", py::overload_cast<char const*, Weights, TensorLocation>(&IRefitter::setNamedWeights),
+            "name"_a, "weights"_a, "location"_a, py::keep_alive<1, 3>{}, RefitterDoc::set_named_weights_with_location)
+        .def("get_named_weights", &IRefitter::getNamedWeights, "weights_name"_a, RefitterDoc::get_named_weights)
+        .def(
+            "get_weights_location", &IRefitter::getWeightsLocation, "weights_name"_a, RefitterDoc::get_weights_location)
+        .def("unset_named_weights", &IRefitter::unsetNamedWeights, "weights_name"_a, RefitterDoc::unset_named_weights)
+        .def_property("weights_validation", &IRefitter::getWeightsValidation, &IRefitter::setWeightsValidation)
+        .def("refit_cuda_engine_async", lambdas::refitter_refit_cuda_engine_async, "stream_handle"_a,
+            RefitterDoc::refit_cuda_engine_async, py::call_guard<py::gil_scoped_release>{})
+        .def("get_weights_prototype", &IRefitter::getWeightsPrototype, "weights_name"_a,
+            RefitterDoc::get_weights_prototype)
         .def("__del__", &utils::doNothingDel<IRefitter>);
-#endif // EXPORT_ALL_BINDINGS
 }
 
 } // namespace tensorrt
diff --git a/python/src/infer/pyFoundationalTypes.cpp b/python/src/infer/pyFoundationalTypes.cpp
index a4c9a722..e89e020a 100644
--- a/python/src/infer/pyFoundationalTypes.cpp
+++ b/python/src/infer/pyFoundationalTypes.cpp
@@ -33,6 +33,10 @@ namespace lambdas
 // For Weights
 static const auto weights_datatype_constructor = [](DataType const& type) { return new Weights{type, nullptr, 0}; };
 
+static const auto weights_pointer_constructor = [](DataType const& type, size_t const ptr, int64_t count) {
+    return new Weights{type, reinterpret_cast<void*>(ptr), count};
+};
+
 static const auto weights_numpy_constructor = [](py::array& arr) {
     arr = py::array::ensure(arr);
     // In order to construct a weights object, we must have a contiguous C-style array.
@@ -52,7 +56,7 @@ bool dimsEqual(DimsType const& self, PyIterable& other)
         return false;
     }
     bool eq = true;
-    std::vector<int32_t> o = other.template cast<std::vector<int32_t>>();
+    std::vector<int64_t> o = other.template cast<std::vector<int64_t>>();
     for (int32_t i = 0; i < self.nbDims; ++i)
     {
         eq = eq && (self.d[i] == o[i]);
@@ -61,11 +65,11 @@ bool dimsEqual(DimsType const& self, PyIterable& other)
 }
 
 // For base Dims class
-static const auto dims_vector_constructor = [](std::vector<int32_t> const& in) {
+static const auto dims_vector_constructor = [](std::vector<int64_t> const& in) {
     // This is required, because otherwise MAX_DIMS will not be resolved at compile time.
     int32_t const maxDims{static_cast<int32_t>(Dims::MAX_DIMS)};
     PY_ASSERT_VALUE_ERROR(in.size() <= maxDims,
-            "Input length " + std::to_string(in.size()) + ". Max expected length is " + std::to_string(maxDims));
+        "Input length " + std::to_string(in.size()) + ". Max expected length is " + std::to_string(maxDims));
 
     // Create the Dims object.
     Dims* self = new Dims{};
@@ -92,7 +96,7 @@ static const auto dims_to_str = [](Dims const& self) {
 static const auto dims_len = [](Dims const& self) { return self.nbDims; };
 
 // TODO: Add slicing support?
-static const auto dims_getter = [](Dims const& self, int32_t const pyIndex) -> int32_t const& {
+static const auto dims_getter = [](Dims const& self, int32_t const pyIndex) -> int64_t const& {
     // Without these bounds checks, horrible infinite looping will occur.
     int32_t const index{(pyIndex < 0) ? static_cast<int32_t>(self.nbDims) + pyIndex : pyIndex};
     PY_ASSERT_INDEX_ERROR(index >= 0 && index < self.nbDims);
@@ -112,7 +116,7 @@ static const auto dims_getter_slice = [](Dims const& self, py::slice slice) {
     return ret;
 };
 
-static const auto dims_setter = [](Dims& self, int32_t const pyIndex, int32_t const item) {
+static const auto dims_setter = [](Dims& self, int32_t const pyIndex, int64_t const item) {
     int32_t const index{(pyIndex < 0) ? static_cast<int32_t>(self.nbDims) + pyIndex : pyIndex};
     PY_ASSERT_INDEX_ERROR(index >= 0 && index < self.nbDims);
     self.d[index] = item;
@@ -130,28 +134,28 @@ static const auto dims_setter_slice = [](Dims& self, py::slice slice, Dims const
 };
 
 // For Dims2
-static const auto dims2_vector_constructor = [](std::vector<int32_t> const& in) {
+static const auto dims2_vector_constructor = [](std::vector<int64_t> const& in) {
     PY_ASSERT_VALUE_ERROR(in.size() == 2,
         "Input length " + std::to_string(in.size()) + " not equal to expected Dims2 length, which is 2");
     return new Dims2{in[0], in[1]};
 };
 
 // For DimsHW
-static const auto dimshw_vector_constructor = [](std::vector<int32_t> const& in) {
+static const auto dimshw_vector_constructor = [](std::vector<int64_t> const& in) {
     PY_ASSERT_VALUE_ERROR(in.size() == 2,
         "Input length " + std::to_string(in.size()) + " not equal to expected DimsHW length, which is 2");
     return new DimsHW{in[0], in[1]};
 };
 
 // For Dims3
-static const auto dims3_vector_constructor = [](std::vector<int32_t> const& in) {
+static const auto dims3_vector_constructor = [](std::vector<int64_t> const& in) {
     PY_ASSERT_VALUE_ERROR(in.size() == 3,
         "Input length " + std::to_string(in.size()) + " not equal to expected Dims3 length, which is 3");
     return new Dims3{in[0], in[1], in[2]};
 };
 
 // For Dims4
-static const auto dims4_vector_constructor = [](std::vector<int32_t> const& in) {
+static const auto dims4_vector_constructor = [](std::vector<int64_t> const& in) {
     PY_ASSERT_VALUE_ERROR(in.size() == 4,
         "Input length " + std::to_string(in.size()) + " not equal to expected Dims4 length, which is 4");
     return new Dims4{in[0], in[1], in[2], in[3]};
@@ -175,20 +179,26 @@ void bindFoundationalTypes(py::module& m)
     py::enum_<DataType>(m, "DataType", DataTypeDoc::descr, py::module_local())
         .value("FLOAT", DataType::kFLOAT, DataTypeDoc::float32)
         .value("HALF", DataType::kHALF, DataTypeDoc::float16)
+        .value("BF16", DataType::kBF16, DataTypeDoc::bfloat16)
         .value("INT8", DataType::kINT8, DataTypeDoc::int8)
         .value("INT32", DataType::kINT32, DataTypeDoc::int32)
+        .value("INT64", DataType::kINT64, DataTypeDoc::int64)
         .value("BOOL", DataType::kBOOL, DataTypeDoc::boolean)
         .value("UINT8", DataType::kUINT8, DataTypeDoc::uint8)
-        .value("FP8", DataType::kFP8, DataTypeDoc::fp8); // DataType
+        .value("FP8", DataType::kFP8, DataTypeDoc::fp8)
+        .value("INT4", DataType::kINT4, DataTypeDoc::int4); // DataType
 
     // Also create direct mappings (so we can call trt.float32, for example).
     m.attr("float32") = DataType::kFLOAT;
     m.attr("float16") = DataType::kHALF;
+    m.attr("bfloat16") = DataType::kBF16;
     m.attr("int8") = DataType::kINT8;
     m.attr("int32") = DataType::kINT32;
+    m.attr("int64") = DataType::kINT64;
     m.attr("bool") = DataType::kBOOL;
     m.attr("uint8") = DataType::kUINT8;
     m.attr("fp8") = DataType::kFP8;
+    m.attr("int4") = DataType::kINT4;
 
     py::enum_<WeightsRole>(m, "WeightsRole", WeightsRoleDoc::descr, py::module_local())
         .value("KERNEL", WeightsRole::kKERNEL, WeightsRoleDoc::KERNEL)
@@ -202,6 +212,7 @@ void bindFoundationalTypes(py::module& m)
     py::class_<Weights>(m, "Weights", WeightsDoc::descr, py::module_local())
         // Can construct an empty weights object with type. Defaults to float32.
         .def(py::init(lambdas::weights_datatype_constructor), "type"_a = DataType::kFLOAT, WeightsDoc::init_type)
+        .def(py::init(lambdas::weights_pointer_constructor), "type"_a, "ptr"_a, "count"_a, WeightsDoc::init_ptr)
         // Allows for construction through any contiguous numpy array. It then keeps a pointer to that buffer
         // (zero-copy).
         .def(py::init(lambdas::weights_numpy_constructor), "a"_a, py::keep_alive<1, 2>(), WeightsDoc::init_numpy)
@@ -221,8 +232,9 @@ void bindFoundationalTypes(py::module& m)
         // Allows for construction from python lists and tuples.
         .def(py::init(lambdas::dims_vector_constructor), "shape"_a)
         // static_cast is required here, or MAX_DIMS does not get pulled in until LOAD time.
-        .def_property_readonly(
-            "MAX_DIMS", [](Dims const& self) { return static_cast<int32_t const>(self.MAX_DIMS); }, DimsDoc::MAX_DIMS)
+        .def_property_readonly_static(
+            "MAX_DIMS", [](py::object /*self*/) { return static_cast<int32_t const>(Dims::MAX_DIMS); },
+            DimsDoc::MAX_DIMS)
         // Allow for string representations (displays like a python tuple).
         .def("__str__", lambdas::dims_to_str)
         .def("__repr__", lambdas::dims_to_str)
@@ -237,47 +249,47 @@ void bindFoundationalTypes(py::module& m)
         .def("__setitem__", lambdas::dims_setter_slice); // Dims
 
     // Make it possible to use tuples/lists in Python in place of Dims.
-    py::implicitly_convertible<std::vector<int32_t>, Dims>();
+    py::implicitly_convertible<std::vector<int64_t>, Dims>();
 
     // 2D
     py::class_<Dims2, Dims>(m, "Dims2", Dims2Doc::descr, py::module_local())
         .def(py::init<>())
-        .def(py::init<int32_t, int32_t>(), "dim0"_a, "dim1"_a)
+        .def(py::init<int64_t, int64_t>(), "dim0"_a, "dim1"_a)
         // Allows for construction from a tuple/list.
         .def(py::init(lambdas::dims2_vector_constructor), "shape"_a); // Dims2
 
-    py::implicitly_convertible<std::vector<int32_t>, Dims2>();
+    py::implicitly_convertible<std::vector<int64_t>, Dims2>();
 
     py::class_<DimsHW, Dims2>(m, "DimsHW", DimsHWDoc::descr, py::module_local())
         .def(py::init<>())
-        .def(py::init<int32_t, int32_t>(), "h"_a, "w"_a)
+        .def(py::init<int64_t, int64_t>(), "h"_a, "w"_a)
         // Allows for construction from a tuple/list.
         .def(py::init(lambdas::dimshw_vector_constructor), "shape"_a)
         // Expose these functions as attributes in Python.
         .def_property(
-            "h", [](DimsHW const& dims) { return dims.h(); }, [](DimsHW& dims, int32_t i) { dims.h() = i; })
+            "h", [](DimsHW const& dims) { return dims.h(); }, [](DimsHW& dims, int64_t i) { dims.h() = i; })
         .def_property(
-            "w", [](DimsHW const& dims) { return dims.w(); }, [](DimsHW& dims, int32_t i) { dims.w() = i; }); // DimsHW
+            "w", [](DimsHW const& dims) { return dims.w(); }, [](DimsHW& dims, int64_t i) { dims.w() = i; }); // DimsHW
 
-    py::implicitly_convertible<std::vector<int32_t>, DimsHW>();
+    py::implicitly_convertible<std::vector<int64_t>, DimsHW>();
 
     // 3D
     py::class_<Dims3, Dims>(m, "Dims3", Dims3Doc::descr, py::module_local())
         .def(py::init<>())
-        .def(py::init<int32_t, int32_t, int32_t>(), "dim0"_a, "dim1"_a, "dim2"_a)
+        .def(py::init<int64_t, int64_t, int64_t>(), "dim0"_a, "dim1"_a, "dim2"_a)
         // Allows for construction from a tuple/list.
         .def(py::init(lambdas::dims3_vector_constructor), "shape"_a); // Dims3
 
-    py::implicitly_convertible<std::vector<int32_t>, Dims3>();
+    py::implicitly_convertible<std::vector<int64_t>, Dims3>();
 
     // 4D
     py::class_<Dims4, Dims>(m, "Dims4", Dims4Doc::descr, py::module_local())
         .def(py::init<>())
-        .def(py::init<int32_t, int32_t, int32_t, int32_t>(), "dim0"_a, "dim1"_a, "dim2"_a, "dim3"_a)
+        .def(py::init<int64_t, int64_t, int64_t, int64_t>(), "dim0"_a, "dim1"_a, "dim2"_a, "dim3"_a)
         // Allows for construction from a tuple/list.
         .def(py::init(lambdas::dims4_vector_constructor), "shape"_a); // Dims4
 
-    py::implicitly_convertible<std::vector<int32_t>, Dims4>();
+    py::implicitly_convertible<std::vector<int64_t>, Dims4>();
 
     py::class_<IHostMemory>(m, "IHostMemory", py::buffer_protocol(), IHostMemoryDoc::descr, py::module_local())
         .def_property_readonly("dtype", [](IHostMemory const& mem) { return mem.type(); })
@@ -285,6 +297,19 @@ void bindFoundationalTypes(py::module& m)
         // Expose buffer interface.
         .def_buffer(lambdas::host_memory_buffer_interface)
         .def("__del__", &utils::doNothingDel<IHostMemory>); // IHostMemory
+
+    py::class_<InterfaceInfo>(m, "InterfaceInfo", InterfaceInfoDoc::descr, py::module_local())
+        .def_readwrite("kind", &InterfaceInfo::kind)
+        .def_readwrite("major", &InterfaceInfo::major)
+        .def_readwrite("minor", &InterfaceInfo::minor);
+
+    py::enum_<APILanguage>(m, "APILanguage", APILanguageDoc::descr, py::module_local())
+        .value("CPP", APILanguage::kCPP)
+        .value("PYTHON", APILanguage::kPYTHON);
+
+    py::class_<IVersionedInterface>(m, "IVersionedInterface", IVersionedInterfaceDoc::descr, py::module_local())
+        .def_property_readonly("api_language", &IVersionedInterface::getAPILanguage)
+        .def_property_readonly("interface_info", &IVersionedInterface::getInterfaceInfo);
 }
 
 } // namespace tensorrt
diff --git a/python/src/infer/pyGraph.cpp b/python/src/infer/pyGraph.cpp
index ae66ed12..730481ae 100644
--- a/python/src/infer/pyGraph.cpp
+++ b/python/src/infer/pyGraph.cpp
@@ -23,7 +23,6 @@
 #if ENABLE_INETWORK_SERIALIZE
 #include "NvInferSerialize.h"
 #endif
-
 #include "infer/pyGraphDoc.h"
 
 // clang-format off
@@ -109,9 +108,9 @@ namespace tensorrt
             return self.addPluginV2(inputs.data(), inputs.size(), plugin);
         };
 
-        IConvolutionLayer* add_convolution(INetworkDefinition& self, ITensor& input, int32_t numOutputMaps, DimsHW kernelSize, Weights kernel, Weights* bias)
+        static const auto add_plugin_v3 = [] (INetworkDefinition& self, std::vector<ITensor*> const& inputs, std::vector<ITensor*> const& shapeInputs, IPluginV3& plugin)
         {
-            return self.addConvolution(input, numOutputMaps, kernelSize, kernel, optionalWeights(bias));
+            return self.addPluginV3(inputs.data(), inputs.size(), shapeInputs.data(), shapeInputs.size(), plugin);
         };
 
         static const auto add_convolution_nd = [](INetworkDefinition& self, ITensor& input, int32_t numOutputMaps, Dims kernelSize, Weights kernel, Weights* bias)
@@ -119,11 +118,6 @@ namespace tensorrt
             return self.addConvolutionNd(input, numOutputMaps, kernelSize, kernel, optionalWeights(bias));
         };
 
-        IFullyConnectedLayer* add_fully_connected(INetworkDefinition& self, ITensor& input, int32_t numOutputs, Weights kernel, Weights* bias)
-        {
-            return self.addFullyConnected(input, numOutputs, kernel, optionalWeights(bias));
-        };
-
         IGridSampleLayer* add_grid_sample(INetworkDefinition& self, ITensor& input, ITensor& grid)
         {
             return self.addGridSample(input, grid);
@@ -149,11 +143,6 @@ namespace tensorrt
             return self.addDequantize(input, scale);
         };
 
-        IDeconvolutionLayer* add_deconvolution(INetworkDefinition& self, ITensor& input, int32_t numOutputMaps, DimsHW kernelSize, Weights kernel, Weights* bias)
-        {
-            return self.addDeconvolution(input, numOutputMaps, kernelSize, kernel, optionalWeights(bias));
-        };
-
         static const auto add_scatter = [](INetworkDefinition& self, ITensor& data, ITensor& indices, ITensor& updates, ScatterMode mode)
         {
             return self.addScatter(data, indices, updates, mode);
@@ -180,9 +169,6 @@ namespace tensorrt
         static const auto conv_get_kernel = [](IConvolutionLayer& self) { auto w = self.getKernelWeights(); return utils::weights_to_numpy(w); };
         static const auto conv_get_bias = [](IConvolutionLayer& self) { auto w = self.getBiasWeights(); return utils::weights_to_numpy(w); };
 
-        static const auto fc_get_kernel = [](IFullyConnectedLayer& self) { auto w = self.getKernelWeights(); return utils::weights_to_numpy(w); };
-        static const auto fc_get_bias = [](IFullyConnectedLayer& self) { auto w = self.getBiasWeights(); return utils::weights_to_numpy(w); };
-
         static const auto scale_get_shift = [](IScaleLayer& self) { auto w = self.getShift(); return utils::weights_to_numpy(w); };
         static const auto scale_get_scale = [](IScaleLayer& self) { auto w = self.getScale(); return utils::weights_to_numpy(w); };
         static const auto scale_get_power = [](IScaleLayer& self) { auto w = self.getPower(); return utils::weights_to_numpy(w); };
@@ -190,13 +176,6 @@ namespace tensorrt
         static const auto deconv_get_kernel = [](IDeconvolutionLayer& self) { auto w = self.getKernelWeights(); return utils::weights_to_numpy(w); };
         static const auto deconv_get_bias = [](IDeconvolutionLayer& self) { auto w = self.getBiasWeights(); return utils::weights_to_numpy(w); };
 
-        static const auto rnnv2_get_weights = [](IRNNv2Layer& self, int32_t index, RNNGateType gate, bool isW) {
-            auto w = self.getWeightsForGate(index, gate, isW); return utils::weights_to_numpy(w);
-        };
-        static const auto rnnv2_get_bias = [](IRNNv2Layer& self, int32_t index, RNNGateType gate, bool isW) {
-            auto w = self.getBiasForGate(index, gate, isW); return utils::weights_to_numpy(w);
-        };
-
         static const auto constant_get_weights = [](IConstantLayer& self) { auto w = self.getWeights(); return utils::weights_to_numpy(w); };
 
         // TODO: Add slicing support?
@@ -220,6 +199,51 @@ namespace tensorrt
             self.getScales(nbScales, scales.data());
             return scales;
         };
+
+        // For Fill layer
+        static auto set_alpha = [](IFillLayer& self, py::object alpha) {
+            try
+            {
+                double alphaDouble = alpha.cast<double>();
+                self.setAlpha(alphaDouble);
+            }
+            catch (py::cast_error const&) {}
+
+            try
+            {
+                int64_t alphaInt64 = alpha.cast<int64_t>();
+                self.setAlphaInt64(alphaInt64);
+            }
+            catch (py::cast_error const&) {}
+        };
+        static auto get_alpha = [](IFillLayer& self) {
+            if (self.isAlphaBetaInt64())
+                return py::cast(self.getAlphaInt64());
+            else
+                return py::cast(self.getAlpha());
+        };
+        static auto set_beta = [](IFillLayer& self, py::object beta) {
+            try
+            {
+                double betaDouble = beta.cast<double>();
+                self.setBeta(betaDouble);
+            }
+            catch (py::cast_error const&) {}
+
+            try
+            {
+                int64_t betaInt64 = beta.cast<int64_t>();
+                self.setBetaInt64(betaInt64);
+            }
+            catch (py::cast_error const&) {}
+        };
+        static auto get_beta = [](IFillLayer& self) {
+            if (self.isAlphaBetaInt64())
+                return py::cast(self.getBetaInt64());
+            else
+                return py::cast(self.getBeta());
+        };
+
     } /* lambdas */
 
     void bindGraph(py::module& m)
@@ -227,7 +251,6 @@ namespace tensorrt
         // Bind to a Python enum called LayerType.
         py::enum_<LayerType>(m, "LayerType", LayerTypeDoc::descr, py::module_local())
             .value("CONVOLUTION", LayerType::kCONVOLUTION, LayerTypeDoc::CONVOLUTION)
-            .value("FULLY_CONNECTED", LayerType::kFULLY_CONNECTED, LayerTypeDoc::FULLY_CONNECTED)
             .value("GRID_SAMPLE", LayerType::kGRID_SAMPLE, LayerTypeDoc::GRID_SAMPLE)
             .value("NMS", LayerType::kNMS, LayerTypeDoc::NMS)
             .value("ACTIVATION", LayerType::kACTIVATION, LayerTypeDoc::ACTIVATION)
@@ -248,7 +271,6 @@ namespace tensorrt
             .value("MATRIX_MULTIPLY", LayerType::kMATRIX_MULTIPLY, LayerTypeDoc::MATRIX_MULTIPLY)
             .value("RAGGED_SOFTMAX", LayerType::kRAGGED_SOFTMAX, LayerTypeDoc::RAGGED_SOFTMAX)
             .value("CONSTANT", LayerType::kCONSTANT, LayerTypeDoc::CONSTANT)
-            .value("RNN_V2", LayerType::kRNN_V2, LayerTypeDoc::RNN_V2)
             .value("IDENTITY", LayerType::kIDENTITY, LayerTypeDoc::IDENTITY)
             .value("CAST", LayerType::kCAST, LayerTypeDoc::CAST)
             .value("PLUGIN_V2", LayerType::kPLUGIN_V2, LayerTypeDoc::PLUGIN_V2)
@@ -274,14 +296,9 @@ namespace tensorrt
             .value("NON_ZERO", LayerType::kNON_ZERO, LayerTypeDoc::NON_ZERO)
             .value("REVERSE_SEQUENCE", LayerType::kREVERSE_SEQUENCE, LayerTypeDoc::REVERSE_SEQUENCE)
             .value("NORMALIZATION", LayerType::kNORMALIZATION, LayerTypeDoc::NORMALIZATION)
+            .value("PLUGIN_V3", LayerType::kPLUGIN_V3, LayerTypeDoc::PLUGIN_V3)
         ; // LayerType
 
-        // Bind to a Python enum called TensorLocation.
-        py::enum_<TensorLocation>(m, "TensorLocation", TensorLocationDoc::descr, py::module_local())
-            .value("DEVICE", TensorLocation::kDEVICE, TensorLocationDoc::DEVICE)
-            .value("HOST", TensorLocation::kHOST, TensorLocationDoc::HOST)
-        ; // TensorLocation
-
         py::enum_<TensorFormat>(m, "TensorFormat", TensorFormatDoc::descr, py::arithmetic{}, py::module_local())
             .value("LINEAR", TensorFormat::kLINEAR, TensorFormatDoc::LINEAR)
             .value("CHW2", TensorFormat::kCHW2, TensorFormatDoc::CHW2)
@@ -303,7 +320,7 @@ namespace tensorrt
             .def_property("name", &ITensor::getName, &ITensor::setName)
             .def_property("shape", &ITensor::getDimensions, &ITensor::setDimensions)
             .def_property("dtype", &ITensor::getType, &ITensor::setType)
-            .def_property("broadcast_across_batch", &ITensor::getBroadcastAcrossBatch, &ITensor::setBroadcastAcrossBatch)
+            .def_property("broadcast_across_batch", utils::deprecateMember(&ITensor::getBroadcastAcrossBatch, "Implicit batch dimensions support has been removed"), utils::deprecateMember(&ITensor::setBroadcastAcrossBatch, "Implicit batch dimensions support has been removed"))
             .def_property("location", &ITensor::getLocation, &ITensor::setLocation)
             .def_property("allowed_formats", &ITensor::getAllowedFormats, &ITensor::setAllowedFormats)
             .def_property_readonly("is_network_input", &ITensor::isNetworkInput)
@@ -341,15 +358,10 @@ namespace tensorrt
             .value("EXPLICIT_ROUND_UP", PaddingMode::kEXPLICIT_ROUND_UP, PaddingModeDoc::EXPLICIT_ROUND_UP)
             .value("SAME_UPPER", PaddingMode::kSAME_UPPER, PaddingModeDoc::SAME_UPPER)
             .value("SAME_LOWER", PaddingMode::kSAME_LOWER, PaddingModeDoc::SAME_LOWER)
-            .value("CAFFE_ROUND_DOWN", PaddingMode::kCAFFE_ROUND_DOWN, PaddingModeDoc::CAFFE_ROUND_DOWN)
-            .value("CAFFE_ROUND_UP", PaddingMode::kCAFFE_ROUND_UP, PaddingModeDoc::CAFFE_ROUND_UP)
         ;
 
         py::class_<IConvolutionLayer, ILayer, std::unique_ptr<IConvolutionLayer, py::nodelete>>(m, "IConvolutionLayer", IConvolutionLayerDoc::descr, py::module_local())
-            .def_property("kernel_size", utils::deprecateMember(&IConvolutionLayer::getKernelSize, "kernel_size_nd"), utils::deprecateMember(&IConvolutionLayer::setKernelSize, "kernel_size_nd"))
             .def_property("num_output_maps", &IConvolutionLayer::getNbOutputMaps, &IConvolutionLayer::setNbOutputMaps)
-            .def_property("stride", utils::deprecateMember(&IConvolutionLayer::getStride, "stride_nd"), utils::deprecateMember(&IConvolutionLayer::setStride, "stride_nd"))
-            .def_property("padding", utils::deprecateMember(&IConvolutionLayer::getPadding, "padding_nd"), utils::deprecateMember(&IConvolutionLayer::setPadding, "padding_nd"))
             .def_property("pre_padding", &IConvolutionLayer::getPrePadding, &IConvolutionLayer::setPrePadding)
             .def_property("post_padding", &IConvolutionLayer::getPostPadding, &IConvolutionLayer::setPostPadding)
             .def_property("padding_mode", &IConvolutionLayer::getPaddingMode, &IConvolutionLayer::setPaddingMode)
@@ -357,19 +369,12 @@ namespace tensorrt
             // Return numpy arrays instead of weights.
             .def_property("kernel", lambdas::conv_get_kernel, py::cpp_function(&IConvolutionLayer::setKernelWeights, py::keep_alive<1, 2>{}))
             .def_property("bias", lambdas::conv_get_bias, py::cpp_function(&IConvolutionLayer::setBiasWeights, py::keep_alive<1, 2>{}))
-            .def_property("dilation", utils::deprecateMember(&IConvolutionLayer::getDilation, "dilation_nd"), utils::deprecateMember(&IConvolutionLayer::setDilation, "dilation_nd"))
             .def_property("kernel_size_nd", &IConvolutionLayer::getKernelSizeNd, &IConvolutionLayer::setKernelSizeNd)
             .def_property("stride_nd", &IConvolutionLayer::getStrideNd, &IConvolutionLayer::setStrideNd)
             .def_property("padding_nd", &IConvolutionLayer::getPaddingNd, &IConvolutionLayer::setPaddingNd)
             .def_property("dilation_nd", &IConvolutionLayer::getDilationNd, &IConvolutionLayer::setDilationNd)
         ;
 
-        py::class_<IFullyConnectedLayer, ILayer, std::unique_ptr<IFullyConnectedLayer, py::nodelete>>(m, "IFullyConnectedLayer", IFullyConnectedLayerDoc::descr, py::module_local())
-            .def_property("num_output_channels", &IFullyConnectedLayer::getNbOutputChannels, &IFullyConnectedLayer::setNbOutputChannels)
-            .def_property("kernel", lambdas::fc_get_kernel, py::cpp_function(&IFullyConnectedLayer::setKernelWeights, py::keep_alive<1, 2>{}))
-            .def_property("bias", lambdas::fc_get_bias, py::cpp_function(&IFullyConnectedLayer::setBiasWeights, py::keep_alive<1, 2>{}))
-        ;
-
         // Bind to a Python enum called ActivationType.
         py::enum_<ActivationType>(m, "ActivationType", ActivationTypeDoc::descr, py::module_local())
             .value("RELU", ActivationType::kRELU, ActivationTypeDoc::RELU)
@@ -384,6 +389,8 @@ namespace tensorrt
             .value("HARD_SIGMOID", ActivationType::kHARD_SIGMOID, ActivationTypeDoc::HARD_SIGMOID)
             .value("SCALED_TANH", ActivationType::kSCALED_TANH, ActivationTypeDoc::SCALED_TANH)
             .value("THRESHOLDED_RELU", ActivationType::kTHRESHOLDED_RELU, ActivationTypeDoc::THRESHOLDED_RELU)
+            .value("GELU_ERF", ActivationType::kGELU_ERF, ActivationTypeDoc::GELU_ERF)
+            .value("GELU_TANH", ActivationType::kGELU_TANH, ActivationTypeDoc::GELU_TANH)
         ; // ActivationType
 
         py::class_<IActivationLayer, ILayer, std::unique_ptr<IActivationLayer, py::nodelete>>(m, "IActivationLayer", IActivationLayerDoc::descr, py::module_local())
@@ -401,9 +408,6 @@ namespace tensorrt
 
         py::class_<IPoolingLayer, ILayer, std::unique_ptr<IPoolingLayer, py::nodelete>>(m, "IPoolingLayer", IPoolingLayerDoc::descr, py::module_local())
             .def_property("type", &IPoolingLayer::getPoolingType, &IPoolingLayer::setPoolingType)
-            .def_property("window_size", utils::deprecateMember(&IPoolingLayer::getWindowSize, "windnow_size_nd"), utils::deprecateMember(&IPoolingLayer::setWindowSize, "windnow_size_nd"))
-            .def_property("stride", utils::deprecateMember(&IPoolingLayer::getStride, "stride_nd"), utils::deprecateMember(&IPoolingLayer::setStride, "stride_nd"))
-            .def_property("padding", utils::deprecateMember(&IPoolingLayer::getPadding, "padding_nd"), utils::deprecateMember(&IPoolingLayer::setPadding, "padding_nd"))
             .def_property("pre_padding", &IPoolingLayer::getPrePadding, &IPoolingLayer::setPrePadding)
             .def_property("post_padding", &IPoolingLayer::getPostPadding, &IPoolingLayer::setPostPadding)
             .def_property("padding_mode", &IPoolingLayer::getPaddingMode, &IPoolingLayer::setPaddingMode)
@@ -438,10 +442,12 @@ namespace tensorrt
 
         py::class_<IQuantizeLayer, ILayer, std::unique_ptr<IQuantizeLayer, py::nodelete>>(m, "IQuantizeLayer", IQuantizeLayerDoc::descr, py::module_local())
             .def_property("axis", &IQuantizeLayer::getAxis, &IQuantizeLayer::setAxis)
+            .def_property("to_type", &IQuantizeLayer::getToType, &IQuantizeLayer::setToType)
         ;
 
         py::class_<IDequantizeLayer, ILayer, std::unique_ptr<IDequantizeLayer, py::nodelete>>(m, "IDequantizeLayer", IDequantizeLayerDoc::descr, py::module_local())
             .def_property("axis", &IDequantizeLayer::getAxis, &IDequantizeLayer::setAxis)
+            .def_property("to_type", &IDequantizeLayer::getToType, &IDequantizeLayer::setToType)
         ;
 
         py::class_<ISoftMaxLayer, ILayer, std::unique_ptr<ISoftMaxLayer, py::nodelete>>(m, "ISoftMaxLayer", ISoftMaxLayerDoc::descr, py::module_local())
@@ -453,9 +459,6 @@ namespace tensorrt
         ;
 
         py::class_<IDeconvolutionLayer, ILayer, std::unique_ptr<IDeconvolutionLayer, py::nodelete>>(m, "IDeconvolutionLayer", IDeconvolutionLayerDoc::descr, py::module_local())
-            .def_property("kernel_size", utils::deprecateMember(&IDeconvolutionLayer::getKernelSize, "kernel_size_nd"), utils::deprecateMember(&IDeconvolutionLayer::setKernelSize, "kernel_size_nd"))
-            .def_property("stride", utils::deprecateMember(&IDeconvolutionLayer::getStride, "stride_nd"), utils::deprecateMember(&IDeconvolutionLayer::setStride, "stride_nd"))
-            .def_property("padding", utils::deprecateMember(&IDeconvolutionLayer::getPadding, "padding_nd"), utils::deprecateMember(&IDeconvolutionLayer::setPadding, "padding_nd"))
             .def_property("num_output_maps", &IDeconvolutionLayer::getNbOutputMaps, &IDeconvolutionLayer::setNbOutputMaps)
             .def_property("pre_padding", &IDeconvolutionLayer::getPrePadding, &IDeconvolutionLayer::setPrePadding)
             .def_property("post_padding", &IDeconvolutionLayer::getPostPadding, &IDeconvolutionLayer::setPostPadding)
@@ -513,54 +516,14 @@ namespace tensorrt
             .value("ND", GatherMode::kND, GatherModeDoc::ND)
         ;
 
-        py::enum_<RNNOperation>(m, "RNNOperation", RNNOperationDoc::descr, py::module_local())
-            .value("RELU", RNNOperation::kRELU, RNNOperationDoc::RELU)
-            .value("TANH", RNNOperation::kTANH, RNNOperationDoc::TANH)
-            .value("LSTM", RNNOperation::kLSTM, RNNOperationDoc::LSTM)
-            .value("GRU", RNNOperation::kGRU, RNNOperationDoc::GRU)
-        ;
-
-        py::enum_<RNNDirection>(m, "RNNDirection", RNNDirectionDoc::descr, py::module_local())
-            .value("UNIDIRECTION", RNNDirection::kUNIDIRECTION, RNNDirectionDoc::UNIDIRECTION)
-            .value("BIDIRECTION", RNNDirection::kBIDIRECTION, RNNDirectionDoc::BIDIRECTION)
-        ;
-
-        py::enum_<RNNInputMode>(m, "RNNInputMode", RNNInputModeDoc::descr, py::module_local())
-            .value("LINEAR", RNNInputMode::kLINEAR, RNNInputModeDoc::LINEAR)
-            .value("SKIP", RNNInputMode::kSKIP, RNNInputModeDoc::SKIP)
-        ;
-
-        py::enum_<RNNGateType>(m, "RNNGateType", RNNGateTypeDoc::descr, py::module_local())
-            .value("INPUT", RNNGateType::kINPUT, RNNGateTypeDoc::INPUT)
-            .value("OUTPUT", RNNGateType::kOUTPUT, RNNGateTypeDoc::OUTPUT)
-            .value("FORGET", RNNGateType::kFORGET, RNNGateTypeDoc::FORGET)
-            .value("UPDATE", RNNGateType::kUPDATE, RNNGateTypeDoc::UPDATE)
-            .value("RESET", RNNGateType::kRESET, RNNGateTypeDoc::RESET)
-            .value("CELL", RNNGateType::kCELL, RNNGateTypeDoc::CELL)
-            .value("HIDDEN", RNNGateType::kHIDDEN, RNNGateTypeDoc::HIDDEN)
-        ;
-
-        py::class_<IRNNv2Layer, ILayer, std::unique_ptr<IRNNv2Layer, py::nodelete>>(m, "IRNNv2Layer", IRNNv2LayerDoc::descr, py::module_local())
-            .def_property_readonly("num_layers", &IRNNv2Layer::getLayerCount)
-            .def_property_readonly("hidden_size", &IRNNv2Layer::getHiddenSize)
-            .def_property_readonly("max_seq_length", &IRNNv2Layer::getMaxSeqLength)
-            .def_property_readonly("data_length", &IRNNv2Layer::getDataLength)
-            .def_property("seq_lengths", &IRNNv2Layer::getSequenceLengths, &IRNNv2Layer::setSequenceLengths)
-            .def_property("op", &IRNNv2Layer::getOperation, &IRNNv2Layer::setOperation)
-            .def_property("input_mode", &IRNNv2Layer::getInputMode, &IRNNv2Layer::setInputMode)
-            .def_property("direction", &IRNNv2Layer::getDirection, &IRNNv2Layer::setDirection)
-            .def("set_weights_for_gate", &IRNNv2Layer::setWeightsForGate, "layer_index"_a, "gate"_a, "is_w"_a, "weights"_a, IRNNv2LayerDoc::set_weights_for_gate, py::keep_alive<1, 5>{})
-            .def("get_weights_for_gate", lambdas::rnnv2_get_weights, "layer_index"_a, "gate"_a, "is_w"_a, IRNNv2LayerDoc::get_weights_for_gate)
-            .def("set_bias_for_gate", &IRNNv2Layer::setBiasForGate, "layer_index"_a, "gate"_a, "is_w"_a, "bias"_a, IRNNv2LayerDoc::set_bias_for_gate, py::keep_alive<1, 5>{})
-            .def("get_bias_for_gate", lambdas::rnnv2_get_bias, "layer_index"_a, "gate"_a, "is_w"_a, IRNNv2LayerDoc::get_bias_for_gate)
-            .def_property("hidden_state", &IRNNv2Layer::getHiddenState, py::cpp_function(&IRNNv2Layer::setHiddenState, py::keep_alive<1, 2>{}))
-            .def_property("cell_state", &IRNNv2Layer::getCellState, py::cpp_function(&IRNNv2Layer::setCellState, py::keep_alive<1, 2>{}))
-        ;
-
         py::class_<IPluginV2Layer, ILayer, std::unique_ptr<IPluginV2Layer, py::nodelete>>(m, "IPluginV2Layer", IPluginV2LayerDoc::descr, py::module_local())
             .def_property_readonly("plugin", &IPluginV2Layer::getPlugin)
         ;
 
+        py::class_<IPluginV3Layer, ILayer, std::unique_ptr<IPluginV3Layer, py::nodelete>>(m, "IPluginV3Layer", IPluginV3LayerDoc::descr, py::module_local())
+            .def_property_readonly("plugin", &IPluginV3Layer::getPlugin)
+        ;
+
         py::enum_<UnaryOperation>(m, "UnaryOperation", UnaryOperationDoc::descr, py::module_local())
             .value("EXP", UnaryOperation::kEXP, UnaryOperationDoc::EXP)
             .value("LOG", UnaryOperation::kLOG, UnaryOperationDoc::LOG)
@@ -607,8 +570,6 @@ namespace tensorrt
         ;
 
         py::class_<IPaddingLayer, ILayer, std::unique_ptr<IPaddingLayer, py::nodelete>>(m, "IPaddingLayer", IPaddingLayerDoc::descr, py::module_local())
-            .def_property("pre_padding", utils::deprecateMember(&IPaddingLayer::getPrePadding, "pre_padding_nd"), utils::deprecateMember(&IPaddingLayer::setPrePadding, "pre_padding_nd"))
-            .def_property("post_padding", utils::deprecateMember(&IPaddingLayer::getPostPadding, "post_padding_nd"), utils::deprecateMember(&IPaddingLayer::setPostPadding, "post_padding_nd"))
             .def_property("pre_padding_nd", &IPaddingLayer::getPrePaddingNd, &IPaddingLayer::setPrePaddingNd)
             .def_property("post_padding_nd", &IPaddingLayer::getPostPaddingNd, &IPaddingLayer::setPostPaddingNd)
         ;
@@ -652,7 +613,6 @@ namespace tensorrt
 
         py::enum_<SampleMode>(m, "SampleMode", SampleModeDoc::descr, py::module_local())
             .value("STRICT_BOUNDS", SampleMode::kSTRICT_BOUNDS, SampleModeDoc::STRICT_BOUNDS)
-            .value("DEFAULT", SampleMode::kDEFAULT, SampleModeDoc::DEFAULT)
             .value("WRAP", SampleMode::kWRAP, SampleModeDoc::WRAP)
             .value("CLAMP", SampleMode::kCLAMP, SampleModeDoc::CLAMP)
             .value("FILL", SampleMode::kFILL, SampleModeDoc::FILL)
@@ -805,9 +765,11 @@ namespace tensorrt
         py::class_<IFillLayer, ILayer, std::unique_ptr<IFillLayer, py::nodelete>>(m, "IFillLayer", IFillLayerDoc::descr, py::module_local())
             .def_property("shape", &IFillLayer::getDimensions, &IFillLayer::setDimensions)
             .def_property("operation", &IFillLayer::getOperation, &IFillLayer::setOperation)
-            .def_property("alpha", &IFillLayer::getAlpha, &IFillLayer::setAlpha)
-            .def_property("beta", &IFillLayer::getBeta, &IFillLayer::setBeta)
+            .def_property("alpha", lambdas::get_alpha, lambdas::set_alpha)
+            .def_property("beta", lambdas::get_beta, lambdas::set_beta)
+            .def_property("to_type", &IFillLayer::getToType, &IFillLayer::setToType)
             .def("set_input", &IFillLayer::setInput, "index"_a, "tensor"_a, IFillLayerDoc::set_input)
+            .def("is_alpha_beta_int64", &IFillLayer::isAlphaBetaInt64)
         ;
 
         py::class_<IIfConditionalBoundaryLayer, ILayer, std::unique_ptr<IIfConditionalBoundaryLayer, py::nodelete>>(m, "IIfConditionalBoundaryLayer", IIfConditionalBoundaryLayerDoc::descr, py::module_local())
@@ -860,27 +822,18 @@ namespace tensorrt
             .def_property_readonly("num_layers", &INetworkDefinition::getNbLayers)
             .def_property_readonly("num_inputs", &INetworkDefinition::getNbInputs)
             .def_property_readonly("num_outputs", &INetworkDefinition::getNbOutputs)
-            .def_property_readonly("has_implicit_batch_dimension", &INetworkDefinition::hasImplicitBatchDimension)
-            .def_property_readonly("has_explicit_precision", &INetworkDefinition::hasExplicitPrecision)
+            .def_property_readonly("has_implicit_batch_dimension", utils::deprecateMember(&INetworkDefinition::hasImplicitBatchDimension, "Implicit batch dimensions support has been removed"))
             .def_property("error_recorder", &INetworkDefinition::getErrorRecorder,
                 py::cpp_function(&INetworkDefinition::setErrorRecorder, py::keep_alive<1, 2>{}))
             .def("mark_output", &INetworkDefinition::markOutput, "tensor"_a, INetworkDefinitionDoc::mark_output)
             // Layers
             .def("add_input", &INetworkDefinition::addInput, "name"_a, "dtype"_a, "shape"_a,
                 INetworkDefinitionDoc::add_input, py::return_value_policy::reference_internal)
-            .def("add_convolution", utils::deprecate(lambdas::add_convolution, "add_convolution_nd"), "input"_a, "num_output_maps"_a, "kernel_shape"_a,
-                "kernel"_a, "bias"_a=nullptr, py::keep_alive<1, 5>{}, py::keep_alive<1, 6>{}, INetworkDefinitionDoc::add_convolution,
-                py::return_value_policy::reference_internal)
             .def("add_convolution_nd", lambdas::add_convolution_nd, "input"_a, "num_output_maps"_a,
                 "kernel_shape"_a, "kernel"_a, "bias"_a=nullptr, py::keep_alive<1, 5>{}, py::keep_alive<1, 6>{},
                 INetworkDefinitionDoc::add_convolution_nd, py::return_value_policy::reference_internal)
-            .def("add_fully_connected", utils::deprecate(lambdas::add_fully_connected, "add_matrix_multiply"), "input"_a, "num_outputs"_a,
-                "kernel"_a, "bias"_a=nullptr, py::keep_alive<1, 4>{}, py::keep_alive<1, 5>{}, INetworkDefinitionDoc::add_fully_connected,
-                py::return_value_policy::reference_internal)
             .def("add_activation", &INetworkDefinition::addActivation, "input"_a, "type"_a,
                 INetworkDefinitionDoc::add_activation, py::return_value_policy::reference_internal)
-            .def("add_pooling", utils::deprecateMember(&INetworkDefinition::addPooling, "add_pooling_nd"), "input"_a, "type"_a, "window_size"_a,
-                INetworkDefinitionDoc::add_pooling, py::return_value_policy::reference_internal)
             .def("add_pooling_nd", &INetworkDefinition::addPoolingNd, "input"_a, "type"_a, "window_size"_a,
                 INetworkDefinitionDoc::add_pooling_nd, py::return_value_policy::reference_internal)
             .def("add_lrn", &INetworkDefinition::addLRN, "input"_a, "window"_a, "alpha"_a, "beta"_a, "k"_a,
@@ -895,9 +848,6 @@ namespace tensorrt
                 py::return_value_policy::reference_internal)
             .def("add_concatenation", lambdas::add_concatenation, "inputs"_a, INetworkDefinitionDoc::add_concatenation,
                 py::return_value_policy::reference_internal)
-            .def("add_deconvolution", utils::deprecate(lambdas::add_deconvolution, "add_deconvolution_nd"), "input"_a, "num_output_maps"_a,
-                "kernel_shape"_a, "kernel"_a, "bias"_a=nullptr, py::keep_alive<1, 5>{}, py::keep_alive<1, 6>{},
-                INetworkDefinitionDoc::add_deconvolution, py::return_value_policy::reference_internal)
             .def("add_deconvolution_nd", lambdas::add_deconvolution_nd, "input"_a, "num_output_maps"_a,
                 "kernel_shape"_a, "kernel"_a, "bias"_a=nullptr, py::keep_alive<1, 5>{}, py::keep_alive<1, 6>{},
                 INetworkDefinitionDoc::add_deconvolution_nd, py::return_value_policy::reference_internal)
@@ -905,8 +855,6 @@ namespace tensorrt
                 INetworkDefinitionDoc::add_elementwise, py::return_value_policy::reference_internal)
             .def("add_unary", &INetworkDefinition::addUnary, "input"_a, "op"_a, INetworkDefinitionDoc::add_unary,
                 py::return_value_policy::reference_internal)
-            .def("add_padding", utils::deprecateMember(&INetworkDefinition::addPadding, "add_padding_nd"), "input"_a, "pre_padding"_a, "post_padding"_a,
-                INetworkDefinitionDoc::add_padding, py::return_value_policy::reference_internal)
             .def("add_padding_nd", &INetworkDefinition::addPaddingNd, "input"_a, "pre_padding"_a, "post_padding"_a,
                 INetworkDefinitionDoc::add_padding_nd, py::return_value_policy::reference_internal)
             .def("add_shuffle", &INetworkDefinition::addShuffle, "input"_a, INetworkDefinitionDoc::add_shuffle,
@@ -932,14 +880,14 @@ namespace tensorrt
             .def("add_constant", &INetworkDefinition::addConstant, "shape"_a, "weights"_a,
                 py::keep_alive<1, 3>{}, INetworkDefinitionDoc::add_constant,
                 py::return_value_policy::reference_internal)
-            .def("add_rnn_v2", utils::deprecateMember(&INetworkDefinition::addRNNv2, "addLoop"), "input"_a, "layer_count"_a,
-                "hidden_size"_a, "max_seq_length"_a, "op"_a, INetworkDefinitionDoc::add_rnn_v2)
             .def("add_identity", &INetworkDefinition::addIdentity, "input"_a,
                 INetworkDefinitionDoc::add_identity, py::return_value_policy::reference_internal)
             .def("add_cast", &INetworkDefinition::addCast, "input"_a, "to_type"_a,
                 INetworkDefinitionDoc::add_cast, py::return_value_policy::reference_internal)
             .def("add_plugin_v2",  lambdas::add_plugin_v2, "inputs"_a, "plugin"_a,
                 INetworkDefinitionDoc::add_plugin_v2, py::return_value_policy::reference_internal)
+            .def("add_plugin_v3",  lambdas::add_plugin_v3, "inputs"_a, "shape_inputs"_a, "plugin"_a,
+                INetworkDefinitionDoc::add_plugin_v3, py::return_value_policy::reference_internal)
             .def("add_parametric_relu", &INetworkDefinition::addParametricReLU, "input"_a,
                 "slopes"_a, INetworkDefinitionDoc::add_parametric_relu, py::return_value_policy::reference_internal)
             .def("add_resize", &INetworkDefinition::addResize, "input"_a, INetworkDefinitionDoc::add_resize,
@@ -957,10 +905,15 @@ namespace tensorrt
                   INetworkDefinitionDoc::add_grid_sample, py::return_value_policy::reference_internal)
             .def("add_nms", &INetworkDefinition::addNMS, "boxes"_a,
                 "scores"_a, "max_output_boxes_per_class"_a, INetworkDefinitionDoc::add_nms, py::return_value_policy::reference_internal)
-            .def("add_fill", &INetworkDefinition::addFill, "shape"_a, "op"_a, INetworkDefinitionDoc::add_fill)
-            .def("add_quantize",  &INetworkDefinition::addQuantize, "input"_a, "scale"_a,
+            .def("add_fill", static_cast<IFillLayer* (INetworkDefinition::*)(Dims const&, FillOperation)>(&INetworkDefinition::addFill), "shape"_a, "op"_a, INetworkDefinitionDoc::add_fill)
+            .def("add_fill", static_cast<IFillLayer* (INetworkDefinition::*)(Dims const&, FillOperation, DataType)>(&INetworkDefinition::addFill), "shape"_a, "op"_a, "output_type"_a, INetworkDefinitionDoc::add_fill)
+            .def("add_quantize",  static_cast<IQuantizeLayer* (INetworkDefinition::*)(ITensor&, ITensor&)>(&INetworkDefinition::addQuantize), "input"_a, "scale"_a,
                 INetworkDefinitionDoc::add_quantize, py::return_value_policy::reference_internal)
-            .def("add_dequantize", &INetworkDefinition::addDequantize, "input"_a, "scale"_a,
+            .def("add_dequantize", static_cast<IDequantizeLayer* (INetworkDefinition::*)(ITensor&, ITensor&)>(&INetworkDefinition::addDequantize), "input"_a, "scale"_a,
+                INetworkDefinitionDoc::add_dequantize, py::return_value_policy::reference_internal)
+            .def("add_quantize",  static_cast<IQuantizeLayer* (INetworkDefinition::*)(ITensor&, ITensor&, DataType)>(&INetworkDefinition::addQuantize), "input"_a, "scale"_a, "output_type"_a,
+                INetworkDefinitionDoc::add_quantize, py::return_value_policy::reference_internal)
+            .def("add_dequantize", static_cast<IDequantizeLayer* (INetworkDefinition::*)(ITensor&, ITensor&, DataType)>(&INetworkDefinition::addDequantize), "input"_a, "scale"_a, "output_type"_a,
                 INetworkDefinitionDoc::add_dequantize, py::return_value_policy::reference_internal)
             .def("add_if_conditional", &INetworkDefinition::addIfConditional, INetworkDefinitionDoc::add_if_conditional,
                 py::return_value_policy::reference_internal)
@@ -990,7 +943,11 @@ namespace tensorrt
             // keep the INetworkDefinition alive while the builder is referenced) is unnecessary here.
             .def_property_readonly("builder", &INetworkDefinition::getBuilder, INetworkDefinitionDoc::builder,
                 py::return_value_policy::reference)
-
+            .def_property_readonly("flags", &INetworkDefinition::getFlags)
+            .def("get_flag", &INetworkDefinition::getFlag, "flag"_a, INetworkDefinitionDoc::get_flag)
+            .def("mark_debug", &INetworkDefinition::markDebug, "tensor"_a, INetworkDefinitionDoc::mark_debug)
+            .def("unmark_debug", &INetworkDefinition::unmarkDebug, "tensor"_a, INetworkDefinitionDoc::unmark_debug)
+            .def("is_debug_tensor", &INetworkDefinition::isDebugTensor, "tensor"_a, INetworkDefinitionDoc::is_debug_tensor)
 #if ENABLE_INETWORK_SERIALIZE
             // Serialization
             .def("serialize", lambdas::network_serialize, INetworkDefinitionDoc::serialize)
@@ -1002,8 +959,5 @@ namespace tensorrt
             .def("__del__", &utils::doNothingDel<INetworkDefinition>)
         ;
 
-        //Aliasing deprecated enums
-        m.attr("ResizeMode") = m.attr("InterpolationMode");
-        m.attr("SliceMode") = m.attr("SampleMode");
     }
 } /* tensorrt */
diff --git a/python/src/infer/pyInt8.cpp b/python/src/infer/pyInt8.cpp
index 27f578ec..5639bcd1 100644
--- a/python/src/infer/pyInt8.cpp
+++ b/python/src/infer/pyInt8.cpp
@@ -154,6 +154,11 @@ class pyIInt8Calibrator : public pyCalibratorTrampoline<IInt8Calibrator>
     using Derived = pyCalibratorTrampoline<IInt8Calibrator>;
     using Derived::Derived;
 
+    InterfaceInfo getInterfaceInfo() const noexcept override
+    {
+        return InterfaceInfo{"PYTHON CALIBRATOR", 1, 0};
+    }
+
     CalibrationAlgoType getAlgorithm() noexcept override
     {
         try
@@ -279,7 +284,7 @@ void bindInt8(py::module& m)
 
     py::class_<IInt8Calibrator, pyIInt8Calibrator>(m, "IInt8Calibrator", IInt8CalibratorDoc::descr, py::module_local())
         .def(py::init<>())
-        .def("get_batch_size", &IInt8Calibrator::getBatchSize, IInt8CalibratorDoc::get_batch_size)
+        .def("get_batch_size", utils::deprecateMember(&IInt8Calibrator::getBatchSize, "Implicit batch dimensions support has been removed"), IInt8CalibratorDoc::get_batch_size)
         .def("get_algorithm", &IInt8Calibrator::getAlgorithm, IInt8CalibratorDoc::get_algorithm)
         // For documentation purposes only
         .def("get_batch", docGetBatch<IInt8Calibrator>, "names"_a, IInt8CalibratorDoc::get_batch)
@@ -291,7 +296,7 @@ void bindInt8(py::module& m)
     py::class_<IInt8LegacyCalibrator, IInt8Calibrator, pyIInt8LegacyCalibrator>(
         m, "IInt8LegacyCalibrator", IInt8LegacyCalibratorDoc::descr, py::module_local())
         .def(py::init<>())
-        .def("get_batch_size", &IInt8LegacyCalibrator::getBatchSize, IInt8CalibratorDoc::get_batch_size)
+        .def("get_batch_size", utils::deprecateMember(&IInt8LegacyCalibrator::getBatchSize, "Implicit batch dimensions support has been removed"), IInt8CalibratorDoc::get_batch_size)
         .def("get_algorithm", &IInt8LegacyCalibrator::getAlgorithm, IInt8LegacyCalibratorDoc::get_algorithm)
         // For documentation purposes only
         .def("get_batch", docGetBatch<IInt8LegacyCalibrator>, "names"_a, IInt8CalibratorDoc::get_batch)
@@ -303,7 +308,7 @@ void bindInt8(py::module& m)
     py::class_<IInt8EntropyCalibrator, IInt8Calibrator, pyCalibratorTrampoline<IInt8EntropyCalibrator>>(
         m, "IInt8EntropyCalibrator", IInt8EntropyCalibratorDoc::descr, py::module_local())
         .def(py::init<>())
-        .def("get_batch_size", &IInt8EntropyCalibrator::getBatchSize, IInt8CalibratorDoc::get_batch_size)
+        .def("get_batch_size", utils::deprecateMember(&IInt8EntropyCalibrator::getBatchSize, "Implicit batch dimensions support has been removed"), IInt8CalibratorDoc::get_batch_size)
         .def("get_algorithm", &IInt8EntropyCalibrator::getAlgorithm, IInt8EntropyCalibratorDoc::get_algorithm)
         // For documentation purposes only
         .def("get_batch", docGetBatch<IInt8EntropyCalibrator>, "names"_a, IInt8CalibratorDoc::get_batch)
@@ -315,7 +320,7 @@ void bindInt8(py::module& m)
     py::class_<IInt8EntropyCalibrator2, IInt8Calibrator, pyCalibratorTrampoline<IInt8EntropyCalibrator2>>(
         m, "IInt8EntropyCalibrator2", IInt8EntropyCalibrator2Doc::descr, py::module_local())
         .def(py::init<>())
-        .def("get_batch_size", &IInt8EntropyCalibrator2::getBatchSize, IInt8CalibratorDoc::get_batch_size)
+        .def("get_batch_size", utils::deprecateMember(&IInt8EntropyCalibrator2::getBatchSize, "Implicit batch dimensions support has been removed"), IInt8CalibratorDoc::get_batch_size)
         .def("get_algorithm", &IInt8EntropyCalibrator2::getAlgorithm, IInt8EntropyCalibrator2Doc::get_algorithm)
         // For documentation purposes only
         .def("get_batch", docGetBatch<IInt8EntropyCalibrator2>, "names"_a, IInt8CalibratorDoc::get_batch)
@@ -327,7 +332,7 @@ void bindInt8(py::module& m)
     py::class_<IInt8MinMaxCalibrator, IInt8Calibrator, pyCalibratorTrampoline<IInt8MinMaxCalibrator>>(
         m, "IInt8MinMaxCalibrator", IInt8MinMaxCalibratorDoc::descr, py::module_local())
         .def(py::init<>())
-        .def("get_batch_size", &IInt8MinMaxCalibrator::getBatchSize, IInt8CalibratorDoc::get_batch_size)
+        .def("get_batch_size", utils::deprecateMember(&IInt8MinMaxCalibrator::getBatchSize, "Implicit batch dimensions support has been removed"), IInt8CalibratorDoc::get_batch_size)
         .def("get_algorithm", &IInt8MinMaxCalibrator::getAlgorithm, IInt8MinMaxCalibratorDoc::get_algorithm)
         // For documentation purposes only
         .def("get_batch", docGetBatch<IInt8MinMaxCalibrator>, "names"_a, IInt8CalibratorDoc::get_batch)
diff --git a/python/src/infer/pyPlugin.cpp b/python/src/infer/pyPlugin.cpp
index 110dfb30..d87a42ec 100644
--- a/python/src/infer/pyPlugin.cpp
+++ b/python/src/infer/pyPlugin.cpp
@@ -20,20 +20,1736 @@
 #include "infer/pyPluginDoc.h"
 #include "utils.h"
 #include <cuda_runtime_api.h>
+#include <pybind11/pybind11.h>
 #include <pybind11/stl.h>
 
 #if EXPORT_ALL_BINDINGS
 #include "NvInferPlugin.h"
 #endif
 
+#define PLUGIN_API_CATCH(func)                                                                                         \
+    catch (std::exception const& e)                                                                                    \
+    {                                                                                                                  \
+        std::cerr << "[ERROR] Exception caught in " << (func) << "(): " << e.what() << std::endl;                      \
+    }                                                                                                                  \
+    catch (...)                                                                                                        \
+    {                                                                                                                  \
+        std::cerr << "[ERROR] Exception caught in " << (func) << "()" << std::endl;                                    \
+    }
+
+#define PLUGIN_API_CATCH_CAST(func, returnType)                                                                        \
+    catch (const py::cast_error& e)                                                                                    \
+    {                                                                                                                  \
+        std::cerr << "[ERROR] Return value of " << (func) << "() could not be interpreted as " << (returnType)         \
+                  << std::endl;                                                                                        \
+    }
+
 namespace tensorrt
 {
-using namespace nvinfer1;
-#if EXPORT_ALL_BINDINGS
-using namespace nvinfer1::plugin;
-#endif
+using namespace nvinfer1;
+#if EXPORT_ALL_BINDINGS
+using namespace nvinfer1::plugin;
+#endif
+
+constexpr PluginFieldCollection EMPTY_PLUGIN_FIELD_COLLECTION{0, nullptr};
+constexpr uint32_t kTHREE_BYTE_SHIFT{24U};
+constexpr uint32_t kBYTE_MASK{0xFFU};
+
+inline PluginVersion getPluginVersion(int32_t const version)
+{
+    return static_cast<PluginVersion>(version >> kTHREE_BYTE_SHIFT & kBYTE_MASK);
+}
+
+class PyIDimensionExprImpl : public IDimensionExpr
+{
+public:
+    using IDimensionExpr::IDimensionExpr;
+    ~PyIDimensionExprImpl() override = default;
+};
+
+class PyIExprBuilderImpl : public IExprBuilder
+{
+public:
+    using IExprBuilder::IExprBuilder;
+    ~PyIExprBuilderImpl() override = default;
+};
+
+class PyIPluginV2DynamicExt : public IPluginV2DynamicExt
+{
+public:
+    ~PyIPluginV2DynamicExt() override = default;
+};
+
+std::map<IPluginV2*, py::handle> pyObjVec;
+
+class PyIPluginV2DynamicExtImpl : public PyIPluginV2DynamicExt
+{
+public:
+    using PyIPluginV2DynamicExt::PyIPluginV2DynamicExt;
+    PyIPluginV2DynamicExtImpl() = default;
+    PyIPluginV2DynamicExtImpl(const PyIPluginV2DynamicExt& a) {};
+
+    int32_t getNbOutputs() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if(!mIsNbOutputsInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "num_outputs not initialized");
+            }
+            return mNbOutputs;
+        }
+        PLUGIN_API_CATCH("num_outputs")
+        return -1;
+    }
+
+    bool supportsFormatCombination(int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pySupportsFormatCombination
+                = utils::getOverride(static_cast<PyIPluginV2DynamicExt*>(this), "supports_format_combination");
+            if (!pySupportsFormatCombination)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for supports_format_combination()");
+            }
+
+            std::vector<PluginTensorDesc> inOutVector;
+            for(int32_t idx = 0; idx < nbInputs + nbOutputs; ++idx)
+            {
+                inOutVector.push_back(*(inOut + idx));
+            }
+
+            py::object pyResult = pySupportsFormatCombination(pos, inOutVector, nbInputs);
+
+            try
+            {
+                auto result = pyResult.cast<bool>();
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("supports_format_combination", "bool")
+            return false;
+        }
+        PLUGIN_API_CATCH("supports_format_combination")
+        return false;
+    }
+
+    int32_t initialize() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyInitialize = py::get_override(static_cast<PyIPluginV2DynamicExt*>(this), "initialize");
+
+            if (!pyInitialize)
+            {
+                // if no implementation is provided, default to empty initialize()
+                return 0;
+            }
+
+            try{
+                py::object pyResult = pyInitialize();
+            }
+            catch (py::error_already_set &e)
+            {
+                std::cerr << "[ERROR] Exception thrown from initialize() " << e.what() << std::endl;
+                return -1;
+            }
+            return 0;
+        }
+        PLUGIN_API_CATCH("initialize")
+        return -1;
+    }
+
+    void terminate() noexcept override {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyTerminate = py::get_override(static_cast<PyIPluginV2DynamicExt*>(this), "terminate");
+
+            // if no implementation is provided for terminate(), it is defaulted to `pass`
+            if(pyTerminate)
+            {
+                pyTerminate();
+            }
+        }
+        PLUGIN_API_CATCH("terminate")
+    }
+
+    int32_t enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyEnqueue = utils::getOverride(static_cast<PyIPluginV2DynamicExt*>(this), "enqueue");
+            if (!pyEnqueue)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for enqueue()");
+            }
+
+            std::vector<PluginTensorDesc> inVector;
+            for(int32_t idx = 0; idx < mNbInputs; ++idx)
+            {
+                inVector.push_back(*(inputDesc + idx));
+            }
+            std::vector<PluginTensorDesc> outVector;
+            for(int32_t idx = 0; idx < mNbOutputs; ++idx)
+            {
+                outVector.push_back(*(outputDesc + idx));
+            }
+
+            std::vector<intptr_t> inPtrs;
+            for (int32_t idx = 0; idx < mNbInputs; ++idx)
+            {
+                inPtrs.push_back(reinterpret_cast<intptr_t>(inputs[idx]));
+            }
+            std::vector<intptr_t> outPtrs;
+            for (int32_t idx = 0; idx < mNbOutputs; ++idx)
+            {
+                outPtrs.push_back(reinterpret_cast<intptr_t>(outputs[idx]));
+            }
+
+            intptr_t workspacePtr = reinterpret_cast<intptr_t>(workspace);
+            intptr_t cudaStreamPtr = reinterpret_cast<intptr_t>(stream);
+
+            try{
+                pyEnqueue(inVector, outVector, inPtrs, outPtrs, workspacePtr, cudaStreamPtr);
+            }
+            catch (py::error_already_set &e)
+            {
+                std::cerr << "[ERROR] Exception thrown from enqueue() " << e.what() << std::endl;
+                return -1;
+            }
+            return 0;
+        }
+        PLUGIN_API_CATCH("enqueue")
+        return -1;
+    }
+
+    size_t getSerializationSize() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetSerializationSize
+                = py::get_override(static_cast<PyIPluginV2DynamicExt const*>(this), "get_serialization_size");
+
+            if (!pyGetSerializationSize)
+            {
+                // if no implementation is provided for get_serialization_size(), default to len(serialize())
+                py::gil_scoped_acquire gil{};
+                py::function pySerialize
+                    = utils::getOverride(static_cast<PyIPluginV2DynamicExt const*>(this), "serialize");
+                if (!pySerialize)
+                {
+                    utils::throwPyError(PyExc_RuntimeError, "no implementation provided for serialize()");
+                }
+
+                py::object pyResult = pySerialize();
+
+                try
+                {
+                    std::string pyResultString = pyResult.cast<std::string>();
+                    return pyResultString.size();
+                }
+                PLUGIN_API_CATCH_CAST("serialize", "std::string")
+                return 0;
+            }
+
+            py::object pyResult = pyGetSerializationSize();
+
+            try
+            {
+                auto result = pyResult.cast<size_t>();
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("get_serialization_size", "size_t")
+            return 0;
+        }
+        PLUGIN_API_CATCH("get_serialization_size")
+        return 0;
+    }
+
+    void serialize(void* buffer) const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pySerialize
+                = utils::getOverride(static_cast<PyIPluginV2DynamicExt const*>(this), "serialize");
+            if (!pySerialize)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for serialize()");
+            }
+
+            py::object pyResult = pySerialize();
+
+            try
+            {
+                std::string pyResultString = pyResult.cast<std::string>();
+                std::memcpy(buffer, pyResultString.data(), getSerializationSize());
+            }
+            PLUGIN_API_CATCH_CAST("serialize", "std::string")
+        }
+        PLUGIN_API_CATCH("serialize")
+    }
+
+    char const* getPluginType() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if(!mIsPluginTypeInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_type not initialized");
+            }
+            return mPluginType.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_type")
+        return nullptr;
+    }
+
+    char const* getPluginVersion() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if(!mIsPluginVersionInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_version not initialized");
+            }
+            return mPluginVersion.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_version")
+        return nullptr;
+    }
+
+    nvinfer1::IPluginV2DynamicExt* clone() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyClone = utils::getOverride(static_cast<const PyIPluginV2DynamicExt*>(this), "clone");
+            if (!pyClone)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for clone()");
+            }
+
+            py::handle handle = pyClone().release();
+
+            try
+            {
+                auto result = handle.cast<PyIPluginV2DynamicExt*>();
+                pyObjVec[result] = handle;
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("clone", "nvinfer1::IPluginV2DynamicExt")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("clone")
+        return nullptr;
+    }
+
+    void destroy() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyDestroy = py::get_override(static_cast<const PyIPluginV2DynamicExt*>(this), "destroy");
+
+            if (pyDestroy)
+            {
+                pyDestroy();
+            }
+
+            // Remove reference to the Python plugin object so that it could be garbage-collected
+            pyObjVec[this].dec_ref();
+
+        }
+        PLUGIN_API_CATCH("destroy")
+    }
+
+    void setPluginNamespace(const char* libNamespace) noexcept override
+    {
+        // setPluginNamespace() is not passed through to the Python side
+        std::string libNamespaceStr{libNamespace};
+        mNamespace = std::move(libNamespaceStr);
+        mIsNamespaceInitialized = true;
+    }
+
+    const char* getPluginNamespace() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            // getPluginNamespace() is not passed through to the Python side
+            if(!mIsNamespaceInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_namespace not initialized");
+            }
+            return mNamespace.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_namespace")
+        return nullptr;
+    }
+
+    DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetOutputDataType
+                = utils::getOverride(static_cast<PyIPluginV2DynamicExt const*>(this), "get_output_datatype");
+            if (!pyGetOutputDataType)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for get_output_datatype()");
+            }
+
+            std::vector<DataType> inVector;
+            for(int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(inputTypes + idx));
+            }
+
+            py::object pyResult = pyGetOutputDataType(index, inVector);
+
+            try
+            {
+                auto result = pyResult.cast<DataType>();
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("get_output_datatype", "nvinfer1::DataType")
+            return DataType{};
+        }
+        PLUGIN_API_CATCH("get_output_datatype")
+        return DataType{};
+    }
+
+
+    DimsExprs getOutputDimensions(int32_t outputIndex, DimsExprs const* inputs, int32_t nbInputs, IExprBuilder& exprBuilder) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetOutputDimensions
+                = utils::getOverride(static_cast<PyIPluginV2DynamicExt*>(this), "get_output_dimensions");
+            if (!pyGetOutputDimensions)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for get_output_dimensions()");
+            }
+
+            std::vector<DimsExprs> inVector;
+            for(int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(inputs + idx));
+            }
+
+            py::object pyResult = pyGetOutputDimensions(outputIndex, inVector, &exprBuilder);
+
+            try
+            {
+                auto result = pyResult.cast<DimsExprs>();
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("get_output_dimensions", "nvinfer1::DimsExprs")
+            return DimsExprs{};
+        }
+        PLUGIN_API_CATCH("get_output_dimensions")
+        return DimsExprs{};
+    }
+
+    void configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept override
+    {
+        try
+        {
+            mNbInputs = nbInputs;
+
+            py::gil_scoped_acquire gil{};
+
+            py::function pyConfigurePlugin
+                = utils::getOverride(static_cast<PyIPluginV2DynamicExt*>(this), "configure_plugin");
+            if (!pyConfigurePlugin)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for configure_plugin()");
+            }
+
+            std::vector<DynamicPluginTensorDesc> inVector;
+            for(int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(in + idx));
+            }
+
+            std::vector<DynamicPluginTensorDesc> outVector;
+            for(int32_t idx = 0; idx < nbOutputs; ++idx)
+            {
+                outVector.push_back(*(out + idx));
+            }
+
+            pyConfigurePlugin(inVector, outVector);
+        }
+        PLUGIN_API_CATCH("configure_plugin")
+    }
+
+    size_t getWorkspaceSize(PluginTensorDesc const* inputs, int32_t nbInputs, PluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetWorkspaceSize = py::get_override(static_cast<PyIPluginV2DynamicExt const*>(this), "get_workspace_size");
+
+            if (!pyGetWorkspaceSize)
+            {
+                // if no implementation is provided for get_workspace_size(), default to zero workspace size required
+                return 0;
+            }
+
+            std::vector<PluginTensorDesc> inVector;
+            for(int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(inputs + idx));
+            }
+
+            std::vector<PluginTensorDesc> outVector;
+            for(int32_t idx = 0; idx < nbOutputs; ++idx)
+            {
+                outVector.push_back(*(outputs + idx));
+            }
+
+            py::object pyResult = pyGetWorkspaceSize(inVector, outVector);
+
+            try
+            {
+                auto result = pyResult.cast<size_t>();
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("get_workspace_size", "size_t")
+            return 0;
+        }
+        PLUGIN_API_CATCH("get_workspace_size")
+        return 0;
+    }
+
+    void setNbOutputs(int32_t nbOutputs)
+    {
+        mNbOutputs = nbOutputs;
+        mIsNbOutputsInitialized = true;
+    }
+
+    void setPluginType(std::string pluginType)
+    {
+        mPluginType = std::move(pluginType);
+        mIsPluginTypeInitialized = true;
+    }
+
+    void setPluginVersion(std::string pluginVersion)
+    {
+        mPluginVersion = std::move(pluginVersion);
+        mIsPluginVersionInitialized = true;
+    }
+    private:
+        int32_t getTensorRTVersion() const noexcept override
+        {
+        return static_cast<int32_t>((static_cast<uint32_t>(PluginVersion::kV2_DYNAMICEXT_PYTHON) << 24U)
+            | (static_cast<uint32_t>(NV_TENSORRT_VERSION) & 0xFFFFFFU));
+        }
+
+        int32_t mNbInputs{};
+        int32_t mNbOutputs{};
+        std::string mNamespace;
+        std::string mPluginType;
+        std::string mPluginVersion;
+
+        bool mIsNbOutputsInitialized{false};
+        bool mIsNamespaceInitialized{false};
+        bool mIsPluginTypeInitialized{false};
+        bool mIsPluginVersionInitialized{false};
+};
+
+class IPluginCreatorImpl : public IPluginCreator
+{
+public:
+    IPluginCreatorImpl() = default;
+
+    APILanguage getAPILanguage() const noexcept final
+    {
+        return APILanguage::kPYTHON;
+    }
+
+    AsciiChar const* getPluginName() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if(!mIsNameInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "name not initialized");
+            }
+            return mName.c_str();
+        }
+        PLUGIN_API_CATCH("name")
+        return nullptr;
+    }
+
+    const char* getPluginVersion() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if(!mIsPluginVersionInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_version not initialized");
+            }
+            return mPluginVersion.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_version")
+        return nullptr;
+    }
+
+    const PluginFieldCollection* getFieldNames() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if(!mIsFCInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "field_names not initialized");
+            }
+            return &mFC;
+        }
+        PLUGIN_API_CATCH("field_names")
+        return nullptr;
+    }
+
+    IPluginV2* createPlugin(const char* name, const PluginFieldCollection* fc) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyCreatePlugin = utils::getOverride(static_cast<IPluginCreator*>(this), "create_plugin");
+            if (!pyCreatePlugin)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for create_plugin()");
+            }
+
+            std::string nameString{name};
+
+            py::handle handle = pyCreatePlugin(nameString, fc).release();
+            try
+            {
+                auto result = handle.cast<IPluginV2*>();
+                pyObjVec[result] = handle;
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("create_plugin", "IPluginV2*")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("create_plugin")
+        return nullptr;
+    }
+
+    IPluginV2* deserializePlugin(
+        const char* name, const void* serialData, size_t serialLength) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyDeserializePlugin
+                = utils::getOverride(static_cast<IPluginCreator*>(this), "deserialize_plugin");
+            if (!pyDeserializePlugin)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for deserialize_plugin()");
+            }
+
+            std::string nameString{name};
+
+            py::handle handle = pyDeserializePlugin(nameString, py::bytes(static_cast<const char*>(serialData), serialLength)).release();
+            try
+            {
+                auto result = handle.cast<IPluginV2*>();
+                pyObjVec[result] = handle;
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("deserialize_plugin", "IPluginV2*")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("deserialize_plugin")
+        return nullptr;
+    }
+
+    void setPluginNamespace(const char* libNamespace) noexcept override
+    {
+        std::string libNamespaceStr{libNamespace};
+        mNamespace = std::move(libNamespaceStr);
+        mIsNamespaceInitialized = true;
+    }
+
+    const char* getPluginNamespace() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if(!mIsNamespaceInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_namespace not initialized");
+            }
+            return mNamespace.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_namespace")
+        return nullptr;
+    }
+
+    void setFieldNames(PluginFieldCollection fc)
+    {
+        mFC = fc;
+        mIsFCInitialized = true;
+    }
+
+    void setPluginName(std::string name)
+    {
+        mName = std::move(name);
+        mIsNameInitialized = true;
+    }
+
+    void setPluginVersion(std::string pluginVersion)
+    {
+        mPluginVersion = std::move(pluginVersion);
+        mIsPluginVersionInitialized = true;
+    }
+
+private:
+    nvinfer1::PluginFieldCollection mFC;
+    std::string mNamespace;
+    std::string mName;
+    std::string mPluginVersion;
+
+    bool mIsFCInitialized{false};
+    bool mIsNamespaceInitialized{false};
+    bool mIsNameInitialized{false};
+    bool mIsPluginVersionInitialized{false};
+};
+
+class PyIPluginV3Impl : public IPluginV3
+{
+public:
+    using IPluginV3::IPluginV3;
+    PyIPluginV3Impl() = default;
+    PyIPluginV3Impl(const IPluginV3& a){};
+
+    APILanguage getAPILanguage() const noexcept final
+    {
+        return APILanguage::kPYTHON;
+    }
+
+    IPluginCapability* getCapabilityInterface(PluginCapabilityType type) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetCapabilityInterface
+                = utils::getOverride(static_cast<const IPluginV3*>(this), "get_capability_interface");
+            if (!pyGetCapabilityInterface)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for get_capability_interface()");
+            }
+
+            auto pyResult = pyGetCapabilityInterface(type).release();
+
+            try
+            {
+                if (type == PluginCapabilityType::kCORE)
+                {
+                    return pyResult.cast<IPluginV3OneCore*>();
+                }
+                if (type == PluginCapabilityType::kBUILD)
+                {
+                    return pyResult.cast<IPluginV3OneBuild*>();
+                }
+                if (type == PluginCapabilityType::kRUNTIME)
+                {
+                    return pyResult.cast<IPluginV3OneRuntime*>();
+                }
+            }
+            PLUGIN_API_CATCH_CAST("get_capability_interface", "nvinfer1::IPluginCapability")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("get_capability_interface")
+        return nullptr;
+    }
+
+    nvinfer1::IPluginV3* clone() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyClone = utils::getOverride(static_cast<const IPluginV3*>(this), "clone");
+            if (!pyClone)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for clone()");
+            }
+
+            // Release so that pybind11 does not manage lifetime anymore
+            // We will manually decrement ref count in the destructor so that Python could garbage collect
+            py::handle handle = pyClone().release();
+
+            try
+            {
+                return handle.cast<IPluginV3*>();
+            }
+            PLUGIN_API_CATCH_CAST("clone", "nvinfer1::IPluginV3")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("clone")
+        return nullptr;
+    }
+
+    ~PyIPluginV3Impl() override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyDestroy = py::get_override(static_cast<IPluginV3 const*>(this), "destroy");
+
+            if (pyDestroy)
+            {
+                pyDestroy();
+            }
+
+            // Remove reference to the Python plugin object so that it could be garbage-collected
+            py::cast(this).dec_ref();
+        }
+        PLUGIN_API_CATCH("destroy")
+    }
+};
+
+class PyIPluginResourceImpl : public IPluginResource
+{
+public:
+    using IPluginResource::IPluginResource;
+    PyIPluginResourceImpl() = default;
+    PyIPluginResourceImpl(const IPluginResource& a){};
+
+    APILanguage getAPILanguage() const noexcept final
+    {
+        return APILanguage::kPYTHON;
+    }
+
+    int32_t release() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyRelease = utils::getOverride(static_cast<IPluginResource const*>(this), "release");
+
+            if (!pyRelease)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for release()");
+            }
+
+            try
+            {
+                pyRelease();
+            }
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from release() " << e.what() << std::endl;
+            }
+        }
+        PLUGIN_API_CATCH("release")
+        return -1;
+    }
+
+    IPluginResource* clone() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyClone = utils::getOverride(static_cast<IPluginResource const*>(this), "clone");
+
+            if (!pyClone)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for clone()");
+            }
+
+            try
+            {
+                auto handle = pyClone().release();
+                try
+                {
+                    return handle.cast<IPluginResource*>();
+                }
+                PLUGIN_API_CATCH_CAST("clone", "nvinfer1::IPluginResource")
+            }
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from clone() " << e.what() << std::endl;
+            }
+        }
+        PLUGIN_API_CATCH("clone")
+        return nullptr;
+    }
+
+    ~PyIPluginResourceImpl() override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            auto obj = py::cast(this);
+            // The cloned resource (which is registered by TRT) may have many references on the Python
+            // side, including at call sites of IPluginRegistry.acquire_plugin_resource().
+            // But even though IPluginRegistry.release_plugin_resource() internally decrements the ref count,
+            // this is not reflected on the Python side. Therefore, when the registered resource is manually
+            // deleted by TRT, set the ref count to zero so the object may be properly garbage collected by
+            // Python.
+            while (obj.ref_count())
+            {
+                obj.dec_ref();
+            }
+        }
+        PLUGIN_API_CATCH("IPluginResource destruction")
+    }
+};
+
+class PyIPluginV3OneBuildImpl : public IPluginV3OneBuild
+{
+public:
+    using IPluginV3OneBuild::IPluginV3OneBuild;
+    PyIPluginV3OneBuildImpl() = default;
+    PyIPluginV3OneBuildImpl(const IPluginV3OneBuild& a){};
+
+    APILanguage getAPILanguage() const noexcept final
+    {
+        return APILanguage::kPYTHON;
+    }
+
+    int32_t getNbOutputs() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsNbOutputsInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "num_outputs not initialized");
+            }
+            return mNbOutputs;
+        }
+        PLUGIN_API_CATCH("num_outputs")
+        return -1;
+    }
+
+    int32_t getNbTactics() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            try
+            {
+                py::function pyGetValidTactics
+                    = py::get_override(static_cast<IPluginV3OneBuild const*>(this), "get_valid_tactics");
+
+                if (!pyGetValidTactics)
+                {
+                    // if no implementation is provided for get_valid_tactics(), communicate that no custom tactics are
+                    // used by the plugin
+                    return 0;
+                }
+
+                py::object pyResult = pyGetValidTactics();
+                auto result = pyResult.cast<std::vector<int32_t>>();
+                return static_cast<int32_t>(result.size());
+            }
+            PLUGIN_API_CATCH_CAST("get_valid_tactics", "std::vector<int32_t>")
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from get_valid_tactics() " << e.what() << std::endl;
+            }
+        }
+        PLUGIN_API_CATCH("tactics")
+        return -1;
+    }
+
+    int32_t getValidTactics(int32_t* tactics, int32_t nbTactics) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            try
+            {
+                py::function pyGetValidTactics
+                    = py::get_override(static_cast<IPluginV3OneBuild const*>(this), "get_valid_tactics");
+
+                if (!pyGetValidTactics)
+                {
+                    // if no implementation is provided for get_valid_tactics() nothing further to do
+                    return 0;
+                }
+
+                py::object pyResult = pyGetValidTactics();
+                auto result = pyResult.cast<std::vector<int32_t>>();
+                std::copy(result.begin(), result.end(), tactics);
+                return 0;
+            }
+            PLUGIN_API_CATCH_CAST("get_valid_tactics", "std::vector<int32_t>")
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from get_valid_tactics() " << e.what() << std::endl;
+            }
+        }
+        PLUGIN_API_CATCH("tactics")
+        return -1;
+    }
+
+    bool supportsFormatCombination(
+        int32_t pos, DynamicPluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pySupportsFormatCombination
+                = utils::getOverride(static_cast<IPluginV3OneBuild*>(this), "supports_format_combination");
+            if (!pySupportsFormatCombination)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for supports_format_combination()");
+            }
+
+            std::vector<DynamicPluginTensorDesc> inOutVector;
+            for (int32_t idx = 0; idx < nbInputs + nbOutputs; ++idx)
+            {
+                inOutVector.push_back(*(inOut + idx));
+            }
+
+            py::object pyResult = pySupportsFormatCombination(pos, inOutVector, nbInputs);
+
+            try
+            {
+                auto result = pyResult.cast<bool>();
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("supports_format_combination", "bool")
+            return false;
+        }
+        PLUGIN_API_CATCH("supports_format_combination")
+        return false;
+    }
+
+    int32_t getOutputDataTypes(
+        DataType* outputTypes, int32_t nbOutputs, const DataType* inputTypes, int32_t nbInputs) const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetOutputDataTypes
+                = utils::getOverride(static_cast<IPluginV3OneBuild const*>(this), "get_output_data_types");
+            if (!pyGetOutputDataTypes)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for get_output_data_types()");
+            }
+
+            std::vector<DataType> inVector;
+            for (int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(inputTypes + idx));
+            }
+
+            try
+            {
+                py::object pyResult = pyGetOutputDataTypes(inVector);
+                auto result = pyResult.cast<std::vector<DataType>>();
+
+                if (static_cast<int32_t>(result.size()) != nbOutputs)
+                {
+                    utils::throwPyError(PyExc_RuntimeError,
+                        "get_output_data_types() returned a list with a different length than num_outputs");
+                }
+
+                std::copy(result.begin(), result.end(), outputTypes);
+                return 0;
+            }
+            PLUGIN_API_CATCH_CAST("get_output_data_types", "std::vector<nvinfer1::DataType>")
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from get_output_data_types() " << e.what() << std::endl;
+            }
+        }
+        PLUGIN_API_CATCH("get_output_data_types")
+        return -1;
+    }
+
+    int32_t getOutputShapes(DimsExprs const* inputs, int32_t nbInputs, DimsExprs const* shapeInputs,
+        int32_t nbShapeInputs, DimsExprs* outputs, int32_t nbOutputs, IExprBuilder& exprBuilder) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetOutputShapes
+                = utils::getOverride(static_cast<IPluginV3OneBuild*>(this), "get_output_shapes");
+            if (!pyGetOutputShapes)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for get_output_shapes()");
+            }
+
+            std::vector<DimsExprs> inVector;
+            for (int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(inputs + idx));
+            }
+
+            std::vector<DimsExprs> shapeInVector;
+            for (int32_t idx = 0; idx < nbShapeInputs; ++idx)
+            {
+                shapeInVector.push_back(*(shapeInputs + idx));
+            }
+
+            py::object pyResult = pyGetOutputShapes(inVector, shapeInVector, &exprBuilder);
+
+            try
+            {
+                auto result = pyResult.cast<std::vector<DimsExprs>>();
+                if (static_cast<int32_t>(result.size()) != nbOutputs)
+                {
+                    utils::throwPyError(PyExc_RuntimeError,
+                        "get_output_shapes() returned a list with a different length than num_outputs");
+                }
+                std::copy(result.begin(), result.end(), outputs);
+                return 0;
+            }
+            PLUGIN_API_CATCH_CAST("get_output_shapes", "std::vector<nvinfer1::DimsExprs>")
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from get_output_shapes() " << e.what() << std::endl;
+            }
+            return -1;
+        }
+        PLUGIN_API_CATCH("get_output_shapes")
+        return -1;
+    }
+
+    int32_t configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out,
+        int32_t nbOutputs) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyConfigurePlugin
+                = utils::getOverride(static_cast<IPluginV3OneBuild*>(this), "configure_plugin");
+
+            if (!pyConfigurePlugin)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for configure_plugin()");
+            }
+
+            std::vector<DynamicPluginTensorDesc> inVector;
+            for (int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(in + idx));
+            }
+
+            std::vector<DynamicPluginTensorDesc> outVector;
+            for (int32_t idx = 0; idx < nbOutputs; ++idx)
+            {
+                outVector.push_back(*(out + idx));
+            }
+
+            try
+            {
+                pyConfigurePlugin(inVector, outVector);
+                return 0;
+            }
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from configure_plugin() " << e.what() << std::endl;
+            }
+        }
+        PLUGIN_API_CATCH("configure_plugin")
+        return -1;
+    }
+
+    size_t getWorkspaceSize(DynamicPluginTensorDesc const* inputs, int32_t nbInputs,
+        DynamicPluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetWorkspaceSize
+                = py::get_override(static_cast<IPluginV3OneBuild const*>(this), "get_workspace_size");
+
+            if (!pyGetWorkspaceSize)
+            {
+                // if no implementation is provided for get_workspace_size(), default to zero workspace size required
+                return 0U;
+            }
+
+            std::vector<DynamicPluginTensorDesc> inVector;
+            for (int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(inputs + idx));
+            }
+
+            std::vector<DynamicPluginTensorDesc> outVector;
+            for (int32_t idx = 0; idx < nbOutputs; ++idx)
+            {
+                outVector.push_back(*(outputs + idx));
+            }
+
+            py::object pyResult = pyGetWorkspaceSize(inVector, outVector);
+
+            try
+            {
+                auto result = pyResult.cast<size_t>();
+                return result;
+            }
+            PLUGIN_API_CATCH_CAST("get_workspace_size", "size_t")
+            return 0U;
+        }
+        PLUGIN_API_CATCH("get_workspace_size")
+        return 0U;
+    }
+
+    char const* getTimingCacheID() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsTimingCachedIdInitialized)
+            {
+                return nullptr;
+            }
+            return mTimingCachedId.c_str();
+        }
+        PLUGIN_API_CATCH("timing_cache_id")
+        return nullptr;
+    }
+
+    char const* getMetadataString() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsMetadataStringInitialized)
+            {
+                return nullptr;
+            }
+            return mMetadataString.c_str();
+        }
+        PLUGIN_API_CATCH("metadata_string")
+        return nullptr;
+    }
+
+    int32_t getFormatCombinationLimit() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsFormatCombinationLimitInitialized)
+            {
+                return IPluginV3OneBuild::kDEFAULT_FORMAT_COMBINATION_LIMIT;
+            }
+            return mFormatCombinationLimit;
+        }
+        PLUGIN_API_CATCH("format_combination_limit")
+        return -1;
+    }
+
+    void setNbOutputs(int32_t nbOutputs)
+    {
+        mNbOutputs = nbOutputs;
+        mIsNbOutputsInitialized = true;
+    }
+
+    void setFormatCombinationLimit(int32_t formatCombinationLimit)
+    {
+        mFormatCombinationLimit = formatCombinationLimit;
+        mIsFormatCombinationLimitInitialized = true;
+    }
+
+    void setTimingCachedId(std::string timingCachedId)
+    {
+        mTimingCachedId = std::move(timingCachedId);
+        mIsTimingCachedIdInitialized = true;
+    }
+
+    void setMetadataString(std::string metadataString)
+    {
+        mMetadataString = std::move(metadataString);
+        mIsMetadataStringInitialized = true;
+    }
+
+private:
+    int32_t mNbOutputs{};
+    int32_t mFormatCombinationLimit{};
+    std::string mTimingCachedId{};
+    std::string mMetadataString{};
+
+    bool mIsNbOutputsInitialized{false};
+    bool mIsTimingCachedIdInitialized{false};
+    bool mIsFormatCombinationLimitInitialized{false};
+    bool mIsMetadataStringInitialized{false};
+};
+
+class PyIPluginV3OneRuntimeImpl : public IPluginV3OneRuntime
+{
+public:
+    using IPluginV3OneRuntime::IPluginV3OneRuntime;
+    PyIPluginV3OneRuntimeImpl() = default;
+    PyIPluginV3OneRuntimeImpl(const IPluginV3OneRuntime& a){};
+
+    APILanguage getAPILanguage() const noexcept final
+    {
+        return APILanguage::kPYTHON;
+    }
+
+    int32_t enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs,
+        void* const* outputs, void* workspace, cudaStream_t stream) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyEnqueue = utils::getOverride(static_cast<IPluginV3OneRuntime*>(this), "enqueue");
+            if (!pyEnqueue)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for enqueue()");
+            }
+
+            std::vector<PluginTensorDesc> inVector;
+            for (int32_t idx = 0; idx < mNbInputs; ++idx)
+            {
+                inVector.push_back(*(inputDesc + idx));
+            }
+            std::vector<PluginTensorDesc> outVector;
+            for (int32_t idx = 0; idx < mNbOutputs; ++idx)
+            {
+                outVector.push_back(*(outputDesc + idx));
+            }
+
+            std::vector<intptr_t> inPtrs;
+            for (int32_t idx = 0; idx < mNbInputs; ++idx)
+            {
+                inPtrs.push_back(reinterpret_cast<intptr_t>(inputs[idx]));
+            }
+            std::vector<intptr_t> outPtrs;
+            for (int32_t idx = 0; idx < mNbOutputs; ++idx)
+            {
+                outPtrs.push_back(reinterpret_cast<intptr_t>(outputs[idx]));
+            }
+
+            intptr_t workspacePtr = reinterpret_cast<intptr_t>(workspace);
+            intptr_t cudaStreamPtr = reinterpret_cast<intptr_t>(stream);
+
+            try
+            {
+                pyEnqueue(inVector, outVector, inPtrs, outPtrs, workspacePtr, cudaStreamPtr);
+            }
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from enqueue() " << e.what() << std::endl;
+                return -1;
+            }
+            return 0;
+        }
+        PLUGIN_API_CATCH("enqueue")
+        return -1;
+    }
+
+    int32_t setTactic(int32_t tactic) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pySetTactic = utils::getOverride(static_cast<IPluginV3OneRuntime*>(this), "set_tactic");
+            if (!pySetTactic)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for set_tactic()");
+            }
+
+            try
+            {
+                pySetTactic(tactic);
+            }
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from set_tactic() " << e.what() << std::endl;
+                return -1;
+            }
+            return 0;
+        }
+        PLUGIN_API_CATCH("set_tactic")
+        return -1;
+    }
+
+    int32_t onShapeChange(
+        PluginTensorDesc const* in, int32_t nbInputs, PluginTensorDesc const* out, int32_t nbOutputs) noexcept override
+    {
+        try
+        {
+            mNbInputs = nbInputs;
+            mNbOutputs = nbOutputs;
+
+            py::gil_scoped_acquire gil{};
+
+            py::function pyConfigurePlugin
+                = utils::getOverride(static_cast<IPluginV3OneRuntime*>(this), "on_shape_change");
+            if (!pyConfigurePlugin)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for on_shape_change()");
+            }
+
+            std::vector<PluginTensorDesc> inVector;
+            for (int32_t idx = 0; idx < nbInputs; ++idx)
+            {
+                inVector.push_back(*(in + idx));
+            }
+
+            std::vector<PluginTensorDesc> outVector;
+            for (int32_t idx = 0; idx < nbOutputs; ++idx)
+            {
+                outVector.push_back(*(out + idx));
+            }
+
+            try
+            {
+                pyConfigurePlugin(inVector, outVector);
+            }
+            catch (py::error_already_set& e)
+            {
+                std::cerr << "[ERROR] Exception thrown from on_shape_change() " << e.what() << std::endl;
+                return -1;
+            }
+            return 0;
+        }
+        PLUGIN_API_CATCH("on_shape_change")
+        return -1;
+    }
+
+    IPluginV3* attachToContext(IPluginResourceContext* context) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyAttachToContext
+                = utils::getOverride(static_cast<const IPluginV3OneRuntime*>(this), "attach_to_context");
+            if (!pyAttachToContext)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for attach_to_context()");
+            }
+
+            py::handle handle = pyAttachToContext(context).release();
+
+            try
+            {
+                return handle.cast<IPluginV3*>();
+            }
+            PLUGIN_API_CATCH_CAST("attach_to_context", "nvinfer1::IPluginV3")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("attach_to_context")
+        return nullptr;
+    }
+
+    PluginFieldCollection const* getFieldsToSerialize() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyGetFieldsToSerialize
+                = utils::getOverride(static_cast<const IPluginV3OneRuntime*>(this), "get_fields_to_serialize");
+            if (!pyGetFieldsToSerialize)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for get_fields_to_serialize()");
+            }
+
+            py::object result = pyGetFieldsToSerialize();
+
+            try
+            {
+                return result.cast<PluginFieldCollection*>();
+            }
+            PLUGIN_API_CATCH_CAST("get_fields_to_serialize", "nvinfer1::PluginFieldCollection")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("get_fields_to_serialize")
+        return nullptr;
+    }
+
+    void setPluginType(std::string pluginType)
+    {
+        mPluginType = std::move(pluginType);
+        mIsPluginTypeInitialized = true;
+    }
+
+    void setPluginVersion(std::string pluginVersion)
+    {
+        mPluginVersion = std::move(pluginVersion);
+        mIsPluginVersionInitialized = true;
+    }
+
+private:
+    int32_t mNbInputs{};
+    int32_t mNbOutputs{};
+    std::string mNamespace;
+    std::string mPluginType;
+    std::string mPluginVersion;
+
+    bool mIsNbOutputsInitialized{false};
+    bool mIsNamespaceInitialized{false};
+    bool mIsPluginTypeInitialized{false};
+    bool mIsPluginVersionInitialized{false};
+};
+
+class PyIPluginV3OneCoreImpl : public IPluginV3OneCore
+{
+public:
+    using IPluginV3OneCore::IPluginV3OneCore;
+    PyIPluginV3OneCoreImpl() = default;
+    PyIPluginV3OneCoreImpl(const IPluginV3OneCore& a){};
+
+    APILanguage getAPILanguage() const noexcept final
+    {
+        return APILanguage::kPYTHON;
+    }
+
+    char const* getPluginName() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsPluginNameInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_name not initialized");
+            }
+            return mPluginName.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_name")
+        return nullptr;
+    }
+
+    char const* getPluginVersion() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsPluginVersionInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_version not initialized");
+            }
+            return mPluginVersion.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_version")
+        return nullptr;
+    }
+
+    const char* getPluginNamespace() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            // getPluginNamespace() is not passed through to the Python side
+            if (!mIsPluginNamespaceInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_namespace not initialized");
+            }
+            return mPluginNamespace.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_namespace")
+        return nullptr;
+    }
+
+    void setPluginName(std::string pluginName)
+    {
+        mPluginName = std::move(pluginName);
+        mIsPluginNameInitialized = true;
+    }
+
+    void setPluginNamespace(std::string pluginNamespace)
+    {
+        mPluginNamespace = std::move(pluginNamespace);
+        mIsPluginNamespaceInitialized = true;
+    }
+
+    void setPluginVersion(std::string pluginVersion)
+    {
+        mPluginVersion = std::move(pluginVersion);
+        mIsPluginVersionInitialized = true;
+    }
+
+private:
+    std::string mPluginNamespace;
+    std::string mPluginName;
+    std::string mPluginVersion;
+
+    bool mIsPluginNamespaceInitialized{false};
+    bool mIsPluginNameInitialized{false};
+    bool mIsPluginVersionInitialized{false};
+};
+
+class IPluginCreatorV3OneImpl : public IPluginCreatorV3One
+{
+public:
+    IPluginCreatorV3OneImpl() = default;
 
-constexpr PluginFieldCollection EMPTY_PLUGIN_FIELD_COLLECTION{0, nullptr};
+    APILanguage getAPILanguage() const noexcept final
+    {
+        return APILanguage::kPYTHON;
+    }
+
+    char const* getPluginName() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsNameInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "name not initialized");
+            }
+            return mName.c_str();
+        }
+        PLUGIN_API_CATCH("name")
+        return nullptr;
+    }
+
+    const char* getPluginVersion() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsPluginVersionInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_version not initialized");
+            }
+            return mPluginVersion.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_version")
+        return nullptr;
+    }
+
+    PluginFieldCollection const* getFieldNames() noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsFCInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "field_names not initialized");
+            }
+            return &mFC;
+        }
+        PLUGIN_API_CATCH("field_names")
+        return nullptr;
+    }
+
+    IPluginV3* createPlugin(const char* name, const PluginFieldCollection* fc, TensorRTPhase phase) noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+
+            py::function pyCreatePlugin = utils::getOverride(static_cast<IPluginCreatorV3One*>(this), "create_plugin");
+            if (!pyCreatePlugin)
+            {
+                utils::throwPyError(PyExc_RuntimeError, "no implementation provided for create_plugin()");
+            }
+
+            std::string nameString{name};
+
+            py::handle handle = pyCreatePlugin(nameString, fc, phase).release();
+            try
+            {
+                return handle.cast<IPluginV3*>();
+            }
+            PLUGIN_API_CATCH_CAST("create_plugin", "IPluginV3*")
+            return nullptr;
+        }
+        PLUGIN_API_CATCH("create_plugin")
+        return nullptr;
+    }
+
+    const char* getPluginNamespace() const noexcept override
+    {
+        try
+        {
+            py::gil_scoped_acquire gil{};
+            if (!mIsNamespaceInitialized)
+            {
+                utils::throwPyError(PyExc_AttributeError, "plugin_namespace not initialized");
+            }
+            return mNamespace.c_str();
+        }
+        PLUGIN_API_CATCH("plugin_namespace")
+        return nullptr;
+    }
+
+    void setFieldNames(PluginFieldCollection fc)
+    {
+        mFC = fc;
+        mIsFCInitialized = true;
+    }
+
+    void setPluginName(std::string name)
+    {
+        mName = std::move(name);
+        mIsNameInitialized = true;
+    }
+
+    void setPluginVersion(std::string pluginVersion)
+    {
+        mPluginVersion = std::move(pluginVersion);
+        mIsPluginVersionInitialized = true;
+    }
+
+    void setPluginNamespace(std::string pluginNamespace)
+    {
+        mNamespace = std::move(pluginNamespace);
+        mIsNamespaceInitialized = true;
+    }
+
+private:
+    nvinfer1::PluginFieldCollection mFC;
+    std::string mNamespace;
+    std::string mName;
+    std::string mPluginVersion;
+
+    bool mIsFCInitialized{false};
+    bool mIsNamespaceInitialized{false};
+    bool mIsNameInitialized{false};
+    bool mIsPluginVersionInitialized{false};
+};
+
+namespace
+{
+bool isPython(IVersionedInterface const& versionedInterface)
+{
+    return versionedInterface.getAPILanguage() == APILanguage::kPYTHON;
+}
+} // namespace
 
 // Long lambda functions should go here rather than being inlined into the bindings (1 liners are OK).
 namespace lambdas
@@ -78,6 +1794,36 @@ static const auto IPluginV2_execute_async = [](IPluginV2& self, int32_t batchSiz
     return self.enqueue(batchSize, inputs.data(), outputs.data(), workspace, reinterpret_cast<cudaStream_t>(stream));
 };
 
+static const auto IPluginV2_set_num_outputs = [](IPluginV2& self, int32_t numOutputs) {
+    if(getPluginVersion(self.getTensorRTVersion()) == PluginVersion::kV2_DYNAMICEXT_PYTHON)
+    {
+        auto plugin = static_cast<PyIPluginV2DynamicExtImpl*>(&self);
+        plugin->setNbOutputs(numOutputs);
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: num_outputs is read-only for C++ plugins");
+};
+
+static const auto IPluginV2_set_plugin_type = [](IPluginV2& self, std::string pluginType) {
+    if(getPluginVersion(self.getTensorRTVersion()) == PluginVersion::kV2_DYNAMICEXT_PYTHON)
+    {
+        auto plugin = reinterpret_cast<PyIPluginV2DynamicExtImpl*>(&self);
+        plugin->setPluginType(std::move(pluginType));
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: plugin_type is read-only for C++ plugins");
+};
+
+static const auto IPluginV2_set_plugin_version = [](IPluginV2& self, std::string pluginVersion) {
+    if(getPluginVersion(self.getTensorRTVersion()) == PluginVersion::kV2_DYNAMICEXT_PYTHON)
+    {
+        auto plugin = reinterpret_cast<PyIPluginV2DynamicExtImpl*>(&self);
+        plugin->setPluginVersion(std::move(pluginVersion));
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: plugin_version is read-only for C++ plugins");
+};
+
 // For IPluginV2Ext
 static const auto get_output_data_type = [](IPluginV2Ext& self, int32_t index, const std::vector<DataType> inputTypes) {
     return self.getOutputDataType(index, inputTypes.data(), inputTypes.size());
@@ -132,35 +1878,435 @@ static const auto get_plugin_creator_list = [](IPluginRegistry& self) {
     return new std::vector<IPluginCreator*>(ptr, ptr + numCreators);
 };
 
+static const auto get_all_creators = [](IPluginRegistry& self) -> std::vector<py::object>* {
+    int32_t numCreators{0};
+    IPluginCreatorInterface* const* ptr = self.getAllCreators(&numCreators);
+    // Python will free when done.
+    auto vec = std::make_unique<std::vector<py::object>>(numCreators);
+    try
+    {
+        std::generate(vec->begin(), vec->end(), [&ptr, i = 0]() mutable -> py::object {
+            if (std::strcmp(ptr[i]->getInterfaceInfo().kind, "PLUGIN CREATOR_V1") == 0)
+            {
+                return py::cast(static_cast<IPluginCreator const*>(ptr[i++]));
+            }
+            if (std::strcmp(ptr[i]->getInterfaceInfo().kind, "PLUGIN CREATOR_V3ONE") == 0)
+            {
+                return py::cast(static_cast<IPluginCreatorV3One const*>(ptr[i++]));
+            }
+            utils::throwPyError(PyExc_RuntimeError, "Unknown plugin creator type");
+            return py::none{};
+        });
+        return vec.release();
+    }
+    catch (std::exception const& e)
+    {
+        std::cerr << "[ERROR] Exception caught in get_all_creators(): " << e.what() << std::endl;
+    }
+    catch (...)
+    {
+        std::cerr << "[ERROR] Exception caught in get_all_creators()" << std::endl;
+    }
+    return nullptr;
+};
+
+static const auto get_creator = [](IPluginRegistry& self, char const* pluginType, char const* pluginVersion,
+                                    char const* pluginNamespace) -> py::object {
+    IPluginCreatorInterface* creator = self.getCreator(pluginType, pluginVersion, pluginNamespace);
+    if (creator == nullptr)
+    {
+        return py::none{};
+    }
+    else
+    {
+        if (std::strcmp(creator->getInterfaceInfo().kind, "PLUGIN CREATOR_V1") == 0)
+        {
+            return py::cast(static_cast<IPluginCreator*>(creator));
+        }
+        if (std::strcmp(creator->getInterfaceInfo().kind, "PLUGIN CREATOR_V3ONE") == 0)
+        {
+            return py::cast(static_cast<IPluginCreatorV3One*>(creator));
+        }
+        utils::throwPyError(PyExc_RuntimeError, "Unknown plugin creator type");
+        return py::none{};
+    }
+};
+
 // For IPluginCreator
 static const auto creator_create_plugin
     = [](IPluginCreator& self, std::string const& name, PluginFieldCollection const* fc) {
           return self.createPlugin(name.c_str(), fc);
       };
 
-static const auto get_field_names = [](IPluginCreator& self) -> const PluginFieldCollection* {
-    const PluginFieldCollection* fieldCollection = self.getFieldNames();
+static const auto creator_create_plugin_v3
+    = [](IPluginCreatorV3One& self, std::string const& name, PluginFieldCollection const* fc, TensorRTPhase phase) {
+          return self.createPlugin(name.c_str(), fc, phase);
+      };
+
+static const auto deserialize_plugin = [](IPluginCreator& self, std::string const& name, py::buffer& serializedPlugin) {
+    py::buffer_info info = serializedPlugin.request();
+    return self.deserializePlugin(name.c_str(), info.ptr, info.size * info.itemsize);
+};
+
+static const auto IPluginV3_get_format_combination_limit = [](IPluginV3OneBuild& self, int32_t formatCombinationLimit) {
+    if (isPython(self))
+    {
+        auto plugin = static_cast<PyIPluginV3OneBuildImpl*>(&self);
+        plugin->setFormatCombinationLimit(formatCombinationLimit);
+        return;
+    }
+    utils::throwPyError(
+        PyExc_AttributeError, "Can't set attribute: format_combination_limit is read-only for C++ plugins");
+};
+
+static const auto IPluginV3_get_metadata_string = [](IPluginV3OneBuild& self, std::string metadataString) {
+    if (isPython(self))
+    {
+        auto plugin = static_cast<PyIPluginV3OneBuildImpl*>(&self);
+        plugin->setMetadataString(std::move(metadataString));
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: metadata_string is read-only for C++ plugins");
+};
+
+static const auto IPluginV3_get_timing_cache_id = [](IPluginV3OneBuild& self, std::string timingCacheId) {
+    if (isPython(self))
+    {
+        auto plugin = static_cast<PyIPluginV3OneBuildImpl*>(&self);
+        plugin->setTimingCachedId(std::move(timingCacheId));
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: timing_cache_id is read-only for C++ plugins");
+};
+
+// For base Dims class
+static const auto dimsexprs_vector_constructor = [](std::vector<IDimensionExpr const*> const& in) {
+    // This is required, because otherwise MAX_DIMS will not be resolved at compile time.
+    int32_t const maxDims{static_cast<int32_t>(Dims::MAX_DIMS)};
+    PY_ASSERT_VALUE_ERROR(in.size() <= maxDims,
+            "Input length " + std::to_string(in.size()) + ". Max expected length is " + std::to_string(maxDims));
+
+    // Create the Dims object.
+    DimsExprs* self = new DimsExprs{};
+    self->nbDims = in.size();
+    for (int32_t i = 0; i < in.size(); ++i)
+        self->d[i] = in[i];
+    return self;
+};
+
+static const auto dimsexprs_len_constructor = [](int32_t const size) {
+    // This is required, because otherwise MAX_DIMS will not be resolved at compile time.
+    int32_t const maxDims{static_cast<int32_t>(Dims::MAX_DIMS)};
+    PY_ASSERT_VALUE_ERROR(size <= maxDims,
+        "Input length " + std::to_string(size) + ". Max expected length is " + std::to_string(maxDims));
+
+    // Create the Dims object.
+    DimsExprs* self = new DimsExprs{};
+    self->nbDims = size;
+    return self;
+};
+
+static const auto dimsexprs_len = [](DimsExprs const& self) { return self.nbDims; };
+
+// TODO: Add slicing support?
+static const auto dimsexprs_getter = [](DimsExprs const& self, int32_t const pyIndex) -> IDimensionExpr const* {
+    // Without these bounds checks, horrible infinite looping will occur.
+    int32_t const index{(pyIndex < 0) ? static_cast<int32_t>(self.nbDims) + pyIndex : pyIndex};
+    PY_ASSERT_INDEX_ERROR(index >= 0 && index < self.nbDims);
+    return self.d[index];
+};
+
+static const auto dimsexprs_setter = [](DimsExprs& self, int32_t const pyIndex, IDimensionExpr const* item) {
+    int32_t const index{(pyIndex < 0) ? static_cast<int32_t>(self.nbDims) + pyIndex : pyIndex};
+    PY_ASSERT_INDEX_ERROR(index >= 0 && index < self.nbDims);
+    self.d[index] = item;
+};
+
+// IPluginV3 lambdas
+
+static const auto IPluginV3_set_num_outputs = [](IPluginV3OneBuild& self, int32_t numOutputs) {
+    if (isPython(self))
+    {
+        auto plugin = static_cast<PyIPluginV3OneBuildImpl*>(&self);
+        plugin->setNbOutputs(numOutputs);
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: num_outputs is read-only for C++ plugins");
+};
+
+} // namespace lambdas
+
+namespace helpers
+{
+template <typename T>
+inline PluginFieldCollection const* getFieldNames(T& self)
+{
+    PluginFieldCollection const* fieldCollection = self.getFieldNames();
     if (!fieldCollection)
     {
         return &EMPTY_PLUGIN_FIELD_COLLECTION;
     }
     return fieldCollection;
+}
+
+template <typename T, typename U>
+inline void setPluginName(T& self, std::string name)
+{
+    if (isPython(self))
+    {
+        auto object = static_cast<U*>(&self);
+        object->setPluginName(std::move(name));
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: read-only for C++ plugins");
+}
+
+template <typename T, typename U>
+inline void setPluginVersion(T& self, std::string pluginVersion)
+{
+    if (isPython(self))
+    {
+        auto object = static_cast<U*>(&self);
+        object->setPluginVersion(std::move(pluginVersion));
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: read-only for C++ plugins");
+}
+
+template <typename T, typename U>
+inline void setPluginNamespace(T& self, std::string namespace_)
+{
+    if (isPython(self))
+    {
+        auto object = static_cast<U*>(&self);
+        object->setPluginNamespace(std::move(namespace_));
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: read-only for C++ plugins");
+}
+
+template <typename T, typename U>
+inline void setPluginCreatorFieldNames(T& self, PluginFieldCollection pfc)
+{
+    if (isPython(self))
+    {
+        auto pluginCreator = static_cast<U*>(&self);
+        pluginCreator->setFieldNames(pfc);
+        return;
+    }
+    utils::throwPyError(PyExc_AttributeError, "Can't set attribute: read-only for C++ plugins");
+}
+
+} // namespace helpers
+
+// NOTE: Fake bindings are provided solely to document the API in the C++ -> Python direction for these methods
+// These bindings will never be called.
+namespace pluginDoc
+{
+
+py::bytes serialize(PyIPluginV2DynamicExt& self)
+{
+    return py::bytes();
+}
+
+DataType getOutputDataType(PyIPluginV2DynamicExt& self, int32_t index, std::vector<DataType> const& inputTypes)
+{
+    return DataType{};
 };
 
-static const auto deserialize_plugin = [](IPluginCreator& self, std::string const& name, py::buffer& serializedPlugin) {
-    py::buffer_info info = serializedPlugin.request();
-    return self.deserializePlugin(name.c_str(), info.ptr, info.size * info.itemsize);
+PyIPluginV2DynamicExt* deserializePlugin(
+    PyIPluginV2DynamicExt& self, std::string const& name, py::bytes const& serializedPlugin)
+{
+    return nullptr;
+}
+
+DimsExprs getOutputDimensions(
+    PyIPluginV2DynamicExt& self, int32_t outputIndex, std::vector<DimsExprs> const& inputs, IExprBuilder& exprBuilder)
+{
+    return DimsExprs{};
+}
+
+void configurePlugin(PyIPluginV2DynamicExt& self, std::vector<DynamicPluginTensorDesc> const& in,
+    std::vector<DynamicPluginTensorDesc> const& out)
+{
+}
+
+size_t getWorkspaceSize(PyIPluginV2DynamicExt& self, std::vector<PluginTensorDesc> const& inputDesc,
+    std::vector<PluginTensorDesc> const& outputDesc)
+{
+    return 0U;
+}
+
+bool supportsFormatCombination(
+    PyIPluginV2DynamicExt& self, int32_t pos, std::vector<PluginTensorDesc> const& inOut, int32_t nbInputs)
+{
+    return false;
+}
+
+void enqueue(PyIPluginV2DynamicExt& self, std::vector<PluginTensorDesc> const& inputDesc,
+    std::vector<PluginTensorDesc> const& outputDesc, const std::vector<intptr_t>& inputs,
+    std::vector<intptr_t>& outputs, intptr_t workspace, long stream)
+{
+}
+
+int32_t initialize(PyIPluginV2DynamicExt& self)
+{
+    return -1;
+}
+
+void terminate(PyIPluginV2DynamicExt& self) {}
+
+void destroy(PyIPluginV2DynamicExt& self) {}
+
+PyIPluginV2DynamicExt* clone(PyIPluginV2DynamicExt& self)
+{
+    return nullptr;
+}
+
+size_t getSerializationSize(PyIPluginV2DynamicExt& self)
+{
+    return 0U;
+}
+
+IPluginCapability* getCapabilityInterface(IPluginV3& self, PluginCapabilityType type)
+{
+    return nullptr;
+}
+
+std::vector<DataType> getOutputDataTypes(IPluginV3& self, std::vector<DataType> const& inputTypes)
+{
+    return {};
 };
 
-} // namespace lambdas
+std::vector<DimsExprs> getOutputShapes(IPluginV3& self, std::vector<DimsExprs> const& inputs,
+    std::vector<DimsExprs> const& shapeInputs, IExprBuilder& exprBuilder)
+{
+    return {};
+}
+
+void configurePluginV3(
+    IPluginV3& self, std::vector<DynamicPluginTensorDesc> const& in, std::vector<DynamicPluginTensorDesc> const& out)
+{
+}
+
+void onShapeChange(IPluginV3& self, std::vector<PluginTensorDesc> const& in, std::vector<PluginTensorDesc> const& out)
+{
+}
+
+size_t getWorkspaceSizeV3(IPluginV3& self, std::vector<DynamicPluginTensorDesc> const& inputDesc,
+    std::vector<DynamicPluginTensorDesc> const& outputDesc)
+{
+    return 0U;
+}
+
+bool supportsFormatCombinationV3(
+    IPluginV3& self, int32_t pos, std::vector<DynamicPluginTensorDesc> const& inOut, int32_t nbInputs)
+{
+    return false;
+}
+
+void enqueueV3(IPluginV3& self, std::vector<PluginTensorDesc> const& inputDesc,
+    std::vector<PluginTensorDesc> const& outputDesc, const std::vector<intptr_t>& inputs,
+    std::vector<intptr_t>& outputs, intptr_t workspace, long stream)
+{
+}
+
+void destroyV3(IPluginV3& self) {}
+
+IPluginV3* cloneV3(IPluginV3& self)
+{
+    return nullptr;
+}
+
+IPluginV3* attachToContext(IPluginV3& self, IPluginResourceContext& context)
+{
+    return nullptr;
+}
+
+PluginFieldCollection* getFieldsToSerialize(IPluginV3& self)
+{
+    return nullptr;
+}
+
+std::vector<int32_t> getValidTactics(IPluginV3& self)
+{
+    return {};
+}
+
+void setTactic(IPluginV3& self, int32_t tactic) {}
+
+void release(IPluginResource& self) {}
+
+IPluginResource* clonePluginResource(IPluginResource& self)
+{
+    return nullptr;
+}
+
+} // namespace pluginDoc
 
 void bindPlugin(py::module& m)
 {
+    py::class_<IDimensionExpr, PyIDimensionExprImpl, std::unique_ptr<IDimensionExpr, py::nodelete>>(
+        m, "IDimensionExpr", IDimensionExprDoc::descr, py::module_local())
+        .def("is_constant", &IDimensionExpr::isConstant, IDimensionExprDoc::is_constant)
+        .def("get_constant_value", &IDimensionExpr::getConstantValue, IDimensionExprDoc::get_constant_value)
+        .def("is_size_tensor", &IDimensionExpr::isSizeTensor, IDimensionExprDoc::is_size_tensor);
+
+    py::class_<DimsExprs>(m, "DimsExprs", DimsExprsDoc::descr, py::module_local())
+        .def(py::init<>())
+        // Allows for construction from python lists and tuples.
+        .def(py::init(lambdas::dimsexprs_vector_constructor))
+        // Allows for construction with a specified number of dims.
+        .def(py::init(lambdas::dimsexprs_len_constructor))
+        // These functions allow us to use DimsExprs like an iterable.
+        .def("__len__", lambdas::dimsexprs_len)
+        .def("__getitem__", lambdas::dimsexprs_getter)
+        .def("__setitem__", lambdas::dimsexprs_setter);
+
+    py::enum_<DimensionOperation>(
+        m, "DimensionOperation", py::arithmetic{}, DimensionOperationDoc::descr, py::module_local())
+        .value("SUM", DimensionOperation::kSUM)
+        .value("PROD", DimensionOperation::kPROD)
+        .value("MAX", DimensionOperation::kMAX)
+        .value("MIN", DimensionOperation::kMIN)
+        .value("SUB", DimensionOperation::kSUB)
+        .value("EQUAL", DimensionOperation::kEQUAL)
+        .value("LESS", DimensionOperation::kLESS)
+        .value("FLOOR_DIV", DimensionOperation::kFLOOR_DIV)
+        .value("CEIL_DIV", DimensionOperation::kCEIL_DIV);
+
+    py::class_<IExprBuilder, PyIExprBuilderImpl, std::unique_ptr<IExprBuilder, py::nodelete>>(
+        m, "IExprBuilder", IExprBuilderDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def(
+            "constant", &IExprBuilder::constant, py::return_value_policy::reference_internal, IExprBuilderDoc::constant)
+        .def("operation", &IExprBuilder::operation, py::return_value_policy::reference_internal,
+            IExprBuilderDoc::operation)
+        .def("declare_size_tensor", &IExprBuilder::declareSizeTensor, py::return_value_policy::reference_internal,
+            IExprBuilderDoc::declare_size_tensor);
+
+    py::class_<PluginTensorDesc>(m, "PluginTensorDesc", PluginTensorDescDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def_readwrite("dims", &PluginTensorDesc::dims)
+        .def_readwrite("type", &PluginTensorDesc::type)
+        .def_readwrite("format", &PluginTensorDesc::format)
+        .def_readwrite("scale", &PluginTensorDesc::scale);
+
+    py::class_<DynamicPluginTensorDesc>(
+        m, "DynamicPluginTensorDesc", DynamicPluginTensorDescDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def_readwrite("desc", &DynamicPluginTensorDesc::desc)
+        .def_readwrite("min", &DynamicPluginTensorDesc::min)
+        .def_readwrite("opt", &DynamicPluginTensorDesc::opt)
+        .def_readwrite("max", &DynamicPluginTensorDesc::max);
+
     py::class_<IPluginV2>(m, "IPluginV2", IPluginV2Doc::descr, py::module_local())
-        .def_property_readonly("num_outputs", &IPluginV2::getNbOutputs)
+        .def_property("num_outputs", &IPluginV2::getNbOutputs, lambdas::IPluginV2_set_num_outputs)
         .def_property_readonly("tensorrt_version", &IPluginV2::getTensorRTVersion)
-        .def_property_readonly("plugin_type", &IPluginV2::getPluginType)
-        .def_property_readonly("plugin_version", &IPluginV2::getPluginVersion)
+        .def_property("plugin_type", &IPluginV2::getPluginType,
+            py::cpp_function(lambdas::IPluginV2_set_plugin_type, py::keep_alive<1, 2>{}))
+        .def_property("plugin_version", &IPluginV2::getPluginVersion,
+            py::cpp_function(lambdas::IPluginV2_set_plugin_version, py::keep_alive<1, 2>{}))
         .def_property("plugin_namespace", &IPluginV2::getPluginNamespace,
             py::cpp_function(&IPluginV2::setPluginNamespace, py::keep_alive<1, 2>{}))
         .def("get_output_shape", lambdas::IPluginV2_get_output_shape, "index"_a, "input_shapes"_a,
@@ -191,6 +2337,99 @@ void bindPlugin(py::module& m)
         .def("clone", &IPluginV2Ext::clone, IPluginV2ExtDoc::clone);
     ;
 
+    py::class_<IPluginV2DynamicExt, IPluginV2, std::unique_ptr<IPluginV2DynamicExt, py::nodelete>>(m, "IPluginV2DynamicExtBase", py::module_local());
+
+    py::class_<PyIPluginV2DynamicExt, IPluginV2DynamicExt, IPluginV2, PyIPluginV2DynamicExtImpl,
+        std::unique_ptr<PyIPluginV2DynamicExt>>(
+        m, "IPluginV2DynamicExt", IPluginV2DynamicExtDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def(py::init<const PyIPluginV2DynamicExt&>())
+        .def_property_readonly_static(
+            "FORMAT_COMBINATION_LIMIT", [](py::object) { return IPluginV2DynamicExt::kFORMAT_COMBINATION_LIMIT; })
+        // The following defs are only for documenting the API for Python-based plugins
+        .def("initialize", &pluginDoc::initialize, IPluginV2DynamicExtDoc::initialize)
+        .def("terminate", &pluginDoc::terminate, IPluginV2DynamicExtDoc::terminate)
+        .def("serialize", &pluginDoc::serialize, IPluginV2DynamicExtDoc::serialize)
+        .def("get_output_datatype", &pluginDoc::getOutputDataType, "index"_a, "input_types"_a,
+            IPluginV2DynamicExtDoc::get_output_data_type)
+        .def("destroy", &pluginDoc::destroy, IPluginV2DynamicExtDoc::destroy)
+        .def("get_serialization_size", &pluginDoc::getSerializationSize, IPluginV2DynamicExtDoc::get_serialization_size)
+        .def("get_output_dimensions", &pluginDoc::getOutputDimensions, "output_index"_a, "inputs"_a, "expr_builder"_a,
+            IPluginV2DynamicExtDoc::get_output_dimensions)
+        .def("get_workspace_size", &pluginDoc::getWorkspaceSize, "in"_a, "out"_a,
+            IPluginV2DynamicExtDoc::get_workspace_size)
+        .def("configure_plugin", &pluginDoc::configurePlugin, "pos"_a, "in_out"_a,
+            IPluginV2DynamicExtDoc::configure_plugin)
+        .def("supports_format_combination", &pluginDoc::supportsFormatCombination, "pos"_a, "in_out"_a, "num_inputs"_a,
+            IPluginV2DynamicExtDoc::supports_format_combination)
+        .def("enqueue", &pluginDoc::enqueue, "input_desc"_a, "output_desc"_a, "inputs"_a, "outputs"_a, "workspace"_a,
+            "stream"_a, IPluginV2DynamicExtDoc::enqueue)
+        .def("clone", &pluginDoc::clone, IPluginV2DynamicExtDoc::clone);
+
+    py::class_<IPluginV3, IVersionedInterface, PyIPluginV3Impl, std::unique_ptr<IPluginV3>>(
+        m, "IPluginV3", IPluginV3Doc::ipluginv3_descr, py::module_local())
+        .def(py::init<>())
+        .def(py::init<const IPluginV3&>())
+        // The following defs are only for documenting the API for Python-based plugins
+        .def("get_capability_interface", &pluginDoc::getCapabilityInterface, IPluginV3Doc::get_capability_interface)
+        .def("clone", &pluginDoc::cloneV3, IPluginV3Doc::clone)
+        .def("destroy", &pluginDoc::destroyV3, IPluginV3Doc::destroy);
+
+    py::class_<IPluginCapability, IVersionedInterface, std::unique_ptr<IPluginCapability>>(
+        m, "IPluginCapability", IPluginV3Doc::iplugincapability_descr, py::module_local());
+
+    py::class_<IPluginV3OneCore, IPluginCapability, IVersionedInterface, PyIPluginV3OneCoreImpl,
+        std::unique_ptr<IPluginV3OneCore>>(
+        m, "IPluginV3OneCore", IPluginV3Doc::ipluginv3onecore_descr, py::module_local())
+        .def(py::init<>())
+        .def(py::init<const IPluginV3OneCore&>())
+        .def_property("plugin_name", &IPluginV3OneCore::getPluginName,
+            py::cpp_function(&helpers::setPluginName<IPluginV3OneCore, PyIPluginV3OneCoreImpl>, py::keep_alive<1, 2>{}))
+        .def_property("plugin_version", &IPluginV3OneCore::getPluginVersion,
+            py::cpp_function(
+                &helpers::setPluginVersion<IPluginV3OneCore, PyIPluginV3OneCoreImpl>, py::keep_alive<1, 2>{}))
+        .def_property("plugin_namespace", &IPluginV3OneCore::getPluginNamespace,
+            py::cpp_function(
+                &helpers::setPluginNamespace<IPluginV3OneCore, PyIPluginV3OneCoreImpl>, py::keep_alive<1, 2>{}));
+
+    py::class_<IPluginV3OneBuild, IPluginCapability, IVersionedInterface, PyIPluginV3OneBuildImpl,
+        std::unique_ptr<IPluginV3OneBuild>>(
+        m, "IPluginV3OneBuild", IPluginV3Doc::ipluginv3onebuild_descr, py::module_local())
+        .def(py::init<>())
+        .def(py::init<const IPluginV3OneBuild&>())
+        .def_property_readonly_static("DEFAULT_FORMAT_COMBINATION_LIMIT",
+            [](py::object) { return IPluginV3OneBuild::kDEFAULT_FORMAT_COMBINATION_LIMIT; })
+        .def_property("num_outputs", &IPluginV3OneBuild::getNbOutputs, lambdas::IPluginV3_set_num_outputs)
+        .def_property("format_combination_limit", &IPluginV3OneBuild::getFormatCombinationLimit,
+            lambdas::IPluginV3_get_format_combination_limit)
+        .def_property("metadata_string", &IPluginV3OneBuild::getMetadataString,
+            py::cpp_function(lambdas::IPluginV3_get_metadata_string, py::keep_alive<1, 2>{}))
+        .def_property("timing_cache_id", &IPluginV3OneBuild::getTimingCacheID,
+            py::cpp_function(lambdas::IPluginV3_get_timing_cache_id, py::keep_alive<1, 2>{}))
+        // The following defs are only for documenting the API for Python-based plugins
+        .def("get_output_datatypes", &pluginDoc::getOutputDataTypes, "input_types"_a,
+            IPluginV3Doc::get_output_data_types)
+        .def("get_output_shapes", &pluginDoc::getOutputShapes, "inputs"_a, "shape_inputs"_a, "expr_builder"_a,
+            IPluginV3Doc::get_output_shapes)
+        .def("get_workspace_size", &pluginDoc::getWorkspaceSizeV3, "in"_a, "out"_a, IPluginV3Doc::get_workspace_size)
+        .def("configure_plugin", &pluginDoc::configurePluginV3, "in"_a, "out"_a, IPluginV3Doc::configure_plugin)
+        .def("supports_format_combination", &pluginDoc::supportsFormatCombinationV3, "pos"_a, "in_out"_a,
+            "num_inputs"_a, IPluginV3Doc::supports_format_combination)
+        .def("get_valid_tactics", &pluginDoc::getValidTactics, IPluginV3Doc::get_valid_tactics);
+
+    py::class_<IPluginV3OneRuntime, IPluginCapability, IVersionedInterface, PyIPluginV3OneRuntimeImpl,
+        std::unique_ptr<IPluginV3OneRuntime>>(
+        m, "IPluginV3OneRuntime", IPluginV3Doc::ipluginv3oneruntime_descr, py::module_local())
+        .def(py::init<>())
+        .def(py::init<const IPluginV3OneRuntime&>())
+        // The following defs are only for documenting the API for Python-based plugins
+        .def("on_shape_change", &pluginDoc::onShapeChange, "in"_a, "out"_a, IPluginV3Doc::on_shape_change)
+        .def("set_tactic", &pluginDoc::setTactic, "tactic"_a, IPluginV3Doc::set_tactic)
+        .def("get_fields_to_serialize", &pluginDoc::getFieldsToSerialize, IPluginV3Doc::get_fields_to_serialize)
+        .def("enqueue", &pluginDoc::enqueueV3, "input_desc"_a, "output_desc"_a, "inputs"_a, "outputs"_a, "workspace"_a,
+            "stream"_a, IPluginV3Doc::enqueue)
+        .def("attach_to_context", &pluginDoc::attachToContext, "resource_context"_a, IPluginV3Doc::attach_to_context);
+
     py::enum_<PluginFieldType>(m, "PluginFieldType", PluginFieldTypeDoc::descr, py::module_local())
         .value("FLOAT16", PluginFieldType::kFLOAT16)
         .value("FLOAT32", PluginFieldType::kFLOAT32)
@@ -200,7 +2439,10 @@ void bindPlugin(py::module& m)
         .value("INT32", PluginFieldType::kINT32)
         .value("CHAR", PluginFieldType::kCHAR)
         .value("DIMS", PluginFieldType::kDIMS)
-        .value("UNKNOWN", PluginFieldType::kUNKNOWN);
+        .value("UNKNOWN", PluginFieldType::kUNKNOWN)
+        .value("BF16", PluginFieldType::kBF16)
+        .value("INT64", PluginFieldType::kINT64)
+        .value("FP8", PluginFieldType::kFP8);
 
     py::class_<PluginField>(m, "PluginField", PluginFieldDoc::descr, py::module_local())
         .def(py::init(lambdas::plugin_field_default_constructor), "name"_a = "", py::keep_alive<1, 2>{})
@@ -211,7 +2453,35 @@ void bindPlugin(py::module& m)
             py::cpp_function(
                 [](PluginField& self, FallbackString& name) { self.name = name.c_str(); }, py::keep_alive<1, 2>{}))
         .def_property(
-            "data", [](PluginField& self) { return self.data; },
+            "data",
+            [](PluginField& self) {
+                switch (self.type)
+                {
+                case PluginFieldType::kINT32:
+                    return py::array(self.length, static_cast<int32_t const*>(self.data));
+                    break;
+                case PluginFieldType::kINT8:
+                    return py::array(self.length, static_cast<int8_t const*>(self.data));
+                    break;
+                case PluginFieldType::kINT16:
+                    return py::array(self.length, static_cast<int16_t const*>(self.data));
+                    break;
+                case PluginFieldType::kFLOAT16:
+                    // TODO: Figure out how to handle float16 correctly here
+                    return py::array(self.length, static_cast<float const*>(self.data));
+                    break;
+                case PluginFieldType::kFLOAT32:
+                    return py::array(self.length, static_cast<float const*>(self.data));
+                    break;
+                case PluginFieldType::kFLOAT64:
+                    return py::array(self.length, static_cast<double const*>(self.data));
+                    break;
+                case PluginFieldType::kCHAR: return py::array(self.length, static_cast<char const*>(self.data)); break;
+                default: assert(false && "No known conversion for returning data from PluginField"); break;
+                }
+                // should not reach this line
+                return py::array();
+            },
             py::cpp_function(
                 [](PluginField& self, py::buffer& buffer) {
                     py::buffer_info info = buffer.request();
@@ -234,25 +2504,73 @@ void bindPlugin(py::module& m)
     // which can then be converted to an actual C++ PluginFieldCollection.
     py::implicitly_convertible<std::vector<nvinfer1::PluginField>, PluginFieldCollection>();
 
-    py::class_<IPluginCreator>(m, "IPluginCreator", IPluginCreatorDoc::descr, py::module_local())
-        .def_property_readonly("tensorrt_version", &IPluginCreator::getTensorRTVersion)
-        .def_property_readonly("name", &IPluginCreator::getPluginName)
-        .def_property_readonly("plugin_version", &IPluginCreator::getPluginVersion)
-        .def_property_readonly("field_names", lambdas::get_field_names, py::return_value_policy::reference_internal)
+    py::class_<IPluginCreatorInterface, IVersionedInterface>(
+        m, "IPluginCreatorInterface", IPluginCreatorInterfaceDoc::descr, py::module_local());
+
+    py::class_<IPluginCreator, IPluginCreatorImpl, IPluginCreatorInterface, IVersionedInterface>(
+        m, "IPluginCreator", IPluginCreatorDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def_property("name", &IPluginCreator::getPluginName,
+            py::cpp_function(&helpers::setPluginName<IPluginCreator, IPluginCreatorImpl>, py::keep_alive<1, 2>{}))
+        .def_property("plugin_version", &IPluginCreator::getPluginVersion,
+            py::cpp_function(&helpers::setPluginVersion<IPluginCreator, IPluginCreatorImpl>, py::keep_alive<1, 2>{}))
+        .def_property("field_names", &helpers::getFieldNames<IPluginCreator>,
+            py::cpp_function(
+                &helpers::setPluginCreatorFieldNames<IPluginCreator, IPluginCreatorImpl>, py::keep_alive<1, 2>{}),
+            py::return_value_policy::reference_internal)
         .def_property("plugin_namespace", &IPluginCreator::getPluginNamespace,
             py::cpp_function(&IPluginCreator::setPluginNamespace, py::keep_alive<1, 2>{}))
         .def("create_plugin", lambdas::creator_create_plugin, "name"_a, "field_collection"_a,
             IPluginCreatorDoc::create_plugin)
         .def("deserialize_plugin", lambdas::deserialize_plugin, "name"_a, "serialized_plugin"_a,
-            IPluginCreatorDoc::deserialize_plugin);
+            IPluginCreatorDoc::deserialize_plugin)
+        .def("deserialize_plugin", &pluginDoc::deserializePlugin, "name"_a, "serialized_plugin"_a,
+            IPluginCreatorDoc::deserialize_plugin_python) // Should never be used. For documenting C++ -> Python API
+                                                          // only.
+        ;
+
+    py::class_<IPluginCreatorV3One, IPluginCreatorV3OneImpl, IPluginCreatorInterface, IVersionedInterface>(
+        m, "IPluginCreatorV3One", IPluginCreatorV3OneDoc::descr, py::module_local())
+        .def(py::init<>())
+        .def_property("name", &IPluginCreatorV3One::getPluginName,
+            py::cpp_function(
+                &helpers::setPluginName<IPluginCreatorV3One, IPluginCreatorV3OneImpl>, py::keep_alive<1, 2>{}))
+        .def_property("plugin_version", &IPluginCreatorV3One::getPluginVersion,
+            py::cpp_function(
+                &helpers::setPluginVersion<IPluginCreatorV3One, IPluginCreatorV3OneImpl>, py::keep_alive<1, 2>{}))
+        .def_property("field_names", &helpers::getFieldNames<IPluginCreatorV3One>,
+            py::cpp_function(&helpers::setPluginCreatorFieldNames<IPluginCreatorV3One, IPluginCreatorV3OneImpl>,
+                py::keep_alive<1, 2>{}),
+            py::return_value_policy::reference_internal)
+        .def_property("plugin_namespace", &IPluginCreatorV3One::getPluginNamespace,
+            py::cpp_function(
+                &helpers::setPluginNamespace<IPluginCreatorV3One, IPluginCreatorV3OneImpl>, py::keep_alive<1, 2>{}))
+        .def("create_plugin", lambdas::creator_create_plugin_v3, "name"_a, "field_collection"_a, "phase"_a,
+            IPluginCreatorV3OneDoc::create_plugin);
+
+    py::class_<IPluginResourceContext, std::unique_ptr<IPluginResourceContext, py::nodelete>>(
+        m, "IPluginResourceContext", IPluginResourceContextDoc::descr, py::module_local())
+        // return_value_policy::reference_internal is default for the following
+        .def_property_readonly("error_recorder", &IPluginResourceContext::getErrorRecorder)
+        .def_property_readonly("gpu_allocator", &IPluginResourceContext::getGpuAllocator);
+
+    py::class_<IPluginResource, IVersionedInterface, PyIPluginResourceImpl,
+        std::unique_ptr<IPluginResource, py::nodelete>>(
+        m, "IPluginResource", IPluginResourceDoc::descr, py::module_local())
+        .def(py::init<>())
+        // return_value_policy::reference_internal is default for the following
+        .def("release", &pluginDoc::release, IPluginResourceDoc::release)
+        .def("clone", &pluginDoc::clonePluginResource, IPluginResourceDoc::clone);
 
     py::class_<IPluginRegistry, std::unique_ptr<IPluginRegistry, py::nodelete>>(
         m, "IPluginRegistry", IPluginRegistryDoc::descr, py::module_local())
         .def_property_readonly("plugin_creator_list", lambdas::get_plugin_creator_list)
-        .def("register_creator", &IPluginRegistry::registerCreator, "creator"_a, "plugin_namespace"_a = "",
-            py::keep_alive<1, 2>{}, IPluginRegistryDoc::register_creator)
-        .def("deregister_creator", &IPluginRegistry::deregisterCreator, "creator"_a,
-            IPluginRegistryDoc::deregister_creator)
+        .def_property_readonly("all_creators", lambdas::get_all_creators)
+        .def("register_creator",
+            py::overload_cast<IPluginCreator&, AsciiChar const* const>(&IPluginRegistry::registerCreator), "creator"_a,
+            "plugin_namespace"_a = "", py::keep_alive<1, 2>{}, IPluginRegistryDoc::register_creator_iplugincreator)
+        .def("deregister_creator", py::overload_cast<IPluginCreator const&>(&IPluginRegistry::deregisterCreator),
+            "creator"_a, IPluginRegistryDoc::deregister_creator_iplugincreator)
         .def("get_plugin_creator", &IPluginRegistry::getPluginCreator, "type"_a, "version"_a, "plugin_namespace"_a = "",
             py::return_value_policy::reference_internal, IPluginRegistryDoc::get_plugin_creator)
         .def_property("error_recorder", &IPluginRegistry::getErrorRecorder,
@@ -262,11 +2580,37 @@ void bindPlugin(py::module& m)
         .def("load_library", &IPluginRegistry::loadLibrary, "plugin_path"_a,
             py::return_value_policy::reference_internal, IPluginRegistryDoc::load_library)
         .def("deregister_library", &IPluginRegistry::deregisterLibrary, "handle"_a,
-            IPluginRegistryDoc::deregister_library);
+            IPluginRegistryDoc::deregister_library)
+        .def("register_creator",
+            py::overload_cast<IPluginCreatorInterface&, AsciiChar const* const>(&IPluginRegistry::registerCreator),
+            "creator"_a, "plugin_namespace"_a = "", py::keep_alive<1, 2>{}, IPluginRegistryDoc::register_creator)
+        .def("deregister_creator",
+            py::overload_cast<IPluginCreatorInterface const&>(&IPluginRegistry::deregisterCreator), "creator"_a,
+            IPluginRegistryDoc::deregister_creator)
+        .def("get_creator", lambdas::get_creator, "type"_a, "version"_a, "plugin_namespace"_a = "",
+            py::return_value_policy::reference_internal, IPluginRegistryDoc::get_creator)
+        .def("acquire_plugin_resource", &IPluginRegistry::acquirePluginResource, "key"_a, "resource"_a,
+            py::return_value_policy::reference_internal, IPluginRegistryDoc::acquire_plugin_resource)
+        .def("release_plugin_resource", &IPluginRegistry::releasePluginResource, "key"_a,
+            IPluginRegistryDoc::release_plugin_resource);
+
+    py::enum_<PluginCreatorVersion>(m, "PluginCreatorVersion", PluginCreatorVersionDoc::descr, py::module_local())
+        .value("V1", PluginCreatorVersion::kV1)
+        .value("V1_PYTHON", PluginCreatorVersion::kV1_PYTHON);
 
     m.def("get_plugin_registry", &getPluginRegistry, py::return_value_policy::reference,
         FreeFunctionsDoc::get_plugin_registry);
 
+    py::enum_<PluginCapabilityType>(
+        m, "PluginCapabilityType", py::arithmetic{}, PluginCapabilityTypeDoc::descr, py::module_local())
+        .value("CORE", PluginCapabilityType::kCORE)
+        .value("BUILD", PluginCapabilityType::kBUILD)
+        .value("RUNTIME", PluginCapabilityType::kRUNTIME);
+
+    py::enum_<TensorRTPhase>(m, "TensorRTPhase", py::arithmetic{}, TensorRTPhaseDoc::descr, py::module_local())
+        .value("BUILD", TensorRTPhase::kBUILD)
+        .value("RUNTIME", TensorRTPhase::kRUNTIME);
+
 #if EXPORT_ALL_BINDINGS
     m.def("get_builder_plugin_registry", &getBuilderPluginRegistry, py::return_value_policy::reference,
         FreeFunctionsDoc::get_builder_plugin_registry);
diff --git a/python/src/parsers/pyCaffe.cpp b/python/src/parsers/pyCaffe.cpp
deleted file mode 100644
index 4aa23a6b..00000000
--- a/python/src/parsers/pyCaffe.cpp
+++ /dev/null
@@ -1,95 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// Implementation of PyBind11 Binding Code for CaffeParser
-#include "ForwardDeclarations.h"
-#include "NvCaffeParser.h"
-#include "parsers/pyCaffeDoc.h"
-#include "utils.h"
-#include <pybind11/numpy.h>
-
-namespace tensorrt
-{
-using namespace nvcaffeparser1;
-
-namespace lambdas
-{
-static const auto parse_binary_proto = [](ICaffeParser& self, const std::string& filename) {
-    using VoidFunc = void (*)(void*);
-
-    // Type-erasure allows us to properly destroy the IBinaryProtoBlob in the bindings.
-    nvcaffeparser1::IBinaryProtoBlob* proto = self.parseBinaryProto(filename.c_str());
-    VoidFunc freeFunc = [](void* p) { static_cast<IBinaryProtoBlob*>(p)->destroy(); };
-    py::capsule freeBlob{static_cast<void*>(proto), freeFunc};
-
-    // Explicitly check for narrowing dimensions
-    auto const volume = utils::volume(proto->getDimensions());
-    PY_ASSERT_RUNTIME_ERROR(static_cast<decltype(volume)>(static_cast<int32_t>(volume)) == volume,
-        "Volume exceeds the representable limit.");
-
-    // By specifying the py::capsule as a parent here, we tie the lifetime of the data buffer to this array.
-    // When this array is eventually destroyed on the Python side, the capsule parent will free(protoPtr).
-    return py::array{utils::nptype(proto->getDataType()), static_cast<int32_t>(volume), proto->getData(), freeBlob};
-};
-
-static const auto parse_buffer = [](ICaffeParser& self, py::buffer& deploy, py::buffer& model,
-                                     nvinfer1::INetworkDefinition& network, nvinfer1::DataType dtype) {
-    py::buffer_info deploy_info = deploy.request();
-    py::buffer_info model_info = model.request();
-    return self.parseBuffers(static_cast<const uint8_t*>(deploy_info.ptr), deploy_info.size * deploy_info.itemsize,
-        static_cast<const uint8_t*>(model_info.ptr), model_info.size * model_info.itemsize, network, dtype);
-};
-
-// For IPluginFactoryV2
-static const auto PluginV2_create_plugin
-    = [](IPluginFactoryV2& self, const std::string& layerName, const std::vector<nvinfer1::Weights>& weights) {
-          return self.createPlugin(layerName.c_str(), weights.data(), weights.size());
-      };
-} // namespace lambdas
-
-void bindCaffe(py::module& m)
-{
-    py::class_<IBlobNameToTensor, std::unique_ptr<IBlobNameToTensor, py::nodelete>>(
-        m, "IBlobNameToTensor", IBlobNameToTensorDoc::descr, py::module_local())
-        .def("find", &IBlobNameToTensor::find, "name"_a, IBlobNameToTensorDoc::find);
-
-    py::class_<IPluginFactoryV2>(m, "ICaffePluginFactoryV2", ICaffePluginFactoryV2Doc::descr, py::module_local())
-        .def("is_plugin_v2", &IPluginFactoryV2::isPluginV2, "layer_name"_a, ICaffePluginFactoryV2Doc::is_plugin_v2)
-        .def("create_plugin", lambdas::PluginV2_create_plugin, "layer_name"_a, "weights"_a, py::keep_alive<1, 3>{},
-            ICaffePluginFactoryV2Doc::create_plugin);
-
-    py::class_<ICaffeParser, std::unique_ptr<ICaffeParser, py::nodelete>>(
-        m, "CaffeParser", ICaffeParserDoc::descr, py::module_local())
-        .def(py::init(&nvcaffeparser1::createCaffeParser))
-        .def_property("protobuf_buffer_size", nullptr, &ICaffeParser::setProtobufBufferSize)
-        .def_property(
-            "plugin_factory_v2", nullptr, py::cpp_function(&ICaffeParser::setPluginFactoryV2, py::keep_alive<1, 2>{}))
-        .def_property(
-            "plugin_namespace", nullptr, py::cpp_function(&ICaffeParser::setPluginNamespace, py::keep_alive<1, 2>{}))
-        .def("parse", &ICaffeParser::parse, "deploy"_a, "model"_a, "network"_a, "dtype"_a, ICaffeParserDoc::parse,
-            py::keep_alive<4, 1>{})
-        .def("parse_buffer", lambdas::parse_buffer, "deploy_buffer"_a, "model_buffer"_a, "network"_a, "dtype"_a,
-            ICaffeParserDoc::parse_buffer, py::keep_alive<4, 1>{})
-        .def("parse_binary_proto", lambdas::parse_binary_proto, "filename"_a, ICaffeParserDoc::parse_binary_proto)
-        .def_property("error_recorder", &ICaffeParser::getErrorRecorder,
-            py::cpp_function(&ICaffeParser::setErrorRecorder, py::keep_alive<1, 2>{}))
-        .def("__del__", &utils::doNothingDel<ICaffeParser>);
-
-    m.def("shutdown_protobuf_library", &nvcaffeparser1::shutdownProtobufLibrary,
-        FreeFunctionsDoc::shutdown_protobuf_library);
-}
-} // namespace tensorrt
diff --git a/python/src/parsers/pyOnnx.cpp b/python/src/parsers/pyOnnx.cpp
index ec9bc0c8..122fc219 100644
--- a/python/src/parsers/pyOnnx.cpp
+++ b/python/src/parsers/pyOnnx.cpp
@@ -44,13 +44,26 @@ static const auto error_code_str = [](ErrorCode self) {
     case ErrorCode::kINVALID_NODE: return "INVALID_NODE";
     case ErrorCode::kUNSUPPORTED_GRAPH: return "UNSUPPORTED_GRAPH";
     case ErrorCode::kUNSUPPORTED_NODE: return "UNSUPPORTED_NODE";
+    case ErrorCode::kUNSUPPORTED_NODE_ATTR: return "UNSUPPORTED_NODE_ATTR";
+    case ErrorCode::kUNSUPPORTED_NODE_INPUT: return "UNSUPPORTED_NODE_INPUT";
+    case ErrorCode::kUNSUPPORTED_NODE_DATATYPE: return "UNSUPPORTED_NODE_DATATYPE";
+    case ErrorCode::kUNSUPPORTED_NODE_DYNAMIC: return "UNSUPPORTED_NODE_DYNAMIC";
+    case ErrorCode::kUNSUPPORTED_NODE_SHAPE: return "UNSUPPORTED_NODE_SHAPE";
+    case ErrorCode::kREFIT_FAILED: return "REFIT_FAILED";
     }
     return "UNKNOWN";
 };
 
 static const auto parser_error_str = [](IParserError& self) {
-    return "In node " + std::to_string(self.node()) + " (" + self.func() + "): " + error_code_str(self.code()) + ": "
-        + self.desc();
+    const std::string node_info = "In node " + std::to_string(self.node()) + " with name: " + self.nodeName()
+        + " and operator: " + self.nodeOperator() + " ";
+    const std::string error_info
+        = std::string("(") + self.func() + "): " + error_code_str(self.code()) + ": " + self.desc();
+    if (self.code() == ErrorCode::kMODEL_DESERIALIZE_FAILED || self.code() == ErrorCode::kREFIT_FAILED)
+    {
+        return error_info;
+    }
+    return node_info + error_info;
 };
 
 static const auto parse = [](IParser& self, const py::buffer& model, const char* path = nullptr) {
@@ -89,6 +102,29 @@ static const auto get_used_vc_plugin_libraries = [](IParser& self) {
     return vcPluginLibs;
 };
 
+static const auto get_local_function_stack = [](IParserError& self) {
+    std::vector<std::string> localFunctionStack;
+    int32_t localFunctionStackSize = self.localFunctionStackSize();
+    if (localFunctionStackSize > 0)
+    {
+        auto localFunctionStackCArray = self.localFunctionStack();
+        localFunctionStack.reserve(localFunctionStackSize);
+        for (int32_t i = 0; i < localFunctionStackSize; ++i)
+        {
+            localFunctionStack.emplace_back(std::string{localFunctionStackCArray[i]});
+        }
+    }
+    return localFunctionStack;
+};
+
+static const auto refitFromBytes = [](IParserRefitter& self, const py::buffer& model, const char* path = nullptr) {
+    py::buffer_info info = model.request();
+    return self.refitFromBytes(info.ptr, info.size * info.itemsize, path);
+};
+
+static const auto refitFromFile
+    = [](IParserRefitter& self, const std::string& model) { return self.refitFromFile(model.c_str()); };
+
 } // namespace lambdas
 
 void bindOnnx(py::module& m)
@@ -114,6 +150,8 @@ void bindOnnx(py::module& m)
         .def("clear_flag", &IParser::clearFlag, "flag"_a, OnnxParserDoc::clear_flag)
         .def("set_flag", &IParser::setFlag, "flag"_a, OnnxParserDoc::set_flag)
         .def("get_flag", &IParser::getFlag, "flag"_a, OnnxParserDoc::get_flag)
+        .def("get_layer_output_tensor", &IParser::getLayerOutputTensor, "name"_a, "i"_a,
+            OnnxParserDoc::get_layer_output_tensor)
         .def("get_used_vc_plugin_libraries", lambdas::get_used_vc_plugin_libraries,
             OnnxParserDoc::get_used_vc_plugin_libraries)
         .def("__del__", &utils::doNothingDel<IParser>);
@@ -131,6 +169,12 @@ void bindOnnx(py::module& m)
         .value("INVALID_NODE", ErrorCode::kINVALID_NODE)
         .value("UNSUPPORTED_GRAPH", ErrorCode::kUNSUPPORTED_GRAPH)
         .value("UNSUPPORTED_NODE", ErrorCode::kUNSUPPORTED_NODE)
+        .value("UNSUPPORTED_NODE_ATTR", ErrorCode::kUNSUPPORTED_NODE_ATTR)
+        .value("UNSUPPORTED_NODE_INPUT", ErrorCode::kUNSUPPORTED_NODE_INPUT)
+        .value("UNSUPPORTED_NODE_DATATYPE", ErrorCode::kUNSUPPORTED_NODE_DATATYPE)
+        .value("UNSUPPORTED_NODE_DYNAMIC", ErrorCode::kUNSUPPORTED_NODE_DYNAMIC)
+        .value("UNSUPPORTED_NODE_SHAPE", ErrorCode::kUNSUPPORTED_NODE_SHAPE)
+        .value("REFIT_FAILED", ErrorCode::kREFIT_FAILED)
         .def("__str__", lambdas::error_code_str)
         .def("__repr__", lambdas::error_code_str);
 
@@ -141,9 +185,25 @@ void bindOnnx(py::module& m)
         .def("line", &IParserError::line, ParserErrorDoc::line)
         .def("func", &IParserError::func, ParserErrorDoc::func)
         .def("node", &IParserError::node, ParserErrorDoc::node)
+        .def("node_name", &IParserError::nodeName, ParserErrorDoc::node_name)
+        .def("node_operator", &IParserError::nodeOperator, ParserErrorDoc::node_operator)
+        .def("local_function_stack", lambdas::get_local_function_stack, ParserErrorDoc::local_function_stack)
+        .def("local_function_stack_size", &IParserError::localFunctionStackSize,
+            ParserErrorDoc::local_function_stack_size)
         .def("__str__", lambdas::parser_error_str)
         .def("__repr__", lambdas::parser_error_str);
 
+    py::class_<IParserRefitter>(m, "OnnxParserRefitter", OnnxParserRefitterDoc::descr, py::module_local())
+        .def(py::init(&nvonnxparser::createParserRefitter), "refitter"_a, "logger"_a, OnnxParserRefitterDoc::init,
+            py::keep_alive<1, 3>{}, py::keep_alive<2, 1>{})
+        .def("refit_from_bytes", lambdas::refitFromBytes, "model"_a, "path"_a = nullptr,
+            OnnxParserRefitterDoc::refit_from_bytes, py::call_guard<py::gil_scoped_release>{})
+        .def("refit_from_file", lambdas::refitFromFile, "model"_a, OnnxParserRefitterDoc::refit_from_file,
+            py::call_guard<py::gil_scoped_release>{})
+        .def_property_readonly("num_errors", &IParserRefitter::getNbErrors)
+        .def("get_error", &IParserRefitter::getError, "index"_a, OnnxParserRefitterDoc::get_error)
+        .def("clear_errors", &IParserRefitter::clearErrors, OnnxParserRefitterDoc::clear_errors);
+
     // Free functions.
     m.def("get_nv_onnx_parser_version", &getNvOnnxParserVersion, get_nv_onnx_parser_version);
 }
diff --git a/python/src/parsers/pyUff.cpp b/python/src/parsers/pyUff.cpp
deleted file mode 100644
index 1663bc02..00000000
--- a/python/src/parsers/pyUff.cpp
+++ /dev/null
@@ -1,82 +0,0 @@
-/*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: Apache-2.0
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// Implementation of PyBind11 Binding Code for UffParser
-#include "ForwardDeclarations.h"
-#include "NvUffParser.h"
-#include "parsers/pyUffDoc.h"
-#include "utils.h"
-
-namespace tensorrt
-{
-using namespace nvuffparser;
-
-namespace lambdas
-{
-static const auto uff_parse_buffer = [](IUffParser& self, py::buffer& buffer, nvinfer1::INetworkDefinition& network,
-                                         nvinfer1::DataType weightsType = nvinfer1::DataType::kFLOAT) {
-    py::buffer_info info = buffer.request();
-    return self.parseBuffer(static_cast<const char*>(info.ptr), info.size * info.itemsize, network, weightsType);
-};
-} // namespace lambdas
-
-void bindUff(py::module& m)
-{
-    py::enum_<UffInputOrder>(m, "UffInputOrder", UffInputOrderDoc::descr, py::module_local())
-        .value("NCHW", UffInputOrder::kNCHW)
-        .value("NHWC", UffInputOrder::kNHWC)
-        .value("NC", UffInputOrder::kNC);
-
-    py::enum_<FieldType>(m, "FieldType", FieldTypeDoc::descr, py::module_local())
-        .value("FLOAT", FieldType::kFLOAT)
-        .value("INT32", FieldType::kINT32)
-        .value("CHAR", FieldType::kCHAR)
-        .value("DIMS", FieldType::kDIMS)
-        .value("DATATYPE", FieldType::kDATATYPE)
-        .value("UNKNOWN", FieldType::kUNKNOWN);
-
-    py::class_<FieldMap>(m, "FieldMap", FieldMapDoc::descr, py::module_local())
-        .def(py::init<const char*, const void*, const FieldType, int>(), "name"_a, "data"_a, "type"_a, "length"_a = 1)
-        .def_readwrite("name", &FieldMap::name)
-        .def_readwrite("data", &FieldMap::data)
-        .def_readwrite("type", &FieldMap::type)
-        .def_readwrite("length", &FieldMap::length);
-
-    py::class_<FieldCollection>(m, "FieldCollection", FieldCollectionDoc::descr, py::module_local())
-        .def_readwrite("num_fields", &FieldCollection::nbFields)
-        .def_readwrite("fields", &FieldCollection::fields);
-
-    py::class_<IUffParser, std::unique_ptr<IUffParser, py::nodelete>>(
-        m, "UffParser", UffParserDoc::descr, py::module_local())
-        .def(py::init(&createUffParser))
-        .def_property_readonly("uff_required_version_major", &IUffParser::getUffRequiredVersionMajor)
-        .def_property_readonly("uff_required_version_minor", &IUffParser::getUffRequiredVersionMinor)
-        .def_property_readonly("uff_required_version_patch", &IUffParser::getUffRequiredVersionPatch)
-        .def_property(
-            "plugin_namespace", nullptr, py::cpp_function(&IUffParser::setPluginNamespace, py::keep_alive<1, 2>{}))
-        .def("register_input", &IUffParser::registerInput, "name"_a, "shape"_a, "order"_a = UffInputOrder::kNCHW,
-            UffParserDoc::register_input)
-        .def("register_output", &IUffParser::registerOutput, "name"_a, UffParserDoc::register_output)
-        .def("parse", &IUffParser::parse, "file"_a, "network"_a, "weights_type"_a = nvinfer1::DataType::kFLOAT,
-            UffParserDoc::parse, py::keep_alive<3, 1>{})
-        .def("parse_buffer", lambdas::uff_parse_buffer, "buffer"_a, "network"_a,
-            "weights_type"_a = nvinfer1::DataType::kFLOAT, UffParserDoc::parse_buffer, py::keep_alive<3, 1>{})
-        .def_property("error_recorder", &IUffParser::getErrorRecorder,
-            py::cpp_function(&IUffParser::setErrorRecorder, py::keep_alive<1, 2>{}))
-        .def("__del__", &utils::doNothingDel<IUffParser>);
-}
-} // namespace tensorrt
diff --git a/python/src/pyTensorRT.cpp b/python/src/pyTensorRT.cpp
index 5231b990..a7fe0017 100644
--- a/python/src/pyTensorRT.cpp
+++ b/python/src/pyTensorRT.cpp
@@ -60,8 +60,6 @@ PYBIND11_MODULE(TENSORRT_MODULE, m)
 #if EXPORT_ALL_BINDINGS
     // Parsers
     bindOnnx(m);
-    bindUff(m);
-    bindCaffe(m);
 #endif
 }
 } // namespace tensorrt
diff --git a/python/src/utils.cpp b/python/src/utils.cpp
index 0c24af7d..46e8b3ba 100644
--- a/python/src/utils.cpp
+++ b/python/src/utils.cpp
@@ -39,27 +39,34 @@ size_t size(nvinfer1::DataType type)
     case nvinfer1::DataType::kHALF: return 2;
     case nvinfer1::DataType::kINT8: return 1;
     case nvinfer1::DataType::kINT32: return 4;
+    case nvinfer1::DataType::kINT64: return 8;
     case nvinfer1::DataType::kBOOL: return 1;
     case nvinfer1::DataType::kUINT8: return 1;
     case nvinfer1::DataType::kFP8: return 1;
+    case nvinfer1::DataType::kBF16: return 2;
+    case nvinfer1::DataType::kINT4: break;  //TRT-22011 - need to address sub-byte element size
     }
     return -1;
 }
 
-// Converts a TRT datatype to its corresponding numpy dtype.
-py::dtype nptype(nvinfer1::DataType type)
+std::unique_ptr<py::dtype> nptype(nvinfer1::DataType type)
 {
+    auto const makeDtype = [](char const* typeStr) { return std::make_unique<py::dtype>(typeStr); };
+
     switch (type)
     {
-    case nvinfer1::DataType::kFLOAT: return py::dtype("f4");
-    case nvinfer1::DataType::kHALF: return py::dtype("f2");
-    case nvinfer1::DataType::kINT8: return py::dtype("i1");
-    case nvinfer1::DataType::kINT32: return py::dtype("i4");
-    case nvinfer1::DataType::kBOOL: return py::dtype("b1");
-    case nvinfer1::DataType::kUINT8: return py::dtype("u1");
-    case nvinfer1::DataType::kFP8: return py::dtype("f1");
+    case nvinfer1::DataType::kFLOAT: return makeDtype("f4");
+    case nvinfer1::DataType::kHALF: return makeDtype("f2");
+    case nvinfer1::DataType::kINT8: return makeDtype("i1");
+    case nvinfer1::DataType::kINT32: return makeDtype("i4");
+    case nvinfer1::DataType::kINT64: return makeDtype("i8");
+    case nvinfer1::DataType::kBOOL: return makeDtype("b1");
+    case nvinfer1::DataType::kUINT8: return makeDtype("u1");
+    case nvinfer1::DataType::kFP8: return nullptr;
+    case nvinfer1::DataType::kBF16: return nullptr;
+    case nvinfer1::DataType::kINT4: return nullptr;
     }
-    return py::dtype("unknown");
+    return nullptr;
 }
 
 nvinfer1::DataType type(py::dtype const& type)
@@ -72,6 +79,10 @@ nvinfer1::DataType type(py::dtype const& type)
     {
         return nvinfer1::DataType::kHALF;
     }
+    else if (type.is(py::dtype("i8")))
+    {
+        return nvinfer1::DataType::kINT64;
+    }
     else if (type.is(py::dtype("i4")))
     {
         return nvinfer1::DataType::kINT32;
diff --git a/quickstart/deploy_to_triton/README.md b/quickstart/deploy_to_triton/README.md
index 5bc01845..35a1c268 100644
--- a/quickstart/deploy_to_triton/README.md
+++ b/quickstart/deploy_to_triton/README.md
@@ -43,7 +43,7 @@ trtexec --onnx=resnet50.onnx \
 Before we proceed to the next step, it is important that we know the names of the "input" and "output" layers of your network, as these would be required by Triton. One easy way is to use `polygraphy` which comes packaged with the TensorRT container. If you want to learn more about Polygraphy and its usage, visit [this](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy) repository. You can checkout a plethora of [examples](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/inspect) demonstrating the utility of Polygraphy to inspect models.
 
 ```
-polygraphy inspect model model.plan --mode=basic
+polygraphy inspect model model.plan --model-type engine
 ```
 A section of the output mode looks like this:
 ```
diff --git a/samples/CMakeLists.txt b/samples/CMakeLists.txt
index ff543aaf..1c26cc38 100644
--- a/samples/CMakeLists.txt
+++ b/samples/CMakeLists.txt
@@ -22,10 +22,12 @@ set(OPENSOURCE_SAMPLES_LIST
     sampleCharRNN
     sampleDynamicReshape
     sampleINT8API
+    sampleNonZeroPlugin
     sampleOnnxMNIST
     sampleIOFormats
     sampleOnnxMnistCoordConvAC
     sampleNamedDimensions
+    sampleProgressMonitor
     trtexec)
 
 foreach(SAMPLE_ITER ${OPENSOURCE_SAMPLES_LIST})
diff --git a/samples/CMakeSamplesTemplate.txt b/samples/CMakeSamplesTemplate.txt
index b355fa1d..d4f78ae5 100644
--- a/samples/CMakeSamplesTemplate.txt
+++ b/samples/CMakeSamplesTemplate.txt
@@ -33,8 +33,14 @@ set_ifndef(CUDA_INSTALL_DIR /usr/local/cuda)
 # SAMPLES_COMMON_SOURCES
 set(SAMPLES_COMMON_SOURCES
     ${SAMPLES_DIR}/common/logger.cpp
+    ${SAMPLES_DIR}/utils/timingCache.cpp
+    ${SAMPLES_DIR}/utils/fileLock.cpp
 )
 
+if (MSVC)
+    list(APPEND SAMPLES_COMMON_SOURCES ${SAMPLES_DIR}/common/getopt.c)
+endif()
+
 # add underscores (snake) to camelCase cases
 string(REGEX REPLACE "(.)([A-Z][a-z]+)" "\\1_\\2" SAMPLE_NAME_SNAKE_MIXED ${SAMPLE_DIR_NAME})
 string(REGEX REPLACE "([a-z0-9])([A-Z])" "\\1_\\2" SAMPLE_NAME_SNAKE_MIXED ${SAMPLE_NAME_SNAKE_MIXED})
@@ -60,7 +66,7 @@ if(BUILD_PLUGINS)
 endif()
 
 if(BUILD_PARSERS)
-    list(APPEND DEPS_LIST nvuffparser nvcaffeparser nvonnxparser)
+    list(APPEND DEPS_LIST nvonnxparser)
 endif()
 
 if(BUILD_PLUGINS OR BUILD_PARSERS)
@@ -75,17 +81,20 @@ target_include_directories(${TARGET_NAME}
     PUBLIC ${PROJECT_SOURCE_DIR}/include
     PUBLIC ${ONNX_INCLUDE_DIR}
     PUBLIC ${CUDA_INSTALL_DIR}/include
+    PRIVATE ${SAMPLES_DIR}
     PRIVATE ${SAMPLES_DIR}/common
+    PRIVATE ${SAMPLES_DIR}/common/windows
     PRIVATE ${TARGET_DIR}
 )
 
-target_compile_options(${TARGET_NAME} PUBLIC "-fno-rtti")
+target_compile_options(${TARGET_NAME} PUBLIC
+             "$<$<COMPILE_LANGUAGE:CUDA>:SHELL:-Xcompiler -fno-rtti>"
+             "$<$<COMPILE_LANGUAGE:CXX>:-fno-rtti>")
 
 set(SAMPLE_DEP_LIBS
     ${CUDART_LIB}
-    ${CUBLAS_LIB}
-    ${CUDNN_LIB}
-    nvinfer
+    ${nvinfer_LIB_PATH}
+    ${RT_LIB}
     ${CMAKE_DL_LIBS}
     ${CMAKE_THREAD_LIBS_INIT}
 )
@@ -98,18 +107,10 @@ if(${PLUGINS_NEEDED})
     list(APPEND SAMPLE_DEP_LIBS nvinfer_plugin)
 endif()
 
-if("caffe" IN_LIST SAMPLE_PARSERS)
-    list(APPEND SAMPLE_DEP_LIBS nvcaffeparser)
-endif()
-
 if("onnx" IN_LIST SAMPLE_PARSERS)
     list(APPEND SAMPLE_DEP_LIBS nvonnxparser)
 endif()
 
-if("uff" IN_LIST SAMPLE_PARSERS)
-    list(APPEND SAMPLE_DEP_LIBS nvuffparser)
-endif()
-
 # Necessary to link nvinfer_plugin library.
 target_link_libraries(${TARGET_NAME}
     ${SAMPLE_DEP_LIBS}
diff --git a/samples/README.md b/samples/README.md
index 25a774b3..81b8c4b8 100644
--- a/samples/README.md
+++ b/samples/README.md
@@ -17,12 +17,16 @@
 | [sampleDynamicReshape](sampleDynamicReshape) | C++ | ONNX | Digit Recognition With Dynamic Shapes In TensorRT |
 | [sampleINT8API](sampleINT8API) | C++ | ONNX | Performing Inference In INT8 Precision |
 | [sampleNamedDimensions](sampleNamedDimensions) | C++ | ONNX | Working with named input dimensions |
+| [sampleNonZeroPlugin](sampleNonZeroPlugin) | C++ | INetwork | Adding plugin with data-dependent output shapes |
 | [sampleOnnxMnistCoordConvAC](sampleOnnxMnistCoordConvAC) | C++ | ONNX | Implementing CoordConv with a custom plugin |
 | [sampleIOFormats](sampleIOFormats) | C++ | ONNX | Specifying TensorRT I/O Formats |
+| [sampleProgressMonitor](sampleProgressMonitor) | C++ | ONNX | Progress Monitor API usage |
 | [trtexec](trtexec) | C++ | All | TensorRT Command-Line Wrapper: trtexec |
 | [engine_refit_onnx_bidaf](python/engine_refit_onnx_bidaf) | Python | ONNX | refitting an engine built from an ONNX model via parsers. |
 | [introductory_parser_samples](python/introductory_parser_samples) | Python | ONNX | Introduction To Importing Models Using TensorRT Parsers |
 | [onnx_packnet](python/onnx_packnet) | Python | ONNX | TensorRT Inference Of ONNX Models With Custom Layers |
+| [simpleProgressMonitor](python/simple_progress_monitor) | Python | ONNX | Progress Monitor API usage |
+| [python_plugin](python/python_plugin) | Python | INetwork/ONNX | Python-based TRT plugins |
 
 ### 3. Application Samples
 | Sample | Language | Format | Description |
@@ -32,8 +36,3 @@
 | [efficientnet](python/efficientnet) | Python | ONNX | EfficientNet V1 and V2 Classification with TensorRT |
 | [tensorflow_object_detection_api](python/tensorflow_object_detection_api) | Python | ONNX | TensorFlow Object Detection API Models in TensorRT |
 | [yolov3_onnx](python/yolov3_onnx) | Python | ONNX | Object Detection Using YOLOv3 With TensorRT ONNX Backend |
-
-## Known Limitations
-
-  - UFF converter and GraphSurgeon tools are only supported with Tensorflow 1.x
-  - For the UFF samples, please use the [NVIDIA tf1 (Tensorflow 1.x)](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/running.html#running) for running these tests or install Tensorflow 1.x manually.
diff --git a/samples/common/BatchStream.h b/samples/common/BatchStream.h
index c08c9e14..f6da8d70 100644
--- a/samples/common/BatchStream.h
+++ b/samples/common/BatchStream.h
@@ -216,7 +216,7 @@ class BatchStream : public IBatchStream
             return false;
         }
 
-        for (int csize = 1, batchPos = 0; batchPos < mBatchSize; batchPos += csize, mFileBatchPos += csize)
+        for (int64_t csize = 1, batchPos = 0; batchPos < mBatchSize; batchPos += csize, mFileBatchPos += csize)
         {
             ASSERT(mFileBatchPos > 0 && mFileBatchPos <= mDims.d[0]);
             if (mFileBatchPos == mDims.d[0] && !update())
@@ -225,7 +225,7 @@ class BatchStream : public IBatchStream
             }
 
             // copy the smaller of: elements left to fulfill the request, or elements left in the file buffer.
-            csize = std::min(mBatchSize - batchPos, mDims.d[0] - mFileBatchPos);
+            csize = std::min<int64_t>(mBatchSize - batchPos, mDims.d[0] - mFileBatchPos);
             std::copy_n(
                 getFileBatch() + mFileBatchPos * mImageSize, csize * mImageSize, getBatch() + batchPos * mImageSize);
             std::copy_n(getFileLabels() + mFileBatchPos, csize, getLabels() + batchPos);
@@ -359,7 +359,7 @@ class BatchStream : public IBatchStream
         return true;
     }
 
-    int mBatchSize{0};
+    int64_t mBatchSize{0};
     int mMaxBatches{0};
     int mBatchCount{0};
     int mFileCount{0};
diff --git a/samples/common/ErrorRecorder.h b/samples/common/ErrorRecorder.h
index 3cc8ef9e..cd00f745 100644
--- a/samples/common/ErrorRecorder.h
+++ b/samples/common/ErrorRecorder.h
@@ -17,7 +17,7 @@
 
 #ifndef ERROR_RECORDER_H
 #define ERROR_RECORDER_H
-#include "NvInferRuntimeCommon.h"
+#include "NvInferRuntimeBase.h"
 #include "logger.h"
 #include <atomic>
 #include <cstdint>
diff --git a/samples/common/argsParser.h b/samples/common/argsParser.h
index 3b80797c..745070d9 100644
--- a/samples/common/argsParser.h
+++ b/samples/common/argsParser.h
@@ -39,22 +39,12 @@ struct SampleParams
     int32_t dlaCore{-1};               //!< Specify the DLA core to run network on.
     bool int8{false};                  //!< Allow runnning the network in Int8 mode.
     bool fp16{false};                  //!< Allow running the network in FP16 mode.
+    bool bf16{false};                  //!< Allow running the network in BF16 mode.
     std::vector<std::string> dataDirs; //!< Directory paths where sample data files are stored
     std::vector<std::string> inputTensorNames;
     std::vector<std::string> outputTensorNames;
 };
 
-//!
-//! \brief The CaffeSampleParams structure groups the additional parameters required by
-//!         networks that use caffe
-//!
-struct CaffeSampleParams : public SampleParams
-{
-    std::string prototxtFileName; //!< Filename of prototxt design file of a network
-    std::string weightsFileName;  //!< Filename of trained weights file of a network
-    std::string meanFileName;     //!< Filename of mean file of a network
-};
-
 //!
 //! \brief The OnnxSampleParams structure groups the additional parameters required by
 //!         networks that use ONNX
@@ -64,15 +54,6 @@ struct OnnxSampleParams : public SampleParams
     std::string onnxFileName; //!< Filename of ONNX file of a network
 };
 
-//!
-//! \brief The UffSampleParams structure groups the additional parameters required by
-//!         networks that use Uff
-//!
-struct UffSampleParams : public SampleParams
-{
-    std::string uffFileName; //!< Filename of uff file of a network
-};
-
 //!
 //! /brief Struct to maintain command-line arguments.
 //!
@@ -80,13 +61,14 @@ struct Args
 {
     bool runInInt8{false};
     bool runInFp16{false};
+    bool runInBf16{false};
     bool help{false};
     int32_t useDLACore{-1};
     int32_t batch{1};
     std::vector<std::string> dataDirs;
     std::string saveEngine;
     std::string loadEngine;
-    bool useILoop{false};
+    bool rowMajor{true};
 };
 
 //!
@@ -102,9 +84,10 @@ inline bool parseArgs(Args& args, int32_t argc, char* argv[])
     {
         int32_t arg;
         static struct option long_options[] = {{"help", no_argument, 0, 'h'}, {"datadir", required_argument, 0, 'd'},
-            {"int8", no_argument, 0, 'i'}, {"fp16", no_argument, 0, 'f'}, {"useILoop", no_argument, 0, 'l'},
-            {"saveEngine", required_argument, 0, 's'}, {"loadEngine", required_argument, 0, 'o'},
-            {"useDLACore", required_argument, 0, 'u'}, {"batch", required_argument, 0, 'b'}, {nullptr, 0, nullptr, 0}};
+            {"int8", no_argument, 0, 'i'}, {"fp16", no_argument, 0, 'f'}, {"bf16", no_argument, 0, 'z'},
+            {"columnMajor", no_argument, 0, 'c'}, {"saveEngine", required_argument, 0, 's'},
+            {"loadEngine", required_argument, 0, 'o'}, {"useDLACore", required_argument, 0, 'u'},
+            {"batch", required_argument, 0, 'b'}, {nullptr, 0, nullptr, 0}};
         int32_t option_index = 0;
         arg = getopt_long(argc, argv, "hd:iu", long_options, &option_index);
         if (arg == -1)
@@ -140,7 +123,8 @@ inline bool parseArgs(Args& args, int32_t argc, char* argv[])
             break;
         case 'i': args.runInInt8 = true; break;
         case 'f': args.runInFp16 = true; break;
-        case 'l': args.useILoop = true; break;
+        case 'z': args.runInBf16 = true; break;
+        case 'c': args.rowMajor = false; break;
         case 'u':
             if (optarg)
             {
diff --git a/samples/common/bfloat16.cpp b/samples/common/bfloat16.cpp
new file mode 100644
index 00000000..a9944789
--- /dev/null
+++ b/samples/common/bfloat16.cpp
@@ -0,0 +1,60 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "bfloat16.h"
+#include <cstring>
+
+namespace sample
+{
+
+BFloat16::operator float() const
+{
+    static_assert(sizeof(uint32_t) == sizeof(float), "");
+    float val{0.F};
+    auto bits = static_cast<uint32_t>(mRep) << 16;
+    std::memcpy(&val, &bits, sizeof(uint32_t));
+    return val;
+}
+
+BFloat16::BFloat16(float x)
+{
+    static_assert(sizeof(uint32_t) == sizeof(float), "");
+    uint32_t bits{0};
+    std::memcpy(&bits, &x, sizeof(float));
+
+    // FP32 format: 1 sign bit, 8 bit exponent, 23 bit mantissa
+    // BF16 format: 1 sign bit, 8 bit exponent, 7 bit mantissa
+
+    // Mask for exponent
+    constexpr uint32_t exponent = 0xFFU << 23;
+
+    // Check if exponent is all 1s (NaN or infinite)
+    if ((bits & exponent) != exponent)
+    {
+        // x is finite - round to even
+        bits += 0x7FFFU + (bits >> 16 & 1);
+    }
+
+    mRep = static_cast<uint16_t>(bits >> 16);
+}
+
+BFloat16 operator+(BFloat16 x, BFloat16 y)
+{
+    return BFloat16(static_cast<float>(x) + static_cast<float>(y));
+}
+
+} // namespace sample
diff --git a/parsers/caffe/caffeParser/opParsers/parseSigmoid.cpp b/samples/common/bfloat16.h
similarity index 60%
rename from parsers/caffe/caffeParser/opParsers/parseSigmoid.cpp
rename to samples/common/bfloat16.h
index 46d4b9d2..90b77421 100644
--- a/parsers/caffe/caffeParser/opParsers/parseSigmoid.cpp
+++ b/samples/common/bfloat16.h
@@ -15,19 +15,32 @@
  * limitations under the License.
  */
 
-#include "opParsers.h"
+#pragma once
 
-using namespace nvinfer1;
+#include <cstdint>
 
-namespace nvcaffeparser1
+namespace sample
 {
-ILayer* parseSigmoid(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /*weightFactory*/, BlobNameToTensor& tensors)
+
+//! Implements "Brain Floating Point": like an IEEE FP32,
+//! but the significand is only 7 bits instead of 23 bits.
+class BFloat16
 {
-    if (!checkBlobs(msg, 1, 1))
+public:
+    BFloat16()
+        : mRep(0)
     {
-        return nullptr;
     }
 
-    return network.addActivation(*tensors[msg.bottom(0)], ActivationType::kSIGMOID);
-}
-} //namespace nvcaffeparser1
+    // Rounds to even if there is a tie.
+    BFloat16(float x);
+
+    operator float() const;
+
+private:
+    //! Value stored in BFloat16 representation.
+    uint16_t mRep;
+};
+BFloat16 operator+(BFloat16 x, BFloat16 y);
+
+} // namespace sample
diff --git a/samples/common/buffers.h b/samples/common/buffers.h
index 6d87a11a..bf40dc9c 100644
--- a/samples/common/buffers.h
+++ b/samples/common/buffers.h
@@ -239,26 +239,53 @@ class BufferManager
 public:
     static const size_t kINVALID_SIZE_VALUE = ~size_t(0);
 
+    //!
+    //! \brief Create a BufferManager for handling buffer interactions with engine, when the I/O tensor volumes
+    //! are provided
+    //!
+    BufferManager(
+        std::shared_ptr<nvinfer1::ICudaEngine> engine, std::vector<int64_t> const& volumes, int32_t batchSize = 0)
+        : mEngine(engine)
+        , mBatchSize(batchSize)
+    {
+        // Create host and device buffers
+        for (int32_t i = 0; i < mEngine->getNbIOTensors(); i++)
+        {
+            auto const name = engine->getIOTensorName(i);
+            mNames[name] = i;
+
+            nvinfer1::DataType type = mEngine->getTensorDataType(name);
+
+            std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};
+            manBuf->deviceBuffer = DeviceBuffer(volumes[i], type);
+            manBuf->hostBuffer = HostBuffer(volumes[i], type);
+            void* deviceBuffer = manBuf->deviceBuffer.data();
+            mDeviceBindings.emplace_back(deviceBuffer);
+            mManagedBuffers.emplace_back(std::move(manBuf));
+        }
+    }
+
     //!
     //! \brief Create a BufferManager for handling buffer interactions with engine.
     //!
-    BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, const int batchSize = 0,
-        const nvinfer1::IExecutionContext* context = nullptr)
+    BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, int32_t const batchSize = 0,
+        nvinfer1::IExecutionContext const* context = nullptr)
         : mEngine(engine)
         , mBatchSize(batchSize)
     {
-        // Full Dims implies no batch size.
-        assert(engine->hasImplicitBatchDimension() || mBatchSize == 0);
         // Create host and device buffers
-        for (int i = 0; i < mEngine->getNbBindings(); i++)
+        for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
         {
-            auto dims = context ? context->getBindingDimensions(i) : mEngine->getBindingDimensions(i);
+            auto const name = engine->getIOTensorName(i);
+            mNames[name] = i;
+
+            auto dims = context ? context->getTensorShape(name) : mEngine->getTensorShape(name);
             size_t vol = context || !mBatchSize ? 1 : static_cast<size_t>(mBatchSize);
-            nvinfer1::DataType type = mEngine->getBindingDataType(i);
-            int vecDim = mEngine->getBindingVectorizedDim(i);
+            nvinfer1::DataType type = mEngine->getTensorDataType(name);
+            int32_t vecDim = mEngine->getTensorVectorizedDim(name);
             if (-1 != vecDim) // i.e., 0 != lgScalarsPerVector
             {
-                int scalarsPerVec = mEngine->getBindingComponentsPerElement(i);
+                int32_t scalarsPerVec = mEngine->getTensorComponentsPerElement(name);
                 dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);
                 vol *= scalarsPerVec;
             }
@@ -266,7 +293,8 @@ class BufferManager
             std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};
             manBuf->deviceBuffer = DeviceBuffer(vol, type);
             manBuf->hostBuffer = HostBuffer(vol, type);
-            mDeviceBindings.emplace_back(manBuf->deviceBuffer.data());
+            void* deviceBuffer = manBuf->deviceBuffer.data();
+            mDeviceBindings.emplace_back(deviceBuffer);
             mManagedBuffers.emplace_back(std::move(manBuf));
         }
     }
@@ -283,7 +311,7 @@ class BufferManager
     //!
     //! \brief Returns a vector of device buffers.
     //!
-    const std::vector<void*>& getDeviceBindings() const
+    std::vector<void*> const& getDeviceBindings() const
     {
         return mDeviceBindings;
     }
@@ -292,7 +320,7 @@ class BufferManager
     //! \brief Returns the device buffer corresponding to tensorName.
     //!        Returns nullptr if no such tensor can be found.
     //!
-    void* getDeviceBuffer(const std::string& tensorName) const
+    void* getDeviceBuffer(std::string const& tensorName) const
     {
         return getBuffer(false, tensorName);
     }
@@ -301,7 +329,7 @@ class BufferManager
     //! \brief Returns the host buffer corresponding to tensorName.
     //!        Returns nullptr if no such tensor can be found.
     //!
-    void* getHostBuffer(const std::string& tensorName) const
+    void* getHostBuffer(std::string const& tensorName) const
     {
         return getBuffer(true, tensorName);
     }
@@ -310,12 +338,12 @@ class BufferManager
     //! \brief Returns the size of the host and device buffers that correspond to tensorName.
     //!        Returns kINVALID_SIZE_VALUE if no such tensor can be found.
     //!
-    size_t size(const std::string& tensorName) const
+    size_t size(std::string const& tensorName) const
     {
-        int index = mEngine->getBindingIndex(tensorName.c_str());
-        if (index == -1)
+        auto record = mNames.find(tensorName);
+        if (record == mNames.end())
             return kINVALID_SIZE_VALUE;
-        return mManagedBuffers[index]->hostBuffer.nbBytes();
+        return mManagedBuffers[record->second]->hostBuffer.nbBytes();
     }
 
     //!
@@ -330,7 +358,7 @@ class BufferManager
         assert(bufSize % sizeof(T) == 0);
         T* typedBuf = static_cast<T*>(buf);
         size_t numItems = bufSize / sizeof(T);
-        for (int i = 0; i < static_cast<int>(numItems); i++)
+        for (int32_t i = 0; i < static_cast<int>(numItems); i++)
         {
             // Handle rowCount == 1 case
             if (rowCount == 1 && i != static_cast<int>(numItems) - 1)
@@ -366,7 +394,7 @@ class BufferManager
     //!
     //! \brief Copy the contents of input host buffers to input device buffers asynchronously.
     //!
-    void copyInputToDeviceAsync(const cudaStream_t& stream = 0)
+    void copyInputToDeviceAsync(cudaStream_t const& stream = 0)
     {
         memcpyBuffers(true, false, true, stream);
     }
@@ -374,7 +402,7 @@ class BufferManager
     //!
     //! \brief Copy the contents of output device buffers to output host buffers asynchronously.
     //!
-    void copyOutputToHostAsync(const cudaStream_t& stream = 0)
+    void copyOutputToHostAsync(cudaStream_t const& stream = 0)
     {
         memcpyBuffers(false, true, true, stream);
     }
@@ -382,25 +410,31 @@ class BufferManager
     ~BufferManager() = default;
 
 private:
-    void* getBuffer(const bool isHost, const std::string& tensorName) const
+    void* getBuffer(bool const isHost, std::string const& tensorName) const
     {
-        int index = mEngine->getBindingIndex(tensorName.c_str());
-        if (index == -1)
+        auto record = mNames.find(tensorName);
+        if (record == mNames.end())
             return nullptr;
-        return (isHost ? mManagedBuffers[index]->hostBuffer.data() : mManagedBuffers[index]->deviceBuffer.data());
+        return (isHost ? mManagedBuffers[record->second]->hostBuffer.data()
+                       : mManagedBuffers[record->second]->deviceBuffer.data());
+    }
+
+    bool tenosrIsInput(const std::string& tensorName) const
+    {
+        return mEngine->getTensorIOMode(tensorName.c_str()) == nvinfer1::TensorIOMode::kINPUT;
     }
 
-    void memcpyBuffers(const bool copyInput, const bool deviceToHost, const bool async, const cudaStream_t& stream = 0)
+    void memcpyBuffers(bool const copyInput, bool const deviceToHost, bool const async, cudaStream_t const& stream = 0)
     {
-        for (int i = 0; i < mEngine->getNbBindings(); i++)
+        for (auto const& n : mNames)
         {
-            void* dstPtr
-                = deviceToHost ? mManagedBuffers[i]->hostBuffer.data() : mManagedBuffers[i]->deviceBuffer.data();
-            const void* srcPtr
-                = deviceToHost ? mManagedBuffers[i]->deviceBuffer.data() : mManagedBuffers[i]->hostBuffer.data();
-            const size_t byteSize = mManagedBuffers[i]->hostBuffer.nbBytes();
+            void* dstPtr = deviceToHost ? mManagedBuffers[n.second]->hostBuffer.data()
+                                        : mManagedBuffers[n.second]->deviceBuffer.data();
+            void const* srcPtr = deviceToHost ? mManagedBuffers[n.second]->deviceBuffer.data()
+                                              : mManagedBuffers[n.second]->hostBuffer.data();
+            size_t const byteSize = mManagedBuffers[n.second]->hostBuffer.nbBytes();
             const cudaMemcpyKind memcpyType = deviceToHost ? cudaMemcpyDeviceToHost : cudaMemcpyHostToDevice;
-            if ((copyInput && mEngine->bindingIsInput(i)) || (!copyInput && !mEngine->bindingIsInput(i)))
+            if ((copyInput && tenosrIsInput(n.first)) || (!copyInput && !tenosrIsInput(n.first)))
             {
                 if (async)
                     CHECK(cudaMemcpyAsync(dstPtr, srcPtr, byteSize, memcpyType, stream));
@@ -413,7 +447,8 @@ class BufferManager
     std::shared_ptr<nvinfer1::ICudaEngine> mEngine;              //!< The pointer to the engine
     int mBatchSize;                                              //!< The batch size for legacy networks, 0 otherwise.
     std::vector<std::unique_ptr<ManagedBuffer>> mManagedBuffers; //!< The vector of pointers to managed buffers
-    std::vector<void*> mDeviceBindings;                          //!< The vector of device buffers needed for engine execution
+    std::vector<void*> mDeviceBindings;              //!< The vector of device buffers needed for engine execution
+    std::unordered_map<std::string, int32_t> mNames; //!< The map of tensor name and index pairs
 };
 
 } // namespace samplesCommon
diff --git a/samples/common/common.h b/samples/common/common.h
index b2a01c5e..557bd169 100644
--- a/samples/common/common.h
+++ b/samples/common/common.h
@@ -17,21 +17,11 @@
 
 #ifndef TENSORRT_COMMON_H
 #define TENSORRT_COMMON_H
-
-// For loadLibrary
-#ifdef _MSC_VER
-// Needed so that the max/min definitions in windows.h do not conflict with std::max/min.
-#define NOMINMAX
-#include <windows.h>
-#undef NOMINMAX
-#else
-#include <dlfcn.h>
-#endif
-
 #include "NvInfer.h"
 #include "NvInferPlugin.h"
 #include "logger.h"
-#include "sampleEntrypoints.h"
+#include "safeCommon.h"
+#include "utils/timingCache.h"
 #include <algorithm>
 #include <cassert>
 #include <chrono>
@@ -54,13 +44,15 @@
 #include <vector>
 
 #ifdef _MSC_VER
+// For loadLibrary
+// Needed so that the max/min definitions in windows.h do not conflict with std::max/min.
+#define NOMINMAX
+#include <windows.h>
+#undef NOMINMAX
 #else
-#include <stdio.h>  // fileno
-#include <unistd.h> // lockf
+#include <dlfcn.h>
 #endif
 
-#include "safeCommon.h"
-
 #ifdef _MSC_VER
 #define FN_NAME __FUNCTION__
 #else
@@ -89,7 +81,7 @@
         if (!(condition))                                                   \
         {                                                                   \
             sample::gLogError << "Assertion failure: " << #condition << std::endl;  \
-            abort();                                                        \
+            exit(EXIT_FAILURE);                                                       \
         }                                                                   \
     } while (0)
 
@@ -103,7 +95,7 @@ OBJ_GUARD(T)
 makeObjGuard(T_* t)
 {
     CHECK(!(std::is_base_of<T, T_>::value || std::is_same<T, T_>::value));
-    auto deleter = [](T* t) { t->destroy(); };
+    auto deleter = [](T* t) { delete t; };
     return std::unique_ptr<T, decltype(deleter)>{static_cast<T*>(t), deleter};
 }
 
@@ -206,80 +198,11 @@ struct SimpleProfiler : public nvinfer1::IProfiler
     std::map<std::string, Record> mProfile;
 };
 
-//! Locate path to file, given its filename or filepath suffix and possible dirs it might lie in.
-//! Function will also walk back MAX_DEPTH dirs from CWD to check for such a file path.
-inline std::string locateFile(
-    const std::string& filepathSuffix, const std::vector<std::string>& directories, bool reportError = true)
-{
-    const int MAX_DEPTH{10};
-    bool found{false};
-    std::string filepath;
-
-    for (auto& dir : directories)
-    {
-        if (!dir.empty() && dir.back() != '/')
-        {
-#ifdef _MSC_VER
-            filepath = dir + "\\" + filepathSuffix;
-#else
-            filepath = dir + "/" + filepathSuffix;
-#endif
-        }
-        else
-        {
-            filepath = dir + filepathSuffix;
-        }
-
-        for (int i = 0; i < MAX_DEPTH && !found; i++)
-        {
-            const std::ifstream checkFile(filepath);
-            found = checkFile.is_open();
-            if (found)
-            {
-                break;
-            }
-
-            filepath = "../" + filepath; // Try again in parent dir
-        }
-
-        if (found)
-        {
-            break;
-        }
-
-        filepath.clear();
-    }
-
-    // Could not find the file
-    if (filepath.empty())
-    {
-        const std::string dirList = std::accumulate(directories.begin() + 1, directories.end(), directories.front(),
-            [](const std::string& a, const std::string& b) { return a + "\n\t" + b; });
-        std::cout << "Could not find " << filepathSuffix << " in data directories:\n\t" << dirList << std::endl;
-
-        if (reportError)
-        {
-            std::cout << "&&&& FAILED" << std::endl;
-            exit(EXIT_FAILURE);
-        }
-    }
-
-    return filepath;
-}
-
-inline void readPGMFile(const std::string& fileName, uint8_t* buffer, int inH, int inW)
-{
-    std::ifstream infile(fileName, std::ifstream::binary);
-    assert(infile.is_open() && "Attempting to read from a file that is not open.");
-    std::string magic, w, h, max;
-    infile >> magic >> w >> h >> max;
-    infile.seekg(1, infile.cur);
-    infile.read(reinterpret_cast<char*>(buffer), inH * inW);
-}
-
 namespace samplesCommon
 {
-
+using nvinfer1::utils::loadTimingCacheFile;
+using nvinfer1::utils::saveTimingCacheFile;
+using nvinfer1::utils::updateTimingCacheFile;
 // Swaps endianness of an integral type.
 template <typename T, typename std::enable_if<std::is_integral<T>::value, int>::type = 0>
 inline T swapEndianness(const T& value)
@@ -374,14 +297,13 @@ struct InferDeleter
 template <typename T>
 using SampleUniquePtr = std::unique_ptr<T, InferDeleter>;
 
-static auto StreamDeleter = [](cudaStream_t* pStream)
+static auto StreamDeleter = [](cudaStream_t* pStream) {
+    if (pStream)
     {
-        if (pStream)
-        {
-            cudaStreamDestroy(*pStream);
-            delete pStream;
-        }
-    };
+        static_cast<void>(cudaStreamDestroy(*pStream));
+        delete pStream;
+    }
+};
 
 inline std::unique_ptr<cudaStream_t, decltype(StreamDeleter)> makeCudaStream()
 {
@@ -504,22 +426,6 @@ inline float getMaxValue(const float* buffer, int64_t size)
     return *std::max_element(buffer, buffer + size);
 }
 
-inline int32_t calculateSoftmax(float* const prob, int32_t const numDigits)
-{
-    ASSERT(prob != nullptr);
-    ASSERT(numDigits == 10);
-    float sum{0.0F};
-    std::transform(prob, prob + numDigits, prob, [&sum](float v) -> float {
-        sum += exp(v);
-        return exp(v);
-    });
-
-    ASSERT(sum != 0.0F);
-    std::transform(prob, prob + numDigits, prob, [sum](float v) -> float { return v / sum; });
-    int32_t idx = std::max_element(prob, prob + numDigits) - prob;
-    return idx;
-}
-
 // Ensures that every tensor used by a network has a dynamic range set.
 //
 // All tensors in a network must have a dynamic range specified if a calibrator is not used.
@@ -632,13 +538,16 @@ inline uint32_t getElementSize(nvinfer1::DataType t) noexcept
 {
     switch (t)
     {
-    case nvinfer1::DataType::kINT32: return 4;
+    case nvinfer1::DataType::kINT64: return 8;
+    case nvinfer1::DataType::kINT32:
     case nvinfer1::DataType::kFLOAT: return 4;
+    case nvinfer1::DataType::kBF16:
     case nvinfer1::DataType::kHALF: return 2;
     case nvinfer1::DataType::kBOOL:
     case nvinfer1::DataType::kUINT8:
     case nvinfer1::DataType::kINT8:
     case nvinfer1::DataType::kFP8: return 1;
+    case nvinfer1::DataType::kINT4: ASSERT(false && "Element size is not implemented for sub-byte data-types (INT4)");
     }
     return 0;
 }
@@ -899,7 +808,7 @@ class DynamicLibrary
         : mLibName{name}
     {
 #if defined(_WIN32)
-        mHandle = LoadLibrary(name.c_str());
+        mHandle = LoadLibraryA(name.c_str());
 #else // defined(_WIN32)
         int32_t flags{RTLD_LAZY};
 #if ENABLE_ASAN
@@ -980,25 +889,6 @@ inline std::unique_ptr<DynamicLibrary> loadLibrary(std::string const& path)
     return std::unique_ptr<DynamicLibrary>(new DynamicLibrary{path});
 }
 
-inline int32_t getSMVersion()
-{
-    int32_t deviceIndex = 0;
-    CHECK(cudaGetDevice(&deviceIndex));
-
-    int32_t major, minor;
-    CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, deviceIndex));
-    CHECK(cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, deviceIndex));
-
-    return ((major << 8) | minor);
-}
-
-inline bool isSMSafe()
-{
-    const int32_t smVersion = getSMVersion();
-    return smVersion == 0x0700 || smVersion == 0x0702 || smVersion == 0x0705 || smVersion == 0x0800
-        || smVersion == 0x0806 || smVersion == 0x0807;
-}
-
 inline int32_t getMaxPersistentCacheSize()
 {
     int32_t deviceIndex{};
@@ -1030,183 +920,6 @@ inline bool isDataTypeSupported(nvinfer1::DataType dataType)
 
     return true;
 }
-
-class FileLock
-{
-public:
-    FileLock(std::string const& fileName)
-        : fileName(fileName)
-    {
-        std::string lockFileName = fileName + ".lock";
-#ifdef _MSC_VER
-        sample::gLogVerbose << "Trying to set exclusive file lock " << lockFileName << std::endl;
-        auto startTime = std::chrono::high_resolution_clock::now();
-        // MS docs said this is a blocking IO if "FILE_FLAG_OVERLAPPED" is not provided
-        lock = CreateFileA(lockFileName.c_str(), GENERIC_WRITE, 0, NULL, OPEN_ALWAYS, 0, NULL);
-        if (lock != INVALID_HANDLE_VALUE)
-        {
-            float const time
-                = std::chrono::duration<float>(std::chrono::high_resolution_clock::now() - startTime).count();
-            sample::gLogVerbose << "File locked in " << time << " seconds." << std::endl;
-        }
-        else
-        {
-            throw std::runtime_error("Failed to lock " + lockFileName + "!");
-        }
-#elif defined(__QNX__)
-        // We once enabled the file lock on QNX, lockf(F_TLOCK) return -1 and the reported error is
-        // The error generated was 89
-        // That means : Function not implemented
-#else
-        fp = fopen(lockFileName.c_str(), "wb+");
-        if (!fp)
-        {
-            throw std::runtime_error("Cannot open " + lockFileName + "!");
-        }
-        fd = fileno(fp);
-        sample::gLogVerbose << "Trying to set exclusive file lock " << lockFileName << std::endl;
-        auto startTime = std::chrono::high_resolution_clock::now();
-        auto ret = lockf(fd, F_LOCK, 0);
-        if (ret != 0)
-        {
-            fd = -1;
-            fclose(fp);
-            throw std::runtime_error("Failed to lock " + lockFileName + "!");
-        }
-        float const time = std::chrono::duration<float>(std::chrono::high_resolution_clock::now() - startTime).count();
-        sample::gLogVerbose << "File locked in " << time << " seconds." << std::endl;
-#endif
-    }
-
-    ~FileLock()
-    {
-        std::string lockFileName = fileName + ".lock";
-#ifdef _MSC_VER
-        if (lock != INVALID_HANDLE_VALUE)
-        {
-            sample::gLogVerbose << "Trying to remove exclusive file lock " << lockFileName << std::endl;
-            auto startTime = std::chrono::high_resolution_clock::now();
-            CloseHandle(lock);
-            float const time
-                = std::chrono::duration<float>(std::chrono::high_resolution_clock::now() - startTime).count();
-            sample::gLogVerbose << "File unlocked in " << time << " seconds." << std::endl;
-        }
-#elif defined(__QNX__)
-        // We once enabled the file lock on QNX, lockf(F_TLOCK) return -1 and the reported error is
-        // The error generated was 89
-        // That means : Function not implemented
-#else
-        if (fd != -1)
-        {
-            sample::gLogVerbose << "Trying to remove exclusive file lock " << lockFileName << std::endl;
-            auto startTime = std::chrono::high_resolution_clock::now();
-            auto ret = lockf(fd, F_ULOCK, 0);
-            if (ret != 0)
-            {
-                sample::gLogVerbose << "Failed to unlock " << lockFileName << "!" << std::endl;
-            }
-            else
-            {
-                fd = -1;
-                fclose(fp);
-                float const time
-                    = std::chrono::duration<float>(std::chrono::high_resolution_clock::now() - startTime).count();
-                sample::gLogVerbose << "File unlocked in " << time << " seconds." << std::endl;
-            }
-        }
-#endif
-    }
-
-private:
-    FileLock() = delete;                           // no default ctor
-    FileLock(FileLock const&) = delete;            // no copy ctor
-    FileLock& operator=(FileLock const&) = delete; // no copy assignment
-
-    const std::string fileName; // the file being protected
-#ifdef _MSC_VER
-    HANDLE lock;
-#else
-    FILE* fp;
-    int32_t fd;
-#endif
-};
-
-inline std::vector<char> loadTimingCacheFile(std::string const& inFileName)
-{
-    std::unique_ptr<samplesCommon::FileLock> fileLock{new samplesCommon::FileLock(inFileName)};
-    std::ifstream iFile(inFileName, std::ios::in | std::ios::binary);
-    if (!iFile)
-    {
-        sample::gLogWarning << "Could not read timing cache from: " << inFileName
-                            << ". A new timing cache will be generated and written." << std::endl;
-        return std::vector<char>();
-    }
-    iFile.seekg(0, std::ifstream::end);
-    size_t fsize = iFile.tellg();
-    iFile.seekg(0, std::ifstream::beg);
-    std::vector<char> content(fsize);
-    iFile.read(content.data(), fsize);
-    iFile.close();
-    sample::gLogInfo << "Loaded " << fsize << " bytes of timing cache from " << inFileName << std::endl;
-    return content;
-}
-
-inline void saveTimingCacheFile(std::string const& outFileName, nvinfer1::IHostMemory const* blob)
-{
-    std::unique_ptr<samplesCommon::FileLock> fileLock{new samplesCommon::FileLock(outFileName)};
-    std::ofstream oFile(outFileName, std::ios::out | std::ios::binary);
-    if (!oFile)
-    {
-        sample::gLogWarning << "Could not write timing cache to: " << outFileName << std::endl;
-        return;
-    }
-    oFile.write((char*) blob->data(), blob->size());
-    oFile.close();
-    sample::gLogInfo << "Saved " << blob->size() << " bytes of timing cache to " << outFileName << std::endl;
-}
-
-inline void updateTimingCacheFile(std::string const& fileName, nvinfer1::ITimingCache const* timingCache)
-{
-    // Prepare empty timingCache in case that there is no existing file to read
-    std::unique_ptr<nvinfer1::IBuilder> builder{createBuilder()};
-    std::unique_ptr<nvinfer1::IBuilderConfig> config{builder->createBuilderConfig()};
-    std::unique_ptr<nvinfer1::ITimingCache> fileTimingCache{
-        config->createTimingCache(static_cast<const void*>(nullptr), 0)};
-
-    std::unique_ptr<samplesCommon::FileLock> fileLock{new samplesCommon::FileLock(fileName)};
-    std::ifstream iFile(fileName, std::ios::in | std::ios::binary);
-    if (iFile)
-    {
-        iFile.seekg(0, std::ifstream::end);
-        size_t fsize = iFile.tellg();
-        iFile.seekg(0, std::ifstream::beg);
-        std::vector<char> content(fsize);
-        iFile.read(content.data(), fsize);
-        iFile.close();
-        sample::gLogInfo << "Loaded " << fsize << " bytes of timing cache from " << fileName << std::endl;
-        fileTimingCache.reset(config->createTimingCache(static_cast<const void*>(content.data()), content.size()));
-        if (!fileTimingCache)
-        {
-            throw std::runtime_error("Failed to create timingCache from " + fileName + "!");
-        }
-    }
-    fileTimingCache->combine(*timingCache, false);
-    std::unique_ptr<nvinfer1::IHostMemory> blob{fileTimingCache->serialize()};
-    if (!blob)
-    {
-        throw std::runtime_error("Failed to serialize ITimingCache!");
-    }
-    std::ofstream oFile(fileName, std::ios::out | std::ios::binary);
-    if (!oFile)
-    {
-        sample::gLogWarning << "Could not write timing cache to: " << fileName << std::endl;
-        return;
-    }
-    oFile.write((char*) blob->data(), blob->size());
-    oFile.close();
-    sample::gLogInfo << "Saved " << blob->size() << " bytes of timing cache to " << fileName << std::endl;
-}
-
 } // namespace samplesCommon
 
 inline std::ostream& operator<<(std::ostream& os, const nvinfer1::Dims& dims)
diff --git a/samples/common/logging.h b/samples/common/logging.h
index 38cbbd01..e61b3687 100644
--- a/samples/common/logging.h
+++ b/samples/common/logging.h
@@ -18,7 +18,7 @@
 #ifndef TENSORRT_LOGGING_H
 #define TENSORRT_LOGGING_H
 
-#include "NvInferRuntimeCommon.h"
+#include "NvInferRuntimeBase.h"
 #include "sampleOptions.h"
 #include <cassert>
 #include <ctime>
@@ -292,7 +292,7 @@ class Logger : public nvinfer1::ILogger
     };
 
     //!
-    //! \brief Forward-compatible method for retrieving the nvinfer::ILogger associated with this Logger
+    //! \brief Forward-compatible method for retrieving the nvinfer1::ILogger associated with this Logger
     //! \return The nvinfer1::ILogger associated with this Logger
     //!
     //! TODO Once all samples are updated to use this method to register the logger with TensorRT,
@@ -354,7 +354,7 @@ class Logger : public nvinfer1::ILogger
     //!
     //! \brief Define a test for logging
     //!
-    //! \param[in] name The name of the test.  This should be a string starting with
+    //! \param[in] name The name of the test. This should be a string starting with
     //!                  "TensorRT" and containing dot-separated strings containing
     //!                  the characters [A-Za-z0-9_].
     //!                  For example, "TensorRT.sample_googlenet"
diff --git a/samples/common/parserOnnxConfig.h b/samples/common/parserOnnxConfig.h
index b1c4e434..ed0a9b55 100644
--- a/samples/common/parserOnnxConfig.h
+++ b/samples/common/parserOnnxConfig.h
@@ -61,7 +61,6 @@ class ParserOnnxConfig : public nvonnxparser::IOnnxConfig
 #endif
     }
 
-protected:
     ~ParserOnnxConfig() override
     {
 #ifdef ONNX_DEBUG
@@ -141,12 +140,6 @@ class ParserOnnxConfig : public nvonnxparser::IOnnxConfig
         return false;
 #endif
     }
-
-    void destroy() noexcept override
-    {
-        delete this;
-    }
-
 }; // class ParserOnnxConfig
 
 #endif
diff --git a/samples/common/safeCommon.h b/samples/common/safeCommon.h
index 326257ab..fc9f28b0 100644
--- a/samples/common/safeCommon.h
+++ b/samples/common/safeCommon.h
@@ -18,9 +18,12 @@
 #ifndef TENSORRT_SAFE_COMMON_H
 #define TENSORRT_SAFE_COMMON_H
 
+#include "NvInferRuntimeBase.h"
 #include "cuda_runtime.h"
-#include "NvInferRuntimeCommon.h"
+#include "sampleEntrypoints.h"
+#include <cmath>
 #include <cstdlib>
+#include <fstream>
 #include <iostream>
 #include <memory>
 #include <numeric>
@@ -36,6 +39,10 @@
 #else
 #include <dlfcn.h>
 #endif
+#if IS_QNX_SAFE
+#include <cuda_runtime_api_safe_ex.h>
+#include <sys/procmgr.h>
+#endif // IS_QNX_SAFE
 
 #undef CHECK
 #define CHECK(status)                                                                                                  \
@@ -45,21 +52,92 @@
         if (ret != 0)                                                                                                  \
         {                                                                                                              \
             std::cerr << "Cuda failure: " << ret << std::endl;                                                         \
-            abort();                                                                                                   \
+            exit(EXIT_FAILURE);                                                                                        \
         }                                                                                                              \
     } while (0)
 
 #undef SAFE_ASSERT
-#define SAFE_ASSERT(condition)                                                   \
-    do                                                                      \
-    {                                                                       \
-        if (!(condition))                                                   \
-        {                                                                   \
-            std::cerr << "Assertion failure: " << #condition << std::endl;  \
-            abort();                                                        \
-        }                                                                   \
+#define SAFE_ASSERT(condition)                                                                                         \
+    do                                                                                                                 \
+    {                                                                                                                  \
+        if (!(condition))                                                                                              \
+        {                                                                                                              \
+            std::cerr << "Assertion failure: " << #condition << std::endl;                                             \
+            exit(EXIT_FAILURE);                                                                                        \
+        }                                                                                                              \
     } while (0)
 
+//! Locate path to file, given its filename or filepath suffix and possible dirs it might lie in.
+//! Function will also walk back MAX_DEPTH dirs from CWD to check for such a file path.
+inline std::string locateFile(
+    const std::string& filepathSuffix, const std::vector<std::string>& directories, bool reportError = true)
+{
+    const int MAX_DEPTH{10};
+    bool found{false};
+    std::string filepath;
+
+    for (auto& dir : directories)
+    {
+        if (!dir.empty() && dir.back() != '/')
+        {
+#ifdef _MSC_VER
+            filepath = dir + "\\" + filepathSuffix;
+#else
+            filepath = dir + "/" + filepathSuffix;
+#endif
+        }
+        else
+        {
+            filepath = dir + filepathSuffix;
+        }
+
+        for (int i = 0; i < MAX_DEPTH && !found; i++)
+        {
+            const std::ifstream checkFile(filepath);
+            found = checkFile.is_open();
+            if (found)
+            {
+                break;
+            }
+
+            filepath = "../" + filepath; // Try again in parent dir
+        }
+
+        if (found)
+        {
+            break;
+        }
+
+        filepath.clear();
+    }
+
+    // Could not find the file
+    if (filepath.empty())
+    {
+        const std::string dirList = std::accumulate(directories.begin() + 1, directories.end(), directories.front(),
+            [](const std::string& a, const std::string& b) { return a + "\n\t" + b; });
+        std::cout << "Could not find " << filepathSuffix << " in data directories:\n\t" << dirList << std::endl;
+
+        if (reportError)
+        {
+            std::cout << "&&&& FAILED" << std::endl;
+            exit(EXIT_FAILURE);
+        }
+    }
+
+    return filepath;
+}
+
+inline void readPGMFile(const std::string& fileName, uint8_t* buffer, int32_t inH, int32_t inW)
+{
+    std::ifstream infile(fileName, std::ifstream::binary);
+    SAFE_ASSERT(infile.is_open() && "Attempting to read from a file that is not open.");
+    std::string magic, w, h, max;
+    infile >> magic >> w >> h >> max;
+    infile.seekg(1, infile.cur);
+    infile.read(reinterpret_cast<char*>(buffer), inH * inW);
+}
+
 namespace samplesCommon
 {
 template <typename T>
@@ -76,13 +154,17 @@ inline uint32_t elementSize(nvinfer1::DataType t)
 {
     switch (t)
     {
+    case nvinfer1::DataType::kINT64: return 8;
     case nvinfer1::DataType::kINT32:
     case nvinfer1::DataType::kFLOAT: return 4;
-    case nvinfer1::DataType::kHALF: return 2;
-    case nvinfer1::DataType::kINT8: return 1;
-    case nvinfer1::DataType::kUINT8: return 1;
-    case nvinfer1::DataType::kBOOL: return 1;
+    case nvinfer1::DataType::kHALF:
+    case nvinfer1::DataType::kBF16: return 2;
+    case nvinfer1::DataType::kINT8:
+    case nvinfer1::DataType::kUINT8:
+    case nvinfer1::DataType::kBOOL:
     case nvinfer1::DataType::kFP8: return 1;
+    case nvinfer1::DataType::kINT4:
+        SAFE_ASSERT(false && "Element size is not implemented for sub-byte data-types (INT4)");
     }
     return 0;
 }
@@ -98,10 +180,13 @@ inline int64_t volume(nvinfer1::Dims const& d)
     return std::accumulate(d.d, d.d + d.nbDims, int64_t{1}, std::multiplies<int64_t>{});
 }
 
-// Return m rounded up to nearest multiple of n
-template <typename T>
-inline T roundUp(T m, T n)
+//! Return m rounded up to nearest multiple of n
+template <typename T1, typename T2>
+inline T1 roundUp(T1 m, T2 n)
 {
+    static_assert(std::is_integral<T1>::value && std::is_integral<T2>::value, "arguments must be integers");
+    static_assert(std::is_signed<T1>::value == std::is_signed<T2>::value, "mixed signedness not allowed");
+    static_assert(sizeof(T1) >= sizeof(T2), "first type must be as least as wide as second type");
     return ((m + n - 1) / n) * n;
 }
 
@@ -115,6 +200,40 @@ inline int64_t volume(nvinfer1::Dims dims, int32_t vecDim, int32_t comps, int32_
     return samplesCommon::volume(dims) * std::max(batch, 1);
 }
 
+inline int32_t getSMVersion()
+{
+    int32_t deviceIndex = 0;
+    CHECK(cudaGetDevice(&deviceIndex));
+    int32_t major, minor;
+    CHECK(cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, deviceIndex));
+    CHECK(cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, deviceIndex));
+
+    return ((major << 8) | minor);
+}
+
+inline bool isSMSafe()
+{
+    const int32_t smVersion = getSMVersion();
+    return smVersion == 0x0700 || smVersion == 0x0705 || smVersion == 0x0800 || smVersion == 0x0806
+        || smVersion == 0x0807;
+}
+
+inline int32_t calculateSoftmax(float* const prob, int32_t const numDigits)
+{
+    SAFE_ASSERT(prob != nullptr);
+    SAFE_ASSERT(numDigits == 10);
+    float sum{0.0F};
+    std::transform(prob, prob + numDigits, prob, [&sum](float v) -> float {
+        sum += exp(v);
+        return exp(v);
+    });
+
+    SAFE_ASSERT(sum != 0.0F);
+    std::transform(prob, prob + numDigits, prob, [sum](float v) -> float { return v / sum; });
+    int32_t idx = std::max_element(prob, prob + numDigits) - prob;
+    return idx;
+}
+
 //!
 //! \class TrtCudaGraphSafe
 //! \brief Managed CUDA graph
@@ -190,7 +309,7 @@ class TrtCudaGraphSafe
 inline void safeLoadLibrary(const std::string& path)
 {
 #ifdef _MSC_VER
-    void* handle = LoadLibrary(path.c_str());
+    void* handle = LoadLibraryA(path.c_str());
 #else
     int32_t flags{RTLD_LAZY};
     void* handle = dlopen(path.c_str(), flags);
@@ -221,4 +340,34 @@ inline std::vector<std::string> safeSplitString(std::string str, char delimiter
 
 } // namespace samplesCommon
 
+namespace safetyCompliance
+{
+inline void initSafeCuda()
+{
+    // According to CUDA initialization in NVIDIA CUDA SAFETY API REFERENCE FOR DRIVE OS
+    // We will need to do the following in order
+    // 1. Initialize the calling thread with CUDA specific information (Call any CUDA RT API identified as init)
+    // 2. Query/Configure and choose the desired CUDA device
+    // 3. CUDA context initialization. (Call cudaDeviceGetLimit or cuCtxCreate)
+    size_t stackSizeLimit = 0;
+    int32_t deviceIndex = 0;
+    CHECK(cudaGetDevice(&deviceIndex));
+    CHECK(cudaDeviceGetLimit(&stackSizeLimit, cudaLimitStackSize));
+#if IS_QNX_SAFE
+    CHECK(cudaSafeExSelectAPIMode(cudaSafeExAPIModeAsilB));
+#endif // IS_QNX_SAFE
+}
+
+inline void setPromgrAbility()
+{
+#if IS_QNX_SAFE
+    // Comply with DEEPLRN_RES_117 on QNX-safe by dropping PROCMGR_AID_MEM_PHYS ability and locking out any further
+    // changes
+    procmgr_ability(
+        0, PROCMGR_ADN_NONROOT | PROCMGR_AOP_DENY | PROCMGR_AOP_LOCK | PROCMGR_AID_MEM_PHYS, PROCMGR_AID_EOL);
+#endif // IS_QNX_SAFE
+}
+
+} // namespace safetyCompliance
+
 #endif // TENSORRT_SAFE_COMMON_H
diff --git a/samples/common/sampleConfig.h b/samples/common/sampleConfig.h
index 6402b448..f60ed363 100644
--- a/samples/common/sampleConfig.h
+++ b/samples/common/sampleConfig.h
@@ -71,7 +71,6 @@ class SampleConfig : public nvonnxparser::IOnnxConfig
 #endif
     }
 
-protected:
     ~SampleConfig() override
     {
 #ifdef ONNX_DEBUG
@@ -327,12 +326,6 @@ class SampleConfig : public nvonnxparser::IOnnxConfig
         return false;
 #endif
     }
-
-    void destroy() noexcept override
-    {
-        delete this;
-    }
-
 }; // class SampleConfig
 
 #endif
diff --git a/samples/common/sampleDevice.cpp b/samples/common/sampleDevice.cpp
new file mode 100644
index 00000000..f504fa69
--- /dev/null
+++ b/samples/common/sampleDevice.cpp
@@ -0,0 +1,131 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "sampleDevice.h"
+
+#include <iomanip>
+
+namespace sample
+{
+
+void cudaCheck(cudaError_t ret, std::ostream& err)
+{
+    if (ret != cudaSuccess)
+    {
+        err << "Cuda failure: " << cudaGetErrorString(ret) << std::endl;
+        exit(EXIT_FAILURE);
+    }
+}
+
+// Construct GPU UUID string in the same format as nvidia-smi does.
+std::string getUuidString(cudaUUID_t uuid)
+{
+    constexpr int32_t kUUID_SIZE = sizeof(cudaUUID_t);
+    static_assert(kUUID_SIZE == 16, "Unexpected size for cudaUUID_t!");
+
+    std::ostringstream ss;
+    std::vector<int32_t> const splits = {0, 4, 6, 8, 10, kUUID_SIZE};
+
+    ss << "GPU" << std::hex << std::setfill('0');
+    for (int32_t splitIdx = 0; splitIdx < static_cast<int32_t>(splits.size()) - 1; ++splitIdx)
+    {
+        ss << "-";
+        for (int32_t byteIdx = splits[splitIdx]; byteIdx < splits[splitIdx + 1]; ++byteIdx)
+        {
+            ss << std::setw(2) << +static_cast<uint8_t>(uuid.bytes[byteIdx]);
+        }
+    }
+    return ss.str();
+}
+
+void setCudaDevice(int32_t device, std::ostream& os)
+{
+    os << "=== Device Information ===" << std::endl;
+
+    // Get the number of visible GPUs.
+    int32_t nbDevices{-1};
+    cudaCheck(cudaGetDeviceCount(&nbDevices));
+
+    if (nbDevices <= 0)
+    {
+        os << "Cannot find any available devices (GPUs)!" << std::endl;
+        exit(EXIT_FAILURE);
+    }
+
+    // Print out the GPU name and PCIe bus ID of each GPU.
+    os << "Available Devices: " << std::endl;
+    cudaDeviceProp properties;
+    for (int32_t deviceIdx = 0; deviceIdx < nbDevices; ++deviceIdx)
+    {
+        cudaDeviceProp tempProperties;
+        cudaCheck(cudaGetDeviceProperties(&tempProperties, deviceIdx));
+
+        // clang-format off
+        os << "  Device " << deviceIdx << ": \"" << tempProperties.name << "\" UUID: "
+           << getUuidString(tempProperties.uuid) << std::endl;
+        // clang-format on
+
+        // Record the properties of the desired GPU.
+        if (deviceIdx == device)
+        {
+            properties = tempProperties;
+        }
+    }
+
+    // Exit with error if the requested device ID does not exist.
+    if (device < 0 || device >= nbDevices)
+    {
+        os << "Cannot find device ID " << device << "!" << std::endl;
+        exit(EXIT_FAILURE);
+    }
+
+    // Set to the corresponding GPU.
+    cudaCheck(cudaSetDevice(device));
+
+    // clang-format off
+    os << "Selected Device: "      << properties.name                                               << std::endl;
+    os << "Selected Device ID: "   << device                                                        << std::endl;
+    os << "Selected Device UUID: " << getUuidString(properties.uuid)                                << std::endl;
+    os << "Compute Capability: "   << properties.major << "." << properties.minor                   << std::endl;
+    os << "SMs: "                  << properties.multiProcessorCount                                << std::endl;
+    os << "Device Global Memory: " << (properties.totalGlobalMem >> 20) << " MiB"                   << std::endl;
+    os << "Shared Memory per SM: " << (properties.sharedMemPerMultiprocessor >> 10) << " KiB"       << std::endl;
+    os << "Memory Bus Width: "     << properties.memoryBusWidth << " bits"
+                        << " (ECC " << (properties.ECCEnabled != 0 ? "enabled" : "disabled") << ")" << std::endl;
+    os << "Application Compute Clock Rate: "   << properties.clockRate / 1000000.0F << " GHz"       << std::endl;
+    os << "Application Memory Clock Rate: "    << properties.memoryClockRate / 1000000.0F << " GHz" << std::endl;
+    os << std::endl;
+    os << "Note: The application clock rates do not reflect the actual clock rates that the GPU is "
+                                                                         << "currently running at." << std::endl;
+    // clang-format on
+}
+
+int32_t getCudaDriverVersion()
+{
+    int32_t version{-1};
+    cudaCheck(cudaDriverGetVersion(&version));
+    return version;
+}
+
+int32_t getCudaRuntimeVersion()
+{
+    int32_t version{-1};
+    cudaCheck(cudaRuntimeGetVersion(&version));
+    return version;
+}
+
+} // namespace sample
diff --git a/samples/common/sampleDevice.h b/samples/common/sampleDevice.h
index 83cd53c3..ad122180 100644
--- a/samples/common/sampleDevice.h
+++ b/samples/common/sampleDevice.h
@@ -29,14 +29,8 @@
 namespace sample
 {
 
-inline void cudaCheck(cudaError_t ret, std::ostream& err = std::cerr)
-{
-    if (ret != cudaSuccess)
-    {
-        err << "Cuda failure: " << cudaGetErrorString(ret) << std::endl;
-        abort();
-    }
-}
+//! Check if the CUDA return status shows any error. If so, exit the program immediately.
+void cudaCheck(cudaError_t ret, std::ostream& err = std::cerr);
 
 class TrtCudaEvent;
 
@@ -497,6 +491,13 @@ class OutputAllocator : public nvinfer1::IOutputAllocator
         return mBuffer->getDeviceBuffer();
     }
 
+    //! IMirroredBuffer does not implement Async allocation, hence this is just a wrap around
+    void* reallocateOutputAsync(char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment,
+        cudaStream_t /*stream*/) noexcept override
+    {
+        return reallocateOutput(tensorName, currentMemory, size, alignment);
+    }
+
     void notifyShape(char const* tensorName, nvinfer1::Dims const& dims) noexcept override {}
 
     IMirroredBuffer* getBuffer()
@@ -511,43 +512,14 @@ class OutputAllocator : public nvinfer1::IOutputAllocator
     uint64_t mSize{};
 };
 
-inline void setCudaDevice(int device, std::ostream& os)
-{
-    cudaCheck(cudaSetDevice(device));
-
-    cudaDeviceProp properties;
-    cudaCheck(cudaGetDeviceProperties(&properties, device));
-
-    // clang-format off
-    os << "=== Device Information ===" << std::endl;
-    os << "Selected Device: "      << properties.name                                               << std::endl;
-    os << "Compute Capability: "   << properties.major << "." << properties.minor                   << std::endl;
-    os << "SMs: "                  << properties.multiProcessorCount                                << std::endl;
-    os << "Device Global Memory: " << (properties.totalGlobalMem >> 20) << " MiB"                   << std::endl;
-    os << "Shared Memory per SM: " << (properties.sharedMemPerMultiprocessor >> 10) << " KiB"       << std::endl;
-    os << "Memory Bus Width: "     << properties.memoryBusWidth << " bits"
-                        << " (ECC " << (properties.ECCEnabled != 0 ? "enabled" : "disabled") << ")" << std::endl;
-    os << "Application Compute Clock Rate: "   << properties.clockRate / 1000000.0F << " GHz"       << std::endl;
-    os << "Application Memory Clock Rate: "    << properties.memoryClockRate / 1000000.0F << " GHz" << std::endl;
-    os << std::endl;
-    os << "Note: The application clock rates do not reflect the actual clock rates that the GPU is "
-                                                                         << "currently running at." << std::endl;
-    // clang-format on
-}
+//! Set the GPU to run the inference on.
+void setCudaDevice(int32_t device, std::ostream& os);
 
-inline int32_t getCudaDriverVersion()
-{
-    int32_t version{-1};
-    cudaCheck(cudaDriverGetVersion(&version));
-    return version;
-}
+//! Get the CUDA version of the current CUDA driver.
+int32_t getCudaDriverVersion();
 
-inline int32_t getCudaRuntimeVersion()
-{
-    int32_t version{-1};
-    cudaCheck(cudaRuntimeGetVersion(&version));
-    return version;
-}
+//! Get the CUDA version of the current CUDA runtime.
+int32_t getCudaRuntimeVersion();
 
 } // namespace sample
 
diff --git a/samples/common/sampleEngines.cpp b/samples/common/sampleEngines.cpp
index 74a9a77f..bea07a53 100644
--- a/samples/common/sampleEngines.cpp
+++ b/samples/common/sampleEngines.cpp
@@ -27,10 +27,8 @@
 #include <unordered_map>
 #include <vector>
 
-#include "NvCaffeParser.h"
 #include "NvInfer.h"
 #include "NvOnnxParser.h"
-#include "NvUffParser.h"
 
 #include "ErrorRecorder.h"
 #include "common.h"
@@ -49,22 +47,6 @@ namespace sample
 namespace
 {
 
-struct CaffeBufferShutter
-{
-    ~CaffeBufferShutter()
-    {
-        shutdownCaffeParser();
-    }
-};
-
-struct UffBufferShutter
-{
-    ~UffBufferShutter()
-    {
-        shutdownUffParser();
-    }
-};
-
 std::map<std::string, float> readScalesFromCalibrationCache(std::string const& calibrationFile)
 {
     std::map<std::string, float> tensorScales;
@@ -98,8 +80,8 @@ nvinfer1::ICudaEngine* LazilyDeserializedEngine::get()
 
     if (mEngine == nullptr)
     {
-        SMP_RETVAL_IF_FALSE(
-            !mEngineBlob.empty(), "Engine blob is empty. Nothing to deserialize!", nullptr, sample::gLogError);
+        SMP_RETVAL_IF_FALSE(getFileReader().isOpen() || !getBlob().empty(), "Engine is empty. Nothing to deserialize!",
+            nullptr, sample::gLogError);
 
         using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
         using duration = std::chrono::duration<float>;
@@ -141,12 +123,21 @@ nvinfer1::ICudaEngine* LazilyDeserializedEngine::get()
         {
             mRuntime->getPluginRegistry().loadLibrary(pluginPath.c_str());
         }
-        mEngine.reset(mRuntime->deserializeCudaEngine(mEngineBlob.data(), mEngineBlob.size()));
+
+        if (getFileReader().isOpen())
+        {
+            mEngine.reset(mRuntime->deserializeCudaEngine(getFileReader()));
+        }
+        else
+        {
+            auto const& engineBlob = getBlob();
+            mEngine.reset(mRuntime->deserializeCudaEngine(engineBlob.data, engineBlob.size));
+        }
         SMP_RETVAL_IF_FALSE(mEngine != nullptr, "Engine deserialization failed", nullptr, sample::gLogError);
 
         time_point const deserializeEndTime{std::chrono::high_resolution_clock::now()};
-        sample::gLogInfo << "Engine deserialized in "
-                            << duration(deserializeEndTime - deserializeStartTime).count() << " sec." << std::endl;
+        sample::gLogInfo << "Engine deserialized in " << duration(deserializeEndTime - deserializeStartTime).count()
+                         << " sec." << std::endl;
     }
 
     return mEngine.get();
@@ -163,13 +154,14 @@ nvinfer1::safe::ICudaEngine* LazilyDeserializedEngine::getSafe()
         mIsSafe, "Safe mode is not enabled, but trying to get safe engine!", nullptr, sample::gLogError);
 
     ASSERT(sample::hasSafeRuntime());
+    auto engineBlob = getBlob();
     if (mSafeEngine == nullptr)
     {
         SMP_RETVAL_IF_FALSE(
-            !mEngineBlob.empty(), "Engine blob is empty. Nothing to deserialize!", nullptr, sample::gLogError);
+            !engineBlob.empty(), "Engine blob is empty. Nothing to deserialize!", nullptr, sample::gLogError);
 
-        SMP_RETVAL_IF_FALSE(
-            mDLACore == -1, "Safe DLA engine built with kDLA_STANDALONE should not be deserialized in TRT!", nullptr,
+        SMP_RETVAL_IF_FALSE(mDLACore == -1,
+            "Safe DLA engine built with kDLA_STANDALONE should not be deserialized in TRT!", nullptr,
             sample::gLogError);
 
         using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
@@ -179,13 +171,12 @@ nvinfer1::safe::ICudaEngine* LazilyDeserializedEngine::getSafe()
         std::unique_ptr<safe::IRuntime> safeRuntime{sample::createSafeInferRuntime(sample::gLogger.getTRTLogger())};
         SMP_RETVAL_IF_FALSE(safeRuntime != nullptr, "SafeRuntime creation failed", nullptr, sample::gLogError);
         safeRuntime->setErrorRecorder(&gRecorder);
-        mSafeEngine.reset(
-            safeRuntime->deserializeCudaEngine(mEngineBlob.data(), mEngineBlob.size()));
+        mSafeEngine.reset(safeRuntime->deserializeCudaEngine(engineBlob.data, engineBlob.size));
         SMP_RETVAL_IF_FALSE(mSafeEngine != nullptr, "SafeEngine deserialization failed", nullptr, sample::gLogError);
 
         time_point const deserializeEndTime{std::chrono::high_resolution_clock::now()};
-        sample::gLogInfo << "SafeEngine deserialized in "
-                            << duration(deserializeEndTime - deserializeStartTime).count() << " sec." << std::endl;
+        sample::gLogInfo << "SafeEngine deserialized in " << duration(deserializeEndTime - deserializeStartTime).count()
+                         << " sec." << std::endl;
     }
 
     return mSafeEngine.get();
@@ -243,81 +234,17 @@ Parser modelToNetwork(ModelOptions const& model, BuildOptions const& build, nvin
     auto const tBegin = std::chrono::high_resolution_clock::now();
 
     Parser parser;
-    std::string const& modelName = model.baseModel.model;
     switch (model.baseModel.format)
     {
-    case ModelFormat::kCAFFE:
-    {
-        using namespace nvcaffeparser1;
-        parser.caffeParser.reset(sampleCreateCaffeParser());
-        CaffeBufferShutter bufferShutter;
-        auto const* const blobNameToTensor = parser.caffeParser->parse(
-            model.prototxt.c_str(), modelName.empty() ? nullptr : modelName.c_str(), network, DataType::kFLOAT);
-        if (!blobNameToTensor)
-        {
-            err << "Failed to parse caffe model or prototxt, tensors blob not found" << std::endl;
-            parser.caffeParser.reset();
-            break;
-        }
-
-        for (auto const& s : model.outputs)
-        {
-            if (blobNameToTensor->find(s.c_str()) == nullptr)
-            {
-                err << "Could not find output blob " << s << std::endl;
-                parser.caffeParser.reset();
-                break;
-            }
-            network.markOutput(*blobNameToTensor->find(s.c_str()));
-        }
-        break;
-    }
-    case ModelFormat::kUFF:
-    {
-        using namespace nvuffparser;
-        parser.uffParser.reset(sampleCreateUffParser());
-        UffBufferShutter bufferShutter;
-        for (auto const& s : model.uffInputs.inputs)
-        {
-            if (!parser.uffParser->registerInput(
-                    s.first.c_str(), s.second, model.uffInputs.NHWC ? UffInputOrder::kNHWC : UffInputOrder::kNCHW))
-            {
-                err << "Failed to register input " << s.first << std::endl;
-                parser.uffParser.reset();
-                break;
-            }
-        }
-
-        for (auto const& s : model.outputs)
-        {
-            if (!parser.uffParser->registerOutput(s.c_str()))
-            {
-                err << "Failed to register output " << s << std::endl;
-                parser.uffParser.reset();
-                break;
-            }
-        }
-
-        if (!parser.uffParser->parse(model.baseModel.model.c_str(), network))
-        {
-            err << "Failed to parse uff file" << std::endl;
-            parser.uffParser.reset();
-            break;
-        }
-        break;
-    }
     case ModelFormat::kONNX:
     {
         using namespace nvonnxparser;
         parser.onnxParser.reset(createONNXParser(network));
         ASSERT(parser.onnxParser != nullptr);
-        // For version or hardware compatible engines, we must use TensorRT's native InstanceNorm implementation for
-        // compatibility.
-        if (build.versionCompatible
-            || (build.hardwareCompatibilityLevel != nvinfer1::HardwareCompatibilityLevel::kNONE))
+        // kNATIVE_INSTANCENORM is ON by default in the parser and must be cleared to use the plugin implementation.
+        if (build.pluginInstanceNorm)
         {
-            auto parserflags = 1U << static_cast<uint32_t>(OnnxParserFlag::kNATIVE_INSTANCENORM);
-            parser.onnxParser->setFlags(parserflags);
+            parser.onnxParser->clearFlag(OnnxParserFlag::kNATIVE_INSTANCENORM);
         }
         if (!parser.onnxParser->parseFromFile(
                 model.baseModel.model.c_str(), static_cast<int>(sample::gLogger.getReportableSeverity())))
@@ -504,40 +431,43 @@ bool setTensorDynamicRange(INetworkDefinition const& network, float inRange = 2.
     return true;
 }
 
+bool isNonActivationType(nvinfer1::DataType const type)
+{
+    return type == nvinfer1::DataType::kINT32 || type == nvinfer1::DataType::kINT64 || type == nvinfer1::DataType::kBOOL
+        || type == nvinfer1::DataType::kUINT8;
+}
+
 void setLayerPrecisions(INetworkDefinition& network, LayerPrecisions const& layerPrecisions)
 {
-    bool const hasGlobalPrecision{layerPrecisions.find("*") != layerPrecisions.end()};
-    auto const globalPrecision = hasGlobalPrecision ? layerPrecisions.at("*") : nvinfer1::DataType::kFLOAT;
     bool hasLayerPrecisionSkipped{false};
     for (int32_t layerIdx = 0; layerIdx < network.getNbLayers(); ++layerIdx)
     {
         auto* layer = network.getLayer(layerIdx);
         auto const layerName = layer->getName();
-        if (layerPrecisions.find(layer->getName()) != layerPrecisions.end())
+        auto exactMatch = layerPrecisions.find(layerName);
+        auto plausibleMatch = findPlausible(layerPrecisions, layerName);
+        if (exactMatch != layerPrecisions.end())
         {
-            layer->setPrecision(layerPrecisions.at(layer->getName()));
+            sample::gLogInfo << "Set layer " << layerName << " to precision " << exactMatch->second << std::endl;
+            layer->setPrecision(exactMatch->second);
         }
-        else if (hasGlobalPrecision)
+        else if (plausibleMatch != layerPrecisions.end())
         {
-            // We should not set the layer precision if its default precision is INT32 or Bool.
-            if (layer->getPrecision() == nvinfer1::DataType::kINT32
-                || layer->getPrecision() == nvinfer1::DataType::kBOOL)
+            if (isNonActivationType(layer->getPrecision()))
             {
                 hasLayerPrecisionSkipped = true;
                 sample::gLogVerbose << "Skipped setting precision for layer " << layerName << " because the "
-                                    << " default layer precision is INT32 or Bool." << std::endl;
+                                    << " default layer precision is of non-activation type." << std::endl;
                 continue;
             }
-            // We should not set the constant layer precision if its weights are in INT32.
             if (layer->getType() == nvinfer1::LayerType::kCONSTANT
-                && static_cast<IConstantLayer*>(layer)->getWeights().type == nvinfer1::DataType::kINT32)
+                && (isNonActivationType(static_cast<IConstantLayer*>(layer)->getWeights().type)))
             {
                 hasLayerPrecisionSkipped = true;
                 sample::gLogVerbose << "Skipped setting precision for layer " << layerName << " because this "
-                                    << "constant layer has INT32 weights." << std::endl;
+                                    << "constant layer has weights of non-activation type." << std::endl;
                 continue;
             }
-            // We should not set the layer precision if the layer operates on a shape tensor.
             if (layer->getNbInputs() >= 1 && layer->getInput(0)->isShapeTensor())
             {
                 hasLayerPrecisionSkipped = true;
@@ -545,16 +475,17 @@ void setLayerPrecisions(INetworkDefinition& network, LayerPrecisions const& laye
                                     << "operates on a shape tensor." << std::endl;
                 continue;
             }
-            if (layer->getNbInputs() >= 1 && layer->getInput(0)->getType() == nvinfer1::DataType::kINT32
-                && layer->getNbOutputs() >= 1 && layer->getOutput(0)->getType() == nvinfer1::DataType::kINT32)
+            if (layer->getNbInputs() >= 1 && isNonActivationType(layer->getInput(0)->getType())
+                && layer->getNbOutputs() >= 1 && isNonActivationType(layer->getOutput(0)->getType()))
             {
                 hasLayerPrecisionSkipped = true;
                 sample::gLogVerbose << "Skipped setting precision for layer " << layerName << " because this "
-                                    << "layer has INT32 input and output." << std::endl;
+                                    << "layer has input and output of non-activation type." << std::endl;
                 continue;
             }
             // All heuristics passed. Set the layer precision.
-            layer->setPrecision(globalPrecision);
+            sample::gLogInfo << "Set layer " << layerName << " to precision " << plausibleMatch->second << std::endl;
+            layer->setPrecision(plausibleMatch->second);
         }
     }
 
@@ -575,9 +506,11 @@ void setLayerOutputTypes(INetworkDefinition& network, LayerOutputTypes const& la
         auto* layer = network.getLayer(layerIdx);
         auto const layerName = layer->getName();
         auto const nbOutputs = layer->getNbOutputs();
-        if (layerOutputTypes.find(layer->getName()) != layerOutputTypes.end())
+        auto exactMatch = layerOutputTypes.find(layerName);
+        auto plausibleMatch = findPlausible(layerOutputTypes, layerName);
+        if (exactMatch != layerOutputTypes.end())
         {
-            auto const& outputTypes = layerOutputTypes.at(layer->getName());
+            auto const& outputTypes = exactMatch->second;
             bool const isBroadcast = (outputTypes.size() == 1);
             if (!isBroadcast && static_cast<int32_t>(outputTypes.size()) != nbOutputs)
             {
@@ -588,11 +521,17 @@ void setLayerOutputTypes(INetworkDefinition& network, LayerOutputTypes const& la
             }
             for (int32_t outputIdx = 0; outputIdx < nbOutputs; ++outputIdx)
             {
-                layer->setOutputType(outputIdx, outputTypes.at(isBroadcast ? 0 : outputIdx));
+                auto const outputType = outputTypes.at(isBroadcast ? 0 : outputIdx);
+                sample::gLogInfo << "Set output " << outputIdx << " of layer " << layerName << " to type " << outputType
+                                 << std::endl;
+                layer->setOutputType(outputIdx, outputType);
             }
         }
-        else if (hasGlobalOutputType)
+        else if (plausibleMatch != layerOutputTypes.end())
         {
+            auto const& outputTypes = plausibleMatch->second;
+            bool const isBroadcast = (outputTypes.size() == 1);
+
             // We should not set the layer output types if its default precision is INT32 or Bool.
             if (layer->getPrecision() == nvinfer1::DataType::kINT32
                 || layer->getPrecision() == nvinfer1::DataType::kBOOL)
@@ -621,6 +560,10 @@ void setLayerOutputTypes(INetworkDefinition& network, LayerOutputTypes const& la
                                         << layerName << " because it is a shape tensor." << std::endl;
                     continue;
                 }
+
+                auto const outputType = outputTypes.at(isBroadcast ? 0 : outputIdx);
+                sample::gLogInfo << "Set output " << outputIdx << " of layer " << layerName << " to type " << outputType
+                                 << std::endl;
                 layer->setOutputType(outputIdx, globalOutputType);
             }
         }
@@ -640,14 +583,42 @@ void setLayerDeviceTypes(
     {
         auto* layer = network.getLayer(layerIdx);
         auto const layerName = layer->getName();
-        if (layerDeviceTypes.find(layerName) != layerDeviceTypes.end())
+        auto match = findPlausible(layerDeviceTypes, layerName);
+        if (match != layerDeviceTypes.end())
         {
-            DeviceType const deviceType = layerDeviceTypes.at(layerName);
+            DeviceType const deviceType = match->second;
+            sample::gLogInfo << "Set layer " << layerName << " to device type " << deviceType << std::endl;
             config.setDeviceType(layer, deviceType);
         }
     }
 }
 
+void markDebugTensors(INetworkDefinition& network, StringSet const& debugTensors)
+{
+    for (int64_t inputIndex = 0; inputIndex < network.getNbInputs(); ++inputIndex)
+    {
+        auto* t = network.getInput(inputIndex);
+        auto const tensorName = t->getName();
+        if (debugTensors.count(tensorName) > 0)
+        {
+            network.markDebug(*t);
+        }
+    }
+    for (int64_t layerIndex = 0; layerIndex < network.getNbLayers(); ++layerIndex)
+    {
+        auto* layer = network.getLayer(layerIndex);
+        for (int64_t outputIndex = 0; outputIndex < layer->getNbOutputs(); ++outputIndex)
+        {
+            auto* t = layer->getOutput(outputIndex);
+            auto const tensorName = t->getName();
+            if (debugTensors.count(tensorName) > 0)
+            {
+                network.markDebug(*t);
+            }
+        }
+    }
+}
+
 void setMemoryPoolLimits(IBuilderConfig& config, BuildOptions const& build)
 {
     auto const roundToBytes = [](double const sizeInMB) { return static_cast<size_t>(sizeInMB * (1 << 20)); };
@@ -667,7 +638,8 @@ void setMemoryPoolLimits(IBuilderConfig& config, BuildOptions const& build)
         --sizeInPowerOf2;
         if (sizeInPowerOf2 == 30)
         {
-            sample::gLogWarning << "User-specified DLA managed SRAM size is too large and has been clipped to 2^30 bytes. "
+            sample::gLogWarning
+                << "User-specified DLA managed SRAM size is too large and has been clipped to 2^30 bytes. "
                 << "Please make sure that this is the intended managed SRAM size." << std::endl;
         }
         config.setMemoryPoolLimit(MemoryPoolType::kDLA_MANAGED_SRAM, static_cast<size_t>(1) << sizeInPowerOf2);
@@ -680,6 +652,10 @@ void setMemoryPoolLimits(IBuilderConfig& config, BuildOptions const& build)
     {
         config.setMemoryPoolLimit(MemoryPoolType::kDLA_GLOBAL_DRAM, roundToBytes(build.dlaGlobalDRAM));
     }
+    if (build.tacticSharedMem >= 0)
+    {
+        config.setMemoryPoolLimit(MemoryPoolType::kTACTIC_SHARED_MEMORY, roundToBytes(build.tacticSharedMem));
+    }
 }
 
 void setPreviewFeatures(IBuilderConfig& config, BuildOptions const& build)
@@ -691,9 +667,8 @@ void setPreviewFeatures(IBuilderConfig& config, BuildOptions const& build)
             config.setPreviewFeature(feat, build.previewFeatures.at(featVal));
         }
     };
-    setFlag(PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805);
-    setFlag(PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805);
-    setFlag(PreviewFeature::kPROFILE_SHARING_0806);
+    // unused
+    static_cast<void>(setFlag);
 }
 
 } // namespace
@@ -702,12 +677,9 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
     INetworkDefinition& network, IBuilderConfig& config, std::unique_ptr<nvinfer1::IInt8Calibrator>& calibrator,
     std::ostream& err, std::vector<std::vector<int8_t>>& sparseWeights)
 {
-    IOptimizationProfile* profile{nullptr};
-    if (build.maxBatch)
-    {
-        builder.setMaxBatchSize(build.maxBatch);
-    }
-    else
+    std::vector<IOptimizationProfile*> profiles{};
+    profiles.resize(build.optProfiles.size());
+    for (auto& profile : profiles)
     {
         profile = builder.createOptimizationProfile();
     }
@@ -716,16 +688,16 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
 
     bool broadcastInputFormats = broadcastIOFormats(build.inputFormats, network.getNbInputs());
 
-    if (profile)
+    // Check if the provided input tensor names match the input tensors of the engine.
+    // Throw an error if the provided input tensor names cannot be found because it implies a potential typo.
+    for (auto const& shapes : build.optProfiles)
     {
-        // Check if the provided input tensor names match the input tensors of the engine.
-        // Throw an error if the provided input tensor names cannot be found because it implies a potential typo.
-        for (auto const& shape : build.shapes)
+        for (auto const& shape : shapes)
         {
             bool tensorNameFound{false};
             for (int32_t i = 0; i < network.getNbInputs(); ++i)
             {
-                if (network.getInput(i)->getName() == shape.first)
+                if (matchStringWithOneWildcard(shape.first, network.getInput(i)->getName()))
                 {
                     tensorNameFound = true;
                     break;
@@ -746,7 +718,7 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
         auto* input = network.getInput(i);
         if (!build.inputFormats.empty())
         {
-            int inputFormatIndex = broadcastInputFormats ? 0 : i;
+            int32_t inputFormatIndex = broadcastInputFormats ? 0 : i;
             input->setType(build.inputFormats[inputFormatIndex].first);
             input->setAllowedFormats(build.inputFormats[inputFormatIndex].second);
         }
@@ -755,9 +727,11 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
             switch (input->getType())
             {
             case DataType::kINT32:
+            case DataType::kINT64:
             case DataType::kBOOL:
             case DataType::kHALF:
             case DataType::kUINT8:
+            case DataType::kBF16:
                 // Leave these as is.
                 break;
             case DataType::kFLOAT:
@@ -765,28 +739,32 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
                 // User did not specify a floating-point format.  Default to kFLOAT.
                 input->setType(DataType::kFLOAT);
                 break;
-            case DataType::kFP8: ASSERT(!"FP8 is not supported"); break;
+            case DataType::kFP8: ASSERT(false && "FP8 is not supported");
+            case DataType::kINT4: ASSERT(false && "INT4 is not supported");
             }
             input->setAllowedFormats(1U << static_cast<int>(TensorFormat::kLINEAR));
         }
 
-        if (profile)
+        auto const dims = input->getDimensions();
+        auto const isScalar = dims.nbDims == 0;
+        auto const isDynamicInput = std::any_of(dims.d, dims.d + dims.nbDims, [](int32_t dim) { return dim == -1; })
+            || input->isShapeTensor();
+        if (isDynamicInput)
         {
-            auto const dims = input->getDimensions();
-            auto const isScalar = dims.nbDims == 0;
-            auto const isDynamicInput = std::any_of(dims.d, dims.d + dims.nbDims, [](int32_t dim) { return dim == -1; })
-                || input->isShapeTensor();
-            if (isDynamicInput)
+            hasDynamicShapes = true;
+            for (size_t i = 0; i < build.optProfiles.size(); i++)
             {
-                hasDynamicShapes = true;
-                auto shape = build.shapes.find(input->getName());
+                auto const& optShapes = build.optProfiles[i];
+                auto profile = profiles[i];
+                auto const tensorName = input->getName();
+                auto shape = findPlausible(optShapes, tensorName);
                 ShapeRange shapes{};
 
                 // If no shape is provided, set dynamic dimensions to 1.
-                if (shape == build.shapes.end())
+                if (shape == optShapes.end())
                 {
-                    constexpr int DEFAULT_DIMENSION = 1;
-                    std::vector<int> staticDims;
+                    constexpr int32_t kDEFAULT_DIMENSION{1};
+                    std::vector<int32_t> staticDims;
                     if (input->isShapeTensor())
                     {
                         if (isScalar)
@@ -796,16 +774,16 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
                         else
                         {
                             staticDims.resize(dims.d[0]);
-                            std::fill(staticDims.begin(), staticDims.end(), DEFAULT_DIMENSION);
+                            std::fill(staticDims.begin(), staticDims.end(), kDEFAULT_DIMENSION);
                         }
                     }
                     else
                     {
                         staticDims.resize(dims.nbDims);
                         std::transform(dims.d, dims.d + dims.nbDims, staticDims.begin(),
-                            [&](int dimension) { return dimension > 0 ? dimension : DEFAULT_DIMENSION; });
+                            [&](int dimension) { return dimension > 0 ? dimension : kDEFAULT_DIMENSION; });
                     }
-                    sample::gLogWarning << "Dynamic dimensions required for input: " << input->getName()
+                    sample::gLogWarning << "Dynamic dimensions required for input: " << tensorName
                                         << ", but no shapes were provided. Automatically overriding shape to: "
                                         << staticDims << std::endl;
                     std::fill(shapes.begin(), shapes.end(), staticDims);
@@ -819,32 +797,42 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
                 if (input->isShapeTensor())
                 {
                     profileDims = shapes[static_cast<size_t>(OptProfileSelector::kMIN)];
-                    SMP_RETVAL_IF_FALSE(profile->setShapeValues(input->getName(), OptProfileSelector::kMIN,
+                    SMP_RETVAL_IF_FALSE(profile->setShapeValues(tensorName, OptProfileSelector::kMIN,
                                             profileDims.data(), static_cast<int>(profileDims.size())),
                         "Error in set shape values MIN", false, err);
                     profileDims = shapes[static_cast<size_t>(OptProfileSelector::kOPT)];
-                    SMP_RETVAL_IF_FALSE(profile->setShapeValues(input->getName(), OptProfileSelector::kOPT,
+                    SMP_RETVAL_IF_FALSE(profile->setShapeValues(tensorName, OptProfileSelector::kOPT,
                                             profileDims.data(), static_cast<int>(profileDims.size())),
                         "Error in set shape values OPT", false, err);
                     profileDims = shapes[static_cast<size_t>(OptProfileSelector::kMAX)];
-                    SMP_RETVAL_IF_FALSE(profile->setShapeValues(input->getName(), OptProfileSelector::kMAX,
+                    SMP_RETVAL_IF_FALSE(profile->setShapeValues(tensorName, OptProfileSelector::kMAX,
                                             profileDims.data(), static_cast<int>(profileDims.size())),
                         "Error in set shape values MAX", false, err);
+                    sample::gLogInfo << "Set input shape tensor " << tensorName << " for optimization profile " << i
+                                     << " to:"
+                                     << " MIN=" << shapes[static_cast<size_t>(OptProfileSelector::kMIN)]
+                                     << " OPT=" << shapes[static_cast<size_t>(OptProfileSelector::kOPT)]
+                                     << " MAX=" << shapes[static_cast<size_t>(OptProfileSelector::kMAX)] << std::endl;
                 }
                 else
                 {
                     profileDims = shapes[static_cast<size_t>(OptProfileSelector::kMIN)];
                     SMP_RETVAL_IF_FALSE(
-                        profile->setDimensions(input->getName(), OptProfileSelector::kMIN, toDims(profileDims)),
+                        profile->setDimensions(tensorName, OptProfileSelector::kMIN, toDims(profileDims)),
                         "Error in set dimensions to profile MIN", false, err);
                     profileDims = shapes[static_cast<size_t>(OptProfileSelector::kOPT)];
                     SMP_RETVAL_IF_FALSE(
-                        profile->setDimensions(input->getName(), OptProfileSelector::kOPT, toDims(profileDims)),
+                        profile->setDimensions(tensorName, OptProfileSelector::kOPT, toDims(profileDims)),
                         "Error in set dimensions to profile OPT", false, err);
                     profileDims = shapes[static_cast<size_t>(OptProfileSelector::kMAX)];
                     SMP_RETVAL_IF_FALSE(
-                        profile->setDimensions(input->getName(), OptProfileSelector::kMAX, toDims(profileDims)),
+                        profile->setDimensions(tensorName, OptProfileSelector::kMAX, toDims(profileDims)),
                         "Error in set dimensions to profile MAX", false, err);
+                    sample::gLogInfo << "Set shape of input tensor " << tensorName << " for optimization profile " << i
+                                     << " to:"
+                                     << " MIN=" << shapes[static_cast<size_t>(OptProfileSelector::kMIN)]
+                                     << " OPT=" << shapes[static_cast<size_t>(OptProfileSelector::kOPT)]
+                                     << " MAX=" << shapes[static_cast<size_t>(OptProfileSelector::kMAX)] << std::endl;
                 }
             }
         }
@@ -853,20 +841,17 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
     for (uint32_t i = 0, n = network.getNbOutputs(); i < n; i++)
     {
         auto* output = network.getOutput(i);
-        if (profile)
+        auto const dims = output->getDimensions();
+        // A shape tensor output with known static dimensions may have dynamic shape values inside it.
+        auto const isDynamicOutput = std::any_of(dims.d, dims.d + dims.nbDims, [](int32_t dim) { return dim == -1; })
+            || output->isShapeTensor();
+        if (isDynamicOutput)
         {
-            auto const dims = output->getDimensions();
-            // A shape tensor output with known static dimensions may have dynamic shape values inside it.
-            auto const isDynamicOutput = std::any_of(dims.d, dims.d + dims.nbDims, [](int32_t dim) { return dim == -1; })
-                || output->isShapeTensor();
-            if (isDynamicOutput)
-            {
-                hasDynamicShapes = true;
-            }
+            hasDynamicShapes = true;
         }
     }
 
-    if (!hasDynamicShapes && !build.shapes.empty())
+    if (!hasDynamicShapes && !build.optProfiles[0].empty())
     {
         sample::gLogError << "Static model does not take explicit shapes since the shape of inference tensors will be "
                              "determined by the model itself"
@@ -874,11 +859,14 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
         return false;
     }
 
-    if (profile && hasDynamicShapes)
+    if (hasDynamicShapes)
     {
-        SMP_RETVAL_IF_FALSE(profile->isValid(), "Required optimization profile is invalid", false, err);
-        SMP_RETVAL_IF_FALSE(
-            config.addOptimizationProfile(profile) != -1, "Error in add optimization profile", false, err);
+        for (auto profile : profiles)
+        {
+            SMP_RETVAL_IF_FALSE(profile->isValid(), "Required optimization profile is invalid", false, err);
+            SMP_RETVAL_IF_FALSE(
+                config.addOptimizationProfile(profile) != -1, "Error in add optimization profile", false, err);
+        }
     }
 
     bool broadcastOutputFormats = broadcastIOFormats(build.outputFormats, network.getNbOutputs(), false);
@@ -889,7 +877,7 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
         auto* output = network.getOutput(i);
         if (!build.outputFormats.empty())
         {
-            int outputFormatIndex = broadcastOutputFormats ? 0 : i;
+            int32_t outputFormatIndex = broadcastOutputFormats ? 0 : i;
             output->setType(build.outputFormats[outputFormatIndex].first);
             output->setAllowedFormats(build.outputFormats[outputFormatIndex].second);
         }
@@ -903,11 +891,6 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
 
     setPreviewFeatures(config, build);
 
-    if (build.heuristic)
-    {
-        config.setFlag(BuilderFlag::kENABLE_TACTIC_HEURISTIC);
-    }
-
     if (build.builderOptimizationLevel != defaultBuilderOptimizationLevel)
     {
         config.setBuilderOptimizationLevel(build.builderOptimizationLevel);
@@ -918,6 +901,16 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
         config.setFlag(BuilderFlag::kDISABLE_TIMING_CACHE);
     }
 
+    if (build.disableCompilationCache)
+    {
+        config.setFlag(BuilderFlag::kDISABLE_COMPILATION_CACHE);
+    }
+
+    if (build.errorOnTimingCacheMiss)
+    {
+        config.setFlag(BuilderFlag::kERROR_ON_TIMING_CACHE_MISS);
+    }
+
     if (!build.tf32)
     {
         config.clearFlag(BuilderFlag::kTF32);
@@ -928,6 +921,12 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
         config.setFlag(BuilderFlag::kREFIT);
     }
 
+    if (build.stripWeights)
+    {
+        // The kREFIT_IDENTICAL is enabled by default when kSTRIP_PLAN is on.
+        config.setFlag(BuilderFlag::kSTRIP_PLAN);
+    }
+
     if (build.versionCompatible)
     {
         config.setFlag(BuilderFlag::kVERSION_COMPATIBLE);
@@ -959,21 +958,22 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
     }
 
     config.setProfilingVerbosity(build.profilingVerbosity);
-    config.setMinTimingIterations(build.minTiming);
     config.setAvgTimingIterations(build.avgTiming);
 
     if (build.fp16)
     {
         config.setFlag(BuilderFlag::kFP16);
     }
-
     if (build.int8)
     {
         config.setFlag(BuilderFlag::kINT8);
     }
+    if (build.bf16)
+    {
+        config.setFlag(BuilderFlag::kBF16);
+    }
 
-    SMP_RETVAL_IF_FALSE(!(build.int8 && build.fp8),
-        "FP8 and INT8 precisions have been specified", false, err);
+    SMP_RETVAL_IF_FALSE(!(build.int8 && build.fp8), "FP8 and INT8 precisions have been specified", false, err);
 
     if (build.fp8)
     {
@@ -1038,17 +1038,25 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
             {
                 auto* input = network.getInput(i);
                 Dims profileDims{};
-                auto shape = build.shapesCalib.find(input->getName());
-                ShapeRange shapesCalib{};
-                shapesCalib = shape->second;
+                auto const tensorName = input->getName();
+                auto shape = findPlausible(build.shapesCalib, tensorName);
 
+                if (shape == build.shapesCalib.end())
+                {
+                    std::ostringstream msg;
+                    msg << "Calibration profile for tensor " << tensorName << " cannot be found!";
+                    throw std::invalid_argument(msg.str());
+                }
+
+                auto shapesCalib = shape->second;
                 profileDims = toDims(shapesCalib[static_cast<size_t>(OptProfileSelector::kOPT)]);
                 // Here we check only kMIN as all profileDims are the same.
-                SMP_RETVAL_IF_FALSE(
-                    profileCalib->setDimensions(input->getName(), OptProfileSelector::kMIN, profileDims),
+                SMP_RETVAL_IF_FALSE(profileCalib->setDimensions(tensorName, OptProfileSelector::kMIN, profileDims),
                     "Error in set dimensions to calibration profile OPT", false, err);
-                profileCalib->setDimensions(input->getName(), OptProfileSelector::kOPT, profileDims);
-                profileCalib->setDimensions(input->getName(), OptProfileSelector::kMAX, profileDims);
+                profileCalib->setDimensions(tensorName, OptProfileSelector::kOPT, profileDims);
+                profileCalib->setDimensions(tensorName, OptProfileSelector::kMAX, profileDims);
+                sample::gLogInfo << "Set calibration profile for input tensor " << tensorName << " to " << profileDims
+                                 << std::endl;
             }
             SMP_RETVAL_IF_FALSE(profileCalib->isValid(), "Calibration profile is invalid", false, err);
             SMP_RETVAL_IF_FALSE(
@@ -1067,9 +1075,10 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
             {
                 elemCount.push_back(volume(profileCalib->getDimensions(input->getName(), OptProfileSelector::kOPT)));
             }
-            else if (profile && isDynamicInput)
+            else if (!profiles.empty() && isDynamicInput)
             {
-                elemCount.push_back(volume(profile->getDimensions(input->getName(), OptProfileSelector::kOPT)));
+                elemCount.push_back(
+                    volume(profiles[build.calibProfile]->getDimensions(input->getName(), OptProfileSelector::kOPT)));
             }
             else
             {
@@ -1091,9 +1100,7 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
     case PrecisionConstraints::kNONE:
         // It's the default for TensorRT.
         break;
-    case PrecisionConstraints::kOBEY:
-        config.setFlag(BuilderFlag::kOBEY_PRECISION_CONSTRAINTS);
-        break;
+    case PrecisionConstraints::kOBEY: config.setFlag(BuilderFlag::kOBEY_PRECISION_CONSTRAINTS); break;
     case PrecisionConstraints::kPREFER: config.setFlag(BuilderFlag::kPREFER_PRECISION_CONSTRAINTS); break;
     }
 
@@ -1112,6 +1119,11 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
         setLayerDeviceTypes(network, config, build.layerDeviceTypes);
     }
 
+    if (!build.debugTensors.empty())
+    {
+        markDebugTensors(network, build.debugTensors);
+    }
+
     if (build.safe && sys.DLACore == -1)
     {
         config.setEngineCapability(EngineCapability::kSAFETY);
@@ -1169,6 +1181,11 @@ bool setupNetworkAndConfig(BuildOptions const& build, SystemOptions const& sys,
         config.setMaxAuxStreams(build.maxAuxStreams);
     }
 
+    if (build.allowWeightStreaming)
+    {
+        config.setFlag(BuilderFlag::kWEIGHT_STREAMING);
+    }
+
     return true;
 }
 
@@ -1192,7 +1209,7 @@ bool networkToSerializedEngine(
     // Try to load cache from file. Create a fresh cache if the file doesn't exist
     if (build.timingCacheMode == TimingCacheMode::kGLOBAL)
     {
-        std::vector<char> loadedCache = samplesCommon::loadTimingCacheFile(build.timingCacheFile);
+        std::vector<char> loadedCache = samplesCommon::loadTimingCacheFile(gLogger, build.timingCacheFile);
         timingCache.reset(config->createTimingCache(static_cast<const void*>(loadedCache.data()), loadedCache.size()));
         SMP_RETVAL_IF_FALSE(timingCache != nullptr, "TimingCache creation failed", false, err);
         config->setTimingCache(*timingCache, false);
@@ -1203,20 +1220,29 @@ bool networkToSerializedEngine(
     SMP_RETVAL_IF_FALSE(profileStream != nullptr, "Cuda stream creation failed", false, err);
     config->setProfileStream(*profileStream);
 
+    auto const tBegin = std::chrono::high_resolution_clock::now();
     std::unique_ptr<IHostMemory> serializedEngine{builder.buildSerializedNetwork(*env.network, *config)};
     SMP_RETVAL_IF_FALSE(serializedEngine != nullptr, "Engine could not be created from network", false, err);
-
-    env.engine.setBlob(serializedEngine->data(), serializedEngine->size());
+    auto const tEnd = std::chrono::high_resolution_clock::now();
+    float const buildTime = std::chrono::duration<float>(tEnd - tBegin).count();
+    sample::gLogInfo << "Engine built in " << buildTime << " sec." << std::endl;
+    sample::gLogInfo << "Created engine with size: " << (serializedEngine->size() / 1.0_MiB) << " MiB" << std::endl;
 
     if (build.safe && build.consistency)
     {
-        checkSafeEngine(serializedEngine->data(), serializedEngine->size());
+        if (!checkSafeEngine(serializedEngine->data(), serializedEngine->size()))
+        {
+            sample::gLogError << "Consistency validation is not successful." << std::endl;
+            return false;
+        }
     }
 
+    env.engine.setBlob(serializedEngine);
+
     if (build.timingCacheMode == TimingCacheMode::kGLOBAL)
     {
         auto timingCache = config->getTimingCache();
-        samplesCommon::updateTimingCacheFile(build.timingCacheFile, timingCache);
+        samplesCommon::updateTimingCacheFile(gLogger, build.timingCacheFile, timingCache, builder);
     }
 
     return true;
@@ -1231,9 +1257,9 @@ bool modelToBuildEnv(
     env.builder.reset(createBuilder());
     SMP_RETVAL_IF_FALSE(env.builder != nullptr, "Builder creation failed", false, err);
     env.builder->setErrorRecorder(&gRecorder);
-    auto networkFlags
-        = (build.maxBatch) ? 0U : 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
-
+    auto networkFlags = (build.stronglyTyped)
+        ? 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kSTRONGLY_TYPED)
+        : 0U;
     for (auto const& pluginPath : sys.dynamicPlugins)
     {
         env.builder->getPluginRegistry().loadLibrary(pluginPath.c_str());
@@ -1321,28 +1347,81 @@ std::pair<std::vector<std::string>, std::vector<WeightsRole>> getMissingLayerWei
 }
 } // namespace
 
-bool loadEngineToBuildEnv(std::string const& engine, bool enableConsistency, BuildEnvironment& env, std::ostream& err)
+bool loadStreamingEngineToBuildEnv(std::string const& filepath, BuildEnvironment& env, std::ostream& err)
 {
-    std::ifstream engineFile(engine, std::ios::binary);
-    SMP_RETVAL_IF_FALSE(engineFile.good(), "", false, err << "Error opening engine file: " << engine);
+    auto& reader = env.engine.getFileReader();
+    SMP_RETVAL_IF_FALSE(reader.open(filepath), "", false, err << "Error opening engine file: " << filepath);
+    return true;
+}
+
+bool loadEngineToBuildEnv(std::string const& filepath, bool enableConsistency, BuildEnvironment& env, std::ostream& err)
+{
+    auto const tBegin = std::chrono::high_resolution_clock::now();
+    std::ifstream engineFile(filepath, std::ios::binary);
+    SMP_RETVAL_IF_FALSE(engineFile.good(), "", false, err << "Error opening engine file: " << filepath);
     engineFile.seekg(0, std::ifstream::end);
     int64_t fsize = engineFile.tellg();
     engineFile.seekg(0, std::ifstream::beg);
 
     std::vector<uint8_t> engineBlob(fsize);
     engineFile.read(reinterpret_cast<char*>(engineBlob.data()), fsize);
-    SMP_RETVAL_IF_FALSE(engineFile.good(), "", false, err << "Error loading engine file: " << engine);
+    SMP_RETVAL_IF_FALSE(engineFile.good(), "", false, err << "Error loading engine file: " << filepath);
+    auto const tEnd = std::chrono::high_resolution_clock::now();
+    float const loadTime = std::chrono::duration<float>(tEnd - tBegin).count();
+    sample::gLogInfo << "Engine loaded in " << loadTime << " sec." << std::endl;
+    sample::gLogInfo << "Loaded engine with size: " << (fsize / 1.0_MiB) << " MiB" << std::endl;
 
     if (enableConsistency)
     {
-        checkSafeEngine(engineBlob.data(), fsize);
+        if (!checkSafeEngine(engineBlob.data(), fsize))
+        {
+            sample::gLogError << "Consistency validation is not successful." << std::endl;
+            return false;
+        }
     }
 
-    env.engine.setBlob(engineBlob.data(), engineBlob.size());
+    env.engine.setBlob(std::move(engineBlob));
 
     return true;
 }
 
+bool printPlanVersion(BuildEnvironment& env, std::ostream& err)
+{
+    constexpr int64_t kPLAN_SIZE{28};
+    std::vector<uint8_t> data(kPLAN_SIZE);
+    auto blob = data.data();
+
+    auto& reader = env.engine.getFileReader();
+    if (reader.isOpen())
+    {
+        SMP_RETVAL_IF_FALSE(reader.read(data.data(), kPLAN_SIZE) == kPLAN_SIZE, "Failed to read plan file", false, err);
+    }
+    else
+    {
+        SMP_RETVAL_IF_FALSE(env.engine.getBlob().data != nullptr, "Plan file is empty", false, err);
+        SMP_RETVAL_IF_FALSE(env.engine.getBlob().size >= 28, "Plan file is incorrect", false, err);
+        blob = static_cast<uint8_t*>(env.engine.getBlob().data);
+    }
+    auto blob32 = reinterpret_cast<uint32_t*>(blob);
+
+    //! Correct TensorRT plan file starts with this tag
+    constexpr uint32_t kPLAN_FILE_TAG{0x74727466U};
+    SMP_RETVAL_IF_FALSE(blob32[0] == kPLAN_FILE_TAG, "Failed to verify a plan tag.", false, err);
+    switch (blob32[1])
+    {
+    case 0U:
+    {
+        // Blob index to store the plan version may depend on the serialization version.
+        sample::gLogInfo << "Plan was created with TensorRT version " << static_cast<int32_t>(blob[24])
+        << "." << static_cast<int32_t>(blob[25]) << "." << static_cast<int32_t>(blob[26])
+        << "." << static_cast<int32_t>(blob[27]) << std::endl;
+        return true;
+    }
+    }
+    sample::gLogError << "Serialization version is not supported." << std::endl;
+    return false;
+}
+
 void dumpRefittable(nvinfer1::ICudaEngine& engine)
 {
     std::unique_ptr<IRefitter> refitter{createRefitter(engine)};
@@ -1395,7 +1474,14 @@ bool getEngineBuildEnv(
 
     if (build.load)
     {
-        createEngineSuccess = loadEngineToBuildEnv(build.engine, build.safe && build.consistency, env, err);
+        if (build.safe)
+        {
+            createEngineSuccess = loadEngineToBuildEnv(build.engine, build.safe && build.consistency, env, err);
+        }
+        else
+        {
+            createEngineSuccess = loadStreamingEngineToBuildEnv(build.engine, env, err);
+        }
     }
     else
     {
@@ -1404,11 +1490,25 @@ bool getEngineBuildEnv(
 
     SMP_RETVAL_IF_FALSE(createEngineSuccess, "Failed to create engine from model or file.", false, err);
 
+    if (build.getPlanVersionOnly && build.load)
+    {
+        SMP_RETVAL_IF_FALSE(printPlanVersion(env, err), "Failed to get plan file version.", false, err);
+        return true;
+    }
+
     if (build.save)
     {
         std::ofstream engineFile(build.engine, std::ios::binary);
-        engineFile.write(reinterpret_cast<char const*>(env.engine.getBlob().data()), env.engine.getBlob().size());
+        auto& engineBlob = env.engine.getBlob();
+        engineFile.write(static_cast<char const*>(engineBlob.data), engineBlob.size);
         SMP_RETVAL_IF_FALSE(!engineFile.fail(), "Saving engine to file failed.", false, err);
+        engineFile.flush();
+        engineFile.close();
+        if (!build.safe)
+        {
+            env.engine.releaseBlob();
+            SMP_RETVAL_IF_FALSE(loadStreamingEngineToBuildEnv(build.engine, env, err), "Reading engine file failed.", false, err);
+        }
     }
 
     return true;
@@ -1427,11 +1527,14 @@ std::vector<std::pair<WeightsRole, Weights>> getAllRefitWeightsForLayer(const IL
         {
         case DataType::kFLOAT:
         case DataType::kHALF:
+        case DataType::kBF16:
         case DataType::kINT8:
-        case DataType::kINT32: return {std::make_pair(WeightsRole::kCONSTANT, weights)};
+        case DataType::kINT32:
+        case DataType::kINT64: return {std::make_pair(WeightsRole::kCONSTANT, weights)};
         case DataType::kBOOL:
         case DataType::kUINT8:
         case DataType::kFP8:
+        case DataType::kINT4:
             // Refit not supported for these types.
             break;
         }
@@ -1449,12 +1552,6 @@ std::vector<std::pair<WeightsRole, Weights>> getAllRefitWeightsForLayer(const IL
         return {std::make_pair(WeightsRole::kKERNEL, layer.getKernelWeights()),
             std::make_pair(WeightsRole::kBIAS, layer.getBiasWeights())};
     }
-    case LayerType::kFULLY_CONNECTED:
-    {
-        auto const& layer = static_cast<const nvinfer1::IFullyConnectedLayer&>(l);
-        return {std::make_pair(WeightsRole::kKERNEL, layer.getKernelWeights()),
-            std::make_pair(WeightsRole::kBIAS, layer.getBiasWeights())};
-    }
     case LayerType::kSCALE:
     {
         auto const& layer = static_cast<const nvinfer1::IScaleLayer&>(l);
@@ -1487,6 +1584,7 @@ std::vector<std::pair<WeightsRole, Weights>> getAllRefitWeightsForLayer(const IL
     case LayerType::kPARAMETRIC_RELU:
     case LayerType::kPLUGIN:
     case LayerType::kPLUGIN_V2:
+    case LayerType::kPLUGIN_V3:
     case LayerType::kPOOLING:
     case LayerType::kQUANTIZE:
     case LayerType::kRAGGED_SOFTMAX:
@@ -1494,7 +1592,6 @@ std::vector<std::pair<WeightsRole, Weights>> getAllRefitWeightsForLayer(const IL
     case LayerType::kREDUCE:
     case LayerType::kRESIZE:
     case LayerType::kREVERSE_SEQUENCE:
-    case LayerType::kRNN_V2:
     case LayerType::kSCATTER:
     case LayerType::kSELECT:
     case LayerType::kSHAPE:
@@ -1568,29 +1665,35 @@ bool timeRefit(INetworkDefinition const& network, nvinfer1::ICudaEngine& engine,
         return layerNames.empty();
     };
 
+    // Skip weights validation since we are confident that the new weights are similar to the weights used to build
+    // engine.
+    refitter->setWeightsValidation(false);
+
     // Warm up and report missing weights
+    // We only need to set weights for the first time and that can be reused in later refitting process.
     bool const success = setWeights() && reportMissingWeights() && refitter->refitCudaEngine();
     if (!success)
     {
         return false;
     }
 
-    constexpr int32_t loop = 5;
+    TrtCudaStream stream;
+    constexpr int32_t kLOOP = 10;
     time_point const refitStartTime{std::chrono::steady_clock::now()};
     {
-        for (int32_t l = 0; l < loop; l++)
+        for (int32_t l = 0; l < kLOOP; l++)
         {
-            bool const success = setWeights() && refitter->refitCudaEngine();
-            if (!success)
+            if (!refitter->refitCudaEngineAsync(stream.get()))
             {
                 return false;
             }
         }
     }
+    stream.synchronize();
     time_point const refitEndTime{std::chrono::steady_clock::now()};
 
     sample::gLogInfo << "Engine refitted"
-        << " in " << durationMs(refitEndTime - refitStartTime).count() / loop << " ms." << std::endl;
+                     << " in " << durationMs(refitEndTime - refitStartTime).count() / kLOOP << " ms." << std::endl;
     return true;
 }
 
@@ -1600,11 +1703,13 @@ void* initSafeRuntime()
 {
     void* handle{nullptr};
 #if !defined(_WIN32)
-    std::string const dllName{samplesCommon::isDebug() ? "libnvinfer_safe_debug.so.8" : "libnvinfer_safe.so.8"};
+    std::string const dllName{samplesCommon::isDebug() ? "libnvinfer_safe_debug.so." + std::to_string(NV_TENSORRT_MAJOR)
+                                                       : "libnvinfer_safe.so." + std::to_string(NV_TENSORRT_MAJOR)};
 #if SANITIZER_BUILD
     handle = dlopen(dllName.c_str(), RTLD_LAZY | RTLD_NODELETE);
 #else
-    handle = dlopen(dllName.c_str(), RTLD_LAZY);
+    // RTLD_GLOBAL is used for symbol resolution of subsequently loaded plugin libraries
+    handle = dlopen(dllName.c_str(), RTLD_LAZY | RTLD_GLOBAL);
 #endif
 #endif
     return handle;
@@ -1614,7 +1719,9 @@ void* initConsistencyCheckerLibrary()
 {
     void* handle{nullptr};
 #if !defined(_WIN32)
-    std::string const dllName{samplesCommon::isDebug() ? "libnvinfer_checker_debug.so.8" : "libnvinfer_checker.so.8"};
+    std::string const dllName{samplesCommon::isDebug()
+            ? "libnvinfer_checker_debug.so." + std::to_string(NV_TENSORRT_MAJOR)
+            : "libnvinfer_checker.so." + std::to_string(NV_TENSORRT_MAJOR)};
 #if SANITIZER_BUILD
     handle = dlopen(dllName.c_str(), RTLD_LAZY | RTLD_NODELETE);
 #else
diff --git a/samples/common/sampleEngines.h b/samples/common/sampleEngines.h
index 6c0a88b6..f6cff080 100644
--- a/samples/common/sampleEngines.h
+++ b/samples/common/sampleEngines.h
@@ -21,27 +21,42 @@
 #include <iostream>
 #include <vector>
 
-#include "NvCaffeParser.h"
 #include "NvInfer.h"
 #include "NvInferConsistency.h"
 #include "NvInferSafeRuntime.h"
 #include "NvOnnxParser.h"
-#include "NvUffParser.h"
 #include "sampleOptions.h"
 #include "sampleUtils.h"
+#include "streamReader.h"
 
 namespace sample
 {
 
 struct Parser
 {
-    std::unique_ptr<nvcaffeparser1::ICaffeParser> caffeParser;
-    std::unique_ptr<nvuffparser::IUffParser> uffParser;
     std::unique_ptr<nvonnxparser::IParser> onnxParser;
 
     operator bool() const
     {
-        return caffeParser || uffParser || onnxParser;
+        return onnxParser != nullptr;
+    }
+};
+
+//!
+//! \brief Helper struct to faciliate engine serialization and deserialization. It does not own the underlying memory.
+//!
+struct EngineBlob
+{
+    EngineBlob(void* engineData, size_t engineSize)
+        : data(engineData)
+        , size(engineSize)
+    {
+    }
+    void* data{};
+    size_t size{};
+    bool empty() const
+    {
+        return size == 0;
     }
 };
 
@@ -68,24 +83,13 @@ class LazilyDeserializedEngine
         , mTempfileControls(tempfileControls)
         , mLeanDLLPath(leanDLLPath)
     {
+        mFileReader = std::make_unique<samplesCommon::FileStreamReader>();
     }
 
     //!
     //! \brief Move from another LazilyDeserializedEngine.
     //!
-    LazilyDeserializedEngine(LazilyDeserializedEngine&& other)
-    {
-        mIsSafe = other.mIsSafe;
-        mVersionCompatible = other.mVersionCompatible;
-        mDLACore = other.mDLACore;
-        mEngineBlob = std::move(other.mEngineBlob);
-        mEngine = std::move(other.mEngine);
-        mSafeEngine = std::move(other.mSafeEngine);
-        mTempdir = std::move(other.mTempdir);
-        mTempfileControls = other.mTempfileControls;
-        mLeanDLLPath = std::move(other.mLeanDLLPath);
-        mDynamicPlugins = std::move(other.mDynamicPlugins);
-    }
+    LazilyDeserializedEngine(LazilyDeserializedEngine&& other) = default;
 
     //!
     //! \brief Delete copy constructor.
@@ -110,18 +114,39 @@ class LazilyDeserializedEngine
     //!
     //! \brief Get the underlying blob storing serialized engine.
     //!
-    std::vector<uint8_t> const& getBlob() const
+    EngineBlob const getBlob() const
     {
-        return mEngineBlob;
+        ASSERT((!mFileReader || !mFileReader->isOpen())
+            && "Attempting to access the glob when there is an open file reader!");
+        if (!mEngineBlob.empty())
+        {
+            return EngineBlob{const_cast<void*>(static_cast<void const*>(mEngineBlob.data())), mEngineBlob.size()};
+        }
+        if (mEngineBlobHostMemory.get() != nullptr && mEngineBlobHostMemory->size() > 0)
+        {
+            return EngineBlob{mEngineBlobHostMemory->data(), mEngineBlobHostMemory->size()};
+        }
+        ASSERT(false && "Attempting to access an empty engine!");
+        return EngineBlob{nullptr, 0};
     }
 
     //!
-    //! \brief Set the underlying blob storing serialized engine.
+    //! \brief Set the underlying blob storing the serialized engine without duplicating IHostMemory.
     //!
-    void setBlob(void* data, size_t size)
+    void setBlob(std::unique_ptr<nvinfer1::IHostMemory>& data)
     {
-        mEngineBlob.resize(size);
-        std::memcpy(mEngineBlob.data(), data, size);
+        ASSERT(data.get() && data->size() > 0);
+        mEngineBlobHostMemory = std::move(data);
+        mEngine.reset();
+        mSafeEngine.reset();
+    }
+
+    //!
+    //! \brief Set the underlying blob storing the serialized engine without duplicating vector memory.
+    //!
+    void setBlob(std::vector<uint8_t>&& engineBlob)
+    {
+        mEngineBlob = std::move(engineBlob);
         mEngine.reset();
         mSafeEngine.reset();
     }
@@ -132,6 +157,16 @@ class LazilyDeserializedEngine
     void releaseBlob()
     {
         mEngineBlob.clear();
+        mEngineBlobHostMemory.reset();
+    }
+
+    //!
+    //! \brief Get the file stream reader used for deserialization
+    //!
+    samplesCommon::FileStreamReader& getFileReader()
+    {
+        ASSERT(mFileReader);
+        return *mFileReader;
     }
 
     //!
@@ -152,6 +187,10 @@ class LazilyDeserializedEngine
     bool mVersionCompatible{false};
     int32_t mDLACore{-1};
     std::vector<uint8_t> mEngineBlob;
+    std::unique_ptr<samplesCommon::FileStreamReader> mFileReader;
+
+    // Directly use the host memory of a serialized engine instead of duplicating the engine in CPU memory.
+    std::unique_ptr<nvinfer1::IHostMemory> mEngineBlobHostMemory;
 
     std::string mTempdir{};
     nvinfer1::TempfileControlFlags mTempfileControls{getTempfileControlDefaults()};
@@ -308,6 +347,8 @@ nvinfer1::consistency::IConsistencyChecker* createConsistencyChecker(
 //!
 bool checkSafeEngine(void const* serializedEngine, int32_t const engineSize);
 
+bool loadStreamingEngineToBuildEnv(std::string const& engine, BuildEnvironment& env, std::ostream& err);
+
 bool loadEngineToBuildEnv(std::string const& engine, bool enableConsistency, BuildEnvironment& env, std::ostream& err);
 } // namespace sample
 
diff --git a/samples/common/sampleEntrypoints.h b/samples/common/sampleEntrypoints.h
index 94806970..70f45dde 100644
--- a/samples/common/sampleEntrypoints.h
+++ b/samples/common/sampleEntrypoints.h
@@ -28,10 +28,8 @@
 //! Samples that use TRT at link time can define DEFINE_TRT_ENTRYPOINTS before including this header to
 //! pick up the definitions here.
 
-#include "NvCaffeParser.h"
 #include "NvInfer.h"
 #include "NvOnnxParser.h"
-#include "NvUffParser.h"
 #include "logger.h"
 
 extern nvinfer1::IBuilder* createBuilder();
@@ -40,12 +38,6 @@ extern nvinfer1::IRefitter* createRefitter(nvinfer1::ICudaEngine& engine);
 
 extern nvonnxparser::IParser* createONNXParser(nvinfer1::INetworkDefinition& network);
 
-extern nvcaffeparser1::ICaffeParser* sampleCreateCaffeParser();
-extern void shutdownCaffeParser();
-
-extern nvuffparser::IUffParser* sampleCreateUffParser();
-extern void shutdownUffParser();
-
 #if !defined(DEFINE_TRT_ENTRYPOINTS)
 #define DEFINE_TRT_ENTRYPOINTS 0
 #endif
@@ -104,38 +96,6 @@ nvonnxparser::IParser* createONNXParser(nvinfer1::INetworkDefinition& network)
 #endif
 }
 
-nvcaffeparser1::ICaffeParser* sampleCreateCaffeParser()
-{
-#if DEFINE_TRT_LEGACY_PARSER_ENTRYPOINT
-    return nvcaffeparser1::createCaffeParser();
-#else
-    return {};
-#endif
-}
-
-void shutdownCaffeParser()
-{
-#if DEFINE_TRT_LEGACY_PARSER_ENTRYPOINT
-    nvcaffeparser1::shutdownProtobufLibrary();
-#endif
-}
-
-nvuffparser::IUffParser* sampleCreateUffParser()
-{
-#if DEFINE_TRT_LEGACY_PARSER_ENTRYPOINT
-    return nvuffparser::createUffParser();
-#else
-    return {};
-#endif
-}
-
-void shutdownUffParser()
-{
-#if DEFINE_TRT_LEGACY_PARSER_ENTRYPOINT
-    nvuffparser::shutdownProtobufLibrary();
-#endif
-}
-
 #endif // DEFINE_TRT_ENTRYPOINTS
 
 #endif // TRT_SAMPLE_ENTRYPOINTS_H
diff --git a/samples/common/sampleInference.cpp b/samples/common/sampleInference.cpp
index 84850d93..dfc76708 100644
--- a/samples/common/sampleInference.cpp
+++ b/samples/common/sampleInference.cpp
@@ -38,6 +38,7 @@
 #include "NvInfer.h"
 
 #include "ErrorRecorder.h"
+#include "bfloat16.h"
 #include "logger.h"
 #include "sampleDevice.h"
 #include "sampleEngines.h"
@@ -50,8 +51,7 @@ namespace sample
 {
 
 template <class MapType, class EngineType>
-bool validateTensorNames(
-    MapType const& map, EngineType const* engine, int32_t const endBindingIndex)
+bool validateTensorNames(MapType const& map, EngineType const* engine, int32_t const endBindingIndex)
 {
     // Check if the provided input tensor names match the input tensors of the engine.
     // Throw an error if the provided input tensor names cannot be found because it implies a potential typo.
@@ -60,7 +60,9 @@ bool validateTensorNames(
         bool tensorNameFound{false};
         for (int32_t b = 0; b < endBindingIndex; ++b)
         {
-            if (engine->bindingIsInput(b) && engine->getBindingName(b) == item.first)
+            auto const tensorName = engine->getIOTensorName(b);
+            auto const tensorIOMode = engine->getTensorIOMode(tensorName);
+            if (tensorIOMode == nvinfer1::TensorIOMode::kINPUT && matchStringWithOneWildcard(item.first, tensorName))
             {
                 tensorNameFound = true;
                 break;
@@ -89,6 +91,7 @@ class FillBindingClosure
     BindingsVector& bindings;
     int32_t batch;
     int32_t endBindingIndex;
+    int32_t profileIndex;
 
     void fillOneBinding(TensorInfo const& tensorInfo)
     {
@@ -96,7 +99,7 @@ class FillBindingClosure
         auto const* bindingInOutStr = tensorInfo.isInput ? "Input" : "Output";
         for (auto& binding : bindings)
         {
-            auto const input = inputs.find(name);
+            auto const input = findPlausible(inputs, name);
             if (tensorInfo.isInput && input != inputs.end())
             {
                 sample::gLogInfo << "Using values loaded from " << input->second << " for input " << name << std::endl;
@@ -146,13 +149,14 @@ class FillBindingClosure
 
 public:
     FillBindingClosure(EngineType const* _engine, ContextType const* _context, InputsMap const& _inputs,
-        BindingsVector& _bindings, int32_t _batch, int32_t _endBindingIndex)
+        BindingsVector& _bindings, int32_t _batch, int32_t _endBindingIndex, int32_t _profileIndex)
         : engine(_engine)
         , context(_context)
         , inputs(_inputs)
         , bindings(_bindings)
         , batch(_batch)
         , endBindingIndex(_endBindingIndex)
+        , profileIndex(_profileIndex)
     {
     }
 
@@ -166,29 +170,16 @@ template <>
 void FillBindingClosure<nvinfer1::ICudaEngine, nvinfer1::IExecutionContext>::getTensorInfo(TensorInfo& tensorInfo)
 {
     auto const b = tensorInfo.bindingIndex;
-    auto const name = engine->getBindingName(b);
+    auto const name = engine->getIOTensorName(b);
     tensorInfo.name = name;
-    if (engine->hasImplicitBatchDimension())
-    {
-        tensorInfo.dims = context->getBindingDimensions(b);
-        tensorInfo.comps = engine->getBindingComponentsPerElement(b);
-        tensorInfo.strides = context->getStrides(b);
-        tensorInfo.vectorDimIndex = engine->getBindingVectorizedDim(b);
-        tensorInfo.isInput = engine->bindingIsInput(b);
-        tensorInfo.dataType = engine->getBindingDataType(b);
-    }
-    else
-    {
-        // Use enqueueV3.
-        tensorInfo.dims = context->getTensorShape(name);
-        tensorInfo.isDynamic = std::any_of(
-            tensorInfo.dims.d, tensorInfo.dims.d + tensorInfo.dims.nbDims, [](int32_t dim) { return dim == -1; });
-        tensorInfo.comps = engine->getTensorComponentsPerElement(name);
-        tensorInfo.strides = context->getTensorStrides(name);
-        tensorInfo.vectorDimIndex = engine->getTensorVectorizedDim(name);
-        tensorInfo.isInput = engine->getTensorIOMode(name) == TensorIOMode::kINPUT;
-        tensorInfo.dataType = engine->getTensorDataType(name);
-    }
+    tensorInfo.dims = context->getTensorShape(name);
+    tensorInfo.isDynamic = std::any_of(
+        tensorInfo.dims.d, tensorInfo.dims.d + tensorInfo.dims.nbDims, [](int32_t dim) { return dim == -1; });
+    tensorInfo.comps = engine->getTensorComponentsPerElement(name, profileIndex);
+    tensorInfo.strides = context->getTensorStrides(name);
+    tensorInfo.vectorDimIndex = engine->getTensorVectorizedDim(name, profileIndex);
+    tensorInfo.isInput = engine->getTensorIOMode(name) == TensorIOMode::kINPUT;
+    tensorInfo.dataType = engine->getTensorDataType(name);
 }
 
 template <>
@@ -208,6 +199,54 @@ void FillBindingClosure<nvinfer1::safe::ICudaEngine, nvinfer1::safe::IExecutionC
     tensorInfo.dataType = engine->getTensorDataType(name);
 }
 
+namespace
+{
+bool allocateContextMemory(InferenceEnvironment& iEnv, InferenceOptions const& inference)
+{
+    auto* engine = iEnv.engine.get();
+    iEnv.deviceMemory.resize(inference.infStreams);
+    // Delay context memory allocation until input shapes are specified because runtime allocation would require actual
+    // input shapes.
+    for (int32_t i = 0; i < inference.infStreams; ++i)
+    {
+        auto const& ec = iEnv.contexts.at(i);
+        if (inference.memoryAllocationStrategy == MemoryAllocationStrategy::kSTATIC)
+        {
+            sample::gLogInfo << "Created execution context with device memory size: "
+                             << (engine->getDeviceMemorySize() / 1.0_MiB) << " MiB" << std::endl;
+        }
+        else
+        {
+            size_t sizeToAlloc{0};
+            const char* allocReason{nullptr};
+            if (inference.memoryAllocationStrategy == MemoryAllocationStrategy::kPROFILE)
+            {
+                auto const p = inference.optProfileIndex;
+                sizeToAlloc = engine->getDeviceMemorySizeForProfile(p);
+                allocReason = "current profile";
+            }
+            else if (inference.memoryAllocationStrategy == MemoryAllocationStrategy::kRUNTIME)
+            {
+                sizeToAlloc = ec->updateDeviceMemorySizeForShapes();
+                allocReason = "current input shapes";
+            }
+            else
+            {
+                sample::gLogError << "Unrecognizable memory allocation strategy." << std::endl;
+                return false;
+            }
+            iEnv.deviceMemory.at(i) = std::move(TrtDeviceBuffer(sizeToAlloc));
+            ec->setDeviceMemory(iEnv.deviceMemory.at(i).get());
+            sample::gLogInfo << "Maximum device memory size across all profiles: "
+                             << (engine->getDeviceMemorySize() / 1.0_MiB) << " MiB" << std::endl;
+            sample::gLogInfo << "Only allocated device memory enough for " << allocReason << ": "
+                             << (sizeToAlloc / 1.0_MiB) << " MiB" << std::endl;
+        }
+    }
+    return true;
+}
+} // namespace
+
 bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inference, SystemOptions const& system)
 {
     int32_t device{};
@@ -237,13 +276,17 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
                 sample::gLogError << "Unable to create execution context for stream " << s << "." << std::endl;
                 return false;
             }
+
+            sample::gLogInfo << "Created safe execution context with device memory size: "
+                             << (safeEngine->getDeviceMemorySize() / 1.0_MiB) << " MiB" << std::endl;
+
             iEnv.safeContexts.emplace_back(ec);
             iEnv.bindings.emplace_back(new Bindings(useManagedMemory));
         }
-        int32_t const nbBindings = safeEngine->getNbBindings();
+        int32_t const nbIOTensor = safeEngine->getNbIOTensors();
         auto const* safeContext = iEnv.safeContexts.front().get();
         // batch is set to 1 because safety only support explicit batch.
-        return FillSafeBindings(safeEngine, safeContext, inference.inputs, iEnv.bindings, 1, nbBindings)();
+        return FillSafeBindings(safeEngine, safeContext, inference.inputs, iEnv.bindings, 1, nbIOTensor, 0)();
     }
 
     using FillStdBindings = FillBindingClosure<nvinfer1::ICudaEngine, nvinfer1::IExecutionContext>;
@@ -251,21 +294,79 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
     auto* engine = iEnv.engine.get();
     SMP_RETVAL_IF_FALSE(engine != nullptr, "Got invalid engine!", false, sample::gLogError);
 
-    bool const hasDLA = system.DLACore >= 0;
-    if (engine->hasImplicitBatchDimension() && hasDLA && inference.batch != engine->getMaxBatchSize())
+    // Release serialized blob to save memory space.
+    iEnv.engine.releaseBlob();
+
+    // Setup weight streaming if enabled
+    if (engine->getStreamableWeightsSize() > 0)
     {
-        sample::gLogError << "When using DLA with an implicit batch engine, the inference batch size must be the same "
-                             "as the engine's maximum batch size. Please specify the batch size by adding: '--batch="
-                          << engine->getMaxBatchSize() << "' to your command." << std::endl;
+        auto const& budget = inference.weightStreamingBudget;
+        int64_t wsBudget = budget.bytes;
+        if (budget.percent != WeightStreamingBudget::kDISABLE)
+        {
+            double const percent = budget.percent;
+            ASSERT(percent > 0.0);
+            auto const min = engine->getMinimumWeightStreamingBudget();
+            auto const max = engine->getStreamableWeightsSize();
+            wsBudget = (max >= min) ? (1 - percent / 100) * (max - min) + min : WeightStreamingBudget::kDISABLE;
+        }
+        bool success = engine->setWeightStreamingBudget(wsBudget);
+        SMP_RETVAL_IF_FALSE(success, "Failed to set weight streaming limit!", false, sample::gLogError);
+        switch (wsBudget)
+        {
+        case WeightStreamingBudget::kDISABLE:
+        {
+            sample::gLogInfo << "Weight streaming has been disabled at runtime." << std::endl;
+            break;
+        }
+
+        case WeightStreamingBudget::kAUTOMATIC:
+        {
+            sample::gLogInfo << "The weight streaming budget will automatically be chosen by TensorRT." << std::endl;
+            break;
+        }
+        default:
+        {
+            sample::gLogInfo << "Weight streaming is enabled with a device memory limit of " << wsBudget << " bytes."
+                             << std::endl;
+            break;
+        }
+        }
+    }
+
+    int32_t const nbOptProfiles = engine->getNbOptimizationProfiles();
+
+    if (inference.optProfileIndex >= nbOptProfiles)
+    {
+        sample::gLogError << "Selected profile index " << inference.optProfileIndex
+                          << " exceeds the number of profiles that the engine holds. " << std::endl;
         return false;
     }
 
-    // Release serialized blob to save memory space.
-    iEnv.engine.releaseBlob();
+    if (nbOptProfiles > 1 && !inference.setOptProfile)
+    {
+        sample::gLogWarning << nbOptProfiles
+                            << " profiles detected but not set. Running with profile 0. Please use "
+                               "--dumpOptimizationProfile to see all available profiles."
+                            << std::endl;
+    }
+
+    cudaStream_t setOptProfileStream;
+    CHECK(cudaStreamCreate(&setOptProfileStream));
 
     for (int32_t s = 0; s < inference.infStreams; ++s)
     {
-        auto ec = engine->createExecutionContext();
+        IExecutionContext* ec{nullptr};
+        if (inference.memoryAllocationStrategy == MemoryAllocationStrategy::kSTATIC)
+        {
+            // Let TRT pre-allocate and manage the memory.
+            ec = engine->createExecutionContext();
+        }
+        else
+        {
+            // Allocate based on the current profile or runtime shapes.
+            ec = engine->createExecutionContext(ExecutionContextAllocationStrategy::kUSER_MANAGED);
+        }
         if (ec == nullptr)
         {
             sample::gLogError << "Unable to create execution context for stream " << s << "." << std::endl;
@@ -278,9 +379,27 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
         sample::gLogInfo << "Setting persistentCacheLimit to " << persistentCacheLimit << " bytes." << std::endl;
         ec->setPersistentCacheLimit(persistentCacheLimit);
 
+        auto setProfile = ec->setOptimizationProfileAsync(inference.optProfileIndex, setOptProfileStream);
+        CHECK(cudaStreamSynchronize(setOptProfileStream));
+
+        if (!setProfile)
+        {
+            sample::gLogError << "Set optimization profile failed. " << std::endl;
+            if (inference.infStreams > 1)
+            {
+                sample::gLogError
+                    << "Please ensure that the engine is built with preview feature profileSharing0806 enabled. "
+                    << std::endl;
+            }
+            return false;
+        }
+
         iEnv.contexts.emplace_back(ec);
         iEnv.bindings.emplace_back(new Bindings(useManagedMemory));
     }
+
+    CHECK(cudaStreamDestroy(setOptProfileStream));
+
     if (iEnv.profiler)
     {
         iEnv.contexts.front()->setProfiler(iEnv.profiler.get());
@@ -288,14 +407,8 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
         iEnv.contexts.front()->setEnqueueEmitsProfile(false);
     }
 
-    int32_t const nbOptProfiles = engine->getNbOptimizationProfiles();
     int32_t const endBindingIndex = engine->getNbIOTensors();
 
-    if (nbOptProfiles > 1)
-    {
-        sample::gLogWarning << "Multiple profiles are currently not supported. Running with one profile." << std::endl;
-    }
-
     // Make sure that the tensor names provided in command-line args actually exist in any of the engine bindings
     // to avoid silent typos.
     if (!validateTensorNames(inference.shapes, engine, endBindingIndex))
@@ -304,12 +417,6 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
         return false;
     }
 
-    // Set all input dimensions before all bindings can be allocated
-    bool const useEnqueueV3 = !engine->hasImplicitBatchDimension();
-    if (useEnqueueV3)
-    {
-        sample::gLogVerbose << "Using enqueueV3." << std::endl;
-    }
     for (int32_t b = 0; b < endBindingIndex; ++b)
     {
         auto const& name = engine->getIOTensorName(b);
@@ -318,16 +425,9 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
         {
             Dims const dims = iEnv.contexts.front()->getTensorShape(name);
             bool isShapeInferenceIO{false};
-            if (useEnqueueV3)
-            {
-                isShapeInferenceIO = engine->isShapeInferenceIO(name);
-            }
-            else
-            {
-                isShapeInferenceIO = engine->isShapeBinding(b);
-            }
+            isShapeInferenceIO = engine->isShapeInferenceIO(name);
             bool const hasRuntimeDim = std::any_of(dims.d, dims.d + dims.nbDims, [](int32_t dim) { return dim == -1; });
-            auto const shape = inference.shapes.find(name);
+            auto const shape = findPlausible(inference.shapes, name);
             if (hasRuntimeDim || isShapeInferenceIO)
             {
                 // Set shapeData to either dimensions of the input (if it has a dynamic shape)
@@ -342,7 +442,7 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
                     {
                         // Set shape tensor to all ones.
                         shapeData.assign(volume(dims, 0, dims.nbDims), kDEFAULT_VALUE);
-                        sample::gLogWarning << "Values missing for input shape tensor: " << engine->getBindingName(b)
+                        sample::gLogWarning << "Values missing for input shape tensor: " << name
                                             << "Automatically setting values to: " << shapeData << std::endl;
                     }
                     else
@@ -351,9 +451,8 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
                         shapeData.resize(dims.nbDims);
                         std::transform(dims.d, dims.d + dims.nbDims, shapeData.begin(),
                             [&](int32_t dimension) { return dimension >= 0 ? dimension : kDEFAULT_VALUE; });
-                        sample::gLogWarning
-                            << "Shape missing for input with dynamic shape: " << engine->getBindingName(b)
-                            << "Automatically setting shape to: " << shapeData << std::endl;
+                        sample::gLogWarning << "Shape missing for input with dynamic shape: " << name
+                                            << "Automatically setting shape to: " << shapeData << std::endl;
                     }
                 }
                 else if (inference.inputs.count(shape->first) && isShapeInferenceIO)
@@ -374,45 +473,28 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
                 if (isShapeInferenceIO)
                 {
                     // Save the data in iEnv, in a way that it's address does not change
-                    // before enqueueV2 or enqueueV3 is called.
+                    // before enqueueV3 is called.
                     iEnv.inputShapeTensorValues.emplace_back(shapeData);
                     shapeTensorData = iEnv.inputShapeTensorValues.back().data();
                 }
 
                 for (auto& c : iEnv.contexts)
                 {
-                    if (useEnqueueV3)
+                    if (isShapeInferenceIO)
                     {
-                        if (isShapeInferenceIO)
-                        {
-                            if (!c->setTensorAddress(name, shapeTensorData))
-                            {
-                                return false;
-                            }
-                        }
-                        else
+                        sample::gLogInfo << "Set input shape tensor " << name << " to: " << shapeData << std::endl;
+                        if (!c->setTensorAddress(name, shapeTensorData))
                         {
-                            if (!c->setInputShape(name, toDims(shapeData)))
-                            {
-                                return false;
-                            }
+                            return false;
                         }
                     }
                     else
                     {
-                        if (isShapeInferenceIO)
+                        sample::gLogInfo << "Set shape of input tensor " << name << " to: " << shapeData
+                                            << std::endl;
+                        if (!c->setInputShape(name, toDims(shapeData)))
                         {
-                            if (!c->setInputShapeBinding(b, shapeTensorData))
-                            {
-                                return false;
-                            }
-                        }
-                        else
-                        {
-                            if (!c->setBindingDimensions(b, toDims(shapeData)))
-                            {
-                                return false;
-                            }
+                            return false;
                         }
                     }
                 }
@@ -424,6 +506,8 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
                 {
                     if (!c->setInputShape(name, toDims(shape->second)))
                     {
+                        sample::gLogError << "The engine was built with static shapes for input tensor " << name
+                                          << " but the provided shapes do not match the static shapes!" << std::endl;
                         return false;
                     }
                 }
@@ -431,9 +515,25 @@ bool setUpInference(InferenceEnvironment& iEnv, InferenceOptions const& inferenc
         }
     }
 
+    // Create Debug Listener and turn on debug states if client requested dumping debug tensors.
+    if (!inference.debugTensorFileNames.empty())
+    {
+        iEnv.listener.reset(new DebugTensorWriter(inference.debugTensorFileNames));
+        iEnv.contexts.front()->setDebugListener(iEnv.listener.get());
+        for (auto const& s : inference.debugTensorFileNames)
+        {
+            iEnv.contexts.front()->setTensorDebugState(s.first.c_str(), true);
+        }
+    }
+
+    if (!allocateContextMemory(iEnv, inference))
+    {
+        return false;
+    }
+
     auto const* context = iEnv.contexts.front().get();
-    int32_t const batch = engine->hasImplicitBatchDimension() ? inference.batch : 1;
-    return FillStdBindings(engine, context, inference.inputs, iEnv.bindings, batch, endBindingIndex)();
+    return FillStdBindings(
+        engine, context, inference.inputs, iEnv.bindings, 1, endBindingIndex, inference.optProfileIndex)();
 }
 
 TaskInferenceEnvironment::TaskInferenceEnvironment(
@@ -500,45 +600,6 @@ struct Enqueue
     nvinfer1::IExecutionContext& mContext;
 };
 
-//!
-//! \class EnqueueImplicit
-//! \brief Functor to enqueue inference with implicit batch
-//!
-class EnqueueImplicit : private Enqueue
-{
-
-public:
-    explicit EnqueueImplicit(nvinfer1::IExecutionContext& context, void** buffers, int32_t batch)
-        : Enqueue(context)
-        , mBuffers(buffers)
-        , mBatch(batch)
-    {
-    }
-
-    bool operator()(TrtCudaStream& stream) const
-    {
-        try
-        {
-            bool const result = mContext.enqueue(mBatch, mBuffers, stream.get(), nullptr);
-            // Collecting layer timing info from current profile index of execution context
-            if (mContext.getProfiler() && !mContext.getEnqueueEmitsProfile() && !mContext.reportToProfiler())
-            {
-                gLogWarning << "Failed to collect layer timing info from previous enqueue()" << std::endl;
-            }
-            return result;
-        }
-        catch (const std::exception&)
-        {
-            return false;
-        }
-        return false;
-    }
-
-private:
-    void** mBuffers{};
-    int32_t mBatch{};
-};
-
 //!
 //! \class EnqueueExplicit
 //! \brief Functor to enqueue inference with explict batch
@@ -658,6 +719,7 @@ class EnqueueSafe
     }
 
     nvinfer1::safe::IExecutionContext& mContext;
+
 private:
     Bindings const& mBindings;
 };
@@ -866,14 +928,7 @@ class Iteration
     void createEnqueueFunction(
         InferenceOptions const& inference, nvinfer1::IExecutionContext& context, Bindings& bindings)
     {
-        if (context.getEngine().hasImplicitBatchDimension())
-        {
-            mEnqueue = EnqueueFunction(EnqueueImplicit(context, mBindings.getDeviceBuffers(), inference.batch));
-        }
-        else
-        {
-            mEnqueue = EnqueueFunction(EnqueueExplicit(context, mBindings));
-        }
+        mEnqueue = EnqueueFunction(EnqueueExplicit(context, mBindings));
         if (inference.graph)
         {
             TrtCudaStream& stream = getStream(StreamType::kCOMPUTE);
@@ -922,7 +977,6 @@ class Iteration
             mGraph.endCapture(stream);
             mEnqueue = EnqueueFunction(EnqueueGraphSafe(mGraph));
         }
-
     }
 
     Bindings& mBindings;
@@ -1006,7 +1060,8 @@ bool inferenceLoop(std::vector<std::unique_ptr<Iteration<ContextType>>>& iStream
 
 template <class ContextType>
 void inferenceExecution(InferenceOptions const& inference, InferenceEnvironment& iEnv, SyncStruct& sync,
-    int32_t const threadIdx, int32_t const streamsPerThread, int32_t device, std::vector<InferenceTrace>& trace) noexcept
+    int32_t const threadIdx, int32_t const streamsPerThread, int32_t device,
+    std::vector<InferenceTrace>& trace) noexcept
 {
     try
     {
@@ -1039,8 +1094,8 @@ void inferenceExecution(InferenceOptions const& inference, InferenceEnvironment&
         }
 
         std::vector<InferenceTrace> localTrace;
-        if (!inferenceLoop(iStreams, sync.cpuStart, sync.gpuStart, inference.iterations, durationMs, warmupMs, localTrace,
-                inference.skipTransfers, inference.idle))
+        if (!inferenceLoop(iStreams, sync.cpuStart, sync.gpuStart, inference.iterations, durationMs, warmupMs,
+                localTrace, inference.skipTransfers, inference.idle))
         {
             sync.mutex.lock();
             iEnv.error = true;
@@ -1059,7 +1114,7 @@ void inferenceExecution(InferenceOptions const& inference, InferenceEnvironment&
         trace.insert(trace.end(), localTrace.begin(), localTrace.end());
         sync.mutex.unlock();
     }
-    catch(...)
+    catch (...)
     {
         sync.mutex.lock();
         iEnv.error = true;
@@ -1200,17 +1255,20 @@ bool timeDeserialize(InferenceEnvironment& iEnv, SystemOptions const& sys)
         auto startClock = std::chrono::high_resolution_clock::now();
         if (iEnv.safe)
         {
-            safeEngine.reset(safeRT->deserializeCudaEngine(iEnv.engine.getBlob().data(), iEnv.engine.getBlob().size()));
+            auto& engineBlob = iEnv.engine.getBlob();
+            safeEngine.reset(safeRT->deserializeCudaEngine(engineBlob.data, engineBlob.size));
             deserializeOK = (safeEngine != nullptr);
         }
         else
         {
+            auto& reader = iEnv.engine.getFileReader();
+            reader.reset();
+            ASSERT(reader.isOpen());
             for (auto const& pluginPath : sys.dynamicPlugins)
             {
                 rt->getPluginRegistry().loadLibrary(pluginPath.c_str());
             }
-            engine.reset(
-                rt->deserializeCudaEngine(iEnv.engine.getBlob().data(), iEnv.engine.getBlob().size(), nullptr));
+            engine.reset(rt->deserializeCudaEngine(reader));
             deserializeOK = (engine != nullptr);
         }
         auto endClock = std::chrono::high_resolution_clock::now();
@@ -1301,6 +1359,11 @@ void Binding::fill()
         fillBuffer<int32_t>(buffer->getHostBuffer(), volume, -128, 127);
         break;
     }
+    case nvinfer1::DataType::kINT64:
+    {
+        fillBuffer<int64_t>(buffer->getHostBuffer(), volume, -128, 127);
+        break;
+    }
     case nvinfer1::DataType::kINT8:
     {
         fillBuffer<int8_t>(buffer->getHostBuffer(), volume, -128, 127);
@@ -1316,12 +1379,18 @@ void Binding::fill()
         fillBuffer<__half>(buffer->getHostBuffer(), volume, -1.0F, 1.0F);
         break;
     }
+    case nvinfer1::DataType::kBF16:
+    {
+        fillBuffer<BFloat16>(buffer->getHostBuffer(), volume, -1.0F, 1.0F);
+        break;
+    }
     case nvinfer1::DataType::kUINT8:
     {
         fillBuffer<uint8_t>(buffer->getHostBuffer(), volume, 0, 255);
         break;
     }
-    case nvinfer1::DataType::kFP8: ASSERT(!"FP8 is not supported");
+    case nvinfer1::DataType::kFP8: ASSERT(false && "FP8 is not supported");
+    case nvinfer1::DataType::kINT4: ASSERT(false && "INT4 is not supported");
     }
 }
 
@@ -1364,12 +1433,23 @@ void Binding::dump(std::ostream& os, Dims dims, Dims strides, int32_t vectorDim,
         dumpBuffer<__half>(outputBuffer, separator, os, dims, strides, vectorDim, spv);
         break;
     }
+    case nvinfer1::DataType::kBF16:
+    {
+        dumpBuffer<BFloat16>(outputBuffer, separator, os, dims, strides, vectorDim, spv);
+        break;
+    }
     case nvinfer1::DataType::kUINT8:
     {
         dumpBuffer<uint8_t>(outputBuffer, separator, os, dims, strides, vectorDim, spv);
         break;
     }
-    case nvinfer1::DataType::kFP8: ASSERT(!"FP8 is not supported");
+    case nvinfer1::DataType::kINT64:
+    {
+        dumpBuffer<int64_t>(outputBuffer, separator, os, dims, strides, vectorDim, spv);
+        break;
+    }
+    case nvinfer1::DataType::kFP8: ASSERT(false && "FP8 is not supported");
+    case nvinfer1::DataType::kINT4: ASSERT(false && "INT4 is not supported");
     }
 }
 
@@ -1474,40 +1554,20 @@ void Bindings::transferOutputToHost(TrtCudaStream& stream)
 }
 
 template <>
-void Bindings::dumpBindingValues<nvinfer1::IExecutionContext>(nvinfer1::IExecutionContext const& context, int32_t binding, std::ostream& os,
-    std::string const& separator /*= " "*/, int32_t batch /*= 1*/) const
+void Bindings::dumpBindingValues<nvinfer1::IExecutionContext>(nvinfer1::IExecutionContext const& context,
+    int32_t binding, std::ostream& os, std::string const& separator /*= " "*/, int32_t batch /*= 1*/) const
 {
-    Dims dims = context.getBindingDimensions(binding);
-    Dims strides = context.getStrides(binding);
-    int32_t vectorDim = context.getEngine().getBindingVectorizedDim(binding);
-    int32_t const spv = context.getEngine().getBindingComponentsPerElement(binding);
-
-    if (context.getEngine().hasImplicitBatchDimension())
-    {
-        auto const insertN = [](Dims& d, int32_t bs) {
-            int32_t const nbDims = d.nbDims;
-            ASSERT(nbDims < Dims::MAX_DIMS);
-            std::copy_backward(&d.d[0], &d.d[nbDims], &d.d[nbDims + 1]);
-            d.d[0] = bs;
-            d.nbDims = nbDims + 1;
-        };
-        int32_t batchStride = 0;
-        for (int32_t i = 0; i < strides.nbDims; ++i)
-        {
-            if (strides.d[i] * dims.d[i] > batchStride)
-            {
-                batchStride = strides.d[i] * dims.d[i];
-            }
-        }
-        insertN(dims, batch);
-        insertN(strides, batchStride);
-        vectorDim = (vectorDim == -1) ? -1 : vectorDim + 1;
-    }
+    auto const tensorName = context.getEngine().getIOTensorName(binding);
+    Dims dims = context.getTensorShape(tensorName);
+    Dims strides = context.getTensorStrides(tensorName);
+    int32_t vectorDim = context.getEngine().getTensorVectorizedDim(tensorName);
+    int32_t const spv = context.getEngine().getTensorComponentsPerElement(tensorName);
 
     mBindings[binding].dump(os, dims, strides, vectorDim, spv, separator);
 }
 
-namespace {
+namespace
+{
 
 std::string genFilenameSafeString(std::string const& s)
 {
@@ -1524,38 +1584,16 @@ std::string genFilenameSafeString(std::string const& s)
 }
 
 template <typename ContextType>
-Dims getBindingDimensions(ContextType const& /*context*/, int32_t /*binding*/)
+Dims getBindingDimensions(ContextType const& /*context*/, std::string const& /*name*/)
 {
     ASSERT(0 && "Unimplemented");
 }
 
 template <>
-Dims getBindingDimensions(nvinfer1::IExecutionContext const& context, int32_t binding)
-{
-    return context.getBindingDimensions(binding);
-}
-
-template <>
-Dims getBindingDimensions(nvinfer1::safe::IExecutionContext const& context, int32_t binding)
+Dims getBindingDimensions(nvinfer1::IExecutionContext const& context, std::string const& name)
 {
-    return context.getEngine().getBindingDimensions(binding);
+    return context.getTensorShape(name.c_str());
 }
-
-inline std::ostream& operator<<(std::ostream& o, nvinfer1::DataType dt)
-{
-    switch (dt)
-    {
-    case DataType::kINT32: o << "Int32"; break;
-    case DataType::kFLOAT: o << "Float"; break;
-    case DataType::kHALF: o << "Half"; break;
-    case DataType::kINT8: o << "Int8"; break;
-    case DataType::kUINT8: o << "UInt8"; break;
-    case DataType::kBOOL: o << "Bool"; break;
-    case DataType::kFP8: o << "Float8"; break;
-    }
-    return o;
-}
-
 } // namespace
 
 template <typename ContextType>
@@ -1577,7 +1615,7 @@ void Bindings::dumpRawBindingToFiles(ContextType const& context, std::ostream& o
             outputBuffer = binding.buffer->getHostBuffer();
         }
 
-        Dims dims = getBindingDimensions(context, bIndex);
+        Dims dims = getBindingDimensions(context, name);
         std::string dimsStr;
         std::string dotStr;
 
@@ -1590,10 +1628,11 @@ void Bindings::dumpRawBindingToFiles(ContextType const& context, std::ostream& o
         std::string const bindingTypeStr = (binding.isInput ? "input" : "output");
 
         std::stringstream fileName;
-        fileName << genFilenameSafeString(name) << "." << bindingTypeStr << "." << dimsStr << "." << binding.dataType << ".raw";
+        fileName << genFilenameSafeString(name) << "." << bindingTypeStr << "." << dimsStr << "." << binding.dataType
+                 << ".raw";
 
         os << "Writing file for " << bindingTypeStr << " binding " << name << " (with datatype " << binding.dataType
-            << " and dimensions " << dimsStr << ") to " << fileName.str() << std::endl;
+           << " and dimensions " << dimsStr << ") to " << fileName.str() << std::endl;
 
         std::ofstream f(fileName.str(), std::ios::out | std::ios::binary);
         ASSERT(f && "Cannot open file for write");
@@ -1602,39 +1641,42 @@ void Bindings::dumpRawBindingToFiles(ContextType const& context, std::ostream& o
     }
 }
 
-template
-void Bindings::dumpRawBindingToFiles<nvinfer1::IExecutionContext>(nvinfer1::IExecutionContext const& context, std::ostream& os) const;
+template void Bindings::dumpRawBindingToFiles<nvinfer1::IExecutionContext>(
+    nvinfer1::IExecutionContext const& context, std::ostream& os) const;
 
 template <>
-void Bindings::dumpBindingDimensions<nvinfer1::IExecutionContext>(int binding, nvinfer1::IExecutionContext const& context, std::ostream& os) const
+void Bindings::dumpBindingDimensions<nvinfer1::IExecutionContext>(
+    std::string const& name, nvinfer1::IExecutionContext const& context, std::ostream& os) const
 {
-    auto const dims = context.getBindingDimensions(binding);
+    auto const dims = context.getTensorShape(name.c_str());
     // Do not add a newline terminator, because the caller may be outputting a JSON string.
     os << dims;
 }
 
 template <>
-void Bindings::dumpBindingDimensions<nvinfer1::safe::IExecutionContext>(int binding, nvinfer1::safe::IExecutionContext const& context, std::ostream& os) const
+void Bindings::dumpBindingDimensions<nvinfer1::safe::IExecutionContext>(
+    std::string const& name, nvinfer1::safe::IExecutionContext const& context, std::ostream& os) const
 {
-    auto const dims = context.getEngine().getBindingDimensions(binding);
+    auto const dims = context.getEngine().getTensorShape(name.c_str());
     // Do not add a newline terminator, because the caller may be outputting a JSON string.
     os << dims;
 }
 
 template <>
-void Bindings::dumpBindingValues<nvinfer1::safe::IExecutionContext>(nvinfer1::safe::IExecutionContext const& context, int32_t binding, std::ostream& os,
-    std::string const& separator /*= " "*/, int32_t batch /*= 1*/) const
+void Bindings::dumpBindingValues<nvinfer1::safe::IExecutionContext>(nvinfer1::safe::IExecutionContext const& context,
+    int32_t binding, std::ostream& os, std::string const& separator /*= " "*/, int32_t batch /*= 1*/) const
 {
-    Dims const dims = context.getEngine().getBindingDimensions(binding);
-    Dims const strides = context.getStrides(binding);
-    int32_t const vectorDim = context.getEngine().getBindingVectorizedDim(binding);
-    int32_t const spv = context.getEngine().getBindingComponentsPerElement(binding);
+    auto const tensorName = context.getEngine().getIOTensorName(binding);
+    Dims const strides = context.getTensorStrides(tensorName);
+    auto const dims = context.getEngine().getTensorShape(tensorName);
+    int32_t const vectorDim = context.getEngine().getTensorVectorizedDim(tensorName);
+    int32_t const spv = context.getEngine().getTensorComponentsPerElement(tensorName);
 
     mBindings[binding].dump(os, dims, strides, vectorDim, spv, separator);
 }
 
-template
-void Bindings::dumpRawBindingToFiles<nvinfer1::safe::IExecutionContext>(nvinfer1::safe::IExecutionContext const& context, std::ostream& os) const;
+template void Bindings::dumpRawBindingToFiles<nvinfer1::safe::IExecutionContext>(
+    nvinfer1::safe::IExecutionContext const& context, std::ostream& os) const;
 
 std::unordered_map<std::string, int> Bindings::getBindings(std::function<bool(Binding const&)> predicate) const
 {
@@ -1700,4 +1742,28 @@ bool Bindings::setSafeTensorAddresses(nvinfer1::safe::IExecutionContext& context
     return true;
 }
 
+bool DebugTensorWriter::processDebugTensor(void const* addr, nvinfer1::TensorLocation location, nvinfer1::DataType type,
+    nvinfer1::Dims const& shape, char const* name, cudaStream_t stream)
+{
+    CHECK(cudaStreamSynchronize(stream));
+    // Store data from callback.
+    int64_t size = std::accumulate(shape.d, shape.d + shape.nbDims, 1LL, std::multiplies<int64_t>{})
+        * samplesCommon::elementSize(type);
+    std::vector<char> hostDataOut(size, 0);
+    CHECK(cudaMemcpy(hostDataOut.data(), addr, size, cudaMemcpyDeviceToHost));
+
+    auto it = mDebugTensorFileNames.find(name);
+    ASSERT(it != mDebugTensorFileNames.end());
+    std::string fileName = it->second;
+
+    std::ofstream f(fileName, std::ios::out | std::ios::binary);
+    ASSERT(f && "Cannot open file for write");
+    sample::gLogInfo << "Writing to file " << fileName << " for debug tensor " << name << std::endl;
+    f.write(hostDataOut.data(), size);
+    f.close();
+
+    CHECK(cudaStreamSynchronize(stream));
+    return true;
+}
+
 } // namespace sample
diff --git a/samples/common/sampleInference.h b/samples/common/sampleInference.h
index 909a71b8..e726cb31 100644
--- a/samples/common/sampleInference.h
+++ b/samples/common/sampleInference.h
@@ -35,6 +35,22 @@
 namespace sample
 {
 
+// IDebugListener class for writing debug tensors to output file.
+class DebugTensorWriter : public nvinfer1::IDebugListener
+{
+public:
+    DebugTensorWriter(std::unordered_map<std::string, std::string> fileNames)
+        : mDebugTensorFileNames(fileNames)
+    {
+    }
+
+    bool processDebugTensor(void const* addr, nvinfer1::TensorLocation location, nvinfer1::DataType type,
+        nvinfer1::Dims const& shape, char const* name, cudaStream_t stream) override;
+
+private:
+    std::unordered_map<std::string, std::string> mDebugTensorFileNames;
+};
+
 struct InferenceEnvironment
 {
     InferenceEnvironment() = delete;
@@ -47,7 +63,10 @@ struct InferenceEnvironment
     LazilyDeserializedEngine engine;
     std::unique_ptr<Profiler> profiler;
     std::vector<std::unique_ptr<nvinfer1::IExecutionContext>> contexts;
+    std::vector<TrtDeviceBuffer>
+        deviceMemory; //< Device memory used for inference when the allocation strategy is not static.
     std::vector<std::unique_ptr<Bindings>> bindings;
+    std::unique_ptr<DebugTensorWriter> listener;
     bool error{false};
 
     bool safe{false};
@@ -60,7 +79,7 @@ struct InferenceEnvironment
     //!
     //! It's important that the addresses of the data do not change between the calls to
     //! setTensorAddress/setInputShape (which tells TensorRT where the input shape tensor is)
-    //! and enqueueV2/enqueueV3 (when TensorRT might use the input shape tensor).
+    //! and enqueueV3 (when TensorRT might use the input shape tensor).
     //!
     //! The input shape tensors could alternatively be handled via member bindings,
     //! but it simplifies control-flow to store the data here since it's shared across
@@ -165,7 +184,7 @@ class Bindings
     }
 
     template <typename ContextType>
-    void dumpBindingDimensions(int32_t binding, ContextType const& context, std::ostream& os) const;
+    void dumpBindingDimensions(std::string const& name, ContextType const& context, std::ostream& os) const;
 
     template <typename ContextType>
     void dumpBindingValues(ContextType const& context, int32_t binding, std::ostream& os,
@@ -201,11 +220,12 @@ class Bindings
     {
         for (auto const& n : mNames)
         {
+            auto const name = n.first;
             auto const binding = n.second;
             if (predicate(mBindings[binding]))
             {
                 os << n.first << ": (";
-                dumpBindingDimensions(binding, context, os);
+                dumpBindingDimensions(name, context, os);
                 os << ")" << std::endl;
 
                 dumpBindingValues(context, binding, os);
@@ -214,7 +234,6 @@ class Bindings
         }
     }
 
-
     std::unordered_map<std::string, int> getInputBindings() const
     {
         auto isInput = [](Binding const& b) { return b.isInput; };
diff --git a/samples/common/sampleOptions.cpp b/samples/common/sampleOptions.cpp
index 41a13972..575668e1 100644
--- a/samples/common/sampleOptions.cpp
+++ b/samples/common/sampleOptions.cpp
@@ -20,6 +20,7 @@
 #include <cstring>
 #include <functional>
 #include <iostream>
+#include <sstream>
 #include <stdexcept>
 #include <string>
 #include <vector>
@@ -36,6 +37,52 @@ namespace sample
 namespace
 {
 
+static const std::map<char, std::pair<int64_t, std::string>> kUNIT_MULTIPLIERS{
+    {'B', {1, "Bytes"}},
+    {'K', {1 << 10, "Kibibytes"}},
+    {'M', {1 << 20, "Mebibytes"}},
+    {'G', {1 << 30, "Gibibytes"}},
+};
+
+// Returns "B (Bytes), K (Kilobytes), ..."
+std::string getAvailableUnitSuffixes()
+{
+    std::ostringstream ss;
+    for (auto it = kUNIT_MULTIPLIERS.begin(); it != kUNIT_MULTIPLIERS.end(); ++it)
+    {
+        if (it != kUNIT_MULTIPLIERS.begin())
+        {
+            ss << ", ";
+        }
+        ss << it->first << " (" << it->second.second << ")";
+    }
+    return ss.str();
+}
+
+// Numeric trtexec arguments can have unit specifiers in similar to polygraphy.
+// E.g. --weightStreamingBudget=20M would be 20 Mebibytes (base 2).
+int64_t getUnitMultiplier(std::string const& option)
+{
+    char lastChar = option.at(option.size() - 1);
+    if (!std::isdigit(lastChar))
+    {
+        char unit = std::toupper(lastChar);
+        auto found = kUNIT_MULTIPLIERS.find(unit);
+        if (found == kUNIT_MULTIPLIERS.end())
+        {
+            std::ostringstream ss;
+            ss << "Error parsing \"" << option << "\": invalid unit specifier '" << unit
+               << "'. Valid base-2 unit suffixes include: ";
+            ss << getAvailableUnitSuffixes() << ".";
+            throw std::invalid_argument(ss.str());
+        }
+        return found->second.first;
+    }
+
+    // Return bytes by default
+    return kUNIT_MULTIPLIERS.at('B').first;
+}
+
 template <typename T>
 T stringToValue(const std::string& option)
 {
@@ -48,6 +95,12 @@ int32_t stringToValue<int32_t>(const std::string& option)
     return std::stoi(option);
 }
 
+template <>
+size_t stringToValue<size_t>(const std::string& option)
+{
+    return std::stoi(option) * getUnitMultiplier(option);
+}
+
 template <>
 float stringToValue<float>(const std::string& option)
 {
@@ -57,7 +110,7 @@ float stringToValue<float>(const std::string& option)
 template <>
 double stringToValue<double>(const std::string& option)
 {
-    return std::stod(option);
+    return std::stod(option) * getUnitMultiplier(option);
 }
 
 template <>
@@ -70,6 +123,10 @@ template <>
 std::vector<int32_t> stringToValue<std::vector<int32_t>>(const std::string& option)
 {
     std::vector<int32_t> shape;
+    if (option == "scalar")
+    {
+        return shape;
+    }
     std::vector<std::string> dimsStrings = splitToStringVec(option, 'x');
     for (const auto& d : dimsStrings)
     {
@@ -82,8 +139,9 @@ template <>
 nvinfer1::DataType stringToValue<nvinfer1::DataType>(const std::string& option)
 {
     const std::unordered_map<std::string, nvinfer1::DataType> strToDT{{"fp32", nvinfer1::DataType::kFLOAT},
-        {"fp16", nvinfer1::DataType::kHALF}, {"int8", nvinfer1::DataType::kINT8}, {"fp8", nvinfer1::DataType::kFP8},
-        {"int32", nvinfer1::DataType::kINT32}};
+        {"fp16", nvinfer1::DataType::kHALF}, {"bf16", nvinfer1::DataType::kBF16}, {"int8", nvinfer1::DataType::kINT8},
+        {"fp8", nvinfer1::DataType::kFP8}, {"int32", nvinfer1::DataType::kINT32}, {"int64", nvinfer1::DataType::kINT64},
+        {"bool", nvinfer1::DataType::kBOOL}, {"uint8", nvinfer1::DataType::kUINT8}};
     const auto& dt = strToDT.find(option);
     if (dt == strToDT.end())
     {
@@ -159,9 +217,46 @@ SparsityFlag stringToValue<SparsityFlag>(std::string const& option)
     {
         throw std::invalid_argument(std::string("Unknown sparsity mode: ") + option);
     }
+    if (search->second == SparsityFlag::kFORCE)
+    {
+        sample::gLogWarning << "--sparsity=force has been deprecated. "
+                            << "Please use <polygraphy surgeon prune> to rewrite the weights to a sparsity pattern "
+                            << "and then run with --sparsity=enable" << std::endl;
+    }
+
     return search->second;
 }
 
+template <>
+WeightStreamingBudget stringToValue<WeightStreamingBudget>(std::string const& option)
+{
+    WeightStreamingBudget budget;
+    if (option.find('%') != std::string::npos)
+    {
+        double percent = std::stod(option);
+        if (!(percent >= 0 && percent <= 100.0))
+        {
+            std::ostringstream err;
+            err << "The weight streaming percent must be between 0 and 100.";
+            throw std::invalid_argument(err.str());
+        }
+        budget.percent = percent;
+    }
+    else
+    {
+        double bytes = stringToValue<double>(option);
+        if (!(bytes == WeightStreamingBudget::kAUTOMATIC || bytes >= WeightStreamingBudget::kDISABLE))
+        {
+            std::ostringstream err;
+            err << "The weight streaming budget must be " << WeightStreamingBudget::kAUTOMATIC << " or at least "
+                << WeightStreamingBudget::kDISABLE << ".";
+            throw std::invalid_argument(err.str());
+        }
+        budget.bytes = static_cast<int64_t>(bytes);
+    }
+    return budget;
+}
+
 template <typename T>
 std::pair<std::string, T> splitNameAndValue(const std::string& s)
 {
@@ -176,8 +271,10 @@ std::pair<std::string, T> splitNameAndValue(const std::string& s)
     {
         if (quoteNameRange.size() != 3)
         {
-            throw std::invalid_argument(std::string("Found invalid number of \'s when parsing ") + s +
-                std::string(". Expected: 2, received: ") + std::to_string(quoteNameRange.size() -1));
+            std::string errorMsg = std::string("Found invalid number of \'s when parsing ") + s +
+                std::string(". Expected: 2, received: ") + std::to_string(quoteNameRange.size() -1) +
+                ". Please ensure that a singular comma is used within each comma-separated key-value pair for options like --inputIOFormats, --optShapes, --optShapesCalib, --layerPrecisions, etc.";
+            throw std::invalid_argument(errorMsg);
         }
         // Everything before the second "'" is the name.
         tensorName = quoteNameRange[0] + quoteNameRange[1];
@@ -236,15 +333,48 @@ std::string joinValuesToString(std::array<T, N> const& list, std::string const&
 }
 
 //! Check if input option exists in input arguments.
-//! If it does: return its value, erase the argument and return true.
+//! If it does: set its value, and return true
 //! If it does not: return false.
 template <typename T>
-bool getAndDelOption(Arguments& arguments, const std::string& option, T& value)
+bool getOption(Arguments& arguments, const std::string& option, T& value)
 {
-    const auto match = arguments.find(option);
+    auto const match = arguments.find(option);
     if (match != arguments.end())
     {
-        value = stringToValue<T>(match->second);
+        value = stringToValue<T>(match->second.first);
+        return true;
+    }
+
+    return false;
+}
+
+//! Check if input option exists in input arguments.
+//! If it does: set its value, erase the argument and return true.
+//! If it does not: return false.
+template <typename T_>
+bool getAndDelOption(Arguments& arguments, const std::string& option, T_& value)
+{
+    bool found = getOption(arguments, option, value);
+    if (found)
+    {
+        const auto match = arguments.find(option);
+        arguments.erase(match);
+    }
+
+    return found;
+}
+
+//! Check if input option exists in input arguments.
+//! If it does: set its value and position, erase the argument and return true.
+//! If it does not: return false.
+template <typename T_>
+bool getAndDelOptionWithPosition(Arguments& arguments, std::string const& option, T_& value, int32_t& pos)
+{
+    auto const match = arguments.find(option);
+    if (match != arguments.end())
+    {
+        value = stringToValue<T_>(match->second.first);
+        pos = match->second.second;
         arguments.erase(match);
         return true;
     }
@@ -252,8 +382,31 @@ bool getAndDelOption(Arguments& arguments, const std::string& option, T& value)
     return false;
 }
 
+//! Check if input option exists in input arguments behind the position spcecified by pos.
+//! If it does: set its value, erase the argument and return true.
+//! If it does not: return false.
+template <typename T_>
+bool getAndDelOptionBehind(Arguments& arguments, std::string const& option, int32_t pos, T_& value)
+{
+    auto const match = arguments.equal_range(option);
+    if (match.first == match.second)
+    {
+        return false;
+    }
+    for (auto i = match.first; i != match.second; ++i)
+    {
+        if (i->second.second - pos == 1)
+        {
+            value = stringToValue<T_>(i->second.first);
+            arguments.erase(i);
+            return true;
+        }
+    }
+    return false;
+}
+
 //! Check if input option exists in input arguments.
-//! If it does: return false in value, erase the argument and return true.
+//! If it does: set false in value, erase the argument and return true.
 //! If it does not: return false.
 bool getAndDelNegOption(Arguments& arguments, const std::string& option, bool& value)
 {
@@ -279,7 +432,7 @@ bool getAndDelRepeatedOption(Arguments& arguments, const std::string& option, st
     }
 
     auto addToValues
-        = [&values](Arguments::value_type& argValue) { values.emplace_back(stringToValue<T>(argValue.second)); };
+        = [&values](Arguments::value_type& argValue) { values.emplace_back(stringToValue<T>(argValue.second.first)); };
     std::for_each(match.first, match.second, addToValues);
     arguments.erase(match.first, match.second);
 
@@ -368,6 +521,22 @@ void getLayerDeviceTypes(Arguments& arguments, char const* argument, LayerDevice
     }
 }
 
+void getStringsSet(Arguments& arguments, char const* argument, StringSet& stringSet)
+{
+    std::string list;
+    if (!getAndDelOption(arguments, argument, list))
+    {
+        return;
+    }
+
+    // The layerPrecisions flag contains comma-separated layerName:precision pairs.
+    std::vector<std::string> strings{splitToStringVec(list, ',')};
+    for (auto const& s : strings)
+    {
+        stringSet.insert(s);
+    }
+}
+
 bool getShapesBuild(Arguments& arguments, BuildOptions::ShapeProfile& shapes, char const* argument,
     nvinfer1::OptProfileSelector selector)
 {
@@ -494,33 +663,96 @@ void processShapes(BuildOptions::ShapeProfile& shapes, bool minShapes, bool optS
     shapes = newShapes;
 }
 
-template <typename T>
-void printShapes(std::ostream& os, const char* phase, const T& shapes)
+bool getOptimizationProfiles(
+    Arguments& arguments, std::vector<BuildOptions::ShapeProfile>& optProfiles, char const* argument)
 {
-    if (shapes.empty())
+    bool retValue{false};
+    int32_t pos{};
+    size_t profileIndex{};
+
+    auto getShapes
+        = [](BuildOptions::ShapeProfile& shapes, std::string const& list, nvinfer1::OptProfileSelector selector) {
+              std::vector<std::string> shapeList{splitToStringVec(list, ',')};
+              for (auto const& s : shapeList)
+              {
+                  auto nameDimsPair = splitNameAndValue<std::vector<int32_t>>(s);
+                  auto tensorName = removeSingleQuotationMarks(nameDimsPair.first);
+                  auto dims = nameDimsPair.second;
+                  insertShapesBuild(shapes, selector, tensorName, dims);
+              }
+          };
+
+    while (getAndDelOptionWithPosition(arguments, argument, profileIndex, pos))
     {
-        os << "Input " << phase << " shapes: model" << std::endl;
+        BuildOptions::ShapeProfile optProfile{};
+        bool minShapes{false}, maxShapes{false}, optShapes{false};
+        for (int32_t i = 0; i < nvinfer1::EnumMax<nvinfer1::OptProfileSelector>(); i++, pos++)
+        {
+            std::string value;
+
+            if (!minShapes && getAndDelOptionBehind(arguments, "--minShapes", pos, value))
+            {
+                minShapes = true;
+                getShapes(optProfile, value, nvinfer1::OptProfileSelector::kMIN);
+            }
+            else if (!maxShapes && getAndDelOptionBehind(arguments, "--maxShapes", pos, value))
+            {
+                maxShapes = true;
+                getShapes(optProfile, value, nvinfer1::OptProfileSelector::kMAX);
+            }
+            else if (!optShapes && getAndDelOptionBehind(arguments, "--optShapes", pos, value))
+            {
+                optShapes = true;
+                getShapes(optProfile, value, nvinfer1::OptProfileSelector::kOPT);
+            }
+            else
+            {
+                break;
+            }
+        }
+        processShapes(optProfile, minShapes, optShapes, maxShapes, false);
+        if (profileIndex >= optProfiles.size())
+        {
+            optProfiles.resize(profileIndex + 1);
+        }
+        if (!optProfiles[profileIndex].empty())
+        {
+            throw std::invalid_argument("Optimization profile index cannot be the same.");
+        }
+        optProfiles[profileIndex] = optProfile;
+        retValue = true;
     }
-    else
+
+    profileIndex = 0;
+    for (auto const& optProfile : optProfiles)
     {
-        for (const auto& s : shapes)
+        if (optProfile.empty())
         {
-            os << "Input " << phase << " shape: " << s.first << "=" << s.second << std::endl;
+            throw std::invalid_argument(std::string("Found invalid or missing shape spec at profile index ")
+                + std::to_string(profileIndex) + std::string(". "));
         }
+        ++profileIndex;
     }
+    return retValue;
 }
 
-std::ostream& printBatch(std::ostream& os, int32_t maxBatch)
+template <typename T>
+void printShapes(std::ostream& os, char const* phase, T const& shapes, int32_t profileIndex)
 {
-    if (maxBatch != maxBatchNotProvided)
+    if (shapes.empty())
     {
-        os << maxBatch;
+        os << "Input " << phase << " shapes: model" << std::endl;
     }
     else
     {
-        os << "explicit batch";
+        std::string profileString = (profileIndex != -1 && strcmp(phase, "build") == 0)
+            ? "(profile " + std::to_string(profileIndex) + ")"
+            : "";
+        for (auto const& s : shapes)
+        {
+            os << "Input " << phase << " shape " << profileString << ": " << s.first << "=" << s.second << std::endl;
+        }
     }
-    return os;
 }
 
 std::ostream& printTacticSources(
@@ -559,6 +791,10 @@ std::ostream& printPrecision(std::ostream& os, BuildOptions const& options)
     {
         os << "+FP16";
     }
+    if (options.bf16)
+    {
+        os << "+BF16";
+    }
     if (options.int8)
     {
         os << "+INT8";
@@ -567,6 +803,10 @@ std::ostream& printPrecision(std::ostream& os, BuildOptions const& options)
     {
         os << "+FP8";
     }
+    if (options.stronglyTyped)
+    {
+        os << " (Strongly Typed)";
+    }
     if (options.precisionConstraints == PrecisionConstraints::kOBEY)
     {
         os << " (obey precision constraints)";
@@ -638,6 +878,9 @@ std::ostream& printMemoryPools(std::ostream& os, BuildOptions const& options)
     os << ", ";
     os << "dlaGlobalDRAM: ";
     printValueOrDefault(options.dlaGlobalDRAM);
+    os << ", ";
+    os << "tacticSharedMem: ";
+    printValueOrDefault(options.tacticSharedMem);
     return os;
 }
 
@@ -646,9 +889,11 @@ std::string previewFeatureToString(PreviewFeature feature)
     // clang-format off
     switch (feature)
     {
-    case PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805: return "kFASTER_DYNAMIC_SHAPES_0805";
-    case PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805: return "kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805";
-    case PreviewFeature::kPROFILE_SHARING_0806: return "kPROFILE_SHARING_0806";
+    case PreviewFeature::kPROFILE_SHARING_0806:
+    {
+        gLogWarning << "profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect." << std::endl;
+        break;
+    }
     }
     return "Invalid Preview Feature";
     // clang-format on
@@ -669,10 +914,8 @@ std::ostream& printPreviewFlags(std::ostream& os, BuildOptions const& options)
             os << previewFeatureToString(feat) << (options.previewFeatures.at(featVal) ? " [ON], " : " [OFF], ");
         }
     };
-
-    addFlag(PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805);
-    addFlag(PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805);
-    addFlag(PreviewFeature::kPROFILE_SHARING_0806);
+    // unused
+    static_cast<void>(addFlag);
 
     return os;
 }
@@ -688,51 +931,41 @@ Arguments argsToArgumentsMap(int32_t argc, char* argv[])
         if (valuePtr)
         {
             std::string value{valuePtr + 1};
-            arguments.emplace(std::string(argv[i], valuePtr - argv[i]), value);
+            arguments.emplace(std::string(argv[i], valuePtr - argv[i]), std::make_pair(value, i));
         }
         else
         {
-            arguments.emplace(argv[i], "");
+            arguments.emplace(argv[i], std::make_pair(std::string(""), i));
         }
     }
     return arguments;
 }
 
-void BaseModelOptions::parse(Arguments& arguments)
+namespace
 {
-    if (getAndDelOption(arguments, "--onnx", model))
-    {
-        format = ModelFormat::kONNX;
-    }
-    else if (getAndDelOption(arguments, "--uff", model))
-    {
-        format = ModelFormat::kUFF;
-    }
-    else if (getAndDelOption(arguments, "--model", model))
+std::string resolveHomeDirectoryOnLinux(std::string const& model)
+{
+    std::string filePath{model};
+#ifndef _WIN32
+    if (filePath[0] == '~')
     {
-        format = ModelFormat::kCAFFE;
+        char const* home = std::getenv("HOME");
+        if (home)
+        {
+            filePath.replace(0, 1, home);
+        }
     }
+#endif
+    return filePath;
 }
+} // namespace
 
-void UffInput::parse(Arguments& arguments)
+void BaseModelOptions::parse(Arguments& arguments)
 {
-    getAndDelOption(arguments, "--uffNHWC", NHWC);
-    std::vector<std::string> args;
-    if (getAndDelRepeatedOption(arguments, "--uffInput", args))
+    if (getAndDelOption(arguments, "--onnx", model))
     {
-        for (const auto& i : args)
-        {
-            std::vector<std::string> values{splitToStringVec(i, ',')};
-            if (values.size() == 4)
-            {
-                nvinfer1::Dims3 dims{std::stoi(values[1]), std::stoi(values[2]), std::stoi(values[3])};
-                inputs.emplace_back(values[0], dims);
-            }
-            else
-            {
-                throw std::invalid_argument(std::string("Invalid uffInput ") + i);
-            }
-        }
+        format = ModelFormat::kONNX;
+        model = resolveHomeDirectoryOnLinux(model);
     }
 }
 
@@ -742,51 +975,14 @@ void ModelOptions::parse(Arguments& arguments)
 
     switch (baseModel.format)
     {
-    case ModelFormat::kCAFFE:
-    {
-        getAndDelOption(arguments, "--deploy", prototxt);
-        break;
-    }
-    case ModelFormat::kUFF:
-    {
-        uffInputs.parse(arguments);
-        if (uffInputs.inputs.empty())
-        {
-            throw std::invalid_argument("Uff models require at least one input");
-        }
-        break;
-    }
-    case ModelFormat::kONNX: break;
+    case ModelFormat::kONNX:
     case ModelFormat::kANY:
     {
-        if (getAndDelOption(arguments, "--deploy", prototxt))
-        {
-            baseModel.format = ModelFormat::kCAFFE;
-        }
         break;
     }
     }
 
-    // The --output flag should only be used with Caffe and UFF. It has no effect on ONNX.
-    std::vector<std::string> outArgs;
-    if (getAndDelRepeatedOption(arguments, "--output", outArgs))
-    {
-        for (const auto& o : outArgs)
-        {
-            for (auto& v : splitToStringVec(o, ','))
-            {
-                outputs.emplace_back(std::move(v));
-            }
-        }
-    }
-    if (baseModel.format == ModelFormat::kCAFFE || baseModel.format == ModelFormat::kUFF)
-    {
-        if (outputs.empty())
-        {
-            throw std::invalid_argument("Caffe and Uff models require at least one output");
-        }
-    }
-    else if (baseModel.format == ModelFormat::kONNX)
+    if (baseModel.format == ModelFormat::kONNX)
     {
         if (!outputs.empty())
         {
@@ -858,38 +1054,59 @@ void BuildOptions::parse(Arguments& arguments)
     getFormats(inputFormats, "--inputIOFormats");
     getFormats(outputFormats, "--outputIOFormats");
 
-    bool addedExplicitBatchFlag{false};
-    getAndDelOption(arguments, "--explicitBatch", addedExplicitBatchFlag);
-    if (addedExplicitBatchFlag)
-    {
-        sample::gLogWarning << "--explicitBatch flag has been deprecated and has no effect!" << std::endl;
-        sample::gLogWarning << "Explicit batch dim is automatically enabled if input model is ONNX or if dynamic "
-                            << "shapes are provided when the engine is built." << std::endl;
-    }
-
-    bool minShapes = getShapesBuild(arguments, shapes, "--minShapes", nvinfer1::OptProfileSelector::kMIN);
-    bool optShapes = getShapesBuild(arguments, shapes, "--optShapes", nvinfer1::OptProfileSelector::kOPT);
-    bool maxShapes = getShapesBuild(arguments, shapes, "--maxShapes", nvinfer1::OptProfileSelector::kMAX);
-    processShapes(shapes, minShapes, optShapes, maxShapes, false);
-    bool minShapesCalib
-        = getShapesBuild(arguments, shapesCalib, "--minShapesCalib", nvinfer1::OptProfileSelector::kMIN);
-    bool optShapesCalib
-        = getShapesBuild(arguments, shapesCalib, "--optShapesCalib", nvinfer1::OptProfileSelector::kOPT);
-    bool maxShapesCalib
-        = getShapesBuild(arguments, shapesCalib, "--maxShapesCalib", nvinfer1::OptProfileSelector::kMAX);
-    processShapes(shapesCalib, minShapesCalib, optShapesCalib, maxShapesCalib, true);
+    bool getCalibProfile = getAndDelOption(arguments, "--calibProfile", calibProfile);
+    if (!getOptimizationProfiles(arguments, optProfiles, "--profile"))
+    {
+        ShapeProfile shapes;
+        bool minShapes{false}, optShapes{false}, maxShapes{false};
+        try
+        {
+            minShapes = getShapesBuild(arguments, shapes, "--minShapes", nvinfer1::OptProfileSelector::kMIN);
+            optShapes = getShapesBuild(arguments, shapes, "--optShapes", nvinfer1::OptProfileSelector::kOPT);
+            maxShapes = getShapesBuild(arguments, shapes, "--maxShapes", nvinfer1::OptProfileSelector::kMAX);
+        }
+        catch (std::invalid_argument const& arg)
+        {
+            throw std::invalid_argument(arg.what()
+                + std::string(" conversion failure: failed to parse minShapes/optShapes/maxShapes. Please double check "
+                              "your input string."));
+        }
 
-    bool addedExplicitPrecisionFlag{false};
-    getAndDelOption(arguments, "--explicitPrecision", addedExplicitPrecisionFlag);
-    if (addedExplicitPrecisionFlag)
+        processShapes(shapes, minShapes, optShapes, maxShapes, false);
+        optProfiles.emplace_back(shapes);
+    }
+
+    if (calibProfile >= optProfiles.size())
+    {
+        throw std::invalid_argument(
+            std::string("--calibProfile shouldn't greater than the size of optimization profile."));
+    }
+
+    BuildOptions::ShapeProfile dummyShapes;
+
+    bool remainingMinShapes = getShapesBuild(arguments, dummyShapes, "--minShapes", nvinfer1::OptProfileSelector::kMIN);
+    bool remainingOptShapes = getShapesBuild(arguments, dummyShapes, "--optShapes", nvinfer1::OptProfileSelector::kOPT);
+    bool remainingMaxShapes = getShapesBuild(arguments, dummyShapes, "--maxShapes", nvinfer1::OptProfileSelector::kMAX);
+    if (remainingMinShapes || remainingOptShapes || remainingMaxShapes)
     {
-        sample::gLogWarning << "--explicitPrecision flag has been deprecated and has no effect!" << std::endl;
+        throw std::invalid_argument("Multiple --minShapes/--optShapes/--maxShapes without --profile are not allowed. ");
     }
 
-    if (getAndDelOption(arguments, "--workspace", workspace))
+    bool minShapesCalib{false}, optShapesCalib{false}, maxShapesCalib{false};
+    try
     {
-        sample::gLogWarning << "--workspace flag has been deprecated by --memPoolSize flag." << std::endl;
+        minShapesCalib = getShapesBuild(arguments, shapesCalib, "--minShapesCalib", nvinfer1::OptProfileSelector::kMIN);
+        optShapesCalib = getShapesBuild(arguments, shapesCalib, "--optShapesCalib", nvinfer1::OptProfileSelector::kOPT);
+        maxShapesCalib = getShapesBuild(arguments, shapesCalib, "--maxShapesCalib", nvinfer1::OptProfileSelector::kMAX);
     }
+    catch (std::invalid_argument const& arg)
+    {
+        throw std::invalid_argument(arg.what()
+            + std::string(" conversion failure: failed to parse minShapesCalib/optShapesCalib/maxShapesCalib. Please "
+                          "double check your input string."));
+    }
+
+    processShapes(shapesCalib, minShapesCalib, optShapesCalib, maxShapesCalib, true);
 
     std::string memPoolSizes;
     getAndDelOption(arguments, "--memPoolSize", memPoolSizes);
@@ -898,7 +1115,17 @@ void BuildOptions::parse(Arguments& arguments)
     {
         std::string memPoolName;
         double memPoolSize;
-        std::tie(memPoolName, memPoolSize) = splitNameAndValue<double>(memPoolSpec);
+        try
+        {
+            std::tie(memPoolName, memPoolSize) = splitNameAndValue<double>(memPoolSpec);
+        }
+        catch (std::invalid_argument const& arg)
+        {
+            throw std::invalid_argument(arg.what()
+                + std::string(
+                      " conversion failure: failed to parse --memPoolSize. Please double check your input string."));
+        }
+
         if (memPoolSize < 0)
         {
             throw std::invalid_argument(std::string("Negative memory pool size: ") + std::to_string(memPoolSize));
@@ -919,14 +1146,16 @@ void BuildOptions::parse(Arguments& arguments)
         {
             dlaGlobalDRAM = memPoolSize;
         }
+        else if (memPoolName == "tacticSharedMem")
+        {
+            tacticSharedMem = memPoolSize;
+        }
         else if (!memPoolName.empty())
         {
             throw std::invalid_argument(std::string("Unknown memory pool: ") + memPoolName);
         }
     }
 
-    getAndDelOption(arguments, "--maxBatch", maxBatch);
-    getAndDelOption(arguments, "--minTiming", minTiming);
     getAndDelOption(arguments, "--avgTiming", avgTiming);
 
     bool best{false};
@@ -935,10 +1164,19 @@ void BuildOptions::parse(Arguments& arguments)
     {
         int8 = true;
         fp16 = true;
+
+        // BF16 only supported on Ampere+
+        if (samplesCommon::getSMVersion() >= 0x0800)
+        {
+            bf16 = true;
+        }
     }
 
     getAndDelOption(arguments, "--refit", refittable);
 
+    getAndDelOption(arguments, "--weightless", stripWeights);
+    getAndDelOption(arguments, "--stripWeights", stripWeights);
+
     // --vc and --versionCompatible are synonyms
     getAndDelOption(arguments, "--vc", versionCompatible);
     if (!versionCompatible)
@@ -946,12 +1184,38 @@ void BuildOptions::parse(Arguments& arguments)
         getAndDelOption(arguments, "--versionCompatible", versionCompatible);
     }
 
-    getAndDelOption(arguments, "--excludeLeanRuntime", excludeLeanRuntime);
+    // --pi and --pluginInstanceNorm are synonyms
+    getAndDelOption(arguments, "--pi", pluginInstanceNorm);
+    if (!pluginInstanceNorm)
+    {
+        getAndDelOption(arguments, "--pluginInstanceNorm", pluginInstanceNorm);
+    }
 
+    getAndDelOption(arguments, "--excludeLeanRuntime", excludeLeanRuntime);
+    getAndDelOption(arguments, "--noCompilationCache", disableCompilationCache);
     getAndDelNegOption(arguments, "--noTF32", tf32);
     getAndDelOption(arguments, "--fp16", fp16);
+    getAndDelOption(arguments, "--bf16", bf16);
     getAndDelOption(arguments, "--int8", int8);
     getAndDelOption(arguments, "--fp8", fp8);
+    getAndDelOption(arguments, "--stronglyTyped", stronglyTyped);
+    if (stronglyTyped)
+    {
+        auto disableAndLog = [](bool& flag, std::string mode, std::string type) {
+            if (flag)
+            {
+                flag = false;
+                sample::gLogWarning << "Invalid usage, setting " << mode
+                                    << " mode is not allowed if graph is strongly typed. Disabling BuilderFlag::"
+                                    << type << "." << std::endl;
+            }
+        };
+        disableAndLog(fp16, "fp16", "kFP16");
+        disableAndLog(int8, "int8", "kINT8");
+        disableAndLog(bf16, "bf16", "kBF16");
+        disableAndLog(fp8, "fp8", "kFP8");
+    }
+
     if (fp8 && int8)
     {
         throw std::invalid_argument("Invalid usage, fp8 and int8 aren't allowed to be enabled together.");
@@ -961,10 +1225,6 @@ void BuildOptions::parse(Arguments& arguments)
     getAndDelOption(arguments, "--allowGPUFallback", allowGPUFallback);
     getAndDelOption(arguments, "--consistency", consistency);
     getAndDelOption(arguments, "--restricted", restricted);
-    if (getAndDelOption(arguments, "--buildOnly", skipInference))
-    {
-        sample::gLogWarning << "--buildOnly flag has been deprecated by --skipInference flag." << std::endl;
-    }
     getAndDelOption(arguments, "--skipInference", skipInference);
     getAndDelOption(arguments, "--directIO", directIO);
 
@@ -1004,20 +1264,24 @@ void BuildOptions::parse(Arguments& arguments)
                             << R"(flag is set to "none".)" << std::endl;
     }
 
+    getStringsSet(arguments, "--markDebug", debugTensors);
+
     getAndDelOption(arguments, "--sparsity", sparsity);
 
     bool calibCheck = getAndDelOption(arguments, "--calib", calibration);
-    if (int8 && calibCheck && !shapes.empty() && shapesCalib.empty())
+    if (int8 && calibCheck && !optProfiles[calibProfile].empty() && shapesCalib.empty())
     {
-        shapesCalib = shapes;
+        shapesCalib = optProfiles[calibProfile];
     }
-
-    std::string profilingVerbosityString;
-    if (getAndDelOption(arguments, "--nvtxMode", profilingVerbosityString))
+    else if (!shapesCalib.empty() && getCalibProfile)
     {
-        sample::gLogWarning << "--nvtxMode flag has been deprecated by --profilingVerbosity flag." << std::endl;
+        sample::gLogWarning
+            << "--calibProfile have no effect when --minShapesCalib/--optShapesCalib/--maxShapesCalib is set."
+            << std::endl;
     }
 
+    std::string profilingVerbosityString;
+
     getAndDelOption(arguments, "--profilingVerbosity", profilingVerbosityString);
     if (profilingVerbosityString == "layer_names_only")
     {
@@ -1053,6 +1317,8 @@ void BuildOptions::parse(Arguments& arguments)
     {
         load = true;
     }
+    getAndDelOption(arguments, "--getPlanVersionOnly", getPlanVersionOnly);
+
     if (getAndDelOption(arguments, "--saveEngine", engine))
     {
         save = true;
@@ -1147,12 +1413,7 @@ void BuildOptions::parse(Arguments& arguments)
     {
         timingCacheMode = TimingCacheMode::kLOCAL;
     }
-    if (getAndDelOption(arguments, "--heuristic", heuristic))
-    {
-        sample::gLogWarning << "--heuristic flag has been deprecated, use --builderOptimizationLevel=<N> flag instead "
-                               "(N <= 2 enables heuristic)."
-                            << std::endl;
-    }
+    getAndDelOption(arguments, "--errorOnTimingCacheMiss", errorOnTimingCacheMiss);
     getAndDelOption(arguments, "--builderOptimizationLevel", builderOptimizationLevel);
 
     std::string hardwareCompatibleArgs;
@@ -1171,6 +1432,11 @@ void BuildOptions::parse(Arguments& arguments)
             + ". Valid options: none, ampere+.");
     }
 
+    if (pluginInstanceNorm && (versionCompatible || hardwareCompatibilityLevel == HardwareCompatibilityLevel::kAMPERE_PLUS))
+    {
+        throw std::invalid_argument("Plugin InstanceNorm cannot be used with version compatible or hardware compatible engines!");
+    }
+
     getAndDelOption(arguments, "--maxAuxStreams", maxAuxStreams);
 
     std::string previewFeaturesBuf;
@@ -1194,15 +1460,9 @@ void BuildOptions::parse(Arguments& arguments)
         PreviewFeature feat{};
         if (featureName == "profileSharing0806")
         {
-            feat = PreviewFeature::kPROFILE_SHARING_0806;
-        }
-        else if (featureName == "fasterDynamicShapes0805")
-        {
-            feat = PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805;
-        }
-        else if (featureName == "disableExternalTacticSourcesForCore0805")
-        {
-            feat = PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805;
+            sample::gLogWarning
+                << "profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect."
+                << std::endl;
         }
         else
         {
@@ -1246,6 +1506,9 @@ void BuildOptions::parse(Arguments& arguments)
     }
 
     getAndDelOption(arguments, "--leanDLLPath", leanDLLPath);
+
+    // Don't delete the option because the inference option parser requires it
+    getOption(arguments, "--allowWeightStreaming", allowWeightStreaming);
 }
 
 void SystemOptions::parse(Arguments& arguments)
@@ -1273,6 +1536,9 @@ void SystemOptions::parse(Arguments& arguments)
     getAndDelOption(arguments, "--ignoreParsedPluginLibs", ignoreParsedPluginLibs);
 }
 
+constexpr int64_t WeightStreamingBudget::kDISABLE;
+constexpr int64_t WeightStreamingBudget::kAUTOMATIC;
+
 void InferenceOptions::parse(Arguments& arguments)
 {
 
@@ -1308,7 +1574,46 @@ void InferenceOptions::parse(Arguments& arguments)
     splitInsertKeyValue(inputsList, inputs);
 
     getShapesInference(arguments, shapes, "--shapes");
-    getAndDelOption(arguments, "--batch", batch);
+    setOptProfile = getAndDelOption(arguments, "--useProfile", optProfileIndex);
+
+    std::string allocationStrategyString;
+    getAndDelOption(arguments, "--allocationStrategy", allocationStrategyString);
+    if (allocationStrategyString == "static")
+    {
+        memoryAllocationStrategy = MemoryAllocationStrategy::kSTATIC;
+    }
+    else if (allocationStrategyString == "profile")
+    {
+        memoryAllocationStrategy = MemoryAllocationStrategy::kPROFILE;
+    }
+    else if (allocationStrategyString == "runtime")
+    {
+        memoryAllocationStrategy = MemoryAllocationStrategy::kRUNTIME;
+    }
+    else if (!allocationStrategyString.empty())
+    {
+        throw std::invalid_argument(std::string("Unknown allocationStrategy: ") + allocationStrategyString);
+    }
+
+    bool allowWs{false};
+    getAndDelOption(arguments, "--allowWeightStreaming", allowWs);
+    bool wsBudgetFound = getAndDelOption(arguments, "--weightStreamingBudget", weightStreamingBudget);
+    if (wsBudgetFound && !allowWs)
+    {
+        throw std::invalid_argument(
+            "The weight streaming budget can only be set with --allowWeightStreaming specified.");
+    }
+    if (allowWs && weightStreamingBudget.isDisabled())
+    {
+        sample::gLogWarning << "The engine can stream its weights but it will not at runtime because "
+                               "--weightStreamingBudget unset or set to "
+                            << WeightStreamingBudget::kDISABLE << "." << std::endl;
+    }
+
+    std::string debugTensorList;
+    getAndDelOption(arguments, "--saveDebugTensors", debugTensorList);
+    std::vector<std::string> fileNames{splitToStringVec(debugTensorList, ',')};
+    splitInsertKeyValue(fileNames, debugTensorFileNames);
 }
 
 void ReportingOptions::parse(Arguments& arguments)
@@ -1320,6 +1625,7 @@ void ReportingOptions::parse(Arguments& arguments)
     getAndDelOption(arguments, "--dumpRawBindingsToFile", dumpRawBindings);
     getAndDelOption(arguments, "--dumpProfile", profile);
     getAndDelOption(arguments, "--dumpLayerInfo", layerInfo);
+    getAndDelOption(arguments, "--dumpOptimizationProfile", optProfileInfo);
     getAndDelOption(arguments, "--exportTimes", exportTimes);
     getAndDelOption(arguments, "--exportOutput", exportOutput);
     getAndDelOption(arguments, "--exportProfile", exportProfile);
@@ -1362,64 +1668,35 @@ void AllOptions::parse(Arguments& arguments)
     system.parse(arguments);
     inference.parse(arguments);
 
-    // Use explicitBatch when input model is ONNX or when dynamic shapes are used.
-    const bool isOnnx{model.baseModel.format == ModelFormat::kONNX};
-    const bool hasDynamicShapes{!build.shapes.empty() || !inference.shapes.empty()};
-    const bool detectedExplicitBatch = isOnnx || hasDynamicShapes;
-
-    // Throw an error if user tries to use --batch or --maxBatch when the engine has explicit batch dim.
-    const bool maxBatchWasSet{build.maxBatch != maxBatchNotProvided};
-    const bool batchWasSet{inference.batch != batchNotProvided};
-    if (detectedExplicitBatch && (maxBatchWasSet || batchWasSet))
-    {
-        throw std::invalid_argument(
-            "The --batch and --maxBatch flags should not be used when the input model is ONNX or when dynamic shapes "
-            "are provided. Please use --optShapes and --shapes to set input shapes instead.");
-    }
-
     if (build.useRuntime != RuntimeMode::kFULL && inference.timeRefit)
     {
         throw std::invalid_argument("--timeRefit requires --useRuntime=full.");
     }
 
-    // If batch and/or maxBatch is not set and the engine has implicit batch dim, set them to default values.
-    if (!detectedExplicitBatch)
-    {
-        // If batch is not set, set it to default value.
-        if (!batchWasSet)
-        {
-            inference.batch = defaultBatch;
-        }
-        // If maxBatch is not set, set it to be equal to batch.
-        if (!maxBatchWasSet)
-        {
-            build.maxBatch = inference.batch;
-        }
-        // MaxBatch should not be less than batch.
-        if (build.maxBatch < inference.batch)
-        {
-            throw std::invalid_argument("Build max batch " + std::to_string(build.maxBatch)
-                + " is less than inference batch " + std::to_string(inference.batch));
-        }
-    }
-
-    // Propagate shape profile between builder and inference
-    for (auto const& s : build.shapes)
+    if (inference.optProfileIndex < static_cast<int32_t>(build.optProfiles.size()))
     {
-        if (inference.shapes.find(s.first) == inference.shapes.end())
+        // Propagate shape profile between builder and inference
+        for (auto const& s : build.optProfiles[inference.optProfileIndex])
         {
-            insertShapesInference(
-                inference.shapes, s.first, s.second[static_cast<size_t>(nvinfer1::OptProfileSelector::kOPT)]);
+            if (inference.shapes.find(s.first) == inference.shapes.end())
+            {
+                insertShapesInference(
+                    inference.shapes, s.first, s.second[static_cast<size_t>(nvinfer1::OptProfileSelector::kOPT)]);
+            }
         }
-    }
-    for (auto const& s : inference.shapes)
-    {
-        if (build.shapes.find(s.first) == build.shapes.end())
+        for (auto const& s : inference.shapes)
         {
-            // assume min/opt/max all the same
-            insertShapesBuild(build.shapes, nvinfer1::OptProfileSelector::kMIN, s.first, s.second);
-            insertShapesBuild(build.shapes, nvinfer1::OptProfileSelector::kOPT, s.first, s.second);
-            insertShapesBuild(build.shapes, nvinfer1::OptProfileSelector::kMAX, s.first, s.second);
+            if (build.optProfiles[inference.optProfileIndex].find(s.first)
+                == build.optProfiles[inference.optProfileIndex].end())
+            {
+                // assume min/opt/max all the same
+                insertShapesBuild(build.optProfiles[inference.optProfileIndex], nvinfer1::OptProfileSelector::kMIN,
+                    s.first, s.second);
+                insertShapesBuild(build.optProfiles[inference.optProfileIndex], nvinfer1::OptProfileSelector::kOPT,
+                    s.first, s.second);
+                insertShapesBuild(build.optProfiles[inference.optProfileIndex], nvinfer1::OptProfileSelector::kMAX,
+                    s.first, s.second);
+            }
         }
     }
 
@@ -1520,7 +1797,6 @@ void SafeBuilderOptions::parse(Arguments& arguments)
     bool noBuilderCache{false};
     getAndDelOption(arguments, "--noBuilderCache", noBuilderCache);
     getAndDelOption(arguments, "--timingCacheFile", timingCacheFile);
-    getAndDelOption(arguments, "--minTiming", minTiming);
     getAndDelOption(arguments, "--avgTiming", avgTiming);
     if (noBuilderCache)
     {
@@ -1544,21 +1820,11 @@ std::ostream& operator<<(std::ostream& os, const BaseModelOptions& options)
     os << "Format: ";
     switch (options.format)
     {
-    case ModelFormat::kCAFFE:
-    {
-        os << "Caffe";
-        break;
-    }
     case ModelFormat::kONNX:
     {
         os << "ONNX";
         break;
     }
-    case ModelFormat::kUFF:
-    {
-        os << "UFF";
-        break;
-    }
     case ModelFormat::kANY: os << "*"; break;
     }
     os << std::endl << "Model: " << options.model << std::endl;
@@ -1566,32 +1832,11 @@ std::ostream& operator<<(std::ostream& os, const BaseModelOptions& options)
     return os;
 }
 
-std::ostream& operator<<(std::ostream& os, const UffInput& input)
-{
-    os << "Uff Inputs Layout: " << (input.NHWC ? "NHWC" : "NCHW") << std::endl;
-    for (const auto& i : input.inputs)
-    {
-        os << "Input: " << i.first << "," << i.second.d[0] << "," << i.second.d[1] << "," << i.second.d[2] << std::endl;
-    }
-
-    return os;
-}
-
 std::ostream& operator<<(std::ostream& os, const ModelOptions& options)
 {
     os << options.baseModel;
     switch (options.baseModel.format)
     {
-    case ModelFormat::kCAFFE:
-    {
-        os << "Prototxt: " << options.prototxt << std::endl;
-        break;
-    }
-    case ModelFormat::kUFF:
-    {
-        os << options.uffInputs;
-        break;
-    }
     case ModelFormat::kONNX: // Fallthrough: No options to report for ONNX or the generic case
     case ModelFormat::kANY: break;
     }
@@ -1620,6 +1865,11 @@ std::ostream& operator<<(std::ostream& os, nvinfer1::DataType dtype)
         os << "fp16";
         break;
     }
+    case nvinfer1::DataType::kBF16:
+    {
+        os << "bf16";
+        break;
+    }
     case nvinfer1::DataType::kINT8:
     {
         os << "int8";
@@ -1645,6 +1895,16 @@ std::ostream& operator<<(std::ostream& os, nvinfer1::DataType dtype)
         os << "fp8";
         break;
     }
+    case nvinfer1::DataType::kINT64:
+    {
+        os << "int64";
+        break;
+    }
+    case nvinfer1::DataType::kINT4:
+    {
+        os << "int4";
+        break;
+    }
     }
     return os;
 }
@@ -1788,20 +2048,37 @@ std::ostream& operator<<(std::ostream& os, LayerDeviceTypes const& layerDeviceTy
     return os;
 }
 
+std::ostream& operator<<(std::ostream& os, StringSet const& stringSet)
+{
+    int64_t i = 0;
+    for (auto const& s : stringSet)
+    {
+        os << (i ? "," : "") << s;
+        ++i;
+    }
+    return os;
+}
+
 std::ostream& operator<<(std::ostream& os, const BuildOptions& options)
 {
+    // if loadEngine is specified, BuildOptions are N/A
+    if (options.load)
+    {
+        os << std::endl;
+        return os;
+    }
     // clang-format off
     os << "=== Build Options ==="                                                                                       << std::endl <<
-          "Max batch: ";        printBatch(os, options.maxBatch)                                                        << std::endl <<
           "Memory Pools: ";     printMemoryPools(os, options)                                                           << std::endl <<
-          "minTiming: "      << options.minTiming                                                                       << std::endl <<
           "avgTiming: "      << options.avgTiming                                                                       << std::endl <<
           "Precision: ";        printPrecision(os, options)                                                             << std::endl <<
           "LayerPrecisions: " << options.layerPrecisions                                                                << std::endl <<
           "Layer Device Types: " << options.layerDeviceTypes                                                            << std::endl <<
           "Calibration: "    << (options.int8 && options.calibration.empty() ? "Dynamic" : options.calibration.c_str()) << std::endl <<
           "Refit: "          << boolToEnabled(options.refittable)                                                       << std::endl <<
+          "Strip weights: "     << boolToEnabled(options.stripWeights)                                                  << std::endl <<
           "Version Compatible: " << boolToEnabled(options.versionCompatible)                                            << std::endl <<
+          "ONNX Plugin InstanceNorm: " << boolToEnabled(options.pluginInstanceNorm)                                     << std::endl <<
           "TensorRT runtime: " << options.useRuntime                                                                    << std::endl <<
           "Lean DLL Path: " << options.leanDLLPath                                                                      << std::endl <<
           "Tempfile Controls: "; printTempfileControls(os, options.tempfileControls)                                    << std::endl <<
@@ -1819,10 +2096,14 @@ std::ostream& operator<<(std::ostream& os, const BuildOptions& options)
           "Tactic sources: ";   printTacticSources(os, options.enabledTactics, options.disabledTactics)                 << std::endl <<
           "timingCacheMode: ";  printTimingCache(os, options.timingCacheMode)                                           << std::endl <<
           "timingCacheFile: " << options.timingCacheFile                                                                << std::endl <<
-          "Heuristic: "       << boolToEnabled(options.heuristic)                                                       << std::endl <<
+          "Enable Compilation Cache: "<< boolToEnabled(!options.disableCompilationCache) << std::endl <<
+          "errorOnTimingCacheMiss: "  << boolToEnabled(options.errorOnTimingCacheMiss)                                  << std::endl <<
           "Preview Features: "; printPreviewFlags(os, options)                                                          << std::endl <<
           "MaxAuxStreams: "   << options.maxAuxStreams                                                                  << std::endl <<
-          "BuilderOptimizationLevel: " << options.builderOptimizationLevel                                              << std::endl;
+          "BuilderOptimizationLevel: " << options.builderOptimizationLevel                                              << std::endl <<
+          "Calibration Profile Index: " << options.calibProfile                                                         << std::endl <<
+          "Weight Streaming: " << boolToEnabled(options.allowWeightStreaming)                                           << std::endl <<
+          "Debug Tensors: " << options.debugTensors                                                                     << std::endl;
     // clang-format on
 
     auto printIOFormats = [](std::ostream& os, const char* direction, const std::vector<IOFormat> formats) {
@@ -1841,8 +2122,11 @@ std::ostream& operator<<(std::ostream& os, const BuildOptions& options)
 
     printIOFormats(os, "Input(s)", options.inputFormats);
     printIOFormats(os, "Output(s)", options.outputFormats);
-    printShapes(os, "build", options.shapes);
-    printShapes(os, "calibration", options.shapesCalib);
+    for (size_t i = 0; i < options.optProfiles.size(); i++)
+    {
+        printShapes(os, "build", options.optProfiles[i], i);
+    }
+    printShapes(os, "calibration", options.shapesCalib, -1);
 
     return os;
 }
@@ -1899,7 +2183,22 @@ std::ostream& operator<<(std::ostream& os, const InferenceOptions& options)
     {
                           os << "Explicit"                                << std::endl;
     }
-    printShapes(os, "inference", options.shapes);
+    printShapes(os, "inference", options.shapes, options.optProfileIndex);
+
+    std::string wsBudget{"Disabled"};
+    if (options.weightStreamingBudget.bytes == WeightStreamingBudget::kAUTOMATIC)
+    {
+        wsBudget = "Automatic";
+    }
+    else if (options.weightStreamingBudget.bytes != WeightStreamingBudget::kDISABLE)
+    {
+        wsBudget = std::to_string(options.weightStreamingBudget.bytes) + " bytes";
+    }
+    else if (options.weightStreamingBudget.percent != WeightStreamingBudget::kDISABLE)
+    {
+        wsBudget = std::to_string(options.weightStreamingBudget.percent) + "%";
+    }
+
     os << "Iterations: "                << options.iterations                                   << std::endl <<
           "Duration: "                  << options.duration   << "s (+ "
                                         << options.warmup     << "ms warm up)"                  << std::endl <<
@@ -1915,7 +2214,9 @@ std::ostream& operator<<(std::ostream& os, const InferenceOptions& options)
           "Time Deserialize: "          << boolToEnabled(options.timeDeserialize)               << std::endl <<
           "Time Refit: "                << boolToEnabled(options.timeRefit)                     << std::endl <<
           "NVTX verbosity: "            << static_cast<int32_t>(options.nvtxVerbosity)          << std::endl <<
-          "Persistent Cache Ratio: "    << static_cast<float>(options.persistentCacheRatio)   << std::endl;
+          "Persistent Cache Ratio: "    << static_cast<float>(options.persistentCacheRatio)     << std::endl <<
+          "Optimization Profile Index: "<< options.optProfileIndex                              << std::endl <<
+          "Weight Streaming Budget: "   << wsBudget                                             << std::endl;
     // clang-format on
 
     os << "Inputs:" << std::endl;
@@ -1924,6 +2225,12 @@ std::ostream& operator<<(std::ostream& os, const InferenceOptions& options)
         os << input.first << "<-" << input.second << std::endl;
     }
 
+    os << "Debug Tensor Save Destinations:" << std::endl;
+    for (auto const& fileName : options.debugTensorFileNames)
+    {
+        os << fileName.first << ": " << fileName.second << std::endl;
+    }
+
     return os;
 }
 
@@ -2002,19 +2309,7 @@ std::ostream& operator<<(std::ostream& os, const SafeBuilderOptions& options)
 void BaseModelOptions::help(std::ostream& os)
 {
     // clang-format off
-    os << "  --uff=<file>                UFF model"                                             << std::endl <<
-          "  --onnx=<file>               ONNX model"                                            << std::endl <<
-          "  --model=<file>              Caffe model (default = no model, random weights used)" << std::endl;
-    // clang-format on
-}
-
-void UffInput::help(std::ostream& os)
-{
-    // clang-format off
-    os << "  --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified "
-                                                       "multiple times; at least one is required for UFF models" << std::endl <<
-          "  --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use "             <<
-                                                                    "X,Y,Z=H,W,C order in --uffInput)"           << std::endl;
+    os << "  --onnx=<file>               ONNX model"                                            << std::endl;
     // clang-format on
 }
 
@@ -2023,10 +2318,6 @@ void ModelOptions::help(std::ostream& os)
     // clang-format off
     os << "=== Model Options ==="                                                                                 << std::endl;
     BaseModelOptions::help(os);
-    os << "  --deploy=<file>             Caffe prototxt file"                                                     << std::endl <<
-          "  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output "
-                                                                                  "is required for UFF and Caffe" << std::endl;
-    UffInput::help(os);
     // clang-format on
 }
 
@@ -2034,8 +2325,6 @@ void BuildOptions::help(std::ostream& os)
 {
     // clang-format off
     os << "=== Build Options ==="                                                                                                                   "\n"
-          "  --maxBatch                         Set max batch size and build an implicit batch engine (default = same size as --batch)"             "\n"
-          "                                     This option should not be used when the input model is ONNX or when dynamic shapes are provided."   "\n"
           "  --minShapes=spec                   Build with dynamic shapes using a profile with the min shapes provided"                             "\n"
           "  --optShapes=spec                   Build with dynamic shapes using a profile with the opt shapes provided"                             "\n"
           "  --maxShapes=spec                   Build with dynamic shapes using a profile with the max shapes provided"                             "\n"
@@ -2047,10 +2336,12 @@ void BuildOptions::help(std::ostream& os)
           "                                           that min shapes and max shapes are set to the same values as opt shapes."                     "\n"
           "                                           Input names can be wrapped with escaped single quotes (ex: 'Input:0')."                       "\n"
           "                                     Example input shapes spec: input0:1x3x256x256,input1:1x3x128x128"                                   "\n"
+          "                                     For scalars (0-D shapes), use input0:scalar or simply input0: with nothing after the colon."        "\n"
           "                                     Each input shape is supplied as a key-value pair where key is the input name and"                   "\n"
           "                                     value is the dimensions (including the batch dimension) to be used for that input."                 "\n"
           "                                     Each key-value pair has the key and value separated using a colon (:)."                             "\n"
-          "                                     Multiple input shapes can be provided via comma-separated key-value pairs."                         "\n"
+          "                                     Multiple input shapes can be provided via comma-separated key-value pairs, and each input name can" "\n"
+          "                                     contain at most one wildcard ('*') character."                                                      "\n"
           "  --inputIOFormats=spec              Type and format of each of the input tensors (default = all inputs in fp32:chw)"                    "\n"
           "                                     See --outputIOFormats help for the grammar of type and format list."                                "\n"
           "                                     Note: If this option is specified, please set comma-separated types and formats for all"            "\n"
@@ -2062,34 +2353,37 @@ void BuildOptions::help(std::ostream& os)
           "                                           needs specifying IO format) or set the type and format once for broadcasting."                "\n"
           R"(                                     IO Formats: spec  ::= IOfmt[","spec])"                                                            "\n"
           "                                                 IOfmt ::= type:fmt"                                                                     "\n"
-          R"(                                               type  ::= "fp32"|"fp16"|"int32"|"int8")"                                                "\n"
+          R"(                                               type  ::= "fp32"|"fp16"|"bf16"|"int32"|"int64"|"int8"|"uint8"|"bool")"                  "\n"
           R"(                                               fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32"|"dhwc8"|)"                        "\n"
           R"(                                                          "cdhw32"|"hwc"|"dla_linear"|"dla_hwc4")["+"fmt])"                            "\n"
-          "  --workspace=N                      Set workspace size in MiB."                                                                         "\n"
           "  --memPoolSize=poolspec             Specify the size constraints of the designated memory pool(s) in MiB."                              "\n"
           "                                     Note: Also accepts decimal sizes, e.g. 0.25MiB. Will be rounded down to the nearest integer bytes." "\n"
           "                                     In particular, for dlaSRAM the bytes will be rounded down to the nearest power of 2."               "\n"
           R"(                                   Pool constraint: poolspec ::= poolfmt[","poolspec])"                                                "\n"
           "                                                      poolfmt ::= pool:sizeInMiB"                                                        "\n"
-          R"(                                                    pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM")"                    "\n"
-          "  --profilingVerbosity=mode          Specify profiling verbosity. mode ::= layer_names_only|detailed|none (default = layer_names_only)"  "\n"
-          "  --minTiming=M                      Set the minimum number of iterations used in kernel selection (default = "
-                                                                                                                  << defaultMinTiming << ")"        "\n"
+          R"(                                                    pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM"|"tacticSharedMem")"  "\n"
+          "  --profilingVerbosity=mode          Specify profiling verbosity. mode ::= layer_names_only|detailed|none (default = layer_names_only)." "\n"
+          "                                     Please only assign once."                                                                           "\n"
           "  --avgTiming=M                      Set the number of times averaged in each iteration for kernel selection (default = "
                                                                                                                   << defaultAvgTiming << ")"        "\n"
           "  --refit                            Mark the engine as refittable. This will allow the inspection of refittable layers "                "\n"
           "                                     and weights within the engine."                                                                     "\n"
+          "  --stripWeights                     Strip weights from plan. This flag works with either refit or refit with identical weights. Default""\n"
+          "                                     to latter, but you can switch to the former by enabling both --stripWeights and --refit at the same""\n"
+          "                                     time."                                                                                              "\n"
+          "  --weightless                       [Deprecated] this knob has been deprecated. Please use --stripWeights"                              "\n"
           "  --versionCompatible, --vc          Mark the engine as version compatible. This allows the engine to be used with newer versions"       "\n"
           "                                     of TensorRT on the same host OS, as well as TensorRT's dispatch and lean runtimes."                 "\n"
-          "                                     Only supported with explicit batch."                                                                "\n"
+          "  --pluginInstanceNorm, --pi         Set `kNATIVE_INSTANCENORM` to false in the ONNX parser. This will cause the ONNX parser to use"     "\n"
+          "                                     a plugin InstanceNorm implementation over the native implementation when parsing."                  "\n"
           R"(  --useRuntime=runtime               TensorRT runtime to execute engine. "lean" and "dispatch" require loading VC engine and do)"      "\n"
           "                                     not support building an engine."                                                                    "\n"
           R"(                                           runtime::= "full"|"lean"|"dispatch")"                                                       "\n"
           "  --leanDLLPath=<file>               External lean runtime DLL to use in version compatiable mode."                                      "\n"
           "  --excludeLeanRuntime               When --versionCompatible is enabled, this flag indicates that the generated engine should"          "\n"
           "                                     not include an embedded lean runtime. If this is set, the user must explicitly specify a"           "\n"
-          "                                     valid lean runtime to use when loading the engine.  Only supported with explicit batch"             "\n"
-          "                                     and weights within the engine."                                                                     "\n"
+          "                                     valid lean runtime to use when loading the engine."     "\n"
+          "                                     Only supported with weights within the engine."         "\n"
           "  --sparsity=spec                    Control sparsity (default = disabled). "                                                            "\n"
           R"(                                   Sparsity: spec ::= "disable", "enable", "force")"                                                   "\n"
           "                                     Note: Description about each of these options is as below"                                          "\n"
@@ -2098,11 +2392,15 @@ void BuildOptions::help(std::ostream& os)
           "                                                     considered if the weights have the right sparsity pattern)"                         "\n"
           "                                           force   = enable sparse tactics in the builder and force-overwrite the weights to have"       "\n"
           "                                                     a sparsity pattern (even if you loaded a model yourself)"                           "\n"
+          "                                                     [Deprecated] this knob has been deprecated."                                        "\n"
+          "                                                     Please use <polygraphy surgeon prune> to rewrite the weights."                      "\n"
           "  --noTF32                           Disable tf32 precision (default is to enable tf32, in addition to fp32)"                            "\n"
           "  --fp16                             Enable fp16 precision, in addition to fp32 (default = disabled)"                                    "\n"
+          "  --bf16                             Enable bf16 precision, in addition to fp32 (default = disabled)"                                    "\n"
           "  --int8                             Enable int8 precision, in addition to fp32 (default = disabled)"                                    "\n"
           "  --fp8                              Enable fp8 precision, in addition to fp32 (default = disabled)"                                     "\n"
           "  --best                             Enable all precisions to achieve the best performance (default = disabled)"                         "\n"
+          "  --stronglyTyped                    Create a strongly typed network. (default = disabled)"                                              "\n"
           "  --directIO                         Avoid reformatting at network boundaries. (default = disabled)"                                     "\n"
           "  --precisionConstraints=spec        Control precision constraint setting. (default = none)"                                             "\n"
           R"(                                       Precision Constraints: spec ::= "none" | "obey" | "prefer")"                                    "\n"
@@ -2112,19 +2410,19 @@ void BuildOptions::help(std::ostream& os)
           "                                                otherwise"                                                                               "\n"
           "  --layerPrecisions=spec             Control per-layer precision constraints. Effective only when precisionConstraints is set to"        "\n"
           R"(                                   "obey" or "prefer". (default = none))"                                                              "\n"
-          R"(                                   The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a)"      "\n"
-          "                                     layerName to specify the default precision for all the unspecified layers."                         "\n"
+          R"(                                   The specs are read left-to-right, and later ones override earlier ones. Each layer name can)"       "\n"
+          "                                     contain at most one wildcard ('*') character."                                                      "\n"
           R"(                                   Per-layer precision spec ::= layerPrecision[","spec])"                                              "\n"
           R"(                                                       layerPrecision ::= layerName":"precision)"                                      "\n"
-          R"(                                                       precision ::= "fp32"|"fp16"|"int32"|"int8")"                                    "\n"
+          R"(                                                       precision ::= "fp32"|"fp16"|"bf16"|"int32"|"int8")"                             "\n"
           "  --layerOutputTypes=spec            Control per-layer output type constraints. Effective only when precisionConstraints is set to"      "\n"
           R"(                                   "obey" or "prefer". (default = none)"                                                               "\n"
-          R"(                                   The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a)"      "\n"
-          "                                     layerName to specify the default precision for all the unspecified layers. If a layer has more than""\n"
+          R"(                                   The specs are read left-to-right, and later ones override earlier ones. Each layer name can)"       "\n"
+          "                                     contain at most one wildcard ('*') character. If a layer has more than"                             "\n"
           R"(                                   one output, then multiple types separated by "+" can be provided for this layer.)"                  "\n"
           R"(                                   Per-layer output type spec ::= layerOutputTypes[","spec])"                                          "\n"
           R"(                                                         layerOutputTypes ::= layerName":"type)"                                       "\n"
-          R"(                                                         type ::= "fp32"|"fp16"|"int32"|"int8"["+"type])"                              "\n"
+          R"(                                                         type ::= "fp32"|"fp16"|"bf16"|"int32"|"int8"["+"type])"                       "\n"
           "  --layerDeviceTypes=spec            Specify layer-specific device type."                                                                "\n"
           "                                     The specs are read left-to-right, and later ones override earlier ones. If a layer does not have"   "\n"
           "                                     a device type specified, the layer will opt for the default device type."                           "\n"
@@ -2143,6 +2441,8 @@ void BuildOptions::help(std::ostream& os)
           "  --restricted                       Enable safety scope checking with kSAFETY_SCOPE build flag"                                         "\n"
           "  --saveEngine=<file>                Save the serialized engine"                                                                         "\n"
           "  --loadEngine=<file>                Load a serialized engine"                                                                           "\n"
+          "  --getPlanVersionOnly               Print TensorRT version when loaded plan was created. Works without deserialization of the plan."    "\n"
+          "                                     Use together with --loadEngine. Supported only for engines created with 8.6 and forward."           "\n"
           "  --tacticSources=tactics            Specify the tactics to be used by adding (+) or removing (-) tactics from the default "             "\n"
           "                                     tactic sources (default = all available tactics)."                                                  "\n"
           "                                     Note: Currently only cuDNN, cuBLAS, cuBLAS-LT, and edge mask convolutions are listed as optional"   "\n"
@@ -2153,14 +2453,13 @@ void BuildOptions::help(std::ostream& os)
           R"(                                                               |"JIT_CONVOLUTIONS")"                                                   "\n"
           "                                     For example, to disable cudnn and enable cublas: --tacticSources=-CUDNN,+CUBLAS"                    "\n"
           "  --noBuilderCache                   Disable timing cache in builder (default is to enable timing cache)"                                "\n"
-          "  --heuristic                        Enable tactic selection heuristic in builder (default is to disable the heuristic)"                 "\n"
+          "  --noCompilationCache               Disable Compilation cache in builder, and the cache is part of timing cache (default is to enable compilation cache)"                                                "\n"
+          "  --errorOnTimingCacheMiss           Emit error when a tactic being timed is not present in the timing cache (default = false)"          "\n"
           "  --timingCacheFile=<file>           Save/load the serialized global timing cache"                                                       "\n"
           "  --preview=features                 Specify preview feature to be used by adding (+) or removing (-) preview features from the default" "\n"
           R"(                                   Preview Features: features ::= [","feature])"                                                       "\n"
           "                                                       feature  ::= (+|-)flag"                                                           "\n"
-          R"(                                                     flag     ::= "fasterDynamicShapes0805")"                                          "\n"
-          R"(                                                                  |"disableExternalTacticSourcesForCore0805")"                         "\n"
-          R"(                                                                  |"profileSharing0806")"                                              "\n"
+          R"(                                                     flag     ::= "profileSharing0806")"                                               "\n"
           "  --builderOptimizationLevel         Set the builder optimization level. (default is 3)"                                                 "\n"
           "                                     Higher level allows TensorRT to spend more building time for more optimization options."            "\n"
           "                                     Valid values include integers from 0 to the maximum optimization level, which is currently 5."      "\n"
@@ -2177,10 +2476,18 @@ void BuildOptions::help(std::ostream& os)
           "                                                filesystem (in the directory given by --tempdir)."                                       "\n"
           "                                     For example, to allow in-memory files and disallow temporary files:"                                "\n"
           "                                         --tempfileControls=in_memory:allow,temporary:deny"                                              "\n"
-          R"(                                   If a flag is unspecified, the default behavior is "allow".)"                                        "\n"
+          R"(                                     If a flag is unspecified, the default behavior is "allow".)"                                      "\n"
           "  --maxAuxStreams=N                  Set maximum number of auxiliary streams per inference stream that TRT is allowed to use to run "    "\n"
           "                                     kernels in parallel if the network contains ops that can run in parallel, with the cost of more "   "\n"
           "                                     memory usage. Set this to 0 for optimal memory usage. (default = using heuristics)"                 "\n"
+          "  --profile                          Build with dynamic shapes using a profile with the min/max/opt shapes provided. Can be specified"   "\n"
+          "                                         multiple times to create multiple profiles with contiguous index."                              "\n"
+          "                                     (ex: --profile=0 --minShapes=<spec> --optShapes=<spec> --maxShapes=<spec> --profile=1 ...)"         "\n"
+          "  --calibProfile                     Select the optimization profile to calibrate by index. (default = "
+                                                                                                                << defaultOptProfileIndex << ")"    "\n"
+          "  --allowWeightStreaming             Enable a weight streaming engine. Must be specified with --stronglyTyped. TensorRT will disable"    "\n"
+          "                                     weight streaming at runtime unless --weightStreamingBudget is specified."                           "\n"
+          "  --markDebug                        Specify list of names of tensors to be marked as debug tensors. Separate names with a comma"        "\n"
           ;
     // clang-format on
     os << std::flush;
@@ -2205,20 +2512,20 @@ void InferenceOptions::help(std::ostream& os)
 {
     // clang-format off
     os << "=== Inference Options ==="                                                                                                << std::endl <<
-          "  --batch=N                   Set batch size for implicit batch engines (default = "              << defaultBatch << ")"  << std::endl <<
-          "                              This option should not be used when the engine is built from an ONNX model or when dynamic" << std::endl <<
-          "                              shapes are provided when the engine is built."                                              << std::endl <<
           "  --shapes=spec               Set input shapes for dynamic shapes inference inputs."                                      << std::endl <<
           R"(                              Note: Input names can be wrapped with escaped single quotes (ex: 'Input:0').)"            << std::endl <<
           "                              Example input shapes spec: input0:1x3x256x256, input1:1x3x128x128"                          << std::endl <<
+          "                              For scalars (0-D shapes), use input0:scalar or simply input0: with nothing after the colon."<< std::endl <<
           "                              Each input shape is supplied as a key-value pair where key is the input name and"           << std::endl <<
           "                              value is the dimensions (including the batch dimension) to be used for that input."         << std::endl <<
           "                              Each key-value pair has the key and value separated using a colon (:)."                     << std::endl <<
-          "                              Multiple input shapes can be provided via comma-separated key-value pairs."                 << std::endl <<
+          "                              Multiple input shapes can be provided via comma-separated key-value pairs, and each input " << std::endl <<
+          "                              name can contain at most one wildcard ('*') character."                                     << std::endl <<
           "  --loadInputs=spec           Load input values from files (default = generate random inputs). Input names can be "
                                                                                        "wrapped with single quotes (ex: 'Input:0')"  << std::endl <<
           R"(                            Input values spec ::= Ival[","spec])"                                                       << std::endl <<
           R"(                                         Ival ::= name":"file)"                                                         << std::endl <<
+          "                              Consult the README for more information on generating files for custom inputs."             << std::endl <<
           "  --iterations=N              Run at least N inference iterations (default = "               << defaultIterations << ")"  << std::endl <<
           "  --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = "
                                                                                                             << defaultWarmUp << ")"  << std::endl <<
@@ -2229,7 +2536,8 @@ void InferenceOptions::help(std::ostream& os)
                                                                                                "(default = " << defaultSleep << ")"  << std::endl <<
           "  --idleTime=N                Sleep N milliseconds between two continuous iterations"
                                                                                                "(default = " << defaultIdle << ")"   << std::endl <<
-          "  --infStreams=N              Instantiate N engines to run inference concurrently (default = "  << defaultStreams << ")"  << std::endl <<
+          "  --infStreams=N              Instantiate N execution contexts to run inference concurrently "
+                                                                                             "(default = " << defaultStreams << ")"  << std::endl <<
           "  --exposeDMA                 Serialize DMA transfers to and from device (default = disabled)."                           << std::endl <<
           "  --noDataTransfers           Disable DMA transfers to and from device (default = enabled)."                              << std::endl <<
           "  --useManagedMemory          Use managed memory instead of separate host and device allocations (default = disabled)."   << std::endl <<
@@ -2246,7 +2554,28 @@ void InferenceOptions::help(std::ostream& os)
           "  --skipInference             Exit after the engine has been built and skip inference perf measurement "
                                                                                                              "(default = disabled)"  << std::endl <<
           "  --persistentCacheRatio      Set the persistentCacheLimit in ratio, 0.5 represent half of max persistent L2 size "
-                                                                                                                    "(default = 0)"  << std::endl;
+                                                                                                                    "(default = 0)"  << std::endl <<
+          "  --useProfile                Set the optimization profile for the inference context "
+                                                                                   "(default = " << defaultOptProfileIndex << " )."  << std::endl <<
+          "  --allocationStrategy=spec   Specify how the internal device memory for inference is allocated."                         << std::endl <<
+          R"(                            Strategy: spec ::= "static", "profile", "runtime")"                                         << std::endl <<
+          "                                  static = Allocate device memory based on max size across all profiles."                 << std::endl <<
+          "                                  profile = Allocate device memory based on max size of the current profile."             << std::endl <<
+          "                                  runtime = Allocate device memory based on the actual input shapes."                     << std::endl <<
+          "  --saveDebugTensors          Specify list of names of tensors to turn on the debug state"                                << std::endl <<
+          "                              and filename to save raw outputs to."                                                       << std::endl <<
+          "                              These tensors must be specified as debug tensors during build time."                        << std::endl <<
+          R"(                            Input values spec ::= Ival[","spec])"                                                       << std::endl <<
+          R"(                                         Ival ::= name":"file)"                                                         << std::endl <<
+          "  --weightStreamingBudget     Set the maximum amount of GPU memory TensorRT is allowed to use for weights."               << std::endl <<
+          "                              It can take on the following values:"                                                       << std::endl <<
+          "                                -1: TensorRT will automatically decide the budget."                                       << std::endl <<
+          "                                 0: (default) Disable weight streaming at runtime."                                       << std::endl <<
+          "                                 0-100%: Percentage of streamable weights that should be streamed."                       << std::endl <<
+          "                                         100% saves the most memory but will have the worst performance."                 << std::endl <<
+          "                                         Requires the % character."                                                       << std::endl <<
+          "                                >0B: The exact amount of streambale weights that reside on the GPU. Supports the "        << std::endl <<
+          "                                     following base-2 suffixes: " << getAvailableUnitSuffixes() << "."                    << std::endl;
     // clang-format on
 }
 
@@ -2269,6 +2598,8 @@ void ReportingOptions::help(std::ostream& os)
           "  --dumpProfile               Print profile information per layer (default = disabled)"       << std::endl <<
           "  --dumpLayerInfo             Print layer information of the engine to console "
                                                                                 "(default = disabled)"   << std::endl <<
+          "  --dumpOptimizationProfile   Print the optimization profile(s) information "
+                                                                                "(default = disabled)"   << std::endl <<
           "  --exportTimes=<file>        Write the timing results in a json file (default = disabled)"   << std::endl <<
           "  --exportOutput=<file>       Write the output tensors to a json file (default = disabled)"   << std::endl <<
           "  --exportProfile=<file>      Write the profile information per layer in a json file "
@@ -2308,19 +2639,6 @@ void AllOptions::help(std::ostream& os)
     os << std::endl;
     InferenceOptions::help(os);
     os << std::endl;
-    // clang-format off
-    os << "=== Build and Inference Batch Options ==="                                                                   << std::endl <<
-          "                              When using implicit batch, the max batch size of the engine, if not given, "   << std::endl <<
-          "                              is set to the inference batch size;"                                           << std::endl <<
-          "                              when using explicit batch, if shapes are specified only for inference, they "  << std::endl <<
-          "                              will be used also as min/opt/max in the build profile; if shapes are "         << std::endl <<
-          "                              specified only for the build, the opt shapes will be used also for inference;" << std::endl <<
-          "                              if both are specified, they must be compatible; and if explicit batch is "     << std::endl <<
-          "                              enabled but neither is specified, the model must provide complete static"      << std::endl <<
-          "                              dimensions, including batch size, for all inputs"                              << std::endl <<
-          "                              Using ONNX models automatically forces explicit batch."                        << std::endl <<
-    std::endl;
-    // clang-format on
     ReportingOptions::help(os);
     os << std::endl;
     SystemOptions::help(os);
@@ -2367,8 +2685,6 @@ void SafeBuilderOptions::printHelp(std::ostream& os)
           "                                              considered if the weights have the right sparsity pattern)"                         << std::endl <<
           "                                    force   = enable sparse tactics in the builder and force-overwrite the weights to have"       << std::endl <<
           "                                              a sparsity pattern"                                                                 << std::endl <<
-          "  --minTiming=M               Set the minimum number of iterations used in kernel selection (default = "                          << std::endl <<
-          ""                                                                                               << defaultMinTiming << ")"        << std::endl <<
           "  --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = "                << std::endl <<
           ""                                                                                               << defaultAvgTiming << ")"        << std::endl <<
           ""                                                                                                                                 << std::endl;
diff --git a/samples/common/sampleOptions.h b/samples/common/sampleOptions.h
index 36fb6839..00e8b15d 100644
--- a/samples/common/sampleOptions.h
+++ b/samples/common/sampleOptions.h
@@ -24,6 +24,7 @@
 #include <stdexcept>
 #include <string>
 #include <unordered_map>
+#include <unordered_set>
 #include <utility>
 #include <vector>
 
@@ -33,8 +34,6 @@ namespace sample
 {
 
 // Build default params
-constexpr int32_t maxBatchNotProvided{0};
-constexpr int32_t defaultMinTiming{1};
 constexpr int32_t defaultAvgTiming{8};
 constexpr int32_t defaultMaxAuxStreams{-1};
 constexpr int32_t defaultBuilderOptimizationLevel{-1};
@@ -47,6 +46,7 @@ constexpr int32_t defaultBatch{1};
 constexpr int32_t batchNotProvided{0};
 constexpr int32_t defaultStreams{1};
 constexpr int32_t defaultIterations{10};
+constexpr int32_t defaultOptProfileIndex{0};
 constexpr float defaultWarmUp{200.F};
 constexpr float defaultDuration{3.F};
 constexpr float defaultSleep{};
@@ -67,9 +67,7 @@ enum class PrecisionConstraints
 enum class ModelFormat
 {
     kANY,
-    kCAFFE,
-    kONNX,
-    kUFF
+    kONNX
 };
 
 enum class SparsityFlag
@@ -86,6 +84,13 @@ enum class TimingCacheMode
     kGLOBAL
 };
 
+enum class MemoryAllocationStrategy
+{
+    kSTATIC,  //< Allocate device memory based on max size across all profiles.
+    kPROFILE, //< Allocate device memory based on max size of the current profile.
+    kRUNTIME, //< Allocate device memory based on the current input shapes.
+};
+
 //!
 //! \enum RuntimeMode
 //!
@@ -127,7 +132,7 @@ inline std::ostream& operator<<(std::ostream& os, RuntimeMode const mode)
     return os;
 }
 
-using Arguments = std::unordered_multimap<std::string, std::string>;
+using Arguments = std::unordered_multimap<std::string, std::pair<std::string, int32_t>>;
 
 using IOFormat = std::pair<nvinfer1::DataType, nvinfer1::TensorFormats>;
 
@@ -137,6 +142,22 @@ using LayerPrecisions = std::unordered_map<std::string, nvinfer1::DataType>;
 using LayerOutputTypes = std::unordered_map<std::string, std::vector<nvinfer1::DataType>>;
 using LayerDeviceTypes = std::unordered_map<std::string, nvinfer1::DeviceType>;
 
+using StringSet = std::unordered_set<std::string>;
+
+class WeightStreamingBudget
+{
+public:
+    static constexpr int64_t kDISABLE{0};
+    static constexpr int64_t kAUTOMATIC{-1};
+    int64_t bytes{kDISABLE};
+    double percent{static_cast<double>(kDISABLE)};
+
+    bool isDisabled()
+    {
+        return bytes == kDISABLE && percent == kDISABLE;
+    }
+};
+
 class Options
 {
 public:
@@ -155,24 +176,12 @@ class BaseModelOptions : public Options
     static void help(std::ostream& out);
 };
 
-class UffInput : public Options
-{
-public:
-    std::vector<std::pair<std::string, nvinfer1::Dims>> inputs;
-    bool NHWC{false};
-
-    void parse(Arguments& arguments) override;
-
-    static void help(std::ostream& out);
-};
-
 class ModelOptions : public Options
 {
 public:
     BaseModelOptions baseModel;
     std::string prototxt;
     std::vector<std::string> outputs;
-    UffInput uffInputs;
 
     void parse(Arguments& arguments) override;
 
@@ -189,22 +198,26 @@ constexpr nvinfer1::TempfileControlFlags getTempfileControlDefaults()
 class BuildOptions : public Options
 {
 public:
-    int32_t maxBatch{maxBatchNotProvided};
     double workspace{-1.0};
     double dlaSRAM{-1.0};
     double dlaLocalDRAM{-1.0};
     double dlaGlobalDRAM{-1.0};
-    int32_t minTiming{defaultMinTiming};
+    double tacticSharedMem{-1.0};
     int32_t avgTiming{defaultAvgTiming};
+    size_t calibProfile{defaultOptProfileIndex};
     bool tf32{true};
     bool fp16{false};
+    bool bf16{false};
     bool int8{false};
     bool fp8{false};
+    bool stronglyTyped{false};
     bool directIO{false};
     PrecisionConstraints precisionConstraints{PrecisionConstraints::kNONE};
     LayerPrecisions layerPrecisions;
     LayerOutputTypes layerOutputTypes;
     LayerDeviceTypes layerDeviceTypes;
+    StringSet debugTensors;
+    StringSet debugTensorStates;
     bool safe{false};
     bool buildDLAStandalone{false};
     bool allowGPUFallback{false};
@@ -214,16 +227,18 @@ class BuildOptions : public Options
     bool save{false};
     bool load{false};
     bool refittable{false};
-    bool heuristic{false};
+    bool stripWeights{false};
     bool versionCompatible{false};
+    bool pluginInstanceNorm{false};
     bool excludeLeanRuntime{false};
+    bool disableCompilationCache{false};
     int32_t builderOptimizationLevel{defaultBuilderOptimizationLevel};
     SparsityFlag sparsity{SparsityFlag::kDISABLE};
     nvinfer1::ProfilingVerbosity profilingVerbosity{nvinfer1::ProfilingVerbosity::kLAYER_NAMES_ONLY};
     std::string engine;
     std::string calibration;
     using ShapeProfile = std::unordered_map<std::string, ShapeRange>;
-    ShapeProfile shapes;
+    std::vector<ShapeProfile> optProfiles;
     ShapeProfile shapesCalib;
     std::vector<IOFormat> inputFormats;
     std::vector<IOFormat> outputFormats;
@@ -231,6 +246,7 @@ class BuildOptions : public Options
     nvinfer1::TacticSources disabledTactics{0};
     TimingCacheMode timingCacheMode{TimingCacheMode::kLOCAL};
     std::string timingCacheFile{};
+    bool errorOnTimingCacheMiss{false};
     // C++11 does not automatically generate hash function for enum class.
     // Use int32_t to support C++11 compilers.
     std::unordered_map<int32_t, bool> previewFeatures;
@@ -240,6 +256,9 @@ class BuildOptions : public Options
     RuntimeMode useRuntime{RuntimeMode::kFULL};
     std::string leanDLLPath{};
     int32_t maxAuxStreams{defaultMaxAuxStreams};
+    bool getPlanVersionOnly{false};
+
+    bool allowWeightStreaming{false};
 
     void parse(Arguments& arguments) override;
 
@@ -267,6 +286,7 @@ class InferenceOptions : public Options
     int32_t batch{batchNotProvided};
     int32_t iterations{defaultIterations};
     int32_t infStreams{defaultStreams};
+    int32_t optProfileIndex{defaultOptProfileIndex};
     float warmup{defaultWarmUp};
     float duration{defaultDuration};
     float sleep{defaultSleep};
@@ -281,10 +301,15 @@ class InferenceOptions : public Options
     bool rerun{false};
     bool timeDeserialize{false};
     bool timeRefit{false};
+    bool setOptProfile{false};
     std::unordered_map<std::string, std::string> inputs;
     using ShapeProfile = std::unordered_map<std::string, std::vector<int32_t>>;
     ShapeProfile shapes;
     nvinfer1::ProfilingVerbosity nvtxVerbosity{nvinfer1::ProfilingVerbosity::kLAYER_NAMES_ONLY};
+    MemoryAllocationStrategy memoryAllocationStrategy{MemoryAllocationStrategy::kSTATIC};
+    std::unordered_map<std::string, std::string> debugTensorFileNames;
+
+    WeightStreamingBudget weightStreamingBudget;
 
     void parse(Arguments& arguments) override;
 
@@ -302,6 +327,7 @@ class ReportingOptions : public Options
     bool dumpRawBindings{false};
     bool profile{false};
     bool layerInfo{false};
+    bool optProfileInfo{false};
     std::string exportTimes;
     std::string exportOutput;
     std::string exportProfile;
@@ -330,7 +356,6 @@ class SafeBuilderOptions : public Options
     TimingCacheMode timingCacheMode{TimingCacheMode::kLOCAL};
     std::string timingCacheFile{};
     SparsityFlag sparsity{SparsityFlag::kDISABLE};
-    int32_t minTiming{defaultMinTiming};
     int32_t avgTiming{defaultAvgTiming};
 
     void parse(Arguments& arguments) override;
@@ -376,8 +401,6 @@ void helpHelp(std::ostream& out);
 
 std::ostream& operator<<(std::ostream& os, const BaseModelOptions& options);
 
-std::ostream& operator<<(std::ostream& os, const UffInput& input);
-
 std::ostream& operator<<(std::ostream& os, const IOFormat& format);
 
 std::ostream& operator<<(std::ostream& os, const ShapeRange& dims);
@@ -396,6 +419,10 @@ std::ostream& operator<<(std::ostream& os, const AllOptions& options);
 
 std::ostream& operator<<(std::ostream& os, const SafeBuilderOptions& options);
 
+std::ostream& operator<<(std::ostream& os, nvinfer1::DataType dtype);
+
+std::ostream& operator<<(std::ostream& os, nvinfer1::DeviceType devType);
+
 inline std::ostream& operator<<(std::ostream& os, const nvinfer1::Dims& dims)
 {
     for (int32_t i = 0; i < dims.nbDims; ++i)
diff --git a/samples/common/sampleReporting.cpp b/samples/common/sampleReporting.cpp
index e7fc17f1..3c8efab0 100644
--- a/samples/common/sampleReporting.cpp
+++ b/samples/common/sampleReporting.cpp
@@ -102,8 +102,26 @@ float findCoeffOfVariance(std::vector<InferenceTime> const& timings, T const& to
 
 inline InferenceTime traceToTiming(const InferenceTrace& a)
 {
-    return InferenceTime((a.enqEnd - a.enqStart), (a.h2dEnd - a.h2dStart), (a.computeEnd - a.computeStart),
-        (a.d2hEnd - a.d2hStart));
+    return InferenceTime(
+        (a.enqEnd - a.enqStart), (a.h2dEnd - a.h2dStart), (a.computeEnd - a.computeStart), (a.d2hEnd - a.d2hStart));
+}
+
+inline std::string dimsToString(Dims const& shape)
+{
+    std::stringstream ss;
+
+    if (shape.nbDims == 0)
+    {
+        ss << "scalar";
+    }
+    else
+    {
+        for (int32_t i = 0; i < shape.nbDims; i++)
+        {
+            ss << shape.d[i] << (i != shape.nbDims - 1 ? "x" : "");
+        }
+    }
+    return ss.str();
 }
 
 } // namespace
@@ -116,15 +134,27 @@ void printProlog(int32_t warmups, int32_t timings, float warmupMs, float benchTi
 
 void printTiming(std::vector<InferenceTime> const& timings, int32_t runsPerAvg, std::ostream& os)
 {
-    int32_t count = 0;
+    int64_t count = 0;
     InferenceTime sum;
 
     os << std::endl;
     os << "=== Trace details ===" << std::endl;
     os << "Trace averages of " << runsPerAvg << " runs:" << std::endl;
-    for (auto const& t : timings)
+
+    // Show only the first N lines and the last N lines, where N = kTIMING_PRINT_THRESHOLD.
+    constexpr int64_t kTIMING_PRINT_THRESHOLD{200};
+    int64_t const maxNbTimings{kTIMING_PRINT_THRESHOLD * runsPerAvg};
+
+    for (int64_t idx = 0, size = timings.size(); idx < size; ++idx)
     {
-        sum += t;
+        // Omit some latency printing to avoid very long logs.
+        if (size > 2 * maxNbTimings && idx == maxNbTimings)
+        {
+            os << "... Omitting " << (size - 2 * maxNbTimings) << " lines" << std::endl;
+            idx = size - kTIMING_PRINT_THRESHOLD * runsPerAvg - 1;
+        }
+
+        sum += timings[idx];
 
         if (++count == runsPerAvg)
         {
@@ -292,8 +322,6 @@ void printPerformanceReport(std::vector<InferenceTrace> const& trace, ReportingO
     auto const noWarmup = std::find_if(trace.begin(), trace.end(), isNotWarmup);
     int32_t const warmups = noWarmup - trace.begin();
     float const benchTime = trace.back().d2hEnd - noWarmup->h2dStart;
-    // when implicit batch used, batchSize = options.inference.batch, which is parsed through --batch
-    // when explicit batch used, batchSize = options.inference.batch = 0
     // treat inference with explicit batch as a single query and report the throughput
     batchSize = batchSize ? batchSize : 1;
     printProlog(warmups * batchSize, (trace.size() - warmups) * batchSize, warmupMs, benchTime, osInfo);
@@ -471,7 +499,7 @@ void exportJSONOutput(
         os << sep << R"({ "name" : ")" << binding.first << "\"" << std::endl;
         sep = ", ";
         os << "  " << sep << R"("dimensions" : ")";
-        bindings.dumpBindingDimensions(binding.second, context, os);
+        bindings.dumpBindingDimensions(binding.first, context, os);
         os << "\"" << std::endl;
         os << "  " << sep << "\"values\" : [ ";
         bindings.dumpBindingValues(context, binding.second, os, sep, batch);
@@ -487,7 +515,7 @@ void exportJSONOutput(nvinfer1::IExecutionContext const& context, Bindings const
 template void exportJSONOutput(nvinfer1::safe::IExecutionContext const& context, Bindings const& bindings,
     std::string const& fileName, int32_t batch);
 
-bool printLayerInfo(
+void printLayerInfo(
     ReportingOptions const& reporting, nvinfer1::ICudaEngine* engine, nvinfer1::IExecutionContext* context)
 {
     if (reporting.layerInfo)
@@ -501,7 +529,33 @@ bool printLayerInfo(
         std::ofstream os(reporting.exportLayerInfo, std::ofstream::trunc);
         os << getLayerInformation(engine, context, nvinfer1::LayerInformationFormat::kJSON) << std::flush;
     }
-    return true;
+}
+
+void printOptimizationProfileInfo(ReportingOptions const& reporting, nvinfer1::ICudaEngine const* engine)
+{
+    if (reporting.optProfileInfo)
+    {
+        sample::gLogInfo << "Optimization Profile Information:" << std::endl;
+        for (int32_t i = 0; i < engine->getNbOptimizationProfiles(); i++)
+        {
+            for (int32_t j = 0, e = engine->getNbIOTensors(); j < e; j++)
+            {
+                auto const tensorName = engine->getIOTensorName(j);
+
+                if (engine->getTensorIOMode(tensorName) == nvinfer1::TensorIOMode::kINPUT)
+                {
+                    auto tensorMinShape = engine->getProfileShape(tensorName, i, nvinfer1::OptProfileSelector::kMIN);
+                    auto tensorOptShape = engine->getProfileShape(tensorName, i, nvinfer1::OptProfileSelector::kOPT);
+                    auto tensorMaxShape = engine->getProfileShape(tensorName, i, nvinfer1::OptProfileSelector::kMAX);
+
+                    sample::gLogInfo << "Model input " << tensorName << " (profile " << i << "): "
+                                     << "min=" << dimsToString(tensorMinShape)
+                                     << ", opt=" << dimsToString(tensorOptShape)
+                                     << ", max=" << dimsToString(tensorMaxShape) << std::endl;
+                }
+            }
+        }
+    }
 }
 
 void printPerformanceProfile(ReportingOptions const& reporting, InferenceEnvironment& iEnv)
diff --git a/samples/common/sampleReporting.h b/samples/common/sampleReporting.h
index fa0d706d..8cab62ba 100644
--- a/samples/common/sampleReporting.h
+++ b/samples/common/sampleReporting.h
@@ -281,9 +281,14 @@ class Profiler : public nvinfer1::IProfiler
 //!
 //! \brief Print layer info to logger or export it to output JSON file.
 //!
-bool printLayerInfo(
+void printLayerInfo(
     ReportingOptions const& reporting, nvinfer1::ICudaEngine* engine, nvinfer1::IExecutionContext* context);
 
+//!
+//! \brief Print optimization profile info to logger.
+//!
+void printOptimizationProfileInfo(ReportingOptions const& reporting, nvinfer1::ICudaEngine const* engine);
+
 //! Forward declaration.
 struct InferenceEnvironment;
 
diff --git a/samples/common/sampleUtils.cpp b/samples/common/sampleUtils.cpp
index 93aeb69d..7f827bc8 100644
--- a/samples/common/sampleUtils.cpp
+++ b/samples/common/sampleUtils.cpp
@@ -16,6 +16,7 @@
  */
 
 #include "sampleUtils.h"
+#include "bfloat16.h"
 #include "half.h"
 
 using namespace nvinfer1;
@@ -27,24 +28,27 @@ size_t dataTypeSize(nvinfer1::DataType dataType)
 {
     switch (dataType)
     {
+    case nvinfer1::DataType::kINT64: return 8U;
     case nvinfer1::DataType::kINT32:
     case nvinfer1::DataType::kFLOAT: return 4U;
+    case nvinfer1::DataType::kBF16:
     case nvinfer1::DataType::kHALF: return 2U;
     case nvinfer1::DataType::kBOOL:
     case nvinfer1::DataType::kUINT8:
     case nvinfer1::DataType::kINT8:
     case nvinfer1::DataType::kFP8: return 1U;
+    case nvinfer1::DataType::kINT4: ASSERT(false && "Element size is not implemented for sub-byte data-types (INT4)");
     }
     return 0;
 }
 
 int64_t volume(nvinfer1::Dims const& dims, nvinfer1::Dims const& strides, int32_t vecDim, int32_t comps, int32_t batch)
 {
-    int32_t maxNbElems = 1;
+    int64_t maxNbElems = 1;
     for (int32_t i = 0; i < dims.nbDims; ++i)
     {
         // Get effective length of axis.
-        int32_t d = dims.d[i];
+        int64_t d = dims.d[i];
         // Any dimension is 0, it is an empty tensor.
         if (d == 0)
         {
@@ -56,7 +60,7 @@ int64_t volume(nvinfer1::Dims const& dims, nvinfer1::Dims const& strides, int32_
         }
         maxNbElems = std::max(maxNbElems, d * strides.d[i]);
     }
-    return static_cast<int64_t>(maxNbElems) * batch * (vecDim < 0 ? 1 : comps);
+    return maxNbElems * batch * (vecDim < 0 ? 1 : comps);
 }
 
 nvinfer1::Dims toDims(std::vector<int32_t> const& vec)
@@ -79,6 +83,20 @@ void loadFromFile(std::string const& fileName, char* dst, size_t size)
     std::ifstream file(fileName, std::ios::in | std::ios::binary);
     if (file.is_open())
     {
+        file.seekg(0, std::ios::end);
+        int64_t fileSize = static_cast<int64_t>(file.tellg());
+        // Due to change from int32_t to int64_t VC engines created with earlier versions
+        // may expect input of the half of the size
+        if (fileSize != static_cast<int64_t>(size) && fileSize != static_cast<int64_t>(size * 2))
+        {
+            std::ostringstream msg;
+            msg << "Unexpected file size for input file: " << fileName << ". Note: Input binding size is: " << size
+                << " bytes but the file size is " << fileSize
+                << " bytes. Double check the size and datatype of the provided data.";
+            throw std::invalid_argument(msg.str());
+        }
+        // Move file pointer back to the beginning after reading file size.
+        file.seekg(0, std::ios::beg);
         file.read(dst, size);
         size_t const nbBytesRead = file.gcount();
         file.close();
@@ -98,18 +116,33 @@ void loadFromFile(std::string const& fileName, char* dst, size_t size)
     }
 }
 
-std::vector<std::string> splitToStringVec(std::string const& s, char separator)
+std::vector<std::string> splitToStringVec(std::string const& s, char separator, int64_t maxSplit)
 {
     std::vector<std::string> splitted;
 
     for (size_t start = 0; start < s.length();)
     {
+        // If maxSplit is specified and we have reached maxSplit, emplace back the rest of the string and break the
+        // loop.
+        if (maxSplit >= 0 && static_cast<int64_t>(splitted.size()) == maxSplit)
+        {
+            splitted.emplace_back(s.substr(start, s.length() - start));
+            break;
+        }
+
         size_t separatorIndex = s.find(separator, start);
         if (separatorIndex == std::string::npos)
         {
             separatorIndex = s.length();
         }
         splitted.emplace_back(s.substr(start, separatorIndex - start));
+
+        // If the separator is the last character, then we should push an empty string at the end.
+        if (separatorIndex == s.length() - 1)
+        {
+            splitted.emplace_back("");
+        }
+
         start = separatorIndex + 1;
     }
 
@@ -323,8 +356,6 @@ void setSparseWeights(L& l, int32_t k, int32_t trs, std::vector<int8_t>& sparseW
 // Explicit instantiation
 template void setSparseWeights<IConvolutionLayer>(
     IConvolutionLayer& l, int32_t k, int32_t trs, std::vector<int8_t>& sparseWeights);
-template void setSparseWeights<IFullyConnectedLayer>(
-    IFullyConnectedLayer& l, int32_t k, int32_t trs, std::vector<int8_t>& sparseWeights);
 
 void sparsify(nvinfer1::INetworkDefinition& network, std::vector<std::vector<int8_t>>& sparseWeights)
 {
@@ -342,16 +373,11 @@ void sparsify(nvinfer1::INetworkDefinition& network, std::vector<std::vector<int
             sparseWeights.emplace_back();
             setSparseWeights(conv, k, trs, sparseWeights.back());
         }
-        else if (t == nvinfer1::LayerType::kFULLY_CONNECTED)
-        {
-            auto& fc = *static_cast<nvinfer1::IFullyConnectedLayer*>(layer);
-            auto const k = fc.getNbOutputChannels();
-            sparseWeights.emplace_back();
-            setSparseWeights(fc, k, 1, sparseWeights.back());
-        }
     }
 
     sparsifyMatMulKernelWeights(network, sparseWeights);
+    sample::gLogVerbose << "--sparsity=force pruned " << sparseWeights.size() << " weights to be sparsity pattern." << std::endl;
+    sample::gLogVerbose << "--sparsity=force has been deprecated. Please use <polygraphy surgeon prune> to rewrite the weights to a sparsity pattern and then run with --sparsity=enable" << std::endl;
 }
 
 void sparsify(Weights const& weights, int32_t k, int32_t trs, std::vector<int8_t>& sparseWeights)
@@ -364,11 +390,16 @@ void sparsify(Weights const& weights, int32_t k, int32_t trs, std::vector<int8_t
     case DataType::kHALF:
         sparsify(static_cast<half_float::half const*>(weights.values), weights.count, k, trs, sparseWeights);
         break;
+    case DataType::kBF16:
+        sparsify(static_cast<BFloat16 const*>(weights.values), weights.count, k, trs, sparseWeights);
+        break;
     case DataType::kINT8:
     case DataType::kINT32:
     case DataType::kUINT8:
     case DataType::kBOOL:
+    case DataType::kINT4:
     case DataType::kFP8: break;
+    case DataType::kINT64: ASSERT(false && "Unsupported data type");
     }
 }
 
@@ -431,8 +462,12 @@ template void dumpBuffer<float>(void const* buffer, std::string const& separator
     Dims const& strides, int32_t vectorDim, int32_t spv);
 template void dumpBuffer<__half>(void const* buffer, std::string const& separator, std::ostream& os, Dims const& dims,
     Dims const& strides, int32_t vectorDim, int32_t spv);
+template void dumpBuffer<BFloat16>(void const* buffer, std::string const& separator, std::ostream& os, Dims const& dims,
+    Dims const& strides, int32_t vectorDim, int32_t spv);
 template void dumpBuffer<uint8_t>(void const* buffer, std::string const& separator, std::ostream& os, Dims const& dims,
     Dims const& strides, int32_t vectorDim, int32_t spv);
+template void dumpBuffer<int64_t>(void const* buffer, std::string const& separator, std::ostream& os, Dims const& dims,
+    Dims const& strides, int32_t vectorDim, int32_t spv);
 
 template <typename T>
 void sparsify(T const* values, int64_t count, int32_t k, int32_t trs, std::vector<int8_t>& sparseWeights)
@@ -526,8 +561,25 @@ void fillBuffer(void* buffer, int64_t volume, T min, T max)
 template void fillBuffer<bool>(void* buffer, int64_t volume, bool min, bool max);
 template void fillBuffer<float>(void* buffer, int64_t volume, float min, float max);
 template void fillBuffer<int32_t>(void* buffer, int64_t volume, int32_t min, int32_t max);
+template void fillBuffer<int64_t>(void* buffer, int64_t volume, int64_t min, int64_t max);
 template void fillBuffer<int8_t>(void* buffer, int64_t volume, int8_t min, int8_t max);
 template void fillBuffer<__half>(void* buffer, int64_t volume, __half min, __half max);
+template void fillBuffer<BFloat16>(void* buffer, int64_t volume, BFloat16 min, BFloat16 max);
 template void fillBuffer<uint8_t>(void* buffer, int64_t volume, uint8_t min, uint8_t max);
 
+bool matchStringWithOneWildcard(std::string const& pattern, std::string const& target)
+{
+    auto const splitPattern = splitToStringVec(pattern, '*', 1);
+
+    // If there is no wildcard, return if the two strings match exactly.
+    if (splitPattern.size() == 1)
+    {
+        return pattern == target;
+    }
+
+    // Otherwise, target must follow prefix+anything+postfix pattern.
+    return target.size() >= (splitPattern[0].size() + splitPattern[1].size()) && target.find(splitPattern[0]) == 0
+        && target.rfind(splitPattern[1]) == (target.size() - splitPattern[1].size());
+}
+
 } // namespace sample
diff --git a/samples/common/sampleUtils.h b/samples/common/sampleUtils.h
index 618c2782..32d5f1b0 100644
--- a/samples/common/sampleUtils.h
+++ b/samples/common/sampleUtils.h
@@ -74,7 +74,7 @@ void dumpBuffer(void const* buffer, std::string const& separator, std::ostream&
 
 void loadFromFile(std::string const& fileName, char* dst, size_t size);
 
-std::vector<std::string> splitToStringVec(std::string const& option, char separator);
+std::vector<std::string> splitToStringVec(std::string const& option, char separator, int64_t maxSplit = -1);
 
 bool broadcastIOFormats(std::vector<IOFormat> const& formats, size_t nbBindings, bool isInput = true);
 
@@ -100,6 +100,28 @@ void sparsifyMatMulKernelWeights(
 template <typename T>
 void transpose2DWeights(void* dst, void const* src, int32_t const m, int32_t const n);
 
+//! A helper function to match a target string with a pattern where the pattern can contain up to one wildcard ('*')
+//! character that matches to any strings.
+bool matchStringWithOneWildcard(std::string const& pattern, std::string const& target);
+
+//! A helper method to find an item from an unordered_map. If the exact match exists, this is identical to
+//! map.find(target). If the exact match does not exist, it returns the first plausible match, taking up to one wildcard
+//! into account. If there is no plausible match, then it returns map.end().
+template <typename T>
+typename std::unordered_map<std::string, T>::const_iterator findPlausible(
+    std::unordered_map<std::string, T> const& map, std::string const& target)
+{
+    auto res = map.find(target);
+    if (res == map.end())
+    {
+        res = std::find_if(
+            map.begin(), map.end(), [&](typename std::unordered_map<std::string, T>::value_type const& item) {
+                return matchStringWithOneWildcard(item.first, target);
+            });
+    }
+    return res;
+}
+
 } // namespace sample
 
 #endif // TRT_SAMPLE_UTILS_H
diff --git a/samples/common/streamReader.h b/samples/common/streamReader.h
new file mode 100644
index 00000000..657e35b8
--- /dev/null
+++ b/samples/common/streamReader.h
@@ -0,0 +1,78 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef STREAM_READER_H
+#define STREAM_READER_H
+
+#include "NvInferRuntimeBase.h"
+#include "sampleUtils.h"
+#include <iostream>
+
+namespace samplesCommon
+{
+
+//! Implements the TensorRT IStreamReader to allow deserializing an engine directly from the plan file.
+class FileStreamReader final : public nvinfer1::IStreamReader
+{
+public:
+    bool open(std::string filepath)
+    {
+        mFile.open(filepath, std::ios::binary);
+        return mFile.is_open();
+    }
+
+    void close()
+    {
+        if (mFile.is_open())
+        {
+            mFile.close();
+        }
+    }
+
+    ~FileStreamReader() final
+    {
+        close();
+    }
+
+    int64_t read(void* dest, int64_t bytes) final
+    {
+        if (!mFile.good())
+        {
+            return -1;
+        }
+        mFile.read(static_cast<char*>(dest), bytes);
+        return mFile.gcount();
+    }
+
+    void reset()
+    {
+        assert(mFile.good());
+        mFile.seekg(0);
+    }
+
+    bool isOpen() const
+    {
+        return mFile.is_open();
+    }
+
+private:
+    std::ifstream mFile;
+};
+
+} // namespace samplesCommon
+
+#endif // STREAM_READER_H
diff --git a/samples/python/README.md b/samples/python/README.md
index 452ab69e..6d667b7f 100644
--- a/samples/python/README.md
+++ b/samples/python/README.md
@@ -61,3 +61,7 @@ from downloader import getFilePath
 
 cfg_file_path = getFilePath('samples/python/yolov3_onnx/yolov3.cfg')
 ```
+
+**Python Version Support**
+
+All Python samples are expected to be run with Python>=3.8. It is not recommended to use any lower version as there may be compatibility issues.
diff --git a/samples/python/common.py b/samples/python/common.py
index 76e2b17b..f289c366 100644
--- a/samples/python/common.py
+++ b/samples/python/common.py
@@ -17,12 +17,9 @@
 
 import argparse
 import os
-import ctypes
-from typing import Optional, List
 
-import numpy as np
 import tensorrt as trt
-from cuda import cuda, cudart
+from common_runtime import *
 
 try:
     # Sometimes python does not understand FileNotFoundError
@@ -30,25 +27,6 @@
 except NameError:
     FileNotFoundError = IOError
 
-EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
-
-def check_cuda_err(err):
-    if isinstance(err, cuda.CUresult):
-        if err != cuda.CUresult.CUDA_SUCCESS:
-            raise RuntimeError("Cuda Error: {}".format(err))
-    if isinstance(err, cudart.cudaError_t):
-        if err != cudart.cudaError_t.cudaSuccess:
-            raise RuntimeError("Cuda Runtime Error: {}".format(err))
-    else:
-        raise RuntimeError("Unknown error type: {}".format(err))
-
-def cuda_call(call):
-    err, res = call[0], call[1:]
-    check_cuda_err(err)
-    if len(res) == 1:
-        res = res[0]
-    return res
-
 def GiB(val):
     return val * 1 << 30
 
@@ -135,129 +113,17 @@ def locate_files(data_paths, filenames, err_msg=""):
             )
     return found_files
 
-
-class HostDeviceMem:
-    """Pair of host and device memory, where the host memory is wrapped in a numpy array"""
-    def __init__(self, size: int, dtype: np.dtype):
-        nbytes = size * dtype.itemsize
-        host_mem = cuda_call(cudart.cudaMallocHost(nbytes))
-        pointer_type = ctypes.POINTER(np.ctypeslib.as_ctypes_type(dtype))
-
-        self._host = np.ctypeslib.as_array(ctypes.cast(host_mem, pointer_type), (size,))
-        self._device = cuda_call(cudart.cudaMalloc(nbytes))
-        self._nbytes = nbytes
-
-    @property
-    def host(self) -> np.ndarray:
-        return self._host
-
-    @host.setter
-    def host(self, arr: np.ndarray):
-        if arr.size > self.host.size:
-            raise ValueError(
-                f"Tried to fit an array of size {arr.size} into host memory of size {self.host.size}"
-            )
-        np.copyto(self.host[:arr.size], arr.flat, casting='safe')
-
-    @property
-    def device(self) -> int:
-        return self._device
-
-    @property
-    def nbytes(self) -> int:
-        return self._nbytes
-
-    def __str__(self):
-        return f"Host:\n{self.host}\nDevice:\n{self.device}\nSize:\n{self.nbytes}\n"
-
-    def __repr__(self):
-        return self.__str__()
-
-    def free(self):
-        cuda_call(cudart.cudaFree(self.device))
-        cuda_call(cudart.cudaFreeHost(self.host.ctypes.data))
-
-
-# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
-# If engine uses dynamic shapes, specify a profile to find the maximum input & output size.
-def allocate_buffers(engine: trt.ICudaEngine, profile_idx: Optional[int] = None):
-    inputs = []
-    outputs = []
-    bindings = []
-    stream = cuda_call(cudart.cudaStreamCreate())
-    tensor_names = [engine.get_tensor_name(i) for i in range(engine.num_io_tensors)]
-    for binding in tensor_names:
-        # get_tensor_profile_shape returns (min_shape, optimal_shape, max_shape)
-        # Pick out the max shape to allocate enough memory for the binding.
-        shape = engine.get_tensor_shape(binding) if profile_idx is None else engine.get_tensor_profile_shape(binding, profile_idx)[-1]
-        shape_valid = np.all([s >= 0 for s in shape])
-        if not shape_valid and profile_idx is None:
-            raise ValueError(f"Binding {binding} has dynamic shape, " +\
-                "but no profile was specified.")
-        size = trt.volume(shape)
-        if engine.has_implicit_batch_dimension:
-            size *= engine.max_batch_size
-        dtype = np.dtype(trt.nptype(engine.get_tensor_dtype(binding)))
-
-        # Allocate host and device buffers
-        bindingMemory = HostDeviceMem(size, dtype)
-
-        # Append the device buffer to device bindings.
-        bindings.append(int(bindingMemory.device))
-
-        # Append to the appropriate list.
-        if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
-            inputs.append(bindingMemory)
-        else:
-            outputs.append(bindingMemory)
-    return inputs, outputs, bindings, stream
-
-
-# Frees the resources allocated in allocate_buffers
-def free_buffers(inputs: List[HostDeviceMem], outputs: List[HostDeviceMem], stream: cudart.cudaStream_t):
-    for mem in inputs + outputs:
-        mem.free()
-    cuda_call(cudart.cudaStreamDestroy(stream))
-
-
-# Wrapper for cudaMemcpy which infers copy size and does error checking
-def memcpy_host_to_device(device_ptr: int, host_arr: np.ndarray):
-    nbytes = host_arr.size * host_arr.itemsize
-    cuda_call(cudart.cudaMemcpy(device_ptr, host_arr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice))
-
-
-# Wrapper for cudaMemcpy which infers copy size and does error checking
-def memcpy_device_to_host(host_arr: np.ndarray, device_ptr: int):
-    nbytes = host_arr.size * host_arr.itemsize
-    cuda_call(cudart.cudaMemcpy(host_arr, device_ptr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost))
-
-
-def _do_inference_base(inputs, outputs, stream, execute_async):
-    # Transfer input data to the GPU.
-    kind = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice
-    [cuda_call(cudart.cudaMemcpyAsync(inp.device, inp.host, inp.nbytes, kind, stream)) for inp in inputs]
-    # Run inference.
-    execute_async()
-    # Transfer predictions back from the GPU.
-    kind = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost
-    [cuda_call(cudart.cudaMemcpyAsync(out.host, out.device, out.nbytes, kind, stream)) for out in outputs]
-    # Synchronize the stream
-    cuda_call(cudart.cudaStreamSynchronize(stream))
-    # Return only the host outputs.
-    return [out.host for out in outputs]
-
-
-# This function is generalized for multiple inputs/outputs.
-# inputs and outputs are expected to be lists of HostDeviceMem objects.
-def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
-    def execute_async():
-        context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream)
-    return _do_inference_base(inputs, outputs, stream, execute_async)
-
-
-# This function is generalized for multiple inputs/outputs for full dimension networks.
-# inputs and outputs are expected to be lists of HostDeviceMem objects.
-def do_inference_v2(context, bindings, inputs, outputs, stream):
-    def execute_async():
-        context.execute_async_v2(bindings=bindings, stream_handle=stream)
-    return _do_inference_base(inputs, outputs, stream, execute_async)
+# Sets up the builder to use the timing cache file, and creates it if it does not already exist
+def setup_timing_cache(config: trt.IBuilderConfig, timing_cache_path: os.PathLike):
+    buffer = b""
+    if os.path.exists(timing_cache_path):
+        with open(timing_cache_path, mode="rb") as timing_cache_file:
+            buffer = timing_cache_file.read()
+    timing_cache: trt.ITimingCache = config.create_timing_cache(buffer)
+    config.set_timing_cache(timing_cache, True)
+
+# Saves the config's timing cache to file
+def save_timing_cache(config: trt.IBuilderConfig, timing_cache_path: os.PathLike):
+    timing_cache: trt.ITimingCache = config.get_timing_cache()
+    with open(timing_cache_path, 'wb') as timing_cache_file:
+        timing_cache_file.write(memoryview(timing_cache.serialize()))
diff --git a/samples/python/common_runtime.py b/samples/python/common_runtime.py
new file mode 100644
index 00000000..ef0a0d00
--- /dev/null
+++ b/samples/python/common_runtime.py
@@ -0,0 +1,170 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import ctypes
+from typing import Optional, List, Union
+
+import numpy as np
+import tensorrt as trt
+from cuda import cuda, cudart
+
+def check_cuda_err(err):
+    if isinstance(err, cuda.CUresult):
+        if err != cuda.CUresult.CUDA_SUCCESS:
+            raise RuntimeError("Cuda Error: {}".format(err))
+    if isinstance(err, cudart.cudaError_t):
+        if err != cudart.cudaError_t.cudaSuccess:
+            raise RuntimeError("Cuda Runtime Error: {}".format(err))
+    else:
+        raise RuntimeError("Unknown error type: {}".format(err))
+
+def cuda_call(call):
+    err, res = call[0], call[1:]
+    check_cuda_err(err)
+    if len(res) == 1:
+        res = res[0]
+    return res
+
+
+class HostDeviceMem:
+    """Pair of host and device memory, where the host memory is wrapped in a numpy array"""
+    def __init__(self, size: int, dtype: Optional[np.dtype] = None):
+        dtype = dtype or np.dtype(np.uint8)
+        nbytes = size * dtype.itemsize
+        host_mem = cuda_call(cudart.cudaMallocHost(nbytes))
+        pointer_type = ctypes.POINTER(np.ctypeslib.as_ctypes_type(dtype))
+
+        self._host = np.ctypeslib.as_array(ctypes.cast(host_mem, pointer_type), (size,))
+        self._device = cuda_call(cudart.cudaMalloc(nbytes))
+        self._nbytes = nbytes
+
+    @property
+    def host(self) -> np.ndarray:
+        return self._host
+
+    @host.setter
+    def host(self, data: Union[np.ndarray, bytes]):
+        if isinstance(data, np.ndarray):
+            if data.size > self.host.size:
+                raise ValueError(
+                    f"Tried to fit an array of size {data.size} into host memory of size {self.host.size}"
+                )
+            np.copyto(self.host[:data.size], data.flat, casting='safe')
+        else:
+            assert self.host.dtype == np.uint8
+            self.host[:self.nbytes] = np.frombuffer(data, dtype=np.uint8)
+
+    @property
+    def device(self) -> int:
+        return self._device
+
+    @property
+    def nbytes(self) -> int:
+        return self._nbytes
+
+    def __str__(self):
+        return f"Host:\n{self.host}\nDevice:\n{self.device}\nSize:\n{self.nbytes}\n"
+
+    def __repr__(self):
+        return self.__str__()
+
+    def free(self):
+        cuda_call(cudart.cudaFree(self.device))
+        cuda_call(cudart.cudaFreeHost(self.host.ctypes.data))
+
+
+# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
+# If engine uses dynamic shapes, specify a profile to find the maximum input & output size.
+def allocate_buffers(engine: trt.ICudaEngine, profile_idx: Optional[int] = None):
+    inputs = []
+    outputs = []
+    bindings = []
+    stream = cuda_call(cudart.cudaStreamCreate())
+    tensor_names = [engine.get_tensor_name(i) for i in range(engine.num_io_tensors)]
+    for binding in tensor_names:
+        # get_tensor_profile_shape returns (min_shape, optimal_shape, max_shape)
+        # Pick out the max shape to allocate enough memory for the binding.
+        shape = engine.get_tensor_shape(binding) if profile_idx is None else engine.get_tensor_profile_shape(binding, profile_idx)[-1]
+        shape_valid = np.all([s >= 0 for s in shape])
+        if not shape_valid and profile_idx is None:
+            raise ValueError(f"Binding {binding} has dynamic shape, " +\
+                "but no profile was specified.")
+        size = trt.volume(shape)
+        trt_type = engine.get_tensor_dtype(binding)
+
+        # Allocate host and device buffers
+        if trt.nptype(trt_type):
+            dtype = np.dtype(trt.nptype(trt_type))
+            bindingMemory = HostDeviceMem(size, dtype)
+        else: # no numpy support: create a byte array instead (BF16, FP8, INT4)
+            size = int(size * trt_type.itemsize)
+            bindingMemory = HostDeviceMem(size)
+
+        # Append the device buffer to device bindings.
+        bindings.append(int(bindingMemory.device))
+
+        # Append to the appropriate list.
+        if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
+            inputs.append(bindingMemory)
+        else:
+            outputs.append(bindingMemory)
+    return inputs, outputs, bindings, stream
+
+
+# Frees the resources allocated in allocate_buffers
+def free_buffers(inputs: List[HostDeviceMem], outputs: List[HostDeviceMem], stream: cudart.cudaStream_t):
+    for mem in inputs + outputs:
+        mem.free()
+    cuda_call(cudart.cudaStreamDestroy(stream))
+
+
+# Wrapper for cudaMemcpy which infers copy size and does error checking
+def memcpy_host_to_device(device_ptr: int, host_arr: np.ndarray):
+    nbytes = host_arr.size * host_arr.itemsize
+    cuda_call(cudart.cudaMemcpy(device_ptr, host_arr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice))
+
+# Wrapper for cudaMemcpy which infers copy size and does error checking
+def memcpy_device_to_host(host_arr: np.ndarray, device_ptr: int):
+    nbytes = host_arr.size * host_arr.itemsize
+    cuda_call(cudart.cudaMemcpy(host_arr, device_ptr, nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost))
+
+
+def _do_inference_base(inputs, outputs, stream, execute_async_func):
+    # Transfer input data to the GPU.
+    kind = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice
+    [cuda_call(cudart.cudaMemcpyAsync(inp.device, inp.host, inp.nbytes, kind, stream)) for inp in inputs]
+    # Run inference.
+    execute_async_func()
+    # Transfer predictions back from the GPU.
+    kind = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost
+    [cuda_call(cudart.cudaMemcpyAsync(out.host, out.device, out.nbytes, kind, stream)) for out in outputs]
+    # Synchronize the stream
+    cuda_call(cudart.cudaStreamSynchronize(stream))
+    # Return only the host outputs.
+    return [out.host for out in outputs]
+
+
+# This function is generalized for multiple inputs/outputs.
+# inputs and outputs are expected to be lists of HostDeviceMem objects.
+def do_inference(context, engine, bindings, inputs, outputs, stream):
+    def execute_async_func():
+        context.execute_async_v3(stream_handle=stream)
+    # Setup context tensor address.
+    num_io = engine.num_io_tensors
+    for i in range(num_io):
+        context.set_tensor_address(engine.get_tensor_name(i), bindings[i])
+    return _do_inference_base(inputs, outputs, stream, execute_async_func)
diff --git a/samples/python/detectron2/README.md b/samples/python/detectron2/README.md
index 00615008..1d83e74d 100644
--- a/samples/python/detectron2/README.md
+++ b/samples/python/detectron2/README.md
@@ -4,6 +4,12 @@ Support for Detectron 2 Mask R-CNN R50-FPN 3x model in TensorRT. This script hel
 
 ## Changelog
 
+- Aug 2023
+  - Update ONNX version support to 1.14.0
+  - Update ONNX Runtime version support to 1.15.1 for Python>=3.8
+  - Removed support for Python versions < 3.8.
+- July 2023:
+  - Update benchmarks and include hardware used.
 - October 2022:
   - Updated converter to support `tracing` export instead of deprecated `caffe2_tracing`
 
@@ -134,14 +140,6 @@ Where `--calib_input` points to a directory with several thousands of images. Fo
 
 The `--calib_cache` is optional, and it controls where the calibration cache file will be written to. This is useful to keep a cached copy of the calibration results. Next time you need to build an int8 engine for the same network, if this file exists, the builder will skip the calibration step and use the cached values instead.
 
-Some sample results of using all precisions:
-
-| **Precision**   | **Latency** | **bbox COCO mAP** | **segm COCO mAP** |
-| ----------------|-------------|-------------------|-------------------|
-| fp32            | 34.48 ms    | 0.402             | 0.368             |
-| fp16            | 17.26 ms    | 0.402             | 0.368             |
-| int8            | 9.90 ms     | 0.398             | 0.366             |
-
 #### Benchmark Engine
 
 Optionally, you can obtain execution timing information for the built engine by using the `trtexec` utility, as:
@@ -159,6 +157,14 @@ An inference benchmark will run, with GPU Compute latency times printed out to t
 GPU Compute Time: min = 30.1864 ms, max = 37.0945 ms, mean = 34.481 ms, median = 34.4187 ms, percentile(99%) = 37.0945 ms
 ```
 
+Some sample results comparing different data precisions are shown below. The following results were obtained using an RTX A5000 and TensorRT 8.6.1. mAP was evaluated for the COCO val2017 dataset using the instructions in [Evaluate mAP Metric](#evaluate-map-metric).
+
+| **Precision**   | **Latency** | **bbox COCO mAP** | **segm COCO mAP** |
+| ----------------|-------------|-------------------|-------------------|
+| fp32            | 25.89 ms    | 0.402             | 0.368             |
+| fp16            | 13.00 ms    | 0.402             | 0.368             |
+| int8            | 7.29 ms     | 0.399             | 0.366             |
+
 ## Inference
 
 For optimal performance, inference should be done in a C++ application that takes advantage of CUDA Graphs to launch the inference request. Alternatively, the TensorRT engine built with this process can also be executed through either [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) or [DeepStream SDK](https://developer.nvidia.com/deepstream-sdk).
diff --git a/samples/python/detectron2/build_engine.py b/samples/python/detectron2/build_engine.py
index 42de4df0..aa6f5795 100644
--- a/samples/python/detectron2/build_engine.py
+++ b/samples/python/detectron2/build_engine.py
@@ -130,7 +130,7 @@ def __init__(self, verbose=False, workspace=8):
 
         self.builder = trt.Builder(self.trt_logger)
         self.config = self.builder.create_builder_config()
-        self.config.max_workspace_size = workspace * (2 ** 30)
+        self.config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace * (2 ** 30))
 
         self.batch_size = None
         self.network = None
@@ -141,9 +141,7 @@ def create_network(self, onnx_path):
         Parse the ONNX graph and create the corresponding TensorRT network definition.
         :param onnx_path: The path to the ONNX graph to load.
         """
-        network_flags = (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
-
-        self.network = self.builder.create_network(network_flags)
+        self.network = self.builder.create_network(0)
         self.parser = trt.OnnxParser(self.network, self.trt_logger)
 
         onnx_path = os.path.realpath(onnx_path)
@@ -164,7 +162,6 @@ def create_network(self, onnx_path):
         for output in outputs:
             log.info("Output '{}' with shape {} and dtype {}".format(output.name, output.shape, output.dtype))
         assert self.batch_size > 0
-        self.builder.max_batch_size = self.batch_size
 
     def create_engine(self, engine_path, precision, config_file, calib_input=None, calib_cache=None, calib_num_images=5000,
                       calib_batch_size=8):
@@ -200,14 +197,11 @@ def create_engine(self, engine_path, precision, config_file, calib_input=None, c
                     ImageBatcher(calib_input, calib_shape, calib_dtype, max_num_images=calib_num_images,
                                  exact_batches=True, config_file=config_file))
 
-        engine_bytes = None
-        try:
-            engine_bytes = self.builder.build_serialized_network(self.network, self.config)
-        except AttributeError:
-            engine = self.builder.build_engine(self.network, self.config)
-            engine_bytes = engine.serialize()
-            del engine
-        assert engine_bytes
+        engine_bytes = self.builder.build_serialized_network(self.network, self.config)
+        if engine_bytes is None:
+            log.error("Failed to create engine")
+            sys.exit(1)
+
         with open(engine_path, "wb") as f:
             log.info("Serializing engine to file: {:}".format(engine_path))
             f.write(engine_bytes)
diff --git a/samples/python/detectron2/image_batcher.py b/samples/python/detectron2/image_batcher.py
index 35d8c8e1..228798ad 100644
--- a/samples/python/detectron2/image_batcher.py
+++ b/samples/python/detectron2/image_batcher.py
@@ -49,7 +49,8 @@ def det2_setup(config_file):
             Create configs and perform basic setups.
             """
             cfg = get_cfg()
-            cfg.merge_from_file(config_file)
+            if config_file is not None:
+                cfg.merge_from_file(config_file)
             cfg.freeze()
             return cfg
 
@@ -64,7 +65,7 @@ def det2_setup(config_file):
         input = os.path.realpath(input)
         self.images = []
 
-        extensions = [".jpg", ".jpeg", ".png", ".bmp"]
+        extensions = [".jpg", ".jpeg", ".png", ".bmp", ".ppm"]
 
         def is_image(path):
             return os.path.isfile(path) and os.path.splitext(path)[1].lower() in extensions
diff --git a/samples/python/detectron2/infer.py b/samples/python/detectron2/infer.py
index aae41435..db7c83b6 100644
--- a/samples/python/detectron2/infer.py
+++ b/samples/python/detectron2/infer.py
@@ -51,13 +51,13 @@ def __init__(self, engine_path):
         self.inputs = []
         self.outputs = []
         self.allocations = []
-        for i in range(self.engine.num_bindings):
+        for i in range(self.engine.num_io_tensors):
+            name = self.engine.get_tensor_name(i)
             is_input = False
-            if self.engine.binding_is_input(i):
+            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
                 is_input = True
-            name = self.engine.get_binding_name(i)
-            dtype = self.engine.get_binding_dtype(i)
-            shape = self.engine.get_binding_shape(i)
+            dtype = self.engine.get_tensor_dtype(name)
+            shape = self.engine.get_tensor_shape(name)
             if is_input:
                 self.batch_size = shape[0]
             size = np.dtype(trt.nptype(dtype)).itemsize
@@ -73,7 +73,7 @@ def __init__(self, engine_path):
                 'size': size
             }
             self.allocations.append(allocation)
-            if self.engine.binding_is_input(i):
+            if is_input:
                 self.inputs.append(binding)
             else:
                 self.outputs.append(binding)
diff --git a/samples/python/detectron2/requirements.txt b/samples/python/detectron2/requirements.txt
index d7361093..720558ea 100644
--- a/samples/python/detectron2/requirements.txt
+++ b/samples/python/detectron2/requirements.txt
@@ -1,15 +1,11 @@
-onnx==1.9.0; python_version<"3.8"
-onnx==1.12.0; python_version>="3.8"
-onnxruntime==1.8.1; python_version<"3.8"
-onnxruntime==1.12.1; python_version>="3.8"
-Pillow
+onnx==1.14.0
+onnxruntime==1.15.1
+Pillow>=10.0.0
 git+https://github.com/facebookresearch/detectron2.git
 git+https://github.com/NVIDIA/TensorRT#subdirectory=tools/onnx-graphsurgeon
-cuda-python
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/detectron2/visualize.py b/samples/python/detectron2/visualize.py
index 546d7669..dd8b6ead 100644
--- a/samples/python/detectron2/visualize.py
+++ b/samples/python/detectron2/visualize.py
@@ -101,7 +101,8 @@ def visualize_detections(image_path, output_path, detections, labels=[], iou_thr
         text = "{}: {}%".format(label, int(100 * score))
         if score < 0:
             text = label
-        text_width, text_height = font.getsize(text)
+        left, top, right, bottom = font.getbbox(text)
+        text_width, text_height = right - left, bottom - top
         text_bottom = max(text_height, d['ymin'])
         text_left = d['xmin']
         margin = np.ceil(0.05 * text_height)
diff --git a/samples/python/efficientdet/README.md b/samples/python/efficientdet/README.md
index 26b08ae3..d3d6ebc0 100644
--- a/samples/python/efficientdet/README.md
+++ b/samples/python/efficientdet/README.md
@@ -18,6 +18,11 @@ These scripts help with conversion and execution of [Google EfficientDet](https:
 
 ## Changelog
 
+- August 2023:
+  - Removed support for Python versions < 3.8.
+  - Added support for TensorFlow 2.12.0
+  - Update ONNX version support to 1.14.0
+  - Update ONNX Runtime version support to 1.15.1 for Python>=3.8
 - January 2022:
   - Added support for EfficientDet Lite and AdvProp models.
   - Added dynamic batch support.
@@ -27,14 +32,14 @@ These scripts help with conversion and execution of [Google EfficientDet](https:
 
 ## Setup
 
-We recommend running these scripts on an environment with TensorRT >= 8.0.1 and TensorFlow >= 2.5.
+We recommend running these scripts on an environment with TensorRT >= 8.0.1 and TensorFlow 2.12.0.
 
 Install TensorRT as per the [TensorRT Install Guide](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html). You will need to make sure the Python bindings for TensorRT are also installed correctly, these are available by installing the `python3-libnvinfer` and `python3-libnvinfer-dev` packages on your TensorRT download.
 
 To simplify TensorRT and TensorFlow installation, use an [NGC TensorFlow Docker Image](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow), such as:
 
 ```bash
-docker pull nvcr.io/nvidia/tensorflow:22.01-tf1-py3
+docker pull nvcr.io/nvidia/tensorflow:23.07-tf2-py3
 ```
 
 Install all dependencies listed in `requirements.txt`:
@@ -206,7 +211,7 @@ trtexec \
     --onnx=/path/to/model.onnx \
     --saveEngine=/path/to/engine.trt \
     --optShapes=input:$INPUT_SHAPE \
-    --workspace=1024
+    --memPoolSize=workspace:1024
 ```
 
 Where `$INPUT_SHAPE` defines the input spec to build the engine with, e.g. `--optShapes=input:8x512x512x3`. Other common `trtexec` functionality for lower precision modes or other options will also work as expected.
diff --git a/samples/python/efficientdet/build_engine.py b/samples/python/efficientdet/build_engine.py
index 638c604f..77143aad 100644
--- a/samples/python/efficientdet/build_engine.py
+++ b/samples/python/efficientdet/build_engine.py
@@ -130,7 +130,7 @@ def __init__(self, verbose=False, workspace=8):
 
         self.builder = trt.Builder(self.trt_logger)
         self.config = self.builder.create_builder_config()
-        self.config.max_workspace_size = workspace * (2 ** 30)
+        self.config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace * (2 ** 30))
 
         self.network = None
         self.parser = None
@@ -143,9 +143,8 @@ def create_network(self, onnx_path, batch_size, dynamic_batch_size=None):
         :param dynamic_batch_size: Dynamic batch size to build the engine with, if given,
         batch_size is ignored, pass as a comma-separated string or int list as MIN,OPT,MAX
         """
-        network_flags = (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
 
-        self.network = self.builder.create_network(network_flags)
+        self.network = self.builder.create_network(0)
         self.parser = trt.OnnxParser(self.network, self.trt_logger)
 
         onnx_path = os.path.realpath(onnx_path)
@@ -192,7 +191,9 @@ def set_mixed_precision(self):
         Enable mixed-precision mode. When set, the layers defined here will be forced to FP16 to maximize
         INT8 inference accuracy, while having minimal impact on latency.
         """
-        self.config.set_flag(trt.BuilderFlag.STRICT_TYPES)
+        self.config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
+        self.config.set_flag(trt.BuilderFlag.DIRECT_IO)
+        self.config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)
 
         # All convolution operations in the first four blocks of the graph are pinned to FP16.
         # These layers have been manually chosen as they give a good middle-point between int8 and fp16
@@ -234,6 +235,9 @@ def create_engine(self, engine_path, precision, calib_input=None, calib_cache=No
 
         inputs = [self.network.get_input(i) for i in range(self.network.num_inputs)]
 
+        log.info("Reading timing cache from file: {:}".format(args.timing_cache))
+        common.setup_timing_cache(self.config, args.timing_cache)
+
         if precision in ["fp16", "int8", "mixed"]:
             if not self.builder.platform_has_fast_fp16:
                 log.warning("FP16 is not supported natively on this platform/device")
@@ -250,14 +254,14 @@ def create_engine(self, engine_path, precision, calib_input=None, calib_cache=No
                     ImageBatcher(calib_input, calib_shape, calib_dtype, max_num_images=calib_num_images,
                                  exact_batches=True, shuffle_files=True))
 
-        engine_bytes = None
-        try:
-            engine_bytes = self.builder.build_serialized_network(self.network, self.config)
-        except AttributeError:
-            engine = self.builder.build_engine(self.network, self.config)
-            engine_bytes = engine.serialize()
-            del engine
-        assert engine_bytes
+        engine_bytes = self.builder.build_serialized_network(self.network, self.config)
+        if engine_bytes is None:
+            log.error("Failed to create engine")
+            sys.exit(1)
+
+        log.info("Serializing timing cache to file: {:}".format(args.timing_cache))
+        common.save_timing_cache(self.config, args.timing_cache)
+
         with open(engine_path, "wb") as f:
             log.info("Serializing engine to file: {:}".format(engine_path))
             f.write(engine_bytes)
@@ -298,6 +302,8 @@ def main(args):
                         help="The maximum number of images to use for calibration, default: 5000")
     parser.add_argument("--calib_batch_size", default=8, type=int,
                         help="The batch size for the calibration process, default: 8")
+    parser.add_argument("--timing_cache", default="./timing.cache",
+                        help="The file path for timing cache, default: ./timing.cache")
     args = parser.parse_args()
     if args.precision in ["int8", "mixed"] and not (args.calib_input or os.path.exists(args.calib_cache)):
         parser.print_help()
diff --git a/samples/python/efficientdet/infer.py b/samples/python/efficientdet/infer.py
index 409738ac..25bd28de 100644
--- a/samples/python/efficientdet/infer.py
+++ b/samples/python/efficientdet/infer.py
@@ -53,20 +53,20 @@ def __init__(self, engine_path):
         self.inputs = []
         self.outputs = []
         self.allocations = []
-        for i in range(self.engine.num_bindings):
+        for i in range(self.engine.num_io_tensors):
+            name = self.engine.get_tensor_name(i)
             is_input = False
-            if self.engine.binding_is_input(i):
+            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
                 is_input = True
-            name = self.engine.get_binding_name(i)
-            dtype = np.dtype(trt.nptype(self.engine.get_binding_dtype(i)))
-            shape = self.context.get_binding_shape(i)
+            dtype = np.dtype(trt.nptype(self.engine.get_tensor_dtype(name)))
+            shape = self.context.get_tensor_shape(name)
             if is_input and shape[0] < 0:
                 assert self.engine.num_optimization_profiles > 0
                 profile_shape = self.engine.get_profile_shape(0, name)
                 assert len(profile_shape) == 3  # min,opt,max
                 # Set the *max* profile as binding shape
-                self.context.set_binding_shape(i, profile_shape[2])
-                shape = self.context.get_binding_shape(i)
+                self.context.set_input_shape(name, profile_shape[2])
+                shape = self.context.get_tensor_shape(name)
             if is_input:
                 self.batch_size = shape[0]
             size = dtype.itemsize
@@ -83,7 +83,7 @@ def __init__(self, engine_path):
                 "host_allocation": host_allocation,
             }
             self.allocations.append(allocation)
-            if self.engine.binding_is_input(i):
+            if is_input:
                 self.inputs.append(binding)
             else:
                 self.outputs.append(binding)
diff --git a/samples/python/efficientdet/requirements.txt b/samples/python/efficientdet/requirements.txt
index b7d3d8f1..4a29d8eb 100644
--- a/samples/python/efficientdet/requirements.txt
+++ b/samples/python/efficientdet/requirements.txt
@@ -1,14 +1,10 @@
-Pillow
-onnx==1.9.0; python_version<"3.8"
-onnx==1.12.0; python_version>="3.8"
-onnxruntime==1.8.1; python_version<"3.8"
-onnxruntime==1.12.1; python_version>="3.8"
+Pillow>=10.0.0
+onnx==1.14.0
+onnxruntime==1.15.1
 tf2onnx==1.8.1
-cuda-python
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/efficientdet/visualize.py b/samples/python/efficientdet/visualize.py
index 742ecbce..4366f9e0 100644
--- a/samples/python/efficientdet/visualize.py
+++ b/samples/python/efficientdet/visualize.py
@@ -176,7 +176,8 @@ def visualize_detections(image_path, output_path, detections, labels=[]):
         text = "{}: {}%".format(label, int(100 * score))
         if score < 0:
             text = label
-        text_width, text_height = font.getsize(text)
+        left, top, right, bottom = font.getbbox(text)
+        text_width, text_height = right - left, bottom - top
         text_bottom = max(text_height, d["ymin"])
         text_left = d["xmin"]
         margin = np.ceil(0.05 * text_height)
@@ -191,7 +192,8 @@ def visualize_detections(image_path, output_path, detections, labels=[]):
 
 def concat_visualizations(images, names, colors, output_path):
     def draw_text(draw, font, text, width, bar_height, offset, color):
-        text_width, text_height = font.getsize(text)
+        left, top, right, bottom = font.getbbox(text)
+        text_width, text_height = right - left, bottom - top
         draw.rectangle([(offset, 0), (offset + width, bar_height)], fill=color)
         draw.text((offset + (width - text_width) / 2, text_height - text_height / 2), text, fill="black", font=font)
 
diff --git a/samples/python/efficientnet/README.md b/samples/python/efficientnet/README.md
index 78d88c83..cd5dc399 100644
--- a/samples/python/efficientnet/README.md
+++ b/samples/python/efficientnet/README.md
@@ -3,6 +3,7 @@
 These scripts help with conversion and execution of Google [EfficientNet V1](https://arxiv.org/abs/1905.11946) and [EfficientNet V2](https://arxiv.org/abs/2104.00298) models with [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt).
 
 ## Contents
+- [Changelog](#changelog)
 - [Setup](#setup)
 - [Model Conversion](#model-conversion)
   * [TensorFlow Saved Model](#tensorflow-saved-model)
@@ -15,9 +16,15 @@ These scripts help with conversion and execution of Google [EfficientNet V1](htt
   * [Validate against Ground Truth](#validate-against-ground-truth)
   * [Compare against TensorFlow](#compare-against-tensorflow)
 
+# Changelog
+
+August 2023: 
+  - Update ONNX version support to 1.14.0
+  - Removed support for Python versions < 3.8.
+
 ## Setup
 
-For best results, we recommend running these scripts on an environment with TensorRT >= 8.0.1 and TensorFlow 2.5.
+For best results, we recommend running these scripts on an environment with TensorRT >= 8.0.1 and TensorFlow 2.12.0.
 
 Install TensorRT as per the [TensorRT Install Guide](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html). You will need to make sure the Python bindings for TensorRT are also installed correctly, these are available by installing the `python3-libnvinfer` and `python3-libnvinfer-dev` packages on your TensorRT download.
 
diff --git a/samples/python/efficientnet/build_engine.py b/samples/python/efficientnet/build_engine.py
index 2391aca0..a4d75552 100644
--- a/samples/python/efficientnet/build_engine.py
+++ b/samples/python/efficientnet/build_engine.py
@@ -127,7 +127,7 @@ def __init__(self, verbose=False):
 
         self.builder = trt.Builder(self.trt_logger)
         self.config = self.builder.create_builder_config()
-        self.config.max_workspace_size = 8 * (2 ** 30)  # 8 GB
+        self.config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 8 * (2 ** 30))  # 8 GB
 
         self.batch_size = None
         self.network = None
@@ -138,9 +138,8 @@ def create_network(self, onnx_path):
         Parse the ONNX graph and create the corresponding TensorRT network definition.
         :param onnx_path: The path to the ONNX graph to load.
         """
-        network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
 
-        self.network = self.builder.create_network(network_flags)
+        self.network = self.builder.create_network(0)
         self.parser = trt.OnnxParser(self.network, self.trt_logger)
 
         onnx_path = os.path.realpath(onnx_path)
@@ -161,7 +160,6 @@ def create_network(self, onnx_path):
         for output in outputs:
             log.info("Output '{}' with shape {} and dtype {}".format(output.name, output.shape, output.dtype))
         assert self.batch_size > 0
-        self.builder.max_batch_size = self.batch_size
 
     def create_engine(
         self,
@@ -190,6 +188,9 @@ def create_engine(
 
         inputs = [self.network.get_input(i) for i in range(self.network.num_inputs)]
 
+        log.info("Reading timing cache from file: {:}".format(args.timing_cache))
+        common.setup_timing_cache(self.config, args.timing_cache)
+
         if precision == "fp16":
             if not self.builder.platform_has_fast_fp16:
                 log.warning("FP16 is not supported natively on this platform/device")
@@ -215,9 +216,17 @@ def create_engine(
                         )
                     )
 
-        with self.builder.build_engine(self.network, self.config) as engine, open(engine_path, "wb") as f:
+        engine_bytes = self.builder.build_serialized_network(self.network, self.config)
+        if engine_bytes is None:
+            log.error("Failed to create engine")
+            sys.exit(1)
+
+        log.info("Serializing timing cache to file: {:}".format(args.timing_cache))
+        common.save_timing_cache(self.config, args.timing_cache)
+
+        with open(engine_path, "wb") as f:
             log.info("Serializing engine to file: {:}".format(engine_path))
-            f.write(engine.serialize())
+            f.write(engine_bytes)
 
 
 def main(args):
@@ -267,6 +276,11 @@ def main(args):
         choices=["V1", "V1MS", "V2"],
         help="Set the calibration image preprocessor to use, either 'V2', 'V1' or 'V1MS', default: V2",
     )
+    parser.add_argument(
+        "--timing_cache",
+        default="./timing.cache",
+        help="The file path for timing cache, default: ./timing.cache",
+    )
     args = parser.parse_args()
     if not all([args.onnx, args.engine]):
         parser.print_help()
diff --git a/samples/python/efficientnet/eval_gt.py b/samples/python/efficientnet/eval_gt.py
index 88198b89..14d5a8d1 100644
--- a/samples/python/efficientnet/eval_gt.py
+++ b/samples/python/efficientnet/eval_gt.py
@@ -24,7 +24,6 @@
 from infer import TensorRTInfer
 from image_batcher import ImageBatcher
 
-
 def main(args):
     annotations = {}
     for line in open(args.annotations, "r"):
@@ -35,7 +34,7 @@ def main(args):
         annotations[os.path.basename(line[0])] = int(line[1])
 
     trt_infer = TensorRTInfer(args.engine)
-    batcher = ImageBatcher(args.input, *trt_infer.input_spec(), preprocessor=args.preprocessor)
+    batcher = ImageBatcher(args.input, *trt_infer.input_spec(), max_num_images=args.num_images, preprocessor=args.preprocessor)
     top1 = 0
     top5 = 0
     total = 0
@@ -87,9 +86,17 @@ def main(args):
         choices=["V1", "V1MS", "V2"],
         help="Select the image preprocessor to use, either 'V2', 'V1' or 'V1MS', default: V2",
     )
+    parser.add_argument(
+        "-n",
+        "--num_images",
+        default=5000,
+        type=int,
+        help="The maximum number of images to use for validation, default: 5000",
+    )
     args = parser.parse_args()
     if not all([args.engine, args.input, args.annotations]):
         parser.print_help()
         print("\nThese arguments are required: --engine  --input and --annotations")
         sys.exit(1)
+
     main(args)
diff --git a/samples/python/efficientnet/infer.py b/samples/python/efficientnet/infer.py
index 46c11253..2c70b14e 100644
--- a/samples/python/efficientnet/infer.py
+++ b/samples/python/efficientnet/infer.py
@@ -51,13 +51,13 @@ def __init__(self, engine_path):
         self.inputs = []
         self.outputs = []
         self.allocations = []
-        for i in range(self.engine.num_bindings):
+        for i in range(self.engine.num_io_tensors):
+            name = self.engine.get_tensor_name(i)
             is_input = False
-            if self.engine.binding_is_input(i):
+            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
                 is_input = True
-            name = self.engine.get_binding_name(i)
-            dtype = self.engine.get_binding_dtype(i)
-            shape = self.engine.get_binding_shape(i)
+            dtype = self.engine.get_tensor_dtype(name)
+            shape = self.engine.get_tensor_shape(name)
             if is_input:
                 self.batch_size = shape[0]
             size = np.dtype(trt.nptype(dtype)).itemsize
@@ -72,7 +72,7 @@ def __init__(self, engine_path):
                 "allocation": allocation,
             }
             self.allocations.append(allocation)
-            if self.engine.binding_is_input(i):
+            if is_input:
                 self.inputs.append(binding)
             else:
                 self.outputs.append(binding)
diff --git a/samples/python/efficientnet/requirements.txt b/samples/python/efficientnet/requirements.txt
index 4fffb9f0..d1c21fbe 100644
--- a/samples/python/efficientnet/requirements.txt
+++ b/samples/python/efficientnet/requirements.txt
@@ -1,13 +1,10 @@
-Pillow
-onnx==1.9.0; python_version<"3.8"
-onnx==1.12.0; python_version>="3.8"
+Pillow>=10.0.0
+onnx==1.14.0
 tensorrt>=7.1.0.0
 tf2onnx==1.8.1
-cuda-python
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/engine_refit_onnx_bidaf/README.md b/samples/python/engine_refit_onnx_bidaf/README.md
index 7cfd9a5d..a328f8e9 100644
--- a/samples/python/engine_refit_onnx_bidaf/README.md
+++ b/samples/python/engine_refit_onnx_bidaf/README.md
@@ -61,42 +61,69 @@ And unsupported HardMax nodes and Compress nodes are replaced by ArgMax nodes an
 
 
 * Build a TensorRT engine, refit the engine and run inference.
-`python3 build_and_refit_engine.py`
+`python3 build_and_refit_engine.py --weights-location GPU`
 
-The script will build a TensorRT engine from the modified ONNX model, and then refit the engine and run inference on sample context and query sentences.
+The script will build a TensorRT engine from the modified ONNX model, and then refit the engine from GPU weights and run inference on sample context and query sentences.
 
 When running the above command for the first time, the output should look similar to the following:
 ```
 Loading ONNX file from path bidaf-modified.onnx...
 Beginning ONNX file parsing
-[TensorRT] WARNING: onnx2trt_utils.cpp:283: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
-[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
-[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.
+[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_4 has Int64 binding.
+[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_5 has Int64 binding.
+[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_6 has Int64 binding.
+[09/25/2023-08:48:16] [TRT] [W] ModelImporter.cpp:407: Make sure input CategoryMapper_7 has Int64 binding.
 Completed parsing of ONNX file
 Network inputs:
-CategoryMapper_4 <class 'numpy.int32'> (-1, 1)
-CategoryMapper_5 <class 'numpy.int32'> (-1, 1, 1, 16)
-CategoryMapper_6 <class 'numpy.int32'> (-1, 1)
-CategoryMapper_7 <class 'numpy.int32'> (-1, 1, 1, 16)
+CategoryMapper_4 <class 'numpy.int64'> (-1, 1)
+CategoryMapper_5 <class 'numpy.int64'> (-1, 1, 1, 16)
+CategoryMapper_6 <class 'numpy.int64'> (-1, 1)
+CategoryMapper_7 <class 'numpy.int64'> (-1, 1, 1, 16)
 Building an engine from file bidaf-modified.onnx; this may take a while...
 Completed creating Engine
-Refitting engine...
+Refitting engine from GPU weights...
+Engine refitted in 39.88 ms.
+Doing inference...
+Doing inference...
+Refitting engine from GPU weights...
+Engine refitted in 0.27 ms.
 Doing inference...
-Refitting engine...
 Doing inference...
 Passed
 ```
 
+Note that refitting for second time will be much faster than the first time.
 When running the above command again, engine will be deserialized from the plan file, the output should look similar to the following:
 ```
-Reading engine from file bidaf.trt
-Refitting engine...
+Reading engine from file bidaf.trt...
+Refitting engine from GPU weights...
+Engine refitted in 32.64 ms.
+Doing inference...
+Doing inference...
+Refitting engine from GPU weights...
+Engine refitted in 0.41 ms.
 Doing inference...
-Refitting engine...
 Doing inference...
 Passed
 ```
 
+To refit the engine from CPU weights, change the command to be `python3 build_and_refit_engine.py --weights-location CPU`. And the output should look similar to the following
+```
+Reading engine from file bidaf.trt...
+Refitting engine from CPU weights...
+Engine refitted in 45.18 ms.
+Doing inference...
+Doing inference...
+Refitting engine from CPU weights...
+Engine refitted in 1.20 ms.
+Doing inference...
+Doing inference...
+Passed
+```
+
+There is also an option `--version-compatible` to enable engine version compatibility. If installed, `tensorrt_dispatch` package will used for refitting and running version compatible engines instead of `tensorrt` package.
+To build and refit a version compatible engine, run the command `python3 build_and_refit_engine.py --version-compatible` and the output should look similar to the above cases.
+
 # Additional resources
 
 The following resources provide a deeper understanding about the model used in this sample:
@@ -118,6 +145,13 @@ For terms and conditions for use, reproduction, and distribution, see the [Tenso
 
 October 2020: This sample was recreated, updated and reviewed.
 
+August 2023: 
+  - Add support for refitting engines from GPU weights.
+  - Removed support for Python versions < 3.8.
+
+January 2024:
+  - Add support for refitting version compatible engines.
+
 # Known issues
 
 There are no known issues in this sample.
diff --git a/samples/python/engine_refit_onnx_bidaf/requirements.txt b/samples/python/engine_refit_onnx_bidaf/requirements.txt
index 1f43213b..8b4d45ea 100644
--- a/samples/python/engine_refit_onnx_bidaf/requirements.txt
+++ b/samples/python/engine_refit_onnx_bidaf/requirements.txt
@@ -1,10 +1,8 @@
 nltk>=3.5
-wget
-cuda-python
+wget==3.2
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/introductory_parser_samples/README.md b/samples/python/introductory_parser_samples/README.md
index 78f3685e..a240511f 100644
--- a/samples/python/introductory_parser_samples/README.md
+++ b/samples/python/introductory_parser_samples/README.md
@@ -31,11 +31,6 @@ This sample demonstrates how to build an engine from an ONNX model file using th
 pip3 install -r requirements.txt
 ```
 
-On Jetson Nano, you will need nvcc in the `PATH` for installing pycuda:
-```bash
-export PATH=${PATH}:/usr/local/cuda/bin/
-```
-
 ## Running the sample
 
 1.  Run the sample to create a TensorRT inference engine and run inference:
@@ -82,6 +77,8 @@ The following resources provide a deeper understanding about importing a model i
 For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
 
 # Changelog
+Auguest 2023
+Removed support for Python versions < 3.8.
 
 Auguest 2022
 Removed options for Caffe and UFF parsers.
diff --git a/samples/python/introductory_parser_samples/onnx_resnet50.py b/samples/python/introductory_parser_samples/onnx_resnet50.py
index e7e845b6..f07e99ff 100644
--- a/samples/python/introductory_parser_samples/onnx_resnet50.py
+++ b/samples/python/introductory_parser_samples/onnx_resnet50.py
@@ -43,11 +43,11 @@ class ModelData(object):
 # The Onnx path is used for Onnx models.
 def build_engine_onnx(model_file):
     builder = trt.Builder(TRT_LOGGER)
-    network = builder.create_network(common.EXPLICIT_BATCH)
+    network = builder.create_network(0)
     config = builder.create_builder_config()
     parser = trt.OnnxParser(network, TRT_LOGGER)
 
-    config.max_workspace_size = common.GiB(1)
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, common.GiB(1))
     # Load the Onnx model and parse it in order to populate the TensorRT network.
     with open(model_file, "rb") as model:
         if not parser.parse(model.read()):
@@ -55,7 +55,10 @@ def build_engine_onnx(model_file):
             for error in range(parser.num_errors):
                 print(parser.get_error(error))
             return None
-    return builder.build_engine(network, config)
+
+    engine_bytes = builder.build_serialized_network(network, config)
+    runtime = trt.Runtime(TRT_LOGGER)
+    return runtime.deserialize_cuda_engine(engine_bytes)
 
 
 def load_normalized_test_case(test_image, pagelocked_buffer):
@@ -64,7 +67,7 @@ def normalize_image(image):
         # Resize, antialias and transpose the image to CHW.
         c, h, w = ModelData.INPUT_SHAPE
         image_arr = (
-            np.asarray(image.resize((w, h), Image.ANTIALIAS))
+            np.asarray(image.resize((w, h), Image.LANCZOS))
             .transpose([2, 0, 1])
             .astype(trt.nptype(ModelData.DTYPE))
             .ravel()
@@ -108,7 +111,7 @@ def main():
     test_case = load_normalized_test_case(test_image, inputs[0].host)
     # Run the engine. The output will be a 1D tensor of length 1000, where each value represents the
     # probability that the image corresponds to that label
-    trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
+    trt_outputs = common.do_inference(context, engine=engine, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
     # We use the highest probability as our prediction. Its index corresponds to the predicted label.
     pred = labels[np.argmax(trt_outputs[0])]
     common.free_buffers(inputs, outputs, stream)
diff --git a/samples/python/introductory_parser_samples/requirements.txt b/samples/python/introductory_parser_samples/requirements.txt
index 5712e028..a9ca7fb0 100644
--- a/samples/python/introductory_parser_samples/requirements.txt
+++ b/samples/python/introductory_parser_samples/requirements.txt
@@ -1,11 +1,7 @@
-Pillow
-pycuda<2021.1; python_version<"3.7"
-pycuda; python_version>="3.7"
-cuda-python
+Pillow>=10.0.0
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/network_api_pytorch_mnist/README.md b/samples/python/network_api_pytorch_mnist/README.md
index 92cefc2c..1f8dba76 100644
--- a/samples/python/network_api_pytorch_mnist/README.md
+++ b/samples/python/network_api_pytorch_mnist/README.md
@@ -91,6 +91,8 @@ The following resources provide a deeper understanding about getting started wit
 For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
 
 # Changelog
+August 2023
+Removed support for Python versions < 3.8.
 
 September 2021
 Updated the sample to use explicit batch network definition.
diff --git a/samples/python/network_api_pytorch_mnist/model.py b/samples/python/network_api_pytorch_mnist/model.py
index 654532bf..3f1a4fe4 100644
--- a/samples/python/network_api_pytorch_mnist/model.py
+++ b/samples/python/network_api_pytorch_mnist/model.py
@@ -24,10 +24,10 @@
 from torch.autograd import Variable
 
 import numpy as np
-import os
 
 from random import randint
 
+
 # Network
 class Net(nn.Module):
     def __init__(self):
@@ -78,6 +78,8 @@ def __init__(self):
             timeout=600,
         )
         self.network = Net()
+        if torch.cuda.is_available():
+            self.network = self.network.to("cuda")
 
     # Train the network for one or more epochs, validating after each epoch.
     def learn(self, num_epochs=2):
@@ -86,6 +88,9 @@ def train(epoch):
             self.network.train()
             optimizer = optim.SGD(self.network.parameters(), lr=self.learning_rate, momentum=self.sgd_momentum)
             for batch, (data, target) in enumerate(self.train_loader):
+                if torch.cuda.is_available():
+                    data = data.to("cuda")
+                    target = target.to("cuda")
                 data, target = Variable(data), Variable(target)
                 optimizer.zero_grad()
                 output = self.network(data)
@@ -110,6 +115,9 @@ def test(epoch):
             correct = 0
             for data, target in self.test_loader:
                 with torch.no_grad():
+                    if torch.cuda.is_available():
+                        data = data.to("cuda")
+                        target = target.to("cuda")
                     data, target = Variable(data), Variable(target)
                 output = self.network(data)
                 test_loss += F.nll_loss(output, target).data.item()
@@ -132,6 +140,6 @@ def get_weights(self):
     def get_random_testcase(self):
         data, target = next(iter(self.test_loader))
         case_num = randint(0, len(data) - 1)
-        test_case = data.numpy()[case_num].ravel().astype(np.float32)
-        test_name = target.numpy()[case_num]
+        test_case = data.cpu().numpy()[case_num].ravel().astype(np.float32)
+        test_name = target.cpu().numpy()[case_num]
         return test_case, test_name
diff --git a/samples/python/network_api_pytorch_mnist/requirements.txt b/samples/python/network_api_pytorch_mnist/requirements.txt
index e249943c..70b77d3f 100644
--- a/samples/python/network_api_pytorch_mnist/requirements.txt
+++ b/samples/python/network_api_pytorch_mnist/requirements.txt
@@ -1,19 +1,17 @@
-Pillow
+Pillow>=10.0.0
 -f https://download.pytorch.org/whl/torch_stable.html
-torch==1.9.0; python_version<"3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torch==1.9.0+cpu; python_version<"3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
-torch==1.11.0; python_version>="3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torch==1.11.0+cpu; python_version>="3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+torch==1.11.0; python_version>="3.8" and python_version<"3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torch==1.11.0+cpu; python_version>="3.8" and python_version<"3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+torch==2.0.0; python_version>="3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torch==2.0.0+cpu; python_version>="3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
 -f https://download.pytorch.org/whl/torch_stable.html
-torchvision==0.10.0; python_version<"3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torchvision==0.10.0+cpu; python_version<"3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
-torchvision==0.12.0; python_version>="3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torchvision==0.12.0+cpu; python_version>="3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
-cuda-python
+torchvision==0.12.0; python_version>="3.8" and python_version<"3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torchvision==0.12.0+cpu; python_version>="3.8" and python_version<"3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+torchvision==0.15.1; python_version>="3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torchvision==0.15.1+cpu; python_version>="3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/network_api_pytorch_mnist/sample.py b/samples/python/network_api_pytorch_mnist/sample.py
index 7c0a6417..1f634443 100644
--- a/samples/python/network_api_pytorch_mnist/sample.py
+++ b/samples/python/network_api_pytorch_mnist/sample.py
@@ -70,32 +70,32 @@ def add_matmul_as_fc(net, input, outputs, w, b):
         output_reshape.reshape_dims = trt.Dims4(m, n, 1, 1)
         return output_reshape
 
-    conv1_w = weights["conv1.weight"].numpy()
-    conv1_b = weights["conv1.bias"].numpy()
-    conv1 = network.add_convolution(
+    conv1_w = weights["conv1.weight"].cpu().numpy()
+    conv1_b = weights["conv1.bias"].cpu().numpy()
+    conv1 = network.add_convolution_nd(
         input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b
     )
-    conv1.stride = (1, 1)
+    conv1.stride_nd = (1, 1)
 
-    pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
-    pool1.stride = (2, 2)
+    pool1 = network.add_pooling_nd(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
+    pool1.stride_nd = trt.Dims2(2, 2)
 
-    conv2_w = weights["conv2.weight"].numpy()
-    conv2_b = weights["conv2.bias"].numpy()
-    conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
-    conv2.stride = (1, 1)
+    conv2_w = weights["conv2.weight"].cpu().numpy()
+    conv2_b = weights["conv2.bias"].cpu().numpy()
+    conv2 = network.add_convolution_nd(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
+    conv2.stride_nd = (1, 1)
 
-    pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
-    pool2.stride = (2, 2)
+    pool2 = network.add_pooling_nd(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
+    pool2.stride_nd = trt.Dims2(2, 2)
 
-    fc1_w = weights["fc1.weight"].numpy()
-    fc1_b = weights["fc1.bias"].numpy()
+    fc1_w = weights["fc1.weight"].cpu().numpy()
+    fc1_b = weights["fc1.bias"].cpu().numpy()
     fc1 = add_matmul_as_fc(network, pool2.get_output(0), 500, fc1_w, fc1_b)
 
     relu1 = network.add_activation(input=fc1.get_output(0), type=trt.ActivationType.RELU)
 
-    fc2_w = weights["fc2.weight"].numpy()
-    fc2_b = weights["fc2.bias"].numpy()
+    fc2_w = weights["fc2.weight"].cpu().numpy()
+    fc2_b = weights["fc2.bias"].cpu().numpy()
     fc2 = add_matmul_as_fc(network, relu1.get_output(0), ModelData.OUTPUT_SIZE, fc2_w, fc2_b)
 
     fc2.get_output(0).name = ModelData.OUTPUT_NAME
@@ -105,11 +105,11 @@ def add_matmul_as_fc(net, input, outputs, w, b):
 def build_engine(weights):
     # For more information on TRT basics, refer to the introductory samples.
     builder = trt.Builder(TRT_LOGGER)
-    network = builder.create_network(common.EXPLICIT_BATCH)
+    network = builder.create_network(0)
     config = builder.create_builder_config()
     runtime = trt.Runtime(TRT_LOGGER)
 
-    config.max_workspace_size = common.GiB(1)
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, common.GiB(1))
     # Populate the network using weights from the PyTorch model.
     populate_network(network, weights)
     # Build and return an engine.
@@ -143,7 +143,7 @@ def main():
     case_num = load_random_test_case(mnist_model, pagelocked_buffer=inputs[0].host)
     # For more information on performing inference, refer to the introductory samples.
     # The common.do_inference function will return a list of outputs - we only have one in this case.
-    [output] = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
+    [output] = common.do_inference(context, engine=engine, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
     pred = np.argmax(output)
     common.free_buffers(inputs, outputs, stream)
     print("Test Case: " + str(case_num))
diff --git a/samples/python/onnx_custom_plugin/CMakeLists.txt b/samples/python/onnx_custom_plugin/CMakeLists.txt
index a7ad8933..75f69af4 100644
--- a/samples/python/onnx_custom_plugin/CMakeLists.txt
+++ b/samples/python/onnx_custom_plugin/CMakeLists.txt
@@ -15,6 +15,7 @@
 # limitations under the License.
 #
 
+
 # We need cmake >= 3.8, since 3.8 introduced CUDA as a first class language
 cmake_minimum_required(VERSION 3.8 FATAL_ERROR)
 project(CustomHardMax LANGUAGES CXX CUDA)
@@ -63,11 +64,13 @@ add_definitions(-DTENSORRT_BUILD_LIB)
 
 # Add include directories
 get_filename_component(SAMPLES_COMMON_DIR ${CMAKE_SOURCE_DIR}/../../common/ ABSOLUTE)
-include_directories(${CUDA_INC_DIR} ${TRT_INCLUDE} ${CMAKE_SOURCE_DIR}/plugin/ ${SAMPLES_COMMON_DIR})
+get_filename_component(SAMPLES_DIR ${CMAKE_SOURCE_DIR}/../../ ABSOLUTE)
+include_directories(${CUDA_INC_DIR} ${TRT_INCLUDE} ${CMAKE_SOURCE_DIR}/plugin/ ${SAMPLES_COMMON_DIR} ${SAMPLES_DIR})
 
 # Define Hardmax plugin library target
 add_library(customHardmaxPlugin MODULE
   ${SAMPLES_COMMON_DIR}/logger.cpp
+  ${SAMPLES_DIR}/utils/fileLock.cpp
   ${CMAKE_SOURCE_DIR}/plugin/customHardmaxPlugin.cpp
   ${CMAKE_SOURCE_DIR}/plugin/customHardmaxPlugin.h
 )
diff --git a/samples/python/onnx_custom_plugin/README.md b/samples/python/onnx_custom_plugin/README.md
index 69173e4d..82a6583e 100644
--- a/samples/python/onnx_custom_plugin/README.md
+++ b/samples/python/onnx_custom_plugin/README.md
@@ -57,7 +57,9 @@ pip3 install -r requirements.txt
 
 2. [Install CMake](https://cmake.org/download/).
 
-3. (For Windows builds) [Visual Studio](https://visualstudio.microsoft.com/vs/older-downloads/) 2017 Community or Enterprise edition
+3. [Install Cublas](https://developer.nvidia.com/cublas).
+
+4. (For Windows builds) [Visual Studio](https://visualstudio.microsoft.com/vs/older-downloads/) 2017 Community or Enterprise edition
 
 ## Download and preprocess the ONNX model
 
@@ -151,9 +153,15 @@ The following resources provide a deeper understanding about getting started wit
 For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
 
 # Changelog
+January 2024:
+  - Create cublas handle with cublasCreate instead of using the cublasContext argument from attachToContext. The cublasContext is still valid if TacticSource::kCUBLAS is enabled. TacticSource::kCUBLAS is deprecated.
+  - Added the Cublas library as a prerequisite.
+
+August 2023: 
+  - Update ONNX version support to 1.14.0
+  - Removed support for Python versions < 3.8.
 
-September 2022
-This `README.md` file was created and reviewed.
+September 2022: This `README.md` file was created and reviewed.
 
 # Known issues
 
diff --git a/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.cpp b/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.cpp
index 4900292b..bcc6cc02 100644
--- a/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.cpp
+++ b/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -141,9 +141,8 @@ nvinfer1::DimsExprs HardmaxPlugin::getOutputDimensions(
 void HardmaxPlugin::attachToContext(
     cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) noexcept
 {
-    ASSERT(
-        cublasContext != nullptr && "HardmaxPlugin given a null cuBLAS Context. Was the CUBLAS TacticSource disabled?");
-    mCublas = cublasContext;
+    cublasStatus_t ret = cublasCreate(&mCublas);
+    ASSERT(ret == CUBLAS_STATUS_SUCCESS && mCublas != nullptr && "Failed to create cublasHandle_t.");
 }
 
 // Detach the plugin object from its execution context.
diff --git a/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.h b/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.h
index 250291d5..0e75d996 100644
--- a/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.h
+++ b/samples/python/onnx_custom_plugin/plugin/customHardmaxPlugin.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/samples/python/onnx_custom_plugin/requirements.txt b/samples/python/onnx_custom_plugin/requirements.txt
index 31c08cf0..840248c0 100644
--- a/samples/python/onnx_custom_plugin/requirements.txt
+++ b/samples/python/onnx_custom_plugin/requirements.txt
@@ -1,13 +1,11 @@
 nltk>=3.5
-onnx>=1.8.0
+onnx==1.14.0
 --extra-index-url https://pypi.ngc.nvidia.com
 onnx-graphsurgeon>=0.3.20
 wget>=3.2
-cuda-python
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/onnx_custom_plugin/sample.py b/samples/python/onnx_custom_plugin/sample.py
index 65b1991f..7026f0e5 100644
--- a/samples/python/onnx_custom_plugin/sample.py
+++ b/samples/python/onnx_custom_plugin/sample.py
@@ -53,8 +53,9 @@
 def build_engine(model_path):
 
     builder = trt.Builder(TRT_LOGGER)
-    network = builder.create_network(common.EXPLICIT_BATCH)
+    network = builder.create_network(0)
     config = builder.create_builder_config()
+    config.set_tactic_sources(config.get_tactic_sources() | 1 << int(trt.TacticSource.CUBLAS))
     parser = trt.OnnxParser(network, TRT_LOGGER)
     runtime = trt.Runtime(TRT_LOGGER)
 
@@ -158,9 +159,7 @@ def main():
         if not interactive:
             print(f"Input context: {context_text}")
             print(f"Input query: {query_text}")
-        trt_outputs = common.do_inference_v2(
-            trt_context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream
-        )
+        trt_outputs = common.do_inference(trt_context, engine=engine, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
         start = trt_outputs[1].item()
         end = trt_outputs[0].item()
         answer = context_words[start : end + 1].flatten()
diff --git a/samples/python/onnx_custom_plugin/test_custom_hardmax_plugin.py b/samples/python/onnx_custom_plugin/test_custom_hardmax_plugin.py
index dc919b72..c99b78d1 100644
--- a/samples/python/onnx_custom_plugin/test_custom_hardmax_plugin.py
+++ b/samples/python/onnx_custom_plugin/test_custom_hardmax_plugin.py
@@ -44,8 +44,9 @@ def make_trt_network_and_engine(input_shape, axis):
     plugin = plugin_creator.create_plugin(name="CustomHardmax", field_collection=field_collection)
 
     builder = trt.Builder(TRT_LOGGER)
-    network = builder.create_network(common.EXPLICIT_BATCH)
+    network = builder.create_network(0)
     config = builder.create_builder_config()
+    config.set_tactic_sources(config.get_tactic_sources() | 1 << int(trt.TacticSource.CUBLAS))
     runtime = trt.Runtime(TRT_LOGGER)
 
     input_layer = network.add_input(name="input_layer", dtype=trt.float32, shape=input_shape)
@@ -61,9 +62,7 @@ def custom_plugin_impl(input_arr, engine):
     inputs, outputs, bindings, stream = common.allocate_buffers(engine)
     context = engine.create_execution_context()
     inputs[0].host = input_arr.astype(trt.nptype(trt.float32))
-    trt_outputs = common.do_inference_v2(
-        context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream
-    )
+    trt_outputs = common.do_inference(context, engine=engine, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
     output = trt_outputs[0].copy()
     common.free_buffers(inputs, outputs, stream)
     return output
diff --git a/samples/python/onnx_packnet/README.md b/samples/python/onnx_packnet/README.md
index 0db53981..01345664 100644
--- a/samples/python/onnx_packnet/README.md
+++ b/samples/python/onnx_packnet/README.md
@@ -63,7 +63,7 @@ python3 convert_to_onnx.py --output model.onnx
 Once the ONNX graph is generated, use `trtexec` tool (located in `bin` directory of TensorRT package) to perform inference on a random input image.
 
 ```
-trtexec --onnx=model.onnx --explicitBatch
+trtexec --onnx=model.onnx
 ```
 
 Please refer to `trtexec` tool for more commandline options.
@@ -98,6 +98,9 @@ For terms and conditions for use, reproduction, and distribution, see the [Tenso
 
 # Changelog
 
+August 2023: 
+  - Update ONNX version support to 1.14.0
+  - Removed support for Python versions < 3.8.
 August 2021: Update sample to work with latest torch version
 June 2020: Initial release of this sample
 
diff --git a/samples/python/onnx_packnet/post_processing.py b/samples/python/onnx_packnet/post_processing.py
index fd101b45..887834c7 100644
--- a/samples/python/onnx_packnet/post_processing.py
+++ b/samples/python/onnx_packnet/post_processing.py
@@ -66,8 +66,10 @@ def fold_pad_inputs(node, graph):
     # Gather the amount of padding in each dimension from pytorch graph.
     if torch.__version__ < "1.5.0":
         pad_values_pyt = node.i(1).i(0).i(0).i(0).i(0).i(0).i(0).i(0).attrs["value"].values
-    else:
+    elif torch.__version__ < "2.0.0":
         pad_values_pyt = node.i(1).i(0).i(0).i(0).i(0).i(0).inputs[0].values
+    else:
+        pad_values_pyt = node.i(1).i(0).i(0).i(0).i(0).i(0).i(0).attrs["value"].values
 
     # Assumption a 4d input tensor
     onnx_pad_values = [0] * 4 * 2  # 4d tensor and 2 sides padding for each dimension
diff --git a/samples/python/onnx_packnet/requirements.txt b/samples/python/onnx_packnet/requirements.txt
index 70bd2cbc..41d78164 100644
--- a/samples/python/onnx_packnet/requirements.txt
+++ b/samples/python/onnx_packnet/requirements.txt
@@ -1,20 +1,17 @@
-onnx==1.10.2; python_version<"3.8"
-onnx==1.12.0; python_version>="3.8"
+onnx==1.14.0
 --extra-index-url https://pypi.ngc.nvidia.com
-onnx-graphsurgeon
+onnx-graphsurgeon>=0.3.20
 -f https://download.pytorch.org/whl/torch_stable.html
-torch==1.9.0; python_version<"3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torch==1.9.0+cpu; python_version<"3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
-torch==1.11.0; python_version>="3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torch==1.11.0+cpu; python_version>="3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+torch==1.11.0; python_version>="3.8" and python_version<"3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torch==1.11.0+cpu; python_version>="3.8" and python_version<"3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+torch==2.0.0; python_version>="3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torch==2.0.0+cpu; python_version>="3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
 -f https://download.pytorch.org/whl/torch_stable.html
-torchvision==0.10.0; python_version<"3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torchvision==0.10.0+cpu; python_version<"3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
-torchvision==0.12.0; python_version>="3.8" and (platform_machine=="aarch64" and sys.platform=="linux")
-torchvision==0.12.0+cpu; python_version>="3.8" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+torchvision==0.12.0; python_version>="3.8" and python_version<"3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torchvision==0.12.0+cpu; python_version>="3.8" and python_version<"3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+torchvision==0.15.1; python_version>="3.11" and (platform_machine=="aarch64" and sys.platform=="linux")
+torchvision==0.15.1+cpu; python_version>="3.11" and ((platform_machine=="x86_64" and sys.platform=="linux") or sys.platform=="win32")
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/python_plugin/CMakeLists.txt b/samples/python/python_plugin/CMakeLists.txt
new file mode 100644
index 00000000..3b8fc1f3
--- /dev/null
+++ b/samples/python/python_plugin/CMakeLists.txt
@@ -0,0 +1,66 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# We need cmake >= 3.8, since 3.8 introduced CUDA as a first class language
+cmake_minimum_required(VERSION 3.8 FATAL_ERROR)
+project(CircPadPlugin LANGUAGES CXX CUDA)
+
+if(NOT MSVC)
+  # Enable all compile warnings
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wno-long-long -pedantic -Wno-deprecated-declarations")
+  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler -Wno-deprecated-declarations")
+endif()
+
+# Sets variable to a value if variable is unset.
+macro(set_ifndef var val)
+    if (NOT ${var})
+        set(${var} ${val})
+    endif()
+    message(STATUS "Configurable variable ${var} set to ${${var}}")
+endmacro()
+
+# -------- CONFIGURATION --------
+if(NOT MSVC)
+  set_ifndef(TRT_LIB /usr/lib/x86_64-linux-gnu)
+  set_ifndef(TRT_INCLUDE /usr/include/x86_64-linux-gnu)
+  set_ifndef(CUDA_INC_DIR /usr/local/cuda/include)
+  set_ifndef(CUDA_LIB_DIR /usr/local/cuda)
+endif()
+
+message("\nThe following variables are derived from the values of the previous variables unless provided explicitly:\n")
+
+find_library(_NVINFER_LIB nvinfer HINTS ${TRT_LIB} PATH_SUFFIXES lib lib64)
+set_ifndef(NVINFER_LIB ${_NVINFER_LIB})
+
+find_library(_CUDA_LIB cuda HINTS ${CUDA_LIB_DIR} PATH_SUFFIXES lib/stubs lib64/stubs)
+set_ifndef(CUDA_LIB ${_CUDA_LIB})
+
+# -------- BUILDING --------
+
+add_library(circ_pad_plugin SHARED
+  ${CMAKE_SOURCE_DIR}/circ_plugin_cpp/circ_pad_plugin.cu
+)
+
+target_include_directories(circ_pad_plugin
+    PUBLIC ${CUDA_INC_DIR}
+    PUBLIC ${TRT_INCLUDE}
+)
+
+set_property(TARGET circ_pad_plugin PROPERTY CUDA_STANDARD 14)
+
+target_link_libraries(circ_pad_plugin PRIVATE ${NVINFER_LIB})
+target_link_libraries(circ_pad_plugin PRIVATE ${CUDA_LIB})
diff --git a/samples/python/python_plugin/README.md b/samples/python/python_plugin/README.md
new file mode 100644
index 00000000..948d7627
--- /dev/null
+++ b/samples/python/python_plugin/README.md
@@ -0,0 +1,165 @@
+# Python-based TRT Plugins
+
+This is a sample to showcase Python-based plugin definitions in TRT. No changes to existing TRT APIs have been made
+to deliver this feature, so using the updated bindings should not break any existing code.
+
+## Introduction
+
+Until TRT 9.1, plugin implementations could only be done through the TRT C++ API. To use a plugin in a Python app, one had to
+ - Implement plugin in C++ and build into a shared library
+ - Load plugin lib and register plugin creator (statically or dynamically)
+ - Retrieve plugin creator and create plugin instance through the respective Python API
+
+The following design considerations were followed in creating bindings to allow Python-based plugin definitions:
+ - Zero additional C++ code shall be required to implement, integrate and run a plugin within TensorRT
+ - Offer the flexibility to implement the kernel(s) for the plugin through any method of choice
+   - Many libraries have sprung up to provide CUDA kernel support with AOT/JIT compilation
+     - Numba, OpenAI Triton, CuPy etc.
+   - Could even do without explicit kernels (e.g. leverage PyTorch functional op)
+
+ - Will only support `IPluginV2DynamicExt` and `IPluginV3`-based plugins
+   - Other plugin interfaces (except `IPluginV2IOExt`) are deprecated since TRT 8.5
+
+With these bindings, plugins can be implemented and integrated to TRT purely with Python.
+
+## Setting Up The Build Environment
+
+To build and install the bindings, follow the instructions in `$TRT_OSSPATH/python/README.md`.
+
+Then install the requisite packages
+```bash
+cd $TRT_OSSPATH/samples/python/trt_python_plugin
+pip3 install -r requirements.txt
+```
+Install `cupy-cuda11x` instead if testing on a CUDA 11.x environment.
+
+# TensorRT Plugin API for Python
+
+Implementing a TRT plugin in Python is similar to C++ in that implementation of `IPluginV2DynamicExt`+`IPluginCreator` or `IPluginV3`+`IPluginCreatorV3One` is necessary. Refer to the TensorRT Python API reference for a concise description.
+
+## Differences in C++ and Python APIs for `IPluginV2DynamicExt`
+The interface methods in Python have mostly similar APIs to their C++ counterparts, except for `serialize()` and `enqueue()`.
+ - While the C++ API for `serialize()` is `void serialize (void *buffer)` where the plugin writes to the passed-in `buffer`, the Python API is `serialize(self) -> bytes`, where the implementation of the method is expected to return a bytes object containing a serialized representation of the plugin object. 
+ - In `enqueue()`, the device pointers for input and output tensors are passed as their `intptr_t` casts. Since these buffers are created and owned by TRT, care must be taken when writing to them from the Python side.
+  - No bindings yet for `attachToContext()` and `detachFromContext()` which are not pure virtual.
+
+# Running the sample: Circular padding plugin
+
+This sample contains a circular padding plugin, where the `enqueue` has been implemented with various frameworks for writing kernels or executing GPU ops (torch). 
+
+Each script accepts a command-line argument to choose precision from either FP32 or FP16. e.g.
+```bash
+python3 circ_pad_plugin_cuda_python.py --precision fp32 # fp32 or fp16
+```
+
+## Circular padding
+
+Circular padding is useful for ops like circular convolution in deep learning. The following image denotes how the original image (red) is circular padded once (green) and twice (blue):
+
+![alt text](circ_pad_example.png "Circular padding example")
+
+The plugin shall have the following characteristics:
+ - Input: 4-dimensional input (e.g. NxCxHxW)
+ - Attribute(s): m-dimensional parameter `pads` where $m$ is even and $m/2 \le 4$. `pads` denotes the amount of padding to apply before and after each of the $m/2$ last dimensions of the input tensor.
+ - Output: Padded tensor. Shape depends on `pads`.
+
+## Baseline: Using a C++ plugin
+
+To establish a baseline, we first demonstrate a C++ plugin implementing circular padding. The relevant files can be found in the `circ_plugin_cpp` folder: the included `CMakeLists.txt` can be used to build the shared library `libcirc_pad_plugin.so` / `circ_pad_plugin.dll`.
+
+```bash
+cd $TRT_OSSPATH/samples/python/trt_python_plugin
+mkdir build && pushd build
+cmake .. && make -j
+popd
+python3 circ_pad_plugin_cpp.py --plugin-lib build/libcirc_pad_plugin.so
+```
+
+## Python plugin: cuda-python
+
+The cuda-python based implementation can be found in `circ_pad_plugin_cuda_python.py`. `cuda.nvrtc` is used to JIT compile a C/C++-based kernel, which is provided as a string. The compiled kernel is then launched through cuda-python's `cuda.cuLaunchKernel`.
+
+`circ_pad_plugin_cuda_python.py` demonstrates an ONNX-based workflow: `circ_pad_plugin_inetdef_cuda_python.py` demonstrates a workflow where the model is constructed through `INetworkDefinition`.
+
+## Python plugin: CuPy
+
+The CuPy-based implementation can be found in `circ_pad_plugin_cupy.py`. CuPy's `RawKernel` class has been used to provide the C/C++-based kernel implementation as a string. CuPy will JIT compile the kernel.
+
+## Python plugin: Triton (valid only on Linux)
+
+The same plugin can be implemented with a Triton-based kernel as well. The only other change would be to `enqueue`. The entire implementation can be found in `circ_pad_plugin_triton.py`.
+
+Some remarks:
+ - Triton also allows for JIT-able kernels.
+ - CuPy device arrays cannot be passed into Triton kernels directly -- only Torch arrays are accepted. However, we can use `torch.as_tensor()` to get around this constraint.
+ - Triton does not seem to allow the specification of a CUDA stream.
+
+## Python plugin: Numba
+
+The Numba implementation can be found in `circ_pad_plugin_numba.py`. Some remarks:
+ - Numba also allows for JIT-able kernels.
+ - CuPy device arrays can be passed into Numba kernels without issue since CuPy arrays implement `__cuda_array_interface__`.
+
+## Python plugin: Torch
+
+The flexibility of the `enqueue()` interface means that it is not always necessary to implement a custom kernel. In this case, PyTorch's [torch.nn.functional.pad](https://pytorch.org/docs/stable/generated/torch.nn.functional.pad.html) offers the exact same capability we want, so we can use that inside `enqueue()`, as in `circ_pad_plugin_torch.py`.
+
+## Python plugin: Multi-tactic, Multi-plugin (based on IPluginV3)
+
+The entire implementation can be found in `circ_pad_plugin_multi_tactic.py`.
+
+### Custom tactics
+
+When multiple options are available to compute the same op, and it's not possible to reliably predict which one will be faster for the expected input shapes/types or the target platform,
+it is useful to ask TensorRT to time all available options during the build stage. In V2 plugins, TensorRT would only time different type/format combinations supported by the plugin, but
+V3 plugins allow users to specify any number of custom tactics to time also (in addition to type/format combinations).
+
+In this example, we specify two custom tactics: PyTorch's [torch.nn.functional.pad](https://pytorch.org/docs/stable/generated/torch.nn.functional.pad.html) and a custom kernel written
+using OpenAI triton.
+
+### Multiple plugins instances
+
+Imagine that you expect to have multiple instances of the same plugin in your network, which would operate on separate inputs, but where the input and output shapes/formats, as well
+as other determining plugin attributes would be the same. With V2 plugins, TensorRT would time all such plugin instances during the engine build -- however, this would be inefficient because the only salient difference between those instances are the values of the input tensors. 
+
+To communicate to TensorRT that you would like the timing for similar plugin instances to be cached, V3 plugins allow for the specification of a timing cache ID. The timing cache ID
+should only capture timing determinants extraneous to plugin I/O, like their shapes and formats. Typically, this would be the values of any plugin attributes that might be different
+between the plugin instances. 
+
+In this example,
+ - The shape of the `pads` parameter affects timing, but only as far as it affects the output shape. Therefore, the timing cache ID could be an empty string.
+ - We consider a scenario where there are two circular padding plugin instances with identical configurations. Therefore, only a single instance should be timed by TensorRT.
+   This can be verified by inspecting the log.
+
+# Limitations
+
+ - Plugins cannot be serialized into the engine (in contrast to `IBuilderConfig::setPluginsToSerialize()`)
+   - Plugin class and Plugin Creator class must exist in the module where the engine is deserialized
+ - The engine / ONNX model cannot be run from outside Python (e.g. with `trtexec`)
+   - This functionality is possible to implement but comes at the cost of embedding the Python interpreter to the TRT runtime / the binary loading the engine
+ - (For `IPluginV2DynamicExt` only) No bindings yet for `attachToContext()` and `detachFromContext()` which are not pure virtual.
+
+# FAQ
+
+1. What are the performance impacts of a Python-based plugin versus a C++ one?
+
+   In preliminary testing, the Python overhead was found to be very minimal to negligible. In fact, if the kernels were compiled AOT (instead of JIT) the CuPY and Triton
+   versions of the plugin were as performant as the C++ one. However, with Numba, there seems to be a significant kernel launch overhead.
+
+2. Can I deploy a TRT engine including a Python plugin in a runtime environment without Python?
+
+   No. There is no way to fully embed a Python plugin into the engine that allows for it to be executed without the need for Python during inference time.
+
+   This design principle is what allows for the `enqueue()` to be implemented in any framework of choice.
+
+# License
+
+For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
+
+# Changelog
+
+July 2023: Initial release of this sample
+
+# Known issues
+
+There are no known issues in this sample
diff --git a/samples/python/python_plugin/circ_pad_example.png b/samples/python/python_plugin/circ_pad_example.png
new file mode 100644
index 00000000..76a8e9fd
Binary files /dev/null and b/samples/python/python_plugin/circ_pad_example.png differ
diff --git a/samples/python/python_plugin/circ_pad_plugin_cpp.py b/samples/python/python_plugin/circ_pad_plugin_cpp.py
new file mode 100644
index 00000000..a820399f
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_cpp.py
@@ -0,0 +1,82 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+import ctypes
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    EngineFromNetwork,
+    NetworkFromOnnxPath,
+    TrtRunner,
+)
+
+def parseArgs():
+    parser = argparse.ArgumentParser(description="Options for Circular Padding plugin C++ example")
+
+    parser.add_argument('--precision', type=str, default="fp32", choices=["fp32", "fp16"], help="Precision to use for plugin")
+    parser.add_argument('--plugin-lib', type=str, help="Path to the Circular Padding plugin lib", required=True)
+
+    return parser.parse_args()
+
+if __name__ == "__main__":
+
+    args = parseArgs()
+
+    handle = ctypes.CDLL(args.plugin_lib)
+    if not handle:
+        raise RuntimeError("Could not load Circular Padding plugin library")
+
+    precision = np.float32 if args.precision == "fp32" else np.float16
+    inp_shape = (10, 3, 32, 32)
+    X = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    # create ONNX model
+    onnx_path = "test_CircPadPlugin.onnx"
+    inputA = gs.Variable(name="X", shape=inp_shape, dtype=precision)
+    Y = gs.Variable(name="Y", dtype=precision)
+    myPluginNode = gs.Node(
+        name="CircPadPlugin",
+        op="CircPadPlugin",
+        inputs=[inputA],
+        outputs=[Y],
+        attrs={"pads": pads},
+    )
+    graph = gs.Graph(nodes=[myPluginNode], inputs=[inputA], outputs=[Y], opset=16)
+    onnx.save(gs.export_onnx(graph), onnx_path)
+
+    # build engine
+    build_engine = EngineFromNetwork(
+        NetworkFromOnnxPath(onnx_path), CreateConfig(fp16=precision==np.float16)
+    )
+
+    Y_ref = np.pad(X, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    # Run
+    with TrtRunner(build_engine, "trt_runner")as runner:
+        outputs = runner.infer({"X": X})
+        Y = outputs["Y"]
+        
+        if np.allclose(Y, Y_ref):
+            print("Inference result correct!")
+        else:
+            print("Inference result incorrect!")
diff --git a/samples/python/python_plugin/circ_pad_plugin_cuda_python.py b/samples/python/python_plugin/circ_pad_plugin_cuda_python.py
new file mode 100644
index 00000000..88ad1ff7
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_cuda_python.py
@@ -0,0 +1,336 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    EngineFromNetwork,
+    NetworkFromOnnxPath,
+    TrtRunner
+)
+from polygraphy.json import to_json, from_json
+
+from utils import checkCudaErrors, KernelHelper, parseArgs, CudaCtxManager
+from cuda import cuda
+
+circ_pad_half_kernel = r'''
+#include <cuda_fp16.h>
+extern "C" __global__
+void circ_pad_half(half const* X, int const* all_pads, int const* orig_dims, half* Y, int const* Y_shape, int Y_len) {
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for(int i = index; i < Y_len; i += stride)
+    {
+        int i3 = i % Y_shape[3];
+        int i2 = (i / Y_shape[3]) % Y_shape[2];
+        int i1 = (i / Y_shape[3] / Y_shape[2]) % Y_shape[1];
+        int i0 = i / Y_shape[3] / Y_shape[2] / Y_shape[1];
+
+        int j0 = (i0 - all_pads[0] + orig_dims[0]) % orig_dims[0];
+        int j1 = (i1 - all_pads[2] + orig_dims[1]) % orig_dims[1];
+        int j2 = (i2 - all_pads[4] + orig_dims[2]) % orig_dims[2];
+        int j3 = (i3 - all_pads[6] + orig_dims[3]) % orig_dims[3];
+
+        Y[i] = X[
+            orig_dims[3] * orig_dims[2] * orig_dims[1] * j0
+            + orig_dims[3] * orig_dims[2] * j1
+            + orig_dims[3] * j2
+            + j3
+        ];
+    }
+}
+'''
+
+circ_pad_float_kernel = r'''
+extern "C" __global__
+void circ_pad_float(float const* X, int const* all_pads, int const* orig_dims, float* Y, int const* Y_shape, int Y_len) {
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for(int i = index; i < Y_len; i += stride)
+    {
+        int i3 = i % Y_shape[3];
+        int i2 = (i / Y_shape[3]) % Y_shape[2];
+        int i1 = (i / Y_shape[3] / Y_shape[2]) % Y_shape[1];
+        int i0 = i / Y_shape[3] / Y_shape[2] / Y_shape[1];
+
+        int j0 = (i0 - all_pads[0] + orig_dims[0]) % orig_dims[0];
+        int j1 = (i1 - all_pads[2] + orig_dims[1]) % orig_dims[1];
+        int j2 = (i2 - all_pads[4] + orig_dims[2]) % orig_dims[2];
+        int j3 = (i3 - all_pads[6] + orig_dims[3]) % orig_dims[3];
+
+        Y[i] = X[
+            orig_dims[3] * orig_dims[2] * orig_dims[1] * j0
+            + orig_dims[3] * orig_dims[2] * j1
+            + orig_dims[3] * j2
+            + j3
+        ];
+    }
+}
+'''
+
+class CircPadPlugin(trt.IPluginV2DynamicExt):
+    def __init__(self, fc=None):
+        trt.IPluginV2DynamicExt.__init__(self)
+        self.pads = []
+        self.X_shape = []
+        self.N = 0
+
+        self.all_pads_d = None
+        self.orig_dims_d = None
+        self.Y_shape_d = None
+
+        self.num_outputs = 1
+        self.plugin_namespace = ""
+        self.plugin_type = "CircPadPlugin"
+        self.plugin_version = "1"
+
+        self.cuDevice = None
+
+        if fc is not None:
+            assert set([f.name for f in fc]) == set(["pads", "N"]), "Field collection invalid"
+            for f in fc:
+                if f.name == "pads":
+                    self.pads = f.data
+                elif f.name == "N":
+                    self.N = int(f.data)
+
+    def initialize(self):
+        err, self.cuDevice = cuda.cuDeviceGet(0)
+        trt.get_plugin_registry().acquire_plugin_resource("cuda_ctx", CudaCtxManager(self.cuDevice))
+        self.all_pads_d = checkCudaErrors(cuda.cuMemAlloc(np.int32().itemsize * self.N * 2))
+        self.orig_dims_d = checkCudaErrors(cuda.cuMemAlloc(np.int32().itemsize * self.N))
+        self.Y_shape_d = checkCudaErrors(cuda.cuMemAlloc(np.int32().itemsize * self.N))
+        
+    def get_output_datatype(self, index, input_types):
+        return input_types[0]
+
+    def get_output_dimensions(self, output_index, inputs, exprBuilder):
+
+        output_dims = trt.DimsExprs(inputs[0])
+
+        for i in range(np.size(self.pads) // 2):
+            output_dims[len(output_dims) - i - 1] = exprBuilder.operation(
+                trt.DimensionOperation.SUM,
+                inputs[0][len(output_dims) - i - 1],
+                exprBuilder.constant(self.pads[i * 2] + self.pads[i * 2 + 1]),
+            )
+
+        return output_dims
+
+    def serialize(self):
+        return to_json({"pads": self.pads, "N": self.N})
+
+    def configure_plugin(self, inp, out):
+        X_dims = inp[0].desc.dims
+        self.X_shape = np.zeros((len(X_dims),))
+        for i in range(len(X_dims)):
+            self.X_shape[i] = X_dims[i]
+
+        all_pads = np.zeros((self.N * 2,), dtype=np.int32)
+        orig_dims = np.array(self.X_shape, dtype=np.int32)
+        out_dims = np.array(self.X_shape, dtype=np.int32)
+
+        for i in range(np.size(self.pads) // 2):
+            out_dims[self.N - i - 1] += self.pads[i * 2] + self.pads[i * 2 + 1]
+            all_pads[self.N * 2 - 2 * i - 2] = self.pads[i * 2]
+            all_pads[self.N * 2 - 2 * i - 1] = self.pads[i * 2 + 1]
+
+        # Copy vectors from host memory to device memory
+        if self.all_pads_d:
+            checkCudaErrors(cuda.cuMemcpyHtoD(self.all_pads_d, all_pads, all_pads.nbytes))
+        if self.orig_dims_d:
+            checkCudaErrors(cuda.cuMemcpyHtoD(self.orig_dims_d, orig_dims, orig_dims.nbytes))
+        if self.Y_shape_d:
+            checkCudaErrors(cuda.cuMemcpyHtoD(self.Y_shape_d, out_dims, out_dims.nbytes))
+
+        self.Y_len_d = np.prod(out_dims)
+
+    def supports_format_combination(self, pos, in_out, num_inputs):
+        assert num_inputs == 1
+        assert pos < len(in_out)
+
+        desc = in_out[pos]
+        if desc.format != trt.TensorFormat.LINEAR:
+            return False
+
+        # first input should be float16 or float32
+        if pos == 0:
+            return desc.type == trt.DataType.FLOAT or desc.type == trt.DataType.HALF
+
+        # output should have the same type as the input
+        if pos == 1:
+            return in_out[0].type == desc.type
+
+        assert False
+
+    def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
+
+        inp_dtype = trt.nptype(input_desc[0].type)
+
+        blockSize = 256
+        numBlocks = int((np.prod(np.array(self.X_shape)) + blockSize - 1) // blockSize)
+
+        da = np.array([inputs[0]], dtype=np.uint64)
+        dc = np.array([outputs[0]], dtype=np.uint64)
+
+        d_all_pads = np.array([int(self.all_pads_d)], dtype=np.uint64)
+        d_orig_dims = np.array([int(self.orig_dims_d)], dtype=np.uint64)
+        d_Y_shape = np.array([int(self.Y_shape_d)], dtype=np.uint64)
+        Y_len = np.array(self.Y_len_d, dtype=np.uint32)
+
+        args = [da, d_all_pads, d_orig_dims, dc, d_Y_shape, Y_len]
+        kernelArgs = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)
+
+        stream_ptr = np.array([stream], dtype=np.uint64)
+
+        if inp_dtype == np.float32:
+            kernelHelper = KernelHelper(circ_pad_float_kernel, int(self.cuDevice))
+            _circ_pad_float_kernel = kernelHelper.getFunction(b'circ_pad_float')
+            checkCudaErrors(cuda.cuLaunchKernel(_circ_pad_float_kernel,
+                                        numBlocks, 1, 1,
+                                        blockSize, 1, 1,
+                                        0,
+                                        stream_ptr,
+                                        kernelArgs, 0))
+        elif inp_dtype == np.float16:
+            kernelHelper = KernelHelper(circ_pad_half_kernel, int(self.cuDevice))
+            _circ_pad_half_kernel = kernelHelper.getFunction(b'circ_pad_half')
+            checkCudaErrors(cuda.cuLaunchKernel(_circ_pad_half_kernel,
+                                        numBlocks, 1, 1,
+                                        blockSize, 1, 1,
+                                        0,
+                                        stream_ptr,
+                                        kernelArgs, 0))
+        else:
+            raise ValueError("inp_dtype not valid")
+        
+    def clone(self):
+        cloned_plugin = CircPadPlugin()
+        cloned_plugin.__dict__.update(self.__dict__)
+        return cloned_plugin
+
+    def terminate(self):
+        if self.all_pads_d:
+            checkCudaErrors(cuda.cuMemFree(self.all_pads_d))
+        if self.orig_dims_d:
+            checkCudaErrors(cuda.cuMemFree(self.orig_dims_d))
+        if self.Y_shape_d:
+            checkCudaErrors(cuda.cuMemFree(self.Y_shape_d))
+
+        trt.get_plugin_registry().release_plugin_resource("cuda_ctx")
+
+    # 
+    # The following defaults take effect since the respective methods are not overriden
+    #
+
+    # def get_serialization_size(self):
+    #     return len(to_json({"pads": self.pads}))
+
+    # def get_workspace_size(self, input_desc, output_desc):
+    #     return 0
+    
+    # def destroy(self):
+    #     pass
+
+
+class CircPadPluginCreator(trt.IPluginCreator):
+    def __init__(self):
+        trt.IPluginCreator.__init__(self)
+        self.name = "CircPadPlugin"
+        self.plugin_namespace = ""
+        self.plugin_version = "1"
+        self.field_names = trt.PluginFieldCollection([
+            trt.PluginField("pads", np.array([]), trt.PluginFieldType.INT32),
+            trt.PluginField("N", np.array([]), trt.PluginFieldType.INT32)
+        ])
+
+    def create_plugin(self, name, fc):
+        return CircPadPlugin(fc)
+
+    def deserialize_plugin(self, name, data):
+        deserialized = CircPadPlugin()
+        j = dict(from_json(data))
+        deserialized.__dict__.update(j)
+        return deserialized
+
+if __name__ == "__main__":
+
+    args = parseArgs()
+
+    # Initialize CUDA Driver API
+    err, = cuda.cuInit(0)
+
+    # Retrieve handle for device 0
+    err, cuDevice = cuda.cuDeviceGet(0)
+
+    plg_registry = trt.get_plugin_registry()
+
+    # Create context
+    plg_registry.acquire_plugin_resource("cuda_ctx", CudaCtxManager(cuDevice))
+
+    precision = np.float32 if args.precision == "fp32" else np.float16
+
+    inp_shape = (100, 2, 32, 32)
+    X = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    # Load standard plugins
+    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+    trt.init_libnvinfer_plugins(TRT_LOGGER, namespace="")
+
+    # Register plugin creator
+    my_plugin_creator = CircPadPluginCreator()
+    plg_registry.register_creator(my_plugin_creator, "")
+
+    # create ONNX model
+    onnx_path = "test_CircPadPlugin.onnx"
+    inputA = gs.Variable(name="X", shape=inp_shape, dtype=precision)
+    Y = gs.Variable(name="Y", dtype=precision)
+    myPluginNode = gs.Node(
+        name="CircPadPlugin",
+        op="CircPadPlugin",
+        inputs=[inputA],
+        outputs=[Y],
+        attrs={"pads": pads, "N": 4},
+    )
+    graph = gs.Graph(nodes=[myPluginNode], inputs=[inputA], outputs=[Y], opset=16)
+    onnx.save(gs.export_onnx(graph), onnx_path)
+
+    # build engine
+    build_engine = EngineFromNetwork(
+        NetworkFromOnnxPath(onnx_path), CreateConfig(fp16=precision==np.float16)
+    )
+
+    Y_ref = np.pad(X, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    # Run
+    with TrtRunner(build_engine, "trt_runner")as runner:
+        outputs = runner.infer({"X": X})
+        Y = outputs["Y"]
+
+        if np.allclose(Y, Y_ref):
+            print("Inference result correct!")
+        else:
+            print("Inference result incorrect!")
+
+    plg_registry.release_plugin_resource("cuda_ctx")
diff --git a/samples/python/python_plugin/circ_pad_plugin_cupy.py b/samples/python/python_plugin/circ_pad_plugin_cupy.py
new file mode 100644
index 00000000..82b271cc
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_cupy.py
@@ -0,0 +1,290 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+import cupy as cp
+import time
+import pickle
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    EngineFromNetwork,
+    NetworkFromOnnxPath,
+    TrtRunner
+)
+
+from polygraphy.json import to_json, from_json
+
+from utils import volume, parseArgs
+
+circ_pad_half_kernel = cp.RawKernel(r'''
+#include <cuda_fp16.h>
+extern "C" __global__
+void circ_pad_half(half const* X, int const* all_pads, int const* orig_dims, half* Y, int const* Y_shape, int const* Y_len) {
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for(int i = index; i < *Y_len; i += stride)
+    {
+        int i3 = i % Y_shape[3];
+        int i2 = (i / Y_shape[3]) % Y_shape[2];
+        int i1 = (i / Y_shape[3] / Y_shape[2]) % Y_shape[1];
+        int i0 = i / Y_shape[3] / Y_shape[2] / Y_shape[1];
+
+        int j0 = (i0 - all_pads[0] + orig_dims[0]) % orig_dims[0];
+        int j1 = (i1 - all_pads[2] + orig_dims[1]) % orig_dims[1];
+        int j2 = (i2 - all_pads[4] + orig_dims[2]) % orig_dims[2];
+        int j3 = (i3 - all_pads[6] + orig_dims[3]) % orig_dims[3];
+
+        Y[i] = X[
+            orig_dims[3] * orig_dims[2] * orig_dims[1] * j0
+            + orig_dims[3] * orig_dims[2] * j1
+            + orig_dims[3] * j2
+            + j3
+        ];
+    }
+}
+''', 'circ_pad_half')
+
+circ_pad_float_kernel = cp.RawKernel(r'''
+extern "C" __global__
+void circ_pad_float(float const* X, int const* all_pads, int const* orig_dims, float* Y, int const* Y_shape, int const* Y_len) {
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for(int i = index; i < *Y_len; i += stride)
+    {
+        int i3 = i % Y_shape[3];
+        int i2 = (i / Y_shape[3]) % Y_shape[2];
+        int i1 = (i / Y_shape[3] / Y_shape[2]) % Y_shape[1];
+        int i0 = i / Y_shape[3] / Y_shape[2] / Y_shape[1];
+
+        int j0 = (i0 - all_pads[0] + orig_dims[0]) % orig_dims[0];
+        int j1 = (i1 - all_pads[2] + orig_dims[1]) % orig_dims[1];
+        int j2 = (i2 - all_pads[4] + orig_dims[2]) % orig_dims[2];
+        int j3 = (i3 - all_pads[6] + orig_dims[3]) % orig_dims[3];
+
+        Y[i] = X[
+            orig_dims[3] * orig_dims[2] * orig_dims[1] * j0
+            + orig_dims[3] * orig_dims[2] * j1
+            + orig_dims[3] * j2
+            + j3
+        ];
+    }
+}
+''', 'circ_pad_float')
+
+class CircPadPlugin(trt.IPluginV2DynamicExt):
+    def __init__(self, fc=None):
+        trt.IPluginV2DynamicExt.__init__(self)
+        self.pads = []
+        self.X_shape = []
+        
+        self.num_outputs = 1
+        self.plugin_namespace = ""
+        self.plugin_type = "CircPadPlugin"
+        self.plugin_version = "1"
+
+        if fc is not None:
+            assert fc[0].name == "pads"
+            self.pads = fc[0].data
+
+    def get_output_datatype(self, index, input_types):
+        return input_types[0]
+
+    def get_output_dimensions(self, output_index, inputs, exprBuilder):
+
+        output_dims = trt.DimsExprs(inputs[0])
+
+        for i in range(np.size(self.pads) // 2):
+            output_dims[len(output_dims) - i - 1] = exprBuilder.operation(
+                trt.DimensionOperation.SUM,
+                inputs[0][len(output_dims) - i - 1],
+                exprBuilder.constant(self.pads[i * 2] + self.pads[i * 2 + 1]),
+            )
+
+        return output_dims
+
+    def serialize(self):
+        return to_json({"pads": self.pads})
+
+    def configure_plugin(self, inp, out):
+        X_dims = inp[0].desc.dims
+        self.X_shape = np.zeros((len(X_dims),))
+        for i in range(len(X_dims)):
+            self.X_shape[i] = X_dims[i]
+
+        N = len(self.X_shape)
+        all_pads = np.zeros((N * 2,))
+        orig_dims = np.array(self.X_shape)
+        out_dims = np.array(self.X_shape)
+
+        for i in range(np.size(pads) // 2):
+            out_dims[N - i - 1] += self.pads[i * 2] + self.pads[i * 2 + 1]
+            all_pads[N * 2 - 2 * i - 2] = self.pads[i * 2]
+            all_pads[N * 2 - 2 * i - 1] = self.pads[i * 2 + 1]
+
+        self.all_pads_d = cp.asarray(all_pads, dtype=cp.int32)
+        self.orig_dims_d = cp.asarray(orig_dims, dtype=cp.int32)
+        self.Y_shape_d = cp.asarray(out_dims, dtype=cp.int32)
+        self.Y_len_d = cp.array([np.prod(out_dims)], dtype=cp.int32)
+
+    def supports_format_combination(self, pos, in_out, num_inputs):
+        assert num_inputs == 1
+        assert pos < len(in_out)
+
+        desc = in_out[pos]
+        if desc.format != trt.TensorFormat.LINEAR:
+            return False
+
+        # first input should be float16 or float32
+        if pos == 0:
+            return desc.type == trt.DataType.FLOAT or desc.type == trt.DataType.HALF
+
+        # output should have the same type as the input
+        if pos == 1:
+            return in_out[0].type == desc.type
+
+        assert False
+
+    def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
+
+        inp_dtype = trt.nptype(input_desc[0].type)
+
+        a_mem = cp.cuda.UnownedMemory(
+            inputs[0], volume(input_desc[0].dims) * cp.dtype(inp_dtype).itemsize, self
+        )
+        c_mem = cp.cuda.UnownedMemory(
+            outputs[0],
+            volume(output_desc[0].dims) * cp.dtype(inp_dtype).itemsize,
+            self,
+        )
+
+        a_ptr = cp.cuda.MemoryPointer(a_mem, 0)
+        c_ptr = cp.cuda.MemoryPointer(c_mem, 0)
+
+        a = cp.ndarray((volume(input_desc[0].dims)), dtype=inp_dtype, memptr=a_ptr)
+        c = cp.ndarray((volume(output_desc[0].dims)), dtype=inp_dtype, memptr=c_ptr)
+
+        cuda_stream = cp.cuda.ExternalStream(stream)
+
+        blockSize = 256
+        numBlocks = int((np.prod(np.array(self.X_shape)) + blockSize - 1) // blockSize)
+
+        with cuda_stream:
+            if inp_dtype == np.float32:
+                circ_pad_float_kernel((numBlocks,), (blockSize,), (a, self.all_pads_d, self.orig_dims_d, c, self.Y_shape_d, self.Y_len_d))
+            elif inp_dtype == np.float16:
+                circ_pad_half_kernel((numBlocks,), (blockSize,), (a, self.all_pads_d, self.orig_dims_d, c, self.Y_shape_d, self.Y_len_d))
+            else:
+                raise ValueError("inp_dtype not valid")
+
+    def clone(self):
+        cloned_plugin = CircPadPlugin()
+        cloned_plugin.__dict__.update(self.__dict__)
+        return cloned_plugin
+
+    # 
+    # The following defaults take effect since the respective methods are not overriden
+    #
+
+    # def initialize(self):
+    #     pass
+
+    # def get_serialization_size(self):
+    #     return len(to_json({"pads": self.pads}))
+
+    # def get_workspace_size(self, input_desc, output_desc):
+    #     return 0
+    
+    # def destroy(self):
+    #     pass
+
+    # def terminate(self):
+    #     pass
+
+class CircPadPluginCreator(trt.IPluginCreator):
+    def __init__(self):
+        trt.IPluginCreator.__init__(self)
+        
+        self.name = "CircPadPlugin"
+        self.plugin_namespace = ""
+        self.plugin_version = "1"
+        self.field_names = trt.PluginFieldCollection(
+            [trt.PluginField("pads", np.array([]), trt.PluginFieldType.INT32)]
+        )
+
+    def create_plugin(self, name, fc):
+        return CircPadPlugin(fc)
+    
+    def deserialize_plugin(self, name, data):
+        j = dict(from_json(data.decode("utf-8")))
+        deserialized = CircPadPlugin()
+        deserialized.__dict__.update(j)
+        return deserialized
+
+if __name__ == "__main__":
+
+    args = parseArgs()
+    precision = np.float32 if args.precision == "fp32" else np.float16
+
+    inp_shape = (100, 2, 32, 32)
+    X = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    # Load standard plugins
+    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+    trt.init_libnvinfer_plugins(TRT_LOGGER, namespace="")
+
+    # Register plugin creator
+    plg_registry = trt.get_plugin_registry()
+    my_plugin_creator = CircPadPluginCreator()
+    plg_registry.register_creator(my_plugin_creator, "")
+
+    # create ONNX model
+    onnx_path = "test_CircPadPlugin.onnx"
+    inputA = gs.Variable(name="X", shape=inp_shape, dtype=precision)
+    Y = gs.Variable(name="Y", dtype=precision)
+    myPluginNode = gs.Node(
+        name="CircPadPlugin",
+        op="CircPadPlugin",
+        inputs=[inputA],
+        outputs=[Y],
+        attrs={"pads": pads},
+    )
+    graph = gs.Graph(nodes=[myPluginNode], inputs=[inputA], outputs=[Y], opset=16)
+    onnx.save(gs.export_onnx(graph), onnx_path)
+
+    # build engine
+    build_engine = EngineFromNetwork(
+        NetworkFromOnnxPath(onnx_path), CreateConfig(fp16=precision==np.float16)
+    )
+
+    Y_ref = np.pad(X, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    # Run
+    with TrtRunner(build_engine, "trt_runner")as runner:
+        outputs = runner.infer({"X": X})
+        Y = outputs["Y"]
+
+        if np.allclose(Y, Y_ref):
+            print("Inference result correct!")
+        else:
+            print("Inference result incorrect!")
diff --git a/samples/python/python_plugin/circ_pad_plugin_inetdef_cuda_python.py b/samples/python/python_plugin/circ_pad_plugin_inetdef_cuda_python.py
new file mode 100644
index 00000000..60208ab3
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_inetdef_cuda_python.py
@@ -0,0 +1,336 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import onnx_graphsurgeon as gs
+import numpy as np
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    TrtRunner,
+    create_network,
+    engine_from_network
+)
+
+from polygraphy.json import to_json, from_json
+
+from utils import checkCudaErrors, KernelHelper, parseArgs, CudaCtxManager
+from cuda import cuda
+
+circ_pad_half_kernel = r'''
+#include <cuda_fp16.h>
+extern "C" __global__
+void circ_pad_half(half const* X, int const* all_pads, int const* orig_dims, half* Y, int const* Y_shape, int Y_len) {
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for(int i = index; i < Y_len; i += stride)
+    {
+        int i3 = i % Y_shape[3];
+        int i2 = (i / Y_shape[3]) % Y_shape[2];
+        int i1 = (i / Y_shape[3] / Y_shape[2]) % Y_shape[1];
+        int i0 = i / Y_shape[3] / Y_shape[2] / Y_shape[1];
+
+        int j0 = (i0 - all_pads[0] + orig_dims[0]) % orig_dims[0];
+        int j1 = (i1 - all_pads[2] + orig_dims[1]) % orig_dims[1];
+        int j2 = (i2 - all_pads[4] + orig_dims[2]) % orig_dims[2];
+        int j3 = (i3 - all_pads[6] + orig_dims[3]) % orig_dims[3];
+
+        Y[i] = X[
+            orig_dims[3] * orig_dims[2] * orig_dims[1] * j0
+            + orig_dims[3] * orig_dims[2] * j1
+            + orig_dims[3] * j2
+            + j3
+        ];
+    }
+}
+'''
+
+circ_pad_float_kernel = r'''
+extern "C" __global__
+void circ_pad_float(float const* X, int const* all_pads, int const* orig_dims, float* Y, int const* Y_shape, int Y_len) {
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    int stride = blockDim.x * gridDim.x;
+
+    for(int i = index; i < Y_len; i += stride)
+    {
+        int i3 = i % Y_shape[3];
+        int i2 = (i / Y_shape[3]) % Y_shape[2];
+        int i1 = (i / Y_shape[3] / Y_shape[2]) % Y_shape[1];
+        int i0 = i / Y_shape[3] / Y_shape[2] / Y_shape[1];
+
+        int j0 = (i0 - all_pads[0] + orig_dims[0]) % orig_dims[0];
+        int j1 = (i1 - all_pads[2] + orig_dims[1]) % orig_dims[1];
+        int j2 = (i2 - all_pads[4] + orig_dims[2]) % orig_dims[2];
+        int j3 = (i3 - all_pads[6] + orig_dims[3]) % orig_dims[3];
+
+        Y[i] = X[
+            orig_dims[3] * orig_dims[2] * orig_dims[1] * j0
+            + orig_dims[3] * orig_dims[2] * j1
+            + orig_dims[3] * j2
+            + j3
+        ];
+    }
+}
+'''
+
+class CircPadPlugin(trt.IPluginV2DynamicExt):
+    def __init__(self, fc=None):
+        trt.IPluginV2DynamicExt.__init__(self)
+        self.pads = []
+        self.X_shape = []
+        self.N = 0
+
+        self.all_pads_d = None
+        self.orig_dims_d = None
+        self.Y_shape_d = None
+
+        self.num_outputs = 1
+        self.plugin_namespace = ""
+        self.plugin_type = "CircPadPlugin"
+        self.plugin_version = "1"
+
+        self.cuDevice = None
+
+        if fc is not None:
+            assert set([f.name for f in fc]) == set(["pads", "N"]), "Field collection invalid"
+            for f in fc:
+                if f.name == "pads":
+                    self.pads = f.data
+                elif f.name == "N":
+                    self.N = int(f.data)
+
+    def initialize(self):
+        err, self.cuDevice = cuda.cuDeviceGet(0)
+        trt.get_plugin_registry().acquire_plugin_resource("cuda_ctx", CudaCtxManager(self.cuDevice))
+        self.all_pads_d = checkCudaErrors(cuda.cuMemAlloc(np.int32().itemsize * self.N * 2))
+        self.orig_dims_d = checkCudaErrors(cuda.cuMemAlloc(np.int32().itemsize * self.N))
+        self.Y_shape_d = checkCudaErrors(cuda.cuMemAlloc(np.int32().itemsize * self.N))
+        
+    def get_output_datatype(self, index, input_types):
+        return input_types[0]
+
+    def get_output_dimensions(self, output_index, inputs, exprBuilder):
+
+        output_dims = trt.DimsExprs(inputs[0])
+
+        for i in range(np.size(self.pads) // 2):
+            output_dims[len(output_dims) - i - 1] = exprBuilder.operation(
+                trt.DimensionOperation.SUM,
+                inputs[0][len(output_dims) - i - 1],
+                exprBuilder.constant(self.pads[i * 2] + self.pads[i * 2 + 1]),
+            )
+
+        return output_dims
+
+    def serialize(self):
+        return to_json({"pads": self.pads, "N": self.N})
+
+    def configure_plugin(self, inp, out):
+        X_dims = inp[0].desc.dims
+        self.X_shape = np.zeros((len(X_dims),))
+        for i in range(len(X_dims)):
+            self.X_shape[i] = X_dims[i]
+
+        all_pads = np.zeros((self.N * 2,), dtype=np.int32)
+        orig_dims = np.array(self.X_shape, dtype=np.int32)
+        out_dims = np.array(self.X_shape, dtype=np.int32)
+
+        for i in range(np.size(self.pads) // 2):
+            out_dims[self.N - i - 1] += self.pads[i * 2] + self.pads[i * 2 + 1]
+            all_pads[self.N * 2 - 2 * i - 2] = self.pads[i * 2]
+            all_pads[self.N * 2 - 2 * i - 1] = self.pads[i * 2 + 1]
+
+        # Copy vectors from host memory to device memory
+        if self.all_pads_d:
+            checkCudaErrors(cuda.cuMemcpyHtoD(self.all_pads_d, all_pads, all_pads.nbytes))
+        if self.orig_dims_d:
+            checkCudaErrors(cuda.cuMemcpyHtoD(self.orig_dims_d, orig_dims, orig_dims.nbytes))
+        if self.Y_shape_d:
+            checkCudaErrors(cuda.cuMemcpyHtoD(self.Y_shape_d, out_dims, out_dims.nbytes))
+
+        self.Y_len_d = np.prod(out_dims)
+
+    def supports_format_combination(self, pos, in_out, num_inputs):
+        assert num_inputs == 1
+        assert pos < len(in_out)
+
+        desc = in_out[pos]
+        if desc.format != trt.TensorFormat.LINEAR:
+            return False
+
+        # first input should be float16 or float32
+        if pos == 0:
+            return desc.type == trt.DataType.FLOAT or desc.type == trt.DataType.HALF
+
+        # output should have the same type as the input
+        if pos == 1:
+            return in_out[0].type == desc.type
+
+        assert False
+
+    def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
+
+        inp_dtype = trt.nptype(input_desc[0].type)
+
+        blockSize = 256
+        numBlocks = int((np.prod(np.array(self.X_shape)) + blockSize - 1) // blockSize)
+
+        da = np.array([inputs[0]], dtype=np.uint64)
+        dc = np.array([outputs[0]], dtype=np.uint64)
+
+        d_all_pads = np.array([int(self.all_pads_d)], dtype=np.uint64)
+        d_orig_dims = np.array([int(self.orig_dims_d)], dtype=np.uint64)
+        d_Y_shape = np.array([int(self.Y_shape_d)], dtype=np.uint64)
+        Y_len = np.array(self.Y_len_d, dtype=np.uint32)
+
+        args = [da, d_all_pads, d_orig_dims, dc, d_Y_shape, Y_len]
+        kernelArgs = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)
+
+        stream_ptr = np.array([stream], dtype=np.uint64)
+
+        if inp_dtype == np.float32:
+            kernelHelper = KernelHelper(circ_pad_float_kernel, int(self.cuDevice))
+            _circ_pad_float_kernel = kernelHelper.getFunction(b'circ_pad_float')
+            checkCudaErrors(cuda.cuLaunchKernel(_circ_pad_float_kernel,
+                                        numBlocks, 1, 1,
+                                        blockSize, 1, 1,
+                                        0,
+                                        stream_ptr,
+                                        kernelArgs, 0))
+        elif inp_dtype == np.float16:
+            kernelHelper = KernelHelper(circ_pad_half_kernel, int(self.cuDevice))
+            _circ_pad_half_kernel = kernelHelper.getFunction(b'circ_pad_half')
+            checkCudaErrors(cuda.cuLaunchKernel(_circ_pad_half_kernel,
+                                        numBlocks, 1, 1,
+                                        blockSize, 1, 1,
+                                        0,
+                                        stream_ptr,
+                                        kernelArgs, 0))
+        else:
+            raise ValueError("inp_dtype not valid")
+        
+    def clone(self):
+        cloned_plugin = CircPadPlugin()
+        cloned_plugin.__dict__.update(self.__dict__)
+        return cloned_plugin
+
+    def terminate(self):
+        if self.all_pads_d:
+            checkCudaErrors(cuda.cuMemFree(self.all_pads_d))
+        if self.orig_dims_d:
+            checkCudaErrors(cuda.cuMemFree(self.orig_dims_d))
+        if self.Y_shape_d:
+            checkCudaErrors(cuda.cuMemFree(self.Y_shape_d))
+
+        plg_registry.release_plugin_resource("cuda_ctx")
+
+    # 
+    # The following defaults take effect since the respective methods are not overriden
+    #
+
+    # def get_serialization_size(self):
+    #     return len(to_json({"pads": self.pads}))
+
+    # def get_workspace_size(self, input_desc, output_desc):
+    #     return 0
+    
+    # def destroy(self):
+    #     pass
+
+
+class CircPadPluginCreator(trt.IPluginCreator):
+    def __init__(self):
+        trt.IPluginCreator.__init__(self)
+        self.name = "CircPadPlugin"
+        self.plugin_namespace = ""
+        self.plugin_version = "1"
+        self.field_names = trt.PluginFieldCollection([
+            trt.PluginField("pads", np.array([]), trt.PluginFieldType.INT32),
+            trt.PluginField("N", np.array([]), trt.PluginFieldType.INT32)
+        ])
+
+    def create_plugin(self, name, fc):
+        return CircPadPlugin(fc)
+
+    def deserialize_plugin(self, name, data):
+        deserialized = CircPadPlugin()
+        j = dict(from_json(data))
+        deserialized.__dict__.update(j)
+        return deserialized
+
+if __name__ == "__main__":
+
+    args = parseArgs()
+    precision = np.float32 if args.precision == "fp32" else np.float16
+
+    # Initialize CUDA Driver API
+    err, = cuda.cuInit(0)
+
+    # Retrieve handle for device 0
+    err, cuDevice = cuda.cuDeviceGet(0)
+
+    plg_registry = trt.get_plugin_registry()
+
+    # Create context
+    plg_registry.acquire_plugin_resource("cuda_ctx", CudaCtxManager(cuDevice))
+
+    inp_shape = (100, 2, 32, 32)
+    X = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+    # Load standard plugins (if needed)
+    trt.init_libnvinfer_plugins(TRT_LOGGER, namespace="")
+
+    # Register plugin creator
+    my_plugin_creator = CircPadPluginCreator()
+    plg_registry.register_creator(my_plugin_creator, "")
+
+    # Create plugin object
+    builder, network = create_network()
+    plg_creator = plg_registry.get_plugin_creator("CircPadPlugin", "1", "")
+    plugin_fields_list = [
+        trt.PluginField("pads", np.array(pads, dtype=np.int32), trt.PluginFieldType.INT32),
+        trt.PluginField("N", np.array([4], dtype=np.int32), trt.PluginFieldType.INT32),
+    ]
+    pfc = trt.PluginFieldCollection(plugin_fields_list)
+    plugin = plg_creator.create_plugin("CircPadPlugin", pfc)
+
+    # Populate network
+    input_X = network.add_input(name="X", dtype=trt.float32 if precision==np.float32 else trt.float16, shape=X.shape)
+    out = network.add_plugin_v2([input_X], plugin)
+    out.get_output(0).name = "Y"
+    network.mark_output(tensor=out.get_output(0))
+
+    # Build engine
+    config = builder.create_builder_config()
+    engine = engine_from_network((builder, network), CreateConfig(fp16=precision==trt.float16))
+
+    Y_ref = np.pad(X, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    # Run
+    with TrtRunner(engine, "trt_runner")as runner:
+        outputs = runner.infer({"X": X})
+        Y = outputs["Y"]
+        
+        if np.allclose(Y, Y_ref):
+            print("Inference result correct!")
+        else:
+            print("Inference result incorrect!")
+
+    plg_registry.release_plugin_resource("cuda_ctx")
diff --git a/samples/python/python_plugin/circ_pad_plugin_multi_tactic.py b/samples/python/python_plugin/circ_pad_plugin_multi_tactic.py
new file mode 100644
index 00000000..43e1be49
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_multi_tactic.py
@@ -0,0 +1,318 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+import cupy as cp
+import logging
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    EngineFromNetwork,
+    NetworkFromOnnxPath,
+    TrtRunner,
+)
+
+import triton
+import triton.language as tl
+
+from enum import IntEnum
+
+from polygraphy.json import to_json, from_json
+import torch
+
+from utils import volume, parseArgs
+
+logger = logging.getLogger("CircPadMultiTactic")
+
+class Tactic(IntEnum):
+    TORCH = 1
+    TRITON = 2
+
+@triton.jit
+def circ_pad(X,
+            all_pads_0, all_pads_2, all_pads_4, all_pads_6,
+            orig_dims_0, orig_dims_1, orig_dims_2, orig_dims_3,
+            Y,
+            Y_shape_1, Y_shape_2, Y_shape_3,
+            X_len, Y_len, BLOCK_SIZE: tl.constexpr,):
+    pid = tl.program_id(0)
+    i = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
+
+    mask_y = i < Y_len
+
+    i3 = i % Y_shape_3
+    i2 = (i // Y_shape_3) % Y_shape_2
+    i1 = (i // Y_shape_3 // Y_shape_2) % Y_shape_1
+    i0 = i // Y_shape_3 // Y_shape_2 // Y_shape_1
+
+    j0 = (i0 - all_pads_0 + orig_dims_0) % orig_dims_0
+    j1 = (i1 - all_pads_2 + orig_dims_1) % orig_dims_1
+    j2 = (i2 - all_pads_4 + orig_dims_2) % orig_dims_2
+    j3 = (i3 - all_pads_6 + orig_dims_3) % orig_dims_3
+
+    load_idx = orig_dims_3 * orig_dims_2 * orig_dims_1 * j0 + orig_dims_3 * orig_dims_2 * j1 + orig_dims_3 * j2 + j3
+    mask_x = load_idx < X_len
+
+    x = tl.load(X + load_idx, mask=mask_x)
+
+    tl.store(Y + i, x, mask=mask_y)
+
+class CircPadPlugin(trt.IPluginV3, trt.IPluginV3OneCore, trt.IPluginV3OneBuild, trt.IPluginV3OneRuntime):
+    def __init__(self, fc=None, phase=None):
+        trt.IPluginV3.__init__(self)
+        trt.IPluginV3OneCore.__init__(self)
+        trt.IPluginV3OneBuild.__init__(self)
+        trt.IPluginV3OneRuntime.__init__(self)
+        self.pads = []
+        self.X_shape = []
+        
+        self.num_outputs = 1
+        self.plugin_namespace = ""
+        self.plugin_name = "CircPadPlugin"
+        self.plugin_version = "1"
+
+        # Set the timing cache ID to prevent unnecessary timing of second plugin instance
+        self.timing_cache_id = ""
+        
+        self.tactic = None
+
+        if fc is not None:
+            assert fc[0].name == "pads"
+            self.pads = fc[0].data
+
+        if phase is not None:
+            self.phase = phase
+
+    def get_capability_interface(self, type):
+        return self
+
+    def get_output_data_types(self, input_types):
+        return [input_types[0]]
+
+    def get_output_shapes(self, inputs, shape_inputs, exprBuilder):
+        output_dims = trt.DimsExprs(inputs[0])
+
+        for i in range(np.size(self.pads) // 2):
+            output_dims[len(output_dims) - i - 1] = exprBuilder.operation(
+                trt.DimensionOperation.SUM,
+                inputs[0][len(output_dims) - i - 1],
+                exprBuilder.constant(self.pads[i * 2] + self.pads[i * 2 + 1]),
+            )
+
+        return [output_dims]
+    
+    def get_fields_to_serialize(self):
+        return trt.PluginFieldCollection([
+            trt.PluginField("pads", self.pads, trt.PluginFieldType.INT32)
+        ])
+
+    def configure_plugin(self, inp, out):
+        pass
+
+    def on_shape_change(self, inp, out):
+        X_dims = inp[0].dims
+        self.X_shape = np.zeros((len(X_dims),))
+        for i in range(len(X_dims)):
+            self.X_shape[i] = X_dims[i]
+
+    def supports_format_combination(self, pos, in_out, num_inputs):
+        assert num_inputs == 1
+        assert pos < len(in_out)
+
+        desc = in_out[pos].desc
+        if desc.format != trt.TensorFormat.LINEAR:
+            return False
+
+        # first input should be float16 or float32
+        if pos == 0:
+            return desc.type == trt.DataType.FLOAT or desc.type == trt.DataType.HALF
+
+        # output should have the same type as the input
+        if pos == 1:
+            return in_out[0].desc.type == desc.type
+
+        assert False
+
+    def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
+
+        inp_dtype = trt.nptype(input_desc[0].type)
+
+        a_mem = cp.cuda.UnownedMemory(
+            inputs[0], volume(input_desc[0].dims) * cp.dtype(inp_dtype).itemsize, self
+        )
+        c_mem = cp.cuda.UnownedMemory(
+            outputs[0],
+            volume(output_desc[0].dims) * cp.dtype(inp_dtype).itemsize,
+            self,
+        )
+
+        a_ptr = cp.cuda.MemoryPointer(a_mem, 0)
+        c_ptr = cp.cuda.MemoryPointer(c_mem, 0)
+
+        c_d = cp.ndarray((volume(output_desc[0].dims)), dtype=inp_dtype, memptr=c_ptr)
+        
+        if self.phase == trt.TensorRTPhase.BUILD:
+            logger.info(f"Timing tactic: {self.tactic}")
+
+        if self.tactic == Tactic.TORCH:
+            # Use PyTorch functional op - no need to write kernel
+            a_d = cp.ndarray(tuple(input_desc[0].dims), dtype=inp_dtype, memptr=a_ptr)
+            a_t = torch.as_tensor(a_d, device='cuda')
+            out = torch.nn.functional.pad(a_t, self.pads.tolist(), mode='circular')
+            cp.copyto(c_d, cp.reshape(cp.asarray(out), (-1,)))
+        elif self.tactic == Tactic.TRITON:
+            a_d = cp.ndarray((volume(input_desc[0].dims)), dtype=inp_dtype, memptr=a_ptr)
+            a_t = torch.as_tensor(a_d, device='cuda')
+            c_t = torch.as_tensor(c_d, device='cuda')
+
+            N = len(self.X_shape)
+            all_pads = np.zeros((N * 2,), dtype=np.int32)
+            orig_dims = np.array(self.X_shape, dtype=np.int32)
+            out_dims = np.array(self.X_shape, dtype=np.int32)
+
+            for i in range(np.size(pads) // 2):
+                out_dims[N - i - 1] += pads[i * 2] + pads[i * 2 + 1]
+                all_pads[N * 2 - 2 * i - 2] = pads[i * 2]
+                all_pads[N * 2 - 2 * i - 1] = pads[i * 2 + 1]
+
+            all_pads = all_pads.tolist()
+            orig_dims = orig_dims.tolist()
+            out_dims = out_dims.tolist()
+
+            blockSize = 256
+            numBlocks = tuple(int((np.prod(out_dims) + blockSize - 1) // blockSize))
+
+            circ_pad[numBlocks](a_t,
+                all_pads[0], all_pads[2], all_pads[4], all_pads[6],
+                orig_dims[0], orig_dims[1], orig_dims[2], orig_dims[3],
+                c_t,
+                out_dims[1], out_dims[2], out_dims[3],
+                int(np.prod(orig_dims)), int(np.prod(out_dims)), BLOCK_SIZE=256
+            )
+        else:
+            raise RuntimeError("Invalid tactic")
+    
+    def attach_to_context(self, context):
+        return self.clone()
+    
+    def get_valid_tactics(self):
+        return [int(Tactic.TORCH), int(Tactic.TRITON)]
+
+    def set_tactic(self, tactic):
+        self.tactic = Tactic(tactic)
+
+        if self.phase == trt.TensorRTPhase.RUNTIME:
+            logger.info(f"Best tactic chosen: {self.tactic}")
+
+    def clone(self):
+        cloned_plugin = CircPadPlugin()
+        cloned_plugin.__dict__.update(self.__dict__)
+        return cloned_plugin
+
+    # 
+    # The following defaults take effect since the respective methods are not overriden
+    #
+
+    # def get_workspace_size(self, input_desc, output_desc):
+    #     return 0
+    
+    # def destroy(self):
+    #     pass
+
+
+class CircPadPluginCreator(trt.IPluginCreatorV3One):
+    def __init__(self):
+        trt.IPluginCreatorV3One.__init__(self)
+        self.name = "CircPadPlugin"
+        self.plugin_namespace = ""
+        self.plugin_version = "1"
+        self.field_names = trt.PluginFieldCollection([
+            trt.PluginField("pads", np.array([]), trt.PluginFieldType.INT32)
+        ])
+
+    def create_plugin(self, name, fc, phase):
+        return CircPadPlugin(fc, phase)
+
+
+if __name__ == "__main__":
+    logging.basicConfig()
+    logger.setLevel(logging.INFO)
+
+    args = parseArgs()
+    precision = np.float32 if args.precision == "fp32" else np.float16
+
+    inp_shape = (10, 3, 32, 32)
+    X_A = np.random.normal(size=inp_shape).astype(precision)
+    X_B = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    # Register plugin creator
+    plg_registry = trt.get_plugin_registry()
+    my_plugin_creator = CircPadPluginCreator()
+    plg_registry.register_creator(my_plugin_creator, "")
+
+    # create ONNX model
+    onnx_path = "test_CircPadPlugin.onnx"
+    inputA = gs.Variable(name="X_A", shape=inp_shape, dtype=precision)
+    inputB = gs.Variable(name="X_B", shape=inp_shape, dtype=precision)
+    Y_A = gs.Variable(name="Y_A", dtype=precision)
+    Y_B = gs.Variable(name="Y_B", dtype=precision)
+    myPluginNode_A = gs.Node(
+        name="CircPadPlugin_A",
+        op="CircPadPlugin",
+        inputs=[inputA],
+        outputs=[Y_A],
+        attrs={"pads": pads},
+    )
+    myPluginNode_B = gs.Node(
+        name="CircPadPlugin_B",
+        op="CircPadPlugin",
+        inputs=[inputB],
+        outputs=[Y_B],
+        attrs={"pads": pads},
+    )
+
+    graph = gs.Graph(nodes=[myPluginNode_A, myPluginNode_B], inputs=[inputA, inputB], outputs=[Y_A, Y_B], opset=16)
+    onnx.save(gs.export_onnx(graph), onnx_path)
+
+    # build engine
+    build_engine = EngineFromNetwork(
+        NetworkFromOnnxPath(onnx_path), CreateConfig(fp16=precision==np.float16)
+    )
+
+    Y_A_ref = np.pad(X_A, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    Y_B_ref = np.pad(X_B, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+
+    # Run
+    with TrtRunner(build_engine, "trt_runner")as runner:
+        outputs = runner.infer({"X_A": X_A, "X_B": X_B})
+        Y_A_out = outputs["Y_A"]
+        Y_B_out = outputs["Y_B"]
+
+        if np.allclose(Y_A_out, Y_A_ref):
+            print("Inference result A correct!")
+        else:
+            print("Inference result A incorrect!")
+
+        if np.allclose(Y_B_out, Y_B_ref):
+            print("Inference result B correct!")
+        else:
+            print("Inference result B incorrect!")
diff --git a/samples/python/python_plugin/circ_pad_plugin_numba.py b/samples/python/python_plugin/circ_pad_plugin_numba.py
new file mode 100644
index 00000000..2cc0bfab
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_numba.py
@@ -0,0 +1,249 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+import cupy as cp
+from numba import cuda
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    EngineFromNetwork,
+    NetworkFromOnnxPath,
+    TrtRunner,
+)
+
+from polygraphy.json import to_json, from_json
+from utils import volume, parseArgs
+
+@cuda.jit
+def circ_pad(X, all_pads, orig_dims, Y, Y_shape, Y_len):
+    index = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
+    stride = cuda.blockDim.x * cuda.gridDim.x
+
+    for i in range(index, Y_len, stride):
+        i3 = int(i % Y_shape[3])
+        i2 = int((i // Y_shape[3]) % Y_shape[2])
+        i1 = int((i // Y_shape[3] // Y_shape[2]) % Y_shape[1])
+        i0 = int(i // Y_shape[3] // Y_shape[2] // Y_shape[1])
+
+        j0 = int((i0 - all_pads[0]) % orig_dims[0])
+        j1 = int((i1 - all_pads[2]) % orig_dims[1])
+        j2 = int((i2 - all_pads[4]) % orig_dims[2])
+        j3 = int((i3 - all_pads[6]) % orig_dims[3])
+
+        Y[i] = X[
+            int(
+                orig_dims[3] * orig_dims[2] * orig_dims[1] * j0
+                + orig_dims[3] * orig_dims[2] * j1
+                + orig_dims[3] * j2
+                + j3
+            )
+        ]
+
+class CircPadPlugin(trt.IPluginV2DynamicExt):
+    def __init__(self, fc=None):
+        trt.IPluginV2DynamicExt.__init__(self)
+        self.pads = []
+        self.X_shape = []
+
+        self.num_outputs = 1
+        self.plugin_namespace = ""
+        self.plugin_type = "CircPadPlugin"
+        self.plugin_version = "1"
+
+        if fc is not None:
+            assert fc[0].name == "pads"
+            self.pads = fc[0].data
+
+    def get_output_datatype(self, index, input_types):
+        return input_types[0]
+
+    def get_output_dimensions(self, output_index, inputs, exprBuilder):
+        
+        output_dims = trt.DimsExprs(inputs[0])
+
+        for i in range(np.size(self.pads) // 2):
+            output_dims[len(output_dims) - i - 1] = exprBuilder.operation(
+                trt.DimensionOperation.SUM,
+                inputs[0][len(output_dims) - i - 1],
+                exprBuilder.constant(self.pads[i * 2] + self.pads[i * 2 + 1]),
+            )
+
+        return output_dims
+
+    def serialize(self):
+        return to_json({"pads": self.pads})
+
+    def configure_plugin(self, inp, out):
+        X_dims = inp[0].desc.dims
+        self.X_shape = np.zeros((len(X_dims),))
+        for i in range(len(X_dims)):
+            self.X_shape[i] = X_dims[i]
+
+    def supports_format_combination(self, pos, in_out, num_inputs):
+        assert num_inputs == 1
+        assert pos < len(in_out)
+
+        desc = in_out[pos]
+        if desc.format != trt.TensorFormat.LINEAR:
+            return False
+
+        # first input should be float16 or float32
+        if pos == 0:
+            return desc.type == trt.DataType.FLOAT or desc.type == trt.DataType.HALF
+
+        # output should have the same type as the input
+        if pos == 1:
+            return in_out[0].type == desc.type
+
+        assert False
+
+    def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
+
+        inp_dtype = trt.nptype(input_desc[0].type)
+
+        a_mem = cp.cuda.UnownedMemory(
+            inputs[0], volume(input_desc[0].dims) * cp.dtype(inp_dtype).itemsize, self
+        )
+        c_mem = cp.cuda.UnownedMemory(
+            outputs[0],
+            volume(output_desc[0].dims) * cp.dtype(inp_dtype).itemsize,
+            self,
+        )
+
+        a_ptr = cp.cuda.MemoryPointer(a_mem, 0)
+        c_ptr = cp.cuda.MemoryPointer(c_mem, 0)
+
+        a = cp.ndarray((volume(input_desc[0].dims)), dtype=inp_dtype, memptr=a_ptr)
+        c = cp.ndarray((volume(output_desc[0].dims)), dtype=inp_dtype, memptr=c_ptr)
+
+        numba_stream = cuda.external_stream(stream)
+
+        N = len(self.X_shape)
+        all_pads = np.zeros((N * 2,))
+        orig_dims = np.array(self.X_shape)
+        out_dims = np.array(self.X_shape)
+
+        for i in range(np.size(pads) // 2):
+            out_dims[N - i - 1] += pads[i * 2] + pads[i * 2 + 1]
+            all_pads[N * 2 - 2 * i - 2] = pads[i * 2]
+            all_pads[N * 2 - 2 * i - 1] = pads[i * 2 + 1]
+
+        all_pads_d = cp.asarray(all_pads)
+        orig_dims_d = cp.asarray(orig_dims)
+        Y_shape_d = cp.asarray(out_dims)
+
+        blockSize = 256
+        numBlocks = int((np.prod(out_dims) + blockSize - 1) // blockSize)
+
+        circ_pad[numBlocks, blockSize, numba_stream](
+            a, all_pads_d, orig_dims_d, c, Y_shape_d, np.prod(out_dims)
+        )
+
+        return 0
+
+    def clone(self):
+        cloned_plugin = CircPadPlugin()
+        cloned_plugin.__dict__.update(self.__dict__)
+        return cloned_plugin
+    
+    # 
+    # The following defaults take effect since the respective methods are not overriden
+    #
+
+    # def initialize(self):
+    #     pass
+
+    # def get_serialization_size(self):
+    #     return len(to_json({"pads": self.pads}))
+
+    # def get_workspace_size(self, input_desc, output_desc):
+    #     return 0
+    
+    # def destroy(self):
+    #     pass
+
+    # def terminate(self):
+    #     pass
+
+
+class CircPadPluginCreator(trt.IPluginCreator):
+    def __init__(self):
+        trt.IPluginCreator.__init__(self)
+        self.name = "CircPadPlugin"
+        self.plugin_namespace = ""
+        self.plugin_version = "1"
+        self.field_names = trt.PluginFieldCollection(
+            [trt.PluginField("pads", np.array([]), trt.PluginFieldType.INT32)]
+        )
+
+    def create_plugin(self, name, fc):
+        return CircPadPlugin(fc)
+
+    def deserialize_plugin(self, name, data):
+        j = dict(from_json(data.decode("utf-8")))
+        deserialized = CircPadPlugin()
+        deserialized.__dict__.update(j)
+        return deserialized
+
+if __name__ == "__main__":
+
+    args = parseArgs()
+    precision = np.float32 if args.precision == "fp32" else np.float16
+
+    inp_shape = (10, 3, 32, 32)
+    X = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    # Register plugin creator
+    plg_registry = trt.get_plugin_registry()
+    my_plugin_creator = CircPadPluginCreator()
+    plg_registry.register_creator(my_plugin_creator, "")
+
+    # create ONNX model
+    onnx_path = "test_CircPadPlugin.onnx"
+    inputA = gs.Variable(name="X", shape=inp_shape, dtype=precision)
+    Y = gs.Variable(name="Y", dtype=precision)
+    myPluginNode = gs.Node(
+        name="CircPadPlugin",
+        op="CircPadPlugin",
+        inputs=[inputA],
+        outputs=[Y],
+        attrs={"pads": pads},
+    )
+    graph = gs.Graph(nodes=[myPluginNode], inputs=[inputA], outputs=[Y], opset=16)
+    onnx.save(gs.export_onnx(graph), onnx_path)
+
+    # build engine
+    build_engine = EngineFromNetwork(
+        NetworkFromOnnxPath(onnx_path), CreateConfig(fp16=precision==np.float16)
+    )
+
+    Y_ref = np.pad(X, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    # Run
+    with TrtRunner(build_engine, "trt_runner")as runner:
+        outputs = runner.infer({"X": X})
+        Y = outputs["Y"]
+
+        if np.allclose(Y, Y_ref):
+            print("Inference result correct!")
+        else:
+            print("Inference result incorrect!")
diff --git a/samples/python/python_plugin/circ_pad_plugin_torch.py b/samples/python/python_plugin/circ_pad_plugin_torch.py
new file mode 100644
index 00000000..8b036469
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_torch.py
@@ -0,0 +1,208 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+import cupy as cp
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    EngineFromNetwork,
+    NetworkFromOnnxPath,
+    TrtRunner,
+)
+
+from polygraphy.json import to_json, from_json
+import torch
+
+from utils import volume, parseArgs
+
+class CircPadPlugin(trt.IPluginV2DynamicExt):
+    def __init__(self, fc=None):
+        trt.IPluginV2DynamicExt.__init__(self)
+        self.pads = []
+        self.X_shape = []
+        
+        self.num_outputs = 1
+        self.plugin_namespace = ""
+        self.plugin_type = "CircPadPlugin"
+        self.plugin_version = "1"
+
+        if fc is not None:
+            assert fc[0].name == "pads"
+            self.pads = fc[0].data
+
+    def get_output_datatype(self, index, input_types):
+        return input_types[0]
+
+    def get_output_dimensions(self, output_index, inputs, exprBuilder):
+
+        output_dims = trt.DimsExprs(inputs[0])
+
+        for i in range(np.size(self.pads) // 2):
+            output_dims[len(output_dims) - i - 1] = exprBuilder.operation(
+                trt.DimensionOperation.SUM,
+                inputs[0][len(output_dims) - i - 1],
+                exprBuilder.constant(self.pads[i * 2] + self.pads[i * 2 + 1]),
+            )
+
+        return output_dims
+
+    def serialize(self):
+        return to_json({"pads": self.pads})
+
+    def configure_plugin(self, inp, out):
+        X_dims = inp[0].desc.dims
+        self.X_shape = np.zeros((len(X_dims),))
+        for i in range(len(X_dims)):
+            self.X_shape[i] = X_dims[i]
+
+    def supports_format_combination(self, pos, in_out, num_inputs):
+        assert num_inputs == 1
+        assert pos < len(in_out)
+
+        desc = in_out[pos]
+        if desc.format != trt.TensorFormat.LINEAR:
+            return False
+
+        # first input should be float16 or float32
+        if pos == 0:
+            return desc.type == trt.DataType.FLOAT or desc.type == trt.DataType.HALF
+
+        # output should have the same type as the input
+        if pos == 1:
+            return in_out[0].type == desc.type
+
+        assert False
+
+    def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
+
+        inp_dtype = trt.nptype(input_desc[0].type)
+
+        a_mem = cp.cuda.UnownedMemory(
+            inputs[0], volume(input_desc[0].dims) * cp.dtype(inp_dtype).itemsize, self
+        )
+        c_mem = cp.cuda.UnownedMemory(
+            outputs[0],
+            volume(output_desc[0].dims) * cp.dtype(inp_dtype).itemsize,
+            self,
+        )
+
+        a_ptr = cp.cuda.MemoryPointer(a_mem, 0)
+        c_ptr = cp.cuda.MemoryPointer(c_mem, 0)
+
+        a_d = cp.ndarray(tuple(input_desc[0].dims), dtype=inp_dtype, memptr=a_ptr)
+        c_d = cp.ndarray((volume(output_desc[0].dims)), dtype=inp_dtype, memptr=c_ptr)
+
+        a_t = torch.as_tensor(a_d, device='cuda')
+
+        # Use PyTorch functional op - no need to write kernel
+        out = torch.nn.functional.pad(a_t, self.pads.tolist(), mode='circular')
+        cp.copyto(c_d, cp.reshape(cp.asarray(out), (-1,)))
+
+        return 0
+
+    def clone(self):
+        cloned_plugin = CircPadPlugin()
+        cloned_plugin.__dict__.update(self.__dict__)
+        return cloned_plugin
+
+    # 
+    # The following defaults take effect since the respective methods are not overriden
+    #
+
+    # def initialize(self):
+    #     pass
+
+    # def get_serialization_size(self):
+    #     return len(to_json({"pads": self.pads}))
+
+    # def get_workspace_size(self, input_desc, output_desc):
+    #     return 0
+    
+    # def destroy(self):
+    #     pass
+
+    # def terminate(self):
+    #     pass
+
+
+class CircPadPluginCreator(trt.IPluginCreator):
+    def __init__(self):
+        trt.IPluginCreator.__init__(self)
+        self.name = "CircPadPlugin"
+        self.plugin_namespace = ""
+        self.plugin_version = "1"
+        self.field_names = trt.PluginFieldCollection(
+            [trt.PluginField("pads", np.array([]), trt.PluginFieldType.INT32)]
+        )
+
+    def create_plugin(self, name, fc):
+        return CircPadPlugin(fc)
+
+    def deserialize_plugin(self, name, data):
+        j = dict(from_json(data.decode("utf-8")))
+        deserialized = CircPadPlugin()
+        deserialized.__dict__.update(j)
+        return deserialized
+
+if __name__ == "__main__":
+
+    args = parseArgs()
+    precision = np.float32 if args.precision == "fp32" else np.float16
+
+    inp_shape = (10, 3, 32, 32)
+    X = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    # Register plugin creator
+    plg_registry = trt.get_plugin_registry()
+    my_plugin_creator = CircPadPluginCreator()
+    plg_registry.register_creator(my_plugin_creator, "")
+
+    # create ONNX model
+    onnx_path = "test_CircPadPlugin.onnx"
+    inputA = gs.Variable(name="X", shape=inp_shape, dtype=precision)
+    Y = gs.Variable(name="Y", dtype=precision)
+    myPluginNode = gs.Node(
+        name="CircPadPlugin",
+        op="CircPadPlugin",
+        inputs=[inputA],
+        outputs=[Y],
+        attrs={"pads": pads},
+    )
+    graph = gs.Graph(nodes=[myPluginNode], inputs=[inputA], outputs=[Y], opset=16)
+    onnx.save(gs.export_onnx(graph), onnx_path)
+
+    # build engine
+    build_engine = EngineFromNetwork(
+        NetworkFromOnnxPath(onnx_path), CreateConfig(fp16=precision==np.float16)
+    )
+
+    Y_ref = np.pad(X, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    # Run
+    with TrtRunner(build_engine, "trt_runner")as runner:
+        outputs = runner.infer({"X": X})
+        Y = outputs["Y"]
+
+        if np.allclose(Y, Y_ref):
+            print("Inference result correct!")
+        else:
+            print("Inference result incorrect!")
diff --git a/samples/python/python_plugin/circ_pad_plugin_triton.py b/samples/python/python_plugin/circ_pad_plugin_triton.py
new file mode 100644
index 00000000..93b5f0fd
--- /dev/null
+++ b/samples/python/python_plugin/circ_pad_plugin_triton.py
@@ -0,0 +1,263 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+import cupy as cp
+
+import triton
+import triton.language as tl
+
+import tensorrt as trt
+from polygraphy.backend.trt import (
+    CreateConfig,
+    EngineFromNetwork,
+    NetworkFromOnnxPath,
+    TrtRunner,
+)
+
+from polygraphy.json import to_json, from_json
+import torch
+
+from utils import volume, parseArgs
+
+@triton.jit
+def circ_pad(X,
+            all_pads_0, all_pads_2, all_pads_4, all_pads_6,
+            orig_dims_0, orig_dims_1, orig_dims_2, orig_dims_3,
+            Y,
+            Y_shape_1, Y_shape_2, Y_shape_3,
+            X_len, Y_len, BLOCK_SIZE: tl.constexpr,):
+    pid = tl.program_id(0)
+    i = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
+
+    mask_y = i < Y_len
+
+    i3 = i % Y_shape_3
+    i2 = (i // Y_shape_3) % Y_shape_2
+    i1 = (i // Y_shape_3 // Y_shape_2) % Y_shape_1
+    i0 = i // Y_shape_3 // Y_shape_2 // Y_shape_1
+
+    j0 = (i0 - all_pads_0 + orig_dims_0) % orig_dims_0
+    j1 = (i1 - all_pads_2 + orig_dims_1) % orig_dims_1
+    j2 = (i2 - all_pads_4 + orig_dims_2) % orig_dims_2
+    j3 = (i3 - all_pads_6 + orig_dims_3) % orig_dims_3
+
+    load_idx = orig_dims_3 * orig_dims_2 * orig_dims_1 * j0 + orig_dims_3 * orig_dims_2 * j1 + orig_dims_3 * j2 + j3
+    mask_x = load_idx < X_len
+
+    x = tl.load(X + load_idx, mask=mask_x)
+
+    tl.store(Y + i, x, mask=mask_y)
+
+
+class CircPadPlugin(trt.IPluginV2DynamicExt):
+    def __init__(self, fc=None):
+        trt.IPluginV2DynamicExt.__init__(self)
+        self.pads = []
+        self.X_shape = []
+
+        self.num_outputs = 1
+        self.plugin_namespace = ""
+        self.plugin_type = "CircPadPlugin"
+        self.plugin_version = "1"
+
+        if fc is not None:
+            assert fc[0].name == "pads"
+            self.pads = fc[0].data
+
+    def get_output_datatype(self, index, input_types):
+        return input_types[0]
+
+    def get_output_dimensions(self, output_index, inputs, exprBuilder):
+
+        output_dims = trt.DimsExprs(inputs[0])
+
+        for i in range(np.size(self.pads) // 2):
+            output_dims[len(output_dims) - i - 1] = exprBuilder.operation(
+                trt.DimensionOperation.SUM,
+                inputs[0][len(output_dims) - i - 1],
+                exprBuilder.constant(self.pads[i * 2] + self.pads[i * 2 + 1]),
+            )
+
+        return output_dims
+
+    def serialize(self):
+        return to_json({"pads": self.pads})
+
+    def configure_plugin(self, inp, out):
+        X_dims = inp[0].desc.dims
+        self.X_shape = np.zeros((len(X_dims),))
+        for i in range(len(X_dims)):
+            self.X_shape[i] = X_dims[i]
+
+    def supports_format_combination(self, pos, in_out, num_inputs):
+        assert num_inputs == 1
+        assert pos < len(in_out)
+
+        desc = in_out[pos]
+        if desc.format != trt.TensorFormat.LINEAR:
+            return False
+
+        # first input should be float16 or float32
+        if pos == 0:
+            return desc.type == trt.DataType.FLOAT or desc.type == trt.DataType.HALF
+
+        # output should have the same type as the input
+        if pos == 1:
+            return in_out[0].type == desc.type
+
+        assert False
+
+    def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
+
+        inp_dtype = trt.nptype(input_desc[0].type)
+
+        a_mem = cp.cuda.UnownedMemory(
+            inputs[0], volume(input_desc[0].dims) * cp.dtype(inp_dtype).itemsize, self
+        )
+        c_mem = cp.cuda.UnownedMemory(
+            outputs[0],
+            volume(output_desc[0].dims) * cp.dtype(inp_dtype).itemsize,
+            self,
+        )
+
+        a_ptr = cp.cuda.MemoryPointer(a_mem, 0)
+        c_ptr = cp.cuda.MemoryPointer(c_mem, 0)
+
+        a_d = cp.ndarray((volume(input_desc[0].dims)), dtype=inp_dtype, memptr=a_ptr)
+        c_d = cp.ndarray((volume(output_desc[0].dims)), dtype=inp_dtype, memptr=c_ptr)
+
+        a_t = torch.as_tensor(a_d, device='cuda')
+        c_t = torch.as_tensor(c_d, device='cuda')
+
+        N = len(self.X_shape)
+        all_pads = np.zeros((N * 2,), dtype=np.int32)
+        orig_dims = np.array(self.X_shape, dtype=np.int32)
+        out_dims = np.array(self.X_shape, dtype=np.int32)
+
+        for i in range(np.size(pads) // 2):
+            out_dims[N - i - 1] += pads[i * 2] + pads[i * 2 + 1]
+            all_pads[N * 2 - 2 * i - 2] = pads[i * 2]
+            all_pads[N * 2 - 2 * i - 1] = pads[i * 2 + 1]
+
+        all_pads = all_pads.tolist()
+        orig_dims = orig_dims.tolist()
+        out_dims = out_dims.tolist()
+
+        blockSize = 256
+        numBlocks = (int((np.prod(out_dims) + blockSize - 1) // blockSize),)
+
+        circ_pad[numBlocks](a_t,
+            all_pads[0], all_pads[2], all_pads[4], all_pads[6],
+            orig_dims[0], orig_dims[1], orig_dims[2], orig_dims[3],
+            c_t,
+            out_dims[1], out_dims[2], out_dims[3],
+            int(np.prod(orig_dims)), int(np.prod(out_dims)), BLOCK_SIZE=256
+        )
+
+        return 0
+
+    def clone(self):
+        cloned_plugin = CircPadPlugin()
+        cloned_plugin.__dict__.update(self.__dict__)
+        return cloned_plugin
+
+    # 
+    # The following defaults take effect since the respective methods are not overriden
+    #
+
+    # def initialize(self):
+    #     pass
+
+    # def get_serialization_size(self):
+    #     return len(to_json({"pads": self.pads}))
+
+    # def get_workspace_size(self, input_desc, output_desc):
+    #     return 0
+    
+    # def destroy(self):
+    #     pass
+
+    # def terminate(self):
+    #     pass
+
+
+class CircPadPluginCreator(trt.IPluginCreator):
+    def __init__(self):
+        trt.IPluginCreator.__init__(self)
+        self.name = "CircPadPlugin"
+        self.plugin_namespace = ""
+        self.plugin_version = "1"
+        self.field_names = trt.PluginFieldCollection(
+            [trt.PluginField("pads", np.array([]), trt.PluginFieldType.INT32)]
+        )
+
+    def create_plugin(self, name, fc):
+        return CircPadPlugin(fc)
+
+    def deserialize_plugin(self, name, data):
+        j = dict(from_json(data.decode("utf-8")))
+        deserialized = CircPadPlugin()
+        deserialized.__dict__.update(j)
+        return deserialized
+
+if __name__ == "__main__":
+
+    args = parseArgs()
+    precision = np.float32 if args.precision == "fp32" else np.float16
+
+    inp_shape = (10, 3, 32, 32)
+    X = np.random.normal(size=inp_shape).astype(precision)
+
+    pads = (1, 1, 1, 1)
+
+    # Register plugin creator
+    plg_registry = trt.get_plugin_registry()
+    my_plugin_creator = CircPadPluginCreator()
+    plg_registry.register_creator(my_plugin_creator, "")
+
+    # create ONNX model
+    onnx_path = "test_CircPadPlugin.onnx"
+    inputA = gs.Variable(name="X", shape=inp_shape, dtype=precision)
+    Y = gs.Variable(name="Y", dtype=precision)
+    myPluginNode = gs.Node(
+        name="CircPadPlugin",
+        op="CircPadPlugin",
+        inputs=[inputA],
+        outputs=[Y],
+        attrs={"pads": pads},
+    )
+    graph = gs.Graph(nodes=[myPluginNode], inputs=[inputA], outputs=[Y], opset=16)
+    onnx.save(gs.export_onnx(graph), onnx_path)
+
+    # build engine
+    build_engine = EngineFromNetwork(
+        NetworkFromOnnxPath(onnx_path), CreateConfig(fp16=precision==np.float16)
+    )
+
+    Y_ref = np.pad(X, [[0, 0], [0, 0], [pads[0], pads[1]], [pads[2], pads[3]]], "wrap")
+    # Run
+    with TrtRunner(build_engine, "trt_runner")as runner:
+        outputs = runner.infer({"X": X})
+        Y = outputs["Y"]
+
+        if np.allclose(Y, Y_ref):
+            print("Inference result correct!")
+        else:
+            print("Inference result incorrect!")
diff --git a/samples/python/python_plugin/circ_plugin_cpp/circ_pad_plugin.cu b/samples/python/python_plugin/circ_plugin_cpp/circ_pad_plugin.cu
new file mode 100644
index 00000000..8e06a025
--- /dev/null
+++ b/samples/python/python_plugin/circ_plugin_cpp/circ_pad_plugin.cu
@@ -0,0 +1,415 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "NvInfer.h"
+
+#include <algorithm>
+#include <cstdint>
+#include <cstring>
+#include <iostream>
+#include <memory>
+#include <numeric>
+#include <vector>
+
+#include <cuda.h>
+#include <cuda_fp16.h>
+
+using namespace nvinfer1;
+
+static void caughtError(std::exception const& e)
+{
+    std::cout << e.what() << std::endl;
+}
+
+// Write values into buffer
+template <typename T>
+void write(char*& buffer, T const& val)
+{
+    std::memcpy(buffer, &val, sizeof(T));
+    buffer += sizeof(T);
+}
+
+// Read values from buffer
+template <typename T>
+T read(char const*& buffer)
+{
+    T val{};
+    std::memcpy(&val, buffer, sizeof(T));
+    buffer += sizeof(T);
+    return val;
+}
+
+#define ASSERT(condition)                                                                                              \
+    do                                                                                                                 \
+    {                                                                                                                  \
+        if (!(condition))                                                                                              \
+        {                                                                                                              \
+            std::cout << "Assertion failure: " << #condition << std::endl;                                             \
+            abort();                                                                                                   \
+        }                                                                                                              \
+    } while (0)
+
+template <typename Dtype>
+struct CudaBind
+{
+    size_t mSize;
+    Dtype* mPtr;
+
+    CudaBind(size_t size)
+    {
+        mSize = size;
+        ASSERT(!cudaMalloc((void**) &mPtr, sizeof(Dtype) * mSize));
+    }
+
+    ~CudaBind()
+    {
+        if (mPtr != nullptr)
+        {
+            ASSERT(!cudaFree(mPtr));
+            mPtr = nullptr;
+        }
+    }
+};
+
+static int64_t volume(Dims const& dims)
+{
+    return std::accumulate(dims.d, dims.d + dims.nbDims, int64_t{1}, std::multiplies<int64_t>{});
+}
+
+template <typename T>
+__global__ void circPadKernel(
+    T const* x, int32_t const* allPads, int32_t const* origDims, T* y, int32_t const* yShape, int32_t yLen)
+{
+    int32_t index = blockIdx.x * blockDim.x + threadIdx.x;
+    int32_t stride = blockDim.x * gridDim.x;
+
+    for (int32_t i = index; i < yLen; i += stride)
+    {
+        int32_t i3 = i % yShape[3];
+        int32_t i2 = (i / yShape[3]) % yShape[2];
+        int32_t i1 = (i / yShape[3] / yShape[2]) % yShape[1];
+        int32_t i0 = i / yShape[3] / yShape[2] / yShape[1];
+
+        int32_t j0 = (i0 - allPads[0] + origDims[0]) % origDims[0];
+        int32_t j1 = (i1 - allPads[2] + origDims[1]) % origDims[1];
+        int32_t j2 = (i2 - allPads[4] + origDims[2]) % origDims[2];
+        int32_t j3 = (i3 - allPads[6] + origDims[3]) % origDims[3];
+
+        y[i] = x[origDims[3] * origDims[2] * origDims[1] * j0 + origDims[3] * origDims[2] * j1 + origDims[3] * j2
+            + j3];
+    }
+}
+
+class CircPadPlugin : public nvinfer1::IPluginV2DynamicExt
+{
+public:
+    CircPadPlugin() = default;
+
+    CircPadPlugin(std::vector<int32_t> pads)
+        : mPads(pads)
+    {
+    }
+
+    CircPadPlugin(CircPadPlugin const& p) = default;
+
+    CircPadPlugin(void const* serialData, size_t length)
+    {
+        ASSERT(serialData != nullptr);
+
+        char const* d = static_cast<char const*>(serialData);
+        char const* a = d;
+
+        int32_t padsSize = read<int32_t>(d);
+        mPads.resize(padsSize);
+        for (int i = 0; i < padsSize; ++i)
+        {
+            mPads[i] = read<int32_t>(d);
+        }
+
+        ASSERT(d == a + length);
+    }
+
+    int32_t getNbOutputs() const noexcept override
+    {
+        return 1;
+    }
+
+    bool supportsFormatCombination(
+        int32_t pos, PluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept
+    {
+        PluginTensorDesc const& desc = inOut[pos];
+        if (desc.format != TensorFormat::kLINEAR)
+        {
+            return false;
+        }
+
+        // first input should be float16 or float32
+        if (pos == 0)
+        {
+            return (inOut[pos].type == nvinfer1::DataType::kFLOAT || inOut[pos].type == nvinfer1::DataType::kHALF);
+        }
+
+        // output should have the same type as the input
+        if (pos == 1)
+        {
+            return (inOut[pos].type == inOut[0].type);
+        }
+
+        return false;
+    }
+
+    void configureWithFormat(nvinfer1::Dims const*, int32_t, nvinfer1::Dims const*, int32_t, nvinfer1::DataType type,
+        nvinfer1::PluginFormat floatFormat, int32_t) noexcept override
+    {
+    }
+
+    int32_t initialize() noexcept override
+    {
+        return 0;
+    }
+
+    void terminate() noexcept override
+    {
+        mAllPadsPtr.reset();
+        mOrigDimsPtr.reset();
+        mOutDimsPtr.reset();
+    }
+
+    int32_t enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs,
+        void* const* outputs, void* workspace, cudaStream_t stream) noexcept
+    {
+        auto inpDType = inputDesc[0].type;
+
+        int32_t const blockSize = 256;
+        int32_t const numBlocks = (volume(outputDesc[0].dims) + blockSize - 1) / blockSize;
+
+        ASSERT(inpDType == DataType::kFLOAT || inpDType == DataType::kHALF);
+
+        if (inpDType == DataType::kFLOAT)
+        {
+            circPadKernel<float><<<numBlocks, blockSize, 0, stream>>>(static_cast<float const*>(inputs[0]),
+                mAllPadsPtr->mPtr, mOrigDimsPtr->mPtr, static_cast<float*>(outputs[0]), mOutDimsPtr->mPtr,
+                volume(outputDesc[0].dims));
+        }
+        else if (inpDType == DataType::kHALF)
+        {
+            circPadKernel<half><<<numBlocks, blockSize, 0, stream>>>(static_cast<half const*>(inputs[0]),
+                mAllPadsPtr->mPtr, mOrigDimsPtr->mPtr, static_cast<half*>(outputs[0]), mOutDimsPtr->mPtr,
+                volume(outputDesc[0].dims));
+        }
+        return 0;
+    }
+
+    size_t getSerializationSize() const noexcept override
+    {
+        return (mPads.size() + 1) * sizeof(int32_t);
+    }
+
+    void serialize(void* buffer) const noexcept override
+    {
+        ASSERT(buffer != nullptr);
+        char* d = static_cast<char*>(buffer);
+        char* a = d;
+        write(d, static_cast<int32_t>(mPads.size()));
+        for (int i = 0; i < mPads.size(); ++i)
+        {
+            write(d, mPads[i]);
+        }
+        ASSERT(d == a + getSerializationSize());
+    }
+
+    char const* getPluginType() const noexcept override
+    {
+        return "CircPadPlugin";
+    }
+
+    char const* getPluginVersion() const noexcept override
+    {
+        return "1";
+    }
+
+    nvinfer1::IPluginV2DynamicExt* clone() const noexcept override
+    {
+        return new CircPadPlugin(*this);
+    }
+
+    void destroy() noexcept override
+    {
+        delete this;
+    }
+
+    void setPluginNamespace(char const* libNamespace) noexcept override
+    {
+        mNamespace = libNamespace;
+    }
+
+    char const* getPluginNamespace() const noexcept override
+    {
+        return mNamespace.c_str();
+    }
+
+    DataType getOutputDataType(int index, nvinfer1::DataType const* inputTypes, int nbInputs) const noexcept
+    {
+        return inputTypes[0];
+    }
+
+    DimsExprs getOutputDimensions(
+        int32_t outputIndex, DimsExprs const* inputs, int32_t nbInputs, IExprBuilder& exprBuilder) noexcept
+    {
+        nvinfer1::DimsExprs outDims{inputs[0]};
+        int32_t nbOutDims = inputs[0].nbDims;
+
+        for (int32_t i = 0; i < mPads.size() / 2; ++i)
+        {
+            outDims.d[nbOutDims - i - 1] = exprBuilder.operation(nvinfer1::DimensionOperation::kSUM,
+                *inputs[0].d[nbOutDims - i - 1], *exprBuilder.constant(mPads[i * 2] + mPads[i * 2 + 1]));
+        }
+
+        return outDims;
+    }
+
+    void configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out,
+        int32_t nbOutputs) noexcept
+    {
+        mN = in[0].desc.dims.nbDims;
+
+        std::vector<int32_t> allPads(mN * 2);
+        std::vector<int32_t> origDims(mN);
+        std::vector<int32_t> outDims(mN);
+
+        for (int32_t i = 0; i < mN; ++i)
+        {
+            origDims[i] = in[0].desc.dims.d[i];
+            outDims[i] = in[0].desc.dims.d[i];
+        }
+
+        for (int32_t i = 0; i < mPads.size() / 2; ++i)
+        {
+            outDims[mN - i - 1] += mPads[i * 2] + mPads[i * 2 + 1];
+            allPads[mN * 2 - 2 * i - 2] = mPads[i * 2];
+            allPads[mN * 2 - 2 * i - 1] = mPads[i * 2 + 1];
+        }
+
+        mAllPadsPtr = std::make_shared<CudaBind<int32_t>>(mN * 2);
+        mOrigDimsPtr = std::make_shared<CudaBind<int32_t>>(mN);
+        mOutDimsPtr = std::make_shared<CudaBind<int32_t>>(mN);
+
+        ASSERT(
+            !cudaMemcpy(mAllPadsPtr->mPtr, &allPads.front(), allPads.size() * sizeof(int32_t), cudaMemcpyHostToDevice));
+        ASSERT(!cudaMemcpy(
+            mOrigDimsPtr->mPtr, &origDims.front(), origDims.size() * sizeof(int32_t), cudaMemcpyHostToDevice));
+        ASSERT(
+            !cudaMemcpy(mOutDimsPtr->mPtr, &outDims.front(), outDims.size() * sizeof(int32_t), cudaMemcpyHostToDevice));
+    }
+
+    size_t getWorkspaceSize(PluginTensorDesc const* inputs, int32_t nbInputs, PluginTensorDesc const* outputs,
+        int32_t nbOutputs) const noexcept
+    {
+        return 0;
+    }
+
+private:
+    std::vector<int32_t> mPads{};
+    int32_t mN{};
+    std::shared_ptr<CudaBind<int32_t>> mAllPadsPtr{};
+    std::shared_ptr<CudaBind<int32_t>> mOrigDimsPtr{};
+    std::shared_ptr<CudaBind<int32_t>> mOutDimsPtr{};
+    std::string mNamespace;
+};
+
+class CircPadPluginCreator : public nvinfer1::IPluginCreator
+{
+public:
+    CircPadPluginCreator()
+    {
+        mPluginAttributes.clear();
+        mPluginAttributes.emplace_back(PluginField("pads", nullptr, PluginFieldType::kINT32, 1));
+        mFC.nbFields = mPluginAttributes.size();
+        mFC.fields = mPluginAttributes.data();
+    }
+
+    char const* getPluginName() const noexcept
+    {
+        return "CircPadPlugin";
+    }
+
+    char const* getPluginVersion() const noexcept
+    {
+        return "1";
+    }
+
+    PluginFieldCollection const* getFieldNames() noexcept
+    {
+        return &mFC;
+    }
+
+    IPluginV2* createPlugin(char const* name, PluginFieldCollection const* fc) noexcept
+    {
+        try
+        {
+            std::vector<int32_t> pads;
+
+            for (int32_t i = 0; i < fc->nbFields; i++)
+            {
+                std::string field_name(fc->fields[i].name);
+                if (field_name.compare("pads") == 0)
+                {
+                    pads.resize(fc->fields[i].length);
+                    auto const* padsPtr = static_cast<int32_t const*>(fc->fields[i].data);
+                    std::copy_n(padsPtr, fc->fields[i].length, pads.data());
+                }
+            }
+
+            return new CircPadPlugin(pads);
+        }
+        catch (std::exception const& e)
+        {
+            caughtError(e);
+        }
+        return nullptr;
+    }
+
+    IPluginV2* deserializePlugin(char const* name, void const* serialData, size_t serialLength) noexcept
+    {
+        try
+        {
+            return new CircPadPlugin(serialData, serialLength);
+        }
+        catch (std::exception const& e)
+        {
+            caughtError(e);
+        }
+        return nullptr;
+    }
+
+    void setPluginNamespace(char const* libNamespace) noexcept
+    {
+        mNamespace = libNamespace;
+    }
+
+    char const* getPluginNamespace() const noexcept
+    {
+        return mNamespace.c_str();
+    }
+
+private:
+    nvinfer1::PluginFieldCollection mFC;
+    std::vector<nvinfer1::PluginField> mPluginAttributes;
+    std::string mNamespace;
+};
+
+REGISTER_TENSORRT_PLUGIN(CircPadPluginCreator);
diff --git a/samples/python/python_plugin/requirements.txt b/samples/python/python_plugin/requirements.txt
new file mode 100644
index 00000000..8550a865
--- /dev/null
+++ b/samples/python/python_plugin/requirements.txt
@@ -0,0 +1,15 @@
+cuda-python
+cupy-cuda12x
+numba
+triton; platform_system != "Windows"
+torch
+--extra-index-url https://pypi.ngc.nvidia.com
+polygraphy
+colored
+numpy==1.23.5; platform_system != "Windows"
+--extra-index-url https://pypi.ngc.nvidia.com
+onnx-graphsurgeon
+pywin32; platform_system == "Windows"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
diff --git a/samples/python/python_plugin/requirements.yml b/samples/python/python_plugin/requirements.yml
new file mode 100644
index 00000000..30f1cca4
--- /dev/null
+++ b/samples/python/python_plugin/requirements.yml
@@ -0,0 +1,28 @@
+---
+args:
+  polygraphy:
+    - '--extra-index-url https://pypi.ngc.nvidia.com'
+  torch: []
+conditions:
+  cuda-python:
+   - cuda-python
+  onnx-graphsurgeon:
+   - onnx-graphsurgeon
+  triton:
+   - triton; platform_system != "Windows"
+  numpy:
+   - numpy==1.23.5; platform_system != "Windows"
+  torch:
+   - torch
+packages:
+ - cuda-python
+ - cupy-cuda12x
+ - numba
+ - triton
+ - torch
+ - polygraphy
+ - colored
+ - numpy
+ - onnx-graphsurgeon
+ - pywin32
+...
diff --git a/samples/python/python_plugin/utils.py b/samples/python/python_plugin/utils.py
new file mode 100644
index 00000000..4015b72c
--- /dev/null
+++ b/samples/python/python_plugin/utils.py
@@ -0,0 +1,118 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from cuda import cuda, cudart, nvrtc
+import numpy as np
+import os
+import argparse
+import threading
+
+import tensorrt as trt
+
+def parseArgs():
+    parser = argparse.ArgumentParser(description="Options for Circular Padding plugin C++ example")
+    parser.add_argument('--precision', type=str, default="fp32", choices=["fp32", "fp16"], help="Precision to use for plugin")
+
+    return parser.parse_args()
+
+def volume(d):
+    return np.prod(d)
+
+# Taken from https://github.com/NVIDIA/cuda-python/blob/main/examples/common/helper_cuda.py
+def checkCudaErrors(result):
+    def _cudaGetErrorEnum(error):
+        if isinstance(error, cuda.CUresult):
+            err, name = cuda.cuGetErrorName(error)
+            return name if err == cuda.CUresult.CUDA_SUCCESS else "<unknown>"
+        elif isinstance(error, cudart.cudaError_t):
+            return cudart.cudaGetErrorName(error)[1]
+        elif isinstance(error, nvrtc.nvrtcResult):
+            return nvrtc.nvrtcGetErrorString(error)[1]
+        else:
+            raise RuntimeError('Unknown error type: {}'.format(error))
+    if result[0].value:
+        raise RuntimeError("CUDA error code={}({})".format(result[0].value, _cudaGetErrorEnum(result[0])))
+    if len(result) == 1:
+        return None
+    elif len(result) == 2:
+        return result[1]
+    else:
+        return result[1:]
+
+# Taken from https://github.com/NVIDIA/cuda-python/blob/main/examples/common/common.py
+class KernelHelper:
+    def __init__(self, code, devID):
+        prog = checkCudaErrors(nvrtc.nvrtcCreateProgram(str.encode(code), b'sourceCode.cu', 0, [], []))
+        CUDA_HOME = os.getenv('CUDA_HOME')
+        if CUDA_HOME == None:
+            CUDA_HOME = os.getenv('CUDA_PATH')
+        if CUDA_HOME == None:
+            raise RuntimeError('Environment variable CUDA_HOME or CUDA_PATH is not set')
+        include_dirs = os.path.join(CUDA_HOME, 'include')
+
+        # Initialize CUDA
+        checkCudaErrors(cudart.cudaFree(0))
+
+        major = checkCudaErrors(cudart.cudaDeviceGetAttribute(cudart.cudaDeviceAttr.cudaDevAttrComputeCapabilityMajor, devID))
+        minor = checkCudaErrors(cudart.cudaDeviceGetAttribute(cudart.cudaDeviceAttr.cudaDevAttrComputeCapabilityMinor, devID))
+        _, nvrtc_minor = checkCudaErrors(nvrtc.nvrtcVersion())
+        use_cubin = (nvrtc_minor >= 1)
+        prefix = 'sm' if use_cubin else 'compute'
+        arch_arg = bytes(f'--gpu-architecture={prefix}_{major}{minor}', 'ascii')
+
+        try:
+            opts = [b'--fmad=true', arch_arg, '--include-path={}'.format(include_dirs).encode('UTF-8'),
+                    b'--std=c++11', b'-default-device']
+            checkCudaErrors(nvrtc.nvrtcCompileProgram(prog, len(opts), opts))
+        except RuntimeError as err:
+            logSize = checkCudaErrors(nvrtc.nvrtcGetProgramLogSize(prog))
+            log = b' ' * logSize
+            checkCudaErrors(nvrtc.nvrtcGetProgramLog(prog, log))
+            print(log.decode())
+            print(err)
+            exit(-1)
+
+        if use_cubin:
+            dataSize = checkCudaErrors(nvrtc.nvrtcGetCUBINSize(prog))
+            data = b' ' * dataSize
+            checkCudaErrors(nvrtc.nvrtcGetCUBIN(prog, data))
+        else:
+            dataSize = checkCudaErrors(nvrtc.nvrtcGetPTXSize(prog))
+            data = b' ' * dataSize
+            checkCudaErrors(nvrtc.nvrtcGetPTX(prog, data))
+
+        self.module = checkCudaErrors(cuda.cuModuleLoadData(np.char.array(data)))
+
+    def getFunction(self, name):
+        return checkCudaErrors(cuda.cuModuleGetFunction(self.module, name))
+
+class CudaCtxManager(trt.IPluginResource):
+    def __init__(self, device = None):
+        trt.IPluginResource.__init__(self)
+        self.device = device
+        self.cuda_ctx = None
+
+    def clone(self):
+        cloned = CudaCtxManager()
+        cloned.__dict__.update(self.__dict__)
+        # Delay the CUDA ctx creation until clone()
+        # since only a cloned resource is registered by TRT
+        _, cloned.cuda_ctx = cuda.cuCtxCreate(0, self.device)
+        return cloned
+
+    def release(self):
+        checkCudaErrors(cuda.cuCtxDestroy(self.cuda_ctx))
diff --git a/samples/python/requirements.txt b/samples/python/requirements.txt
index 51d6f230..2560ce53 100644
--- a/samples/python/requirements.txt
+++ b/samples/python/requirements.txt
@@ -1,6 +1,4 @@
-pyyaml
-requests
-tqdm
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/sample_weight_stripping/README.md b/samples/python/sample_weight_stripping/README.md
new file mode 100644
index 00000000..34a3960d
--- /dev/null
+++ b/samples/python/sample_weight_stripping/README.md
@@ -0,0 +1,83 @@
+# Introduction To Building and Refitting Weight-stripped Engines from ONNX Models
+
+
+**Table Of Contents**
+- [Description](#description)
+- [How does this sample work?](#how-does-this-sample-work)
+- [Prerequisites](#prerequisites)
+- [Running the sample](#running-the-sample)
+	* [Sample `--help` options](#sample-help-options)
+- [Additional resources](#additional-resources)
+- [License](#license)
+- [Changelog](#changelog)
+- [Known issues](#known-issues)
+
+## Description
+
+This sample, sample_weight_stripping, is a Python sample which uses TensorRT to build a weight-stripped engine and later refit to a full engine for inference.
+
+## How does this sample work?
+
+This sample demonstrates how to build a weight-stripped engine from an ONNX model file using TensorRT Python API which can reduce the saved engine size. Later, the weight-stripped engine is refitted by parser refitter with the original ONNX model as input. The refitted full engine is used for inference and guarantees no performance and accuracy loss. In this sample, we use ResNet50 to showcase our features.
+
+## Prerequisites
+
+1. Install the dependencies for Python.
+
+	```bash
+	pip3 install -r requirements.txt
+	```
+
+## Running the sample
+
+1.  Build and save both normal engine and weight-stripped engine:
+
+	```
+	python3 build_engines.py --output_stripped_engine=stripped_engine.trt --output_normal_engine=normal_engine.trt
+	```
+
+	After running this step, you can see two saved TensorRT engines. `stripped_engine.trt` contains a stripped engine (~2.3MB) and `normal_engine.trt` contains a normal engine with all weights included (~51MB). By using stripped engine build, we can greatly reduce the size of the saved engine file.
+
+
+	**Note:** If the TensorRT sample data is not installed in the default location, for example `/usr/src/tensorrt/data/`, the model directory must be specified. For example: `--stripped_onnx=/path/to/my/data/` sets the model path for building weight-stripped engine and `--original_onnx=/path/to/my/data/` sets the model path for building normal engine. In most of the cases, they can use the same ONNX model.
+
+2.  Refit the weight-stripped engine and perform inference with the weight-stripped engine and the normal engine:
+	```
+	python3 refit_engine_and_infer.py --stripped_engine=stripped_engine.trt -–normal_engine=normal_engine.trt
+	```
+3.  Verify that the sample ran successfully. If the sample runs successfully you should see output similar to the following. The prediction results of the refitted stripped engine is the same as the normal engine. There is no performance loss.
+	```
+	Normal engine inference time on 100 cases: 0.1066 seconds
+	Refitted stripped engine inference time on 100 cases: 0.0606 seconds
+	Normal engine correctly recognized data/samples/resnet50/tabby_tiger_cat.jpg as tiger cat
+	Refitted stripped engine correctly recognized data/samples/resnet50/tabby_tiger_cat.jpg as tiger cat
+	```
+### Sample --help options
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command line option.
+
+# Additional resources
+
+The following resources provide a deeper understanding about importing a model into TensorRT using Python:
+
+**ResNet-50**
+- [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf)
+
+**Documentation**
+- [Introduction To NVIDIA’s TensorRT Samples](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#samples)
+- [Working With TensorRT Using The Python API](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#python_topics)
+- [NVIDIA’s TensorRT Documentation Library](https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/index.html)
+
+# License
+
+For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
+
+# Changelog
+
+February 2024
+
+Initial release of this sample.
+
+# Known issues
+
+There are no known issues in this sample.
diff --git a/samples/python/sample_weight_stripping/build_engines.py b/samples/python/sample_weight_stripping/build_engines.py
new file mode 100644
index 00000000..6f1e3936
--- /dev/null
+++ b/samples/python/sample_weight_stripping/build_engines.py
@@ -0,0 +1,116 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import sys
+import argparse
+import math
+import time
+import datetime
+
+import tensorrt as trt
+
+sys.path.insert(1, os.path.join(sys.path[0], ".."))
+import common
+
+# You can set the logger severity higher to suppress messages (or lower to display more messages).
+TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+
+
+def convert_size(size_bytes):
+    if size_bytes == 0:
+        return "0B"
+    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
+    i = int(math.floor(math.log(size_bytes, 1024)))
+    p = math.pow(1024, i)
+    s = round(size_bytes / p, 2)
+    return "%s %s" % (s, size_name[i])
+
+def main(args):
+
+    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(0) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
+        with open(args.original_onnx, 'rb') as onnx_model:
+            parser.parse(onnx_model.read())
+
+        with builder.create_builder_config() as config:
+
+            config.set_flag(trt.BuilderFlag.FP16)
+            config.set_flag(trt.BuilderFlag.STRIP_PLAN)
+
+            cache = config.create_timing_cache(b"")
+            config.set_timing_cache(cache, ignore_mismatch = False)
+
+            profile = builder.create_optimization_profile()
+            profile.set_shape("gpu_0/data_0", min=[1, 3, 224, 224], opt=[1, 3, 224, 224], max=[1, 3, 224, 224])
+            config.add_optimization_profile(profile)
+
+            def build_and_save_engine(builder, network, config, output):
+                start_time = time.time()
+                engine_bytes = builder.build_serialized_network(network, config)
+                assert engine_bytes is not None
+                with open(output, 'wb') as f:
+                    f.write(engine_bytes)
+                total_time = time.time() - start_time
+                print("built and saved {} in time {}".format(output, str(datetime.timedelta(seconds=int(total_time)))))
+
+            # build weight-stripped engine and generate timing cache.
+            build_and_save_engine(builder, network, config, args.output_stripped_engine)
+
+            # build normal engine with the same timing cache.
+            config.flags &= ~(1 << int(trt.BuilderFlag.STRIP_PLAN))
+            build_and_save_engine(builder, network, config, args.output_normal_engine)
+
+def get_default_model_file():
+    # Set the data path to the directory that contains the ONNX model.
+    _, data_files = common.find_sample_data(
+        description="Runs a ResNet50 network with a TensorRT inference engine.",
+        subfolder="resnet50",
+        find_files=["ResNet50.onnx"],
+    )
+    onnx_model_file = data_files[0]
+    return onnx_model_file
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--stripped_onnx", default=None, type=str,
+                        help="The ONNX model file to load for building stripped engine.")
+    parser.add_argument("--original_onnx", default=None, type=str,
+                        help="The ONNX model file to load for building normal engine.")
+    parser.add_argument("--output_stripped_engine", default='stripped_engine.trt', type=str,
+                        help="The output path for the weight-stripped TRT engine.")
+    parser.add_argument("--output_normal_engine", default='normal_engine.trt', type=str,
+                        help="The output path for the full TRT engine.")
+    args, _ = parser.parse_known_args()
+
+    onnx_model_file = get_default_model_file()
+    if args.stripped_onnx is None:
+        args.stripped_onnx = onnx_model_file
+    if args.original_onnx is None:
+        args.original_onnx = onnx_model_file
+
+    if not os.path.exists(args.stripped_onnx):
+        parser.print_help()
+        print(f"--stripped_onnx {args.stripped_onnx} does not exist.")
+        sys.exit(1)
+    if not os.path.exists(args.original_onnx):
+        parser.print_help()
+        print(f"--original_onnx {args.original_onnx} does not exist.")
+        sys.exit(1)
+
+    main(args)
diff --git a/samples/python/sample_weight_stripping/refit_engine_and_infer.py b/samples/python/sample_weight_stripping/refit_engine_and_infer.py
new file mode 100644
index 00000000..c9021c13
--- /dev/null
+++ b/samples/python/sample_weight_stripping/refit_engine_and_infer.py
@@ -0,0 +1,166 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import argparse
+import random
+import sys
+import time
+import datetime
+import numpy as np
+
+import tensorrt as trt
+from PIL import Image
+
+sys.path.insert(1, os.path.join(sys.path[0], ".."))
+import common
+
+
+# You can set the logger severity higher to suppress messages (or lower to display more messages).
+TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+
+class ModelData(object):
+    MODEL_PATH = "ResNet50.onnx"
+    INPUT_SHAPE = (3, 224, 224)
+    # We can convert TensorRT data types to numpy types with trt.nptype()
+    DTYPE = trt.float16
+
+def load_stripped_engine_and_refit(input_file, onnx_model_path):
+    runtime = trt.Runtime(TRT_LOGGER)
+
+    with open(input_file, 'rb') as engine_file:
+        engine = runtime.deserialize_cuda_engine(engine_file.read())
+        refitter = trt.Refitter(engine, TRT_LOGGER)
+        parser_refitter = trt.OnnxParserRefitter(refitter, TRT_LOGGER)
+        assert parser_refitter.refit_from_file(onnx_model_path)
+        assert refitter.refit_cuda_engine()
+
+        return engine
+
+def load_normal_engine(input_file):
+    runtime = trt.Runtime(TRT_LOGGER)
+    with open(input_file, 'rb') as engine_file:
+        engine = runtime.deserialize_cuda_engine(engine_file.read())
+
+        return engine
+
+
+def load_normalized_test_case(test_image, pagelocked_buffer):
+    # Converts the input image to a CHW Numpy array
+    def normalize_image(image):
+        # Resize, antialias and transpose the image to CHW.
+        c, h, w = ModelData.INPUT_SHAPE
+        image_arr = (
+            np.asarray(image.resize((w, h), Image.LANCZOS))
+            .transpose([2, 0, 1])
+            .astype(trt.nptype(ModelData.DTYPE))
+            .ravel()
+        )
+        # This particular ResNet50 model requires some preprocessing, specifically, mean normalization.
+        return (image_arr / 255.0 - 0.45) / 0.225
+
+    # Normalize the image and copy to pagelocked memory.
+    np.copyto(pagelocked_buffer, normalize_image(Image.open(test_image)))
+    return test_image
+
+def main(args):
+    # Set the data path to the directory that contains the trained models and test images for inference.
+    _, data_files = common.find_sample_data(
+        description="Runs a ResNet50 network with a TensorRT inference engine.",
+        subfolder="resnet50",
+        find_files=[
+            "binoculars.jpeg",
+            "reflex_camera.jpeg",
+            "tabby_tiger_cat.jpg",
+            ModelData.MODEL_PATH,
+            "class_labels.txt",
+        ],
+    )
+    # Get test images, models and labels.
+    test_images = data_files[0:3]
+    onnx_model_file, labels_file = data_files[3:]
+
+    labels = open(labels_file, "r").read().split("\n")
+
+    # Load a TensorRT engine.
+    engine = load_normal_engine(args.normal_engine)
+    refitted_engine = load_stripped_engine_and_refit(args.stripped_engine, onnx_model_file)
+
+    # Allocate buffers and create a CUDA stream.
+    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
+    inputs_1, outputs_1, bindings_1, stream_1 = common.allocate_buffers(refitted_engine)
+
+    # Contexts are used to perform inference.
+    context = engine.create_execution_context()
+    context_1 = refitted_engine.create_execution_context()
+
+    # Load a normalized test case into the host input page-locked buffer.
+    test_image = random.choice(test_images)
+    test_case = load_normalized_test_case(test_image, inputs[0].host)
+    test_case_1 = load_normalized_test_case(test_image, inputs_1[0].host)
+
+    # Run the engine. The output will be a 1D tensor of length 1000, where each value represents the
+    # probability that the image corresponds to that label
+    start_time = time.time()
+    for i in range(100): # count time for 100 times of inference
+        trt_outputs = common.do_inference(context, engine=engine, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
+    total_time = time.time() - start_time
+    print("Normal engine inference time on 100 cases: {:.4f} seconds".format(total_time))
+
+    start_time = time.time()
+    for i in range(100):
+        trt_outputs_refitted = common.do_inference(context_1, engine=refitted_engine, bindings=bindings_1, inputs=inputs_1, outputs=outputs_1, stream=stream_1)
+    total_time = time.time() - start_time
+    print("Refitted stripped engine inference time on 100 cases: {:.4f} seconds".format(total_time))
+
+    # We use the highest probability as our prediction. Its index corresponds to the predicted label.
+    pred = labels[np.argmax(trt_outputs[0])]
+    if "_".join(pred.split()) in os.path.splitext(os.path.basename(test_case))[0]:
+        print("Normal engine correctly recognized " + test_case + " as " + pred)
+    else:
+        print("Normal engine incorrectly recognized " + test_case + " as " + pred)
+        exit(1)
+
+    pred_refitted = labels[np.argmax(trt_outputs_refitted[0])]
+    if "_".join(pred_refitted.split()) in os.path.splitext(os.path.basename(test_case_1))[0]:
+        print("Refitted stripped engine correctly recognized " + test_case + " as " + pred_refitted)
+    else:
+        print("Refitted stripped engine incorrectly recognized " + test_case + " as " + pred_refitted)
+        exit(1)
+
+    return trt_outputs, trt_outputs_refitted
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--stripped_engine", default='stripped_engine.trt', type=str,
+                        help="The stripped engine file to load.")
+    parser.add_argument("--normal_engine", default='normal_engine.trt', type=str,
+                        help="The normal engine file to load.")
+
+    args, _ = parser.parse_known_args()
+    if not os.path.exists(args.stripped_engine):
+        parser.print_help()
+        print(f"--stripped_engine {args.stripped_engine} does not exist.")
+        sys.exit(1)
+    if not os.path.exists(args.normal_engine):
+        parser.print_help()
+        print(f"--normal_engine {args.normal_engine} does not exist.")
+        sys.exit(1)
+
+    trt_outputs, trt_outputs_refitted = main(args)
+    print("The MSE of the final layer output is", np.square(np.subtract(trt_outputs, trt_outputs_refitted)).mean())
diff --git a/samples/python/sample_weight_stripping/requirements.txt b/samples/python/sample_weight_stripping/requirements.txt
new file mode 100644
index 00000000..a9ca7fb0
--- /dev/null
+++ b/samples/python/sample_weight_stripping/requirements.txt
@@ -0,0 +1,7 @@
+Pillow>=10.0.0
+cuda-python==12.2.0
+pywin32; platform_system == "Windows"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/simple_progress_monitor/README.md b/samples/python/simple_progress_monitor/README.md
new file mode 100644
index 00000000..6f6b52c0
--- /dev/null
+++ b/samples/python/simple_progress_monitor/README.md
@@ -0,0 +1,95 @@
+# Introduction To IProgressMonitor Callbacks Using Python
+
+
+**Table Of Contents**
+- [Description](#description)
+- [How does this sample work?](#how-does-this-sample-work)
+	* [simple_progress_monitor](#simple_progress_monitor)
+- [Prerequisites](#prerequisites)
+- [Running the sample](#running-the-sample)
+	* [Sample `--help` options](#sample-help-options)
+- [Additional resources](#additional-resources)
+- [License](#license)
+- [Changelog](#changelog)
+- [Known issues](#known-issues)
+
+## Description
+
+This sample, simple_progress_reporter, is a Python sample which uses TensorRT and its included ONNX parser, to perform inference with ResNet-50 models saved in ONNX format. It displays animated progress bars while TensorRT builds the engine. 
+
+## How does this sample work?
+
+### simple_progress_monitor
+
+This sample demonstrates how to build an engine from an ONNX model file using the open-source ONNX parser and then run inference. The ONNX parser can be used with any framework that supports the ONNX format (typically `.onnx` files). An `IProgressMonitor` object receives updates on the progress of the build, and displays them as ASCII progress bars on stdout.
+
+## Prerequisites
+
+1. Install the dependencies for Python.
+
+```bash
+pip3 install -r requirements.txt
+```
+
+## Running the sample
+
+1.  Run the sample from a terminal to create a TensorRT inference engine and run inference:
+	`python3 simple_progress_monitor.py`
+
+	**Note:** If the TensorRT sample data is not installed in the default location, for example `/usr/src/tensorrt/data/`, the `data` directory must be specified. For example: `python3 simple_progress_monitor.py -d /path/to/my/data/`
+
+	**Note:** Do not redirect the output of this script to a file or pipe.
+
+2.  Verify that the sample ran successfully. If the sample runs successfully you should see output similar to the following:
+	`Correctly recognized data/samples/resnet50/reflex_camera.jpeg as reflex camera`
+
+### Sample --help options
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command line option. For example:
+```
+usage: simple_progress_monitor.py [-h] [-d DATADIR]
+
+Runs a ResNet50 network with a TensorRT inference engine. Displays intermediate build progress.
+
+optional arguments:
+ -h, --help            show this help message and exit
+ -d DATADIR, --datadir DATADIR
+                       Location of the TensorRT sample data directory.
+                       (default: /usr/src/tensorrt/data)
+```
+
+# Additional resources
+
+The following resources provide a deeper understanding about importing a model into TensorRT using Python:
+
+**ResNet-50**
+- [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf)
+
+**Parsers**
+- [ONNX Parser](https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/parsers/Onnx/pyOnnx.html)
+
+**Documentation**
+- [Introduction To NVIDIA’s TensorRT Samples](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#samples)
+- [Working With TensorRT Using The Python API](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#python_topics)
+- [Importing A Model Using A Parser In Python](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#import_model_python)
+- [NVIDIA’s TensorRT Documentation Library](https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/index.html)
+
+**Terminal Escape Sequences**
+- Linux: [XTerm Control Sequences](https://invisible-island.net/xterm/ctlseqs/ctlseqs.html)
+- Windows: [Console Virtual Terminal Sequences](https://learn.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences)
+
+# License
+
+For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
+
+# Changelog
+
+August 2023
+Removed support for Python versions < 3.8.
+
+June 2023
+This `README.md` file was created and reviewed.
+
+# Known issues
+
+There are no known issues in this sample
diff --git a/samples/python/simple_progress_monitor/requirements.txt b/samples/python/simple_progress_monitor/requirements.txt
new file mode 100644
index 00000000..a9ca7fb0
--- /dev/null
+++ b/samples/python/simple_progress_monitor/requirements.txt
@@ -0,0 +1,7 @@
+Pillow>=10.0.0
+cuda-python==12.2.0
+pywin32; platform_system == "Windows"
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/simple_progress_monitor/simple_progress_monitor.py b/samples/python/simple_progress_monitor/simple_progress_monitor.py
new file mode 100644
index 00000000..9ed6c6ba
--- /dev/null
+++ b/samples/python/simple_progress_monitor/simple_progress_monitor.py
@@ -0,0 +1,199 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+
+# This sample demonstrates incremental progress reporting while it uses an ONNX ResNet50 Model to create a TensorRT Inference Engine.
+import random
+import sys
+
+import numpy as np
+
+import tensorrt as trt
+from PIL import Image
+
+sys.path.insert(1, os.path.join(sys.path[0], ".."))
+import common
+
+
+class ModelData(object):
+    MODEL_PATH = "ResNet50.onnx"
+    INPUT_SHAPE = (3, 224, 224)
+    # We can convert TensorRT data types to numpy types with trt.nptype().
+    DTYPE = trt.float32
+
+# This is a simple ASCII-art progress monitor comparable to the C++ version in sample_progress_monitor.
+class SimpleProgressMonitor(trt.IProgressMonitor):
+    def __init__(self):
+        trt.IProgressMonitor.__init__(self)
+        self._active_phases = {}
+        self._step_result = True
+
+    def phase_start(self, phase_name, parent_phase, num_steps):
+        try:
+            if parent_phase is not None:
+                nbIndents = 1 + self._active_phases[parent_phase]['nbIndents']
+            else:
+                nbIndents = 0
+            self._active_phases[phase_name] = { 'title': phase_name, 'steps': 0, 'num_steps': num_steps, 'nbIndents': nbIndents }
+            self._redraw()
+        except KeyboardInterrupt:
+            # The phase_start callback cannot directly cancel the build, so request the cancellation from within step_complete.
+            _step_result = False
+
+    def phase_finish(self, phase_name):
+        try:
+            del self._active_phases[phase_name]
+            self._redraw(blank_lines=1) # Clear the removed phase.
+        except KeyboardInterrupt:
+            _step_result = False
+
+    def step_complete(self, phase_name, step):
+        try:
+            self._active_phases[phase_name]['steps'] = step
+            self._redraw()
+            return self._step_result
+        except KeyboardInterrupt:
+            # There is no need to propagate this exception to TensorRT. We can simply cancel the build.
+            return False
+
+    def _redraw(self, *, blank_lines=0):
+        # The Python curses module is not widely available on Windows platforms.
+        # Instead, this function uses raw terminal escape sequences. See the sample documentation for references.
+        def clear_line():
+            print('\x1B[2K', end='')
+        def move_to_start_of_line():
+            print('\x1B[0G', end='')
+        def move_cursor_up(lines):
+            print('\x1B[{}A'.format(lines), end='')
+
+        def progress_bar(steps, num_steps):
+            INNER_WIDTH = 10
+            completed_bar_chars = int(INNER_WIDTH * steps / float(num_steps))
+            return '[{}{}]'.format(
+                '=' * completed_bar_chars,
+                '-' * (INNER_WIDTH - completed_bar_chars))
+
+        # Set max_cols to a default of 200 if not run in interactive mode.
+        max_cols = os.get_terminal_size().columns if sys.stdout.isatty() else 200
+
+        move_to_start_of_line()
+        for phase in self._active_phases.values():
+            phase_prefix = '{indent}{bar} {title}'.format(
+                indent = ' ' * phase['nbIndents'],
+                bar = progress_bar(phase['steps'], phase['num_steps']),
+                title = phase['title'])
+            phase_suffix = '{steps}/{num_steps}'.format(**phase)
+            allowable_prefix_chars = max_cols - len(phase_suffix) - 2
+            if allowable_prefix_chars < len(phase_prefix):
+                phase_prefix = phase_prefix[0:allowable_prefix_chars-3] + '...'
+            clear_line()
+            print(phase_prefix, phase_suffix)
+        for line in range(blank_lines):
+            clear_line()
+            print()
+        move_cursor_up(len(self._active_phases) + blank_lines)
+        sys.stdout.flush()
+
+# You can set the logger severity higher to suppress messages (or lower to display more messages).
+TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
+
+# The Onnx path is used for Onnx models.
+def build_engine_onnx(model_file):
+    builder = trt.Builder(TRT_LOGGER)
+    network = builder.create_network(0)
+    config = builder.create_builder_config()
+    if not sys.stdout.isatty():
+        print("Warning: This sample should be run from an interactive terminal in order to showcase the progress monitor correctly.")
+    config.progress_monitor = SimpleProgressMonitor()
+    parser = trt.OnnxParser(network, TRT_LOGGER)
+
+    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, common.GiB(1))
+    # Load the Onnx model and parse it in order to populate the TensorRT network.
+    with open(model_file, "rb") as model:
+        if not parser.parse(model.read()):
+            print("ERROR: Failed to parse the ONNX file.")
+            for error in range(parser.num_errors):
+                print(parser.get_error(error))
+            return None
+
+    engine_bytes = builder.build_serialized_network(network, config)
+    runtime = trt.Runtime(TRT_LOGGER)
+    return runtime.deserialize_cuda_engine(engine_bytes)
+
+
+def load_normalized_test_case(test_image, pagelocked_buffer):
+    # Converts the input image to a CHW Numpy array.
+    def normalize_image(image):
+        # Resize, antialias and transpose the image to CHW.
+        c, h, w = ModelData.INPUT_SHAPE
+        image_arr = (
+            np.asarray(image.resize((w, h), Image.LANCZOS))
+            .transpose([2, 0, 1])
+            .astype(trt.nptype(ModelData.DTYPE))
+            .ravel()
+        )
+        # This particular ResNet50 model requires some preprocessing, specifically, mean normalization.
+        return (image_arr / 255.0 - 0.45) / 0.225
+
+    # Normalize the image and copy to pagelocked memory.
+    np.copyto(pagelocked_buffer, normalize_image(Image.open(test_image)))
+    return test_image
+
+
+def main():
+    # Set the data path to the directory that contains the trained models and test images for inference.
+    _, data_files = common.find_sample_data(
+        description="Runs a ResNet50 network with a TensorRT inference engine. Displays intermediate build progress.",
+        subfolder="resnet50",
+        find_files=[
+            "binoculars.jpeg",
+            "reflex_camera.jpeg",
+            "tabby_tiger_cat.jpg",
+            ModelData.MODEL_PATH,
+            "class_labels.txt",
+        ],
+    )
+    # Get test images, models and labels.
+    test_images = data_files[0:3]
+    onnx_model_file, labels_file = data_files[3:]
+    labels = open(labels_file, "r").read().split("\n")
+
+    # Build a TensorRT engine.
+    engine = build_engine_onnx(onnx_model_file)
+    # Inference is the same regardless of which parser is used to build the engine, since the model architecture is the same.
+    # Allocate buffers and create a CUDA stream.
+    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
+    # Contexts are used to perform inference.
+    context = engine.create_execution_context()
+
+    # Load a normalized test case into the host input page-locked buffer.
+    test_image = random.choice(test_images)
+    test_case = load_normalized_test_case(test_image, inputs[0].host)
+    # Run the engine. The output will be a 1D tensor of length 1000, where each value represents the
+    # probability that the image corresponds to that label
+    trt_outputs = common.do_inference(context, engine=engine, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
+    # We use the highest probability as our prediction. Its index corresponds to the predicted label.
+    pred = labels[np.argmax(trt_outputs[0])]
+    common.free_buffers(inputs, outputs, stream)
+    if "_".join(pred.split()) in os.path.splitext(os.path.basename(test_case))[0]:
+        print("Correctly recognized " + test_case + " as " + pred)
+    else:
+        print("Incorrectly recognized " + test_case + " as " + pred)
+
+if __name__ == "__main__":
+    main()
diff --git a/samples/python/tensorflow_object_detection_api/README.md b/samples/python/tensorflow_object_detection_api/README.md
index e26de47b..47ce0385 100644
--- a/samples/python/tensorflow_object_detection_api/README.md
+++ b/samples/python/tensorflow_object_detection_api/README.md
@@ -7,14 +7,14 @@ Support for [TensorFlow Object Detection (TFOD) API](https://github.com/tensorfl
 
 ### TensorFlow and TensorRT Environment
 
-In order for scripts to work we suggest an environment with TensorRT >= 8.0.1 and TensorFlow 2.5.
+In order for scripts to work we suggest an environment with TensorRT >= 8.0.1 and TensorFlow 2.12.0.
 
 Install TensorRT as per the [TensorRT Install Guide](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html). You will need to make sure the Python bindings for TensorRT are also installed correctly, these are available by installing the `python3-libnvinfer` and `python3-libnvinfer-dev` packages on your TensorRT download.
 
 If you would like to use Docker, you can use an NGC image to fulfill these requirements, such as:
 
 ```
-docker pull nvcr.io/nvidia/tensorflow:21.10-tf2-py3
+docker pull nvcr.io/nvidia/tensorflow:23.07-tf2-py3
 ```
 ### TFOD API Environment
 
@@ -28,7 +28,7 @@ wget https://github.com/protocolbuffers/protobuf/releases/download/v3.15.4/proto
 unzip protoc*.zip bin/protoc -d /usr/local
 git clone https://github.com/tensorflow/models.git
 cd /workspace/models/research
-git checkout 08b6803
+git checkout 66e22c4
 protoc object_detection/protos/*.proto --python_out=.
 cp object_detection/packages/tf2/setup.py ./
 pip --use-deprecated=legacy-resolver install .
@@ -315,3 +315,10 @@ If you run this on COCO val2017 images, you may also add the parameter `--annota
 
 ![mrcnn_compare_tf](https://drive.google.com/uc?export=view&id=1kNnfJ2H5OY85Z2e6KNxZgiYk3Lo-sB9r)
 
+# Changelog
+
+August 2023: 
+  - Removed support for Python versions < 3.8.
+  - Update ONNX version support to 1.14.0
+  - Update ONNX Runtime version support to 1.15.1 for Python>=3.8
+
diff --git a/samples/python/tensorflow_object_detection_api/build_engine.py b/samples/python/tensorflow_object_detection_api/build_engine.py
index 2efd6599..0a0d6238 100644
--- a/samples/python/tensorflow_object_detection_api/build_engine.py
+++ b/samples/python/tensorflow_object_detection_api/build_engine.py
@@ -128,7 +128,7 @@ def __init__(self, verbose=False, workspace=8):
 
         self.builder = trt.Builder(self.trt_logger)
         self.config = self.builder.create_builder_config()
-        self.config.max_workspace_size = workspace * (2 ** 30)
+        self.config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace * (2 ** 30))
 
         self.batch_size = None
         self.network = None
@@ -139,9 +139,8 @@ def create_network(self, onnx_path):
         Parse the ONNX graph and create the corresponding TensorRT network definition.
         :param onnx_path: The path to the ONNX graph to load.
         """
-        network_flags = (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
 
-        self.network = self.builder.create_network(network_flags)
+        self.network = self.builder.create_network(0)
         self.parser = trt.OnnxParser(self.network, self.trt_logger)
 
         onnx_path = os.path.realpath(onnx_path)
@@ -162,7 +161,6 @@ def create_network(self, onnx_path):
         for output in outputs:
             log.info("Output '{}' with shape {} and dtype {}".format(output.name, output.shape, output.dtype))
         assert self.batch_size > 0
-        self.builder.max_batch_size = self.batch_size
 
         # TODO: These overrides are to improve fp16/int8 performance on FRCNN models
         # it might be possible to avoid doing this by using different box encoding
@@ -198,7 +196,9 @@ def create_engine(self, engine_path, precision, calib_input=None, calib_cache=No
 
         # TODO: Strict type is only needed If the per-layer precision overrides are used
         # If a better method is found to deal with that issue, this flag can be removed.
-        self.config.set_flag(trt.BuilderFlag.STRICT_TYPES)
+        self.config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
+        self.config.set_flag(trt.BuilderFlag.DIRECT_IO)
+        self.config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)
 
         if precision == "fp16":
             if not self.builder.platform_has_fast_fp16:
@@ -221,9 +221,14 @@ def create_engine(self, engine_path, precision, calib_input=None, calib_cache=No
                         ImageBatcher(calib_input, calib_shape, calib_dtype, max_num_images=calib_num_images,
                                      exact_batches=True))
 
-        with self.builder.build_engine(self.network, self.config) as engine, open(engine_path, "wb") as f:
+        engine_bytes = self.builder.build_serialized_network(self.network, self.config)
+        if engine_bytes is None:
+            log.error("Failed to create engine")
+            sys.exit(1)
+
+        with open(engine_path, "wb") as f:
             log.info("Serializing engine to file: {:}".format(engine_path))
-            f.write(engine.serialize())
+            f.write(engine_bytes)
 
 
 def main(args):
diff --git a/samples/python/tensorflow_object_detection_api/infer.py b/samples/python/tensorflow_object_detection_api/infer.py
index cdb1f2a9..3ea07863 100644
--- a/samples/python/tensorflow_object_detection_api/infer.py
+++ b/samples/python/tensorflow_object_detection_api/infer.py
@@ -54,13 +54,13 @@ def __init__(self, engine_path, preprocessor, detection_type, iou_threshold):
         self.inputs = []
         self.outputs = []
         self.allocations = []
-        for i in range(self.engine.num_bindings):
+        for i in range(self.engine.num_io_tensors):
+            name = self.engine.get_tensor_name(i)
             is_input = False
-            if self.engine.binding_is_input(i):
+            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
                 is_input = True
-            name = self.engine.get_binding_name(i)
-            dtype = self.engine.get_binding_dtype(i)
-            shape = self.engine.get_binding_shape(i)
+            dtype = self.engine.get_tensor_dtype(name)
+            shape = self.engine.get_tensor_shape(name)
             if is_input:
                 self.batch_size = shape[0]
             size = np.dtype(trt.nptype(dtype)).itemsize
@@ -75,7 +75,7 @@ def __init__(self, engine_path, preprocessor, detection_type, iou_threshold):
                 'allocation': allocation,
             }
             self.allocations.append(allocation)
-            if self.engine.binding_is_input(i):
+            if is_input:
                 self.inputs.append(binding)
             else:
                 self.outputs.append(binding)
diff --git a/samples/python/tensorflow_object_detection_api/requirements.txt b/samples/python/tensorflow_object_detection_api/requirements.txt
index 55a8d9eb..13971563 100644
--- a/samples/python/tensorflow_object_detection_api/requirements.txt
+++ b/samples/python/tensorflow_object_detection_api/requirements.txt
@@ -1,14 +1,13 @@
-onnx==1.10.2; python_version<"3.8"
-onnx==1.12.0; python_version>="3.8"
-onnxruntime==1.9.0; python_version<"3.8"
-onnxruntime==1.12.1; python_version>="3.8"
-Pillow
-tf2onnx==1.8.1
-git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI; platform_system != "Windows"
+onnx==1.14.0
+onnxruntime==1.15.1
+Pillow>=10.0.0
+tf2onnx==1.15.0
+pycocotools; platform_system != "Windows"
 pycocotools-windows; platform_system == "Windows"
-cuda-python
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-pyyaml
-requests
-tqdm
-numpy>=1.19.4
+Cython<3.0
+pyyaml==5.3.1
+requests==2.31.0
+tqdm==4.66.1
+numpy==1.24.4
diff --git a/samples/python/tensorflow_object_detection_api/visualize.py b/samples/python/tensorflow_object_detection_api/visualize.py
index ca992d0f..f3e4ffc1 100644
--- a/samples/python/tensorflow_object_detection_api/visualize.py
+++ b/samples/python/tensorflow_object_detection_api/visualize.py
@@ -105,7 +105,8 @@ def visualize_detections(image_path, output_path, detections, labels=[]):
         text = "{}: {}%".format(label, int(100 * score))
         if score < 0:
             text = label
-        text_width, text_height = font.getsize(text)
+        left, top, right, bottom = font.getbbox(text)
+        text_width, text_height = right - left, bottom - top
         text_bottom = max(text_height, d['ymin'])
         text_left = d['xmin']
         margin = np.ceil(0.05 * text_height)
@@ -119,7 +120,8 @@ def visualize_detections(image_path, output_path, detections, labels=[]):
 
 def concat_visualizations(images, names, colors, output_path):
     def draw_text(draw, font, text, width, bar_height, offset, color):
-        text_width, text_height = font.getsize(text)
+        left, top, right, bottom = font.getbbox(text)
+        text_width, text_height = right - left, bottom - top
         draw.rectangle([(offset, 0), (offset + width, bar_height)], fill=color)
         draw.text((offset + (width - text_width) / 2, text_height - text_height / 2), text, fill='black', font=font)
 
diff --git a/samples/python/yolov3_onnx/README.md b/samples/python/yolov3_onnx/README.md
index 9daf1e9a..2b29ade7 100644
--- a/samples/python/yolov3_onnx/README.md
+++ b/samples/python/yolov3_onnx/README.md
@@ -101,9 +101,14 @@ The following resources provide a deeper understanding about the model used in t
 For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
 
 # Changelog
+August 2023
+- Removed support for Python versions < 3.8.
+- This sample was updated to support protobuf 3.20.3 for Python>=3.8
+- Update ONNX version support to 1.14.0
 
 March 2019
-This `README.md` file was recreated, updated and reviewed.
+- This `README.md` file was recreated, updated and reviewed.
+
 
 # Known issues
 
diff --git a/samples/python/yolov3_onnx/onnx_to_tensorrt.py b/samples/python/yolov3_onnx/onnx_to_tensorrt.py
index 7572f76a..c7e54d16 100644
--- a/samples/python/yolov3_onnx/onnx_to_tensorrt.py
+++ b/samples/python/yolov3_onnx/onnx_to_tensorrt.py
@@ -69,14 +69,13 @@ def get_engine(onnx_file_path, engine_file_path=""):
     def build_engine():
         """Takes an ONNX file and creates a TensorRT engine to run inference with"""
         with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
-            common.EXPLICIT_BATCH
+           0 
         ) as network, builder.create_builder_config() as config, trt.OnnxParser(
             network, TRT_LOGGER
         ) as parser, trt.Runtime(
             TRT_LOGGER
         ) as runtime:
-            config.max_workspace_size = 1 << 28  # 256MiB
-            builder.max_batch_size = 1
+            config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 28) # 256MiB
             # Parse model file
             if not os.path.exists(onnx_file_path):
                 print(
@@ -138,7 +137,7 @@ def main():
         print("Running inference on image {}...".format(input_image_path))
         # Set host input to the image. The common.do_inference function will copy the input to the GPU before executing.
         inputs[0].host = image
-        trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
+        trt_outputs = common.do_inference(context, engine=engine, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
 
     # Before doing post-processing, we need to reshape the outputs as the common.do_inference will give us flat arrays.
     trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]
diff --git a/samples/python/yolov3_onnx/requirements.txt b/samples/python/yolov3_onnx/requirements.txt
index 5352b7f3..877da733 100644
--- a/samples/python/yolov3_onnx/requirements.txt
+++ b/samples/python/yolov3_onnx/requirements.txt
@@ -1,12 +1,9 @@
-cuda-python
+cuda-python==12.2.0
 pywin32; platform_system == "Windows"
-numpy==1.18.1; python_version<"3.8" and platform_system == "Windows"
-numpy==1.19.4; python_version<"3.8" and platform_system != "Windows"
-numpy==1.23.2; python_version>="3.8"
-onnx==1.11.0
-Pillow
-protobuf; python_version<"3.7"
-protobuf==3.20.2; python_version>="3.7"
-pyyaml
-requests
-tqdm
+numpy==1.24.4
+onnx==1.14.0
+Pillow>=10.0.0
+protobuf==3.20.3
+pyyaml==6.0.1
+requests==2.31.0
+tqdm==4.66.1
diff --git a/samples/sampleAlgorithmSelector/CMakeLists.txt b/samples/sampleAlgorithmSelector/CMakeLists.txt
index 8d31262a..ef9386b3 100644
--- a/samples/sampleAlgorithmSelector/CMakeLists.txt
+++ b/samples/sampleAlgorithmSelector/CMakeLists.txt
@@ -19,6 +19,6 @@ SET(SAMPLE_SOURCES
      sampleAlgorithmSelector.cpp
 )
 
-SET(SAMPLE_PARSERS "uff" "caffe" "onnx")
+SET(SAMPLE_PARSERS "onnx")
 
 include(../CMakeSamplesTemplate.txt)
diff --git a/samples/sampleAlgorithmSelector/sampleAlgorithmSelector.cpp b/samples/sampleAlgorithmSelector/sampleAlgorithmSelector.cpp
index 927d5ed3..0072f761 100644
--- a/samples/sampleAlgorithmSelector/sampleAlgorithmSelector.cpp
+++ b/samples/sampleAlgorithmSelector/sampleAlgorithmSelector.cpp
@@ -86,7 +86,7 @@ class AlgorithmCacheWriter : public IAlgorithmSelector
         if (!algorithmFile.good())
         {
             sample::gLogError << "Cannot open algorithm cache file: " << mCacheFileName << " to write." << std::endl;
-            abort();
+            exit(EXIT_FAILURE);
         }
 
         for (int32_t i = 0; i < nbAlgorithms; i++)
@@ -104,8 +104,6 @@ class AlgorithmCacheWriter : public IAlgorithmSelector
             // Write input and output formats.
             for (int32_t j = 0; j < nbInputs + nbOutputs; j++)
             {
-                algorithmFile << static_cast<int32_t>(algoChoices[i]->getAlgorithmIOInfoByIndex(j)->getTensorFormat())
-                              << "\n";
                 algorithmFile << static_cast<int32_t>(algoChoices[i]->getAlgorithmIOInfoByIndex(j)->getDataType())
                               << "\n";
                 Dims const strides = algoChoices[i]->getAlgorithmIOInfoByIndex(j)->getStrides();
@@ -196,8 +194,6 @@ class AlgorithmCacheReader : public IAlgorithmSelector
             auto nbFormats = algoItem.nbInputs + algoItem.nbOutputs;
             for (auto j = 0; j < nbFormats; j++)
             {
-                ASSERT(algoItem.inOutIOInfo[j].tensorFormat
-                    == static_cast<int32_t>(algoChoices[i]->getAlgorithmIOInfoByIndex(j)->getTensorFormat()));
                 ASSERT(algoItem.inOutIOInfo[j].dataType
                     == static_cast<int32_t>(algoChoices[i]->getAlgorithmIOInfoByIndex(j)->getDataType()));
                 Dims const strides = algoChoices[i]->getAlgorithmIOInfoByIndex(j)->getStrides();
@@ -220,7 +216,7 @@ class AlgorithmCacheReader : public IAlgorithmSelector
         if (!algorithmFile.good())
         {
             sample::gLogError << "Cannot open algorithm cache file: " << cacheFileName << " to read." << std::endl;
-            abort();
+            exit(EXIT_FAILURE);
         }
 
         std::string line;
@@ -246,8 +242,6 @@ class AlgorithmCacheReader : public IAlgorithmSelector
             algoItem.inOutIOInfo.resize(nbFormats);
             for (int32_t i = 0; i < nbFormats; i++)
             {
-                getline(algorithmFile, line);
-                algoItem.inOutIOInfo[i].tensorFormat = std::stoi(line);
                 getline(algorithmFile, line);
                 algoItem.inOutIOInfo[i].dataType = std::stoi(line);
 
@@ -273,7 +267,6 @@ class AlgorithmCacheReader : public IAlgorithmSelector
 private:
     struct AlgorithmIOCache
     {
-        int32_t tensorFormat{};
         int32_t dataType{};
         Dims strides{};
         int64_t vectorDim{};
@@ -304,9 +297,7 @@ class AlgorithmCacheReader : public IAlgorithmSelector
         auto const nbFormats = algoCacheItem.nbInputs + algoCacheItem.nbOutputs;
         for (auto j = 0; j < nbFormats; j++)
         {
-            if (algoCacheItem.inOutIOInfo[j].tensorFormat
-                    != static_cast<int32_t>(algoChoice.getAlgorithmIOInfoByIndex(j)->getTensorFormat())
-                || algoCacheItem.inOutIOInfo[j].dataType
+            if (algoCacheItem.inOutIOInfo[j].dataType
                     != static_cast<int32_t>(algoChoice.getAlgorithmIOInfoByIndex(j)->getDataType())
 
                 || algoCacheItem.inOutIOInfo[j].vectorDim
@@ -436,9 +427,8 @@ bool SampleAlgorithmSelector::build(IAlgorithmSelector* selector)
     {
         return false;
     }
-    auto const networkFlags = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
 
-    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(networkFlags));
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
     if (!network)
     {
         return false;
@@ -463,13 +453,16 @@ bool SampleAlgorithmSelector::build(IAlgorithmSelector* selector)
         return false;
     }
 
-    builder->setMaxBatchSize(mParams.batchSize);
     config->setAlgorithmSelector(selector);
 
     if (mParams.fp16)
     {
         config->setFlag(BuilderFlag::kFP16);
     }
+    if (mParams.bf16)
+    {
+        config->setFlag(BuilderFlag::kBF16);
+    }
     if (mParams.int8)
     {
         config->setFlag(BuilderFlag::kINT8);
@@ -487,6 +480,7 @@ bool SampleAlgorithmSelector::build(IAlgorithmSelector* selector)
     {
         mRuntime = SampleUniquePtr<IRuntime>(createInferRuntime(sample::gLogger.getTRTLogger()));
     }
+
     if (!mRuntime)
     {
         return false;
@@ -639,6 +633,12 @@ bool SampleAlgorithmSelector::infer()
         return false;
     }
 
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
+    {
+        auto const name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
+    }
+
     // Pick a random digit to try to infer.
     srand(time(NULL));
     int32_t const digit = rand() % 10;
@@ -659,7 +659,7 @@ bool SampleAlgorithmSelector::infer()
     buffers.copyInputToDeviceAsync(stream);
 
     // Asynchronously enqueue the inference work
-    if (!context->enqueueV2(buffers.getDeviceBindings().data(), stream, nullptr))
+    if (!context->enqueueV3(stream))
     {
         return false;
     }
@@ -699,6 +699,7 @@ samplesCommon::OnnxSampleParams initializeSampleParams(samplesCommon::Args const
     params.dlaCore = args.useDLACore;
     params.int8 = args.runInInt8;
     params.fp16 = args.runInFp16;
+    params.bf16 = args.runInBf16;
 
     params.onnxFileName = "mnist.onnx";
     params.inputTensorNames.push_back("Input3");
@@ -724,6 +725,7 @@ void printHelpInfo()
               << std::endl;
     std::cout << "--int8          Run in Int8 mode.\n";
     std::cout << "--fp16          Run in FP16 mode.\n";
+    std::cout << "--bf16          Run in BF16 mode.\n";
 }
 
 int32_t main(int32_t argc, char** argv)
diff --git a/samples/sampleCharRNN/CMakeLists.txt b/samples/sampleCharRNN/CMakeLists.txt
index 9c9b9f6c..89d82682 100644
--- a/samples/sampleCharRNN/CMakeLists.txt
+++ b/samples/sampleCharRNN/CMakeLists.txt
@@ -16,10 +16,13 @@
 #
 SET(SAMPLE_SOURCES
     sampleCharRNN.cpp
+    ../common/sampleDevice.cpp
     ../common/sampleEngines.cpp
+    ../common/sampleOptions.cpp
     ../common/sampleUtils.cpp
+    ../common/bfloat16.cpp
 )
 # Required due to inclusion of sampleEnines.h
-SET(SAMPLE_PARSERS "uff" "caffe" "onnx")
+SET(SAMPLE_PARSERS "onnx")
 
 include(../CMakeSamplesTemplate.txt)
diff --git a/samples/sampleCharRNN/README.md b/samples/sampleCharRNN/README.md
index 77eed188..40c81489 100644
--- a/samples/sampleCharRNN/README.md
+++ b/samples/sampleCharRNN/README.md
@@ -40,9 +40,6 @@ The ElementWise layer, also known as the Eltwise layer, implements per-element o
 [MatrixMultiply](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#matrixmultiply-layer)
 The MatrixMultiply layer implements matrix multiplication for a collection of matrices. The Matrix Multiplication layer is used to execute the first step of the functionality provided by a FullyConnected layer.
 
-[RNNv2](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#rnnv2-layer)
-The RNNv2 layer implements recurrent layers such as Recurrent Neural Network (RNN), Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM). Supported types are RNN, GRU, and LSTM. It performs a recurrent operation, where the operation is defined by one of several well-known recurrent neural network (RNN) "cells".  This is the first layer in the network is an RNN layer. This is added and configured in the `addRNNv2Layer()` function. Weights are set for each gate and layer individually. The input format for RNNv2 is BSE (Batch, Sequence, Embedding).
-
 [TopK](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#topk-layer)
 The TopK layer is used to identify the character that has the maximum probability of appearing next. The TopK layer finds the top K maximum (or minimum) elements along a dimension, returning a reduced tensor and a tensor of index positions.
 
@@ -125,6 +122,10 @@ documentation.
 
 # Changelog
 
+January 2024
+* Removed RNNv2Layer based addLSTMLayer implementation. addLSTMLayer is now implemented with ILoop only.
+* Default to use ILoop in paramaters.
+
 February 2019
 This is the first release of this `README.md` file.
 
diff --git a/samples/sampleCharRNN/sampleCharRNN.cpp b/samples/sampleCharRNN/sampleCharRNN.cpp
index 919f0929..73ba53cc 100644
--- a/samples/sampleCharRNN/sampleCharRNN.cpp
+++ b/samples/sampleCharRNN/sampleCharRNN.cpp
@@ -42,7 +42,6 @@
 #include <vector>
 
 #include "NvInfer.h"
-#include "NvUtils.h"
 #include "argsParser.h"
 #include "buffers.h"
 #include "common.h"
@@ -135,7 +134,6 @@ struct SampleCharRNNParams : samplesCommon::SampleParams
 
     std::vector<std::string> inputSentences;
     std::vector<std::string> outputSentences;
-    bool useILoop;
 };
 
 //!
@@ -220,23 +218,13 @@ class SampleCharRNNBase
     //!
     void copyRNNOutputsToInputs(samplesCommon::BufferManager& buffers);
 
-    std::shared_ptr<nvinfer1::IRuntime> mRuntime{nullptr}; //!< The TensorRT runtime used to run the network
-    std::shared_ptr<nvinfer1::ICudaEngine> mEngine{nullptr}; //!< The TensorRT engine used to run the network
-};
-
-class SampleCharRNNv2 : public SampleCharRNNBase
-{
-public:
-    SampleCharRNNv2(SampleCharRNNParams params)
-        : SampleCharRNNBase(params)
-    {
-    }
-
-protected:
     //!
-    //! \brief Add inputs to the TensorRT network and configure LSTM layers using network definition API.
+    //! \brief Transposes a sub-buffer of size height * width.
     //!
-    nvinfer1::ILayer* addLSTMLayers(SampleUniquePtr<nvinfer1::INetworkDefinition>& network) final;
+    bool transposeSubBuffers(void* data, int64_t height, int64_t width) noexcept;
+
+    std::shared_ptr<nvinfer1::IRuntime> mRuntime{nullptr}; //!< The TensorRT runtime used to run the network
+    std::shared_ptr<nvinfer1::ICudaEngine> mEngine{nullptr}; //!< The TensorRT engine used to run the network
 };
 
 class SampleCharRNNLoop : public SampleCharRNNBase
@@ -274,6 +262,46 @@ class SampleCharRNNLoop : public SampleCharRNNBase
         nvinfer1::ITensor* sequenceSize, const LstmParams& params, LstmIO& outputTensors);
 };
 
+//!
+//! \brief Transpose a sub-buffer of size height * width.
+//!
+//! \param data The data to transpose. Serves as both input and output.
+//! \param height The size of the height dimension to transpose.
+//! \param width The size of the width dimension to transpose.
+//!
+//! \return True on success, false on failure.
+//!
+bool SampleCharRNNBase::transposeSubBuffers(void* data, int64_t height, int64_t width) noexcept
+{
+    try
+    {
+        ASSERT(data != nullptr);
+        ASSERT(height > 0);
+        ASSERT(width > 0);
+        int64_t const tmpSize = height * width * sizeof(float);
+        samplesCommon::HostBuffer tmpbuf(tmpSize, DataType::kFLOAT);
+        ASSERT(tmpbuf.data() != nullptr);
+        auto in = static_cast<float*>(data);
+        auto out = static_cast<float*>(tmpbuf.data());
+
+        for (int64_t i{}; i < height; ++i)
+        {
+            for (int64_t j{}; j < width; ++j)
+            {
+                out[j * height + i] = in[i * width + j];
+            }
+        }
+
+        std::copy(static_cast<uint8_t*>(tmpbuf.data()), static_cast<uint8_t*>(tmpbuf.data()) + tmpSize,
+            static_cast<uint8_t*>(data));
+    }
+    catch (...)
+    {
+        return false;
+    }
+    return true;
+}
+
 //!
 //! \brief Creates the network, configures the builder and creates
 //!        the network engine
@@ -296,14 +324,7 @@ bool SampleCharRNNBase::build()
         {
             return false;
         }
-        NetworkDefinitionCreationFlags flags{
-            1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH)};
-        if (!mParams.useILoop)
-        {
-            flags = 0;
-            builder->setMaxBatchSize(mParams.batchSize);
-        }
-        auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(flags));
+        auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
         if (!network)
         {
             return false;
@@ -454,12 +475,12 @@ nvinfer1::Weights SampleCharRNNBase::convertRNNWeights(nvinfer1::Weights orig, i
     auto mem = new samplesCommon::FloatMemory(input.count);
     weightsMemory.emplace_back(mem);
     auto ptr = mem->raw();
-    const float* data = static_cast<const float*>(input.values);
-    int dimsW[2]{dataSize, 4 * mParams.hiddenSize};
-    int dimsR[2]{mParams.hiddenSize, 4 * mParams.hiddenSize};
+    float const* data = static_cast<float const*>(input.values);
+    int64_t dimsW[2]{dataSize, 4 * mParams.hiddenSize};
+    int64_t dimsR[2]{mParams.hiddenSize, 4 * mParams.hiddenSize};
     std::copy(data, data + input.count, ptr);
-    ASSERT(utils::transposeSubBuffers(ptr, DataType::kFLOAT, 1, dimsW[0], dimsW[1]));
-    ASSERT(utils::transposeSubBuffers(&ptr[dimsW[0] * dimsW[1]], DataType::kFLOAT, 1, dimsR[0], dimsR[1]));
+    ASSERT(transposeSubBuffers(ptr, dimsW[0], dimsW[1]));
+    ASSERT(transposeSubBuffers(&ptr[dimsW[0] * dimsW[1]], dimsR[0], dimsR[1]));
     return nvinfer1::Weights{input.type, ptr, input.count};
 }
 
@@ -661,108 +682,6 @@ nvinfer1::ILayer* SampleCharRNNLoop::addLSTMLayers(SampleUniquePtr<nvinfer1::INe
 
     return dataOut;
 }
-//!
-//! \brief Add inputs to the network and configure the RNNv2 layer using network definition API.
-//!
-//! \param network The network that will be used to build the engine.
-//! \param weightMap Map that contains all the weights required by the model.
-//!
-//! \return Configured and added RNNv2 layer.
-//!
-nvinfer1::ILayer* SampleCharRNNv2::addLSTMLayers(SampleUniquePtr<nvinfer1::INetworkDefinition>& network)
-{
-    // Initialize data, hiddenIn, cellIn, and seqLenIn inputs into RNN Layer
-    nvinfer1::ITensor* data = network->addInput(mParams.bindingNames.INPUT_BLOB_NAME, nvinfer1::DataType::kFLOAT,
-        nvinfer1::Dims2(mParams.seqSize, mParams.dataSize));
-    ASSERT(data != nullptr);
-
-    nvinfer1::ITensor* hiddenIn = network->addInput(mParams.bindingNames.HIDDEN_IN_BLOB_NAME,
-        nvinfer1::DataType::kFLOAT, nvinfer1::Dims2(mParams.layerCount, mParams.hiddenSize));
-    ASSERT(hiddenIn != nullptr);
-
-    nvinfer1::ITensor* cellIn = network->addInput(mParams.bindingNames.CELL_IN_BLOB_NAME, nvinfer1::DataType::kFLOAT,
-        nvinfer1::Dims2(mParams.layerCount, mParams.hiddenSize));
-    ASSERT(cellIn != nullptr);
-
-    nvinfer1::ITensor* seqLenIn
-        = network->addInput(mParams.bindingNames.SEQ_LEN_IN_BLOB_NAME, nvinfer1::DataType::kINT32, nvinfer1::Dims{});
-    ASSERT(seqLenIn != nullptr);
-
-    // create an RNN layer w/ 2 layers and 512 hidden states
-    nvinfer1::IRNNv2Layer* rnn = network->addRNNv2(
-        *data, mParams.layerCount, mParams.hiddenSize, mParams.seqSize, nvinfer1::RNNOperation::kLSTM);
-    ASSERT(rnn != nullptr);
-
-    // Set RNNv2 optional inputs
-    rnn->getOutput(0)->setName("RNN output");
-    rnn->setHiddenState(*hiddenIn);
-    if (rnn->getOperation() == nvinfer1::RNNOperation::kLSTM)
-    {
-        rnn->setCellState(*cellIn);
-    }
-
-    // Specify sequence lengths.  Note this can be omitted since we are always using the maximum
-    // sequence length, but for illustrative purposes we explicitly pass in sequence length data
-    // in the sample
-    rnn->setSequenceLengths(*seqLenIn);
-    seqLenIn->setLocation(nvinfer1::TensorLocation::kDEVICE);
-
-    // convert tensorflow weight format to trt weight format
-    nvinfer1::Weights rnnwL0
-        = SampleCharRNNBase::convertRNNWeights(mWeightMap[mParams.weightNames.RNNW_L0_NAME], mParams.dataSize);
-    nvinfer1::Weights rnnbL0 = SampleCharRNNBase::convertRNNBias(mWeightMap[mParams.weightNames.RNNB_L0_NAME]);
-    nvinfer1::Weights rnnwL1
-        = SampleCharRNNBase::convertRNNWeights(mWeightMap[mParams.weightNames.RNNW_L1_NAME], mParams.hiddenSize);
-    nvinfer1::Weights rnnbL1 = SampleCharRNNBase::convertRNNBias(mWeightMap[mParams.weightNames.RNNB_L1_NAME]);
-
-    std::vector<nvinfer1::RNNGateType> gateOrder({nvinfer1::RNNGateType::kINPUT, nvinfer1::RNNGateType::kCELL,
-        nvinfer1::RNNGateType::kFORGET, nvinfer1::RNNGateType::kOUTPUT});
-    const nvinfer1::DataType dataType = static_cast<nvinfer1::DataType>(rnnwL0.type);
-    const float* wtsL0 = static_cast<const float*>(rnnwL0.values);
-    const float* biasesL0 = static_cast<const float*>(rnnbL0.values);
-    const float* wtsL1 = static_cast<const float*>(rnnwL1.values);
-    const float* biasesL1 = static_cast<const float*>(rnnbL1.values);
-    size_t kernelOffsetL0 = 0, kernelOffsetL1 = 0, biasOffset = 0;
-    const int numGates = gateOrder.size();
-    ASSERT(numGates > 0);
-    for (int gateIndex = 0; gateIndex < 2 * numGates; gateIndex++)
-    {
-        bool isW = (gateIndex < numGates);
-        int64_t weightCountL0 = (isW ? mParams.dataSize : mParams.hiddenSize) * mParams.hiddenSize;
-        int64_t weightCountL1 = mParams.hiddenSize * mParams.hiddenSize;
-        // extract weights and bias for a given gate and layer
-        nvinfer1::Weights gateWeightL0{dataType, wtsL0 + kernelOffsetL0, weightCountL0};
-        nvinfer1::Weights gateBiasL0{dataType, biasesL0 + biasOffset, mParams.hiddenSize};
-        nvinfer1::Weights gateWeightL1{dataType, wtsL1 + kernelOffsetL1, weightCountL1};
-        nvinfer1::Weights gateBiasL1{dataType, biasesL1 + biasOffset, mParams.hiddenSize};
-
-        // set weights and bias for given gate
-        rnn->setWeightsForGate(0, gateOrder[gateIndex % numGates], isW, gateWeightL0);
-        rnn->setBiasForGate(0, gateOrder[gateIndex % numGates], isW, gateBiasL0);
-        rnn->setWeightsForGate(1, gateOrder[gateIndex % numGates], isW, gateWeightL1);
-        rnn->setBiasForGate(1, gateOrder[gateIndex % numGates], isW, gateBiasL1);
-
-        // Update offsets
-        kernelOffsetL0 += weightCountL0;
-        kernelOffsetL1 += weightCountL1;
-        biasOffset += mParams.hiddenSize;
-    }
-
-    // Store the transformed weights in the weight map so the memory can be properly released later.
-    mWeightMap["rnnwL0"] = rnnwL0;
-    mWeightMap["rnnbL0"] = rnnbL0;
-    mWeightMap["rnnwL1"] = rnnwL1;
-    mWeightMap["rnnbL1"] = rnnbL1;
-
-    rnn->getOutput(1)->setName(mParams.bindingNames.HIDDEN_OUT_BLOB_NAME);
-    network->markOutput(*rnn->getOutput(1));
-    if (rnn->getOperation() == nvinfer1::RNNOperation::kLSTM)
-    {
-        rnn->getOutput(2)->setName(mParams.bindingNames.CELL_OUT_BLOB_NAME);
-        network->markOutput(*rnn->getOutput(2));
-    }
-    return rnn;
-}
 
 //!
 //! \brief Create full model using the TensorRT network definition API and build the engine.
@@ -777,8 +696,8 @@ void SampleCharRNNBase::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& bu
     auto rnn = addLSTMLayers(network);
 
     // Transpose FC weights since TensorFlow's weights are transposed when compared to TensorRT
-    ASSERT(utils::transposeSubBuffers((void*) mWeightMap[mParams.weightNames.FCW_NAME].values,
-        nvinfer1::DataType::kFLOAT, 1, mParams.hiddenSize, mParams.vocabSize));
+    ASSERT(transposeSubBuffers((void*) mWeightMap[mParams.weightNames.FCW_NAME].values,
+        mParams.hiddenSize, mParams.vocabSize));
 
     // add Constant layers for fully connected weights
     auto fcwts = network->addConstant(
@@ -834,7 +753,7 @@ void SampleCharRNNBase::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& bu
 bool SampleCharRNNBase::infer()
 {
     // Create RAII buffer manager object
-    samplesCommon::BufferManager buffers(mEngine, mParams.useILoop ? 0 : mParams.batchSize);
+    samplesCommon::BufferManager buffers(mEngine, 0);
 
     auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
 
@@ -936,12 +855,14 @@ bool SampleCharRNNBase::stepOnce(
     // Asynchronously copy data from host input buffers to device input buffers
     buffers.copyInputToDeviceAsync(stream);
 
-    // Asynchronously enqueue the inference work
-    if (mParams.useILoop ? !context->enqueueV2(buffers.getDeviceBindings().data(), stream, nullptr)
-                         : !context->enqueue(mParams.batchSize, buffers.getDeviceBindings().data(), stream, nullptr))
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
     {
-        return false;
+        auto const name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
     }
+
+    // Asynchronously enqueue the inference work
+    ASSERT(context->enqueueV3(stream));
     // Asynchronously copy data from device output buffers to host output buffers
     buffers.copyOutputToHostAsync(stream);
 
@@ -1001,7 +922,6 @@ SampleCharRNNParams initializeSampleParams(const samplesCommon::Args& args)
     params.vocabSize = 65;
     params.outputSize = 1;
     params.weightFileName = locateFile("char-rnn.wts", params.dataDirs);
-    params.useILoop = args.useILoop;
     params.saveEngine = args.saveEngine;
     params.loadEngine = args.loadEngine;
 
@@ -1044,7 +964,6 @@ void printHelpInfo()
 {
     std::cout << "Usage: ./sample_char_rnn [-h or --help] [-d or --datadir=<path to data directory>]\n";
     std::cout << "--help          Display help information\n";
-    std::cout << "--useILoop      Use ILoop LSTM definition\n";
     std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
                  "multiple times to add multiple directories. If no data directories are given, the default is to use "
                  "data/samples/char-rnn/ and data/char-rnn/"
@@ -1082,14 +1001,7 @@ int main(int argc, char** argv)
     SampleCharRNNParams params = initializeSampleParams(args);
     std::unique_ptr<SampleCharRNNBase> sample;
 
-    if (args.useILoop)
-    {
-        sample.reset(new SampleCharRNNLoop(params));
-    }
-    else
-    {
-        sample.reset(new SampleCharRNNv2(params));
-    }
+    sample.reset(new SampleCharRNNLoop(params));
 
     sample::gLogInfo << "Building and running a GPU inference engine for Char RNN model..." << std::endl;
 
diff --git a/samples/sampleDynamicReshape/README.md b/samples/sampleDynamicReshape/README.md
index 9f09bdc8..4c1ced61 100644
--- a/samples/sampleDynamicReshape/README.md
+++ b/samples/sampleDynamicReshape/README.md
@@ -34,7 +34,7 @@ Specifically, this sample:
 ### Creating the preprocessing network
 
 First, create a network with full dims support:
-`auto preprocessorNetwork = makeUnique(builder->createNetworkV2(1U << static_cast<int32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH)));`
+`auto preprocessorNetwork = makeUnique(builder->createNetworkV2(0));`
 
 Next, add an input layer that accepts an input with a dynamic shape, followed by a resize layer that will reshape the input to the shape the model expects:
 ```
@@ -50,8 +50,7 @@ The -1 dimensions denote dimensions that will be supplied at runtime.
 
 First, create an empty full-dims network, and parser:
 ```
-const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
-auto network = makeUnique(builder->createNetworkV2(explicitBatch));
+auto network = makeUnique(builder->createNetworkV2(0));
 auto parser = nvonnxparser::createParser(*network, sample::gLogger.getTRTLogger());
 ```
 
@@ -91,8 +90,8 @@ Prepare and set int8 calibrator if running in int8 mode:
 std::unique_ptr<IInt8Calibrator> calibrator;
 if (mParams.int8)
 {
-    preprocessorConfig->setFlag(BuilderFlag::kINT8);
-    const int nCalibBatches{10};
+    preprocessorConfig->setFlag(BuilderFlag::kINT8);    
+    const int nCalibBatches{10}; 
     MNISTBatchStream calibrationStream(calibBatchSize, nCalibBatches, "train-images-idx3-ubyte",
         "train-labels-idx1-ubyte", mParams.dataDirs);
     calibrator.reset(new Int8EntropyCalibrator2<MNISTBatchStream>(
@@ -101,7 +100,7 @@ if (mParams.int8)
 }
 ```
 
-Run engine build with config:
+Run engine build with config: 
 ```
 SampleUniquePtr<nvinfer1::IHostMemory> preprocessorPlan = makeUnique(
         builder->buildSerializedNetwork(*preprocessorNetwork, *preprocessorConfig));
@@ -156,7 +155,7 @@ CHECK(cudaMemcpy(mInput.deviceBuffer.data(), mInput.hostBuffer.data(), mInput.ho
 ```
 
 Since the preprocessor engine accepts dynamic shapes, specify the actual shape of the current input to the execution context:
-`mPreprocessorContext->setBindingDimensions(0, inputDims);`
+`mPreprocessorContext->setInputShape(inputTensorName, inputDims);`, where inputTensorName is the name of the input tensor on binding index 0.
 
 Next, run the preprocessor using the `executeV2` function. The example writes the output of the preprocessor engine directly to the input device buffer of the MNIST engine:
 ```
@@ -217,7 +216,7 @@ The IResizeLayer implements the resize operation on an input tensor.
     Producer version: 2.5.1
     Domain:           ai.cntk
     Model version:    1
-    Doc string:
+    Doc string:  
     ----------------------------------------------------------------
     [W] [TRT] onnx2trt_utils.cpp:214: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
     [W] [TRT] onnx2trt_utils.cpp:214: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
@@ -258,16 +257,16 @@ The IResizeLayer implements the resize operation on an input tensor.
     @@@@@@@@@@@@@@@@@@@@@@@@@@@@
 
     [I] Output:
-    [I]  Prob 0  0.0000 Class 0:
-    [I]  Prob 1  0.0000 Class 1:
+    [I]  Prob 0  0.0000 Class 0: 
+    [I]  Prob 1  0.0000 Class 1: 
     [I]  Prob 2  1.0000 Class 2: **********
-    [I]  Prob 3  0.0000 Class 3:
-    [I]  Prob 4  0.0000 Class 4:
-    [I]  Prob 5  0.0000 Class 5:
-    [I]  Prob 6  0.0000 Class 6:
-    [I]  Prob 7  0.0000 Class 7:
-    [I]  Prob 8  0.0000 Class 8:
-    [I]  Prob 9  0.0000 Class 9:
+    [I]  Prob 3  0.0000 Class 3: 
+    [I]  Prob 4  0.0000 Class 4: 
+    [I]  Prob 5  0.0000 Class 5: 
+    [I]  Prob 6  0.0000 Class 6: 
+    [I]  Prob 7  0.0000 Class 7: 
+    [I]  Prob 8  0.0000 Class 8: 
+    [I]  Prob 9  0.0000 Class 9: 
     &&&& PASSED TensorRT.sample_dynamic_reshape # ./sample_dynamic_reshape
     ```
 
diff --git a/samples/sampleDynamicReshape/sampleDynamicReshape.cpp b/samples/sampleDynamicReshape/sampleDynamicReshape.cpp
index 003bfb55..d91b1a68 100644
--- a/samples/sampleDynamicReshape/sampleDynamicReshape.cpp
+++ b/samples/sampleDynamicReshape/sampleDynamicReshape.cpp
@@ -86,6 +86,8 @@ class SampleDynamicReshape
     nvinfer1::Dims mPredictionInputDims;  //!< The dimensions of the input of the MNIST model.
     nvinfer1::Dims mPredictionOutputDims; //!< The dimensions of the output of the MNIST model.
 
+    SampleUniquePtr<nvinfer1::IRuntime> mRuntime{nullptr};
+
     // Engine plan files used for inference. One for resizing inputs, another for prediction.
     SampleUniquePtr<nvinfer1::ICudaEngine> mPreprocessorEngine{nullptr}, mPredictionEngine{nullptr};
 
@@ -121,8 +123,8 @@ bool SampleDynamicReshape::build()
         return false;
     }
 
-    auto runtime = makeUnique(nvinfer1::createInferRuntime(sample::gLogger.getTRTLogger()));
-    if (!runtime)
+    mRuntime = makeUnique(nvinfer1::createInferRuntime(sample::gLogger.getTRTLogger()));
+    if (!mRuntime)
     {
         sample::gLogError << "Runtime object creation failed." << std::endl;
         return false;
@@ -139,8 +141,8 @@ bool SampleDynamicReshape::build()
             return false;
         }
 
-        bool result = buildPredictionEngine(builder, runtime, *profileStream)
-            && buildPreprocessorEngine(builder, runtime, *profileStream);
+        bool result = buildPredictionEngine(builder, mRuntime, *profileStream)
+            && buildPreprocessorEngine(builder, mRuntime, *profileStream);
         return result;
     }
     catch (std::runtime_error& e)
@@ -159,8 +161,7 @@ bool SampleDynamicReshape::buildPreprocessorEngine(const SampleUniquePtr<nvinfer
     const SampleUniquePtr<nvinfer1::IRuntime>& runtime, cudaStream_t profileStream)
 {
     // Create the preprocessor engine using a network that supports full dimensions (createNetworkV2).
-    auto preprocessorNetwork = makeUnique(
-        builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH)));
+    auto preprocessorNetwork = makeUnique(builder->createNetworkV2(0));
     if (!preprocessorNetwork)
     {
         sample::gLogError << "Create network failed." << std::endl;
@@ -229,12 +230,14 @@ bool SampleDynamicReshape::buildPreprocessorEngine(const SampleUniquePtr<nvinfer
         return false;
     }
 
+     auto const tensorName = mPreprocessorEngine->getIOTensorName(0);
+
     sample::gLogInfo << "Profile dimensions in preprocessor engine:" << std::endl;
-    sample::gLogInfo << "    Minimum = " << mPreprocessorEngine->getProfileDimensions(0, 0, OptProfileSelector::kMIN)
+    sample::gLogInfo << "    Minimum = " << mPreprocessorEngine->getProfileShape(tensorName, 0, OptProfileSelector::kMIN)
                      << std::endl;
-    sample::gLogInfo << "    Optimum = " << mPreprocessorEngine->getProfileDimensions(0, 0, OptProfileSelector::kOPT)
+    sample::gLogInfo << "    Optimum = " << mPreprocessorEngine->getProfileShape(tensorName, 0, OptProfileSelector::kOPT)
                      << std::endl;
-    sample::gLogInfo << "    Maximum = " << mPreprocessorEngine->getProfileDimensions(0, 0, OptProfileSelector::kMAX)
+    sample::gLogInfo << "    Maximum = " << mPreprocessorEngine->getProfileShape(tensorName, 0, OptProfileSelector::kMAX)
                      << std::endl;
 
 
@@ -254,8 +257,7 @@ bool SampleDynamicReshape::buildPredictionEngine(const SampleUniquePtr<nvinfer1:
     const SampleUniquePtr<nvinfer1::IRuntime>& runtime, cudaStream_t profileStream)
 {
     // Create a network using the parser.
-    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
-    auto network = makeUnique(builder->createNetworkV2(explicitBatch));
+    auto network = makeUnique(builder->createNetworkV2(0));
     if (!network)
     {
         sample::gLogError << "Create network failed." << std::endl;
@@ -293,6 +295,10 @@ bool SampleDynamicReshape::buildPredictionEngine(const SampleUniquePtr<nvinfer1:
     {
         config->setFlag(BuilderFlag::kFP16);
     }
+    if (mParams.bf16)
+    {
+        config->setFlag(BuilderFlag::kBF16);
+    }
     config->setProfileStream(profileStream);
 
     auto profileCalib = builder->createOptimizationProfile();
@@ -388,7 +394,7 @@ bool SampleDynamicReshape::infer()
         mInput.deviceBuffer.data(), mInput.hostBuffer.data(), mInput.hostBuffer.nbBytes(), cudaMemcpyHostToDevice));
 
     // Set the input size for the preprocessor
-    CHECK_RETURN_W_MSG(mPreprocessorContext->setBindingDimensions(0, inputDims), false, "Invalid binding dimensions.");
+    CHECK_RETURN_W_MSG(mPreprocessorContext->setInputShape("input", inputDims), false, "Invalid binding dimensions.");
 
     // We can only run inference once all dynamic input shapes have been specified.
     if (!mPreprocessorContext->allInputDimensionsSpecified())
@@ -497,6 +503,7 @@ samplesCommon::OnnxSampleParams initializeSampleParams(const samplesCommon::Args
     params.outputTensorNames.push_back("Plus214_Output_0");
     params.int8 = args.runInInt8;
     params.fp16 = args.runInFp16;
+    params.bf16 = args.runInBf16;
     return params;
 }
 
@@ -514,6 +521,7 @@ void printHelpInfo()
               << std::endl;
     std::cout << "--int8          Run in Int8 mode." << std::endl;
     std::cout << "--fp16          Run in FP16 mode." << std::endl;
+    std::cout << "--bf16          Run in BF16 mode." << std::endl;
 }
 
 int main(int argc, char** argv)
diff --git a/samples/sampleINT8API/README.md b/samples/sampleINT8API/README.md
index 6ee68f86..44cdcaba 100644
--- a/samples/sampleINT8API/README.md
+++ b/samples/sampleINT8API/README.md
@@ -113,11 +113,13 @@ After the engine has been built, it can be used just like an FP32 engine. For ex
 	```
 
 2.  Copy the data from the host input buffers to the device input buffers.
-    `buffers.copyInputToDeviceAsync(stream);`
+	```
+	buffers.copyInputToDeviceAsync(stream);
+	```
 
 3.  Enqueue the inference work and perform actual inference.
 	```
-	context->enqueue(batchSize, buffers.getDeviceBindings().data(), input_stream, nullptr))
+	context->enqueueV3(input_stream))
 	```
 
 4.  Copy data from the device output buffers to the host output buffers.
diff --git a/samples/sampleINT8API/sampleINT8API.cpp b/samples/sampleINT8API/sampleINT8API.cpp
index 6fbcf4f4..a20acff3 100644
--- a/samples/sampleINT8API/sampleINT8API.cpp
+++ b/samples/sampleINT8API/sampleINT8API.cpp
@@ -161,28 +161,31 @@ class SampleINT8API
 //!
 void SampleINT8API::getInputOutputNames()
 {
-    int nbindings = mEngine.get()->getNbBindings();
+    int32_t nbindings = mEngine.get()->getNbIOTensors();
     ASSERT(nbindings == 2);
-    for (int b = 0; b < nbindings; ++b)
+    for (int32_t b = 0; b < nbindings; ++b)
     {
-        nvinfer1::Dims dims = mEngine.get()->getBindingDimensions(b);
-        if (mEngine.get()->bindingIsInput(b))
+        auto const bindingName = mEngine.get()->getIOTensorName(b);
+        nvinfer1::Dims dims = mEngine.get()->getTensorShape(bindingName);
+        if (mEngine.get()->getTensorIOMode(bindingName) == TensorIOMode::kINPUT)
         {
             if (mParams.verbose)
             {
-                sample::gLogInfo << "Found input: " << mEngine.get()->getBindingName(b) << " shape=" << dims
-                                 << " dtype=" << (int) mEngine.get()->getBindingDataType(b) << std::endl;
+                sample::gLogInfo << "Found input: " << bindingName << " shape=" << dims
+                                 << " dtype=" << static_cast<int32_t>(mEngine.get()->getTensorDataType(bindingName))
+                                 << std::endl;
             }
-            mInOut["input"] = mEngine.get()->getBindingName(b);
+            mInOut["input"] = bindingName;
         }
         else
         {
             if (mParams.verbose)
             {
-                sample::gLogInfo << "Found output: " << mEngine.get()->getBindingName(b) << " shape=" << dims
-                                 << " dtype=" << (int) mEngine.get()->getBindingDataType(b) << std::endl;
+                sample::gLogInfo << "Found output: " << bindingName << " shape=" << dims
+                                 << " dtype=" << static_cast<int32_t>(mEngine.get()->getTensorDataType(bindingName))
+                                 << std::endl;
             }
-            mInOut["output"] = mEngine.get()->getBindingName(b);
+            mInOut["output"] = bindingName;
         }
     }
 }
@@ -261,9 +264,10 @@ void SampleINT8API::setLayerPrecision(SampleUniquePtr<nvinfer1::INetworkDefiniti
 void SampleINT8API::writeNetworkTensorNames(const SampleUniquePtr<nvinfer1::INetworkDefinition>& network)
 {
     sample::gLogInfo << "Sample requires to run with per-tensor dynamic range." << std::endl;
-    sample::gLogInfo << "In order to run Int8 inference without calibration, user will need to provide dynamic range for all "
-                "the network tensors."
-             << std::endl;
+    sample::gLogInfo
+        << "In order to run Int8 inference without calibration, user will need to provide dynamic range for all "
+           "the network tensors."
+        << std::endl;
 
     std::ofstream tensorsFile{mParams.networkTensorsFileName};
 
@@ -315,12 +319,14 @@ bool SampleINT8API::setDynamicRange(SampleUniquePtr<nvinfer1::INetworkDefinition
     sample::gLogInfo << "Setting Per Tensor Dynamic Range" << std::endl;
     if (mParams.verbose)
     {
-        sample::gLogInfo << "If dynamic range for a tensor is missing, TensorRT will run inference assuming dynamic range for "
-                    "the tensor as optional."
-                 << std::endl;
-        sample::gLogInfo << "If dynamic range for a tensor is required then inference will fail. Follow README.md to generate "
-                    "missing per-tensor dynamic range."
-                 << std::endl;
+        sample::gLogInfo
+            << "If dynamic range for a tensor is missing, TensorRT will run inference assuming dynamic range for "
+               "the tensor as optional."
+            << std::endl;
+        sample::gLogInfo
+            << "If dynamic range for a tensor is required then inference will fail. Follow README.md to generate "
+               "missing per-tensor dynamic range."
+            << std::endl;
     }
     // set dynamic range for network input tensors
     for (int i = 0; i < network->getNbInputs(); ++i)
@@ -381,6 +387,9 @@ bool SampleINT8API::setDynamicRange(SampleUniquePtr<nvinfer1::INetworkDefinition
                     case DataType::kINT32: val = static_cast<const int32_t*>(wts.values)[wb]; break;
                     case DataType::kUINT8: val = static_cast<uint8_t const*>(wts.values)[wb]; break;
                     case DataType::kFP8: ASSERT(!"FP8 is not supported"); break;
+                    case DataType::kBF16:
+                    case DataType::kINT4:
+                    case DataType::kINT64: ASSERT(false && "Unsupported data type");
                     }
                     max = std::max(max, std::abs(val));
                 }
@@ -435,7 +444,7 @@ bool SampleINT8API::prepareInput(const samplesCommon::BufferManager& buffers)
     infile.seekg(1, infile.cur);
     infile.read(reinterpret_cast<char*>(fileData.data()), width * height * channels);
 
-    uint8_t* hostInputBuffer = static_cast<uint8_t*>(buffers.getHostBuffer(mInOut["input"]));
+    float* hostInputBuffer = static_cast<float*>(buffers.getHostBuffer(mInOut["input"]));
 
     // Convert HWC to CHW and Normalize
     for (int c = 0; c < channels; ++c)
@@ -446,7 +455,7 @@ bool SampleINT8API::prepareInput(const samplesCommon::BufferManager& buffers)
             {
                 int dstIdx = c * height * width + h * width + w;
                 int srcIdx = h * width * channels + w * channels + c;
-                hostInputBuffer[dstIdx] = fileData[srcIdx];
+                hostInputBuffer[dstIdx] = (2.0F / 255.0F) * static_cast<float>(fileData[srcIdx]) - 1.0F;
             }
         }
     }
@@ -502,12 +511,12 @@ sample::Logger::TestResult SampleINT8API::build()
 
     if (!builder->platformHasFastInt8())
     {
-        sample::gLogError << "Platform does not support INT8 inference. sampleINT8API can only run in INT8 Mode." << std::endl;
+        sample::gLogError << "Platform does not support INT8 inference. sampleINT8API can only run in INT8 Mode."
+                          << std::endl;
         return sample::Logger::TestResult::kWAIVED;
     }
 
-    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
-    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
     if (!network)
     {
         sample::gLogError << "Unable to create network object." << mParams.referenceFileName << std::endl;
@@ -599,12 +608,8 @@ sample::Logger::TestResult SampleINT8API::build()
     // populates input output map structure
     getInputOutputNames();
 
-    // derive input/output dims from engine bindings
-    const int inputIndex = mEngine.get()->getBindingIndex(mInOut["input"].c_str());
-    mInputDims = mEngine.get()->getBindingDimensions(inputIndex);
-
-    const int outputIndex = mEngine.get()->getBindingIndex(mInOut["output"].c_str());
-    mOutputDims = mEngine.get()->getBindingDimensions(outputIndex);
+    mInputDims = mEngine.get()->getTensorShape(mInOut["input"].c_str());
+    mOutputDims = mEngine.get()->getTensorShape(mInOut["output"].c_str());
 
     return sample::Logger::TestResult::kRUNNING;
 }
@@ -626,6 +631,12 @@ sample::Logger::TestResult SampleINT8API::infer()
         return sample::Logger::TestResult::kFAILED;
     }
 
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
+    {
+        auto const name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
+    }
+
     // Read the input data into the managed buffers
     // There should be just 1 input tensor
 
@@ -642,7 +653,7 @@ sample::Logger::TestResult SampleINT8API::infer()
     buffers.copyInputToDeviceAsync(stream);
 
     // Asynchronously enqueue the inference work
-    if (!context->enqueueV2(buffers.getDeviceBindings().data(), stream, nullptr))
+    if (!context->enqueueV3(stream))
     {
         return sample::Logger::TestResult::kFAILED;
     }
diff --git a/samples/sampleIOFormats/CMakeLists.txt b/samples/sampleIOFormats/CMakeLists.txt
index 154b599e..4ec93187 100755
--- a/samples/sampleIOFormats/CMakeLists.txt
+++ b/samples/sampleIOFormats/CMakeLists.txt
@@ -18,6 +18,6 @@ SET(SAMPLE_SOURCES
     sampleIOFormats.cpp
 )
 
-SET(SAMPLE_PARSERS "uff" "caffe" "onnx")
+SET(SAMPLE_PARSERS "onnx")
 
 include(../CMakeSamplesTemplate.txt)
diff --git a/samples/sampleIOFormats/sampleIOFormats.cpp b/samples/sampleIOFormats/sampleIOFormats.cpp
index a8f94717..2c8b87af 100644
--- a/samples/sampleIOFormats/sampleIOFormats.cpp
+++ b/samples/sampleIOFormats/sampleIOFormats.cpp
@@ -278,6 +278,7 @@ class SampleIOFormats
         SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
         SampleUniquePtr<nvonnxparser::IParser>& parser);
 
+    SampleUniquePtr<IRuntime> mRuntime{};                    //!< The TensorRT Runtime used to deserialize the engine.
     std::shared_ptr<nvinfer1::ICudaEngine> mEngine{nullptr}; //!< The TensorRT engine used to run the network
 
 public:
@@ -307,9 +308,8 @@ bool SampleIOFormats::build(int32_t dataWidth)
     {
         return false;
     }
-    auto const networkFlags = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
 
-    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(networkFlags));
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
     if (!network)
     {
         return false;
@@ -369,14 +369,18 @@ bool SampleIOFormats::build(int32_t dataWidth)
         return false;
     }
 
-    SampleUniquePtr<IRuntime> runtime{createInferRuntime(sample::gLogger.getTRTLogger())};
-    if (!runtime)
+    if (!mRuntime)
+    {
+        mRuntime = SampleUniquePtr<IRuntime>(createInferRuntime(sample::gLogger.getTRTLogger()));
+    }
+
+    if (!mRuntime)
     {
         return false;
     }
 
     mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
-        runtime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
+        mRuntime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
     if (!mEngine)
     {
         return false;
@@ -436,14 +440,25 @@ bool SampleIOFormats::infer(SampleBuffer& inputBuf, SampleBuffer& outputBuf)
         return false;
     }
 
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
+    {
+        auto const name = mEngine->getIOTensorName(i);
+        if (mEngine->getTensorIOMode(name) == TensorIOMode::kINPUT)
+        {
+            context->setTensorAddress(name, devInput.get());
+        }
+        else
+        {
+            context->setTensorAddress(name, devOutput.get());
+        }
+    }
+
     // Create CUDA stream for the execution of this inference.
     cudaStream_t stream;
     CHECK(cudaStreamCreate(&stream));
 
-    void* bindings[2] = {devInput.get(), devOutput.get()};
-
     // Asynchronously enqueue the inference work
-    if (!context->enqueueV2(bindings, stream, nullptr))
+    if (!context->enqueueV3(stream))
     {
         return false;
     }
@@ -634,7 +649,7 @@ bool process(SampleIOFormats& sample, sample::Logger::TestAtom const& sampleTest
     sample::gLogInfo << "Building and running a GPU inference engine with specified I/O formats." << std::endl;
 
     inputBuf = SampleBuffer(sample.mInputDims, sizeof(T), sample.mTensorFormat, true);
-    outputBuf = SampleBuffer(sample.mOutputDims, sizeof(float), TensorFormat::kLINEAR, false);
+    outputBuf = SampleBuffer(sample.mOutputDims, sizeof(T), TensorFormat::kLINEAR, false);
     if (!sample.build(sizeof(T)))
     {
         return false;
diff --git a/samples/sampleNamedDimensions/sampleNamedDimensions.cpp b/samples/sampleNamedDimensions/sampleNamedDimensions.cpp
index c38d2bfb..42298ba4 100644
--- a/samples/sampleNamedDimensions/sampleNamedDimensions.cpp
+++ b/samples/sampleNamedDimensions/sampleNamedDimensions.cpp
@@ -85,6 +85,7 @@ class SampleNamedDimensions
     std::vector<float> mInput0;
     std::vector<float> mInput1;
 
+    SampleUniquePtr<IRuntime> mRuntime{};           //!< The TensorRT Runtime used to deserialize the engine.
     std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The TensorRT engine used to run the network
 
     //!
@@ -135,8 +136,7 @@ bool SampleNamedDimensions::build()
         return false;
     }
 
-    auto const explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
-    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
     if (!network)
     {
         return false;
@@ -187,14 +187,18 @@ bool SampleNamedDimensions::build()
         return false;
     }
 
-    SampleUniquePtr<IRuntime> runtime{createInferRuntime(sample::gLogger.getTRTLogger())};
-    if (!runtime)
+    if (!mRuntime)
+    {
+        mRuntime = SampleUniquePtr<IRuntime>(createInferRuntime(sample::gLogger.getTRTLogger()));
+    }
+
+    if (!mRuntime)
     {
         return false;
     }
 
     mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
-        runtime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
+        mRuntime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
     if (!mEngine)
     {
         return false;
@@ -257,6 +261,12 @@ bool SampleNamedDimensions::infer()
         return false;
     }
 
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
+    {
+        auto const name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
+    }
+
     // Read the input data into the managed buffers
     ASSERT(mParams.inputTensorNames.size() == 2);
     if (!processInput(buffers))
diff --git a/samples/sampleNonZeroPlugin/CMakeLists.txt b/samples/sampleNonZeroPlugin/CMakeLists.txt
new file mode 100644
index 00000000..590c5005
--- /dev/null
+++ b/samples/sampleNonZeroPlugin/CMakeLists.txt
@@ -0,0 +1,24 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+SET(SAMPLE_SOURCES
+    sampleNonZeroPlugin.cpp
+    nonZeroKernel.cu
+)
+
+set(SAMPLE_PARSERS "onnx")
+
+include(../CMakeSamplesTemplate.txt)
diff --git a/samples/sampleNonZeroPlugin/README.md b/samples/sampleNonZeroPlugin/README.md
new file mode 100644
index 00000000..15e8e4c2
--- /dev/null
+++ b/samples/sampleNonZeroPlugin/README.md
@@ -0,0 +1,180 @@
+# NonZero Plugin for TensorRT using IPluginV3
+
+**Table Of Contents**
+- [Description](#description)
+- [How does this sample work?](#how-does-this-sample-work)
+	* [Implementing a NonZero plugin using IPluginV3 interface](#implementing-a-nonzero-plugin-using-ipluginv3-interface)
+	* [Creating network and building the engine](#creating-network-and-building-the-engine)
+	* [Running inference](#running-inference)
+- [Running the sample](#running-the-sample)
+	* [Sample `--help` options](#sample---help-options)
+- [Additional resources](#additional-resources)
+- [License](#license)
+- [Changelog](#changelog)
+- [Known issues](#known-issues)
+
+## Description
+
+This sample, sampleNonZeroPlugin, implements a plugin for the NonZero operation, customizable to output the non-zero indices in
+either a row major (each set of indices in the same row) or column major format (each set of indices in the same column).
+
+NonZero is an operation where the non-zero indices of the input tensor is found. 
+
+## How does this sample work?
+
+This sample creates and runs a TensorRT engine built from a network containing a single NonZeroPlugin node. It demonstrates how
+custom layers with data-dependent output shapes can be implemented and added to a TensorRT network.
+
+Specifically, this sample:
+- [Implements a TensorRT plugin for the NonZero operation](#implementing-a-nonzero-plugin-using-ipluginv3-interface)
+- [Creates a network and builds an engine](#creating-network-and-building-the-engine)
+- [Runs inference using the generated TensorRT network](#running-inference)
+
+### Implementing a NonZero plugin using IPluginV3 interface
+
+Until `IPluginV3` (and associated interfaces), TensorRT plugins could not have outputs whose shapes depended on the input values (they could only depend
+on input shapes). `IPluginV3OneBuild` which exposes a build capability for `IPluginV3`, provides support for such data-dependent output shapes.
+
+`NonZeroPlugin` in this sample is written to handle 2-D input tensors of shape $R \times C$. Assume that the tensor contains $K$ non-zero elements and that the
+non-zero indices are required in a row-major order. Then the output shape would be $K \times 2$.
+
+The output shapes are expressed to the TensorRT builder through the `IPluginV3OneBuild::getOutputShapes()` API. Expressing the second dimension of the output is
+straightforward:
+```
+outputs[0].d[1] = exprBuilder.constant(2);
+```
+
+The extent of each data-dependent dimension in the plugin must be expressed in terms of a *_size tensor_*. A size tensor is a scalar output of 
+`DataType::kINT32` or `DataType::kINT64` that must be added as one of the plugin outputs. In this case, it is sufficient to declare one size tensor to denote the extent of the
+first dimension of the non-zero indices output. To declare a size tensor, one must provide an upper-bound and optimum value for its extent as `IDimensionExpr`s. These can be formed through the `IExprBuilder` argument passed to the `IPluginV3OneBuild::getOutputShapes()` method.
+ - For unknown inputs, the upper-bound is the total number of elements in the input
+	```
+	auto upperBound = exprBuilder.operation(DimensionOperation::kPROD, *inputs[0].d[0], *inputs[0].d[1]);
+	```
+ - A good estimate for the optimum is that half of the elements are non-zero
+	```
+	auto optValue = exprBuilder.operation(DimensionOperation::kFLOOR_DIV, *upperBound, *exprBuilder.constant(2));
+	```
+
+Now we can declare the size tensor using the `IExprBuilder::declareSizeTensor()` method, which also requires the specification of the output index at which the size tensor would reside. Let us place it after the non-zero indices output:
+```
+auto numNonZeroSizeTensor = exprBuilder.declareSizeTensor(1, *optValue, *upperBound);
+```
+
+Now we are ready to specify the extent of the first dimension of the non-zero indices output:
+```
+outputs[0].d[0] = numNonZeroSizeTensor;
+```
+and let's not forget to declare that the size tensor is a scalar (0-D):
+```
+outputs[1].nbDims = 0;
+```
+
+The `NonZeroPlugin` can also be configured to emit the non-zero indices in a column-major fashion through the `rowMajor` plugin attribute, by setting it to `0`.
+In this case, the first output of the plugin will have shape $2 \times K$, and the output shape specification must be adjusted accordingly.
+
+### Creating network and building the engine
+
+To add the plugin to the network, the `INetworkDefinition::addPluginV3()` method must be used. 
+
+Similar to `IPluginCreator` used for V2 plugins, V3 plugins must be accompanied by the registration of a plugin creator implementing the `IPluginCreatorV3One`
+interface.
+
+### Running inference
+
+As sample inputs, random images from MNIST dataset are selected and scaled to between `[0,1]`. The network will output both the non-zero indices,
+as well as the non-zero count.
+
+## Preparing sample data
+
+Download the sample data from the [TensorRT release tarball](https://developer.nvidia.com/nvidia-tensorrt-download#).
+
+## Running the sample
+
+1. Compile the sample by following build instructions in [TensorRT README](https://github.com/NVIDIA/TensorRT/).
+
+2.  Run the sample to build and run the MNIST engine from the ONNX model.
+	```
+	./sample_non_zero_plugin [-h or --help] [-d or --datadir=<path to data directory>] [--columnMajor] [--fp16]
+	```
+
+3.  Verify that the sample ran successfully. If the sample runs successfully you should see output similar to the following:
+	```
+	&&&& RUNNING TensorRT.sample_non_zero_plugin # ./sample_non_zero_plugin
+	...
+	[I] Input:
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.854902, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.858824, 0, 0, 0.0745098, 0, 0.564706, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.317647, 0, 0, 0.47451, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0431373, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.854902, 0, 0, 0.145098
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.564706, 0, 0, 0.996078
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.282353
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.854902
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.854902, 0, 0, 0.145098, 0, 0.564706
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.564706, 0, 0, 0.996078, 0, 0
+	[I] 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.282353, 0, 0
+	[I]
+	[I] Output:
+	[I] 2 14
+	[I] 3 9
+	[I] 3 12
+	[I] 3 14
+	[I] 4 9
+	[I] 4 12
+	[I] 5 12
+	[I] 8 12
+	[I] 8 15
+	[I] 9 12
+	[I] 9 15
+	[I] 10 15
+	[I] 13 15
+	[I] 14 10
+	[I] 14 13
+	[I] 14 15
+	[I] 15 10
+	[I] 15 13
+	[I] 16 13
+	&&&& PASSED TensorRT.sample_non_zero_plugin # ./sample_non_zero_plugin
+	```
+
+### Sample `--help` options
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command line option.
+
+
+# Additional resources
+
+The following resources provide a deeper understanding about the V3 TensorRT plugins and the NonZero operation:
+
+**NonZero**
+- [ONNX: NonZero](https://onnx.ai/onnx/operators/onnx__NonZero.html)
+
+**TensorRT plugins**
+- [Extending TensorRT with Custom Layers](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#extending)
+
+**Other documentation**
+- [Introduction To NVIDIA’s TensorRT Samples](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#samples)
+- [Working With TensorRT Using The C++ API](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#c_topics)
+- [NVIDIA’s TensorRT Documentation Library](https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/index.html)
+
+# License
+
+For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
+
+
+# Changelog
+
+March 2024
+This is the first version of this `README.md` file.
+
+
+# Known issues
+
+There are no known issues in this sample.
diff --git a/samples/sampleNonZeroPlugin/nonZeroKernel.cu b/samples/sampleNonZeroPlugin/nonZeroKernel.cu
new file mode 100644
index 00000000..7e015b2c
--- /dev/null
+++ b/samples/sampleNonZeroPlugin/nonZeroKernel.cu
@@ -0,0 +1,58 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "nonZeroKernel.h"
+
+__global__ void findNonZeroIndicesKernel(
+    float const* X, int32_t* indices, int32_t* count, int32_t const* K, int32_t R, int32_t C, bool rowMajor)
+{
+    int32_t col = blockIdx.x * blockDim.x + threadIdx.x;
+
+    // Check if the column index is within bounds
+    if (col < C)
+    {
+        for (int32_t row = 0; row < R; ++row)
+        {
+            if (X[row + R * col] != 0.F)
+            {
+                int32_t index = atomicAdd(count, 1); // Increment count atomically and get the previous value
+                if (indices)
+                {
+                    if(!rowMajor)
+                    {
+                        indices[index] = row;
+                        indices[index + *K] = col;
+                    }
+                    else
+                    {
+                        indices[2 * index] = row;
+                        indices[2 * index + 1] = col;
+                    }
+                }
+            }
+        }
+    }
+}
+
+void nonZeroIndicesImpl(
+    float const* X, int32_t* indices, int32_t* count, int32_t const* K, int32_t R, int32_t C, bool rowMajor, cudaStream_t stream)
+{
+    constexpr int32_t kBLOCK_SIZE = 256;
+    int32_t const blocksPerGrid = (R + kBLOCK_SIZE - 1) / kBLOCK_SIZE;
+        
+    findNonZeroIndicesKernel<<<blocksPerGrid, kBLOCK_SIZE, 0, stream>>>(X, indices, count, K, R, C, rowMajor);
+}
diff --git a/parsers/caffe/NvCaffeParser.cpp b/samples/sampleNonZeroPlugin/nonZeroKernel.h
similarity index 58%
rename from parsers/caffe/NvCaffeParser.cpp
rename to samples/sampleNonZeroPlugin/nonZeroKernel.h
index 2a9737e6..4dbb1ab0 100644
--- a/parsers/caffe/NvCaffeParser.cpp
+++ b/samples/sampleNonZeroPlugin/nonZeroKernel.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,23 +14,11 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
+#ifndef SAMPLE_NONZERO_KERNEL_H
+#define SAMPLE_NONZERO_KERNEL_H
+#include <cstdint>
 
-#include "NvCaffeParser.h"
-#include "caffeParser.h"
+void nonZeroIndicesImpl(float const* X, int32_t* indices, int32_t* count, int32_t const* K, int32_t R, int32_t C,
+    bool rowMajor, cudaStream_t stream);
 
-using namespace nvcaffeparser1;
-
-void nvcaffeparser1::shutdownProtobufLibrary() noexcept
-{
-    google::protobuf::ShutdownProtobufLibrary();
-}
-
-extern "C" void* createNvCaffeParser_INTERNAL() noexcept
-{
-    return nvcaffeparser1::createCaffeParser();
-}
-
-ICaffeParser* nvcaffeparser1::createCaffeParser() noexcept
-{
-    return new CaffeParser;
-}
+#endif // SAMPLE_NONZERO_KERNEL_H
diff --git a/samples/sampleNonZeroPlugin/sampleNonZeroPlugin.cpp b/samples/sampleNonZeroPlugin/sampleNonZeroPlugin.cpp
new file mode 100644
index 00000000..47b5e7b3
--- /dev/null
+++ b/samples/sampleNonZeroPlugin/sampleNonZeroPlugin.cpp
@@ -0,0 +1,746 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+//!
+//! sampleNonZeroPlugin.cpp
+//! This file contains a sample demonstrating a plugin for NonZero.
+//! It can be run with the following command line:
+//! Command: ./sample_non_zero_plugin [-h or --help] [-d=/path/to/data/dir or --datadir=/path/to/data/dir]
+//!
+
+// Define TRT entrypoints used in common code
+#define DEFINE_TRT_ENTRYPOINTS 1
+#define DEFINE_TRT_LEGACY_PARSER_ENTRYPOINT 0
+
+#include "argsParser.h"
+#include "buffers.h"
+#include "common.h"
+#include "logger.h"
+#include "nonZeroKernel.h"
+#include "parserOnnxConfig.h"
+
+#include "NvInfer.h"
+#include <cuda_runtime_api.h>
+
+#include <cstdlib>
+#include <fstream>
+#include <iostream>
+#include <random>
+#include <sstream>
+using namespace nvinfer1;
+using samplesCommon::SampleUniquePtr;
+
+std::string const kSAMPLE_NAME = "TensorRT.sample_non_zero_plugin";
+
+class NonZeroPlugin : public IPluginV3, public IPluginV3OneCore, public IPluginV3OneBuild, public IPluginV3OneRuntime
+{
+public:
+    NonZeroPlugin(NonZeroPlugin const& p) = default;
+
+    NonZeroPlugin(bool rowMajor)
+        : mRowMajor(rowMajor)
+    {
+        initFieldsToSerialize();
+    }
+
+    void initFieldsToSerialize()
+    {
+        mDataToSerialize.clear();
+        mDataToSerialize.emplace_back(PluginField("rowMajor", &mRowMajor, PluginFieldType::kINT32, 1));
+        mFCToSerialize.nbFields = mDataToSerialize.size();
+        mFCToSerialize.fields = mDataToSerialize.data();
+    }
+
+    // IPluginV3 methods
+
+    IPluginCapability* getCapabilityInterface(PluginCapabilityType type) noexcept override
+    {
+        try
+        {
+            if (type == PluginCapabilityType::kBUILD)
+            {
+                return static_cast<IPluginV3OneBuild*>(this);
+            }
+            if (type == PluginCapabilityType::kRUNTIME)
+            {
+                return static_cast<IPluginV3OneRuntime*>(this);
+            }
+            ASSERT(type == PluginCapabilityType::kCORE);
+            return static_cast<IPluginV3OneCore*>(this);
+        }
+        catch (std::exception const& e)
+        {
+            sample::gLogError << e.what() << std::endl;
+        }
+        return nullptr;
+    }
+
+    IPluginV3* clone() noexcept override
+    {
+        auto clone = std::make_unique<NonZeroPlugin>(*this);
+        clone->initFieldsToSerialize();
+        return clone.release();
+    }
+
+    // IPluginV3OneCore methods
+    char const* getPluginName() const noexcept override
+    {
+        return "NonZeroPlugin";
+    }
+
+    char const* getPluginVersion() const noexcept override
+    {
+        return "0";
+    }
+
+    char const* getPluginNamespace() const noexcept override
+    {
+        return "";
+    }
+
+    // IPluginV3OneBuild methods
+    int32_t getNbOutputs() const noexcept override
+    {
+        return 2;
+    }
+
+    int32_t configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out,
+        int32_t nbOutputs) noexcept override
+    {
+        return 0;
+    }
+
+    bool supportsFormatCombination(
+        int32_t pos, DynamicPluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept override
+    {
+        bool typeOk{false};
+        if (pos == 0)
+        {
+            typeOk = inOut[0].desc.type == DataType::kFLOAT || inOut[0].desc.type == DataType::kHALF;
+        }
+        else if (pos == 1)
+        {
+            typeOk = inOut[1].desc.type == DataType::kINT32;
+        }
+        else // pos == 2
+        {
+            // size tensor outputs must be NCHW INT32
+            typeOk = inOut[2].desc.type == DataType::kINT32;
+        }
+
+        return inOut[pos].desc.format == PluginFormat::kLINEAR && typeOk;
+    }
+
+    int32_t getOutputDataTypes(
+        DataType* outputTypes, int32_t nbOutputs, DataType const* inputTypes, int32_t nbInputs) const noexcept override
+    {
+        outputTypes[0] = DataType::kINT32;
+        outputTypes[1] = DataType::kINT32;
+        return 0;
+    }
+
+    int32_t getOutputShapes(DimsExprs const* inputs, int32_t nbInputs, DimsExprs const* shapeInputs,
+        int32_t nbShapeInputs, DimsExprs* outputs, int32_t nbOutputs, IExprBuilder& exprBuilder) noexcept override
+    {
+        // The input tensor must be 2-D
+        if (inputs[0].nbDims != 2)
+        {
+            return -1;
+        }
+
+        outputs[0].nbDims = 2;
+
+        auto upperBound = exprBuilder.operation(DimensionOperation::kPROD, *inputs[0].d[0], *inputs[0].d[1]);
+
+        // On average, we can assume that half of all elements will be non-zero
+        auto optValue = exprBuilder.operation(DimensionOperation::kFLOOR_DIV, *upperBound, *exprBuilder.constant(2));
+        auto numNonZeroSizeTensor = exprBuilder.declareSizeTensor(1, *optValue, *upperBound);
+
+        if (!mRowMajor)
+        {
+            outputs[0].d[0] = exprBuilder.constant(2);
+            outputs[0].d[1] = numNonZeroSizeTensor;
+        }
+        else
+        {
+            outputs[0].d[0] = numNonZeroSizeTensor;
+            outputs[0].d[1] = exprBuilder.constant(2);
+        }
+
+        // output at index 1 is a size tensor
+        outputs[1].nbDims = 0; // size tensors must be declared as 0-D
+
+        return 0;
+    }
+
+    // IPluginV3OneRuntime methods
+    int32_t enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs,
+        void* const* outputs, void* workspace, cudaStream_t stream) noexcept override
+    {
+
+        int32_t const R = inputDesc[0].dims.d[0];
+        int32_t const C = inputDesc[0].dims.d[1];
+
+        cudaMemsetAsync(outputs[1], 0, sizeof(int32_t), stream);
+
+        if (!mRowMajor)
+        {
+            // When constructing a column major output, the kernel needs to be aware of the total number of non-zero
+            // elements so as to write the non-zero indices at the correct places. Therefore, we will launch the kernel
+            // twice: first, only to calculate the total non-zero count, which will be stored in workspace; and
+            // then to actually write the non-zero indices to the outputs[0] buffer.
+            cudaMemsetAsync(workspace, 0, sizeof(int32_t), stream);
+            nonZeroIndicesImpl(static_cast<float const*>(inputs[0]), nullptr, static_cast<int32_t*>(workspace), 0, R, C,
+                mRowMajor, stream);
+
+            nonZeroIndicesImpl(static_cast<float const*>(inputs[0]), static_cast<int32_t*>(outputs[0]),
+                static_cast<int32_t*>(outputs[1]), static_cast<int32_t*>(workspace), R, C, mRowMajor, stream);
+        }
+        else
+        {
+            nonZeroIndicesImpl(static_cast<float const*>(inputs[0]), static_cast<int32_t*>(outputs[0]),
+                static_cast<int32_t*>(outputs[1]), 0, R, C, mRowMajor, stream);
+        }
+
+        return 0;
+    }
+
+    int32_t onShapeChange(
+        PluginTensorDesc const* in, int32_t nbInputs, PluginTensorDesc const* out, int32_t nbOutputs) noexcept override
+    {
+        return 0;
+    }
+
+    IPluginV3* attachToContext(IPluginResourceContext* context) noexcept override
+    {
+        return clone();
+    }
+
+    PluginFieldCollection const* getFieldsToSerialize() noexcept override
+    {
+        return &mFCToSerialize;
+    }
+
+    size_t getWorkspaceSize(DynamicPluginTensorDesc const* inputs, int32_t nbInputs,
+        DynamicPluginTensorDesc const* outputs, int32_t nbOutputs) const noexcept override
+    {
+        return sizeof(int32_t);
+    }
+
+private:
+    bool mRowMajor{true};
+    std::vector<nvinfer1::PluginField> mDataToSerialize;
+    nvinfer1::PluginFieldCollection mFCToSerialize;
+};
+
+class NonZeroPluginCreator : public nvinfer1::IPluginCreatorV3One
+{
+public:
+    NonZeroPluginCreator()
+    {
+        mPluginAttributes.clear();
+        mPluginAttributes.emplace_back(PluginField("rowMajor", nullptr, PluginFieldType::kINT32, 1));
+        mFC.nbFields = mPluginAttributes.size();
+        mFC.fields = mPluginAttributes.data();
+    }
+
+    char const* getPluginName() const noexcept override
+    {
+        return "NonZeroPlugin";
+    }
+
+    char const* getPluginVersion() const noexcept override
+    {
+        return "0";
+    }
+
+    PluginFieldCollection const* getFieldNames() noexcept override
+    {
+        return &mFC;
+    }
+
+    IPluginV3* createPlugin(char const* name, PluginFieldCollection const* fc, TensorRTPhase phase) noexcept override
+    {
+        try
+        {
+            bool rowMajor{true};
+            for (int32_t i = 0; i < fc->nbFields; ++i)
+            {
+                auto const fieldName(fc->fields[i].name);
+                if (std::strcmp(fieldName, "rowMajor") == 0)
+                {
+                    rowMajor = *static_cast<bool const*>(fc->fields[i].data);
+                }
+            }
+            return new NonZeroPlugin(rowMajor);
+        }
+        catch (std::exception const& e)
+        {
+            sample::gLogError << e.what() << std::endl;
+        }
+        return nullptr;
+    }
+
+    char const* getPluginNamespace() const noexcept override
+    {
+        return "";
+    }
+
+private:
+    nvinfer1::PluginFieldCollection mFC;
+    std::vector<nvinfer1::PluginField> mPluginAttributes;
+};
+
+namespace
+{
+struct NonZeroParams : public samplesCommon::SampleParams
+{
+    bool rowMajor{true};
+};
+} // namespace
+
+//! \brief  The SampleNonZeroPlugin class implements a NonZero plugin
+//!
+//! \details The plugin is able to output the non-zero indices in row major or column major order
+//!
+class SampleNonZeroPlugin
+{
+public:
+    SampleNonZeroPlugin(NonZeroParams const& params)
+        : mParams(params)
+        , mRuntime(nullptr)
+        , mEngine(nullptr)
+    {
+        mSeed = static_cast<uint32_t>(time(nullptr));
+    }
+
+    //!
+    //! \brief Function builds the network engine
+    //!
+    bool build();
+
+    //!
+    //! \brief Runs the TensorRT inference engine for this sample
+    //!
+    bool infer();
+
+private:
+    NonZeroParams mParams; //!< The parameters for the sample.
+
+    nvinfer1::Dims mInputDims;  //!< The dimensions of the input to the network.
+    nvinfer1::Dims mOutputDims; //!< The dimensions of the output to the network.
+
+    std::shared_ptr<nvinfer1::IRuntime> mRuntime;   //!< The TensorRT runtime used to deserialize the engine
+    std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The TensorRT engine used to run the network
+
+    uint32_t mSeed{};
+
+    //!
+    //! \brief Creates a TensorRT network and inserts a NonZero plugin
+    //!
+    bool constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
+        SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config);
+
+    //!
+    //! \brief Reads the input and stores the result in a managed buffer
+    //!
+    bool processInput(samplesCommon::BufferManager const& buffers);
+
+    //!
+    //! \brief Verifies the result
+    //!
+    bool verifyOutput(samplesCommon::BufferManager const& buffers);
+};
+
+//!
+//! \brief Creates the network, configures the builder and creates the network engine
+//!
+//! \details This function creates a network containing a NonZeroPlugin and builds
+//!          the engine that will be used to run the plugin (mEngine)
+//!
+//! \return true if the engine was created successfully and false otherwise
+//!
+bool SampleNonZeroPlugin::build()
+{
+    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
+    if (!builder)
+    {
+        return false;
+    }
+
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
+    if (!network)
+    {
+        return false;
+    }
+
+    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
+    if (!config)
+    {
+        return false;
+    }
+
+    auto pluginCreator = std::make_unique<NonZeroPluginCreator>();
+    getPluginRegistry()->registerCreator(*pluginCreator.get(), "");
+
+    auto constructed = constructNetwork(builder, network, config);
+    if (!constructed)
+    {
+        return false;
+    }
+
+    // CUDA stream used for profiling by the builder.
+    auto profileStream = samplesCommon::makeCudaStream();
+    if (!profileStream)
+    {
+        return false;
+    }
+    config->setProfileStream(*profileStream);
+
+    SampleUniquePtr<IHostMemory> plan{builder->buildSerializedNetwork(*network, *config)};
+    if (!plan)
+    {
+        return false;
+    }
+
+    mRuntime = std::shared_ptr<nvinfer1::IRuntime>(createInferRuntime(sample::gLogger.getTRTLogger()));
+    if (!mRuntime)
+    {
+        return false;
+    }
+
+    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
+        mRuntime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
+    if (!mEngine)
+    {
+        return false;
+    }
+
+    ASSERT(network->getNbInputs() == 1);
+    mInputDims = network->getInput(0)->getDimensions();
+    ASSERT(mInputDims.nbDims == 2);
+
+    ASSERT(network->getNbOutputs() == 2);
+    mOutputDims = network->getOutput(0)->getDimensions();
+    ASSERT(mOutputDims.nbDims == 2);
+
+    return true;
+}
+
+//!
+//! \brief Creates a network with a single custom layer containing the NonZero plugin and marks the
+//!        output layers
+//!
+//! \param network Pointer to the network that will be populated with the NonZero plugin
+//!
+//! \param builder Pointer to the engine builder
+//!
+bool SampleNonZeroPlugin::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
+    SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config)
+{
+    if (mParams.fp16)
+    {
+        config->setFlag(BuilderFlag::kFP16);
+    }
+
+    std::default_random_engine generator(mSeed);
+    std::uniform_int_distribution<int32_t> distr(10, 25);
+
+    int32_t const R = distr(generator);
+    int32_t const C = distr(generator);
+    auto* in = network->addInput("Input", DataType::kFLOAT, {2, {R, C}});
+    ASSERT(in != nullptr);
+
+    std::vector<PluginField> const vecPF{{"rowMajor", &mParams.rowMajor, PluginFieldType::kINT32, 1}};
+    PluginFieldCollection pfc{static_cast<int32_t>(vecPF.size()), vecPF.data()};
+
+    auto pluginCreator = static_cast<IPluginCreatorV3One*>(getPluginRegistry()->getCreator("NonZeroPlugin", "0", ""));
+    auto plugin = std::unique_ptr<IPluginV3>(pluginCreator->createPlugin("NonZeroPlugin", &pfc, TensorRTPhase::kBUILD));
+
+    std::vector<ITensor*> inputsVec{in};
+    auto pluginNonZeroLayer = network->addPluginV3(inputsVec.data(), inputsVec.size(), nullptr, 0, *plugin);
+    ASSERT(pluginNonZeroLayer != nullptr);
+    ASSERT(pluginNonZeroLayer->getOutput(0) != nullptr);
+    ASSERT(pluginNonZeroLayer->getOutput(1) != nullptr);
+
+    pluginNonZeroLayer->getOutput(0)->setName("Output0");
+    pluginNonZeroLayer->getOutput(1)->setName("Output1");
+
+    network->markOutput(*(pluginNonZeroLayer->getOutput(0)));
+    network->markOutput(*(pluginNonZeroLayer->getOutput(1)));
+
+    return true;
+}
+
+//!
+//! \brief Runs the TensorRT inference engine for this sample
+//!
+//! \details This function is the main execution function of the sample. It allocates the buffer,
+//!          sets inputs and executes the engine.
+//!
+bool SampleNonZeroPlugin::infer()
+{
+
+    // Since the data dependent output size cannot be inferred from the engine denote a sufficient size for the
+    // corresponding output buffer (along with the rest of the I/O tensors)
+    std::vector<int64_t> ioVolumes = {mInputDims.d[0] * mInputDims.d[1], mInputDims.d[0] * mInputDims.d[1] * 2, 1};
+
+    // Create RAII buffer manager object
+    samplesCommon::BufferManager buffers(mEngine, ioVolumes);
+
+    auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
+    if (!context)
+    {
+        return false;
+    }
+
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; ++i)
+    {
+        auto const name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
+    }
+
+    // Read the input data into the managed buffers
+    ASSERT(mParams.inputTensorNames.size() == 1);
+    if (!processInput(buffers))
+    {
+        return false;
+    }
+
+    // Create CUDA stream for the execution of this inference.
+    cudaStream_t stream;
+    CHECK(cudaStreamCreate(&stream));
+
+    // Memcpy from host input buffers to device input buffers
+    buffers.copyInputToDeviceAsync(stream);
+
+    bool status = context->enqueueV3(stream);
+    if (!status)
+    {
+        return false;
+    }
+
+    // Asynchronously copy data from device output buffers to host output buffers.
+    buffers.copyOutputToHostAsync(stream);
+
+    // Wait for the work in the stream to complete.
+    CHECK(cudaStreamSynchronize(stream));
+
+    // Release stream.
+    CHECK(cudaStreamDestroy(stream));
+
+    // Verify results
+    if (!verifyOutput(buffers))
+    {
+        return false;
+    }
+
+    return true;
+}
+
+//!
+//! \brief Reads the input and stores the result in a managed buffer
+//!
+bool SampleNonZeroPlugin::processInput(samplesCommon::BufferManager const& buffers)
+{
+    int32_t const inputH = mInputDims.d[0];
+    int32_t const inputW = mInputDims.d[1];
+
+    std::vector<uint8_t> fileData(inputH * inputW);
+
+    std::default_random_engine generator(mSeed);
+    std::uniform_int_distribution<int32_t> distr(0, 9);
+    auto const number = distr(generator);
+    readPGMFile(locateFile(std::to_string(number) + ".pgm", mParams.dataDirs), fileData.data(), inputH, inputW);
+
+    float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer(mParams.inputTensorNames[0]));
+    for (int32_t i = 0; i < inputH * inputW; ++i)
+    {
+        auto const raw = 1.0 - float(fileData[i] / 255.0);
+        hostDataBuffer[i] = raw;
+    }
+
+    sample::gLogInfo << "Input:" << std::endl;
+    for (int32_t i = 0; i < inputH; ++i)
+    {
+        for (int32_t j = 0; j < inputW; ++j)
+        {
+            sample::gLogInfo << hostDataBuffer[i + inputH * j];
+            if (j < inputW - 1)
+            {
+                sample::gLogInfo << ", ";
+            }
+        }
+        sample::gLogInfo << std::endl;
+    }
+    sample::gLogInfo << std::endl;
+
+    return true;
+}
+
+//!
+//! \brief Verify result
+//!
+//! \return whether the output correctly identifies all (and only) non-zero elements
+//!
+bool SampleNonZeroPlugin::verifyOutput(samplesCommon::BufferManager const& buffers)
+{
+    float* input = static_cast<float*>(buffers.getHostBuffer(mParams.inputTensorNames[0]));
+    int32_t* output = static_cast<int32_t*>(buffers.getHostBuffer(mParams.outputTensorNames[0]));
+    int32_t count = *static_cast<int32_t*>(buffers.getHostBuffer(mParams.outputTensorNames[1]));
+
+    std::vector<bool> covered(mInputDims.d[0] * mInputDims.d[1], false);
+
+    sample::gLogInfo << "Output:" << std::endl;
+    if (mParams.rowMajor)
+    {
+        for (int32_t i = 0; i < count; ++i)
+        {
+            for (int32_t j = 0; j < 2; ++j)
+            {
+                sample::gLogInfo << output[j + 2 * i] << " ";
+            }
+            sample::gLogInfo << std::endl;
+        }
+    }
+    else
+    {
+        for (int32_t i = 0; i < 2; ++i)
+        {
+            for (int32_t j = 0; j < count; ++j)
+            {
+                sample::gLogInfo << output[j + count * i] << " ";
+            }
+            sample::gLogInfo << std::endl;
+        }
+    }
+
+    if (!mParams.rowMajor)
+    {
+        for (int32_t i = 0; i < count; ++i)
+        {
+            auto const idx = output[i] + mInputDims.d[0] * output[i + count];
+            covered[idx] = true;
+            if (input[idx] == 0.F)
+            {
+                return false;
+            }
+        }
+    }
+    else
+    {
+        for (int32_t i = 0; i < count; ++i)
+        {
+            auto const idx = output[2 * i] + mInputDims.d[0] * output[2 * i + 1];
+            covered[idx] = true;
+            if (input[idx] == 0.F)
+            {
+                return false;
+            }
+        }
+    }
+
+    for (int32_t i = 0; i < static_cast<int32_t>(covered.size()); ++i)
+    {
+        if (!covered[i])
+        {
+            if (input[i] != 0.F)
+            {
+                return false;
+            }
+        }
+    }
+
+    return true;
+}
+
+//!
+//! \brief Initializes members of the params struct using the command line args
+//!
+NonZeroParams initializeSampleParams(samplesCommon::Args const& args)
+{
+    NonZeroParams params;
+    if (args.dataDirs.empty()) // Use default directories if user hasn't provided directory paths
+    {
+        params.dataDirs.push_back("data/mnist/");
+        params.dataDirs.push_back("data/samples/mnist/");
+    }
+    else // Use the data directory provided by the user
+    {
+        params.dataDirs = args.dataDirs;
+    }
+
+    params.inputTensorNames.push_back("Input");
+    params.outputTensorNames.push_back("Output0");
+    params.outputTensorNames.push_back("Output1");
+    params.fp16 = args.runInFp16;
+    params.rowMajor = args.rowMajor;
+
+    return params;
+}
+
+//!
+//! \brief Prints the help information for running this sample
+//!
+void printHelpInfo()
+{
+    std::cout << "Usage: ./sample_non_zero_plugin [-h or --help] [-d or --datadir=<path to data directory>]"
+              << std::endl;
+    std::cout << "--help          Display help information" << std::endl;
+    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
+                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
+                 "(data/samples/mnist/, data/mnist/)"
+              << std::endl;
+    std::cout << "--fp16          Run in FP16 mode." << std::endl;
+    std::cout << "--columnMajor   Run plugin in column major output mode." << std::endl;
+}
+
+int main(int argc, char** argv)
+{
+    samplesCommon::Args args;
+    bool argsOK = samplesCommon::parseArgs(args, argc, argv);
+    if (!argsOK)
+    {
+        sample::gLogError << "Invalid arguments" << std::endl;
+        printHelpInfo();
+        return EXIT_FAILURE;
+    }
+    if (args.help)
+    {
+        printHelpInfo();
+        return EXIT_SUCCESS;
+    }
+
+    auto sampleTest = sample::gLogger.defineTest(kSAMPLE_NAME, argc, argv);
+
+    sample::gLogger.reportTestStart(sampleTest);
+
+    SampleNonZeroPlugin sample(initializeSampleParams(args));
+
+    sample::gLogInfo << "Building and running a GPU inference engine for NonZero plugin" << std::endl;
+
+    if (!sample.build())
+    {
+        return sample::gLogger.reportFail(sampleTest);
+    }
+    if (!sample.infer())
+    {
+        return sample::gLogger.reportFail(sampleTest);
+    }
+
+    return sample::gLogger.reportPass(sampleTest);
+}
diff --git a/samples/sampleOnnxMNIST/sampleOnnxMNIST.cpp b/samples/sampleOnnxMNIST/sampleOnnxMNIST.cpp
index 9fb5b678..35bfbd04 100644
--- a/samples/sampleOnnxMNIST/sampleOnnxMNIST.cpp
+++ b/samples/sampleOnnxMNIST/sampleOnnxMNIST.cpp
@@ -114,8 +114,7 @@ bool SampleOnnxMNIST::build()
         return false;
     }
 
-    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
-    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
     if (!network)
     {
         return false;
@@ -201,6 +200,10 @@ bool SampleOnnxMNIST::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& buil
     {
         config->setFlag(BuilderFlag::kFP16);
     }
+    if (mParams.bf16)
+    {
+        config->setFlag(BuilderFlag::kBF16);
+    }
     if (mParams.int8)
     {
         config->setFlag(BuilderFlag::kINT8);
@@ -229,6 +232,12 @@ bool SampleOnnxMNIST::infer()
         return false;
     }
 
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
+    {
+        auto const name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
+    }
+
     // Read the input data into the managed buffers
     ASSERT(mParams.inputTensorNames.size() == 1);
     if (!processInput(buffers))
@@ -349,6 +358,7 @@ samplesCommon::OnnxSampleParams initializeSampleParams(const samplesCommon::Args
     params.dlaCore = args.useDLACore;
     params.int8 = args.runInInt8;
     params.fp16 = args.runInFp16;
+    params.bf16 = args.runInBf16;
 
     return params;
 }
@@ -371,6 +381,7 @@ void printHelpInfo()
               << std::endl;
     std::cout << "--int8          Run in Int8 mode." << std::endl;
     std::cout << "--fp16          Run in FP16 mode." << std::endl;
+    std::cout << "--bf16          Run in BF16 mode." << std::endl;
 }
 
 int main(int argc, char** argv)
diff --git a/samples/sampleOnnxMnistCoordConvAC/README.md b/samples/sampleOnnxMnistCoordConvAC/README.md
index e9dfdbf8..d3022496 100644
--- a/samples/sampleOnnxMnistCoordConvAC/README.md
+++ b/samples/sampleOnnxMnistCoordConvAC/README.md
@@ -127,7 +127,7 @@ The Shuffle layer implements a reshape and transpose operator for tensors.
     python3 mnist_coord_conv_train.py --save-onnx
     python3 modify_onnx_ac.py
     ``` 
-    The first line trains a model for the MNIST dataset and saves it as an ONNX model. The second line modifies the ONNX model structure to make it work with TensorRT for building the MNIST engine. Please use torch 1.10.2 to run these scripts.
+    The first line trains a model for the MNIST dataset and saves it as an ONNX model. The second line modifies the ONNX model structure to make it work with TensorRT for building the MNIST engine. These scripts are expected to be used with `torch==2.0.1+cu118` and `torchvision==0.15.2+cu118`.
 
 3.  Run the sample to build and run the MNIST engine from the ONNX model.
     ```
diff --git a/samples/sampleOnnxMnistCoordConvAC/modify_onnx_ac.py b/samples/sampleOnnxMnistCoordConvAC/modify_onnx_ac.py
index 5462d6e0..0de8321d 100644
--- a/samples/sampleOnnxMnistCoordConvAC/modify_onnx_ac.py
+++ b/samples/sampleOnnxMnistCoordConvAC/modify_onnx_ac.py
@@ -57,11 +57,11 @@ def main():
     tmap = graph.tensors()
     # You can figure out the input and output tensors using Netron.
     inputs = [tmap["conv1"]]
-    outputs = [tmap["90"]]
+    outputs = [tmap["/conv1/addcoords/Concat_output_0"]]
     replace_with_coordconvac(graph, inputs, outputs)
 
-    inputs = [tmap["92"]]
-    outputs = [tmap["170"]]
+    inputs = [tmap["/Relu_output_0"]]
+    outputs = [tmap["/conv2/addcoords/Concat_output_0"]]
     replace_with_coordconvac(graph, inputs, outputs)
 
     # Remove the now-dangling subgraph.
diff --git a/samples/sampleOnnxMnistCoordConvAC/sampleOnnxMnistCoordConvAC.cpp b/samples/sampleOnnxMnistCoordConvAC/sampleOnnxMnistCoordConvAC.cpp
index e161e4f8..491186e5 100644
--- a/samples/sampleOnnxMnistCoordConvAC/sampleOnnxMnistCoordConvAC.cpp
+++ b/samples/sampleOnnxMnistCoordConvAC/sampleOnnxMnistCoordConvAC.cpp
@@ -82,6 +82,7 @@ class SampleOnnxMnistCoordConvAC
     nvinfer1::Dims mOutputDims; //!< The dimensions of the output to the network.
     int mNumber{0};             //!< The number to classify
 
+    SampleUniquePtr<IRuntime> mRuntime{};           //!< The TensorRT Runtime used to deserialize the engine.
     std::shared_ptr<nvinfer1::ICudaEngine> mEngine; //!< The TensorRT engine used to run the network
 
     //!
@@ -120,8 +121,7 @@ bool SampleOnnxMnistCoordConvAC::build()
         return false;
     }
 
-    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
-    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
     if (!network)
     {
         return false;
@@ -160,14 +160,18 @@ bool SampleOnnxMnistCoordConvAC::build()
         return false;
     }
 
-    SampleUniquePtr<IRuntime> runtime{createInferRuntime(sample::gLogger.getTRTLogger())};
-    if (!runtime)
+    if (!mRuntime)
+    {
+        mRuntime = SampleUniquePtr<IRuntime>(createInferRuntime(sample::gLogger.getTRTLogger()));
+    }
+
+    if (!mRuntime)
     {
         return false;
     }
 
     mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
-        runtime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
+        mRuntime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
     if (!mEngine)
     {
         return false;
@@ -235,6 +239,12 @@ bool SampleOnnxMnistCoordConvAC::infer()
         return false;
     }
 
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
+    {
+        auto const name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
+    }
+
     // Read the input data into the managed buffers
     assert(mParams.inputTensorNames.size() == 1);
     if (!processInput(buffers))
diff --git a/demo/HuggingFace/NNDF/logger.py b/samples/sampleProgressMonitor/CMakeLists.txt
similarity index 66%
rename from demo/HuggingFace/NNDF/logger.py
rename to samples/sampleProgressMonitor/CMakeLists.txt
index 220394b7..582cbbf1 100644
--- a/demo/HuggingFace/NNDF/logger.py
+++ b/samples/sampleProgressMonitor/CMakeLists.txt
@@ -15,15 +15,10 @@
 # limitations under the License.
 #
 
-import logging
+SET(SAMPLE_SOURCES
+     sampleProgressMonitor.cpp
+)
 
-G_LOGGER = logging.getLogger("OSS")
-G_LOGGER.DEBUG = logging.DEBUG
-G_LOGGER.INFO = logging.INFO
-G_LOGGER.WARNING = logging.WARNING
-G_LOGGER.ERROR = logging.ERROR
+set(SAMPLE_PARSERS "onnx")
 
-formatter = logging.Formatter("[%(asctime)s][%(name)s][%(levelname)s] %(message)s")
-stream = logging.StreamHandler()
-stream.setFormatter(formatter)
-G_LOGGER.addHandler(stream)
+include(../CMakeSamplesTemplate.txt)
diff --git a/samples/sampleProgressMonitor/README.md b/samples/sampleProgressMonitor/README.md
new file mode 100644
index 00000000..35477677
--- /dev/null
+++ b/samples/sampleProgressMonitor/README.md
@@ -0,0 +1,189 @@
+# Progress Monitor API usage example based off sampleMNIST in TensorRT
+
+**Table Of Contents**
+
+- [Description](#description)
+- [How does this sample work?](#how-does-this-sample-work)
+    - [Progress bar display](#progress-bar-display)
+- [Preparing sample data](#preparing-sample-data)
+- [Running the sample](#running-the-sample)
+	- [Sample `--help` options](#sample---help-options)
+- [Additional resources](#additional-resources)
+- [License](#license)
+- [Changelog](#changelog)
+- [Known issues](#known-issues)
+
+## Description
+
+This sample, sampleProgressMonitor, shows an example of how to use the progress monitor API based on sampleOnnxMNIST ([documentation](https://docs.nvidia.com/deeplearning/tensorrt/sample-support-guide/index.html#onnx_mnist_sample)).
+
+This sample demonstrates the usage of `IProgressMonitor` to report the status of TRT engine-building operations.
+
+## How does this sample work?
+
+This sample uses a Onnx model that was trained on the [MNIST dataset](https://github.com/NVIDIA/DIGITS/blob/master/docs/GettingStarted.md).
+
+Specifically, this sample performs the following steps:
+- Performs the basic setup and initialization of TensorRT using the Onnx parser
+- [Imports a trained Onnx model using Onnx parser](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#import_onnx_c)
+- Preprocesses the input and stores the result in a managed buffer
+- Builds an engine using incremental progress reporting
+- [Serializes and deserializes the engines](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#serial_model_c)
+- [Uses the engines to perform inference on an input image](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#perform_inference_c)
+
+To verify whether the engine is operating correctly, this sample picks a 28x28 image of a digit at random and runs inference on it using the engine it created. The output of the network is a probability distribution on the digit, showing which digit is likely to be that in the image.
+
+### Progress bar display
+
+This sample implements an `IProgressMonitor` to display progress while building a TensorRT engine. Each long-running step of the process can define a new progress phase, nesting them as necessary.
+1. Phase entry - The `IProgressMonitor::phaseBegin` callback determines an appropriate nesting level for the new phase and updates the terminal display.
+2. Phase progress - The `IProgressMonitor::stepComplete` callback increments the progress bar for the selected phase and updates the terminal display. This sample always returns `true` from `stepComplete` in order to progress the build unconditionally. If you wish to cancel a build in progress, such as in response to user input, you can return `false` from this function to stop the build early.
+3. Phase completion - The `IProgressMonitor::phaseEnd` callback removes the line corresponding to the completed phase and updates the terminal display.
+
+The progress bars are drawn using virtual terminal escape sequences to manipulate the terminal's cursor and clear lines.
+
+## Preparing sample data
+
+1. Download the sample data from [TensorRT release tarball](https://developer.nvidia.com/nvidia-tensorrt-download#), if not already mounted under `/usr/src/tensorrt/data` (NVIDIA NGC containers) and set it to `$TRT_DATADIR`.
+    ```bash
+    export TRT_DATADIR=/usr/src/tensorrt/data
+    pushd $TRT_DATADIR/mnist
+    pip3 install Pillow
+    popd
+    ```
+
+## Running the sample
+
+1. Compile the sample by following build instructions in [TensorRT README](https://github.com/NVIDIA/TensorRT/).
+
+2. Run the sample to perform inference on the digit:
+    ```bash
+    ./sample_progress_monitor [-h] [--datadir=/path/to/data/dir/] [--useDLA=N] [--fp16 or --int8]
+    ```
+
+    For example:
+    ```bash
+    ./sample_progress_monitor --datadir $TRT_DATADIR/mnist --fp16
+    ```
+
+	This sample reads the `mnist.onnx` file to build the network:
+
+	This sample can be run in FP16 and INT8 modes as well.
+
+	**Note:** By default, the sample expects these files to be in either the `data/samples/mnist/` or `data/mnist/` directories. The list of default directories can be changed by adding one or more paths with `--datadir=/new/path/` as a command line argument.
+
+	**Note:** The sample should be run from a terminal. It uses xterm-style escape sequences to animate its output, and is not designed to be redirected to a file.
+
+3.  Verify that the sample ran successfully. If the sample runs successfully you should see animated progress bars during the network build phase and output similar to the following:
+    ```
+	&&&& RUNNING TensorRT.sample_progress_monitor [TensorRT v8700] # ./sample_progress_monitor
+	[I] Building and running a GPU inference engine for MNIST.
+	[I] [TRT] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 19, GPU 1217 (MiB)
+	[I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1450, GPU +266, now: CPU 1545, GPU 1483 (MiB)
+	[I] [TRT] ----------------------------------------------------------------
+	[I] [TRT] Input filename:   ../../../../data/samples/mnist/mnist.onnx
+	[I] [TRT] ONNX IR version:  0.0.3
+	[I] [TRT] Opset version:    8
+	[I] [TRT] Producer name:    CNTK
+	[I] [TRT] Producer version: 2.5.1
+	[I] [TRT] Domain:           ai.cntk
+	[I] [TRT] Model version:    1
+	[I] [TRT] Doc string:       
+	[I] [TRT] ----------------------------------------------------------------
+	[W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
+	[I] [TRT] Graph optimization time: 0.00293778 seconds.
+	[I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
+	[=======---] Building engine 3/4
+	 [----------] Building engine from subgraph 0/1
+	  [----------] Computing profile costs 0/1
+	   [=======---] Timing graph nodes 11/15
+	    [===-------] Finding fastest tactic for Times212 12/37
+	     [==========] Measuring tactic time 4/4
+    ```
+    After the TensorRT network has been constructed, you should see output similar to the following. An ASCII rendering of the input image with digit 3:
+    ```
+	&&&& RUNNING TensorRT.sample_progress_monitor # ./sample_progress_monitor
+	[I] Building and running a GPU inference engine for MNIST
+	[I] Input:
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@@@@@@@@#-:.-=@@@@@@@@@@@@@@
+	@@@@@%=     . *@@@@@@@@@@@@@
+	@@@@% .:+%%%  *@@@@@@@@@@@@@
+	@@@@+=#@@@@@# @@@@@@@@@@@@@@
+	@@@@@@@@@@@%  @@@@@@@@@@@@@@
+	@@@@@@@@@@@: *@@@@@@@@@@@@@@
+	@@@@@@@@@@- .@@@@@@@@@@@@@@@
+	@@@@@@@@@:  #@@@@@@@@@@@@@@@
+	@@@@@@@@:   +*%#@@@@@@@@@@@@
+	@@@@@@@%         :+*@@@@@@@@
+	@@@@@@@@#*+--.::     +@@@@@@
+	@@@@@@@@@@@@@@@@#=:.  +@@@@@
+	@@@@@@@@@@@@@@@@@@@@  .@@@@@
+	@@@@@@@@@@@@@@@@@@@@#. #@@@@
+	@@@@@@@@@@@@@@@@@@@@#  @@@@@
+	@@@@@@@@@%@@@@@@@@@@- +@@@@@
+	@@@@@@@@#-@@@@@@@@*. =@@@@@@
+	@@@@@@@@ .+%%%%+=.  =@@@@@@@
+	@@@@@@@@           =@@@@@@@@
+	@@@@@@@@*=:   :--*@@@@@@@@@@
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+	@@@@@@@@@@@@@@@@@@@@@@@@@@@@
+
+	[I] Output:
+	Prob 1  0.0000 Class 1:
+	Prob 2  0.0000 Class 2:
+	Prob 3  1.0000 Class 3: **********
+	Prob 4  0.0000 Class 4:
+	Prob 5  0.0000 Class 5:
+	Prob 6  0.0000 Class 6:
+	Prob 7  0.0000 Class 7:
+	Prob 8  0.0000 Class 8:
+	Prob 9  0.0000 Class 9:
+
+	&&&& PASSED TensorRT.sample_progress_monitor # ./sample_progress_monitor
+	```
+
+	This output shows that the sample ran successfully; `PASSED`.
+
+
+### Sample `--help` options
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command line option. For example:
+```
+Usage: ./sample_progress_monitor [-h or --help] [-d or --datadir=<path to data directory>] [--useDLACore=<int>]
+--help Display help information
+--datadir Specify path to a data directory, overriding the default. This option can be used multiple times to add multiple directories. If no data directories are given, the default is to use (data/samples/mnist/, data/mnist/)
+--useDLACore=N Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, where n is the number of DLA engines on the platform.
+--int8 Run in Int8 mode.
+--fp16 Run in FP16 mode.
+```
+
+# Additional resources
+
+The following resources provide a deeper understanding about sampleProgressMonitor:
+
+**MNIST**
+- [MNIST dataset](https://github.com/NVIDIA/DIGITS/blob/master/docs/GettingStarted.md)
+
+**Documentation**
+- [Introduction To NVIDIA’s TensorRT Samples](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#samples)
+- [Working With TensorRT Using The C++ API](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#c_topics)
+- [NVIDIA’s TensorRT Documentation Library](https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/index.html)
+
+# License
+
+For terms and conditions for use, reproduction, and distribution, see the [TensorRT Software License Agreement](https://docs.nvidia.com/deeplearning/sdk/tensorrt-sla/index.html) documentation.
+
+# Changelog
+
+**May 2023**
+- This `README.md` file was created and reviewed.
+
+# Known issues
+
+There are no known issues in this sample.
diff --git a/samples/sampleProgressMonitor/sampleProgressMonitor.cpp b/samples/sampleProgressMonitor/sampleProgressMonitor.cpp
new file mode 100644
index 00000000..c9da0f23
--- /dev/null
+++ b/samples/sampleProgressMonitor/sampleProgressMonitor.cpp
@@ -0,0 +1,575 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+//! \file sampleProgressMonitor.cpp
+//! \brief This file contains the implementation of the Progress Monitor sample.
+//!
+//! It demonstrates the usage of IProgressMonitor for displaying engine build progress on the user's terminal.
+//! It builds a TensorRT engine by importing a trained MNIST ONNX model and runs inference on an input image of a
+//! digit.
+//! It can be run with the following command line:
+//! Command: ./sample_progress_monitor [-h or --help] [-d=/path/to/data/dir or --datadir=/path/to/data/dir]
+
+// Define TRT entrypoints used in common code
+#define DEFINE_TRT_ENTRYPOINTS 1
+
+#include "argsParser.h"
+#include "buffers.h"
+#include "common.h"
+#include "logger.h"
+
+#include "NvInfer.h"
+#include "NvOnnxParser.h"
+#include "parserOnnxConfig.h"
+
+#include <algorithm>
+#include <cmath>
+#include <cuda_runtime_api.h>
+#include <fstream>
+#include <iomanip>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+using namespace nvinfer1;
+using samplesCommon::SampleUniquePtr;
+std::string const gSampleName = "TensorRT.sample_progress_monitor";
+
+//!
+//! \brief The ConsoleProgressMonitor class displays a simple progress graph for each step of the build process.
+//!
+class ConsoleProgressMonitor : public IProgressMonitor
+{
+public:
+    void phaseStart(char const* phaseName, char const* parentPhase, int32_t nbSteps) noexcept final
+    {
+        PhaseEntry newPhase;
+        newPhase.title = phaseName;
+        newPhase.nbSteps = nbSteps;
+
+        PhaseIter iParent = mPhases.end();
+        if (parentPhase)
+        {
+            iParent = findPhase(parentPhase);
+            newPhase.nbIndents = 1 + iParent->nbIndents;
+            do
+            {
+                ++iParent;
+            } while (iParent != mPhases.end() && iParent->nbIndents >= newPhase.nbIndents);
+        }
+        mPhases.insert(iParent, newPhase);
+        redraw();
+    }
+
+    bool stepComplete(char const* phaseName, int32_t step) noexcept final
+    {
+        PhaseIter const iPhase = findPhase(phaseName);
+        iPhase->steps = step;
+        redraw();
+        return true;
+    }
+
+    void phaseFinish(char const* phaseName) noexcept final
+    {
+        PhaseIter const iPhase = findPhase(phaseName);
+        iPhase->active = false;
+        redraw();
+        mPhases.erase(iPhase);
+    }
+
+private:
+    struct PhaseEntry
+    {
+        std::string title;
+        int32_t steps{0};
+        int32_t nbSteps{0};
+        int32_t nbIndents{0};
+        bool active{true};
+    };
+    using PhaseIter = std::vector<PhaseEntry>::iterator;
+
+    std::vector<PhaseEntry> mPhases;
+
+    static int32_t constexpr kPROGRESS_INNER_WIDTH = 10;
+
+    void redraw()
+    {
+        auto const moveToStartOfLine = []() { std::cout << "\x1b[0G"; };
+        auto const clearCurrentLine = []() { std::cout << "\x1b[2K"; };
+
+        moveToStartOfLine();
+
+        int32_t inactivePhases = 0;
+        for (PhaseEntry const& phase : mPhases)
+        {
+            clearCurrentLine();
+
+            if (phase.nbIndents > 0)
+            {
+                for (int32_t indent = 0; indent < phase.nbIndents; ++indent)
+                {
+                    std::cout << ' ';
+                }
+            }
+
+            if (phase.active)
+            {
+                std::cout << progressBar(phase.steps, phase.nbSteps) << ' ' << phase.title << ' ' << phase.steps << '/'
+                          << phase.nbSteps << std::endl;
+            }
+            else
+            {
+                // Don't draw anything at this time, but prepare to emit blank lines later.
+                // This ensures that stale phases are removed from display rather than lingering.
+                ++inactivePhases;
+            }
+        }
+
+        for (int32_t phase = 0; phase < inactivePhases; ++phase)
+        {
+            clearCurrentLine();
+            std::cout << std::endl;
+        }
+
+        // Move (mPhases.size()) lines up so that logger output can overwrite the progress bars.
+        std::cout << "\x1b[" << mPhases.size() << "A";
+    }
+
+    std::string progressBar(int32_t steps, int32_t nbSteps) const
+    {
+        std::ostringstream bar;
+        bar << '[';
+        int32_t const completedChars
+            = static_cast<int32_t>(kPROGRESS_INNER_WIDTH * steps / static_cast<float>(nbSteps));
+        for (int32_t i = 0; i < completedChars; ++i)
+        {
+            bar << '=';
+        }
+        for (int32_t i = completedChars; i < kPROGRESS_INNER_WIDTH; ++i)
+        {
+            bar << '-';
+        }
+        bar << ']';
+        return bar.str();
+    }
+
+    PhaseIter findPhase(std::string const& title)
+    {
+        return std::find_if(mPhases.begin(), mPhases.end(),
+            [title](PhaseEntry const& phase) { return phase.title == title && phase.active; });
+    }
+};
+
+//!
+//! \brief The SampleProgressMonitor class implements the SampleProgressReporter sample.
+//!
+//! \details It creates the network using a trained ONNX MNIST classification model.
+//!
+class SampleProgressMonitor
+{
+public:
+    explicit SampleProgressMonitor(samplesCommon::OnnxSampleParams const& params)
+        : mParams(params)
+    {
+    }
+
+    //!
+    //! \brief Builds the network engine.
+    //!
+    bool build(IProgressMonitor* monitor);
+
+    //!
+    //! \brief Runs the TensorRT inference engine for this sample.
+    //!
+    bool infer();
+
+private:
+    //!
+    //! \brief uses a Onnx parser to create the MNIST Network and marks the output layers.
+    //!
+    bool constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
+        SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
+        SampleUniquePtr<nvonnxparser::IParser>& parser);
+    //!
+    //! \brief Reads the input and mean data, preprocesses, and stores the result in a managed buffer.
+    //!
+    bool processInput(
+        samplesCommon::BufferManager const& buffers, std::string const& inputTensorName, int32_t inputFileIdx) const;
+
+    //!
+    //! \brief Verifies that the output is correct and prints it.
+    //!
+    bool verifyOutput(samplesCommon::BufferManager const& buffers, std::string const& outputTensorName,
+        int32_t groundTruthDigit) const;
+
+    SampleUniquePtr<IRuntime> mRuntime{};
+    std::shared_ptr<nvinfer1::ICudaEngine> mEngine{nullptr}; //!< The TensorRT engine used to run the network.
+
+    samplesCommon::OnnxSampleParams mParams; //!< The parameters for the sample.
+
+    nvinfer1::Dims mInputDims; //!< The dimensions of the input to the network.
+};
+
+//!
+//! \brief Creates the network, configures the builder and creates the network engine.
+//!
+//! \details This function creates the MNIST network by parsing the ONNX model and builds
+//!          the engine that will be used to run MNIST (mEngine).
+//!
+//! \return true if the engine was created successfully and false otherwise.
+//!
+bool SampleProgressMonitor::build(IProgressMonitor* monitor)
+{
+    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
+    if (!builder)
+    {
+        return false;
+    }
+
+    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(0));
+    if (!network)
+    {
+        return false;
+    }
+
+    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
+    if (!config)
+    {
+        return false;
+    }
+
+    auto parser
+        = SampleUniquePtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, sample::gLogger.getTRTLogger()));
+    if (!parser)
+    {
+        return false;
+    }
+
+    auto constructed = constructNetwork(builder, network, config, parser);
+    if (!constructed)
+    {
+        return false;
+    }
+
+    config->setProgressMonitor(monitor);
+
+    if (mParams.fp16)
+    {
+        config->setFlag(BuilderFlag::kFP16);
+    }
+    if (mParams.int8)
+    {
+        config->setFlag(BuilderFlag::kINT8);
+    }
+
+    samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore, true /*GPUFallback*/);
+
+    if (mParams.int8)
+    {
+        // The sample fails for Int8 with kREJECT_EMPTY_ALGORITHMS flag set.
+        config->clearFlag(BuilderFlag::kREJECT_EMPTY_ALGORITHMS);
+    }
+
+    if (!mRuntime)
+    {
+        mRuntime = SampleUniquePtr<IRuntime>(createInferRuntime(sample::gLogger.getTRTLogger()));
+    }
+    if (!mRuntime)
+    {
+        return false;
+    }
+
+    // CUDA stream used for profiling by the builder.
+    auto profileStream = samplesCommon::makeCudaStream();
+    if (!profileStream)
+    {
+        return false;
+    }
+    config->setProfileStream(*profileStream);
+
+    SampleUniquePtr<IHostMemory> plan{builder->buildSerializedNetwork(*network, *config)};
+    if (!plan)
+    {
+        return false;
+    }
+
+    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
+        mRuntime->deserializeCudaEngine(plan->data(), plan->size()), samplesCommon::InferDeleter());
+    if (!mEngine)
+    {
+        return false;
+    }
+
+    ASSERT(network->getNbInputs() == 1);
+    mInputDims = network->getInput(0)->getDimensions();
+    ASSERT(mInputDims.nbDims == 4);
+
+    return true;
+}
+
+//!
+//! \brief Reads the input and mean data, preprocesses, and stores the result in a managed buffer.
+//!
+bool SampleProgressMonitor::processInput(
+    samplesCommon::BufferManager const& buffers, std::string const& inputTensorName, int32_t inputFileIdx) const
+{
+    int32_t const inputH = mInputDims.d[2];
+    int32_t const inputW = mInputDims.d[3];
+
+    // Read a random digit file.
+    srand(unsigned(time(nullptr)));
+    std::vector<uint8_t> fileData(inputH * inputW);
+    readPGMFile(locateFile(std::to_string(inputFileIdx) + ".pgm", mParams.dataDirs), fileData.data(), inputH, inputW);
+
+    // Print ASCII representation of digit.
+    sample::gLogInfo << "Input:\n";
+    for (int32_t i = 0; i < inputH * inputW; i++)
+    {
+        sample::gLogInfo << (" .:-=+*#%@"[fileData[i] / 26]) << (((i + 1) % inputW) ? "" : "\n");
+    }
+    sample::gLogInfo << std::endl;
+
+    float* hostInputBuffer = static_cast<float*>(buffers.getHostBuffer(inputTensorName));
+
+    for (int32_t i = 0; i < inputH * inputW; i++)
+    {
+        hostInputBuffer[i] = 1.0F - static_cast<float>(fileData[i]) / 255.0F;
+    }
+
+    return true;
+}
+
+//!
+//! \brief Verifies that the output is correct and prints it.
+//!
+bool SampleProgressMonitor::verifyOutput(
+    samplesCommon::BufferManager const& buffers, std::string const& outputTensorName, int32_t groundTruthDigit) const
+{
+    float* prob = static_cast<float*>(buffers.getHostBuffer(outputTensorName));
+    int32_t constexpr kDIGITS = 10;
+
+    std::for_each(prob, prob + kDIGITS, [](float& n) { n = exp(n); });
+
+    float const sum = std::accumulate(prob, prob + kDIGITS, 0.F);
+
+    std::for_each(prob, prob + kDIGITS, [sum](float& n) { n = n / sum; });
+
+    auto max_ele = std::max_element(prob, prob + kDIGITS);
+
+    float const val = *max_ele;
+
+    int32_t const idx = max_ele - prob;
+
+    // Print histogram of the output probability distribution.
+    sample::gLogInfo << "Output:\n";
+    for (int32_t i = 0; i < kDIGITS; i++)
+    {
+        sample::gLogInfo << " Prob " << i << "  " << std::fixed << std::setw(5) << std::setprecision(4) << prob[i]
+                         << " "
+                         << "Class " << i << ": " << std::string(int32_t(std::floor(prob[i] * 10 + 0.5F)), '*')
+                         << std::endl;
+    }
+    sample::gLogInfo << std::endl;
+
+    return (idx == groundTruthDigit && val > 0.9F);
+}
+
+//!
+//! \brief Uses an ONNX parser to create the MNIST Network and marks the
+//!        output layers.
+//!
+//! \param network Pointer to the network that will be populated with the MNIST network.
+//!
+//! \param builder Pointer to the engine builder.
+//!
+bool SampleProgressMonitor::constructNetwork(SampleUniquePtr<nvinfer1::IBuilder>& builder,
+    SampleUniquePtr<nvinfer1::INetworkDefinition>& network, SampleUniquePtr<nvinfer1::IBuilderConfig>& config,
+    SampleUniquePtr<nvonnxparser::IParser>& parser)
+{
+    auto parsed = parser->parseFromFile(locateFile(mParams.onnxFileName, mParams.dataDirs).c_str(),
+        static_cast<int32_t>(sample::gLogger.getReportableSeverity()));
+    if (!parsed)
+    {
+        return false;
+    }
+
+    if (mParams.fp16)
+    {
+        config->setFlag(BuilderFlag::kFP16);
+    }
+    if (mParams.int8)
+    {
+        config->setFlag(BuilderFlag::kINT8);
+        samplesCommon::setAllDynamicRanges(network.get(), 127.0F, 127.0F);
+    }
+
+    samplesCommon::enableDLA(builder.get(), config.get(), mParams.dlaCore);
+
+    return true;
+}
+
+//!
+//! \brief Runs the TensorRT inference engine for this sample.
+//!
+//! \details This function is the main execution function of the sample. It allocates
+//!          the buffer, sets inputs, executes the engine, and verifies the output.
+//!
+bool SampleProgressMonitor::infer()
+{
+    // Create RAII buffer manager object.
+    samplesCommon::BufferManager buffers(mEngine);
+
+    auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
+    if (!context)
+    {
+        return false;
+    }
+
+    // Pick a random digit to try to infer.
+    srand(time(NULL));
+    int32_t const digit = rand() % 10;
+
+    // Read the input data into the managed buffers.
+    // There should be just 1 input tensor.
+    ASSERT(mParams.inputTensorNames.size() == 1);
+
+    if (!processInput(buffers, mParams.inputTensorNames[0], digit))
+    {
+        return false;
+    }
+    // Create CUDA stream for the execution of this inference.
+    cudaStream_t stream;
+    CHECK(cudaStreamCreate(&stream));
+
+    // Asynchronously copy data from host input buffers to device input buffers
+    buffers.copyInputToDeviceAsync(stream);
+
+    for (int32_t i = 0, e = mEngine->getNbIOTensors(); i < e; i++)
+    {
+        auto const& name = mEngine->getIOTensorName(i);
+        context->setTensorAddress(name, buffers.getDeviceBuffer(name));
+    }
+
+    // Asynchronously enqueue the inference work
+    if (!context->enqueueV3(stream))
+    {
+        return false;
+    }
+    // Asynchronously copy data from device output buffers to host output buffers.
+    buffers.copyOutputToHostAsync(stream);
+
+    // Wait for the work in the stream to complete.
+    CHECK(cudaStreamSynchronize(stream));
+
+    // Release stream.
+    CHECK(cudaStreamDestroy(stream));
+
+    // Check and print the output of the inference.
+    // There should be just one output tensor.
+    ASSERT(mParams.outputTensorNames.size() == 1);
+    bool outputCorrect = verifyOutput(buffers, mParams.outputTensorNames[0], digit);
+    return outputCorrect;
+}
+
+//!
+//! \brief Initializes members of the params struct using the command line args
+//!
+samplesCommon::OnnxSampleParams initializeSampleParams(samplesCommon::Args const& args)
+{
+    samplesCommon::OnnxSampleParams params;
+    if (args.dataDirs.empty()) // Use default directories if user hasn't provided directory paths.
+    {
+        params.dataDirs.push_back("data/mnist/");
+        params.dataDirs.push_back("data/samples/mnist/");
+    }
+    else // Use the data directory provided by the user.
+    {
+        params.dataDirs = args.dataDirs;
+    }
+
+    params.dlaCore = args.useDLACore;
+    params.int8 = args.runInInt8;
+    params.fp16 = args.runInFp16;
+
+    params.onnxFileName = "mnist.onnx";
+    params.inputTensorNames.push_back("Input3");
+    params.outputTensorNames.push_back("Plus214_Output_0");
+
+    return params;
+}
+
+//!
+//! \brief Prints the help information for running this sample.
+//!
+void printHelpInfo()
+{
+    std::cout << "Usage: ./sample_progress_monitor [-h or --help] [-d or --datadir=<path to data directory>] "
+                 "[--useDLACore=<int>]\n";
+    std::cout << "--help          Display help information\n";
+    std::cout << "--datadir       Specify path to a data directory, overriding the default. This option can be used "
+                 "multiple times to add multiple directories. If no data directories are given, the default is to use "
+                 "(data/samples/mnist/, data/mnist/)"
+              << std::endl;
+    std::cout << "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, "
+                 "where n is the number of DLA engines on the platform."
+              << std::endl;
+    std::cout << "--int8          Run in Int8 mode.\n";
+    std::cout << "--fp16          Run in FP16 mode.\n";
+}
+
+int32_t main(int32_t argc, char** argv)
+{
+    samplesCommon::Args args;
+    bool const argsOK = samplesCommon::parseArgs(args, argc, argv);
+    if (!argsOK)
+    {
+        sample::gLogError << "Invalid arguments" << std::endl;
+        printHelpInfo();
+        return EXIT_FAILURE;
+    }
+    if (args.help)
+    {
+        printHelpInfo();
+        return EXIT_SUCCESS;
+    }
+
+    auto sampleTest = sample::Logger::defineTest(gSampleName, argc, argv);
+
+    sample::Logger::reportTestStart(sampleTest);
+
+    samplesCommon::OnnxSampleParams params = initializeSampleParams(args);
+
+    SampleProgressMonitor sampleProgressMonitor(params);
+    {
+        sample::gLogInfo << "Building and running a GPU inference engine for MNIST." << std::endl;
+        ConsoleProgressMonitor progressMonitor;
+
+        if (!sampleProgressMonitor.build(&progressMonitor))
+        {
+            return sample::Logger::reportFail(sampleTest);
+        }
+
+        if (!sampleProgressMonitor.infer())
+        {
+            return sample::Logger::reportFail(sampleTest);
+        }
+    }
+
+    return sample::Logger::reportPass(sampleTest);
+}
diff --git a/samples/trtexec/CMakeLists.txt b/samples/trtexec/CMakeLists.txt
index 35ad26de..93b87ec5 100644
--- a/samples/trtexec/CMakeLists.txt
+++ b/samples/trtexec/CMakeLists.txt
@@ -15,14 +15,16 @@
 # limitations under the License.
 #
 SET(SAMPLE_SOURCES
+    ../common/sampleDevice.cpp
     ../common/sampleEngines.cpp
     ../common/sampleInference.cpp
     ../common/sampleOptions.cpp
     ../common/sampleReporting.cpp
     ../common/sampleUtils.cpp
+    ../common/bfloat16.cpp
     trtexec.cpp
 )
 
-set(SAMPLE_PARSERS "caffe" "uff" "onnx")
+set(SAMPLE_PARSERS "onnx")
 
 include(../CMakeSamplesTemplate.txt)
diff --git a/samples/trtexec/README.md b/samples/trtexec/README.md
index 323caf7d..9b65c8e1 100644
--- a/samples/trtexec/README.md
+++ b/samples/trtexec/README.md
@@ -5,12 +5,12 @@
   - [Description](#description)
   - [Building `trtexec`](#building-trtexec)
   - [Using `trtexec`](#using-trtexec)
-    - [Example 1: Simple MNIST model from Caffe](#example-1-simple-mnist-model-from-caffe)
-    - [Example 2: Profiling a custom layer](#example-2-profiling-a-custom-layer)
-    - [Example 3: Running a network on DLA](#example-3-running-a-network-on-dla)
-    - [Example 4: Running an ONNX model with full dimensions and dynamic shapes](#example-4-running-an-onnx-model-with-full-dimensions-and-dynamic-shapes)
-    - [Example 5: Collecting and printing a timing trace](#example-5-collecting-and-printing-a-timing-trace)
-    - [Example 6: Tune throughput with multi-streaming](#example-6-tune-throughput-with-multi-streaming)
+    - [Example 1: Profiling a custom layer](#example-1-profiling-a-custom-layer)
+    - [Example 2: Running a network on DLA](#example-2-running-a-network-on-dla)
+    - [Example 3: Running an ONNX model with full dimensions and dynamic shapes](#example-3-running-an-onnx-model-with-full-dimensions-and-dynamic-shapes)
+    - [Example 4: Collecting and printing a timing trace](#example-4-collecting-and-printing-a-timing-trace)
+    - [Example 5: Tune throughput with multi-streaming](#example-5-tune-throughput-with-multi-streaming)
+    - [Example 6: Create a strongly typed plan file](#example-6-create-a-strongly-typed-plan-file)
   - [Tool command line arguments](#tool-command-line-arguments)
   - [Additional resources](#additional-resources)
 - [License](#license)
@@ -23,34 +23,35 @@ Included in the `samples` directory is a command line wrapper tool, called `trte
 -   It’s useful for benchmarking networks on random or user-provided input data.
 -   It’s useful for generating serialized engines from models.
 
-**Benchmarking network** - If you have a model saved as a UFF file, ONNX file, or if you have a network description in a Caffe prototxt format, you can use the `trtexec` tool to test the performance of running inference on your network using TensorRT. The `trtexec` tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.
+**Benchmarking network** - If you have a model saved as an ONNX file, you can use the `trtexec` tool to test the performance of running inference on your network using TensorRT. The `trtexec` tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.
 
-**Serialized engine generation** - If you generate a saved serialized engine file, you can pull it into another application that runs inference. For example, you can use the [TensorRT Laboratory](https://github.com/NVIDIA/tensorrt-laboratory) to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats, for example, if you used a Caffe prototxt file and a model is not supplied, random weights are generated. Also, in INT8 mode, random weights are used, meaning trtexec does not provide calibration capability.
+**Serialized engine generation** - If you generate a saved serialized engine file, you can pull it into another application that runs inference. For example, you can use the [TensorRT Laboratory](https://github.com/NVIDIA/tensorrt-laboratory) to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. Also, in INT8 mode, random weights are used, meaning trtexec does not provide calibration capability.
 
-## Building `trtexec`
+**Using custom input data** - By default trtexec will run inference with randomly generated inputs. To provide custom inputs for an inference run, trtexec expects a binary file containing the data for each input tensor. It is recommended that this binary file be generated through `numpy`. For example, to create custom data of all ones to an ONNX model with one input named `data` with shape `(1,3,244,244)` and type `FLOAT`:
 
-`trtexec` can be used to build engines, using different TensorRT features (see command line arguments), and run inference. `trtexec` also measures and reports execution time and can be used to understand performance and possibly locate bottlenecks.
+```
+import numpy as np
+data = np.ones((1,3,244,244), dtype=np.float32)
+data.tofile("data.bin")
+```
+
+This binary file can be be loaded by trtexec during inference by using the `--loadInputs` flag:
 
-Compile this sample by running `make` in the `<TensorRT root directory>/samples/trtexec` directory. The binary named `trtexec` will be created in the `<TensorRT root directory>/bin` directory.
 ```
-cd <TensorRT root directory>/samples/trtexec
-make
+./trtexec --onnx=model.onnx --loadInputs="data":data.bin
 ```
-Where `<TensorRT root directory>` is where you installed TensorRT.
 
-## Using `trtexec`
+## Building `trtexec`
 
-`trtexec` can build engines from models in Caffe, UFF, or ONNX format.
+`trtexec` can be used to build engines, using different TensorRT features (see command line arguments), and run inference. `trtexec` also measures and reports execution time and can be used to understand performance and possibly locate bottlenecks.
 
-### Example 1: Simple MNIST model from Caffe
+Compile the sample by following build instructions in [TensorRT README](https://github.com/NVIDIA/TensorRT/).
 
-The example below shows how to load a model description and its weights, build the engine that is optimized for batch size 16, and save it to a file.
-`trtexec --deploy=/path/to/mnist.prototxt --model=/path/to/mnist.caffemodel --output=prob --batch=16 --saveEngine=mnist16.trt`
+## Using `trtexec`
 
-Then, the same engine can be used for benchmarking; the example below shows how to load the engine and run inference on batch 16 inputs (randomly generated).
-`trtexec --loadEngine=mnist16.trt --batch=16`
+`trtexec` can build engines from models in ONNX format.
 
-### Example 2: Profiling a custom layer
+### Example 1: Profiling a custom layer
 
 You can profile a custom layer using the `IPluginRegistry` for the plugins and `trtexec`. You’ll need to first register the plugin with `IPluginRegistry`.
 
@@ -59,24 +60,24 @@ If you are using TensorRT shipped plugins, you should load the `libnvinfer_plugi
 If you have your own plugin, then it has to be registered explicitly. The following macro can be used to register the plugin creator `YourPluginCreator` with the `IPluginRegistry`.
 `REGISTER_TENSORRT_PLUGIN(YourPluginCreator);`
 
-### Example 3: Running a network on DLA
+### Example 2: Running a network on DLA
 
-To run the AlexNet network on NVIDIA DLA (Deep Learning Accelerator) using `trtexec` in FP16 mode, issue:
+To run the MNIST network on NVIDIA DLA (Deep Learning Accelerator) using `trtexec` in FP16 mode, issue:
 ```
-./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --fp16 --allowGPUFallback
+./trtexec --onnx=data/mnist/mnist.onnx --useDLACore=1 --fp16 --allowGPUFallback
 ```
-To run the AlexNet network on DLA using `trtexec` in INT8 mode, issue:
+To run the MNIST network on DLA using `trtexec` in INT8 mode, issue:
 ```
-./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --int8 --allowGPUFallback
+./trtexec --onnx=data/mnist/mnist.onnx --useDLACore=1 --int8 --allowGPUFallback
 ```
 To run the MNIST network on DLA using `trtexec`, issue:
 ```
-./trtexec --deploy=data/mnist/mnist.prototxt --output=prob --useDLACore=0 --fp16 --allowGPUFallback
+./trtexec --onnx=data/mnist/mnist.onnx --useDLACore=0 --fp16 --allowGPUFallback
 ```
 
 For more information about DLA, see [Working With DLA](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#dla_topic).
 
-### Example 4: Running an ONNX model with full dimensions and dynamic shapes
+### Example 3: Running an ONNX model with full dimensions and dynamic shapes
 
 To run an ONNX model in full-dimensions mode with static input shapes:
 
@@ -98,11 +99,11 @@ To benchmark your ONNX model with a range of possible input shapes:
 ./trtexec --onnx=model.onnx --minShapes=input:1x3x244x244 --optShapes=input:16x3x244x244 --maxShapes=input:32x3x244x244 --shapes=input:5x3x244x244
 ```
 
-### Example 5: Collecting and printing a timing trace
+### Example 4: Collecting and printing a timing trace
 
 When running, `trtexec` prints the measured performance, but can also export the measurement trace to a json file:
 ```
-./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --exportTimes=trace.json
+./trtexec --onnx=data/mnist/mnist.onnx --exportTimes=trace.json
 ```
 Once the trace is stored in a file, it can be printed using the `tracer.py` utility. This tool prints timestamps and duration of input, compute, and output, in different forms:
 ```
@@ -110,27 +111,35 @@ Once the trace is stored in a file, it can be printed using the `tracer.py` util
 ```
 Similarly, profiles can also be printed and stored in a json file. The utility `profiler.py` can be used to read and print the profile from a json file.
 
-### Example 6: Tune throughput with multi-streaming
+### Example 5: Tune throughput with multi-streaming
 
 Tuning throughput may require running multiple concurrent streams of execution. This is the case for example when the latency achieved is well within the desired
-threshold, and we can increase the throughput, even at the expense of some latency. For example, saving engines for batch sizes 1 and 2 and assume that both
+threshold, and we can increase the throughput, even at the expense of some latency. For example, saving engines with different precisions and assume that both
 execute within 2ms, the latency threshold:
 ```
-trtexec --deploy=GoogleNet_N2.prototxt --output=prob --batch=1 --saveEngine=g1.trt --int8 --buildOnly
-trtexec --deploy=GoogleNet_N2.prototxt --output=prob --batch=2 --saveEngine=g2.trt --int8 --buildOnly
+trtexec --onnx=resnet50.onnx --saveEngine=g1.trt --int8 --skipInference
+trtexec --onnx=resnet50.onnx --saveEngine=g2.trt --best --skipInference
+```
+Now, the saved engines can be tried to find the combination precision/streams below 2 ms that maximizes the throughput:
 ```
-Now, the saved engines can be tried to find the combination batch/streams below 2 ms that maximizes the throughput:
+trtexec --loadEngine=g1.trt --streams=2
+trtexec --loadEngine=g1.trt --streams=3
+trtexec --loadEngine=g1.trt --streams=4
+trtexec --loadEngine=g2.trt --streams=2
 ```
-trtexec --loadEngine=g1.trt --batch=1 --streams=2
-trtexec --loadEngine=g1.trt --batch=1 --streams=3
-trtexec --loadEngine=g1.trt --batch=1 --streams=4
-trtexec --loadEngine=g2.trt --batch=2 --streams=2
+
+### Example 6: Create a strongly typed plan file
+This flag will create a network with the `NetworkDefinitionCreationFlag::kSTRONGLY_TYPED` flag where tensor data types are inferred from network input types
+and operator type specification.  Use of specific builder precision flags such as `--int8` or `--best` with this option is not allowed.
+```
+./trtexec --onnx=model.onnx --stronglyTyped
 ```
+
 ## Tool command line arguments
 
 To see the full list of available options and their descriptions, issue the `./trtexec --help` command.
 
-**Note:** Specifying the `--safe` parameter turns the safety mode switch `ON`. By default, the `--safe` parameter is not specified; the safety mode switch is `OFF`. The layers and parameters that are contained within the `--safe` subset are restricted if the switch is set to `ON`. The switch is used for prototyping the safety restricted flows until the TensorRT safety runtime is made available. For more information, see the [Working With Automotive Safety section in the TensorRT Developer Guide](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#working_auto_safety).
+**Note:** Specifying the `--safe` parameter turns the safety mode switch `ON`. By default, the `--safe` parameter is not specified; the safety mode switch is `OFF`. The layers and parameters that are contained within the `--safe` subset are restricted if the switch is set to `ON`. The switch is used for prototyping the safety restricted flows until the TensorRT safety runtime is made available. This parameter is required when loading or saving safe engines with the standard TensorRT package. For more information, see the [Working With Automotive Safety section in the TensorRT Developer Guide](https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#working_auto_safety).
 
 ## Additional resources
 
diff --git a/samples/trtexec/trtexec.cpp b/samples/trtexec/trtexec.cpp
index 3113ff81..ece19ed6 100644
--- a/samples/trtexec/trtexec.cpp
+++ b/samples/trtexec/trtexec.cpp
@@ -55,38 +55,21 @@ using LibraryPtr = std::unique_ptr<DynamicLibrary>;
 std::string const kNVINFER_PLUGIN_LIBNAME{"nvinfer_plugin.dll"};
 std::string const kNVINFER_LIBNAME{"nvinfer.dll"};
 std::string const kNVONNXPARSER_LIBNAME{"nvonnxparser.dll"};
-std::string const kNVPARSERS_LIBNAME{"nvparsers.dll"};
 std::string const kNVINFER_LEAN_LIBNAME{"nvinfer_lean.dll"};
 std::string const kNVINFER_DISPATCH_LIBNAME{"nvinfer_dispatch.dll"};
-
-std::string const kMANGLED_UFF_PARSER_CREATE_NAME{"?createUffParser@nvuffparser@@YAPEAVIUffParser@1@XZ"};
-std::string const kMANGLED_CAFFE_PARSER_CREATE_NAME{"?createCaffeParser@nvcaffeparser1@@YAPEAVICaffeParser@1@XZ"};
-std::string const kMANGLED_UFF_PARSER_SHUTDOWN_NAME{"?shutdownProtobufLibrary@nvuffparser@@YAXXZ"};
-std::string const kMANGLED_CAFFE_PARSER_SHUTDOWN_NAME{"?shutdownProtobufLibrary@nvcaffeparser1@@YAXXZ"};
 #else
 std::string const kNVINFER_PLUGIN_LIBNAME = std::string{"libnvinfer_plugin.so."} + std::to_string(NV_TENSORRT_MAJOR);
 std::string const kNVINFER_LIBNAME = std::string{"libnvinfer.so."} + std::to_string(NV_TENSORRT_MAJOR);
 std::string const kNVONNXPARSER_LIBNAME = std::string{"libnvonnxparser.so."} + std::to_string(NV_TENSORRT_MAJOR);
-std::string const kNVPARSERS_LIBNAME = std::string{"libnvparsers.so."} + std::to_string(NV_TENSORRT_MAJOR);
 std::string const kNVINFER_LEAN_LIBNAME = std::string{"libnvinfer_lean.so."} + std::to_string(NV_TENSORRT_MAJOR);
 std::string const kNVINFER_DISPATCH_LIBNAME
     = std::string{"libnvinfer_dispatch.so."} + std::to_string(NV_TENSORRT_MAJOR);
-
-std::string const kMANGLED_UFF_PARSER_CREATE_NAME{"_ZN11nvuffparser15createUffParserEv"};
-std::string const kMANGLED_CAFFE_PARSER_CREATE_NAME{"_ZN14nvcaffeparser117createCaffeParserEv"};
-std::string const kMANGLED_UFF_PARSER_SHUTDOWN_NAME{"_ZN11nvuffparser23shutdownProtobufLibraryEv"};
-std::string const kMANGLED_CAFFE_PARSER_SHUTDOWN_NAME{"_ZN14nvcaffeparser123shutdownProtobufLibraryEv"};
 #endif
 #endif // !TRT_STATIC
-std::function<void*(void*, int32_t)>
-    pCreateInferRuntimeInternal{};
+std::function<void*(void*, int32_t)> pCreateInferRuntimeInternal{};
 std::function<void*(void*, void*, int32_t)> pCreateInferRefitterInternal{};
 std::function<void*(void*, int32_t)> pCreateInferBuilderInternal{};
 std::function<void*(void*, void*, int)> pCreateNvOnnxParserInternal{};
-std::function<nvuffparser::IUffParser*()> pCreateUffParser{};
-std::function<nvcaffeparser1::ICaffeParser*()> pCreateCaffeParser{};
-std::function<void()> pShutdownUffLibrary{};
-std::function<void(void)> pShutdownCaffeLibrary{};
 
 //! Track runtime used for the execution of trtexec.
 //! Must be tracked as a global variable due to how library init functions APIs are organized.
@@ -139,11 +122,18 @@ bool initNvinfer()
     static LibraryPtr libnvinferPtr{};
     auto fetchPtrs = [](DynamicLibrary* l) {
         pCreateInferRuntimeInternal = l->symbolAddress<void*(void*, int32_t)>("createInferRuntime_INTERNAL");
-
-        if (gUseRuntime == RuntimeMode::kFULL)
+        try
         {
             pCreateInferRefitterInternal
                 = l->symbolAddress<void*(void*, void*, int32_t)>("createInferRefitter_INTERNAL");
+        }
+        catch (const std::exception& e)
+        {
+            sample::gLogWarning << "Could not load function createInferRefitter_INTERNAL : " << e.what() << std::endl;
+        }
+
+        if (gUseRuntime == RuntimeMode::kFULL)
+        {
             pCreateInferBuilderInternal = l->symbolAddress<void*(void*, int32_t)>("createInferBuilder_INTERNAL");
         }
     };
@@ -170,28 +160,6 @@ bool initNvonnxparser()
 #endif // !TRT_STATIC
 }
 
-bool initNvparsers()
-{
-#if !TRT_STATIC
-    static LibraryPtr libnvparsersPtr{};
-    auto fetchPtrs = [](DynamicLibrary* l) {
-        // TODO: get equivalent Windows symbol names
-        pCreateUffParser = l->symbolAddress<nvuffparser::IUffParser*()>(kMANGLED_UFF_PARSER_CREATE_NAME.c_str());
-        pCreateCaffeParser
-            = l->symbolAddress<nvcaffeparser1::ICaffeParser*()>(kMANGLED_CAFFE_PARSER_CREATE_NAME.c_str());
-        pShutdownUffLibrary = l->symbolAddress<void()>(kMANGLED_UFF_PARSER_SHUTDOWN_NAME.c_str());
-        pShutdownCaffeLibrary = l->symbolAddress<void(void)>(kMANGLED_CAFFE_PARSER_SHUTDOWN_NAME.c_str());
-    };
-    return initLibrary(libnvparsersPtr, kNVPARSERS_LIBNAME, fetchPtrs);
-#else
-    pCreateUffParser = nvuffparser::createUffParser;
-    pCreateCaffeParser = nvcaffeparser1::createCaffeParser;
-    pShutdownUffLibrary = nvuffparser::shutdownProtobufLibrary;
-    pShutdownCaffeLibrary = nvcaffeparser1::shutdownProtobufLibrary;
-    return true;
-#endif // !TRT_STATIC
-}
-
 } // namespace
 
 IRuntime* createRuntime()
@@ -235,46 +203,6 @@ nvonnxparser::IParser* createONNXParser(INetworkDefinition& network)
         pCreateNvOnnxParserInternal(&network, &gLogger.getTRTLogger(), NV_ONNX_PARSER_VERSION));
 }
 
-nvcaffeparser1::ICaffeParser* sampleCreateCaffeParser()
-{
-    if (!initNvparsers())
-    {
-        return {};
-    }
-    ASSERT(pCreateCaffeParser != nullptr);
-    return pCreateCaffeParser();
-}
-
-void shutdownCaffeParser()
-{
-    if (!initNvparsers())
-    {
-        return;
-    }
-    ASSERT(pShutdownCaffeLibrary != nullptr);
-    pShutdownCaffeLibrary();
-}
-
-nvuffparser::IUffParser* sampleCreateUffParser()
-{
-    if (!initNvparsers())
-    {
-        return {};
-    }
-    ASSERT(pCreateUffParser != nullptr);
-    return pCreateUffParser();
-}
-
-void shutdownUffParser()
-{
-    if (!initNvparsers())
-    {
-        return;
-    }
-    ASSERT(pShutdownUffLibrary != nullptr);
-    pShutdownUffLibrary();
-}
-
 using time_point = std::chrono::time_point<std::chrono::high_resolution_clock>;
 using duration = std::chrono::duration<float>;
 
@@ -306,22 +234,23 @@ int main(int argc, char** argv)
 
                 if (!args.empty())
                 {
+                    AllOptions::help(std::cout);
                     for (auto const& arg : args)
                     {
-                        sample::gLogError << "Unknown option: " << arg.first << " " << arg.second << std::endl;
+                        sample::gLogError << "Unknown option: " << arg.first << " " << arg.second.first << std::endl;
                     }
                     failed = true;
                 }
             }
             catch (std::invalid_argument const& arg)
             {
+                AllOptions::help(std::cout);
                 sample::gLogError << arg.what() << std::endl;
                 failed = true;
             }
 
             if (failed)
             {
-                AllOptions::help(std::cout);
                 return sample::gLogger.reportFail(sampleTest);
             }
         }
@@ -356,23 +285,16 @@ int main(int argc, char** argv)
         std::vector<LibraryPtr> pluginLibs;
         if (gUseRuntime == RuntimeMode::kFULL)
         {
-            if (!options.build.versionCompatible)
-            {
-                sample::gLogInfo << "Loading standard plugins" << std::endl;
+            sample::gLogInfo << "Loading standard plugins" << std::endl;
 #if !TRT_STATIC
-                nvinferPluginLib = loadLibrary(kNVINFER_PLUGIN_LIBNAME);
-                auto pInitLibNvinferPlugins
-                    = nvinferPluginLib->symbolAddress<bool(void*, char const*)>("initLibNvInferPlugins");
+            nvinferPluginLib = loadLibrary(kNVINFER_PLUGIN_LIBNAME);
+            auto pInitLibNvinferPlugins
+                = nvinferPluginLib->symbolAddress<bool(void*, char const*)>("initLibNvInferPlugins");
 #else
-                auto pInitLibNvinferPlugins = initLibNvInferPlugins;
+            auto pInitLibNvinferPlugins = initLibNvInferPlugins;
 #endif
-                ASSERT(pInitLibNvinferPlugins != nullptr);
-                pInitLibNvinferPlugins(&sample::gLogger.getTRTLogger(), "");
-            }
-            else
-            {
-                sample::gLogInfo << "Not loading standard plugins since --versionCompatible is specified." << std::endl;
-            }
+            ASSERT(pInitLibNvinferPlugins != nullptr);
+            pInitLibNvinferPlugins(&sample::gLogger.getTRTLogger(), "");
             for (auto const& pluginPath : options.system.plugins)
             {
                 sample::gLogInfo << "Loading supplied plugin library: " << pluginPath << std::endl;
@@ -400,9 +322,7 @@ int main(int argc, char** argv)
         std::unique_ptr<BuildEnvironment> bEnv(new BuildEnvironment(options.build.safe, options.build.versionCompatible,
             options.system.DLACore, options.build.tempdir, options.build.tempfileControls, options.build.leanDLLPath));
 
-        time_point const buildStartTime{std::chrono::high_resolution_clock::now()};
         bool buildPass = getEngineBuildEnv(options.model, options.build, options.system, *bEnv, sample::gLogError);
-        time_point const buildEndTime{std::chrono::high_resolution_clock::now()};
 
         if (!buildPass)
         {
@@ -410,13 +330,16 @@ int main(int argc, char** argv)
             return sample::gLogger.reportFail(sampleTest);
         }
 
+        // Exit as version is already printed during getEngineBuildEnv
+        if (options.build.getPlanVersionOnly)
+        {
+            return sample::gLogger.reportPass(sampleTest);
+        }
+
         // dynamicPlugins may have been updated by getEngineBuildEnv above
         bEnv->engine.setDynamicPlugins(options.system.dynamicPlugins);
 
-        sample::gLogInfo << "Engine " << (options.build.load ? "loaded" : "built") << " in "
-                         << duration(buildEndTime - buildStartTime).count() << " sec." << std::endl;
-
-        if (!options.build.safe && options.build.refittable)
+        if (!options.build.safe && !options.build.buildDLAStandalone && options.build.refittable)
         {
             auto* engine = bEnv->engine.get();
             if (options.reporting.refit)
@@ -443,9 +366,10 @@ int main(int argc, char** argv)
 
         if (options.build.skipInference)
         {
-            if (!options.build.safe)
+            if (!options.build.safe && !options.build.buildDLAStandalone)
             {
                 printLayerInfo(options.reporting, bEnv->engine.get(), nullptr);
+                printOptimizationProfileInfo(options.reporting, bEnv->engine.get());
             }
             sample::gLogInfo << "Skipped inference phase since --skipInference is added." << std::endl;
             return sample::gLogger.reportPass(sampleTest);
@@ -507,6 +431,7 @@ int main(int argc, char** argv)
         if (!options.build.safe)
         {
             printLayerInfo(options.reporting, iEnv->engine.get(), iEnv->contexts.front().get());
+            printOptimizationProfileInfo(options.reporting, iEnv->engine.get());
         }
 
         std::vector<InferenceTrace> trace;
diff --git a/samples/utils/fileLock.cpp b/samples/utils/fileLock.cpp
new file mode 100644
index 00000000..0b45c2df
--- /dev/null
+++ b/samples/utils/fileLock.cpp
@@ -0,0 +1,100 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "fileLock.h"
+#include "NvInfer.h"
+#include <sstream>
+#include <stdexcept>
+#include <string>
+
+namespace nvinfer1
+{
+namespace utils
+{
+FileLock::FileLock(ILogger& logger, std::string const& fileName)
+    : mLogger(logger)
+    , mFileName(fileName)
+{
+    std::string lockFileName = mFileName + ".lock";
+#ifdef _MSC_VER
+    {
+        std::stringstream ss;
+        ss << "Trying to set exclusive file lock " << lockFileName << std::endl;
+        mLogger.log(ILogger::Severity::kVERBOSE, ss.str().c_str());
+    }
+    // MS docs said this is a blocking IO if "FILE_FLAG_OVERLAPPED" is not provided
+    mHandle = CreateFileA(lockFileName.c_str(), GENERIC_WRITE, 0, NULL, OPEN_ALWAYS, 0, NULL);
+    if (mHandle == INVALID_HANDLE_VALUE)
+    {
+        throw std::runtime_error("Failed to lock " + lockFileName + "!");
+    }
+#elif defined(__QNX__)
+    // We once enabled the file lock on QNX, lockf(F_TLOCK) return -1 and the reported error is
+    // The error generated was 89, which means that the function is not implemented.
+#else
+    mHandle = fopen(lockFileName.c_str(), "wb+");
+    if (mHandle == nullptr)
+    {
+        throw std::runtime_error("Cannot open " + lockFileName + "!");
+    }
+    {
+        std::stringstream ss;
+        ss << "Trying to set exclusive file lock " << lockFileName << std::endl;
+        mLogger.log(ILogger::Severity::kVERBOSE, ss.str().c_str());
+    }
+    mDescriptor = fileno(mHandle);
+    auto ret = lockf(mDescriptor, F_LOCK, 0);
+    if (ret != 0)
+    {
+        mDescriptor = -1;
+        fclose(mHandle);
+        throw std::runtime_error("Failed to lock " + lockFileName + "!");
+    }
+#endif
+}
+
+FileLock::~FileLock()
+{
+    std::string lockFileName = mFileName + ".lock";
+#ifdef _MSC_VER
+    if (mHandle != INVALID_HANDLE_VALUE)
+    {
+        CloseHandle(mHandle);
+    }
+#elif defined(__QNX__)
+    // We once enabled the file lock on QNX, lockf(F_TLOCK) return -1 and the reported error is
+    // The error generated was 89
+    // That means : Function not implemented
+#else
+    if (mDescriptor != -1)
+    {
+        auto ret = lockf(mDescriptor, F_ULOCK, 0);
+        if (mHandle != nullptr)
+        {
+            fclose(mHandle);
+        }
+        if (ret != 0)
+        {
+            std::stringstream ss;
+            ss << "Failed to unlock " << lockFileName << ", please remove " << lockFileName << ".lock manually!"
+               << std::endl;
+            mLogger.log(ILogger::Severity::kVERBOSE, ss.str().c_str());
+        }
+    }
+#endif
+}
+} // namespace utils
+} // namespace nvinfer1
diff --git a/samples/utils/fileLock.h b/samples/utils/fileLock.h
new file mode 100644
index 00000000..628da207
--- /dev/null
+++ b/samples/utils/fileLock.h
@@ -0,0 +1,86 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef TENSORRT_SAMPLES_COMMON_FILELOCK_H_
+#define TENSORRT_SAMPLES_COMMON_FILELOCK_H_
+#include "NvInfer.h"
+#ifdef _MSC_VER
+// Needed so that the max/min definitions in windows.h do not conflict with std::max/min.
+#define NOMINMAX
+#include <windows.h>
+#undef NOMINMAX
+#else
+#include <stdio.h>  // fileno
+#include <unistd.h> // lockf
+#endif
+#include <string>
+
+namespace nvinfer1
+{
+namespace utils
+{
+//!
+//! \brief RAII object that locks a the specified file.
+//!
+//! The FileLock class uses a lock file to specify that the
+//! current file is being used by a TensorRT tool or sample
+//! so that things like the TimingCache can be updated across
+//! processes without having conflicts.
+//!
+class FileLock
+{
+public:
+    FileLock(nvinfer1::ILogger& logger, std::string const& fileName);
+    ~FileLock();
+    FileLock() = delete;                           // no default ctor
+    FileLock(FileLock const&) = delete;            // no copy ctor
+    FileLock& operator=(FileLock const&) = delete; // no copy assignment
+    FileLock(FileLock&&) = delete;                 // no move ctor
+    FileLock& operator=(FileLock&&) = delete;      // no move assignment
+
+private:
+    //!
+    //! The logger that emits any error messages that might show up.
+    //!
+    nvinfer1::ILogger& mLogger;
+
+    //!
+    //! The filename that the FileLock is protecting from multiple
+    //! TensorRT processes from writing to.
+    //!
+    std::string const mFileName;
+
+#ifdef _MSC_VER
+    //!
+    //! The file handle on windows for the file lock.
+    //!
+    HANDLE mHandle{};
+#else
+    //!
+    //! The file handle on linux for the file lock.
+    //!
+    FILE* mHandle{};
+    //!
+    //! The file descriptor on linux of the file lock.
+    //!
+    int32_t mDescriptor{-1};
+#endif
+}; // class FileLock
+} // namespace utils
+} // namespace nvinfer1
+
+#endif // TENSORRT_SAMPLES_COMMON_FILELOCK_H_
diff --git a/samples/utils/timingCache.cpp b/samples/utils/timingCache.cpp
new file mode 100644
index 00000000..1ddf083d
--- /dev/null
+++ b/samples/utils/timingCache.cpp
@@ -0,0 +1,143 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "timingCache.h"
+#include "NvInfer.h"
+#include "fileLock.h"
+#include <fstream>
+#include <iostream>
+#include <memory>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+using namespace nvinfer1;
+namespace nvinfer1
+{
+namespace utils
+{
+std::vector<char> loadTimingCacheFile(ILogger& logger, std::string const& inFileName)
+{
+    try
+    {
+        std::unique_ptr<FileLock> fileLock{new FileLock(logger, inFileName)};
+        std::ifstream iFile(inFileName, std::ios::in | std::ios::binary);
+        if (!iFile)
+        {
+            std::stringstream ss;
+            ss << "Could not read timing cache from: " << inFileName
+               << ". A new timing cache will be generated and written.";
+            logger.log(ILogger::Severity::kWARNING, ss.str().c_str());
+            return std::vector<char>();
+        }
+        iFile.seekg(0, std::ifstream::end);
+        size_t fsize = iFile.tellg();
+        iFile.seekg(0, std::ifstream::beg);
+        std::vector<char> content(fsize);
+        iFile.read(content.data(), fsize);
+        iFile.close();
+        std::stringstream ss;
+        ss << "Loaded " << fsize << " bytes of timing cache from " << inFileName;
+        logger.log(ILogger::Severity::kINFO, ss.str().c_str());
+        return content;
+    }
+    catch (std::exception const& e)
+    {
+        std::cerr << "Exception detected: " << e.what() << std::endl;
+    }
+    return {};
+}
+
+void saveTimingCacheFile(ILogger& logger, std::string const& outFileName, IHostMemory const* blob)
+{
+    try
+    {
+        std::unique_ptr<FileLock> fileLock{new FileLock(logger, outFileName)};
+        std::ofstream oFile(outFileName, std::ios::out | std::ios::binary);
+        if (!oFile)
+        {
+            std::stringstream ss;
+            ss << "Could not write timing cache to: " << outFileName;
+            logger.log(ILogger::Severity::kWARNING, ss.str().c_str());
+            return;
+        }
+        oFile.write(reinterpret_cast<char*>(blob->data()), blob->size());
+        oFile.close();
+        std::stringstream ss;
+        ss << "Saved " << blob->size() << " bytes of timing cache to " << outFileName;
+        logger.log(ILogger::Severity::kINFO, ss.str().c_str());
+    }
+    catch (std::exception const& e)
+    {
+        std::cerr << "Exception detected: " << e.what() << std::endl;
+    }
+}
+
+void updateTimingCacheFile(nvinfer1::ILogger& logger, std::string const& fileName,
+    nvinfer1::ITimingCache const* timingCache, nvinfer1::IBuilder& builder)
+{
+    try
+    {
+        // Prepare empty timingCache in case that there is no existing file to read
+        std::unique_ptr<IBuilderConfig> config{builder.createBuilderConfig()};
+        std::unique_ptr<ITimingCache> fileTimingCache{config->createTimingCache(static_cast<void const*>(nullptr), 0)};
+
+        std::unique_ptr<FileLock> fileLock{new FileLock(logger, fileName)};
+        std::ifstream iFile(fileName, std::ios::in | std::ios::binary);
+        if (iFile)
+        {
+            iFile.seekg(0, std::ifstream::end);
+            size_t fsize = iFile.tellg();
+            iFile.seekg(0, std::ifstream::beg);
+            std::vector<char> content(fsize);
+            iFile.read(content.data(), fsize);
+            iFile.close();
+            std::stringstream ss;
+            ss << "Loaded " << fsize << " bytes of timing cache from " << fileName;
+            logger.log(ILogger::Severity::kINFO, ss.str().c_str());
+            fileTimingCache.reset(config->createTimingCache(static_cast<void const*>(content.data()), content.size()));
+            if (!fileTimingCache)
+            {
+                throw std::runtime_error("Failed to create timingCache from " + fileName + "!");
+            }
+        }
+        fileTimingCache->combine(*timingCache, false);
+        std::unique_ptr<IHostMemory> blob{fileTimingCache->serialize()};
+        if (!blob)
+        {
+            throw std::runtime_error("Failed to serialize ITimingCache!");
+        }
+        std::ofstream oFile(fileName, std::ios::out | std::ios::binary);
+        if (!oFile)
+        {
+            std::stringstream ss;
+            ss << "Could not write timing cache to: " << fileName;
+            logger.log(ILogger::Severity::kWARNING, ss.str().c_str());
+            return;
+        }
+        oFile.write(reinterpret_cast<char*>(blob->data()), blob->size());
+        oFile.close();
+        std::stringstream ss;
+        ss << "Saved " << blob->size() << " bytes of timing cache to " << fileName;
+        logger.log(ILogger::Severity::kINFO, ss.str().c_str());
+    }
+    catch (std::exception const& e)
+    {
+        std::cerr << "Exception detected: " << e.what() << std::endl;
+    }
+}
+} // namespace utils
+} // namespace nvinfer1
diff --git a/parsers/caffe/caffeParser/opParsers/parseELU.cpp b/samples/utils/timingCache.h
similarity index 50%
rename from parsers/caffe/caffeParser/opParsers/parseELU.cpp
rename to samples/utils/timingCache.h
index ad67af98..fff4a482 100644
--- a/parsers/caffe/caffeParser/opParsers/parseELU.cpp
+++ b/samples/utils/timingCache.h
@@ -14,29 +14,23 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
+#ifndef TENSORRT_SAMPLES_COMMON_TIMINGCACHE_H_
+#define TENSORRT_SAMPLES_COMMON_TIMINGCACHE_H_
+#include "NvInfer.h"
+#include <iosfwd>
+#include <memory>
+#include <string>
+#include <vector>
 
-#include "opParsers.h"
-
-using namespace nvinfer1;
-
-namespace nvcaffeparser1
+namespace nvinfer1
 {
-ILayer* parseELU(INetworkDefinition& network, const trtcaffe::LayerParameter& msg, CaffeWeightFactory& /* weightFactory */, BlobNameToTensor& tensors)
+namespace utils
 {
-    if (!checkBlobs(msg, 1, 1))
-    {
-        return nullptr;
-    }
-
-    const trtcaffe::ELUParameter& p = msg.elu_param();
+std::vector<char> loadTimingCacheFile(nvinfer1::ILogger& logger, std::string const& inFileName);
+void saveTimingCacheFile(nvinfer1::ILogger& logger, std::string const& outFileName, nvinfer1::IHostMemory const* blob);
+void updateTimingCacheFile(nvinfer1::ILogger& logger, std::string const& fileName,
+    nvinfer1::ITimingCache const* timingCache, nvinfer1::IBuilder& builder);
+} // namespace utils
+} // namespace nvinfer1
 
-    float alpha = 1.f; // default parameter
-    if (p.has_alpha())
-    {
-        alpha = p.alpha();
-    }
-    auto newLayer = network.addActivation(*tensors[msg.bottom(0)], ActivationType::kELU);
-    newLayer->setAlpha(alpha);
-    return newLayer;
-}
-} //namespace nvcaffeparser1
+#endif // TENSORRT_SAMPLES_COMMON_TIMINGCACHE_H_
diff --git a/scripts/convert_te_onnx_to_trt_onnx.py b/scripts/convert_te_onnx_to_trt_onnx.py
new file mode 100644
index 00000000..6969aa2e
--- /dev/null
+++ b/scripts/convert_te_onnx_to_trt_onnx.py
@@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import argparse
+import onnx
+import logging
+import os
+import numpy as np
+import onnx_graphsurgeon as gs
+from onnx import helper, TensorProto, numpy_helper, version_converter
+
+'''
+This script is converting TE ONNX models (cast + CustomOp Q) and (CustomOp DQ + cast) pairs to Opset19 ONNX Q/DQ
+usage:
+python3 convert_te_onnx_to_trt_onnx.py --onnx_model_path <folder/file> 
+
+This script requires onnx 1.14 and above
+'''
+
+def find_node_by_tensor(graph, search_tensor, is_node_input, search_node_type=None):
+    for idx, node in enumerate(graph.node):
+        search_container = node.output
+        if is_node_input:
+            search_container = node.input
+        for node_tensor in search_container:
+            if search_node_type and node.op_type != search_node_type:
+                continue
+            if node_tensor == search_tensor:
+                return node, idx
+
+    return None, None
+
+def redirect_quantize_input(graph, q_node):
+    assert(q_node.op_type == 'QuantizeLinear')
+    q_input = q_node.input[0]
+    cast_node, cast_node_idx = find_node_by_tensor(graph, q_input, False, 'Cast')
+    if cast_node:
+        q_node.input[0] = cast_node.input[0]
+        return [cast_node_idx]
+    return []
+
+def redirect_dequantize_output(graph, dq_node):
+    assert(dq_node.op_type == 'DequantizeLinear')
+    dq_output = dq_node.output[0]
+    cast_node, cast_node_idx = find_node_by_tensor(graph, dq_output, True, 'Cast')
+    if cast_node:
+        dq_node.output[0] = cast_node.output[0]
+        return [cast_node_idx]
+    return []
+
+def get_attr_numpy_tensor(attr):
+    assert(attr.type == onnx.AttributeProto.TENSOR)
+    return numpy_helper.to_array(attr.t)
+
+def get_attr(node, search_attr_name):
+    for idx, attr in enumerate(node.attribute):
+        if attr.name == search_attr_name:
+            return attr, idx
+
+    return None, None
+
+def cast_scale(graph, qdq_node, cast_to):
+    assert(cast_to in ['fp32', 'fp16'])
+    assert(qdq_node.op_type in ['QuantizeLinear', 'DequantizeLinear'])
+    constant_node_idx = None
+    scale_tensor = qdq_node.input[1]
+    constant_node, constant_node_idx = find_node_by_tensor(graph, scale_tensor, False, 'Constant')
+    scale_cast_to_dtype = None
+    onnx_cast_to_dtype = None
+    if cast_to == 'fp16':
+        scale_cast_to_dtype = np.dtype(np.float32)
+        onnx_cast_to_dtype = onnx.TensorProto.FLOAT16
+    elif cast_to == 'fp32':
+        scale_cast_to_dtype = np.dtype(np.float32)
+        onnx_cast_to_dtype = onnx.TensorProto.FLOAT
+
+    if constant_node:
+        scale_attr, _ = get_attr(constant_node, 'value')
+        assert(scale_attr)
+        numpy_scale = get_attr_numpy_tensor(scale_attr)
+        logging.info(type(numpy_scale.dtype))
+        logging.info(type(scale_cast_to_dtype))
+        if numpy_scale.dtype != scale_cast_to_dtype:
+            logging.debug(f'Change {qdq_node.name} scale from {numpy_scale.dtype} to {scale_cast_to_dtype}')
+            numpy_scale = numpy_scale.astype(scale_cast_to_dtype)
+            tensor_name = constant_node.name + '_casted'
+            create_constant_tensor(graph, tensor_name, onnx_cast_to_dtype, numpy_scale)
+            qdq_node.input[1] = tensor_name
+    else:
+        logging.warning(f'No constant node connected to {qdq_node} as scale')
+
+    if constant_node_idx:
+        return [constant_node_idx]
+    return []
+
+def create_constant_tensor(graph, name, dtype, np_tensor):
+    tensor_value_info = helper.make_tensor_value_info(name, dtype, np_tensor.shape)
+    graph.input.append(tensor_value_info)
+    helper.make_tensor(name, data_type=dtype, dims=(), vals=[0])
+    
+    tensor_initializer = helper.make_tensor(name, dtype, np_tensor.shape, np_tensor.flatten().tolist())
+    graph.initializer.append(tensor_initializer)
+
+'''
+Convert custom operators to opset19
+'''
+def custom_op_to_opset19(graph, node, use_int32_quantization, remove_cast_before_q, remove_cast_after_dq, change_qdq_scale_precision):
+    assert(node.op_type in ['TRT_FP8QuantizeLinear', 'TRT_FP8DequantizeLinear'])
+    is_dq = node.op_type == 'TRT_FP8DequantizeLinear'
+    logging.debug(f'Convert {node.name} to Opset19')
+    orig_node_name = node.name
+    new_node_name = orig_node_name + '_converted'
+
+    quant_to = TensorProto.FLOAT8E4M3FN
+    if use_int32_quantization:
+        quant_to = TensorProto.INT32
+
+    #add zero point to the node
+    tensor_name = new_node_name + '_zero_point'
+    create_constant_tensor(graph, tensor_name, quant_to, np.array([0]))
+    node.input.append(tensor_name)
+
+    node.domain = ""
+    node.op_type = "QuantizeLinear"
+
+    node_idxs_to_delete = []
+    if is_dq:
+        node.op_type = "DequantizeLinear"
+        if remove_cast_after_dq:
+            node_idxs_to_delete += redirect_dequantize_output(graph, node)
+            if change_qdq_scale_precision:
+                node_idxs_to_delete += cast_scale(graph, node, change_qdq_scale_precision)
+    else:
+        if remove_cast_before_q:
+            node_idxs_to_delete += redirect_quantize_input(graph, node)
+            if change_qdq_scale_precision:
+                node_idxs_to_delete += cast_scale(graph, node, change_qdq_scale_precision)
+
+    node.name = new_node_name
+    logging.debug(f'Convert Done\n')
+    return node_idxs_to_delete
+
+def check_model(graph):
+    converted_qdq_ops = ['TRT_FP8QuantizeLinear', 'TRT_FP8DequantizeLinear']
+    passed_check = True
+    for node in graph.node:
+        if node.op_type in converted_qdq_ops:
+            logging.error(f'Node \"{node.name}\" of type {node.op_type} should have been removed')
+            passed_check = False
+    return passed_check
+
+def update_quantize_node_type(model):
+    graph = gs.import_onnx(model)
+    for node in graph.nodes:
+        if node.op == "TRT_FP8QuantizeLinear":
+            for out in node.outputs:
+                out.dtype = TensorProto.FLOAT8E4M3FN
+    return gs.export_onnx(graph)
+
+'''
+Converts onnx files from TE to TRT
+'''
+def replace_customop_qdq_with_onnx_qdq(te_onnx_files, results_path, create_netron_compatible_model, remove_cast_before_q, remove_cast_after_dq, change_qdq_scale_precision):
+    # store mappings from original ONNX name to new ONNX name.
+    file_mappings = {}
+    for te_onnx_file in te_onnx_files:
+        logging.debug('Loading model')
+        model = onnx.load(te_onnx_file, load_external_data=False)
+        # update QuantizeLinear output dtype
+        model = update_quantize_node_type(model)
+        # change model opset to 19
+        model.opset_import[0].version = 19
+        graph = model.graph
+        logging.debug('Loading model finished')
+        converted_qdq_ops = ['TRT_FP8QuantizeLinear', 'TRT_FP8DequantizeLinear']
+
+        try:
+            node_idxs_to_delete = []
+            converted = False
+            for node in graph.node:
+                if node.op_type in converted_qdq_ops:
+                    converted = True
+                    node_idxs_to_delete += custom_op_to_opset19(graph, node, create_netron_compatible_model, remove_cast_before_q, remove_cast_after_dq, change_qdq_scale_precision)
+
+            if converted:
+                assert(check_model(graph))
+                node_idxs_to_delete = reversed(sorted(node_idxs_to_delete))
+                for node_idx in node_idxs_to_delete:
+                    del(graph.node[node_idx])
+                suffix = '.opset19'
+                if create_netron_compatible_model:
+                    suffix += '.netron'
+                suffix += '.onnx'
+                new_model_filename = os.path.join(results_path, os.path.splitext(os.path.split(te_onnx_file)[1])[0] + suffix)
+                onnx.save_model(model, new_model_filename)
+                logging.info(f'The converted model is saved at {new_model_filename}!')
+                file_mappings[te_onnx_file] = new_model_filename
+            else:
+                logging.info(f'No conversion was done with {te_onnx_file}!')
+                file_mappings[te_onnx_file] = te_onnx_file
+        except Exception as ex:
+            logging.error(f'Failed: {ex}')
+            file_mappings[te_onnx_file] = None
+    return file_mappings
+
+if __name__ == "__main__":
+    logging.getLogger().setLevel(logging.INFO)
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--onnx_model_path', required=True, help="Path of model or a folder of models. When using a folder, this script will convert all \'.onnx\' files")
+    parser.add_argument('--results_path', required=False, help="Path for generated models, when not set, the generated model(s) will be next ot the origianl model(s)")
+    parser.add_argument('--create_netron_compatible_model', action='store_true', required=False, help="When set, the script will use int32 quantization. "
+        "This enables the user to view the graph with Netron, until it adds support for opset19. The generated model isn't TRT compatible.")
+    parser.add_argument('--remove_casts', required=False, help="Controls whether to remove casts around q/dq nodes. "
+        "For example, when set to \'dq\', remove casts only after dq. Default is \'keep_all\'", choices=['q', 'dq', 'qdq', 'keep_all'], default='keep_all')
+    parser.add_argument('--change_qdq_scale_precision', required=False, help="When set controls q/dq nodes scales data type.", choices=['fp32', 'fp16'])
+    args = parser.parse_args()
+
+    results_path = args.results_path
+    if results_path and os.path.isdir(results_path) == False:
+        logging.error(f'\'--results_path\' set to \'{results_path}\', but the folder doesn\'t exist, exiting')
+        exit(-1)
+
+    if results_path is None:
+        results_path = args.onnx_model_path
+        if os.path.isfile(results_path):
+            results_path = os.path.split(results_path)[0]
+
+    remove_cast_after_dq = False
+    remove_cast_before_q = False
+    if args.remove_casts == 'q':
+        remove_cast_before_q = True
+    elif args.remove_casts == 'dq':
+        remove_cast_after_dq = True
+    elif args.remove_casts == 'qdq':
+        remove_cast_after_dq = True
+        remove_cast_before_q = True
+
+    onnx_files = []
+    if os.path.isdir(args.onnx_model_path):
+        logging.info(f"Got folder: {args.onnx_model_path}")
+        onnx_files = [os.path.join(args.onnx_model_path, filename) for filename in os.listdir(args.onnx_model_path) if filename.endswith('.onnx')==True and filename.endswith('.opset19.onnx')==False]
+
+    else:
+        logging.info(f"Got file: {args.onnx_model_path}")
+        onnx_files = [args.onnx_model_path]
+
+    replace_customop_qdq_with_onnx_qdq(onnx_files, results_path, args.create_netron_compatible_model, remove_cast_before_q, remove_cast_after_dq, args.change_qdq_scale_precision)
diff --git a/scripts/stubify.sh b/scripts/stubify.sh
index a6839b78..aad43500 100755
--- a/scripts/stubify.sh
+++ b/scripts/stubify.sh
@@ -47,15 +47,15 @@ if [ "$OS" = "Ubuntu-22.04" ] ; then
 fi
 
 # make stub library
+# This uses the system nm in containers that compile with SCL, but the output is identical to the SCL output
 if [ -z "${CC_ARGS}" ] ; then
     nm -D "${IN_LIBFILE}" ${EXTRA_NM_FLAG} | \
         awk '{if ($2 == "T") { print "void",$3,"() {}" }}' | \
-        "${CC}" -x c -Og -fPIC -shared -Wl,-soname=${SONAME} -Wl,--strip-all -o "${OUT_LIBFILE}" -
+        ${CC} -xc -Og -fPIC -shared -Wl,-soname=${SONAME} -Wl,--strip-all -o "${OUT_LIBFILE}" -
 else
     nm -D "${IN_LIBFILE}" ${EXTRA_NM_FLAG} | \
         awk '{if ($2 == "T") { print "void",$3,"() {}" }}' | \
-        "${CC}" -x c -Og -fPIC -shared -Wl,-soname=${SONAME} -Wl,--strip-all -o "${OUT_LIBFILE}" "${CC_ARGS}" -
+        ${CC} -xc -Og -fPIC -shared -Wl,-soname=${SONAME} -Wl,--strip-all -o "${OUT_LIBFILE}" "${CC_ARGS}" -
 fi
 
 exit $?
-
diff --git a/third_party/protobuf.cmake b/third_party/protobuf.cmake
index d5c87ecd..6b1fbd43 100644
--- a/third_party/protobuf.cmake
+++ b/third_party/protobuf.cmake
@@ -41,6 +41,7 @@ macro(configure_protobuf VERSION)
             -DCMAKE_CXX_FLAGS=${PROTOBUF_CXXFLAGS}
             -DCMAKE_INSTALL_PREFIX=${Protobuf_INSTALL_DIR}/${Protobuf_TARGET}
             -Dprotobuf_BUILD_TESTS=OFF
+            -Dprotobuf_MSVC_STATIC_RUNTIME=OFF # For Windows
         SOURCE_SUBDIR cmake
         BINARY_DIR ${Protobuf_INSTALL_DIR}/${Protobuf_TARGET}/src/${Protobuf_TARGET}
     )
@@ -56,13 +57,13 @@ macro(configure_protobuf VERSION)
     set(Protobuf_INCLUDE_DIRS "${CMAKE_BINARY_DIR}/${Protobuf_TARGET}/include")
     set(Protobuf_PROTOC_EXECUTABLE  "${Protobuf_BIN_DIR}/protoc")
     if (CMAKE_BUILD_TYPE STREQUAL "Debug")
-        set(Protobuf_LIBRARY "${Protobuf_LIB_DIR}/libprotobufd.a")
-        set(Protobuf_PROTOC_LIBRARY "${Protobuf_LIB_DIR}/libprotocd.a")
-        set(Protobuf_LITE_LIBRARY "${Protobuf_LIB_DIR}/libprotobuf-lited.a")
+        set(Protobuf_LIBRARY "${Protobuf_LIB_DIR}/libprotobufd.${STATIC_LIB_EXT}")
+        set(Protobuf_PROTOC_LIBRARY "${Protobuf_LIB_DIR}/libprotocd.${STATIC_LIB_EXT}")
+        set(Protobuf_LITE_LIBRARY "${Protobuf_LIB_DIR}/libprotobuf-lited.${STATIC_LIB_EXT}")
     else()
-        set(Protobuf_LIBRARY "${Protobuf_LIB_DIR}/libprotobuf.a")
-        set(Protobuf_PROTOC_LIBRARY "${Protobuf_LIB_DIR}/libprotoc.a")
-        set(Protobuf_LITE_LIBRARY "${Protobuf_LIB_DIR}/libprotobuf-lite.a")
+        set(Protobuf_LIBRARY "${Protobuf_LIB_DIR}/libprotobuf.${STATIC_LIB_EXT}")
+        set(Protobuf_PROTOC_LIBRARY "${Protobuf_LIB_DIR}/libprotoc.${STATIC_LIB_EXT}")
+        set(Protobuf_LITE_LIBRARY "${Protobuf_LIB_DIR}/libprotobuf-lite.${STATIC_LIB_EXT}")
     endif()
     set(protolibType STATIC)
 
@@ -204,7 +205,7 @@ function(protobuf_generate_cpp SRCS HDRS)
         add_custom_command(
             OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/${PROTO_DIR}/${PROTO_SRC}"
                    "${CMAKE_CURRENT_BINARY_DIR}/${PROTO_DIR}/${PROTO_HEADER}"
-            COMMAND LIBRARY_PATH=${Protobuf_LIB_DIR} ${Protobuf_PROTOC_EXECUTABLE}
+            COMMAND ${Protobuf_PROTOC_EXECUTABLE}
             ARGS --cpp_out ${CMAKE_CURRENT_BINARY_DIR}/${PROTO_DIR} -I${CMAKE_CURRENT_SOURCE_DIR}/${PROTO_DIR} ${CMAKE_CURRENT_SOURCE_DIR}/${proto}
             WORKING_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}"
             DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/${proto}" protobuf::libprotobuf Protobuf protobuf::protoc
diff --git a/tools/Polygraphy/CHANGELOG.md b/tools/Polygraphy/CHANGELOG.md
index 3c4c6470..a04b5ca5 100644
--- a/tools/Polygraphy/CHANGELOG.md
+++ b/tools/Polygraphy/CHANGELOG.md
@@ -3,6 +3,148 @@
 Dates are in YYYY-MM-DD format.
 
 
+## v0.49.9 (2024-03-19)
+### Added
+- Added `run_opts` argument to `tools.main` to allow calling polygraphy tools from within other Python scripts.
+
+### Changed
+- Updated weight streaming flag to accept a percentage.
+
+### Fixed
+- Fixed a bug where a large amount of mismatch between two runner outputs would generate an out of memory error.
+
+
+## v0.49.8 (2024-02-16)
+### Added
+- Added `--mark-debug` command-line option.
+- Initializes trt plugin library by default when using the TensorRT ONNX parser
+
+
+## v0.49.7 (2024-02-02)
+### Added
+- Added `plugin match` subtool that finds opportunities for plugin substitution in an ONNX Model and prepares an intermediate file to be used for actual substitution
+- Added `plugin list` subtool that lists opportunities for plugin substitution, without preparing an intermediate file.
+- Added support for building engines with the refittable weights stripped.
+    Setting the `strip_plan` parameter of `CreateConfig` or passing in the `--strip-plan` flag
+    enables building engines with the refittable weights stripped.
+- Added `plugin replace` subtool that replaces subgraphs with plugins, based on an intermediate file (config.yaml)
+- Added `polygraphy surgeon weight-strip` to strip the initializers of selective nodes in an ONNX model
+- Added `polygraphy surgeon weight-reconstruct` to read a weightless ONNX model and fill the empty initializers with proxy tensors
+- Added '--weight-streaming` and `--weight-streaming-budget` APIs to control TRT weight streaming
+
+
+## v0.49.6 (2024-01-18)
+### Fixed
+- Fixed a bug where `explicit_batch` would be provided by default on TRT 10.0, where it has been removed.
+
+
+## v0.49.5 (2024-01-16)
+### Added
+- Added an `allocation_strategy` to `TrtRunner` and corresponding `--allocation-strategy` argument to CLI tools.
+
+### Fixed
+- Fixed a bug where the reference count of the TensorRT engine would not be decremented correctly in `TrtRunner`.
+
+
+## v0.49.4 (2023-12-20)
+### Fixed
+- Fixed a bug where the comparator would modify the original output tensor in some cases instead of operating on a copy.
+- Fixed a bug where `is_installed()` for lazily imported modules would not work if the package name differed from the module name.
+
+
+## v0.49.3 (2023-11-30)
+### Changed
+- Improved error messages in the default data loader for invalid backend modules.
+
+
+## v0.49.2 (2023-11-27)
+### Added
+- Added `DataType.INT4` for 4-bit signed integers.
+
+### Changed
+- Removed internal usage of several deprecated TensorRT APIs.
+
+### Fixed
+- Fixed a bug in the default data loader where scalars would not be generated correctly.
+
+
+## v0.49.1 (2023-10-03)
+### Added
+- Added `--profiling-verbosity` command-line option.
+- Added a `progress_monitor` parameter to `CreateConfig`.
+- Added a `data_loader_backend_module` parameter to `DataLoader` and corresponding `--data-loader-backend-module` argument to CLI tools to
+    choose between generating `numpy.ndarray` and `torch.tensor` in the default dataloader.
+
+### Fixed
+- Fixed a bug where warnings would be issued for unsupported versions of `torch` even if
+    `torch` was not being used.
+
+
+## v0.49.0 (2023-07-28)
+### Added
+- Added `check lint` subtool that validates ONNX Models and generates human-readable console output and
+    a JSON report detailing unused or invalid nodes as well as model errors.
+- Added a new `inspect sparsity` subtool that can check whether the weights in a model are sparse.
+- Added a `disable_compilation_cache` parameter to `CreateConfig` and corresponding `--disable-compilation-cache` argument
+    to CLI tools.
+- Added a `"quantile"` mode to `CompareFunc.simple()`'s `check_error_stat` parameter.
+- Added an `error_quantile` parameter to `CompareFunc.simple()` and corresponding `--error-quantile` argument to
+    CLI tools to specify the error quantile when `check_error_stat="quantile"`.
+
+### Changed
+- Improved error messages when deserialization fails due to a missing module.
+- Changed `SaveOnnx` to use `external_data_path=""` when the model exceeds the protobuf size limit and no external data path is provided.
+    This prevents scenarios where a long running command like `surgeon sanitize --fold-constants` would finally
+    complete only to fail when attempting to save the final model. With the new behavior, the model will be saved successfully
+    with external data in a default location.
+
+
+## v0.48.1 (2023-07-05)
+### Changed
+- Updated default `DataLoader` to show better errors when it can't generate inputs due to data types unsupported by NumPy.
+    In such cases, you must provide a custom data loader.
+
+### Fixed
+- Fixed a bug where older versions of ONNX would cause failures due to missing data types.
+- Fixed a bug where top-K implementation would not work for PyTorch FP16 tensors on CPU.
+
+
+## v0.48.0 (2023-06-13)
+### Added
+- Added a `quantization_flags` parameter to `CreateConfig` and corresponding `--quantization-flags` argument
+    to CLI tools to enable setting TensorRT builder quantization flags.
+- Added a `error_on_timing_cache_miss` parameter to `CreateConfig` and corresponding `--error-on-timing-cache-miss` argument
+    to CLI tools.
+- Added a `bf16` option to the TensorRT `CreateConfig` loader and corresponding `--bf16` argument to CLI tools.
+- Added a common `DataType` class which can convert between data type classes of various other frameworks,
+    like NumPy, PyTorch, and TensorRT.
+- Added support for PyTorch tensors in `TrtRunner` and `Calibrator`. See [the example](examples/api/09_working_with_pytorch_tensors/) for details.
+- Added support for PyTorch tensors in `OnnxrtRunner`.
+- Added a `strongly_typed` option to TensorRT network loaders and a corresponding `--strongly-typed` argument to CLI tools.
+- Added `polygraphy surgeon prune` to prune a model to be sparse. Note that this will *not* retain the accuracy of the model and should hence be used only for functional testing.
+- Added a `--toposort` option to `surgeon sanitize` to topologically sort nodes.
+
+### Changed
+- `TensorMetadata` will now automatically convert data types to Polygraphy's `DataType` class.
+    The data types can be converted to corresponding NumPy types using the `numpy()` method.
+    This affects any interface that returns instances of this class. For example, the `Calibrator`
+    sets input metadata on the provided data loader in the form of a `TensorMetadata` instance.
+    *NOTE: For compatibility, the runner method, `get_input_metadata` will retain its previous behavior of using NumPy types.*
+        *In a future version of Polygraphy, it will be updated to return only Polygraphy `DataType`s.*
+
+### Fixed
+- Fixed a bug where the `DataLoader` would not generate inputs correctly if a scalar was provided
+    in the input metadata. Note that since scalars have fixed shapes, they do need to be specified
+    via the `input_metadata` argument, so a workaround on older versions is to simply omit scalars.
+
+### Removed
+- Removed the TensorRT Legacy runner which supported UFF and Caffe models.
+- Removed support for TensorRT versions older than 8.5.
+- Removed `max_workspace_size` option in TensorRT's `CreateConfig` and corresponding `--workspace` argument from CLI tools.
+- Removed `strict_types` option in TensorRT's `CreateConfig` and corresponding `--strict-types` argument from CLI tools.
+- Removed `debug diff-tactics` alias for `inspect diff-tactics`. `diff-tactics` is now available only under the `inspect` tool.
+
+
 ## v0.47.1 (2023-03-29)
 ### Changed
 - Updated `TrtOnnxFlagArgs` to automatically enable `NATIVE_INSTANCENORM` when either hardware or
@@ -793,8 +935,8 @@ Dates are in YYYY-MM-DD format.
     the shape/data type arguments.
 - ONNX shape inference will now be skipped when `--force-fallback-shape-inference` is enabled in `surgeon extract/sanitize`.
 - `debug reduce` will now freeze intermediate shapes in the model if `--model-input-shapes` is provided.
-- `IterationResult`s now store `LazyNumpyArray` rather than `np.ndarray`.
-    The public interface for `IterationResult` will automatically pack or unpack `np.ndarray`s into/from `LazyNumpyArray`,
+- `IterationResult`s now store `LazyArray` rather than `np.ndarray`.
+    The public interface for `IterationResult` will automatically pack or unpack `np.ndarray`s into/from `LazyArray`,
     so the change is completely transparent.
     This can significantly reduce memory usage for tools like `debug reduce` and `run`.
 
diff --git a/tools/Polygraphy/CONTRIBUTING.md b/tools/Polygraphy/CONTRIBUTING.md
index c3bbdb2d..b9ce386e 100644
--- a/tools/Polygraphy/CONTRIBUTING.md
+++ b/tools/Polygraphy/CONTRIBUTING.md
@@ -30,7 +30,7 @@
     - Install TensorRT. If you don't already have it installed, there are two options:
         1. Install the Python package:
             ```
-            python3 -m pip install nvidia-tensorrt --index-url https://pypi.ngc.nvidia.com
+            python3 -m pip install tensorrt
             ```
         2. Install it manually following the instructions in the [installation guide](https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html#installing).
 
diff --git a/tools/Polygraphy/Makefile b/tools/Polygraphy/Makefile
index 39130aa3..74b701bc 100644
--- a/tools/Polygraphy/Makefile
+++ b/tools/Polygraphy/Makefile
@@ -23,8 +23,15 @@ NPROC ?= 8
 BUILD_DIR := build
 TESTS_DIR := tests
 
+RUN_ALL_TESTS ?= 0
 EXTRA_PYTEST_OPTS ?=
-PYTEST_OPTS := -v -x --durations=5 --failed-first --new-first --script-launch-mode=subprocess $(EXTRA_PYTEST_OPTS)
+PYTEST_OPTS := $(EXTRA_PYTEST_OPTS) -v -x --durations=15 --failed-first --new-first --script-launch-mode=subprocess
+
+EXTRA_PYTEST_MARKERS :=
+
+ifeq ($(RUN_ALL_TESTS),0)
+    EXTRA_PYTEST_MARKERS +=  and not slow
+endif
 
 # Tests also check that docs can build
 test: docs install
@@ -32,8 +39,8 @@ test: docs install
 	export POLYGRAPHY_INTERNAL_CORRECTNESS_CHECKS=1
 	export CUDA_MODULE_LOADING=LAZY
     # Some tests need to be run serially - we annotate those with a `serial` marker.
-	python3 -m pytest $(TESTS_DIR) -m "serial" $(PYTEST_OPTS) && \
-		python3 -m pytest $(TESTS_DIR) -n $(NPROC) --dist=loadscope -m "not serial" $(PYTEST_OPTS)
+	python3 -m pytest $(TESTS_DIR) -m "serial $(EXTRA_PYTEST_MARKERS)" $(PYTEST_OPTS) && \
+		python3 -m pytest $(TESTS_DIR) -n $(NPROC) --dist=loadscope -m "not serial $(EXTRA_PYTEST_MARKERS)" $(PYTEST_OPTS)
 
 leak_check:
 	export PYTHONPATH=$(CURDIR):$${PYTHONPATH}
@@ -53,6 +60,6 @@ install_deps: build
 install: install_deps
 	- python3 -m pip install --force-reinstall $(CURDIR)/dist/*.whl
 
-docs:
+docs: build
 	mkdir -p $(BUILD_DIR)/docs
 	python3 `which sphinx-build` docs $(BUILD_DIR)/docs/ -j $(NPROC) -W
diff --git a/tools/Polygraphy/bad.onnx b/tools/Polygraphy/bad.onnx
deleted file mode 100644
index ad84a9f4..00000000
Binary files a/tools/Polygraphy/bad.onnx and /dev/null differ
diff --git a/tools/Polygraphy/docs/datatype/toc.rst b/tools/Polygraphy/docs/datatype/toc.rst
new file mode 100644
index 00000000..140d5c07
--- /dev/null
+++ b/tools/Polygraphy/docs/datatype/toc.rst
@@ -0,0 +1,7 @@
+================
+Data Types
+================
+
+Module: ``polygraphy.datatype``
+
+.. automodule:: polygraphy.datatype.datatype
diff --git a/tools/Polygraphy/docs/index.rst b/tools/Polygraphy/docs/index.rst
index 87743e97..f02e6b1c 100644
--- a/tools/Polygraphy/docs/index.rst
+++ b/tools/Polygraphy/docs/index.rst
@@ -11,6 +11,11 @@ see `the GitHub repository <https://github.com/NVIDIA/TensorRT/tree/main/tools/P
 For a conceptual overview of the Python API,
 see `this page <https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/polygraphy>`_.
 
+.. warning::
+    Any APIs not documented here should be considered internal only and do not adhere to the
+    deprecation policy for public APIs. Thus, they may be modified or removed at any time without warning.
+    Avoid using undocumented APIs!
+
 .. toctree::
     :hidden:
 
@@ -31,11 +36,11 @@ see `this page <https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/po
     config/toc
     constants/toc
     cuda/toc
+    datatype/toc
     exception/toc
     func/toc
     json/toc
     logger/toc
-    mod/toc
 
 .. toctree::
     :caption: API Reference: Development
diff --git a/tools/Polygraphy/docs/mod/importer.rst b/tools/Polygraphy/docs/mod/importer.rst
deleted file mode 100644
index 9a35655d..00000000
--- a/tools/Polygraphy/docs/mod/importer.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-================
-Import Helpers
-================
-
-Module: ``polygraphy.mod``
-
-.. automodule:: polygraphy.mod.importer
diff --git a/tools/Polygraphy/docs/mod/toc.rst b/tools/Polygraphy/docs/mod/toc.rst
deleted file mode 100644
index 8e9cdb00..00000000
--- a/tools/Polygraphy/docs/mod/toc.rst
+++ /dev/null
@@ -1,8 +0,0 @@
-==============
-Module Helpers
-==============
-
-Module: ``polygraphy.mod``
-
-.. toctree::
-    importer
diff --git a/tools/Polygraphy/examples/api/04_int8_calibration_in_tensorrt/example.py b/tools/Polygraphy/examples/api/04_int8_calibration_in_tensorrt/example.py
index ed03d9d3..168bf217 100644
--- a/tools/Polygraphy/examples/api/04_int8_calibration_in_tensorrt/example.py
+++ b/tools/Polygraphy/examples/api/04_int8_calibration_in_tensorrt/example.py
@@ -30,7 +30,7 @@
 def calib_data():
     for _ in range(4):
         # TIP: If your calibration data is already on the GPU, you can instead provide GPU pointers
-        # (as `int`s) or Polygraphy `DeviceView`s instead of NumPy arrays.
+        # (as `int`s), Polygraphy `DeviceView`s, or PyTorch tensors instead of NumPy arrays.
         #
         # For details on `DeviceView`, see `polygraphy/cuda/cuda.py`.
         yield {"x": np.ones(shape=(1, 1, 2, 2), dtype=np.float32)}  # Totally real data
diff --git a/tools/Polygraphy/examples/api/06_immediate_eval_api/build_and_run.py b/tools/Polygraphy/examples/api/06_immediate_eval_api/build_and_run.py
index f355a63d..b0cf567e 100644
--- a/tools/Polygraphy/examples/api/06_immediate_eval_api/build_and_run.py
+++ b/tools/Polygraphy/examples/api/06_immediate_eval_api/build_and_run.py
@@ -47,16 +47,12 @@ def main():
     # Create a TensorRT IBuilderConfig so that we can build the engine with FP16 enabled.
     config = create_config(builder, network, fp16=True)
 
-    # We can free everything we constructed above once we're done building the engine.
-    # NOTE: In TensorRT 8.0 and newer, we do *not* need to use a context manager here.
-    with builder, network, parser, config:
-        engine = engine_from_network((builder, network), config)
+    engine = engine_from_network((builder, network), config)
 
     # To reuse the engine elsewhere, we can serialize it and save it to a file.
     save_engine(engine, path="identity.engine")
 
-    # NOTE: In TensorRT 8.0 and newer, we do *not* need to use a context manager to free `engine`.
-    with engine, TrtRunner(engine) as runner:
+    with TrtRunner(engine) as runner:
         inp_data = np.ones((1, 1, 2, 2), dtype=np.float32)
 
         # NOTE: The runner owns the output buffers and is free to reuse them between `infer()` calls.
diff --git a/tools/Polygraphy/examples/api/06_immediate_eval_api/load_and_run.py b/tools/Polygraphy/examples/api/06_immediate_eval_api/load_and_run.py
index 0971073e..6219ae06 100644
--- a/tools/Polygraphy/examples/api/06_immediate_eval_api/load_and_run.py
+++ b/tools/Polygraphy/examples/api/06_immediate_eval_api/load_and_run.py
@@ -28,8 +28,7 @@
 def main():
     engine = engine_from_bytes(bytes_from_path("identity.engine"))
 
-    # NOTE: In TensorRT 8.0 and newer, we do *not* need to use a context manager to free `engine`.
-    with engine, TrtRunner(engine) as runner:
+    with TrtRunner(engine) as runner:
         inp_data = np.ones((1, 1, 2, 2), dtype=np.float32)
 
         # NOTE: The runner owns the output buffers and is free to reuse them between `infer()` calls.
diff --git a/tools/Polygraphy/examples/api/07_tensorrt_and_dynamic_shapes/README.md b/tools/Polygraphy/examples/api/07_tensorrt_and_dynamic_shapes/README.md
index e1e9d7c1..a43da30c 100644
--- a/tools/Polygraphy/examples/api/07_tensorrt_and_dynamic_shapes/README.md
+++ b/tools/Polygraphy/examples/api/07_tensorrt_and_dynamic_shapes/README.md
@@ -19,22 +19,8 @@ Using the TensorRT API, the process involves two steps:
     - `max`: The maximum shape for which the profile should work.
 
 2. During inference, set the input shape(s) in the execution context, then
-    query the execution context (*not* the engine) to determine the shape(s) of the output(s).
-    Based on the output shape(s), the device buffers can be resized to accomodate
-    the entire output(s).
-
-    For a single-input, single-output model, this would look roughly as follows:
-
-    <!-- Polygraphy Test: Ignore Start -->
-    ```python
-    context.set_binding_shape(0, inp.shape)
-
-    out_shape = context.get_binding_shape(1)
-    out_buf.resize(out_shape)
-
-    # Rest of inference code...
-    ```
-    <!-- Polygraphy Test: Ignore End -->
+    use the `IOutputAllocator` API to provide a callback to allocate enough
+    device memory for the outputs.
 
 Polygraphy can simplify both steps and help you avoid common pitfalls:
 
diff --git a/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/README.md b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/README.md
new file mode 100644
index 00000000..89c216c4
--- /dev/null
+++ b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/README.md
@@ -0,0 +1,33 @@
+# Working With PyTorch Tensors
+
+## Introduction
+
+Some runners like `OnnxrtRunner` and `TrtRunner` can accept and return PyTorch tensors
+in addition to NumPy arrays. When PyTorch tensors are provided in the inputs, the runner
+will return the outputs as PyTorch tensors as well. This can be especially useful in
+cases where PyTorch supports a data type that is not supported by NumPy, such as BFloat16.
+
+Polygraphy's included TensorRT `Calibrator` can also accept PyTorch tensors directly.
+
+This example uses PyTorch tensors on the GPU where possible (i.e. if a GPU-enabled version
+of PyTorch is installed). When the tensors already reside on GPU memory, no additional copies
+are required in the runner/calibrator.
+
+## Running The Example
+
+1. Install prerequisites
+    * Ensure that TensorRT is installed
+    * Install other dependencies with `python3 -m pip install -r requirements.txt`
+
+
+2. Run the example:
+
+    ```bash
+    python3 example.py
+    ```
+
+
+## See Also
+
+* [Inference With TensorRT](../00_inference_with_tensorrt/)
+* [INT8 Calibration In TensorRT](../04_int8_calibration_in_tensorrt/)
diff --git a/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/example.py b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/example.py
new file mode 100644
index 00000000..3878f32b
--- /dev/null
+++ b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/example.py
@@ -0,0 +1,69 @@
+#!/usr/bin/env python3
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+"""
+This script demonstrates how to use PyTorch tensors with the TensorRT runner and calibrator.
+"""
+
+import torch
+
+from polygraphy.backend.trt import Calibrator, CreateConfig, TrtRunner, engine_from_network, network_from_onnx_path
+
+# If your PyTorch installation has GPU support, then we'll allocate the tensors
+# directly in GPU memory. This will mean that the calibrator and runner can skip the
+# host-to-device copy we would otherwise incur with NumPy arrays.
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+
+
+def calib_data():
+    for _ in range(4):
+        yield {"x": torch.ones((1, 1, 2, 2), dtype=torch.float32, device=DEVICE)}
+
+
+def main():
+    calibrator = Calibrator(data_loader=calib_data())
+
+    engine = engine_from_network(
+        network_from_onnx_path("identity.onnx"), config=CreateConfig(int8=True, calibrator=calibrator)
+    )
+
+    with TrtRunner(engine) as runner:
+        inp_data = torch.ones((1, 1, 2, 2), dtype=torch.float32, device=DEVICE)
+
+        # NOTE: The runner owns the output buffers and is free to reuse them between `infer()` calls.
+        # Thus, if you want to store results from multiple inferences, you should use `copy.deepcopy()`.
+        #
+        # When you provide PyTorch tensors in the feed_dict, the runner will try to use
+        # PyTorch tensors for the outputs. Specifically:
+        # - If the `copy_outputs_to_host` argument to `infer()` is set to `True` (the default),
+        #       it will return PyTorch tensors in CPU memory.
+        # - If `copy_outputs_to_host` is `False`, it will return:
+        #       - PyTorch tensors in GPU memory if you have a GPU-enabled PyTorch installation.
+        #       - Polygraphy `DeviceView`s otherwise.
+        #
+        outputs = runner.infer({"x": inp_data})
+
+        # `copy_outputs_to_host` defaults to True, so the outputs should be PyTorch
+        # tensors in CPU memory.
+        assert isinstance(outputs["y"], torch.Tensor)
+        assert outputs["y"].device.type == "cpu"
+
+        assert torch.equal(outputs["y"], inp_data.to("cpu"))  # It's an identity model!
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/identity.onnx b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/identity.onnx
new file mode 100644
index 00000000..b7302044
--- /dev/null
+++ b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/identity.onnx
@@ -0,0 +1,15 @@
+backend-test:[
+
+xy"Identitytest_identityZ
+x
+
+
+
+
+b
+y
+
+
+
+
+B
\ No newline at end of file
diff --git a/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/requirements.txt b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/requirements.txt
new file mode 100644
index 00000000..9d77a55d
--- /dev/null
+++ b/tools/Polygraphy/examples/api/09_working_with_pytorch_tensors/requirements.txt
@@ -0,0 +1,2 @@
+tensorrt>=8.5
+torch>=1.13.0
diff --git a/tools/Polygraphy/examples/cli/check/01_linting_an_onnx_model/README.md b/tools/Polygraphy/examples/cli/check/01_linting_an_onnx_model/README.md
new file mode 100644
index 00000000..a684a1d7
--- /dev/null
+++ b/tools/Polygraphy/examples/cli/check/01_linting_an_onnx_model/README.md
@@ -0,0 +1,125 @@
+# checking An ONNX Model
+
+
+## Introduction
+
+The `check lint` subtool validates ONNX Models and generates a JSON report detailing any bad/unused nodes or model errors.
+
+## Running The Example
+
+### Lint the ONNX model:
+
+<!-- Polygraphy Test: XFAIL Start -->
+```bash
+polygraphy check lint bad_graph.onnx -o report.json
+```
+<!-- Polygraphy Test: XFAIL End -->
+The output should look something like this:
+```bash
+[I] RUNNING | Command: polygraphy check lint bad_graph.onnx -o report.json
+[I] Loading model: bad_graph.onnx
+[E] LINT | Field 'name' of 'graph' is required to be non-empty.
+[I] Will generate inference input data according to provided TensorMetadata: {E [dtype=float32, shape=(1, 4)],
+     F [dtype=float32, shape=(4, 1)],
+     G [dtype=int64, shape=(4, 4)],
+     D [dtype=float32, shape=(4, 1)],
+     C [dtype=float32, shape=(3, 4)],
+     A [dtype=float32, shape=(1, 3)],
+     B [dtype=float32, shape=(4, 4)]}
+[E] LINT | Name: MatMul_3, Op: MatMul |  Incompatible dimensions for matrix multiplication
+[E] LINT | Name: Add_0, Op: Add |  Incompatible dimensions
+[E] LINT | Name: MatMul_0, Op: MatMul |  Incompatible dimensions for matrix multiplication
+[W] LINT | Input: 'A' does not affect outputs, can be removed.
+[W] LINT | Input: 'B' does not affect outputs, can be removed.
+[W] LINT | Name: MatMul_0, Op: MatMul | Does not affect outputs, can be removed.
+[I] Saving linting report to report.json
+[E] FAILED | Runtime: 1.006s | Command: polygraphy check lint bad_graph.onnx -o report.json
+```
+
+- This will create a `report.json` that contains information about what's wrong with the model.
+- The above example uses a faulty ONNX Model `bad_graph.onnx` that has multiple errors/warnings captured by the linter.
+The errors are:
+    1. Model has an empty name.
+    2. Nodes `Add_0`, `MatMul_0` and `MatMul_3` have incompatible input shapes.
+The warnings are:
+    1. Inputs `A` and `B` are unused output.
+    2. Node `MatMul_0` is unused by output.
+
+### Example Report:
+
+The generated report looks as follows:
+
+<!-- Polygraphy Test: Ignore Start -->
+```json
+{
+    "summary": {
+        "passing": [
+            "MatMul_1",
+            "cast_to_int64",
+            "NonZero"
+        ],
+        "failing": [
+            "MatMul_0",
+            "MatMul_3",
+            "Add_0"
+        ]
+    },
+    "lint_entries": [
+        {
+            "level": "exception",
+            "source": "onnx_checker",
+            "message": "Field 'name' of 'graph' is required to be non-empty."
+        },
+        {
+            "level": "exception",
+            "source": "onnxruntime",
+            "message": " Incompatible dimensions for matrix multiplication",
+            "nodes": [
+                "MatMul_3"
+            ]
+        },
+        {
+            "level": "exception",
+            "source": "onnxruntime",
+            "message": " Incompatible dimensions",
+            "nodes": [
+                "Add_0"
+            ]
+        },
+        {
+            "level": "exception",
+            "source": "onnxruntime",
+            "message": " Incompatible dimensions for matrix multiplication",
+            "nodes": [
+                "MatMul_0"
+            ]
+        },
+        {
+            "level": "warning",
+            "source": "onnx_graphsurgeon",
+            "message": "Input: 'A' does not affect outputs, can be removed."
+        },
+        {
+            "level": "warning",
+            "source": "onnx_graphsurgeon",
+            "message": "Input: 'B' does not affect outputs, can be removed."
+        },
+        {
+            "level": "warning",
+            "source": "onnx_graphsurgeon",
+            "message": "Does not affect outputs, can be removed.",
+            "nodes": [
+                "MatMul_0"
+            ]
+        }
+    ]
+}
+```
+<!-- Polygraphy Test: Ignore End -->
+
+### Notes
+Since it runs ONNX Runtime under the hood, it is possible to specify execution providers using `--providers`. Defaults to CPU.
+
+It is also possible to override the input shapes using `--input-shapes`, or provide custom input data. For more details, refer [how-to/use_custom_input_data](../../../../how-to/use_custom_input_data.md).
+
+For more information on usage, use `polygraphy check lint --help`.
diff --git a/tools/Polygraphy/examples/cli/check/01_linting_an_onnx_model/bad_graph.onnx b/tools/Polygraphy/examples/cli/check/01_linting_an_onnx_model/bad_graph.onnx
new file mode 100644
index 00000000..499fc2a5
Binary files /dev/null and b/tools/Polygraphy/examples/cli/check/01_linting_an_onnx_model/bad_graph.onnx differ
diff --git a/tools/Polygraphy/examples/cli/convert/02_deterministic_engine_builds_in_tensorrt/README.md b/tools/Polygraphy/examples/cli/convert/02_deterministic_engine_builds_in_tensorrt/README.md
index 5a6d3788..a2c67acd 100644
--- a/tools/Polygraphy/examples/cli/convert/02_deterministic_engine_builds_in_tensorrt/README.md
+++ b/tools/Polygraphy/examples/cli/convert/02_deterministic_engine_builds_in_tensorrt/README.md
@@ -1,5 +1,6 @@
 # Deterministic Engine Building In TensorRT
 
+**NOTE: This example requires TensorRT 8.7 or newer.**
 
 ## Introduction
 
@@ -8,40 +9,33 @@ the most optimal ones. Since kernel timings may vary slightly from run to run, t
 process is inherently non-deterministic.
 
 In many cases, deterministic engine builds may be desirable. One way of achieving this
-is to use the `IAlgorithmSelector` API to ensure the same kernels are picked each time.
-
-To make this process easier, Polygraphy provides two built-in algorithm selectors:
-`TacticRecorder` and `TacticReplayer`. The former can be used to record tactics selected
-during an engine build, and the latter to play them back during a subsequent build.
-The CLI tools include `--save-tactics` and `--load-tactics` options correspnding to these.
+is to use a timing cache to ensure the same kernels are picked each time.
 
 ## Running The Example
 
-1. Build an engine and save a replay file:
+1. Build an engine and save a timing cache:
 
     ```bash
     polygraphy convert identity.onnx \
-        --save-tactics replay.json \
+        --save-timing-cache timing.cache \
         -o 0.engine
     ```
 
-    The resulting `replay.json` file is human-readable. Optionally, we can
-    use `inspect tactics` to view it in a friendly format:
-
-    ```bash
-    polygraphy inspect tactics replay.json
-    ```
-
-2. Use the replay file for another engine build:
+2. Use the timing cache for another engine build:
 
     ```bash
     polygraphy convert identity.onnx \
-        --load-tactics replay.json \
+        --load-timing-cache timing.cache --error-on-timing-cache-miss \
         -o 1.engine
     ```
 
+    We specify `--error-on-timing-cache-miss` so that we can be sure that the new engine
+    used the entries from the timing cache for each layer.
+
 3. Verify that the engines are exactly the same:
 
+    <!-- Polygraphy Test: Ignore Start -->
     ```bash
-    diff -sa 0.engine 1.engine
+    diff <(polygraphy inspect model 0.engine --show layers attrs) <(polygraphy inspect model 1.engine --show layers attrs)
     ```
+    <!-- Polygraphy Test: Ignore End -->
diff --git a/tools/Polygraphy/examples/cli/convert/04_converting_models_to_fp16/README.md b/tools/Polygraphy/examples/cli/convert/04_converting_models_to_fp16/README.md
index a1c93253..c1e5f56a 100644
--- a/tools/Polygraphy/examples/cli/convert/04_converting_models_to_fp16/README.md
+++ b/tools/Polygraphy/examples/cli/convert/04_converting_models_to_fp16/README.md
@@ -15,12 +15,14 @@ with reduced precision.
    ```bash
    polygraphy convert --fp-to-fp16 -o identity_fp16.onnx identity.onnx
    ```
+
 2. **[Optional]** Inspect the resulting model:
 
    ```bash
    polygraphy inspect model identity_fp16.onnx
    ```
-3. **[Optional]** Run the FP32 and FP16 model under ONNX-Runtime, then compare the results:
+
+3. **[Optional]** Run the FP32 and FP16 models under ONNX-Runtime and then compare the results:
 
    ```bash
    polygraphy run --onnxrt identity.onnx \
@@ -32,10 +34,12 @@ with reduced precision.
       --load-inputs inputs.json --load-outputs outputs_fp32.json \
       --atol 0.001 --rtol 0.001
    ```
+
 4. **[Optional]** Check if any intermediate outputs of the FP16 model
    contain NaN or infinity (see [Checking for Intermediate NaN or Infinities](../../../../examples/cli/run/07_checking_nan_inf)):
+
    ```bash
-   polygraphy run --onnxrt identity_fp16.onnx --validate
+   polygraphy run --onnxrt identity_fp16.onnx --onnx-outputs mark all --validate
    ```
 
 ## See Also
diff --git a/tools/Polygraphy/examples/cli/debug/01_debugging_flaky_trt_tactics/README.md b/tools/Polygraphy/examples/cli/debug/01_debugging_flaky_trt_tactics/README.md
index c43fc724..802b8b60 100644
--- a/tools/Polygraphy/examples/cli/debug/01_debugging_flaky_trt_tactics/README.md
+++ b/tools/Polygraphy/examples/cli/debug/01_debugging_flaky_trt_tactics/README.md
@@ -1,5 +1,10 @@
 # Debugging Flaky TensorRT Tactics
 
+**IMPORTANT: This example no longer works reliably for newer versions of TensorRT, since they make some**
+    **tactic choices that are not exposed via the IAlgorithmSelector interface. Thus, the approach outlined below**
+    **cannot guarantee deterministic engine builds. With TensorRT 8.7 and newer, you can use the**
+    **tactic timing cache (`--save-timing-cache` and `--load-timing-cache` in Polygraphy) to ensure**
+    **determinism, but these files are opaque and thus cannot be interpreted by `inspect diff-tactics`**
 
 ## Introduction
 
diff --git a/tools/Polygraphy/examples/cli/inspect/02_inspecting_a_tensorrt_engine/README.md b/tools/Polygraphy/examples/cli/inspect/02_inspecting_a_tensorrt_engine/README.md
index bfff480c..c9e14069 100644
--- a/tools/Polygraphy/examples/cli/inspect/02_inspecting_a_tensorrt_engine/README.md
+++ b/tools/Polygraphy/examples/cli/inspect/02_inspecting_a_tensorrt_engine/README.md
@@ -59,7 +59,7 @@ about TensorRT engines, i.e. plan files:
                  -> {Y [shape=(1, 2, -1, -1)]}
 
         - Profile: 1
-            Layer 0    | node_of_Y [profile 1] [Op: Reformat]
+            Layer 0    | node_of_Y [profile 1] [Op: MyelinReformat]
                 {X [profile 1] [shape=(1, 2, -1, -1)]}
                  -> {Y [profile 1] [shape=(1, 2, -1, -1)]}
     ```
diff --git a/tools/Polygraphy/examples/cli/inspect/07_inspecting_tactic_replays/README.md b/tools/Polygraphy/examples/cli/inspect/07_inspecting_tactic_replays/README.md
index 39d0bb94..e5e9ae16 100644
--- a/tools/Polygraphy/examples/cli/inspect/07_inspecting_tactic_replays/README.md
+++ b/tools/Polygraphy/examples/cli/inspect/07_inspecting_tactic_replays/README.md
@@ -24,8 +24,8 @@ files generated by Polygraphy.
     This will display something like:
 
     ```
-    [I] Layer: (Unnamed Layer* 0) [Shuffle]
-            Algorithm: (Implementation: 2147483661, Tactic: 0) | Inputs: (('TensorFormat.LINEAR', 'DataType.FLOAT'),) | Outputs: (('TensorFormat.LINEAR', 'DataType.FLOAT'),)
+    [I] Layer: ONNXTRT_Broadcast
+            Algorithm: (Implementation: 2147483661, Tactic: 0) | Inputs: (('DataType.FLOAT'),) | Outputs: (('DataType.FLOAT'),)
         Layer: node_of_z
-            Algorithm: (Implementation: 2147483651, Tactic: 1) | Inputs: (('TensorFormat.LINEAR', 'DataType.FLOAT'), ('TensorFormat.LINEAR', 'DataType.FLOAT')) | Outputs: (('TensorFormat.LINEAR', 'DataType.FLOAT'),)
+            Algorithm: (Implementation: 2147483651, Tactic: 1) | Inputs: (('DataType.FLOAT'), ('DataType.FLOAT')) | Outputs: (('DataType.FLOAT'),)
     ```
diff --git a/tools/Polygraphy/examples/cli/inspect/08_inspecting_tensorrt_onnx_support/README.md b/tools/Polygraphy/examples/cli/inspect/08_inspecting_tensorrt_onnx_support/README.md
index 80752be9..51c8375e 100644
--- a/tools/Polygraphy/examples/cli/inspect/08_inspecting_tensorrt_onnx_support/README.md
+++ b/tools/Polygraphy/examples/cli/inspect/08_inspecting_tensorrt_onnx_support/README.md
@@ -3,24 +3,23 @@
 ## Introduction
 
 The `inspect capability` subtool provides detailed information on TensorRT's ONNX operator support for a given ONNX graph.
-It also partitions and saves supported and unsupported subgraphs from the original model.
-
+It is also able to partition and save supported and unsupported subgraphs from the original model in order to report all the dynamically checked errors with a given model.
 
 ## Running The Example
 
 1. Generate the capability report
 
     ```bash
-    polygraphy inspect capability model.onnx
+    polygraphy inspect capability --with-partitioning model.onnx
     ```
 
-    This should display a summary table like:
+2. This should display a summary table like:
 
     ```
     [I] ===== Summary =====
-        Operator | Count   | Reason                                                                                                                                                            | Nodes
-        -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-        Fake     |       1 | In node 0 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?" | [[2, 3]]
+        Operator | Count   | Reason                                                                                                                                                                    | Nodes
+        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+        Fake     |       1 | In node 0 with name:  and operator: Fake (checkFallbackPluginImporter): INVALID_NODE: creator && "Plugin not found, are the plugin name, version, and namespace correct?" | [[2, 3]]
     ```
 
 ## Understanding The Output
@@ -30,4 +29,6 @@ The summary table shows the unsupported operator, the reason it's unsupported, h
 and the index range of these nodes in the graph in case there are multiple unsupported nodes in a row.
 Note that this range uses an inclusive start index and an exclusive end index.
 
+It is important to note that the graph partitioning logic (`--with-partitioning`) currently does not support surfacing issues with nodes inside local functions (`FunctionProto`s). See the description of the default flow (without `--with-partitioning` option, described in the example `09_inspecting_tensorrt_static_onnx_support`) for static error reporting that properly handles nodes inside local functions.
+
 For more information and options, see `polygraphy inspect capability --help`.
diff --git a/tools/Polygraphy/examples/cli/inspect/09_inspecting_tensorrt_static_onnx_support/README.md b/tools/Polygraphy/examples/cli/inspect/09_inspecting_tensorrt_static_onnx_support/README.md
new file mode 100644
index 00000000..c632845b
--- /dev/null
+++ b/tools/Polygraphy/examples/cli/inspect/09_inspecting_tensorrt_static_onnx_support/README.md
@@ -0,0 +1,31 @@
+# Inspecting TensorRT ONNX Support
+
+## Introduction
+
+The `inspect capability` subtool provides detailed information on TensorRT's ONNX operator support for a given ONNX graph.
+It is also able to partition and save supported and unsupported subgraphs from the original model in order to report all the dynamically checked errors with a given model (see the example `08_inspecting_tensorrt_onnx_support`).
+
+## Running The Example
+
+1. Generate the capability report
+
+    ```bash
+    polygraphy inspect capability nested_local_function.onnx
+    ```
+
+2. This should display a summary table like:
+
+    ```
+    [I] ===== Summary =====
+        Stack trace                       | Operator  | Node               | Reason
+        -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+        OuterFunction -> NestedLocalFake2 | Fake_2    | nested_node_fake_2 | In node 0 with name: nested_node_fake_2 and operator: Fake_2 (checkFallbackPluginImporter): INVALID_NODE: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
+        OuterFunction                     | Fake_1    | nested_node_fake_1 | In node 0 with name: nested_node_fake_1 and operator: Fake_1 (checkFallbackPluginImporter): INVALID_NODE: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
+    ```
+
+## Understanding The Output
+
+In this example, `nested_local_function.onnx` contains `Fake_1` and `Fake_2` nodes that are not supported by TensorRT. `Fake_1` node is located inside a local function `OuterFunction` and `Fake_2` node is located inside a nested local function, `NestedLocalFake2`.
+The summary table shows the current stack trace consisting of local functions, the operator in which the error occurred and the reason it's unsupported.
+
+For more information and options, see `polygraphy inspect capability --help`.
diff --git a/tools/Polygraphy/examples/cli/inspect/09_inspecting_tensorrt_static_onnx_support/nested_local_function.onnx b/tools/Polygraphy/examples/cli/inspect/09_inspecting_tensorrt_static_onnx_support/nested_local_function.onnx
new file mode 100644
index 00000000..a769cc83
Binary files /dev/null and b/tools/Polygraphy/examples/cli/inspect/09_inspecting_tensorrt_static_onnx_support/nested_local_function.onnx differ
diff --git a/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/README.md b/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/README.md
new file mode 100644
index 00000000..d0679707
--- /dev/null
+++ b/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/README.md
@@ -0,0 +1,122 @@
+# Matching and replacing a subgraph with a plugin in an onnx model
+
+
+## Introduction
+
+The `plugin` tool offers subtools to find and replace subgraphs in an onnx model.
+
+Subgraph substition is a three-step process:
+1. Find matching subgraphs based on the plugin's graph pattern (pattern.py) and list the potential substitutions in a user-editable intermediate file (config.yaml)
+2. Review and edit (if necessary) the list of potential substitutions (config.yaml)
+3. Replace subgraphs with plugins based on the list of potential substitutions (config.yaml)
+
+
+`original.onnx` -------> `match` -------> `config.yaml` -------> `replace` -------> `replaced.onnx`
+`plugins` ----------------^   `usr input`---^    `plugins`--------^                     
+
+## Details
+
+### Match
+Finding matchings subgraphs in a model is done based on a graph pattern description (`pattern.py`) provided by the plugins.
+The graph pattern description (`pattern.py`) contains information about the topology and additional constraints for the graph nodes, and a way to calculate the plugin's attributes based on the matching subgraph. 
+Only plugins which provide a graph pattern description (pattern.py) are considered for matching.
+
+The result of the matching is stored in an intermediate file called `config.yaml`. 
+The user should review and edit this file, as it serves as a TODO list for the replacement step. For example, if there are 2 matching subgraphs, but only one should be substituted, the result can be removed from the file.
+
+As a preview/dry-run step, the `plugin list` subtool can show the list of potential substitutions without generating an intermediate file.
+
+### Replace
+Replacement of subgraphs with plugins uses the `config.yaml` file generated in the matching stage. Any matching subgraph listed in this file is going to be removed and replaced with a single node representing the plugin. The original file is kept, and a new file is saved where the replacements are done. This file by default is called `replaced.onnx`.
+
+### Compare
+The original and the replaced model can be compared to check if they behave the same way before and after plugin substitution:
+`polygraphy run original.onnx --trt --save-outputs model_output.json`
+`polygraphy run replaced.onnx --trt --load-outputs model_output.json`
+
+## Running The Example
+
+1. Find and save matches of toyPlugin in the example network:
+
+    ```bash
+    polygraphy plugin match graph_with_subgraph_matching_toy_plugin.onnx \
+        --plugin-dir ./plugins
+    ```
+
+    <!-- Polygraphy Test: Ignore Start -->
+    This will display something like:
+
+    ```
+    checking toyPlugin in model
+    [I] Start a subgraph matching...
+    [I] 	Checking node: n1 against pattern node: Anode.
+    [I] 	No match because: Op did not match. Node op was: O but pattern op was: A.
+    [I] Start a subgraph matching...
+    [I] Found a matched subgraph!
+    [I] Start a subgraph matching...
+    ```
+
+    The resulting config.yaml will look like:
+
+    ```
+    name: toyPlugin
+    instances:
+    - inputs:
+    - i1
+    - i1
+    outputs:
+    - o1
+    - o2
+    attributes:
+        x: 1
+    ```
+    <!-- Polygraphy Test: Ignore End -->
+
+2. **[Optional]** List matches of toyPlugin in the example network, without saving config.yaml:
+
+    ```bash
+    polygraphy plugin list graph_with_subgraph_matching_toy_plugin.onnx \
+        --plugin-dir ./plugins
+    ```
+
+    <!-- Polygraphy Test: Ignore Start -->
+    This will display something like:
+
+    ```
+    checking toyPlugin in model
+    [I] Start a subgraph matching...
+    [I] 	Checking node: n1 against pattern node: Anode.
+    [I] 	No match because: Op did not match. Node op was: O but pattern op was: A.
+    [I] Start a subgraph matching...
+    ...
+    [I] Found a matched subgraph!
+    [I] Start a subgraph matching...
+    [I] 	Checking node: n6 against pattern node: Anode.
+    [I] 	No match because: Op did not match. Node op was: E but pattern op was: A.
+    the following plugins would be used:
+    {'toyPlugin': 1}
+    ```
+
+    There will be no resulting config.yaml, as this command is only for printing the number of matches per plugin
+    <!-- Polygraphy Test: Ignore End -->
+
+The `plugin replace` subtool replaces subgraphs in an onnx model with plugins
+
+
+3. Replace parts of the example network with toyPlugin:
+
+    ```bash
+    polygraphy plugin replace graph_with_subgraph_matching_toy_plugin.onnx \
+        --plugin-dir ./plugins \
+        -o replaced.onnx
+    ```
+
+    <!-- Polygraphy Test: Ignore Start -->
+    This will display something like:
+
+    ```
+    [I] Loading model: /Users/pkisfaludi/Documents/git/Polygraphy/examples/cli/plugin/03_replace_subgraph_with_a_plugin/graph_with_subgraph_matching_toy_plugin.onnx
+    ```
+
+    The result file is replaced.onnx, where a subgraph in the example network is replaced by toyPlugin
+    <!-- Polygraphy Test: Ignore End -->
\ No newline at end of file
diff --git a/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/graph_with_subgraph_matching_toy_plugin.onnx b/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/graph_with_subgraph_matching_toy_plugin.onnx
new file mode 100644
index 00000000..9dde8cb7
Binary files /dev/null and b/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/graph_with_subgraph_matching_toy_plugin.onnx differ
diff --git a/demo/HuggingFace/GPT2/.gitkeep b/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/plugins/toyPlugin/__init__.py
similarity index 100%
rename from demo/HuggingFace/GPT2/.gitkeep
rename to tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/plugins/toyPlugin/__init__.py
diff --git a/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/plugins/toyPlugin/pattern.py b/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/plugins/toyPlugin/pattern.py
new file mode 100644
index 00000000..6cf600ba
--- /dev/null
+++ b/tools/Polygraphy/examples/cli/plugin/01_match_and_replace_plugin/plugins/toyPlugin/pattern.py
@@ -0,0 +1,31 @@
+from polygraphy import mod
+gs = mod.lazy_import("onnx_graphsurgeon>=0.5.0")
+
+def get_plugin_pattern() -> gs.GraphPattern:
+    """
+    Toy plugin pattern:
+        A     B
+        \   /
+          C, attrs['x'] < 2.0
+        /   \
+        D     E
+    """
+    pattern = gs.GraphPattern()
+    in_0 = pattern.variable()
+    in_1 = pattern.variable()
+    a_out = pattern.add("Anode", "A", inputs=[in_0])
+    b_out = pattern.add("Bnode", "B", inputs=[in_1])
+    check_function = lambda node : node.attrs["x"] < 2.0
+    c_out = pattern.add("Cnode", "C", inputs=[a_out, b_out], check_func=check_function)
+    d_out = pattern.add("Dnode", "D", inputs=[c_out])
+    e_out = pattern.add("Enode", "E", inputs=[c_out])
+    pattern.set_output_tensors([d_out, e_out])
+
+    return pattern
+
+def get_plugin_attributes(sg) -> dict:
+    """
+    example plugin attribute mapping, where the plugin has attribute ToyX, which gets its value from C.x * 2
+    """
+    return {"ToyX": int(sg.get("Cnode").attrs["x"]) * 2}
+
diff --git a/tools/Polygraphy/examples/cli/run/01_comparing_frameworks/README.md b/tools/Polygraphy/examples/cli/run/01_comparing_frameworks/README.md
index e36a3880..0f288ad2 100644
--- a/tools/Polygraphy/examples/cli/run/01_comparing_frameworks/README.md
+++ b/tools/Polygraphy/examples/cli/run/01_comparing_frameworks/README.md
@@ -48,8 +48,8 @@ polygraphy run dynamic_identity.onnx --trt --onnxrt \
 ### Comparing TensorRT Precisions
 
 To build a TensorRT engine with reduced precision layers for comparison against
-ONNXRT, use one of the supported precision flags (`--tf32`, `--fp16`, or
-`--int8`). For example:
+ONNXRT, use one of the supported precision flags (e.g. `--tf32`, `--fp16`,`--int8`, etc.).
+For example:
 
 ```bash
 polygraphy run dynamic_identity.onnx --trt --fp16 --onnxrt \
@@ -89,7 +89,7 @@ reldiff = absdiff / abs(out1)
 <!-- Polygraphy Test: Ignore End -->
 
 Then, for each index `i` in the output, Polygraphy checks whether
-`absdiff[i] > atol and reldiff[i] > rtol`. If any index  satisfies this,
+`absdiff[i] > atol and reldiff[i] > rtol`. If any index satisfies this,
 then the comparison will fail.  This is less stringent than comparing the maximum
 absolute and relative error across the entire tensor (`--check-error-stat max`) since if
 *different* indices `i` and `j` satisfy `absdiff[i] > atol` and `reldiff[j] > rtol`,
diff --git a/tools/Polygraphy/examples/dev/01_writing_cli_tools/gen-data b/tools/Polygraphy/examples/dev/01_writing_cli_tools/gen-data
index c910786c..50d0fe85 100755
--- a/tools/Polygraphy/examples/dev/01_writing_cli_tools/gen-data
+++ b/tools/Polygraphy/examples/dev/01_writing_cli_tools/gen-data
@@ -18,12 +18,18 @@
 Generates data and writes it to a file.
 """
 
-import numpy as np
+from polygraphy import mod
 from polygraphy.common import TensorMetadata
 from polygraphy.json import save_json
 from polygraphy.tools import Tool
 from polygraphy.tools.args import DataLoaderArgs
 
+# Your tool should lazily import any external dependencies. By doing so,
+# we avoid creating hard dependencies on other packages.
+# Additionally, this allows Polygraphy to automatically install required packages
+# as they are needed, instead of requiring the user to do so up front.
+np = mod.lazy_import("numpy")
+
 
 class GenData(Tool):
     # Polygraphy will use the docstring of the tool child class to generate
@@ -31,6 +37,7 @@ class GenData(Tool):
     """
     Generate random data and write it to a file.
     """
+
     # First, we'll implement `get_subscriptions_impl()` to subscribe to argument groups
     # that we're intrested in.
     # All the argument groups we subscribe to will be stored in a member called
@@ -59,5 +66,9 @@ class GenData(Tool):
         save_json(list(data_loader), dest=args.output, description="randomly generated numbers")
 
 
-# Now we just use the tool's main() method to make this script behave like a CLI tool.
+# NOTE: To integrate a tool into Polygraphy, you will need to add it to the registry in
+# `polygraphy/tools/registry.py`.
+#
+# Alternatively, we can create a standalone tool by invoking the `main()` method, which will allow
+# our script to be used on the command-line.
 GenData().main()
diff --git a/tools/Polygraphy/examples/dev/02_extending_polygraphy_run/README.md b/tools/Polygraphy/examples/dev/02_extending_polygraphy_run/README.md
index dc83a3c5..cc2cb3e7 100644
--- a/tools/Polygraphy/examples/dev/02_extending_polygraphy_run/README.md
+++ b/tools/Polygraphy/examples/dev/02_extending_polygraphy_run/README.md
@@ -84,16 +84,9 @@ It is recommended that you read these files in the following order:
 
     Install the wheel:
 
-    <!-- Polygraphy Test
-        *NOTE: For tests, this is required to work around compatibility*
-            *breakages in more recent versions of these packages*
-        ```bash
-        python3 -m pip install protobuf==3.19.4 onnx==1.10.0 numpy<=1.23.0
-        ```
-     Polygraphy Test -->
-
     ```bash
-    python3 -m pip install extension_module/dist/polygraphy_reshape_destroyer-0.0.1-py3-none-any.whl
+    python3 -m pip install extension_module/dist/polygraphy_reshape_destroyer-0.0.1-py3-none-any.whl \
+        --extra-index-url https://pypi.ngc.nvidia.com
     ```
 
     *TIP: If you make changes to the example extension module, you can update your installed version by*
diff --git a/tools/Polygraphy/polygraphy/README.md b/tools/Polygraphy/polygraphy/README.md
index c137d650..3b3d1dc0 100644
--- a/tools/Polygraphy/polygraphy/README.md
+++ b/tools/Polygraphy/polygraphy/README.md
@@ -27,8 +27,8 @@ with the `--gen-script` option to auto-generate template scripts that use the Po
 
 > :warning: Any APIs not documented in the [API reference documentation](#python-api-reference-documentation)
     should be considered internal only and do not adhere to the [deprecation policy](#deprecation-policy)
-    as the public APIs do. Thus, they may be modified or removed at any time without warning.
-    Avoid using these internal APIs outside of Polygraphy!
+    for public APIs. Thus, they may be modified or removed at any time without warning.
+    Avoid using undocumented APIs!
 
 
 ## Backends
diff --git a/tools/Polygraphy/polygraphy/__init__.py b/tools/Polygraphy/polygraphy/__init__.py
index 86602aee..41215171 100644
--- a/tools/Polygraphy/polygraphy/__init__.py
+++ b/tools/Polygraphy/polygraphy/__init__.py
@@ -1,3 +1,3 @@
 import polygraphy.config
 
-__version__ = "0.47.1"
+__version__ = "0.49.9"
diff --git a/tools/Polygraphy/polygraphy/backend/base/runner.py b/tools/Polygraphy/polygraphy/backend/base/runner.py
index 0a26ec66..ae033e61 100644
--- a/tools/Polygraphy/polygraphy/backend/base/runner.py
+++ b/tools/Polygraphy/polygraphy/backend/base/runner.py
@@ -19,7 +19,9 @@
 from collections import defaultdict
 
 from polygraphy import config, func, mod, util
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER, LogMode
+from polygraphy.backend.base import util as base_util
 
 np = mod.lazy_import("numpy")
 
@@ -99,26 +101,51 @@ def get_input_metadata_impl(self):
         """
         Implemenation for `get_input_metadata`. Derived classes should override this function
         rather than `get_input_metadata`.
+
+        Derived classes may return any kind of data type supported by Polygraphy's DataType
+        class (e.g. np.dtype, torch.dtype, etc.)
         """
         raise NotImplementedError("BaseRunner is an abstract class")
 
     @func.constantmethod
-    def get_input_metadata(self):
+    def get_input_metadata(self, use_numpy_dtypes=None):
         """
         Returns information about the inputs of the model.
         Shapes here may include dynamic dimensions, represented by ``None``.
         Must be called only after ``activate()`` and before ``deactivate()``.
 
+        Args:
+            use_numpy_dtypes (bool):
+                    [DEPRECATED] Whether to return NumPy data types instead of Polygraphy ``DataType`` s.
+                    This is provided to retain backwards compatibility. In the future,
+                    this parameter will be removed and Polygraphy ``DataType`` s will
+                    always be returned. These can be converted to NumPy data types by calling the `numpy()` method.
+                    Defaults to True.
+
         Returns:
             TensorMetadata: Input names, shapes, and data types.
         """
         if not self.is_active:
             G_LOGGER.critical(f"{self.name:35} | Must be activated prior to calling get_input_metadata()")
 
-        return self.get_input_metadata_impl()
+        use_numpy_dtypes = util.default(use_numpy_dtypes, True)
+
+        meta = self.get_input_metadata_impl()
+
+        for name, (dtype, _) in meta.items():
+            dtype = DataType.from_dtype(dtype)
+            if use_numpy_dtypes:
+                mod.warn_deprecated(
+                    "Returning NumPy data types instead of Polygraphy `DataType`s from `get_input_metadata()`",
+                    use_instead=None,
+                    remove_in="0.60.0",
+                )
+                meta[name]._dtype = DataType.to_dtype(dtype, "numpy")
+        return meta
 
     # Implementation for runner inference. Derived classes should override this function
     # rather than ``infer()``
+    # Derived classes should also set the `inference_time` property so that performance metrics are accurate.
     def infer_impl(self, feed_dict):
         raise NotImplementedError("BaseRunner is an abstract class")
 
@@ -144,7 +171,6 @@ def infer(self, feed_dict, check_inputs=True, *args, **kwargs):
         Attributes:
             inference_time (float):
                     The time required to run inference in seconds.
-                    Derived classes should set this so that performance metrics are accurate.
 
         Returns:
             OrderedDict[str, numpy.ndarray]:
@@ -157,22 +183,9 @@ def infer(self, feed_dict, check_inputs=True, *args, **kwargs):
             G_LOGGER.critical(f"{self.name:35} | Must be activated prior to calling infer()")
 
         if check_inputs:
-            input_metadata = self.get_input_metadata()
+            input_metadata = self.get_input_metadata(use_numpy_dtypes=False)
             G_LOGGER.verbose(f"{self.name:35} | Input metadata is: {input_metadata}", mode=LogMode.ONCE)
-
-            util.check_sequence_contains(feed_dict.keys(), input_metadata.keys(), name="feed_dict", items_name="inputs")
-
-            for name, inp in feed_dict.items():
-                meta = input_metadata[name]
-                if not np.issubdtype(inp.dtype, meta.dtype):
-                    G_LOGGER.critical(
-                        f"Input tensor: {name} | Received unexpected dtype: {inp.dtype}.\nNote: Expected type: {meta.dtype}"
-                    )
-
-                if not util.is_valid_shape_override(inp.shape, meta.shape):
-                    G_LOGGER.critical(
-                        f"Input tensor: {name} | Received incompatible shape: {inp.shape}.\nNote: Expected a shape compatible with: {meta.shape}"
-                    )
+            base_util.check_inputs(feed_dict, input_metadata)
 
         return self.infer_impl(feed_dict, *args, **kwargs)
 
diff --git a/tools/Polygraphy/polygraphy/backend/base/util.py b/tools/Polygraphy/polygraphy/backend/base/util.py
new file mode 100644
index 00000000..03821811
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/backend/base/util.py
@@ -0,0 +1,51 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from polygraphy import util
+from polygraphy.logger import G_LOGGER
+
+
+def check_inputs(feed_dict, input_metadata):
+    """
+    Checks the provided `feed_dict` against expected input metadata.
+
+    Args:
+        feed_dict (Dict[str, Union[DeviceView, numpy.ndarray, torch.Tensor]]):
+                A mapping of input names to arrays.
+        input_metadata (TensorMetadata):
+                The expected input metadata.
+    """
+    util.check_sequence_contains(feed_dict.keys(), input_metadata.keys(), name="input data", items_name="inputs")
+
+    for name, inp in feed_dict.items():
+        meta = input_metadata[name]
+
+        # The "buffer" might just be a pointer, in which case we can't do any further checks with it, so we skip it.
+        if isinstance(inp, int):
+            continue
+
+        dtype = util.array.dtype(inp)
+        if dtype != meta.dtype:
+            G_LOGGER.critical(
+                f"Input tensor: {name} | Received unexpected dtype: {dtype}.\nNote: Expected type: {meta.dtype}"
+            )
+
+        shape = util.array.shape(inp)
+        if not util.is_valid_shape_override(shape, meta.shape):
+            G_LOGGER.critical(
+                f"Input tensor: {name} | Received incompatible shape: {shape}.\nNote: Expected a shape compatible with: {meta.shape}"
+            )
diff --git a/tools/Polygraphy/polygraphy/backend/onnx/loader.py b/tools/Polygraphy/polygraphy/backend/onnx/loader.py
index 4b144f7d..6992ad00 100644
--- a/tools/Polygraphy/polygraphy/backend/onnx/loader.py
+++ b/tools/Polygraphy/polygraphy/backend/onnx/loader.py
@@ -22,6 +22,7 @@
 from polygraphy import constants, mod, util
 from polygraphy.backend.base import BaseLoader
 from polygraphy.backend.onnx import util as onnx_util
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER, LogMode
 
 np = mod.lazy_import("numpy")
@@ -31,14 +32,13 @@
 tf = mod.lazy_import("tensorflow<2.0")
 tf2onnx = mod.lazy_import("tf2onnx")
 tf_util = mod.lazy_import("polygraphy.backend.tf.util", log=False)
-gs = mod.lazy_import("onnx_graphsurgeon>=0.3.21")
-shape_inference = mod.lazy_import("onnx.shape_inference")
-external_data_helper = mod.lazy_import("onnx.external_data_helper")
+gs = mod.lazy_import("onnx_graphsurgeon>=0.3.27")
 # ONNX-RT's shape inference also requires "sympy", but it is not reported as a dependency,
 # so we work around it by checking for it manually.
 onnxrt_symbolic_shape_inference = mod.lazy_import("onnxruntime.tools.symbolic_shape_infer>=1.10.0", requires=["sympy"])
 
 LARGE_MODEL_THRESHOLD = 512 << 20  # 512 MiB
+PROTOBUF_THRESHOLD = 2e9
 
 
 class BaseLoadOnnxCopy(BaseLoader):
@@ -165,7 +165,7 @@ def call_impl(self):
 
         if self.external_data_dir is not None:
             G_LOGGER.verbose(f"Loading external data from: {self.external_data_dir}")
-            external_data_helper.load_external_data_for_model(model, self.external_data_dir)
+            onnx.external_data_helper.load_external_data_for_model(model, self.external_data_dir)
         return model
 
 
@@ -396,8 +396,9 @@ def run_const_fold_pass(model):
                 model = infer_shapes(model, allow_onnxruntime=self.allow_onnxruntime_shape_inference)
             return model
 
+        # Need to manually trigger the autoinstall this since it's used by ONNX-GS, which does not have an autoinstall mechanism.
         mod.autoinstall(onnxrt)
-        if not mod.has_mod("onnxruntime"):
+        if not onnxrt.is_installed() or not onnxrt.is_importable():
             G_LOGGER.error(
                 f"ONNX-Runtime is not installed, so constant folding may be suboptimal or not work at all.\n"
                 f"Consider installing ONNX-Runtime: {sys.executable} -m pip install onnxruntime"
@@ -437,7 +438,7 @@ class SetUpperBound(BaseLoadOnnxCopy):
     """
     Functor that sets upper bounds for tensors with unbounded DDS in an ONNX model.
 
-    Requires that the model has been constant folded and has shapes inferred. 
+    Requires that the model has been constant folded and has shapes inferred.
     """
 
     def __init__(
@@ -456,7 +457,7 @@ def __init__(
             upper_bounds (Union[int, Dict[str, int]]):
                     The upper bounds for tensors with unbounded DDS.
                     If a single integer is provided, it will be used as the default upper bound for all tensors with unbounded DDS.
-                    This can also be provided on a per-tensor basis using a dictionary. In that case, use an empty string ("") as the 
+                    This can also be provided on a per-tensor basis using a dictionary. In that case, use an empty string ("") as the
                     key to specify default upper bound for tensors not explicitly listed.
             copy (bool):
                     Whether to create a copy of the model first.
@@ -480,24 +481,20 @@ def set_upper_bound(graph, target_tensor_list):
                     continue
                 # Insert a min operator to set the upper bound for the target tensor.
                 # A target tensor should always be produced from a single node.
-                assert (len(tensor.inputs) == 1)
+                assert len(tensor.inputs) == 1
                 producer = tensor.inputs[0]
                 producer_idx = producer.outputs.index(tensor)
-                tensor_copy = gs.Variable(
-                    tensor.name + "_copy", dtype=tensor.dtype, shape=tensor.shape)
+                tensor_copy = gs.Variable(tensor.name + "_copy", dtype=tensor.dtype, shape=tensor.shape)
                 upper_bound_values = np.array(upper_bound)
                 if tensor.shape is not None and len(tensor.shape) > 0:
                     upper_bound_values = np.array([upper_bound] * len(tensor.shape))
-                tensor_upper_bound = gs.Constant(
-                    tensor.name + "_upper_bound", values=upper_bound_values)
-                min_node = gs.Node(op="Min", inputs=[
-                    tensor_copy, tensor_upper_bound], outputs=[tensor])
+                tensor_upper_bound = gs.Constant(tensor.name + "_upper_bound", values=upper_bound_values)
+                min_node = gs.Node(op="Min", inputs=[tensor_copy, tensor_upper_bound], outputs=[tensor])
                 producer.outputs[producer_idx] = tensor_copy
                 tensor.inputs = [min_node]
                 graph.nodes.append(min_node)
                 applied_bounds[tensor.name] = upper_bound
-            G_LOGGER.info(
-                f"Set tensor upper bounds: {applied_bounds}")
+            G_LOGGER.info(f"Set tensor upper bounds: {applied_bounds}")
             return graph
 
         model = self.load()
@@ -554,7 +551,7 @@ def __init__(
         Args:
             model (Union[onnx.ModelProto, Callable() -> onnx.ModelProto, str, Callable() -> str]):
                     An ONNX model or a callable that returns one, or a path to a model.
-                    Supports models larger than the 2 GiB protobuf limit.
+                    Supports models larger than the 2 GB protobuf limit.
 
             error_ok (bool):
                     Whether errors during shape inference should be suppressed.
@@ -565,8 +562,8 @@ def __init__(
             save_to_disk_threshold_bytes (int):
                     The size in bytes above which a ModelProto will be serialized to the disk
                     before running shape inference.
-                    This can be used to work around the 2 GiB protobuf limitation.
-                    Defaults to ~2 GiB.
+                    This can be used to work around the 2 GB protobuf limitation.
+                    Defaults to 2 GB.
             allow_onnxruntime (bool):
                     Allow ONNX-Runtime's shape inference to be used if available instead of ONNX's
                     shape inference utilities. The former may provide performance or memory usage benefits.
@@ -576,7 +573,7 @@ def __init__(
         self.error_ok = util.default(error_ok, True)
         self.external_data_dir = external_data_dir
         # Subtract a little so we're below the real threshold
-        self.save_to_disk_threshold_bytes = util.default(save_to_disk_threshold_bytes, (2 << 30) - 8192)
+        self.save_to_disk_threshold_bytes = util.default(save_to_disk_threshold_bytes, PROTOBUF_THRESHOLD)
         self.allow_onnxruntime = util.default(allow_onnxruntime, True)
 
     def _run_onnx_shape_inference(self, model, external_data_dir):
@@ -590,7 +587,7 @@ def _run_onnx_shape_inference(self, model, external_data_dir):
                     mode=LogMode.ONCE,
                 )
 
-            if MODEL_SIZE > self.save_to_disk_threshold_bytes:
+            if MODEL_SIZE >= self.save_to_disk_threshold_bytes:
                 G_LOGGER.warning(
                     f"Model size ({MODEL_SIZE / 1024.0 ** 2} MiB) exceeds the in-memory size threshold: "
                     f"{self.save_to_disk_threshold_bytes / 1024.0 ** 2} MiB.\n"
@@ -604,11 +601,11 @@ def _run_onnx_shape_inference(self, model, external_data_dir):
                 external_data_dir = outdir.name
 
         if isinstance(model, onnx.ModelProto):
-            model = shape_inference.infer_shapes(model)
+            model = onnx.shape_inference.infer_shapes(model)
         else:
             tmp_path = util.NamedTemporaryFile(prefix="tmp_polygraphy_", suffix=".onnx").name
             G_LOGGER.verbose(f"Writing shape-inferred model to: {tmp_path}")
-            shape_inference.infer_shapes_path(model, tmp_path)
+            onnx.shape_inference.infer_shapes_path(model, tmp_path)
             # In cases where the original model had external data stored in the same directory,
             # the external data directory may not be explicitly specified.
             # In such cases, we need to use the model's directory as the external data path
@@ -724,7 +721,9 @@ def update_tensor(name, dtype, shape):
                 tensor = get_tensor(name)
                 # No need to update constants
                 if isinstance(tensor, gs.Variable):
-                    tensor.dtype, tensor.shape = dtype or tensor.dtype, shape or tensor.shape
+                    tensor.dtype, tensor.shape = (
+                        DataType.to_dtype(DataType.from_dtype(dtype), "onnx") if dtype is not None else None
+                    ) or tensor.dtype, shape or tensor.shape
                 return tensor
 
             def check_meta(name, dtype, shape, meta_type, needs_shape=True):
@@ -790,7 +789,7 @@ def __init__(self, model, path, external_data_path=None, size_threshold=None, al
                     directory as the model.
                     Set to an empty string to use the default path.
                     Set to None to disable.
-                    Defaults to None.
+                    Defaults to None if the model is within the protobuf size threshold and an empty string otherwise.
             size_threshold (int):
                     Tensor size threshold, in bytes, above which tensor data will be
                     stored in the external file.
@@ -816,12 +815,24 @@ def call_impl(self):
         """
         model, _ = util.invoke_if_callable(self._model)
         G_LOGGER.info(f"Saving ONNX model to: {self.path}")
-        if self.external_data_path is not None:
-            G_LOGGER.verbose(f"Saving external data for ONNX model to: {self.external_data_path}")
+
+        model_size = model.ByteSize()
+        if self.external_data_path is None and model_size >= PROTOBUF_THRESHOLD:
+            external_data_path = ""
+            G_LOGGER.warning(
+                f"Model size ({model_size // 1024.0 ** 2} MiB) exceeds protobuf size threshold ({PROTOBUF_THRESHOLD // 1024 ** 2} MiB). "
+                f"Will save weight data to an external file.\n"
+                f"To control the location of this file, use the `external_data_path` parameter or the `--external-data-path` command-line option. "
+            )
+        else:
+            external_data_path = self.external_data_path
+
+        if external_data_path is not None:
+            G_LOGGER.verbose(f"Saving external data for ONNX model to: {external_data_path}")
             try:
-                external_data_helper.convert_model_to_external_data(
+                onnx.external_data_helper.convert_model_to_external_data(
                     model,
-                    location=self.external_data_path,
+                    location=external_data_path,
                     all_tensors_to_one_file=util.default(self.all_tensors_to_one_file, True),
                     size_threshold=util.default(self.size_threshold, 1024),
                 )
@@ -830,9 +841,9 @@ def call_impl(self):
                     G_LOGGER.warning(
                         "This version of onnx does not support size_threshold in convert_model_to_external_data"
                     )
-                external_data_helper.convert_model_to_external_data(
+                onnx.external_data_helper.convert_model_to_external_data(
                     model,
-                    location=self.external_data_path,
+                    location=external_data_path,
                     all_tensors_to_one_file=util.default(self.all_tensors_to_one_file, True),
                 )
         else:
diff --git a/tools/Polygraphy/polygraphy/backend/onnx/util.py b/tools/Polygraphy/polygraphy/backend/onnx/util.py
index ba2546b6..1760cf19 100644
--- a/tools/Polygraphy/polygraphy/backend/onnx/util.py
+++ b/tools/Polygraphy/polygraphy/backend/onnx/util.py
@@ -17,13 +17,14 @@
 import copy
 from collections import OrderedDict
 
-from polygraphy import mod, util, constants
+from polygraphy import mod, util
 from polygraphy.common import TensorMetadata
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER, LogMode
 
 gs = mod.lazy_import("onnx_graphsurgeon")
-numpy_helper = mod.lazy_import("onnx.numpy_helper")
 onnx = mod.lazy_import("onnx")
+onnx_numpy_helper = mod.lazy_import("onnx.numpy_helper")
 
 
 def get_num_nodes(model):
@@ -117,14 +118,12 @@ def get_dtype(tensor):
         onnx_type = tensor.data_type
     else:
         onnx_type = tensor.type.tensor_type.elem_type
-    if onnx_type in onnx.mapping.TENSOR_TYPE_TO_NP_TYPE:
-        return onnx.mapping.TENSOR_TYPE_TO_NP_TYPE[onnx_type]
-    return None
+    return DataType.from_dtype(onnx_type, source_module="onnx")
 
 
 def get_values(tensor):
     try:
-        return numpy_helper.to_array(tensor)
+        return onnx_numpy_helper.to_array(tensor)
     except Exception as err:
         G_LOGGER.error(f"Failed to load weights.\nNote: Error was: {err}", mode=LogMode.ONCE)
     return "<error: failed to load weights>"
@@ -189,7 +188,6 @@ def get_opset():
 
 
 def str_from_onnx_graph(graph, tensors, show_layers, show_attrs, show_weights, indent_level=0):
-
     input_metadata = get_input_metadata(graph)
     output_metadata = get_output_metadata(graph)
     initializer_metadata = get_tensor_metadata(graph.initializer)
@@ -335,7 +333,7 @@ def set_shapes_from_layerwise_meta(graph, layerwise_meta):
     for tensor in graph.tensors().values():
         if isinstance(tensor, gs.Variable) and tensor.name in layerwise_meta:
             tensor.shape = layerwise_meta[tensor.name].shape
-            tensor.dtype = layerwise_meta[tensor.name].dtype
+            tensor.dtype = DataType.to_dtype(DataType.from_dtype(layerwise_meta[tensor.name].dtype), "onnx")
 
 
 def lower_constant_nodes(graph):
@@ -375,7 +373,11 @@ def check_op(node, const_tensor_set):
                     input_tensor = node.inputs[input_idx]
                     # Check if the corresponding input tensor is a runtime value and its producer is not Min operator.
                     # If a tensor is produced by a Min operator, its upper bound has already been set.
-                    if input_tensor.name not in const_tensor_set and len(input_tensor.inputs) >= 1 and input_tensor.inputs[0].op != 'Min':
+                    if (
+                        input_tensor.name not in const_tensor_set
+                        and len(input_tensor.inputs) >= 1
+                        and input_tensor.inputs[0].op != "Min"
+                    ):
                         return input_tensor
         return None
 
@@ -401,7 +403,7 @@ def get_target_tensors(graph):
         const_tensor_set = get_const_tensors(graph)
 
         # Our target is to find those input tensors that cause its consumer nodes generated unbounded outputs.
-        # If a tensor has named dimensions that appeared before in its symbolic shape, it means that the shape is *not* data dependent, 
+        # If a tensor has named dimensions that appeared before in its symbolic shape, it means that the shape is *not* data dependent,
         # and so will have an upper bound.
         target_tensor_names = set()
         target_tensor_list = []
diff --git a/tools/Polygraphy/polygraphy/backend/onnxrt/runner.py b/tools/Polygraphy/polygraphy/backend/onnxrt/runner.py
index fcb552a1..2c4bf7a0 100644
--- a/tools/Polygraphy/polygraphy/backend/onnxrt/runner.py
+++ b/tools/Polygraphy/polygraphy/backend/onnxrt/runner.py
@@ -20,8 +20,7 @@
 from polygraphy import mod, util
 from polygraphy.backend.base import BaseRunner
 from polygraphy.common import TensorMetadata
-
-np = mod.lazy_import("numpy")
+from polygraphy.datatype import DataType
 
 
 @mod.export()
@@ -45,37 +44,44 @@ def activate_impl(self):
 
     @util.check_called_by("get_input_metadata")
     def get_input_metadata_impl(self):
-        ONNX_RT_TYPE_TO_NP = {
-            "tensor(double)": np.float64,
-            "tensor(float)": np.float32,
-            "tensor(float16)": np.float16,
-            "tensor(int16)": np.int16,
-            "tensor(int32)": np.int32,
-            "tensor(int64)": np.int64,
-            "tensor(int8)": np.int8,
-            "tensor(uint16)": np.uint16,
-            "tensor(uint32)": np.uint32,
-            "tensor(uint64)": np.uint64,
-            "tensor(uint8)": np.uint8,
-            "tensor(bool)": bool,
-            "tensor(string)": np.unicode_,
-        }
-
         meta = TensorMetadata()
         for node in self.sess.get_inputs():
-            dtype = ONNX_RT_TYPE_TO_NP[node.type] if node.type in ONNX_RT_TYPE_TO_NP else None
-            meta.add(node.name, dtype=dtype, shape=node.shape)
+            meta.add(
+                node.name,
+                dtype=DataType.from_dtype(node.type, "onnxruntime"),
+                shape=node.shape,
+            )
         return meta
 
     @util.check_called_by("infer")
     def infer_impl(self, feed_dict):
+        """
+        Implementation for running inference with ONNX-Runtime.
+        Do not call this method directly - use ``infer()`` instead,
+        which will forward unrecognized arguments to this method.
+
+        Args:
+            feed_dict (OrderedDict[str, Union[numpy.ndarray, torch.Tensor]]):
+                    A mapping of input tensor names to corresponding input NumPy arrays or PyTorch tensors.
+                    If PyTorch tensors are provided in the feed_dict, then this function
+                    will return the outputs also as PyTorch tensors.
+
+        Returns:
+            OrderedDict[str, Union[numpy.ndarray, torch.Tensor]]:
+                    A mapping of output tensor names to corresponding output NumPy arrays
+                    or PyTorch tensors.
+        """
+        use_torch = any(util.array.is_torch(t) for t in feed_dict.values())
+        # `to_numpy()`` and `to_torch()` should be zero-copy whenever possible.
+        feed_dict = {name: util.array.to_numpy(t) for name, t in feed_dict.items()}
+
         start = time.time()
         inference_outputs = self.sess.run(None, feed_dict)
         end = time.time()
 
         out_dict = OrderedDict()
         for node, out in zip(self.sess.get_outputs(), inference_outputs):
-            out_dict[node.name] = out
+            out_dict[node.name] = out if not use_torch else util.array.to_torch(out)
         self.inference_time = end - start
         return out_dict
 
diff --git a/tools/Polygraphy/polygraphy/backend/pyt/runner.py b/tools/Polygraphy/polygraphy/backend/pyt/runner.py
index 450d46f7..4af81f1f 100644
--- a/tools/Polygraphy/polygraphy/backend/pyt/runner.py
+++ b/tools/Polygraphy/polygraphy/backend/pyt/runner.py
@@ -20,7 +20,7 @@
 from polygraphy import mod, util
 from polygraphy.backend.base import BaseRunner
 
-torch = mod.lazy_import("torch")
+torch = mod.lazy_import("torch>=1.13.0")
 
 
 @mod.export()
diff --git a/tools/Polygraphy/polygraphy/backend/tf/__init__.py b/tools/Polygraphy/polygraphy/backend/tf/__init__.py
index e7240f27..d9de2111 100644
--- a/tools/Polygraphy/polygraphy/backend/tf/__init__.py
+++ b/tools/Polygraphy/polygraphy/backend/tf/__init__.py
@@ -11,7 +11,7 @@ def set_tf_logging_level(severity_trie):
 
         tf = mod.lazy_import("tensorflow<2.0")
 
-        if not mod.has_mod("tensorflow"):
+        if not tf.is_installed() or not tf.is_importable():
             return
 
         sev = severity_trie.get(G_LOGGER.module_path(os.path.dirname(__file__)))
diff --git a/tools/Polygraphy/polygraphy/backend/tf/util.py b/tools/Polygraphy/polygraphy/backend/tf/util.py
index ce8709cc..86d1e753 100644
--- a/tools/Polygraphy/polygraphy/backend/tf/util.py
+++ b/tools/Polygraphy/polygraphy/backend/tf/util.py
@@ -146,7 +146,6 @@ def is_output_node(node):
 
     output_tensors = []
     for node in output_nodes:
-
         tensor_name = node.name + ":0"
         try:
             tensor = graph.get_tensor_by_name(tensor_name)
diff --git a/tools/Polygraphy/polygraphy/backend/trt/__init__.py b/tools/Polygraphy/polygraphy/backend/trt/__init__.py
index 8522cdba..6755a8fa 100644
--- a/tools/Polygraphy/polygraphy/backend/trt/__init__.py
+++ b/tools/Polygraphy/polygraphy/backend/trt/__init__.py
@@ -5,39 +5,3 @@
 from polygraphy.backend.trt.profile import *
 from polygraphy.backend.trt.runner import *
 from polygraphy.backend.trt.util import *
-
-
-def register_logger_callback():
-    from polygraphy.logger import G_LOGGER
-
-    def set_trt_logging_level(severity_trie):
-        from polygraphy import mod
-        import os
-
-        trt = mod.lazy_import("tensorrt")
-
-        if not mod.has_mod("tensorrt"):
-            return
-
-        if mod.version(trt.__version__) >= mod.version("8.0"):
-            # For TensorRT 8.0 and newer, we use a custom logger implementation
-            # which redirects all messages into Polygraphy's logger for better integration.
-            # Thus, this callback is unnecessary.
-            return
-
-        sev = severity_trie.get(G_LOGGER.module_path(os.path.dirname(__file__)))
-        if sev >= G_LOGGER.CRITICAL:
-            get_trt_logger().min_severity = trt.Logger.INTERNAL_ERROR
-        elif sev >= G_LOGGER.ERROR:
-            get_trt_logger().min_severity = trt.Logger.ERROR
-        elif sev >= G_LOGGER.INFO:
-            get_trt_logger().min_severity = trt.Logger.WARNING
-        elif sev >= G_LOGGER.VERBOSE:
-            get_trt_logger().min_severity = trt.Logger.INFO
-        else:
-            get_trt_logger().min_severity = trt.Logger.VERBOSE
-
-    G_LOGGER.register_callback(set_trt_logging_level)  # Will be registered when this backend is imported.
-
-
-register_logger_callback()
diff --git a/tools/Polygraphy/polygraphy/backend/trt/algorithm_selector.py b/tools/Polygraphy/polygraphy/backend/trt/algorithm_selector.py
index e2ad25e3..35f4153c 100644
--- a/tools/Polygraphy/polygraphy/backend/trt/algorithm_selector.py
+++ b/tools/Polygraphy/polygraphy/backend/trt/algorithm_selector.py
@@ -23,7 +23,7 @@
 
 from typing import Sequence
 
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 
 
 ##
@@ -37,7 +37,9 @@
 
 def check_is_instance(obj, cls, name):
     if not isinstance(obj, cls):
-        G_LOGGER.critical(f"'{name}' must be an instance of {cls.__name__}, but is: {obj}.")
+        G_LOGGER.critical(
+            f"'{name}' must be an instance of {cls.__name__}, but is: {obj}."
+        )
 
 
 @mod.export()
@@ -58,7 +60,6 @@ def from_trt(io_info):
             TensorInfo
         """
         return TensorInfo(
-            io_info.tensor_format,
             io_info.dtype,
             tuple(io_info.strides),
             # These fields were added in 8.6
@@ -66,16 +67,14 @@ def from_trt(io_info):
             util.try_getattr(io_info, "components_per_element"),
         )
 
-    def __init__(self, tensor_format, dtype, strides, vectorized_dim, components_per_element):
+    def __init__(self, dtype, strides, vectorized_dim, components_per_element):
         """
         Args:
-            tensor_format (trt.TensorFormat): The tensor format.
             dtype (trt.DataType): The data type.
             strides (Sequence[int]): The strides.
             vectorized_dim (int): The index of the vectorized dimensions.
             components_per_element (int): The number of components per element.
         """
-        check_is_instance(tensor_format, trt.TensorFormat, "tensor_format")
         check_is_instance(dtype, trt.DataType, "dtype")
         check_is_instance(strides, Sequence, "strides")
         if vectorized_dim is not None:
@@ -83,7 +82,6 @@ def __init__(self, tensor_format, dtype, strides, vectorized_dim, components_per
         if components_per_element is not None:
             check_is_instance(components_per_element, int, "components_per_element")
 
-        self.tensor_format = tensor_format
         self.dtype = dtype
         self.strides = tuple(strides)
         self.vectorized_dim = vectorized_dim
@@ -93,16 +91,17 @@ def __eq__(self, other):
         return self.__dict__ == other.__dict__
 
     def __repr__(self):
-        return f"TensorInfo({str(self.tensor_format)}, {str(self.dtype)}, {self.strides}, {self.vectorized_dim}, {self.components_per_element})"
+        return f"TensorInfo({str(self.dtype)}, {self.strides}, {self.vectorized_dim}, {self.components_per_element})"
 
     def __hash__(self):
-        return hash((self.tensor_format, self.dtype, self.strides, self.vectorized_dim, self.components_per_element))
+        return hash(
+            (self.dtype, self.strides, self.vectorized_dim, self.components_per_element)
+        )
 
 
 @Encoder.register(TensorInfo)
 def encode(tensor_info):
     return {
-        "tensor_format": str(tensor_info.tensor_format),
         "dtype": str(tensor_info.dtype),
         "strides": tensor_info.strides,
         "vectorized_dim": tensor_info.vectorized_dim,
@@ -113,7 +112,6 @@ def encode(tensor_info):
 @Decoder.register(TensorInfo)
 def decode(dct):
     return TensorInfo(
-        util.getattr_nested(trt, dct["tensor_format"]),
         util.getattr_nested(trt, dct["dtype"]),
         dct["strides"],
         dct["vectorized_dim"],
@@ -146,7 +144,10 @@ def from_trt(context, algorithm):
 
         implementation = algorithm.algorithm_variant.implementation
         tactic = algorithm.algorithm_variant.tactic
-        inputs = tuple(TensorInfo.from_trt(algorithm.get_algorithm_io_info(i)) for i in range(context.num_inputs))
+        inputs = tuple(
+            TensorInfo.from_trt(algorithm.get_algorithm_io_info(i))
+            for i in range(context.num_inputs)
+        )
         outputs = tuple(
             TensorInfo.from_trt(algorithm.get_algorithm_io_info(i))
             for i in range(context.num_inputs, context.num_inputs + context.num_outputs)
@@ -233,7 +234,10 @@ def add(self, name, algorithm):
 
     def __str__(self):
         return "\n".join(
-            [f"Layer: {name}\n{constants.TAB}Algorithm: {algorithm}" for (name, algorithm) in self.items()]
+            [
+                f"Layer: {name}\n{constants.TAB}Algorithm: {algorithm}"
+                for (name, algorithm) in self.items()
+            ]
         )
 
 
@@ -251,23 +255,14 @@ def decode(dct):
 ## Algorithm Selectors
 ##
 
+
 # Everything is encapsulated in functions so that we don't create a dependency on TensorRT
 # when objects from this file are imported.
 def get_base_selector_type():
-    ALGO_SELECTOR_ENABLED = False
-    if mod.version(trt.__version__) >= mod.version("8.0"):
-        ALGO_SELECTOR_ENABLED = True
-        IAlgorithmSelector = trt.IAlgorithmSelector
-    else:
-        IAlgorithmSelector = object
-
-    class BaseSelector(IAlgorithmSelector):
+    class BaseSelector(trt.IAlgorithmSelector):
         def __init__(self, data):
-            if not ALGO_SELECTOR_ENABLED:
-                trt_util.fail_unavailable("Algorithm selector")
-
             # Must explicitly initialize parent for any trampoline class! Will mysteriously segfault without this.
-            IAlgorithmSelector.__init__(self)
+            trt.IAlgorithmSelector.__init__(self)
 
             self.path = None
             self.data = TacticReplayData()
@@ -317,7 +312,7 @@ def report_algorithms(self, contexts, choices):
             Returns:
                 None
             """
-            for (context, choice) in zip(contexts, choices):
+            for context, choice in zip(contexts, choices):
                 self.data.add(context.name, Algorithm.from_trt(context, choice))
 
             if self.path is not None:
@@ -412,7 +407,7 @@ def report_algorithms(self, contexts, choices):
                 PolygraphyException:
                         If a tactic specified in ``self.data`` was not selected for a layer.
             """
-            for (context, choice) in zip(contexts, choices):
+            for context, choice in zip(contexts, choices):
                 if context.name in self.data:
                     to_select = self.data[context.name]
                     selected = Algorithm.from_trt(context, choice)
diff --git a/tools/Polygraphy/polygraphy/backend/trt/calibrator.py b/tools/Polygraphy/polygraphy/backend/trt/calibrator.py
index 8a9b2ec5..f30c907e 100644
--- a/tools/Polygraphy/polygraphy/backend/trt/calibrator.py
+++ b/tools/Polygraphy/polygraphy/backend/trt/calibrator.py
@@ -17,11 +17,13 @@
 import contextlib
 from collections import OrderedDict
 
-from polygraphy import cuda, mod, util
+from polygraphy import mod, util
 from polygraphy.exception import PolygraphyException
 from polygraphy.logger import G_LOGGER, LogMode
+from polygraphy.backend.trt import util as trt_util
+from polygraphy.backend.base import util as base_util
 
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 np = mod.lazy_import("numpy")
 
 
@@ -33,11 +35,11 @@ def Calibrator(
     Supplies calibration data to TensorRT to calibrate the network for INT8 inference.
 
     Args:
-        data_loader (Sequence[OrderedDict[str, Union[numpy.ndarray, DeviceView, int]]]):
+        data_loader (Sequence[OrderedDict[str, Union[numpy.ndarray, DeviceView, torch.Tensor, int]]]):
             A generator or iterable that yields a dictionary that maps input names to NumPy
-            arrays, Polygraphy DeviceViews, or GPU pointers. If NumPy arrays or DeviceViews
-            are provided, the calibrator will check the data types and shapes if possible
-            to ensure that they match those expected by the model.
+            arrays, Polygraphy DeviceViews, PyTorch tensors, or GPU pointers. If NumPy arrays,
+            DeviceViews, or PyTorch tensors are provided, the calibrator will check the data types
+            and shapes if possible to ensure that they match those expected by the model.
 
             In case you don't know details about the inputs ahead of time, you can access the
             `input_metadata` property in your data loader, which will be set to a ``TensorMetadata``
@@ -145,58 +147,20 @@ def _get_batch_impl(self, names):
                         "used multiple times (generators cannot be rewound)"
                     )
                 return None
-            else:
-                self.num_batches += 1
 
-            util.check_sequence_contains(
-                buffers.keys(),
-                names,
-                name="calibration input data provided by the data loader",
-                items_name="inputs",
-            )
+            self.num_batches += 1
 
-            def check_buffer(name, buffer):
-                if self.input_metadata is None:
-                    return
-
-                expected_dtype, expected_shape = self.input_metadata[name]
-
-                err_prefix = "Received an unexpected input from the data loader during calibration. "
-                if buffer.dtype != expected_dtype:
-                    G_LOGGER.critical(
-                        err_prefix
-                        + f"For input: '{name}', expected data type: {expected_dtype}, but received: {buffer.dtype}"
-                    )
-
-                if not util.is_valid_shape_override(buffer.shape, expected_shape):
-                    G_LOGGER.critical(
-                        err_prefix
-                        + f"For input: '{name}', expected a shape compatible with: {expected_shape}, but received: {buffer.shape}"
-                    )
+            if self.input_metadata is not None:
+                base_util.check_inputs(buffers, self.input_metadata)
 
             ptrs = []
             for name in names:
                 buf = buffers[name]
 
-                if isinstance(buf, cuda.DeviceView):
-                    check_buffer(name, buf)
-                    ptrs.append(buf.ptr)
-                elif isinstance(buf, np.ndarray):
-                    check_buffer(name, buf)
-                    if name not in self.device_buffers:
-                        self.device_buffers[name] = cuda.DeviceArray(shape=buf.shape, dtype=buf.dtype)
-                        G_LOGGER.verbose(f"Allocated: {self.device_buffers[name]}")
-
-                    self.device_buffers[name].resize(buf.shape)
-                    buf = util.make_contiguous(buf)
-                    ptrs.append(self.device_buffers[name].copy_from(buf).ptr)
-                elif isinstance(buf, int):
+                if isinstance(buf, int):
                     ptrs.append(buf)
                 else:
-                    G_LOGGER.critical(
-                        f"Calibration data loader provided an unrecognized type: {type(buf).__name__} for input: {name}."
-                        "\nPlease provide either a NumPy array, Polygraphy DeviceView, or GPU pointer. "
-                    )
+                    ptrs.append(trt_util._get_array_on_gpu(buf, name, self.device_buffers))
 
             return ptrs
 
diff --git a/tools/Polygraphy/polygraphy/backend/trt/config.py b/tools/Polygraphy/polygraphy/backend/trt/config.py
index a0312c97..41fc14c3 100644
--- a/tools/Polygraphy/polygraphy/backend/trt/config.py
+++ b/tools/Polygraphy/polygraphy/backend/trt/config.py
@@ -23,7 +23,7 @@
 from polygraphy.backend.trt.profile import Profile
 from polygraphy.logger import G_LOGGER
 
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 
 
 @mod.export(funcify=True)
@@ -34,14 +34,12 @@ class CreateConfig(BaseLoader):
 
     def __init__(
         self,
-        max_workspace_size=None,
         tf32=None,
         fp16=None,
         int8=None,
         profiles=None,
         calibrator=None,
         precision_constraints=None,
-        strict_types=None,
         load_timing_cache=None,
         algorithm_selector=None,
         sparse_weights=None,
@@ -52,6 +50,7 @@ def __init__(
         profiling_verbosity=None,
         memory_pool_limits=None,
         refittable=None,
+        strip_plan=None,
         preview_features=None,
         engine_capability=None,
         direct_io=None,
@@ -61,15 +60,17 @@ def __init__(
         max_aux_streams=None,
         version_compatible=None,
         exclude_lean_runtime=None,
+        quantization_flags=None,
+        error_on_timing_cache_miss=None,
+        bf16=None,
+        disable_compilation_cache=None,
+        progress_monitor=None,
+        weight_streaming=None,
     ):
         """
         Creates a TensorRT IBuilderConfig that can be used by EngineFromNetwork.
 
         Args:
-            max_workspace_size (int):
-                    [DEPRECATED - use memory_pool_limits]
-                    The maximum workspace size, in bytes, when building the engine.
-                    Defaults to None.
             tf32 (bool):
                     Whether to build the engine with TF32 precision enabled.
                     Defaults to False.
@@ -97,11 +98,6 @@ def __init__(
                     other precisions if no implementation exists for the requested precision.
                     Otherwise, precision constraints are ignored.
                     Defaults to None.
-            strict_types (bool):
-                    [DEPRECATED] If True, prefer that layers execute in specified precisions and avoid I/O reformatting.
-                    Fall back to ignoring the preferences if such an engine cannot be built.
-                    precision_constraints is recommended instead.
-                    Defaults to False.
             load_timing_cache (Union[str, file-like]):
                     A path or file-like object from which to load a tactic timing cache.
                     Providing a tactic timing cache can speed up the engine building process.
@@ -142,6 +138,9 @@ def __init__(
             refittable (bool):
                     Enables the engine to be refitted with new weights after it is built.
                     Defaults to False.
+            strip_plan (bool):
+                    Strips the refittable weights from the engine plan file.
+                    Defaults to False.
             preview_features (List[trt.PreviewFeature]):
                     The preview features to enable.
                     Use an empty list to disable all preview features.
@@ -177,21 +176,36 @@ def __init__(
             exclude_lean_runtime (bool):
                     Whether to exclude the lean runtime in version compatible engines.
                     Requires that version compatibility is enabled.
+            quantization_flags (List[trt.QuantizationFlag]):
+                    The quantization flags to enable.
+                    Use an empty list to disable all quantization flags.
+                    Defaults to TensorRT's default quantization flags.
+            error_on_timing_cache_miss (bool):
+                    Emit error when a tactic being timed is not present in the timing cache.
+                    This flag has an effect only when IBuilderConfig has an associated ITimingCache.
+                    Defaults to False.
+            bf16 (bool):
+                    Whether to build the engine with BF16 precision enabled.
+                    Defaults to False.
+            disable_compilation_cache (bool):
+                    Whether to disable caching JIT-compiled code.
+                    Defaults to False.
+            progress_monitor (trt.IProgressMonitor):
+                    A progress monitor. Allow users to view engine building progress through CLI.
+            weight_streaming (bool):
+                    TWhether to enable weight streaming for the TensorRT Engine.
         """
-        self.max_workspace_size = max_workspace_size
-        if max_workspace_size is not None:
-            mod.warn_deprecated("max_workspace_size", use_instead="memory_pool_limits", remove_in="0.48.0")
-
         self.tf32 = util.default(tf32, False)
         self.fp16 = util.default(fp16, False)
+        self.bf16 = util.default(bf16, False)
         self.int8 = util.default(int8, False)
         self.fp8 = util.default(fp8, False)
         self.profiles = util.default(profiles, [Profile()])
         self.calibrator = calibrator
         self.precision_constraints = precision_constraints
-        self.strict_types = util.default(strict_types, False)
         self.restricted = util.default(restricted, False)
         self.refittable = util.default(refittable, False)
+        self.strip_plan = util.default(strip_plan, False)
         self.timing_cache_path = load_timing_cache
         self.algorithm_selector = algorithm_selector
         self.sparse_weights = util.default(sparse_weights, False)
@@ -208,6 +222,13 @@ def __init__(
         self.max_aux_streams = max_aux_streams
         self.version_compatible = version_compatible
         self.exclude_lean_runtime = exclude_lean_runtime
+        self.quantization_flags = quantization_flags
+        self.error_on_timing_cache_miss = util.default(
+            error_on_timing_cache_miss, False
+        )
+        self.disable_compilation_cache = util.default(disable_compilation_cache, False)
+        self.progress_monitor = progress_monitor
+        self.weight_streaming = weight_streaming
 
         if self.calibrator is not None and not self.int8:
             G_LOGGER.warning(
@@ -228,205 +249,264 @@ def call_impl(self, builder, network):
         Returns:
             trt.IBuilderConfig: The TensorRT builder configuration.
         """
-        with util.FreeOnException([builder.create_builder_config()]) as (config,):
+        config = builder.create_builder_config()
 
-            def try_run(func, name):
-                try:
-                    return func()
-                except AttributeError:
-                    trt_util.fail_unavailable(f"{name} in CreateConfig")
+        def try_run(func, name):
+            try:
+                return func()
+            except AttributeError:
+                trt_util.fail_unavailable(f"{name} in CreateConfig")
 
-            def try_set_flag(flag_name):
-                return try_run(lambda: config.set_flag(getattr(trt.BuilderFlag, flag_name)), flag_name.lower())
+        def try_set_flag(flag_name):
+            return try_run(
+                lambda: config.set_flag(getattr(trt.BuilderFlag, flag_name)),
+                flag_name.lower(),
+            )
 
-            if self.preview_features is not None:
-                for preview_feature in trt.PreviewFeature.__members__.values():
-                    try_run(
-                        lambda: config.set_preview_feature(preview_feature, preview_feature in self.preview_features),
-                        "preview_features",
-                    )
+        if self.preview_features is not None:
+            for preview_feature in trt.PreviewFeature.__members__.values():
+                try_run(
+                    lambda: config.set_preview_feature(
+                        preview_feature, preview_feature in self.preview_features
+                    ),
+                    "preview_features",
+                )
 
-            with G_LOGGER.indent():
-                G_LOGGER.verbose("Setting TensorRT Optimization Profiles")
-                profiles = copy.deepcopy(self.profiles)
-                for profile in profiles:
-                    # Last profile is used for set_calibration_profile.
-                    calib_profile = profile.fill_defaults(network).to_trt(builder, network)
-                    config.add_optimization_profile(calib_profile)
-                G_LOGGER.info(f"Configuring with profiles: {profiles}")
+        G_LOGGER.verbose("Setting TensorRT Optimization Profiles")
+        profiles = copy.deepcopy(self.profiles)
+        for profile in profiles:
+            # Last profile is used for set_calibration_profile.
+            calib_profile = profile.fill_defaults(network)
+            config.add_optimization_profile(calib_profile.to_trt(builder, network))
+        newline = "\n"
+        sep = ",\n"
+        G_LOGGER.info(
+            f"Configuring with profiles:[\n"
+            f"{util.indent_block(sep.join([f'Profile {index}:{newline}{util.indent_block(profile)}' for index, profile in enumerate(profiles)]))}\n]"
+        )
+
+        layer_with_precisions = {
+            layer.name: layer.precision.name
+            for layer in network
+            if layer.precision_is_set and not layer.type == trt.LayerType.SHAPE
+        }
+        if self.precision_constraints == "obey":
+            try_set_flag("OBEY_PRECISION_CONSTRAINTS")
+        elif self.precision_constraints == "prefer":
+            try_set_flag("PREFER_PRECISION_CONSTRAINTS")
+        elif layer_with_precisions:
+            G_LOGGER.warning(
+                "It looks like some layers in the network have compute precision set, but precision constraints were not enabled. "
+                "\nPrecision constraints must be set to 'prefer' or 'obey' for layer compute precision to take effect. "
+                f"\nNote: Layers and their requested precisions were: {layer_with_precisions}"
+            )
+        if self.restricted:
+            try_set_flag("SAFETY_SCOPE")
 
-            if self.max_workspace_size is not None:
-                config.max_workspace_size = int(self.max_workspace_size)
+        if self.refittable:
+            try_set_flag("REFIT")
 
-            layer_with_precisions = {
-                layer.name: layer.precision.name
-                for layer in network
-                if layer.precision_is_set and not layer.type == trt.LayerType.SHAPE
-            }
-            if self.precision_constraints == "obey":
-                try_set_flag("OBEY_PRECISION_CONSTRAINTS")
-            elif self.precision_constraints == "prefer":
-                try_set_flag("PREFER_PRECISION_CONSTRAINTS")
-            elif layer_with_precisions:
-                G_LOGGER.warning(
-                    "It looks like some layers in the network have compute precision set, but precision constraints were not enabled. "
-                    "\nPrecision constraints must be set to 'prefer' or 'obey' for layer compute precision to take effect. "
-                    f"\nNote: Layers and their requested precisions were: {layer_with_precisions}"
-                )
+        if self.strip_plan:
+            try_set_flag("STRIP_PLAN")
 
-            if self.strict_types:
-                mod.warn_deprecated("strict_types", use_instead="precision_constraints", remove_in="0.48.0")
-                try_set_flag("STRICT_TYPES")
-
-            if self.restricted:
-                try_set_flag("SAFETY_SCOPE")
-
-            if self.refittable:
-                try_set_flag("REFIT")
-
-            if self.direct_io:
-                try_set_flag("DIRECT_IO")
-
-            if self.tf32:
-                try_set_flag("TF32")
-            else:  # TF32 is on by default
-                with contextlib.suppress(AttributeError):
-                    config.clear_flag(trt.BuilderFlag.TF32)
-
-            if self.fp16:
-                try_set_flag("FP16")
-
-            if self.fp8:
-                try_set_flag("FP8")
-
-            if self.int8:
-                try_set_flag("INT8")
-                if not network.has_explicit_precision:
-                    if self.calibrator is not None:
-                        config.int8_calibrator = self.calibrator
-                        try:
-                            config.set_calibration_profile(calib_profile)
-                        except AttributeError:
-                            G_LOGGER.extra_verbose("Cannot set calibration profile on TensorRT 7.0 and older.")
-
-                        trt_util.try_setup_polygraphy_calibrator(config, network, calib_profile=calib_profile)
-                    else:
-                        G_LOGGER.warning(
-                            "Network does not have explicit precision and no calibrator was provided. Please ensure "
-                            "that tensors in the network have dynamic ranges set, or provide a calibrator in order to use int8 mode."
-                        )
+        if self.direct_io:
+            try_set_flag("DIRECT_IO")
+
+        if self.tf32:
+            try_set_flag("TF32")
+        else:  # TF32 is on by default
+            with contextlib.suppress(AttributeError):
+                config.clear_flag(trt.BuilderFlag.TF32)
+
+        if self.fp16:
+            try_set_flag("FP16")
 
-            if self.sparse_weights:
-                try_set_flag("SPARSE_WEIGHTS")
+        if self.bf16:
+            try_set_flag("BF16")
 
-            if self.use_dla:
-                config.default_device_type = trt.DeviceType.DLA
-                config.DLA_core = 0
+        if self.fp8:
+            try_set_flag("FP8")
 
-            if self.allow_gpu_fallback:
-                try_set_flag("GPU_FALLBACK")
+        if self.int8:
+            try_set_flag("INT8")
+            # No Q/DQ layers means that we will need to calibrate.
+            if not any(
+                layer.type in [trt.LayerType.QUANTIZE, trt.LayerType.DEQUANTIZE]
+                for layer in network
+            ):
+                if self.calibrator is not None:
+                    config.int8_calibrator = self.calibrator
+                    try:
+                        config.set_calibration_profile(
+                            calib_profile.to_trt(builder, network)
+                        )
+                        G_LOGGER.info(f"Using calibration profile: {calib_profile}")
+                    except AttributeError:
+                        G_LOGGER.extra_verbose(
+                            "Cannot set calibration profile on TensorRT 7.0 and older."
+                        )
 
-            if self.profiling_verbosity is not None:
+                    trt_util.try_setup_polygraphy_calibrator(
+                        config,
+                        network,
+                        calib_profile=calib_profile.to_trt(builder, network),
+                    )
+                else:
+                    G_LOGGER.warning(
+                        "Network does not have explicit precision and no calibrator was provided. Please ensure "
+                        "that tensors in the network have dynamic ranges set, or provide a calibrator in order to use int8 mode."
+                    )
 
-                def set_profiling_verbosity():
-                    config.profiling_verbosity = self.profiling_verbosity
+        if self.sparse_weights:
+            try_set_flag("SPARSE_WEIGHTS")
 
-                try_run(set_profiling_verbosity, name="profiling_verbosity")
-            else:
-                try:
-                    config.profiling_verbosity = trt.ProfilingVerbosity.VERBOSE
-                except AttributeError:
-                    pass
+        if self.use_dla:
+            config.default_device_type = trt.DeviceType.DLA
+            config.DLA_core = 0
 
-            if self.memory_pool_limits is not None:
-                for pool_type, pool_size in self.memory_pool_limits.items():
-                    try_run(lambda: config.set_memory_pool_limit(pool_type, pool_size), name="memory_pool_limits")
+        if self.allow_gpu_fallback:
+            try_set_flag("GPU_FALLBACK")
 
-            if self.tactic_sources is not None:
-                tactic_sources_flag = 0
-                for source in self.tactic_sources:
-                    tactic_sources_flag |= 1 << int(source)
-                try_run(lambda: config.set_tactic_sources(tactic_sources_flag), name="tactic_sources")
+        if self.profiling_verbosity is not None:
 
+            def set_profiling_verbosity():
+                config.profiling_verbosity = self.profiling_verbosity
+
+            try_run(set_profiling_verbosity, name="profiling_verbosity")
+        else:
             try:
-                cache = None
-                if self.timing_cache_path:
-                    try:
-                        with util.LockFile(self.timing_cache_path):
-                            timing_cache_data = util.load_file(
-                                self.timing_cache_path, description="tactic timing cache"
-                            )
-                            cache = config.create_timing_cache(timing_cache_data)
-                    except FileNotFoundError:
-                        G_LOGGER.warning(
-                            "Timing cache file {} not found, falling back to empty timing cache.".format(
-                                self.timing_cache_path
-                            )
-                        )
-                if cache is None:
-                    # Create an empty timing cache by default so it will be populated during engine build.
-                    # This way, consumers of CreateConfig have the option to use the cache later.
-                    cache = config.create_timing_cache(b"")
+                config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
             except AttributeError:
-                if self.timing_cache_path:
-                    trt_util.fail_unavailable("load_timing_cache in CreateConfig")
-            else:
-                config.set_timing_cache(cache, ignore_mismatch=False)
+                pass
+
+        if self.memory_pool_limits is not None:
+            for pool_type, pool_size in self.memory_pool_limits.items():
+                try_run(
+                    lambda: config.set_memory_pool_limit(pool_type, pool_size),
+                    name="memory_pool_limits",
+                )
+
+        if self.tactic_sources is not None:
+            tactic_sources_flag = 0
+            for source in self.tactic_sources:
+                tactic_sources_flag |= 1 << int(source)
+            try_run(
+                lambda: config.set_tactic_sources(tactic_sources_flag),
+                name="tactic_sources",
+            )
 
-            if self.algorithm_selector is not None:
+        try:
+            cache = None
+            if self.timing_cache_path:
+                try:
+                    with util.LockFile(self.timing_cache_path):
+                        timing_cache_data = util.load_file(
+                            self.timing_cache_path, description="tactic timing cache"
+                        )
+                        cache = config.create_timing_cache(timing_cache_data)
+                except FileNotFoundError:
+                    G_LOGGER.warning(
+                        "Timing cache file {} not found, falling back to empty timing cache.".format(
+                            self.timing_cache_path
+                        )
+                    )
+            if cache is None:
+                # Create an empty timing cache by default so it will be populated during engine build.
+                # This way, consumers of CreateConfig have the option to use the cache later.
+                cache = config.create_timing_cache(b"")
+        except AttributeError:
+            if self.timing_cache_path:
+                trt_util.fail_unavailable("load_timing_cache in CreateConfig")
+        else:
+            config.set_timing_cache(cache, ignore_mismatch=False)
 
-                def set_algo_selector():
-                    config.algorithm_selector = self.algorithm_selector
+        if self.algorithm_selector is not None:
 
-                try_run(set_algo_selector, name="algorithm_selector")
+            def set_algo_selector():
+                config.algorithm_selector = self.algorithm_selector
 
-                if not self.timing_cache_path:
-                    G_LOGGER.warning("Disabling tactic timing cache because algorithm selector is enabled.")
-                    try_set_flag("DISABLE_TIMING_CACHE")
+            try_run(set_algo_selector, name="algorithm_selector")
 
-            if self.engine_capability is not None:
+            if not self.timing_cache_path:
+                G_LOGGER.warning(
+                    "Disabling tactic timing cache because algorithm selector is enabled."
+                )
+                try_set_flag("DISABLE_TIMING_CACHE")
 
-                def set_engine_cap():
-                    config.engine_capability = self.engine_capability
+        if self.engine_capability is not None:
 
-                try_run(set_engine_cap, "engine_capability")
+            def set_engine_cap():
+                config.engine_capability = self.engine_capability
 
-            if self.builder_optimization_level is not None:
+            try_run(set_engine_cap, "engine_capability")
 
-                def set_builder_optimization_level():
-                    config.builder_optimization_level = self.builder_optimization_level
+        if self.builder_optimization_level is not None:
 
-                try_run(set_builder_optimization_level, "builder_optimization_level")
+            def set_builder_optimization_level():
+                config.builder_optimization_level = self.builder_optimization_level
 
-            if self.hardware_compatibility_level is not None:
+            try_run(set_builder_optimization_level, "builder_optimization_level")
 
-                def set_hardware_compatibility_level():
-                    config.hardware_compatibility_level = self.hardware_compatibility_level
+        if self.hardware_compatibility_level is not None:
 
-                try_run(set_hardware_compatibility_level, "hardware_compatibility_level")
+            def set_hardware_compatibility_level():
+                config.hardware_compatibility_level = self.hardware_compatibility_level
 
-            if self.version_compatible:
-                try_set_flag("VERSION_COMPATIBLE")
+            try_run(set_hardware_compatibility_level, "hardware_compatibility_level")
 
-            if self.exclude_lean_runtime:
-                if not self.version_compatible:
-                    G_LOGGER.critical(f"Cannot set EXCLUDE_LEAN_RUNTIME if version compatibility is not enabled. ")
-                try_set_flag("EXCLUDE_LEAN_RUNTIME")
+        if self.version_compatible:
+            try_set_flag("VERSION_COMPATIBLE")
 
-            if self.hardware_compatibility_level is not None or self.version_compatible:
-                G_LOGGER.info(
-                    "Version or hardware compatibility was enabled. "
-                    "If you are using an ONNX model, please set the NATIVE_INSTANCENORM ONNX parser flag, e.g. `--onnx-flags NATIVE_INSTANCENORM`"
+        if self.exclude_lean_runtime:
+            if not self.version_compatible:
+                G_LOGGER.critical(
+                    f"Cannot set EXCLUDE_LEAN_RUNTIME if version compatibility is not enabled. "
                 )
+            try_set_flag("EXCLUDE_LEAN_RUNTIME")
+
+        if self.hardware_compatibility_level is not None or self.version_compatible:
+            G_LOGGER.info(
+                "Version or hardware compatibility was enabled. "
+                "If you are using an ONNX model, please set the NATIVE_INSTANCENORM ONNX parser flag, e.g. `--onnx-flags NATIVE_INSTANCENORM`"
+            )
+
+        if self.max_aux_streams is not None:
+
+            def set_max_aux_streams():
+                config.max_aux_streams = self.max_aux_streams
+
+            try_run(set_max_aux_streams, "max_aux_streams")
+
+        if self.quantization_flags is not None:
+            for quantization_flag in trt.QuantizationFlag.__members__.values():
+                if quantization_flag in self.quantization_flags:
+                    try_run(
+                        lambda: config.set_quantization_flag(quantization_flag),
+                        "quantization_flag",
+                    )
+                else:
+                    try_run(
+                        lambda: config.clear_quantization_flag(quantization_flag),
+                        "quantization_flag",
+                    )
+
+        if self.error_on_timing_cache_miss:
+            try_set_flag("ERROR_ON_TIMING_CACHE_MISS")
+
+        if self.disable_compilation_cache:
+            try_set_flag("DISABLE_COMPILATION_CACHE")
 
-            if self.max_aux_streams is not None:
+        if self.progress_monitor is not None:
 
-                def set_max_aux_streams():
-                    config.max_aux_streams = self.max_aux_streams
+            def set_progress_monitor():
+                config.progress_monitor = self.progress_monitor
 
-                try_run(set_max_aux_streams, "max_aux_streams")
+            try_run(set_progress_monitor, name="progress_monitor")
 
-            return config
+        if self.weight_streaming:
+            try_set_flag("WEIGHT_STREAMING")
+
+        return config
 
 
 @mod.export(funcify=True)
@@ -450,7 +530,9 @@ def __init__(self, config, func):
 
         # Sanity-check that the function passed in is callable
         if not callable(func):
-            G_LOGGER.critical(f"Object {func} (of type {type(func)}) is not a callable.")
+            G_LOGGER.critical(
+                f"Object {func} (of type {type(func)}) is not a callable."
+            )
 
         self._func = func
 
@@ -468,12 +550,8 @@ def call_impl(self, builder, network):
             trt.IBuilderConfig:
                     The modified builder configuration.
         """
-        config, owns_config = util.invoke_if_callable(self._config, builder, network)
-
-        with contextlib.ExitStack() as stack:
-            if owns_config:
-                stack.enter_context(util.FreeOnException([config]))
+        config, _ = util.invoke_if_callable(self._config, builder, network)
 
-            self._func(builder, network, config)
+        self._func(builder, network, config)
 
-            return config
+        return config
diff --git a/tools/Polygraphy/polygraphy/backend/trt/loader.py b/tools/Polygraphy/polygraphy/backend/trt/loader.py
index 64bc9480..bb46766e 100644
--- a/tools/Polygraphy/polygraphy/backend/trt/loader.py
+++ b/tools/Polygraphy/polygraphy/backend/trt/loader.py
@@ -14,18 +14,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-import contextlib
 import ctypes
 import time
-import os
 
 from polygraphy import constants, mod, util
 from polygraphy.backend.base import BaseLoader
 from polygraphy.backend.trt import util as trt_util
 from polygraphy.backend.trt.config import CreateConfig
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER
 
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 gs = mod.lazy_import("onnx_graphsurgeon")
 np = mod.lazy_import("numpy")
 
@@ -80,7 +79,7 @@ class CreateNetwork(BaseLoader):
     Functor that creates an empty TensorRT network.
     """
 
-    def __init__(self, explicit_batch=None):
+    def __init__(self, explicit_batch=None, strongly_typed=None):
         """
         Creates an empty TensorRT network.
 
@@ -88,8 +87,12 @@ def __init__(self, explicit_batch=None):
             explicit_batch (bool):
                     Whether to create the network with explicit batch mode.
                     Defaults to True.
+            strongly_typed (bool):
+                    Whether to mark the network as being strongly typed.
+                    Defaults to False.
         """
-        self.explicit_batch = util.default(explicit_batch, True)
+        self.explicit_batch = util.default(explicit_batch, True if mod.version(trt.__version__) < mod.version("10.0") else None)
+        self.strongly_typed = util.default(strongly_typed, False)
 
     @util.check_called_by("__call__")
     def call_impl(self):
@@ -97,43 +100,63 @@ def call_impl(self):
         Returns:
             (trt.Builder, trt.INetworkDefinition): The builder and empty network.
         """
-        with util.FreeOnException([trt.Builder(trt_util.get_trt_logger())]) as (builder,):
-            network_flags = 0
-            if self.explicit_batch:
+        builder = trt.Builder(trt_util.get_trt_logger())
+        network_flags = 0
+
+        if self.explicit_batch:
+            try:
                 network_flags |= 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
-            network = builder.create_network(flags=network_flags)
-            if network is None:
-                G_LOGGER.critical("Invalid network. See logging output above for details.")
-            return builder, network
+            except AttributeError:
+                trt_util.fail_unavailable("explicit_batch")
+
+        if self.strongly_typed:
+            try:
+                network_flags |= 1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)
+            except AttributeError:
+                trt_util.fail_unavailable("strongly_typed")
+
+        network = builder.create_network(flags=network_flags)
+        if network is None:
+            G_LOGGER.critical("Invalid network. See logging output above for details.")
+        return builder, network
 
 
 class BaseNetworkFromOnnx(BaseLoader):
-    def __init__(self, explicit_batch=None, flags=None):
+    def __init__(self, flags=None, plugin_instancenorm=None, strongly_typed=None):
         """
         Args:
-            explicit_batch (bool):
-                    Whether to create the network with explicit batch mode.
-                    Defaults to True.
             flags (List[trt.OnnxParserFlag]):
                     A list of ``OnnxParserFlag`` s to modify the default parsing
                     behavior of the ONNX parser.
                     Defaults to None.
+            plugin_instancenorm (bool):
+                    Whether to force usage of the plugin implementation of ONNX
+                    InstanceNorm by clearing the NATIVE_INSTANCENORM flag in the parser.
+                    Defaults to False
+            strongly_typed (bool):
+                    Whether to mark the network as being strongly typed.
+                    Defaults to False.
         """
-        self.explicit_batch = util.default(explicit_batch, True)
         self.flags = flags
+        self.plugin_instancenorm=util.default(plugin_instancenorm, False)
+        self.strongly_typed = util.default(strongly_typed, False)
 
     @util.check_called_by("__call__")
     def call_impl(self):
-        with util.FreeOnException(create_network(explicit_batch=self.explicit_batch)) as (builder, network):
-            parser = trt.OnnxParser(network, trt_util.get_trt_logger())
-            # Set flags if applicable
-            if mod.version(trt.__version__) >= mod.version("8.6"):
-                if self.flags:
-                    masked_flags = 0
-                    for f in self.flags:
-                        masked_flags |= 1 << int(f)
-                    parser.flags = masked_flags
-            return builder, network, parser
+        builder, network = create_network(strongly_typed=self.strongly_typed)
+        # Initialize plugin library for the parser.
+        trt.init_libnvinfer_plugins(trt_util.get_trt_logger(), "")
+        parser = trt.OnnxParser(network, trt_util.get_trt_logger())
+        # Set flags if applicable.
+        if mod.version(trt.__version__) >= mod.version("8.6"):
+            if self.flags:
+                masked_flags = 0
+                for f in self.flags:
+                    masked_flags |= 1 << int(f)
+                parser.flags = masked_flags
+            if self.plugin_instancenorm:
+                parser.clear_flag(trt.OnnxParserFlag.NATIVE_INSTANCENORM)
+        return builder, network, parser
 
 
 @mod.export(funcify=True)
@@ -142,7 +165,7 @@ class NetworkFromOnnxBytes(BaseNetworkFromOnnx):
     Functor that parses an ONNX model to create a trt.INetworkDefinition.
     """
 
-    def __init__(self, model_bytes, flags=None):
+    def __init__(self, model_bytes, flags=None, plugin_instancenorm=None, strongly_typed=None):
         """
         Parses an ONNX model.
 
@@ -154,8 +177,15 @@ def __init__(self, model_bytes, flags=None):
                     A list of ``OnnxParserFlag`` s to modify the default parsing
                     behavior of the ONNX parser.
                     Defaults to None.
-        """
-        super().__init__(flags=flags)
+            plugin_instancenorm (bool):
+                    Whether to force usage of the plugin implementation of ONNX
+                    InstanceNorm by clearing the NATIVE_INSTANCENORM flag in the parser.
+                    Defaults to False
+            strongly_typed (bool):
+                    Whether to mark the network as being strongly typed.
+                    Defaults to False.
+        """
+        super().__init__(flags=flags, plugin_instancenorm=plugin_instancenorm, strongly_typed=strongly_typed)
         self._model_bytes = model_bytes
 
     @util.check_called_by("__call__")
@@ -166,10 +196,10 @@ def call_impl(self):
                     A TensorRT network, as well as the builder used to create it, and the parser
                     used to populate it.
         """
-        with util.FreeOnException(super().call_impl()) as (builder, network, parser):
-            success = parser.parse(util.invoke_if_callable(self._model_bytes)[0])
-            trt_util.check_onnx_parser_errors(parser, success)
-            return builder, network, parser
+        builder, network, parser = super().call_impl()
+        success = parser.parse(util.invoke_if_callable(self._model_bytes)[0])
+        trt_util.check_onnx_parser_errors(parser, success)
+        return builder, network, parser
 
 
 @mod.export(funcify=True)
@@ -179,7 +209,7 @@ class NetworkFromOnnxPath(BaseNetworkFromOnnx):
     This loader supports models with weights stored in an external location.
     """
 
-    def __init__(self, path, flags=None):
+    def __init__(self, path, flags=None, plugin_instancenorm=None, strongly_typed=None):
         """
         Parses an ONNX model from a file.
 
@@ -190,8 +220,15 @@ def __init__(self, path, flags=None):
                     A list of ``OnnxParserFlag`` s to modify the default parsing
                     behavior of the ONNX parser.
                     Defaults to None.
-        """
-        super().__init__(flags=flags)
+            plugin_instancenorm (bool):
+                    Whether to force usage of the plugin implementation of ONNX
+                    InstanceNorm by clearing the NATIVE_INSTANCENORM flag in the parser.
+                    Defaults to False
+            strongly_typed (bool):
+                    Whether to mark the network as being strongly typed.
+                    Defaults to False.
+        """
+        super().__init__(flags=flags, plugin_instancenorm=plugin_instancenorm, strongly_typed=strongly_typed)
         self.path = path
 
     @util.check_called_by("__call__")
@@ -203,17 +240,12 @@ def call_impl(self):
                     used to populate it.
         """
         path = util.invoke_if_callable(self.path)[0]
-        if mod.version(trt.__version__) >= mod.version("7.1"):
-            with util.FreeOnException(super().call_impl()) as (builder, network, parser):
-                # We need to use parse_from_file for the ONNX parser to keep track of the location of the ONNX file for
-                # potentially parsing any external weights.
-                success = parser.parse_from_file(path)
-                trt_util.check_onnx_parser_errors(parser, success)
-                return builder, network, parser
-        else:
-            from polygraphy.backend.common import bytes_from_path
-
-            return network_from_onnx_bytes(bytes_from_path(path))
+        builder, network, parser = super().call_impl()
+        # We need to use parse_from_file for the ONNX parser to keep track of the location of the ONNX file for
+        # potentially parsing any external weights.
+        success = parser.parse_from_file(path)
+        trt_util.check_onnx_parser_errors(parser, success)
+        return builder, network, parser
 
 
 @mod.export(funcify=True)
@@ -257,20 +289,16 @@ def call_impl(self):
             Tuple[trt.Builder, trt.INetworkDefinition, Optional[parser]]:
                     The modified network along with the builder and parser if provided.
         """
-        ret, owns_network = util.invoke_if_callable(self._network)
+        ret, _ = util.invoke_if_callable(self._network)
         builder, network, parser = util.unpack_args(ret, num=3)
 
         G_LOGGER.verbose(f"Executing postprocessing step [{self.name}]")
 
-        with contextlib.ExitStack() as stack:
-            if owns_network:
-                stack.enter_context(util.FreeOnException([builder, network, parser]))
+        self._func(network=network)
 
-            self._func(network=network)
-
-            if parser is None:
-                return builder, network
-            return builder, network, parser
+        if parser is None:
+            return builder, network
+        return builder, network, parser
 
 
 @mod.export(funcify=True)
@@ -350,7 +378,7 @@ def __init__(self, network, layer_precisions):
 @mod.export(funcify=True)
 class SetTensorDatatypes(PostprocessNetwork):
     """
-    Functor that sets tensor datatypes in a TensorRT ``INetworkDefinition``.
+    Functor that sets tensor datatypes for network I/O tensors in a TensorRT ``INetworkDefinition``.
     """
 
     @staticmethod
@@ -371,7 +399,7 @@ def _apply(network, tensor_datatypes):
 
     def __init__(self, network, tensor_datatypes):
         """
-        Sets tensor datatypes in a TensorRT ``INetworkDefinition``.
+        Sets network I/O tensor datatypes in a TensorRT ``INetworkDefinition``.
 
         Args:
             network (Union[Tuple[trt.Builder, trt.INetworkDefinition, Optional[parser]], Callable() -> Tuple[trt.Builder, trt.INetworkDefinition, Optional[parser]]):
@@ -387,7 +415,7 @@ def __init__(self, network, tensor_datatypes):
 @mod.export(funcify=True)
 class SetTensorFormats(PostprocessNetwork):
     """
-    Functor that sets tensor formats in a TensorRT ``INetworkDefinition``.
+    Functor that sets network I/O tensor formats in a TensorRT ``INetworkDefinition``.
     """
 
     @staticmethod
@@ -411,7 +439,7 @@ def _apply(network, tensor_formats):
 
     def __init__(self, network, tensor_formats):
         """
-        Sets tensor formats in a TensorRT ``INetworkDefinition``.
+        Sets network I/O tensor formats in a TensorRT ``INetworkDefinition``.
 
         Args:
             network (Union[Tuple[trt.Builder, trt.INetworkDefinition, Optional[parser]], Callable() -> Tuple[trt.Builder, trt.INetworkDefinition, Optional[parser]]):
@@ -448,9 +476,9 @@ def call_impl(self):
         Returns:
             trt.Runtime: The runtime that was loaded.
         """
-        with trt.Runtime(trt_util.get_trt_logger()) as bootstrap_runtime:
-            G_LOGGER.info(f"Loading TensorRT runtime from: {self.path}")
-            return bootstrap_runtime.load_runtime(self.path)
+        bootstrap_runtime = trt.Runtime(trt_util.get_trt_logger())
+        G_LOGGER.info(f"Loading TensorRT runtime from: {self.path}")
+        return bootstrap_runtime.load_runtime(self.path)
 
 
 @mod.export(funcify=True)
@@ -489,90 +517,69 @@ def call_impl(self):
             bytes: The serialized engine that was created.
         """
         # If network is a callable, then we own its return value
-        ret, owns_network = util.invoke_if_callable(self._network)
-        builder, network, parser = util.unpack_args(ret, num=3)
+        ret, _ = util.invoke_if_callable(self._network)
+        builder, network, _ = util.unpack_args(ret, num=3)
 
         if builder is None or network is None:
             G_LOGGER.critical(
                 f"Expected to recevie a (builder, network) tuple for the `network` parameter, but received: ({builder}, {network})"
             )
 
-        with contextlib.ExitStack() as stack:
-            if owns_network:
-                stack.enter_context(builder)
-                stack.enter_context(network)
-                if parser is not None:
-                    stack.enter_context(parser)
-            else:
-                provided = "Builder and Network" if parser is None else "Builder, Network, and Parser"
-                G_LOGGER.verbose(
-                    f"{provided} were provided directly instead of via a Callable. This loader will not assume ownership. Please ensure that they are freed."
-                )
+        config, _ = util.invoke_if_callable(self._config, builder, network)
 
-            config, owns_config = util.invoke_if_callable(self._config, builder, network)
-            if owns_config:
-                stack.enter_context(config)
-            else:
-                G_LOGGER.verbose(
-                    "Builder configuration was provided directly instead of via a Callable. This loader will not assume "
-                    "ownership. Please ensure it is freed."
-                )
+        trt_util.try_setup_polygraphy_calibrator(config, network)
 
-            trt_util.try_setup_polygraphy_calibrator(config, network)
-
-            G_LOGGER.super_verbose(
-                lambda: (
-                    "Displaying TensorRT Network:\n"
-                    + trt_util.str_from_network(
-                        network,
-                        show_layers=True,
-                        show_attrs=True,
-                        show_weights=G_LOGGER.module_severity.get(G_LOGGER.module_path(__file__))
-                        <= G_LOGGER.ULTRA_VERBOSE,
-                    )
+        G_LOGGER.super_verbose(
+            lambda: (
+                "Displaying TensorRT Network:\n"
+                + trt_util.str_from_network(
+                    network,
+                    show_layers=True,
+                    show_attrs=True,
+                    show_weights=G_LOGGER.module_severity.get(G_LOGGER.module_path(__file__)) <= G_LOGGER.ULTRA_VERBOSE,
                 )
             )
+        )
 
-            G_LOGGER.start(f"Building engine with configuration:\n{trt_util.str_from_config(config)}")
+        G_LOGGER.start(f"Building engine with configuration:\n{trt_util.str_from_config(config)}")
 
-            start_time = time.time()
-            try:
-                engine_bytes = builder.build_serialized_network(network, config)
-            except AttributeError:
-                engine = builder.build_engine(network, config)
-                if not engine:
-                    G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly")
-                stack.enter_context(engine)
-                engine_bytes = engine.serialize()
-            end_time = time.time()
-
-            if not engine_bytes:
+        start_time = time.time()
+        try:
+            engine_bytes = builder.build_serialized_network(network, config)
+        except AttributeError:
+            engine = builder.build_engine(network, config)
+            if not engine:
                 G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly")
+            engine_bytes = engine.serialize()
+        end_time = time.time()
 
-            G_LOGGER.finish(f"Finished engine building in {end_time - start_time:.3f} seconds")
+        if not engine_bytes:
+            G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly")
 
-            if self.timing_cache_path:
-                try:
-                    timing_cache = config.get_timing_cache()
-                except AttributeError:
-                    trt_util.fail_unavailable("save_timing_cache in EngineBytesFromNetwork")
+        G_LOGGER.finish(f"Finished engine building in {end_time - start_time:.3f} seconds")
 
-                with util.LockFile(self.timing_cache_path):
-                    try:
-                        prev_cache = config.create_timing_cache(util.load_file(self.timing_cache_path))
-                    except:
-                        prev_cache = None
+        if self.timing_cache_path:
+            try:
+                timing_cache = config.get_timing_cache()
+            except AttributeError:
+                trt_util.fail_unavailable("save_timing_cache in EngineBytesFromNetwork")
+
+            with util.LockFile(self.timing_cache_path):
+                try:
+                    prev_cache = config.create_timing_cache(util.load_file(self.timing_cache_path))
+                except:
+                    prev_cache = None
 
-                    if timing_cache:
-                        if prev_cache is not None:
-                            combine_success = timing_cache.combine(prev_cache, ignore_mismatch=True)
-                            if not combine_success:
-                                G_LOGGER.warning("Could not combine old timing cache into current timing cache")
+                if timing_cache:
+                    if prev_cache is not None:
+                        combine_success = timing_cache.combine(prev_cache, ignore_mismatch=True)
+                        if not combine_success:
+                            G_LOGGER.warning("Could not combine old timing cache into current timing cache")
 
-                        with timing_cache.serialize() as buffer:
-                            util.save_file(buffer, self.timing_cache_path, description="tactic timing cache")
+                    with timing_cache.serialize() as buffer:
+                        util.save_file(buffer, self.timing_cache_path, description="tactic timing cache")
 
-            return engine_bytes
+        return engine_bytes
 
 
 @mod.export(funcify=True)
@@ -643,32 +650,20 @@ def call_impl(self):
         Returns:
             trt.ICudaEngine: The deserialized engine.
         """
-        buffer, owns_buffer = util.invoke_if_callable(self._serialized_engine)
-        runtime, owns_runtime = util.invoke_if_callable(self._runtime)
+        buffer, _ = util.invoke_if_callable(self._serialized_engine)
+        runtime, _ = util.invoke_if_callable(self._runtime)
 
         trt.init_libnvinfer_plugins(trt_util.get_trt_logger(), "")
-        with contextlib.ExitStack() as stack:
-            if owns_runtime:
-                stack.enter_context(runtime)
-
-            if owns_buffer:
-                try:
-                    buffer.__enter__  # IHostMemory is freed only in __exit__
-                except AttributeError:
-                    pass
-                else:
-                    stack.enter_context(buffer)
-
-            try:
-                # To deserialize version compatible engines, we must signal the runtime that host code is allowed
-                runtime.engine_host_code_allowed = True
-            except AttributeError:
-                pass
+        try:
+            # To deserialize version compatible engines, we must signal the runtime that host code is allowed
+            runtime.engine_host_code_allowed = True
+        except AttributeError:
+            pass
 
-            engine = runtime.deserialize_cuda_engine(buffer)
-            if not engine:
-                G_LOGGER.critical("Could not deserialize engine. See log for details.")
-            return engine
+        engine = runtime.deserialize_cuda_engine(buffer)
+        if not engine:
+            G_LOGGER.critical("Could not deserialize engine. See log for details.")
+        return engine
 
 
 @mod.export(funcify=True)
@@ -693,14 +688,8 @@ def call_impl(self):
         Returns:
             bytes: The serialized engine.
         """
-        engine, owns_engine = util.invoke_if_callable(self._engine)
-
-        with contextlib.ExitStack() as stack:
-            if owns_engine:
-                stack.enter_context(util.FreeOnException([engine]))
-
-            with engine.serialize() as buffer:
-                return bytes(buffer)
+        engine, _ = util.invoke_if_callable(self._engine)
+        return bytes(engine.serialize())
 
 
 @mod.export(funcify=True)
@@ -728,14 +717,10 @@ def call_impl(self):
         Returns:
             trt.ICudaEngine: The engine that was saved.
         """
-        engine, owns_engine = util.invoke_if_callable(self._engine)
-
-        with contextlib.ExitStack() as stack:
-            if owns_engine:
-                stack.enter_context(util.FreeOnException([engine]))
+        engine, _ = util.invoke_if_callable(self._engine)
 
-            util.save_file(contents=bytes_from_engine(engine), dest=self.path, description="engine")
-            return engine
+        util.save_file(contents=bytes_from_engine(engine), dest=self.path, description="engine")
+        return engine
 
 
 @mod.export(funcify=True)
@@ -766,80 +751,110 @@ def call_impl(self):
         Returns:
             onnx.ModelProto: The ONNX-like, but **not** valid ONNX, representation of the TensorRT network.
         """
-        ret, owns_network = util.invoke_if_callable(self._network)
-        builder, network, parser = util.unpack_args(ret, num=3)
+        ret, _ = util.invoke_if_callable(self._network)
+        builder, network, _ = util.unpack_args(ret, num=3)
 
         if builder is None or network is None:
             G_LOGGER.critical(
                 f"Expected to recevie a (builder, network) tuple for the `network` parameter, but received: ({builder}, {network})"
             )
 
-        with contextlib.ExitStack() as stack:
-            if owns_network:
-                stack.enter_context(builder)
-                stack.enter_context(network)
-                if parser is not None:
-                    stack.enter_context(parser)
-
-            tensor_map = {}
-
-            def tensors_from_names_meta(names, meta):
-                nonlocal tensor_map
-                tensors = []
-                for name in names:
-                    if name not in tensor_map:
-                        dtype, shape = meta[name]
-                        tensor_map[name] = gs.Variable(name=name, dtype=dtype, shape=shape)
-                    tensors.append(tensor_map[name])
-                return tensors
-
-            nodes = []
-            graph_inputs = tensors_from_names_meta(*trt_util.get_network_input_names_meta(network))
-            graph_outputs = tensors_from_names_meta(*trt_util.get_network_output_names_meta(network))
-
-            LAYER_TYPE_CLASS_MAPPING = trt_util.get_layer_class_mapping()
-
-            for layer in network:
-                op_name = layer.type.name
-                if layer.type in LAYER_TYPE_CLASS_MAPPING:
-                    layer.__class__ = LAYER_TYPE_CLASS_MAPPING[layer.type]
-
-                node_inputs = tensors_from_names_meta(*trt_util.get_layer_input_names_meta(layer))
-                node_outputs = tensors_from_names_meta(*trt_util.get_layer_output_names_meta(layer))
-                attrs = {}
-                attr_names = trt_util.get_layer_attribute_names(layer)
-                for name in attr_names:
-                    with G_LOGGER.verbosity():
+        tensor_map = {}
+
+        def tensors_from_names_meta(names, meta):
+            nonlocal tensor_map
+            tensors = []
+            for name in names:
+                if name not in tensor_map:
+                    dtype, shape = meta[name]
+                    tensor_map[name] = gs.Variable(name=name, dtype=DataType.to_dtype(dtype, "onnx"), shape=shape)
+                tensors.append(tensor_map[name])
+            return tensors
+
+        nodes = []
+        graph_inputs = tensors_from_names_meta(*trt_util.get_network_input_names_meta(network))
+        graph_outputs = tensors_from_names_meta(*trt_util.get_network_output_names_meta(network))
+
+        LAYER_TYPE_CLASS_MAPPING = trt_util.get_layer_class_mapping()
+
+        for layer in network:
+            op_name = layer.type.name
+            if layer.type in LAYER_TYPE_CLASS_MAPPING:
+                layer.__class__ = LAYER_TYPE_CLASS_MAPPING[layer.type]
+
+            node_inputs = tensors_from_names_meta(*trt_util.get_layer_input_names_meta(layer))
+            node_outputs = tensors_from_names_meta(*trt_util.get_layer_output_names_meta(layer))
+            attrs = {}
+            attr_names = trt_util.get_layer_attribute_names(layer)
+            for name in attr_names:
+                with G_LOGGER.verbosity():
+                    try:
                         attr = getattr(layer, name)
+                    except Exception as err:
+                        attr = f"<Error: could not retrieve layer attribute: {name}. Note: Error was: {err}>"
 
-                    if util.is_sequence(attr) or any(isinstance(attr, cls) for cls in [trt.Dims, trt.Permutation]):
-                        try:
-                            attr = list(attr)
-                        except ValueError:  # Invalid dims
-                            attr = []
+                if util.is_sequence(attr) or any(isinstance(attr, cls) for cls in [trt.Dims, trt.Permutation]):
+                    try:
+                        attr = list(attr)
+                    except ValueError:  # Invalid dims
+                        attr = []
 
-                    if hasattr(attr, "__entries"):  # TensorRT Enums
-                        attr = attr.name
+                if hasattr(attr, "__entries"):  # TensorRT Enums
+                    attr = attr.name
 
-                    if isinstance(attr, trt.ILoop):
-                        attr = attr.name
+                if isinstance(attr, trt.ILoop):
+                    attr = attr.name
 
-                    VALID_TYPES = [np.ndarray, list, int, str, bool, float]
-                    if not any(isinstance(attr, cls) for cls in VALID_TYPES):
-                        G_LOGGER.internal_error(
-                            f"Unknown type: {type(attr)} for layer attribute: {attr}.\nNote: Layer was: {layer}"
-                        )
-                        try:
-                            attr = str(attr)
-                        except:
-                            attr = "<error during conversion>"
+                VALID_TYPES = [np.ndarray, list, int, str, bool, float]
+                if not any(isinstance(attr, cls) for cls in VALID_TYPES):
+                    G_LOGGER.internal_error(
+                        f"Unknown type: {type(attr)} for layer attribute: {attr}.\nNote: Layer was: {layer}"
+                    )
+                    try:
+                        attr = str(attr)
+                    except:
+                        attr = "<error during conversion>"
 
-                    attrs[name] = attr
+                attrs[name] = attr
 
-                nodes.append(
-                    gs.Node(name=layer.name, op=op_name, attrs=attrs, inputs=node_inputs, outputs=node_outputs)
-                )
+            nodes.append(gs.Node(name=layer.name, op=op_name, attrs=attrs, inputs=node_inputs, outputs=node_outputs))
 
-            graph = gs.Graph(name=network.name, inputs=graph_inputs, outputs=graph_outputs, nodes=nodes)
+        graph = gs.Graph(name=network.name, inputs=graph_inputs, outputs=graph_outputs, nodes=nodes)
+
+        return gs.export_onnx(graph)
+
+@mod.export(funcify=True)
+class MarkDebug(PostprocessNetwork):
+    """
+    Functor that mark tensors as debug tensors in a TensorRT ``INetworkDefinition``.
+    """
 
-            return gs.export_onnx(graph)
+    @staticmethod
+    def _apply(network, mark_debug):
+        tensor_map = trt_util.get_all_tensors(network)
+        util.check_sequence_contains(
+            tensor_map.keys(),
+            mark_debug,
+            name="the network",
+            items_name="tensors",
+            check_extra=False,
+            log_func=G_LOGGER.warning,
+        )
+
+        for name in mark_debug:
+            network.mark_debug(tensor_map[name])
+        return network
+
+    def __init__(self, network, mark_debug):
+        """
+        Mark tensors as debug tensors in a TensorRT ``INetworkDefinition``.
+
+        Args:
+            network (Union[Tuple[trt.Builder, trt.INetworkDefinition, Optional[parser]], Callable() -> Tuple[trt.Builder, trt.INetworkDefinition, Optional[parser]]):
+                    A tuple containing a TensorRT builder, network and optionally parser or a callable that returns one.
+                    To omit the parser, return a tuple containing just the builder and network.
+            mark_debug (List[str]):
+                    List of tensor names to mark as debug tensors.
+        """
+        func = lambda network: MarkDebug._apply(network, mark_debug)
+        super().__init__(network, func, "MarkDebug")
diff --git a/tools/Polygraphy/polygraphy/backend/trt/runner.py b/tools/Polygraphy/polygraphy/backend/trt/runner.py
index ec63d2d5..faf2daad 100644
--- a/tools/Polygraphy/polygraphy/backend/trt/runner.py
+++ b/tools/Polygraphy/polygraphy/backend/trt/runner.py
@@ -14,8 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-import contextlib
-import copy
 import time
 from collections import OrderedDict
 
@@ -23,37 +21,114 @@
 from polygraphy.backend.base import BaseRunner
 from polygraphy.backend.trt import util as trt_util
 from polygraphy.common import FormattedArray
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER
 
 np = mod.lazy_import("numpy")
-trt = mod.lazy_import("tensorrt")
+torch = mod.lazy_import("torch>=1.13.0")
+trt = mod.lazy_import("tensorrt>=8.5")
 
 
+def _make_debug_listener():
+    class DebugTensorWriter(trt.IDebugListener):
+        def __init__(self):
+            trt.IDebugListener.__init__(self)
+            self.debug_tensor_outputs = {}
+
+        def process_debug_tensor(self, addr, location, type, shape, name, stream):
+            cuda.wrapper().stream_synchronize(stream)
+            datatype = DataType.from_dtype(type)
+            size = util.volume(shape)
+            buffer = np.zeros(shape, dtype=DataType.to_dtype(datatype, "numpy"))
+            buffer = util.array.resize_or_reallocate(buffer, size)
+            cuda.wrapper().memcpy(
+                dst=util.array.data_ptr(buffer),
+                src=addr,
+                nbytes=size*datatype.itemsize,
+                kind=cuda.MemcpyKind.DeviceToHost,
+                stream_ptr=stream)
+            cuda.wrapper().stream_synchronize(stream)
+            self.debug_tensor_outputs[name] = util.array.resize_or_reallocate(buffer, shape)
+
+    return DebugTensorWriter()
+
 def _make_output_allocator():
-    if mod.version(trt.__version__) <= mod.version("8.5.0.9"):
-        G_LOGGER.internal_error("This function should only be called in TensorRT 8.5 and newer")
 
     class OutputAllocator(trt.IOutputAllocator):
         def __init__(self):
             trt.IOutputAllocator.__init__(self)
             self.buffers = {}
             self.shapes = {}
+            self.use_torch = False
 
         def reallocate_output(self, tensor_name, memory, size, alignment):
             shape = (size,)
             if tensor_name not in self.buffers:
-                self.buffers[tensor_name] = cuda.DeviceArray.raw(shape)
+                self.buffers[tensor_name] = (
+                    cuda.DeviceArray.raw(shape)
+                    if not self.use_torch
+                    else torch.empty(shape, dtype=torch.uint8, device="cuda")
+                )
             else:
-                self.buffers[tensor_name].resize(shape)
+                self.buffers[tensor_name] = util.array.resize_or_reallocate(self.buffers[tensor_name], shape)
             G_LOGGER.extra_verbose(f"Reallocated output tensor: {tensor_name} to: {self.buffers[tensor_name]}")
-            return self.buffers[tensor_name].ptr
+            return util.array.data_ptr(self.buffers[tensor_name])
 
         def notify_shape(self, tensor_name, shape):
             self.shapes[tensor_name] = tuple(shape)
 
+        def set_use_torch(self, use_torch):
+            self.use_torch = use_torch
+    
     return OutputAllocator()
 
 
+def _get_array_on_cpu(arr, name, host_buffers, stream, nbytes, use_torch):
+    """
+    Copies the provided array to CPU memory and returns it.
+    If sufficient CPU memory has not been allocated for the array in
+    ``host_bufffers``, this function will allocate new memory.
+
+    If the input is a `torch.Tensor`, then a `torch.Tensor` is returned.
+    Otherwise, if the input is a `DeviceView`, a `NumPy` array is returned.
+
+    Args:
+        arr (Union[DeviceView, torch.Tensor]): The array.
+        name (str): The name of the array.
+        host_buffers (Dict[str, Union[numpy.ndarray, torch.Tensor]]):
+                A mapping of names to host buffers.
+        stream (cuda.Stream): The CUDA stream to use.
+        nbytes (int): The number of bytes to copy. This may be smaller than the size of the GPU memory.
+        use_torch (bool): Whether to use PyTorch tensors instead of NumPy arrays.
+
+    Returns:
+        Union[numpy.ndarray, torch.Tensor]: The host buffer as a flat array of bytes.
+    """
+    if not util.array.is_on_gpu(arr):
+        G_LOGGER.internal_error(f"_get_array_on_cpu() should only be called with input arrays on the GPU!")
+
+    # The host buffer will always be a "raw" array, i.e. a flat array of bytes.
+    shape = (nbytes,)
+    dtype = DataType.UINT8
+    # If we switch between torch tensors and DeviceViews between inferences, we need to reallocate the host buffer.
+    if name not in host_buffers or util.array.is_torch(host_buffers[name]) != use_torch:
+        host_buffers[name] = (
+            np.empty(shape, dtype=DataType.to_dtype(dtype, "numpy"))
+            if not use_torch
+            else torch.empty(shape, dtype=DataType.to_dtype(dtype, "torch"), device="cpu")
+        )
+
+    host_buffers[name] = util.array.resize_or_reallocate(host_buffers[name], shape)
+    cuda.wrapper().memcpy(
+        dst=util.array.data_ptr(host_buffers[name]),
+        src=util.array.data_ptr(arr),
+        nbytes=nbytes,
+        kind=cuda.MemcpyKind.DeviceToHost,
+        stream_ptr=stream.ptr,
+    )
+    return host_buffers[name]
+
+
 @mod.export()
 class TrtRunner(BaseRunner):
     """
@@ -63,7 +138,7 @@ class TrtRunner(BaseRunner):
     be used only for prototyping, testing, and debugging.
     """
 
-    def __init__(self, engine, name: str = None, optimization_profile: int = None):
+    def __init__(self, engine, name: str = None, optimization_profile: int = None, allocation_strategy: str = None, weight_streaming_budget: int = None, weight_streaming_percent: float = None):
         """
         Args:
             engine (Union[Union[trt.ICudaEngine, trt.IExecutionContext], Callable() -> Union[trt.ICudaEngine, trt.IExecutionContext]]):
@@ -77,91 +152,91 @@ def __init__(self, engine, name: str = None, optimization_profile: int = None):
                     The index of the optimization profile to set each time this runner is activated.
                     When this is not provided, the profile is not set explicitly and will default to the 0th profile.
                     You can also change the profile after the runner is active using the ``set_profile()`` method.
+            allocation_strategy (str):
+                    The way device memory (internal activation and scratch memory) is allocated for the execution context. The value of this argument can be:
+                        - "static": The default value. The execution context will pre-allocate a block of memory that is sufficient for any possible input size across all profiles.
+                        - "profile": Allocate device memory enough for the current profile based on profile max shapes.
+                        - "runtime": Allocate device meomry enough for the current input shapes.
+            weight_streaming_budget (int):
+                    The amount of GPU memory that TensorRT can use for weights at runtime. Tt can take on the following values:
+                        None or 0: Disables weight streaming at runtime.
+                        -1: TensorRT will decide the streaming budget automatically.
+                        > 0: The maximum amount of GPU memory TensorRT is allowed to use for weights in bytes.
+            weight_streaming_percent (float):
+                    The percentage of weights that TRT will stream from CPU to GPU. It can take on the following values:
+                        None or 0: Disables weight streaming at runtime.
+                        [0 to 100]: The percentage of weights TRT will stream. 100 will stream the maximum number of weights.
         """
         super().__init__(name=name, prefix="trt-runner")
         self._engine_or_context = engine
         self.optimization_profile = optimization_profile
-
-        # Check compatibility with NumPy before proceeding further
-        trt_util.check_numpy_trt_compatibility()
+        self.allocation_strategy = allocation_strategy
+        self.weight_streaming_budget = weight_streaming_budget
+        self.weight_streaming_percent = weight_streaming_percent
+        self.output_allocator = _make_output_allocator()
 
     @util.check_called_by("activate")
     def activate_impl(self):
-        engine_or_context, owning = util.invoke_if_callable(self._engine_or_context)
-
+        engine_or_context, _ = util.invoke_if_callable(self._engine_or_context)
 
         if isinstance(engine_or_context, trt.ICudaEngine):
             self.engine = engine_or_context
-            self.owns_engine = owning
-            self.context = self.engine.create_execution_context()
-            self.owns_context = True
+
+            # Setup weight streaming if applicable
+            if self.weight_streaming_budget != None and self.weight_streaming_percent != None:
+                G_LOGGER.critical(f"Cannot specify the weight streaming budget both in bytes and percentage.")
+
+            budget_bytes = None
+            if self.weight_streaming_budget is not None:
+                assert self.weight_streaming_budget == -1 or self.weight_streaming_budget >= 0
+                budget_bytes = self.weight_streaming_budget
+            elif self.weight_streaming_percent is not None:
+                assert 0 <= self.weight_streaming_percent <= 100
+                if self.weight_streaming_percent == 0:
+                    budget_bytes = 0 # Disable weight streaming
+                else:
+                    min_budget = self.engine.minimum_weight_streaming_budget
+                    max_budget = self.engine.streamable_weights_size
+                    budget_bytes = (1 - self.weight_streaming_percent / 100.0) * (max_budget - min_budget) + min_budget
+            if budget_bytes is not None:
+                budget_bytes = int(budget_bytes)
+                self.engine.weight_streaming_budget = budget_bytes
+                if self.engine.weight_streaming_budget != budget_bytes:
+                    G_LOGGER.critical(f"Failed to set weight streaming budget to {budget_bytes}!")
+                if budget_bytes == 0:
+                    G_LOGGER.info(f"Weight streaming is disabled.")
+                elif budget_bytes == -1:
+                    G_LOGGER.info(f"Weight streaming is enabled with TensorRT automatically determiing the budget.")
+                else:
+                    G_LOGGER.info(f"Weight streaming is enabled with a memory budget of {budget_bytes} bytes.")
+            
+            allocation_strategy = util.default(self.allocation_strategy, "static")
+            if allocation_strategy == 'static':
+                self.context = self.engine.create_execution_context()
+            elif allocation_strategy in ['profile', 'runtime']:
+                # Device memory will be managed by polygraphy
+                self.context = self.engine.create_execution_context(trt.ExecutionContextAllocationStrategy.USER_MANAGED)
+            else:
+                 G_LOGGER.critical("Invalid allocation strategy specified.")
             if not self.context:
                 G_LOGGER.critical("Invalid Context. See error log for details.")
         elif isinstance(engine_or_context, trt.IExecutionContext):
             self.context = engine_or_context
-            self.owns_context = owning
             self.engine = self.context.engine
-            self.owns_engine = False
+            if self.allocation_strategy is not None:
+                G_LOGGER.warning(
+                    "An allocation strategy was specified. Please ensure the provided execution context uses the same strategy."
+                )
+
         else:
             G_LOGGER.critical(
                 "Invalid Engine or Context. Please ensure the engine was built correctly. See error log for details."
             )
 
-        if not owning:
-            G_LOGGER.verbose(
-                "Object was provided directly instead of via a Callable. This runner will not assume ownership. "
-                "Please ensure it is freed."
-            )
-
-        def make_buffers_legacy():
-            """
-            Creates empty host and device buffers for the specified engine.
-            Always uses binding names from Profile 0.
-            """
-            device_buffers = OrderedDict()
-            host_output_buffers = OrderedDict()
-
-            for idx in range(trt_util.get_bindings_per_profile(self.engine)):
-                binding = self.engine[idx]
-                dtype = trt_util.np_dtype_from_trt(self.engine.get_binding_dtype(binding))
-                device_buffers[binding] = cuda.DeviceArray(dtype=dtype)
-                if not self.engine.binding_is_input(binding):
-                    host_output_buffers[binding] = np.empty(shape=tuple(), dtype=dtype)
-
-            G_LOGGER.extra_verbose(f"Initialized device buffers: {device_buffers}")
-            return device_buffers, host_output_buffers, None
-
-        def make_buffers():
-            """
-            Creates empty host buffers for outputs and empty device buffers for inputs.
-            """
-            device_buffers = OrderedDict()
-            host_output_buffers = OrderedDict()
-            output_allocator = _make_output_allocator()
-
-            for idx in range(self.engine.num_io_tensors):
-                name = self.engine.get_tensor_name(idx)
-
-                # NOTE: We use raw arrays to enable vectorized formats.
-                if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
-                    device_buffers[name] = cuda.DeviceArray.raw(shape=tuple())
-                elif self.engine.get_tensor_mode(name) == trt.TensorIOMode.OUTPUT:
-                    host_output_buffers[name] = np.empty(shape=tuple(), dtype=np.byte)
-                    if not self.context.set_output_allocator(name, output_allocator):
-                        G_LOGGER.critical(f"For output: {name}, failed to set output allocator")
-                else:
-                    G_LOGGER.internal_error(
-                        f"Unexpected tensor I/O mode encountered during inference: {self.engine.get_tensor_mode(name)}.\n"
-                        "Please update this implementation!"
-                    )
-
-            G_LOGGER.extra_verbose(f"Initialized device buffers: {device_buffers}")
-            return device_buffers, host_output_buffers, output_allocator
-
-        self.device_buffers, self.host_output_buffers, self.output_allocator = (
-            make_buffers() if trt_util._should_use_v3_api() else make_buffers_legacy()
-        )
+        self.device_input_buffers = OrderedDict()
+        self.host_output_buffers = OrderedDict()
         self.stream = cuda.Stream()
+        self.context_memory_buffer = None
 
         if self.optimization_profile is not None:
             self.set_profile(self.optimization_profile)
@@ -196,162 +271,41 @@ def set_profile(self, index: int):
 
     @util.check_called_by("get_input_metadata")
     def get_input_metadata_impl(self):
-        if trt_util._should_use_v3_api():
-            return trt_util.get_metadata_from_engine(self.engine, self.context, mode=trt.TensorIOMode.INPUT)
-        else:
-            start_binding, end_binding = trt_util.get_active_profile_bindings(self.context)
-            # This function always uses binding names of the 0th profile.
-            return trt_util.get_input_metadata_from_engine(self.engine, start_binding, end_binding)
-
-    def _set_shapes_from_feed_dict_legacy(self, feed_dict):
-        """
-        Sets context shapes according to the provided feed_dict.
-
-        Note that ``infer()`` will call this function automatically, and hence
-        you should only use it if you plan to use this runner's context manually.
-
-        Args:
-            feed_dict (OrderedDict[str, numpy.ndarray]):
-                    A mapping of input tensor names to corresponding input NumPy arrays.
-
-        Returns:
-            Tuple[int, int]: The start and end binding indices of the modified bindings.
-        """
-
-        def is_dynamic_shape_input(binding):
-            return self.engine.is_shape_binding(binding) and self.engine.binding_is_input(binding)
-
-        start_binding, end_binding = trt_util.get_active_profile_bindings(self.context)
-        for name, inp in feed_dict.items():
-            binding = start_binding + self.engine[name]
-            # Only set shapes if required.
-            # get_shape/get_binding_shape will return what a shape input/data input is currently set to.
-            if is_dynamic_shape_input(binding):  # For input shape tensors
-                if isinstance(inp, cuda.DeviceView):
-                    G_LOGGER.critical(
-                        f"A DeviceView was provided for input: {name}, but since this is a shape tensor, "
-                        "it must reside in host memory. Please use a NumPy array instead. "
-                    )
+        return trt_util.get_metadata_from_engine(self.engine, self.context, mode=trt.TensorIOMode.INPUT)
 
-                if tuple(self.context.get_shape(binding)) != tuple(inp):
-                    G_LOGGER.verbose(lambda: f"Setting shape binding: {name} (index: {binding}) to: {inp}")
-                    if not self.context.set_shape_input(binding, inp):
-                        G_LOGGER.critical(
-                            f"Failed to set shape binding: {name} (index: {binding}) to: {inp}. "
-                            "Are these values valid for the binding?"
-                        )
-
-            elif util.is_shape_dynamic(self.engine.get_binding_shape(binding)):
-                shape = inp.shape
-                if tuple(self.context.get_binding_shape(binding)) != tuple(shape):
-                    G_LOGGER.verbose(lambda: f"Setting binding: {name} (index: {binding}) to shape: {shape}")
-                    if not self.context.set_binding_shape(binding, shape):
-                        G_LOGGER.critical(
-                            f"Failed to set binding: {name} (index: {binding}) to shape: {shape}. "
-                            "Is this shape valid for the binding?"
-                        )
-
-        if not self.context.all_binding_shapes_specified:
-            G_LOGGER.critical(
-                f"Some input shapes were not specified.\nNote: Network inputs are: {self.get_input_metadata()}"
-            )
-        if not self.context.all_shape_inputs_specified:
-            G_LOGGER.critical(
-                f"Some shape inputs were not specified.\nNote: Network inputs are: {self.get_input_metadata()}"
-            )
-
-        return start_binding, end_binding
-
-    def _infer_impl_legacy(self, feed_dict, copy_outputs_to_host):
-        start_binding, end_binding = self._set_shapes_from_feed_dict_legacy(feed_dict)
-
-        # Resize output device buffers - host buffers will be automatically resized by copy_to
-        for binding in range(start_binding, end_binding):
-            if not self.engine.binding_is_input(binding):
-                name = self.engine[binding - start_binding]  # Use profile 0 binding names for all buffers.
-                shape = tuple(self.context.get_binding_shape(binding))
-                self.device_buffers[name].resize(shape)
-
-        # Use a shallow copy in case we need to replace our allocated buffers with provided DeviceViews.
-        dev_bufs = copy.copy(self.device_buffers)
-        for name, buffer in feed_dict.items():
-            if isinstance(buffer, cuda.DeviceView):
-                dev_bufs[name] = buffer
-            elif isinstance(buffer, np.ndarray):
-                dev_bufs[name].resize(buffer.shape)
-                buffer = util.make_contiguous(buffer)
-                dev_bufs[name].copy_from(buffer, self.stream)
-            else:
-                G_LOGGER.critical(
-                    f"For input: {name}, unrecognized type in feed_dict: {type(buffer).__name__}.\n"
-                    "Please provide either a NumPy array or Polygraphy DeviceView. "
-                )
-
-        # Need to offset bindings in case the active profile is not 0.
-        bindings = [0] * start_binding + [buf.ptr for buf in dev_bufs.values()]
-        success = self.context.execute_async_v2(bindings=bindings, stream_handle=self.stream.ptr)
-        if not success:
-            G_LOGGER.critical("Model execution failed. Please see the log messages above for details")
-
-        output_buffers = OrderedDict()
-        for name, buffer in self.host_output_buffers.items():
-            if copy_outputs_to_host:
-                self.host_output_buffers[name] = util.resize_buffer(buffer, dev_bufs[name].shape)
-                dev_bufs[name].copy_to(self.host_output_buffers[name], self.stream)
-                output_buffers[name] = self.host_output_buffers[name]
-            else:
-                output_buffers[name] = dev_bufs[name].view()
-
-        self.stream.synchronize()
-        return output_buffers
+    def _infer_impl(self, feed_dict, copy_outputs_to_host):
+        def get_io(mode):
+            for idx in range(self.engine.num_io_tensors):
+                name = self.engine.get_tensor_name(idx)
 
-    def _infer_impl_v3(self, feed_dict, copy_outputs_to_host):
-        for idx in range(self.engine.num_io_tensors):
-            name = self.engine.get_tensor_name(idx)
+                if self.engine.get_tensor_mode(name) == mode:
+                    yield name
 
-            if self.engine.get_tensor_mode(name) != trt.TensorIOMode.INPUT:
-                continue
+        use_torch = False
 
+        for name in get_io(trt.TensorIOMode.INPUT):
             # Set up input tensor shapes and copy from host memory if needed
             array = feed_dict[name]
             if not isinstance(array, FormattedArray):
-                array = FormattedArray(array, shape=array.shape, dtype=array.dtype)
+                array = FormattedArray(array, shape=util.array.shape(array))
 
             underlying_array = array.array
+            use_torch = use_torch or util.array.is_torch(underlying_array)
 
             ptr = None
             if self.engine.is_shape_inference_io(name):
-                if not isinstance(underlying_array, np.ndarray):
+                if not util.array.is_on_cpu(underlying_array):
                     G_LOGGER.critical(
                         f"A {type(underlying_array).__name__} was provided for input: {name}, but since this is a shape tensor, "
-                        "it must reside in host memory. Please use a NumPy array instead. "
+                        "it must reside in host memory. "
                     )
 
-                ptr = underlying_array.ctypes.data
+                ptr = util.array.data_ptr(underlying_array)
             else:
-                if isinstance(underlying_array, cuda.DeviceView):
-                    ptr = underlying_array.ptr
-                elif isinstance(underlying_array, np.ndarray):
-                    underlying_array = util.make_contiguous(underlying_array)
-                    dev_array = self.device_buffers[name]
-                    dev_array.resize(shape=(underlying_array.nbytes,))
-
-                    # For scalars, we need to reshape the array to 1D before we can use `view()` or NumPy complains.
-                    if not underlying_array.shape:
-                        view = underlying_array.reshape(-1).view(np.byte)
-                    else:
-                        view = underlying_array.view(np.byte)
-
-                    dev_array.copy_from(view, stream=self.stream)
-                    ptr = dev_array.ptr
-                else:
-                    G_LOGGER.critical(
-                        f"For input: {name}, unrecognized type in feed_dict: {type(underlying_array).__name__}.\n"
-                        "Please provide either a NumPy array or Polygraphy DeviceView. "
-                    )
+                ptr = trt_util._get_array_on_gpu(underlying_array, name, self.device_input_buffers, self.stream)
 
             # If the format is HWC, make sure array.shape is considered after transposing back to CHW
-            if self.engine.get_tensor_format(name) == trt.TensorFormat.HWC:
+            if trt_util.get_tensor_format(self.engine, self.context, name) == trt.TensorFormat.HWC:
                 array_shape = trt_util.get_chw_shape_from_hwc(array.shape, self.context.get_tensor_strides(name))
             else:
                 array_shape = array.shape
@@ -368,44 +322,81 @@ def _infer_impl_v3(self, feed_dict, copy_outputs_to_host):
                 if not self.context.set_tensor_address(name, ptr):
                     G_LOGGER.critical(f"For input: {name}, failed to set tensor address to: {ptr}")
 
+        try:
+            self.context.set_all_tensors_debug_state
+        except AttributeError:
+            pass
+        else:
+            # Set up the debug listener before running inference.
+            debug_listener = _make_debug_listener()
+            self.context.set_all_tensors_debug_state(True)
+            if not self.context.set_debug_listener(debug_listener):
+                G_LOGGER.critical(f"Failed to set debug listener.")
+
+        # Set up the output allocator before running inference.
+        self.output_allocator.set_use_torch(use_torch and torch.cuda.is_available())
+        for name in get_io(trt.TensorIOMode.OUTPUT):
+            if not self.context.set_output_allocator(name, self.output_allocator):
+                G_LOGGER.critical(f"For output: {name}, failed to set output allocator")
+
+        if self.allocation_strategy in ["profile", "runtime"]:
+            if self.allocation_strategy == "profile":
+                # Perform per-profile allocation.
+                size_to_allocate = self.engine.get_device_memory_size_for_profile(self.context.active_optimization_profile)
+            elif self.allocation_strategy =="runtime":
+                # Perform runtime allocation.
+                size_to_allocate = self.context.update_device_memory_size_for_shapes()
+
+            if self.context_memory_buffer is None:
+                self.context_memory_buffer = cuda.DeviceArray.raw((size_to_allocate,))
+
+            self.context_memory_buffer.resize((size_to_allocate,))
+            self.context.device_memory = self.context_memory_buffer.ptr
+
         if not self.context.execute_async_v3(self.stream.ptr):
             G_LOGGER.critical("`execute_async_v3()` failed. Please see the logging output above for details.")
 
         output_buffers = OrderedDict()
-        for name in self.host_output_buffers.keys():
+        for name in get_io(trt.TensorIOMode.OUTPUT):
             # If we're dealing with vectorized formats, we need to return a FormattedArray.
             # Otherwise, we create a view instead with the correct shape/dtype.
             raw_array = self.output_allocator.buffers[name]
-            shape = self.output_allocator.shapes[name]
-            dtype = trt_util.np_dtype_from_trt(self.engine.get_tensor_dtype(name))
-
-            tensor_format = self.engine.get_tensor_format(name)
 
+            shape = self.output_allocator.shapes[name]
             # If the format is HWC, make sure the result is shaped accordingly
+            tensor_format = trt_util.get_tensor_format(self.engine, self.context, name)
             if tensor_format == trt.TensorFormat.HWC:
                 shape = trt_util.get_hwc_shape_from_chw(shape, self.context.get_tensor_strides(name))
-
             using_vectorized_format = tensor_format != trt.TensorFormat.LINEAR and tensor_format != trt.TensorFormat.HWC
+
+            dtype = DataType.from_dtype(self.engine.get_tensor_dtype(name), source_module="tensorrt")
+
             # The memory allocated by the output allocator may be larger than actually required.
             # If we're using a vectorized format, then we need to copy the whole thing.
             # Otherwise, we can determine how much we actually need.
-            nbytes = raw_array.nbytes if using_vectorized_format else (util.volume(shape) * dtype.itemsize)
+            nbytes = util.array.nbytes(raw_array) if using_vectorized_format else (util.volume(shape) * dtype.itemsize)
 
             if copy_outputs_to_host:
-                self.host_output_buffers[name] = util.resize_buffer(self.host_output_buffers[name], (nbytes,))
-                raw_array.view(shape=(nbytes,)).copy_to(self.host_output_buffers[name], stream=self.stream)
-                raw_array = self.host_output_buffers[name]
+                raw_array = _get_array_on_cpu(
+                    raw_array, name, self.host_output_buffers, self.stream, nbytes, use_torch=use_torch
+                )
 
             if using_vectorized_format:
-                array = FormattedArray(raw_array, shape=shape, dtype=dtype)
+                array = FormattedArray(raw_array, shape=shape)
             else:
-                if copy_outputs_to_host:
-                    array = raw_array.view(dtype).reshape(shape)
-                else:
-                    array = cuda.DeviceView(raw_array.ptr, shape, dtype)
+                array = util.array.view(raw_array, dtype, shape)
             output_buffers[name] = array
 
         self.stream.synchronize()
+
+        try:
+            self.context.set_all_tensors_debug_state
+        except AttributeError:
+            pass
+        else:
+            if debug_listener.debug_tensor_outputs:
+                output_buffers.update(debug_listener.debug_tensor_outputs)
+
         return output_buffers
 
     @util.check_called_by("infer")
@@ -415,32 +406,29 @@ def infer_impl(self, feed_dict, copy_outputs_to_host=None):
         Do not call this method directly - use ``infer()`` instead,
         which will forward unrecognized arguments to this method.
 
-        In addition to accepting NumPy arrays in the feed_dict, this runner can also
-        accept Polygraphy DeviceViews. In that case, no host-to-device copy is necessary for the inputs.
-
         Args:
-            feed_dict (OrderedDict[str, Union[numpy.ndarray, DeviceView]]):
-                    A mapping of input tensor names to corresponding input NumPy arrays
-                    or Polygraphy DeviceViews.
+            feed_dict (OrderedDict[str, Union[numpy.ndarray, DeviceView, torch.Tensor]]):
+                    A mapping of input tensor names to corresponding input NumPy arrays,
+                    Polygraphy DeviceViews, or PyTorch tensors.
+                    If PyTorch tensors are provided in the feed_dict, then this function
+                    will return the outputs also as PyTorch tensors.
+                    If the provided inputs already reside in GPU memory, no additional copies are made.
 
             copy_outputs_to_host (bool):
                     Whether to copy inference outputs back to host memory.
-                    If this is False, Polygraphy DeviceViews are returned
-                    instead of NumPy arrays.
+                    If this is False, PyTorch GPU tensors or Polygraphy DeviceViews
+                    are returned instead of PyTorch CPU tensors or NumPy arrays respectively.
                     Defaults to True.
 
         Returns:
-            OrderedDict[str, Union[numpy.ndarray, DeviceView]]:
-                    A mapping of output tensor names to corresponding output NumPy arrays
-                    or Polygraphy DeviceViews.
+            OrderedDict[str, Union[numpy.ndarray, DeviceView, torch.Tensor]]:
+                    A mapping of output tensor names to corresponding output NumPy arrays,
+                    Polygraphy DeviceViews, or PyTorch tensors.
         """
         copy_outputs_to_host = util.default(copy_outputs_to_host, True)
 
         start = time.time()
-        if trt_util._should_use_v3_api():
-            output_buffers = self._infer_impl_v3(feed_dict, copy_outputs_to_host)
-        else:
-            output_buffers = self._infer_impl_legacy(feed_dict, copy_outputs_to_host)
+        output_buffers = self._infer_impl(feed_dict, copy_outputs_to_host)
         end = time.time()
         self.inference_time = end - start
 
@@ -448,22 +436,16 @@ def infer_impl(self, feed_dict, copy_outputs_to_host=None):
 
     @util.check_called_by("deactivate")
     def deactivate_impl(self):
-        with contextlib.ExitStack() as stack:
-            if self.owns_engine:
-                stack.enter_context(self.engine)
-            if self.owns_context:
-                stack.enter_context(self.context)
-
-            [buf.free() for buf in self.device_buffers.values()]
-            self.stream.free()
+        [buf.free() for buf in self.device_input_buffers.values()]
+        if self.context_memory_buffer is not None:
+            self.context_memory_buffer.free()
+        self.stream.free()
 
         del (
             self.engine,
-            self.owns_engine,
             self.context,
-            self.owns_context,
-            self.device_buffers,
+            self.device_input_buffers,
             self.host_output_buffers,
-            self.output_allocator,
             self.stream,
+            self.context_memory_buffer,
         )
diff --git a/tools/Polygraphy/polygraphy/backend/trt/util.py b/tools/Polygraphy/polygraphy/backend/trt/util.py
index de6f9acb..ec58d2eb 100644
--- a/tools/Polygraphy/polygraphy/backend/trt/util.py
+++ b/tools/Polygraphy/polygraphy/backend/trt/util.py
@@ -19,12 +19,13 @@
 import os
 import signal
 
-from polygraphy import config, mod, util
+from polygraphy import config, mod, util, cuda
 from polygraphy.common import TensorMetadata
+from polygraphy.datatype import DataType
 from polygraphy.exception import PolygraphyException
 from polygraphy.logger import G_LOGGER, LogMode
 
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 np = mod.lazy_import("numpy")
 
 
@@ -41,42 +42,37 @@ def get_trt_logger():
     """
     global TRT_LOGGER
 
-    LoggerType = trt.Logger
-    if mod.version(trt.__version__) >= mod.version("8.0"):
-
-        class CustomTrtLogger(trt.ILogger):
-            def __init__(self):
-                trt.ILogger.__init__(self)
-
-            def log(self, severity, msg):
-                try:
-                    log_func = {
-                        # This function cannot throw, so `critical` should not be used here!
-                        trt.Logger.INTERNAL_ERROR: G_LOGGER.error,
-                        trt.Logger.ERROR: G_LOGGER.error,
-                        # Reduce warning spam from TRT.
-                        trt.Logger.WARNING: lambda msg: G_LOGGER.warning(msg, mode=LogMode.ONCE),
-                        trt.Logger.INFO: G_LOGGER.verbose,
-                        trt.Logger.VERBOSE: G_LOGGER.extra_verbose,
-                    }.get(severity, G_LOGGER.super_verbose)
-
-                    log_func(msg)
-                except KeyboardInterrupt:
-                    # `log()` is `noexcept` so we need to convert exceptions to signals so that
-                    # ctrl-C will work as expected.
-                    os.kill(os.getpid(), signal.SIGTERM)
-
-        LoggerType = CustomTrtLogger
-
-    if TRT_LOGGER is None:
-        TRT_LOGGER = LoggerType()
+    if TRT_LOGGER is not None:
+        return TRT_LOGGER
+
+    class CustomTrtLogger(trt.ILogger):
+        def __init__(self):
+            trt.ILogger.__init__(self)
+
+        def log(self, severity, msg):
+            try:
+                log_func = {
+                    # This function cannot throw, so `critical` should not be used here!
+                    trt.Logger.INTERNAL_ERROR: G_LOGGER.error,
+                    trt.Logger.ERROR: G_LOGGER.error,
+                    # Reduce warning spam from TRT.
+                    trt.Logger.WARNING: lambda msg: G_LOGGER.warning(
+                        msg, mode=LogMode.ONCE
+                    ),
+                    trt.Logger.INFO: G_LOGGER.verbose,
+                    trt.Logger.VERBOSE: G_LOGGER.extra_verbose,
+                }.get(severity, G_LOGGER.super_verbose)
+
+                log_func(msg)
+            except KeyboardInterrupt:
+                # `log()` is `noexcept` so we need to convert exceptions to signals so that
+                # ctrl-C will work as expected.
+                os.kill(os.getpid(), signal.SIGTERM)
+
+    TRT_LOGGER = CustomTrtLogger()
     return TRT_LOGGER
 
 
-def _should_use_v3_api():
-    return mod.version(trt.__version__) > mod.version("8.5.0.9")
-
-
 def fail_unavailable(what):
     G_LOGGER.backtrace()
     G_LOGGER.critical(f"{what} is not available on TensorRT version {trt.__version__}.")
@@ -89,7 +85,9 @@ def check_onnx_parser_errors(parser, success):
         G_LOGGER.critical("Could not parse ONNX correctly")
 
     if not success:
-        G_LOGGER.critical("Failed to parse ONNX model. Does the model file exist and contain a valid ONNX model?")
+        G_LOGGER.critical(
+            "Failed to parse ONNX model. Does the model file exist and contain a valid ONNX model?"
+        )
 
 
 def get_layer_class_mapping():
@@ -101,7 +99,9 @@ def try_add(layer_type, layer_cls):
             layer_cls = getattr(trt, layer_cls)
         except AttributeError:
             if config.INTERNAL_CORRECTNESS_CHECKS:
-                G_LOGGER.warning(f"Could not find layer type: {layer_type} or layer class: {layer_cls}")
+                G_LOGGER.warning(
+                    f"Could not find layer type: {layer_type} or layer class: {layer_cls}"
+                )
         else:
             layer_class_mapping[layer_type] = layer_cls
 
@@ -157,21 +157,6 @@ def try_add(layer_type, layer_cls):
 
     return layer_class_mapping
 
-def check_numpy_trt_compatibility():
-    if mod.version(trt.__version__) < mod.version("8.6") and \
-       mod.version(np.__version__) >= mod.version("1.24"):
-        # TensorRT < 8.6 uses a deprecated alias np.bool that was removed in NumPy >= 1.24
-        G_LOGGER.warning(f"TensorRT version {trt.__version__} and NumPy version {np.__version__} "
-                          "are not compatible.  Consider downgrading your NumPy package to a version < 1.24 "
-                          "or upgrading TensorRT to a version >= 8.6.", mode=LogMode.ONCE)
-
-
-def np_dtype_from_trt(trt_dtype):
-    # trt.nptype uses NumPy, so to make autoinstall work, we need to trigger it before that.
-    mod.autoinstall(np)
-    check_numpy_trt_compatibility()
-    return np.dtype(trt.nptype(trt_dtype))
-
 
 def get_network_input_names_meta(network):
     names = []
@@ -179,7 +164,11 @@ def get_network_input_names_meta(network):
     for i in range(network.num_inputs):
         tensor = network.get_input(i)
         names.append(tensor.name)
-        meta.add(name=tensor.name, dtype=np_dtype_from_trt(tensor.dtype), shape=tensor.shape)
+        meta.add(
+            name=tensor.name,
+            dtype=DataType.from_dtype(tensor.dtype, "tensorrt"),
+            shape=tensor.shape,
+        )
     return names, meta
 
 
@@ -189,7 +178,11 @@ def get_network_output_names_meta(network):
     for i in range(network.num_outputs):
         tensor = network.get_output(i)
         names.append(tensor.name)
-        meta.add(name=tensor.name, dtype=np_dtype_from_trt(tensor.dtype), shape=tensor.shape)
+        meta.add(
+            name=tensor.name,
+            dtype=DataType.from_dtype(tensor.dtype, "tensorrt"),
+            shape=tensor.shape,
+        )
     return names, meta
 
 
@@ -200,7 +193,7 @@ def get_layer_input_names_meta(layer):
         inp = layer.get_input(i)
         if inp:
             names.append(inp.name)
-            meta.add(inp.name, np_dtype_from_trt(inp.dtype), inp.shape)
+            meta.add(inp.name, DataType.from_dtype(inp.dtype, "tensorrt"), inp.shape)
     return names, meta
 
 
@@ -211,7 +204,7 @@ def get_layer_output_names_meta(layer):
         out = layer.get_output(i)
         if out:
             names.append(out.name)
-            meta.add(out.name, np_dtype_from_trt(out.dtype), out.shape)
+            meta.add(out.name, DataType.from_dtype(out.dtype, "tensorrt"), out.shape)
     return names, meta
 
 
@@ -219,7 +212,14 @@ def str_from_layer(layer, index):
     input_names, input_meta = get_layer_input_names_meta(layer)
     output_names, output_meta = get_layer_output_names_meta(layer)
     return util.str_from_layer(
-        "Layer", index, layer.name, layer.type, input_names, input_meta, output_names, output_meta
+        "Layer",
+        index,
+        layer.name,
+        layer.type,
+        input_names,
+        input_meta,
+        output_names,
+        output_meta,
     )
 
 
@@ -247,7 +247,9 @@ def is_valid_attribute(attr, layer):
     return [
         attr
         for attr in dir(layer)
-        if not is_special_attribute(attr) and not hasattr(trt.ILayer, attr) and is_valid_attribute(attr, layer)
+        if not is_special_attribute(attr)
+        and not hasattr(trt.ILayer, attr)
+        and is_valid_attribute(attr, layer)
     ]
 
 
@@ -270,13 +272,17 @@ def str_from_network(network, show_layers=None, show_attrs=None, show_weights=No
 
     LAYER_TYPE_CLASS_MAPPING = get_layer_class_mapping()
 
-    network_str = f"Name: {network.name} | {'Implicit' if hasattr(network, 'has_implicit_batch_dimension') and network.has_implicit_batch_dimension else 'Explicit'} Batch Network{' with Explicit Precision ' if hasattr(network, 'has_explicit_precision') and network.has_explicit_precision else ''}\n"
+    network_str = f"Name: {network.name} | {'Implicit' if hasattr(network, 'has_implicit_batch_dimension') and network.has_implicit_batch_dimension else 'Explicit'} Batch{' Strongly Typed' if hasattr(network, 'get_flag') and network.get_flag(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED) else ''} Network\n"
     network_str += "\n"
 
     _, input_metadata = get_network_input_names_meta(network)
-    network_str += f"---- {len(input_metadata)} Network Input(s) ----\n{input_metadata}\n\n"
+    network_str += (
+        f"---- {len(input_metadata)} Network Input(s) ----\n{input_metadata}\n\n"
+    )
     _, output_metadata = get_network_output_names_meta(network)
-    network_str += f"---- {len(output_metadata)} Network Output(s) ----\n{output_metadata}\n\n"
+    network_str += (
+        f"---- {len(output_metadata)} Network Output(s) ----\n{output_metadata}\n\n"
+    )
     network_str += f"---- {network.num_layers} Layer(s) ----\n"
     if show_layers:
         for index, layer in enumerate(network):
@@ -292,12 +298,17 @@ def str_from_network(network, show_layers=None, show_attrs=None, show_weights=No
                     network_str += util.indent_block("---- Attributes ----") + "\n"
                 for attr in attrs:
                     with G_LOGGER.verbosity():
-                        val = getattr(layer, attr)
+                        try:
+                            val = getattr(layer, attr)
+                        except Exception as err:
+                            val = f"<Error: could not retrieve layer attribute: {attr}. Note: Error was: {err}>"
                     if show_weights or not isinstance(val, np.ndarray):
                         attr_str = ""
                         if layer.name:
                             attr_str += f"{layer.name}."
-                        network_str += util.indent_block(f"{attr_str}{attr} = {val}") + "\n"
+                        network_str += (
+                            util.indent_block(f"{attr_str}{attr} = {val}") + "\n"
+                        )
             network_str += "\n"
 
     return util.indent_block(network_str, level=0)
@@ -326,7 +337,11 @@ def mark_outputs(network, outputs):
 
     tensor_map = get_all_tensors(network)
     util.check_sequence_contains(
-        tensor_map.keys(), outputs, name="the network", items_name="outputs", check_extra=False
+        tensor_map.keys(),
+        outputs,
+        name="the network",
+        items_name="outputs",
+        check_extra=False,
     )
 
     for tensor in tensor_map.values():
@@ -343,8 +358,16 @@ def mark_layerwise(network):
     # Layers within loops cannot be marked as network outputs.
     LOOP_START_NAMES = ["TRIP_LIMIT", "ITERATOR", "RECURRENCE"]
     LOOP_END_NAMES = ["LOOP_OUTPUT"]
-    LOOP_START_LAYERS = [getattr(trt.LayerType, attr) for attr in LOOP_START_NAMES if hasattr(trt.LayerType, attr)]
-    LOOP_END_LAYERS = [getattr(trt.LayerType, attr) for attr in LOOP_END_NAMES if hasattr(trt.LayerType, attr)]
+    LOOP_START_LAYERS = [
+        getattr(trt.LayerType, attr)
+        for attr in LOOP_START_NAMES
+        if hasattr(trt.LayerType, attr)
+    ]
+    LOOP_END_LAYERS = [
+        getattr(trt.LayerType, attr)
+        for attr in LOOP_END_NAMES
+        if hasattr(trt.LayerType, attr)
+    ]
     EXCLUDE_LAYERS = [trt.LayerType.SHAPE, trt.LayerType.CONSTANT]
     outputs = []
     in_loop = False
@@ -375,7 +398,11 @@ def unmark_outputs(network, outputs):
 
     tensor_map = get_all_tensors(network)
     util.check_sequence_contains(
-        tensor_map.keys(), outputs, name="the network", items_name="outputs", check_extra=False
+        tensor_map.keys(),
+        outputs,
+        name="the network",
+        items_name="outputs",
+        check_extra=False,
     )
 
     for name in outputs:
@@ -400,10 +427,16 @@ def add_line(title, line):
 
     def get_enabled_enum_vals(EnumType, is_enabled):
         # is_enabled is a Callable[[enum_val], bool] which reports whether to include the enum value.
-        return [name for name, enum_val in EnumType.__members__.items() if is_enabled(enum_val)]
+        return [
+            name
+            for name, enum_val in EnumType.__members__.items()
+            if is_enabled(enum_val)
+        ]
 
     # Flags
-    enabled_builder_flags = get_enabled_enum_vals(trt.BuilderFlag, lambda flag: config.get_flag(flag))
+    enabled_builder_flags = get_enabled_enum_vals(
+        trt.BuilderFlag, lambda flag: config.get_flag(flag)
+    )
     add_line("Flags", f"{str_from_list(enabled_builder_flags)}")
 
     # Engine Capability
@@ -422,24 +455,35 @@ def get_enabled_enum_vals(EnumType, is_enabled):
 
     # Tactic Sources
     with contextlib.suppress(AttributeError):
-        source_vals = get_enabled_enum_vals(trt.TacticSource, lambda val: (1 << int(val)) & config.get_tactic_sources())
+        source_vals = get_enabled_enum_vals(
+            trt.TacticSource, lambda val: (1 << int(val)) & config.get_tactic_sources()
+        )
         add_line("Tactic Sources", f"{str_from_list(source_vals)}")
 
     # DLA
     if using_dla:
-        add_line("DLA", f"Default Device Type: {config.default_device_type}, Core: {config.DLA_core}")
+        add_line(
+            "DLA",
+            f"Default Device Type: {config.default_device_type}, Core: {config.DLA_core}",
+        )
 
     # Profiling Verbosity
     with contextlib.suppress(AttributeError):
         add_line("Profiling Verbosity", f"{config.profiling_verbosity}")
 
     # Optimization Profiles
-    if config.num_optimization_profiles > 1:  # Not particularly interesting unless there are multiple.
-        add_line("Optimization Profiles", f"{config.num_optimization_profiles} profile(s)")
+    if (
+        config.num_optimization_profiles > 1
+    ):  # Not particularly interesting unless there are multiple.
+        add_line(
+            "Optimization Profiles", f"{config.num_optimization_profiles} profile(s)"
+        )
 
     # Preview Features
     with contextlib.suppress(AttributeError):
-        feature_vals = get_enabled_enum_vals(trt.PreviewFeature, lambda val: config.get_preview_feature(val))
+        feature_vals = get_enabled_enum_vals(
+            trt.PreviewFeature, lambda val: config.get_preview_feature(val)
+        )
         if feature_vals:
             add_line("Preview Features", f"{str_from_list(feature_vals)}")
 
@@ -447,12 +491,22 @@ def get_enabled_enum_vals(EnumType, is_enabled):
     if config.int8_calibrator:
         add_line("Calibrator", f"{config.int8_calibrator}")
 
+    # Quantization Flags
+    with contextlib.suppress(AttributeError):
+        quantization_flags = get_enabled_enum_vals(
+            trt.QuantizationFlag, lambda val: config.get_quantization_flag(val)
+        )
+        if quantization_flags:
+            add_line("Quantization Flags", f"{str_from_list(quantization_flags)}")
+
     return "\n".join(lines)
 
 
 def check_profile(profile):
     if not bool(profile):
-        G_LOGGER.critical(f"Profile is not valid, please provide profile data.\nNote: profile was: {profile}")
+        G_LOGGER.critical(
+            f"Profile is not valid, please provide profile data.\nNote: profile was: {profile}"
+        )
     return profile
 
 
@@ -519,7 +573,7 @@ def get_input_metadata_from_network(network, profile, force_opt_shapes=None):
 
         input_metadata.add(
             name=tensor.name,
-            dtype=np_dtype_from_trt(tensor.dtype),
+            dtype=tensor.dtype,
             shape=opt_shape if force_opt_shapes else tensor.shape,
             min_shape=None if force_opt_shapes else min_shape,
             max_shape=None if force_opt_shapes else max_shape,
@@ -535,7 +589,8 @@ def try_setup_polygraphy_calibrator(config, network, calib_profile=None):
     """
     calibrator = config.int8_calibrator
     if calibrator is None or not (
-        hasattr(calibrator, "is_polygraphy_calibrator") and calibrator.is_polygraphy_calibrator
+        hasattr(calibrator, "is_polygraphy_calibrator")
+        and calibrator.is_polygraphy_calibrator
     ):
         # No calibrator or not a Polygraphy calibrator.
         return
@@ -544,13 +599,17 @@ def try_setup_polygraphy_calibrator(config, network, calib_profile=None):
         try:
             calib_profile = config.get_calibration_profile()
         except AttributeError:
-            G_LOGGER.extra_verbose("Cannot get calibration profile on TensorRT 7.0 and older.")
+            G_LOGGER.extra_verbose(
+                "Cannot get calibration profile on TensorRT 7.0 and older."
+            )
             # Return early so we don't emit extraneous warnings on TRT 7.0 and older.
             return
 
     try:
         # TensorRT does not currently support shapes other than the OPT shape.
-        input_metadata = get_input_metadata_from_network(network, calib_profile, force_opt_shapes=True)
+        input_metadata = get_input_metadata_from_network(
+            network, calib_profile, force_opt_shapes=True
+        )
     except PolygraphyException as err:
         G_LOGGER.warning(
             "Could not determine input_metadata to provide to the calibrator because no calibration profile is set. "
@@ -562,6 +621,13 @@ def try_setup_polygraphy_calibrator(config, network, calib_profile=None):
         calibrator.set_input_metadata(input_metadata)
 
 
+def get_tensor_format(engine, context, name):
+    try:
+        return engine.get_tensor_format(name, context.active_optimization_profile)
+    except TypeError:
+        return engine.get_tensor_format(name)
+
+
 def get_hwc_shape_from_chw(shape, strides):
     # The relative size (descending sorted order) of the strides should give the permutation to convert the shape
     perm = sorted(range(len(strides)), key=strides.__getitem__, reverse=True)
@@ -583,10 +649,14 @@ def get_metadata_from_engine(engine, context, mode):
 
         shape = engine.get_tensor_shape(name)
         # If the input format is HWC, make sure the input is shaped accordingly
-        if engine.get_tensor_format(name) == trt.TensorFormat.HWC:
+        if get_tensor_format(engine, context, name) == trt.TensorFormat.HWC:
             shape = get_hwc_shape_from_chw(shape, context.get_tensor_strides(name))
 
-        meta.add(name=name, dtype=np_dtype_from_trt(engine.get_tensor_dtype(name)), shape=shape)
+        meta.add(
+            name=name,
+            dtype=DataType.from_dtype(engine.get_tensor_dtype(name), "tensorrt"),
+            shape=shape,
+        )
     return meta
 
 
@@ -594,67 +664,68 @@ def str_from_engine(engine, context, show_layers=None, show_attrs=None):
     show_layers = util.default(show_layers, False)
     show_attrs = util.default(show_attrs, False)
 
-    if _should_use_v3_api():
-        num_io_tensors = engine.num_io_tensors
-    else:
-        num_io_tensors = get_bindings_per_profile(engine)
+    num_io_tensors = engine.num_io_tensors
 
     engine_str = f"Name: {engine.name} | {'Refittable ' if engine.refittable else ''}{'Implicit' if hasattr(engine, 'has_implicit_batch_dimension') and engine.has_implicit_batch_dimension else 'Explicit'} Batch Engine\n"
     engine_str += "\n"
 
     # Show metadata for the first profile (i.e. the dynamic shapes)
-    if _should_use_v3_api():
-        input_metadata = get_metadata_from_engine(engine, context, mode=trt.TensorIOMode.INPUT)
-        output_metadata = get_metadata_from_engine(engine, context, mode=trt.TensorIOMode.OUTPUT)
-    else:
-        input_metadata = get_input_metadata_from_engine(engine, 0, num_io_tensors)
-        output_metadata = get_output_metadata_from_engine(engine, 0, num_io_tensors)
+    input_metadata = get_metadata_from_engine(
+        engine, context, mode=trt.TensorIOMode.INPUT
+    )
+    output_metadata = get_metadata_from_engine(
+        engine, context, mode=trt.TensorIOMode.OUTPUT
+    )
 
-    engine_str += f"---- {len(input_metadata)} Engine Input(s) ----\n{input_metadata}\n\n"
-    engine_str += f"---- {len(output_metadata)} Engine Output(s) ----\n{output_metadata}\n\n"
+    engine_str += (
+        f"---- {len(input_metadata)} Engine Input(s) ----\n{input_metadata}\n\n"
+    )
+    engine_str += (
+        f"---- {len(output_metadata)} Engine Output(s) ----\n{output_metadata}\n\n"
+    )
 
-    engine_str += f"---- Memory ----\nDevice Memory: {engine.device_memory_size} bytes\n\n"
+    engine_str += (
+        f"---- Memory ----\nDevice Memory: {engine.device_memory_size} bytes\n\n"
+    )
 
     engine_str += f"---- {engine.num_optimization_profiles} Profile(s) ({num_io_tensors} Tensor(s) Each) ----\n"
     for profile_index in range(engine.num_optimization_profiles):
         engine_str += f"- Profile: {profile_index}\n"
 
-        if _should_use_v3_api():
-            max_width = max([len(engine.get_tensor_name(idx)) for idx in range(engine.num_io_tensors)]) + 8
-        else:
-            max_width = max([len(binding) for binding in engine]) + 8
+        max_width = (
+            max(
+                [
+                    len(engine.get_tensor_name(idx))
+                    for idx in range(engine.num_io_tensors)
+                ]
+            )
+            + 8
+        )
 
         for idx in range(num_io_tensors):
-            if _should_use_v3_api():
-                name = engine.get_tensor_name(idx)
-                binding_type = " (Input)" if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT else "(Output)"
-                engine_str += util.indent_block(f"Tensor: {name:<{max_width}} {binding_type}, Index: {idx}")
-
-                if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
-                    min_shape, opt_shape, max_shape = engine.get_tensor_profile_shape(name, profile_index)
-                    engine_str += f" | Shapes: min={min_shape}, opt={opt_shape}, max={max_shape}\n"
-                else:
-                    engine_str += f" | Shape: {engine.get_tensor_shape(name)}\n"
+            name = engine.get_tensor_name(idx)
+            binding_type = (
+                " (Input)"
+                if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT
+                else "(Output)"
+            )
+            engine_str += util.indent_block(
+                f"Tensor: {name:<{max_width}} {binding_type}, Index: {idx}"
+            )
+
+            if engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
+                min_shape, opt_shape, max_shape = engine.get_tensor_profile_shape(
+                    name, profile_index
+                )
+                engine_str += (
+                    f" | Shapes: min={min_shape}, opt={opt_shape}, max={max_shape}\n"
+                )
             else:
-                binding = profile_index * num_io_tensors + idx
-                name = f"[Name: {engine.get_binding_name(binding)}]"
-                binding_type = "(Input) " if engine.binding_is_input(binding) else "(Output)"
-                engine_str += util.indent_block(f"Binding Index: {binding} {binding_type} {name:<{max_width}}")
-
-                if engine.binding_is_input(binding):
-                    if engine.is_shape_binding(binding):
-                        min_shape, opt_shape, max_shape = engine.get_profile_shape_input(profile_index, binding)
-                    else:
-                        min_shape, opt_shape, max_shape = engine.get_profile_shape(profile_index, binding)
-                    engine_str += f" | Shapes: min={min_shape}, opt={opt_shape}, max={max_shape}\n"
-                else:
-                    engine_str += f" | Shape: {engine.get_binding_shape(binding)}\n"
+                engine_str += f" | Shape: {engine.get_tensor_shape(name)}\n"
         engine_str += "\n"
 
     layers_per_profile = engine.num_layers // engine.num_optimization_profiles
-    engine_str += (
-        f"---- {layers_per_profile} Layer(s){' Per Profile' if engine.num_optimization_profiles > 1 else ''} ----\n"
-    )
+    engine_str += f"---- {layers_per_profile} Layer(s){' Per Profile' if engine.num_optimization_profiles > 1 else ''} ----\n"
     if show_layers:
         try:
             inspector = engine.create_engine_inspector()
@@ -663,6 +734,7 @@ def str_from_engine(engine, context, show_layers=None, show_attrs=None):
                 f"Cannot show layer information because IEngineInspector is not available in this version of TensorRT ({trt.__version__})"
             )
         else:
+            inspector.execution_context = context
             for profile_idx in range(engine.num_optimization_profiles):
                 indent_level = 0
                 if engine.num_optimization_profiles >= 1:
@@ -672,7 +744,9 @@ def str_from_engine(engine, context, show_layers=None, show_attrs=None):
                 offset = profile_idx * layers_per_profile
                 for index in range(layers_per_profile):
                     layer_info = json.loads(
-                        inspector.get_layer_information(offset + index, trt.LayerInformationFormat.JSON)
+                        inspector.get_layer_information(
+                            offset + index, trt.LayerInformationFormat.JSON
+                        )
                     )
 
                     op = "Unknown"
@@ -685,6 +759,27 @@ def str_from_engine(engine, context, show_layers=None, show_attrs=None):
                         op = layer_info.get("LayerType", "Unknown")
 
                         def names_meta_from_inspector(key):
+                            def dtype_from_fmt_dtype(contents):
+                                contents = contents.upper()
+                                mapping = {
+                                    "FLOAT": DataType.FLOAT32,
+                                    "FP32": DataType.FLOAT32,
+                                    "FP16": DataType.FLOAT16,
+                                    "INT8": DataType.INT8,
+                                    "INT32": DataType.INT32,
+                                    "INT64": DataType.INT64,
+                                    "BOOL": DataType.BOOL,
+                                    "N/A": None,
+                                }
+
+                                for key, val in mapping.items():
+                                    if key in contents:
+                                        return val
+                                G_LOGGER.internal_error(
+                                    f"Could not determine data type from format string: {contents}"
+                                )
+                                return None
+
                             names = []
                             meta = TensorMetadata()
                             info = layer_info.get(key)
@@ -692,13 +787,24 @@ def names_meta_from_inspector(key):
                                 return meta
                             for elem in info:
                                 names.append(elem["Name"])
-                                meta.add(name=elem["Name"], dtype=None, shape=elem["Dimensions"])
+                                meta.add(
+                                    name=elem["Name"],
+                                    dtype=dtype_from_fmt_dtype(elem["Format/Datatype"]),
+                                    shape=elem["Dimensions"],
+                                    docstring=f"Format: {elem['Format/Datatype']}"
+                                    if "N/A" not in elem["Format/Datatype"]
+                                    else None,
+                                )
                             return names, meta
 
                         input_names, input_meta = names_meta_from_inspector("Inputs")
                         output_names, output_meta = names_meta_from_inspector("Outputs")
                         origin = layer_info.get("Origin", "Unknown")
                         tactic = layer_info.get("TacticValue", "Unknown")
+                        # For Myelin layers, use `TacticName` instead of `TacticValue`
+                        if "TacticValue" not in layer_info:
+                            tactic = layer_info.get("TacticName", "Unknown")
+
                     else:
                         G_LOGGER.warning(
                             f"This engine was created with a profiling verbosity of: {engine.profiling_verbosity}. Some layer information may be missing. Try setting a higher profiling verbosity to see more detailed layer information. ",
@@ -709,7 +815,14 @@ def names_meta_from_inspector(key):
                     engine_str += (
                         util.indent_block(
                             util.str_from_layer(
-                                "Layer", index, name, op, input_names, input_meta, output_names, output_meta
+                                "Layer",
+                                index,
+                                name,
+                                op,
+                                input_names,
+                                input_meta,
+                                output_names,
+                                output_meta,
                             ),
                             indent_level,
                         )
@@ -717,85 +830,52 @@ def names_meta_from_inspector(key):
                     )
 
                     if show_attrs:
-                        engine_str += util.indent_block("---- Attributes ----", indent_level + 1) + "\n"
-                        engine_str += util.indent_block(f"Origin = {origin}", indent_level + 1) + "\n"
-                        engine_str += util.indent_block(f"Tactic = {tactic}", indent_level + 1) + "\n"
+                        engine_str += (
+                            util.indent_block("---- Attributes ----", indent_level + 1)
+                            + "\n"
+                        )
+                        engine_str += (
+                            util.indent_block(f"Origin = {origin}", indent_level + 1)
+                            + "\n"
+                        )
+                        engine_str += (
+                            util.indent_block(f"Tactic = {tactic}", indent_level + 1)
+                            + "\n"
+                        )
 
                     engine_str += "\n"
 
     return util.indent_block(engine_str, level=0)
 
 
-# V2 APIs
-def add_binding_to_metadata(engine, binding, metadata, name_binding):
-    if _should_use_v3_api():
-        G_LOGGER.internal_error("This function should not be called when using the V3 API")
-
-    # name_binding always comes from profile 0, since that's where we
-    # get all binding names in the runner
-    metadata.add(
-        name=engine[name_binding],
-        dtype=np_dtype_from_trt(engine.get_binding_dtype(binding)),
-        shape=list(engine.get_binding_shape(binding)),
-    )
-
-
-def get_input_metadata_from_engine(engine, start_binding, end_binding):
-    if _should_use_v3_api():
-        G_LOGGER.internal_error("This function should not be called when using the V3 API")
-
-    inputs = TensorMetadata()
-    for index, binding in enumerate(range(start_binding, end_binding)):
-        if engine.binding_is_input(binding):
-            add_binding_to_metadata(engine, binding, inputs, name_binding=index)
-    return inputs
-
-
-def get_output_metadata_from_engine(engine, start_binding, end_binding):
-    if _should_use_v3_api():
-        G_LOGGER.internal_error("This function should not be called when using the V3 API")
-
-    outputs = TensorMetadata()
-    for index, binding in enumerate(range(start_binding, end_binding)):
-        if not engine.binding_is_input(binding):
-            add_binding_to_metadata(engine, binding, outputs, name_binding=index)
-    return outputs
-
-
-def get_bindings_per_profile(engine):
-    if _should_use_v3_api():
-        G_LOGGER.internal_error("This function should not be called when using the V3 API")
-
-    return engine.num_bindings // engine.num_optimization_profiles
-
-
-def get_active_profile_bindings(context):
+def _get_array_on_gpu(arr, name, device_buffers, stream=None):
     """
-    Gets the start and end binding indices for the active optimization profile.
+    Copies the provided array to GPU memory if needed and returns a pointer
+    to the GPU memory. If sufficient GPU memory has not been allocated for
+    the array in ``device_buffers``, this function will allocate new memory.
 
     Args:
-        engine (trt.ICudaEngine): The engine in question.
-        context (trt.IExecutionContext): The context where the profile is currently set.
+        arr (Union[DeviceView, numpy.ndarray, torch.Tensor]): The array.
+        name (str): The name of the array.
+        device_buffers (Dict[str, DeviceArray]):
+                A mapping of names to DeviceArrays.
+        stream (cuda.Stream): The CUDA stream to use.
 
     Returns:
-        Tuple[int, int]: The start and end bindings indices, in that order
+        int: A pointer to the GPU memory.
     """
-    if _should_use_v3_api():
-        G_LOGGER.internal_error("This function should not be called when using the V3 API")
+    if util.array.is_on_gpu(arr):
+        return util.array.data_ptr(arr)
 
-    active_profile = context.active_optimization_profile
-    if active_profile < 0:
-        G_LOGGER.critical(
-            f"Cannot determine profile bindings since the optimization profile for this context is set to: {active_profile}"
-        )
+    arr = util.array.make_contiguous(arr)
 
-    bindings_per_profile = get_bindings_per_profile(context.engine)
+    shape = (util.array.nbytes(arr),)
+    if name not in device_buffers:
+        # We intentionally don't set the shape here so that it's treated as a scalar and therefore has
+        # some memory allocated. Otherwise, if there's an empty tensor, we won't allocate anything
+        # and the device pointer will be 0 (i.e. nullptr), which TensorRT will complain about.
+        device_buffers[name] = cuda.DeviceArray.raw()
 
-    start_binding = bindings_per_profile * active_profile
-    end_binding = start_binding + bindings_per_profile
-
-    G_LOGGER.ultra_verbose(
-        f"Total # of Profiles: {context.engine.num_optimization_profiles}, Bindings Per Profile: {bindings_per_profile}, "
-        f"Active Profile: {active_profile}, Start Binding: {start_binding}, End Binding: {end_binding}"
-    )
-    return start_binding, end_binding
+    device_buffers[name].resize(shape)
+    device_buffers[name].copy_from(util.array.view(arr, DataType.UINT8, shape), stream)
+    return device_buffers[name].ptr
diff --git a/tools/Polygraphy/polygraphy/backend/trt_legacy.py b/tools/Polygraphy/polygraphy/backend/trt_legacy.py
deleted file mode 100644
index 09162d20..00000000
--- a/tools/Polygraphy/polygraphy/backend/trt_legacy.py
+++ /dev/null
@@ -1,423 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-import contextlib
-import os
-import time
-from collections import OrderedDict
-
-from polygraphy import constants, cuda, mod, util
-from polygraphy.backend.base import BaseLoader, BaseRunner
-from polygraphy.backend.trt import util as trt_util
-from polygraphy.backend.trt.loader import BaseNetworkFromOnnx
-from polygraphy.backend.trt.util import get_trt_logger
-from polygraphy.common import TensorMetadata
-from polygraphy.logger import G_LOGGER
-from polygraphy.util.format import DataFormat, FormatManager
-
-np = mod.lazy_import("numpy")
-trt = mod.lazy_import("tensorrt")
-uff = mod.lazy_import("uff")
-
-
-class LoadUffFile(BaseLoader):
-    def __init__(self, path, shapes, outputs):
-        self.path = path
-        self.shapes = shapes
-        self.outputs = outputs
-
-    @util.check_called_by("__call__")
-    def call_impl(self):
-        input_names = list(self.shapes.keys())
-        input_shapes = list(self.shapes.values())
-        with open(self.path, "rb") as f:
-            return f.read(), input_names, input_shapes, self.outputs
-
-
-class ConvertToUff(BaseLoader):
-    def __init__(self, tf_loader, save_uff=None, preprocessor=None):
-        self.tf_loader = tf_loader
-        self.uff_path = save_uff
-        self.preprocessor = preprocessor
-
-    @util.check_called_by("__call__")
-    def call_impl(self):
-        """
-
-        save_uff (bool): Whether to write the generated UFF and corresponding PBTXT files.
-        """
-        graph, output_names = self.tf_loader()
-        output_names = [name.split(":")[0] for name in output_names]
-        # GraphDefs don't have names, so we have to name it something generic.
-        output_filename = None if not self.uff_path else "out.uff"
-
-        # Generate the UFF model and get information about the input_buffers/output_buffers.
-        uff_model, input_nodes, _ = uff.from_tensorflow(
-            graph.as_graph_def(),
-            return_graph_info=True,
-            quiet=(G_LOGGER.module_severity.get(G_LOGGER.module_path(__file__)) > G_LOGGER.VERBOSE),
-            debug_mode=(G_LOGGER.module_severity.get(G_LOGGER.module_path(__file__)) == G_LOGGER.EXTRA_VERBOSE),
-            text=self.uff_path,
-            save_preprocessed=self.uff_path,
-            output_filename=output_filename,
-            preprocessor=self.preprocessor,
-        )
-
-        input_names = [node.name for node in input_nodes]
-        input_shapes = [tuple(int(dim.size) for dim in node.attr["shape"].shape.dim) for node in input_nodes]
-        return uff_model, input_names, input_shapes, output_names
-
-
-class LoadNetworkFromUff(BaseLoader):
-    def __init__(self, uff_loader, uff_order=None):
-        self.uff_loader = uff_loader
-        self.uff_order = None
-        if uff_order:
-            self.uff_order = trt.UffInputOrder.NCHW if uff_order.lower() == "nchw" else trt.UffInputOrder.NHWC
-
-    @util.check_called_by("__call__")
-    def call_impl(self):
-        uff_model, input_names, input_shapes, output_names = self.uff_loader()
-
-        builder = trt.Builder(get_trt_logger())
-        network = builder.create_network()
-        parser = trt.UffParser()
-        # Input names should come from the converter, as a preprocessing script may have been applied to the frozen model.
-        for name, shape in zip(input_names, input_shapes):
-            # Default order is NCHW, only set to NHWC if we're reasonably certain that it is.
-            input_order = self.uff_order
-            if not self.uff_order:
-                input_order = trt.UffInputOrder.NCHW
-                if FormatManager.determine_format(shape) == DataFormat.NHWC:
-                    input_order = trt.UffInputOrder.NHWC
-            shape = shape[1:]
-            G_LOGGER.verbose(f"Registering UFF input: {name} with shape: {shape} and input order: {input_order}")
-            parser.register_input(name, shape, input_order)
-
-        if output_names and output_names != constants.MARK_ALL:
-            for name in output_names:
-                G_LOGGER.verbose("Registering UFF output: " + str(name))
-                parser.register_output(name)
-
-        G_LOGGER.info(f"Parsing UFF model with inputs: {input_names} and outputs: {output_names}")
-        success = parser.parse_buffer(uff_model, network)
-        if not success:
-            G_LOGGER.critical("Could not parse UFF correctly")
-        return builder, network, parser, input_shapes[0][0]
-
-
-class ParseNetworkFromOnnxLegacy(BaseNetworkFromOnnx):
-    def __init__(self, onnx_loader):
-        """
-        Parses an ONNX model to create a trt.INetworkDefinition. This loader only supports the
-        implicit batch version of the parser.
-
-        Args:
-            onnx_loader (Union[onnx.ModelProto, Callable() -> onnx.ModelProto]):
-                    An ONNX model or a callable that returns one.
-        """
-        super().__init__(explicit_batch=False)
-        self.onnx_loader = onnx_loader
-
-    @util.check_called_by("__call__")
-    def call_impl(self):
-        from polygraphy.backend.onnx import util as onnx_util
-
-        with util.FreeOnException(super().call_impl()) as (builder, network, parser):
-            onnx_model, _ = util.invoke_if_callable(self.onnx_loader)
-            _, shape = list(onnx_util.get_input_metadata(onnx_model.graph).values())[0]
-
-            success = parser.parse(onnx_model.SerializeToString())
-            trt_util.check_onnx_parser_errors(parser, success)
-
-            return builder, network, parser, shape[0]
-
-
-class LoadNetworkFromCaffe:
-    def __init__(self, deploy, model, outputs, batch_size=None, dtype=None):
-        self.deploy = deploy
-        self.model = model
-        if not self.model:
-            G_LOGGER.warning(
-                "No model file provided for Caffe model, random weights will be used. To avoid this, "
-                "please set the model paramater, or --model"
-            )
-
-        if not outputs:
-            G_LOGGER.critical(
-                f"Please set Caffe model outputs using the outputs parameter, or --trt-outputs. "
-                "Note: To determine possible outputs, try running: tail -n50 {deploy}"
-            )
-
-        self.outputs = outputs
-        self.dtype = util.default(dtype, trt.float32)
-        self.batch_size = util.default(batch_size, 1)
-
-    def __call__(self):
-        builder = trt.Builder(get_trt_logger())
-        network = builder.create_network()
-        parser = trt.CaffeParser()
-
-        parser.parse(deploy=self.deploy, model=self.model, network=network, dtype=self.dtype)
-
-        if self.outputs and self.outputs != constants.MARK_ALL:
-            trt_util.mark_outputs(network, self.outputs)
-
-        return builder, network, parser, self.batch_size
-
-
-def _input_metadata_from_network(network):
-    input_metadata = TensorMetadata()
-    for index in range(network.num_inputs):
-        tensor = network.get_input(index)
-        dtype = trt_util.np_dtype_from_trt(tensor.dtype)
-        input_metadata.add(name=tensor.name, dtype=dtype, shape=tensor.shape)
-    return input_metadata
-
-
-# Builds and tracks a single engine for a single network.
-class TrtLegacyRunner(BaseRunner):
-    """
-    A runner that can perform inference on a single TensorRT engine.
-    """
-
-    # Simple helper data class that's a little nicer to use than a 2-tuple.
-    class HostDeviceMem:
-        def __init__(self, host_mem, device_mem):
-            self.host = host_mem
-            self.device = device_mem
-
-        def __str__(self):
-            return "Host:" + str(self.host) + ", Device:" + str(self.device)
-
-    def __init__(
-        self,
-        network_loader=None,
-        max_workspace_size=None,
-        max_batch_size=None,
-        fp16=None,
-        tf32=None,
-        fp8=None,
-        load_engine=None,
-        save_engine=None,
-        layerwise=False,
-        plugins=[],
-        name=None,
-        int8=None,
-        calibrator=None,
-        use_dla=None,
-        allow_gpu_fallback=None,
-    ):
-        """
-        Creates a runner that manages a single TensorRT engine.
-
-
-            network_loader (BaseModelLoader):
-                    A loader that returns a TRT builder, network, parser and input shapes.
-            max_workspace_size (int): The maximum workspace size.
-            max_batch_size (int): The maximum batch size.
-            fp16 (bool): Whether to run in fp16 mode
-            fp8  (bool): Whether to run in fp8 mode
-            layerwise (bool): Whether to retrieve the outputs of every layer in the network.
-            name (str):
-                    The human-readable name prefix to use for this runner.
-                    A runner count and timestamp will be appended to this prefix.
-        """
-        G_LOGGER.warning("TrtLegacyRunner is deprecated, and will be removed in a future release")
-        # Load any user-supplied plugin libraries. This must happen before everything else, including engine deserialization.
-        if plugins:
-            import ctypes
-
-            for plugin in plugins:
-                path = os.path.abspath(plugin)
-                G_LOGGER.info(f"Loading plugin library: {path}")
-                ctypes.CDLL(path)
-
-        # Choose a unique name for this runner.
-        super().__init__(name=name, prefix="trt-legacy-runner")
-
-        # Save parameters for activate and deactivate.
-        self.network_loader = network_loader
-        self.max_workspace_size = util.default(max_workspace_size, 1 << 24)
-        self.fp16 = util.default(fp16, False)
-        self.fp8 = util.default(fp8, False)
-        self.tf32 = util.default(tf32, False)
-        self.load_engine = load_engine
-
-        self.engine_path = save_engine
-
-        self.layerwise = layerwise
-        self.max_batch_size = max_batch_size
-        self.int8 = util.default(int8, False)
-        self.calibrator = calibrator
-        self.use_dla = use_dla
-        self.allow_gpu_fallback = allow_gpu_fallback
-
-        # Check compatibility with NumPy before proceeding further
-        trt_util.check_numpy_trt_compatibility()
-
-    def activate_impl(self):
-        """
-        Parses command-line arguments and populates the following attributes:
-
-        Attributes:
-            engine (trt.ICudaEngine):
-                    The engine tracked by this runner. The TrtLegacyRunner OWNS the engine it
-                    manages, and therefore is responsible for it's destruction. Do not free the engine outside of the
-                    runner, or it will result in a double free.
-            context (trt.IExecutionContext): The context used for inference.
-            input_buffers (Dict[str, TrtLegacyRunner.HostDeviceMem]):
-                    A mapping of binding names to HostDeviceMem objects for input buffers.
-            output_buffers (Dict[str, TrtLegacyRunner.HostDeviceMem]):
-                    A mapping of binding names to HostDeviceMem objects for output buffers.
-            bindings (List[int]): A list of device pointers for engine bindings.
-            stream (cuda.Stream): The CUDA stream that this runner will use for inference.
-        """
-        # Only initialize GPU after this runner is activated.
-        # Allocates all buffers required for an engine, i.e. host/device input_buffers/output_buffers.
-        def allocate_buffers(engine):
-            input_buffers = OrderedDict()
-            output_buffers = OrderedDict()
-            stream = cuda.Stream()
-            G_LOGGER.verbose("Using batch size: " + str(engine.max_batch_size) + " during buffer allocation")
-            for binding in engine:
-                shape = (engine.max_batch_size,) + tuple(engine.get_binding_shape(binding))
-                dtype = engine.get_binding_dtype(binding)
-
-                device_mem = cuda.DeviceArray(shape=shape, dtype=trt.nptype(dtype))
-                G_LOGGER.extra_verbose(f"Tensor: {binding:35} | Allocated: {device_mem}")
-
-                if engine.binding_is_input(binding):
-                    input_buffers[binding] = TrtLegacyRunner.HostDeviceMem(None, device_mem)
-                else:
-                    host_mem = np.empty(shape=shape, dtype=trt.nptype(dtype))
-                    output_buffers[binding] = TrtLegacyRunner.HostDeviceMem(host_mem, device_mem)
-            return input_buffers, output_buffers, stream
-
-        # Always try reading the engine first, or, failing that, build it.
-        if self.load_engine:
-            with open(self.load_engine, "rb") as f, trt.Runtime(get_trt_logger()) as runtime:
-                G_LOGGER.info(f"Reading engine from {self.load_engine}")
-                self.engine = runtime.deserialize_cuda_engine(f.read())
-        else:
-            trt.init_libnvinfer_plugins(get_trt_logger(), "")
-            builder, network, parser, model_batch_size = self.network_loader()
-            with builder, network, parser, builder.create_builder_config() as config, contextlib.ExitStack() as stack:
-                if not network:
-                    G_LOGGER.critical("Invalid network")
-                G_LOGGER.super_verbose(lambda: trt_util.str_from_network(network) or "Finished logging network")
-
-                builder.max_batch_size = int(self.max_batch_size or model_batch_size or 1)
-
-                config.max_workspace_size = int(self.max_workspace_size)
-
-                if not self.tf32:
-                    with contextlib.suppress(AttributeError):
-                        config.clear_flag(trt.BuilderFlag.TF32)
-                if self.fp16:
-                    config.set_flag(trt.BuilderFlag.FP16)
-                if self.fp8:
-                    config.set_flag(trt.BuilderFlag.FP8)
-
-                if self.int8:
-                    config.set_flag(trt.BuilderFlag.INT8)
-                    input_metadata = _input_metadata_from_network(network)
-                    with contextlib.suppress(AttributeError):  # Polygraphy calibrator has a reset method
-                        self.calibrator.set_input_metadata(input_metadata)
-                        self.calibrator.reset()
-                    config.int8_calibrator = self.calibrator
-
-                if self.use_dla:
-                    config.default_device_type = trt.DeviceType.DLA
-                    config.DLA_core = 0
-
-                if self.allow_gpu_fallback:
-                    config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
-
-                if self.layerwise:
-                    trt_util.mark_layerwise(network)
-
-                G_LOGGER.info(
-                    f"Building engine: max workspace size={config.max_workspace_size} bytes, max batch size={builder.max_batch_size}, "
-                    f"fp16={self.fp16}, tf32={self.tf32}, int8={self.int8}, fp8={self.fp8}"
-                )
-                self.engine = builder.build_engine(network, config)
-
-        if not self.engine:
-            G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly")
-
-        if self.engine_path:
-            with open(self.engine_path, "wb") as f:
-                G_LOGGER.info(f"Writing engine to {self.engine_path}")
-                f.write(self.engine.serialize())
-
-        self.context = self.engine.create_execution_context()
-        self.input_buffers, self.output_buffers, self.stream = allocate_buffers(self.engine)
-
-    def get_input_metadata_impl(self):
-        inputs = TensorMetadata()
-
-        for binding in self.engine:
-            if self.engine.binding_is_input(binding):
-                # Always prepend a dynamic batch dimension
-                inputs.add(
-                    binding,
-                    trt.nptype(self.engine.get_binding_dtype(binding)),
-                    [-1] + list(self.engine.get_binding_shape(binding)),
-                )
-        return inputs
-
-    def deactivate_impl(self):
-        # Destroy the engine and context.
-        with self.engine, self.context:
-            pass
-        [inp.device.free() for inp in self.input_buffers.values()]
-        [out.device.free() for out in self.output_buffers.values()]
-        self.stream.free()
-
-        del (self.engine, self.context, self.input_buffers, self.output_buffers, self.stream)
-
-    def infer_impl(self, feed_dict):
-        start = time.time()
-
-        for name, buffer in feed_dict.items():
-            self.input_buffers[name].device.resize(buffer.shape)
-            buffer = util.make_contiguous(buffer)
-            self.input_buffers[name].device.copy_from(buffer, self.stream)
-
-        # We will not run with smaller batch sizes than whatever the builder chose.
-        bindings = [buf.device.ptr for buf in self.input_buffers.values()] + [
-            buf.device.ptr for buf in self.output_buffers.values()
-        ]
-        status = self.context.execute_async(
-            batch_size=self.context.engine.max_batch_size, bindings=bindings, stream_handle=self.stream.ptr
-        )
-        if not status:
-            G_LOGGER.critical("Model execution failed. Please see the log messages above for details")
-
-        for out in self.output_buffers.values():
-            out.host = util.resize_buffer(out.host, out.device.shape)
-            out.device.copy_to(out.host, self.stream)
-
-        self.stream.synchronize()
-        end = time.time()
-
-        out_dict = OrderedDict()
-        for (name, out) in self.output_buffers.items():
-            out_dict[name] = out.host
-        self.inference_time = end - start
-        return out_dict
diff --git a/tools/Polygraphy/polygraphy/common/struct.py b/tools/Polygraphy/polygraphy/common/struct.py
index c57b1bbf..1e99bcac 100644
--- a/tools/Polygraphy/polygraphy/common/struct.py
+++ b/tools/Polygraphy/polygraphy/common/struct.py
@@ -17,10 +17,9 @@
 
 from polygraphy import mod, util
 from polygraphy.common.interface import TypedDict
+from polygraphy.datatype import DataType
 from polygraphy.json import Decoder, Encoder, add_json_methods
 
-np = mod.lazy_import("numpy")
-
 
 class BoundedShape(list):
     """
@@ -37,23 +36,34 @@ def __repr__(self):
 
 
 class MetadataTuple:
-    def __init__(self, dtype, shape):
+    def __init__(self, dtype, shape, docstring):
         self.dtype = dtype
         self.shape = shape
+        self.docstring = docstring
+
+    @property
+    def dtype(self):
+        return self._dtype
+
+    @dtype.setter
+    def dtype(self, new):
+        self._dtype = DataType.from_dtype(new) if new is not None else None
 
     def __iter__(self):
         yield from [self.dtype, self.shape]
 
     def __repr__(self):
-        return f"MetadataTuple({self.dtype}, {self.shape})"
+        return f"MetadataTuple({self.dtype}, {self.shape}, {self.docstring})"
 
     def __str__(self):
         ret = ""
         meta_items = []
         if self.dtype is not None:
-            meta_items.append(f"dtype={np.dtype(self.dtype).name}")
+            meta_items.append(f"dtype={self.dtype}")
         if self.shape is not None:
             meta_items.append(f"shape={tuple(self.shape)}")
+        if self.docstring is not None:
+            meta_items.append(self.docstring)
         if meta_items:
             ret += "[" + ", ".join(meta_items) + "]"
         return ret
@@ -82,24 +92,26 @@ def from_feed_dict(feed_dict):
         Constructs a new TensorMetadata using information from the provided feed_dict.
 
         Args:
-            feed_dict (OrderedDict[str, numpy.ndarray]):
-                    A mapping of input tensor names to corresponding input NumPy arrays.
+            feed_dict (OrderedDict[str, Union[numpy.ndarray, torch.tensor]]):
+                    A mapping of input tensor names to corresponding input arrays.
 
         Returns:
             TensorMetadata
         """
         meta = TensorMetadata()
         for name, arr in feed_dict.items():
-            meta.add(name, arr.dtype, arr.shape)
+            meta.add(name, util.array.dtype(arr), util.array.shape(arr))
         return meta
 
-    def add(self, name, dtype, shape, min_shape=None, max_shape=None):
+    def add(self, name, dtype, shape, min_shape=None, max_shape=None, docstring=None):
         """
         Convenience function for adding entries.
 
         Args:
             name (str): The name of the input.
-            dtype (numpy.dtype): The data type of the input.
+            dtype (Any):
+                    The data type of the input.
+                    This can be any type that can be converted to a Polygraphy DataType.
             shape (Sequence[Union[int, str]]]):
                     The shape of the input. Dynamic dimensions may
                     be indicated by negative values, ``None``, or a string.
@@ -110,19 +122,30 @@ def add(self, name, dtype, shape, min_shape=None, max_shape=None):
             max_shape (Sequence[int]):
                     The maximum valid shape for the input.
                     If provided, this shape should not include any dynamic dimensions.
+            docstring (str):
+                    Any additional information associated with a tensor.
 
         Returns:
             The newly added entry.
         """
         self[name] = MetadataTuple(
-            dtype, BoundedShape(shape, min=min_shape, max=max_shape) if shape is not None else None
+            dtype, BoundedShape(shape, min=min_shape, max=max_shape) if shape is not None else None, docstring
         )
         return self
 
     def __repr__(self):
         ret = "TensorMetadata()"
-        for name, (dtype, shape) in self.items():
-            ret += util.make_repr(".add", name, dtype, list(shape), min_shape=shape.min, max_shape=shape.max)[0]
+        for name, metadata_tuple in self.items():
+            (dtype, shape) = metadata_tuple
+            ret += util.make_repr(
+                ".add",
+                name,
+                dtype,
+                list(shape),
+                min_shape=shape.min,
+                max_shape=shape.max,
+                docstring=metadata_tuple.docstring,
+            )[0]
         return ret
 
     def __str__(self):
@@ -145,23 +168,20 @@ class FormattedArray:
     the channel dimension would be padded to a multiple of 4. However, we still need a way to keep
     track of the semantic shape for things like shape inference.
 
-    This class provides a mechanism to specify the shape and dtype of an array independently of
+    This class provides a mechanism to specify the shape of an array independently of
     the underlying array.
     """
 
-    def __init__(self, array, shape, dtype):
+    def __init__(self, array, shape):
         """
         Args:
             array (Union[np.ndarray, polygraphy.cuda.DeviceView]):
                     The array. In most cases, this will be a raw byte-array.
             shape (Sequence[int]):
                     The semantic shape of the data.
-            dtype (np.dtype):
-                    The data type.
         """
         self.array = array
         self.shape = shape
-        self.dtype = dtype
 
 
 @Encoder.register(FormattedArray)
@@ -169,10 +189,9 @@ def encode(farray):
     return {
         "array": farray.array,
         "shape": farray.shape,
-        "dtype": farray.dtype,
     }
 
 
 @Decoder.register(FormattedArray)
 def decode(dct):
-    return FormattedArray(dct["array"], dct["shape"], dct["dtype"])
+    return FormattedArray(dct["array"], dct["shape"])
diff --git a/tools/Polygraphy/polygraphy/comparator/comparator.py b/tools/Polygraphy/polygraphy/comparator/comparator.py
index 4be7ca80..282050d0 100644
--- a/tools/Polygraphy/polygraphy/comparator/comparator.py
+++ b/tools/Polygraphy/polygraphy/comparator/comparator.py
@@ -27,8 +27,6 @@
 from polygraphy.comparator.struct import AccuracyResult, IterationResult, RunResults
 from polygraphy.logger import G_LOGGER, LogMode
 
-np = mod.lazy_import("numpy")
-
 
 @mod.export()
 class Comparator:
@@ -98,7 +96,7 @@ def execute_runner(runner, loader_cache):
             with runner as active_runner:
                 # DataLoaderCache will ensure that the feed_dict does not contain any extra entries
                 # based on the provided input_metadata.
-                loader_cache.set_input_metadata(active_runner.get_input_metadata())
+                loader_cache.set_input_metadata(active_runner.get_input_metadata(use_numpy_dtypes=False))
 
                 if warm_up:
                     G_LOGGER.start(f"{active_runner.name:35} | Running {warm_up} warm-up run(s)")
@@ -327,8 +325,8 @@ def validate(run_results, check_inf=None, check_nan=None, fail_fast=None):
         fail_fast = util.default(fail_fast, False)
 
         def is_finite(output):
-            non_finite = np.logical_not(np.isfinite(output))
-            if np.any(non_finite):
+            non_finite = util.array.logical_not(util.array.isfinite(output))
+            if util.array.any(non_finite):
                 G_LOGGER.error("Inf Detected | One or more non-finite values were encountered in this output")
                 G_LOGGER.info(
                     "Note: Use -vv or set logging verbosity to EXTRA_VERBOSE to display non-finite values",
@@ -340,8 +338,8 @@ def is_finite(output):
             return True
 
         def is_not_nan(output):
-            nans = np.isnan(output)
-            if np.any(nans):
+            nans = util.array.isnan(output)
+            if util.array.any(nans):
                 G_LOGGER.error("NaN Detected | One or more NaNs were encountered in this output")
                 G_LOGGER.info(
                     "Note: Use -vv or set logging verbosity to EXTRA_VERBOSE to display locations of NaNs",
diff --git a/tools/Polygraphy/polygraphy/comparator/compare.py b/tools/Polygraphy/polygraphy/comparator/compare.py
index 8bd7d7be..fa7cb7c7 100644
--- a/tools/Polygraphy/polygraphy/comparator/compare.py
+++ b/tools/Polygraphy/polygraphy/comparator/compare.py
@@ -14,11 +14,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+import copy
 import functools
 from collections import OrderedDict
 
 from polygraphy import mod, util
 from polygraphy.comparator import util as comp_util
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER, LogMode
 
 np = mod.lazy_import("numpy")
@@ -31,7 +33,18 @@ class OutputCompareResult:
     between two runners.
     """
 
-    def __init__(self, passed, max_absdiff, max_reldiff, mean_absdiff, mean_reldiff, median_absdiff, median_reldiff):
+    def __init__(
+        self,
+        passed,
+        max_absdiff,
+        max_reldiff,
+        mean_absdiff,
+        mean_reldiff,
+        median_absdiff,
+        median_reldiff,
+        quantile_absdiff,
+        quantile_reldiff,
+    ):
         """
         Records the required tolerances and other statistics gathered during comparison.
 
@@ -50,6 +63,10 @@ def __init__(self, passed, max_absdiff, max_reldiff, mean_absdiff, mean_reldiff,
                     The median absolute error between the outputs.
             median_reldiff (float):
                     The median relative error between the outputs.
+            quantile_absdiff (float):
+                    The q-th quantile absolute error between the outputs.
+            quantile_reldiff (float):
+                    The q-th quantile relative error between the outputs.
         """
         self.passed = passed
         self.max_absdiff = max_absdiff
@@ -58,6 +75,8 @@ def __init__(self, passed, max_absdiff, max_reldiff, mean_absdiff, mean_reldiff,
         self.mean_reldiff = mean_reldiff
         self.median_absdiff = median_absdiff
         self.median_reldiff = median_reldiff
+        self.quantile_absdiff = quantile_absdiff
+        self.quantile_reldiff = quantile_reldiff
 
     def __bool__(self):
         """
@@ -82,9 +101,7 @@ def default_find_output_func(output_name, index, iter_result, base_iter_result):
             G_LOGGER.verbose(
                 f"Will not compare {found_name} with {output_name}, since the former already has an exact match: {exact_match}"
             )
-            return (
-                None  # If the found output is being compared against another output already, skip this non-exact match
-            )
+            return None  # If the found output is being compared against another output already, skip this non-exact match
         G_LOGGER.warning(
             f"Output names did not match exactly. Assuming {iter_result.runner_name} output: {found_name} corresponds to output: {output_name}"
         )
@@ -95,7 +112,9 @@ def run_comparison(func, fail_fast, iter_result0, iter_result1, find_output_func
     """
     Iterates over all the generated outputs and runs `func` to compare them.
     """
-    output_status = OrderedDict()  # OrderedDict[str, bool] Maps output names to whether they matched.
+    output_status = (
+        OrderedDict()
+    )  # OrderedDict[str, bool] Maps output names to whether they matched.
 
     for index, (out0_name, output0) in enumerate(iter_result0.items()):
         out1_names = util.default(find_output_func(out0_name, index, iter_result1), [])
@@ -115,18 +134,22 @@ def run_comparison(func, fail_fast, iter_result0, iter_result1, find_output_func
             output1 = iter_result1[out1_name]
 
             G_LOGGER.start(
-                f"Comparing Output: '{out0_name}' (dtype={output0.dtype}, shape={output0.shape}) with '{out1_name}' (dtype={output1.dtype}, shape={output1.shape})"
+                f"Comparing Output: '{out0_name}' (dtype={util.array.dtype(output0)}, shape={util.array.shape(output0)}) with '{out1_name}' (dtype={util.array.dtype(output1)}, shape={util.array.shape(output1)})"
             )
             with G_LOGGER.indent():
                 output_status[out0_name] = func(out0_name, output0, out1_name, output1)
                 if fail_fast and not output_status[out0_name]:
                     return output_status
 
-    mismatched_output_names = [name for name, matched in output_status.items() if not matched]
+    mismatched_output_names = [
+        name for name, matched in output_status.items() if not matched
+    ]
     if mismatched_output_names:
         G_LOGGER.error(f"FAILED | Mismatched outputs: {mismatched_output_names}")
     else:
-        G_LOGGER.finish(f"PASSED | All outputs matched | Outputs: {list(output_status.keys())}")
+        G_LOGGER.finish(
+            f"PASSED | All outputs matched | Outputs: {list(output_status.keys())}"
+        )
 
     # This is useful for catching cases were Polygraphy does something wrong with the runner output buffers
     if not output_status and (bool(iter_result0.keys()) or bool(iter_result1.keys())):
@@ -161,6 +184,7 @@ def simple(
         show_heatmaps=None,
         save_error_metrics_plot=None,
         show_error_metrics_plot=None,
+        error_quantile=None,
     ):
         """
         Creates a function that compares two IterationResults, and can be used as the `compare_func` argument
@@ -204,6 +228,7 @@ def simple(
                     - "max": Checks the maximum absolute/relative errors against the respective tolerances. This is the strictest possible check.
                     - "mean" Checks the mean absolute/relative errors against the respective tolerances.
                     - "median": Checks the median absolute/relative errors against the respective tolerances.
+                    - "quantile": Checks the quantile absolute/relative errors against the respective tolerances.
 
                     This can be provided on a per-output basis using a dictionary. In that case,
                     use an empty string ("") as the key to specify default error stat for outputs not explicitly listed.
@@ -223,6 +248,10 @@ def simple(
                     Defaults to None.
             show_error_metrics_plot (bool):
                     [EXPERIMENTAL] Whether to display the error metrics plot.
+            error_quantile (Union[float, Dict[str, float]]):
+                    Quantile error to compute when checking accuracy. This is expressed as a float in range [0, 1].
+                    For example, error_quantile=0.5 is the median.
+                    Defaults to 0.99.
 
         Returns:
             Callable(IterationResult, IterationResult) -> OrderedDict[str, OutputCompareResult]:
@@ -232,8 +261,10 @@ def simple(
         check_shapes = util.default(check_shapes, True)
         default_rtol = 1e-5
         default_atol = 1e-5
+        default_quantile = 0.99
         rtol = util.default(rtol, default_rtol)
         atol = util.default(atol, default_atol)
+        error_quantile = util.default(error_quantile, default_quantile)
         fail_fast = util.default(fail_fast, False)
         default_error_stat = "elemwise"
         check_error_stat = util.default(check_error_stat, default_error_stat)
@@ -251,119 +282,159 @@ def check_outputs_match(
             per_out_err_stat,
             runner0_name,
             runner1_name,
+            per_out_quantile,
         ):
             """
             Checks whether two outputs matched.
 
             Args:
-                out0 (np.array): The first output.
+                out0 (Union[np.array, torch.Tensor]): The first output.
                 out0_name (str): The name of the first output.
-                out1 (np.array): The second output.
+                out1 (Union[np.array, torch.Tensor]): The second output.
                 out1_name (str): The name of the second output.
                 per_out_rtol (float): The relative tolerance to use for comparison.
                 per_out_atol (float): The absolute tolerance to use for comparison.
                 per_out_err_stat (str): The error statistic to check. See the docstring of ``simple`` for details.
                 runner0_name (str): The name of the runner that generated the first output.
                 runner1_name (str): The name of the runner that generated the second output.
+                per_out_quantile (float): The qunatile value to use for quantile comparison.
 
             Returns:
                 OutputCompareResult: Details on whether the outputs matched.
             """
-            VALID_CHECK_ERROR_STATS = ["max", "mean", "median", "elemwise"]
+            VALID_CHECK_ERROR_STATS = ["max", "mean", "median", "elemwise", "quantile"]
             if per_out_err_stat not in VALID_CHECK_ERROR_STATS:
                 G_LOGGER.critical(
                     f"Invalid choice for check_error_stat: {per_out_err_stat}.\nNote: Valid choices are: {VALID_CHECK_ERROR_STATS}"
                 )
 
             G_LOGGER.super_verbose(
-                f"{runner0_name:35} | Output: {out0_name} (dtype={out0.dtype}, shape={out0.shape}):\n{util.indent_block(out0)}"
+                f"{runner0_name:35} | Output: {out0_name} (dtype={util.array.dtype(out0)}, shape={util.array.shape(out0)}):\n{util.indent_block(out0)}"
             )
             G_LOGGER.super_verbose(
-                f"{runner1_name:35} | Output: {out1_name} (dtype={out1.dtype}, shape={out1.shape}):\n{util.indent_block(out1)}"
+                f"{runner1_name:35} | Output: {out1_name} (dtype={util.array.dtype(out1)}, shape={util.array.shape(out1)}):\n{util.indent_block(out1)}"
             )
 
             # Check difference vs. tolerances
-            if np.issubdtype(out0.dtype, np.bool_) and np.issubdtype(out1.dtype, np.bool_):
-                absdiff = np.logical_xor(out0, out1)
+            if (
+                util.array.dtype(out0) == DataType.BOOL
+                and util.array.dtype(out1) == DataType.BOOL
+            ):
+                absdiff = util.array.logical_xor(out0, out1)
             else:
-                absdiff = np.abs(comp_util.cast_up(out0) - comp_util.cast_up(out1))
+                absdiff = util.array.abs(
+                    util.array.subtract(
+                        comp_util.cast_up(out0), comp_util.cast_up(out1)
+                    )
+                )
                 if infinities_compare_equal:
-                    out0_infinite = np.isinf(out0)
-                    cond = np.logical_and(out0_infinite, out0 == out1)
-                    absdiff = np.where(cond, 0, absdiff)
+                    out0_infinite = util.array.isinf(out0)
+                    cond = util.array.logical_and(out0_infinite, out0 == out1)
+                    absdiff = util.array.where(cond, 0, absdiff)
 
             # Add a small epsilon (2e-16) to zero values in the array to prevent NaN in relative error.
-            cast_up_out1 = comp_util.cast_up(out1)
+            out1_with_eps = copy.copy(comp_util.cast_up(out1))
 
-            if np.issubdtype(cast_up_out1.dtype, np.floating):
-                if np.any(cast_up_out1 == 0):
+            if util.array.dtype(out1_with_eps).is_floating:
+                if util.array.any(out1_with_eps == 0):
                     G_LOGGER.warning(
                         f"{runner1_name:35} | Output: {out1_name}: Some values are 0. "
                         f"Will add a small epsilon quantity to these when computing relative difference. "
                         f"Note that this may cause some relative differences to be extremely high. ",
                         mode=LogMode.ONCE,
                     )
-                cast_up_out1[cast_up_out1 == 0] += np.finfo(float).eps
+                EPSILON = 2.220446049250313e-16
+                out1_with_eps[out1_with_eps == 0] += EPSILON
 
-            reldiff = absdiff / np.abs(cast_up_out1)
+            # TODO: Only evaluate this if actually needed like we do for quantile_*.
+            reldiff = util.array.divide(absdiff, util.array.abs(out1_with_eps))
             min_reldiff = comp_util.compute_min(reldiff)
             max_reldiff = comp_util.compute_max(reldiff)
             mean_reldiff = comp_util.compute_mean(reldiff)
             median_reldiff = comp_util.compute_median(reldiff)
+            quantile_reldiff = None
 
             min_absdiff = comp_util.compute_min(absdiff)
             max_absdiff = comp_util.compute_max(absdiff)
             mean_absdiff = comp_util.compute_mean(absdiff)
             median_absdiff = comp_util.compute_median(absdiff)
+            quantile_absdiff = None
 
             def stat_failed(diff, tol):
-                return np.isnan(diff) or diff > tol
+                return util.array.isnan(diff) or diff > tol
 
             if per_out_err_stat == "mean":
-                failed = stat_failed(mean_absdiff, per_out_atol) and stat_failed(mean_reldiff, per_out_rtol)
+                failed = stat_failed(mean_absdiff, per_out_atol) and stat_failed(
+                    mean_reldiff, per_out_rtol
+                )
             elif per_out_err_stat == "median":
-                failed = stat_failed(median_absdiff, per_out_atol) and stat_failed(median_reldiff, per_out_rtol)
+                failed = stat_failed(median_absdiff, per_out_atol) and stat_failed(
+                    median_reldiff, per_out_rtol
+                )
             elif per_out_err_stat == "max":
-                failed = stat_failed(max_absdiff, per_out_atol) and stat_failed(max_reldiff, per_out_rtol)
+                failed = stat_failed(max_absdiff, per_out_atol) and stat_failed(
+                    max_reldiff, per_out_rtol
+                )
+            elif per_out_err_stat == "quantile":
+                quantile_reldiff = comp_util.compute_quantile(reldiff, per_out_quantile)
+                quantile_absdiff = comp_util.compute_quantile(absdiff, per_out_quantile)
+                failed = stat_failed(quantile_absdiff, per_out_atol) and stat_failed(
+                    quantile_reldiff, per_out_rtol
+                )
             else:
                 assert (
                     per_out_err_stat == "elemwise"
                 ), "This branch should be unreachable unless per_out_err_stat is 'elemwise'"
-                with np.testing.suppress_warnings() as sup:
-                    sup.filter(RuntimeWarning)
-                    mismatches = ((absdiff > per_out_atol) | np.isnan(absdiff)) & (
-                        (reldiff > per_out_rtol) | np.isnan(reldiff)
-                    )
+                mismatches = (
+                    util.array.greater(absdiff, per_out_atol)
+                    | util.array.isnan(absdiff)
+                ) & (
+                    util.array.greater(reldiff, per_out_rtol)
+                    | util.array.isnan(reldiff)
+                )
 
-                failed = np.any(mismatches)
+                failed = util.array.any(mismatches)
                 try:
                     with G_LOGGER.indent():
-                        G_LOGGER.super_verbose(f"Mismatched indices:\n{np.argwhere(mismatches)}")
-                        G_LOGGER.extra_verbose(f"{runner0_name:35} | Mismatched values:\n{out0[mismatches]}")
-                        G_LOGGER.extra_verbose(f"{runner1_name:35} | Mismatched values:\n{out1[mismatches]}")
+                        G_LOGGER.super_verbose(
+                            lambda: f"Mismatched indices:\n{util.array.argwhere(mismatches)}"
+                        )
+                        G_LOGGER.extra_verbose(
+                            lambda: f"{runner0_name:35} | Mismatched values:\n{out0[mismatches]}"
+                        )
+                        G_LOGGER.extra_verbose(
+                            lambda: f"{runner1_name:35} | Mismatched values:\n{out1[mismatches]}"
+                        )
                 except Exception as err:
-                    G_LOGGER.warning(f"Failing to log mismatches.\nNote: Error was: {err}")
+                    G_LOGGER.warning(
+                        f"Failing to log mismatches.\nNote: Error was: {err}"
+                    )
 
             # Log information about the outputs
             hist_bin_range = (
                 min(comp_util.compute_min(out0), comp_util.compute_min(out1)),
                 max(comp_util.compute_max(out0), comp_util.compute_max(out1)),
             )
-            comp_util.log_output_stats(out0, failed, f"{runner0_name}: {out0_name}", hist_range=hist_bin_range)
-            comp_util.log_output_stats(out1, failed, f"{runner1_name}: {out1_name}", hist_range=hist_bin_range)
+            comp_util.log_output_stats(
+                out0, failed, f"{runner0_name}: {out0_name}", hist_range=hist_bin_range
+            )
+            comp_util.log_output_stats(
+                out1, failed, f"{runner1_name}: {out1_name}", hist_range=hist_bin_range
+            )
 
             G_LOGGER.info(f"Error Metrics: {out0_name}")
             with G_LOGGER.indent():
 
-                def req_tol(mean_diff, median_diff, max_diff):
+                def req_tol(mean_diff, median_diff, max_diff, quantile_diff):
                     return {
                         "mean": mean_diff,
                         "median": median_diff,
                         "max": max_diff,
                         "elemwise": max_diff,
+                        "quantile": quantile_diff,
                     }[per_out_err_stat]
 
-                msg = f"Minimum Required Tolerance: {per_out_err_stat} error | [abs={req_tol(mean_absdiff, median_absdiff, max_absdiff):.5g}] OR [rel={req_tol(mean_reldiff, median_reldiff, max_reldiff):.5g}]"
+                msg = f"Minimum Required Tolerance: {per_out_err_stat} error | [abs={req_tol(mean_absdiff, median_absdiff, max_absdiff, quantile_absdiff):.5g}] OR [rel={req_tol(mean_reldiff, median_reldiff, max_reldiff, quantile_reldiff):.5g}]"
                 if per_out_err_stat == "elemwise":
                     msg += " (requirements may be lower if both abs/rel tolerances are set)"
                 G_LOGGER.info(msg)
@@ -401,13 +472,23 @@ def build_heatmaps(diff, min_diff, max_diff, prefix, use_lognorm=None):
                 build_heatmaps(absdiff, min_absdiff, max_absdiff, "Absolute")
 
                 comp_util.log_output_stats(reldiff, failed, "Relative Difference")
-                build_heatmaps(reldiff, min_reldiff, max_reldiff, "Relative", use_lognorm=True)
+                build_heatmaps(
+                    reldiff, min_reldiff, max_reldiff, "Relative", use_lognorm=True
+                )
 
             G_LOGGER.extra_verbose(
-                f"Finished comparing: '{out0_name}' (dtype={out0.dtype}, shape={out0.shape}) [{runner0_name}] and '{out1_name}' (dtype={out1.dtype}, shape={out1.shape}) [{runner1_name}]"
+                lambda: f"Finished comparing: '{out0_name}' (dtype={util.array.dtype(out0)}, shape={util.array.shape(out0)}) [{runner0_name}] and '{out1_name}' (dtype={util.array.dtype(out1)}, shape={util.array.shape(out1)}) [{runner1_name}]"
             )
             return OutputCompareResult(
-                not failed, max_absdiff, max_reldiff, mean_absdiff, mean_reldiff, median_absdiff, median_reldiff
+                not failed,
+                max_absdiff,
+                max_reldiff,
+                mean_absdiff,
+                mean_reldiff,
+                median_absdiff,
+                median_reldiff,
+                quantile_absdiff,
+                quantile_reldiff,
             )
 
         def compare_output(iter_result0, iter_result1):
@@ -445,43 +526,62 @@ def check_dict(dct, dict_name):
             check_dict(rtol, "the rtol dictionary")
             check_dict(atol, "the atol dictionary")
             check_dict(check_error_stat, "the check_error_stat dictionary")
+            check_dict(error_quantile, "the quantile dictionary")
 
             if not check_shapes:
-                G_LOGGER.info("Strict shape checking disabled. Will attempt to match output shapes before comparisons")
+                G_LOGGER.info(
+                    "Strict shape checking disabled. Will attempt to match output shapes before comparisons"
+                )
 
             def match(out0_name, output0, out1_name, output1):
                 per_out_atol = util.value_or_from_dict(atol, out0_name, default_atol)
                 per_out_rtol = util.value_or_from_dict(rtol, out0_name, default_rtol)
-                per_out_err_stat = util.value_or_from_dict(check_error_stat, out0_name, default_error_stat)
+                per_out_err_stat = util.value_or_from_dict(
+                    check_error_stat, out0_name, default_error_stat
+                )
+                per_out_quantile = util.value_or_from_dict(
+                    error_quantile, out0_name, default_quantile
+                )
 
                 G_LOGGER.info(
                     f"Tolerance: [abs={per_out_atol:.5g}, rel={per_out_rtol:.5g}] | Checking {per_out_err_stat} error"
                 )
-                G_LOGGER.extra_verbose(f"Note: Comparing {iter_result0.runner_name} vs. {iter_result1.runner_name}")
+                G_LOGGER.extra_verbose(
+                    f"Note: Comparing {iter_result0.runner_name} vs. {iter_result1.runner_name}"
+                )
 
-                if check_shapes and output0.shape != output1.shape:
+                if check_shapes and util.array.shape(output0) != util.array.shape(
+                    output1
+                ):
                     G_LOGGER.error(
-                        f"Will not compare outputs of different shapes. Note: Output shapes are {output0.shape} and {output1.shape}."
+                        f"FAILED | Output: `{out0_name}` | Will not compare outputs of different shapes.\n"
+                        f"Note: Output shapes are {util.array.shape(output0)} and {util.array.shape(output1)}."
                     )
                     G_LOGGER.error(
-                        "Note: Use --no-shape-check or set check_shapes=False to " "attempt to compare values anyway.",
+                        "Note: Use --no-shape-check or set check_shapes=False to "
+                        "attempt to compare values anyway.",
                         mode=LogMode.ONCE,
                     )
-                    outputs_matched = False
-                else:
-                    output1 = util.try_match_shape(output1, output0.shape)
-                    output0 = output0.reshape(output1.shape)
-                    outputs_matched = check_outputs_match(
-                        output0,
-                        out0_name,
-                        output1,
-                        out1_name,
-                        per_out_rtol=per_out_rtol,
-                        per_out_atol=per_out_atol,
-                        per_out_err_stat=per_out_err_stat,
-                        runner0_name=iter_result0.runner_name,
-                        runner1_name=iter_result1.runner_name,
-                    )
+                    return False
+
+                output1 = util.try_match_shape(output1, util.array.shape(output0))
+                output0 = util.array.view(
+                    output0,
+                    DataType.from_dtype(util.array.dtype(output0)),
+                    util.array.shape(output1),
+                )
+                outputs_matched = check_outputs_match(
+                    output0,
+                    out0_name,
+                    output1,
+                    out1_name,
+                    per_out_rtol=per_out_rtol,
+                    per_out_atol=per_out_atol,
+                    per_out_err_stat=per_out_err_stat,
+                    runner0_name=iter_result0.runner_name,
+                    runner1_name=iter_result1.runner_name,
+                    per_out_quantile=per_out_quantile,
+                )
 
                 # Finally show summary.
                 if not outputs_matched:
@@ -497,9 +597,14 @@ def match(out0_name, output0, out1_name, output1):
 
             nonlocal find_output_func
             find_output_func = util.default(
-                find_output_func, functools.partial(default_find_output_func, base_iter_result=iter_result0)
+                find_output_func,
+                functools.partial(
+                    default_find_output_func, base_iter_result=iter_result0
+                ),
+            )
+            return run_comparison(
+                match, fail_fast, iter_result0, iter_result1, find_output_func
             )
-            return run_comparison(match, fail_fast, iter_result0, iter_result1, find_output_func)
 
         return compare_output
 
@@ -575,14 +680,16 @@ def compare_output(iter_result0, iter_result1):
             """
 
             def match(out0_name, output0, out1_name, output1):
-                per_out_index_tol = util.value_or_from_dict(index_tolerance, out0_name, 0)
+                per_out_index_tol = util.value_or_from_dict(
+                    index_tolerance, out0_name, 0
+                )
 
-                if output0.shape != output1.shape:
+                if util.array.shape(output0) != util.array.shape(output1):
                     G_LOGGER.error("Cannot compare outputs of different shapes.")
                     return False
 
                 passed = True
-                for batch in np.ndindex(output0.shape[:-1]):
+                for batch in np.ndindex(util.array.shape(output0)[:-1]):
                     out0_vals = output0[batch]
                     if per_out_index_tol > 0:
                         out0_vals = out0_vals[:-per_out_index_tol]
@@ -592,9 +699,13 @@ def match(out0_name, output0, out1_name, output1):
                         if val0 == out1_vals[index0]:
                             continue
 
-                        index1 = np.argwhere(out1_vals == val0).ravel()
-                        if index1.size < 1:
-                            G_LOGGER.error(f"FAILED | Value: {val0} not found in output")
+                        index1 = util.array.ravel(
+                            util.array.argwhere(out1_vals == val0)
+                        )
+                        if util.array.size(index1) < 1:
+                            G_LOGGER.error(
+                                f"FAILED | Value: {val0} not found in output"
+                            )
                             passed = False
                             if fail_fast:
                                 return False
@@ -603,7 +714,9 @@ def match(out0_name, output0, out1_name, output1):
                         index1 = index1[0]
 
                         if abs(index1 - index0) > per_out_index_tol:
-                            G_LOGGER.error(f"FAILED | Difference exceeds index tolerance ({per_out_index_tol})")
+                            G_LOGGER.error(
+                                f"FAILED | Difference exceeds index tolerance ({per_out_index_tol})"
+                            )
                             passed = False
                             if fail_fast:
                                 return False
@@ -615,14 +728,22 @@ def match(out0_name, output0, out1_name, output1):
                     max(comp_util.compute_max(output0), comp_util.compute_max(output1)),
                 )
                 comp_util.log_output_stats(
-                    output0, not passed, f"{iter_result0.runner_name}: {out0_name}", hist_range=hist_bin_range
+                    output0,
+                    not passed,
+                    f"{iter_result0.runner_name}: {out0_name}",
+                    hist_range=hist_bin_range,
                 )
                 comp_util.log_output_stats(
-                    output1, not passed, f"{iter_result1.runner_name}: {out1_name}", hist_range=hist_bin_range
+                    output1,
+                    not passed,
+                    f"{iter_result1.runner_name}: {out1_name}",
+                    hist_range=hist_bin_range,
                 )
 
                 if passed:
-                    G_LOGGER.finish(f"PASSED | Difference is within index tolerance ({per_out_index_tol})")
+                    G_LOGGER.finish(
+                        f"PASSED | Difference is within index tolerance ({per_out_index_tol})"
+                    )
                 return passed
 
             return run_comparison(
@@ -630,7 +751,9 @@ def match(out0_name, output0, out1_name, output1):
                 fail_fast,
                 iter_result0,
                 iter_result1,
-                functools.partial(default_find_output_func, base_iter_result=iter_result0),
+                functools.partial(
+                    default_find_output_func, base_iter_result=iter_result0
+                ),
             )
 
         return compare_output
diff --git a/tools/Polygraphy/polygraphy/comparator/data_loader.py b/tools/Polygraphy/polygraphy/comparator/data_loader.py
index 332df866..7dc11cc2 100644
--- a/tools/Polygraphy/polygraphy/comparator/data_loader.py
+++ b/tools/Polygraphy/polygraphy/comparator/data_loader.py
@@ -18,11 +18,96 @@
 from collections import OrderedDict
 
 from polygraphy import constants, func, mod, util
-from polygraphy.exception import PolygraphyException
+from polygraphy.comparator.struct import RunResults
+from polygraphy.datatype import DataType
+from polygraphy.exception import DataTypeConversionException, PolygraphyException
 from polygraphy.json import save_json
 from polygraphy.logger import G_LOGGER, LogMode
 
 np = mod.lazy_import("numpy")
+torch = mod.lazy_import("torch")
+
+
+class ArraySampler:
+    def __init__(self, data_loader_backend_module, seed):
+        """
+        Args:
+            data_loader_backend_module (str):
+                    The module specifying the array type to use to generate arrays.
+                    Can be either "numpy" or "torch".
+            seed (int):
+                    The seed to use when generating random inputs.
+        """
+        self.rng = None
+        VALID_ARRAY_MODULES = ["numpy", "torch"]
+        if data_loader_backend_module not in VALID_ARRAY_MODULES:
+            G_LOGGER.critical(
+                f"Invalid `data_loader_backend_module`. Note: got: {data_loader_backend_module} but valid modules are: {VALID_ARRAY_MODULES}"
+            )
+
+        self.data_loader_backend_module = data_loader_backend_module
+
+
+        if self.data_loader_backend_module == "numpy":
+            self.rng = np.random.RandomState(seed)
+        elif self.data_loader_backend_module == "torch":
+            self.rng = torch.Generator()
+            self.rng.manual_seed(seed)
+
+
+    def sample_integer(self, shape, dtype, low, high):
+        """
+        Samples an array containing integral values in the range [low, high], inclusive
+        """
+        dtype = (
+            DataType.to_dtype(DataType.from_dtype(dtype), self.data_loader_backend_module)
+            if dtype is not None
+            else dtype
+        )
+        if self.data_loader_backend_module == "numpy":
+            return np.array(
+                self.rng.randint(low=low, high=high + 1, size=shape, dtype=dtype)
+            )
+        elif self.data_loader_backend_module == "torch":
+            return torch.randint(low, high + 1, shape, generator=self.rng, dtype=dtype)
+
+    def sample_float(self, shape, dtype, fmin, fmax):
+        """
+        Samples an array containing float values in the range [fmin, fmax], inclusive
+        """
+        # Special handling for infinite lower/upper bounds
+        # Without this, two infinities will collapse into a NaN, resulting in no infinities
+        # in the final output.
+        scale = fmax - fmin
+        shift = fmin
+        if util.is_inf(fmin):
+            scale = fmin
+            shift = 0
+        if util.is_inf(fmax):
+            scale = fmax
+
+        dtype = (
+            DataType.to_dtype(DataType.from_dtype(dtype), self.data_loader_backend_module)
+            if dtype is not None
+            else dtype
+        )
+        if self.data_loader_backend_module == "numpy":
+            return np.array(
+                (self.rng.random_sample(size=shape) * scale + shift).astype(dtype)
+            )
+        elif self.data_loader_backend_module == "torch":
+            return torch.rand(shape, generator=self.rng, dtype=dtype)
+
+    def constant_array(self, shape, dtype):
+        dtype = (
+            DataType.to_dtype(DataType.from_dtype(dtype), self.data_loader_backend_module)
+            if dtype is not None
+            else dtype
+        )
+        if self.data_loader_backend_module == "numpy":
+            return np.array(shape, dtype=dtype)
+        elif self.data_loader_backend_module == "torch":
+            return torch.tensor(shape, dtype=dtype)
 
 
 @mod.export()
@@ -32,7 +117,14 @@ class DataLoader:
     """
 
     def __init__(
-        self, seed=None, iterations=None, input_metadata=None, int_range=None, float_range=None, val_range=None
+        self,
+        seed=None,
+        iterations=None,
+        input_metadata=None,
+        int_range=None,
+        float_range=None,
+        val_range=None,
+        data_loader_backend_module=None,
     ):
         """
         Args:
@@ -59,6 +151,10 @@ def __init__(
                     This can be specified on a per-input basis using a dictionary. In that case,
                     use an empty string ("") as the key to specify default range for inputs not explicitly listed.
                     Defaults to (0.0, 1.0).
+            data_loader_backend_module (str):
+                    A string denoting what module to use to construct the input data arrays. Currently supports
+                    "numpy" and "torch".
+                    Defaults to "numpy".
 
             int_range (Tuple[int]):
                     [DEPRECATED - Use val_range instead]
@@ -77,7 +173,9 @@ def __init__(
         """
 
         def default_tuple(tup, default):
-            if tup is None or (not isinstance(tup, tuple) and not isinstance(tup, list)):
+            if tup is None or (
+                not isinstance(tup, tuple) and not isinstance(tup, list)
+            ):
                 return default
             new_tup = []
             for elem, default_elem in zip(tup, default):
@@ -87,15 +185,24 @@ def default_tuple(tup, default):
         self.seed = util.default(seed, constants.DEFAULT_SEED)
         self.iterations = util.default(iterations, 1)
         self.user_input_metadata = util.default(input_metadata, {})
+        self.data_loader_backend_module = util.default(
+            data_loader_backend_module, "numpy"
+        )
 
         self._int_range_set = int_range is not None
         if self._int_range_set:
-            mod.warn_deprecated("The int_range parameter in DataLoader", "val_range", remove_in="0.50.0")
+            mod.warn_deprecated(
+                "The int_range parameter in DataLoader", "val_range", remove_in="0.50.0"
+            )
         self._int_range = default_tuple(int_range, (1, 25))
 
         self._float_range_set = float_range is not None
         if self._float_range_set:
-            mod.warn_deprecated("The float_range parameter in DataLoader", "val_range", remove_in="0.50.0")
+            mod.warn_deprecated(
+                "The float_range parameter in DataLoader",
+                "val_range",
+                remove_in="0.50.0",
+            )
         self._float_range = default_tuple(float_range, (-1.0, 1.0))
 
         self.input_metadata = None
@@ -116,6 +223,7 @@ def __repr__(self):
             int_range=self._int_range,
             float_range=self._float_range,
             val_range=self.val_range,
+            data_loader_backend_module=self.data_loader_backend_module,
         )[0]
 
     def _get_range(self, name, cast_type):
@@ -139,13 +247,14 @@ def __getitem__(self, index):
                     Generated data is guaranteed to be the same for the same index.
 
         Returns:
-            OrderedDict[str, numpy.ndarray]: A mapping of input names to input numpy buffers.
+            OrderedDict[str, Union[numpy.ndarray, torch.Tensor]]: A mapping of input names to input numpy buffers.
         """
         if index >= self.iterations:
             raise IndexError()
 
         G_LOGGER.verbose(f"Generating data using numpy seed: {self.seed + index}")
-        rng = np.random.RandomState(self.seed + index)
+
+        array_sampler = ArraySampler(self.data_loader_backend_module, self.seed + index)
 
         def get_static_shape(name, shape):
             static_shape = shape
@@ -173,34 +282,46 @@ def get_static_shape(name, shape):
         # rather than the shape of the input.
         # If the shape is 1D, and has a value equal to the rank of the provided default shape, it is
         # likely to be a shape tensor, and so its value, not shape, should be overriden.
+        # Note that this is a hack needed for older versions of TensorRT. Ideally, we wouldn't care
+        # whether the input is a shape tensor or not.
         def is_shape_tensor(name, dtype):
             if name not in self.input_metadata or name not in self.user_input_metadata:
                 return False
 
             _, shape = self.input_metadata[name]
-            is_shape = np.issubdtype(dtype, np.integer) and (not util.is_shape_dynamic(shape)) and (len(shape) == 1)
+            if (
+                (dtype is not None and not DataType.from_dtype(dtype).is_integral)
+                or util.is_shape_dynamic(shape)
+                or len(shape) != 1
+            ):
+                return False
 
             user_shape = self.user_input_metadata[name].shape
-            is_shape &= len(user_shape) == shape[0]
-            is_shape &= not util.is_shape_dynamic(user_shape)  # Shape of shape cannot be dynamic.
-            return is_shape
+            # Shape of shape cannot be dynamic.
+            return not util.is_shape_dynamic(user_shape) and len(user_shape) == shape[0]
 
         def generate_buffer(name, dtype, shape):
             if is_shape_tensor(name, dtype):
-                buffer = np.array(shape, dtype=dtype)
+                buffer = array_sampler.constant_array(shape, dtype)
                 G_LOGGER.info(
                     f"Assuming {name} is a shape tensor. Setting input values to: {buffer}. "
                     "If these values are not correct, please set it correctly in 'input_metadata' or by providing --input-shapes",
                     mode=LogMode.ONCE,
                 )
-            elif np.issubdtype(dtype, np.integer) or np.issubdtype(dtype, np.bool_):
-                imin, imax = self._get_range(name, cast_type=int if np.issubdtype(dtype, np.integer) else bool)
+            elif dtype is not None and (
+                DataType.from_dtype(dtype).is_integral
+                or DataType.from_dtype(dtype) == DataType.BOOL
+            ):
+                imin, imax = self._get_range(
+                    name,
+                    cast_type=int if DataType.from_dtype(dtype).is_integral else bool,
+                )
                 G_LOGGER.verbose(
                     f"Input tensor: {name} | Generating input data in range: [{imin}, {imax}]",
                     mode=LogMode.ONCE,
                 )
                 # high is 1 greater than the max int drawn.
-                buffer = rng.randint(low=imin, high=imax + 1, size=shape, dtype=dtype)
+                buffer = array_sampler.sample_integer(shape, dtype, imin, imax)
             else:
                 fmin, fmax = self._get_range(name, cast_type=float)
                 G_LOGGER.verbose(
@@ -208,20 +329,8 @@ def generate_buffer(name, dtype, shape):
                     mode=LogMode.ONCE,
                 )
 
-                # Special handling for infinite lower/upper bounds
-                # Without this, two inifinities will collapse into a NaN, resulting in no inifinities
-                # in the final output.
-                scale = fmax - fmin
-                shift = fmin
-                if util.is_inf(fmin):
-                    scale = fmin
-                    shift = 0
-                if util.is_inf(fmax):
-                    scale = fmax
-
-                buffer = (rng.random_sample(size=shape) * scale + shift).astype(dtype)
+                buffer = array_sampler.sample_float(shape, dtype, fmin, fmax)
 
-            buffer = np.array(buffer)  # To handle scalars, since the above functions return a float if shape is ().
             return buffer
 
         if self.input_metadata is None and self.user_input_metadata is not None:
@@ -229,11 +338,34 @@ def generate_buffer(name, dtype, shape):
 
         buffers = OrderedDict()
         for name, (dtype, shape) in self.input_metadata.items():
+            try:
+                dtype = (
+                    DataType.to_dtype(
+                        DataType.from_dtype(dtype), self.data_loader_backend_module
+                    )
+                    if dtype is not None
+                    else None
+                )
+            except DataTypeConversionException:
+                G_LOGGER.critical(
+                    f"Could not convert data type: {dtype} to {self.data_loader_backend_module}, so the default data loader cannot generate a {self.data_loader_backend_module} array for input: {name}. "
+                    f"Please use a custom data loader to provide inputs. "
+                )
             if name in self.user_input_metadata:
                 user_dtype, user_shape = self.user_input_metadata[name]
 
                 dtype = util.default(user_dtype, dtype)
-                is_valid_shape_override = user_shape is not None and util.is_valid_shape_override(user_shape, shape)
+                dtype = (
+                    DataType.to_dtype(
+                        DataType.from_dtype(dtype), self.data_loader_backend_module
+                    )
+                    if dtype is not None
+                    else None
+                )
+                is_valid_shape_override = (
+                    user_shape is not None
+                    and util.is_valid_shape_override(user_shape, shape)
+                )
 
                 if util.is_shape_dynamic(user_shape):
                     G_LOGGER.warning(
@@ -254,7 +386,9 @@ def generate_buffer(name, dtype, shape):
         for name in self.user_input_metadata.keys():
             if name not in self.input_metadata:
                 msg = f"Input tensor: {name} | Metadata was provided, but the input does not exist in one or more runners."
-                close_match = util.find_str_in_iterable(name, self.input_metadata.keys())
+                close_match = util.find_str_in_iterable(
+                    name, self.input_metadata.keys()
+                )
                 if close_match:
                     msg += f"\nMaybe you meant to set: {close_match}?"
                 G_LOGGER.warning(msg)
@@ -293,9 +427,13 @@ def __getitem__(self, iteration):
         # Attempts to match existing input buffers to the requested input_metadata
         def coerce_cached_input(index, name, dtype, shape):
             cached_feed_dict = self.cache[iteration]
-            cached_name = util.find_str_in_iterable(name, cached_feed_dict.keys(), index)
+            cached_name = util.find_str_in_iterable(
+                name, cached_feed_dict.keys(), index
+            )
             if cached_name is None:
-                G_LOGGER.critical(f"Input tensor: {name} | Does not exist in the data loader cache.")
+                G_LOGGER.critical(
+                    f"Input tensor: {name} | Does not exist in the data loader cache."
+                )
 
             if cached_name != name:
                 G_LOGGER.warning(
@@ -304,32 +442,44 @@ def coerce_cached_input(index, name, dtype, shape):
 
             buffer = cached_feed_dict[cached_name]
 
-            if dtype != buffer.dtype:
+            if dtype != util.array.dtype(buffer):
                 G_LOGGER.warning(
-                    f"Input tensor: {name} | Buffer dtype ({buffer.dtype}) does not match expected input dtype ({np.dtype(dtype).name}), attempting to cast. "
+                    f"Input tensor: {name} | Buffer dtype ({util.array.dtype(buffer)}) does not match expected input dtype ({dtype}), attempting to cast. "
                 )
 
-                type_info = None
-                if np.issubdtype(dtype, np.integer):
-                    type_info = np.iinfo(np.dtype(dtype))
-                elif np.issubdtype(dtype, np.floating):
-                    type_info = np.finfo(np.dtype(dtype))
-
-                if type_info is not None and np.any((buffer < type_info.min) | (buffer > type_info.max)):
-                    G_LOGGER.warning(
-                        f"Some values in this input are out of range of {dtype}. Unexpected behavior may ensue!"
-                    )
-                buffer = buffer.astype(dtype)
+                try:
+                    np_type = DataType.to_dtype(dtype, "numpy")
+                except:
+                    pass
+                else:
+                    type_info = None
+                    if dtype.is_integral:
+                        type_info = np.iinfo(np_type)
+                    elif dtype.is_floating:
+                        type_info = np.finfo(np_type)
+
+                    if type_info is not None and util.array.any(
+                        (buffer < type_info.min) | (buffer > type_info.max)
+                    ):
+                        G_LOGGER.warning(
+                            f"Some values in this input are out of range of {dtype}. Unexpected behavior may ensue!"
+                        )
+                buffer = util.array.cast(buffer, dtype)
 
-            if not util.is_valid_shape_override(buffer.shape, shape):
+            if not util.is_valid_shape_override(util.array.shape(buffer), shape):
                 G_LOGGER.warning(
-                    f"Input tensor: {name} | Buffer shape ({buffer.shape}) does not match expected input shape ({shape}), attempting to transpose/reshape. "
+                    f"Input tensor: {name} | Buffer shape ({util.array.shape(buffer)}) does not match expected input shape ({shape}). "
+                    f"Attempting to transpose/reshape. "
                 )
                 buffer = util.try_match_shape(buffer, shape)
 
-            if buffer.dtype != dtype or not util.is_valid_shape_override(buffer.shape, shape):
+            if util.array.dtype(buffer) != dtype or not util.is_valid_shape_override(
+                util.array.shape(buffer), shape
+            ):
                 G_LOGGER.critical(
-                    f"Input tensor: {name} | Cannot reuse input data due to mismatch in shape or data type.\nNote: Cached input: [dtype={buffer.dtype}, shape={buffer.shape}], Requested input: [dtype={dtype}, shape={shape}]"
+                    f"Input tensor: {name} | Cannot reuse input data due to mismatch in shape or data type.\n"
+                    f"Note: Cached input: [dtype={util.array.dtype(buffer)}, shape={util.array.shape(buffer)}], "
+                    f"Requested input: [dtype={dtype}, shape={shape}]"
                 )
             return buffer
 
@@ -340,7 +490,9 @@ def coerce_cached_input(index, name, dtype, shape):
 
         for index, (name, (dtype, shape)) in enumerate(self.input_metadata.items()):
             try:
-                buffer = coerce_cached_input(index, name, dtype, shape)
+                buffer = coerce_cached_input(
+                    index, name, DataType.from_dtype(dtype), shape
+                )
             except PolygraphyException:
                 G_LOGGER.warning(
                     f"Could not use buffer previously cached from data loader for input: {name}. Attempting to reload inputs from the data loader.\nNote that this will only work if the data loader supports random access.\nPlease refer to warnings above for details on why the previously generated input buffer didn't work. "
@@ -376,9 +528,12 @@ def set_input_metadata(self, input_metadata):
             self.cache = list(self.data_loader)
 
             def _is_feed_dict(inp):
+                if isinstance(inp, RunResults):
+                    return False
+
                 try:
-                    for name, arr in inp.items():
-                        if not isinstance(name, str) or not isinstance(arr, np.ndarray):
+                    for name, _ in inp.items():
+                        if not isinstance(name, str):
                             return False
                 except:
                     return False
@@ -389,7 +544,7 @@ def _is_feed_dict(inp):
                 G_LOGGER.warning("Data loader did not yield any input data.")
             elif not _is_feed_dict(self.cache[0]):
                 G_LOGGER.critical(
-                    f"Data loader returned an object that cannot be recognized as a feed_dict (Dict[str, np.ndarray]):"
+                    f"Data loader returned an object that cannot be recognized as a feed_dict (Dict[str, Union[np.ndarray, torch.Tensor, DeviceView]]):"
                     f"\nNote: The object was:\n{self.cache[0]}.\n"
                     f"\nHint: If this is a `RunReults` object (e.g. generated with `--save-outputs`), try using the "
                     f"`data to-input` tool to convert it to a feed_dict compatible format. "
diff --git a/tools/Polygraphy/polygraphy/comparator/postprocess.py b/tools/Polygraphy/polygraphy/comparator/postprocess.py
index ce49e906..0ba09eb0 100644
--- a/tools/Polygraphy/polygraphy/comparator/postprocess.py
+++ b/tools/Polygraphy/polygraphy/comparator/postprocess.py
@@ -58,9 +58,7 @@ def top_k_impl(iter_result):
                     if util.is_sequence(k_val):
                         k_val, axis = k_val
 
-                    indices = np.argsort(-output, axis=axis, kind="stable")
-                    axis_len = indices.shape[axis]
-                    iter_result[name] = np.take(indices, np.arange(0, min(k_val, axis_len)), axis=axis)
+                    iter_result[name] = util.array.topk(output, k_val, axis)[1]
             return iter_result
 
         return top_k_impl
diff --git a/tools/Polygraphy/polygraphy/comparator/struct.py b/tools/Polygraphy/polygraphy/comparator/struct.py
index 88fb3881..ae66f557 100644
--- a/tools/Polygraphy/polygraphy/comparator/struct.py
+++ b/tools/Polygraphy/polygraphy/comparator/struct.py
@@ -22,64 +22,63 @@
 from polygraphy.json import Decoder, Encoder, add_json_methods, load_json, save_json
 from polygraphy.logger import G_LOGGER
 
-np = mod.lazy_import("numpy")
 
-
-class LazyNumpyArray:
+class LazyArray:
     """
-    Represents a lazily loaded NumPy array.
-    For example, large NumPy arrays may be serialized to temporary files on the disk
+    Represents a lazily loaded NumPy array or PyTorch Tensor.
+    For example, large arrays may be serialized to temporary files on the disk
     to save memory.
     """
 
     def __init__(self, arr):
         """
         Args:
-            arr (np.ndarray): The NumPy array.
+            arr (Union[np.ndarray, torch.Tensor]): The array.
         """
         self.arr = None
         self.tmpfile = None
-        if config.ARRAY_SWAP_THRESHOLD_MB >= 0 and arr.nbytes > (config.ARRAY_SWAP_THRESHOLD_MB << 20):
+        if config.ARRAY_SWAP_THRESHOLD_MB >= 0 and util.array.nbytes(arr) > (config.ARRAY_SWAP_THRESHOLD_MB << 20):
             self.tmpfile = util.NamedTemporaryFile(suffix=".json")
             G_LOGGER.extra_verbose(
-                f"Evicting large array ({arr.nbytes / 1024.0 ** 2:.3f} MiB) from memory and saving to {self.tmpfile.name}"
+                f"Evicting large array ({util.array.nbytes(arr) / 1024.0 ** 2:.3f} MiB) from memory and saving to {self.tmpfile.name}"
             )
             save_json(arr, self.tmpfile.name)
         else:
             self.arr = arr
 
-    def numpy(self):
+    def load(self):
         """
-        Get the NumPy array, deserializing from the disk if it was stored earlier.
+        Load the array, deserializing from the disk if it was stored earlier.
 
         Returns:
-            np.ndarray: The NumPy array
+            Union[np.ndarray, torch.Tensor]: The array
         """
         if self.arr is not None:
             return self.arr
 
-        assert self.tmpfile is not None, "Path and NumPy array cannot both be None!"
+        if self.tmpfile is None:
+            G_LOGGER.internal_error(f"self.arr is None but self.tmpfile is also None; this should be impossible.")
         return load_json(self.tmpfile.name)
 
 
-@Encoder.register(LazyNumpyArray)
+@Encoder.register(LazyArray, alias="LazyNumpyArray")
 def encode(lazy_arr):
     return {
-        "values": lazy_arr.numpy(),
+        "values": lazy_arr.load(),
     }
 
 
-@Decoder.register(LazyNumpyArray)
+@Decoder.register(LazyArray, alias="LazyNumpyArray")
 def decode(dct):
-    return LazyNumpyArray(dct["values"])
+    return LazyArray(dct["values"])
 
 
 @mod.export()
-class IterationResult(TypedDict(lambda: str, lambda: LazyNumpyArray)):
+class IterationResult(TypedDict(lambda: str, lambda: LazyArray)):
     """
     An ordered dictionary containing the result of a running a single iteration of a runner.
 
-    This maps output names to NumPy arrays, and preserves the output ordering from the runner.
+    This maps output names to arrays, and preserves the output ordering from the runner.
 
     NOTE: The ``POLYGRAPHY_ARRAY_SWAP_THRESHOLD_MB`` environment variable can be set to enable
     the arrays to be swapped to the disk.
@@ -90,16 +89,16 @@ class IterationResult(TypedDict(lambda: str, lambda: LazyNumpyArray)):
 
     @staticmethod
     def _to_lazy(nparray):
-        if isinstance(nparray, LazyNumpyArray):
+        if isinstance(nparray, LazyArray):
             return nparray
-        return LazyNumpyArray(nparray)
+        return LazyArray(nparray)
 
     @staticmethod
     def _to_lazy_dict(nparray_dict):
         if nparray_dict is None:
             return None
 
-        # Converts a Dict[str, np.ndarray] to a Dict[str, LazyNumpyArray]
+        # Converts a Dict[str, np.ndarray] to a Dict[str, LazyArray]
         lazy = OrderedDict()
         for name, out in nparray_dict.items():
             lazy[name] = IterationResult._to_lazy(out)
@@ -108,7 +107,7 @@ def _to_lazy_dict(nparray_dict):
     def __init__(self, outputs=None, runtime=None, runner_name=None):
         """
         Args:
-            outputs (Dict[str, np.array]): The outputs of this iteration, mapped to their names.
+            outputs (Dict[str, Union[np.array, torch.Tensor]]): The outputs of this iteration, mapped to their names.
 
             runtime (float):
                     The time required for this iteration, in seconds.
@@ -118,7 +117,11 @@ def __init__(self, outputs=None, runtime=None, runner_name=None):
                     If this is omitted, a default name is generated.
         """
         if outputs and config.ARRAY_SWAP_THRESHOLD_MB < 0:
-            total_size_gb = sum(arr.nbytes for arr in outputs.values() if isinstance(arr, np.ndarray)) / (1024.0**3)
+            total_size_gb = sum(
+                util.array.nbytes(arr)
+                for arr in outputs.values()
+                if util.array.is_torch(arr) or util.array.is_numpy(arr)
+            ) / (1024.0**3)
             if total_size_gb >= 1:
                 G_LOGGER.warning(
                     f"It looks like the outputs of this network are very large ({total_size_gb:.3f} GiB).\n"
@@ -139,14 +142,14 @@ def __setitem__(self, name, arr):
 
     def values(self):
         for arr in super().values():
-            yield arr.numpy()
+            yield arr.load()
 
     def items(self):
         for name, arr in super().items():
-            yield name, arr.numpy()
+            yield name, arr.load()
 
     def __getitem__(self, name):
-        return super().__getitem__(name).numpy()
+        return super().__getitem__(name).load()
 
     def __eq__(self, other):
         if self.runtime != other.runtime or self.runner_name != other.runner_name:
@@ -156,7 +159,7 @@ def __eq__(self, other):
             if key not in other:
                 return False
 
-            if not np.array_equal(val, other[key]):
+            if not util.array.equal(val, other[key]):
                 return False
 
         return True
diff --git a/tools/Polygraphy/polygraphy/comparator/util.py b/tools/Polygraphy/polygraphy/comparator/util.py
index 4207297c..04d72c77 100644
--- a/tools/Polygraphy/polygraphy/comparator/util.py
+++ b/tools/Polygraphy/polygraphy/comparator/util.py
@@ -20,6 +20,7 @@
 
 from polygraphy import config, mod, util
 from polygraphy.logger import G_LOGGER
+from polygraphy.datatype import DataType
 
 import math
 import os
@@ -30,14 +31,14 @@
 
 
 def cast_up(buffer):
-    dtype = np.dtype(buffer.dtype)
-
-    if dtype == np.dtype(np.float16):
-        buffer = buffer.astype(np.float32)
-    elif dtype in list(map(np.dtype, [np.int8, np.uint8, np.int16, np.uint16])):
-        buffer = buffer.astype(np.int32)
-    elif dtype == np.dtype(np.uint32):
-        buffer = buffer.astype(np.int64)
+    dtype = util.array.dtype(buffer)
+
+    if dtype == DataType.FLOAT16:
+        buffer = util.array.cast(buffer, DataType.FLOAT32)
+    elif dtype in [DataType.INT8, DataType.UINT8, DataType.INT16, DataType.UINT16]:
+        buffer = util.array.cast(buffer, DataType.INT32)
+    elif dtype == DataType.UINT32:
+        buffer = util.array.cast(buffer, DataType.INT64)
     return buffer
 
 
@@ -48,7 +49,7 @@ def use_higher_precision(func):
 
     @functools.wraps(func)
     def wrapped(*buffers):
-        if any(util.is_empty_shape(buffer.shape) for buffer in buffers):
+        if any(util.is_empty_shape(util.array.shape(buffer)) for buffer in buffers):
             return 0
 
         new_buffers = [cast_up(buffer) for buffer in buffers]
@@ -59,59 +60,61 @@ def wrapped(*buffers):
 
 @use_higher_precision
 def compute_max(buffer):
-    return np.amax(buffer)
+    return util.array.max(buffer)
 
 
 # Returns index of max value
 @use_higher_precision
 def compute_argmax(buffer):
-    return np.unravel_index(np.argmax(buffer), buffer.shape)
+    return util.array.unravel_index(util.array.argmax(buffer), util.array.shape(buffer))
 
 
 @use_higher_precision
 def compute_min(buffer):
-    return np.amin(buffer)
+    return util.array.min(buffer)
 
 
 # Returns index of min value
 @use_higher_precision
 def compute_argmin(buffer):
-    return np.unravel_index(np.argmin(buffer), buffer.shape)
+    return util.array.unravel_index(util.array.argmin(buffer), util.array.shape(buffer))
 
 
-@use_higher_precision
 def compute_mean(buffer):
-    return np.mean(buffer)
+    return util.array.mean(buffer, dtype=DataType.FLOAT32)
 
 
 @use_higher_precision
-def compute_stddev(buffer):
-    return np.std(buffer)
+def compute_std(buffer):
+    return util.array.std(buffer)
 
 
 @use_higher_precision
 def compute_variance(buffer):
-    return np.var(buffer)
+    return util.array.var(buffer)
 
 
 @use_higher_precision
 def compute_median(buffer):
-    return np.median(buffer)
+    return util.array.median(buffer)
+
+
+def compute_quantile(buffer, q):
+    return util.array.quantile(buffer, q)
 
 
-@use_higher_precision
 def compute_average_magnitude(buffer):
-    return np.mean(np.abs(buffer))
+    return util.array.mean(util.array.abs(buffer), dtype=DataType.FLOAT32)
 
 
 def str_histogram(output, hist_range=None):
-    if np.issubdtype(output.dtype, np.bool_):
+    if util.array.dtype(output) == DataType.BOOL:
         return ""
 
     try:
         try:
-            hist, bin_edges = np.histogram(output, range=hist_range)
-        except ValueError as err:
+            hist, bin_edges = util.array.histogram(output, range=hist_range)
+        except (ValueError, RuntimeError) as err:
             G_LOGGER.verbose(f"Could not generate histogram. Note: Error was: {err}")
             return ""
 
@@ -143,9 +146,7 @@ def str_output_stats(output, runner_name=None):
         ret += f"{runner_name} | Stats: "
 
     try:
-        with np.testing.suppress_warnings() as sup:
-            sup.filter(RuntimeWarning)
-            ret += f"mean={compute_mean(output):.5g}, std-dev={compute_stddev(output):.5g}, var={compute_variance(output):.5g}, median={compute_median(output):.5g}, min={compute_min(output):.5g} at {compute_argmin(output)}, max={compute_max(output):.5g} at {compute_argmax(output)}, avg-magnitude={compute_average_magnitude(output):.5g}\n"
+        ret += f"mean={compute_mean(output):.5g}, std-dev={compute_std(output):.5g}, var={compute_variance(output):.5g}, median={compute_median(output):.5g}, min={compute_min(output):.5g} at {compute_argmin(output)}, max={compute_max(output):.5g} at {compute_argmax(output)}, avg-magnitude={compute_average_magnitude(output):.5g}\n"
     except Exception as err:
         G_LOGGER.verbose(f"Could not generate statistics.\nNote: Error was: {err}")
         ret += "<Error while computing statistics>"
@@ -160,7 +161,7 @@ def log_output_stats(output, info_hist=False, runner_name=None, hist_range=None)
     with G_LOGGER.indent():
         # For small outputs, show the entire output instead of just a histogram.
         SMALL_OUTPUT_THRESHOLD = 100
-        if output.size <= SMALL_OUTPUT_THRESHOLD:
+        if util.array.size(output) <= SMALL_OUTPUT_THRESHOLD:
             G_LOGGER.log(
                 lambda: f"---- Values ----\n{util.indent_block(output)}",
                 severity=G_LOGGER.INFO if info_hist else G_LOGGER.VERBOSE,
@@ -177,7 +178,7 @@ def build_heatmaps(arr, min_val, max_val, prefix, save_dir=None, show=None, use_
     of images to display.
 
     Args:
-        arr (np.ndarray): The input array
+        arr (Union[torch.Tensor, numpy.ndarray]): The input array or tensor.
         min_val (float): The minimum value in the input array
         max_val (float): The maximum value in the input array
         prefix (str): The prefix to use when displaying titles for figures.
@@ -193,25 +194,27 @@ def build_heatmaps(arr, min_val, max_val, prefix, save_dir=None, show=None, use_
         MAX_NUM_COLS = 7
         FONT_SIZE = "xx-small"
 
-        if len(arr.shape) < 3:
-            arr = np.expand_dims(arr, tuple(range(3 - len(arr.shape))))
+        shape = util.array.shape(arr)
+        if len(shape) < 3:
+            arr = util.array.view(arr, dtype=util.array.dtype(arr), shape=([1] * (3 - len(shape))) + list(shape))
 
-        original_shape = arr.shape
-        arr = arr.reshape(-1, arr.shape[-2], arr.shape[-1])
+        original_shape = util.array.shape(arr)
+        arr = util.array.view(arr, dtype=util.array.dtype(arr), shape=(-1, original_shape[-2], original_shape[-1]))
 
-        num_images = arr.shape[0]
+        shape = util.array.shape(arr)
+        num_images = shape[0]
 
         def coord_str_from_img_idx(img_idx):
             coord = []
             for dim in reversed(original_shape[:-2]):
                 coord.insert(0, img_idx % dim)
                 img_idx //= dim
-            return f"({','.join(map(str, coord))},0:{arr.shape[-2]},0:{arr.shape[-1]})"
+            return f"({','.join(map(str, coord))},0:{shape[-2]},0:{shape[-1]})"
 
         # We treat each 2D slice of the array as a separate image.
         # Multiple images may be displayed on a single figure (in a grid) and we may have multiple figures.
-        num_rows = min(MAX_HEIGHT // arr.shape[-2], MAX_NUM_ROWS)
-        num_cols = min(MAX_WIDTH // arr.shape[-1], MAX_NUM_COLS)
+        num_rows = min(MAX_HEIGHT // shape[-2], MAX_NUM_ROWS)
+        num_cols = min(MAX_WIDTH // shape[-1], MAX_NUM_COLS)
 
         # Remove any excess images per figure
         if num_images < num_rows * num_cols:
@@ -247,11 +250,11 @@ def coord_str_from_img_idx(img_idx):
                         ax = axs[row, col]
                         ax.set_axis_off()
 
-                        if img_idx < arr.shape[0]:
+                        if img_idx < shape[0]:
                             img = arr[img_idx]
                             title = f"{coord_str_from_img_idx(img_idx)}"
                         else:
-                            img = np.zeros(shape=(arr.shape[-2:]))
+                            img = np.zeros(shape=(shape[-2:]))
                             title = "Out Of Bounds"
                         ax.set_title(title, fontsize=FONT_SIZE)
 
@@ -291,9 +294,9 @@ def scatter_plot_error_magnitude(
     Display a plot of absolute/relative difference against the magnitude of the output.
 
     Args:
-        absdiff (np.ndarray): The absolute difference.
-        reldiff (np.ndarray): The relative difference.
-        reference_output (np.ndarray): The output to consider as the reference output.
+        absdiff (Union[torch.Tensor, numpy.ndarray]): The absolute difference.
+        reldiff (Union[torch.Tensor, numpy.ndarray]): The relative difference.
+        reference_output (Union[torch.Tensor, numpy.ndarray]): The output to consider as the reference output.
         min_reldiff (float): The minimum relative difference
         max_reldiff (float): The maximum relative difference
         runner0_name (str): The name of the first runner.
@@ -338,7 +341,7 @@ def set_log_ax(ax, min_diff, max_diff):
             ax.set_yticks(np.power(10, np.arange(yrange[0], yrange[1], 1)))
             set_ax_properties(ax)
 
-        magnitude = np.abs(reference_output)
+        magnitude = util.array.abs(reference_output)
         fig, axs = plt.subplots(2, sharex=True, constrained_layout=True)
 
         try:
diff --git a/tools/Polygraphy/polygraphy/config.py b/tools/Polygraphy/polygraphy/config.py
index b5d7345d..5db39762 100644
--- a/tools/Polygraphy/polygraphy/config.py
+++ b/tools/Polygraphy/polygraphy/config.py
@@ -46,7 +46,7 @@
 
 ARRAY_SWAP_THRESHOLD_MB = int(os.environ.get("POLYGRAPHY_ARRAY_SWAP_THRESHOLD_MB", "-1"))
 """
-int: The threshold, in megabytes, above which Polygraphy will evict a NumPy array from memory and swap it to disk.
+int: The threshold, in megabytes, above which Polygraphy will evict an array from memory and swap it to disk.
 A negative value disables swapping and a value of 0 causes all arrays to be saved to disk.
 Disabled by default.
 This can be configured by setting the 'POLYGRAPHY_ARRAY_SWAP_THRESHOLD_MB' environment variable.
diff --git a/tools/Polygraphy/polygraphy/cuda/cuda.py b/tools/Polygraphy/polygraphy/cuda/cuda.py
index 4e9decd3..b7ff77be 100644
--- a/tools/Polygraphy/polygraphy/cuda/cuda.py
+++ b/tools/Polygraphy/polygraphy/cuda/cuda.py
@@ -19,6 +19,7 @@
 import sys
 
 from polygraphy import func, mod, util
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER
 
 np = mod.lazy_import("numpy")
@@ -249,7 +250,7 @@ def __init__(self, ptr, shape, dtype):
             ptr (int): A pointer to the region of memory.
 
             shape (Tuple[int]): The shape of the region.
-            dtype (numpy.dtype): The data type of the region.
+            dtype (DataType): The data type of the region.
         """
         self.ptr = int(ptr)
         """int: The memory address of the underlying GPU memory"""
@@ -257,15 +258,15 @@ def __init__(self, ptr, shape, dtype):
         """Tuple[int]: The shape of the device buffer"""
         self.itemsize = None
         self.dtype = dtype
-        """np.dtype: The data type of the device buffer"""
+        """DataType: The data type of the device buffer"""
 
     def _check_host_buffer(self, host_buffer, copying_from):
-        if host_buffer.dtype != self.dtype:
+        if util.array.dtype(host_buffer) != self._dtype:
             G_LOGGER.error(
-                f"Host buffer type: {host_buffer.dtype} does not match the type of this device buffer: {self.dtype}. This may cause CUDA errors!"
+                f"Host buffer type: {util.array.dtype(host_buffer)} does not match the type of this device buffer: {self._dtype}. This may cause CUDA errors!"
             )
 
-        if not util.is_contiguous(host_buffer):
+        if not util.array.is_contiguous(host_buffer):
             G_LOGGER.critical(
                 "Provided host buffer is not contiguous in memory.\n"
                 "Hint: Use `util.make_contiguous()` or `np.ascontiguousarray()` to make the array contiguous in memory."
@@ -274,28 +275,38 @@ def _check_host_buffer(self, host_buffer, copying_from):
         # If the host buffer is an input, the device buffer should be large enough to accomodate it.
         # Otherwise, the host buffer needs to be large enough to accomodate the device buffer.
         if copying_from:
-            if host_buffer.nbytes > self.nbytes:
+            if util.array.nbytes(host_buffer) > self.nbytes:
                 G_LOGGER.critical(
                     f"Provided host buffer is larger than device buffer.\n"
-                    f"Note: host buffer is {host_buffer.nbytes} bytes but device buffer is only {self.nbytes} bytes.\n"
+                    f"Note: host buffer is {util.array.nbytes(host_buffer)} bytes but device buffer is only {self.nbytes} bytes.\n"
                     f"Hint: Use `resize()` to resize the device buffer to the correct shape."
                 )
         else:
-            if host_buffer.nbytes < self.nbytes:
+            if util.array.nbytes(host_buffer) < self.nbytes:
                 G_LOGGER.critical(
                     f"Provided host buffer is smaller than device buffer.\n"
-                    f"Note: host buffer is only {host_buffer.nbytes} bytes but device buffer is {self.nbytes} bytes.\n"
-                    f"Hint: Use `util.resize_buffer()` to resize the host buffer to the correct shape."
+                    f"Note: host buffer is only {util.array.nbytes(host_buffer)} bytes but device buffer is {self.nbytes} bytes.\n"
+                    f"Hint: Use `util.array.resize_or_reallocate()` to resize the host buffer to the correct shape."
                 )
 
     @property
     def dtype(self):
-        return self._dtype
+        try:
+            # For backwards compatibility
+            mod.warn_deprecated(
+                "Using NumPy data types in DeviceView/DeviceArray", use_instead=None, remove_in="0.50.0"
+            )
+            G_LOGGER.warning(
+                f"In the future, you will need to use `DataType.from_dtype(device_view.dtype).numpy()` to retrieve the NumPy data type"
+            )
+            return DataType.to_dtype(self._dtype, "numpy")
+        except:
+            return self._dtype
 
     @dtype.setter
     def dtype(self, new):
-        self._dtype = new
-        self.itemsize = np.dtype(new).itemsize
+        self._dtype = DataType.from_dtype(new)
+        self.itemsize = self._dtype.itemsize
 
     @property
     def nbytes(self):
@@ -310,10 +321,10 @@ def copy_to(self, host_buffer, stream=None):
         Copies from this device buffer to the provided host buffer.
 
         Args:
-            host_buffer (numpy.ndarray):
+            host_buffer (Union[numpy.ndarray, torch.Tensor]):
                     The host buffer to copy into. The buffer must be contiguous in
-                    memory (see np.ascontiguousarray) and large enough to accomodate
-                    the device buffer.
+                    memory (see np.ascontiguousarray or torch.Tensor.contiguous) and
+                    large enough to accomodate the device buffer.
             stream (Stream):
                     A Stream instance. Performs a synchronous copy if no stream is provided.
 
@@ -325,7 +336,7 @@ def copy_to(self, host_buffer, stream=None):
 
         self._check_host_buffer(host_buffer, copying_from=False)
         wrapper().memcpy(
-            dst=host_buffer.ctypes.data,
+            dst=util.array.data_ptr(host_buffer),
             src=self.ptr,
             nbytes=self.nbytes,
             kind=MemcpyKind.DeviceToHost,
@@ -341,15 +352,15 @@ def numpy(self):
         Returns:
             np.ndarray: The newly created NumPy array.
         """
-        arr = np.empty(self.shape, dtype=self.dtype)
+        arr = np.empty(self.shape, dtype=DataType.to_dtype(self._dtype, "numpy"))
         self.copy_to(arr)
         return arr
 
     def __str__(self):
-        return f"DeviceView[(dtype={np.dtype(self.dtype).name}, shape={self.shape}), ptr={hex(self.ptr)}]"
+        return f"DeviceView[(dtype={self._dtype.name}, shape={self.shape}), ptr={hex(self.ptr)}]"
 
     def __repr__(self):
-        return util.make_repr("DeviceView", ptr=self.ptr, shape=self.shape, dtype=self.dtype)[0]
+        return util.make_repr("DeviceView", ptr=self.ptr, shape=self.shape, dtype=self._dtype)[0]
 
 
 @mod.export()
@@ -362,9 +373,9 @@ def __init__(self, shape=None, dtype=None):
         """
         Args:
             shape (Tuple[int]): The initial shape of the buffer.
-            dtype (numpy.dtype): The data type of the buffer.
+            dtype (DataType): The data type of the buffer.
         """
-        super().__init__(ptr=0, shape=util.default(shape, tuple()), dtype=util.default(dtype, np.float32))
+        super().__init__(ptr=0, shape=util.default(shape, tuple()), dtype=util.default(dtype, DataType.FLOAT32))
         self.allocated_nbytes = 0
         self.resize(self.shape)
 
@@ -372,7 +383,7 @@ def __enter__(self):
         return self
 
     @staticmethod
-    def raw(shape):
+    def raw(shape=None):
         """
         Creates an untyped device array of the specified shape.
 
@@ -384,7 +395,7 @@ def raw(shape):
         Returns:
             DeviceArray: The raw device array.
         """
-        return DeviceArray(shape=shape, dtype=np.byte)
+        return DeviceArray(shape=shape, dtype=DataType.UINT8)
 
     def resize(self, shape):
         """
@@ -395,6 +406,9 @@ def resize(self, shape):
 
         Args:
             shape (Tuple[int]): The new shape.
+
+        Returns:
+            DeviceArray: self
         """
         nbytes = util.volume(shape) * self.itemsize
         if nbytes > self.allocated_nbytes:
@@ -402,6 +416,7 @@ def resize(self, shape):
             self.ptr = wrapper().malloc(nbytes)
             self.allocated_nbytes = nbytes
         self.shape = shape
+        return self
 
     def __exit__(self, exc_type, exc_value, traceback):
         """
@@ -429,23 +444,24 @@ def copy_from(self, host_buffer, stream=None):
         Copies from the provided host buffer into this device buffer.
 
         Args:
-            host_buffer (numpy.ndarray):
+            host_buffer (Union[numpy.ndarray, torch.Tensor]):
                     The host buffer to copy from. The buffer must be contiguous in
-                    memory (see np.ascontiguousarray) and not larger than this device buffer.
+                    memory (see np.ascontiguousarray or torch.Tensor.contiguous) and not
+                    larger than this device buffer.
             stream (Stream):
                     A Stream instance. Performs a synchronous copy if no stream is provided.
 
         Returns:
             DeviceArray: self
         """
-        if not host_buffer.nbytes:
+        if not util.array.nbytes(host_buffer):
             return self
 
         self._check_host_buffer(host_buffer, copying_from=True)
         wrapper().memcpy(
             dst=self.ptr,
-            src=host_buffer.ctypes.data,
-            nbytes=host_buffer.nbytes,
+            src=util.array.data_ptr(host_buffer),
+            nbytes=util.array.nbytes(host_buffer),
             kind=MemcpyKind.HostToDevice,
             stream_ptr=try_get_stream_handle(stream),
         )
@@ -459,7 +475,7 @@ def view(self, shape=None, dtype=None):
             shape (Sequence[int]):
                     The desired shape of the view.
                     Defaults to the shape of this array or view.
-            dtype (numpy.dtype):
+            dtype (DataType):
                     The desired data type of the view.
                     Defaults to the data type of this array or view.
 
@@ -467,19 +483,19 @@ def view(self, shape=None, dtype=None):
             DeviceView: A view of this arrays data on the device.
         """
         shape = util.default(shape, self.shape)
-        dtype = util.default(dtype, self.dtype)
+        dtype = util.default(dtype, self._dtype)
         view = DeviceView(self.ptr, shape, dtype)
 
         if view.nbytes > self.nbytes:
             G_LOGGER.critical(
                 "A view cannot exceed the number of bytes of the original array.\n"
-                f"Note: Original array has shape: {self.shape} and dtype: {self.dtype}, which requires {self.nbytes} bytes, "
+                f"Note: Original array has shape: {self.shape} and dtype: {self._dtype}, which requires {self.nbytes} bytes, "
                 f"while the view has shape: {shape} and dtype: {dtype}, which requires {view.nbytes} bytes, "
             )
         return view
 
     def __str__(self):
-        return f"DeviceArray[(dtype={np.dtype(self.dtype).name}, shape={self.shape}), ptr={hex(self.ptr)}]"
+        return f"DeviceArray[(dtype={self._dtype.name}, shape={self.shape}), ptr={hex(self.ptr)}]"
 
     def __repr__(self):
-        return util.make_repr("DeviceArray", shape=self.shape, dtype=self.dtype)[0]
+        return util.make_repr("DeviceArray", shape=self.shape, dtype=self._dtype)[0]
diff --git a/tools/Polygraphy/polygraphy/datatype/__init__.py b/tools/Polygraphy/polygraphy/datatype/__init__.py
new file mode 100644
index 00000000..64d7a893
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/datatype/__init__.py
@@ -0,0 +1,6 @@
+from polygraphy.datatype.datatype import *
+from polygraphy.datatype.numpy import *
+from polygraphy.datatype.onnx import *
+from polygraphy.datatype.onnxrt import *
+from polygraphy.datatype.tensorrt import *
+from polygraphy.datatype.torch import *
diff --git a/tools/Polygraphy/polygraphy/datatype/datatype.py b/tools/Polygraphy/polygraphy/datatype/datatype.py
new file mode 100644
index 00000000..c39a3b38
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/datatype/datatype.py
@@ -0,0 +1,269 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import functools
+from textwrap import dedent
+
+from polygraphy import mod
+from polygraphy.exception import PolygraphyException, DataTypeConversionException
+from polygraphy.logger import G_LOGGER, LogMode
+
+
+import enum
+
+
+class _SkipImporterException(PolygraphyException):
+    pass
+
+
+class _DataTypeKind(enum.Enum):
+    FLOATING_POINT = 0
+    INTEGRAL = 1
+    _OTHER = 3
+
+
+@mod.export()
+class DataTypeEntry:
+    """
+    Represents a data type.
+    Can be transformed to and from data type classes of external modules, like NumPy.
+
+    Do *not* construct objects of this type directly. Instead, use the predefined data types
+    provided in the ``DataType`` class.
+    """
+
+    def __init__(self, name, itemsize, type: _DataTypeKind):
+        self.name = name
+        """The human-readable name of the data type"""
+        self.itemsize = itemsize
+        """The size in bytes of a single element of this data type"""
+
+        # self._type describes the basic kind of the type we have.
+        # For example, this can
+        self._type = type
+
+    @property
+    def is_floating(self):
+        return self._type == _DataTypeKind.FLOATING_POINT
+
+    @property
+    def is_integral(self):
+        return self._type == _DataTypeKind.INTEGRAL
+
+    def __str__(self):
+        return self.name
+
+    def __repr__(self):
+        return f"DataType.{self.name.upper()}"
+
+
+@mod.export()
+class DataType:
+    # Docstring will be populated by the loop below
+    """
+    Aggregates supported Polygraphy data types. Each data type is accessible
+    via this class as a class member of type ``DataTypeEntry``.
+
+    Members:
+    """
+    _IMPORTER_FUNCS = {}
+    _EXPORTER_FUNCS = {}
+
+    __members__ = {
+        "FLOAT64": DataTypeEntry("float64", 8, _DataTypeKind.FLOATING_POINT),
+        "FLOAT32": DataTypeEntry("float32", 4, _DataTypeKind.FLOATING_POINT),
+        "FLOAT16": DataTypeEntry("float16", 2, _DataTypeKind.FLOATING_POINT),
+        "INT16": DataTypeEntry("int16", 2, _DataTypeKind.INTEGRAL),
+        "INT32": DataTypeEntry("int32", 4, _DataTypeKind.INTEGRAL),
+        "INT64": DataTypeEntry("int64", 8, _DataTypeKind.INTEGRAL),
+        "INT8": DataTypeEntry("int8", 1, _DataTypeKind.INTEGRAL),
+        "INT4": DataTypeEntry("int4", 0.5, _DataTypeKind.INTEGRAL),
+        "UINT16": DataTypeEntry("uint16", 2, _DataTypeKind.INTEGRAL),
+        "UINT32": DataTypeEntry("uint32", 4, _DataTypeKind.INTEGRAL),
+        "UINT64": DataTypeEntry("uint64", 8, _DataTypeKind.INTEGRAL),
+        "UINT8": DataTypeEntry("uint8", 1, _DataTypeKind.INTEGRAL),
+        "BOOL": DataTypeEntry("bool", 1, _DataTypeKind._OTHER),
+        "STRING": DataTypeEntry("string", 0, _DataTypeKind._OTHER),
+        "BFLOAT16": DataTypeEntry("bfloat16", 2, _DataTypeKind.FLOATING_POINT),
+        "FLOAT8E4M3FN": DataTypeEntry("float8e4m3fn", 1, _DataTypeKind.FLOATING_POINT),
+        "FLOAT8E4M3FNUZ": DataTypeEntry(
+            "float8e4m3fnuz", 1, _DataTypeKind.FLOATING_POINT
+        ),
+        "FLOAT8E5M2": DataTypeEntry("float8e5m2", 1, _DataTypeKind.FLOATING_POINT),
+        "FLOAT8E5M2FNUZ": DataTypeEntry(
+            "float8e5m2fnuz", 1, _DataTypeKind.FLOATING_POINT
+        ),
+    }
+
+    @staticmethod
+    def from_dtype(dtype, source_module=None):
+        """
+        Converts a data type from any known external libraries to a corresponding
+        Polygraphy data type.
+
+        Args:
+            dtype (Any): A data type from an external library.
+            source_module (str):
+                    The name of the module from where the provided `dtype` originates.
+                    If this is not provided, Polygraphy will attempt to guess the module
+                    in order to convert the data type.
+
+        Returns:
+            DataTypeEntry: The corresponding Polygraphy data type.
+
+        Raises:
+            PolygraphyException: If the data type could not be converted.
+        """
+        if dtype is None:
+            G_LOGGER.critical(f"Could not convert: {dtype} to a Polygraphy data type")
+
+        if isinstance(dtype, DataTypeEntry):
+            return dtype
+
+        if source_module is not None:
+            if source_module not in DataType._IMPORTER_FUNCS:
+                G_LOGGER.critical(
+                    f"Could not find source module: {source_module} in known importers. "
+                    f"Note: Importer functions have been registered for the following modules: {list(DataType._IMPORTER_FUNCS.keys())}"
+                )
+            return DataType._IMPORTER_FUNCS[source_module](dtype)
+
+        for func in DataType._IMPORTER_FUNCS.values():
+            try:
+                ret = func(dtype)
+            except _SkipImporterException:
+                pass
+            else:
+                return ret
+
+        msg = f"Could not convert: {dtype} to a corresponding Polygraphy data type. Leaving this type in its source format."
+        G_LOGGER.warning(msg, mode=LogMode.ONCE)
+        G_LOGGER.internal_error(msg)
+        return dtype
+
+    @staticmethod
+    def to_dtype(dtype, target_module):
+        """
+        Converts a Polygraphy data type to one from any known external libraries.
+
+        Args:
+            dtype (DataType):
+                    A Polygraphy data type. If something other than a Polygraphy data type is provided,
+                    then this function will return it without modifying it.
+            target_module (str):
+                    The name of the module whose data type class to convert this data type to.
+
+        Returns:
+            Any: The corresponding data type from the target module.
+
+        Raises:
+            PolygraphyException: If the data type could not be converted.
+        """
+        if not isinstance(dtype, DataTypeEntry):
+            G_LOGGER.internal_error(
+                f"Received input of type other than DataType: {dtype}"
+            )
+            return dtype
+
+        if target_module not in DataType._EXPORTER_FUNCS:
+            G_LOGGER.critical(
+                f"Could not find target module: {target_module} in known exporters. "
+                f"Note: Exporter functions have been registered for the following modules: {list(DataType._EXPORTER_FUNCS.keys())}"
+            )
+        return DataType._EXPORTER_FUNCS[target_module](dtype)
+
+
+DataType.__doc__ = dedent(DataType.__doc__)
+for name, value in DataType.__members__.items():
+    setattr(DataType, name, value)
+    DataType.__doc__ += f"\t- {name}\n"
+
+
+def register_dtype_importer(source_module):
+    """
+    Registers an importer function with the DataType class.
+
+    IMPORTANT: You *must* ensure that the importer function does not attempt to automatically install
+    or import modules which are not already installed.
+    With a lazily imported module, `module.is_installed()/is_importable()` is an easy way to guard the code against this.
+    We do not want to automatically install heavy modules like PyTorch or TensorRT just for the sake of DataType.
+
+    For example:
+    ::
+
+        @register_dtype_importer("numpy")
+        def func(dtype):
+            ...
+
+    The importer function should return `None` if no corresponding data type could be found
+    or if the input type did not match what was expected.
+
+    The newly registered function is then usable via `from_dtype`:
+    ::
+
+        dtype = DataType.from_dtype(np.int64)
+    """
+
+    def register_importer_impl(func):
+        @functools.wraps(func)
+        def new_func(dtype):
+            val = func(dtype)
+            if val is None:
+                # We raise an exception to indicate that `from_dtype` should skip this importer and try a different one.
+                # We have to do it this way since we don't necessarily know which importer is the right one to use.
+                raise _SkipImporterException()
+            return val
+
+        DataType._IMPORTER_FUNCS[source_module] = new_func
+        return new_func
+
+    return register_importer_impl
+
+
+def register_dtype_exporter(target_module):
+    """
+    Registers an exporter function with the DataType class.
+
+    For example:
+    ::
+
+        @register_dtype_exporter("numpy")
+        def func(dtype):
+            ...
+
+    The newly registered function is then accessible with, for example:
+    ::
+
+        np_dtype = DataType.FLOAT32.numpy()
+    """
+
+    def register_exporter_impl(func):
+        @functools.wraps(func)
+        def new_func(dtype):
+            val = func(dtype)
+            if val is None:
+                G_LOGGER.critical(
+                    f"Could not convert Polygraphy data type: {dtype} to a corresponding {target_module} data type. ",
+                    ExceptionType=DataTypeConversionException,
+                )
+            return val
+
+        new_func.__name__ = target_module
+        setattr(DataTypeEntry, new_func.__name__, new_func)
+        DataType._EXPORTER_FUNCS[target_module] = new_func
+        return new_func
+
+    return register_exporter_impl
diff --git a/tools/Polygraphy/polygraphy/datatype/numpy.py b/tools/Polygraphy/polygraphy/datatype/numpy.py
new file mode 100644
index 00000000..5f6fedc5
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/datatype/numpy.py
@@ -0,0 +1,73 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from polygraphy import mod, util
+from polygraphy.datatype.datatype import DataType, register_dtype_importer, register_dtype_exporter
+
+np = mod.lazy_import("numpy")
+
+
+def _get_mapping():
+    DATATYPE_FROM_NUMPY = {
+        np.double: DataType.FLOAT64,
+        np.float32: DataType.FLOAT32,
+        np.float16: DataType.FLOAT16,
+        np.int16: DataType.INT16,
+        np.int32: DataType.INT32,
+        np.int64: DataType.INT64,
+        np.int8: DataType.INT8,
+        np.uint16: DataType.UINT16,
+        np.uint32: DataType.UINT32,
+        np.uint64: DataType.UINT64,
+        np.uint8: DataType.UINT8,
+        np.bool_: DataType.BOOL,
+        np.unicode_: DataType.STRING,
+    }
+    return {np.dtype(key): val for key, val in DATATYPE_FROM_NUMPY.items()}
+
+
+@register_dtype_importer("numpy")
+def from_numpy(numpy_type):
+    """
+    Converts a NumPy data type to a Polygraphy DataType.
+
+    Args:
+        numpy_type (np.dtype): The NumPy data type.
+
+    Returns:
+        DataType: The Polygraphy data type.
+    """
+    if not np.is_installed() or not np.is_importable():
+        return None
+
+    try:
+        dtype = np.dtype(numpy_type)
+    except TypeError:
+        return None
+
+    return _get_mapping().get(dtype)
+
+
+@register_dtype_exporter("numpy")
+def from_datatype(self):
+    """
+    Converts this Polygraphy DataType to a NumPy data type.
+
+    Returns:
+        np.dtype: The NumPy data type.
+    """
+    return util.invert_dict(_get_mapping()).get(self)
diff --git a/tools/Polygraphy/polygraphy/datatype/onnx.py b/tools/Polygraphy/polygraphy/datatype/onnx.py
new file mode 100644
index 00000000..4fc66579
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/datatype/onnx.py
@@ -0,0 +1,77 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from polygraphy import mod, util
+from polygraphy.datatype.datatype import DataType, register_dtype_importer, register_dtype_exporter
+
+onnx = mod.lazy_import("onnx")
+
+
+def _get_mapping():
+    DATATYPE_FROM_ONNX = {
+        "DOUBLE": DataType.FLOAT64,
+        "FLOAT": DataType.FLOAT32,
+        "FLOAT16": DataType.FLOAT16,
+        "INT16": DataType.INT16,
+        "INT32": DataType.INT32,
+        "INT64": DataType.INT64,
+        "INT8": DataType.INT8,
+        "UINT16": DataType.UINT16,
+        "UINT32": DataType.UINT32,
+        "UINT64": DataType.UINT64,
+        "UINT8": DataType.UINT8,
+        "BOOL": DataType.BOOL,
+        "STRING": DataType.STRING,
+        "BFLOAT16": DataType.BFLOAT16,
+        "FLOAT8E4M3FN": DataType.FLOAT8E4M3FN,
+        "FLOAT8E4M3FNUZ": DataType.FLOAT8E4M3FNUZ,
+        "FLOAT8E5M2": DataType.FLOAT8E5M2,
+        "FLOAT8E5M2FNUZ": DataType.FLOAT8E5M2FNUZ,
+    }
+    if None in DATATYPE_FROM_ONNX:
+        del DATATYPE_FROM_ONNX[None]
+
+    onnx_type_map = dict(onnx.TensorProto.DataType.items())
+    return {onnx_type_map[key]: val for key, val in DATATYPE_FROM_ONNX.items() if key in onnx_type_map}
+
+
+@register_dtype_importer("onnx")
+def from_onnx(onnx_type):
+    """
+    Converts an ONNX data type to a Polygraphy DataType.
+
+    Args:
+        onnx_type (onnx.TensorProto.DataType): The ONNX data type.
+
+    Returns:
+        DataType: The Polygraphy data type.
+    """
+    if not onnx.is_installed() or not onnx.is_importable():
+        return None
+
+    return _get_mapping().get(onnx_type)
+
+
+@register_dtype_exporter("onnx")
+def from_datatype(self):
+    """
+    Converts this Polygraphy DataType to an ONNX data type.
+
+    Returns:
+        onnx.TensorProto.DataType: The ONNX data type.
+    """
+    return util.invert_dict(_get_mapping()).get(self)
diff --git a/tools/Polygraphy/polygraphy/datatype/onnxrt.py b/tools/Polygraphy/polygraphy/datatype/onnxrt.py
new file mode 100644
index 00000000..56a40129
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/datatype/onnxrt.py
@@ -0,0 +1,67 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from polygraphy import util
+from polygraphy.datatype.datatype import DataType, register_dtype_importer, register_dtype_exporter
+
+__DATATYPE_FROM_ONNXRT = {
+    "tensor(double)": DataType.FLOAT64,
+    "tensor(float)": DataType.FLOAT32,
+    "tensor(float16)": DataType.FLOAT16,
+    "tensor(int16)": DataType.INT16,
+    "tensor(int32)": DataType.INT32,
+    "tensor(int64)": DataType.INT64,
+    "tensor(int8)": DataType.INT8,
+    "tensor(uint16)": DataType.UINT16,
+    "tensor(uint32)": DataType.UINT32,
+    "tensor(uint64)": DataType.UINT64,
+    "tensor(uint8)": DataType.UINT8,
+    "tensor(bool)": DataType.BOOL,
+    "tensor(string)": DataType.STRING,
+    "tensor(bfloat16)": DataType.BFLOAT16,
+    "tensor(float8e4m3fn)": DataType.FLOAT8E4M3FN,
+    "tensor(float8e4m3fnuz)": DataType.FLOAT8E4M3FNUZ,
+    "tensor(float8e5m2)": DataType.FLOAT8E5M2,
+    "tensor(float8e5m2fnuz)": DataType.FLOAT8E5M2FNUZ,
+}
+
+__ONNXRT_FROM_DATATYPE = util.invert_dict(__DATATYPE_FROM_ONNXRT)
+
+
+@register_dtype_importer("onnxruntime")
+def from_onnxrt(onnxrt_type):
+    """
+    Converts an ONNX-Runtime data type to a Polygraphy DataType.
+
+    Args:
+        onnxrt_type (str): The ONNX-Runtime data type.
+
+    Returns:
+        DataType: The Polygraphy data type.
+    """
+    return __DATATYPE_FROM_ONNXRT.get(onnxrt_type)
+
+
+@register_dtype_exporter("onnxruntime")
+def from_datatype(self):
+    """
+    Converts this Polygraphy DataType to an ONNX-Runtime data type.
+
+    Returns:
+        str: The ONNX-Runtime data type.
+    """
+    return __ONNXRT_FROM_DATATYPE.get(self)
diff --git a/tools/Polygraphy/polygraphy/datatype/tensorrt.py b/tools/Polygraphy/polygraphy/datatype/tensorrt.py
new file mode 100644
index 00000000..ad327e96
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/datatype/tensorrt.py
@@ -0,0 +1,72 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from polygraphy import mod, util
+from polygraphy.datatype.datatype import (
+    DataType,
+    register_dtype_importer,
+    register_dtype_exporter,
+)
+
+trt = mod.lazy_import("tensorrt>=8.5")
+
+
+def _get_mapping():
+    DATATYPE_FROM_TENSORRT = {
+        trt.float32: DataType.FLOAT32,
+        trt.float16: DataType.FLOAT16,
+        trt.int32: DataType.INT32,
+        trt.int8: DataType.INT8,
+        util.try_getattr(trt, "int64"): DataType.INT64,
+        util.try_getattr(trt, "uint8"): DataType.UINT8,
+        util.try_getattr(trt, "bool"): DataType.BOOL,
+        util.try_getattr(trt, "bfloat16"): DataType.BFLOAT16,
+        util.try_getattr(trt, "fp8"): DataType.FLOAT8E4M3FN,
+        util.try_getattr(trt, "int4"): DataType.INT4,
+    }
+    if None in DATATYPE_FROM_TENSORRT:
+        del DATATYPE_FROM_TENSORRT[None]
+
+    return DATATYPE_FROM_TENSORRT
+
+
+@register_dtype_importer("tensorrt")
+def from_tensorrt(tensorrt_type):
+    """
+    Converts a TensorRT data type to a Polygraphy DataType.
+
+    Args:
+        tensorrt_type (tensorrt.DataType): The TensorRT data type.
+
+    Returns:
+        DataType: The Polygraphy data type.
+    """
+    if not trt.is_installed() or not trt.is_importable():
+        return None
+
+    return _get_mapping().get(tensorrt_type)
+
+
+@register_dtype_exporter("tensorrt")
+def from_datatype(self):
+    """
+    Converts this Polygraphy DataType to a TensorRT data type.
+
+    Returns:
+        tensorrt.DataType: The TensorRT data type.
+    """
+    return util.invert_dict(_get_mapping()).get(self)
diff --git a/tools/Polygraphy/polygraphy/datatype/torch.py b/tools/Polygraphy/polygraphy/datatype/torch.py
new file mode 100644
index 00000000..cf217d36
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/datatype/torch.py
@@ -0,0 +1,64 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from polygraphy import mod, util
+from polygraphy.datatype.datatype import DataType, register_dtype_importer, register_dtype_exporter
+
+torch = mod.lazy_import("torch>=1.13.0")
+
+
+def _get_mapping():
+    return {
+        torch.float64: DataType.FLOAT64,
+        torch.float32: DataType.FLOAT32,
+        torch.float16: DataType.FLOAT16,
+        torch.int16: DataType.INT16,
+        torch.int32: DataType.INT32,
+        torch.int64: DataType.INT64,
+        torch.int8: DataType.INT8,
+        torch.uint8: DataType.UINT8,
+        torch.bool: DataType.BOOL,
+        torch.bfloat16: DataType.BFLOAT16,
+    }
+
+
+@register_dtype_importer("torch")
+def from_torch(torch_type):
+    """
+    Converts a PyTorch data type to a Polygraphy DataType.
+
+    Args:
+        torch_type (torch.dtype): The PyTorch data type.
+
+    Returns:
+        DataType: The Polygraphy data type.
+    """
+    if not torch.is_installed() or not torch.is_importable():
+        return None
+
+    return _get_mapping().get(torch_type)
+
+
+@register_dtype_exporter("torch")
+def from_datatype(self):
+    """
+    Converts this Polygraphy DataType to a PyTorch data type.
+
+    Returns:
+        torch.dtype: The PyTorch data type.
+    """
+    return util.invert_dict(_get_mapping()).get(self)
diff --git a/tools/Polygraphy/polygraphy/exception/exception.py b/tools/Polygraphy/polygraphy/exception/exception.py
index 41662009..27ab1f42 100644
--- a/tools/Polygraphy/polygraphy/exception/exception.py
+++ b/tools/Polygraphy/polygraphy/exception/exception.py
@@ -40,3 +40,13 @@ class PolygraphyInternalException(Exception):
     """
 
     pass
+
+
+# Do not raise this exception manually. Instead, use G_LOGGER.critical(..., ExceptionType=DataTypeConversionException).
+@mod.export()
+class DataTypeConversionException(PolygraphyException):
+    """
+    An exception during conversion to or from a Polygraphy DataType.
+    """
+
+    pass
diff --git a/tools/Polygraphy/polygraphy/json/serde.py b/tools/Polygraphy/polygraphy/json/serde.py
index fd6ac39d..fcc73180 100644
--- a/tools/Polygraphy/polygraphy/json/serde.py
+++ b/tools/Polygraphy/polygraphy/json/serde.py
@@ -21,16 +21,15 @@
 import json
 from collections import OrderedDict
 
-from polygraphy import config, constants, mod, util
+from polygraphy import constants, mod, util
 from polygraphy.logger import G_LOGGER
 
 np = mod.lazy_import("numpy")
-
-TYPE_STRING_PREFIX = "__polygraphy_encoded_"
+torch = mod.lazy_import("torch>=1.13.0")
 
 
 def legacy_str_from_type(typ):
-    return TYPE_STRING_PREFIX + typ.__name__
+    return "__polygraphy_encoded_" + typ.__name__
 
 
 def str_from_type(typ):
@@ -43,10 +42,17 @@ class BaseCustomImpl:
     """
 
     @classmethod
-    def register(cls, typ):
+    def register(cls, typ, alias=None):
         """
         Decorator that registers JSON encoding/decoding functions for types.
 
+        Args:
+            typ (type): The type to register
+            alias (str):
+                    An alias under which to also register the decoder function.
+                    This can be used to retain backwards-compatibility when a class
+                    name changes.
+
         For the documentation that follows, assume we have a class:
         ::
 
@@ -132,6 +138,8 @@ def wrapped(dct):
 
                 add(legacy_str_from_type(typ), wrapped)
                 add(str_from_type(typ), wrapped)
+                if alias is not None:
+                    add(alias, wrapped)
             else:
                 G_LOGGER.critical("Cannot register for unrecognized class type: ")
 
@@ -161,30 +169,33 @@ class Decoder(BaseCustomImpl):
     polygraphy_registered = {}
 
     def __call__(self, pairs):
-        dct = OrderedDict(pairs)
-
-        if config.INTERNAL_CORRECTNESS_CHECKS:
-            custom_type_keys = [key for key in dct if key.startswith(TYPE_STRING_PREFIX)]
-            if custom_type_keys and custom_type_keys[0] not in self.polygraphy_registered:
-                G_LOGGER.internal_error(
-                    f"Custom type has no decode function registered! Note: Encoded object is:\n{dct}"
-                )
-
         # The encoder will insert special key-value pairs into dictionaries encoded from
         # custom types. If we find one, then we know to decode using the corresponding custom
         # type function.
-        type_name = dct.get(constants.TYPE_MARKER)
-        func = self.polygraphy_registered.get(type_name)
-        if func:
-            return func(dct)
+        dct = OrderedDict(pairs)
 
+        # Handle legacy naming first - these keys should not be present in JSON generated by more recent versions of Polygraphy.
         for type_str, func in self.polygraphy_registered.items():
             if type_str in dct and dct[type_str] == constants.LEGACY_TYPE_MARKER:  # Found a custom type!
                 return func(dct)
+
+        type_name = dct.get(constants.TYPE_MARKER)
+        if type_name is not None:
+            if type_name not in self.polygraphy_registered:
+                user_type_name = {
+                    "Tensor": "torch.Tensor",
+                    "ndarray": "np.ndarray",
+                }.get(type_name, type_name)
+                G_LOGGER.critical(
+                    f"Could not decode serialized type: {user_type_name}. This could be because a required module is missing. "
+                )
+            return self.polygraphy_registered[type_name](dct)
+
         return dct
 
 
 NUMPY_REGISTRATION_SUCCESS = False
+TORCH_REGISTRATION_SUCCESS = False
 COMMON_REGISTRATION_SUCCESS = False
 
 
@@ -200,7 +211,7 @@ def try_register_common_json(func):
     @functools.wraps(func)
     def wrapped(*args, **kwargs):
         global NUMPY_REGISTRATION_SUCCESS
-        if not NUMPY_REGISTRATION_SUCCESS and mod.has_mod("numpy"):
+        if not NUMPY_REGISTRATION_SUCCESS and np.is_installed() and np.is_importable():
             # We define this alongside load_json/save_json so that it is guaranteed to be
             # imported before we need to encode/decode NumPy arrays.
             @Encoder.register(np.ndarray)
@@ -233,6 +244,25 @@ def load(mode="base64"):
 
             NUMPY_REGISTRATION_SUCCESS = True
 
+        global TORCH_REGISTRATION_SUCCESS
+        if not TORCH_REGISTRATION_SUCCESS and torch.is_installed() and torch.is_importable():
+
+            @Encoder.register(torch.Tensor)
+            def encode(tensor):
+                outfile = io.BytesIO()
+                torch.save(tensor, outfile)
+                outfile.seek(0)
+                data = base64.b64encode(outfile.read()).decode()
+                return {"tensor": data}
+
+            @Decoder.register(torch.Tensor)
+            def decode(dct):
+                data = base64.b64decode(dct["tensor"].encode(), validate=True)
+                infile = io.BytesIO(data)
+                return torch.load(infile)
+
+            TORCH_REGISTRATION_SUCCESS = True
+
         global COMMON_REGISTRATION_SUCCESS
         if not COMMON_REGISTRATION_SUCCESS:
             # Pull in some common types so that we can get their associated serialization/deserialization
diff --git a/tools/Polygraphy/polygraphy/logger/logger.py b/tools/Polygraphy/polygraphy/logger/logger.py
index 5594dca3..a8455360 100644
--- a/tools/Polygraphy/polygraphy/logger/logger.py
+++ b/tools/Polygraphy/polygraphy/logger/logger.py
@@ -273,14 +273,14 @@ def module_severity(self, value):
     @property
     def severity(self):
         print(
-            "Warning: Accessing the `severity` property of G_LOGGER is deprecated and will be removed in v0.48.0. Use `module_severity` instead"
+            "Warning: Accessing the `severity` property of G_LOGGER is deprecated and will be removed in v0.50.0. Use `module_severity` instead"
         )
         return self._module_severity.get()
 
     @severity.setter
     def severity(self, value):
         print(
-            "Warning: Accessing the `severity` property of G_LOGGER is deprecated and will be removed in v0.48.0. Use `module_severity` instead"
+            "Warning: Accessing the `severity` property of G_LOGGER is deprecated and will be removed in v0.50.0. Use `module_severity` instead"
         )
         self.module_severity = value
 
@@ -402,7 +402,14 @@ def apply_color(message):
                     import colored
 
                     color = Logger.SEVERITY_COLOR_MAPPING[severity]
-                    return colored.stylize(message, [colored.fg(color)]) if color else message
+
+                    if color:
+                        try:
+                            color = [colored.fg(color)]
+                        except AttributeError:
+                            color = colored.fore(color)
+
+                    return colored.stylize(message, color) if color else message
                 return message
 
             prefix = get_prefix()
@@ -577,24 +584,25 @@ def error(self, message, mode=LogMode.EACH):
         """
         self.log(message, Logger.ERROR, mode=mode, stack_depth=3)
 
-    def critical(self, message):
+    def critical(self, message, ExceptionType=None):
         """
         Logs a message to stdout with CRITICAL severity and raises an exception.
 
         Args:
             message (Union[str, Callable() -> str]):
                     A string or callable which returns a string of the message to log.
-            mode (LogMode):
-                    Controls how the message is logged.
-                    See LogMode for details.
+            ExceptionType (type):
+                    The type of exception to raise.
+                    Defaults to PolygraphyException.
 
         Raises:
-            PolygraphyException
+            ExceptionType
         """
         self.log(message, Logger.CRITICAL, stack_depth=3)
         from polygraphy.exception import PolygraphyException
 
-        raise PolygraphyException(message) from None
+        ExceptionType = ExceptionType or PolygraphyException
+        raise ExceptionType(message) from None
 
     def internal_error(self, message):
         from polygraphy import config
diff --git a/tools/Polygraphy/polygraphy/mod/importer.py b/tools/Polygraphy/polygraphy/mod/importer.py
index f018398e..c686653c 100644
--- a/tools/Polygraphy/polygraphy/mod/importer.py
+++ b/tools/Polygraphy/polygraphy/mod/importer.py
@@ -17,12 +17,18 @@
 
 import contextlib
 import importlib
+import importlib.util
 import os
 import subprocess as sp
 import sys
-import warnings
 from typing import List
 
+try:
+    # Available in Python 3.8+
+    import importlib.metadata
+except ModuleNotFoundError:
+    pass
+
 from polygraphy import constants
 from polygraphy.mod import util as mod_util
 
@@ -95,13 +101,27 @@ def lazy_import(
                 A lazily loaded module. When an attribute is first accessed,
                 the module will be imported.
     """
+
+    def issue_wrong_version_error(installed_version, version):
+        from polygraphy.logger import G_LOGGER, LogMode
+
+        G_LOGGER.error(
+            f"Module: '{name}' version '{installed_version}' is installed, but version '{version}' is required.\n"
+            f"Please install the required version or set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables "
+            f"to allow Polygraphy to do so automatically.\n"
+            f"Attempting to continue with the currently installed version of this module, but note that this may cause errors!",
+            mode=LogMode.ONCE,
+        )
+
     VERSION_CHARS = ["=", ">", "<"]
 
     log = True if log is None else log
     requires = [] if requires is None else requires
 
     def split_name_version(inp):
-        version_char_indices = [inp.index(char) for char in VERSION_CHARS if char in inp]
+        version_char_indices = [
+            inp.index(char) for char in VERSION_CHARS if char in inp
+        ]
         if not version_char_indices:
             return inp, None
 
@@ -120,17 +140,29 @@ def import_mod():
 
         def install_mod(install_name, install_version, raise_error=True):
             modname = install_name.split(".")[0]
-            pkg = pkg_name if pkg_name is not None else _PKG_NAME_FROM_MODULE.get(modname, modname)
-            extra_flags = install_flags if install_flags is not None else _EXTRA_FLAGS_FOR_MODULE.get(modname, [])
+            pkg = (
+                pkg_name
+                if pkg_name is not None
+                else _PKG_NAME_FROM_MODULE.get(modname, modname)
+            )
+            extra_flags = (
+                install_flags
+                if install_flags is not None
+                else _EXTRA_FLAGS_FOR_MODULE.get(modname, [])
+            )
 
             def fail():
                 log_func = G_LOGGER.critical if raise_error else G_LOGGER.warning
-                log_func(f"Could not automatically install required module: {pkg}. Please install it manually.")
+                log_func(
+                    f"Could not automatically install required module: {pkg}. Please install it manually."
+                )
 
             if config.ASK_BEFORE_INSTALL:
                 res = None
                 while res not in ["y", "n"]:
-                    res = input(f"Automatically install '{pkg}' (version: {install_version or 'any'}) ([Y]/n)? ")
+                    res = input(
+                        f"Automatically install '{pkg}' (version: {install_version or 'any'}) ([Y]/n)? "
+                    )
                     res = res.strip()[:1].lower() or "y"
 
                 if res == "n":
@@ -145,7 +177,9 @@ def fail():
             G_LOGGER.info(f"Running installation command: {' '.join(cmd)}")
             status = sp.run(cmd, stdout=sp.PIPE, stderr=sp.PIPE)
             if status.returncode != 0:
-                G_LOGGER.error(f"Error during installation:\n{constants.TAB}{status.stderr.decode()}")
+                G_LOGGER.error(
+                    f"Error during installation:\n{constants.TAB}{status.stderr.decode()}"
+                )
                 fail()
 
             mod = importlib.import_module(install_name)
@@ -182,17 +216,15 @@ def fail():
                         f"Attempting to upgrade now."
                     )
                     # We can try to use the other version if install fails, so this is non-fatal.
-                    installed_mod = install_mod(install_name, install_version, raise_error=False)
+                    installed_mod = install_mod(
+                        install_name, install_version, raise_error=False
+                    )
                     if install_name == name:
                         mod = installed_mod
 
                 elif install_version != LATEST_VERSION:
-                    G_LOGGER.error(
-                        f"Module: '{install_name}' version '{installed_mod.__version__}' is installed, but version '{install_version}' is required.\n"
-                        f"Please install the required version or set POLYGRAPHY_AUTOINSTALL_DEPS=1 in your environment variables "
-                        f"to allow Polygraphy to do so automatically.\n"
-                        f"Attempting to continue with the currently installed version of this module, but note that this may cause errors!",
-                        mode=LogMode.ONCE,
+                    issue_wrong_version_error(
+                        installed_mod.__version__, install_version
                     )
 
         if log:
@@ -219,26 +251,69 @@ def __setattr__(self, name, value):
             module = self.__polygraphy_import_mod()
             return setattr(module, name, value)
 
+        def is_installed(self):
+            """
+            Checks whether any version of this module is installed.
+            The module will not be imported by this method.
+
+            Returns:
+                bool: Whether the module is installed.
+            """
+            global importlib
+
+            try:
+                return name in sys.modules or (
+                    importlib.util.find_spec(name) is not None
+                )
+            except:
+                return False
+
+        def is_importable(self):
+            """
+            Checks whether this module is importable. Note that a module may be installed but not importable.
+
+            Returns:
+                bool: Whether the module is importable.
+            """
+            try:
+                importlib.import_module(name)
+                return True
+            except:
+                return False
+
     return LazyModule()
 
 
 def has_mod(modname):
     """
-    Checks whether a module is available and usable.
+    Checks whether a module is installed without importing the module.
 
     Args:
         modname (str): The name of the module to check.
 
     Returns:
-        bool:
-                Whether the module is available and usable.
-                This returns false if importing the module fails for any reason.
+        bool: Whether the module is installed.
     """
+    import warnings
+
+    import polygraphy
+    from polygraphy.logger import G_LOGGER
+
+    remove_in = "0.50.0"
+    if mod_util.version(polygraphy.__version__) >= mod_util.version(remove_in):
+        G_LOGGER.internal_error(
+            f"has_mod should have been removed in version: {remove_in}"
+        )
+    warnings.warn(
+        f"has_mod is deprecated and will be removed in Polygraphy {remove_in}",
+        DeprecationWarning,
+        stacklevel=3,
+    )
+
     try:
-        importlib.import_module(modname)
-    except:
+        return modname in sys.modules or (importlib.util.find_spec(modname) is not None)
+    except ValueError:
         return False
-    return True
 
 
 def autoinstall(lazy_mod):
@@ -295,9 +370,7 @@ def reset_sys_path():
             ext = os.path.splitext(path)[1]
             err_msg = f"Could not import symbol: {name} from script: {path}"
             if ext != ".py":
-                err_msg += (
-                    f"\nThis could be because the extension of the file is not '.py'. Note: The extension is: {ext}"
-                )
+                err_msg += f"\nThis could be because the extension of the file is not '.py'. Note: The extension is: {ext}"
             err_msg += f"\nNote: Error was: {err}"
             err_msg += f"\nNote: sys.path was: {sys.path}"
             G_LOGGER.critical(err_msg)
diff --git a/tools/Polygraphy/polygraphy/mod/util.py b/tools/Polygraphy/polygraphy/mod/util.py
index 82a10b85..c21d6780 100644
--- a/tools/Polygraphy/polygraphy/mod/util.py
+++ b/tools/Polygraphy/polygraphy/mod/util.py
@@ -18,17 +18,21 @@
 
 def version(version_str):
     def process_version_part(num):
+        suffix = None
+        if "+" in num:
+            num, suffix = num.split("+")
+
         try:
-            return [int(num)]
+            num = int(num)
         except ValueError:
             VERSION_SUFFIXES = ["a", "b", "rc", "post", "dev"]
             # One version part can only contain one of the above suffixes
-            for suffix in VERSION_SUFFIXES:
-                if suffix in num:
-                    return num.partition(suffix)
+            for version_suffix in VERSION_SUFFIXES:
+                if version_suffix in num:
+                    num = num.partition(version_suffix)
+                    break
 
-            # For unrecognized suffixes, just return as-is
-            return [num]
+        return [num, suffix] if suffix is not None else [num]
 
     ver_list = []
     for num in version_str.split("."):
diff --git a/tools/Polygraphy/polygraphy/tools/README.md b/tools/Polygraphy/polygraphy/tools/README.md
index 4f3883f8..b18a9bef 100644
--- a/tools/Polygraphy/polygraphy/tools/README.md
+++ b/tools/Polygraphy/polygraphy/tools/README.md
@@ -6,6 +6,7 @@
 - [Common Use-Cases](#common-use-cases)
     - [Inspecting A Model](#inspecting-a-model)
     - [Converting A Model To TensorRT](#converting-a-model-to-tensorrt)
+    - [Linting An ONNX Model](#linting-an-onnx-model)
     - [Sanitizing An ONNX Model](#sanitizing-an-onnx-model)
     - [Comparing A Model Between Frameworks](#comparing-a-model-between-frameworks)
     - [Modifying Input Shapes In An ONNX Model](#modifying-input-shapes-in-an-onnx-model)
@@ -62,6 +63,16 @@ For more information, refer to the examples, which, among other things, show how
 
 You can find the complete listing of `convert` examples [here](../../examples/cli/convert/).
 
+### Linting An ONNX Model
+
+Validating an ONNX model can be more than checking if it passes ONNX specifications.
+For example, it could be data-dependant, or depend on the underlying runtime.
+Moreover, an ONNX model can be broken in multiple places that may only be uncovered iteratively.
+
+The subtool `check lint` validates the model for specific use-cases (dynamic shape, custom data, custom weights etc.).
+When model is broken, it also attempts to catch all independent exceptions/warnings in one go.
+
+For more details, refer to [`check` example 01](../../examples/cli/check/01_linting_an_onnx_model/).
 
 ### Sanitizing An ONNX Model
 
diff --git a/tools/Polygraphy/polygraphy/tools/_main.py b/tools/Polygraphy/polygraphy/tools/_main.py
index dc875a2e..0fd359b6 100644
--- a/tools/Polygraphy/polygraphy/tools/_main.py
+++ b/tools/Polygraphy/polygraphy/tools/_main.py
@@ -17,7 +17,7 @@
 
 
 @mod.export()
-def main():
+def main(run_opts = None):
     """
     The Polygraphy CLI Toolkit
 
@@ -51,7 +51,10 @@ def main():
     for tool in TOOL_REGISTRY:
         tool.setup_parser(subparsers)
 
-    args, unknown = parser.parse_known_args()
+    if run_opts is not None:
+        args, unknown = parser.parse_known_args(run_opts)
+    else:
+        args, unknown = parser.parse_known_args()
 
     if unknown:
         G_LOGGER.error(f"Unrecognized Options: {unknown}")
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/__init__.py b/tools/Polygraphy/polygraphy/tools/args/backend/__init__.py
index f2dda7be..83997e90 100644
--- a/tools/Polygraphy/polygraphy/tools/args/backend/__init__.py
+++ b/tools/Polygraphy/polygraphy/tools/args/backend/__init__.py
@@ -18,5 +18,4 @@
 from polygraphy.tools.args.backend.pluginref import *
 from polygraphy.tools.args.backend.tf import *
 from polygraphy.tools.args.backend.trt import *
-from polygraphy.tools.args.backend.trt_legacy import *
 from polygraphy.tools.args.backend.runner_select import *
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/onnx/loader.py b/tools/Polygraphy/polygraphy/tools/args/backend/onnx/loader.py
index 032e8d0c..a07bb457 100644
--- a/tools/Polygraphy/polygraphy/tools/args/backend/onnx/loader.py
+++ b/tools/Polygraphy/polygraphy/tools/args/backend/onnx/loader.py
@@ -187,7 +187,7 @@ def fallback_inference(self, onnx_model, outputs=None):
             ) as runner:
                 data_loader = self.arg_groups[DataLoaderArgs].get_data_loader()
                 loader_cache = DataLoaderCache(data_loader)
-                loader_cache.set_input_metadata(runner.get_input_metadata())
+                loader_cache.set_input_metadata(runner.get_input_metadata(use_numpy_dtypes=False))
 
                 feed_dict = loader_cache[0]
 
@@ -486,14 +486,14 @@ def add_parser_args_impl(self):
             self.group.add_argument(
                 "--set-unbounded-dds-upper-bound",
                 help="""
-                Set upper bounds for tensors with unbounded DDS(data-dependent shape). 
-                Tensors with unbounded DDS can make it difficult for TensorRT to optimize inference performance 
-                and memory usage. In the worst case, they can cause TensorRT engine build failures. To fix this, 
-                Polygraphy supports setting upper bounds for tensors with unbounded DDS by inserting the ONNX 
-                min operator. To specify per-tensor upper bounds, use the format: 
+                Set upper bounds for tensors with unbounded DDS(data-dependent shape).
+                Tensors with unbounded DDS can make it difficult for TensorRT to optimize inference performance
+                and memory usage. In the worst case, they can cause TensorRT engine build failures. To fix this,
+                Polygraphy supports setting upper bounds for tensors with unbounded DDS by inserting the ONNX
+                min operator. To specify per-tensor upper bounds, use the format:
                 --set-unbounded-dds-upper-bound [<tensor_name>:]<upper_bound>.
-                If no tensor name is provided, the upper bound is used for any tensors with unbounded DDS that 
-                are not explicitly specified. For example: 
+                If no tensor name is provided, the upper bound is used for any tensors with unbounded DDS that
+                are not explicitly specified. For example:
                 --set-unbounded-dds-upper-bound 10000 tensor_a:5000 tensor_b:4000.
 
                 Note that setting upper bounds only works for models that have been constant folded and have shapes inferred.
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/onnxrt/loader.py b/tools/Polygraphy/polygraphy/tools/args/backend/onnxrt/loader.py
index ce0d4251..1cbeb665 100644
--- a/tools/Polygraphy/polygraphy/tools/args/backend/onnxrt/loader.py
+++ b/tools/Polygraphy/polygraphy/tools/args/backend/onnxrt/loader.py
@@ -55,11 +55,12 @@ def parse_impl(self, args):
         """
         self.providers = args_util.get(args, "providers")
 
-    def add_to_script_impl(self, script):
-        if self.arg_groups[OnnxLoadArgs].must_use_onnx_loader():
-            onnx_name = self.arg_groups[OnnxLoadArgs].add_to_script(script, serialize_model=True)
-        else:
-            onnx_name = self.arg_groups[ModelArgs].path
+    def add_to_script_impl(self, script, onnx_name=None):
+        if onnx_name is None: # default behavior according to self.arg_groups
+            if self.arg_groups[OnnxLoadArgs].must_use_onnx_loader():
+                onnx_name = self.arg_groups[OnnxLoadArgs].add_to_script(script, serialize_model=True)
+            else:
+                onnx_name = self.arg_groups[ModelArgs].path
 
         script.add_import(imports=["SessionFromOnnx"], frm="polygraphy.backend.onnxrt")
         loader_name = script.add_loader(
@@ -67,12 +68,15 @@ def add_to_script_impl(self, script):
         )
         return loader_name
 
-    def load_onnxrt_session(self):
+    def load_onnxrt_session(self, model=None):
         """
         Loads an ONNX-Runtime Inference Session according to arguments provided on the command-line.
 
+        Args:
+            model (Union[bytes, str]): The model bytes or path to a model. Defaults to None, in which case, the model specified on the command-line is used.
+
         Returns:
             onnxruntime.InferenceSession
         """
-        loader = args_util.run_script(self.add_to_script)
+        loader = args_util.run_script(self.add_to_script, model)
         return loader()
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/tf/loader.py b/tools/Polygraphy/polygraphy/tools/args/backend/tf/loader.py
index 75c8ec1f..67213cea 100644
--- a/tools/Polygraphy/polygraphy/tools/args/backend/tf/loader.py
+++ b/tools/Polygraphy/polygraphy/tools/args/backend/tf/loader.py
@@ -30,7 +30,6 @@ class TfTrtArgs(BaseArgs):
     Depends on:
 
         - TrtConfigArgs
-        - TrtLegacyRunnerArgs
     """
 
     def add_parser_args_impl(self):
@@ -75,7 +74,6 @@ def add_to_script_impl(self, script, loader_name=None, suffix=None):
         """
         if self.use_tftrt:
             from polygraphy.tools.args.backend.trt import TrtConfigArgs
-            from polygraphy.tools.args.backend.trt_legacy import TrtLegacyRunnerArgs
 
             script.add_import(imports=["UseTfTrt"], frm="polygraphy.backend.tf")
             loader_str = make_invocable(
@@ -84,7 +82,6 @@ def add_to_script_impl(self, script, loader_name=None, suffix=None):
                 max_workspace_size=self.arg_groups[TrtConfigArgs]._workspace,
                 fp16=self.arg_groups[TrtConfigArgs].fp16,
                 int8=self.arg_groups[TrtConfigArgs].int8,
-                max_batch_size=self.arg_groups[TrtLegacyRunnerArgs].batch_size,
                 is_dynamic_op=self.dynamic_op,
                 minimum_segment_size=self.minimum_segment_size,
             )
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/trt/config.py b/tools/Polygraphy/polygraphy/tools/args/backend/trt/config.py
index 6b75bb59..51f287e3 100644
--- a/tools/Polygraphy/polygraphy/tools/args/backend/trt/config.py
+++ b/tools/Polygraphy/polygraphy/tools/args/backend/trt/config.py
@@ -160,12 +160,15 @@ def add_parser_args_impl(self):
 
         self.group.add_argument("--tf32", help="Enable tf32 precision in TensorRT", action="store_true", default=None)
         self.group.add_argument("--fp16", help="Enable fp16 precision in TensorRT", action="store_true", default=None)
+        self.group.add_argument("--bf16", help="Enable bf16 precision in TensorRT", action="store_true", default=None)
         self.group.add_argument("--fp8", help="Enable fp8 precision in TensorRT", action="store_true", default=None)
         self.group.add_argument(
             "--int8",
             help="Enable int8 precision in TensorRT. "
             "If calibration is required but no calibration cache is provided, this option will cause TensorRT to run "
-            "int8 calibration using the Polygraphy data loader to provide calibration data. ",
+            "int8 calibration using the Polygraphy data loader to provide calibration data. "
+            "If calibration is run and the model has dynamic shapes, the last optimization profile will be "
+            "used as the calibration profile. ",
             action="store_true",
             default=None,
         )
@@ -177,16 +180,6 @@ def add_parser_args_impl(self):
             choices=("prefer", "obey", "none"),
             default=self._precision_constraints_default,
         )
-
-        precision_constraints_group.add_argument(
-            "--strict-types",
-            help="[DEPRECATED - use --precision-constraints] Enable preference for precision constraints and avoidance of I/O reformatting in TensorRT, "
-            "and fall back to ignoring the request if such an engine cannot be built.",
-            action="store_true",
-            default=None,
-            dest="strict_types",
-        )
-
         self.group.add_argument(
             "--sparse-weights",
             help="Enable optimizations for sparse weights in TensorRT",
@@ -206,14 +199,6 @@ def add_parser_args_impl(self):
             default=None,
         )
 
-        self.group.add_argument(
-            "--workspace",
-            metavar="BYTES",
-            help="[DEPRECATED - use --pool-limit] Amount of memory, in bytes, to allocate for the TensorRT builder's workspace. "
-            "Optionally, use a `K`, `M`, or `G` suffix to indicate KiB, MiB, or GiB respectively. "
-            "For example, `--workspace=16M` is equivalent to `--workspace=16777216`. ",
-            default=None,
-        )
         self.group.add_argument(
             "--calibration-cache",
             help="Path to load/save a calibration cache. "
@@ -251,6 +236,20 @@ def add_parser_args_impl(self):
             default=None,
         )
 
+        self.group.add_argument(
+            "--error-on-timing-cache-miss",
+            help="Emit error when a tactic being timed is not present in the timing cache.",
+            action="store_true",
+            default=None,
+        )
+
+        self.group.add_argument(
+            "--disable-compilation-cache",
+            help="Disable caching JIT-compiled code",
+            action="store_true",
+            default=None,
+        )
+
         replay_group = self.group.add_mutually_exclusive_group()
         replay_group.add_argument(
             "--save-tactics",
@@ -320,6 +319,12 @@ def add_parser_args_impl(self):
             action="store_true",
             default=None,
         )
+        self.group.add_argument(
+            "--strip-plan",
+            help="Builds the engine with the refittable weights stripped.",
+            action="store_true",
+            default=None,
+        )
         self.group.add_argument(
             "--use-dla",
             help="[EXPERIMENTAL] Use DLA as the default device type",
@@ -384,6 +389,33 @@ def add_parser_args_impl(self):
             default=None,
         )
 
+        self.group.add_argument(
+            "--quantization-flags",
+            dest="quantization_flags",
+            help="Int8 quantization flags to enable. Values come from the names of values "
+            "in the trt.QuantizationFlag enum, and are case-insensitive. "
+            "If no arguments are provided, e.g. '--quantization-flags', then all quantization flags are disabled. "
+            "Defaults to TensorRT's default quantization flags.",
+            nargs="*",
+            default=None,
+        )
+
+        self.group.add_argument(
+            "--profiling-verbosity",
+            help="The verbosity of NVTX annotations in the generated engine."
+            "Values come from the names of values in the `trt.ProfilingVerbosity` enum and are case-insensitive. "
+            "For example, `--profiling-verbosity detailed`. "
+            "Defaults to 'verbose'.",
+            default=None,
+        )
+
+        self.group.add_argument(
+            "--weight-streaming",
+            help="Build a weight streamable engine. Must be set with --strongly-typed. The weight streaming amount can be set with --weight-streaming-budget.",
+            action="store_true",
+            default=None
+        )
+
         if self._allow_engine_capability:
             self.group.add_argument(
                 "--engine-capability",
@@ -410,6 +442,7 @@ def parse_impl(self, args):
                 input names to a tuple of (min, opt, max) shapes.
             tf32 (bool): Whether to enable TF32.
             fp16 (bool): Whether to enable FP16.
+            bf16 (bool): Whether to enable BF16.
             fp8  (bool): Whether to enable FP8.
             int8 (bool): Whether to enable INT8.
             precision_constraints (str): The precision constraints to apply.
@@ -432,11 +465,17 @@ def parse_impl(self, args):
             direct_io (bool): Whether to disallow reformatting layers at network input/output tensors which have user-specified formats.
             preview_features (List[str]): Names of preview features to enable.
             refittable (bool): Whether the engine should be refittable.
+            strip_plan (bool): Whether the engine should be built with the refittable weights stripped.
             builder_optimization_level (int): The builder optimization level.
             hardware_compatibility_level (str): A string representing a hardware compatibility level enum value.
+            profiling_verbosity (str): A string representing a profiling verbosity level enum value.
             max_aux_streams (int): The maximum number of auxiliary streams that TensorRT is allowed to use.
             version_compatible (bool): Whether or not to build a TensorRT forward-compatible.
             exclude_lean_runtime (bool): Whether to exclude the lean runtime from a version compatible plan.
+            quantization_flags (List[str]): Names of quantization flags to enable.
+            error_on_timing_cache_miss (bool): Whether to emit error when a tactic being timed is not present in the timing cache.
+            disable_compilation_cache (bool): Whether to disable caching JIT-compiled code.
+            weight_streaming (bool): Whether to enable weight streaming for the TensorRT Engine.
         """
 
         trt_min_shapes = args_util.get(args, "trt_min_shapes", default=[])
@@ -453,6 +492,7 @@ def parse_impl(self, args):
 
         self.tf32 = args_util.get(args, "tf32")
         self.fp16 = args_util.get(args, "fp16")
+        self.bf16 = args_util.get(args, "bf16")
         self.int8 = args_util.get(args, "int8")
         self.fp8 = args_util.get(args, "fp8")
         self.precision_constraints = args_util.get(args, "precision_constraints")
@@ -460,17 +500,9 @@ def parse_impl(self, args):
         if self.precision_constraints == "none":
             self.precision_constraints = None
 
-        self._strict_types = args_util.get(args, "strict_types")
-        if self._strict_types is not None:
-            mod.warn_deprecated(
-                "--strict-types",
-                use_instead=f"--precision-constraints=obey",
-                remove_in="0.48.0",
-                always_show_warning=True,
-            )
-
         self.restricted = args_util.get(args, "restricted")
         self.refittable = args_util.get(args, "refittable")
+        self.strip_plan = args_util.get(args, "strip_plan")
 
         self.calibration_cache = args_util.get(args, "calibration_cache")
         calib_base = args_util.get(args, "calibration_base_class")
@@ -513,15 +545,6 @@ def parse_impl(self, args):
         self.use_dla = args_util.get(args, "use_dla")
         self.allow_gpu_fallback = args_util.get(args, "allow_gpu_fallback")
 
-        self._workspace = args_util.parse_num_bytes(args_util.get(args, "workspace"))
-        if self._workspace is not None:
-            mod.warn_deprecated(
-                "--workspace",
-                use_instead=f"--pool-limit workspace:{args_util.get(args, 'workspace')}",
-                remove_in="0.48.0",
-                always_show_warning=True,
-            )
-
         memory_pool_limits = args_util.parse_arglist_to_dict(
             args_util.get(args, "memory_pool_limit"), cast_to=args_util.parse_num_bytes, allow_empty_key=False
         )
@@ -552,13 +575,31 @@ def parse_impl(self, args):
                 "HardwareCompatibilityLevel", hardware_compatibility_level
             )
 
+        self.profiling_verbosity = None
+        profiling_verbosity = args_util.get(args, "profiling_verbosity")
+        if profiling_verbosity is not None:
+            self.profiling_verbosity = make_trt_enum_val(
+                "ProfilingVerbosity", profiling_verbosity
+            )
+
         self.max_aux_streams = args_util.get(args, "max_aux_streams")
         self.version_compatible = args_util.get(args, "version_compatible")
         self.exclude_lean_runtime = args_util.get(args, "exclude_lean_runtime")
 
+        quantization_flags = args_util.get(args, "quantization_flags")
+        self.quantization_flags = None
+        if quantization_flags is not None:
+            self.quantization_flags = [make_trt_enum_val("QuantizationFlag", flag) for flag in quantization_flags]
+
         if self.exclude_lean_runtime and not self.version_compatible:
             G_LOGGER.critical(f"`--exclude-lean-runtime` requires `--version-compatible` to be enabled.")
 
+        self.error_on_timing_cache_miss = args_util.get(args, "error_on_timing_cache_miss")
+
+        self.disable_compilation_cache = args_util.get(args, "disable_compilation_cache")
+
+        self.weight_streaming = args_util.get(args, "weight_streaming")
+
     def add_to_script_impl(self, script):
         profiles = []
         for profile_dict in self.profile_dicts:
@@ -626,7 +667,9 @@ def add_to_script_impl(self, script):
                 self.memory_pool_limits,
                 self.preview_features,
                 self.engine_capability,
+                self.profiling_verbosity,
                 self.hardware_compatibility_level,
+                self.quantization_flags,
             ]
         ):
             script.add_import(imports="tensorrt", imp_as="trt")
@@ -639,13 +682,12 @@ def add_to_script_impl(self, script):
         else:
             config_loader_str = make_invocable_if_nondefault(
                 "CreateTrtConfig",
-                max_workspace_size=self._workspace,
                 tf32=self.tf32,
                 fp16=self.fp16,
+                bf16=self.bf16,
                 int8=self.int8,
                 fp8=self.fp8,
                 precision_constraints=self.precision_constraints,
-                strict_types=self._strict_types,
                 restricted=self.restricted,
                 profiles=profile_name,
                 calibrator=calibrator,
@@ -657,14 +699,20 @@ def add_to_script_impl(self, script):
                 allow_gpu_fallback=self.allow_gpu_fallback,
                 memory_pool_limits=self.memory_pool_limits,
                 refittable=self.refittable,
+                strip_plan=self.strip_plan,
                 preview_features=self.preview_features,
                 engine_capability=self.engine_capability,
                 direct_io=self.direct_io,
                 builder_optimization_level=self.builder_optimization_level,
                 hardware_compatibility_level=self.hardware_compatibility_level,
+                profiling_verbosity=self.profiling_verbosity,
                 max_aux_streams=self.max_aux_streams,
                 version_compatible=self.version_compatible,
                 exclude_lean_runtime=self.exclude_lean_runtime,
+                quantization_flags=self.quantization_flags,
+                error_on_timing_cache_miss=self.error_on_timing_cache_miss,
+                disable_compilation_cache=self.disable_compilation_cache,
+                weight_streaming=self.weight_streaming,
             )
             if config_loader_str is not None:
                 script.add_import(imports="CreateConfig", frm="polygraphy.backend.trt", imp_as="CreateTrtConfig")
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/trt/loader.py b/tools/Polygraphy/polygraphy/tools/args/backend/trt/loader.py
index b1a5c613..7c0f88f4 100644
--- a/tools/Polygraphy/polygraphy/tools/args/backend/trt/loader.py
+++ b/tools/Polygraphy/polygraphy/tools/args/backend/trt/loader.py
@@ -80,6 +80,14 @@ def add_parser_args_impl(self):
             nargs="+",
             default=None,
         )
+        self.group.add_argument(
+            "--plugin-instancenorm",
+            help="Switch to clear the `trt.OnnxParserFlag.NATIVE_INSTANCENORM` flag and"
+            "force the usage of the plugin implementation of ONNX InstanceNorm."
+            "Note that `trt.OnnxParserFlag.NATIVE_INSTANCENORM` is ON by default since TensorRT 10.0.",
+            action="store_true",
+            default=None,
+        )
 
     def parse_impl(self, args):
         """
@@ -89,6 +97,7 @@ def parse_impl(self, args):
             flags (List[str]): flags for onnxparser
         """
         self._flags = args_util.get(args, "onnx_flags", default=[])
+        self._plugin_instancenorm = args_util.get(args, "plugin_instancenorm", default=None)
 
     def get_flags(self):
         """
@@ -110,7 +119,7 @@ def get_flags(self):
             )
             flags.append("native_instancenorm")
 
-        return [make_trt_enum_val("OnnxParserFlag", f) for f in flags] or None
+        return ([make_trt_enum_val("OnnxParserFlag", f) for f in flags] or None, self._plugin_instancenorm)
 
 
 @mod.export()
@@ -175,7 +184,7 @@ def add_parser_args_impl(self):
         self.group.add_argument(
             "--tensor-dtypes",
             "--tensor-datatypes",
-            help="Data type to use for each tensor. This should be specified on a per-tensor basis, using the format: "
+            help="Data type to use for each network I/O tensor. This should be specified on a per-tensor basis, using the format: "
             "--tensor-datatypes <tensor_name>:<tensor_datatype>. Data type values come from the TensorRT data type aliases, like "
             "float32, float16, int8, bool, etc. For example: --tensor-datatypes example_tensor:float16 other_tensor:int8. ",
             nargs="+",
@@ -185,7 +194,7 @@ def add_parser_args_impl(self):
         if self._allow_tensor_formats:
             self.group.add_argument(
                 "--tensor-formats",
-                help="Formats to allow for each tensor. This should be specified on a per-tensor basis, using the format: "
+                help="Formats to allow for each network I/O tensor. This should be specified on a per-tensor basis, using the format: "
                 "--tensor-formats <tensor_name>:[<tensor_formats>,...]. Format values come from the `trt.TensorFormat` enum "
                 "and are case-insensitve. "
                 "For example: --tensor-formats example_tensor:[linear,chw4] other_tensor:[chw16]. ",
@@ -214,6 +223,21 @@ def add_parser_args_impl(self):
             default=None,
         )
 
+        self.group.add_argument(
+            "--strongly-typed",
+            help="Mark the network as being strongly typed.",
+            action="store_true",
+            default=None,
+        )
+
+        self.group.add_argument(
+            "--mark-debug",
+            help="Specify list of names of tensors to be marked as debug tensors."
+            "For example, `--mark-debug tensor1 tensor2 tensor3`. ",
+            nargs="+",
+            default=None,
+        )
+
     def parse_impl(self, args):
         """
         Parses command-line arguments and populates the following attributes:
@@ -227,6 +251,8 @@ def parse_impl(self, args):
             tensor_formats (Dict[str, List[str]]): Tensor names mapped to their desired formats, in string form.
             postprocess_scripts (List[Tuple[str, str]]):
                     A list of tuples specifying a path to a network postprocessing script and the name of the postprocessing function.
+            strongly_typed (bool): Whether to mark the network as being strongly typed.
+            mark_debug (List[str]): Names of tensors which should be marked as debug tensors.
         """
         self.outputs = args_util.get_outputs(args, "trt_outputs")
 
@@ -272,6 +298,10 @@ def parse_impl(self, args):
                 G_LOGGER.warning(f"Could not find postprocessing script {script_path}")
             self.postprocess_scripts.append((script_path, func))
 
+        self.strongly_typed = args_util.get(args, "strongly_typed")
+        
+        self.mark_debug = args_util.get(args, "mark_debug")
+
     def add_to_script_impl(self, script):
         network_func_name = self.arg_groups[ModelArgs].extra_model_info
         if self.trt_network_func_name is not None:
@@ -281,10 +311,10 @@ def add_to_script_impl(self, script):
         model_file = self.arg_groups[ModelArgs].path
         model_type = self.arg_groups[ModelArgs].model_type
         outputs = args_util.get_outputs_for_script(script, self.outputs)
-        parser_flags = self.arg_groups[TrtOnnxFlagArgs].get_flags()
+        parser_flags, plugin_instancenorm = self.arg_groups[TrtOnnxFlagArgs].get_flags()
 
         if any(
-            arg is not None for arg in [self.layer_precisions, self.tensor_datatypes, self.tensor_formats, parser_flags]
+            arg is not None for arg in [self.layer_precisions, self.tensor_datatypes, self.tensor_formats, parser_flags, plugin_instancenorm]
         ):
             script.add_import(imports="tensorrt", imp_as="trt")
 
@@ -308,6 +338,8 @@ def add_to_script_impl(self, script):
                     "NetworkFromOnnxBytes",
                     self.arg_groups[TrtLoadPluginsArgs].add_to_script(script, onnx_loader),
                     flags=parser_flags,
+                    plugin_instancenorm=plugin_instancenorm,
+                    strongly_typed=self.strongly_typed,
                 )
                 loader_name = script.add_loader(loader_str, "parse_network_from_onnx")
             else:
@@ -316,6 +348,8 @@ def add_to_script_impl(self, script):
                     "NetworkFromOnnxPath",
                     self.arg_groups[TrtLoadPluginsArgs].add_to_script(script, model_file),
                     flags=parser_flags,
+                    plugin_instancenorm=plugin_instancenorm,
+                    strongly_typed=self.strongly_typed,
                 )
                 loader_name = script.add_loader(loader_str, "parse_network_from_onnx")
         else:
@@ -347,6 +381,9 @@ def add_loader_if_nondefault(loader, result_var_name, **kwargs):
         loader_name = add_loader_if_nondefault(
             "SetTensorFormats", "set_tensor_formats", tensor_formats=self.tensor_formats
         )
+        loader_name = add_loader_if_nondefault(
+            "MarkDebug", "mark_debug", mark_debug=self.mark_debug
+        )
 
         return loader_name
 
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/trt/runner.py b/tools/Polygraphy/polygraphy/tools/args/backend/trt/runner.py
index 7f5632f9..79b3546b 100644
--- a/tools/Polygraphy/polygraphy/tools/args/backend/trt/runner.py
+++ b/tools/Polygraphy/polygraphy/tools/args/backend/trt/runner.py
@@ -42,6 +42,27 @@ def add_parser_args_impl(self):
             default=None,
             dest="optimization_profile",
         )
+        self.group.add_argument(
+            "--allocation-strategy",
+            help="The way activation memory is allocated. "
+            "static: Pre-allocate based on the max possible size across all profiles. "
+            "profile: Allocate what's needed for the profile to use."
+            "runtime: Allocate what's needed for the current input shapes.",
+            type=str,
+            default=None,
+            dest="allocation_strategy",
+            choices=["static", "profile", "runtime"],
+        )
+        self.group.add_argument(
+            "--weight-streaming-budget",
+            help="The amount of GPU memory in bytes that TensorRT can use for weights at runtime. The engine must be built with weight streaming enabled. It can take on the following values: "
+            "None or 0: Disables weight streaming at runtime. "
+            "-1: TensorRT will decide the streaming budget automatically. "
+            "0 to 100%%: The percentage of weights TRT will stream. 100%% will stream the maximum number of weights. "
+            ">0B: The exact amount of streamable weights that reside on the GPU (unit suffixes are supported).",
+            type=str,
+            default=None
+        )
 
     def parse_impl(self, args):
         """
@@ -49,10 +70,26 @@ def parse_impl(self, args):
 
         Attributes:
             optimization_profile (int): The index of the optimization profile to initialize the runner with.
+            allocation_strategy (str): The way activation memory is allocated.
+            weight_streaming_budget (int): The weight streaming budget in bytes.
+            weight_streaming_percent (float): The percentage of weights streamed.
         """
         self.optimization_profile = args_util.get(args, "optimization_profile")
+        self.allocation_strategy = args_util.get(args, "allocation_strategy")
+        self.weight_streaming_budget = None
+        self.weight_streaming_percent = None
+        
+        ws_arg = args_util.get(args, "weight_streaming_budget")
+        if ws_arg and ws_arg.endswith("%"):
+            percent = float(ws_arg[:-1])
+            assert 0 <= percent <= 100, "Invalid percentage for --weight-streaming-budget!"
+            self.weight_streaming_percent = percent
+        elif ws_arg:
+            budget = args_util.parse_num_bytes(ws_arg)
+            assert budget == -1 or budget >= 0, "Invalid amount for --weight-streaming-budget!"
+            self.weight_streaming_budget = budget
 
     def add_to_script_impl(self, script):
         script.add_import(imports=["TrtRunner"], frm="polygraphy.backend.trt")
         loader_name = self.arg_groups[TrtLoadEngineArgs].add_to_script(script)
-        script.add_runner(make_invocable("TrtRunner", loader_name, optimization_profile=self.optimization_profile))
+        script.add_runner(make_invocable("TrtRunner", loader_name, optimization_profile=self.optimization_profile, allocation_strategy=self.allocation_strategy, weight_streaming_budget=self.weight_streaming_budget, weight_streaming_percent=self.weight_streaming_percent))
diff --git a/tools/Polygraphy/polygraphy/tools/args/backend/trt_legacy.py b/tools/Polygraphy/polygraphy/tools/args/backend/trt_legacy.py
deleted file mode 100644
index 067fb362..00000000
--- a/tools/Polygraphy/polygraphy/tools/args/backend/trt_legacy.py
+++ /dev/null
@@ -1,183 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-from polygraphy import constants, mod, util
-from polygraphy.logger import G_LOGGER
-from polygraphy.tools.args import util as args_util
-from polygraphy.tools.args.backend.onnx.loader import OnnxLoadArgs
-from polygraphy.tools.args.backend.trt.config import TrtConfigArgs
-from polygraphy.tools.args.backend.trt.loader import TrtLoadPluginsArgs, TrtSaveEngineBytesArgs
-from polygraphy.tools.args.base import BaseRunnerArgs
-from polygraphy.tools.args.comparator.data_loader import DataLoaderArgs
-from polygraphy.tools.args.model import ModelArgs
-from polygraphy.tools.script import inline_identifier, inline, make_invocable, safe
-
-
-@mod.export()
-class TrtLegacyRunnerArgs(BaseRunnerArgs):
-    """
-    TensorRT Legacy API (UFF, Caffe) Inference: inference with deprecated TensorRT APIs.
-
-    Depends on:
-
-        - ModelArgs
-        - TrtLoadPluginsArgs
-        - TrtConfigArgs
-        - TfLoadArgs
-        - TrtSaveEngineBytesArgs
-        - DataLoaderArgs
-        - OnnxLoadArgs
-
-    [DEPRECATED] Options related to inference using deprecated TensorRT APIs and import paths, like UFF and Caffe.
-    """
-
-    def get_name_opt_impl(self):
-        return "Legacy TensorRT", "trt-legacy"
-
-    def get_extra_help_text_impl(self):
-        return "Only supports networks using implicit batch mode."
-
-    def add_parser_args_impl(self):
-        self.group.add_argument(
-            "-p", "--preprocessor", help="The preprocessor to use for the UFF converter", default=None
-        )
-        self.group.add_argument("--uff-order", help="The order of the input", default=None)
-        self.group.add_argument(
-            "--batch-size",
-            metavar="SIZE",
-            help="The batch size to use in TensorRT when it cannot be automatically determined",
-            type=int,
-            default=None,
-        )
-        self.group.add_argument(
-            "--model",
-            help="Model file for Caffe models. The deploy file should be provided as the model_file positional argument",
-            dest="caffe_model",
-        )
-        self.group.add_argument("--save-uff", help="Save intermediate UFF files", action="store_true", default=None)
-
-    def parse_impl(self, args):
-        self.trt_outputs = args_util.get_outputs(args, "trt_outputs")
-        self.caffe_model = args_util.get(args, "caffe_model")
-        self.batch_size = args_util.get(args, "batch_size")
-        self.save_uff = args_util.get(args, "save_uff")
-        self.uff_order = args_util.get(args, "uff_order")
-        self.preprocessor = args_util.get(args, "preprocessor")
-
-        self.calibration_cache = args_util.get(args, "calibration_cache")
-        calib_base = args_util.get(args, "calibration_base_class")
-        self.calibration_base_class = None
-        if calib_base is not None:
-            self.calibration_base_class = inline(safe("trt.{:}", inline_identifier(calib_base)))
-
-        self.quantile = args_util.get(args, "quantile")
-        self.regression_cutoff = args_util.get(args, "regression_cutoff")
-
-        self.use_dla = args_util.get(args, "use_dla")
-        self.allow_gpu_fallback = args_util.get(args, "allow_gpu_fallback")
-
-    def add_to_script_impl(self, script):
-        script.add_import(imports=["TrtLegacyRunner"], frm="polygraphy.backend.trt_legacy")
-        G_LOGGER.warning("Legacy TensorRT runner only supports implicit batch TensorFlow/UFF, ONNX, and Caffe models")
-
-        load_engine = self.arg_groups[ModelArgs].path if self.arg_groups[ModelArgs].model_type == "engine" else None
-
-        loader_name = None
-        if self.arg_groups[ModelArgs].model_type == "onnx":
-            script.add_import(imports=["ParseNetworkFromOnnxLegacy"], frm="polygraphy.backend.trt_legacy")
-            onnx_loader = self.arg_groups[OnnxLoadArgs].add_to_script(script, disable_custom_outputs=True)
-            loader_name = script.add_loader(
-                make_invocable("ParseNetworkFromOnnxLegacy", onnx_loader), "parse_network_from_onnx_legacy"
-            )
-        elif self.arg_groups[ModelArgs].model_type == "caffe":
-            script.add_import(imports=["LoadNetworkFromCaffe"], frm="polygraphy.backend.trt_legacy")
-            loader_name = script.add_loader(
-                make_invocable(
-                    "LoadNetworkFromCaffe",
-                    self.arg_groups[ModelArgs].path,
-                    self.caffe_model,
-                    self.trt_outputs,
-                    self.batch_size,
-                ),
-                "parse_network_from_caffe",
-            )
-        elif load_engine is None:
-            script.add_import(imports=["LoadNetworkFromUff"], frm="polygraphy.backend.trt_legacy")
-            if self.arg_groups[ModelArgs].model_type == "uff":
-                script.add_import(imports=["LoadUffFile"], frm="polygraphy.backend.trt_legacy")
-                shapes = {name: tuple(shape) for name, (_, shape) in self.arg_groups[ModelArgs].input_shapes.items()}
-                loader_name = script.add_loader(
-                    make_invocable(
-                        "LoadUffFile", self.arg_groups[ModelArgs].path, util.default(shapes, {}), self.trt_outputs
-                    ),
-                    "load_uff_file",
-                )
-            else:
-                from polygraphy.tools.args.backend.tf.loader import TfLoadArgs
-
-                script.add_import(imports=["ConvertToUff"], frm="polygraphy.backend.trt_legacy")
-                loader_name = script.add_loader(
-                    make_invocable(
-                        "ConvertToUff",
-                        self.arg_groups[TfLoadArgs].add_to_script(script),
-                        save_uff=self.save_uff,
-                        preprocessor=self.preprocessor,
-                    ),
-                    "convert_to_uff",
-                )
-            loader_name = script.add_loader(
-                make_invocable("LoadNetworkFromUff", loader_name, uff_order=self.uff_order), "uff_network_loader"
-            )
-
-        calibrator = None
-        if (
-            self.arg_groups[TrtConfigArgs].int8 and DataLoaderArgs in self.arg_groups
-        ):  # We cannot do calibration if there is no data loader.
-            script.add_import(imports=["Calibrator"], frm="polygraphy.backend.trt")
-            script.add_import(imports=["DataLoader"], frm="polygraphy.comparator")
-            data_loader_name = self.arg_groups[DataLoaderArgs].add_to_script(script)
-            if self.calibration_base_class:
-                script.add_import(imports="tensorrt", imp_as="trt")
-
-            calibrator = make_invocable(
-                "Calibrator",
-                data_loader=data_loader_name if data_loader_name else inline(safe("DataLoader()")),
-                cache=self.calibration_cache,
-                BaseClass=self.calibration_base_class,
-                quantile=self.quantile,
-                regression_cutoff=self.regression_cutoff,
-            )
-
-        runner_str = make_invocable(
-            "TrtLegacyRunner",
-            network_loader=loader_name,
-            max_workspace_size=self.arg_groups[TrtConfigArgs]._workspace,
-            max_batch_size=self.batch_size,
-            fp16=self.arg_groups[TrtConfigArgs].fp16,
-            tf32=self.arg_groups[TrtConfigArgs].tf32,
-            load_engine=load_engine,
-            save_engine=self.arg_groups[TrtSaveEngineBytesArgs].path,
-            layerwise=self.trt_outputs == constants.MARK_ALL,
-            plugins=self.arg_groups[TrtLoadPluginsArgs].plugins,
-            int8=self.arg_groups[TrtConfigArgs].int8,
-            fp8=self.arg_groups[TrtConfigArgs].fp8,
-            calibrator=calibrator,
-            use_dla=self.use_dla,
-            allow_gpu_fallback=self.allow_gpu_fallback,
-        )
-
-        script.add_runner(runner_str)
diff --git a/tools/Polygraphy/polygraphy/tools/args/comparator/compare.py b/tools/Polygraphy/polygraphy/tools/args/comparator/compare.py
index 8d998dbd..8816297b 100644
--- a/tools/Polygraphy/polygraphy/tools/args/comparator/compare.py
+++ b/tools/Polygraphy/polygraphy/tools/args/comparator/compare.py
@@ -110,6 +110,16 @@ def add_parser_args_impl(self):
             action="store_true",
             default=None,
         )
+        self.group.add_argument(
+            "--error-quantile",
+            help="The error quantile to compare. "
+            "Float, valid range [0, 1]"
+            "To specify per-output values, use the format: --quantile [<out_name>:]<stat>. If no output name is provided, "
+            "the value is used for any outputs not explicitly specified. For example: "
+            "--error-quantile 0.95 out0:0.8 out1:0.9",
+            nargs="+",
+            default=None,
+        )
 
     def parse_impl(self, args):
         """
@@ -125,6 +135,7 @@ def parse_impl(self, args):
             show_heatmaps (bool): Whether to display heatmaps of error.
             save_error_metrics_plot (str): Path to store generated error plots.
             show_error_metrics_plot (bool): Whether to display the error metrics plots.
+            error_quantile (Dict[str, float]): Per-tensor quantile of error to compute.
         """
         self.no_shape_check = args_util.get(args, "no_shape_check")
         self.rtol = args_util.parse_arglist_to_dict(args_util.get(args, "rtol"))
@@ -135,10 +146,11 @@ def parse_impl(self, args):
         self.show_heatmaps = args_util.get(args, "show_heatmaps")
         self.save_error_metrics_plot = args_util.get(args, "save_error_metrics_plot")
         self.show_error_metrics_plot = args_util.get(args, "show_error_metrics_plot")
+        self.error_quantile = args_util.parse_arglist_to_dict(args_util.get(args, "error_quantile"))
 
         # Without this early check, failure would only happen after inference, which is clearly not desirable.
         if self.check_error_stat:
-            VALID_CHECK_ERROR_STATS = ["max", "mean", "median", "elemwise"]
+            VALID_CHECK_ERROR_STATS = ["max", "mean", "median", "elemwise", "quantile"]
             for stat in self.check_error_stat.values():
                 if stat not in VALID_CHECK_ERROR_STATS:
                     G_LOGGER.critical(
@@ -160,6 +172,7 @@ def add_to_script_impl(self, script):
             show_heatmaps=self.show_heatmaps,
             save_error_metrics_plot=self.save_error_metrics_plot,
             show_error_metrics_plot=self.show_error_metrics_plot,
+            error_quantile=self.error_quantile
         )
         compare_func = None
         if compare_func_str:
diff --git a/tools/Polygraphy/polygraphy/tools/args/comparator/data_loader.py b/tools/Polygraphy/polygraphy/tools/args/comparator/data_loader.py
index d203e66f..635860cb 100644
--- a/tools/Polygraphy/polygraphy/tools/args/comparator/data_loader.py
+++ b/tools/Polygraphy/polygraphy/tools/args/comparator/data_loader.py
@@ -90,6 +90,15 @@ def add_parser_args_impl(self):
             dest="iterations",
         )
 
+        self._array_modules = ["numpy", "torch"]
+        self.group.add_argument(
+            "--data-loader-backend-module",
+            type=str,
+            choices=self._array_modules,
+            help=f"The module to use for generating input arrays. Currently supported options: {', '.join(self._array_modules)}",
+            default=None,
+        )
+
         custom_loader_group = self.group.add_mutually_exclusive_group()
         custom_loader_group.add_argument(
             "--load-inputs",
@@ -128,6 +137,7 @@ def parse_impl(self, args):
             load_inputs_paths (List[str]): Path(s) from which to load inputs.
             data_loader_script (str): Path to a custom script to load inputs.
             data_loader_func_name (str): Name of the function in the custom data loader script that loads data.
+            data_loader_backend_module (str): Module to be used that provides arrays.
         """
 
         def omit_none_tuple(tup):
@@ -168,6 +178,8 @@ def omit_none_tuple(tup):
 
         self.load_inputs_paths = args_util.get(args, "load_inputs_paths")
 
+        self.data_loader_backend_module = args_util.get(args, "data_loader_backend_module")
+
         self.data_loader_script, self.data_loader_func_name = args_util.parse_script_and_func_name(
             args_util.get(args, "data_loader_script"), default_func_name="load_data"
         )
@@ -224,6 +236,7 @@ def _add_to_script_helper(self, script, user_input_metadata_str=None):
                 int_range=self._int_range,
                 float_range=self._float_range,
                 val_range=self.val_range,
+                data_loader_backend_module=self.data_loader_backend_module,
             )
             if data_loader:
                 script.add_import(imports=["DataLoader"], frm="polygraphy.comparator")
diff --git a/tools/Polygraphy/polygraphy/tools/args/util/util.py b/tools/Polygraphy/polygraphy/tools/args/util/util.py
index 7e8fe90b..97fda30c 100644
--- a/tools/Polygraphy/polygraphy/tools/args/util/util.py
+++ b/tools/Polygraphy/polygraphy/tools/args/util/util.py
@@ -19,11 +19,10 @@
 
 from polygraphy import constants, mod, util
 from polygraphy.common import TensorMetadata
-from polygraphy.logger import G_LOGGER, LogMode
+from polygraphy.datatype import DataType
+from polygraphy.logger import G_LOGGER
 from polygraphy.tools.script import Script, ensure_safe, inline, safe
 
-np = mod.lazy_import("numpy")
-
 
 @mod.export()
 def cast(val):
@@ -125,31 +124,25 @@ def get_outputs_for_script(script, outputs):
     return outputs
 
 
-def np_types():
-    """
-    Returns a list of human-readable names of NumPy data types.
-    """
-    return sorted(set(np.dtype(dtype).name for dtype in np.sctypeDict.values()))
-
-
-def np_type_from_str(dt_str):
+def datatype_from_str(dt_str):
     """
-    Converts a string representation of a data type to a NumPy data type.
+    Converts a string representation of a data type to a Polygraphy data type.
 
     Args:
         dt_str (str): The string representation of the data type.
 
     Returns:
-        np.dtype: The NumPy data type.
+        DataType: The Polygraphy data type.
 
     Raises:
         KeyError: If the provided string does not correspond to a NumPy data type.
     """
     try:
-        return {np.dtype(dtype).name: np.dtype(dtype) for dtype in np.sctypeDict.values()}[dt_str]
+        return DataType.__members__[dt_str.upper()]
     except KeyError:
         G_LOGGER.error(
-            f"Could not understand data type: {dt_str}. Did you forget to specify a data type? Please use one of: {np_types()} or `auto`."
+            f"Could not understand data type: {dt_str}. Did you forget to specify a data type? "
+            f"Please use one of: {list(DataType.__members__.keys())} or `auto`."
         )
         raise
 
@@ -330,7 +323,7 @@ def pop_meta(func):
             return func(val)
 
         if includes_dtype:
-            dtype = pop_meta(func=np_type_from_str)
+            dtype = pop_meta(func=datatype_from_str)
 
         if includes_shape:
             shape = pop_meta(func=lambda s: tuple(e for e in s if e != ""))
diff --git a/tools/Polygraphy/polygraphy/tools/check/README.md b/tools/Polygraphy/polygraphy/tools/check/README.md
new file mode 100644
index 00000000..dc143514
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/check/README.md
@@ -0,0 +1,28 @@
+# Inspect
+
+## Table of Contents
+
+- [Introduction](#introduction)
+- [Subtools](#subtools)
+- [Usage](#usage)
+- [Examples](#examples)
+
+
+## Introduction
+
+The `check` tool can be used to check and validate for various use-cases.
+
+
+## Subtools
+
+- [EXPERIMENTAL] `lint` can be used to validate ONNX models and catch exceptions/warnings over independent nodes
+    in the graph in a JSON format.
+
+## Usage
+
+See `polygraphy check -h` for usage information.
+
+
+## Examples
+
+For examples, see [this directory](../../../examples/cli/check)
diff --git a/tools/Polygraphy/polygraphy/tools/check/__init__.py b/tools/Polygraphy/polygraphy/tools/check/__init__.py
new file mode 100644
index 00000000..3b7b289d
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/check/__init__.py
@@ -0,0 +1 @@
+from polygraphy.tools.check.check import Check
diff --git a/tools/Polygraphy/polygraphy/tools/check/check.py b/tools/Polygraphy/polygraphy/tools/check/check.py
new file mode 100644
index 00000000..7d7d8dfd
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/check/check.py
@@ -0,0 +1,32 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from polygraphy.tools.base import Tool
+from polygraphy.tools.check.subtool import Lint
+
+
+class Check(Tool):
+    """
+    Check and validate various aspects of a model
+    """
+
+    def __init__(self):
+        super().__init__("check")
+
+    def get_subtools_impl(self):
+        return "Check Subtools", [
+            Lint(),
+        ]
diff --git a/tools/Polygraphy/polygraphy/tools/check/subtool/__init__.py b/tools/Polygraphy/polygraphy/tools/check/subtool/__init__.py
new file mode 100644
index 00000000..cba4938f
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/check/subtool/__init__.py
@@ -0,0 +1 @@
+from polygraphy.tools.check.subtool.lint import Lint
diff --git a/tools/Polygraphy/polygraphy/tools/check/subtool/lint.py b/tools/Polygraphy/polygraphy/tools/check/subtool/lint.py
new file mode 100644
index 00000000..0b45c500
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/check/subtool/lint.py
@@ -0,0 +1,868 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import contextlib
+import enum
+import functools
+import io
+import json
+import os
+import re
+import sys
+import tempfile
+from collections import OrderedDict
+from typing import Optional, Union
+
+from polygraphy import mod
+from polygraphy.comparator import IterationResult
+from polygraphy.exception import PolygraphyException
+from polygraphy.json import save_json
+from polygraphy.logger import G_LOGGER
+from polygraphy.tools import util as tools_util
+from polygraphy.tools.args import DataLoaderArgs, ModelArgs, OnnxLoadArgs, OnnxrtSessionArgs
+from polygraphy.tools.base import Tool
+
+onnx = mod.lazy_import("onnx")
+gs = mod.lazy_import("onnx_graphsurgeon>=0.3.21")
+onnx_util = mod.lazy_import("polygraphy.backend.onnx.util")
+onnx_backend = mod.lazy_import("polygraphy.backend.onnx")
+onnxrt_backend = mod.lazy_import("polygraphy.backend.onnxrt")
+
+
+class Lint(Tool):
+    """
+    [EXPERIMENTAL] Topologically "lint" an ONNX model to find faulty nodes in the graph.
+    All nodes that depend on a faulty node will be marked as faulty and ignored.
+
+    All error messages and warnings are captured in a JSON report.
+
+    The JSON report contains the following fields:
+    - 'summary' : summarizes the passing and failing nodes among the ones that are linted.
+    (Note: the nodes included are not exhaustive, as some nodes may be skipped due to dependency on a faulty previous node)
+    - 'linting_entries': a list of linting entries, each of which contains the following fields:
+        - 'level': the severity of the linting entry (error or warning)
+        - 'source': The underlying checker that generated the error message (either `onnx.checker` or ONNX Runtime)
+        - 'message': The error message. This message is superficially parsed/pruned but may retain formatting of the underlying checker.
+        - (optional) 'nodes': A list of nodes that are related to the error message. If this field is not present,
+            then the linting entry is a global error/warning that applies to the entire model (like a missing opset import).
+
+    The schema for the json output is:
+        {
+            'summary': {
+                'passing': [<list of nodes that passed ORT inference check>],
+                'failing': [<list of nodes that failed ORT inference check>],
+                },
+            'lint_entries': [
+                { 'level': <severity level>, 'source': <source of error>, 'message': <error string>, 'nodes': [<name of failing node>] },
+                ...
+            ]
+        }
+
+    Known Limitations:
+    ------------------
+    1. BFLOAT16 and FLOAT8  are not currently supported.
+    2. Only erroneous nodes that are independent of each other are captured in the JSON report. Downstream nodes that depend on a faulty node are not checked.
+    3. Subgraph nested inside nodes are not recursively linted.
+    4. Custom Ops are documented as warnings in the JSON Report, but are treated as exceptions by the internal inference checks. Therefore downstream nodes that depend on the custom op are not checked for error or custom ops.
+    5. The subtool verifies data-dependent failures either based on user's input data or generating random data for the input tensors. Therefore, the subtool's coverage of subgraphs are completely dependent on the input data and does not guarantee 100% coverage.
+    For example, if a subgraph has a conditional branch, the subtool will only check the branch that is taken based on the input data.
+    6. Large models (>2GB) require external data to be in same directory as the model file, custom paths to external data are not supported.
+    """
+
+    CUSTOM_OP_EXCEPTION_SUBSTRS = ["No opset import for domain", "is not a registered function/op"]
+    ONNX_CHECKER_IGNORE_SUBSTR = "Bad node spec for node"
+    INVALID_ONNX_EXCEPTION_SUBSTR = "Error parsing message with type 'onnx.ModelProto'"
+    MAXIMUM_PROTOBUF = 2e9  # 2GB
+
+    class ContextManager:
+        """
+        Keeps track of the linting process, including the current node being linted, cached tensors and their consumers.
+        Provides an interface to explicitly perform inference node-by-node, for node-level control of the computational graph.
+        """
+
+        def __init__(self, graph: "gs.Graph"):
+            """
+            Args:
+                graph (gs.Graph):
+                    The input graphsurgeon graph to be linted.
+
+            Attributes:
+                graph (gs.Graph):
+                    The input graph reference
+                tensor_map (OrderedDict[str, Tensor]):
+                    Mapping of tensor names to tensors in the graph.
+                cur_node (gs.Node):
+                    Initially set to None, represents the current node being processed in the graph.
+                num_consumers (OrderedDict[str, int]):
+                    Keeps track of the consumers for each tensor cached
+                cache (OrderedDict[str, Tensor]):
+                    Keeps track of tensor data, used for feeding inference.
+            """
+
+            self.graph = graph
+            self.tensor_map = OrderedDict()
+            self.cur_node = None
+            self.num_consumers = OrderedDict()
+            self.cache = IterationResult()
+
+        def __enter__(self):
+            """
+            Enter the context of the linting process.
+            """
+            self.tensor_map = self.graph.tensors()
+            for tensor in self.tensor_map.values():
+                if isinstance(tensor, gs.Variable):
+                    # Set the number of consumers for each tensor
+                    self.num_consumers[tensor.name] = len(tensor.outputs)
+
+            return self
+
+        def nodes(self) -> "gs.Node":
+            """
+            Get the next node to be linted. Nodes are yielded in topological order.
+            """
+            for node in self.graph.nodes:
+                self.cur_node = node
+                G_LOGGER.extra_verbose(
+                    f"Linting node: {node.name}: {node.op}({[inp.name for inp in node.inputs]})->{[out.name for out in node.outputs]}"
+                )
+                yield node
+
+        def make_singleton_graph(self) -> Optional["gs.Graph"]:
+            """
+            Creates a singleton graph with just the current node, its inputs and outputs.
+
+            This function first checks if all the inputs for the current node are available in the cache. If not, it returns None.
+            If all inputs are available, it creates a subgraph with only the current node and its inputs.
+            The input metadata of the singleton graph is then updated to be used for inference.
+
+            Returns:
+                singleton (gs.Graph):
+                    The singleton graph created from the current node, its inputs and outputs.
+                    If not all inputs for the current node are available in the cache, the function returns None.
+
+            """
+            node = self.cur_node
+
+            inp_names = {inp.name for inp in node.inputs if isinstance(inp, gs.Variable)}
+
+            if not all([inp in self.cache for inp in inp_names]):  # Need all inputs to be available in the cache
+                return None
+
+            singleton = self.graph.copy()
+            singleton.nodes = [node]
+            singleton.inputs = node.inputs
+            singleton.outputs = node.outputs
+            singleton.name = node.name
+
+            # Update the input metadata of the singleton graph so that it can be used for inference
+            # NOTE: nodes can treat the same tensor as two or more inputs, but a graph should be defined with uniquely named value infos.
+            singleton_input_dict = {
+                inp.name: inp.to_variable(
+                    shape=self.cache[inp.name].shape,
+                    dtype=self.cache[inp.name].dtype,
+                )
+                for inp in singleton.inputs
+                if isinstance(inp, gs.Variable)
+            }
+            singleton.inputs = list(singleton_input_dict.values())
+
+            return singleton
+
+        def update(self, output_dict: Optional[dict]):
+            """
+            Update the cache and available tensors after linting a node.
+            This should be called after the current node has been linted.
+            """
+            # Now the node has been visited, the node's inputs have leser consumers
+            for inp in self.cur_node.inputs:
+                if inp.name not in self.cache:
+                    G_LOGGER.super_verbose(
+                        f"node `{self.cur_node.name}`'s input tensor: `{inp.name}` missing in cache. something wrong with node's ancestors."
+                    )
+                    # If some inputs for current node are missing,
+                    # means that something went wrong with its ancestor nodes.
+                    continue
+                self.num_consumers[inp.name] -= 1
+                if self.num_consumers[inp.name] == 0:  # All consuming nodes of this tensor have been visited
+                    G_LOGGER.super_verbose(f"removing tensor: `{inp.name}` from cache")
+                    del self.cache[inp.name]  # Can delete the tensor from the cache
+
+            if not output_dict:
+                return
+
+            # Update the cache with the outputs of the current node
+            for name in output_dict.keys():
+                out = self.tensor_map[name]
+                if isinstance(out, gs.Variable):
+                    G_LOGGER.super_verbose(f"adding tensor: `{out.name}` to cache")
+                    self.cache[out.name] = output_dict[name]
+                elif isinstance(
+                    out, gs.Constant
+                ):  # This theoretically should never happen, as constants are not outputs of nodes
+                    G_LOGGER.critical(f"tensor: `{out.name}` is a constant, but is part of the output!")
+                else:
+                    G_LOGGER.critical(f"tensor: `{out.name}` is neither a variable nor a constant")
+
+        def set_graph_inputs(self, feed_dict: dict):
+            """
+            Initialize the cache with the input feed_dict for source node.
+            """
+            self.cache = feed_dict
+
+        def feed_dict(self) -> dict:
+            """
+            Provide a feed_dict for the current node from cache.
+            Expects that all inputs of the current node are available in the cache.
+            """
+            _feed_dict = {}
+            for inp in self.cur_node.inputs:
+                if inp.name not in self.cache:
+                    if isinstance(inp, gs.Variable):
+                        G_LOGGER.internal_error(
+                            f"tensor: {inp.name} missing in input cache! are you sure current node {self.cur_node.name} is valid?"
+                        )  # This should never happen
+                    elif isinstance(inp, gs.Constant):
+                        G_LOGGER.super_verbose(f"tensor: `{inp.name}` is a constant, not tracked in cache. ")
+                        continue
+
+                _feed_dict[inp.name] = self.cache[inp.name]
+            return _feed_dict
+
+        def __exit__(self, exc_type, exc_value, traceback):
+            """
+            Exit the context of the linting process.
+            """
+            G_LOGGER.ultra_verbose("exiting lint context")
+            self.num_consumers = {}
+            self.cache = {}
+            self.cur_node = None
+
+    class Level(enum.Enum):  # Severity of linting message
+        EXCEPTION = "exception"
+        WARNING = "warning"
+        INFO = "info"
+
+    class Source(enum.Enum):  # Source of the error message
+        ONNXRUNTIME = "onnxruntime"
+        ONNX_CHECKER = "onnx_checker"
+        ONNX_LOADER = "onnx_loader"
+        ONNX_GS = "onnx_graphsurgeon"
+
+    class Report:
+        """
+        Record the Linting report.
+
+        The report is a dictionary with the following structure:
+        {
+            'summary': {
+                'passing': [list of nodes that passed ORT inference check],
+                'failing': [list of nodes that failed ORT inference check],
+                },
+            'lint_entries': [
+                { 'level': Lint.Level, 'source': str, 'message': str, 'nodes': [node_name] },
+                ...
+            ]
+        }
+
+        """
+
+        def __init__(self):
+            self.lint_entries = []
+            self.is_model_valid = True
+            self.summary = {
+                "passing": set(),
+                "failing": set(),
+            }
+
+        def add(
+            self,
+            level: "Lint.Level",
+            source: "Lint.Source",
+            message: Optional[str] = None,
+            node_name: Optional[str] = None,
+            op: Optional[str] = None,
+            log: bool = True,
+        ):
+            """
+            Adds a lint entry to the report and updates the summary dictionary.
+
+            This method performs two major functions under the hood:
+
+            1. updates the passing and failing nodes in the summary dictionary `self.summary`.
+            The `node_name` is added to the `passing` or `failing` list based on the `level` and `message`.
+                If `node_name` is not None, the following logic is used to determine if the node is passing or failing:
+                    - If `message` is None, the node is marked as passing, irrespecitive of the `level` if that node isn't already
+                    in the failing set.
+                    - If `message` is not None, and `level` is `Lint.Level.EXCEPTION`, the node is marked as failing,
+                    and removed from the passing set if exists.
+
+            2. Parses the lint entry's message using the `_prune` method before adding the entry to the report.
+               This helper method attempts to reverse engineer the formatting done in the ORT codebase to make the error message more readable.
+
+            Args:
+                level (Lint.Level):
+                    The severity level of the lint entry.
+                source (Lint.Source):
+                    The source of the lint entry.
+                message (str, optional):
+                    The message associated with the lint entry. If present, it will be parsed by `_parse_ort_error` method before being added to the report. Defaults to None. If not present, the `node_name` associated (if not None) is marked as passing in the summary dictionary.
+                node_name (str, optional):
+                    The name of the node associated with the lint entry. If present, the node is marked as passing or failing in the summary dictionary based on the `level` and `message`. Defaults to None.
+                op (str, optional):
+                    The operator of the linted node (if any). Defaults to None.
+                log (bool, optional):
+                    If True, the lint entry is printed to the console. Defaults to True.
+            """
+            if message:
+                message = self._prune(message)
+                if log:
+                    severity_from_level = {
+                        Lint.Level.EXCEPTION: G_LOGGER.ERROR,
+                        Lint.Level.WARNING: G_LOGGER.WARNING,
+                        Lint.Level.INFO: G_LOGGER.INFO,
+                    }
+                    scope = ""
+                    if node_name and op:
+                        scope = f"Name: {node_name}, Op: {op} | "
+                    G_LOGGER.log(f"LINT | {scope}{message}", severity=severity_from_level[level])
+                lint_entry = {
+                    "level": level.value,
+                    "source": source.value,
+                    "message": message,
+                }
+
+                if node_name:
+                    lint_entry["nodes"] = [node_name]
+                    if level == Lint.Level.EXCEPTION:
+                        self.summary["failing"].update([node_name])
+                        # Remove from passing set if exists
+                        self.summary["passing"].discard(node_name)
+
+                    elif node_name not in self.summary["failing"]:
+                        # Only add to passing set if not already in failing set
+                        self.summary["passing"].update([node_name])
+
+                self.lint_entries.append(lint_entry)
+
+                self.is_model_valid = (level != Lint.Level.EXCEPTION) and self.is_model_valid
+
+            elif node_name not in self.summary["failing"]:
+                self.summary["passing"].update([node_name])
+
+        def export(self, path: str):
+            """
+            Write the report to a json file.
+            """
+            report = {
+                "summary": {k: list(v) for k, v in self.summary.items()},
+                "lint_entries": self.lint_entries,
+            }
+            G_LOGGER.ultra_verbose(f"report:\n {json.dumps(report, indent=4)}")
+            if path:
+                save_json(report, path, description="linting report")
+
+        def _prune(self, message: str) -> str:
+            """
+            Prunes the formatting of the error message that is thrown from ONNX Runtime.
+            Essentially attempts to reverse engineer the formatting done in the ORT codebase to make the error message more readable.
+            Note: Not exhaustive, some error messages may retain formatting.
+            """
+
+            def _prune_ONNXRuntimeError_formatting(message):
+                """
+                Prunes formating: [ONNXRuntimeError] : {code} : {StatusCodeToString(code)} : {msg}
+                and returns only {msg}.
+                """
+                ORT_SUBSTRS_TO_PRUNE = [
+                    "This is an invalid model. ",
+                    "Error: ",
+                    "Failed to load model with error: ",
+                    "Exception caught: ",
+                    "Exception during loading: ",
+                    "\n",
+                ]
+                parts = message.split(" : ")
+                if len(parts) < 4:
+                    # The ORT message format is not as expected, so just return the message pruning the prefix
+                    return message.split("[ONNXRuntimeError] : ")[1]
+                message = "".join(parts[3:]).replace('"', "`")
+                for substr in ORT_SUBSTRS_TO_PRUNE:  # remove substrings that are not useful in the error message
+                    message = message.replace(substr, "")
+                return message
+
+            # some patterns that were observed while testing various types of error messages
+            pattern_prune_dict = {
+                r"SystemError: .*": lambda x: G_LOGGER.internal_error(
+                    "SystemError: " + x.split(" : ")[1]
+                ),  # If starts with "SystemError", it is likely due to improper installation of ONNX Runtime.
+                r"\[ONNXRuntimeError\] : .*": _prune_ONNXRuntimeError_formatting,  # [ONNXRuntimeError] : {code} : {StatusCodeToString(code)} : {msg}
+                r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ \[.*?\]\ ": lambda msg: re.sub(
+                    r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+ \[.*?\]\ ", "", msg
+                ),  # (e.g: https://github.com/microsoft/onnxruntime/blob/24566058b3e5bb9e511513977cee6e7c553fd5c2/onnxruntime/core/graph/graph.cc#L3545-L3546)
+                r"Node \(.*\) Op \(.*\) \[ShapeInferenceError\]": lambda msg: re.sub(
+                    r"Node \(.*\) Op \(.*\) \[ShapeInferenceError\]", "", msg
+                ),  # Eg: "Node {name}: Op {op}: [ShapeInferenceError] {msg}"
+                r".*/.*\.cc:\d+ onnxruntime::.*\(.*\)": lambda msg: re.sub(
+                    r".*/.*\.cc:\d+ onnxruntime::.*\(.*\)", "", msg
+                ),  # Eg: {path/to/file}.cc:{line} onnxruntime::{func}({args}) {msg}
+                r"In Node,\ .*,\ Error\ ": lambda msg: re.sub(
+                    r"In Node,\ .*,\ Error\ ", "", msg
+                ),  # Eg: "In Node, {node details}, Error {msg}"
+                r".*Status Message:\ ": lambda msg: re.sub(
+                    r".*Status Message:\ ", "", msg
+                ),  # Eg: "Non-zero status code returned while running {op} node. Name:'{name}' Status Message: {msg}"
+            }
+
+            # catches a pattern, and the function substitutes the pattern with ""
+            for pattern, func in pattern_prune_dict.items():
+                # NOTE: patterns are incrementally matched and modified
+                # ONNRuntime's default error message format is first stripped off (if it exists),
+                # and then the remaining message is parsed for other patterns
+                if re.match(pattern, message):
+                    message = func(message)
+            return message.replace("\n", "")
+
+    def __init__(self):
+        super().__init__("lint")
+
+    def get_subscriptions_impl(self):
+        return [
+            ModelArgs(model_opt_required=True, required_model_type="onnx"),
+            OnnxLoadArgs(outputs_opt_prefix=False, allow_shape_inference=False),
+            DataLoaderArgs(),
+            OnnxrtSessionArgs(),
+        ]
+
+    def show_start_end_logging_impl(self, args):
+        return True
+
+    def add_parser_args_impl(self, parser):
+        parser.add_argument(
+            "-o",
+            "--output",
+            help="Path to save json report.",
+        )
+
+    def prepare_feed_dict(self, onnx_model: "onnx.ModelProto") -> OrderedDict:
+        """
+        Prepare the feed_dict for the source node of the graph using `DataLoaderArgs`.
+        This converts data to Polygraphy's internal DataType format.
+        """
+        input_metadata = onnx_util.get_input_metadata(onnx_model.graph)
+        data_loader = self.arg_groups[DataLoaderArgs].get_data_loader(input_metadata)
+        return next(iter(data_loader))
+
+    def load_helper(self, args) -> Optional["onnx.ModelProto"]:
+        """
+        Loads the ONNX model using `OnnxLoadArgs` and returns the ONNX model.
+        If the model is invalid, returns None.
+        """
+        try:
+            onnx_model = self.arg_groups[OnnxLoadArgs].load_onnx()
+        except Exception as err:  # pylint: disable=broad-except
+            # intentionally catching broad exception to avoid introducing
+            # `google.protobuf.message.DecodeError` dependency
+            if Lint.INVALID_ONNX_EXCEPTION_SUBSTR in str(err):
+                self.report.add(Lint.Level.EXCEPTION, Lint.Source.ONNX_LOADER, str(err))
+                self.report.export(args.output)
+                G_LOGGER.error(f"Invalid ONNX model given: {err}")
+            else:
+                # some unkown error
+                G_LOGGER.critical(f"Unhandled error: {err}")
+            return None
+
+        # if the model is empty with no onnx metadata
+        if onnx_model.ByteSize() == 0:
+            self.report.add(
+                Lint.Level.EXCEPTION,
+                Lint.Source.ONNX_LOADER,
+                "Empty ONNX model given",
+            )
+            self.report.export(args.output)
+            return None
+
+        return onnx_model
+
+    def run_impl(self, args):
+        def _handle_empty_names(graph: "gs.Graph"):
+            """
+            Handle nodes with empty names in the graph
+            by renaming them to "polygraphy_unnamed_node_<id>"
+            where <id> is the topological sort order of the node.
+
+            If the above name already exists, then the node is renamed to
+            "polygraphy_unnamed_node_<id>_<uid>" where <uid> is a unique id.
+            """
+            uid = 0
+            with graph.node_ids():
+
+                def _generate_unique_name(node_id):
+                    nonlocal uid
+                    name = f"polygraphy_unnamed_node_{node_id}"
+                    names = {node.name for node in graph.nodes}
+                    while name in names:  # guarantee unique name
+                        name = f"polygraphy_unnamed_node_{node_id}_{uid}"
+                        uid += 1
+                    G_LOGGER.verbose(f"Node with topological id: {node_id} has empty name. Renaming to: {name}")
+                    return name
+
+                for node in graph.nodes:
+                    node.name = node.name or _generate_unique_name(node.id)
+
+        def _duplicate_node_name_check(graph: "gs.Graph"):
+            """
+            Duplicate names that are non-empty violate ONNX Naming rules.
+            This is not caught by ONNX Checker for some reason, hence we need to catch it here.
+            """
+            name_tracker = {}
+            with graph.node_ids():
+                for node in graph.nodes:
+                    name_tracker.setdefault(node.name, []).append(node.id)
+
+                for name, ids in name_tracker.items():
+                    if len(ids) > 1:
+                        self.report.add(
+                            Lint.Level.EXCEPTION,
+                            Lint.Source.ONNX_GS,
+                            f"Duplicate node name: '{name}' for nodes with topological IDs: {ids} found.",
+                            node_name=name,
+                        )
+
+        def _onnx_spec_check(onnx_model: "onnx.ModelProto") -> bool:
+            """
+            Check the ONNX model for specification errors using `onnx.checker.check_model`.
+
+            Args:
+                onnx_model (ModelProto): The ONNX Model to check
+            Returns:
+                bool: True if the ONNX Checker passes, False otherwise.
+
+            Only Graph-level ONNX metadata as well as inputs are checked for correctness here.
+            Node-level specification errors are caught and ignored, as they will be caught by the linting process incrementally.
+
+            Performing this for correct ONNX specifications in the graph-level is important as a pre-linting step
+            before checking correctness of each node seperately.
+            For example, if an ONNX Graph with duplicated inputs is passed, this will not be caught when linting at the node-level.
+
+            The checks performed are:
+            1. ModelProto-level:
+                check if opset_imports is non-empty in onnx model
+                check for duplicate keys in metadata_props
+                check if IR is set in onnx model
+            2. GraphProto-level:
+                check non-empty name for graphs, tensor-initializers, sparse-tensor-initializers.
+                Check nodes are topologically sorted
+                check non-empty name for tensor initializers
+                check duplicates in graph.inputs, graph.initializer, graph.sparse_initializer, (potentially, if all node-checks pass) graph.outputs
+            """
+            # NOTE: `onnx.checker.check_model` checks Field `shape` of `type` in `ValueInfoProto` of graph.
+            # But graphsurgeon doesn't add this field when exporting from ONNX to GS.
+            # So we need to manually check and add that field so `onnx.checker` is happy.
+            for output in onnx_model.graph.output:
+                if not output.type.tensor_type.HasField("shape"):
+                    output.type.tensor_type.shape.dim.add()
+
+            # handle large models
+            if onnx_model.ByteSize() > Lint.MAXIMUM_PROTOBUF:
+                checker_input = self.arg_groups[ModelArgs].path
+                G_LOGGER.warning(
+                    "Given ONNX Model >2GB. ONNX-Checker will run with model path as input instead.\n"
+                    "NOTE: The external data needs to be under the same directory as model path."
+                )
+            else:
+                checker_input = onnx_model
+
+            try:
+                onnx.checker.check_model(checker_input)
+            except onnx.checker.ValidationError as err:
+                if Lint.ONNX_CHECKER_IGNORE_SUBSTR not in str(err):
+                    self.report.add(
+                        level=Lint.Level.EXCEPTION,
+                        message=str(err),
+                        source=Lint.Source.ONNX_CHECKER,
+                    )
+                return False
+
+            return True
+
+        def capture(
+            func,
+        ):
+            """
+            Decorator to capture stdout, exceptions, and warnings from the `ort_inference_check` function.
+            Uses C-level stdout and stderr redirection to capture any warnings printed by the ONNX Runtime.
+            Note: This is not thread-safe!
+            """
+            # The stdout redirector code was generalized from a post on Eli Bendersky's website.
+            # The original code for POSIX-specific systems can be found at https://eli.thegreenplace.net/2015/redirecting-all-kinds-of-stdout-in-python/.
+
+            @contextlib.contextmanager
+            def stderr_redirector(stream):
+                # The original fd stderr points to. Usually 2 on POSIX systems.
+                original_stderr_fd = sys.stderr.fileno()
+
+                def _redirect_stderr(to_fd):
+                    """Redirect stderr to the given file descriptor."""
+                    # Flush and close sys.stderr - also closes the file descriptor
+                    sys.stderr.close()
+                    # Make original_stderr_fd point to the same file as to_fd
+                    os.dup2(to_fd, original_stderr_fd)
+                    # Create a new sys.stderr that points to the redirected fd
+                    sys.stderr = io.TextIOWrapper(os.fdopen(original_stderr_fd, "wb"))
+
+                # Save a copy of the original stderr fd in saved_stderr_fd
+                saved_stderr_fd = os.dup(original_stderr_fd)
+                try:
+                    # Create a temporary file and redirect stderr to it
+                    tfile = tempfile.TemporaryFile(mode="w+b")
+                    _redirect_stderr(tfile.fileno())
+                    # Yield to caller, then redirect stderr back to the saved fd
+                    yield
+                    _redirect_stderr(saved_stderr_fd)
+                    # Copy contents of temporary file to the given stream
+                    tfile.flush()
+                    tfile.seek(0, io.SEEK_SET)
+                    stream.write(tfile.read())
+                finally:
+                    tfile.close()
+                    os.close(saved_stderr_fd)
+
+            @functools.wraps(func)
+            def wrapper(*args, **kwargs):
+                captured_stdout = io.StringIO()
+                captured_stderr = io.BytesIO()
+                captured_exception = None
+                result = None
+                with contextlib.redirect_stdout(captured_stdout), stderr_redirector(captured_stderr):
+                    try:
+                        # Execute the function
+                        result = func(*args, **kwargs)
+                    except Exception as err:  # pylint: disable=broad-except
+                        captured_exception = err
+                UTF_TYPE = "utf-16-le" if os.name == "nt" else "utf-8"
+                stderr_msg = captured_stderr.getvalue().decode(UTF_TYPE)  # platform-dependent
+                stdout_msg = captured_stdout.getvalue()
+                return (result, captured_exception, stderr_msg, stdout_msg)
+
+            return wrapper
+
+        @capture
+        def _ort_inference_check(model_bytes: Union[bytes, str], feed_dict: OrderedDict) -> Optional[OrderedDict]:
+            """
+            Runs inference using ONNX-Runtime.
+
+            Args:
+                model_bytes (Union[bytes, str]): The model bytes or path for the model.
+                feed_dict (dict): The feed dictionary to use for inference.
+
+            Returns:
+                dict: The output dictionary from the inference run.
+                      None if the inference run fails.
+
+            NOTE: This function is decorated with `capture` to capture all stdout, stderr, and exceptions.
+            """
+            with onnxrt_backend.OnnxrtRunner(
+                self.arg_groups[OnnxrtSessionArgs].load_onnxrt_session(model_bytes)
+            ) as runner:
+                output_dict = runner.infer(feed_dict)
+            return output_dict
+
+        @capture
+        def _unused_info_helper(graph: "gs.Graph"):
+            """
+            Helper function to report unused nodes and input tensors in the graph.
+            Calls `graph.cleanup()` in-place to remove unused nodes and input tensors.
+
+            Returns the difference between the original and cleaned graph as a tuple of sets
+
+            NOTE: This function is decorated with `capture` to capture all stdout, stderr, and exceptions.
+            """
+
+            orig_input_tensor_names = {inp.name for inp in graph.inputs}
+            orig_node_info = {(node.name, node.op) for node in graph.nodes}
+
+            ### in-place clean ###
+            # remove any tensor or node that doesn't contribute to the output
+            # NOTE: only the top-level nodes are removed, and not the subgraphs.
+            graph.cleanup(recurse_subgraphs=False, remove_unused_graph_inputs=True)
+
+            cleaned_input_tensor_names = {inp.name for inp in graph.inputs}
+            cleaned_node_info = {(node.name, node.op) for node in graph.nodes}
+
+            return (orig_node_info - cleaned_node_info, orig_input_tensor_names - cleaned_input_tensor_names)
+
+        def _report_unused_info(graph: "gs.Graph"):
+            """
+            Checks for unused nodes and inputs in the graph.
+            Appends to the report as warnings.
+
+            Args:
+                graph (gs.Graph): The graph to check for unused nodes and tensors.
+
+            Note:
+                - This function avoids copying the graph to avoid memory overhead,
+                    and instead modifies the graph in-place by calling `graph.cleanup()`.
+                    Therefore, this function is intentionally called at the end of the linting process.
+                - All nodes in the graph are expected to have non-empty names.
+            """
+
+            (unused_node_info, unused_input_tensor_names), exception, _, _ = _unused_info_helper(graph)
+
+            if exception:
+                # something went wrong here.
+                G_LOGGER.internal_error(f"Failed to report unused nodes. Error: {exception}")
+                G_LOGGER.warning(f"Failed to report unused nodes. Error: {exception}. Continuing...")
+
+            # report unused tensors that are also inputs (intermediate tensors are not reported)
+            for inp_name in sorted(list(unused_input_tensor_names)):
+                self.report.add(
+                    Lint.Level.WARNING,
+                    Lint.Source.ONNX_GS,
+                    f"Input: '{inp_name}' does not affect outputs, can be removed.",
+                )
+
+            # report unused nodes of the outermost graph
+            for node_name, op in sorted(list(unused_node_info), key=lambda x: x[0]):
+                self.report.add(
+                    Lint.Level.WARNING,
+                    Lint.Source.ONNX_GS,
+                    "Does not affect outputs, can be removed.",
+                    node_name=node_name,
+                    op=op,
+                )
+
+        # instantiate the report
+        self.report = Lint.Report()
+
+        # tries to load the model in-memory using OnnxLoadArgs.load_onnx()
+        # TODO: find a way to avoid loading the whole model in memory if not-required
+        # Currently we'll need the model for the following reasons:
+        # 1. to calculate the model size
+        # 2. to obtain input shape metadata for generating feed dict
+        # 3. if OnnxLoadArgs.must_use_onnx_loader() is True.
+        # 4. overriding input shapes if provided by the user
+        onnx_model = self.load_helper(args)
+        # handle invalid or empty onnx model
+        if not onnx_model:
+            return 1  # invalid
+        if len(onnx_model.graph.node) == 0:
+            self.report.add(
+                Lint.Level.WARNING,
+                Lint.Source.ONNX_LOADER,
+                "ONNX model has no nodes",
+            )
+            self.report.export(args.output)
+            return 0  # empty
+
+        graph = gs.import_onnx(onnx_model)
+
+        ### Preprocess graph ###
+        # override input shapes if provided by the user
+        user_input_metadata = self.arg_groups[ModelArgs].input_shapes
+        if user_input_metadata:
+            graph = tools_util.override_input_shapes(graph, user_input_metadata)
+        G_LOGGER.verbose("ONNX Model loaded into linter succesfully.")
+        # rename any nodes with empty names
+        _handle_empty_names(graph)
+
+        onnx_model = gs.export_onnx(graph, do_type_check=False)  # update the ONNX model
+
+        ### report.add(Duplicate node names violation) ###
+        _duplicate_node_name_check(graph)
+
+        ### report.add(ONNX spec violations) ###
+        is_onnx_check_passing = _onnx_spec_check(onnx_model)
+        G_LOGGER.verbose(f"ONNX Checker passed: {is_onnx_check_passing}")
+
+        # prepare feed_dict for `ort_inference_check`, that will later be re-used.
+        feed_dict = self.prepare_feed_dict(onnx_model)
+
+        if is_onnx_check_passing:
+            ### full ORT inference as preliminary check ###
+            model_bytes = None  # so that model is picked from `self.arg_groups`
+            _, exception, warn_str, _ = _ort_inference_check(model_bytes, feed_dict)
+            # NOTE: we ignore stdout as it contains info from polygraphy not relevant to linting.
+
+            if not exception:
+                # ORT inference check passes, early exit
+                # any recorded warnings from stderr are added to the report.
+                # NOTE: This is only done if early-exiting, as otherwise these warnings tend to be repeats
+                # of node level warnings/exceptions.
+                if warn_str:
+                    warnings = warn_str.split('\n')
+                    for warning in warnings:
+                        if len(warning) > 0:
+                            self.report.add(
+                                Lint.Level.WARNING,
+                                Lint.Source.ONNXRUNTIME,
+                                warning,
+                            )
+
+                ### report.add(unused nodes and tensors) ###
+                _report_unused_info(graph)
+
+                self.report.summary["passing"] = {node.name for node in graph.nodes}
+                self.report.export(args.output)
+                G_LOGGER.verbose("ORT inference check passed. Model is valid. Early exiting.")
+                return 0
+            if isinstance(exception, PolygraphyException):
+                # PolygraphyException is raised when the provided input is not compatible with polygraphy
+                # This could be due to improper input, unsupported provider etc. that user needs to fix.
+                # This is not raised due to errors in ONNX Model, so we shouldn't handle it.
+                G_LOGGER.critical(f"PolygraphyException: {exception}")
+                raise exception
+            G_LOGGER.verbose(f"ORT inference check failed with error: '{exception}'")
+
+        # start Node-level linting
+        with Lint.ContextManager(graph) as lcm:
+            lcm.set_graph_inputs(feed_dict)  # load the cache with initial feed_dict values for iterative inference.
+
+            for _ in lcm.nodes():
+                g = lcm.make_singleton_graph()
+                inference_output = None
+
+                if g:  # has valid ancestors. Can perform inference.
+                    model_bytes = onnx_backend.BytesFromOnnx(gs.export_onnx(g, do_type_check=False))
+                    inference_output, exception, _, _ = _ort_inference_check(model_bytes, lcm.feed_dict())
+                    # NOTE: we ignore stdout and stderr as it contains info from polygraphy not relevant to linting.
+                    err_str = str(exception) if exception else ""
+                    if any([substr in err_str for substr in Lint.CUSTOM_OP_EXCEPTION_SUBSTRS]):
+                        self.report.add(
+                            level=Lint.Level.WARNING,
+                            source=Lint.Source.ONNXRUNTIME,
+                            message=err_str,
+                            node_name=g.name,
+                            op=g.nodes[0].op,
+                        )
+                    else:
+                        self.report.add(
+                            level=Lint.Level.EXCEPTION,
+                            source=Lint.Source.ONNXRUNTIME,
+                            message=err_str,
+                            node_name=g.name,
+                            op=g.nodes[0].op,
+                        )
+
+                # update : cache new outputs if any, and remove stale tensors from cache.
+                lcm.update(inference_output)
+
+            ### report.add(unused nodes and tensors) ###
+            _report_unused_info(graph)
+            self.report.export(args.output)
+
+            return int(not self.report.is_model_valid)
diff --git a/tools/Polygraphy/polygraphy/tools/debug/debug.py b/tools/Polygraphy/polygraphy/tools/debug/debug.py
index dcc6f0b8..16ddc187 100644
--- a/tools/Polygraphy/polygraphy/tools/debug/debug.py
+++ b/tools/Polygraphy/polygraphy/tools/debug/debug.py
@@ -17,9 +17,6 @@
 from polygraphy.tools.base import Tool
 from polygraphy.tools.debug.subtool import Build, Precision, Reduce, Repeat
 
-# For backwards compatibility
-from polygraphy.tools.inspect.subtool import DiffTactics
-
 
 class Debug(Tool):
     r"""
@@ -70,7 +67,6 @@ def get_subtools_impl(self):
         return "Debug Subtools", [
             Build(),
             Precision(),
-            DiffTactics(_issue_deprecation_warning=True),
             Reduce(),
             Repeat(),
         ]
diff --git a/tools/Polygraphy/polygraphy/tools/debug/subtool/base.py b/tools/Polygraphy/polygraphy/tools/debug/subtool/base.py
index 60ea46b3..b94d2363 100644
--- a/tools/Polygraphy/polygraphy/tools/debug/subtool/base.py
+++ b/tools/Polygraphy/polygraphy/tools/debug/subtool/base.py
@@ -34,7 +34,7 @@
 from polygraphy.tools.debug.subtool.iterative_debug_args import ArtifactSortArgs, CheckCmdArgs, IterativeDebugArgs
 
 trt_backend = mod.lazy_import("polygraphy.backend.trt")
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 
 
 class BaseCheckerSubtool(Tool):
@@ -110,52 +110,34 @@ def remaining(self):
         pass
 
     def run_impl(self, args):
-        # Hack to switch obey_precision_constraints to strict_types on older versions
-        if (
-            mod.version(trt.__version__) < mod.version("8.2")
-            and self.arg_groups[TrtConfigArgs].precision_constraints is not None
-        ):
-            G_LOGGER.warning(
-                "--precision-constraints is not supported on this version of TensorRT. "
-                "Treating it as --strict-types instead."
-            )
-            self.arg_groups[TrtConfigArgs].precision_constraints = None
-            self.arg_groups[TrtConfigArgs].strict_types = True
-
-        builder, network, parser = util.unpack_args(self.arg_groups[TrtLoadNetworkArgs].load_network(), 3)
-
-        with contextlib.ExitStack() as stack:
-            stack.enter_context(builder)
-            stack.enter_context(network)
-            if parser:
-                stack.enter_context(parser)
-
-            self.setup(args, network)
-
-            def make_iter_art(_):
-                self.process_network(network)
-
-                try:
-                    serialized_engine = self.arg_groups[TrtLoadEngineBytesArgs].load_engine_bytes((builder, network))
-                except Exception as err:
-                    G_LOGGER.warning(
-                        f"Failed to create network or engine, continuing to the next iteration.\nNote: Error was: {err}"
+        builder, network, _ = util.unpack_args(self.arg_groups[TrtLoadNetworkArgs].load_network(), 3)
+
+        self.setup(args, network)
+
+        def make_iter_art(_):
+            self.process_network(network)
+
+            try:
+                serialized_engine = self.arg_groups[TrtLoadEngineBytesArgs].load_engine_bytes((builder, network))
+            except Exception as err:
+                G_LOGGER.warning(
+                    f"Failed to create network or engine, continuing to the next iteration.\nNote: Error was: {err}"
+                )
+                G_LOGGER.internal_error("Failed to create network or engine. See warning above for details.")
+                self.arg_groups[IterativeDebugArgs].skip_iteration(success=False)
+            else:
+                # Don't need to keep the engine around in memory - just serialize to disk and free it.
+                with serialized_engine:
+                    self.arg_groups[TrtSaveEngineBytesArgs].save_engine_bytes(
+                        serialized_engine, self.arg_groups[IterativeDebugArgs].iter_artifact_path
                     )
-                    G_LOGGER.internal_error("Failed to create network or engine. See warning above for details.")
-                    self.arg_groups[IterativeDebugArgs].skip_iteration(success=False)
-                else:
-                    # Don't need to keep the engine around in memory - just serialize to disk and free it.
-                    with serialized_engine:
-                        self.arg_groups[TrtSaveEngineBytesArgs].save_engine_bytes(
-                            serialized_engine, self.arg_groups[IterativeDebugArgs].iter_artifact_path
-                        )
-
-            def advance(context):
-                if self.step(context.success):
-                    self.arg_groups[IterativeDebugArgs].stop_iteration()
-
-            self.arg_groups[IterativeDebugArgs].iterate(
-                make_iter_art_func=make_iter_art,
-                advance_func=advance if not self._allow_until_opt else None,
-                get_remaining_func=lambda: self.remaining(),
-            )
+
+        def advance(context):
+            if self.step(context.success):
+                self.arg_groups[IterativeDebugArgs].stop_iteration()
+
+        self.arg_groups[IterativeDebugArgs].iterate(
+            make_iter_art_func=make_iter_art,
+            advance_func=advance if not self._allow_until_opt else None,
+            get_remaining_func=lambda: self.remaining(),
+        )
diff --git a/tools/Polygraphy/polygraphy/tools/debug/subtool/precision.py b/tools/Polygraphy/polygraphy/tools/debug/subtool/precision.py
index 78b7aa6f..336ee9ec 100644
--- a/tools/Polygraphy/polygraphy/tools/debug/subtool/precision.py
+++ b/tools/Polygraphy/polygraphy/tools/debug/subtool/precision.py
@@ -22,7 +22,7 @@
 from polygraphy.tools.args import ModelArgs, TrtConfigArgs
 from polygraphy.tools.debug.subtool.base import BaseCheckerSubtool
 
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 trt_util = mod.lazy_import("polygraphy.backend.trt.util")
 
 
diff --git a/tools/Polygraphy/polygraphy/tools/debug/subtool/reduce.py b/tools/Polygraphy/polygraphy/tools/debug/subtool/reduce.py
index 649db1a5..31184920 100644
--- a/tools/Polygraphy/polygraphy/tools/debug/subtool/reduce.py
+++ b/tools/Polygraphy/polygraphy/tools/debug/subtool/reduce.py
@@ -20,6 +20,7 @@
 from polygraphy import constants, mod, util
 from polygraphy.common import TensorMetadata
 from polygraphy.comparator import IterationResult
+from polygraphy.datatype import DataType
 from polygraphy.logger import G_LOGGER, LogMode
 from polygraphy.tools import util as tools_util
 from polygraphy.tools.args import DataLoaderArgs, ModelArgs, OnnxInferShapesArgs, OnnxLoadArgs, OnnxSaveArgs
@@ -320,7 +321,9 @@ def fix_tensor_metadata(tensors):
                     # If a tensor is not in `fallback_metadata`, it means it doesn't require metadata to be updated.
                     if tensor.name in fallback_metadata:
                         tensor.shape = tensor.shape or fallback_metadata[tensor.name].shape
-                        tensor.dtype = tensor.dtype or fallback_metadata[tensor.name].dtype
+                        tensor.dtype = DataType.to_dtype(
+                            DataType.from_dtype(tensor.dtype or fallback_metadata[tensor.name].dtype), "onnx"
+                        )
 
             fix_tensor_metadata(graph.inputs)
             fix_tensor_metadata(graph.outputs)
diff --git a/tools/Polygraphy/polygraphy/tools/inspect/README.md b/tools/Polygraphy/polygraphy/tools/inspect/README.md
index 84cca371..951257e7 100644
--- a/tools/Polygraphy/polygraphy/tools/inspect/README.md
+++ b/tools/Polygraphy/polygraphy/tools/inspect/README.md
@@ -32,6 +32,9 @@ The `inspect` tool can be used to display information about supported types of f
 
     See the [example](../../../examples/cli/debug/01_debugging_flaky_trt_tactics/) for details.
 
+- [EXPERIMENTAL] `sparsity` displays information about whether each weight tensor in an ONNX model
+    follows a 2:4 structured sparsity pattern.
+
 ## Usage
 
 See `polygraphy inspect -h` for usage information.
diff --git a/tools/Polygraphy/polygraphy/tools/inspect/inspect.py b/tools/Polygraphy/polygraphy/tools/inspect/inspect.py
index 79a66a34..daa8fdc1 100644
--- a/tools/Polygraphy/polygraphy/tools/inspect/inspect.py
+++ b/tools/Polygraphy/polygraphy/tools/inspect/inspect.py
@@ -15,7 +15,7 @@
 # limitations under the License.
 #
 from polygraphy.tools.base import Tool
-from polygraphy.tools.inspect.subtool import Data, Model, Tactics, Capability, DiffTactics
+from polygraphy.tools.inspect.subtool import Data, Model, Tactics, Capability, DiffTactics, Sparsity
 
 
 class Inspect(Tool):
@@ -33,4 +33,5 @@ def get_subtools_impl(self):
             Tactics(),
             Capability(),
             DiffTactics(),
+            Sparsity(),
         ]
diff --git a/tools/Polygraphy/polygraphy/tools/inspect/subtool/__init__.py b/tools/Polygraphy/polygraphy/tools/inspect/subtool/__init__.py
index d334f19e..05b03541 100644
--- a/tools/Polygraphy/polygraphy/tools/inspect/subtool/__init__.py
+++ b/tools/Polygraphy/polygraphy/tools/inspect/subtool/__init__.py
@@ -3,3 +3,4 @@
 from polygraphy.tools.inspect.subtool.tactics import Tactics
 from polygraphy.tools.inspect.subtool.capability import Capability
 from polygraphy.tools.inspect.subtool.diff_tactics import DiffTactics
+from polygraphy.tools.inspect.subtool.sparsity import Sparsity
diff --git a/tools/Polygraphy/polygraphy/tools/inspect/subtool/capability.py b/tools/Polygraphy/polygraphy/tools/inspect/subtool/capability.py
index cfe10f60..7f4aa9de 100644
--- a/tools/Polygraphy/polygraphy/tools/inspect/subtool/capability.py
+++ b/tools/Polygraphy/polygraphy/tools/inspect/subtool/capability.py
@@ -16,6 +16,7 @@
 #
 import os
 
+from collections import OrderedDict
 from polygraphy import mod
 from polygraphy.common.interface import TypedDict
 from polygraphy.logger import G_LOGGER
@@ -26,7 +27,7 @@
 gs = mod.lazy_import("onnx_graphsurgeon")
 onnx_backend = mod.lazy_import("polygraphy.backend.onnx")
 onnx_util = mod.lazy_import("polygraphy.backend.onnx.util")
-trt = mod.lazy_import("tensorrt")
+trt = mod.lazy_import("tensorrt>=8.5")
 trt_backend = mod.lazy_import("polygraphy.backend.trt")
 trt_util = mod.lazy_import("polygraphy.backend.trt.util")
 util = mod.lazy_import("polygraphy.util")
@@ -88,6 +89,30 @@ def supports_model(path):
     return supported, nodelists, parser
 
 
+def parse(path):
+    """
+    Invokes the ONNX parser's `parse` on the specified model.
+
+    Args:
+        path (str): The path to the ONNX model.
+
+    Returns:
+        Tuple[bool, parser]:
+            (1) Whether the model was parsed successfully.
+            (2) The TensorRT ONNX parser instance.
+    """
+    _, network = trt_backend.create_network()
+    parser = trt.OnnxParser(network, trt_backend.get_trt_logger())
+
+    try:
+        parser.parse
+    except AttributeError:
+        trt_util.fail_unavailable("parse in tensorrt.OnnxParser")
+
+    supported = parser.parse(common_backend.bytes_from_path(path), path)
+    return supported, parser
+
+
 def save_subgraph(onnx_save_args, graph, start, end, prefix="", use_tmp_file=False):
     """
     Extracts a subgraph from the main graph and saves it to disk.
@@ -147,9 +172,37 @@ def gen_results_summary(final_unsupported):
     return summary
 
 
+def gen_results_summary_no_partitioning(stack_trace_to_errors):
+    """
+    Generates a results summary given all the errors with a corresponding stack trace.
+
+    Args:
+        stack_trace_to_errors (``OrderedDict[str, List[Tuple[str, str, str]]]``):
+                Reported errors with a corresponding stack trace.
+
+    Returns:
+        str: A summary of all the unsupported ops in model, along with reasons and stack traces.
+    """
+    stack_trace_width = max(map(len, list(stack_trace_to_errors.keys()) + ["Stack trace "]))
+    op_width = max(max(len(op) for errors_per_stack in stack_trace_to_errors.values() for op, _, _ in errors_per_stack), len("Operator "))
+    node_width = max(max(len(node) for errors_per_stack in stack_trace_to_errors.values() for _, node, _ in errors_per_stack), len("Node "))
+    reason_width = max(len(reason) for errors_per_stack in stack_trace_to_errors.values() for _, _, reason in errors_per_stack)
+
+    summary = "===== Summary =====\n"
+
+    header = f"{'Stack trace':{stack_trace_width}} | {'Operator':{op_width}} | {'Node':{node_width}} | {'Reason':{reason_width}}\n"
+    summary += header + "-" * len(header) + "\n"
+
+    for stack_trace, op_node_reason in stack_trace_to_errors.items():
+        for op, node, reason in op_node_reason:
+            summary += f"{stack_trace:{stack_trace_width}} | {op:{op_width}} | {node:{node_width}} | {reason:{reason_width}}\n"
+    return summary
+
+
 class Capability(Tool):
     """
-    Determine the capability of TensorRT to run an ONNX graph. Graph will be paritioned into supported and unsupported subgraphs.
+    Determine the capability of TensorRT to run an ONNX graph. Graph will be either partitioned into supported and unsupported subgraphs
+    or only analyzed in terms of statically checked errors.
     """
 
     def __init__(self):
@@ -162,8 +215,52 @@ def get_subscriptions_impl(self):
             OnnxLoadArgs(outputs_opt_prefix=False),
             OnnxSaveArgs(output_default_path="polygraphy_capability_dumps", allow_multiple_models=True),
         ]
+    
+    def add_parser_args_impl(self, parser):
+        parser.add_argument(
+            "--with-partitioning",
+            help="Whether to partition the model graph on the nodes with parsing failures",
+            action="store_true",
+        )
 
     def run_impl(self, args):
+        if args.with_partitioning:
+            self.supports_model_variant()
+        else:
+            self.no_partitioning_variant()
+
+    def no_partitioning_variant(self):
+        supported, parser = parse(self.arg_groups[ModelArgs].path)
+        if supported:
+            G_LOGGER.info("Graph is fully supported by TensorRT; Will not report errors.")
+            return
+        
+        stack_trace_to_errors = OrderedDict()
+        for err_idx in range(parser.num_errors):
+            parser_error = parser.get_error(err_idx)
+            stack_trace = ""
+            if parser_error.local_function_stack_size() > 0:
+                for function_idx in range(parser_error.local_function_stack_size()):
+                    stack_trace += parser_error.local_function_stack()[function_idx]
+                    if function_idx != parser_error.local_function_stack_size() - 1:
+                        stack_trace += " -> "
+                
+            if stack_trace not in stack_trace_to_errors:
+                stack_trace_to_errors[stack_trace] = []
+
+            node_operator = parser_error.node_operator()
+            node_name = parser_error.node_name()
+            parser_error_desc = str(parser_error)
+            stack_trace_to_errors[stack_trace].append(tuple((node_operator, node_name, parser_error_desc)))
+        
+        summary = gen_results_summary_no_partitioning(stack_trace_to_errors)
+
+        G_LOGGER.info(summary)
+        util.save_file(
+            summary, os.path.join(self.arg_groups[OnnxSaveArgs].path, "results.txt"), "w", description="results"
+        )
+
+    def supports_model_variant(self):
         supported, nodelists, _ = supports_model(self.arg_groups[ModelArgs].path)
         if supported:
             G_LOGGER.info("Graph is fully supported by TensorRT; Will not generate subgraphs.")
@@ -185,7 +282,7 @@ def partition(nodelists, offset):
                         A list of subgraphs supported by TensorRT, each described by a list of node indices.
             """
             supported_subgraphs = []
-            for (node_indices, supported) in nodelists:
+            for node_indices, supported in nodelists:
                 if supported:
                     supported_subgraphs.append([index + offset for index in node_indices])
                     continue
diff --git a/tools/Polygraphy/polygraphy/tools/inspect/subtool/diff_tactics.py b/tools/Polygraphy/polygraphy/tools/inspect/subtool/diff_tactics.py
index e6e18c63..a4af2489 100644
--- a/tools/Polygraphy/polygraphy/tools/inspect/subtool/diff_tactics.py
+++ b/tools/Polygraphy/polygraphy/tools/inspect/subtool/diff_tactics.py
@@ -31,12 +31,8 @@ class DiffTactics(Tool):
     replay files, such as those saved by `--save-tactics`.
     """
 
-    def __init__(self, _issue_deprecation_warning=None):
+    def __init__(self):
         super().__init__("diff-tactics")
-        self._issue_deprecation_warning = util.default(_issue_deprecation_warning, False)
-
-        if self._issue_deprecation_warning:
-            self.__doc__ = "[DEPRECATED - use `inspect diff-tactics`] " + self.__doc__
 
     def add_parser_args(self, parser):
         parser.add_argument(
@@ -57,11 +53,6 @@ def add_parser_args(self, parser):
         )
 
     def run_impl(self, args):
-        if self._issue_deprecation_warning:
-            mod.warn_deprecated(
-                "debug diff-tactics", use_instead="inspect diff-tactics", remove_in="0.48.0", always_show_warning=True
-            )
-
         if args.dir is None and (args.good is None or args.bad is None):
             G_LOGGER.critical("Either `--dir`, or both `--good` and `--bad` must be specified.")
 
diff --git a/tools/Polygraphy/polygraphy/tools/inspect/subtool/sparsity.py b/tools/Polygraphy/polygraphy/tools/inspect/subtool/sparsity.py
new file mode 100644
index 00000000..39197965
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/inspect/subtool/sparsity.py
@@ -0,0 +1,42 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from polygraphy.tools.args import ModelArgs, OnnxLoadArgs
+from polygraphy.tools.base import Tool
+from polygraphy.tools.sparse import SparsityPruner
+
+
+class Sparsity(Tool):
+    """
+    [EXPERIMENTAL] Display information about whether each weight tensor in an ONNX model follows a 2:4 structured sparsity pattern.
+    """
+
+    def __init__(self):
+        super().__init__("sparsity")
+
+    def show_start_end_logging_impl(self, args):
+        return True
+
+    def get_subscriptions_impl(self):
+        return [
+            ModelArgs(model_opt_required=True, input_shapes_opt_name=False, required_model_type="onnx"),
+            OnnxLoadArgs(allow_shape_inference=False, outputs_opt_prefix=False, allow_from_tf=False),
+        ]
+
+    def run_impl(self, args):
+        model = self.arg_groups[OnnxLoadArgs].load_onnx()
+        pruner = SparsityPruner(model)
+        pruner.check()
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/README.md b/tools/Polygraphy/polygraphy/tools/plugin/README.md
new file mode 100644
index 00000000..2cd0869a
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/README.md
@@ -0,0 +1,35 @@
+# Plugin
+
+## Table of Contents
+
+- [Introduction](#introduction)
+- [Subtools](#subtools)
+- [Usage](#usage)
+- [Examples](#examples)
+
+
+## Introduction
+
+The `plugin` tool helps with plugin substitution in an onnx model.
+The plugins need to advertise the graph pattern that they can substitute.
+This is done by placing a file called pattern.py inside the plugin's directory.
+See for example [toyPlugin](../../../examples/cli/plugin/01_match_and_replace_plugin/plugins/toyPlugin)
+
+## Subtools
+
+- `match` finds potential opportunities for substituting subgraphs with a plugin.
+    This creates an intermediate file (config.yaml) which the user should further edit to pick the desired plugin substitutions.
+
+- `list` finds potential opportunities for substituting subgraphs with a plugin, without generating an intermediate file.
+    This command is to list the potential substitutions, a kind of a dry run of the match tool.
+
+- `replace` replaces subgraphs with plugins, based on the intermediate file (config.yaml).
+
+## Usage
+
+See `polygraphy plugin -h` for usage information.
+
+
+## Examples
+
+For examples, see [this directory](../../../examples/cli/plugin)
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/__init__.py b/tools/Polygraphy/polygraphy/tools/plugin/__init__.py
new file mode 100644
index 00000000..f0bc1fd6
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/__init__.py
@@ -0,0 +1 @@
+from polygraphy.tools.plugin.plugin import Plugin
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/plugin.py b/tools/Polygraphy/polygraphy/tools/plugin/plugin.py
new file mode 100644
index 00000000..41d1b3a6
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/plugin.py
@@ -0,0 +1,34 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from polygraphy.tools.base import Tool
+from polygraphy.tools.plugin.subtool import Match, ListPlugins, Replace
+
+
+class Plugin(Tool):
+    """
+    Plugin related operations on an onnx model.
+    """
+
+    def __init__(self):
+        super().__init__("plugin")
+
+    def get_subtools_impl(self):
+        return "Plugin Subtools", [
+            Match(),
+            ListPlugins(),
+            Replace(),
+        ]
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/subtool/__init__.py b/tools/Polygraphy/polygraphy/tools/plugin/subtool/__init__.py
new file mode 100644
index 00000000..5d07d5bb
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/subtool/__init__.py
@@ -0,0 +1,4 @@
+from polygraphy.tools.plugin.subtool.match import Match
+from polygraphy.tools.plugin.subtool.list_plugins import ListPlugins
+from polygraphy.tools.plugin.subtool.replace import Replace
+from polygraphy.tools.plugin.subtool.plugin_base import PluginBase
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/subtool/list_plugins.py b/tools/Polygraphy/polygraphy/tools/plugin/subtool/list_plugins.py
new file mode 100644
index 00000000..2b068fb2
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/subtool/list_plugins.py
@@ -0,0 +1,36 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Analyzes onnx model for potential plugin substitutions.
+"""
+
+from polygraphy.tools.plugin.subtool.plugin_base import PluginBase
+
+class ListPlugins(PluginBase):
+    """
+    Analyze an onnx model for potential plugin substitutions.
+    """
+
+    def __init__(self):
+        super().__init__("list")
+        
+    def add_parser_args_impl(self, parser):
+        super().add_parser_args_impl(parser)
+
+    def run_impl(self, args):
+        super().match_plugin(args=args, list_plugins=True)
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/subtool/match.py b/tools/Polygraphy/polygraphy/tools/plugin/subtool/match.py
new file mode 100644
index 00000000..0c7b0a2a
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/subtool/match.py
@@ -0,0 +1,43 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Analyzes onnx model for potential plugin substitutions.
+"""
+
+from polygraphy.tools.plugin.subtool.plugin_base import PluginBase
+
+
+class Match(PluginBase):
+    """
+    Analyze an onnx model for potential plugin substitutions.
+    """
+
+    def __init__(self):
+        super().__init__("match")
+
+    def add_parser_args_impl(self, parser):
+        super().add_parser_args_impl(parser)
+        parser.add_argument(
+            "-o",
+            "--output",
+            help="Full path where to save the intermediate file. Defaults to a file called config.yaml in the model directory.",
+            required=False,
+        )
+
+    def run_impl(self, args):
+        super().match_plugin(args=args, list_plugins=False)
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/subtool/plugin_base.py b/tools/Polygraphy/polygraphy/tools/plugin/subtool/plugin_base.py
new file mode 100644
index 00000000..b594f5c3
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/subtool/plugin_base.py
@@ -0,0 +1,131 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Analyzes onnx model for potential plugin substitutions.
+"""
+
+import glob
+from polygraphy import mod
+from polygraphy.logger import G_LOGGER
+from polygraphy.tools import Tool
+from polygraphy.tools.args import DataLoaderArgs, OnnxLoadArgs, ModelArgs, OnnxInferShapesArgs
+import os
+
+# Your tool should lazily import any external dependencies. By doing so,
+# we avoid creating hard dependencies on other packages.
+# Additionally, this allows Polygraphy to automatically install required packages
+# as they are needed, instead of requiring the user to do so up front.
+common_backend = mod.lazy_import("polygraphy.backend.common")
+gs = mod.lazy_import("onnx_graphsurgeon>=0.5.0")
+np = mod.lazy_import("numpy")
+onnx = mod.lazy_import("onnx")
+yaml = mod.lazy_import("yaml", pkg_name="pyyaml")
+
+class PluginBase(Tool):
+    """
+    Analyze an onnx model for potential plugin substitutions.
+    """
+    GRAPH_PATTERN_FILE_NAME="pattern.py"
+
+    def __init__(self, name=None):
+        super().__init__(name)
+        self.plugin_dir = None
+
+    def get_subscriptions_impl(self):
+        return [
+            ModelArgs(model_opt_required=True),
+            OnnxInferShapesArgs(),
+            OnnxLoadArgs(),
+            DataLoaderArgs(),
+        ]
+
+    def add_parser_args_impl(self, parser):
+        parser.add_argument("--plugin-dir", help="Plugin directory.", required=True)
+        include_exclude = parser.add_mutually_exclusive_group()
+        include_exclude.add_argument("--include", help="Names of plugins to include. Format: `--include <plugin_name0> <plugin_name1> ...`", required=False, nargs="+", type=str, default=[])
+        include_exclude.add_argument("--exclude", help="Names of plugins to exclude. Format: `--exclude <plugin_name0> <plugin_name1> ...`", required=False, nargs="+", type=str, default=[])
+
+    def run_impl(self, args):
+        raise NotImplementedError("run_impl() must be implemented by child classes")
+
+    def match_plugin(self, args, list_plugins=False):
+
+        self.plugin_dir = os.path.abspath(args.plugin_dir)
+        full_pattern = os.path.join(self.plugin_dir, "*", self.GRAPH_PATTERN_FILE_NAME)
+
+        plugin_set = {os.path.basename(os.path.dirname(x)) for x in glob.glob(pathname=full_pattern, recursive=False)}
+
+        if args.include:
+            plugin_set.intersection_update(set(args.include))
+
+        if args.exclude:
+            plugin_set.difference_update(set(args.exclude))
+
+        graph = gs.import_onnx(self.arg_groups[OnnxLoadArgs].load_onnx())
+
+        # list of plugin substitution instances (conent of config.yaml)
+        out_yaml = []
+        plugin_frequency = dict.fromkeys(plugin_set, 0)
+
+        # for each plugin, see if there is any match in the onnx model
+        for plugin in plugin_set:
+            G_LOGGER.info(f"checking {plugin} in model")
+            plugin_yaml = {}
+
+            #build pattern from plugin
+            plugin_pattern_loc = os.path.join(self.plugin_dir, plugin, self.GRAPH_PATTERN_FILE_NAME)
+            graph_pattern = common_backend.invoke_from_script(plugin_pattern_loc, "get_plugin_pattern")
+
+            matched_subgraphs = graph_pattern.match_all(graph)
+            if matched_subgraphs:
+                plugin_frequency[plugin] += len(matched_subgraphs)
+
+            plugin_yaml["name"] = plugin
+            plugin_yaml["instances"] = []
+
+            for sg in matched_subgraphs:
+                def get_names(tensors):
+                    return [tensor.name for tensor in tensors]
+
+                inputs = get_names(sg.inputs)
+                outputs = get_names(sg.outputs)
+                attributes = common_backend.invoke_from_script(plugin_pattern_loc, "get_plugin_attributes", sg)
+                plugin_yaml["instances"].append({
+                    "inputs": inputs,
+                    "outputs": outputs,
+                    "attributes": attributes
+                })
+
+            out_yaml.append(plugin_yaml)
+
+        if list_plugins:
+            G_LOGGER.info("the following plugins would be used:")
+            G_LOGGER.info(plugin_frequency)
+            return
+
+        config_yaml = os.path.join(os.path.dirname(args.model_file),"config.yaml")
+        if args.output:
+            config_yaml = args.output
+
+        with open(config_yaml, "w") as stream:
+            yaml.dump_all(
+                out_yaml,
+                stream,
+                default_flow_style=False,
+                sort_keys=False
+            )
diff --git a/tools/Polygraphy/polygraphy/tools/plugin/subtool/replace.py b/tools/Polygraphy/polygraphy/tools/plugin/subtool/replace.py
new file mode 100644
index 00000000..73e95f14
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/plugin/subtool/replace.py
@@ -0,0 +1,116 @@
+#!/usr/bin/env python3
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Replaces a subgraph in an onnx model with a plugin.
+"""
+
+from polygraphy import mod
+from polygraphy.logger import G_LOGGER
+from polygraphy.tools import Tool
+from polygraphy.tools.args import (
+    DataLoaderArgs,
+    OnnxLoadArgs,
+    ModelArgs,
+    OnnxInferShapesArgs,
+)
+import os
+
+# Your tool should lazily import any external dependencies. By doing so,
+# we avoid creating hard dependencies on other packages.
+# Additionally, this allows Polygraphy to automatically install required packages
+# as they are needed, instead of requiring the user to do so up front.
+
+gs = mod.lazy_import("onnx_graphsurgeon>=0.5.0")
+onnx = mod.lazy_import("onnx")
+yaml = mod.lazy_import("yaml", pkg_name="pyyaml")
+
+
+def replace_with_plugin(graph, op, inputs, outputs, attrs=None):
+    """
+    replaces a subgraph with a plugin
+    """
+
+    # Disconnect output nodes of all input tensors
+    for inp in inputs:
+        inp.outputs.clear()
+
+    # Disconnet input nodes of all output tensors
+    for out in outputs:
+        out.inputs.clear()
+
+    # Insert the new node.
+    new_node = graph.layer(op=op, inputs=inputs, outputs=outputs, attrs=attrs)
+
+    # Remove the now-dangling subgraph.
+    graph.cleanup().toposort()
+
+    return new_node
+
+
+class Replace(Tool):
+    # Polygraphy will use the docstring of the tool child class to generate
+    # the summary for the command-line help output.
+    """
+    Replace a subgraph in an onnx model with a plugin.
+    """
+
+    def __init__(self):
+        super().__init__("replace")
+
+    def get_subscriptions_impl(self):
+        return [
+            ModelArgs(model_opt_required=True),
+            OnnxInferShapesArgs(),
+            OnnxLoadArgs(),
+            DataLoaderArgs(),
+        ]
+
+    def add_parser_args_impl(self, parser):
+        parser.add_argument("--plugin-dir", help="Plugin directory.", required=True)
+        parser.add_argument(
+            "-o", "--output", help="Where to save the modified model", required=True
+        )
+        parser.add_argument("--config", help="location of config.yaml.")
+
+    def run_impl(self, args):
+        graph = gs.import_onnx(self.arg_groups[OnnxLoadArgs].load_onnx())
+        tmap = graph.tensors()
+        config_yaml = os.path.join(os.path.dirname(args.model_file), "config.yaml")
+        if args.config:
+            config_yaml = args.config
+
+        with open(config_yaml, "r") as stream:
+            in_yaml = yaml.safe_load_all(stream)
+
+            for plugin in in_yaml:
+                plugin_name = plugin["name"]
+                for instance in plugin["instances"]:
+                    inputs = [tmap[tensor_name] for tensor_name in instance["inputs"]]
+                    outputs = [tmap[tensor_name] for tensor_name in instance["outputs"]]
+                    attrs = instance["attributes"]
+
+                    replace_with_plugin(
+                        graph=graph,
+                        op=plugin_name,
+                        inputs=inputs,
+                        outputs=outputs,
+                        attrs=attrs,
+                    )
+
+        onnx.save(gs.export_onnx(graph), args.output)
diff --git a/tools/Polygraphy/polygraphy/tools/registry.py b/tools/Polygraphy/polygraphy/tools/registry.py
index 9469fcd6..ed4597d4 100644
--- a/tools/Polygraphy/polygraphy/tools/registry.py
+++ b/tools/Polygraphy/polygraphy/tools/registry.py
@@ -51,10 +51,12 @@ def try_register_tool(module, tool_class):
 try_register_tool("polygraphy.tools.run", "Run")
 try_register_tool("polygraphy.tools.convert", "Convert")
 try_register_tool("polygraphy.tools.inspect", "Inspect")
+try_register_tool("polygraphy.tools.check", "Check")
 try_register_tool("polygraphy.tools.surgeon", "Surgeon")
 try_register_tool("polygraphy.tools.template", "Template")
 try_register_tool("polygraphy.tools.debug", "Debug")
 try_register_tool("polygraphy.tools.data", "Data")
+try_register_tool("polygraphy.tools.plugin", "Plugin")
 
 # Check that tool names are unique
 tool_names = [tool.name for tool in TOOL_REGISTRY]
diff --git a/tools/Polygraphy/polygraphy/tools/run/run.py b/tools/Polygraphy/polygraphy/tools/run/run.py
index 2663a733..2595d90e 100644
--- a/tools/Polygraphy/polygraphy/tools/run/run.py
+++ b/tools/Polygraphy/polygraphy/tools/run/run.py
@@ -43,7 +43,6 @@
     TfRunnerArgs,
     TfTrtArgs,
     TrtConfigArgs,
-    TrtLegacyRunnerArgs,
     TrtLoadEngineArgs,
     TrtLoadEngineBytesArgs,
     TrtLoadNetworkArgs,
@@ -135,7 +134,6 @@ def get_subscriptions_impl(self):
             TrtLoadEngineBytesArgs(allow_saving=True),
             TrtLoadEngineArgs(),
             TrtRunnerArgs(),
-            TrtLegacyRunnerArgs(),
             DataLoaderArgs(),
             ComparatorRunArgs(),
             ComparatorPostprocessArgs(),
diff --git a/tools/Polygraphy/polygraphy/tools/sparse.py b/tools/Polygraphy/polygraphy/tools/sparse.py
new file mode 100644
index 00000000..2cdbbfff
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/sparse.py
@@ -0,0 +1,371 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import sys
+from collections import namedtuple
+
+from polygraphy import mod, util
+from polygraphy.datatype import DataType
+from polygraphy.logger import G_LOGGER
+
+onnx = mod.lazy_import("onnx")
+onnx_numpy_helper = mod.lazy_import("onnx.numpy_helper")
+np = mod.lazy_import("numpy")
+
+
+PruneInfo = namedtuple("PruneInfo", ["name", "axis"])
+
+
+class SparsityPruner:
+    def __init__(self, model):
+        self.model = model
+        g = model.graph
+        self.g = g
+        # map: initializer name -> object
+        self.w_name2obj = {t.name: t for t in g.initializer}
+        # map: tensor name -> producer node object
+        self.tname2producer = dict()
+        for n in g.node:
+            for t in n.output:
+                self.tname2producer[t] = n
+
+        self.prune_infos = dict()
+        self.sparse_tensors = set()
+        self.weights_skip = set()
+
+    # Look back through Q/DQ/Cast nodes
+    def __tensor(self, t, axis):
+        if t in self.w_name2obj:
+            G_LOGGER.super_verbose(f"Tracking weight: ({t})")
+            self.prune_infos[t] = PruneInfo(t, axis)
+            return
+
+        axis_insensitive_op_type = [
+            "QuantizeLinear",
+            "DequantizeLinear",
+            "TRT_FP8QuantizeLinear",
+            "TRT_FP8DequantizeLinear",
+            "Cast",
+        ]
+        stop_op_type = [
+            "LayerNormalization",
+            "Reshape",
+            "Concat",
+            "Slice",
+            "Shape",
+            "Unsqueeze",
+            "Gather",
+            "Mul",
+            "Add",
+        ]
+        if t in self.tname2producer:
+            producer = self.tname2producer[t]
+            if producer.op_type in axis_insensitive_op_type:
+                G_LOGGER.ultra_verbose(f"({t}) is produced by {producer.op_type}, looking back")
+                self.__tensor(producer.input[0], axis)
+            elif producer.op_type == "Transpose":
+                G_LOGGER.ultra_verbose(f"({t}) is produced by {producer.op_type}, checking attributes")
+                for attr in producer.attribute:
+                    if attr.name == "perm":
+                        perm = list(attr.ints)
+                        new_axis = perm.index(axis)
+                        G_LOGGER.ultra_verbose(f"attribute <perm> is {perm}, axis {axis} -> {new_axis}")
+                        self.__tensor(producer.input[0], new_axis)
+                        return
+                G_LOGGER.warning(f"{producer.op_type} doesn't have <perm> attribute!")
+            elif producer.op_type in stop_op_type:
+                G_LOGGER.ultra_verbose(
+                    f"({t}) produced by {producer.name} type {producer.op_type}. Stopping backward analysis."
+                )
+            else:
+                G_LOGGER.warning(f"({t}) produced by {producer.name} type: {producer.op_type} is unsupported!")
+
+    def __conv(self, node):
+        assert node.op_type == "Conv"
+        w = node.input[1]
+        self.__tensor(w, 1)
+
+    def __matmul(self, node):
+        assert node.op_type == "MatMul"
+        a = node.input[0]
+        b = node.input[1]
+        self.__tensor(a, 1)
+        self.__tensor(b, 0)
+
+    def __gemm(self, node):
+        assert node.op_type == "Gemm"
+        a = node.input[0]
+        b = node.input[1]
+
+        # get attributes
+        trans_a = False
+        trans_b = False
+        attrs = node.attribute
+        for attr in attrs:
+            if attr.name == "transA":
+                trans_a = attr.i == 1
+            elif attr.name == "transB":
+                trans_b = attr.i == 1
+
+        # check
+        axis = 0 if trans_a else 1
+        self.__tensor(a, axis)
+        axis = 1 if trans_b else 0
+        self.__tensor(b, axis)
+
+    def _walk_nodes(self):
+        G_LOGGER.verbose(f"Walking graph to collect weights candidates.")
+        assert len(self.prune_infos) == 0
+        count = len(self.g.node)
+        for i in range(count):
+            n = self.g.node[i]
+            G_LOGGER.super_verbose(f"Processing node {i}/{count} ({n.op_type}): {n.name}")
+            if n.op_type == "MatMul":
+                self.__matmul(n)
+            elif n.op_type == "Gemm":
+                self.__gemm(n)
+            elif n.op_type == "Conv":
+                self.__conv(n)
+            else:
+                pass
+        G_LOGGER.verbose(f"Collected {len(self.prune_infos)} weights candidates.")
+
+        G_LOGGER.verbose("Skipping tensors that are not eligible for pruning.")
+        prune_infos = list(self.prune_infos.values())
+        count = len(prune_infos)
+        final_prune_infos = []
+        for i in range(count):
+            pinfo = prune_infos[i]
+            G_LOGGER.super_verbose(f"Processing tensor {i + 1}/{count}: {pinfo}")
+            t = self.w_name2obj[pinfo.name]
+            if t.name in self.weights_skip:
+                G_LOGGER.warning(
+                    f"Skipping tensor: {t.name} since it was marked to skip pruning"
+                )
+                continue
+            supported_dtypes = [
+                onnx.TensorProto.FLOAT,
+                onnx.TensorProto.FLOAT16,
+                onnx.TensorProto.BFLOAT16,
+            ]
+            if not t.data_type in supported_dtypes:
+                G_LOGGER.warning(
+                    f"Skipping tensor: {t.name} due to unsupported type: {DataType.from_dtype(t.data_type, 'onnx')}"
+                )
+                continue
+            assert pinfo.axis < len(t.dims)
+            dim = t.dims[pinfo.axis]
+            if dim % 4 != 0:
+                G_LOGGER.verbose(
+                    f"Skipping {t.name} since the length of axis {pinfo.axis} ({dim} in {t.dims}) is not a multiple of 4. "
+                )
+                continue
+            final_prune_infos.append(pinfo)
+
+        new_count = len(final_prune_infos)
+        G_LOGGER.extra_verbose(
+            f"Skipped {count - new_count} of {count} tensor(s) since they are not eligible for pruning. "
+        )
+        G_LOGGER.info(f"Found: {new_count} weight tensor(s) eligible for pruning.")
+        return final_prune_infos
+
+    def process(self, check):
+        # Walk nodes to collect the tensors (initializers) that need to be pruned and the axis.
+        prune_infos = self._walk_nodes()
+        count = len(prune_infos)
+
+        if check:
+            G_LOGGER.start(f"Checking the sparsity pattern of {count} tensors.")
+            for i in range(count):
+                pinfo = prune_infos[i]
+                tensor = self.w_name2obj[pinfo.name]
+                G_LOGGER.extra_verbose(f"Checking tensor {i + 1}/{count}: {pinfo.name}")
+                is_sparse = process_tensor(pinfo, tensor, True)
+                if is_sparse:
+                    self.sparse_tensors.add(tensor.name)
+            G_LOGGER.finish(f"Finished checking {count} tensors. ")
+            return None
+        else:
+            G_LOGGER.start(f"Pruning {count} tensors.")
+            new_w_name2obj = dict()
+            for i in range(count):
+                pinfo = prune_infos[i]
+                tensor = self.w_name2obj[pinfo.name]
+                G_LOGGER.extra_verbose(f"Pruning tensor {i+ 1}/{count}: {pinfo.name}")
+                new_t = process_tensor(pinfo, tensor, False)
+                new_w_name2obj[new_t.name] = new_t
+            G_LOGGER.finish(f"Finished pruning {count} tensors. ")
+
+            return build_new_model(self.model, new_w_name2obj)
+
+    def prune(self, weights_skip=set()):
+        self.weights_skip = weights_skip
+        return self.process(False)
+
+    def check(self):
+        self.process(True)
+
+
+def process_bf16_tensor(tensor, outer, pdim, pstride, check):
+    G_LOGGER.super_verbose("Processing BF16 tensor")
+    assert tensor.data_type == onnx.TensorProto.BFLOAT16
+    is_raw_data = len(tensor.int32_data) == 0
+    data = bytearray(tensor.raw_data) if is_raw_data else tensor.int32_data
+    step = 4 if is_raw_data else 2
+
+    ostride = pdim * pstride
+    for o in range(outer):
+        for i in range(pstride):
+            for piter in range(0, pdim, step):
+
+                def short2long(idx):
+                    return o * ostride + (piter + idx) * pstride + i
+
+                if check:
+                    zeros = 0
+                    if is_raw_data:
+                        for i in range(step):
+                            if data[short2long(i) * 2] == 0 and data[short2long(i) * 2 + 1] == 0:
+                                zeros += 1
+                    else:
+                        i32_data_0 = data[short2long(0)]
+
+                        def bf16_zeros_in_int32(v):
+                            bf16_data_0 = v & 0xFF
+                            bf16_data_1 = (v >> 8) & 0xFF
+                            v0_zero = 1 if bf16_data_0 == 0 else 0
+                            v1_zero = 1 if bf16_data_1 == 0 else 0
+                            return v0_zero + v1_zero
+
+                        zeros = bf16_zeros_in_int32(i32_data_0) + bf16_zeros_in_int32(i32_data_0)
+                    if zeros < 2:
+                        G_LOGGER.warning(f"Found non-sparse tensor: {tensor.name}")
+                        return False
+                else:
+                    if is_raw_data:
+                        # data is 8bit array, bf16 is 16bit
+                        # the index is doubled, and we need twice change for one bf16 value
+                        data[short2long(1) * 2] = 0
+                        data[short2long(1) * 2 + 1] = 0
+                        data[short2long(2) * 2] = 0
+                        data[short2long(2) * 2 + 1] = 0
+                    else:
+                        # data is 32bit array, bf16 is 16bit
+                        # We use the index but only need to change one value
+                        data[short2long(0)] = 0
+
+    if check:
+        G_LOGGER.info(f"Found sparse tensor: {tensor.name}")
+        return True
+    else:
+        if is_raw_data:
+            tensor.raw_data = bytes(data)
+        return tensor
+
+
+def process_tensor(pinfo, tensor, check):
+    axis = pinfo.axis
+    dims = tensor.dims
+    pdim = tensor.dims[axis]
+
+    # figure out the stride
+    outer = 1
+    pstride = 1
+    for i in range(0, axis, 1):
+        outer *= dims[i]
+    for i in range(axis + 1, len(tensor.dims), 1):
+        pstride *= dims[i]
+    G_LOGGER.ultra_verbose(f"axis {axis} of dims {dims} has stride {pstride} and outer {outer}")
+
+    # We need hacks since BF16 has not been fully enabled in Numpy or ONNX.
+    if tensor.data_type is onnx.TensorProto.BFLOAT16:
+        return process_bf16_tensor(tensor, outer, pdim, pstride, check)
+
+    # prune/check alongside the axis
+    ostride = pdim * pstride
+    data = np.array(onnx_numpy_helper.to_array(tensor)).reshape(util.volume(dims))
+    for o in range(outer):
+        for i in range(pstride):
+            for piter in range(0, pdim, 4):
+
+                def short2long(idx):
+                    """Convert the short-index to the location in the buffer"""
+                    return o * ostride + (piter + idx) * pstride + i
+
+                short_idx = range(4)
+                long_idx = [short2long(si) for si in short_idx]
+                vals = [data[li] for li in long_idx]
+                vals_abs = [abs(v) for v in vals]
+                min0_vabs = min(vals_abs)
+                min0_idx = vals_abs.index(min0_vabs)
+                vals_abs[min0_idx] = sys.float_info.max
+                min1_vabs = min(vals_abs)
+                min1_idx = vals_abs.index(min1_vabs)
+
+                if check:
+                    if min0_vabs != 0 or min1_vabs != 0:
+                        G_LOGGER.warning(f"Found non-sparse tensor: {tensor.name}")
+                        return False
+                else:
+                    min0_idx = short2long(min0_idx)
+                    min1_idx = short2long(min1_idx)
+                    np.put(data, min0_idx, 0)
+                    np.put(data, min1_idx, 0)
+
+    if check:
+        G_LOGGER.info(f"Found sparse tensor: {tensor.name}")
+        return True
+    else:
+        # pack raw data pack and then push to the model
+        data = data.reshape(dims)
+        return onnx_numpy_helper.from_array(data, name=tensor.name)
+
+
+def build_new_model(m, new_w_name2obj):
+    if len(new_w_name2obj) == 0:
+        G_LOGGER.verbose("No need to build new model object")
+        return m
+
+    G_LOGGER.info("Replacing weights to build new model object...")
+    g = m.graph
+    new_initializer = list()
+    n = len(g.initializer)
+    for i in range(n):
+        t = g.initializer[i]
+        G_LOGGER.extra_verbose(f"Processing {i}/{n} {t.name}")
+        if t.name in new_w_name2obj:
+            new_t = new_w_name2obj[t.name]
+            new_initializer.append(new_t)
+        else:
+            new_initializer.append(t)
+
+    new_g = onnx.helper.make_graph(
+        nodes=g.node,
+        name=g.name,
+        inputs=g.input,
+        outputs=g.output,
+        initializer=new_initializer,
+        doc_string=g.doc_string,
+        value_info=g.value_info,
+    )
+
+    attrs = {
+        "ir_version": m.ir_version,
+        "producer_name": "polygraphy surgeon prune",
+        "opset_imports": [m.opset_import[0]],
+    }
+    return onnx.helper.make_model(new_g, **attrs)
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/README.md b/tools/Polygraphy/polygraphy/tools/surgeon/README.md
index efbe9c82..ef37f3c2 100644
--- a/tools/Polygraphy/polygraphy/tools/surgeon/README.md
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/README.md
@@ -26,6 +26,11 @@ to modify an ONNX model.
 
 - [EXPERIMENTAL] `insert` can insert a node into a model, optionally replacing existing subgraphs.
 
+- [EXPERIMENTAL] `prune` can prune an ONNX model to have [2:4 structured sparsity](https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/) by setting the some values of the weights to zero. Note that this will *not* retain the accuracy of the model and should hence be used only for functional testing.
+
+- [EXPERIMENTAL] `weight-strip` can strip the initializers of selective nodes in an ONNX model. If an initializer is identified as 2:4 structured sparse, the sparsity structure is encoded in the initializer to preserve sparsity during reconstruction. The tool only supports 2:4 structured sparsity.
+
+- [EXPERIMENTAL] `weight-reconstruct` reads a weightless ONNX model and fills the empty initializers with proxy tensors. The reconstruction preserves 2:4 structed sparsity for initializers that were identified as sparse during weight stripping.
 
 ## Usage
 
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/__init__.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/__init__.py
index cc6034cc..3381db74 100644
--- a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/__init__.py
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/__init__.py
@@ -1,3 +1,6 @@
 from polygraphy.tools.surgeon.subtool.extract import Extract
 from polygraphy.tools.surgeon.subtool.insert import Insert
 from polygraphy.tools.surgeon.subtool.sanitize import Sanitize
+from polygraphy.tools.surgeon.subtool.prune import Prune
+from polygraphy.tools.surgeon.subtool.weight_strip import WeightStripper
+from polygraphy.tools.surgeon.subtool.weight_reconstruct import WeightReconstructor
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/base.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/base.py
index c49c126a..5a8991da 100644
--- a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/base.py
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/base.py
@@ -44,8 +44,9 @@ def save_model(self, model, log_model=True):
         if log_model:
             G_LOGGER.info(f"New Model:\n{onnx_util.str_from_onnx(model)}\n\n")
 
-    def run_impl(self, args):
-        raise NotImplementedError("Subclasses must implement run_impl!")
+    # Subtools of `surgeon` should implement this instead of `run_impl`.
+    def run_impl_surgeon(self, args):
+        raise NotImplementedError("Subclasses must implement run_impl_surgeon")
 
     def run_impl(self, args):
         def set_onnx_gs_logging_level(severity_trie):
@@ -79,4 +80,4 @@ def set_onnx_gs_logging_level(severity_trie):
                     ONNX_GS_LOGGER.line_info = True
 
         G_LOGGER.register_callback(set_onnx_gs_logging_level)
-        return self.run_impl(args)
+        return self.run_impl_surgeon(args)
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/extract.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/extract.py
index 69923a3d..fac44dd4 100644
--- a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/extract.py
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/extract.py
@@ -22,6 +22,7 @@
 from polygraphy.tools.args import DataLoaderArgs, ModelArgs, OnnxInferShapesArgs, OnnxLoadArgs, OnnxSaveArgs
 from polygraphy.tools.args import util as args_util
 from polygraphy.tools.surgeon.subtool.base import BaseSurgeonSubtool
+from polygraphy.datatype import DataType
 
 onnx_backend = mod.lazy_import("polygraphy.backend.onnx")
 onnx_util = mod.lazy_import("polygraphy.backend.onnx.util")
@@ -78,7 +79,7 @@ def add_parser_args_impl(self, parser):
             default=[],
         )
 
-    def run_impl(self, args):
+    def run_impl_surgeon(self, args):
         def missing_meta_tensors(input_metadata, output_metadata):
             missing = TensorMetadata()
             for name, (dtype, shape) in input_metadata.items():
@@ -166,6 +167,7 @@ def choose_meta(user, model, fallback):
                         user_dtype, user_shape = user_meta[name].dtype, user_meta[name].shape
 
                     meta[name].dtype = choose_meta(user_dtype, meta[name].dtype, layerwise_meta[name].dtype)
+
                     if set_shapes:
                         meta[name].shape = choose_meta(user_shape, meta[name].shape, layerwise_meta[name].shape)
                     G_LOGGER.verbose(f"Updated tensor: {name} metadata to: {meta[name]}")
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/insert.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/insert.py
index 5459dc6e..3161c883 100644
--- a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/insert.py
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/insert.py
@@ -87,7 +87,7 @@ def get_subscriptions_impl(self):
             OnnxSaveArgs(allow_shape_inference=True, output_opt_required=True),
         ]
 
-    def run_impl(self, args):
+    def run_impl_surgeon(self, args):
         graph = onnx_backend.gs_from_onnx(super().load_model())
 
         TENSOR_MAP = graph.tensors()
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/prune.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/prune.py
new file mode 100644
index 00000000..c48d0ff2
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/prune.py
@@ -0,0 +1,50 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from polygraphy.tools.args import ModelArgs, OnnxLoadArgs, OnnxSaveArgs
+from polygraphy.tools.sparse import SparsityPruner
+from polygraphy.tools.surgeon.subtool.base import BaseSurgeonSubtool
+
+
+class Prune(BaseSurgeonSubtool):
+    """
+    [EXPERIMENTAL] Prune the weights of a model to follow a 2:4 structured sparsity pattern without regard for accuracy.
+    For every four weight values, two will be set to zero.
+
+    **NOTE:** This tool is meant to help functionally test sparsity.
+    It will almost certainly cause significant accuracy degradation and so should NOT be used outside of functional testing.
+    """
+
+    def __init__(self):
+        super().__init__("prune")
+
+    def show_start_end_logging_impl(self, args):
+        return True
+
+    def get_subscriptions_impl(self):
+        return [
+            ModelArgs(model_opt_required=True, input_shapes_opt_name=False, required_model_type="onnx"),
+            OnnxLoadArgs(allow_shape_inference=False, outputs_opt_prefix=False, allow_from_tf=False),
+            OnnxSaveArgs(allow_shape_inference=False, output_opt_required=True),
+        ]
+
+    def run_impl_surgeon(self, args):
+        model = super().load_model()
+
+        pruner = SparsityPruner(model)
+        new_model = pruner.prune()
+
+        super().save_model(new_model)
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/sanitize.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/sanitize.py
index 461bf564..a185542b 100644
--- a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/sanitize.py
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/sanitize.py
@@ -183,10 +183,17 @@ def add_parser_args_impl(self, parser):
             default=False,
         )
 
+        parser.add_argument(
+            "--toposort",
+            help="Topologically sort nodes in the graph. ",
+            action="store_true",
+            default=False,
+        )
+
     def show_start_end_logging_impl(self, args):
         return True
 
-    def run_impl(self, args):
+    def run_impl_surgeon(self, args):
         # First do all processing that requires an ONNX-GraphSurgeon graph, then do everything
         # that operates on the ONNX model. This lets us avoid ONNX-GraphSurgeon import if we don't
         # need it.
@@ -215,6 +222,10 @@ def get_graph():
                 graph = get_graph()
                 graph.cleanup()
 
+            if args.toposort:
+                graph = get_graph()
+                graph.toposort()
+
             if graph is not None:
                 model = gs.export_onnx(graph)
             return model, rerun_shape_inference
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/weight_reconstruct.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/weight_reconstruct.py
new file mode 100644
index 00000000..0725a90e
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/weight_reconstruct.py
@@ -0,0 +1,88 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import ctypes
+from polygraphy import mod
+from polygraphy.logger import G_LOGGER
+from polygraphy.tools.sparse import SparsityPruner
+from polygraphy.tools.surgeon.subtool.base import BaseSurgeonSubtool
+from polygraphy.tools.args import ModelArgs, OnnxLoadArgs, OnnxSaveArgs
+
+np = mod.lazy_import("numpy")
+onnx = mod.lazy_import("onnx")
+torch = mod.lazy_import("torch")
+
+class WeightReconstructor(BaseSurgeonSubtool):
+    """
+    Reconstruct proxy weights in the Stripped ONNX model
+    """
+    def __init__(self):
+        super().__init__("weight-reconstruct")
+    
+    def show_start_end_logging_impl(self, args):
+        return True
+
+    def get_subscriptions_impl(self):
+        return [
+            ModelArgs(model_opt_required=True, input_shapes_opt_name=False, required_model_type="onnx"),
+            OnnxLoadArgs(allow_shape_inference=False, outputs_opt_prefix=False, allow_from_tf=False),
+            OnnxSaveArgs(allow_shape_inference=False, output_opt_required=True),
+        ]
+    
+    def run_impl_surgeon(self, args):
+        def reconstruct_weights(model):
+            G_LOGGER.start(f"Beginning weight reconstruction...")
+            # Skip Sparsity Pruning of weights not marked as "SPARSE_2_4"
+            skip_weight_sparsify = set()
+            num_reconstructed = 0
+            for initializer in model.graph.initializer:
+                doc_string = initializer.doc_string
+
+                # If not marked as weightless, leave initializer untouched
+                if "TRT_WEIGHTLESS" not in doc_string:
+                    skip_weight_sparsify.add(initializer.name)
+                    continue
+                _, sparse_str = doc_string.split('/')
+
+                # If not sparse, add to skip list
+                if not sparse_str:
+                    skip_weight_sparsify.add(initializer.name)
+
+                weight_dtype = onnx.helper.tensor_dtype_to_np_dtype(initializer.data_type)
+                weight_shape = tuple(initializer.dims)
+                proxy_weight_tensor = np.random.randn(*weight_shape).astype(weight_dtype)
+                proxy_weight_bytes = proxy_weight_tensor.data.tobytes()
+                if initializer.data_type == onnx.TensorProto.BFLOAT16:
+                    proxy_weight_tensor = torch.from_numpy(proxy_weight_tensor).to(torch.bfloat16)
+                    proxy_weight_bytes = bytes((ctypes.c_byte * proxy_weight_tensor.numel()
+                        * proxy_weight_tensor.element_size()).from_address(proxy_weight_tensor.untyped_storage().data_ptr()))
+                assert weight_shape == proxy_weight_tensor.shape
+                assert initializer.raw_data == b""
+
+                G_LOGGER.verbose(f"Reconstructing weights for the {initializer.name} initializer")
+                num_reconstructed += 1
+                initializer.raw_data = proxy_weight_bytes
+
+            # Call Sparsity Pruner tool to convert selected weights to sparse weights
+            G_LOGGER.info("Calling Sparsity Pruner to prune selected weights")
+            sparsity_pruner = SparsityPruner(model)
+            model = sparsity_pruner.prune(weights_skip=skip_weight_sparsify)
+            G_LOGGER.finish(f"Finished reconstructing {num_reconstructed} weights")
+            return model
+
+        model = super().load_model()
+        reconstructed_model = reconstruct_weights(model)
+        super().save_model(reconstructed_model)
\ No newline at end of file
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/subtool/weight_strip.py b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/weight_strip.py
new file mode 100644
index 00000000..c11b96f6
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/subtool/weight_strip.py
@@ -0,0 +1,422 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from polygraphy import mod
+from polygraphy.logger import G_LOGGER
+from polygraphy.tools.args.base import BaseArgs
+from polygraphy.tools.args import util as args_util
+from polygraphy.tools.sparse import SparsityPruner
+from polygraphy.tools.surgeon.subtool.base import BaseSurgeonSubtool
+from polygraphy.tools.args import ModelArgs, OnnxLoadArgs, OnnxSaveArgs
+
+gs = mod.lazy_import("onnx_graphsurgeon")
+np = mod.lazy_import("numpy")
+onnx = mod.lazy_import("onnx")
+
+class WeightStripperArgs(BaseArgs):
+    """
+    Weight Stripping: weight stripping
+    """
+    def add_parser_args_impl(self):
+        self.group.add_argument(
+            "--exclude-list",
+            help="Path to text file containing a list of initializers to skip",
+            default=None,
+            required=False
+        )
+    def parse_impl(self, args):
+        """
+        Parses command-line arguments and populates the following attributes:
+
+        Attributes:
+            exclude_list (str): Path to text file containing a list of initializers to skip.
+        """
+        self.exclude_list = args_util.get(args, "exclude_list")
+
+    def get_exclude_list(self):
+        if not self.exclude_list:
+            return set()
+        with open(self.exclude_list) as fp:
+            lines = [line.rstrip() for line in fp]
+            return set(lines)
+
+def get_patterns():
+    """
+    Base Patterns contain single ops: Conv, ConvTranspose, Gemm, Gather, MatMul, Slice
+
+    Q/DQ Patterns contain sequences with Q/DQ ops followed by a few base patterns:
+        [QuantizeLinear, DequantizeLinear, Conv]
+        [QuantizeLinear, DequantizeLinear, ConvTranspose]
+        [QuantizeLinear, DequantizeLinear, Gemm]
+        [QuantizeLinear, DequantizeLinear, MatMul]
+
+    Transpose Patterns contain sequences with the Transpose op followed  by a few base patterns:
+        [Transpose, Conv]
+        [Transpose, ConvTranspose]
+        [Transpose, Gemm]
+        [Transpose, MatMul]
+    """
+    base_patterns = []
+
+    # dictionary storing the index of the input the Producer output can be linked to
+    input_positions = {
+        'Conv': [0],
+        'ConvTranspose': [0],
+        'Gemm': [0, 1, 2],
+        'MatMul': [0, 1],
+    }
+
+    # Conv with Weight input
+    conv_w = gs.GraphPattern()
+    in_0 = conv_w.variable()
+    w = conv_w.variable()
+    conv_w_out = conv_w.add("conv_w", "Conv", inputs=[in_0, w])
+    conv_w.set_output_tensors([conv_w_out])
+    base_patterns.append(conv_w)
+
+    # Conv with Weight and Bias inputs
+    conv_w_b = gs.GraphPattern()
+    in_0 = conv_w_b.variable()
+    w = conv_w_b.variable()
+    b = conv_w_b.variable()
+    conv_w_b_out = conv_w_b.add("conv_w_b", "Conv", inputs=[in_0, w, b])
+    conv_w_b.set_output_tensors([conv_w_b_out])
+    base_patterns.append(conv_w_b)
+
+    # ConvTranspose with Weight input
+    convt_w = gs.GraphPattern()
+    in_0 = convt_w.variable()
+    w = convt_w.variable()
+    convt_w_out = convt_w.add("convt_w", "ConvTranspose", inputs=[in_0, w])
+    convt_w.set_output_tensors([convt_w_out])
+    base_patterns.append(convt_w)
+
+    # ConvTranspose with Weight and Bias inputs
+    convt_w_b = gs.GraphPattern()
+    in_0 = convt_w_b.variable()
+    w = convt_w_b.variable()
+    b = convt_w_b.variable()
+    convt_w_b_out = convt_w_b.add("convt_w_b", "ConvTranspose", inputs=[in_0, w, b])
+    convt_w_b.set_output_tensors([convt_w_b_out])
+    base_patterns.append(convt_w_b)
+
+    # Gemm with A and B inputs
+    gemm_1 = gs.GraphPattern()
+    in_0 = gemm_1.variable()
+    in_1 = gemm_1.variable()
+    gemm_1_out = gemm_1.add("gemm_1", "Gemm", inputs=[in_0, in_1])
+    gemm_1.set_output_tensors([gemm_1_out])
+    base_patterns.append(gemm_1)
+
+    # Gemm with A, B and C inputs
+    gemm_2 = gs.GraphPattern()
+    in_0 = gemm_2.variable()
+    in_1 = gemm_2.variable()
+    in_2 = gemm_2.variable()
+    gemm_2_out = gemm_2.add("gemm_2", "Gemm", inputs=[in_0, in_1, in_2])
+    gemm_2.set_output_tensors([gemm_2_out])
+    base_patterns.append(gemm_2)
+
+    # MatMul
+    matmul = gs.GraphPattern()
+    in_0 = matmul.variable()
+    in_1 = matmul.variable()
+    matmul_out = matmul.add("matmul", "MatMul", inputs=[in_0, in_1])
+    matmul.set_output_tensors([matmul_out])
+    base_patterns.append(matmul)
+
+    # Q/DQ patterns
+    # QuantizeLinear with y_scale input
+    q_1 = gs.GraphPattern()
+    in_0 = q_1.variable()
+    y_scale = q_1.variable()
+    q_1_out = q_1.add("q_1", "QuantizeLinear", inputs=[in_0, y_scale])
+    q_1.set_output_tensors([q_1_out])
+
+    # QuantizeLinear with y_scale and y_zero_point inputs
+    q_2 = gs.GraphPattern()
+    in_0 = q_2.variable()
+    y_scale = q_2.variable()
+    y_zero_point = q_2.variable()
+    q_2_out = q_2.add("q_2", "QuantizeLinear", inputs=[in_0, y_scale, y_zero_point])
+    q_2.set_output_tensors([q_2_out])
+
+    # DequantizeLinear with x_scale input
+    dq_1 = gs.GraphPattern()
+    in_0 = dq_1.variable()
+    x_scale = dq_1.variable()
+    dq_1_out = dq_1.add("dq_1", "DequantizeLinear", inputs=[in_0, x_scale])
+    dq_1.set_output_tensors([dq_1_out])
+
+    # QuantizeLinear with y_scale and y_zero_point inputs
+    dq_2 = gs.GraphPattern()
+    in_0 = dq_2.variable()
+    x_scale = dq_2.variable()
+    x_zero_point = dq_2.variable()
+    dq_2_out = dq_2.add("dq_2", "DequantizeLinear", inputs=[in_0, x_scale, x_zero_point])
+    dq_2.set_output_tensors([dq_2_out])
+
+    qdq_patterns = []
+    for op in base_patterns:
+        # all base patterns contain a single node
+        op_type = next(iter(op.nodes.values())).op
+        for input_pos in input_positions[op_type]:
+            for q in [q_1, q_2]:
+                for dq in [dq_1, dq_2]:
+                    curr_pattern = gs.GraphPattern()
+
+                    q_inps = [curr_pattern.variable() for _ in range(len(q.input_tensors))]
+                    q_out = curr_pattern.add("Q", q, inputs=q_inps)
+
+                    dq_inps = [curr_pattern.variable() for _ in range(len(dq.input_tensors) - 1)]
+                    dq_out = curr_pattern.add("DQ", dq, inputs=[q_out] + dq_inps)
+
+                    # in case of Gemm with 2 inputs, skip the case where output of dq node is the 3rd input of Gemm
+                    if len(op.input_tensors) <= input_pos:
+                        continue
+                    op_inps = [curr_pattern.variable() for _ in range(len(op.input_tensors))]
+                    op_inps[input_pos] = dq_out
+                    out = curr_pattern.add("base_op", op, inputs=op_inps)
+                    curr_pattern.set_output_tensors([out])
+                    qdq_patterns.append(curr_pattern)
+
+    # Transpose patterns
+    transpose_patterns = []
+    transpose = gs.GraphPattern()
+    in_0 = transpose.variable()
+    transpose_out = transpose.add("transpose", "Transpose", inputs=[in_0])
+    transpose.set_output_tensors([transpose_out])
+
+    for op in base_patterns:
+        # all base patterns contain a single node
+        op_type = next(iter(op.nodes.values())).op
+        for input_pos in input_positions[op_type]:
+            curr_pattern = gs.GraphPattern()
+
+            t_inps = [curr_pattern.variable() for _ in range(len(transpose.input_tensors))]
+            t_out = curr_pattern.add("t", transpose, inputs=t_inps)
+
+            # in case of Gemm with 2 inputs, skip the case where output of transpose node is the 3rd input of Gemm
+            if len(op.input_tensors) <= input_pos:
+                continue
+            op_inps = [curr_pattern.variable() for _ in range(len(op.input_tensors))]
+            op_inps[input_pos] = t_out
+            out = curr_pattern.add("base_op", op, inputs=op_inps)
+            curr_pattern.set_output_tensors([out])
+            transpose_patterns.append(curr_pattern)
+
+    # Gather
+    gather = gs.GraphPattern()
+    in_0 = gather.variable()
+    indices = gather.variable()
+    gather_out = gather.add("gather", "Gather", inputs=[in_0, indices])
+    gather.set_output_tensors([gather_out])
+    base_patterns.append(gather)
+
+    # Slice without no optional inputs
+    slice_0 = gs.GraphPattern()
+    in_0 = slice_0.variable()
+    starts = slice_0.variable()
+    ends = slice_0.variable()
+    slice_0_out = slice_0.add("slice_0", "Slice", inputs=[in_0, starts, ends])
+    slice_0.set_output_tensors([slice_0_out])
+    base_patterns.append(slice_0)
+
+    # Slice with axes inputs
+    slice_1 = gs.GraphPattern()
+    in_0 = slice_1.variable()
+    starts = slice_1.variable()
+    ends = slice_1.variable()
+    axes = slice_1.variable()
+    slice_1_out = slice_1.add("slice_1", "Slice", inputs=[in_0, starts, ends, axes])
+    slice_1.set_output_tensors([slice_1_out])
+    base_patterns.append(slice_1)
+
+    # Slice with steps inputs
+    slice_2 = gs.GraphPattern()
+    in_0 = slice_2.variable()
+    starts = slice_2.variable()
+    ends = slice_2.variable()
+    steps = slice_2.variable()
+    slice_2_out = slice_2.add("slice_2", "Slice", inputs=[in_0, starts, ends, steps])
+    slice_2.set_output_tensors([slice_2_out])
+    base_patterns.append(slice_2)
+
+    # Slice with axes and steps inputs
+    slice_3 = gs.GraphPattern()
+    in_0 = slice_3.variable()
+    starts = slice_3.variable()
+    ends = slice_3.variable()
+    axes = slice_3.variable()
+    steps = slice_3.variable()
+    slice_3_out = slice_3.add("slice_3", "Slice", inputs=[in_0, starts, ends, axes, steps])
+    slice_3.set_output_tensors([slice_3_out])
+    base_patterns.append(slice_3)
+
+    return base_patterns + qdq_patterns + transpose_patterns
+    
+def get_size_thresholds():
+    """
+    Strip the initializers of the ops only if the size threshold has been crossed
+    """
+    return {
+        'Conv': 1,
+        'ConvTranspose': 1,
+        'Gather': 1024,
+        'Gemm': 1,
+        'Plugin': 1024,
+        'Slice': 1024,
+    }
+
+def get_inputs_to_strip():
+    """
+    Restrict the stripping of initializers of the ops to the input index specified
+    """
+    return {
+        'QuantizeLinear': set([0]),
+        'Slice': set([0]),
+    }
+
+class WeightStripper(BaseSurgeonSubtool):
+    """
+    Strip weights from the provided ONNX model
+    """
+    def __init__(self):
+        super().__init__("weight-strip")
+
+    def show_start_end_logging_impl(self, args):
+        return True
+
+    def get_subscriptions_impl(self):
+        return [
+            ModelArgs(model_opt_required=True, input_shapes_opt_name=False, required_model_type="onnx"),
+            OnnxLoadArgs(allow_shape_inference=False, outputs_opt_prefix=False, allow_from_tf=False),
+            OnnxSaveArgs(allow_shape_inference=False, output_opt_required=True),
+            WeightStripperArgs()
+        ]
+
+    def __skip(self, node, inp, inp_index):
+        """
+        Skip stripping the input based on pre-defined heuristics
+
+        The function also modifies exclude_list if a matching input is found
+        """
+        # restrict stripping of certain op inputs
+        if node.op in self.inputs_to_strip and inp_index not in self.inputs_to_strip[node.op]:
+            return True
+        # Skip inputs that are not initializers
+        if not isinstance(inp, gs.Constant):
+            return True
+        # Skip initializers with uint8 dtype
+        if inp.dtype == np.uint8:
+            return True
+        # Skip initializers specified in user defined skip list
+        if inp.name in self.exclude_list:
+            self.exclude_list.remove(inp.name)
+            return True
+        # Heuristic to strip based on size
+        if node.op in self.size_thresholds and inp.values.size < self.size_thresholds[node.op]:
+            return True
+        
+        return False
+
+    def __get_matching_subgraph_inputs(self, graph):
+        """
+        Use GraphPattern to find matching patterns in the graph
+        """
+        for pattern in self.patterns:
+            subgraphs = pattern.match_all(graph)
+            for subgraph in subgraphs:
+                # the first node in the matched subgraph contains the initializer to strip
+                curr_node = next(iter(subgraph.values()))
+                while curr_node._get_node() is None:
+                    curr_node = next(iter(curr_node.values()))
+                onnx_node = curr_node.onnx_node
+                for inp_index, inp in enumerate(onnx_node.inputs):
+                    if not self.__skip(onnx_node, inp, inp_index):
+                        self.initializers_to_strip.add(inp.name)
+
+    def __get_plugin_inputs(self, nodes):
+        """
+        Identify Plugin inputs to strip
+        """
+        for node in nodes:
+            # If plugin found
+            if not onnx.defs.has(node.op):
+                for inp_index, inp in enumerate(node.inputs):
+                    if not self.__skip(node, inp, inp_index):
+                        G_LOGGER.verbose(f"Stripping initializer {inp.name} to the {node.op} op.")
+                        self.initializers_to_strip.add(inp.name)
+    
+    def __get_sparse_tensors(self, model):
+        """
+        Identify sparse tensors in the model
+        """
+        sparsity_checker = SparsityPruner(model)
+        sparsity_checker.check()
+        sparse_tensors = sparsity_checker.sparse_tensors
+        return sparse_tensors
+
+    def run_impl_surgeon(self, args):
+        def strip_weights(model):
+            G_LOGGER.start(f"Beginning weight stripping...")
+            G_LOGGER.warning(f"The model is expected to be constant folded to successfully capture all weights eligible for stripping")
+            graph = gs.import_onnx(model)
+            # check model sparsity
+            G_LOGGER.info("Querying Sparse Initializers in the model")
+            sparse_initializers = self.__get_sparse_tensors(model)
+
+            # Call PatternMatcher to populate initializers_to_strip
+            self.__get_matching_subgraph_inputs(graph)
+            self.__get_plugin_inputs(graph.nodes)
+
+            # Strip initializers identified by the PatternMatcher and Plugin Identifier
+            num_stripped = 0
+            for initializer in model.graph.initializer:
+                if initializer.name in self.initializers_to_strip:
+                    G_LOGGER.verbose(f"Stripping initializer {initializer.name}")
+                    # Erase initializer data
+                    initializer.raw_data = b""
+
+                    # Check sparsity
+                    sparse_str = "SPARSE_2_4" if initializer.name in sparse_initializers else ""
+                    
+                    # Update initializer doc_string
+                    initializer.doc_string = '/'.join(["TRT_WEIGHTLESS", sparse_str])
+                    num_stripped += 1
+
+            if self.exclude_list:
+                G_LOGGER.warning(f"The following weights provided by the user to skip stripping were not found in the model: {self.exclude_list}.")
+            assert num_stripped == len(self.initializers_to_strip)
+
+            if num_stripped:
+                model.doc_string = '-'.join(filter(None, [model.doc_string, "TRT_WEIGHTLESS"]))
+            G_LOGGER.finish(f"Finished stripping {num_stripped} weights")
+
+            return model
+
+        # Initialize patterns
+        self.patterns = get_patterns()
+        self.size_thresholds = get_size_thresholds()
+        self.inputs_to_strip = get_inputs_to_strip()
+        self.initializers_to_strip = set()
+
+        # load model
+        model = super().load_model()
+
+        self.exclude_list = self.arg_groups[WeightStripperArgs].get_exclude_list()
+        stripped_model = strip_weights(model)
+        super().save_model(stripped_model)
diff --git a/tools/Polygraphy/polygraphy/tools/surgeon/surgeon.py b/tools/Polygraphy/polygraphy/tools/surgeon/surgeon.py
index 3a0eb911..0acbc94c 100644
--- a/tools/Polygraphy/polygraphy/tools/surgeon/surgeon.py
+++ b/tools/Polygraphy/polygraphy/tools/surgeon/surgeon.py
@@ -15,7 +15,7 @@
 # limitations under the License.
 #
 from polygraphy.tools.base import Tool
-from polygraphy.tools.surgeon.subtool import Extract, Insert, Sanitize
+from polygraphy.tools.surgeon.subtool import Extract, Insert, Sanitize, Prune, WeightStripper, WeightReconstructor
 
 ################################# MAIN TOOL #################################
 
@@ -33,4 +33,7 @@ def get_subtools_impl(self):
             Extract(),
             Sanitize(),
             Insert(),
+            Prune(),
+            WeightStripper(),
+            WeightReconstructor(),
         ]
diff --git a/tools/Polygraphy/polygraphy/util/__init__.py b/tools/Polygraphy/polygraphy/util/__init__.py
index f7f3b069..5db04ec7 100644
--- a/tools/Polygraphy/polygraphy/util/__init__.py
+++ b/tools/Polygraphy/polygraphy/util/__init__.py
@@ -1 +1,2 @@
 from polygraphy.util.util import *
+import polygraphy.util.array
diff --git a/tools/Polygraphy/polygraphy/util/array.py b/tools/Polygraphy/polygraphy/util/array.py
new file mode 100644
index 00000000..4381ccf6
--- /dev/null
+++ b/tools/Polygraphy/polygraphy/util/array.py
@@ -0,0 +1,1329 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+"""
+This file includes utility functions for arrays/tensors that work for multiple
+libraries like NumPy and PyTorch.
+"""
+
+import builtins
+import functools
+import math
+import numbers
+
+from polygraphy import mod
+from polygraphy.datatype import DataType
+from polygraphy.logger import G_LOGGER
+
+np = mod.lazy_import("numpy")
+torch = mod.lazy_import("torch>=1.13.0")
+
+
+@mod.export()
+def is_torch(obj):
+    """
+    Whether the provided object is a PyTorch tensor.
+    This function does *not* introduce a dependency on the PyTorch module.
+
+    Args:
+        obj (Any): The object to check.
+
+    Returns:
+        bool: Whether the object is a PyTorch tensor.
+    """
+    return (
+        torch.is_installed() and torch.is_importable() and isinstance(obj, torch.Tensor)
+    )
+
+
+@mod.export()
+def is_numpy(obj):
+    """
+    Whether the provided object is a NumPy array or scalar.
+    This function does *not* introduce a dependency on the NumPy module.
+
+    Args:
+        obj (Any): The object to check.
+
+    Returns:
+        bool: Whether the object is a NumPy array.
+    """
+    return (
+        np.is_installed()
+        and np.is_importable()
+        and (isinstance(obj, np.ndarray) or isinstance(obj, np.generic))
+    )
+
+
+@mod.export()
+def is_device_view(obj):
+    """
+    Whether the provided object is a DeviceView array.
+
+    Args:
+        obj (Any): The object to check.
+
+    Returns:
+        bool: Whether the object is a DeviceView.
+    """
+    from polygraphy.cuda import DeviceView
+
+    return isinstance(obj, DeviceView)
+
+
+# The current design dispatches to the correct function implementation separately for each function call.
+# Obviously, this has some performance cost and an alternative approach would be a more familiar inheritance
+# pattern wherein we would have a BaseArray class and then child classes like NumpyArray, TorchArray, PolygraphyDeviceArray etc.
+# That way, the dispatching logic would only have to run once when we construct an instance of one of these
+# classes.
+#
+# The tradeoff is that the caller would then have to be careful that they are *not* passing in NumPy arrays,
+# Torch tensors etc. directly, but have first wrapped them appropriately. Plus, at the interface boundaries,
+# we would have to unwrap them once again since we don't want to expose the wrappers at the API level (the user
+# should be able to work directly with NumPy arrays, PyTorch tensors etc.).
+#
+# To illustrate this a bit better, consider the two possible workflows:
+#
+# Option 1 (dispatch logic in each function, current design):
+#
+# def my_api_func(obj)
+#     nbytes = util.array.nbytes(obj) # Dispatching logic needs to run on each function call
+#     dtype = util.array.dtype(obj)
+#     # Do something interesting, then...
+#     return obj
+#
+# Option 2 (class hierarchy, possible alternative design):
+#
+# # Assume we have:
+#
+# class BaseArray:
+#     ...
+#
+# class TorchArray:
+#     ...
+#
+# # etc.
+#
+# def my_api_func()
+#     obj = wrap_array(obj) # Dispatch logic only runs once
+#     nbytes = obj.nbytes
+#     dtype = obj.dtype
+#     # Do something interesting, then...
+#     return unwrap_array(obj) # Need to return the np.ndarray/torch.Tensor/DeviceView, *not* the wrapper
+#
+# In Polygraphy, the number of calls to `wrap_array`/`unwrap_array` would most likely be quite high
+# relative to the number of calls to the actual methods, so the perfomance hit of the current implementation
+# may not be that significant. If it is, then it should be straightforward, though time-consuming, to switch to Option 2.
+#
+def dispatch(num_arrays=1):
+    """
+    Decorator that will dispatch to functions specific to a framework type, like NumPy or PyTorch,
+    based on the type of the input.
+
+    The decorated function should return a dictionary with implementations for all supported types.
+    The following keys may be specified: ["torch", "numpy", "device_view", "number"].
+
+    Args:
+        num_arrays (int):
+            The number of arrays expected.
+            The naming convention for the array arguments is as follows:
+            - For a single array, the argument is called "obj".
+            - For two arrays, the arguments are called "lhs" and "rhs".
+            - For N>2 arrays, the arguments are called "obj0", "obj1", ... "obj<N-1>"
+            In the case of more than one array, this function will automatically convert the rest to be of the
+            same kind as the first.
+    """
+
+    def dispatch_impl(func):
+        def _get_key(obj):
+            key = None
+
+            if is_device_view(obj):
+                key = "device_view"
+            elif is_numpy(obj):
+                key = "numpy"
+            elif is_torch(obj):
+                key = "torch"
+            elif isinstance(obj, numbers.Number):
+                key = "number"
+
+            if not key:
+                G_LOGGER.critical(
+                    f"Function: {func.__name__} is unsupported for objects of type: {type(obj).__name__}"
+                )
+            return key
+
+        if num_arrays < 0:
+            G_LOGGER.critical(
+                f"Function: {func.__name__} is unsupported with {num_arrays} < 0"
+            )
+
+        @functools.wraps(func)
+        def wrapped(*args, **kwargs):
+            if len(args) < num_arrays:
+                G_LOGGER.critical(
+                    f"Function: {func.__name__} is unsupported for less than {num_arrays} positional arguments"
+                )
+
+            mapping = func()
+            obj0 = args[0]
+            key = _get_key(obj0)
+
+            if key not in mapping:
+                G_LOGGER.critical(
+                    f"Function: {func.__name__} is unsupported for objects of type: {type(obj0).__name__}"
+                )
+
+            # Note that we can use to_torch/to_numpy here without a circular dependency because those functions
+            # take the num_arrays=1 path.
+            def convert_array(obj):
+                if key == "torch":
+                    return to_torch(obj)
+                elif key == "numpy":
+                    return to_numpy(obj)
+                else:
+                    G_LOGGER.critical(
+                        f"Function: {func.__name__} is unsupported for objects of type: {type(obj).__name__}"
+                    )
+
+            converted_args = (
+                [obj0]
+                + list(map(convert_array, args[1:num_arrays]))
+                + list(args[num_arrays:])
+            )
+
+            return mapping[key](*converted_args, **kwargs)
+
+        return wrapped
+
+    return dispatch_impl
+
+
+##
+## Conversion Functions
+##
+
+
+@mod.export()
+@dispatch()
+def to_torch():
+    """
+    Converts an array or tensor to a PyTorch tensor.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        torch.Tensor: The PyTorch tensor.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj,
+        "numpy": lambda obj: torch.from_numpy(obj),
+        "number": lambda obj: torch.tensor(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def to_numpy():
+    """
+    Converts an array or tensor to a NumPy array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        np.ndarray: The NumPy array.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj.numpy(force=True),
+        "numpy": lambda obj: obj,
+        "number": lambda obj: np.array(obj),
+    }
+
+
+##
+## Metadata
+##
+
+
+@mod.export()
+@dispatch()
+def nbytes():
+    """
+    Calculate the number of bytes required by the input array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        int: The number of bytes required by the array.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj.nelement() * obj.element_size(),
+        "numpy": lambda obj: obj.nbytes,
+        "device_view": lambda obj: obj.nbytes,
+    }
+
+
+@mod.export()
+@dispatch()
+def size():
+    """
+    Calculate the volume of the input array
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        int: The volume of the array.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj.numel(),
+        "numpy": lambda obj: obj.size,
+    }
+
+
+@mod.export()
+@dispatch()
+def data_ptr():
+    """
+    Return a pointer to the first element of the input array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        int: A pointer to the first element of the array.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj.data_ptr(),
+        "numpy": lambda obj: obj.ctypes.data,
+        "device_view": lambda obj: obj.ptr,
+    }
+
+
+@mod.export()
+@dispatch()
+def is_on_cpu():
+    """
+    Returns whether the input array is in CPU memory.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        bool: Whether the array is in CPU, i.e. host, memory.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj.device.type == "cpu",
+        "numpy": lambda _: True,
+        "device_view": lambda _: False,
+    }
+
+
+@mod.export()
+@dispatch()
+def is_on_gpu():
+    """
+    Returns whether the input array is in GPU memory.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        bool: Whether the array is in GPU, i.e. host, memory.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj.device.type == "cuda",
+        "numpy": lambda _: False,
+        "device_view": lambda _: True,
+    }
+
+
+@mod.export()
+@dispatch()
+def dtype():
+    """
+    Return the data type the input array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        DataType: The data type of the array
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    func = lambda obj: DataType.from_dtype(obj.dtype)
+    return {"torch": func, "numpy": func, "device_view": func}
+
+
+@mod.export()
+@dispatch()
+def shape():
+    """
+    Return the shape the input array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray, DeviceView]: The shape of the array
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    func = lambda obj: obj.shape
+    return {"torch": func, "numpy": func, "device_view": func}
+
+
+@mod.export()
+def view(obj, dtype, shape):
+    """
+    Return a view of the the input array with the given data type and shape.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]):
+                The array or tensor. Must be contiguous.
+        dtype (DataType): The data type to use for the view.
+        shape (Sequence[int]): The shape to use for the view.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray, DeviceView]: The view of the array
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    if not is_contiguous(obj):
+        G_LOGGER.critical(f"Input array to view() must be contiguous in memory")
+
+    if is_device_view(obj):
+        return obj.view(shape=shape, dtype=dtype)
+
+    dtype = (
+        DataType.to_dtype(dtype, "numpy")
+        if is_numpy(obj)
+        else DataType.to_dtype(dtype, "torch")
+    )
+    return obj.reshape(-1).view(dtype).reshape(shape)
+
+
+@mod.export()
+@dispatch()
+def is_contiguous():
+    """
+    Checks whether the provided array is contiguous in memory.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        bool: Whether the array is contiguous in memory.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "torch": lambda obj: obj.is_contiguous(),
+        "numpy": lambda obj: obj.flags["C_CONTIGUOUS"],
+        "device_view": lambda _: True,
+    }
+
+
+##
+## Memory Management
+##
+
+
+@mod.export()
+@dispatch()
+def make_contiguous():
+    """
+    Makes an array contiguous if it's not already.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceView]): The array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray, DeviceView]: The contiguous array.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def impl_numpy(obj):
+        if is_contiguous(obj):
+            return obj
+        return np.ascontiguousarray(obj)
+
+    return {
+        "torch": lambda obj: obj.contiguous(),
+        "numpy": impl_numpy,
+        "device_view": lambda obj: obj,
+    }
+
+
+@mod.export()
+@dispatch()
+def resize_or_reallocate():
+    """
+    Resizes the provided buffer, possibly reallocating the buffer.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray, DeviceArray]): The array or tensor.
+        shape (Sequence[int]): The desired shape of the buffer.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray, DeviceArray]: The resized buffer, possibly reallocated.
+    """
+
+    def numpy_impl(obj, shape):
+        if shape != obj.shape:
+            try:
+                obj.resize(shape, refcheck=False)
+            except ValueError as err:
+                G_LOGGER.warning(
+                    f"Could not resize NumPy array to shape: {shape}. "
+                    f"Allocating a new array instead.\nNote: Error was: {err}"
+                )
+                obj = np.empty(shape, dtype=np.dtype(obj.dtype))
+        return obj
+
+    return {
+        "numpy": numpy_impl,
+        "torch": lambda obj, shape: obj.resize_(shape) if shape != obj.shape else obj,
+        "device_view": lambda obj, shape: obj.resize(shape)
+        if shape != obj.shape
+        else obj,
+    }
+
+
+##
+## Math Helpers
+##
+
+
+@mod.export()
+@dispatch()
+def cast():
+    """
+    Casts an array to the specified type.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+        dtype (DataType): The type to cast to.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The casted array.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj, dtype: np.array(obj.astype(dtype.numpy())),
+        "torch": lambda obj, dtype: obj.to(DataType.to_dtype(dtype, "torch")),
+    }
+
+
+@mod.export()
+@dispatch()
+def any():
+    """
+    Return whether any of the values in the provided array evaluate to True.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        bool: Whether any of the values in the array evaluate to True.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.any(obj),
+        "torch": lambda obj: bool(torch.any(obj)),
+    }
+
+
+@mod.export()
+@dispatch()
+def all():
+    """
+    Return whether all of the values in the provided array evaluate to True.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        bool: Whether all of the values in the array evaluate to True.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.all(obj),
+        "torch": lambda obj: bool(torch.all(obj)),
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=2)
+def equal():
+    """
+    Returns whether two arrays are equal
+
+    Args:
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+
+    Returns:
+        bool: Whether the arrays are equal.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    return {
+        "torch": lambda lhs, rhs: torch.equal(lhs, rhs),
+        "numpy": lambda lhs, rhs: np.array_equal(lhs, rhs),
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=2)
+def subtract():
+    """
+    Subtracts the second array from the first.
+
+    Args:
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The difference.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    return {
+        "torch": lambda lhs, rhs: lhs - rhs,
+        "numpy": lambda lhs, rhs: np.array(lhs - rhs),
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=2)
+def divide():
+    """
+    Divides the first array by the second.
+
+    Args:
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The quotient.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    return {
+        "torch": lambda lhs, rhs: lhs / rhs,
+        "numpy": lambda lhs, rhs: lhs / rhs,
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=2)
+def allclose():
+    """
+    Returns whether all the values in two arrays are within the given thresholds.
+
+    Args:
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+        rtol (float): The relative tolerance. Defaults to 1e-5.
+        atol (float): The absolute tolerance. Defaults to 1e-8.
+
+    Returns:
+        bool: Whether the arrays are close.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    DEFAULT_RTOL = 1e-5
+    DEFAULT_ATOL = 1e-8
+
+    return {
+        "torch": lambda lhs, rhs, rtol=DEFAULT_RTOL, atol=DEFAULT_ATOL: torch.allclose(
+            lhs, rhs, rtol=rtol, atol=atol
+        ),
+        "numpy": lambda lhs, rhs, rtol=DEFAULT_RTOL, atol=DEFAULT_ATOL: np.allclose(
+            lhs, rhs, rtol=rtol, atol=atol
+        ),
+    }
+
+
+@mod.export()
+def unravel_index(index, shape):
+    """
+    Unravels a flat index into a N-dimensional index based on the specified shape.
+
+    Args:
+        index (int): The flat index.
+        shape (Sequence[int]): The shape on which to unravel the index.
+
+    Returns:
+        Tuple[int]: The N-dimensional index.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    index = int(index)
+
+    nd_index = []
+    for dim in reversed(shape):
+        nd_index.insert(0, index % dim)
+        index = index // dim
+
+    return tuple(nd_index)
+
+
+@mod.export()
+@dispatch()
+def histogram():
+    """
+    Compute a histogram for the given array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+        range (Tuple[float, float]): The lower and upper range of the bins.
+
+    Returns:
+        Tuple[Union[torch.Tensor, numpy.ndarray], Union[torch.Tensor, numpy.ndarray]]:
+            The histogram values and the bin edges
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def torch_impl(obj, range=None):
+        # PyTorch doesn't support histograms for all types, so cast to FP32
+        original_dtype = obj.dtype
+        hist, bins = torch.histogram(obj.to(torch.float32), bins=10, range=range)
+        return hist.to(original_dtype), bins.to(original_dtype)
+
+    return {
+        "numpy": lambda obj, range=None: np.histogram(obj, bins=10, range=range),
+        "torch": torch_impl,
+    }
+
+
+@mod.export()
+@dispatch()
+def max():
+    """
+    Returns the maximum value of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Any: The maximum value
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.amax(obj).item(),
+        "torch": lambda obj: torch.max(obj).item(),
+    }
+
+
+@mod.export()
+@dispatch()
+def argmax():
+    """
+    Returns the flattened index of the maximum value of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        int: The flattened index.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def torch_impl(obj):
+        # Torch argmax doesn't support bools
+        return torch.argmax(obj.to(torch.float32))
+
+    return {
+        "numpy": lambda obj: np.argmax(obj),
+        "torch": lambda obj: torch_impl(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def min():
+    """
+    Returns the minimum value of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Any: The minimum value
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.amin(obj).item(),
+        "torch": lambda obj: torch.min(obj).item(),
+    }
+
+
+@mod.export()
+@dispatch()
+def argmin():
+    """
+    Returns the flattened index of the minimum value of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        int: The flattened index.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def torch_impl(obj):
+        # Torch argmin doesn't support bools
+        return torch.argmin(obj.to(torch.float32))
+
+    return {
+        "numpy": lambda obj: np.argmin(obj),
+        "torch": lambda obj: torch_impl(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def mean():
+    """
+    Returns the mean value of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+        dtype (DataType): The mean compute type.
+
+    Returns:
+        Any: The mean value
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj, dtype=None: np.mean(
+            obj, dtype=DataType.to_dtype(dtype, "numpy") if dtype is not None else None
+        ),
+        "torch": lambda obj, dtype=None: torch.mean(
+            obj, dtype=DataType.to_dtype(dtype, "torch") if dtype is not None else None
+        ),
+    }
+
+
+@mod.export()
+@dispatch()
+def std():
+    """
+    Returns the standard deviation of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Any: The standard deviation
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def torch_impl(obj):
+        # torch.var is only supported for floats, so cast up and then back.
+        obj_fp32 = obj.to(torch.float32)
+        try:
+            return torch.std(obj_fp32, correction=0)
+        except AttributeError:
+            return torch.std(obj_fp32, unbiased=False)
+
+    return {
+        "numpy": lambda obj: np.std(obj),
+        "torch": torch_impl,
+    }
+
+
+@mod.export()
+@dispatch()
+def var():
+    """
+    Returns the variance of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Any: The variance
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def torch_impl(obj):
+        # torch.var is only supported for floats, so cast up and then back.
+        obj_fp32 = obj.to(torch.float32)
+        try:
+            return torch.var(obj_fp32, correction=0)
+        except AttributeError:
+            return torch.var(obj_fp32, unbiased=False)
+
+    return {
+        "numpy": lambda obj: np.var(obj),
+        "torch": torch_impl,
+    }
+
+
+@mod.export()
+@dispatch()
+def median():
+    """
+    Returns the median value of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Any: The median value
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def torch_impl(obj):
+        # Median in PyTorch doesn't work as expected for arrays with an even number of elements - instead
+        # of returning the average of the two middle elements, it just returns the smaller one.
+        # It is also not implemented for some types, so cast to FP32 for compute.
+
+        original_dtype = obj.dtype
+        obj = obj.to(torch.float32)
+
+        rv = 0
+        if obj.nelement() % 2 == 1:
+            rv = torch.median(obj)
+        else:
+            smaller = torch.median(obj)
+            larger = torch.median(torch.cat([obj.flatten(), torch.max(obj)[None]]))
+            rv = (smaller + larger) / 2.0
+        return rv.to(original_dtype)
+
+    return {
+        "numpy": lambda obj: np.median(obj),
+        "torch": torch_impl,
+    }
+
+
+@mod.export()
+@dispatch()
+def quantile():
+    """
+    Returns the value of the q quantile of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+        q  (float): Quantile to compute, expected range [0, 1]
+
+    Returns:
+        Any: The quantile value
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def numpy_impl(obj, q):
+        if obj.size == 0:
+            return np.inf
+        return np.quantile(obj, q)
+
+    def torch_impl(obj, q):
+        if obj.numel() == 0:
+            return torch.inf
+        original_dtype = obj.dtype
+        obj = obj.to(torch.float32)
+        qunatile_val = torch.quantile(obj, q)
+        return qunatile_val.to(original_dtype)
+
+    return {
+        "numpy": numpy_impl,
+        "torch": torch_impl,
+    }
+
+
+@mod.export()
+@dispatch()
+def topk():
+    """
+    Returns a tuple of the top k values and indices of an array along a specified axis.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+        k (int): The number of values to return. This is clamped to the length of obj along the given axis.
+        axis (int): The axis to perform the topk computation on
+
+    Returns:
+        Tuple[Union[torch.Tensor, numpy.ndarray], Union[torch.Tensor, numpy.ndarray]]: A tuple containing a pair of arrays,
+            the first being the values and the second being the indices of the top k values along the specified axis
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def numpy_impl(obj, k, axis):
+        # NumPy doesn't have a Top K implementation
+        indices = np.argsort(-obj, axis=axis, kind="stable")
+        axis_len = indices.shape[axis]
+        indices = np.take(indices, np.arange(0, builtins.min(k, axis_len)), axis=axis)
+        return np.take_along_axis(obj, indices, axis=axis), indices
+
+    def torch_impl(obj, k, axis):
+        axis_len = obj.shape[axis]
+
+        # Top K has no implementation for float16 in torch-cpu, so
+        # If gpu is available, run computation there
+        # Otherwise, run the calculation on cpu using fp32 precision
+        if obj.dtype == torch.float16:
+            if torch.cuda.is_available():
+                original_device = obj.device
+                ret = tuple(
+                    torch.topk(obj.to("cuda"), builtins.min(k, axis_len), dim=axis)
+                )
+                return (ret[0].to(original_device), ret[1].to(original_device))
+            else:
+                ret = tuple(
+                    torch.topk(
+                        obj.type(torch.float32), builtins.min(k, axis_len), dim=axis
+                    )
+                )
+                return (ret[0].type(torch.float16), ret[1].type(torch.float16))
+        return tuple(torch.topk(obj, builtins.min(k, axis_len), dim=axis))
+
+    return {
+        "numpy": numpy_impl,
+        "torch": torch_impl,
+    }
+
+
+@mod.export()
+@dispatch()
+def abs():
+    """
+    Returns the absolute value of an array.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Any: The absolute value
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+
+    def torch_abs_impl(obj):
+        # PyTorch doesn't support abs for all types, so cast to FP32
+        original_dtype = obj.dtype
+        return torch.abs(obj.to(torch.float32)).to(original_dtype)
+
+    return {
+        "numpy": lambda obj: np.array(np.abs(obj)),
+        "torch": lambda obj: torch_abs_impl(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def isfinite():
+    """
+    Returns a boolean array indicating if each element of obj is finite or not.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The boolean array indicating which elements of obj are finite.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.isfinite(obj),
+        "torch": lambda obj: torch.isfinite(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def isinf():
+    """
+    Returns a boolean array indicating if each element of obj is infinite or not.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The boolean array indicating which elements of obj are infinite.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.isinf(obj),
+        "torch": lambda obj: torch.isinf(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def isnan():
+    """
+    Returns a boolean array indicating if each element of obj is NaN or not.
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The boolean array indicating which elements of obj are NaN.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.isnan(obj),
+        "torch": lambda obj: torch.isnan(obj),
+        "number": lambda obj: math.isnan(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def argwhere():
+    """
+    Returns a indices of non-zero array elements
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: An (N, obj.ndim) array containing indices of non-zero elements of obj
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.argwhere(obj),
+        "torch": lambda obj: torch.argwhere(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def ravel():
+    """
+    Flattens the input array
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]): The array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The flattened input tensor
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.ravel(obj),
+        "torch": lambda obj: torch.ravel(obj),
+    }
+
+
+@mod.export()
+@dispatch()
+def logical_not():
+    """
+    Computes the logical not of an array
+
+    Args:
+        obj (Union[torch.Tensor, numpy.ndarray]):
+                The input array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The logical not.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda obj: np.logical_not(obj),
+        "torch": lambda obj: torch.logical_not(obj),
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=2)
+def logical_xor():
+    """
+    Computes the logical exclusive-or of two arrays.
+
+    Args:
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The logical xor.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda lhs, rhs: np.logical_xor(lhs, rhs),
+        "torch": lambda lhs, rhs: torch.logical_xor(lhs, rhs),
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=2)
+def logical_and():
+    """
+    Computes the logical and of two arrays.
+
+    Args:
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: The logical and.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda lhs, rhs: np.logical_and(lhs, rhs),
+        "torch": lambda lhs, rhs: torch.logical_and(lhs, rhs),
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=2)
+def greater():
+    """
+    Returns a boolean array indicating where lhs is greater than rhs
+
+    Args:
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: Boolean array indicating whether lhs > rhs.
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda lhs, rhs: np.greater(lhs, rhs),
+        "torch": lambda lhs, rhs: torch.gt(lhs, rhs),
+    }
+
+
+@mod.export()
+@dispatch(num_arrays=3)
+def where():
+    """
+    Returns an array containing elements from lhs when cond is true, and rhs when cond is false.
+    Computes the logical and of two arrays.
+
+    Args:
+        cond (Union[torch.Tensor, numpy.ndarray]):
+                The condition array or tensor.
+        lhs (Union[torch.Tensor, numpy.ndarray]):
+                The first array or tensor.
+        rhs (Union[torch.Tensor, numpy.ndarray]):
+                The second array or tensor.
+
+    Returns:
+        Union[torch.Tensor, numpy.ndarray]: Selected elements from lhs if cond is true, and rhs otherwise
+
+    Raises:
+        PolygraphyException: if the input is of an unrecognized type.
+    """
+    return {
+        "numpy": lambda cond, lhs, rhs: np.where(cond, lhs, rhs),
+        "torch": lambda cond, lhs, rhs: torch.where(cond, lhs, rhs),
+    }
diff --git a/tools/Polygraphy/polygraphy/util/format.py b/tools/Polygraphy/polygraphy/util/format.py
deleted file mode 100644
index a3b14d97..00000000
--- a/tools/Polygraphy/polygraphy/util/format.py
+++ /dev/null
@@ -1,119 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-import enum
-
-from polygraphy.logger import G_LOGGER
-
-
-# TRT does not include batch dimension.
-class DataFormat(enum.IntEnum):
-    UNKNOWN = 0
-    NW = 1
-    NHW = 2
-    CHW = 3
-    NHWC = 4
-    NCHW = 5
-
-
-# This class is responsible for deducing the format of a shape,
-# and converting it to the desired format (specified as a DataFormat).
-class FormatManager:
-    # NOTE: New permutations should be added in this function, as it will automatically generate the inverses.
-    def _generate_permutations():
-        def is_invertible(perm):
-            return min(perm) >= 0 and max(perm) < len(perm)
-
-        def inverse_permutation(perm):
-            inverse = [perm[index] for index in perm]
-            return inverse
-
-        # Inverse permutations are generated automatically below.
-        # We use -1 to denote that a dummy dimension of 1 should be inserted in the convert function.
-        initial_permutations = {
-            (DataFormat.NCHW, DataFormat.NCHW): (0, 1, 2, 3),
-            (DataFormat.NHWC, DataFormat.NHWC): (0, 1, 2, 3),
-            (DataFormat.NHWC, DataFormat.NCHW): (0, 3, 1, 2),
-            (DataFormat.CHW, DataFormat.CHW): (0, 1, 2),
-            (DataFormat.NCHW, DataFormat.CHW): (1, 2, 3),
-            (DataFormat.NHWC, DataFormat.CHW): (3, 1, 2),
-            (DataFormat.NHW, DataFormat.CHW): (-1, 1, 2),
-            (DataFormat.NW, DataFormat.CHW): (-1, -1, 1),
-        }
-        permutations = {}
-        for (f1, f2), perm in initial_permutations.items():
-            permutations[(f1, f2)] = perm
-            if is_invertible(perm):
-                permutations[(f2, f1)] = inverse_permutation(perm)
-        return permutations
-
-    # Dict[Tuple[DataFormat, DataFormat], Tuple[int]]
-    # This provides the correct permutation for various data format conversions.
-    DATA_PERMUTATIONS = _generate_permutations()
-
-    @staticmethod
-    def determine_format(shape):
-        """
-        Guesses the data format of a given shape.
-
-        Args:
-            shape (Tuple[int]): The shape, including batch dimension.
-
-        Returns:
-            DataFormat: The determined data format.
-        """
-        # The smaller this ratio, the closer a and b are.
-        def minmax_ratio(a, b):
-            return abs(max(a, b) / min(a, b))
-
-        # Assume all shapes include batch dimension
-        if len(shape) == 4:
-            # Typically, H and W are quite close, so if minmax_ratio(0, 1) > minmax_ratio(1, 2), then we assume CHW.
-            if minmax_ratio(shape[1], shape[2]) > minmax_ratio(shape[2], shape[3]):
-                return DataFormat.NCHW
-            return DataFormat.NHWC
-        elif len(shape) == 3:
-            return DataFormat.NHW
-        elif len(shape) == 2:
-            return DataFormat.NW
-        else:
-            G_LOGGER.warning(
-                "Cannot determine format for "
-                + str(shape)
-                + ". Currently only implemented for input_buffers with 1-3 non-batch dimensions. Please update this function!"
-            )
-            return DataFormat.UNKNOWN
-
-    # Get the permutation required to transpose old_format to new_format
-    @staticmethod
-    def permutation(old_format, new_format):
-        return FormatManager.DATA_PERMUTATIONS[(old_format, new_format)]
-
-    @staticmethod
-    def convert(shape, new_format):
-        """
-        Permutes a shape from one format to another.
-
-        Args:
-            shape (Tuple[int]): The shape to convert.
-            new_format (DataFormat): The desired format of the shape.
-
-        Returns:
-            Tuple[int]: A new shape in the correct format.
-        """
-        old_format = FormatManager.determine_format(shape)
-        perm = FormatManager.permutation(old_format, new_format)
-        return [shape[index] if index != -1 else 1 for index in perm]
diff --git a/tools/Polygraphy/polygraphy/util/util.py b/tools/Polygraphy/polygraphy/util/util.py
index 78d3a02c..7a28357c 100644
--- a/tools/Polygraphy/polygraphy/util/util.py
+++ b/tools/Polygraphy/polygraphy/util/util.py
@@ -14,7 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-import contextlib
 import copy
 import functools
 import glob
@@ -187,6 +186,22 @@ def unique_list(sequence):
     return list(OrderedDict.fromkeys(sequence))
 
 
+@mod.export()
+def invert_dict(dct):
+    """
+    Inverts the keys and values of a dictionary.
+
+    Args:
+        dct (Dict[Any, Any]): The dictionary to invert.
+
+    Returns:
+        Dict[Any, Any]:
+                A dictionary with the keys and values inverted.
+                That is, the values of the original dictionary become the keys of this dictionary.
+    """
+    return {val: key for key, val in dct.items()}
+
+
 # default exists to solve issues that might result from Python's normal default arguments.
 # Specifically, consider the following class:
 #
@@ -679,10 +694,11 @@ def check_called_by(expected_caller_name):
     def check_called_by_impl(func):
         @functools.wraps(func)
         def wrapped(*args, **kwargs):
-
             # Skip checks if we're calling these functions internally
             module = inspect.getmodule(sys._getframe(1))
-            called_from_polygraphy = module.__name__ and module.__name__.split(".")[0] == "polygraphy"
+            called_from_polygraphy = (
+                module is not None and module.__name__ and module.__name__.split(".")[0] == "polygraphy"
+            )
 
             if not called_from_polygraphy:
                 actual_caller_name = sys._getframe(1).f_code.co_name
@@ -750,39 +766,43 @@ def try_match_shape(arr, shape):
     This is a no-op if the array is already the correct shape.
 
     Args:
-        arr (numpy.ndarray): The array to reshape.
+        arr (Union[numpy.ndarray, torch.Tensor]): The array or tensor to reshape.
         shape (Tuple[int]): The shape to use. May contain at most 1 dynamic dimension.
 
     Returns:
-        numpy.ndarray: The reshaped array.
+        Union[numpy.ndarray, torch.Tensor]: The reshaped array or tensor.
     """
+    import polygraphy.util.array as array_util
 
     def is_rank_same(arr, shape):
-        return len(shape) == len(arr.shape)
+        return len(shape) == len(array_util.shape(arr))
 
     def try_reshape(arr, shape):
-        original_shape = arr.shape
+        arr = array_util.make_contiguous(arr)
+        original_shape = array_util.shape(arr)
         try:
-            arr = arr.reshape(shape)
-        except ValueError:
-            G_LOGGER.warning(f"Could not reshape array from shape: {arr.shape} to {shape}. Skipping reshape.")
+            arr = array_util.view(arr, shape=shape, dtype=array_util.dtype(arr))
+        except (ValueError, RuntimeError):
+            G_LOGGER.warning(
+                f"Could not reshape array from shape: {array_util.shape(arr)} to {shape}. Skipping reshape."
+            )
         else:
-            if arr.shape != original_shape:
-                G_LOGGER.info(f"Reshaped array from shape: {original_shape} to: {arr.shape}")
+            if array_util.shape(arr) != original_shape:
+                G_LOGGER.info(f"Reshaped array from shape: {original_shape} to: {array_util.shape(arr)}")
         return arr
 
     def try_permute(arr, shape):
-        original_shape = arr.shape
+        original_shape = array_util.shape(arr)
 
-        if sorted(arr.shape) != sorted(shape):
-            G_LOGGER.extra_verbose(f"Array of shape: {arr.shape} cannot be permuted to: {shape}")
+        if sorted(array_util.shape(arr)) != sorted(shape):
+            G_LOGGER.extra_verbose(f"Array of shape: {array_util.shape(arr)} cannot be permuted to: {shape}")
             return arr
 
         # We need to remove axes from the original shape as we use them to avoid
         # duplication in the permutation.
-        arr_shape_indices = {index: dimlen for index, dimlen in enumerate(arr.shape)}
+        arr_shape_indices = {index: dimlen for index, dimlen in enumerate(array_util.shape(arr))}
 
-        # Find which axis in arr.shape corresponds to the specified size. Never returns duplicates.
+        # Find which axis in array_util.shape(arr) corresponds to the specified size. Never returns duplicates.
         def find_axis(dimlen):
             nonlocal arr_shape_indices
             for index, d in arr_shape_indices.items():
@@ -796,8 +816,10 @@ def find_axis(dimlen):
         except Exception as err:
             G_LOGGER.extra_verbose(f"Skipping permutation due to {err}")
         else:
-            if arr.shape != original_shape:
-                G_LOGGER.info(f"Permuted array of shape: {original_shape} to: {arr.shape} using permutation {perm}")
+            if array_util.shape(arr) != original_shape:
+                G_LOGGER.info(
+                    f"Permuted array of shape: {original_shape} to: {array_util.shape(arr)} using permutation {perm}"
+                )
         return arr
 
     # Override any dynamic dimensions in the shape with concrete shapes from the array.
@@ -805,17 +827,18 @@ def try_freeze_shape(arr, shape):
         if num_dynamic_dimensions(shape) == 1:
             try:
                 static_dims = [dim for dim in shape if not is_dimension_dynamic(dim)]
-                determined_dim = volume(arr.shape) // volume(static_dims)
+                determined_dim = volume(array_util.shape(arr)) // volume(static_dims)
             except ZeroDivisionError:
                 determined_dim = 0
             shape = [determined_dim if is_dimension_dynamic(elem) else elem for elem in shape]
         elif is_rank_same(arr, shape):
             shape = [
-                arr_shape_elem if is_dimension_dynamic(elem) else elem for elem, arr_shape_elem in zip(shape, arr.shape)
+                arr_shape_elem if is_dimension_dynamic(elem) else elem
+                for elem, arr_shape_elem in zip(shape, array_util.shape(arr))
             ]
         return shape
 
-    if shape == arr.shape:
+    if shape == array_util.shape(arr):
         return arr
 
     if is_shape_dynamic(shape):
@@ -831,61 +854,6 @@ def try_freeze_shape(arr, shape):
     return arr
 
 
-@mod.export()
-def is_contiguous(array):
-    """
-    Checks whether the provided NumPy array is contiguous in memory.
-
-    Args:
-        array (np.ndarray): The NumPy array.
-
-    Returns:
-        bool: Whether the array is contiguous in memory.
-    """
-    return array.flags["C_CONTIGUOUS"]
-
-
-@mod.export()
-def make_contiguous(array):
-    """
-    Makes a NumPy array contiguous if it's not already.
-
-    Args:
-        array (np.ndarray): The NumPy array.
-
-    Returns:
-        np.ndarray: The contiguous NumPy array.
-    """
-    if not is_contiguous(array):
-        return np.ascontiguousarray(array)
-    return array
-
-
-@mod.export()
-def resize_buffer(buffer, shape):
-    """
-    Resizes the provided buffer and makes it contiguous in memory,
-    possibly reallocating the buffer.
-
-    Args:
-        buffer (np.ndarray): The buffer to resize.
-        shape (Sequence[int]): The desired shape of the buffer.
-
-    Returns:
-        np.ndarray: The resized buffer, possibly reallocated.
-    """
-    if shape != buffer.shape:
-        try:
-            buffer.resize(shape, refcheck=False)
-        except ValueError as err:
-            G_LOGGER.warning(
-                f"Could not resize host buffer to shape: {shape}. "
-                f"Allocating a new buffer instead.\nNote: Error was: {err}"
-            )
-            buffer = np.empty(shape, dtype=np.dtype(buffer.dtype))
-    return make_contiguous(buffer)
-
-
 ##
 ## Logging Utilities
 ##
@@ -1014,43 +982,6 @@ def all_default(arg_list):
     return repr_str, all_default(processed_args), all_default(processed_kwargs)
 
 
-##
-## Safety
-##
-
-
-@mod.export()
-class FreeOnException:
-    def __init__(self, objs):
-        """
-        Frees the specified objects if an exception occurs in this context.
-        Does nothing otherwise.
-
-        Args:
-            objs (List[object]): List of objects with __enter__/__exit__ methods defined.
-        """
-        assert is_sequence(objs), "FreeOnException requires a sequence of objects!"
-        self.objs = objs
-
-    def __enter__(self):
-        """
-        Returns the objects managed by this context manager.
-        """
-        return self.objs
-
-    def __exit__(self, exc_type, exc_value, traceback):
-        """
-        On exception, deletes all tracked objects.
-        Does nothing if there are no exceptions.
-        """
-        if exc_type is not None:
-            # Objects are freed in reverse order
-            with contextlib.ExitStack() as stack:
-                for obj in self.objs:
-                    if obj is not None:
-                        stack.enter_context(obj)
-
-
 ##
 ## Attribute Helpers
 ##
diff --git a/tools/Polygraphy/polygraphy_debug_replay.json b/tools/Polygraphy/polygraphy_debug_replay.json
deleted file mode 100644
index 5dd85163..00000000
--- a/tools/Polygraphy/polygraphy_debug_replay.json
+++ /dev/null
@@ -1,14 +0,0 @@
-{
-    "_N0_outputs": [
-        true,
-        [
-            3
-        ]
-    ],
-    "_N1_outputs": [
-        true,
-        [
-            4
-        ]
-    ]
-}
\ No newline at end of file
diff --git a/tools/Polygraphy/polygraphy_debug_replay_skip_current.json b/tools/Polygraphy/polygraphy_debug_replay_skip_current.json
deleted file mode 100644
index 61694a8b..00000000
--- a/tools/Polygraphy/polygraphy_debug_replay_skip_current.json
+++ /dev/null
@@ -1,20 +0,0 @@
-{
-    "_N0_outputs": [
-        true,
-        [
-            3
-        ]
-    ],
-    "_N1_outputs": [
-        true,
-        [
-            4
-        ]
-    ],
-    "_N0_inputs": [
-        false,
-        [
-            3
-        ]
-    ]
-}
\ No newline at end of file
diff --git a/tools/Polygraphy/reduced.onnx b/tools/Polygraphy/reduced.onnx
deleted file mode 100644
index 57da7170..00000000
Binary files a/tools/Polygraphy/reduced.onnx and /dev/null differ
diff --git a/tools/Polygraphy/tests/README.md b/tools/Polygraphy/tests/README.md
index 17f91007..17dce3cd 100644
--- a/tools/Polygraphy/tests/README.md
+++ b/tools/Polygraphy/tests/README.md
@@ -19,3 +19,16 @@ For example:
 def my_not_parallel_test():
     ...
 ```
+
+## Slow Tests
+
+Some tests are long-running, so we'd prefer to avoid running them during local development.
+You can mark tests with the `pytest.mark.slow` marker so that those tests are only run when
+the `RUN_ALL_TESTS` make option is enabled.
+
+For example:
+```python
+@pytest.mark.slow
+def my_long_running_test():
+    ...
+```
diff --git a/tools/Polygraphy/tests/backend/onnx/test_loader.py b/tools/Polygraphy/tests/backend/onnx/test_loader.py
index feb07d8e..50d8fede 100644
--- a/tools/Polygraphy/tests/backend/onnx/test_loader.py
+++ b/tools/Polygraphy/tests/backend/onnx/test_loader.py
@@ -18,9 +18,11 @@
 import tempfile
 
 import numpy as np
+import onnx
 import onnx_graphsurgeon as gs
 import pytest
-from polygraphy import constants, util
+
+from polygraphy import constants, mod, util
 from polygraphy.backend.onnx import (
     ConvertToFp16,
     FoldConstants,
@@ -31,9 +33,9 @@
     SaveOnnx,
     SetUpperBound,
     extract_subgraph,
+    fold_constants,
     gs_from_onnx,
     infer_shapes,
-    fold_constants,
     onnx_from_path,
 )
 from polygraphy.common import TensorMetadata
@@ -41,8 +43,6 @@
 from tests.helper import is_file_non_empty
 from tests.models.meta import ONNX_MODELS, TF_MODELS
 
-import onnx
-
 
 class TestLoggerCallbacks:
     @pytest.mark.parametrize("sev", G_LOGGER.SEVERITY_LETTER_MAPPING.keys())
@@ -90,13 +90,13 @@ def test_basic(self):
 
 
 class TestExportOnnxFromTf:
-    pytest.importorskip("tensorflow")
-
     def test_no_optimize(self):
+        pytest.importorskip("tensorflow")
         loader = OnnxFromTfGraph(TF_MODELS["identity"].loader, optimize=False)
         model = loader()
 
     def test_opset(self):
+        pytest.importorskip("tensorflow")
         loader = OnnxFromTfGraph(TF_MODELS["identity"].loader, opset=9)
         model = loader()
         assert model.opset_import[0].version == 9
@@ -223,7 +223,6 @@ def test_size_threshold(self, size_threshold, expect_folding):
 
 
 class TestSetUpperBound:
-
     @pytest.mark.parametrize("global_upper_bound", [False, True])
     @pytest.mark.parametrize("specified_upper_bound", [False, True])
     def test_set_upper_bound(
@@ -254,7 +253,7 @@ def test_set_upper_bound(
         # Check if there is a Min operator in the modified model
         find_min = False
         for node in graph.nodes:
-            if node.op == 'Min':
+            if node.op == "Min":
                 find_min = True
                 # Check if the Min operator's second input is a constant tensor
                 assert isinstance(node.inputs[1], gs.Constant)
@@ -262,7 +261,7 @@ def test_set_upper_bound(
                 val = node.inputs[1].values
                 # Check if the constant value equals the target upper bound
                 assert val == upper_bound
-        assert (find_min)
+        assert find_min
 
 
 class TestSaveOnnx:
diff --git a/tools/Polygraphy/tests/backend/onnxrt/test_runner.py b/tools/Polygraphy/tests/backend/onnxrt/test_runner.py
index 3c17437d..b1ed7967 100644
--- a/tools/Polygraphy/tests/backend/onnxrt/test_runner.py
+++ b/tools/Polygraphy/tests/backend/onnxrt/test_runner.py
@@ -16,6 +16,8 @@
 #
 import numpy as np
 import pytest
+import torch
+
 from polygraphy.backend.onnxrt import OnnxrtRunner, SessionFromOnnx
 from polygraphy.exception import PolygraphyException
 from polygraphy.logger import G_LOGGER
@@ -42,6 +44,15 @@ def test_basic(self):
             assert runner.last_inference_time() is not None
         assert not runner.is_active
 
+    def test_torch_tensors(self):
+        model = ONNX_MODELS["identity"]
+        with OnnxrtRunner(SessionFromOnnx(model.loader)) as runner:
+            arr = torch.ones((1, 1, 2, 2), dtype=torch.float32)
+            outputs = runner.infer({"x": arr})
+
+            assert isinstance(outputs["y"], torch.Tensor)
+            assert torch.equal(outputs["y"], arr)
+
     @pytest.mark.serial
     def test_warn_if_impl_methods_called(self, check_warnings_on_runner_impl_methods):
         model = ONNX_MODELS["identity"]
@@ -56,7 +67,7 @@ def test_shape_output(self):
     def test_dim_param_preserved(self):
         model = ONNX_MODELS["dim_param"]
         with OnnxrtRunner(SessionFromOnnx(model.loader)) as runner:
-            input_meta = runner.get_input_metadata()
+            input_meta = runner.get_input_metadata(use_numpy_dtypes=False)
             # In Polygraphy, we only use None to indicate a dynamic input dimension - not strings.
             assert len(input_meta) == 1
             for _, (_, shape) in input_meta.items():
diff --git a/tools/Polygraphy/tests/backend/test_tensorrt_legacy.py b/tools/Polygraphy/tests/backend/test_tensorrt_legacy.py
deleted file mode 100644
index 0218dcbb..00000000
--- a/tools/Polygraphy/tests/backend/test_tensorrt_legacy.py
+++ /dev/null
@@ -1,42 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-import numpy as np
-import pytest
-from polygraphy.backend.trt import Calibrator
-from polygraphy.backend.trt_legacy import ConvertToUff, LoadNetworkFromUff, ParseNetworkFromOnnxLegacy, TrtLegacyRunner
-from polygraphy.comparator import DataLoader
-from tests.models.meta import ONNX_MODELS, TF_MODELS
-
-
-def test_uff_identity():
-    pytest.importorskip("tensorflow")
-
-    model = TF_MODELS["identity"]
-    loader = model.loader
-    with TrtLegacyRunner(
-        LoadNetworkFromUff(ConvertToUff(loader)), int8=True, calibrator=Calibrator(DataLoader())
-    ) as runner:
-        assert runner.is_active
-        feed_dict = {"Input": np.random.random_sample(size=(1, 15, 25, 30)).astype(np.float32)}
-        outputs = runner.infer(feed_dict)
-        assert np.all(outputs["Identity_2"] == feed_dict["Input"])
-    assert not runner.is_active
-
-
-def test_can_construct_onnx_loader():
-    model = ONNX_MODELS["identity"].path
-    loader = ParseNetworkFromOnnxLegacy(model)
diff --git a/tools/Polygraphy/tests/backend/trt/test_algorithm_selector.py b/tools/Polygraphy/tests/backend/trt/test_algorithm_selector.py
index 48572eed..8e35eaa1 100644
--- a/tools/Polygraphy/tests/backend/trt/test_algorithm_selector.py
+++ b/tools/Polygraphy/tests/backend/trt/test_algorithm_selector.py
@@ -20,11 +20,19 @@
 import pytest
 import tensorrt as trt
 from polygraphy import mod, util
-from polygraphy.backend.trt import Algorithm, TacticRecorder, TacticReplayData, TacticReplayer, TensorInfo
+from polygraphy.backend.trt import (
+    Algorithm,
+    TacticRecorder,
+    TacticReplayData,
+    TacticReplayer,
+    TensorInfo,
+)
 from polygraphy.exception import PolygraphyException
 
 
-FakeAlgorithmContext = namedtuple("FakeAlgorithmContext", ["name", "num_inputs", "num_outputs"])
+FakeAlgorithmContext = namedtuple(
+    "FakeAlgorithmContext", ["name", "num_inputs", "num_outputs"]
+)
 FakeAlgorithm = namedtuple("FakeAlgorithm", ["algorithm_variant", "io_info"])
 FakeAlgorithm.get_algorithm_io_info = lambda this, index: this.io_info[index]
 
@@ -36,54 +44,53 @@ def fake_context(name):
 
 
 def make_tensor_info(
-    tensor_format=trt.TensorFormat.LINEAR,
     dtype=trt.float32,
     strides=(1, 2, 3),
     vectorized_dim=-1,
     components_per_element=1,
 ):
-    return TensorInfo(tensor_format, dtype, strides, vectorized_dim, components_per_element)
+    return TensorInfo(dtype, strides, vectorized_dim, components_per_element)
 
 
 def fake_algo(implementation=6, tactic=0, io=None):
     io_info = [make_tensor_info()] * 2
     if io:
         io_info = []
-        for fmt, dtype, strides in io:
+        for dtype, strides in io:
             io_info.append(
-                TensorInfo(tensor_format=fmt, dtype=dtype, strides=strides, vectorized_dim=-1, components_per_element=1)
+                TensorInfo(
+                    dtype=dtype,
+                    strides=strides,
+                    vectorized_dim=-1,
+                    components_per_element=1,
+                )
             )
 
-    trt_algo = FakeAlgorithm(algorithm_variant=FakeAlgorithmVariant(implementation, tactic), io_info=io_info)
+    trt_algo = FakeAlgorithm(
+        algorithm_variant=FakeAlgorithmVariant(implementation, tactic), io_info=io_info
+    )
     return trt_algo
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
 class TestTensorInfo:
     @pytest.mark.parametrize(
         "left, right, expected",
         [
             (
-                TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1),
-                TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1),
+                TensorInfo(trt.float32, (1, 2, 3), -1, 1),
+                TensorInfo(trt.float32, (1, 2, 3), -1, 1),
                 True,
             ),
-            # Different format
-            (
-                TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1),
-                TensorInfo(trt.TensorFormat.HWC, trt.float32, (1, 2, 3), -1, 1),
-                False,
-            ),
             # Different data type
             (
-                TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1),
-                TensorInfo(trt.TensorFormat.LINEAR, trt.float16, (1, 2, 3), -1, 1),
+                TensorInfo(trt.float32, (1, 2, 3), -1, 1),
+                TensorInfo(trt.float16, (1, 2, 3), -1, 1),
                 False,
             ),
             # Different vectotrization
             (
-                TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1),
-                TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), 0, 2),
+                TensorInfo(trt.float32, (1, 2, 3), -1, 1),
+                TensorInfo(trt.float32, (1, 2, 3), 0, 2),
                 False,
             ),
         ],
@@ -92,7 +99,6 @@ def test_equality(self, left, right, expected):
         assert (left == right) == expected
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
 class TestAlgorithm:
     @pytest.mark.parametrize(
         "left, right, expected",
@@ -101,14 +107,14 @@ class TestAlgorithm:
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 True,
             ),  # Same
@@ -116,14 +122,14 @@ class TestAlgorithm:
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 Algorithm(
                     7,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 False,
             ),  # Different implementation
@@ -131,14 +137,14 @@ class TestAlgorithm:
                 Algorithm(
                     6,
                     2,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 False,
             ),  # Different tactic
@@ -146,29 +152,14 @@ class TestAlgorithm:
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                ),
-                Algorithm(
-                    6,
-                    1,
-                    inputs=[make_tensor_info(trt.TensorFormat.CHW32, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                ),
-                False,
-            ),  # Different input format
-            (
-                Algorithm(
-                    6,
-                    1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.int8)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.int8)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 False,
             ),  # Different input data type
@@ -176,29 +167,14 @@ class TestAlgorithm:
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.CHW32, trt.float32)],
-                ),
-                False,
-            ),  # Different output format
-            (
-                Algorithm(
-                    6,
-                    1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                ),
-                Algorithm(
-                    6,
-                    1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.int8)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.int8)],
                 ),
                 False,
             ),  # Different output data type
@@ -206,14 +182,14 @@ class TestAlgorithm:
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)] * 2,
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)] * 2,
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 False,
             ),  # Different number of inputs
@@ -221,14 +197,14 @@ class TestAlgorithm:
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)] * 2,
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)] * 2,
                 ),
                 Algorithm(
                     6,
                     1,
-                    inputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
-                    outputs=[make_tensor_info(trt.TensorFormat.LINEAR, trt.float32)],
+                    inputs=[make_tensor_info(trt.float32)],
+                    outputs=[make_tensor_info(trt.float32)],
                 ),
                 False,
             ),  # Different number of outputs
@@ -274,26 +250,30 @@ def replay(request):
     yield context, poly_algo, trt_algo, in_replay_data, out_replay_data
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
 class TestReplayer:
     def test_basic(self, replay):
         context, _, algo, replay_data, _ = replay
         replayer = TacticReplayer(replay_data)
-        selected = replayer.select_algorithms(context, [fake_algo(implementation=2), algo, fake_algo(tactic=1)])
+        selected = replayer.select_algorithms(
+            context, [fake_algo(implementation=2), algo, fake_algo(tactic=1)]
+        )
         assert selected == [1]
 
     def test_new_layer_falls_back(self, replay):
         _, _, _, replay_data, _ = replay
         replayer = TacticReplayer(replay_data)
         selected = replayer.select_algorithms(
-            fake_context(name="new_layer"), [fake_algo(2, 1), fake_algo(3, 4), fake_algo(5, 6)]
+            fake_context(name="new_layer"),
+            [fake_algo(2, 1), fake_algo(3, 4), fake_algo(5, 6)],
         )
         assert selected == [0, 1, 2]
 
     def test_missing_algo_fails(self, replay):
         context, _, _, replay_data, _ = replay
         replayer = TacticReplayer(replay_data)
-        with pytest.raises(PolygraphyException, match="was not provided by TensorRT as a choice"):
+        with pytest.raises(
+            PolygraphyException, match="was not provided by TensorRT as a choice"
+        ):
             assert replayer.select_algorithms(context, [fake_algo(2, 1)]) == [0]
 
     @pytest.mark.parametrize(
@@ -302,32 +282,49 @@ def test_missing_algo_fails(self, replay):
             fake_algo(2),
             fake_algo(tactic=2),
             fake_algo(
-                io=[(trt.TensorFormat.CHW32, trt.float32, (1, 2)), (trt.TensorFormat.LINEAR, trt.float32, (1, 2))]
+                io=[
+                    (trt.float32, (1, 2)),
+                    (trt.float32, (1, 2)),
+                ]
+            ),
+            fake_algo(
+                io=[
+                    (trt.int8, (1, 2)),
+                    (trt.float32, (1, 2)),
+                ]
             ),
-            fake_algo(io=[(trt.TensorFormat.LINEAR, trt.int8, (1, 2)), (trt.TensorFormat.LINEAR, trt.float32, (1, 2))]),
             fake_algo(
-                io=[(trt.TensorFormat.LINEAR, trt.float32, (1, 2)), (trt.TensorFormat.CHW32, trt.float32, (1, 2))]
+                io=[
+                    (trt.float32, (1, 2)),
+                    (trt.float32, (1, 2)),
+                ]
             ),
             fake_algo(
-                io=[(trt.TensorFormat.LINEAR, trt.float32, (1, 2)), (trt.TensorFormat.LINEAR, trt.int32, (1, 2))]
+                io=[
+                    (trt.float32, (1, 2)),
+                    (trt.int32, (1, 2)),
+                ]
             ),
         ],
     )
     def test_different_algo_fails(self, replay, algo):
         context, _, _, replay_data, _ = replay
         replayer = TacticReplayer(replay_data)
-        with pytest.raises(PolygraphyException, match="was not provided by TensorRT as a choice"):
+        with pytest.raises(
+            PolygraphyException, match="was not provided by TensorRT as a choice"
+        ):
             assert replayer.select_algorithms(context, [algo]) == [0]
 
     def test_fails_if_wrong_selected(self, replay):
         context, _, _, replay_data, _ = replay
         replayer = TacticReplayer(replay_data)
         # We should be able to check tactics even if we're not recording them.
-        with pytest.raises(PolygraphyException, match="TensorRT selected a tactic different"):
+        with pytest.raises(
+            PolygraphyException, match="TensorRT selected a tactic different"
+        ):
             replayer.report_algorithms([context], [fake_algo(implementation=9)])
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
 class TestRecorder:
     def test_basic(self, replay):
         context, poly_algo, algo, _, replay_data = replay
diff --git a/tools/Polygraphy/tests/backend/trt/test_calibrator.py b/tools/Polygraphy/tests/backend/trt/test_calibrator.py
index 2ae99a47..81926156 100644
--- a/tools/Polygraphy/tests/backend/trt/test_calibrator.py
+++ b/tools/Polygraphy/tests/backend/trt/test_calibrator.py
@@ -17,16 +17,19 @@
 import numpy as np
 import pytest
 import tensorrt as trt
-from polygraphy import cuda, mod, util
+import torch
+
+from polygraphy import cuda, util
 from polygraphy.backend.trt import (
     Calibrator,
     CreateConfig,
+    Profile,
     engine_from_network,
     get_trt_logger,
-    Profile,
     network_from_onnx_bytes,
 )
 from polygraphy.common import TensorMetadata
+from polygraphy.datatype import DataType
 from polygraphy.exception import PolygraphyException
 from tests.helper import get_file_size, is_file_non_empty
 from tests.models.meta import ONNX_MODELS
@@ -74,9 +77,6 @@ def check_calibrator_cleanup(self, calibrator):
         ],
     )
     def test_calibrator_basic(self, identity_builder_network, BaseClass):
-        if mod.version(trt.__version__) < mod.version("7.0") and BaseClass == trt.IInt8LegacyCalibrator:
-            pytest.skip("Bug in TRT 6 causes NaNs with legacy calibrator")
-
         builder, network = identity_builder_network
         NUM_BATCHES = 2
 
@@ -129,16 +129,19 @@ def test_calibrator_generator_data(self, identity_builder_network):
             assert calibrator.num_batches == NUM_BATCHES
         self.check_calibrator_cleanup(calibrator)
 
-    # We should be able to mix DeviceView with NumPy arrays.
-    @pytest.mark.parametrize(
-        "mode", ["array", "view", "pointer"]
-    )  # We should be able to use DeviceArray in place of DeviceView
+    # We should be able to mix DeviceView with NumPy arrays and PyTorch tensors.
+    @pytest.mark.parametrize("mode", ["array", "view", "pointer", "torch"])
     def test_calibrator_device_buffers_multiinput(self, multi_input_builder_network, mode):
         def generate_dev_data(num_batches):
             with cuda.DeviceArray(shape=(1,), dtype=np.float32) as x:
                 for _ in range(num_batches):
                     x.copy_from(np.ones((1,), dtype=np.float32))
-                    xdata = {"array": x, "view": cuda.DeviceView(x.ptr, x.shape, x.dtype), "pointer": x.ptr}[mode]
+                    xdata = {
+                        "array": x,
+                        "view": cuda.DeviceView(x.ptr, x.shape, x.dtype),
+                        "pointer": x.ptr,
+                        "torch": torch.ones((1,), dtype=torch.float32),
+                    }[mode]
                     yield {"X0": xdata, "Y0": np.zeros((1,), dtype=np.float32)}
 
         builder, network = multi_input_builder_network
@@ -160,15 +163,10 @@ def test_calibrator_outside_polygraphy(self, identity_builder_network):
         config.set_flag(trt.BuilderFlag.INT8)
         calibrator = Calibrator(generate_data(NUM_BATCHES))
         config.int8_calibrator = calibrator
+        runtime = trt.Runtime(get_trt_logger())
+        engine = runtime.deserialize_cuda_engine(builder.build_serialized_network(network, config))
 
-        if mod.version(trt.__version__) < mod.version("8.0"):
-            engine = builder.build_engine(network, config)
-        else:
-            with trt.Runtime(get_trt_logger()) as runtime:
-                engine = runtime.deserialize_cuda_engine(builder.build_serialized_network(network, config))
-
-        with engine:
-            assert engine
+        assert engine
         self.check_calibrator_cleanup(calibrator)
 
     def test_calibrator_with_path_name_cache(self, identity_builder_network):
@@ -277,7 +275,12 @@ def test_calibrator_invalid_input_fails(self, identity_builder_network, names):
         ],
     )
     def test_calibrator_checks_input_metadata(self, expected_meta, meta, should_pass):
-        data = [{name: np.ones(shape=shape, dtype=dtype) for name, (dtype, shape) in meta.items()}]
+        data = [
+            {
+                name: np.ones(shape=shape, dtype=DataType.to_dtype(dtype, "numpy"))
+                for name, (dtype, shape) in meta.items()
+            }
+        ]
         calibrator = Calibrator(data)
         calibrator.set_input_metadata(expected_meta)
 
diff --git a/tools/Polygraphy/tests/backend/trt/test_config.py b/tools/Polygraphy/tests/backend/trt/test_config.py
index f6b97788..381db3db 100644
--- a/tools/Polygraphy/tests/backend/trt/test_config.py
+++ b/tools/Polygraphy/tests/backend/trt/test_config.py
@@ -2,13 +2,20 @@
 import os
 import tempfile
 
-import numpy as np
 import pytest
 import tensorrt as trt
+
 from polygraphy import mod, util
-from polygraphy.backend.trt import Calibrator, CreateConfig, Profile, network_from_onnx_bytes, postprocess_config
-from polygraphy.common.struct import MetadataTuple, BoundedShape
+from polygraphy.backend.trt import (
+    Calibrator,
+    CreateConfig,
+    Profile,
+    network_from_onnx_bytes,
+    postprocess_config,
+)
+from polygraphy.common.struct import BoundedShape
 from polygraphy.comparator import DataLoader
+from polygraphy.datatype import DataType
 from tests.helper import has_dla
 from tests.models.meta import ONNX_MODELS
 
@@ -27,27 +34,36 @@ def test_defaults(self, identity_builder_network):
         assert loader.timing_cache_path is None
 
         with loader(builder, network) as config:
+            assert not config.get_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)
             with contextlib.suppress(AttributeError):
                 assert not config.get_flag(trt.BuilderFlag.TF32)
             with contextlib.suppress(AttributeError):
                 assert not config.get_flag(trt.BuilderFlag.SPARSE_WEIGHTS)
             assert not config.get_flag(trt.BuilderFlag.FP16)
             assert not config.get_flag(trt.BuilderFlag.INT8)
+            if mod.version(trt.__version__) >= mod.version("8.7"):
+                assert not config.get_flag(trt.BuilderFlag.BF16)
             if mod.version(trt.__version__) >= mod.version("8.6"):
                 assert not config.get_flag(trt.BuilderFlag.FP8)
                 assert not config.get_flag(trt.BuilderFlag.VERSION_COMPATIBLE)
                 assert not config.get_flag(trt.BuilderFlag.EXCLUDE_LEAN_RUNTIME)
+                assert (
+                    config.hardware_compatibility_level
+                    == trt.HardwareCompatibilityLevel.NONE
+                )
             assert config.num_optimization_profiles == 1
             assert config.int8_calibrator is None
             with contextlib.suppress(AttributeError):
-                if mod.version(trt.__version__) >= mod.version("8.5"):
+                if mod.version(trt.__version__) >= mod.version("10.0"):
+                    assert config.get_tactic_sources() == 24
+                elif mod.version(trt.__version__) >= mod.version("8.7"):
+                    assert config.get_tactic_sources() == 29
+                elif mod.version(trt.__version__) >= mod.version("8.5"):
                     assert config.get_tactic_sources() == 31
-                elif mod.version(trt.__version__) >= mod.version("8.4"):
-                    assert config.get_tactic_sources() == 15
-                elif mod.version(trt.__version__) >= mod.version("8.0"):
-                    assert config.get_tactic_sources() == 7
-                else:
-                    assert config.get_tactic_sources() == 3
+            if mod.version(trt.__version__) >= mod.version("8.7"):
+                assert not config.get_flag(trt.BuilderFlag.ERROR_ON_TIMING_CACHE_MISS)
+            if mod.version(trt.__version__) >= mod.version("8.7"):
+                assert not config.get_flag(trt.BuilderFlag.DISABLE_COMPILATION_CACHE)
             with contextlib.suppress(AttributeError):
                 assert not config.get_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS)
             with contextlib.suppress(AttributeError):
@@ -55,25 +71,20 @@ def test_defaults(self, identity_builder_network):
             with contextlib.suppress(AttributeError):
                 assert not config.get_flag(trt.BuilderFlag.DIRECT_IO)
 
-    def test_workspace_size(self, identity_builder_network):
+    @pytest.mark.parametrize(
+        "engine_capability",
+        [
+            trt.EngineCapability.STANDARD,
+            trt.EngineCapability.SAFETY,
+            trt.EngineCapability.DLA_STANDALONE,
+        ],
+    )
+    def test_engine_capability(self, identity_builder_network, engine_capability):
         builder, network = identity_builder_network
-        loader = CreateConfig(max_workspace_size=0)
+        loader = CreateConfig(engine_capability=engine_capability)
         with loader(builder, network) as config:
-            assert config.max_workspace_size == 0
-
-    if mod.version(trt.__version__) >= mod.version("8.0"):
-
-        @pytest.mark.parametrize(
-            "engine_capability",
-            [trt.EngineCapability.STANDARD, trt.EngineCapability.SAFETY, trt.EngineCapability.DLA_STANDALONE],
-        )
-        def test_engine_capability(self, identity_builder_network, engine_capability):
-            builder, network = identity_builder_network
-            loader = CreateConfig(engine_capability=engine_capability)
-            with loader(builder, network) as config:
-                assert config.engine_capability == engine_capability
+            assert config.engine_capability == engine_capability
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.2"), reason="Unsupported before TRT 8.2")
     @pytest.mark.parametrize("flag", ["obey", "prefer", None])
     def test_precision_constraints(self, identity_builder_network, flag):
         builder, network = identity_builder_network
@@ -88,35 +99,34 @@ def test_precision_constraints(self, identity_builder_network, flag):
             else:
                 assert not obey_set and not prefer_set
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.6"), reason="Unsupported before TRT 8.6")
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="Unsupported before TRT 8.6",
+    )
     @pytest.mark.parametrize(
         "kwargs, expected_flag",
         [
             ({"version_compatible": True}, "VERSION_COMPATIBLE"),
-            ({"version_compatible": True, "exclude_lean_runtime": True}, "EXCLUDE_LEAN_RUNTIME"),
+            (
+                {"version_compatible": True, "exclude_lean_runtime": True},
+                "EXCLUDE_LEAN_RUNTIME",
+            ),
         ],
     )
-    def test_version_compatibility_flags(self, identity_builder_network, kwargs, expected_flag):
+    def test_version_compatibility_flags(
+        self, identity_builder_network, kwargs, expected_flag
+    ):
         builder, network = identity_builder_network
         loader = CreateConfig(**kwargs)
         with loader(builder, network) as config:
             assert config.get_flag(getattr(trt.BuilderFlag, expected_flag))
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.2"), reason="Unsupported before TRT 8.2")
     def test_direct_io(self, identity_builder_network):
         builder, network = identity_builder_network
         loader = CreateConfig(direct_io=True)
         with loader(builder, network) as config:
             assert config.get_flag(trt.BuilderFlag.DIRECT_IO)
 
-    @pytest.mark.parametrize("flag", [True, False])
-    def test_strict_types(self, identity_builder_network, flag):
-        builder, network = identity_builder_network
-        loader = CreateConfig(strict_types=flag)
-        with loader(builder, network) as config:
-            assert config.get_flag(trt.BuilderFlag.STRICT_TYPES) == flag
-
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0.0.0"), reason="API was added in TRT 8.0")
     @pytest.mark.parametrize("flag", [True, False])
     def test_restricted(self, identity_builder_network, flag):
         builder, network = identity_builder_network
@@ -124,53 +134,54 @@ def test_restricted(self, identity_builder_network, flag):
         with loader(builder, network) as config:
             assert config.get_flag(trt.BuilderFlag.SAFETY_SCOPE) == flag
 
-    @pytest.mark.parametrize("flag", [True, False])
-    def test_refittable(self, identity_builder_network, flag):
-        builder, network = identity_builder_network
-        loader = CreateConfig(refittable=flag)
-        with loader(builder, network) as config:
-            assert config.get_flag(trt.BuilderFlag.REFIT) == flag
-
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.1.0.0"), reason="API was added in TRT 7.1")
-    @pytest.mark.parametrize("flag", [True, False])
-    def test_tf32(self, identity_builder_network, flag):
-        builder, network = identity_builder_network
-        loader = CreateConfig(tf32=flag)
-        with loader(builder, network) as config:
-            assert config.get_flag(trt.BuilderFlag.TF32) == flag
-
-    @pytest.mark.parametrize("flag", [True, False])
-    def test_fp16(self, identity_builder_network, flag):
-        builder, network = identity_builder_network
-        loader = CreateConfig(fp16=flag)
-        with loader(builder, network) as config:
-            assert config.get_flag(trt.BuilderFlag.FP16) == flag
-
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.6"), reason="API was added in TRT 8.6")
-    @pytest.mark.parametrize("flag", [True, False])
-    def test_fp8(self, identity_builder_network, flag):
-        builder, network = identity_builder_network
-        loader = CreateConfig(fp8=flag)
-        with loader(builder, network) as config:
-            assert config.get_flag(trt.BuilderFlag.FP8) == flag
-
-    @pytest.mark.parametrize("flag", [True, False])
-    def test_int8(self, identity_builder_network, flag):
-        builder, network = identity_builder_network
-        loader = CreateConfig(int8=flag)
-        with loader(builder, network) as config:
-            assert config.get_flag(trt.BuilderFlag.INT8) == flag
-
-    @pytest.mark.parametrize("flag", [True, False])
-    def test_allow_gpu_fallback(self, identity_builder_network, flag):
+    @pytest.mark.parametrize(
+        "arg_name, flag_type",
+        [
+            ("fp16", trt.BuilderFlag.FP16),
+            ("int8", trt.BuilderFlag.INT8),
+            ("allow_gpu_fallback", trt.BuilderFlag.GPU_FALLBACK),
+            ("refittable", trt.BuilderFlag.REFIT),
+            ("tf32", trt.BuilderFlag.TF32),
+        ]
+        + (
+            [
+                ("bf16", trt.BuilderFlag.BF16),
+            ]
+            if mod.version(trt.__version__) >= mod.version("8.7")
+            else []
+        )
+        + (
+            [
+                ("fp8", trt.BuilderFlag.FP8),
+            ]
+            if mod.version(trt.__version__) >= mod.version("8.6")
+            else []
+        )
+        + (
+            [
+                (
+                    "disable_compilation_cache",
+                    trt.BuilderFlag.DISABLE_COMPILATION_CACHE,
+                ),
+            ]
+            if mod.version(trt.__version__) >= mod.version("9.0")
+            else []
+        )
+        + (
+            [
+                ("strip_plan", trt.BuilderFlag.STRIP_PLAN),
+            ]
+            if mod.version(trt.__version__) >= mod.version("10.0")
+            else []
+        ),
+    )
+    @pytest.mark.parametrize("value", [True, False])
+    def test_flags(self, identity_builder_network, arg_name, flag_type, value):
         builder, network = identity_builder_network
-        loader = CreateConfig(allow_gpu_fallback=flag)
+        loader = CreateConfig(**{arg_name: value})
         with loader(builder, network) as config:
-            assert config.get_flag(trt.BuilderFlag.GPU_FALLBACK) == flag
+            assert config.get_flag(flag_type) == value
 
-    @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.0"), reason="API was not available in 7.2 and older"
-    )
     @pytest.mark.parametrize("flag", [True, False])
     def test_sparse_weights(self, identity_builder_network, flag):
         builder, network = identity_builder_network
@@ -187,67 +198,66 @@ def test_use_dla(self, identity_builder_network):
                 assert config.DLA_core == 0
 
     with contextlib.suppress(AttributeError):
-        if mod.version(trt.__version__) < mod.version("8.0"):
-            TACTIC_SOURCES_CASES = [
-                (None, 3),  # By default, all sources are enabled.
-                ([], 0),
-                ([trt.TacticSource.CUBLAS], 1),
-                ([trt.TacticSource.CUBLAS_LT], 2),
-                ([trt.TacticSource.CUBLAS, trt.TacticSource.CUBLAS_LT], 3),
-            ]
-
-        if mod.version(trt.__version__) >= mod.version("8.0"):
-            TACTIC_SOURCES_CASES = [
-                (None, 7),  # By default, all sources are enabled.
-                ([], 0),
-                ([trt.TacticSource.CUBLAS], 1),
-                ([trt.TacticSource.CUBLAS_LT], 2),
-                ([trt.TacticSource.CUDNN], 4),
-                ([trt.TacticSource.CUBLAS, trt.TacticSource.CUBLAS_LT], 3),
-                ([trt.TacticSource.CUBLAS, trt.TacticSource.CUDNN], 5),
-                ([trt.TacticSource.CUBLAS_LT, trt.TacticSource.CUDNN], 6),
-                ([trt.TacticSource.CUDNN, trt.TacticSource.CUBLAS, trt.TacticSource.CUBLAS_LT], 7),
-            ]
-
-        if mod.version(trt.__version__) >= mod.version("8.4"):
-            TACTIC_SOURCES_CASES[0] = (None, 15)
-            TACTIC_SOURCES_CASES.extend(
+        TACTIC_SOURCES_CASES = [
+            (None, 31),  # By default, all sources are enabled.
+            ([], 0),
+            ([trt.TacticSource.CUBLAS], 1),
+            ([trt.TacticSource.CUBLAS_LT], 2),
+            ([trt.TacticSource.CUDNN], 4),
+            ([trt.TacticSource.CUBLAS, trt.TacticSource.CUBLAS_LT], 3),
+            ([trt.TacticSource.CUBLAS, trt.TacticSource.CUDNN], 5),
+            ([trt.TacticSource.CUBLAS_LT, trt.TacticSource.CUDNN], 6),
+            (
+                [
+                    trt.TacticSource.CUDNN,
+                    trt.TacticSource.CUBLAS,
+                    trt.TacticSource.CUBLAS_LT,
+                ],
+                7,
+            ),
+            (
                 [
-                    (
-                        [
-                            trt.TacticSource.CUDNN,
-                            trt.TacticSource.CUBLAS,
-                            trt.TacticSource.CUBLAS_LT,
-                            trt.TacticSource.EDGE_MASK_CONVOLUTIONS,
-                        ],
-                        15,
-                    )
-                ]
-            )
-
-        if mod.version(trt.__version__) >= mod.version("8.5"):
-            TACTIC_SOURCES_CASES[0] = (None, 31)
-            TACTIC_SOURCES_CASES.extend(
+                    trt.TacticSource.CUDNN,
+                    trt.TacticSource.CUBLAS,
+                    trt.TacticSource.CUBLAS_LT,
+                    trt.TacticSource.EDGE_MASK_CONVOLUTIONS,
+                ],
+                15,
+            ),
+            (
                 [
-                    (
-                        [
-                            trt.TacticSource.CUDNN,
-                            trt.TacticSource.CUBLAS,
-                            trt.TacticSource.CUBLAS_LT,
-                            trt.TacticSource.EDGE_MASK_CONVOLUTIONS,
-                            trt.TacticSource.JIT_CONVOLUTIONS,
-                        ],
-                        31,
-                    )
-                ]
-            )
-
-        @pytest.mark.parametrize("sources, expected", TACTIC_SOURCES_CASES)
-        def test_tactic_sources(self, identity_builder_network, sources, expected):
-            builder, network = identity_builder_network
-            loader = CreateConfig(tactic_sources=sources)
-            with loader(builder, network) as config:
-                assert config.get_tactic_sources() == expected
+                    trt.TacticSource.CUDNN,
+                    trt.TacticSource.CUBLAS,
+                    trt.TacticSource.CUBLAS_LT,
+                    trt.TacticSource.EDGE_MASK_CONVOLUTIONS,
+                    trt.TacticSource.JIT_CONVOLUTIONS,
+                ],
+                31,
+            ),
+        ]
+
+        if mod.version(trt.__version__) >= mod.version("10.0"):
+            TACTIC_SOURCES_CASES[0] = (None, 24)
+        elif mod.version(trt.__version__) >= mod.version("8.7"):
+            TACTIC_SOURCES_CASES[0] = (None, 29)
+
+    @pytest.mark.parametrize("sources, expected", TACTIC_SOURCES_CASES)
+    def test_tactic_sources(self, identity_builder_network, sources, expected):
+        builder, network = identity_builder_network
+        loader = CreateConfig(tactic_sources=sources)
+        with loader(builder, network) as config:
+            assert config.get_tactic_sources() == expected
+
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.7"),
+        reason="API was added in TRT 8.7",
+    )
+    @pytest.mark.parametrize("flag", [True, False])
+    def test_error_on_timing_cache_miss(self, identity_builder_network, flag):
+        builder, network = identity_builder_network
+        loader = CreateConfig(error_on_timing_cache_miss=flag)
+        with loader(builder, network) as config:
+            assert config.get_flag(trt.BuilderFlag.ERROR_ON_TIMING_CACHE_MISS) == flag
 
     def test_calibrator_metadata_set(self, identity_builder_network):
         builder, network = identity_builder_network
@@ -256,9 +266,9 @@ def test_calibrator_metadata_set(self, identity_builder_network):
         with loader(builder, network) as config:
             assert config.int8_calibrator
             assert "x" in calibrator.data_loader.input_metadata
-            assert calibrator.data_loader.input_metadata["x"] == MetadataTuple(
-                shape=BoundedShape((1, 1, 2, 2)), dtype=np.dtype(np.float32)
-            )
+            meta = calibrator.data_loader.input_metadata["x"]
+            assert meta.shape == BoundedShape((1, 1, 2, 2))
+            assert meta.dtype == DataType.FLOAT32
 
     def test_multiple_profiles(self, identity_builder_network):
         builder, network = identity_builder_network
@@ -270,7 +280,6 @@ def test_multiple_profiles(self, identity_builder_network):
         with loader(builder, network) as config:
             assert config.num_optimization_profiles == 2
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
     @pytest.mark.parametrize("path_mode", [True, False], ids=["path", "file-like"])
     def test_timing_cache(self, identity_builder_network, path_mode):
         builder, network = identity_builder_network
@@ -279,7 +288,6 @@ def test_timing_cache(self, identity_builder_network, path_mode):
             with loader(builder, network) as config:
                 assert config.get_timing_cache()
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
     def test_fall_back_to_empty_timing_cache(self, identity_builder_network):
         """Tests that passing in a nonexistent timing cache path is non-fatal"""
         builder, network = identity_builder_network
@@ -289,7 +297,6 @@ def test_fall_back_to_empty_timing_cache(self, identity_builder_network):
             with loader(builder, network) as config:
                 assert config.get_timing_cache()
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
     def test_empty_timing_cache_when_default(self, identity_builder_network):
         builder, network = identity_builder_network
         loader = CreateConfig()
@@ -303,7 +310,6 @@ def test_empty_timing_cache_when_default(self, identity_builder_network):
                 new_cache_size = len(bytes(buffer))
             assert cache_size == new_cache_size
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
     def test_profiling_verbosity(self, identity_builder_network):
         builder, network = identity_builder_network
         expected = trt.ProfilingVerbosity.NONE
@@ -325,9 +331,6 @@ def test_profiling_verbosity(self, identity_builder_network):
             },
         ]
 
-        @pytest.mark.skipif(
-            mod.version(trt.__version__) < mod.version("8.3"), reason="Unsupported for TRT versions prior to 8.3"
-        )
         @pytest.mark.parametrize("pool_limits", POOL_LIMITS)
         def test_memory_pool_limits(self, pool_limits, identity_builder_network):
             if any("dla" in key.name.lower() for key in pool_limits) and not has_dla():
@@ -339,29 +342,40 @@ def test_memory_pool_limits(self, pool_limits, identity_builder_network):
                 for pool_type, pool_size in pool_limits.items():
                     assert config.get_memory_pool_limit(pool_type) == pool_size
 
-    if mod.version(trt.__version__) >= mod.version("8.5"):
+    @pytest.mark.parametrize(
+        "preview_features",
+        [
+            [trt.PreviewFeature.PROFILE_SHARING_0806]
+            if mod.version(trt.__version__) >= mod.version("10.0")
+            else [trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805],
+        ],
+    )
+    def test_preview_features(self, identity_builder_network, preview_features):
+        builder, network = identity_builder_network
+        loader = CreateConfig(preview_features=preview_features)
+        with loader(builder, network) as config:
+            # Check that only the enabled preview features are on.
+            for pf in trt.PreviewFeature.__members__.values():
+                assert config.get_preview_feature(pf) == (pf in preview_features)
 
-        @pytest.mark.parametrize(
-            "preview_features",
-            [
-                [],
-                [trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805],
-                [
-                    trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805,
-                    trt.PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805,
-                ],
-            ],
-        )
-        def test_preview_features(self, identity_builder_network, preview_features):
-            builder, network = identity_builder_network
-            loader = CreateConfig(preview_features=preview_features)
-            with loader(builder, network) as config:
-                # Check that only the enabled preview features are on.
-                for pf in trt.PreviewFeature.__members__.values():
-                    assert config.get_preview_feature(pf) == (pf in preview_features)
+    @pytest.mark.parametrize(
+        "quantization_flags",
+        [
+            [],
+            [trt.QuantizationFlag.CALIBRATE_BEFORE_FUSION],
+        ],
+    )
+    def test_quantization_flags(self, identity_builder_network, quantization_flags):
+        builder, network = identity_builder_network
+        loader = CreateConfig(quantization_flags=quantization_flags)
+        with loader(builder, network) as config:
+            # Check that only the enabled quantization flags are on.
+            for qf in trt.QuantizationFlag.__members__.values():
+                assert config.get_quantization_flag(qf) == (qf in quantization_flags)
 
     @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.6"), reason="Unsupported for TRT versions prior to 8.6"
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="Unsupported for TRT versions prior to 8.6",
     )
     @pytest.mark.parametrize("level", range(6))
     def test_builder_optimization_level(self, identity_builder_network, level):
@@ -386,7 +400,8 @@ def test_hardware_compatibility_level(self, identity_builder_network, level):
                 assert config.hardware_compatibility_level == level
 
     @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.6"), reason="Unsupported for TRT versions prior to 8.6"
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="Unsupported for TRT versions prior to 8.6",
     )
     @pytest.mark.parametrize("num_streams", range(3))
     def test_max_aux_streams(self, identity_builder_network, num_streams):
@@ -395,6 +410,30 @@ def test_max_aux_streams(self, identity_builder_network, num_streams):
         with loader(builder, network) as config:
             assert config.max_aux_streams == num_streams
 
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("9.0"),
+        reason="API was added in TRT 9.0",
+    )
+    def test_progress_monitor(self, identity_builder_network):
+        class DummyProgressMonitor(trt.IProgressMonitor):
+            def __init__(self):
+                trt.IProgressMonitor.__init__(self)
+
+            def phase_start(self, phase_name, parent_phase, num_steps):
+                pass
+
+            def phase_finish(self, phase_name):
+                pass
+
+            def step_complete(self, phase_name, step):
+                return True
+
+        builder, network = identity_builder_network
+        progress_monitor = DummyProgressMonitor()
+        loader = CreateConfig(progress_monitor=progress_monitor)
+        with loader(builder, network) as config:
+            assert config.progress_monitor == progress_monitor
+
 
 class TestPostprocessConfig:
     def test_with_config(self, identity_builder_network):
diff --git a/tools/Polygraphy/tests/backend/trt/test_loader.py b/tools/Polygraphy/tests/backend/trt/test_loader.py
index ce1d0d78..4ae2608a 100644
--- a/tools/Polygraphy/tests/backend/trt/test_loader.py
+++ b/tools/Polygraphy/tests/backend/trt/test_loader.py
@@ -16,7 +16,6 @@
 #
 import sys
 
-import numpy as np
 import pytest
 import tensorrt as trt
 
@@ -47,8 +46,9 @@
     set_tensor_datatypes,
     set_tensor_formats,
 )
-from polygraphy.common.struct import BoundedShape, MetadataTuple
+from polygraphy.common.struct import BoundedShape
 from polygraphy.comparator import DataLoader
+from polygraphy.datatype import DataType
 from polygraphy.exception import PolygraphyException
 from tests.helper import get_file_size, is_file_non_empty
 from tests.models.meta import ONNX_MODELS
@@ -79,29 +79,27 @@ def identity_vc_engine_bytes():
 @pytest.fixture(scope="session")
 def identity_builder_network():
     builder, network, parser = network_from_onnx_bytes(ONNX_MODELS["identity"].loader)
-    with builder, network, parser:
-        yield builder, network
+    yield builder, network
 
 
 @pytest.fixture(scope="session")
 def identity_network():
     builder, network, parser = network_from_onnx_bytes(ONNX_MODELS["identity"].loader)
-    with builder, network, parser:
-        yield builder, network, parser
+    yield builder, network, parser
 
 
 @pytest.fixture(scope="session")
 def identity_identity_network():
-    builder, network, parser = network_from_onnx_bytes(ONNX_MODELS["identity_identity"].loader)
-    with builder, network, parser:
-        yield builder, network, parser
+    builder, network, parser = network_from_onnx_bytes(
+        ONNX_MODELS["identity_identity"].loader
+    )
+    yield builder, network, parser
 
 
 @pytest.fixture(scope="session")
 def reshape_network():
     builder, network, parser = network_from_onnx_bytes(ONNX_MODELS["reshape"].loader)
-    with builder, network, parser:
-        yield builder, network, parser
+    yield builder, network, parser
 
 
 @pytest.fixture(scope="session")
@@ -127,7 +125,11 @@ def get_plugin_names():
             return [pc.name for pc in trt.get_plugin_registry().plugin_creator_list]
 
         loader = LoadPlugins(
-            plugins=["nvinfer_plugin.dll" if sys.platform.startswith("win") else "libnvinfer_plugin.so"]
+            plugins=[
+                "nvinfer_plugin.dll"
+                if sys.platform.startswith("win")
+                else "libnvinfer_plugin.so"
+            ]
         )
         loader()
         assert get_plugin_names()
@@ -156,7 +158,9 @@ def test_serialized_engine_loader_custom_runtime(self, identity_engine):
                 assert isinstance(engine, trt.ICudaEngine)
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.6"), reason="API was added in TRT 8.6")
+@pytest.mark.skipif(
+    mod.version(trt.__version__) < mod.version("8.6"), reason="API was added in TRT 8.6"
+)
 class TestLoadRuntime:
     def test_load_lean_runtime(self, nvinfer_lean_path):
         loader = LoadRuntime(nvinfer_lean_path)
@@ -164,7 +168,9 @@ def test_load_lean_runtime(self, nvinfer_lean_path):
             assert isinstance(runtime, trt.Runtime)
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.6"), reason="API was added in TRT 8.6")
+@pytest.mark.skipif(
+    mod.version(trt.__version__) < mod.version("8.6"), reason="API was added in TRT 8.6"
+)
 class TestSerializedVCEngineLoader:
     def test_serialized_vc_engine_loader_from_lambda(self, identity_vc_engine_bytes):
         with util.NamedTemporaryFile() as outpath:
@@ -181,79 +187,109 @@ def test_serialized_engine_loader_from_buffer(self, identity_vc_engine_bytes):
             assert isinstance(engine, trt.ICudaEngine)
 
 
-class TestOnnxNetworkLoader:
+class TestNetworkFromOnnxBytes:
     def test_loader(self):
-        builder, network, parser = network_from_onnx_bytes(ONNX_MODELS["identity"].loader)
-        with builder, network, parser:
-            assert not network.has_implicit_batch_dimension
-            assert not network.has_explicit_precision
+        builder, network, parser = network_from_onnx_bytes(
+            ONNX_MODELS["identity"].loader
+        )
+        assert not network.has_implicit_batch_dimension
+
+    @pytest.mark.parametrize(
+        "kwargs, flag",
+        [({"strongly_typed": True}, trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)]
+        if mod.version(trt.__version__) >= mod.version("8.7")
+        else [],
+    )
+    def test_network_flags(self, kwargs, flag):
+        builder, network, parser = network_from_onnx_bytes(
+            ONNX_MODELS["identity"].loader, **kwargs
+        )
+        assert network.get_flag(flag)
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.1.0.0"), reason="API was added in TRT 7.1")
 class TestNetworkFromOnnxPath:
     def test_loader(self):
         builder, network, parser = network_from_onnx_path(ONNX_MODELS["identity"].path)
-        with builder, network, parser:
-            assert not network.has_implicit_batch_dimension
-            assert not network.has_explicit_precision
+        assert not network.has_implicit_batch_dimension
+
+    @pytest.mark.parametrize(
+        "kwargs, flag",
+        [({"strongly_typed": True}, trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)]
+        if mod.version(trt.__version__) >= mod.version("8.7")
+        else [],
+    )
+    def test_network_flags(self, kwargs, flag):
+        builder, network, parser = network_from_onnx_path(
+            ONNX_MODELS["identity"].path, **kwargs
+        )
+        assert network.get_flag(flag)
 
 
 class TestModifyNetwork:
     def test_mark_layerwise(self, modifiable_network):
-        load_network = ModifyNetworkOutputs(modifiable_network, outputs=constants.MARK_ALL)
+        load_network = ModifyNetworkOutputs(
+            modifiable_network, outputs=constants.MARK_ALL
+        )
         builder, network, parser = load_network()
-        with builder, network, parser:
-            for layer in network:
-                for index in range(layer.num_outputs):
-                    assert layer.get_output(index).is_network_output
+
+        for layer in network:
+            for index in range(layer.num_outputs):
+                assert layer.get_output(index).is_network_output
 
     def test_mark_custom_outputs(self, modifiable_network):
-        builder, network, parser = modify_network_outputs(modifiable_network, outputs=["identity_out_0"])
-        with builder, network, parser:
-            assert network.num_outputs == 1
-            assert network.get_output(0).name == "identity_out_0"
+        builder, network, parser = modify_network_outputs(
+            modifiable_network, outputs=["identity_out_0"]
+        )
+
+        assert network.num_outputs == 1
+        assert network.get_output(0).name == "identity_out_0"
 
     def test_exclude_outputs_with_mark_layerwise(self, modifiable_network):
         builder, network, parser = modify_network_outputs(
-            modifiable_network, outputs=constants.MARK_ALL, exclude_outputs=["identity_out_2"]
+            modifiable_network,
+            outputs=constants.MARK_ALL,
+            exclude_outputs=["identity_out_2"],
         )
-        with builder, network, parser:
-            assert network.num_outputs == 1
-            assert network.get_output(0).name == "identity_out_0"
+
+        assert network.num_outputs == 1
+        assert network.get_output(0).name == "identity_out_0"
 
     def test_mark_shape_outputs(self, modifiable_reshape_network):
         builder, network, parser = modify_network_outputs(
             modifiable_reshape_network, outputs=["output", "reduce_prod_out_gs_2"]
         )
-        with builder, network, parser:
-            assert network.num_outputs == 2
-            assert network.get_output(1).name == "reduce_prod_out_gs_2"
+
+        assert network.num_outputs == 2
+        assert network.get_output(1).name == "reduce_prod_out_gs_2"
 
     def test_unmark_shape_outputs(self, modifiable_reshape_network):
         builder, network, parser = modify_network_outputs(
-            modifiable_reshape_network, outputs=constants.MARK_ALL, exclude_outputs=["reduce_prod_out_gs_2"]
+            modifiable_reshape_network,
+            outputs=constants.MARK_ALL,
+            exclude_outputs=["shape_out_gs_0", "reduce_prod_out_gs_2"],
         )
-        with builder, network, parser:
-            assert network.num_outputs == 1
+
+        assert network.num_outputs == 1
 
     def test_mark_outputs_layer_with_optional_inputs(self):
         builder, network = create_network()
-        with builder, network:
-            inp = network.add_input("input", shape=(1, 3, 224, 224), dtype=trt.float32)
-            slice_layer = network.add_slice(inp, (0, 0, 0, 0), (1, 3, 224, 224), (1, 1, 1, 1))
+        inp = network.add_input("input", shape=(1, 3, 224, 224), dtype=trt.float32)
+        slice_layer = network.add_slice(
+            inp, (0, 0, 0, 0), (1, 3, 224, 224), (1, 1, 1, 1)
+        )
 
-            # Set a tensor for `stride` to increment `num_inputs` so we have some inputs
-            # which are `None` in between.
-            slice_layer.set_input(3, inp)
-            assert slice_layer.num_inputs == 4
+        # Set a tensor for `stride` to increment `num_inputs` so we have some inputs
+        # which are `None` in between.
+        slice_layer.set_input(3, inp)
+        assert slice_layer.num_inputs == 4
 
-            slice = slice_layer.get_output(0)
-            slice.name = "Slice"
+        slice = slice_layer.get_output(0)
+        slice.name = "Slice"
 
-            builder, network = modify_network_outputs((builder, network), outputs=["Slice"])
-            assert network.num_outputs == 1
-            assert network.get_output(0).name == "Slice"
-            assert network.get_output(0) == slice
+        builder, network = modify_network_outputs((builder, network), outputs=["Slice"])
+        assert network.num_outputs == 1
+        assert network.get_output(0).name == "Slice"
+        assert network.get_output(0) == slice
 
 
 class TestPostprocessNetwork:
@@ -294,9 +330,8 @@ def func(network):
 
         builder, network, parser = postprocess_network(modifiable_network, func)
 
-        with builder, network, parser:
-            assert network[0].precision == trt.float16
-            assert network[1].precision == trt.int8
+        assert network[0].precision == trt.float16
+        assert network[1].precision == trt.int8
 
     def test_negative_non_callable(self, modifiable_network):
         """Tests that PostprocessNetwork properly rejects `func` objects that
@@ -310,11 +345,14 @@ class TestSetLayerPrecisions:
     def test_basic(self, modifiable_network):
         builder, network, parser = set_layer_precisions(
             modifiable_network,
-            layer_precisions={"onnx_graphsurgeon_node_1": trt.float16, "onnx_graphsurgeon_node_3": trt.int8},
+            layer_precisions={
+                "onnx_graphsurgeon_node_1": trt.float16,
+                "onnx_graphsurgeon_node_3": trt.int8,
+            },
         )
-        with builder, network, parser:
-            assert network[0].precision == trt.float16
-            assert network[1].precision == trt.int8
+
+        assert network[0].precision == trt.float16
+        assert network[1].precision == trt.int8
 
 
 class TestSetTensorDatatypes:
@@ -323,15 +361,12 @@ def test_basic(self, modifiable_network):
             modifiable_network,
             tensor_datatypes={
                 "X": trt.float16,
-                "identity_out_0": trt.float32,
                 "identity_out_2": trt.float16,
             },
         )
-        with builder, network, parser:
-            assert network[0].get_input(0).dtype == trt.float16
-            assert network[0].get_output(0).dtype == trt.float32
-            assert network[1].get_input(0).dtype == trt.float32
-            assert network[1].get_output(0).dtype == trt.float16
+
+        assert network.get_input(0).dtype == trt.float16
+        assert network.get_output(0).dtype == trt.float16
 
 
 class TestSetTensorFormats:
@@ -343,11 +378,11 @@ def test_basic(self, modifiable_network):
                 "identity_out_2": [trt.TensorFormat.HWC8],
             },
         )
-        with builder, network, parser:
-            assert network[0].get_input(0).allowed_formats == (
-                1 << int(trt.TensorFormat.LINEAR) | 1 << int(trt.TensorFormat.CHW4)
-            )
-            assert network[1].get_output(0).allowed_formats == 1 << int(trt.TensorFormat.HWC8)
+
+        assert network.get_input(0).allowed_formats == (
+            1 << int(trt.TensorFormat.LINEAR) | 1 << int(trt.TensorFormat.CHW4)
+        )
+        assert network.get_output(0).allowed_formats == 1 << int(trt.TensorFormat.HWC8)
 
 
 class TestEngineBytesFromNetwork:
@@ -375,12 +410,19 @@ def test_can_build_without_parser_non_owning(self, identity_builder_network):
 
     def test_custom_runtime(self, identity_builder_network):
         builder, network = identity_builder_network
-        loader = EngineFromNetwork((builder, network), runtime=trt.Runtime(get_trt_logger()))
+        loader = EngineFromNetwork(
+            (builder, network), runtime=trt.Runtime(get_trt_logger())
+        )
         with loader() as engine:
             assert isinstance(engine, trt.ICudaEngine)
 
-    @pytest.mark.parametrize("use_config_loader, set_calib_profile", [(True, None), (False, False), (False, True)])
-    def test_can_build_with_calibrator(self, identity_builder_network, use_config_loader, set_calib_profile):
+    @pytest.mark.parametrize(
+        "use_config_loader, set_calib_profile",
+        [(True, None), (False, False), (False, True)],
+    )
+    def test_can_build_with_calibrator(
+        self, identity_builder_network, use_config_loader, set_calib_profile
+    ):
         builder, network = identity_builder_network
         calibrator = Calibrator(DataLoader())
 
@@ -389,9 +431,9 @@ def check_calibrator():
             # which in turn should be passed to the data loader.
             assert calibrator.input_metadata is not None
             assert "x" in calibrator.data_loader.input_metadata
-            assert calibrator.data_loader.input_metadata["x"] == MetadataTuple(
-                shape=BoundedShape((1, 1, 2, 2)), dtype=np.dtype(np.float32)
-            )
+            meta = calibrator.data_loader.input_metadata["x"]
+            assert meta.shape == BoundedShape((1, 1, 2, 2))
+            assert meta.dtype == DataType.FLOAT32
 
         if use_config_loader:
             config = create_config(builder, network, int8=True, calibrator=calibrator)
@@ -402,7 +444,9 @@ def check_calibrator():
             config.int8_calibrator = calibrator
             # Since this network has static shapes, we shouldn't need to set a calibration profile.
             if set_calib_profile:
-                calib_profile = Profile().fill_defaults(network).to_trt(builder, network)
+                calib_profile = (
+                    Profile().fill_defaults(network).to_trt(builder, network)
+                )
                 config.add_optimization_profile(calib_profile)
                 config.set_calibration_profile(calib_profile)
 
@@ -412,9 +456,10 @@ def check_calibrator():
         check_calibrator()
 
         # Calibrator buffers should be freed after the build
-        assert all([buf.allocated_nbytes == 0 for buf in calibrator.device_buffers.values()])
+        assert all(
+            [buf.allocated_nbytes == 0 for buf in calibrator.device_buffers.values()]
+        )
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
     @pytest.mark.parametrize("path_mode", [True, False], ids=["path", "file-like"])
     def test_timing_cache_generate_and_append(self, path_mode):
         with util.NamedTemporaryFile() as total_cache, util.NamedTemporaryFile() as identity_cache:
@@ -450,7 +495,10 @@ def build_engine(model, cache):
             total_cache_size = get_file_size(total_cache.name)
 
             # The total cache should be larger than either of the individual caches.
-            assert total_cache_size >= const_foldable_cache_size and total_cache_size >= identity_cache_size
+            assert (
+                total_cache_size >= const_foldable_cache_size
+                and total_cache_size >= identity_cache_size
+            )
             # The total cache should also be smaller than or equal to the sum of the individual caches since
             # header information should not be duplicated.
             assert total_cache_size <= (const_foldable_cache_size + identity_cache_size)
@@ -466,15 +514,34 @@ def test_serialize_engine(self, identity_network):
 class TestSaveEngine:
     def test_save_engine(self, identity_network):
         with util.NamedTemporaryFile() as outpath:
-            engine_loader = SaveEngine(EngineFromNetwork(identity_network), path=outpath.name)
+            engine_loader = SaveEngine(
+                EngineFromNetwork(identity_network), path=outpath.name
+            )
             with engine_loader():
                 assert is_file_non_empty(outpath.name)
 
 
 class TestOnnxLikeFromNetwork:
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.2"), reason="Unsupported for TRT 7.1 and older")
     @pytest.mark.parametrize(
-        "model_name", ["identity", "empty_tensor_expand", "const_foldable", "and", "scan", "dim_param", "tensor_attr"]
+        "model_name",
+        [
+            "identity",
+            "empty_tensor_expand",
+            "const_foldable",
+            "and",
+            "scan",
+            "dim_param",
+            "tensor_attr",
+        ],
     )
     def test_onnx_like_from_network(self, model_name):
-        assert onnx_like_from_network(NetworkFromOnnxBytes(ONNX_MODELS[model_name].loader))
+        assert onnx_like_from_network(
+            NetworkFromOnnxBytes(ONNX_MODELS[model_name].loader)
+        )
+
+class TestDefaultPlugins:
+    def test_default_plugins(self):
+        network_loader = NetworkFromOnnxBytes(ONNX_MODELS["roialign"].loader)
+        engine_loader = EngineFromNetwork(network_loader)
+        engine = engine_loader()
+        assert engine is not None
diff --git a/tools/Polygraphy/tests/backend/trt/test_runner.py b/tools/Polygraphy/tests/backend/trt/test_runner.py
index 73dfcc39..7b9f411f 100644
--- a/tools/Polygraphy/tests/backend/trt/test_runner.py
+++ b/tools/Polygraphy/tests/backend/trt/test_runner.py
@@ -19,6 +19,7 @@
 import numpy as np
 import pytest
 import tensorrt as trt
+import torch
 
 from polygraphy import cuda, mod
 from polygraphy.backend.trt import (
@@ -30,9 +31,9 @@
     engine_from_network,
     network_from_onnx_bytes,
 )
+from polygraphy.backend.trt.runner import _get_array_on_cpu
 from polygraphy.exception import PolygraphyException
 from polygraphy.logger import G_LOGGER
-from tests.helper import time_func
 from tests.models.meta import ONNX_MODELS
 
 
@@ -49,33 +50,39 @@ def nonzero_engine():
     return engine_from_network(network_loader)
 
 
+@pytest.fixture()
+def identity_engine():
+    model = ONNX_MODELS["identity"]
+    network_loader = NetworkFromOnnxBytes(model.loader)
+    return engine_from_network(network_loader)
+
+
+@pytest.fixture()
+def reducable_engine():
+    model = ONNX_MODELS["reducable"]
+    network_loader = NetworkFromOnnxBytes(model.loader)
+    return engine_from_network(network_loader)
+
+
 class TestTrtRunner:
     def test_can_name_runner(self):
         NAME = "runner"
         runner = TrtRunner(None, name=NAME)
         assert runner.name == NAME
 
-    def test_basic(self):
-        model = ONNX_MODELS["identity"]
-        network_loader = NetworkFromOnnxBytes(model.loader)
-        with TrtRunner(EngineFromNetwork(network_loader)) as runner:
+    def test_basic(self, identity_engine):
+        with TrtRunner(identity_engine) as runner:
             assert runner.optimization_profile is None
             assert runner.is_active
-            assert runner.owns_engine
-            assert runner.owns_context
-            model.check_runner(runner)
+            ONNX_MODELS["identity"].check_runner(runner)
             assert runner.last_inference_time() is not None
         assert not runner.is_active
 
     @pytest.mark.serial
-    def test_warn_if_impl_methods_called(self, check_warnings_on_runner_impl_methods):
-        model = ONNX_MODELS["identity"]
-        runner = TrtRunner(EngineFromNetwork(NetworkFromOnnxBytes(model.loader)))
+    def test_warn_if_impl_methods_called(self, check_warnings_on_runner_impl_methods, identity_engine):
+        runner = TrtRunner(identity_engine)
         check_warnings_on_runner_impl_methods(runner)
 
-    @pytest.mark.skipif(
-        mod.version(trt.__version__) <= mod.version("8.5.0.9"), reason="Unsupported for TRT 8.4 and older"
-    )
     @pytest.mark.parametrize(
         "inp, expected",
         [
@@ -86,23 +93,35 @@ def test_warn_if_impl_methods_called(self, check_warnings_on_runner_impl_methods
     )
     def test_data_dependent_shapes(self, nonzero_engine, inp, expected):
         with TrtRunner(nonzero_engine) as runner:
-            outputs = runner.infer({"input": np.array(inp, dtype=np.int32)})
+            outputs = runner.infer(
+                {
+                    "input": np.array(
+                        inp, dtype=np.int32 if mod.version(trt.__version__) < mod.version("9.0") else np.int64
+                    )
+                }
+            )
             assert np.array_equal(outputs["nonzero_out_0"], np.array(expected, dtype=np.int32))
 
-    def test_context(self):
-        model = ONNX_MODELS["identity"]
-        engine = engine_from_network(NetworkFromOnnxBytes(model.loader))
-        with engine, TrtRunner(engine.create_execution_context) as runner:
-            model.check_runner(runner)
-            assert not runner.owns_engine
-            assert runner.owns_context
+    @pytest.mark.parametrize("copy_outputs_to_host", [True, False])
+    @pytest.mark.parametrize("device", ["cpu", "cuda"])
+    def test_torch_tensors(self, copy_outputs_to_host, identity_engine, device):
+        with TrtRunner(identity_engine) as runner:
+            arr = torch.ones([1, 1, 2, 2], dtype=torch.float32, device=device)
+            outputs = runner.infer({"x": arr}, copy_outputs_to_host=copy_outputs_to_host)
+            assert all(isinstance(t, torch.Tensor) for t in outputs.values())
 
-    def test_device_buffer_order_matches_bindings(self):
-        model = ONNX_MODELS["reducable"]
-        engine = engine_from_network(NetworkFromOnnxBytes(model.loader))
-        with engine, TrtRunner(engine) as runner:
-            dev_buf_order = list(runner.device_buffers.keys())
-            for binding, dev_buf_name in zip(engine, dev_buf_order):
+            assert torch.equal(outputs["y"].to("cpu"), arr.to("cpu"))
+
+            assert outputs["y"].device.type == ("cpu" if copy_outputs_to_host else "cuda")
+
+    def test_context(self, identity_engine):
+        with TrtRunner(identity_engine.create_execution_context) as runner:
+            ONNX_MODELS["identity"].check_runner(runner)
+
+    def test_device_buffer_order_matches_bindings(self, reducable_engine):
+        with TrtRunner(reducable_engine) as runner:
+            dev_buf_order = list(runner.device_input_buffers.keys())
+            for binding, dev_buf_name in zip(reducable_engine, dev_buf_order):
                 assert binding == dev_buf_name
 
     def test_shape_output(self):
@@ -111,20 +130,20 @@ def test_shape_output(self):
         with engine, TrtRunner(engine.create_execution_context) as runner:
             model.check_runner(runner)
 
-    def test_multithreaded_runners_from_engine(self):
-        model = ONNX_MODELS["identity"]
-        engine = engine_from_network(NetworkFromOnnxBytes(model.loader))
-
-        with engine, TrtRunner(engine) as runner0, TrtRunner(engine) as runner1:
-            t1 = threading.Thread(target=model.check_runner, args=(runner0,))
-            t2 = threading.Thread(target=model.check_runner, args=(runner1,))
+    def test_multithreaded_runners_from_engine(self, identity_engine):
+        with TrtRunner(identity_engine) as runner0, TrtRunner(identity_engine) as runner1:
+            t1 = threading.Thread(target=ONNX_MODELS["identity"].check_runner, args=(runner0,))
+            t2 = threading.Thread(target=ONNX_MODELS["identity"].check_runner, args=(runner1,))
             t1.start()
             t2.start()
             t1.join()
             t2.join()
 
-    @pytest.mark.skipif(mod.version(trt.__version__)[0:2] == mod.version("7.2"), reason="Bugged in TRT 7.2")
     @pytest.mark.parametrize("use_optimization_profile", [True, False])
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) >= mod.version("8.6") and mod.version(trt.__version__) < mod.version("8.7"),
+        reason="Bug in TRT 8.6",
+    )
     def test_multiple_profiles(self, use_optimization_profile):
         model = ONNX_MODELS["dynamic_identity"]
         profile0_shapes = [(1, 2, 1, 1), (1, 2, 1, 1), (1, 2, 1, 1)]  # Use min==opt==max to fix shapes in the engine.
@@ -152,6 +171,36 @@ def test_multiple_profiles(self, use_optimization_profile):
                 for shape in shapes:
                     model.check_runner(runner, {"X": shape})
 
+
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("10.0"),
+        reason="Feature not present before 10.0",
+    )
+    @pytest.mark.parametrize("allocation_strategy", [None, "static", "profile", "runtime"])
+    def test_allocation_strategies(self, allocation_strategy):
+        model = ONNX_MODELS["residual_block"]
+        profile0_shapes = [(1, 3, 224, 224), (1, 3, 224, 224), (1, 3, 224, 224)]
+        profile1_shapes = [(1, 3, 224, 224), (1, 3, 224, 224), (2, 3, 224, 224)]
+        profile2_shapes = [(1, 3, 224, 224), (1, 3, 224, 224), (4, 3, 224, 224)]
+        network_loader = NetworkFromOnnxBytes(model.loader)
+        profiles = [
+            Profile().add("gpu_0/data_0", *profile0_shapes),
+            Profile().add("gpu_0/data_0", *profile1_shapes),
+            Profile().add("gpu_0/data_0", *profile2_shapes),
+        ]
+        config_loader = CreateConfig(profiles=profiles)
+        engine = engine_from_network(network_loader, config_loader)
+
+        for index, shapes in enumerate([profile0_shapes, profile1_shapes, profile2_shapes]):
+            with TrtRunner(
+                engine,
+                optimization_profile=index,
+                allocation_strategy=allocation_strategy,
+            ) as runner:
+                for shape in shapes:
+                    model.check_runner(runner, {"gpu_0/data_0": shape})
+
+
     def test_empty_tensor_with_dynamic_input_shape_tensor(self):
         model = ONNX_MODELS["empty_tensor_expand"]
         shapes = [(1, 2, 0, 3, 0), (2, 2, 0, 3, 0), (4, 2, 0, 3, 0)]
@@ -163,7 +212,6 @@ def test_empty_tensor_with_dynamic_input_shape_tensor(self):
             for shape in shapes:
                 model.check_runner(runner, {"new_shape": shape})
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.0"), reason="Test not compatible with TRT 6")
     @pytest.mark.parametrize(
         "names, err",
         [
@@ -172,34 +220,27 @@ def test_empty_tensor_with_dynamic_input_shape_tensor(self):
             ([], "The following inputs were not found"),
         ],
     )
-    def test_error_on_wrong_name_feed_dict(self, names, err):
-        model = ONNX_MODELS["identity"]
-        network_loader = NetworkFromOnnxBytes(model.loader)
-        with TrtRunner(EngineFromNetwork(network_loader)) as runner:
+    @pytest.mark.parametrize("module", [torch, np])
+    def test_error_on_wrong_name_feed_dict(self, names, err, identity_engine, module):
+        with TrtRunner(identity_engine) as runner:
             with pytest.raises(PolygraphyException, match=err):
-                runner.infer({name: np.ones(shape=(1, 1, 2, 2), dtype=np.float32) for name in names})
+                runner.infer({name: module.ones((1, 1, 2, 2), dtype=module.float32) for name in names})
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.0"), reason="Test not compatible with TRT 6")
-    def test_error_on_wrong_dtype_feed_dict(self):
-        model = ONNX_MODELS["identity"]
-        network_loader = NetworkFromOnnxBytes(model.loader)
-        with TrtRunner(EngineFromNetwork(network_loader)) as runner:
+    @pytest.mark.parametrize("module", [torch, np])
+    def test_error_on_wrong_dtype_feed_dict(self, identity_engine, module):
+        with TrtRunner(identity_engine) as runner:
             with pytest.raises(PolygraphyException, match="unexpected dtype."):
-                runner.infer({"x": np.ones(shape=(1, 1, 2, 2), dtype=np.int32)})
+                runner.infer({"x": module.ones((1, 1, 2, 2), dtype=module.int32)})
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.0"), reason="Test not compatible with TRT 6")
-    def test_error_on_wrong_shape_feed_dict(self):
-        model = ONNX_MODELS["identity"]
-        network_loader = NetworkFromOnnxBytes(model.loader)
-        with TrtRunner(EngineFromNetwork(network_loader)) as runner:
+    @pytest.mark.parametrize("module", [torch, np])
+    def test_error_on_wrong_shape_feed_dict(self, identity_engine, module):
+        with TrtRunner(identity_engine) as runner:
             with pytest.raises(PolygraphyException, match="incompatible shape."):
-                runner.infer({"x": np.ones(shape=(1, 1, 3, 2), dtype=np.float32)})
+                runner.infer({"x": module.ones((1, 1, 3, 2), dtype=module.float32)})
 
     @pytest.mark.parametrize("use_view", [True, False])  # We should be able to use DeviceArray in place of DeviceView
-    def test_device_views(self, use_view):
-        model = ONNX_MODELS["reducable"]
-        network_loader = NetworkFromOnnxBytes(model.loader)
-        with TrtRunner(EngineFromNetwork(network_loader)) as runner, cuda.DeviceArray((1,), dtype=np.float32) as x:
+    def test_device_views(self, use_view, reducable_engine):
+        with TrtRunner(reducable_engine) as runner, cuda.DeviceArray((1,), dtype=np.float32) as x:
             x.copy_from(np.ones((1,), dtype=np.float32))
             outputs = runner.infer(
                 {
@@ -210,19 +251,15 @@ def test_device_views(self, use_view):
             assert outputs["identity_out_6"][0] == 2
             assert outputs["identity_out_8"][0] == 2
 
-    def test_no_output_copy(self):
-        model = ONNX_MODELS["identity"]
-        network_loader = NetworkFromOnnxBytes(model.loader)
-        with TrtRunner(EngineFromNetwork(network_loader)) as runner:
+    def test_no_output_copy(self, identity_engine):
+        with TrtRunner(identity_engine) as runner:
             inp = np.ones(shape=(1, 1, 2, 2), dtype=np.float32)
             outputs = runner.infer({"x": inp}, copy_outputs_to_host=False)
             assert isinstance(outputs["y"], cuda.DeviceView)
             assert np.array_equal(outputs["y"].numpy(), inp)
 
-    def test_subsequent_infers_with_different_input_types(self):
-        model = ONNX_MODELS["identity"]
-        network_loader = NetworkFromOnnxBytes(model.loader)
-        with TrtRunner(EngineFromNetwork(network_loader)) as runner:
+    def test_subsequent_infers_with_different_input_types(self, identity_engine):
+        with TrtRunner(identity_engine) as runner:
             inp = np.ones(shape=(1, 1, 2, 2), dtype=np.float32)
 
             def check(outputs):
@@ -230,6 +267,9 @@ def check(outputs):
 
             check(runner.infer({"x": inp}))
             check(runner.infer({"x": cuda.DeviceArray(shape=inp.shape, dtype=inp.dtype).copy_from(inp)}))
+
+            torch_outputs = runner.infer({"x": torch.from_numpy(inp)})
+            check({name: out.numpy() for name, out in torch_outputs.items()})
             check(runner.infer({"x": inp}))
 
     @pytest.mark.parametrize("use_view", [True, False])  # We should be able to use DeviceArray in place of DeviceView
@@ -246,59 +286,14 @@ def test_device_view_dynamic_shapes(self, use_view):
             assert np.all(outputs["Y"] == inp)
             assert outputs["Y"].shape == (1, 2, 3, 3)
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported before TRT 8")
     def test_cannot_use_device_view_shape_tensor(self):
         model = ONNX_MODELS["empty_tensor_expand"]
         with TrtRunner(EngineFromNetwork(NetworkFromOnnxBytes(model.loader))) as runner, cuda.DeviceArray(
-            shape=(5,), dtype=np.int32
+            shape=(5,), dtype=np.int32 if mod.version(trt.__version__) < mod.version("9.0") else np.int64
         ) as arr:
             with pytest.raises(PolygraphyException, match="it must reside in host memory"):
                 runner.infer({"data": np.ones((2, 0, 3, 0), dtype=np.float32), "new_shape": arr})
 
-    @pytest.mark.flaky
-    @pytest.mark.serial
-    @pytest.mark.parametrize("copy_outputs", [True, False], ids=["output_dtoh", "no_output_copy"])
-    @pytest.mark.parametrize("copy_inputs", [True, False], ids=["input_htod", "no_input_copy"])
-    def test_infer_overhead(self, copy_inputs, copy_outputs):
-        model = ONNX_MODELS["needs_constraints"]
-        inp_name = list(model.input_metadata.keys())[0]
-        inp_shape = model.input_metadata[inp_name].shape
-
-        inp = np.ones(shape=inp_shape, dtype=np.float32)
-        dev_inp = cuda.DeviceArray(shape=inp.shape, dtype=inp.dtype)
-        dev_inp.copy_from(inp)
-
-        out = np.zeros(shape=inp_shape, dtype=np.float32)
-        dev_out = cuda.DeviceArray(shape=out.shape, dtype=out.dtype)
-
-        with engine_from_network(
-            network_from_onnx_bytes(model.loader)
-        ) as engine, engine.create_execution_context() as context, TrtRunner(
-            context
-        ) as runner, dev_inp, dev_out, cuda.Stream() as stream:
-            # Inference outside the TrtRunner
-            def infer():
-                if copy_inputs:
-                    dev_inp.copy_from(inp, stream=stream)
-                context.execute_async_v2(bindings=[dev_inp.ptr, dev_out.ptr], stream_handle=stream.ptr)
-                if copy_outputs:
-                    dev_out.copy_to(out, stream=stream)
-                stream.synchronize()
-
-            native_time = time_func(infer)
-
-            feed_dict = {inp_name: (inp if copy_inputs else dev_inp)}
-
-            def runner_infer():
-                runner.infer(feed_dict, check_inputs=False, copy_outputs_to_host=copy_outputs)
-
-            runner_time = time_func(runner_infer)
-
-        print(f"Absolute difference: {runner_time - native_time:.5g}")
-        print(f"Relative difference: {runner_time / native_time:.5g}")
-        assert (runner_time - native_time) < 1e-3 or runner_time <= (native_time * 1.10)
-
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.5"), reason="Unsupported before TRT 8.5")
     @pytest.mark.parametrize("hwc_input", [True, False], ids=["hwc_input", "chw_input"])
     @pytest.mark.parametrize("hwc_output", [True, False], ids=["hwc_output", "chw_output"])
     def test_infer_chw_format(self, hwc_input, hwc_output):
@@ -326,3 +321,48 @@ def test_infer_chw_format(self, hwc_input, hwc_output):
                 assert np.allclose(outputs["y"].transpose(0, 3, 1, 2), inp)
             else:  # hwc_input and not hwc_output: output in CHW format and shaped (N, C, H, W)
                 assert np.allclose(outputs["y"].transpose(0, 2, 3, 1), inp)
+
+    @pytest.mark.parametrize("use_torch", [True, False])
+    def test_get_array_on_cpu(self, use_torch):
+        shape = (4,)
+        with cuda.DeviceArray.raw(shape) as arr:
+            host_buffers = {}
+            stream = cuda.Stream()
+            host_arr = _get_array_on_cpu(arr, "test", host_buffers, stream, arr.nbytes, use_torch)
+
+            if use_torch:
+                assert isinstance(host_arr, torch.Tensor)
+            else:
+                assert isinstance(host_arr, np.ndarray)
+
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("10.0"),
+        reason="Feature not present before 10.0",
+    )
+    @pytest.mark.parametrize("budget", [None, 0, 0.5, 0.99, 1000, np.inf])
+    def test_weight_streaming(self, budget):
+        model = ONNX_MODELS["matmul_2layer"]
+        network_loader = NetworkFromOnnxBytes(model.loader, strongly_typed=True)
+        config_loader = CreateConfig(weight_streaming=True)
+        engine = engine_from_network(network_loader, config_loader)
+        
+        if budget == np.inf:
+            # set to max size - 1
+            budget = engine.streamable_weights_size - 1
+
+        kwargs = {
+            "weight_streaming_budget": None,
+            "weight_streaming_percent": None
+        }
+        if budget is not None:
+            if 0 < budget <= 1:
+                kwargs["weight_streaming_percent"] = budget * 100
+            else:
+                kwargs["weight_streaming_budget"] = int(budget)
+
+        with TrtRunner(
+            engine,
+            optimization_profile=0,
+            **kwargs
+        ) as runner:
+            model.check_runner(runner)
diff --git a/tools/Polygraphy/tests/backend/trt/test_util.py b/tools/Polygraphy/tests/backend/trt/test_util.py
index 55509348..1bd9d44c 100644
--- a/tools/Polygraphy/tests/backend/trt/test_util.py
+++ b/tools/Polygraphy/tests/backend/trt/test_util.py
@@ -15,10 +15,12 @@
 # limitations under the License.
 #
 
+import contextlib
 from textwrap import dedent
 
 import pytest
 import tensorrt as trt
+
 from polygraphy import mod
 from polygraphy.backend.trt import CreateConfig, Profile, create_network
 from polygraphy.backend.trt import util as trt_util
@@ -39,167 +41,237 @@ def layer_class_mapping():
 
 @pytest.mark.parametrize("layer_type", trt.LayerType.__members__.values())
 def test_all_layer_types_mapped(layer_class_mapping, layer_type):
-    if layer_type == trt.LayerType.PLUGIN:
+    waived_layers = [trt.LayerType.PLUGIN]
+    with contextlib.suppress(AttributeError):
+        waived_layers.append(trt.LayerType.PLUGIN_V3)
+    if layer_type in waived_layers:
         pytest.skip("PLUGIN has no corresponding ILayer")
     assert layer_type in layer_class_mapping
 
 
 # Can't use pytest.skip because we can't construct the test unless trt.MemoryPoolType exists.
-if mod.version(trt.__version__) >= mod.version("8.4"):
 
-    def adjust_memory_pool_limits_after_8_6(limits):
-        # Adjust tactic DRAM so we can match the output text reliably in add_default_preview_features_after_8_6.
-        if mod.version(trt.__version__) >= mod.version("8.6"):
-            limits[trt.MemoryPoolType.TACTIC_DRAM] = 1 << 30
-        return limits
 
-    def add_default_preview_features_after_8_6(expected):
-        if mod.version(trt.__version__) >= mod.version("8.6"):
+def adjust_memory_pool_limits_after_8_6(limits):
+    # Adjust tactic DRAM so we can match the output text reliably in update_expected_output.
+    if mod.version(trt.__version__) >= mod.version("8.6"):
+        limits[trt.MemoryPoolType.TACTIC_DRAM] = 1 << 30
+    return limits
+
+
+def update_expected_output(expected):
+    if mod.version(trt.__version__) >= mod.version("8.6"):
+        if mod.version(trt.__version__) >= mod.version("10.0"):
+            expected = expected.replace(
+                "MiB]",
+                "MiB, TACTIC_DRAM: 1024.00 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]",
+            )
+        else:
             expected = expected.replace("MiB]", "MiB, TACTIC_DRAM: 1024.00 MiB]")
 
-            if "Preview Features" not in expected:
+        if "Preview Features" not in expected:
+            if mod.version(trt.__version__) < mod.version("10.0"):
                 expected = (
                     dedent(expected).strip()
                     + "\nPreview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]"
                 )
+            else:
+                expected = (
+                    dedent(expected).strip()
+                    + "\nPreview Features       | [PROFILE_SHARING_0806]"
+                )
 
-        return expected
+    if mod.version(trt.__version__) >= mod.version("8.7"):
+        # CUBLAS_LT is not longer enabled by default
+        expected = expected.replace("CUBLAS_LT, ", "")
 
-    @pytest.mark.parametrize(
-        "create_config, expected",
-        # NOTE: We set workspace sizes here so we can have predictable output
-        [
-            (
-                CreateConfig(
-                    memory_pool_limits=adjust_memory_pool_limits_after_8_6({trt.MemoryPoolType.WORKSPACE: 16 << 20})
-                ),
-                add_default_preview_features_after_8_6(
-                    """
-                    Flags                  | []
-                    Engine Capability      | EngineCapability.DEFAULT
-                    Memory Pools           | [WORKSPACE: 16.00 MiB]
-                    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
-                    Profiling Verbosity    | ProfilingVerbosity.DETAILED
-                    """
-                ),
+    if mod.version(trt.__version__) >= mod.version("10.0"):
+        expected = expected.replace(
+            "EngineCapability.DEFAULT", "EngineCapability.STANDARD"
+        )
+        expected = expected.replace("CUBLAS, ", "")
+        expected = expected.replace("CUDNN, ", "")
+
+    return expected
+
+
+@pytest.mark.parametrize(
+    "create_config, expected",
+    # NOTE: We set workspace sizes here so we can have predictable output
+    [
+        (
+            CreateConfig(
+                memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                    {trt.MemoryPoolType.WORKSPACE: 16 << 20}
+                )
             ),
-            (
-                CreateConfig(
-                    memory_pool_limits=adjust_memory_pool_limits_after_8_6({trt.MemoryPoolType.WORKSPACE: 16 << 20}),
-                    tactic_sources=[],
-                ),
-                add_default_preview_features_after_8_6(
-                    """
-                    Flags                  | []
-                    Engine Capability      | EngineCapability.DEFAULT
-                    Memory Pools           | [WORKSPACE: 16.00 MiB]
-                    Tactic Sources         | []
-                    Profiling Verbosity    | ProfilingVerbosity.DETAILED
-                    """
-                ),
+            update_expected_output(
+                """
+                Flags                  | []
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 16.00 MiB]
+                Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                """
             ),
-            (
-                CreateConfig(
-                    memory_pool_limits=adjust_memory_pool_limits_after_8_6({trt.MemoryPoolType.WORKSPACE: 4 << 20})
-                ),
-                add_default_preview_features_after_8_6(
-                    """
-                    Flags                  | []
-                    Engine Capability      | EngineCapability.DEFAULT
-                    Memory Pools           | [WORKSPACE: 4.00 MiB]
-                    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
-                    Profiling Verbosity    | ProfilingVerbosity.DETAILED
-                    """
+        ),
+        (
+            CreateConfig(
+                memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                    {trt.MemoryPoolType.WORKSPACE: 16 << 20}
                 ),
+                tactic_sources=[],
             ),
-            (
-                CreateConfig(
-                    memory_pool_limits=adjust_memory_pool_limits_after_8_6({trt.MemoryPoolType.WORKSPACE: 16 << 20}),
-                    fp16=True,
-                    int8=True,
-                    fp8=True,
-                    tf32=True,
-                    refittable=True,
-                    precision_constraints="obey",
-                ),
-                add_default_preview_features_after_8_6(
-                    """
-                    Flags                  | [FP16, INT8, REFIT, TF32, OBEY_PRECISION_CONSTRAINTS, FP8]
-                    Engine Capability      | EngineCapability.DEFAULT
-                    Memory Pools           | [WORKSPACE: 16.00 MiB]
-                    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
-                    Profiling Verbosity    | ProfilingVerbosity.DETAILED
-                    """
+            update_expected_output(
+                """
+                Flags                  | []
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 16.00 MiB]
+                Tactic Sources         | []
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                """
+            ),
+        ),
+        (
+            CreateConfig(
+                memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                    {trt.MemoryPoolType.WORKSPACE: 4 << 20}
+                )
+            ),
+            update_expected_output(
+                """
+                Flags                  | []
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 4.00 MiB]
+                Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                """
+            ),
+        ),
+        (
+            CreateConfig(
+                memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                    {trt.MemoryPoolType.WORKSPACE: 16 << 20}
                 ),
+                fp16=True,
+                int8=True,
+                tf32=True,
+                refittable=True,
+                precision_constraints="obey",
             ),
-            (
-                CreateConfig(
-                    memory_pool_limits=adjust_memory_pool_limits_after_8_6({trt.MemoryPoolType.WORKSPACE: 16 << 20}),
-                    profiles=[Profile().add("X", [1], [1], [1]), Profile().add("X", [2], [2], [2])],
+            update_expected_output(
+                """
+                Flags                  | [FP16, INT8, REFIT, TF32, OBEY_PRECISION_CONSTRAINTS]
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 16.00 MiB]
+                Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                """
+            ),
+        ),
+        (
+            CreateConfig(
+                memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                    {trt.MemoryPoolType.WORKSPACE: 16 << 20}
                 ),
-                add_default_preview_features_after_8_6(
-                    """
-                    Flags                  | []
-                    Engine Capability      | EngineCapability.DEFAULT
-                    Memory Pools           | [WORKSPACE: 16.00 MiB]
-                    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
-                    Profiling Verbosity    | ProfilingVerbosity.DETAILED
-                    Optimization Profiles  | 2 profile(s)
-                    """
+                profiles=[
+                    Profile().add("X", [1], [1], [1]),
+                    Profile().add("X", [2], [2], [2]),
+                ],
+            ),
+            update_expected_output(
+                """
+                Flags                  | []
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 16.00 MiB]
+                Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                Optimization Profiles  | 2 profile(s)
+                """
+            ),
+        ),
+        (
+            CreateConfig(
+                memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                    {trt.MemoryPoolType.WORKSPACE: 16 << 20}
                 ),
+                use_dla=True,
             ),
+            update_expected_output(
+                """
+                Flags                  | []
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 16.00 MiB, DLA_MANAGED_SRAM: 0.00 MiB, DLA_LOCAL_DRAM: 1024.00 MiB, DLA_GLOBAL_DRAM: 512.00 MiB]
+                Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
+                DLA                    | Default Device Type: DeviceType.DLA, Core: -1
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                """
+            ),
+        ),
+        (
             (
                 CreateConfig(
-                    memory_pool_limits=adjust_memory_pool_limits_after_8_6({trt.MemoryPoolType.WORKSPACE: 16 << 20}),
-                    use_dla=True,
+                    memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                        {trt.MemoryPoolType.WORKSPACE: 16 << 20}
+                    ),
+                    preview_features=[trt.PreviewFeature.PROFILE_SHARING_0806],
                 ),
-                add_default_preview_features_after_8_6(
-                    """
-                    Flags                  | []
-                    Engine Capability      | EngineCapability.DEFAULT
-                    Memory Pools           | [WORKSPACE: 16.00 MiB, DLA_MANAGED_SRAM: 0.00 MiB, DLA_LOCAL_DRAM: 1024.00 MiB, DLA_GLOBAL_DRAM: 512.00 MiB]
-                    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
-                    DLA                    | Default Device Type: DeviceType.DLA, Core: -1
-                    Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                update_expected_output(
                     """
+                Flags                  | []
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 16.00 MiB]
+                Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                Preview Features       | [PROFILE_SHARING_0806]
+                """
                 ),
-            ),
-        ]
-        + [
-            (
+            )
+            if mod.version(trt.__version__) >= mod.version("10.0")
+            else (
                 CreateConfig(
-                    memory_pool_limits=adjust_memory_pool_limits_after_8_6({trt.MemoryPoolType.WORKSPACE: 16 << 20}),
+                    memory_pool_limits=adjust_memory_pool_limits_after_8_6(
+                        {trt.MemoryPoolType.WORKSPACE: 16 << 20}
+                    ),
                     preview_features=[trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805],
                 ),
-                add_default_preview_features_after_8_6(
-                    """
-                    Flags                  | []
-                    Engine Capability      | EngineCapability.DEFAULT
-                    Memory Pools           | [WORKSPACE: 16.00 MiB]
-                    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
-                    Profiling Verbosity    | ProfilingVerbosity.DETAILED
-                    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805]
+                update_expected_output(
                     """
+                Flags                  | []
+                Engine Capability      | EngineCapability.DEFAULT
+                Memory Pools           | [WORKSPACE: 16.00 MiB]
+                Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
+                Profiling Verbosity    | ProfilingVerbosity.DETAILED
+                Preview Features       | [FASTER_DYNAMIC_SHAPES_0805]
+                """
                 ),
-            ),
-        ]
-        if mod.version(trt.__version__) >= mod.version("8.5")
-        else [],
-        ids=["default", "tactic-sources", "memory-pool-limits", "builder-flags", "profiles", "dla"]
-        + ["preview-features"]
-        if mod.version(trt.__version__) >= mod.version("8.5")
-        else [],
-    )
-    def test_str_from_config(create_config, expected, dummy_network):
-        config = create_config(*dummy_network)
-        assert trt_util.str_from_config(config) == dedent(expected).strip()
+            )
+        ),
+    ],
+    ids=[
+        "default",
+        "tactic-sources",
+        "memory-pool-limits",
+        "builder-flags",
+        "profiles",
+        "dla",
+        "preview-features",
+    ],
+)
+def test_str_from_config(create_config, expected, dummy_network):
+    config = create_config(*dummy_network)
+    actual = trt_util.str_from_config(config)
+    expected = dedent(expected).strip()
+    assert actual == expected
 
 
 def test_get_all_tensors_layer_with_null_inputs():
     builder, network = create_network()
     with builder, network:
         inp = network.add_input("input", shape=(1, 3, 224, 224), dtype=trt.float32)
-        slice_layer = network.add_slice(inp, (0, 0, 0, 0), (1, 3, 224, 224), (1, 1, 1, 1))
+        slice_layer = network.add_slice(
+            inp, (0, 0, 0, 0), (1, 3, 224, 224), (1, 1, 1, 1)
+        )
 
         # Set a tensor for `stride` to increment `num_inputs` so we have some inputs
         # which are `None` in between.
diff --git a/tools/Polygraphy/tests/common/test_datatype.py b/tools/Polygraphy/tests/common/test_datatype.py
new file mode 100644
index 00000000..c9e5108a
--- /dev/null
+++ b/tools/Polygraphy/tests/common/test_datatype.py
@@ -0,0 +1,175 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import numpy as np
+import onnx
+import pytest
+import tensorrt as trt
+import torch
+
+from polygraphy import mod, util
+from polygraphy.datatype import DataType, DataTypeEntry
+
+DATATYPES = DataType.__members__.values()
+
+
+class TestDataType:
+    def compare_names(self, name, expected_name, replace_map):
+        # Names may not match up exactly, so use the replace_map to make adjustments to the
+        # foreign type before comparing against the Polygraphy type
+        for old, new in replace_map.items():
+            if name == old:
+                name = new
+        assert name == expected_name
+
+    @pytest.mark.parametrize("dtype", DATATYPES, ids=str)
+    def test_numpy(self, dtype):
+        if dtype in [
+            DataType.BFLOAT16,
+            DataType.FLOAT8E4M3FN,
+            DataType.FLOAT8E4M3FNUZ,
+            DataType.FLOAT8E5M2,
+            DataType.FLOAT8E5M2FNUZ,
+            DataType.INT4,
+        ]:
+            pytest.xfail("Type not supported by NumPy")
+
+        np_type = dtype.numpy()
+        assert DataType.to_dtype(dtype, "numpy") == np_type
+        assert np_type.itemsize == dtype.itemsize
+        self.compare_names(np_type.name, dtype.name, {"str": "string"})
+
+        assert isinstance(np_type, np.dtype)
+
+        assert DataType.from_dtype(np_type) == dtype
+
+    @pytest.mark.parametrize("dtype", DATATYPES, ids=str)
+    def test_onnxrt(self, dtype):
+        if dtype in [
+            DataType.INT4,
+        ]:
+            pytest.skip("Type not supported by ONNX-RT")
+
+        onnxrt_type = DataType.to_dtype(dtype, "onnxruntime")
+        assert dtype.onnxruntime() == onnxrt_type
+
+        assert isinstance(onnxrt_type, str)
+
+        self.compare_names(
+            onnxrt_type.replace("tensor(", "").replace(")", ""),
+            dtype.name,
+            {
+                "double": "float64",
+                "float": "float32",
+            },
+        )
+
+        assert DataType.from_dtype(onnxrt_type) == dtype
+
+    @pytest.mark.parametrize("dtype", DATATYPES, ids=str)
+    def test_onnx(self, dtype):
+        if dtype in [
+            DataType.INT4,
+        ]:
+            pytest.skip("Type not supported by ONNX")
+
+        onnx_type = dtype.onnx()
+        assert DataType.to_dtype(dtype, "onnx") == onnx_type
+
+        assert isinstance(onnx_type, int)
+
+        onnx_type_map = util.invert_dict(dict(onnx.TensorProto.DataType.items()))
+        self.compare_names(
+            onnx_type_map[onnx_type].lower(),
+            dtype.name,
+            {
+                "double": "float64",
+                "float": "float32",
+            },
+        )
+
+        assert DataType.from_dtype(onnx_type) == dtype
+
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.7"),
+        reason="Unsupported before TRT 8.7",
+    )
+    @pytest.mark.parametrize("dtype", DATATYPES, ids=str)
+    def test_tensorrt(self, dtype):
+        if dtype in [
+            DataType.FLOAT64,
+            DataType.INT16,
+            DataType.UINT16,
+            DataType.UINT32,
+            DataType.UINT64,
+            DataType.STRING,
+            DataType.INT64,
+            DataType.FLOAT8E4M3FNUZ,
+            DataType.FLOAT8E5M2,
+            DataType.FLOAT8E5M2FNUZ,
+        ]:
+            pytest.xfail("Type not supported by TensorRT")
+
+        tensorrt_dtype = dtype.tensorrt()
+        assert DataType.to_dtype(dtype, "tensorrt") == tensorrt_dtype
+
+        assert isinstance(tensorrt_dtype, trt.DataType)
+
+        self.compare_names(
+            tensorrt_dtype.name.lower(),
+            dtype.name,
+            {
+                "double": "float64",
+                "float": "float32",
+                "half": "float16",
+                "fp8": "float8e4m3fn",
+                "bf16": "bfloat16",
+            },
+        )
+
+        assert DataType.from_dtype(tensorrt_dtype) == dtype
+
+    @pytest.mark.parametrize("trt_dtype", trt.DataType.__members__.values())
+    def test_all_tensorrt_types_supported(self, trt_dtype):
+        dtype = DataType.from_dtype(trt_dtype, "tensorrt")
+        assert isinstance(dtype, DataTypeEntry)
+
+        assert dtype.tensorrt() == trt_dtype
+
+    @pytest.mark.parametrize("dtype", DATATYPES, ids=str)
+    def test_torch(self, dtype):
+        if dtype in [
+            DataType.FLOAT8E4M3FN,
+            DataType.FLOAT8E4M3FNUZ,
+            DataType.FLOAT8E5M2,
+            DataType.FLOAT8E5M2FNUZ,
+            DataType.UINT16,
+            DataType.UINT32,
+            DataType.UINT64,
+            DataType.STRING,
+            DataType.INT4,
+        ]:
+            pytest.xfail("Type not supported by Torch")
+
+        torch_type = dtype.torch()
+        assert DataType.to_dtype(dtype, "torch") == torch_type
+
+        assert isinstance(torch_type, torch.dtype)
+
+        self.compare_names(str(torch_type).replace("torch.", ""), dtype.name, {})
+
+        assert DataType.from_dtype(torch_type) == dtype
diff --git a/tools/Polygraphy/tests/common/test_struct.py b/tools/Polygraphy/tests/common/test_struct.py
index 1eb1614d..76c95b93 100644
--- a/tools/Polygraphy/tests/common/test_struct.py
+++ b/tools/Polygraphy/tests/common/test_struct.py
@@ -14,13 +14,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-import numpy as np
 from polygraphy.common import TensorMetadata
-
+from polygraphy.datatype import DataType
 
 class TestTensorMetadata:
     def test_str(self):
-        meta = TensorMetadata().add("X", dtype=np.float32, shape=(64, 64))
+        meta = TensorMetadata().add("X", dtype=DataType.FLOAT32, shape=(64, 64))
         assert str(meta) == "{X [dtype=float32, shape=(64, 64)]}"
 
     def test_str_no_dtype(self):
@@ -28,7 +27,7 @@ def test_str_no_dtype(self):
         assert str(meta) == "{X [shape=(64, 64)]}"
 
     def test_str_no_shape(self):
-        meta = TensorMetadata().add("X", dtype=np.float32, shape=None)
+        meta = TensorMetadata().add("X", dtype=DataType.FLOAT32, shape=None)
         assert str(meta) == "{X [dtype=float32]}"
 
     def test_str_no_meta(self):
diff --git a/tools/Polygraphy/tests/comparator/test_comparator.py b/tools/Polygraphy/tests/comparator/test_comparator.py
index 7fc46308..720b5549 100644
--- a/tools/Polygraphy/tests/comparator/test_comparator.py
+++ b/tools/Polygraphy/tests/comparator/test_comparator.py
@@ -19,15 +19,19 @@
 import numpy as np
 import pytest
 import tensorrt as trt
-from polygraphy import mod
+
+from polygraphy import util, mod
 from polygraphy.backend.onnx import GsFromOnnx, OnnxFromBytes
 from polygraphy.backend.onnxrt import OnnxrtRunner, SessionFromOnnx
 from polygraphy.backend.pluginref import PluginRefRunner
-from polygraphy.backend.trt import EngineFromNetwork, NetworkFromOnnxBytes, TrtRunner
+from polygraphy.backend.trt import EngineFromNetwork, NetworkFromOnnxBytes, TrtRunner, network_from_onnx_bytes
+from polygraphy.backend.trt.util import get_all_tensors
 from polygraphy.comparator import Comparator, CompareFunc, DataLoader, IterationResult, PostprocessFunc, RunResults
 from polygraphy.exception import PolygraphyException
 from tests.models.meta import ONNX_MODELS
 
+build_torch = lambda a, **kwargs: util.array.to_torch(np.array(a, **kwargs))
+
 
 class TestComparator:
     def test_warmup_runs(self):
@@ -80,7 +84,7 @@ def test_multiple_runners(self):
 
     def test_postprocess(self):
         onnx_loader = ONNX_MODELS["identity"].loader
-        run_results = Comparator.run([OnnxrtRunner(SessionFromOnnx(onnx_loader))], use_subprocess=True)
+        run_results = Comparator.run([OnnxrtRunner(SessionFromOnnx(onnx_loader))])
         # Output shape is (1, 1, 2, 2)
         postprocessed = Comparator.postprocess(run_results, postprocess_func=PostprocessFunc.top_k(k=(1, -1)))
         for _, results in postprocessed.items():
@@ -117,16 +121,18 @@ def test_multirun_outputs_are_different(self):
         iteration0 = run_results[runner.name][0]
         iteration1 = run_results[runner.name][1]
         for name in iteration0.keys():
-            assert np.any(iteration0[name] != iteration1[name])
+            assert util.array.any(iteration0[name] != iteration1[name])
 
-    def test_validate_nan(self):
+    @pytest.mark.parametrize("array_type", [np.array, build_torch])
+    def test_validate_nan(self, array_type):
         run_results = RunResults()
-        run_results["fake-runner"] = [IterationResult(outputs={"x": np.array(np.nan)})]
+        run_results["fake-runner"] = [IterationResult(outputs={"x": array_type(np.nan)})]
         assert not Comparator.validate(run_results)
 
-    def test_validate_inf(self):
+    @pytest.mark.parametrize("array_type", [np.array, build_torch])
+    def test_validate_inf(self, array_type):
         run_results = RunResults()
-        run_results["fake-runner"] = [IterationResult(outputs={"x": np.array(np.inf)})]
+        run_results["fake-runner"] = [IterationResult(outputs={"x": array_type(np.inf)})]
         assert not Comparator.validate(run_results, check_inf=True)
 
     def test_dim_param_trt_onnxrt(self):
@@ -143,3 +149,23 @@ def test_dim_param_trt_onnxrt(self):
         compare_func = CompareFunc.simple(check_shapes=True)
         assert bool(Comparator.compare_accuracy(run_results, compare_func=compare_func))
         assert len(list(run_results.values())[0]) == 1  # Default number of iterations
+
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("10.0"),
+        reason="Feature not present before 10.0",
+    )
+    def test_debug_tensors(self):         
+        model = ONNX_MODELS["identity"]
+        builder, network, parser = network_from_onnx_bytes(model.loader)
+        tensor_map = get_all_tensors(network)
+        network.mark_debug(tensor_map["x"])
+        load_engine = EngineFromNetwork((builder, network, parser))
+        runners = [TrtRunner(load_engine)]
+        data = [{"x": np.ones((1, 1, 2, 2), dtype=np.float32)}]
+        run_results = Comparator.run(runners, data_loader=data)
+        for iteration_list in run_results.values():
+            # There should be 2 outputs, debug tensor "x" and output "y"
+            assert len(list(iteration_list[0].items())) == 2 
+        run_results["fake-runner"] = [IterationResult(outputs={"x": np.ones((1, 1, 2, 2), dtype=np.float32), "y": np.ones((1, 1, 2, 2), dtype=np.float32)})]
+        compare_func = CompareFunc.simple(check_shapes=True)
+        assert bool(Comparator.compare_accuracy(run_results, compare_func=compare_func))
diff --git a/tools/Polygraphy/tests/comparator/test_compare.py b/tools/Polygraphy/tests/comparator/test_compare.py
index 18c49cdc..97f255c4 100644
--- a/tools/Polygraphy/tests/comparator/test_compare.py
+++ b/tools/Polygraphy/tests/comparator/test_compare.py
@@ -18,45 +18,144 @@
 import pytest
 from polygraphy import util
 from polygraphy.comparator import CompareFunc, IterationResult
+from polygraphy.datatype import DataType
 from polygraphy.exception import PolygraphyException
 from polygraphy.logger import G_LOGGER
 
+build_torch = lambda a, **kwargs: util.array.to_torch(np.array(a, **kwargs))
 
+
+@pytest.mark.parametrize("array_type", [np.array, build_torch], ids=["numpy", "torch"])
 class TestSimpleCompareFunc:
     @pytest.mark.parametrize(
         "values0, values1, dtype, expected_max_absdiff, expected_max_reldiff",
         [
             # Low precision arrays should be casted to higher precisions to avoid overflows/underflows.
-            ([0], [1], np.uint8, 1, 1.0),
-            ([1], [0], np.uint8, 1, np.inf),
-            ([0], [1], np.uint16, 1, 1.0),
-            ([1], [0], np.uint16, 1, np.inf),
-            ([0], [1], np.uint32, 1, 1.0),
-            ([1], [0], np.uint32, 1, np.inf),
-            ([25], [30], np.int8, 5, 5.0 / 30.0),
-            ([25], [30], np.float16, 5, np.array([5.0], dtype=np.float32) / np.array([30.0], dtype=np.float32)),
-            ([1], [0], np.float32, 1, 1 / np.finfo(float).eps),
+            ([0], [1], DataType.UINT8, 1, 1.0),
+            ([1], [0], DataType.UINT8, 1, np.inf),
+            ([0], [1], DataType.UINT16, 1, 1.0),
+            ([1], [0], DataType.UINT16, 1, np.inf),
+            ([0], [1], DataType.UINT32, 1, 1.0),
+            ([1], [0], DataType.UINT32, 1, np.inf),
+            ([25], [30], DataType.INT8, 5, 5.0 / 30.0),
+            (
+                [25],
+                [30],
+                DataType.FLOAT16,
+                5,
+                np.array([5.0], dtype=np.float32) / np.array([30.0], dtype=np.float32),
+            ),
+            ([1], [0], DataType.FLOAT16, 1, 1 / np.finfo(float).eps),
         ],
     )
-    def test_comparison(self, values0, values1, dtype, expected_max_absdiff, expected_max_reldiff):
-        iter_result0 = IterationResult(outputs={"output": np.array(values0, dtype=dtype)})
-        iter_result1 = IterationResult(outputs={"output": np.array(values1, dtype=dtype)})
+    def test_comparison(
+        self,
+        values0,
+        values1,
+        dtype,
+        expected_max_absdiff,
+        expected_max_reldiff,
+        array_type,
+    ):
+        if array_type != np.array:
+            try:
+                DataType.to_dtype(dtype, "torch")
+            except:
+                pytest.skip(f"Cannot convert {dtype} to torch")
+        iter_result0 = IterationResult(
+            outputs={"output": array_type(values0, dtype=dtype.numpy())}
+        )
+        iter_result1 = IterationResult(
+            outputs={"output": array_type(values1, dtype=dtype.numpy())}
+        )
 
         compare_func = CompareFunc.simple()
         acc = compare_func(iter_result0, iter_result1)
 
         comp_result = acc["output"]
-        assert comp_result.max_absdiff == expected_max_absdiff
-        assert comp_result.max_absdiff == comp_result.mean_absdiff
-        assert comp_result.max_absdiff == comp_result.median_absdiff
+        assert np.isclose(comp_result.max_absdiff, expected_max_absdiff)
+        assert np.isclose(comp_result.max_absdiff, comp_result.mean_absdiff)
+        assert np.isclose(comp_result.max_absdiff, comp_result.median_absdiff)
 
-        assert comp_result.max_reldiff == expected_max_reldiff
-        assert comp_result.max_reldiff == comp_result.mean_reldiff
-        assert comp_result.max_reldiff == comp_result.median_reldiff
+        assert np.isclose(comp_result.max_reldiff, expected_max_reldiff)
+        assert np.isclose(comp_result.max_reldiff, comp_result.mean_reldiff)
+        assert np.isclose(comp_result.max_reldiff, comp_result.median_reldiff)
 
-    def test_can_compare_bool(self):
-        iter_result0 = IterationResult(outputs={"output": np.zeros((4, 4), dtype=bool)})
-        iter_result1 = IterationResult(outputs={"output": np.ones((4, 4), dtype=bool)})
+    @pytest.mark.parametrize(
+        "values0, values1, dtype, quantile, expected_abs_quantile, expected_rel_quantile",
+        [
+            ([0, 0.1, 0.5, 0.75, 1], [2, 2, 2, 2, 2], DataType.FLOAT16, 0.5, 1.5, 0.75),
+            (
+                [0, 0.1, 0.5, 0.75, 1],
+                [2, 2, 2, 2, 2],
+                DataType.FLOAT16,
+                0.75,
+                1.9,
+                0.95,
+            ),
+            (
+                [0.2, 0.2, 0.125, 0.11],
+                [0.1, 0.1, 0.1, 0.1],
+                DataType.FLOAT16,
+                0.5,
+                0.0625,
+                0.625,
+            ),
+            ([0, 1, 2, 3, 4], [2, 2, 2, 2, 2], DataType.UINT8, 0.5, 1, 0.5),
+            ([0, 1, 2, 3, 4], [2, 2, 2, 2, 2], DataType.UINT8, 0.75, 2, 1),
+        ],
+    )
+    def test_quantile(
+        self,
+        values0,
+        values1,
+        dtype,
+        quantile,
+        expected_abs_quantile,
+        expected_rel_quantile,
+        array_type,
+    ):
+        if array_type != np.array:
+            try:
+                DataType.to_dtype(dtype, "torch")
+            except:
+                pytest.skip(f"Cannot convert {dtype} to torch")
+        iter_result0 = IterationResult(
+            outputs={"output": array_type(values0, dtype=dtype.numpy())}
+        )
+        iter_result1 = IterationResult(
+            outputs={"output": array_type(values1, dtype=dtype.numpy())}
+        )
+
+        compare_func = CompareFunc.simple(
+            check_error_stat="quantile", error_quantile=quantile
+        )
+        acc = compare_func(iter_result0, iter_result1)
+
+        comp_result = acc["output"]
+        assert np.isclose(
+            comp_result.quantile_absdiff, expected_abs_quantile, atol=1e-4, rtol=1e-4
+        )
+        assert comp_result.quantile_absdiff <= comp_result.max_absdiff
+        assert (quantile >= 0.5) == (
+            comp_result.quantile_absdiff >= comp_result.median_absdiff
+        )
+
+        assert np.isclose(
+            comp_result.quantile_reldiff, expected_rel_quantile, atol=1e-4, rtol=1e-4
+        )
+        assert comp_result.quantile_reldiff <= comp_result.max_reldiff
+        assert (quantile >= 0.5) == (
+            comp_result.quantile_reldiff >= comp_result.median_reldiff
+        )
+
+    def test_can_compare_bool(self, array_type):
+        iter_result0 = IterationResult(
+            outputs={"output": array_type(np.zeros((4, 4), dtype=bool))}
+        )
+        iter_result1 = IterationResult(
+            outputs={"output": array_type(np.ones((4, 4), dtype=bool))}
+        )
 
         compare_func = CompareFunc.simple()
         acc = compare_func(iter_result0, iter_result1)
@@ -64,13 +163,17 @@ def test_can_compare_bool(self):
         assert not acc["output"]
 
     @pytest.mark.parametrize("mode", ["abs", "rel"])
-    def test_per_output_tol(self, mode):
+    def test_per_output_tol(self, mode, array_type):
         OUT0_NAME = "output0"
         OUT1_NAME = "output1"
-        OUT_VALS = np.ones((4, 4))
+        OUT_VALS = array_type(np.ones((4, 4)))
 
-        iter_result0 = IterationResult(outputs={OUT0_NAME: OUT_VALS, OUT1_NAME: OUT_VALS})
-        iter_result1 = IterationResult(outputs={OUT0_NAME: OUT_VALS, OUT1_NAME: OUT_VALS + 1})
+        iter_result0 = IterationResult(
+            outputs={OUT0_NAME: OUT_VALS, OUT1_NAME: OUT_VALS}
+        )
+        iter_result1 = IterationResult(
+            outputs={OUT0_NAME: OUT_VALS, OUT1_NAME: OUT_VALS + 1}
+        )
 
         # With default tolerances, out1 is wrong for the second result.
         compare_func = CompareFunc.simple()
@@ -94,13 +197,17 @@ def test_per_output_tol(self, mode):
         assert bool(acc[OUT1_NAME])
 
     @pytest.mark.parametrize("mode", ["abs", "rel"])
-    def test_per_output_tol_fallback(self, mode):
+    def test_per_output_tol_fallback(self, mode, array_type):
         OUT0_NAME = "output0"
         OUT1_NAME = "output1"
-        OUT_VALS = np.ones((4, 4))
+        OUT_VALS = array_type(np.ones((4, 4)))
 
-        iter_result0 = IterationResult(outputs={OUT0_NAME: OUT_VALS + 1, OUT1_NAME: OUT_VALS})
-        iter_result1 = IterationResult(outputs={OUT0_NAME: OUT_VALS, OUT1_NAME: OUT_VALS + 1})
+        iter_result0 = IterationResult(
+            outputs={OUT0_NAME: OUT_VALS + 1, OUT1_NAME: OUT_VALS}
+        )
+        iter_result1 = IterationResult(
+            outputs={OUT0_NAME: OUT_VALS, OUT1_NAME: OUT_VALS + 1}
+        )
 
         acc = CompareFunc.simple()(iter_result0, iter_result1)
         assert not bool(acc[OUT0_NAME])
@@ -121,10 +228,10 @@ def test_per_output_tol_fallback(self, mode):
         assert bool(acc[OUT1_NAME])
 
     @pytest.mark.parametrize("mode", ["abs", "rel"])
-    def test_default_tol_in_map(self, mode):
+    def test_default_tol_in_map(self, mode, array_type):
         # "" can be used to indicate a global tolerance
         OUT0_NAME = "output0"
-        OUT_VALS = np.ones((4, 4))
+        OUT_VALS = array_type(np.ones((4, 4)))
 
         iter_result0 = IterationResult(outputs={OUT0_NAME: OUT_VALS})
         iter_result1 = IterationResult(outputs={OUT0_NAME: OUT_VALS + 1})
@@ -150,9 +257,13 @@ def test_default_tol_in_map(self, mode):
             (2, 2, 2, 2),
         ],
     )
-    def test_non_matching_outputs(self, shape):
-        iter_result0 = IterationResult(outputs={"output": np.zeros(shape, dtype=np.float32)})
-        iter_result1 = IterationResult(outputs={"output": np.ones(shape, dtype=np.float32)})
+    def test_non_matching_outputs(self, shape, array_type):
+        iter_result0 = IterationResult(
+            outputs={"output": array_type(np.zeros(shape, dtype=np.float32))}
+        )
+        iter_result1 = IterationResult(
+            outputs={"output": array_type(np.ones(shape, dtype=np.float32))}
+        )
 
         compare_func = CompareFunc.simple()
 
@@ -169,9 +280,13 @@ def test_non_matching_outputs(self, shape):
             np.ones,
         ],
     )
-    def test_check_error_stat(self, func, check_error_stat):
-        iter_result0 = IterationResult(outputs={"output": func((100,), dtype=np.float32)})
-        iter_result1 = IterationResult(outputs={"output": func((100,), dtype=np.float32)})
+    def test_check_error_stat(self, func, check_error_stat, array_type):
+        iter_result0 = IterationResult(
+            outputs={"output": array_type(func((100,), dtype=np.float32))}
+        )
+        iter_result1 = IterationResult(
+            outputs={"output": array_type(func((100,), dtype=np.float32))}
+        )
 
         iter_result0["output"][0] += 100
 
@@ -185,21 +300,33 @@ def test_check_error_stat(self, func, check_error_stat):
             assert compare_func(iter_result0, iter_result1)["output"]
 
     @pytest.mark.parametrize("check_error_stat", ["max", "median", "mean", "elemwise"])
-    def test_atol_rtol_either_pass(self, check_error_stat):
+    def test_atol_rtol_either_pass(self, check_error_stat, array_type):
         # If either rtol/atol is sufficient, the compare_func should pass
-        res0 = IterationResult(outputs={"output": np.array([1, 2], dtype=np.float32)})
-        res1 = IterationResult(outputs={"output": np.array((1.25, 2.5), dtype=np.float32)})
+        res0 = IterationResult(outputs={"output": array_type([1, 2], dtype=np.float32)})
+        res1 = IterationResult(
+            outputs={"output": array_type((1.25, 2.5), dtype=np.float32)}
+        )
 
-        assert not CompareFunc.simple(check_error_stat=check_error_stat)(res0, res1)["output"]
+        assert not CompareFunc.simple(check_error_stat=check_error_stat)(res0, res1)[
+            "output"
+        ]
 
-        assert CompareFunc.simple(check_error_stat=check_error_stat, rtol=0.25)(res0, res1)["output"]
-        assert CompareFunc.simple(check_error_stat=check_error_stat, atol=0.5)(res0, res1)["output"]
+        assert CompareFunc.simple(check_error_stat=check_error_stat, rtol=0.25)(
+            res0, res1
+        )["output"]
+        assert CompareFunc.simple(check_error_stat=check_error_stat, atol=0.5)(
+            res0, res1
+        )["output"]
 
-    def test_atol_rtol_combined_pass(self):
+    def test_atol_rtol_combined_pass(self, array_type):
         # We should also be able to mix them - i.e. rtol might enough for some, atol for others.
         # If they cover the entire output range, it should pass.
-        res0 = IterationResult(outputs={"output": np.array([0, 1, 2, 3], dtype=np.float32)})
-        res1 = IterationResult(outputs={"output": np.array((0.15, 1.25, 2.5, 3.75), dtype=np.float32)})
+        res0 = IterationResult(
+            outputs={"output": array_type([0, 1, 2, 3], dtype=np.float32)}
+        )
+        res1 = IterationResult(
+            outputs={"output": array_type((0.15, 1.25, 2.5, 3.75), dtype=np.float32)}
+        )
 
         assert not CompareFunc.simple()(res0, res1)["output"]
 
@@ -217,52 +344,63 @@ def test_atol_rtol_combined_pass(self):
             {"": "mean"},
         ],
     )
-    def test_per_output_error_stat(self, check_error_stat):
+    def test_per_output_error_stat(self, check_error_stat, array_type):
         # output0 will only pass when using check_error_stat=mean
         res0 = IterationResult(
             outputs={
-                "output0": np.array([0, 1, 2, 3], dtype=np.float32),
-                "output1": np.array([0, 1, 2, 3], dtype=np.float32),
+                "output0": array_type([0, 1, 2, 3], dtype=np.float32),
+                "output1": array_type([0, 1, 2, 3], dtype=np.float32),
             }
         )
         res1 = IterationResult(
             outputs={
-                "output0": np.array((0.15, 1.25, 2.5, 3.75), dtype=np.float32),
-                "output1": np.array((0, 1, 2, 3), dtype=np.float32),
+                "output0": array_type((0.15, 1.25, 2.5, 3.75), dtype=np.float32),
+                "output1": array_type((0, 1, 2, 3), dtype=np.float32),
             }
         )
 
         atol = 0.4125
         assert not CompareFunc.simple(atol=atol)(res0, res1)["output0"]
 
-        assert CompareFunc.simple(check_error_stat=check_error_stat, atol=atol)(res0, res1)["output0"]
-        assert CompareFunc.simple(check_error_stat=check_error_stat, atol=atol)(res0, res1)["output1"]
+        assert CompareFunc.simple(check_error_stat=check_error_stat, atol=atol)(
+            res0, res1
+        )["output0"]
+        assert CompareFunc.simple(check_error_stat=check_error_stat, atol=atol)(
+            res0, res1
+        )["output1"]
 
-    def test_invalid_error_stat(self):
-        res0 = IterationResult(outputs={"output": np.array([0, 1, 2, 3], dtype=np.float32)})
-        res1 = IterationResult(outputs={"output": np.array([0.15, 1.25, 2.5, 3.75], dtype=np.float32)})
+    def test_invalid_error_stat(self, array_type):
+        res0 = IterationResult(
+            outputs={"output": array_type([0, 1, 2, 3], dtype=np.float32)}
+        )
+        res1 = IterationResult(
+            outputs={"output": array_type([0.15, 1.25, 2.5, 3.75], dtype=np.float32)}
+        )
 
         with pytest.raises(PolygraphyException, match="Invalid choice"):
             CompareFunc.simple(check_error_stat="invalid-stat")(res0, res1)
 
     @pytest.mark.parametrize("check_error_stat", ["max", "median", "mean", "elemwise"])
     @pytest.mark.parametrize("val0, val1", [(np.nan, 0.15), (0.15, np.nan)])
-    def test_nans_always_fail(self, check_error_stat, val0, val1):
-        res0 = IterationResult(outputs={"output": np.array([val0], dtype=np.float32)})
-        res1 = IterationResult(outputs={"output": np.array([val1], dtype=np.float32)})
+    def test_nans_always_fail(self, check_error_stat, val0, val1, array_type):
+        res0 = IterationResult(outputs={"output": array_type([val0], dtype=np.float32)})
+        res1 = IterationResult(outputs={"output": array_type([val1], dtype=np.float32)})
 
-        assert not CompareFunc.simple(check_error_stat=check_error_stat)(res0, res1)["output"]
+        assert not CompareFunc.simple(check_error_stat=check_error_stat)(res0, res1)[
+            "output"
+        ]
 
     @pytest.mark.parametrize("infinities_compare_equal", (False, True))
     @pytest.mark.parametrize("val", (np.inf, -np.inf))
-    def test_infinities_compare_equal(self, infinities_compare_equal, val):
-        res0 = IterationResult(outputs={"output": np.array([val], dtype=np.float32)})
-        res1 = IterationResult(outputs={"output": np.array([val], dtype=np.float32)})
+    def test_infinities_compare_equal(self, infinities_compare_equal, val, array_type):
+        res0 = IterationResult(outputs={"output": array_type([val], dtype=np.float32)})
+        res1 = IterationResult(outputs={"output": array_type([val], dtype=np.float32)})
 
         cf = CompareFunc.simple(infinities_compare_equal=infinities_compare_equal)
         assert bool(cf(res0, res1)["output"]) == infinities_compare_equal
 
 
+@pytest.mark.parametrize("array_type", [np.array, build_torch])
 class TestIndicesCompareFunc:
     @pytest.mark.parametrize(
         "out0,out1,index_tolerance,expected",
@@ -284,8 +422,11 @@ class TestIndicesCompareFunc:
             ([0, 1, 2, 3, 4, 5, 6], [0, 3, 2, 1, 4, 5, 6], 2, True),
         ],
     )
-    def test_index_tolerance(self, out0, out1, index_tolerance, expected):
-        res0 = IterationResult(outputs={"output": np.array(out0, dtype=np.int32)})
-        res1 = IterationResult(outputs={"output": np.array(out1, dtype=np.int32)})
+    def test_index_tolerance(self, out0, out1, index_tolerance, expected, array_type):
+        res0 = IterationResult(outputs={"output": array_type(out0, dtype=np.int32)})
+        res1 = IterationResult(outputs={"output": array_type(out1, dtype=np.int32)})
 
-        assert CompareFunc.indices(index_tolerance=index_tolerance)(res0, res1)["output"] == expected
+        assert (
+            CompareFunc.indices(index_tolerance=index_tolerance)(res0, res1)["output"]
+            == expected
+        )
diff --git a/tools/Polygraphy/tests/comparator/test_data_loader.py b/tools/Polygraphy/tests/comparator/test_data_loader.py
index 37de4f9a..f60fea40 100644
--- a/tools/Polygraphy/tests/comparator/test_data_loader.py
+++ b/tools/Polygraphy/tests/comparator/test_data_loader.py
@@ -17,16 +17,24 @@
 from collections import OrderedDict
 
 import numpy as np
+import torch
+import pytest
+
+from polygraphy import constants, util
 from polygraphy.common import TensorMetadata
 from polygraphy.comparator import DataLoader
 from polygraphy.comparator.data_loader import DataLoaderCache
-from polygraphy import constants
+from polygraphy.datatype import DataType
 from tests.models.meta import ONNX_MODELS
-import pytest
+from polygraphy.exception import PolygraphyException
 
 
 def meta(dtype):
-    return TensorMetadata().add("X", dtype=dtype, shape=(4, 4)).add("Y", dtype=dtype, shape=(5, 5))
+    return (
+        TensorMetadata()
+        .add("X", dtype=dtype, shape=(4, 4))
+        .add("Y", dtype=dtype, shape=(5, 5))
+    )
 
 
 class TestDataLoader:
@@ -75,7 +83,9 @@ def test_can_use_min_max_shape(self, min_shape, max_shape, expected):
     @pytest.mark.parametrize("dtype", [np.int32, bool, np.float32, np.int64])
     @pytest.mark.parametrize("range_val", [0, 1])
     def test_range_min_max_equal(self, dtype, range_val):
-        data_loader = DataLoader(input_metadata=meta(dtype), val_range=(range_val, range_val))
+        data_loader = DataLoader(
+            input_metadata=meta(dtype), val_range=(range_val, range_val)
+        )
         feed_dict = data_loader[0]
         assert np.all(feed_dict["X"] == range_val)
         assert np.all(feed_dict["Y"] == range_val)
@@ -94,7 +104,9 @@ def test_range_min_max_equal(self, dtype, range_val):
     )
     def test_val_ranges(self, range):
         min_val, max_val, dtype = range
-        data_loader = DataLoader(input_metadata=meta(dtype), val_range=(min_val, max_val))
+        data_loader = DataLoader(
+            input_metadata=meta(dtype), val_range=(min_val, max_val)
+        )
         feed_dict = data_loader[0]
         assert np.all((feed_dict["X"] >= min_val) & (feed_dict["X"] <= max_val))
 
@@ -142,7 +154,9 @@ def test_no_shape_tensor_false_positive_negative_dims(self):
         data_loader.input_metadata = input_meta
 
         feed_dict = data_loader[0]
-        assert feed_dict["X"].shape == (3,)  # Shape IS (3, ), because this is NOT a shape tensor
+        assert feed_dict["X"].shape == (
+            3,
+        )  # Shape IS (3, ), because this is NOT a shape tensor
         assert np.any(
             feed_dict["X"] != INPUT_DATA
         )  # Contents are not INPUT_DATA, since it's not treated as a shape value
@@ -168,60 +182,102 @@ def test_non_user_provided_inputs_never_shape_tensors(self):
         feed_dict = data_loader[0]
         assert feed_dict["X"].shape == (3,)  # Treat as a normal tensor
 
+    @pytest.mark.parametrize("dtype", [np.float32, np.int32])
+    @pytest.mark.parametrize("data_loader_backend_module", ["torch", "numpy"])
+    def test_generate_scalar(self, dtype, data_loader_backend_module):
+        data_loader = DataLoader(
+            input_metadata=TensorMetadata().add("input", dtype=dtype, shape=[]),
+            data_loader_backend_module=data_loader_backend_module,
+        )
+
+        scalar = data_loader[0]["input"]
+        assert isinstance(
+            scalar,
+            np.ndarray if data_loader_backend_module == "numpy" else torch.Tensor,
+        )
+        assert scalar.shape == tuple()
+
+    def test_error_on_unsupported_numpy_type(self):
+        input_meta = TensorMetadata().add("X", dtype=DataType.BFLOAT16, shape=(3,))
+        data_loader = DataLoader()
+        data_loader.input_metadata = input_meta
+
+        with pytest.raises(
+            PolygraphyException,
+            match="Please use a custom data loader to provide inputs.",
+        ):
+            data_loader[0]
+
+    def test_bf16_supported_torch(self):
+        input_meta = TensorMetadata().add("X", dtype=DataType.BFLOAT16, shape=(3,))
+        data_loader = DataLoader(data_loader_backend_module="torch")
+        data_loader.input_metadata = input_meta
 
+        assert util.array.is_torch(data_loader[0]["X"])
+
+
+build_torch = lambda a, **kwargs: util.array.to_torch(np.array(a, **kwargs))
+
+
+@pytest.mark.parametrize("array_type", [np.array, build_torch])
 class TestDataLoaderCache:
-    def test_can_cast_dtype(self):
+    def test_can_cast_dtype(self, array_type):
         # Ensure that the data loader can only be used once
         def load_data():
-            yield {"X": np.ones((1, 1), dtype=np.float32)}
+            yield {"X": array_type(np.ones((1, 1), dtype=np.float32))}
 
         cache = DataLoaderCache(load_data())
 
-        fp32_meta = TensorMetadata().add("X", dtype=np.float32, shape=(1, 1))
+        fp32_meta = TensorMetadata().add("X", dtype=DataType.FLOAT32, shape=(1, 1))
         cache.set_input_metadata(fp32_meta)
         feed_dict = cache[0]
-        assert feed_dict["X"].dtype == np.float32
+        assert util.array.dtype(feed_dict["X"]) == DataType.FLOAT32
 
-        fp64_meta = TensorMetadata().add("X", dtype=np.float64, shape=(1, 1))
+        fp64_meta = TensorMetadata().add("X", dtype=DataType.FLOAT64, shape=(1, 1))
         cache.set_input_metadata(fp64_meta)
         feed_dict = cache[0]
-        assert feed_dict["X"].dtype == np.float64
+        assert util.array.dtype(feed_dict["X"]) == DataType.FLOAT64
 
     # If one input isn't in the cache, we shouldn't give up looking
     # for other inputs
-    def test_will_not_give_up_on_first_cache_miss(self):
+    def test_will_not_give_up_on_first_cache_miss(self, array_type):
         SHAPE = (32, 32)
 
         DATA = [OrderedDict()]
-        DATA[0]["X"] = np.zeros(SHAPE, dtype=np.int64)
-        DATA[0]["Y"] = np.zeros(SHAPE, dtype=np.int64)
+        DATA[0]["X"] = array_type(np.zeros(SHAPE, dtype=np.int64))
+        DATA[0]["Y"] = array_type(np.zeros(SHAPE, dtype=np.int64))
 
         cache = DataLoaderCache(DATA)
-        cache.set_input_metadata(TensorMetadata().add("X", np.int64, shape=SHAPE).add("Y", np.int64, SHAPE))
+        cache.set_input_metadata(
+            TensorMetadata()
+            .add("X", DataType.INT64, shape=SHAPE)
+            .add("Y", DataType.INT64, SHAPE)
+        )
 
-        # Populate the cache with bad X but good Y
+        # Populate the cache with bad X but good Y.
+        # The data loader cache should fail to coerce X to the right shape and then reload it from the data loader.
         cache.cache[0] = OrderedDict()
-        cache.cache[0]["X"] = np.ones((64, 64), dtype=np.int64)
-        cache.cache[0]["Y"] = np.ones(SHAPE, dtype=np.int64)
+        cache.cache[0]["X"] = array_type(np.ones((64, 64), dtype=np.int64))
+        cache.cache[0]["Y"] = array_type(np.ones(SHAPE, dtype=np.int64))
 
         feed_dict = cache[0]
         # Cache cannot reuse X, so it'll reload - we'll get all 0s from the data loader
-        assert np.all(feed_dict["X"] == 0)
+        assert util.array.all(feed_dict["X"] == 0)
         # Cache can reuse Y, even though it's after X, so we'll get ones from the cache
-        assert np.all(feed_dict["Y"] == 1)
+        assert util.array.all(feed_dict["Y"] == 1)
 
     # The cache should ignore extra data generated by the data loader
-    def test_ignores_extra_data(self):
+    def test_ignores_extra_data(self, array_type):
         SHAPE = (32, 32)
 
         DATA = [OrderedDict()]
-        DATA[0]["X"] = np.zeros(SHAPE, dtype=np.int64)
-        DATA[0]["Y"] = np.zeros(SHAPE, dtype=np.int64)
+        DATA[0]["X"] = array_type(np.zeros(SHAPE, dtype=np.int64))
+        DATA[0]["Y"] = array_type(np.zeros(SHAPE, dtype=np.int64))
 
         cache = DataLoaderCache(DATA)
 
-        cache.set_input_metadata(TensorMetadata().add("X", np.int64, shape=SHAPE))
+        cache.set_input_metadata(TensorMetadata().add("X", DataType.INT64, shape=SHAPE))
 
         feed_dict = cache[0]
         assert list(feed_dict.keys()) == ["X"]
-        assert np.all(feed_dict["X"] == 0)
+        assert util.array.all(feed_dict["X"] == 0)
diff --git a/tools/Polygraphy/tests/comparator/test_postprocess.py b/tools/Polygraphy/tests/comparator/test_postprocess.py
index 6f541d34..c4c2a0d9 100644
--- a/tools/Polygraphy/tests/comparator/test_postprocess.py
+++ b/tools/Polygraphy/tests/comparator/test_postprocess.py
@@ -15,32 +15,42 @@
 # limitations under the License.
 #
 import numpy as np
+import pytest
+from polygraphy import util
 from polygraphy.comparator import PostprocessFunc, IterationResult
 
+build_torch = lambda a, **kwargs: util.array.to_torch(np.array(a, **kwargs))
 
+@pytest.mark.parametrize("array_type", [np.array, build_torch])
 class TestTopK:
-    def test_basic(self):
-        arr = np.array([1, 2, 3, 4, 5], dtype=np.float32)
+    def test_basic(self, array_type):
+        arr = array_type([1, 2, 3, 4, 5], dtype=np.float32)
         func = PostprocessFunc.top_k(k=3)
         top_k = func(IterationResult({"x": arr}))
-        assert np.all(top_k["x"] == [4, 3, 2])
+        assert util.array.equal(top_k["x"], array_type([4, 3, 2]))
 
-    def test_k_can_exceed_array_len(self):
-        arr = np.array([1, 2, 3, 4, 5], dtype=np.float32)
+    def test_k_can_exceed_array_len(self, array_type):
+        arr = array_type([1, 2, 3, 4, 5], dtype=np.float32)
         func = PostprocessFunc.top_k(k=10)
         top_k = func(IterationResult({"x": arr}))
-        assert np.all(top_k["x"] == [4, 3, 2, 1, 0])
+        assert util.array.equal(top_k["x"], array_type([4, 3, 2, 1, 0]))
 
-    def test_per_output_top_k(self):
-        arr = np.array([1, 2, 3, 4, 5], dtype=np.float32)
+    def test_per_output_top_k(self, array_type):
+        arr = array_type([1, 2, 3, 4, 5], dtype=np.float32)
         func = PostprocessFunc.top_k(k={"": 10, "y": 2})
         top_k = func(IterationResult({"x": arr, "y": arr}))
-        assert np.all(top_k["x"] == [4, 3, 2, 1, 0])
-        assert np.all(top_k["y"] == [4, 3])
+        assert util.array.equal(top_k["x"], array_type([4, 3, 2, 1, 0]))
+        assert util.array.equal(top_k["y"], array_type([4, 3]))
 
-    def test_per_output_top_k_axis(self):
-        arr = np.array([[5, 6, 5], [6, 5, 6]], dtype=np.float32)
+    def test_per_output_top_k_axis(self, array_type):
+        arr = array_type([[5, 6, 5], [6, 5, 6]], dtype=np.float32)
         func = PostprocessFunc.top_k(k={"": (1, 0), "y": (1, 1)})
         top_k = func(IterationResult({"x": arr, "y": arr}))
-        assert np.all(top_k["x"] == [[1, 0, 1]])
-        assert np.all(top_k["y"] == [[1], [0]])
+        assert util.array.equal(top_k["x"], array_type([[1, 0, 1]]))
+        assert util.array.equal(top_k["y"], array_type([[1], [0]]))
+
+    def test_top_k_half(self, array_type):
+        arr = array_type([1, 2, 3, 4, 5], dtype=np.float16)
+        func = PostprocessFunc.top_k(k=3)
+        top_k = func(IterationResult({"x": arr}))
+        assert util.array.equal(top_k["x"], array_type([4, 3, 2]))
diff --git a/tools/Polygraphy/tests/comparator/test_struct.py b/tools/Polygraphy/tests/comparator/test_struct.py
index 57847cae..bbb720f2 100644
--- a/tools/Polygraphy/tests/comparator/test_struct.py
+++ b/tools/Polygraphy/tests/comparator/test_struct.py
@@ -14,12 +14,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+import contextlib
+
 import numpy as np
 import pytest
-import contextlib
-from polygraphy import config
+import torch
+
+from polygraphy import config, util
 from polygraphy.comparator import IterationResult, RunResults
-from polygraphy.comparator.struct import LazyNumpyArray
+from polygraphy.comparator.struct import LazyArray
 from polygraphy.exception import PolygraphyException
 
 
@@ -115,9 +118,10 @@ def test_add_new_default_name(self):
         assert all(isinstance(iter_result, IterationResult) for iter_result in iter_results)
 
 
-class TestLazyNumpyArray:
+@pytest.mark.parametrize("module", [torch, np])
+class TestLazyArray:
     @pytest.mark.parametrize("set_threshold", [True, False])
-    def test_unswapped_array(self, set_threshold):
+    def test_unswapped_array(self, set_threshold, module):
         with contextlib.ExitStack() as stack:
             if set_threshold:
 
@@ -129,14 +133,14 @@ def reset_array_swap():
                 config.ARRAY_SWAP_THRESHOLD_MB = 8
 
             small_shape = (7 * 1024 * 1024,)
-            small_array = np.ones(shape=small_shape, dtype=np.byte)
-            lazy = LazyNumpyArray(small_array)
-            assert np.array_equal(small_array, lazy.arr)
+            small_array = module.ones(small_shape, dtype=module.uint8)
+            lazy = LazyArray(small_array)
+            assert util.array.equal(small_array, lazy.arr)
             assert lazy.tmpfile is None
 
-            assert np.array_equal(small_array, lazy.numpy())
+            assert util.array.equal(small_array, lazy.load())
 
-    def test_swapped_array(self):
+    def test_swapped_array(self, module):
         with contextlib.ExitStack() as stack:
 
             def reset_array_swap():
@@ -147,9 +151,9 @@ def reset_array_swap():
             config.ARRAY_SWAP_THRESHOLD_MB = 8
 
             large_shape = (9 * 1024 * 1024,)
-            large_array = np.ones(shape=large_shape, dtype=np.byte)
-            lazy = LazyNumpyArray(large_array)
+            large_array = module.ones(large_shape, dtype=module.uint8)
+            lazy = LazyArray(large_array)
             assert lazy.arr is None
             assert lazy.tmpfile is not None
 
-            assert np.array_equal(large_array, lazy.numpy())
+            assert util.array.equal(large_array, lazy.load())
diff --git a/tools/Polygraphy/tests/conftest.py b/tools/Polygraphy/tests/conftest.py
index 25f8e9bb..b9a80865 100644
--- a/tools/Polygraphy/tests/conftest.py
+++ b/tools/Polygraphy/tests/conftest.py
@@ -16,11 +16,11 @@
 #
 
 import copy
+import ctypes.util
 import glob
 import os
 import subprocess as sp
 import sys
-import ctypes.util
 
 import pytest
 
@@ -58,9 +58,16 @@ def run_impl(command, cwd=None):
             status.stdout = sr_status.stdout
             status.success = sr_status.success
         else:
-            sp_status = sp.run(command, cwd=cwd, env=env, stdout=sp.PIPE, stderr=sp.PIPE, universal_newlines=True)
-            status.stdout = sp_status.stdout
-            status.stderr = sp_status.stderr
+            sp_status = sp.run(command, cwd=cwd, env=env, stdout=sp.PIPE, stderr=sp.PIPE)
+
+            def try_decode(inp):
+                try:
+                    return inp.decode()
+                except UnicodeDecodeError:
+                    return inp
+
+            status.stdout = try_decode(sp_status.stdout)
+            status.stderr = try_decode(sp_status.stderr)
             status.success = sp_status.returncode == 0
 
         return status
@@ -80,11 +87,23 @@ def check(runner):
 
         import numpy as np
 
+        from polygraphy.datatype import DataType
+
         outfile = io.StringIO()
         with contextlib.redirect_stdout(outfile), contextlib.redirect_stderr(outfile):
             runner.activate()
+            # Check that NumPy dtypes are still returned by default
             metadata = runner.get_input_metadata()
-            runner.infer({name: np.ones(shape, dtype=dtype) for name, (dtype, shape) in metadata.items()})
+            for dtype, _ in metadata.values():
+                assert isinstance(dtype, np.dtype)
+
+            metadata = runner.get_input_metadata(use_numpy_dtypes=False)
+            runner.infer(
+                {
+                    name: np.ones(shape, dtype=DataType.to_dtype(dtype, "numpy"))
+                    for name, (dtype, shape) in metadata.items()
+                }
+            )
             runner.deactivate()
 
             outfile.seek(0)
@@ -103,7 +122,12 @@ def check_warning(method, warning_expected):
 
             runner.activate_impl()
             metadata = runner.get_input_metadata_impl()
-            runner.infer_impl({name: np.ones(shape, dtype=dtype) for name, (dtype, shape) in metadata.items()})
+            runner.infer_impl(
+                {
+                    name: np.ones(shape, dtype=DataType.to_dtype(DataType.from_dtype(dtype), "numpy"))
+                    for name, (dtype, shape) in metadata.items()
+                }
+            )
             runner.deactivate_impl()
 
             outfile.seek(0)
@@ -153,7 +177,7 @@ def check(loader):
 @pytest.mark.skipif(sys.platform.startswith("win"), reason="Fixture has not been updated to work on Windows")
 def nvinfer_lean_path():
     lean_library_name = ctypes.util.find_library("nvinfer_lean")
-    for dirname in os.environ.get("LD_LIBRARY_PATH", "").split(os.path.pathsep):
+    for dirname in os.environ.get("LD_LIBRARY_PATH", "").split(os.path.pathsep) + ["/usr/lib/x86_64-linux-gnu"]:
         path = os.path.join(dirname, lean_library_name)
         if os.path.exists(path):
             return path
diff --git a/tools/Polygraphy/tests/cuda/test_cuda.py b/tools/Polygraphy/tests/cuda/test_cuda.py
index a7ad8ccb..213f8724 100644
--- a/tools/Polygraphy/tests/cuda/test_cuda.py
+++ b/tools/Polygraphy/tests/cuda/test_cuda.py
@@ -16,9 +16,10 @@
 #
 import numpy as np
 import pytest
-import tensorrt as trt
-from polygraphy import mod, util
-from polygraphy.cuda import DeviceArray, Stream, DeviceView, wrapper, MemcpyKind
+import torch
+
+from polygraphy import util
+from polygraphy.cuda import DeviceArray, DeviceView, MemcpyKind, Stream, wrapper
 from tests.helper import time_func
 
 
@@ -30,21 +31,25 @@ def test_basic(self):
             assert v.shape == arr.shape
             assert v.dtype == arr.dtype
             assert v.nbytes == arr.nbytes
+            # For backwards compatibility
+            assert isinstance(arr.dtype, np.dtype)
+            assert isinstance(v.dtype, np.dtype)
 
     def test_with_int_ptr(self):
         ptr = 74892
         v = DeviceView(ptr=ptr, shape=(1,), dtype=np.float32)
         assert v.ptr == ptr
 
-    def test_copy_to(self):
+    @pytest.mark.parametrize("module", [np, torch])
+    def test_copy_to(self, module):
         with DeviceArray((2, 2), dtype=np.float32) as arr:
-            arr.copy_from(np.ones((2, 2), dtype=np.float32) * 4)
+            arr.copy_from(module.ones((2, 2), dtype=module.float32) * 4)
 
             v = DeviceView(arr.ptr, arr.shape, arr.dtype)
-            host_buf = np.zeros((2, 2), dtype=np.float32)
+            host_buf = module.zeros((2, 2), dtype=module.float32)
             v.copy_to(host_buf)
 
-            assert np.all(host_buf == 4)
+            assert module.all(host_buf == 4)
 
     def test_numpy(self):
         with DeviceArray((2, 2), dtype=np.float32) as arr:
@@ -137,7 +142,7 @@ def test_empty_tensor_to_host(self):
     @pytest.mark.flaky
     @pytest.mark.serial
     def test_copy_from_overhead(self):
-        host_buf = np.ones(shape=(4, 8, 1024, 1024), dtype=np.float32)
+        host_buf = np.ones(shape=(4, 8, 512, 512), dtype=np.float32)
         with DeviceArray(shape=host_buf.shape, dtype=host_buf.dtype) as dev_buf:
             memcpy_time = time_func(
                 lambda: wrapper().memcpy(
@@ -151,12 +156,12 @@ def test_copy_from_overhead(self):
             copy_from_time = time_func(lambda: dev_buf.copy_from(host_buf))
 
         print(f"memcpy time: {memcpy_time}, copy_from time: {copy_from_time}")
-        assert copy_from_time <= (memcpy_time * 1.05)
+        assert copy_from_time <= (memcpy_time * 1.08)
 
     @pytest.mark.flaky
     @pytest.mark.serial
     def test_copy_to_overhead(self):
-        host_buf = np.ones(shape=(4, 8, 1024, 1024), dtype=np.float32)
+        host_buf = np.ones(shape=(4, 8, 512, 512), dtype=np.float32)
         with DeviceArray(shape=host_buf.shape, dtype=host_buf.dtype) as dev_buf:
             memcpy_time = time_func(
                 lambda: wrapper().memcpy(
@@ -170,7 +175,7 @@ def test_copy_to_overhead(self):
             copy_to_time = time_func(lambda: dev_buf.copy_to(host_buf))
 
         print(f"memcpy time: {memcpy_time}, copy_to time: {copy_to_time}")
-        assert copy_to_time <= (memcpy_time * 1.05)
+        assert copy_to_time <= (memcpy_time * 1.08)
 
     def test_raw(self):
         with DeviceArray.raw((25,)) as buf:
diff --git a/tools/Polygraphy/tests/helper.py b/tools/Polygraphy/tests/helper.py
index 10f309a7..e4141c72 100644
--- a/tools/Polygraphy/tests/helper.py
+++ b/tools/Polygraphy/tests/helper.py
@@ -27,7 +27,8 @@
     "run": [],
     "convert": [],
     "inspect": ["data", "model", "tactics", "capability", "diff-tactics"],
-    "surgeon": ["extract", "insert", "sanitize"],
+    "check": ["lint"],
+    "surgeon": ["extract", "insert", "sanitize", "prune", "weight-strip", "weight-reconstruct"],
     "template": ["trt-network", "trt-config", "onnx-gs"],
     "debug": ["build", "precision", "reduce", "repeat"],
     "data": ["to-input"],
@@ -46,7 +47,7 @@ def is_file_non_empty(path):
     return not is_file_empty(path)
 
 
-def time_func(func, warm_up=25, iters=100):
+def time_func(func, warm_up=25, iters=50):
     for _ in range(warm_up):
         func()
 
diff --git a/tools/Polygraphy/tests/util/test_format.py b/tools/Polygraphy/tests/mod/conftest.py
similarity index 52%
rename from tools/Polygraphy/tests/util/test_format.py
rename to tools/Polygraphy/tests/mod/conftest.py
index 67697e0c..89e4d02f 100644
--- a/tools/Polygraphy/tests/util/test_format.py
+++ b/tools/Polygraphy/tests/mod/conftest.py
@@ -14,27 +14,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-from polygraphy.logger import G_LOGGER
-
-from polygraphy.util.format import FormatManager, DataFormat
-
 import pytest
+from tests.helper import ROOT_DIR
 
 
-class FormatTestCase:
-    def __init__(self, shape, format):
-        self.shape = shape
-        self.format = format
-
-
-EXPECTED_FORMATS = [
-    FormatTestCase((1, 3, 480, 960), DataFormat.NCHW),
-    FormatTestCase((1, 3, 224, 224), DataFormat.NCHW),
-    FormatTestCase((1, 224, 224, 3), DataFormat.NHWC),
-    FormatTestCase((1, 9, 9, 3), DataFormat.NHWC),
-]
+@pytest.fixture()
+def poly_venv(virtualenv):
+    virtualenv.env["PYTHONPATH"] = ROOT_DIR
+    virtualenv.env["LD_LIBRARY_PATH"] = ""
 
+    # Newer versions of setuptools break pytest-virtualenv
+    virtualenv.run([virtualenv.python, "-m", "pip", "install", "setuptools==59.6.0"])
 
-@pytest.mark.parametrize("test_case", EXPECTED_FORMATS)
-def test_format_deduction(test_case):
-    assert test_case.format == FormatManager.determine_format(test_case.shape)
+    return virtualenv
diff --git a/tools/Polygraphy/tests/mod/test_dependencies.py b/tools/Polygraphy/tests/mod/test_dependencies.py
index 6894b56e..859bd1f9 100644
--- a/tools/Polygraphy/tests/mod/test_dependencies.py
+++ b/tools/Polygraphy/tests/mod/test_dependencies.py
@@ -21,7 +21,6 @@
 
 import sys
 import pytest
-import tensorrt as trt
 from polygraphy import mod, util
 from polygraphy.mod.importer import _version_ok
 
@@ -34,13 +33,6 @@
 """
 
 
-@pytest.fixture()
-def poly_venv(virtualenv):
-    virtualenv.env["PYTHONPATH"] = ROOT_DIR
-    virtualenv.env["LD_LIBRARY_PATH"] = ""
-    return virtualenv
-
-
 def is_submodule(path):
     file_mod = os.path.isfile(path) and path.endswith(".py") and os.path.basename(path) != "__init__.py"
     dir_mod = os.path.isdir(path) and os.path.isfile(os.path.join(path, "__init__.py"))
@@ -110,10 +102,8 @@ class TestAutoinstallDeps:
             ],
         ],
     )
+    @pytest.mark.slow
     def test_can_automatically_install_deps(self, poly_venv, cmd):
-        # WAR an issue with newer versions of protobuf, ONNX, and NumPy
-        poly_venv.run([poly_venv.python, "-m", "pip", "install", "protobuf==3.19.4", "onnx==1.10.0", "numpy<=1.23.0"])
-
         poly_venv.env["POLYGRAPHY_AUTOINSTALL_DEPS"] = "1"
         cmd = [poly_venv.python, *POLYGRAPHY_CMD] + cmd
         print(f"Running: {' '.join(cmd)}")
@@ -121,6 +111,9 @@ def test_can_automatically_install_deps(self, poly_venv, cmd):
         print(output)
         assert "is required, but not installed. Attempting to install now" in output
 
+        # Make sure that no PyTorch dependency is accidentally introduced
+        assert "torch" not in poly_venv.installed_packages()
+
     @pytest.mark.parametrize(
         "new_ver, expected",
         [
@@ -171,7 +164,7 @@ def get_colored_version():
                 poly_venv.python,
                 "-c",
                 f"from polygraphy import mod; "
-                f"requests = mod.lazy_import('requests==1.0.0', requires=['colored{new_ver}']); "
+                f"requests = mod.lazy_import('requests==2.25.1', requires=['colored{new_ver}']); "
                 f"requests.__version__; "
                 f"import colored; print(colored.__version__)",
             ]
@@ -230,9 +223,6 @@ def test_ask_before_autoinstall(self, response, should_install, poly_venv):
 
     # We can import inner modules, and Polygraphy should still autoinstall the outermost one.
     def test_can_install_for_nested_import(self, poly_venv):
-        # WAR an issue with newer versions of protobuf, ONNX, and NumPy
-        poly_venv.run([poly_venv.python, "-m", "pip", "install", "protobuf==3.19.4", "onnx==1.10.0", "numpy<=1.23.0"])
-
         poly_venv.env["POLYGRAPHY_AUTOINSTALL_DEPS"] = "1"
 
         poly_venv.run(
@@ -258,9 +248,7 @@ def test_all_lazy_imports(self):
             "msvcrt",
             "numpy",
             "onnx_graphsurgeon",
-            "onnx.external_data_helper",
             "onnx.numpy_helper",
-            "onnx.shape_inference",
             "onnx",
             "onnxmltools",
             "onnxruntime.tools.symbolic_shape_infer",
@@ -268,7 +256,8 @@ def test_all_lazy_imports(self):
             "tensorflow",
             "tensorrt",
             "tf2onnx",
-            "uff",
+            "torch",
+            "yaml",
         ]
         if sys.version_info < (3, 8):
             expected.append("importlib_metadata")
diff --git a/tools/Polygraphy/tests/mod/test_importer.py b/tools/Polygraphy/tests/mod/test_importer.py
index b1821f07..dc269b8c 100644
--- a/tools/Polygraphy/tests/mod/test_importer.py
+++ b/tools/Polygraphy/tests/mod/test_importer.py
@@ -79,7 +79,9 @@ def example():
             assert example is not None
             example()
 
-            with pytest.raises(PolygraphyException, match="Could not import symbol: non_existent from"):
+            with pytest.raises(
+                PolygraphyException, match="Could not import symbol: non_existent from"
+            ):
                 mod.import_from_script(f.name, "non_existent")
             assert sys.path == orig_sys_path
 
@@ -101,3 +103,46 @@ def example():
     )
     def test_version_ok(self, ver, pref, expected):
         assert _version_ok(ver, pref) == expected
+
+    def test_is_installed_works_when_package_name_differs_from_module_name(
+        self, poly_venv
+    ):
+        assert "onnxruntime" not in poly_venv.installed_packages()
+        assert "onnxruntime-gpu" not in poly_venv.installed_packages()
+
+        poly_venv.run(
+            [poly_venv.python, "-m", "pip", "install", "onnxruntime-gpu", "--no-deps"]
+        )
+
+        # The `onnxruntime-gpu` package provides the `onnxruntime` module.
+        # `is_installed()` should be able to understand that.
+        poly_venv.run(
+            [
+                poly_venv.python,
+                "-c",
+                "from polygraphy import mod; onnxrt = mod.lazy_import('onnxruntime<0'); assert onnxrt.is_installed()",
+            ]
+        )
+
+    @pytest.mark.parametrize(
+        "mod_check", ["mod.has_mod('colored')", "colored.is_installed()"]
+    )
+    def test_has_mod(self, poly_venv, mod_check):
+        assert "colored" not in poly_venv.installed_packages()
+        poly_venv.run(
+            [
+                poly_venv.python,
+                "-c",
+                f"from polygraphy import mod; colored = mod.lazy_import('colored'); assert not {mod_check}",
+            ]
+        )
+
+        poly_venv.run([poly_venv.python, "-m", "pip", "install", "colored==1.4.0"])
+        # Make sure `has_mod` doesn't actually import the package.
+        poly_venv.run(
+            [
+                poly_venv.python,
+                "-c",
+                "from polygraphy import mod; import sys; assert mod.has_mod('colored'); assert 'colored' not in sys.modules; import colored; assert 'colored' in sys.modules",
+            ]
+        )
diff --git a/tools/Polygraphy/tests/mod/test_util.py b/tools/Polygraphy/tests/mod/test_util.py
index 5798275d..beb16268 100644
--- a/tools/Polygraphy/tests/mod/test_util.py
+++ b/tools/Polygraphy/tests/mod/test_util.py
@@ -31,6 +31,8 @@
         ("0.1rc0", "0.1b0", True),
         ("0.post1", "0.post0", True),
         ("0.post1", "0.post2", False),
+        ("1.13.1+cu117", "1.13.0", True),
+        ("1.13.1+cu117", "1.13.2", False),
     ],
 )
 def test_version(ver0, ver1, expected):
diff --git a/tools/Polygraphy/tests/models/I.onnx b/tools/Polygraphy/tests/models/I.onnx
deleted file mode 100644
index 3b2fe79f..00000000
Binary files a/tools/Polygraphy/tests/models/I.onnx and /dev/null differ
diff --git a/tools/Polygraphy/tests/models/add_with_dup_inputs.onnx b/tools/Polygraphy/tests/models/add_with_dup_inputs.onnx
index fcb3880e..ece4913d 100644
Binary files a/tools/Polygraphy/tests/models/add_with_dup_inputs.onnx and b/tools/Polygraphy/tests/models/add_with_dup_inputs.onnx differ
diff --git a/tools/Polygraphy/tests/models/bad_graph_conditionally_invalid.onnx b/tools/Polygraphy/tests/models/bad_graph_conditionally_invalid.onnx
new file mode 100644
index 00000000..0d4e6dc9
Binary files /dev/null and b/tools/Polygraphy/tests/models/bad_graph_conditionally_invalid.onnx differ
diff --git a/tools/Polygraphy/tests/models/bad_graph_with_dup_value_info.onnx b/tools/Polygraphy/tests/models/bad_graph_with_dup_value_info.onnx
new file mode 100644
index 00000000..ce2ba5a8
Binary files /dev/null and b/tools/Polygraphy/tests/models/bad_graph_with_dup_value_info.onnx differ
diff --git a/tools/Polygraphy/tests/models/bad_graph_with_duplicate_node_names.onnx b/tools/Polygraphy/tests/models/bad_graph_with_duplicate_node_names.onnx
new file mode 100644
index 00000000..aec764e7
Binary files /dev/null and b/tools/Polygraphy/tests/models/bad_graph_with_duplicate_node_names.onnx differ
diff --git a/tools/Polygraphy/tests/models/bad_graph_with_multi_level_errors.onnx b/tools/Polygraphy/tests/models/bad_graph_with_multi_level_errors.onnx
new file mode 100644
index 00000000..263f6534
Binary files /dev/null and b/tools/Polygraphy/tests/models/bad_graph_with_multi_level_errors.onnx differ
diff --git a/tools/Polygraphy/tests/models/bad_graph_with_no_import_domains.onnx b/tools/Polygraphy/tests/models/bad_graph_with_no_import_domains.onnx
new file mode 100644
index 00000000..31ab0174
Binary files /dev/null and b/tools/Polygraphy/tests/models/bad_graph_with_no_import_domains.onnx differ
diff --git a/tools/Polygraphy/tests/models/bad_graph_with_no_name.onnx b/tools/Polygraphy/tests/models/bad_graph_with_no_name.onnx
new file mode 100644
index 00000000..30a9f301
Binary files /dev/null and b/tools/Polygraphy/tests/models/bad_graph_with_no_name.onnx differ
diff --git a/tools/Polygraphy/tests/models/bad_graph_with_parallel_invalid_nodes.onnx b/tools/Polygraphy/tests/models/bad_graph_with_parallel_invalid_nodes.onnx
new file mode 100644
index 00000000..928e78a9
Binary files /dev/null and b/tools/Polygraphy/tests/models/bad_graph_with_parallel_invalid_nodes.onnx differ
diff --git a/tools/Polygraphy/tests/models/cleanable.onnx b/tools/Polygraphy/tests/models/cleanable.onnx
new file mode 100644
index 00000000..99c7d83c
Binary files /dev/null and b/tools/Polygraphy/tests/models/cleanable.onnx differ
diff --git a/tools/Polygraphy/tests/models/constant_fold_bloater.onnx b/tools/Polygraphy/tests/models/constant_fold_bloater.onnx
index 6d46f5f6..9f1e6eab 100644
Binary files a/tools/Polygraphy/tests/models/constant_fold_bloater.onnx and b/tools/Polygraphy/tests/models/constant_fold_bloater.onnx differ
diff --git a/tools/Polygraphy/tests/models/conv.onnx b/tools/Polygraphy/tests/models/conv.onnx
new file mode 100644
index 00000000..9bc59e65
Binary files /dev/null and b/tools/Polygraphy/tests/models/conv.onnx differ
diff --git a/tools/Polygraphy/tests/models/custom_op_node.onnx b/tools/Polygraphy/tests/models/custom_op_node.onnx
new file mode 100644
index 00000000..78fef463
Binary files /dev/null and b/tools/Polygraphy/tests/models/custom_op_node.onnx differ
diff --git a/tools/Polygraphy/tests/models/empty.onnx b/tools/Polygraphy/tests/models/empty.onnx
new file mode 100644
index 00000000..bd42f4f1
Binary files /dev/null and b/tools/Polygraphy/tests/models/empty.onnx differ
diff --git a/tools/Polygraphy/tests/models/graph_with_subgraph_matching_toy_plugin.onnx b/tools/Polygraphy/tests/models/graph_with_subgraph_matching_toy_plugin.onnx
new file mode 100644
index 00000000..9dde8cb7
Binary files /dev/null and b/tools/Polygraphy/tests/models/graph_with_subgraph_matching_toy_plugin.onnx differ
diff --git a/tools/Polygraphy/tests/models/invalid.onnx b/tools/Polygraphy/tests/models/invalid.onnx
new file mode 100644
index 00000000..e19990a4
--- /dev/null
+++ b/tools/Polygraphy/tests/models/invalid.onnx
@@ -0,0 +1 @@
+# This is an invalid onnx file!
diff --git a/tools/Polygraphy/tests/models/loop.onnx b/tools/Polygraphy/tests/models/loop.onnx
new file mode 100644
index 00000000..07d6ba48
Binary files /dev/null and b/tools/Polygraphy/tests/models/loop.onnx differ
diff --git a/tools/Polygraphy/tests/models/make_models.py b/tools/Polygraphy/tests/models/make_models.py
index dc35edb7..cf3a1e3a 100644
--- a/tools/Polygraphy/tests/models/make_models.py
+++ b/tools/Polygraphy/tests/models/make_models.py
@@ -20,16 +20,21 @@
 """
 import os
 
+import tempfile
 import numpy as np
 import onnx
+import subprocess
 import onnx_graphsurgeon as gs
-
+from meta import ONNX_MODELS
+from polygraphy.tools.sparse import SparsityPruner
 CURDIR = os.path.dirname(__file__)
 
 
 @gs.Graph.register()
 def identity(self, inp, **kwargs):
-    return self.layer(op="Identity", inputs=[inp], outputs=["identity_out"], **kwargs)[0]
+    out = self.layer(op="Identity", inputs=[inp], outputs=["identity_out"], **kwargs)[0]
+    out.dtype = inp.dtype
+    return out
 
 
 @gs.Graph.register()
@@ -68,9 +73,8 @@ def tile(self, inp, repeats):
 
 
 @gs.Graph.register()
-def nonzero(self, inp):
-    return self.layer(op="NonZero", inputs=[inp], outputs=["nonzero_out"])[0]
-
+def nonzero(self, inp, **kwargs):
+    return self.layer(op="NonZero", inputs=[inp], outputs=["nonzero_out"], **kwargs)[0]
 
 # Name range as onnx_range as range is a python built-in function.
 @gs.Graph.register()
@@ -85,12 +89,16 @@ def cast(self, input, type, **kwargs):
 
 @gs.Graph.register()
 def reduce_max(self, input, keep_dims, **kwargs):
-    return self.layer(op="ReduceMax", inputs=[input], attrs={"keepdims": keep_dims}, outputs=["reduce_max_out"], **kwargs)[0]
+    return self.layer(
+        op="ReduceMax", inputs=[input], attrs={"keepdims": keep_dims}, outputs=["reduce_max_out"], **kwargs
+    )[0]
 
 
 @gs.Graph.register()
 def conv(self, input, weights, kernel_shape, **kwargs):
-    return self.layer(op="Conv", inputs=[input, weights], attrs={"kernel_shape": kernel_shape}, outputs=["conv_out"], **kwargs)[0]
+    return self.layer(
+        op="Conv", inputs=[input, weights], attrs={"kernel_shape": kernel_shape}, outputs=["conv_out"], **kwargs
+    )[0]
 
 
 @gs.Graph.register()
@@ -102,12 +110,41 @@ def split(self, inp, split, axis=0):
         attrs={"axis": axis, "split": split},
     )
 
+@gs.Graph.register()
+def transpose(self, inp, **kwargs):
+    return self.layer(
+        op="Transpose",
+        inputs=[inp],
+        outputs=["transpose_out"],
+        **kwargs
+    )[0]
+
+@gs.Graph.register()
+def quantize_linear(self, inp, y_scale, y_zero_point, **kwargs):
+    return self.layer(
+        op="QuantizeLinear",
+        inputs=[inp, y_scale, y_zero_point],
+        outputs=["quantize_linear_out"],
+        **kwargs
+    )[0]
+
+@gs.Graph.register()
+def dequantize_linear(self, inp, x_scale, x_zero_point, **kwargs):
+    return self.layer(
+        op="DequantizeLinear",
+        inputs=[inp, x_scale, x_zero_point],
+        outputs=["dequantize_linear_out"],
+        **kwargs
+    )[0]
 
 def save(graph, model_name):
     path = os.path.join(CURDIR, model_name)
     print(f"Writing: {path}")
     onnx.save(gs.export_onnx(graph), path)
 
+def make_sparse(graph):
+    sparsity_pruner = SparsityPruner(gs.export_onnx(graph))
+    return gs.import_onnx(sparsity_pruner.prune())
 
 # Generates a model with multiple inputs/outputs:
 #
@@ -269,6 +306,7 @@ def make_needs_constraints():
 
 make_needs_constraints()
 
+
 # Generates a model that will become very large when constant-folded
 #
 #   inp
@@ -289,6 +327,7 @@ def make_constant_fold_bloater():
 
 make_constant_fold_bloater()
 
+
 # Generate a model with a data-dependent shape
 #
 #    inp
@@ -387,3 +426,433 @@ def make_unbounded_dds():
 
 
 make_unbounded_dds()
+
+
+def make_small_matmul(name, dtype, save_sparse=False):
+    M = 8
+    N = 8
+    K = 16
+    a = gs.Variable("a", shape=(M, K), dtype=dtype)
+    g = gs.Graph(inputs=[a], opset=13)
+    val = np.random.uniform(-3, 3, size=K * N).astype(dtype).reshape((K, N))
+    b = gs.Constant("b", values=val)
+    c = g.matmul(a, b, name="matmul")
+    c.dtype = dtype
+    g.outputs = [c]
+
+    save(g, name)
+    if save_sparse:
+        save(make_sparse(g), 'sparse.'+name)
+
+make_small_matmul("matmul.onnx", np.float32, save_sparse=True)
+make_small_matmul("matmul.fp16.onnx", np.float16)
+
+
+def make_small_conv(name):
+    N = 1
+    C = 16
+    H = 8
+    W = 8
+    K = 4
+    F = 4
+    a = gs.Variable("a", shape=(N, C, H, W), dtype=np.float32)
+    g = gs.Graph(inputs=[a], opset=13)
+    val = np.random.uniform(-3, 3, size=K * C * F * F).reshape((K, C, F, F)).astype(np.float32)
+    b = gs.Constant("b", values=val)
+    c = g.conv(a, b, (F, F), name="conv")
+    c.dtype = np.float32
+    g.outputs = [c]
+
+    save(g, name)
+    save(make_sparse(g), 'sparse.'+name)
+
+
+make_small_conv("conv.onnx")
+
+
+def make_unsorted():
+    inp = gs.Variable("input", shape=(1, 1), dtype=np.float32)
+    graph = gs.Graph(inputs=[inp])
+    graph.outputs = [graph.identity(graph.identity(inp))]
+
+    graph.nodes = list(reversed(graph.nodes))
+    save(graph, "unsorted.onnx")
+
+
+make_unsorted()
+def make_empty():
+    g = gs.Graph(inputs=[], opset=13)
+    g.outputs = []
+
+    save(g, "empty.onnx")
+make_empty()
+
+# Builds a graph that has unused nodes and inputs.
+#
+# f  e
+# |\  |
+# H  G
+# |  |
+# h  g
+# |
+# I
+# |
+# i
+#
+# e is an unused input.
+# G is an unused node.
+# This graph is useful for testing if `lint` catches unused nodes and inputs.
+def make_cleanable():
+    e = gs.Variable(name="e", dtype=np.float32, shape=(1, 1))
+    f = gs.Variable(name="f", dtype=np.float32, shape=(1, 1))
+    h = gs.Variable(name="h", dtype=np.float32, shape=(1, 1))
+    i = gs.Variable(name="i", dtype=np.float32, shape=(1, 1))
+    g = gs.Variable(name="g", dtype=np.float32, shape=(2, 1))
+
+    nodes = [
+        gs.Node(op="Concat", name="G", inputs=[e, f], outputs=[g], attrs={"axis": 0}),
+        gs.Node(op="Dropout", name="H", inputs=[f], outputs=[h]),
+        gs.Node(op="Identity", name="I", inputs=[h], outputs=[i]),
+    ]
+
+    graph = gs.Graph(nodes=nodes, inputs=[e, f], outputs=[i])
+    save(graph, "cleanable.onnx")
+make_cleanable()
+
+# Generates a graph with very deranged names
+# Tests that the unique renaming in lint tool works
+def make_renamable():
+    a = gs.Variable(name="a", dtype=np.float32, shape=(1, 1))
+    b = gs.Variable(name="b", dtype=np.float32, shape=(1, 1))
+    c = gs.Variable(name="c", dtype=np.float32, shape=(1, 1))
+    d = gs.Variable(name="d", dtype=np.float32, shape=(1, 1))
+    e = gs.Variable(name="e", dtype=np.float32, shape=(2, 1))
+
+    nodes = [
+        gs.Node(op="Identity", name="", inputs=[a], outputs=[b]),
+        gs.Node(op="Dropout", name="polygraphy_unnamed_node_0", inputs=[b], outputs=[c]),
+        gs.Node(op="Identity", name="polygraphy_unnamed_node_0_0", inputs=[c], outputs=[d]),
+        gs.Node(op="Dropout", name="", inputs=[d], outputs=[e]),
+    ]
+
+    graph = gs.Graph(nodes=nodes, inputs=[a], outputs=[e])
+    save(graph, "renamable.onnx")
+make_renamable()
+
+####### Generate some invalid models #######
+
+### Graphs whose errors are data-dependent ###
+
+# Generats an invalid graph with multiple parallel bad nodes.
+# The graph is invalid due to multiple parallel nodes failing.
+# This is is the graph:
+#    A    B    C    D  E    F    G
+#     \  /      \  /    \  /      \
+#    MatMul_0* Add_0*  MatMul_1 NonZero
+#        \        /        \    /
+#         MatMul_2       MatMul_3*
+#               \       /
+#                \     /
+#                Add_1
+#                  |
+#                output
+# The graph is invalid because MatMul_0, Add_0 and MatMul_3 all will fail.
+# MatMul_0 should fail because A and B are not compatible.
+# Add_0 should fail because C and D are not compatible.
+# MatMul_3 should fail because result of MatMul2 and the Data-dependent shape of output of
+# NonZero are not compatible.
+#
+# This graph is useful for testing if `lint` catches multiple parallel bad nodes that may/may not be data-dependent.
+#
+def make_bad_graph_with_parallel_invalid_nodes():
+    DTYPE = np.float32
+    BAD_DIM = 3
+
+    graph = gs.Graph(name="bad_graph_with_parallel_invalid_nodes")
+
+    A = gs.Variable("A", dtype=DTYPE, shape=(1, BAD_DIM))
+    B = gs.Variable("B", dtype=DTYPE, shape=(4, 4))
+    mm_ab_out = graph.matmul(A, B, name="MatMul_0") # This node will fail because A and B are not compatible.
+
+    C = gs.Variable("C", dtype=DTYPE, shape=(BAD_DIM, 4))
+    D = gs.Variable("D", dtype=DTYPE, shape=(4, 1))
+    add_cd_out = graph.add(C, D, name="Add_0") # This node will fail because C and D are not compatible.
+
+    pre_out_1 = graph.matmul(mm_ab_out, add_cd_out, name="MatMul_2")
+
+    E = gs.Variable("E", dtype=DTYPE, shape=(1, 4))
+    F = gs.Variable("F", dtype=DTYPE, shape=(4, 1))
+    mm_ef_out = graph.matmul(E, F, name="MatMul_1")
+    mm_ef_out_int64 = graph.cast(mm_ef_out, onnx.TensorProto.INT64, name="cast_to_int64")
+
+
+    G = gs.Variable("G", dtype=np.int64, shape=(4, 4))
+    nz_g_out = graph.nonzero(G, name="NonZero") # `nz_g_out` shape is data-dependent.
+
+    pre_out_2 = graph.matmul(mm_ef_out_int64, nz_g_out, name="MatMul_3") # This node will fail because `mm_ef_out_int64` and `nz_g_out` are not compatible.
+    pre_out_2_float = graph.cast(pre_out_2, getattr(onnx.TensorProto, "FLOAT"), name="cast_to_float")
+
+    out = graph.add(pre_out_1, pre_out_2_float, name="Add_1")
+    out.dtype = DTYPE
+
+    graph.inputs = [A, B, C, D, E, F, G]
+    graph.outputs = [out]
+
+    save(graph, "bad_graph_with_parallel_invalid_nodes.onnx")
+
+make_bad_graph_with_parallel_invalid_nodes()
+
+
+# Generates the following graph:
+#                 cond
+#                  |
+#                 If
+#                  |
+#             z (x or y)
+#              \   |
+#               MatMul
+#                  |
+#               output
+# If `cond` is True, then `x` is used, otherwise `y` is used.
+# `x` is compatible with `z`, while `y` is NOT compatible with `z`.
+# Based on the value of `cond`, the graph may be valid or invalid.
+#
+# This graph is useful to check whether the error message is caught or not at runtime based on data input.
+#
+def make_bad_graph_conditionally_invalid():
+    X = [[4.0], [3.0]] # shape (2, 1), compatible with Z for MatMul
+    Y = [2.0, 4.0] # shape (2,), incompatible with Z for MatMul
+    Z = [[2.0, 4.0]] # shape (1, 2)
+
+    cond = gs.Variable("cond", dtype=np.bool_, shape=(1,)) # input to If, True or False based on user input.
+
+    graph = gs.Graph(name="bad_graph_conditionally_invalid")
+
+    x = gs.Constant("x", values=np.array(X, dtype=np.float32))
+    y = gs.Constant("y", values=np.array(Y, dtype=np.float32))
+
+    then_out = gs.Variable("then_out", dtype=np.float32, shape=None)
+    else_out = gs.Variable("else_out", dtype=np.float32, shape=None)
+
+    then_const_node = gs.Node(op="Constant", inputs=[], outputs=[then_out], attrs={"value":x}) # node for `then_branch` Graph
+    else_const_node = gs.Node(op="Constant", inputs=[], outputs=[else_out], attrs={"value":y}) # node for `else_branch` Graph
+
+    then_body = gs.Graph(nodes=[then_const_node], name="then_body", inputs=[], outputs=[then_out]) # Graph for `then_branch`
+    else_body = gs.Graph(nodes=[else_const_node], name="else_body", inputs=[], outputs=[else_out]) # Graph for `else_branch`
+
+    res = gs.Variable("res", dtype=np.float32, shape=None)  # shape is data-dependent
+
+    if_node = gs.Node(op="If", name="If_Node", inputs=[cond], outputs=[res], attrs={"then_branch":then_body, "else_branch":else_body})
+    graph.nodes = [if_node]
+
+    out = graph.matmul(res, gs.Constant("z", values=np.array(Z, dtype=np.float32)), name="MatMul")
+    out.dtype = np.float32
+
+    graph.inputs = [cond]
+    graph.outputs = [out]
+
+    save(graph, "bad_graph_conditionally_invalid.onnx")
+
+make_bad_graph_conditionally_invalid()
+
+
+### Bad GraphProto ###
+### Graphs that break the ONNX Specification for GraphProto ###
+
+# Generates a model where the GraphProto has no name.
+#
+# This is invalid as ONNX Specification requires that the GraphProto has a name.
+#
+def make_bad_graph_with_no_name():
+    DTYPE = np.float32
+    SHAPE = (4, 4)
+
+    inp = gs.Variable("inp", dtype=DTYPE, shape=SHAPE)
+
+    graph = gs.Graph(inputs=[inp], name="")
+    out = graph.add(inp, inp)
+    out.dtype = DTYPE
+    graph.outputs = [out]
+
+    save(graph, "bad_graph_with_no_name.onnx")
+
+
+make_bad_graph_with_no_name()
+
+# Generates a model where the GraphProto has no imports.
+#
+# This is invalid as ONNX Specification requires that the GraphProto has at least one import.
+#
+def make_bad_graph_with_no_import_domains():
+    DTYPE = np.float32
+    SHAPE = (4, 4)
+
+    inp = gs.Variable("inp", dtype=DTYPE, shape=SHAPE)
+
+    graph = gs.Graph(inputs=[inp], import_domains=[])
+    out = graph.add(inp, inp)
+    out.dtype = DTYPE
+    graph.outputs = [out]
+
+    save(graph, "bad_graph_with_no_import_domains.onnx")
+
+
+make_bad_graph_with_no_import_domains()
+
+# Generates a model where the inputs (value info) of graph are duplicates.
+#
+# This is invalid as ONNX Specification requires that the (value info) inputs of a graph are unique.
+#
+#    inp
+#    / \
+#    Add
+#     |
+#    out
+#
+def make_bad_graph_with_dup_value_info():
+    DTYPE = np.float32
+    SHAPE = (4, 4)
+
+    inp = gs.Variable("inp", dtype=DTYPE, shape=SHAPE)
+
+    graph = gs.Graph(inputs=[inp, inp])
+    out = graph.add(inp, inp)
+    out.dtype = DTYPE
+    graph.outputs = [out]
+
+    save(graph, "bad_graph_with_dup_value_info.onnx")
+
+
+make_bad_graph_with_dup_value_info()
+
+
+# Generates a model with mult-level errors.
+# The model is invalid because of graph-level error (no name) and node-level error (incompatible inputs).
+def make_bad_graph_multi_level_errors():
+    DTYPE = np.float32
+    SHAPE = (4, 5)
+
+    inp1 = gs.Variable("inp1", dtype=DTYPE, shape=SHAPE)
+    inp2 = gs.Variable("inp2", dtype=DTYPE, shape=SHAPE)
+
+    graph = gs.Graph(inputs=[inp1, inp2], name="") # graph-level error: empty name
+    out = graph.matmul(inp1, inp2) # node-level error: incompatible inputs
+    out.dtype = DTYPE
+    out.shape = [] # we need to specify this so GS creates valid ONNX model.
+    graph.outputs = [out]
+
+    save(graph, "bad_graph_with_multi_level_errors.onnx")
+
+make_bad_graph_multi_level_errors()
+
+# Generates a model where graph has multiple node names with same non-empty string.
+def make_bad_graph_with_duplicate_node_names():
+    DTYPE = np.float32
+    SHAPE = (4, 5)
+
+    inp = gs.Variable("inp", dtype=DTYPE, shape=SHAPE)
+
+    graph = gs.Graph(inputs=[inp], name="bad_graph_with_duplicate_node_names")
+    inter1 = graph.identity(inp, name="identical")
+    out = graph.identity(inter1, name="identical") # node-level error: duplicate node names
+    graph.outputs = [out]
+
+    save(graph, "bad_graph_with_duplicate_node_names.onnx")
+make_bad_graph_with_duplicate_node_names()
+
+# Generates a model where the graph has a subgraph matching toyPlugin's graph pattern
+def make_graph_with_subgraph_matching_toy_plugin():
+    i0 = gs.Variable(name="i0", dtype=np.float32)
+    i1 = gs.Variable(name="i1", dtype=np.float32)
+    i2 = gs.Variable(name="i2", dtype=np.float32)
+    i3 = gs.Variable(name="i3", dtype=np.float32)
+    i4 = gs.Variable(name="i4", dtype=np.float32)
+
+    o1 = gs.Variable(name="o1", dtype=np.float32)
+    o2 = gs.Variable(name="o2", dtype=np.float32)
+
+    O_node = gs.Node(op="O", inputs=[i0], outputs=[i1], name="n1")
+    A_node = gs.Node(op="A", inputs=[i1], outputs=[i2], name="n2")
+    B_node = gs.Node(op="B", inputs=[i1], outputs=[i3], name="n3")
+    C_node = gs.Node(op="C", inputs=[i2,i3], outputs=[i4], attrs={"x":1}, name="n4")
+    D_node = gs.Node(op="D", inputs=[i4], outputs=[o1], name="n5")
+    E_node = gs.Node(op="E", inputs=[i4], outputs=[o2], name="n6")
+
+    graph = gs.Graph(nodes=[O_node, A_node, B_node, C_node, D_node, E_node], inputs=[i0], outputs=[o1,o2])
+
+    save(graph, "graph_with_subgraph_matching_toy_plugin.onnx")
+make_graph_with_subgraph_matching_toy_plugin()
+
+# Generates the following Graph
+#
+# The input to the Transpose op is an initializer
+#
+#    Transpose
+#       |
+#      MatMul
+#       |
+#      out
+#
+def make_transpose_matmul():
+    M = 8
+    N = 8
+    K = 16
+    a = gs.Variable("a", shape=(M, K), dtype=np.float32)
+    g = gs.Graph(inputs=[a], opset=13)
+    val = np.random.uniform(-3, 3, size=K * N).astype(np.float32).reshape((N, K))
+    b = gs.Constant("b", values=val)
+    b_transpose = g.transpose(b, name="transpose")
+    c = g.matmul(a, b_transpose, name="matmul")
+    c.dtype = np.float32
+    g.outputs = [c]
+
+    save(g, "transpose_matmul.onnx")
+
+make_transpose_matmul()
+
+# Generates the following Graph
+#
+# The input to the QuantizeLinear op is an initializer
+#
+#    QuantizeLinear
+#       |
+#    DequantizeLinear
+#       |
+#      Conv
+#       |
+#      out
+#
+def make_qdq_conv():
+    x = np.random.uniform(-3, 3, size=3*3*130).astype(np.float32).reshape((1, 3, 3, 130))
+    y_scale = np.array([2, 4, 5], dtype=np.float32)
+    y_zero_point = np.array([84, 24, 196], dtype=np.uint8)
+    x_const = gs.Constant("x", values=x)
+    y_scale_const = gs.Constant("y_scale", values=y_scale)
+    y_zero_point_const = gs.Constant("y_zero_point", values=y_zero_point)
+
+    weight = gs.Constant("Weights_0", values=np.ones((3, 3, 3, 3), dtype=np.float32))
+
+    g = gs.Graph(inputs=[], opset=13)
+    q_layer = g.quantize_linear(x_const, y_scale_const, y_zero_point_const)
+    dq_layer = g.dequantize_linear(q_layer, y_scale_const, y_zero_point_const)
+    out = g.conv(dq_layer, weight, [3, 3], name="Conv_0")
+    out.dtype = np.float32
+    g.outputs = [out]
+
+    save(g, "qdq_conv.onnx")
+
+make_qdq_conv()
+
+def make_weightless_network(model_name):
+    ipath = ONNX_MODELS[model_name].path
+    opath = os.path.join(CURDIR, "weightless." + model_name + ".onnx")
+    cmd = [f"polygraphy surgeon weight-strip {ipath} -o {opath}"]
+    subprocess.run(cmd, shell=True)
+
+make_weightless_network("matmul.fp16")
+make_weightless_network("matmul.bf16")
+make_weightless_network("sparse.matmul")
+make_weightless_network("conv")
+make_weightless_network("sparse.conv")
+make_weightless_network("transpose_matmul")
+make_weightless_network("qdq_conv")
diff --git a/tools/Polygraphy/tests/models/matmul.bf16.i32data.onnx b/tools/Polygraphy/tests/models/matmul.bf16.i32data.onnx
new file mode 100644
index 00000000..599c0c27
--- /dev/null
+++ b/tools/Polygraphy/tests/models/matmul.bf16.i32data.onnx
@@ -0,0 +1,17 @@
+onnx-example:�
+
+A
+BC"MatMul
+test-model*�@*����x�}�~���|�~�~�{�����~�~�~���z���}�|���~���~���|�{���~�}�������~�����{�|�{���}�}�������~�~�}���~���}���}�~�������~���}�����������|���}�������~�z�~�����������}�����������~�������|���~�y�{�~�}�|���������|�~���������y�}�����~�}�������}���~���}���|�|�z�������~�������~���|�~���y�~�~�~�����~�~���������������~�|�~�}���}�|�}���}���~���~���w���}�}���~�������}�~�~���~���~���������|�~�}�������������}�����|���|�z�}�|�~�}�����������������~���������}�������~�{�~�|�y���{�~���|�~�����������z���}���~�z�}�~�}�w�~�}�z���~�|���|�������|�~�~�����~�����~�{�{�~���z�|�~�~�}�����~�z���}�~�~�����~�~�~���~�|���������~�{�}���}�~���~�~�~���}�����{�}���}���}�����}�|�}�z�����|�}�����}�����~�����~�������}�������~�|�������~���������~���{�~�~�����~�|�����~���~�|�����~���~���}�|�}�������}�}���~���~�~�����~�~�~�����~���|�}�|�~�����}�~�}�}�~�~�~�����|�~�~�|�������~�}�~�y�|���~���|�~�}�����{�}�}�|�~�~�~�����}�~�z���~�����~�~���}�|���������~�~���������~�~�z���}�}���~�����~���~���~�������~���}�}���}�}���{���|�����z�~�|�}���~���~�������|�������|�~�y�z�}�������}�z�{�����|�z���}�������~�}�����~���~�}���~�����{���|�������}���}�~�����������~�z�|���z�}�������~�}�|�~�}�~�~���z���~���|�|���|�~�~�~�����������{���~���}���~�~�}�}�}�y�~�~�����~�~���}�|�����}�~�~�z�����t�{���{�}�{�~�~�~�}�~���~�|�}�~����~���~�~���������}���������~���~�}�~�����}�~�~�~�~���|���~�����~�}�}���~�����~�}�|���~�������z�~���{�����~�~�~���}�|�~�~�|�}�}�~�����|�}�������~�~�~�|�}�����~�~�~�~�����~�����}�~�������~�}�����~�|�}�~�����~�~�}���}�����|�����|���������|�����{�}�������~�}�����|���}�z���~�~�������}�{���}�~�����}�������}�����������}���}�~���}�}�}���{�~�~���}�}�~�z�}�~�����~�~���������~�~�������}���~�~���}�{�������}���}�}���~�}�~�����|�~�~�����}���~�~�������~�����~�}���|�����}���~�������}�}���~�������}�}�����{�{�}�����~�~�~�����~�~�~�����~�|�}�~�����}�~�������~�~���~�z�~���|�{�~���~�~�~�����~�~�y�{���~�~���~���~�����}�������~���������|�������������{�~���~�~�~���~�~�����������|�}���~�����|�������������~���{�~�z���~���~���|���������}�����~���|�������|���y�~���}�z�����{���}�}�������}�{�}BBZ
+A
+
+
+@b
+C
+
+
+j
+B
+
+@
+B
\ No newline at end of file
diff --git a/tools/Polygraphy/tests/models/matmul.bf16.onnx b/tools/Polygraphy/tests/models/matmul.bf16.onnx
new file mode 100644
index 00000000..a995ca20
Binary files /dev/null and b/tools/Polygraphy/tests/models/matmul.bf16.onnx differ
diff --git a/tools/Polygraphy/tests/models/matmul.exclude_list.txt b/tools/Polygraphy/tests/models/matmul.exclude_list.txt
new file mode 100644
index 00000000..61780798
--- /dev/null
+++ b/tools/Polygraphy/tests/models/matmul.exclude_list.txt
@@ -0,0 +1 @@
+b
diff --git a/tools/Polygraphy/tests/models/matmul.fp16.onnx b/tools/Polygraphy/tests/models/matmul.fp16.onnx
new file mode 100644
index 00000000..a1cf5947
Binary files /dev/null and b/tools/Polygraphy/tests/models/matmul.fp16.onnx differ
diff --git a/tools/Polygraphy/tests/models/matmul.onnx b/tools/Polygraphy/tests/models/matmul.onnx
new file mode 100644
index 00000000..99d82cff
Binary files /dev/null and b/tools/Polygraphy/tests/models/matmul.onnx differ
diff --git a/tools/Polygraphy/tests/models/matmul_2layer.onnx b/tools/Polygraphy/tests/models/matmul_2layer.onnx
new file mode 100644
index 00000000..d4ba54a9
Binary files /dev/null and b/tools/Polygraphy/tests/models/matmul_2layer.onnx differ
diff --git a/tools/Polygraphy/tests/models/meta.py b/tools/Polygraphy/tests/models/meta.py
index 826bfd2d..d0ed444d 100644
--- a/tools/Polygraphy/tests/models/meta.py
+++ b/tools/Polygraphy/tests/models/meta.py
@@ -17,11 +17,14 @@
 import os
 
 import numpy as np
-from polygraphy import util
+import tensorrt as trt
+
+from polygraphy import mod, util
 from polygraphy.backend.common import BytesFromPath
 from polygraphy.backend.onnx import OnnxFromPath
 from polygraphy.backend.tf import GraphFromFrozen
 from polygraphy.common import TensorMetadata
+from polygraphy.datatype import DataType
 
 
 def model_path(name=None):
@@ -45,6 +48,7 @@ def check_tf_identity(runner):
     outputs = runner.infer(feed_dict)
     assert np.all(outputs["Identity_2:0"] == feed_dict["Input:0"])
 
+MODELS_DIR = os.path.join(os.path.dirname(__file__))
 
 TF_MODELS = {
     "identity": Model(path=model_path("tf_identity.pb"), LoaderType=GraphFromFrozen, check_runner=check_tf_identity),
@@ -71,7 +75,10 @@ def check_dynamic_identity(runner, shapes):
 
 def check_empty_tensor_expand(runner, shapes):
     shape = shapes["new_shape"]
-    feed_dict = {"data": np.zeros(shape=(2, 0, 3, 0), dtype=np.float32), "new_shape": np.array(shape, dtype=np.int32)}
+    feed_dict = {
+        "data": np.zeros(shape=(2, 0, 3, 0), dtype=np.float32),
+        "new_shape": np.array(shape, dtype=np.int32 if mod.version(trt.__version__) < mod.version("9.0") else np.int64),
+    }
     outputs = runner.infer(feed_dict)
     # Empty tensor will still be empty after broadcast
     assert outputs["expanded"].shape == shape
@@ -84,6 +91,16 @@ def check_reshape(runner):
     assert np.all(outputs["output"] == feed_dict["data"].ravel())
 
 
+def check_residual_block(runner,shapes):
+    feed_dict = {"gpu_0/data_0": np.random.random_sample(size=shapes["gpu_0/data_0"]).astype(np.float32)}
+    # Confirm inference can go through without error
+    outputs = runner.infer(feed_dict)
+
+def check_matmul_2layer(runner, shape=(2, 8)):
+    feed_dict = {"onnx::MatMul_0": np.random.random_sample(size=shape).astype(np.float32)}
+    # Confirm inference can go through without error
+    outputs = runner.infer(feed_dict)
+
 def no_check_implemented(runner):
     raise NotImplementedError("No check_runner implemented for this model")
 
@@ -93,7 +110,7 @@ def no_check_implemented(runner):
         path=model_path("identity.onnx"),
         LoaderType=BytesFromPath,
         check_runner=check_identity,
-        input_metadata=TensorMetadata().add("x", dtype=np.float32, shape=(1, 1, 2, 2)),
+        input_metadata=TensorMetadata().add("x", dtype=DataType.FLOAT32, shape=(1, 1, 2, 2)),
     ),
     "identity_identity": Model(
         path=model_path("identity_identity.onnx"), LoaderType=BytesFromPath, check_runner=check_identity_identity
@@ -102,13 +119,13 @@ def no_check_implemented(runner):
         path=model_path("dynamic_identity.onnx"),
         LoaderType=BytesFromPath,
         check_runner=check_dynamic_identity,
-        input_metadata=TensorMetadata().add("X", dtype=np.float32, shape=(1, 1, -1, -1)),
+        input_metadata=TensorMetadata().add("X", dtype=DataType.FLOAT32, shape=(1, 1, -1, -1)),
     ),
     "identity_multi_ch": Model(
         path=model_path("identity_multi_ch.onnx"),
         LoaderType=BytesFromPath,
         check_runner=no_check_implemented,
-        input_metadata=TensorMetadata().add("x", dtype=np.float32, shape=(2, 4, 3, 3)),
+        input_metadata=TensorMetadata().add("x", dtype=DataType.FLOAT32, shape=(2, 4, 3, 3)),
     ),
     "empty_tensor_expand": Model(
         path=model_path("empty_tensor_expand.onnx"), LoaderType=BytesFromPath, check_runner=check_empty_tensor_expand
@@ -133,7 +150,9 @@ def no_check_implemented(runner):
         path=model_path("reducable.onnx"),
         LoaderType=BytesFromPath,
         check_runner=no_check_implemented,
-        input_metadata=TensorMetadata().add("X0", shape=(1,), dtype=np.float32).add("Y0", shape=(1,), dtype=np.float32),
+        input_metadata=TensorMetadata()
+        .add("X0", shape=(1,), dtype=DataType.FLOAT32)
+        .add("Y0", shape=(1,), dtype=DataType.FLOAT32),
     ),
     "reducable_with_const": Model(
         path=model_path("reducable_with_const.onnx"),
@@ -165,13 +184,23 @@ def no_check_implemented(runner):
         path=model_path("needs_constraints.onnx"),
         LoaderType=BytesFromPath,
         check_runner=no_check_implemented,
-        input_metadata=TensorMetadata().add("x", dtype=np.float32, shape=(1, 1, 256, 256)),
+        input_metadata=TensorMetadata().add("x", dtype=DataType.FLOAT32, shape=(1, 1, 256, 256)),
     ),
     "constant_fold_bloater": Model(
         path=model_path("constant_fold_bloater.onnx"),
         LoaderType=BytesFromPath,
         check_runner=no_check_implemented,
     ),
+    "renamable" : Model(
+        path=model_path("renamable.onnx"),
+        LoaderType=BytesFromPath,
+        check_runner=no_check_implemented,
+    ),
+    "cleanable" : Model(
+        path=model_path("cleanable.onnx"),
+        LoaderType=BytesFromPath,
+        check_runner=no_check_implemented,
+    ),
     "nonzero": Model(path=model_path("nonzero.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented),
     "inp_dim_val_not_set": Model(
         path=model_path("inp_dim_val_not_set.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
@@ -182,4 +211,103 @@ def no_check_implemented(runner):
     "unbounded_dds": Model(
         path=model_path("unbounded_dds.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
     ),
+    "loop": Model(
+        path=model_path("loop.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "matmul.fp16": Model(
+        path=model_path("matmul.fp16.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "matmul": Model(
+        path=model_path("matmul.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "sparse.matmul": Model(
+        path=model_path("sparse.matmul.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "matmul.bf16": Model(
+        path=model_path("matmul.bf16.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "matmul.bf16.i32data": Model(
+        path=model_path("matmul.bf16.i32data.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "matmul_2layer": Model(
+        path=model_path("matmul_2layer.onnx"), LoaderType=BytesFromPath, check_runner=check_matmul_2layer
+    ),
+    "unsorted": Model(path=model_path("unsorted.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented),
+    "conv": Model(
+        path=model_path("conv.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "sparse.conv": Model(
+        path=model_path("sparse.conv.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "no_op_reshape": Model(
+        path=model_path("no_op_reshape.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "bad_graph_with_dup_value_info": Model(
+        path=model_path("bad_graph_with_dup_value_info.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "bad_graph_with_no_name": Model(
+        path=model_path("bad_graph_with_no_name.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "bad_graph_with_no_import_domains": Model(
+    path=model_path("bad_graph_with_no_import_domains.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "bad_graph_with_parallel_invalid_nodes": Model(
+        path=model_path("bad_graph_with_parallel_invalid_nodes.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "bad_graph_conditionally_invalid": Model(
+        path=model_path("bad_graph_conditionally_invalid.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "custom_op_node": Model(
+        path=model_path("custom_op_node.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "bad_graph_with_duplicate_node_names": Model(
+        path=model_path("bad_graph_with_duplicate_node_names.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "bad_graph_with_multi_level_errors": Model(
+        path=model_path("bad_graph_with_multi_level_errors.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "invalid": Model(
+        path=model_path("invalid.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "empty": Model(
+        path=model_path("empty.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "residual_block": Model(
+        path=model_path("residual_block.onnx"), LoaderType=BytesFromPath, check_runner=check_residual_block
+    ),
+    "graph_with_subgraph_matching_toy_plugin": Model(
+        path=model_path("graph_with_subgraph_matching_toy_plugin.onnx"), 
+        LoaderType=BytesFromPath, 
+        check_runner=no_check_implemented
+    ),
+    "transpose_matmul": Model(
+        path=model_path("transpose_matmul.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "qdq_conv": Model(
+        path=model_path("qdq_conv.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "weightless.matmul.fp16": Model(
+        path=model_path("weightless.matmul.fp16.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "weightless.matmul.bf16": Model(
+        path=model_path("weightless.matmul.bf16.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "weightless.conv": Model(
+        path=model_path("weightless.conv.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "weightless.sparse.matmul": Model(
+        path=model_path("weightless.sparse.matmul.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "weightless.sparse.conv": Model(
+        path=model_path("weightless.sparse.conv.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "weightless.transpose_matmul": Model(
+        path=model_path("weightless.transpose_matmul.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "weightless.qdq_conv": Model(
+        path=model_path("weightless.qdq_conv.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
+    "roialign": Model(
+        path=model_path("roialign.onnx"), LoaderType=BytesFromPath, check_runner=no_check_implemented
+    ),
 }
diff --git a/tools/Polygraphy/tests/models/multi_output.onnx b/tools/Polygraphy/tests/models/multi_output.onnx
index ab68ea90..3853a5e0 100644
Binary files a/tools/Polygraphy/tests/models/multi_output.onnx and b/tools/Polygraphy/tests/models/multi_output.onnx differ
diff --git a/tools/Polygraphy/tests/models/needs_constraints.onnx b/tools/Polygraphy/tests/models/needs_constraints.onnx
index b7c76378..dff80add 100644
Binary files a/tools/Polygraphy/tests/models/needs_constraints.onnx and b/tools/Polygraphy/tests/models/needs_constraints.onnx differ
diff --git a/tools/Polygraphy/tests/models/no_op_reshape.onnx b/tools/Polygraphy/tests/models/no_op_reshape.onnx
index 430c2adb..a70ce861 100644
Binary files a/tools/Polygraphy/tests/models/no_op_reshape.onnx and b/tools/Polygraphy/tests/models/no_op_reshape.onnx differ
diff --git a/tools/Polygraphy/tests/models/nonzero.onnx b/tools/Polygraphy/tests/models/nonzero.onnx
index c990f5b9..1792d1c7 100644
Binary files a/tools/Polygraphy/tests/models/nonzero.onnx and b/tools/Polygraphy/tests/models/nonzero.onnx differ
diff --git a/demo/HuggingFace/T5/.gitkeep b/tools/Polygraphy/tests/models/plugins/toyPlugin/__init__.py
similarity index 100%
rename from demo/HuggingFace/T5/.gitkeep
rename to tools/Polygraphy/tests/models/plugins/toyPlugin/__init__.py
diff --git a/tools/Polygraphy/tests/models/plugins/toyPlugin/pattern.py b/tools/Polygraphy/tests/models/plugins/toyPlugin/pattern.py
new file mode 100644
index 00000000..6cf600ba
--- /dev/null
+++ b/tools/Polygraphy/tests/models/plugins/toyPlugin/pattern.py
@@ -0,0 +1,31 @@
+from polygraphy import mod
+gs = mod.lazy_import("onnx_graphsurgeon>=0.5.0")
+
+def get_plugin_pattern() -> gs.GraphPattern:
+    """
+    Toy plugin pattern:
+        A     B
+        \   /
+          C, attrs['x'] < 2.0
+        /   \
+        D     E
+    """
+    pattern = gs.GraphPattern()
+    in_0 = pattern.variable()
+    in_1 = pattern.variable()
+    a_out = pattern.add("Anode", "A", inputs=[in_0])
+    b_out = pattern.add("Bnode", "B", inputs=[in_1])
+    check_function = lambda node : node.attrs["x"] < 2.0
+    c_out = pattern.add("Cnode", "C", inputs=[a_out, b_out], check_func=check_function)
+    d_out = pattern.add("Dnode", "D", inputs=[c_out])
+    e_out = pattern.add("Enode", "E", inputs=[c_out])
+    pattern.set_output_tensors([d_out, e_out])
+
+    return pattern
+
+def get_plugin_attributes(sg) -> dict:
+    """
+    example plugin attribute mapping, where the plugin has attribute ToyX, which gets its value from C.x * 2
+    """
+    return {"ToyX": int(sg.get("Cnode").attrs["x"]) * 2}
+
diff --git a/tools/Polygraphy/tests/models/qdq_conv.exclude_list.txt b/tools/Polygraphy/tests/models/qdq_conv.exclude_list.txt
new file mode 100644
index 00000000..5667c431
--- /dev/null
+++ b/tools/Polygraphy/tests/models/qdq_conv.exclude_list.txt
@@ -0,0 +1 @@
+Weights_0
diff --git a/tools/Polygraphy/tests/models/qdq_conv.onnx b/tools/Polygraphy/tests/models/qdq_conv.onnx
new file mode 100644
index 00000000..52581b05
Binary files /dev/null and b/tools/Polygraphy/tests/models/qdq_conv.onnx differ
diff --git a/tools/Polygraphy/tests/models/reducable.onnx b/tools/Polygraphy/tests/models/reducable.onnx
index 8e53d161..afa85c14 100644
Binary files a/tools/Polygraphy/tests/models/reducable.onnx and b/tools/Polygraphy/tests/models/reducable.onnx differ
diff --git a/tools/Polygraphy/tests/models/reducable_with_const.onnx b/tools/Polygraphy/tests/models/reducable_with_const.onnx
index 6458d6ea..209e092d 100644
Binary files a/tools/Polygraphy/tests/models/reducable_with_const.onnx and b/tools/Polygraphy/tests/models/reducable_with_const.onnx differ
diff --git a/tools/Polygraphy/tests/models/renamable.onnx b/tools/Polygraphy/tests/models/renamable.onnx
new file mode 100644
index 00000000..102a2eb0
Binary files /dev/null and b/tools/Polygraphy/tests/models/renamable.onnx differ
diff --git a/tools/Polygraphy/tests/models/residual_block.onnx b/tools/Polygraphy/tests/models/residual_block.onnx
new file mode 100644
index 00000000..1173c7e5
Binary files /dev/null and b/tools/Polygraphy/tests/models/residual_block.onnx differ
diff --git a/tools/Polygraphy/tests/models/roialign.onnx b/tools/Polygraphy/tests/models/roialign.onnx
new file mode 100644
index 00000000..78b1570f
Binary files /dev/null and b/tools/Polygraphy/tests/models/roialign.onnx differ
diff --git a/tools/Polygraphy/tests/models/sparse.conv.exclude_list.txt b/tools/Polygraphy/tests/models/sparse.conv.exclude_list.txt
new file mode 100644
index 00000000..61780798
--- /dev/null
+++ b/tools/Polygraphy/tests/models/sparse.conv.exclude_list.txt
@@ -0,0 +1 @@
+b
diff --git a/tools/Polygraphy/tests/models/sparse.conv.onnx b/tools/Polygraphy/tests/models/sparse.conv.onnx
new file mode 100644
index 00000000..2a64a006
Binary files /dev/null and b/tools/Polygraphy/tests/models/sparse.conv.onnx differ
diff --git a/tools/Polygraphy/tests/models/sparse.matmul.onnx b/tools/Polygraphy/tests/models/sparse.matmul.onnx
new file mode 100644
index 00000000..8f3153f9
Binary files /dev/null and b/tools/Polygraphy/tests/models/sparse.matmul.onnx differ
diff --git a/tools/Polygraphy/tests/models/transpose_matmul.onnx b/tools/Polygraphy/tests/models/transpose_matmul.onnx
new file mode 100644
index 00000000..e0b15d67
Binary files /dev/null and b/tools/Polygraphy/tests/models/transpose_matmul.onnx differ
diff --git a/tools/Polygraphy/tests/models/unbounded_dds.onnx b/tools/Polygraphy/tests/models/unbounded_dds.onnx
index 259c4b5c..2cf824d6 100644
Binary files a/tools/Polygraphy/tests/models/unbounded_dds.onnx and b/tools/Polygraphy/tests/models/unbounded_dds.onnx differ
diff --git a/tools/Polygraphy/tests/models/unsorted.onnx b/tools/Polygraphy/tests/models/unsorted.onnx
new file mode 100644
index 00000000..3ff4f6ab
Binary files /dev/null and b/tools/Polygraphy/tests/models/unsorted.onnx differ
diff --git a/tools/Polygraphy/tests/models/weightless.conv.onnx b/tools/Polygraphy/tests/models/weightless.conv.onnx
new file mode 100644
index 00000000..29ea0f6c
Binary files /dev/null and b/tools/Polygraphy/tests/models/weightless.conv.onnx differ
diff --git a/tools/Polygraphy/tests/models/weightless.matmul.bf16.onnx b/tools/Polygraphy/tests/models/weightless.matmul.bf16.onnx
new file mode 100644
index 00000000..70c86768
Binary files /dev/null and b/tools/Polygraphy/tests/models/weightless.matmul.bf16.onnx differ
diff --git a/tools/Polygraphy/tests/models/weightless.matmul.fp16.onnx b/tools/Polygraphy/tests/models/weightless.matmul.fp16.onnx
new file mode 100644
index 00000000..66b469da
Binary files /dev/null and b/tools/Polygraphy/tests/models/weightless.matmul.fp16.onnx differ
diff --git a/tools/Polygraphy/tests/models/weightless.qdq_conv.onnx b/tools/Polygraphy/tests/models/weightless.qdq_conv.onnx
new file mode 100644
index 00000000..dfa79101
Binary files /dev/null and b/tools/Polygraphy/tests/models/weightless.qdq_conv.onnx differ
diff --git a/tools/Polygraphy/tests/models/weightless.sparse.conv.onnx b/tools/Polygraphy/tests/models/weightless.sparse.conv.onnx
new file mode 100644
index 00000000..ba4c06af
Binary files /dev/null and b/tools/Polygraphy/tests/models/weightless.sparse.conv.onnx differ
diff --git a/tools/Polygraphy/tests/models/weightless.sparse.matmul.onnx b/tools/Polygraphy/tests/models/weightless.sparse.matmul.onnx
new file mode 100644
index 00000000..6ad60884
Binary files /dev/null and b/tools/Polygraphy/tests/models/weightless.sparse.matmul.onnx differ
diff --git a/tools/Polygraphy/tests/models/weightless.transpose_matmul.onnx b/tools/Polygraphy/tests/models/weightless.transpose_matmul.onnx
new file mode 100644
index 00000000..22b61589
Binary files /dev/null and b/tools/Polygraphy/tests/models/weightless.transpose_matmul.onnx differ
diff --git a/tools/Polygraphy/tests/pytest.ini b/tools/Polygraphy/tests/pytest.ini
index b849e10e..8e2cb86c 100644
--- a/tools/Polygraphy/tests/pytest.ini
+++ b/tools/Polygraphy/tests/pytest.ini
@@ -1,3 +1,4 @@
 [pytest]
 markers =
     serial: Indicates tests do not support parallel execution
+    slow: Indicates tests that are slow and should be skipped if we choose not to run all tests
diff --git a/tools/Polygraphy/tests/requirements.txt b/tools/Polygraphy/tests/requirements.txt
index 42eeeeda..6a95269e 100644
--- a/tools/Polygraphy/tests/requirements.txt
+++ b/tools/Polygraphy/tests/requirements.txt
@@ -1,18 +1,21 @@
 colored
-flaky==3.7.0
-numpy<=1.19.5
-onnx_graphsurgeon>=0.3.21
-onnx==1.10.0
+pytest-rerunfailures==13.0
+numpy==1.21.6
+onnx_graphsurgeon>=0.5.0
+onnx==1.14.0
 onnxconverter_common==1.12.2
 onnxmltools==1.11.1
-onnxruntime==1.10.0
-protobuf==3.19.4
+onnxruntime==1.15.0
+protobuf==3.20.3
 pytest
 pytest-console-scripts
 pytest-virtualenv
+virtualenv==20.23.1
 pytest-xdist
 requests==2.25.1
 sympy==1.9
 tensorflow<2.0; python_version<'3.8'
 tf2onnx
+torch==2.0.1
 wheel
+pyyaml
diff --git a/tools/Polygraphy/tests/test_examples.py b/tools/Polygraphy/tests/test_examples.py
index 28d12e8c..a500e230 100644
--- a/tools/Polygraphy/tests/test_examples.py
+++ b/tools/Polygraphy/tests/test_examples.py
@@ -32,21 +32,26 @@
 
 class Marker:
     def __init__(self, matches_start_func=None, matches_end_func=None):
-        self.matches_start = util.default(matches_start_func, lambda line: line == self.start)
+        self.matches_start = util.default(
+            matches_start_func, lambda line: line == self.start
+        )
         self.matches_end = util.default(matches_end_func, lambda line: line == self.end)
 
     @staticmethod
     def from_name(name):
         return Marker(
-            matches_start_func=lambda line: line == f"<!-- Polygraphy Test: {name} Start -->",
-            matches_end_func=lambda line: line == f"<!-- Polygraphy Test: {name} End -->",
+            matches_start_func=lambda line: line
+            == f"<!-- Polygraphy Test: {name} Start -->",
+            matches_end_func=lambda line: line
+            == f"<!-- Polygraphy Test: {name} End -->",
         )
 
 
 VALID_MARKERS = {
     # For command markers, the start marker may be annotated with a language tag, e.g. ```py, so an exact match is too strict.
     "command": Marker(
-        matches_start_func=lambda line: line.startswith("```"), matches_end_func=lambda line: line == "```"
+        matches_start_func=lambda line: line.startswith("```"),
+        matches_end_func=lambda line: line == "```",
     ),
     # Marks an entire block to be ignored by the tests.
     "ignore": Marker.from_name("Ignore"),
@@ -94,7 +99,9 @@ def is_in(self, marker):
         Whether we are currently on a line between the specified start and end marker.
         This will always return False for a line containing the marker itself.
         """
-        return marker in self.active_markers and not (self.entering(marker) or self.exiting(marker))
+        return marker in self.active_markers and not (
+            self.entering(marker) or self.exiting(marker)
+        )
 
     def entering(self, marker):
         return marker in self.entering_markers
@@ -124,8 +131,12 @@ def load_command_blocks_from_readme(readme) -> List[CommandBlock]:
     with open(readme, "r") as f:
         contents = f.read()
         # Check that the README has all the expected sections.
-        assert "## Introduction" in contents, "All example READMEs should have an 'Introduction' section!"
-        assert "## Running The Example" in contents, "All example READMEs should have a 'Running The Example' section!"
+        assert (
+            "## Introduction" in contents
+        ), "All example READMEs should have an 'Introduction' section!"
+        assert (
+            "## Running The Example" in contents
+        ), "All example READMEs should have a 'Running The Example' section!"
 
     cmd_blocks = []
     with MarkerTracker(readme) as tracker:
@@ -134,7 +145,9 @@ def load_command_blocks_from_readme(readme) -> List[CommandBlock]:
                 continue
 
             if tracker.entering(VALID_MARKERS["command"]):
-                current_block = CommandBlock(xfail=tracker.is_in(VALID_MARKERS["xfail"]))
+                current_block = CommandBlock(
+                    xfail=tracker.is_in(VALID_MARKERS["xfail"])
+                )
             elif tracker.exiting(VALID_MARKERS["command"]):
                 cmd_blocks.append(copy.copy(current_block))
             elif tracker.is_in(VALID_MARKERS["command"]):
@@ -151,7 +164,11 @@ def __init__(self, path_components, artifact_names=[]):
         self.original_files = []
 
     def _get_file_list(self):
-        return [path for path in glob.iglob(os.path.join(self.path, "*")) if "__pycache__" not in path]
+        return [
+            path
+            for path in glob.iglob(os.path.join(self.path, "*"))
+            if "__pycache__" not in path
+        ]
 
     def _remove_artifacts(self, must_exist=True):
         for artifact in self.artifacts:
@@ -175,14 +192,22 @@ def __enter__(self):
 
     def run(self, cmd_block, sandboxed_install_run):
         # Remove whitespace args and escaped newlines
-        command = [arg for arg in str(cmd_block).strip().split(" ") if arg.strip() and arg != "\\\n"]
+        command = [
+            arg
+            for arg in str(cmd_block).strip().split(" ")
+            if arg.strip() and arg != "\\\n"
+        ]
         status = sandboxed_install_run(command, cwd=self.path)
 
-        cmd_print = f"Note: Command was: {' '.join(command)}"
+        details = f"Note: Command was: {' '.join(command)}.\n==== STDOUT ====\n{status.stdout}\n==== STDERR ====\n{status.stderr}"
         if cmd_block.xfail:
-            assert not status.success, f"Command that was expected to fail did not fail. {cmd_print}"
+            assert (
+                not status.success
+            ), f"Command that was expected to fail did not fail. {details}"
         else:
-            assert status.success, f"Command that was expected to succeed did not succeed. {cmd_print}"
+            assert (
+                status.success
+            ), f"Command that was expected to succeed did not succeed. {details}"
         return status
 
     def __exit__(self, exc_type, exc_value, traceback):
@@ -200,24 +225,37 @@ def __str__(self):
 
 API_EXAMPLES = [
     Example(["api", "00_inference_with_tensorrt"], artifact_names=["identity.engine"]),
-    Example(["api", "01_comparing_frameworks"], artifact_names=["inference_results.json"]),
+    Example(
+        ["api", "01_comparing_frameworks"], artifact_names=["inference_results.json"]
+    ),
     Example(["api", "02_validating_on_a_dataset"]),
     Example(["api", "03_interoperating_with_tensorrt"]),
-    Example(["api", "04_int8_calibration_in_tensorrt"], artifact_names=["identity-calib.cache"]),
+    Example(
+        ["api", "04_int8_calibration_in_tensorrt"],
+        artifact_names=["identity-calib.cache"],
+    ),
     Example(["api", "05_using_tensorrt_network_api"]),
     Example(["api", "06_immediate_eval_api"], artifact_names=["identity.engine"]),
-    Example(["api", "07_tensorrt_and_dynamic_shapes"], artifact_names=["dynamic_identity.engine"]),
     Example(
-        ["api", "08_working_with_run_results_and_saved_inputs_manually"], artifact_names=["inputs.json", "outputs.json"]
+        ["api", "07_tensorrt_and_dynamic_shapes"],
+        artifact_names=["dynamic_identity.engine"],
     ),
+    Example(
+        ["api", "08_working_with_run_results_and_saved_inputs_manually"],
+        artifact_names=["inputs.json", "outputs.json"],
+    ),
+    Example(["api", "09_working_with_pytorch_tensors"]),
 ]
 
 
 @pytest.mark.parametrize("example", API_EXAMPLES, ids=lambda case: str(case))
 @pytest.mark.script_launch_mode("subprocess")
 def test_api_examples(example, sandboxed_install_run):
-    if mod.version(trt.__version__) < mod.version("8.0") and (example.path.endswith("07_tensorrt_and_dynamic_shapes")):
-        pytest.skip("Not intended for older versions of TRT")
+    if "07_tensorrt_and_dynamic_shapes" in example.path and (
+        mod.version(trt.__version__) >= mod.version("8.6")
+        and mod.version(trt.__version__) < mod.version("8.7")
+    ):
+        pytest.skip("Broken on 8.6")
 
     with example as commands:
         for command in commands:
@@ -231,18 +269,30 @@ def test_api_examples(example, sandboxed_install_run):
         ["cli", "run", "02_comparing_across_runs"],
         artifact_names=["inputs.json", "run_0_outputs.json", "identity.engine"],
     ),
-    Example(["cli", "run", "03_generating_a_comparison_script"], artifact_names=["compare_trt_onnxrt.py"]),
+    Example(
+        ["cli", "run", "03_generating_a_comparison_script"],
+        artifact_names=["compare_trt_onnxrt.py"],
+    ),
     Example(
         ["cli", "run", "04_defining_a_tensorrt_network_or_config_manually"],
         artifact_names=["my_define_network.py", "my_create_config.py"],
     ),
-    Example(["cli", "run", "05_comparing_with_custom_input_data"], artifact_names=["custom_inputs.json"]),
+    Example(
+        ["cli", "run", "05_comparing_with_custom_input_data"],
+        artifact_names=["custom_inputs.json"],
+    ),
     Example(
         ["cli", "run", "06_comparing_with_custom_output_data"],
         artifact_names=["custom_inputs.json", "custom_outputs.json"],
     ),
     Example(["cli", "run", "07_checking_nan_inf"]),
-    Example(["cli", "run", "08_adding_precision_constraints"], artifact_names=["inputs.json", "golden_outputs.json"]),
+    pytest.param(
+        Example(
+            ["cli", "run", "08_adding_precision_constraints"],
+            artifact_names=["inputs.json", "golden_outputs.json"],
+        ),
+        marks=[pytest.mark.slow],
+    ),
     # Convert
     Example(
         ["cli", "convert", "01_int8_calibration_in_tensorrt"],
@@ -251,65 +301,97 @@ def test_api_examples(example, sandboxed_install_run):
     pytest.param(
         Example(
             ["cli", "convert", "02_deterministic_engine_builds_in_tensorrt"],
-            artifact_names=["0.engine", "1.engine", "replay.json"],
+            artifact_names=[
+                "0.engine",
+                "1.engine",
+                "timing.cache",
+                "timing.cache.lock",
+            ],
         ),
         marks=[pytest.mark.serial, pytest.mark.flaky(max_runs=2)],
     ),
-    Example(["cli", "convert", "03_dynamic_shapes_in_tensorrt"], artifact_names=["dynamic_identity.engine"]),
+    Example(
+        ["cli", "convert", "03_dynamic_shapes_in_tensorrt"],
+        artifact_names=["dynamic_identity.engine"],
+    ),
     Example(
         ["cli", "convert", "04_converting_models_to_fp16"],
         artifact_names=["identity_fp16.onnx", "inputs.json", "outputs_fp32.json"],
     ),
     # Surgeon
-    Example(["cli", "surgeon", "01_isolating_subgraphs"], artifact_names=["subgraph.onnx"]),
+    Example(
+        ["cli", "surgeon", "01_isolating_subgraphs"], artifact_names=["subgraph.onnx"]
+    ),
     Example(["cli", "surgeon", "02_folding_constants"], artifact_names=["folded.onnx"]),
-    Example(["cli", "surgeon", "03_modifying_input_shapes"], artifact_names=["dynamic_identity.onnx"]),
-    Example(["cli", "surgeon", "04_setting_upper_bounds"], artifact_names=["modified.onnx", "folded.onnx"]),
+    Example(
+        ["cli", "surgeon", "03_modifying_input_shapes"],
+        artifact_names=["dynamic_identity.onnx"],
+    ),
+    Example(
+        ["cli", "surgeon", "04_setting_upper_bounds"],
+        artifact_names=["modified.onnx", "folded.onnx"],
+    ),
     # Debug
     Example(
         ["cli", "debug", "01_debugging_flaky_trt_tactics"],
         artifact_names=["replays", "golden.json"],
     ),
+    pytest.param(
+        Example(
+            ["cli", "debug", "02_reducing_failing_onnx_models"],
+            artifact_names=[
+                "polygraphy_debug_replay.json",
+                "polygraphy_debug_replay_skip_current.json",
+                "folded.onnx",
+                "inputs.json",
+                "layerwise_golden.json",
+                "layerwise_inputs.json",
+                "initial_reduced.onnx",
+                "final_reduced.onnx",
+            ],
+        ),
+        marks=[pytest.mark.slow],
+    ),
+    # Plugin
     Example(
-        ["cli", "debug", "02_reducing_failing_onnx_models"],
-        artifact_names=[
-            "polygraphy_debug_replay.json",
-            "polygraphy_debug_replay_skip_current.json",
-            "folded.onnx",
-            "inputs.json",
-            "layerwise_golden.json",
-            "layerwise_inputs.json",
-            "initial_reduced.onnx",
-            "final_reduced.onnx",
-        ],
+        ["cli", "plugin", "01_match_and_replace_plugin"], artifact_names=["config.yaml", "replaced.onnx"]
     ),
 ]
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.0"), reason="Unsupported for TRT 6")
 @pytest.mark.parametrize("example", CLI_EXAMPLES, ids=lambda case: str(case))
 @pytest.mark.script_launch_mode("subprocess")
 def test_cli_examples(example, sandboxed_install_run):
-    if mod.version(trt.__version__) < mod.version("8.0") and (
-        example.path.endswith("01_debugging_flaky_trt_tactics")
-        or example.path.endswith("02_deterministic_engine_builds_in_tensorrt")
+    if "02_deterministic_engine_builds_in_tensorrt" in example.path and mod.version(
+        trt.__version__
+    ) < mod.version("8.7"):
+        pytest.skip("Only supported on TensorRT 8.7 and newer")
+
+    if "03_dynamic_shapes_in_tensorrt" in example.path and (
+        mod.version(trt.__version__) >= mod.version("8.6")
+        and mod.version(trt.__version__) < mod.version("8.7")
     ):
-        pytest.skip("Tactic replays are not supported on older versions of TRT")
-
-    if mod.version(trt.__version__) < mod.version("8.4") and example.path.endswith("08_adding_precision_constraints"):
-        pytest.skip("TRT < 8.4 fails to parse the Gemm node in the example ONNX file.")
+        pytest.skip("Broken on TensorRT 8.6")
 
     with example as command_blocks:
         for cmd_block in command_blocks:
             example.run(cmd_block, sandboxed_install_run)
 
 
-CLI_INSPECT_EXAMPLES = [
+CLI_INSPECT_CHECK_EXAMPLES = [
     Example(["cli", "inspect", "01_inspecting_a_tensorrt_network"]),
-    Example(["cli", "inspect", "02_inspecting_a_tensorrt_engine"], artifact_names=["dynamic_identity.engine"]),
+    Example(
+        ["cli", "inspect", "02_inspecting_a_tensorrt_engine"],
+        artifact_names=["dynamic_identity.engine"],
+    ),
     Example(["cli", "inspect", "03_inspecting_an_onnx_model"]),
-    Example(["cli", "inspect", "05_inspecting_inference_outputs"], artifact_names=["outputs.json"]),
-    Example(["cli", "inspect", "06_inspecting_input_data"], artifact_names=["inputs.json"]),
+    Example(
+        ["cli", "inspect", "05_inspecting_inference_outputs"],
+        artifact_names=["outputs.json"],
+    ),
+    Example(
+        ["cli", "inspect", "06_inspecting_input_data"], artifact_names=["inputs.json"]
+    ),
     Example(
         ["cli", "inspect", "08_inspecting_tensorrt_onnx_support"],
         artifact_names=[
@@ -321,27 +403,38 @@ def test_cli_examples(example, sandboxed_install_run):
             "polygraphy_capability_dumps",
         ],
     ),
+    Example(
+        ["cli", "inspect", "07_inspecting_tactic_replays"],
+        artifact_names=["replay.json"],
+    ),
+    Example(
+        ["cli", "check", "01_linting_an_onnx_model"], artifact_names=["report.json"]
+    ),
+    Example(
+        ["cli", "inspect", "09_inspecting_tensorrt_static_onnx_support"],
+        artifact_names=[
+            "polygraphy_capability_dumps/results.txt",
+            # Remove directory when done
+            "polygraphy_capability_dumps",
+        ],
+    ),
 ]
 
-if mod.version(trt.__version__) >= mod.version("8.0"):
-    CLI_INSPECT_EXAMPLES.append(
-        Example(["cli", "inspect", "07_inspecting_tactic_replays"], artifact_names=["replay.json"])
-    )
-
 if mod.has_mod("tensorflow"):
-    CLI_INSPECT_EXAMPLES.append(Example(["cli", "inspect", "04_inspecting_a_tensorflow_graph"]))
+    CLI_INSPECT_CHECK_EXAMPLES.append(
+        Example(["cli", "inspect", "04_inspecting_a_tensorflow_graph"])
+    )
 
 
-@pytest.mark.parametrize("example", CLI_INSPECT_EXAMPLES, ids=lambda case: str(case))
+@pytest.mark.parametrize(
+    "example", CLI_INSPECT_CHECK_EXAMPLES, ids=lambda case: str(case)
+)
 @pytest.mark.script_launch_mode("subprocess")
-def test_cli_inspect_examples(example, sandboxed_install_run):
-    if mod.version(trt.__version__) < mod.version("8.5") and example.path.endswith("02_inspecting_a_tensorrt_engine"):
-        pytest.skip("Engine layer inspection example is not supported on older versions of TRT")
-
-    if mod.version(trt.__version__) < mod.version("8.2") and example.path.endswith(
-        "08_inspecting_tensorrt_onnx_support"
+def test_cli_inspect_check_examples(example, sandboxed_install_run):
+    if mod.version(trt.__version__) < mod.version("10.0") and (
+        "09_inspecting_tensorrt_static_onnx_support" in example.path
     ):
-        pytest.skip("Capability subtool is not supported on older versions of TRT")
+        pytest.skip("Parser features not supported in TRT <10.0.")
 
     # Last block should be the expected output, and last command should generate it.
     with example as blocks:
@@ -349,28 +442,55 @@ def test_cli_inspect_examples(example, sandboxed_install_run):
         for cmd_block in command_blocks:
             actual_output = example.run(cmd_block, sandboxed_install_run).stdout
 
+    if mod.version(trt.__version__) >= mod.version("9.0") and (
+        "01_inspecting_a_tensorrt_network" in example.path
+        or "02_inspecting_a_tensorrt_engine" in example.path
+    ):
+        pytest.skip(
+            "Output is different for TRT >=9, this test needs to be updated to account for that. "
+        )
+
+    if mod.version(trt.__version__) < mod.version("10.0") and (
+        "08_inspecting_tensorrt_onnx_support" in example.path
+    ):
+        pytest.skip("Output is different for TRT <10.0.")
+
     print(actual_output)
-    # Makes reading the diff way easier
-    actual_lines = [
-        line
-        for line in actual_output.splitlines()
-        if "[I] Loading " not in line
-        and "[I] Saving" not in line
-        and not line.startswith("[W]")
-        and not line.startswith("[E]")
-    ]
+
+    actual_lines = actual_output.splitlines()
+    # The output for lint is expected to have errors and warnings, so we can't filter them out.
+    # The rest of the examples can be pruned of unnecessary lines.
+    if "01_linting_an_onnx_model" not in example.path:
+        include_line = (
+            lambda line: "[I] Loading" not in line
+            and "[I] Saving" not in line
+            and "[W]" not in line
+            and "[E]" not in line
+        )
+        actual_lines = [line for line in actual_lines if include_line(line)]
 
     expected_lines = expected_output.splitlines()
     assert len(actual_lines) == len(expected_lines)
 
     # Indicates lines that may not match exactly
-    NON_EXACT_LINE_MARKERS = ["---- ", "    Layer", "        Algorithm:"]
+    NON_EXACT_LINE_MARKERS = [
+        "---- ",
+        "Layer",
+        "        Algorithm:",
+        "RUNNING",
+        "FAILED",
+        "[I] Loading ",
+    ]
 
-    for index, (actual_line, expected_line) in enumerate(zip(actual_lines, expected_lines)):
+    for index, (actual_line, expected_line) in enumerate(
+        zip(actual_lines, expected_lines)
+    ):
         # Skip whitespace, and lines that include runner names (since those have timestamps)
-        if expected_line.strip() and all([marker not in expected_line for marker in NON_EXACT_LINE_MARKERS]):
+        if expected_line.strip() and all(
+            [marker not in expected_line for marker in NON_EXACT_LINE_MARKERS]
+        ):
             print(f"Checking line {index}: {expected_line}")
-            assert actual_line == expected_line
+            assert actual_line.strip() == expected_line.strip()
 
 
 DEV_EXAMPLES = [
diff --git a/tools/Polygraphy/tests/test_packaging.py b/tools/Polygraphy/tests/test_packaging.py
index 026d1f0c..ef556620 100644
--- a/tools/Polygraphy/tests/test_packaging.py
+++ b/tools/Polygraphy/tests/test_packaging.py
@@ -29,6 +29,9 @@ def test_install(self, virtualenv):
         with pytest.raises(Exception, match="returned non-zero exit"):
             virtualenv.run(["python3", "-c", "import polygraphy"])
 
+        # Newer versions of setuptools break pytest-virtualenv
+        virtualenv.run([virtualenv.python, "-m", "pip", "install", "setuptools==59.6.0"])
+
         virtualenv.run(["make", "install"], cwd=ROOT_DIR)
 
         # Check Python package is installed
@@ -47,20 +50,21 @@ def test_install(self, virtualenv):
         # NOTE: This should be updated when new files are added to the top-level package.
         EXPECTED_FILES = set(
             [
-                "backend",
-                "mod",
                 "__init__.py",
-                "cuda",
-                "logger",
-                "constants.py",
-                "util",
+                "backend",
+                "common",
                 "comparator",
-                "tools",
+                "config.py",
+                "constants.py",
+                "cuda",
+                "datatype",
                 "exception",
                 "func",
-                "common",
                 "json",
-                "config.py",
+                "logger",
+                "mod",
+                "tools",
+                "util",
             ]
         )
         assert set(all_poly_files) == EXPECTED_FILES
diff --git a/tools/Polygraphy/tests/tools/args/backend/test_runner_select.py b/tools/Polygraphy/tests/tools/args/backend/test_runner_select.py
index dbef4eba..0606ebc9 100644
--- a/tools/Polygraphy/tests/tools/args/backend/test_runner_select.py
+++ b/tools/Polygraphy/tests/tools/args/backend/test_runner_select.py
@@ -22,7 +22,6 @@
     PluginRefRunnerArgs,
     RunnerSelectArgs,
     TfRunnerArgs,
-    TrtLegacyRunnerArgs,
     TrtRunnerArgs,
 )
 from tests.tools.args.helper import ArgGroupTestHelper
@@ -37,7 +36,6 @@ def runner_select_args():
             OnnxrtRunnerArgs(),
             PluginRefRunnerArgs(),
             TrtRunnerArgs(),
-            TrtLegacyRunnerArgs(),
         ],
     )
 
diff --git a/tools/Polygraphy/tests/tools/args/backend/trt/test_config.py b/tools/Polygraphy/tests/tools/args/backend/trt/test_config.py
index 3ecad39c..32c3943c 100644
--- a/tools/Polygraphy/tests/tools/args/backend/trt/test_config.py
+++ b/tools/Polygraphy/tests/tools/args/backend/trt/test_config.py
@@ -23,7 +23,12 @@
 import pytest
 import tensorrt as trt
 from polygraphy import mod, util
-from polygraphy.backend.trt import TacticRecorder, TacticReplayData, TacticReplayer, create_network
+from polygraphy.backend.trt import (
+    TacticRecorder,
+    TacticReplayData,
+    TacticReplayer,
+    create_network,
+)
 from polygraphy.exception import PolygraphyException
 from polygraphy.tools.args import DataLoaderArgs, ModelArgs, TrtConfigArgs
 from tests.helper import has_dla
@@ -33,20 +38,22 @@
 @pytest.fixture()
 def trt_config_args():
     return ArgGroupTestHelper(
-        TrtConfigArgs(allow_engine_capability=True, allow_tensor_formats=True), deps=[ModelArgs(), DataLoaderArgs()]
+        TrtConfigArgs(allow_engine_capability=True, allow_tensor_formats=True),
+        deps=[ModelArgs(), DataLoaderArgs()],
     )
 
 
 class TestTrtConfigArgs:
     def test_defaults(self, trt_config_args):
         trt_config_args.parse_args([])
-        assert trt_config_args._workspace is None
 
     def test_create_config(self, trt_config_args):
         trt_config_args.parse_args([])
         builder, network = create_network()
 
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert isinstance(config, trt.IBuilderConfig)
 
     @pytest.mark.parametrize(
@@ -54,74 +61,63 @@ def test_create_config(self, trt_config_args):
         [
             (["--int8"], "INT8"),
             (["--fp16"], "FP16"),
+            (["--bf16"], "BF16"),
             (["--fp8"], "FP8"),
             (["--tf32"], "TF32"),
             (["--allow-gpu-fallback"], "GPU_FALLBACK"),
             (["--precision-constraints", "obey"], "OBEY_PRECISION_CONSTRAINTS"),
             (["--precision-constraints", "prefer"], "PREFER_PRECISION_CONSTRAINTS"),
             (["--direct-io"], "DIRECT_IO"),
+            (["--disable-compilation-cache"], "DISABLE_COMPILATION_CACHE"),
         ],
     )
     def test_flags(self, trt_config_args, args, flag):
-        if flag == "TF32" and mod.version(trt.__version__) < mod.version("7.1"):
-            pytest.skip("TF32 support was added in 7.1")
         if flag == "FP8" and mod.version(trt.__version__) < mod.version("8.6"):
             pytest.skip("FP8 support was added in 8.6")
-
-        if (
-            flag == "OBEY_PRECISION_CONSTRAINTS"
-            or flag == "PREFER_PRECISION_CONSTRAINTS"
-            or flag == "DIRECT_IO"
-            and mod.version(trt.__version__) < mod.version("8.2")
-        ):
-            pytest.skip("OBEY_PRECISION_CONSTRAINTS/PREFER_PRECISION_CONSTRAINTS/DIRECT_IO support was added in 8.2")
+        if flag == "BF16" and mod.version(trt.__version__) < mod.version("8.7"):
+            pytest.skip("BF16 support was added in 8.7")
+        if flag == "DISABLE_COMPILATION_CACHE" and mod.version(
+            trt.__version__
+        ) < mod.version("9.0"):
+            pytest.skip("BF16 support was added in 9.0")
 
         trt_config_args.parse_args(args)
 
         builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.get_flag(getattr(trt.BuilderFlag, flag))
 
     @pytest.mark.parametrize(
-        "workspace, expected",
+        "engine_capability, expected",
         [
-            ("16", 16),
-            ("1e9", 1e9),
-            ("2M", 2 << 20),
+            ("Standard", trt.EngineCapability.STANDARD),
+            ("SaFETY", trt.EngineCapability.SAFETY),
+            ("DLA_STANDALONE", trt.EngineCapability.DLA_STANDALONE),
         ],
     )
-    def test_workspace(self, trt_config_args, workspace, expected):
-        trt_config_args.parse_args(["--workspace", workspace])
-        assert trt_config_args._workspace == expected
-
-        builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
-            assert config.max_workspace_size == expected
-
-    if mod.version(trt.__version__) >= mod.version("8.0"):
-
-        @pytest.mark.parametrize(
-            "engine_capability, expected",
-            [
-                ("Standard", trt.EngineCapability.STANDARD),
-                ("SaFETY", trt.EngineCapability.SAFETY),
-                ("DLA_STANDALONE", trt.EngineCapability.DLA_STANDALONE),
-            ],
+    def test_engine_capability(self, trt_config_args, engine_capability, expected):
+        trt_config_args.parse_args(["--engine-capability", engine_capability])
+        assert (
+            str(trt_config_args.engine_capability)
+            == f"trt.EngineCapability.{engine_capability.upper()}"
         )
-        def test_engine_capability(self, trt_config_args, engine_capability, expected):
-            trt_config_args.parse_args(["--engine-capability", engine_capability])
-            assert str(trt_config_args.engine_capability) == f"trt.EngineCapability.{engine_capability.upper()}"
 
-            builder, network = create_network()
-            with builder, network, trt_config_args.create_config(builder, network=network) as config:
-                assert config.engine_capability == expected
+        builder, network = create_network()
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
+            assert config.engine_capability == expected
 
     def test_dla(self, trt_config_args):
         trt_config_args.parse_args(["--use-dla"])
         assert trt_config_args.use_dla
 
         builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.default_device_type == trt.DeviceType.DLA
             if has_dla():
                 assert config.DLA_core == 0
@@ -130,25 +126,42 @@ def test_calibrator_when_dla(self, trt_config_args):
         trt_config_args.parse_args(["--use-dla", "--int8"])
 
         builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert isinstance(config.int8_calibrator, trt.IInt8EntropyCalibrator2)
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="SAFETY_SCOPE was added in TRT 8")
     def test_restricted_flags(self, trt_config_args):
         trt_config_args.parse_args(["--trt-safety-restricted"])
         builder, network = create_network()
 
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.get_flag(getattr(trt.BuilderFlag, "SAFETY_SCOPE"))
 
     def test_refittable_flags(self, trt_config_args):
         trt_config_args.parse_args(["--refittable"])
         builder, network = create_network()
 
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.get_flag(getattr(trt.BuilderFlag, "REFIT"))
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Bugged before TRT 8")
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("10.0"),
+        reason="Feature not present before 10.0",
+    )
+    def test_weight_streaming_flags(self, trt_config_args):
+        trt_config_args.parse_args(["--weight-streaming"])
+        builder, network = create_network()
+
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
+            assert config.get_flag(getattr(trt.BuilderFlag, "WEIGHT_STREAMING"))
+
     @pytest.mark.parametrize(
         "opt, cls",
         [
@@ -163,60 +176,87 @@ def test_tactics(self, trt_config_args, opt, cls):
 
             trt_config_args.parse_args([opt, f.name])
             builder, network = create_network()
-            with builder, network, trt_config_args.create_config(builder, network=network) as config:
+            with builder, network, trt_config_args.create_config(
+                builder, network=network
+            ) as config:
                 selector = config.algorithm_selector
                 assert selector.make_func == cls
                 assert selector.path == f.name
 
-    if mod.version(trt.__version__) < mod.version("8.0"):
-        TACTIC_SOURCES_CASES = [
-            ([], 3),  # By default, all sources are enabled.
-            (["--tactic-sources"], 0),
-            (["--tactic-sources", "CUBLAS"], 1),
-            (["--tactic-sources", "CUBLAS_LT"], 2),
-            (["--tactic-sources", "CUblAS", "cublas_lt"], 3),  # Not case sensitive
-        ]
-
-    if mod.version(trt.__version__) >= mod.version("8.0"):
-        TACTIC_SOURCES_CASES = [
-            ([], 7),  # By default, all sources are enabled.
-            (["--tactic-sources"], 0),
-            (["--tactic-sources", "CUBLAS"], 1),
-            (["--tactic-sources", "CUBLAS_LT"], 2),
-            (["--tactic-sources", "CUDNN"], 4),
-            (["--tactic-sources", "CUblAS", "cublas_lt"], 3),  # Not case sensitive
-            (["--tactic-sources", "CUBLAS", "cuDNN"], 5),
-            (["--tactic-sources", "CUBLAS_LT", "CUDNN"], 6),
-            (["--tactic-sources", "CUDNN", "cuBLAS", "CUBLAS_LT"], 7),
-        ]
-
-    if mod.version(trt.__version__) >= mod.version("8.4"):
-        TACTIC_SOURCES_CASES[0] = ([], 15)
-        TACTIC_SOURCES_CASES.extend(
-            [(["--tactic-sources", "CUDNN", "cuBLAS", "CUBLAS_LT", "edge_mask_convolutions"], 15)]
-        )
+    TACTIC_SOURCES_CASES = [
+        ([], 31),  # By default, all sources are enabled.
+        (["--tactic-sources"], 0),
+        (["--tactic-sources", "CUBLAS"], 1),
+        (["--tactic-sources", "CUBLAS_LT"], 2),
+        (["--tactic-sources", "CUDNN"], 4),
+        (["--tactic-sources", "CUblAS", "cublas_lt"], 3),  # Not case sensitive
+        (["--tactic-sources", "CUBLAS", "cuDNN"], 5),
+        (["--tactic-sources", "CUBLAS_LT", "CUDNN"], 6),
+        (["--tactic-sources", "CUDNN", "cuBLAS", "CUBLAS_LT"], 7),
+        (
+            [
+                "--tactic-sources",
+                "CUDNN",
+                "cuBLAS",
+                "CUBLAS_LT",
+                "edge_mask_convolutions",
+            ],
+            15,
+        ),
+        (
+            [
+                "--tactic-sources",
+                "CUDNN",
+                "cuBLAS",
+                "CUBLAS_LT",
+                "edge_mask_convolutions",
+                "jit_convolutions",
+            ],
+            31,
+        ),
+    ]
 
-    if mod.version(trt.__version__) >= mod.version("8.5"):
-        TACTIC_SOURCES_CASES[0] = ([], 31)
-        TACTIC_SOURCES_CASES.extend(
-            [(["--tactic-sources", "CUDNN", "cuBLAS", "CUBLAS_LT", "edge_mask_convolutions", "jit_convolutions"], 31)]
-        )
+    if mod.version(trt.__version__) >= mod.version("10.0"):
+        TACTIC_SOURCES_CASES[0] = ([], 24)
+    elif mod.version(trt.__version__) >= mod.version("8.7"):
+        TACTIC_SOURCES_CASES[0] = ([], 29)
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.2"), reason="Not available before 7.2")
     @pytest.mark.parametrize("opt, expected", TACTIC_SOURCES_CASES)
     def test_tactic_sources(self, trt_config_args, opt, expected):
         trt_config_args.parse_args(opt)
         builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.get_tactic_sources() == expected
 
-    @pytest.mark.parametrize("base_class", ["IInt8LegacyCalibrator", "IInt8EntropyCalibrator2"])
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.7"),
+        reason="ERROR_ON_TIMING_CACHE_MISS support was added in 8.7",
+    )
+    def test_error_on_timing_cache_miss(self, trt_config_args):
+        trt_config_args.parse_args(["--error-on-timing-cache-miss"])
+        builder, network = create_network()
+
+        assert trt_config_args.error_on_timing_cache_miss
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
+            assert config.get_flag(
+                getattr(trt.BuilderFlag, "ERROR_ON_TIMING_CACHE_MISS")
+            )
+
+    @pytest.mark.parametrize(
+        "base_class", ["IInt8LegacyCalibrator", "IInt8EntropyCalibrator2"]
+    )
     def test_calibration_base_class(self, trt_config_args, base_class):
         trt_config_args.parse_args(["--int8", "--calibration-base-class", base_class])
         assert trt_config_args.calibration_base_class.unwrap() == f"trt.{base_class}"
 
         builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert isinstance(config.int8_calibrator, getattr(trt, base_class))
 
     def test_legacy_calibrator_params(self, trt_config_args):
@@ -236,7 +276,9 @@ def test_legacy_calibrator_params(self, trt_config_args):
         assert trt_config_args._regression_cutoff == regression_cutoff
 
         builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.int8_calibrator.get_quantile() == quantile
             assert config.int8_calibrator.get_regression_cutoff() == regression_cutoff
 
@@ -258,7 +300,9 @@ def test_no_deps_profiles_int8(self, trt_config_args):
         builder, network = create_network()
         network.add_input("input", shape=(-1, 25, 25), dtype=trt.float32)
 
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert isinstance(config, trt.IBuilderConfig)
             # Unfortunately there is no API to check the contents of the profile in a config.
             # The checks above will have to do.
@@ -290,12 +334,16 @@ def my_load_config(config):
             f.flush()
             os.fsync(f.fileno())
 
-            trt_config_args.parse_args(["--trt-config-script", f"{f.name}:my_load_config"])
+            trt_config_args.parse_args(
+                ["--trt-config-script", f"{f.name}:my_load_config"]
+            )
             assert trt_config_args.trt_config_script == f.name
             assert trt_config_args.trt_config_func_name == "my_load_config"
 
             builder, network = create_network()
-            with builder, network, trt_config_args.create_config(builder, network) as config:
+            with builder, network, trt_config_args.create_config(
+                builder, network
+            ) as config:
                 assert isinstance(config, trt.IBuilderConfig)
                 assert config.get_flag(trt.BuilderFlag.FP16)
 
@@ -318,12 +366,23 @@ def my_postprocess_config(builder, network, config):
             f.flush()
             os.fsync(f.fileno())
 
-            trt_config_args.parse_args(["--trt-config-postprocess-script", f"{f.name}:my_postprocess_config", "--int8"])
+            trt_config_args.parse_args(
+                [
+                    "--trt-config-postprocess-script",
+                    f"{f.name}:my_postprocess_config",
+                    "--int8",
+                ]
+            )
             assert trt_config_args.trt_config_postprocess_script == f.name
-            assert trt_config_args.trt_config_postprocess_func_name == "my_postprocess_config"
+            assert (
+                trt_config_args.trt_config_postprocess_func_name
+                == "my_postprocess_config"
+            )
 
             builder, network = create_network()
-            with builder, network, trt_config_args.create_config(builder, network) as config:
+            with builder, network, trt_config_args.create_config(
+                builder, network
+            ) as config:
                 assert isinstance(config, trt.IBuilderConfig)
                 assert config.get_flag(trt.BuilderFlag.FP16)
                 assert config.get_flag(trt.BuilderFlag.INT8)
@@ -336,9 +395,21 @@ def my_postprocess_config(builder, network, config):
             ["--int8", "--calibration-base-class", "IInt8LegacyCalibrator)"],
             ["--int8", "--calibration-base-class", "IInt8LegacyCalibrator}"],
             ["--int8", "--calibration-base-class", "IInt8LegacyCalibrator]"],
-            ["--int8", "--calibration-base-class", "IInt8LegacyCalibrator));print(('hi'"],
-            ["--int8", "--calibration-base-class", "IInt8LegacyCalibrator;print(('hi')"],
-            ["--int8", "--calibration-base-class", "IInt8LegacyCalibrator';print('hi')"],
+            [
+                "--int8",
+                "--calibration-base-class",
+                "IInt8LegacyCalibrator));print(('hi'",
+            ],
+            [
+                "--int8",
+                "--calibration-base-class",
+                "IInt8LegacyCalibrator;print(('hi')",
+            ],
+            [
+                "--int8",
+                "--calibration-base-class",
+                "IInt8LegacyCalibrator';print('hi')",
+            ],
             ["--tactic-sources", "CUBLAS, fp16=True"],
         ],
     )
@@ -348,22 +419,40 @@ def test_code_injection_checks(self, trt_config_args, args):
 
     with contextlib.suppress(AttributeError):
 
-        @pytest.mark.skipif(
-            mod.version(trt.__version__) < mod.version("8.3"), reason="Unsupported for TRT versions prior to 8.3"
-        )
         @pytest.mark.parametrize(
             "args,expected",
             [
-                (["--pool-limit", "workspace:250"], {trt.MemoryPoolType.WORKSPACE: 250}),
-                (["--pool-limit", "dla_managed_sram:250"], {trt.MemoryPoolType.DLA_MANAGED_SRAM: 250}),
-                (["--pool-limit", "dla_local_dram:250"], {trt.MemoryPoolType.DLA_LOCAL_DRAM: 250}),
-                (["--pool-limit", "dla_global_dram:250"], {trt.MemoryPoolType.DLA_GLOBAL_DRAM: 250}),
+                (
+                    ["--pool-limit", "workspace:250"],
+                    {trt.MemoryPoolType.WORKSPACE: 250},
+                ),
+                (
+                    ["--pool-limit", "dla_managed_sram:250"],
+                    {trt.MemoryPoolType.DLA_MANAGED_SRAM: 250},
+                ),
+                (
+                    ["--pool-limit", "dla_local_dram:250"],
+                    {trt.MemoryPoolType.DLA_LOCAL_DRAM: 250},
+                ),
+                (
+                    ["--pool-limit", "dla_global_dram:250"],
+                    {trt.MemoryPoolType.DLA_GLOBAL_DRAM: 250},
+                ),
                 # Test case insensitivity
-                (["--pool-limit", "wOrkSpaCE:250"], {trt.MemoryPoolType.WORKSPACE: 250}),
+                (
+                    ["--pool-limit", "wOrkSpaCE:250"],
+                    {trt.MemoryPoolType.WORKSPACE: 250},
+                ),
                 # Test works with K/M/G suffixes
-                (["--pool-limit", "workspace:2M"], {trt.MemoryPoolType.WORKSPACE: 2 << 20}),
+                (
+                    ["--pool-limit", "workspace:2M"],
+                    {trt.MemoryPoolType.WORKSPACE: 2 << 20},
+                ),
                 # Test works with scientific notation
-                (["--pool-limit", "workspace:2e3"], {trt.MemoryPoolType.WORKSPACE: 2e3}),
+                (
+                    ["--pool-limit", "workspace:2e3"],
+                    {trt.MemoryPoolType.WORKSPACE: 2e3},
+                ),
             ],
         )
         def test_memory_pool_limits(self, args, expected, trt_config_args):
@@ -377,9 +466,6 @@ def test_memory_pool_limits(self, args, expected, trt_config_args):
                         pytest.skip("DLA is not available on this system")
                     config.get_memory_pool_limit(pool_type) == pool_size
 
-        @pytest.mark.skipif(
-            mod.version(trt.__version__) < mod.version("8.3"), reason="Unsupported for TRT versions prior to 8.3"
-        )
         @pytest.mark.parametrize(
             "args",
             [
@@ -390,15 +476,12 @@ def test_memory_pool_limits_empty_key_not_allowed(self, args, trt_config_args):
             with pytest.raises(PolygraphyException, match="Could not parse argument"):
                 trt_config_args.parse_args(args)
 
-    @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.5"), reason="Unsupported for TRT versions prior to 8.5"
-    )
     @pytest.mark.parametrize(
         "preview_features",
         [
-            [],
-            ["FASter_DYNAMIC_ShAPeS_0805"],
-            ["FASter_DYNAMIC_ShAPeS_0805", "DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805"],
+            ["PROFILE_SHAriNG_0806"]
+            if mod.version(trt.__version__) >= mod.version("10.0")
+            else ["FASter_DYNAMIC_ShAPeS_0805"],
         ],
     )
     def test_preview_features(self, trt_config_args, preview_features):
@@ -408,13 +491,42 @@ def test_preview_features(self, trt_config_args, preview_features):
 
         sanitized_preview_features = [pf.upper() for pf in preview_features]
 
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             # Check that only the enabled preview features are on.
             for name, pf in trt.PreviewFeature.__members__.items():
-                assert config.get_preview_feature(pf) == (name in sanitized_preview_features)
+                assert config.get_preview_feature(pf) == (
+                    name in sanitized_preview_features
+                )
+
+    @pytest.mark.parametrize(
+        "quantization_flags",
+        [
+            [],
+            ["CALIBRATE_BEFORE_FUSION"],
+            ["cAlIBRaTE_BEFORE_fUSIoN"],
+        ],
+    )
+    def test_quantization_flags(self, trt_config_args, quantization_flags):
+        # Flag should be case-insensitive
+        trt_config_args.parse_args(["--quantization-flags"] + quantization_flags)
+        builder, network = create_network()
+
+        sanitized_quantization_flags = [pf.upper() for pf in quantization_flags]
+
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
+            # Check that only the enabled quantization flags are on.
+            for name, qf in trt.QuantizationFlag.__members__.items():
+                assert config.get_quantization_flag(qf) == (
+                    name in sanitized_quantization_flags
+                )
 
     @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.6"), reason="Unsupported for TRT versions prior to 8.6"
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="Unsupported for TRT versions prior to 8.6",
     )
     @pytest.mark.parametrize("level", range(6))
     def test_builder_optimization_level(self, trt_config_args, level):
@@ -423,7 +535,9 @@ def test_builder_optimization_level(self, trt_config_args, level):
 
         builder, network = create_network()
 
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.builder_optimization_level == level
 
     if mod.version(trt.__version__) >= mod.version("8.6"):
@@ -439,16 +553,20 @@ def test_builder_optimization_level(self, trt_config_args, level):
         def test_hardware_compatibility_level(self, trt_config_args, level, expected):
             trt_config_args.parse_args(["--hardware-compatibility-level", str(level)])
             assert (
-                str(trt_config_args.hardware_compatibility_level) == f"trt.HardwareCompatibilityLevel.{expected.name}"
+                str(trt_config_args.hardware_compatibility_level)
+                == f"trt.HardwareCompatibilityLevel.{expected.name}"
             )
 
             builder, network = create_network()
 
-            with builder, network, trt_config_args.create_config(builder, network=network) as config:
+            with builder, network, trt_config_args.create_config(
+                builder, network=network
+            ) as config:
                 assert config.hardware_compatibility_level == expected
 
     @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.6"), reason="Unsupported for TRT versions prior to 8.6"
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="Unsupported for TRT versions prior to 8.6",
     )
     @pytest.mark.parametrize("num_streams", range(5))
     def test_max_aux_streams(self, trt_config_args, num_streams):
@@ -457,15 +575,24 @@ def test_max_aux_streams(self, trt_config_args, num_streams):
 
         builder, network = create_network()
 
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.max_aux_streams == num_streams
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.6"), reason="Unsupported before TRT 8.6")
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="Unsupported before TRT 8.6",
+    )
     @pytest.mark.parametrize(
         "args, attr, expected_flag",
         [
             (["--version-compatible"], "version_compatible", "VERSION_COMPATIBLE"),
-            (["--version-compatible", "--exclude-lean-runtime"], "exclude_lean_runtime", "EXCLUDE_LEAN_RUNTIME"),
+            (
+                ["--version-compatible", "--exclude-lean-runtime"],
+                "exclude_lean_runtime",
+                "EXCLUDE_LEAN_RUNTIME",
+            ),
         ],
     )
     def test_version_compatibility(self, trt_config_args, args, attr, expected_flag):
@@ -473,5 +600,33 @@ def test_version_compatibility(self, trt_config_args, args, attr, expected_flag)
         assert getattr(trt_config_args, attr)
 
         builder, network = create_network()
-        with builder, network, trt_config_args.create_config(builder, network=network) as config:
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
             assert config.get_flag(getattr(trt.BuilderFlag, expected_flag))
+
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="Unsupported before 8.6",
+    )
+    @pytest.mark.parametrize(
+        "level, expected",
+        [
+            ("none", trt.ProfilingVerbosity.NONE),
+            ("detailed", trt.ProfilingVerbosity.DETAILED),
+            ("layer_names_only", trt.ProfilingVerbosity.LAYER_NAMES_ONLY),
+        ],
+    )
+    def test_profiling_verbosity(self, trt_config_args, level, expected):
+        trt_config_args.parse_args(["--profiling-verbosity", str(level)])
+        assert (
+            str(trt_config_args.profiling_verbosity)
+            == f"trt.ProfilingVerbosity.{expected.name}"
+        )
+
+        builder, network = create_network()
+
+        with builder, network, trt_config_args.create_config(
+            builder, network=network
+        ) as config:
+            assert config.profiling_verbosity == expected
diff --git a/tools/Polygraphy/tests/tools/args/backend/trt/test_loader.py b/tools/Polygraphy/tests/tools/args/backend/trt/test_loader.py
index 7185a190..9c967532 100644
--- a/tools/Polygraphy/tests/tools/args/backend/trt/test_loader.py
+++ b/tools/Polygraphy/tests/tools/args/backend/trt/test_loader.py
@@ -47,7 +47,15 @@
 
 
 class TestTrtLoadNetworkArgs:
-    def test_load_network(self):
+    @pytest.mark.parametrize("force_onnx_loader", [True, False])
+    @pytest.mark.parametrize(
+        "opts,expected_flag",
+        [([], None)]
+        + [(["--strongly-typed"], trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED)]
+        if mod.version(trt.__version__) >= mod.version("8.7")
+        else [],
+    )
+    def test_load_network(self, force_onnx_loader, opts, expected_flag):
         arg_group = ArgGroupTestHelper(
             TrtLoadNetworkArgs(),
             deps=[
@@ -58,12 +66,24 @@ def test_load_network(self):
                 TrtOnnxFlagArgs(),
             ],
         )
-        arg_group.parse_args([ONNX_MODELS["identity_identity"].path, "--trt-outputs=identity_out_0"])
+
+        args = [ONNX_MODELS["identity_identity"].path]
+        if force_onnx_loader:
+            # We can force Polygraphy to use NetworkFromOnnxBytes instead of NetworkFromOnnxPath by requiring
+            # changes to the model.
+            args.append("--trt-outputs=identity_out_0")
+
+        args += opts
+        arg_group.parse_args(args)
 
         builder, network, _ = arg_group.load_network()
         with builder, network:
             assert network.num_outputs == 1
-            assert network.get_output(0).name == "identity_out_0"
+            assert network.get_output(0).name == (
+                "identity_out_0" if force_onnx_loader else "identity_out_2"
+            )
+            if expected_flag is not None:
+                assert network.get_flag(expected_flag)
 
     @pytest.mark.parametrize("func_name", ["postprocess", "custom_func"])
     def test_postprocess_network(self, func_name):
@@ -95,7 +115,13 @@ def {func_name}(network):
             else:
                 pps_arg = f"{f.name}:{func_name}"
 
-            arg_group.parse_args([ONNX_MODELS["identity_identity"].path, "--trt-network-postprocess-script", pps_arg])
+            arg_group.parse_args(
+                [
+                    ONNX_MODELS["identity_identity"].path,
+                    "--trt-network-postprocess-script",
+                    pps_arg,
+                ]
+            )
 
             builder, network, _ = arg_group.load_network()
             with builder, network:
@@ -163,17 +189,14 @@ def test_set_tensor_datatypes(self):
                 ONNX_MODELS["identity_identity"].path,
                 "--tensor-datatypes",
                 "X:float16",
-                "identity_out_0:float32",
                 "identity_out_2:float16",
             ]
         )
 
         builder, network, _ = arg_group.load_network()
         with builder, network:
-            assert network[0].get_input(0).dtype == trt.float16
-            assert network[0].get_output(0).dtype == trt.float32
-            assert network[1].get_input(0).dtype == trt.float32
-            assert network[1].get_output(0).dtype == trt.float16
+            assert network.get_input(0).dtype == trt.float16
+            assert network.get_output(0).dtype == trt.float16
 
     def test_set_tensor_datatypes_default_disallowed(self):
         arg_group = ArgGroupTestHelper(
@@ -218,10 +241,12 @@ def test_set_tensor_formats(self):
 
         builder, network, _ = arg_group.load_network()
         with builder, network:
-            assert network[0].get_input(0).allowed_formats == (
+            assert network.get_input(0).allowed_formats == (
                 1 << int(trt.TensorFormat.LINEAR) | 1 << int(trt.TensorFormat.CHW4)
             )
-            assert network[1].get_output(0).allowed_formats == 1 << int(trt.TensorFormat.HWC8)
+            assert network.get_output(0).allowed_formats == 1 << int(
+                trt.TensorFormat.HWC8
+            )
 
     def test_set_tensor_formats_default_disallowed(self):
         arg_group = ArgGroupTestHelper(
@@ -243,13 +268,23 @@ def test_set_tensor_formats_default_disallowed(self):
                 ]
             )
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.6"), reason="API was added in TRT 8.6")
-    @pytest.mark.parametrize("args", [["--hardware-compatibility-level=ampere_plus"], ["--version-compatible"]])
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="API was added in TRT 8.6",
+    )
+    @pytest.mark.parametrize(
+        "args",
+        [["--hardware-compatibility-level=ampere_plus"], ["--version-compatible"]],
+    )
     def test_onnx_flags_autoenabled_for_vc_or_hc(self, args):
-        arg_group = ArgGroupTestHelper(TrtOnnxFlagArgs(), deps=[ModelArgs(), TrtConfigArgs()])
+        arg_group = ArgGroupTestHelper(
+            TrtOnnxFlagArgs(), deps=[ModelArgs(), TrtConfigArgs()]
+        )
         arg_group.parse_args([ONNX_MODELS["identity_identity"].path] + args)
 
-        assert arg_group.get_flags() == [make_trt_enum_val("OnnxParserFlag", "NATIVE_INSTANCENORM")]
+        assert arg_group.get_flags()[0] == [
+            make_trt_enum_val("OnnxParserFlag", "NATIVE_INSTANCENORM")
+        ]
 
 
 @pytest.fixture()
@@ -270,11 +305,12 @@ def engine_loader_args():
 
 class TestTrtEngineLoaderArgs:
     def test_build_engine(self, engine_loader_args):
-        engine_loader_args.parse_args([ONNX_MODELS["identity_identity"].path, "--trt-outputs=identity_out_0"])
+        engine_loader_args.parse_args(
+            [ONNX_MODELS["identity_identity"].path, "--trt-outputs=identity_out_0"]
+        )
 
         with engine_loader_args.load_engine() as engine:
             assert isinstance(engine, trt.ICudaEngine)
-            assert len(engine) == 2
             assert engine[1] == "identity_out_0"
 
     def test_build_engine_custom_network(self, engine_loader_args):
@@ -286,9 +322,10 @@ def test_build_engine_custom_network(self, engine_loader_args):
         out.name = "output"
         network.mark_output(out)
 
-        with builder, network, engine_loader_args.load_engine(network=(builder, network)) as engine:
+        with builder, network, engine_loader_args.load_engine(
+            network=(builder, network)
+        ) as engine:
             assert isinstance(engine, trt.ICudaEngine)
-            assert len(engine) == 2
             assert engine[0] == "input"
             assert engine[1] == "output"
 
@@ -303,12 +340,17 @@ def test_load_serialized_engine(self, engine_loader_args):
             engine_loader_args.parse_args([f.name, "--model-type=engine"])
             with engine_loader_args.load_engine() as engine:
                 assert isinstance(engine, trt.ICudaEngine)
-                assert len(engine) == 2
+
                 assert engine[0] == "x"
                 assert engine[1] == "y"
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.6"), reason="API was added in TRT 8.6")
-    def test_load_engine_with_custom_runtime(self, engine_loader_args, nvinfer_lean_path):
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("8.6"),
+        reason="API was added in TRT 8.6",
+    )
+    def test_load_engine_with_custom_runtime(
+        self, engine_loader_args, nvinfer_lean_path
+    ):
         with util.NamedTemporaryFile() as f, engine_bytes_from_network(
             network_from_onnx_path(ONNX_MODELS["identity"].path),
             CreateConfig(version_compatible=True, exclude_lean_runtime=True),
@@ -317,11 +359,12 @@ def test_load_engine_with_custom_runtime(self, engine_loader_args, nvinfer_lean_
             f.flush()
             os.fsync(f.fileno())
 
-            engine_loader_args.parse_args([f.name, "--model-type=engine", "--load-runtime", nvinfer_lean_path])
+            engine_loader_args.parse_args(
+                [f.name, "--model-type=engine", "--load-runtime", nvinfer_lean_path]
+            )
             assert engine_loader_args.load_runtime == nvinfer_lean_path
             with engine_loader_args.load_engine() as engine:
                 assert isinstance(engine, trt.ICudaEngine)
-                assert len(engine) == 2
 
                 with TrtRunner(engine) as runner:
                     assert runner.infer({"x": np.ones((1, 1, 2, 2), dtype=np.float32)})
diff --git a/tools/Polygraphy/tests/tools/args/comparator/test_compare.py b/tools/Polygraphy/tests/tools/args/comparator/test_compare.py
index d0c0292c..b44cf45d 100644
--- a/tools/Polygraphy/tests/tools/args/comparator/test_compare.py
+++ b/tools/Polygraphy/tests/tools/args/comparator/test_compare.py
@@ -27,7 +27,7 @@
 
 
 class TestCompareFuncSimple:
-    @pytest.mark.parametrize("check_error_stat", ["max", "median", "mean", "elemwise"])
+    @pytest.mark.parametrize("check_error_stat", ["max", "median", "mean", "elemwise", "quantile"])
     def test_error_stat(self, check_error_stat):
         arg_group = ArgGroupTestHelper(
             CompareFuncSimpleArgs(), deps=[ComparatorCompareArgs(), CompareFuncIndicesArgs()]
diff --git a/tools/Polygraphy/tests/tools/args/comparator/test_data_loader.py b/tools/Polygraphy/tests/tools/args/comparator/test_data_loader.py
index b6e37824..7c4dac90 100644
--- a/tools/Polygraphy/tests/tools/args/comparator/test_data_loader.py
+++ b/tools/Polygraphy/tests/tools/args/comparator/test_data_loader.py
@@ -20,6 +20,7 @@
 
 import numpy as np
 import pytest
+
 from polygraphy import util
 from polygraphy.common import TensorMetadata
 from polygraphy.exception import PolygraphyException
@@ -57,6 +58,8 @@ class TestDataLoaderArgs:
             (["--iterations=12"], ["iterations"], [12]),
             (["--val-range", "[0.0,inf]"], ["val_range"], [{"": (0.0, float("inf"))}]),
             (["--val-range", "[-inf,0.0]"], ["val_range"], [{"": (float("-inf"), 0.0)}]),
+            (["--data-loader-backend-module", "torch"], ["data_loader_backend_module"], ["torch"]),
+            (["--data-loader-backend-module", "numpy"], ["data_loader_backend_module"], ["numpy"]),
         ],
         ids=lambda c: c[1][0],
     )
diff --git a/tools/Polygraphy/tests/tools/args/util/test_util.py b/tools/Polygraphy/tests/tools/args/util/test_util.py
index 23510580..c27e683a 100644
--- a/tools/Polygraphy/tests/tools/args/util/test_util.py
+++ b/tools/Polygraphy/tests/tools/args/util/test_util.py
@@ -15,8 +15,9 @@
 # limitations under the License.
 #
 
-import numpy as np
 import pytest
+
+from polygraphy.datatype import DataType
 from polygraphy.exception import PolygraphyException
 from polygraphy.tools.args import util as args_util
 from polygraphy.tools.script import inline, safe
@@ -50,13 +51,13 @@ def test_parse_dtype_only(self, name):
         meta_args = [f"{name}:float32"]
         meta = args_util.parse_meta(meta_args, includes_shape=False)
         assert meta[name].shape is None
-        assert meta[name].dtype == np.float32
+        assert meta[name].dtype == DataType.FLOAT32
 
     def test_parse_shape_dtype(self, name):
         meta_args = [f"{name}:[1,3,224,224]:float32"]
         meta = args_util.parse_meta(meta_args)
         assert meta[name].shape == [1, 3, 224, 224]
-        assert meta[name].dtype == np.float32
+        assert meta[name].dtype == DataType.FLOAT32
 
     def test_parse_shape_dtype_auto(self, name):
         meta_args = [f"{name}:auto:auto"]
diff --git a/tools/Polygraphy/tests/tools/conftest.py b/tools/Polygraphy/tests/tools/conftest.py
index d0d79c2b..508e7155 100644
--- a/tools/Polygraphy/tests/tools/conftest.py
+++ b/tools/Polygraphy/tests/tools/conftest.py
@@ -30,7 +30,9 @@
 def make_poly_fixture(subtool: List[str]):
     @pytest.fixture()
     def poly_fixture(script_runner):
-        def poly_fixture_impl(additional_opts: List[str] = [], expect_error: bool = False, *args, **kwargs):
+        def poly_fixture_impl(
+            additional_opts: List[str] = [], expect_error: bool = False, *args, **kwargs
+        ):
             cmd = ["polygraphy"] + subtool + ["-v"] + additional_opts
             # NOTE: script_runner does not work very well in `in-process`` mode if you need to inspect stdout/stderr.
             # Occasionally, the output comes out empty - not clear why. Cave emptor!
@@ -43,19 +45,23 @@ def poly_fixture_impl(additional_opts: List[str] = [], expect_error: bool = Fals
 
     return poly_fixture
 
-
 poly = make_poly_fixture([])
 poly_run = make_poly_fixture(["run"])
 poly_convert = make_poly_fixture(["convert"])
 poly_inspect = make_poly_fixture(["inspect"])
+poly_check = make_poly_fixture(["check"])
 poly_surgeon = make_poly_fixture(["surgeon"])
 poly_surgeon_extract = make_poly_fixture(["surgeon", "extract"])
 poly_template = make_poly_fixture(["template"])
 poly_debug = make_poly_fixture(["debug"])
 poly_data = make_poly_fixture(["data"])
+poly_plugin_match = make_poly_fixture(["plugin", "match"])
+poly_plugin_list_plugins = make_poly_fixture(["plugin", "list"])
+poly_plugin_replace = make_poly_fixture(["plugin", "replace"])
 
-
-FakeAlgorithmContext = namedtuple("FakeAlgorithmContext", ["name", "num_inputs", "num_outputs"])
+FakeAlgorithmContext = namedtuple(
+    "FakeAlgorithmContext", ["name", "num_inputs", "num_outputs"]
+)
 FakeAlgorithm = namedtuple("FakeAlgorithm", ["algorithm_variant", "io_info"])
 FakeAlgorithm.get_algorithm_io_info = lambda this, index: this.io_info[index]
 
@@ -65,20 +71,30 @@ def poly_fixture_impl(additional_opts: List[str] = [], expect_error: bool = Fals
 @pytest.fixture(scope="session", params=["", "subdir"])
 def replay_dir(request):
     def fake_context(name, num_inputs=1, num_outputs=1):
-        return FakeAlgorithmContext(name=name, num_inputs=num_inputs, num_outputs=num_outputs)
+        return FakeAlgorithmContext(
+            name=name, num_inputs=num_inputs, num_outputs=num_outputs
+        )
 
     def fake_algo(
-        implementation=6, tactic=0, num_io=2, tensor_format=trt.TensorFormat.LINEAR, dtype=trt.float32, strides=(1, 2)
+        implementation=6, tactic=0, num_io=2, dtype=trt.float32, strides=(1, 2)
     ):
         io_info = [
             TensorInfo(
-                tensor_format=tensor_format, dtype=dtype, strides=strides, vectorized_dim=-1, components_per_element=1
+                dtype=dtype,
+                strides=strides,
+                vectorized_dim=-1,
+                components_per_element=1,
             )
         ] * num_io
-        return FakeAlgorithm(algorithm_variant=FakeAlgorithmVariant(implementation, tactic), io_info=io_info)
+        return FakeAlgorithm(
+            algorithm_variant=FakeAlgorithmVariant(implementation, tactic),
+            io_info=io_info,
+        )
 
     def make_replay(tactic):
-        return TacticReplayData().add("layer0", Algorithm.from_trt(fake_context("layer0"), fake_algo(0, tactic)))
+        return TacticReplayData().add(
+            "layer0", Algorithm.from_trt(fake_context("layer0"), fake_algo(0, tactic))
+        )
 
     with tempfile.TemporaryDirectory() as dir:
 
@@ -103,7 +119,7 @@ def make_path(prefix, *args):
             [I] Loaded {num} bad tactic replays.
             [I] Found potentially bad tactics:
             [I] Layer: layer0
-                    Algorithms: ['(Implementation: 0, Tactic: 2) | Inputs: (TensorInfo(TensorFormat.LINEAR, DataType.FLOAT, (1, 2), -1, 1),) | Outputs: (TensorInfo(TensorFormat.LINEAR, DataType.FLOAT, (1, 2), -1, 1),)']
+                    Algorithms: ['(Implementation: 0, Tactic: 2) | Inputs: (TensorInfo(DataType.FLOAT, (1, 2), -1, 1),) | Outputs: (TensorInfo(DataType.FLOAT, (1, 2), -1, 1),)']
             """
         )
         yield dir, EXPECTED_OUTPUT
diff --git a/tools/Polygraphy/tests/tools/test_check.py b/tools/Polygraphy/tests/tools/test_check.py
new file mode 100644
index 00000000..a3b27a85
--- /dev/null
+++ b/tools/Polygraphy/tests/tools/test_check.py
@@ -0,0 +1,436 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import os
+import numpy as np
+
+import pytest
+from polygraphy import util
+from polygraphy.json import load_json, save_json
+from polygraphy.tools.check.subtool.lint import Lint
+from tests.models.meta import ONNX_MODELS, MODELS_DIR
+
+# Dict[str, Tuple[model, expected_valid_nodes, expected_invalid_nodes]]
+TEST_LINT_CASES = {
+    "test_summary": [
+        (
+            "identity_identity",
+            ["onnx_graphsurgeon_node_1", "onnx_graphsurgeon_node_3"],
+            [],
+        ),
+    ],
+    "test_onnxrt_parity": [
+        "ext_weights",  # valid
+        "capability",  # invalid
+    ],
+    "test_onnx_spec_check": [
+        "bad_graph_with_no_name",
+        "bad_graph_with_no_import_domains",
+        "bad_graph_with_dup_value_info",
+    ],
+}
+
+ORT_MATMUL_ERROR_MSG = " Incompatible dimensions for matrix multiplication"
+
+
+class TestLint:
+    def run_lint_get_json(self, poly_check, model_path, *args, expect_error=False):
+        """
+        Helper function to run `polygraphy check lint`, and load the json output saved by the command.
+        the json output is saved in a temporary file and retrieved as a dict.
+        The schema for the json output that is used for tests here is:
+        {
+            'summary': {
+                'passing': [list of nodes that passed ORT inference check],
+                'failing': [list of nodes that failed ORT inference check],
+                },
+            'lint_entries': [
+                { 'level': Lint.Level, 'source': Lint.Source, 'message': str, 'nodes': [list of node names] },
+                ...
+            ]
+        }
+        """
+        # Run the command and return the json output
+        with util.NamedTemporaryFile(suffix=".json") as outpath:
+            status = poly_check(
+                ["lint", model_path, "-o", outpath.name, *args],
+                expect_error=expect_error,
+            )
+            # load the json file
+            output_json = load_json(outpath.name)
+        return output_json, status
+
+    def eval_per_entry(self, lint_entries, lambda_check):
+        return list(map(lambda_check, lint_entries))
+
+    @pytest.mark.parametrize("case", TEST_LINT_CASES["test_summary"], ids=lambda case: case[0])
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_summary(self, case, poly_check):
+        """
+        Basic test to check that nodes are correctly classified as passing or failing
+        """
+        model_name, expected_passing, expected_failing = case
+        output_json, status = self.run_lint_get_json(poly_check, ONNX_MODELS[model_name].path)
+        passing = sorted(output_json["summary"].get("passing", []))
+        assert expected_passing == passing  # check that the valid nodes are as expected
+        failing = sorted(output_json["summary"].get("failing", []))
+        assert expected_failing == failing  # check that the invalid nodes are as expected
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_duplicate_node_names_caught(self, poly_check):
+        """
+        Test that duplicate node names are marked as exception
+        """
+        output_json, _ = self.run_lint_get_json(
+            poly_check, ONNX_MODELS["bad_graph_with_duplicate_node_names"].path, expect_error=True
+        )
+
+        lint_entry = output_json["lint_entries"][0]
+        expected_entry = {
+            "level": Lint.Level.EXCEPTION.value,
+            "nodes": ["identical"],
+            "message": "Duplicate node name: 'identical' for nodes with topological IDs: [0, 1] found.",
+            "source": Lint.Source.ONNX_GS.value,
+        }
+        assert lint_entry == expected_entry
+        assert "identical" in output_json["summary"]["failing"]
+
+    @pytest.mark.parametrize("model_name", TEST_LINT_CASES["test_onnx_spec_check"], ids=lambda m: m)
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_onnx_spec_check(self, model_name, poly_check):
+        """
+        Test that basic onnx specification errors are caught by the lint command from the ONNX Checker
+        """
+        output_json, _ = self.run_lint_get_json(poly_check, ONNX_MODELS[model_name].path, expect_error=True)
+
+        assert any(  # Make sure that there is atleast 1 entry with level exception and source onnx_checker
+            self.eval_per_entry(
+                output_json["lint_entries"],
+                lambda entry: entry["level"] == Lint.Level.EXCEPTION.value
+                and entry["source"] == Lint.Source.ONNX_CHECKER.value,
+            )
+        )
+
+    @pytest.mark.parametrize(
+        "model_name",
+        TEST_LINT_CASES["test_onnxrt_parity"],
+        ids=lambda model_name: model_name,
+    )
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_onnxrt_parity(self, model_name, poly_check, poly_run):
+        """
+        Test that `polygraphy check lint` aligns with `polygraphy run --onnxrt`
+        in terms of validating the same models.
+
+        When `polygraphy run --onnxrt` fails,
+        `polygraphy check lint` is gauranteed to pick an exception from the model.
+
+        This test also validates `--ext` flag usage in `polygraphy check lint` for loading external weights.
+        """
+        model_path = ONNX_MODELS[model_name].path
+
+        poly_run_exception = None  # whether `poly_run` picked exception
+
+        extra_args_dict = {
+            "ext_weights": [
+                "--ext",
+                os.path.join(MODELS_DIR, "data"),
+            ],
+        }
+
+        try:  # try to run the model using onnxrt, may fail.
+            status = poly_run([model_path, "--onnxrt", *extra_args_dict.get(model_name, [])])
+            poly_run_exception = "FAILED" in status.stdout
+        except Exception as e:
+            poly_run_exception = True
+
+        # now run the model using polygraphy check and check if its a valid model
+        _, _ = self.run_lint_get_json(
+            poly_check,
+            model_path,
+            *extra_args_dict.get(model_name, []),
+            expect_error=poly_run_exception,  # if poly_run picked exception, expect poly_check to pick exception
+        )
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_parallel_invalid_nodes_caught(self, poly_check):
+        """
+        Test that the ort inference check codepath works as expected.
+        Check that all independent nodes with exceptions are caught.
+        Check correct node is identified as invalid, and that the error message contains expected information.
+        """
+        model_name = "bad_graph_with_parallel_invalid_nodes"
+        # Model's graph is as follows:
+        # The graph is invalid due to multiple parallel nodes failing.
+        # This is is the graph:
+        #    A    B    C    D  E    F    G
+        #     \  /      \  /    \  /      \
+        #    MatMul_0* Add_0*  MatMul_1 NonZero
+        #        \        /        \    /
+        #         MatMul_2       MatMul_3*
+        #               \       /
+        #                \     /
+        #                Add_1
+        #                  |
+        #                output
+        # The graph is invalid because MatMul_0, Add_0 and MatMul_3 all will fail.
+        # MatMul_0 should fail because A and B are not compatible.
+        # Add_0 should fail because C and D are not compatible.
+        # MatMul_3 should fail because result of MatMul2 and the Data-dependant shape of output of
+        # NonZero are not compatible.
+
+        output_json, _ = self.run_lint_get_json(
+            poly_check,
+            ONNX_MODELS[model_name].path,
+            expect_error=True,
+        )
+
+        expected_valid_nodes = ["MatMul_1", "NonZero", "cast_to_int64"]
+        expected_invalid_dict = {
+            "Add_0": " Incompatible dimensions",
+            "MatMul_0": ORT_MATMUL_ERROR_MSG,
+            "MatMul_3": ORT_MATMUL_ERROR_MSG,
+        }
+
+        # Check each node's entries in json and make sure the required error messages are present.
+        for bad_node, msg in expected_invalid_dict.items():
+            assert bad_node in output_json["summary"]["failing"]
+
+            expected_entry = {
+                "level": Lint.Level.EXCEPTION.value,
+                "nodes": [bad_node],
+                "message": msg,
+                "source": Lint.Source.ONNXRUNTIME.value,
+            }
+            assert expected_entry in output_json["lint_entries"]
+
+        # Check correct summary
+        assert sorted(expected_valid_nodes) == sorted(output_json["summary"]["passing"])
+        assert sorted(expected_invalid_dict.keys()) == sorted(output_json["summary"]["failing"])
+
+    @pytest.mark.parametrize(
+        "input_bool",
+        [True, False],
+    )
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_data_dependent_errors_caught(self, poly_check, input_bool):
+        """
+        Test that data-dependent errors are caught.
+        The same test also validates custom data-loading of inputs work.
+
+        Behavior: all the invalid nodes should be caught.
+        """
+        model_name = "bad_graph_conditionally_invalid"
+        #                 cond
+        #                  |
+        #                 If_Node
+        #                  |
+        #             z (x or y)
+        #              \   |
+        #               MatMul
+        #                  |
+        #               output
+        # If `cond` is True, then `x` is used, otherwise `y` is used.
+        # `x` is compatible with `z`, while `y` is NOT compatible with `z`.
+        # Based on the value of `cond`, the graph may be valid or invalid.
+        #
+        with util.NamedTemporaryFile() as input_file:
+            # Create a custom input file with the value of `cond` as `input_bool`
+            json_data = [{"cond": np.array((input_bool,))}]
+            save_json(json_data, input_file.name)
+            output_json, _ = self.run_lint_get_json(
+                poly_check,
+                ONNX_MODELS[model_name].path,
+                "--load-input-data",
+                input_file.name,
+                expect_error=not input_bool,
+            )
+        validation_dict = {  # key: input_bool, value: expected output
+            True: {
+                "passing": ["If_Node", "MatMul"],
+                "failing": [],
+            },
+            False: {
+                "passing": ["If_Node"],
+                "failing": ["MatMul"],
+            },
+        }
+
+        # Check that the output is as expected.
+        assert sorted(validation_dict[input_bool]["passing"]) == sorted(output_json["summary"]["passing"])
+        assert sorted(validation_dict[input_bool]["failing"]) == sorted(output_json["summary"].get("failing", []))
+
+        if validation_dict[input_bool]["failing"]:  # when input_bool = False
+            expected_entry = {
+                "level": Lint.Level.EXCEPTION.value,
+                "nodes": ["MatMul"],
+                "message": ORT_MATMUL_ERROR_MSG,
+                "source": Lint.Source.ONNXRUNTIME.value,
+            }
+            assert expected_entry in output_json["lint_entries"]
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_custom_op(self, poly_check):
+        """
+        Test that a custom-op is handled correctly.
+        Behavior: a warning is emitted, and the node is marked as passing.
+        """
+        model_name = "custom_op_node"
+
+        output_json, _ = self.run_lint_get_json(
+            poly_check,
+            ONNX_MODELS[model_name].path,
+            expect_error=False,
+        )
+        condition = (
+            lambda entry: any([substr in entry["message"] for substr in Lint.CUSTOM_OP_EXCEPTION_SUBSTRS])
+            and entry["source"] == Lint.Source.ONNXRUNTIME.value
+            and entry["level"] == Lint.Level.WARNING.value
+        )
+        assert any(self.eval_per_entry(output_json["lint_entries"], condition))
+
+        # node should be present in passing list
+        assert "polygraphy_unnamed_node_0" in output_json["summary"]["passing"]
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_multi_level_errors(self, poly_check):
+        """
+        Test that multi-level errors are handled correctly.
+        The model is invalid because of graph-level error (no name) and node-level error (incompatible inputs).
+        Behavior: two lint entries are emitted, one from onnx checker and one from onnxruntime.
+        """
+        model_name = "bad_graph_with_multi_level_errors"
+
+        output_json, _ = self.run_lint_get_json(
+            poly_check,
+            ONNX_MODELS[model_name].path,
+            expect_error=True,
+        )
+        lint_entries = output_json["lint_entries"]
+
+        # condition for onnx checker entry
+        condition_onnx_checker = (
+            lambda entry: "Field 'name' of 'graph' is required to be non-empty." in entry["message"]
+            and entry["source"] == Lint.Source.ONNX_CHECKER.value
+            and entry["level"] == Lint.Level.EXCEPTION.value
+        )
+
+        # condition for onnxruntime entry
+        condition_onnxruntime = (
+            lambda entry: ORT_MATMUL_ERROR_MSG in entry["message"]
+            and entry["source"] == Lint.Source.ONNXRUNTIME.value
+            and entry["level"] == Lint.Level.EXCEPTION.value
+        )
+
+        # checks
+        assert len(lint_entries) >= 2  # there should be atleast two lint entries
+        assert any(self.eval_per_entry(lint_entries, condition_onnx_checker))  # condition for onnx checker entry
+        assert any(self.eval_per_entry(lint_entries, condition_onnxruntime))  # condition for onnxruntime entry
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_invalid_model_error(self, poly_check):
+        """
+        Test that an invalid model is handled correctly.
+        The invalid model should trigger an error parsing message.
+        Behavior: a lint entry is emitted from onnx_loader with level exception.
+        """
+        invalid_model_name = "invalid"
+
+        # Test with invalid model
+        output_json, _ = self.run_lint_get_json(
+            poly_check,
+            ONNX_MODELS[invalid_model_name].path,
+            expect_error=True,
+        )
+        lint_entries = output_json["lint_entries"]
+
+        # condition for onnx_loader entry for invalid model
+        condition = (
+            lambda entry: "Error parsing message with type 'onnx.ModelProto'" in entry["message"]
+            and entry["source"] == Lint.Source.ONNX_LOADER.value
+            and entry["level"] == Lint.Level.EXCEPTION.value
+        )
+
+        assert len(lint_entries) == 1  # there should be only one lint entry
+        assert condition(lint_entries[0])  # condition for onnx_loader entry for invalid model
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_empty_model_warning(self, poly_check):
+        """
+        Test that an empty model is handled correctly.
+        The empty model should trigger a warning about an empty ONNX model.
+        Behavior: a lint entry is emitted from onnx_loader with level warning.
+        """
+        empty_model_name = "empty"
+
+        # Test with empty model
+        output_json, _ = self.run_lint_get_json(poly_check, ONNX_MODELS[empty_model_name].path, expect_error=False)
+        lint_entries = output_json["lint_entries"]
+
+        # condition for onnx_loader entry for empty model
+        condition = (
+            lambda entry: "ONNX model has no nodes" in entry["message"]
+            and entry["source"] == Lint.Source.ONNX_LOADER.value
+            and entry["level"] == Lint.Level.WARNING.value
+        )
+
+        assert len(lint_entries) == 1  # there should be only one lint entry
+        assert condition(lint_entries[0])
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_cleanable_warning(self, poly_check):
+        """
+        Test that a cleanable graph with unused nodes/inputs is handled correctly.
+        Behavior: The cleanable graph should trigger a warning about unused nodes/inputs.
+        """
+        cleanable_model_name = "cleanable"
+
+        output_json, _ = self.run_lint_get_json(
+            poly_check, ONNX_MODELS[cleanable_model_name].path, expect_error=False
+        )
+        lint_entries = output_json["lint_entries"]
+
+        # condition for onnx_loader entry for empty model
+        node_check = (
+            lambda entry: "Does not affect outputs, can be removed" in entry["message"]
+            and entry["source"] == Lint.Source.ONNX_GS.value
+            and entry["level"] == Lint.Level.WARNING.value
+            and entry["nodes"] == ["G"]
+        )
+        inp_check = (
+            lambda entry: "Input: 'e' does not affect outputs, can be removed" in entry["message"]
+            and entry["source"] == Lint.Source.ONNX_GS.value
+            and entry["level"] == Lint.Level.WARNING.value
+        )
+
+        assert len(lint_entries) == 2
+        assert inp_check(lint_entries[0])
+        assert node_check(lint_entries[1])
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_empty_nodes_renaming(self, poly_check):
+        """
+        Tests that empty nodes are *gauranteed* a unique name while renaming.
+        """
+        output_json, _ = self.run_lint_get_json(poly_check, ONNX_MODELS["renamable"].path, expect_error=False)
+        names = output_json["summary"]["passing"]
+        expected_names = [
+            "polygraphy_unnamed_node_0_0",
+            "polygraphy_unnamed_node_3",
+            "polygraphy_unnamed_node_0_1",
+            "polygraphy_unnamed_node_0",
+        ]
+        assert sorted(names) == sorted(expected_names)
diff --git a/tools/Polygraphy/tests/tools/test_convert.py b/tools/Polygraphy/tests/tools/test_convert.py
index 8b0d4ba2..cebfd240 100644
--- a/tools/Polygraphy/tests/tools/test_convert.py
+++ b/tools/Polygraphy/tests/tools/test_convert.py
@@ -60,9 +60,6 @@ def test_onnx_to_trt(self, poly_convert):
             poly_convert([ONNX_MODELS["identity"].path, "--model-type=onnx", "-o", outmodel.name])
             self.check_engine(outmodel.name)
 
-    @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.0"), reason="Bug in older versions of TRT breaks this test"
-    )
     def test_tf_to_onnx_to_trt(self, poly_convert):
         pytest.importorskip("tensorflow")
 
@@ -116,7 +113,6 @@ def test_modify_onnx_outputs(self, poly_convert):
 
 
 class TestConvertToOnnxLikeTrt:
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.2"), reason="Unsupported for TRT 7.1 and older")
     @pytest.mark.parametrize(
         "model_name", ["identity", "empty_tensor_expand", "const_foldable", "and", "scan", "dim_param", "tensor_attr"]
     )
diff --git a/tools/Polygraphy/tests/tools/test_debug.py b/tools/Polygraphy/tests/tools/test_debug.py
index 7e0677ce..d0fb425a 100644
--- a/tools/Polygraphy/tests/tools/test_debug.py
+++ b/tools/Polygraphy/tests/tools/test_debug.py
@@ -32,7 +32,6 @@
 from tests.models.meta import ONNX_MODELS
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
 class TestBuild:
     def test_good_bad(self, poly_debug):
         with tempfile.TemporaryDirectory() as outdir:
diff --git a/tools/Polygraphy/tests/tools/test_deprecated.py b/tools/Polygraphy/tests/tools/test_deprecated.py
index 779424de..b74bee29 100644
--- a/tools/Polygraphy/tests/tools/test_deprecated.py
+++ b/tools/Polygraphy/tests/tools/test_deprecated.py
@@ -22,8 +22,3 @@ def test_logger_severity():
     assert G_LOGGER.severity == G_LOGGER.module_severity.get()
     with G_LOGGER.verbosity():
         assert G_LOGGER.severity == G_LOGGER.CRITICAL
-
-
-def test_debug_diff_tactics(poly_debug):
-    status = poly_debug(["diff-tactics"])
-    assert "debug diff-tactics is deprecated and will be removed" in status.stdout + status.stderr
diff --git a/tools/Polygraphy/tests/tools/test_inspect.py b/tools/Polygraphy/tests/tools/test_inspect.py
index 68e63fb7..953c21b3 100644
--- a/tools/Polygraphy/tests/tools/test_inspect.py
+++ b/tools/Polygraphy/tests/tools/test_inspect.py
@@ -28,7 +28,6 @@
     engine_from_network,
     network_from_onnx_bytes,
     save_engine,
-    util as trt_util,
 )
 from tests.models.meta import ONNX_MODELS, TF_MODELS
 
@@ -70,7 +69,9 @@ def check_lines_match(actual, expected, should_check_line=lambda x: True):
     actual = [
         line
         for line in actual.splitlines()
-        if "Loading" not in line and not line.startswith("[V]") and not line.startswith("[W]")
+        if "Loading" not in line
+        and not line.startswith("[V]")
+        and not line.startswith("[W]")
     ]
     expected = expected.splitlines()
     assert len(actual) == len(expected)
@@ -428,184 +429,11 @@ def check_lines_match(actual, expected, should_check_line=lambda x: True):
 ]
 
 
-def process_expected_engine_output(output):
-    # This used to be required due to differences in the output based on TRT version, but is not required currently.
-    # Kept here as it may be required in the future.
-    return output
-
-
 # Format: List[Tuple[show_opts, expected]]
 ENGINE_CASES = [
     (
         [],
-        process_expected_engine_output(
-            r"""
-        [I] ==== TensorRT Engine ====
-            Name: Unnamed Network 0 | Explicit Batch Engine
-
-            ---- 1 Engine Input(s) ----
-            {X [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- 1 Engine Output(s) ----
-            {Y [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- Memory ----
-            Device Memory: 0 bytes
-
-            ---- 2 Profile(s) (2 Tensor(s) Each) ----
-            - Profile: 0
-                Binding Index: 0 (Input)  [Name: X]             | Shapes: min=(1, 2, 1, 1), opt=(1, 2, 3, 3), max=(1, 2, 5, 5)
-                Binding Index: 1 (Output) [Name: Y]             | Shape: (1, 2, -1, -1)
-
-            - Profile: 1
-                Binding Index: 2 (Input)  [Name: X [profile 1]] | Shapes: min=(1, 2, 2, 2), opt=(1, 2, 4, 4), max=(1, 2, 6, 6)
-                Binding Index: 3 (Output) [Name: Y [profile 1]] | Shape: (1, 2, -1, -1)
-
-            ---- 1 Layer(s) Per Profile ----
-        """
-            if not trt_util._should_use_v3_api()
-            else r"""
-        [I] ==== TensorRT Engine ====
-            Name: Unnamed Network 0 | Explicit Batch Engine
-
-            ---- 1 Engine Input(s) ----
-            {X [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- 1 Engine Output(s) ----
-            {Y [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- Memory ----
-            Device Memory: 0 bytes
-
-            ---- 2 Profile(s) (2 Tensor(s) Each) ----
-            - Profile: 0
-                Tensor: X          (Input), Index: 0 | Shapes: min=(1, 2, 1, 1), opt=(1, 2, 3, 3), max=(1, 2, 5, 5)
-                Tensor: Y         (Output), Index: 1 | Shape: (1, 2, -1, -1)
-
-            - Profile: 1
-                Tensor: X          (Input), Index: 0 | Shapes: min=(1, 2, 2, 2), opt=(1, 2, 4, 4), max=(1, 2, 6, 6)
-                Tensor: Y         (Output), Index: 1 | Shape: (1, 2, -1, -1)
-
-            ---- 1 Layer(s) Per Profile ----
-        """
-        ),
-    ),
-    (
-        ["layers"],
-        process_expected_engine_output(
-            r"""
-        [I] ==== TensorRT Engine ====
-            Name: Unnamed Network 0 | Explicit Batch Engine
-
-            ---- 1 Engine Input(s) ----
-            {X [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- 1 Engine Output(s) ----
-            {Y [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- Memory ----
-            Device Memory: 0 bytes
-
-            ---- 2 Profile(s) (2 Tensor(s) Each) ----
-            - Profile: 0
-                Binding Index: 0 (Input)  [Name: X]             | Shapes: min=(1, 2, 1, 1), opt=(1, 2, 3, 3), max=(1, 2, 5, 5)
-                Binding Index: 1 (Output) [Name: Y]             | Shape: (1, 2, -1, -1)
-
-            - Profile: 1
-                Binding Index: 2 (Input)  [Name: X [profile 1]] | Shapes: min=(1, 2, 2, 2), opt=(1, 2, 4, 4), max=(1, 2, 6, 6)
-                Binding Index: 3 (Output) [Name: Y [profile 1]] | Shape: (1, 2, -1, -1)
-
-            ---- 1 Layer(s) Per Profile ----
-            - Profile: 0
-                Layer 0    | node_of_Y [Op: Reformat]
-                    {X [shape=(1, 2, -1, -1)]}
-                     -> {Y [shape=(1, 2, -1, -1)]}
-
-            - Profile: 1
-                Layer 0    | node_of_Y [profile 1] [Op: Reformat]
-                    {X [profile 1] [shape=(1, 2, -1, -1)]}
-                     -> {Y [profile 1] [shape=(1, 2, -1, -1)]}
-        """
-            if not trt_util._should_use_v3_api()
-            else r"""
-        [I] ==== TensorRT Engine ====
-            Name: Unnamed Network 0 | Explicit Batch Engine
-
-            ---- 1 Engine Input(s) ----
-            {X [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- 1 Engine Output(s) ----
-            {Y [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- Memory ----
-            Device Memory: 0 bytes
-
-            ---- 2 Profile(s) (2 Tensor(s) Each) ----
-            - Profile: 0
-                Tensor: X          (Input), Index: 0 | Shapes: min=(1, 2, 1, 1), opt=(1, 2, 3, 3), max=(1, 2, 5, 5)
-                Tensor: Y         (Output), Index: 1 | Shape: (1, 2, -1, -1)
-
-            - Profile: 1
-                Tensor: X          (Input), Index: 0 | Shapes: min=(1, 2, 2, 2), opt=(1, 2, 4, 4), max=(1, 2, 6, 6)
-                Tensor: Y         (Output), Index: 1 | Shape: (1, 2, -1, -1)
-
-            ---- 1 Layer(s) Per Profile ----
-            - Profile: 0
-                Layer 0    | node_of_Y [Op: Reformat]
-                    {X [shape=(1, 2, -1, -1)]}
-                     -> {Y [shape=(1, 2, -1, -1)]}
-
-            - Profile: 1
-                Layer 0    | node_of_Y [profile 1] [Op: Reformat]
-                    {X [profile 1] [shape=(1, 2, -1, -1)]}
-                     -> {Y [profile 1] [shape=(1, 2, -1, -1)]}
-        """
-        ),
-    ),
-    (
-        ["layers", "attrs"],
-        process_expected_engine_output(
-            r"""
-        [I] ==== TensorRT Engine ====
-            Name: Unnamed Network 0 | Explicit Batch Engine
-
-            ---- 1 Engine Input(s) ----
-            {X [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- 1 Engine Output(s) ----
-            {Y [dtype=float32, shape=(1, 2, -1, -1)]}
-
-            ---- Memory ----
-            Device Memory: 0 bytes
-
-            ---- 2 Profile(s) (2 Tensor(s) Each) ----
-            - Profile: 0
-                Binding Index: 0 (Input)  [Name: X]             | Shapes: min=(1, 2, 1, 1), opt=(1, 2, 3, 3), max=(1, 2, 5, 5)
-                Binding Index: 1 (Output) [Name: Y]             | Shape: (1, 2, -1, -1)
-
-            - Profile: 1
-                Binding Index: 2 (Input)  [Name: X [profile 1]] | Shapes: min=(1, 2, 2, 2), opt=(1, 2, 4, 4), max=(1, 2, 6, 6)
-                Binding Index: 3 (Output) [Name: Y [profile 1]] | Shape: (1, 2, -1, -1)
-
-            ---- 1 Layer(s) Per Profile ----
-            - Profile: 0
-                Layer 0    | node_of_Y [Op: Reformat]
-                    {X [shape=(1, 2, -1, -1)]}
-                     -> {Y [shape=(1, 2, -1, -1)]}
-                    ---- Attributes ----
-                    Origin = CAST
-                    Tactic = 0x0
-
-            - Profile: 1
-                Layer 0    | node_of_Y [profile 1] [Op: Reformat]
-                    {X [profile 1] [shape=(1, 2, -1, -1)]}
-                     -> {Y [profile 1] [shape=(1, 2, -1, -1)]}
-                    ---- Attributes ----
-                    Origin = CAST
-                    Tactic = 0x0
-        """
-            if not trt_util._should_use_v3_api()
-            else r"""
+        r"""
         [I] ==== TensorRT Engine ====
             Name: Unnamed Network 0 | Explicit Batch Engine
 
@@ -628,52 +456,44 @@ def process_expected_engine_output(output):
                 Tensor: Y         (Output), Index: 1 | Shape: (1, 2, -1, -1)
 
             ---- 1 Layer(s) Per Profile ----
-            - Profile: 0
-                Layer 0    | node_of_Y [Op: Reformat]
-                    {X [shape=(1, 2, -1, -1)]}
-                     -> {Y [shape=(1, 2, -1, -1)]}
-                    ---- Attributes ----
-                    Origin = CAST
-                    Tactic = 0x0
-
-            - Profile: 1
-                Layer 0    | node_of_Y [profile 1] [Op: Reformat]
-                    {X [profile 1] [shape=(1, 2, -1, -1)]}
-                     -> {Y [profile 1] [shape=(1, 2, -1, -1)]}
-                    ---- Attributes ----
-                    Origin = CAST
-                    Tactic = 0x0
-        """
-        ),
+        """,
     ),
 ]
 
 
 class TestInspectModel:
-    @pytest.mark.parametrize("case", ONNX_CASES, ids=lambda case: f"{case[0]}-{case[1]}")
+    @pytest.mark.parametrize(
+        "case", ONNX_CASES, ids=lambda case: f"{case[0]}-{case[1]}"
+    )
     def test_onnx(self, case, poly_inspect):
         model, show, expected, additional_opts = case
         status = poly_inspect(
-            ["model", ONNX_MODELS[model].path] + (["--show"] + show if show else []) + additional_opts
+            ["model", ONNX_MODELS[model].path, "--log-format=no-colors"]
+            + (["--show"] + show if show else [])
+            + additional_opts
         )
 
         expected = dedent(expected).strip()
         actual = "\n".join(status.stdout.splitlines()[1:])  # Ignore loading message
 
-        check_lines_match(actual, expected, should_check_line=lambda line: "Note: Error was:" not in line)
+        check_lines_match(
+            actual,
+            expected,
+            should_check_line=lambda line: "Note: Error was:" not in line,
+        )
 
     def test_list_unbounded_dds(self, poly_inspect):
-        cmd = ["model", ONNX_MODELS["unbounded_dds"].path, "--list-unbounded-dds", "--shape-inference"]
+        cmd = [
+            "model",
+            ONNX_MODELS["unbounded_dds"].path,
+            "--list-unbounded-dds",
+            "--shape-inference",
+        ]
         status = poly_inspect(cmd)
-        assert ("cast_out_6" in status.stdout)
+        assert "cast_out_6" in status.stdout
 
     @pytest.mark.parametrize("model", ["identity", "scan", "tensor_attr"])
     def test_trt_sanity(self, run_inspect_model, model):
-        import tensorrt as trt
-
-        if model == "tensor_attr" and mod.version(trt.__version__) < mod.version("7.2"):
-            pytest.skip("Models with constant outputs were not supported before 7.2")
-
         run_inspect_model([ONNX_MODELS[model].path, "--display-as=trt"])
 
     def test_trt_network_script(self, poly_inspect):
@@ -698,12 +518,13 @@ def load_network(builder, network):
 
             poly_inspect(["model", f.name])
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.2"), reason="Unsupported for TRT 8.0 and older")
     @pytest.mark.parametrize("case", ENGINE_CASES, ids=lambda case: f"{case[0]}")
     @pytest.mark.flaky(max_runs=3)
     def test_trt_engine(self, case, dynamic_identity_engine, poly_inspect):
         show, expected = case
-        status = poly_inspect(["model", dynamic_identity_engine] + (["--show"] + show if show else []))
+        status = poly_inspect(
+            ["model", dynamic_identity_engine] + (["--show"] + show if show else [])
+        )
 
         expected = dedent(expected).strip()
         actual = "\n".join(status.stdout.splitlines()[1:])  # Ignore loading message
@@ -727,13 +548,27 @@ class TestInspectData:
     @pytest.mark.parametrize("opts", [[], ["--show-values"]])
     def test_outputs(self, opts, poly_run, poly_inspect):
         with util.NamedTemporaryFile() as outpath:
-            poly_run([ONNX_MODELS["identity"].path, "--onnxrt", "--save-outputs", outpath.name])
+            poly_run(
+                [
+                    ONNX_MODELS["identity"].path,
+                    "--onnxrt",
+                    "--save-outputs",
+                    outpath.name,
+                ]
+            )
             poly_inspect(["data", outpath.name] + opts)
 
     @pytest.mark.parametrize("opts", [[], ["--show-values"]])
     def test_inputs(self, opts, poly_run, poly_inspect):
         with util.NamedTemporaryFile() as outpath:
-            poly_run([ONNX_MODELS["identity"].path, "--onnxrt", "--save-inputs", outpath.name])
+            poly_run(
+                [
+                    ONNX_MODELS["identity"].path,
+                    "--onnxrt",
+                    "--save-inputs",
+                    outpath.name,
+                ]
+            )
             poly_inspect(["data", outpath.name] + opts)
 
     @pytest.mark.parametrize("num_items", [-1, 1, 2, 10, 12])
@@ -749,14 +584,22 @@ def test_num_items(self, poly_run, poly_inspect, num_items):
                 ]
             )
             status = poly_inspect(
-                ["data", outpath.name, "--show-values", "--line-width=-1", f"--num-items={num_items}"]
+                [
+                    "data",
+                    outpath.name,
+                    "--show-values",
+                    "--line-width=-1",
+                    f"--num-items={num_items}",
+                ]
             )
 
             # Process only lines containing array print outs (which are all indented)
             lines = [
                 line.strip()
                 for line in status.stdout.splitlines()
-                if line.strip() and line.startswith(constants.TAB * 2) and line.strip() != "..."
+                if line.strip()
+                and line.startswith(constants.TAB * 2)
+                and line.strip() != "..."
             ]
             for line in lines:
                 items = [e for e in line.strip("[]").split() if "..." not in e]
@@ -771,30 +614,46 @@ def test_num_items(self, poly_run, poly_inspect, num_items):
     [
         "pow_scalar",
         r"""
-
         [I] Layer: (Unnamed Layer* 0) [Shuffle]
-                Algorithm: (Implementation: 2147483661, Tactic: 0) | Inputs: (('TensorFormat.LINEAR', 'DataType.FLOAT', '()'),) | Outputs: (('TensorFormat.LINEAR', 'DataType.FLOAT', '(1,)'),)
+                Algorithm: (Implementation: 2147483661, Tactic: 0) | Inputs: (TensorInfo(DataType.FLOAT, (), -1, 1),) | Outputs: (TensorInfo(DataType.FLOAT, (1,), -1, 1),)
             Layer: node_of_z
-                Algorithm: (Implementation: 2147483651, Tactic: 1) | Inputs: (('TensorFormat.LINEAR', 'DataType.FLOAT', '(1,)'), ('TensorFormat.LINEAR', 'DataType.FLOAT', '(1,)')) | Outputs: (('TensorFormat.LINEAR', 'DataType.FLOAT', '(1,)'),)
+                Algorithm: (Implementation: 2147483651, Tactic: 1) | Inputs: (TensorInfo(DataType.FLOAT, (1,), -1, 1), TensorInfo(DataType.FLOAT, (1,), -1, 1)) | Outputs: (TensorInfo(DataType.FLOAT, (1,), -1, 1),)
+        """
+        if mod.version(trt.__version__) < mod.version("8.7")
+        else r"""
+        [I] Layer: ONNXTRT_Broadcast
+                Algorithm: (Implementation: 2147483661, Tactic: 0) | Inputs: (TensorInfo(DataType.FLOAT, (), -1, 1),) | Outputs: (TensorInfo(DataType.FLOAT, (1,), -1, 1),)
+            Layer: PWN(node_of_z)
+                Algorithm: (Implementation: 2147483688, Tactic: 1) | Inputs: (TensorInfo(DataType.FLOAT, (1,), -1, 1), TensorInfo(DataType.FLOAT, (1,), -1, 1)) | Outputs: (TensorInfo(DataType.FLOAT, (1,), -1, 1),)
         """,
     ],
 ]
 
 
-@pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
 class TestInspectTactics:
     @pytest.mark.parametrize("case", TACTIC_REPLAY_CASES, ids=lambda case: case[0])
     def test_show_tactics(self, case, poly_run, poly_inspect):
         with util.NamedTemporaryFile() as replay:
             model_name, expected = case
 
-            poly_run([ONNX_MODELS[model_name].path, "--trt", "--save-tactics", replay.name])
+            poly_run(
+                [
+                    ONNX_MODELS[model_name].path,
+                    "--trt",
+                    "--save-tactics",
+                    replay.name,
+                ]
+            )
             status = poly_inspect(["tactics", replay.name])
 
             expected = dedent(expected).strip()
             actual = status.stdout
 
-            check_lines_match(actual, expected, should_check_line=lambda line: "Algorithm: " not in line)
+            check_lines_match(
+                actual,
+                expected,
+                should_check_line=lambda line: "Algorithm: " not in line,
+            )
 
 
 # List[model, expected_files, expected_output]
@@ -815,6 +674,15 @@ def test_show_tactics(self, case, poly_run, poly_inspect):
             -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
             FAKE!    |       2 | In node 0 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?" | [[0, 1], [2, 3]]
             FAKER!   |       1 | In node 0 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?" | [[4, 5]]
+        """
+        if mod.version(trt.__version__) < mod.version("10.0")
+        else """
+        [I] ===== Summary =====
+            Operator | Count   | Reason                                                                                                                                                                           | Nodes
+            --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+            FAKE!    |       1 | In node 0 with name: Fake1 and operator: FAKE! (checkFallbackPluginImporter): INVALID_NODE: creator && "Plugin not found, are the plugin name, version, and namespace correct?"  | [[0, 1]]
+            FAKE!    |       1 | In node 0 with name: Fake2 and operator: FAKE! (checkFallbackPluginImporter): INVALID_NODE: creator && "Plugin not found, are the plugin name, version, and namespace correct?"  | [[2, 3]]
+            FAKER!   |       1 | In node 0 with name: Fake3 and operator: FAKER! (checkFallbackPluginImporter): INVALID_NODE: creator && "Plugin not found, are the plugin name, version, and namespace correct?" | [[4, 5]]
         """,
     ),
     (
@@ -828,21 +696,27 @@ def test_show_tactics(self, case, poly_run, poly_inspect):
 
 
 class TestCapability:
-    @pytest.mark.skipif(
-        mod.version(trt.__version__) < mod.version("8.0"), reason="supports_model API not available before TRT 8.0"
-    )
     @pytest.mark.script_launch_mode("subprocess")
     @pytest.mark.parametrize("case", TEST_CAPABILITY_CASES, ids=lambda case: case[0])
     def test_capability(self, case, poly_inspect):
         model, expected_files, expected_summary = case
         with tempfile.TemporaryDirectory() as outdir:
             status = poly_inspect(
-                ["capability", ONNX_MODELS[model].path, "-o", os.path.join(outdir, "subgraphs")],
-            )
-            assert sorted(map(os.path.basename, glob.glob(os.path.join(outdir, "subgraphs", "**")))) == sorted(
-                expected_files
+                [
+                    "capability",
+                    "--with-partitioning",
+                    ONNX_MODELS[model].path,
+                    "-o",
+                    os.path.join(outdir, "subgraphs"),
+                ],
             )
-            assert dedent(expected_summary).strip() in status.stdout
+            assert sorted(
+                map(
+                    os.path.basename,
+                    glob.glob(os.path.join(outdir, "subgraphs", "**")),
+                )
+            ) == sorted(expected_files)
+            assert dedent(expected_summary).strip() in dedent(status.stdout).strip()
 
 
 class TestDiffTactics:
@@ -877,3 +751,14 @@ def find_file(dirpath, filename):
         bad = find_file(os.path.join(replay_dir, "bad"), "1.json")
         status = poly_inspect(["diff-tactics", "--good", good, "--bad", bad])
         self.check_output(status, expected_output, expected_num=1)
+
+
+class TestInspectSparsity:
+    @pytest.mark.parametrize(
+        "model_name", ["matmul", "matmul.bf16", "matmul.bf16.i32data", "conv"]
+    )
+    def test_prune_check(self, poly_inspect, model_name):
+        with tempfile.TemporaryDirectory() as outdir:
+            ipath = ONNX_MODELS[model_name].path
+            status = poly_inspect(["sparsity", ipath])
+            assert status
diff --git a/tools/Polygraphy/tests/tools/test_plugin.py b/tools/Polygraphy/tests/tools/test_plugin.py
new file mode 100644
index 00000000..a74f0b17
--- /dev/null
+++ b/tools/Polygraphy/tests/tools/test_plugin.py
@@ -0,0 +1,108 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pytest
+import os
+import yaml
+from tests.models.meta import ONNX_MODELS
+import onnx
+import tempfile
+
+
+class TestMatch:
+
+    TOY_MODEL_PATH = ONNX_MODELS["graph_with_subgraph_matching_toy_plugin"].path
+    PLUGINS_PATH = os.path.join(os.path.dirname(TOY_MODEL_PATH), "plugins")
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_match_toy(self, poly_plugin_match):
+
+        assert os.path.exists(self.TOY_MODEL_PATH)
+
+        with tempfile.TemporaryDirectory() as outdir:
+            config_yaml_loc = os.path.join(outdir, "config.yaml")
+            poly_plugin_match(
+                [
+                    self.TOY_MODEL_PATH,
+                    "--plugin-dir",
+                    self.PLUGINS_PATH,
+                    "-o",
+                    config_yaml_loc,
+                ]
+            )
+
+            assert os.path.exists(config_yaml_loc)
+
+            with open(config_yaml_loc, "r") as stream:
+                config_yaml = yaml.safe_load_all(stream)
+
+                num_plugins = 0
+                for plugin in config_yaml:
+                    num_plugins += 1
+                    assert plugin["name"] == "toyPlugin"
+                    assert len(plugin["instances"]) == 1
+                    assert len(plugin["instances"][0]["inputs"]) == 2
+                    assert len(plugin["instances"][0]["outputs"]) == 2
+                    assert plugin["instances"][0]["attributes"]["ToyX"] == 2
+
+                assert num_plugins == 1
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_match_list_toy(self, poly_plugin_list_plugins):
+        status = poly_plugin_list_plugins(
+            [self.TOY_MODEL_PATH, "--plugin-dir", self.PLUGINS_PATH]
+        )
+
+        assert "{'toyPlugin': 1}" in status.stdout
+
+    @pytest.mark.script_launch_mode("subprocess")
+    def test_replace_toy(self, poly_plugin_replace, poly_plugin_match):
+        with tempfile.TemporaryDirectory() as outdir:
+            config_yaml_loc = os.path.join(outdir, "config.yaml")
+
+            poly_plugin_match(
+                [
+                    self.TOY_MODEL_PATH,
+                    "--plugin-dir",
+                    self.PLUGINS_PATH,
+                    "-o",
+                    config_yaml_loc,
+                ]
+            )
+
+            replaced_loc = os.path.join(outdir, "replaced.onnx")
+            poly_plugin_replace(
+                [
+                    self.TOY_MODEL_PATH,
+                    "--plugin-dir",
+                    self.PLUGINS_PATH,
+                    "--config",
+                    config_yaml_loc,
+                    "-o",
+                    replaced_loc,
+                ]
+            )
+
+            model = onnx.load(replaced_loc)
+            assert len(model.graph.node) == 2
+            node_names = {node.name for node in model.graph.node}
+
+            assert "n1" in node_names
+            assert not node_names.intersection({"n2", "n3", "n4", "n5", "n6"})
+            assert model.graph.node[1].op_type == "toyPlugin"
+            assert model.graph.node[1].attribute[0].name == "ToyX"
+            assert model.graph.node[1].attribute[0].i == 2
diff --git a/tools/Polygraphy/tests/tools/test_run.py b/tools/Polygraphy/tests/tools/test_run.py
index 2d831fd2..0e5e4a69 100644
--- a/tools/Polygraphy/tests/tools/test_run.py
+++ b/tools/Polygraphy/tests/tools/test_run.py
@@ -21,12 +21,14 @@
 import sys
 import tempfile
 from textwrap import dedent
+import tensorrt as trt
 
 import onnx
 import pytest
-import tensorrt as trt
+import torch
+
 from polygraphy import mod, util
-from polygraphy.json import load_json
+from polygraphy.json import load_json, save_json
 from tests.helper import ROOT_DIR, get_file_size, is_file_non_empty
 from tests.models.meta import ONNX_MODELS, TF_MODELS
 
@@ -60,11 +62,6 @@ def test_log_file(self, poly_run, log_path):
             assert open(os.path.join(outdir, log_path)).read()
 
 
-class TestTrtLegacy:
-    def test_uff(self, poly_run):
-        poly_run([TF_MODELS["identity"].path, "--trt-legacy"])
-
-
 class TestTrt:
     def test_basic(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--trt"])
@@ -124,7 +121,6 @@ def test_exclude_outputs_with_layerwise(self, poly_run):
     def test_int8(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--trt", "--int8"])
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="API was added after TRT 7.2")
     def test_sparse_weights(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--trt", "--sparse-weights"])
 
@@ -208,12 +204,45 @@ def test_multiple_profiles(self, poly_run, optimization_profile):
             "X:[1,2,4,4]" if optimization_profile == 1 else "X:[1,2,1,1]",
         ]
         if optimization_profile is not None:
-            if mod.version(trt.__version__) <= mod.version("7.3"):
-                pytest.skip("Unsupported for TRT 7.2")
             cmd += [f"--optimization-profile={optimization_profile}"]
 
         poly_run(cmd)
 
+    @pytest.mark.skipif(
+        mod.version(trt.__version__) < mod.version("10.0"),
+        reason="Feature not present before 10.0",
+    )
+    @pytest.mark.parametrize("allocation_strategy", [None, "static", "profile", "runtime"])
+    def test_allocation_strategies(self, poly_run, allocation_strategy):
+        cmd = [
+            ONNX_MODELS["residual_block"].path,
+            "--trt",
+            "--onnxrt",
+            # Profile 0
+            "--trt-min-shapes",
+            "gpu_0/data_0:[1,3,224,224]",
+            "--trt-opt-shapes",
+            "gpu_0/data_0:[1,3,224,224]",
+            "--trt-max-shapes",
+            "gpu_0/data_0:[2,3,224,224]",
+            # Profile 1
+            "--trt-min-shapes",
+            "gpu_0/data_0:[1,3,224,224]",
+            "--trt-opt-shapes",
+            "gpu_0/data_0:[1,3,224,224]",
+            "--trt-max-shapes",
+            "gpu_0/data_0:[4,3,224,224]",
+            # Input shapes
+            "--input-shapes",
+            "gpu_0/data_0:[2,3,224,224]",
+            "--optimization-profile",
+            "1",
+        ]
+        if allocation_strategy is not None:
+            cmd += ["--allocation-strategy", allocation_strategy]
+
+        poly_run(cmd)
+
     def test_int8_calibration_cache(self, poly_run):
         with util.NamedTemporaryFile() as outpath:
             cmd = [ONNX_MODELS["identity"].path, "--trt", "--int8", "--calibration-cache", outpath.name]
@@ -227,7 +256,6 @@ def test_int8_calibration_base_class(self, poly_run, base_class):
         cmd += ["--onnxrt"]
         poly_run()
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
     def test_timing_cache(self, poly_run):
         with tempfile.TemporaryDirectory() as dir:
             # Test with files that haven't already been created instead of using NamedTemporaryFile().
@@ -256,18 +284,15 @@ def test_save_load_engine(self, poly_run):
             assert is_file_non_empty(outpath.name)
             poly_run(["--trt", outpath.name, "--model-type=engine"])
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.0"), reason="Unsupported for TRT 7.2 and older")
     def test_tactic_replay(self, poly_run):
         with util.NamedTemporaryFile() as tactic_replay:
             poly_run([ONNX_MODELS["identity"].path, "--trt", "--save-tactics", tactic_replay.name])
             assert is_file_non_empty(tactic_replay.name)
             poly_run([ONNX_MODELS["identity"].path, "--trt", "--load-tactics", tactic_replay.name])
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("7.2"), reason="Unsupported before TRT 7.2")
     def test_tactic_sources(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--trt", "--tactic-sources", "CUBLAS", "CUBLAS_LT"])
 
-    @pytest.mark.skipif(mod.version(trt.__version__) < mod.version("8.3"), reason="Unsupported before TRT 8.3")
     def test_pool_limits(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--trt", "--pool-limit", "workspace:32M"])
 
@@ -291,17 +316,18 @@ def load_data():
 
 
 class TestTf:
-    pytest.importorskip("tensorflow")
-
     def test_tf(self, poly_run):
+        pytest.importorskip("tensorflow")
         poly_run([TF_MODELS["identity"].path, "--tf", "--gpu-memory-fraction=0.5"])
 
     def test_tf_save_pb(self, poly_run):
+        pytest.importorskip("tensorflow")
         with util.NamedTemporaryFile() as outpath:
             poly_run([TF_MODELS["identity"].path, "--tf", "--gpu-memory-fraction=0.5", "--save-pb", outpath.name])
             assert is_file_non_empty(outpath.name)
 
     def test_tf_save_tensorboard(self, poly_run):
+        pytest.importorskip("tensorflow")
         with tempfile.TemporaryDirectory() as outdir:
             poly_run([TF_MODELS["identity"].path, "--tf", "--gpu-memory-fraction=0.5", "--save-tensorboard", outdir])
             files = glob.glob(f"{outdir}{os.path.sep}*")
@@ -309,6 +335,7 @@ def test_tf_save_tensorboard(self, poly_run):
 
     @pytest.mark.skip(reason="Non-trivial to set up - requires CUPTI")
     def test_tf_save_timeline(self, poly_run):
+        pytest.importorskip("tensorflow")
         with util.NamedTemporaryFile() as outpath:
             poly_run([TF_MODELS["identity"].path, "--tf", "--gpu-memory-fraction=0.5", "--save-timeline", outpath.name])
             timelines = glob.glob(os.path.join(outpath.name, "*"))
@@ -317,19 +344,11 @@ def test_tf_save_timeline(self, poly_run):
 
     @pytest.mark.skip(reason="Non-trivial to set up")
     def test_tftrt(self, poly_run):
+        pytest.importorskip("tensorflow")
         poly_run([TF_MODELS["identity"].path, "--tf", "--tftrt"])
 
 
 class TestOnnxrt:
-    def test_tf2onnxrt(self, poly_run):
-        poly_run([TF_MODELS["identity"].path, "--onnxrt", "--model-type=frozen"])
-
-    def test_tf2onnx_save_onnx(self, poly_run):
-        with util.NamedTemporaryFile() as outpath:
-            poly_run([TF_MODELS["identity"].path, "--onnxrt", "--model-type=frozen", "--save-onnx", outpath.name])
-            assert is_file_non_empty(outpath.name)
-            assert onnx.load(outpath.name)
-
     def test_onnx_rt(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--onnxrt"])
 
@@ -430,7 +449,7 @@ def test_custom_input_ranges(self, poly_run):
     def test_index_comparison(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--onnxrt", "--postprocess", "top-1", "--compare-func=indices"])
 
-    @pytest.mark.parametrize("check_error_stat", ["max", "median", "mean"])
+    @pytest.mark.parametrize("check_error_stat", ["max", "median", "mean", "quantile"])
     def test_check_error_stat(self, poly_run, check_error_stat):
         poly_run([ONNX_MODELS["identity"].path, "--onnxrt", "--onnxrt", "--check-error-stat", check_error_stat])
 
@@ -478,9 +497,27 @@ def test_save_load_inputs(self, poly_run):
             )  # Copy
             poly_run([ONNX_MODELS["identity"].path, "--onnxrt", "--load-input-data", infile0.name, infile1.name])
 
+    def test_load_torch_inputs(self, poly_run):
+        with util.NamedTemporaryFile() as infile:
+            inp = torch.ones((1, 1, 2, 2), dtype=torch.float32)
+            feed_dict = [{"x": inp}]
+            save_json(feed_dict, infile.name)
+            poly_run([ONNX_MODELS["identity"].path, "--onnxrt", "--onnxrt", "--load-inputs", infile.name])
+
     def test_runner_coexistence(self, poly_run):
         poly_run([ONNX_MODELS["identity"].path, "--onnxrt", "--trt"])
 
+    def test_tf2onnxrt(self, poly_run):
+        pytest.importorskip("tensorflow")
+        poly_run([TF_MODELS["identity"].path, "--onnxrt", "--model-type=frozen"])
+
+    def test_tf2onnx_save_onnx(self, poly_run):
+        pytest.importorskip("tensorflow")
+        with util.NamedTemporaryFile() as outpath:
+            poly_run([TF_MODELS["identity"].path, "--onnxrt", "--model-type=frozen", "--save-onnx", outpath.name])
+            assert is_file_non_empty(outpath.name)
+            assert onnx.load(outpath.name)
+
 
 class TestPluginRef:
     def test_basic(self, poly_run):
diff --git a/tools/Polygraphy/tests/tools/test_surgeon.py b/tools/Polygraphy/tests/tools/test_surgeon.py
index 776463c9..88a4ab61 100644
--- a/tools/Polygraphy/tests/tools/test_surgeon.py
+++ b/tools/Polygraphy/tests/tools/test_surgeon.py
@@ -18,12 +18,14 @@
 import tempfile
 
 import onnx
+from onnx import numpy_helper
 import onnx_graphsurgeon as gs
 import pytest
 from polygraphy import util
 from polygraphy.backend.onnx import util as onnx_util
+from polygraphy.tools.sparse import SparsityPruner
 from tests.helper import is_file_non_empty
-from tests.models.meta import ONNX_MODELS
+from tests.models.meta import ONNX_MODELS, model_path
 
 
 @pytest.fixture()
@@ -33,6 +35,67 @@ def onnx_model_sanity_check_impl(model_path):
 
     return onnx_model_sanity_check_impl
 
+def get_exclude_list(exclude_list):
+    if not exclude_list:
+        return set()
+    with open(exclude_list) as fp:
+        lines = [line.rstrip() for line in fp]
+        return set(lines)
+
+def pruned_initializer_sanity_check(opath, is_sparse=False, exclude_list=None):
+    exclude_list = get_exclude_list(exclude_list)
+    # we only prune the input data of QuantizeLinear and leave the scale and zero_point untouched
+    if 'qdq' in opath:
+        exclude_list.add('y_scale')
+        exclude_list.add('y_zero_point')
+
+    model = onnx.load(opath)
+    for initializer in model.graph.initializer:
+        # If initializer is to be left un-stripped
+        if initializer.name in exclude_list:
+            # ensure initializer is non-empty and doc_string doesn't contain the weightless flag
+            shape_match = list(numpy_helper.to_array(initializer).shape) == initializer.dims
+            if "TRT_WEIGHTLESS" in initializer.doc_string or not shape_match:
+                return False
+            continue
+
+        # ensure initializer is empty and doc_string is in required format
+        init_empty = initializer.raw_data == b""
+        trt_weightless, sparsity = initializer.doc_string.split('/')
+        trt_weightless_correctness = trt_weightless == "TRT_WEIGHTLESS"
+        sparsity_correctness = False
+        if (not is_sparse and sparsity == "") or (is_sparse and sparsity == "SPARSE_2_4"):
+            sparsity_correctness = True
+
+        if not (init_empty and trt_weightless_correctness and sparsity_correctness):
+            return False
+
+    return True
+
+def get_initializers_to_sparsify(ipath):
+    model = onnx.load(ipath)
+    initializers_to_sparsify = set()
+    for initializer in model.graph.initializer:
+        if "SPARSE_2_4" in initializer.doc_string:
+            initializers_to_sparsify.add(initializer.name)
+
+    return initializers_to_sparsify
+
+def reconstructed_initializer_sanity_check(opath, initializers_to_sparsify):
+    model = onnx.load(opath)
+    sparsity_checker = SparsityPruner(model)
+    sparsity_checker.check()
+    sparse_tensors = sparsity_checker.sparse_tensors
+    for initializer in model.graph.initializer:
+        shape_match = list(numpy_helper.to_array(initializer).shape) == initializer.dims
+        if not shape_match:
+            return False
+
+        # ensure sparsity of initializers is retained
+        if initializer.name in initializers_to_sparsify and initializer.name not in sparse_tensors:
+            return False
+
+    return True
 
 def was_shape_inference_run(status, model):
     logging_correct = "Shape inference completed successfully" in (status.stdout + status.stderr)
@@ -309,15 +372,15 @@ def test_fold_constants(
 
     @pytest.mark.parametrize("global_upper_bound", [None, "2000"])
     @pytest.mark.parametrize("specified_upper_bound", [None, "cast_out_6:4000"])
-    def test_set_upper_bound(
-        self,
-        poly_surgeon,
-        global_upper_bound,
-        specified_upper_bound,
-        onnx_model_sanity_check
-    ):
+    def test_set_upper_bound(self, poly_surgeon, global_upper_bound, specified_upper_bound, onnx_model_sanity_check):
         with util.NamedTemporaryFile() as outmodel:
-            cmd = ["sanitize", ONNX_MODELS["unbounded_dds"].path, "-o", outmodel.name, "--set-unbounded-dds-upper-bound"]
+            cmd = [
+                "sanitize",
+                ONNX_MODELS["unbounded_dds"].path,
+                "-o",
+                outmodel.name,
+                "--set-unbounded-dds-upper-bound",
+            ]
             upper_bound = "1000"
             if global_upper_bound:
                 upper_bound = "2000"
@@ -335,7 +398,7 @@ def test_set_upper_bound(
             # Check if there is a Min operator in the modified model
             find_min = False
             for node in graph.nodes:
-                if node.op == 'Min':
+                if node.op == "Min":
                     find_min = True
                     # Check if the Min operator's second input is a constant tensor
                     assert isinstance(node.inputs[1], gs.Constant)
@@ -343,8 +406,7 @@ def test_set_upper_bound(
                     val = node.inputs[1].values
                     # Check if the constant value equals the target upper bound
                     assert str(val) == upper_bound
-            assert (find_min)
-
+            assert find_min
 
     def test_fold_constants_single_pass(self, poly_surgeon, onnx_model_sanity_check):
         with util.NamedTemporaryFile() as outmodel:
@@ -459,6 +521,22 @@ def test_cleanup(self, poly_surgeon):
             assert len(model.graph.node) == 1
             assert model.graph.output[0].name == "identity_out_0"
 
+    def test_toposort(self, poly_surgeon):
+        with util.NamedTemporaryFile(suffix=".onnx") as outmodel:
+            poly_surgeon(
+                [
+                    "sanitize",
+                    ONNX_MODELS["unsorted"].path,
+                    "-o",
+                    outmodel.name,
+                    "--toposort",
+                ]
+            )
+
+            model = onnx.load(outmodel.name)
+            assert model.graph.node[0].name == "onnx_graphsurgeon_node_1"
+            assert model.graph.node[1].name == "onnx_graphsurgeon_node_3"
+
     def test_external_data(self, poly_surgeon, poly_run):
         with tempfile.TemporaryDirectory() as outdir:
             model = ONNX_MODELS["ext_weights"]
@@ -533,3 +611,59 @@ def test_size_threshold(self, poly_surgeon, size_threshold, expect_folding, onnx
             else:
                 assert len(model.graph.node) == 1
                 assert model.graph.node[0].op_type == "Tile"
+
+
+class TestSurgeonPrune:
+    @pytest.mark.parametrize("model_name", ["matmul", "matmul.fp16", "matmul.bf16", "matmul.bf16.i32data", "conv"])
+    def test_prune(self, poly_surgeon, onnx_model_sanity_check, model_name):
+        with tempfile.TemporaryDirectory() as outdir:
+            ipath = ONNX_MODELS[model_name].path
+            opath = os.path.join(outdir, "pruned." + os.path.basename(ipath))
+            status = poly_surgeon(["prune", ipath, "-o", opath])
+            assert status
+            if "bf16" not in ipath:
+                onnx_model_sanity_check(opath)
+
+class TestSurgeonWeightStrip:
+    @pytest.mark.parametrize("model_name", ["matmul", "matmul.fp16", "matmul.bf16", "conv", "sparse.matmul", "sparse.conv",
+        "transpose_matmul", "qdq_conv"])
+    def test_weight_strip(self, poly_surgeon, model_name):
+        with tempfile.TemporaryDirectory() as outdir:
+            ipath = ONNX_MODELS[model_name].path
+            opath = os.path.join(outdir, "weightless." + os.path.basename(ipath))
+            status = poly_surgeon(["weight-strip", ipath, "-o", opath])
+            assert status
+
+            is_sparse = "sparse" in ipath
+            assert pruned_initializer_sanity_check(opath, is_sparse=is_sparse)
+
+    @pytest.mark.parametrize(
+        "model_name,     exclude_list", [
+        ["matmul",       "matmul.exclude_list.txt"],
+        ["sparse.conv",  "sparse.conv.exclude_list.txt"],
+        ["qdq_conv",     "qdq_conv.exclude_list.txt"]])
+    def test_weight_strip_exclude_file(self, poly_surgeon, model_name, exclude_list):
+        with tempfile.TemporaryDirectory() as outdir:
+            ipath = ONNX_MODELS[model_name].path
+            exclude_list = model_path(exclude_list)
+            opath = os.path.join(outdir, "weightless_sparse." + os.path.basename(ipath))
+            status = poly_surgeon(["weight-strip", ipath, "-o", opath, "--exclude-list", exclude_list])
+            assert status
+
+            is_sparse = "sparse" in ipath
+            assert pruned_initializer_sanity_check(opath, is_sparse=is_sparse, exclude_list=exclude_list)
+
+class TestSurgeonWeightReconstruct:
+    @pytest.mark.parametrize("model_name", ["weightless.matmul.fp16", "weightless.matmul.bf16", "weightless.conv", "weightless.sparse.matmul",
+        "weightless.sparse.conv", "weightless.transpose_matmul", "weightless.qdq_conv"])
+    def test_weight_reconstruct(self, poly_surgeon, onnx_model_sanity_check, model_name):
+        with tempfile.TemporaryDirectory() as outdir:
+            ipath = ONNX_MODELS[model_name].path
+            opath = os.path.join(outdir, "reconstruct." + os.path.basename(ipath))
+            status = poly_surgeon(["weight-reconstruct", ipath, "-o", opath])
+            assert status
+            if "bf16" not in ipath:
+                onnx_model_sanity_check(opath)
+
+                initializers_to_sparsify = get_initializers_to_sparsify(ipath)
+                assert reconstructed_initializer_sanity_check(opath, initializers_to_sparsify)
\ No newline at end of file
diff --git a/tools/Polygraphy/tests/util/test_array.py b/tools/Polygraphy/tests/util/test_array.py
new file mode 100644
index 00000000..8450be68
--- /dev/null
+++ b/tools/Polygraphy/tests/util/test_array.py
@@ -0,0 +1,248 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import copy
+
+import numpy as np
+import pytest
+import torch
+
+from polygraphy import cuda, util
+from polygraphy.datatype import DataType
+
+
+@pytest.mark.parametrize(
+    "obj",
+    [
+        np.transpose(np.ones((2, 3), dtype=np.float32)),
+        torch.transpose(torch.ones((2, 3), dtype=torch.float32), 1, 0),
+        cuda.DeviceArray(shape=(2, 3), dtype=DataType.FLOAT32),
+    ],
+    ids=[
+        "numpy",
+        "torch",
+        "DeviceView",
+    ],
+)
+class TestArrayFuncs:
+    def test_nbytes(self, obj):
+        nbytes = util.array.nbytes(obj)
+        assert isinstance(nbytes, int)
+        assert nbytes == 24
+
+    def test_data_ptr(self, obj):
+        data_ptr = util.array.data_ptr(obj)
+        assert isinstance(data_ptr, int)
+
+    def test_make_contiguous(self, obj):
+        if isinstance(obj, cuda.DeviceView):
+            pytest.skip("DeviceViews are always contiguous")
+
+        obj = copy.copy(obj)
+        assert not util.array.is_contiguous(obj)
+
+        obj = util.array.make_contiguous(obj)
+        assert util.array.is_contiguous(obj)
+
+    def test_dtype(self, obj):
+        assert util.array.dtype(obj) == DataType.FLOAT32
+
+    def test_view(self, obj):
+        obj = util.array.make_contiguous(obj)
+        view = util.array.view(obj, dtype=DataType.UINT8, shape=(24, 1))
+        assert util.array.dtype(view) == DataType.UINT8
+        assert util.array.shape(view) == (24, 1)
+
+    def test_resize(self, obj):
+        # Need to make a copy since we're modifying the array.
+        obj = copy.copy(util.array.make_contiguous(obj))
+        obj = util.array.resize_or_reallocate(obj, (1, 1))
+        assert util.array.shape(obj) == (1, 1)
+
+
+@pytest.mark.parametrize(
+    "obj, is_on_cpu",
+    [
+        (np.ones((2, 3)), True),
+        (torch.ones((2, 3)), True),
+        (torch.ones((2, 3), device="cuda"), False),
+        (cuda.DeviceArray(shape=(2, 3), dtype=DataType.FLOAT32), False),
+    ],
+)
+def test_is_on_cpu(obj, is_on_cpu):
+    assert util.array.is_on_cpu(obj) == is_on_cpu
+
+
+@pytest.mark.parametrize(
+    "obj, is_on_gpu",
+    [
+        (np.ones((2, 3)), False),
+        (torch.ones((2, 3)), False),
+        (torch.ones((2, 3), device="cuda"), True),
+        (cuda.DeviceArray(shape=(2, 3), dtype=DataType.FLOAT32), True),
+    ],
+)
+def test_is_on_cpu(obj, is_on_gpu):
+    assert util.array.is_on_gpu(obj) == is_on_gpu
+
+
+@pytest.mark.parametrize(
+    "lhs,rhs,expected",
+    [
+        (np.ones((2, 3)), np.ones((2, 3)), True),
+        (np.zeros((2, 3)), np.ones((2, 3)), False),
+        (torch.ones((2, 3)), torch.ones((2, 3)), True),
+        (torch.zeros((2, 3)), torch.ones((2, 3)), False),
+    ],
+)
+def test_equal(lhs, rhs, expected):
+    assert util.array.equal(lhs, rhs) == expected
+
+
+@pytest.mark.parametrize(
+    "index,shape",
+    [
+        (7, (4, 4)),
+        (12, (4, 4, 3, 2)),
+    ],
+)
+def test_unravel_index(index, shape):
+    assert util.array.unravel_index(index, shape) == np.unravel_index(index, shape)
+
+
+@pytest.mark.parametrize(
+    "lhs, rhs, expected",
+    [
+        (np.array([5.00001]), np.array([5.00]), True),
+        (np.array([5.5]), np.array([5.00]), False),
+        (torch.tensor([5.00001]), torch.tensor([5.00]), True),
+        (torch.tensor([5.5]), torch.tensor([5.00]), False),
+    ],
+)
+def test_allclose(lhs, rhs, expected):
+    assert util.array.allclose(lhs, rhs) == expected
+
+
+ARRAYS = [
+    # Generate ints so FP rounding error is less of an issue
+    np.random.randint(1, 25, size=(5, 2)).astype(np.float32),
+    # Make sure functions work with an even or odd number of elements
+    np.random.randint(1, 25, size=(1, 3)).astype(np.float32),
+    # Generate binary values
+    np.random.randint(0, 2, size=(5, 2)).astype(np.float32),
+    # Test with scalars
+    np.ones(shape=tuple(), dtype=np.float32),
+]
+
+TEST_CASES = []
+IDS = []
+for arr in ARRAYS:
+    TEST_CASES.extend([(arr, arr), (torch.from_numpy(arr), arr)])
+    IDS.extend(["numpy", "torch"])
+
+
+@pytest.mark.parametrize("obj, np_arr", TEST_CASES, ids=IDS)
+class TestArrayMathFuncs:
+    # Test that the util.array implementations match NumPy
+    @pytest.mark.parametrize(
+        "func, np_func",
+        [
+            (util.array.max, np.amax),
+            (util.array.argmax, np.argmax),
+            (util.array.min, np.amin),
+            (util.array.argmin, np.argmin),
+            (util.array.mean, np.mean),
+            (util.array.std, np.std),
+            (util.array.var, np.var),
+            (util.array.median, np.median),
+            (util.array.any, np.any),
+            (util.array.all, np.all),
+        ],
+    )
+    def test_reduction_funcs(self, obj, np_arr, func, np_func):
+        assert np.isclose(func(obj), np_func(np_arr))
+
+    @pytest.mark.parametrize(
+        "func, np_func",
+        [
+            (util.array.abs, np.abs),
+            (util.array.isinf, np.isinf),
+            (util.array.isnan, np.isnan),
+            (util.array.argwhere, np.argwhere),
+        ],
+    )
+    def test_array_funcs(self, obj, np_arr, func, np_func):
+        obj = func(obj)
+        assert util.array.equal(obj, np.array(np_func(np_arr)))
+
+    def test_cast(self, obj, np_arr):
+        dtype = DataType.INT32
+        casted = util.array.cast(obj, dtype)
+        assert util.array.dtype(casted) == dtype
+        assert type(casted) == type(obj)
+
+    def test_to_torch(self, obj, np_arr):
+        assert isinstance(util.array.to_torch(obj), torch.Tensor)
+
+    def test_to_numpy(self, obj, np_arr):
+        assert isinstance(util.array.to_numpy(obj), np.ndarray)
+
+    def test_histogram(self, obj, np_arr):
+        hist, bins = util.array.histogram(obj)
+        np_hist, np_bins = np.histogram(np_arr)
+        np_hist = np_hist.astype(np_arr.dtype)
+
+        assert util.array.allclose(hist, np_hist)
+        assert util.array.allclose(bins, np_bins)
+
+    @pytest.mark.parametrize("k", [1, 2, 3, 4])
+    @pytest.mark.parametrize("axis", [0, 1])
+    def test_topk(self, obj, np_arr, k, axis):
+        if axis >= len(util.array.shape(obj)):
+            pytest.skip()
+        topk_vals = util.array.topk(obj, k, axis)
+
+        k_clamped = min(util.array.shape(obj)[axis], k)
+        tensor = util.array.to_torch(np_arr)
+        ref_topk_vals = torch.topk(tensor, k_clamped, axis)
+
+        assert util.array.allclose(topk_vals[0], ref_topk_vals[0])
+
+    @pytest.mark.parametrize(
+        "func, np_func",
+        [
+            (util.array.subtract, np.subtract),
+            (util.array.divide, np.divide),
+            (util.array.logical_xor, np.logical_xor),
+            (util.array.logical_and, np.logical_and),
+            (util.array.greater, np.greater),
+        ],
+    )
+    def test_binary_funcs(self, obj, np_arr, func, np_func):
+        obj = func(obj, obj + 1)
+        assert util.array.equal(obj, np.array(np_func(np_arr, np_arr + 1)))
+
+    @pytest.mark.parametrize(
+        "func, np_func, types",
+        [
+            (util.array.where, np.where, tuple(map(DataType.from_dtype, (np.bool8, np.float32, np.float32)))),
+        ],
+    )
+    def test_ternary_funcs(self, obj, np_arr, func, np_func, types):
+        build_inputs = lambda input: map(lambda pair: util.array.cast(input + pair[0], pair[1]), enumerate(types))
+        obj = func(*build_inputs(obj))
+        assert util.array.equal(obj, np.array(np_func(*build_inputs(np_arr))))
diff --git a/tools/Polygraphy/tests/util/test_serde.py b/tools/Polygraphy/tests/util/test_serde.py
index 82ff56b3..19dcc8bf 100644
--- a/tools/Polygraphy/tests/util/test_serde.py
+++ b/tools/Polygraphy/tests/util/test_serde.py
@@ -18,6 +18,7 @@
 import numpy as np
 import pytest
 import tensorrt as trt
+import torch
 
 from polygraphy import constants, util
 from polygraphy.backend.trt import Algorithm, TacticReplayData, TensorInfo
@@ -32,21 +33,31 @@ def __init__(self, x):
 
 
 @Encoder.register(Dummy)
-def encode(dummy):
+def encode_dummy(dummy):
     return {"x": dummy.x}
 
 
 @Decoder.register(Dummy)
-def decode(dct):
+def decode_dummy(dct):
     assert len(dct) == 1  # Custom type markers should be removed at this point
     return Dummy(x=dct["x"])
 
 
+class NoDecoder:
+    def __init__(self, x):
+        self.x = x
+
+
+@Encoder.register(NoDecoder)
+def encode_nodecoder(no_decoder):
+    return {"x": no_decoder.x}
+
+
 class TestEncoder:
     def test_registered(self):
         d = Dummy(x=-1)
         d_json = to_json(d)
-        assert encode(d) == {"x": d.x, constants.TYPE_MARKER: "Dummy"}
+        assert encode_dummy(d) == {"x": d.x, constants.TYPE_MARKER: "Dummy"}
         expected = f'{{\n    "x": {d.x},\n    "{constants.TYPE_MARKER}": "Dummy"\n}}'
         assert d_json == expected
 
@@ -59,6 +70,49 @@ def test_object_pairs_hook(self):
         new_d = from_json(d_json)
         assert new_d.x == d.x
 
+    def test_error_on_no_decoder(self):
+        d = NoDecoder(x=1)
+        d_json = to_json(d)
+
+        with pytest.raises(
+            PolygraphyException,
+            match="Could not decode serialized type: NoDecoder. This could be because a required module is missing.",
+        ):
+            from_json(d_json)
+
+    def test_names_correct(self):
+        # Trigger `try_register_common_json`
+        d = Dummy(x=-1)
+        to_json(d)
+
+        # If the name of a class changes, then we need to specify an `alias` when registering
+        # to retain backwards compatibility.
+        assert set(Decoder.polygraphy_registered.keys()) == {
+            "__polygraphy_encoded_Algorithm",
+            "__polygraphy_encoded_Dummy",
+            "__polygraphy_encoded_FormattedArray",
+            "__polygraphy_encoded_IterationContext",
+            "__polygraphy_encoded_IterationResult",
+            "__polygraphy_encoded_LazyArray",
+            "__polygraphy_encoded_ndarray",
+            "__polygraphy_encoded_RunResults",
+            "__polygraphy_encoded_TacticReplayData",
+            "__polygraphy_encoded_Tensor",
+            "__polygraphy_encoded_TensorInfo",
+            "Algorithm",
+            "Dummy",
+            "FormattedArray",
+            "IterationContext",
+            "IterationResult",
+            "LazyArray",
+            "LazyNumpyArray",
+            "ndarray",
+            "RunResults",
+            "TacticReplayData",
+            "Tensor",
+            "TensorInfo",
+        }
+
 
 def make_algo():
     return Algorithm(
@@ -66,10 +120,10 @@ def make_algo():
         tactic=5,
         # Should work even if strides are not set
         inputs=[
-            TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2), -1, 1),
-            TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2), -1, 1),
+            TensorInfo(trt.float32, (1, 2), -1, 1),
+            TensorInfo(trt.float32, (1, 2), -1, 1),
         ],
-        outputs=[TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (2, 3), -1, 1)],
+        outputs=[TensorInfo(trt.float32, (2, 3), -1, 1)],
     )
 
 
@@ -94,28 +148,31 @@ class TestImplementations:
     @pytest.mark.parametrize(
         "obj",
         [
-            TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1),
+            TensorInfo(trt.float32, (1, 2, 3), -1, 1),
             Algorithm(
                 implementation=4,
                 tactic=5,
-                inputs=[TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1)],
-                outputs=[TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1)],
+                inputs=[TensorInfo(trt.float32, (1, 2, 3), -1, 1)],
+                outputs=[TensorInfo(trt.float32, (1, 2, 3), -1, 1)],
             ),
             Algorithm(
                 implementation=4,
                 tactic=5,
                 inputs=[
-                    TensorInfo(trt.TensorFormat.LINEAR, trt.float32, (1, 2, 3), -1, 1),
-                    TensorInfo(trt.TensorFormat.CHW32, trt.int8, (1, 2, 3), -1, 1),
+                    TensorInfo(trt.float32, (1, 2, 3), -1, 1),
+                    TensorInfo(trt.int8, (1, 2, 3), -1, 1),
                 ],
-                outputs=[TensorInfo(trt.TensorFormat.CHW32, trt.float16, (1, 2, 3), -1, 1)],
+                outputs=[TensorInfo(trt.float16, (1, 2, 3), -1, 1)],
             ),
             np.ones((3, 4, 5), dtype=np.int64),
             np.ones(5, dtype=np.int64),
             np.zeros((4, 5), dtype=np.float32),
             np.random.random_sample((3, 5)),
+            torch.ones((3, 4, 5), dtype=torch.int64),
             make_iter_result(),
-            RunResults([("runner0", [make_iter_result()]), ("runner0", [make_iter_result()])]),
+            RunResults(
+                [("runner0", [make_iter_result()]), ("runner0", [make_iter_result()])]
+            ),
         ],
         ids=lambda x: type(x),
     )
@@ -124,6 +181,8 @@ def test_serde(self, obj):
         decoded = from_json(encoded)
         if isinstance(obj, np.ndarray):
             assert np.array_equal(decoded, obj)
+        elif isinstance(obj, torch.Tensor):
+            assert torch.equal(decoded, obj)
         else:
             assert decoded == obj
 
diff --git a/tools/Polygraphy/tests/util/test_util.py b/tools/Polygraphy/tests/util/test_util.py
index d3e02607..6c2f302c 100644
--- a/tools/Polygraphy/tests/util/test_util.py
+++ b/tools/Polygraphy/tests/util/test_util.py
@@ -92,11 +92,14 @@ def arange(shape):
     (arange((1, 3, 2, 2)), (None, 2, 2, 3), np.transpose(arange((1, 3, 2, 2)), [0, 2, 3, 1])),  # Permute
 ]
 
+build_torch = lambda a, **kwargs: util.array.to_torch(np.array(a, **kwargs))
 
+
+@pytest.mark.parametrize("array_type", [np.array, build_torch])
 @pytest.mark.parametrize("arr, shape, expected", SHAPE_MATCHING_CASES)
-def test_shape_matching(arr, shape, expected):
-    arr = util.try_match_shape(arr, shape)
-    assert np.array_equal(arr, expected)
+def test_shape_matching(arr, shape, expected, array_type):
+    arr = util.try_match_shape(array_type(arr), shape)
+    assert util.array.equal(arr, array_type(expected))
 
 
 UNPACK_ARGS_CASES = [
@@ -197,14 +200,6 @@ def write_to_file(path, content):
     assert os.path.exists(outfile.name + ".lock")
 
 
-def test_make_contiguous():
-    arr = np.transpose(np.ones(shape=(5, 10), dtype=np.float32))
-    assert not util.is_contiguous(arr)
-
-    arr = util.make_contiguous(arr)
-    assert util.is_contiguous(arr)
-
-
 class TestMakeRepr:
     def test_basic(self):
         assert util.make_repr("Example", 1, x=2) == ("Example(1, x=2)", False, False)
@@ -247,7 +242,6 @@ def test_nan_inf(self, obj, recursion_depth):
 def test_check_called_by():
     outfile = io.StringIO()
     with contextlib.redirect_stdout(outfile):
-
         warn_msg = "Calling 'test_check_called_by.<locals>.callee()' directly is not recommended. Please use 'caller()' instead."
 
         @util.check_called_by("caller")
diff --git a/tools/experimental/trt-engine-explorer/CHANGELOG.md b/tools/experimental/trt-engine-explorer/CHANGELOG.md
index 66c57378..1c03cf8d 100644
--- a/tools/experimental/trt-engine-explorer/CHANGELOG.md
+++ b/tools/experimental/trt-engine-explorer/CHANGELOG.md
@@ -2,6 +2,36 @@
 
 Dates are in YYYY-MM-DD format.
 
+## v0.1.8 (2024-March)
+- Added `trex` command-line tool (see `bin/README.md`)
+- Updated to support Python 3.10, new package installations and TensorRT 10.0.
+- Made the scripts in the `utils/` directory executable for easier usage.
+- Enabled JupyterLab
+- Added notebooks/q_dq_placement.ipynb for experimenting with Q/DQ placement, data types and strong typing. Shows how to quickly iterate between ONNX definition, visualization and engine visualization.
+- Installation:
+  - Shortened the installation time (removed qgrid and its dependencies).
+  - Separated the installation to core packages and notebook packages.
+  - Removed non-core modules from inclusion in the default trex namespace. This is meant to simplify for users that don't require Jupyter etc.
+  - Updated the installation instructions for [PyTorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization) to use source installation which is more reliable.
+  - Updates script `install.sh` with options for full or core installation. Installation of a virtual environment is now an opt-in.
+- Graph rendering:
+  - Updated the graph rendering for TensorRT 10.0 `kgen` kernels.
+  - Added an option to display engine layer metadata. TensorRT 10.0 adds ONNX layer information as a metadata field in the layer information.
+  - Added a color for FP8 tensors when rendering SVG.
+  - Added an option to prevent rendering of disconnected layer nodes.
+  - Moved the colormap definitions to a separate file (trex/colors.py) to decouple graph rendering from non-core trex code.
+  - Added support for TrainStation engine layers. A TrainStation is an internal TensorRT engine layer that manages data-dependent-shapes device memory allocation. TrainStation layers synchronize the stream they are invoked from.
+- Deprecated functionality:
+  - Removed display_df_qgrid. Data frames now display using the default panda's table renderer.
+- Miscellaneous
+  - Added copyright message to test files
+  - Updated and fixed the build-log parsing for TensorRT 9.x/10.x.
+
+
+## v0.1.7 (2023-August)
+- Updated graph rendering for TensorRT 9.0 `kgen` kernels.
+- Updated TensorRT data formats dictionary.
+
 ## v0.1.6 (2023-April)
 - Graph rendering:
   - Add node highlighting option.
diff --git a/tools/experimental/trt-engine-explorer/KNOWN_ISSUES.md b/tools/experimental/trt-engine-explorer/KNOWN_ISSUES.md
index 42c47703..49596333 100644
--- a/tools/experimental/trt-engine-explorer/KNOWN_ISSUES.md
+++ b/tools/experimental/trt-engine-explorer/KNOWN_ISSUES.md
@@ -1,5 +1,7 @@
 # Known Limitations and Issues
 
+* TensorRT 9.0 uses `kgen` kernels which produce broken representations of JSON engine graphs and no per-layer timing or inference profiling information.
+
 * TREx does not always display the correct region shapes. This is due to missing information in the graph JSON produced by TensorRT. This will be fixed in a future TensorRT release.
 * TREx has partial support for multi-profile engines. You can choose which profile to use for the engine plan, but profiling information will only load when using the first (default) profile (0). For example, to load profile 3:
 
diff --git a/tools/experimental/trt-engine-explorer/README.md b/tools/experimental/trt-engine-explorer/README.md
index ab519e1c..e1963284 100644
--- a/tools/experimental/trt-engine-explorer/README.md
+++ b/tools/experimental/trt-engine-explorer/README.md
@@ -27,9 +27,9 @@ When `trtexec` times individual layers, the total engine latency (computed by su
 To measure per-layer execution times, when `trtexec` enqueues kernel layers for execution in a stream, it places CUDA event objects between the layers to monitor the start and completion of each layer. These CUDA events add a small overhead which is more noticeable with smaller networks (shallow and narrow networks or networks with small activation data).
 
 ## Supported TensorRT Versions
-Starting with TensorRT 8.2, engine-plan graph and profiling data can be exported to JSON files. `trex` supports TensortRT 8.2 and 8.4.
+Starting with TensorRT 8.2, engine-plan graph and profiling data can be exported to JSON files. `trex` supports TensortRT 8.x, 9.x and 10.0.
 
-`trex` has only been tested on Ubuntu 18.04, 20.04, 22.04, with Python 3.8.<br>
+`trex` has only been tested on 22.04 with Python 3.10.12.<br>
 `trex` does not require a GPU, but generating the input JSON file(s) does require a GPU.
 
 <details><summary><h1>Installation</h1></summary>
@@ -51,16 +51,22 @@ $ python3 -m virtualenv env_trex
 $ source env_trex/bin/activate
 ```
 
-### 3. Install trex in development mode and the Jupyter extensions required for the notebooks
+### 3. Install trex in development mode
+To install core functionality only:
 ```
 $ python3 -m pip install -e .
-$ jupyter nbextension enable widgetsnbextension --user --py
+```
+
+To install all packages (core + packages required for using Jupyter notebooks):
+
+```
+$ python3 -m pip install -e .[notebook]
 ```
 
 ### 4. Install Graphviz
 Generating dot and SVG graphs requires Graphviz, an open source graph visualization software:
 ```
-$ sudo apt-get --yes install graphviz
+$ sudo apt --yes install graphviz
 ```
 </details>
 
@@ -92,6 +98,10 @@ Launch the Jupyter notebook server as detailed below and open your browser at `h
 $ jupyter-notebook --ip=0.0.0.0 --no-browser
 ```
 
+If you're using JupyterLab, you can launch the server with:
+```
+$ jupyter lab --ip=0.0.0.0 --port=8888
+```
 </details>
 
 <details><summary><h1>License</h1></summary>
diff --git a/tools/experimental/trt-engine-explorer/bin/README.md b/tools/experimental/trt-engine-explorer/bin/README.md
new file mode 100644
index 00000000..68ec29dd
--- /dev/null
+++ b/tools/experimental/trt-engine-explorer/bin/README.md
@@ -0,0 +1,17 @@
+# trex Command-line Tool
+
+The `trex` command-line tool (not to be confused with the `trex` package) provides a convinient interface to some of the utilities in the `utils` directory. It is installed with the `trex` package.
+
+## trex draw
+Draw a graph diagram of a TensorRT engine graph JSON file.<br>
+
+Example:
+```
+$ trex draw ./examples/pytorch/resnet/A100/fp32/resnet.onnx.engine.graph.json --display_regions --no_layer_names
+```
+
+## trex process
+Build, profile and draw a TensorRT engine.
+```
+$ trex process ./examples/pytorch/resnet/generated/resnet.onnx ./examples/pytorch/resnet/A100/fp32/
+```
\ No newline at end of file
diff --git a/tools/experimental/trt-engine-explorer/bin/trex b/tools/experimental/trt-engine-explorer/bin/trex
new file mode 100644
index 00000000..874e0f97
--- /dev/null
+++ b/tools/experimental/trt-engine-explorer/bin/trex
@@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+"""
+This is a command-line interface to several common utilities.
+
+Note: this script requires graphviz which can be installed manually:
+    $ sudo apt-get --yes install graphviz
+    $ python3 -m pip install graphviz
+"""
+
+import sys
+import os
+import argparse
+import trex
+import logging
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.join(os.path.dirname(SCRIPT_DIR), "utils"))
+import utils.draw_engine as draw_engine
+import utils.process_engine as process_engine
+
+
+if __name__ == "__main__":
+    try:
+        sys.argv[1]
+    except IndexError:
+        sys.argv.append("-h")
+    parser = argparse.ArgumentParser()
+    subparsers = parser.add_subparsers(help='TREx sub-commands')
+    draw_engine.make_subcmd_parser(subparsers)
+    process_engine.make_subcmd_parser(subparsers)
+    parser.add_argument('-v', '--version', action='version',
+        version=f'%(prog)s {trex.__version__}',
+        help="Show program's version number and exit.")
+
+    args = parser.parse_args()
+    if hasattr(args, "func"):
+        args.func(args)
+    else:
+        logging.error("Failed to parse trex arguments")
+
+
diff --git a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/README.md b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/README.md
index a7ed50e3..c6dfd4c1 100644
--- a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/README.md
+++ b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/README.md
@@ -4,11 +4,14 @@ This directory contains the code for the examples in the [TREx blog](https://dev
 Directory `A100` contains pre-generated JSON files which are provided as a shortcut so you can skip straight to using the `example_qat_resnet18.ipynb` notebook.
 <br><br>
 ### Installation
-The example uses PyTorch, TorchVision and Nvidia's [PyTorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization) which you can install with:
+The example uses PyTorch, TorchVision and Nvidia's [PyTorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization). PyTorch and TorchVision can be installed with:
 ```
-python3 -m pip install -r requirements.txt
+python3 -m pip install torch torchvision
 ```
 
+It is most reliable to install PyTorch Quantization Toolkit from source code so please follow [these](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization#from-source) instructions.
+
+
 ### Description
 This example walks you through the process of optimizing a TensorRT engine created from QAT ResNet18.
 File `resnet_example.py` contains code to generate several variations of QAT ResNet18, each one performing better than the previous.
@@ -27,41 +30,3 @@ File `resnet_example.py` contains code to generate several variations of QAT Res
 
 * Finally, review the results using the `example_qat_resnet18.ipynb` notebook.
 
-<br>
-
-### Troubleshooting
-
-When exporting PyTorch to ONNX and then using that ONNX file with TensorRT, you might see this error:
-```
-[07/15/2022-05:26:40] [V] [TRT] Parsing node: Identity_0 [Identity]
-[07/15/2022-05:26:40] [V] [TRT] Searching for input: onnx::QuantizeLinear_429
-[07/15/2022-05:26:40] [V] [TRT] Identity_0 [Identity] inputs: [onnx::QuantizeLinear_429 -> ()[INT8]],
-[07/15/2022-05:26:40] [V] [TRT] Registering layer: onnx::QuantizeLinear_429 for ONNX node: onnx::QuantizeLinear_429
-[07/15/2022-05:26:40] [V] [TRT] Registering layer: Identity_0 for ONNX node: Identity_0
-[07/15/2022-05:26:40] [E] Error[3]: onnx::QuantizeLinear_429: invalid weights type of Int8
-[07/15/2022-05:26:40] [E] [TRT] parsers/onnx/ModelImporter.cpp:791: While parsing node number 0 [Identity -> "onnx::QuantizeLinear_491"]:
-[07/15/2022-05:26:40] [E] [TRT] parsers/onnx/ModelImporter.cpp:792: --- Begin node ---
-[07/15/2022-05:26:40] [E] [TRT] parsers/onnx/ModelImporter.cpp:793: input: "onnx::QuantizeLinear_429"
-output: "onnx::QuantizeLinear_491"
-name: "Identity_0"
-op_type: "Identity"
-
-[07/15/2022-05:26:40] [E] [TRT] parsers/onnx/ModelImporter.cpp:794: --- End node ---
-[07/15/2022-05:26:40] [E] [TRT] parsers/onnx/ModelImporter.cpp:796: ERROR: parsers/onnx/ModelImporter.cpp:179 In function parseGraph:
-[6] Invalid Node - Identity_0
-onnx::QuantizeLinear_429: invalid weights type of Int8
-[07/15/2022-05:26:40] [E] Failed to parse onnx file
-[07/15/2022-05:26:40] [I] Finish parsing network model
-[07/15/2022-05:26:40] [E] Parsing model failed
-[07/15/2022-05:26:40] [E] Failed to create engine from model.
-[07/15/2022-05:26:40] [E] Engine set up failed
-&&&& FAILED TensorRT.trtexec [TensorRT v8205] # trtexec --verbose --nvtxMode=verbose --buildOnly --workspace=1024 --onnx=/tmp/resnet/resnet-qat.onnx --saveEngine=/tmp/resnet/qat/resnet-qat.onnx.engine --timingCacheFile=./timing.cache --int8 --fp16 --shapes=input.1:32x3x224x224
-
-```
-The solution is disable constant folding when exporting to ONNX (`do_constant_folding = False`):
-```
-torch.onnx.export(model, dummy_input, onnx_filename, do_constant_folding=False)
-```
-<br>
-
-Explanation: When do_constant_folding == True then the ONNX exporter folds `contant(FP32) + QuantizeLinear => constant(INT8)`. TensorRT does not support INT8 weights for explicit quantization.
diff --git a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/process_resnet.sh b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/process_resnet.sh
index 6ddc6aef..57acfe82 100755
--- a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/process_resnet.sh
+++ b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/process_resnet.sh
@@ -1,6 +1,6 @@
 #!/bin/sh
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/qat_model.py b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/qat_model.py
index a5d9ea2a..f90d85c1 100644
--- a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/qat_model.py
+++ b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/qat_model.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/requirements.txt b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/requirements.txt
deleted file mode 100644
index dfea2d6c..00000000
--- a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/requirements.txt
+++ /dev/null
@@ -1,6 +0,0 @@
---find-links https://download.pytorch.org/whl/cu113
-
-nvidia-pyindex
-pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com
-torch
-torchvision
diff --git a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/resnet_example.py b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/resnet_example.py
index 31fead88..9434e5d0 100644
--- a/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/resnet_example.py
+++ b/tools/experimental/trt-engine-explorer/examples/pytorch/resnet/resnet_example.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/process_resnet.sh b/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/process_resnet.sh
index 061627e1..fd45b921 100755
--- a/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/process_resnet.sh
+++ b/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/process_resnet.sh
@@ -1,6 +1,6 @@
 #!/bin/sh
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/resnet_example.py b/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/resnet_example.py
index 5bdd4fd4..fbf670ab 100644
--- a/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/resnet_example.py
+++ b/tools/experimental/trt-engine-explorer/examples/tensorflow/resnet/resnet_example.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/experimental/trt-engine-explorer/examples/tensorrt/archive_example.py b/tools/experimental/trt-engine-explorer/examples/tensorrt/archive_example.py
new file mode 100644
index 00000000..9478617f
--- /dev/null
+++ b/tools/experimental/trt-engine-explorer/examples/tensorrt/archive_example.py
@@ -0,0 +1,78 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+This file contains an example of creating a TensorRT Engine Archive programmatically.
+"""
+
+import logging
+import tensorrt as trt
+from trex.archiving import EngineArchive
+
+
+model_name = "./tests/resources/single_relu.onnx"
+tea_name = "/tmp/test.tea"
+
+input_shapes = [[1, 3, 224,224], [1, 3, 224,224], [1, 3, 224,224]]
+
+
+def example_build_engine(tea: EngineArchive, verbose=False):
+    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if verbose else trt.Logger()
+    with tea.Builder(TRT_LOGGER) as builder, builder.create_network() as network, \
+            trt.OnnxParser(network, TRT_LOGGER) as parser:
+        with open(model_name, 'rb') as model:
+            if not parser.parse(model.read()):
+                logging.error(f"Parsing {model_name} failed")
+                for error in range(parser.num_errors):
+                    logging.error(parser.get_error(error))
+                return
+
+        logging.info('Building the engine...')
+        builder.max_workspace_size = 1 << 30
+        config = builder.create_builder_config()
+        cache = config.create_timing_cache(b"")
+        config.set_timing_cache(cache, False)
+        optimization_profiles = [builder.create_optimization_profile()]
+        for profile in optimization_profiles:
+            profile.set_shape("input.1", *input_shapes)
+            config.add_optimization_profile(profile)
+
+        engine = builder.build_serialized_network(network, config)
+        del engine
+
+
+def example_run_engine(tea: EngineArchive, verbose=False):
+    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if verbose else trt.Logger()
+    plan = tea.readf("engine.trt")
+    with trt.Runtime(TRT_LOGGER) as runtime:
+        engine = runtime.deserialize_cuda_engine(plan)
+        with engine.create_execution_context() as context:
+            pass
+        del engine
+
+
+def main():
+    tea = EngineArchive(tea_name)
+    logging.info(f"Building an engine archive from {model_name}")
+    example_build_engine(tea)
+    example_run_engine(tea)
+    logging.info(f"Created an engine archive at {tea_name}")
+
+
+if __name__ == "__main__":
+    main()
+
diff --git a/tools/experimental/trt-engine-explorer/install.sh b/tools/experimental/trt-engine-explorer/install.sh
index 873bcb1b..75937400 100644
--- a/tools/experimental/trt-engine-explorer/install.sh
+++ b/tools/experimental/trt-engine-explorer/install.sh
@@ -1,6 +1,6 @@
 #!/bin/sh
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -20,13 +20,97 @@
 # TREx installation script.
 #
 # Usage:
-#   $ source install.sh
+#   $ source install.sh [--venv]
 #
 
 
-sudo apt install graphviz
-sudo apt install virtualenv
-python3 -m virtualenv env_trex
-source env_trex/bin/activate
-python3 -m pip install -e .
-jupyter nbextension enable widgetsnbextension --user --py
+usage() {
+    echo "Usage:"
+    echo "  $ source $BASH_SOURCE [--venv]"
+    OK=0
+}
+
+invalid_arg() {
+    echo "Error: $ARG1 is not a valid argument." >&2
+    usage
+}
+
+too_many_args() {
+    echo "Error: Too many arguments"
+    usage
+}
+
+install_venv() {
+    VENV="env_trex"
+    sudo apt install --yes virtualenv
+    python3 -m virtualenv $VENV
+    source ./$VENV/bin/activate
+}
+
+warn_if_trt_not_installed() {
+    # Check if the tensorrt Python package is installed
+    python3 -c "import tensorrt" &> /dev/null
+
+    status=$?
+    if [ $status -eq 1 ]; then
+        echo
+        echo "Warning: Python package tensorrt is not installed!" >&2
+        echo "Package tensorrt is required by some of the notebooks and scripts." >&2
+    fi
+}
+
+install_trex_core() { python3 -m pip install -e .; }
+
+install_trex_full() { python3 -m pip install -e .[notebook]; }
+
+install() {
+    sudo apt install --yes graphviz
+    INSTALL_TYPE=$1
+    if [ $INSTALL_TYPE = "full" ]; then
+        install_trex_full
+    else
+        install_trex_core
+    fi
+    warn_if_trt_not_installed
+}
+
+parse_args() {
+    if [ $NARGS -gt 1 ]; then
+        too_many_args
+        return
+    fi
+    case $ARG1 in
+    ("-h" | "--h" | "-help" | "--help")
+        usage
+        ;;
+    ("--venv")
+        install_venv
+        ;;
+    ("-c" | "--core")
+        INSTALLATION_TYPE="core"
+        ;;
+    ("-f" | "--full")
+        INSTALLATION_TYPE="full"
+        ;;
+    ("")
+        ;;
+    (*)
+        invalid_arg
+        ;;
+    esac
+}
+
+
+OK=1
+NARGS=$#
+ARG1=$1
+INSTALLATION_TYPE="full"
+parse_args
+
+if [ $OK -eq 1 ]; then
+    install $INSTALLATION_TYPE
+fi
+
+if [ $OK -eq 0 ]; then
+    echo "Installation aborted."
+fi
diff --git a/tools/experimental/trt-engine-explorer/notebooks/api-examples.ipynb b/tools/experimental/trt-engine-explorer/notebooks/api-examples.ipynb
index cc1fa9af..090402a7 100644
--- a/tools/experimental/trt-engine-explorer/notebooks/api-examples.ipynb
+++ b/tools/experimental/trt-engine-explorer/notebooks/api-examples.ipynb
@@ -18,6 +18,7 @@
    "outputs": [],
    "source": [
     "import trex\n",
+    "from trex.notebook import *\n",
     "\n",
     "engine_name = \"../tests/inputs/mobilenet.qat.onnx.engine\"\n",
     "plan = trex.EnginePlan(f\"{engine_name}.graph.json\", f\"{engine_name}.profile.json\", f\"{engine_name}.metadata.json\")"
@@ -161,7 +162,7 @@
    "outputs": [],
    "source": [
     "# Group by type, and perform a sum reduction on the latency\n",
-    "plan.df.groupby([\"type\"]).sum()[[\"latency.avg_time\", \"latency.pct_time\"]]"
+    "plan.df.groupby([\"type\"])[[\"latency.avg_time\", \"latency.pct_time\"]].sum()"
    ]
   },
   {
@@ -186,23 +187,6 @@
     "trex.group_count(plan.df, \"type\")"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dfd18a27",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "convs_1x1 = clean_convs[clean_convs[\"attr.kernel\"] == (1,1)]\n",
-    "convs_3x3 = clean_convs[clean_convs[\"attr.kernel\"] == (3,3)]\n",
-    "\n",
-    "# Group by convolution kernel shape, and count\n",
-    "print(trex.group_count(clean_convs, \"attr.kernel\"))\n",
-    "\n",
-    "# Display the dataframe of all convolutions with kernel shape = (1, 1)\n",
-    "trex.display_df(convs_1x1)"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -212,21 +196,13 @@
    "source": [
     "# Display the dataframe of all INT8 convolutions with \n",
     "convs_1x1_fp32 = convs_1x1[convs_1x1[\"Outputs\"].str.startswith(\"FP32\")]\n",
-    "trex.display_df(convs_1x1_fp32)"
+    "display_df(convs_1x1_fp32)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6669a905",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -240,7 +216,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.9"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
diff --git a/tools/experimental/trt-engine-explorer/notebooks/compare_engines.ipynb b/tools/experimental/trt-engine-explorer/notebooks/compare_engines.ipynb
index 9a2e2ad0..72638118 100644
--- a/tools/experimental/trt-engine-explorer/notebooks/compare_engines.ipynb
+++ b/tools/experimental/trt-engine-explorer/notebooks/compare_engines.ipynb
@@ -54,23 +54,52 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
     "import IPython\n",
     "from ipywidgets import widgets\n",
     "from trex import *\n",
+    "from trex.notebook import *\n",
+    "from trex.report_card import *\n",
+    "from trex.compare_engines import *\n",
     "\n",
     "# Configure a wider output (for the wide graphs)\n",
-    "set_wide_display()\n",
-    "\n",
+    "set_wide_display()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "extensions": {
+     "jupyter_dashboards": {
+      "version": 1,
+      "views": {
+       "grid_default": {},
+       "report_default": {
+        "hidden": true
+       }
+      }
+     }
+    }
+   },
+   "outputs": [],
+   "source": [
     "engine_name_1 = \"../tests/inputs/mobilenet.qat.onnx.engine\"\n",
     "engine_name_2 = \"../tests/inputs/mobilenet_v2_residuals.qat.onnx.engine\"\n",
     "\n",
-    "plan1 = EnginePlan(f'{engine_name_1}.graph.json', f'{engine_name_1}.profile.json', f\"{engine_name_1}.profile.metadata.json\")\n",
-    "plan2 = EnginePlan(f'{engine_name_2}.graph.json', f'{engine_name_2}.profile.json', f\"{engine_name_2}.profile.metadata.json\")\n",
+    "def extract_engine_name(engine_path):\n",
+    "    from pathlib import Path \n",
+    "    return Path(engine_path).name\n",
+    "\n",
+    "def make_plan(engine_path, engine_name=None):\n",
+    "    plan = EnginePlan(f'{engine_path}.graph.json', f'{engine_path}.profile.json', f\"{engine_path}.profile.metadata.json\", name=engine_name)\n",
+    "    return plan\n",
+    "    \n",
+    "plan1 = make_plan(engine_name_1, \"mobilenet.qat\")\n",
+    "plan2 = make_plan(engine_name_2, \"mobilenet_v2_residuals.qat\")\n",
     "plans = (plan1, plan2)"
    ]
   },
@@ -101,26 +130,22 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "compare_engines_summaries_tbl(plans, orientation='vertical')"
+    "compare_engines_overview(plans)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "compare_engines_overview(plans)"
+    "compare_engines_summaries_tbl(plans, orientation='vertical')"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "compare_engines_layer_latencies(\n",
@@ -134,9 +159,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "compare_engines_layer_details(plans[0], plans[1])"
@@ -145,62 +168,52 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "report_card_perf_overview_widget(plan1);\n",
-    "print(plan1.name)"
+    "print(plan1.name)\n",
+    "report_card_perf_overview_widget(plan1)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "report_card_perf_overview_widget(plan2);\n",
-    "print(plan2.name)"
+    "print(plan2.name)\n",
+    "report_card_perf_overview_widget(plan2)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "report_card_table_view(plan1);\n",
-    "print(plan1.name)"
+    "print(plan1.name)\n",
+    "report_card_table_view(plan1)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "report_card_table_view(plan2);\n",
-    "print(plan2.name)"
+    "print(plan2.name)\n",
+    "report_card_table_view(plan2)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "for plan in plans:\n",
-    "    graph = to_dot(plan, layer_type_formatter, display_regions=True, expand_layer_details=True)\n",
-    "    render_dot(graph, plan.name, 'svg')"
+    "    graph = graphing.to_dot(plan, graphing.layer_type_formatter, display_regions=True, expand_layer_details=True)\n",
+    "    graphing.render_dot(graph, plan.name, 'svg')"
    ]
   },
   {
@@ -232,7 +245,7 @@
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   },
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -246,7 +259,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.0"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
diff --git a/tools/experimental/trt-engine-explorer/notebooks/engine_report_card.ipynb b/tools/experimental/trt-engine-explorer/notebooks/engine_report_card.ipynb
index 7badef32..24b5571e 100644
--- a/tools/experimental/trt-engine-explorer/notebooks/engine_report_card.ipynb
+++ b/tools/experimental/trt-engine-explorer/notebooks/engine_report_card.ipynb
@@ -22,7 +22,8 @@
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {
     "extensions": {
      "jupyter_dashboards": {
@@ -36,13 +37,22 @@
      }
     }
    },
+   "outputs": [],
    "source": [
-    "## Load JSON Files"
+    "import IPython\n",
+    "from ipywidgets import widgets\n",
+    "from trex import *\n",
+    "from trex.notebook import *\n",
+    "from trex.report_card import *\n",
+    "\n",
+    "# Choose an engine file to load.\n",
+    "engine_name = \"../tests/inputs/mobilenet.qat.onnx.engine\"\n",
+    "engine_name = \"../tests/inputs/mobilenet_v2_residuals.qat.onnx.engine\"\n",
+    "set_wide_display()"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {
     "extensions": {
      "jupyter_dashboards": {
@@ -56,17 +66,8 @@
      }
     }
    },
-   "outputs": [],
    "source": [
-    "import IPython\n",
-    "from ipywidgets import widgets\n",
-    "from trex import *\n",
-    "\n",
-    "# Choose an engine file to load.\n",
-    "engine_name = \"../tests/inputs/mobilenet.qat.onnx.engine\"\n",
-    "engine_name = \"../tests/inputs/mobilenet_v2_residuals.qat.onnx.engine\"\n",
-    "\n",
-    "set_wide_display()"
+    "## Load JSON Files"
    ]
   },
   {
@@ -83,8 +84,7 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
@@ -142,8 +142,7 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
@@ -175,8 +174,7 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
@@ -236,7 +234,7 @@
    },
    "outputs": [],
    "source": [
-    "report_card_perf_overview_widget(plan);"
+    "report_card_perf_overview_widget(plan)"
    ]
   },
   {
@@ -276,7 +274,7 @@
    },
    "outputs": [],
    "source": [
-    "report_card_memory_footprint_widget(plan);"
+    "report_card_memory_footprint_widget(plan)"
    ]
   },
   {
@@ -312,8 +310,7 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
@@ -342,13 +339,12 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
     "latency_vs_prec_per_conv = partial(\n",
-    "    plotly_bar2,\n",
+    "    plotting.plotly_bar2,\n",
     "    convs,\n",
     "    values_col='latency.pct_time',\n",
     "    names_col='Name',\n",
@@ -400,7 +396,7 @@
    },
    "outputs": [],
    "source": [
-    "report_card_gemm_MNK(plan);"
+    "report_card_gemm_MNK(plan)"
    ]
   },
   {
@@ -409,7 +405,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "report_card_gemm_MNK_scatter(plan);"
+    "report_card_gemm_MNK_scatter(plan)"
    ]
   },
   {
@@ -430,7 +426,7 @@
    },
    "outputs": [],
    "source": [
-    "report_card_efficiency_vs_latency_3d(plan);"
+    "report_card_efficiency_vs_latency_3d(plan)"
    ]
   },
   {
@@ -447,12 +443,11 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
-    "report_card_perf_scatter(plan);"
+    "report_card_perf_scatter(plan)"
    ]
   },
   {
@@ -497,8 +492,7 @@
        }
       }
      }
-    },
-    "scrolled": false
+    }
    },
    "outputs": [],
    "source": [
@@ -530,6 +524,11 @@
     "* REFORMAT: type or layout conversion.\n",
     "* SLICE: slice layer output conversion.\n",
     "* CONCAT: concat layer input conversion.\n",
+    "* IDENTITY: identity layer conversion.\n",
+    "* NOOP: no-op layer conversion (e.g no-op shuffle or scale).\n",
+    "* NVMOPT: injected by the NVM region optimizer.\n",
+    "* QDQ: o << \"QDQ\"; break;\n",
+    "\n",
     "\n",
     "Reformat layers that perform data-type conversion from float32/float16 to INT8, or vice-versa, may indicate poorly placed Q/DQ layers in a QAT network. <br>\n",
     "These are Q/DQ layers which could not be fused with another layer in the engine graph and may be quite costly in latency."
@@ -541,7 +540,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "report_card_reformat_overview(plan)"
+    "report_card_reformat_overview(plan);"
    ]
   },
   {
@@ -725,6 +724,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "from trex.excel_summary import *\n",
+    "\n",
     "summary = ExcelSummary(plan, path=\"default_summary.xlsx\")\n",
     "summary.generate_default_summary() # 'generate_default_summary' automatically saves the file"
    ]
@@ -761,6 +762,13 @@
     "    summary.add_images({\"trex_logo\": \"../images/trex_logo.png\"})\n",
     "    summary.add_dataframes({\"clean_df\": clean_for_display(plan.df)})\n"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
@@ -787,7 +795,7 @@
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   },
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -801,7 +809,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.0"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
diff --git a/tools/experimental/trt-engine-explorer/notebooks/tutorial.ipynb b/tools/experimental/trt-engine-explorer/notebooks/tutorial.ipynb
index bd10c1b8..fa090ade 100644
--- a/tools/experimental/trt-engine-explorer/notebooks/tutorial.ipynb
+++ b/tools/experimental/trt-engine-explorer/notebooks/tutorial.ipynb
@@ -29,9 +29,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "# !python3 ../utils/process_engine.py ../tests/inputs/mobilenet.qat.onnx ../tests/inputs best"
@@ -47,20 +45,21 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%matplotlib inline\n",
     "import matplotlib.pyplot as plt\n",
     "import os\n",
     "import pandas as pd\n",
-    "module_path = os.path.abspath(os.path.join('.'))\n",
-    "from trex import *\n",
+    "import trex\n",
+    "import trex.notebook\n",
+    "import trex.plotting\n",
+    "import trex.graphing\n",
+    "import trex.df_preprocessing\n",
     "\n",
     "# Configure a wider output (for the wide graphs)\n",
-    "set_wide_display()\n",
+    "trex.notebook.set_wide_display()\n",
     "\n",
     "# Choose an engine file to load.  This notebook assumes that you've saved the engine to the following paths.\n",
     "engine_name = \"../tests/inputs/mobilenet.qat.onnx.engine\"\n",
@@ -77,13 +76,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "assert engine_name is not None\n",
-    "plan = EnginePlan(f'{engine_name}.graph.json', f'{engine_name}.profile.json', f'{engine_name}.profile.metadata.json')"
+    "plan = trex.EnginePlan(f'{engine_name}.graph.json', f'{engine_name}.profile.json', f'{engine_name}.profile.metadata.json')"
    ]
   },
   {
@@ -171,7 +168,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "display_df(plan.df)"
+    "trex.notebook.display_df(plan.df)"
    ]
   },
   {
@@ -192,9 +189,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df = clean_for_display(plan.df)\n",
+    "df = trex.df_preprocessing.clean_for_display(plan.df)\n",
     "print(f\"These are the column names in the plan\\n: {df.columns}\")\n",
-    "display_df(df)"
+    "trex.notebook.display_df(df)"
    ]
   },
   {
@@ -214,13 +211,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "layer_types = group_count(plan.df, 'type')\n",
+    "layer_types = trex.group_count(plan.df, 'type')\n",
     "\n",
     "# Simple DF print\n",
     "print(layer_types)\n",
     "\n",
     "# dtale DF display\n",
-    "display_df(layer_types)"
+    "trex.notebook.display_df(layer_types)"
    ]
   },
   {
@@ -236,15 +233,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    df=layer_types, \n",
     "    title='Layer Count By Type', \n",
     "    values_col='count', \n",
     "    names_col='type',\n",
     "    orientation='v',\n",
     "    color='type',\n",
-    "    colormap=layer_colormap,\n",
-    "    show_axis_ticks=(True, True))"
+    "    colormap=trex.colors.layer_colormap,\n",
+    "    show_axis_ticks=(True, True));"
    ]
   },
   {
@@ -265,7 +262,7 @@
    "outputs": [],
    "source": [
     "top3 = plan.df.nlargest(3, 'latency.pct_time')\n",
-    "display_df(top3)"
+    "trex.notebook.display_df(top3)"
    ]
   },
   {
@@ -279,19 +276,17 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    df=plan.df, \n",
     "    title=\"% Latency Budget Per Layer\",\n",
     "    values_col=\"latency.pct_time\",\n",
     "    names_col=\"Name\",\n",
     "    color='type',\n",
     "    use_slider=False,\n",
-    "    colormap=layer_colormap)"
+    "    colormap=trex.colors.layer_colormap);"
    ]
   },
   {
@@ -309,13 +304,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "plotly_hist(\n",
+    "trex.plotting.plotly_hist(\n",
     "    df=plan.df, \n",
     "    title=\"Layer Latency Distribution\", \n",
     "    values_col=\"latency.pct_time\",\n",
     "    xaxis_title=\"Latency (ms)\",\n",
     "    color='type',\n",
-    "    colormap=layer_colormap)"
+    "    colormap=trex.colors.layer_colormap);"
    ]
   },
   {
@@ -330,21 +325,19 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "time_pct_by_type = plan.df.groupby([\"type\"]).sum()[[\"latency.pct_time\", \"latency.avg_time\"]].reset_index()\n",
-    "display_df(time_pct_by_type)\n",
-    "plotly_bar2(\n",
+    "time_pct_by_type = plan.df.groupby([\"type\"])[[\"latency.pct_time\", \"latency.avg_time\"]].sum().reset_index()\n",
+    "trex.notebook.display_df(time_pct_by_type)\n",
+    "trex.plotting.plotly_bar2(\n",
     "    df=time_pct_by_type,\n",
     "    title=\"% Latency Budget Per Layer Type\",\n",
     "    values_col=\"latency.pct_time\",\n",
     "    names_col=\"type\",\n",
     "    orientation='h',\n",
     "    color='type',\n",
-    "    colormap=layer_colormap)"
+    "    colormap=trex.colors.layer_colormap);"
    ]
   },
   {
@@ -362,6 +355,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import plotly.express as px\n",
+    "\n",
     "fig = px.treemap(\n",
     "    plan.df,\n",
     "    path=['type', 'Name'],\n",
@@ -409,28 +404,28 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    plan.df, \n",
     "    \"Weights Sizes Per Layer\", \n",
     "    \"weights_size\", \"Name\", \n",
     "    color='type', \n",
-    "    colormap=layer_colormap)\n",
+    "    colormap=trex.colors.layer_colormap)\n",
     "\n",
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    plan.df, \n",
     "    \"Activations Sizes Per Layer\", \n",
     "    \"total_io_size_bytes\", \n",
     "    \"Name\", \n",
     "    color='type', \n",
-    "    colormap=layer_colormap)\n",
+    "    colormap=trex.colors.layer_colormap)\n",
     "\n",
-    "plotly_hist(\n",
+    "trex.plotting.plotly_hist(\n",
     "    plan.df, \n",
     "    \"Layer Activations Sizes Distribution\", \n",
     "    \"total_io_size_bytes\", \n",
     "    \"Size (bytes)\", \n",
     "    color='type', \n",
-    "    colormap=layer_colormap)\n",
+    "    colormap=trex.colors.layer_colormap)\n",
     "\n",
     "plan.df[\"total_io_size_bytes\"].describe()"
    ]
@@ -449,37 +444,33 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "charts = []\n",
-    "layer_precisions = group_count(plan.df, 'precision')\n",
+    "layer_precisions = trex.group_count(plan.df, 'precision')\n",
     "charts.append((layer_precisions, 'Layer Count By Precision', 'count', 'precision'))\n",
     "\n",
-    "layers_time_pct_by_precision = group_sum_attr(plan.df, grouping_attr='precision', reduced_attr='latency.pct_time')\n",
+    "layers_time_pct_by_precision = trex.group_sum_attr(plan.df, grouping_attr='precision', reduced_attr='latency.pct_time')\n",
     "display(layers_time_pct_by_precision)\n",
     "\n",
     "charts.append((layers_time_pct_by_precision, '% Latency Budget By Precision', 'latency.pct_time', 'precision'))\n",
-    "plotly_pie2(\"Precision Statistics\", charts, colormap=precision_colormap)"
+    "trex.plotting.plotly_pie2(\"Precision Statistics\", charts, colormap=trex.colors.precision_colormap);"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    plan.df, \n",
     "    \"% Latency Budget Per Layer<BR>(bar color indicates precision)\", \n",
     "    \"latency.pct_time\", \n",
     "    \"Name\",\n",
     "    color='precision',\n",
-    "    colormap=precision_colormap)"
+    "    colormap=trex.colors.precision_colormap);"
    ]
   },
   {
@@ -500,14 +491,12 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "formatter = layer_type_formatter if True else precision_formatter\n",
-    "graph = to_dot(plan, formatter)\n",
-    "svg_name = render_dot(graph, engine_name, 'svg')"
+    "formatter = trex.graphing.layer_type_formatter if True else trex.graphing.precision_formatter\n",
+    "graph = trex.graphing.to_dot(plan, formatter)\n",
+    "svg_name = trex.graphing.render_dot(graph, engine_name, 'svg')"
    ]
   },
   {
@@ -525,7 +514,7 @@
    },
    "outputs": [],
    "source": [
-    "png_name = render_dot(graph, engine_name, 'png')\n",
+    "png_name = trex.graphing.render_dot(graph, engine_name, 'png')\n",
     "from IPython.display import Image\n",
     "display(Image(filename=png_name))"
    ]
@@ -559,27 +548,25 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "convs = plan.get_layers_by_type('Convolution')\n",
-    "display_df(convs)\n",
+    "trex.notebook.display_df(convs)\n",
     "\n",
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    convs, \n",
     "    \"Latency Per Layer (%)<BR>(bar color indicates precision)\",\n",
     "    \"attr.arithmetic_intensity\", \"Name\",\n",
     "    color='precision', \n",
-    "    colormap=precision_colormap)\n",
+    "    colormap=trex.colors.precision_colormap)\n",
     "\n",
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    convs,\n",
     "    \"Convolution Data Sizes<BR>(bar color indicates latency)\",\n",
     "    \"total_io_size_bytes\", \n",
     "    \"Name\", \n",
-    "    color='latency.pct_time')"
+    "    color='latency.pct_time');"
    ]
   },
   {
@@ -598,19 +585,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    convs, \n",
     "    \"Convolution Arithmetic Intensity<BR>(bar color indicates activations size)\",\n",
     "    \"attr.arithmetic_intensity\", \n",
     "    \"Name\",\n",
     "    color='total_io_size_bytes')\n",
     "\n",
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    convs, \n",
     "    \"Convolution Arithmetic Intensity<BR>(bar color indicates latency)\", \n",
     "    \"attr.arithmetic_intensity\", \n",
     "    \"Name\",\n",
-    "    color='latency.pct_time')"
+    "    color='latency.pct_time');"
    ]
   },
   {
@@ -629,7 +616,7 @@
    "outputs": [],
    "source": [
     "# Memory accesses per ms (assuming one time read/write penalty)\n",
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    convs, \n",
     "    \"Convolution Memory Efficiency<BR>(bar color indicates latency)\", \n",
     "    \"attr.memory_efficiency\", \n",
@@ -637,12 +624,12 @@
     "    color='latency.pct_time')\n",
     "\n",
     "# Compute operations per ms (assuming one time read/write penalty)\n",
-    "plotly_bar2(\n",
+    "trex.plotting.plotly_bar2(\n",
     "    convs, \n",
     "    \"Convolution Compute Efficiency<BR>(bar color indicates latency)\",\n",
     "    \"attr.compute_efficiency\",\n",
     "    \"Name\",\n",
-    "    color='latency.pct_time')"
+    "    color='latency.pct_time');"
    ]
   },
   {
@@ -655,49 +642,35 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "convs = plan.get_layers_by_type('Convolution')\n",
     "\n",
     "charts = []\n",
-    "convs_count_by_type = group_count(convs, 'subtype')\n",
+    "convs_count_by_type = trex.group_count(convs, 'subtype')\n",
     "charts.append((convs_count_by_type, 'Count', 'count', 'subtype'))\n",
     "\n",
-    "convs_time_pct_by_type = group_sum_attr(convs, grouping_attr='subtype', reduced_attr='latency.pct_time')\n",
+    "convs_time_pct_by_type = trex.group_sum_attr(convs, grouping_attr='subtype', reduced_attr='latency.pct_time')\n",
     "charts.append((convs_time_pct_by_type, '% Latency Budget', 'latency.pct_time', 'subtype'))\n",
-    "plotly_pie2(\"Convolutions Statistics (Subtype)\", charts)\n",
-    "\n",
+    "trex.plotting.plotly_pie2(\"Convolutions Statistics (Subtype)\", charts)\n",
     "\n",
     "charts = []\n",
-    "convs_count_by_group_size = group_count(convs, 'attr.groups')\n",
+    "convs_count_by_group_size = trex.group_count(convs, 'attr.groups')\n",
     "charts.append((convs_count_by_group_size, 'Count', 'count', 'attr.groups'))\n",
     "\n",
-    "convs_time_pct_by_grp_size = group_sum_attr(convs, grouping_attr='attr.groups', reduced_attr='latency.pct_time')\n",
+    "convs_time_pct_by_grp_size = trex.group_sum_attr(convs, grouping_attr='attr.groups', reduced_attr='latency.pct_time')\n",
     "charts.append((convs_time_pct_by_grp_size, '% Latency Budget', 'latency.pct_time', 'attr.groups'))\n",
-    "plotly_pie2(\"Convolutions Statistics (Number of Groups)\", charts)\n",
-    "\n",
-    "\n",
-    "\n",
-    "charts = []\n",
-    "convs_count_by_kernel_shape = group_count(convs, 'attr.kernel')\n",
-    "charts.append((convs_count_by_kernel_shape, 'Count', 'count', 'attr.kernel'))\n",
-    "\n",
-    "convs_time_pct_by_kernel_shape = group_sum_attr(convs, grouping_attr='attr.kernel', reduced_attr='latency.pct_time')\n",
-    "charts.append((convs_time_pct_by_kernel_shape, '% Latency Budget', 'latency.pct_time', 'attr.kernel'))\n",
-    "plotly_pie2(\"Convolutions Statistics (Kernel Size)\", charts)\n",
-    "\n",
+    "trex.plotting.plotly_pie2(\"Convolutions Statistics (Number of Groups)\", charts)\n",
     "\n",
     "charts = []\n",
-    "convs_count_by_precision = group_count(convs, 'precision')\n",
+    "convs_count_by_precision = trex.group_count(convs, 'precision')\n",
     "charts.append((convs_count_by_precision, 'Count', 'count', 'precision'))\n",
     "\n",
-    "convs_time_pct_by_precision = group_sum_attr(convs, grouping_attr='precision', reduced_attr='latency.pct_time')\n",
+    "convs_time_pct_by_precision = trex.group_sum_attr(convs, grouping_attr='precision', reduced_attr='latency.pct_time')\n",
     "charts.append((convs_time_pct_by_precision, '% Latency Budget', 'latency.pct_time', 'precision'))\n",
     "\n",
-    "plotly_pie2(\"Convolutions Statistics (Precision)\", charts, colormap=precision_colormap)"
+    "trex.plotting.plotly_pie2(\"Convolutions Statistics (Precision)\", charts, colormap=trex.colors.precision_colormap);"
    ]
   }
  ],
@@ -706,7 +679,7 @@
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   },
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -720,7 +693,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.0"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
diff --git a/tools/experimental/trt-engine-explorer/requirements-notebook.txt b/tools/experimental/trt-engine-explorer/requirements-notebook.txt
new file mode 100644
index 00000000..c3037481
--- /dev/null
+++ b/tools/experimental/trt-engine-explorer/requirements-notebook.txt
@@ -0,0 +1,9 @@
+# Packages required for using the notebooks.
+ipywidgets==8.1.1
+dtale==3.8.1
+Werkzeug==2.3.7
+Flask==2.2.2
+plotly
+jupyter
+jupyterlab
+netron
\ No newline at end of file
diff --git a/tools/experimental/trt-engine-explorer/requirements.txt b/tools/experimental/trt-engine-explorer/requirements.txt
index 2b34ebe1..247e1cf6 100644
--- a/tools/experimental/trt-engine-explorer/requirements.txt
+++ b/tools/experimental/trt-engine-explorer/requirements.txt
@@ -1,20 +1,15 @@
-setuptools # for qgrid
-wheel # for qgrid
-protobuf==3.16.0
-onnx==1.12.0
-numpy
-pandas==1.1.5
-plotly
-qgrid
-graphviz
-jupyter
-netron
-openpyxl # for excel reporting
-ipywidgets==7.7.2
-ipyfilechooser
-jupyterlab
-jupyter-dash
+# Packages required for core functionality.
+protobuf
+onnx==1.15.0
+pandas==2.2.1
 pytest
-dtale==2.2.0
+graphviz
+numpy
+
+# Packages required for utilities
+pynvx
+pynvml
+
+# Packages required for excel reporting.
 xlsxwriter
-Flask==2.0.0
+openpyxl
diff --git a/tools/experimental/trt-engine-explorer/setup.py b/tools/experimental/trt-engine-explorer/setup.py
index 936d82a1..bb085bca 100644
--- a/tools/experimental/trt-engine-explorer/setup.py
+++ b/tools/experimental/trt-engine-explorer/setup.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,11 +14,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+
 import os
 import sys
-
 from setuptools import find_packages, setup
 
+
 def no_publish():
     blacklist = ["register"]
     for cmd in blacklist:
@@ -31,9 +32,12 @@ def main():
     with open('requirements.txt','r') as req_file:
         required_pckgs = [line.strip() for line in req_file.readlines()]
 
+    with open('requirements-notebook.txt','r') as notebook_req_file:
+        extras_require_notebook = [line.strip() for line in notebook_req_file.readlines()]
+
     setup(
         name="trex",
-        version="0.1.6",
+        version="0.1.8",
         description="TREX: TensorRT Engine Exploration Toolkit",
         long_description=open("README.md", "r", encoding="utf-8").read(),
         author="NVIDIA",
@@ -44,7 +48,9 @@ def main():
         ],
         license="Apache 2.0",
         install_requires=required_pckgs,
+        extras_require={"notebook": extras_require_notebook},
         packages=find_packages(exclude=("tests", "tests.*")),
+        scripts=[os.path.join("bin", "trex")],
         zip_safe=True,
     )
 
diff --git a/tools/experimental/trt-engine-explorer/tests/test_activations.py b/tools/experimental/trt-engine-explorer/tests/test_activations.py
index ef6989ea..d3042191 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_activations.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_activations.py
@@ -1,7 +1,23 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import plan
 from trex import create_activations, Activation
-import pytest
-import pandas as pd
 
 def test_create_activations(plan):
     pd_series = plan.get_layers_by_type('Convolution').iloc[0]
diff --git a/tools/experimental/trt-engine-explorer/tests/test_clocks.py b/tools/experimental/trt-engine-explorer/tests/test_clocks.py
new file mode 100644
index 00000000..c336fcf8
--- /dev/null
+++ b/tools/experimental/trt-engine-explorer/tests/test_clocks.py
@@ -0,0 +1,28 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+import utils.config_gpu
+
+class TestClocksSampling:
+    def test_sampling(self):
+        max_mem_clk, max_gr_clk = utils.config_gpu.get_max_clocks(dev=0)
+
+
+if __name__ == "__main__":
+    test = TestClocksSampling()
+    test.test_sampling()
\ No newline at end of file
diff --git a/tools/experimental/trt-engine-explorer/tests/test_compare_engines.py b/tools/experimental/trt-engine-explorer/tests/test_compare_engines.py
index 11f0d21e..ed7c85f9 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_compare_engines.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_compare_engines.py
@@ -1,11 +1,28 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import plan, plan2
-from trex import get_plans_names, match_layers
-import pytest
-import pandas as pd
+from trex.compare_engines import get_plans_names, match_layers
+
 
 def test_get_plans_names(plan, plan2):
     assert get_plans_names([plan, plan2]) == [
-        'mobilenet.qat.onnx.engine.graph.json', 
+        'mobilenet.qat.onnx.engine.graph.json',
         'mobilenet_v2_residuals.qat.onnx.engine.graph.json'
         ]
 
diff --git a/tools/experimental/trt-engine-explorer/tests/test_df_preprocessing.py b/tools/experimental/trt-engine-explorer/tests/test_df_preprocessing.py
index a1fd2ee9..37c1ce55 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_df_preprocessing.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_df_preprocessing.py
@@ -1,12 +1,29 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import plan
 from trex import change_col_order, drop_columns
-import pytest
 
 def test_change_col_order(plan):
     new_col_order = change_col_order(plan.df).columns
 
     def check_col_order(new_col_order):
-        common_cols = list(('Name', 'type', 'Inputs', 'Outputs', 
+        common_cols = list(('Name', 'type', 'Inputs', 'Outputs',
         'latency.avg_time', 'latency.pct_time', 'total_footprint_bytes',
         'tactic'))
 
diff --git a/tools/experimental/trt-engine-explorer/tests/test_engine_plan.py b/tools/experimental/trt-engine-explorer/tests/test_engine_plan.py
index 7d59b1e7..4863fad7 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_engine_plan.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_engine_plan.py
@@ -1,6 +1,23 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import plan
-from trex import summary_dict, EnginePlan
-import pytest
+from trex import summary_dict
 import pandas as pd
 
 def test_summary_dict(plan):
diff --git a/tools/experimental/trt-engine-explorer/tests/test_layer.py b/tools/experimental/trt-engine-explorer/tests/test_layer.py
index 3c4e82d4..3d2a0c51 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_layer.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_layer.py
@@ -1,6 +1,24 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import test_engine1_prefix_path
 from trex import import_graph_file, Layer, Activation
-import pytest
+
 
 class TestLayer:
     def test_initialization(self):
diff --git a/tools/experimental/trt-engine-explorer/tests/test_lint.py b/tools/experimental/trt-engine-explorer/tests/test_lint.py
index 6a906f4c..fd2826f8 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_lint.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_lint.py
@@ -1,7 +1,23 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import plan
 from trex import ConvLinter, ReformatLinter
-import pytest
-import pandas as pd
 
 class TestConvLinter:
     def test_tc_lint(self, plan):
@@ -23,7 +39,7 @@ def test_alignment_lint(self, plan):
         assert alignment_lint_row['name'] == 'features.0.0.weight + QuantizeLinear_8 + Conv_12 + PWN(Clip_16)'
         assert alignment_lint_row['hazard'] == 'Convolution channels are not optimally aligned.'
         assert alignment_lint_row['mitigation'] == "Consider changing the alignment of the convolution's channels."
-        
+
 class TestReformatLinter:
     def test_lint(self, plan):
         df = ReformatLinter(plan).lint()
diff --git a/tools/experimental/trt-engine-explorer/tests/test_misc.py b/tools/experimental/trt-engine-explorer/tests/test_misc.py
index c263f93f..ecf79309 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_misc.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_misc.py
@@ -1,18 +1,35 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import plan
-from trex import group_count, group_sum_attr
-import pytest
-import pandas as pd
+import trex
+
 
 def test_group_count(plan):
-    gc = group_count(plan.df, "type").set_index("type")
+    gc = trex.misc.group_count(plan.df, "type").set_index("type")
     gc_exp = plan.df.groupby(["type"]).size()
     assert (gc.loc['Convolution'] == gc_exp.loc['Convolution']).all()
     assert (gc.loc['Pooling'] == gc_exp.loc['Pooling']).all()
     assert (gc.loc['Reformat'] == gc_exp.loc['Reformat']).all()
 
 def test_group_sum_attr(plan):
-    gsa = group_sum_attr(plan.df,"type", "latency.avg_time").set_index("type")
-    gsa_exp = plan.df.groupby(["type"]).sum()[["latency.avg_time"]]
+    gsa = trex.misc.group_sum_attr(plan.df,"type", "latency.avg_time").set_index("type")
+    gsa_exp = plan.df.groupby(["type"])[["latency.avg_time"]].sum()
     assert (gsa.loc['Convolution'] == gsa_exp.loc['Convolution']).all()
     assert (gsa.loc['Pooling'] == gsa_exp.loc['Pooling']).all()
     assert (gsa.loc['Reformat'] == gsa_exp.loc['Reformat']).all()
diff --git a/tools/experimental/trt-engine-explorer/tests/test_raw_preprocessing.py b/tools/experimental/trt-engine-explorer/tests/test_raw_preprocessing.py
index 446c18c1..db97b25e 100644
--- a/tools/experimental/trt-engine-explorer/tests/test_raw_preprocessing.py
+++ b/tools/experimental/trt-engine-explorer/tests/test_raw_preprocessing.py
@@ -1,3 +1,21 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from .util import test_engine1_prefix_path
 from trex import import_graph_file
 import pytest
diff --git a/tools/experimental/trt-engine-explorer/tests/util.py b/tools/experimental/trt-engine-explorer/tests/util.py
index bc9c0713..ca4184eb 100644
--- a/tools/experimental/trt-engine-explorer/tests/util.py
+++ b/tools/experimental/trt-engine-explorer/tests/util.py
@@ -1,3 +1,21 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
 from trex import EnginePlan
 import pytest
 import os
diff --git a/tools/experimental/trt-engine-explorer/trex/__init__.py b/tools/experimental/trt-engine-explorer/trex/__init__.py
index 33bf216b..fcb54ecc 100644
--- a/tools/experimental/trt-engine-explorer/trex/__init__.py
+++ b/tools/experimental/trt-engine-explorer/trex/__init__.py
@@ -1,5 +1,5 @@
 #
-# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -15,19 +15,12 @@
 #
 
 
+# Include only core modules.
 from trex.df_preprocessing import *
 from trex.misc import *
 from trex.lint import *
 from trex.activations import *
 from trex.engine_plan import *
-# The Jupyter notebook graphing and plotting are
-# not required in a terminal environment.
-from trex.plotting import *
-from trex.notebook import *
-from trex.graphing import *
-from trex.interactive import *
-from trex.report_card import *
-from trex.compare_engines import *
-from trex.excel_summary import *
+from trex.colors import *
 
-__version__ = "0.1.6"
+__version__ = "0.1.8"
diff --git a/tools/experimental/trt-engine-explorer/trex/activations.py b/tools/experimental/trt-engine-explorer/trex/activations.py
index 349a697d..378ffcd9 100644
--- a/tools/experimental/trt-engine-explorer/trex/activations.py
+++ b/tools/experimental/trt-engine-explorer/trex/activations.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -42,7 +42,9 @@
     "Channel major FP16 format where channel % 16 == 0": "FP16 NHWC16",
     "Channel major FP16 format where channel == 4 and column stride % 32 == 0": "FP16 NHWC4",
     "Channel major INT8 format where channel == 4 and column stride % 32 == 0": "Int8 NHWC4",
+    "Channel major FP16 format where channel % 2 == 0": "FP16 NHWC2",
     "Channel major INT8 format where column stride % 32 == 0": "Int8 NHWC1",
+    "Channel major INT8 format where channel % 16 == 0": "Int8 NHWC16",
     "Row major INT8 format where column stride % 64 == 0": "Int8 NCHW",
     "Channel major FP16 format where channel % 8 == 0 with 3 spatial dimensions": "FP16 NDHWC8",
     "Channel major FP16 format where channel == 1 and column stride % 32 == 0": "FP16 NHWC1",
@@ -56,33 +58,70 @@
     "Channel major FP16 format": "FP16 NHWC",
     "Channel major Int8 format": "Int8 NHWC",
     "Row major linear BOOL": "Bool",
-    "Unknown format": "Unknown format"
+    "Channel major FP32 format with 3 spatial dimensions": "FP32 NDHWC",
+    "Channel major FP32 format with 3 spatial dimensions where channel % 4 == 0": "FP32 NDHWC4",
+    "Channel major FP32 format where channel % 4 == 0 with 3 spatial dimensions": "FP32 NDHWC4",
+    "Row major linear UInt8 format" : "UInt8 NCHW",
+    "Channel major UInt8 format": "UInt8 NHWC",
+    "Row major linear Int64 format": "Int64 NCHW",
+    "Row major linear BFloat16 format": "BF16 NCHW",
+    "Channel major BFloat16 format where channel % 8 == 0": "BF16 NHWC8",
+    "Channel major BFloat16 format where channel % 4 == 0": "BF16 NHWC4",
+    "Channel major BFloat16 format where channel % 8 == 0 with 3 spatial dimensions": "BF16 NDHWC8",
+    "Channel major BFloat16 format where channel % 2 == 0": "BF16 NHWC2",
+    "Two wide channel vectorized row major BFloat16 format": "BF16 NC2HW",
+    "Row major linear FP8 format": "FP8 NCHW",
+    "Unknown format": "Unknown format",
+    # kgen formats
+    "BFloat16": "BFloat16",
+    "Bool": "Bool",
+    "Double": "Double",
+    "DoubleComplex": "DoubleComplex",
+    "Float": "Float",
+    "FloatComplex": "FloatComplex",
+    "FP8": "FP8",
+    "Half": "Half",
+    "Int16": "Int16",
+    "Int32": "Int32",
+    "Int64": "Int64",
+    "Int8": "Int8",
+    "None": "None",
+    "UInt16": "UInt16",
+    "UInt32": "UInt32",
+    "UInt64": "UInt64",
+    "UInt8": "UInt8",
 }
 
 class Activation:
     """Convenience class wrapping activation regions."""
     def __init__(self, raw_dict: Dict):
         def parse_tensor_info(desc):
-            if 'Int8' in desc:
-                precision = 'INT8'
-                data_size = 1
-            elif 'FP32' in desc:
-                precision = 'FP32'
-                data_size = 4
-            elif 'FP16' in desc:
-                precision = 'FP16'
-                data_size = 2
-            elif 'INT32' in desc:
-                precision = 'INT32'
-                data_size = 4
-            elif 'Bool' in desc:
-                precision = 'BOOL'
-                data_size = 4
-            elif desc == "Unknown format":
-                precision = 'Unknown'
-                data_size = 0
-            else:
-                raise ValueError(f"Uknown precision {desc}")
+            try:
+                data_type, layout = desc.split(' ')
+            except ValueError:
+                data_type = desc
+            unknown_format = ('Unknown format', 0)
+            precision, data_size = {
+                'FP8':            ('FP8',    1),
+                'FP16':           ('FP16',   2),
+                'Half':           ('FP16',   2),
+                'FP32':           ('FP32',   4),
+                'Float':          ('FP32',   4),
+                'Double':         ('FP64',   8),
+                'BFloat16':       ('FP32',   2),
+                'Int8':           ('INT8',   1),
+                'Int16':          ('INT16',  2),
+                'INT32':          ('INT32',  4),
+                'Int32':          ('INT32',  4),
+                'Int64':          ('INT64',  8),
+                'UInt8':          ('UINT8',  1),
+                'UInt16':         ('UINT16', 2),
+                'UInt32':         ('UINT32', 4),
+                'UInt64':         ('UINT64', 8),
+                'Unknown format': unknown_format,
+                'None':           unknown_format,
+            }.get(data_type, (data_type, 0))
+
             return precision, data_size
 
         self.name = raw_dict['Name']
diff --git a/tools/experimental/trt-engine-explorer/trex/archiving.py b/tools/experimental/trt-engine-explorer/trex/archiving.py
new file mode 100644
index 00000000..0fb2b72b
--- /dev/null
+++ b/tools/experimental/trt-engine-explorer/trex/archiving.py
@@ -0,0 +1,269 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+This file contains archiving functionality.
+"""
+
+__all__ = ["EngineArchive", "get_reader", "get_writer"]
+
+
+import os
+import io
+import json
+import datetime
+from typing import List
+from io import BytesIO
+from zipfile import ZipFile, Path, ZIP_DEFLATED
+import tensorrt as trt
+
+
+class regular_file_writer(object):
+    """Adaptor for file writer context manager"""
+    def __init__(self, fname: str) -> None:
+        self.fname = fname
+
+    def __enter__(self):
+        self.f = open(self.fname, 'w')
+        return self
+
+    def write(self, text:str):
+        self.f.write(text)
+
+    def __exit__(self, *args) -> None:
+        self.f.close()
+
+
+class regular_file_reader(object):
+    """Adaptor for file reader context manager"""
+    def __init__(self, fname: str) -> None:
+        self.fname = fname
+
+    def __enter__(self):
+        self.f = open(self.fname, 'r')
+        return self
+
+    def read(self) -> str:
+        return self.f.read()
+
+    def readlines(self) -> List[str]:
+        return self.f.readlines()
+
+    def __exit__(self, *args) -> None:
+        self.f.close()
+
+
+class zip_file_writer(object):
+    """Adaptor for Zip file writer context manager"""
+    def __init__(self, zipf: ZipFile, fname: str) -> None:
+        self.zipf = zipf
+        self.fname = os.path.basename(fname)
+
+    def __enter__(self):
+        return self
+
+    def write(self, text:str):
+        self.zipf.writestr(self.fname, text)
+
+    def __exit__(self, *args) -> None:
+        pass
+
+
+class zip_file_reader(object):
+    """Adaptor for Zip file reader context manager"""
+    def __init__(self, zipf: ZipFile, fname: str) -> None:
+        self.zipf = zipf
+        self.fname = os.path.basename(fname)
+
+    def __enter__(self):
+        return self
+
+    def read(self):
+        return self.zipf.read(self.fname)
+
+    def readlines(self):
+        p = Path(self.zipf, self.fname)
+        txt = p.read_text()
+        for line in txt.split("\n"):
+            yield line
+
+    def __exit__(self, *args) -> None:
+        pass
+
+
+class StreamTrtLogger(trt.ILogger):
+    """Writes TRT log messages to a buffer"""
+    def __init__(self, user_logger):
+        trt.ILogger.__init__(self)
+        self.user_logger = user_logger
+        self.buffer = io.StringIO()
+        self.proxy_logger = trt.Logger()
+
+    def log(self, severity: trt.ILogger.Severity, msg: str):
+        if severity <= self.user_logger.min_severity:
+            self.user_logger.log(severity, msg)
+        self.buffer.write(msg + "\n")
+
+
+class NullTrtLogger(trt.ILogger):
+    """TRT log messages blackhole"""
+    def __init__(self):
+        trt.ILogger.__init__(self)
+
+    def log(self, severity: trt.ILogger.Severity, msg: str):
+        pass
+
+
+class EngineArchive(object):
+    """Interface to a TensorRT Engine Archive (TEA) file"""
+    __version__ = "1.0"
+
+    def __init__(self, archive_filename: str, override_files: bool=True) -> None:
+        if os.path.exists(archive_filename):
+            if override_files:
+                os.remove(archive_filename)
+            else:
+                raise FileExistsError(f"TensorRT engine archive {archive_filename} exists")
+        self.archive_filename = archive_filename
+        self.zipf = None
+
+    def open(self):
+        self.zipf = ZipFile(self.archive_filename, 'a', compression=ZIP_DEFLATED, compresslevel=9)
+
+    def close(self):
+        assert self.zipf is not None
+        self.zipf.testzip()
+        self.zipf.close()
+        self.zipf = None
+
+    def __enter__(self):
+        self.open()
+        return self
+
+    def __exit__(self, *args) -> None:
+        self.close()
+
+    def writef_txt(self, fname: str, text: str):
+        """Write a text file to the TEA"""
+        with self:
+            self.writer(fname).write(text)
+            self.zipf.testzip()
+
+    def writef_bin(self, fname: str, content: any):
+        """Write a binary file to the TEA"""
+        with self:
+            self.writer(fname).write(BytesIO(content).getvalue())
+            self.zipf.testzip()
+
+    def readf(self, fname: str):
+        """Read a file from the TEA"""
+        with self:
+            return self.reader(fname).read()
+
+    def writer(self, fname: str):
+        assert self.zipf is not None
+        return zip_file_writer(self.zipf, fname)
+
+    def reader(self, fname: str):
+        assert self.zipf is not None
+        return zip_file_reader(self.zipf, fname)
+
+    def archive_build_config(self,
+        config: trt.IBuilderConfig,
+        build_duration: datetime.timedelta
+    ):
+        as_dict = lambda cfg: {attr: str(getattr(cfg, attr))
+            for attr in dir(cfg) if not callable(getattr(cfg, attr)) and not attr.startswith("__")}
+        build_dict = {"engine_build_duration": build_duration.total_seconds()}
+        build_dict.update(as_dict(config))
+        as_json = json.dumps(build_dict, ensure_ascii=False, indent=4)
+        self.writef_txt("build_cfg.json", as_json)
+
+    def archive_timing_cache(self, config: trt.IBuilderConfig):
+        cache = config.get_timing_cache()
+        if cache is None:
+            return
+        self.writef_bin("timing.cache", cache.serialize())
+
+    def archive_plan_info(self, plan: trt.IHostMemory):
+        assert plan
+        with trt.Runtime(NullTrtLogger()) as runtime:
+            engine = runtime.deserialize_cuda_engine(plan)
+            # Explicitly remove some attributes which explode when getattr is called on them.
+            bad_attrs = ["weight_streaming_budget",
+                "minimum_weight_streaming_budget", "streamable_weights_size"]
+            safe_attrs = [attr for attr in dir(engine)
+                if attr not in bad_attrs and not callable(getattr(engine, attr)) and not attr.startswith("__")]
+            plan_dict = {attr: str(getattr(engine, attr)) for attr in safe_attrs}
+
+            bindings = plan_dict["io_tensors"] = {}
+            for index in range(engine.num_io_tensors):
+                name = engine.get_tensor_name(index)
+                dtype = engine.get_tensor_dtype(name)
+                shape = engine.get_tensor_shape(name)
+                location = engine.get_tensor_location(name)
+                mode = engine.get_tensor_mode(name)
+                bindings[name] = {
+                    "mode": str(mode),
+                    "dtype": str(dtype),
+                    "shape": str(shape),
+                    "location": str(location)
+                }
+            as_json = json.dumps(plan_dict, ensure_ascii=False, indent=4)
+            self.writef_txt("plan_cfg.json", as_json)
+
+
+    class _Builder(trt.Builder):
+        """A trt.Builder decorator class which adds archiving functionality."""
+        def __init__(self, logger, tea):
+            self.tea = tea
+            self.tea_logger = StreamTrtLogger(logger)
+            trt.Builder.__init__(self, self.tea_logger)
+
+        def build_serialized_network(self,
+            network: trt.INetworkDefinition,
+            config: trt.IBuilderConfig
+        ):
+            start = datetime.datetime.now()
+            plan = trt.Builder.build_serialized_network(self, network, config)
+            end = datetime.datetime.now()
+            build_duration = end - start
+            # Save to archive
+            self.tea.writef_txt("build.txt", self.tea_logger.buffer.getvalue())
+            self.tea.writef_bin("engine.trt", plan)
+            self.tea.archive_build_config(config, build_duration)
+            self.tea.archive_plan_info(plan)
+            self.tea.archive_timing_cache(config)
+            return plan
+
+    def Builder(self, logger):
+        assert trt.__version__ >= "10.0.0", "This TEA functionality requires TRT 10.0+"
+        return self._Builder(logger, self)
+
+
+def get_writer(tea: EngineArchive, fname:str):
+    if tea:
+        return tea.writer(fname)
+    else:
+        return regular_file_writer(fname)
+
+
+def get_reader(tea: EngineArchive, fname:str):
+    if tea:
+        return tea.reader(fname)
+    else:
+        return regular_file_reader(fname)
diff --git a/tools/experimental/trt-engine-explorer/trex/colors.py b/tools/experimental/trt-engine-explorer/trex/colors.py
new file mode 100644
index 00000000..464848a0
--- /dev/null
+++ b/tools/experimental/trt-engine-explorer/trex/colors.py
@@ -0,0 +1,82 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+This file contains color pallete defnitions.
+"""
+
+
+from collections import defaultdict
+
+
+NVDA_GREEN = '#76b900'
+UNKNOWN_KEY_COLOR = 'gray'
+GRID_COLOR = 'rgba(114, 179, 24, 0.3)'
+
+
+# pallete = px.colors.qualitative.G10
+# https://medialab.github.io/iwanthue/
+default_pallete = [
+    "#a11350",
+    "#008619",
+    "#4064ec",
+    "#ffb519",
+    "#8f1a8e",
+    "#b2b200",
+    "#64b0ff",
+    "#e46d00",
+    "#02d2ba",
+    "#ef393d",
+    "#f1b0f7",
+    "#7e4401",
+    UNKNOWN_KEY_COLOR]
+
+
+# Set a color for each precision datatype.
+precision_colormap = defaultdict(lambda: UNKNOWN_KEY_COLOR, {
+    'INT8':  NVDA_GREEN,
+    'FP32':  'red',
+    'FP16':  'orange',
+    'INT32': 'lightgray',
+    'FP8':   'deepskyblue',
+})
+
+
+# Set a color for each layer type.
+layer_colormap = defaultdict(lambda: UNKNOWN_KEY_COLOR, {
+    # https://htmlcolorcodes.com/
+    "Convolution":    "#4682B4", # SteelBlue
+    "Deconvolution":  "#7B68EE", # MediumSlateBlue
+    "ConvActPool":    "#6495ED", # CornflowerBlue
+    "MatrixMultiply": "#1E90FF", # DodgerBlue
+    "gemm":           "#1E90FF", # DodgerBlue
+    "Reformat":       "#00FFFF", # Cyan
+    "Shuffle":        "#BC8F8F", # RosyBrown
+    "Slice":          "#FFA500", # Orange
+    "Scale":          "#8FBC8B", # DarkSeaGreen
+    "Quantize":       "#6B8E23", # OliveDrab
+    "Pooling":        "#3CB371", # MediumSeaGreen
+    "PluginV2":       "#C71585", # MediumVioletRed
+    "PointWise":      "#9ACD32", # YellowGreen
+    "ElementWise":    "#9ACD32", # YellowGreen
+    "Reduce":         "#90EE90", # LightGreen
+    "SoftMax":        "#DA70D6", # Orchid
+    "Myelin":         "#B39C4D", # Satic Sheen Gold
+    "kgen":           "#B39C4D", # Satic Sheen Gold
+    "NonZero":        "#98FB98", # PaleGreen
+    "TrainStation":   "#FFA07A", # LightSalmon
+})
diff --git a/tools/experimental/trt-engine-explorer/trex/compare_engines.py b/tools/experimental/trt-engine-explorer/trex/compare_engines.py
index 31962a1b..415362cb 100644
--- a/tools/experimental/trt-engine-explorer/trex/compare_engines.py
+++ b/tools/experimental/trt-engine-explorer/trex/compare_engines.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,19 +22,22 @@
 
 import copy
 import numpy as np
-from typing import List
+from typing import List, Tuple, Dict
 from functools import partial
-from matplotlib.pyplot import colormaps
-from .engine_plan import EnginePlan
-from .misc import group_count, group_sum_attr
-from .plotting import *
-from .interactive import *
-from .misc import stack_dicts
-from .activations import create_activations
-from .engine_plan import summary_dict
-
-
-def get_plans_names(plans: List[EnginePlan]):
+import pandas as pd
+import plotly.express as px
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+import trex.misc as misc
+import trex.plotting as plotting
+import trex.colors as colors
+import trex.notebook as notebook
+import trex.interactive as interactive
+import trex.engine_plan as engine_plan
+import trex.activations as activations
+
+
+def get_plans_names(plans: List[engine_plan.EnginePlan]):
     """Create unique plans names"""
     engine_names = [plan.name for plan in plans]
     if len(set(engine_names)) != len(plans):
@@ -42,7 +45,7 @@ def get_plans_names(plans: List[EnginePlan]):
     return engine_names
 
 
-def compare_engines_overview(plans: List[EnginePlan]):
+def compare_engines_overview(plans: List[engine_plan.EnginePlan]):
     """A dropdown widget to choose from several diagrams
     that compare 2 or more engine plans.
     """
@@ -56,15 +59,16 @@ def throughput_per_plan(title: str):
         y = [plan.performance_summary.get('Throughput', 0) for plan in plans]
         x = [plan.name for plan in plans]
         fig = px.bar(x=y, y=x, orientation='h')
-        trex_base_layout(fig)
+        plotting.trex_base_layout(fig)
         fig.update_layout({
             'xaxis_title': "Throughput (inferences / sec)",
             'yaxis_title': "Engine"})
         fig.show()
 
-    time_by_type = [plan.df.groupby(['type']).sum() \
-        [['latency.pct_time', 'latency.avg_time']].reset_index() for plan in plans]
-    cnt_by_type = [group_count(plan.df, 'type') for plan in plans]
+    time_by_type = [plan.df.groupby(['type'])[
+        ['latency.pct_time', 'latency.avg_time']].sum().reset_index() for plan in plans]
+
+    cnt_by_type = [misc.group_count(plan.df, 'type') for plan in plans]
 
     # Normalize timings by the batch-size.
     df_list_bs_normalized = [copy.deepcopy(plan.df) for plan in plans]
@@ -72,18 +76,19 @@ def throughput_per_plan(title: str):
         inputs, outputs = plan.get_bindings()
         bs = inputs[0].shape[0]
         df_list_bs_normalized[i]['latency.avg_time'] /= bs
-    time_by_type_bs_normalized = [df.groupby(['type']).sum() \
-        [['latency.pct_time', 'latency.avg_time']].reset_index() for df in df_list_bs_normalized]
+
+    time_by_type_bs_normalized = [df.groupby(['type'])
+        [['latency.pct_time', 'latency.avg_time']].sum().reset_index() for df in df_list_bs_normalized]
 
     def latency_per_type(title):
         stacked_latencies_bars = partial(
-            stacked_bars,
+            plotting.stacked_bars,
             title,
             bar_names=engine_names,
             df_list=time_by_type,
             names_col='type',
             values_col='latency.avg_time',
-            colormap=layer_colormap,
+            colormap=colors.layer_colormap,
             display_tbl=False,
             xaxis_title="Engine",
             yaxis_title="Latency (ms)")
@@ -108,34 +113,34 @@ def latency_per_type(title):
                     name="Real Latency",
                     marker=dict(size=12, color='#DE3163'),),
                 secondary_y=False)
-            trex_base_layout(fig)
+            plotting.trex_base_layout(fig)
             fig.update_yaxes(title_text="Throughput (inferences / sec)", secondary_y=True)
             fig.show()
         else:
             stacked_latencies_bars()
 
-        df = stacked_tabular_df(
+        df = plotting.stacked_tabular_df(
             engine_names, time_by_type, 'type', 'latency.avg_time', empty_symbol=np.NaN)
         # Compute the speedup of the last engine vs. the first engine.
         df['speedup'] = df[engine_names[0]] / df[engine_names[-1]]
         print(f"\'speedup\' refers to the speedup of \"{engine_names[-1]}\" relative to \"{engine_names[0]}\"")
-        display_df(df, range_highlights=speedup_range_highlights(
+        notebook.display_df(df, range_highlights=speedup_range_highlights(
             col_name='speedup', threshold=0.03))
 
     latency_per_type_bs_normalized = partial(
-        stacked_bars,
+        plotting.stacked_bars,
         bar_names=engine_names,
         df_list=time_by_type_bs_normalized,
         names_col='type',
         values_col='latency.avg_time',
         empty_symbol=np.NaN,
-        colormap=layer_colormap,
+        colormap=colors.layer_colormap,
         xaxis_title="Engine",
         yaxis_title="Latency (ms)")
 
     d = {engine_name:df for engine_name, df in zip(engine_names, time_by_type)}
     latency_per_type_comparison = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         df=d,
         values_col='latency.avg_time',
         names_col='type',
@@ -144,54 +149,54 @@ def latency_per_type(title):
 
     d = {engine_name:df for engine_name, df in zip(engine_names, cnt_by_type)}
     count_comparison = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         df=d,
         values_col='count',
         names_col='type',
         orientation='h',
         showlegend=True)
 
-    time_by_precision = [plan.df.groupby(['precision']).sum() \
-        [['latency.avg_time']].reset_index() for plan in plans]
+    time_by_precision = [plan.df.groupby(['precision']) \
+        [['latency.avg_time']].sum().reset_index() for plan in plans]
 
     stacked_layers_by_precision = partial(
-        stacked_bars,
+        plotting.stacked_bars,
         bar_names=engine_names,
         df_list=time_by_precision,
         names_col='precision',
         values_col='latency.avg_time',
-        colormap=precision_colormap)
+        colormap=colors.precision_colormap)
 
     precision_subplots = [(
-        group_count(
+        misc.group_count(
             plan.df, 'precision'),
             plan.name, 'count', 'precision'
         ) for plan in plans]
     precision_cnts = partial(
-        plotly_pie2,
+        plotting.plotly_pie2,
         charts=precision_subplots,
-        colormap=precision_colormap)
+        colormap=colors.precision_colormap)
 
     output_precision_subplots = [(
-        group_count(
+        misc.group_count(
             plan.df, 'output_precision'),
             plan.name, 'count', 'output_precision'
         ) for plan in plans]
     output_precision_cnts = partial(
-        plotly_pie2,
+        plotting.plotly_pie2,
         charts=output_precision_subplots,
-        colormap=precision_colormap)
+        colormap=colors.precision_colormap)
 
     precision_subplots = [(
-        group_sum_attr(
+        misc.group_sum_attr(
             plan.df, grouping_attr='precision',
             reduced_attr='latency.pct_time'),
             plan.name, 'latency.pct_time', 'precision'
         ) for plan in plans]
     precision_latency = partial(
-        plotly_pie2,
+        plotting.plotly_pie2,
         charts=precision_subplots,
-        colormap=precision_colormap)
+        colormap=colors.precision_colormap)
 
     dropdown_choices = {
         "Stacked latencies by layer type": latency_per_type,
@@ -206,39 +211,39 @@ def latency_per_type(title):
     if have_throughput_data:
         dropdown_choices["Throughput"] = throughput_per_plan
 
-    InteractiveDiagram_2(dropdown_choices, 'Diagram:')
+    interactive.InteractiveDiagram_2(dropdown_choices, 'Diagram:')
 
 
-def compare_engines_summaries_tbl(plans: List[EnginePlan], orientation: str='vertical'):
+def compare_engines_summaries_tbl(plans: List[engine_plan.EnginePlan], orientation: str='vertical'):
     """Display a tabular comparison of several engine plans."""
 
     merged_summaries = {}
     summary_dicts_list = (
-        [summary_dict(plan) for plan in plans],
+        [engine_plan.summary_dict(plan) for plan in plans],
         [plan.performance_summary for plan in plans],
         [plan.device_properties for plan in plans],
         [plan.builder_cfg for plan in plans]
     )
 
     for d in summary_dicts_list:
-        merged_summaries.update(stack_dicts(d, empty_placeholder=""))
+        merged_summaries.update(misc.stack_dicts(d, empty_placeholder=""))
 
     if orientation == 'vertical':
         df = pd.DataFrame.from_dict(
             merged_summaries, orient='index', columns=get_plans_names(plans))
         df['attribute'] = list(merged_summaries.keys())
-        df = rotate_columns(df)
+        df = plotting.rotate_columns(df)
         df.set_index('attribute')
     else:
         df = pd.DataFrame.from_dict(merged_summaries)
         df['plan'] = get_plans_names(plans)
-        df = rotate_columns(df)
+        df = plotting.rotate_columns(df)
     print(("\"Average time\": "
         "refers to the sum of the layer latencies, when profiling layers separately"))
     print(("\"Latency\": "
         "refers to the [min, max, mean, median, 99% percentile] of the engine latency "
         "measurements, when timing the engine w/o profiling layers."))
-    display_df(df)
+    notebook.display_df(df)
 
 
 # Code to align and compare two plans
@@ -253,7 +258,7 @@ def get_io_dimensions(layer: pd.Series, use_all_tensors: bool) -> tuple:
     For an exact (conservative) matching set `use_all_tensors` to True.
     To match using only the first input and output, set to False.
     """
-    inputs, outputs = create_activations(layer)
+    inputs, outputs = activations.create_activations(layer)
     if not use_all_tensors:
         inputs = [inputs[0],]
         outputs = [outputs[0],]
@@ -266,7 +271,7 @@ def get_io_dimensions(layer: pd.Series, use_all_tensors: bool) -> tuple:
 
 def get_io_formats(layer: pd.Series) -> tuple:
     """Return a string representation of the inputs and outputs."""
-    inputs, outputs = create_activations(layer)
+    inputs, outputs = activations.create_activations(layer)
     in_formats = "in: " + ", ".join((f"{t.format}:{t.shape}" for t in inputs))
     out_formats = "out: " + ", ".join((f"{t.format}:{t.shape}" for t in outputs))
     return in_formats + "\t" + out_formats
@@ -276,7 +281,7 @@ def get_io_precisions(layer: pd.Series) -> tuple:
     """Return two tuples representing the precisions of layer's inputs and outputs."""
     if layer is None:
         return "", ""
-    inputs, outputs = create_activations(layer)
+    inputs, outputs = activations.create_activations(layer)
     assert len(inputs) > 0 and len(outputs) > 0
     p_in = ", ".join((t.precision for t in inputs))
     p_out = ", ".join((t.precision for t in outputs))
@@ -284,7 +289,7 @@ def get_io_precisions(layer: pd.Series) -> tuple:
 
 
 def match_layers(
-    plan1: EnginePlan, plan2: EnginePlan, exact_matching: bool
+    plan1: engine_plan.EnginePlan, plan2: engine_plan.EnginePlan, exact_matching: bool
 ) -> List[Tuple]:
     """Align two plans by their layers.
 
@@ -398,8 +403,8 @@ def debug_print(i1: int, i2: int):
 
 
 def aligned_merge_plans(
-    plan1: EnginePlan,
-    plan2: EnginePlan,
+    plan1: engine_plan.EnginePlan,
+    plan2: engine_plan.EnginePlan,
     matched_indices_pairs: List[Tuple]
 ) -> pd.DataFrame:
     """Return a dataframe containing merged layers from the two plans, after
@@ -446,8 +451,8 @@ def append_layer(merged: List, layer1: pd.Series, layer2: pd.Series):
 
 
 def aligned_layers(
-    plan1: EnginePlan,
-    plan2: EnginePlan,
+    plan1: engine_plan.EnginePlan,
+    plan2: engine_plan.EnginePlan,
     matched_indices_pairs:List[Tuple],
     layer_type: str=None
 ) -> Tuple[pd.DataFrame, pd.DataFrame]:
@@ -517,8 +522,8 @@ def speedup_range_highlights(col_name, threshold: float):
 
 
 def compare_engines_layer_latencies(
-    plan1: EnginePlan,
-    plan2: EnginePlan,
+    plan1: engine_plan.EnginePlan,
+    plan2: engine_plan.EnginePlan,
     threshold: float,
     exact_matching: bool
 ):
@@ -544,7 +549,7 @@ def render_diagram(choice: str, ignore):
         print("\"in-p (1)\" are the input precisions of the layer in "
              f"{plan1.name}. Similarly,")
         print("\"out-p (2)\" are the output precisions of the layer in " + plan2.name)
-        display_df(df, range_highlights=speedup_range_highlights(
+        notebook.display_df(df, range_highlights=speedup_range_highlights(
             'speedup (2)', threshold))
 
         # Display a bar diagram comparison
@@ -555,7 +560,7 @@ def render_diagram(choice: str, ignore):
         print(f"Latencies:{latency_str(plan1.name, df1)}{latency_str(plan2.name, df2)}")
 
         d = {plan1.name: df1, plan2.name: df2}
-        plotly_bar2(title=f"Layer Latency Comparison (Layer Type={choice})",
+        plotting.plotly_bar2(title=f"Layer Latency Comparison (Layer Type={choice})",
             df=d,
             values_col='latency.avg_time',
             names_col='id',
@@ -565,12 +570,12 @@ def render_diagram(choice: str, ignore):
     types = ['All', 'Precision Mismatch', 'Tactic Mismatch']
     types += list(set(plan1.df['type'].tolist() + plan2.df['type'].tolist()))
     dropdown_choices = {t: t for t in types}
-    InteractiveDiagram(render_diagram, dropdown_choices, 'Dataframe')
+    interactive.InteractiveDiagram(render_diagram, dropdown_choices, 'Dataframe')
 
 
 def compare_engines_layer_details(
-    plan1: EnginePlan,
-    plan2: EnginePlan,
+    plan1: engine_plan.EnginePlan,
+    plan2: engine_plan.EnginePlan,
 ):
     """Compare the details of two layers from two aligned plans.
 
@@ -601,14 +606,14 @@ def render_diagram(choice: str, row_id: int):
         formats = "Same" if row['formats (1)'] == row['formats (2)'] else "Different"
         df2['comparison'] = (
             '', speedup, tactic, inp_precision, out_precision, formats)
-        df2 = rotate_columns(df2)
+        df2 = plotting.rotate_columns(df2)
         df2['attribute'] = (
             'name', 'avg_time', 'tactic', 'input precision', 'output precision', 'formats')
-        df2 = rotate_columns(df2)
+        df2 = plotting.rotate_columns(df2)
         df2.set_index('attribute')
-        display_df(df2)
+        notebook.display_df(df2)
 
     matched_indices_pairs = match_layers(plan1, plan2, exact_matching=True)
     df = aligned_merge_plans(plan1, plan2, matched_indices_pairs)
     dropdown_choices = {f"{t}: {df.iloc[t]['type']}": t for t in range(len(df))}
-    InteractiveDiagram(render_diagram, dropdown_choices, 'Choose Layer:')
+    interactive.InteractiveDiagram(render_diagram, dropdown_choices, 'Choose Layer:')
diff --git a/tools/experimental/trt-engine-explorer/trex/df_preprocessing.py b/tools/experimental/trt-engine-explorer/trex/df_preprocessing.py
index 2fab3dc6..e6126706 100644
--- a/tools/experimental/trt-engine-explorer/trex/df_preprocessing.py
+++ b/tools/experimental/trt-engine-explorer/trex/df_preprocessing.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -157,12 +157,18 @@ def __fix_columns_types(df: pd.DataFrame):
             df[col] = df[col].astype('int32')
         except KeyError:
             pass
-    df.fillna("", inplace=True)
+    df.fillna(0, inplace=True)
 
 
 def __fix_output_precision(df: pd.DataFrame):
-    df['output_precision'] = [Activation(outputs[0]).precision for outputs in df['Outputs']]
-
+    fixed_outputs = []
+    for outputs in df['Outputs']:
+        try:
+            fixed_outputs.append(Activation(outputs[0]).precision)
+        except IndexError:
+            # Some layers may have empty outputs.
+            fixed_outputs.append('')
+    df['output_precision'] = fixed_outputs
 
 
 def fix_df(df: pd.DataFrame) -> pd.DataFrame:
@@ -205,7 +211,7 @@ def filter_by_layer(df: pd.DataFrame, layer_type: str) -> pd.DataFrame:
     if len(layers) == 0:
         return layers
 
-    copy_cols = set(copy_cols) & set(layers.columns)
+    copy_cols = list(set(copy_cols) & set(layers.columns))
     layers = layers[copy_cols]
 
     if layer_type == 'Convolution':
@@ -285,8 +291,12 @@ def annotate_convolutions(convs: pd.DataFrame):
         # Arithmetic intensity: ops/bytes
         convs.loc[index, 'attr.arithmetic_intensity'] = nb_macs / nb_bytes
         latency = convs.loc[index, 'latency.avg_time']
-        convs.loc[index, 'attr.compute_efficiency'] = nb_macs / latency
-        convs.loc[index, 'attr.memory_efficiency'] = nb_bytes / latency
+        if latency > 0:
+            convs.loc[index, 'attr.compute_efficiency'] = nb_macs / latency
+            convs.loc[index, 'attr.memory_efficiency'] = nb_bytes / latency
+        else:
+            convs.loc[index, 'attr.compute_efficiency'] = 0
+            convs.loc[index, 'attr.memory_efficiency'] = 0
         # Conversion to matrices (M, K) * (K, N)
         M = N * P * Q
         N = K
diff --git a/tools/experimental/trt-engine-explorer/trex/engine_plan.py b/tools/experimental/trt-engine-explorer/trex/engine_plan.py
index 287411b1..b6dcc62b 100644
--- a/tools/experimental/trt-engine-explorer/trex/engine_plan.py
+++ b/tools/experimental/trt-engine-explorer/trex/engine_plan.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -70,7 +70,6 @@ def add_zero_perf(graph_df):
 
             if raw_perf is not None:
                 perf_df = pd.DataFrame.from_dict(raw_perf)
-                perf_df.drop(columns=['name'], inplace=True)
                 perf_df.rename(columns={
                     'percentage': 'latency.pct_time',
                     'averageMs': 'latency.avg_time',
@@ -78,14 +77,16 @@ def add_zero_perf(graph_df):
                     'timeMs': 'latency.time',
                     }, inplace=True)
                 if len(graph_df) == len(perf_df):
+                    perf_df.drop(columns=['name'], inplace=True)
                     df = graph_df.join(perf_df)
                 else:
                     warnings.warn(
-                        "Ignoring profiling data: The number of layers in the engine "
-                        "graph does not match the number of layers in the performance "
-                        "JSON.\n"
+                        "Partial profiling data: The number of layers in the engine "
+                        f"graph ({len(graph_df)}) does not match the number of layers "
+                        f"({len(perf_df)}) in the performance JSON.\n"
                         "This can happen if you're not using the first shape-profile.")
-                    df = add_zero_perf(graph_df)
+                    perf_df.rename(columns={'name': 'Name',}, inplace=True)
+                    df = graph_df.merge(perf_df, on='Name', how='left').fillna(0)
             else:
                 warnings.warn("Profiling data was not provided.")
                 df = add_zero_perf(graph_df)
diff --git a/tools/experimental/trt-engine-explorer/trex/excel_summary.py b/tools/experimental/trt-engine-explorer/trex/excel_summary.py
index 0a1d225e..ec04afbe 100644
--- a/tools/experimental/trt-engine-explorer/trex/excel_summary.py
+++ b/tools/experimental/trt-engine-explorer/trex/excel_summary.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -137,5 +137,5 @@ def save(self):
         Excel file
         """
 
-        self.Excelwriter.save()
+        self.Excelwriter.close()
         print(f"{self.path} has been saved")
diff --git a/tools/experimental/trt-engine-explorer/trex/graphing.py b/tools/experimental/trt-engine-explorer/trex/graphing.py
index 49aa88ed..c378fccc 100644
--- a/tools/experimental/trt-engine-explorer/trex/graphing.py
+++ b/tools/experimental/trt-engine-explorer/trex/graphing.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,18 +19,28 @@
 This file contains code to generate graph diagrams for an engine-plan.
 """
 
+__all__ = [
+    "to_dot",
+    "PlanGraph",
+    "DotGraph",
+    "OnnxGraph",
+    "render_dot",
+    "latency_types",
+    "layer_precision_formatter",
+    "layer_type_formatter",
+    "layer_node_formatters",
+    "layer_node_renderers"]
 
 import os
-import re
 import warnings
 from enum import Enum
 from graphviz import Digraph
 from typing import Callable, NamedTuple, List, Dict
 from dataclasses import dataclass, field
-from .engine_plan import EnginePlan
-from .layer import Layer
-from .activations import Activation
-from .plotting import precision_colormap, layer_colormap
+from trex.engine_plan import EnginePlan
+from trex.layer import Layer
+from trex.activations import Activation
+from trex.colors import precision_colormap, layer_colormap
 
 
 class PortDesc(NamedTuple):
@@ -202,7 +212,7 @@ def set_is_forked(region: Region):
     return regions
 
 
-def make_memory_node_name(region: Region, generation: RegionGeneration) -> str:
+def _make_memory_node_name(region: Region, generation: RegionGeneration) -> str:
     if generation.is_user:
         return region.name
     return ".".join((region.name, str(generation.id)))
@@ -253,7 +263,7 @@ def __init__(self,
 
         # These lists will be populated by __create_graph
         self._edges_list: List[Edge] = []
-        self._layers_nodes: List[Layer] = []
+        self._layer_nodes: List[Layer] = []
         self._memory_nodes: List[MemoryNode] = []
         self.__create_graph(plan)
 
@@ -262,7 +272,7 @@ def __init__(self,
     def edges_list(self): return self._edges_list
 
     @property
-    def layers_nodes(self): return self._layers_nodes
+    def layer_nodes(self): return self._layer_nodes
 
     @property
     def memory_nodes(self): return self._memory_nodes
@@ -273,13 +283,13 @@ def __create_graph(self, plan: EnginePlan):
         self.__add_inter_region_edges()
 
     def __add_layer_nodes(self, plan):
-        self._layers_nodes = [LayerNode(layer) for layer in plan.all_layers]
+        self._layer_nodes = [LayerNode(layer) for layer in plan.all_layers]
 
     def __add_constant_node(self, region: Region, generation: RegionGeneration, constants_producers):
         assert not generation.writers
 
         is_user = generation.is_user
-        activation_name = make_memory_node_name(region, generation)
+        activation_name = _make_memory_node_name(region, generation)
         constant = constants_producers[activation_name]
         if self.include_regions:
             self._edges_list.append(Edge(
@@ -298,8 +308,9 @@ def __add_memory_nodes(self, plan):
         constants_outputs = [const.outputs[0].name for const in plan.constants]
         constants_producers = {const.outputs[0].name + ".0": const for const in plan.constants}
         for region in self.regions:
+            is_myelin_const = len(region.writers()) == 0 and region.name not in plan.bindings
             is_constant = region.name in constants_outputs
-            if is_constant and not self.include_constants:
+            if (is_constant or is_myelin_const) and not self.include_constants:
                 continue
             for generation in region.generations:
                 include_region = self.should_include_region(region, generation, is_constant)
@@ -315,7 +326,7 @@ def __add_memory_nodes(self, plan):
     def __add_memory_node(self, region, generation):
         # Add an activation node (represents a region generation that we chose to display)
         is_user = generation.is_user
-        node_name = make_memory_node_name(region, generation)
+        node_name = _make_memory_node_name(region, generation)
         self._memory_nodes.append(MemoryNode(node_name, generation.tensor, generation.id, is_user))
         self.__add_ingress_edges(region, generation)
         self.__add_egress_edges(region, generation)
@@ -327,7 +338,7 @@ def __add_region_bypass_edges(self, generation: RegionGeneration):
             self.__connect_writer_to_all_readers(writer_desc, generation)
 
     def __add_ingress_edges(self, region: Region, generation: RegionGeneration):
-        node_name = make_memory_node_name(region, generation)
+        node_name = _make_memory_node_name(region, generation)
         node_port = None if generation.is_user else 0
         for writer in generation.writers:
             self._edges_list.append(Edge(
@@ -337,7 +348,7 @@ def __add_ingress_edges(self, region: Region, generation: RegionGeneration):
                 generation.id))
 
     def __add_egress_edges(self, region: Region, generation: RegionGeneration):
-        activation_name = make_memory_node_name(region, generation)
+        activation_name = _make_memory_node_name(region, generation)
         activation_port = None if generation.is_user else 0
         self.__connect_writer_to_all_readers(
             PortDesc(activation_name, activation_port),
@@ -361,8 +372,8 @@ def __add_inter_region_edges(self):
             for gen_id in range(1, len(region.generations)):
                 curr_generation = region.generations[gen_id]
                 self._edges_list.append(Edge(
-                        PortDesc(make_memory_node_name(region, prev_generation), 0),
-                        PortDesc(make_memory_node_name(region, curr_generation), 0),
+                        PortDesc(_make_memory_node_name(region, prev_generation), 0),
+                        PortDesc(_make_memory_node_name(region, curr_generation), 0),
                         curr_generation.tensor,
                         gen_id))
                 prev_generation = curr_generation
@@ -400,12 +411,31 @@ def render_dot(dot_graph: Digraph, engine_name: str, output_format: str):
     return output_fname
 
 
+def name_or_metadata(layer: Layer, prefer_matadata: bool=True):
+    def clean_layer_name(layer_name: str):
+        layer_name = layer_name.replace("||", "\|\|")
+        layer_name = layer_name.replace("{", "")
+        layer_name = layer_name.replace("}", "")
+        return layer_name
+
+    try:
+        metadata = layer.metadata
+        if metadata is not None and len(metadata) == 0:
+            metadata = None
+    except AttributeError:
+        metadata = None
+    if prefer_matadata and metadata is not None:
+        return metadata
+    return clean_layer_name(layer.name)
+
+
 def layer_node_renderer_simple(
     layer: Layer,
     latency: float,
     display_layer_names: bool=True,
     expand_layer_details: bool=False,
     stack_layer_names: bool=True,
+    prefer_matadata: bool=True,
 ) -> str:
     return f"{layer.name}\\n{layer.type}" if display_layer_names else f"{layer.type}"
 
@@ -416,6 +446,7 @@ def layer_node_renderer_keras(
     display_layer_names: bool=True,
     expand_layer_details: bool=False,
     stack_layer_names: bool=True,
+    prefer_matadata: bool=True,
 ) -> str:
     """Keras-style node label formatting."""
 
@@ -425,7 +456,8 @@ def add_io(tensors: List):
             io_desc_str += str(t.shape)
         return io_desc_str
 
-    label =  f"{layer.name}\\n" if display_layer_names else ""
+    name = name_or_metadata(layer, prefer_matadata)
+    label =  f"{name}\\n" if display_layer_names else ""
     label += f"{layer.type}"
     label += "|{input:|output:}|{{"
     label += add_io(layer.inputs)
@@ -452,13 +484,8 @@ def layer_node_configurable_renderer(
     display_layer_names: bool=True,
     expand_layer_details: bool=False,
     stack_layer_names: bool=True,
+    prefer_matadata: bool=True,
 ) -> str:
-    def clean_layer_name(layer_name: str):
-        layer_name = layer_name.replace("||", "\|\|")
-        layer_name = layer_name.replace("{", "")
-        layer_name = layer_name.replace("}", "")
-        return layer_name
-
     def html_tbl(rows: List[str]):
         def html_tbl_row(row_content, bold:bool, color: str=None):
             row_content = row_content if not bold else f"<b>{row_content}</b>"
@@ -531,8 +558,8 @@ def handle_reformat(layer: Layer, rows: List[str]):
             return
         rows.append((layer.raw_dict['Origin'], None))
 
-    def add_node_name(layer: Layer, rows: List[str], stack_layer_names: bool):
-        layer_name = clean_layer_name(layer.name) if display_layer_names else ""
+    def add_node_name(layer: Layer, rows: List[str], stack_layer_names: bool, prefer_matadata: bool):
+        layer_name = name_or_metadata(layer, prefer_matadata) if display_layer_names else ""
         if stack_layer_names:
             # This is layer name "stacking": splitting on '+' and stacking in several rows
             parts = layer_name.split('+')
@@ -545,7 +572,7 @@ def add_node_name(layer: Layer, rows: List[str], stack_layer_names: bool):
     if latency:
         rows.append((f"{latency} ms",))
     if display_layer_names:
-        add_node_name(layer, rows, stack_layer_names)
+        add_node_name(layer, rows, stack_layer_names, prefer_matadata)
     handle_reformat(layer, rows)
     if expand_layer_details:
         handle_pwgen(layer, rows)
@@ -554,26 +581,14 @@ def add_node_name(layer: Layer, rows: List[str], stack_layer_names: bool):
     return tbl
 
 
-def parse_operation(op):
-    c = re.compile("\((.+\))")
-    args = c.findall(op)
-    args[0].split(",")
-    args = args[0].split(",")
-
-    c = re.compile("pwgen::.+\(")
-    w = c.findall(op)
-    opname = w[0][:-1]
-
-    output = op.split(" ")[2]
-    print(f"{opname}: {args} -> {output}")
-    return opname, args, output
-
-
-def precision_formatter(layer: Layer):
+def layer_precision_formatter(layer: Layer):
     """Format Dot nodes by layer precision"""
-    formatting = {'style': 'filled',
+    formatting = {'shape': 'Mrecord',
+                  'style': 'filled',
                   'tooltip': layer.tooltip(),
-                  'fillcolor': precision_colormap[layer.precision]}
+                  'fillcolor': precision_colormap[layer.precision],
+                  'color': 'lightgray',
+                  'fontname': 'Helvetica',}
     return formatting
 
 
@@ -612,13 +627,14 @@ def region_precision_formatter(tensor: Activation, is_user: bool):
         # Hover popup text
         'tooltip': tensor.name,
         'penwidth': '3',
-        'color': precision_colormap[tensor.precision]}
+        'color': precision_colormap[tensor.precision],
+        'fontname': 'Helvetica'}
     return formatting
 
 
-layer_layer_node_formatters = {
+layer_node_formatters = {
     "layer_type": layer_type_formatter,
-    "precision_formatter": precision_formatter,
+    "layer_precision_formatter": layer_precision_formatter,
 }
 
 layer_node_renderers = {
@@ -628,7 +644,7 @@ def region_precision_formatter(tensor: Activation, is_user: bool):
 }
 
 
-def get_latency(plan: EnginePlan, layer: Layer, latency_type) -> float:
+def _get_latency(plan: EnginePlan, layer: Layer, latency_type) -> float:
     try:
         latency = plan.df[plan.df['Name'] == layer.name][f"latency.{latency_type}"].iloc[0]
     except (KeyError, IndexError):
@@ -636,6 +652,14 @@ def get_latency(plan: EnginePlan, layer: Layer, latency_type) -> float:
         latency = 0
     return latency
 
+def get_dot_id(layer_name: str) -> str:
+    return layer_name.replace(":", "###") # f"l_{dot_node_id}"
+
+def _get_dot_id(layer_name: str) -> str:
+    return layer_name.replace(":", "###")
+
+def _is_trainstation(layer):
+    return layer.type == 'TrainStation'
 
 def get_dot_id(layer_name: str) -> str:
     return layer_name.replace(":", "###") # f"l_{dot_node_id}"
@@ -662,6 +686,8 @@ def __init__(self,
         display_edge_name: bool=False,
         display_edge_details: bool=True,
         highlight_layers: list=None,
+        remove_disconnected_layers: bool=False,
+        display_matadata: bool=True,
     ):
         plan_graph = PlanGraph(
             plan, display_regions, display_constants, display_forking_regions)
@@ -685,36 +711,66 @@ def __init__(self,
         if highlight_layers:
             try:
                 highlight_layers_name = plan.df['Name'].iloc[highlight_layers].to_list()
-                self.highlighted_layers_ids = [get_dot_id(name) for name in highlight_layers_name]
+                self.highlighted_layers_ids = [_get_dot_id(name) for name in highlight_layers_name]
             except IndexError:
                 warnings.warn("The layers indices specified for highlighting are incorrect")
 
         node_name_2_node_id = {}
+        if remove_disconnected_layers:
+            self.__remove_disconnected_layers(plan_graph)
+        self.display_matadata = display_matadata
         self.__add_dot_region_nodes(plan_graph, node_name_2_node_id)
         self.__add_dot_layer_nodes(plan, plan_graph, node_name_2_node_id)
-
-        for edge in plan_graph.edges_list:
-            src_id = node_name_2_node_id[edge.src.layer_name]
-            dst_id = node_name_2_node_id[edge.dst.layer_name]
-            self.__create_dot_edge(src_id, dst_id, edge.tensor, edge.region_gen)
+        self.__add_edges(plan_graph, node_name_2_node_id)
+        self.__connect_train_station(plan_graph, node_name_2_node_id)
 
     def __add_dot_region_nodes(self, plan_graph, node_name_2_node_id):
         dot_node_id = 0
         for mem_node in plan_graph.memory_nodes:
-            node_name_2_node_id[mem_node.name] = dot_id = get_dot_id(mem_node.name)
+            node_name_2_node_id[mem_node.name] = dot_id = _get_dot_id(mem_node.name)
             self.__create_dot_region_node(dot_id, mem_node.tensor, mem_node.is_user, mem_node.region_gen)
             dot_node_id += 1
 
     def __add_dot_layer_nodes(self, plan, plan_graph, node_name_2_node_id):
-        for layer_node in plan_graph.layers_nodes:
+        for layer_node in plan_graph.layer_nodes:
             layer = layer_node.layer
-            latency = get_latency(plan, layer, self.latency_type)
+            latency = _get_latency(plan, layer, self.latency_type)
             if not layer.type == 'Constant' or plan_graph.include_constants:
-                dot_id = get_dot_id(layer.name)
+                dot_id = _get_dot_id(layer.name)
                 node_name_2_node_id[layer.name] = dot_id
                 self.__create_dot_layer_node(
                     dot_id, layer, latency, layer_node_renderer=self.layer_node_renderer)
 
+    def __remove_disconnected_layers(self, plan_graph):
+        # Remove layer nodes that have no inputs and no outputs.
+        # TrainStation layers are exempt.
+        disconnected = lambda layer_node: len(layer_node.layer.inputs) == len(layer_node.layer.outputs) == 0
+        nodes_to_remove = [layer_node for layer_node in plan_graph.layer_nodes if
+            disconnected(layer_node) and not _is_trainstation(layer_node.layer)]
+        for node in nodes_to_remove:
+            plan_graph.layer_nodes.remove(node)
+
+    def __connect_train_station(self, plan_graph, node_name_2_node_id):
+        prev_layer = None
+        n_layers = len(plan_graph.layer_nodes)
+        for i, layer_node in enumerate(plan_graph.layer_nodes):
+            layer = layer_node.layer
+            if _is_trainstation(layer) and prev_layer and i < n_layers-1:
+                src_id = node_name_2_node_id[prev_layer.name]
+                dst_id = node_name_2_node_id[layer.name]
+                self.__create_dot_dependency_edge(src_id, dst_id)
+            if prev_layer and _is_trainstation(prev_layer) and i>1:
+                src_id = node_name_2_node_id[prev_layer.name]
+                dst_id = node_name_2_node_id[layer.name]
+                self.__create_dot_dependency_edge(src_id, dst_id)
+            prev_layer = layer
+
+    def __add_edges(self, plan_graph, node_name_2_node_id):
+        for edge in plan_graph.edges_list:
+            src_id = node_name_2_node_id[edge.src.layer_name]
+            dst_id = node_name_2_node_id[edge.dst.layer_name]
+            self.__create_dot_edge(src_id, dst_id, edge.tensor, edge.region_gen)
+
     def __create_dot_region_node(self, node_id: int, tensor: Activation, is_user: bool, gen: int):
         formatter = self.region_formatter(tensor, is_user)
         is_minimal = False
@@ -735,11 +791,10 @@ def __create_dot_region_node(self, node_id: int, tensor: Activation, is_user: bo
                 desc,
                 shape='rectangle',
                 fillcolor='gray' if is_user else None,
-                fontname="Helvetica",
                 **formatter)
 
     def __create_dot_layer_node(
-        self, node_id: int, layer: Layer, latency: float, layer_node_renderer: Callable
+        self, node_id: str, layer: Layer, latency: float, layer_node_renderer: Callable
     ):
         formatting = self.layer_node_formatter(layer)
         formatting.update(self.layer_node_highlighter(node_id, self.highlighted_layers_ids))
@@ -750,10 +805,14 @@ def __create_dot_layer_node(
                 latency if (self.display_latency and latency) else None,
                 expand_layer_details=self.expand_layer_details,
                 display_layer_names=self.display_layer_names,
-                stack_layer_names=self.stack_layer_names),
+                stack_layer_names=self.stack_layer_names,
+                prefer_matadata=self.display_matadata),
                 **formatting)
 
-    def __create_dot_edge(self, src, end, tensor, region_gen):
+    def __create_dot_dependency_edge(self, src, dst):
+        self.dot.edge(src, dst, "", color='lightgray', style="dashed")
+
+    def __create_dot_edge(self, src, dst, tensor, region_gen):
         def generation_color(gen: int, line_color: str) -> str:
             edge_color = ""
             if self.render_region_reuse_edges:
@@ -772,7 +831,7 @@ def generation_color(gen: int, line_color: str) -> str:
         if self.display_edge_details:
             desc.append(tensor.tooltip())
         desc_str = "\n".join(desc)
-        self.dot.edge(src, end, desc_str, color=edge_color)
+        self.dot.edge(src, dst, desc_str, color=edge_color)
 
 
 def to_dot(*args, **kwargs) -> Digraph:
@@ -811,7 +870,7 @@ class OnnxGraph(object):
     def __init__(self, plan: EnginePlan, display_forking_regions: bool):
         def get_adjacency_lists():
             inputs_map, outputs_map = {}, {}
-            layer_nodes_names = [node.layer.name for node in self.plan_graph.layers_nodes]
+            layer_nodes_names = [node.layer.name for node in self.plan_graph.layer_nodes]
             for edge in self.plan_graph.edges_list:
                 if edge.src.port != None and edge.dst.port != None:
                     if False and edge.src.layer_name in layer_nodes_names and edge.dst.layer_name in layer_nodes_names:
@@ -834,7 +893,7 @@ def get_adjacency_lists():
             return inputs_map, outputs_map
 
         def add_layer_nodes():
-            for layer_id, layer_node in enumerate(self.plan_graph.layers_nodes):
+            for layer_id, layer_node in enumerate(self.plan_graph.layer_nodes):
                 layer = layer_node.layer
                 try:
                     self.__add_layer_node(layer_id, layer, inputs_map[layer.name], outputs_map[layer.name])
diff --git a/tools/experimental/trt-engine-explorer/trex/interactive.py b/tools/experimental/trt-engine-explorer/trex/interactive.py
index e3dd350d..ae7b429a 100644
--- a/tools/experimental/trt-engine-explorer/trex/interactive.py
+++ b/tools/experimental/trt-engine-explorer/trex/interactive.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,7 +22,7 @@
 
 from ipywidgets import widgets
 import IPython
-from IPython.core.display import display
+from IPython.display import display
 from typing import List
 
 
@@ -32,38 +32,27 @@ def __init__(self, diagram_renderer, choices, description):
         def get_default_choice_key():
             return list(self.choices.keys())[0]
 
-        def get_default_choice_value():
-            val = list(self.choices.values())[0]
-            if not isinstance(val, list) and not isinstance(val, tuple):
-                return (val,)
-            return val
-
         self.diagram_renderer = diagram_renderer
         self.choices = choices
 
-        display(get_default_choice_key())
-
         self.choice_widget = widgets.Dropdown(
             options=self.choices.keys(),
             value=get_default_choice_key(),
             description=description,
             disabled=False,
         )
-        dropdown_state_eventhandler = lambda change: self.dropdown_state_eventhandler(change)
-        self.choice_widget.observe(dropdown_state_eventhandler, names='value')
-        self._render(get_default_choice_key(), get_default_choice_value())
+        out = widgets.interactive_output(self.dropdown_state_eventhandler, {'user_choice': self.choice_widget})
+        display(out)
 
     def _render(self, choice, values):
-        IPython.display.clear_output(wait=True)
         display(self.choice_widget)
         self.diagram_renderer(choice, *values)
 
-    def dropdown_state_eventhandler(self, change):
-        state_choice = change.new
-        values = self.choices[state_choice]
+    def dropdown_state_eventhandler(self, user_choice):
+        values = self.choices[user_choice]
         if not isinstance(values, list) and not isinstance(values, tuple):
             values = (values,)
-        self._render(state_choice, values)
+        self._render(user_choice, values)
 
 
 class InteractiveDiagram_2:
@@ -72,29 +61,20 @@ def __init__(self, choices: List, description: str):
         def get_default_choice():
             return list(self.choices.keys())[0]
 
-        def get_default_renderer():
-            val = list(self.choices.values())[0]
-            return val
-
         self.choices = choices
-        display(get_default_choice())
-
         self.choice_widget = widgets.Dropdown(
             options=self.choices.keys(),
             value=get_default_choice(),
             description=description,
             disabled=False,
         )
-        dropdown_state_eventhandler = lambda change: self.dropdown_state_eventhandler(change)
-        self.choice_widget.observe(dropdown_state_eventhandler, names='value')
-        self._render(get_default_choice(), get_default_renderer())
+        out = widgets.interactive_output(self.dropdown_state_eventhandler, {'user_choice': self.choice_widget})
+        display(out)
 
     def _render(self, title, renderer):
-        IPython.display.clear_output(wait=True)
         display(self.choice_widget)
         renderer(title=title)
 
-    def dropdown_state_eventhandler(self, change):
-        state_choice = change.new
-        renderer = self.choices[state_choice]
-        self._render(state_choice, renderer)
+    def dropdown_state_eventhandler(self, user_choice):
+        renderer = self.choices[user_choice]
+        self._render(user_choice, renderer)
diff --git a/tools/experimental/trt-engine-explorer/trex/layer.py b/tools/experimental/trt-engine-explorer/trex/layer.py
index bba6a04a..e7a10ae7 100644
--- a/tools/experimental/trt-engine-explorer/trex/layer.py
+++ b/tools/experimental/trt-engine-explorer/trex/layer.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -31,8 +31,12 @@ def __init__(self, raw_dict: Dict):
         self.name = raw_dict['Name']
         try:
             self.type = raw_dict['ParameterType']
-        except:
+        except KeyError:
             self.type = raw_dict['LayerType']
+        try:
+            self.metadata = raw_dict['Metadata']
+        except KeyError:
+            self.metadata = None
         self.subtype = raw_dict['LayerType']
         self.inputs = [Activation(tensor) for tensor in raw_dict['Inputs']]
         self.outputs = [Activation(tensor) for tensor in raw_dict['Outputs']]
@@ -99,12 +103,12 @@ def consumers_producers_dict(layers) -> Dict[Activation, List[Layer]]:
             for loc, input in enumerate(layer.inputs):
                 try:
                     consumers[input.name].append((layer.name, loc))
-                except:
+                except KeyError:
                     consumers[input.name] = [(layer.name, loc)]
             for loc, output in enumerate(layer.outputs):
                 try:
                     producers[output.name].append((layer.name, loc))
-                except:
+                except KeyError:
                     producers[output.name] = [(layer.name, loc)]
         return consumers, producers
 
diff --git a/tools/experimental/trt-engine-explorer/trex/lint.py b/tools/experimental/trt-engine-explorer/trex/lint.py
index ecb2cfe3..4a37c4f3 100644
--- a/tools/experimental/trt-engine-explorer/trex/lint.py
+++ b/tools/experimental/trt-engine-explorer/trex/lint.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/experimental/trt-engine-explorer/trex/misc.py b/tools/experimental/trt-engine-explorer/trex/misc.py
index ca2edf88..7aa00e5a 100644
--- a/tools/experimental/trt-engine-explorer/trex/misc.py
+++ b/tools/experimental/trt-engine-explorer/trex/misc.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -26,13 +26,13 @@
 
 
 def group_count(df, grouping_attr):
-    grp = df.groupby([grouping_attr]).size().to_frame().reset_index()
+    grp = df.groupby([grouping_attr]).size().reset_index()
     grp.rename(columns = {0: 'count'}, inplace = True)
     return grp
 
 
 def group_sum_attr(df, grouping_attr, reduced_attr):
-    grp = df.groupby([grouping_attr]).sum()[reduced_attr].to_frame().reset_index()
+    grp = df.groupby([grouping_attr])[reduced_attr].sum().reset_index()
     return grp
 
 
@@ -81,4 +81,4 @@ def stack_dataframes(
     # A list of values lists.
     values = [df[values_col].tolist() for df in df_list]
 
-    return _merge_keys_values(names, values, empty_placeholder)
\ No newline at end of file
+    return _merge_keys_values(names, values, empty_placeholder)
diff --git a/tools/experimental/trt-engine-explorer/trex/notebook.py b/tools/experimental/trt-engine-explorer/trex/notebook.py
index ce0066cb..21122bd9 100644
--- a/tools/experimental/trt-engine-explorer/trex/notebook.py
+++ b/tools/experimental/trt-engine-explorer/trex/notebook.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,9 +23,7 @@
 from typing import Callable
 import pandas as pd
 import dtale
-from IPython.core.display import display, HTML
-from ipyfilechooser import FileChooser
-import qgrid
+from IPython.display import display, HTML
 import pandas as pd
 import time
 import logging
@@ -46,16 +44,6 @@ def set_wide_display(width_pct: int=90):
     display(HTML(f"<style>.container {{width:{width_pct}% !important;}}</style>"))
 
 
-def display_df_qgrid(df: pd.DataFrame, **kwargs):
-    """Display a Pandas dataframe using a qgrid widget"""
-    grid = qgrid.show_grid(df,
-        grid_options={'forceFitColumns': False, 'fullWidthRows': True},
-        column_options={'resizable': True, },
-        column_definitions={'index': {'maxWidth': 0, 'minWidth': 0, 'width': 0},
-                            'Name': {'maxwidth': 400}})
-    display(grid)
-
-
 def display_df_dtale(
     df: pd.DataFrame,
     range_highlights: dict=None,
@@ -93,7 +81,7 @@ def display_df_dtale(
 
 
 # Control how to display tables in notebooks
-table_display_backend = display_df_dtale
+table_display_backend = display
 
 
 def set_table_display_backend(tbl_display_fn: Callable):
@@ -102,13 +90,8 @@ def set_table_display_backend(tbl_display_fn: Callable):
 
 
 def display_df(df: pd.DataFrame, **kwargs):
-    table_display_backend(df, **kwargs)
-
+    try:
+        table_display_backend(df, **kwargs)
+    except TypeError:
+        table_display_backend(df)
 
-def display_filechooser(rootdir: str) -> FileChooser:
-    """Create and display a FileChooser widget"""
-    fc = FileChooser(rootdir)
-    fc.filter_pattern = '*.engine'
-    fc.title = "Press Select to choose an engine file"
-    display(fc)
-    return fc
diff --git a/tools/experimental/trt-engine-explorer/trex/parser.py b/tools/experimental/trt-engine-explorer/trex/parser.py
index 057064c5..0123f535 100644
--- a/tools/experimental/trt-engine-explorer/trex/parser.py
+++ b/tools/experimental/trt-engine-explorer/trex/parser.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -48,7 +48,7 @@ def read_graph_file(graph_file: str) -> List:
         if not isinstance(layers, list):
             raise ValueError(err_msg)
         if not isinstance(layers[0], dict):
-            details_msg = "\nMake sure to enable detailed ProfilingVerbosity."
+            details_msg = "\nMake sure to enable detailed ProfilingVerbosity when building the engine."
             details_msg += "\nSee https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#engine-inspector"
             raise ValueError(err_msg + details_msg)
         return layers, bindings
@@ -187,9 +187,37 @@ def fix_metadata(raw_layers: List) -> List:
                 pass
         return raw_layers
 
+    def fix_unicode(raw_layers: List) -> List:
+        """TensorRT 8.6 and 10.0 introduced non-ASCII characters to the graph JSON
+        that trigger SVG rendering errors. This function replaces these characters.
+
+        See: https://github.com/NVIDIA/TensorRT/issues/2779
+        """
+        UNICODE_UNIT_SEPARATOR = '\x1E'
+        UNICODE_REC_SEPARATOR = '\x1F'
+        TREX_SEPARATOR = '+'
+        replace_unicode = lambda s: s.replace(UNICODE_UNIT_SEPARATOR, TREX_SEPARATOR
+                                    ).replace(UNICODE_REC_SEPARATOR, TREX_SEPARATOR)
+        for l in raw_layers:
+            try:
+                l['Name'] = replace_unicode(l['Name'])
+                l['Metadata'] = replace_unicode(l['Metadata'])
+            except KeyError:
+                pass
+        return raw_layers
+
+    def __remove_signal_wait(raw_layers: List) -> List:
+        for raw_layer in raw_layers:
+            if raw_layer['LayerType'] in ("signal", "wait"):
+                raw_layers.remove(raw_layer)
+        return raw_layers
+
+
     raw_layers, bindings = read_graph_file(graph_file)
-    raw_layers = fix_metadata(raw_layers)
+    raw_layers = fix_unicode(raw_layers)
     raw_layers = convert_deconv(raw_layers)
+    raw_layers = __remove_signal_wait(raw_layers)
     raw_layers = disambiguate_layer_names(raw_layers)
     raw_layers, bindings = filter_profiles(raw_layers, bindings, profile_id)
+
     return raw_layers, bindings
diff --git a/tools/experimental/trt-engine-explorer/trex/plotting.py b/tools/experimental/trt-engine-explorer/trex/plotting.py
index 6564865f..22ae864b 100644
--- a/tools/experimental/trt-engine-explorer/trex/plotting.py
+++ b/tools/experimental/trt-engine-explorer/trex/plotting.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,14 +19,13 @@
 This file contains pyplot plotting wrappers.
 """
 
-
-import plotly.express as px
-import plotly.graph_objects as go
-from plotly.subplots import make_subplots
 from typing import Dict, List, Tuple
 from collections import defaultdict
 import pandas as pd
 import math
+import plotly.express as px
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
 from .notebook import display_df
 from .misc import stack_dataframes
 
@@ -70,6 +69,7 @@
     "Deconvolution":  "#7B68EE", # MediumSlateBlue
     "ConvActPool":    "#6495ED", # CornflowerBlue
     "MatrixMultiply": "#1E90FF", # DodgerBlue
+    "gemm":           "#1E90FF", # DodgerBlue
     "Reformat":       "#00FFFF", # Cyan
     "Shuffle":        "#BC8F8F", # RosyBrown
     "Slice":          "#FFA500", # Orange
@@ -82,6 +82,7 @@
     "Reduce":         "#90EE90", # LightGreen
     "SoftMax":        "#DA70D6", # Orchid
     "Myelin":         "#800080", # Purple
+    "kgen":           "#800080", # Purple
 })
 
 
diff --git a/tools/experimental/trt-engine-explorer/trex/raw_preprocessing.py b/tools/experimental/trt-engine-explorer/trex/raw_preprocessing.py
deleted file mode 100644
index d5f1ca8f..00000000
--- a/tools/experimental/trt-engine-explorer/trex/raw_preprocessing.py
+++ /dev/null
@@ -1,54 +0,0 @@
-#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-
-from typing import Dict, List, Tuple
-from .parser import *
-
-
-def __disambiguate_layer_names(raw_layers: List) -> List:
-    """If a layer name appears twice we need to disabmiguate it"""
-    names_cnt = {}
-    for raw_layer in raw_layers:
-        name = raw_layer['Name']
-        if name in names_cnt:
-            names_cnt[name] += 1
-            name += "_" + str(names_cnt[name])
-            raw_layer['Name'] = name
-        else:
-            names_cnt[name] = 1
-    return raw_layers
-
-
-def __convert_deconv(raw_layers: List) -> List:
-    for raw_layer in raw_layers:
-        try:
-            is_deconv = (
-                raw_layer['ParameterType'] == "Convolution" and
-                raw_layer['LayerType'] == "CaskDeconvolutionV2")
-            if is_deconv:
-                raw_layer['ParameterType'] = "Deconvolution"
-        except KeyError:
-            pass
-    return raw_layers
-
-
-def import_graph_file(graph_file: str):
-    raw_layers, bindings = read_graph_file(graph_file)
-    raw_layers = __convert_deconv(raw_layers)
-    raw_layers = __disambiguate_layer_names(raw_layers)
-    return raw_layers, bindings
\ No newline at end of file
diff --git a/tools/experimental/trt-engine-explorer/trex/report_card.py b/tools/experimental/trt-engine-explorer/trex/report_card.py
index 2d3f5e80..816b758c 100644
--- a/tools/experimental/trt-engine-explorer/trex/report_card.py
+++ b/tools/experimental/trt-engine-explorer/trex/report_card.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,16 +22,22 @@
 """
 
 
+from typing import Dict
 from functools import partial
 import IPython.display
 from ipywidgets import widgets
-from .misc import group_count, group_sum_attr
-from .interactive import InteractiveDiagram, InteractiveDiagram_2
-from .notebook import display_df
-from .df_preprocessing import clean_for_display
-from .plotting import *
-from .graphing import *
-from .parser import read_timing_file
+import pandas as pd
+import plotly.express as px
+import onnx
+import trex.misc as misc
+import trex.interactive as interactive
+import trex.notebook as notebook
+import trex.df_preprocessing as df_preprocessing
+import trex.parser as parser
+import trex.plotting as plotting
+import trex.colors as colors
+import trex.graphing as graphing
+from trex import EnginePlan
 
 
 def report_card_perf_overview(plan: EnginePlan) -> Dict[str, callable]:
@@ -39,92 +45,92 @@ def report_card_perf_overview(plan: EnginePlan) -> Dict[str, callable]:
 
     Each query is implemented via a callable.
     Returns a dictionary of query names mapped to query callables."""
-    layer_types = group_count(plan.df, 'type')
+    layer_types = misc.group_count(plan.df, 'type')
     count_per_layer_type = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         layer_types,
         values_col='count',
         names_col='type',
-        color='type', colormap=layer_colormap,
+        color='type', colormap=colors.layer_colormap,
         orientation='h',
         show_axis_ticks=(True, True))
 
-    time_pct_by_type = plan.df.groupby(['type']).sum()[
-        ['latency.pct_time', 'latency.avg_time']].reset_index()
+    time_pct_by_type = plan.df.groupby(['type'])[
+        ['latency.pct_time', 'latency.avg_time']].sum().reset_index()
     latency_per_type_ms = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         time_pct_by_type,
         values_col='latency.avg_time',
         names_col='type',
-        color='type', colormap=layer_colormap,
+        color='type', colormap=colors.layer_colormap,
         orientation='h',
         show_axis_ticks=(True, True))
 
     latency_per_type_pct = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         time_pct_by_type,
         values_col='latency.pct_time',
         names_col='type',
-        color='type', colormap=layer_colormap,
+        color='type', colormap=colors.layer_colormap,
         orientation='h',
         show_axis_ticks=(True, True))
 
     precision_per_layer = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         plan.df,
         values_col='latency.avg_time',
         names_col='Name',
-        color='precision', colormap=precision_colormap,
+        color='precision', colormap=colors.precision_colormap,
         xaxis_title="Layer")
 
     output_precision_per_layer = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         plan.df,
         values_col='latency.avg_time',
         names_col='Name',
-        color='output_precision', colormap=precision_colormap,
+        color='output_precision', colormap=colors.precision_colormap,
         xaxis_title="Layer")
 
     output_precision_per_layer = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         plan.df,
         values_col='latency.avg_time',
         names_col='Name',
-        color='output_precision', colormap=precision_colormap,
+        color='output_precision', colormap=colors.precision_colormap,
         xaxis_title="Layer")
 
     latency_distribution = partial(
-        plotly_hist,
+        plotting.plotly_hist,
         plan.df,
         values_col='latency.pct_time',
         xaxis_title = 'Latency (ms)',
-        color='type', colormap=layer_colormap)
+        color='type', colormap=colors.layer_colormap)
 
     latency_per_layer = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         plan.df,
         values_col='latency.pct_time',
         names_col='Name',
-        color='type', colormap=layer_colormap,
+        color='type', colormap=colors.layer_colormap,
         xaxis_title="Layer")
 
     latency_per_layer_ms = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         plan.df,
         values_col='latency.avg_time',
         names_col='Name',
-        color='type', colormap=layer_colormap,
+        color='type', colormap=colors.layer_colormap,
         xaxis_title="Layer")
 
     precision_charts = []
-    layer_precisions = group_count(plan.df, 'precision')
+    layer_precisions = misc.group_count(plan.df, 'precision')
     precision_charts.append((
         layer_precisions,
         'Layer Count By Precision',
         'count',
         'precision'))
 
-    layers_time_pct_by_precision = group_sum_attr(
+    layers_time_pct_by_precision = misc.group_sum_attr(
         plan.df,
         grouping_attr='precision',
         reduced_attr='latency.pct_time')
@@ -136,20 +142,20 @@ def report_card_perf_overview(plan: EnginePlan) -> Dict[str, callable]:
         'precision'))
 
     precision_statistics = partial(
-        plotly_pie2,
+        plotting.plotly_pie2,
         charts=precision_charts,
-        colormap=precision_colormap)
+        colormap=colors.precision_colormap)
 
     def precision_per_type(title, do_show: bool=True,):
         title = f"{title}\n({plan.name})"
         df = plan.df
         precision_sunburst = df.groupby(['type', 'precision']).count().reset_index()
-        color = [precision_colormap[p] for p in df['precision']]
+        color = [colors.precision_colormap[p] for p in df['precision']]
         fig = px.sunburst(
             precision_sunburst,
             path=['type', 'precision'],
             values='Name',
-            color_discrete_map=precision_colormap,
+            color_discrete_map=colors.precision_colormap,
             color='precision')
         fig.update_layout(title=title, title_x=0.5, font_size=15,)
         if do_show:
@@ -178,7 +184,7 @@ def report_card_perf_overview_widget(plan: EnginePlan):
     characteristics of the plan's convolution layers."""
 
     dropdown_choices = report_card_perf_overview(plan)
-    InteractiveDiagram_2(dropdown_choices, 'Diagram:')
+    interactive.InteractiveDiagram_2(dropdown_choices, 'Diagram:')
 
 
 def report_card_convolutions_overview(convs: pd.DataFrame) -> Dict[str, callable]:
@@ -188,79 +194,79 @@ def report_card_convolutions_overview(convs: pd.DataFrame) -> Dict[str, callable
     Returns a dictionary of query names mapped to query callables."""
 
     latency_vs_ai_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='latency.pct_time',
         names_col='Name',
-        color='attr.arithmetic_intensity', colormap=precision_colormap)
+        color='attr.arithmetic_intensity', colormap=colors.precision_colormap)
     latency_vs_prec_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='latency.pct_time',
         names_col='Name',
-        color='precision', colormap=precision_colormap)
+        color='precision', colormap=colors.precision_colormap)
     latency_vs_fmas = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='latency.pct_time',
         names_col='Name',
         color='attr.macs')
     latency_vs_data = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='latency.pct_time',
         names_col='Name',
         color='total_footprint_bytes')
     latency_vs_ce_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='latency.pct_time',
         names_col='Name',
-        color='attr.compute_efficiency', colormap=precision_colormap)
+        color='attr.compute_efficiency', colormap=colors.precision_colormap)
     latency_vs_group_size = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='latency.pct_time',
         names_col='Name',
         color='attr.groups')
     latency_vs_kernel_size = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='latency.pct_time',
         names_col='Name',
         color='attr.kernel')
     footprint_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='total_footprint_bytes',
         names_col='Name',
         color='latency.pct_time')
     fmas_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='attr.macs',
         names_col='Name',
         color='latency.pct_time')
     ai_vs_latency_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='attr.arithmetic_intensity',
         names_col='Name',
         color='attr.compute_efficiency')
     ai_vs_footprint_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='attr.arithmetic_intensity',
         names_col='Name',
         color='attr.memory_efficiency')
     ce_vs_latency_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='attr.compute_efficiency',
         names_col='Name',
         color='latency.pct_time')
     me_vs_latency_per_conv = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         convs,
         values_col='attr.memory_efficiency',
         names_col='Name',
@@ -301,7 +307,7 @@ def report_card_convolutions_overview_widget(convs: pd.DataFrame):
     Display a dropdown widget to choose between diagrams showing various
     characteristics of the plan's convolution layers"""
     dropdown_choices = report_card_convolutions_overview(convs)
-    InteractiveDiagram_2(dropdown_choices, 'Diagram:')
+    interactive.InteractiveDiagram_2(dropdown_choices, 'Diagram:')
 
 
 def report_card_table_view(plan: EnginePlan):
@@ -312,69 +318,69 @@ def report_card_table_view(plan: EnginePlan):
 
     def render_diagram(choice, ignore):
         if choice == 'All':
-            display_df(clean_for_display(plan.df))
+            notebook.display_df(df_preprocessing.clean_for_display(plan.df))
         else:
             df = plan.get_layers_by_type(choice)
             print(f"There are {len(df)} {choice} layers which account for"
                   f"{df['latency.pct_time'].sum(): .2f}% ({df['latency.avg_time'].sum(): .5f} ms) of the overall latency.")
-            display_df(clean_for_display(df))
+            notebook.display_df(df_preprocessing.clean_for_display(df))
 
     types = ['All'] + list(set(plan.df['type']))
     dropdown_choices = {t: t for t in types}
-    InteractiveDiagram(render_diagram, dropdown_choices, 'Dataframe')
+    interactive.InteractiveDiagram(render_diagram, dropdown_choices, 'Dataframe')
 
 
 def report_card_memory_footprint(plan: EnginePlan):
     """Memory footprint diagrams"""
 
-    plotly_hist_wrap = partial(plotly_hist,
+    plotting.plotly_hist_wrap = partial(plotting.plotly_hist,
                 df=plan.df,
                 xaxis_title="Size (bytes)",
                 color='type',
-                colormap=layer_colormap)
+                colormap=colors.layer_colormap)
 
-    plotly_bar2_wrap = partial(plotly_bar2,
+    plotting.plotly_bar2_wrap = partial(plotting.plotly_bar2,
                 plan.df,
                 names_col="Name",
                 color='type',
-                colormap=layer_colormap,
+                colormap=colors.layer_colormap,
                 show_axis_ticks=(False, True))
 
     dropdown_choices = {
-        "Weights footprint per layer": partial(plotly_bar2_wrap, values_col='weights_size'),
-        "Activation footprint per layer": partial(plotly_bar2_wrap, values_col='total_io_size_bytes'),
-        "Total footprint per layer": partial(plotly_bar2_wrap, values_col='total_footprint_bytes'),
-        "Weights footprint distribution per layer": partial(plotly_hist_wrap, values_col='weights_size'),
-        "Activations footprint distribution per layer": partial(plotly_hist_wrap, values_col='total_io_size_bytes'),
-        "Total footprint distribution per layer": partial(plotly_hist_wrap, values_col='total_footprint_bytes'),
+        "Weights footprint per layer": partial(plotting.plotly_bar2_wrap, values_col='weights_size'),
+        "Activation footprint per layer": partial(plotting.plotly_bar2_wrap, values_col='total_io_size_bytes'),
+        "Total footprint per layer": partial(plotting.plotly_bar2_wrap, values_col='total_footprint_bytes'),
+        "Weights footprint distribution per layer": partial(plotting.plotly_hist_wrap, values_col='weights_size'),
+        "Activations footprint distribution per layer": partial(plotting.plotly_hist_wrap, values_col='total_io_size_bytes'),
+        "Total footprint distribution per layer": partial(plotting.plotly_hist_wrap, values_col='total_footprint_bytes'),
     }
     return dropdown_choices
 
 
 def report_card_memory_footprint_widget(plan: EnginePlan):
     dropdown_choices = report_card_memory_footprint(plan)
-    InteractiveDiagram_2(dropdown_choices, 'Diagram:')
+    interactive.InteractiveDiagram_2(dropdown_choices, 'Diagram:')
 
 
 def report_card_draw_plan_graph(plan: EnginePlan, engine_name: str):
     """Draw the plan graph (export to SVG)"""
     def render_diagram(choice, formatter, display_regions, expand_layer_details, display_layer_names=True):
-        graph = to_dot(plan, formatter,
+        graph = graphing.to_dot(plan, formatter,
             display_regions=display_regions,
             expand_layer_details=expand_layer_details,
             display_layer_names=display_layer_names)
-        render_dot(graph, engine_name, 'svg')
+        graphing.render_dot(graph, engine_name, 'svg')
 
     # Color code nodes by precision or layer-type
     dropdown_choices = {
-        "Color nodes by type": (layer_type_formatter, False, False),
-        "Color nodes by type (detailed)": (layer_type_formatter, True, True),
-        "Color nodes by precision": (precision_formatter, False, False),
-        "Color nodes by precision (detailed)": (precision_formatter, True, True),
-        "Minimalistic": (layer_type_formatter, False, True, False),
+        "Color nodes by type": (graphing.layer_type_formatter, False, False),
+        "Color nodes by type (detailed)": (graphing.layer_type_formatter, True, True),
+        "Color nodes by precision": (graphing.layer_precision_formatter, False, False),
+        "Color nodes by precision (detailed)": (graphing.layer_precision_formatter, True, True),
+        "Minimalistic": (graphing.layer_type_formatter, False, True, False),
     }
 
-    InteractiveDiagram(render_diagram, dropdown_choices, 'Color formatting:')
+    interactive.InteractiveDiagram(render_diagram, dropdown_choices, 'Color formatting:')
 
 
 def report_card_pointwise_lint(plan: EnginePlan):
@@ -385,11 +391,11 @@ def report_card_pointwise_lint(plan: EnginePlan):
         return
 
     charts = []
-    by_n_operations = group_count(pws, 'attr.n_operations')
+    by_n_operations = misc.group_count(pws, 'attr.n_operations')
     charts.append((by_n_operations,
         "Pointwise layers by number of operations", 'count', 'attr.n_operations'))
 
-    layers_time_pct_by_n_operations = group_sum_attr(
+    layers_time_pct_by_n_operations = misc.group_sum_attr(
         pws,
         grouping_attr='attr.n_operations',
         reduced_attr='latency.pct_time')
@@ -417,7 +423,7 @@ def list_pw_operations(pws):
             operations = "\n\t".join([op for op in pw['attr.operations']])
             print(f"{pw.name}\n\t{operations}")
 
-    plotly_pie2("Pointwise Statistics", charts)
+    plotting.plotly_pie2("Pointwise Statistics", charts)
     list_pw_operations(pws)
     print(pws['per_op_latency'])
 
@@ -429,7 +435,7 @@ def layer_latency_sunburst(df: pd.DataFrame, title: str, do_show: bool=True):
         precision_sunburst,
         path=['type', 'latency.pct_time'],
         values='latency.avg_time',
-        color_discrete_map=layer_colormap,
+        color_discrete_map=colors.layer_colormap,
         color='type')
     fig.update_layout(title=title, title_x=0.5, font_size=15,)
     if do_show:
@@ -440,13 +446,13 @@ def layer_latency_sunburst(df: pd.DataFrame, title: str, do_show: bool=True):
 
 def plot_engine_timings(timing_json_file: str, do_show: bool=True):
     """Plot the engine profiling timings"""
-    latencies = read_timing_file(timing_json_file)
+    latencies = parser.read_timing_file(timing_json_file)
     samples = range(len(latencies))
 
     fig = px.scatter(
         title="Engine Timing Samples",
         x=samples, y=latencies)
-    trex_base_layout(fig)
+    plotting.trex_base_layout(fig)
     fig.update_layout({
         'yaxis_title': "Latency (ms)",
         'xaxis_title': "Timing Samples",
@@ -462,7 +468,7 @@ def render_scatter3d(choice, x, y, z, color, size):
         convs = plan.get_layers_by_type('Convolution')
         fig = px.scatter_3d(convs, x=x, y=y, z=z, color=color, size=size,
                 size_max=18, opacity=0.7)
-        trex_base_layout(fig)
+        plotting.trex_base_layout(fig)
         fig.update_layout({
             'title': "Implicit GEMM " + choice,
             'title_x': 0.5})
@@ -479,7 +485,7 @@ def render_scatter3d(choice, x, y, z, color, size):
             ('attr.M', 'attr.N', 'attr.K', 'attr.memory_efficiency', 'latency.avg_time',),
     }
 
-    InteractiveDiagram(render_scatter3d, dropdown_choices, 'Diagram')
+    interactive.InteractiveDiagram(render_scatter3d, dropdown_choices, 'Diagram')
 
 
 def report_card_gemm_MNK_scatter(plan: pd.DataFrame):
@@ -501,7 +507,7 @@ def render_scatter(choice, x, y, color, size):
 
     }
 
-    InteractiveDiagram(render_scatter, dropdown_choices, 'Diagram')
+    interactive.InteractiveDiagram(render_scatter, dropdown_choices, 'Diagram')
 
 
 def report_card_efficiency_vs_latency_3d(plan: pd.DataFrame):
@@ -517,7 +523,7 @@ def report_card_efficiency_vs_latency_3d(plan: pd.DataFrame):
         title="Compute-efficiency vs Memory-efficiency vs Latency",
         title_x=0.5)
 
-    trex_base_layout(fig)
+    plotting.trex_base_layout(fig)
     fig.show()
 
 
@@ -550,15 +556,15 @@ def render_scatter(choice, x, y, color, size):
             'attr.arithmetic_intensity', 'attr.memory_efficiency', 'latency.avg_time', 'attr.compute_efficiency'),
     }
 
-    InteractiveDiagram(render_scatter, dropdown_choices, 'Diagram')
+    interactive.InteractiveDiagram(render_scatter, dropdown_choices, 'Diagram')
 
 
 def report_card_reformat_overview(plan: EnginePlan):
     """Bar diagrams showing how Reformat layers are used, by their origin"""
     reformats = plan.get_layers_by_type('Reformat')
-    reformats_origins_cnt = group_count(reformats, 'attr.origin')
+    reformats_origins_cnt = misc.group_count(reformats, 'attr.origin')
     reformats_origins_cnt_widget = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         df=reformats_origins_cnt,
         values_col='count',
         names_col='attr.origin',
@@ -566,9 +572,9 @@ def report_card_reformat_overview(plan: EnginePlan):
         orientation='h',
         show_axis_ticks=(True, True))
 
-    avg_time_by_origin = group_sum_attr(reformats, 'attr.origin', 'latency.avg_time')
+    avg_time_by_origin = misc.group_sum_attr(reformats, 'attr.origin', 'latency.avg_time')
     avg_time_by_origin_widget = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         df=avg_time_by_origin,
         values_col='latency.avg_time',
         names_col='attr.origin',
@@ -576,9 +582,9 @@ def report_card_reformat_overview(plan: EnginePlan):
         orientation='h',
         show_axis_ticks=(True, True))
 
-    pct_time_by_origin = group_sum_attr(reformats, 'attr.origin', 'latency.pct_time')
+    pct_time_by_origin = misc.group_sum_attr(reformats, 'attr.origin', 'latency.pct_time')
     pct_time_by_origin_widget = partial(
-        plotly_bar2,
+        plotting.plotly_bar2,
         df=pct_time_by_origin,
         values_col='latency.pct_time',
         names_col='attr.origin',
@@ -592,7 +598,7 @@ def report_card_reformat_overview(plan: EnginePlan):
         "Reformat - Percent latency by origin": pct_time_by_origin_widget,
      }
 
-    return InteractiveDiagram_2(dropdown_choices, 'Diagram:')
+    return interactive.InteractiveDiagram_2(dropdown_choices, 'Diagram:')
 
 
 def report_card_draw_plan_graph_extended(plan, engine_name):
@@ -604,6 +610,7 @@ def report_card_draw_plan_graph_extended(plan, engine_name):
     display_region_names_widget = widgets.Checkbox(value=False, description="Regions: Render names")
     display_layer_names_widget = widgets.Checkbox(value=False, description="Layers: Render names")
     stack_layer_names_widget = widgets.Checkbox(value=True, description="Layers: Stack names")
+    display_matadata_widget = widgets.Checkbox(value=True, description="Layers: Prefer metadata over name")
     expand_layer_details_widget = widgets.Checkbox(value=True, description="Layers: Expand layer details")
     display_latency_widget = widgets.Checkbox(value=True, description="Layers: Render latency")
     display_constants_widget = widgets.Checkbox(value=False, description="Graph: Render constant inputs")
@@ -611,8 +618,8 @@ def report_card_draw_plan_graph_extended(plan, engine_name):
     display_edge_details_widget = widgets.Checkbox(value=True, description="Edges: Render details")
 
     latency_metric_choice_widget = widgets.Dropdown(
-                options=latency_types,
-                value=latency_types[0],
+                options=graphing.latency_types,
+                value=graphing.latency_types[0],
                 description="Latency metric",
                 disabled=False,
             )
@@ -625,7 +632,7 @@ def report_card_draw_plan_graph_extended(plan, engine_name):
             )
 
     layer_node_renderers_widget = widgets.Dropdown(
-                options=[k for k in layer_node_renderers.keys()],
+                options=[k for k in graphing.layer_node_renderers.keys()],
                 value="Configurable",
                 description="Layer renderer",
                 disabled=False,
@@ -648,6 +655,7 @@ def report_card_draw_plan_graph_extended(plan, engine_name):
                     expand_layer_details_widget,
                     display_layer_names_widget,
                     stack_layer_names_widget,
+                    display_matadata_widget,
                 ]),
                 widgets.VBox([
                     display_regions_widget,
@@ -668,11 +676,11 @@ def report_card_draw_plan_graph_extended(plan, engine_name):
     @output.capture()
     def on_svg_button_clicked(b):
         IPython.display.clear_output(wait=True)
-        formatter = layer_type_formatter if color_choice_widget.value == "Layer Type" else precision_formatter
-        graph = to_dot(
+        formatter = graphing.layer_type_formatter if color_choice_widget.value == "Layer Type" else layer_precision_formatter
+        graph = graphing.to_dot(
             plan,
             formatter,
-            layer_node_renderer = layer_node_renderers[layer_node_renderers_widget.value],
+            layer_node_renderer = graphing.layer_node_renderers[layer_node_renderers_widget.value],
             display_regions=display_regions_widget.value,
             expand_layer_details=expand_layer_details_widget.value,
             display_layer_names=display_layer_names_widget.value,
@@ -685,13 +693,14 @@ def on_svg_button_clicked(b):
             display_forking_regions=display_forking_regions_widget.value,
             display_edge_name=display_edge_name_widget.value,
             display_edge_details=display_edge_details_widget.value,
+            display_matadata=display_matadata_widget.value,
         )
         print("DOT graph ready.\nGenerating SVG...  ", end="")
-        render_dot(graph, engine_name, 'svg')
+        graphing.render_dot(graph, engine_name, 'svg')
 
     def on_onnx_button_clicked(b):
         fname = 'trex_model.onnx'
-        graph = OnnxGraph(plan, display_forking_regions=False )
+        graph = graphing.OnnxGraph(plan, display_forking_regions=False )
         onnx.save(graph.onnx_model, fname)
         import netron
         netron.start(fname, 8082)
diff --git a/tools/experimental/trt-engine-explorer/utils/config_gpu.py b/tools/experimental/trt-engine-explorer/utils/config_gpu.py
index bf416e37..a5da7ff4 100644
--- a/tools/experimental/trt-engine-explorer/utils/config_gpu.py
+++ b/tools/experimental/trt-engine-explorer/utils/config_gpu.py
@@ -1,5 +1,6 @@
+#!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,70 +24,65 @@
 from typing import Tuple
 from threading import Thread, Condition
 from enum import Enum, auto
-import time
 import logging
+from contextlib import contextmanager
+import pynvml
 
 
-def install_pynvml():
-    logging.info("pynvml is not installed - attempting installation")
-    import subprocess
+@contextmanager
+def nvmlContext(*args, **kwds):
+    gpu_id = args[0]
+    pynvml.nvmlInit()
+    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id)
     try:
-        subprocess.run(f"python3 -m pip install pynvml".split(), check=True)
-    except subprocess.CalledProcessError:
-        logging.warning("Failed to install pynvml.")
-        logging.warning(f"Exiting: GPU clocks were not configured!")
-        exit(1)
-
-try:
-    import pynvml
-except ImportError:
-    install_pynvml()
-    import pynvml
-
-
-# Initialize NVML at the module level.
-pynvml.nvmlInit()
+        yield handle
+    finally:
+        pynvml.nvmlShutdown()
 
 
 class GPUMonitor():
     """Monitor GPU activity"""
+
+    def _sample_gpu_state(handle):
+        assert handle
+        try:
+            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
+            util_rate = pynvml.nvmlDeviceGetUtilizationRates(handle)
+            pwr = pynvml.nvmlDeviceGetPowerUsage(handle) // 1000
+            cps = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
+            # Clock list: https://github.com/nicolargo/nvidia-ml-py3/blob/master/pynvml.py#L95
+            sm_clock_mhz = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_SM)
+            mem_clock_mhz = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_MEM)
+            graphics_clock_mhz = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_GRAPHICS)
+            print(f"pwr: {pwr} temp: {temp} util_rate.gpu={util_rate.gpu} "
+                f"util_rate.memory={util_rate.memory} cnt processes={len(cps)} "
+                f"Clocks: sm={sm_clock_mhz} mem={mem_clock_mhz} graphics={graphics_clock_mhz}")
+
+            # Throttle reasons
+            # https://github.com/nicolargo/nvidia-ml-py3/blob/master/pynvml.py#L570
+            # https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksThrottleReasons.html
+            tr = (pynvml.nvmlDeviceGetCurrentClocksThrottleReasons(handle))
+            # GPU is idle
+            tr_idle = tr & pynvml.nvmlClocksThrottleReasonGpuIdle
+            # GPU clocks are limited by current setting of applications clocks
+            tr_appsettings = tr & pynvml.nvmlClocksThrottleReasonApplicationsClocksSetting
+            # SW Power Scaling algorithm is reducing the clocks below requested clocks
+            tr_sw_power = tr & pynvml.nvmlClocksThrottleReasonSwPowerCap
+            # HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
+            # This is an indicator of: External Power Brake Assertion being triggered (e.g. by the system power supply)
+            tr_hw_slowdown = tr & pynvml.nvmlClocksThrottleReasonHwSlowdown
+            if tr:
+                print(f"Throttling = idle={tr_idle} app={tr_appsettings} "
+                    f"power={tr_sw_power} hardware={tr_hw_slowdown}")
+        except pynvml.nvml.NVMLError as e:
+            logging.warning(f"Could not read GPU state")
+
     def gpu_monitor(self):
         self.done_cond.acquire()
-        handle = self.handle
-        while not self.is_done:
-            try:
-                temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
-                util_rate = pynvml.nvmlDeviceGetUtilizationRates(handle)
-                pwr = pynvml.nvmlDeviceGetPowerUsage(handle) // 1000
-                cps = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
-                # Clock list: https://github.com/nicolargo/nvidia-ml-py3/blob/master/pynvml.py#L95
-                sm_clock_mhz = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_SM)
-                mem_clock_mhz = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_MEM)
-                graphics_clock_mhz = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_GRAPHICS)
-                print(f"pwr: {pwr} temp: {temp} util_rate.gpu={util_rate.gpu} "
-                    f"util_rate.memory={util_rate.memory} cnt processes={len(cps)} "
-                    f"Clocks: sm={sm_clock_mhz} mem={mem_clock_mhz} graphics={graphics_clock_mhz}")
-
-                # Throttle reasons
-                # https://github.com/nicolargo/nvidia-ml-py3/blob/master/pynvml.py#L570
-                # https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksThrottleReasons.html
-                tr = (pynvml.nvmlDeviceGetCurrentClocksThrottleReasons(handle))
-                # GPU is idle
-                tr_idle = tr & pynvml.nvmlClocksThrottleReasonGpuIdle
-                # GPU clocks are limited by current setting of applications clocks
-                tr_appsettings = tr & pynvml.nvmlClocksThrottleReasonApplicationsClocksSetting
-                # SW Power Scaling algorithm is reducing the clocks below requested clocks
-                tr_sw_power = tr & pynvml.nvmlClocksThrottleReasonSwPowerCap
-                # HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged
-                # This is an indicator of: External Power Brake Assertion being triggered (e.g. by the system power supply)
-                tr_hw_slowdown = tr & pynvml.nvmlClocksThrottleReasonHwSlowdown
-                if tr:
-                    print(f"Throttling = idle={tr_idle} app={tr_appsettings} "
-                          f"power={tr_sw_power} hardware={tr_hw_slowdown}")
-            except pynvml.nvml.NVMLError as e:
-                logging.warning(f"Could not read GPU state")
-
-            self.done_cond.wait(self.sampling_interval)
+        with nvmlContext(self.gpu_id) as handle:
+            while not self.is_done:
+                self._sample_gpu_state(handle)
+                self.done_cond.wait(self.sampling_interval)
         self.done_cond.release()
 
     def __init__(self, enabled: bool, gpu_id: int=0, sampling_interval: float=1.):
@@ -98,7 +94,6 @@ def __init__(self, enabled: bool, gpu_id: int=0, sampling_interval: float=1.):
         self.is_done = False
         self.monitor = Thread(target=self.gpu_monitor)
         self.done_cond = Condition()
-        self.handle = pynvml.nvmlDeviceGetHandleByIndex(self.gpu_id)
 
     def __enter__(self):
         if not self.enabled:
@@ -139,34 +134,36 @@ def __init__(
         self.dont_lock_clocks = dont_lock_clocks
         self.set_power_limit = power_limit is not None
         self.set_lock = (compute_clk is not None and memory_clk is not None) and (not dont_lock_clocks)
-        self.handle = pynvml.nvmlDeviceGetHandleByIndex(self.gpu_id)
-
         if self.set_power_limit:
-            self.power_readings_stats = self._extract_power_limits()
+            with nvmlContext(self.gpu_id) as handle:
+                self.power_readings_stats = self._extract_power_limits(handle)
 
     def __enter__(self):
-        if self.set_power_limit:
-            self._set_power_limit(self.power_limit)
-        if self.set_lock:
-            self._lock_clocks()
+        with nvmlContext(self.gpu_id) as handle:
+            if self.set_power_limit:
+                self._set_power_limit(handle, self.power_limit)
+            if self.set_lock:
+                self._lock_clocks(handle)
 
     def __exit__(self, exc_type, exc_value, exc_traceback):
-        if self.set_power_limit:
-            init_power_limit = self.power_readings_stats[self.Key.INIT_POWER_LIMIT]
-            self._set_power_limit(init_power_limit)
-        if self.set_lock:
-            self._unlock_clocks()
+        with nvmlContext(self.gpu_id) as handle:
+            if self.set_power_limit:
+                init_power_limit = self.power_readings_stats[self.Key.INIT_POWER_LIMIT]
+                self._set_power_limit(handle, init_power_limit)
+            if self.set_lock:
+                self._unlock_clocks(handle)
 
     # Helper functions
-    def _extract_power_limits(self):
+    def _extract_power_limits(self, handle):
         def to_watt(power_milliwatt: int):
             return power_milliwatt // 1000
-        min_power_limit, max_power_limit = pynvml.nvmlDeviceGetPowerManagementLimitConstraints(self.handle)
+
+        min_power_limit, max_power_limit = pynvml.nvmlDeviceGetPowerManagementLimitConstraints(handle)
         try:
-            cur_power_limit = pynvml.nvmlDeviceGetPowerManagementLimit(self.handle)
+            cur_power_limit = pynvml.nvmlDeviceGetPowerManagementLimit(handle)
         except pynvml.nvml.NVMLError as e:
             logging.warning(f"Could read power limit constraints ({e}).")
-            cur_power_limit = pynvml.nvmlDeviceGetPowerManagementDefaultLimit(self.handle)
+            cur_power_limit = pynvml.nvmlDeviceGetPowerManagementDefaultLimit(handle)
 
         # Power limit stats in Watts.
         power_readings_stats = {
@@ -176,7 +173,7 @@ def to_watt(power_milliwatt: int):
         }
         return power_readings_stats
 
-    def _set_power_limit(self, power_limit: float):
+    def _set_power_limit(self, handle, power_limit: float):
         def to_milliwatt(power_watt: int):
             return power_watt * 1000
 
@@ -186,7 +183,7 @@ def to_milliwatt(power_watt: int):
         logging.warning(f"Setting power limit to {power_limit} Watts")
         try:
             power_limit = to_milliwatt(power_limit)
-            pynvml.nvmlDeviceSetPowerManagementLimit(self.handle, power_limit)
+            pynvml.nvmlDeviceSetPowerManagementLimit(handle, power_limit)
         except pynvml.nvml.NVMLError_InvalidArgument as e:
             self.set_power_limit = False
             logging.warning(f"Could not set power limits ({e})\n"
@@ -197,10 +194,10 @@ def to_milliwatt(power_watt: int):
             logging.warning(f"Could not set power limits ({e}).")
 
 
-    def _lock_clocks(self):
+    def _lock_clocks(self, handle):
         try:
-            pynvml.nvmlDeviceSetApplicationsClocks(self.handle, self.memory_clk, self.compute_clk)
-            pynvml.nvmlDeviceSetGpuLockedClocks(self.handle,
+            pynvml.nvmlDeviceSetApplicationsClocks(handle, self.memory_clk, self.compute_clk)
+            pynvml.nvmlDeviceSetGpuLockedClocks(handle,
                 minGpuClockMHz=self.compute_clk,
                 maxGpuClockMHz=self.compute_clk)
             logging.warning(f"Set max memory clock = {self.memory_clk} MHz")
@@ -219,22 +216,22 @@ def _lock_clocks(self):
         except pynvml.nvml.NVMLError as e:
             logging.warning(f"Could not lock clocks ({e}).")
 
-    def _unlock_clocks(self):
+    def _unlock_clocks(self, handle):
         try:
-            pynvml.nvmlDeviceResetGpuLockedClocks(self.handle)
-            pynvml.nvmlDeviceResetApplicationsClocks(self.handle)
+            pynvml.nvmlDeviceResetGpuLockedClocks(handle)
+            pynvml.nvmlDeviceResetApplicationsClocks(handle)
             logging.warning(f"Unlocked device clocks.")
         except pynvml.nvml.NVMLError as e:
             logging.warning(f"Could not unlock clocks ({e}).\n"
             "\tTry running as root or unlocking the clocks from the commandline:\n"
             "\t\tsudo nvidia-smi --reset-gpu-clocks\n"
-		    "\t\tsudo nvidia-smi --reset-applications-clocks")
+            "\t\tsudo nvidia-smi --reset-applications-clocks")
 
 
 def get_max_clocks(dev: int) -> Tuple[int, int]:
-    handle = pynvml.nvmlDeviceGetHandleByIndex(dev)
-    mem_clks = pynvml.nvmlDeviceGetSupportedMemoryClocks(handle)
-    max_mem_clk = mem_clks[0]
-    gr_clocks = pynvml.nvmlDeviceGetSupportedGraphicsClocks(handle, max_mem_clk)
-    max_gr_clk = gr_clocks[0]
-    return max_mem_clk, max_gr_clk
+    with nvmlContext(dev) as handle:
+        mem_clks = pynvml.nvmlDeviceGetSupportedMemoryClocks(handle)
+        max_mem_clk = mem_clks[0]
+        gr_clocks = pynvml.nvmlDeviceGetSupportedGraphicsClocks(handle, max_mem_clk)
+        max_gr_clk = gr_clocks[0]
+        return max_mem_clk, max_gr_clk
diff --git a/tools/experimental/trt-engine-explorer/utils/device_info.py b/tools/experimental/trt-engine-explorer/utils/device_info.py
index 13efd5f6..2939b008 100644
--- a/tools/experimental/trt-engine-explorer/utils/device_info.py
+++ b/tools/experimental/trt-engine-explorer/utils/device_info.py
@@ -1,5 +1,6 @@
+#!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/experimental/trt-engine-explorer/utils/draw_engine.py b/tools/experimental/trt-engine-explorer/utils/draw_engine.py
index fe7e3eea..59fb311b 100644
--- a/tools/experimental/trt-engine-explorer/utils/draw_engine.py
+++ b/tools/experimental/trt-engine-explorer/utils/draw_engine.py
@@ -1,5 +1,6 @@
+#!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -21,17 +22,18 @@
 
 Note: this script requires graphviz which can be installed manually:
     $ sudo apt-get --yes install graphviz
-    $ python3 -m pip install graphviz networkx
+    $ python3 -m pip install graphviz
 """
 
 
-import graphviz
-from trex import *
+import warnings
 import argparse
 import shutil
+import trex.graphing
+import trex.engine_plan
 
 
-def draw_engine(engine_json_fname: str):
+def draw_engine(engine_json_fname: str, profiling_json_fname: str=None, **kwargs):
     graphviz_is_installed = shutil.which("dot") is not None
     if not graphviz_is_installed:
         print("graphviz is required but it is not installed.\n")
@@ -39,19 +41,51 @@ def draw_engine(engine_json_fname: str):
         print("sudo apt --yes install graphviz")
         exit()
 
-    plan = EnginePlan(engine_json_fname)
-    formatter = layer_type_formatter
-    display_regions = True
-    expand_layer_details = False
+    try:
+        if kwargs["display_constants"] and not kwargs["display_regions"]:
+            warnings.warn("Ignoring argument --display_constants because it requires --display_regions.")
+    except KeyError:
+        pass
 
-    graph = to_dot(plan, formatter,
-                display_regions=display_regions,
-                expand_layer_details=expand_layer_details)
-    render_dot(graph, engine_json_fname, 'svg')
+    plan = trex.engine_plan.EnginePlan(engine_json_fname, profiling_file=profiling_json_fname)
+    layer_node_formatter = trex.graphing.layer_type_formatter
+    graph = trex.graphing.to_dot(plan, layer_node_formatter, **kwargs)
+    trex.graphing.render_dot(graph, engine_json_fname, "svg")
+
+
+def make_subcmd_parser(subparsers):
+    draw = lambda args: draw_engine(
+        engine_json_fname=args.input,
+        profiling_json_fname=args.profiling_json,
+        display_regions=args.display_regions,
+        display_layer_names=not args.no_layer_names,
+        display_constants=args.display_constant,
+    )
+    draw_parser = subparsers.add_parser("draw", help="Draw a TensorRT engine.")
+    draw_parser.set_defaults(func=draw)
+    _make_parser(draw_parser)
+
+
+def _make_parser(parser):
+    parser.add_argument("input", help="name of engine JSON file to draw.")
+    parser.add_argument("--profiling_json", "-pj",
+        default=None, help="name of engine JSON file to draw")
+    parser.add_argument("--display_regions", "-dr",
+        action='store_true', help="render memory regions as graph nodes.")
+    parser.add_argument("--display_constant", "-dc",
+        action='store_true', help="render constant input tensors.")
+    parser.add_argument("--no_layer_names", "-no_ln",
+        action='store_true', help="render constants.")
 
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('input', help="name of engine JSON file to draw")
-    args = parser.parse_args()
-    draw_engine(engine_json_fname=args.input)
+    args = parser.parse_args(_make_parser(parser))
+
+    draw_engine(
+        engine_json_fname=args.input,
+        profiling_json_fname=args.profiling_json,
+        display_regions=True,
+        expand_layer_details=False,
+        display_latency=True,
+    )
diff --git a/tools/experimental/trt-engine-explorer/utils/parse_trtexec_log.py b/tools/experimental/trt-engine-explorer/utils/parse_trtexec_log.py
index f885d575..f01d97e3 100644
--- a/tools/experimental/trt-engine-explorer/utils/parse_trtexec_log.py
+++ b/tools/experimental/trt-engine-explorer/utils/parse_trtexec_log.py
@@ -1,5 +1,6 @@
+#!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -21,11 +22,10 @@
 """
 
 
-import json
-from pickle import BUILD
 import re
 from typing import Tuple, List, Dict, Any
-from enum import Enum
+import argparse
+import trex.archiving as archiving
 
 
 def __to_float(line: str) -> float:
@@ -58,7 +58,7 @@ def entered_section(self, line: str):
         s = re.search(self.section_header, line)
         return s is not None
 
-    def parse_line(self, line: str):
+    def parse_line(self, line: str) -> bool:
         def parse_kv_line(line: str) -> Tuple[Any, Any]:
             """Parse a log line that reports a key-value pair.
 
@@ -68,12 +68,16 @@ def parse_kv_line(line: str) -> Tuple[Any, Any]:
             if match is not None:
                 match_end = match.span()[1]
                 kv_line = line[match_end:].strip()
-                kv = kv_line.split(": ")
+                if not kv_line.count(":"):
+                    return None, None
+                kv = kv_line.split(":")
                 if len(kv) > 1:
-                    return kv[0], kv[1]
+                    return kv[0], kv[1][1:]
+                if len(kv) == 1:
+                    return kv[0], None
             return None, None
 
-        k,v = parse_kv_line(line)
+        k, v = parse_kv_line(line)
         if k is not None and v is not None:
             self.dict[k] = v
             return True
@@ -82,38 +86,45 @@ def parse_kv_line(line: str) -> Tuple[Any, Any]:
         return False
 
 
-def __parse_log_file(file_name: str, sections: List) -> List[Dict]:
+def __parse_log_file(file_name: str, sections: List, tea: archiving.EngineArchive) -> List[Dict]:
+    def entered_section(sections, line) -> bool:
+        for section in sections:
+            if section.entered_section(line):
+                return section
+        return None
+
     current_section = None
-    with open(file_name, "r") as file:
-        for line in file.readlines():
+    with archiving.get_reader(tea, file_name) as reader:
+        for line in reader.readlines():
             if current_section is None:
-                for section in sections:
-                    if section.entered_section(line):
-                        current_section = section
-                        break
+                current_section = entered_section(sections, line)
             else:
                 if not current_section.parse_line(line):
-                    current_section = None
+                    sections.remove(current_section)
+                    current_section = entered_section(sections, line)
     dicts = [section.dict for section in sections]
     return dicts
 
 
-def parse_build_log(file_name: str) -> List[Dict]:
+def parse_build_log(file_name: str, tea: archiving.EngineArchive) -> List[Dict]:
     """Parse the TensorRT engine build log and extract the builder configuration.
 
     Returns the model and engine build configurations as dictionaries.
     """
     model_options = FileSection("=== Model Options ===")
     build_options = FileSection("=== Build Options ===")
-    sections = [model_options, build_options]
-    __parse_log_file(file_name, sections)
+    device_information = FileSection("=== Device Information ===")
+
+    sections = [model_options, build_options, device_information]
+    __parse_log_file(file_name, sections, tea)
     return {
         "model_options": model_options.dict,
         "build_options": build_options.dict,
+        "device_information": device_information.dict
     }
 
 
-def parse_profiling_log(file_name: str):
+def parse_profiling_log(file_name: str, tea: archiving.EngineArchive):
     performance_summary = FileSection("=== Performance summary ===")
     inference_options = FileSection("=== Inference Options ===")
     device_information = FileSection("=== Device Information ===")
@@ -121,7 +132,7 @@ def parse_profiling_log(file_name: str):
         performance_summary,
         inference_options,
         device_information]
-    __parse_log_file(file_name, sections)
+    __parse_log_file(file_name, sections, tea)
 
     def post_process_perf(perf_summary: dict):
         """Normalize the log results to a standard format"""
@@ -145,3 +156,10 @@ def post_process_device_info(device_info: dict):
         "inference_options": inference_options.dict,
         "device_information": post_process_device_info(device_information.dict)
     }
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument('input', help="name of engine build log file to parse.")
+    args = parser.parse_args()
+    parse_build_log(args.input)
diff --git a/tools/experimental/trt-engine-explorer/utils/process_engine.py b/tools/experimental/trt-engine-explorer/utils/process_engine.py
index a2b324ed..817efb5b 100644
--- a/tools/experimental/trt-engine-explorer/utils/process_engine.py
+++ b/tools/experimental/trt-engine-explorer/utils/process_engine.py
@@ -1,5 +1,6 @@
+#!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -32,15 +33,17 @@
 import json
 import argparse
 import subprocess
-from typing import List, Dict, Tuple
+from typing import List, Dict, Tuple, Optional
+import tensorrt as trt
 from parse_trtexec_log import parse_build_log, parse_profiling_log
 from config_gpu import GPUMonitor, GPUConfigurator, get_max_clocks
+import trex.archiving as archiving
 
 
-def run_trtexec(trt_cmdline: List[str], build_log_file: str):
+def run_trtexec(trt_cmdline: List[str], writer):
     '''Execute trtexec'''
     success = False
-    with open(build_log_file, 'w') as logf:
+    with writer:
         log_str = None
         try:
             log = subprocess.run(
@@ -57,7 +60,7 @@ def run_trtexec(trt_cmdline: List[str], build_log_file: str):
         except FileNotFoundError as err:
             log_str = f"\nError: {err.strerror}: {err.filename}"
             print(log_str)
-        logf.write(log_str)
+        writer.write(log_str)
     return success
 
 
@@ -67,16 +70,22 @@ def build_engine_cmd(
     engine_path: str,
     timing_cache_path: str
 ) -> Tuple[List[str], str]:
+    graph_json_fname = f"{engine_path}.graph.json"
     cmd_line = ["trtexec",
         "--verbose",
-        # nvtxMode=verbose is the same as profilingVerbosity=detailed, but backward-compatible
-        "--nvtxMode=verbose",
-        "--buildOnly",
-        "--workspace=8192",
         f"--onnx={onnx_path}",
         f"--saveEngine={engine_path}",
+        f"--exportLayerInfo={graph_json_fname}",
         f"--timingCacheFile={timing_cache_path}",
     ]
+    if trt.__version__ < "10.0":
+        # nvtxMode=verbose is the same as profilingVerbosity=detailed, but backward-compatible
+        cmd_line.append("--nvtxMode=verbose")
+        cmd_line.append("--buildOnly")
+        cmd_line.append("--workspace=8192")
+    else:
+        cmd_line.append("--profilingVerbosity=detailed")
+
     append_trtexec_args(args.trtexec, cmd_line)
 
     build_log_fname = f"{engine_path}.build.log"
@@ -85,39 +94,44 @@ def build_engine_cmd(
 
 def build_engine(
     args: Dict,
-    timing_cache_path: str
+    timing_cache_path: str,
+    tea: Optional[archiving.EngineArchive]
 ) -> bool:
-    def generate_build_metadata(log_file: str, output_json: str):
+    def generate_build_metadata(log_file: str, metadata_json_fname: str, tea: archiving.EngineArchive):
         """Parse trtexec engine build log file and write to a JSON file"""
-        build_metadata = parse_build_log(log_file)
-        with open(output_json, 'w') as fout:
-            json.dump(build_metadata , fout)
-            print(f"Engine building metadata: generated output file {output_json}")
+        build_metadata = parse_build_log(log_file, tea)
+        with archiving.get_writer(tea, metadata_json_fname) as writer:
+            json_str = json.dumps(build_metadata, ensure_ascii=False, indent=4)
+            writer.write(json_str)
+            print(f"Engine building metadata: generated output file {metadata_json_fname}")
+
+    def print_error(build_log_file: str):
+        print("\nFailed to build the engine.")
+        print(f"See logfile in: {build_log_file}\n")
+        print("Troubleshooting:")
+        print("1. Make sure that you are running this script in an environment "
+              "which has trtexec built and accessible from $PATH.")
+        print("2. If this is a Jupyter notebook, make sure the "
+              " trtexec is in the $PATH of the Jupyter server.")
 
     onnx_path = args.input
-    onnx_fname = os.path.basename(onnx_path)
-    outdir = args.outdir
-    engine_path = os.path.join(outdir, onnx_fname) + ".engine"
+    engine_path = get_engine_path(args, add_suffix=True)
 
-    print("Building the engine:")
+    print(f"Building the engine: {engine_path}")
     cmd_line, build_log_file = build_engine_cmd(
         args, onnx_path, engine_path, timing_cache_path)
     print(" ".join(cmd_line))
     if args.print_only:
         return True
-    success = run_trtexec(cmd_line, build_log_file)
+
+    writer = archiving.get_writer(tea, build_log_file)
+    success = run_trtexec(cmd_line, writer)
     if success:
         print("\nSuccessfully built the engine.\n")
         build_md_json_fname = f"{engine_path}.build.metadata.json"
-        generate_build_metadata(build_log_file, build_md_json_fname)
+        generate_build_metadata(build_log_file, build_md_json_fname, tea)
     else:
-        print("\nFailed to build the engine.")
-        print(f"See logfile in: {build_log_file}\n")
-        print("Troubleshooting:")
-        print("1. Make sure that you are running this script in an environment "
-              "which has trtexec built and accessible from $PATH.")
-        print("2. If this is a Jupyter notebook, make sure the "
-              " trtexec is in the $PATH of the Jupyter server.")
+        print_error(build_log_file)
     return success
 
 
@@ -137,14 +151,16 @@ def profile_engine_cmd(
         # Always run and time without profiling.
         "--separateProfileRun",
         "--useSpinWait",
-        # nvtxMode=verbose is the same as profilingVerbosity=detailed, but backward-compatible
-        "--nvtxMode=verbose",
         f"--loadEngine={engine_path}",
         f"--exportTimes={timing_json_fname}",
         f"--exportProfile={profiling_json_fname}",
         f"--exportLayerInfo={graph_json_fname}",
         f"--timingCacheFile={timing_cache_path}",
     ]
+    if trt.__version__ < "10.0":
+        cmd_line.append("--nvtxMode=verbose")
+    else:
+        cmd_line.append("--profilingVerbosity=detailed")
 
     append_trtexec_args(args.trtexec, cmd_line)
 
@@ -183,31 +199,34 @@ def freq_to_int(clk_freq: str, max_clk_freq: int):
 def profile_engine(
     args: Dict,
     timing_cache_path:str,
+    tea: archiving.EngineArchive,
     add_suffix: bool
 ) -> bool:
-    def generate_profiling_metadata(log_file: str, output_json: str):
+    def generate_profiling_metadata(log_file: str, metadata_json_fname: str, tea: archiving.EngineArchive):
         """Parse trtexec profiling session log file and write to a JSON file"""
-        profiling_metadata = parse_profiling_log(log_file)
-        with open(output_json, 'w') as fout:
-            json.dump(profiling_metadata , fout)
-            print(f"Profiling metadata: generated output file {output_json}")
+        profiling_metadata = parse_profiling_log(log_file, tea)
+        with archiving.get_writer(tea, metadata_json_fname) as writer:
+            json_str = json.dumps(profiling_metadata, ensure_ascii=False, indent=4)
+            writer.write(json_str)
+            print(f"Profiling metadata: generated output file {metadata_json_fname}")
 
     engine_path = get_engine_path(args, add_suffix)
-
-    print("Profiling the engine:")
+    print(f"Profiling the engine: {engine_path}")
     cmd_line, profile_log_file = profile_engine_cmd(
         args, engine_path, timing_cache_path)
     print(" ".join(cmd_line))
     if args.print_only:
         return True
 
-    with GPUMonitor(args.monitor), GPUConfigurator(*get_gpu_config_args(args)):
-        success = run_trtexec(cmd_line, profile_log_file)
+    writer = archiving.get_writer(tea, profile_log_file)
+
+    #with GPUMonitor(args.monitor), GPUConfigurator(*get_gpu_config_args(args)):
+    success = run_trtexec(cmd_line, writer)
 
     if success:
         print("\nSuccessfully profiled the engine.\n")
         profiling_md_json_fname = f"{engine_path}.profile.metadata.json"
-        generate_profiling_metadata(profile_log_file, profiling_md_json_fname)
+        generate_profiling_metadata(profile_log_file, profiling_md_json_fname, tea)
     else:
         print("\nFailed to profile the engine.")
         print(f"See logfile in: {profile_log_file}\n")
@@ -221,13 +240,14 @@ def generate_engine_svg(args: Dict, add_suffix: bool) -> bool:
 
     if add_suffix:
         graph_json_fname = f"{engine_path}.graph.json"
+        profiling_json_fname = f"{engine_path}.profile.json"
     else:
         graph_json_fname = engine_path
 
     try:
         from draw_engine import draw_engine
         print(f"Generating graph diagram: {graph_json_fname}")
-        draw_engine(graph_json_fname)
+        draw_engine(graph_json_fname, profiling_json_fname)
     except ModuleNotFoundError:
         print("Can't generate plan SVG graph because some package is not installed")
 
@@ -247,20 +267,28 @@ def process_engine(
 ) -> bool:
     timing_cache_path = "./timing.cache"
     success = True
+    engine_path = get_engine_path(args, add_suffix=True)
+    tea_name = f"{engine_path}.tea"
+    tea = archiving.EngineArchive(tea_name) if args.archive else None
+    if tea: tea.open()
     if build:
-        success = build_engine(args, timing_cache_path)
+        success = build_engine(args, timing_cache_path, tea)
     if profile and success:
-        success = profile_engine(args, timing_cache_path, add_suffix=build)
+        success = profile_engine(args, timing_cache_path, tea, add_suffix=build)
     if draw and success:
         success = generate_engine_svg(args, add_suffix=build)
+    if tea: tea.close()
     print(f"Artifcats directory: {args.outdir}")
     return success
 
 
-def parse_args():
-    parser = argparse.ArgumentParser(
-        description='Utility to build and profile TensorRT engines.')
+def make_subcmd_parser(subparsers):
+    parser = subparsers.add_parser('process', help='Utility to build and profile TensorRT engines.')
+    parser.set_defaults(func=do_work)
+    _make_parser(parser)
 
+
+def _make_parser(parser):
     # Positional arguments.
     parser.add_argument('input', help="input file (ONNX model or TensorRT engine file)")
     parser.add_argument('outdir', help="directory to store output artifacts")
@@ -285,7 +313,6 @@ def parse_args():
     parser.add_argument('--monitor',
         action='store_true',
         help="Monitor GPU temperature, power, clocks and utilization while profiling.")
-
     parser.add_argument('--print-only', action='store_true',
         help='print the command-line and exit')
     parser.add_argument('--build-engine', '-b', action='store_true', default=None,
@@ -294,8 +321,8 @@ def parse_args():
         help='profile the engine')
     parser.add_argument('--draw-engine', '-d', action='store_true', default=None,
         help='draw the engine')
-    args = parser.parse_args()
-    return args
+    parser.add_argument('--archive', action='store_true',
+        help="create a TensorRT engine archive file (.tea)")
 
 
 def append_trtexec_args(trt_args: Dict, cmd_line: List[str]):
@@ -312,8 +339,13 @@ def get_subcmds(args: Dict):
     return build, profile, draw
 
 
-if __name__ == "__main__":
-    args = parse_args()
+def do_work(args):
     create_artifacts_directory(args.outdir)
     build, profile, draw = get_subcmds(args)
     process_engine(args, build, profile, draw)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    args = parser.parse_args(_make_parser(parser))
+    do_work(args)
diff --git a/tools/onnx-graphsurgeon/CHANGELOG.md b/tools/onnx-graphsurgeon/CHANGELOG.md
index 425dc366..3d0be18f 100644
--- a/tools/onnx-graphsurgeon/CHANGELOG.md
+++ b/tools/onnx-graphsurgeon/CHANGELOG.md
@@ -2,6 +2,79 @@
 
 Dates are in YYYY-MM-DD format.
 
+
+## v0.5.1 (2024-02-23)
+### Changed
+- Removed dependency on `typing_extensions` package.
+- Improved error messages when a function registered with a graph is not registered for the current opset.
+
+
+## v0.5.0 (2024-01-12)
+### Added
+- Added a `GraphPattern` API which can be used to find matching subgraphs in a graph.
+
+
+## v0.4.1 (2023-11-30)
+### Fixed
+- Fixed a bug where toposort would not correctly memoize intermediate values, leading to long runtimes.
+- Fixed a bug where `export_value_info_proto` would not handle constant tensors correctly.
+
+
+## v0.4.0 (2023-08-16)
+### Added
+- Added `Function` class representing a `Graph` implementing a Custom Op.
+- Added `functions` field to `Graph`
+- Added `Node.AttributeRef` dataclass representing an attribute value in a parent Function.
+- Added `subgraph()` methods to `Node` and `Graph` to iterate over the node's/graph's subgraphs.
+- Added new kwargs to `Graph.cleanup()`, `Graph.fold_constants()`, and `Graph.toposort()` to optionally recurse into the Graph's Functions.
+- Added 'mode' kwarg to `Graph.toposort()` to control whether nodes, functions, or both get sorted.
+- Added example 11 which demonstrates how to use `Function`s
+
+### Removed
+- Removed `do_type_check` kwarg from `OnnxExporter.export_node()`
+
+### Fixed
+- Fixed some warnings caused by using deprecated APIs in `onnx.mapping`.
+
+
+## v0.3.29 (2023-08-11)
+### Fixed
+- Fixed a bug where doing a copy (e.g. `copy.copy`) of node/tensor inputs/outputs would retain
+    their synchronization behavior. For example, for a graph like:
+    ```
+    inp -> node -> out
+    ```
+    Doing:
+    ```py
+    node_outputs = copy.copy(node.outputs)
+    del node_outputs[0]
+    ```
+    would have previously resulted in `out.inputs` being modified also.
+
+
+## v0.3.28 (2023-07-11)
+### Added
+- Added support for various 8-bit floating point types. Like `BFLOAT16`, these will not be converted to NumPy
+    data types.
+
+### Fixed
+- Fixed a bug in `fold_constants` where nodes with omitted optional inputs would not be folded even if
+    all their other inputs were constant.
+
+
+## v0.3.27 (2023-05-24)
+### Added
+- Added support for `BFLOAT16`. Tensors of `BFLOAT16` type will not have their data types converted to NumPy.
+    Additionally, attempting to access the values of a `BFLOAT16` constant tensor will cause them to be casted
+    to `float32`.
+
+### Changed
+- Updated the `Graph.layer` API to generate unique names for Tensors and Nodes.
+- Updated the exporter to provide a warning before exporting to ONNX if nodes within a graph have duplicate names.
+- Updated all `dtype` attributes to accept `onnx.TensorProto.DataType` types in addition to NumPy types.
+    This is required since some types, like `BFLOAT16` are not representable in NumPy.
+
+
 ## v0.3.26 (2022-12-09)
 ### Fixed
 - Fixed a bug where node domain was not preserved.
diff --git a/tools/onnx-graphsurgeon/CONTRIBUTING.md b/tools/onnx-graphsurgeon/CONTRIBUTING.md
index e06697df..951a1785 100644
--- a/tools/onnx-graphsurgeon/CONTRIBUTING.md
+++ b/tools/onnx-graphsurgeon/CONTRIBUTING.md
@@ -10,7 +10,7 @@
     The structure of the `tests` directory closely mirrors that of the main source directory (`onnx_graphsurgeon`),
     so in general, for every source file you change, you'll need to modify the corresponding test file.
 
-    When addiing new examples, be sure to add an entry in `test_examples.py`. The test will parse the README
+    When adding new examples, be sure to add an entry in `test_examples.py`. The test will parse the README
     to execute any commands specified. If your example creates any additional files, specify each of them in
     the test case as an `Artifact`.
 
diff --git a/tools/onnx-graphsurgeon/Makefile b/tools/onnx-graphsurgeon/Makefile
index 5b0cf62d..eeb1662a 100644
--- a/tools/onnx-graphsurgeon/Makefile
+++ b/tools/onnx-graphsurgeon/Makefile
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/README.md b/tools/onnx-graphsurgeon/README.md
index fcca7cec..e8d08b12 100644
--- a/tools/onnx-graphsurgeon/README.md
+++ b/tools/onnx-graphsurgeon/README.md
@@ -26,7 +26,7 @@ ONNX GraphSurgeon is a tool that allows you to easily generate new ONNX graphs,
 
 ### Using Prebuilt Wheels
 ```bash
-python3 -m pip install onnx_graphsurgeon --index-url https://pypi.ngc.nvidia.com
+python3 -m pip install onnx_graphsurgeon --extra-index-url https://pypi.ngc.nvidia.com
 ```
 
 ### Building From Source
diff --git a/tools/onnx-graphsurgeon/docs/_static/style.css b/tools/onnx-graphsurgeon/docs/_static/style.css
index 176de2f5..1329f433 100644
--- a/tools/onnx-graphsurgeon/docs/_static/style.css
+++ b/tools/onnx-graphsurgeon/docs/_static/style.css
@@ -1,3 +1,3 @@
 .wy-nav-content {
-    max-width: 1100px !important;
+    max-width: 1400px !important;
 }
diff --git a/tools/onnx-graphsurgeon/docs/_templates/footer.html b/tools/onnx-graphsurgeon/docs/_templates/footer.html
new file mode 100644
index 00000000..164c30ce
--- /dev/null
+++ b/tools/onnx-graphsurgeon/docs/_templates/footer.html
@@ -0,0 +1,29 @@
+{% extends "!footer.html" %}
+{%- block contentinfo %}
+{{ super }}
+
+<div class="footer">
+    <p>
+        Copyright © 2024 NVIDIA Corporation
+    </p>
+    <p>
+        <a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-policy/" target="_blank" rel="noopener"
+            data-cms-ai="0">Privacy Policy</a> |
+        <a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/privacy-center/" target="_blank" rel="noopener"
+            data-cms-ai="0">Manage My Privacy</a> |
+        <a class="Link" href="https://www.nvidia.com/en-us/preferences/start/" target="_blank" rel="noopener"
+            data-cms-ai="0">Do Not Sell or Share My Data</a> |
+        <a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/terms-of-service/" target="_blank"
+            rel="noopener" data-cms-ai="0">Terms of Service</a> |
+        <a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/accessibility/" target="_blank" rel="noopener"
+            data-cms-ai="0">Accessibility</a> |
+        <a class="Link" href="https://www.nvidia.com/en-us/about-nvidia/company-policies/" target="_blank"
+            rel="noopener" data-cms-ai="0">Corporate Policies</a> |
+        <a class="Link" href="https://www.nvidia.com/en-us/product-security/" target="_blank" rel="noopener"
+            data-cms-ai="0">Product Security</a> |
+        <a class="Link" href="https://www.nvidia.com/en-us/contact/" target="_blank" rel="noopener"
+            data-cms-ai="0">Contact</a>
+    </p>
+</div>
+
+{% endblock %}
diff --git a/tools/onnx-graphsurgeon/docs/_templates/layout.html b/tools/onnx-graphsurgeon/docs/_templates/layout.html
index 729c3290..c06aeb05 100644
--- a/tools/onnx-graphsurgeon/docs/_templates/layout.html
+++ b/tools/onnx-graphsurgeon/docs/_templates/layout.html
@@ -5,5 +5,7 @@
 {% endblock %}
 
 {% block footer %}
+
 <script type="text/javascript">_satellite.pageBottom();</script>
+
 {% endblock %}
diff --git a/tools/onnx-graphsurgeon/docs/conf.py b/tools/onnx-graphsurgeon/docs/conf.py
index 34433481..ea3669d0 100644
--- a/tools/onnx-graphsurgeon/docs/conf.py
+++ b/tools/onnx-graphsurgeon/docs/conf.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -51,7 +51,7 @@
 
 # General information about the project.
 project = "ONNX GraphSurgeon"
-copyright = "2020, NVIDIA"
+copyright = "2024, NVIDIA"
 author = "NVIDIA"
 
 version = gs.__version__
@@ -82,9 +82,14 @@
 autoclass_content = "both"
 
 # Unlimited depth sidebar.
-html_theme_options = {"navigation_depth": -1}
+html_theme_options = {
+    "navigation_depth": -1,
+}
+
+html_sidebars = {
+    "**": ["globaltoc.html", "relations.html", "sourcelink.html", "searchbox.html"]
+}
 
-html_sidebars = {"**": ["globaltoc.html", "relations.html", "sourcelink.html", "searchbox.html"]}
 
 # Allows us to override the default page width in the Sphinx theme.
 def setup(app):
diff --git a/tools/onnx-graphsurgeon/examples/01_creating_a_model/example.py b/tools/onnx-graphsurgeon/examples/01_creating_a_model/example.py
index f4e0d80e..ea664cdc 100644
--- a/tools/onnx-graphsurgeon/examples/01_creating_a_model/example.py
+++ b/tools/onnx-graphsurgeon/examples/01_creating_a_model/example.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/examples/02_creating_a_model_with_initializer/example.py b/tools/onnx-graphsurgeon/examples/02_creating_a_model_with_initializer/example.py
index c0268236..e8e1e68f 100644
--- a/tools/onnx-graphsurgeon/examples/02_creating_a_model_with_initializer/example.py
+++ b/tools/onnx-graphsurgeon/examples/02_creating_a_model_with_initializer/example.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/generate.py b/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/generate.py
index e71a783b..75eb288b 100644
--- a/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/generate.py
+++ b/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/generate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/isolate.py b/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/isolate.py
index 2d4bfdf5..0b6e78fe 100644
--- a/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/isolate.py
+++ b/tools/onnx-graphsurgeon/examples/03_isolating_a_subgraph/isolate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/examples/04_modifying_a_model/generate.py b/tools/onnx-graphsurgeon/examples/04_modifying_a_model/generate.py
index e71a783b..75eb288b 100644
--- a/tools/onnx-graphsurgeon/examples/04_modifying_a_model/generate.py
+++ b/tools/onnx-graphsurgeon/examples/04_modifying_a_model/generate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/examples/04_modifying_a_model/modify.py b/tools/onnx-graphsurgeon/examples/04_modifying_a_model/modify.py
index 55d42234..7fbd5a30 100644
--- a/tools/onnx-graphsurgeon/examples/04_modifying_a_model/modify.py
+++ b/tools/onnx-graphsurgeon/examples/04_modifying_a_model/modify.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -43,6 +43,7 @@
 # Therefore, you should only need to sort the graph when you have added new nodes out-of-order.
 # In this case, the identity node is already in the correct spot (it is the last node,
 # and was appended to the end of the list), but to be on the safer side, we can sort anyway.
-graph.cleanup().toposort()
+graph.cleanup(remove_unused_graph_inputs=True).toposort()
 
-onnx.save(gs.export_onnx(graph), "modified.onnx")
+model = onnx.shape_inference.infer_shapes(gs.export_onnx(graph))
+onnx.save(model, "modified.onnx")
diff --git a/tools/onnx-graphsurgeon/examples/05_folding_constants/fold.py b/tools/onnx-graphsurgeon/examples/05_folding_constants/fold.py
index 4514f808..5fabe238 100644
--- a/tools/onnx-graphsurgeon/examples/05_folding_constants/fold.py
+++ b/tools/onnx-graphsurgeon/examples/05_folding_constants/fold.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/examples/05_folding_constants/generate.py b/tools/onnx-graphsurgeon/examples/05_folding_constants/generate.py
index 98c6bc43..004c8d84 100644
--- a/tools/onnx-graphsurgeon/examples/05_folding_constants/generate.py
+++ b/tools/onnx-graphsurgeon/examples/05_folding_constants/generate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/examples/06_removing_nodes/generate.py b/tools/onnx-graphsurgeon/examples/06_removing_nodes/generate.py
index e786f894..e81d6ac1 100644
--- a/tools/onnx-graphsurgeon/examples/06_removing_nodes/generate.py
+++ b/tools/onnx-graphsurgeon/examples/06_removing_nodes/generate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -37,4 +37,6 @@
 ]
 
 graph = gs.Graph(nodes=nodes, inputs=[x], outputs=[y])
-onnx.save(gs.export_onnx(graph), "model.onnx")
+
+model = onnx.shape_inference.infer_shapes(gs.export_onnx(graph))
+onnx.save(model, "model.onnx")
diff --git a/tools/onnx-graphsurgeon/examples/06_removing_nodes/remove.py b/tools/onnx-graphsurgeon/examples/06_removing_nodes/remove.py
index 01a6d683..087cdee9 100644
--- a/tools/onnx-graphsurgeon/examples/06_removing_nodes/remove.py
+++ b/tools/onnx-graphsurgeon/examples/06_removing_nodes/remove.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -37,4 +37,6 @@
 
 # Remove the fake node from the graph completely
 graph.cleanup()
-onnx.save(gs.export_onnx(graph), "removed.onnx")
+
+model = onnx.shape_inference.infer_shapes(gs.export_onnx(graph))
+onnx.save(model, "removed.onnx")
diff --git a/tools/onnx-graphsurgeon/examples/07_creating_a_model_with_the_layer_api/generate.py b/tools/onnx-graphsurgeon/examples/07_creating_a_model_with_the_layer_api/generate.py
index 516a9e33..7198228a 100644
--- a/tools/onnx-graphsurgeon/examples/07_creating_a_model_with_the_layer_api/generate.py
+++ b/tools/onnx-graphsurgeon/examples/07_creating_a_model_with_the_layer_api/generate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,6 +22,7 @@
 
 print("Graph.layer Help:\n{}".format(gs.Graph.layer.__doc__))
 
+
 # We can use `Graph.register()` to add a function to the Graph class. Later, we can invoke the function
 # directly on instances of the graph, e.g., `graph.add(...)`
 @gs.Graph.register()
@@ -95,4 +96,5 @@ def relu(self, a):
 for out in graph.outputs:
     out.dtype = np.float32
 
-onnx.save(gs.export_onnx(graph), "model.onnx")
+model = onnx.shape_inference.infer_shapes(gs.export_onnx(graph))
+onnx.save(model, "model.onnx")
diff --git a/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/generate.py b/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/generate.py
index 680ad607..a45c9c93 100644
--- a/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/generate.py
+++ b/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/generate.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,6 +19,7 @@
 import numpy as np
 import onnx
 
+
 # Register functions to make graph generation easier
 @gs.Graph.register()
 def min(self, *args):
diff --git a/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/replace.py b/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/replace.py
index 88652309..60a3c09e 100644
--- a/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/replace.py
+++ b/tools/onnx-graphsurgeon/examples/08_replacing_a_subgraph/replace.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -43,7 +43,11 @@ def replace_with_clip(self, inputs, outputs):
 # You can figure out the input and output tensors using Netron. In our case:
 # Inputs: [inp, MIN_VAL, MAX_VAL]
 # Outputs: [max_out]
-inputs = [tmap["identity_out_0"], tmap["onnx_graphsurgeon_constant_5"], tmap["onnx_graphsurgeon_constant_2"]]
+inputs = [
+    tmap["identity_out_0"],
+    tmap["onnx_graphsurgeon_constant_5"],
+    tmap["onnx_graphsurgeon_constant_2"],
+]
 outputs = [tmap["max_out_6"]]
 
 graph.replace_with_clip(inputs, outputs)
diff --git a/tools/onnx-graphsurgeon/examples/09_shape_operations_with_the_layer_api/generate.py b/tools/onnx-graphsurgeon/examples/09_shape_operations_with_the_layer_api/generate.py
index 29403b08..0f4ce213 100644
--- a/tools/onnx-graphsurgeon/examples/09_shape_operations_with_the_layer_api/generate.py
+++ b/tools/onnx-graphsurgeon/examples/09_shape_operations_with_the_layer_api/generate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -31,7 +31,10 @@ def shape(self, a):
 @gs.Graph.register()
 def reduce_prod(self, a, axes, keepdims=True):
     return self.layer(
-        op="ReduceProd", inputs=[a], attrs={"axes": axes, "keepdims": int(keepdims)}, outputs=["reduce_prod_out_gs"]
+        op="ReduceProd",
+        inputs=[a],
+        attrs={"axes": axes, "keepdims": int(keepdims)},
+        outputs=["reduce_prod_out_gs"],
     )[0]
 
 
@@ -47,14 +50,22 @@ def gather(self, data, indices):
 
 @gs.Graph.register()
 def concat(self, inputs, axis=0):
-    return self.layer(op="Concat", inputs=inputs, attrs={"axis": axis}, outputs=["concat_out_gs"])[0]
+    return self.layer(
+        op="Concat", inputs=inputs, attrs={"axis": axis}, outputs=["concat_out_gs"]
+    )[0]
 
 
 # Create the graph.
 graph = gs.Graph()
 
 # First we set up the inputs, using gs.Tensor.DYNAMIC to specify dynamic dimensions.
-graph.inputs = [gs.Variable(name="data", dtype=np.float32, shape=(gs.Tensor.DYNAMIC, 3, gs.Tensor.DYNAMIC, 5))]
+graph.inputs = [
+    gs.Variable(
+        name="data",
+        dtype=np.float32,
+        shape=(gs.Tensor.DYNAMIC, 3, gs.Tensor.DYNAMIC, 5),
+    )
+]
 
 input_shape = graph.shape(graph.inputs[0])
 
@@ -73,9 +84,15 @@ def concat(self, inputs, axis=0):
 
 # Finally, set up the outputs and export.
 flattened.name = "flattened"  # Rename output tensor to make it easy to find.
-flattened.dtype = np.float32  # NOTE: We must include dtype information for graph outputs
+flattened.dtype = (
+    np.float32
+)  # NOTE: We must include dtype information for graph outputs
+flattened.shape = (gs.Tensor.DYNAMIC,)
 partially_flattened.name = "partially_flattened"
 partially_flattened.dtype = np.float32
+partially_flattened.shape = (gs.Tensor.DYNAMIC, 3, gs.Tensor.DYNAMIC)
 
 graph.outputs = [flattened, partially_flattened]
-onnx.save(gs.export_onnx(graph), "model.onnx")
+
+model = onnx.shape_inference.infer_shapes(gs.export_onnx(graph))
+onnx.save(model, "model.onnx")
diff --git a/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/generate.py b/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/generate.py
index 19435b9c..e31c4c0a 100644
--- a/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/generate.py
+++ b/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/generate.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,13 +24,19 @@
 ##########################################################################################################
 # Register functions to simplify the graph building process later on.
 
+
 @gs.Graph.register()
 def conv(self, inp, weights, dilations, group, strides):
     out = self.layer(
         op="Conv",
         inputs=[inp, weights],
         outputs=["conv_out"],
-        attrs={"dilations": dilations, "group": group, "kernel_shape": weights.shape[2:], "strides": strides},
+        attrs={
+            "dilations": dilations,
+            "group": group,
+            "kernel_shape": weights.shape[2:],
+            "strides": strides,
+        },
     )[0]
     out.dtype = inp.dtype
     return out
@@ -49,6 +55,7 @@ def matmul(self, lhs, rhs):
     out.dtype = lhs.dtype
     return out
 
+
 ##########################################################################################################
 
 
@@ -58,7 +65,11 @@ def matmul(self, lhs, rhs):
 
 # Connect intermediate tensors
 conv_out = graph.conv(
-    X, weights=np.ones(shape=(32, 3, 3, 3), dtype=np.float32), dilations=[1, 1], group=1, strides=[1, 1]
+    X,
+    weights=np.ones(shape=(32, 3, 3, 3), dtype=np.float32),
+    dilations=[1, 1],
+    group=1,
+    strides=[1, 1],
 )
 reshape_out = graph.reshape(conv_out, np.array([1, 21632], dtype=np.int64))
 matmul_out = graph.matmul(reshape_out, np.ones(shape=(21632, 10), dtype=np.float32))
@@ -67,4 +78,5 @@ def matmul(self, lhs, rhs):
 graph.outputs = [matmul_out]
 
 # Save graph
-onnx.save(gs.export_onnx(graph), "model.onnx")
+model = onnx.shape_inference.infer_shapes(gs.export_onnx(graph))
+onnx.save(model, "model.onnx")
diff --git a/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/modify.py b/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/modify.py
index 959a36b2..353ef112 100644
--- a/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/modify.py
+++ b/tools/onnx-graphsurgeon/examples/10_dynamic_batch_size/modify.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,7 +24,7 @@
 
 # Update input shape
 for input in graph.inputs:
-    input.shape[0] = 'N'
+    input.shape[0] = "N"
 
 # Update 'Reshape' nodes (if they exist)
 reshape_nodes = [node for node in graph.nodes if node.op == "Reshape"]
diff --git a/tools/onnx-graphsurgeon/examples/11_creating_a_local_function/README.md b/tools/onnx-graphsurgeon/examples/11_creating_a_local_function/README.md
new file mode 100644
index 00000000..64cd1d76
--- /dev/null
+++ b/tools/onnx-graphsurgeon/examples/11_creating_a_local_function/README.md
@@ -0,0 +1,80 @@
+# Local Functions
+
+## Introduction
+
+This example generates a model which uses ONNX Functions.
+Functions are a way to specify a default implementation for a Custom Op.
+
+## Basics
+
+A Function can be created and modified the same way as a Graph.
+
+```python
+custom_func = gs.Function("CustomOp", inputs=[gs.Variable("input")], outputs=[gs.Variable("output")])
+custom_func_relu_node = gs.Node(op="Relu", inputs=custom_func.inputs.copy(), outputs=custom_func.outputs.copy())
+custom_func.nodes.append(custom_func_relu_node)
+```
+
+To use a Function in a graph, add the Function to the graph's list of functions.
+Then, that Function will serve as a default implementation for nodes in the graph with the same
+op and domain as the Function.
+
+```python
+graph = gs.Graph(inputs=[gs.Variable("model_input")], functions=[custom_func])
+graph.outputs = graph.CustomOp(inputs=[graph.inputs[0]])
+```
+
+The node could also have been created manually using the `Node()` constructor:
+
+```python
+node = gs.Node(op=custom_func.name, domain=custom_func.domain)
+node.inputs = [graph.inputs[0]]
+node.outputs = [gs.Variable("custom_op_output")]
+graph.nodes.append(node)
+```
+
+## Function Attributes
+
+Nodes inside of functions can have attributes which refer to values passed in when the Function is instantiated.
+The function holds a list of such attributes which can be overridden.
+
+```python
+func_input = gs.Variable("input")
+func_output = gs.Variable("output")
+func = gs.Function("Concat_Softmax", inputs=[func_input], outputs=[func_output])
+concat_node = gs.Node(op="Concat", inputs=[func_input], outputs=[gs.Variable("concat_out")])
+softmax_node = gs.Node(op="Softmax", inputs=[concat_node.outputs[0]], outputs=[func_output])
+func.nodes = [concat_node, softmax_node]
+
+# Specify the attributes that can be supplied to the function, and their default values.
+func.attrs = {
+    "concat_axis": None, # 'None' means no default value
+    "softmax_axis": -1,
+}
+
+# Setup the node attributes to refer to the Function's attributes.
+# We also need to specify the type of the attribute.
+concat_node.attrs = {"axis": gs.Node.AttributeRef("concat_axis", int)}
+softmax_node.attrs = {"axis": gs.Node.AttributeRef("softmax_axis", int)}
+
+# Now we can specify the attribute values when we instantiate the Function.
+graph.functions.append(func)
+node_1_outputs = graph.Concat_Softmax(inputs=["input1"], attrs={"concat_axis": 1})
+node_2_outputs = graph.Concat_Softmax(inputs=["input2"], attrs={"concat_axis": 0, "softmax_axis": 0})
+```
+
+
+## Running the example
+
+1. Generate the model:
+    ```bash
+    python3 generate.py
+    ```
+
+    This creates a model with Custom SelfAttention ops.
+
+    ![../resources/11_model.onnx.png](../resources/11_model.onnx.png)
+
+    The SelfAttention op is build out of ONNX primitives:
+
+    ![../resources/11_selfattention.png](../resources/11_selfattention.png)
diff --git a/tools/onnx-graphsurgeon/examples/11_creating_a_local_function/generate.py b/tools/onnx-graphsurgeon/examples/11_creating_a_local_function/generate.py
new file mode 100644
index 00000000..57d39ee9
--- /dev/null
+++ b/tools/onnx-graphsurgeon/examples/11_creating_a_local_function/generate.py
@@ -0,0 +1,161 @@
+#!/usr/bin/env python3
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import math
+from typing import List
+
+import onnx_graphsurgeon as gs
+import numpy as np
+import onnx
+
+
+##########################################################################################################
+# Register functions to simplify the graph building process later on.
+
+opset = 18
+
+
+@gs.Graph.register()
+def add(self, lhs, rhs):
+    out = self.layer(op="Add", inputs=[lhs, rhs], outputs=["add_out"])[0]
+    out.dtype = lhs.dtype
+    return out
+
+
+@gs.Graph.register()
+def div(self, lhs, rhs):
+    out = self.layer(op="Div", inputs=[lhs, rhs], outputs=["div_out"])[0]
+    out.dtype = lhs.dtype
+    return out
+
+
+@gs.Graph.register()
+def matmul(self, lhs, rhs):
+    out = self.layer(op="MatMul", inputs=[lhs, rhs], outputs=["matmul_out"])[0]
+    out.dtype = lhs.dtype
+    return out
+
+
+@gs.Graph.register()
+def constant_tensor_ref(self, ref_name, dtype):
+    attr_ref = gs.Node.AttributeRef(ref_name, gs.Tensor)
+    out = self.layer(
+        op="Constant", outputs=["constant_out"], attrs={"value": attr_ref}
+    )[0]
+    out.dtype = dtype
+    return out
+
+
+@gs.Graph.register()
+def constant_float_ref(self, ref_name):
+    attr_ref = gs.Node.AttributeRef(ref_name, float)
+    out = self.layer(
+        op="Constant", outputs=["constant_out"], attrs={"value_float": attr_ref}
+    )[0]
+    out.dtype = np.float32
+    return out
+
+
+@gs.Graph.register()
+def transpose(self, t, perm=[]):
+    out = self.layer(
+        op="Transpose", inputs=[t], outputs=["transpose_out"], attrs={"perm": perm}
+    )[0]
+    out.dtype = t.dtype
+    return out
+
+
+@gs.Graph.register()
+def relu(self, t):
+    out = self.layer(op="Relu", inputs=[t], outputs=["relu_out"])[0]
+    out.dtype = t.dtype
+    return out
+
+
+@gs.Graph.register()
+def softmax(self, t):
+    out = self.layer(op="Softmax", inputs=[t], outputs=["softmax_out"])[0]
+    out.dtype = t.dtype
+    return out
+
+
+##########################################################################################################
+# Create a Function representing Attention the same way you would create a gs.Graph.
+# Tensors created here are not reused outside of this Function.
+
+attn_input_embeds = gs.Variable(
+    "input", dtype=np.float32, shape=("batch", "seqlen", "emb_dim")
+)
+attn_attrs = {
+    "Wq": None,
+    "Wk": None,
+    "Wv": None,
+    "transpose_perm": [0, 2, 1],
+    "sqrt_emb_dim": 1.0,
+}
+attn = gs.Function("SelfAttention", inputs=[attn_input_embeds], attrs=attn_attrs)
+attn_Q = attn.matmul(
+    attn_input_embeds, attn.constant_tensor_ref("Wq", dtype=np.float32)
+)
+attn_K = attn.matmul(
+    attn_input_embeds, attn.constant_tensor_ref("Wk", dtype=np.float32)
+)
+attn_V = attn.matmul(
+    attn_input_embeds, attn.constant_tensor_ref("Wv", dtype=np.float32)
+)
+attn_sqrt_emb_dim = attn.constant_float_ref("sqrt_emb_dim")
+attn_perm = gs.Node.AttributeRef("transpose_perm", List[int])
+attn_matrix = attn.div(
+    attn.matmul(attn_Q, attn.transpose(attn_K, perm=attn_perm)), attn_sqrt_emb_dim
+)
+attn.outputs = [attn.matmul(attn.softmax(attn_matrix), attn_V)]
+attn.opset = opset
+
+##########################################################################################################
+# Use the Function in a model.
+
+# Model parameters
+emb_dim = 4
+n_layers = 4
+
+
+def make_attention_attrs():
+    return {
+        "sqrt_emb_dim": float(math.sqrt(emb_dim)),
+        "Wq": gs.Constant("Wq", np.random.randn(emb_dim, emb_dim).astype(np.float32)),
+        "Wk": gs.Constant("Wk", np.random.randn(emb_dim, emb_dim).astype(np.float32)),
+        "Wv": gs.Constant("Wv", np.random.randn(emb_dim, emb_dim).astype(np.float32)),
+    }
+
+
+# Build graph with n_layers attention blocks.
+input_embeds = gs.Variable(
+    "input_embeds", dtype=np.float32, shape=("batch", "seqlen", emb_dim)
+)
+graph = gs.Graph(inputs=[input_embeds], functions=[attn])
+out = input_embeds
+for _ in range(n_layers):
+    next = graph.SelfAttention(inputs=[out], attrs=make_attention_attrs())[0]
+    out = graph.add(out, graph.relu(next))
+out.shape = input_embeds.shape
+graph.outputs = [out]
+graph.opset = opset
+
+# Save graph
+model = gs.export_onnx(graph)
+onnx.save(model, "model.onnx")
diff --git a/tools/onnx-graphsurgeon/examples/resources/11_model.onnx.png b/tools/onnx-graphsurgeon/examples/resources/11_model.onnx.png
new file mode 100644
index 00000000..d66688b2
Binary files /dev/null and b/tools/onnx-graphsurgeon/examples/resources/11_model.onnx.png differ
diff --git a/tools/onnx-graphsurgeon/examples/resources/11_selfattention.png b/tools/onnx-graphsurgeon/examples/resources/11_selfattention.png
new file mode 100644
index 00000000..e2b3ced7
Binary files /dev/null and b/tools/onnx-graphsurgeon/examples/resources/11_selfattention.png differ
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/__init__.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/__init__.py
index 0c8d9a65..100229f5 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/__init__.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/__init__.py
@@ -1,8 +1,10 @@
 from onnx_graphsurgeon.exporters.onnx_exporter import export_onnx
+from onnx_graphsurgeon.graph_pattern import GraphPattern, PatternMapping
 from onnx_graphsurgeon.importers.onnx_importer import import_onnx
+from onnx_graphsurgeon.ir.function import Function
 from onnx_graphsurgeon.ir.graph import Graph
 from onnx_graphsurgeon.ir.node import Node
 from onnx_graphsurgeon.ir.tensor import Constant, Tensor, Variable
 from onnx_graphsurgeon.util.exception import OnnxGraphSurgeonException
 
-__version__ = "0.3.26"
+__version__ = "0.5.1"
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/base_exporter.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/base_exporter.py
index b596e5d1..afff867b 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/base_exporter.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/base_exporter.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/onnx_exporter.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/onnx_exporter.py
index 4d4b2429..d7e3e138 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/onnx_exporter.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/exporters/onnx_exporter.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,18 +15,78 @@
 # limitations under the License.
 #
 
+from typing import List, Sequence, Union
+
 import numpy as np
 import onnx
 import onnx.numpy_helper
 from onnx_graphsurgeon.exporters.base_exporter import BaseExporter
+from onnx_graphsurgeon.ir.function import Function
 from onnx_graphsurgeon.ir.graph import Graph
 from onnx_graphsurgeon.ir.node import Node
-from onnx_graphsurgeon.ir.tensor import Constant, LazyValues, Tensor, Variable
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.ir.tensor import (
+    Constant,
+    SparseValues,
+    LazyValues,
+    Tensor,
+    Variable,
+)
+from onnx_graphsurgeon.logger import G_LOGGER
+from onnx_graphsurgeon.util import misc
+
+
+def dtype_to_onnx(dtype: Union[np.dtype, "onnx.TensorProto.DataType"]) -> int:
+    if isinstance(dtype, int):
+        return dtype
+    return onnx.helper.np_dtype_to_tensor_dtype(np.dtype(dtype))
+
+
+def check_duplicate_node_names(nodes: Sequence[Node], level=G_LOGGER.WARNING):
+    # Check if node names are unique. If not, log based on severity.
+
+    # Note:
+    # Empty string or None attribute values are not considered duplicates.
+    name_map = {}
+    for node in nodes:
+        if not node.name:
+            continue
+        if node.name in name_map:
+            msg = "Found distinct Nodes that share the same name:\n[id: {:}]:\n {:}---\n[id: {:}]:\n {:}\n".format(
+                id(name_map[node.name]),
+                name_map[node.name],
+                id(node),
+                node,
+            )
+            G_LOGGER.log(msg, level)
+        else:
+            name_map[node.name] = node
 
 
-def dtype_to_onnx(dtype: np.dtype) -> int:
-    return onnx.mapping.NP_TYPE_TO_TENSOR_TYPE[np.dtype(dtype)]
+def update_import_domains(graph):
+    # Update the import_domains field to contain the graph's ONNX opset,
+    # as well as other non-ONNX domains which are used by this graph's nodes.
+    # Returns the updated value of the import_domains field.
+
+    # Add domain of the standard ONNX opset.
+    if graph.import_domains is None:
+        graph.import_domains = [onnx.helper.make_opsetid("", graph.opset)]
+
+    # Crawl over all nodes in this graph and its subgraphs, and add the nodes' domains.
+    all_used_domains = {node.domain for node in graph.nodes}
+    for subgraph in graph.subgraphs(recursive=True):
+        all_used_domains |= {n.domain for n in subgraph.nodes}
+    all_used_domains.discard(None)
+
+    # Update self.import_domains with any missing domains.
+    current_domains = {opsetid.domain for opsetid in graph.import_domains}
+    DEFAULT_CUSTOM_OPSET_VERSION = 1
+    for used_domain in all_used_domains:
+        if used_domain not in current_domains:
+            graph.import_domains.append(
+                onnx.helper.make_opsetid(used_domain, DEFAULT_CUSTOM_OPSET_VERSION)
+            )
+            current_domains.add(used_domain)
+    return graph.import_domains
 
 
 class OnnxExporter(BaseExporter):
@@ -44,7 +104,13 @@ def export_tensor_proto(tensor: Constant) -> onnx.TensorProto:
         return onnx_tensor
 
     @staticmethod
-    def export_value_info_proto(tensor: Variable, do_type_check: bool) -> onnx.ValueInfoProto:
+    def export_sparse_tensor_proto(tensor: Constant) -> onnx.SparseTensorProto:
+        return tensor._values.tensor
+
+    @staticmethod
+    def export_value_info_proto(
+        tensor: Tensor, do_type_check: bool
+    ) -> onnx.ValueInfoProto:
         if do_type_check and tensor.dtype is None:
             G_LOGGER.critical(
                 "Graph input and output tensors must include dtype information. Please set the dtype attribute for: {:}".format(
@@ -53,36 +119,121 @@ def export_value_info_proto(tensor: Variable, do_type_check: bool) -> onnx.Value
             )
 
         if tensor.dtype is not None:
-            onnx_tensor = onnx.helper.make_tensor_value_info(tensor.name, dtype_to_onnx(tensor.dtype), tensor.shape)
+            if isinstance(tensor, Constant) or tensor.type == "tensor_type":
+                onnx_tensor = onnx.helper.make_tensor_value_info(
+                    tensor.name, dtype_to_onnx(tensor.dtype), tensor.shape
+                )
+            elif tensor.type == "sequence_type":
+                onnx_tensor = onnx.helper.make_tensor_sequence_value_info(
+                    tensor.name, dtype_to_onnx(tensor.dtype), tensor.shape
+                )
+            elif tensor.type == "sparse_tensor_type":
+                onnx_tensor = onnx.helper.make_sparse_tensor_value_info(
+                    tensor.name, dtype_to_onnx(tensor.dtype), tensor.shape
+                )
         else:
             onnx_tensor = onnx.helper.make_empty_tensor_value_info(tensor.name)
         return onnx_tensor
 
     @staticmethod
-    def export_node(node: Node, do_type_check: bool) -> onnx.NodeProto:
-        # Cannot pass in attrs directly as make_node will change the order
-        onnx_node = onnx.helper.make_node(
-            node.op,
-            inputs=[t.name for t in node.inputs],
-            outputs=[t.name for t in node.outputs],
-            name=node.name,
-            domain=node.domain,
-        )
-        # Convert Tensors and Graphs to TensorProtos and GraphProtos respectively
-        for key, val in node.attrs.items():
+    def export_attributes(attrs: dict) -> List[onnx.AttributeProto]:
+        onnx_attrs: List[onnx.AttributeProto] = []
+        for key, val in attrs.items():
             if isinstance(val, Tensor):
                 val = OnnxExporter.export_tensor_proto(val)
             elif isinstance(val, Graph):
-                val = OnnxExporter.export_graph(val, do_type_check)
+                # Subgraphs don't need to have types specified for their tensors.
+                val = OnnxExporter.export_graph(val, do_type_check=False)
+            elif isinstance(val, Node.AttributeRef):
+                onnx_attr = onnx.AttributeProto()
+                onnx_attr.name = key
+                onnx_attr.type = misc.convert_to_onnx_attr_type(val.type)
+
+                # Netron has a bug which makes it crash if a Tensor attribute has no tensor data.
+                # So provide some meaningless tensor data for Netron to read.
+                if val.type == Tensor:
+                    tensor_proto = OnnxExporter.export_tensor_proto(
+                        Constant("", np.array([0], dtype=np.float32))
+                    )
+                    onnx_attr.t.CopyFrom(tensor_proto)
+
+                onnx_attr.ref_attr_name = val.name
+                onnx_attrs.append(onnx_attr)
+                continue
             elif isinstance(val, type):
                 # May be a numpy type
                 try:
                     val = dtype_to_onnx(val)
                 except TypeError:
                     pass
-            onnx_node.attribute.extend([onnx.helper.make_attribute(key, val)])
+            onnx_attrs.append(onnx.helper.make_attribute(key, val))
+        return onnx_attrs
+
+    @staticmethod
+    def export_node(node: Node) -> onnx.NodeProto:
+        # Cannot pass in attrs directly as make_node will change the order
+        onnx_node = onnx.helper.make_node(
+            node.op,
+            inputs=[t.name for t in node.inputs],
+            outputs=[t.name for t in node.outputs],
+            name=node.name,
+            domain=node.domain,
+        )
+        onnx_node.attribute.extend(OnnxExporter.export_attributes(node.attrs))
         return onnx_node
 
+    @staticmethod
+    def export_function(func: Function) -> onnx.FunctionProto:
+        """
+        Export an onnx-graphsurgeon Function to an ONNX FunctionProto.
+
+        Args:
+            func (Function): The function to export.
+        """
+        # Unlike onnx Graphs, onnx Functions don't have an 'initializer' field.
+        # So we need to replace all Constant tensors with onnx Constant nodes which produce them.
+        # We need to be careful to (a) preserve topological ordering and (b) not make the new nodes visible to the user.
+        func_nodes = func.nodes.copy()
+        new_const_nodes = []
+        for tensor in func.tensors().values():
+            if isinstance(tensor, Constant):
+                # Copying the tensor prevents the new node from appearing in the Constant tensor's inputs.
+                new_const_nodes.append(
+                    Node("Constant", attrs={"value": tensor}, outputs=[tensor.copy()])
+                )
+        # Const nodes have no inputs, so this maintains a topological ordering.
+        func_nodes = new_const_nodes + func_nodes
+
+        check_duplicate_node_names(func_nodes, level=G_LOGGER.WARNING)
+        nodes = [OnnxExporter.export_node(node) for node in func_nodes]
+
+        # Update the import_domains field to include all domains used by this function.
+        opset_imports = update_import_domains(func)
+
+        onnx_inputs = [inp.name for inp in func.inputs]
+        onnx_outputs = [out.name for out in func.outputs]
+
+        attributes = []
+        attribute_protos = dict()
+        for attr_name, default_val in func.attrs.items():
+            if default_val is None:
+                attributes.append(attr_name)
+            else:
+                attribute_protos[attr_name] = default_val
+        attribute_protos = OnnxExporter.export_attributes(attribute_protos)
+
+        return onnx.helper.make_function(
+            func.domain or "",
+            func.name,
+            onnx_inputs,
+            onnx_outputs,
+            nodes,
+            opset_imports,
+            attributes=attributes,
+            attribute_protos=attribute_protos,
+            doc_string=func.doc_string,
+        )
+
     @staticmethod
     def export_graph(graph: Graph, do_type_check=True) -> onnx.GraphProto:
         """
@@ -92,13 +243,29 @@ def export_graph(graph: Graph, do_type_check=True) -> onnx.GraphProto:
             graph (Graph): The graph to export.
 
             do_type_check (bool): Whether to check that input and output tensors have data types defined, and fail if not.
+                                  Defaults to True.
         """
-        nodes = [OnnxExporter.export_node(node, do_type_check) for node in graph.nodes]
-        inputs = [OnnxExporter.export_value_info_proto(inp, do_type_check) for inp in graph.inputs]
-        outputs = [OnnxExporter.export_value_info_proto(out, do_type_check) for out in graph.outputs]
+        check_duplicate_node_names(graph.nodes, level=G_LOGGER.WARNING)
+        nodes = [OnnxExporter.export_node(node) for node in graph.nodes]
+        inputs = [
+            OnnxExporter.export_value_info_proto(inp, do_type_check)
+            for inp in graph.inputs
+        ]
+        outputs = [
+            OnnxExporter.export_value_info_proto(out, do_type_check)
+            for out in graph.outputs
+        ]
         tensor_map = graph.tensors()
         initializer = [
-            OnnxExporter.export_tensor_proto(tensor) for tensor in tensor_map.values() if isinstance(tensor, Constant)
+            OnnxExporter.export_tensor_proto(tensor)
+            for tensor in tensor_map.values()
+            if isinstance(tensor, Constant)
+            and not isinstance(tensor._values, SparseValues)
+        ]
+        sparse_initializer = [
+            OnnxExporter.export_sparse_tensor_proto(tensor)
+            for tensor in tensor_map.values()
+            if isinstance(tensor, Constant) and isinstance(tensor._values, SparseValues)
         ]
 
         # Remove inputs and outputs to export ValueInfoProtos
@@ -108,7 +275,9 @@ def export_graph(graph: Graph, do_type_check=True) -> onnx.GraphProto:
 
         # Omit tensors from value_info if we don't know their shape/dtype
         def has_value_info(tensor):
-            return isinstance(tensor, Variable) and (tensor.dtype is not None or tensor.shape is not None)
+            return isinstance(tensor, Variable) and (
+                tensor.dtype is not None or tensor.shape is not None
+            )
 
         value_info = [
             OnnxExporter.export_value_info_proto(tensor, do_type_check)
@@ -122,6 +291,7 @@ def has_value_info(tensor):
             inputs=inputs,
             outputs=outputs,
             initializer=initializer,
+            sparse_initializer=sparse_initializer,
             doc_string=graph.doc_string,
             value_info=value_info,
         )
@@ -135,18 +305,18 @@ def export_onnx(graph: Graph, do_type_check=True, **kwargs) -> "onnx.ModelProto"
         graph (Graph): The graph to export
 
         do_type_check (bool): Whether to check that input and output tensors have data types defined, and fail if not.
+                              Defaults to True.
         kwargs: Additional arguments to onnx.helper.make_model
 
     Returns:
         onnx.ModelProto: A corresponding ONNX model.
     """
     onnx_graph = OnnxExporter.export_graph(graph, do_type_check=do_type_check)
+    onnx_functions = [OnnxExporter.export_function(func) for func in graph.functions]
+    kwargs["functions"] = onnx_functions
 
     if "opset_imports" not in kwargs:
-        if graph.import_domains is None:
-            kwargs["opset_imports"] = [onnx.helper.make_opsetid("", graph.opset)]
-        else:
-            kwargs["opset_imports"] = graph.import_domains
+        kwargs["opset_imports"] = update_import_domains(graph)
 
     model = onnx.helper.make_model(onnx_graph, **kwargs)
     model.producer_name = graph.producer_name
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/graph_pattern/__init__.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/graph_pattern/__init__.py
new file mode 100644
index 00000000..d79f86bc
--- /dev/null
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/graph_pattern/__init__.py
@@ -0,0 +1 @@
+from onnx_graphsurgeon.graph_pattern.graph_pattern import GraphPattern, PatternMapping
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/graph_pattern/graph_pattern.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/graph_pattern/graph_pattern.py
new file mode 100644
index 00000000..53d0b1a6
--- /dev/null
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/graph_pattern/graph_pattern.py
@@ -0,0 +1,510 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Dict, List, Union
+
+from onnx_graphsurgeon.ir.graph import Constant, Graph, Node
+from onnx_graphsurgeon.logger import G_LOGGER
+
+
+class PatternMapping(dict):
+    """
+    Represents a graph pattern mapping result.
+    """
+
+    def __init__(self, onnx_node=None) -> None:
+        super().__init__()
+        self.onnx_node = onnx_node
+
+        self.inputs = []
+        self.outputs = []
+        if onnx_node is not None:
+            self.inputs = onnx_node.inputs
+            self.outputs = onnx_node.outputs
+
+        self.constants = dict()  # constant name -> onnx tensor mapping
+
+    def set_input_onnx_tensor(self, onnx_tensor, index):
+        length = len(self.inputs)
+        for _ in range(index - length + 1):
+            self.inputs.append(None)
+        if (
+            self.inputs[index] is not None
+            and self.inputs[index].name != onnx_tensor.name
+        ):
+            return False  # This input tensor has been set up by another onnx tensor
+        self.inputs[index] = onnx_tensor
+        return True
+
+    def set_output_onnx_tensor(self, onnx_tensor, index):
+        length = len(self.outputs)
+        for _ in range(index - length + 1):
+            self.outputs.append(None)
+        if (
+            self.outputs[index] is not None
+            and self.outputs[index].name != onnx_tensor.name
+        ):
+            return False  # This output tensor has been set up by another onnx tensor
+        self.outputs[index] = onnx_tensor
+        return True
+
+    def set_constant_onnx_tensor(self, onnx_tensor, name):
+        if name in self.constants and self.constants[name].name != onnx_tensor.name:
+            return False
+        self.constants[name] = onnx_tensor
+        return True
+
+    def _get_node(self):
+        return self.onnx_node
+
+    def get(self, name: str):
+        """
+        Retrieve a pattern-to-graph mapping given the pattern node name.
+
+        Args:
+            name (str): The name of the pattern node. The pattern node can be a single op node or a subpattern.
+
+        Returns:
+            PatternMapping for a subpattern node or gs.Node for a single op node.
+        """
+        if self[name].onnx_node is not None:
+            return self[name].onnx_node
+
+        return self[name]
+
+    def __str__(self) -> str:
+        if self.onnx_node is None:
+            return (
+                "{"
+                + str.join(
+                    ", ", [f"{key}: {str(value)}" for key, value in self.items()]
+                )
+                + "}"
+            )
+        return self.onnx_node.name
+
+
+class GraphPattern:
+    """
+    Represent a graph pattern.
+
+    Example:
+    ::
+
+        pattern = GraphPattern()
+        conv = pattern.add("Conv")
+        leaky_relu = pattern.add("LeakyReLU", inputs=[conv], check_func=lambda node: node.attrs["alpha"] < 1.0)
+    """
+
+    def __init__(self) -> None:
+        self.op = None  # op (str)
+        self.check_func = None  # callback function for single node
+        # pattern node name -> GraphPattern nodes(single or subpattern)
+        self.nodes: Dict[str, "GraphPattern"] = dict()
+        # pattern node name -> input tensors
+        self.node_inputs: Dict[str, List[int]] = dict()
+        # pattern node name -> output tensors
+        self.node_outputs: Dict[str, List[int]] = dict()
+        self.num_tensors = 0  # number of all tensors in the pattern
+        self.tensor_inputs: Dict[int, List[str]] = dict()  # tensor id -> input node
+        self.tensor_outputs: Dict[int, List[str]] = dict()  # tensor id -> output nodes
+        self.input_tensors: List[int] = []  # a list of input tensor ids of this pattern
+        self.output_tensors: List[int] = []
+        # tensor id -> tensor name of constant tensors.
+        self.constant_tensors: Dict[int, str] = dict()
+
+    def _add_tensor(self, input_node=None) -> int:
+        tensor_id = self.num_tensors
+        self.tensor_inputs[tensor_id] = []
+        if input_node is not None:
+            self.tensor_inputs[tensor_id].append(input_node)
+        self.tensor_outputs[tensor_id] = []
+
+        self.num_tensors += 1
+        return tensor_id
+
+    def variable(self) -> int:
+        """
+        Add a variable tensor without a input node - This tensor will be an input tensor of this graph pattern.
+
+        Return:
+            int: the tensor id.
+        """
+        tensor_id = self._add_tensor()
+        self.input_tensors.append(tensor_id)
+        return tensor_id
+
+    def constant(self, name=None) -> int:
+        """
+        Add a constant tensor. If name is not provided, a default name will be assigned.
+
+        Args:
+            name(str): the constant tensor name
+
+        Return:
+            int: the tensor id.
+        """
+        tensor_id = self._add_tensor()
+        if name is None:
+            name = f"unnamed_constant_tensor_{tensor_id}"
+        self.constant_tensors[tensor_id] = name
+        return tensor_id
+
+    def set_output_tensors(self, output_tensors) -> None:
+        for tensor_id in output_tensors:
+            assert tensor_id in self.tensor_inputs
+        self.output_tensors = output_tensors
+
+    def _init_single_node(self, op, check_func=None) -> None:
+        self.op = op
+        self.check_func = check_func
+
+    def add(
+        self,
+        name: str,
+        op: Union["GraphPattern", str],
+        check_func=None,
+        inputs=None,
+        num_output_tensors=1,
+    ):
+        """
+        Add an op node or a subpattern node to the current pattern.
+
+        Args:
+            name (str): the node name.
+            op (Union[GraphPattern, str]): the GraphPattern instance if adding a subpattern node or the op name if adding a single op node.
+            check_func (function): the callback function for additional matching rules of an op node if adding a single op node.
+            inputs (list): the list of input tensors. If this node is a sub-pattern, the sequence of this list should align with the sequence of the sub-pattern's input tensors.
+            num_output_tensors (int): number of output tensors
+
+        Return:
+            tuple(int) or int or None: output tensors.
+        """
+        assert self.op is None
+        assert name not in self.nodes
+
+        if inputs is None:
+            inputs = []
+
+        if isinstance(op, str):
+            op_name = op
+            op = GraphPattern()
+            op._init_single_node(op_name, check_func)
+
+        self.nodes[name] = op
+
+        self.node_inputs[name] = inputs
+
+        self.node_outputs[name] = []
+        for _ in range(num_output_tensors):
+            self.node_outputs[name].append(self._add_tensor(input_node=name))
+
+        for input in inputs:
+            self.tensor_outputs[input].append(name)
+
+        if len(self.node_outputs[name]) == 0:
+            return None
+        elif len(self.node_outputs[name]) == 1:
+            return self.node_outputs[name][0]
+        return tuple(self.node_outputs[name])
+
+    def _get_inbound(self, tensor_index):
+        if len(self.input_tensors) > tensor_index:
+            tensor_id = self.input_tensors[tensor_index]
+            if len(self.tensor_outputs[tensor_id]):
+                inbound_node = self.tensor_outputs[tensor_id][0]
+                return tensor_id, inbound_node
+        return None, None
+
+    def _get_outbound(self, tensor_index):
+        if len(self.output_tensors) > tensor_index:
+            tensor_id = self.output_tensors[tensor_index]
+            if len(self.tensor_inputs[tensor_id]):
+                outbound_node = self.tensor_inputs[tensor_id][0]
+                return tensor_id, outbound_node
+        return None, None
+
+    def _single_node_match(self, onnx_node: Node) -> bool:
+        assert self.op is not None
+        with G_LOGGER.indent():
+            if self.op != onnx_node.op:
+                G_LOGGER.info(
+                    "No match because: Op did not match. Node op was: {:} but pattern op was: {:}.".format(
+                        onnx_node.op, self.op
+                    )
+                )
+                return False
+            if self.check_func is not None:
+                if not self.check_func(onnx_node):
+                    G_LOGGER.info("No match because: check_func returned false.")
+                    return False
+            G_LOGGER.info(
+                "Single node is matched: {:}, {:}".format(self.op, onnx_node.name)
+            )
+        return True
+
+    def _get_tensor_index_for_node(
+        self, node: str, tensor_id: int, is_node_input: bool
+    ):
+        if is_node_input:
+            return self.node_inputs[node].index(tensor_id)
+        else:
+            return self.node_outputs[node].index(tensor_id)
+
+    def get_inbound_or_outbound_onnx_node(
+        self, mapping: PatternMapping, is_inbound: bool, tensor_index: int
+    ):
+        if self.op is not None:
+            onnx_node = mapping._get_node()
+            return onnx_node
+        if is_inbound:
+            inbound_tensor, inbound_node = self._get_inbound(tensor_index)
+            if inbound_node is not None:
+                return self.nodes[inbound_node].get_inbound_or_outbound_onnx_node(
+                    mapping[inbound_node],
+                    is_inbound=True,
+                    tensor_index=self._get_tensor_index_for_node(
+                        inbound_node, inbound_tensor, is_node_input=True
+                    ),
+                )
+
+        else:
+            outbound_tensor, outbound_node = self._get_outbound(tensor_index)
+            if outbound_node is not None:
+                return self.nodes[outbound_node].get_inbound_or_outbound_onnx_node(
+                    mapping[outbound_node],
+                    is_inbound=False,
+                    tensor_index=self._get_tensor_index_for_node(
+                        outbound_node, outbound_tensor, is_node_input=False
+                    ),
+                )
+        return None
+
+    # Match an onnx node and its subgraph with the current pattern.
+    def match(
+        self,
+        onnx_node: Node,
+        from_inbound: bool,
+        from_tensor_index: int,
+        mapped_onnx_nodes: set,
+        onnx_graph_output_tensors: set,
+    ):
+        if onnx_node.id in mapped_onnx_nodes:
+            return None
+        if self.op is not None:  # is single node
+            if self._single_node_match(onnx_node):
+                mapped_onnx_nodes.add(onnx_node.id)
+                return PatternMapping(onnx_node=onnx_node)
+            else:
+                return None
+
+        initial_node = None
+        if from_inbound:
+            from_tensor, initial_node = self._get_inbound(from_tensor_index)
+        else:
+            from_tensor, initial_node = self._get_outbound(from_tensor_index)
+        assert initial_node is not None
+
+        mapping = PatternMapping()
+        match = self._match_node(
+            initial_node,
+            onnx_node,
+            from_tensor,
+            mapping,
+            mapped_onnx_nodes,
+            onnx_graph_output_tensors,
+            from_inbound,
+        )
+        if match:
+            return mapping
+        else:
+            return None
+
+    # Match an onnx node and its subgraph with a starting pattern node(can be a subpattern node or a single node) and its subgraph. This is the actual dfs.
+    def _match_node(
+        self,
+        node_name: str,
+        onnx_node: Node,
+        from_tensor: int,
+        mapping: PatternMapping,
+        mapped_onnx_nodes: set,
+        onnx_graph_output_tensors: set,
+        from_inbound: bool,
+    ) -> bool:
+        with G_LOGGER.indent():
+            G_LOGGER.info(
+                "Checking node: {:} against pattern node: {:}.".format(
+                    onnx_node.name, node_name
+                )
+            )
+        tensor_index_for_node = self._get_tensor_index_for_node(
+            node_name, from_tensor, is_node_input=from_inbound
+        )
+        subgraph_mapping = self.nodes[node_name].match(
+            onnx_node,
+            from_inbound,
+            tensor_index_for_node,
+            mapped_onnx_nodes,
+            onnx_graph_output_tensors,
+        )
+        if subgraph_mapping is not None:
+            mapping[node_name] = subgraph_mapping
+        else:
+            return False
+
+        input_onnx_tensors = subgraph_mapping.inputs
+        if len(input_onnx_tensors) != len(self.node_inputs[node_name]):
+            return False  # Number of node inputs should equal to number of input onnx tensors of the node.
+        for node_input_tensor, onnx_tensor in zip(
+            self.node_inputs[node_name], input_onnx_tensors
+        ):
+            if onnx_tensor is None:
+                return False
+            # tensor paired up.
+            if node_input_tensor in self.input_tensors:
+                if not mapping.set_input_onnx_tensor(
+                    onnx_tensor, self.input_tensors.index(node_input_tensor)
+                ):
+                    return False  # this tensor is mapped to another onnx tensor
+                continue
+            if node_input_tensor in self.constant_tensors:
+                if not isinstance(onnx_tensor, Constant):
+                    return False  # constant tensor not match
+                if not mapping.set_constant_onnx_tensor(
+                    onnx_tensor, self.constant_tensors[node_input_tensor]
+                ):
+                    # this constant tensor is mapped to another onnx tensor
+                    return False
+                continue
+            if len(self.tensor_inputs[node_input_tensor]) != len(onnx_tensor.inputs):
+                return False
+            for input_node, input_onnx_node in zip(
+                self.tensor_inputs[node_input_tensor], onnx_tensor.inputs
+            ):
+                # dfs ends when revisiting a node. We need to check if the edges are matched.
+                if input_node in mapping:
+                    outbound_tensor_index = self._get_tensor_index_for_node(
+                        input_node, node_input_tensor, is_node_input=False
+                    )
+                    outbound_onnx_node_of_input_node = self.nodes[
+                        input_node
+                    ].get_inbound_or_outbound_onnx_node(
+                        mapping[input_node],
+                        is_inbound=False,
+                        tensor_index=outbound_tensor_index,
+                    )
+                    if (
+                        outbound_onnx_node_of_input_node is None
+                        or outbound_onnx_node_of_input_node.name != input_onnx_node.name
+                    ):
+                        return False
+                    continue
+                match = self._match_node(
+                    input_node,
+                    input_onnx_node,
+                    node_input_tensor,
+                    mapping,
+                    mapped_onnx_nodes,
+                    onnx_graph_output_tensors,
+                    from_inbound=False,
+                )
+                if not match:
+                    return False
+
+        output_onnx_tensors = subgraph_mapping.outputs
+        if len(output_onnx_tensors) != len(self.node_outputs[node_name]):
+            return False  # Number of node outputs should be equal to number of output onnx tensors of the node.
+        for node_output_tensor, onnx_tensor in zip(
+            self.node_outputs[node_name], output_onnx_tensors
+        ):
+            if onnx_tensor is None:
+                return False
+            # tensor matched
+            if node_output_tensor in self.output_tensors:
+                if not mapping.set_output_onnx_tensor(
+                    onnx_tensor, self.output_tensors.index(node_output_tensor)
+                ):
+                    return False  # this tensor is mapped to another onnx tensor
+                continue
+            if onnx_tensor.name in onnx_graph_output_tensors:
+                return False  # The pattern tensor is not an output but the onnx tensor is an output tensor of the onnx graph.
+
+            # For sub-patterns, each input tensor can only have 1 output node. Otherwise the following test will fail.
+            if len(self.tensor_outputs[node_output_tensor]) != len(onnx_tensor.outputs):
+                return False
+            for output_node, output_onnx_node in zip(
+                self.tensor_outputs[node_output_tensor], onnx_tensor.outputs
+            ):
+                # dfs ends when revisiting a node. We need to check if the edges are matched.
+                if output_node in mapping:
+                    inbound_tensor_index = self._get_tensor_index_for_node(
+                        output_node, node_output_tensor, is_node_input=True
+                    )
+                    inbound_onnx_node_of_output_node = self.nodes[
+                        output_node
+                    ].get_inbound_or_outbound_onnx_node(
+                        mapping[output_node],
+                        is_inbound=True,
+                        tensor_index=inbound_tensor_index,
+                    )
+                    if (
+                        inbound_onnx_node_of_output_node is None
+                        or inbound_onnx_node_of_output_node.name
+                        != output_onnx_node.name
+                    ):
+                        return False
+                    continue
+                match = self._match_node(
+                    output_node,
+                    output_onnx_node,
+                    node_output_tensor,
+                    mapping,
+                    mapped_onnx_nodes,
+                    onnx_graph_output_tensors,
+                    from_inbound=True,
+                )
+                if not match:
+                    return False
+        return True
+
+    def match_all(self, graph: Graph) -> List[PatternMapping]:
+        """
+        Find all the matched instances of subgraph with the current pattern in the given graph.
+
+        Args:
+            graph (Graph): the graph to match.
+
+        Return:
+            List[PatternMapping]: list of mappings.
+        """
+        mappings = []
+        onnx_graph_output_tensors = set(tensor.name for tensor in graph.outputs)
+        with graph.node_ids():
+            for node in graph.nodes:
+                G_LOGGER.info("Start a subgraph matching...")
+                mapped_onnx_nodes = set()
+                mapping = self.match(
+                    node,
+                    from_inbound=True,
+                    from_tensor_index=0,
+                    mapped_onnx_nodes=mapped_onnx_nodes,
+                    onnx_graph_output_tensors=onnx_graph_output_tensors,
+                )
+                if mapping is not None:
+                    G_LOGGER.info("Found a matched subgraph!")
+                    mappings.append(mapping)
+        return mappings
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/base_importer.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/base_importer.py
index 153d076e..cf15dff9 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/base_importer.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/base_importer.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/onnx_importer.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/onnx_importer.py
index de5cf73f..41bd5c22 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/onnx_importer.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/importers/onnx_importer.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,20 +17,27 @@
 
 import copy
 from collections import OrderedDict
-from typing import List, Union
+from typing import List, Union, Dict, Any
 
 import numpy as np
 import onnx
 import onnx.numpy_helper
 from onnx_graphsurgeon.importers.base_importer import BaseImporter
+from onnx_graphsurgeon.ir.function import Function
 from onnx_graphsurgeon.ir.graph import Graph
 from onnx_graphsurgeon.ir.node import Node
-from onnx_graphsurgeon.ir.tensor import Constant, LazyValues, Tensor, Variable
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.ir.tensor import (
+    Constant,
+    SparseValues,
+    LazyValues,
+    Tensor,
+    Variable,
+)
+from onnx_graphsurgeon.logger import G_LOGGER, LogMode
 from onnx_graphsurgeon.util import misc
 
 # Maps values from the AttributeType enum to their string representations, e.g., {1: "FLOAT"}
-ATTR_TYPE_MAPPING = dict(zip(onnx.AttributeProto.AttributeType.values(), onnx.AttributeProto.AttributeType.keys()))
+ATTR_TYPE_MAPPING = {v: k for k, v in onnx.AttributeProto.AttributeType.items()}
 
 # Maps an ONNX attribute to the corresponding Python property
 ONNX_PYTHON_ATTR_MAPPING = {
@@ -45,9 +52,13 @@
 }
 
 
-def get_onnx_tensor_shape(onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto]) -> List[int]:
+def get_onnx_tensor_shape(
+    onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto]
+) -> List[int]:
     shape = None
-    if isinstance(onnx_tensor, onnx.TensorProto):
+    if isinstance(onnx_tensor, onnx.TensorProto) or isinstance(
+        onnx_tensor, onnx.SparseTensorProto
+    ):
         shape = onnx_tensor.dims
     else:
         if onnx_tensor.type.tensor_type.HasField("shape"):
@@ -62,92 +73,247 @@ def get_onnx_tensor_shape(onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorPro
     return shape
 
 
-def get_onnx_tensor_dtype(onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto]) -> np.dtype:
+def get_dtype_name(onnx_type):
+    return {val: key for key, val in onnx.TensorProto.DataType.items()}[onnx_type]
+
+
+def get_itemsize(dtype):
+    np_dtype = get_numpy_type(dtype)
+    if np_dtype is not None:
+        return np.dtype(np_dtype).itemsize
+
+    if dtype == onnx.TensorProto.BFLOAT16:
+        return 2
+
+    if dtype in [
+        onnx.TensorProto.FLOAT8E4M3FN,
+        onnx.TensorProto.FLOAT8E4M3FNUZ,
+        onnx.TensorProto.FLOAT8E5M2,
+        onnx.TensorProto.FLOAT8E5M2FNUZ,
+    ]:
+        return 1
+    G_LOGGER.critical(f"Unsupported type: {dtype}")
+
+
+def get_numpy_type(onnx_type):
+    if not isinstance(onnx_type, int):
+        # Already a NumPy type
+        return onnx_type
+
+    numpy_unsupported_types = [
+        onnx.TensorProto.BFLOAT16,
+        onnx.TensorProto.FLOAT8E4M3FN,
+        onnx.TensorProto.FLOAT8E4M3FNUZ,
+        onnx.TensorProto.FLOAT8E5M2,
+        onnx.TensorProto.FLOAT8E5M2FNUZ,
+    ]
+
+    # TENSOR_TYPE_TO_NP_TYPE maps types unsupported by NumPy to random other types.
+    # This obviously breaks things, so we need to treat this as a special case.
+    if (
+        onnx_type not in numpy_unsupported_types
+        and onnx_type in onnx.helper.get_all_tensor_dtypes()
+    ):
+        return onnx.helper.tensor_dtype_to_np_dtype(onnx_type)
+    return None
+
+
+def get_onnx_tensor_dtype(
+    onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto]
+) -> Union[np.dtype, "onnx.TensorProto.DataType"]:
     if isinstance(onnx_tensor, onnx.TensorProto):
-        onnx_type = onnx_tensor.data_type
+        onnx_dtype = onnx_tensor.data_type
+    elif isinstance(onnx_tensor, onnx.SparseTensorProto):
+        onnx_dtype = onnx_tensor.values.data_type
     else:
-        onnx_type = onnx_tensor.type.tensor_type.elem_type
-    if onnx_type in onnx.mapping.TENSOR_TYPE_TO_NP_TYPE:
-        return onnx.mapping.TENSOR_TYPE_TO_NP_TYPE[onnx_type]
-    return None
+        if onnx_tensor.type.HasField("tensor_type"):
+            onnx_dtype = onnx_tensor.type.tensor_type.elem_type
+        elif onnx_tensor.type.HasField("sequence_type"):
+            onnx_dtype = onnx_tensor.type.sequence_type.elem_type.tensor_type.elem_type
+        elif onnx_tensor.type.HasField("map_type"):
+            onnx_dtype = onnx_tensor.type.map_type.value_type
+        elif onnx_tensor.type.HasField("optional_type"):
+            onnx_dtype = onnx_tensor.type.optional_type.elem_type
+        elif onnx_tensor.type.HasField("sparse_tensor_type"):
+            onnx_dtype = onnx_tensor.type.sparse_tensor_type.elem_type
+        else:
+            onnx_dtype = onnx_tensor.type.opaque_type
+
+    dtype = get_numpy_type(onnx_dtype)
+    if dtype is not None:
+        return dtype
+
+    G_LOGGER.warning(
+        f"Could not convert: {get_dtype_name(onnx_dtype)} to a corresponding NumPy type. "
+        f"The original ONNX type will be preserved. ",
+        mode=LogMode.ONCE,
+    )
+    return onnx_dtype
+
+
+def get_onnx_tensor_type(
+    onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto]
+) -> str:
+    if isinstance(onnx_tensor, onnx.TensorProto):
+        onnx_type = "tensor_type"
+    else:
+        if onnx_tensor.type.HasField("tensor_type"):
+            onnx_type = "tensor_type"
+        elif onnx_tensor.type.HasField("sequence_type"):
+            onnx_type = "sequence_type"
+        elif onnx_tensor.type.HasField("map_type"):
+            onnx_type = "map_type"
+        elif onnx_tensor.type.HasField("optional_type"):
+            onnx_type = "optional_type"
+        elif onnx_tensor.type.HasField("opaque_type"):
+            onnx_type = "opaque_type"
+        elif onnx_tensor.type.HasField("sparse_tensor_type"):
+            onnx_type = "sparse_tensor_type"
+        else:
+            onnx_type = None
+
+    return onnx_type
+
+
+def get_onnx_tensor_type(
+    onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto]
+) -> str:
+    if isinstance(onnx_tensor, onnx.TensorProto):
+        onnx_type = "tensor_type"
+    else:
+        if onnx_tensor.type.HasField("tensor_type"):
+            onnx_type = "tensor_type"
+        elif onnx_tensor.type.HasField("sequence_type"):
+            onnx_type = "sequence_type"
+        elif onnx_tensor.type.HasField("map_type"):
+            onnx_type = "map_type"
+        elif onnx_tensor.type.HasField("optional_type"):
+            onnx_type = "optional_type"
+        elif onnx_tensor.type.HasField("opaque_type"):
+            onnx_type = "opaque_type"
+        elif onnx_tensor.type.HasField("sparse_tensor_type"):
+            onnx_type = "sparse_tensor_type"
+        else:
+            onnx_type = None
+
+    return onnx_type
 
 
 class OnnxImporter(BaseImporter):
     @staticmethod
-    def get_opset(model: onnx.ModelProto):
+    def get_opset(model_or_func: Union[onnx.ModelProto, onnx.FunctionProto]):
+        class_name = (
+            "Function" if isinstance(model_or_func, onnx.FunctionProto) else "Model"
+        )
         try:
-            for importer in OnnxImporter.get_import_domains(model):
+            for importer in OnnxImporter.get_import_domains(model_or_func):
                 if importer.domain == "" or importer.domain == "ai.onnx":
                     return importer.version
-            G_LOGGER.warning("Model does not contain ONNX domain opset information! Using default opset.")
+            G_LOGGER.warning(
+                f"{class_name} does not contain ONNX domain opset information! Using default opset."
+            )
             return None
         except:
-            G_LOGGER.warning("Model does not contain opset information! Using default opset.")
+            G_LOGGER.warning(
+                f"{class_name} does not contain opset information! Using default opset."
+            )
             return None
 
     @staticmethod
-    def get_import_domains(model: onnx.ModelProto):
-        return model.opset_import
+    def get_import_domains(model_or_func: Union[onnx.ModelProto, onnx.FunctionProto]):
+        return model_or_func.opset_import
 
     @staticmethod
-    def import_tensor(onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto]) -> Tensor:
-        if isinstance(onnx_tensor, onnx.TensorProto):
-            data_location = int(onnx_tensor.data_location) if onnx_tensor.HasField("data_location") else None
-            return Constant(name=onnx_tensor.name, values=LazyValues(onnx_tensor), data_location=data_location)
-        else:
-            return Variable(
+    def import_tensor(
+        onnx_tensor: Union[
+            onnx.ValueInfoProto, onnx.TensorProto, onnx.SparseTensorProto
+        ]
+    ) -> Tensor:
+        if isinstance(onnx_tensor, onnx.SparseTensorProto):
+            return Constant(
+                name=onnx_tensor.values.name,
+                values=SparseValues(onnx_tensor),
+                data_location=onnx_tensor.values.data_location,
+            )
+        elif isinstance(onnx_tensor, onnx.TensorProto):
+            data_location = (
+                int(onnx_tensor.data_location)
+                if onnx_tensor.HasField("data_location")
+                else None
+            )
+            return Constant(
                 name=onnx_tensor.name,
-                dtype=get_onnx_tensor_dtype(onnx_tensor),
-                shape=get_onnx_tensor_shape(onnx_tensor),
+                values=LazyValues(onnx_tensor),
+                data_location=data_location,
             )
+        else:
+            # A ValueInfoProto inside a subgraph might not have shape & type specified.
+            tensor = Variable(onnx_tensor.name)
+            if onnx_tensor.type.ByteSize() > 0:
+                tensor.dtype = get_onnx_tensor_dtype(onnx_tensor)
+                tensor.shape = get_onnx_tensor_shape(onnx_tensor)
+                tensor.type = get_onnx_tensor_type(onnx_tensor)
+            return tensor
 
     @staticmethod
-    def import_node(
-        onnx_node: onnx.NodeProto,
+    def import_attributes(
+        onnx_attributes: List[onnx.AttributeProto],
         tensor_map: "OrderedDict[str, Tensor]",
         subgraph_tensor_map: "OrderedDict[str, Tensor]",
-        opset,
+        opset: int,
         import_domains: onnx.OperatorSetIdProto,
-    ) -> Node:
-        def attrs_to_dict(attrs):
-            attr_dict = OrderedDict()
-            for attr in attrs:
-
-                def process_attr(attr_str: str):
-                    processed = getattr(attr, ONNX_PYTHON_ATTR_MAPPING[attr_str])
-                    if attr_str == "STRING":
-                        processed = processed.decode()
-                    elif attr_str == "TENSOR":
-                        processed = OnnxImporter.import_tensor(processed)
-                    elif attr_str == "GRAPH":
-                        processed = OnnxImporter.import_graph(
-                            processed,
-                            misc.combine_dicts(tensor_map, subgraph_tensor_map),
-                            opset=opset,
-                            import_domains=import_domains,
-                        )
-                    elif attr_str == "FLOATS" or attr_str == "INTS":
-                        processed = list(processed)
-                    elif attr_str == "STRINGS":
-                        processed = [p.decode() for p in processed]
-                    return processed
-
-                if attr.type in ATTR_TYPE_MAPPING:
-                    attr_str = ATTR_TYPE_MAPPING[attr.type]
-                    if attr_str in ONNX_PYTHON_ATTR_MAPPING:
-                        attr_dict[attr.name] = process_attr(attr_str)
-                    else:
-                        G_LOGGER.warning(
-                            "Attribute of type {:} is currently unsupported. Skipping attribute.".format(attr_str)
-                        )
+    ) -> "OrderedDict[str, Any]":
+        attr_dict = OrderedDict()
+        for attr in onnx_attributes:
+
+            def process_attr(attr_str: str):
+                if attr.ref_attr_name:
+                    attr_type = misc.convert_from_onnx_attr_type(attr.type)
+                    return Node.AttributeRef(attr.ref_attr_name, attr_type)
+                processed = getattr(attr, ONNX_PYTHON_ATTR_MAPPING[attr_str])
+                if attr_str == "STRING":
+                    processed = processed.decode()
+                elif attr_str == "TENSOR":
+                    processed = OnnxImporter.import_tensor(processed)
+                elif attr_str == "GRAPH":
+                    processed = OnnxImporter.import_graph(
+                        processed,
+                        misc.combine_dicts(tensor_map, subgraph_tensor_map),
+                        opset=opset,
+                        import_domains=import_domains,
+                    )
+                elif attr_str == "FLOATS" or attr_str == "INTS":
+                    processed = list(processed)
+                elif attr_str == "STRINGS":
+                    processed = [p.decode() for p in processed]
+                return processed
+
+            if attr.type in ATTR_TYPE_MAPPING:
+                attr_str = ATTR_TYPE_MAPPING[attr.type]
+                if attr_str in ONNX_PYTHON_ATTR_MAPPING:
+                    attr_dict[attr.name] = process_attr(attr_str)
                 else:
                     G_LOGGER.warning(
-                        "Attribute type: {:} was not recognized. Was the graph generated with a newer IR version than the installed `onnx` package? Skipping attribute.".format(
-                            attr.type
+                        "Attribute of type {:} is currently unsupported. Skipping attribute.".format(
+                            attr_str
                         )
                     )
-            return attr_dict
+            else:
+                G_LOGGER.warning(
+                    "Attribute type: {:} was not recognized. Was the graph generated with a newer IR version than the installed `onnx` package? Skipping attribute.".format(
+                        attr.type
+                    )
+                )
+        return attr_dict
 
+    @staticmethod
+    def import_node(
+        onnx_node: onnx.NodeProto,
+        tensor_map: "OrderedDict[str, Tensor]",
+        subgraph_tensor_map: "OrderedDict[str, Tensor]",
+        opset,
+        import_domains: onnx.OperatorSetIdProto,
+    ) -> Node:
         # Optional inputs/outputs are represented by empty tensors. All other tensors should already have been populated during shape inference.
         def get_tensor(name: str, check_outer_graph=True):
             # Prioritize the subgraph even if check_outer_graph is set
@@ -184,15 +350,70 @@ def retrieve_node_outputs() -> List[Tensor]:
                 outputs.append(get_tensor(output_name, check_outer_graph=False))
             return outputs
 
+        attributes = OnnxImporter.import_attributes(
+            onnx_node.attribute, tensor_map, subgraph_tensor_map, opset, import_domains
+        )
+
         return Node(
             op=onnx_node.op_type,
             name=onnx_node.name,
-            attrs=attrs_to_dict(onnx_node.attribute),
+            attrs=attributes,
             inputs=retrieve_node_inputs(),
             outputs=retrieve_node_outputs(),
             domain=onnx_node.domain if onnx_node.HasField("domain") else None,
         )
 
+    @staticmethod
+    def import_function(
+        onnx_function: onnx.FunctionProto,
+        model_opset: int = None,
+        model_import_domains: onnx.OperatorSetIdProto = None,
+    ) -> Function:
+        opset = OnnxImporter.get_opset(onnx_function) or model_opset
+        import_domains = (
+            OnnxImporter.get_import_domains(onnx_function) or model_import_domains
+        )
+        subgraph_tensor_map = OrderedDict()  # Tensors in this function
+
+        def make_tensor(name: str) -> Tensor:
+            if name not in subgraph_tensor_map:
+                subgraph_tensor_map[name] = Variable(name)
+            return subgraph_tensor_map[name]
+
+        function_inputs = [make_tensor(inp) for inp in onnx_function.input]
+        function_outputs = [make_tensor(out) for out in onnx_function.output]
+        nodes = [
+            OnnxImporter.import_node(
+                onnx_node, dict(), subgraph_tensor_map, opset, import_domains
+            )
+            for onnx_node in onnx_function.node
+        ]
+
+        attributes = dict()
+        if onnx_function.attribute:
+            attributes = {attr_name: None for attr_name in onnx_function.attribute}
+        if onnx_function.attribute_proto:
+            attrs_with_default_value = OnnxImporter.import_attributes(
+                onnx_function.attribute_proto,
+                None,
+                subgraph_tensor_map,
+                opset,
+                import_domains,
+            )
+            attributes.update(attrs_with_default_value)
+
+        return Function(
+            onnx_function.name,
+            onnx_function.domain,
+            nodes=nodes,
+            inputs=function_inputs,
+            outputs=function_outputs,
+            doc_string=onnx_function.doc_string,
+            opset=opset,
+            import_domains=import_domains,
+            attrs=attributes,
+        )
+
     @staticmethod
     def import_graph(
         onnx_graph: onnx.GraphProto,
@@ -201,6 +422,7 @@ def import_graph(
         import_domains: onnx.OperatorSetIdProto = None,
         producer_name: str = None,
         producer_version: str = None,
+        functions: List[Function] = None,
     ) -> Graph:
         """
         Imports a Graph from an ONNX Graph.
@@ -212,39 +434,53 @@ def import_graph(
             opset (int): The ONNX opset to use for this graph.
             producer_name (str): The name of the tool used to generate the model. Defaults to "".
             producer_version (str): The version of the generating tool. Defaults to "".
+            functions (List[Function]): The list of custom functions which are available to use in the model.
         """
-        tensor_map = copy.copy(misc.default_value(tensor_map, OrderedDict()))  # Outer graph tensors, read-only
+        functions = misc.default_value(functions, [])
+        tensor_map = copy.copy(
+            misc.default_value(tensor_map, OrderedDict())
+        )  # Outer graph tensors, read-only
         subgraph_tensor_map = OrderedDict()  # Tensors in this subgraph
 
         # Retrieves a Tensor from subgraph_tensor_map or the outer graph (tensor_map) if present, otherwise imports the tensor
         # If overwrite=True, this function will overwrite previously imported tensors
         # if the new tensor has more information available.
         def get_tensor(
-            onnx_tensor: Union[onnx.ValueInfoProto, onnx.TensorProto], overwrite=False, check_outer_graph=True
+            onnx_tensor: Union[
+                onnx.ValueInfoProto, onnx.TensorProto, onnx.SparseTensorProto
+            ],
+            overwrite=False,
+            check_outer_graph=True,
         ) -> Tensor:
+            if isinstance(onnx_tensor, onnx.SparseTensorProto):
+                name = onnx_tensor.values.name
+            else:
+                name = onnx_tensor.name
             # Prioritize the subgraph even if check_outer_graph is set
-            if onnx_tensor.name in subgraph_tensor_map:
+            if name in subgraph_tensor_map:
                 if overwrite:
                     tensor = OnnxImporter.import_tensor(onnx_tensor)
-                    if isinstance(subgraph_tensor_map[onnx_tensor.name], Variable):
-                        subgraph_tensor_map[onnx_tensor.name].dtype = (
-                            subgraph_tensor_map[onnx_tensor.name].dtype or tensor.dtype
+                    if isinstance(subgraph_tensor_map[name], Variable):
+                        subgraph_tensor_map[name].dtype = (
+                            subgraph_tensor_map[name].dtype or tensor.dtype
                         )
-                        subgraph_tensor_map[onnx_tensor.name].shape = (
-                            subgraph_tensor_map[onnx_tensor.name].shape or tensor.shape
+                        subgraph_tensor_map[name].shape = (
+                            subgraph_tensor_map[name].shape or tensor.shape
                         )
-                return subgraph_tensor_map[onnx_tensor.name]
+                return subgraph_tensor_map[name]
 
-            if check_outer_graph and onnx_tensor.name in tensor_map:
-                return tensor_map[onnx_tensor.name]
+            if check_outer_graph and name in tensor_map:
+                return tensor_map[name]
 
-            subgraph_tensor_map[onnx_tensor.name] = OnnxImporter.import_tensor(onnx_tensor)
-            return subgraph_tensor_map[onnx_tensor.name]
+            subgraph_tensor_map[name] = OnnxImporter.import_tensor(onnx_tensor)
+            return subgraph_tensor_map[name]
 
         # Import initializers contents into Constants.
         G_LOGGER.verbose("Importing initializers")
         for initializer in onnx_graph.initializer:
             get_tensor(initializer)
+        for initializer in onnx_graph.sparse_initializer:
+            get_tensor(initializer)
 
         # Import all tensors whose shapes are known. Tensors may be repeated, and some of these
         # duplicates may not include shape/dtype information, so overwrite is set to True
@@ -255,7 +491,10 @@ def get_tensor(
 
         # Import graph inputs and outputs. Initializers are not considered to be inputs.
         # Graph inputs and outputs can never come from the outer graph!
-        initializer_names = set([tensor.name for tensor in onnx_graph.initializer])
+        initializer_names = set(
+            [tensor.name for tensor in onnx_graph.initializer]
+            + [tensor.values.name for tensor in onnx_graph.sparse_initializer]
+        )
         G_LOGGER.verbose("Importing graph inputs")
         graph_inputs = []  # List[Tensor]
         for inp in onnx_graph.input:
@@ -272,7 +511,9 @@ def get_tensor(
         G_LOGGER.verbose("Importing nodes")
         nodes = []  # List[Node]
         for onnx_node in onnx_graph.node:
-            node = OnnxImporter.import_node(onnx_node, tensor_map, subgraph_tensor_map, opset, import_domains)
+            node = OnnxImporter.import_node(
+                onnx_node, tensor_map, subgraph_tensor_map, opset, import_domains
+            )
             nodes.append(node)
 
         return Graph(
@@ -285,6 +526,7 @@ def get_tensor(
             producer_version=producer_version,
             opset=opset,
             import_domains=import_domains,
+            functions=functions,
         )
 
 
@@ -298,10 +540,32 @@ def import_onnx(onnx_model: "onnx.ModelProto") -> Graph:
     Returns:
         Graph: A corresponding onnx-graphsurgeon Graph.
     """
+    model_opset = OnnxImporter.get_opset(onnx_model)
+    model_import_domains = OnnxImporter.get_import_domains(onnx_model)
+    functions: List[Function] = [
+        OnnxImporter.import_function(
+            onnx_function,
+            model_opset=model_opset,
+            model_import_domains=model_import_domains,
+        )
+        for onnx_function in onnx_model.functions
+    ]
+
+    # Functions are identified by their name and domain.
+    # Make sure that no two Functions share the same name and domain.
+    function_unqiue_ids = set()
+    for func in functions:
+        unique_id = func.unique_id
+        if unique_id in function_unqiue_ids:
+            msg = "Model contains duplicate function definitions with "
+            msg += f'name="{func.name}" and domain="{func.domain}"'
+            G_LOGGER.warning(msg)
+
     return OnnxImporter.import_graph(
         onnx_model.graph,
-        opset=OnnxImporter.get_opset(onnx_model),
-        import_domains=OnnxImporter.get_import_domains(onnx_model),
+        opset=model_opset,
+        import_domains=model_import_domains,
         producer_name=onnx_model.producer_name,
         producer_version=onnx_model.producer_version,
+        functions=functions,
     )
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/function.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/function.py
new file mode 100644
index 00000000..41d056e7
--- /dev/null
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/function.py
@@ -0,0 +1,290 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import copy
+from typing import List, Sequence
+
+from onnx_graphsurgeon.ir.graph import Graph
+from onnx_graphsurgeon.ir.node import Node
+from onnx_graphsurgeon.ir.tensor import Tensor, Variable
+from onnx_graphsurgeon.logger import G_LOGGER
+from onnx_graphsurgeon.util import misc
+
+
+class Function(Graph):
+    """
+    Represents a local function, which is a default implementation of a Custom Op.
+    This default implementation is represented as a Graph of other Ops.
+
+    Functions are used in a model by creating a Node with the same name and domain as the function. This can be done
+    using the __call__() method of a Function, which creates this new node and appends it to a Graph.
+    A Function is not a subgraph of a Graph, and its Nodes, Tensors, and subgraphs are entirely separate
+    from the main Graph.
+
+    Functions can be composed of other functions, but cyclical or recursive defintions are not allowed in ONNX.
+    """
+
+    DEFAULT_DOMAIN = "onnx_graphsurgeon"
+
+    def __init__(
+        self,
+        name: str,
+        domain: str = None,
+        nodes: Sequence[Node] = None,
+        inputs: Sequence[Tensor] = None,
+        outputs: Sequence[Tensor] = None,
+        doc_string: str = None,
+        opset: int = None,
+        import_domains: "Sequence[onnx.OperatorSetIdProto]" = None,
+        functions: "Sequence[Function]" = None,
+        attrs: dict = None,
+    ):
+        """
+        Args:
+            name (str): The name of the function.
+            domain (str): The domain/namespace of this function.
+            nodes (Sequence[Node]): A list of the nodes in this function.
+            inputs (Sequence[Tensor]): A list of graph input Tensors.
+            outputs (Sequence[Tensor]): A list of graph output Tensors.
+            doc_string (str): A doc_string for the function. Defaults to "".
+            opset (int): The ONNX opset used by nodes in this function.
+            import_domains (Sequence[onnx.OperatorSetIdProto]): The list of domains used by nodes in this function.
+            functions (Sequence[Function]): The list of functions in this model.
+            attrs (dict): A mapping of attribute names to their default values.
+                Nodes within this function can have attributes which take on the values of the Function attributes.
+                When a Function is instantiated into a Node, providing attributes to that Node will override the Function's
+                default attribute values. A default value of `None` means that the instantiated Node must provide the value
+                of that attribute (in other words, it is a required attribute).
+        """
+        self.domain = misc.default_value(domain, Function.DEFAULT_DOMAIN)
+        self.attrs = misc.default_value(attrs, dict())
+
+        super().__init__(
+            nodes,
+            inputs,
+            outputs,
+            name=name,
+            doc_string=doc_string,
+            opset=opset,
+            import_domains=import_domains,
+            functions=functions,
+        )
+
+        # Properties of Graph that Function doesn't have.
+        del self.producer_name
+        del self.producer_version
+
+    @property
+    def unique_id(self):
+        """
+        Returns a tuple which uniquely identifies this function.
+        """
+        return (self.domain, self.name)
+
+    def cleanup(
+        self,
+        remove_unused_node_outputs=False,
+        recurse_subgraphs=True,
+        remove_unused_graph_inputs=False,
+        recurse_functions=False,
+    ):
+        """
+        See Graph.cleanup()
+        The only difference is that 'recurse_functions' defaults to False, so that only this Function is cleaned up.
+        """
+        if recurse_functions:
+            G_LOGGER.warning(
+                "Function.cleanup() called with recurse_functions=True, meaning that other functions will also be cleaned up."
+            )
+        return super().cleanup(
+            remove_unused_node_outputs=remove_unused_node_outputs,
+            recurse_subgraphs=recurse_subgraphs,
+            remove_unused_graph_inputs=remove_unused_graph_inputs,
+            recurse_functions=recurse_functions,
+        )
+
+    def fold_constants(self, recurse_functions=False, **kwargs):
+        """
+        See Graph.fold_constants()
+        The only difference is that 'recurse_functions' defaults to False, so that only this Function's constants are folded.
+        """
+        if recurse_functions:
+            G_LOGGER.warning(
+                "Function.fold_constants() called with recurse_functions=True, meaning that other functions will also be const-folded."
+            )
+        return super().fold_constants(recurse_functions=recurse_functions, **kwargs)
+
+    def toposort(
+        self,
+        recurse_subgraphs=True,
+        recurse_functions=False,
+        mode="nodes",
+    ):
+        """
+        See Graph.toposort()
+        The only difference is that 'recurse_functions' defaults to False and mode defaults to "nodes",
+        so that by default only this function's nodes will be sorted.
+        """
+        if recurse_functions:
+            G_LOGGER.warning(
+                "Function.toposort() called with recurse_functions=True, meaning that other functions will be sorted."
+            )
+        return super().toposort(
+            recurse_subgraphs=recurse_subgraphs,
+            recurse_functions=recurse_functions,
+            mode=mode,
+        )
+
+    def __call__(
+        self, graph, inputs=None, outputs=None, *args, **kwargs
+    ) -> List[Tensor]:
+        """
+        Creates a Node which is an instance of this function.
+        The created node can be used in a Graph or another Function.
+
+        The provided inputs are processed the same way as in Graph.layer().
+        If outputs are not provided, they are created based on the Function's outputs.
+
+        Args:
+            graph (Union[Graph, Function]): The Graph of Function to add the new node to.
+            inputs (List[Union[Tensor, str, numpy.ndarray]]): The list of inputs.
+            outputs (List[Union[Tensor, str, numpy.ndarray]]): The list of outputs.
+            attrs (Dict[str, Any]): A list of attributes for the node.
+                The attribute names should be a subset of this Function's attribute names.
+            args/kwargs: These are passed directly to the constructor of Node.
+
+        Returns:
+            List[Tensor]: The output tensors of the node.
+        """
+        if inputs is not None and len(inputs) != len(self.inputs):
+            msg_template = (
+                "Function {} expects {} inputs, but was called with {} inputs."
+            )
+            G_LOGGER.warning(
+                msg_template.format(self.name, len(self.inputs), len(inputs))
+            )
+
+        new_output_indices = []
+        if outputs is None:
+            # Graph.layer() will create Tensors and make sure the names do not conflict.
+            outputs = [out.name for out in self.outputs]
+            new_output_indices = list(range(len(outputs)))
+        elif len(outputs) != len(self.outputs):
+            msg_template = (
+                "Function {} expects {} outputs, but was called with {} outputs."
+            )
+            G_LOGGER.warning(
+                msg_template.format(self.name, len(self.outputs), len(outputs))
+            )
+        else:
+            new_output_indices = [
+                i for i in range(len(outputs)) if not isinstance(outputs[i], Tensor)
+            ]
+
+        attrs = kwargs.get("attrs", None)
+        if attrs is not None:
+            for attr_name, default_val in self.attrs.items():
+                if default_val is None and attr_name not in attrs:
+                    msg_template = "Function {} called without required attribute: {}"
+                    G_LOGGER.warning(msg_template.format(self.name, attr_name))
+
+        inputs = misc.default_value(inputs, [])
+        outputs = misc.default_value(outputs, [])
+        outputs = graph.layer(
+            *args,
+            **kwargs,
+            op=self.name,
+            domain=self.domain,
+            inputs=inputs,
+            outputs=outputs,
+        )
+
+        # For newly created output tensors, set their shape and dtype to match the Function defintion.
+        for i in new_output_indices:
+            outputs[i].dtype = self.outputs[i].dtype
+            outputs[i].shape = self.outputs[i].shape
+
+        return outputs
+
+    def copy(self):
+        """
+        Copy the function.
+
+        This makes copies of all nodes and tensors in the function, but will not
+        do a deep-copy of weights or attributes (with the exception of ``Graph``
+        attributes, which will be copied using their ``copy`` method).
+
+        Returns:
+            Function: A copy of the function.
+        """
+
+        local_tensor_copies = {n: t.copy() for n, t in self.tensors().items()}
+
+        def get_tensor(name):
+            if not name:
+                return Variable.empty()
+            return local_tensor_copies[name]
+
+        # Next, copy nodes, and update inputs/outputs
+        new_nodes = []
+        for node in self.nodes:
+            new_node = node.copy(
+                inputs=[get_tensor(inp.name) for inp in node.inputs],
+                outputs=[get_tensor(out.name) for out in node.outputs],
+                tensor_map=local_tensor_copies,
+            )
+            new_nodes.append(new_node)
+        new_func_inputs = [get_tensor(inp.name) for inp in self.inputs]
+        new_func_outputs = [get_tensor(out.name) for out in self.outputs]
+
+        new_attrs = {name: copy.copy(val) for name, val in self.attrs.items()}
+
+        return Function(
+            self.name,
+            self.domain,
+            nodes=new_nodes,
+            inputs=new_func_inputs,
+            outputs=new_func_outputs,
+            doc_string=self.doc_string,
+            opset=self.opset,
+            import_domains=self.import_domains,
+            functions=self.functions,
+            attrs=new_attrs,
+        )
+
+    def __eq__(self, other: "Function"):
+        def sequences_equal(seq1, seq2):
+            return len(seq1) == len(seq2) and all(
+                [elem1 == elem2 for elem1, elem2 in zip(seq1, seq2)]
+            )
+
+        return (
+            self.unique_id == other.unique_id
+            and self.opset == other.opset
+            and self.import_domains == other.import_domains
+            and sequences_equal(self.inputs, other.inputs)
+            and sequences_equal(self.outputs, other.outputs)
+            and sequences_equal(self.nodes, other.nodes)
+        )
+
+    def __str__(self):
+        nodes_str = "\n".join([str(node) for node in self.nodes])
+        out = f"Function {self.name}, Domain {self.domain}, Opset {self.opset}"
+        out += f"\nInputs: {self.inputs}"
+        out += f"\nNodes: {nodes_str}"
+        out += f"\nOutputs: {self.outputs}"
+        return out
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/graph.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/graph.py
index ce71d82b..b40968f4 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/graph.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/graph.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,7 +18,7 @@
 import copy
 import numbers
 from collections import OrderedDict, defaultdict
-from typing import Sequence
+from typing import List, Sequence
 
 import numpy as np
 from onnx_graphsurgeon.ir.node import Node
@@ -103,6 +103,7 @@ def __init__(
         import_domains=None,
         producer_name: str = None,
         producer_version: str = None,
+        functions: "Sequence[Function]" = None,
     ):
         """
         Args:
@@ -127,23 +128,74 @@ def __init__(
         self.producer_name = misc.default_value(producer_name, "")
         self.producer_version = misc.default_value(producer_version, "")
         self.import_domains = import_domains
-        # Printing graphs can be very expensive
-        G_LOGGER.ultra_verbose(lambda: "Created Graph: {:}".format(self))
         # For layer() function
         self.name_idx = 0
 
+        # In ONNX, the same list of Functions is shared between all Graphs & Functions in a model.
+        # Protect the list object with an underscore as self._functions
+        # Users should access/modify/set this list via graph.functions
+        self._functions = list(misc.default_value(functions, []))
+        self._merge_subgraph_functions()
+
+        # Printing graphs can be very expensive
+        G_LOGGER.ultra_verbose(lambda: "Created Graph: {:}".format(self))
+
     def __getattr__(self, name):
         try:
             return super().__getattribute__(name)
         except AttributeError as err:
+            # Warn user if the name matches multiple registered functions.
+            methods = []
+            method_descs = []
+
             # Opset specific ops always take priority over global ops.
-            if self.opset in Graph.OPSET_FUNC_MAP and name in Graph.OPSET_FUNC_MAP[self.opset]:
-                return lambda *args, **kwargs: Graph.OPSET_FUNC_MAP[self.opset][name](self, *args, **kwargs)
+            if (
+                self.opset in Graph.OPSET_FUNC_MAP
+                and name in Graph.OPSET_FUNC_MAP[self.opset]
+            ):
+                methods.append(Graph.OPSET_FUNC_MAP[self.opset][name])
+                method_descs.append(
+                    f'GraphSurgeon-registered function "{name}" with opset {self.opset}'
+                )
 
+            # Registered ops take priority over Local Functions.
             if name in Graph.GLOBAL_FUNC_MAP:
-                return lambda *args, **kwargs: Graph.GLOBAL_FUNC_MAP[name](self, *args, **kwargs)
+                methods.append(Graph.GLOBAL_FUNC_MAP[name])
+                method_descs.append(f'GraphSurgeon-registered function "{name}"')
+
+            for func in self.functions:
+                if func.name == name:
+                    methods.append(func.__call__)
+                    method_descs.append(
+                        f'Local Function "{func.name}" with domain "{func.domain}"'
+                    )
 
-            G_LOGGER.error("No function: {:} registered for opset: {:}".format(name, self.opset))
+            if methods:
+                if len(methods) > 1:
+                    msg_template = "Method name {} is overloaded with the following candidates: {}. "
+                    msg_template += "Choosing candidate {}"
+                    G_LOGGER.warning(
+                        message=msg_template.format(
+                            name, method_descs, method_descs[0]
+                        ),
+                        mode=LogMode.ONCE,
+                    )
+                return lambda *args, **kwargs: methods[0](self, *args, **kwargs)
+
+            found_in_other_opsets = {
+                opset
+                for opset, opset_map in Graph.OPSET_FUNC_MAP.items()
+                if name in opset_map
+            }
+
+            G_LOGGER.error(
+                f"Function: '{name}' was not registered for opset {self.opset}. "
+                + (
+                    f"Note: '{name}' was registered for opsets: {found_in_other_opsets}."
+                    if found_in_other_opsets
+                    else ""
+                )
+            )
             raise err
 
     def __setattr__(self, name, value):
@@ -152,18 +204,36 @@ def __setattr__(self, name, value):
             value = list(value)
         return super().__setattr__(name, value)
 
+    @property
+    def functions(self) -> "List[Function]":
+        return self._functions
+
+    @functions.setter
+    def functions(self, new_fns: "Sequence[Function]"):
+        # The 'self._functions' list object is shared between
+        # this graph, its subgraphs, and its functions.
+        # If the user sets a new value for self.functions,
+        # all subgraphs and functions should also see this new value.
+        self._functions.clear()
+        self._functions += list(new_fns)
+
     def __eq__(self, other: "Graph"):
-        nodes_match = len(self.nodes) == len(other.nodes) and all(
-            [node == other_node for node, other_node in zip(self.nodes, other.nodes)]
-        )
-        inputs_match = len(self.inputs) == len(other.inputs) and all(
-            [inp == other_inp for inp, other_inp in zip(self.inputs, other.inputs)]
-        )
-        outputs_match = len(self.outputs) == len(other.outputs) and all(
-            [out == other_out for out, other_out in zip(self.outputs, other.outputs)]
+        def sequences_equal(seq1, seq2):
+            return len(seq1) == len(seq2) and all(
+                [elem1 == elem2 for elem1, elem2 in zip(seq1, seq2)]
+            )
+
+        functions_equal = {f.unique_id for f in self.functions} == {
+            f.unique_id for f in other.functions
+        }
+        return (
+            self.opset == other.opset
+            and self.import_domains == other.import_domains
+            and functions_equal
+            and sequences_equal(self.inputs, other.inputs)
+            and sequences_equal(self.outputs, other.outputs)
+            and sequences_equal(self.nodes, other.nodes)
         )
-        opset_matches = self.opset == other.opset and self.import_domains == other.import_domains
-        return nodes_match and inputs_match and outputs_match and opset_matches
 
     def node_ids(self):
         """
@@ -187,14 +257,20 @@ def _get_node_id(self, node):
         except AttributeError:
             G_LOGGER.critical(
                 "Encountered a node not in the graph:\n{:}.\n\n"
-                "To fix this, please append the node to this graph's `nodes` attribute.".format(node)
+                "To fix this, please append the node to this graph's `nodes` attribute.".format(
+                    node
+                )
             )
 
     # A tensor is local if it is produced in this graph, or is explicitly a graph input.
     def _local_tensors(self):
-        local_tensors = {t.name: t for node in self.nodes for t in node.outputs if not t.is_empty()}
+        local_tensors = {
+            t.name: t for node in self.nodes for t in node.outputs if not t.is_empty()
+        }
         local_tensors.update({t.name: t for t in self.inputs})
-        local_tensors.update({t.name: t for t in self.tensors().values() if isinstance(t, Constant)})
+        local_tensors.update(
+            {t.name: t for t in self.tensors().values() if isinstance(t, Constant)}
+        )
         return local_tensors
 
     # Returns tensors used by this graph which are not present in the graph.
@@ -207,16 +283,19 @@ def is_foreign_tensor(tensor):
             return tensor.name not in local_tensors
 
         for node in self.nodes:
-            foreign_tensors.update({t.name: t for t in node.inputs if is_foreign_tensor(t)})
-
-            for attr in node.attrs.values():
-                if isinstance(attr, Graph):
-                    subgraph_foreign_tensors = attr._foreign_tensors()
-                    # Some of the foreign tensors from a subgraph may come from this graph.
-                    subgraph_foreign_tensors = {
-                        t.name: t for t in subgraph_foreign_tensors.values() if is_foreign_tensor(t)
-                    }
-                    foreign_tensors.update(subgraph_foreign_tensors)
+            foreign_tensors.update(
+                {t.name: t for t in node.inputs if is_foreign_tensor(t)}
+            )
+
+            for subgraph in node.subgraphs():
+                subgraph_foreign_tensors = subgraph._foreign_tensors()
+                # Some of the foreign tensors from a subgraph may come from this graph.
+                subgraph_foreign_tensors = {
+                    t.name: t
+                    for t in subgraph_foreign_tensors.values()
+                    if is_foreign_tensor(t)
+                }
+                foreign_tensors.update(subgraph_foreign_tensors)
 
         return foreign_tensors
 
@@ -256,15 +335,55 @@ def __call__(self, tensor):
                 node_used_tensors = list(node.inputs)
 
                 # If a node includes a subgraph, get any tensors that it uses from the outer graph.
-                for attr in node.attrs.values():
-                    if isinstance(attr, Graph):
-                        node_used_tensors += list(attr._foreign_tensors().values())
+                for subgraph in node.subgraphs():
+                    node_used_tensors += list(subgraph._foreign_tensors().values())
 
                 used_node_ids.add(self._get_node_id(node))
                 used_tensors.extend(filter(ignore_tensors, node_used_tensors))
         return used_node_ids, used_tensors
 
-    def cleanup(self, remove_unused_node_outputs=False, recurse_subgraphs=True, remove_unused_graph_inputs=False):
+    def _merge_subgraph_functions(self):
+        # When a user adds a Graph as a node attr, that graph will have a different
+        # function list than the parent graph. This function merges those lists.
+        func_ids = {func.unique_id for func in self.functions}
+
+        def absorb_function_list(func_list):
+            for func in func_list:
+                if func.unique_id not in func_ids:
+                    self.functions.append(func)
+                    func_ids.add(func.unique_id)
+            return self.functions
+
+        for graph in self.functions + [self]:
+            for subgraph in graph.subgraphs(recursive=True):
+                new_list = absorb_function_list(subgraph.functions)
+                subgraph._functions = new_list
+
+        for func in self.functions:
+            func._functions = absorb_function_list(func.functions)
+
+    def subgraphs(self, recursive=False):
+        """
+        Convenience function to iterate over all subgraphs which are contained in this graph.
+        Subgraphs are found in the attributes of ONNX control flow nodes such as 'If' and 'Loop'.
+
+        Args:
+            recursive (bool): Whether to recursively search this graph's subgraphs for more subgraphs. Defaults to False.
+
+        Returns:
+            A generator which iterates over the subgraphs contained in this graph.
+        """
+        for node in self.nodes:
+            for subgraph in node.subgraphs(recursive=recursive):
+                yield subgraph
+
+    def cleanup(
+        self,
+        remove_unused_node_outputs=False,
+        recurse_subgraphs=True,
+        remove_unused_graph_inputs=False,
+        recurse_functions=True,
+    ):
         """
         Removes unused nodes and tensors from the graph.
         A node or tensor is considered unused if it does not contribute to any of the graph outputs.
@@ -283,23 +402,34 @@ def cleanup(self, remove_unused_node_outputs=False, recurse_subgraphs=True, remo
             remove_unused_graph_inputs (bool):
                     Whether to remove unused graph inputs.
                     Defaults to False.
-
+            recurse_functions (bool):
+                    Whether to also clean up this graph's local functions.
+                    Defaults to True.
         Returns:
             self
         """
 
         def cleanup_subgraphs():
-            for node in self.nodes:
-                for attr in node.attrs.values():
-                    if isinstance(attr, Graph):
-                        attr.cleanup(
-                            remove_unused_node_outputs=remove_unused_node_outputs,
-                            remove_unused_graph_inputs=remove_unused_graph_inputs,
-                        )
+            for subgraph in self.subgraphs():
+                subgraph.cleanup(
+                    remove_unused_node_outputs=remove_unused_node_outputs,
+                    recurse_subgraphs=recurse_subgraphs,
+                    remove_unused_graph_inputs=remove_unused_graph_inputs,
+                    recurse_functions=False,  # Only cleanup functions once
+                )
 
         if recurse_subgraphs:
             cleanup_subgraphs()
 
+        if recurse_functions:
+            for func in self.functions:
+                func.cleanup(
+                    remove_unused_node_outputs=remove_unused_node_outputs,
+                    recurse_subgraphs=recurse_subgraphs,
+                    remove_unused_graph_inputs=remove_unused_graph_inputs,
+                    recurse_functions=False,  # No infinite recursion
+                )
+
         G_LOGGER.verbose("Cleaning up {:}".format(self.name))
 
         with self.node_ids():
@@ -333,7 +463,9 @@ def cleanup_subgraphs():
 
                     def is_hanging_tensor(tensor):
                         return (
-                            not tensor.is_empty() and len(tensor.outputs) == 0 and tensor.name not in graph_output_names
+                            not tensor.is_empty()
+                            and len(tensor.outputs) == 0
+                            and tensor.name not in graph_output_names
                         )
 
                     to_remove = [out for out in node.outputs if is_hanging_tensor(out)]
@@ -344,31 +476,59 @@ def is_hanging_tensor(tensor):
             self.nodes = nodes
             return self
 
-    def toposort(self, recurse_subgraphs=True):
+    def toposort(
+        self,
+        recurse_subgraphs=True,
+        recurse_functions=True,
+        mode="full",
+    ):
         """
         Topologically sort the graph in place.
 
         Args:
             recurse_subgraphs (bool):
                     Whether to recursively topologically sort subgraphs.
+                    Only applicable when mode="full" or mode="nodes".
                     Defaults to True.
+            recurse_functions (bool):
+                    Whether to topologically sort the nodes of this graph's functions.
+                    Only applicable when mode="full" or mode="nodes".
+                    Defaults to True.
+            mode (str):
+                    Whether to reorder this graph's list of nodes, list of functions, or both.
+                    Possible values:
+                    - "full": Topologically sort the list of nodes and the list of functions.
+                    - "nodes": Only sort the list of nodes.
+                    - "functions": Only sort the list of functions.
+                    Defaults to "full".
 
         Returns:
             self
         """
-        if recurse_subgraphs:
-            for node in self.nodes:
-                for attr in node.attrs.values():
-                    if isinstance(attr, Graph):
-                        attr.toposort()
+        ALLOWED_MODES = ["full", "nodes", "functions"]
+        if mode not in ALLOWED_MODES:
+            G_LOGGER.critical(f'Mode "{mode}" not in {ALLOWED_MODES}')
+
+        sort_nodes = mode in ["full", "nodes"]
+        sort_functions = mode in ["full", "functions"]
+
+        if sort_nodes and recurse_functions:
+            for func in self.functions:
+                func.toposort(recurse_subgraphs=recurse_subgraphs, mode="nodes")
+
+        if sort_nodes and recurse_subgraphs:
+            for subgraph in self.subgraphs():
+                subgraph.toposort(
+                    recurse_subgraphs=True, recurse_functions=False, mode="nodes"
+                )
 
         G_LOGGER.debug("Topologically sorting {:}".format(self.name))
 
-        # Keeps track of a node and it's level in the graph hierarchy.
+        # Keeps track of a node and its level in the graph hierarchy.
         # 0 corresponds to an input node, N corresponds to a node with N layers of inputs.
         class HierarchyDescriptor(object):
-            def __init__(self, node=None, level=None):
-                self.node = node
+            def __init__(self, node_or_func, level=None):
+                self.node_or_func = node_or_func
                 self.level = level
 
             def __lt__(self, other):
@@ -377,45 +537,101 @@ def __lt__(self, other):
         hierarchy_levels = {}  # Dict[int, HierarchyDescriptor]
 
         local_tensors = self._local_tensors()
+        func_id_to_func = dict()
+
+        def get_id(node_or_func):
+            if isinstance(node_or_func, Node):
+                return self._get_node_id(node_or_func)
+            return node_or_func.unique_id
+
+        def get_hierarchy_level(node_or_func, visited=None):
+            from onnx_graphsurgeon.ir.function import Function
+
+            visited = misc.default_value(visited, set())
+            if get_id(node_or_func) in visited:
+                if isinstance(node_or_func, Function):
+                    G_LOGGER.critical("Cycle detected in function definitions!")
+                G_LOGGER.critical(
+                    "Cycle detected in graph! Are there tensors with duplicate names in the graph?"
+                )
+            visited.add(get_id(node_or_func))
 
-        def get_hierarchy_level(node):
-            # Return all local nodes that contribute to this node.
-            def get_input_nodes(node):
-                inputs = {}
+            def get_inputs(node_or_func):
+                # Find all nodes used by this node.
+                def get_used_nodes(node):
+                    inputs = {}
 
-                def add_local_producers(tensor):
-                    nonlocal inputs
-                    if tensor.name in local_tensors:
-                        for inp_node in tensor.inputs:
-                            inputs[self._get_node_id(inp_node)] = inp_node
+                    def add_local_producers(tensor):
+                        nonlocal inputs
+                        if tensor.name in local_tensors:
+                            for inp_node in tensor.inputs:
+                                inputs[self._get_node_id(inp_node)] = inp_node
 
-                for tensor in node.inputs:
-                    add_local_producers(tensor)
+                    for tensor in node.inputs:
+                        add_local_producers(tensor)
 
-                # If a node includes a subgraph, get any tensors that it uses from the outer graph.
-                for attr in node.attrs.values():
-                    if isinstance(attr, Graph):
-                        for tensor in attr._foreign_tensors().values():
+                    # If a node includes a subgraph, get any tensors that it uses from the outer graph.
+                    for subgraph in node.subgraphs():
+                        for tensor in subgraph._foreign_tensors().values():
                             add_local_producers(tensor)
 
-                return inputs.values()
-
-            if self._get_node_id(node) in hierarchy_levels:
-                return hierarchy_levels[self._get_node_id(node)].level
-
-            # The level of a node is the level of it's highest input + 1.
-            try:
-                max_input_level = max([get_hierarchy_level(input_node) for input_node in get_input_nodes(node)] + [-1])
-            except RecursionError:
-                G_LOGGER.critical("Cycle detected in graph! Are there tensors with duplicate names in the graph?")
+                    return inputs.values()
+
+                # Find all functions used in this list of nodes.
+                def get_used_funcs(nodes):
+                    inputs = {}
+                    for subgraph in self.subgraphs():
+                        inputs.update(get_used_funcs(subgraph.nodes))
+                    for node in nodes:
+                        func_id = (node.domain, node.op)
+                        if func_id in func_id_to_func:
+                            inputs[func_id] = func_id_to_func[func_id]
+                    return inputs
+
+                if isinstance(node_or_func, Node):
+                    inputs = get_used_nodes(node_or_func)
+                else:
+                    inputs = get_used_funcs(node_or_func.nodes).values()
+                return inputs
+
+            if get_id(node_or_func) in hierarchy_levels:
+                return hierarchy_levels[get_id(node_or_func)].level
+
+            # The level of a node is the level of its highest input + 1.
+            max_input_level = max(
+                [
+                    get_hierarchy_level(inp, visited=visited)
+                    for inp in get_inputs(node_or_func)
+                ]
+                + [-1]
+            )
+            visited.remove(get_id(node_or_func))
 
+            hierarchy_levels[get_id(node_or_func)] = HierarchyDescriptor(
+                node_or_func, level=max_input_level + 1
+            )
             return max_input_level + 1
 
-        with self.node_ids():
-            for node in self.nodes:
-                hierarchy_levels[self._get_node_id(node)] = HierarchyDescriptor(node, level=get_hierarchy_level(node))
+        if sort_nodes:
+            with self.node_ids():
+                for node in self.nodes:
+                    hierarchy_levels[get_id(node)] = HierarchyDescriptor(
+                        node, level=get_hierarchy_level(node)
+                    )
+            self.nodes = [hd.node_or_func for hd in sorted(hierarchy_levels.values())]
+
+        if sort_functions:
+            self._merge_subgraph_functions()
+            func_id_to_func.update({func.unique_id: func for func in self.functions})
+            hierarchy_levels.clear()
+            for func in self.functions:
+                hierarchy_levels[func.unique_id] = HierarchyDescriptor(
+                    func, level=get_hierarchy_level(func)
+                )
+            self.functions = [
+                hd.node_or_func for hd in sorted(hierarchy_levels.values())
+            ]
 
-        self.nodes = [hd.node for hd in sorted(hierarchy_levels.values())]
         return self
 
     def tensors(self, check_duplicates=False):
@@ -437,18 +653,18 @@ def tensors(self, check_duplicates=False):
 
         def add_to_tensor_map(tensor):
             if not tensor.is_empty():
-                if tensor.name in tensor_map and not (tensor_map[tensor.name] is tensor):
+                if tensor.name in tensor_map and not (
+                    tensor_map[tensor.name] is tensor
+                ):
                     msg = "Found distinct tensors that share the same name:\n[id: {:}] {:}\n[id: {:}] {:}\n".format(
                         id(tensor_map[tensor.name]),
                         tensor_map[tensor.name],
                         id(tensor),
                         tensor,
                     )
-                    msg += (
-                        "Note: Producer node(s) of first tensor:\n{:}\nProducer node(s) of second tensor:\n{:}".format(
-                            tensor_map[tensor.name].inputs,
-                            tensor.inputs,
-                        )
+                    msg += "Note: Producer node(s) of first tensor:\n{:}\nProducer node(s) of second tensor:\n{:}".format(
+                        tensor_map[tensor.name].inputs,
+                        tensor.inputs,
                     )
 
                     if check_duplicates:
@@ -479,10 +695,11 @@ def fold_constants(
         flatten_subgraphs=True,
         size_threshold=None,
         should_exclude_node=None,
+        recurse_functions=True,
     ):
         """
-        Folds constants in-place in the graph. The graph must be topologically sorted prior to
-        calling this function (see `toposort()`).
+        Folds constants in-place in the graph. The graph's nodes and functions must be topologically
+        sorted prior to calling this function (see `toposort()`).
 
         This function will not remove constants after folding them. In order to get rid of
         these hanging nodes, you can run the `cleanup()` function.
@@ -531,17 +748,33 @@ def fold_constants(
                     be excluded from folding. This is only called for nodes which are otherwise foldable.
                     Note that preventing a node from being folded also prevents its consumers from being folded.
                     Defaults to a callable that always returns False.
+            recurse_functions (bool):
+                    Whether to fold constants in this graph's Functions.
+                    Defaults to True.
 
         Returns:
             self
         """
         from onnx_graphsurgeon.exporters.onnx_exporter import dtype_to_onnx, export_onnx
 
-        should_exclude_node = misc.default_value(should_exclude_node, lambda node: False)
+        custom_should_exclude_node = misc.default_value(
+            should_exclude_node, lambda node: False
+        )
+
+        # Don't fold nodes with attribute values which are variable.
+        def should_exclude_node(node):
+            for attr_val in node.attrs.values():
+                if isinstance(attr_val, Node.AttributeRef):
+                    return True
+            return custom_should_exclude_node(node)
 
         PARTITIONING_MODES = [None, "basic", "recursive"]
         if partitioning not in PARTITIONING_MODES:
-            G_LOGGER.critical("Argument for parameter 'partitioning' must be one of: {:}".format(PARTITIONING_MODES))
+            G_LOGGER.critical(
+                "Argument for parameter 'partitioning' must be one of: {:}".format(
+                    PARTITIONING_MODES
+                )
+            )
         ORT_PROVIDERS = ["CPUExecutionProvider"]
 
         G_LOGGER.debug("Folding constants in {:}".format(self.name))
@@ -561,7 +794,31 @@ def fold_constants(
             if len(tensor.inputs) == 1:
                 node = tensor.inputs[0]
                 if node.op == "Constant":
-                    tensor.to_constant(node.attrs["value"]._values)  # Using ._values avoids copying
+                    if len(node.attrs) != 1:
+                        G_LOGGER.warning(
+                            "Constant node must contain exactly one attribute"
+                        )
+                        continue
+                    attr_name, attr_val = list(node.attrs.items())[0]
+                    allowed_attrs = {
+                        "value",
+                        "value_float",
+                        "value_floats",
+                        "value_int",
+                        "value_ints",
+                    }
+                    if attr_name not in allowed_attrs:
+                        G_LOGGER.warning(
+                            f"Unsupported attribute for Constant node: {attr_name}"
+                        )
+                        continue
+                    if isinstance(attr_val, Node.AttributeRef):
+                        continue
+                    elif isinstance(attr_val, Constant):
+                        arr = attr_val._values  # Using ._values avoids copying
+                    else:
+                        arr = np.array(attr_val)
+                    tensor.to_constant(arr)
                     tensor.inputs.clear()
 
         # Pass 2: Run shape-tensor cast elision
@@ -571,7 +828,18 @@ def run_cast_elision(node):
             # Search for Cast(s) (from int -> float) -> intermediate operator (with float constants) -> Cast(s) (back to int)
             # This pattern is problematic for TensorRT since these operations may be performed on Shape Tensors, which
             # are not allowed to be floating point type. Attempt to fold the pattern here
-            VALID_CAST_ELISION_OPS = ["Add", "Sub", "Mul", "Div", "Max", "Min", "Equal", "Greater", "Less", "Concat"]
+            VALID_CAST_ELISION_OPS = [
+                "Add",
+                "Sub",
+                "Mul",
+                "Div",
+                "Max",
+                "Min",
+                "Equal",
+                "Greater",
+                "Less",
+                "Concat",
+            ]
 
             if node.op not in VALID_CAST_ELISION_OPS:
                 return
@@ -590,7 +858,8 @@ def run_cast_elision(node):
                 inp_node
                 for inp_tensor in node.inputs
                 for inp_node in inp_tensor.inputs
-                if inp_node.op == "Cast" and inp_node.attrs["to"] == onnx.TensorProto.DataType.FLOAT
+                if inp_node.op == "Cast"
+                and inp_node.attrs["to"] == onnx.TensorProto.DataType.FLOAT
             ]
 
             # No cast nodes found, return early
@@ -598,7 +867,9 @@ def run_cast_elision(node):
                 return
 
             # Ensure that all input cast nodes are casting from the same type
-            inp_dtypes = [dtype_to_onnx(inp_cast.inputs[0].dtype) for inp_cast in inp_casts]
+            inp_dtypes = [
+                dtype_to_onnx(inp_cast.inputs[0].dtype) for inp_cast in inp_casts
+            ]
             if len(set(inp_dtypes)) != 1:
                 return
 
@@ -610,7 +881,8 @@ def run_cast_elision(node):
                 for out_tensor in node.outputs
                 for out_node in out_tensor.outputs
                 if out_node.op == "Cast"
-                and out_node.attrs["to"] in [onnx.TensorProto.DataType.INT32, onnx.TensorProto.DataType.INT64]
+                and out_node.attrs["to"]
+                in [onnx.TensorProto.DataType.INT32, onnx.TensorProto.DataType.INT64]
             ]
 
             # No cast node found on outputs, return early
@@ -630,7 +902,9 @@ def run_cast_elision(node):
             # `cast_node.inputs[0].outputs[0] == cast_node`.
             for index, inp in enumerate(node.inputs):
                 if isinstance(inp, Constant):
-                    inp.values = inp.values.astype(onnx.mapping.TENSOR_TYPE_TO_NP_TYPE[final_type])
+                    inp.values = inp.values.astype(
+                        onnx.helper.tensor_dtype_to_np_dtype(final_type)
+                    )
 
                 for cast in inp_casts:
                     if cast.outputs[0] == inp:
@@ -645,7 +919,9 @@ def run_cast_elision(node):
 
         if fold_shapes:
             # Perform shape tensor cast elision prior to most other folding
-            G_LOGGER.debug("Performing shape tensor cast elision in {:}".format(self.name))
+            G_LOGGER.debug(
+                "Performing shape tensor cast elision in {:}".format(self.name)
+            )
             try:
                 with self.node_ids():
                     for node in self.nodes:
@@ -653,7 +929,11 @@ def run_cast_elision(node):
             except Exception as err:
                 if not error_ok:
                     raise err
-                G_LOGGER.warning("'{:}' routine failed with: {:}".format("Shape tensor cast elision", err))
+                G_LOGGER.warning(
+                    "'{:}' routine failed with: {:}".format(
+                        "Shape tensor cast elision", err
+                    )
+                )
 
         # Note that most of the remaining passes operate on a clone of the original graph.
         # Pass 3: Find all descendants of constant tensors
@@ -661,25 +941,39 @@ def run_cast_elision(node):
         graph_clone = self.copy()
         clone_tensors = graph_clone.tensors()
 
+        # If 'self' is a Function, then these fields need to be set so it can be exported as an ONNX Graph.
+        graph_clone.producer_name = ""
+        graph_clone.producer_version = ""
+
         def update_foldable_outputs(graph_constants):
             def is_foldable(node):
-                NO_FOLD_OPS = ["QuantizeLinear", "DequantizeLinear", "DynamicQuantizeLinear"]
+                NO_FOLD_OPS = [
+                    "QuantizeLinear",
+                    "DequantizeLinear",
+                    "DynamicQuantizeLinear",
+                ]
                 if node.op in NO_FOLD_OPS:
                     return False
 
                 def all_tensors_const(tensors):
-                    return all([t.name in graph_constants for t in tensors])
+                    # Ignore omitted optional inputs.
+                    return all(
+                        [t.name in graph_constants for t in tensors if not t.is_empty()]
+                    )
 
                 if not all_tensors_const(node.inputs):
                     return False
 
                 all_subgraph_foreign_tensors_const = True
-                for attr in node.attrs.values():
-                    if isinstance(attr, Graph):
-                        foreign_tensors = attr._foreign_tensors().values()
-                        all_subgraph_foreign_tensors_const &= all_tensors_const(foreign_tensors)
+                for subgraph in node.subgraphs():
+                    foreign_tensors = subgraph._foreign_tensors().values()
+                    all_subgraph_foreign_tensors_const &= all_tensors_const(
+                        foreign_tensors
+                    )
 
-                return all_subgraph_foreign_tensors_const and not should_exclude_node(node)
+                return all_subgraph_foreign_tensors_const and not should_exclude_node(
+                    node
+                )
 
             # Walks along the outputs of graph_constants to see if they can also be computed statically.
             # Since the graph is topologically sorted, this should find all constant nodes in the graph.
@@ -688,7 +982,11 @@ def all_tensors_const(tensors):
                     graph_constants.update({out.name: out for out in node.outputs})
             return graph_constants
 
-        graph_constants = {name: tensor for name, tensor in clone_tensors.items() if isinstance(tensor, Constant)}
+        graph_constants = {
+            name: tensor
+            for name, tensor in clone_tensors.items()
+            if isinstance(tensor, Constant)
+        }
         graph_constants = update_foldable_outputs(graph_constants)
 
         # Pass 4: Shape Folding
@@ -822,13 +1120,21 @@ def fold_shape_slice(tensor):
                         shape_of = shape_fold_func(tensor)
 
                         if shape_of is not None:
-                            G_LOGGER.ultra_verbose("Folding shape tensor: {:} to: {:}".format(tensor.name, shape_of))
+                            G_LOGGER.ultra_verbose(
+                                "Folding shape tensor: {:} to: {:}".format(
+                                    tensor.name, shape_of
+                                )
+                            )
                             graph_constants[tensor.name] = tensor.to_constant(shape_of)
                             graph_constants[tensor.name].inputs.clear()
                 except Exception as err:
                     if not error_ok:
                         raise err
-                    G_LOGGER.warning("'{:}' routine failed with:\n{:}".format(shape_fold_func.__name__, err))
+                    G_LOGGER.warning(
+                        "'{:}' routine failed with:\n{:}".format(
+                            shape_fold_func.__name__, err
+                        )
+                    )
                 else:
                     graph_constants = update_foldable_outputs(graph_constants)
 
@@ -862,11 +1168,16 @@ def get_out_node_ids():
                     import onnxruntime as onnxrt
 
                     sess = onnxrt.InferenceSession(
-                        export_onnx(part, do_type_check=False).SerializeToString(), providers=ORT_PROVIDERS
+                        export_onnx(part, do_type_check=False).SerializeToString(),
+                        providers=ORT_PROVIDERS,
                     )
                     values = sess.run(names, {})
                 except Exception as err:
-                    G_LOGGER.warning("Inference failed for subgraph: {:}. Note: Error was:\n{:}".format(part.name, err))
+                    G_LOGGER.warning(
+                        "Inference failed for subgraph: {:}. Note: Error was:\n{:}".format(
+                            part.name, err
+                        )
+                    )
                     if partitioning == "recursive":
                         G_LOGGER.verbose("Attempting to recursively partition subgraph")
                         # Partition failed, peel off last node.
@@ -876,13 +1187,17 @@ def get_out_node_ids():
                         out_node.outputs.clear()
                         out_node.inputs.clear()
                     else:
-                        G_LOGGER.info("You may see better results if you set partitioning='recursive'")
+                        G_LOGGER.info(
+                            "You may see better results if you set partitioning='recursive'"
+                        )
                         if not error_ok:
                             raise err
 
                     constant_values.update(partition_and_infer(part))
                 else:
-                    constant_values.update({name: val for name, val in zip(names, values)})
+                    constant_values.update(
+                        {name: val for name, val in zip(names, values)}
+                    )
 
             return constant_values
 
@@ -890,25 +1205,39 @@ def get_out_node_ids():
         # Otherwise, if all the outputs are foldable, then we can just evaluate the outputs directly.
         # Additionally, if we can determine tensor size, do not evaluate tensors whose sizes exceed the size threshold.
         def should_eval_foldable(tensor):
+            from onnx_graphsurgeon.importers.onnx_importer import get_itemsize
+
             non_const = not isinstance(tensor, Constant)
             is_graph_output = not tensor.outputs
-            has_non_foldable_outputs = any(out.name not in graph_constants for out in tensor.outputs)
+            has_non_foldable_outputs = any(
+                out.name not in graph_constants for out in tensor.outputs
+            )
             exceeds_size_threshold = (
                 tensor.shape is not None
                 and not misc.is_dynamic_shape(tensor.shape)
                 and tensor.dtype is not None
                 and size_threshold is not None
-            ) and (misc.volume(tensor.shape) * np.dtype(tensor.dtype).itemsize > size_threshold)
+            ) and (
+                misc.volume(tensor.shape) * get_itemsize(tensor.dtype) > size_threshold
+            )
 
-            return non_const and (is_graph_output or has_non_foldable_outputs) and not exceeds_size_threshold
+            return (
+                non_const
+                and (is_graph_output or has_non_foldable_outputs)
+                and not exceeds_size_threshold
+            )
 
-        graph_clone.outputs = [t for t in graph_constants.values() if should_eval_foldable(t)]
+        graph_clone.outputs = [
+            t for t in graph_constants.values() if should_eval_foldable(t)
+        ]
         G_LOGGER.debug("Folding tensors: {:}".format(graph_clone.outputs))
-        graph_clone.cleanup(remove_unused_graph_inputs=True)
+        graph_clone.cleanup(remove_unused_graph_inputs=True, recurse_functions=False)
 
         # Using ._values avoids a deep copy of the values.
         constant_values = {
-            name: tensor._values for name, tensor in graph_constants.items() if isinstance(tensor, Constant)
+            name: tensor._values
+            for name, tensor in graph_constants.items()
+            if isinstance(tensor, Constant)
         }
         if graph_clone.outputs:
             if partitioning:
@@ -919,10 +1248,15 @@ def should_eval_foldable(tensor):
                     import onnxruntime as onnxrt
 
                     sess = onnxrt.InferenceSession(
-                        export_onnx(graph_clone, do_type_check=False).SerializeToString(), providers=ORT_PROVIDERS
+                        export_onnx(
+                            graph_clone, do_type_check=False
+                        ).SerializeToString(),
+                        providers=ORT_PROVIDERS,
                     )
                     values = sess.run(names, {})
-                    constant_values.update({name: val for name, val in zip(names, values)})
+                    constant_values.update(
+                        {name: val for name, val in zip(names, values)}
+                    )
                 except Exception as err:
                     G_LOGGER.warning(
                         "Inference failed. You may want to try enabling partitioning to see better results. "
@@ -963,28 +1297,30 @@ def should_eval_foldable(tensor):
 
             if large_tensors:
                 large_tensors_mib = {
-                    tensor_name: "{:} MiB".format(value // (1 << 20)) for tensor_name, value in large_tensors.items()
+                    tensor_name: "{:} MiB".format(value // (1 << 20))
+                    for tensor_name, value in large_tensors.items()
                 }
                 G_LOGGER.warning(
                     "It looks like this model contains foldable nodes that produce large outputs.\n"
                     "In order to avoid bloating the model, you may want to set a constant-folding size threshold.\n"
-                    "Note: Large tensors and their corresponding sizes were: {:}".format(large_tensors_mib),
+                    "Note: Large tensors and their corresponding sizes were: {:}".format(
+                        large_tensors_mib
+                    ),
                     mode=LogMode.ONCE,
                 )
 
         # Folding subgraphs after the outer graph can lead to better folding.
         def fold_subgraphs():
-            for node in self.nodes:
-                for attr in node.attrs.values():
-                    if isinstance(attr, Graph):
-                        attr.fold_constants(
-                            fold_shapes=fold_shapes,
-                            recurse_subgraphs=recurse_subgraphs,
-                            partitioning=partitioning,
-                            error_ok=error_ok,
-                            flatten_subgraphs=flatten_subgraphs,
-                            size_threshold=size_threshold,
-                        )
+            for subgraph in self.subgraphs():
+                subgraph.fold_constants(
+                    fold_shapes=fold_shapes,
+                    recurse_subgraphs=recurse_subgraphs,
+                    partitioning=partitioning,
+                    error_ok=error_ok,
+                    flatten_subgraphs=flatten_subgraphs,
+                    size_threshold=size_threshold,
+                    recurse_functions=False,  # Functions are folded later
+                )
 
         if recurse_subgraphs:
             fold_subgraphs()
@@ -997,7 +1333,9 @@ def fold_subgraphs():
                 if node.op == "If" and isinstance(node.inputs[0], Constant):
                     G_LOGGER.debug("Flattening conditional: {:}".format(node))
                     cond = get_scalar_value(node.inputs[0])
-                    subgraph = node.attrs["then_branch"] if cond else node.attrs["else_branch"]
+                    subgraph = (
+                        node.attrs["then_branch"] if cond else node.attrs["else_branch"]
+                    )
                     # Need to add a suffix to subgraph tensors so they don't collide with outer graph tensors
                     for tensor in subgraph._local_tensors().values():
                         tensor.name += "_subg_{:}_{:}".format(index, subgraph.name)
@@ -1019,32 +1357,54 @@ def fold_subgraphs():
 
                 index += 1
 
+        if recurse_functions:
+            # Nodes which are constant-folded but not cleaned up can result in errors during inference,
+            # so process functions in reverse topological order.
+            for func in reversed(self.functions):
+                func.fold_constants(
+                    fold_shapes=fold_shapes,
+                    recurse_subgraphs=recurse_subgraphs,
+                    partitioning=partitioning,
+                    error_ok=error_ok,
+                    flatten_subgraphs=flatten_subgraphs,
+                    size_threshold=size_threshold,
+                    should_exclude_node=should_exclude_node,
+                    recurse_functions=False,  # No infinite recursion
+                )
+
         return self
 
-    def _generate_name(self, prefix):
-        name = "{}_{}".format(prefix, self.name_idx)
-        self.name_idx += 1
+    def _generate_name(self, prefix: str, existing_names: set):
+        # `existing_names` will ensure that generated name does not clash existing names.
+        # Generation is done by appending an index to the prefix.
+        while True:
+            name = "{}_{}".format(prefix, self.name_idx)
+            self.name_idx += 1
+            if name not in existing_names:  # Ensure generated name is unique
+                break
         return name
 
-    def layer(self, inputs=[], outputs=[], *args, **kwargs):
+    def layer(self, inputs=None, outputs=None, *args, **kwargs):
         """
         Creates a node, adds it to this graph, and optionally creates its input and output tensors.
 
         The input and output lists can include various different types:
 
-            - ``Tensor``: Any Tensors provided will be used as-is in the inputs/outputs of the node created.
+            - ``Tensor``:
+                    Any Tensors provided will be used as-is in the inputs/outputs of the node created.
+                    Therefore, you must ensure that the provided Tensors have unique names.
             - ``str``:
                     If a string is provided, this function will generate a new tensor using
                     the string to generate a name. It will append an index to the end of the provided string
-                    to attempt to avoid duplicate tensor names, but since this doesn't guarantee that the name will
-                    be unique, you should try to ensure that the string provided is as unique as possible.
-                    To avoid problems with duplicate names, you can generate names yourself and provide ``Tensor`` s.
+                    to guarantee unique names.
             - ``numpy.ndarray``:
                     If a NumPy array is provided, this function will generate a Constant tensor
-                    using the name prefix: "onnx_graphsurgeon_constant"
+                    using the name prefix: "onnx_graphsurgeon_constant", and append an index to the end
+                    of the prefix to guarantee unique names.
             - ``Union[List[Number], Tuple[Number]]``:
                     If a list or tuple of numbers (int or float) is provided, this function will
-                    generate a Constant tensor using the name prefix: "onnx_graphsurgeon_lst_constant".
+                    generate a Constant tensor using the name prefix: "onnx_graphsurgeon_lst_constant",
+                    and append an index to the end of the prefix to guarantee unique names.
                     The values of the tensor will be a 1D array containing the specified values.
                     The datatype will be either `np.float32` or `np.int64`.
 
@@ -1056,37 +1416,60 @@ def layer(self, inputs=[], outputs=[], *args, **kwargs):
         Returns:
             List[Tensor]: The output tensors of the node
         """
+        inputs = misc.default_value(inputs, [])
+        outputs = misc.default_value(outputs, [])
 
-        def process_io(io):
+        def process_io(io, existing_names):
+            # Note: modifies `existing_names` in-place
             new_io = []
             for elem in io:
                 if isinstance(elem, Tensor):
                     new_io.append(elem)
                 elif isinstance(elem, str):
-                    tensor = Variable(name=self._generate_name(elem))
+                    name = self._generate_name(elem, existing_names)
+                    tensor = Variable(name=name)
                     new_io.append(tensor)
                 elif isinstance(elem, np.ndarray):
-                    new_io.append(Constant(name=self._generate_name("onnx_graphsurgeon_constant"), values=elem))
-                elif isinstance(elem, list) or isinstance(elem, tuple) or isinstance(elem, numbers.Number):
+                    name = self._generate_name(
+                        "onnx_graphsurgeon_constant", existing_names
+                    )
+                    new_io.append(Constant(name=name, values=elem))
+                elif (
+                    isinstance(elem, list)
+                    or isinstance(elem, tuple)
+                    or isinstance(elem, numbers.Number)
+                ):
                     if isinstance(elem, list) or isinstance(elem, tuple):
-                        dtype = np.float32 if any([isinstance(x, float) for x in elem]) else np.int64
+                        dtype = (
+                            np.float32
+                            if any([isinstance(x, float) for x in elem])
+                            else np.int64
+                        )
                     else:
                         dtype = np.float32 if isinstance(elem, float) else np.int64
                     arr = np.array(elem, dtype=dtype)
-                    new_io.append(Constant(name=self._generate_name("onnx_graphsurgeon_lst_constant"), values=arr))
+                    name = self._generate_name(
+                        "onnx_graphsurgeon_lst_constant", existing_names
+                    )
+                    new_io.append(Constant(name=name, values=arr))
                 else:
                     G_LOGGER.critical(
                         "Unrecognized type passed to Graph.layer: {:}.\n"
                         "\tHint: Did you forget to unpack a list with `*`?\n"
                         "\tPlease use Tensors, strings, or NumPy arrays.".format(elem)
                     )
+                if new_io[-1].name:
+                    existing_names.add(new_io[-1].name)
             return new_io
 
-        inputs = process_io(inputs)
-        outputs = process_io(outputs)
+        existing_names = set(self.tensors().keys())  # set for fast lookup
+        inputs = process_io(inputs, existing_names)
+        outputs = process_io(outputs, existing_names)
 
         if "name" not in kwargs:
-            kwargs["name"] = self._generate_name("onnx_graphsurgeon_node")
+            kwargs["name"] = self._generate_name(
+                "onnx_graphsurgeon_node", {node.name for node in self.nodes}
+            )
 
         node = Node(*args, **kwargs, inputs=inputs, outputs=outputs)
         self.nodes.append(node)
@@ -1118,7 +1501,9 @@ def copy(self, tensor_map: "OrderedDict[str, Tensor]" = None):
         # However, we should prioritize copies already made by the outer graph.
         local_tensor_copies.update(tensor_map)
         # And locally produced tensors should take precedence over everything else.
-        local_tensor_copies.update({n: t.copy() for n, t in self._local_tensors().items()})
+        local_tensor_copies.update(
+            {n: t.copy() for n, t in self._local_tensors().items()}
+        )
 
         def get_tensor(name):
             if not name:
@@ -1145,13 +1530,18 @@ def get_tensor(name):
             doc_string=copy.copy(self.doc_string),
             opset=copy.copy(self.opset),
             import_domains=self.import_domains,
+            functions=copy.copy(self.functions),
         )
 
     def __str__(self):
         nodes_str = "\n".join([str(node) for node in self.nodes])
-        return "Graph {:} (Opset: {:})\nInputs: {:}\nNodes:\n{:}\nOutputs: {:}".format(
-            self.name, self.opset, self.inputs, nodes_str, self.outputs
-        )
+        functions_str = ",".join([str(func.name) for func in self.functions])
+        out = f"Graph {self.name} (Opset {self.opset})"
+        out += f"\nLocal Functions: [{functions_str}]"
+        out += f"\nInputs: {self.inputs}"
+        out += f"\nNodes: {nodes_str}"
+        out += f"\nOutputs: {self.outputs}"
+        return out
 
     def __repr__(self):
         return self.__str__()
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/node.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/node.py
index bf10f3f6..ccd103f8 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/node.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/node.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,15 +15,31 @@
 # limitations under the License.
 #
 
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.logger import G_LOGGER
 from onnx_graphsurgeon.ir.tensor import Tensor
 from onnx_graphsurgeon.util import misc
 
 from collections import OrderedDict
+from dataclasses import dataclass
 from typing import List, Dict
 
 
 class Node(object):
+
+    @dataclass
+    class AttributeRef:
+        """
+        An AttributeRef is an attribute value which references an attribute in the parent function.
+        A node's attribute can only be an AttributeRef if the node lives inside a Function.
+
+        Args:
+            name (str): The name of the referenced attribute in the parent Function.
+            type (type): The attribute's type.
+        """
+
+        name: str
+        type: type
+
     def __init__(
         self,
         op: str,
@@ -48,8 +64,12 @@ def __init__(
         self.op = op
         self.name = misc.default_value(name, "")
         self.attrs = misc.default_value(attrs, OrderedDict())
-        self.inputs = misc.SynchronizedList(self, field_name="outputs", initial=misc.default_value(inputs, []))
-        self.outputs = misc.SynchronizedList(self, field_name="inputs", initial=misc.default_value(outputs, []))
+        self.inputs = misc.SynchronizedList(
+            self, field_name="outputs", initial=misc.default_value(inputs, [])
+        )
+        self.outputs = misc.SynchronizedList(
+            self, field_name="inputs", initial=misc.default_value(outputs, [])
+        )
         self.domain = domain
 
     def i(self, tensor_idx=0, producer_idx=0):
@@ -91,6 +111,33 @@ def o(self, consumer_idx=0, tensor_idx=0):
         """
         return self.outputs[tensor_idx].outputs[consumer_idx]
 
+    def subgraphs(self, recursive=False):
+        """
+        Convenience function to iterate over all subgraphs which are contained in this node.
+        Node subgraphs are found in attributes of ONNX control flow nodes such as 'If' and 'Loop'.
+
+        Args:
+            recursive (bool): Whether to recurse into the subgraph nodes when looking for subgraphs. Defaults to False.
+
+        Returns:
+            A generator which iterates over this node's subgraphs.
+        """
+        from onnx_graphsurgeon.ir.graph import Graph
+
+        visit_queue = [self]
+
+        # This prevents infinite recursion in the (illegal) case of cyclical graphs.
+        visited = set()
+
+        while visit_queue:
+            node = visit_queue.pop()
+            for attr in node.attrs.values():
+                if isinstance(attr, Graph) and id(attr) not in visited:
+                    visited.add(id(attr))
+                    if recursive:
+                        visit_queue.extend(attr.nodes)
+                    yield attr
+
     def __setattr__(self, name, value):
         if name in ["inputs", "outputs"]:
             try:
@@ -107,7 +154,12 @@ def __setattr__(self, name, value):
         else:
             super().__setattr__(name, value)
 
-    def copy(self, inputs: List["Tensor"] = None, outputs: List["Tensor"] = None, tensor_map=None):
+    def copy(
+        self,
+        inputs: List["Tensor"] = None,
+        outputs: List["Tensor"] = None,
+        tensor_map=None,
+    ):
         """
         Makes a shallow copy of this node, overriding input and output information.
 
@@ -122,7 +174,14 @@ def copy(self, inputs: List["Tensor"] = None, outputs: List["Tensor"] = None, te
             else:
                 new_attrs[name] = attr
 
-        return Node(self.op, self.name, new_attrs, inputs=inputs, outputs=outputs, domain=self.domain)
+        return Node(
+            self.op,
+            self.name,
+            new_attrs,
+            inputs=inputs,
+            outputs=outputs,
+            domain=self.domain,
+        )
 
     def __str__(self):
         ret = "{:} ({:})".format(self.name, self.op)
@@ -153,7 +212,11 @@ def __eq__(self, other):
         Check whether two nodes are equal by comparing name, attributes, op, inputs, and outputs.
         """
         G_LOGGER.verbose("Comparing node: {:} with {:}".format(self.name, other.name))
-        attrs_match = self.name == other.name and self.op == other.op and self.attrs == other.attrs
+        attrs_match = (
+            self.name == other.name
+            and self.op == other.op
+            and self.attrs == other.attrs
+        )
         inputs_match = len(self.inputs) == len(other.inputs) and all(
             [inp == other_inp for inp, other_inp in zip(self.inputs, other.inputs)]
         )
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/tensor.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/tensor.py
index eec11405..71425a94 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/tensor.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/ir/tensor.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,7 +15,7 @@
 # limitations under the License.
 #
 
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.logger import G_LOGGER
 from onnx_graphsurgeon.util import misc
 
 from typing import Set, Sequence, Union
@@ -80,12 +80,16 @@ def to_constant(self, values: np.ndarray, data_location: int = None):
         self.data_location = data_location
         return self
 
-    def to_variable(self, dtype: np.dtype = None, shape: Sequence[Union[int, str]] = []):
+    def to_variable(
+        self,
+        dtype: Union[np.dtype, "onnx.TensorProto.DataType"] = None,
+        shape: Sequence[Union[int, str]] = [],
+    ):
         """
         Modifies this tensor in-place to convert it to a Variable. This means that all consumers/producers of the tensor will see the update.
 
         Args:
-            dtype (np.dtype): The data type of the tensor.
+            dtype (Union[numpy.dtype, onnx.TensorProto.DataType]): The data type of the tensor.
             shape (Sequence[int]): The shape of the tensor.
 
         Returns:
@@ -136,7 +140,9 @@ def o(self, consumer_idx=0, tensor_idx=0):
         return self.outputs[consumer_idx].outputs[tensor_idx]
 
     def __str__(self):
-        return "{:} ({:}): (shape={:}, dtype={:})".format(type(self).__name__, self.name, self.shape, self.dtype)
+        return "{:} ({:}): (shape={:}, dtype={:})".format(
+            type(self).__name__, self.name, self.shape, self.dtype
+        )
 
     def __repr__(self):  # Hack to make logging output pretty.
         return self.__str__()
@@ -155,20 +161,28 @@ class Variable(Tensor):
     def empty():
         return Variable(name="")
 
-    def __init__(self, name: str, dtype: np.dtype = None, shape: Sequence[Union[int, str]] = None):
+    def __init__(
+        self,
+        name: str,
+        dtype: Union[np.dtype, "onnx.TensorProto.DataType"] = None,
+        shape: Sequence[Union[int, str]] = None,
+        type: str = "tensor_type",
+    ):
         """
         Represents a Tensor whose value is not known until inference-time.
 
         Args:
             name (str): The name of the tensor.
-            dtype (numpy.dtype): The data type of the tensor.
+            dtype (Union[numpy.dtype, onnx.TensorProto.DataType]): The data type of the tensor.
             shape (Sequence[Union[int, str]]): The shape of the tensor. This may contain strings if the model uses dimension parameters.
+            type (str): The type of the tensor.
         """
         self.name = name
         self.inputs = misc.SynchronizedList(self, field_name="outputs", initial=[])
         self.outputs = misc.SynchronizedList(self, field_name="inputs", initial=[])
         self.dtype = dtype
         self.shape = misc.default_value(shape, None)
+        self.type = type
 
     def to_constant(self, values: np.ndarray):
         del self.dtype
@@ -192,14 +206,18 @@ class LazyValues(object):
     def __init__(self, tensor):
         """
         Args:
-            tensor (onnx.TensorProto): The ONNX tensor that this instance should lazily load.
+            tensor (onnx.TensorProto, onnx.SparseTensorProto): The ONNX tensor that this instance should lazily load.
         """
-        from onnx_graphsurgeon.importers.onnx_importer import get_onnx_tensor_shape, get_onnx_tensor_dtype
+        from onnx_graphsurgeon.importers.onnx_importer import (
+            get_onnx_tensor_shape,
+            get_onnx_tensor_dtype,
+            get_itemsize,
+        )
 
         self.tensor = tensor
         self.shape = get_onnx_tensor_shape(self.tensor)
         self.dtype = get_onnx_tensor_dtype(self.tensor)
-        self.nbytes = misc.volume(self.shape) * self.dtype.itemsize
+        self.nbytes = misc.volume(self.shape) * get_itemsize(self.dtype)
 
     def load(self):
         """
@@ -210,6 +228,18 @@ def load(self):
         """
         import onnx
         import onnx.numpy_helper
+        from onnx_graphsurgeon.importers.onnx_importer import (
+            get_dtype_name,
+            get_numpy_type,
+        )
+
+        if get_numpy_type(self.dtype) is None:
+            G_LOGGER.warning(
+                f"Datatype: {get_dtype_name(self.dtype)} could not be converted to a NumPy type.\n"
+                f"Accessing the values of this constant tensor ({self.tensor.name}) will cause them to be casted to a supported data type. "
+                f"This means that the weights will have a different type than the original model when they are exported again!\n"
+                f"If this is not what you intended, please avoid accessing the values of this constant tensor."
+            )
 
         return np.array(onnx.numpy_helper.to_array(self.tensor))
 
@@ -220,8 +250,72 @@ def __repr__(self):  # Hack to make logging output pretty.
         return self.__str__()
 
 
+class SparseValues(LazyValues):
+    """
+    A special object that represents constant tensor values that is sparse
+    """
+
+    def load(self):
+        """
+        Load a numpy array from the sparse structure.
+
+        Returns:
+            np.array: A numpy array containing the values of the tensor.
+        """
+        import onnx
+        import onnx.numpy_helper
+        from onnx_graphsurgeon.importers.onnx_importer import (
+            get_dtype_name,
+            get_numpy_type,
+        )
+
+        supported_index_type = [onnx.TensorProto.INT64]
+        if self.tensor.indices.data_type not in supported_index_type:
+            G_LOGGER.critical(
+                f"Unsupported index data type {self.tensor.indices.data_type} in {self.tensor.values.name}"
+            )
+
+        if self.tensor.values.data_type == onnx.TensorProto.FLOAT16:
+            values_data = np.asarray(
+                self.tensor.values.int32_data, dtype=np.uint16
+            ).view(np.float16)
+        else:
+            field_name = onnx.helper.tensor_dtype_to_field(self.tensor.values.data_type)
+            values = getattr(self.tensor.values, field_name)
+            dtype = onnx.helper.tensor_dtype_to_np_dtype(self.tensor.values.data_type)
+            values_data = np.asarray(values, dtype)
+        indices_data = self.tensor.indices.int64_data
+
+        if len(self.tensor.indices.dims) == 1:
+            values = np.zeros(np.prod(self.tensor.dims))
+            # [NNZ] layout, in which case the i-th value must be the linearized-index of the i-th value.
+            values[indices_data] = values_data
+            values = values.reshape(self.tensor.dims)
+        elif len(self.tensor.indices.dims) == 2:
+            # [NNZ, rank] with the [i,j]-th value corresponding to the j-th index of the i-th value
+            values = np.zeros(self.tensor.dims)
+            indices_data = np.asarray(indices_data).reshape(self.tensor.indices.dims)
+
+            for i in range(len(values_data)):
+                values[tuple(indices_data[i])] = values_data[i]
+        else:
+            G_LOGGER.critical(
+                f"Unsupported index data dims {self.tensor.indices.dims} in {self.tensor.values.name}"
+            )
+
+        return values
+
+    def __str__(self):
+        return "SparseValues (shape={:}, dtype={:})".format(self.shape, self.dtype)
+
+
 class Constant(Tensor):
-    def __init__(self, name: str, values: Union[np.ndarray, LazyValues], data_location: int = None):
+    def __init__(
+        self,
+        name: str,
+        values: Union[np.ndarray, LazyValues],
+        data_location: int = None,
+    ):
         """
         Represents a Tensor whose value is known.
 
@@ -236,16 +330,24 @@ def __init__(self, name: str, values: Union[np.ndarray, LazyValues], data_locati
         self.name = name
         self.inputs = misc.SynchronizedList(self, field_name="outputs", initial=[])
         self.outputs = misc.SynchronizedList(self, field_name="inputs", initial=[])
-        if not isinstance(values, np.ndarray) and not isinstance(values, LazyValues):
+        if (
+            not isinstance(values, np.ndarray)
+            and not isinstance(values, LazyValues)
+            and not isinstance(values, SparseValues)
+        ):
             G_LOGGER.critical(
-                "Provided `values` argument is not a NumPy array or a LazyValues instance. "
-                "Please provide a NumPy array or LazyValues instance to construct a Constant. "
-                "Note: Provided `values` parameter was: {:}".format(values)
+                "Provided `values` argument is not a NumPy array, a LazyValues instance or a"
+                "SparseValues instance. Please provide a NumPy array or LazyValues instance "
+                "to construct a Constant. Note: Provided `values` parameter was: {:}".format(
+                    values
+                )
             )
         self._values = values
         self.data_location = data_location
 
-    def to_variable(self, dtype: np.dtype = None, shape: Sequence[Union[int, str]] = []):
+    def to_variable(
+        self, dtype: np.dtype = None, shape: Sequence[Union[int, str]] = []
+    ):
         del self._values
         return super().to_variable(dtype, shape)
 
@@ -274,7 +376,7 @@ def shape(self):
 
     @property
     def dtype(self):
-        return self._values.dtype.type
+        return self._values.dtype
 
     def __repr__(self):  # Hack to make logging output pretty.
         ret = self.__str__()
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/logger/logger.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/logger/logger.py
index 7aac34d0..1241847e 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/logger/logger.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/logger/logger.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,6 +23,7 @@
 import sys
 import os
 
+
 # Context manager to apply indentation to messages
 class LoggerIndent(object):
     def __init__(self, logger, indent):
@@ -87,7 +88,9 @@ class Logger(object):
         CRITICAL: "red_1",
     }
 
-    def __init__(self, severity=INFO, colors=True, letter=True, timestamp=False, line_info=False):
+    def __init__(
+        self, severity=INFO, colors=True, letter=True, timestamp=False, line_info=False
+    ):
         """
         Logger.
 
@@ -100,7 +103,9 @@ def __init__(self, severity=INFO, colors=True, letter=True, timestamp=False, lin
         """
         self._severity = severity
         self.logging_indent = 0
-        self.root_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir))
+        self.root_dir = os.path.abspath(
+            os.path.join(os.path.dirname(__file__), os.pardir, os.pardir)
+        )
         self.once_logged = set()
         self.colors = colors
         self.letter = letter
@@ -159,7 +164,9 @@ def get_line_info():
                     # If the file is not located in trt_smeagol, use its basename instead.
                     if os.pardir in filename:
                         filename = os.path.basename(filename)
-                    return "[{:}:{:}] ".format(filename, sys._getframe(stack_depth).f_lineno)
+                    return "[{:}:{:}] ".format(
+                        filename, sys._getframe(stack_depth).f_lineno
+                    )
 
                 prefix = ""
                 if self.letter:
@@ -172,7 +179,9 @@ def get_line_info():
 
             def apply_indentation(message):
                 message_lines = str(message).splitlines()
-                return "\n".join(["\t" * self.logging_indent + line for line in message_lines])
+                return "\n".join(
+                    ["\t" * self.logging_indent + line for line in message_lines]
+                )
 
             def apply_color(message):
                 if self.colors:
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/exception.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/exception.py
index 79843c9c..5937964f 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/exception.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/exception.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/misc.py b/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/misc.py
index cc46c459..67c79b81 100644
--- a/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/misc.py
+++ b/tools/onnx-graphsurgeon/onnx_graphsurgeon/util/misc.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -18,6 +18,11 @@
 from collections import OrderedDict
 from typing import List, Sequence
 
+import numpy as np
+from onnx import AttributeProto
+from onnx_graphsurgeon.logger import G_LOGGER
+
+
 # default_value exists to solve issues that might result from Python's normal default argument behavior.
 # Specifically, consider the following class:
 #
@@ -75,6 +80,56 @@ def volume(obj):
     return vol
 
 
+_ONNX_ATTR_TYPE_TO_GS_TYPE = {}
+_GS_TYPE_TO_ONNX_ATTR_TYPE = {}
+
+
+# This method prevents circular import of Tensor and Graph
+def _init_dicts():
+    global _ONNX_ATTR_TYPE_TO_GS_TYPE
+    global _GS_TYPE_TO_ONNX_ATTR_TYPE
+    if _ONNX_ATTR_TYPE_TO_GS_TYPE and _GS_TYPE_TO_ONNX_ATTR_TYPE:
+        return
+
+    from onnx_graphsurgeon.ir.graph import Graph
+    from onnx_graphsurgeon.ir.tensor import Tensor
+
+    _ONNX_ATTR_TYPE_TO_GS_TYPE = {
+        AttributeProto.UNDEFINED: None,
+        AttributeProto.FLOAT: float,
+        AttributeProto.INT: int,
+        AttributeProto.STRING: str,
+        AttributeProto.TENSOR: Tensor,
+        AttributeProto.GRAPH: Graph,
+        AttributeProto.SPARSE_TENSOR: AttributeProto.SPARSE_TENSOR,
+        AttributeProto.TYPE_PROTO: AttributeProto.TYPE_PROTO,
+        AttributeProto.FLOATS: List[float],
+        AttributeProto.INTS: List[int],
+        AttributeProto.STRINGS: List[str],
+        AttributeProto.TENSORS: List[Tensor],
+        AttributeProto.GRAPHS: List[Graph],
+        AttributeProto.SPARSE_TENSORS: AttributeProto.SPARSE_TENSORS,
+        AttributeProto.TYPE_PROTOS: AttributeProto.TYPE_PROTOS,
+    }
+    _GS_TYPE_TO_ONNX_ATTR_TYPE = {v: k for k, v in _ONNX_ATTR_TYPE_TO_GS_TYPE.items()}
+
+
+def convert_from_onnx_attr_type(onnx_attr_type):
+    _init_dicts()
+    return _ONNX_ATTR_TYPE_TO_GS_TYPE[onnx_attr_type]
+
+
+def convert_to_onnx_attr_type(any_type):
+    _init_dicts()
+    if any_type in _GS_TYPE_TO_ONNX_ATTR_TYPE:
+        return _GS_TYPE_TO_ONNX_ATTR_TYPE[any_type]
+    if np.issubdtype(any_type, np.floating):
+        return AttributeProto.FLOAT
+    if np.issubdtype(any_type, np.integer):
+        return AttributeProto.INT
+    G_LOGGER.warning(f"Unable to convert {any_type} into an ONNX AttributeType")
+
+
 # Special type of list that synchronizes contents with another list.
 # Concrete example: Assume some node, n, contains an input tensor, t. If we remove t from n.inputs,
 # we also need to remove n from t.outputs. To avoid having to do this manually, we use SynchronizedList,
@@ -137,3 +192,9 @@ def __add__(self, other_list: List[object]):
     def __iadd__(self, other_list: List[object]):
         self.extend(other_list)
         return self
+
+    def __copy__(self):
+        return list(self)
+
+    def __deepcopy__(self, memo):
+        return list(self)
diff --git a/tools/onnx-graphsurgeon/setup.py b/tools/onnx-graphsurgeon/setup.py
index 6982e8e0..df0b5188 100644
--- a/tools/onnx-graphsurgeon/setup.py
+++ b/tools/onnx-graphsurgeon/setup.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -29,7 +29,7 @@ def no_publish():
 
 REQUIRED_PACKAGES = [
     "numpy",
-    "onnx",
+    "onnx>=1.14.0",
 ]
 
 
@@ -41,7 +41,7 @@ def main():
         description="ONNX GraphSurgeon",
         long_description=open("README.md", "r", encoding="utf-8").read(),
         license="Apache 2.0",
-        url="https://github.com/nvidia/tensorrt/tools/onnx-graphsurgeon",
+        url="https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon",
         author="NVIDIA",
         author_email="svc_tensorrt@nvidia.com",
         classifiers=[
diff --git a/tools/onnx-graphsurgeon/tests/ir/test_graph.py b/tools/onnx-graphsurgeon/tests/ir/test_graph.py
index 5a082d98..941aec23 100644
--- a/tools/onnx-graphsurgeon/tests/ir/test_graph.py
+++ b/tools/onnx-graphsurgeon/tests/ir/test_graph.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,15 +16,17 @@
 #
 
 import copy
+from typing import List
 
 import numpy as np
 import onnx
 import onnx_graphsurgeon as gs
 import pytest
+from onnx_graphsurgeon.ir.function import Function
 from onnx_graphsurgeon.ir.graph import Graph
 from onnx_graphsurgeon.ir.node import Node
-from onnx_graphsurgeon.ir.tensor import Constant, LazyValues, Variable
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.ir.tensor import Constant, LazyValues, Tensor, Variable
+from onnx_graphsurgeon.logger import G_LOGGER
 from onnx_graphsurgeon.util import misc
 from onnx_graphsurgeon.util.exception import OnnxGraphSurgeonException
 from onnx_graphsurgeon.util.misc import SynchronizedList
@@ -40,14 +42,19 @@ def shape(self, inp):
 
 @Graph.register()
 def cast(self, inp, to):
-    return self.layer(op="Cast", inputs=[inp], outputs=["cast_out"], attrs={"to": to})[0]
+    return self.layer(op="Cast", inputs=[inp], outputs=["cast_out"], attrs={"to": to})[
+        0
+    ]
 
 
 @Graph.register()
 def constant(self, values):
-    return self.layer(op="Constant", inputs=[], outputs=["constant_out"], attrs={"value": Constant("values", values)})[
-        0
-    ]
+    return self.layer(
+        op="Constant",
+        inputs=[],
+        outputs=["constant_out"],
+        attrs={"value": Constant("values", values)},
+    )[0]
 
 
 @Graph.register()
@@ -72,6 +79,22 @@ def add(self, a, b, name=None):
     return out
 
 
+@Graph.register()
+def mul(self, a, b, name=None):
+    outputs = [Variable(name=name)] if name else ["mul_out"]
+    out = self.layer(op="Mul", inputs=[a, b], outputs=outputs)[0]
+    out.dtype = a.dtype or b.dtype
+    return out
+
+
+@Graph.register()
+def less(self, a, b, name=None):
+    outputs = [Variable(name=name)] if name else ["less_out"]
+    out = self.layer(op="Less", inputs=[a, b], outputs=outputs)[0]
+    out.dtype = bool
+    return out
+
+
 # A fake op that can be used to ensure things work even when there is an invalid
 # node present in the model.
 @Graph.register()
@@ -100,13 +123,18 @@ def slice(self, data, starts=None, ends=None, axes=None, steps=None):
 
 @gs.Graph.register()
 def nested(self, inp, graph):
-    return self.layer(op="Nested", inputs=[inp], outputs=["nested_out"], attrs={"body": graph})[0]
+    return self.layer(
+        op="Nested", inputs=[inp], outputs=["nested_out"], attrs={"body": graph}
+    )[0]
 
 
 @gs.Graph.register()
 def if_op(self, cond, then_graph, else_graph):
     return self.layer(
-        op="If", inputs=[cond], outputs=["if_out"], attrs={"then_branch": then_graph, "else_branch": else_graph}
+        op="If",
+        inputs=[cond],
+        outputs=["if_out"],
+        attrs={"then_branch": then_graph, "else_branch": else_graph},
     )[0]
 
 
@@ -120,7 +148,10 @@ def tile(self, inp, repeats):
 @gs.Graph.register()
 def dequantize_linear(self, inp, scale, zero_point, axis=1):
     out = self.layer(
-        op="DequantizeLinear", inputs=[inp, scale, zero_point], outputs=["dequantize_linear_out"], attrs={"axis": axis}
+        op="DequantizeLinear",
+        inputs=[inp, scale, zero_point],
+        outputs=["dequantize_linear_out"],
+        attrs={"axis": axis},
     )[0]
     out.dtype = np.float32
     return out
@@ -138,6 +169,28 @@ def quantize_linear(self, inp, out_scale, out_zero_point, axis=1):
     return out
 
 
+@gs.Graph.register()
+def pad(self, data, pads, constant_value=None):
+    constant_value = misc.default_value(constant_value, Variable.empty())
+    out = self.layer(
+        op="Pad", inputs=[data, pads, constant_value], outputs=["pad_out"]
+    )[0]
+    out.dtype = data.dtype
+    return out
+
+
+@gs.Graph.register()
+def softmax(self, data, axis=None):
+    attrs = {}
+    if axis is not None:
+        attrs["axis"] = axis
+    out = self.layer(op="Softmax", inputs=[data], outputs=["softmax_out"], attrs=attrs)[
+        0
+    ]
+    out.dtype = data.dtype
+    return out
+
+
 # Generates a graph where an outer node has no outputs except
 # within the subgraph. ONNX-GS should recognize that the node
 # is being used, and should not remove it during cleanup().
@@ -153,12 +206,20 @@ def make_nested_graph():
     subgraph_outputs = [Variable("subgraph_out")]
 
     subgraph_identity0 = Node(op="Identity", inputs=[id_out], outputs=[subgraph_id_out])
-    subgraph_identity1 = Node(op="Identity", inputs=[subgraph_id_out], outputs=subgraph_outputs)
+    subgraph_identity1 = Node(
+        op="Identity", inputs=[subgraph_id_out], outputs=subgraph_outputs
+    )
 
-    subgraph = Graph(nodes=[subgraph_identity0, subgraph_identity1], inputs=subgraph_inputs, outputs=subgraph_outputs)
+    subgraph = Graph(
+        nodes=[subgraph_identity0, subgraph_identity1],
+        inputs=subgraph_inputs,
+        outputs=subgraph_outputs,
+    )
 
     nested_out = Variable("nested_out")
-    nested_node = Node(op="Nested", attrs={"body": subgraph}, inputs=[inp], outputs=[nested_out])
+    nested_node = Node(
+        op="Nested", attrs={"body": subgraph}, inputs=[inp], outputs=[nested_out]
+    )
 
     return Graph(nodes=[identity, nested_node], inputs=[inp], outputs=[nested_out])
 
@@ -168,15 +229,51 @@ def nested_graph():
     yield make_nested_graph()
 
 
+@pytest.fixture
+def very_nested_graph():
+    inner_subgraph_1 = Graph(name="inner_subgraph_1")
+    inner_subgraph_2 = Graph(name="inner_subgraph_2")
+    inner_subgraph_3 = Graph(name="inner_subgraph_3")
+    outer_subgraph_1 = Graph(name="subgraph1")
+    outer_subgraph_2 = Graph(
+        name="subgraph2",
+        nodes=[Node("Add", attrs={"x": inner_subgraph_1, "y": inner_subgraph_2})],
+    )
+    outer_subgraph_3 = Graph(
+        name="subgraph3", nodes=[Node("Add", attrs={"x": inner_subgraph_3, "y": 3.14})]
+    )
+    node_1 = Node(
+        op="Add",
+        attrs={
+            "x": outer_subgraph_1,
+            "y": outer_subgraph_2,
+            "z": 5,
+            "w": outer_subgraph_3,
+        },
+    )
+    node_2 = Node(op="Add", attrs={"x": outer_subgraph_3})
+    return Graph(nodes=[node_1, node_2], name="main_graph")
+
+
 class TestBasic(object):
     def test_generate_name(self):
         graph = Graph()
-        names = set()
+        generated_names = set()
+        existing_names = {
+            "name_{}".format(i) for i in range(50, 150)
+        }  # names_50 to names_149
         num_names = 100
-        # This function should not return the same name more than once
         for idx in range(num_names):
-            names.add(graph._generate_name("name"))
-        assert len(names) == 100
+            generated_names.add(graph._generate_name("name", existing_names))
+        assert len(generated_names) == num_names  # 100 unique generated_names
+        assert (
+            len(generated_names.intersection(existing_names)) == 0
+        )  # no generated_names in existing_names
+        expected_names = {"name_{}".format(i) for i in range(0, 50)}
+        expected_names.update({"name_{}".format(i) for i in range(150, 200)})
+        assert (
+            generated_names == expected_names
+        )  # expect 'names_0' to 'names_49', 'names_150' to 'names_199'
 
     def test_equal(self, nested_graph):
         assert nested_graph == nested_graph
@@ -206,6 +303,25 @@ def test_equal_nested_unequal(self):
 
         assert not (g0 == g1)
 
+    def test_subgraphs_not_recursive(self, very_nested_graph):
+        unrelated_graph = Graph(name="unrelated")
+        subgraph_names = {subgraph.name for subgraph in very_nested_graph.subgraphs()}
+        assert subgraph_names == {"subgraph1", "subgraph2", "subgraph3"}
+
+    def test_subgraphs_recursive(self, very_nested_graph):
+        unrelated_graph = Graph(name="unrelated")
+        subgraph_names = {
+            subgraph.name for subgraph in very_nested_graph.subgraphs(recursive=True)
+        }
+        assert subgraph_names == {
+            "subgraph1",
+            "subgraph2",
+            "subgraph3",
+            "inner_subgraph_1",
+            "inner_subgraph_2",
+            "inner_subgraph_3",
+        }
+
 
 class TestRegister(object):
     def test_register(self):
@@ -240,8 +356,38 @@ def fake_add(self, a, b):
         assert len(graph_opset10.nodes) == 1
         assert graph_opset10.nodes[-1].op == "Add-10"
 
+    def test_register_name_conflict(self):
+        @Graph.register()
+        def fake_mul(self, a, b):
+            return self.layer(
+                op="Add", domain="domain1", inputs=[a, b], outputs=["mul_out"]
+            )
+
+        func = Function("fake_mul", domain="domain2")
+        graph = Graph(functions=[func])
+        graph.fake_mul("a", "b")
+        assert len(graph.nodes) == 1
+        assert graph.nodes[0].domain == "domain1"
+
 
 class TestLayer(object):
+    def test_layer_default_naming(self):
+        node1 = Node(
+            name="onnx_graphsurgeon_node_0", op="Identity"
+        )  # injecting default name
+        node2 = Node(
+            name="onnx_graphsurgeon_node_1", op="Identity"
+        )  # injecting default name again
+        graph = Graph(nodes=[node1, node2])
+        graph.layer(
+            op="Identity"
+        )  # new default name should be onnx_graphsurgeon_node_2
+        assert graph.nodes[-1].name == "onnx_graphsurgeon_node_2"
+        graph.layer(
+            op="Identity"
+        )  # new default name should be onnx_graphsurgeon_node_3
+        assert graph.nodes[-1].name == "onnx_graphsurgeon_node_3"
+
     def test_layer_with_attrs(self):
         graph = Graph()
         outputs = graph.layer(op="Add", name="node", attrs={"fake_attr": 0})
@@ -272,8 +418,14 @@ def test_layer_with_strings(self):
 
         outputs = graph.layer(op="Fake", inputs=[x0, x1], outputs=[y0, y1])
         assert len(graph.nodes) == 1
-        assert [prefix in tensor.name for prefix, tensor in zip([x0, x1], graph.nodes[-1].inputs)]
-        assert [prefix in tensor.name for prefix, tensor in zip([y0, y1], graph.nodes[-1].outputs)]
+        assert [
+            prefix in tensor.name
+            for prefix, tensor in zip([x0, x1], graph.nodes[-1].inputs)
+        ]
+        assert [
+            prefix in tensor.name
+            for prefix, tensor in zip([y0, y1], graph.nodes[-1].outputs)
+        ]
         assert graph.nodes[-1].outputs == outputs
 
     def test_layer_with_arrays(self):
@@ -284,7 +436,10 @@ def test_layer_with_arrays(self):
         graph = Graph()
 
         outputs = graph.layer(op="Fake", inputs=[x0, x1], outputs=[y0, y1])
-        assert [prefix in tensor.name for prefix, tensor in zip([y0, y1], graph.nodes[-1].outputs)]
+        assert [
+            prefix in tensor.name
+            for prefix, tensor in zip([y0, y1], graph.nodes[-1].outputs)
+        ]
         assert len(graph.nodes) == 1
         assert graph.nodes[-1].inputs[0].values == x0
         assert graph.nodes[-1].inputs[1].values == x1
@@ -298,13 +453,58 @@ def test_layer_with_iterables(self):
         graph = Graph()
 
         outputs = graph.layer(op="Fake", inputs=[x0, x1], outputs=[y0, y1])
-        assert [prefix in tensor.name for prefix, tensor in zip([y0, y1], graph.nodes[-1].outputs)]
+        assert [
+            prefix in tensor.name
+            for prefix, tensor in zip([y0, y1], graph.nodes[-1].outputs)
+        ]
         assert len(graph.nodes) == 1
         assert graph.nodes[-1].inputs[0].values == x0
         assert graph.nodes[-1].inputs[1].values == x1
         assert graph.nodes[-1].outputs == outputs
 
 
+class TestFunctionCall(object):
+    def make_graph_and_func(self):
+        func_output = Variable("test_output", shape=[1, 2, 3], dtype=np.int32)
+        func = Function(
+            "TestFunction", inputs=[Variable("test_input")], outputs=[func_output]
+        )
+        graph = Graph(functions=[func])
+        return graph, func_output
+
+    def check_outputs_match(self, func_output, node_outputs):
+        assert len(node_outputs) == 1
+        output = node_outputs[0]
+
+        assert output is not func_output
+        assert output.name
+        assert output.shape == func_output.shape
+        assert output.dtype == func_output.dtype
+
+    def test_function_default_outputs(self):
+        graph, func_output = self.make_graph_and_func()
+
+        # No outputs given, but they should be created.
+        outputs = graph.TestFunction(inputs=["input"])
+        self.check_outputs_match(func_output, outputs)
+
+    def test_function_default_outputs_string_names(self):
+        graph, func_output = self.make_graph_and_func()
+
+        # Output name given, it should be preserved.
+        outputs = graph.TestFunction(inputs=["input"], outputs=["output"])
+        self.check_outputs_match(func_output, outputs)
+        assert outputs[0].name.startswith("output")
+
+    def test_function_default_outputs_existing_tensor(self):
+        graph, _ = self.make_graph_and_func()
+
+        # Output tensor provided, it should not be changed.
+        existing_tensor = Variable("output", shape=[2, 3, 4], dtype=np.float32)
+        outputs = graph.TestFunction(inputs=["input"], outputs=[existing_tensor])
+        assert outputs[0] is existing_tensor
+
+
 def tensors_linear_graph():
     inputs = [Variable(name="x")]
     intermediate0 = Variable(name="intermediate0")
@@ -426,7 +626,12 @@ def toposort_multi_tier_output_graph():
 # |    /
 # Test2 -> out (graph_output)
 def toposort_multi_tier_input_graph():
-    inputs = [Variable(name="x0"), Variable(name="x1"), Variable(name="x2"), Variable(name="x3")]
+    inputs = [
+        Variable(name="x0"),
+        Variable(name="x1"),
+        Variable(name="x2"),
+        Variable(name="x3"),
+    ]
     int0, int1 = [Variable(name="intermediate0"), Variable(name="intermediate1")]
     outputs = [Variable(name="out")]
     x0, x1, x2, x3 = inputs
@@ -482,7 +687,9 @@ def make_var(name):
 
     # Subgraphs for If
     subgraph_outputs = [make_var("res")]
-    subgraph_nodes = [Node(op="Add", name="SubgraphTest0", inputs=[x1, x2], outputs=subgraph_outputs)]
+    subgraph_nodes = [
+        Node(op="Add", name="SubgraphTest0", inputs=[x1, x2], outputs=subgraph_outputs)
+    ]
     subgraph = Graph(nodes=subgraph_nodes, outputs=subgraph_outputs)
 
     nodes = [
@@ -509,6 +716,29 @@ def make_var(name):
 
 
 class TestToposort(object):
+    @staticmethod
+    def make_single_node_function(name: str, dependencies: List[str]) -> Function:
+        """
+        Create a function which uses all the functions given in 'dependencies'.
+        """
+        func = Function(name, inputs=[Variable("input")], outputs=[Variable("output")])
+        if not dependencies:
+            return func
+        intermediate = func.inputs[0]
+        for i, dep in enumerate(dependencies):
+            new_intermediate = Variable(f"inter_{i}")
+            func.nodes.append(
+                Node(
+                    dep,
+                    domain=Function.DEFAULT_DOMAIN,
+                    inputs=[intermediate],
+                    outputs=[new_intermediate],
+                )
+            )
+            intermediate = new_intermediate
+        func.outputs = [intermediate]
+        return func
+
     @pytest.mark.parametrize("toposort_test_case", TOPOSORT_TEST_CASES)
     def test_topologically_sort(self, toposort_test_case):
         graph, expected_node_order = toposort_test_case()
@@ -531,13 +761,58 @@ def test_toposort_nested(self, toposort_test_case):
         subgraph.nodes[0].inputs.append(id_out)
 
         out = Variable("output")
-        nested = Node(op="Nested", inputs=[id_out], outputs=[out], attrs={"subgraph": subgraph})
+        nested = Node(
+            op="Nested", inputs=[id_out], outputs=[out], attrs={"subgraph": subgraph}
+        )
 
         graph = Graph(nodes=[identity, nested], inputs=[inp], outputs=[out])
         graph.toposort(recurse_subgraphs=True)
 
         assert subgraph.nodes == expected_node_order
 
+    def test_function(self):
+        graph, expected_node_order = toposort_multi_tier_input_graph()
+        func = Function(
+            "Test", nodes=graph.nodes, inputs=graph.inputs, outputs=graph.outputs
+        )
+        func.toposort()
+        assert func.nodes == expected_node_order
+
+    def test_function_order(self):
+        # Check that toposort re-orders functions in topological order.
+        func1 = self.make_single_node_function("func1", [])
+        func2 = self.make_single_node_function("func2", ["func1"])
+        func3 = self.make_single_node_function("func3", ["func2"])
+        func4 = self.make_single_node_function("func4", ["func3", "func2"])
+        funcs = [func3, func2, func4, func1]
+        graph = Graph(functions=funcs)
+        graph.toposort()
+        assert graph.functions == [func1, func2, func3, func4]
+
+    def test_function_circular_dep_simple(self):
+        func = self.make_single_node_function("func", ["func"])
+        graph = Graph(functions=[func])
+        try:
+            graph.toposort()
+            assert False, "Should have raised"
+        except OnnxGraphSurgeonException:
+            pass
+
+    def test_function_circular_dep_complicated(self):
+        # Circular dependency [func2 -> func3 -> func4 -> func2]
+        func1 = self.make_single_node_function("func1", ["func2", "func5"])
+        func2 = self.make_single_node_function("func2", ["func3"])
+        func3 = self.make_single_node_function("func3", ["func4", "func5"])
+        func4 = self.make_single_node_function("func4", ["func2", "func5"])
+        func5 = self.make_single_node_function("func5", [])
+        funcs = [func3, func2, func4, func1, func5]
+        graph = Graph(functions=funcs)
+        try:
+            graph.toposort()
+            assert False, "Should have raised"
+        except OnnxGraphSurgeonException:
+            pass
+
 
 def build_basic_graph():
     inputs = [Variable(name="x")]
@@ -570,6 +845,23 @@ def build_two_layer_graph_multiple_io():
     return Graph(nodes=nodes, inputs=inputs, outputs=outputs)
 
 
+def build_function_with_unused_node():
+    func = Function("Test")
+
+    A = Variable("A", dtype=np.float32, shape=(1, 1))
+    B = Variable("B", dtype=np.float32, shape=(1, 1))
+    X = Variable("X", dtype=np.float32, shape=(1, 1))
+    Y = Variable("Y", dtype=np.float32, shape=(1, 1))
+
+    func.inputs = [A, X]
+    func.outputs = [B]
+    func.nodes = [
+        Node(op="Sin", inputs=[X], outputs=[Y]),  # this node is unused
+        Node(op="Cos", inputs=[A], outputs=[B]),
+    ]
+    return func
+
+
 CLEANUP_TEST_CASES = [
     build_basic_graph(),
     build_two_layer_graph(),
@@ -584,7 +876,9 @@ def test_get_used_node_ids(self, graph):
         graph_used_tensors = copy.copy(list(graph.tensors().values()))
 
         unused_tensor = Variable(name="Unused")
-        unused_node = Node(op="Unused", inputs=[graph.inputs[0]], outputs=[unused_tensor])
+        unused_node = Node(
+            op="Unused", inputs=[graph.inputs[0]], outputs=[unused_tensor]
+        )
         graph.nodes.append(unused_node)
 
         with graph.node_ids():
@@ -593,7 +887,9 @@ def test_get_used_node_ids(self, graph):
             assert all([node.id in used_node_ids for node in graph_used_nodes])
             assert unused_node.id not in used_node_ids
             assert unused_tensor not in used_tensors
-            assert all([used_tensor in used_tensors for used_tensor in graph_used_tensors])
+            assert all(
+                [used_tensor in used_tensors for used_tensor in graph_used_tensors]
+            )
 
     def test_multi_tier(self):
         graph, _ = toposort_multi_tier_output_graph()
@@ -675,7 +971,9 @@ def test_node_used_only_in_nested_graph(self):
         Y = Variable("Y", dtype=np.float32, shape=(1,))
         graph = Graph(inputs=[X, Y])
 
-        X_p = graph.identity(X)  # X_p is only used by the subgraph, not in the outer graph.
+        X_p = graph.identity(
+            X
+        )  # X_p is only used by the subgraph, not in the outer graph.
 
         subgraph_inp = Variable("subgraph_input", dtype=np.float32, shape=(1,))
         subgraph = Graph(inputs=[subgraph_inp])
@@ -708,6 +1006,25 @@ def test_input_is_output(self):
         assert graph.nodes[0].inputs == [A, B]
         assert graph.nodes[0].outputs == [C]
 
+    def test_function(self):
+        func = build_function_with_unused_node()
+
+        func.cleanup()
+        assert {i.name for i in func.inputs} == {"A", "X"}
+        assert {o.name for o in func.outputs} == {"B"}
+        assert len(func.nodes) == 1
+
+    def test_graph_cleans_up_function(self):
+        graph = Graph()
+        func = build_function_with_unused_node()
+        graph.functions.append(func)
+
+        # Cleaning up the graph should by default also cleanup the function.
+        graph.cleanup()
+        assert {i.name for i in func.inputs} == {"A", "X"}
+        assert {o.name for o in func.outputs} == {"B"}
+        assert len(func.nodes) == 1
+
 
 class TestCopy(object):
     def test_basic(self):
@@ -805,6 +1122,40 @@ def test_copy_with_subgraph_dup_const_tensors(self):
         graph_copy = graph.copy()
         assert graph_copy.nodes[0].attrs["body"].nodes[0].inputs[0].shape == (1, 2)
 
+    def test_function(self):
+        func = Function(
+            "Test",
+            domain="onnx-gs.test",
+            nodes=[Node(op="Add")],
+            inputs=[Variable("input")],
+            outputs=[Variable("output")],
+            doc_string="docstring",
+            opset=15,
+            import_domains=["test"],
+            attrs={"attr1": None, "attr2": np.array([1, 2, 3])},
+        )
+        func_copy = func.copy()
+        assert func.name == func_copy.name
+        assert func.domain == func_copy.domain
+        assert func.nodes == func_copy.nodes
+        assert func.inputs == func_copy.inputs
+        assert func.outputs == func_copy.outputs
+        assert func.doc_string == func_copy.doc_string
+        assert func.opset == func_copy.opset
+        assert func.import_domains == func_copy.import_domains
+        assert func.attrs["attr1"] == func_copy.attrs["attr1"]
+        assert np.all(func.attrs["attr2"] == func_copy.attrs["attr2"])
+
+        assert func.nodes is not func_copy.nodes
+        assert func.inputs is not func_copy.inputs
+        assert func.outputs is not func_copy.outputs
+        assert func.attrs is not func_copy.attrs
+
+        assert func.nodes[0] is not func_copy.nodes[0]
+        assert func.inputs[0] is not func_copy.inputs[0]
+        assert func.outputs[0] is not func_copy.outputs[0]
+        assert func.attrs["attr2"] is not func_copy.attrs["attr2"]
+
 
 @pytest.fixture
 def simple_foldable():
@@ -870,12 +1221,63 @@ def foldable_with_invalid_node():
     yield graph
 
 
+@pytest.fixture
+def foldable_with_local_functions():
+    dtype = np.float32
+    counter = 0
+
+    def const():
+        nonlocal counter
+        counter += 1
+        return Constant(f"constant_{counter}", np.ones(1, dtype=np.float32))
+
+    func_inner = Function("FuncInner")
+    func_outer = Function("FuncOuter")
+    funcs = [func_inner, func_outer]
+
+    # func_inner(x) = x + 1
+    func_inner.inputs = [Variable("input", dtype=dtype)]
+    func_inner.outputs = [Variable("output", dtype=dtype)]
+    func_inner.nodes = [
+        Node(
+            "Add",
+            inputs=[func_inner.inputs[0], const()],
+            outputs=[func_inner.outputs[0]],
+        )
+    ]
+
+    # func_outer(x) = func_inner(1) * x
+    func_outer.inputs = [Variable("input", dtype=dtype)]
+    func_outer.functions = [func_inner]
+    func_outer_intermediate = func_outer.FuncInner(inputs=[const()])[0]
+    func_outer.outputs = [func_outer.mul(func_outer.inputs[0], func_outer_intermediate)]
+
+    # a = func_inner(input)
+    # b = func_outer(input)
+    # c = 1 + 1
+    # d = a + b
+    # e = c + d
+    # output = e + c
+    graph = Graph(inputs=[Variable("graph_input", dtype=dtype)], functions=funcs)
+    var0 = graph.FuncInner(inputs=[const()])[0]
+    var1 = graph.FuncOuter(inputs=[const()])[0]
+    var2 = graph.add(const(), const())
+    var3 = graph.add(var0, var1)
+    var4 = graph.add(var2, var3)
+    graph.outputs = [graph.add(var2, var4)]
+    graph.outputs[0].dtype = dtype
+
+    yield graph
+
+
 class TestFoldConstants(object):
     @pytest.mark.parametrize("partitioning", [None, "basic", "recursive"])
     def test_basic(self, simple_foldable, partitioning):
         inp = simple_foldable.inputs[0]
 
-        simple_foldable.fold_constants(partitioning=partitioning).cleanup(remove_unused_graph_inputs=True)
+        simple_foldable.fold_constants(partitioning=partitioning).cleanup(
+            remove_unused_graph_inputs=True
+        )
 
         # Extra node should be removed
         assert len(simple_foldable.nodes) == 1
@@ -883,7 +1285,10 @@ def test_basic(self, simple_foldable, partitioning):
         assert simple_foldable.nodes[0].inputs[1].name == "c"
 
         # Value should be computed correctly
-        assert np.all(simple_foldable.nodes[0].inputs[1].values == np.ones(shape=(1, 3), dtype=np.float32) * 2)
+        assert np.all(
+            simple_foldable.nodes[0].inputs[1].values
+            == np.ones(shape=(1, 3), dtype=np.float32) * 2
+        )
 
     def test_one_hop(self, one_hop_foldable):
         inp = one_hop_foldable.inputs[0]
@@ -896,7 +1301,10 @@ def test_one_hop(self, one_hop_foldable):
         assert one_hop_foldable.nodes[0].inputs[1].name == "e"
 
         # Value should be computed correctly
-        assert np.all(one_hop_foldable.nodes[0].inputs[1].values == np.ones(shape=(1, 3), dtype=np.float32) * 3)
+        assert np.all(
+            one_hop_foldable.nodes[0].inputs[1].values
+            == np.ones(shape=(1, 3), dtype=np.float32) * 3
+        )
 
     def test_with_invalid_nodes(self, foldable_with_invalid_node):
         foldable_with_invalid_node.fold_constants(partitioning="recursive").cleanup()
@@ -907,7 +1315,9 @@ def test_with_invalid_nodes(self, foldable_with_invalid_node):
         assert foldable_with_invalid_node.nodes[0].op == "Fake"
         assert foldable_with_invalid_node.nodes[1].op == "Add"
         assert foldable_with_invalid_node.nodes[2].op == "Add"
-        assert np.all(tensor_map["c"].values == (np.ones(shape=(1, 3), dtype=np.float32) * 2))
+        assert np.all(
+            tensor_map["c"].values == (np.ones(shape=(1, 3), dtype=np.float32) * 2)
+        )
 
     def test_with_invalid_nodes_no_recursive(self, foldable_with_invalid_node):
         # No folding should take place without recursive partitioning
@@ -1026,9 +1436,13 @@ def test_no_load_constants(self):
         def check_no_const_loaded(graph):
             num_lazy_constants = 0
             for tensor in graph.tensors().values():
-                if isinstance(tensor, Constant) and isinstance(tensor._values, LazyValues):
+                if isinstance(tensor, Constant) and isinstance(
+                    tensor._values, LazyValues
+                ):
                     num_lazy_constants += 1
-            assert num_lazy_constants == 3  # Graph starts with 3 constants - none should be loaded.
+            assert (
+                num_lazy_constants == 3
+            )  # Graph starts with 3 constants - none should be loaded.
 
         check_no_const_loaded(graph)
         check_no_const_loaded(new_graph)
@@ -1072,10 +1486,31 @@ def test_shape_gather(self, shape, indices):
     @pytest.mark.parametrize(
         "shape, starts, ends, axes, steps, expected",
         [
-            (("batch", 3, "height", "width"), 1, 2, 0, 1, [3]),  # Scalar starts/ends case
+            (
+                ("batch", 3, "height", "width"),
+                1,
+                2,
+                0,
+                1,
+                [3],
+            ),  # Scalar starts/ends case
             (("batch", 3, "height", "width"), [1], [2], [0], [1], [3]),
-            (("batch", 3, 5, "width"), [1], [-1], [0], [1], [3, 5]),  # Negative ends case
-            (("batch", 3, 5, 7), [1], [2000], [0], [1], [3, 5, 7]),  # Past end, ends case
+            (
+                ("batch", 3, 5, "width"),
+                [1],
+                [-1],
+                [0],
+                [1],
+                [3, 5],
+            ),  # Negative ends case
+            (
+                ("batch", 3, 5, 7),
+                [1],
+                [2000],
+                [0],
+                [1],
+                [3, 5, 7],
+            ),  # Past end, ends case
             (("batch", 3, 5, 7), [-2], [4], [0], [1], [5, 7]),  # Negative starts case
             (("batch", 3, 5, 7), [-2], [4], [1], [1], None),  # Non-zero axes case
             (("batch", 3, 5, "width"), [-2], [4], [1], [1], None),  # Dynamic case
@@ -1089,7 +1524,13 @@ def test_shape_slice(self, shape, starts, ends, axes, steps, expected):
 
         inp_shape = graph.shape(inp)
         graph.outputs = [
-            graph.slice(inp_shape, np.array(starts), np.array(ends), axes=np.array(axes), steps=np.array(steps))
+            graph.slice(
+                inp_shape,
+                np.array(starts),
+                np.array(ends),
+                axes=np.array(axes),
+                steps=np.array(steps),
+            )
         ]
 
         graph.fold_constants()
@@ -1123,7 +1564,7 @@ def test_shape_slice_single_input(self):
         assert np.all(graph.outputs[0].values == inp.shape[1:3:2])
 
     def test_with_variable_conditional(self):
-        cond = gs.Variable("cond", dtype=np.bool, shape=(1,))
+        cond = gs.Variable("cond", dtype=bool, shape=(1,))
 
         X = gs.Variable("X", dtype=np.float32, shape=(1,))
         Y = gs.Constant("Y", values=np.ones((1,), dtype=np.float32))
@@ -1150,7 +1591,7 @@ def test_with_variable_conditional(self):
     @pytest.mark.parametrize("cond_value", [True, False])
     @pytest.mark.parametrize("flatten", [True, False])
     def test_flatten_static_conditional(self, flatten, cond_value):
-        cond = gs.Constant("cond", values=np.array([cond_value], dtype=np.bool))
+        cond = gs.Constant("cond", values=np.array([cond_value], dtype=bool))
 
         X = gs.Variable("X", dtype=np.float32, shape=(1,))
         Y = gs.Variable("Y", dtype=np.float32, shape=(1,))
@@ -1175,7 +1616,9 @@ def test_flatten_static_conditional(self, flatten, cond_value):
 
             subgraph = then_graph if cond_value else else_graph
             # Make sure subgraph intermediate tensors are renamed
-            assert graph.nodes[0].outputs[0].name == "add_out_0_subg_0_{:}".format(subgraph.name)
+            assert graph.nodes[0].outputs[0].name == "add_out_0_subg_0_{:}".format(
+                subgraph.name
+            )
             assert graph.outputs[0].inputs[0] == subgraph.nodes[-1]
             assert subgraph.nodes[-1] == graph.nodes[-1]
         else:
@@ -1242,7 +1685,9 @@ def test_cast_elision_with_constant_node(self):
 
         add_const_inp = graph.nodes[0].inputs[1]
         assert isinstance(add_const_inp, Constant)
-        assert add_const_inp.dtype == np.int64  # Should have been casted to match dtype of other inputs.
+        assert (
+            add_const_inp.dtype == np.int64
+        )  # Should have been casted to match dtype of other inputs.
 
     # For a graph like:
     #
@@ -1263,8 +1708,12 @@ def test_cast_elision_with_constant_node(self):
     @pytest.mark.parametrize("use_as_graph_output", [True, False], ids=["graph", ""])
     @pytest.mark.parametrize("use_in_other_node", [True, False], ids=["node", ""])
     # Whether to apply the effects of the first two parameters to the input `Cast` node or to the `Add` node.
-    @pytest.mark.parametrize("apply_to_input_cast", [True, False], ids=["input", "output"])
-    def test_cast_elision_multi_use_cast(self, use_as_graph_output, use_in_other_node, apply_to_input_cast):
+    @pytest.mark.parametrize(
+        "apply_to_input_cast", [True, False], ids=["input", "output"]
+    )
+    def test_cast_elision_multi_use_cast(
+        self, use_as_graph_output, use_in_other_node, apply_to_input_cast
+    ):
         X = gs.Variable("X", dtype=np.int32, shape=(1,))
         graph = Graph(inputs=[X])
         casted_x = graph.cast(X, to=onnx.TensorProto.DataType.FLOAT)
@@ -1287,9 +1736,13 @@ def test_cast_elision_multi_use_cast(self, use_as_graph_output, use_in_other_nod
             if apply_to_input_cast:
                 assert graph.nodes[1].inputs[0] == X
                 assert graph.nodes[1].outputs[0] == uncasted_x
-                assert ops == ["Cast", "Add"] + (["Identity"] if use_in_other_node else [])
+                assert ops == ["Cast", "Add"] + (
+                    ["Identity"] if use_in_other_node else []
+                )
             else:
-                assert ops == ["Cast", "Add", "Cast"] + (["Identity"] if use_in_other_node else [])
+                assert ops == ["Cast", "Add", "Cast"] + (
+                    ["Identity"] if use_in_other_node else []
+                )
         else:
             assert ops == ["Add"]
 
@@ -1376,7 +1829,7 @@ def test_folding_size_threshold(
 
         # Make sure size_threshold option is propagated into subgraphs.
         if push_into_subgraph:
-            cond = gs.Variable("cond", dtype=np.bool, shape=tuple())
+            cond = gs.Variable("cond", dtype=bool, shape=tuple())
             outer_graph = Graph(inputs=[cond])
             outer_graph.if_op(cond, then_graph=graph, else_graph=graph)
 
@@ -1399,11 +1852,15 @@ def test_no_fold_qdq(self, op, add_intermediate_layer):
             inp = graph.identity(inp)
 
         qdq_func = graph.quantize_linear if op == "Q" else graph.dequantize_linear
-        graph.outputs = [qdq_func(inp, 1.2, np.array(0, dtype=np.int8))]  # Arbitrary scale and zero-point
+        graph.outputs = [
+            qdq_func(inp, 1.2, np.array(0, dtype=np.int8))
+        ]  # Arbitrary scale and zero-point
 
         graph.fold_constants().cleanup()
         assert len(graph.nodes) == 1
-        assert graph.nodes[0].op == "QuantizeLinear" if op == "Q" else "DequantizeLinear"
+        assert (
+            graph.nodes[0].op == "QuantizeLinear" if op == "Q" else "DequantizeLinear"
+        )
 
     @pytest.mark.parametrize(
         "should_exclude_node_func,expected_node_names",
@@ -1440,7 +1897,9 @@ def test_no_fold_qdq(self, op, add_intermediate_layer):
             ),
         ],
     )
-    def test_custom_should_exclude_node(self, should_exclude_node_func, expected_node_names):
+    def test_custom_should_exclude_node(
+        self, should_exclude_node_func, expected_node_names
+    ):
         inp = gs.Constant("input", np.ones(shape=(1, 3, 5, 5), dtype=np.float32))
         graph = Graph(inputs=[inp])
 
@@ -1454,6 +1913,219 @@ def test_custom_should_exclude_node(self, should_exclude_node_func, expected_nod
         graph.fold_constants(should_exclude_node=should_exclude_node_func).cleanup()
         assert [node.name for node in graph.nodes] == expected_node_names
 
+    def test_omitted_optional_inputs_ignored(self):
+        # An omitted optional input will show up as a `Variable` with no name.
+        # This should *not* prevent us from folding nodes where all other inputs are constants.
+        data = gs.Constant("data", np.ones(shape=(3, 5, 5), dtype=np.float32))
+        pads = gs.Constant("pads", np.zeros(shape=(6,), dtype=np.int64))
+        graph = Graph()
+
+        pad_0 = graph.pad(data, pads, constant_value=None)
+        graph.outputs = [pad_0]
+
+        assert pad_0.inputs[0].inputs[2] == Variable.empty()
+        assert len(graph.nodes) == 1
+
+        graph.fold_constants().cleanup()
+
+        assert len(graph.nodes) == 0
+        assert isinstance(graph.outputs[0], gs.Constant)
+
+    def test_function(self, simple_foldable):
+        func = Function(
+            "Test",
+            nodes=simple_foldable.nodes,
+            inputs=simple_foldable.inputs,
+            outputs=simple_foldable.outputs,
+        )
+        assert len(func.nodes) == 2
+        func.fold_constants().cleanup()
+        assert len(func.nodes) == 1
+
+    def test_function_with_attributes(self):
+        # Nodes that reference function attributes shouldn't be folded.
+        input = Variable("input", dtype=np.float32)
+        func = Function("Test", inputs=[input], attrs={"softmax_axis": -1})
+        x = func.softmax(
+            np.float32([[1, 2, 3]]), axis=Node.AttributeRef("softmax_axis", int)
+        )
+        y = func.softmax(np.float32([[4, 5, 6]]), axis=-1)
+        z = func.add(x, y)
+        func.outputs += [func.add(input, z)]
+
+        # Only one node should get folded.
+        assert len(func.nodes) == 4
+        func.fold_constants().cleanup()
+        assert len(func.nodes) == 3
+
+    def test_functions_inside_functions(self, foldable_with_local_functions):
+        graph = foldable_with_local_functions
+        graph.toposort()
+        graph.fold_constants(error_ok=False)
+        graph.cleanup()
+
+        assert len(graph.inputs) == 1
+        assert len(graph.outputs) == 1
+        assert isinstance(graph.outputs[0], Constant)
+        assert graph.outputs[0].values == 8
+
+    def test_function_with_unused_input(self, foldable_with_local_functions):
+        # Constant folding should still work correctly when a function has unused inputs.
+        graph = foldable_with_local_functions
+
+        func_outer = graph.functions[1]
+        func_outer.nodes[1].inputs[0] = gs.Constant(
+            "Constant_99", values=np.array([3], dtype=np.float32)
+        )
+
+        graph.toposort().fold_constants(error_ok=False).cleanup()
+
+        assert len(graph.inputs) == 1
+        assert len(graph.outputs) == 1
+        assert isinstance(graph.outputs[0], Constant)
+        assert graph.outputs[0].values == 12
+
+    def test_inner_function_with_unused_input(self, foldable_with_local_functions):
+        graph = foldable_with_local_functions
+
+        func_inner = graph.functions[0]
+        func_inner.nodes[0].inputs[0] = gs.Constant(
+            "Constant_99", values=np.array([3], dtype=np.float32)
+        )
+
+        graph.toposort().fold_constants(error_ok=False).cleanup()
+
+        assert len(graph.inputs) == 1
+        assert len(graph.outputs) == 1
+        assert isinstance(graph.outputs[0], Constant)
+        assert graph.outputs[0].values == 12
+
+    def test_function_with_subgraph(self, foldable_with_local_functions):
+        graph = foldable_with_local_functions
+        dtype = graph.outputs[0].dtype
+
+        func = Function("func_with_subgraph")
+        func.inputs = [Variable("input")]
+
+        then_graph = Graph(name="Then")
+        then_graph.outputs = [then_graph.identity(func.inputs[0])]
+
+        else_graph = Graph(name="Else")
+        else_graph.functions = graph.functions
+        else_graph.outputs = else_graph.FuncInner(
+            inputs=[func.inputs[0]], outputs=["else_out"]
+        )
+
+        cond = func.less(func.inputs[0], Constant("Zero", np.zeros(1, dtype=dtype)))
+        func.outputs = [func.if_op(cond, then_graph, else_graph)]
+
+        graph.functions.append(func)
+        graph.outputs = graph.func_with_subgraph(
+            inputs=[graph.outputs[0]], outputs=["new_output"]
+        )
+        graph.toposort().fold_constants(error_ok=False).cleanup()
+        assert len(graph.inputs) == 1
+        assert len(graph.outputs) == 1
+        assert isinstance(graph.outputs[0], Constant)
+        assert graph.outputs[0].values == 9
+
+    def test_function_with_complicated_attrs(self):
+        # Types of attributes we should test:
+        # 1) Has no default value
+        # 2) Has default value which is used
+        # 3) Has default value which is overridden
+        # 4) Confusing attribute name / reference name mappings
+        dtype = np.float32
+        opset = 18
+        func = Function(
+            "complicated_func", inputs=[Variable("input", dtype=dtype)], opset=opset
+        )
+        variables = [Variable(f"var{i}", dtype=dtype) for i in range(5)]
+
+        func.nodes.append(
+            Node(
+                "ConstantOfShape",
+                attrs={"value": Node.AttributeRef("ConstantOfShape_value", Tensor)},
+                inputs=[func.inputs[0]],
+                outputs=[variables[0]],  # shape [2, 3, 4]
+            )
+        )
+        func.nodes.append(
+            Node(
+                "ReduceSum",
+                attrs={"keepdims": Node.AttributeRef("keepdims", int)},
+                inputs=[
+                    variables[0],
+                    Constant("ReduceSum_axis", np.array([1], dtype=int)),
+                ],
+                outputs=[variables[1]],  # shape [2, 1, 4] when keepDims=True
+            )
+        )
+        func.nodes.append(
+            Node(
+                "Flatten",
+                attrs={"axis": Node.AttributeRef("Flatten_axis", np.int32)},
+                inputs=[variables[1]],
+                outputs=[variables[2]],  # shape [2, 4] when axis=1
+            )
+        )
+        func.nodes.append(
+            Node(
+                "Concat",
+                attrs={"axis": Node.AttributeRef("axis", np.int64)},
+                inputs=[
+                    variables[2],
+                    Constant("to_concat", values=np.ones((2, 1), dtype=dtype)),
+                ],
+                outputs=[variables[3]],  # shape [2, 5] when axis=-1
+            )
+        )
+        func.nodes.append(
+            Node(
+                "Concat",
+                attrs={"axis": Node.AttributeRef("Concat_axis", int)},
+                inputs=[
+                    variables[3],
+                    Constant("to_concat_2", values=2 * np.ones((2, 1), dtype=dtype)),
+                ],
+                outputs=[variables[4]],  # shape [2, 6] when axis=-1
+            )
+        )
+        func.outputs = [variables[4]]
+
+        func.attrs = {
+            "ConstantOfShape_value": None,  # no default value
+            "axis": None,  # no default value
+            "Flatten_axis": 0,
+            "keepdims": True,
+            "Concat_axis": -1,
+        }
+        graph = Graph(opset=opset, functions=[func])
+        input = Constant("shape", values=np.array([2, 3, 4], dtype=int))
+        output = graph.complicated_func(
+            inputs=[input],
+            outputs=["output"],
+            attrs={
+                "ConstantOfShape_value": Constant(
+                    "three", values=np.array([3], dtype=dtype)
+                ),
+                "axis": -1,
+                "Flatten_axis": 2,  # overrides default value
+            },
+        )[0]
+        output.dtype = dtype
+        graph.inputs = [input]
+        graph.outputs = [output]
+        graph.fold_constants(error_ok=False).cleanup()
+
+        assert len(graph.inputs) == 1
+        assert len(graph.outputs) == 1
+        assert isinstance(graph.outputs[0], Constant)
+        assert np.all(
+            graph.outputs[0].values
+            == np.array([[9, 9, 9, 9, 1, 2], [9, 9, 9, 9, 1, 2]])
+        )
+
 
 class TestIO(object):
     def test_io_cannot_be_sync_list_on_init(self):
@@ -1482,3 +2154,51 @@ def test_io_cannot_be_sync_list_on_assign(self):
 
         assert not isinstance(graph.inputs, SynchronizedList)
         assert not isinstance(graph.outputs, SynchronizedList)
+
+
+class TestFunctionList(object):
+    def test_shared_function_list(self):
+        subgraph = Graph()
+        node = Node("Test", attrs={"test_attr": subgraph})
+        main_graph = Graph(nodes=[node])
+
+        new_func = Function("func")
+        main_graph.functions.append(new_func)
+        assert len(main_graph.functions) == 1
+        assert len(subgraph.functions) == 1
+        assert subgraph.functions[0] == new_func
+
+    def test_set_function_list(self):
+        subgraph = Graph()
+        node = Node("Test", attrs={"test_attr": subgraph})
+        main_graph = Graph(nodes=[node])
+
+        new_func = Function("func")
+        main_graph.functions = [new_func]
+        assert len(main_graph.functions) == 1
+        assert len(subgraph.functions) == 1
+        assert subgraph.functions[0] == new_func
+
+    def test_merge_funcs(self):
+        func1 = Function("func1", domain="domain1")
+        func2 = Function("func1", domain="domain2")
+        func3 = Function("func2", domain="domain1")
+        func4 = Function("func2", domain="domain2")
+        all_funcs = [func1, func2, func3, func4]
+
+        subgraph1 = Graph(functions=[func1, func2])
+
+        node1 = Node("Test", attrs={"test_attr": subgraph1})
+        subgraph2 = Graph(functions=[func2, func3], nodes=[node1])
+
+        node2 = Node("Test", attrs={"test_attr": subgraph2})
+        main_graph = Graph(functions=[func3, func4], nodes=[node2])
+
+        assert len(main_graph.functions) == len(all_funcs)
+        assert len(subgraph1.functions) == len(all_funcs)
+        assert len(subgraph2.functions) == len(all_funcs)
+
+        new_func = Function("func3")
+        main_graph.functions.append(new_func)
+        assert new_func in subgraph1.functions
+        assert new_func in subgraph2.functions
diff --git a/tools/onnx-graphsurgeon/tests/models/sparse_nnz.onnx b/tools/onnx-graphsurgeon/tests/models/sparse_nnz.onnx
new file mode 100644
index 00000000..1bc2ca37
--- /dev/null
+++ b/tools/onnx-graphsurgeon/tests/models/sparse_nnz.onnx
@@ -0,0 +1,22 @@
+	:�
+?
+input
+w_sparseconv_outconv0"Conv*
+kernel_shape@@�
+Main_graphZ!
+input
+
+
+
+
+�
+�b$
+conv_out
+
+
+
+
+�
+�z>
+
+*�x��������Bw_sparse:BindicesB
\ No newline at end of file
diff --git a/tools/onnx-graphsurgeon/tests/models/sparse_nnz_rank.onnx b/tools/onnx-graphsurgeon/tests/models/sparse_nnz_rank.onnx
new file mode 100644
index 00000000..05fc1bf0
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/sparse_nnz_rank.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/models/test_recursive_pattern_building.onnx b/tools/onnx-graphsurgeon/tests/models/test_recursive_pattern_building.onnx
new file mode 100644
index 00000000..2a4ae36a
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/test_recursive_pattern_building.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_base_match_case.onnx b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_base_match_case.onnx
new file mode 100644
index 00000000..9dde8cb7
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_base_match_case.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_callback_check_unmatch_case.onnx b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_callback_check_unmatch_case.onnx
new file mode 100644
index 00000000..af219f32
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_callback_check_unmatch_case.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_constant_initializer_match_case.onnx b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_constant_initializer_match_case.onnx
new file mode 100644
index 00000000..ce6ac314
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_constant_initializer_match_case.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_constant_node_match_case.onnx b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_constant_node_match_case.onnx
new file mode 100644
index 00000000..bfe73c8f
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_constant_node_match_case.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_intermediate_output_to_other_node_unmatch_case.onnx b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_intermediate_output_to_other_node_unmatch_case.onnx
new file mode 100644
index 00000000..349b6d51
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_intermediate_output_to_other_node_unmatch_case.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_intermediate_output_unmatch_case.onnx b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_intermediate_output_unmatch_case.onnx
new file mode 100644
index 00000000..611a7a3f
Binary files /dev/null and b/tools/onnx-graphsurgeon/tests/models/test_toyPlugin_intermediate_output_unmatch_case.onnx differ
diff --git a/tools/onnx-graphsurgeon/tests/onnx_models.py b/tools/onnx-graphsurgeon/tests/onnx_models.py
index fe3b6a93..f324b253 100644
--- a/tools/onnx-graphsurgeon/tests/onnx_models.py
+++ b/tools/onnx-graphsurgeon/tests/onnx_models.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,7 +15,7 @@
 # limitations under the License.
 #
 
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.logger import G_LOGGER
 from onnx_graphsurgeon.ir.tensor import Tensor, Constant, Variable
 from onnx_graphsurgeon.ir.graph import Graph
 from onnx_graphsurgeon.ir.node import Node
@@ -60,7 +60,11 @@ def assert_equal(self, graph: Graph):
 
             def check_tensor_io(actensor, extensor):
                 def check_list(aclist, exlist):
-                    G_LOGGER.debug("Actual node list: {:}\n\nExpected node list: {:}".format(aclist, exlist))
+                    G_LOGGER.debug(
+                        "Actual node list: {:}\n\nExpected node list: {:}".format(
+                            aclist, exlist
+                        )
+                    )
                     assert len(aclist) == len(exlist)
                     for acnode, exnode in zip(aclist, exlist):
                         assert acnode == exnode
@@ -70,7 +74,9 @@ def check_list(aclist, exlist):
                 G_LOGGER.debug("Checking tensor: {:} outputs".format(actensor.name))
                 check_list(actensor.outputs, extensor.outputs)
 
-            G_LOGGER.debug("Actual Node: {:}\n\nExpected Node: {:}".format(actual, expected))
+            G_LOGGER.debug(
+                "Actual Node: {:}\n\nExpected Node: {:}".format(actual, expected)
+            )
             assert actual.op == expected.op
             assert actual.inputs == expected.inputs
             # Check I/O of input tensors
@@ -84,7 +90,9 @@ def check_list(aclist, exlist):
 
             assert actual.name == expected.name
             assert len(actual.attrs) == len(expected.attrs)
-            for (ackey, acval), (exkey, exval) in zip(actual.attrs.items(), expected.attrs.items()):
+            for (ackey, acval), (exkey, exval) in zip(
+                actual.attrs.items(), expected.attrs.items()
+            ):
                 assert ackey == exkey
                 assert acval == exval
             assert actual == expected
@@ -105,7 +113,9 @@ def identity_model():
     y = Variable(name="y", dtype=np.float32, shape=(1, 1, 2, 2))
     node = Node(op="Identity", inputs=[x], outputs=[y])
 
-    return Model(path, inputs=[x], outputs=[y], nodes=[node], opset=OnnxImporter.get_opset(model))
+    return Model(
+        path, inputs=[x], outputs=[y], nodes=[node], opset=OnnxImporter.get_opset(model)
+    )
 
 
 def dim_param_model():
@@ -116,7 +126,9 @@ def dim_param_model():
     y = Variable(name="Output:0", dtype=np.float32, shape=("dim0", 16, 128))
     node = Node(op="Identity", inputs=[x], outputs=[y])
 
-    return Model(path, inputs=[x], outputs=[y], nodes=[node], opset=OnnxImporter.get_opset(model))
+    return Model(
+        path, inputs=[x], outputs=[y], nodes=[node], opset=OnnxImporter.get_opset(model)
+    )
 
 
 def lstm_model():
@@ -172,7 +184,12 @@ def scan_model():
         Node(op="Add", inputs=[sum_in, next], outputs=[sum_out]),
         Node(op="Identity", inputs=[sum_out], outputs=[scan_out]),
     ]
-    body_graph = Graph(nodes=body_nodes, inputs=[sum_in, next], outputs=[sum_out, scan_out], name="scan_body")
+    body_graph = Graph(
+        nodes=body_nodes,
+        inputs=[sum_in, next],
+        outputs=[sum_out, scan_out],
+        name="scan_body",
+    )
 
     # Outer graph
     inputs = [
@@ -203,7 +220,9 @@ def initializer_is_output_model():
 
     X = Constant(name="X", values=np.ones((64, 64), dtype=np.float32))
 
-    return Model(path, inputs=[], outputs=[X], nodes=[], opset=OnnxImporter.get_opset(model))
+    return Model(
+        path, inputs=[], outputs=[X], nodes=[], opset=OnnxImporter.get_opset(model)
+    )
 
 
 # Node includes a subgraph whose I/O names are the same as that of the node.
@@ -215,8 +234,12 @@ def nested_dup_names():
     subgraph_inputs = [Variable("X", shape=(2, 2), dtype=np.float32)]
     subgraph_outputs = [Variable("Y", shape=(2, 2), dtype=np.float32)]
 
-    subgraph_node = Node(op="Identity", inputs=subgraph_inputs, outputs=subgraph_outputs)
-    subgraph = Graph(nodes=[subgraph_node], inputs=subgraph_inputs, outputs=subgraph_outputs)
+    subgraph_node = Node(
+        op="Identity", inputs=subgraph_inputs, outputs=subgraph_outputs
+    )
+    subgraph = Graph(
+        nodes=[subgraph_node], inputs=subgraph_inputs, outputs=subgraph_outputs
+    )
 
     # Outer - problem happens if outer node has same I/O names as subgraph
     inputs = [Variable("X", shape=(2, 2), dtype=np.float32)]
@@ -263,9 +286,27 @@ def ext_weights():
 
 def const_foldable():
     path = os.path.join(TEST_ROOT, "models", "const_foldable.onnx")
-    return Model(path, inputs=None, outputs=None, nodes=None, opset=None)  # Only used for path.
+    return Model(
+        path, inputs=None, outputs=None, nodes=None, opset=None
+    )  # Only used for path.
 
 
 def shape_cast_elision():
     path = os.path.join(TEST_ROOT, "models", "shape_cast_elision.onnx")
-    return Model(path, inputs=None, outputs=None, nodes=None, opset=None)  # Only used for path.
+    return Model(
+        path, inputs=None, outputs=None, nodes=None, opset=None
+    )  # Only used for path.
+
+
+def sparse_nnz_model():
+    path = os.path.join(TEST_ROOT, "models", "sparse_nnz.onnx")
+    return Model(
+        path, inputs=None, outputs=None, nodes=None, opset=None
+    )  # Only used for path.
+
+
+def sparse_nnz_rank_model():
+    path = os.path.join(TEST_ROOT, "models", "sparse_nnz_rank.onnx")
+    return Model(
+        path, inputs=None, outputs=None, nodes=None, opset=None
+    )  # Only used for path.
diff --git a/tools/onnx-graphsurgeon/tests/requirements.txt b/tools/onnx-graphsurgeon/tests/requirements.txt
index 58ba73af..4b1b41c9 100644
--- a/tools/onnx-graphsurgeon/tests/requirements.txt
+++ b/tools/onnx-graphsurgeon/tests/requirements.txt
@@ -1,5 +1,5 @@
 numpy
-onnx==1.12.0
-onnxruntime==1.12.1
-protobuf==3.19.4
+onnx==1.14.0
+onnxruntime==1.15.0
+protobuf>=3.20.2
 pytest
diff --git a/tools/onnx-graphsurgeon/tests/test_api.py b/tools/onnx-graphsurgeon/tests/test_api.py
index 2e3ac861..519c14dd 100644
--- a/tools/onnx-graphsurgeon/tests/test_api.py
+++ b/tools/onnx-graphsurgeon/tests/test_api.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/onnx-graphsurgeon/tests/test_examples.py b/tools/onnx-graphsurgeon/tests/test_examples.py
index a7f29ef8..64a28558 100644
--- a/tools/onnx-graphsurgeon/tests/test_examples.py
+++ b/tools/onnx-graphsurgeon/tests/test_examples.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,15 +17,13 @@
 
 import os
 import subprocess as sp
-import sys
-import tempfile
 
 import numpy as np
 import onnx
 import onnx_graphsurgeon as gs
 import onnxruntime
 import pytest
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.logger import G_LOGGER
 from onnx_graphsurgeon.util import misc
 
 ROOT_DIR = os.path.realpath(os.path.join(os.path.dirname(__file__), os.path.pardir))
@@ -44,13 +42,18 @@ def __init__(self, name, infer=True):
     ("03_isolating_a_subgraph", [Artifact("model.onnx"), Artifact("subgraph.onnx")]),
     ("04_modifying_a_model", [Artifact("model.onnx"), Artifact("modified.onnx")]),
     ("05_folding_constants", [Artifact("model.onnx"), Artifact("folded.onnx")]),
-    ("06_removing_nodes", [Artifact("model.onnx", infer=False), Artifact("removed.onnx")]),
+    (
+        "06_removing_nodes",
+        [Artifact("model.onnx", infer=False), Artifact("removed.onnx")],
+    ),
     ("07_creating_a_model_with_the_layer_api", [Artifact("model.onnx")]),
     ("08_replacing_a_subgraph", [Artifact("model.onnx"), Artifact("replaced.onnx")]),
     ("09_shape_operations_with_the_layer_api", [Artifact("model.onnx")]),
     ("10_dynamic_batch_size", [Artifact("model.onnx"), Artifact("dynamic.onnx")]),
+    ("11_creating_a_local_function", [Artifact("model.onnx")]),
 ]
 
+
 # Extract any ``` blocks from the README
 def load_commands_from_readme(readme):
     def ignore_command(cmd):
@@ -72,16 +75,24 @@ def ignore_command(cmd):
 
 def infer_model(path):
     model = onnx.load(path)
+    onnx.checker.check_model(model)
+
     graph = gs.import_onnx(model)
 
     feed_dict = {}
     for tensor in graph.inputs:
-        shape = tuple(dim if not misc.is_dynamic_dimension(dim) else 1 for dim in tensor.shape)
-        feed_dict[tensor.name] = np.random.random_sample(size=shape).astype(tensor.dtype)
+        shape = tuple(
+            dim if not misc.is_dynamic_dimension(dim) else 1 for dim in tensor.shape
+        )
+        feed_dict[tensor.name] = np.random.random_sample(size=shape).astype(
+            tensor.dtype
+        )
 
     output_names = [out.name for out in graph.outputs]
 
-    sess = onnxruntime.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"])
+    sess = onnxruntime.InferenceSession(
+        model.SerializeToString(), providers=["CPUExecutionProvider"]
+    )
     outputs = sess.run(output_names, feed_dict)
     G_LOGGER.info("Inference outputs: {:}".format(outputs))
     return outputs
@@ -94,7 +105,12 @@ def test_examples(example_dir, artifacts):
     commands = load_commands_from_readme(readme)
     for command in commands:
         G_LOGGER.info(command)
-        assert sp.run(["bash", "-c", command], cwd=example_dir, env={"PYTHONPATH": ROOT_DIR}).returncode == 0
+        assert (
+            sp.run(
+                ["bash", "-c", command], cwd=example_dir, env={"PYTHONPATH": ROOT_DIR}
+            ).returncode
+            == 0
+        )
 
     for artifact in artifacts:
         artifact_path = os.path.join(example_dir, artifact.name)
diff --git a/tools/onnx-graphsurgeon/tests/test_exporters.py b/tools/onnx-graphsurgeon/tests/test_exporters.py
index 7d069fe9..6c0834ab 100644
--- a/tools/onnx-graphsurgeon/tests/test_exporters.py
+++ b/tools/onnx-graphsurgeon/tests/test_exporters.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,6 +24,7 @@
 from onnx_graphsurgeon.exporters.onnx_exporter import OnnxExporter
 from onnx_graphsurgeon.importers.onnx_importer import OnnxImporter
 from onnx_graphsurgeon.ir.node import Node
+from onnx_graphsurgeon.ir.function import Function
 from onnx_graphsurgeon.ir.tensor import Constant, LazyValues, Tensor, Variable
 
 from onnx_models import (
@@ -34,6 +35,8 @@
     lstm_model,
     nested_dup_names,
     scan_model,
+    sparse_nnz_model,
+    sparse_nnz_rank_model,
 )
 
 
@@ -76,15 +79,25 @@ def test_export_constant_tensor_to_value_info_proto(self):
             onnx_shape.append(dim.dim_value)
         assert tuple(onnx_shape) == shape
 
-    def test_export_variable_tensor(self):
+    @pytest.mark.parametrize(
+        "dtype, expected_type",
+        [
+            (np.float32, onnx.TensorProto.FLOAT),
+            (onnx.TensorProto.BFLOAT16, onnx.TensorProto.BFLOAT16),
+            (onnx.TensorProto.FLOAT8E4M3FN, onnx.TensorProto.FLOAT8E4M3FN),
+            (onnx.TensorProto.FLOAT8E4M3FNUZ, onnx.TensorProto.FLOAT8E4M3FNUZ),
+            (onnx.TensorProto.FLOAT8E5M2, onnx.TensorProto.FLOAT8E5M2),
+            (onnx.TensorProto.FLOAT8E5M2FNUZ, onnx.TensorProto.FLOAT8E5M2FNUZ),
+        ],
+    )
+    def test_export_variable_tensor(self, dtype, expected_type):
         name = "variable_tensor"
         shape = (3, 224, 224)
-        dtype = np.float32
 
         tensor = Variable(dtype=dtype, shape=shape, name=name)
         onnx_tensor = OnnxExporter.export_value_info_proto(tensor, do_type_check=True)
         assert onnx_tensor.name == name
-        assert onnx_tensor.type.tensor_type.elem_type == onnx.TensorProto.FLOAT
+        assert onnx_tensor.type.tensor_type.elem_type == expected_type
 
         onnx_shape = []
         for dim in onnx_tensor.type.tensor_type.shape.dim:
@@ -99,7 +112,9 @@ def test_export_variable_tensor_empty_dim_param(self):
 
         onnx_shape = []
         for dim in onnx_tensor.type.tensor_type.shape.dim:
-            onnx_shape.append(dim.dim_value if dim.HasField("dim_value") else dim.dim_param)
+            onnx_shape.append(
+                dim.dim_value if dim.HasField("dim_value") else dim.dim_param
+            )
         assert tuple(onnx_shape) == shape
 
     # When a tensor shape is unknown, we should leave the shape field empty.
@@ -129,14 +144,16 @@ def test_export_node(self):
         attrs["float_attr"] = 4.0
         attrs["int_attr"] = 10
         attrs["str_attr"] = "constant"
-        attrs["tensor_attr"] = Constant("testTensor", np.ones(shape=(1, 2, 3, 4), dtype=np.float32))
+        attrs["tensor_attr"] = Constant(
+            "testTensor", np.ones(shape=(1, 2, 3, 4), dtype=np.float32)
+        )
         attrs["floats_attr"] = [1.0, 2.0, 3.0, 4.0]
         attrs["ints_attr"] = [4, 3, 2, 1]
         attrs["strings_attr"] = ["constant", "and", "variable"]
         attrs["dtype_attr"] = np.float32
         node = Node(op=op, name=name, inputs=inputs, outputs=outputs, attrs=attrs)
 
-        onnx_node = OnnxExporter.export_node(node, do_type_check=True)
+        onnx_node = OnnxExporter.export_node(node)
         assert onnx_node.name == name
         assert onnx_node.op_type == op
         assert onnx_node.input == ["input"]
@@ -150,7 +167,10 @@ def test_export_node(self):
             elif isinstance(attr, str):
                 assert onnx_attr.s.decode() == attr
             elif isinstance(attr, Tensor):
-                assert onnx_attr.t.SerializeToString() == OnnxExporter.export_tensor_proto(attr).SerializeToString()
+                assert (
+                    onnx_attr.t.SerializeToString()
+                    == OnnxExporter.export_tensor_proto(attr).SerializeToString()
+                )
             elif isinstance(attr, list):
                 if isinstance(attr[0], float):
                     assert onnx_attr.floats == attr
@@ -160,12 +180,88 @@ def test_export_node(self):
                     assert [s.decode() for s in onnx_attr.strings] == attr
                 else:
                     raise AssertionError(
-                        "Unrecognized list attribute: ({:}: {:}) of type: {:}".format(name, attr, type(attr))
+                        "Unrecognized list attribute: ({:}: {:}) of type: {:}".format(
+                            name, attr, type(attr)
+                        )
                     )
             elif isinstance(attr, type):
-                assert onnx_attr.i == onnx.mapping.NP_TYPE_TO_TENSOR_TYPE[np.dtype(attr)]
+                assert onnx_attr.i == onnx.helper.np_dtype_to_tensor_dtype(
+                    np.dtype(attr)
+                )
             else:
-                raise AssertionError("Unrecognized attribute: ({:}: {:}) of type: {:}".format(name, attr, type(attr)))
+                raise AssertionError(
+                    "Unrecognized attribute: ({:}: {:}) of type: {:}".format(
+                        name, attr, type(attr)
+                    )
+                )
+
+    def test_export_node_ref_attrs(self):
+        op = "Test"
+        inputs = [Variable(name="input")]
+        outputs = [Variable(name="output")]
+        attrs = OrderedDict(
+            {
+                "attr1": 1,
+                "attr2": 2.0,
+                "attr3": Node.AttributeRef("attr4", int),
+            }
+        )
+        node = Node(op=op, inputs=inputs, outputs=outputs, attrs=attrs)
+
+        onnx_node = OnnxExporter.export_node(node)
+
+        assert onnx_node.attribute[0].name == "attr1"
+        assert onnx_node.attribute[0].i == attrs["attr1"]
+        assert onnx_node.attribute[1].name == "attr2"
+        assert onnx_node.attribute[1].f == attrs["attr2"]
+        assert onnx_node.attribute[2].name == "attr3"
+        assert onnx_node.attribute[2].ref_attr_name == "attr4"
+        assert onnx_node.attribute[2].type == onnx.AttributeProto.INT
+
+    def test_export_function(self):
+        name = "Test"
+        domain = "org.test"
+        W = Variable("W", dtype=np.float32)
+        X = Variable("X", dtype=np.float32)
+        Y = Variable("Y", dtype=np.float32)
+        Z = Variable("Z", dtype=np.float32)
+        nodes = [
+            Node("Add", inputs=[W, X], outputs=[Y]),
+            Node("Mul", inputs=[X, Y], outputs=[Z]),
+        ]
+        inputs = [W, X]
+        outputs = [Z]
+        doc_string = "docstring"
+        opset = 15
+        attributes = {"attr1": None, "attr2": 2.0, "attr3": None}
+        func = Function(
+            name,
+            domain=domain,
+            nodes=nodes,
+            inputs=inputs,
+            outputs=outputs,
+            doc_string=doc_string,
+            opset=opset,
+            attrs=attributes,
+        )
+        func.functions = [func]
+        onnx_func = OnnxExporter.export_function(func)
+
+        assert onnx_func.name == name
+        assert onnx_func.domain == domain
+        assert onnx_func.doc_string == doc_string
+        assert sorted(onnx_func.attribute) == sorted(
+            [name for name, val in attributes.items() if val is None]
+        )
+        assert len(onnx_func.attribute_proto) == 1
+        assert onnx_func.attribute_proto[0].name == "attr2"
+        assert onnx_func.attribute_proto[0].f == 2.0
+        assert sorted(onnx_func.input) == sorted([t.name for t in inputs])
+        assert sorted(onnx_func.output) == sorted([t.name for t in outputs])
+        assert sorted([n.op_type for n in onnx_func.node]) == sorted(
+            [n.op for n in nodes]
+        )
+        assert onnx_func.opset_import[0].version == opset
 
     # See test_importers for import correctness checks
     # This function first imports an ONNX graph, and then re-exports it with no changes.
@@ -180,6 +276,8 @@ def test_export_node(self):
             initializer_is_output_model(),
             nested_dup_names(),
             ext_weights(),
+            sparse_nnz_model(),
+            sparse_nnz_rank_model(),
         ],
         ids=lambda model: str(model),
     )
diff --git a/tools/onnx-graphsurgeon/tests/test_graph_pattern.py b/tools/onnx-graphsurgeon/tests/test_graph_pattern.py
new file mode 100644
index 00000000..2238b21d
--- /dev/null
+++ b/tools/onnx-graphsurgeon/tests/test_graph_pattern.py
@@ -0,0 +1,387 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+
+import onnx
+
+from onnx_graphsurgeon import GraphPattern, PatternMapping
+from onnx_graphsurgeon.importers.onnx_importer import import_onnx
+from onnx_graphsurgeon.logger import G_LOGGER
+
+TEST_ROOT = os.path.realpath(os.path.dirname(__file__))
+G_LOGGER.severity = G_LOGGER.ULTRA_VERBOSE
+
+
+class TestGraphPatternMatching:
+    def get_plugin_io_and_attrs(self, subgraph: PatternMapping):
+        inputs = []
+        inputs.append(subgraph.get("Anode").inputs[0])
+        inputs.append(subgraph.get("Bnode").inputs[0])
+
+        attrs = dict()
+        attrs["x"] = subgraph.get("Cnode").attrs["x"]
+
+        outputs = []
+        outputs.append(subgraph.get("Dnode").outputs[0])
+        outputs.append(subgraph.get("Enode").outputs[0])
+
+        return inputs, outputs, attrs
+
+    def get_plugin_pattern(self):
+        """
+        Toy plugin pattern:
+            A     B
+              \   /
+                C, attrs['x'] < 2.0
+              /   \
+            D     E
+        """
+        pattern = GraphPattern()
+        # in_0, in_1 = pattern.set_input_tensors(2)
+        in_0 = pattern.variable()
+        in_1 = pattern.variable()
+        a_out = pattern.add("Anode", "A", inputs=[in_0])
+        b_out = pattern.add("Bnode", "B", inputs=[in_1])
+        check_function = lambda node: node.attrs["x"] < 2.0
+        c_out = pattern.add(
+            "Cnode", "C", inputs=[a_out, b_out], check_func=check_function
+        )
+        d_out = pattern.add("Dnode", "D", inputs=[c_out])
+        e_out = pattern.add("Enode", "E", inputs=[c_out])
+        pattern.set_output_tensors([d_out, e_out])
+
+        return pattern
+
+    def test_base_match(self):
+        path = os.path.join(TEST_ROOT, "models", "test_toyPlugin_base_match_case.onnx")
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+        assert len(matched_subgraphs) == 1
+        sg = matched_subgraphs[0]
+
+        inputs, outputs, attrs = self.get_plugin_io_and_attrs(sg)
+
+        # node-to-node mapping
+        assert sg.get("Anode").name == "n2"
+        assert sg.get("Bnode").name == "n3"
+        assert sg.get("Cnode").name == "n4"
+        assert sg.get("Dnode").name == "n5"
+        assert sg.get("Enode").name == "n6"
+
+        # I/O mapping
+        assert inputs[0].name == "i1" and inputs[1].name == "i1"
+        assert outputs[0].name == "o1" and outputs[1].name == "o2"
+
+        # attrs mapping
+        assert attrs["x"] == 1.0
+
+    def test_callback_check_unmatch(self):
+        path = os.path.join(
+            TEST_ROOT, "models", "test_toyPlugin_callback_check_unmatch_case.onnx"
+        )
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+
+        # No matched subgraph due to the callback check failure for attrs.
+        assert len(matched_subgraphs) == 0
+
+    def test_intermediate_output_unmatch(self):
+        path = os.path.join(
+            TEST_ROOT, "models", "test_toyPlugin_intermediate_output_unmatch_case.onnx"
+        )
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+
+        # No matched subgraph due to the callback check failure for attrs.
+        assert len(matched_subgraphs) == 0
+
+    def test_intermediate_output_to_other_node_unmatch(self):
+        path = os.path.join(
+            TEST_ROOT,
+            "models",
+            "test_toyPlugin_intermediate_output_to_other_node_unmatch_case.onnx",
+        )
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+
+        # No matched subgraph due to the callback check failure for attrs.
+        assert len(matched_subgraphs) == 0
+
+
+class TestGraphPatternBuilding:
+    def get_plugin_io_and_attrs(self, subgraph: PatternMapping):
+        inputs = []
+        inputs.append(subgraph.get("left").get("Anode").inputs[0])
+        inputs.append(subgraph.get("right").get("Anode").inputs[0])
+
+        attrs = dict()
+
+        outputs = []
+        outputs.append(subgraph.get("Cnode").outputs[0])
+
+        return inputs, outputs, attrs
+
+    def get_plugin_pattern(self):
+        """
+        Graph pattern:
+             A     A
+             |     |
+             B     B
+              \   /
+                C
+        """
+        subpattern = GraphPattern()
+        # i0 = subpattern.set_input_tensors(1)
+        i0 = subpattern.variable()
+        a_node = subpattern.add("Anode", "A", inputs=[i0], num_output_tensors=1)
+        b_out = subpattern.add("Bnode", "B", inputs=[a_node], num_output_tensors=1)
+        subpattern.set_output_tensors([b_out])
+
+        pattern = GraphPattern()
+        # in_0, in_1 = pattern.set_input_tensors(2)
+        in_0 = pattern.variable()
+        in_1 = pattern.variable()
+        left = pattern.add("left", subpattern, inputs=[in_0], num_output_tensors=1)
+        right = pattern.add("right", subpattern, inputs=[in_1], num_output_tensors=1)
+        c_out = pattern.add("Cnode", "C", inputs=[left, right], num_output_tensors=1)
+        pattern.set_output_tensors([c_out])
+
+        return pattern
+
+    def test_recursive_pattern_building(self):
+        path = os.path.join(TEST_ROOT, "models", "test_recursive_pattern_building.onnx")
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+        assert len(matched_subgraphs) == 1
+        sg = matched_subgraphs[0]
+        print(sg)
+
+        inputs, outputs, attrs = self.get_plugin_io_and_attrs(sg)
+
+        # node-to-node mapping
+        assert sg.get("left").get("Anode").name == "n1"
+        assert sg.get("left").get("Bnode").name == "n3"
+        assert sg.get("Cnode").name == "n5"
+        assert sg.get("right").get("Anode").name == "n2"
+        assert sg.get("right").get("Bnode").name == "n4"
+
+        # I/O mapping
+        assert inputs[0].name == "i0" and inputs[1].name == "i1"
+        assert outputs[0].name == "i6"
+
+
+class TestOutputNodes:
+    def get_plugin_io_and_attrs(self, subgraph: PatternMapping):
+        inputs = []
+        inputs.append(subgraph.get("Anode").inputs[0])
+        inputs.append(subgraph.get("Bnode").inputs[0])
+
+        attrs = dict()
+        attrs["x"] = subgraph.get("Cnode").attrs["x"]
+
+        outputs = []
+        outputs.append(subgraph.get("Dnode").outputs[0])
+        outputs.append(subgraph.get("Enode").outputs[0])
+        outputs.append(subgraph.get("Bnode").outputs[0])
+
+        return inputs, outputs, attrs
+
+    def get_plugin_pattern(self):
+        r"""
+        Toy plugin pattern:
+            A     B
+              \   / \
+                C    |
+              /   \  |
+            D     E  |
+        """
+        pattern = GraphPattern()
+        # in_0, in_1 = pattern.set_input_tensors(2)
+        in_0 = pattern.variable()
+        in_1 = pattern.variable()
+        a_out = pattern.add("Anode", "A", inputs=[in_0])
+        b_out = pattern.add("Bnode", "B", inputs=[in_1])
+        c_out = pattern.add("Cnode", "C", inputs=[a_out, b_out])
+        d_out = pattern.add("Dnode", "D", inputs=[c_out])
+        e_out = pattern.add("Enode", "E", inputs=[c_out])
+        pattern.set_output_tensors([d_out, e_out, b_out])
+
+        return pattern
+
+    def get_plugin_pattern_with_multiple_output_node(self):
+        r"""
+        Toy plugin pattern: B has two different outputs.
+            A     B
+              \   / \
+                C    |
+              /   \  |
+            D     E  |
+        """
+
+        pattern = GraphPattern()
+        # in_0, in_1 = pattern.set_input_tensors(2)
+        in_0 = pattern.variable()
+        in_1 = pattern.variable()
+        a_out = pattern.add("Anode", "A", inputs=[in_0])
+        b_out_0, b_out_1 = pattern.add(
+            "Bnode", "B", inputs=[in_1], num_output_tensors=2
+        )
+        c_out = pattern.add("Cnode", "C", inputs=[a_out, b_out_0])
+        d_out = pattern.add("Dnode", "D", inputs=[c_out])
+        e_out = pattern.add("Enode", "E", inputs=[c_out])
+        pattern.set_output_tensors([d_out, e_out, b_out_1])
+
+        return pattern
+
+    def test_outbound_node_with_consumer_match(self):
+        # special case: B has consumers but it is an outbound node.
+        path = os.path.join(
+            TEST_ROOT, "models", "test_toyPlugin_intermediate_output_unmatch_case.onnx"
+        )
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+        assert len(matched_subgraphs) == 1
+        sg = matched_subgraphs[0]
+
+        inputs, outputs, attrs = self.get_plugin_io_and_attrs(sg)
+
+        # node-to-node mapping
+        assert sg.get("Anode").name == "n2"
+        assert sg.get("Bnode").name == "n3"
+        assert sg.get("Cnode").name == "n4"
+        assert sg.get("Dnode").name == "n5"
+        assert sg.get("Enode").name == "n6"
+
+        # I/O mapping
+        assert inputs[0].name == "i1" and inputs[1].name == "i1"
+        assert (
+            outputs[0].name == "o1"
+            and outputs[1].name == "o2"
+            and outputs[2].name == "i3"
+        )
+
+    def test_multiple_output_node_unmatch(self):
+        # special case: B has 2 outputs in pattern, but onnx model only has one output.
+        path = os.path.join(
+            TEST_ROOT, "models", "test_toyPlugin_intermediate_output_unmatch_case.onnx"
+        )
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern_with_multiple_output_node()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+        assert len(matched_subgraphs) == 0
+
+
+class TestConstantCases:
+    def get_plugin_pattern_constant_node(self):
+        r"""
+        Toy plugin pattern:
+            A      Constant
+              \   /
+               B
+        """
+        pattern = GraphPattern()
+        in_0 = pattern.variable()
+        a_out = pattern.add("Anode", "A", inputs=[in_0])
+        c_out = pattern.add("ConstantNode", "Constant")
+        b_out = pattern.add("Bnode", "B", inputs=[a_out, c_out])
+        pattern.set_output_tensors([b_out])
+
+        return pattern
+
+    def get_plugin_pattern_constant_tensor(self):
+        r"""
+        Toy plugin pattern:
+            A      Constant
+              \   /
+               B
+        """
+        pattern = GraphPattern()
+        in_0 = pattern.variable()
+        a_out = pattern.add("Anode", "A", inputs=[in_0])
+        c_out = pattern.constant()
+        b_out = pattern.add("Bnode", "B", inputs=[a_out, c_out])
+        pattern.set_output_tensors([b_out])
+
+        return pattern
+
+    def get_plugin_pattern_no_constant(self):
+        r"""
+        Toy plugin pattern:
+            A
+            |
+            B
+        """
+        pattern = GraphPattern()
+        in_0 = pattern.variable()
+        a_out = pattern.add("Anode", "A", inputs=[in_0])
+        b_out = pattern.add("Bnode", "B", inputs=[a_out])
+        pattern.set_output_tensors([b_out])
+
+        return pattern
+
+    def test_constant_initializer_match(self):
+        path = os.path.join(
+            TEST_ROOT, "models", "test_toyPlugin_constant_initializer_match_case.onnx"
+        )
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern_constant_tensor()
+
+        matched_subgraphs = graph_pattern.match_all(graph)
+        assert len(matched_subgraphs) == 1
+
+    def test_constant_node_match(self):
+        path = os.path.join(
+            TEST_ROOT, "models", "test_toyPlugin_constant_node_match_case.onnx"
+        )
+        graph = import_onnx(onnx.load(path))
+
+        graph_pattern = self.get_plugin_pattern_constant_node()
+        matched_subgraphs = graph_pattern.match_all(graph)
+
+        assert len(matched_subgraphs) == 1
+
+    def test_constant_initializer_unmatch(self):
+        path = os.path.join(
+            TEST_ROOT, "models", "test_toyPlugin_constant_initializer_match_case.onnx"
+        )
+        graph = import_onnx(onnx.load(path))
+        graph_pattern = self.get_plugin_pattern_no_constant()
+        matched_subgraphs = graph_pattern.match_all(graph)
+
+        assert len(matched_subgraphs) == 0
diff --git a/tools/onnx-graphsurgeon/tests/test_importers.py b/tools/onnx-graphsurgeon/tests/test_importers.py
index 30570cb6..7098ed8f 100644
--- a/tools/onnx-graphsurgeon/tests/test_importers.py
+++ b/tools/onnx-graphsurgeon/tests/test_importers.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,8 +23,10 @@
 import onnx.shape_inference
 import pytest
 from onnx_graphsurgeon.importers.onnx_importer import OnnxImporter
-from onnx_graphsurgeon.ir.tensor import Constant, Variable
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.ir.tensor import Constant, Tensor, Variable, SparseValues
+from onnx_graphsurgeon.ir.function import Function
+from onnx_graphsurgeon.ir.node import Node
+from onnx_graphsurgeon.logger import G_LOGGER
 
 from onnx_models import (
     dim_param_model,
@@ -34,20 +36,33 @@
     lstm_model,
     nested_dup_names,
     scan_model,
+    sparse_nnz_model,
+    sparse_nnz_rank_model,
 )
 
 G_LOGGER.severity = G_LOGGER.ULTRA_VERBOSE
 
 
 class TestOnnxImporter(object):
-    def test_import_variable_tensor(self):
+    @pytest.mark.parametrize(
+        "onnx_type, expected_type",
+        [
+            (onnx.TensorProto.FLOAT, np.float32),
+            (onnx.TensorProto.BFLOAT16, onnx.TensorProto.BFLOAT16),
+            (onnx.TensorProto.FLOAT8E4M3FN, onnx.TensorProto.FLOAT8E4M3FN),
+            (onnx.TensorProto.FLOAT8E4M3FNUZ, onnx.TensorProto.FLOAT8E4M3FNUZ),
+            (onnx.TensorProto.FLOAT8E5M2, onnx.TensorProto.FLOAT8E5M2),
+            (onnx.TensorProto.FLOAT8E5M2FNUZ, onnx.TensorProto.FLOAT8E5M2FNUZ),
+        ],
+    )
+    def test_import_variable_tensor(self, onnx_type, expected_type):
         name = "test0"
         shape = (1, 2, 3, 4)
-        onnx_tensor = onnx.helper.make_tensor_value_info(name, onnx.TensorProto.FLOAT, shape)
+        onnx_tensor = onnx.helper.make_tensor_value_info(name, onnx_type, shape)
         tensor = OnnxImporter.import_tensor(onnx_tensor)
         assert type(tensor) == Variable
         assert tensor.name == name
-        assert tensor.dtype == np.float32
+        assert tensor.dtype == expected_type
         assert tuple(tensor.shape) == shape
 
     def test_import_constant_tensor(self):
@@ -69,7 +84,9 @@ def test_import_tensor_unknown_metadata(self):
     # An empty string in `dim_param` should be treated like a dynamic dimension
     def test_import_empty_dim_param_tensor(self):
         shape = (1, 2, "non-empty", "")
-        onnx_tensor = onnx.helper.make_tensor_value_info("test0", onnx.TensorProto.FLOAT, shape)
+        onnx_tensor = onnx.helper.make_tensor_value_info(
+            "test0", onnx.TensorProto.FLOAT, shape
+        )
         tensor = OnnxImporter.import_tensor(onnx_tensor)
         assert type(tensor) == Variable
         assert tuple(tensor.shape) == shape
@@ -77,7 +94,9 @@ def test_import_empty_dim_param_tensor(self):
     # Sometimes, tensor shape is not known, in which case we shouldn't import it
     def test_import_unknown_shape_tensor(self):
         shape = None
-        onnx_tensor = onnx.helper.make_tensor_value_info("test0", onnx.TensorProto.FLOAT, shape)
+        onnx_tensor = onnx.helper.make_tensor_value_info(
+            "test0", onnx.TensorProto.FLOAT, shape
+        )
         tensor = OnnxImporter.import_tensor(onnx_tensor)
         assert type(tensor) == Variable
         assert tensor.shape is None
@@ -85,7 +104,9 @@ def test_import_unknown_shape_tensor(self):
     # Scalars can be represented in ONNX with a dim that includes neither a dim_param nor dim_value
     def test_import_empty_dim_tensor(self):
         shape = (None,)
-        onnx_tensor = onnx.helper.make_tensor_value_info("test0", onnx.TensorProto.FLOAT, shape)
+        onnx_tensor = onnx.helper.make_tensor_value_info(
+            "test0", onnx.TensorProto.FLOAT, shape
+        )
         onnx_tensor.type.tensor_type.shape.dim[0].ClearField("dim_value")
         onnx_tensor.type.tensor_type.shape.dim[0].ClearField("dim_param")
 
@@ -119,7 +140,9 @@ def test_import_node(self):
             ints_attr=ints_attr,
             strings_attr=strings_attr,
         )
-        node = OnnxImporter.import_node(onnx_node, OrderedDict(), OrderedDict(), opset=11, import_domains=None)
+        node = OnnxImporter.import_node(
+            onnx_node, OrderedDict(), OrderedDict(), opset=11, import_domains=None
+        )
         assert node.op == op
         assert node.attrs["float_attr"] == float_attr
         assert node.attrs["int_attr"] == int_attr
@@ -130,6 +153,70 @@ def test_import_node(self):
         assert node.attrs["ints_attr"] == ints_attr
         assert node.attrs["strings_attr"] == strings_attr
 
+    def test_import_node_ref_attrs(self):
+        op = "Test"
+        inputs = ["x"]
+        outputs = ["y"]
+        attrs = {"attr1": 1, "attr2": 2.0}
+        referencing_attr = "attr3"
+        referenced_attr = "attr4"
+
+        onnx_node = onnx.helper.make_node(op, inputs, outputs, **attrs)
+        onnx_attr_ref = onnx.helper.make_attribute_ref(
+            referencing_attr, onnx.AttributeProto.FLOAT
+        )
+        onnx_attr_ref.ref_attr_name = referenced_attr
+        onnx_node.attribute.append(onnx_attr_ref)
+        node = OnnxImporter.import_node(
+            onnx_node, OrderedDict(), OrderedDict(), opset=11, import_domains=None
+        )
+        assert node.op == op
+        assert node.attrs["attr1"] == 1
+        assert node.attrs["attr2"] == 2.0
+        assert node.attrs["attr3"] == Node.AttributeRef(referenced_attr, float)
+
+    def test_import_function(self):
+        name = "Test"
+        domain = "com.test"
+        inputs = ["X", "Y"]
+        outputs = ["Z"]
+        nodes = [
+            onnx.helper.make_node("Add", ["X", "Y"], ["V"], attr1=1),
+            onnx.helper.make_node("Add", ["X", "V"], ["W"], attr2=2),
+            onnx.helper.make_node("Mul", ["V", "W"], ["Z"], attr3=3),
+        ]
+        opset = 18
+        opset_imports = [onnx.helper.make_operatorsetid("ai.onnx", opset)]
+        attributes = ["attr1", "attr2"]
+        attribute_protos = [onnx.helper.make_attribute("attr3", 3)]
+        doc_string = "docstring"
+        onnx_function = onnx.helper.make_function(
+            domain,
+            name,
+            inputs,
+            outputs,
+            nodes,
+            opset_imports,
+            attributes=attributes,
+            attribute_protos=attribute_protos,
+            doc_string=doc_string,
+        )
+        func = OnnxImporter.import_function(onnx_function)
+        assert type(func) == Function
+        assert func.name == name
+        assert func.domain == domain
+        assert func.doc_string == doc_string
+        assert list(func.import_domains) == list(opset_imports)
+        assert set(func.attrs.keys()) == set(attributes) | {
+            a.name for a in attribute_protos
+        }
+        assert func.opset == opset
+        assert all([isinstance(t, Tensor) for t in func.inputs + func.outputs])
+        assert sorted(inputs) == sorted([t.name for t in func.inputs])
+        assert sorted(outputs) == sorted([t.name for t in func.outputs])
+        assert sorted([n.op_type for n in nodes]) == sorted([n.op for n in func.nodes])
+        assert attribute_protos[0].i == func.attrs[attribute_protos[0].name]
+
     @pytest.mark.parametrize(
         "model",
         [
@@ -152,7 +239,10 @@ def test_import_graph_value_info(self):
         graph = OnnxImporter.import_graph(model.graph)
         tensors = graph.tensors()
         assert all(
-            [type(tensor) == Variable and tensor.dtype is not None and tensor.shape for tensor in tensors.values()]
+            [
+                type(tensor) == Variable and tensor.dtype is not None and tensor.shape
+                for tensor in tensors.values()
+            ]
         )
 
     def test_import_graph_tensor_map_preserved(self):
@@ -171,3 +261,93 @@ def test_import_graph_with_dim_param(self):
         model = dim_param_model()
         graph = OnnxImporter.import_graph(model.load().graph)
         model.assert_equal(graph)
+
+    def test_import_graph_with_sparse_nnz_rank(self):
+        model = sparse_nnz_rank_model()
+        graph = OnnxImporter.import_graph(model.load().graph)
+        tensors = graph.tensors()
+
+        assert "w_sparse" in tensors
+        sparse_tensor = tensors["w_sparse"]
+
+        ref_value = np.array(
+            [
+                1.0,
+                2.0,
+                3.0,
+                4.0,
+                5.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+            ]
+        ).reshape(sparse_tensor._values.shape)
+        assert (
+            type(sparse_tensor) == Constant
+            and type(sparse_tensor._values) == SparseValues
+        )
+        assert (tensors["w_sparse"]._values.load() == ref_value).all()
+
+    def test_import_graph_with_sparse_nnz(self):
+        model = sparse_nnz_model()
+        graph = OnnxImporter.import_graph(model.load().graph)
+        tensors = graph.tensors()
+
+        assert "w_sparse" in tensors
+        sparse_tensor = tensors["w_sparse"]
+
+        ref_value = np.array(
+            [
+                0.0,
+                1.0,
+                0.0,
+                0.0,
+                2.0,
+                0.0,
+                0.0,
+                3.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                4.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                5.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+                0.0,
+            ]
+        ).reshape(sparse_tensor._values.shape)
+        assert (
+            type(sparse_tensor) == Constant
+            and type(sparse_tensor._values) == SparseValues
+        )
+        assert (tensors["w_sparse"]._values.load() == ref_value).all()
diff --git a/tools/onnx-graphsurgeon/tests/test_ir.py b/tools/onnx-graphsurgeon/tests/test_ir.py
index 6f7be80e..15f650f0 100644
--- a/tools/onnx-graphsurgeon/tests/test_ir.py
+++ b/tools/onnx-graphsurgeon/tests/test_ir.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,12 +16,17 @@
 #
 
 
+import copy
+
 import numpy as np
 import onnx
 import pytest
+
 from onnx_graphsurgeon.ir.node import Node
+from onnx_graphsurgeon.ir.graph import Graph
 from onnx_graphsurgeon.ir.tensor import Constant, LazyValues, Variable
-from onnx_graphsurgeon.logger.logger import G_LOGGER
+from onnx_graphsurgeon.logger import G_LOGGER
+from onnx_graphsurgeon.util.misc import SynchronizedList
 
 G_LOGGER.severity = G_LOGGER.ULTRA_VERBOSE
 
@@ -76,6 +81,15 @@ def test_can_copy_outputs_from_other_node(self):
         assert tensor.outputs == self.tensor.outputs
         assert tensor.outputs is not self.tensor.outputs
 
+    # copy.copy/deepcopy should yield a regular list instead of a synchronized list
+    @pytest.mark.parametrize("copy_func", [copy.copy, copy.deepcopy])
+    def test_copy_makes_normal_list(self, copy_func):
+        assert isinstance(self.tensor.inputs, SynchronizedList)
+
+        inputs = copy_func(self.tensor.inputs)
+        assert not isinstance(inputs, SynchronizedList)
+        assert isinstance(inputs, list)
+
     def test_i(self):
         x = Variable(name="x")
         y = Variable(name="y")
@@ -108,7 +122,9 @@ def test_o_multiple_outputs(self):
 
 class TestVariable(TensorBaseTests):
     def setup_method(self):
-        self.tensor = Variable(name="test_tensor", dtype=np.float32, shape=(1, 3, 224, 224))
+        self.tensor = Variable(
+            name="test_tensor", dtype=np.float32, shape=(1, 3, 224, 224)
+        )
         self.input_node = Node(op="Add", outputs=[self.tensor])
         self.output_node = Node(op="Add", inputs=[self.tensor])
 
@@ -119,7 +135,9 @@ def test_equals_name_mismatch(self):
 
 class TestConstant(TensorBaseTests):
     def setup_method(self):
-        self.tensor = Constant(name="test_tensor", values=np.ones((1, 3, 5, 5), dtype=np.float64))
+        self.tensor = Constant(
+            name="test_tensor", values=np.ones((1, 3, 5, 5), dtype=np.float64)
+        )
         self.input_node = Node(
             op="Add", outputs=[self.tensor]
         )  # Doesn't make sense for Constants, but needed to make base tests happy.
@@ -132,11 +150,27 @@ def test_can_get_dtype(self):
         assert self.tensor.dtype == np.float64
 
 
+@pytest.fixture
+def node_with_nested_subgraphs():
+    inner_subgraph = Graph(name="inner_subgraph")
+    outer_subgraph_1 = Graph(name="subgraph1")
+    outer_subgraph_2 = Graph(
+        name="subgraph2", nodes=[Node("Add", attrs={"x": inner_subgraph})]
+    )
+    node = Node(op="Add", attrs={"x": outer_subgraph_1, "y": outer_subgraph_2, "z": 5})
+    return node
+
+
 class TestNode(object):
     def setup_method(self):
         self.input_tensor = Variable(name="x")
         self.output_tensor = Variable(name="y")
-        self.node = Node(op="Add", name="Test", inputs=[self.input_tensor], outputs=[self.output_tensor])
+        self.node = Node(
+            op="Add",
+            name="Test",
+            inputs=[self.input_tensor],
+            outputs=[self.output_tensor],
+        )
 
     def test_equals(self):
         assert self.node == self.node
@@ -190,33 +224,81 @@ def test_can_copy_outputs_from_other_node(self):
 
     def test_i(self):
         intermediate_tensor = Variable(name="intermediate")
-        input_node = Node(op="Add", name="Input", inputs=[self.input_tensor], outputs=[intermediate_tensor])
-        output_node = Node(op="Add", name="Out", inputs=[intermediate_tensor], outputs=[self.output_tensor])
+        input_node = Node(
+            op="Add",
+            name="Input",
+            inputs=[self.input_tensor],
+            outputs=[intermediate_tensor],
+        )
+        output_node = Node(
+            op="Add",
+            name="Out",
+            inputs=[intermediate_tensor],
+            outputs=[self.output_tensor],
+        )
         assert output_node.i() == input_node
 
     def test_i_multiple_inputs(self):
         intermediate_tensor = Variable(name="intermediate")
         intermediate_tensor2 = Variable(name="intermediate2")
-        input_node = Node(op="Add", name="Input", inputs=[self.input_tensor], outputs=[intermediate_tensor])
-        input_node2 = Node(op="Add", name="Input2", inputs=[self.input_tensor], outputs=[intermediate_tensor2])
+        input_node = Node(
+            op="Add",
+            name="Input",
+            inputs=[self.input_tensor],
+            outputs=[intermediate_tensor],
+        )
+        input_node2 = Node(
+            op="Add",
+            name="Input2",
+            inputs=[self.input_tensor],
+            outputs=[intermediate_tensor2],
+        )
         output_node = Node(
-            op="Add", name="Out", inputs=[intermediate_tensor, intermediate_tensor2], outputs=[self.output_tensor]
+            op="Add",
+            name="Out",
+            inputs=[intermediate_tensor, intermediate_tensor2],
+            outputs=[self.output_tensor],
         )
         assert output_node.i() == input_node
         assert output_node.i(1) == input_node2
 
     def test_o(self):
         intermediate_tensor = Variable(name="intermediate")
-        input_node = Node(op="Add", name="Input", inputs=[self.input_tensor], outputs=[intermediate_tensor])
-        output_node = Node(op="Add", name="Out", inputs=[intermediate_tensor], outputs=[self.output_tensor])
+        input_node = Node(
+            op="Add",
+            name="Input",
+            inputs=[self.input_tensor],
+            outputs=[intermediate_tensor],
+        )
+        output_node = Node(
+            op="Add",
+            name="Out",
+            inputs=[intermediate_tensor],
+            outputs=[self.output_tensor],
+        )
         assert input_node.o() == output_node
 
     def test_o_multiple_outputs(self):
         intermediate_tensor = Variable(name="intermediate")
         intermediate_tensor2 = Variable(name="intermediate2")
-        input_node = Node(op="Add", name="Input", inputs=[self.input_tensor], outputs=[intermediate_tensor])
-        output_node = Node(op="Add", name="Out", inputs=[intermediate_tensor], outputs=[self.output_tensor])
-        output_node2 = Node(op="Add", name="Input2", inputs=[intermediate_tensor], outputs=[intermediate_tensor2])
+        input_node = Node(
+            op="Add",
+            name="Input",
+            inputs=[self.input_tensor],
+            outputs=[intermediate_tensor],
+        )
+        output_node = Node(
+            op="Add",
+            name="Out",
+            inputs=[intermediate_tensor],
+            outputs=[self.output_tensor],
+        )
+        output_node2 = Node(
+            op="Add",
+            name="Input2",
+            inputs=[intermediate_tensor],
+            outputs=[intermediate_tensor2],
+        )
         assert input_node.o() == output_node
         assert input_node.o(1) == output_node2
 
@@ -224,25 +306,49 @@ def test_domain(self):
         node = Node(op="Add", domain="test")
         assert node.domain == "test"
 
+    def test_subgraphs_not_recursive(self, node_with_nested_subgraphs):
+        unrelated_graph = Graph(name="unrelated")
+        subgraph_names = {
+            subgraph.name for subgraph in node_with_nested_subgraphs.subgraphs()
+        }
+        assert subgraph_names == {"subgraph1", "subgraph2"}
+
+    def test_subgraphs_recursive(self, node_with_nested_subgraphs):
+        unrelated_graph = Graph(name="unrelated")
+        subgraph_names = {
+            subgraph.name
+            for subgraph in node_with_nested_subgraphs.subgraphs(recursive=True)
+        }
+        assert subgraph_names == {"subgraph1", "subgraph2", "inner_subgraph"}
+
 
 class TestNodeIO(object):
     def setup_method(self, field_names):
         self.tensors = [
-            Variable(name="test_tensor_{:}".format(i), dtype=np.float32, shape=(1, 3, 224, 224)) for i in range(10)
+            Variable(
+                name="test_tensor_{:}".format(i),
+                dtype=np.float32,
+                shape=(1, 3, 224, 224),
+            )
+            for i in range(10)
         ]
         self.node = Node(op="Dummy")
 
     def get_lists(self, field_names):
         return getattr(self.node, field_names[0]), field_names[1]
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_append(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.append(self.tensors[0])
         assert nlist[0] == self.tensors[0]
         assert getattr(self.tensors[0], tensor_field)[0] == self.node
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_extend(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.extend(self.tensors)
@@ -250,7 +356,9 @@ def test_extend(self, field_names):
             assert tensor in nlist
             assert getattr(tensor, tensor_field)[0] == self.node
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_insert(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.append(self.tensors[1])
@@ -258,7 +366,9 @@ def test_insert(self, field_names):
         assert nlist[0] == self.tensors[0]
         assert getattr(self.tensors[0], tensor_field)[0] == self.node
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_remove(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.append(self.tensors[0])
@@ -266,7 +376,9 @@ def test_remove(self, field_names):
         assert len(nlist) == 0
         assert len(getattr(self.tensors[0], tensor_field)) == 0
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_pop(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.append(self.tensors[0])
@@ -274,7 +386,9 @@ def test_pop(self, field_names):
         assert len(nlist) == 0
         assert len(getattr(tensor, tensor_field)) == 0
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_pop_index(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.extend(self.tensors)
@@ -282,7 +396,9 @@ def test_pop_index(self, field_names):
         assert self.tensors[1] not in nlist
         assert len(getattr(tensor, tensor_field)) == 0
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_del_index(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.extend(self.tensors)
@@ -291,7 +407,9 @@ def test_del_index(self, field_names):
         assert self.tensors[1] not in nlist
         assert len(getattr(tensor, tensor_field)) == 0
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_clear(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.extend(self.tensors)
@@ -299,14 +417,18 @@ def test_clear(self, field_names):
         assert len(nlist) == 0
         assert all([len(getattr(tensor, tensor_field)) == 0 for tensor in self.tensors])
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_add(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist = nlist + self.tensors
         for tensor in self.tensors:
             assert tensor in nlist
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_iadd(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist += self.tensors
@@ -314,7 +436,9 @@ def test_iadd(self, field_names):
             assert tensor in nlist
             assert getattr(tensor, tensor_field)[0] == self.node
 
-    @pytest.mark.parametrize("field_names", [("inputs", "outputs"), ("outputs", "inputs")])
+    @pytest.mark.parametrize(
+        "field_names", [("inputs", "outputs"), ("outputs", "inputs")]
+    )
     def test_setitem(self, field_names):
         nlist, tensor_field = self.get_lists(field_names)
         nlist.append(self.tensors[0])
@@ -346,7 +470,9 @@ def test_iadd_on_tensor_directly(self):
 class TestLazyValues(object):
     def test_basic(self):
         shape = (1, 5, 5)
-        onnx_tensor = onnx.helper.make_tensor_value_info("test", onnx.TensorProto.FLOAT, shape)
+        onnx_tensor = onnx.helper.make_tensor_value_info(
+            "test", onnx.TensorProto.FLOAT, shape
+        )
         values = LazyValues(onnx_tensor)
 
         assert values.dtype == np.float32
diff --git a/tools/onnx-graphsurgeon/tests/test_util.py b/tools/onnx-graphsurgeon/tests/test_util.py
index 72023153..f251bffd 100644
--- a/tools/onnx-graphsurgeon/tests/test_util.py
+++ b/tools/onnx-graphsurgeon/tests/test_util.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/polygraphy-extension-trtexec/README.md b/tools/polygraphy-extension-trtexec/README.md
index 7dd60832..8968c2ed 100644
--- a/tools/polygraphy-extension-trtexec/README.md
+++ b/tools/polygraphy-extension-trtexec/README.md
@@ -28,4 +28,3 @@ output of `polygraphy run`:
     ```bash
     polygraphy run sample.onnx --trtexec
     ```
-
diff --git a/tools/pytorch-quantization/README.md b/tools/pytorch-quantization/README.md
index 175baa42..6ba198fc 100644
--- a/tools/pytorch-quantization/README.md
+++ b/tools/pytorch-quantization/README.md
@@ -34,7 +34,7 @@ python setup.py install
 
 #### NGC Container
 
-`pytorch-quantization` is preinstalled in NVIDIA NGC PyTorch container, e.g. `nvcr.io/nvidia/pytorch:21.09-py3`
+`pytorch-quantization` is preinstalled in NVIDIA NGC PyTorch container, e.g. `nvcr.io/nvidia/pytorch:22.12-py3`
 
 ## Resources
 
diff --git a/tools/pytorch-quantization/VERSION b/tools/pytorch-quantization/VERSION
index eca07e4c..ccbccc3d 100644
--- a/tools/pytorch-quantization/VERSION
+++ b/tools/pytorch-quantization/VERSION
@@ -1 +1 @@
-2.1.2
+2.2.0
diff --git a/tools/pytorch-quantization/docs/Makefile b/tools/pytorch-quantization/docs/Makefile
index c0d298f8..5c2b8460 100644
--- a/tools/pytorch-quantization/docs/Makefile
+++ b/tools/pytorch-quantization/docs/Makefile
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/docs/source/conf.py b/tools/pytorch-quantization/docs/source/conf.py
index 24a47cd9..3a88540c 100644
--- a/tools/pytorch-quantization/docs/source/conf.py
+++ b/tools/pytorch-quantization/docs/source/conf.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/docs/source/tutorials/quant_resnet50.rst b/tools/pytorch-quantization/docs/source/tutorials/quant_resnet50.rst
index b1a3f009..170ee03f 100644
--- a/tools/pytorch-quantization/docs/source/tutorials/quant_resnet50.rst
+++ b/tools/pytorch-quantization/docs/source/tutorials/quant_resnet50.rst
@@ -155,7 +155,7 @@ Next we will evaluate the classification accuracy of our post training quantized
     # Save the model
     torch.save(model.state_dict(), "/tmp/quant_resnet50-calibrated.pth")
 
-This should yield 76.1% top-1 accuracy, which is close the the pre-trained model accuracy of 76.2%.
+This should yield 76.1% top-1 accuracy, which is close to the pre-trained model accuracy of 76.2%.
 
 Use different calibration
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/tools/pytorch-quantization/docs/source/userguide.rst b/tools/pytorch-quantization/docs/source/userguide.rst
index 1b316659..055076f3 100644
--- a/tools/pytorch-quantization/docs/source/userguide.rst
+++ b/tools/pytorch-quantization/docs/source/userguide.rst
@@ -110,8 +110,8 @@ Calibration
 ~~~~~~~~~~~
 
 Calibration is the TensorRT terminology of passing data samples to the
-quantizer and deciding the best amax for activations. We support 3
-calibration method:
+quantizer and deciding the best amax for activations. 
+We support 3 calibration methods:
 
 -  ``max``: Simply use global maximum absolute value
 -  ``entropy``: TensorRT's entropy calibration
@@ -125,7 +125,7 @@ be used as the following example:
 
     # Find the TensorQuantizer and enable calibration
     for name, module in model.named_modules():
-        if name.endswith('_input_quantizer'):
+        if name.endswith('_quantizer'):
             module.enable_calib()
             module.disable_quant()  # Use full precision data to calibrate
             
@@ -135,8 +135,9 @@ be used as the following example:
 
     # Finalize calibration
     for name, module in model.named_modules():
-        if name.endswith('_input_quantizer'):
+        if name.endswith('_quantizer'):
             module.load_calib_amax()
+            module.disable_calib()
             module.enable_quant()
             
     # If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
@@ -145,6 +146,10 @@ be used as the following example:
     # Keep running the quantized model
     # ...
 
+.. note::
+
+    Calibration needs to be performed before exporting the model to ONNX.
+
 Quantization Aware Training
 ---------------------
 
@@ -182,34 +187,32 @@ experience, here are some recommendations:
 Export to ONNX
 --------------
 
-The goal of exporting to ONNX is to deploy inference by TensorRT, not
+The goal of exporting to ONNX is to deploy to TensorRT, not to
 ONNX runtime. So we only export fake quantized model into a form TensorRT will take. Fake
 quantization will be broken into a pair of
-QuantizeLinear/DequantizeLinear ONNX ops. In future, TensorRT will take
-the graph, and execute it in int8 in the most optimized way to its
+QuantizeLinear/DequantizeLinear ONNX ops. TensorRT will take
+the generated ONNX graph, and execute it in int8 in the most optimized way to its
 capability.
 
-First set static member of TensorQuantizer to use Pytorch’s own fake
-quantization functions
+.. note::
 
-.. code:: python
-
-   from pytorch_quantization import nn as quant_nn
-   quant_nn.TensorQuantizer.use_fb_fake_quant = True
+    Currently, we only support exporting int8 and fp8 fake quantized modules. 
+    Additionally, quantized modules need to be calibrated before exporting to ONNX. 
 
-Fake quantized model can now be exported to ONNX as other models, follow
-the instructions in
+Fake quantized model can be exported to ONNX as any other Pytorch model. 
+Please learn more about exporting a Pytorch model to ONNX at
 `torch.onnx <https://pytorch.org/docs/stable/onnx.html?highlight=onnx#module-torch.onnx>`__.
 For example:
 
 .. code:: python
 
+   import pytorch_quantization
    from pytorch_quantization import nn as quant_nn
    from pytorch_quantization import quant_modules
-   quant_nn.TensorQuantizer.use_fb_fake_quant = True
 
    quant_modules.initialize()
    model = torchvision.models.resnet50()
+   
    # load the calibrated model
    state_dict = torch.load("quant_resnet50-entropy-1024.pth", map_location="cpu")
    model.load_state_dict(state_dict)
@@ -220,9 +223,11 @@ For example:
    input_names = [ "actual_input_1" ]
    output_names = [ "output1" ]
 
-   # enable_onnx_checker needs to be disabled. See notes below.
-   torch.onnx.export(
-       model, dummy_input, "quant_resnet50.onnx", verbose=True, opset_version=10, enable_onnx_checker=False)
+   with pytorch_quantization.enable_onnx_export():
+        # enable_onnx_checker needs to be disabled. See notes below.
+        torch.onnx.export(
+            model, dummy_input, "quant_resnet50.onnx", verbose=True, opset_version=10, enable_onnx_checker=False
+            )
 
 .. Note::
 
diff --git a/tools/pytorch-quantization/examples/finetune_quant_resnet50.ipynb b/tools/pytorch-quantization/examples/finetune_quant_resnet50.ipynb
index 400270b1..972f5834 100644
--- a/tools/pytorch-quantization/examples/finetune_quant_resnet50.ipynb
+++ b/tools/pytorch-quantization/examples/finetune_quant_resnet50.ipynb
@@ -577,7 +577,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "After only 1 epoch of quantized fine tuning, top-1 improved from ~76.1 to 76.426. Train longer with lr anealing can improve accuracy futher"
+    "After only 1 epoch of quantized fine tuning, top-1 improved from ~76.1 to 76.426. Train longer with lr anealing can improve accuracy further"
    ]
   }
  ],
diff --git a/tools/pytorch-quantization/examples/torchvision/classification_flow.py b/tools/pytorch-quantization/examples/torchvision/classification_flow.py
index 67f3dafd..18c10c66 100644
--- a/tools/pytorch-quantization/examples/torchvision/classification_flow.py
+++ b/tools/pytorch-quantization/examples/torchvision/classification_flow.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -16,6 +16,7 @@
 #
 
 import datetime
+import inspect
 import os
 import sys
 import time
@@ -23,8 +24,12 @@
 import warnings
 import collections
 
+import subprocess
+
 import torch
 import torch.utils.data
+
+from collections import namedtuple
 from torch import nn
 
 from tqdm import tqdm
@@ -45,7 +50,7 @@
 from prettytable import PrettyTable
 
 # The following path assumes running in nvcr.io/nvidia/pytorch:20.08-py3
-sys.path.insert(0,"/opt/pytorch/vision/references/classification/")
+sys.path.insert(0, "/opt/pytorch/vision/references/classification/")
 
 # Import functions from torchvision reference
 try:
@@ -54,6 +59,7 @@
     raise ModuleNotFoundError(
         "Add https://github.com/pytorch/vision/blob/master/references/classification/ to PYTHONPATH")
 
+
 def get_parser():
     """
     Creates an argument parser.
@@ -62,10 +68,19 @@ def get_parser():
 
     parser.add_argument('--data-dir', '-d', type=str, help='input data folder', required=True)
     parser.add_argument('--model-name', '-m', default='resnet50', help='model name: default resnet50')
-    parser.add_argument('--disable-pcq', '-dpcq', action="store_true", help='disable per-channel quantization for weights')
+    parser.add_argument('--disable-pcq',
+                        '-dpcq',
+                        action="store_true",
+                        help='disable per-channel quantization for weights')
     parser.add_argument('--out-dir', '-o', default='/tmp', help='output folder: default /tmp')
     parser.add_argument('--print-freq', '-pf', type=int, default=20, help='evaluation print frequency: default 20')
-    parser.add_argument('--threshold', '-t', type=float, default=-1.0, help='top1 accuracy threshold (less than 0.0 means no comparison): default -1.0')
+    parser.add_argument('--threshold',
+                        '-t',
+                        type=float,
+                        default=-1.0,
+                        help='top1 accuracy threshold (less than 0.0 means no comparison): default -1.0')
+
+    parser.add_argument('--fp16', action="store_true", help="Enable FP16 model training, evaluation and export")
 
     parser.add_argument('--batch-size-train', type=int, default=128, help='batch size for training: default 128')
     parser.add_argument('--batch-size-test', type=int, default=128, help='batch size for testing: default 128')
@@ -74,34 +89,38 @@ def get_parser():
     parser.add_argument('--seed', type=int, default=12345, help='random seed: default 12345')
 
     checkpoint = parser.add_mutually_exclusive_group(required=True)
-    checkpoint.add_argument('--ckpt-path', default='', type=str,
-                            help='path to latest checkpoint (default: none)')
-    checkpoint.add_argument('--ckpt-url', default='', type=str,
-                            help='url to latest checkpoint (default: none)')
+    checkpoint.add_argument('--ckpt-path', default='', type=str, help='path to latest checkpoint (default: none)')
+    checkpoint.add_argument('--ckpt-url', default='', type=str, help='url to latest checkpoint (default: none)')
     checkpoint.add_argument('--pretrained', action="store_true")
 
-    parser.add_argument('--num-calib-batch', default=4, type=int,
+    parser.add_argument('--num-calib-batch',
+                        default=4,
+                        type=int,
                         help='Number of batches for calibration. 0 will disable calibration. (default: 4)')
-    parser.add_argument('--num-finetune-epochs', default=0, type=int,
+    parser.add_argument('--num-finetune-epochs',
+                        default=0,
+                        type=int,
                         help='Number of epochs to fine tune. 0 will disable fine tune. (default: 0)')
     parser.add_argument('--calibrator', type=str, choices=["max", "histogram"], default="max")
     parser.add_argument('--percentile', nargs='+', type=float, default=[99.9, 99.99, 99.999, 99.9999])
     parser.add_argument('--sensitivity', action="store_true", help="Build sensitivity profile")
     parser.add_argument('--evaluate-onnx', action="store_true", help="Evaluate exported ONNX")
+    parser.add_argument('--evaluate-trt', action="store_true", help="Export and evaluate TRT")
 
     return parser
 
-def prepare_model(
-        model_name,
-        data_dir,
-        per_channel_quantization,
-        batch_size_train,
-        batch_size_test,
-        batch_size_onnx,
-        calibrator,
-        pretrained=True,
-        ckpt_path=None,
-        ckpt_url=None):
+
+def prepare_model(model_name,
+                  data_dir,
+                  per_channel_quantization,
+                  batch_size_train,
+                  batch_size_test,
+                  batch_size_onnx,
+                  calibrator,
+                  pretrained=True,
+                  ckpt_path=None,
+                  ckpt_url=None,
+                  fp16=False):
     """
     Prepare the model for the classification flow.
     Arguments:
@@ -167,27 +186,52 @@ def prepare_model(
     model.eval()
     model.cuda()
 
+    if fp16:
+        model = model.half()
+
     ## Prepare the data loaders
     traindir = os.path.join(data_dir, 'train')
     valdir = os.path.join(data_dir, 'val')
-    _args = collections.namedtuple("mock_args", ["model", "distributed", "cache_dataset"])
+    _args = collections.namedtuple("mock_args", [
+        "model", "distributed", "cache_dataset", "val_resize_size", "val_crop_size", "train_crop_size", "interpolation",
+        "ra_magnitude", "augmix_severity", "weights", "backend", "use_v2"
+    ])
     dataset, dataset_test, train_sampler, test_sampler = load_data(
-        traindir, valdir, _args(model=model_name, distributed=False, cache_dataset=False))
-
-    data_loader_train = torch.utils.data.DataLoader(
-        dataset, batch_size=batch_size_train,
-        sampler=train_sampler, num_workers=4, pin_memory=True)
-
-    data_loader_test = torch.utils.data.DataLoader(
-        dataset_test, batch_size=batch_size_test,
-        sampler=test_sampler, num_workers=4, pin_memory=True)
-
-    data_loader_onnx = torch.utils.data.DataLoader(
-        dataset_test, batch_size=batch_size_onnx,
-        sampler=test_sampler, num_workers=4, pin_memory=True)
+        traindir, valdir,
+        _args(model=model_name,
+              distributed=False,
+              cache_dataset=False,
+              val_resize_size=256,
+              val_crop_size=224,
+              train_crop_size=224,
+              interpolation="bilinear",
+              ra_magnitude=9,
+              augmix_severity=3,
+              weights=None,
+              backend="pil",
+              use_v2=False))
+
+    data_loader_train = torch.utils.data.DataLoader(dataset,
+                                                    batch_size=batch_size_train,
+                                                    sampler=train_sampler,
+                                                    num_workers=4,
+                                                    pin_memory=True)
+
+    data_loader_test = torch.utils.data.DataLoader(dataset_test,
+                                                   batch_size=batch_size_test,
+                                                   sampler=test_sampler,
+                                                   num_workers=4,
+                                                   pin_memory=True)
+
+    data_loader_onnx = torch.utils.data.DataLoader(dataset_test,
+                                                   batch_size=batch_size_onnx,
+                                                   sampler=test_sampler,
+                                                   num_workers=4,
+                                                   pin_memory=True)
 
     return model, data_loader_train, data_loader_test, data_loader_onnx
 
+
 def main(cmdline_args):
     parser = get_parser()
     args = parser.parse_args(cmdline_args)
@@ -199,33 +243,29 @@ def main(cmdline_args):
 
     ## Prepare the pretrained model and data loaders
     model, data_loader_train, data_loader_test, data_loader_onnx = prepare_model(
-        args.model_name,
-        args.data_dir,
-        not args.disable_pcq,
-        args.batch_size_train,
-        args.batch_size_test,
-        args.batch_size_onnx,
-        args.calibrator,
-        args.pretrained,
-        args.ckpt_path,
-        args.ckpt_url)
+        args.model_name, args.data_dir, not args.disable_pcq, args.batch_size_train, args.batch_size_test,
+        args.batch_size_onnx, args.calibrator, args.pretrained, args.ckpt_path, args.ckpt_url, args.fp16)
 
     ## Initial accuracy evaluation
-    criterion = nn.CrossEntropyLoss()
+    CrossEntropy = nn.CrossEntropyLoss()
+
+    # nn.CrossEntropyLoss expects float inputs
+    def criterion(output, target):
+        return CrossEntropy(output.float(), target)
+
     with torch.no_grad():
         print('Initial evaluation:')
         top1_initial = evaluate(model, criterion, data_loader_test, device="cuda", print_freq=args.print_freq)
 
     ## Calibrate the model
     with torch.no_grad():
-        calibrate_model(
-            model=model,
-            model_name=args.model_name,
-            data_loader=data_loader_train,
-            num_calib_batch=args.num_calib_batch,
-            calibrator=args.calibrator,
-            hist_percentile=args.percentile,
-            out_dir=args.out_dir)
+        calibrate_model(model=model,
+                        model_name=args.model_name,
+                        data_loader=data_loader_train,
+                        num_calib_batch=args.num_calib_batch,
+                        calibrator=args.calibrator,
+                        hist_percentile=args.percentile,
+                        out_dir=args.out_dir)
 
     ## Evaluate after calibration
     if args.num_calib_batch > 0:
@@ -245,7 +285,13 @@ def main(cmdline_args):
     lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, args.num_finetune_epochs)
     for epoch in range(args.num_finetune_epochs):
         # Training a single epch
-        train_one_epoch(model, criterion, optimizer, data_loader_train, "cuda", 0, 100)
+        if "print_freq" in inspect.signature(train_one_epoch).parameters:
+            train_one_epoch(model, criterion, optimizer, data_loader_train, "cuda", 0, 100)
+        else:
+            _args = collections.namedtuple("mock_args",
+                                           ["print_freq", "clip_grad_norm", "model_ema_steps", "lr_warmup_epochs"])
+            train_one_epoch(model, criterion, optimizer, data_loader_train, "cuda", 0,
+                            _args(print_freq=100, clip_grad_norm=None, model_ema_steps=32, lr_warmup_epochs=0))
         lr_scheduler.step()
 
     if args.num_finetune_epochs > 0:
@@ -259,18 +305,26 @@ def main(cmdline_args):
     ## Export to ONNX
     onnx_filename = args.out_dir + '/' + args.model_name + ".onnx"
     top1_onnx = -1.0
-    if export_onnx(model, onnx_filename, args.batch_size_onnx, not args.disable_pcq) and args.evaluate_onnx:
+    if args.evaluate_onnx and export_onnx(model, onnx_filename, args.batch_size_onnx, not args.disable_pcq):
         ## Validate ONNX and evaluate
         top1_onnx = evaluate_onnx(onnx_filename, data_loader_onnx, criterion, args.print_freq)
 
+    trt_filename = args.out_dir + '/' + args.model_name + ".trt"
+    top1_trt = -1.0
+    if args.evaluate_trt and export_trt(model, trt_filename, args.batch_size_onnx, args.fp16):
+        ## Validate TRT and evaluate
+        top1_trt = evaluate_trt(trt_filename, data_loader_onnx, criterion, args.print_freq)
+
     ## Print summary
     print("Accuracy summary:")
-    table = PrettyTable(['Stage','Top1'])
+    table = PrettyTable(['Stage', 'Top1'])
     table.align['Stage'] = "l"
-    table.add_row( [ 'Initial',     "{:.2f}".format(top1_initial) ] )
-    table.add_row( [ 'Calibrated',  "{:.2f}".format(top1_calibrated) ] )
-    table.add_row( [ 'Finetuned',   "{:.2f}".format(top1_finetuned) ] )
-    table.add_row( [ 'ONNX',        "{:.2f}".format(top1_onnx) ] )
+    table.add_row(['Initial', "{:.2f}".format(top1_initial)])
+    table.add_row(['Calibrated', "{:.2f}".format(top1_calibrated)])
+    table.add_row(['Finetuned', "{:.2f}".format(top1_finetuned)])
+    table.add_row(['ONNX', "{:.2f}".format(top1_onnx)])
+    if args.evaluate_trt:
+        table.add_row(['TRT', "{:.2f}".format(top1_trt)])
     print(table)
 
     ## Compare results
@@ -278,6 +332,9 @@ def main(cmdline_args):
         if args.evaluate_onnx and top1_onnx < 0.0:
             print("Failed to export/evaluate ONNX!")
             return 1
+        if args.evaluate_trt and top1_trt < 0.0:
+            print("Failed to export/evaluate TRT!")
+            return 1
         if args.num_finetune_epochs > 0:
             if top1_finetuned >= (top1_onnx - args.threshold):
                 print("Accuracy threshold was met!")
@@ -285,8 +342,15 @@ def main(cmdline_args):
                 print("Accuracy threshold was missed!")
                 return 1
 
+            if args.evaluate_trt and top1_finetuned >= (top1_trt - args.threshold):
+                print("TRT Accuracy threshold was met!")
+            elif args.evaluate_trt:
+                print("TRT Accuracy threshold was missed!")
+                return 1
+
     return 0
 
+
 def evaluate_onnx(onnx_filename, data_loader, criterion, print_freq):
     """Evaluate accuracy on the given ONNX file using the provided data loader and criterion.
        The method returns the average top-1 accuracy on the given dataset.
@@ -305,7 +369,7 @@ def evaluate_onnx(onnx_filename, data_loader, criterion, print_freq):
                 # run the data through onnx runtime instead of torch model
                 input_name = ort_session.get_inputs()[0].name
                 raw_result = ort_session.run([], {input_name: input_data})
-                output = torch.tensor((raw_result[0]))
+                output = torch.tensor((raw_result[0])).float()
 
                 loss = criterion(output, target)
                 acc1, acc5 = utils.accuracy(output, target, topk=(1, 5))
@@ -316,13 +380,137 @@ def evaluate_onnx(onnx_filename, data_loader, criterion, print_freq):
         # gather the stats from all processes
         metric_logger.synchronize_between_processes()
 
-        print('  ONNXRuntime: Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}'
-            .format(top1=metric_logger.acc1, top5=metric_logger.acc5))
+        print('  ONNXRuntime: Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}'.format(top1=metric_logger.acc1,
+                                                                                              top5=metric_logger.acc5))
         return metric_logger.acc1.global_avg
 
+
+def evaluate_trt(trt_filename, data_loader, criterion, print_freq):
+    print("Loading TRT file: ", trt_filename)
+
+    import pycuda.driver as cuda
+    try:
+        import pycuda.autoprimaryctx
+    except ModuleNotFoundError:
+        import pycuda.autoinit
+
+    import tensorrt as trt
+
+    TRT_LOGGER = trt.Logger()
+
+    TRT_tensor = namedtuple('TRT_tensor', ['binding_idx', 'shape', 'dtype', 'device_memory', 'host_memory'])
+
+    def load_engine(engine_file_path):
+        assert os.path.exists(engine_file_path)
+        print("Reading engine from file {}".format(engine_file_path))
+        with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
+            return runtime.deserialize_cuda_engine(f.read())
+
+    def setup_context(engine):
+        return engine.create_execution_context()
+
+    def allocate_buffers(engine, context):
+
+        # Allocate host and device buffers
+        bindings = []
+        inputs = {}
+        outputs = {}
+        for binding_idx in range(engine.num_bindings):
+            binding = engine.get_tensor_name(binding_idx)
+            shape = tuple(context.get_tensor_shape(binding))
+
+            size = trt.volume(context.get_tensor_shape(binding))
+            dtype = np.dtype(trt.nptype(engine.get_tensor_dtype(binding)))
+
+            device_memory = cuda.mem_alloc(size * dtype.itemsize)
+            bindings.append(int(device_memory))
+
+            if engine.get_tensor_mode(binding) == trt.TensorIOMode.INPUT:
+                inputs[binding] = TRT_tensor(binding_idx, shape, dtype, device_memory, None)
+            else:
+                host_memory = cuda.pagelocked_empty(size, dtype)
+                outputs[binding] = TRT_tensor(binding_idx, shape, dtype, device_memory, host_memory)
+
+        stream = cuda.Stream()
+        return bindings, inputs, outputs, stream
+
+    def infer(batch, context, bindings, inputs, outputs, stream):
+
+        # Transfer input data to the GPU.
+        for name, trt_in_t in inputs.items():
+            buffer = np.ascontiguousarray(batch[name])
+            cuda.memcpy_htod_async(trt_in_t.device_memory, buffer, stream)
+
+        # Run inference
+        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
+
+        # Transfer predictions back from the GPU.
+        for _, trt_out_t in outputs.items():
+            cuda.memcpy_dtoh_async(trt_out_t.host_memory, trt_out_t.device_memory, stream)
+
+        # Synchronize the stream
+        stream.synchronize()
+
+        return {k: torch.tensor(v.host_memory).reshape(v.shape) for k, v in outputs.items()}
+
+    engine = load_engine(trt_filename)
+    context = setup_context(engine)
+    bindings, inputs, outputs, stream = allocate_buffers(engine, context)
+
+    with torch.no_grad():
+        metric_logger = utils.MetricLogger(delimiter="  ")
+        header = 'Test:'
+        with torch.no_grad():
+            for image, target in metric_logger.log_every(data_loader, print_freq, header):
+                image = image.to("cpu", non_blocking=True)
+                image_data = np.array(image)
+
+                output = infer({"input": image_data}, context, bindings, inputs, outputs, stream)["output"].float()
+
+                loss = criterion(output, target)
+                acc1, acc5 = utils.accuracy(output, target, topk=(1, 5))
+                batch_size = image.shape[0]
+                metric_logger.update(loss=loss.item())
+                metric_logger.meters['acc1'].update(acc1.item(), n=batch_size)
+                metric_logger.meters['acc5'].update(acc5.item(), n=batch_size)
+        # gather the stats from all processes
+        metric_logger.synchronize_between_processes()
+
+        print('  TRTRuntime: Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f}'.format(top1=metric_logger.acc1,
+                                                                                             top5=metric_logger.acc5))
+        return metric_logger.acc1.global_avg
+
+
+def _export_onnx(model, dummy_input, onnx_filename, opset_version):
+    try:
+        if "enable_onnx_checker" in inspect.signature(torch.onnx.export).parameters:
+            torch.onnx.export(model,
+                              dummy_input,
+                              onnx_filename,
+                              verbose=False,
+                              input_names=["input"],
+                              output_names=["output"],
+                              opset_version=opset_version,
+                              enable_onnx_checker=False,
+                              do_constant_folding=True)
+        else:
+            torch.onnx.export(model,
+                              dummy_input,
+                              onnx_filename,
+                              verbose=False,
+                              input_names=["input"],
+                              output_names=["output"],
+                              opset_version=opset_version,
+                              do_constant_folding=True)
+    except ValueError:
+        print("Failed to export to ONNX")
+        return False
+
+    return True
+
+
 def export_onnx(model, onnx_filename, batch_onnx, per_channel_quantization):
     model.eval()
-    quant_nn.TensorQuantizer.use_fb_fake_quant = True # We have to shift to pytorch's fake quant ops before exporting the model to ONNX
 
     if per_channel_quantization:
         opset_version = 13
@@ -331,15 +519,38 @@ def export_onnx(model, onnx_filename, batch_onnx, per_channel_quantization):
 
     # Export ONNX for multiple batch sizes
     print("Creating ONNX file: " + onnx_filename)
-    dummy_input = torch.randn(batch_onnx, 3, 224, 224, device='cuda') #TODO: switch input dims by model
+    dummy_input = torch.randn(batch_onnx, 3, 224, 224, device='cuda')  #TODO: switch input dims by model
+    return _export_onnx(model, dummy_input, onnx_filename, opset_version)
+
+
+def export_trt(model, trt_filename, batch_trt, fp16=False):
+    model.eval()
+
+    # Export TRT for multiple batch sizes
+    print("Creating TRT file: " + trt_filename)
+    dummy_input = torch.randn(batch_trt, 3, 224, 224, device='cuda')  #TODO: switch input dims by model
+
+    OPSET = 17
+    onnx_filename = trt_filename.replace(".trt", ".onnx")
+
+    if not _export_onnx(model, dummy_input, onnx_filename, OPSET):
+        return False
+
+    trt_cmd = f"trtexec --onnx={onnx_filename} --saveEngine={trt_filename} --int8"
+
+    if fp16:
+        trt_cmd += " --fp16"
+
+    print(trt_cmd)
     try:
-        torch.onnx.export(model, dummy_input, onnx_filename, verbose=False, opset_version=opset_version, enable_onnx_checker=False, do_constant_folding=True)
-    except ValueError:
-        warnings.warn(UserWarning("Per-channel quantization is not yet supported in Pytorch/ONNX RT (requires ONNX opset 13)"))
-        print("Failed to export to ONNX")
+        trt_stdout = subprocess.check_output(trt_cmd, shell=True).decode("utf-8")
+    except:
+        print("Failed to export to TRT")
         return False
 
-    return True
+    print(trt_stdout)
+    return 'PASSED' in trt_stdout
+
 
 def calibrate_model(model, model_name, data_loader, num_calib_batch, calibrator, hist_percentile, out_dir):
     """
@@ -361,27 +572,24 @@ def calibrate_model(model, model_name, data_loader, num_calib_batch, calibrator,
 
         if not calibrator == "histogram":
             compute_amax(model, method="max")
-            calib_output = os.path.join(
-                out_dir,
-                F"{model_name}-max-{num_calib_batch*data_loader.batch_size}.pth")
+            calib_output = os.path.join(out_dir, F"{model_name}-max-{num_calib_batch*data_loader.batch_size}.pth")
             torch.save(model.state_dict(), calib_output)
         else:
             for percentile in hist_percentile:
                 print(F"{percentile} percentile calibration")
                 compute_amax(model, method="percentile")
                 calib_output = os.path.join(
-                    out_dir,
-                    F"{model_name}-percentile-{percentile}-{num_calib_batch*data_loader.batch_size}.pth")
+                    out_dir, F"{model_name}-percentile-{percentile}-{num_calib_batch*data_loader.batch_size}.pth")
                 torch.save(model.state_dict(), calib_output)
 
             for method in ["mse", "entropy"]:
                 print(F"{method} calibration")
                 compute_amax(model, method=method)
-                calib_output = os.path.join(
-                    out_dir,
-                    F"{model_name}-{method}-{num_calib_batch*data_loader.batch_size}.pth")
+                calib_output = os.path.join(out_dir,
+                                            F"{model_name}-{method}-{num_calib_batch*data_loader.batch_size}.pth")
                 torch.save(model.state_dict(), calib_output)
 
+
 def collect_stats(model, data_loader, num_batches):
     """Feed data to the network and collect statistics"""
     # Enable calibrators
@@ -408,6 +616,7 @@ def collect_stats(model, data_loader, num_batches):
             else:
                 module.enable()
 
+
 def compute_amax(model, **kwargs):
     # Load calib result
     for name, module in model.named_modules():
@@ -420,6 +629,7 @@ def compute_amax(model, **kwargs):
             print(F"{name:40}: {module}")
     model.cuda()
 
+
 def build_sensitivity_profile(model, criterion, data_loader_test):
     quant_layer_names = []
     for name, module in model.named_modules():
@@ -441,6 +651,7 @@ def build_sensitivity_profile(model, criterion, data_loader_test):
                 module.disable()
                 print(F"{name:40}: {module}")
 
+
 if __name__ == '__main__':
     res = main(sys.argv[1:])
     exit(res)
diff --git a/tools/pytorch-quantization/examples/torchvision/models/__init__.py b/tools/pytorch-quantization/examples/torchvision/models/__init__.py
index 78cb96ca..b8c4952a 100644
--- a/tools/pytorch-quantization/examples/torchvision/models/__init__.py
+++ b/tools/pytorch-quantization/examples/torchvision/models/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/examples/torchvision/models/classification/__init__.py b/tools/pytorch-quantization/examples/torchvision/models/classification/__init__.py
index b2fdc54f..82c817d5 100644
--- a/tools/pytorch-quantization/examples/torchvision/models/classification/__init__.py
+++ b/tools/pytorch-quantization/examples/torchvision/models/classification/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/examples/torchvision/models/classification/resnet.py b/tools/pytorch-quantization/examples/torchvision/models/classification/resnet.py
index 06b6e147..6c07f297 100644
--- a/tools/pytorch-quantization/examples/torchvision/models/classification/resnet.py
+++ b/tools/pytorch-quantization/examples/torchvision/models/classification/resnet.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -59,11 +59,11 @@
 ]
 
 model_urls = {
-    'resnet18': 'https://download.pytorch.org/models/resnet18-5c106cde.pth',
-    'resnet34': 'https://download.pytorch.org/models/resnet34-333f7ec4.pth',
-    'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth',
-    'resnet101': 'https://download.pytorch.org/models/resnet101-5d3b4d8f.pth',
-    'resnet152': 'https://download.pytorch.org/models/resnet152-b121ed2d.pth',
+    'resnet18': 'https://download.pytorch.org/models/resnet18-f37072fd.pth',
+    'resnet34': 'https://download.pytorch.org/models/resnet34-b627a593.pth',
+    'resnet50': 'https://download.pytorch.org/models/resnet50-0676ba61.pth',
+    'resnet101': 'https://download.pytorch.org/models/resnet101-63fe2227.pth',
+    'resnet152': 'https://download.pytorch.org/models/resnet152-394f9c45.pth',
     'resnext50_32x4d': 'https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth',
     'resnext101_32x8d': 'https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth',
     'wide_resnet50_2': 'https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth',
@@ -346,6 +346,9 @@ def _make_layer(self,
 
         return nn.Sequential(*layers)
 
+    def _get_dtype(self):
+        return self.conv1.weight.dtype
+
     def _forward_impl(self, x: Tensor) -> Tensor:
         # See note [TorchScript super()]
         x = self.conv1(x)
@@ -365,7 +368,7 @@ def _forward_impl(self, x: Tensor) -> Tensor:
         return x
 
     def forward(self, x: Tensor) -> Tensor:
-        return self._forward_impl(x)
+        return self._forward_impl(x.to(self._get_dtype()))
 
 
 def _resnet(arch: str, block: Type[Union[BasicBlock, Bottleneck]], layers: List[int], pretrained: bool, progress: bool,
diff --git a/tools/pytorch-quantization/pytorch_quantization/__init__.py b/tools/pytorch-quantization/pytorch_quantization/__init__.py
index c7fa38b1..223bd239 100644
--- a/tools/pytorch-quantization/pytorch_quantization/__init__.py
+++ b/tools/pytorch-quantization/pytorch_quantization/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,7 +15,8 @@
 # limitations under the License.
 #
 
-
 from absl import logging
 from .version import __version__
+from .quant_modules import *
+
 logging.use_absl_handler()
diff --git a/tools/pytorch-quantization/pytorch_quantization/calib/__init__.py b/tools/pytorch-quantization/pytorch_quantization/calib/__init__.py
index 481b31a3..3da2a32d 100644
--- a/tools/pytorch-quantization/pytorch_quantization/calib/__init__.py
+++ b/tools/pytorch-quantization/pytorch_quantization/calib/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/calib/calibrator.py b/tools/pytorch-quantization/pytorch_quantization/calib/calibrator.py
index 0add0f8d..d8bb33e1 100644
--- a/tools/pytorch-quantization/pytorch_quantization/calib/calibrator.py
+++ b/tools/pytorch-quantization/pytorch_quantization/calib/calibrator.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/calib/histogram.py b/tools/pytorch-quantization/pytorch_quantization/calib/histogram.py
index cf3e6464..f5ba6d1f 100644
--- a/tools/pytorch-quantization/pytorch_quantization/calib/histogram.py
+++ b/tools/pytorch-quantization/pytorch_quantization/calib/histogram.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,8 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
-
 """Histogram based calibrators"""
 from collections import Counter
 import numpy as np
@@ -26,12 +24,13 @@
 import torch
 
 from pytorch_quantization.calib.calibrator import _Calibrator
-from pytorch_quantization.tensor_quant import fake_tensor_quant
+from pytorch_quantization.tensor_quant import fake_tensor_quant, scaled_e4m3
 from pytorch_quantization import nn as quant_nn
 from pytorch_quantization import utils as quant_utils
 
 __all__ = ["HistogramCalibrator", "calibrate_weights"]
 
+
 class HistogramCalibrator(_Calibrator):
     """Unified histogram calibrator
 
@@ -46,9 +45,10 @@ class HistogramCalibrator(_Calibrator):
         grow_method: A string. DEPRECATED. default None.
         skip_zeros: A boolean. If True, skips zeros when collecting data for histogram. Default False.
         torch_hist: A boolean. If True, collect histogram by torch.histc instead of np.histogram. If input tensor
-            is on GPU, histc will also be running on GPU. Default False.
+            is on GPU, histc will also be running on GPU. Default True.
     """
-    def __init__(self, num_bits, axis, unsigned, num_bins=2048, grow_method=None, skip_zeros=False, torch_hist=False):
+
+    def __init__(self, num_bits, axis, unsigned, num_bins=2048, grow_method=None, skip_zeros=False, torch_hist=True):
         super(HistogramCalibrator, self).__init__(num_bits, axis, unsigned)
         self._num_bins = num_bins
         self._skip_zeros = skip_zeros
@@ -67,11 +67,9 @@ def __init__(self, num_bits, axis, unsigned, num_bins=2048, grow_method=None, sk
     def collect(self, x):
         """Collect histogram"""
         if torch.min(x) < 0.:
-            logging.log_first_n(
-                logging.INFO,
-                ("Calibrator encountered negative values. It shouldn't happen after ReLU. "
-                 "Make sure this is the right tensor to calibrate."),
-                1)
+            logging.log_first_n(logging.INFO,
+                                ("Calibrator encountered negative values. It shouldn't happen after ReLU. "
+                                 "Make sure this is the right tensor to calibrate."), 1)
             x = x.abs()
 
         x = x.float()
@@ -123,8 +121,7 @@ def reset(self):
         self._calib_bin_edges = None
         self._calib_hist = None
 
-    def compute_amax(
-            self, method: str, *, stride: int = 1, start_bin: int = 128, percentile: float = 99.99):
+    def compute_amax(self, method: str, *, stride: int = 1, start_bin: int = 128, percentile: float = 99.99):
         """Compute the amax from the collected histogram
 
         Args:
@@ -146,11 +143,11 @@ def compute_amax(
             calib_bin_edges = self._calib_bin_edges
 
         if method == 'entropy':
-            calib_amax = _compute_amax_entropy(
-                calib_hist, calib_bin_edges, self._num_bits, self._unsigned, stride, start_bin)
+            calib_amax = _compute_amax_entropy(calib_hist, calib_bin_edges, self._num_bits, self._unsigned, stride,
+                                               start_bin)
         elif method == 'mse':
-            calib_amax = _compute_amax_mse(
-                calib_hist, calib_bin_edges, self._num_bits, self._unsigned, stride, start_bin)
+            calib_amax = _compute_amax_mse(calib_hist, calib_bin_edges, self._num_bits, self._unsigned, stride,
+                                           start_bin)
         elif method == 'percentile':
             calib_amax = _compute_amax_percentile(calib_hist, calib_bin_edges, percentile)
         else:
@@ -164,8 +161,8 @@ def __str__(self):
         if self._calib_bin_edges is None:
             bin_edge_str = "None"
         else:
-            bin_edge_str = "[{:.3f}, ..., {:.3f}]({})".format(
-                self._calib_bin_edges[0], self._calib_bin_edges[-1], len(self._calib_bin_edges))
+            bin_edge_str = "[{:.3f}, ..., {:.3f}]({})".format(self._calib_bin_edges[0], self._calib_bin_edges[-1],
+                                                              len(self._calib_bin_edges))
         s += "calib_bin_edges={})".format(bin_edge_str)
         return s
 
@@ -175,6 +172,7 @@ def __repr__(self):
         s += " calib_bin_edges={_calib_bin_edges}"
         s += " calib_hist={_calib_hist})"
         return s.format(**self.__dict__)
+
     # pylint:enable=missing-docstring
 
 
@@ -251,10 +249,11 @@ def _normalize_distr(distr):
     logging.debug("divergences={}".format(divergences))
     last_argmin = len(divergences) - 1 - np.argmin(divergences[::-1])
     calib_amax = calib_bin_edges[last_argmin * stride + starting]
-    calib_amax = torch.tensor(calib_amax.item()) #pylint: disable=not-callable
+    calib_amax = torch.tensor(calib_amax.item())  #pylint: disable=not-callable
 
     return calib_amax
 
+
 def _compute_amax_mse(calib_hist, calib_bin_edges, num_bits, unsigned, stride=1, start_bin=128):
     """Returns amax that minimizes MSE of the collected histogram"""
 
@@ -262,8 +261,8 @@ def _compute_amax_mse(calib_hist, calib_bin_edges, num_bits, unsigned, stride=1,
     if calib_bin_edges is None and calib_hist is None:
         return None
 
-    counts = torch.from_numpy(calib_hist[:]).float()
-    edges = torch.from_numpy(calib_bin_edges[:]).float()
+    counts = torch.from_numpy(calib_hist[:]).float().cuda()
+    edges = torch.from_numpy(calib_bin_edges[:]).float().cuda()
     centers = (edges[1:] + edges[:-1]) / 2
 
     mses = []
@@ -272,11 +271,19 @@ def _compute_amax_mse(calib_hist, calib_bin_edges, num_bits, unsigned, stride=1,
     for i in range(start_bin, len(centers), stride):
 
         amax = centers[i]
-        quant_centers = fake_tensor_quant(centers, amax, num_bits, unsigned)
+        if isinstance(num_bits, int) and num_bits >= 0:
+            if num_bits == 0:
+                logging.error("num_bits is 0. This will result in the tensor being quantized to all zeros."
+                              " This mode should only be used for debugging purposes.")
+            quant_centers = fake_tensor_quant(centers, amax, num_bits, unsigned)
+        elif num_bits == (4, 3):
+            quant_centers = scaled_e4m3(centers, amax, num_bits[0], num_bits[1])
+        else:
+            raise TypeError("Invalid num_bits. num_bits must be a postivie integer or tuple (4,3).")
 
         mse = ((quant_centers - centers)**2 * counts).mean()
 
-        mses.append(mse)
+        mses.append(mse.cpu())
         arguments.append(i)
 
     logging.debug("mses={}".format(mses))
@@ -285,6 +292,7 @@ def _compute_amax_mse(calib_hist, calib_bin_edges, num_bits, unsigned, stride=1,
 
     return calib_amax
 
+
 def _compute_amax_percentile(calib_hist, calib_bin_edges, percentile):
     """Returns amax that clips the percentile fraction of collected data"""
 
@@ -299,10 +307,11 @@ def _compute_amax_percentile(calib_hist, calib_bin_edges, percentile):
     cdf = np.cumsum(calib_hist / total)
     idx = np.searchsorted(cdf, percentile / 100)
     calib_amax = calib_bin_edges[idx]
-    calib_amax = torch.tensor(calib_amax.item()) #pylint: disable=not-callable
+    calib_amax = torch.tensor(calib_amax.item())  #pylint: disable=not-callable
 
     return calib_amax
 
+
 def calibrate_weights(model, method="percentile", perchannel=True, percentile=99.99, num_bins=2048):
     """Calibrate weights of all child quantized modules
 
@@ -328,11 +337,8 @@ def calibrate_weights(model, method="percentile", perchannel=True, percentile=99
             logging.info("Calibrate weight of %s", name)
             num_bits = module.weight_quantizer.num_bits
             unsigned = module.weight_quantizer.unsigned
-            channel_second_modules = (
-                quant_nn.QuantConvTranspose1d,
-                quant_nn.QuantConvTranspose2d,
-                quant_nn.QuantConvTranspose3d
-            )
+            channel_second_modules = (quant_nn.QuantConvTranspose1d, quant_nn.QuantConvTranspose2d,
+                                      quant_nn.QuantConvTranspose3d)
             if perchannel:
                 axis = 1 if isinstance(module, channel_second_modules) else 0
             else:
@@ -342,17 +348,17 @@ def calibrate_weights(model, method="percentile", perchannel=True, percentile=99
             # Histogram is always collected even if method is "max". Although "max" is supported here
             # but it is not the primary usage of this function
             if axis is None:
-                calib_hist, calib_bin_edges = np.histogram(module.weight.abs().cpu().detach().numpy(), bins=2048)
+                input_weights = module.weight.abs().cpu().detach().numpy()
+                calib_hist, calib_bin_edges = np.histogram(input_weights, bins=2048, range=(0, input_weights.max()))
                 calib_hist = [calib_hist]
                 calib_bin_edges = [calib_bin_edges]
             else:
                 calib_hist = []
                 calib_bin_edges = []
                 for i in range(axis_size):
-                    hist, bin_edges = np.histogram(
-                        module.weight.index_select(
-                            axis, torch.tensor(i, device=module.weight.device)).abs().cpu().detach().numpy(),
-                        bins=num_bins)
+                    input_weights = module.weight.index_select(axis, torch.tensor(
+                        i, device=module.weight.device)).abs().cpu().detach().numpy()
+                    hist, bin_edges = np.histogram(input_weights, bins=num_bins, range=(0, input_weights.max()))
                     calib_hist.append(hist)
                     calib_bin_edges.append(bin_edges)
 
diff --git a/tools/pytorch-quantization/pytorch_quantization/calib/max.py b/tools/pytorch-quantization/pytorch_quantization/calib/max.py
index 611884c6..a2da7a5c 100644
--- a/tools/pytorch-quantization/pytorch_quantization/calib/max.py
+++ b/tools/pytorch-quantization/pytorch_quantization/calib/max.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -67,6 +67,8 @@ def collect(self, x):
 
         # Swap axis to reduce.
         axis = self._axis if isinstance(self._axis, (list, tuple)) else [self._axis]
+        # Handle negative axis.
+        axis = [x.dim() + i if isinstance(i, int) and i < 0 else i for i in axis]
         reduce_axis = []
         for i in range(x.dim()):
             if not i in axis:
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/__init__.py b/tools/pytorch-quantization/pytorch_quantization/nn/__init__.py
index 435aed98..f26c93c4 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/__init__.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/_functions/quant_rnn.py b/tools/pytorch-quantization/pytorch_quantization/nn/_functions/quant_rnn.py
index 39e26765..1f119536 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/_functions/quant_rnn.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/_functions/quant_rnn.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/functional.py b/tools/pytorch-quantization/pytorch_quantization/nn/functional.py
index 9e1ff867..1459deee 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/functional.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/functional.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/_utils.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/_utils.py
index a7d820c7..15ad29b1 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/_utils.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/_utils.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/clip.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/clip.py
index 0caff613..52577f09 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/clip.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/clip.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_conv.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_conv.py
index 734c8313..b6d2d437 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_conv.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_conv.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -23,6 +23,7 @@
 when start scratch.
 """
 
+import inspect
 import torch
 import torch.nn
 import torch.nn.functional as F
@@ -266,6 +267,19 @@ def _quant(self, input):
 
         return (quant_input, quant_weight)
 
+    def _output_padding_nd(self,
+                           input,
+                           output_size,
+                           stride,
+                           padding,
+                           kernel_size,
+                           num_spatial_dims,
+                           dilation=None):
+        if "num_spatial_dims" in inspect.signature(self._output_padding).parameters:
+            return self._output_padding(input, output_size, stride, padding, kernel_size, num_spatial_dims)
+        else:
+            return self._output_padding(input, output_size, stride, padding, kernel_size)
+
 
 class QuantConvTranspose1d(_QuantConvTransposeNd):
     """Quantized ConvTranspose1d"""
@@ -299,7 +313,9 @@ def forward(self, input, output_size=None):
         if self.padding_mode != 'zeros':
             raise ValueError('Only `zeros` padding mode is supported for QuantConvTranspose1d')
 
-        output_padding = self._output_padding(input, output_size, self.stride, self.padding, self.kernel_size)
+        num_spatial_dims = 1
+        output_padding = self._output_padding_nd(input, output_size, self.stride, self.padding, self.kernel_size,
+                                                 num_spatial_dims)
 
         quant_input, quant_weight = self._quant(input)
         output = F.conv_transpose1d(quant_input, quant_weight, self.bias, self.stride, self.padding, output_padding,
@@ -339,7 +355,9 @@ def forward(self, input, output_size=None):
         if self.padding_mode != 'zeros':
             raise ValueError('Only `zeros` padding mode is supported for QuantConvTranspose2d')
 
-        output_padding = self._output_padding(input, output_size, self.stride, self.padding, self.kernel_size)
+        num_spatial_dims = 2
+        output_padding = self._output_padding_nd(input, output_size, self.stride, self.padding, self.kernel_size,
+                                                 num_spatial_dims)
 
         quant_input, quant_weight = self._quant(input)
         output = F.conv_transpose2d(quant_input, quant_weight, self.bias, self.stride, self.padding, output_padding,
@@ -380,7 +398,9 @@ def forward(self, input, output_size=None):
         if self.padding_mode != 'zeros':
             raise ValueError('Only `zeros` padding mode is supported for QuantConvTranspose3d')
 
-        output_padding = self._output_padding(input, output_size, self.stride, self.padding, self.kernel_size)
+        num_spatial_dims = 3
+        output_padding = self._output_padding_nd(input, output_size, self.stride, self.padding, self.kernel_size,
+                                                 num_spatial_dims)
 
         quant_input, quant_weight = self._quant(input)
         output = F.conv_transpose3d(quant_input, quant_weight, self.bias, self.stride, self.padding, output_padding,
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_instancenorm.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_instancenorm.py
index 66789039..88158a75 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_instancenorm.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_instancenorm.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_linear.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_linear.py
index 8538c6da..3208e458 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_linear.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_linear.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_pooling.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_pooling.py
index 14864ed8..52da6789 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_pooling.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_pooling.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_rnn.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_rnn.py
index 131246d5..c146e549 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_rnn.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_rnn.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/nn/modules/tensor_quantizer.py b/tools/pytorch-quantization/pytorch_quantization/nn/modules/tensor_quantizer.py
index 895a55e7..c985319b 100644
--- a/tools/pytorch-quantization/pytorch_quantization/nn/modules/tensor_quantizer.py
+++ b/tools/pytorch-quantization/pytorch_quantization/nn/modules/tensor_quantizer.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,8 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
-
 """TensorQuantizer Module"""
 import math
 from absl import logging
@@ -23,7 +21,7 @@
 import torch
 from torch import nn
 
-from pytorch_quantization.tensor_quant import QuantDescriptor, tensor_quant, fake_tensor_quant
+from pytorch_quantization.tensor_quant import QuantDescriptor, tensor_quant, fake_tensor_quant, scaled_e4m3
 from pytorch_quantization.nn.modules.clip import Clip
 
 from pytorch_quantization import calib
@@ -32,6 +30,7 @@
 
 __all__ = ['TensorQuantizer']
 
+
 class TensorQuantizer(nn.Module):
     """Tensor quantizer module
 
@@ -64,9 +63,7 @@ class TensorQuantizer(nn.Module):
         - amax:
     """
 
-    # An experimental static switch for using pytorch's native fake quantization
-    # Primary usage is to export to ONNX
-    use_fb_fake_quant = False
+    _enable_onnx_export = False
 
     def __init__(self, quant_desc=QuantDescriptor(), disabled=False, if_quant=True, if_clip=False, if_calib=False):
         """Initialize quantizer and set up required variables"""
@@ -100,8 +97,9 @@ def __init__(self, quant_desc=QuantDescriptor(), disabled=False, if_quant=True,
 
         if quant_desc.calib_method == "histogram":
             logging.info("Creating histogram calibrator")
-            self._calibrator = calib.HistogramCalibrator(
-                num_bits=self._num_bits, axis=self._axis, unsigned=self._unsigned)
+            self._calibrator = calib.HistogramCalibrator(num_bits=self._num_bits,
+                                                         axis=self._axis,
+                                                         unsigned=self._unsigned)
         elif quant_desc.calib_method == "max":
             logging.info("Creating Max calibrator")
             self._calibrator = calib.MaxCalibrator(num_bits=self._num_bits, axis=self._axis, unsigned=self._unsigned)
@@ -111,6 +109,12 @@ def __init__(self, quant_desc=QuantDescriptor(), disabled=False, if_quant=True,
     def num_bits(self):
         return self._num_bits
 
+    @property
+    def maxbound(self):
+        if self._num_bits == (4, 3):
+            return 448.0
+        return (1 << (self._num_bits - 1 + int(self._unsigned))) - 1
+
     @property
     def unsigned(self):
         return self._unsigned
@@ -123,6 +127,12 @@ def scale(self):
             logging.critical("Accessing scale before quantizing any tensor!")
         return self._scale
 
+    @property
+    def pre_quant_scale(self):
+        if not hasattr(self, "_pre_quant_scale"):
+            return None
+        return self._pre_quant_scale
+
     @property
     def amax(self):
         if not hasattr(self, "_amax"):
@@ -158,16 +168,16 @@ def enable(self):
     def disable_clip(self):
         """Disable clip stage"""
         self._if_clip = False
-        self.clip.clip_value_min.required_grad = False
-        self.clip.clip_value_max.required_grad = False
+        self.clip.clip_value_min.requires_grad = False
+        self.clip.clip_value_max.requires_grad = False
 
     def enable_clip(self):
         """Enable clip stage"""
         logging.warning("Enable `clip` stage for amax learning.")
         if not self._learn_amax:
             raise ValueError("learn_amax is False. Cannot enable clip.")
-        self.clip.clip_value_min.required_grad = True
-        self.clip.clip_value_max.required_grad = True
+        self.clip.clip_value_min.requires_grad = True
+        self.clip.clip_value_max.requires_grad = True
         self._if_clip = True
 
     def disable_calib(self):
@@ -200,9 +210,22 @@ def amax(self, value):
             else:
                 value = torch.tensor(value, device=self._amax.device)
                 if self._amax.shape != value.shape:
-                    raise TypeError("Changing shape when setting amax is not allowed.")
+                    raise RuntimeError("Changing shape when setting amax is not allowed.")
                 self._amax.data.copy_(value.data)
 
+    @pre_quant_scale.setter
+    def pre_quant_scale(self, value):
+        if value is None:
+            logging.error("Setting pre_quant_scale no None is meaningless.")
+        else:
+            if not hasattr(self, "_pre_quant_scale"):
+                self.register_buffer('_pre_quant_scale', torch.tensor(value))
+            else:
+                value = torch.tensor(value, device=self._pre_quant_scale.device)
+                if self._pre_quant_scale.shape != value.shape:
+                    raise RuntimeError("Changing shape when setting pre_quant_scale is not allowed.")
+                self._pre_quant_scale.data.copy_(value.data)
+
     @num_bits.setter
     def num_bits(self, value):
         self._num_bits = value
@@ -236,10 +259,9 @@ def load_calib_amax(self, *args, **kwargs):
             else:
                 raise RuntimeError(err_msg + " Passing 'strict=False' to `load_calib_amax()` will ignore the error.")
         logging.warning("Load calibrated amax, shape={}.".format(calib_amax.shape))
-        logging.log_first_n(
-            logging.WARNING, "Call .cuda() if running on GPU after loading calibrated amax.", 1)
+        logging.log_first_n(logging.WARNING, "Call .cuda() if running on GPU after loading calibrated amax.", 1)
         if not hasattr(self, '_amax'):
-            self.register_buffer('_amax', calib_amax.data)
+            self.register_buffer("_amax", calib_amax.data)
         else:
             self._amax.copy_(calib_amax)
 
@@ -274,28 +296,13 @@ def _get_amax(self, inputs):
         if self._scale_amax is not None:
             amax = amax.detach() * self._scale_amax
 
-        return amax
+        amax = amax.data
 
-    def _fb_fake_quant(self, inputs, amax):
-        """Native pytorch fake quantization."""
-        logging.log_first_n(logging.WARNING, "Use Pytorch's native experimental fake quantization.", 1)
-        bound = (1 << (self._num_bits - 1 + int(self._unsigned))) - 1
-        # To be consistent with ONNX, full range is used. e.g. range is [-128, 127] in int8
-        if amax.numel() == 1:
-            outputs = torch.fake_quantize_per_tensor_affine(
-                inputs, amax.item() / bound, 0,
-                -bound - 1 if not self._unsigned else 0, bound)
-        else:
-            amax_sequeeze = amax.squeeze().detach()
-            if len(amax_sequeeze.shape) != 1:
-                raise TypeError("Pytorch's native quantization doesn't support multiple axes")
-            quant_dim = list(amax.shape).index(list(amax_sequeeze.shape)[0])
-            scale = amax_sequeeze / bound
-            outputs = torch.fake_quantize_per_channel_affine(
-                inputs, scale.data, torch.zeros_like(scale, dtype=torch.int32).data, quant_dim,
-                -bound - 1 if not self._unsigned else 0, bound)
+        # cast amax to float32 if it is in a lower precision dtype
+        if amax.dtype not in (torch.double, torch.float):
+            amax = amax.float()
 
-        return outputs
+        return amax
 
     def _quant_forward(self, inputs):
         """Quantized forward pass."""
@@ -306,17 +313,31 @@ def _quant_forward(self, inputs):
             amax = self._get_amax(inputs)
 
         if self._fake_quant:
-            if not TensorQuantizer.use_fb_fake_quant:
-                outputs = fake_tensor_quant(inputs, amax, self._num_bits, self._unsigned, self._narrow_range)
-            else:
-                if inputs.dtype == torch.half or amax.dtype == torch.half:
-                    raise Exception("Exporting to ONNX in fp16 is not supported. Please export in fp32, i.e. disable AMP.")
-                outputs = self._fb_fake_quant(inputs, amax)
+            outputs = fake_tensor_quant(inputs, amax, self._num_bits, self._unsigned, self._narrow_range)
         else:
             outputs, self._scale = tensor_quant(inputs, amax, self._num_bits, self._unsigned)
 
         return outputs
 
+    def _check_onnx_readiness(self, inputs):
+        """Check if quantizer is ready for ONNX export."""
+
+        assert hasattr(
+            self, '_amax'), ("Quantizer has not been calibrated. ONNX export requires the quantizer to be calibrated."
+                             "Calibrate and load amax before exporting to ONNX.")
+
+        if self._if_calib:
+            logging.warning("Quantizer is in calibration mode. "
+                            "Please complete calibration before exporting to ONNX for correct results.")
+
+        amax = self._get_amax(inputs)
+
+        # We only support scalar amax for E4M3 ONNX export
+        if isinstance(self.num_bits, tuple):
+            assert amax.numel() == 1, ("E4M3 supports ONNX export only for per-tensor quantization."
+                                       " Per-tensor quantization requires scalar amax. "
+                                       f"Received non-scalar amax of shape: {amax.shape}")
+
     def forward(self, inputs):
         """Apply tensor_quant function to inputs
 
@@ -326,6 +347,14 @@ def forward(self, inputs):
         Returns:
             outputs: A Tensor of type output_dtype
         """
+
+        if self._enable_onnx_export:
+            self._check_onnx_readiness(inputs)
+
+        # Activation scaling for smoothquant
+        if self.pre_quant_scale is not None:
+            inputs = inputs * self.pre_quant_scale
+
         if self._disabled:
             return inputs
 
@@ -334,7 +363,7 @@ def forward(self, inputs):
         if self._if_calib:
             if self._calibrator is None:
                 raise RuntimeError("Calibrator was not created.")
-            # Shape is only know when it sees the first tensor
+            # Shape is only known when it sees the first tensor
             self._calibrator.collect(inputs)
 
         if self._if_clip:
@@ -343,7 +372,11 @@ def forward(self, inputs):
             outputs = self.clip(inputs)
 
         if self._if_quant:
-            outputs = self._quant_forward(inputs)
+            if not isinstance(self._num_bits, tuple):
+                outputs = self._quant_forward(inputs)
+            else:
+                E, M = self._num_bits
+                outputs = scaled_e4m3(inputs, self._get_amax(inputs), E, M)
 
         return outputs
 
@@ -357,10 +390,14 @@ def _short_amax(self, fmt='.4f'):
         """
         if not hasattr(self, '_amax'):
             return 'dynamic'
+        if self._amax is None:
+            return "None"
         if self._amax.numel() == 1:
             return '{:{fmt}}'.format(self._amax.item(), fmt=fmt)
-        return '[{:{fmt}}, {:{fmt}}]({})'.format(self._amax.min().item(), self._amax.max().item(),
-                                                 self._amax.numel(), fmt=fmt)
+        return '[{:{fmt}}, {:{fmt}}]({})'.format(self._amax.min().item(),
+                                                 self._amax.max().item(),
+                                                 self._amax.numel(),
+                                                 fmt=fmt)
 
     def extra_repr(self):
         if self._disabled:
@@ -371,6 +408,7 @@ def extra_repr(self):
         s += " axis={}".format(self._axis) if self._axis is not None else " per-tensor"
         s += " amax={}".format(self._short_amax())
         s += " *{}".format(self._scale_amax) if self._scale_amax else ""
+        s += " pre_quant_scale" if self.pre_quant_scale is not None else ""
         s += " learned" if (self._learn_amax) else ""
         s += " calibrator={}".format(self._calibrator.__class__.__name__) if (self._calibrator is not None) else ""
         s += " scale={}".format(self._scale) if self._scale is not None else ""
@@ -402,4 +440,17 @@ def _load_from_state_dict(self, state_dict, prefix, *args, **kwargs):
         elif src_has_amax and dst_has_amax:
             logging.warning("{}: Overwriting amax.".format(prefix[:-1]))
 
+        dst_has_pre_quant_scale = '_pre_quant_scale' in self._buffers
+        src_has_pre_quant_scale = prefix + '_pre_quant_scale' in state_dict
+
+        if not src_has_pre_quant_scale and dst_has_pre_quant_scale:
+            logging.error("{}: No pre_quant_scale in state_dict.".format(prefix[:-1]))
+        elif src_has_pre_quant_scale and not dst_has_pre_quant_scale:
+            logging.debug(("{}: No '_pre_quant_scale' buffer to load pre_quant_scale into."
+                           " '_pre_quant_scale` will be created as WAR for now. "
+                           "This behavior will change in future.").format(prefix[:-1]))
+            self.register_buffer("_pre_quant_scale", state_dict[prefix + '_pre_quant_scale'].data.cuda())
+        elif src_has_pre_quant_scale and dst_has_pre_quant_scale:
+            logging.warning("{}: Overwriting pre_quant_scale.".format(prefix[:-1]))
+
         super(TensorQuantizer, self)._load_from_state_dict(state_dict, prefix, *args, **kwargs)
diff --git a/tools/pytorch-quantization/pytorch_quantization/optim/helper.py b/tools/pytorch-quantization/pytorch_quantization/optim/helper.py
index faf9fb23..bd828932 100644
--- a/tools/pytorch-quantization/pytorch_quantization/optim/helper.py
+++ b/tools/pytorch-quantization/pytorch_quantization/optim/helper.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/quant_modules.py b/tools/pytorch-quantization/pytorch_quantization/quant_modules.py
index 0096f956..b4369b01 100644
--- a/tools/pytorch-quantization/pytorch_quantization/quant_modules.py
+++ b/tools/pytorch-quantization/pytorch_quantization/quant_modules.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -17,28 +17,35 @@
 """Dynamically replace the modules with quantized versions."""
 
 from collections import namedtuple
+from contextlib import contextmanager
+
 import torch
 from pytorch_quantization import nn as quant_nn
 
+__all__ = ['initialize', 'deactivate', 'enable_onnx_export']
+
 # Definition of the named tuple that is used to store mapping of the quantized modules
 _quant_entry = namedtuple('quant_entry', 'orig_mod mod_name replace_mod')
 
 # Global member of the file that contains the mapping of quantized modules
-_DEFAULT_QUANT_MAP = [_quant_entry(torch.nn, "Conv1d", quant_nn.QuantConv1d),
-                      _quant_entry(torch.nn, "Conv2d", quant_nn.QuantConv2d),
-                      _quant_entry(torch.nn, "Conv3d", quant_nn.QuantConv3d),
-                      _quant_entry(torch.nn, "ConvTranspose1d", quant_nn.QuantConvTranspose1d),
-                      _quant_entry(torch.nn, "ConvTranspose2d", quant_nn.QuantConvTranspose2d),
-                      _quant_entry(torch.nn, "ConvTranspose3d", quant_nn.QuantConvTranspose3d),
-                      _quant_entry(torch.nn, "Linear", quant_nn.QuantLinear),
-                      _quant_entry(torch.nn, "LSTM", quant_nn.QuantLSTM),
-                      _quant_entry(torch.nn, "LSTMCell", quant_nn.QuantLSTMCell),
-                      _quant_entry(torch.nn, "AvgPool1d", quant_nn.QuantAvgPool1d),
-                      _quant_entry(torch.nn, "AvgPool2d", quant_nn.QuantAvgPool2d),
-                      _quant_entry(torch.nn, "AvgPool3d", quant_nn.QuantAvgPool3d),
-                      _quant_entry(torch.nn, "AdaptiveAvgPool1d", quant_nn.QuantAdaptiveAvgPool1d),
-                      _quant_entry(torch.nn, "AdaptiveAvgPool2d", quant_nn.QuantAdaptiveAvgPool2d),
-                      _quant_entry(torch.nn, "AdaptiveAvgPool3d", quant_nn.QuantAdaptiveAvgPool3d),]
+_DEFAULT_QUANT_MAP = [
+    _quant_entry(torch.nn, "Conv1d", quant_nn.QuantConv1d),
+    _quant_entry(torch.nn, "Conv2d", quant_nn.QuantConv2d),
+    _quant_entry(torch.nn, "Conv3d", quant_nn.QuantConv3d),
+    _quant_entry(torch.nn, "ConvTranspose1d", quant_nn.QuantConvTranspose1d),
+    _quant_entry(torch.nn, "ConvTranspose2d", quant_nn.QuantConvTranspose2d),
+    _quant_entry(torch.nn, "ConvTranspose3d", quant_nn.QuantConvTranspose3d),
+    _quant_entry(torch.nn, "Linear", quant_nn.QuantLinear),
+    _quant_entry(torch.nn, "LSTM", quant_nn.QuantLSTM),
+    _quant_entry(torch.nn, "LSTMCell", quant_nn.QuantLSTMCell),
+    _quant_entry(torch.nn, "AvgPool1d", quant_nn.QuantAvgPool1d),
+    _quant_entry(torch.nn, "AvgPool2d", quant_nn.QuantAvgPool2d),
+    _quant_entry(torch.nn, "AvgPool3d", quant_nn.QuantAvgPool3d),
+    _quant_entry(torch.nn, "AdaptiveAvgPool1d", quant_nn.QuantAdaptiveAvgPool1d),
+    _quant_entry(torch.nn, "AdaptiveAvgPool2d", quant_nn.QuantAdaptiveAvgPool2d),
+    _quant_entry(torch.nn, "AdaptiveAvgPool3d", quant_nn.QuantAdaptiveAvgPool3d),
+]
+
 
 class QuantModuleReplacementHelper():
     """To help replace torch.nn modules with quantized versions.
@@ -57,6 +64,7 @@ class QuantModuleReplacementHelper():
             which indicates the modules to leave out in monkey patching.
 
     """
+
     def __init__(self):
 
         # Will hold the original modules to be replaced back
@@ -86,8 +94,8 @@ def prepare_state(self, float_module_list=None, custom_map=None):
                 # append the modules into the variable that will be used in monkey patching
                 self.quant_map.add(item)
                 # also store the original module to be used in reverse monkey patching
-                self.orginal_func_map.add(_quant_entry(item.orig_mod, item.mod_name,
-                                                       getattr(item.orig_mod, item.mod_name)))
+                self.orginal_func_map.add(
+                    _quant_entry(item.orig_mod, item.mod_name, getattr(item.orig_mod, item.mod_name)))
 
         # Add custom modules to the quant_map
         if custom_map is not None:
@@ -114,6 +122,7 @@ def restore_float_modules(self):
         for entry in self.orginal_func_map:
             setattr(entry.orig_mod, entry.mod_name, entry.replace_mod)
 
+
 def initialize(float_module_list=None, custom_quant_modules=None):
     """Dynamic module replacement using monkey patching.
 
@@ -142,6 +151,7 @@ def initialize(float_module_list=None, custom_quant_modules=None):
     _quant_module_helper_object.prepare_state(float_module_list, custom_quant_modules)
     _quant_module_helper_object.apply_quant_modules()
 
+
 def deactivate():
     """Dynamic module replacement which reverses the monkey patching.
 
@@ -150,5 +160,23 @@ def deactivate():
     """
     _quant_module_helper_object.restore_float_modules()
 
+
 # Global object that maintains the state of the modules that are replaced.
 _quant_module_helper_object = QuantModuleReplacementHelper()
+
+
+@contextmanager
+def enable_onnx_export():
+    """Context manager to enable onnx export.
+    
+    .. code-block:: python
+
+        with pytorch_quantization.enable_onnx_export():
+            # export onnx model
+            torch.onnx.export(model, ...)
+    
+    """
+    quant_nn.TensorQuantizer._enable_onnx_export = True
+    yield
+
+    quant_nn.TensorQuantizer._enable_onnx_export = False
diff --git a/tools/pytorch-quantization/pytorch_quantization/tensor_quant.py b/tools/pytorch-quantization/pytorch_quantization/tensor_quant.py
index 4938c7a5..0996d4ce 100644
--- a/tools/pytorch-quantization/pytorch_quantization/tensor_quant.py
+++ b/tools/pytorch-quantization/pytorch_quantization/tensor_quant.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,24 +14,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
-
 """Basic tensor quantization functions"""
 import numpy as np
+import numbers
 import yaml
 
 from absl import logging
 
 import torch
+import torch._C._onnx as _C_onnx
 from torch.autograd import Function
 
+from pytorch_quantization import cuda_ext
+
+from torch.onnx import symbolic_helper
+
+
 class ScaledQuantDescriptor():
     """Supportive descriptor of quantization
 
     Describe how a tensor should be quantized. A QuantDescriptor and a tensor defines a quantized tensor.
 
     Args:
-        num_bits: An integer. Number of bits of quantization. It is used to calculate scaling factor. Default 8.
+        num_bits: An integer or a tuple of two integers. 
+            Specifically, `num_bits` can be:
+            
+            #. A positive integer argument for integer qunatization. `num_bits` specify 
+                the number of bits used for integer quantization.
+
+            #. Constant integer tuple (4,3) for E4M3 floating point quantization emulating
+                Nvidia's FP8 quantization. E4M3 quantization only supports per-tensor quantization.
+
+            Default: 8.
         name: Seems a nice thing to have
 
     Keyword Arguments:
@@ -67,13 +81,15 @@ class ScaledQuantDescriptor():
     """
 
     def __init__(self, num_bits=8, name=None, **kwargs):
-        if not isinstance(num_bits, int):
-            raise TypeError("num_bits must be an integer, not {}.".format(type(num_bits)))
-        if num_bits < 0:
-            raise ValueError("num_bits must be >= 0, not {}.".format(num_bits))
-        if num_bits == 0:
-            logging.error("num_bits is 0. This will result in the tensor being quantized to all zeros."
-                          " This mode should only be used for debugging purposes.")
+        if isinstance(num_bits, int):
+            if num_bits < 0:
+                raise ValueError("num_bits must be > 0, not {}.".format(num_bits))
+            if num_bits == 0:
+                logging.error("num_bits is 0. This will result in the tensor being quantized to all zeros."
+                              " This mode should only be used for debugging purposes.")
+        elif num_bits != (4, 3):
+            raise TypeError("num_bits must be a postive integer or tuple (4,3), not {}.".format(type(num_bits)))
+
         self._num_bits = num_bits
         if not isinstance(name, str) and name is not None:
             raise TypeError("name must be a string or None, not {}.".format(type(name)))
@@ -85,12 +101,11 @@ def __init__(self, num_bits=8, name=None, **kwargs):
             logging.debug("Meaning of axis has changed since v2.0. Make sure to update.")
         self._learn_amax = kwargs.pop('learn_amax', False)
         if self._learn_amax and self._axis is not None:
-            raise TypeError(
-                "axis is ignored and must be None when learn_amax is true, got {}.".format(type(self._axis)))
+            raise TypeError("axis is ignored and must be None when learn_amax is true, got {}.".format(type(
+                self._axis)))
         amax = kwargs.pop('amax', None)
         if amax is not None:
-            if not isinstance(amax, float) and not isinstance(
-                    amax, list) and not isinstance(amax, np.ndarray):
+            if not isinstance(amax, float) and not isinstance(amax, list) and not isinstance(amax, np.ndarray):
                 raise TypeError("amax must be float, list or ndarray, not {}".format(type(amax)))
             # Make it single precision array
             self._amax = np.array(amax, dtype=np.float32)
@@ -145,6 +160,7 @@ def unsigned(self):
     @property
     def narrow_range(self):
         return self._narrow_range
+
     # pylint:enable=missing-docstring
 
     def __str__(self):
@@ -153,8 +169,8 @@ def __str__(self):
         s += " fake" if self._fake_quant else " real"
         s += " axis={}".format(self._axis if self._axis is not None else " per-tensor")
         if isinstance(self._amax, torch.Tensor):
-            s += " amax={}".format(np.array2string(self._amax.cpu().numpy().flatten(), edgeitems=1,
-                                                   formatter={'all': "{:.2e}".format}))
+            s += " amax={}".format(
+                np.array2string(self._amax.cpu().numpy().flatten(), edgeitems=1, formatter={'all': "{:.2e}".format}))
         elif self._amax is not None:
             s += " amax={_amax}"
             s += " full_range"
@@ -217,6 +233,7 @@ def from_yaml(cls, yaml_str):
 
         return quant_desc
 
+
 QuantDescriptor = ScaledQuantDescriptor
 
 # Predefined descriptors
@@ -231,6 +248,178 @@ def from_yaml(cls, yaml_str):
 QUANT_DESC_8BIT_CONVTRANSPOSE3D_WEIGHT_PER_CHANNEL = QuantDescriptor(num_bits=8, axis=(0))
 
 
+@torch.jit.script
+def _fake_tensor_quant_backward(inputs, amax, grad_outputs):
+    zero = grad_outputs.new_zeros(1)
+    grad_inputs = torch.where(inputs.abs() <= amax, grad_outputs, zero)
+    return grad_inputs
+
+
+def _onnx_int8_helper(g, inputs, amax, num_bits, unsigned, narrow_range):
+    assert num_bits == 8, "Only INT8 ONNX export is supported for now."
+    maxbound = (1 << (num_bits - 1 + int(unsigned))) - 1
+
+    if amax.numel() == 1:
+        zero_point, axis = torch.tensor(0.0, device=amax.device), None
+    else:
+        amax_init_shape = amax.shape
+        amax = amax.squeeze().data
+        assert len(amax.shape) == 1, "ONNX does not support multi-axis quantization."
+        zero_point = torch.zeros_like(amax, dtype=torch.int32).data
+        axis = list(amax_init_shape).index(list(amax.shape)[0])
+
+    zero_point = g.op("Constant", value_t=zero_point)
+
+    if not unsigned:
+        assert not narrow_range, "ONNX does not support unsigned narrow range INT8."
+        zero_point = g.op("Cast", zero_point, to_i=_C_onnx.TensorProtoDataType.INT8)
+    else:
+        zero_point = g.op("Cast", zero_point, to_i=_C_onnx.TensorProtoDataType.UINT8)
+
+    scale = amax / maxbound
+    scale.masked_fill_(scale == 0, 1.0)
+    scale = g.op("Constant", value_t=scale)
+
+    input_type = inputs.type().scalarType()
+
+    # Q inputs are currently constrained to FP32 due to a similar limitation in ORT
+    # custom ops, so cast the input if needed.
+    if input_type == "Half" or input_type == "BFloat16":
+        inputs = g.op("Cast", inputs, to_i=_C_onnx.TensorProtoDataType.FLOAT)
+
+    quantized = g.op("QuantizeLinear", inputs, scale, zero_point, axis_i=axis)
+    out = g.op("DequantizeLinear", quantized, scale, zero_point, axis_i=axis)
+
+    # DQ outputs are currently constrained to FP32 due to a similar limitation in ORT
+    # custom ops, so cast the output if needed.
+    if input_type == "Half":
+        out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.FLOAT16)
+    elif input_type == "BFloat16":
+        out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.BFLOAT16)
+
+    return out
+
+
+class FakeTensorQuantFunction(Function):
+    """Fake version of TensorQuantFunction use CUDA extension"""
+
+    @staticmethod
+    @symbolic_helper.parse_args("v", "t", "i", "b", "b")
+    def symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
+        return _onnx_int8_helper(g, inputs, amax, num_bits, unsigned, narrow_range)
+
+    @staticmethod
+    def forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
+        ctx.save_for_backward(inputs, amax)
+
+        def legacy_quant_func():
+            # The LegacyFakeTensorQuantFunction support cpu and amax with any shape that can be broadcasted to inputs.
+            outputs, scale = _tensor_quant(inputs, amax, num_bits, unsigned, narrow_range)
+            return outputs / scale.to(inputs.dtype)
+
+        if not inputs.is_cuda:
+            outputs = legacy_quant_func()
+        else:
+            try:
+                if amax.numel() == 1:
+                    outputs = cuda_ext.fake_tensor_quant(inputs, amax, num_bits, unsigned, narrow_range)
+                else:
+                    axis = amax.shape.index(amax.numel())
+
+                    outputs = cuda_ext.fake_tensor_quant_with_axis(inputs, amax.squeeze(), axis, num_bits, unsigned,
+                                                                   narrow_range)
+            except (RuntimeError, ValueError) as error:
+                outputs = legacy_quant_func()
+
+        return outputs
+
+    @staticmethod
+    def backward(ctx, grad_outputs):
+        inputs, amax = ctx.saved_tensors
+        return _fake_tensor_quant_backward(inputs, amax, grad_outputs), None, None, None, None
+
+
+def _onnx_fp8_quantize(g, inputs, scale_inv):
+    """Helper Function for Quantization"""
+    output_shape = torch.onnx.symbolic_helper._get_tensor_sizes(inputs)
+
+    # Q inputs are currently constrained to FP32 due to a similar limitation in ORT
+    # custom ops, so cast the input if needed.
+    if inputs.type().scalarType() == "Half" or inputs.type().scalarType() == "BFloat16":
+        inputs = g.op("Cast", inputs, to_i=_C_onnx.TensorProtoDataType.FLOAT)
+
+    scale = g.op("Constant", value_t=torch.tensor(scale_inv))
+    q_op = g.op("trt::TRT_FP8QuantizeLinear", inputs,
+                scale).setType(inputs.type().with_dtype(torch.uint8).with_sizes(output_shape))
+    return q_op
+
+
+def _onnx_fp8_dequantize(g, inputs, scale_inv, otype=None):
+    """Helper Function for Dequantization"""
+    output_shape = torch.onnx.symbolic_helper._get_tensor_sizes(inputs)
+
+    scale = g.op("Constant", value_t=torch.tensor(scale_inv))
+    out = g.op("trt::TRT_FP8DequantizeLinear", inputs,
+               scale).setType(inputs.type().with_dtype(torch.float32).with_sizes(output_shape))
+
+    # DQ outputs are currently constrained to FP32 due to a similar limitation in ORT
+    # custom ops, so cast the output if needed.
+    if otype == "Half":
+        out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.FLOAT16)
+    elif otype == "BFloat16":
+        out = g.op("Cast", out, to_i=_C_onnx.TensorProtoDataType.BFLOAT16)
+    return out
+
+
+class ScaledE4M3Function(Function):
+    """E4M3fy input with scale"""
+
+    @staticmethod
+    @symbolic_helper.parse_args("v", "t", "i", "b", "b")
+    def symbolic(g, inputs, amax=None, E=4, M=3):
+        if amax is None:
+            scale = 1.0
+        else:
+            scale = 448.0 / amax
+            scale = float(scale.masked_fill_(scale == 0, 1.0))
+
+        input_type = inputs.type().scalarType()
+        q_tensor = _onnx_fp8_quantize(g, inputs, 1.0 / scale)
+        return _onnx_fp8_dequantize(g, q_tensor, 1.0 / scale, input_type)
+
+    @staticmethod
+    def forward(ctx, inputs, amax=None, E=4, M=3):
+        if E != 4 or M != 3:
+            raise NotImplementedError("Only support E=4 & M=3 for now.")
+
+        ctx.save_for_backward(inputs)
+        ctx.amax = amax
+        zero_mask = (inputs.abs() < 1. / (1 << 24))
+
+        if amax is None:
+            outputs = cuda_ext.fake_e4m3fy(inputs)
+        else:
+            # FP8 ONNX export requires scalar `scale`.
+            # To simplify implementation, amax is enforced to be a scalar.
+            scale = 448.0 / amax
+            outputs = cuda_ext.fake_e4m3fy(inputs * scale) / scale
+
+        # Zero out values that are tiny.
+        # Tiny values could lead to tiny amax and then large scale which cause overflow/saturation
+        # and won't go back to normal value after dividing by scale. The right behavior is to mark them
+        # as zero which also get rid of inf/nan
+        outputs[zero_mask] = 0.
+
+        return outputs
+
+    @staticmethod
+    def backward(ctx, grad_outputs):
+        inputs, = ctx.saved_tensors
+        amax = torch.tensor(ctx.amax if ctx.amax is not None else 448.0, dtype=torch.float32, device=inputs.device)
+        grad_inputs = _fake_tensor_quant_backward(inputs, amax, grad_outputs)
+        return grad_inputs, None, None, None
+
+
 class TensorQuantFunction(Function):
     """A universal tensor quantization function
 
@@ -242,6 +431,11 @@ class TensorQuantFunction(Function):
     It uses 2^num_bits -1 values instead of 2^num_bits. e.g., for num_bits=8, it uses [-127, 127] instead of [-128, 127]
     """
 
+    @staticmethod
+    @symbolic_helper.parse_args("v", "t", "i", "b", "b")
+    def symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
+        return _onnx_int8_helper(g, inputs, amax, num_bits, unsigned, narrow_range)
+
     @staticmethod
     def forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
         """
@@ -290,12 +484,12 @@ def backward(ctx, grad_outputs, grad_scale):
             grad_inputs: A tensor of gradient.
         """
         inputs, amax = ctx.saved_tensors
-        zero = grad_outputs.new_zeros(1) # create a zero tensor with the same type and device
+        zero = grad_outputs.new_zeros(1)  # create a zero tensor with the same type and device
         grad_inputs = torch.where(inputs.abs() <= amax, grad_outputs, zero)
         return grad_inputs, None, None, None, None
 
 
-class FakeTensorQuantFunction(Function):
+class LegacyFakeTensorQuantFunction(Function):
     """Fake version of TensorQuantFunction
     See comments of TensorQuantFunction, arguments are the same.
     """
@@ -313,12 +507,13 @@ def backward(ctx, grad_outputs):
         grad_inputs = torch.where(inputs.abs() <= amax, grad_outputs, zero)
         return grad_inputs, None, None, None, None
 
+
 def _tensor_quant(inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
     """Shared function body between TensorQuantFunction and FakeTensorQuantFunction"""
-        # Fine scale, per channel scale will be handled by broadcasting, which could be tricky. Pop a warning.
+    # Fine scale, per channel scale will be handled by broadcasting, which could be tricky. Pop a warning.
     if isinstance(amax, torch.Tensor) and inputs.dim() != amax.dim():
-        logging.debug("amax %s has different shape than inputs %s. Make sure broadcast works as expected!",
-                      amax.size(), inputs.size())
+        logging.debug("amax %s has different shape than inputs %s. Make sure broadcast works as expected!", amax.size(),
+                      inputs.size())
 
     logging.debug("{} bits quantization on shape {} tensor.".format(num_bits, inputs.size()))
 
@@ -346,7 +541,7 @@ def _tensor_quant(inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
         min_bound = -max_bound - 1
     scale = max_bound / amax
 
-    epsilon = 1. / (1<<24)
+    epsilon = 1. / (1 << 24)
     if min_amax <= epsilon:  # Treat amax smaller than minimum representable of fp16 0
         zero_amax_mask = (amax <= epsilon)
         scale[zero_amax_mask] = 0  # Value quantized with amax=0 should all be 0
@@ -361,6 +556,7 @@ def _tensor_quant(inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
 
     return outputs, scale
 
+
 class FakeAffineTensorQuantFunction(Function):
     """Fake version of affine quantization
 
@@ -417,10 +613,12 @@ def backward(ctx, grad_outputs):
 
         inputs, min_range, max_range = ctx.saved_tensors
         zero = grad_outputs.new_zeros(1)
-        grad_inputs = torch.where((inputs <= max_range)*(inputs >= min_range), grad_outputs, zero)
+        grad_inputs = torch.where((inputs <= max_range) * (inputs >= min_range), grad_outputs, zero)
         return grad_inputs, None, None, None
 
 
 tensor_quant = TensorQuantFunction.apply
+legacy_fake_tensor_quant = LegacyFakeTensorQuantFunction.apply
 fake_tensor_quant = FakeTensorQuantFunction.apply
 fake_affine_tensor_quant = FakeAffineTensorQuantFunction.apply
+scaled_e4m3 = ScaledE4M3Function.apply
diff --git a/tools/pytorch-quantization/pytorch_quantization/utils/__init__.py b/tools/pytorch-quantization/pytorch_quantization/utils/__init__.py
index b62c4687..f648635d 100644
--- a/tools/pytorch-quantization/pytorch_quantization/utils/__init__.py
+++ b/tools/pytorch-quantization/pytorch_quantization/utils/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/utils/quant_logging.py b/tools/pytorch-quantization/pytorch_quantization/utils/quant_logging.py
index 42246fc9..72504b8c 100644
--- a/tools/pytorch-quantization/pytorch_quantization/utils/quant_logging.py
+++ b/tools/pytorch-quantization/pytorch_quantization/utils/quant_logging.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/pytorch_quantization/utils/reduce_amax.py b/tools/pytorch-quantization/pytorch_quantization/utils/reduce_amax.py
index b92457ce..1432642b 100644
--- a/tools/pytorch-quantization/pytorch_quantization/utils/reduce_amax.py
+++ b/tools/pytorch-quantization/pytorch_quantization/utils/reduce_amax.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/setup.py b/tools/pytorch-quantization/setup.py
index ed8f3d69..33d7f253 100644
--- a/tools/pytorch-quantization/setup.py
+++ b/tools/pytorch-quantization/setup.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -28,7 +28,7 @@
     requirements = f.read().splitlines()
 
 license_header = """#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/src/tensor_quant.cpp b/tools/pytorch-quantization/src/tensor_quant.cpp
index 4ea305fd..63d08c04 100644
--- a/tools/pytorch-quantization/src/tensor_quant.cpp
+++ b/tools/pytorch-quantization/src/tensor_quant.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,66 +14,82 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-
-
+#include <cuda.h>
 #include <torch/extension.h>
 
-void fake_tensor_quant_cuda_inplace(at::Tensor, at::Tensor, int, bool);
-at::Tensor fake_tensor_quant_cuda(at::Tensor, at::Tensor, int, bool);
-at::Tensor fake_tensor_quant_with_axis_cuda(at::Tensor, at::Tensor, int, int, bool);
+void fake_tensor_quant_cuda_inplace(at::Tensor, at::Tensor, int, bool, bool);
+at::Tensor fake_tensor_quant_cuda(at::Tensor, at::Tensor, int, bool, bool);
+at::Tensor fake_tensor_quant_with_axis_cuda(at::Tensor, at::Tensor, int, int, bool, bool);
 float bits_to_bound(int, int);
+at::Tensor fake_e4m3fy_cuda(at::Tensor inputs);
 
-void fake_tensor_quant_inplace(at::Tensor inputs, at::Tensor amax, int num_bits=8, bool is_unsigned=false) {
+void fake_tensor_quant_(at::Tensor inputs, at::Tensor amax, int num_bits = 8,
+                        bool is_unsigned = false, bool narrow_range = true) {
+  TORCH_CHECK(inputs.is_cuda());
+  TORCH_CHECK(inputs.is_contiguous())  // in-place on non-contiguous tensor is more difficult
   TORCH_CHECK(amax.numel(), 1);
-
-  float bound = bits_to_bound(num_bits, is_unsigned);
-  float scale = bound / amax.data_ptr<float>()[0];
-  for (int i = 0; i < inputs.numel(); ++i) {
-    float output = round(inputs.data_ptr<float>()[i] * scale);
-    output = output > bound ? bound : output;
-    output = output < -bound ? -bound : output;
-
-    inputs.data_ptr<float>()[i] = output / scale;
-  }
+  fake_tensor_quant_cuda_inplace(inputs, amax, num_bits, is_unsigned, narrow_range);
 }
 
-void fake_tensor_quant_(at::Tensor inputs, at::Tensor amax, int num_bits=8, bool is_unsigned=false) {
+at::Tensor fake_tensor_quant(at::Tensor inputs, at::Tensor amax, int num_bits = 8,
+                             bool is_unsigned = false, bool narrow_range = true) {
+  TORCH_CHECK(inputs.is_cuda());
   TORCH_CHECK(amax.numel(), 1);
-  if (inputs.type().is_cuda()) {
-    fake_tensor_quant_cuda_inplace(inputs, amax, num_bits, is_unsigned);
-  } else {
-    fake_tensor_quant_inplace(inputs, amax, num_bits, is_unsigned);
-  }
+  return fake_tensor_quant_cuda(inputs.contiguous(), amax.contiguous(), num_bits, is_unsigned, narrow_range);
 }
 
-at::Tensor fake_tensor_quant(at::Tensor inputs, at::Tensor amax, int num_bits=8, bool is_unsigned=false) {
-  TORCH_CHECK(amax.numel(), 1);
-  if (inputs.type().is_cuda()) {
-    return fake_tensor_quant_cuda(inputs, amax, num_bits, is_unsigned);
-  } else {
-    auto outputs = torch::clone(inputs);
-    fake_tensor_quant_inplace(outputs, amax, num_bits, is_unsigned);
-    return outputs;
-  }
+at::Tensor fake_tensor_quant_with_axis(at::Tensor inputs, at::Tensor amax, int axis,
+                                       int num_bits = 8, bool is_unsigned = false,
+                                       bool narrow_range = true) {
+  TORCH_CHECK(inputs.is_cuda());
+  TORCH_CHECK(amax.numel(), inputs.size(axis));
+  return fake_tensor_quant_with_axis_cuda(
+      inputs.contiguous(), amax.contiguous(), axis, num_bits, is_unsigned, narrow_range);
 }
 
-at::Tensor fake_tensor_quant_with_axis(
-    at::Tensor inputs, at::Tensor amax, int axis, int num_bits=8, bool is_unsigned=false) {
-  TORCH_CHECK(amax.numel(), inputs.size(axis));
-  if (inputs.type().is_cuda()) {
-    return fake_tensor_quant_with_axis_cuda(
-        inputs, amax, axis, num_bits, is_unsigned);
-  } else {
-    throw std::runtime_error("axis is only supported on GPU.");
+
+#if CUDA_VERSION > 11070
+
+  #include <cuda_fp8.h>
+  #include <ATen/ATen.h>
+
+  at::Tensor fake_e4m3fy_cuda(at::Tensor inputs);
+
+  at::Tensor fake_e4m3fy(at::Tensor inputs) {
+    if (inputs.is_cuda()) {
+      return fake_e4m3fy_cuda(inputs.contiguous());
+    } else {
+      TORCH_CHECK(inputs.dtype() == at::ScalarType::Float);
+      TORCH_CHECK(inputs.is_contiguous());
+      auto out = at::zeros_like(inputs);
+      for (int i = 0; i < inputs.numel(); ++i) {
+        out.data_ptr<float>()[i] = static_cast<float>(static_cast<__nv_fp8_e4m3>(inputs.data_ptr<float>()[i]));
+      }
+      return out;
+    }
   }
-}
+
+#else
+
+  #include <stdexcept>
+
+  at::Tensor fake_e4m3fy(at::Tensor inputs) {
+    throw std::runtime_error("FP8 emulation is not supported on CUDA 11.7 and below");
+  }
+
+#endif
 
 PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("fake_tensor_quant_", &fake_tensor_quant_, "Fake Tensor Quant Inplace", py::arg("inputs"), py::arg("amax"),
-        py::arg("num_bits")=8, py::arg("unsigned")=false);
-  m.def("fake_tensor_quant", &fake_tensor_quant, "Fake Tensor Quant", py::arg("inputs"), py::arg("amax"),
-        py::arg("num_bits")=8, py::arg("unsigned")=false);
-  m.def("fake_tensor_quant_with_axis", &fake_tensor_quant_with_axis,
-        "Fake Tensor Quant with axis", py::arg("inputs"), py::arg("amax"),
-        py::arg("axis"), py::arg("num_bits")=8, py::arg("unsigned")=false);
+  m.def("fake_tensor_quant_", &fake_tensor_quant_, "Fake Tensor Quant Inplace", py::arg("inputs"),
+        py::arg("amax"), py::arg("num_bits") = 8, py::arg("unsigned") = false,
+        py::arg("narrow_range") = true);
+  m.def("fake_tensor_quant", &fake_tensor_quant, "Fake Tensor Quant", py::arg("inputs"),
+        py::arg("amax"), py::arg("num_bits") = 8, py::arg("unsigned") = false,
+        py::arg("narrow_range") = true);
+  m.def("fake_tensor_quant_with_axis", &fake_tensor_quant_with_axis, "Fake Tensor Quant with axis",
+        py::arg("inputs"), py::arg("amax"), py::arg("axis"), py::arg("num_bits") = 8,
+        py::arg("unsigned") = false, py::arg("narrow_range") = true);
+
+  m.def("fake_e4m3fy", &fake_e4m3fy, "Reduce precision to E4M3",
+        py::arg("inputs"));
 }
diff --git a/tools/pytorch-quantization/src/tensor_quant_gpu.cu b/tools/pytorch-quantization/src/tensor_quant_gpu.cu
index 2192b3fb..c56d978a 100644
--- a/tools/pytorch-quantization/src/tensor_quant_gpu.cu
+++ b/tools/pytorch-quantization/src/tensor_quant_gpu.cu
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,110 +15,145 @@
  * limitations under the License.
  */
 
-
 #include <ATen/ATen.h>
-#include <torch/extension.h>
-
-#include <math.h>
 #include <cuda.h>
 #include <cuda_fp16.h>
 #include <cuda_runtime.h>
+#include <math.h>
+#include <torch/extension.h>
 
 #define BLOCK_SIZE 128
+#define EPSILON (1. / (1<<24))  // Minimum representable of fp16
+
+#define AT_DISPATCH_CASE_FLOATING_TYPES(...)   \
+  AT_DISPATCH_CASE(at::ScalarType::Double, __VA_ARGS__)  \
+  AT_DISPATCH_CASE(at::ScalarType::Float, __VA_ARGS__)  \
+  AT_DISPATCH_CASE(at::ScalarType::Half, __VA_ARGS__)   \
+  AT_DISPATCH_CASE(at::ScalarType::BFloat16, __VA_ARGS__)
+
+#define AT_DISPATCH_FLOATING_TYPES(TYPE, NAME, ...) \
+  AT_DISPATCH_SWITCH(                                        \
+      TYPE, NAME, AT_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__))
 
 __host__ __device__ float bits_to_bound(int num_bits, int is_unsigned) {
   float bound = (1 << (num_bits - 1 + int(is_unsigned))) - 1;
   return bound;
 }
 
-template <typename T> __device__ T fake_tensor_quant_device(T input, T amax, int bound);
-
-template <>
-__device__ float fake_tensor_quant_device(float input, float amax, int bound) {
-  float scale = bound / amax;
-  float output = round(input * scale);
-  output = output > bound ? bound : output;
-  output = output < -bound ? -bound : output;
+__device__ float fake_tensor_quant_device(float input, float amax, int min_bound, int max_bound) {
+  CUDA_KERNEL_ASSERT(amax >= 0);
 
-  return output / scale;
-}
-
-template <>
-__device__ at::Half fake_tensor_quant_device(at::Half input, at::Half amax, int bound) {
-  float output = fake_tensor_quant_device(__half2float(input), __half2float(amax), bound);
-
-  return __float2half(output);
-}
+  if (amax < EPSILON) {
+    return 0.f;
+  }
 
-// Sepcialize double only to pass Aten dispatch macros
-template <>
-__device__ double fake_tensor_quant_device(double input, double amax, int bound) {
-  float output = fake_tensor_quant_device(input, amax, bound);
+  float scale = max_bound / amax;
+  float output = rint(input * scale);
+  output = output > max_bound ? max_bound : output;
+  output = output < min_bound ? min_bound : output;
 
-  return output;
+  return output / scale;
 }
 
 template <typename T>
-__global__ void fake_tensor_quant_kernel(
-    const T *inputs, size_t n, T *outputs,
-    const T *amax, int num_bits=8, bool is_unsigned=false) {
+__global__ void fake_tensor_quant_kernel(const T* inputs, size_t n, T* outputs, const float* amax,
+                                         int num_bits = 8, bool is_unsigned = false,
+                                         bool narrow_range = true) {
   int tid = blockIdx.x * blockDim.x + threadIdx.x;
 
   if (tid < n) {
+    if (is_unsigned) {
+      CUDA_KERNEL_ASSERT(inputs[tid] >= 0);
+    }
     float bound = bits_to_bound(num_bits, is_unsigned);
-    outputs[tid] = fake_tensor_quant_device(inputs[tid], amax[0], bound);
+    float max_bound = bound;
+    float min_bound = -(bound + !narrow_range);
+    outputs[tid] = fake_tensor_quant_device((float)inputs[tid], amax[0], min_bound, max_bound);
   }
 }
 
-void fake_tensor_quant_cuda_inplace(at::Tensor inputs, at::Tensor amax, int num_bits=8, bool is_unsigned=false) {
+void fake_tensor_quant_cuda_inplace(at::Tensor inputs, at::Tensor amax, int num_bits = 8,
+                                    bool is_unsigned = false, bool narrow_range = true) {
   size_t numel = inputs.numel();
-  AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs.type().scalarType(), "fake_tensor_quant_cuda_inplace", [&] {
-    fake_tensor_quant_kernel<<<numel/BLOCK_SIZE + 1, BLOCK_SIZE>>>(
-        inputs.data_ptr<scalar_t>(), numel, inputs.data_ptr<scalar_t>(),
-        amax.data_ptr<scalar_t>(), num_bits, is_unsigned);
-  });
+  AT_DISPATCH_FLOATING_TYPES(
+      inputs.type().scalarType(), "fake_tensor_quant_cuda_inplace", [&] {
+        fake_tensor_quant_kernel<<<numel / BLOCK_SIZE + 1, BLOCK_SIZE>>>(
+            inputs.data_ptr<scalar_t>(), numel, inputs.data_ptr<scalar_t>(),
+            amax.to(at::ScalarType::Float).data_ptr<float>(), num_bits, is_unsigned);
+      });
 }
 
-at::Tensor fake_tensor_quant_cuda(at::Tensor inputs, at::Tensor amax, int num_bits=8, bool is_unsigned=false) {
+at::Tensor fake_tensor_quant_cuda(at::Tensor inputs, at::Tensor amax, int num_bits = 8,
+                                  bool is_unsigned = false, bool narrow_range = true) {
   size_t numel = inputs.numel();
-  auto outputs = torch::zeros_like(inputs);
-  AT_DISPATCH_FLOATING_TYPES_AND_HALF(inputs.type().scalarType(), "fake_tensor_quant_cuda", [&] {
-    fake_tensor_quant_kernel<<<numel/BLOCK_SIZE + 1, BLOCK_SIZE>>>(
+  auto outputs = torch::empty_like(inputs);
+  AT_DISPATCH_FLOATING_TYPES(inputs.type().scalarType(), "fake_tensor_quant_cuda", [&] {
+    fake_tensor_quant_kernel<<<numel / BLOCK_SIZE + 1, BLOCK_SIZE>>>(
         inputs.data_ptr<scalar_t>(), numel, outputs.data_ptr<scalar_t>(),
-        amax.data_ptr<scalar_t>(), num_bits, is_unsigned);
+        amax.to(at::ScalarType::Float).data_ptr<float>(), num_bits, is_unsigned);
   });
 
   return outputs;
 }
 
-__global__ void fake_tensor_quant_with_axis_cuda_kernel(
-    const float *inputs, size_t n, float *outputs,
-    const float *amax, int axis_size, int outer_size, int num_bits=8, bool is_unsigned=false) {
+template <typename T>
+__global__ void fake_tensor_quant_with_axis_cuda_kernel(const T* inputs, size_t n,
+                                                        T* outputs, const float* amax,
+                                                        int axis_size, int outer_size,
+                                                        int num_bits = 8, bool is_unsigned = false,
+                                                        bool narrow_range = true) {
   int tid = blockIdx.x * blockDim.x + threadIdx.x;
 
   float bound = bits_to_bound(num_bits, is_unsigned);
+  float max_bound = bound;
+  float min_bound = -(bound + !narrow_range);
 
   for (int idx = 4 * tid; idx < 4 * (tid + 1) && idx < n; ++idx) {
+    if (is_unsigned) {
+      CUDA_KERNEL_ASSERT(inputs[idx] >= 0);
+    }
     int axis_idx = (idx / outer_size) % axis_size;
 
-    outputs[idx] = fake_tensor_quant_device(inputs[idx], amax[axis_idx], bound);
+    outputs[idx] = fake_tensor_quant_device((float)inputs[idx], amax[axis_idx], min_bound, max_bound);
   }
 }
 
-at::Tensor fake_tensor_quant_with_axis_cuda(
-    at::Tensor inputs, at::Tensor amax, int axis, int num_bits=8, bool is_unsigned=false) {
+at::Tensor fake_tensor_quant_with_axis_cuda(at::Tensor inputs, at::Tensor amax, int axis,
+                                            int num_bits = 8, bool is_unsigned = false,
+                                            bool narrow_range = true) {
   auto outputs = torch::empty_like(inputs);
   size_t numel = inputs.numel();
   int axis_size = inputs.size(axis);
 
-  int outer_size = 1;
-  for (int i = axis + 1; i < inputs.dim(); ++i) {
-    outer_size *= inputs.size(i);
-  }
+  int outer_size = inputs.stride(axis);
 
-  fake_tensor_quant_with_axis_cuda_kernel<<<numel / (BLOCK_SIZE*4) + 1, BLOCK_SIZE>>>(
-      inputs.data_ptr<float>(), numel, outputs.data_ptr<float>(), amax.data_ptr<float>(), axis_size, outer_size,
-      num_bits, is_unsigned);
+  AT_DISPATCH_FLOATING_TYPES(inputs.type().scalarType(), "fake_tensor_quant_cuda_with_axis", [&] {
+    fake_tensor_quant_with_axis_cuda_kernel<<<numel / (BLOCK_SIZE * 4) + 1, BLOCK_SIZE>>>(
+        inputs.data_ptr<scalar_t>(), numel, outputs.data_ptr<scalar_t>(),
+        amax.to(at::ScalarType::Float).data_ptr<float>(), axis_size, outer_size, num_bits, is_unsigned);
+  });
   return outputs;
 }
 
+#if CUDA_VERSION > 11070
+  #include <cuda_fp8.h>
+
+  template <typename T>
+  __global__ void fake_e4m3fy_kernel(const T* inputs, size_t n, T* outputs) {
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+
+    for (int idx = 4 * tid; idx < 4 * (tid + 1) && idx < n; ++idx) {
+      outputs[idx] = static_cast<T>(static_cast<float>(static_cast<__nv_fp8_e4m3>(static_cast<float>(inputs[idx]))));
+    }
+  }
+
+  at::Tensor fake_e4m3fy_cuda(at::Tensor inputs) {
+    size_t numel = inputs.numel();
+    auto outputs = torch::empty_like(inputs);
+    AT_DISPATCH_FLOATING_TYPES(inputs.type().scalarType(), "fake_e4m3fy_cuda", [&] {
+      fake_e4m3fy_kernel<<<numel / (BLOCK_SIZE * 4) + 1, BLOCK_SIZE>>>(
+          inputs.data_ptr<scalar_t>(), numel, outputs.data_ptr<scalar_t>());
+    });
+    return outputs;
+  }
+#endif
diff --git a/tools/pytorch-quantization/tests/calibrator_test.py b/tools/pytorch-quantization/tests/calibrator_test.py
index 1fd64086..a7d7aafb 100644
--- a/tools/pytorch-quantization/tests/calibrator_test.py
+++ b/tools/pytorch-quantization/tests/calibrator_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -68,6 +68,25 @@ def test_fine_grain(self):
         max_calibrator.reset()
         assert max_calibrator.compute_amax() is None
 
+    def test_reverse_axis(self):
+        axis = -4
+        reducs_axis = (1, 2, 3)
+        max_calibrator = calib.MaxCalibrator(8, axis, False)
+
+        x_1 = torch.rand(31, 63, 7, 7).cuda()
+        x_2 = torch.rand(31, 63, 7, 7).cuda()
+        max_calibrator.collect(x_1)
+        max_calibrator.collect(x_2)
+
+        assert max_calibrator.compute_amax().shape[0] == 31
+
+        test_utils.compare(max_calibrator.compute_amax(),
+                           quant_utils.reduce_amax(torch.max(x_1, x_2), axis=reducs_axis),
+                           atol=0, rtol=0, ctol=0)
+
+        max_calibrator.reset()
+        assert max_calibrator.compute_amax() is None
+
     def test_raises(self):
         axis = 0
         max_calibrator = calib.MaxCalibrator(8, axis, False)
diff --git a/tools/pytorch-quantization/tests/classification_flow_test.py b/tools/pytorch-quantization/tests/classification_flow_test.py
index 0fa89b15..42c03afb 100644
--- a/tools/pytorch-quantization/tests/classification_flow_test.py
+++ b/tools/pytorch-quantization/tests/classification_flow_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -59,7 +59,7 @@ def test_resnet18(self, request, pytestconfig):
                 '--data-dir', dataset_dir,
                 '--model', 'resnet18', '--pretrained',
                 '-t', '0.5',
-                '--num-finetune-epochs', '1',
+                '--num-finetune-epochs', '2',
                 '--evaluate-onnx',
             ],
             env=test_env,
diff --git a/tools/pytorch-quantization/tests/clip_test.py b/tools/pytorch-quantization/tests/clip_test.py
index 234e58df..12a7efdd 100644
--- a/tools/pytorch-quantization/tests/clip_test.py
+++ b/tools/pytorch-quantization/tests/clip_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -64,6 +64,6 @@ def test_backward(self):
 
         loss.backward()
 
-        assert x.grad.cpu()[x < min_value].sum() == 0
-        assert x.grad.cpu()[x > max_value].sum() == 0
+        assert x.grad.cpu()[x.cpu() < min_value].sum() == 0
+        assert x.grad.cpu()[x.cpu() > max_value].sum() == 0
         assert torch.equal(clip_x.grad[(x > min_value) & (x < max_value)], x.grad[(x > min_value) & (x < max_value)])
diff --git a/tools/pytorch-quantization/tests/conftest.py b/tools/pytorch-quantization/tests/conftest.py
index 344b3a5f..f9c56d89 100644
--- a/tools/pytorch-quantization/tests/conftest.py
+++ b/tools/pytorch-quantization/tests/conftest.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/fixtures/__init__.py b/tools/pytorch-quantization/tests/fixtures/__init__.py
index 09b14223..366869ba 100644
--- a/tools/pytorch-quantization/tests/fixtures/__init__.py
+++ b/tools/pytorch-quantization/tests/fixtures/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/fixtures/models.py b/tools/pytorch-quantization/tests/fixtures/models.py
index 7ea2bc92..3d6e6679 100644
--- a/tools/pytorch-quantization/tests/fixtures/models.py
+++ b/tools/pytorch-quantization/tests/fixtures/models.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/functional_test.py b/tools/pytorch-quantization/tests/functional_test.py
index 51312e30..0a6e2489 100644
--- a/tools/pytorch-quantization/tests/functional_test.py
+++ b/tools/pytorch-quantization/tests/functional_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -89,6 +89,6 @@ def test_backward(self):
             clip_x.grad[x < min_value].sum().cpu().numpy(), min_value.grad.cpu().numpy(), decimal=6)
         np.testing.assert_array_almost_equal(
             clip_x.grad[x > max_value].sum().cpu().numpy(), max_value.grad.cpu().numpy(), decimal=6)
-        assert x.grad.cpu()[x < min_value].sum() == 0
-        assert x.grad.cpu()[x > max_value].sum() == 0
+        assert x.grad.cpu()[x.cpu() < min_value.cpu()].sum() == 0
+        assert x.grad.cpu()[x.cpu() > max_value.cpu()].sum() == 0
         assert torch.equal(clip_x.grad[(x > min_value) & (x < max_value)], x.grad[(x > min_value) & (x < max_value)])
diff --git a/tools/pytorch-quantization/tests/integration_test.py b/tools/pytorch-quantization/tests/integration_test.py
index b4f191d7..9973ca17 100644
--- a/tools/pytorch-quantization/tests/integration_test.py
+++ b/tools/pytorch-quantization/tests/integration_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,18 +19,19 @@
 """tests of integrating Quant layers into a network"""
 
 import pytest
+import io
 
 import numpy as np
 
 import torch
 import torch.nn.functional as F
 import torch.optim as optim
-from apex.amp import _amp_state
 
 from pytorch_quantization import tensor_quant
 from pytorch_quantization import quant_modules
 from pytorch_quantization import nn as quant_nn
 from pytorch_quantization.tensor_quant import QuantDescriptor
+from pytorch_quantization.nn.modules import tensor_quantizer
 from tests.fixtures.models import LeNet, QuantLeNet
 from tests.fixtures import verbose
 
@@ -79,26 +80,6 @@ def test_backward(self):
         loss.backward()
         optimizer.step()
 
-    def test_apex_amp_fp16(self):
-        """test one iteration with random data and labels"""
-        try:
-            from apex import amp
-        except ImportError:
-            pytest.skip("AMP is not available.")
-        input_desc = tensor_quant.QUANT_DESC_8BIT_PER_TENSOR
-        weight_desc = tensor_quant.QUANT_DESC_8BIT_PER_TENSOR
-        model = QuantLeNet(quant_desc_input=input_desc, quant_desc_weight=weight_desc)
-        optimizer = optim.SGD(model.parameters(), lr=0.01)
-        model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
-        optimizer.zero_grad()
-        output = model(torch.empty(16, 1, 28, 28))
-        loss = F.nll_loss(output, torch.randint(10, (16,), dtype=torch.int64))
-        with amp.scale_loss(loss, optimizer) as scaled_loss:
-            scaled_loss.backward()
-        optimizer.step()
-        assert loss.dtype == torch.float32
-        _amp_state.handle._deactivate()
-
     def test_native_amp_fp16(self):
         """test one iteration with random data and labels"""
         input_desc = tensor_quant.QUANT_DESC_8BIT_PER_TENSOR
@@ -140,7 +121,8 @@ def test_asp(self):
                 quant_nn.QuantConvTranspose2d: ['weight'],
                 quant_nn.QuantConvTranspose3d: ['weight'],
                 quant_nn.QuantLinear: ['weight']
-            })
+            },
+            allow_permutation=False)
         ASP.init_optimizer_for_pruning(optimizer)
         ASP.compute_sparse_masks()
 
@@ -220,3 +202,15 @@ def test_calibration(self):
                     module.enable()
         quant_model.cuda()
 
+    def test_state_load(self):
+        quant_desc = tensor_quant.QuantDescriptor(axis=1, num_bits=8, amax=127.0)
+        quantizer = tensor_quantizer.TensorQuantizer(quant_desc).cuda()
+        quantizer2 = tensor_quantizer.TensorQuantizer(quant_desc).cuda()
+        quantizer2.pre_quant_scale = torch.Tensor([[1.0, 2.0, 3.0, 4.0]]).cuda()
+        buffer = io.BytesIO()
+        torch.save(quantizer2.state_dict(), buffer)
+
+        buffer.seek(0)
+        quantizer.load_state_dict(torch.load(buffer))
+
+        assert torch.allclose(quantizer.pre_quant_scale, quantizer2.pre_quant_scale)
diff --git a/tools/pytorch-quantization/tests/license_test.py b/tools/pytorch-quantization/tests/license_test.py
index 1d35e826..08d44618 100644
--- a/tools/pytorch-quantization/tests/license_test.py
+++ b/tools/pytorch-quantization/tests/license_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/license_test_header_cpp.txt b/tools/pytorch-quantization/tests/license_test_header_cpp.txt
index ef6a0ada..5222eae7 100644
--- a/tools/pytorch-quantization/tests/license_test_header_cpp.txt
+++ b/tools/pytorch-quantization/tests/license_test_header_cpp.txt
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/license_test_header_py.txt b/tools/pytorch-quantization/tests/license_test_header_py.txt
index 0068febe..3119e45f 100644
--- a/tools/pytorch-quantization/tests/license_test_header_py.txt
+++ b/tools/pytorch-quantization/tests/license_test_header_py.txt
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/license_test_header_sh.txt b/tools/pytorch-quantization/tests/license_test_header_sh.txt
index 56564138..4349dbbb 100644
--- a/tools/pytorch-quantization/tests/license_test_header_sh.txt
+++ b/tools/pytorch-quantization/tests/license_test_header_sh.txt
@@ -1,6 +1,6 @@
 #!/bin/bash
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/model_test.py b/tools/pytorch-quantization/tests/model_test.py
index 62f33533..d3466ee2 100644
--- a/tools/pytorch-quantization/tests/model_test.py
+++ b/tools/pytorch-quantization/tests/model_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -15,11 +15,13 @@
 # limitations under the License.
 #
 """Tests of calibrators"""
+import inspect
 import pytest
 import numpy as np
 
 import torch
 
+from pytorch_quantization import enable_onnx_export
 from pytorch_quantization import utils as quant_utils
 from pytorch_quantization import calib
 from pytorch_quantization import nn as quant_nn
@@ -39,14 +41,52 @@ class TestExampleModels():
     def test_resnet50(self):
         model = resnet50(pretrained=True, quantize=True)
         model.eval()
+
+        for name, module in model.named_modules():
+            if name.endswith('_quantizer'):
+                module.amax = 2.50
+
         model.cuda()
-        quant_nn.TensorQuantizer.use_fb_fake_quant = True
         dummy_input = torch.randn(1, 3, 224, 224, device='cuda')
-        torch.onnx.export(model,
-                          dummy_input,
-                          "/tmp/resnet50.onnx",
-                          verbose=False,
-                          opset_version=13,
-                          enable_onnx_checker=False,
-                          do_constant_folding=True)
-        quant_nn.TensorQuantizer.use_fb_fake_quant = False
+        with enable_onnx_export():
+            if "enable_onnx_checker" in inspect.signature(torch.onnx.export).parameters:
+                torch.onnx.export(model,
+                                  dummy_input,
+                                  "/tmp/resnet50.onnx",
+                                  verbose=False,
+                                  opset_version=13,
+                                  enable_onnx_checker=False,
+                                  do_constant_folding=True)
+            else:
+                torch.onnx.export(model,
+                                  dummy_input,
+                                  "/tmp/resnet50.onnx",
+                                  verbose=False,
+                                  opset_version=13,
+                                  do_constant_folding=True)
+
+    def test_resnet50_cpu(self):
+        model = resnet50(pretrained=True, quantize=True)
+        model.eval()
+
+        for name, module in model.named_modules():
+            if name.endswith('_quantizer'):
+                module.amax = 2.50
+
+        dummy_input = torch.randn(1, 3, 224, 224)
+        with enable_onnx_export():
+            if "enable_onnx_checker" in inspect.signature(torch.onnx.export).parameters:
+                torch.onnx.export(model,
+                                  dummy_input,
+                                  "/tmp/resnet50_cpu.onnx",
+                                  verbose=False,
+                                  opset_version=13,
+                                  enable_onnx_checker=False,
+                                  do_constant_folding=True)
+            else:
+                torch.onnx.export(model,
+                                  dummy_input,
+                                  "/tmp/resnet50.onnx",
+                                  verbose=False,
+                                  opset_version=13,
+                                  do_constant_folding=True)
diff --git a/tools/pytorch-quantization/tests/optim_helper_test.py b/tools/pytorch-quantization/tests/optim_helper_test.py
index c8d20428..82da1cf3 100644
--- a/tools/pytorch-quantization/tests/optim_helper_test.py
+++ b/tools/pytorch-quantization/tests/optim_helper_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/print_test.py b/tools/pytorch-quantization/tests/print_test.py
index 856c7631..9362da7b 100644
--- a/tools/pytorch-quantization/tests/print_test.py
+++ b/tools/pytorch-quantization/tests/print_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/quant_conv_test.py b/tools/pytorch-quantization/tests/quant_conv_test.py
index 854d8588..05fe90ce 100644
--- a/tools/pytorch-quantization/tests/quant_conv_test.py
+++ b/tools/pytorch-quantization/tests/quant_conv_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/quant_conv_transposed_test.py b/tools/pytorch-quantization/tests/quant_conv_transposed_test.py
index 2af448c4..8aec73a9 100644
--- a/tools/pytorch-quantization/tests/quant_conv_transposed_test.py
+++ b/tools/pytorch-quantization/tests/quant_conv_transposed_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/quant_instancenorm_test.py b/tools/pytorch-quantization/tests/quant_instancenorm_test.py
index 118fd0de..696d3299 100644
--- a/tools/pytorch-quantization/tests/quant_instancenorm_test.py
+++ b/tools/pytorch-quantization/tests/quant_instancenorm_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/quant_linear_test.py b/tools/pytorch-quantization/tests/quant_linear_test.py
index aa85ee54..660b817d 100644
--- a/tools/pytorch-quantization/tests/quant_linear_test.py
+++ b/tools/pytorch-quantization/tests/quant_linear_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/quant_modules_test.py b/tools/pytorch-quantization/tests/quant_modules_test.py
index 5accd6f4..5d7d0508 100644
--- a/tools/pytorch-quantization/tests/quant_modules_test.py
+++ b/tools/pytorch-quantization/tests/quant_modules_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/quant_pooling_test.py b/tools/pytorch-quantization/tests/quant_pooling_test.py
index d715a59c..b03317e8 100644
--- a/tools/pytorch-quantization/tests/quant_pooling_test.py
+++ b/tools/pytorch-quantization/tests/quant_pooling_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -105,6 +105,20 @@ def test_input_fake_quant_disable(self):
         out2 = quant_pooling_object(test_input)
         np.testing.assert_array_equal(out1.detach().cpu().numpy(), out2.detach().cpu().numpy())
 
+    def test_input_multi_axis(self):
+        quant_desc_input = tensor_quant.QuantDescriptor(num_bits=8, axis=(0, 1))
+
+        quant_pooling.QuantMaxPool2d.set_default_quant_desc_input(quant_desc_input)
+        quant_pooling_object = quant_pooling.QuantMaxPool2d(kernel_size=3, stride=1)
+
+        test_input = torch.randn(16, 7, 5, 5, dtype=torch.double)
+        input_amax = torch.amax(torch.abs(test_input), dim=(2, 3), keepdim=True)
+        quant_input = tensor_quant.fake_tensor_quant(test_input, input_amax)
+
+        out1 = F.max_pool2d(quant_input, 3, 1, 0, 1, False, False)
+        out2 = quant_pooling_object(test_input)
+        np.testing.assert_array_equal(out1.detach().cpu().numpy(), out2.detach().cpu().numpy())
+
 class TestQuantMaxPool3d():
 
     def test_raise(self):
diff --git a/tools/pytorch-quantization/tests/quant_rnn_test.py b/tools/pytorch-quantization/tests/quant_rnn_test.py
index cccd3d60..6c2d54b9 100644
--- a/tools/pytorch-quantization/tests/quant_rnn_test.py
+++ b/tools/pytorch-quantization/tests/quant_rnn_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/quant_utils_test.py b/tools/pytorch-quantization/tests/quant_utils_test.py
index 697a310d..3c172065 100644
--- a/tools/pytorch-quantization/tests/quant_utils_test.py
+++ b/tools/pytorch-quantization/tests/quant_utils_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/pytorch-quantization/tests/tensor_quant_test.py b/tools/pytorch-quantization/tests/tensor_quant_test.py
index 05f8c0ae..aa9e40eb 100644
--- a/tools/pytorch-quantization/tests/tensor_quant_test.py
+++ b/tools/pytorch-quantization/tests/tensor_quant_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,19 +14,18 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
-
 """tests of tensor quantization function and module"""
+import contextlib
+
 import pytest
 import numpy as np
 
 import torch
 from torch.nn.parameter import Parameter
 
-from pytorch_quantization import calib
 from pytorch_quantization import cuda_ext
 from pytorch_quantization import tensor_quant
-from pytorch_quantization.nn.modules.tensor_quantizer import TensorQuantizer
+import pytorch_quantization.utils as quant_utils
 
 import tests.utils as test_utils
 from tests.fixtures import verbose
@@ -133,6 +132,7 @@ def test_full_range(self):
         quant_x_torch, _ = tensor_quant.tensor_quant(x_torch, torch.max(torch.abs(x_torch)), 8, True, False)
         np.testing.assert_array_almost_equal(quant_x_torch.cpu().numpy(), quant_x_np)
 
+
 class TestFakeTensorQuant():
 
     def test_simple_run(self):
@@ -202,30 +202,33 @@ def test_cuda_ext(self):
 
         for num_bits in [3, 4, 5, 7, 8, 11]:
             for unsigned in [True, False]:
-                test_utils.compare(
-                    cuda_ext.fake_tensor_quant(x_torch, torch.max(torch.abs(x_torch)), num_bits, unsigned),
-                    tensor_quant.fake_tensor_quant(x_torch, torch.max(torch.abs(x_torch)), num_bits, unsigned),
-                    rtol=0, atol=0)
-
-        # Test fp16
-        x_np_fp16 = np.random.rand(1023).astype('float16')
-        x_torch_fp16 = torch.Tensor(x_np_fp16).cuda().half()
-        test_utils.compare(
-            cuda_ext.fake_tensor_quant(x_torch_fp16, torch.max(torch.abs(x_torch_fp16))),
-            tensor_quant.fake_tensor_quant(x_torch_fp16, torch.max(torch.abs(x_torch_fp16))),
-            rtol=0, atol=0)
-
-    def test_cuda_ext_with_axis(self):
-        x_np = np.random.rand(3, 4, 5, 6).astype('float32')
-        x_torch = torch.Tensor(x_np).cuda()
+                test_utils.compare(cuda_ext.fake_tensor_quant(x_torch, torch.max(torch.abs(x_torch)), num_bits,
+                                                              unsigned),
+                                   tensor_quant.fake_tensor_quant(x_torch, torch.max(torch.abs(x_torch)), num_bits,
+                                                                  unsigned),
+                                   rtol=0,
+                                   atol=0)
+
+        # Test fp16 and bf16
+        for dtype in [torch.float16, torch.bfloat16]:
+            x_np = np.random.rand(1023)
+            x_torch = torch.Tensor(x_np).cuda().to(dtype)
+            cuda_ext_out = cuda_ext.fake_tensor_quant(x_torch, torch.max(torch.abs(x_torch))).to(torch.float32)
+            pytorch_out = tensor_quant.fake_tensor_quant(x_torch, torch.max(torch.abs(x_torch))).to(torch.float32)
+            test_utils.compare(cuda_ext_out, pytorch_out, rtol=0, atol=0)
+
+    @pytest.mark.parametrize("dtype", [torch.float32, torch.float16, torch.bfloat16])
+    def test_cuda_ext_with_axis(self, dtype):
+        x_np = np.random.rand(3, 4, 5, 6)
+        x_torch = torch.Tensor(x_np).cuda().to(dtype)
 
         # amax along axis 1
         amax_torch = torch.tensor([0.8, 0.9, 0.7, 0.6], device="cuda")
 
         for num_bits in [3, 4, 5, 7, 8, 11]:
             for unsigned in [True, False]:
-                cuda_ext_out = cuda_ext.fake_tensor_quant_with_axis(x_torch, amax_torch, 1, num_bits, unsigned)
-                pytorch_out = tensor_quant.fake_tensor_quant(x_torch, amax_torch.view(1, -1, 1, 1), num_bits, unsigned)
+                cuda_ext_out = cuda_ext.fake_tensor_quant_with_axis(x_torch, amax_torch, 1, num_bits, unsigned).to(torch.float32)
+                pytorch_out = tensor_quant.fake_tensor_quant(x_torch, amax_torch.view(1, -1, 1, 1), num_bits, unsigned).to(torch.float32)
                 test_utils.compare(cuda_ext_out, pytorch_out, rtol=0, atol=0)
 
     def test_cuda_ext_inplace(self):
@@ -235,12 +238,20 @@ def test_cuda_ext_inplace(self):
         cuda_ext.fake_tensor_quant_(x_torch, torch.max(torch.abs(x_torch)))
         np.testing.assert_array_equal(x_torch.cpu().numpy(), quant_x_np)
 
-        # Test fp16
-        x_np_fp16 = np.random.rand(1023).astype('float16')
-        x_torch_fp16 = torch.Tensor(x_np_fp16).cuda().half()
-        quant_x_np_fp16 = test_utils.quant_np(x_np_fp16, np.max(np.abs(x_np_fp16)), fake=True)
-        cuda_ext.fake_tensor_quant_(x_torch_fp16, torch.max(torch.abs(x_torch_fp16)))
-        np.testing.assert_array_almost_equal(x_torch_fp16.cpu().numpy(), quant_x_np_fp16, decimal=2)
+        # Test fp16 and bf16
+        for dtype in [torch.float16, torch.bfloat16]:
+            x_np = np.random.rand(1023)
+            x_torch = torch.Tensor(x_np).cuda().to(dtype)
+            quant_x_np = test_utils.quant_np(x_np, np.max(np.abs(x_np)), fake=True)
+            cuda_ext.fake_tensor_quant_(x_torch, torch.max(torch.abs(x_torch)))
+            x_torch = x_torch.to(torch.float32)
+            np.testing.assert_array_almost_equal(x_torch.cpu().numpy(), quant_x_np, decimal=2)
+
+    def test_cuda_ext_tiny_amax(self):
+        x_torch = torch.rand(2, 3, 4, device="cuda")
+        amax = torch.tensor([1., 1.e-26, 1.], device="cuda").unsqueeze(-1).unsqueeze(1)
+        quant_x = cuda_ext.fake_tensor_quant_with_axis(x_torch, amax, axis=1)
+        assert quant_x[:, 1, :].sum() == 0
 
     def test_overflow_fp16(self):
         x_torch = torch.randn(1023).cuda().half()
@@ -267,6 +278,47 @@ def test_full_range(self):
         quant_x_torch = tensor_quant.fake_tensor_quant(x_torch, torch.max(torch.abs(x_torch)), 8, True, False)
         np.testing.assert_array_almost_equal(quant_x_torch.cpu().numpy(), quant_x_np)
 
+    @pytest.mark.parametrize("dtype", ["float32", "float16"])
+    def test_against_legacy(self, dtype):
+        x_np = np.random.rand(3, 4, 5, 6).astype(dtype)
+        x_torch = torch.Tensor(x_np).cuda()
+
+        amax_torch = torch.tensor(0.7, device="cuda")
+
+        for num_bits in [3, 4, 5, 7, 8, 11]:
+            for unsigned in [True, False]:
+                legacy_out = tensor_quant.legacy_fake_tensor_quant(x_torch, amax_torch, num_bits, unsigned)
+                test_out = tensor_quant.fake_tensor_quant(x_torch, amax_torch, num_bits, unsigned)
+                test_utils.compare(legacy_out, test_out, rtol=0, atol=0)
+
+    def test_against_legacy_noncontiguous(self):
+        x_np = np.random.rand(3, 4, 5, 6)
+        x_torch = torch.Tensor(x_np).cuda()
+
+        amax_torch = torch.tensor(0.7, device="cuda")
+
+        x_torch_noncontiguous = x_torch[:, 2, :, 3]
+        assert not x_torch_noncontiguous.is_contiguous()
+
+        legacy_out = tensor_quant.legacy_fake_tensor_quant(x_torch_noncontiguous, amax_torch)
+        test_out = tensor_quant.fake_tensor_quant(x_torch_noncontiguous, amax_torch)
+        test_utils.compare(legacy_out, test_out, rtol=0, atol=0)
+
+    @pytest.mark.parametrize("dtype", ["float32", "float16"])
+    def test_against_legacy_with_axis(self, dtype):
+        x_np = np.random.rand(3, 4, 5, 6).astype(dtype)
+        x_torch = torch.Tensor(x_np).cuda()
+
+        # amax along axis 1
+        amax_torch = torch.tensor([0.8, 0.9, 0.7, 0.6], device="cuda").view(1, -1, 1, 1)
+
+        for num_bits in [3, 4, 5, 7, 8, 11]:
+            for unsigned in [True, False]:
+                legacy_out = tensor_quant.legacy_fake_tensor_quant(x_torch, amax_torch, num_bits, unsigned)
+                test_out = tensor_quant.fake_tensor_quant(x_torch, amax_torch, num_bits, unsigned)
+                test_utils.compare(legacy_out, test_out, rtol=0, atol=0)
+
+
 class TestQuantDescriptor():
 
     def test_scaled_mode(self):
@@ -324,9 +376,11 @@ def test_amax(self):
             tensor_quant.QuantDescriptor(amax='oops')
 
     def test_from_to_dict(self):
-        quant_desc_1 = tensor_quant.QuantDescriptor(
-            num_bits=2, name='a', fake_quant=True, axis=(1, 2),
-            amax=3.1415926536)
+        quant_desc_1 = tensor_quant.QuantDescriptor(num_bits=2,
+                                                    name='a',
+                                                    fake_quant=True,
+                                                    axis=(1, 2),
+                                                    amax=3.1415926536)
         quant_desc_2 = tensor_quant.QuantDescriptor(**quant_desc_1.dict())
         if verbose:
             print(quant_desc_1.dict())
@@ -337,9 +391,11 @@ def test_from_to_dict(self):
         assert quant_desc_1 == quant_desc_2
 
     def test_from_to_yaml(self):
-        quant_desc_1 = tensor_quant.QuantDescriptor(
-            num_bits=2, name='a', fake_quant=True, axis=(1, 2),
-            amax=3.1415926536)
+        quant_desc_1 = tensor_quant.QuantDescriptor(num_bits=2,
+                                                    name='a',
+                                                    fake_quant=True,
+                                                    axis=(1, 2),
+                                                    amax=3.1415926536)
         quant_desc_2 = tensor_quant.QuantDescriptor.from_yaml(quant_desc_1.to_yaml())
         if verbose:
             print(quant_desc_1.to_yaml())
@@ -372,3 +428,55 @@ def test_clip_gradient(self):
         loss = torch.sum((quant_x - 0.5)**2)
         loss.backward()
         np.testing.assert_array_equal(x.grad.cpu().numpy() != 0, x_in_range.cpu().numpy())
+
+
+class TestScaledE4M3():
+
+    x = [[-2.0000, -1.8000, -1.6000, -1.4000, -1.2000], [-1.0000, -0.8000, -0.6000, -0.4000, -0.2000],
+         [-0.0000, 0.2000, 0.4000, 0.6000, 0.8000], [1.0000, 1.2000, 1.4000, 1.6000, 1.8000]]
+
+    xq_unscaled = [[-2.0000, -1.7500, -1.6250, -1.3750, -1.2500], [-1.0000, -0.8125, -0.6250, -0.4062, -0.2031],
+                   [0.0000, 0.2031, 0.4062, 0.6250, 0.8125], [1.0000, 1.2500, 1.3750, 1.6250, 1.7500]]
+
+    xq_scaled = [[-2.0000, -1.8571, -1.5714, -1.4286, -1.1429], [-1.0000, -0.7857, -0.5714, -0.3929, -0.1964],
+                 [0.0000, 0.1964, 0.3929, 0.5714, 0.7857], [1.0000, 1.1429, 1.4286, 1.5714, 1.8571]]
+
+    def test_e4m3_no_scale(self):
+        x = torch.tensor(TestScaledE4M3.x, device="cuda")
+        xq_ref = torch.tensor(TestScaledE4M3.xq_unscaled, device="cuda")
+        e4m3_x = tensor_quant.scaled_e4m3(x, None)
+        test_utils.compare(e4m3_x, xq_ref, atol=1e-4, rtol=1e-4)
+
+    def test_e4m3_no_cpu(self):
+        x = torch.tensor(TestScaledE4M3.x)
+        xq_ref = torch.tensor(TestScaledE4M3.xq_unscaled)
+        e4m3_x = tensor_quant.scaled_e4m3(x, None)
+        test_utils.compare(e4m3_x, xq_ref, atol=1e-4, rtol=1e-4)
+
+    def test_with_amax(self):
+        x = torch.tensor(TestScaledE4M3.x, device="cuda").unsqueeze(-1)
+        xq_ref = torch.tensor(TestScaledE4M3.xq_scaled, device="cuda").unsqueeze(-1)
+
+        amax = quant_utils.reduce_amax(x, axis=None, keepdims=True)
+
+        e4m3_x = tensor_quant.scaled_e4m3(x, amax)
+
+        test_utils.compare(e4m3_x, xq_ref, atol=1e-4, rtol=1e-4)
+
+    def test_e4m3_incontiguous(self):
+        x = torch.tensor(TestScaledE4M3.x, device="cuda").transpose(1, 0)
+        xq_ref = torch.tensor(TestScaledE4M3.xq_unscaled, device="cuda").transpose(1, 0)
+        assert not x.is_contiguous()
+        e4m3_x = tensor_quant.scaled_e4m3(x, None)
+        test_utils.compare(e4m3_x, xq_ref, atol=1e-4, rtol=1e-4)
+
+    def test_backward(self):
+        x = torch.randn(3, 7, requires_grad=True).cuda()
+        labels = torch.randint(6, (3,)).type(torch.LongTensor).cuda()
+        quant_x = tensor_quant.scaled_e4m3(x, None)
+        x.retain_grad()
+        quant_x.retain_grad()
+        criterion = torch.nn.CrossEntropyLoss().cuda()
+        loss = criterion(quant_x, labels)
+        loss.backward()
+        np.testing.assert_array_equal(quant_x.grad.cpu().numpy(), x.grad.cpu().numpy())
diff --git a/tools/pytorch-quantization/tests/tensor_quantizer_test.py b/tools/pytorch-quantization/tests/tensor_quantizer_test.py
index 8b7e0c88..84eee277 100644
--- a/tools/pytorch-quantization/tests/tensor_quantizer_test.py
+++ b/tools/pytorch-quantization/tests/tensor_quantizer_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -14,17 +14,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
-
 """tests of tensor quantizer"""
-import yaml
+import contextlib
+
 import pytest
 import numpy as np
 
 import torch
 
 from pytorch_quantization import tensor_quant
-from pytorch_quantization import calib
 from pytorch_quantization.nn.modules import tensor_quantizer
 from pytorch_quantization import utils as quant_utils
 import tests.utils as test_utils
@@ -34,6 +32,7 @@
 
 # pylint:disable=missing-docstring, no-self-use
 
+
 class TestTensorQuantizer():
 
     def test_simple_run(self):
@@ -94,11 +93,11 @@ def test_per_channel_scale(self, verbose):
     def test_learn_amax(self):
         """Test the clip implied by learn_amax"""
         x_np = np.random.rand(1023).astype(np.float32)
-        x_torch = torch.Tensor(x_np)
+        x_torch = torch.Tensor(x_np).cuda()
         amax = 0.5
         quant_x_np = test_utils.quant_np(x_np, 0.5, fake=True)
         quantizer = tensor_quantizer.TensorQuantizer(
-            tensor_quant.QuantDescriptor(num_bits=8, amax=amax, learn_amax=True))
+            tensor_quant.QuantDescriptor(num_bits=8, amax=amax, learn_amax=True)).cuda()
         assert hasattr(quantizer, 'clip')
         module_quant_x = quantizer(x_torch)
         np.testing.assert_array_equal(module_quant_x.cpu().detach().numpy(), quant_x_np)
@@ -106,23 +105,24 @@ def test_learn_amax(self):
     def test_clip_mode(self):
         """Test the clip stage only"""
         x_np = np.random.rand(1023).astype(np.float32)
-        x_torch = torch.Tensor(x_np)
+        x_torch = torch.Tensor(x_np).cuda()
         amax = 0.5
         clip_x_np = np.clip(x_np, -amax, amax)
-        quantizer = tensor_quantizer.TensorQuantizer(
-            tensor_quant.QuantDescriptor(amax=amax, learn_amax=True), if_quant=False, if_clip=True)
+        quantizer = tensor_quantizer.TensorQuantizer(tensor_quant.QuantDescriptor(amax=amax, learn_amax=True),
+                                                     if_quant=False,
+                                                     if_clip=True).cuda()
         assert hasattr(quantizer, 'clip')
         module_clip_x = quantizer(x_torch)
         np.testing.assert_array_equal(module_clip_x.cpu().detach().numpy(), clip_x_np)
 
     def test_scale_amax(self):
         x_np = np.random.rand(1023).astype(np.float32)
-        x_torch = torch.Tensor(x_np)
+        x_torch = torch.Tensor(x_np).cuda()
         amax = 0.5
         scale_amax = 0.9
         quant_x_np = test_utils.quant_np(x_np, amax * scale_amax, fake=True)
         quantizer = tensor_quantizer.TensorQuantizer(
-            tensor_quant.QuantDescriptor(num_bits=8, amax=amax, scale_amax=scale_amax))
+            tensor_quant.QuantDescriptor(num_bits=8, amax=amax, scale_amax=scale_amax)).cuda()
         module_quant_x = quantizer(x_torch)
         np.testing.assert_array_equal(module_quant_x.cpu().detach().numpy(), quant_x_np)
 
@@ -133,7 +133,7 @@ def test_scale_amax(self):
     def test_disable(self):
         x = torch.randn(3, 7).cuda()
         amax_x = torch.max(torch.abs(x))
-        quantizer = tensor_quantizer.TensorQuantizer(disabled=True)
+        quantizer = tensor_quantizer.TensorQuantizer(disabled=True).cuda()
         module_quant_x = quantizer(x)
         np.testing.assert_array_equal(x.cpu().numpy(), module_quant_x.cpu().numpy())
 
@@ -168,7 +168,7 @@ def test_properties(self):
 
     def test_init_calib(self):
         quant_desc2 = tensor_quant.QuantDescriptor(axis=(0, 1))
-        quantizer2 = tensor_quantizer.TensorQuantizer(quant_desc2, if_calib=True).cuda()
+        quantizer2 = tensor_quantizer.TensorQuantizer(quant_desc2, if_calib=True, if_quant=False).cuda()
 
         x_2 = torch.rand(127, 63, 7, 7).cuda()
         quantizer2(x_2)
@@ -196,9 +196,8 @@ def test_max_calib(self):
         quantizer1(x_2)
         quantizer1.disable_calib()
 
-        global_amax = torch.max(
-            quant_utils.reduce_amax(x_1, axis=reduce_axis, keepdims=True),
-            quant_utils.reduce_amax(x_2, axis=reduce_axis, keepdims=True))
+        global_amax = torch.max(quant_utils.reduce_amax(x_1, axis=reduce_axis, keepdims=True),
+                                quant_utils.reduce_amax(x_2, axis=reduce_axis, keepdims=True))
         test_utils.compare(quantizer1._calibrator.compute_amax(), global_amax, atol=0, rtol=0, ctol=0)
 
         quantizer1.load_calib_amax()
@@ -233,8 +232,11 @@ def test_entropy_and_percentile_calib(self):
         quantizer1(x_2)
 
         quantizer1.load_calib_amax("percentile", percentile=99.99)
-        test_utils.compare(quantizer1._calibrator.compute_amax(
-            "percentile", percentile=99.99), quantizer1.amax, atol=0, rtol=0, ctol=0)
+        test_utils.compare(quantizer1._calibrator.compute_amax("percentile", percentile=99.99),
+                           quantizer1.amax,
+                           atol=0,
+                           rtol=0,
+                           ctol=0)
 
     def test_setters(self):
         quantizer = tensor_quantizer.TensorQuantizer()
@@ -243,3 +245,34 @@ def test_setters(self):
 
         assert quantizer.num_bits == 7
         assert quantizer.unsigned
+
+    def test_pre_quant_scale(self):
+        quant_desc = tensor_quant.QuantDescriptor(axis=1, num_bits=8, amax=127.0)
+        quantizer = tensor_quantizer.TensorQuantizer(quant_desc).cuda()
+        quantizer2 = tensor_quantizer.TensorQuantizer(quant_desc).cuda()
+
+        inputs = torch.Tensor([[0, 0.4, 1.1, 2.0]]).cuda()
+        outputs_gt = torch.Tensor([[0, 0, 1, 2]]).cuda()
+        assert torch.allclose(quantizer(inputs), outputs_gt)
+
+        quantizer.pre_quant_scale = 2.0
+        outputs_gt = torch.Tensor([[0, 1, 2, 4]]).cuda()
+        assert torch.allclose(quantizer(inputs), outputs_gt)
+
+        quantizer2.pre_quant_scale = torch.Tensor([[1.0, 2.0, 3.0, 4.0]]).cuda()
+        outputs_gt = torch.Tensor([[0, 1, 3, 8]]).cuda()
+        assert torch.allclose(quantizer2(inputs), outputs_gt)
+
+    @pytest.mark.parametrize("E, M, axis", [(5, 2, None), (4, 3, None), (4, 3, 1), (7, 3, None)])
+    def test_e4m3(self, E, M, axis):
+        is_error_expected = (E != 4 or M != 3)
+        with (pytest.raises(TypeError)
+              if is_error_expected else contextlib.nullcontext()):
+            e4m3_desc = tensor_quant.QuantDescriptor(num_bits=(E, M), axis=axis)
+            e4m3_quantizer = tensor_quantizer.TensorQuantizer(e4m3_desc).to("cuda")
+
+            x = torch.rand(3, 63, 7, 7, device="cuda")
+
+            e4m3_x = e4m3_quantizer(x)
+            ref = tensor_quant.scaled_e4m3(x, e4m3_quantizer._get_amax(x), E, M)
+            test_utils.compare(e4m3_x, ref, atol=0, rtol=0)
diff --git a/tools/pytorch-quantization/tests/test_onnx_export.py b/tools/pytorch-quantization/tests/test_onnx_export.py
new file mode 100644
index 00000000..1b7983cf
--- /dev/null
+++ b/tools/pytorch-quantization/tests/test_onnx_export.py
@@ -0,0 +1,111 @@
+#
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+"""Tests for ONNX export."""
+import io
+import onnxruntime
+import pytest
+import torch
+
+# ORT output correctness tests sometimes fails due to random seed.
+# It needs to be investigated closer
+torch.manual_seed(0)
+
+import tests.utils as test_utils
+import torch.nn as nn
+import pytorch_quantization
+from pytorch_quantization.nn import QuantLinear
+from pytorch_quantization.tensor_quant import QuantDescriptor
+
+
+class MyModel(nn.Module):
+    """Test model for ONNX export."""
+
+    def __init__(self, **kwargs):
+        super().__init__()
+        self.net = nn.Sequential(
+            QuantLinear(16, 32, **kwargs),
+            nn.ReLU(),
+            QuantLinear(32, 64, **kwargs),
+            nn.ReLU(),
+            QuantLinear(64, 16, **kwargs),
+        )
+
+    def forward(self, x):
+        return self.net(x)
+
+
+@pytest.mark.parametrize("num_bits, per_channel_quantization, constant_folding, dtype",
+                         [(8, True, True, torch.float32), (8, False, True, torch.float32),
+                          (8, True, False, torch.float32), (8, False, False, torch.float32),
+                          (8, False, False, torch.float16), (8, False, False, torch.bfloat16),
+                          ((4, 3), False, True, torch.float32), ((4, 3), False, False, torch.float32),
+                          ((4, 3), False, False, torch.float16), ((4, 3), False, False, torch.bfloat16)])
+def test_onnx_export(num_bits, per_channel_quantization, constant_folding, dtype, onnx_file_path=None):
+    quant_desc_input = QuantDescriptor(num_bits=num_bits, axis=None)
+    quant_desc_weight = QuantDescriptor(num_bits=num_bits, axis=0 if per_channel_quantization else None)
+
+    model = MyModel(quant_desc_input=quant_desc_input, quant_desc_weight=quant_desc_weight).cuda()
+    model.eval()
+
+    OPSET = 17
+    dummy_input = torch.randn(16, 16).cuda()
+    input_names = ["input"]
+    output_names = ["output"]
+
+    model = model.to(dtype)
+    dummy_input = dummy_input.to(dtype)
+
+    # Calibrate model
+    for name, module in model.named_modules():
+        if name.endswith('_quantizer'):
+            module.enable_calib()
+            module.disable_quant()
+    _ = model(dummy_input)
+    for name, module in model.named_modules():
+        if name.endswith('_quantizer'):
+            module.disable_calib()
+            module.load_calib_amax()
+            module.enable_quant()
+
+    f = io.BytesIO() if onnx_file_path is None else None
+
+    with pytorch_quantization.enable_onnx_export():
+        torch.onnx.export(
+            model,
+            dummy_input,
+            f=f if onnx_file_path is None else onnx_file_path,
+            opset_version=OPSET,
+            input_names=input_names,
+            output_names=output_names,
+            do_constant_folding=constant_folding,
+        )
+
+    # TODO: ort output correctness check for fp8
+    # ONNXRuntime does not seem to be supporting bf16 gemms
+    if num_bits == 8 and dtype != torch.bfloat16:
+        if f is not None:
+            f.seek(0)
+        ort_session = onnxruntime.InferenceSession(f.read() if onnx_file_path is None else onnx_file_path)
+        ort_result = ort_session.run([], {"input": dummy_input.cpu().numpy()})
+        ort_result = torch.tensor(ort_result[0]).cuda()
+        torch_result = model(dummy_input)
+        test_utils.compare(ort_result, torch_result, atol=1e-2, rtol=1e-2)
+
+
+if __name__ == "__main__":
+    test_onnx_export(8, False, False, torch.float16, "/tmp/test_fp16.onnx")
+    test_onnx_export(8, False, False, torch.bfloat16, "/tmp/test_bf16.onnx")
\ No newline at end of file
diff --git a/tools/pytorch-quantization/tests/utils.py b/tools/pytorch-quantization/tests/utils.py
index a057d1ee..bc5aac0f 100644
--- a/tools/pytorch-quantization/tests/utils.py
+++ b/tools/pytorch-quantization/tests/utils.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/VERSION b/tools/tensorflow-quantization/VERSION
index 3eefcb9d..0ea3a944 100644
--- a/tools/tensorflow-quantization/VERSION
+++ b/tools/tensorflow-quantization/VERSION
@@ -1 +1 @@
-1.0.0
+0.2.0
diff --git a/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_full.ipynb b/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_full.ipynb
index 707b4d4c..cc64413a 100644
--- a/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_full.ipynb
+++ b/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_full.ipynb
@@ -36,7 +36,7 @@
    "outputs": [],
    "source": [
     "#\n",
-    "# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
+    "# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
     "# SPDX-License-Identifier: Apache-2.0\n",
     "#\n",
     "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
@@ -395,4 +395,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 2
-}
\ No newline at end of file
+}
diff --git a/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_partial.ipynb b/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_partial.ipynb
index 704e30fa..56c7119e 100644
--- a/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_partial.ipynb
+++ b/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_partial.ipynb
@@ -93,7 +93,7 @@
    "source": [
     "\n",
     "#\n",
-    "# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
+    "# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
     "# SPDX-License-Identifier: Apache-2.0\n",
     "#\n",
     "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
diff --git a/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_specific_class.ipynb b/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_specific_class.ipynb
index 3a523609..75d42e6e 100644
--- a/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_specific_class.ipynb
+++ b/tools/tensorflow-quantization/docs/source/notebooks/simple_network_quantize_specific_class.ipynb
@@ -49,7 +49,7 @@
    "outputs": [],
    "source": [
     "#\n",
-    "# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
+    "# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
     "# SPDX-License-Identifier: Apache-2.0\n",
     "#\n",
     "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
diff --git a/tools/tensorflow-quantization/docs/source/notebooks/tiny_resnet.py b/tools/tensorflow-quantization/docs/source/notebooks/tiny_resnet.py
index 45b3dc9e..97266e7c 100644
--- a/tools/tensorflow-quantization/docs/source/notebooks/tiny_resnet.py
+++ b/tools/tensorflow-quantization/docs/source/notebooks/tiny_resnet.py
@@ -1,11 +1,12 @@
 #
-# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
-#     http://www.apache.org/licenses/LICENSE-2.0
+# http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
diff --git a/tools/tensorflow-quantization/examples/data/data_loader.py b/tools/tensorflow-quantization/examples/data/data_loader.py
index 7993ad12..f6930b68 100644
--- a/tools/tensorflow-quantization/examples/data/data_loader.py
+++ b/tools/tensorflow-quantization/examples/data/data_loader.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/data/test_data_loader.py b/tools/tensorflow-quantization/examples/data/test_data_loader.py
index 036ad3fe..bfbc22bb 100644
--- a/tools/tensorflow-quantization/examples/data/test_data_loader.py
+++ b/tools/tensorflow-quantization/examples/data/test_data_loader.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/efficientnet/export.py b/tools/tensorflow-quantization/examples/efficientnet/export.py
index 368cb364..8105c798 100644
--- a/tools/tensorflow-quantization/examples/efficientnet/export.py
+++ b/tools/tensorflow-quantization/examples/efficientnet/export.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/efficientnet/run_qat_workflow.py b/tools/tensorflow-quantization/examples/efficientnet/run_qat_workflow.py
index 320fe79b..4c296d0a 100644
--- a/tools/tensorflow-quantization/examples/efficientnet/run_qat_workflow.py
+++ b/tools/tensorflow-quantization/examples/efficientnet/run_qat_workflow.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/efficientnet/test_qdq_node_placement.py b/tools/tensorflow-quantization/examples/efficientnet/test_qdq_node_placement.py
index d5e3eb31..445b89b2 100644
--- a/tools/tensorflow-quantization/examples/efficientnet/test_qdq_node_placement.py
+++ b/tools/tensorflow-quantization/examples/efficientnet/test_qdq_node_placement.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/efficientnet/utils.py b/tools/tensorflow-quantization/examples/efficientnet/utils.py
index 30f45eed..6858baa3 100644
--- a/tools/tensorflow-quantization/examples/efficientnet/utils.py
+++ b/tools/tensorflow-quantization/examples/efficientnet/utils.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/inception/run_qat_workflow.py b/tools/tensorflow-quantization/examples/inception/run_qat_workflow.py
index a84b9105..66c4efff 100644
--- a/tools/tensorflow-quantization/examples/inception/run_qat_workflow.py
+++ b/tools/tensorflow-quantization/examples/inception/run_qat_workflow.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/inception/test_qdq_node_placement.py b/tools/tensorflow-quantization/examples/inception/test_qdq_node_placement.py
index a152124e..66221aff 100644
--- a/tools/tensorflow-quantization/examples/inception/test_qdq_node_placement.py
+++ b/tools/tensorflow-quantization/examples/inception/test_qdq_node_placement.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/infer_engine.py b/tools/tensorflow-quantization/examples/infer_engine.py
index 94c975aa..30b17fa5 100644
--- a/tools/tensorflow-quantization/examples/infer_engine.py
+++ b/tools/tensorflow-quantization/examples/infer_engine.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/mobilenet/run_qat_workflow.py b/tools/tensorflow-quantization/examples/mobilenet/run_qat_workflow.py
index 1c3a9c9e..85887030 100644
--- a/tools/tensorflow-quantization/examples/mobilenet/run_qat_workflow.py
+++ b/tools/tensorflow-quantization/examples/mobilenet/run_qat_workflow.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/mobilenet/test_qdq_node_placement.py b/tools/tensorflow-quantization/examples/mobilenet/test_qdq_node_placement.py
index 50bd105e..3114cc47 100644
--- a/tools/tensorflow-quantization/examples/mobilenet/test_qdq_node_placement.py
+++ b/tools/tensorflow-quantization/examples/mobilenet/test_qdq_node_placement.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/resnet/run_qat_workflow.py b/tools/tensorflow-quantization/examples/resnet/run_qat_workflow.py
index c05f4391..43f5335b 100644
--- a/tools/tensorflow-quantization/examples/resnet/run_qat_workflow.py
+++ b/tools/tensorflow-quantization/examples/resnet/run_qat_workflow.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/resnet/test_qdq_node_placement.py b/tools/tensorflow-quantization/examples/resnet/test_qdq_node_placement.py
index 4c0e5ffd..847368ae 100644
--- a/tools/tensorflow-quantization/examples/resnet/test_qdq_node_placement.py
+++ b/tools/tensorflow-quantization/examples/resnet/test_qdq_node_placement.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/resnet/utils.py b/tools/tensorflow-quantization/examples/resnet/utils.py
index 3044d885..b61d7bbc 100644
--- a/tools/tensorflow-quantization/examples/resnet/utils.py
+++ b/tools/tensorflow-quantization/examples/resnet/utils.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/utils.py b/tools/tensorflow-quantization/examples/utils.py
index efd8c37c..d6efa56e 100644
--- a/tools/tensorflow-quantization/examples/utils.py
+++ b/tools/tensorflow-quantization/examples/utils.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/examples/utils_finetuning.py b/tools/tensorflow-quantization/examples/utils_finetuning.py
index 4604d677..0f5cc1f3 100644
--- a/tools/tensorflow-quantization/examples/utils_finetuning.py
+++ b/tools/tensorflow-quantization/examples/utils_finetuning.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/setup.py b/tools/tensorflow-quantization/setup.py
index dd0696b0..e98c748e 100644
--- a/tools/tensorflow-quantization/setup.py
+++ b/tools/tensorflow-quantization/setup.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -22,7 +22,7 @@
 abspath = os.path.dirname(os.path.realpath(__file__))
 
 license_header = """#
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/__init__.py b/tools/tensorflow-quantization/tensorflow_quantization/__init__.py
index fbe12c76..4d28bd04 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/__init__.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/__init__.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -34,4 +34,4 @@
 from tensorflow_quantization.utils import convert_saved_model_to_onnx
 from tensorflow_quantization.utils import convert_keras_model_to_onnx
 
-from .version import __version__
\ No newline at end of file
+from .version import __version__
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_case_base.py b/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_case_base.py
index c47867ea..8a078bd4 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_case_base.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_case_base.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_cases.py b/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_cases.py
index 6dbb4054..e060b466 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_cases.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/custom_qdq_cases.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -360,4 +360,4 @@ def case(
         mp_cqdq_qspec = mp_cqdq.case(keras_model, qspec)
         special_qspec.layers.extend(mp_cqdq_qspec.layers)
 
-        return special_qspec
\ No newline at end of file
+        return special_qspec
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/global_config.py b/tools/tensorflow-quantization/tensorflow_quantization/global_config.py
index c2595bfa..d23abcb0 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/global_config.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/global_config.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/quantize.py b/tools/tensorflow-quantization/tensorflow_quantization/quantize.py
index b9659c99..e3674dff 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/quantize.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/quantize.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/quantize_config.py b/tools/tensorflow-quantization/tensorflow_quantization/quantize_config.py
index 317e289b..c771616a 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/quantize_config.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/quantize_config.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrapper_base.py b/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrapper_base.py
index e1446f04..1f721008 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrapper_base.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrapper_base.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrappers.py b/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrappers.py
index ff9b7287..8518c2c6 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrappers.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/quantize_wrappers.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/quantizers.py b/tools/tensorflow-quantization/tensorflow_quantization/quantizers.py
index a6408e54..d9abfb22 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/quantizers.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/quantizers.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tensorflow_quantization/utils.py b/tools/tensorflow-quantization/tensorflow_quantization/utils.py
index eb07c751..b27f2eef 100644
--- a/tools/tensorflow-quantization/tensorflow_quantization/utils.py
+++ b/tools/tensorflow-quantization/tensorflow_quantization/utils.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/custom_qdq_cases_test.py b/tools/tensorflow-quantization/tests/custom_qdq_cases_test.py
index 7f11352b..29d1ea38 100644
--- a/tools/tensorflow-quantization/tests/custom_qdq_cases_test.py
+++ b/tools/tensorflow-quantization/tests/custom_qdq_cases_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/network_pool.py b/tools/tensorflow-quantization/tests/network_pool.py
index f64e6730..3be0d41b 100644
--- a/tools/tensorflow-quantization/tests/network_pool.py
+++ b/tools/tensorflow-quantization/tests/network_pool.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/onnx_graph_qdq_validator.py b/tools/tensorflow-quantization/tests/onnx_graph_qdq_validator.py
index 39cbad0c..7349e2d4 100644
--- a/tools/tensorflow-quantization/tests/onnx_graph_qdq_validator.py
+++ b/tools/tensorflow-quantization/tests/onnx_graph_qdq_validator.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/quantize_config_test.py b/tools/tensorflow-quantization/tests/quantize_config_test.py
index 4b1cb70d..e281fb09 100644
--- a/tools/tensorflow-quantization/tests/quantize_config_test.py
+++ b/tools/tensorflow-quantization/tests/quantize_config_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/quantize_qdq_insertion_test.py b/tools/tensorflow-quantization/tests/quantize_qdq_insertion_test.py
index 0b61e84b..639ca656 100644
--- a/tools/tensorflow-quantization/tests/quantize_qdq_insertion_test.py
+++ b/tools/tensorflow-quantization/tests/quantize_qdq_insertion_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/quantize_test.py b/tools/tensorflow-quantization/tests/quantize_test.py
index 90367261..f98f25bb 100644
--- a/tools/tensorflow-quantization/tests/quantize_test.py
+++ b/tools/tensorflow-quantization/tests/quantize_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -203,4 +203,4 @@ def test_end_to_end_workflow_4bit():
     test_assets.add_folder("test_end_to_end_workflow_4bit")
 
     train_quantize_fine_tune(test_assets.test_end_to_end_workflow_4bit, perform_four_bit_quantization=True)
-    tf.keras.backend.clear_session()
\ No newline at end of file
+    tf.keras.backend.clear_session()
diff --git a/tools/tensorflow-quantization/tests/quantize_wrapper_base_test.py b/tools/tensorflow-quantization/tests/quantize_wrapper_base_test.py
index 44831f9a..23849637 100644
--- a/tools/tensorflow-quantization/tests/quantize_wrapper_base_test.py
+++ b/tools/tensorflow-quantization/tests/quantize_wrapper_base_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/quantize_wrappers_test.py b/tools/tensorflow-quantization/tests/quantize_wrappers_test.py
index 4197a125..a0300bee 100644
--- a/tools/tensorflow-quantization/tests/quantize_wrappers_test.py
+++ b/tools/tensorflow-quantization/tests/quantize_wrappers_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
diff --git a/tools/tensorflow-quantization/tests/utils_test.py b/tools/tensorflow-quantization/tests/utils_test.py
index 456bc41a..482af81f 100644
--- a/tools/tensorflow-quantization/tests/utils_test.py
+++ b/tools/tensorflow-quantization/tests/utils_test.py
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License");