Releases: NVIDIA-Merlin/HugeCTR
Merlin: HugeCTR 24.06
What's New in Version 24.06
- Sparse Operation Kit (SOK) Updates:
- A new API
sok.incremental_dump
has been added, which allows users to dump newly added keys and values into a numpy array by specifying a time threshold. Currently it only supportssok.DynamicVariable
that uses HKV as the backend. - Fixed the issue of wgrad using too much GPU memory.
- Fixed an illegal memory access issue in a CUDA kernel during backward propagation.
- The documentation and examples for SOK (Sparse Operation Kit) have been updated. For more details, refer to the SOK Documentation.
- A new API
Merlin: HugeCTR 24.04
v24.04.00 Remove some internal files (#447)
Merlin: HugeCTR 23.12
What's New in Version 23.12
-
Lock-free Inference Cache in HPS
- We have added a new lock-free GPU embedding cache for the hierarhical parameter server, which can further improve the performance of embedding table lookup in inference. It also doesn't lead to data inconsistency even if concurrent model updates or missing key insertions are in use. That is because we ensure the cache consistency through the asynchronous stream synchronization mechanism. To enable lock-free GPU embedding cache, a user only needs to set "embedding_cache_type" to
dynamic
and"use_hctr_cache_implementation"
tofalse
.
- We have added a new lock-free GPU embedding cache for the hierarhical parameter server, which can further improve the performance of embedding table lookup in inference. It also doesn't lead to data inconsistency even if concurrent model updates or missing key insertions are in use. That is because we ensure the cache consistency through the asynchronous stream synchronization mechanism. To enable lock-free GPU embedding cache, a user only needs to set "embedding_cache_type" to
-
Official SOK Release
- The SOK is not an
experiment
package anymore but is now officially supported by HugeCTR. Doimport sparse_operation_kit as sok
instead offrom sparse_operation_kit import experiment as sok
sok.DynamicVariable
supports Merlin-HKV as its backend- The parallel dump and load functions are added
- The SOK is not an
-
Code Cleaning and Deprecation
- Deprecated the
Model::export_predictions
function. Use the Model::check_out_tensor function instead. - We have deprecated the
Norm
and legacyRaw
DataReaders. Usehugectr.DataReaderType_t.RawAsync
orhugectr.DataReaderType_t.Parquet
as their alternatives.
- Deprecated the
-
Issues Fixed:
- Improved the performance of the HKV lookup via the SOK
- Fix an illegal memory access issue from the SOK backward pass, occurring in a corner case
- Resolved the mean combiner returning zeroes, when the pooling factor is zero, which can make the SOK lookup return NaN.
- Fixed some dependency related build issues
- Optimized the performance of the dynamic embedding table (DET) in the SOK.
- Fixed the crash when a user specifies negative keys in using the DET via the SOK.
- Resolved the occasional correctness issue which becomes visible during the backward propagation phase of the SOK, in handling thousands of embedding tables.
- Removed the runtime errors happening in the Tensorflow >= 2.13.
-
Known Issues:
-
If we set
max_eval_batches
andbatchsize_eval
to some large values such as 5000 and 12000 respectively, the training process leads to the illegal memory access error. The issue is from the CUB, and is fixed in its latest version. However, because it is only included in CUDA 12.3, which is not used by our NGC container yet, until we update our NGC container to rely upon that version of CUDA, please rebuild HugeCTR with the newest CUB as a workaround. Otherwise, please try to avoid such largemax_eval_batches
andbatchsize_eval
. -
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](#243).
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR 23.09
What's New in Version 23.09
-
Code Cleaning and Deprecation
- The offline inference has been deprecated from our documentation, notebook suite, and code. Please check out the HPS plugin for TensorFlow and TensorRT. The multi-GPU inference is not illustrated in this HPS TRT notebook.
- We are working on deprecating the Embedding Training Cache (ETC). If you trying using that feature, it still works but omits a deprecation warning message. In a near-futre release, they will be removed from the API and code level. Please refer to the NVIDIA HierarchicalKV as an alternative.
- In this release, we have also cleand up our C++ code and CMakeLists.txt to improve their maintainability and fix minor but potential issues. There will be more code cleanup in several future releases.
-
General Updates:
- Enabled the support of the static CUDA runtime. Now you can experimentally enable the feature by specifying
-DUSE_CUDART_STATIC=ON
in configuring the code with cmake, while the dynamic CUDA runtime is still used by default. - Added HPS as a custom extension for TorchScript. A user can leverage the HPS embedding lookup during the inference of scripted torch module.
- Enabled the support of the static CUDA runtime. Now you can experimentally enable the feature by specifying
-
Issues Fixed:
- Resolved a couple of performance regressions when the SOK is used together with HKV, related to unique operaiton and unified memory
- Reduced the unnessary memory consumption of intermediate buffers in loading and dumping the SOK embedding
- Fixed the Interaction Layer to support large
num_slots
- Resolved the occasional runtime error in using multiple H800 GPUs
-
Known Issues:
-
If we set
max_eval_batches
andbatchsize_eval
to some large values such as 5000 and 12000 respectively, the training process leads to the illegal memory access error. The issue is from the CUB, and is fixed in its latest version. However, because it is only included in CUDA 12.3, which is not used by our NGC container yet, until we update our NGC container to rely upon that version of CUDA, please rebuild HugeCTR with the newest CUB as a workaround. Otherwise, please try to avoid such largemax_eval_batches
andbatchsize_eval
. -
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](#243).
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR 23.08
What's New in Version 23.08
-
Hierarchical Parameter Server:
- Support static EC fp8 quantization
We already support quantization for fp8 in the static cache. HPS will perform fp8 quantization on the embedding vector when reading the embedding table by enable fp8_quant configuration, and perform fp32 dequantization on the embedding vector corresponding to the queried embedding key in the static embedding cache, so as to ensure the accuracy of dense part prediction. - Large model deployment demo based on HPS TensorRT-plugin
This demo shows how to use the HPS TRT-plugin to build a complete TRT engine for deploying a 147GB embedding table based on a 1TB Criteo dataset. We also provide static embedding implementation for fully offloading embedding tables to host page-locke memory for benchmarks on x86 and Grace Hopper Superchip. - Issues Fixed
- Resolve Kafka update ingestion error. There was an error that prevented handing over online parameter updates coming from Kafka message queues to Redis database backends.
- Fixed HPS Triton backend re-initializing the embedding cache issue due to undefined null when getting the embedded cache on the corresponding device.
- Support static EC fp8 quantization
-
HugeCTR Training & SOK:
- Dense Embedding Support in Embedding Collection
We add the dense embedding in embedding collection. To use the dense embedding, a user just needs to specify the_concat_
as the combiner. For more information, please refer to dense_embedding.py. - Refinement of sequence mask layer and attention softmax layer to support cross-attention.
- We introduce a more generalized reshape layer which allows user to reshape source tensor to destination tensor without dimension restriction. Please refer Reshape Layer API for more detailed information
- Issues Fixed
- Fix error when using Localized Variable in Sparse Operation Kit
- Fix bug in Sparse Operation Kit backward computing.
- Fix some SOK performance bugs by replacing the calls to
DeviceSegmentedSort
withDeviceSegmentedRadixSort
- Fix a bug from the SOK's Python API side, which led to the duplicate calls to the model's forward function and thus degraded the performance.
- Reduce the CPU launch overhead
- Remove dynamic vector allocation in DataDistributor
- Remove the use of the checkout value tensor from the DataReader. The data reader generates a nested std::vector on-the-fly and returns the vector to the embedding collection, which incur lots of host overhead. We have made it a class member so that the overhead can be eliminated.
- Align with the latest parquet update.
We have fixed a bug due to the parquet_reader_options::set_num_rows() update of cudf 23.06: PR . - Fix core23 assertion of debug mode
We have fixed an assertion bug while the new core library is enabled if HugeCTR is built in debug mode.
- Dense Embedding Support in Embedding Collection
-
General Updates:
- Cleaned up logging code. Added compile-time format-string validation. Fixed issue where HCTR_PRINT did not interpret format strings properly.
- Enabled the experimental enablement of the static CUDA runtime. Use
-DUSE_CUDART_STATIC=ON
in cmak'ing - Modified the data preprocessing documentation to clarify the correct commands to use in different situations. Fixed the error of the description of arguments
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](#243).
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR 23.06.01
Merge branch 'reworking-cleaning-mlperf-junzhang' into 'main' Remove reader checkout value tensor introduced by core23 reworking See merge request dl/hugectr/hugectr!1398
Merlin: HugeCTR 23.06
Merge branch 'update_hugectr_version_23.6.0' into 'main' Update new version: 23.6.0 See merge request dl/hugectr/hugectr!1388
Merlin: HugeCTR 23.05
What's New in Version 23.05
In this release, we have fixed issues and enhanced the code.
-
3G Embedding Updates:
- Refactored the
DataDistributor
related code - New SOK
load()
anddump()
APIs are usable in TensorFlow 2. To use the API, specifysok_vars
in addition topath
. sok_vars
is a list ofsok.variable
and/orsok.dynamic_variable
.- If you want to store optimizer states such as
m
andv
ofAdam
, theoptimizer
must be specified as well. - The
optimizer
must be atf.keras.optimizers.Optimizer
orsok.OptimizerWrapper
while their underlying type must beSGD
,Adamax
,Adadelta
,Adagrad
, orFtrl
.
import sparse_operation_kit as sok sok.load(path, sok_vars, optimizer=None) sok.dump(path, sok_vars, optimizer=None)
These APIs are independent from the number of GPUs in use and the sharding strategy. For instance, a distributed embedding table trained and dumped with 8 GPUs can be loaded to train on a 4-GPU machine.
- Refactored the
-
Issues Fixed:
- Fixed the segmentation fault and wrong initialization when the embedding table fusion is enabled in using the HPS UVM implementation
cudaDeviceSynchronize()
is removed when building the HugeCTR in the debug mode, so you can enable the CUDA Graph even in the debug mode.- Modified some Notebooks to use the most recent version of NGC container
- Fixed the
EmbeddingTableCollection
utest to run correctly with multiple GPUs
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka,make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
-
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR 23.05.01
What's New in Version 23.05
In this release, we have fixed issues and enhanced the code.
-
3G Embedding Updates:
- Refactored the
DataDistributor
related code - New SOK
load()
anddump()
APIs are usable in TensorFlow 2. To use the API, specifysok_vars
in addition topath
. sok_vars
is a list ofsok.variable
and/orsok.dynamic_variable
.- If you want to store optimizer states such as
m
andv
ofAdam
, theoptimizer
must be specified as well. - The
optimizer
must be atf.keras.optimizers.Optimizer
orsok.OptimizerWrapper
while their underlying type must beSGD
,Adamax
,Adadelta
,Adagrad
, orFtrl
.
import sparse_operation_kit as sok sok.load(path, sok_vars, optimizer=None) sok.dump(path, sok_vars, optimizer=None)
These APIs are independent from the number of GPUs in use and the sharding strategy. For instance, a distributed embedding table trained and dumped with 8 GPUs can be loaded to train on a 4-GPU machine.
- Refactored the
-
Issues Fixed:
- Fixed the segmentation fault and wrong initialization when the embedding table fusion is enabled in using the HPS UVM implementation
cudaDeviceSynchronize()
is removed when building the HugeCTR in the debug mode, so you can enable the CUDA Graph even in the debug mode.- Modified some Notebooks to use the most recent version of NGC container
- Fixed the
EmbeddingTableCollection
utest to run correctly with multiple GPUs
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka,make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
-
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-
Merlin: HugeCTR 23.04
What's New in Version 23.04
-
Hierarchical Parameter Server Enhancements:
-
HPS Table Fusion: From this release, you can fuse tables of the same embedding vector size in HPS. We support this feature in the HPS plugin for TensorFlow and the Triton backend for HPS.. To turn on table fusion, set
fuse_embedding_table
totrue
in the HPS JSON file. This feature requires that the key values in different tables do not overlap and the embedding lookup layers are not dependent on each other in the model graph. For more information, refer to HPS configuration and HPS table fusion demo notebook. This feature can reduce the embedding lookup latency significantly when there are multiple tables and GPU embedding cache is employed. About 3x speedup is achieved on V100 for the fused case demonstrated in the notebook compared to the unfused one. -
UVM Support: We have upgraded the static embedding solution. For embedding tables whose size exceeds the device memory, we will save high-frequency embeddings in the HBM as an embedding cache and offload the remaining embeddings to the UVM. Compared with the dynamic cache solution that offloads the remaining embeddings to the Volatile DB, the UVM solution has higher CPU lookup throughput. We will support online updating of the UVM solution in a future release. Users can switch between different embedding cache solutions through the embedding_cache_type configuration parameter.
-
Triton Perf Analayzer’s Request Generator: We have added an inference request generator to generate the JSON request format required by Triton Perf Analyzer. By using this request generator together with the model generator, you can use the Triton Perf Analyzer to profile the HPS performance and do stress testing. For API documentation and demo usage, please refer to README
-
-
General Updates:
- DenseLayerComputeConfig: MLP and CrossLayer support asynchronous weight gradient computations with data gradient backpropagation when training. We have added a new member
hugectr DenseLayerComputeConfig
tohugectr.DenseLayer
for configuring the computing behavior. The knob for enabling asynchronous weight gradient computations has been moved fromhugectr.CreateSolver
tohugectr.DenseLayerComputeConfig.async_wgrad
. The knob for controlling the fusion mode of weight gradients and bias gradients has been moved fromhugectr.DenseLayerSwitchs
tohugectr.DenseLayerComputeConfig.fuse_wb
. - Hopper Architecture Support: Users can build HugeCTR from scratch with the compute capability 9.0 (
DSM=90
), so that it can run on Hopper architectures. Note that our NGC container does not support the compute capability yet. Users who are unfamiliar with how to build HugeCTR can refer to the HugeCTR Contribution Guide. - RoCE Support for Hybrid Embedding: With the parameter
CommunicationType.IB_NVLink_Hier
in HybridEmbeddingParams, the RoCE is supported. We have also added 2 environment variablesHUGECTR_ROCE_GID
andHUGECTR_ROCE_TC
so that a user can control the RoCE NIC's GID and traffic class.
https://nvidia-merlin.github.io/HugeCTR/main/api/python_interface.html#hybridembeddingparam-class
- DenseLayerComputeConfig: MLP and CrossLayer support asynchronous weight gradient computations with data gradient backpropagation when training. We have added a new member
-
Documentation Updates:
- Data Reader: We have enhanced our Raw data reader to read multi-hot input data, connecting with an embedding collection seamlessly. The raw dataset format is strengthened as well. Refer to our online documentation for more details. We have refined the description for Norm datasest as well.
- Embedding Collection: We have added the knob
is_exclusive_keys
to enable potencial acceleration if a user has already preprocessed the input of embedding collection to make the resulting tables exclusive with one another. We have also added the nobcomm_strategy
in embedding collection for user to configure optimized communication strategy in multi-node training - HPS Plugin: We have fixed the unit of measurement for DLRM inference benchmark results that leverage the HPS plugin. We have updated the user guide for the HPS plugin for TensorFlow and the HPS plugin for TensorRT
- Embedding Cache: We have updated the usage of three types of embedding cache. We have updated the descriptions of the three types of embedding cache as well.
-
Issues Fixed:
- We added a slots emptiness check to prevent
SparseParam
from being misused. - We revised MPI lifetime service to become MPI init service with slightly greater scope and clearer interface. In this effort, we also fixed a rare bug that could lead access violations during the MPI shutdown procedure.
- We fixed a segment fault that occurs when a GPU has no embedding wgrad to update.
- SOK build & runtime error related to TF version: We made the SOK Experiment](https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/sparse_operation_kit/experiment) compatible with the Tensorflow >= v2.11.0. The legacy SOK doesn’t support that and newer versions of Tensorflow.
- HPS requires CPU memory to be at least 2.5x larger than the model size during its initialization. From this release, we parse the model embedding files through chunks and reduce the required memory to 1.3x model size.
- We added a slots emptiness check to prevent
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, a user can set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue](#243).
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka, you have to make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers.
Otherwise, different workers are mapped to the same file and data loading does not progress as expected. -
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-