Merlin: HugeCTR 23.05.01
Pre-releaseWhat's New in Version 23.05
In this release, we have fixed issues and enhanced the code.
-
3G Embedding Updates:
- Refactored the
DataDistributor
related code - New SOK
load()
anddump()
APIs are usable in TensorFlow 2. To use the API, specifysok_vars
in addition topath
. sok_vars
is a list ofsok.variable
and/orsok.dynamic_variable
.- If you want to store optimizer states such as
m
andv
ofAdam
, theoptimizer
must be specified as well. - The
optimizer
must be atf.keras.optimizers.Optimizer
orsok.OptimizerWrapper
while their underlying type must beSGD
,Adamax
,Adadelta
,Adagrad
, orFtrl
.
import sparse_operation_kit as sok sok.load(path, sok_vars, optimizer=None) sok.dump(path, sok_vars, optimizer=None)
These APIs are independent from the number of GPUs in use and the sharding strategy. For instance, a distributed embedding table trained and dumped with 8 GPUs can be loaded to train on a 4-GPU machine.
- Refactored the
-
Issues Fixed:
- Fixed the segmentation fault and wrong initialization when the embedding table fusion is enabled in using the HPS UVM implementation
cudaDeviceSynchronize()
is removed when building the HugeCTR in the debug mode, so you can enable the CUDA Graph even in the debug mode.- Modified some Notebooks to use the most recent version of NGC container
- Fixed the
EmbeddingTableCollection
utest to run correctly with multiple GPUs
-
Known Issues:
-
HugeCTR can lead to a runtime error if client code calls RMM’s
rmm::mr::set_current_device_resource()
orrmm::mr::set_current_device_resource()
because HugeCTR’s Parquet Data Reader also callsrmm::mr::set_current_device_resource()
, and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, set an environment variableHCTR_RMM_SETTABLE
to 0 to disable HugeCTR to set a custom RMM device resource, if they knowrmm::mr::set_current_device_resource()
is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading. -
HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:-shm-size=1g -ulimit memlock=-1
See also this NCCL known issue and this GitHub issue.
-
KafkaProducers
startup succeeds even if the target Kafka broker is unresponsive.
To avoid data loss in conjunction with streaming-model updates from Kafka,make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR. -
The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.
-
Joint loss training with a regularizer is not supported.
-
Dumping Adam optimizer states to AWS S3 is not supported.
-