Skip to content

Merlin: HugeCTR 23.05

Compare
Choose a tag to compare
@minseokl minseokl released this 18 May 04:35
· 347 commits to main since this release

What's New in Version 23.05

In this release, we have fixed issues and enhanced the code.

  • 3G Embedding Updates:

    • Refactored the DataDistributor related code
    • New SOK load() and dump() APIs are usable in TensorFlow 2. To use the API, specify sok_vars in addition to path.
    • sok_vars is a list of sok.variable and/or sok.dynamic_variable.
    • If you want to store optimizer states such as m and v of Adam, the optimizer must be specified as well.
    • The optimizer must be a tf.keras.optimizers.Optimizer or sok.OptimizerWrapper while their underlying type must be SGD, Adamax, Adadelta, Adagrad, or Ftrl.
    import sparse_operation_kit as sok
    
    sok.load(path, sok_vars, optimizer=None)
    
    sok.dump(path, sok_vars, optimizer=None)

    These APIs are independent from the number of GPUs in use and the sharding strategy. For instance, a distributed embedding table trained and dumped with 8 GPUs can be loaded to train on a 4-GPU machine.

  • Issues Fixed:

    • Fixed the segmentation fault and wrong initialization when the embedding table fusion is enabled in using the HPS UVM implementation
    • cudaDeviceSynchronize() is removed when building the HugeCTR in the debug mode, so you can enable the CUDA Graph even in the debug mode.
    • Modified some Notebooks to use the most recent version of NGC container
    • Fixed the EmbeddingTableCollection utest to run correctly with multiple GPUs
  • Known Issues:

    • HugeCTR can lead to a runtime error if client code calls RMM’s rmm::mr::set_current_device_resource() or rmm::mr::set_current_device_resource() because HugeCTR’s Parquet Data Reader also calls rmm::mr::set_current_device_resource(), and it becomes visible to other libraries in the same process. Refer to [this issue] (#356) . As a workaround, set an environment variable HCTR_RMM_SETTABLE to 0 to disable HugeCTR to set a custom RMM device resource, if they know rmm::mr::set_current_device_resource() is called outside HugeCTR. But be cautious, as it could affect the performance of parquet reading.

    • HugeCTR uses NCCL to share data between ranks and NCCL can require shared system memory for IPC and pinned (page-locked) system memory resources.
      If you use NCCL inside a container, increase these resources by specifying the following arguments when you start the container:

        -shm-size=1g -ulimit memlock=-1

      See also this NCCL known issue and this GitHub issue.

    • KafkaProducers startup succeeds even if the target Kafka broker is unresponsive.
      To avoid data loss in conjunction with streaming-model updates from Kafka,make sure that a sufficient number of Kafka brokers are running, operating properly, and reachable from the node where you run HugeCTR.

    • The number of data files in the file list should be greater than or equal to the number of data reader workers. Otherwise, different workers are mapped to the same file and data loading does not progress as expected.

    • Joint loss training with a regularizer is not supported.

    • Dumping Adam optimizer states to AWS S3 is not supported.