- Suport tf.int32 dtype using feature_column API
tf.feature_column.categorical_column_with_embedding
. - Make the rules of export frequencies and versions the same as the rule of export keys.
- Optimize cuda kernel implementation in GroupEmbedding.
- Support to read embedding files with mmap and madvise, and direct IO.
- Add double check in find_wait_free of lockless dense hashmap.
- Change Embedding init value of version in EV from 0 to -1.
- Interface 'GetSnapshot()' backward compatibility.
- Implement CPU GroupEmbedding lookup sparse Op.
- Make GroupEmbedding compatible with sequence feature_column interface.
- Fix sp_weights indices calculation error in GroupEmbedding.
- Add group_strategy to control parallelism of group_embedding.
- Support SparseTensor as placeholder in Sample-awared Graph Compression.
- Add Dice fusion grappler and ops.
- Enable MKL Matmul + Bias + LeakyRelu fusion.
- Avoid unnecessary polling in EventMgr.
- Reduce lock cost and memory usage in EventMgr when use multi-stream.
- Register GPU implementation of int64 type for Prod.
- Register GPU implementation of string type for Shape, ShapeN and ExpandDims.
- Optimize list of GPU SegmentReductionOps.
- Optimize zeros_like_impl by reducing calls to convert_to_tensor.
- Implement GPU version of SparseSlice Op.
- Delay Reshape when rank > 2 in keras.layers.Dense so that post op can be fused with MatMul.
- Implement setting max_num_threads hint to oneDNN at compile time.
- Implement TensorPackTransH2DOp to improve SmartStage performance on GPU.
- Add tensor shape meta-data support for ParquetDataset.
- Add arrow BINARY type support for ParquetDataset.
- Add Dice fusion to inference mode.
- Enable INFERENCE_MODE in processor.
- Support TensorRT 8.x in Inference.
- Add configure filed to control enable TensorRT or not.
- Add flag for device_placement_optimization.
- Avoid to clustering feature column related nodes when enable TensorRT.
- Optimize inference latency when load increment checkpoint.
- Optimize performance via only place TensorRT ops to gpu device.
- Support CUDA 12.
- Update DEFAULT_CUDA_VERSION and DEFAULT_CUDNN_VERSION in configure.py.
- Move thirdparties from WORKSPACE to workspace.bzl.
- Update urls corresponding to colm, ragel, aliyun-oss-sdk and uuid.
- Fix constant op placing bug for device placement optimization.
- Fix Nan issue occurred in group_embedding API.
- Fix SOK not compatible with variable issue.
- Fix memory leak when update full model in serving.
- Fix 'cols_to_output_tensors' not setted issue in GroupEmbedding.
- Fix core dump issue about saving GPU EmbeddingVariable.
- Fix cuda resource issue in KvResourceImportV3 kernel.
- Fix loading signature_def with coo_sparse bug and add UT.
- Fix the bug that the training ends early when the workqueue is enabled.
- Fix the control edge connection issue in device placement optimization.
- Modify GroupEmbedding related function usage.
- Update masknet example with layernorm.
- Add tools for remove filtered features in checkpoint.
- Add Arm Compute Library (ACL) user documents.
- Update Embedding Variable document to fix initializer config example.
- Update GroupEmbedding document.
- Update processor documents.
- Add user documents for intel AMX.
- Add TensorRT usage documents.
- Update documents for ParquetDataset.
More details of features: https://deeprec.readthedocs.io/zh/latest/
alideeprec/deeprec-release:deeprec2304-cpu-py38-ubuntu20.04
alideeprec/deeprec-release:deeprec2304-gpu-py38-cu116-ubuntu20.04
- Support same saver graph for EmbeddingVariable on GPU/CPU devices.
- Support save and restore parameters in HBM storage of EmbeddingVariable.
- Add GPU apply ops of Adam, AdamAsync, AdamW for multi-tier storage of EmbeddingVariable.
- Place output of KvResourceIsInitializedOp on CPU.
- Support GroupEmbedding to pack multiple feature columns lookup/apply.
- Optimize HBM-DRAM storage of EmbeddingVariable with intra parallelism and fine-grained synchronization.
- Support not saving filtered features when saving checkpoint.
- Support localized mode fusion in GroupEmbedding.
- Support to avoid preloaded IDs being eliminated in multi-tier embedding's cache.
- Support COMPACT layout to reduce memory cost in EmbeddingVariable.
- Support to ignore version when restore Embedding Variable with TF_EV_RESET_VERSION.
- Support restore custom dimension of Embedding Variable.
- Support merge and delete checkpoint files of SSDHash storage.
- Optimize SmartStage by prefetching LookupID op.
- Decouple SmartStage and forward backward joint optimization.
- Support Sample-awared Graph Compression.
- Support CUDA multi-stream for Stage.
- Improve Device Placement Optimization performance.
- Add TensorBufferPutGpuOp to improve SmartStage performance on GPU device.
- Enable EVAllocator by default.
- Optimize executor to eliminate sort latency and reduce memory.
- Add list of GPU Ops for forward backward joint optimization.
- Optimize FusedBatchNormGrad on CPU device.
- Support NCHW format input for FusedBatchNormOp.
- Use new asynchronous evaluation in Eigen to FusedBatchNorm.
- Add exponential_avg_factor attribute to FusedBatchNorm* kernels.
- Change AliUniqueGPU kernel implementation to AsyncOpKernel.
- Support computing exponential running mean and variance in fused_batch_norm.
- Upgrade oneDNN to 2.7 and ACL to 22.08.
- Use global cache for MKL primitives for ARM.
- Disable optimizing batch norm as sequence of post ops on AArch64.
- Restore re-mapper and fix BatchMatmul and FactoryKeyCreator under AArch64 + ACL.
- Speedup SOK by GroupEmbedding which fuse multiple feature column together.
- Support to setup gpu config in SessionGroup.
- Support to use multiple GPUs in SessionGroup.
- Support processor to set multi-stream option.
- Add flag to disable per_session_host_allocator.
- Run init_op on all sessions in session_group.
- Skip invalid request and return error msg to client.
- Use graph signature as the key to get runtime executor.
- Optimize compile time for kv_variable_ops module.
- Add dataset headers for custom op compilation.
- Add docker images for ARM based on ubuntu22.04.
- Upgrade BAZEL version to 3.7.2.
- Do not cudaSetDevice to invisible GPU in CreateDevices.
- Fix concurrency issue caused by not reference to same lock in multi-tier storage.
- Fix parse input request bug.
- Fix the bug when saving empty GPU EmbeddingVariable.
- Fix the concurrency issue between feature eviction and embedding lookup in asynchronous training.
- Support Parquet Dataset in list of models.
- Add GPU benchmark in Modelzoo.
- Unify the usage of price column in Taobao dataset.
- Add DeepFM model with int64 categorical id input.
- Update dataset url in Modelzoo.
- Add checkpoint meta transformer tool.
- Add list of user documents in English.
More details of features: https://deeprec.readthedocs.io/zh/latest/
alideeprec/deeprec-release:deeprec2302-cpu-py38-ubuntu20.04
alideeprec/deeprec-release:deeprec2302-gpu-py38-cu116-ubuntu20.04
- Add flag to disable per_session_host_allocator.
- Fix bug of saving EmbeddingVariable with int32 type.
- Revert "Support fused batchnorm with any ndims and axis".
alideeprec/deeprec-release:deeprec2212u1-cpu-py38-ubuntu20.04
alideeprec/deeprec-release:deeprec2212u1-gpu-py38-cu116-ubuntu20.04
- Refactor GPU Embedding Variable storage layer.
- Remove TENSORFLOW_USE_GPU_EV macro from embedding storage layer.
- Refactor KvResourceGather GPU Op.
- Add embedding memory pool for HBM storage of EmbeddingVariable.
- Refine the code HBM storage of EmbeddingVariable.
- Reuse the embedding files on SSD generated by EmbeddingVariable when save and restore checkpoint.
- Integrate single HBM EV into multi_tier EmbeddingVariable.
- Filter out the 'stream_id' attribute in arithmetic optimizer.
- Add SimplifyEmbeddingLookupStage optimizer.
- Add ForwardBackwardJointOptimizationPass to eliminate duplicate hash in Gather and Apply ops for Embedding Variable.
- Add allocators for each stream_executor in multi-context mode.
- Set multi-gpu devices in session_group mode.
- Add blacklist and whitelist to JitCugraph.
- Optimize CPU EVAllocator to speedup EmbeddingVariable performance.
- Support independent GPU host allocator for each session.
- Add GPU EVAllocator to speedup EmbeddingVariable on GPU.
- Add GPU implementation for Unique.
- Support indices type with DT_INT64 in sparse segment ops.
- Add list of gradient implementation for the following ops including SplitV, ConcatV2, BroadcastTo, Tile, GatherV2, Cumsum, Cast.
- Add C++ gradient op for Select.
- Add gradient implementation for SelectV2.
- Add C++ gradient op for Atan2.
- Add C++ gradients for UnsortedSegmentMin/Max/Sum.
- Refactor KvSparseApplyAdagrad GPU Op.
- Merge NV-TF r1.15.5+22.12.
- Update seastar to control SDT by macro HAVE_SDT.
- Update WORKER_DEFAULT_CORE_NUM(8) and PS_EFAULT_CORE_NUM(2) default values.
- Support multi-model deployment in SessionGroup.
- Support user setup cpu-sets for each session_group.
- Support processor to load multi-models.
- Support GPU compilation in processor.
- Optimize independent GPU host allocator for each session.
- Update systemtap to a valid source address.
- Support DeepRec's ABI compatible with TensorFlow 1.15 by configure TF_API_COMPATIBLE_1150.
- Upgrade base docker images based on ubuntu20.04 and python3.8.10.
- Update pcre-8.44 urls.
- Remove systemtap from third party and related dependency.
- Enable gcc optimization option -O3 by default.
- Fix function definition issue in processor.
- Fix the hang when insert item into lockless hash map.
- Fix EmbeddingVariable hang/coredump in GPU mode.
- Fix memory leak in CUDA multi-stream when merge compute and copy stream.
- Fix wrong session devices order.
- Fix hwloc build error on alinux3.
- Fix double clear resource_mgr bug when use SessionGroup.
- Fix wrong Shrink causes unit tests to fail randomly.
- Fix the conflict when the EmbeddingVariable and embedding fusion is enabled simultaneously.
- Fix EmbeddingVarGPU coredump in destructor.
More details of features: https://deeprec.readthedocs.io/zh/latest/
alideeprec/deeprec-release:deeprec2212-cpu-py38-ubuntu20.04
alideeprec/deeprec-release:deeprec2212-gpu-py38-cu116-ubuntu20.04
- Support HBM-DRAM-SSD storage in EmbeddingVariable multi-tier storage.
- Support multi-tier EmbeddingVariable initialized based on frequency when restore model.
- Support to lookup location of ids of EmbeddingVariable.
- Support kv_initialized_op for GPU Embedding Variable.
- Support restore compatibility of EmbeddingVariable using init_from_proto.
- Improve performance of apply/gather ops for EmbeddingVariable.
- Add Eviction Manager in EmbeddingVariable Multi-tier storage.
- Add unified thread pool for cache of Multi-tier storage in EmbeddingVariable.
- Save frequencies and versions of features in SSDHash and LevelDB storage of EmbeddingVariable.
- Avoid invalid eviction use HBM-DRAM storage of EmbeddingVariable.
- Preventing from accessing uninitialized data use EmbeddingVariable.
- Optimize Async EmbeddingLookup by placement optimization.
- Place VarHandlerOp to Compute main graph for SmartStage.
- Support independent thread pool for stage subgraph to avoid thread contention.
- Implement device placement optimization.
- Support CUDA Graph execution by adding CUDA Graph mode session.
- Support CUDA Graph execution in JIT mode.
- Support intra task cost estimate in CostModel in Executor.
- Support tf.stream and tf.colocate python API for CUDA multi-stream.
- Support embedding subgraphs partition policy when use CUDA multi-stream.
- Optimize CUDA multi-stream by merging copy stream into compute stream.
- Add a list of Quantized* and _MklQuantized* ops.
- Implement GPU version of SparseFillEmptyRows.
- Implement c version of spin_lock to support multi-architectures.
- Upgrade the OneDNN version to v2.7.
- Support distributed training use SOK based on EmbeddingVariable.
- Add NETWORK_MAX_CONNECTION_TIMEOUT to support connection timeout configurable in StarServer.
- Upgrade the SOK version to v4.2.
- Add TF_NEED_PARQUET_DATASET to enable ParquetDataset.
- Optimize embedding lookup performance by disable feature filter when serving.
- Optimize error code for user when parse request or response failed.
- Support independent update model threadpool to avoid performance jitter.
- Add MaskNet Model.
- Add PLE Model.
- Support variable type BF16 in DCN model.
- Fix tf.nn.embedding_lookup interface bug and session hang bug when enabling async embedding.
- Fix warmup failed bug when user set warmup file path.
- Fix build failure in ev_allocator.cc and hash.cc on ARM.
- Fix build failure in arrow when build on ARM
- Fix redefined error in NEON header file for ARM.
- Fix _mm_malloc build failure in sparsehash on ARM.
- Fix warmup failed bug when use session_group.
- Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.
- Fix the colocation error when using EmbeddingVariable in distribution.
- Fix HostNameToIp fails by replacing gethostbyname by getaddrinfo in StarServer.
More details of features: https://deeprec.readthedocs.io/zh/latest/
alideeprec/deeprec-release:deeprec2210-cpu-py36-ubuntu18.04
alideeprec/deeprec-release:deeprec2210-gpu-py36-cu116-ubuntu18.04
Duyi-Wang, Locke, shijieliu, Honglin Zhu, chenxujun, GosTraight2020, LALBJ, Nanno
- Fix a list of Quantized* and _MklQuantized* ops not found issue.
- Fix build save graph bug when creating partitioned EmbeddingVariable in feature_column API.
- Fix warmup failed bug when user set warmup file path.
- Fix warmup failed bug when use session_group.
alideeprec/deeprec-release:deeprec2208u1-cpu-py36-ubuntu18.04
alideeprec/deeprec-release:deeprec2208u1-gpu-py36-cu116-ubuntu18.04
- Multi-tier of EmbeddingVariable support HBM, add async compactor in SSDHashKV.
- Support tf.feature_column.shard_embedding_columns, SequenceCategoricalColumn and WeightedCategoricalColumn API for EmbeddingVariable.
- Support save and restore checkpoint of GPU EmbeddingVariable.
- Support EmbeddingVariable OpKernel with REAL_NUMBER_TYPES.
- Support user defined default_value for feature filter.
- Support feature column API for MultiHash.
- Add FP32 fused l2 normalize op and grad op and tf.nn.fused_layer_normalize API.
- Add Concat+Cast fusion ops.
- Optimize SmartStage performance on GPU.
- Add macro to control to optimize mkl_layout_pass.
- Support asynchronous embedding lookup.
- CPUAllocator, avoid multiple threads cleanup at the same time.
- Support independent intra threadpool for each session and intra threadpool be pinned to cpuset.
- Support multi-stream with virtual device.
- Implement ApplyFtrl, ResourceApplyFtrl, ApplyFtrlV2 and ResourceApplyFtrlV2 GPU kernels.
- Optimize BatchMatmul GPU kernel.
- Integrate cuBLASlt into backend and use BlasLtMatmul in batch_matmul_op.
- Support GPU fusion of matmal+bias+(activation).
- Merge NV-TF r1.15.5+22.06.
- Support AdamW optimizer for EmbeddingVariable.
- Support asynchronously restore EmbeddingVariable from checkpoint.
- Support EmbeddingVariable in init_from_checkpoint.
- Add go/java/python client SDK and demo.
- Support GPU multi-streams in SessionGroup.
- Support independent inter thread pool for each session in SessionGroup.
- Support multi-tiered Embedding.
- Support immutable EmbeddingVariable.
- Add low precision optimization tool, support BF16, FP16, INT8 for savedmodel and checkpoint.
- Add embedding variable quantization.
- Optimize DIN's BF16 performance.
- Add DCN & DCNv2 models and MLPerf recommendation benchmark.
- Add detail information for RecvTensor in timeline.
- Add ubuntu 22.04 dockerfile and images with gcc11.2 and python3.8.6.
- Add cuda11.2, cuda11.4, cuda11.6, cuda11.7 docker images and use cuda 11.6 as default GPU image.
- Update default TF_CUDA_COMPUTE_CAPABILITIES to 6.0,6.1,7.0,7.5,8.0.
- Upgrade bazel version to 0.26.1.
- Support for building DeepRec on ROCm2.10.0.
- Fix build failures with gcc11 & gcc12.
- StarServer, remove user packet split to avoid multiple user packet out-of-order issue.
- Fix the 'NodeIsInGpu is not declare' issue.
- Fix the placement bug of worker devices when distributed training in Modelzoo.
- Fix out of range issue for BiasAddGrad op when enable AVX512.
- Avoid loading invalid model when model update in serving.
More details of features: https://deeprec.readthedocs.io/zh/latest/
alideeprec/deeprec-release:deeprec2208-cpu-py36-ubuntu18.04
alideeprec/deeprec-release:deeprec2208-gpu-py36-cu116-ubuntu18.04
- Multi-tier of EmbeddingVariable, add SSD_HashKV which is better performance than LevelDB.
- Support GPU EmbeddingVariable which gather/apply ops place on GPU.
- Add user API to record frequence and version for EmbeddingVariable.
- Add Embedding Fusion ops for CPU/GPU.
- Optimize SmartStage performance on GPU.
- Executor, support cost-based and critical path ops first.
- GPUAllocator, support CUDA malloc async allocator. (need to use >= CUDA 11.2)
- CPUAllocator, automatically memory allocation policy generation.
- PMEMAllocator, optimize allocator and add statistic.
- Implement SparseReshape, SparseApplyAdam, SparseApplyAdagrad, SparseApplyFtrl, ApplyAdamAsync, SparseApplyAdamAsync, KvSparseApplyAdamAsync GPU kernels.
- Optimize UnSortedSegment on CPU.
- Upgrade OneDNN to v2.6.
- ParquetDataset, add parquet dataset which could reduce storage and improve performance.
- Asynchronous restore EmbeddingVariable from checkpoint.
- SessionGroup, highly improve QPS and RT in inference.
- Add models SimpleMultiTask, ESSM, DBMTL, MMoE, BST.
- Support for mapping of operators and real thread ids in timeline.
- Fix EmbeddingVariable core when EmbeddingVariable only has primary embedding value.
- Fix abnormal behavior in L2-norm calculation.
- Fix save checkpoint issue when use LevelDB in EmbeddingVariable.
- Fix delete old checkpoint failure when use incremental checkpoint.
- Fix build failure with CUDA 11.6.
More details of features: https://deeprec.readthedocs.io/zh/latest/
alideeprec/deeprec-release:deeprec2206-cpu-py36-ubuntu18.04
alideeprec/deeprec-release:deeprec2206-gpu-py36-cu110-ubuntu18.04
- Fix saving checkpoint issue when use EmbeddingVariable. (DeepRec-AI#167)
- Fix inputs from different frames issue when use auto graph fusion. (DeepRec-AI#144)
- Fix embedding_lookup_sparse graph issue.
alideeprec/deeprec-release:deeprec2204u1-cpu-py36-ubuntu18.04
alideeprec/deeprec-release:deeprec2204u1-gpu-py36-cu110-ubuntu18.04
- Support hybrid storage of EmbeddingVariable (DRAM, PMEM, LevelDB)
- Support memory-continuous storage of multi-slot EmbeddingVariable.
- Optimize beta1_power and beta2_power slots of EmbeddingVariable.
- Support restore frequency of features in EmbeddingVariable.
- Integrate SOK in DeepRec.
- Auto Graph Fusion, support float32/int32/int64 type for select fusion.
- SmartStage, fix graph contains circle bug when enable SmartStage optimization.
- GPUTensorPoolAllocator, which reduce GPU memory usage and improve performance.
- PMEMAllocator, support allocation in persistent memory.
- Optimize AdamOptimizer performance.
- Change fused MatMul layout type and number thread for small size inputs.
- KafkaGroupIODataset, support consumer rebalance.
- Support dump incremental graph info.
- Add serving module (ODL processor), which support Online Deep Learning (ODL).
More details of features: https://deeprec.readthedocs.io/zh/latest/
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2204-cpu-py36-ubuntu18.04
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2204-gpu-py36-cu110-ubuntu18.04
Some user report issue when use Embedding Variable, such as DeepRec-AI#167. The bug is fixed in r1.15.5-deeprec2204u1.
This is the first release of DeepRec. DeepRec has super large-scale distributed training capability, supporting model training of trillion samples and 100 billion Embedding Processing. For sparse model scenarios, in-depth performance optimization has been conducted across CPU and GPU platform.
- Embedding Variable (including feature eviction and feature filter)
- Dynamic Dimension Embedding Variable
- Adaptive Embedding
- Multi-Hash Variable
- GRPC++
- StarServer
- Synchronous Training - SOK
- Auto Micro Batch
- Auto Graph Fusion
- Embedding Fusion
- Smart Stage
- CPU Memory Optimization
- GPU Memory Optimization
- GPU Virtual Memory
- Incremental Checkpoint
- AdamAsync Optimizer
- AdagradDecay Optimizer
- Operators Optimization: Unique, Gather, DynamicStitch, BiasAdd, Select, Transpose, SparseSegmentReduction, where, DynamicPartition, SparseConcat tens of ops' CPU/GPU optimization.
- support oneDNN & BFloat16(BF16) & Advanced Matrix Extension(AMX)
- Support TensorFloat-32(TF32)
- WorkQueue
- KafkaDataset
- KafkaGroupIODataset
More details of features: DeepRec Document
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2201-cpu-py36-ubuntu18.04
registry.cn-shanghai.aliyuncs.com/pai-dlc-share/deeprec-training:deeprec2201-gpu-py36-cu110-ubuntu18.04