Data Processing

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement (ATC 2024) [Paper] [Code]
- ETH & Google
Disaggregating ML Input Data Processing at Scale (SoCC 2023)
- Google & ETH
GoldMiner: Elastic Scaling of Training Data Pre-Processing Pipelines for Deep Learning (SIGMOD 2023) [Paper]
- Alibaba & PKU
A case for disaggregation of ML data processing (arXiv 2210.14826) [Paper]
- Google & ETH
- tf.data service: Disaggregate data preprocessing from ML computation.
Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training (ISCA 2022) [Paper]
- Meta
- DSI: Data storage and ingestion
- Industry track
- Meta's data storage and ingestion pipeline

Provide feedback