description
#deep_learning_training_workloads #cluster_scheduler #system_interpretability #ML_for_System #decision_tree #generalized_additive_model

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Meta Info

Presented in ASPLOS 2023.

Authors: Qinghao Hu (NTU & Shanghai AI Lab), Meng Zhang (NTU), Peng Sun (SenseTime), Yonggang Wen, Tianwei Zhang (NTU).

Code: https://github.com/S-Lab-System-Group/Lucid

Understanding the paper

TL;DRs

This paper presents Lucid, a non-intrusive DL scheduler based on interpretable models.

It introduces a two-dimensional optimized profiler for efficient job metric collection and timely debugging job feedback; utilizes a packing strategy to circumvent interference; allocates resources based on estimated job priority values and sharing scores.

Interpretable Models

Decision Tree (DT) for Packing Analyze Model
Additive model algorithm GA$$^2$$M for Throughput Predict Model & Workload Estimate Model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lucid.md

lucid.md

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Meta Info

Understanding the paper

TL;DRs

Interpretable Models

Files

lucid.md

Latest commit

History

lucid.md

File metadata and controls

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Meta Info

Understanding the paper

TL;DRs

Interpretable Models