Presented in arxiv:2210.00093. (Accepted in NSDI 2023)
Authors: Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, Aditya Akella, University of Wisconsin-Madison, The University of Texas at Austin
Code: https://github.com/uw-mad-dash/shockwave
This paper presents Shockwave, an efficient and fair scheduler for DL training jobs with elastic resource requirements.
It extends classic market theory from static settings to dynamic settings to co-optimize efficiency and fairness; utilizes stochastic dynamic programming to handle dynamic changes.
- Dynamically adjusting model structure or hyper-parameters (e.g., batch size) can significantly accelerate training without sacrificing accuracy.
- Existing schedulers fail to provide fairness and degrade system efficiency when the training throughput changes over time under dynamic adaptation.
- Treat dynamic adaptation as a part of the user’s program that cannot be modified.
- Based on the measurement: automatically perform dynamic adaptation can hurt training accuracy.
- The market needs to know utility values in the future to compute market equilibrium; but it is hard to predict utility in the future since dynamic adaptations in jobs are non-deterministically triggered.
- Solution: Use Bayesian statistics to predict the patterns.
- Solving the dynamic market equilibrium for an (infinitely) long time horizon is difficult and impractical.
- Solution: Only plan the schedule for a finite length window (e.g., 30-60 mins).
- Baselines
- OSSP (Open Shop Scheduling)
- AlloX
- Themis
- MSS (Max-Sum-Throughput)
- GandivaFair
- Pollux
- Improve makespan by 1.3x and fairness by 2x with existing fair schedulers.