Presented in SoCC 2017.
Authors: Haoyu Zhang, Logan Stafman, Andrew Or, Michael J. Freedman, Princeton University
This paper presents SLAQ, which is a cluster scheduling framework that hosts multi-tenant approximate ML training jobs running on shared resources.
It is a fine-grained job-level scheduler, which focuses on the allocation of cluster resources between competing ML jobs, but does so short time intervals (i.e., hundreds of milliseconds to a few seconds).
- Leverage the iterative nature of ML training algorithms.
- Collect quality and resource usage information from concurrent jobs.
- Generate quality-improvement predictions for future iterations.
Existing job-level schedulers (YARN, Mesos, Apollo, Hadoop Capacity, Quincy, etc.) mostly allocate resources based on resource fairness or priorities.
For ML training jobs, these schedulers often make suboptimal scheduling decisions because they are agnostic to the progress (quality improvement) within each job.
The system is implemented within the Apache Spark framework and utilizes Spark MLlib.