Presented in OSDI 2018.
Authors: Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica (University of California, Berkeley)
Code: https://github.com/ray-project/ray
This paper presents Ray, a distributed system tailed for RL applications. It provides both an actor-based (stateful) and a task-parallel (stateless) programming abstraction. It employs a distributed scheduler and a distributed fault-tolerant store to manage the system's control state.
- A framework for RL applications must provide efficient support for training, serving, and simulation. All three of these workloads are tightly coupled in a single application.
- Fault tolerance really matters with AI applications. 1) Easy development; 2) Reproducibility; 3) Allow cheap resources like spot instances.
- Global control store (GCS)
- Maintain the entire control state of the system.
- Maintain fault tolerance and low latency.
- Two-level hierarchical scheduler (bottom-up)
- Consist of a global scheduler and per-node local schedulers.
- Tasks created at a node are submitted first to the node's local scheduler.
- Global scheduler considers each node's load and task's constraints.
- In-memory distributed storage system
- Store the inputs and outputs of every task.
- Via shared memory; Apache Arrow.
- Recover any objects through lineage re-execution.
All experiments were run on AWS.
- The authors believe that centralizing control state will be a key design component of future distributed systems.
- From Ion Stoica — Spark, Ray, and Enterprise Open Source
- Spark abstracts away parallels; Ray exposes parallels.
- Ray fundamentally is an RPC framework, plus an actor framework, plus an object store which allows you to efficiently pass the data between different functions and actors by reference.