Skip to content

Latest commit

 

History

History
45 lines (30 loc) · 2.29 KB

File metadata and controls

45 lines (30 loc) · 2.29 KB

Ray: A distributed framework for emerging AI applications

Metadata

Presented in OSDI 2018.

Authors: Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica (University of California, Berkeley)

Code: https://github.com/ray-project/ray

Understanding the paper

TL;DR

This paper presents Ray, a distributed system tailed for RL applications. It provides both an actor-based (stateful) and a task-parallel (stateless) programming abstraction. It employs a distributed scheduler and a distributed fault-tolerant store to manage the system's control state.

Motivation

  • A framework for RL applications must provide efficient support for training, serving, and simulation. All three of these workloads are tightly coupled in a single application.
  • Fault tolerance really matters with AI applications. 1) Easy development; 2) Reproducibility; 3) Allow cheap resources like spot instances.

Technical details

  • Global control store (GCS)
    • Maintain the entire control state of the system.
    • Maintain fault tolerance and low latency.
  • Two-level hierarchical scheduler (bottom-up)
    • Consist of a global scheduler and per-node local schedulers.
    • Tasks created at a node are submitted first to the node's local scheduler.
    • Global scheduler considers each node's load and task's constraints.
  • In-memory distributed storage system
    • Store the inputs and outputs of every task.
    • Via shared memory; Apache Arrow.
    • Recover any objects through lineage re-execution.

Evaluation

All experiments were run on AWS.

Experience

  • The authors believe that centralizing control state will be a key design component of future distributed systems.
  • From Ion Stoica — Spark, Ray, and Enterprise Open Source
    • Spark abstracts away parallels; Ray exposes parallels.
    • Ray fundamentally is an RPC framework, plus an actor framework, plus an object store which allows you to efficiently pass the data between different functions and actors by reference.