UCC Virtual F2F Meeting Information

UCC Virtual F2F Meeting (May 11-13th and May 18-19th)

Meeting artifacts
Registration
Topics
Agenda
Telecon Info

Artifacts

UCC API slides (Manjunath Gorentla Venkata)

Registration

Please fill in the form here

Agenda

Day1

Meeting Notes

Monday, May 11th, 2020

Time	Topic	Telecon
7:00 am - 7:30 PT	Kickoff and Opening Remarks (Gilad Shainer)
7:30 - 8:15 PT	Highlights of UCC API (Review) (Manju)
8:15 - 8:30 AM PT	Break
8:30 - 9:30 AM PT	Teams API (Manju; All/Discussion)
9:30 - 9:45 AM PT	Break
9:45 - 11:00 AM PT	Endpoints / Collective Operations (Manju; All/Discussion)

Day_1_Notes

Participants

Manjunath Gorentla Venkata
Alex Margolin
Sergey Lebedev
Valentin Petrov
Rami Nudelman
Baker, Matthew
Tony
Gilad Shainer
James S Dinan .
Chambreau, Chris
Gil Bloch
Dmitry Gladkov
Arturo
Pavel Shamis
Ravi, Naveen
Raffenetti, Kenneth J.
Akshay Venkatesh

Discussion

Initialization
- Have a flexible infrastructure for initialization and selection of library functionality
- Discuss final options during component arch discussion
- UCC config interface to follow UCS config.
- Rename ucc_config to ucc_params to reflect UCX style
Context
- Do we need sync model config on the context create ?
  - Yes for enabling RDMA based implementations
  - The drawback - might have to create more contexts (sync and non-sync)
    - Yes, might require multiple objects but not necessarily multiple resources
    - Explore explicit device abstraction and ability to express affinity and propose to the WG group
Team Creation
- Need to revisit endpoints (as this seems to be implementation specific) after presentation from Alex
- Can we hide endpoint from interface and enable agnostic way of creating teams
Collective Operations
- Need to define the mapping of programming model (src, dst) to UCC (src, dst) for cases like MPI broadcast, which has only set of buffers.
- Is there a need for multiple outstanding persistent collective operations of same type ? No use case yet.

Day2

Time	Topic	Telecon
7:00 am - 7:45 PT	Topology Aware Collectives (Sameh)
7:45 - 8:00 AM PT	Break
8:00 am - 8:45 PT	Collectives API - the Reactive alternative (Alex)
8:45 - 9:00 AM PT	Break
9:00 - 11:00 PT	Task and Plan API Discussion

Day_2_Notes

Manjunath Gorentla Venkata
Richard Graham
Sameh
Gil Bloch
Ravi, Naveen
Alex Margolin
Tony
Raffenetti, Kenneth J.
Sergey Lebedev
Rami Nudelman
Arturo
James Dinan
Pavel Shamis
Geoffroy
Valentine Petrov

Topology aware collectives

WG to sync with Sameh (IBM) about topology definition as we abstract topology, device, and affinity

Multiple-level API ?

Option 1: Standardize ucc and ucc_mpi interfaces Option 2: Standardize only ucc interfaces Discussion on UCC base, UCC MPI

For now focus on UCC base and continue the discussion on UCC MPI in the working group
Option for UCC MPI (driver) - provide as a part of UCC project (example contrib directory)
(Alex correct this if needed)

Task API

Task API is use-full (feedback from the WG)

To be considered for a later version of API (not the first version)
It is useful to address the use-cases that include
- computation + communication
- Pipelined protocols
- provide a use case for bundled collectives
- Propose Task API to the working group

Topology Information

What topology information to abstract and what to pass?

Capture distance between various processes/threads that forms the team/groups
Capture distance between context (resource) and devices (GPU/CPU)
Where to pass this information team creation or init?
AI for the working group: Propose an API that covers the above requirements

Endpoints

Endpoint in UCC is member_index in UCG
Move the endpoint to the team_config structure
Make endpoint an input
If no input is provided the library will create the endpoints and it will be available via get_attrib interface

Day3

Wednesday, May 13th, 2020

Time	Topic	Telecon
7:00 am - 8:00 PT	GPUs/DL (NVIDIA/IBM/All)
8:00 - 8:45 PT	Multirail Discussion (Sergey;All)
8:45 - 9:00 PT	Break
9:00 - 9:30 PT	Algorithm Selection Models (All)
9:30 - 10:00 PT	Memory registration and Global Symmetric Memory (All)
10:00 - 11:00 PT	Document on differences and plan to converge

Day_3_Notes

Manjunath Gorentla Venkata
Sameh
Arturo
Valentin Petrov
Devendar Bureddy
Sergey Lebedev
Rami Nudelman
Alex Margolin
James Dinan
Sreeram Potluri
Pavel Shamis
Raffenetti, Kenneth
Geoffroy Vallee
Gil Bloch

UCC and GPUs / DL/AI(NVIDIA/IBM/All)

Goals
- UCC should support GPU-aware MPI collectives
- UCC should be cognizant of DL/AI requirements and should design interfaces for it
  - (participants were in consensus)
Relevant use cases/interfaces besides MPI and OpenSHMEM
- Single process/thread utilizing multiple GPUs
- Aggregate or bundled collectives - the motivation is to reduce the launch overhead.
  - A series of collectives launched
  - NCCL addresses this with ncclGroupStart/End interfaces
Missing abstractions from the UCC interface proposals
- Memory type: The library should know the memory passed to the collective operation.
  - Host memory, device memory
  - Where to abstract this information?
    - Passing this information to the team creation operation should be enough. The user might have to create a team that is specific to memory type.
  - Passing this information to each invocation is useful, but there is no use case yet.
  - The abstraction should support other accelerators and memory types (CUDA, ROCM, Smart NIC, DRAM, HBM
- Device abstraction and affinity
- How do you handle the GPU device context?
  - Can this be abstracted onto the UCC context?
- How do you handle CUDA streams?
Next steps / Questions
- Design for missing abstractions
- Ping AMD and IBM
- Error handling / Managing asynchronous errors
  - More details required

Multirail support

Goal
- UCC should support multirail collectives (participants were in consensus)
Lessons from Sergey’s implementation

Multirail support can be implemented “easily” if we have basic collectives expressed as components and these components can be composed to implement the UCC API.
Hierarchical collectives are implemented like this in XCCL

Missing abstractions
- The team create operation should pass in multiple UCC contexts (resources) to the team create operation
- The information about the distance between the contexts (assuming contexts are mapped as one context per HCA)

Topology Infrastructure

The topology information is needed for multirail, UCG’s group create operation, and GPU-aware collectives
What topology information is needed?
- Distance between the participants of the team in the team create operation
- Distance between the network resources (HCA’s) and thread invoking the team create operation
- Distance between the GPUs and thread invoking the team create operation
Who should implement it? UCC or an external library?
- Can we pass this information from the external libraries (hwloc, ompi)? If so, how to abstract it?
- Can the library implement it?
  - This is an expensive operation and a huge undertaking.
Next steps:
- Prototype interfaces and work with IBM to understand the pitfalls.

Algorithm Selection

HCOLL model
libcoll/Intel model
User query model
Adaptive model

A common thread for all the models is the selection attributes. The selection attributes can include algorithm type, message range, collective implementation type (XCCL, XUCG, hardware), and more.

Next steps:
- Define the selection attributes.
- In version 1.0, design the interfaces that are not external but internal. Gather experience and then make it public.

Day4

Monday, May 18th, 2020

Time	Topic	Telecon
7:00 am - 7:45 PT	OMPI-X / ADAPT (George Bosilca/Talk)
7:45 am - 8:00 PT	Break
8:00 am - 9:00 PT	Component Architecture (Review for non-WG participants)(Alex/Val/Discussion)
9:00 am - 9:30 PT	Memory registration and symmetric memory API (Manju; All; Discussion)
9:30 am - 9:45 PT	Break
9:45 am - 10:30 PT	Library initialization parameters
10:30 am - 11:00	Documentation / Code Structure

Day_4_Notes

Manjunath Gorentla Venkata
George
Arturo
Valentin Petrov
Sergey Lebedev
Rami Nudelman
Alex Margolin
Pavel Shamis
Raffenetti, Kenneth
Geoffroy Vallee
Tony

Version 1.0 of the component architecture (from Val’s presentation)

Component Architecture Overview
- Abstractions
  - Collective layer with multiple collective implementations (XCCL, XUCG, Hardware)
  - Basic collective layer with primitive collectives (p2p_collectives, SHARP)
  - P2P layer
  - Services layer
  - Resolves:
    - It addresses a majority of the requirements for component architecture that was identified by the previous iteration of component architecture such as
    - Avoiding circular dependencies
    - Ability to provide a thin layer over hardware collectives
  - To address
    - Ability to share resources between multiple implementations. For example, sharing p2p (or SHARP) resources between XCCL and XUCG
    - Ability to choose multiple collective components (.i.e. say all reduce from XCCL, and a2a from XUCG). Add a selection component that encompasses multiple collective implementations.
    - Ability to share and reuse code at the fine-grained level.
Next Steps
- Develop fine-grained component architecture for XCCL and XUCG
- Identify the components that can be shared
- Identify a way to share resources between different implementations

Day5

Tuesday, May 19th, 2020 (Hackathon Mode)

Time	Topic	Telecon
7:00 am - 8:30 PT	Flesh out the component architecture
8:30 am - 8:45 PT
8:45 am - 10:30 PT	Review and flesh out the spec document
10:30 am - 11:00 PT	Next Steps

Topics

(Laundry List)

Kickoff (Gilad)
Highlights of UCC API (Review for non-WG participants) (Manju)
OMPI-X / ADAPT (George Bosilca/Talk)
Requirements from the AI Users/Deep Learning/GPUs (NVIDIA; All)
API Discussion (Incase not completed in WG)
- Library Initialization
- Resource Abstraction (Contexts)
- Teams API (Manju; All/Discussion)
- Endpoints (Manju; All/Discussion)
- Collective Operations (Manju; All/Discussion)
- Task API (Manju; All/Discussion)
- Alternative Control-path API (Initialization and communicator creation) (Alex; All/Discussion)
- Alternative Data-path API (Starting and progressing collectives) (Alex; All/Discussion)
Component Architecture (Review for non-WG participants)(Alex/Val/Discussion)
Flesh out UCC.H Header (All)
Unit tests and CI infrastructure (?)
Documentation (doxygen ?)(?)
Multirail Support (Sergey)
Topology-aware collectives (Sameh/Talk)
Memory registration (Discussion)
Algorithm selection (Discussion)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly