NSDI 2024

Meta Info

Homepage: https://www.usenix.org/conference/nsdi24

Paper list: https://www.usenix.org/conference/nsdi24/technical-sessions

Papers

Resource Management

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer [Paper]
- Google
- Experience in designing and operating the software infrastructure that allows TPUv4 supercomputers to operate at scale.
Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices [Paper] [Slides] [Code]
- USTC & ETH & MSR
- Minimize CPU allocation of microservice applications while meeting SLO.
- Service-level (low overhead & fast reaction) vs. Application-level (global visibility)
  - Captains (service-level): control based on throttle ratio target; collect data every 100ms, adjust allocation every 1s.
  - Tower (application-level): determine the best throttle targets for Captains to achieve; online learning (contextual bandit algorithm); one step per minute, each step runs in ~100ms.
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters [Paper]
- MIT & UT-Austin
- Consider the communication pattern of different jobs while placing them on network links.

Large Language Models (LLMs)

LLM characterization
- Characterization of Large Language Model Development in the Datacenter [Paper] [Slides] [Trace]
  - NTU & PKU & CUHK & Shanghai AI Lab
LLM training
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs [Paper] [Slides] [Code]
  - ByteDance & PKU

Utilize Spot Instances

Can't Be Late: Optimizing Spot Instance Savings under Deadlines [Paper] [Trace]
- UC Berkeley
- Outstanding Paper
- Characterization (e.g., availability, pricing, duration) of three-month-long spot availability traces on AWS.
- Uniform Progress: a policy to make uniform progress towards the deadline, by distributing the job computation uniformly across the time.
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances [Paper] [Slides] [Code]
- CUHK & ByteDance & CMU & UCLA & Microsoft
- Proactively adjust the parallelization strategy of a DNN training job for future preemptions to maximize preemption-aware throughput (i.e., liveput).

Multimodal Models

DISTMM: Accelerating Distributed Multimodal Model Training [Paper]
- Ohio State University & AWS
- Partition and parallelize the submodules of a multimodal model based on their modalities and redistribute the training data.

Diffusion Models

Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models [Paper] [Slides]
- Adobe Research & UIUC
- Approximate caching: reduce a certain number of denoising steps by reusing intermediate noise states created during a prior image generation.

Deep Learning Recommendation Models (DLRMs)

Accelerating Neural Recommendation Training with Embedding Scheduling [Paper] [Slides] [Code]
- HKUST
- Herald: an adaptive location-aware inputs allocator to determine where embeddings should be trained and an optimal communication plan generator to determine which embeddings should be synchronized.

Fair Resource Allocation

Solving Max-Min Fair Resource Allocations Quickly on Large Graphs [Paper] [Slides] [Code]
- Microsoft & USC & Rice
- Soroush: Single-Shot Max-Min Fair Allocator.
- Deployed on Microsoft WAN.

Network Emulation

Crescent: Emulating Heterogeneous Production Network at Scale [Paper] [Slides]
- ByteDance & Cornell
- Crescent: ByteDance’s network emulation platform for preventing change-induced network incidents.

RDMA

Harmonic: Hardware-assisted RDMA Performance Isolation for Public Clouds [Paper]
- UIUC & Duke & Microsoft
- Harmonic: microarchitecture-resource-aware RDMA performance isolation; including a programmable intelligent PCIe switch (prototyped with FPGA) and an RDMA-friendly rate limiter.

PCIe

Understanding Routable PCIe Performance for Composable Infrastructures [Paper]
- UW-Madison & ZJU
- rPCIeBench: a software-hardware co-designed benchmarking framework to systematically characterize the routable PCIe fabric.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsdi-2024.md

nsdi-2024.md

NSDI 2024

Meta Info

Papers

Resource Management

Large Language Models (LLMs)

Utilize Spot Instances

Multimodal Models

Diffusion Models

Deep Learning Recommendation Models (DLRMs)

Fair Resource Allocation

Network Emulation

RDMA

PCIe

Files

nsdi-2024.md

Latest commit

History

nsdi-2024.md

File metadata and controls

NSDI 2024

Meta Info

Papers

Resource Management

Large Language Models (LLMs)

Utilize Spot Instances

Multimodal Models

Diffusion Models

Deep Learning Recommendation Models (DLRMs)

Fair Resource Allocation

Network Emulation

RDMA

PCIe