Skip to content

SymbioticLab/Oobleck

Repository files navigation

Oobleck
Resilient Distributed Training Framework

Oobleck is a large-model training framework with fast fault recovery support utilizing the concept of pipeline templates.

It is the first training framework that realizes:

  • Dynamic reconfiguration: Oobleck can reconfigure distributed training configurtation without restart after failures.
  • Pipeline template instantiation: Oobleck pre-generates a set of pipeline templates, and then combine their instantiated pipelines to form a distributed execution plan. The same set of pipeline templates is reused and different pipelines are instantiated after failures.

Getting Started

Install

Use pip to install Oobleck:

pip install oobleck

Oobleck relies on cornstarch for pipeline template and Colossal-AI for training backend. Optionally, install apex, xformers and flash-attn to boost throughput (follow instructions in each README).

Run

Please refer to this README.

Cluster Management

Oobleck provides a command line interface (CLI) that manages the cluster. Use oobleck to access the master agent:

$ oobleck --ip <master_ip> --port <master_port> <command> <command_options>

where master port can be found in stdout of running:

| INFO     | __main__:serve:430 - Running master service on port 45145

Currently you can see the list of agents and send a request to gracefully terminate an agent:

$ oobleck --ip <master_ip> --port <master_port> get_agent_list
=== Agents ===
[0] IP: node1:10000 Status: up (device indices: 0,1)
[1] IP: node1:10000 Status: up (device indices: 2,3)
[2] IP: node2:10000 Status: up (device indices: 0,1)
[3] IP: node2:10000 Status: up (device indices: 2,3)
==============

$ oobleck --ip <master_ip> --port <master_port> kill_agent --agent_index 2
| INFO     | __main__:KillAgent:340 - Terminating agent 2 on node1:10000

Citation

@inproceedings{oobleck-sosp23,
    title     = {Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates},
    author    = {Jang, Insu and Yang, Zhenning and Zhang, Zhen and Jin, Xin and Chowdhury, Mosharaf},
    booktitle = {ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP '23)},
    year      = {2023},
}

About

A resilient distributed training framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published