Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for neuron devices #3049

Merged
merged 10 commits into from
Feb 29, 2024
Merged

Add initial support for neuron devices #3049

merged 10 commits into from
Feb 29, 2024

Conversation

bfontain
Copy link
Contributor

What does this PR do?

Add some initial support for neuron devices to composer using the pytorch xla backend.

This PR fixes a few issues:

  • Renames is_tpu_installed to is_xla_installed and changes it to store state in a global (vs importing and deleting torch_xla, which could delete a needed import).
  • Adds a new device type 'neuron' for training on neuron instances
  • Fixes support for AMP on XLA.
  • Removes use of xm.optimizer_step (which introduces a second gradient reduce given that the model gets auto wrapped in DDP/FSDP). xm.optimizer_step does 3 things: 1) gradient reduction 2) applying the optimizer 3) xm.mark_step. We already have xm.mark_step in the microbatching loop, so remove the one here improves performance slightly.

Tested using a small model on 2 neuron cores via torchrun. support for composer CLI to come.

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@bfontain bfontain requested a review from j316chuck February 22, 2024 21:14
Copy link
Contributor

@j316chuck j316chuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Thanks for this! Overall LGTM barring nits

composer/callbacks/speed_monitor.py Show resolved Hide resolved
composer/devices/device_neuron.py Outdated Show resolved Hide resolved
composer/trainer/trainer.py Show resolved Hide resolved
@bfontain bfontain changed the title add initial support for neuron devices Add initial support for neuron devices Feb 23, 2024
@bfontain bfontain requested a review from j316chuck February 23, 2024 16:44
Copy link
Contributor

@j316chuck j316chuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this device @bfontain

@bfontain bfontain enabled auto-merge (squash) February 29, 2024 18:29
@bfontain bfontain merged commit cee4523 into mosaicml:dev Feb 29, 2024
14 checks passed
j316chuck pushed a commit that referenced this pull request Mar 29, 2024
* add initial support for neuron devices

* add missing device_neuron.py

* remove unused imports

* update neuron flops

* formatting

* formatting

* formatting

* documentation

* formatting
j316chuck pushed a commit that referenced this pull request May 16, 2024
* add initial support for neuron devices

* add missing device_neuron.py

* remove unused imports

* update neuron flops

* formatting

* formatting

* formatting

* documentation

* formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants