Disable checkpoint conversion inside AutoResume #10645

hemildesai · 2024-09-26T22:46:05Z

What does this PR do ?

Checkpoint conversion is not feasible inside AutoResume as it is not reliable in multi process/multi node scenarios. This update disables checkpoint conversion inside AutoResume and forces users to use a NeMo based checkpoint.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/lightning/resume.py

Signed-off-by: Hemil Desai <hemild@nvidia.com>

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

Signed-off-by: Hemil Desai <hemild@nvidia.com>

Signed-off-by: Chen Cui <chcui@nvidia.com>

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

hemildesai · 2024-10-02T17:45:13Z

nemo/lightning/megatron_parallel.py

- )
+ # Mcore DistributedDataParallel has to be called with grad. Normally this call is redundant, but for
+ # PEFT with num_sanity_val_steps > 0 this is necessary.
+ with torch.enable_grad():


Should there be a check to only use this if it's PEFT?

thanks for the comment, revised!

Signed-off-by: Chen Cui <chcui@nvidia.com>

akoumpa · 2024-10-02T19:32:39Z

@hemildesai this is great, have you run a two node job to test resuming from an HF checkpoint? What's the process for the user? I imagine this will be a frequent request so I want to make sure it's as frictionless as possible.

I did a quick pass but will revisit later today. Thanks.

hemildesai requested review from cuichenx and akoumpa September 26, 2024 22:46

github-advanced-security bot found potential problems Sep 26, 2024

View reviewed changes

nemo/lightning/resume.py Fixed Show fixed Hide fixed

hemildesai force-pushed the hemil/revamp-finetuning-recipes branch from 4fc4d89 to 8f92c11 Compare September 30, 2024 21:56

hemildesai marked this pull request as ready for review October 1, 2024 15:57

hemildesai and others added 4 commits October 1, 2024 16:30

Disable checkpoint conversion inside AutoResume

8b29797

Signed-off-by: Hemil Desai <hemild@nvidia.com>

Apply isort and black reformatting

b8dfe04

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

Update resume docstrings

ee4af52

Signed-off-by: Hemil Desai <hemild@nvidia.com>

fix

a896007

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai force-pushed the hemil/revamp-finetuning-recipes branch from 756d6f3 to a896007 Compare October 1, 2024 23:30

cuichenx and others added 2 commits October 2, 2024 13:40

add default finetuning recipe and refactor llama3 8b recipe

57c1e60

Signed-off-by: Chen Cui <chcui@nvidia.com>

Apply isort and black reformatting

51c0b5f

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

hemildesai commented Oct 2, 2024

View reviewed changes

address comment

d9bd80d

Signed-off-by: Chen Cui <chcui@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable checkpoint conversion inside AutoResume #10645

Disable checkpoint conversion inside AutoResume #10645

hemildesai commented Sep 26, 2024

hemildesai Oct 2, 2024

cuichenx Oct 2, 2024

akoumpa commented Oct 2, 2024

Disable checkpoint conversion inside AutoResume #10645

Are you sure you want to change the base?

Disable checkpoint conversion inside AutoResume #10645

Conversation

hemildesai commented Sep 26, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

hemildesai Oct 2, 2024

Choose a reason for hiding this comment

cuichenx Oct 2, 2024

Choose a reason for hiding this comment

akoumpa commented Oct 2, 2024