Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: upload non-conflicting files for sharded checkpointing [MD-298] #9598

Merged
merged 2 commits into from
Jul 15, 2024

Conversation

azhou-determined
Copy link
Contributor

@azhou-determined azhou-determined commented Jul 1, 2024

Ticket

Description

make sharded checkpoint uploads with store_path check for file conflicts across workers before upload.

Test Plan

run a distributed core API example and write sharded checkpoints across all workers:

import logging
import pathlib

import determined as det


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
    with det.core.init(
        distributed=det.core.DistributedContext.from_torch_distributed()
    ) as core_context:
        for i in range(100):
            if i % 10 != 0:
                continue
            with core_context.checkpoint.store_path(
                metadata={"steps_completed": i}, shard=True
            ) as (ckpt_path, uuid):
                pathlib.Path(ckpt_path / f"test_file-{i}").touch()
                logging.info(
                    f"Saving {ckpt_path / 'test_file'} to checkpoint {uuid} at {ckpt_path}"
                )
name: core-api-test
entrypoint: python3 -m determined.launch.torch_distributed python3 core_api.py

searcher:
   name: single
   metric: x
   max_length: 1

resources:
   slots_per_trial: 2
max_restarts: 0

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@azhou-determined azhou-determined requested a review from a team as a code owner July 1, 2024 23:59
@cla-bot cla-bot bot added the cla-signed label Jul 1, 2024
Copy link

netlify bot commented Jul 1, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 93f1863
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/669080630b56410008c86ad4
😎 Deploy Preview https://deploy-preview-9598--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Jul 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.02%. Comparing base (7fab87b) to head (93f1863).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9598   +/-   ##
=======================================
  Coverage   53.01%   53.02%           
=======================================
  Files        1255     1255           
  Lines      152884   152960   +76     
  Branches     3233     3234    +1     
=======================================
+ Hits        81053    81106   +53     
- Misses      71680    71703   +23     
  Partials      151      151           
Flag Coverage Δ
backend 44.19% <ø> (-0.05%) ⬇️
harness 72.85% <100.00%> (+0.08%) ⬆️
web 51.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
harness/determined/common/storage/base.py 93.38% <100.00%> (ø)
harness/determined/common/storage/cloud.py 100.00% <100.00%> (ø)
harness/determined/common/storage/shared.py 82.22% <100.00%> (ø)
harness/determined/core/_checkpoint.py 95.45% <100.00%> (+0.99%) ⬆️
harness/tests/core/test_checkpoint.py 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

Copy link
Contributor

@ioga ioga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@azhou-determined azhou-determined merged commit 3663c5b into main Jul 15, 2024
81 of 94 checks passed
@azhou-determined azhou-determined deleted the ckpt-conflicts branch July 15, 2024 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants