Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard FileExistsError on SLURM when training multi-node with 1 process per GPU #570

Open
Lauler opened this issue Apr 30, 2024 · 5 comments

Comments

@Lauler
Copy link

Lauler commented Apr 30, 2024

The documentation of Levanter recommends one process per GPU rather than one process per node when training on multiple GPUs. The untested script for launching a multi-node SLURM job also sets --ntasks-per-node to be equal to the number of GPUs per node.

Training with 1 process per GPU using Tensorboard logging however leads to the run crashing because multiple processes are trying to write to the same logs (truncated traceback):

Traceback (most recent call last):
  File "/leonardo_scratch/fast/EUHPC_D07_027/scandinavian-lm-leonardo/faton/levanter/venvs/levanter/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 47, in directory_check
    factory = REGISTERED_FACTORIES[prefix]
    factory = REGISTERED_FACTORIES[prefix]
KeyError: '../logs/avxbg33f'

[...]

  File "/leonardo/prod/spack/03/ccsdeploy/spack_deploy/envs/cineca-ai-3/._view/i3cjegj2k2ukaywc3gz4zqdjxxwynnw2/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '../logs/avxbg33f'
    mkdir(name, mode)

Not sure whether training with 1 process per GPU works with wandb, as our compute nodes do not have access to the internet (a fairly common scenario for compute nodes on HPC). For this reason we use tensorboard.

Training with 1 process per node works fine however.

Thought I'd open an issue in case anyone else wants to train with Levanter on GPUs and SLURM. Set your --ntasks-per-node=1 and it should work.

@dlwh
Copy link
Member

dlwh commented May 1, 2024

good to know. Probably we just need to set the paths more carefully? Our internal slurm cluster does have wandb so we don't test TB as thoroughly as we should there. Could you try just adding like jax.process_index() to the path here?

dir_to_write = os.path.join(dir_to_write, run_id)

@Lauler
Copy link
Author

Lauler commented May 2, 2024

Adding process index to the path allows it to train without error with ntasks-per-node being equal to the number of GPUs per node (though it writes one log per task).

dir_to_write = os.path.join(dir_to_write, run_id, str(jax.process_index()))

Not sure why the same error doesn't occur for multinode training with ntasks-per-node=1. Is run_id somehow only set for one of the nodes in that scenario?

One solution could be

if jax.process_index() == 0:
    dir_to_write = os.path.join(dir_to_write, run_id)

though I'm not sure if it screws with the other way of training (ntasks-per-node=1). I'll check again tomorrow.

@dlwh
Copy link
Member

dlwh commented May 28, 2024

@Lauler did you ever check on this?

@Lauler
Copy link
Author

Lauler commented Jun 6, 2024

I didn't manage to get it to work with

if jax.process_index() == 0:
    dir_to_write = os.path.join(dir_to_write, run_id)

It only results in 1 process writing to the run_id dir, and the rest of the processes trying to write directly to the origina dir_to_write logdir before it is modified.

Moving more of the logic of the method into the if-statement so the other processes are prevented from writing logs instead results in the Tracker abstractclass complaining about methods being missing/Nonetype for some processes.

I switched over to wandb because I realized it could be run in offline mode. Didn't spend any more time troubleshooting tensorboad, and have no need for it anymore.

@Lauler
Copy link
Author

Lauler commented Jun 6, 2024

Did a quick test and it works when adding a process_index check to the CompositeTracker class. Though I have quite low degree of confidence that my quick hack doesn't break something else. There should be a cleaner solution.

Changed the init of the TensorboardConfig class to this:

    def init(self, run_id: Optional[str]) -> TensorboardTracker:
        dir_to_write = self.logdir
        if run_id is not None:
            if jax.process_index() == 0:
                dir_to_write = os.path.join(dir_to_write, run_id)

                pylogger.info(f"Writing Tensorboard logs to {dir_to_write}")

                from tensorboardX import SummaryWriter  # noqa: F811

                writer = SummaryWriter(
                    dir_to_write,
                    comment=self.comment,
                    purge_step=self.purge_step,
                    max_queue=self.max_queue,
                    flush_secs=self.flush_secs,
                    filename_suffix=self.filename_suffix,
                    write_to_disk=self.write_to_disk,
                )

                return TensorboardTracker(writer)

and CompositeTracker to

class CompositeTracker(Tracker):
    def __init__(self, loggers: List[Tracker]):
        self.loggers = loggers

    def log_hyperparameters(self, hparams: dict[str, Any]):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log_hyperparameters(hparams)

    def log(self, metrics: dict[str, Any], *, step, commit=None):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log(metrics, step=step, commit=commit)

    def log_summary(self, metrics: dict[str, Any]):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log_summary(metrics)

    def log_artifact(
        self, artifact_path, *, name: Optional[str] = None, type: Optional[str] = None
    ):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log_artifact(artifact_path, name=name, type=type)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants