Tensorboard FileExistsError on SLURM when training multi-node with 1 process per GPU #570

Lauler · 2024-04-30T12:58:32Z

The documentation of Levanter recommends one process per GPU rather than one process per node when training on multiple GPUs. The untested script for launching a multi-node SLURM job also sets --ntasks-per-node to be equal to the number of GPUs per node.

Training with 1 process per GPU using Tensorboard logging however leads to the run crashing because multiple processes are trying to write to the same logs (truncated traceback):

Traceback (most recent call last):
  File "/leonardo_scratch/fast/EUHPC_D07_027/scandinavian-lm-leonardo/faton/levanter/venvs/levanter/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 47, in directory_check
    factory = REGISTERED_FACTORIES[prefix]
    factory = REGISTERED_FACTORIES[prefix]
KeyError: '../logs/avxbg33f'

[...]

  File "/leonardo/prod/spack/03/ccsdeploy/spack_deploy/envs/cineca-ai-3/._view/i3cjegj2k2ukaywc3gz4zqdjxxwynnw2/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '../logs/avxbg33f'
    mkdir(name, mode)

Not sure whether training with 1 process per GPU works with wandb, as our compute nodes do not have access to the internet (a fairly common scenario for compute nodes on HPC). For this reason we use tensorboard.

Training with 1 process per node works fine however.

Thought I'd open an issue in case anyone else wants to train with Levanter on GPUs and SLURM. Set your --ntasks-per-node=1 and it should work.

The text was updated successfully, but these errors were encountered:

dlwh · 2024-05-01T19:25:27Z

good to know. Probably we just need to set the paths more carefully? Our internal slurm cluster does have wandb so we don't test TB as thoroughly as we should there. Could you try just adding like jax.process_index() to the path here?

levanter/src/levanter/tracker/tensorboard.py

Line 61 in 4110cd0

dir_to_write = os.path.join(dir_to_write, run_id)

Lauler · 2024-05-02T16:25:16Z

Adding process index to the path allows it to train without error with ntasks-per-node being equal to the number of GPUs per node (though it writes one log per task).

dir_to_write = os.path.join(dir_to_write, run_id, str(jax.process_index()))

Not sure why the same error doesn't occur for multinode training with ntasks-per-node=1. Is run_id somehow only set for one of the nodes in that scenario?

One solution could be

if jax.process_index() == 0:
    dir_to_write = os.path.join(dir_to_write, run_id)

though I'm not sure if it screws with the other way of training (ntasks-per-node=1). I'll check again tomorrow.

dlwh · 2024-05-28T06:43:11Z

@Lauler did you ever check on this?

Lauler · 2024-06-06T21:55:17Z

I didn't manage to get it to work with

if jax.process_index() == 0:
    dir_to_write = os.path.join(dir_to_write, run_id)

It only results in 1 process writing to the run_id dir, and the rest of the processes trying to write directly to the origina dir_to_write logdir before it is modified.

Moving more of the logic of the method into the if-statement so the other processes are prevented from writing logs instead results in the Tracker abstractclass complaining about methods being missing/Nonetype for some processes.

I switched over to wandb because I realized it could be run in offline mode. Didn't spend any more time troubleshooting tensorboad, and have no need for it anymore.

Lauler · 2024-06-06T22:18:52Z

Did a quick test and it works when adding a process_index check to the CompositeTracker class. Though I have quite low degree of confidence that my quick hack doesn't break something else. There should be a cleaner solution.

Changed the init of the TensorboardConfig class to this:

    def init(self, run_id: Optional[str]) -> TensorboardTracker:
        dir_to_write = self.logdir
        if run_id is not None:
            if jax.process_index() == 0:
                dir_to_write = os.path.join(dir_to_write, run_id)

                pylogger.info(f"Writing Tensorboard logs to {dir_to_write}")

                from tensorboardX import SummaryWriter  # noqa: F811

                writer = SummaryWriter(
                    dir_to_write,
                    comment=self.comment,
                    purge_step=self.purge_step,
                    max_queue=self.max_queue,
                    flush_secs=self.flush_secs,
                    filename_suffix=self.filename_suffix,
                    write_to_disk=self.write_to_disk,
                )

                return TensorboardTracker(writer)

and CompositeTracker to

class CompositeTracker(Tracker):
    def __init__(self, loggers: List[Tracker]):
        self.loggers = loggers

    def log_hyperparameters(self, hparams: dict[str, Any]):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log_hyperparameters(hparams)

    def log(self, metrics: dict[str, Any], *, step, commit=None):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log(metrics, step=step, commit=commit)

    def log_summary(self, metrics: dict[str, Any]):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log_summary(metrics)

    def log_artifact(
        self, artifact_path, *, name: Optional[str] = None, type: Optional[str] = None
    ):
        for tracker in self.loggers:
            if jax.process_index() == 0:
                tracker.log_artifact(artifact_path, name=name, type=type)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorboard FileExistsError on SLURM when training multi-node with 1 process per GPU #570

Tensorboard FileExistsError on SLURM when training multi-node with 1 process per GPU #570

Lauler commented Apr 30, 2024

dlwh commented May 1, 2024

Lauler commented May 2, 2024 •

edited

Loading

dlwh commented May 28, 2024

Lauler commented Jun 6, 2024 •

edited

Loading

Lauler commented Jun 6, 2024 •

edited

Loading

Tensorboard FileExistsError on SLURM when training multi-node with 1 process per GPU #570

Tensorboard FileExistsError on SLURM when training multi-node with 1 process per GPU #570

Comments

Lauler commented Apr 30, 2024

dlwh commented May 1, 2024

Lauler commented May 2, 2024 • edited Loading

dlwh commented May 28, 2024

Lauler commented Jun 6, 2024 • edited Loading

Lauler commented Jun 6, 2024 • edited Loading

Lauler commented May 2, 2024 •

edited

Loading

Lauler commented Jun 6, 2024 •

edited

Loading

Lauler commented Jun 6, 2024 •

edited

Loading