Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #131, module 'eole.utils' has no attribute 'distributed' error when training multi-gpu #132

Merged
merged 7 commits into from
Oct 25, 2024
9 changes: 3 additions & 6 deletions eole/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import torch
import traceback
import eole.utils
from eole.utils.distributed import all_gather_list, all_reduce_and_rescale_tensors
from eole.utils.loss import LossCompute
from eole.utils.logging import logger
from eole.utils.misc import clear_gpu_cache, get_autocast
Expand Down Expand Up @@ -333,9 +334,7 @@ def train(
self._maybe_update_estim_lambda(step)

if self.n_gpu > 1 and self.parallel_mode == "data_parallel":
normalization = sum(
eole.utils.distributed.all_gather_list(normalization)
)
normalization = sum(all_gather_list(normalization))

self._gradient_accumulation(
batches, normalization, total_stats, report_stats
Expand Down Expand Up @@ -570,9 +569,7 @@ def _gradient_accumulation(
for p in self.model.parameters()
if p.requires_grad and p.grad is not None
]
eole.utils.distributed.all_reduce_and_rescale_tensors(
grads, float(self.n_gpu)
)
all_reduce_and_rescale_tensors(grads, float(self.n_gpu))

self.optim.step()

Expand Down
Loading