You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
deflog(self, logs: Dict[str, float]) ->None:
""" Log `logs` on the various objects watching training, including stored metrics. Args: logs (`Dict[str, float]`): The values to log. """# logs either has 'loss' or 'eval_loss'train_eval="train"if"loss"inlogselse"eval"# Add averaged stored metrics to logsforkey, metricsinself._stored_metrics[train_eval].items():
logs[key] =torch.tensor(metrics).mean().item()
delself._stored_metrics[train_eval]
returnsuper().log(logs)
it would have this feature if it looks like:
deflog(self, logs: Dict[str, float]) ->None:
""" Log `logs` on the various objects watching training, including stored metrics. Args: logs (`Dict[str, float]`): The values to log. """# logs either has 'loss' or 'eval_loss'train_eval="train"if"loss"inlogselse"eval"# Add averaged stored metrics to logsforkey, metricsinself._stored_metrics[train_eval].items():
ifisinstance(metrics[0], torch.Tensor):
gathered=self._nested_gather([m.cuda() forminmetrics])
metrics= [g.mean() forgingathered]
meaned=torch.tensor(metrics).mean()
logs[key] =meaned.item()
delself._stored_metrics[train_eval]
returnsuper().log(logs)
I'm happy to submit a pr.
The text was updated successfully, but these errors were encountered:
That's a good point! Feel free to open a PR to fix this. I don't think adding a unittest for this is relevant. If possible, add plots (eg, with wandb) before/after to ensure that we aren't introducing a regression
Ofcourse!
here's a graph for the same training with and without the modification. You can see the pink line is a lot more smoother. Especially the accuracy graph. My per_device_batch_size is 2 so the accuracy per device can only be 1, 0.5 or 0.
zhc7
added a commit
to zhc7/trl
that referenced
this issue
Dec 13, 2024
Feature request
synchronize and average metrics across ranks.
Motivation
current metrics reported are only numbers on rank 0.
all of these aren't synced.
Your contribution
current log function looks like:
it would have this feature if it looks like:
I'm happy to submit a pr.
The text was updated successfully, but these errors were encountered: