Rollout Scheduling #148

HCookie · 2024-11-18T09:13:38Z

HCookie
Nov 18, 2024
Maintainer

Our current rollout implementation is very focused on sequential epoch increments, it would be good to generalise this to provide schedulers to control rollout.

Work was done in aifs-mono to enable this. here
I think this can be generalised and provide more general applicability.

Features

Below is a list of features and requirements as I see them

Epoch Step rollout
Static Rollout
Upon hitting a threshold begin another strategy
Random selection of rollout between bounds
Dynamic selection of increments

Questions

Will the rollout only change between epochs? Could within an epoch it change?

What other features may be needed?

Discussion of #145

Points raised in Issue

Step based dynamic rollout
Useful for auto training and changing rollout at 200,000 steps

ssmmnn11 · 2024-11-20T09:16:40Z

ssmmnn11
Nov 20, 2024
Collaborator

We currently also have the issue that if a rollout experiment has to restart, and one re-starts from the latest checkpoint within an epoch, we again go through all samples, instead of only the "remaining samples" / iterations within that epoch.

Remaining samples is of course not well defined because we shuffle (only if we would save the random states and restore them).

1 reply

anaprietonem Nov 22, 2024
Collaborator

I think that PTL datamodules also support the 'state_dict' and 'load_state_dict' https://lightning.ai/docs/pytorch/stable/data/datamodule.html so maybe that could help for that issue?
Something like what they do here https://lightning.ai/docs/pytorch/stable/data/datamodule.html#save-datamodule-state

mchantry · 2024-11-22T22:21:48Z

mchantry
Nov 22, 2024
Collaborator

Would be nice to think about how the dataloader object can be adapted during changes in rollout, e.g. currently the max rollout sets the dataloading fetching, meaning excess data is loaded.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollout Scheduling #148

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Rollout Scheduling #148

HCookie Nov 18, 2024 Maintainer

Features

Questions

Points raised in Issue

Replies: 2 comments · 1 reply

ssmmnn11 Nov 20, 2024 Collaborator

anaprietonem Nov 22, 2024 Collaborator

mchantry Nov 22, 2024 Collaborator

HCookie
Nov 18, 2024
Maintainer

Replies: 2 comments 1 reply

ssmmnn11
Nov 20, 2024
Collaborator

anaprietonem Nov 22, 2024
Collaborator

mchantry
Nov 22, 2024
Collaborator