Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #894.
This is a tentative to implement this optimizer. It's based on a pytorch implementation from here.
That implementation is similar to adam, and so were the files where I started at. After the pytorch->dfdx translation, I gave it a try at cuda kernel - there's probably a lot of naivety and some incorrectness.
For the fp16 side of things, I couldn't compile dfdx even without the prodigy changes, so I'm not sure if they are correct.
With the reasons above and some below, I'm leaving this as a draft.
If I happen to find it good use in later experiments or find bugs in it, I hope to update it here.
Testing
I've added some very basic rust tests and compared with the equivalent pytorch test.
For every pytorch test, they are somewhat like this:
Where the only change between each test happens on the
# this is the optimizer settings
line.dfdx/dfdx/src/nn/optim/prodigy.rs
Lines 180 to 182 in 65cfd6a
dfdx/dfdx/src/nn/optim/prodigy.rs
Lines 225 to 241 in 65cfd6a
dfdx/dfdx/src/nn/optim/prodigy.rs
Lines 285 to 295 in 65cfd6a
dfdx/dfdx/src/nn/optim/prodigy.rs
Lines 337 to 347 in 65cfd6a
Besides this I've tried making a comparison with a unet experiment, and adam seemed to be much better, so I may have implemented something incorrectly. So this is another reason why this PR is still a draft.