Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint Refactor #65

Merged
merged 21 commits into from
Aug 4, 2023

Conversation

loliverhennigh
Copy link
Collaborator

@loliverhennigh loliverhennigh commented Jul 10, 2023

Description

This feature adds better saving and loading of models that store arguments in a json file. This allows loading the model without needing to keep track of model parameters. Along with this merge there are several related side features. from_torch allows users to bring in models from torch seamlessly assuming they have json serializable inputs. Also, we have exposed the models through endpoints following discussions about expanding usability of modulus.
Here is a list of high level features,

  • Saving models now contains other information such as version and params so you can instantiate the model directly from the checkpoint from_checkpoint
  • Bring in model from torch and use it seamlessly as a Modulus module, from_torch
  • Register custom model dynamically. This allows loading models from checkpoints for example.
  • Factory that takes a name of a Module and returns the class for it.

@loliverhennigh
Copy link
Collaborator Author

This should be ready for review now. I had to mess around with my fork to change it to the public PR so hopefully nothing weird happened.

Copy link
Collaborator

@NickGeneva NickGeneva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments

modulus/models/layers/activations.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
@loliverhennigh
Copy link
Collaborator Author

Hey @akshaysubr , @NickGeneva and I talked and I will refactor some of the registry and entrypoints stuff to be a bit more in line with fsspec. You can look at other parts of the code but I wouldn't bother seeing too much until I make these changes.

@loliverhennigh
Copy link
Collaborator Author

Linking with Modulus launch PR that should be merged at the same time, NVIDIA/modulus-launch#42

@loliverhennigh
Copy link
Collaborator Author

Hey @akshaysubr and @NickGeneva, This is ready for another round of revisions. I think everything here is in a pretty good place however there is one key design decision that I will explain bellow.

So for the entrypoint exposure I followed the example here to the letter, https://amir.rachum.com/python-entry-points/. In particular I follow their final revision and expose all our models through entrypoints and treat internal and external models the same. You can see where I get the list of models here, https://github.com/loliverhennigh/modulus/blob/fea-checkpoint_refactor/modulus/models/registry.py#L20.

This goes somewhat counter to the way that fsspec does things. Instead of exposing their protocols through entrypoints and getting them from there they manually write them in a dict here https://github.com/fsspec/filesystem_spec/blob/master/fsspec/registry.py#L62C2-L62C2 . Functionally this leads to the same design and interface for users and they have good documentation on how to add protocols through entrypoints. I would say the advantage of this is they can easily add protocols from outside packages and you can see a few examples of this in the dict. If we did the same thing maybe we could bring in hugging face models in for example.

I am open to setting all this up following the fsspec example but at this point there is no benefit to doing it I think. I would say we go with the design I have in this PR and then if we want we can switch to the fsspec setup at any time. This should be easy to do given how I structured the code and will basically just be adding a dict to the registry.py file. This will also not effect any user interface.

Also, just to make it a bit more concrete on how a user can expose a model to modulus via entrypoints I included a little example here https://github.com/loliverhennigh/modulus/blob/fea-checkpoint_refactor/examples/ExternalPackage/example/load_model.py. Ill delete this after but it was really good to wrap my head around this stuff.

Copy link
Collaborator

@akshaysubr akshaysubr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loliverhennigh This is great work! Left some comments about suggested changes. Would also be good to change the PR title and description to something more verbose and descriptive.

modulus/models/layers/activations.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
modulus/models/registry.py Outdated Show resolved Hide resolved
modulus/models/module.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
@akshaysubr
Copy link
Collaborator

Also, just to make it a bit more concrete on how a user can expose a model to modulus via entrypoints I included a little example here https://github.com/loliverhennigh/modulus/blob/fea-checkpoint_refactor/examples/ExternalPackage/example/load_model.py. Ill delete this after but it was really good to wrap my head around this stuff.

I think it is good to keep this example. Just need to ensure it gets run every so often to ensure that it doesn't go stale.

@NickGeneva NickGeneva added enhancement New feature or request 2 - In Progress Currently a work in progress labels Jul 18, 2023
@loliverhennigh loliverhennigh changed the title Fea checkpoint refactor Checkpoint Refactor Jul 18, 2023
@loliverhennigh
Copy link
Collaborator Author

Hey @akshaysubr and @NickGeneva , Ready for another round of reviews! I think I addressed everything. Could I get another look when you have time. Thanks!

@NickGeneva NickGeneva added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 25, 2023
@NickGeneva NickGeneva removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 1, 2023
@NickGeneva
Copy link
Collaborator

@loliverhennigh I tested this now and it works great! Really liking how this turned out!

@NickGeneva
Copy link
Collaborator

/blossom-ci

@loliverhennigh
Copy link
Collaborator Author

/blossom-ci

@loliverhennigh
Copy link
Collaborator Author

/blossom-ci

3 similar comments
@loliverhennigh
Copy link
Collaborator Author

/blossom-ci

@loliverhennigh
Copy link
Collaborator Author

/blossom-ci

@loliverhennigh
Copy link
Collaborator Author

/blossom-ci

@loliverhennigh
Copy link
Collaborator Author

/blossom-ci

1 similar comment
@loliverhennigh
Copy link
Collaborator Author

/blossom-ci

@loliverhennigh loliverhennigh merged commit 656b15e into NVIDIA:main Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants