auto save model in lightning #613

dberenbaum · 2023-07-03T19:25:12Z

This PR auto logs models with dvclive.lightning.DVCLiveLogger(log_model=True):

trainer = pl.Trainer(logger=[DVCLiveLogger(log_model=True)])

log_model follows the conventions in mlflow and wandb:

False saves no models (this is the default).
True saves all model checkpoints at the end of training.
"all" saves all model checkpoints whenever a model checkpoint is saved.

If log_model is True or "all", dvclive caches the entire checkpoints folder.

Dvclive will also add a model artifact named "best" at the end of training that references the best model checkpoint. (edit: this resembles the best alias in wandb)

Edit: example dvclive/dvc.yaml output:

artifacts:
  best:
    path: ../DvcLiveLogger/dvclive_run/checkpoints/epoch=1-step=4-v1.ckpt
    type: model

To support this, log_artifact was also changed:

Artifacts will only be added to dvc.yaml:artifacts if some metadata is provided (type, name, desc, labels, or meta). This is a breaking change, but I can't see how anyone is making use of this without any metadata since it won't be used by the model registry.
Added cache kwarg to log_artifact (defaults to True) so that it's possible to add the artifact metadata without caching the object.

Tests
Docs (dvclive: lightning log_model dvc.org#4714)
VS Code snippets (cc @mattseddon)
Studio snippets (cc @yathomasi)

dberenbaum · 2023-07-03T19:26:49Z

Overall, I felt opening a PR with the desired behavior would be more effective than explaining and discussing in an issue. I hope this will help resolve some of the rough edges around saving models and that we can work through the other framework callbacks to implement similar functionality that works with the existing framework conventions and resembles mlflow, wandb, etc.

dberenbaum · 2023-07-03T19:56:46Z

One thing to note: lightning will not overwrite existing files or clean up between runs. Instead, it will append a version number, so if you run the same code repeatedly, you will end up with a directory that tracks your entire history of model checkpoints instead of only the latest run:

$ tree DvcLiveLogger
DvcLiveLogger
└── dvclive_run
    ├── checkpoints
    │   ├── epoch=0-step=2-v1.ckpt
    │   ├── epoch=0-step=2-v2.ckpt
    │   ├── epoch=0-step=2.ckpt
    │   ├── epoch=1-step=32-v1.ckpt
    │   ├── epoch=1-step=32-v2.ckpt
    │   ├── epoch=1-step=32.ckpt
    │   ├── epoch=1-step=4.ckpt
    │   ├── epoch=2-step=6.ckpt
    │   ├── epoch=4-step=10-v1.ckpt
    │   ├── epoch=4-step=10-v2.ckpt
    │   ├── epoch=4-step=10-v3.ckpt
    │   ├── epoch=4-step=10-v4.ckpt
    │   ├── epoch=4-step=10-v5.ckpt
    │   ├── epoch=4-step=10-v6.ckpt
    │   └── epoch=4-step=10.ckpt
    └── checkpoints.dvc

If you are running a pipeline, this is probably fine since you can control if you want to delete that directory each time. We might also want to consider dropping the existing checkpoints directory in the dvclive callback if resume=False.

dberenbaum · 2023-07-04T19:15:38Z

I think this gets us to a good place with logging models in lightning. In fact, comparing to other trackers, it feels a bit easier to manage the models this way in dvc. On a different machine, you can pull the lightning checkpoints dir and keep using lightning methods to load those checkpoints. With other trackers, once you are on a different machine, the only way to load models is using the experiment tracker's api.

daavoo · 2023-07-05T14:53:18Z

src/dvclive/lightning.py

+        # Save model checkpoints.
+        if self._log_model is True:
+            self.experiment.log_artifact(checkpoint_callback.dirpath)
+        # Log best model.


WDYT of creating a copy in "dvclive" folder (or in the checkpoints folder itself), at least for the best?

It seems that we would be changing the path of the registered model between experiments in the current behavior

Yes, which I guess is also what other trackers do AFAICT? Do you think the path matters? Maybe it makes it easier to dvc get later, although we could make that work by the artifact name. No strong opinion from me.

daavoo · 2023-07-05T14:58:35Z

I think this gets us to a good place with logging models in lightning. In fact, comparing to other trackers, it feels a bit easier to manage the models this way in dvc. On a different machine, you can pull the lightning checkpoints dir and keep using lightning methods to load those checkpoints. With other trackers, once you are on a different machine, the only way to load models is using the experiment tracker's api.

I am ok with moving forward in this direction and prioritizing similar behavior in the other (most used) frameworks.

We should invest some time in properly documenting the behavior and expected workflow (how to use the dvc-tracked artifacts later), though

daavoo · 2023-07-05T15:03:42Z

If you are running a pipeline, this is probably fine since you can control if you want to delete that directory each time. We might also want to consider dropping the existing checkpoints directory in the dvclive callback if resume=False.

Didn't look in details, but seems like the other loggers use _scan_checkpoints to only track the ones related to the current experiment

dberenbaum · 2023-07-05T17:48:22Z

Didn't look in details, but seems like the other loggers use _scan_checkpoints to only track the ones related to the current experiment

Great idea. I'll look into it.

dberenbaum · 2023-07-11T18:35:35Z

Lightning warns if the directory is not empty:

UserWarning: Checkpoint directory /Users/dave/Code/lstm_seq2seq/model exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")

_scan_checkpoints only returns the latest checkpoint even if you use something like ModelCheckpoint(save_top_k=-1), so I had to save the checkpoint from each scan and then drop the rest of the files in the directory (unless resume=True).

Removing the files only happens after the checkpoint is saved, so sometimes the first checkpoint will still get a version number like:

$ tree model
model
├── epoch=0-step=2-v1.ckpt # saved this checkpoint before previous one was dropped
├── epoch=1-step=4.ckpt
├── epoch=2-step=6.ckpt
├── epoch=3-step=8.ckpt
└── epoch=4-step=10.ckpt

Overall, this works and probably meets most user's expectations, so I think we should keep it, but I don't feel strongly that it outweighs the added complexity or potential surprise that dvclive is deleting model checkpoints.

dberenbaum · 2023-07-21T17:01:48Z

ping @daavoo

codecov-commenter · 2023-07-21T17:25:26Z

Codecov Report

Patch coverage: 3.63% and project coverage change: -1.42 ⚠️

Comparison is base (469e39e) 89.47% compared to head (a9e4587) 88.06%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #613      +/-   ##
==========================================
- Coverage   89.47%   88.06%   -1.42%     
==========================================
  Files          44       43       -1     
  Lines        2994     3042      +48     
  Branches      250      260      +10     
==========================================
  Hits         2679     2679              
- Misses        276      324      +48     
  Partials       39       39

Impacted Files	Coverage Δ
src/dvclive/lightning.py	`0.00% <0.00%> (ø)`
tests/test_frameworks/test_lightning.py	`6.09% <9.09%> (-0.80%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

daavoo

On a high level, the code and description make sense to me.

I have not actually tried in a project the different options, but the test looks reasonable so trusting that.

daavoo · 2023-07-21T17:55:59Z

src/dvclive/lightning.py

+                if str(p) not in self._all_checkpoint_paths:
+                    p.unlink(missing_ok=True)


Let's be clear about this in the docs

daavoo · 2023-07-21T18:01:08Z

I think I would like to have a best (or best_only) option for log_model but we can discuss separately

auto save model in lightning

74f0609

dberenbaum requested review from daavoo, shcheklein and aguschin July 3, 2023 19:25

This comment was marked as resolved.

Sign in to view

lightning: save model at each checkpoint if save_top_k == -1

dbd354b

add type: model to lightning artifact

15932a8

daavoo reviewed Jul 5, 2023

View reviewed changes

dberenbaum mentioned this pull request Jul 5, 2023

log_artifact: add cache option, only write to dvc.yaml if metadata ex… #620

Merged

dberenbaum added 2 commits July 11, 2023 12:30

Merge branch 'main' into lightning-model

3c77b75

lightning: drop unused checkpoints

110b9aa

lightning: add tests for log_model

48309f5

dberenbaum marked this pull request as ready for review July 12, 2023 22:31

dberenbaum requested a review from daavoo July 14, 2023 13:15

merge

a9e4587

daavoo approved these changes Jul 21, 2023

View reviewed changes

daavoo reviewed Jul 21, 2023

View reviewed changes

dberenbaum mentioned this pull request Jul 21, 2023

dvclive: lightning log_model iterative/dvc.org#4714

Merged

Merge branch 'main' into lightning-model

09d9f80

dberenbaum closed this Jul 25, 2023

dberenbaum reopened this Jul 25, 2023

dberenbaum closed this Jul 25, 2023

dberenbaum deleted the lightning-model branch July 25, 2023 20:34

dberenbaum restored the lightning-model branch July 25, 2023 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto save model in lightning #613

auto save model in lightning #613

dberenbaum commented Jul 3, 2023 •

edited

Loading

dberenbaum commented Jul 3, 2023

dberenbaum commented Jul 3, 2023

This comment was marked as resolved.

This comment was marked as resolved.

dberenbaum commented Jul 4, 2023

daavoo Jul 5, 2023

dberenbaum Jul 5, 2023

daavoo commented Jul 5, 2023 •

edited

Loading

daavoo commented Jul 5, 2023

dberenbaum commented Jul 5, 2023

dberenbaum commented Jul 11, 2023

dberenbaum commented Jul 21, 2023

codecov-commenter commented Jul 21, 2023

daavoo left a comment

daavoo Jul 21, 2023

daavoo commented Jul 21, 2023

		if str(p) not in self._all_checkpoint_paths:
		p.unlink(missing_ok=True)

auto save model in lightning #613

auto save model in lightning #613

Conversation

dberenbaum commented Jul 3, 2023 • edited Loading

dberenbaum commented Jul 3, 2023

dberenbaum commented Jul 3, 2023

This comment was marked as resolved.

This comment was marked as resolved.

dberenbaum commented Jul 4, 2023

daavoo Jul 5, 2023

Choose a reason for hiding this comment

dberenbaum Jul 5, 2023

Choose a reason for hiding this comment

daavoo commented Jul 5, 2023 • edited Loading

daavoo commented Jul 5, 2023

dberenbaum commented Jul 5, 2023

dberenbaum commented Jul 11, 2023

dberenbaum commented Jul 21, 2023

codecov-commenter commented Jul 21, 2023

Codecov Report

daavoo left a comment

Choose a reason for hiding this comment

daavoo Jul 21, 2023

Choose a reason for hiding this comment

daavoo commented Jul 21, 2023

dberenbaum commented Jul 3, 2023 •

edited

Loading

daavoo commented Jul 5, 2023 •

edited

Loading