Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization #733

Merged
merged 18 commits into from
Mar 25, 2022
Merged

Serialization #733

merged 18 commits into from
Mar 25, 2022

Conversation

olivierlabayle
Copy link
Collaborator

@olivierlabayle olivierlabayle commented Jan 28, 2022

Context: JuliaAI/MLJSerialization.jl#15

This PR contains:

  • Routines for the serialization of machines/composite machines
  • A small refactoring of the Base.replace to reuse some code for save of Composite models

Unfortunately I don't think we have CI enabled since this does not target dev.

Happy to take your comments.

@ablaom ablaom closed this Jan 30, 2022
@ablaom ablaom reopened this Jan 30, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2022

Codecov Report

❗ No coverage uploaded for pull request base (for-a-0-point-20-release@c131aa8). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@                     Coverage Diff                     @@
##             for-a-0-point-20-release     #733   +/-   ##
===========================================================
  Coverage                            ?   85.41%           
===========================================================
  Files                               ?       36           
  Lines                               ?     3470           
  Branches                            ?        0           
===========================================================
  Hits                                ?     2964           
  Misses                              ?      506           
  Partials                            ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c131aa8...a8c260c. Read the comment docs.

src/machines.jl Outdated Show resolved Hide resolved
test/machines.jl Outdated Show resolved Hide resolved
Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mammoth effort, thanks.

I'm sorry for not catching this earlier but, unless I am missing something, there is no point in including mach.cache in machines to be serialised. In particular, serializable can return a machine with cache undefined. Once a machine has been trained, the only time cache is ever used is when we update the machine (instead of training ab initio). Once the original training data is gone a machine "update" no longer makes sense, it seems to me, and any attempt to fit! the machine should start from scratch (and hence not need cache).

If you're not convinced, let's have another call to clarify.

@ablaom
Copy link
Member

ablaom commented Jan 31, 2022

What happens if the user attempts to:

  • call fit! on a machine returned by serializable?
  • call on operation, such as predict, on such an object?
  • call serializable on a machine that has an Node as an argument? (I wouldn't think it would ever make sense to do that. )

@ablaom
Copy link
Member

ablaom commented Jan 31, 2022

And I'm thinking that we should restore state=0 when we restore! a machine it ensure fit!(mach) performs training from scratch; see

if mach.state == 0 || # condition (i)
.

And in the code just referenced we could catch state==-1 and throw an error that machine needs restoring, yes?

@olivierlabayle
Copy link
Collaborator Author

Mammoth effort, thanks.

I'm sorry for not catching this earlier but, unless I am missing something, there is no point in including mach.cache in machines to be serialised. In particular, serializable can return a machine with cache undefined. Once a machine has been trained, the only time cache is ever used is when we update the machine (instead of training ab initio). Once the original training data is gone a machine "update" no longer makes sense, it seems to me, and any attempt to fit! the machine should start from scratch (and hence not need cache).

If you're not convinced, let's have another call to clarify.

So, I have tried and it turns out the cache is important for Composite models at least (I also thought it would be important for TunedModels but it turns out not to be the case). The main problem is the network_model_names which lives in the cache and which is used for report and fitted_params. I agree however that it makes intuitive sense to delete the cache for serialization. It means we must move the network_model_names field though (perhaps to CompositeFitresult) or construct it again if not in the cache. Waiting for your thoughts.

src/machines.jl Show resolved Hide resolved
Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these latest changes and continued patience.

@olivierlabayle
Copy link
Collaborator Author

Thank you for those in depth explanations! Here is the current state of the code:

  • network_model_names: I have moved it to CompositeFitresult as decided. I also took the liberty to remove the report_additions field from the CompositeFitresult since it plays no role in any operation. I just call the report node when building the report of the Surrogate. The only problem is that the following test fails. It still throws an error but I can't figure out why it's not the original one. That would be great if you could have a look, I can't even find the definition of the CompositeException which seems to be thrown.

  • cache: With the previous change I have now made the cache from a serializable machine undefined. So yes the serializable procedure removes all data whatsoever from any field that could contain some since all those fields are now undefined.

  • machine(filename): I removed the possibility to provide additional arguments as I agree it makes very little sense at the present time.

  • state: Thanks for the detailed description, I now understand much better the purpose of this field! However, it makes me feel we are trying to take too much out of it. The original rationale behind setting its value to -1 was to know weither a machine has been built by serializable or not. From an internal point of view, why would we want to know that? Is it to provide an informative warning for operations (given data, otherwise it will fail fast anyway) that may fail with uninformative message because the fitresult is in serializable form? It is the only corner case I would think of. If we want to tackle this I think we should:

    • Reset the state back to 1 for all machines when restoring
    • Add a warning in operations when it is called with state=-1. In most cases it should still work fine I believe since the default save is just to return the original fitresult. By the way, if we set the XGBoost case aside, is there any current model that requires the non default save method?

What do you think?

@ablaom
Copy link
Member

ablaom commented Feb 2, 2022

I have moved it to CompositeFitresult as decided.

👍🏾

I also took the liberty to remove the report_additions field from the CompositeFitresult since it plays no role in any operation.

True, it does not play a role in any operation. But it needs to be available to update(model::Union{Composite,Surrogate}, ...) so it can be updated in a warm restart:

update!(fitresult) # updates report_additions
. I suppose logically it would belong to cache but then you need to change update! (the method you should put back) to see both cache and fitresult, as it needs the signature.

BTW, CompositeException is a Julia type for collecting multiple exceptions thrown by a Task. It's not defined by MLJ.

cache: With the previous change I have now made the cache from a serializable machine undefined. So yes the serializable procedure removes all data whatsoever from any field that could contain some since all those fields are now undefined.

👍🏾

machine(filename): I removed the possibility to provide additional arguments as I agree it makes very little sense at the present time.

Yeah. Doing machine(mach.model, X, y) is not hard and cognitively clearer.

  • Reset the state back to 1 for all machines when restoring
  • Add a warning in operations when it is called with state=-1. In most cases it should still work fine I believe since the default save is just to return the original fitresult. By the way, if we set the XGBoost case aside, is there any current model that requires

An excellent proposal. Warning could be something like "The operation $operation has been called on a deserialised machine mach whose learned parameters may be unusable. To be sure, first run restore!(mach)."??

@ablaom
Copy link
Member

ablaom commented Feb 2, 2022

By the way, if we set the XGBoost case aside, is there any current model that requires the non default save method?

I'm pretty sure not, but I will locally test the new API on all registered models when the PR's are ready.

@olivierlabayle
Copy link
Collaborator Author

I suppose logically it would belong to cache but then you need to change update! (the method you should put back) to see both cache and fitresult, as it needs the signature.

I haven't put it into the cache either. I actually have removed this method too and updated the update function for composite models instead.

An excellent proposal. Warning could be something like "The operation $operation has been called on a deserialised machine mach whose learned parameters may be unusable. To be sure, first run restore!(mach)."??

Yes that sounds good!

@ablaom
Copy link
Member

ablaom commented Feb 2, 2022

I haven't put it into the cache either. I actually have removed this method too and updated the update function for composite models instead.

Oh, right. I do see that. I'll have a closer look at this and your fail then.

@olivierlabayle
Copy link
Collaborator Author

Oh, right. I do see that. I'll have a closer look at this and your fail then.

I finally put my hand on the problem. The reason why the test did not pass was because the network_model_names function needs to be called before fit! in the return! function. Like now. However, unless I am missing something, I believe the problem is the test itself which raises anyway because y is not a table.

In any case I have also added the restoration to state=1 and the warning we were talking about.

@ablaom
Copy link
Member

ablaom commented Feb 3, 2022

I finally put my hand on the problem.

🚀 ❤️ I'm glad I was busy with something else yesterday 😄 . This must have been a woozy.

I believe the problem is the test itself which raises anyway because y is not a table.

You are right! Great catch. This slipped by because the Standardizer in the local test suite only handles tables, whereas the most recent version in MLJModels, handles vectors as well. I should have been more careful.

I'm working on fix for this test now. As this is orthogonal to the current PR I will make a PR on a separate branch and rebase for-a-0-point-20-release.

Given the complexity of the current PR, I should like to go over this one more time before merge. However, this is definitely in good shape and I will move on to review the supporting PR's for now.

Again many thanks for your perseverance.

@ablaom
Copy link
Member

ablaom commented Feb 3, 2022

Okay, I have made that fix and rebased for-a-0-point-20-release. It looks like this created no merge conflicts for you, so no action on your part should be needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants