Serialization #733

olivierlabayle · 2022-01-28T14:12:28Z

This PR contains:

Routines for the serialization of machines/composite machines
A small refactoring of the Base.replace to reuse some code for save of Composite models

Unfortunately I don't think we have CI enabled since this does not target dev.

Happy to take your comments.

codecov-commenter · 2022-01-30T20:50:01Z

Codecov Report

❗ No coverage uploaded for pull request base (for-a-0-point-20-release@c131aa8). Click here to learn what that means.
The diff coverage is n/a.

@@                     Coverage Diff                     @@
##             for-a-0-point-20-release     #733   +/-   ##
===========================================================
  Coverage                            ?   85.41%           
===========================================================
  Files                               ?       36           
  Lines                               ?     3470           
  Branches                            ?        0           
===========================================================
  Hits                                ?     2964           
  Misses                              ?      506           
  Partials                            ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c131aa8...a8c260c. Read the comment docs.

src/composition/learning_networks/machines.jl

src/machines.jl

test/composition/learning_networks/machines.jl

test/machines.jl

ablaom

Mammoth effort, thanks.

I'm sorry for not catching this earlier but, unless I am missing something, there is no point in including mach.cache in machines to be serialised. In particular, serializable can return a machine with cache undefined. Once a machine has been trained, the only time cache is ever used is when we update the machine (instead of training ab initio). Once the original training data is gone a machine "update" no longer makes sense, it seems to me, and any attempt to fit! the machine should start from scratch (and hence not need cache).

If you're not convinced, let's have another call to clarify.

ablaom · 2022-01-31T02:06:56Z

What happens if the user attempts to:

call fit! on a machine returned by serializable?
call on operation, such as predict, on such an object?
call serializable on a machine that has an Node as an argument? (I wouldn't think it would ever make sense to do that. )

ablaom · 2022-01-31T09:44:25Z

And I'm thinking that we should restore state=0 when we restore! a machine it ensure fit!(mach) performs training from scratch; see

MLJBase.jl/src/machines.jl

Line 583 in c8a2d56

if mach.state == 0 || # condition (i)

.

And in the code just referenced we could catch state==-1 and throw an error that machine needs restoring, yes?

olivierlabayle · 2022-01-31T15:24:58Z

Mammoth effort, thanks.

I'm sorry for not catching this earlier but, unless I am missing something, there is no point in including mach.cache in machines to be serialised. In particular, serializable can return a machine with cache undefined. Once a machine has been trained, the only time cache is ever used is when we update the machine (instead of training ab initio). Once the original training data is gone a machine "update" no longer makes sense, it seems to me, and any attempt to fit! the machine should start from scratch (and hence not need cache).

If you're not convinced, let's have another call to clarify.

So, I have tried and it turns out the cache is important for Composite models at least (I also thought it would be important for TunedModels but it turns out not to be the case). The main problem is the network_model_names which lives in the cache and which is used for report and fitted_params. I agree however that it makes intuitive sense to delete the cache for serialization. It means we must move the network_model_names field though (perhaps to CompositeFitresult) or construct it again if not in the cache. Waiting for your thoughts.

src/machines.jl

ablaom

Thanks for these latest changes and continued patience.

…tions field from it

olivierlabayle · 2022-02-02T17:54:20Z

Thank you for those in depth explanations! Here is the current state of the code:

network_model_names: I have moved it to CompositeFitresult as decided. I also took the liberty to remove the report_additions field from the CompositeFitresult since it plays no role in any operation. I just call the report node when building the report of the Surrogate. The only problem is that the following test fails. It still throws an error but I can't figure out why it's not the original one. That would be great if you could have a look, I can't even find the definition of the CompositeException which seems to be thrown.
cache: With the previous change I have now made the cache from a serializable machine undefined. So yes the serializable procedure removes all data whatsoever from any field that could contain some since all those fields are now undefined.
machine(filename): I removed the possibility to provide additional arguments as I agree it makes very little sense at the present time.
state: Thanks for the detailed description, I now understand much better the purpose of this field! However, it makes me feel we are trying to take too much out of it. The original rationale behind setting its value to -1 was to know weither a machine has been built by serializable or not. From an internal point of view, why would we want to know that? Is it to provide an informative warning for operations (given data, otherwise it will fail fast anyway) that may fail with uninformative message because the fitresult is in serializable form? It is the only corner case I would think of. If we want to tackle this I think we should:
- Reset the state back to 1 for all machines when restoring
- Add a warning in operations when it is called with state=-1. In most cases it should still work fine I believe since the default save is just to return the original fitresult. By the way, if we set the XGBoost case aside, is there any current model that requires the non default save method?

What do you think?

ablaom · 2022-02-02T20:34:27Z

I have moved it to CompositeFitresult as decided.

👍🏾

I also took the liberty to remove the report_additions field from the CompositeFitresult since it plays no role in any operation.

True, it does not play a role in any operation. But it needs to be available to update(model::Union{Composite,Surrogate}, ...) so it can be updated in a warm restart:

MLJBase.jl/src/composition/models/methods.jl

Line 50 in c8a2d56

update!(fitresult) # updates report_additions

. I suppose logically it would belong to cache but then you need to change update! (the method you should put back) to see both cache and fitresult, as it needs the signature.

BTW, CompositeException is a Julia type for collecting multiple exceptions thrown by a Task. It's not defined by MLJ.

cache: With the previous change I have now made the cache from a serializable machine undefined. So yes the serializable procedure removes all data whatsoever from any field that could contain some since all those fields are now undefined.

👍🏾

machine(filename): I removed the possibility to provide additional arguments as I agree it makes very little sense at the present time.

Yeah. Doing machine(mach.model, X, y) is not hard and cognitively clearer.

Reset the state back to 1 for all machines when restoring

Add a warning in operations when it is called with state=-1. In most cases it should still work fine I believe since the default save is just to return the original fitresult. By the way, if we set the XGBoost case aside, is there any current model that requires

An excellent proposal. Warning could be something like "The operation $operation has been called on a deserialised machine mach whose learned parameters may be unusable. To be sure, first run restore!(mach)."??

ablaom · 2022-02-02T20:36:15Z

By the way, if we set the XGBoost case aside, is there any current model that requires the non default save method?

I'm pretty sure not, but I will locally test the new API on all registered models when the PR's are ready.

olivierlabayle · 2022-02-02T21:29:39Z

I suppose logically it would belong to cache but then you need to change update! (the method you should put back) to see both cache and fitresult, as it needs the signature.

I haven't put it into the cache either. I actually have removed this method too and updated the update function for composite models instead.

An excellent proposal. Warning could be something like "The operation $operation has been called on a deserialised machine mach whose learned parameters may be unusable. To be sure, first run restore!(mach)."??

Yes that sounds good!

ablaom · 2022-02-02T21:51:48Z

I haven't put it into the cache either. I actually have removed this method too and updated the update function for composite models instead.

Oh, right. I do see that. I'll have a closer look at this and your fail then.

olivierlabayle · 2022-02-03T13:56:21Z

Oh, right. I do see that. I'll have a closer look at this and your fail then.

I finally put my hand on the problem. The reason why the test did not pass was because the network_model_names function needs to be called before fit! in the return! function. Like now. However, unless I am missing something, I believe the problem is the test itself which raises anyway because y is not a table.

In any case I have also added the restoration to state=1 and the warning we were talking about.

ablaom · 2022-02-03T20:43:16Z

I finally put my hand on the problem.

🚀 ❤️ I'm glad I was busy with something else yesterday 😄 . This must have been a woozy.

I believe the problem is the test itself which raises anyway because y is not a table.

You are right! Great catch. This slipped by because the Standardizer in the local test suite only handles tables, whereas the most recent version in MLJModels, handles vectors as well. I should have been more careful.

I'm working on fix for this test now. As this is orthogonal to the current PR I will make a PR on a separate branch and rebase for-a-0-point-20-release.

Given the complexity of the current PR, I should like to go over this one more time before merge. However, this is definitely in good shape and I will move on to review the supporting PR's for now.

Again many thanks for your perseverance.

ablaom · 2022-02-03T22:16:38Z

Okay, I have made that fix and rebased for-a-0-point-20-release. It looks like this created no merge conflicts for you, so no action on your part should be needed.

olivierlabayle added 8 commits January 28, 2022 10:27

migrate current state from MLJSerialization

9f6bcdb

add filesize check for composite

02a42b5

add some Misc tests

10d9194

add some docstrings

4b58488

add UnupervisedModel test

6dd7270

add warn message when loading machine with state != -1

25100d3

refactor Base.replace to use in save

3581923

add docstrings

a8c260c

ablaom closed this Jan 30, 2022

ablaom reopened this Jan 30, 2022