Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating online and active learning models #60

Open
KnutJaegersberg opened this issue Feb 3, 2019 · 12 comments
Open

Integrating online and active learning models #60

KnutJaegersberg opened this issue Feb 3, 2019 · 12 comments
Labels
design discussion Discussing design issues enhancement New feature or request

Comments

@KnutJaegersberg
Copy link

Integrating OnlineStats (its online learning algorithms) and giving it an easy to use hyperparameter tuning context makes Julia even more useful for quick ML on real big data.

@ablaom
Copy link
Member

ablaom commented Feb 3, 2019

Sorry, but is this a comment or feature request?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 4, 2019

I believe it's both?

Generally, on-line learning is quite a relevant and important area. For a package to support on-line learning properly, it needs to support:
(i) sequential data streams (where data may be i.i.d.)
(ii) an on-line update, i.e., update model when new data comes in.

Parallelization and distributed computations are separate features that are nice on their own, but are quite synergistic.

As far as I can see, OnlineStats supports (i) sequential data streams, and (ii) updating through its fit method, as well as some simple parallelism, through its interface design, which is very nice.

I see two main blockers for interfacing:

  1. there's no explicit hyper-parameter interface

  2. mlj has no explicit design for the on-line task which is more complicated than the simple supervised task.

Point 1 is straightforward to solve, though obviously it's work (and maybe best done by the onlinestats folks?).

Regarding point 2, this is more subtle: for interface hygiene, I don't like the design decision of onlinestats that fitting is always updating. I'd rather separate "fit" and "update", clearly distinguishing "first-time fitting" and "updating". This would, i.m.o., also make a lot of sense with Bayesian models, for the Bayesian update - Bayesian models are often automatically on-line (but not necessarily on sequential data streams as the stylized ML on-line setting).

Any thoughts?

Though generally, I wouldn't see supporting the on-line modelling task a priority above "getting mlj core working", obviously.

@ablaom ablaom added the enhancement New feature or request label Feb 5, 2019
@ablaom ablaom changed the title Integrate OnlineStats models Integrating online and active learning models Jun 4, 2019
@ablaom ablaom added the design discussion Discussing design issues label Sep 19, 2019
@jpsamaroo
Copy link
Collaborator

As mentioned in #71, I'm interested in adding support for active learning to MLJ. My usecase is training models on real-time data like microphone or camera input (and outputting the model's reaction to actuators/devices in real-time).

@fkiraly can you give a concrete example of how you would split OnlineStats' fit method into two components? I'm not clear on how or why that's beneficial from your comment alone, since OnlineStats "models" usually do very little during their fit call.

@jpsamaroo
Copy link
Collaborator

Bump. Can someone provide me an example of what they'd like the online learning API to look like so that I can build out to needed code/interfaces to support this feature?

@ablaom
Copy link
Member

ablaom commented Nov 4, 2019

Thanks @jpsamaroo for re-pinging this discussion and for the offer to
help.

For clarity, here's my understanding of basic online learning: A
supervised or unsupervised machine learning algorithm that has already
been trained on some data X is supplied with new data Xnew and is
retrained:

(i) as if it the training data were was X and Xnew combined, but without
the algorithm needing access to the previous training data X; and

(ii) in a time approximating the time required to train on Xnew alone.

In some cases the learned state based on "train with X and update
with Xnew" is not actually the same as the state based on "train
with X and Xtrain together", but it is a useful approximation.

Not all machine learning algorithms directly support online learning.

Basic work-flow

Here's how I see the basic work-flow for training and updating an MLJ
learner. For concreteness, I will suppose the learner is unsupervised,
in this case a PCA model for dimension reduction.

X = MLJ.table(rand(1000, 17))

# initialize and train on first batch:
model = @load PCA
mach = machine(model, X)
fit!(mach)

# fit on second batch of data:
Xnew= MLJ.table(rand(10, 17))
inject!(mach, Xnew)
fit!(mach)

When new data is injected into a machine, the machine updates an
internal count of the number of injections. When this is one or more,
the next call to fit! calls update_data(model, ...) instead of
fit(model,...) or update(model...) (for updates triggered by
hyperparameter changes, such as increasing an iteration count).

Composing online learners

If a learner does not support online learning, then I suggest the
effect of the update be "leave machine unchanged" and "issue warning,
if this is the first update". In that way, if a learning network contains both
online and non-online models, then the overall "online" learning
network continues to have utility, and can be exported (blueprinted)
to generate a new online model type.

An alternative is that updating a non-online learner with new data,
and fitting, actually retrains the learner from scratch on just the
new data. This is more complicated to deal with because, in the common
use case (train on the first batch of data and leave alone), we would
need extra interface points for freezing non-online components once
trained. The advantage would be that we could, in principle, also
unfreeze these components to "re-calibrate" the new non-online
elements. Is there a substantial use case for this?

We will need syntax for the learning networks. It would look like
this:

Xs = source(X)

mach = machine(model, Xs)
Xout = transform(mach, Xs)

# fit on first batch of data:
fit!(Xout) 

# add data and update:
inject!(Xs, Xnew)
fit!(Xout)

Implementation

In brief, to implement the above just requires:

  • Add a method stub for online_update(model::Model, ...) to
    MLJBase. This method supplements the existing fit and update for
    models for models.

  • Add new model trait supports_online to MLJBase

  • Give Machine and NodalMachine an n_injections field

  • Add logic to fit! to determined when to call online_update

  • Add new inject! methods

The more difficult design decisions revolve around deployment, tuning
and control. Unlike control of, say, a neural network ("train until
the error stops decreasing" or whatever) control of an online learner
in deployment is driven by events outside of MLJ. What's the best way
to do this in julia?

That said, the framework should be similar to that suggested in
Model wrapper for controlling iterative
models
or
a single wrapper could be used for both, as @fkiraly has suggested.

The pragmatic way to move forward which I would advocate, given
current resources, would to implement the basics outlined above, and
test on some examples, flushing out the other design issues later.

Thoughts anyone?

In terms of implementing the basics, I expect it is best that I take
this up. However, help with implementing online/iterative method
control would be greatly appreciated. In addition to the design
outlined in the issue, I have more detailed sketches for the iterative
control wrapper that I can share.

@Oblynx
Copy link

Oblynx commented Jan 9, 2020

I'm developing an online unsupervised learning model for timeseries, which can do prediction / anomaly detection when coupled with a supervised model. As I'm looking for a standardized interface I'm thinking to experiment with MLJ. This can be a use case coupling this issue with #303 and #51 .
I mention it just as food for thought at the moment.

@ablaom
Copy link
Member

ablaom commented Jan 9, 2020

Thanks for that. It might be a challenge to introduce time series and online learning to MLJ simultaneously but all help and input welcome.

On the time series front, see also #303 (continuing time-series related discussion there) and JuliaAI/ScientificTypes.jl#14 .

@cscherrer
Copy link

To generalize this a bit from a discussion with @ablaom on Slack, it seems like there are at least four different cases to consider:

  1. Change the model itself, for example warm restart after changing a hyperparameter
  2. Update model fit, with no change to the data
  3. Update model fit based on a change to the observations
  4. Update model fit based on a change to the features

For (4), lots of statistical models can be fit in terms of sufficient statistics. If we add or remove features, there are often ways to efficiently update those sufficient statistics without starting from scratch.

For example, say we have a linear model with squared loss (and maybe some arbitrary regularization). This can be fit using a Cholesky decomposition of X' * X. If we add a feature, we may have some way to make update the Cholesky, rather than recomputing the decomposition.

In addition, in this situation we'd want to be able to use a previous model fit as a starting point, maybe just starting the weight for the new feature at zero.

@ExpandingMan
Copy link
Contributor

I've recently come up with a workaround for this feature in which I update an xgboost model by defining

MLJBase.fit!(m::Machine, X, y)

and I've spent a bit of time considering whether this can be generalized.

For the cases that @cscherrer laid out above, I think 1,2,3 should be relatively easy (for models where they are possible at all) while 4 is likely to be very hard.

I'll summarize some of the thoughts I've had about a fit!(m, X, y) pattern:

  • We'd have to drop any guarantee that all training data is stored in the machine. You could append it in principle, but in practice if somebody is doing online learning it's likely because they couldn't fit the entire dataset in memory. The most you could do is record how many data points have been used in training.
  • In principle any hyperparameters governing the updates could be put into the model objects, but this might not always be great for the underlying model's interface particularly if that interface expects arguments with each training batch.
  • Every Node in a network would have to decide what to do on repeated calls to fit!. I think by default we'd need to have new calls to fit! being a no-op and then as part of the model interface there can be methods for it. I don't think there's anything too tricky here since presumably on repeated calls each Node would get exactly the same format it got during initial training.

Something like this seems like it would be easier than @ablaom 's inject! above, since we wouldn't have to worry about what the machine does with the injected data (i.e. it would have to store it between calls to inject! and fit!.

Thoughts?

@ablaom
Copy link
Member

ablaom commented Sep 13, 2022

The syntax fit!(mach, X, y) sounds like a good suggestion - we probably don't need to separately attach new data to the machine and then train. However, I can't see how it is possible to implement incremental learning purely at the machine level. Don't we need a method in the model API that tells us how to add data (without discarding learned parameters)? After all, not all models can do this. (Perhaps there is some confusion about MLJModelInterface.update. This is not a method to add data, only to respond to changes in hyper-parameters (eg, iteration parameter) that needn't trigger a cold restart.)

@ExpandingMan
Copy link
Contributor

Don't we need a method in the model API that tells us how to add data (without discarding learned parameters)?

That's why I think the Machine interface makes this a lot more complicated than it is for most of the models themselves. Most models already implement something like fit!(model, X, y)... it seems a pretty safe bet that in the vast majority of cases you will just have something like

fit!(mach::Machine, X, y) = fit!(mach.fitresult, X, y)

I'm not entirely sure what you mean but I think your concern is that the existing definition of Machine is basically model plus data. Adding the ability to do fit!(mach, X, y) means the machine is just a wrapper of the model, not necessarily the data. Of course models would have to define some kind of fit!(model, X, y) method for this to work, I was not implying that it would not be a new method.

I don't really see any way around this: it's not realistic to always require that all the data is kept. If you have an entire network you can have fit!(mach, X, y) recursively call the same thing on all the nodes with the ones that don't implement it defaulting to a no-op (though I haven't fully thought this through, it might be dangerous if some models should update but don't).

So TL;DR my suggestion was that models would be required to implement something like fit!(model, X, y) to be able to get online updates and that this is the method that would update model parameters without completely resetting. This would have the virtue of being very easy to implement on must models that can support it.

@ablaom
Copy link
Member

ablaom commented Sep 14, 2022

So TL;DR my suggestion was that models would be required to implement something like fit!(model, X, y)

Yeah, we have already have the stub (see above comment):

MLJModelInterface.online_update(model::Model, fitresult, verbosity, new_data...) -> (fitresult, state, report)

We just don't have any models that implement it. (And I don't like the name anymore - I'm using ingest! in a planned revamp of the interface, and allow it to optionally mutate fitresult).

We could additionally:

  • Add a field to machines n_ingestions to machines, to count number of new data injections (user needs to know if learned parameters are based on more than the current data attached to machine)
  • Exend signature of fit! to fit!(mach::Machine, newdata...)
  • fit!(mach, newdata...) calls replaces data attached to mach with newdata, increments n_ingestions, and calls dispatches the model training method ingest (instead of fit or update) whenever newdata is non-empty.

How's that sound?

One question is whether this could play nicely with model composition. That might be quite tricky, and I will have to think about it some more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants