Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[design discussion] Handling non-iid data pt1: time series #303

Open
tlienart opened this issue Oct 30, 2019 · 7 comments
Open

[design discussion] Handling non-iid data pt1: time series #303

tlienart opened this issue Oct 30, 2019 · 7 comments
Labels
design discussion Discussing design issues time series

Comments

@tlienart
Copy link
Collaborator

tlienart commented Oct 30, 2019

As was discussed on Slack, there may be design decisions to take so that MLJ can support non-iid tabular data (time series or other).
To get the ball started, here are some thoughts on time series and what would need to be done to support such data effectively in MLJ. Please add comments in line with this (let's discuss other possible use cases different than time series in another issue)

Time series

We'd need:

  • interface with specific models that are adapted to TS (say Arima or whatever)
  • offer adapted tuning/resampling strategies (e.g. Holdout could be done differently to take into account notion of chronology)

fit-predict-evaluate

on temporal data, the notion of evaluation on a test set is less meaningful (doesn't offer meaningful guarantees) but may still be a way to get an idea for how a model performs, so a workflow that could be expected is something like

  • slice time in first 80% (train) - last 20% (test)
  • re-slice train again say first 90%, last 10%
  • train a bunch of models on the 90%, evaluate on last 10%, pick best or aggregate,
  • report how things work on the held-out set

As far as I'm aware this requires little work to get working (assuming a static dataset); just have a "temporal-holdout" which respects ordering

There is the question though that the predict would be semantically different (no input data per se), maybe we could introduce a forecast instead

classification / transformation

A separate task would be to identify similarity between time series; e.g. to cluster time series or classify them as a whole; this does not require anything specific as far as I know other than appropriate packages that would allow the representation of a TS in a numerical space (e.g. could be RNN-based)

other tasks

There are probably other tasks than forecasting / clustering / classification with time series, one that I can think of is to train something able to detect change points, probably an unsupervised task that would learn from training data how to pick changepoints with a sensitivity // penalty over how many change points it finds; then that could be used on new data.

Things to do // comments

  • Add a temporal holdout
  • Consider existing packages in the julia ecosystem that do some temporal stuff and try interfacing with simple things (e.g. TimeSeries.jl)
  • Consider how the predict would happen (no input data per se, rather just a set of future times)

Comments

  • I'm not sure we'd need a specific scientific type; or maybe just one for DateTime; but then assuming there's something like ARIMA.jl, a user would just feed data to fit adapted to ARIMA and ARIMA would internally consider data as temporally ordered.
  • There should be a choice as to how the column representing time is passed; one way would be to have a MLJBase function that does this (like, say, MLJBase.time_matrix) and tries to detect a column that has a datetype out of the feature matrix and use it as a guide ;
@juliohm
Copy link
Contributor

juliohm commented Oct 30, 2019 via email

@tlienart
Copy link
Collaborator Author

I don't really understand why you're saying that and it seems non-constructive.

There can be small modifications in the MLJ environment to make it possible to deal with Time Series (as highlighted above); there may be limitations as to how far this can go, fine (I was hoping for people mentioning this here). People who don't see this working for them may want to develop their own packages independently of MLJ (or use other existing packages like OnlineStats), that's fine, I don't see why that precludes us from trying to make it easier to handle TS data if we can.

I think it should be clear that we're not "putting things inside MLJ" rather in light of MLJ being just a way to do interface with things, people may choose to use it or not... So let's please focus on what's currently not there and could easily be added and if in some future there is scope for more modularity and package separation fine; for now we're already spending a lot of energy trying to manage the multiple repos we have and we won't just start new repos unless we have a clear idea of the advantages.

To be honest it looks like you don't like how we're doing things atm, I understand this and we welcome criticism, however please understand that while you have made specific suggestions in the past which I believe have been addressed, it's not super useful to us to just get feedback like "you're doing this wrong".

So I'd suggest you open a separate issue where you discuss a full design plan which would improve over the current status quo and addresses past comments that were made to your past suggestions or work with us to try to make modifications like the ones suggested here.

@tlienart tlienart added the design discussion Discussing design issues label Oct 30, 2019
@juliohm
Copy link
Contributor

juliohm commented Oct 30, 2019 via email

@mloning
Copy link

mloning commented Oct 30, 2019

In case you're not aware of it, we've been working on a scikit-learn compatible Python package for machine learning with time series data, take a look at our repo here, with @fkiraly as one of the core developers. We also have a short paper out in which we describe different data formats and learning tasks that arise in a temporal/sequential data context. Hope this helps!

If you decide to implement time series functionality, we'd be more than happy to collaborate further, perhaps in form of another development sprint or so.

@tlienart
Copy link
Collaborator Author

tlienart commented Oct 30, 2019

Thanks @mloning, I follow sktime and did intend to take a look, the pointer to the paper is very useful.

@juliohm I think we feel a similar frustration on our side, I apologise for this as we do care about feedback and integrating comments; with respect to @load it was addressed clearly, effectively MLJBase is to be seen primarily as a door to MLJ and moving @load is not conducive to this (please consider that there is a registry in the mix and it's not trivial to decouple the two); I understand that you'd like this to not be the case (i.e. MLJBase effectively be a modern and maintained MLBase); it may be that one day we actually do this but at the moment this seems to us to be a distraction from what we're trying to do well (i.e. serve "standard" ML use cases).
A criticism could be that we need to get the design right early on to avoid things to bite us later; which is the reason for such thing as the discussion; deciding which part of the code goes where is not really what we'd like to focus on now even though it may perfectly be that in the medium term we end up with something that resembles what you had in mind all along.
In short we want to consolidate MLJ first (considering MLJ+MLJBase more or less as a unit) and when users can actually do standard things and compose as we said was the main goal of MLJ then we can potentially consider moving mature and fixed things to more abstract packages. At least that's my opinion.

@juliohm
Copy link
Contributor

juliohm commented Oct 30, 2019

Thank you @tlienart , I disagree with this approach as it goes against the usual design of package ecosystems in Julia. Having strong Base packages is much more important than the actual umbrella that puts functionality together. We saw this in many successful projects including the DifferentialEquations.jl umbrella, the Makie.jl umbrella, and other umbrellas that just load sub packages and reexport.

If the plan is to write code in the MLJ.jl umbrella, that is unfortunate. It will certainly limit our collaboration opportunities.

@mloning
Copy link

mloning commented Jun 10, 2020

FYI we've started working on MLJTime.jl - a time series extension package for MLJ, together with @sjvollmer and @aa25desh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues time series
Projects
None yet
Development

No branches or pull requests

4 participants