-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[design discussion] Handling non-iid data pt1: time series #303
Comments
Thanks for getting the ball rolling. I believe that these efforts on time
series, spatial, etc. deserve separate packages and shouldn't be
implemented as if they were inside the MLJ.jl umbrella. What was discussed
on slack was a common set of Base packages to interface general operations
like resampling, etc that other projects developed by people who actually
research time series can build upon. Putting time series inside MLJ.jl is
not optimal.
…On Wed, Oct 30, 2019, 08:48 Thibaut Lienart ***@***.***> wrote:
As was discussed on Slack, there may be design decisions to take so that
MLJ can support non-iid tabular data (time series or other).
To get the ball started, here are some thoughts on time series and what
would need to be done to support such data effectively in MLJ.
Time series
We'd need:
- interface with specific models that are adapted to TS (say Arima or
whatever)
- offer adapted tuning/resampling strategies (e.g. Holdout could be
done differently to take into account notion of chronology)
fit-predict-evaluate
on temporal data, the notion of evaluation on a test set is less
meaningful (doesn't offer meaningful guarantees) but may still be a way to
get an idea for how a model performs, so a workflow that could be expected
is something like
- slice time in first 80% (train) - last 20% (test)
- re-slice train again say first 90%, last 10%
- train a bunch of models on the 90%, evaluate on last 10%, pick best
or aggregate,
- report how things work on the held-out set
As far as I'm aware this requires little work to get working (assuming a
static dataset); just have a "temporal-holdout" which respects ordering
classification / transformation
A separate task would be to identify similarity between time series; e.g.
to cluster time series or classify them as a whole; this does not require
anything specific as far as I know other than appropriate packages that
would allow the representation of a TS in a numerical space (e.g. could be
RNN-based)
other tasks
There are probably other tasks than forecasting / clustering /
classification with time series, one that I can think of is to train
something able to detect change points, probably an unsupervised task that
would learn from training data how to pick changepoints with a sensitivity
// penalty over how many change points it finds; then that could be used on
new data.
Things to do // comments
- Add a temporal holdout
- Consider existing packages in the julia ecosystem that do some
temporal stuff and try
*Comments*
- I'm not sure we'd need a specific scientific type; or maybe just one
for DateTime; but then assuming there's something like ARIMA.jl, a
user would just feed data to fit adapted to ARIMA and ARIMA would
internally consider data as temporally ordered.
- There should be a choice as to how the column representing time is
passed; one way would be to have a MLJBase function that does this (like,
say, MLJBase.time_matrix) and tries to detect a column that has a
datetype out of the feature matrix and use it as a guide
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#303>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZQW3IAQIAILENRJEVPOQ3QRFYB7ANCNFSM4JGX55HA>
.
|
I don't really understand why you're saying that and it seems non-constructive. There can be small modifications in the MLJ environment to make it possible to deal with Time Series (as highlighted above); there may be limitations as to how far this can go, fine (I was hoping for people mentioning this here). People who don't see this working for them may want to develop their own packages independently of MLJ (or use other existing packages like OnlineStats), that's fine, I don't see why that precludes us from trying to make it easier to handle TS data if we can. I think it should be clear that we're not "putting things inside MLJ" rather in light of MLJ being just a way to do interface with things, people may choose to use it or not... So let's please focus on what's currently not there and could easily be added and if in some future there is scope for more modularity and package separation fine; for now we're already spending a lot of energy trying to manage the multiple repos we have and we won't just start new repos unless we have a clear idea of the advantages. To be honest it looks like you don't like how we're doing things atm, I understand this and we welcome criticism, however please understand that while you have made specific suggestions in the past which I believe have been addressed, it's not super useful to us to just get feedback like "you're doing this wrong". So I'd suggest you open a separate issue where you discuss a full design plan which would improve over the current status quo and addresses past comments that were made to your past suggestions or work with us to try to make modifications like the ones suggested here. |
In previous issues I discussed how exporting the name @load from MLJBase
would be beneficial to me. For some reason this trivial change was rejected
without clear reasons. That is why I lost interest in spending too much
time writing long comments here. In the task design discussion I spent a
great amount of text and received good feedback from users (see the likes,
hearts in the comments). However the MLJ devs decided to leave this
discussion aside and continue with a limiting workflow that has in it a lot
of assumptions that don't serve for my research. Machines, tabular data,
etc.
I will try to be more constructive next time but I confess that I'm not
feeling that my feedback is being incorporated anyhow. The issues are still
open without any action to remediate the design problems I raised.
…On Wed, Oct 30, 2019, 09:11 Thibaut Lienart ***@***.***> wrote:
I don't really understand why you're saying that and it seems
non-constructive.
There can be small modifications in the MLJ environment to make it
possible to deal with Time Series (as highlighted above); there may be
limitations as to how far this can go, fine (I was hoping for people
mentioning this here). People who don't see this working for them may want
to develop their own packages independently of MLJ (or use other existing
packages like OnlineStats), that's fine, I don't see why that precludes us
from trying to make it easier to handle TS data if we can.
I think it should be clear that we're not "putting things inside MLJ"
rather in light of MLJ being just a way to do interface with things, people
may choose to use it or not... So let's please focus on what's currently
not there and could easily be added and if in some future there is scope
for more modularity and package separation fine; for now we're already
spending a lot of energy trying to manage the multiple repos we have and we
won't just start new repos unless we have a clear idea of the advantages.
To be honest it looks like you don't like how we're doing things atm, I
understand this and we welcome criticism, however please understand that
while you have made specific suggestions in the past which I believe have
been addressed, it's not super useful to us to just get feedback like
"you're doing this wrong".
So I'd suggest you open a separate issue where you discuss a full design
plan which would improve over the current status quo and addresses past
comments that were made to your past suggestions or work with us to try to
make modifications like the ones suggested here.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#303>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZQW3IVZVJH5KYJTENT5K3QRF2WJANCNFSM4JGX55HA>
.
|
In case you're not aware of it, we've been working on a scikit-learn compatible Python package for machine learning with time series data, take a look at our repo here, with @fkiraly as one of the core developers. We also have a short paper out in which we describe different data formats and learning tasks that arise in a temporal/sequential data context. Hope this helps! If you decide to implement time series functionality, we'd be more than happy to collaborate further, perhaps in form of another development sprint or so. |
Thanks @mloning, I follow @juliohm I think we feel a similar frustration on our side, I apologise for this as we do care about feedback and integrating comments; with respect to |
Thank you @tlienart , I disagree with this approach as it goes against the usual design of package ecosystems in Julia. Having strong Base packages is much more important than the actual umbrella that puts functionality together. We saw this in many successful projects including the DifferentialEquations.jl umbrella, the Makie.jl umbrella, and other umbrellas that just load sub packages and reexport. If the plan is to write code in the MLJ.jl umbrella, that is unfortunate. It will certainly limit our collaboration opportunities. |
FYI we've started working on MLJTime.jl - a time series extension package for MLJ, together with @sjvollmer and @aa25desh |
As was discussed on Slack, there may be design decisions to take so that MLJ can support non-iid tabular data (time series or other).
To get the ball started, here are some thoughts on time series and what would need to be done to support such data effectively in MLJ. Please add comments in line with this (let's discuss other possible use cases different than time series in another issue)
Time series
We'd need:
fit-predict-evaluate
on temporal data, the notion of evaluation on a test set is less meaningful (doesn't offer meaningful guarantees) but may still be a way to get an idea for how a model performs, so a workflow that could be expected is something like
train
) - last 20% (test
)train
again say first 90%, last 10%As far as I'm aware this requires little work to get working (assuming a static dataset); just have a "temporal-holdout" which respects ordering
There is the question though that the
predict
would be semantically different (no input data per se), maybe we could introduce aforecast
insteadclassification / transformation
A separate task would be to identify similarity between time series; e.g. to cluster time series or classify them as a whole; this does not require anything specific as far as I know other than appropriate packages that would allow the representation of a TS in a numerical space (e.g. could be RNN-based)
other tasks
There are probably other tasks than forecasting / clustering / classification with time series, one that I can think of is to train something able to detect change points, probably an unsupervised task that would learn from training data how to pick changepoints with a sensitivity // penalty over how many change points it finds; then that could be used on new data.
Things to do // comments
predict
would happen (no input data per se, rather just a set of future times)Comments
ARIMA.jl
, a user would just feed data tofit
adapted toARIMA
andARIMA
would internally consider data as temporally ordered.MLJBase.time_matrix
) and tries to detect a column that has a datetype out of the feature matrix and use it as a guide ;TimeArray
type and possibly generalise it to something like aTimeTable
type // see also integration TimeArray <> Tables.jlThe text was updated successfully, but these errors were encountered: