You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's a bit clickbait title. But that will do the trick.
I want to propose several directions to which ETNA can slowly move.
Make ETNA easier to deploy
ETNA is really hard to deploy. Ideally users would want to use some existing solution to deploy a time series pipeline (bentoML/Seldon/etc. or/and Airflow/KubeFlow/Argo). It doesn't mean that ETNA has to support all of this out of the box, but it should be possible to make it work in such environment.
Allow for more flexible Pipelines
ETNA is bound to specific pipeline logic
However one small change would allow more flexible deployments. And that would be not expecting the Pipeline to have the model at the end of it. That would allow us to divide Pipeline into 2 (or more) pipelines and run them in parallel or in different environment (example: do data preparation on CPU machine, while training NN model on GPU machine).
Additional benefit is that we can divide our data into chunks, set up several copies of model (or any other pipeline strictly speaking) and process them in parallel.
(Optional) Making one step further: model could be considered as transformation step, so users could put it in the middle.
(Optional 2) Can still save it as one Pipeline, but task_name argument should be added to each transform.
Instead of CSV use Parquet files
It just faster to load and uses less storage :)
Add different backends
After point 1 we unlock the possibility to set up several identical pipeline to process requests in parallel. After 2 the possibility to parallelize steps within the Pipeline itself. What else do we miss? Lazy loading and out-of-core computation. We do not need it always, but when we do it is pain. Different backends could be added to ENTA. (Dask)[https://www.dask.org/] seems as most promising and easy to integrate solution, so I would propose to try to create Dask backend in addition to Pandas backend ETNA already has. This would require changes to TSDataset, Models, Transforms, Pipeline (?).
To make the change gradual, I guess it would be okay to fall back to Pandas if Dask backend is not available.
And about multiindex: it just slow. Storing data in wide format makes sense, but storing it as multiindex adds additional complexity. Plus not all backends support multiindex (like only Pandas)
(Optional) Instead of Pandas ETNA can use (Polars)[https://www.pola.rs/] by default since it faster and uses less memory.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey guys,
It's a bit clickbait title. But that will do the trick.
I want to propose several directions to which ETNA can slowly move.
Make ETNA easier to deploy
ETNA is really hard to deploy. Ideally users would want to use some existing solution to deploy a time series pipeline (bentoML/Seldon/etc. or/and Airflow/KubeFlow/Argo). It doesn't mean that ETNA has to support all of this out of the box, but it should be possible to make it work in such environment.
Allow for more flexible Pipelines
ETNA is bound to specific pipeline logic
However one small change would allow more flexible deployments. And that would be not expecting the Pipeline to have the model at the end of it. That would allow us to divide Pipeline into 2 (or more) pipelines and run them in parallel or in different environment (example: do data preparation on CPU machine, while training NN model on GPU machine).
Additional benefit is that we can divide our data into chunks, set up several copies of model (or any other pipeline strictly speaking) and process them in parallel.
(Optional) Making one step further: model could be considered as transformation step, so users could put it in the middle.
(Optional 2) Can still save it as one Pipeline, but
task_name
argument should be added to each transform.Instead of CSV use Parquet files
It just faster to load and uses less storage :)
Add different backends
After point 1 we unlock the possibility to set up several identical pipeline to process requests in parallel. After 2 the possibility to parallelize steps within the Pipeline itself. What else do we miss? Lazy loading and out-of-core computation. We do not need it always, but when we do it is pain. Different backends could be added to ENTA. (Dask)[https://www.dask.org/] seems as most promising and easy to integrate solution, so I would propose to try to create Dask backend in addition to Pandas backend ETNA already has. This would require changes to TSDataset, Models, Transforms, Pipeline (?).
To make the change gradual, I guess it would be okay to fall back to Pandas if Dask backend is not available.
And about multiindex: it just slow. Storing data in wide format makes sense, but storing it as multiindex adds additional complexity. Plus not all backends support multiindex (like only Pandas)
(Optional) Instead of Pandas ETNA can use (Polars)[https://www.pola.rs/] by default since it faster and uses less memory.
Beta Was this translation helpful? Give feedback.
All reactions