ETNA multiindex was mistake #1126

iKintosh · 2023-02-22T16:08:21Z

iKintosh
Feb 22, 2023

Hey guys,

It's a bit clickbait title. But that will do the trick.
I want to propose several directions to which ETNA can slowly move.

Make ETNA easier to deploy

ETNA is really hard to deploy. Ideally users would want to use some existing solution to deploy a time series pipeline (bentoML/Seldon/etc. or/and Airflow/KubeFlow/Argo). It doesn't mean that ETNA has to support all of this out of the box, but it should be possible to make it work in such environment.

Allow for more flexible Pipelines

ETNA is bound to specific pipeline logic

However one small change would allow more flexible deployments. And that would be not expecting the Pipeline to have the model at the end of it. That would allow us to divide Pipeline into 2 (or more) pipelines and run them in parallel or in different environment (example: do data preparation on CPU machine, while training NN model on GPU machine).

Additional benefit is that we can divide our data into chunks, set up several copies of model (or any other pipeline strictly speaking) and process them in parallel.

(Optional) Making one step further: model could be considered as transformation step, so users could put it in the middle.
(Optional 2) Can still save it as one Pipeline, but task_name argument should be added to each transform.

Instead of CSV use Parquet files

It just faster to load and uses less storage :)

Add different backends

After point 1 we unlock the possibility to set up several identical pipeline to process requests in parallel. After 2 the possibility to parallelize steps within the Pipeline itself. What else do we miss? Lazy loading and out-of-core computation. We do not need it always, but when we do it is pain. Different backends could be added to ENTA. (Dask)[https://www.dask.org/] seems as most promising and easy to integrate solution, so I would propose to try to create Dask backend in addition to Pandas backend ETNA already has. This would require changes to TSDataset, Models, Transforms, Pipeline (?).
To make the change gradual, I guess it would be okay to fall back to Pandas if Dask backend is not available.
And about multiindex: it just slow. Storing data in wide format makes sense, but storing it as multiindex adds additional complexity. Plus not all backends support multiindex (like only Pandas)

(Optional) Instead of Pandas ETNA can use (Polars)[https://www.pola.rs/] by default since it faster and uses less memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETNA multiindex was mistake #1126

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

ETNA multiindex was mistake #1126

iKintosh Feb 22, 2023

Make ETNA easier to deploy

Allow for more flexible Pipelines

Instead of CSV use Parquet files

Add different backends

Replies: 0 comments

iKintosh
Feb 22, 2023