ImputeBench implements over 15 advanced imputation techniques for missing blocks in time series. It evaluates their precision and runtime on various real-world time series datasets using different recovery scenarios. Technical details can be found in our PVLDB 2020 paper: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series . The benchmark can be easily extended with new algorithms (C/C++, Python, or Matlab), datasets, and scenarios.
-
Original Imputation Algorithms: The original benchmark implements the following algorithms (in C++):
- CDRec: Scalable Recovery of Missing Blocks in Time Series with High and Low Cross-Correlations, KAIS'20
- DynaMMo: DynaMMo: mining and summarization of coevolving sequences with missing values, KDD'09
- GROUSE: Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation, PMLR'16
- ROSL: Robust Orthonormal Subspace Learning: Efficient Recovery of Corrupted Low-Rank Matrices, CVPR'14
- SoftImpute: Spectral Regularization Algorithms for Learning Large Incomplete Matrices, JMLR'10
- SPIRIT*: Streaming pattern discovery in multiple time-series, VLDB'05
- STMVL: ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data, IJCAI'16
- SVDImpute: Missing value estimation methods for DNA microarrays, BIOINFORMATICS'01
- SVT: A Singular Value Thresholding Algorithm for Matrix Completion, SIAM J. OPTIM'10
- TeNMF: Nonnegative Matrix Factorization for Time Series Recovery From a Few Temporal Aggregates, PMLR'17
- TRMF: Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction, NIPS'16
- TKCM*: Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series, EDBT'17
-
Additional Imputation Algorithms: We recently expanded the original benchmark with new algorithms (in their original implementation):
- DeepMVI: Missing Value Imputation on Multidimensional Time Series, PVLDB'21
- MPIN: Missing Value Imputation for Multi-attribute Sensor Data Streams via Message Propagation, PVLDB'24
- IIM*: Learning Individual Models for Imputation, ICDE '19
- PriSTI: PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation, ICDE'23
- MRNN*: Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks, Trans. On Bio Eng.'19
- BRITS: BRITS: Bidirectional Recurrent Imputation for Time Series, NeurIPS'18
- SSA*: Model Agnostic Time Series Analysis via Matrix Estimation, Meas. Anal. Comput. Syst'18
-
Algorithms under Integration:
- DAMR: Dynamic Adjacency Matrix Representation Learning for Multivariate Time Series Imputation, SIGMOD'23
- EDIT: Efficient and Effective Data Imputation with Influence Functions, PVLDB'23
- HKMF-T: HKMF-T: Recover From Blackouts in Tagged Time Series With Hankel Matrix Factorization, TKDE'21
- NAOMI: NAOMI: Non-Autoregressive Multiresolution Sequence Imputation, NeurIPS'19
- E2EGAN: E²GAN: End-to-End Generative Adversarial Network for Multivariate Time Series Imputation, IJCAI'19
-
Datasets: All the datasets used in this benchmark can be found here.
-
Missingness Patterns: The full list of recovery scenarios can be found here.
-
Notes: The algorithms marked with * cannot handle multiple incomplete time series. They produce results only for the following scenarios:
miss_perc
,ts_length
, andts_nbr
.
Prerequisites | Build | Execution | Extension | Contributors | Award | Citation
- Ubuntu 20 or Ubuntu 22 (including Ubuntu derivatives, e.g., Xubuntu) or the same distribution under WSL.
- Clone this repository
-
Install mono from https://www.mono-project.com/download/stable/ and reboot your terminal.
-
Build the Testing Framework using the installation script located in the root folder
$ sh install_linux.sh
$ cd TestingFramework/bin/Debug/
$ mono TestingFramework.exe [arguments]
-alg | -d | -scen |
---|---|---|
cdrec | airq | miss_perc |
dynammo | bafu | ts_length |
grouse | chlorine | ts_nbr |
rosl | climate | miss_disj |
softimp | drift10 | miss_over |
svdimp | electricity | mcar |
svt | meteo | blackout |
stmvl | temp | all |
spirit | bafu_red | |
tenmf | drift10_red | |
tkcm | all | |
trmf | ||
all | ||
-------- | -------- | -------- |
New algs | ||
-------- | -------- | -------- |
ssa | ||
m-rnn | ||
brits | ||
deepmvi | ||
mpin | ||
pristi | ||
iim |
All results and plots will be added to the Results
folder. The accuracy results of all algorithms will be sequentially added for each scenario and dataset to: Results/.../.../error/
. The runtime results of all algorithms will be added to: Results/.../.../runtime/
. The plots of the recovered blocks will be added to the folder Results/.../.../recovery/plots/
.
- Run a single algorithm (cdrec) on a single dataset (drift10) using one scenario (missing percentage)
$ mono TestingFramework.exe -alg cdrec -d drift10 -scen miss_perc
- Run two algorithms (cdrec, spirit) on a single dataset (drift10) using one scenario (missing percentage)
$ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc
- Run point 2 without runtime results
$ mono TestingFramework.exe -alg cdrec,spirit -d drift10 -scen miss_perc -nort
- Run the whole VLDB'20 benchmark (all algorithms, all datasets, all scenarios, precision and runtime)
$ mono TestingFramework.exe -alg all -d all -scen all
Warning: Running the whole benchmark takes a sizeable amount of time (up to 4 days, depending on the hardware) and produces up to 15GB of output files with all recovered data and plots unless stopped early.
- Create patterns of missing blocks on one complete dataset (airq) using one scenario (missing percentage)
$ mono TestingFramework.exe -alg mvexport -d airq -scen miss_perc
Note: You must run each scenario separately on one or multiple datasets. Each time you execute one scenario, the Results
folder will be overwritten with the new files.
- Additional command-line parameters
$ mono TestingFramework.exe --help
- You can parametrize each algorithm using the command
-algx
. For example, you can run the svdimp algorithm with a reduction value of 4 on the drift dataset and by varying the sequence length as follows:
$ mono TestingFramework.exe -algx svdimp 4 -d drift10 -scen ts_nbr
- If you want to run some algorithms with default parameters and some with customized ones, you can use
-alg
and-algx
together. For example, you can run stmvl algorithm with default parameter and cdrec algorithm with a reduction value of 4 on the airq dataset by varying the sequence length as follows:
$ mono TestingFramework.exe -alg stmvl -algx cdrec 4 -d airq -scen ts_nbr
Remark: The command -algx
cannot be executed in a group and thus must precede the name of each algorithm.
- To evaluate the newly integrated algorithms, please install the following Python packages (takes several minutes):
$ sh install_extra.sh
- Activate the virtual environment and execute the new algorithms from the table above
$ source bench-env/bin/activate
$ mono TestingFramework.exe [arguments]
- To add new algorithms:
- To add new datasets:
- import the file to
TestingFramework/bin/Debug/data/{name}/{name}_normal.txt
(name
is the name of your dataset). - Requirements: rows>= 1'000, columns>= 10, column separator: empty space, row separator: newline
- Note: the benchmark can also run with rows>= 100 and columns>= 5 but with a limited number of scenarios and algorithms.
- import the file to
Mourad Khayati (mkhayati@exascale.info) and Zakhar Tymchenko (zakhar.tymchenko@unifr.ch).
Imputebench has received the VLDB 2020 Most Reproducible Paper Award.
@inproceedings{imputebench2020vldb,
author = {Mourad Khayati and Alberto Lerner and Zakhar Tymchenko and Philippe Cudr{\'{e}}{-}Mauroux},
title = {Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series},
booktitle = {Proceedings of the VLDB Endowment},
volume = {13},
number = {5},
year = {2020}
}