seme-tsmall

Work done in Institut de Mathématiques de Bordeaux organized by AMIES and in collaboration with FieldBox.ai.

The report can be found on Hal Archives: https://hal.archives-ouvertes.fr/hal-03211100

Link to the event: http://seme-bordeaux.sciencesconf.org/

Defining the problem

We are given a regression problem with a dataframe consisting in d features X_1, X_2, ..., X_d and n observations. Each feature corresponds to a 1D-signal (e.g. a time-series): the i-th observation is the value at a certain time t_i of the d signals.

Thus, the dataframe is a n x d matrix with coefficient (i,j) given by X_j(t_i) for i=1,...,d and j=1,...,n.

We are interested in predicting the variable y which depends on the values of the 1D-signals X_1, X_2, ..., X_d. Namely, we suppose that y = f(X_1, ..., X_d). Note that we do not suppose y to be an explicit function of the time.

Suppose that n is small (small-data problem). We try to answer the following questions:

Does the time-signal nature of the features gives us more information than the sole observations X_j(t_i)?

Is it possible to infer new data and augment the dataset size? Is this helping in predicting y (e.g., reducing overfitting)?

experimental method

For the sake of analysis, we consider a dataframe data_A with N observations where N>>n. It is split in train_A and test_A. We mainly focus on classical (uniform) random sampling and on random block sampling (random blocks of consecutive observations), depending on the dataset.

sampling procedure

Through a sampling procedure on train_A, we derive a smaller dataframe called train_B; similarly we obtain test_B from test_A. The two dataframes train_B and test_B form data_B: the small dataframe with n observations. We refer to the Documentation section for the sampling procedure.

data augmentation

From train_B, we perform data augmentation to dispose of a larger number of observations and obtain a bigger dataframe called data_C, the dimension of data_C is up to 8 times the one of train_B. The synthetic features and labels are inferred using different techniques, we refer to the data augmentation section in the report.

evaluation with RMSE and R2 score

A same machine-learning algorithm is then trained on the different train_X splits, where X = A, B or C: this yields the three models model_A, model_B and model_C. We evaluate each model on test_A (and test_B) to understand whether the imputation technique is improving the stability and/or the score of model_C with respect to model_B. The metrics under consideration are given by the (root) mean square error and the R2 score.

dataset examples

We focus on the following open repository available at UCI: Appliances energy prediction Data Set

Current module version

To work with our functions, just download the tsmall directory and launch python in the same root directory of tsmall. It then suffices to type from tsmall import * to retrieve all the functionalities.

Here a list of the relevant notebooks present in the repository:

quantile_based_augmentation.ipynb : test quantile-based augmentation developed during this week
DL_aug.ipynb : test data augmentation using LSTM-VAE
signal_distortion.ipynb : contains information about fourier and wavelet discrete transform, it uses the submodule tsmall.augment.

Architecture of tsmall module:

tsmall/
    augment.py          # contain signal_distortion, dfaug and mdfaug
    dl_method.py        # function for LSTM-VAE
    preprocessing.py    # contain block_sampling and min_max_normalization
utils.py                # useful function used in the notebooks

dependencies

If you have pipenv installed, the environment is set in the Pipfile. You can load it by simply executing pipenv install inside the git repository.

Otherwise, you need the following libraries: numpy, pandas, scikit-learn and pywt (wavelet pkg) for running quantile-based augmentation; if you want to test DL methods, you need keras and tensorflow < 2 (code is running on python 3.7). The notebooks require matplotlib.

last updates

added utils.py with useful functions for notebook
pipenv environment, code cleaning and refactoring with basic hint typing
TS_aug.ipynb merged from deep-learning branch
old scripts moved to old folder

overall progress

Documentation

We refer to the report available on Hal.

bibliography

See the report on Hal for related research papers.

Observe that most of the cited bibliography is focused on classification problems for NN algorithms. In such framework, the time-series is seen as an input and the augmentation technique allows to generate synthetic time-series on which the model can be trained. For these reasons, many of the aforementioned strategies are not suitable for our framework; however a few ideas (e.g. frequency domain transform, decomposition methods, etc.) could be developed and put into practice for standard ML algorithms.

Python libraries:

tsaug is useful to modify time-series in the time space, but not necessarily for our scope | see the documentation
sigment data augmentation for audio signals | see the documentation

data augmentation

Data augmentation is the process of generating artificial data in order to reduce the variance of the predictor and thus avoiding overfitting.

Within our framework, we can try to exploit the time-signal nature of the observations and to infer new values. Depending on the hypothesis one takes on y, different techniques are available:

Under stationarity assumptions one can use classical bootstrap techniques or model-methods (ARIMA, ARCH, etc.).
Under continuity assumptions of the signals, one can use interpolation techniques (however this seems not to substantially improve the results).
Fourier/wavelet transform, see next subsection
Quantile-based augmentation in feature space (related to data augmentation in classification problems): group observations by y-labels + transformation in feature-space

Since we do not want to assume any additional hypothesis on y, we exclude the first two possibilities and focus on the last twos. It turns out that quantile-based augmentation works pretty well with k-NN. We also test LSTM-Variational Autoencoder.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
old		old
tsmall		tsmall
.gitignore		.gitignore
DL_aug.ipynb		DL_aug.ipynb
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
energy_data.csv		energy_data.csv
lstm_variational_autoencoder.py		lstm_variational_autoencoder.py
quantile_based_augmentation.ipynb		quantile_based_augmentation.ipynb
sample_data.dat		sample_data.dat
signal_distortion.ipynb		signal_distortion.ipynb
todo.txt		todo.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seme-tsmall

Defining the problem

experimental method

sampling procedure

data augmentation

evaluation with RMSE and R2 score

dataset examples

Current module version

dependencies

last updates

overall progress

Documentation

bibliography

data augmentation

About

Releases

Packages

Contributors 2

Languages

fdesmond/seme-ts

Folders and files

Latest commit

History

Repository files navigation

seme-tsmall

Defining the problem

experimental method

sampling procedure

data augmentation

evaluation with RMSE and R2 score

dataset examples

Current module version

dependencies

last updates

overall progress

Documentation

bibliography

data augmentation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages