Work done in Institut de Mathématiques de Bordeaux organized by AMIES and in collaboration with FieldBox.ai.
The report can be found on Hal Archives: https://hal.archives-ouvertes.fr/hal-03211100
Link to the event: http://seme-bordeaux.sciencesconf.org/
We are given a regression problem with a dataframe consisting in d
features X_1, X_2, ..., X_d
and n
observations. Each feature corresponds to a 1D-signal (e.g. a time-series): the i-th observation is the value at a certain time t_i
of the d
signals.
Thus, the dataframe is a n x d
matrix with coefficient (i,j)
given by X_j(t_i)
for i=1,...,d
and j=1,...,n
.
We are interested in predicting the variable y
which depends on the values of the 1D-signals X_1, X_2, ..., X_d
. Namely, we suppose that y = f(X_1, ..., X_d)
. Note that we do not suppose y
to be an explicit function of the time.
Suppose that n
is small (small-data problem). We try to answer the following questions:
Does the time-signal nature of the features gives us more information than the sole observations X_j(t_i)
?
Is it possible to infer new data and augment the dataset size? Is this helping in predicting y
(e.g., reducing overfitting)?
For the sake of analysis, we consider a dataframe data_A
with N
observations where N>>n
. It is split in train_A
and test_A
. We mainly focus on classical (uniform) random sampling and on random block sampling (random blocks of consecutive observations), depending on the dataset.
Through a sampling procedure on train_A
, we derive a smaller dataframe called train_B
; similarly we obtain test_B
from test_A
. The two dataframes train_B
and test_B
form data_B
: the small dataframe with n
observations. We refer to the Documentation section for the sampling procedure.
From train_B
, we perform data augmentation to dispose of a larger number of observations and obtain a bigger dataframe called data_C
, the dimension of data_C
is up to 8 times the one of train_B
. The synthetic features and labels are inferred using different techniques, we refer to the data augmentation section in the report.
A same machine-learning algorithm is then trained on the different train_X
splits, where X = A, B
or C
: this yields the three models model_A
, model_B
and model_C
. We evaluate each model on test_A
(and test_B
) to understand whether the imputation technique is improving the stability and/or the score of model_C
with respect to model_B
. The metrics under consideration are given by the (root) mean square error and the R2 score.
We focus on the following open repository available at UCI: Appliances energy prediction Data Set
To work with our functions, just download the tsmall
directory and launch python in the same root directory of tsmall
. It then suffices to type from tsmall import *
to retrieve all the functionalities.
Here a list of the relevant notebooks present in the repository:
quantile_based_augmentation.ipynb
: test quantile-based augmentation developed during this weekDL_aug.ipynb
: test data augmentation using LSTM-VAEsignal_distortion.ipynb
: contains information about fourier and wavelet discrete transform, it uses the submoduletsmall.augment
.
Architecture of tsmall
module:
tsmall/
augment.py # contain signal_distortion, dfaug and mdfaug
dl_method.py # function for LSTM-VAE
preprocessing.py # contain block_sampling and min_max_normalization
utils.py # useful function used in the notebooks
If you have pipenv
installed, the environment is set in the Pipfile
. You can load it by simply executing pipenv install
inside the git repository.
Otherwise, you need the following libraries: numpy
, pandas
, scikit-learn
and pywt
(wavelet pkg) for running quantile-based augmentation; if you want to test DL methods, you need keras
and tensorflow < 2
(code is running on python 3.7
). The notebooks require matplotlib
.
- added
utils.py
with useful functions for notebook - pipenv environment, code cleaning and refactoring with basic hint typing
TS_aug.ipynb
merged from deep-learning branch- old scripts moved to
old
folder
- check mathematical bibliography and python libraries on 1d-signal Data Augmentation
- discuss train/test split in
data_A
as well as the possible subsampling techniques to obtaindata_B
. Implement the sampling strategies inpreprocessing.py
- discuss augmentation techniques to obtain
data_C
, the assumptions on the underlying signal (continuity? quasi-stationarity?). Implement techniques inaugment.py
- test data augmentation for knn in
aug_knn.ipynb
with different score metrics and plot the results - add docstrings useful comments in all python scripts
- test linearRegression and decision trees (xgboost, adaboost, randomforest)
- prepare presentation for AMIES
- clean
signal_distortion.py
and test removing high frequencies - test PM2.5 dataframe
- write report for AMIES and some code cleaning + formatting
We refer to the report available on Hal.
See the report on Hal for related research papers.
Observe that most of the cited bibliography is focused on classification problems for NN algorithms. In such framework, the time-series is seen as an input and the augmentation technique allows to generate synthetic time-series on which the model can be trained. For these reasons, many of the aforementioned strategies are not suitable for our framework; however a few ideas (e.g. frequency domain transform, decomposition methods, etc.) could be developed and put into practice for standard ML algorithms.
Python libraries:
tsaug
is useful to modify time-series in the time space, but not necessarily for our scope | see the documentationsigment
data augmentation for audio signals | see the documentation
Data augmentation is the process of generating artificial data in order to reduce the variance of the predictor and thus avoiding overfitting.
Within our framework, we can try to exploit the time-signal nature of the observations and to infer new values. Depending on the hypothesis one takes on y
, different techniques are available:
- Under stationarity assumptions one can use classical bootstrap techniques or model-methods (ARIMA, ARCH, etc.).
- Under continuity assumptions of the signals, one can use interpolation techniques (however this seems not to substantially improve the results).
- Fourier/wavelet transform, see next subsection
- Quantile-based augmentation in feature space (related to data augmentation in classification problems): group observations by
y
-labels + transformation in feature-space
Since we do not want to assume any additional hypothesis on y
, we exclude the first two possibilities and focus on the last twos. It turns out that quantile-based augmentation works pretty well with k-NN. We also test LSTM-Variational Autoencoder.