Why do tree-based models still outperform deep learning on typical tabular data? #19

standing-o · 2023-09-18T08:22:30Z

Why do tree-based models still outperform deep learning on typical tabular data?

Authors : Grinsztajn, L{'e}o and Oyallon, Edouard and Varoquaux, Ga{"e}l
Journal : Advances in Neural Information Processing Systems
Year : 2022
Link : https://proceedings.neurips.cc/paper_files/paper/2022/file/0378c7692da36807bdec87ab043cdadc-Paper-Datasets_and_Benchmarks.pdf

Abstract

The superiority of deep learning on tabular data is not clear.
Tree-based models remain state-of-the-art on medium-sized data (~10K samples) even without accounting for their superior speed.
➔ To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and neural networks.
Guide to build tabular-specific neural network:

Be robust to uninformative features
Preserve the orientation of the data
Be able to easily learn irregular functions

Introduction

Deep learning architecture have been crafted to create inductive biases matching invariances and spatial dependencies of the data. Finding corresponding invariances is hard in tabular data, made of heterogeneous features, small sample sizes, extreme values.
One motivation is that tree-based models are not differentiable, and thus cannot be easily composed and jointly trained with other deep learning blocks.
Most tabular datasets available are small.
Tree-based models remain state-of-the-art on medium-sized tabular datasets, even without accounting for the slower training of deep learning algorithms.
Understand which inductive biases make them well-suited for these data. By transforming tabular datasets to modify
the performances of different models, we uncover differing biases of tree-based models and deep learning algorithms which partly explain their different performances: neural networks struggle to learn irregular patterns of the target function, and their rotation invariance hurt their performance, in particular when handling the numerous uninformative features present in tabular data.

A benchmark for tabular learning

Heterogeneous columns: Each column corresponds to the same signal on different sensors.
Not high dimensional: Keep datasets with a d/n ratio below 1/10, and with d below 500.
Undocumented datasets: Remove datasets where too little information is available
I.I.D data: Remove stream-like datasets or time series.
Real-world data: Remove artificial datasets but keep some simulated datasets.
Not too small: Remove datasets with too few features and too few samples.
Not too easy: Remove datasets which are too easy.
Not deterministic: Remove datasets where the target is a deterministic function of the data.
Medium-sized training dataset: Truncate the training set to 10,000 samples for bigger datasets.
No missing data: Remove all missing data from the datasets.
Balanced classes
Low cardinality categorical features: Remove categorical features with more than 20 items.
High cardinality numerical features: Remove numerical features with less than 10 unique values.

A procedure to benchmark models with hyper-parameter selection

Hyperparameter tuning leads to uncontrolled variance on a benchmark especially with a small budget of model evaluations.
To study performance as a function of the number n of random search iterations, we compute the best hyper-parameter combination on the validation set on these n iterations.

Aggregating results across datasets

We use the test set accuracy and R2 score to measure model performance and a metric similar to the distance to the minimum.

Data preparation (Pre-processing)

Gaussinized features, Transformed regression targets, OneHotEncoder

Tree-based models still outperform deep learning on tabular data

Tree-based models benchmarked: Random Forest, Gradient Boosting Trees, XGBoost
Deep models benchmarked: MLP, Resnet, FT_Transformer
Benchmark on medium-sized datasets

Results

Tuning hyper-parameters does not make neural networks state-of-the-art
➔ Tree-based models are superior for every random search budgets.
Categorical variables are note the main weakness of neural networks
➔ Most of the gap subsists when learning on numerical features only.

Empirical investigation: why do tree-based models still outperform deep learning on tabular data?

Methodology: uncovering inductive biases

Inherent properties of the models explain their performances on tabular data
➔ The best methods on tabular data share: ensemble methods, decision tree.
We apply various transformations to tabular datasets which either narrow or widen the generalization performance gap between neural networks and tree-based models
➔ emphasize their different inductive biases.

Finding 1: Neural networks are biased to overly smooth solutions

We transform each train set by smoothing the output with a Gaussian Kernel smoother for varying length-scale values of the kernel.
Performance as a function of the length-scale of the smoothing kernel.
➔ The target functions in our datasets are not smooth, and that neural networks struggle to fit these irregular functions compared to tree-based models.

Finding 2: Uninformative features affect more MLP-like neural networks

Tabular datasets contain many uninformative features
➔ The classification accuracy of a GBT is not much affected by removing up to half of the features.

MLP-like architectures are not robust to uninformative features
➔ Removing uninformative features reduces the performance gap between MLPs an the other models, while adding uninformative features widens the gap.

Finding 3: Data are non invariant by rotation, so should be learning procedures

The learning procedure which learns an MLP on a training set and evaluate it on a testing set is uncahnged when applying a rotation to the features on both the training and testing set.
Tree-based models are not rotationally invariant, as they attend to each feature separately.
➔ Neither are FT Transformer: Initial FT Tokenizer implements a pointwise operation.
Intuitively, to remove uninformative features, a rotationaly invariant algorithm has to first find the original orientation of the features, and then select the least informative ones: the information contained in the orientation of the data is lost.
The change in test accuracy when randomly rotating our datasets, confirms that only Resnets are rotationally invariant.
➔ Random rotations reverse the performance order: rotation invariance is not desirable

Removing the least important half of the features in each dataset (before rotating), drops the performance of all models except Resnets, but the decrease is less significant than when using all features.

Conclusion

Tree-based models more easily yield good predictions, with much less computational cost. This superiority is explained by specific features of tabular data: irregular patterns in the target function, uninformative features, and non rotationally-invariant data where linear combinations of features misrepresent the information.

standing-o added Tree-Model Tabular labels Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do tree-based models still outperform deep learning on typical tabular data? #19

Why do tree-based models still outperform deep learning on typical tabular data? #19

standing-o commented Sep 18, 2023 •

edited

Loading

Why do tree-based models still outperform deep learning on typical tabular data? #19

Why do tree-based models still outperform deep learning on typical tabular data? #19

Comments

standing-o commented Sep 18, 2023 • edited Loading

Why do tree-based models still outperform deep learning on typical tabular data?

Abstract

Introduction

A benchmark for tabular learning

A procedure to benchmark models with hyper-parameter selection

Aggregating results across datasets

Data preparation (Pre-processing)

Tree-based models still outperform deep learning on tabular data

Results

Empirical investigation: why do tree-based models still outperform deep learning on tabular data?

Methodology: uncovering inductive biases

Finding 1: Neural networks are biased to overly smooth solutions

Finding 2: Uninformative features affect more MLP-like neural networks

Finding 3: Data are non invariant by rotation, so should be learning procedures

Conclusion

standing-o commented Sep 18, 2023 •

edited

Loading