Santander Customer Satisfaction Competition
- Host : Santander Bank, a British bank, wholly owned by the Spanish Santander Group.
- Prize : $ 60,000
- Problem : Binary Classification
- Evaluation : AUC
- Period : Mar 2 2016 ~ May 2 2016 (61 days)
Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.
This competition requires you to deal with combination of sparse, noisy and weak predictors effectively. Since the feature names are provided with anonymity, there is limitation in feature interactions, and general scheme of feature engineering spans from clipping outliers, dropping duplicate and sparse columns, extracting PCA/K-means/t-SNE features and adding count of zeros. Even with enough feature engineering, a single model prediction will have difficulty climbing the ladder of Private LB.
This competition requires an ensemble of multiple models and variety of feature+model combinations as well as more iterated blending techniques to rise to the top of the ladder. 3rd place team had 6 members each with multiple feature+model predictions, they all utilized 5-fold x 20 times cv seed, and ensembled all of their predictions to reach the top place. A solution from one of the 3rd place team ranks 509th place (Top 10%) all by itself. This tells us the importance of ensemble in this competition.
Although time consuming, well structured cv seed and adding diversity in single models by combining multiple feature+model leads to more generalizable result.
My re-implentation includes partial re-implementation of winner's solutions, hence their Private LB rank is not high as their actual rank.
Submission | CV LogLoss | Public LB | Rank | Private LB | Rank |
---|---|---|---|---|---|
bare_minimum | 0.800430 | 0.797948 | 3986 | 0.785805 | 3958 |
kweonwooj redux | 0.821226 | 0.836279 | 3029 | 0.822967 | 2917 |
kweonwooj | 0.840797 | 1667 | 0.826500 | 947 | |
toshi_k redux | 0.8418 | 0.840002 | 2122 | 0.826658 | 810 |
wpppj redux | 0.841137 | 0.839622 | 2232 | 0.826814 | 556 |
mathias from Team Leustago | 0.839441 | 2302 | 0.826881 | 509 |
Total teams : 5,123
[Data]
Place data in input
directory. You can download data from here.
[Code]
Above results can be replicated by runinng
python code/main.py
for each of the directories.
Make sure you are on Python 2.7 with library versions same as specified in requirements.txt
[Submit]
Submit the resulting csv file here and verify the score.
for bare minimum
for mathias's solution form Team Leustago (3rd place)
- Importance of ensemble in squeezing the private LB score
- Techniques of adding diversity to the prediction
- combination of feature sets
- combination of algorithms
- multiple sets of 5-fold cross validation
- power of rank averaging in combining multiple submissions
- 3rd place solution on Forum, Github by Dmitry Efimov
- 7th place solution on Forum by Francisco Javier Díaz Herrera
- 12th place solution on Forum by Antonio José Navarro Céspedes
- 13th place solution on Forum by Ouranos
- 34th place solution on Forum, Github by wpppj
- 44th place solution on Forum, Github by toshi_k