A submission for HUAWEI - 2020 DIGIX GLOBAL AI CHALLENGE
team Melbourne dağları
members: @mustafahakkoz, @Aysenuryilmazz
Turkey region rank: 4 / 103
score (AUC): 0.679876
dataset: advertising behavior data Heavily unbalanced and very large / out of core dataset containing the advertising behavior data collected from seven consecutive days.
-
training dataset (6.09 GB, 43M rows, 36 cols)
-
2 testing datasets (153 MB, 1M rows, 36 cols)
The main ideas of the project are:
-
Reading dataset with chunks and downcasting to fit into the memory.
-
Target encoding with smoothing.
-
SGD model with mini-batches.
-
class_weights to balance classes.
Implementation details can be found in the document DIGIX Implementation Instruction.docx.
- We read whole train (42M) dataset with chunk size of 10K and apply downcast to reduce the size in memory.
2. target encoding with smoothing
-
We implemented target encoding on columns by using a custom function which smooths standard target encoding with global mean of a column.
-
We dropped uid and pt_d columns on train dataset.
-
We shuffle the dataset and split it to 40M for train and rest (~2M) for test purposes.
-
We produce train dataset in several notebooks due to hard disk limitations of Kaggle platform (only 5GB).
-
We chose SGD model of Scikit Learn with default parameters and feed it with batches of 10K.
-
For every batch we used class_weight parameter to balance classes.
-
After evaluating our model on our test dataset (with AUC score of 70%), we refit our model on whole training set (~42M) and export the model.
- By using exported model, we implement prediction on submission dataset test_data_B.csv. For this step we used mean values to fill NA values which are produced by target encoding because of newly encountered values.
-
We didn’t use any cross validation or hyper parameter tuning technique for this contest due to computational constraints of online platforms.
-
We didn’t perform any of feature engineering techniques also.
-
We also tried Decision Tree, XGBoost, catboost and lightGBM with several parameters but they didn’t work out due to memory errors.
-
This repo only contains of final versions. Experiments are implemented in kaggle platform. All of the notebooks including scratches also can be found in this kaggle link with
CTR
tag.