This repo aims to kaggling Home Credit Default Risk in a pipeline fashion. This pipeline is not necessary a performer (PB 0.792/ LB 0.796, top 25~30%) but, aiming to bring a bit automation and using config files to control its process. It does, 1) data set construction ann caching, 2) feature transform, 3) hyperpparameters search and 4) stacking models to generate submission file.
The necessary data set and descriptions can be fould here. https://www.kaggle.com/c/home-credit-default-risk
ModelPipeline.py
is exposed to execute all the tasks and the other main modules are located in lib
. In lib
, DataProvider.py
constructs data set and use FeatureTransformer.py
to do feature engineering. ScikitOptimize.py
use Bayesian Optimization in Scikit-Optimize to perform hyperparameter search. AutoStacker.py
uses mlxtend to build stacking classifier. The shared variables and templates are kept in LibConfigs.py
. Some small tools are implemented in Utility.py
. DataFileIO.py
is responsible for saving and loading HDF5 and CSV files.
The cached data are stored in data
, found hyperparameters are stored in params
, result for submission in output
.
In configs
, there are three configs to control the whole modeling process:
configs for feature generations -- SampleDataConfigs.py
configs for hyper parameter search -- SampleModelConfigs.py
configs for feature stacking model -- SampleStackerConfigs.py
python3 ModelPipeline.py -a cache_prefix
python3 ModelPipeline.py --refresh-cache
python3 ModelPipeline.py --compute-hpo -t LGBM,LossGuideXGB
python3 LGBMSelectedFeatures.py
python3 ModelPipeline.py --compute-stack --refresh-meta
adding filenames such as probs_*.hdf5
into variable ExternalMetaConfigs
in the using *StackerConfigs.py
python3 ModelPipeline.py --compute-stack
--debug --enable-gpu