This GitHub repository contains our code for the American Express Default Prediction competition. Developped in collaboration with Ben See Jian Rong and Younes Siyar, our team utilized LightGBM, an advanced machine learning technique, to predict credit card defaults. This solution highlights our strategic approach to handle a massive dataset with a complex structure, focusing on efficiency and scalability to meet the challenges posed by real-world financial data.
- Competition Description
- Challenges of the Problem
- Proposed Solution
- Solution Effectiveness
- Conclusion
The competition, hosted by American Express, focused on using machine learning to predict credit default. The challenge involved handling an industrial-scale dataset with complex features including delinquency, spend, payment, balance, and risk variables.
Key challenges included managing the large dataset size, addressing data quality issues, and the intensive computational requirements.
Our approach utilized LightGBM due to its efficiency and scalability. Initial experiments with Random Forest provided insights but were not scalable for the competition's dataset. We opted for LightGBM, a gradient boosting framework that uses tree-based learning algorithms, known for its superior handling of large datasets.
- Data Preprocessing: To handle the vast amounts of data, we employed techniques such as handling missing values, encoding categorical variables, and reducing dimensionality where feasible.
- Feature Engineering: We crafted features that could capture nuances in the data, significantly impacting the model's predictive power.
- Model Configuration: Tuning LightGBM parameters like learning rate was crucial to optimize our model for accuracy and efficiency.
- Validation Strategy: To ensure the robustness of our model, we implemented a cross-validation strategy, which helped in identifying stable and reliable model configurations.
Extensive data preprocessing and hyperparameter tuning led to significant improvements:
- AUC: 0.97966
- Binary Log Loss: 0.167891
- AMEX Score: 0.78410
This project provided a profound learning experience in handling real-world data and emphasized the importance of model selection, data quality, preprocessing, and collaborative teamwork.