This project presents an exploratory data analysis (EDA) and regression modelling approach for the WiDS Datathon 2024 Challenge #2. The objective was to predict the metastatic diagnosis period based on various patient demographic and socio-economic characteristics. The analysis highlights the steps taken to clean, preprocess, and model the data while optimizing feature selection and modelling parameters to improve predictive performance.
EDA was conducted to inspect initial data distributions, identify key variables, and examine relationships between features. This includes:
- Initial data inspection and analysis of categorical/numerical variables
- Mutual information and pairwise correlation to assess feature relevance
- Investigation of inconsistencies and missing values
Several imputation strategies were applied to handle missing values, including standard and group-based methods. Grouped imputations, such as mean and mode imputation by patient demographics, helped retain feature relationships while filling gaps in the dataset. Various imputation techniques were tested to identify the most effective approach based on model performance.
Additional features were created based on domain knowledge to enhance model prediction. New features were generated by grouping variables such as bmi and density into clusters, resulting in improved feature representation. Before imputation, redundant features were removed, facilitating the identification of the most effective imputation methods.
Several regression models were evaluated, with key models optimized using cross-validation and hyperparameter tuning. Steps included:
- Establishing baseline scores and SHAP values for feature importance
- Conducting backward feature selection for optimal feature sets
- Hyperparameter tuning
- Implementing a stacking meta-model approach with ensemble techniques
The modelling results highlighted the effectiveness of using CatBoost with tailored imputation techniques and StratifiedKFold validation.
Group-based imputations combined with specific models yielded the best performance. CatBoost, with Constant Categorical imputation for categorical features and No Numerical Imputation, achieved the lowest RMSE score of 80.225 using a 9-fold StratifiedKFold.
Additionally, the average RMSE for the 5th and 7th folds was 80.154, marking the best scores across the private validation.
The second-best score of 80.182 was obtained by using Constant Categorical imputation for categorical features and KNN for numerical features.
Using 9-fold StratifiedKFold, grouped by breast_cancer_diagnosis_desc, further enhanced the model’s ability to capture categorical group structures, resulting in these optimal RMSE scores.
Feature selection based on SHAP values improved modelling performance by helping CatBoost identify the most predictive variables.
The notebook table presents detailed RMSE scores for each model and imputation combination, with GradientBoosting also showing promising results when using group-based imputation strategies.
This study underscores the significance of selecting appropriate imputation techniques and modelling approaches for predicting metastatic diagnosis periods. Combining standard and group-based imputations was particularly effective for handling datasets with diverse missing value patterns.
CatBoost emerged as the top-performing model, particularly due to its compatibility with categorical data and its ability to work well with features selected through SHAP values.
The findings demonstrate that structured feature selection and stratified grouping improve predictive accuracy in healthcare-related regression tasks by capturing meaningful relationships within the data, especially for tree-based models.