WiDS Datathon 2024 Challenge #2 - Regression of Metastatic Diagnosis Period

Introduction

This project presents an exploratory data analysis (EDA) and regression modelling approach for the WiDS Datathon 2024 Challenge #2. The objective was to predict the metastatic diagnosis period based on various patient demographic and socio-economic characteristics. The analysis highlights the steps taken to clean, preprocess, and model the data while optimizing feature selection and modelling parameters to improve predictive performance.

Data Exploration

EDA was conducted to inspect initial data distributions, identify key variables, and examine relationships between features. This includes:

Initial data inspection and analysis of categorical/numerical variables
Mutual information and pairwise correlation to assess feature relevance
Investigation of inconsistencies and missing values

Data Cleaning and Imputation

Several imputation strategies were applied to handle missing values, including standard and group-based methods. Grouped imputations, such as mean and mode imputation by patient demographics, helped retain feature relationships while filling gaps in the dataset. Various imputation techniques were tested to identify the most effective approach based on model performance.

Feature Engineering

Additional features were created based on domain knowledge to enhance model prediction. New features were generated by grouping variables such as bmi and density into clusters, resulting in improved feature representation. Before imputation, redundant features were removed, facilitating the identification of the most effective imputation methods.

Modelling Approach

Several regression models were evaluated, with key models optimized using cross-validation and hyperparameter tuning. Steps included:

Establishing baseline scores and SHAP values for feature importance
Conducting backward feature selection for optimal feature sets
Hyperparameter tuning
Implementing a stacking meta-model approach with ensemble techniques

Results

The modelling results highlighted the effectiveness of using CatBoost with tailored imputation techniques and StratifiedKFold validation.

Group-based imputations combined with specific models yielded the best performance. CatBoost, with Constant Categorical imputation for categorical features and No Numerical Imputation, achieved the lowest RMSE score of 80.225 using a 9-fold StratifiedKFold.

Additionally, the average RMSE for the 5th and 7th folds was 80.154, marking the best scores across the private validation.

The second-best score of 80.182 was obtained by using Constant Categorical imputation for categorical features and KNN for numerical features.

Using 9-fold StratifiedKFold, grouped by breast_cancer_diagnosis_desc, further enhanced the model’s ability to capture categorical group structures, resulting in these optimal RMSE scores.

Feature selection based on SHAP values improved modelling performance by helping CatBoost identify the most predictive variables.

The notebook table presents detailed RMSE scores for each model and imputation combination, with GradientBoosting also showing promising results when using group-based imputation strategies.

Conclusion

This study underscores the significance of selecting appropriate imputation techniques and modelling approaches for predicting metastatic diagnosis periods. Combining standard and group-based imputations was particularly effective for handling datasets with diverse missing value patterns.

CatBoost emerged as the top-performing model, particularly due to its compatibility with categorical data and its ability to work well with features selected through SHAP values.

The findings demonstrate that structured feature selection and stratified grouping improve predictive accuracy in healthcare-related regression tasks by capturing meaningful relationships within the data, especially for tree-based models.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ImputationTechniques.ipynb		ImputationTechniques.ipynb
README.md		README.md
WiDS'24C_2_MetastaticDiagnosisRegression.ipynb		WiDS'24C_2_MetastaticDiagnosisRegression.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WiDS Datathon 2024 Challenge #2 - Regression of Metastatic Diagnosis Period

Introduction

Data Exploration

Data Cleaning and Imputation

Feature Engineering

Modelling Approach

Results

Conclusion

About

Releases

Packages

Languages

drkbluescience/WiDS2024_Challenge2_MetastaticDiagnosisRegression

Folders and files

Latest commit

History

Repository files navigation

WiDS Datathon 2024 Challenge #2 - Regression of Metastatic Diagnosis Period

Introduction

Data Exploration

Data Cleaning and Imputation

Feature Engineering

Modelling Approach

Results

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages