Project Context: This project involves using machine learning to predict the order in which contestants on the reality TV show "Survivor" are eliminated, based on various factors such as age, gender, and their performance in challenges.
Project Status: We have implemented a model that uses a contestant's age and gender to predict the order of their elimination. The model is trained using leave-one-group-out cross-validation, where each group corresponds to a season. This is appropriate given that the contestants in each season don't interact with contestants from other seasons.
We have started integrating data from the challenges into our model. Specifically, we are able to count the number of challenges each contestant has won up to the current episode. This number is then used as a new feature for our model along with age and gender.
Challenge: The integration of challenge data was non-trivial because we need to ensure that the model does not have access to future data (i.e., challenge results from future episodes). This is because, in reality, we would only have access to past and current episode data to predict future eliminations. Therefore, we need to implement a form of time-series cross-validation where the training set always precedes the test set in time.
The project involves a type of time-series cross-validation, also known as rolling-window or walk-forward cross-validation, where the training set always precedes the test set in time. This prevents data leakage: if future data is used to train the model, the model's predictions for past data would be overly optimistic. You not only have a time dimension (the episode), but also a group dimension (the season). So, it's a combination of group and time-series cross-validation!
The challenge and the fun part here is that you're simulating a real-world scenario where the data is revealed progressively over time, and your model needs to adapt to this unfolding data. This makes your project closer to how machine learning models operate in many real-world situations.
Completed Steps:
Exploratory Data Analysis (EDA): Before diving into complex modeling, we conducted a comprehensive EDA to understand the data's structure, trends, anomalies, and patterns. This step was pivotal in identifying potential features and in providing direction for further analysis.
EDA Visualizations: Using visualization tools and techniques, we graphically represented the data, making it easier to discern relationships, correlations, and outliers. These visualizations acted as a roadmap, providing clarity and insights that influenced our subsequent modeling approach.
Feature Engineering: Recognizing the significance of robust feature sets in predictive modeling, we engineered and transformed various data attributes. By creating new variables and modifying existing ones, our dataset was enriched, enhancing the predictive power of our models.
Complex Model Building: Post our preliminary modeling, we ventured into constructing more intricate models. Drawing on advanced algorithms and fine-tuning techniques, our efforts aimed to boost accuracy, reduce overfitting, and ensure the model's generalizability to unseen data.
Cluster Analysis: Utilizing K-means, Hierarchical clustering, and DBSCAN, we performed cluster analysis. Age, gender, and immunityWins served as the clustering attributes.
Optimal Number of Clusters: For methods like K-means, we ascertained the suitable cluster count through the Elbow Method, Silhouette Analysis, and Gap Statistic.
Cluster Validation: Internal validation measures such as the within-cluster sum of squares (WCSS) and the average silhouette score appraised cluster quality. Additionally, external validation measures compared clusters to pre-existing benchmarks.
Result Interpretation: Our clusters exposed unique contestant groups and unearthed data patterns previously hidden.
Visualization: Beyond EDA graphics, cluster visualizations using scatter plots, dendrograms, and other tools offered a clearer perspective on the derived data segments.