This notebook is also available on Kaggle:
https://www.kaggle.com/code/asterfung/titanic-eda-randomforest-gridsearch
The sinking of Titanic was one of the deadliest maritime disaster in history. The RMS Titanic with an estimated 2,224 people onboard sank 15 April 1912 in North Atlantic Ocean, resulting in death of more than 1,500 people. The disaster called for major changes in maritime regulations to implement new safety measures for example preparation of excess lifeboats and establishment of International Ice Patrol.
While the competition rewards high prediction accuracy, this notebook also aims to understand the titanic story :)
The number of casualties of the sinking was reported by newspapers at the time. The British Board of Trade has reported the finalised number of casuality. Kaggle has further prepared the dataset in tabular format to host a competition in prediction model.
https://www.kaggle.com/competitions/titanic/data
Below is the dataset description (courtesy of the Kaggle team):
Variable | Definition | **Key ** |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
Age:
Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
Sibsp: The dataset defines family relations in this way:
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: The dataset defines family relations in this way:
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
major updates
- Introduction: Brief introduction to the Titanic disaster
- Introduction: Information about the dataset
- workflow overview: added a overview workflow section to guide readers
- EDA(inspect the dataset): added barplot to compare count of survived and loss passengers
- EDA(inspect the dataset): added lineplot of survival/loss count vs family members
- Feature engineering and selection: feature selection with correlation matrix and mutual information gain
- model optimization: gridsearch tune random forest parameters
- summarized findings in Highlights and Conclusion
minor changes
- fixed the numbers of each section
- cleared up some useless code
A random forest model was built to predict whether a passenger onboard Titanic could survive the Titanic disaster. The probability of surviving is half of the probability of dying during the disaster. Among various factors, the age, sex and the type of ticket were the most important factors that correlates to survivial.
More details:
- From the test dataset, in a total of 891 passengers onboard, 342 passengers (38.27%) survived the titantic disaster and 549 passengers (61.68%) was lost.
- Younger passengers were more likely to survive the Titanic (p<0.05, T test). The mean age of the passengers who survived was 28.344 and that of those who were lost was 30.627.
- Survial chance was partly determined by the ticket fare ( as implied in passenger class). Those passengers who survived paid more then those who were lost ( p<0.01, Mann-Whitney test). The mean fare paid by the survived group (37.8 pounds) was nearly a double of the mean fare paid by those who were lost in the Titanic disaster (19.72 pounds)
- Correlation study show that some features are correlated: Titles and Age, Fare and Pclass
- Import dataset and python modules
- Exploratory data analysis
2.1 Inspect the dataset
2.2 Explore the features - Feature engineering and selection
- Declare features and targets for models
- Train "test" split
- Models
6.1 Random forest
6.2 Xgboost - Model optimization
- Exporting prediction
- Conclusion
- Acknowledgements and reference