The aim of this project is to predict the level of damage to buildings that happens in Nepal in 2015. It was caused by the Ghorka earthquake. The predictions are based on aspects building locations and construction. The data was collected through surveys by Kathmandu Living Labs and the Central Bureau of Statistics, which works under the National Planning Commission Secretariat of Nepal.
This survey represents an extensive collection of data after a disaster, making it one of the largest datasets ever gathered. It contains valuable information regarding the effects of earthquakes, the state of households, and socio-economic-demographic statistics. The dataset contains approximately 260,000 labeled samples and consists of 38 different features.
This project has been done as part of my Statistical Foundation of Machine Learning class (INFO-F422).
The project have been graded 10/10. It has reached a performance of 0.7426 of F1 micro with a rank of 802 out of more than 6000 participants. The first one on the leaderboard is leading with 0.7558 which is not that far.
The project has been implement in R using R notebooks. It features some data exploration, data processing, feature selection techniques and model evaluation. Those can be found in the report.ipynb
.
A feature relevance metric has also been computed once the best model has been chosen according to the entropy. The following equations details the computation:
where the