First of all, a great shout-out to the other co-owner of this project, Zehra! We put our effort on this analysis collectively.
This is a multi-class classification problem regarding water pumps in Tanzania, where the goal is to classify water pumps' status either as: "functional", "needs repair", or "nonfunctional". The data is provided by the Tanzanian Ministry of Water. If a good statistical model is built, pump maintenance can be improved, which would lead to better access to water.
Figure 1: Distribution of water pump status in the training dataset.
We started by exploring the dataset and performing basic data cleaning and preprocessing steps. The training data set contains 59,400 observations with the labels and the test set contains 14,850 observations (without the labels).
In training data set there were several number of numerical columns containing high number of zero values. So, we decided to handle zero values using categorical location-based data columns like "subvillage", "lga" and ward etc. These location columns had high cardinality, therefore after imputation of numerical location-based data columns we dropped some of the categorical locations columns because of two main reasons; (1) categorical data is required to be encoded into multiple binary columns thus would hinder the performance of the model (2) having a numerical longitude/latitude data would yield the same result therefore, would cause redundancy.
Figure 2 & 3: Histograms GPS Height and Longitude in the training dataset.
In longitude and gps_height columns there were high number 0 values. For Tanzania, these geographical values can not be 0. Whereas in latitude column, there were data points lying outside of Tanzania, on the ocean. Thus, zero and improper latitude values are replaced by NaN and then imputed by using mean value of each column for each unique grouping of 'basin', 'region', 'lga', 'ward', 'subvillage' variables.
After value imputation, gps_height plot shows that in lower heights of Tanzania the ratio of non functional pumps are higher than higher ratios. The functionality between 1000 and 2000 is better.
Figure 4: Histogram of imputed GPS Height in training dataset, grouped by pump status.
Mapping Coordinates
Creating a scatter plot with longitude and latitude, we can see that the number of functional water pumps is higher in some areas, while in some areas it is less, in some areas there are no records at all. The regions, latitude, longitude and gps height is in relation with pump functionality.
Figure 5: Scatter plot of imputed longitude and latitude in training dataset, grouped by pump status.
To handle missing values in this column, computed probability distribution of construction year and replaced missing values with random samples drawn from the normalized probability distribution and categorized all values to decades at the end.
Not surprisingly, the older pumps have a higher ratio in non-functioning pumps compared to the newer water pumps. Besides, more than half of the water pump are constructed last 20 years.
Figure 6: Bar chart of construction year in training dataset, grouped by pump status.
In population, several observations had missing values. Similar to the method stated for longitude-latitude-gps_height above, missing values are replaced using the mean value of each column for each unique combination of 'basin', 'region', 'lga', 'ward', 'subvillage' variables.
The distribution of classes among variables are relatively even, except dry and unknown water pumps are largely non-functional.
Figure 7: Bar chart of quantity group in training dataset, grouped by pump status.
Over 50k of waterpumps have good quality of water. However, people cannot reach this quality type due to half of these are non-functional waterpumps. Unlike the others, the rate of non-functional waterpumps in unknown variable is very high.
Figure 8: Bar chart of quality group in training dataset, grouped by pump status.
According to waterpoint types of water pump, other class have too much more non-functional pumps than functional. Almost half of waterpoint_type is communal standpipe and more than half of this type is functional.
Figure 9: Bar chart of waterpoint type in training dataset, grouped by pump status.
The distribution of classes among all variables are relatively even. But, nearly 50k of water pumps use groundwater.
Figure 10: Bar chart of source class in training dataset, grouped by pump status.
According to extraction types of water pump, other and mono class may not well maintained, and they have much more non-functional pumps than functional. Almost half of extraction class is gravity and these are mostly functional waterpumps.
Figure 11: Bar chart of extraction type in training dataset, grouped by pump status.
We extracted record month from date_recorded column to see season effect. And, the ratio of functional water pumps in spring season records is slightly higher than other months. It may be a proof of season effect in records. To observe this in the analysis, we separated the date data into three numerical columns of DD-MM-YYYY.
Figure 12: Bar chart of recordMonth in training dataset, grouped by pump status.
We converted payment data to binary where never pay is 0, and other cases as 1 for ease of analysis. The ratio of functional water pumps with payment is higher than no payment.
Figure 13: Bar chart of payment in training dataset, grouped by pump status.
Permit, Public Meeting, Scheme_Management _columns seems to have 5-6% missing. "_scheme_management" , "permit", "public meeting" were highly orrelated with management with a few exceptions. Therefore, they are grouped by management and filled by respective group's mode. At the end, permit and public meeting are binarized and false columns were dropped.
Scheme management and public meeting columns have imbalanced distribution.
Figure 14 & 15: Bar charts of scheme_management (left) and public meeting (right) in training dataset, grouped by pump status.
Some of the columns stated below are dropped for different reasons:
- Subvillage column is only 0.6% missing. But it is a categoric varibale with high cardinality. Longitude, Latitude and GPS Height columns were imputed by the help of this and other categoric location based columns (__'subvillage', 'ward', 'region', 'basin', 'lga'..) and also a high cardinality column at the end would yield numerous encoded columns therefore this and other categoric location columns _(__'subvillage', 'ward', 'region', 'basin', 'lga'..)_were removed.
- 'recorded_by' only has 1 unique value, would have no effect on the model and therefore removed.
- Some columns had very high number of unique values (high cardinality) and would hinder model consistency_: 'funder', 'installer', 'lga', 'scheme_name', 'subvillage', 'ward', 'wpt_name'._
- _#Amount__tsh is 70% empty, num_private 98.7% empty therefore will be removed.
- Construction_year will be dropped since it is divided into seperate columns.
- Extraction_type and extraction _type_group are omitted since same as extraction_type_class.
- Similar to above, WaterQuality/Quality Group , Quantity/Quantity group, Management/Management Group, Region/RegionCode, Source/SourceType/SourceClass, WaterpointType/Group, Payment/PaymentType were columns stating the same detail for observations and were repeating themselves. Thus, these repetitions are dropped.
- Scheme_name__dropped considering that 47% missing data.
Figure 16: Chart of columns in training dataset having missing values, prior to imputation.
At the end we have columns of :
'id', 'gps_height', 'longitude', 'latitude', 'region_code',
'district_code', 'population', 'public_meeting', 'scheme_management',
'permit', 'extraction_type', 'payment', 'quality_group',
'quantity_group', 'source_class', 'waterpoint_type'
'recordMonth'.
At the end of the preprocessing, we dummified features that can be used in models due to dataset contains categorical variables. Binary columns' false dummies were dropped.
After engineering our features, we have trained and evaluated various classification models to see which one performs best on the data. Trained models are based on; K-Nearest Neighbors (i.e. KNN) model, Logistic Regression, Random Forest, Random Forest with Grid-Search, and Gradient Boosting Classifier.
Accuracy of each model is stated below:
- Random Forest Classifier = 0.80
- XGBoost = 0.78
- Gradient Boosting Classifier = 0.74
- Logistic Regression = 0.71
- KNN = 0.52
For KNN, we had to select a K hyperparameter, which governs the number of points to be taken under consideration for a given observation in the model.
Charting the error rate for K values in between range 1 and 10, it was observed that K=8 had the minimum error rate thus K=8 is selected.
Figure 17: Error Rate per K values in KNN model trials.
Upon analysis of KNN model, confusion matrix provided the results below. 0.51 test set accuracy was quite low, therefore we had to search for alternative model types.
Figure 18: Classification report for KNN model based on K=8.
As our second model, we have used a multiclass logistic regression model. We used a multiclass model because the regular "logit" model under statsmodels was not supporting multiple classifiers in the target column (in this case, status_group of the pumps).
Figure 19: Classification report for Logistic Regression model.
As seen in the figure above, LogisticRegression model has achieved a 71% accuracy. While this was pretty accurate, we were curious about exploring tree-based statistical models.
Gradient Boosting is a known model type in classification and regression alike. We have decided to use the Gradient Boosting Classifier model in a hope to increase accuracy of the model.
Figure 20: Confusion Matrix and classification report for Gradient Boosting model.
As it is seen from the figure above, we have achieved an accuracy of 74%, ahead of Logistic Regression model. While this is quite good, we had to use XGBoost, an improved version of Gradient Boosting model, to see whether it would provide better accuracy over the data.
XGBoost (eXtreme Gradient Boosting) is an improved version of the Gradient Boosting model using GB with decision trees and is superior in terms of speed, accuracy, data handling and so on. In this project, we expected an accuracy score above the Gradient Boosting model.
Figure 21: Confusion Matrix and classification report for XGBoost model.
As a result, we have achieved an accuracy of 78%, much higher than Gradient Boosting model. As hyperparameters, we got our score based on n_estimators=100, reg_alpha=1 for L1 and reg_lambda=0.01 for L2 regularization.
As the 5th and final model, we decided to opt for a Random Forest Classifier model. We have trained a Random Forest Classifier model having maximum depth of the trees in the forest to 20 and used it to make predictions on the test data. In this case, our output showed the accuracy of the test set as 80%, achieving our best test accuracy score at all.
Figure 22: Classification report of Random Forest Classifier model.
To sum up, we carefully explored and preprocessed the data, engineered informative features, and experimented with different classification models to see which one works best for this problem. Our best model, Random Forest Classifier, predicted the functionality of the pumps with an accuracy of 80%.
Figure 23: Submission Score of Drivendata for results of our Random Forest Classifier model.