GitHub - naren-mohan/World-Indicators-Clustering: Given a dataset with different features for different countries, implementing a clustering algorithm to find groups of countries based on the given features using R

title	author	date	output
Clustering of various countries	Naren Mohan	11/23/2021	html_document

Objective

The world indicators dataset compares different countries based on selected attributes.

To use K-means and hierarchical clustering methods to group similar countries together
To use Internal validation metrics such as Silhouette co-efficient to report the cluster quality
Tune the hyperparameters to report the best clustering solution.
To give a detailed list of all the groups and the countries included within each group
To generate few plots that would explain the characteristics of each cluster

library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(factoextra)
library(tidyr)
library(scales)
library(corrplot)
library(ClusterR)
library(clValid)

set.seed(22)

Reading the dataset

df <- read_csv("World Indicators.csv")

Reading the structure of the dataset

str(df)

Checking for NA values

sort(colSums(is.na(df)))

Energy Usage and Lending Interest columns have around 35% of NA values. Mostly we will be dropping these columns during our analysis

Converting GDP and Health Exp % GDP to numeric form

df$GDP <- parse_number(df$GDP)
df$`Health Exp/Capita` <- parse_number(df$`Health Exp/Capita`)
df$`Business Tax Rate` <- parse_number(df$`Business Tax Rate`)

Analyzing the correlation between various features

df_num <- as.data.frame(select_if(df, is.numeric))
corval <- cor(df_num, use="na.or.complete")
corrplot(corval, type="upper")

In the correlation plot above, it can be observed that few columns correlate well with the other columns. This means that few columns (or features) are redundant and can be dropped from the dataframe. Columns such as Birth Rate, Infant Mortality Rate, Internet Usage, Life Expectancy Female, Life Expectancy Male, Population 0-14, Population 15-64 correlate highly with each other. Another such group is Energy Usage and GDP as well as GDP, Health Exp % GDP, Health Exp/Capita. Instead of dropping all of these columns, we have to retain one column to represent these groups. For the last two groups, it is easy to determine the set of columns that we have to retain; we are going to retain GDP from both the groups. However, for the first group, it is difficult to externally determine the representing column. We will retain the column that has the highest correlation value with all other columns. To find out which column has the highest correlation, let's use the "corval" matrix.

Determining the column with the highest correlation value among the above listed columns

colvec <- c("Birth Rate", "Infant Mortality Rate", "Internet Usage", "Life Expectancy Female",
           "Life Expectancy Male", "Population 0-14", "Population 15-64")
cval1 <- corval[colvec, colvec]
sort(colMeans(abs(cval1)), decreasing = TRUE)[1]

It can be concluded that Birth Rate is the column that we want to retain among these 7 columns.

Dropping the other 9 columns

colvec2 <- c("Infant Mortality Rate", "Internet Usage", "Life Expectancy Female",
           "Life Expectancy Male", "Population 0-14", "Population 15-64", "Energy Usage", 
           "Health Exp % GDP", "Health Exp/Capita")

df2 <- df[, !(names(df) %in% colvec2)]
df_num <- df_num[, !(names(df_num) %in% colvec2)]

Check NA values

sort(colSums(is.na(df2)), decreasing=TRUE)

Lets go ahead and drop the columns which has more than 10% (>20 NA values) data missing. We are doing so because if we replace these values with either mean or median or zero; then we will introduce bias and these records with replaced values will have similar features and eventually form a cluster, which we would want to avoid. For the remaining features, let's replace the NA values with the mean in the numerical columns.


coldrop <- c("Lending Interest", "Hours to do Tax", "Business Tax Rate", "Days to Start Business")
df2 <- df2[, !names(df2) %in% coldrop]
df_num <- df_num[, !names(df_num) %in% coldrop]

colmean <- as.list(colMeans(df_num, na.rm=TRUE))
df_num <- replace_na(df_num, colmean)

df2 <- replace_na(df2, colmean)

We have the final set of 5 columns (or features) that we are going to perform clustering.

Let us feature scale each of the columns to obtain best clustering result. But before that, let us boxplot each of these features quickly to see if there are any outliers

for (i in (1:5)) {
  boxplot(df2[i], xlab=names(df2[i]))
}

There are many outliers especially for the column GDP. Hence, we will have to scale the values using Standard Scaler function defined by --

df2[1:5] <- scale(df2[1:5], center=TRUE, scale=TRUE)

Now we are ready for clustering. However, we need to find the optimal number of clusters first.

Let us implement Elbow method and Silhouette method to get the number of clusters.

#kmeans(df2[1:5], )
fviz_nbclust(df2[1:5], FUNcluster = kmeans, method = "wss")
fviz_nbclust(df2[1:5], FUNcluster = kmeans, method = "silhouette")

Both Elbow method (WSS) and Silhouette method suggests that 2 is the optimal number of clusters.

However, let's perform kmeans clustering with centers from 2 to 10 in a loop and analyze each. We will use Dunn Index and Silhouette Co-efficient as two internal analysis metrics to evaluate.

centersloop=2:10
dunns <- list()
sse <- list()
ssb <- list()
s1 <- list()
d1 <- dist(df2[1:5], method = "euclidean")
for (centers in centersloop) {
  k1 <- kmeans(df2[1:5], centers = centers, iter.max=100000, nstart=200)
  sse <- append(sse, k1$tot.withinss) #Sum of Squared Errors within cluster
  ssb <- append(ssb, k1$totss)        #Group Sum of Squares 
  dunns <- append(dunns, dunn(clusters=k1$cluster, Data=df2[1:5]))  #Dunn index
  s1 <- append(s1, summary(silhouette(k1$cluster, d1))$avg.width)   #Silhouette Average Width
}

Implementing Hierarchical Clustering

clusters <- 2:10
d1 <- dist(df2[1:5], method = "euclidean")
hc1 <- hclust(d1, method = "complete")

dunns2 <- list()
s2 <- list()

plot(hc1, cex = 0.6, hang = -1)

for (c in clusters) {
  hclust1 <- cutree(hc1, k = c)
  dunns2 <- append(dunns2, dunn(clusters=hclust1, Data=df2[1:5]))  #Dunn index
  s2 <- append(s2, summary(silhouette(hclust1, d1))$avg.width)   #Silhouette Average Width
}

Comparing the Silhouette scores of K means and Hierarchical clustering to choose the optimal number of clusters and the algorithm


plot(x=2:10, y=s1, type="b", xlab="Number of Clusters", ylab="Silhouette co-efficient of K means")
plot(x=2:10, y=s2, type="b", xlab="Number of Clusters", 
     ylab="Silhouette co-efficient of Hierarchical clustering")
#Hierarchical clustering with 2 centers gives erroneous results and we are ignoring it deliberately

The above plots indicate that we can get the best clustering solution by implementing kmeans with center equals 3.

Implementing K means with 3 centers.

k1 <- kmeans(df2[1:5], centers = 3, iter.max=100000, nstart=20)
print("Number of countries in each cluster")
print(table(k1$cluster))

c1 <- subset(df, k1$cluster == 1)
c2 <- subset(df, k1$cluster == 2)
c3 <- subset(df, k1$cluster == 3)

print("Countries in the First Cluster")
print(c1[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])

print("Countries in the Second Cluster")
print(c2[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])

print("Countries in the Third Cluster")
print(c3[, c("Country", "Region", "GDP", "Life Expectancy Female", "Birth Rate")])

Lets understand the details of these 3 clusters by plotting a heatmap of the centers of these clusters.

print(k1$centers)
heatmap(k1$centers, Rowv=NULL, Colv=NULL, xlab="Features", ylab="Clusters")

The above heatmap can give a gist of the characteristics of each of the cluster created by the algorithm

Scatter plot between Birth Rate and GDP and the clusters

df$Clusters <- k1$cluster

fig1 <- plot_ly(data = df, 
          x=df$`Birth Rate`, y=df$GDP, 
          type="scatter", mode="markers", 
          color = as.factor(df$Clusters),
          colors = "Set1",
          alpha = 0.7,
          text = ~Country,
          hovertemplate= '<b>%{text}</b>'
          )
fig1 <- fig1 %>%
  layout(title = "Birth Rate vs GDP",
      xaxis = list(title = 'Birth Rate'),
      yaxis = list(title = 'GDP', type="log")
    )

fig1


# ggplot(df, aes(x=df$`Birth Rate`, y=GDP, color=as.factor(Clusters)))+
#   geom_point()+
#   scale_y_continuous(trans="log", labels=dollar)+
#   labs(x="Birth Rate", y="GDP", color="Clusters", title="Birth Rate vs GDP")
  #theme(legend.title = "Clusters")

Here, it can be observed that Birth Rate and GDP are inversely related. And the clusters divided can be observed to be as - (The order might differ with each successive runs)

Lower GDP and Higher Birth Rate
Higher GDP and Lower Birth Rate
Medium GDP and Medium Birth Rate

Scatter plot between Health Exp/Capita vs Infant Mortality Rate abd the clusters


fig1 <- plot_ly(data = df, 
          x=df$`Health Exp/Capita`, y=df$`Infant Mortality Rate`, 
          type="scatter", mode="markers", 
          color = as.factor(df$Clusters),
          colors = "Set1",
          alpha = 0.7,
          text = ~Country,
          hovertemplate= '<b>%{text}</b>'
          )
fig1 <- fig1 %>%
  layout(title = "Health Exp/Capita vs Infant Mortality Rate",
      xaxis = list(type="log", title = 'Health Exp/Capita'),
      yaxis = list(title = 'Infant Mortality Rate')
    )

fig1

# ggplot(df, aes(x=df$`Health Exp/Capita`, y=df$`Infant Mortality Rate`, color=as.factor(Clusters)))+
#   geom_point()+
#   scale_x_continuous(trans="log", labels=dollar)+
#   labs(x="Health Exp/Capita", y="Infant Mortality Rate", color="Clusters")
#   #theme(legend.title = "Clusters")

As one would expect, Infant mortality rate and Health Expenditure Per capita is inversely proportional and again we observe 3 clusters -

High IMR and Low Health Expenditure per capita
Moderate IMR and Moderate Health Expenditure per capita
Low IMR and High Health Expenditure per capita

Scatter plot between Population Urban vs GDP and the clusters

#df$Clusters <- k1$cluster

fig1 <- plot_ly(data = df, 
          x=df$`Population Urban`, y=df$GDP, 
          type="scatter", mode="markers", 
          color = as.factor(df$Clusters),
          colors = "Set1",
          alpha = 0.7,
          text = ~Country,
          hovertemplate= '<b>%{text}</b>'
          )
fig1 <- fig1 %>%
  layout(title = "Population Urban vs GDP",
      xaxis = list(title = 'Population Urban'),
      yaxis = list(type="log", title = 'GDP')
    )

fig1

# ggplot(df, aes(x=df$`Population Urban`, y=GDP, color=as.factor(Clusters)))+
#   geom_point()+
#   scale_y_continuous(trans="log", labels=dollar)+
#   labs(x="Population Urban", y="GDP", color="Clusters")
#   #theme(legend.title = "Clusters")

A positive trend can be observed between population living in the urban and GDP. The clusters divided can be observed as -

Lower population in the urban and Lower GDP
Higher population in the urban and Higher GDP
Scattered cluster

Conclusion

The clusters divided make sense even in the real world where the top performing countries like USA, China, Japan are in one cluster and then the moderate performing (majority of the European countries) and the poorly performing countries (majority of African countries). As the dataset contains the features that are mostly related to economy and healthy lifestyle, we get these clusters from the algorithm. If we wish to see these based on other factors (natural resources availability, human resources etc.), then we should add more features.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
Countries-Clustering.Rmd		Countries-Clustering.Rmd
Countries-Clustering.html		Countries-Clustering.html
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Objective

Reading the dataset

Reading the structure of the dataset

Checking for NA values

Converting GDP and Health Exp % GDP to numeric form

Analyzing the correlation between various features

Determining the column with the highest correlation value among the above listed columns

Dropping the other 9 columns

Check NA values

Let us feature scale each of the columns to obtain best clustering result. But before that, let us boxplot each of these features quickly to see if there are any outliers

There are many outliers especially for the column GDP. Hence, we will have to scale the values using Standard Scaler function defined by --

Let us implement Elbow method and Silhouette method to get the number of clusters.

However, let's perform kmeans clustering with centers from 2 to 10 in a loop and analyze each. We will use Dunn Index and Silhouette Co-efficient as two internal analysis metrics to evaluate.

Implementing Hierarchical Clustering

Comparing the Silhouette scores of K means and Hierarchical clustering to choose the optimal number of clusters and the algorithm

Implementing K means with 3 centers.

Lets understand the details of these 3 clusters by plotting a heatmap of the centers of these clusters.

Scatter plot between Birth Rate and GDP and the clusters

Scatter plot between Health Exp/Capita vs Infant Mortality Rate abd the clusters

Scatter plot between Population Urban vs GDP and the clusters

Conclusion

About

Languages

naren-mohan/World-Indicators-Clustering

Folders and files

Latest commit

History

Repository files navigation

Objective

Reading the dataset

Reading the structure of the dataset

Checking for NA values

Converting GDP and Health Exp % GDP to numeric form

Analyzing the correlation between various features

Determining the column with the highest correlation value among the above listed columns

Dropping the other 9 columns

Check NA values

Let us feature scale each of the columns to obtain best clustering result. But before that, let us boxplot each of these features quickly to see if there are any outliers

There are many outliers especially for the column GDP. Hence, we will have to scale the values using Standard Scaler function defined by --

Let us implement Elbow method and Silhouette method to get the number of clusters.

However, let's perform kmeans clustering with centers from 2 to 10 in a loop and analyze each. We will use Dunn Index and Silhouette Co-efficient as two internal analysis metrics to evaluate.

Implementing Hierarchical Clustering

Comparing the Silhouette scores of K means and Hierarchical clustering to choose the optimal number of clusters and the algorithm

Implementing K means with 3 centers.

Lets understand the details of these 3 clusters by plotting a heatmap of the centers of these clusters.

Scatter plot between Birth Rate and GDP and the clusters

Scatter plot between Health Exp/Capita vs Infant Mortality Rate abd the clusters

Scatter plot between Population Urban vs GDP and the clusters

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Languages