-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot reproduce Optimal_Clusters_KMeans with criterion = "silhouette" #42
Comments
@daniepi I'm sorry for not replying earlier. data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
require(ClusterR)
silh_score <- ClusterR::Optimal_Clusters_KMeans(data = iris,
max_clusters = 1:5,
num_init = 3,
max_iters = 10,
criterion = "silhouette",
initializer = "kmeans++",
plot_clusters = TRUE)
give me a few days to look into this
|
Hi @mlampros, no worries. I have version |
@daniepi after looking once again into the code it seems the 'Optimal_Clusters_KMeans()' function is not the one that you have to use for your purpose. The silhouette metric of the 'Optimal_Clusters_KMeans()' function averages all the computed silhouette values of all clusters for each iteration. That means, if you intend to compute the optimal clusters 2 to 5 for the iris dataset then the silhouette metric - lets say for cluster 2 - will return initially 2 values (a silhouette for cluster 1 and a silhouette for cluster 2) and then due to the fact that the 'Optimal_Clusters_KMeans()' function has to return a single value for each iteration (from 2 to 5) you will simply receive the average of the values that correspond to each cluster iteration, i.e. for cluster 2 the average of the 2 silhouette values, for cluster 3 the average of the 3 silhouette values etc. I included the silhouette_of_clusters function which has similar functionality to the require(ClusterR)
data(dietary_survey_IBS)
dat = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
dat = center_scale(dat)
clusters = 2
set_seed = 1
# compute k-means
km = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
# compute the silhouette width
silh_km = silhouette_of_clusters(data = dat, clusters = km$clusters)
str(silh_km$silhouette_matrix)
# num [1:400, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:3] "cluster" "intra_cluster_dissim" "silhouette"
silh_km$silhouette_summary
# cluster size avg_intra_dissim avg_silhouette
# 1 1 200 4.187004 0.60082682
# 2 2 200 9.750614 0.06605173
# compare results with the 'cluster' R package
diss = ClusterR::distance_matrix(dat, method = "euclidean", upper = TRUE, diagonal = TRUE, threads = 1L)
set.seed(seed = set_seed)
silh = cluster::silhouette(x = km$clusters, dist = diss)
summary(silh)
# Silhouette of 400 units in 2 clusters from silhouette.default(x = km$clusters, dist = diss) :
# Cluster sizes and average silhouette widths:
# 200 200
# 0.60082682 0.06605173
# Individual silhouette widths:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.09454 0.07441 0.35524 0.33344 0.60343 0.66267
# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
# compute k-means
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
silh_km = silhouette_of_clusters(data = mt, clusters = km$clusters)
str(silh_km$silhouette_matrix)
# num [1:150, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:3] "cluster" "intra_cluster_dissim" "silhouette"
silh_km$silhouette_summary
# cluster size avg_intra_dissim avg_silhouette
# 1 1 51 1.327643 0.3304151
# 2 2 50 1.175155 0.6331807
# 3 3 49 1.168557 0.4078544
diss = ClusterR::distance_matrix(mt, method = "euclidean", upper = TRUE, diagonal = TRUE, threads = 1L)
silh = cluster::silhouette(x = km$clusters, dist = diss)
summary(silh)
# Silhouette of 150 units in 3 clusters from silhouette.default(x = km$clusters, dist = diss) :
# Cluster sizes and average silhouette widths:
# 51 50 49
# 0.3304151 0.6331807 0.4078544
# Individual silhouette widths:
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.05003 0.35959 0.47506 0.45663 0.59352 0.73251
If you see the You can install the updated version using remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')
I'll update the ClusterR package on CRAN probably next week because I have to modify a few more functions (not related to your issue) |
regarding performance I used the microbenchmark R package and it seems that the require(ClusterR)
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
# compute k-means
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
# 150 observations
microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")))
# Unit: microseconds
# expr min lq mean median uq max neval
# silhouette_of_clusters(mt, km$clusters) 1731.439 1832.6275 1890.0648 1862.8150 1903.6090 2333.522 100
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 284.070 301.1695 316.1756 309.3755 326.7565 474.051 100
clusters = 4
COLS = 5
# 3000 observations
ROWS = 3000
mt = matrix(runif(n = ROWS * COLS), nrow = ROWS, ncol = COLS)
dim(mt)
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# silhouette_of_clusters(mt, km$clusters) 558.8527 561.2622 563.9357 562.5054 563.6190 581.7394 10
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 158.9886 160.8787 369.5834 366.5273 571.2039 588.6946 10
# 10000 observations
ROWS = 10000
mt = matrix(runif(n = ROWS * COLS), nrow = ROWS, ncol = COLS)
dim(mt)
km = ClusterR::KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
microbenchmark::microbenchmark(silhouette_of_clusters(mt, km$clusters), cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# silhouette_of_clusters(mt, km$clusters) 6.253863 6.261514 6.279802 6.289321 6.291960 6.297662 10
# cluster::silhouette(x = km$clusters, dist = stats::dist(mt, method = "euclidean")) 1.590505 1.601715 1.683070 1.602795 1.610305 2.022306 10
|
Hi @mlampros . Maybe I have misunderstood definition of average silhouette score. If I understood definition behind Example:
It should be the grand mean of inidividual silhouette scores that is average silhouette score. |
In other words, if clusters are of quite different sizes and intra-cluster average silhouette scores vary a lot, these two approaches will give two quite different scores. |
yes that's true. I think that by having now the |
now the function returns similar results to your mentioned 'sklearn.silhouette_score' because internally the require(ClusterR)
# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
# requires the reticulate R package
skl = reticulate::import('sklearn.metrics')
skl_score = skl$silhouette_score(X = mt, labels = km$clusters)
skl_score
# 0.4566338
silh_score = silhouette_of_clusters(data = mt, clusters = km$clusters)
silh_score$silhouette_summary
mean(silh_score$silhouette_summary$avg_silhouette)
# 0.4571501
|
In case one wants to use the official average silhouette score definition (as used by |
thanks for making me aware of this issue. the silhouette criterion for the optimal-clusters-kmeans was one of these metrics that I didn't have a reference. I added the 'silhouette_global_average' value to the output and the following example shows now that the results match, require(ClusterR)
# iris dataset
data(iris)
iris$Species = NULL
str(iris)
mt = as.matrix(iris)
mt = center_scale(mt)
clusters = 3
set_seed = 1
# compute k-means
km = KMeans_rcpp(mt, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', seed = set_seed)
silh = silhouette_of_clusters(mt, km$clusters)
glob_avg = silh$silhouette_global_average
silh_score <- ClusterR::Optimal_Clusters_KMeans(data = mt,
max_clusters = 5L,
num_init = 5L,
max_iters = 100L,
criterion = "silhouette",
initializer = "kmeans++",
plot_clusters = TRUE)
# requires the reticulate R package
skl = reticulate::import('sklearn.metrics')
skl_score = skl$silhouette_score(X = mt, labels = km$clusters, metric='euclidean')
skl_score
# 0.4566338
silh_score[clusters] == glob_avg
# [1] TRUE
all.equal(glob_avg, skl_score, tolerance = sqrt(.Machine$double.eps))
# [1] TRUE
feel free to close the issue if the code now works as expected. I'll upload an updated version to CRAN next week |
Looks good. Thanks for looking into this. |
Hi, I am struggling to reproduce output of
Optimal_Clusters_KMeans
withcriterion = "silhouette"
. For my dataset, the differences are quite big. It first peaks atk=2
, then slightly drops and increases again peaking even higher atk=7
.My dataset is ~1.4M records, so I run
Optimal_Clusters_KMeans
on subsets of data (31 exclusive subsets) and run in parallel. Hence the plot.However if cluster with
KMeans_rcpp
, calculate distances withdistance_matrix
and calculate silhouette withcluster::silhouette
I get very different results suggesting thatk=2
is by far the 'best' choice, peaking on average at0.55
average silhouette width. I ran this only for 10 subsets due to memory usage.This more or less matches results I see using
cuML
'sKMeans
andsilhouette_score
. Usinginit=k-means++
. Run on 300k sample.Example with
Iris
:With
Iris
the differences are not that significant as with my dataset.I get similar differences using
initializer = "random"
.Appreciate any input. Thanks.
R session:
The text was updated successfully, but these errors were encountered: