-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KMeans_rcpp sometimes produces invalid results containing NANs #25
Comments
first of all thanks for reporting this. It seems that this is an error related to the kmeans++ initializer. I ran internally the kmeans_pp_init() function and for your small example use case it returns duplicated centroids as the following code snippet shows, # include <RcppArmadillo.h>
# include <ClusterRHeader.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends(ClusterR)]]
// [[Rcpp::plugins(cpp11)]]
/*
* res_clust = expose_kmeans_pp(data = data, clusters = 6, medoids = FALSE)
* res_clust
*
* [,1] [,2]
* [1,] 0 1
* [2,] 0 1
* [3,] 1 1
* [4,] 1 1
* [5,] 2 2
* [6,] 1 0
*
*/
// [[Rcpp::export]]
arma::mat expose_kmeans_pp(arma::mat& data, int clusters, bool medoids = false) {
clustR::ClustHeader clsh;
return clsh.kmeans_pp_init(data, clusters, medoids);
} The kmeans++ is a new feature in sklearn (started from 0.24 version) and I wasn't able to install it in my OS to reproduce your results. I just copy-pasted the code of the function from the Github repo but I'm not sure that I use the correct parameter setting because when I run the kmeans_plusplus() function I receive duplicated centroids as well, [[6.91013936e-310 2.55737153e-316]
[1.00000000e+000 1.00000000e+000]
[1.00000000e+000 0.00000000e+000]
[0.00000000e+000 1.00000000e+000]
[0.00000000e+000 0.00000000e+000]
[0.00000000e+000 0.00000000e+000]] As far as I am aware currently the flexclust R package includes also a kmeans++ implementation and gives for your data the following centroids, fml = flexclust::kccaFamily('kmeans')
flx_kmpp = flexclust:::kmeanspp(x = data, k = k, family = fml)
flx_kmpp
[,1] [,2]
[1,] 4 2
[2,] 0 1
[3,] 6 2
[4,] 2 2
[5,] 1 0
[6,] 3 1 If we move to a real world example dataset we can see small differences in the mean SSE (including Sklearn using the reticulate package), require(ClusterR)
require(glue)
require(reticulate)
require(KernelKnn)
data("Boston")
kmeans_sklearn = reticulate::import('sklearn.cluster', convert = TRUE)
myseed = function() {
abs(as.integer((Sys.time()-Sys.time())*10E7))
}
dat_norm = center_scale(Boston)
k=6
runs=300 # 'max_iter' parameter in sklearn
n=0
num_init=10
verbose=FALSE
vec_kmeanpp = vec_random = random_base = sklearn_elkan = sklearn_auto = rep(NA_real_, runs)
for (i in 1:runs) {
L_kmpp = KMeans_rcpp(dat_norm,clusters=k,num_init=num_init,initializer="kmeans++",seed=myseed(),verbose=verbose)
L_random = KMeans_rcpp(dat_norm,clusters=k,num_init=num_init,initializer="random",seed=myseed(),verbose=verbose)
base_kmlloyd = stats::kmeans(x = dat_norm, centers = k, iter.max = runs, nstart = num_init, algorithm = 'Lloyd', trace = F)
km_sklearn_elkan = kmeans_sklearn$k_means(X = dat_norm, n_clusters = as.integer(k), algorithm = 'elkan', random_state = as.integer(myseed()))
km_sklearn_auto = kmeans_sklearn$k_means(X = dat_norm, n_clusters = as.integer(k), algorithm = 'auto', random_state = as.integer(myseed()))
sse_kmeanspp<-sum(L_kmpp$WCSS_per_cluster)
vec_kmeanpp[i] = sse_kmeanspp
sse_random<-sum(L_random$WCSS_per_cluster)
vec_random[i] = sse_random
sse_base = sum(base_kmlloyd$withinss)
random_base[i] = sse_base
sklearn_elkan[i] = km_sklearn_elkan[[3]]
sklearn_auto[i] = km_sklearn_auto[[3]]
}
cat(glue::glue("ClusterR_kmeanpp_mean_SSE: {mean(vec_kmeanpp)}\n ClusterR_random_mean_SSE: {mean(vec_random)}\n base_kmeans_mean_SSE: {mean(random_base)}\n Sklearn_elkan_mean_SSE: {mean(sklearn_elkan)}\n Sklearn_auto_mean_SSE: {mean(sklearn_auto)}"), '\n')
# ClusterR_kmeanpp_mean_SSE: 2736.26625096268
# ClusterR_random_mean_SSE: 2742.1682750766
# base_kmeans_mean_SSE: 2741.30445465433
# Sklearn_elkan_mean_SSE: 2745.93712977821
# Sklearn_auto_mean_SSE: 2737.18611532736
# kmeans++ for the normalized data
bst_kmpp = expose_kmeans_pp(data = dat_norm, clusters = 6, medoids = FALSE)
bst_kmpp
any(duplicated(bst_kmpp))
flx_kmpp = flexclust:::kmeanspp(x = dat_norm, k = k, family = fml)
flx_kmpp
any(duplicated(flx_kmpp))
I'll have a second look to the function at the weekend when I'll have more time. Please add if you have the time also the mean-SSE for the sklearn verion of kmeans++ because I currently use the sklearn version 0.22.2. and I don't have the ability to install the newest version due to the version of my OS |
I'm sorry I forgot your second point,
In Kmeans_rcpp() I restrict the parameter k to "clusters > nrow(data) - 2" OR ""the number of clusters should be at most equal to nrow(data) - 2" mainly because when I implemented the ClusterR package back in 2017 I used mainly RcppArmadillo and the Armadillo C++ documentation for kmeans( means, data, k, seed_mode, n_iter, print_mode ) mentions "The k parameter indicates the number of centroids; the number of samples in the data matrix should be much larger than k" and furthermore because I observed error cases for the datasets that I've used when k was close to nrow(data). If you ask me I would say, I don't know if it makes sense to set the number of clusters (k) equal to the number of rows of a dataset, especially for real data, because the purpose of clustering is to cluster the data into groups. |
|
This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond. |
added a short message to keep this open |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond. |
I updated to CRAN version 1.2.6 and fixed the issue of the internal kmeans_pp_init() function so that duplicated centroids are not present. The new version updates the code and fixes this issue as well (NA's are not present), require(ClusterR)
require(glue)
myseed = function() {
as.integer((Sys.time()-Sys.time())*10E7)
}
data=matrix(c(0,0,0,1,1,0,1,1,2,2,3,1,4,2,6,2),ncol=2,byrow=TRUE)
k=6
runs=10
n=0
num_init=10
verbose=FALSE
initializer="kmeans++"
for (i in 1:runs) {
L=KMeans_rcpp(data[1:8,],clusters=k,num_init=num_init,initializer=initializer,seed=myseed(),verbose=verbose)
nansum=sum(is.na(L$centroids))
sse<-sum(L$WCSS_per_cluster)
if (nansum>0) {
n <- n+1
print(glue("SSE={sse} solution contains {nansum} NAN values"))
} else {
print(glue("SSE={sse} "))
}
}
print(glue('Percentage of runs with NANs: {sprintf("%.1f %%", 100*n/runs)}'))
SSE=1
SSE=1
SSE=1
SSE=1
SSE=1
SSE=1
SSE=1
SSE=1
SSE=1
SSE=1
Percentage of runs with NANs: 0.0 % |
Problem
KMeans_rcpp sometimes produces invalid solutions containing NANs
below is mini-example (8 data vectors, k=6) where
minimal reproducible example
program Output
Typical solution from R:
Equivalent Python program
program output
Typical solution from python:
Remarks
Normally the results should be comparable, but the python version does always produce valid solutions and the solutions are alwas good (likely optimal for this mini-problem)
Moreover the reason for the restriction in KMeans_rcpp that k must be smaller than size_of(data)-1 is unclear to me. In scikit-learn I can set k=8 and still get a valid solution (each centroid on one data point).
Environment
Ubuntu 20.04 LTS
R version 3.6.3 (2020-02-29)
The text was updated successfully, but these errors were encountered: