KMeans_rcpp sometimes produces invalid results containing NANs #25

gittar · 2021-11-16T21:02:43Z

Problem

KMeans_rcpp sometimes produces invalid solutions containing NANs

below is mini-example (8 data vectors, k=6) where

minimal reproducible example

library(ClusterR)
library(glue)
myseed = function() {
  as.integer((Sys.time()-Sys.time())*10E7)
}
data=matrix(c(0,0,0,1,1,0,1,1,2,2,3,1,4,2,6,2),ncol=2,byrow=TRUE)
k=6
runs=10
n=0
num_init=10
verbose=FALSE
initializer="kmeans++"
for (i in 1:runs) {
  L=KMeans_rcpp(data[1:8,],clusters=k,num_init=num_init,initializer=initializer,seed=myseed(),verbose=verbose)
  nansum=sum(is.na(L$centroids))
  sse<-sum(L$WCSS_per_cluster)
  if (nansum>0) {
    n <- n+1
    print(glue("SSE={sse} solution contains {nansum} NAN values"))
  } else {
    print(glue("SSE={sse} "))
  }
}
print(glue('Percentage of runs with NANs: {sprintf("%.1f %%", 100*n/runs)}'))

program Output

Output:
  SSE=3 
  SSE=3.5 solution contains 2 NAN values
  SSE=2.5 
  SSE=3.5 solution contains 2 NAN values
  SSE=3.5 solution contains 2 NAN values
  SSE=3.5 solution contains 2 NAN values
  SSE=3.33333333333333 solution contains 2 NAN values
  SSE=2.5 
  SSE=2.5 
  SSE=2.5 
  Percentage of runs with NANs: 50.0 %

Typical solution from R:

Equivalent Python program

from sklearn.cluster import KMeans
import numpy as np
X=np.array([0,0,0,1,1,0,1,1,2,2,3,1,4,2,6,2]).reshape(-1,2)
km=KMeans(n_clusters=6)
for i in range(10):
    km.fit(X)
    print(km.inertia_)

program output

SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0

Typical solution from python:

Remarks

Normally the results should be comparable, but the python version does always produce valid solutions and the solutions are alwas good (likely optimal for this mini-problem)

Moreover the reason for the restriction in KMeans_rcpp that k must be smaller than size_of(data)-1 is unclear to me. In scikit-learn I can set k=8 and still get a valid solution (each centroid on one data point).

Environment

Ubuntu 20.04 LTS

R version 3.6.3 (2020-02-29)

The text was updated successfully, but these errors were encountered:

mlampros · 2021-11-17T11:52:11Z

@gittar,

first of all thanks for reporting this. It seems that this is an error related to the kmeans++ initializer. I ran internally the kmeans_pp_init() function and for your small example use case it returns duplicated centroids as the following code snippet shows,

# include <RcppArmadillo.h>
# include <ClusterRHeader.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends(ClusterR)]]
// [[Rcpp::plugins(cpp11)]]


/*
 * res_clust = expose_kmeans_pp(data = data, clusters = 6, medoids = FALSE)
 * res_clust
 *
 *       [,1] [,2]
 * [1,]    0    1
 * [2,]    0    1
 * [3,]    1    1
 * [4,]    1    1
 * [5,]    2    2
 * [6,]    1    0
 *
 */


// [[Rcpp::export]]
arma::mat expose_kmeans_pp(arma::mat& data, int clusters, bool medoids = false) {

  clustR::ClustHeader clsh;

  return clsh.kmeans_pp_init(data, clusters, medoids);
}

The kmeans++ is a new feature in sklearn (started from 0.24 version) and I wasn't able to install it in my OS to reproduce your results. I just copy-pasted the code of the function from the Github repo but I'm not sure that I use the correct parameter setting because when I run the kmeans_plusplus() function I receive duplicated centroids as well,

[[6.91013936e-310 2.55737153e-316]
 [1.00000000e+000 1.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [0.00000000e+000 1.00000000e+000]
 [0.00000000e+000 0.00000000e+000]
 [0.00000000e+000 0.00000000e+000]]

As far as I am aware currently the flexclust R package includes also a kmeans++ implementation and gives for your data the following centroids,

fml = flexclust::kccaFamily('kmeans')
flx_kmpp = flexclust:::kmeanspp(x = data, k = k, family = fml)
flx_kmpp

     [,1] [,2]
[1,]    4    2
[2,]    0    1
[3,]    6    2
[4,]    2    2
[5,]    1    0
[6,]    3    1

If we move to a real world example dataset we can see small differences in the mean SSE (including Sklearn using the reticulate package),

require(ClusterR)
require(glue)
require(reticulate)
require(KernelKnn)
data("Boston")

kmeans_sklearn = reticulate::import('sklearn.cluster', convert = TRUE)

myseed = function() {
  abs(as.integer((Sys.time()-Sys.time())*10E7))
}

dat_norm = center_scale(Boston)

k=6
runs=300           # 'max_iter' parameter in sklearn  
n=0
num_init=10
verbose=FALSE

vec_kmeanpp = vec_random = random_base = sklearn_elkan = sklearn_auto = rep(NA_real_, runs)

for (i in 1:runs) {
  L_kmpp = KMeans_rcpp(dat_norm,clusters=k,num_init=num_init,initializer="kmeans++",seed=myseed(),verbose=verbose)
  L_random = KMeans_rcpp(dat_norm,clusters=k,num_init=num_init,initializer="random",seed=myseed(),verbose=verbose)
  base_kmlloyd = stats::kmeans(x = dat_norm, centers = k, iter.max = runs, nstart = num_init, algorithm = 'Lloyd', trace = F)
  km_sklearn_elkan = kmeans_sklearn$k_means(X = dat_norm, n_clusters = as.integer(k), algorithm = 'elkan', random_state = as.integer(myseed()))
  km_sklearn_auto = kmeans_sklearn$k_means(X = dat_norm, n_clusters = as.integer(k), algorithm = 'auto', random_state = as.integer(myseed()))

  sse_kmeanspp<-sum(L_kmpp$WCSS_per_cluster)
  vec_kmeanpp[i] = sse_kmeanspp
  
  sse_random<-sum(L_random$WCSS_per_cluster)
  vec_random[i] = sse_random
  
  sse_base = sum(base_kmlloyd$withinss)
  random_base[i] = sse_base
  
  sklearn_elkan[i] = km_sklearn_elkan[[3]]
  sklearn_auto[i] = km_sklearn_auto[[3]]
}

cat(glue::glue("ClusterR_kmeanpp_mean_SSE: {mean(vec_kmeanpp)}\n  ClusterR_random_mean_SSE: {mean(vec_random)}\n  base_kmeans_mean_SSE: {mean(random_base)}\n  Sklearn_elkan_mean_SSE: {mean(sklearn_elkan)}\n  Sklearn_auto_mean_SSE: {mean(sklearn_auto)}"), '\n')
# ClusterR_kmeanpp_mean_SSE: 2736.26625096268
# ClusterR_random_mean_SSE: 2742.1682750766
# base_kmeans_mean_SSE: 2741.30445465433
# Sklearn_elkan_mean_SSE: 2745.93712977821
# Sklearn_auto_mean_SSE: 2737.18611532736 

# kmeans++ for the normalized data

bst_kmpp = expose_kmeans_pp(data = dat_norm, clusters = 6, medoids = FALSE)
bst_kmpp
 
any(duplicated(bst_kmpp))
 
flx_kmpp = flexclust:::kmeanspp(x = dat_norm, k = k, family = fml)
flx_kmpp
 
any(duplicated(flx_kmpp))

I'll have a second look to the function at the weekend when I'll have more time. Please add if you have the time also the mean-SSE for the sklearn verion of kmeans++ because I currently use the sklearn version 0.22.2. and I don't have the ability to install the newest version due to the version of my OS

mlampros · 2021-11-17T13:54:47Z

I'm sorry I forgot your second point,

Moreover the reason for the restriction in KMeans_rcpp that k must be smaller than size_of(data)-1 is unclear to me. In scikit-learn I can set k=8 and still get a valid solution (each centroid on one data point)

In Kmeans_rcpp() I restrict the parameter k to "clusters > nrow(data) - 2" OR ""the number of clusters should be at most equal to nrow(data) - 2"

mainly because when I implemented the ClusterR package back in 2017 I used mainly RcppArmadillo and the Armadillo C++ documentation for kmeans( means, data, k, seed_mode, n_iter, print_mode ) mentions

"The k parameter indicates the number of centroids; the number of samples in the data matrix should be much larger than k"

and furthermore because I observed error cases for the datasets that I've used when k was close to nrow(data).

If you ask me I would say, I don't know if it makes sense to set the number of clusters (k) equal to the number of rows of a dataset, especially for real data, because the purpose of clustering is to cluster the data into groups.

gittar · 2021-11-18T21:29:15Z

the mean SSE for scikit-learns KMeans class is 1.0 for the above problem (zero variance), generated with scikit-learn 0.23.2.
k-means++ is included in scikit-learn since many years. What is new since 0.24.0 is the ability to call the k-means++ inialization without performing any Lloyd iterations. At least one Lloyd iterations was unavoidable previously, since max_iter=0 was not accepted by KMeans.fit(): https://scikit-learn.org/stable/whats_new/v0.24.html#version-0-24-0
one aspect which may make the scikit-learn initializations particularily good is local trials as implemented in https://github.com/scikit-learn/scikit-learn/blob/0d378913b/sklearn/cluster/_kmeans.py#L187 but also in previous versions like 0.23.2.
regarding the statement that the number of clusters should be much smaller than the number of data points: this is of course true, but I would not trust any implementation which - for reasons not obvious to me - cannot handle border cases, to handle the more typical cases correctly where problems are much harder to detect.

github-actions · 2021-11-30T22:12:33Z

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

mlampros · 2021-12-01T06:15:02Z

added a short message to keep this open

github-actions · 2021-12-08T06:16:06Z