Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans_rcpp sometimes produces invalid results containing NANs #25

Closed
gittar opened this issue Nov 16, 2021 · 14 comments
Closed

KMeans_rcpp sometimes produces invalid results containing NANs #25

gittar opened this issue Nov 16, 2021 · 14 comments

Comments

@gittar
Copy link

gittar commented Nov 16, 2021

Problem

KMeans_rcpp sometimes produces invalid solutions containing NANs

below is mini-example (8 data vectors, k=6) where

minimal reproducible example

library(ClusterR)
library(glue)
myseed = function() {
  as.integer((Sys.time()-Sys.time())*10E7)
}
data=matrix(c(0,0,0,1,1,0,1,1,2,2,3,1,4,2,6,2),ncol=2,byrow=TRUE)
k=6
runs=10
n=0
num_init=10
verbose=FALSE
initializer="kmeans++"
for (i in 1:runs) {
  L=KMeans_rcpp(data[1:8,],clusters=k,num_init=num_init,initializer=initializer,seed=myseed(),verbose=verbose)
  nansum=sum(is.na(L$centroids))
  sse<-sum(L$WCSS_per_cluster)
  if (nansum>0) {
    n <- n+1
    print(glue("SSE={sse} solution contains {nansum} NAN values"))
  } else {
    print(glue("SSE={sse} "))
  }
}
print(glue('Percentage of runs with NANs: {sprintf("%.1f %%", 100*n/runs)}'))

program Output

Output:
  SSE=3 
  SSE=3.5 solution contains 2 NAN values
  SSE=2.5 
  SSE=3.5 solution contains 2 NAN values
  SSE=3.5 solution contains 2 NAN values
  SSE=3.5 solution contains 2 NAN values
  SSE=3.33333333333333 solution contains 2 NAN values
  SSE=2.5 
  SSE=2.5 
  SSE=2.5 
  Percentage of runs with NANs: 50.0 %

Typical solution from R:
mini_ClusterR

Equivalent Python program

from sklearn.cluster import KMeans
import numpy as np
X=np.array([0,0,0,1,1,0,1,1,2,2,3,1,4,2,6,2]).reshape(-1,2)
km=KMeans(n_clusters=6)
for i in range(10):
    km.fit(X)
    print(km.inertia_)

program output

SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0
SSE: 1.0

Typical solution from python:
mini_scikit-learn

Remarks

Normally the results should be comparable, but the python version does always produce valid solutions and the solutions are alwas good (likely optimal for this mini-problem)

Moreover the reason for the restriction in KMeans_rcpp that k must be smaller than size_of(data)-1 is unclear to me. In scikit-learn I can set k=8 and still get a valid solution (each centroid on one data point).

Environment

Ubuntu 20.04 LTS

R version 3.6.3 (2020-02-29)

@mlampros
Copy link
Owner

mlampros commented Nov 17, 2021

@gittar,

first of all thanks for reporting this. It seems that this is an error related to the kmeans++ initializer. I ran internally the kmeans_pp_init() function and for your small example use case it returns duplicated centroids as the following code snippet shows,

# include <RcppArmadillo.h>
# include <ClusterRHeader.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends(ClusterR)]]
// [[Rcpp::plugins(cpp11)]]


/*
 * res_clust = expose_kmeans_pp(data = data, clusters = 6, medoids = FALSE)
 * res_clust
 *
 *       [,1] [,2]
 * [1,]    0    1
 * [2,]    0    1
 * [3,]    1    1
 * [4,]    1    1
 * [5,]    2    2
 * [6,]    1    0
 *
 */


// [[Rcpp::export]]
arma::mat expose_kmeans_pp(arma::mat& data, int clusters, bool medoids = false) {

  clustR::ClustHeader clsh;

  return clsh.kmeans_pp_init(data, clusters, medoids);
}

The kmeans++ is a new feature in sklearn (started from 0.24 version) and I wasn't able to install it in my OS to reproduce your results. I just copy-pasted the code of the function from the Github repo but I'm not sure that I use the correct parameter setting because when I run the kmeans_plusplus() function I receive duplicated centroids as well,

[[6.91013936e-310 2.55737153e-316]
 [1.00000000e+000 1.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [0.00000000e+000 1.00000000e+000]
 [0.00000000e+000 0.00000000e+000]
 [0.00000000e+000 0.00000000e+000]]

As far as I am aware currently the flexclust R package includes also a kmeans++ implementation and gives for your data the following centroids,

fml = flexclust::kccaFamily('kmeans')
flx_kmpp = flexclust:::kmeanspp(x = data, k = k, family = fml)
flx_kmpp

     [,1] [,2]
[1,]    4    2
[2,]    0    1
[3,]    6    2
[4,]    2    2
[5,]    1    0
[6,]    3    1

If we move to a real world example dataset we can see small differences in the mean SSE (including Sklearn using the reticulate package),

require(ClusterR)
require(glue)
require(reticulate)
require(KernelKnn)
data("Boston")

kmeans_sklearn = reticulate::import('sklearn.cluster', convert = TRUE)

myseed = function() {
  abs(as.integer((Sys.time()-Sys.time())*10E7))
}

dat_norm = center_scale(Boston)

k=6
runs=300           # 'max_iter' parameter in sklearn  
n=0
num_init=10
verbose=FALSE

vec_kmeanpp = vec_random = random_base = sklearn_elkan = sklearn_auto = rep(NA_real_, runs)

for (i in 1:runs) {
  L_kmpp = KMeans_rcpp(dat_norm,clusters=k,num_init=num_init,initializer="kmeans++",seed=myseed(),verbose=verbose)
  L_random = KMeans_rcpp(dat_norm,clusters=k,num_init=num_init,initializer="random",seed=myseed(),verbose=verbose)
  base_kmlloyd = stats::kmeans(x = dat_norm, centers = k, iter.max = runs, nstart = num_init, algorithm = 'Lloyd', trace = F)
  km_sklearn_elkan = kmeans_sklearn$k_means(X = dat_norm, n_clusters = as.integer(k), algorithm = 'elkan', random_state = as.integer(myseed()))
  km_sklearn_auto = kmeans_sklearn$k_means(X = dat_norm, n_clusters = as.integer(k), algorithm = 'auto', random_state = as.integer(myseed()))

  sse_kmeanspp<-sum(L_kmpp$WCSS_per_cluster)
  vec_kmeanpp[i] = sse_kmeanspp
  
  sse_random<-sum(L_random$WCSS_per_cluster)
  vec_random[i] = sse_random
  
  sse_base = sum(base_kmlloyd$withinss)
  random_base[i] = sse_base
  
  sklearn_elkan[i] = km_sklearn_elkan[[3]]
  sklearn_auto[i] = km_sklearn_auto[[3]]
}

cat(glue::glue("ClusterR_kmeanpp_mean_SSE: {mean(vec_kmeanpp)}\n  ClusterR_random_mean_SSE: {mean(vec_random)}\n  base_kmeans_mean_SSE: {mean(random_base)}\n  Sklearn_elkan_mean_SSE: {mean(sklearn_elkan)}\n  Sklearn_auto_mean_SSE: {mean(sklearn_auto)}"), '\n')
# ClusterR_kmeanpp_mean_SSE: 2736.26625096268
# ClusterR_random_mean_SSE: 2742.1682750766
# base_kmeans_mean_SSE: 2741.30445465433
# Sklearn_elkan_mean_SSE: 2745.93712977821
# Sklearn_auto_mean_SSE: 2737.18611532736 

# kmeans++ for the normalized data

bst_kmpp = expose_kmeans_pp(data = dat_norm, clusters = 6, medoids = FALSE)
bst_kmpp
 
any(duplicated(bst_kmpp))
 
flx_kmpp = flexclust:::kmeanspp(x = dat_norm, k = k, family = fml)
flx_kmpp
 
any(duplicated(flx_kmpp))

I'll have a second look to the function at the weekend when I'll have more time. Please add if you have the time also the mean-SSE for the sklearn verion of kmeans++ because I currently use the sklearn version 0.22.2. and I don't have the ability to install the newest version due to the version of my OS

@mlampros
Copy link
Owner

I'm sorry I forgot your second point,

Moreover the reason for the restriction in KMeans_rcpp that k must be smaller than size_of(data)-1 is unclear to me. In scikit-learn I can set k=8 and still get a valid solution (each centroid on one data point)

In Kmeans_rcpp() I restrict the parameter k to "clusters > nrow(data) - 2" OR ""the number of clusters should be at most equal to nrow(data) - 2"

mainly because when I implemented the ClusterR package back in 2017 I used mainly RcppArmadillo and the Armadillo C++ documentation for kmeans( means, data, k, seed_mode, n_iter, print_mode ) mentions

"The k parameter indicates the number of centroids; the number of samples in the data matrix should be much larger than k"

and furthermore because I observed error cases for the datasets that I've used when k was close to nrow(data).

If you ask me I would say, I don't know if it makes sense to set the number of clusters (k) equal to the number of rows of a dataset, especially for real data, because the purpose of clustering is to cluster the data into groups.

@gittar
Copy link
Author

gittar commented Nov 18, 2021

  • the mean SSE for scikit-learns KMeans class is 1.0 for the above problem (zero variance), generated with scikit-learn 0.23.2.
  • k-means++ is included in scikit-learn since many years. What is new since 0.24.0 is the ability to call the k-means++ inialization without performing any Lloyd iterations. At least one Lloyd iterations was unavoidable previously, since max_iter=0 was not accepted by KMeans.fit(): https://scikit-learn.org/stable/whats_new/v0.24.html#version-0-24-0
  • one aspect which may make the scikit-learn initializations particularily good is local trials as implemented in https://github.com/scikit-learn/scikit-learn/blob/0d378913b/sklearn/cluster/_kmeans.py#L187 but also in previous versions like 0.23.2.
  • regarding the statement that the number of clusters should be much smaller than the number of data points: this is of course true, but I would not trust any implementation which - for reasons not obvious to me - cannot handle border cases, to handle the more typical cases correctly where problems are much harder to detect.

@github-actions
Copy link

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

@github-actions github-actions bot added the stale label Nov 30, 2021
@mlampros
Copy link
Owner

mlampros commented Dec 1, 2021

added a short message to keep this open

@github-actions
Copy link

github-actions bot commented Dec 8, 2021

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@github-actions github-actions bot closed this as completed Dec 8, 2021
@mlampros mlampros reopened this Dec 8, 2021
@github-actions
Copy link

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@mlampros mlampros reopened this Dec 15, 2021
@github-actions
Copy link

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@mlampros mlampros reopened this Dec 22, 2021
@github-actions
Copy link

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@mlampros mlampros reopened this Dec 29, 2021
@github-actions
Copy link

github-actions bot commented Jan 5, 2022

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@github-actions github-actions bot closed this as completed Jan 5, 2022
@mlampros mlampros reopened this Jan 6, 2022
@github-actions
Copy link

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@mlampros mlampros reopened this Jan 13, 2022
@github-actions
Copy link

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@mlampros mlampros reopened this Jan 20, 2022
@github-actions
Copy link

This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond.

@mlampros mlampros reopened this Jan 27, 2022
@mlampros
Copy link
Owner

I updated to CRAN version 1.2.6 and fixed the issue of the internal kmeans_pp_init() function so that duplicated centroids are not present. The new version updates the code and fixes this issue as well (NA's are not present),

require(ClusterR)
require(glue)
myseed = function() {
  as.integer((Sys.time()-Sys.time())*10E7)
}
data=matrix(c(0,0,0,1,1,0,1,1,2,2,3,1,4,2,6,2),ncol=2,byrow=TRUE)
k=6
runs=10
n=0
num_init=10
verbose=FALSE
initializer="kmeans++"
for (i in 1:runs) {
  L=KMeans_rcpp(data[1:8,],clusters=k,num_init=num_init,initializer=initializer,seed=myseed(),verbose=verbose)
  nansum=sum(is.na(L$centroids))
  sse<-sum(L$WCSS_per_cluster)
  if (nansum>0) {
    n <- n+1
    print(glue("SSE={sse} solution contains {nansum} NAN values"))
  } else {
    print(glue("SSE={sse} "))
  }
}
print(glue('Percentage of runs with NANs: {sprintf("%.1f %%", 100*n/runs)}'))

SSE=1 
SSE=1 
SSE=1 
SSE=1 
SSE=1 
SSE=1 
SSE=1 
SSE=1 
SSE=1 
SSE=1 

Percentage of runs with NANs: 0.0 %

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants