DBSCAN produces different number of clusters using cuML compared to sklearn #63

raghavmi · 2018-12-13T22:51:19Z

DBSCAN generates different # of clusters when using cuML compared to when using sklearn.

Dataset to reproduce:
https://github.com/PatWalters/gpu_kmeans/blob/master/fp.csv

Code to reproduce:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
import os

X = pd.read_csv("fp.csv")
print('data',X.shape)

eps = 3
min_samples = 2

clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)
print("# of sklearn clusters", len(set(clustering_sk.labels_)))

X = cudf.DataFrame.from_pandas(X)
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)
print("# of cuML clusters", clustering_cuml.labels_.unique_count())

The text was updated successfully, but these errors were encountered:

dantegd · 2018-12-13T22:54:51Z

@cjnolet has a new version of dbscan that is almost ready for the next version. Corey, if you get the chance, could you see if this issue is still there on the new dbscan?

cjnolet · 2018-12-17T22:25:21Z

I am able to reproduce the inconsistency in the clusters with the new algorithm. I believe this related to how DBSCAN is batching large datasets to scale on a single GPU. Deleting half the data (making a single batch), either from the beginning or the end, yields consistent clusters.

Running the new DBSCAN on the exact same size of randomly generated (and dense) data consistently yields the same results as Sklearn.

datametrician · 2018-12-22T06:30:34Z

Closing this issue as it will be resolved in 0.5 cuML

update 0.10

mike-wendt added the ? - Needs Triage Need team to review and classify label Dec 20, 2018

dantegd added the bug Something isn't working label Dec 22, 2018

datametrician closed this as completed Dec 22, 2018

cjnolet reopened this Jan 11, 2019

cjnolet closed this as completed Jan 11, 2019

cjnolet mentioned this issue Jan 11, 2019

[BUG] DBSCAN results incorrect #80

Closed

dantegd pushed a commit that referenced this issue Sep 12, 2019

Merge pull request #63 from rapidsai/branch-0.10

13b9899

update 0.10

JasonAtNvidia mentioned this issue Jan 27, 2020

KalmanFilter pytest code failing on my RTX workstation #1603

Closed

hershkoy mentioned this issue Jul 15, 2020

[BUG] Semi-Supervised UMAP, with euclidean target_metric, reduction errors when input passes a certain size. #2333

Open

tkpudgy mentioned this issue Jul 29, 2021

[QST] Is there a size limit to the input data for the RandomForestClassifier's fit function? #4132

Closed

VibhuJawa mentioned this issue Jan 13, 2022

[BUG]cuML using memory outside of RMM Pool #4485

Open

jakirkham pushed a commit to jakirkham/cuml that referenced this issue Mar 30, 2023

Merge pull request rapidsai#63 from rapidsai/branch-0.10

394008f

update 0.10

seo-jaeyong mentioned this issue Jul 26, 2023

[QST] The error message is printed, but i can't solve this problem. Please help me. #5525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSCAN produces different number of clusters using cuML compared to sklearn #63

DBSCAN produces different number of clusters using cuML compared to sklearn #63

raghavmi commented Dec 13, 2018

dantegd commented Dec 13, 2018

cjnolet commented Dec 17, 2018 •

edited

Loading

datametrician commented Dec 22, 2018

DBSCAN produces different number of clusters using cuML compared to sklearn #63

DBSCAN produces different number of clusters using cuML compared to sklearn #63

Comments

raghavmi commented Dec 13, 2018

dantegd commented Dec 13, 2018

cjnolet commented Dec 17, 2018 • edited Loading

datametrician commented Dec 22, 2018

cjnolet commented Dec 17, 2018 •

edited

Loading