-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBSCAN produces different number of clusters using cuML compared to sklearn #63
Comments
@cjnolet has a new version of dbscan that is almost ready for the next version. Corey, if you get the chance, could you see if this issue is still there on the new dbscan? |
I am able to reproduce the inconsistency in the clusters with the new algorithm. I believe this related to how DBSCAN is batching large datasets to scale on a single GPU. Deleting half the data (making a single batch), either from the beginning or the end, yields consistent clusters. Running the new DBSCAN on the exact same size of randomly generated (and dense) data consistently yields the same results as Sklearn. |
Closing this issue as it will be resolved in 0.5 cuML |
DBSCAN generates different # of clusters when using cuML compared to when using sklearn.
Dataset to reproduce:
https://github.com/PatWalters/gpu_kmeans/blob/master/fp.csv
Code to reproduce:
The text was updated successfully, but these errors were encountered: