Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

Open
ztf-ucas opened this issue Jan 8, 2022 · 30 comments
Open
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d

Comments

@ztf-ucas
Copy link

ztf-ucas commented Jan 8, 2022

terminate called after throwing an instance of 'raft::cuda_error'
Hi, I'm using cuml.HDBSCAN and the following problem was encountered.

`terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/cudart_utils.h line=267: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 32 stack frames
#0 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f1bd4f95056]
#1 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10cuda_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc9) [0x7f1bd4f95e39]
#2 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10copy_asyncIiEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x138) [0x7f1bd522f948]
#3 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9hierarchy6detail16build_sorted_mstIifN2ML7HDBSCAN22FixConnectivitiesRedOpIifEEEEvRKNS_8handle_tEPKT0_PKT_SF_SC_mmPSD_SG_PSA_SG_mT1_NS_8distance12DistanceTypeEi+0x4c2) [0x7f1bd527e942]
#4 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsERNSB_28robust_single_linkage_outputIT_S6_EE+0x372) [0x7f1bd5281512]
#5 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEE+0x7e) [0x7f1bd521759e]
#6 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x43ec2) [0x7f1de251cec2]
#7 in python(PyObject_Call+0x24d) [0x56056760d35d]
#8 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]
#9 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#10 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]
#11 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c298) [0x7f1de2505298]
#12 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c4f9) [0x7f1de25054f9]
#13 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x3c072) [0x7f1de2515072]
#14 in python(PyObject_Call+0x24d) [0x56056760d35d]
#15 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]
#16 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#17 in python(+0x1b08b7) [0x5605676988b7]
#18 in python(_PyEval_EvalFrameDefault+0x4e03) [0x5605676b9133]
#19 in python(_PyFunction_Vectorcall+0x1a6) [0x560567697fc6]
#20 in python(_PyEval_EvalFrameDefault+0x947) [0x5605676b4c77]
#21 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#22 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]
#23 in python(PyEval_EvalCode+0x1b) [0x56056773a24b]
#24 in python(+0x2522e3) [0x56056773a2e3]
#25 in python(+0x26e543) [0x560567756543]
#26 in python(+0x273562) [0x56056775b562]
#27 in python(PyRun_SimpleFileExFlags+0x1b2) [0x56056775b742]
#28 in python(Py_RunMain+0x36d) [0x56056775bcbd]
#29 in python(Py_BytesMain+0x39) [0x56056775be79]
#30 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f1fc17a4b97]
#31 in python(+0x1e6d69) [0x5605676ced69]

Aborted (core dumped)`

@ztf-ucas ztf-ucas added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 8, 2022
@cjnolet
Copy link
Member

cjnolet commented Jan 12, 2022

Thanks for opening an issue about this @ztf-ucas. To isolate the cause of this failure, it would be helpful if you can provide a code snippet that we can use to reproduce it. It would also be useful to provide the dataset (or relevant details) if you are able.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@Brillone
Copy link

Hi, same issue is happening to me with different settings for the HDBSCAN model (some works).

For example it happens with the following parameters:
model = HDBSCAN(min_cluster_size=15, min_samples=10)

A setting that did worked:
model = HDBSCAN(min_cluster_size=5, min_samples=5)

My dataset has 2.5M samples with 64 dimensions (I can't provide the dataset).

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@preet2312
Copy link

preet2312 commented Jan 25, 2023

@cjnolet @Brillone Hello, I encountered the same issue where some of the parameter combinations work and some throw the same error(if run as a python script. If I run it on jupyter notebook on vscode then it would give error like FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.)

I am using 4.7M samples with dimension of 50.

For Example:
Doesn't work: model = HDBSCAN(min_cluster_size=1000, min_samples=10)
Works: model = HDBSCAN(min_cluster_size=1000, min_samples=5)

Please let me know if there's any update.

Thanks.

@mayurgd
Copy link

mayurgd commented Mar 23, 2023

@cjnolet Hi facing this error while executing HDBSCAN

terminate called after throwing an instance of 'raft::cuda_error'
what():  CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/cudart_utils.h 
line=267: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type),cudaMemcpyDeviceToDevice, stream)', 
Reason=cudaErrorInvalidValue:invalid argument

Below is a reproducable code that gives an error for me:

import numpy as np
import pandas as pd
from cuml.cluster import HDBSCAN as HDBSCAN_gpu

X = np.array([[-14.01115608,  -5.37217331, 314.        ],
       [-17.31538773,  -6.12932587,  22.        ],
       [-17.88701439,  -7.00569153,  16.        ],
       [-17.91534615,  -7.40659523,  12.        ],
       [-13.57449722,  -3.70668411,  12.        ],
       [-14.97053146,  -6.00550461,  51.        ],
       [-15.5725193 ,  -5.07519722,   2.        ],
       [-13.31140137,  -3.99990654,   5.        ],
       [-13.84429169,  -4.01345634,   1.        ],
       [-17.02877998,  -6.42786789,  46.        ],
       [-15.09358597,  -5.4496851 ,  22.        ],
       [-17.52828217,  -6.86034393,   4.        ],
       [-15.57351112,  -5.61835861,   4.        ],
       [-14.20898056,  -4.61386681,   8.        ],
       [-14.45912552,  -5.47292137,   1.        ],
       [-15.27561951,  -4.74104977,   1.        ]])
test = pd.DataFrame(X, columns=['x','y','repeat'])
test = test.loc[test.index.repeat(test.repeat)].drop(columns='repeat')
hdb = HDBSCAN_gpu(
                min_samples=10,
                min_cluster_size=15,
                cluster_selection_method="eom",
                metric="euclidean",
                gen_min_span_tree=True,
            )

labels = hdb.fit_predict(test)

HDBSCAN model runs without any error for min_samples < 5 anything greater than or equal to 5 gives raft::cuda_error
[cuml version '22.02.00']

@beckernick
Copy link
Member

We've made a variety of updates to HDBSCAN since v22.02. Does this error present if you use cuML 23.02?

@mayurgd
Copy link

mayurgd commented Mar 23, 2023

@beckernick thanks for the response. I am in the process of upgrading my rapids docker image to version 23.02. Will update ones that is done.
Require one suggestion regarding HDBSCAN though, Should duplicate row data be removed before applying HDBSCAN or should it be applied to data with duplicate rows?
For eg as per the above code snippet :
Should it be applied to X (non_duplicated array) or test (duplicated_dataframe)

@beckernick
Copy link
Member

Duplicates can to some extent be seen as sample weights and and removing them might move your analysis farther away from the underlying ground truth data distribution from which your data is implicitly sampled. I'd probably leave them in.

@beckernick
Copy link
Member

@preet2312 , do you have any information about your environment (library versions) and system platforms with which you experienced this issue?

@mayurgd
Copy link

mayurgd commented Mar 28, 2023

@beckernick I updated rapids to v23.02 using rapidsai/rapidsai-core:23.02-cuda11.2-runtime-ubuntu20.04-py3.8 image. I still get 'raft::cuda_error' for the above mentioned example

Error Logs:
databricks driver logs show

terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=/databricks/conda/envs/rapids/include/raft/util/cudart_utils.hpp line=278:

databricks notebook shows

ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

@beckernick
Copy link
Member

beckernick commented Mar 28, 2023

Thanks for testing in 23.02 and creating a minimal reproducible example. I can reproduce this behavior.

The underlying error appears to be that a single linkage solution can't be found in at least some scenarios and this error is not caught and propagated back up to Python.

With REPS = 10000 I can reproduce this consistently. With smaller REPS, I can reproduce it intermittently.

import numpy as np
from cuml.cluster import HDBSCAN

REPS = 10000

X = np.arange(12)
tiled = np.tile(X, REPS).reshape(-1, 3)

clusterer = HDBSCAN()
clusterer.fit(tiled)
terminate called after throwing an instance of 'raft::logic_error'
  what():  RAFT failure at file=/opt/conda/conda-bld/work/cpp/src/hdbscan/detail/condense.cuh line=88: Multiple components found in MST or MST is invalid. Cannot find single-linkage solution. Found 79997 vertices total.
Obtained 56 stack frames
#0 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7fe0782bfb8b]
#1 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft11logic_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7fe0782c040d]
#2 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN6detail8Condense25build_condensed_hierarchyIifLi256EEEvRKN4raft8handle_tEPKT_PKT0_SA_iiRNS0_6Common18CondensedHierarchyIS8_SB_EE+0x10f6) [0x7fe07881a936]
#3 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0x1d5) [0x7fe078835195]
#4 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x246) [0x7fe078750706]
#5 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x74f2a) [0x7fdf60510f2a]
#6 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x1c35f) [0x7fdf6096135f]
#7 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0x209) [0x55e944f23209]
#8 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#9 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c6e1) [0x55e944f226e1]
#10 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#11 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#12 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#13 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#14 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x4d0d) [0x55e944f0b23d]
#15 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#16 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#17 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e2c30) [0x55e944fb8c30]
#18 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x140d14) [0x55e944f16d14]
#19 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#20 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#21 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#22 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#23 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#24 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#25 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1faada) [0x55e944fd0ada]
#26 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14b41f) [0x55e944f2141f]
#27 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#28 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#29 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#30 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#31 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#32 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#33 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x13d0) [0x55e944f07900]
#34 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#35 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#36 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#37 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#38 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#39 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#40 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#41 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0xb8) [0x55e944f230b8]
#42 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#43 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#44 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#45 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#46 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#47 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x20e3fc) [0x55e944fe43fc]
#48 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x2092d4) [0x55e944fdf2d4]
#49 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x9758d) [0x55e944e6d58d]
#50 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_SimpleFileObject+0x1b5) [0x55e944fd94f5]
#51 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_AnyFileObject+0x43) [0x55e944fd90a3]
#52 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_RunMain+0x399) [0x55e944fd6279]
#53 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_BytesMain+0x39) [0x55e944fa3dc9]
#54 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fe0fe28d083]
#55 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1cdcc1) [0x55e944fa3cc1]

Aborted (core dumped)

cc @cjnolet @tarang-jain @divyegala , as you may have looked at this code recently

@cjnolet
Copy link
Member

cjnolet commented Mar 28, 2023

@beckernick I think the reason you might be getting the error about convergence is generally not likely to happen in practice, It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges. If that case does in fact end up becoming show stopper on real datasets, I think we should definitely figure out a way around it, however the error you are receiving is definitely explaining what's going on- theres just not enough information provided to connect the graph because of the duplicated edges and we need a connected graph in order to build the dendrogram.

I slightly tweaked the input and was able to reproduce original reported error. I do think we should investigate this one further (cc @tarang-jain who is looking into this):

>>> import numpy as np
>>> from cuml.cluster import HDBSCAN

>>> 
>>> 
>>> REPS = 10000
>>> X = np.arange(500)
>>> tiled = np.tile(X, REPS).reshape(-1, 3)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: cannot reshape array of size 5000000 into shape (3)
>>> tiled = np.tile(X, REPS).reshape(-1, 10)
>>> clusterer = HDBSCAN()
>>> clusterer.fit(tiled)
terminate called after throwing an instance of 'raft::cuda_error'
 what():  CUDA error encountered at: file=/home/cjnolet/miniconda3/envs/cuml_2304_032323/include/raft/util/cudart_utils.hpp line=244: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 29 stack frames
#0 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x84) [0x7f02ed253f84]
#1 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f02ed2549dd]
#2 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft10copy_asyncIiEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x19a) [0x7f02ed6f7dfa]
#3 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPS6_RNSB_28robust_single_linkage_outputIT_S6_EE+0x19fa) [0x7f02ed7aa64a]
#4 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0xf1) [0x7f02ed7abdc1]
#5 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x25a) [0x7f02ed6d87fa]
#6 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x75f7d) [0x7f01c212df7d]
#7 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x248ea) [0x7f01c3fe98ea]
#8 in python(PyObject_Call+0x209) [0x55f302743139]
#9 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#10 in python(+0x14b7a1) [0x55f3027427a1]
#11 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#12 in python(_PyFunction_Vectorcall+0x6f) [0x55f302736f8f]
#13 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#14 in python(+0x14b641) [0x55f302742641]
#15 in python(_PyEval_EvalFrameDefault+0x4d0d) [0x55f30272bafd]
#16 in python(+0x1d8a82) [0x55f3027cfa82]
#17 in python(PyEval_EvalCode+0x87) [0x55f3027cf9c7]
#18 in python(+0x20b82c) [0x55f30280282c]
#19 in python(+0x206704) [0x55f3027fd704]
#20 in python(+0x1173ae) [0x55f30270e3ae]
#21 in python(_PyRun_InteractiveLoopObject+0xcc) [0x55f30270e544]
#22 in python(+0x96790) [0x55f30268d790]
#23 in python(PyRun_AnyFileExFlags+0x4b) [0x55f30270e6be]
#24 in python(+0x93931) [0x55f30268a931]
#25 in python(Py_BytesMain+0x39) [0x55f3027c2089]
#26 in /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f03ff629d90]
#27 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f03ff629e40]
#28 in python(+0x1caf81) [0x55f3027c1f81]

Aborted (core dumped)

@divyegala
Copy link
Member

divyegala commented Mar 29, 2023

It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges.

I agree with this analysis.

You can probably find a solution by artificially introducing minor random noise in the dataset [0, delta) where delta=narrowest edge difference for every point before doing the reshape, so that the amount of duplicates reduces or vanishes. Maybe you can try that @beckernick in your script.

@beckernick
Copy link
Member

Thanks for the suggestions. I agree that the error is clear, but it's uncaught and causes a segfault. Python user code should ideally not cause a segfault, even if rare scenarios like this (I know this is unlikely to occur naturally). Can we catch and propagate this error up?

@tarang-jain
Copy link
Contributor

I was able to reproduce both @beckernick's error and @cjnolet's error by tweaking the arange parameter. I agree with @cjnolet's analysis because of the repeated zero-weight edges in the KNN. Also, since the number of repeated points is greater than min_samples, core distances of all points would be zero. I tried adjusting min_samples to be just greater than REPS and the error does not occur. I can still dig deeper to find the exact piece of code that causes this error.

@MartinKlefas
Copy link

@beckernick I think the reason you might be getting the error about convergence is generally not likely to happen in practice, It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges. If that case does in fact end up becoming show stopper on real datasets, I think we should definitely figure out a way around it, however the error you are receiving is definitely explaining what's going on- theres just not enough information provided to connect the graph because of the duplicated edges and we need a connected graph in order to build the dendrogram.

Hi, I'm getting this with a "real" dataset.

It's a set of images that I'm using for anomaly/outlier detection - the code seems to run on smaller numbers of similar images, up to around 350,000 - 400,000 samples, but if I go much beyond that then I get this same crash behaviour. This only happens when I reduce the image vectors down to certain sizes through PCA though - suggesting that I have inadvertently created multiple identical data points as mentioned above.

I'm happy to provide code and input-data samples if that'll help - how's it best to get it over to you? enough of the dataset to reproduce the error is around 10GB

@tarang-jain
Copy link
Contributor

tarang-jain commented Apr 20, 2023

@MartinKlefas Have you tried to increase min_samples? Adding non-zero edges to the KNN should lead to convergence. If you can compute the maximum number of repeated inputs in your dataset and set min_samples to be greater than that, it should work.

@MartinKlefas
Copy link

@MartinKlefas Have you tried to increase min_samples? Adding non-zero edges to the KNN should lead to convergence. If you can compute the maximum number of repeated inputs in your dataset and set min_samples to be greater than that, it should work.

Thanks, I didn't do the full computation, but just multiplied min_samples by 10 and the clustering ran again.

@NitinVishalKulkarni
Copy link

I am experiencing this as well. My dataset is generated from a Reinforcement Learning environment (Atari Pong).

@terramars
Copy link

I just ran into what I think is something related to this bug, the dataset is related to the Vesuvius Challenge and is spatial. Unfortunately, it looks like there are no duplicates though and consequently I'm not sure how to get closer to identifying the problem.

@terramars
Copy link

managed to pull the segfaulting data -

x = np.array([[168.5, 174.75, 243. ],
[172. , 125. , 249.5 ],
[172. , 172. , 245.5 ],
[172. , 172. , 245.75],
[172. , 172. , 246. ],
[172. , 172. , 246.25],
[172. , 172. , 246.5 ],
[172. , 172. , 246.75],
[172. , 172. , 247. ],
[172. , 174. , 246.25],
[172. , 174. , 246.5 ],
[172. , 174. , 246.75],
[172. , 174. , 247. ],
[172.25, 125.5 , 249.75],
[172.25, 172.25, 246.25],
[172.25, 172.5 , 247. ],
[172.25, 172.5 , 247.25],
[172.25, 173.75, 200.5 ],
[172.25, 173.75, 200.75],
[172.25, 174. , 247. ],
[172.5 , 172. , 245. ],
[172.5 , 172.25, 246.5 ],
[172.5 , 172.25, 246.75],
[172.5 , 172.25, 247. ],
[172.5 , 172.5 , 245.5 ],
[172.5 , 174. , 246.25],
[172.75, 172. , 245. ],
[172.75, 172.25, 246.75],
[172.75, 172.25, 247. ],
[172.75, 172.5 , 245.25],
[172.75, 174. , 245.75],
[173. , 172. , 245.25],
[173. , 172.25, 246.75],
[173. , 172.25, 247. ],
[173. , 174. , 245.5 ],
[173.25, 125. , 249.25],
[173.25, 125.25, 249.75],
[173.25, 170.5 , 245.75],
[173.25, 170.5 , 247.25],
[173.25, 172. , 245.5 ],
[173.25, 172. , 246.5 ],
[173.25, 172. , 246.75],
[173.25, 172. , 247. ],
[173.25, 172.25, 246.75],
[173.25, 172.25, 247. ],
[173.25, 174.25, 243.5 ],
[173.5 , 125. , 249.5 ],
[173.5 , 125.25, 249.75],
[173.5 , 125.5 , 249.5 ],
[173.5 , 125.5 , 249.75],
[173.5 , 170.5 , 246.75],
[173.5 , 170.5 , 247. ],
[173.5 , 172. , 245.25],
[173.5 , 172. , 246.5 ],
[173.5 , 172. , 246.75],
[173.5 , 172. , 247. ],
[173.5 , 172.25, 246.75],
[173.5 , 174. , 244.25],
[173.5 , 174.5 , 242.25],
[173.75, 125. , 249.75],
[173.75, 125.5 , 249.5 ],
[173.75, 125.5 , 249.75],
[173.75, 171.75, 239.5 ],
[173.75, 171.75, 239.75],
[173.75, 172. , 244.75],
[173.75, 172. , 245. ],
[173.75, 172. , 245.25],
[173.75, 172. , 246.5 ],
[173.75, 172. , 246.75],
[173.75, 174. , 241.5 ],
[173.75, 174. , 244. ],
[173.75, 174.25, 242.25],
[173.75, 174.5 , 241.75],
[174. , 125.5 , 249.5 ],
[174. , 125.5 , 249.75],
[174. , 125.75, 249.25],
[174. , 125.75, 249.5 ],
[174. , 125.75, 249.75],
[174. , 169.75, 249. ],
[174. , 170.25, 247.5 ],
[174. , 170.5 , 245.5 ],
[174. , 171.75, 239. ],
[174. , 171.75, 239.25],
[174. , 171.75, 239.5 ],
[174. , 172. , 245. ],
[174. , 172. , 246.25],
[174. , 172. , 246.5 ],
[174. , 172. , 246.75],
[174. , 173.75, 240.75],
[174. , 174. , 241.5 ],
[174.25, 125.5 , 249.5 ],
[174.25, 125.5 , 249.75],
[174.25, 125.75, 248.75],
[174.25, 125.75, 249. ],
[174.25, 125.75, 249.25],
[174.25, 125.75, 249.5 ],
[174.25, 125.75, 249.75],
[174.25, 172. , 245.25],
[174.25, 174. , 241.75],
[174.25, 174. , 243.75],
[174.5 , 125.5 , 249.5 ],
[174.5 , 125.5 , 249.75],
[174.5 , 125.75, 201.75],
[174.5 , 125.75, 202.75],
[174.5 , 125.75, 248.5 ],
[174.5 , 125.75, 248.75],
[174.5 , 125.75, 249. ],
[174.5 , 125.75, 249.25],
[174.5 , 125.75, 249.5 ],
[174.5 , 125.75, 249.75],
[174.5 , 174. , 242.25],
[174.5 , 174. , 243.5 ],
[174.75, 125.5 , 249.5 ],
[174.75, 125.5 , 249.75],
[174.75, 125.75, 201.5 ],
[174.75, 125.75, 203. ],
[174.75, 125.75, 248.5 ],
[174.75, 125.75, 248.75],
[174.75, 125.75, 249. ],
[174.75, 125.75, 249.25],
[174.75, 125.75, 249.5 ],
[174.75, 125.75, 249.75],
[174.75, 170.25, 247. ],
[174.75, 171.5 , 241.5 ],
[174.75, 174. , 241.5 ],
[174.75, 174. , 242.5 ],
[174.75, 174. , 242.75],
[174.75, 174. , 243. ],
[174.75, 174.75, 234.5 ]]
)

hdb = HDBSCAN(min_samples=10, min_cluster_size=10, allow_single_cluster=True)
db = hdb.fit(x)

@terramars
Copy link

this result is not sensitive to the number of samples or cluster size - all segfault. dropping the last data element also doesn't segfault, but it does hang indefinitely.

@divyegala
Copy link
Member

@terramars while there are no explicit duplicates, it looks to me that all the points are quite close in distance to the previous point. What precision are you running with? Can you try running with np.float64?

@terramars
Copy link

@terramars while there are no explicit duplicates, it looks to me that all the points are quite close in distance to the previous point. What precision are you running with? Can you try running with np.float64?

I am giving it float64, it seems like the fit method converts to float32 and errors if you disable conversion :

hdb = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=10, allow_single_cluster=True)
db = hdb.fit(x, convert_dtype=False)

File ~/miniconda3/envs/thaumato/lib/python3.10/site-packages/cuml/internals/array.py:1135, in CumlArray.from_input(cls, X, order, deepcopy, check_dtype, convert_to_dtype, check_mem_type, convert_to_mem_type, safe_dtype_conversion, check_cols, check_rows, fail_on_order, force_contiguous)
1133 else:
1134 if not convert_to_dtype:
-> 1135 raise TypeError(
1136 f"Expected input to be of type in {check_dtype} but got"
1137 f" {arr.dtype}"
1138 )
1140 conversion_required = convert_to_dtype or (
1141 convert_to_mem_type and (convert_to_mem_type != arr.mem_type)
1142 )
1144 if conversion_required:

TypeError: Expected input to be of type in [dtype('float32')] but got float64

@terramars
Copy link

this does appear to be a float32 vs float64 issue, the following method works :

hdb.fit_predict(x-x.mean(axis=0))

the returned labels are a match with the 64 bit labels computed from sklearn.

it seems like there are 2 bugs here - 1) no option to run in float64, and 2) data is not centered before clustering is run, which should be desirable in all circumstances.

@divyegala
Copy link
Member

@terramars I don't think 2 is a bug because we are trying to maintain fidelity to the hdbscan package API, and as far as I am aware they do not automatically center the data before clustering - please correct me if I am wrong.

We will look into solving 1.

@terramars
Copy link

terramars commented Oct 7, 2024 via email

@divyegala
Copy link
Member

@terramars I think the issue with centering in general is that it goes into the territory of manipulating the data without transparency to the user. Not to mention that we'll need to create a copy of the data and then center it, and GPU memory is expensive.

I agree of course that seg fault is a problem, and we'll prioritize on making the UX smoother. At the very least we should try to fail gracefully with a clearer error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working inactive-30d inactive-90d
Projects
None yet
Development

No branches or pull requests