[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

ztf-ucas · 2022-01-08T13:58:14Z

terminate called after throwing an instance of 'raft::cuda_error'
Hi, I'm using cuml.HDBSCAN and the following problem was encountered.

`terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/cudart_utils.h line=267: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 32 stack frames
#0 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f1bd4f95056]
#1 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10cuda_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc9) [0x7f1bd4f95e39]
#2 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10copy_asyncIiEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x138) [0x7f1bd522f948]
#3 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9hierarchy6detail16build_sorted_mstIifN2ML7HDBSCAN22FixConnectivitiesRedOpIifEEEEvRKNS_8handle_tEPKT0_PKT_SF_SC_mmPSD_SG_PSA_SG_mT1_NS_8distance12DistanceTypeEi+0x4c2) [0x7f1bd527e942]
#4 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsERNSB_28robust_single_linkage_outputIT_S6_EE+0x372) [0x7f1bd5281512]
#5 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEE+0x7e) [0x7f1bd521759e]
#6 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x43ec2) [0x7f1de251cec2]
#7 in python(PyObject_Call+0x24d) [0x56056760d35d]
#8 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]
#9 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#10 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]
#11 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c298) [0x7f1de2505298]
#12 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c4f9) [0x7f1de25054f9]
#13 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x3c072) [0x7f1de2515072]
#14 in python(PyObject_Call+0x24d) [0x56056760d35d]
#15 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]
#16 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#17 in python(+0x1b08b7) [0x5605676988b7]
#18 in python(_PyEval_EvalFrameDefault+0x4e03) [0x5605676b9133]
#19 in python(_PyFunction_Vectorcall+0x1a6) [0x560567697fc6]
#20 in python(_PyEval_EvalFrameDefault+0x947) [0x5605676b4c77]
#21 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#22 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]
#23 in python(PyEval_EvalCode+0x1b) [0x56056773a24b]
#24 in python(+0x2522e3) [0x56056773a2e3]
#25 in python(+0x26e543) [0x560567756543]
#26 in python(+0x273562) [0x56056775b562]
#27 in python(PyRun_SimpleFileExFlags+0x1b2) [0x56056775b742]
#28 in python(Py_RunMain+0x36d) [0x56056775bcbd]
#29 in python(Py_BytesMain+0x39) [0x56056775be79]
#30 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f1fc17a4b97]
#31 in python(+0x1e6d69) [0x5605676ced69]

Aborted (core dumped)`

cjnolet · 2022-01-12T23:15:03Z

Thanks for opening an issue about this @ztf-ucas. To isolate the cause of this failure, it would be helpful if you can provide a code snippet that we can use to reproduce it. It would also be useful to provide the dataset (or relevant details) if you are able.

github-actions · 2022-02-12T00:05:59Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Brillone · 2022-03-31T16:08:59Z

Hi, same issue is happening to me with different settings for the HDBSCAN model (some works).

For example it happens with the following parameters:
model = HDBSCAN(min_cluster_size=15, min_samples=10)

A setting that did worked:
model = HDBSCAN(min_cluster_size=5, min_samples=5)

My dataset has 2.5M samples with 64 dimensions (I can't provide the dataset).

github-actions · 2022-04-30T18:03:23Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-07-29T19:02:44Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

preet2312 · 2023-01-25T23:08:41Z

@cjnolet @Brillone Hello, I encountered the same issue where some of the parameter combinations work and some throw the same error(if run as a python script. If I run it on jupyter notebook on vscode then it would give error like FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.)

I am using 4.7M samples with dimension of 50.

For Example:
Doesn't work: model = HDBSCAN(min_cluster_size=1000, min_samples=10)
Works: model = HDBSCAN(min_cluster_size=1000, min_samples=5)

Please let me know if there's any update.

Thanks.

mayurgd · 2023-03-23T05:41:03Z

@cjnolet Hi facing this error while executing HDBSCAN

terminate called after throwing an instance of 'raft::cuda_error'
what():  CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/cudart_utils.h 
line=267: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type),cudaMemcpyDeviceToDevice, stream)', 
Reason=cudaErrorInvalidValue:invalid argument

Below is a reproducable code that gives an error for me:

import numpy as np
import pandas as pd
from cuml.cluster import HDBSCAN as HDBSCAN_gpu

X = np.array([[-14.01115608,  -5.37217331, 314.        ],
       [-17.31538773,  -6.12932587,  22.        ],
       [-17.88701439,  -7.00569153,  16.        ],
       [-17.91534615,  -7.40659523,  12.        ],
       [-13.57449722,  -3.70668411,  12.        ],
       [-14.97053146,  -6.00550461,  51.        ],
       [-15.5725193 ,  -5.07519722,   2.        ],
       [-13.31140137,  -3.99990654,   5.        ],
       [-13.84429169,  -4.01345634,   1.        ],
       [-17.02877998,  -6.42786789,  46.        ],
       [-15.09358597,  -5.4496851 ,  22.        ],
       [-17.52828217,  -6.86034393,   4.        ],
       [-15.57351112,  -5.61835861,   4.        ],
       [-14.20898056,  -4.61386681,   8.        ],
       [-14.45912552,  -5.47292137,   1.        ],
       [-15.27561951,  -4.74104977,   1.        ]])
test = pd.DataFrame(X, columns=['x','y','repeat'])
test = test.loc[test.index.repeat(test.repeat)].drop(columns='repeat')
hdb = HDBSCAN_gpu(
                min_samples=10,
                min_cluster_size=15,
                cluster_selection_method="eom",
                metric="euclidean",
                gen_min_span_tree=True,
            )

labels = hdb.fit_predict(test)

HDBSCAN model runs without any error for min_samples < 5 anything greater than or equal to 5 gives raft::cuda_error
[cuml version '22.02.00']

beckernick · 2023-03-23T16:45:10Z

We've made a variety of updates to HDBSCAN since v22.02. Does this error present if you use cuML 23.02?

mayurgd · 2023-03-23T17:01:58Z

@beckernick thanks for the response. I am in the process of upgrading my rapids docker image to version 23.02. Will update ones that is done.
Require one suggestion regarding HDBSCAN though, Should duplicate row data be removed before applying HDBSCAN or should it be applied to data with duplicate rows?
For eg as per the above code snippet :
Should it be applied to X (non_duplicated array) or test (duplicated_dataframe)

beckernick · 2023-03-23T20:18:14Z

Duplicates can to some extent be seen as sample weights and and removing them might move your analysis farther away from the underlying ground truth data distribution from which your data is implicitly sampled. I'd probably leave them in.

beckernick · 2023-03-23T20:26:27Z

@preet2312 , do you have any information about your environment (library versions) and system platforms with which you experienced this issue?

mayurgd · 2023-03-28T16:48:40Z

@beckernick I updated rapids to v23.02 using rapidsai/rapidsai-core:23.02-cuda11.2-runtime-ubuntu20.04-py3.8 image. I still get 'raft::cuda_error' for the above mentioned example

Error Logs:
databricks driver logs show

terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=/databricks/conda/envs/rapids/include/raft/util/cudart_utils.hpp line=278:

databricks notebook shows

ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

beckernick · 2023-03-28T21:08:59Z

Thanks for testing in 23.02 and creating a minimal reproducible example. I can reproduce this behavior.

The underlying error appears to be that a single linkage solution can't be found in at least some scenarios and this error is not caught and propagated back up to Python.

With REPS = 10000 I can reproduce this consistently. With smaller REPS, I can reproduce it intermittently.

import numpy as np
from cuml.cluster import HDBSCAN

REPS = 10000

X = np.arange(12)
tiled = np.tile(X, REPS).reshape(-1, 3)

clusterer = HDBSCAN()
clusterer.fit(tiled)
terminate called after throwing an instance of 'raft::logic_error'
  what():  RAFT failure at file=/opt/conda/conda-bld/work/cpp/src/hdbscan/detail/condense.cuh line=88: Multiple components found in MST or MST is invalid. Cannot find single-linkage solution. Found 79997 vertices total.
Obtained 56 stack frames
#0 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7fe0782bfb8b]
#1 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft11logic_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7fe0782c040d]
#2 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN6detail8Condense25build_condensed_hierarchyIifLi256EEEvRKN4raft8handle_tEPKT_PKT0_SA_iiRNS0_6Common18CondensedHierarchyIS8_SB_EE+0x10f6) [0x7fe07881a936]
#3 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0x1d5) [0x7fe078835195]
#4 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x246) [0x7fe078750706]
#5 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x74f2a) [0x7fdf60510f2a]
#6 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x1c35f) [0x7fdf6096135f]
#7 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0x209) [0x55e944f23209]
#8 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#9 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c6e1) [0x55e944f226e1]
#10 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#11 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#12 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#13 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#14 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x4d0d) [0x55e944f0b23d]
#15 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#16 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#17 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e2c30) [0x55e944fb8c30]
#18 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x140d14) [0x55e944f16d14]
#19 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#20 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#21 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#22 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#23 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#24 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#25 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1faada) [0x55e944fd0ada]
#26 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14b41f) [0x55e944f2141f]
#27 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#28 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#29 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#30 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#31 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#32 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#33 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x13d0) [0x55e944f07900]
#34 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#35 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#36 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#37 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#38 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#39 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#40 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#41 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0xb8) [0x55e944f230b8]
#42 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#43 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#44 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#45 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#46 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#47 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x20e3fc) [0x55e944fe43fc]
#48 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x2092d4) [0x55e944fdf2d4]
#49 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x9758d) [0x55e944e6d58d]
#50 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_SimpleFileObject+0x1b5) [0x55e944fd94f5]
#51 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_AnyFileObject+0x43) [0x55e944fd90a3]
#52 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_RunMain+0x399) [0x55e944fd6279]
#53 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_BytesMain+0x39) [0x55e944fa3dc9]
#54 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fe0fe28d083]
#55 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1cdcc1) [0x55e944fa3cc1]

Aborted (core dumped)

cc @cjnolet @tarang-jain @divyegala , as you may have looked at this code recently

cjnolet · 2023-03-28T22:49:20Z

@beckernick I think the reason you might be getting the error about convergence is generally not likely to happen in practice, It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges. If that case does in fact end up becoming show stopper on real datasets, I think we should definitely figure out a way around it, however the error you are receiving is definitely explaining what's going on- theres just not enough information provided to connect the graph because of the duplicated edges and we need a connected graph in order to build the dendrogram.

I slightly tweaked the input and was able to reproduce original reported error. I do think we should investigate this one further (cc @tarang-jain who is looking into this):

>>> import numpy as np
>>> from cuml.cluster import HDBSCAN

>>> 
>>> 
>>> REPS = 10000
>>> X = np.arange(500)
>>> tiled = np.tile(X, REPS).reshape(-1, 3)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: cannot reshape array of size 5000000 into shape (3)
>>> tiled = np.tile(X, REPS).reshape(-1, 10)
>>> clusterer = HDBSCAN()
>>> clusterer.fit(tiled)
terminate called after throwing an instance of 'raft::cuda_error'
 what():  CUDA error encountered at: file=/home/cjnolet/miniconda3/envs/cuml_2304_032323/include/raft/util/cudart_utils.hpp line=244: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 29 stack frames
#0 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x84) [0x7f02ed253f84]
#1 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f02ed2549dd]
#2 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN4raft10copy_asyncIiEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x19a) [0x7f02ed6f7dfa]
#3 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPS6_RNSB_28robust_single_linkage_outputIT_S6_EE+0x19fa) [0x7f02ed7aa64a]
#4 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0xf1) [0x7f02ed7abdc1]
#5 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x25a) [0x7f02ed6d87fa]
#6 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x75f7d) [0x7f01c212df7d]
#7 in /home/cjnolet/miniconda3/envs/cuml_2304_032323/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x248ea) [0x7f01c3fe98ea]
#8 in python(PyObject_Call+0x209) [0x55f302743139]
#9 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#10 in python(+0x14b7a1) [0x55f3027427a1]
#11 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#12 in python(_PyFunction_Vectorcall+0x6f) [0x55f302736f8f]
#13 in python(_PyEval_EvalFrameDefault+0x2ec2) [0x55f302729cb2]
#14 in python(+0x14b641) [0x55f302742641]
#15 in python(_PyEval_EvalFrameDefault+0x4d0d) [0x55f30272bafd]
#16 in python(+0x1d8a82) [0x55f3027cfa82]
#17 in python(PyEval_EvalCode+0x87) [0x55f3027cf9c7]
#18 in python(+0x20b82c) [0x55f30280282c]
#19 in python(+0x206704) [0x55f3027fd704]
#20 in python(+0x1173ae) [0x55f30270e3ae]
#21 in python(_PyRun_InteractiveLoopObject+0xcc) [0x55f30270e544]
#22 in python(+0x96790) [0x55f30268d790]
#23 in python(PyRun_AnyFileExFlags+0x4b) [0x55f30270e6be]
#24 in python(+0x93931) [0x55f30268a931]
#25 in python(Py_BytesMain+0x39) [0x55f3027c2089]
#26 in /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f03ff629d90]
#27 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f03ff629e40]
#28 in python(+0x1caf81) [0x55f3027c1f81]

Aborted (core dumped)

divyegala · 2023-03-29T02:00:14Z

It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges.

I agree with this analysis.

You can probably find a solution by artificially introducing minor random noise in the dataset [0, delta) where delta=narrowest edge difference for every point before doing the reshape, so that the amount of duplicates reduces or vanishes. Maybe you can try that @beckernick in your script.

beckernick · 2023-03-29T13:26:56Z

Thanks for the suggestions. I agree that the error is clear, but it's uncaught and causes a segfault. Python user code should ideally not cause a segfault, even if rare scenarios like this (I know this is unlikely to occur naturally). Can we catch and propagate this error up?

tarang-jain · 2023-03-29T16:47:28Z

I was able to reproduce both @beckernick's error and @cjnolet's error by tweaking the arange parameter. I agree with @cjnolet's analysis because of the repeated zero-weight edges in the KNN. Also, since the number of repeated points is greater than min_samples, core distances of all points would be zero. I tried adjusting min_samples to be just greater than REPS and the error does not occur. I can still dig deeper to find the exact piece of code that causes this error.

MartinKlefas · 2023-04-20T22:10:29Z

@beckernick I think the reason you might be getting the error about convergence is generally not likely to happen in practice, It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges. If that case does in fact end up becoming show stopper on real datasets, I think we should definitely figure out a way around it, however the error you are receiving is definitely explaining what's going on- theres just not enough information provided to connect the graph because of the duplicated edges and we need a connected graph in order to build the dendrogram.

Hi, I'm getting this with a "real" dataset.

It's a set of images that I'm using for anomaly/outlier detection - the code seems to run on smaller numbers of similar images, up to around 350,000 - 400,000 samples, but if I go much beyond that then I get this same crash behaviour. This only happens when I reduce the image vectors down to certain sizes through PCA though - suggesting that I have inadvertently created multiple identical data points as mentioned above.

I'm happy to provide code and input-data samples if that'll help - how's it best to get it over to you? enough of the dataset to reproduce the error is around 10GB

tarang-jain · 2023-04-20T22:30:20Z

@MartinKlefas Have you tried to increase min_samples? Adding non-zero edges to the KNN should lead to convergence. If you can compute the maximum number of repeated inputs in your dataset and set min_samples to be greater than that, it should work.

MartinKlefas · 2023-04-20T23:10:36Z

@MartinKlefas Have you tried to increase min_samples? Adding non-zero edges to the KNN should lead to convergence. If you can compute the maximum number of repeated inputs in your dataset and set min_samples to be greater than that, it should work.

Thanks, I didn't do the full computation, but just multiplied min_samples by 10 and the clustering ran again.

NitinVishalKulkarni · 2023-11-10T14:16:52Z

I am experiencing this as well. My dataset is generated from a Reinforcement Learning environment (Atari Pong).

terramars · 2024-09-15T04:49:26Z

I just ran into what I think is something related to this bug, the dataset is related to the Vesuvius Challenge and is spatial. Unfortunately, it looks like there are no duplicates though and consequently I'm not sure how to get closer to identifying the problem.

terramars · 2024-09-15T05:13:05Z

managed to pull the segfaulting data -

x = np.array([[168.5, 174.75, 243. ],
[172. , 125. , 249.5 ],
[172. , 172. , 245.5 ],
[172. , 172. , 245.75],
[172. , 172. , 246. ],
[172. , 172. , 246.25],
[172. , 172. , 246.5 ],
[172. , 172. , 246.75],
[172. , 172. , 247. ],
[172. , 174. , 246.25],
[172. , 174. , 246.5 ],
[172. , 174. , 246.75],
[172. , 174. , 247. ],
[172.25, 125.5 , 249.75],
[172.25, 172.25, 246.25],
[172.25, 172.5 , 247. ],
[172.25, 172.5 , 247.25],
[172.25, 173.75, 200.5 ],
[172.25, 173.75, 200.75],
[172.25, 174. , 247. ],
[172.5 , 172. , 245. ],
[172.5 , 172.25, 246.5 ],
[172.5 , 172.25, 246.75],
[172.5 , 172.25, 247. ],
[172.5 , 172.5 , 245.5 ],
[172.5 , 174. , 246.25],
[172.75, 172. , 245. ],
[172.75, 172.25, 246.75],
[172.75, 172.25, 247. ],
[172.75, 172.5 , 245.25],
[172.75, 174. , 245.75],
[173. , 172. , 245.25],
[173. , 172.25, 246.75],
[173. , 172.25, 247. ],
[173. , 174. , 245.5 ],
[173.25, 125. , 249.25],
[173.25, 125.25, 249.75],
[173.25, 170.5 , 245.75],
[173.25, 170.5 , 247.25],
[173.25, 172. , 245.5 ],
[173.25, 172. , 246.5 ],
[173.25, 172. , 246.75],
[173.25, 172. , 247. ],
[173.25, 172.25, 246.75],
[173.25, 172.25, 247. ],
[173.25, 174.25, 243.5 ],
[173.5 , 125. , 249.5 ],
[173.5 , 125.25, 249.75],
[173.5 , 125.5 , 249.5 ],
[173.5 , 125.5 , 249.75],
[173.5 , 170.5 , 246.75],
[173.5 , 170.5 , 247. ],
[173.5 , 172. , 245.25],
[173.5 , 172. , 246.5 ],
[173.5 , 172. , 246.75],
[173.5 , 172. , 247. ],
[173.5 , 172.25, 246.75],
[173.5 , 174. , 244.25],
[173.5 , 174.5 , 242.25],
[173.75, 125. , 249.75],
[173.75, 125.5 , 249.5 ],
[173.75, 125.5 , 249.75],
[173.75, 171.75, 239.5 ],
[173.75, 171.75, 239.75],
[173.75, 172. , 244.75],
[173.75, 172. , 245. ],
[173.75, 172. , 245.25],
[173.75, 172. , 246.5 ],
[173.75, 172. , 246.75],
[173.75, 174. , 241.5 ],
[173.75, 174. , 244. ],
[173.75, 174.25, 242.25],
[173.75, 174.5 , 241.75],
[174. , 125.5 , 249.5 ],
[174. , 125.5 , 249.75],
[174. , 125.75, 249.25],
[174. , 125.75, 249.5 ],
[174. , 125.75, 249.75],
[174. , 169.75, 249. ],
[174. , 170.25, 247.5 ],
[174. , 170.5 , 245.5 ],
[174. , 171.75, 239. ],
[174. , 171.75, 239.25],
[174. , 171.75, 239.5 ],
[174. , 172. , 245. ],
[174. , 172. , 246.25],
[174. , 172. , 246.5 ],
[174. , 172. , 246.75],
[174. , 173.75, 240.75],
[174. , 174. , 241.5 ],
[174.25, 125.5 , 249.5 ],
[174.25, 125.5 , 249.75],
[174.25, 125.75, 248.75],
[174.25, 125.75, 249. ],
[174.25, 125.75, 249.25],
[174.25, 125.75, 249.5 ],
[174.25, 125.75, 249.75],
[174.25, 172. , 245.25],
[174.25, 174. , 241.75],
[174.25, 174. , 243.75],
[174.5 , 125.5 , 249.5 ],
[174.5 , 125.5 , 249.75],
[174.5 , 125.75, 201.75],
[174.5 , 125.75, 202.75],
[174.5 , 125.75, 248.5 ],
[174.5 , 125.75, 248.75],
[174.5 , 125.75, 249. ],
[174.5 , 125.75, 249.25],
[174.5 , 125.75, 249.5 ],
[174.5 , 125.75, 249.75],
[174.5 , 174. , 242.25],
[174.5 , 174. , 243.5 ],
[174.75, 125.5 , 249.5 ],
[174.75, 125.5 , 249.75],
[174.75, 125.75, 201.5 ],
[174.75, 125.75, 203. ],
[174.75, 125.75, 248.5 ],
[174.75, 125.75, 248.75],
[174.75, 125.75, 249. ],
[174.75, 125.75, 249.25],
[174.75, 125.75, 249.5 ],
[174.75, 125.75, 249.75],
[174.75, 170.25, 247. ],
[174.75, 171.5 , 241.5 ],
[174.75, 174. , 241.5 ],
[174.75, 174. , 242.5 ],
[174.75, 174. , 242.75],
[174.75, 174. , 243. ],
[174.75, 174.75, 234.5 ]]
)

hdb = HDBSCAN(min_samples=10, min_cluster_size=10, allow_single_cluster=True)
db = hdb.fit(x)

terramars · 2024-09-15T05:19:48Z

this result is not sensitive to the number of samples or cluster size - all segfault. dropping the last data element also doesn't segfault, but it does hang indefinitely.

divyegala · 2024-09-17T23:49:23Z

@terramars while there are no explicit duplicates, it looks to me that all the points are quite close in distance to the previous point. What precision are you running with? Can you try running with np.float64?

terramars · 2024-10-03T18:50:02Z

@terramars while there are no explicit duplicates, it looks to me that all the points are quite close in distance to the previous point. What precision are you running with? Can you try running with np.float64?

I am giving it float64, it seems like the fit method converts to float32 and errors if you disable conversion :

hdb = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=10, allow_single_cluster=True)
db = hdb.fit(x, convert_dtype=False)

File ~/miniconda3/envs/thaumato/lib/python3.10/site-packages/cuml/internals/array.py:1135, in CumlArray.from_input(cls, X, order, deepcopy, check_dtype, convert_to_dtype, check_mem_type, convert_to_mem_type, safe_dtype_conversion, check_cols, check_rows, fail_on_order, force_contiguous)
1133 else:
1134 if not convert_to_dtype:
-> 1135 raise TypeError(
1136 f"Expected input to be of type in {check_dtype} but got"
1137 f" {arr.dtype}"
1138 )
1140 conversion_required = convert_to_dtype or (
1141 convert_to_mem_type and (convert_to_mem_type != arr.mem_type)
1142 )
1144 if conversion_required:

TypeError: Expected input to be of type in [dtype('float32')] but got float64

terramars · 2024-10-03T18:55:34Z

this does appear to be a float32 vs float64 issue, the following method works :

hdb.fit_predict(x-x.mean(axis=0))

the returned labels are a match with the 64 bit labels computed from sklearn.

it seems like there are 2 bugs here - 1) no option to run in float64, and 2) data is not centered before clustering is run, which should be desirable in all circumstances.

divyegala · 2024-10-07T18:46:01Z

@terramars I don't think 2 is a bug because we are trying to maintain fidelity to the hdbscan package API, and as far as I am aware they do not automatically center the data before clustering - please correct me if I am wrong.

We will look into solving 1.

terramars · 2024-10-07T18:54:38Z

That would still be great. There's definitely precedent of the likes of sklearn centering data before solving and storing the centers in a variable to address numerical stability issues. Definitely isn't API compatible to seg fault on reasonable input either!

…

On Mon, Oct 7, 2024, 2:46 PM Divye Gala ***@***.***> wrote: @terramars <https://github.com/terramars> I don't think 2 is a bug because we are trying to maintain fidelity to the hdbscan package API, and as far as I am aware they do not automatically center the data before clustering - please correct me if I am wrong. We will look into solving 1. — Reply to this email directly, view it on GitHub <#4474 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHA64P4JSICZGI32H3ZYH3Z2LJIBAVCNFSM6AAAAABOHO2BOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJXGYZTSNJRGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

divyegala · 2024-10-07T19:59:24Z

@terramars I think the issue with centering in general is that it goes into the territory of manipulating the data without transparency to the user. Not to mention that we'll need to create a copy of the data and then center it, and GPU memory is expensive.

I agree of course that seg fault is a problem, and we'll prioritize on making the UX smoother. At the very least we should try to fail gracefully with a clearer error message.

ztf-ucas added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jan 8, 2022

github-actions bot added the inactive-30d label Feb 12, 2022

github-actions bot removed the inactive-30d label Mar 31, 2022

github-actions bot added the inactive-30d label Apr 30, 2022

github-actions bot added the inactive-90d label Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

ztf-ucas commented Jan 8, 2022

cjnolet commented Jan 12, 2022

github-actions bot commented Feb 12, 2022

Brillone commented Mar 31, 2022

github-actions bot commented Apr 30, 2022

github-actions bot commented Jul 29, 2022

preet2312 commented Jan 25, 2023 •

edited

Loading

mayurgd commented Mar 23, 2023 •

edited

Loading

beckernick commented Mar 23, 2023

mayurgd commented Mar 23, 2023

beckernick commented Mar 23, 2023

beckernick commented Mar 23, 2023

mayurgd commented Mar 28, 2023

beckernick commented Mar 28, 2023 •

edited

Loading

cjnolet commented Mar 28, 2023

divyegala commented Mar 29, 2023 •

edited

Loading

beckernick commented Mar 29, 2023

tarang-jain commented Mar 29, 2023

MartinKlefas commented Apr 20, 2023

tarang-jain commented Apr 20, 2023 •

edited

Loading

MartinKlefas commented Apr 20, 2023

NitinVishalKulkarni commented Nov 10, 2023

terramars commented Sep 15, 2024

terramars commented Sep 15, 2024

terramars commented Sep 15, 2024

divyegala commented Sep 17, 2024

terramars commented Oct 3, 2024

terramars commented Oct 3, 2024

divyegala commented Oct 7, 2024

terramars commented Oct 7, 2024 via email

divyegala commented Oct 7, 2024

[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

Comments

ztf-ucas commented Jan 8, 2022

cjnolet commented Jan 12, 2022

github-actions bot commented Feb 12, 2022

Brillone commented Mar 31, 2022

github-actions bot commented Apr 30, 2022

github-actions bot commented Jul 29, 2022

preet2312 commented Jan 25, 2023 • edited Loading

mayurgd commented Mar 23, 2023 • edited Loading

beckernick commented Mar 23, 2023

mayurgd commented Mar 23, 2023

beckernick commented Mar 23, 2023

beckernick commented Mar 23, 2023

mayurgd commented Mar 28, 2023

beckernick commented Mar 28, 2023 • edited Loading

cjnolet commented Mar 28, 2023

divyegala commented Mar 29, 2023 • edited Loading

beckernick commented Mar 29, 2023

tarang-jain commented Mar 29, 2023

MartinKlefas commented Apr 20, 2023

tarang-jain commented Apr 20, 2023 • edited Loading

MartinKlefas commented Apr 20, 2023

NitinVishalKulkarni commented Nov 10, 2023

terramars commented Sep 15, 2024

terramars commented Sep 15, 2024

terramars commented Sep 15, 2024

divyegala commented Sep 17, 2024

terramars commented Oct 3, 2024

terramars commented Oct 3, 2024

divyegala commented Oct 7, 2024

terramars commented Oct 7, 2024 via email

divyegala commented Oct 7, 2024

preet2312 commented Jan 25, 2023 •

edited

Loading

mayurgd commented Mar 23, 2023 •

edited

Loading

beckernick commented Mar 28, 2023 •

edited

Loading

divyegala commented Mar 29, 2023 •

edited

Loading

tarang-jain commented Apr 20, 2023 •

edited

Loading