-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474
Comments
Thanks for opening an issue about this @ztf-ucas. To isolate the cause of this failure, it would be helpful if you can provide a code snippet that we can use to reproduce it. It would also be useful to provide the dataset (or relevant details) if you are able. |
This issue has been labeled |
Hi, same issue is happening to me with different settings for the HDBSCAN model (some works). For example it happens with the following parameters: A setting that did worked: My dataset has 2.5M samples with 64 dimensions (I can't provide the dataset). |
This issue has been labeled |
This issue has been labeled |
@cjnolet @Brillone Hello, I encountered the same issue where some of the parameter combinations work and some throw the same error(if run as a python script. If I run it on jupyter notebook on vscode then it would give error like I am using 4.7M samples with dimension of 50. For Example: Please let me know if there's any update. Thanks. |
@cjnolet Hi facing this error while executing HDBSCAN
Below is a reproducable code that gives an error for me: import numpy as np
import pandas as pd
from cuml.cluster import HDBSCAN as HDBSCAN_gpu
X = np.array([[-14.01115608, -5.37217331, 314. ],
[-17.31538773, -6.12932587, 22. ],
[-17.88701439, -7.00569153, 16. ],
[-17.91534615, -7.40659523, 12. ],
[-13.57449722, -3.70668411, 12. ],
[-14.97053146, -6.00550461, 51. ],
[-15.5725193 , -5.07519722, 2. ],
[-13.31140137, -3.99990654, 5. ],
[-13.84429169, -4.01345634, 1. ],
[-17.02877998, -6.42786789, 46. ],
[-15.09358597, -5.4496851 , 22. ],
[-17.52828217, -6.86034393, 4. ],
[-15.57351112, -5.61835861, 4. ],
[-14.20898056, -4.61386681, 8. ],
[-14.45912552, -5.47292137, 1. ],
[-15.27561951, -4.74104977, 1. ]])
test = pd.DataFrame(X, columns=['x','y','repeat'])
test = test.loc[test.index.repeat(test.repeat)].drop(columns='repeat')
hdb = HDBSCAN_gpu(
min_samples=10,
min_cluster_size=15,
cluster_selection_method="eom",
metric="euclidean",
gen_min_span_tree=True,
)
labels = hdb.fit_predict(test) HDBSCAN model runs without any error for min_samples < 5 anything greater than or equal to 5 gives raft::cuda_error |
We've made a variety of updates to HDBSCAN since v22.02. Does this error present if you use cuML 23.02? |
@beckernick thanks for the response. I am in the process of upgrading my rapids docker image to version 23.02. Will update ones that is done. |
Duplicates can to some extent be seen as sample weights and and removing them might move your analysis farther away from the underlying ground truth data distribution from which your data is implicitly sampled. I'd probably leave them in. |
@preet2312 , do you have any information about your environment (library versions) and system platforms with which you experienced this issue? |
@beckernick I updated rapids to v23.02 using rapidsai/rapidsai-core:23.02-cuda11.2-runtime-ubuntu20.04-py3.8 image. I still get 'raft::cuda_error' for the above mentioned example Error Logs:
databricks notebook shows
|
Thanks for testing in 23.02 and creating a minimal reproducible example. I can reproduce this behavior. The underlying error appears to be that a single linkage solution can't be found in at least some scenarios and this error is not caught and propagated back up to Python. With import numpy as np
from cuml.cluster import HDBSCAN
REPS = 10000
X = np.arange(12)
tiled = np.tile(X, REPS).reshape(-1, 3)
clusterer = HDBSCAN()
clusterer.fit(tiled)
terminate called after throwing an instance of 'raft::logic_error'
what(): RAFT failure at file=/opt/conda/conda-bld/work/cpp/src/hdbscan/detail/condense.cuh line=88: Multiple components found in MST or MST is invalid. Cannot find single-linkage solution. Found 79997 vertices total.
Obtained 56 stack frames
#0 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7fe0782bfb8b]
#1 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN4raft11logic_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7fe0782c040d]
#2 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN6detail8Condense25build_condensed_hierarchyIifLi256EEEvRKN4raft8handle_tEPKT_PKT0_SA_iiRNS0_6Common18CondensedHierarchyIS8_SB_EE+0x10f6) [0x7fe07881a936]
#3 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7HDBSCAN12_fit_hdbscanIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsEPT_PS6_RNSB_14hdbscan_outputISE_S6_EE+0x1d5) [0x7fe078835195]
#4 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/../../../../libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEEPf+0x246) [0x7fe078750706]
#5 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/cluster/hdbscan/hdbscan.cpython-310-x86_64-linux-gnu.so(+0x74f2a) [0x7fdf60510f2a]
#6 in /home/nicholasb/miniconda3/envs/rapids-23.02/lib/python3.10/site-packages/cuml/internals/base.cpython-310-x86_64-linux-gnu.so(+0x1c35f) [0x7fdf6096135f]
#7 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0x209) [0x55e944f23209]
#8 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#9 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c6e1) [0x55e944f226e1]
#10 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#11 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#12 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#13 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#14 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x4d0d) [0x55e944f0b23d]
#15 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#16 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#17 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e2c30) [0x55e944fb8c30]
#18 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x140d14) [0x55e944f16d14]
#19 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#20 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#21 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#22 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#23 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x1bb1) [0x55e944f080e1]
#24 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1e530d) [0x55e944fbb30d]
#25 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1faada) [0x55e944fd0ada]
#26 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14b41f) [0x55e944f2141f]
#27 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#28 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#29 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#30 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#31 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#32 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#33 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x13d0) [0x55e944f07900]
#34 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#35 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#36 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#37 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#38 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#39 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x735) [0x55e944f06c65]
#40 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x14c581) [0x55e944f22581]
#41 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyObject_Call+0xb8) [0x55e944f230b8]
#42 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x2ec3) [0x55e944f093f3]
#43 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyFunction_Vectorcall+0x6f) [0x55e944f16b1f]
#44 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyEval_EvalFrameDefault+0x332) [0x55e944f06862]
#45 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1db6a2) [0x55e944fb16a2]
#46 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(PyEval_EvalCode+0x87) [0x55e944fb15e7]
#47 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x20e3fc) [0x55e944fe43fc]
#48 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x2092d4) [0x55e944fdf2d4]
#49 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x9758d) [0x55e944e6d58d]
#50 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_SimpleFileObject+0x1b5) [0x55e944fd94f5]
#51 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(_PyRun_AnyFileObject+0x43) [0x55e944fd90a3]
#52 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_RunMain+0x399) [0x55e944fd6279]
#53 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(Py_BytesMain+0x39) [0x55e944fa3dc9]
#54 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fe0fe28d083]
#55 in /home/nicholasb/miniconda3/envs/rapids-23.02/bin/python(+0x1cdcc1) [0x55e944fa3cc1]
Aborted (core dumped) cc @cjnolet @tarang-jain @divyegala , as you may have looked at this code recently |
@beckernick I think the reason you might be getting the error about convergence is generally not likely to happen in practice, It looks like the amount of duplicated rows are likely causing the mst to disregard additional edges. If that case does in fact end up becoming show stopper on real datasets, I think we should definitely figure out a way around it, however the error you are receiving is definitely explaining what's going on- theres just not enough information provided to connect the graph because of the duplicated edges and we need a connected graph in order to build the dendrogram. I slightly tweaked the input and was able to reproduce original reported error. I do think we should investigate this one further (cc @tarang-jain who is looking into this):
|
I agree with this analysis. You can probably find a solution by artificially introducing minor random noise in the dataset |
Thanks for the suggestions. I agree that the error is clear, but it's uncaught and causes a segfault. Python user code should ideally not cause a segfault, even if rare scenarios like this (I know this is unlikely to occur naturally). Can we catch and propagate this error up? |
I was able to reproduce both @beckernick's error and @cjnolet's error by tweaking the |
Hi, I'm getting this with a "real" dataset. It's a set of images that I'm using for anomaly/outlier detection - the code seems to run on smaller numbers of similar images, up to around 350,000 - 400,000 samples, but if I go much beyond that then I get this same crash behaviour. This only happens when I reduce the image vectors down to certain sizes through PCA though - suggesting that I have inadvertently created multiple identical data points as mentioned above. I'm happy to provide code and input-data samples if that'll help - how's it best to get it over to you? enough of the dataset to reproduce the error is around 10GB |
@MartinKlefas Have you tried to increase |
Thanks, I didn't do the full computation, but just multiplied |
I am experiencing this as well. My dataset is generated from a Reinforcement Learning environment (Atari Pong). |
I just ran into what I think is something related to this bug, the dataset is related to the Vesuvius Challenge and is spatial. Unfortunately, it looks like there are no duplicates though and consequently I'm not sure how to get closer to identifying the problem. |
managed to pull the segfaulting data - x = np.array([[168.5, 174.75, 243. ], hdb = HDBSCAN(min_samples=10, min_cluster_size=10, allow_single_cluster=True) |
this result is not sensitive to the number of samples or cluster size - all segfault. dropping the last data element also doesn't segfault, but it does hang indefinitely. |
@terramars while there are no explicit duplicates, it looks to me that all the points are quite close in distance to the previous point. What precision are you running with? Can you try running with |
I am giving it float64, it seems like the fit method converts to float32 and errors if you disable conversion : hdb = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=10, allow_single_cluster=True) File ~/miniconda3/envs/thaumato/lib/python3.10/site-packages/cuml/internals/array.py:1135, in CumlArray.from_input(cls, X, order, deepcopy, check_dtype, convert_to_dtype, check_mem_type, convert_to_mem_type, safe_dtype_conversion, check_cols, check_rows, fail_on_order, force_contiguous) TypeError: Expected input to be of type in [dtype('float32')] but got float64 |
this does appear to be a float32 vs float64 issue, the following method works : hdb.fit_predict(x-x.mean(axis=0)) the returned labels are a match with the 64 bit labels computed from sklearn. it seems like there are 2 bugs here - 1) no option to run in float64, and 2) data is not centered before clustering is run, which should be desirable in all circumstances. |
@terramars I don't think 2 is a bug because we are trying to maintain fidelity to the We will look into solving 1. |
That would still be great. There's definitely precedent of the likes of
sklearn centering data before solving and storing the centers in a variable
to address numerical stability issues. Definitely isn't API compatible to
seg fault on reasonable input either!
…On Mon, Oct 7, 2024, 2:46 PM Divye Gala ***@***.***> wrote:
@terramars <https://github.com/terramars> I don't think 2 is a bug
because we are trying to maintain fidelity to the hdbscan package API,
and as far as I am aware they do not automatically center the data before
clustering - please correct me if I am wrong.
We will look into solving 1.
—
Reply to this email directly, view it on GitHub
<#4474 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHA64P4JSICZGI32H3ZYH3Z2LJIBAVCNFSM6AAAAABOHO2BOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJXGYZTSNJRGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@terramars I think the issue with centering in general is that it goes into the territory of manipulating the data without transparency to the user. Not to mention that we'll need to create a copy of the data and then center it, and GPU memory is expensive. I agree of course that seg fault is a problem, and we'll prioritize on making the UX smoother. At the very least we should try to fail gracefully with a clearer error message. |
terminate called after throwing an instance of 'raft::cuda_error'
Hi, I'm using cuml.HDBSCAN and the following problem was encountered.
`terminate called after throwing an instance of 'raft::cuda_error'
what(): CUDA error encountered at: file=_deps/raft-src/cpp/include/raft/cudart_utils.h line=267: call='cudaMemcpyAsync(d_ptr1, d_ptr2, len * sizeof(Type), cudaMemcpyDeviceToDevice, stream)', Reason=cudaErrorInvalidValue:invalid argument
Obtained 32 stack frames
#0 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f1bd4f95056]
#1 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10cuda_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc9) [0x7f1bd4f95e39]
#2 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft10copy_asyncIiEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x138) [0x7f1bd522f948]
#3 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN4raft9hierarchy6detail16build_sorted_mstIifN2ML7HDBSCAN22FixConnectivitiesRedOpIifEEEEvRKNS_8handle_tEPKT0_PKT_SF_SC_mmPSD_SG_PSA_SG_mT1_NS_8distance12DistanceTypeEi+0x4c2) [0x7f1bd527e942]
#4 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7HDBSCAN13build_linkageIifEEvRKN4raft8handle_tEPKT0_mmNS2_8distance12DistanceTypeERNS0_6Common13HDBSCANParamsERNSB_28robust_single_linkage_outputIT_S6_EE+0x372) [0x7f1bd5281512]
#5 in /opt/conda/envs/cuml-dev-11.0/lib/libcuml++.so(_ZN2ML7hdbscanERKN4raft8handle_tEPKfmmNS0_8distance12DistanceTypeERNS_7HDBSCAN6Common13HDBSCANParamsERNS9_14hdbscan_outputIifEE+0x7e) [0x7f1bd521759e]
#6 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x43ec2) [0x7f1de251cec2]
#7 in python(PyObject_Call+0x24d) [0x56056760d35d]
#8 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]
#9 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#10 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]
#11 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c298) [0x7f1de2505298]
#12 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x2c4f9) [0x7f1de25054f9]
#13 in /opt/conda/envs/cuml-dev-11.0/lib/python3.8/site-packages/cuml/cluster/hdbscan.cpython-38-x86_64-linux-gnu.so(+0x3c072) [0x7f1de2515072]
#14 in python(PyObject_Call+0x24d) [0x56056760d35d]
#15 in python(_PyEval_EvalFrameDefault+0x21bf) [0x5605676b64ef]
#16 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#17 in python(+0x1b08b7) [0x5605676988b7]
#18 in python(_PyEval_EvalFrameDefault+0x4e03) [0x5605676b9133]
#19 in python(_PyFunction_Vectorcall+0x1a6) [0x560567697fc6]
#20 in python(_PyEval_EvalFrameDefault+0x947) [0x5605676b4c77]
#21 in python(_PyEval_EvalCodeWithName+0x2c3) [0x560567696db3]
#22 in python(PyEval_EvalCodeEx+0x39) [0x560567697e19]
#23 in python(PyEval_EvalCode+0x1b) [0x56056773a24b]
#24 in python(+0x2522e3) [0x56056773a2e3]
#25 in python(+0x26e543) [0x560567756543]
#26 in python(+0x273562) [0x56056775b562]
#27 in python(PyRun_SimpleFileExFlags+0x1b2) [0x56056775b742]
#28 in python(Py_RunMain+0x36d) [0x56056775bcbd]
#29 in python(Py_BytesMain+0x39) [0x56056775be79]
#30 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f1fc17a4b97]
#31 in python(+0x1e6d69) [0x5605676ced69]
Aborted (core dumped)`
The text was updated successfully, but these errors were encountered: