Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Kmeans MNMG Notebook: NCCL Error #2458

Closed
Quentin-Anthony opened this issue Jun 21, 2020 · 11 comments
Closed

[BUG] Kmeans MNMG Notebook: NCCL Error #2458

Quentin-Anthony opened this issue Jun 21, 2020 · 11 comments
Labels
bug Something isn't working

Comments

@Quentin-Anthony
Copy link

Describe the bug
I'm running the kmeans mnmg notebook example on an HPC system, and running 'kmeans_cuml.fit(X_cudf)' gives the error:

Exception occured! file=/home/qanthony/cuml-work/cuml/cpp/comms/std/src/cuML_std_comms_impl.cpp line=451: ERROR: NCCL call='ncclBroadcast(buff, buff, count, getNCCLDatatype(datatype), root, _nccl_comm, stream)'. Reason:invalid argument

Full trace on Client:

istributed.worker - WARNING -  Compute Failed
Function:  _func_fit
args:      (b'\xbe\x02W=\x02cD\xf0\x9a\x8fK\x1b \xa6\xea\xd2', [               0         1
0       2.922320 -9.099419
1       2.912196 -9.070423
2       2.764461 -9.133838
3       2.816539 -9.198503
4       2.786984 -9.283189
...          ...       ...
499995 -3.900687  0.506668
499996 -3.847557  0.497181
499997 -3.718308  0.505410
499998 -3.928917  0.781890
499999 -3.872833  0.709343

[500000 rows x 2 columns]], 'cudf')
kwargs:    {'init': 'k-means||', 'n_clusters': 5, 'random_state': 100}
Exception: RuntimeError("Exception occured! file=/home/qanthony/cuml-work/cuml/cpp/comms/std/src/cuML_std_comms_impl.cpp line=451: ERROR: NCCL call='ncclBroadcast(buff, buff, count, getNCCLDatatype(datatype), root, _nccl_comm, stream)'. Reason:invalid argument

Full trace on Worker:

Obtained 33 stack frames
#0 in /home/qanthony/cuml-work/miniconda3/lib/python3.7/site-packages/cuml/utils/pointer_utils.cpython-37m-x86_64-linux-gnu.so(_ZN8MLCommon9Exception16collectCallStackEv+0x2b) [0x2af2cdbe2edb]
#1 in /home/qanthony/cuml-work/miniconda3/lib/python3.7/site-packages/cuml/utils/pointer_utils.cpython-37m-x86_64-linux-gnu.so(_ZN8MLCommon9ExceptionC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x62) [0x2af2cdbe3c22]
#2 in /home/qanthony/cuml-work/miniconda3/lib/libcumlcomms.so(_ZNK2ML24cumlStdCommunicator_impl5bcastEPviN8MLCommon16cumlCommunicator10datatype_tEiP11CUstream_st+0x142) [0x2af2d3fa14f2]
#3 in /home/qanthony/cuml-work/miniconda3/lib/libcumlprims.so(_ZN2ML6kmeans3opg18initKMeansPlusPlusIfiEEvRKNS_15cumlHandle_implERKNS0_12KMeansParamsERNS_6TensorIT_Li2ET0_EERN8MLCommon11buffer_baseISA_NSE_15deviceAllocatorEEERNSF_IcSG_EE+0x413) [0x2af2d42b4cc3]
#4 in /home/qanthony/cuml-work/miniconda3/lib/libcumlprims.so(_ZN2ML6kmeans3opg3fitIfiEEvRKNS_15cumlHandle_implERKNS0_12KMeansParamsEPKT_iiPS9_RS9_Ri+0x561) [0x2af2d42b32d1]
#5 in /home/qanthony/cuml-work/miniconda3/lib/libcumlprims.so(_ZN2ML6kmeans3opg3fitERKNS_10cumlHandleERKNS0_12KMeansParamsEPKfiiPfRfRi+0x59) [0x2af2d42ae1b9]
#6 in /home/qanthony/cuml-work/miniconda3/lib/python3.7/site-packages/cuml/cluster/kmeans_mg.cpython-37m-x86_64-linux-gnu.so(+0xf2a8) [0x2af357ff22a8]
#7 in /home/qanthony/cuml-work/miniconda3/lib/python3.7/site-packages/cuml/cluster/kmeans_mg.cpython-37m-x86_64-linux-gnu.so(+0x125a5) [0x2af357ff55a5]
#8 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyObject_FastCallKeywords+0x48b) [0x55f631b3700b]
#9 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x51d1) [0x55f631b9b9a1]
#10 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55f631adf2b9]
#11 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallDict+0x400) [0x55f631ae0610]
#12 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x1d4a) [0x55f631b9851a]
#13 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalCodeWithName+0x5da) [0x55f631adf59a]
#14 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallDict+0x400) [0x55f631ae0610]
#15 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x1d4a) [0x55f631b9851a]
#16 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallDict+0x10b) [0x55f631ae031b]
#17 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x1d4a) [0x55f631b9851a]
#18 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55f631b2f20b]
#19 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55f631b96e70]
#20 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallDict+0x10b) [0x55f631ae031b]
#21 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x1d4a) [0x55f631b9851a]
#22 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55f631b2f20b]
#23 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55f631b96e70]
#24 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55f631b2f20b]
#25 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55f631b96e70]
#26 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyFunction_FastCallDict+0x10b) [0x55f631ae031b]
#27 in /home/qanthony/cuml-work/miniconda3/bin/python(_PyObject_Call_Prepend+0x63) [0x55f631afeb93]
#28 in /home/qanthony/cuml-work/miniconda3/bin/python(PyObject_Call+0x6e) [0x55f631af195e]
#29 in /home/qanthony/cuml-work/miniconda3/bin/python(+0x224fa7) [0x55f631beefa7]
#30 in /home/qanthony/cuml-work/miniconda3/bin/python(+0x1dfc48) [0x55f631ba9c48]
#31 in /lib64/libpthread.so.0(+0x7dd5) [0x2af21daa2dd5]
#32 in /lib64/libc.so.6(clone+0x6d) [0x2af21ddb4ead]
, Exception occured! file=/home/qanthony/cuml-work/cuml/cpp/comms/std/src/cuML_std_comms_impl.cpp line=451: ERROR: NCCL call='ncclBroadcast(buff, buff, count, getNCCLDatatype(datatype), root, _nccl_comm, stream)'. Reason:invalid argument

Steps/Code to reproduce bug

  • Initialize dask cluster via dask-mpi
  • Run following script:
from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from cuml.metrics import adjusted_rand_score
from dask.distributed import Client, wait
from dask_ml.cluster import KMeans as skKMeans

c = Client(scheduler_file='/path/to/scheduler.json')

n_samples = 1000000
n_features = 2
n_total_partitions = len(list(c.has_what().keys()))

X_cudf, Y_cudf = make_blobs(n_samples,
                            n_features,
                            centers = 5,
                            n_parts = n_total_partitions,
                            cluster_std=0.1,
                            verbose=True)

wait(X_cudf)

X_df = to_dask_df(X_cudf)

kmeans_cuml = cuKMeans(init="k-means||",
                       n_clusters=5,
                       random_state=100)

kmeans_cuml.fit(X_cudf)

Environment details (please complete the following information):

  • Environment location: [HPC cluster]
  • Linux Distro/Architecture: [CentOS 7.5.1810, x86_64]
  • GPU Model/Driver: [P100 and driver 418.67]
  • CUDA: [10.1]
  • Method of cuDF & cuML install: [from source]
    • gcc 7.2.0
    • cmake 3.14.0
    • cudf 0.13.0 (from conda)
    • cuml (from source) (commit hash: 7544c43)

Additional context
I have no issues with running mnmg random forests or mnmg nearest neighbors with the same methods. This issue seems to be specific to kmeans

@Quentin-Anthony Quentin-Anthony added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jun 21, 2020
@otavioon
Copy link

Hello,

I am also having the same problem using MNMG K-means implementation in an HPC environment

@cjnolet
Copy link
Member

cjnolet commented Jun 26, 2020

@Quentin-Anthony @otavioon,

Can you provide the output of conda list?

@Quentin-Anthony
Copy link
Author

Quentin-Anthony commented Jun 26, 2020

@cjnolet

Sure.

$ conda list

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
arrow-cpp                 0.15.0           py37h090bef1_2    conda-forge
autoconf                  2.69            pl526h14c3975_9    conda-forge
automake                  1.16.2                  pl526_0    conda-forge
blas                      1.0                    openblas
bokeh                     2.1.0            py37hc8dfbb8_0    conda-forge
boost-cpp                 1.70.0               h8e57a91_2    conda-forge
brotli                    1.0.7             he1b5a44_1002    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
c-ares                    1.15.0            h516909a_1001    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py37hc8dfbb8_0    conda-forge
cffi                      1.14.0           py37he30daa8_1
chardet                   3.0.4                 py37_1003
click                     7.1.2                    pypi_0    pypi
cloudpickle               1.4.1                    pypi_0    pypi
cmake                     3.14.0               h52cb24c_0
conda                     4.8.3            py37hc8dfbb8_1    conda-forge
conda-package-handling    1.6.1            py37h7b6447c_0
cryptography              2.9.2            py37h1ba5d50_0
cudatoolkit               10.1.243             h6bb024c_0
cudf                      0.13.0                   py37_0    rapidsai
cudnn                     7.6.5                cuda10.1_0
cuml                      0.13.0                   pypi_0    pypi
cupy                      7.5.0            py37h0632833_0    conda-forge
cython                    0.29.20          py37h3340039_0    conda-forge
cytoolz                   0.10.1           py37h516909a_0    conda-forge
dask                      2.18.1                   pypi_0    pypi
dask-core                 2.18.1                     py_0    conda-forge
dask-cuda                 0.13.0                   py37_0    rapidsai
dask-cudf                 0.13.0                   py37_0    rapidsai
dask-glm                  0.2.0                      py_1    conda-forge
dask-ml                   1.4.0                      py_0
dask-mpi                  2.0.0+14.g9564954          pypi_0    pypi
distributed               2.18.0+26.g5172678d          pypi_0    pypi
dlpack                    0.2                  he1b5a44_1    conda-forge
double-conversion         3.1.5                he1b5a44_2    conda-forge
expat                     2.2.9                he1b5a44_2    conda-forge
fastavro                  0.23.4           py37h8f50634_0    conda-forge
fastrlock                 0.5              py37h3340039_0    conda-forge
freetype                  2.10.2               he06d7ca_0    conda-forge
fsspec                    0.6.3                      py_0    conda-forge
gflags                    2.2.2             he1b5a44_1002    conda-forge
glog                      0.4.0                h49b9bf7_3    conda-forge
grpc-cpp                  1.23.0               h18db393_0    conda-forge
heapdict                  1.0.1                    pypi_0    pypi
icu                       64.2                 he1b5a44_1    conda-forge
idna                      2.9                        py_1
jinja2                    2.11.2             pyh9f0ad1d_0    conda-forge
joblib                    0.14.1             pyh9f0ad1d_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
krb5                      1.17.1               h173b8e3_0
ld_impl_linux-64          2.33.1               h53a641e_7
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libcudf                   0.13.0               cuda10.1_0    rapidsai
libcumlprims              0.13.0               cuda10.1_0    nvidia
libcurl                   7.69.1               hf7181ac_0    conda-forge
libedit                   3.1.20181209         hc058e9b_0
libevent                  2.1.10               h72c5cf5_0    conda-forge
libffi                    3.3                  he6710b0_1
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.5.0                hdf63c60_6    conda-forge
libhwloc                  2.1.0                h3c4fd83_0    conda-forge
libiconv                  1.15              h516909a_1006    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libnvstrings              0.13.0               cuda10.1_0    rapidsai
libopenblas               0.3.7                h5ec1e0e_6    conda-forge
libpng                    1.6.37               hed695b0_1    conda-forge
libprotobuf               3.8.0                h8b12597_0    conda-forge
librmm                    0.13.0               cuda10.1_0    rapidsai
libssh2                   1.9.0                hab1572f_2    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                hfc65ed5_0    conda-forge
libtool                   2.4.6             h14c3975_1002    conda-forge
libxml2                   2.9.10               hee79883_0    conda-forge
llvmlite                  0.31.0           py37hd408876_0
locket                    0.2.0                      py_2    conda-forge
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
m4                        1.4.18            h14c3975_1001    conda-forge
make                      4.3                  h516909a_0    conda-forge
markupsafe                1.1.1            py37h8f50634_1    conda-forge
mpi4py                    3.1.0a0                  pypi_0    pypi
msgpack                   1.0.0                    pypi_0    pypi
msgpack-python            1.0.0            py37h99015e2_1    conda-forge
multipledispatch          0.6.0                      py_0    conda-forge
nccl                      2.6.4.1              h51cf6c1_0    conda-forge
ncurses                   6.2                  he6710b0_1
numba                     0.48.0           py37h0573a6f_0
numpy                     1.18.5           py37h8960a57_0    conda-forge
nvstrings                 0.13.0                   py37_0    rapidsai
olefile                   0.46                       py_0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
packaging                 20.4               pyh9f0ad1d_0    conda-forge
pandas                    0.25.3           py37hb3f55d8_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
partd                     1.1.0                      py_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
pillow                    5.3.0           py37h00a061d_1000    conda-forge
pip                       20.0.2                   py37_3
pkg-config                0.29.2            h516909a_1006    conda-forge
psutil                    5.7.0            py37h8f50634_1    conda-forge
pyarrow                   0.15.0           py37h8b68381_1    conda-forge
pycosat                   0.6.3            py37h7b6447c_0
pycparser                 2.20                       py_0
pynvml                    8.0.4                      py_0    conda-forge
pyopenssl                 19.1.0                   py37_0
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysocks                   1.7.1                    py37_0
python                    3.7.7                hcff3b4d_5
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytz                      2020.1             pyh9f0ad1d_0    conda-forge
pyyaml                    5.1.2            py37h516909a_0    conda-forge
re2                       2020.04.01           he1b5a44_0    conda-forge
readline                  8.0                  h7b6447c_0
requests                  2.23.0                   py37_0
rhash                     1.3.8                h1ba5d50_0
rmm                       0.13.0                   py37_0    rapidsai
ruamel_yaml               0.15.87          py37h7b6447c_0
scikit-learn              0.22.1           py37h22eb022_0
scipy                     1.4.1            py37ha3d9a3c_3    conda-forge
setuptools                47.3.1           py37hc8dfbb8_0    conda-forge
six                       1.14.0                   py37_0
snappy                    1.1.8                he1b5a44_2    conda-forge
sortedcontainers          2.2.2                    pypi_0    pypi
sqlite                    3.31.1               h62c20be_1
tbb                       2020.0               hfd86e86_0
tblib                     1.6.0                    pypi_0    pypi
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.8                hbc83047_0
toolz                     0.10.0                   pypi_0    pypi
tornado                   6.0.4                    pypi_0    pypi
tqdm                      4.46.0                     py_0
typing_extensions         3.7.4.2                    py_0    conda-forge
ucx                       1.7.0dev+g430ae7e      cuda10.1_0    rapidsai
ucx-proc                  1.0.0                       gpu    conda-forge
ucx-py                    0.15.0a0+78.g3e64dbb          pypi_0    pypi
uriparser                 0.9.3                he1b5a44_1    conda-forge
urllib3                   1.25.8                   py37_0
wheel                     0.34.2                   py37_0
xz                        5.2.5                h7b6447c_0
yaml                      0.1.7                had09818_2
zict                      2.0.0                    pypi_0    pypi
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.3                h3b9ef0a_0    conda-forge

@cjnolet
Copy link
Member

cjnolet commented Jun 28, 2020

@Quentin-Anthony,

In general, we expect that the cuml Dask algorithms be executed on a cluster that is running a set of dask_cuda-worker processes or a LocalCUDACluster from the Dask-CUDA project.

I’m not sure if it’s possible to launch Dask-CUDA workers with Dask-MPI. Cuml’s distributed algorithms will create a NCCL clique behind the scenes across the Dask workers so you shouldn’t need to use Dask-MPI at all if you are just working with cuml estimators in Python.

@cjnolet cjnolet removed the ? - Needs Triage Need team to review and classify label Jun 28, 2020
@Quentin-Anthony
Copy link
Author

Quentin-Anthony commented Jun 30, 2020

@cjnolet

We previously tried using the Dask-CUDA project via the instructions in the kmeans mnmg notebook, but came across the following error:

>>> kmeans_cuml.fit(X_cudf)
distributed.worker - WARNING -  Compute Failed
...
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/qanthony/cuml-work/miniconda3/lib/python3.7/site-packages/cuml/utils/memory_utils.py", line 55, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/home/qanthony/cuml-work/miniconda3/lib/python3.7/site-packages/cuml/dask/cluster/kmeans.py", line 143, in fit
    raise_exception_from_futures(kmeans_fit)
  File "/home/qanthony/cuml-work/miniconda3/lib/python3.7/site-packages/cuml/dask/common/utils.py", line 144, in raise_exception_from_futures
    len(errs), len(futures), ", ".join(map(str, errs))
RuntimeError: 4 of 4 worker jobs failed: Exception occured! file=/conda/conda-bld/libcumlprims_1585671138690/work/cpp/build/cuml/src/cuml/cpp/src/kmeans/common.cuh line=235: FAIL: call='cub::DeviceReduce::Reduce( nullptr, temp_storage_bytes, minClusterDistance.data(), clusterCost, minClusterDistance.numElements(), reduction_op, DataT(), stream)'. Reason:invalid device function
...
...

@cjnolet
Copy link
Member

cjnolet commented Jul 2, 2020

@Quentin-Anthony @otavioon,

I just encountered the (NCCL) error in this issue's description in the nightly but after downgrading the NCCL conda package to 2.5, I was not able to reproduce it again.

On a fresh install of 0.13, I was able to reproduce your latest error and the NCCL version didn't seem to matter. It's possible this could be some combination of updated Dask/Distributed versions and binary compatiblity of recent NCCL versions.

I was able to successfully execute your reproducible example in 0.14, though the make_blobs API changed slightly. Here's the conda install I used:

conda create --name cuml_014_070220
conda activate cuml_014_070220
conda install -c rapidsai -c nvidia -c conda-forge cuml=0.14 dask-cuda=0.14 cudf=0.14 dask-cudf=0.14 dask distributed cudatoolkit=10.2 scipy scikit-learn

And here's the updated distributed k-means example:

  from cuml.dask.cluster.kmeans import KMeans as cuKMeans
  from cuml.dask.datasets import make_blobs
  from dask.distributed import Client, wait

  from dask_cuda import LocalCUDACluster

  cluster = LocalCUDACluster()
  c = Client(cluster)

  n_samples = 1000000
  n_features = 2
  n_total_partitions = len(list(c.has_what().keys()))

  X, Y = make_blobs(n_samples,
                              n_features,
                              centers = 5,
                              n_parts = n_total_partitions,
                              cluster_std=0.1,
                              verbose=True)

  kmeans_cuml = cuKMeans(init="k-means||",
                         n_clusters=5,
                         random_state=100)
  kmeans_cuml.fit(X)

  print(str(kmeans_cuml.predict(X).compute()))

Can you give this a try and let us know if it runs successfully for you?

@cjnolet
Copy link
Member

cjnolet commented Jul 2, 2020

Also created #2504 to update the k-means notebook.

@otavioon
Copy link

otavioon commented Jul 2, 2020

Hello @cjnolet

Unfortunately I no longer have access to the machines used.

@Quentin-Anthony @otavioon,

Can you provide the output of conda list?

I was using a fresh n1-highmem google cloud instance with 4xT4 Nvidia gpus in it, using CUDA 10.2.
RAPIDS was executed over a Docker container, from docker rapidsai/rapidsai:cuda10.2-runtime-ubuntu18.04-py3.7, without any other new package.

Can you give this a try and let us know if it runs successfully for you?

I will try to request access again.

Att,

@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@Nanthini10
Copy link
Contributor

Unable to reproduce this error with the update code from Corey. The seems to have resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants