Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Increasing memory usage leading to OOM when running UMAP in loop #4068

Open
ietz opened this issue Jul 19, 2021 · 6 comments
Open

[BUG] Increasing memory usage leading to OOM when running UMAP in loop #4068

ietz opened this issue Jul 19, 2021 · 6 comments
Labels
inactive-30d inactive-90d Waiting for more information The described problem requires more information for triage

Comments

@ietz
Copy link

ietz commented Jul 19, 2021

Describe the bug
When I fit multiple UMAP models one after another, the GPU memory usage increases with most iterations, even though I do not keep any references to prior models or their results. At some point, I get an OOM error. As I do not keep any references, I would expect any data to be garbage collected to prevent the OOM from happening.

Steps/Code to reproduce bug
Here is a link to my Jupyter notebook on Google Colab: https://colab.research.google.com/drive/1mZew58DdWdI2cBuSRW5F3uUD7lXjHMUk

The issue occurs in the code segment

for i in itertools.count():
  cuml.UMAP(n_neighbors=15) \
      .fit(data, knn_graph=knn_graph)

Looking at the GPU memory usage over time, I can see that the model is not always garbage collected between iterations. The data accumulates over a few iterations and is then deleted every so often, but not all of it. At some point this seems to always lead to an out of memory error. With the input data shape I chose for the Colab demo this took a lot longer than I expected (approx. 20 minutes, 641 iterations), but I think plotting the memory usage over time shows the issue quite nicely:

Memory Usage

With larger datasets such as those that I used when I originally encountered this issue, the OOM happens after way fewer iterations, maybe 10. In the image you can see small and large "teeth". I think when I originally encountered this issue, I had the OOM on one of the small teeth, even before the first large drop in memory usage.

Expected behavior
I would expect that the memory usage does not increase to the point of an OOM error.

Environment details

  • Environment location: Google Colab
  • Linux Distro/Architecture: Ubuntu 18.04.5 x86_64
  • GPU Model/Driver: V100 and driver 460.32.03
  • CUDA: nvidia-smi reports CUDA 11.2, but the conda installation includes cudatoolkit 11.0.221 (see cell outputs in Colab). Not sure which of the two is relevant.
  • Method of cuDF & cuML install: conda, using the scripts from rapidsai/rapidsai-csp-utils
conda list
# packages in environment at /usr/local:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
abseil-cpp                20210324.1           h9c3ff4c_0    conda-forge
aiohttp                   3.7.4.post0      py37h5e8e339_0    conda-forge
anyio                     3.2.1            py37h89c1867_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
argon2-cffi               20.1.0           py37h5e8e339_2    conda-forge
arrow-cpp                 1.0.1           py37haa335b2_40_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
async-timeout             3.0.1                   py_1000    conda-forge
async_generator           1.10                       py_0    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
aws-c-cal                 0.5.11               h95a6274_0    conda-forge
aws-c-common              0.6.2                h7f98852_0    conda-forge
aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
aws-c-io                  0.10.5               hfb6a706_0    conda-forge
aws-checksums             0.1.11               ha31a3da_7    conda-forge
aws-sdk-cpp               1.8.186              hb4091e7_3    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
blazingsql                21.6.0                   pypi_0    pypi
bleach                    3.3.1              pyhd8ed1ab_0    conda-forge
blinker                   1.4                        py_1    conda-forge
bokeh                     2.2.3            py37h89c1867_0    conda-forge
boost                     1.72.0           py37h48f8a5e_1    conda-forge
boost-cpp                 1.72.0               h9d3c048_4    conda-forge
brotli                    1.0.9                h7f98852_5    conda-forge
brotli-bin                1.0.9                h7f98852_5    conda-forge
brotlipy                  0.7.0           py37h5e8e339_1001    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h7f98852_1    conda-forge
ca-certificates           2021.5.30            ha878542_0    conda-forge
cachetools                4.2.2              pyhd8ed1ab_0    conda-forge
cairo                     1.16.0            h6cf1ce9_1008    conda-forge
certifi                   2021.5.30        py37h89c1867_0    conda-forge
cffi                      1.14.5           py37hc58025e_0    conda-forge
cfitsio                   3.470                hb418390_7    conda-forge
chardet                   4.0.0            py37h89c1867_1    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
click-plugins             1.1.1                      py_0    conda-forge
cligj                     0.7.2              pyhd8ed1ab_0    conda-forge
cloudpickle               1.6.0                      py_0    conda-forge
colorcet                  2.0.6              pyhd8ed1ab_0    conda-forge
conda                     4.10.3           py37h89c1867_0    conda-forge
conda-package-handling    1.7.2            py37hb5d75c8_0    conda-forge
cryptography              3.4.5            py37h5d9358c_1    conda-forge
cudatoolkit               11.0.221             h6bb024c_0    nvidia
cudf                      21.06.01        cuda_11.0_py37_g101fc0fda4_2    rapidsai
cudf_kafka                21.06.01        py37_g101fc0fda4_2    rapidsai
cugraph                   21.06.00        py37_gf9ffd2de_0    rapidsai
cuml                      21.06.02        cuda11.0_py37_g7dfbf8d9e_0    rapidsai
cupy                      9.0.0            py37h4fdb0f7_0    conda-forge
curl                      7.77.0               hea6ffbf_0    conda-forge
cusignal                  21.06.00        py38_ga78207b_0    rapidsai
cuspatial                 21.06.00        py37_g37798cd_0    rapidsai
custreamz                 21.06.01        py37_g101fc0fda4_2    rapidsai
cuxfilter                 21.06.00        py37_g9459467_0    rapidsai
cycler                    0.10.0                     py_2    conda-forge
cyrus-sasl                2.1.27               h230043b_2    conda-forge
cytoolz                   0.11.0           py37h5e8e339_3    conda-forge
dask                      2021.5.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.5.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 21.06.00                 py37_0    rapidsai
dask-cudf                 21.06.01        py37_g101fc0fda4_2    rapidsai
datashader                0.11.1             pyh9f0ad1d_0    conda-forge
datashape                 0.5.4                      py_1    conda-forge
decorator                 4.4.2                      py_0    conda-forge
defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
distributed               2021.5.0         py37h89c1867_0    conda-forge
dlpack                    0.5                  h9c3ff4c_0    conda-forge
entrypoints               0.3             pyhd8ed1ab_1003    conda-forge
expat                     2.4.1                h9c3ff4c_0    conda-forge
faiss-proc                1.0.0                      cuda    rapidsai
fastavro                  1.4.3            py37h5e8e339_0    conda-forge
fastrlock                 0.6              py37hcd2ae1e_1    conda-forge
fiona                     1.8.20           py37ha0cc35a_0    conda-forge
fontconfig                2.13.1            hba837de_1005    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
freexl                    1.0.6                h7f98852_0    conda-forge
fsspec                    2021.7.0           pyhd8ed1ab_0    conda-forge
future                    0.18.2           py37h89c1867_3    conda-forge
gcsfs                     2021.7.0           pyhd8ed1ab_0    conda-forge
gdal                      3.2.2            py37hb0e9ad2_0    conda-forge
geopandas                 0.9.0              pyhd8ed1ab_1    conda-forge
geopandas-base            0.9.0              pyhd8ed1ab_1    conda-forge
geos                      3.9.1                h9c3ff4c_2    conda-forge
geotiff                   1.6.0                hcf90da6_5    conda-forge
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
giflib                    5.2.1                h36c2ea0_2    conda-forge
glog                      0.5.0                h48cff8f_0    conda-forge
google-auth               1.33.0             pyh6c4a22f_0    conda-forge
google-auth-oauthlib      0.4.4              pyhd8ed1ab_0    conda-forge
google-cloud-cpp          1.28.0               hbd34f9f_0    conda-forge
greenlet                  1.1.0            py37hcd2ae1e_0    conda-forge
grpc-cpp                  1.38.0               h2519f57_0    conda-forge
hdf4                      4.2.15               h10796ff_3    conda-forge
hdf5                      1.10.6          nompi_h6a2412b_1114    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
icu                       68.1                 h58526e2_0    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
importlib-metadata        4.6.1            py37h89c1867_0    conda-forge
ipykernel                 5.5.5            py37h085eea5_0    conda-forge
ipython                   7.25.0           py37h085eea5_1    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
ipywidgets                7.6.3              pyhd3deb0d_0    conda-forge
jedi                      0.18.0           py37h89c1867_2    conda-forge
jinja2                    3.0.1              pyhd8ed1ab_0    conda-forge
joblib                    1.0.1              pyhd8ed1ab_0    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
jpype1                    1.3.0            py37h2527ec5_0    conda-forge
json-c                    0.15                 h98cffda_0    conda-forge
jsonschema                3.2.0              pyhd8ed1ab_3    conda-forge
jupyter-server-proxy      3.1.0              pyhd8ed1ab_0    conda-forge
jupyter_client            6.1.12             pyhd8ed1ab_0    conda-forge
jupyter_core              4.7.1            py37h89c1867_0    conda-forge
jupyter_server            1.9.0              pyhd8ed1ab_0    conda-forge
jupyterlab_pygments       0.1.2              pyh9f0ad1d_0    conda-forge
jupyterlab_widgets        1.0.0              pyhd8ed1ab_1    conda-forge
kealib                    1.4.14               hcc255d8_2    conda-forge
kiwisolver                1.3.1            py37h2527ec5_1    conda-forge
krb5                      1.19.1               hcc1bbae_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libarchive                3.5.1                h3f442fb_1    conda-forge
libblas                   3.9.0                9_openblas    conda-forge
libbrotlicommon           1.0.9                h7f98852_5    conda-forge
libbrotlidec              1.0.9                h7f98852_5    conda-forge
libbrotlienc              1.0.9                h7f98852_5    conda-forge
libcblas                  3.9.0                9_openblas    conda-forge
libcrc32c                 1.1.1                h9c3ff4c_2    conda-forge
libcudf                   21.06.01        cuda11.0_g101fc0fda4_2    rapidsai
libcudf_kafka             21.06.01          g101fc0fda4_2    rapidsai
libcugraph                21.06.00        cuda11.0_gf9ffd2de_0    rapidsai
libcuml                   21.06.02        cuda11.0_g7dfbf8d9e_0    rapidsai
libcumlprims              21.06.00        cuda11.0_gfda2e6c_0    nvidia
libcurl                   7.77.0               h2574ce0_0    conda-forge
libcuspatial              21.06.00        cuda11.0_g37798cd_0    rapidsai
libdap4                   3.20.6               hd7c4107_2    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               hcdb4288_3    conda-forge
libfaiss                  1.7.0           cuda110h8045045_8_cuda    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgcrypt                 1.9.3                h7f98852_1    conda-forge
libgdal                   3.2.2                h804b7da_0    conda-forge
libgfortran-ng            9.3.0               hff62375_19    conda-forge
libgfortran5              9.3.0               hff62375_19    conda-forge
libglib                   2.68.3               h3e27bee_0    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libgpg-error              1.42                 h9c3ff4c_0    conda-forge
libgsasl                  1.8.0                         2    conda-forge
libhwloc                  2.3.0                h5e5b7d1_1    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libkml                    1.3.0             hd79254b_1012    conda-forge
liblapack                 3.9.0                9_openblas    conda-forge
libllvm10                 10.0.1               he513fc3_3    conda-forge
libnetcdf                 4.7.4           nompi_h56d31a8_107    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libntlm                   1.4               h7f98852_1002    conda-forge
libopenblas               0.3.15          pthreads_h8fe5266_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libpq                     13.3                 hd57d9b9_0    conda-forge
libprotobuf               3.16.0               h780b84a_0    conda-forge
librdkafka                1.5.3                hc49e61c_1    conda-forge
librmm                    21.06.00        cuda11.0_gee432a0_0    rapidsai
librttopo                 1.1.0                h1185371_6    conda-forge
libsodium                 1.0.18               h36c2ea0_1    conda-forge
libsolv                   0.7.17               h780b84a_0    conda-forge
libspatialindex           1.9.3                h9c3ff4c_3    conda-forge
libspatialite             5.0.1                h20cb978_4    conda-forge
libssh2                   1.9.0                ha56f1ee_6    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libthrift                 0.14.1               he6d91bd_2    conda-forge
libtiff                   4.2.0                hbd63e13_2    conda-forge
libutf8proc               2.6.1                h7f98852_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libuv                     1.41.1               h7f98852_0    conda-forge
libwebp                   1.2.0                h3452ae3_0    conda-forge
libwebp-base              1.2.0                h7f98852_2    conda-forge
libxcb                    1.13              h7f98852_1003    conda-forge
libxgboost                1.4.2dev.rapidsai21.06      cuda11.0_0    rapidsai
libxml2                   2.9.12               h72842e0_0    conda-forge
llvmlite                  0.36.0           py37h9d7f4d0_0    conda-forge
locket                    0.2.0                      py_2    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
lzo                       2.10              h516909a_1000    conda-forge
mamba                     0.8.0            py37h7f483ca_0    conda-forge
mapclassify               2.4.2              pyhd8ed1ab_0    conda-forge
markdown                  3.3.4              pyhd8ed1ab_0    conda-forge
markupsafe                2.0.1            py37h5e8e339_0    conda-forge
matplotlib-base           3.4.2            py37hdd32ed1_0    conda-forge
matplotlib-inline         0.1.2              pyhd8ed1ab_2    conda-forge
mistune                   0.8.4           py37h5e8e339_1004    conda-forge
msgpack-python            1.0.2            py37h2527ec5_1    conda-forge
multidict                 5.1.0            py37h5e8e339_1    conda-forge
multipledispatch          0.6.0                      py_0    conda-forge
munch                     2.5.0                      py_0    conda-forge
nbclient                  0.5.3              pyhd8ed1ab_0    conda-forge
nbconvert                 6.1.0            py37h89c1867_0    conda-forge
nbformat                  5.1.3              pyhd8ed1ab_0    conda-forge
nccl                      2.10.3.1             h96e36e3_0    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
nest-asyncio              1.5.1              pyhd8ed1ab_0    conda-forge
netifaces                 0.10.9          py37h5e8e339_1003    conda-forge
networkx                  2.6.1              pyhd8ed1ab_1    conda-forge
nlohmann_json             3.9.1                h9c3ff4c_1    conda-forge
nodejs                    14.17.1              h92b4a50_1    conda-forge
notebook                  6.4.0              pyha770c72_0    conda-forge
numba                     0.53.1           py37hb11d6e1_1    conda-forge
numpy                     1.21.1           py37h038b26d_0    conda-forge
nvtx                      0.2.3            py37h5e8e339_0    conda-forge
oauthlib                  3.1.1              pyhd8ed1ab_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openjdk                   8.0.282              h7f98852_0    conda-forge
openjpeg                  2.4.0                hb52868f_1    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
orc                       1.6.7                h89a63ab_2    conda-forge
packaging                 21.0               pyhd8ed1ab_0    conda-forge
pandas                    1.2.5            py37h219a48f_0    conda-forge
pandoc                    2.14.0.3             h7f98852_0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
panel                     0.10.3             pyhd8ed1ab_0    conda-forge
param                     1.11.1             pyh6c4a22f_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.8.2              pyhd8ed1ab_0    conda-forge
partd                     1.2.0              pyhd8ed1ab_0    conda-forge
pcre                      8.45                 h9c3ff4c_0    conda-forge
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickle5                   0.0.11           py37h5e8e339_0    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    8.2.0            py37h4600e1f_1    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
poppler                   21.03.0              h93df280_0    conda-forge
poppler-data              0.4.10                        0    conda-forge
postgresql                13.3                 h2510834_0    conda-forge
proj                      8.0.0                h277dcde_0    conda-forge
prometheus_client         0.11.0             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.19             pyha770c72_0    conda-forge
protobuf                  3.16.0           py37hcd2ae1e_0    conda-forge
psutil                    5.8.0            py37h5e8e339_1    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
py-xgboost                1.4.2dev.rapidsai21.06  cuda11.0py37_0    rapidsai
pyarrow                   1.0.1           py37hb63ea2f_40_cuda    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pycosat                   0.6.3           py37h5e8e339_1006    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pyct                      0.4.6                      py_0    conda-forge
pyct-core                 0.4.6                      py_0    conda-forge
pydeck                    0.5.0              pyh9f0ad1d_0    conda-forge
pyee                      7.0.4              pyh9f0ad1d_0    conda-forge
pygments                  2.9.0              pyhd8ed1ab_0    conda-forge
pyhive                    0.6.4              pyhd8ed1ab_0    conda-forge
pyjwt                     2.1.0              pyhd8ed1ab_0    conda-forge
pynvml                    11.0.0             pyhd8ed1ab_0    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyppeteer                 0.2.2                      py_1    conda-forge
pyproj                    3.0.1            py37h2bb2a07_1    conda-forge
pyrsistent                0.17.3           py37h5e8e339_2    conda-forge
pysocks                   1.7.1            py37h89c1867_3    conda-forge
python                    3.7.10          hffdb5ce_100_cpython    conda-forge
python-confluent-kafka    1.5.0            py37h8f50634_0    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.7                     2_cp37m    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pyviz_comms               2.1.0              pyhd8ed1ab_0    conda-forge
pyyaml                    5.4.1            py37h5e8e339_0    conda-forge
pyzmq                     22.1.0           py37h336d617_0    conda-forge
rapids                    21.06.00        cuda11.0_py37_ge3c8282_427    rapidsai
rapids-blazing            21.06.00        cuda11.0_py37_ge3c8282_427    rapidsai
rapids-xgboost            21.06.00        cuda11.0_py37_ge3c8282_427    rapidsai
re2                       2021.04.01           h9c3ff4c_0    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
reproc                    14.2.1               h36c2ea0_0    conda-forge
reproc-cpp                14.2.1               h58526e2_0    conda-forge
requests                  2.25.1             pyhd3deb0d_0    conda-forge
requests-oauthlib         1.3.0              pyh9f0ad1d_0    conda-forge
requests-unixsocket       0.2.0                      py_0    conda-forge
rmm                       21.06.00        cuda_11.0_py37_gee432a0_0    rapidsai
rsa                       4.7.2              pyh44b312d_0    conda-forge
rtree                     0.9.7            py37h0b55af0_1    conda-forge
ruamel_yaml               0.15.80         py37h5e8e339_1004    conda-forge
s2n                       1.0.10               h9b69904_0    conda-forge
sasl                      0.3.1            py37hcd2ae1e_0    conda-forge
scikit-learn              0.24.2           py37h18a542f_0    conda-forge
scipy                     1.7.0            py37h29e03ee_1    conda-forge
send2trash                1.7.1              pyhd8ed1ab_0    conda-forge
setuptools                49.6.0           py37h89c1867_3    conda-forge
shapely                   1.7.1            py37h2d1e849_5    conda-forge
simpervisor               0.4                pyhd8ed1ab_0    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
snappy                    1.1.8                he1b5a44_3    conda-forge
sniffio                   1.2.0            py37h89c1867_1    conda-forge
sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
spdlog                    1.8.5                h4bd325d_0    conda-forge
sqlalchemy                1.4.21           py37h5e8e339_0    conda-forge
sqlite                    3.34.0               h74cdb3f_0    conda-forge
streamz                   0.6.2              pyh44b312d_0    conda-forge
tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
terminado                 0.10.1           py37h89c1867_0    conda-forge
testpath                  0.5.0              pyhd8ed1ab_0    conda-forge
threadpoolctl             2.2.0              pyh8a188c0_0    conda-forge
thrift                    0.13.0           py37hcd2ae1e_2    conda-forge
thrift_sasl               0.4.2            py37h8f50634_0    conda-forge
tiledb                    2.2.9                h91fcb0e_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
toolz                     0.11.1                     py_0    conda-forge
tornado                   6.1              py37h5e8e339_1    conda-forge
tqdm                      4.59.0             pyhd8ed1ab_0    conda-forge
traitlets                 5.0.5                      py_0    conda-forge
treelite                  1.3.0            py37hfdac9b6_0    conda-forge
treelite-runtime          1.3.0                    pypi_0    pypi
typing-extensions         3.10.0.0             hd8ed1ab_0    conda-forge
typing_extensions         3.10.0.0           pyha770c72_0    conda-forge
tzcode                    2021a                h7f98852_2    conda-forge
tzdata                    2021a                he74cb21_1    conda-forge
ucx                       1.9.0+gcd9efd3       cuda11.0_0    rapidsai
ucx-proc                  1.0.0                       gpu    rapidsai
ucx-py                    0.20.0          py37_gcd9efd3_0    rapidsai
urllib3                   1.26.3             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
webencodings              0.5.1                      py_1    conda-forge
websocket-client          0.57.0           py37h89c1867_4    conda-forge
websockets                8.1              py37h5e8e339_3    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
widgetsnbextension        3.5.1            py37h89c1867_4    conda-forge
xarray                    0.18.2             pyhd8ed1ab_0    conda-forge
xerces-c                  3.2.3                h9d8b166_2    conda-forge
xgboost                   1.4.2dev.rapidsai21.06  cuda11.0py37_0    rapidsai
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.0.10               h7f98852_0    conda-forge
xorg-libsm                1.2.3             hd9c2040_1000    conda-forge
xorg-libx11               1.7.2                h7f98852_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h7f98852_1    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
yarl                      1.6.3            py37h5e8e339_2    conda-forge
zeromq                    4.3.4                h9c3ff4c_0    conda-forge
zict                      2.0.0                      py_0    conda-forge
zipp                      3.5.0              pyhd8ed1ab_0    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge
@ietz ietz added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 19, 2021
@cjnolet
Copy link
Member

cjnolet commented Jul 21, 2021

Hi @ietz, thank you are filing an issue for this. I ran your script on my V100 (rapids 21.08 nightly packages) and was able to reproduce the trending sawtooth pattern that you've pointed out. I ran the loop for about 25 minutes while running a watch -n 0.1 nvidia-smi in a separate window and noticed it peaked around 12-14gb but didn't go any higher.

Adding a gc.collect() after each loop iteration seemed to make it consistently peak around 4gb and revert to the same value (+= 0.1gb) after the loop. If you are able, can you try adding the gc.collect() after each iteration and let us know if it fixes the problem?

@cjnolet cjnolet added Waiting on author verification Waiting for author to verify potential code changes or updates to the problem and removed ? - Needs Triage Need team to review and classify labels Jul 21, 2021
@ietz
Copy link
Author

ietz commented Jul 22, 2021

Hey @cjnolet and thank you for your response

The gc.collect() call after every iteration indeed does resolve my issue, and my parameter sweep now finished without any further complications. I was not aware of this command and read in some other issue here that just using del to delete the reference should be enough. Thank you!

If you still want to reproduce the OOM without gc.collect() you could increase the size of the data array. With a shape of (1_000_000, 500) I got an OOM after just 10 iterations, less than 1min of execution time. With that shape the memory usage after the iterations was at 4.4, 6.3, 8.2, 10.1, 12.0, 6.3, 8.2, 10.1, 12.0, 13.8 gb, followed by the OOM.

In terms of results, it sadly seems that even with my parameter sweep I could not get outputs from cuML UMAP that are comparable to those of the umap-learn library, as ~¼ of points are mapped to strange outlier positions far away from the main structure. I guess I'll just watch #3467 and try again once that is resolved

@cjnolet
Copy link
Member

cjnolet commented Jul 23, 2021

As a result of your experience with this problem in RAPIDS, do you think it might be helpful if we added some documentation about the use of gc.collect()? If so, we can convert this issue over to a feature request.

@cjnolet cjnolet added Waiting for more information The described problem requires more information for triage and removed Waiting on author verification Waiting for author to verify potential code changes or updates to the problem bug Something isn't working labels Jul 23, 2021
@ietz
Copy link
Author

ietz commented Jul 24, 2021

Sure, I think some info about that might very well help someone as long as you can find it. My problem was that I thought I had to look for some RAPIDS-specific solution as the problem was about GPU memory. As it's just standard Python, a short "gc.collect also works with RAPIDS" would probably have been enough in my case.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive-30d inactive-90d Waiting for more information The described problem requires more information for triage
Projects
None yet
Development

No branches or pull requests

2 participants