Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA #128

Merged
merged 2 commits into from
Mar 10, 2024

Conversation

aknvictor
Copy link
Contributor

@aknvictor aknvictor commented Feb 22, 2024

This PR includes the implementation drastically speed-up (up to 32x on consumer GPU) DirectLiNGAM and its variants e.g VarLiNGAM.

The details are to allow for an optional dependency: https://github.com/Viktour19/culingam which implements custom CUDA kernels for the pairwise likelihood ratio causal ordering method.

The implementation has been tested locally on an NVIDIA RTX 6000 on a Linux machine - but tests on other setups are needed.

@firmai
Copy link

firmai commented Mar 1, 2024

Very interesting adaption, looking forward to it.

@ikeuchi-screen
Copy link
Collaborator

Hi @Viktour19 , thanks for your contribution!

First of all, I could not install culingam in my Windows environment with pip install culingam (although I could install it in my Linux environment).

  • Windows10 Pro
  • Python 3.9.12
  • CUDA Toolkit 12.2

I attempted a manual installation using the procedure shown here, but it failed.

With various modifications I was able to install.
Please see below what I did and use it to improve culingam.

Environment variable CUDA_HOME

The environment variable CUDA_HOME is specified as a path in Windows.
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2

Inclusion of nvToolsExt.h

Changed the header inclusion method as shown in the official site.
#include <nvtx3/nvToolsExt.h>

Compile error for M_PI

Added -D_USE_MATH_DEFINES to extra_compile_args of CUDAExtension.
https://stackoverflow.com/questions/56319494/nvcc-compilation-errors-with-m-pi-and-or

nvToolsExt.lib is missing

Add the following path to library_dirs in CUDAExtension.
C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64

And, change to nvToolsExt64_1.lib instead of nvToolsExt.lib.
pytorch/pytorch#101135

@ikeuchi-screen
Copy link
Collaborator

Hi @Viktour19,
This is the result of comparing the same data with the existing DirectLiNGAM and GPU versions.
The GPU version has the wrong causal order estimated.
Is it a problem with my environment?

import numpy as np
import pandas as pd
import graphviz
import lingam
from lingam.utils import make_dot
print([np.__version__, pd.__version__, graphviz.__version__, lingam.__version__])
np.random.seed(0)

['1.25.2', '2.2.0', '0.20', '1.8.3']

Test Data

x2 = np.random.uniform(size=100000)
x0 = 3.0*x2 + np.random.uniform(size=100000)
x1 = 1.0*x0 + 6.0*x2 + np.random.uniform(size=100000)
X = pd.DataFrame(np.array([x0, x1, x2]).T ,columns=['x0', 'x1', 'x2'])
make_dot([[0.0, 0.0, 3.0], [1.0, 0.0, 6.0], [0.0, 0.0, 0.0]])

output_2_0

CPU

%%time
model = lingam.DirectLiNGAM()
model = model.fit(X)

CPU times: total: 156 ms
Wall time: 169 ms

print('causal ordering:', model.causal_order_)
make_dot(model.adjacency_matrix_)

causal ordering: [2, 0, 1]

output_5_1

GPU

%%time
model = lingam.DirectLiNGAM(measure='pwling_fast')
model = model.fit(X)

CPU times: total: 141 ms
Wall time: 205 ms

print('causal ordering:', model.causal_order_)
make_dot(model.adjacency_matrix_)

causal ordering: [0, 1, 2]

output_8_1

@aknvictor
Copy link
Contributor Author

Thanks for documenting the Windows setup!

I couldn't reproduce the issue on mine. Here's the graph using the data provided:
attached

Could you try running the example in DirectLiNGAM_fast.py? That includes an additional check that the compiler is available.

@ikeuchi-screen
Copy link
Collaborator

The output of the get_cuda_version function is as follows:

CUDA Version found:
 nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

The culingam installed by pip install is v0.0.7, but in the github repository it is v0.0.6.
I am using v0.0.6 installed manually from github on Windows.
Is this due to a different version of culingam?

@ikeuchi-screen
Copy link
Collaborator

I installed culingam v0.07 on Linux with pip and ran DirectLiNGAM_fast.py, but got an AssertionError on assert np.allclose(model.adjacency_matrix_, m)

@ikeuchi-screen
Copy link
Collaborator

ikeuchi-screen commented Mar 3, 2024

I tried to run it with only culingam v0.0.7.
I ran the following code in the Kaggle environment, but the causal order was incorrect.

Execution Result

https://github.com/Viktour19/culingam/blob/e2380d138d980196894a675691a978aa92490ee5/examples/basic.py

!pip install culingam

Collecting culingam
Downloading culingam-0.0.7.tar.gz (27 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from culingam) (1.26.4)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from culingam) (4.66.1)
Building wheels for collected packages: culingam
Building wheel for culingam (pyproject.toml) ... done
Created wheel for culingam: filename=culingam-0.0.7-cp310-cp310-linux_x86_64.whl size=89289 sha256=b56e51c13260bece05ff0a9e4f17f81bc52f0c503ddb8bff87ddd669f0ab9eba
Stored in directory: /root/.cache/pip/wheels/4d/90/ee/7192c3880f1d0903b6f0a50af63669c5b4f55107f44f120e78
Successfully built culingam
Installing collected packages: culingam
Successfully installed culingam-0.0.7

import numpy as np
import subprocess

# [[ 0.          0.          0.          2.99982982  0.          0.        ]
#  [ 2.99997222  0.          2.00008518  0.          0.          0.        ]
#  [ 0.          0.          0.          5.99981965  0.          0.        ]
#  [ 0.          0.          0.          0.          0.          0.        ]
#  [ 7.99857006  0.         -0.99911522  0.          0.          0.        ]
#  [ 3.99974733  0.          0.          0.          0.          0.        ]]
# [3, 0, 2, 5, 4, 1]

def get_cuda_version():
    try:
        nvcc_version = subprocess.check_output(["nvcc", "--version"]).decode('utf-8')
        print("CUDA Version found:\n", nvcc_version)
        return True
    except Exception as e:
        print("CUDA not found or nvcc not in PATH:", e)
        return False

def main():
    np.random.seed(42)
    size = 100000
    x3 = np.random.uniform(size=size)
    x0 = 3.0*x3 + np.random.uniform(size=size)
    x2 = 6.0*x3 + np.random.uniform(size=size)
    x1 = 3.0*x0 + 2.0*x2 + np.random.uniform(size=size)
    x5 = 4.0*x0 + np.random.uniform(size=size)
    x4 = 8.0*x0 - 1.0*x2 + np.random.uniform(size=size)

    X = np.array([x0, x1, x2, x3, x4, x5]).T

    dlm = DirectLiNGAM(12)
    dlm.fit(X, disable_tqdm=False)

    np.set_printoptions(precision=3, suppress=True)

    print(dlm._adjacency_matrix)
    print(dlm.causal_order_)
# Check for cuda availability before importing CUDA-dependent packages
if get_cuda_version():
    try:
        from culingam.directlingam import DirectLiNGAM
        main()

    except ImportError as e:
        print("Failed to import CUDA-dependent package:", e)
else:
    print("CUDA is not available. Please ensure CUDA is installed and correctly configured.")

CUDA Version found:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

100%|██████████| 6/6 [00:00<00:00, 17.03it/s]
[[ 0. 0. 0. 0. 0. 0. ]
[ 6.596 0. 0. 0. 0. 0. ]
[-1.331 0.474 0. 0. 0. 0. ]
[ 0.065 0. 0.131 0. 0. 0. ]
[ 8. 0. -1. 0. 0. 0. ]
[ 3.999 0. 0. 0. 0. 0. ]]
[0, 1, 2, 3, 4, 5]

@aknvictor
Copy link
Contributor Author

aknvictor commented Mar 3, 2024

Thanks for your patience! it seems I needed to allow for a broader range of CUDA gpu compute capability. E.g the P100 on Kaggle is sm_60. I've updated the package on PyPi and on Github. Let me know if that works.

@ikeuchi-screen
Copy link
Collaborator

@Viktour19
Thanks for responding!
Both PyPI and GitHub worked fine!
I'll check a little more to merge the code.

It would be great if you could support pip install culingam to install on Windows as well!

@ikeuchi-screen
Copy link
Collaborator

@Viktour19
You said the GPU was 32 times faster than the CPU, what number of variables and sample size data did you use?
I tried the following combinations and found no difference between CPU and GPU.
Number of variables: {10, 20, 50, 100}
Sample size: {1000, 2000, 5000}

@aknvictor
Copy link
Contributor Author

aknvictor commented Mar 7, 2024

I benchmarked with samples: [1k to 1m] and dim: [10 to 100].

Here's the wall clock time for GPU on my setup. Can you share yours? How does this compare with CPU time on your setup?

image

Ps: I'm working on getting a Windows machine to test on.

@ikeuchi-screen
Copy link
Collaborator

I fixed the number of variables to 100 based on the heatmap you showed me.
There was no difference when the sample size was less than 5000, but when the sample size was greater than that, the GPU was clearly faster!
image

@aknvictor
Copy link
Contributor Author

Excellent!

@ikeuchi-screen
Copy link
Collaborator

ikeuchi-screen commented Mar 10, 2024

I temporarily reverted because I found that the CI test did not pass and the docs build did not pass in an environment without culingam installed.

The error is due to the following code (direct_lingam.py):

from lingam_cuda import causal_order as causal_order_gpu

To avoid the error in the above code, we can install culingam.
However, culingam cannot be installed without CUDA (and cannot pip install on Windows), which means that CUDA is required to use lingam.

@ikeuchi-screen
Copy link
Collaborator

Changed the import location and reverted again. e64892b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants