Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA #128

aknvictor · 2024-02-22T01:46:27Z

This PR includes the implementation drastically speed-up (up to 32x on consumer GPU) DirectLiNGAM and its variants e.g VarLiNGAM.

The details are to allow for an optional dependency: https://github.com/Viktour19/culingam which implements custom CUDA kernels for the pairwise likelihood ratio causal ordering method.

The implementation has been tested locally on an NVIDIA RTX 6000 on a Linux machine - but tests on other setups are needed.

firmai · 2024-03-01T19:24:02Z

Very interesting adaption, looking forward to it.

ikeuchi-screen · 2024-03-02T02:11:11Z

Hi @Viktour19 , thanks for your contribution!

First of all, I could not install culingam in my Windows environment with pip install culingam (although I could install it in my Linux environment).

Windows10 Pro
Python 3.9.12
CUDA Toolkit 12.2

I attempted a manual installation using the procedure shown here, but it failed.

With various modifications I was able to install.
Please see below what I did and use it to improve culingam.

Environment variable CUDA_HOME

The environment variable CUDA_HOME is specified as a path in Windows.
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2

Inclusion of nvToolsExt.h

Changed the header inclusion method as shown in the official site.
#include <nvtx3/nvToolsExt.h>

Compile error for M_PI

Added -D_USE_MATH_DEFINES to extra_compile_args of CUDAExtension.
https://stackoverflow.com/questions/56319494/nvcc-compilation-errors-with-m-pi-and-or

nvToolsExt.lib is missing

Add the following path to library_dirs in CUDAExtension.
C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64

And, change to nvToolsExt64_1.lib instead of nvToolsExt.lib.
pytorch/pytorch#101135

ikeuchi-screen · 2024-03-02T02:44:40Z

Hi @Viktour19,
This is the result of comparing the same data with the existing DirectLiNGAM and GPU versions.
The GPU version has the wrong causal order estimated.
Is it a problem with my environment?

import numpy as np
import pandas as pd
import graphviz
import lingam
from lingam.utils import make_dot
print([np.__version__, pd.__version__, graphviz.__version__, lingam.__version__])
np.random.seed(0)

['1.25.2', '2.2.0', '0.20', '1.8.3']

Test Data

x2 = np.random.uniform(size=100000)
x0 = 3.0*x2 + np.random.uniform(size=100000)
x1 = 1.0*x0 + 6.0*x2 + np.random.uniform(size=100000)
X = pd.DataFrame(np.array([x0, x1, x2]).T ,columns=['x0', 'x1', 'x2'])
make_dot([[0.0, 0.0, 3.0], [1.0, 0.0, 6.0], [0.0, 0.0, 0.0]])

CPU

%%time
model = lingam.DirectLiNGAM()
model = model.fit(X)

CPU times: total: 156 ms
Wall time: 169 ms

print('causal ordering:', model.causal_order_)
make_dot(model.adjacency_matrix_)

causal ordering: [2, 0, 1]

GPU

%%time
model = lingam.DirectLiNGAM(measure='pwling_fast')
model = model.fit(X)

CPU times: total: 141 ms
Wall time: 205 ms

print('causal ordering:', model.causal_order_)
make_dot(model.adjacency_matrix_)

causal ordering: [0, 1, 2]

aknvictor · 2024-03-02T16:27:03Z

Thanks for documenting the Windows setup!

I couldn't reproduce the issue on mine. Here's the graph using the data provided:

Could you try running the example in DirectLiNGAM_fast.py? That includes an additional check that the compiler is available.

ikeuchi-screen · 2024-03-03T07:08:26Z

The output of the get_cuda_version function is as follows:

CUDA Version found:
 nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

The culingam installed by pip install is v0.0.7, but in the github repository it is v0.0.6.
I am using v0.0.6 installed manually from github on Windows.
Is this due to a different version of culingam?

ikeuchi-screen · 2024-03-03T08:04:15Z

I installed culingam v0.07 on Linux with pip and ran DirectLiNGAM_fast.py, but got an AssertionError on assert np.allclose(model.adjacency_matrix_, m)

ikeuchi-screen · 2024-03-03T08:20:47Z

I tried to run it with only culingam v0.0.7.
I ran the following code in the Kaggle environment, but the causal order was incorrect.

Execution Result

https://github.com/Viktour19/culingam/blob/e2380d138d980196894a675691a978aa92490ee5/examples/basic.py

!pip install culingam

Collecting culingam
Downloading culingam-0.0.7.tar.gz (27 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from culingam) (1.26.4)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from culingam) (4.66.1)
Building wheels for collected packages: culingam
Building wheel for culingam (pyproject.toml) ... done
Created wheel for culingam: filename=culingam-0.0.7-cp310-cp310-linux_x86_64.whl size=89289 sha256=b56e51c13260bece05ff0a9e4f17f81bc52f0c503ddb8bff87ddd669f0ab9eba
Stored in directory: /root/.cache/pip/wheels/4d/90/ee/7192c3880f1d0903b6f0a50af63669c5b4f55107f44f120e78
Successfully built culingam
Installing collected packages: culingam
Successfully installed culingam-0.0.7

import numpy as np
import subprocess

# [[ 0.          0.          0.          2.99982982  0.          0.        ]
#  [ 2.99997222  0.          2.00008518  0.          0.          0.        ]
#  [ 0.          0.          0.          5.99981965  0.          0.        ]
#  [ 0.          0.          0.          0.          0.          0.        ]
#  [ 7.99857006  0.         -0.99911522  0.          0.          0.        ]
#  [ 3.99974733  0.          0.          0.          0.          0.        ]]
# [3, 0, 2, 5, 4, 1]

def get_cuda_version():
    try:
        nvcc_version = subprocess.check_output(["nvcc", "--version"]).decode('utf-8')
        print("CUDA Version found:\n", nvcc_version)
        return True
    except Exception as e:
        print("CUDA not found or nvcc not in PATH:", e)
        return False

def main():
    np.random.seed(42)
    size = 100000
    x3 = np.random.uniform(size=size)
    x0 = 3.0*x3 + np.random.uniform(size=size)
    x2 = 6.0*x3 + np.random.uniform(size=size)
    x1 = 3.0*x0 + 2.0*x2 + np.random.uniform(size=size)
    x5 = 4.0*x0 + np.random.uniform(size=size)
    x4 = 8.0*x0 - 1.0*x2 + np.random.uniform(size=size)

    X = np.array([x0, x1, x2, x3, x4, x5]).T

    dlm = DirectLiNGAM(12)
    dlm.fit(X, disable_tqdm=False)

    np.set_printoptions(precision=3, suppress=True)

    print(dlm._adjacency_matrix)
    print(dlm.causal_order_)

# Check for cuda availability before importing CUDA-dependent packages
if get_cuda_version():
    try:
        from culingam.directlingam import DirectLiNGAM
        main()

    except ImportError as e:
        print("Failed to import CUDA-dependent package:", e)
else:
    print("CUDA is not available. Please ensure CUDA is installed and correctly configured.")

CUDA Version found:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

100%|██████████| 6/6 [00:00<00:00, 17.03it/s]
[[ 0. 0. 0. 0. 0. 0. ]
[ 6.596 0. 0. 0. 0. 0. ]
[-1.331 0.474 0. 0. 0. 0. ]
[ 0.065 0. 0.131 0. 0. 0. ]
[ 8. 0. -1. 0. 0. 0. ]
[ 3.999 0. 0. 0. 0. 0. ]]
[0, 1, 2, 3, 4, 5]

aknvictor · 2024-03-03T14:23:09Z

Thanks for your patience! it seems I needed to allow for a broader range of CUDA gpu compute capability. E.g the P100 on Kaggle is sm_60. I've updated the package on PyPi and on Github. Let me know if that works.

ikeuchi-screen · 2024-03-04T05:52:52Z

@Viktour19
Thanks for responding!
Both PyPI and GitHub worked fine!
I'll check a little more to merge the code.

It would be great if you could support pip install culingam to install on Windows as well!

ikeuchi-screen · 2024-03-07T06:47:57Z

@Viktour19
You said the GPU was 32 times faster than the CPU, what number of variables and sample size data did you use?
I tried the following combinations and found no difference between CPU and GPU.
Number of variables: {10, 20, 50, 100}
Sample size: {1000, 2000, 5000}

aknvictor · 2024-03-07T13:23:50Z

I benchmarked with samples: [1k to 1m] and dim: [10 to 100].

Here's the wall clock time for GPU on my setup. Can you share yours? How does this compare with CPU time on your setup?

Ps: I'm working on getting a Windows machine to test on.

ikeuchi-screen · 2024-03-08T07:49:01Z

I fixed the number of variables to 100 based on the heatmap you showed me.
There was no difference when the sample size was less than 5000, but when the sample size was greater than that, the GPU was clearly faster!

aknvictor · 2024-03-08T12:48:59Z

Excellent!

ikeuchi-screen · 2024-03-10T01:43:01Z

I temporarily reverted because I found that the CI test did not pass and the docs build did not pass in an environment without culingam installed.

The error is due to the following code (direct_lingam.py):

from lingam_cuda import causal_order as causal_order_gpu

To avoid the error in the above code, we can install culingam.
However, culingam cannot be installed without CUDA (and cannot pip install on Windows), which means that CUDA is required to use lingam.

ikeuchi-screen · 2024-03-10T02:33:01Z

Changed the import location and reverted again. e64892b

add accelerated lingam using cuda

8ccabd9

add assertion to example

7d088cc

kunwuz mentioned this pull request Mar 3, 2024

Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA py-why/causal-learn#169

Open

ikeuchi-screen merged commit f588217 into cdt15:master Mar 10, 2024

ikeuchi-screen mentioned this pull request Mar 10, 2024

Revert "Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA" #132

Merged

ikeuchi-screen mentioned this pull request May 31, 2024

Prior knowledge in GPU DirectLiNGAM aknvictor/culingam#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA #128

Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA #128

aknvictor commented Feb 22, 2024 •

edited

Loading

firmai commented Mar 1, 2024

ikeuchi-screen commented Mar 2, 2024

ikeuchi-screen commented Mar 2, 2024

aknvictor commented Mar 2, 2024

ikeuchi-screen commented Mar 3, 2024

ikeuchi-screen commented Mar 3, 2024

ikeuchi-screen commented Mar 3, 2024 •

edited

Loading

aknvictor commented Mar 3, 2024 •

edited

Loading

ikeuchi-screen commented Mar 4, 2024

ikeuchi-screen commented Mar 7, 2024

aknvictor commented Mar 7, 2024 •

edited

Loading

ikeuchi-screen commented Mar 8, 2024

aknvictor commented Mar 8, 2024

ikeuchi-screen commented Mar 10, 2024 •

edited

Loading

ikeuchi-screen commented Mar 10, 2024

Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA #128

Accelerate DirectLiNGAM by parallelising causal ordering on GPUs with CUDA #128

Conversation

aknvictor commented Feb 22, 2024 • edited Loading

firmai commented Mar 1, 2024

ikeuchi-screen commented Mar 2, 2024

Environment variable CUDA_HOME

Inclusion of nvToolsExt.h

Compile error for M_PI

nvToolsExt.lib is missing

ikeuchi-screen commented Mar 2, 2024

Test Data

CPU

GPU

aknvictor commented Mar 2, 2024

ikeuchi-screen commented Mar 3, 2024

ikeuchi-screen commented Mar 3, 2024

ikeuchi-screen commented Mar 3, 2024 • edited Loading

Execution Result

aknvictor commented Mar 3, 2024 • edited Loading

ikeuchi-screen commented Mar 4, 2024

ikeuchi-screen commented Mar 7, 2024

aknvictor commented Mar 7, 2024 • edited Loading

ikeuchi-screen commented Mar 8, 2024

aknvictor commented Mar 8, 2024

ikeuchi-screen commented Mar 10, 2024 • edited Loading

ikeuchi-screen commented Mar 10, 2024

aknvictor commented Feb 22, 2024 •

edited

Loading

ikeuchi-screen commented Mar 3, 2024 •

edited

Loading

aknvictor commented Mar 3, 2024 •

edited

Loading

aknvictor commented Mar 7, 2024 •

edited

Loading

ikeuchi-screen commented Mar 10, 2024 •

edited

Loading