Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 7th No.1】 Integrate PaddlePaddle as a New Backend #704

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AndPuQing
Copy link

This PR introduces PaddlePaddle backend support to the PySR library as part of my contribution to the PaddlePaddle Hackathon 7th.

Key Changes:

  • Introduced PaddlePaddle-specific modules and functions.
  • Updated the CI pipeline to include tests for the PaddlePaddle backend.
  • Modified the example code to support PaddlePaddle

Related Issues:

Issue Encountered:

During my integration process, I discovered that the method SymbolicRegression.equation_search causes a termination by signal, as shown in the following code snippet:

class PySRRegressor(MultiOutputMixin, RegressorMixin, BaseEstimator):
    def _run(...)
        ...
        out = SymbolicRegression.equation_search(...) <---terminated by signal SIGSEGV (Address boundary error)
        ...

Interestingly, this error does not occur when I set PYTHON_JULIACALL_THREADS=1. Below are some of the steps I took to investigate the issue:

After the termination, I checked dmesg, which returned the following:

[376826.886941] R13: 00007f3605750ba0 R14: 00007f35fe3ae8b0 R15: 00007f35fe3ae920
[376826.886942] R13: 00007f3605750ba0 R14: 00007f35fe3addc0 R15: 00007f35fe3ade30
[376826.886942] FS:  00007f358b7ff6c0 GS:  0000000000000000
[376826.886943] FS:  00007f35b60fc6c0 GS:  0000000000000000
[376826.886941] FS:  00007f3572ffe6c0 GS:  0000000000000000
[376826.886943] RBP: 00007f330b08ea10 R08: 00000000ffffffff R09: 00007f3605dc2dc8
[376826.886944] FS:  00007f3588ec06c0 GS:  0000000000000000
[376826.886943] RAX: 00007f360698f008 RBX: 00007f3605750b00 RCX: 0000000000000000
[376826.886945] FS:  00007f354f1fe6c0 GS:  0000000000000000
[376826.886945] R10: 00007f3605dc2820 R11: 00007f3605326040 R12: 00007f35fe3af6c0
[376826.886945] FS:  00007f359d3fc6c0 GS:  0000000000000000
[376826.886945] RDX: 0000000000000001 RSI: 00007f3605750b00 RDI: 0000000000000000
[376826.886946] R13: 00007f3605750ba0 R14: 00007f35fe3af6c0 R15: 00007f35fe3af730
[376826.886962] FS:  00007f357bfff6c0 GS:  0000000000000000
[376826.886962] RBP: 00007f330e88ea10 R08: 00000000ffffffff R09: 00007f3605dc2dc8
[376826.886963] R10: 00007f3605dc2820 R11: 00007f3605326040 R12: 00007f35fe3ae270
[376826.886964] R13: 00007f3605750ba0 R14: 00007f35fe3ae270 R15: 00007f35fe3ae2e0
[376826.886965] FS:  00007f35709fa6c0 GS:  0000000000000000
[376826.887626]  in libjulia-internal.so.1.10.4[7f3605453000+1c1000]
[376826.888187]  in libjulia-internal.so.1.10.4[7f3605453000+1c1000]

I also used gdb --args python -m pysr test paddle to inspect the stack trace. The output is as follows:

➜ gdb --args python -m pysr test paddle
...
(gdb) run
W0824 18:25:46.120416  7735 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.6, Runtime API Version: 12.3
W0824 18:25:46.129973  7735 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
.Compiling Julia backend...

Thread 6 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd61fd6c0 (LWP 7882)]
0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffef5b8e20, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
warning: 837    /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c: No such file or directory
(gdb) bt
#0  0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffef5b8e20, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
#1  0x00007ffff5abab63 in _jl_mutex_lock (self=self@entry=0x7fffef5b8e20, lock=0x7ffff5d50b00 <jl_codegen_lock>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:875
#2  0x00007ffff5926100 in jl_mutex_lock (lock=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia_locks.h:65
#3  jl_generate_fptr_impl (mi=0x7ffe4c8c8d10 <jl_system_image_data+6611984>, world=31536, did_compile=0x7ffcf408eabc) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.cpp:483
#4  0x00007ffff5a691b9 in jl_compile_method_internal (world=31536, mi=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2481
#5  jl_compile_method_internal (mi=<optimized out>, world=31536) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2368
#6  0x00007ffff5a6a1be in _jl_invoke (world=31536, mfunc=0x7ffe4c8c8d10 <jl_system_image_data+6611984>, nargs=3, args=0x7ffcf408ebe0, F=0x7ffe4c566390 <jl_system_image_data+3062416>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2887
#7  ijl_apply_generic (F=<optimized out>, args=0x7ffcf408ebe0, nargs=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
#8  0x00007ffe4c23a44f in macro expansion () at /home/happy/.julia/packages/SymbolicRegression/9q4ZC/src/SingleIteration.jl:123
#9  julia_#271#threadsfor_fun#4_21198 () at threadingconstructs.jl:215
#10 0x00007ffe4c0d7cd0 in #271#threadsfor_fun () at threadingconstructs.jl:182
#11 julia_#1_24737 () at threadingconstructs.jl:154
#12 0x00007ffe4c123847 in jfptr_YY.1_24738 () from /home/happy/.julia/compiled/v1.10/SymbolicRegression/X2eIS_URG6E.so
#13 0x00007ffff5a69f8e in _jl_invoke (world=<optimized out>, mfunc=0x7ffe4c856190 <jl_system_image_data+6142096>, nargs=0, args=0x7fffef5b8e58, F=0x7ffe59c89110) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895
#14 ijl_apply_generic (F=<optimized out>, args=args@entry=0x7fffef5b8e58, nargs=nargs@entry=0) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
#15 0x00007ffff5a8d310 in jl_apply (nargs=1, args=0x7fffef5b8e50) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia.h:1982
#16 start_task () at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/task.c:1238
(gdb)

To be honest, I’m not very familiar with Julia, but it seems that the issue is related to multithreading within the SymbolicRegression library. I ran similar tests with the Torch backend, and below are the results:

➜ gdb --args python -m pysr test torch
(gdb) run
Thread 31 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff4dcc06c0 (LWP 9082)]
0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffee9addc0, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
warning: 837    /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c: No such file or directory
(gdb) bt
#0  0x00007ffff5abaab7 in _jl_mutex_wait (self=self@entry=0x7fffee9addc0, lock=lock@entry=0x7ffff5d50b00 <jl_codegen_lock>, safepoint=safepoint@entry=1) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:837
#1  0x00007ffff5abab63 in _jl_mutex_lock (self=self@entry=0x7fffee9addc0, lock=0x7ffff5d50b00 <jl_codegen_lock>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/threading.c:875
#2  0x00007ffff5926100 in jl_mutex_lock (lock=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia_locks.h:65
#3  jl_generate_fptr_impl (mi=0x7ffe56d90560, world=31536, did_compile=0x7ffd82c6207c) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.cpp:483
#4  0x00007ffff5a691b9 in jl_compile_method_internal (world=31536, mi=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2481
#5  jl_compile_method_internal (mi=<optimized out>, world=31536) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2368
#6  0x00007ffff5a6a1be in _jl_invoke (world=31536, mfunc=0x7ffe56d90560, nargs=2, args=0x7ffd82c621b0, F=0x7fffe22c81b0 <jl_system_image_data+3615344>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2887
#7  ijl_apply_generic (F=<optimized out>, args=0x7ffd82c621b0, nargs=<optimized out>) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
#8  0x00007ffe4bfe13c3 in julia_optimize_and_simplify_population_21133 () at /home/happy/.julia/packages/SymbolicRegression/9q4ZC/src/SingleIteration.jl:110

#30 0x00007fffe6459970 in jl_system_image_data () from /home/happy/micromamba/julia_env/pyjuliapkg/install/lib/julia/sys.so
#31 0x00007fffe2c14290 in jl_system_image_data () from /home/happy/micromamba/julia_env/pyjuliapkg/install/lib/julia/sys.so
#32 0x0000000000000001 in ?? ()
#33 0x00007ffff5a72617 in jl_smallintset_lookup (cache=<optimized out>, eq=0x40, key=0x14, data=0x7ffe4f3d0010, hv=3) at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/smallintset.c:121
#34 0x00007fffdf4f82e5 in ?? ()
#35 0x00007ffe4f414218 in ?? ()
#36 0x00007fffef48c890 in ?? ()
#37 0x00007fffebb7f8f0 in ?? ()
#38 0x00007ffe4f3d0010 in ?? ()
#39 0x00007ffd82c62a40 in ?? ()
#40 0x00007ffff592846a in __gthread_mutex_unlock (__mutex=0x1) at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/x86_64-linux-gnu/bits/gthr-default.h:779
#41 std::mutex::unlock (this=0x1) at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/std_mutex.h:118
#42 std::lock_guard<std::mutex>::~lock_guard (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/local/x86_64-linux-gnu/include/c++/9.1.0/bits/std_mutex.h:165
#43 JuliaOJIT::ResourcePool<llvm::orc::ThreadSafeContext, 0ul, std::queue<llvm::orc::ThreadSafeContext, std::deque<llvm::orc::ThreadSafeContext, std::allocator<llvm::orc::ThreadSafeContext> > > >::release (resource=..., this=0x7ffd82c62ad0)
    at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.h:462
#44 JuliaOJIT::ResourcePool<llvm::orc::ThreadSafeContext, 0ul, std::queue<llvm::orc::ThreadSafeContext, std::deque<llvm::orc::ThreadSafeContext, std::allocator<llvm::orc::ThreadSafeContext> > > >::OwningResource::~OwningResource (this=0x0, __in_chrg=<optimized out>)
    at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.h:404
#45 0x00007fffdf4f83c4 in ?? ()
#46 0x0000000000000010 in ?? ()
#47 0x00007ffd82c62c60 in ?? ()
#48 0x0000000000000000 in ?? ()
(gdb)

I am encountering some challenges understanding the internal behavior of the SymbolicRegression library. I would greatly appreciate any guidance or suggestions on how to resolve this issue.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 24, 2024

Thanks! Regarding the error, is this only on the most recent PySR version, or the previous one as well? There was a change in how multithreading was handled in juliacall so I wonder if it’s related.

Just to check, are you launching this with Python multithreading or multiprocessing? I am assuming it is a bug somewhere in PySR but just want to check if there’s anything non standard in how you are launching things.

Another thing to check — does it change if you import paddle first, before PySR? Or does the import order not matter? There is a known issue with PyTorch and JuliaCall where importing torch first prevents an issue with LLVM symbol conflicts (since Torch and Julia are compiled against different LLVM libraries if I remember correctly). Numba has something similar. Not sure if Paddle has something similar

@AndPuQing
Copy link
Author

AndPuQing commented Aug 26, 2024

Thanks! Regarding the error, is this only on the most recent PySR version, or the previous one as well? There was a change in how multithreading was handled in juliacall so I wonder if it’s related.

Just to check, are you launching this with Python multithreading or multiprocessing? I am assuming it is a bug somewhere in PySR but just want to check if there’s anything non standard in how you are launching things.

Another thing to check — does it change if you import paddle first, before PySR? Or does the import order not matter? There is a known issue with PyTorch and JuliaCall where importing torch first prevents an issue with LLVM symbol conflicts (since Torch and Julia are compiled against different LLVM libraries if I remember correctly). Numba has something similar. Not sure if Paddle has something similar

  1. I tested both the 0.18.4 and 0.18.1 release versions, and the issue occurs with both.
  2. I ran tests using pytest pysr/test/test_paddle.py -vv, and my Python version is 3.10.13.
  3. I adjusted the import order in test_paddle.py, but the issue persists regardless of whether pysr or paddle is imported first, so it doesn't seem related to the import order.
  4. As far as I know, Paddle depends on LLVM, which can be confirmed in the llvm.cmake file.

I believe the issue is indeed related to importing Paddle, as the error only occurs when SymbolicRegression.equation_search is executed after importing Paddle. It seems that importing Paddle might cause some side effects, but I'm not sure how to pinpoint the exact cause.

@MilesCranmer
Copy link
Owner

Quick followup – did you also try 0.19.4? There were some changes to juliacall that were integrated in 0.19.4 of PySR.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 26, 2024

It looks like Julia uses LLVM 18: https://github.com/JuliaLang/julia/blob/647753071a1e2ddbddf7ab07f55d7146238b6b72/deps/llvm.version#L8 (or LLVM 17 on the last version) whereas Paddle uses LLVM 11. I wonder if that is causing one of the issues.

Can you try running some generic juliacall stuff as shown in the guide here? https://juliapy.github.io/PythonCall.jl/stable/juliacall/. Hopefully we should be able to get a simpler example of the crash. I would be surprised if it is only PySR (and not Julia more broadly) but it could very well be just PySR.

@AndPuQing
Copy link
Author

Quick followup – did you also try 0.19.4? There were some changes to juliacall that were integrated in 0.19.4 of PySR.

It still occurs.

@AndPuQing
Copy link
Author

It looks like Julia uses LLVM 18: https://github.com/JuliaLang/julia/blob/647753071a1e2ddbddf7ab07f55d7146238b6b72/deps/llvm.version#L8 (or LLVM 17 on the last version) whereas Paddle uses LLVM 11. I wonder if that is causing one of the issues.

Can you try running some generic juliacall stuff as shown in the guide here? https://juliapy.github.io/PythonCall.jl/stable/juliacall/. Hopefully we should be able to get a simpler example of the crash. I would be surprised if it is only PySR (and not Julia more broadly) but it could very well be just PySR.

I created a minimal reproducible example, as shown in the code below.

import numpy as np
import pandas as pd
from pysr import PySRRegressor
import paddle

paddle.disable_signal_handler()
X = pd.DataFrame(np.random.randn(100, 10))
y = np.ones(X.shape[0])
model = PySRRegressor(
    progress=True,
    max_evals=10000,
    model_selection="accuracy",
    extra_sympy_mappings={},
    output_paddle_format=True,
    # multithreading=True,
)
model.fit(X, y)

Interestingly, when running PySRRegressor with either procs=0 (serial execution) or procs=cpu_count() with multithreading=False (multiprocessing), no issues occur. Additionally, setting PYTHON_JULIACALL_THREADS=1 allows multithreading to run without problems.

To summarize, when the environment variable PYTHON_JULIACALL_THREADS is set to Auto or a value greater than 1, and SymbolicRegression is running in multithreading mode, the equation_search function will terminate unexpectedly.

@MilesCranmer
Copy link
Owner

Are you able to build a MWE that only uses juliacall, rather than PySR? We should hopefully be able to boil it down even further. Maybe you could do something like

from juliacall import Main as jl
import paddle
paddle.disable_signal_handler()
jl.seval("""
Threads.@threads for i in 1:5
    println(i)
end
""")

which is like the simplest way of using multithreading in Julia.

@AndPuQing
Copy link
Author

Are you able to build a MWE that only uses juliacall, rather than PySR? We should hopefully be able to boil it down even further. Maybe you could do something like

from juliacall import Main as jl
import paddle
paddle.disable_signal_handler()
jl.seval("""
Threads.@threads for i in 1:5
    println(i)
end
""")

which is like the simplest way of using multithreading in Julia.

It will work as expected

➜ python test.py                                                                                                                                                                                                                                                            (base) 
1
2
3
4
5

@MilesCranmer
Copy link
Owner

MilesCranmer commented Aug 27, 2024

Since those numbers appear in order it seems like you might not have multi-threading turned on for Julia? Normally they will appear in some random order as each will get printed by a different thread. (Note the environment variables I mentioned above.)

@AndPuQing
Copy link
Author

Since those numbers appear in order it seems like you might not have multi-threading turned on for Julia? Normally they will appear in some random order as each will get printed by a different thread. (Note the environment variables I mentioned above.)

➜ PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto python test.py (base)
1
3
5
2
4

@lijialin03
Copy link

Excuse me, is the current problem that paddle does not support multi-threading for Julia without setting PYTHON_JULIACALL_HANDLE_SIGNALS=yes PYTHON_JULIACALL_THREADS=auto?
In addition, I tried the following code without paddle

from juliacall import Main as jl
jl.seval("""
Threads.@threads for i in 1:5
    println(i)
end
""")

and the result is also 1 2 3 4 5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants