-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build results vary from parallelism #4611
Comments
Hi, and welcome to yt! Thanks for opening your first issue. We have an issue template that helps us to gather relevant information to help diagnosing and fixing the issue. |
I half-remember similar problems being raised in the past by downstream packagers (sometimes parallel builds may even crash). Any leads on what we should be on the look out for ? |
From the context, it could come somewhere out of cython - it would not be the first compiler that has trouble with parallel processing. But I know too little about your setup. |
related issue: #4278 |
4.4.0 still has this problem, so it seems to be unrelated to #4278 . I used Also the |
Both
and there are a number of other files (that utilize |
I found variations are only in 2 .cpp files:
This is certainly before compiling and linking. Somehow the cpp code-generation introduces non-determinism. I did a build with debuginfo that contains these diffs at the very end: |
The variations are somewhat random. After 5 tries, I also had these once in |
Is it possible that if we "upgraded" all the C++ to use the same C++ standard the variations would lessen or disappear? |
I don't think so, because this diff is already in the C++ source code. The code that generates |
Do we have any indication that this could (or couldn't) be an upstream bug in Cython ? |
Here is one data point: I found 160 packages in openSUSE that use cython, but the only of them with this kind of variations in a .cpp file is |
That's a genuinely convincing data point. ;-) My thought process for my question was that if we're including them (via |
Some progress I think -- First, I was able to reproduce the difference locally: setting OMP_NUM_THREADS in my environment (e.g., Actually looking up what that
The .pyx lines that generate these omp directives correspond to Potential fixespixelization_routines.pyxThe two prange loops re-acquire the gil to check for error signals, removing those checks remove the omp critical directives. i.e., building with the following diff: diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index a6a9f23d9..3846777b1 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -1213,10 +1213,6 @@ def pixelize_sph_kernel_projection(
local_buff[i] = 0.0
for j in prange(0, posx.shape[0], schedule="dynamic"):
- if j % 100000 == 0:
- with gil:
- PyErr_CheckSignals()
-
xiter[1] = yiter[1] = ziter[1] = 999
if check_period[0] == 1:
@@ -1569,9 +1565,6 @@ def pixelize_sph_kernel_slice(
local_buff[i] = 0.0
for j in prange(0, posx.shape[0], schedule="dynamic"):
- if j % 100000 == 0:
- with gil:
- PyErr_CheckSignals() results in a
|
@chrishavlin wow, thanks for this in depth inquiry ! |
Thanks @chrishavlin for the great analysis. I tested your patch and found that yt-4.4.0/yt/utilities/lib/image_samplers.cpp still has similar variations (might be low-entropy, so could need a few tries to trigger) --- /var/tmp/build-root.12/.mount/home/abuild/rpmbuild/BUILD/yt-4.4.0/yt/utilities/lib/image_samplers.cpp 2024-12-06 14:55:24.336666665 +0000
+++ /var/tmp/build-root.12b/.mount/home/abuild/rpmbuild/BUILD/yt-4.4.0/yt/utilities/lib/image_samplers.cpp 2041-01-08 04:12:36.353333332 +0000
@@ -25213,7 +25213,7 @@
goto __pyx_L26;
__pyx_L26:;
#ifdef _OPENMP
- #pragma omp critical(__pyx_parallel_lastprivates0)
+ #pragma omp critical(__pyx_parallel_lastprivates1)
#endif /* _OPENMP */
{
__pyx_parallel_temp0 = __pyx_v_i;
@@ -26264,7 +26264,7 @@
goto __pyx_L27;
__pyx_L27:;
#ifdef _OPENMP
- #pragma omp critical(__pyx_parallel_lastprivates1)
+ #pragma omp critical(__pyx_parallel_lastprivates2)
#endif /* _OPENMP */
{
__pyx_parallel_temp0 = __pyx_v_i; |
@bmwiedemann image_samplers.pyx includes a similar check for python error signals (re-acquiring the GIL) within a prange, I'll try removing that too, will let you know when up update my branch. @neutrinoceros ya, asking upstream is probably the thing to do but I'll see if i can work out a simpler reproducible example first. |
ok, removing the critical sections from the generated image_samplers.cpp is a bit more complicated -- need to remove the python error check as expected but also need a small refactor due to these lines that occur within a prange: yt/yt/utilities/lib/image_samplers.pyx Lines 285 to 286 in 6672c17
That access to |
A different image buffer for each thread would be a pretty big memory increase -- looking at Npix by Nch, which gets pretty big on images we'd want to be small. Can't we instead cache a reference to it (so that it's not doing |
Good point on the memory increase -- this does sound like a better approach. And I don't think this particular loop should actually have any overlap as it is, but I could use some ideas on what "use standard OMP directives" to use here :) I know you can use omp functions directly... |
Bug report
Bug summary
After fixing #4609, there is some other remaining issue and my tools said, it is about the number of cores I give the build-VM.
Code for reproduction
build once each in a 1-core-VM and a 2-core-VM
Actual outcome
bounding_volume_hierarchy.cpython-310-x86_64-linux-gnu.so
and other binaries varyExpected outcome
It should be possible to create bit-identical results (currently, that works only by doing all builds in 1-core-VMs)
Version Information
This bug was found while working on reproducible builds for openSUSE.
The text was updated successfully, but these errors were encountered: