Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build results vary from parallelism #4611

Open
bmwiedemann opened this issue Jul 28, 2023 · 20 comments
Open

build results vary from parallelism #4611

bmwiedemann opened this issue Jul 28, 2023 · 20 comments
Labels
build related to the build process

Comments

@bmwiedemann
Copy link
Contributor

bmwiedemann commented Jul 28, 2023

Bug report

Bug summary

After fixing #4609, there is some other remaining issue and my tools said, it is about the number of cores I give the build-VM.

Code for reproduction

build once each in a 1-core-VM and a 2-core-VM

osc co openSUSE:Factory/python-yt && cd $_
for N in 1 2 ; do
    osc build --noservice --vm-type=kvm -j$N --keep-pkg=RPMS.$N
    md5sum `find /var/tmp/build-root/standard-x86_64/home/abuild/rpmbuild/ -name \*.c|sort` > $N.md5
done
diff -u ?.md5

Actual outcome

bounding_volume_hierarchy.cpython-310-x86_64-linux-gnu.so and other binaries vary

/home/abuild/rpmbuild/BUILD/yt-4.1.4/build/lib.linux-x86_64-cpython-311/yt/utilities/lib/bounding_volume_hierarchy.cpp        2039-08-30 07:17:28.000000000 +0000
@@ -7059,7 +7059,7 @@            
                                 goto __pyx_L18;
                                 __pyx_L18:;
                                 #ifdef _OPENMP
-                                #pragma omp critical(__pyx_parallel_lastprivates1)                              
+                                #pragma omp critical(__pyx_parallel_lastprivates0)                                  
                                 #endif /* _OPENMP */
                                 {   
                                     __pyx_parallel_temp0 = __pyx_v_i;

Expected outcome

It should be possible to create bit-identical results (currently, that works only by doing all builds in 1-core-VMs)

Version Information

  • Operating System: openSUSE-Tumbleweed 20230724
  • Python Version: 3.9 - 3.11
  • yt version: 4.1.4
  • Other Libraries (if applicable):

This bug was found while working on reproducible builds for openSUSE.

@welcome
Copy link

welcome bot commented Jul 28, 2023

Hi, and welcome to yt! Thanks for opening your first issue. We have an issue template that helps us to gather relevant information to help diagnosing and fixing the issue.

@neutrinoceros
Copy link
Member

I half-remember similar problems being raised in the past by downstream packagers (sometimes parallel builds may even crash). Any leads on what we should be on the look out for ?

@bmwiedemann
Copy link
Contributor Author

From the context, it could come somewhere out of cython - it would not be the first compiler that has trouble with parallel processing.

But I know too little about your setup. python-scikit-learn seems to have a similar problem.

@neutrinoceros neutrinoceros added the build related to the build process label Jul 31, 2023
@neutrinoceros
Copy link
Member

related issue: #4278

@bmwiedemann
Copy link
Contributor Author

4.4.0 still has this problem, so it seems to be unrelated to #4278 .

I used strace to find where pixelization_routines.cpp comes from and found that it is written twice by python3 -mpip wheel . Is there some internal parallelism and maybe those two variants differ and when building on multicore machines, it is decided by a race which variant wins (is written last)?

Also the omp in the diff could that mean openmp is involved.

@chrishavlin
Copy link
Contributor

bounding_volume_hierarchy.cpython-310-x86_64-linux-gnu.so and other binaries vary

Also the omp in the diff could that mean openmp is involved.

Both pixelization_routines.pyx and bounding_volume_hierarchy.pyx contain extra compiler directives related to openmp compiler arguments:

# distutils: extra_compile_args = CPP14_FLAG OMP_ARGS
# distutils: extra_link_args = CPP14_FLAG OMP_ARGS

and there are a number of other files (that utilize cython.parallel) with the same directives -- here's a list that uses the extra omp args internally: bounding_volume_hierarchy.pyx, image_samplers.pyx, image_utilities.pyx, geometry_utils.pyx, pixelization_routines.pyx, _octree_raytracing.pyx . I'd be curious to know if the list of binaries that vary matches that list, which could indicate an issue with how omp is being linked?

@bmwiedemann
Copy link
Contributor Author

I found variations are only in 2 .cpp files:

/home/abuild/rpmbuild/BUILD/yt-4.4.0/yt/utilities/lib/misc_utilities.cpp
/home/abuild/rpmbuild/BUILD/yt-4.4.0/yt/utilities/lib/pixelization_routines.cpp

This is certainly before compiling and linking. Somehow the cpp code-generation introduces non-determinism.

I did a build with debuginfo that contains these diffs at the very end:
https://rb.zq1.de/compare.factory-20241129/diffs/python-yt-4.4-diff.txt

@bmwiedemann
Copy link
Contributor Author

The variations are somewhat random. After 5 tries, I also had these once in
/home/abuild/rpmbuild/BUILD/yt-4.4.0/yt/utilities/lib/image_samplers.cpp

@matthewturk
Copy link
Member

Is it possible that if we "upgraded" all the C++ to use the same C++ standard the variations would lessen or disappear?

@bmwiedemann
Copy link
Contributor Author

I don't think so, because this diff is already in the C++ source code. The code that generates pixelization_routines.cpp from pixelization_routines.pyx is the part that likely needs fixing. If it operates by the principle of "garbage in => garbage out", fixing the input data (.pyx) could also help.

@neutrinoceros
Copy link
Member

Do we have any indication that this could (or couldn't) be an upstream bug in Cython ?

@bmwiedemann
Copy link
Contributor Author

Here is one data point: I found 160 packages in openSUSE that use cython, but the only of them with this kind of variations in a .cpp file is python-yt.

@matthewturk
Copy link
Member

That's a genuinely convincing data point. ;-)

My thought process for my question was that if we're including them (via .pxd inclusion, for instance) and there is some temporary file generated, it could be that the order in which that temporary generated file is created is the dominant factor.

@chrishavlin
Copy link
Contributor

Some progress I think -- First, I was able to reproduce the difference locally: setting OMP_NUM_THREADS in my environment (e.g., export OMP_NUM_THREADS=3) and building as normal (i already had openmp installed), the generated pixelization_routines.cpp and misc_utilities.cpp would end up with those #pragma omp critical(__pyx_parallel_lastprivates0) lines reported in @bmwiedemann's initial comment. On rebuilding, I'd occasionally get variations in the trailing integer in __pyx_parallel_lastprivates0, which would lead to the diffs that @bmwiedemann described.

Actually looking up what that pragma omp critical directive means (link):

the omp critical directive identifies a section of code that must be executed by a single thread at a time.

The .pyx lines that generate these omp directives correspond to prange loops, which is potentially troubling if this directive mean that those prange loops will be forced to run as single threaded loops.

Potential fixes

pixelization_routines.pyx

The two prange loops re-acquire the gil to check for error signals, removing those checks remove the omp critical directives. i.e., building with the following diff:

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index a6a9f23d9..3846777b1 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -1213,10 +1213,6 @@ def pixelize_sph_kernel_projection(
             local_buff[i] = 0.0
 
         for j in prange(0, posx.shape[0], schedule="dynamic"):
-            if j % 100000 == 0:
-                with gil:
-                    PyErr_CheckSignals()
-
             xiter[1] = yiter[1] = ziter[1] = 999
 
             if check_period[0] == 1:
@@ -1569,9 +1565,6 @@ def pixelize_sph_kernel_slice(
             local_buff[i] = 0.0
 
         for j in prange(0, posx.shape[0], schedule="dynamic"):
-            if j % 100000 == 0:
-                with gil:
-                    PyErr_CheckSignals()

results in a pixelization_routines.cpp without those #pragma omp critical(__pyx_parallel_lastprivates0) lines. I'm not actually sure if the directive would result in the entire prange loop running on a single thread (bad) or just the error check (probably fine).

misc_utilities.pyx

Slightly more complicated -- misc_utilities.pyx similarly reacquires the gil within a prange loop in order to check for errors and update a progress bar but it also includes a conditional break in the prange, I had to remove both with the following diff to avoid the #pragma omp critical(__pyx_parallel_lastprivates0) lines in the generated cpp:

diff --git a/yt/utilities/lib/misc_utilities.pyx b/yt/utilities/lib/misc_utilities.pyx
index 2a5e61efa..cee791500 100644
--- a/yt/utilities/lib/misc_utilities.pyx
+++ b/yt/utilities/lib/misc_utilities.pyx
@@ -979,8 +979,6 @@ def gravitational_binding_energy(
     cdef np.float64_t this_potential
 
     n_q = mass.size
-    pbar = get_pbar("Calculating potential for %d cells with %d thread(s)" % (n_q,num_threads),
-        n_q)
 
     # using reversed iterator in order to make use of guided scheduling
     # (inner loop is getting more and more expensive)
@@ -1003,14 +1001,7 @@ def gravitational_binding_energy(
                    (y_i - y_o) * (y_i - y_o) +
                    (z_i - z_o) * (z_i - z_o))
         total_potential += this_potential
-        if truncate and this_potential / kinetic > 1.:
-            break
-        with gil:
-            PyErr_CheckSignals()
-            # this call is not thread safe, but it gives a reasonable approximation
-            pbar.update()
-
-    pbar.finish()
+
     return total_potential

removing the progress bar is not a big deal IMO. But it doesn't seem correct to simply remove the early break -- is this potentially a cython bug? or does it make sense for a break to force single-threaded execution?

thoughts

So it seems that either of the following in a prange results in the # pragma omp critical(__pyx_parallel_lastprivates0) directives being added to the generated cpp:

  • re-acquiring the gil within a prange loop (to check error state)
  • adding a break

And because those __pyx_parallel_lastprivates0 lines will vary in final integer depending on number of threads (and also I've seen between builds with the same number of threads), you end up with diffs in the cpp and resulting binaries.

What I don't know is if these are yt bugs or potential cython bugs or even a bug at all? But that's where it's coming from...

(note I did push up the above diffs to a branch here, but have not submitted a PR cause I think some discussion is warranted first)

@neutrinoceros
Copy link
Member

@chrishavlin wow, thanks for this in depth inquiry !
I think it's fine to remove these calls to PyErr_CheckSignals (they should be most useful for debbuging anyway) and I agree that dropping a progress bar isn't a big deal. I suspect the easiest way we could learn whether break inside prange creating a critical section is bug in Cython or not would be by asking upstream.

@bmwiedemann
Copy link
Contributor Author

Thanks @chrishavlin for the great analysis. I tested your patch and found that yt-4.4.0/yt/utilities/lib/image_samplers.cpp still has similar variations (might be low-entropy, so could need a few tries to trigger)

--- /var/tmp/build-root.12/.mount/home/abuild/rpmbuild/BUILD/yt-4.4.0/yt/utilities/lib/image_samplers.cpp       2024-12-06 14:55:24.336666665 +0000
+++ /var/tmp/build-root.12b/.mount/home/abuild/rpmbuild/BUILD/yt-4.4.0/yt/utilities/lib/image_samplers.cpp      2041-01-08 04:12:36.353333332 +0000
@@ -25213,7 +25213,7 @@
                                 goto __pyx_L26;
                                 __pyx_L26:;
                                 #ifdef _OPENMP
-                                #pragma omp critical(__pyx_parallel_lastprivates0)
+                                #pragma omp critical(__pyx_parallel_lastprivates1)
                                 #endif /* _OPENMP */
                                 {   
                                     __pyx_parallel_temp0 = __pyx_v_i;
@@ -26264,7 +26264,7 @@
                                 goto __pyx_L27;
                                 __pyx_L27:;
                                 #ifdef _OPENMP
-                                #pragma omp critical(__pyx_parallel_lastprivates1)
+                                #pragma omp critical(__pyx_parallel_lastprivates2)
                                 #endif /* _OPENMP */
                                 {
                                     __pyx_parallel_temp0 = __pyx_v_i;

@chrishavlin
Copy link
Contributor

@bmwiedemann image_samplers.pyx includes a similar check for python error signals (re-acquiring the GIL) within a prange, I'll try removing that too, will let you know when up update my branch.

@neutrinoceros ya, asking upstream is probably the thing to do but I'll see if i can work out a simpler reproducible example first.

@chrishavlin
Copy link
Contributor

chrishavlin commented Dec 6, 2024

ok, removing the critical sections from the generated image_samplers.cpp is a bit more complicated -- need to remove the python error check as expected but also need a small refactor due to these lines that occur within a prange:

for i in range(Nch):
self.image[vi, vj, i] = idata.rgba[i]

That access to self.image gives rise to one of the critical directives: the fix will be to refactor this to use a local image buffer for each omp thread (e.g., similar to what's done in pixelize_sph_kernel_projection in pixelization_routines.pyx with the local_buff variable here).

@matthewturk
Copy link
Member

A different image buffer for each thread would be a pretty big memory increase -- looking at Npix by Nch, which gets pretty big on images we'd want to be small. Can't we instead cache a reference to it (so that it's not doing self.) and then use standard OMP directives to make sure there's no overlap?

@chrishavlin
Copy link
Contributor

chrishavlin commented Dec 6, 2024

Can't we instead cache a reference to it (so that it's not doing self.) and then use standard OMP directives to make sure there's no overlap?

Good point on the memory increase -- this does sound like a better approach. And I don't think this particular loop should actually have any overlap as it is, but I could use some ideas on what "use standard OMP directives" to use here :) I know you can use omp functions directly...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build related to the build process
Projects
None yet
Development

No branches or pull requests

4 participants