Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make PyObject cleanup thread-safe in free-threaded Python and reduce contention #176

Conversation

jmao-denver
Copy link
Contributor

This is discovered when testing parallelization of https://github.com/deephaven/deephaven-core in a FT Python environment.

  1. we might consider making the tracking/cleanup of PyObject on a per-thread basis to further reduce contention among threads and improve the performance of parallel processing. (will file an enhancement request if agreed)
  2. this fix warrants a patch release of JPY.

@jmao-denver jmao-denver self-assigned this Nov 26, 2024
Copy link
Contributor

@cpwright cpwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we could be corrupting buffer from multiple threads without the synchronization; and I expect the GIL was "saving" us before and is not now - but want to make sure that understanding is accurate.

Also, I want to understand the intention behind the cleanup_on_init change. Perhaps we should add some Javadoc to those flags indicating why they should be set one way or anohter?

src/main/java/org/jpy/PyObject.java Outdated Show resolved Hide resolved
src/main/java/org/jpy/PyObjectReferences.java Outdated Show resolved Hide resolved
@devinrsmith
Copy link
Member

Essentially, there needs to be something / somebody responsible for invoking Py_DECREF when a org.jpy.PyObject is no longer referenced (GCd), otherwise there is python memory leakage.

Currently, there are 3 strategies. The original intentions were:

  1. CLEANUP_ON_INIT: this strategy assumes that when some PyObject is constructed via JNI it will hold the GIL, and this is an "ok" generic location to do other PyObject cleanup
  2. CLEANUP_ON_THREAD: this strategy starts a thread which is responsible for doing PyObject cleanup
  3. manual: external callers can invoke org.jpy.PyObject#cleanup.

There is this comment in the method threadSafeCleanup (originally titled cleanupOnlyUseFromGIL):

            // We really really really want to make sure we *already* have the GIL lock at this point in
            // time. Otherwise, we block here until the GIL is available for us, and stall all cleanup
            // related to our PyObjects.

Digging deeper into the existing code, while the code is still "safe", we are breaking the intention of the comment above due to change in #48 (essentially, this PR drops the GIL when c code calls into Java and re-acquires when the java method returns). That is, our assumption that when PyObject is called from JNI it will have the GIL is no longer true, and it's possible we are blocking during the creation of PyObject which is bad.

We also have some comments about "This should only be invoked through the proxy" which is no longer true given the ensureGil; the original intention was that the proxy flowed through python to ensure it would be invoked with the GIL, but again, with the dropping of the GIL PR, this is no longer true.

As such, I think we may want to re-work how cleanup is done; I think turning CLEANUP_ON_INIT to false or deleting the logic entirely is warranted. That said, we do have some deephaven core testing code that does not work with CLEANUP_ON_THREAD (deephaven/deephaven-core#651), and I suspect we may need to revisit those cases if we change how cleanup works.

@cpwright
Copy link
Contributor

cpwright commented Dec 2, 2024

If we remove the CLEANUP_ON_INIT option, does it even make sense to make the thread cleanup optional? The system will leak resources without some cleanup. And the "manual" cleanup does seem to be a dangerous thing to expect people to do.

Devin pointed out that some tests don't work with CLEANUP_ON_THREAD, why not?

@jmao-denver
Copy link
Contributor Author

If we remove the CLEANUP_ON_INIT option, does it even make sense to make the thread cleanup optional? The system will leak resources without some cleanup. And the "manual" cleanup does seem to be a dangerous thing to expect people to do.

Devin pointed out that some tests don't work with CLEANUP_ON_THREAD, why not?

Here are some of the comments found in the sources of Deephaven and JPY.

        return project.tasks.create (taskName.toString(), Test) {
            Test t ->
                t.group = 'python'
                t.description = "Run java tests with jpy setup using python environment $venvFullName at file:$dir"
                t.onlyIf { TestTools.shouldRunTests(project) }

                // Cleaning up on a dedicated thread has some issues when there is frequent starting
                // and stopping of the python virtual environment. We'll rely on cleaning up inline
                // when necessary.
                t.systemProperty"PyObject.cleanup_on_thread", "false"
//                t.environment("JAVA_HOME", Jvm.current().javaHome)
                t.dependsOn dependOn
                taskTestPython(project).dependsOn t

                applyProperties(t)
                if (classpath) {
                    t.classpath = classpath.runtimeClasspath
                    t.testClassesDirs = classpath.output.classesDirs
                }
                return
        }
def test_suite():
    suite = unittest.TestSuite()

    def test_python_with_java_runtime(self):
        assert 0 == test_python_java_rt()

    def test_python_with_java_classes(self):
        assert 0 == test_python_java_classes()

    def test_java(self):
        assert test_maven()

    suite.addTest(test_python_with_java_runtime)
    suite.addTest(test_python_with_java_classes)
    # comment out because the asynchronous nature of the PyObject GC in Java makes stopPython/startPython flakey.
    # suite.addTest(test_java)

    return suite

To summarize, the flag is needed to support testing. It is not documented and is true by default so there isn't much danger of user meddling with it to cause memory leaks.

cpwright
cpwright previously approved these changes Dec 5, 2024
cpwright
cpwright previously approved these changes Dec 5, 2024
@jmao-denver jmao-denver merged commit 7dcef19 into jpy-consortium:master Dec 5, 2024
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants