Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134

Open
bruxisma opened this issue Aug 19, 2024 · 87 comments
Labels
topic-C-API topic-subinterpreters type-bug An unexpected behavior, bug, or error

Comments

@bruxisma
Copy link

bruxisma commented Aug 19, 2024

Bug report

Bug description:

Hello. While working on a small joke program, I found a possible memory corruption issue (it could also be a threading issue?) when using the Python C API in a debug only build to quickly create, execute python code, and then destroy 463 sub interpreters. Before I post the code sample and the debug output I'm using a somewhat unique build environment for a Windows developer.

  • clang++ 18.1.7 for x86_64-pc-windows-msvc
  • Visual Studio Build Tools 2022 17.11.0
  • CMake 3.30
  • Python 3.12.5
  • ninja 1.12.1

When running the code sample I've attached at the bottom of this post, I am unable to get the exact same output each time, though the traceback does fire in the same location (Due to the size of the traceback I've not attached it, as it's about 10 MB of text for each thread). Additionally, I sometimes have to run the executable several times to get the error to occur. Lastly, release builds do not exhibit any thread crashes or issues as the debug assertions never fire or execute.

The error output seems to also halt in some cases, either because of stack failure or some other issue I was unable to determine and seemed to possibly be outside the scope of the issue presented here. I have the entire error output from one execution where I was able to save the output.

Debug memory block at address p=00000267272D6030: API 'p'
    16718206241729413120 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
        at p-7: 0xa0 *** OUCH
        at p-6: 0x00 *** OUCH
        at p-5: 0x00Hello, world! From Thread 31
Hello, world! From Thread 87
 *** OUCH
        at p-4: 0xfd
Hello, world! From Thread 43
Hello, world! From Thread 279
Generating thread state 314
        at p-3: 0xfd
        at p-2: 0xfd
Hello, world! From Thread 168
Generating thread state 315
        at p-1: 0xfd
    Because memory is corrupted at the start, the count of bytes requested
       may be bogus, and checking the trailing pad bytes may segfault.
Generating thread state 316
Generating thread state 317
    The 8 pad bytes at tail=E8030267272D6030 are

The output cut off after this, as the entire program crashed, taking my terminal with it 😅

You'll find the MRE code below. I've also added a minimal version of CMakeLists.txt file I used so anyone can recreate the build with the code below (Any warnings, or additional settings I have do not affect whether the error occurs or not). The code appears to breaks inside of _PyObject_DebugDumpAddress, based on what debugging I was able to do with WinDbg.

Important

std::jthread calls .join() on destruction, so all threads auto-join once the std::vector goes out of scope.

Additionally this code exhibits the same behavior regardless of whether it is a thread_local or declared within the lambda passed to std::thread

main.cxx

#include <vector>
#include <thread>
#include <cstdlib>
#include <print>

#include <Python.h>

namespace {

static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463; 
static inline constexpr auto config = PyInterpreterConfig {
  .use_main_obmalloc = 0,
  .allow_fork = 0,
  .allow_exec = 0,
  .allow_threads = 0,
  .allow_daemon_threads = 0,
  .check_multi_interp_extensions = 1,
  .gil = PyInterpreterConfig_OWN_GIL,
};

} /* nameless namespace */

void execute () {
  std::vector<std::jthread> tasks { };
  tasks.reserve(MAX_STATES);
  for (auto count = 0zu; count < tasks.capacity(); count++) {
    std::println("Generating thread state {}", count);
    tasks.emplace_back([count] {
      if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
        std::println("Failed to initialize thread state {}", count);
        return;
      }
      auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
      auto globals = PyDict_New();
      auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
      auto result = PyEval_EvalCode(code, globals, globals);
      Py_DecRef(result);
      Py_DecRef(code);
      Py_DecRef(globals);
      Py_EndInterpreter(state);
      state = nullptr;
    });
  }
}

int main() {
  PyConfig config {};
  PyConfig_InitIsolatedConfig(&config);
  if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status)) {
    std::println("Failed to initialize with isolated config: {}", status.err_msg);
    return EXIT_FAILURE;
  }
  PyConfig_Clear(&config);
  execute();
  Py_Finalize();
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.30)
project(463-interpreters LANGUAGES C CXX)

find_package(Python 3.12 REQUIRED COMPONENTS Development.Embed)

set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")

add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME} PRIVATE main.cxx)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23)
target_precompile_headers(${PROJECT_NAME} PRIVATE <Python.h>)
target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python)

Command to build + run

$ cmake -Bbuild -S. -G "Ninja"
$ cmake --build build && .\build\463-interpreters.exe

CPython versions tested on:

3.12

Operating systems tested on:

Windows

@bruxisma bruxisma added the type-bug An unexpected behavior, bug, or error label Aug 19, 2024
@picnixz
Copy link
Contributor

picnixz commented Aug 19, 2024

Can you reproduce this behaviour on Linux? or MacOS? Or what happens if you sleep a bit before spawning the sub-interpreters? I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/ Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)

@ZeroIntensity
Copy link
Member

At a glance, this could possibly be related to a memory allocation failure or something similar.

  auto globals = PyDict_New();
  auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
  auto result = PyEval_EvalCode(code, globals, globals);

These lines need a NULL check if an error occurred, otherwise there will be a NULL pointer dereference in the call to Py_DecRef.

@bruxisma
Copy link
Author

At a glance, this could possibly be related to a memory allocation failure or something similar.

  auto globals = PyDict_New();
  auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
  auto result = PyEval_EvalCode(code, globals, globals);

These lines need a NULL check if an error occurred, otherwise there will be a NULL pointer dereference in the call to Py_DecRef.

Py_DecRef is the function version of Py_XDECREF. Though I can see the confusion between Py_DecRef and Py_DECREF. If there was a nullptr, we'd have a different set of errors, and the debug memory assertions wouldn't be complaining that all memory was not written with the FORBIDDENBYTE value.

@bruxisma
Copy link
Author

Can you reproduce this behaviour on Linux? or MacOS?
I will work on getting the Linux/macOS behaviors checked when I can. My WSL2 environment is woefully out of date 😅. macOS might be more difficult, as I'm unaware of how the _d version of the Python.framework might be linked against.

Or what happens if you sleep a bit before spawning the sub-interpreters?

Sleeping might fix it, but might also hide the issue. The goal here was just to run a bunch of interpreters, not sleep for the sake of working around possible synchronization issues

I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/

This is not how threads or operating systems work. There is a synchronization issue or an ABA problem with the debug memory assertions in the C API's internals

Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)

I will also work on giving this a go.

@ZeroIntensity
Copy link
Member

Py_DecRef is the function version of Py_XDECREF

Oh! TIL. I still would suggest adding a NULL check -- an error indicator could be set, which would break subsequent calls to PyErr_Occurred.

@bruxisma
Copy link
Author

The error is outside of the Python C API as one might usually interact with it. i.e., only on subinterpreter initialization (i.e., creating a PyThreadState) or on subinterpreter destruction (Py_EndInterpreter) does the error occur (typically the latter). PyErr_Occurred is not being called anywhere in the MRE, nor in the actual traceback for the failure.

@ZeroIntensity
Copy link
Member

Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.

@bruxisma
Copy link
Author

@picnixz I can confirm the behavior still exists on Windows on HEAD. Same errors can occur

Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.

PyErr_Occurred only performs a read on the current thread state object. If an error is present when it shouldn't be that still points to a memory corruption error occurring between the destruction and initialization of a PyThreadState.

@bruxisma
Copy link
Author

Pleased to back up @ZeroIntensity and say that this appears to be a Windows specific issue. I was unable to replicate this memory assertion error on Linux.

However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.

WARNING: ThreadSanitizer: data race (pid=11259)
  Write of size 8 at 0x7f24e35e16c0 by thread T10:
    #0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
    #3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
    #4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
    ...
  Previous write of size 8 at 0x7f24e35e16c0 by thread T18:
    #0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
    #3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
    #4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)

  Location is global '??' at 0x7f24e2e85000 (libpython3.12.so.1.0+0x75c880)

This only happened 3 times, and around the 42 thread generation step on average.

@TBBle
Copy link

TBBle commented Aug 20, 2024

However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.

A quick look suggests this TSan failure is a local data-race to posixmodule.c: There appears to be no protection from multiple threads importing posixmodule simultaneously, and hence multiple threads trying to simultaneously qsort in-place one of the three static tables.

This module is used on Windows, but I'm not sure if any of the macros enabling those qsort calls (HAVE_FPATHCONF, HAVE_PATHCONF, HAVE_CONFSTR, or HAVE_SYSCONF) are enabled in the Windows build. My guess would be no...

I don't know if overlapping qsorts on the same array will actually corrupt memory, but I doubt there's any guarantee it won't. (qsort_r doesn't help, that makes the comparison function thread-safe, not the array being sorted, according to the manpage.)

So probably a separate issue, and a more-focused/reliable MRE could be made for it.

@ZeroIntensity
Copy link
Member

To think of it, I'm not even sure subinterpreters are thread safe in the first place, you might need to acquire the GIL. Though, I'm not all too familiar with them.

cc @ericsnowcurrently for insight

@TBBle
Copy link

TBBle commented Aug 20, 2024

I think the point of PEP-684 (in Python 3.12) is that you can have a per-subinterpreter GIL, as is being done in this test-case. posixmodule.c claims compatibility with the per-interpreter GIL, so in that case I think it's a straight-up bug.

The check_multi_interp_extensions flag has also been set, so any modules known to not work with per-subinterpreter GIL should refuse to import before running any init code anyway.


Edit: Oops, perhaps I misunderstood your comment, and it wasn't directed as the posixmodule issue.

Indeed, calling Py_NewInterpreterFromConfig documents that the GIL must be held for that call, and the own-GIL case does interesting things with the GIL-holding state of both new and current interpreter.

Like all other Python/C API functions, the global interpreter lock must be held before calling this function and is still held when it returns. Likewise a current thread state must be set on entry. On success, the returned thread state will be set as current. If the sub-interpreter is created with its own GIL then the GIL of the calling interpreter will be released. When the function returns, the new interpreter’s GIL will be held by the current thread and the previously interpreter’s GIL will remain released here.

So I'm not clear if the given example is doing the right thing here or not, but I suspect the first thing the tasks should do is take the main interpreter's GIL if not already holding it. (And the call to Py_Finalize should do that too, in that case). Poking around; because it's a fresh thread, maybe it doesn't need the main GIL to call Py_NewInterpreterFromConfig, and so this code is GIL-safe as is. I'm not clear if the new per-subinterpreter-GIL support changed things around here, and if so, how it changed.

@ZeroIntensity
Copy link
Member

I'm guessing that's the problem Judging by the comment, it looks like new_interpreter doesn't support being called without the GIL:

cpython/Python/pylifecycle.c

Lines 2261 to 2263 in bffed80

// XXX Might new_interpreter() have been called without the GIL held?
PyThreadState *save_tstate = _PyThreadState_GET();
PyThreadState *tstate = NULL;

I could see this being a bug, though. I'll wait for Eric to clarify.

@bruxisma
Copy link
Author

Well this definitely brings me some concern then, as the GIL can be made optional in 3.13 😅

@ZeroIntensity
Copy link
Member

Does this occur on the free threaded build?

@bruxisma
Copy link
Author

Sorry for the delay. Had a busy work week. I'll gather this info this weekend, though the code seems to indicate "yes" from a glance.

@bruxisma
Copy link
Author

@ZeroIntensity I can confirm the issue still happens when the GIL is disabled in 3.13 and 3.14 HEAD

@ZeroIntensity
Copy link
Member

For now, I think I'm going to write this off as user-error, as Py_NewInterpreterFromConfig (and friends) doesn't seem to support being called without the GIL. Though, it's a bit confusing as to why this only fails on Windows. Does it still occur if you call PyGILState_Ensure (and subsequently PyGILState_Release) before initializing the subinterpreter?

@bruxisma
Copy link
Author

How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?

You cannot call PyGILState_Release on these threads because when you call Py_NewInterpreterFromConfig it sets the new interpreter as the current one for said thread. This would require swapping to the old state to unlock it and then swapping back to the new state.

@ZeroIntensity
Copy link
Member

ZeroIntensity commented Aug 25, 2024

How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?

AFAIK, PyGILState_* calls don't actually acquire the GIL on free-threaded builds, they just initialize the thread state structure to allow calling the API.

At a glance, the drop-in code would be:

void execute () {
  std::vector<std::jthread> tasks { };
  tasks.reserve(MAX_STATES);
  for (auto count = 0zu; count < tasks.capacity(); count++) {
    std::println("Generating thread state {}", count);
    tasks.emplace_back([count] {
      PyGILState_STATE gil = PyGILState_Ensure(); // Make sure we can call the API
      if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
        std::println("Failed to initialize thread state {}", count);
        PyGILState_Release(gil);
        return;
      }
      auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
      auto globals = PyDict_New();
      auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
      auto result = PyEval_EvalCode(code, globals, globals);
      Py_DecRef(result);
      Py_DecRef(code);
      Py_DecRef(globals);
      Py_EndInterpreter(state);
      state = nullptr;
      PyGILState_Release(gil); // Release the GIL (only on non-free-threaded builds)
    });
  }
}

There needs to be a Py_BEGIN_ALLOW_THREADS in there, though. Normally, that would happen over the calls to join the threads, but I'm assuming that happens in a destructor here -- I'm not too familiar with C++ threading, you'll have to figure out where to put that.

@bruxisma
Copy link
Author

It shouldn't require any of that. The interpreter variable is local to the thread. The Py_BEGIN_ALLOW_THREADS macro is required when manipulating the GIL. There is no GIL manipulation needed because each of these are isolated threads running on their own, and even when the GIL is disabled there shouldn't be any manipulation of the GIL occurring regardless because it's completely turned off

The need for PyGILState_STATE goes away when using isolated subinterpreters. The state is only needed if you're executing these subinterpreters to talk to each other. But these interpreters are running entirely on their own with no interaction between each other. Were I trying to register additional threads under a specific interpreter instance, then your code would be warranted. However, the documentation for these functions clearly says

Note that the PyGILState_* functions assume there is only one global interpreter (created automatically by Py_Initialize()). Python supports the creation of additional interpreters (using Py_NewInterpreter()), but mixing multiple interpreters and the PyGILState_ API is unsupported*.

(emphasis mine)

So, either the documentation is wrong, or Python's C API is wrong. Either way, there is a bug here.

@ZeroIntensity
Copy link
Member

It shouldn't require any of that. The interpreter variable is local to the thread.

Yes, but how do you initialize that interpreter variable? I don't think new_interpreter does it (unless I'm missing something -- please clarify!)

I think you're misinterpreting what I'm saying: you're right, you don't need to mess with the GIL when using the subinterpreter, but it seems (based on the code from new_interpreter) that you need to hold it in order to create a subinterpreter. But that could be wrong, I'm just going off a comment. I'm still not sure why it only happens after a certain number of threads -- it should happen on the first thread, and it certainly shouldn't be Windows-specific.

@bruxisma
Copy link
Author

yes it is initialized. that's what auto status = Py_NewInterpreterFromConfig(&state, &config); does. If it didn't work status would be an error.

I'm very certain this is to do with memory corruption and memory blocks within the debug assertions not being locked or cleared out properly which is leading to this invalid state occurring. Why it only happens on Windows most likely has to do with behavior regarding the MSVC runtime in some capacity, and some behavioral difference with windows memory pages and the virtual allocator versus posix platforms surfacing.

That's the best I've been able to intuit, as the crash is indeterminate in nature, though as I stated in an earlier comment ThreadSanitizer found some race conditions brought on by qsort on Linux.

@ZeroIntensity
Copy link
Member

If it didn't work status would be an error.

It would probably segfault instead, FWIW -- C API calls don't have an extra check for the interpreter or thread state. I don't see where in new_interpreter it initializes the thread state information, so I would think that the failure would be in _PyThreadState_GET. However, that doesn't seem like that's the case, somehow.

Speculatively, the race could be caused by here

runtime->gilstate.check_enabled = 0;

Trying to set this across multiple threads could be problematic, since there's no lock.

@bruxisma
Copy link
Author

If any of those failures were happening there, this same behavior would be observable in a release build. This situation only happens when the _d python library is used.

@ZeroIntensity
Copy link
Member

Ok, I think the fix here would be to add a HEAD_LOCK to _PyMem_SetDefaultAllocator.

@TBBle
Copy link

TBBle commented Oct 9, 2024

Ah, sorry, I was overly-succinct there.

_PyMem_SetDefaultAllocator itself is not buggy. The problem occurs between the calls to _PyMem_SetDefaultAllocator and the matching PyMem_SetAllocator. During that stretch of time, the allocator pointers in the runtime struct (shared by all interpreters) are overridden to point to the default allocator used by release builds of Python.

This leads to a failure in the situation where the configured allocator is not the default allocator used by release builds of Python, i.e. any one of the following: debug builds, PYTHONMALLOC=..._debug env-var, or embedded Python which has called PyMem_SetAllocator to redirect allocation to the host's allocator or PyMem_SetupDebugHooks to enable the debug allocators.

In this situation, if another interpreter (with its own GIL, or in a free-threading build) allocates or deallocates memory with the same domain during this period, the wrong method will be called, leading to a mismatch for that block of memory, and in debug builds, producing one of the two failure reports (either RtlValidateHeap or _PyObject_DebugDumpAddress) as described in this ticket.


Having looked this up, if I add PyMem_SetupDebugHooks(); to main in the last repo I posted, the problem reproduces with 3.13rc2 release builds. However, instead of the RtlValidateHeap failure we get Critical error detected c0000374, because this is using the release-build MSVC allocator, and it's blindly trusting us to pass valid objects to free.

So I suspect this repro will work on Linux and MacOS too, since at least the _PyObject_DebugDumpAddress side doesn't rely on the local allocator behaviour.

3.12-onward repro, including release builds
#include <cstdlib>
#include <print>
#include <thread>
#include <vector>

#include <Python.h>

// https://developercommunity.visualstudio.com/t/Please-implement-P0330R8-Literal-Suffix/10410860#T-N10539586
static constexpr size_t operator""_uz(unsigned long long n)
{
    return size_t(n);
}

namespace
{

static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463;
static inline constexpr auto config = PyInterpreterConfig {
    .use_main_obmalloc = 0,
    .allow_fork = 0,
    .allow_exec = 0,
    .allow_threads = 0,
    .allow_daemon_threads = 0,
    .check_multi_interp_extensions = 1,
    .gil = PyInterpreterConfig_OWN_GIL,
};

} /* nameless namespace */

void execute(PyInterpreterState* interp)
{
    std::vector<std::jthread> tasks {};
    tasks.reserve(MAX_STATES);
    for (auto count = 0_uz; count < tasks.capacity(); count++)
    {
        std::println("Generating thread state {}", count);
        tasks.emplace_back(
            [count, interp]
            {
                auto tstate = PyThreadState_New(interp);
                auto no_thread_state = PyThreadState_Swap(tstate);
                assert(no_thread_state == nullptr);
                if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
                {
                    std::println("Failed to initialize thread state {}", count);
                    return;
                }
                // Swap back to the old thread state, clear it, and then switch back to our new state.
                auto new_thread_state = PyThreadState_Swap(tstate);
                assert(new_thread_state == state);
                PyThreadState_Clear(tstate);
                auto temp_thread_state = PyThreadState_Swap(state);
                assert(temp_thread_state == tstate);
                PyThreadState_Delete(tstate);

#if PY_VERSION_HEX >= 0x030d0000
                Py_SetProgramName(L"");
#endif
                auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
                auto globals = PyDict_New();
                auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
                auto result = PyEval_EvalCode(code, globals, globals);
                Py_DecRef(result);
                Py_DecRef(code);
                Py_DecRef(globals);
                Py_EndInterpreter(state);
                state = nullptr;
            });
    }
}

int main()
{
    PyMem_SetupDebugHooks();
    PyConfig config {};
    PyConfig_InitIsolatedConfig(&config);
    if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status))
    {
        std::println("Failed to initialize with isolated config: {}", status.err_msg);
        return EXIT_FAILURE;
    }
    PyConfig_Clear(&config);
    auto main_interpreter_state = PyInterpreterState_Get();
    Py_BEGIN_ALLOW_THREADS;
    execute(main_interpreter_state);
    Py_END_ALLOW_THREADS;
    Py_Finalize();
}

@TBBle
Copy link

TBBle commented Oct 9, 2024

Linux-based repro (tested against Python 3.12.3 on Ubuntu 24.04 LTS under WSL2)

Required packages: gcc-14 g++-14 python3.12-dev python3-dev cmake

CMakeLists.txt:

cmake_minimum_required(VERSION 3.28.3)
project(463-interpreters LANGUAGES C CXX)

find_package(Python 3.12 EXACT REQUIRED COMPONENTS Development.Embed)

set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")

add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME} PRIVATE main.cxx)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23)
target_precompile_headers(${PROJECT_NAME} PRIVATE <Python.h>)
target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python)

main.cxx (Unchanged from previous post)

#include <cstdlib>
#include <print>
#include <thread>
#include <vector>

#include <Python.h>

// https://developercommunity.visualstudio.com/t/Please-implement-P0330R8-Literal-Suffix/10410860#T-N10539586
static constexpr size_t operator""_uz(unsigned long long n)
{
    return size_t(n);
}

namespace
{

static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463;
static inline constexpr auto config = PyInterpreterConfig {
    .use_main_obmalloc = 0,
    .allow_fork = 0,
    .allow_exec = 0,
    .allow_threads = 0,
    .allow_daemon_threads = 0,
    .check_multi_interp_extensions = 1,
    .gil = PyInterpreterConfig_OWN_GIL,
};

} /* nameless namespace */

void execute(PyInterpreterState* interp)
{
    std::vector<std::jthread> tasks {};
    tasks.reserve(MAX_STATES);
    for (auto count = 0_uz; count < tasks.capacity(); count++)
    {
        std::println("Generating thread state {}", count);
        tasks.emplace_back(
            [count, interp]
            {
                auto tstate = PyThreadState_New(interp);
                auto no_thread_state = PyThreadState_Swap(tstate);
                assert(no_thread_state == nullptr);
                if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
                {
                    std::println("Failed to initialize thread state {}", count);
                    return;
                }
                // Swap back to the old thread state, clear it, and then switch back to our new state.
                auto new_thread_state = PyThreadState_Swap(tstate);
                assert(new_thread_state == state);
                PyThreadState_Clear(tstate);
                auto temp_thread_state = PyThreadState_Swap(state);
                assert(temp_thread_state == tstate);
                PyThreadState_Delete(tstate);

#if PY_VERSION_HEX >= 0x030d0000
                Py_SetProgramName(L"");
#endif
                auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
                auto globals = PyDict_New();
                auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
                auto result = PyEval_EvalCode(code, globals, globals);
                Py_DecRef(result);
                Py_DecRef(code);
                Py_DecRef(globals);
                Py_EndInterpreter(state);
                state = nullptr;
            });
    }
}

int main()
{
    PyMem_SetupDebugHooks();
    PyConfig config {};
    PyConfig_InitIsolatedConfig(&config);
    if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status))
    {
        std::println("Failed to initialize with isolated config: {}", status.err_msg);
        return EXIT_FAILURE;
    }
    PyConfig_Clear(&config);
    auto main_interpreter_state = PyInterpreterState_Get();
    Py_BEGIN_ALLOW_THREADS;
    execute(main_interpreter_state);
    Py_END_ALLOW_THREADS;
    Py_Finalize();
}

Test command

CXX=g++-14 CC=gcc-14 cmake -Bbuild -S. -G "Ninja" && cmake --build build && build/463-interpreters

Failure output

...
Generating thread state 125
Generating thread state 126
munmap_chunk(): invalid pointer
Aborted

or

...
Generating thread state 149
Generating thread state 150
Debug memory block at address p=0x7f8e24078660: API '$'
    11529496521045180416 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
        at p-7: 0x02 *** OUCH
        at p-6: 0x00 *** OUCH
        at p-5: 0x00 *** OUCH
        at p-4: 0x00 *** OUCH
        at p-3: 0x00 *** OUCH
        at p-2: 0x00 *** OUCH
        at p-1: 0x00 *** OUCH
    Because memory is corrupted at the start, the count of bytes requested
       may be bogus, and checking the trailing pad bytes may segfault.
    The 8 pad bytes at tail=0xa0017f8e24078660 are Generating thread state 151
Segmentation fault

So we can remove the OS-windows tag from this, since it also happens on Linux the same way, i.e. munmap_chunk(): invalid pointer is the same failure as RtlValidateHeap or Critical error detected c0000374, i.e. bad pointer passed to free.

@encukou encukou removed the OS-windows label Oct 9, 2024
@ZeroIntensity
Copy link
Member

Thanks! I'll see if I can come up with a way to reproduce this without embedding, because it's especially hard to debug core issues with embedded programs.

So, it looks like the problem here is that some very-high-level embedding functions (such as Py_CompileString and Py_SetProgramName) aren't isolated enough for subinterpreters, or at least don't lock the right thing when the interpreter state is getting copied. In theory, that should be simple to fix, and small enough to backport. (I suggested just locking the whole runtime in those functions, but maybe it's easier to just add extra locks to new_interpreter if that's the only case where things can go wrong.)

@ZeroIntensity
Copy link
Member

Ok, I was able to reproduce this, but unfortunately Py_SetProgramName is deprecated, and has been since 3.11. (Sorry, I should have picked up on this earlier.)

That's probably why it's not thread-safe, we don't want to fix deprecated APIs. Though, maybe it's worth adding some extra locks (such as to the allocator) for new_interpreter. I'm not totally sure yet.

@TBBle
Copy link

TBBle commented Oct 9, 2024

So, it looks like the problem here is that some very-high-level embedding functions (such as Py_CompileString and Py_SetProgramName) aren't isolated enough for subinterpreters, or at least don't lock the right thing when the interpreter state is getting copied.

No. There's no interpreter state copying happening at the relevant part of the repro. Please reread my multiple descriptions of the precise bug, e.g. this summary, as it seems like you're getting stuck on the idea that this is some kind of data-race or cross-thread memory corruption, somehow related specifically to new interpreter creation. That is not the case here.

For example, in an earlier version of the 3.13 repro, all of the interpreters were already completely created and functional before the bug repro occurred, through the use of a semaphore.

The problematic code is Py_SetProgramName in this repro, or anything else that contains a block starting with _PyMem_SetDefaultAllocator(DOMAIN, &x); and ending with PyMem_SetAllocator(DOMAIN, &x). (It happens that before 3.13, one such function was called inside new_interpreter, after the GIL was reinstated, so it could only conflict if the interpreter had its own GIL.)

These problematic functions (by code inspection) could use a "stop-the-world" type lock (runtime lock plus more, the GIL-free build uses one of these to fully block all other threads for some non-threadsafe work like import module data updates), but that's probably a non-starter because most of these functions are only allowed to be used before or during interpreter creation, or during interpreter or runtime finalisation, i.e. they predate or postdate the Runtime, so they cannot rationally take a lock on/via it. And I'm pretty sure all such functions in the public API are labelled with "Do not call after Py_Initialize".

This code conflicts with any memory allocation from that DOMAIN, and I'm almost positive that adding locking to the memory allocation functions would be a non-starter solution. I also think I could repro this by replacing everything from auto text = ... up to Py_EndInterpreter(state); with something like PyMem_RawFree(PyMem_RawMalloc(100)); called 100 times in a loop.


Interesting thought I had earlier today: PyMem_RawMalloc and friends do not require the GIL to be held and are thread-safe, so I could repro this without all the GIL-related messing about, simply by having a background thread calling PyMem_RawFree(PyMem_RawMalloc(100)); repeatedly while the main thread either (for 3.12) creates and destroys classic sub-interpreters (shared GIL); or (for 3.13) calls Py_SetProgramName(""); in a loop.

So in this case, if you had a module that calls PyMem_RawMalloc and PyMem_RawFree in a loop without holding the GIL, you maybe could use the _xxsubinterpreters module and the PYTHONMALLOC=debug env-var with a 3.12 interpreter to repro this bug.

For a 3.13 repro, I doubt any of the problematic functions are exposed to the Python API (because they make no sense to call from Python), so you'd need a module to call those for you as well, but holding the GIL.

Edit: The following (pathological) structure might also reproduce the issue in 3.13 without calling any deprecated APIs. It's 5am so it's pseudocode.

Py_PreInit
Py_SetDebugHooks
run_in_thread([] {while (true) { PyMem_RawFree(PyMem_RawMalloc(100)); } });
Py_Initialise
Py_Finalise

Several of the (non-public-API) problematic functions are used during initial import subsytem setup, which only happens during main Py_Initialise, not sub-interpreter creation.

I guess you could argue that Py_PreInit is not sufficient condition to start calling PyMem_Raw* functions even though setting up the allocators is approximately a third of that function's job. There's certainly nothing in the docs that suggests that is the case, but it doesn't say you can't call PyMem_Raw* before Py_PreInit either, and I suspect that would (should) just barf. Edit: Actually, the docs say that calling Py_PreInitialize is what lets you start calling PyMem_Raw*.

@TBBle
Copy link

TBBle commented Oct 9, 2024

Ok, I was able to reproduce this, but unfortunately Py_SetProgramName is deprecated, and has been since 3.11. (Sorry, I should have picked up on this earlier.)

Yes. AFAIR, all the problematic functions are either deprecated, or not publicly exposed.

To be clear, in 3.12, you don't need to call Py_SetProgramName to repro the issue. But I don't feel 3.12 is worth trying to fix for this use-case.

That's probably why it's not thread-safe, we don't want to fix deprecated APIs.

Yes, that's why I said much earlier that there's probably nothing we can or should do for Python 3.13 onwards. (We could perhaps make the APIs that are marked "Do not call after Py_Initialize" actually assert that you're not doing that, but I suspect for don't-break-existing-code reasons that's not a change we'd want to make even in the non-deprecated functions.)

@ZeroIntensity
Copy link
Member

ZeroIntensity commented Oct 9, 2024

No. There's no interpreter state copying happening at the relevant part of the repro. Please reread my multiple descriptions of the precise bug, e.g. this summary, as it seems like you're getting stuck on the idea that this is some kind of data-race or cross-thread memory corruption, somehow related specifically to new interpreter creation. That is not the case here.

Sorry! This is quite a long thread with a number of different analyses getting passed around, it's hard to keep track.

  • The pointer replacement in _PyMem_SetDefaultAllocator still happens in release builds, but since that's normally a no-op, there's no symptom.

Hmm, where do you see that? Looking at the source, _PyMem_SetDefaultAllocator isn't a no-op on release builds.

Again, I'm still hesitant to fix anything here, because the subinterpreter C APIs just aren't ready yet for public use in another thread. (Or at least, that's what my impression of the situation has been from Petr and Eric.)

To be clear, in 3.12, you don't need to call Py_SetProgramName to repro the issue.

Still a little lost on the compat between 3.12 and 3.13 here. What triggers this for 3.12? init_interpreter?

@bruxisma
Copy link
Author

bruxisma commented Oct 9, 2024

Again, I'm still hesitant to fix anything here, because the subinterpreter C APIs just aren't ready yet for public use in another thread. (Or at least, that's what my impression of the situation has been from Petr and Eric.)

Even if nothing is fixed on the code side for this, it would be really great if this "not ready for public consumption" was documented until they are considered ready. As of right now, there is no indication that the APIs are not for public use beyond a quick line about how it's not yet exposed to the Python language, but not all C APIs are exposed so that sort of line can't be relied on. 😔

(Also @TBBle thank you for doing so much legwork on this, I have so little free time right now that I've been only able to catch up on this thread once a week since mid september...)

@ZeroIntensity
Copy link
Member

As of right now, there is no indication that the APIs are not for public use

It is, pretty much.

but mixing multiple interpreters and the PyGILState_* API is unsupported.

PyGILState_* is how you enable Python from a callback. We're only getting around it here with manual PyThreadState_ shenanigans because I happen to know how that works -- users won't be doing this kind of thing.

@TBBle
Copy link

TBBle commented Oct 9, 2024

  • The pointer replacement in _PyMem_SetDefaultAllocator still happens in release builds, but since that's normally a no-op, there's no symptom.

Hmm, where do you see that? Looking at the source, _PyMem_SetDefaultAllocator isn't a no-op on release builds.

The effect is normally a no-op in a release build because it's setting the same allocator that's already used by-default, i.e. the function pointer value does not change.

It becomes not-a-no-op if you've called PyMem_SetupDebugHooks(), or used the PYTHONMALLOC env-var or PyMem_SetAllocator() to change allocators away from the default. (I didn't realise at the time that you could enable the memory debug hooks in release builds, or I'd have changed this discussion around to talk about that rather than debug builds.)

In debug builds, the PreInit code calls PyMem_SetupDebugHooks() for you, so they are always in the "not-a-no-op" state.

Again, I'm still hesitant to fix anything here, because the subinterpreter C APIs just aren't ready yet for public use in another thread. (Or at least, that's what my impression of the situation has been from Petr and Eric.)

Once again, as of 3.13, I believe this is not related to subinterpreters. I believe we can repro the issue with PyMem_RawMalloc in a separate thread with no interpreter, so we don't need any subinterpreters. (We still need to call a deprecated or private API though, unless my pathological repro idea above is valid, or some other way I've overlooked is found. That would also work in 3.12 and earlier, I expect, but the repro with sub-interpreters in 3.12 was closer to where embedding users would hit this issue.)

So I suspect I could repro this in Python 3.10, where Py_SetProgramName was not deprecated. It's still the wrong place to call it, and it doesn't add anything to the conversation.

(Side-note: The English-language docs for Py_SetProgramName say "This function should be called before Py_Initialize() is called for the first time, if it is called at all.", but the Japanese language docs are much more strongly (roughly) "If you're going to call this function, it must be called before you call Py_Initialize() for the first time." So we could upgrade the docs for the problematic functions from "should be" to "must be" maybe.)

To be clear, in 3.12, you don't need to call Py_SetProgramName to repro the issue.

Still a little lost on the compat between 3.12 and 3.13 here. What triggers this for 3.12? init_interpreter?

Yes. Specifically, new_interpreter -> init_interp_main -> init_sys_streams -> _Py_ClearStandardStreamEncoding.

@ZeroIntensity
Copy link
Member

Ok, if I'm understanding this right, the issue is that Raw allocators can be set mid-execution by another thread, correct?

@TBBle
Copy link

TBBle commented Oct 9, 2024

Yes, that's correct.

And here's the subinterpreter-free, deprecated-function-free, pathological repro for 3.12 and 3.13 release builds. Also works in free-threaded mode, as it happens.

Same CMake setup as before.

#include <thread>

#include <Python.h>

int main()
{
    auto pre_config = PyPreConfig {};
    PyPreConfig_InitIsolatedConfig(&pre_config);
    pre_config.allocator = PYMEM_ALLOCATOR_DEBUG;
    if (auto status = Py_PreInitialize(&pre_config); PyStatus_Exception(status))
    {
        Py_ExitStatusException(status);
    }

    auto allocthread = std::thread(
        []
        {
            while (true)
            {
                PyMem_RawFree(PyMem_RawMalloc(100));
            };
        });

    auto config = PyConfig {};
    PyConfig_InitIsolatedConfig(&config);
    if (auto status = Py_InitializeFromConfig(&config); PyStatus_Exception(status))
    {
        PyConfig_Clear(&config);
        Py_ExitStatusException(status);
    }
    PyConfig_Clear(&config);

    return Py_FinalizeEx();
}

Even though I'm only calling Py_InitializeFromConfig once, it has 100% repro rate so far. I thought I was going to need to stick some kind of yield in between the alloc and free, but turns out to not be necessary.

@ZeroIntensity
Copy link
Member

Hmm, I'm not certain what the best fix would be here. Maybe a per-thread raw allocator domain?

@TBBle
Copy link

TBBle commented Oct 9, 2024

Maybe, yeah. Initialising it would be a wrinkle though. We need PyMem_RawCalloc to create the thread state, so we'd need to stash it in the interpreter too for any sub-interpreters to copy the configured allocator out of. And I think it's legitimate per the big red box in the PyMem_SetAllocator docs to change allocators after initialisation as long as they wrap the existing allocator, and that needs to be done across threads in the shared-objmalloc shared-GIL case as memory allocations can move between those threads freely.

I actually think that big red box in the PyMem_SetAllocator docs is wrong, as the requirement should be stronger: any newly-set allocator after init needs to be able to free allocations created by the allocator it wraps; the Memory Debug Hooks do not meet this stronger criteria (as they shift the pointer backwards to access accounting data before passing it on to the wrapped allocator) for example, which is why you can't call PyMem_SetupDebugHooks() after init. _PyMem_SetDefaultAllocator violates this rule but deliberately and (if you ignore other threads that aren't blocked by the GIL) because it limits the period this scope this change applies to and commits to freeing memory allocated in that scope under the same conditions.


Also, I have a slight new wrinkle to introduce that may or may not affect any resolution, as I found an error in an earlier statement.

As it happens, _PyMem_SetDefaultAllocator also reinstates the debug hooks if Py_DEBUG is defined. So technically, in the problematic block between _PyMem_SetDefaultAllocator(DOM, &X); and the paired PyMem_SetAllocator(DOM, &x); in debug builds, the raw allocator is correct, i.e the same as the default outside that block.

However, the following cases,

  • release builds where PyMem_SetupDebugHooks() has been called. (Inside the block, the debug hooks are not attached)
  • custom allocators installed by embedding host

the allocator during that block is still wrong.

So why does this reproduce in debug builds?

I believe that set_default_allocator_unlocked is not actually a no-op even in the debug build case, as it works by two function calls that mutate the allocator pointers in debug builds:

  • set_allocator_unlocked(domain, &new_alloc);
  • set_up_debug_hooks_domain_unlocked(domain);

And in my new minimal repro, it's easy to see that the allocation thread made or freed an allocation between those two instructions, when the allocator was temporarily wrong. It's a much smaller window, but still repro's 100% for me here, so it's definitely hittable.

Screenshot and explanation of this case

image

Here you can see that the background thread's PyMem_RawFree went straight to _PyMem_RawFree, which is the "allocator was in a bad state" case. (This is a debug build).

However, the thread which is in the "problematic range", as confirmed by the watch window, actually shows PyMem_RawMalloc going to _PyMem_DebugRawMalloc in its call stack.

So that background thread must have called PyMem_RawFree between those two calls in set_default_allocator_unlocked.


So this case could be resolved by changing set_default_allocator_unlocked to build an allocator with debug hooks attached and then call set_allocator_unlocked(domain, &new_alloc); with that allocator.

However, that wouldn't resolve any cases where the allocator in-use is not the default allocator for that build, so it's not a complete solution but might be a good improvement.

Anyway, maybe we're at a point here where it's worth creating a new bug from my pathological case, to disentangle it from any discussion of subinterpreters or non-singular-GILs. I've just unintentionally pulled an all-nighter on this, so I need to sleep on it first, but thoughts welcome.

@ZeroIntensity
Copy link
Member

I'm going to focus on the allocator issue for the time being, there's a lot of uncharted water in this, I wouldn't be surprised if like, 4 new issues are needed 😅

(If you would like to author PRs for docs change and whatnot, that would be great! You've clearly put much more time into this issue than I have.)

Ok, regarding the allocation issue, I'm worried about breaking code here, because theoretically someone could be relying on all threads being affected by modifying the Raw domain. Speculatively, we could modify what allocator PyMem_Raw* reaches for depending on the current thread state -- for example, if it's NULL, then it's not safe to try and call whatever Python has registered, and should just try and go for malloc. But even that could break code (think if someone allocates something without the GIL, such as in a Py_BEGIN_ALLOW_THREADS block, and then tries to free that memory after the GIL has been re-acquired), and wouldn't play well with debuggers.

@ZeroIntensity
Copy link
Member

ZeroIntensity commented Oct 13, 2024

I guess a fix could be to try and make the uses of _PyMem_SetDefaultAllocator thread-local somehow, but that's going to be a PITA.

@TBBle
Copy link

TBBle commented Oct 13, 2024

I couldn't think of any way to make _PyMem_SetDefaultAllocator generally useable; I suspect our best bet would be to make it assert that it's only called at safe times: Before/during initialisation and finalisation of the main interpreter. And we can also note that PyMem_Raw* is not threadsafe during those times.

Although that's both pretty messy, somewhat complicated, and may break existing non-crashing code.


There's probably a similar reproduction case (in release builds only) even without the _PyMem_SetDefaultAllocator issue:

(UNTESTED)

#include <thread>

#include <Python.h>

int main()
{
    auto pre_config = PyPreConfig {};
    PyPreConfig_InitIsolatedConfig(&pre_config);
    if (auto status = Py_PreInitialize(&pre_config); PyStatus_Exception(status))
    {
        Py_ExitStatusException(status);
    }

    auto a = PyMem_RawMalloc(100);
    PyMem_SetupDebugHooks();
    PyMem_RawFree(a);

    return 0;
}

This is currently valid: As I recall the only valid place to call PyMem_SetupDebugHooks() (or PyMem_SetAllocator() when not a trivial wrapper for existing allocator) is between Py_PreInitialize and Py_Initialize.

However, this will reproduce the same bug, as PyMem_SetupDebugHooks() will change the allocate\free function pointer and a block allocated under the normal allocator will be free'd by the debug allocator.

So we can possibly fix both holes by noting that PyMem_Raw* is not thread-safe before the main interpreter's Py_InitializeFromConfig, and must not be called before any use of PyMem_SetAllocator which does not meet the stronger version of PyMem_SetAllocator's contract I proposed in the second paragraph of #123134 (comment):

any newly-set allocator after init needs to be able to free allocations created by the allocator it wraps

We would also need to extend the case of "Python has finish initializing" to also include "or PyMem_Raw* APIs have been called".

PyMem_SetupDebugHooks() does not currently meet this requirement, so it must not be called after PyMem_RawMalloc has been called, making this repro case invalid.

A stronger change that closes this second hole would be to deprecate-remove PyMem_SetupDebugHooks() in favour of the existing PyPreConfig parameter (like we did for Py_SetProgramName), plus a public API that can wrap a given allocator with debug hooks and the inverse unwrapping API, which could also be used to fix the small race condition in _PyMem_SetDefaultAllocator, i.e. the "new wrinkle".

@picrin
Copy link

picrin commented Oct 13, 2024

Speaking as a user here: I've given up on the subinterpreter API -- for my specific use case, which involves spawning ~8 python threads for parallel processing of a bioinformatics workload (the solution is embedded in Go) -- subinterpreters weren't ready at all.

  1. I couldn't get them to work except when importing a pthreads library in Go (I and another engineer couldn't get them to work inside Go goroutines)
  2. Even after all the Cgo/pthreads magic in Go the solution was 2 times slower than spawning separate Python processes and communicating with them via stdin/stdout
  3. I couldn't find a way to finalise the main python interpreter that wouldn't segfault. I decided this wasn't good enough for production.

Obviously these were unchartered waters, but my reading of the subinterpreter API was that this was ready for multithreaded use. It's perhaps worth updating the documentation:

Using Py_NewInterpreterFromConfig() you can create a sub-interpreter that is completely isolated from other interpreters, including having its own GIL. The most important benefit of this isolation is that such an interpreter can execute Python code without being blocked by other interpreters or blocking any others. Thus a single Python process can truly take advantage of multiple CPU cores when running Python code.

@ZeroIntensity
Copy link
Member

ZeroIntensity commented Oct 13, 2024

I couldn't think of any way to make _PyMem_SetDefaultAllocator generally useable; I suspect our best bet would be to make it assert that it's only called at safe times: Before/during initialisation and finalisation of the main interpreter. And we can also note that PyMem_Raw* is not threadsafe during those times.

My idea was to make internal uses of _PyMem_SetDefaultAllocator only affect the current thread, so functions that need to use the raw allocator wouldn't be breaking thread safety. But if there are issues with PyMem_SetupDebugHooks as well, then that wouldn't work. I was able to confirm your reproducer with PyMem_SetupDebugHooks.

A deprecation in favor of a config option is probably better, and then I guess updating thread safety notes about PyMem_Raw* is the best way to go.

Obviously these were unchartered waters, but my reading of the subinterpreter API was that this was ready for multithreaded use. It's perhaps worth updating the documentation:

It is ready, in the sense that we have all the necessary subinterpreter isolation ready, but my understanding of the problem is that we sort of don't have public APIs to really take advantage of it yet. I think my solution with manually attaching a thread state for the main interpreter and then switching to the subinterpreter was the valid way to do it--it's just not very pretty.

Specifically, which parts of subinterpreters "did not work", and what needs to be documented better? As far as I can see, most of the people that know about subinterpreters are the ones working on them, so it would be good to get some feedback from users on what needs to be made clearer.

@picrin
Copy link

picrin commented Oct 13, 2024

I posted a C-only version of my pthreads code earlier on in this thread, but it must have been deleted (the thread got quite unwieldy so maybe somebody tried to clean it up and deleted that post). I wrote the code using the C API documentation as a starting point I believe @tbbie was commenting on it.

My personal question would have been: is that code correct?

I think having a pthreads template (which is effectively a multithreading standard for C) with works.

I've tried to dig out that code and it's something like this (the idea is run a tight loop from each subinterpreter to test for potential segfaults, etc...). I believe that in this particular instance I was able to call Py_Finalise (I wasn't able to call that in Cgo without segfault):

#ifndef SUBINTERPRETERS_C
#define SUBINTERPRETERS_C

#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

// Define the number of threads to create
#define NUM_THREADS 4
// Define the upper limit for the range in our Python calculation
#define RANGE_LIMIT 10000000

// Configure the Python interpreter
PyInterpreterConfig config = {
    .check_multi_interp_extensions = 1,  // Check for multi-interpreter compatible extensions
    .gil = PyInterpreterConfig_OWN_GIL,  // Each interpreter gets its own GIL
};

// Define a struct to hold the result of each thread's execution
typedef struct {
    int error_code;
    char error_message[256];
} ThreadResult;

// Declare a global mutex to protect interpreter creation
//pthread_mutex_t interpreter_mutex = PTHREAD_MUTEX_INITIALIZER;

// Function that each thread will run
void *run_python_code(void *arg) {
    // Cast the argument to ThreadResult pointer
    ThreadResult *result = (ThreadResult *)arg;
    // Initialize the result
    result->error_code = 0;
    result->error_message[0] = '\0';

    PyThreadState *tstate = NULL;

    // Lock the mutex before creating a new interpreter
    //pthread_mutex_lock(&interpreter_mutex);
    
    // Create a new Python interpreter
    PyStatus status = Py_NewInterpreterFromConfig(&tstate, &config);
    
    // Unlock the mutex after creating the interpreter
    //pthread_mutex_unlock(&interpreter_mutex);

    // Check if interpreter creation was successful
    if (PyStatus_Exception(status)) {
        result->error_code = 1;
        snprintf(result->error_message, sizeof(result->error_message), 
                 "Couldn't spawn new interpreter");
        return NULL;
    }

    // Declare Python objects we'll use
    PyObject *pLocal, *pSum, *pRange, *pIter, *pItem;
    
    // Create a new dictionary for local variables
    pLocal = PyDict_New();
    // Create a Python long object initialized to 0
    pSum = PyLong_FromLong(0);
    // Add 'sum' to the local dictionary
    PyDict_SetItemString(pLocal, "sum", pSum);
    
    // Create a range object
    pRange = PyObject_CallFunction(PyObject_GetAttrString(PyImport_AddModule("builtins"), "range"), "l", RANGE_LIMIT);
    // Get an iterator from the range
    pIter = PyObject_GetIter(pRange);
    
    // Iterate over the range
    while ((pItem = PyIter_Next(pIter))) {
        // Add the current item to the sum
        PyObject *pNewSum = PyNumber_Add(pSum, pItem);
        Py_DECREF(pSum);
        pSum = pNewSum;
        Py_DECREF(pItem);
    }
    
    // Print the final sum
    PyObject_Print(pSum, stdout, 0);
    printf("\n");
    
    // Clean up Python objects
    Py_DECREF(pLocal);
    Py_DECREF(pSum);
    Py_DECREF(pRange);
    Py_DECREF(pIter);

    // End the interpreter
    Py_EndInterpreter(tstate);
    return NULL;
}

int main() {
    // Initialize the Python interpreter
    Py_Initialize();

    // Declare arrays to hold thread IDs and results
    pthread_t threads[NUM_THREADS];
    ThreadResult results[NUM_THREADS];
    int num_threads_created = 0;

    // Create threads
    for (int i = 0; i < NUM_THREADS; i++) {
        int ret = pthread_create(&threads[i], NULL, run_python_code, &results[i]);
        if (ret) {
            fprintf(stderr, "Error creating thread %d: %d\n", i, ret);
            break;
        } else {
            printf("Successfully created thread %d:\n", i);
        }
        num_threads_created++;
    }

    // Wait for all created threads to finish
    for (int i = 0; i < num_threads_created; i++) {
        pthread_join(threads[i], NULL);

        // Print any errors that occurred in the thread
        if (results[i].error_code) {
            fprintf(stderr, "Thread %d error: %s\n", i, results[i].error_message);
        }
        printf("Thread %d completed with error code: %d\n", i, results[i].error_code);
    }
    
    // Destroy the mutex
    //pthread_mutex_destroy(&interpreter_mutex);

    // Finalize the Python interpreter
    Py_Finalize();
    return 0;
}

#endif

@ZeroIntensity
Copy link
Member

At a glance, the bug there is calling Py_NewInterpreterFromConfig without the GIL. See what I did earlier in the thread with PyThreadState_New to get around that.

@picrin
Copy link

picrin commented Oct 13, 2024

As I said earlier, I've essentially moved on from this at this stage, and I'm using OS-level processes to manage the interaction between my Go and Python code, and it's been 2x faster than working with subinterpreters and much easier to work with. Personally I'm happy to leave it at that.

But again, speaking from the user perspective, I think having a working example in the documentation would have been helpful at the time (and by this I mean a full example: starting the threads, doing calculation and cleaning up, holding the GIL correctly, etc...). And I think the example should use the most standard multithreading library, like pthreads. I realise that the example will be quite long, and perhaps it will need to be linked from the documentation, but I think as it stands, it's essentially impossible to just read the documentation and come up with a correct example of multithreaded subinterpreter code.

@TBBle
Copy link

TBBle commented Oct 14, 2024

I posted a C-only version of my pthreads code earlier on in this thread, but it must have been deleted (the thread got quite unwieldy so maybe somebody tried to clean it up and deleted that post). I wrote the code using the C API documentation as a starting point I believe @tbbie was commenting on it.

GitHub folds the middle of long issues, clicking this link should show it.

What appears to be the necessary dance for holding the GIL to create an interpreter on the new thread is shown in #123134 (comment), everything from auto tstate = PyThreadState_New(interp); to PyThreadState_Delete(tstate);. The first three lines are what the threading module's ThreadHandle_Start and thread_run do, the rest of the stuff after Py_NewInterpreterFromConfig is basically to clean up the temporary thread state because Py_NewInterpreterFromConfig has created a new one for our separate-GIL interpreter. We've approximately open-coded PyGILState_Ensure for an explicit interpreter (the default is the main interpreter anyway) and and PyGILState_Release for an explicit thread state, but without a bunch of safety rails for nested calls.

So I believe something like this would work, but is not supported:

                auto gilstate = PyGILState_Ensure();
                assert(gilstate == PyGILState_UNLOCKED);
                auto main_interpreter_state = PyThreadState_GetCurrent();

                if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
                {
                    std::println("Failed to initialize thread state {}", count);
                    // NOTE: Need to clean up the thread state we created earlier.
                    // See https://github.com/python/cpython/blob/main/Python/crossinterp.c#L1833-L1841
                    return;
                }

                // Swap back to the old thread state, clear it, and then switch back to our new state.
                auto new_thread_state = PyThreadState_Swap(main_interpreter_state);
                assert(new_thread_state == state);
                PyGILState_Release(gilstate);

                auto temp_thread_state = PyThreadState_Swap(state);
                assert(temp_thread_state == NULL);

                // Do stuff with your new interpreter

I've just realised Py_EndInterpreter probably doesn't clean up the thread state, we probably should end with something like this, but PyThreadState_Clear depends on holding the GIL, so does that need to be done before Py_EndInterpreter? (Or per below, maybe that cleanup is done by Py_EndInterpreter already...)


Exploring more:

As it happens, the new module for exposing interpreter creation to Python drops the GIL before calling Py_NewInterpreterFromConfig so maybe holding the GIL or even having an active thread state is not necessary for Py_NewInterpreterFromConfig after all? This code-flow also throws away the new interpreter's thread state (except in a unit test, which is actually quite close to our use-case here), and then creates a new thread state on-demand for executing code.

Anyway, as I just linked, there's a smaller and safer C API prefixed with _PyXI_ which wraps up a bunch of this stuff but with a more-structured approach (and not public, I think; hopefully it will become so in future), and that's then exposed to Python code by the in-tree _interpreters module, and that's wrapped by the interpreters-PEP-734 module on pypi. In this new API, _PyXI_Enter and _PyXI_Exit are roughly analogous to PyGILState_Ensure and PyGILState_Release but for interpreters, and simplify a bunch of this stuff.

I also note that _PyXI_EndInterpreter also does not clear up a temporary thread state used for Py_EndInterpreter (nor does the open-coded version in the unit tests), so maybe Py_EndInterpreter does already clear up the thread state for the interpreter, and I just didn't notice the code to do that.

@ZeroIntensity
Copy link
Member

Yes, _PyXI* are Eric's magic functions that he's using to develop PEP-734, I think. We can't use them yet :)

As it happens, the new module for exposing interpreter creation to Python drops the GIL before calling Py_NewInterpreterFromConfig so maybe holding the GIL or even having an active thread state is not necessary for Py_NewInterpreterFromConfig after all?

It might be a bug. It's all private, so I doubt it's had the kind of stress testing like what we're doing here. Crossinterpreter stuff is all undocumented at the moment. If you read the discussion in my PR about the GIL and subinterpreters, the behavior that you see right now isn't necessarily intended, because it's untested and support is explicitly disallowed.

@TBBle
Copy link

TBBle commented Oct 14, 2024

If you read the discussion in my PR about the GIL and subinterpreters, the behavior that you see right now isn't necessarily intended, because it's untested and support is explicitly disallowed.

I was only noting that if one puts up the code we got working here side-by-side with the implementation of PyGILState_Ensure and PyGILState_Release (and with the threading module) one can see that they're all doing the same operations, as a reassurance that we've ended up with the right way of getting a GIL established in a new thread and cleaning up afterwards. That's not really specific to sub-interpreters, it just happens that we create a sub-interpreter between the two calls, and then switch back to the thread state we had before that happened.

I'm aware that PyGILState_* is not supported in combination with sub-interpreters. The reason I described _PyXI_Enter as "PyGILState_Ensure for interpreters" is because it does exactly the same thing as PyGILState_Ensure except it takes an interpreter state as an argument, while PyGILState_Ensure uses the runtime's "gilstate.autoInterpreterState" pointer, which is always the main interpreter.

It might be a bug. It's all private, so I doubt it's had the kind of stress testing like what we're doing here.

To the best of my knowledge, at no point did taking the GIL or not before Py_NewInterpreterFromConfig affect the results of any repro case we've had here. So the stress testing we've done indicates that it's not required, even though the docs of Py_NewInterpreterFromConfig state that it is. (They also at one point say it's not required, but the overall vibe seems to be that it is formally required, so that's probably an oversight or an abstraction-break.)

That's also what I understood from reading the function, since before the point where it drops the GIL internally anyway, it does nothing but GIL-unneeded actions as far as I recall. The problems we saw inside Py_NewInterpreterFromConfig in 3.12 occur later in the function, when the new GIL has already been established (or the old GIL retaken), and are not related to the act of interpreter creation itself.

Anyway, it may become technically required in the future, and an API that is formally more restricted than its implementation is the correct way 'round IMHO.

I also don't want future discussions of this issue to get bogged down in unrelated API discussions, so being formally correct in repro cases is desirable.

@ZeroIntensity
Copy link
Member

Ok, @TBBle, I think it's time to get some PRs down. I don't know if there's anything that can be done about the PyMem_SetupDebugHooks problem, but we can fix the issues with _PyMem_SetDefaultAllocator internally by making it only affect the current thread, or possibly switch the internal uses that need the true system malloc over to _PyMem_RawMalloc, as that can't be hooked. Would you like to deal with updating the documentation for the allocator contract?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-C-API topic-subinterpreters type-bug An unexpected behavior, bug, or error
Projects
Status: Todo
Development

No branches or pull requests

6 participants