-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134
Comments
Can you reproduce this behaviour on Linux? or MacOS? Or what happens if you sleep a bit before spawning the sub-interpreters? I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/ Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available) |
At a glance, this could possibly be related to a memory allocation failure or something similar. auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals); These lines need a |
|
Sleeping might fix it, but might also hide the issue. The goal here was just to run a bunch of interpreters, not sleep for the sake of working around possible synchronization issues
This is not how threads or operating systems work. There is a synchronization issue or an ABA problem with the debug memory assertions in the C API's internals
I will also work on giving this a go. |
Oh! TIL. I still would suggest adding a |
The error is outside of the Python C API as one might usually interact with it. i.e., only on subinterpreter initialization (i.e., creating a |
Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific. |
@picnixz I can confirm the behavior still exists on Windows on
|
Pleased to back up @ZeroIntensity and say that this appears to be a Windows specific issue. I was unable to replicate this memory assertion error on Linux. However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place. WARNING: ThreadSanitizer: data race (pid=11259)
Write of size 8 at 0x7f24e35e16c0 by thread T10:
#0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
#3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
#4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
...
Previous write of size 8 at 0x7f24e35e16c0 by thread T18:
#0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
#3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
#4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
Location is global '??' at 0x7f24e2e85000 (libpython3.12.so.1.0+0x75c880) This only happened 3 times, and around the 42 thread generation step on average. |
A quick look suggests this TSan failure is a local data-race to posixmodule.c: There appears to be no protection from multiple threads importing posixmodule simultaneously, and hence multiple threads trying to simultaneously qsort in-place one of the three static tables. This module is used on Windows, but I'm not sure if any of the macros enabling those I don't know if overlapping qsorts on the same array will actually corrupt memory, but I doubt there's any guarantee it won't. ( So probably a separate issue, and a more-focused/reliable MRE could be made for it. |
To think of it, I'm not even sure subinterpreters are thread safe in the first place, you might need to acquire the GIL. Though, I'm not all too familiar with them. cc @ericsnowcurrently for insight |
I think the point of PEP-684 (in Python 3.12) is that you can have a per-subinterpreter GIL, as is being done in this test-case. posixmodule.c claims compatibility with the per-interpreter GIL, so in that case I think it's a straight-up bug. The Edit: Oops, perhaps I misunderstood your comment, and it wasn't directed as the posixmodule issue. Indeed, calling
So I'm not clear if the given example is doing the right thing here or not, but I suspect the first thing the tasks should do is take the main interpreter's GIL if not already holding it. (And the call to |
I'm guessing that's the problem Judging by the comment, it looks like Lines 2261 to 2263 in bffed80
I could see this being a bug, though. I'll wait for Eric to clarify. |
Well this definitely brings me some concern then, as the GIL can be made optional in 3.13 😅 |
Does this occur on the free threaded build? |
Sorry for the delay. Had a busy work week. I'll gather this info this weekend, though the code seems to indicate "yes" from a glance. |
@ZeroIntensity I can confirm the issue still happens when the GIL is disabled in 3.13 and 3.14 HEAD |
For now, I think I'm going to write this off as user-error, as |
How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created? You cannot call |
AFAIK, At a glance, the drop-in code would be: void execute () {
std::vector<std::jthread> tasks { };
tasks.reserve(MAX_STATES);
for (auto count = 0zu; count < tasks.capacity(); count++) {
std::println("Generating thread state {}", count);
tasks.emplace_back([count] {
PyGILState_STATE gil = PyGILState_Ensure(); // Make sure we can call the API
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
std::println("Failed to initialize thread state {}", count);
PyGILState_Release(gil);
return;
}
auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
Py_DecRef(result);
Py_DecRef(code);
Py_DecRef(globals);
Py_EndInterpreter(state);
state = nullptr;
PyGILState_Release(gil); // Release the GIL (only on non-free-threaded builds)
});
}
} There needs to be a |
It shouldn't require any of that. The interpreter variable is local to the thread. The The need for
(emphasis mine) So, either the documentation is wrong, or Python's C API is wrong. Either way, there is a bug here. |
Yes, but how do you initialize that interpreter variable? I don't think I think you're misinterpreting what I'm saying: you're right, you don't need to mess with the GIL when using the subinterpreter, but it seems (based on the code from |
yes it is initialized. that's what I'm very certain this is to do with memory corruption and memory blocks within the debug assertions not being locked or cleared out properly which is leading to this invalid state occurring. Why it only happens on Windows most likely has to do with behavior regarding the MSVC runtime in some capacity, and some behavioral difference with windows memory pages and the virtual allocator versus posix platforms surfacing. That's the best I've been able to intuit, as the crash is indeterminate in nature, though as I stated in an earlier comment |
It would probably segfault instead, FWIW -- C API calls don't have an extra check for the interpreter or thread state. I don't see where in Speculatively, the race could be caused by here Line 2251 in bffed80
Trying to set this across multiple threads could be problematic, since there's no lock. |
If any of those failures were happening there, this same behavior would be observable in a release build. This situation only happens when the |
Ok, I think the fix here would be to add a |
Ah, sorry, I was overly-succinct there.
This leads to a failure in the situation where the configured allocator is not the default allocator used by release builds of Python, i.e. any one of the following: debug builds, In this situation, if another interpreter (with its own GIL, or in a free-threading build) allocates or deallocates memory with the same domain during this period, the wrong method will be called, leading to a mismatch for that block of memory, and in debug builds, producing one of the two failure reports (either Having looked this up, if I add So I suspect this repro will work on Linux and MacOS too, since at least the 3.12-onward repro, including release builds#include <cstdlib>
#include <print>
#include <thread>
#include <vector>
#include <Python.h>
// https://developercommunity.visualstudio.com/t/Please-implement-P0330R8-Literal-Suffix/10410860#T-N10539586
static constexpr size_t operator""_uz(unsigned long long n)
{
return size_t(n);
}
namespace
{
static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463;
static inline constexpr auto config = PyInterpreterConfig {
.use_main_obmalloc = 0,
.allow_fork = 0,
.allow_exec = 0,
.allow_threads = 0,
.allow_daemon_threads = 0,
.check_multi_interp_extensions = 1,
.gil = PyInterpreterConfig_OWN_GIL,
};
} /* nameless namespace */
void execute(PyInterpreterState* interp)
{
std::vector<std::jthread> tasks {};
tasks.reserve(MAX_STATES);
for (auto count = 0_uz; count < tasks.capacity(); count++)
{
std::println("Generating thread state {}", count);
tasks.emplace_back(
[count, interp]
{
auto tstate = PyThreadState_New(interp);
auto no_thread_state = PyThreadState_Swap(tstate);
assert(no_thread_state == nullptr);
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
{
std::println("Failed to initialize thread state {}", count);
return;
}
// Swap back to the old thread state, clear it, and then switch back to our new state.
auto new_thread_state = PyThreadState_Swap(tstate);
assert(new_thread_state == state);
PyThreadState_Clear(tstate);
auto temp_thread_state = PyThreadState_Swap(state);
assert(temp_thread_state == tstate);
PyThreadState_Delete(tstate);
#if PY_VERSION_HEX >= 0x030d0000
Py_SetProgramName(L"");
#endif
auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
Py_DecRef(result);
Py_DecRef(code);
Py_DecRef(globals);
Py_EndInterpreter(state);
state = nullptr;
});
}
}
int main()
{
PyMem_SetupDebugHooks();
PyConfig config {};
PyConfig_InitIsolatedConfig(&config);
if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status))
{
std::println("Failed to initialize with isolated config: {}", status.err_msg);
return EXIT_FAILURE;
}
PyConfig_Clear(&config);
auto main_interpreter_state = PyInterpreterState_Get();
Py_BEGIN_ALLOW_THREADS;
execute(main_interpreter_state);
Py_END_ALLOW_THREADS;
Py_Finalize();
}
|
Linux-based repro (tested against Python 3.12.3 on Ubuntu 24.04 LTS under WSL2) Required packages: gcc-14 g++-14 python3.12-dev python3-dev cmake CMakeLists.txt: cmake_minimum_required(VERSION 3.28.3)
project(463-interpreters LANGUAGES C CXX)
find_package(Python 3.12 EXACT REQUIRED COMPONENTS Development.Embed)
set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")
add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME} PRIVATE main.cxx)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23)
target_precompile_headers(${PROJECT_NAME} PRIVATE <Python.h>)
target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python) main.cxx (Unchanged from previous post) #include <cstdlib>
#include <print>
#include <thread>
#include <vector>
#include <Python.h>
// https://developercommunity.visualstudio.com/t/Please-implement-P0330R8-Literal-Suffix/10410860#T-N10539586
static constexpr size_t operator""_uz(unsigned long long n)
{
return size_t(n);
}
namespace
{
static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463;
static inline constexpr auto config = PyInterpreterConfig {
.use_main_obmalloc = 0,
.allow_fork = 0,
.allow_exec = 0,
.allow_threads = 0,
.allow_daemon_threads = 0,
.check_multi_interp_extensions = 1,
.gil = PyInterpreterConfig_OWN_GIL,
};
} /* nameless namespace */
void execute(PyInterpreterState* interp)
{
std::vector<std::jthread> tasks {};
tasks.reserve(MAX_STATES);
for (auto count = 0_uz; count < tasks.capacity(); count++)
{
std::println("Generating thread state {}", count);
tasks.emplace_back(
[count, interp]
{
auto tstate = PyThreadState_New(interp);
auto no_thread_state = PyThreadState_Swap(tstate);
assert(no_thread_state == nullptr);
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
{
std::println("Failed to initialize thread state {}", count);
return;
}
// Swap back to the old thread state, clear it, and then switch back to our new state.
auto new_thread_state = PyThreadState_Swap(tstate);
assert(new_thread_state == state);
PyThreadState_Clear(tstate);
auto temp_thread_state = PyThreadState_Swap(state);
assert(temp_thread_state == tstate);
PyThreadState_Delete(tstate);
#if PY_VERSION_HEX >= 0x030d0000
Py_SetProgramName(L"");
#endif
auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
Py_DecRef(result);
Py_DecRef(code);
Py_DecRef(globals);
Py_EndInterpreter(state);
state = nullptr;
});
}
}
int main()
{
PyMem_SetupDebugHooks();
PyConfig config {};
PyConfig_InitIsolatedConfig(&config);
if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status))
{
std::println("Failed to initialize with isolated config: {}", status.err_msg);
return EXIT_FAILURE;
}
PyConfig_Clear(&config);
auto main_interpreter_state = PyInterpreterState_Get();
Py_BEGIN_ALLOW_THREADS;
execute(main_interpreter_state);
Py_END_ALLOW_THREADS;
Py_Finalize();
} Test command
Failure output ...
Generating thread state 125
Generating thread state 126
munmap_chunk(): invalid pointer
Aborted or ...
Generating thread state 149
Generating thread state 150
Debug memory block at address p=0x7f8e24078660: API '$'
11529496521045180416 bytes originally requested
The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
at p-7: 0x02 *** OUCH
at p-6: 0x00 *** OUCH
at p-5: 0x00 *** OUCH
at p-4: 0x00 *** OUCH
at p-3: 0x00 *** OUCH
at p-2: 0x00 *** OUCH
at p-1: 0x00 *** OUCH
Because memory is corrupted at the start, the count of bytes requested
may be bogus, and checking the trailing pad bytes may segfault.
The 8 pad bytes at tail=0xa0017f8e24078660 are Generating thread state 151
Segmentation fault So we can remove the OS-windows tag from this, since it also happens on Linux the same way, i.e. |
Thanks! I'll see if I can come up with a way to reproduce this without embedding, because it's especially hard to debug core issues with embedded programs. So, it looks like the problem here is that some very-high-level embedding functions (such as |
Ok, I was able to reproduce this, but unfortunately That's probably why it's not thread-safe, we don't want to fix deprecated APIs. Though, maybe it's worth adding some extra locks (such as to the allocator) for |
No. There's no interpreter state copying happening at the relevant part of the repro. Please reread my multiple descriptions of the precise bug, e.g. this summary, as it seems like you're getting stuck on the idea that this is some kind of data-race or cross-thread memory corruption, somehow related specifically to new interpreter creation. That is not the case here. For example, in an earlier version of the 3.13 repro, all of the interpreters were already completely created and functional before the bug repro occurred, through the use of a semaphore. The problematic code is These problematic functions (by code inspection) could use a "stop-the-world" type lock (runtime lock plus more, the GIL-free build uses one of these to fully block all other threads for some non-threadsafe work like import module data updates), but that's probably a non-starter because most of these functions are only allowed to be used before or during interpreter creation, or during interpreter or runtime finalisation, i.e. they predate or postdate the Runtime, so they cannot rationally take a lock on/via it. And I'm pretty sure all such functions in the public API are labelled with "Do not call after Py_Initialize". This code conflicts with any memory allocation from that Interesting thought I had earlier today: So in this case, if you had a module that calls For a 3.13 repro, I doubt any of the problematic functions are exposed to the Python API (because they make no sense to call from Python), so you'd need a module to call those for you as well, but holding the GIL. Edit: The following (pathological) structure might also reproduce the issue in 3.13 without calling any deprecated APIs. It's 5am so it's pseudocode.
Several of the (non-public-API) problematic functions are used during initial import subsytem setup, which only happens during main Py_Initialise, not sub-interpreter creation.
|
Yes. AFAIR, all the problematic functions are either deprecated, or not publicly exposed. To be clear, in 3.12, you don't need to call
Yes, that's why I said much earlier that there's probably nothing we can or should do for Python 3.13 onwards. (We could perhaps make the APIs that are marked "Do not call after |
Sorry! This is quite a long thread with a number of different analyses getting passed around, it's hard to keep track.
Hmm, where do you see that? Looking at the source, Again, I'm still hesitant to fix anything here, because the subinterpreter C APIs just aren't ready yet for public use in another thread. (Or at least, that's what my impression of the situation has been from Petr and Eric.)
Still a little lost on the compat between 3.12 and 3.13 here. What triggers this for 3.12? |
Even if nothing is fixed on the code side for this, it would be really great if this "not ready for public consumption" was documented until they are considered ready. As of right now, there is no indication that the APIs are not for public use beyond a quick line about how it's not yet exposed to the Python language, but not all C APIs are exposed so that sort of line can't be relied on. 😔 (Also @TBBle thank you for doing so much legwork on this, I have so little free time right now that I've been only able to catch up on this thread once a week since mid september...) |
It is, pretty much.
|
The effect is normally a no-op in a release build because it's setting the same allocator that's already used by-default, i.e. the function pointer value does not change. It becomes not-a-no-op if you've called In debug builds, the
Once again, as of 3.13, I believe this is not related to subinterpreters. I believe we can repro the issue with So I suspect I could repro this in Python 3.10, where (Side-note: The English-language docs for
Yes. Specifically, |
Ok, if I'm understanding this right, the issue is that |
Yes, that's correct. And here's the subinterpreter-free, deprecated-function-free, pathological repro for 3.12 and 3.13 release builds. Also works in free-threaded mode, as it happens. Same CMake setup as before. #include <thread>
#include <Python.h>
int main()
{
auto pre_config = PyPreConfig {};
PyPreConfig_InitIsolatedConfig(&pre_config);
pre_config.allocator = PYMEM_ALLOCATOR_DEBUG;
if (auto status = Py_PreInitialize(&pre_config); PyStatus_Exception(status))
{
Py_ExitStatusException(status);
}
auto allocthread = std::thread(
[]
{
while (true)
{
PyMem_RawFree(PyMem_RawMalloc(100));
};
});
auto config = PyConfig {};
PyConfig_InitIsolatedConfig(&config);
if (auto status = Py_InitializeFromConfig(&config); PyStatus_Exception(status))
{
PyConfig_Clear(&config);
Py_ExitStatusException(status);
}
PyConfig_Clear(&config);
return Py_FinalizeEx();
} Even though I'm only calling |
Hmm, I'm not certain what the best fix would be here. Maybe a per-thread raw allocator domain? |
Maybe, yeah. Initialising it would be a wrinkle though. We need I actually think that big red box in the Also, I have a slight new wrinkle to introduce that may or may not affect any resolution, as I found an error in an earlier statement. As it happens, However, the following cases,
the allocator during that block is still wrong. So why does this reproduce in debug builds? I believe that
And in my new minimal repro, it's easy to see that the allocation thread made or freed an allocation between those two instructions, when the allocator was temporarily wrong. It's a much smaller window, but still repro's 100% for me here, so it's definitely hittable. Screenshot and explanation of this caseHere you can see that the background thread's However, the thread which is in the "problematic range", as confirmed by the watch window, actually shows So that background thread must have called So this case could be resolved by changing However, that wouldn't resolve any cases where the allocator in-use is not the default allocator for that build, so it's not a complete solution but might be a good improvement. Anyway, maybe we're at a point here where it's worth creating a new bug from my pathological case, to disentangle it from any discussion of subinterpreters or non-singular-GILs. I've just unintentionally pulled an all-nighter on this, so I need to sleep on it first, but thoughts welcome. |
I'm going to focus on the allocator issue for the time being, there's a lot of uncharted water in this, I wouldn't be surprised if like, 4 new issues are needed 😅 (If you would like to author PRs for docs change and whatnot, that would be great! You've clearly put much more time into this issue than I have.) Ok, regarding the allocation issue, I'm worried about breaking code here, because theoretically someone could be relying on all threads being affected by modifying the |
I guess a fix could be to try and make the uses of |
I couldn't think of any way to make Although that's both pretty messy, somewhat complicated, and may break existing non-crashing code. There's probably a similar reproduction case (in release builds only) even without the (UNTESTED) #include <thread>
#include <Python.h>
int main()
{
auto pre_config = PyPreConfig {};
PyPreConfig_InitIsolatedConfig(&pre_config);
if (auto status = Py_PreInitialize(&pre_config); PyStatus_Exception(status))
{
Py_ExitStatusException(status);
}
auto a = PyMem_RawMalloc(100);
PyMem_SetupDebugHooks();
PyMem_RawFree(a);
return 0;
} This is currently valid: As I recall the only valid place to call However, this will reproduce the same bug, as So we can possibly fix both holes by noting that
We would also need to extend the case of "Python has finish initializing" to also include "or PyMem_Raw* APIs have been called".
A stronger change that closes this second hole would be to deprecate-remove |
Speaking as a user here: I've given up on the
Obviously these were unchartered waters, but my reading of the subinterpreter API was that this was ready for multithreaded use. It's perhaps worth updating the documentation:
|
My idea was to make internal uses of A deprecation in favor of a config option is probably better, and then I guess updating thread safety notes about
It is ready, in the sense that we have all the necessary subinterpreter isolation ready, but my understanding of the problem is that we sort of don't have public APIs to really take advantage of it yet. I think my solution with manually attaching a thread state for the main interpreter and then switching to the subinterpreter was the valid way to do it--it's just not very pretty. Specifically, which parts of subinterpreters "did not work", and what needs to be documented better? As far as I can see, most of the people that know about subinterpreters are the ones working on them, so it would be good to get some feedback from users on what needs to be made clearer. |
I posted a C-only version of my pthreads code earlier on in this thread, but it must have been deleted (the thread got quite unwieldy so maybe somebody tried to clean it up and deleted that post). I wrote the code using the C API documentation as a starting point I believe @tbbie was commenting on it. My personal question would have been: is that code correct? I think having a pthreads template (which is effectively a multithreading standard for C) with works. I've tried to dig out that code and it's something like this (the idea is run a tight loop from each subinterpreter to test for potential segfaults, etc...). I believe that in this particular instance I was able to call Py_Finalise (I wasn't able to call that in Cgo without segfault):
|
At a glance, the bug there is calling |
As I said earlier, I've essentially moved on from this at this stage, and I'm using OS-level processes to manage the interaction between my Go and Python code, and it's been 2x faster than working with subinterpreters and much easier to work with. Personally I'm happy to leave it at that. But again, speaking from the user perspective, I think having a working example in the documentation would have been helpful at the time (and by this I mean a full example: starting the threads, doing calculation and cleaning up, holding the GIL correctly, etc...). And I think the example should use the most standard multithreading library, like pthreads. I realise that the example will be quite long, and perhaps it will need to be linked from the documentation, but I think as it stands, it's essentially impossible to just read the documentation and come up with a correct example of multithreaded subinterpreter code. |
GitHub folds the middle of long issues, clicking this link should show it. What appears to be the necessary dance for holding the GIL to create an interpreter on the new thread is shown in #123134 (comment), everything from So I believe something like this would work, but is not supported: auto gilstate = PyGILState_Ensure();
assert(gilstate == PyGILState_UNLOCKED);
auto main_interpreter_state = PyThreadState_GetCurrent();
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
{
std::println("Failed to initialize thread state {}", count);
// NOTE: Need to clean up the thread state we created earlier.
// See https://github.com/python/cpython/blob/main/Python/crossinterp.c#L1833-L1841
return;
}
// Swap back to the old thread state, clear it, and then switch back to our new state.
auto new_thread_state = PyThreadState_Swap(main_interpreter_state);
assert(new_thread_state == state);
PyGILState_Release(gilstate);
auto temp_thread_state = PyThreadState_Swap(state);
assert(temp_thread_state == NULL);
// Do stuff with your new interpreter I've just realised Exploring more: As it happens, the new module for exposing interpreter creation to Python drops the GIL before calling Anyway, as I just linked, there's a smaller and safer C API prefixed with I also note that |
Yes,
It might be a bug. It's all private, so I doubt it's had the kind of stress testing like what we're doing here. Crossinterpreter stuff is all undocumented at the moment. If you read the discussion in my PR about the GIL and subinterpreters, the behavior that you see right now isn't necessarily intended, because it's untested and support is explicitly disallowed. |
I was only noting that if one puts up the code we got working here side-by-side with the implementation of I'm aware that
To the best of my knowledge, at no point did taking the GIL or not before That's also what I understood from reading the function, since before the point where it drops the GIL internally anyway, it does nothing but GIL-unneeded actions as far as I recall. The problems we saw inside Anyway, it may become technically required in the future, and an API that is formally more restricted than its implementation is the correct way 'round IMHO. I also don't want future discussions of this issue to get bogged down in unrelated API discussions, so being formally correct in repro cases is desirable. |
Ok, @TBBle, I think it's time to get some PRs down. I don't know if there's anything that can be done about the |
Bug report
Bug description:
Hello. While working on a small joke program, I found a possible memory corruption issue (it could also be a threading issue?) when using the Python C API in a debug only build to quickly create, execute python code, and then destroy 463 sub interpreters. Before I post the code sample and the debug output I'm using a somewhat unique build environment for a Windows developer.
When running the code sample I've attached at the bottom of this post, I am unable to get the exact same output each time, though the traceback does fire in the same location (Due to the size of the traceback I've not attached it, as it's about 10 MB of text for each thread). Additionally, I sometimes have to run the executable several times to get the error to occur. Lastly, release builds do not exhibit any thread crashes or issues as the debug assertions never fire or execute.
The error output seems to also halt in some cases, either because of stack failure or some other issue I was unable to determine and seemed to possibly be outside the scope of the issue presented here. I have the entire error output from one execution where I was able to save the output.
The output cut off after this, as the entire program crashed, taking my terminal with it 😅
You'll find the MRE code below. I've also added a minimal version of CMakeLists.txt file I used so anyone can recreate the build with the code below (Any warnings, or additional settings I have do not affect whether the error occurs or not). The code appears to breaks inside of
_PyObject_DebugDumpAddress
, based on what debugging I was able to do with WinDbg.Important
std::jthread
calls.join()
on destruction, so all threads auto-join once thestd::vector
goes out of scope.Additionally this code exhibits the same behavior regardless of whether it is a
thread_local
or declared within the lambda passed tostd::thread
main.cxx
CMakeLists.txt
Command to build + run
CPython versions tested on:
3.12
Operating systems tested on:
Windows
The text was updated successfully, but these errors were encountered: