Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyType_GetModuleByDef is really slow. #119663

Open
markshannon opened this issue May 28, 2024 · 10 comments
Open

PyType_GetModuleByDef is really slow. #119663

markshannon opened this issue May 28, 2024 · 10 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@markshannon
Copy link
Member

PyType_GetModuleByDef does a lot of work for what was a single memory read prior to sub-interpreters.

For example, #114682.

We should provide a more efficient way to get to the module state from the module def.

AFAICT, the function to get the module state is a pure function of the interpreter and the module def.
Since the module def already has a m_index field that is described as "The module's index into its interpreter's modules_by_index cache.", then a Py_GetModuleFromDef() function could be as simple as:

inline PyObject *
Py_GetModuleFromDef(PyModuleDef *def)
{
     return PyInterpreterState_Get()->modules_table[def->m_base.m_index];
}
@markshannon
Copy link
Member Author

@markshannon
Copy link
Member Author

Another thing that could improve performance is to attach the module state to the end of the module object, and save a pointer hop when accessing data in the module state.

@encukou
Copy link
Member

encukou commented May 29, 2024

Unfortunately m_index is used for PyState_AddModule/PyState_FindModule which

  • only work on single-phase init modules
  • assume an interpreter has up to one module object for each PyModuleDef, which is not necessarily true even for single-phase init. In practice PyState_FindModule & m_index work if you don't do anything too wonky, but it's far from sound.

Avoiding the pointer hop would make it infeasible to revive PEP-547, which assumes you can add state to an existing (but fresh) __main__ module. Perhaps it's worth the performance improvement; I don't think either of us would be a good judge for that.

@markshannon
Copy link
Member Author

Can you create several modules from the same PyModuleDef? That sounds like a bug to me, if you can.

I don't think there is any fundamental reason why m_index needs to be restricted to single-phase init modules. That's just how it currently works, AFAICT.
We could always add another index field, if we can't change it.

Avoiding the pointer hop would make it infeasible to revive PEP-547,

It would prevent the mechanism proposed in PEP 547, but we could achieve the goal of PEP 547 with a different mechanism.
But we can leave that for now, and speed up finding the module first.

@ericsnowcurrently
Copy link
Member

Can you create several modules from the same PyModuleDef? That sounds like a bug to me, if you can.

You can find an example in Modules/_testsinglephase.c.

Another case: loading the same module more than once (which isn't the same thing as importing multiple times or reloading). Each time an extension module is directly loaded, a new module is created.

(Naturally, that isn't strictly true for all modules. Single-phase init modules with m_size < 0 are only ever loaded once. Also, reloading a single-phase init module with m_size >-0 does actually load it from scratch, creating a new module.)

@markshannon
Copy link
Member Author

That test was added recently, and seems to be testing a weird corner case that I suspect no one uses.

For the vast majority of modules that want the sane behavior of one module per module def and want reasonable performance, we should provide that.

So, for backwards compatibility, we will need to leave PyType_GetModuleByDef as it is.
That doesn't prevent us from adding Py_GetModuleFromDef() for those modules that don't need or want to support reloading.

@encukou
Copy link
Member

encukou commented May 31, 2024

It's a new test, the behaviour is old.

If we add a module option that ensures the module can't be initialized more than once, it would make sense to disable finalization as well (or, delay it until interpreter shutdown). That would get us immortal singleton modules (per-interpreter), which does sound like a reasonable feature.

@neonene
Copy link
Contributor

neonene commented Jun 5, 2024

For example, #114682.

@mdboom If this issue is about micro-benchmarking, I would like to see the result of the main vs that with the diff (a) below applied, on the pi_cdecimal test in Modules/_decimal/tests/bench.py with additional iterations.

As to the macro-benchmarks, I'm not sure how impactful the PyType_GetModuleByDef is. The same goes for the _decimal pi test case, once class D(C.Decimal): pass is used instead of D = C.Decimal.


(a) interpreter state access
diff --git a/Include/internal/pycore_interp.h b/Include/internal/pycore_interp.h
index 86dada5..c20e59e 100644
--- a/Include/internal/pycore_interp.h
+++ b/Include/internal/pycore_interp.h
@@ -234,6 +234,7 @@ struct _is {
     // more comments.
     struct _obmalloc_state *obmalloc;

+    void *decimal_state;
     PyObject *audit_hooks;
     PyType_WatchCallback type_watchers[TYPE_MAX_WATCHERS];
     PyCode_WatchCallback code_watchers[CODE_MAX_WATCHERS];
diff --git a/Modules/_decimal/_decimal.c b/Modules/_decimal/_decimal.c
index 94a2cc2..f1e5224 100644
--- a/Modules/_decimal/_decimal.c
+++ b/Modules/_decimal/_decimal.c
@@ -115,9 +115,7 @@ typedef struct {
 static inline decimal_state *
 get_module_state(PyObject *mod)
 {
-    decimal_state *state = _PyModule_GetState(mod);
-    assert(state != NULL);
-    return state;
+    return (decimal_state *)PyInterpreterState_Get()->decimal_state;
 }

 static struct PyModuleDef _decimal_module;
@@ -125,18 +123,13 @@ static struct PyModuleDef _decimal_module;
 static inline decimal_state *
 get_module_state_by_def(PyTypeObject *tp)
 {
-    PyObject *mod = PyType_GetModuleByDef(tp, &_decimal_module);
-    assert(mod != NULL);
-    return get_module_state(mod);
+    return (decimal_state *)PyInterpreterState_Get()->decimal_state;
 }

 static inline decimal_state *
 find_state_left_or_right(PyObject *left, PyObject *right)
 {
-    PyObject *mod = _PyType_GetModuleByDef2(Py_TYPE(left), Py_TYPE(right),
-                                            &_decimal_module);
-    assert(mod != NULL);
-    return get_module_state(mod);
+    return (decimal_state *)PyInterpreterState_Get()->decimal_state;
 }


@@ -5836,6 +5829,7 @@ static int minalloc_is_set = 0;
 static int
 _decimal_exec(PyObject *m)
 {
+    PyInterpreterState_Get()->decimal_state = _PyModule_GetState(m);
     PyObject *numbers = NULL;
     PyObject *Number = NULL;
     PyObject *collections = NULL;
(b) global state access (attached just in case)
diff --git a/Modules/_decimal/_decimal.c b/Modules/_decimal/_decimal.c
index 94a2cc2..de10e73 100644
--- a/Modules/_decimal/_decimal.c
+++ b/Modules/_decimal/_decimal.c
@@ -112,12 +112,14 @@ typedef struct {
     PyCFunction _py_float_as_integer_ratio;
 } decimal_state;

+
+decimal_state global_state;
+
+
 static inline decimal_state *
 get_module_state(PyObject *mod)
 {
-    decimal_state *state = _PyModule_GetState(mod);
-    assert(state != NULL);
-    return state;
+    return &global_state;
 }

 static struct PyModuleDef _decimal_module;
@@ -125,18 +127,13 @@ static struct PyModuleDef _decimal_module;
 static inline decimal_state *
 get_module_state_by_def(PyTypeObject *tp)
 {
-    PyObject *mod = PyType_GetModuleByDef(tp, &_decimal_module);
-    assert(mod != NULL);
-    return get_module_state(mod);
+    return &global_state;
 }

 static inline decimal_state *
 find_state_left_or_right(PyObject *left, PyObject *right)
 {
-    PyObject *mod = _PyType_GetModuleByDef2(Py_TYPE(left), Py_TYPE(right),
-                                            &_decimal_module);
-    assert(mod != NULL);
-    return get_module_state(mod);
+    return &global_state;
 }


@@ -6156,7 +6153,7 @@ decimal_free(void *module)

 static struct PyModuleDef_Slot _decimal_slots[] = {
     {Py_mod_exec, _decimal_exec},
-    {Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
+    {Py_mod_multiple_interpreters, Py_MOD_MULTIPLE_INTERPRETERS_NOT_SUPPORTED},
     {Py_mod_gil, Py_MOD_GIL_NOT_USED},
     {0, NULL},
 };
@@ -6165,7 +6162,7 @@ static struct PyModuleDef _decimal_module = {
     PyModuleDef_HEAD_INIT,
     .m_name = "decimal",
     .m_doc = doc__decimal,
-    .m_size = sizeof(decimal_state),
+    .m_size = 0,
     .m_methods = _decimal_methods,
     .m_slots = _decimal_slots,
     .m_traverse = decimal_traverse,

@neonene
Copy link
Contributor

neonene commented Aug 26, 2024

Two PyObject_GC_Track() have been added to _decimal.c, where the telco benchmarks on my Windows PGO builds have shown insignificant slowdowns by 0%-2%. However, I can no longer see actual performance improvements by using PyInterpreterState_Get() there instead.

@vstinner
Copy link
Member

Why not modifying _decimal PyDecContextObject to store directly the module state? I would avoid have go through ctx => type => module => module state, knowning that getting the module from a type is "slow".

@JacobCoffee JacobCoffee added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)
Projects
None yet
Development

No branches or pull requests

6 participants