-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_decimal: pi benchmark 31-45% slower compared to 3.9 #114682
Comments
Can you try |
The figures are almost the same, around 30% (enable) and 43% (disable). |
The base slowdown of 30% is of course due to the module state changes. For comparison, and to see that the i5-8600K is not inherently underperforming with threads: The ideal situation is implemented in libmpdec++: It has a C++11 thread local context, which is also called for every inline operation. The C++11 TLS adds just 4% overhead compared to pure C (where the context is passed directly to the functions). This of course would be quite difficult to achieve in CPython. But the thread local context may not even be the primary culprit. In Martin von Loewis' first module state implementation the basic slowdown for _decimal was 20%, and a good part of that was due to the heap types. |
The wording was not ideal: The key point is that C++ achieves this without a cached context, which is of course now in the module state. |
@encukou, any thoughts on this? |
|
cpython/Include/internal/pycore_moduleobject.h Lines 32 to 35 in 1aec064
OTOH, |
@ericsnowcurrently Can a module state be accessed correctly with the following?
If not invalid, it will be useful for the function which currently has no way to get the module state. A |
Multiple interpreters can run in the same thread, so a thread-local may be ambiguous in that case. As to storing a reference to the module state on the type object, that shouldn't be necessary. Every heap type as a The only caveat is that FWIW, the common case where I suppose another problem case is with callbacks that don't provide a way to pass the module through. I'm not sure that's a common issue though. Back to the _decimal module, I haven't looked at what usage patterns are resulting in such a significant slowdown. |
Yes; heap types that are not base classes can go the fast route via |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as resolved.
This comment was marked as resolved.
The It is likely that 15 percentage points can be shaved off just by doing the same as in |
AFAICS, we can trim away a lot of the cpython/Include/internal/pycore_typeobject.h Lines 100 to 109 in 15f6f04
Also (but not performance relevant), the |
FTR: I did a quick proof-of-concept for some low-hanging fruit, and I already got a |
So, here's my suggested approach. Let's move the relevant parts of bench.py over to pyperformance. That way, it is easier to keep an eye on performance regressions like this; the faster-cpython team is alert to sudden changes in a particular benchmark. I'll try to tear out as many |
Thank you for getting to this so fast! |
Perhaps you could migrate bench.py over to pyperformance? For now, I've used this patch for local benchmarking with pyperf1: diff --git a/Modules/_decimal/tests/bench.py b/Modules/_decimal/tests/bench.py
index 640290f2ec..6b6117f108 100644
--- a/Modules/_decimal/tests/bench.py
+++ b/Modules/_decimal/tests/bench.py
@@ -11,9 +11,14 @@
from functools import wraps
from test.support.import_helper import import_fresh_module
+import pyperf
+
+
C = import_fresh_module('decimal', fresh=['_decimal'])
P = import_fresh_module('decimal', blocked=['_decimal'])
+runner = pyperf.Runner()
+
#
# NOTE: This is the pi function from the decimal documentation, modified
# for benchmarking purposes. Since floats do not have a context, the higher
@@ -81,33 +86,20 @@ def wrapper(*args, **kwargs):
return _increase_int_max_str_digits
def test_calc_pi():
- print("\n# ======================================================================")
- print("# Calculating pi, 10000 iterations")
- print("# ======================================================================\n")
-
to_benchmark = [pi_float, pi_decimal]
if C is not None:
to_benchmark.insert(1, pi_cdecimal)
for prec in [9, 19]:
- print("\nPrecision: %d decimal digits\n" % prec)
for func in to_benchmark:
- start = time.time()
if C is not None:
C.getcontext().prec = prec
P.getcontext().prec = prec
- for i in range(10000):
- x = func()
- print("%s:" % func.__name__.replace("pi_", ""))
- print("result: %s" % str(x))
- print("time: %fs\n" % (time.time()-start))
+ name = f"{func.__name__}, precision: {prec} decimal digits"
+ runner.bench_func(name, func)
@increase_int_max_str_digits(maxdigits=10000000)
def test_factorial():
- print("\n# ======================================================================")
- print("# Factorial")
- print("# ======================================================================\n")
-
if C is not None:
c = C.getcontext()
c.prec = C.MAX_PREC
@@ -147,4 +139,4 @@ def test_factorial():
if __name__ == "__main__":
test_calc_pi()
- test_factorial()
+ #test_factorial() Footnotes
|
@skrah Do you have a significant regression test case where subclasses are frequently used? 3.12 (sigle-phase init/static types) was almost as slow as main in the given test for me once |
|
@neonene Your test is reasonable! Subclasses are so slow that within 3.9 that The slowdown for subclasses within main is something like 4.5 times for float and 1.9 times for decimal. That seems to suggest, as I mentioned earlier, that certain things are better optimized for base class But in general, compared to 3.9, subclasses themselves seem to have gotten faster in main, so all comparisons are a bit messy. |
Regarding the module-state-access after PyObject *module = PyType_GetModule(type);
if (!module || _PyModule_GetDef(module) != module_def) {
PyErr_Clear();
// Subclasses can ignore the overhead for now? Otherwise,
// the type should cache the module state (refcount-free) with the def.
module = PyType_GetModuleByDef(type, module_def);
}
module_state *state = _PyModule_GetState(module); Currently, a faster version of |
In 2018, I wrote a change to use FASTCALL calling convention which made the telco benchmark 22% faster. But Stefan Krah was the module maintainer and was against the idea, so I gave up. See issue gh-73487. |
If PyType_GetModuleByDef() overhead (compared to the previous version which didn't use it) is too high, we can cache directly important data (such as |
Yes, I already suggested it; we do this in the sqlite3 extension module with great success. |
I see Serhiy's post point to a bpo post where Stefan disapproved of applying Argument Clinic only. Well, I used Argument Clinic extensively in my patch for speeding up |
That would be using |
That is an extraordinary comment from Mr. Stinner. It is entirely irrelevant to this issue:
In short, bringing up a heavily editorialized version of the actual events just distracts from the problems at hand. |
I withdraw myself from this discussion; I am no longer working on this issue. |
The cache in a known object is mainly used to reduce function arguments in the call chains, right? Indeed, many I think a cache in the type instance can be used in the tp slot functions where I think we can measure/consider the boost by |
I reacted to a comment about METH_METHOD. For me, a common way to use METH_METHOD is to use Argument Clinic with |
@neonene Thanks for doing this! AC does not seem to affect the number methods ( I cannot reproduce the stated 22% improvement for the telco benchmark (telco.py full) either. I get something around 10%. If you want to work on a patch without AC, I can review it. The strategy suggested earlier of caching the state in the decimal object seems to be the most fruitful (intuitively, the actual patch will require some experimentation). Generally, it would be good to have a reference module in Python core that does not use AC. AC is unused/not available in the vast majority of C-extensions. Many modules in the scientific ecosystem may want to use module state and vectorcall but not AC. |
This comment was marked as resolved.
This comment was marked as resolved.
I take back. |
def pi():
import _decimal
D = _decimal.Decimal
for i in range(10000):
lasts, t, s, n, na, d, da = D(0), D(3), D(3), D(1), D(0), D(0), D(24)
while s != lasts:
lasts = s
n, na = n+na, na+8
d, da = d+da, da+32
t = (t * n) / d
s += t Module state access related? functions:
|
I'm getting a smaller speedup by using The @ericsnowcurrently, is there a reason why |
Well, same than Erlend for me: I unsubscribe. I'm not interested to work on an issue with such tone in the discussion. |
I'm not aware of any reason. It isn't a GC type, right? |
Yes, up to I now found some explanations in PEP-630, but I still do not understand it:
How would one create a cycle?
|
Ah, there are definitely some subtleties involved with GC types, especially when switching from a non-GC type. I don't recall the specifics, but I'm sure @encukou does. |
It might be more interesting to start with static (non-GC) types: these are “immortal” (statically allocated), but they do have a bunch of dynamically allocated, interpreter-specific data (e.g. subclasses, weakrefs). Getting this data properly cleaned up requires rather special hacks, appropriate for core types that really are created/destroyed with the interpreter itself, rather than extensions loaded on demand (see code around
This is a generic statement. A class with no |
PEP 573 and
Lines 5526 to 5534 in 20eaf4d
To make I agree that |
FYI - skrah has been banned from python spaces, quoting details from my message posted on a duplicate-ish issue: """ Stefan was informed of this privately via email and re-offered the opportunity to apologize for past actions and to accept our code of conduct going forwards per the original 2020 ban's terms and notification. We have not received a reply. I'm posting this here as we view replying in the same forum where issues recently surfaced a reasonable way to notify past, present, and potential future participants of an incident response. This says nothing for or against the validity of this issue. A new issue should be opened if people wish to proceed. I'm locking the comments on this one. For information on what the Github concept of a ban means, read the Github org ban docs. -- The Python Steering Council, 2024 edition |
Bug report
Bug description:
The pi benchmark in
Modules/_decimal/tests/bench.py
is up to 31% slower with--enable-gil
and up to 45% slower with--disable-gil
compared to Python 3.9 as the baseline.Change the number of iterations to
100000
for more stable benchmarks.This is heavily OS and hardware dependent. The worst case cited above was observed on a six core i5-8600K.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs
METH_METHOD
calling convention in_decimal
#115196_decimal
#115401The text was updated successfully, but these errors were encountered: