-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quadratic time internal base conversions #90716
Comments
Our internal base conversion algorithms between power-of-2 and non-power-of-2 bases are quadratic time, and that's been annoying forever ;-) This applies to int<->str and int<->decimal.Decimal conversions. Sometimes the conversion is implicit, like when comparing an int to a Decimal. For example:
I gave up after waiting for over 8 hours, and the computation apparently can't be interrupted. In contrast, using the function in the attached todecstr.py gets the result in under a minute: >>> a = 1 << 1000000000
>>> s = todecstr(a)
>>> len(s)
301029996 That builds an equal decimal.Decimal in a "clever" recursive way, and then just applies str to _that_. That's actually a best case for the function, which gets major benefit from the mountains of trailing 0 bits. A worst case is all 1-bits, but that still finishes in under 5 minutes: >>> a = 1 << 1000000000
>>> s2 = todecstr(a - 1)
>>> len(s2)
301029996
>>> s[-10:], s2[-10:]
('1787109376', '1787109375') A similar kind of function could certainly be written to convert from Decimal to int much faster, but it would probably be less effective. These things avoid explicit division entirely, but fat multiplies are key, and Decimal implements a fancier * algorithm than Karatsuba. Not for the faint of heart ;-) |
Is this similar to https://bugs.python.org/issue3451 ? |
Dennis, partly, although that was more aimed at speeding division, while the approach here doesn't use division at all. However, thinking about it, the implementation I attached doesn't actually for many cases (it doesn't build as much of the power tree in advance as may be needed). Which I missed because all the test cases I tried had mountains of trailing 0 or 1 bits, not mixtures. So I'm closing this anyway, at least until I can dream up an approach that always works. Thanks! |
Changed the code so that inner() only references one of the O(log log n) powers of 2 we actually precomputed (it could get lost before if |
The test case here is a = (1 << 100000000) - 1, a solid string of 100 million 1 bits. The goal is to convert to a decimal string. Methods: native: str(a) numeral: the Python numeral() function from bpo-3451's div.py after adapting to use the Python divmod_fast() from the same report's fast_div.py. todecstr: from the Python file attached to this report. gmp: str() applied to gmpy2.mpz(a). Timings: native: don't know; gave up after waiting over 2 1/2 hours. So there's room for improvement ;-) But here's the thing: I've lost count of how many times someone has whipped up a pure-Python implementation of a bigint algorithm that leaves CPython in the dust. And they're generally pretty easy in Python. But then they die there, because converting to C is soul-crushing, losing the beauty and elegance and compactness to mountains of low-level details of memory-management, refcounting, and checking for errors after every tiny operation. So a new question in this endless dilemma: _why_ do we need to convert to C? Why not leave the extreme cases to far-easier to write and maintain Python code? When we're cutting runtime from hours down to minutes, we're focusing on entirely the wrong end to not settle for 2 minutes because it may be theoretically possible to cut that to 1 minute by resorting to C. (*) I hope this algorithm tickles you by defying expectations ;-) It essentially stands |
Addendum: the "native" time (for built in str(a)) in the msg above turned out to be over 3 hours and 50 minutes. |
Somebody pointed me to V8's implementation of str(bigint) today: https://github.com/v8/v8/blob/main/src/bigint/tostring.cc They say that they can compute str(factorial(1_000_000)) (which is 5.5 million decimal digits) in 1.5s: https://twitter.com/JakobKummerow/status/1487872478076620800 As far as I understand the code (I suck at C++) they recursively split the bigint into halves using % 10^n at each recursion step, but pre-compute and cache the divisors' inverses. |
The factorial of a million is much smaller than the case I was looking at. Here are rough timings on my box, for computing the decimal string from the bigint (and, yes, they all return the same string): native: 475 seconds (about 8 minutes) "They recursively split the bigint into halves using % 10^n at each recursion step". That's the standard trick for "output" conversions. Beyond that, there are different ways to try to use "fat" multiplications instead of division. The recursive splitting all on its own can help, but dramatic speedups need dramatically faster multiplication. todecstr treats it as an "input" conversion instead, using the |
Worth pointing this out since it doesn't seem widely known: "input" base conversions are _generally_ faster than "output" ones. Working in the destination base (or a power of it) is generally simpler. In the math.factorial(1000000) example, it takes CPython more than 3x longer for str() to convert it to base 10 than for int() to reconstruct the bigint from that string. Not an O() thing (they're both quadratic time in CPython today). |
Hi, why is this issue closed? What needs to be done to make it through? |
It's closed because nobody appears to be both willing and able to pursue it. But it's on the edge regardless. In general, any number of huge-int algorithms could be greatly speeded, but that's a massive undertaking. It's why GMP exists, to push such things as far as possible, regardless of implementation complexity, effort, or bulk. Very capable GMP bindings for Python are already available ( As the timings here suggest, GMP will generally be faster than anything we may do anyway, because GMP never reaches a point where its authors say "good enough already". That said, I expect the single most valuable bigint speedup CPython could implement would be to bigint division, along the lines of gh-47701. That could indirectly give major speed boosts to bigint->str and bigint modular |
Why do you need fast division? Why not just implement str to int convertion using this
style of d&c? This would be simple to implement and would only require multiplication. If If multiplication is done using Karatsuba ( Since Python currently uses Karatsuba for its big ints multiplication, this simple str to int convertion algorithm would run in |
Yep, even simplest divide and conquer with native multiplication would already provide significant speed-up. |
Also if you want something slightly smarter, since 10 is
|
I made an implementation of the basic d&q algorithm to test it out pow5 = [5]
while len(pow5) <= 22:
pow5.append(pow5[-1] * pow5[-1])
def str_to_int(s):
def _str_to_int(l, r):
if r - l <= 3000:
return int(s[l:r])
lg_split = (r - l - 1).bit_length() - 1
split = 1 << lg_split
return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s)) Running this locally, Clearly |
I also tried out For 40000 digits str_to_int with GMP takes 0.000432 s. So what Python actually needs is faster big int multiplication. With that you'd get a really fast string to int converter for free. |
All the earlier timing examples here were of int->str, not of str->int. For the latter case, which you're looking at, splitting a decimal string by a power of 10 is indeed trivial (conceptually - but by the time you finish coding it all in C, with all the memory-management, refcounting, and error-checking cruft debugged, not so much 😉). As already noted, the report's But faster native bigint division could nevertheless buy a major speedup for int->str, as already demonstrated near the start here by showing timings for the And it could also buy major speedups for modular And it would, tautologically, speed bigint division too, which is a bottleneck all on its own in some apps. |
It would be useful to have practical examples of actual Python applications that would benefit from any of: Including if they've adopted a third party library for this math and would no longer have any need for that if improved. Where "high performance" means more than constant-time faster than what CPython Bigint's really feel like a neat computer science toy most of the time. We've got them, but what actually uses integers larger than 64 or 128 bits? Similarly, once numbers get huge, what is the point of converting them to and from decimal? Decimal is for humans and humans don't readily comprehend precise numbers that large let alone type them correctly. |
"Practical applications" isn't really the name of this game 😉. Most complaints come from more-or-less newbies, who, e.g., play around in an interactive shell, and are surprised to find that sometimes the shell seems to freeze (but is really just waiting for a quadratic-time, or worse, bigint operation to finish). People who know what they're doing find many uses for big bigints, but more of a theoretical, research, educational, or recreational bent. Like, algorithm design, combinatorics, number theory, or contributing to the wonderful On-Line Encyclopedia of Integer Sequences. They also come as incidental side effects of using power tools, like in a symbolic math package computing symbolic power series or (for related reasons) symbolic high derivatives (integer coefficients and/or denominators often grow at an exponential rate across terms). Or they can be almost the whole banana, such as scaling an ill-conditioned float matrix to use exact bigint arithmetic instead to compute an exact matrix inverse or determinant (where sticking to floats the result could be pure noise due to compounding rounding errors). Not all that long ago, a number of people worked on adding "gonzo" optimizations to |
One good example of using big integers are math people using Python for some quick calculations. When you use functions like factorial, it is easy to get pretty big integers. I also think that math people generally prefer decimal numbers over hex. Another example would be working with fractions. Just adding fractional numbers makes their denominators and numerators grow really quickly. I also don't think fractions support hex. Both reading and printing requires decimal as far as I can tell. For example you can do The examples I've given above is the type of scripts that wont appear in things like Google's codebase, even if they are relatively common use cases. Another important thing to note is that programmers trust the built in functions in Python to just work. That is the fundamental reason why |
I think Tim's suggestion of switching to a Python implementation of a more efficient algorithm for large inputs would be a good approach. It is reasonable to not want to maintain a complicated high-performance algorithm written in C. This would be an interesting project for a student or someone wanting a challenge. Calling into Python from C is not too hard to do. Use a C-API to import a Python module and call a function inside of it. |
Because I won't repeat yet again that ambition has forever been the fatal enemy of making real progress visible to CPython users here. |
Thanks - but sheesh 😉. The original entry in this report noted that Making the change does help a little, but, as you noted, tends to get lost in the noise on "really big" inputs. It cuts some tens of thousands of digits off the smallest value at which the |
Notes on SSA and base conversion:
For an example of the last, I got much better str_to_int-adapted-to-use-SSA behavior after adding this to the start of if x1.bit_length() <= 10000 or x2.bit_length() <= 10000:
p = x1 * x2
return shift(p, M, 0) if M is not None else p By pursuing small things like that by hand for each different digit-string-length tried, it was easy enough to contrive a So:
|
I think this problem can be solved in the same way as for GMP although perhaps done more simply. Just run a bunch of timings and choose thresholds like: if N < 500:
n = 64
elif N < 1000:
n =128
# etc. That's how GMP used to work. The way GMP does it now is more complicated but that change in GMP is only credited with a smallish factor improvement. The paper describing that and other improvements is here: The appropriate comparison when considering if it is worth using something like the SSA implementation above is not really with GMP but rather with the existing CPython implementation. In other words the bar to meet is that the code should be correct and then its speed should be not worse in any case and should be significantly better in at least some cases (to justify making any change at all).
Fair enough. Well if Neal's PR currently gives the fastest |
Add Python implementations of certain longobject.c functions. These use asymptotically faster algorithms that can be used for operations on integers with many digits. In those cases, the performance overhead of the Python implementation is not significant since the asymptotic behavior is what dominates runtime. Functions provided by this module should be considered private and not part of any public API. Co-author: Tim Peters <tim.peters@gmail.com> Co-author: Mark Dickinson <dickinsm@gmail.com> Co-author: Bjorn Martinsson
I like I'm just curious about |
Yeah, I think it could be removed now since it is not very useful. |
Ok, I created PR #99063 to remove _pylong._DEBUG flag. |
To debug the _pylong module, it's trivial to add this code again locally. There is not need to keep it in Python releases.
* Properly decref on _pylong import error. * Improve the error message on _pylong TypeError. * Tie the return value comments together. These are minor followups to issues not caught among the reviewers on python#96673.
* Properly decref on _pylong import error. * Improve the error message on _pylong TypeError. * Fix the assertion error in pydebug builds to be a TypeError. * Tie the return value comments together. These are minor followups to issues not caught among the reviewers on #96673.
Fix validated by: $ ./python -m test -R 3:3 test_int Tests result: SUCCESS
Hello, I have written a fast O(n lg n) integer multiplication code in Rust, using number-theoretic transform (NTT) with 64-bit primes. Code is available here. I ran some benchmarks, and surprisingly, it seems to outperform GMP 6.2.1 on my computer (Ryzen 7 2700X). The explanations for implementation detail can be found in the pull request I have opened in the rust-num/num-bigint repository. There I have attached a graph comparing the speed of this new code with GMP. Importantly, the code does not use any inline assemblies or architecture-specific intrinsics. Although some integer operations unavailable in portable C have been used, it should be possible to replace them with a few if-else statements at the expense of performance hit at most 2-4x (or much less, depending on the compiler optimization), which is suboptimal but still much better than the current CPython implementation. Note that the code correctly handles unbalanced multiplication according to the strategy explained here. Also, the padding is minimized (at most 7%) by using non-power-of-two radices. Although CPython currently uses 30-bit digits internally, this can be repacked into 64-bit words, ran through the new code, and transformed back to 30-bit digits. Since this transformation takes O(n) while the multiplication takes O(n lg n), there should be only minimal overhead for repacking. There should be no license issue since I took care to use only MIT-licensed codes. I'd like to port the code to C if there is enough interest for adopting my implementation in CPython. Please let me know. Thanks. |
Fantastic!
I am not a CPython core developer myself but I would personally be very interested to see CPython have something like this. As you rightly imply the balance for CPython is best served by having a simple, portable implementation that does not need to have state of the art performance but does have good asymptotic complexity to avoid "surprising slowness" (essentially the root of the security concerns that precipitated this thread). I expect that the Python core development team would prefer for this to be something maintained outside of CPython e.g. like a library that CPython could depend on. That would also benefit many other projects as well because I think that there is a wide need for at least reasonable large integer performance that is more portable and has a more liberal license than GMP. Would you consider maintaining this as an independent C library?
I wonder if this is related to compiling directly for your local CPU rather than making a generic binary that is redistributable. One of the most important features of GMP is that it can bundle many different hand-written assembly versions for different CPU architectures in a "fat" build ( (Probably this last point about fat builds is not relevant for CPython.) |
This issue is about non-binary base conversions more so that multiplication. Regardless, it is ideal for that kind of thing to be a Rust and/or C library of its own. If we wanted to use it for anything we could depend upon that or vendor our own copy. |
The way that large integer algorithms usually work is that pretty much all operations that are not trivially |
Yeah, it would be of course fine to wrap the code into an external library.
I have tested the same executable on two more platforms (i5-9400 and Ryzen 5 7500F). The relative speed advantage stayed the same for the multiplication of very large integers, although below a few million bits GMP was 30-40% faster on those platforms. I'm not sure whether gmpy2 used architecture-specific routines; I just copied the Python benchmark code above. |
It seems the num-bigint crate attaches to Python 3 quite well with PyO3. The following image shows the performance of the new multiplication implementation on Ryzen 7 2700X, averaged over 10 runs. For small numbers, the num-bigint crate uses naive, Karatsuba, and Toom-3 before switching to number-theoretic transform. The cross-over point over the CPython native multiplication (version 3.11.3) is around 3,000 bits. The binding I used is very simple. Further optimizations may be possible. use pyo3::prelude::*;
use num_bigint::BigInt;
#[pyfunction]
fn mul(x: BigInt, y: BigInt) -> BigInt {
&x * &y
}
#[pymodule]
fn num_bigint_pyo3(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(mul, m)?)?;
Ok(())
} import num_bigint_pyo3
a, b = 10**100000, 9**100000
x = num_bigint_pyo3.mul(a, b)
assert x == a*b |
Here are the str -> int conversion timings with the new multiplication routine.
Codeimport sys
sys.set_int_max_str_digits(0)
import num_bigint_pyo3
from decimal import *
from time import time
from random import choices
import numpy as np
setcontext(Context(prec=MAX_PREC, Emax=MAX_EMAX, Emin=MIN_EMIN))
pow5 = [5]
while len(pow5) <= 23:
pow5.append(num_bigint_pyo3.mul(pow5[-1], pow5[-1]))
def str_to_int_new_mul(s):
def _str_to_int(l, r):
if r - l <= 3000:
return int(s[l:r])
lg_split = (r - l - 1).bit_length() - 1
split = 1 << lg_split
return (num_bigint_pyo3.mul(_str_to_int(l, r - split), pow5[lg_split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s))
def str_to_int(s):
def _str_to_int(l, r):
if r - l <= 3000:
return int(s[l:r])
lg_split = (r - l - 1).bit_length() - 1
split = 1 << lg_split
return ((_str_to_int(l, r - split) * pow5[lg_split]) << split) + _str_to_int(r - split, r)
return _str_to_int(0, len(s))
def str_to_int_using_decimal(s, bits=1024):
d = Decimal(s)
div = Decimal(2) ** bits
divs = []
while div <= d:
divs.append(div)
div *= div
digits = [d]
for div in reversed(divs):
digits = [x
for digit in digits
for x in divmod(digit, div)]
if not digits[0]:
del digits[0]
b = b''.join(int(digit).to_bytes(bits//8, 'big')
for digit in digits)
return int.from_bytes(b, 'big')
# Benchmark against int() on random 1-million digits string
funcs = [int, str_to_int_using_decimal, str_to_int, str_to_int_new_mul]
s = ''.join(choices('123456789', k=1_000_000))
expect = None
for f in funcs:
t_list = []
for _ in range(10):
t = time()
result = f(s)
t_elapsed = time() - t # f.__name__)
if expect is None:
expect = result
assert result == expect
t_list.append(t_elapsed)
mean, std = np.mean(t_list), np.std(t_list)
print("{0}: {1:.9f} ± {2:.9f} sec".format(f.__name__.ljust(30), mean, std)) |
It looks very nice to me. The question is what you would propose to do going on from here. If the suggestion is that CPython might depend on this (even in an optional way) then there are many details that would need to be worked out before that could be considered. |
Yes, I'd love to see CPython's multiplication improved asymptotically. It would also be great to improve other arithmetic operations too, but as a first step, I think it would be appropriate to limit the scope to multiplication. Currently the binding through PyO3 receives the raw bits of the integer through _PyLong_ToByteArray and sends it back to Python via _PyLong_FromByteArray. I think that mechanism can still work in the C code, which might help lessen the maintenance burden since the existing APIs don't need to be modified substantially. On the other hand, more deliberation is needed to port the primitive operations from Rust to C. Rust provides portable primitives such as add-with-carry and 128bit integer. Although it is possible to access the higher 64bit of a 64 x 64 multiplication in most C compilers, this would not be portable. Switching to plain portable C (without compiler intrinsics) is possible but it comes with performance degradations. For memory allocation, one call of malloc (and a corresponding call of free) would suffice. The temporary buffer has a size linear in the input integer and doesn't need to be a PyObject instance. Please let me know of other details that need to be worked out. Thanks! |
Is there anything left to do in this issue or it can be closed? |
It's hard to tell because of the length and complexity of the discussion, but |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: