Allow custom quickening of calls #428

mdboom · 2022-07-15T18:05:43Z

mdboom
Jul 15, 2022
Maintainer

I'm starting to wonder what it would take to extend adaptive quickening to a deeply-polymorphic data structure library such as Numpy. Some of this started in discussions with @markshannon so I won't take all the credit/blame :)

In some sense, this is a further generalization of the original idea proposed in Allow custom BINARY_OP.

Unlike that proposal, which is still based on Python types, a general "dispatch to specialized function calls based on the Python types of the arguments" wouldn't be very helpful for Numpy. Numpy almost uses a single Python type ndarray, and most function calls dispatch to a specific C function implementation based on the ndarray's dtype, (C-like numeric type, but could also be a struct), stride and shape. This mapping is very specific, and something only Numpy can really address, but it has similar properties in that for the same call site, it would be typical to see the same dtype/stride/shape triple appearing over and over again, so calculating how to perform the computation each time is redundant work.

It's probably not worth discussing implementations yet, however here's my hand-wavy assumptions about what might be possible:

We introduce a new PyMethodDef type called METH_SPECIALIZABLE, which has extra parameters to handle the a new quickening API. There would also need to be some metadata to specify how many cache entries the method needs so space could be allocated in the bytecode.
When the CALL_ADAPTIVE bytecode specializes and it finds a METH_SPECIALIZABLE callable, it passes an argument to the method to tell it to also specialize itself based on the current arguments. The method could fill in a buffer of cache entries, one of which, in most cases, would be a function pointer to a more specialized call.
Subsequent calls to the method would go directly to the more specialized path. There would also be an API to unquicken if the type assumptions no longer hold.

Opportunity sizing:

I've run one simple experiment to see if this would be worth the effort. I hacked Numpy so that the results of all of the delegation within the np.multiply function are "cached" to a static variable after 8 calls, and from then on, only the core computation is performed each time (see my HACK-cache branch). Then I timed how long it takes to multiply two arrays of different sizes together:

timeit.timeit("np.multiply(x, y)", f"import numpy as np; x = np.ones({size}) * 2; y = np.ones({size}) * 4", number=int(1_000))

This gives us a sort of "best case scenario" if the scheme outlined above is even possible. This obviously ignores the overhead of the specialization / unspecialization checks.

Array size	Numpy upstream	Caching fork	speedup
1	0.00064420	0.00026666	2.42x
10	0.00064755	0.00027479	2.36x
100	0.00073137	0.00034967	2.09x
1000	0.00147700	0.00102378	1.44x
10000	0.00908844	0.00855760	1.06x
100000	0.10119196	0.09983327	1.01x
1000000	1.16631011	1.34759164	0.87x
10000000	22.09005169	15.67308645	1.41x

This seems really promising for small arrays where the dispatch calculation dominates. These are an important class of operations -- "scalar" arrays are really common in real-world code. It certainly would be reasonable for Numpy to not quicken when the arrays are larger than a certain threshold.

The next experiment I hope to run is to take some real-world benchmarks and see how often call sites are non-polymorphic as a way to estimate how often we could expect specific call sites to be quickened.

markshannon · 2022-07-18T08:30:01Z

markshannon
Jul 18, 2022
Collaborator

The speedup starts at 2.4 for array size of 1 then diminishes toward 1. This I would expect, as the proportional cost of dispatching is reduced.
But could you explain the last two, I would have expected about 1.0, not 0.87 and 1.41.

6 replies

JelleZijlstra Jul 18, 2022

timeit adaptively runs fewer or more iterations depending on how fast the initial run is, right? Probably for those cases the initial run is so slow that it doesn't run any others. As a result, it doesn't do enough iterations to filter out random noise.

brandtbucher Jul 18, 2022
Maintainer

I know pyperf timeit does that, but I'm not sure that the stdlib timeit.timeit does.

brandtbucher Jul 19, 2022
Maintainer

Perhaps we're spilling out of hardware caches at these sizes? That could possibly lead to some weird timing.

JelleZijlstra Jul 19, 2022

True, timeit.timeit doesn't adjust its count, though python -m timeit does. But with the code in @mdboom's post, which has number=1000, he must have spent about 22000 s (~ 6 h) on the 10m array benchmark, which gives a lot of time for odd things to happen.

mdboom Jul 19, 2022
Maintainer Author

I didn't run it for six hours ;) The numbers are the total for all 1,000 iterations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom quickening of calls #428

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Allow custom quickening of calls #428

mdboom Jul 15, 2022 Maintainer

Replies: 1 comment · 6 replies

markshannon Jul 18, 2022 Collaborator

JelleZijlstra Jul 18, 2022

brandtbucher Jul 18, 2022 Maintainer

brandtbucher Jul 19, 2022 Maintainer

JelleZijlstra Jul 19, 2022

mdboom Jul 19, 2022 Maintainer Author

mdboom
Jul 15, 2022
Maintainer

Replies: 1 comment 6 replies

markshannon
Jul 18, 2022
Collaborator

brandtbucher Jul 18, 2022
Maintainer

brandtbucher Jul 19, 2022
Maintainer

mdboom Jul 19, 2022
Maintainer Author