Replies: 1 comment 6 replies
-
The speedup starts at 2.4 for array size of 1 then diminishes toward 1. This I would expect, as the proportional cost of dispatching is reduced. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm starting to wonder what it would take to extend adaptive quickening to a deeply-polymorphic data structure library such as Numpy. Some of this started in discussions with @markshannon so I won't take all the credit/blame :)
In some sense, this is a further generalization of the original idea proposed in Allow custom
BINARY_OP
.Unlike that proposal, which is still based on Python types, a general "dispatch to specialized function calls based on the Python types of the arguments" wouldn't be very helpful for Numpy. Numpy almost uses a single Python type
ndarray
, and most function calls dispatch to a specific C function implementation based on thendarray
'sdtype
, (C-like numeric type, but could also be astruct
),stride
andshape
. This mapping is very specific, and something only Numpy can really address, but it has similar properties in that for the same call site, it would be typical to see the samedtype
/stride
/shape
triple appearing over and over again, so calculating how to perform the computation each time is redundant work.It's probably not worth discussing implementations yet, however here's my hand-wavy assumptions about what might be possible:
PyMethodDef
type calledMETH_SPECIALIZABLE
, which has extra parameters to handle the a new quickening API. There would also need to be some metadata to specify how many cache entries the method needs so space could be allocated in the bytecode.CALL_ADAPTIVE
bytecode specializes and it finds aMETH_SPECIALIZABLE
callable, it passes an argument to the method to tell it to also specialize itself based on the current arguments. The method could fill in a buffer of cache entries, one of which, in most cases, would be a function pointer to a more specialized call.Opportunity sizing:
I've run one simple experiment to see if this would be worth the effort. I hacked Numpy so that the results of all of the delegation within the
np.multiply
function are "cached" to a static variable after 8 calls, and from then on, only the core computation is performed each time (see my HACK-cache branch). Then I timed how long it takes to multiply two arrays of different sizes together:This gives us a sort of "best case scenario" if the scheme outlined above is even possible. This obviously ignores the overhead of the specialization / unspecialization checks.
This seems really promising for small arrays where the dispatch calculation dominates. These are an important class of operations -- "scalar" arrays are really common in real-world code. It certainly would be reasonable for Numpy to not quicken when the arrays are larger than a certain threshold.
The next experiment I hope to run is to take some real-world benchmarks and see how often call sites are non-polymorphic as a way to estimate how often we could expect specific call sites to be quickened.
Beta Was this translation helpful? Give feedback.
All reactions