-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support vectorcall protocol #390
Conversation
2f6531c
to
9ad4d3c
Compare
@hodgestar left some comments in the IRC channel. I'm posting them here for documentation: Every time I look at the old C API for it, I go "arg" a lot. It feels more like a perfomance hack that got exposed than an API. However, I'm also not sure what to do about it. @fangerer Do you have an important / good example use case for the per-instance vector call? What would prevent people who want per-instance calls from just adding their own C function pointer to their struct and doing it themselves? It feels like we know that a better way is to have our "argument clinic-esque" API for JITs and similar, but that is a lot of work. :/ Maybe a goal for now is to be sure we can replace the implementation of vectorcall in HPy with the argument clinic APIs later without breaking compatibility. Would it be possible to remove |
@hodgestar: Here are my answers:
I don't have a real world example. I think the idea would be that you can have specialized call func impls depending on the object's data. The PEP says: "Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created."
Sure and sounds good to me since it makes it very clear.
I'm not sure if we even need to take caution concerning compatibility with arg clinic. I think there are two aspects:
Or did I misunderstand your comment? |
The author of nanobind asks for this in CPython stable ABI in here: https://discuss.python.org/t/ideas-for-forward-compatible-and-fast-extension-libraries-in-python-3-12. IIRC he mentioned somewhere that nanobind uses/can use this for all functions (my possibly wrong understanding: every function is a separate object with vectorcall).
I think there is one more thing to vectorcall (and again, maybe I just misunderstand it :-)). Citing from PEP-590: Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created. For a class cls, at least one intermediate object is created for each call in the sequence type.call, cls.new, cls.init. |
I can give one example from nanobind: its function object dispatches calls to C++ using either a simple implementation (only positional arguments supported) or a complex implementation (with handling of default values, keyword arguments, variable argument count, etc.) that is significantly slower. When a new function object is created, it sets the appropriate vector call dispatcher based on the properties of the function. |
This has conflicts. Does it replace #251? |
@mattip: No, this PR is basically about supporting something like As far as I got from the discussions here and in the dev calls: we are still not sure if this is the way to go. I still think the changes in this PR make sense because of following reasons:
Anyway, before merging this, I would like to have more feedback. In particular from @antocuni .
They would be easy to resolve if we decide to merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it looks very good to me, thanks for doing this work!
And thanks for the many and detailed comments, they really helped to understand what's going on.
I think there are some details to be discussed though; I raised some concerns also in the inline comments below.
The biggest question IMHO is: do we really need separate slots for HPy_tp_call
and HPy_tp_vectorcall_default
? The two signatures are already very similar: if I understand correctly the only difference is that HPyFunc_keywords
take a full dictionary of keywords, while HPyFunc_vectorcallfunc
takes a list of keyword names.
We could "tweak" the existing HPyFunc_KEYWORDS
calling convention to be compatible with vectorcall: in that way, the end user would just implement HPy_tp_call
, and we would automatically use tp_vectorcall
under the hood.
If we do in that direction, the only "real" difference between supporting vectorcall or not is the ability of setting a per-object function.
hpy/devel/include/hpy/hpydef.h
Outdated
typedef struct { | ||
cpy_vectorcallfunc cpy_trampoline; | ||
HPyFunc_vectorcallfunc impl; | ||
} HPyVectorcall; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need a special struct for this instead of reusing HPySlot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in the manual case, you need to include this struct (I've renamed it to HPyCallFunction
) in the type's struct like this:
typedef struct {
int member0;
int member1;
// ...
HPyCallFunction callfunc;
// ...
}
In this case, the struct shouldn't be larger than necessary which would be the case if we use HPyDef
. For consistency, I'm also using HPyCallFunction *
in HPyVectorcall_Set
(name of this function is subject to be changed). We could use a slot definition there but that seems to be more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, got it and it makes a lot of sense.
The only drawback which I see is that by doing this we are wasting a pointer "forever": HPyCallFunction
contains a pointer to the cpy_trampoline, which is needed right now to support CPython, but it's not needed at all by alternative implementations and it might not be needed even by CPython itself in the future (in case they decide to support hpy natively).
Note that it's slightly different than the HPySlot
case, because HPySlot
s are in fixed number (and stored in static data), while this impose an extra cost to the runtime objects.
However, I don't really know how to solve this issue, so I suppose it is fine to keep as is for now, and see whether we can improve it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's a good point. We don't really need the both function pointers in general.
I think about how we can reduce that to void sizeof(void *)
slot but yes, let's merge this PR and do it in a follow-up (since the PR is already pretty large).
Not done yet but pushed if people are interested on the progress and to trigger the tests. |
a9b8453
to
05cb57c
Compare
…et__' was specified
Big update on the PR. I've addressed most of @antocuni 's points. Here is a summary:
Some other remarks:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job! LGTM :)
Resolves #389 .
This PR enables HPy extensions to implement the vectorcall protocol for HPy types.
To be clear, this is about providing the possibility to implement the vectorcall protocol on the receiver side (i.e. on the type) as in PEP 590.
In contrast to the C API, I tried to make the common case as easy as possible.
Implementing the vectorcall protocol in HPy is now (in the common case) very simple and looks like this:
That's it.
As you might notice, this means that each instance of the type will automatically use the same vectorcall function implementation but one important improvement of PEP 590 is that you can have different (maybe specialized) vectorcall function implementations per object.
In order to provide this flexibility, I've introduced API function
HPyVectorcall_Set
that allows to set an arbitrary vectorcall function on an object. For example:As indicated in the above example,
HPyVectorcall_Set
is meant to be used in the object constructor but there is no restriction when it can be used.Macro
HPyVectorcall_FUNCTION
is encouraged to be used since it generates the appropriate CPython trampoline and fills the requiredHPyVectorcall
struct.Some more explanation
HPyDef_VECTORCALL(SYM)
is an alias forHPyDef_SLOT(SYM, HPy_tp_vectorcall_default)
. So, this just defines an HPy-specific slotHPy_tp_vectorcall_default
which is the default vectorcall function that will be used for all objects.If
ctx_Type_FromSpec
recognized this slot, following happens behind the scenes:vectorcallfunc
) will be added (at the end) to the CPython object. This increases the basic size (bysizeof(vectorcallfunc)
). It is appended to the object because otherwise the*_AsStruct
calls would return an incorrect pointer.Py_TPFLAGS_HAVE_VECTORCALL
is set automatically__vectorcalloffset__
will be added to the C API slots automatically (using the offset of the hidden field).HPy_tp_new
, we assume thatHPy_New
will be used for allocation which will take care of writing the default vectorcall function pointer to the object (see ctx_type.c:1408).HPy_tp_new
is not provided, we wrap the inheritedtp_new
function withhpyobject_new
(see ctx_type.c:265) which takes care of that.Restrictions
tp_call
function and delegate to the vectorcall impl.HPyDef_VECTORCALL
with var objects (i.e. whereitemsize > 0
). Right now, this is not a big restriction since HPy does not really support them. Howver, it is still possible to do the manual way similar to how it is done in the C API: add a fieldHPyVectorcall vectorcall
to the type's struct, define member__vectorcalloffset__
, set flagHPy_TPFLAGS_HAVE_VECTORCALL
, ...). This is also covered by a test.Misc
From a performance point of view, object creation should not be significantly slower (compared to CPython's vectorcall API) because if (1) the vectorcall protocol is not implemented, we just do an additional type flag check, and if (2) the protocol is implemented, we might do an additional write to the hidden field in case the user overwrites the default function.
I still did not write documentation about that. I will do in a follow-up PR.