You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to design an API to build str and bytes object.
Before reading this issue, it is strongly suggested to read the document which summarizes the current CPython API for doing that and how they are used in the real world.
General goals
Whatever API we come with, it needs to meet the following goals:
must have 0 or almost 0 overhead on CPython in CPython-ABI mode
must offer a clear and easy way to implement all the real-world usage patterns described here
must be easy and reasonable to implement for alternative implementations
In this issue, I'll use HPyBytes_* and HPyStr_*. See #213 for more discussion about naming.
Raw-buffer vs opaque API
There are many real-world patterns in which it is necessary to offer a raw buffer interface, so we need to provide it. However, we need to decide whether we want to always expose the raw buffer or only when the user explicitly needs/requires it.
Always exposing the raw buffer makes the API smaller and simpler to use.
Exposing it only upon request might allow for a more efficient implementation. For example, GraalPython could maybe use a java.lang.StringBuilder?
Open question: can PyPy/GraalPython/etc. optimize things better in the non-buffer case?
Proposal 1: "functional interface"
This is probably the most similar to the current API. The implementation on CPython is straightforward since HPyBytesBuilder and HPyStrBuilder can just be thin wrappers around PyBytesObject* and PyUnicodeObject*:
The biggest drawback is that at the time of _New, the implementation does not know yet whether the user will request a raw buffer or not.
Moreover, for the unicode case you would need different versions of HPyStr_GetBuffer, one for each kind: HPyStr_GetBuffer_UCS{1,2,4}, and you get undefined behavior if you call the wrong one (this is a problem which exists also on the currenct CPython API, of course).
Proposal 2: type-safe API
For the vast majority of real-world usage of PyUnicode_New, the maxchar and consequently the kind are fixed and known at compile time. We could take advantage of that and propose an API which is more type-safe at the C level, e.g.:
In this proposal, the C type of HPyStrBuilder is always the same but it provides several different constructors: if you construct the builder with HPyStrBuilder_New you can only use the opaque API to write content into it. If you need a raw-buffer interface, you can use one of the other constructors, which also return a pointer to a buffer of the correct type.
Some additional notes:
for the _ASCII, _UCS1, etc. cases, you don't need to specify maxchar: it is already implicit.
The implementation knowns in advance whether it needs to provide a C-level buffer or not.
It makes it impossible to use a raw-buffer interface if you don't know in advance the kind.
Luckily, there is only one real world usage of the pattern described in point 3 above, inside PyICU. A stripped-down version of the code is this:
The code above could be easily rewritten in the following way, which is only slightly more inconvenient. In this example, HPyStr_KIND is a macro which pre-computes the kind given the max_char:
Instead of returning the builder and the buffer separately, we could pack everything inside a struct. E.g.:
typedefstruct {
// on CPython, _private contains the result of// PyBytes_FromStringAndSize(NULL, size); on alternative impl, it// contains a pointer to whatever the impl wants, similar to// HPyTupleBuilder&co.HPy_ssize_t_private;
char*buffer;
} HPyBytesBuilder;
HPyBytesBuilderb1, b2;
HPyBytesBuilder_Init(ctx, &b1, size);
assert(b1->buffer==NULL);
HPyBytesBuilder_WriteChar(ctx, b1, 0, 'H');
HPyBytesBuilder_InitWithBuffer(ctx, &b2, size);
b2->buffer[0] ='H';
In the example above I used _Init and _InitWithBuffer, but we could also turn it into a flag.
The biggest advantage is that it scales very for the unicode case:
In this scenario, the user needs to take care of not accessing the wrong field of the union, but this is the same problem that they have now with PyUnicode_{1,2,4}BYTE_DATA.
The HPyStrBuilder struct also makes it very convenient to access the kind, without having to do a function call.
If we want to go one step further, we could remove the union and make ucs{1,2,4} real fields, one of them pointing to the buffer and the others being NULL: this way, if you try to access the wrong buffer by mistake, you get a nice segfault.
Fixed-size vs growing buffers
All the proposals above assume that the size of the string is known in advance: there is no support for growing buffers.
To achieve the result, some real word code
uses PyUnicode_Join. In theory, alternative implementations like PyPy could provide a faster alternative, if we expose the proper API.
Open questions:
do we want support for this use case?
Does it need to be integrated with the builders which we are designing
here, or it should be a completely different API?
are we aware of any existing C extension which would benefit from such a
feature, i.e. which build growing strings in C in performance-critical
code?
The text was updated successfully, but these errors were encountered:
pickle actually contains what could be an excellent use case for a resizable builder. _Pickler_Write accumulates binary data into a bytes object, resizing it as required. (The frame part is unrelated, it's part of the pickle format.) In the case of file-dump mode, this is periodically ouputted, otherwise it collects the full pickle.
It's not necessarily in performance-critical code, but Cython uses an inlined/optimised version of PyUnicode_Join for implementing F-strings.
We need to design an API to build
str
andbytes
object.Before reading this issue, it is strongly suggested to read the document which summarizes the current CPython API for doing that and how they are used in the real world.
General goals
Whatever API we come with, it needs to meet the following goals:
In this issue, I'll use
HPyBytes_*
andHPyStr_*
. See #213 for more discussion about naming.Raw-buffer vs opaque API
There are many real-world patterns in which it is necessary to offer a raw buffer interface, so we need to provide it. However, we need to decide whether we want to always expose the raw buffer or only when the user explicitly needs/requires it.
Always exposing the raw buffer makes the API smaller and simpler to use.
Exposing it only upon request might allow for a more efficient implementation. For example, GraalPython could maybe use a
java.lang.StringBuilder
?Open question: can PyPy/GraalPython/etc. optimize things better in the non-buffer case?
Proposal 1: "functional interface"
This is probably the most similar to the current API. The implementation on CPython is straightforward since
HPyBytesBuilder
andHPyStrBuilder
can just be thin wrappers aroundPyBytesObject*
andPyUnicodeObject*
:The biggest drawback is that at the time of
_New
, the implementation does not know yet whether the user will request a raw buffer or not.Moreover, for the unicode case you would need different versions of
HPyStr_GetBuffer
, one for each kind:HPyStr_GetBuffer_UCS{1,2,4}
, and you get undefined behavior if you call the wrong one (this is a problem which exists also on the currenct CPython API, of course).Proposal 2: type-safe API
For the vast majority of real-world usage of
PyUnicode_New
, themaxchar
and consequently thekind
are fixed and known at compile time. We could take advantage of that and propose an API which is more type-safe at the C level, e.g.:In this proposal, the C type of
HPyStrBuilder
is always the same but it provides several different constructors: if you construct the builder withHPyStrBuilder_New
you can only use the opaque API to write content into it. If you need a raw-buffer interface, you can use one of the other constructors, which also return a pointer to a buffer of the correct type.Some additional notes:
_ASCII
,_UCS1
, etc. cases, you don't need to specifymaxchar
: it is already implicit.Luckily, there is only one real world usage of the pattern described in point 3 above, inside PyICU. A stripped-down version of the code is this:
The code above could be easily rewritten in the following way, which is only slightly more inconvenient. In this example,
HPyStr_KIND
is a macro which pre-computes the kind given themax_char
:Proposal 3: "buffer inside a struct"
Instead of returning the builder and the buffer separately, we could pack everything inside a struct. E.g.:
In the example above I used
_Init
and_InitWithBuffer
, but we could also turn it into a flag.The biggest advantage is that it scales very for the unicode case:
In this scenario, the user needs to take care of not accessing the wrong field of the union, but this is the same problem that they have now with
PyUnicode_{1,2,4}BYTE_DATA
.The
HPyStrBuilder
struct also makes it very convenient to access thekind
, without having to do a function call.If we want to go one step further, we could remove the union and make
ucs{1,2,4}
real fields, one of them pointing to the buffer and the others beingNULL
: this way, if you try to access the wrong buffer by mistake, you get a nice segfault.Fixed-size vs growing buffers
All the proposals above assume that the size of the string is known in advance: there is no support for growing buffers.
To achieve the result, some real word code
uses
PyUnicode_Join
. In theory, alternative implementations like PyPy could provide a faster alternative, if we expose the proper API.Open questions:
do we want support for this use case?
Does it need to be integrated with the builders which we are designing
here, or it should be a completely different API?
are we aware of any existing C extension which would benefit from such a
feature, i.e. which build growing strings in C in performance-critical
code?
The text was updated successfully, but these errors were encountered: