Small string optimization #302

lpereira · 2022-02-25T19:26:10Z

lpereira
Feb 25, 2022

One trick that I've seen elsewhere and seems to work really well is to store a string in some struct similar to this one:

struct my_string_type {
    union {
       struct {
           char *buffer;
           size_t length;
       } large;
       struct {
            char buffer[sizeof(char *) + sizeof(size_t) - 1];
            unsigned char remaining_length;
        } small;
    };
    uint32_t flags;
};

This would mean that, on 32-bit architectures, strings up to 7 bytes would be stored inlined in the string type; on 64-bit architectures, strings up to 15 bytes would be stored in the same manner.

The cleverness about this is scheme is that, instead of considering the second field as the actual length, we consider it as remaining_length: when that's zero, this doubles as the terminating NUL, making it possible to pass the string to the C standard library functions that expect strings like that. The buffer length ends up being some expression like: flags & IS_SMALL ? sizeof(small.buffer) - small.remaining_length : large.length.

We already have a lot of space for flags in the unicode string object; maybe this is something we should evaluate too? Especially for ASCII characters, used for identifiers and such, this should help us yield quite a bit of memory savings and reduce a lot of pointer chasing.

Thoughts?

markshannon · 2022-02-28T16:59:16Z

markshannon
Feb 28, 2022
Collaborator

What does buffer point to?

1 reply

lpereira Feb 28, 2022
Author

Thinking only about PyASCIIObject here, I'm proposing shuffling things around so it looks roughly like this:

typedef struct {
    PyObject_HEAD
    union {
         struct { wchar_t *wstr; Py_ssize_t length; } large;
         struct { wchar_t wstr[sizeof(wchar_t *) + sizeof(Py_ssize_t) - 1];
                      unsigned char remaining_len; } small;
    };
    struct {
        /* ... all the other flags go here ... */
        unsigned int small_string:1;
    } state;
    Py_hash_t hash;
} PyASCIIObject;

markshannon · 2022-02-28T18:22:39Z

markshannon
Feb 28, 2022
Collaborator

Ah, I see.
In other words, you want to use a single byte for the size, and overlap the "small" string with the wstr pointer?
The wstr pointer is being removed in 3.12, which would make this less profitable, but it might still have value as there would be 7 bytes of len that could be used.

0 replies

lpereira · 2022-03-02T21:32:36Z

lpereira
Mar 2, 2022
Author

After offline discussions and some comments made here, I initially thought that it would be better to close this issue down. I'm not doing this, because I still want to continue investigating possible optimizations for small strings -- maybe just in a different way.

The main advantages of this specific optimization is to:

Reduce malloc overhead (by allocating the string and its buffer with a single allocation);
Reducing cache pressure by inlining the contents of the string buffer inside the string object;
Helping the prefetchers to useful work;
Increase the ability to reuse allocations for small, short-lived strings, since these would all be guaranteed to be allocated with the same allocation size regardless of the string size.

Since the buffer follows the object header in reality, this specific optimization isn't in fact as useful as I initially thought it would be, as-is at least, especially now with the work I was unaware of to remove the wstr pointer from PyASCIIString. We could still stuff a small string within just the size field, which would let us store up to 7 ASCII characters, but this doesn't seem like a particularly good balance between complexity and benefit. We would need to find some other trimmable fat in the object to increase the buffer to at least 15 bytes for this to be useful.

This is probably possible to do today, and it has some benefits, and is ultimately something I'd like to work on. However, I have another idea that might be useful for all short-lived objects, strings or not. But, for this, I need to put on my mad scientist lab coat; hold on a minute.

OK, so, there's a fifth point I haven't listed above: in languages, such as C++, where objects can be constructed on the stack, small string objects can avoid the malloc roundtrip altogether all the while maintaining the same API surface. Even with good memory allocators that maintain thread-local arenas to avoid locking, it's very unlikely that a dynamically memory allocator will ever beat something that's just "subtract a value from a register and you're done".

This doesn't really apply to Python as-is, of course, but here's a thought that needs some maturing: track the allocation size of all short-lived objects in non-quickened code objects, and, on quickened code objects, use this information to build an arena inside a call frame, and freely carve space from there -- without even taking the GIL for all the short-lived allocations/deallocations. (Going one step further, one could even elide all the deallocation overhead if we know that, after a particular instruction, no more allocations from this arena will be performed. But we're in diminishing returns territory at this point.)

I imagine this would pair really well with the small string optimization suggested in this discussion (point #4 specifically).

Of course, the allocation size tracker is kinda hand-wavy at the moment -- I still need to leave this percolating for a while -- but I wanted to open this idea for discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small string optimization #302

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Small string optimization #302

lpereira Feb 25, 2022

Replies: 4 comments · 1 reply

markshannon Feb 28, 2022 Collaborator

lpereira Feb 28, 2022 Author

markshannon Feb 28, 2022 Collaborator

lpereira Mar 2, 2022 Author

lpereira
Feb 25, 2022

Replies: 4 comments 1 reply

markshannon
Feb 28, 2022
Collaborator

lpereira Feb 28, 2022
Author

markshannon
Feb 28, 2022
Collaborator

lpereira
Mar 2, 2022
Author