ctl::string small-string optimization #1199

mrdomino · 2024-06-06T07:04:24Z

A small-string optimization is a way of reusing inline storage space for sufficiently small strings, rather than allocating them on the heap. The current approach takes after an old Facebook string class: it reuses the highest-order byte for flags and small-string size, in such a way that a maximally-sized small string will have its last byte zeroed, making it a null terminator for the C string.

The only flag we have is in the highest-order bit, that says whether the string is big (set) or small (cleared.) Most of the logic switches based on the value of this bit; e.g. data() returns big()->p if it's set, else small()->buf if it's cleared. For a small string, the capacity is always fixed at sizeof(string) - 1 bytes; we store the length in the last byte, but we store it as the number of remaining bytes of capacity, so that at max size, the last byte will read zero and serve as our null terminator.

Morally speaking, our class's storage is a union over two POD C structs. For now I gravitated towards a slightly more obtuse approach: the string class itself contains a blob of the right size, and we alias that blob's pointer for the two structs, taking some care not to run afoul of object lifetime rules in C++. If anyone wants to improve on this, contributions are welcome.

This commit also introduces the ctl::__ namespace. It can't be legally spelled by library users, and serves as our version of boost's "detail".

We introduced a string::swap function, and we now use that in operator=. operator= now takes its argument by value, so we never need to check for the case where the pointers are equal and can just swap the entire store of the argument with our own, leaving the C++ destructor to free our old storage afterwards.

There are probably still a few places where our capacity is slightly off and we grow too fast, although there don't appear to be any where we are too slow. I will leave these to be fixed in future changes.

TODO:

tests are currently segfaulting
think about operator string_view (thought about it - it's still only 1.3ns to make one)
~~maybe migrate to POD anonymous union~~ (i like this way better)
benchmark and see if this is even worth it
__ namespace is documented, at least here
we are probably incorrectly setting size in a few places
explain why assign-by-value and "swapperator", at least here

A small-string optimization is a way of reusing inline storage space for sufficiently small strings, rather than allocating them on the heap. The current approach takes after an old Facebook string class: it reuses the highest-order byte for flags and small-string size, in such a way that a maximally-sized small string will have its last byte zeroed, making it a null terminator for the C string. The only flag we have is in the highest-order bit, that says whether the string is big (set) or small (cleared.) Most of the logic switches based on the value of this bit; e.g. data() returns big()->p if it's set, else small()->buf if it's cleared. Morally speaking, our class's storage is a union over two POD C structs. It may be that this winds up being the best way to actually write it but for now I gravitated towards a slightly more obtuse approach: the string class itself contains a blob of the right size, and we alias that blob's pointer for the two structs, taking some care not to run afoul of object lifetime rules in C++. Only in writing this now do I realize that we may be able to relatively easily sidestep those rules. TODO: - [ ] tests are currently segfaulting - [ ] think about operator string_view - [ ] maybe migrate to POD anonymous union - [ ] benchmark and see if this is even worth it - [ ] __ namespace needs documented, at least here - [ ] we are probably incorrectly setting size in a few places - [ ] explain why assign-by-value and "swapperator", at least here

ctl/string.cc

ctl/string.h

jart · 2024-06-06T07:45:52Z

ctl/string.h

+  private:
+    inline bool isbig() const noexcept
+    {
+        return *(__builtin_launder(blob) + __::sso_max) & 0x80;


Nice is there one of these for std::move and std::forward? Also the last time I got truly autistic about C++ was back in 2012. Could you give me a two sentence tutorial on what this launder does?

move and forward are both trivial plain C++ functions; there is no reason we couldn't write a brief ctl::move and ctl::forward of our own.

E.g. capnproto's libkj does this:

https://github.com/capnproto/capnproto/blob/5faf3144b5e1b0fbbe80992f7e3a890a472c82a3/c%2B%2B/src/kj/common.h#L678-L687

std::launder is different. It is an escape hatch from the compiler's ability to make crazy assumptions based on the object lifetime rules of C++ - I have not carefully studied this in a bit but my understanding is, for example, if you have a byte that is part of a union where one arm is const and the other isn't, and you change it in the non-const arm, then unless you call launder when referencing the const arm, the compiler can just assume that the value did not change and not bother to look.

Since you obviously can't implement this in terms of the actual C++ language itself, there is going to be a __builtin_launder or something that your STL provider is going to collaborate with your compiler author to use.

Basically the smooth-brained rule to learn is "if you're in a situation where you're using the same backing memory for two different C++ objects, then you need to use launder on your pointers so the compiler doesn't do something crazy."

Thank you. This information has much more clarity than Google results.

jart · 2024-06-06T08:00:34Z

I'm liking this so far. I'm glad to see Facebook's alpha coming to the Cosmopolitan codebase. Since we're replacing simple code with intelligent code, I'd like to see a benchmark too that demonstrates the advantage. Please don't use my legacy benchmark macros. Something simple and less frameworky like this should do:

#define BENCH(ITERATIONS, WORK_PER_RUN, CODE) \
    do { \
        struct timespec start = timespec_real(); \
        for (int i = 0; i < ITERATIONS; ++i) { \
            asm volatile("" ::: "memory"); \
            CODE; \
        } \
        long long work = WORK_PER_RUN * ITERATIONS; \
        double nanos = (timespec_tonanos(timespec_sub(timespec_real(), start)) + work - 1) / (double)work; \
        printf("%10g ns %2dx %s\n", nanos, ITERATIONS, #CODE); \
    } while (0)

Clearly we aren't exercising the capacity increase logic very hard yet.

It clamps to size() + 1, not just size(), i.e. we reserve the null byte. We also check the exact requested capacity against the sso_max before we do the (??) alignment (??) stuff.

mrdomino · 2024-06-06T23:57:11Z

master:

   17.9248 ns 10000000x { ctl::string s; s.append("hello "); s.append("world"); }
   4.10612 ns 1000000x { ctl::string s; for (int i = 0; i < 8; ++i) { s.append('a'); } }
   3.94862 ns 1000000x { ctl::string s; for (int i = 0; i < 16; ++i) { s.append('a'); } }
   3.72796 ns 1000000x { ctl::string s; for (int i = 0; i < 23; ++i) { s.append('a'); } }
   3.83362 ns 1000000x { ctl::string s; for (int i = 0; i < 32; ++i) { s.append('a'); } }
    13.607 ns 1000000x { ctl::string s("hello world"); }
    12.453 ns 1000000x { ctl::string s2(s); }
     17.64 ns 1000000x { ctl::string s("hello world"); ctl::string s2(std::move(s)); }
    24.125 ns 1000000x { ctl::string s("hello world"); ctl::string s2(s); }
    12.818 ns 1000000x { ctl::string s(23, 'a'); }
     13.22 ns 1000000x { ctl::string s(24, 'a'); }

ctl-sso:

   10.4207 ns 10000000x { ctl::string s; s.append("hello "); s.append("world"); }
     3.938 ns 1000000x { ctl::string s; for (int i = 0; i < 8; ++i) { s.append('a'); } }
   4.50244 ns 1000000x { ctl::string s; for (int i = 0; i < 16; ++i) { s.append('a'); } }
   4.83043 ns 1000000x { ctl::string s; for (int i = 0; i < 23; ++i) { s.append('a'); } }
     4.833 ns 1000000x { ctl::string s; for (int i = 0; i < 32; ++i) { s.append('a'); } }
     7.472 ns 1000000x { ctl::string s("hello world"); }
     6.844 ns 1000000x { ctl::string s2(s); }
      8.18 ns 1000000x { ctl::string s("hello world"); ctl::string s2(std::move(s)); }
    13.086 ns 1000000x { ctl::string s("hello world"); ctl::string s2(s); }
      6.89 ns 1000000x { ctl::string s(23, 'a'); }
    18.017 ns 1000000x { ctl::string s(24, 'a'); }

Had not pushed this from my local branch. :(

These commits were sitting on a local branch that I neglected to push before merging. :( * Use memcpy for string::reserve * Remove fence comments

Making a string_view from a string appears to take about 1.3ns no matter what. 100% definitely no point deviating from the STL API over that.

mrdomino added 2 commits June 5, 2024 23:57

less (?) horrible set_big_capacity

23d489c

jart reviewed Jun 6, 2024

View reviewed changes

Tests pass

5f4aceb

github-actions bot added the testing label Jun 6, 2024

mrdomino added 5 commits June 6, 2024 08:28

clang-format

ec96f72

whoops

9154a57

mask big()->c

f279754

Clearly we aren't exercising the capacity increase logic very hard yet.

longer test string

731bdf7

Tweak reserve

2d55834

It clamps to size() + 1, not just size(), i.e. we reserve the null byte. We also check the exact requested capacity against the sso_max before we do the (??) alignment (??) stuff.

mrdomino mentioned this pull request Jun 7, 2024

ctl::string benchmarking code #1200

Merged

jart approved these changes Jun 7, 2024

View reviewed changes

jart marked this pull request as ready for review June 7, 2024 00:07

mrdomino merged commit 8b3e368 into jart:master Jun 7, 2024
6 checks passed

mrdomino deleted the ctl-sso branch June 7, 2024 00:50

mrdomino added a commit that referenced this pull request Jun 7, 2024

Remove fence comments addressed in #1199

7391cfa

Had not pushed this from my local branch. :(

mrdomino added a commit that referenced this pull request Jun 7, 2024

Minor small-string errata from #1199

03b476f

These commits were sitting on a local branch that I neglected to push before merging. :( * Use memcpy for string::reserve * Remove fence comments

mrdomino added a commit that referenced this pull request Jun 7, 2024

One more SSO erratum from #1199

f3effcb

Making a string_view from a string appears to take about 1.3ns no matter what. 100% definitely no point deviating from the STL API over that.

mrdomino changed the title ~~wip ctl::string small-string optimization~~ ctl::string small-string optimization Jun 7, 2024

mrdomino removed the testing label Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctl::string small-string optimization #1199

ctl::string small-string optimization #1199

mrdomino commented Jun 6, 2024 •

edited

Loading

jart Jun 6, 2024

mrdomino Jun 6, 2024 •

edited

Loading

jart Jun 7, 2024

jart commented Jun 6, 2024

mrdomino commented Jun 6, 2024

ctl::string small-string optimization #1199

ctl::string small-string optimization #1199

Conversation

mrdomino commented Jun 6, 2024 • edited Loading

jart Jun 6, 2024

Choose a reason for hiding this comment

mrdomino Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

jart Jun 7, 2024

Choose a reason for hiding this comment

jart commented Jun 6, 2024

mrdomino commented Jun 6, 2024

mrdomino commented Jun 6, 2024 •

edited

Loading

mrdomino Jun 6, 2024 •

edited

Loading