Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctl::string small-string optimization #1199

Merged
merged 8 commits into from
Jun 7, 2024
Merged

Conversation

mrdomino
Copy link
Collaborator

@mrdomino mrdomino commented Jun 6, 2024

A small-string optimization is a way of reusing inline storage space for sufficiently small strings, rather than allocating them on the heap. The current approach takes after an old Facebook string class: it reuses the highest-order byte for flags and small-string size, in such a way that a maximally-sized small string will have its last byte zeroed, making it a null terminator for the C string.

The only flag we have is in the highest-order bit, that says whether the string is big (set) or small (cleared.) Most of the logic switches based on the value of this bit; e.g. data() returns big()->p if it's set, else small()->buf if it's cleared. For a small string, the capacity is always fixed at sizeof(string) - 1 bytes; we store the length in the last byte, but we store it as the number of remaining bytes of capacity, so that at max size, the last byte will read zero and serve as our null terminator.

Morally speaking, our class's storage is a union over two POD C structs. For now I gravitated towards a slightly more obtuse approach: the string class itself contains a blob of the right size, and we alias that blob's pointer for the two structs, taking some care not to run afoul of object lifetime rules in C++. If anyone wants to improve on this, contributions are welcome.

This commit also introduces the ctl::__ namespace. It can't be legally spelled by library users, and serves as our version of boost's "detail".

We introduced a string::swap function, and we now use that in operator=. operator= now takes its argument by value, so we never need to check for the case where the pointers are equal and can just swap the entire store of the argument with our own, leaving the C++ destructor to free our old storage afterwards.

There are probably still a few places where our capacity is slightly off and we grow too fast, although there don't appear to be any where we are too slow. I will leave these to be fixed in future changes.

TODO:

  • tests are currently segfaulting
  • think about operator string_view (thought about it - it's still only 1.3ns to make one)
  • maybe migrate to POD anonymous union (i like this way better)
  • benchmark and see if this is even worth it
  • __ namespace is documented, at least here
  • we are probably incorrectly setting size in a few places
  • explain why assign-by-value and "swapperator", at least here

A small-string optimization is a way of reusing inline storage space for
sufficiently small strings, rather than allocating them on the heap. The
current approach takes after an old Facebook string class: it reuses the
highest-order byte for flags and small-string size, in such a way that a
maximally-sized small string will have its last byte zeroed, making it a
null terminator for the C string.

The only flag we have is in the highest-order bit, that says whether the
string is big (set) or small (cleared.) Most of the logic switches based
on the value of this bit; e.g. data() returns big()->p if it's set, else
small()->buf if it's cleared.

Morally speaking, our class's storage is a union over two POD C structs.
It may be that this winds up being the best way to actually write it but
for now I gravitated towards a slightly more obtuse approach: the string
class itself contains a blob of the right size, and we alias that blob's
pointer for the two structs, taking some care not to run afoul of object
lifetime rules in C++. Only in writing this now do I realize that we may
be able to relatively easily sidestep those rules.

TODO:

- [ ] tests are currently segfaulting
- [ ] think about operator string_view
- [ ] maybe migrate to POD anonymous union
- [ ] benchmark and see if this is even worth it
- [ ] __ namespace needs documented, at least here
- [ ] we are probably incorrectly setting size in a few places
- [ ] explain why assign-by-value and "swapperator", at least here
ctl/string.cc Show resolved Hide resolved
ctl/string.cc Show resolved Hide resolved
ctl/string.h Show resolved Hide resolved
private:
inline bool isbig() const noexcept
{
return *(__builtin_launder(blob) + __::sso_max) & 0x80;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice is there one of these for std::move and std::forward? Also the last time I got truly autistic about C++ was back in 2012. Could you give me a two sentence tutorial on what this launder does?

Copy link
Collaborator Author

@mrdomino mrdomino Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move and forward are both trivial plain C++ functions; there is no reason we couldn't write a brief ctl::move and ctl::forward of our own.

E.g. capnproto's libkj does this:

https://github.com/capnproto/capnproto/blob/5faf3144b5e1b0fbbe80992f7e3a890a472c82a3/c%2B%2B/src/kj/common.h#L678-L687

std::launder is different. It is an escape hatch from the compiler's ability to make crazy assumptions based on the object lifetime rules of C++ - I have not carefully studied this in a bit but my understanding is, for example, if you have a byte that is part of a union where one arm is const and the other isn't, and you change it in the non-const arm, then unless you call launder when referencing the const arm, the compiler can just assume that the value did not change and not bother to look.

Since you obviously can't implement this in terms of the actual C++ language itself, there is going to be a __builtin_launder or something that your STL provider is going to collaborate with your compiler author to use.

Basically the smooth-brained rule to learn is "if you're in a situation where you're using the same backing memory for two different C++ objects, then you need to use launder on your pointers so the compiler doesn't do something crazy."

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This information has much more clarity than Google results.

@jart
Copy link
Owner

jart commented Jun 6, 2024

I'm liking this so far. I'm glad to see Facebook's alpha coming to the Cosmopolitan codebase. Since we're replacing simple code with intelligent code, I'd like to see a benchmark too that demonstrates the advantage. Please don't use my legacy benchmark macros. Something simple and less frameworky like this should do:

#define BENCH(ITERATIONS, WORK_PER_RUN, CODE) \
    do { \
        struct timespec start = timespec_real(); \
        for (int i = 0; i < ITERATIONS; ++i) { \
            asm volatile("" ::: "memory"); \
            CODE; \
        } \
        long long work = WORK_PER_RUN * ITERATIONS; \
        double nanos = (timespec_tonanos(timespec_sub(timespec_real(), start)) + work - 1) / (double)work; \
        printf("%10g ns %2dx %s\n", nanos, ITERATIONS, #CODE); \
    } while (0)

Clearly we aren't exercising the capacity increase logic very hard yet.
It clamps to size() + 1, not just size(), i.e. we reserve the null byte.
We also check the exact requested capacity against the sso_max before we
do the (??) alignment (??) stuff.
@mrdomino
Copy link
Collaborator Author

mrdomino commented Jun 6, 2024

master:

   17.9248 ns 10000000x { ctl::string s; s.append("hello "); s.append("world"); }
   4.10612 ns 1000000x { ctl::string s; for (int i = 0; i < 8; ++i) { s.append('a'); } }
   3.94862 ns 1000000x { ctl::string s; for (int i = 0; i < 16; ++i) { s.append('a'); } }
   3.72796 ns 1000000x { ctl::string s; for (int i = 0; i < 23; ++i) { s.append('a'); } }
   3.83362 ns 1000000x { ctl::string s; for (int i = 0; i < 32; ++i) { s.append('a'); } }
    13.607 ns 1000000x { ctl::string s("hello world"); }
    12.453 ns 1000000x { ctl::string s2(s); }
     17.64 ns 1000000x { ctl::string s("hello world"); ctl::string s2(std::move(s)); }
    24.125 ns 1000000x { ctl::string s("hello world"); ctl::string s2(s); }
    12.818 ns 1000000x { ctl::string s(23, 'a'); }
     13.22 ns 1000000x { ctl::string s(24, 'a'); }

ctl-sso:

   10.4207 ns 10000000x { ctl::string s; s.append("hello "); s.append("world"); }
     3.938 ns 1000000x { ctl::string s; for (int i = 0; i < 8; ++i) { s.append('a'); } }
   4.50244 ns 1000000x { ctl::string s; for (int i = 0; i < 16; ++i) { s.append('a'); } }
   4.83043 ns 1000000x { ctl::string s; for (int i = 0; i < 23; ++i) { s.append('a'); } }
     4.833 ns 1000000x { ctl::string s; for (int i = 0; i < 32; ++i) { s.append('a'); } }
     7.472 ns 1000000x { ctl::string s("hello world"); }
     6.844 ns 1000000x { ctl::string s2(s); }
      8.18 ns 1000000x { ctl::string s("hello world"); ctl::string s2(std::move(s)); }
    13.086 ns 1000000x { ctl::string s("hello world"); ctl::string s2(s); }
      6.89 ns 1000000x { ctl::string s(23, 'a'); }
    18.017 ns 1000000x { ctl::string s(24, 'a'); }

@jart jart marked this pull request as ready for review June 7, 2024 00:07
@mrdomino mrdomino merged commit 8b3e368 into jart:master Jun 7, 2024
6 checks passed
@mrdomino mrdomino deleted the ctl-sso branch June 7, 2024 00:50
mrdomino added a commit that referenced this pull request Jun 7, 2024
Had not pushed this from my local branch. :(
mrdomino added a commit that referenced this pull request Jun 7, 2024
These commits were sitting on a local branch that I neglected to push
before merging. :(

* Use memcpy for string::reserve

* Remove fence comments
mrdomino added a commit that referenced this pull request Jun 7, 2024
Making a string_view from a string appears to take about 1.3ns no matter
what. 100% definitely no point deviating from the STL API over that.
@mrdomino mrdomino changed the title wip ctl::string small-string optimization ctl::string small-string optimization Jun 7, 2024
@mrdomino mrdomino removed the testing label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants