Improve field count typical case performance #120

runer112 · 2023-01-18T22:44:19Z

The tightest upper bound one can specify on the number of fields in a struct is sizeof(type) * CHAR_BIT. So this was previously used when performing a binary search for the field count. This upper bound is extremely loose when considering a typical large struct, which is more likely to contain a relatively small number of relatively large fields rather than the other way around. The binary search range being multiple orders of magnitude larger than necessary wouldn't have been a significant issue if each test was cheap, but they're not. Testing a field count of N costs O(N) memory and time. As a result, the initial few steps of the binary search may be prohibitively expensive.

The primary optimization introduced by these changes is to use unbounded binary search, a.k.a. exponential search, instead of the typically loosely bounded binary search. This produces a tight upper bound (within 2x) on the field count to then perform the binary search with.

As an upside of this change, the compiler-specific limit placed on the upper bound on the field count to stay within compiler limits could be removed.

The tightest upper bound one can specify on the number of fields in a struct is `sizeof(type) * CHAR_BIT`. So this was previously used when performing a binary search for the field count. This upper bound is extremely loose when considering a typical large struct, which is more likely to contain a relatively small number of relatively large fields rather than the other way around. The binary search range being multiple orders of magnitude larger than necessary wouldn't have been a significant issue if each test was cheap, but they're not. Testing a field count of N costs O(N) memory and time. As a result, the initial few steps of the binary search may be prohibitively expensive. The primary optimization introduced by these changes is to use unbounded binary search, a.k.a. exponential search, instead of the typically loosely bounded binary search. This produces a tight upper bound (within 2x) on the field count to then perform the binary search with. As an upside of this change, the compiler-specific limit placed on the upper bound on the field count to stay within compiler limits could be removed.

runer112 · 2023-01-18T22:55:06Z

This issue was originally noticed when some source files in a large project seemed to be consuming a suspiciously large amount of memory (and also time). After some digging, the culprit was eventually nailed down as field count detection.

Before these changes, compiling one such source file with clang 14 peaked at 1.5 GB of memory usage. After these changes, compiling the same file peaked at 617 MB. These numbers include the memory usage of compiling everything else as well, which is why the after-fix number is still relatively large. The memory usage attributable just to field counting probably went from something like 1 GB to something at least one or two orders of magnitude less.

apolukhin · 2023-01-19T13:36:13Z

The PR fails on msvc-14.1 with c1xx: fatal error C1060: compiler is out of heap space error.

I'd recommend to try another approach: just change the detect_fields_count_greedy to fill an bool init_succeeded[Last]. This could be done in one go, by getting an index sequence from Last, and applying it to the function that returns true/false, depending on the construction success. After that a simple loop in constexpr could finish the job.

Something like that:

constexpr size_t detect_fields_count_greedy(index_sequence<Indexes...>) {
bool init_succeeded[Last] = { is_aggr_initable<T, Indexes≥(), ... };
for ( i = Last - 1; i > 0; --i) if init_succeeded [i] return i;
}

In the last CI run, 15 tasks failed with a compiler is out of heap space error. With the jobs running in parallel, it's hard to determine which tasks failed due to their own excessive memory usage and which were well-behaved, but a victim of running when another task consumed all the available memory.

runer112 · 2023-01-24T00:13:51Z

In the last CI run, 15 tasks failed with a compiler is out of heap space error. With the jobs running in parallel, it's hard to determine which tasks failed due to their own excessive memory usage and which were well-behaved, but a victim of running when another task consumed all the available memory. I pushed a temporary commit that I believe should disable testing in parallel. Could you please approve the CI build? Hopefully this will give me enough info to be able to diagnose the real issue, after which I can revert the CI config change.

Revert the CI config change.

Regarding your suggestion, maybe I'm not understanding it correctly, but I don't see how it would help with performance. The crux of the performance issue is that the check of whether a type is constructible with N arguments, enable_if_constructible_helper_t (SFINAE), costs O(N) memory and time. In your suggestion, I believe this would make initializing init_succeeded cost O(Last^2) time and O(Last) memory.

My proposed changes aim to minimize the overall cost by minimizing the sum of all N checked. They effectively replace sizeof(T) with just the result (field count) in asymptotic performance analysis, which may be substantially lesser. Comparing to the current code (don't trust the current comments):

Case	Memory Before	Memory After	Time Before	Time After
`T` is an array	O(1)	O(1)	O(1)	O(1)
`T` is default-constructible	O(sizeof(T))	O(result)	O(sizeof(T) * log(sizeof(T)))	O(result * log(result))
`T` is not default-constructible	O(sizeof(T))	O(result)	O(sizeof(T)^2)	O(result^2)

runer112 · 2023-01-24T22:12:28Z

The excessive compile time/memory usage issues that were causing failures have been fixed.

The central issue was that the static_assert preconditions didn't actually prevent the compiler from trying to expand the field counting templates, which sometimes expanded indefinitely. This was fixed by dummying the field counting dispatch to do basically nothing (count the number of fields in an int[1] instead, which is trivially 1) if any precondition is not met.

runer112 · 2023-01-30T21:44:06Z

@apolukhin Is there anything more you'd like me to address?

The AppVeyor build succeeds now, and does so roughly 5% faster than before despite there being 3 new tests.

apolukhin · 2023-02-03T09:18:39Z

Sorry, I've misread you for the first time.

So here's how I see your changes (please fix me if I'm wrong):

T is default-constructible: you do exponentioal search for upper bound of fields count. It takes log(fields_count) + 1. After that you do a binary search that takes log(log(fields_count) + 1). The final complexity is log(fields_count) + 1 + log(log(fields_count) + 1).

The current implementation starts the binary search from sizeof(type) * CHAR_BIT. However, that CHAR_BIT multiplication is not necessary - we do not need to know the eact count of bitfields. Instead of that the binary search could work with sizeof(type)+1, and if we get the maximum value, then the type has bitfields.

Here comes the math. Your algorithm is better when
log(sizeof(type)+1) > log(fields_count) + 1 + log(log(fields_count) + 1)

which is
log(sizeof(type)+1) > log(fields_count) + log(2) + log(log(fields_count) + 1)

which is
sizeof(type)+1 > fields_count*2 * (log(fields_count) + 1)

sizeof(type) is equal to fields_count*avg_field_size. It gives us

fields_countavg_field_size+1 > fields_count2 * (log(fields_count) + 1)

Which is
avg_field_size+1/fields_count > 2 * log(fields_count) + 2

For fields count 16 the avg_field_size should be about 10 to make your algorithm better. For fields count 256 the avg_field_size should be about 18 to make your algorithm better.

For aggregates of ints, chronos, pointers or size_ts existing algorithm performs better. For aggregates of strings and vectors your algorithm performs better than the existing.

I'd rather call it a tie. But the CHAR_BITS multiplocation should be removed.

T is not defsult constructible: your approach is defenetly superior. I'm worried about the cases, when the whole type is not aggregate initializable, because in that case I think your algo would run as long as the RAM os not exhausted and no diagnostic will be provided. Probably it is the reason, why github CI fails.

I'm also worried about compiler idiosyncrasies. Not all the compilers are listed in CI, so I'd rather stick to the existing, well tested algorithm, if it does not make a noticeable difference.

Here's the plan:

remove the CHAR_BITS multiplication
for the second case: do the linear search for first non default constructible T, and then use the existing binary search with sizeof(T)+1 upper limit

runer112 · 2023-02-03T23:33:02Z

Your understanding of the approach used in these changes is correct: exponential search followed by binary search. However, your performance analysis only counts "steps". Critically, it does not factor in the cost of checking at each step whether the type is constructible with N arguments, which is O(N) memory and time.

Factoring this in to the default-constructible case, the exponential search worst case costs O(1 + 2 + 4 + ... + fields_count + 2 * fields_count) = O(4 * fields_count) to establish bounds separated only by a factor of 2. The best case costs O(1 + 2 + 4 + ... + fields_count + 1) = O(2 * fields_count).

Without exponential search and ignoring the possibility of bitfields, we start with a binary search over [0, sizeof(T)]. To reach bounds separated only by a factor of 2, in the best case of an average field size in [1, 2] bytes, this only requires one check costing O(sizeof(T) / 2). This ranges from O(field_count / 2) to O(field_count), which is 1/4 to 1/2 of exponential search's best case cost.

However, the favorable comparison fades rather quickly. Once the average field size passes 4 bytes, reachng bounds separated only by a factor of 2 requires (at least) three checks costing O(sizeof(T) / 2 + sizeof(T) / 4 + sizeof(T) / 8) = O(7/8 * sizeof(T)). The field count must be less than sizeof(T) / 4, so in terms of field count, the cost is at least O(7/8 * 4 * field_count) = O(7/2 * field_count). This is already nearly the worst case cost of exponential search, and it only gets worse for larger average field sizes.

In the worst case of a single field, the cost is O(sizeof(T) / 2 + sizeof(T) / 4 + sizeof(T) / 8 + ... 1) = O(sizeof(T)). This is sizeof(T) / 4 times exponential search's worst case cost. The cost factor is nearly unbounded as it simply grows with sizeof(T).

Considering that the best case cost factor of using binary search only is 1/4 and the worst case cost factor is nearly unbounded, I believe that that alone should be enough reason to use exponential search. Additionally, it seems that exponential search may be faster in "average" use, as evidenced by the AppVeyor builds with these changes being roughly 5% faster than builds without. And as a reminder, really poor performance cases with binary search aren't just a theoretical possibility that won't happen in practice. I went through the effort to make and propose these changes specifically because I ran into such a poorly performing case where these changes made a huge difference (#120 (comment)).

Regarding the latest CI failure, I haven't had much time to look into it yet. I was hoping that it wouldn't be necessary to do this because it feels unclean, but perhaps it's wise to add a sanity check to the exponential search to halt it if it somehow exceeds sizeof(T) * CHAR_BIT.

apolukhin · 2023-02-04T08:22:22Z

Critically, it does not factor in the cost of checking at each step whether the type is constructible with N arguments, which is O(N) memory and time.

It should not take O(N). std::make_index_sequence was a bottleneck in so many cases, that the compiler developers made it an intrinsic (for example https://github.com/microsoft/STL/blob/73924c1920af92899f7582cd904ea819b9db35bc/stl/inc/type_traits#L34). So getting the indexes is O(1), variadic pack expansion is also close to O(1), ubiq_*ref_constructor is not a template and its constructors do not consume time/memory. As a result the check should take about O(1).

Please, check that your compiler supports the builtin and that it is properly detected in https://github.com/boostorg/pfr/blob/28bd7f541f7a7632f4fe52e30d06a121e7cb1f65/include/boost/pfr/detail/make_integer_sequence.hpp

runer112 · 2023-02-07T15:46:45Z

Even if std::make_index_sequence/std::make_integer_sequence is implemented intrinsically, the cost is not constant. The cost increases with the size of the sequence.

Here's a demonstration of recent versions of clang, gcc, and msvc all running out of resources (whether self-limited or host-limited time or memory) trying to evaluate std::make_integer_sequence<int, 10000000>(): https://godbolt.org/z/KqhfxbK4z. Dropping to 1 million, I see clang and gcc succeed, but only after multiple seconds. Dropping to 100 thousand, I see msvc finally succeed, and clang and gcc succeed in less than a second.

This could happen for a type with a constructor accepting a parameter pack. This also prevents unbounded growth in case something goes wrong with the logic and something should have already stopped (or never started).

runer112 · 2024-09-05T16:23:28Z

I noticed that someone else (@rressi-at-globus) appears to have discovered these changes and found the changes to be useful enough to make their own attempt to get these changes upstreamed with #179. I resynchronized my branch with develop to hopefully allow this PR to proceed, now with third-party corroboration that these changes are beneficial.

runer112 · 2024-09-05T19:37:48Z

I re-ran the tests that I previously ran in #120 (comment) to once again demonstrate that the cost of std::make_index_sequence increases with the size of the sequence, now with more recent versions of clang, gcc, and msvc just in case something got optimized. My findings are the same as before: the cost of std::make_index_sequence increases with the size of the sequence; I suspect O(N).

https://godbolt.org/z/PKqes5h1M: std::make_index_sequence<10'000'000> gave me the same result across multiple retries for all 3 compilers tested: failure due to resource exhaustion.
https://godbolt.org/z/9jEGEG776: std::make_index_sequence<1'000'000> gave me best-case results of clang completing in 0.9 seconds, gcc completing in 2.6 seconds, and msvc failing due to resource exhaustion.
https://godbolt.org/z/PKqes5h1M: std::make_index_sequence<100'000> gave me best-case results of clang completing in 0.3 seconds, gcc completing in 0.5 seconds, and msvc completing in 1.3 seconds.

runer112 · 2024-09-05T23:38:43Z

@apolukhin: Here are direct reponses to your comment, #120 (comment). I apologize for not doing this earlier.

I'm worried about the cases, when the whole type is not aggregate initializable, because in that case I think your algo would run as long as the RAM os not exhausted and no diagnostic will be provided.

You were right, there was a problem. I added tests for a default-constructible type accepting 0 or more arguments and a non-default-constructible type accepting 1 or more arguments in e80f7aa. These tests should now pass with the changes from dd1ae1c.

I'd rather stick to the existing, well tested algorithm, if it does not make a noticeable difference

It does make a noticeable difference for large types.

Practical evidence:

Theoretical evidence:

Here's the plan:

remove the CHAR_BITS multiplication

for the second case: do the linear search for first non default constructible T, and then use the existing binary search with sizeof(T)+1 upper limit

As you're aware, the current implementation performs binary search with an upper bound of sizeof(T) * CHAR_BIT. For large types, the first few iterations could carry a significant cost, perhaps even resulting in resource exhaustion. This could maybe warrant removing the multiplication by CHAR_BIT at the cost of losing support for types with dense bitfields (average field size < 1 byte).

The proposed implementation would perform binary search with an upper bound that is no more than twice the field count. This would not carry any unnecessary significant cost. In this regard, removing the multiplication by CHAR_BIT should not be warranted.

There is however a preexisting case of concern: large types that are not aggregate-initializable in environments that do not provide std::is_aggregate. With both the existing and the proposed implementation, this may result in searching all the way up to sizeof(T) * CHAR_BIT just to eventually encounter resource exhaustion or produce an error stating that the type is not aggregate-initializable. But std::is_aggregate is required by the C++17 standard, environments without it will only become rarer with time, and using a non-aggregate type with pfr seems to be a misuse to begin with? So with the proposed implementation, removing the multiplication by CHAR_BIT only to mitigate poor performance in the case of misuse with a dated C++ environment, while also at the cost of losing support for types with dense bitfields, doesn't seem worth it to me. But if I've forgotten to account for another problematic case or if you otherwise disagree and still want to the multiplication by CHAR_BIT to be removed, let me know.

rressi-at-globus · 2024-09-06T06:30:08Z

I have exported this contribution as a patch for the vcpkg''s port of this component.

The patch is curently targeting boost-pfr version 1.85.0, and is under review here:

[boost-pfr] fields_count.hpp: fix compilation OOM issue microsoft/vcpkg#40823

apolukhin · 2024-09-13T12:44:28Z

Still fails the tests

runer112 · 2024-09-13T19:37:56Z

I'm looking at the output of the failed runs now, but there seems to be a lot of unstructured text to comb through... Any tips as to what to look at?

runer112 · 2024-09-13T20:02:53Z

After my best attempts to search for failures in the log files, I could not spot any test failures, but I did spot a couple of Boost Inspection Report problems. These should now be addressed. Hopefully this was all that was resulting in jobs failing?

coveralls · 2024-09-19T07:12:02Z

Pull Request Test Coverage Report for Build 10855330620

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 10537504895:	0.0%
Covered Lines:	407
Relevant Lines:	407

💛 - Coveralls

runer112 · 2024-09-20T01:24:01Z

Bleh, one more failing CI job. It's late for me today, but I can investigate that tomorrow.

Also, it appears that the coverage report didn't work because I merged develop back into this branch rather than rebasing. Would you like me to rebase instead?

runer112 · 2024-09-20T18:57:18Z

I have implemented a workaround for an MSVC issue that caused failures in windows (msvc-14.3, 20,latest, 64, windows-2022, -j1). It's hard for me to tell from the output if this was the only issue, but I'm hopeful that it was.

Makes the evaluation of the field count of huge arrays not result in excessive compiler resource utilization.

runer112 · 2024-09-20T23:12:49Z

I've made additional improvements that should remove any unnecessary trial initializations of array types. This means that the cost of calculating the field count of an array is now constant, as was probably initially intended. This is tested here:

https://github.com/runer112/pfr/blob/cbc57cc28758c36c77b3fbe4e4ef7fb8e98f5ae8/test/core/run/huge_count.cpp#L53

I've also finally added testing of the core improvement that this PR provides, which is that the cost of calculating the field count of an object is not related to the object's size:

https://github.com/runer112/pfr/blob/cbc57cc28758c36c77b3fbe4e4ef7fb8e98f5ae8/test/core/run/huge_count.cpp#L55-L56

I've tested the changes manually on a couple of the test files with a few different compiler configurations, but this is nowhere near as much testing as CI performs. It's possible that a CI run will reveal one or two issues that I did not find, but I'm hopeful that it won't. Either way, I appreciate you approving CI runs.

runer112 marked this pull request as ready for review January 18, 2023 22:57

runer112 added 2 commits January 23, 2023 18:45

Update performance evaluation

aac7c36

runer112 added 3 commits January 24, 2023 12:42

Add complete type precondition and condition evaluation on preconditions

096445f

Restore parallel testing

de5d6f4

Test fields_count with incomplete types

28bd7f5

runer112 added 3 commits February 10, 2023 13:43

Add more field count tests

e80f7aa

Oops

ca3ca70

Prevent unbounded field count test growth

dd1ae1c

This could happen for a type with a constructor accepting a parameter pack. This also prevents unbounded growth in case something goes wrong with the logic and something should have already stopped (or never started).

This was referenced Sep 5, 2024

[boost-pfr] fields_count.hpp: fix compilation OOM issue microsoft/vcpkg#40823

Open

Improve field count typical case performance #179

Closed

Merge branch 'develop' into opt-field-count

ea57b67

Update copyright years

5b2b1cc

Address violation of Boost min/max guidelines

170338c

Address unacceptable character '+' in file name

63374ef

runer112 added 6 commits September 20, 2024 18:31

Prevent unnecessary triggering of last-resort assertion

a6d0c5f

Address MSVC issue

27a5c6e

Correct wording: constructible -> initializable

465ff1b

Skip inheritance check for non-classes

5546762

Makes the evaluation of the field count of huge arrays not result in excessive compiler resource utilization.

Skip hand-made aggregate check for non-classes

5e655d4

Makes the evaluation of the field count of huge arrays not result in excessive compiler resource utilization.

Test field count for huge types

cbc57cc

runer112 force-pushed the opt-field-count branch from 09ef2fb to cbc57cc Compare September 20, 2024 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve field count typical case performance #120

Improve field count typical case performance #120

runer112 commented Jan 18, 2023

runer112 commented Jan 18, 2023 •

edited

Loading

apolukhin commented Jan 19, 2023 •

edited

Loading

runer112 commented Jan 24, 2023 •

edited

Loading

runer112 commented Jan 24, 2023 •

edited

Loading

runer112 commented Jan 30, 2023 •

edited

Loading

apolukhin commented Feb 3, 2023 •

edited

Loading

runer112 commented Feb 3, 2023 •

edited

Loading

apolukhin commented Feb 4, 2023

runer112 commented Feb 7, 2023 •

edited

Loading

runer112 commented Sep 5, 2024

runer112 commented Sep 5, 2024 •

edited

Loading

runer112 commented Sep 5, 2024 •

edited

Loading

rressi-at-globus commented Sep 6, 2024 •

edited

Loading

apolukhin commented Sep 13, 2024

runer112 commented Sep 13, 2024

runer112 commented Sep 13, 2024 •

edited

Loading

coveralls commented Sep 19, 2024

runer112 commented Sep 20, 2024

runer112 commented Sep 20, 2024

runer112 commented Sep 20, 2024 •

edited

Loading

Improve field count typical case performance #120

Are you sure you want to change the base?

Improve field count typical case performance #120

Conversation

runer112 commented Jan 18, 2023

runer112 commented Jan 18, 2023 • edited Loading

apolukhin commented Jan 19, 2023 • edited Loading

runer112 commented Jan 24, 2023 • edited Loading

runer112 commented Jan 24, 2023 • edited Loading

runer112 commented Jan 30, 2023 • edited Loading

apolukhin commented Feb 3, 2023 • edited Loading

runer112 commented Feb 3, 2023 • edited Loading

apolukhin commented Feb 4, 2023

runer112 commented Feb 7, 2023 • edited Loading

runer112 commented Sep 5, 2024

runer112 commented Sep 5, 2024 • edited Loading

runer112 commented Sep 5, 2024 • edited Loading

rressi-at-globus commented Sep 6, 2024 • edited Loading

apolukhin commented Sep 13, 2024

runer112 commented Sep 13, 2024

runer112 commented Sep 13, 2024 • edited Loading

coveralls commented Sep 19, 2024

Pull Request Test Coverage Report for Build 10855330620

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

runer112 commented Sep 20, 2024

runer112 commented Sep 20, 2024

runer112 commented Sep 20, 2024 • edited Loading

runer112 commented Jan 18, 2023 •

edited

Loading

apolukhin commented Jan 19, 2023 •

edited

Loading

runer112 commented Jan 24, 2023 •

edited

Loading

runer112 commented Jan 24, 2023 •

edited

Loading

runer112 commented Jan 30, 2023 •

edited

Loading

apolukhin commented Feb 3, 2023 •

edited

Loading

runer112 commented Feb 3, 2023 •

edited

Loading

runer112 commented Feb 7, 2023 •

edited

Loading

runer112 commented Sep 5, 2024 •

edited

Loading

runer112 commented Sep 5, 2024 •

edited

Loading

rressi-at-globus commented Sep 6, 2024 •

edited

Loading

runer112 commented Sep 13, 2024 •

edited

Loading

runer112 commented Sep 20, 2024 •

edited

Loading