-
Notifications
You must be signed in to change notification settings - Fork 783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to improve code size, fix GCC 12 #802
base: dev
Are you sure you want to change the base?
Conversation
- Fix the state structs to use unsigned char, update names - Extract the algorithm steps into inline subroutines - Fix a theoretical integer overflow bug - Reroll XXH64 on 32-bit (it is going to run like crap anyway)
It's been long enough
- Add `XXH_32BIT_OPT` which is enabled for targets without 64-bit arithmetic - Add a 64-bit only inline hint, apply it to the functions that need them - TODO benchmark them more - Tail call XXH3_64bits_update instead of inlining XXH3_update - Reroll midrange on 32-bit (ARM is the only one that doesn't choke on the 128-bit multiplies) - Extract the copy-pasted midrange finalization code in XXH128
XXH3_consumeStripes can now process multiple blocks, greatly simplifying the logic and making it resemble XXH32 and XXH64. This also reduces code size by a solid amount.
- XXH_SIZE_OPT now, instead of disabling inline hints, turns off *most* inline hints - Things like acc512 and utilities remain force inline - XXH_FORCE_INLINE now always force inlines for functions that need it.
Not the best solution, but not force inlining the accumulate functions is more of a problem than forcing GCC to -O3.
d54cf26
to
e3776e2
Compare
Seems like some apt repos are down which is causing some things to fail. |
TIL C90 doesn't have SIZE_MAX.
I also have an idea that seems to get full performance (at least on large scale, I'll have to test the latency) with only 2 copies of This manages to get typedef struct {
XXH_ALIGN_MEMBER(64, xxh_u64 acc[8]);
} XXH3_acc_t;
XXH_FORCE_INLINE XXH3_acc_t
XXH3_hashLong_internal_loop_impl(...)
{
XXH3_acc_t acc = { XXH3_INIT_ACC };
// normal internal_loop stuff
return acc;
}
XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong(const xxh_u8* XXH_RESTRICT input, size_t len,
const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
return XXH3_hashLong_internal_loop_impl(input, len, secret, secretSize, XXH3_accumulate,
XXH3_scrambleAcc);
}
#if XXH_SIZE_OPT <= 0
/* Constant folded for the default secret size */
XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong_defaultSecretSize(const xxh_u8* XXH_RESTRICT input, size_t len,
const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
(void) secretSize;
return XXH3_hashLong_internal_loop_impl(input, len, secret, sizeof(XXH3_kSecret), XXH3_accumulate,
XXH3_scrambleAcc);
}
#else
/* Don't constant fold on size opt */
# define XXH3_hashLong_defaultSecretSize XXH3_hashLong
#endif Returning a struct like this is functionally identical to memcpying to an out pointer on the ABI level, but the compiler knows more about it and can optimize it better. (Just having an output pointer without an inlining is 40% slower). I would have to clean it up, rename it, and test it some more though, so I would probably keep this to another PR. Edit: This could be a step towards making dispatching on x86 standard as well since the overridden functions are now only 2-3 (also the feature test is 99% logging - it normally compiles to like 50 instructions). |
How do the rest of these changes look aside from that (which since it would affect dispatching is a huge change that will need some more testing)? The changes shouldn't affect 64-bit codegen that much (although XXH3_update may be worth some benchmarking) The biggest changes codegen-wise are 32-bit not being forced to inline and unroll ugly functions as much (which have more latency than a call frame/loop and aren't worth the code size increase) and |
I'm going to leave the hashLong rework to another PR because that would need reworking dispatching, and that would be a perfect time to redo that mess. |
I'm trying to get a sense of the scope of the changes in this PR. To rephrase it :
If that's the correct scope, then there are a number of measurements to do and publish to document this PR. Longer term, I wonder if there is something that could / should be done to ensure these properties are preserved in the future. Otherwise, it would be easy for a future PR to inflate the binary size without anyone noticing it. |
I made a series of measurements, leading to surprising results (at least for me), when comparing this Let's start with
So that's the first surprise : the changes are beneficial to binary sizes for the performance mode, but when it comes to the size optimized Now let's have a look to performance. The scenario measured is the default
To begin with, at So, some surprises, but all in all, this PR has positives. Now, it was also mentioned that this PR is supposed to be even more useful for 32-bit mode, so let's try that. Once again, compiler is
The first thing to note is how big these binary sizes are, compared to Now, what about performance ?
Once again, It's difficult to draw conclusions for this compilation exercise. There are other axis that could be analyzed. For example, add an One potential idea : is it possible to break this PR into smaller stages ? Maybe some of these stages produce uncontroversial benefits that could be quickly merged ? Maybe other require pondering some trade-offs (which can still be acceptable) ? Finally, maybe there are useful learnings that could be used to improve other parts of the PR ? Also, I would recommend rebasing this PR on top of |
Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions. Let's start this exercise with
|
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff |
---|---|---|---|---|---|---|
XXH32 | x64 |
gcc v9.4.0 |
-O3 |
16576 | 16556 | -20 |
XXH32 | x64 |
gcc v9.4.0 |
-O2 |
16616 | 16616 | 0 |
XXH32 | x64 |
gcc v9.4.0 |
-Os |
16600 | 16664 | +64 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O3 |
15456 | 15456 | 0 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O2 |
15488 | 15488 | 0 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -Os |
15440 | 15532 | +92 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
33934 | 33934 | 0 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
33934 | 33934 | 0 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-Os |
33966 | 34014 | +48 |
A few interesting details here.
To begin with, we know that performance remains largely unaffected for XXH32
, staying at 6.9 GB/s
on the i7-9700k
no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1
, with XXH32
delivering consistently 7.0 GB/s
, and small (unimportant) degradation at pr802
with -Os
setting.
We also note that the library size differences between modes and commits remain extremely small, which is consistent.
Two remarquable trends are worth mentioning though :
- Modes expected to produce smaller binary sizes actually produce larger binary sizes. This trend is clearer for
pr802
. At the end of the story, the differences are not large, but that still goes against expectations. pr802
is especially worse at binary size for the-Os
mode. This is also the reverse of my initial expectations.
It could be that binary size differences are in fact unrelated to XXH32
proper, and due to minor differences in other generic symbols still compiled in the library, such as XXH_versionNumber()
for example.
In conclusion, when it comes to XXH32
, this pr802
is roughly neutral.
Now, onto XXH64
:
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH64 | x64 |
gcc v9.4.0 |
-O3 |
17072 | 17072 | 0 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-O2 |
17016 | 17016 | 0 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-Os |
17040 | 17144 | +104 | 13.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O3 |
19840 | 19872 | +32 | 3.1 -> 2.9 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O2 |
19900 | 15924 | -3976 | 2.8 -> 2.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -Os |
19852 | 15916 | -3936 | 2.7 -> 2.6 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
34414 | 34414 | 0 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
34414 | 34414 | 0 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-Os |
34478 | 34718 | +304 | 14.0 -> 13.7 GB/s |
Some learnings:
In x64
mode, the library size difference is very small compared to the XXH32
-only library. Less than ~0.5KB is added. This is a similar story for aarch64
. This seems to show that XXH64
is compact enough, at all compilation settings.
As we already know, speed remains unaffected, at 13.7 GB/s
on i7-9700k
. It is only slightly degraded at -Os
with pr802
on M1 Pro
, still offering roughly double XXH32
speed.
Unfortunately, pr802
doesn't offer any binary size savings. Actually, -Os
is slightly bigger than dev
.
On x86
32-bit mode, there is a more substantial binary size difference, and dev
branch requires > +4 KB to add XXH64
. This suggests that the XXH64
algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802
, showing that it's possible to add XXH64
with a budget of roughly ~0.5KB, similar to x64
mode.
We also know that XXH64
performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802
makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves almost ~4 KB of binary size, but strangely, at -O3
, there is no such binary size saving, yet the performance difference is still very much present (checked several times).
It's unclear to me why there is a (small) performance difference between dev
and pr802
in 32-bit x86
mode.
It's also unclear to me why binary size savings seem ineffective at -O3
specifically, while -O2
seems to benefit from it.
In conclusion, when it comes to XXH64
, this pr802
is slightly negative for performance. There is an interesting binary size saving on x86
at -O2
and -Os
settings, that could potentially compensate the small speed degradation. But for 64-bit
modes, there is no positive to report.
XXH128
:
For this last one, we can simply re-employ results from previous measurement exercise.
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O3 |
76392 | 60080 | -16312 | 20.1 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O2 |
51936 | 43912 | -8024 | 14.0 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-Os |
32208 | 36600 | +4392 | 12.6 -> 14.9 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O3 |
111860 | 75436 | -36424 | 5.7 -> 5.6 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O2 |
70924 | 46968 | -23956 | 2.8 -> 2.7 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -Os |
34876 | 34920 | +44 | 2.2 -> 2.4 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O3 |
69678 | 53246 | -16432 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O2 |
69678 | 53246 | -16432 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-Os |
37198 | 37198 | 0 | 10.9 GB/s |
To begin with, it's clear that library size increases significantly with XXH3
variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os
compilation mode is the only one that seems in the "acceptable" range.
Here, pr802
brings pretty big binary size savings, at -O2
and especially -O3
.
For -Os
though, it doesn't. It actually increases size for the x64
mode.
-O3
remains an important compilation flag for best performance of XXH128
on gcc
: speed drops very significantly with -O2
and -Os
.
On clang
, the picture is slightly different : speed is exactly the same at -O3
and -O2
. But at -Os
, the performance drop is very steep.
The performance impact of pr802
is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os
mode receiving a speed boost on gcc
+ intel cpu.
For x64
, this speed boost is significant enough to beat -O2
, which is really nice since binary is also smaller.
For x86
, the speed boost is not as important, but still a hefty +10% compared to dev
.
So there's probably something worth investigating there.
To be continued : add more targets, more compilers, etc.
The previous analysis was using Dynamic Library to compare generates binary sizes. It was assumed that this wouldn't make much difference compared to Static Libraries, with maybe just a constant factor added. This is relatively difficult to make sense of, and difficult to decide, so I left both analysis in the thread. We can continue from the Static or the Dynamic library size analysis, depending on preference. Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions. Let's start this exercise with
|
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff |
---|---|---|---|---|---|---|
XXH32 | x64 |
gcc v9.4.0 |
-O3 |
4268 | 4404 | +136 |
XXH32 | x64 |
gcc v9.4.0 |
-O2 |
4108 | 4228 | +120 |
XXH32 | x64 |
gcc v9.4.0 |
-Os |
3780 | 3836 | +56 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O3 |
4412 | 4460 | +48 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O2 |
4200 | 4248 | +48 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -Os |
3594 | 3590 | -4 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
3416 | 3432 | +16 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
3400 | 3416 | +16 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-Os |
3392 | 3432 | +40 |
We know that XXH32
speed remains largely unaffected, staying at 6.9 GB/s
on the i7-9700k
no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1
, with XXH32
delivering consistently 7.0 GB/s
, and a small (unimportant) degradation at pr802
with -Os
setting.
The library size differences between modes and commits remain small, which is unsurprising.
But we can also notice that pr802
impact is generally negative for binary size.
Not by a large amount, therefore it's not a big deal.
Since there is no corresponding speed benefit, pr802
looks globally slightly negative for the XXH32
-only scenario.
But as mentioned, this is minor, and can probably be ignored if there is a much larger benefit for other scenarios.
Now, onto XXH64
:
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH64 | x64 |
gcc v9.4.0 |
-O3 |
7526 | 7886 | +360 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-O2 |
7126 | 7446 | +320 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-Os |
6326 | 6406 | +80 | 13.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O3 |
10638 | 9866 | -772 | 3.1 -> 2.9 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O2 |
9110 | 7814 | -1296 | 2.8 -> 2.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -Os |
8048 | 6472 | -1576 | 2.7 -> 2.6 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
5896 | 5912 | +16 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
5832 | 5848 | +16 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-Os |
5848 | 6128 | +280 | 14.0 -> 13.7 GB/s |
Some learnings:
As we already know, in 64-bit modes (x64
, aarch64
), speed remains unaffected, at 13.7 GB/s
on i7-9700k
. It is only slightly degraded at -Os
with pr802
on M1 Pro
, still offering roughly double XXH32
speed.
Unfortunately, pr802
doesn't offer any binary size savings for this scenario. It is actually slightly bigger than dev
.
On x86
32-bit mode, there is a more substantial binary size difference, and dev
branch requires 2x binary size budget to add XXH64
. This suggests that the XXH64
algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802
, showing that it's possible to add XXH64
with a budget roughly similar to x64
mode.
We also know that XXH64
performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802
makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves binary size, but strangely, at -O3
, binary size savings are not great, yet the performance difference is very much present (checked multiple times).
It's unclear to me why there is a (small) performance difference between dev
and pr802
in 32-bit x86
mode.
It's also unclear to me why binary size savings is less effective at -O3
specifically, while -O2
seems to benefit more from it.
In conclusion, when it comes to XXH64
, this pr802
seems slightly negative for performance. There is an interesting binary size saving on x86
, especially at -O2
and -Os
settings, that could potentially compensate the small speed degradation. But for 64-bit
modes, there is unfortunately no positive outcome to report .
XXH128
:
Let's redo binary size measurements, using static
library size instead.
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O3 |
67232 | 53200 | -14032 | 20.1 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O2 |
40880 | 36864 | -4016 | 14.0 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-Os |
19752 | 20432 | +680 | 12.6 -> 14.9 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O3 |
107280 | 73174 | -34106 | 5.7 -> 5.6 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O2 |
60796 | 39188 | -21608 | 2.8 -> 2.7 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -Os |
25152 | 23278 | -1878 | 2.2 -> 2.4 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O3 |
50456 | 44408 | -6048 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O2 |
48768 | 42248 | -6520 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-Os |
21600 | 24336 | +2736 | 10.9 GB/s |
To begin with, it's clear that library size increases significantly with XXH3
variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os
compilation mode is the only one that lies in the "acceptable" range.
Here, pr802
brings pretty big binary size savings, at -O2
and especially -O3
.
For -Os
though, it's less clear, and can actually increase size.
-O3
remains an important compilation flag for best performance of XXH128
on gcc
: speed drops very significantly with -O2
and -Os
.
On clang
, the picture is slightly different : speed is exactly the same at -O3
and -O2
. But at -Os
, the performance drop is very steep.
The performance impact of pr802
is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os
mode receiving a speed boost on gcc
+ intel cpu.
For x64
, this speed boost is significant enough to beat -O2
, which is really nice since binary size is also smaller.
For x86
, the speed boost is not as important, but still a hefty +10% compared to dev
.
So there's probably something worth investigating there.
Note : some results are sometimes surprising. Ensuring that pr802
is based on top of dev
is also important to ensure this comparison analysis doesn't produce wrong conclusions.
To be continued : add more targets, more compilers, etc.
Sure, I can do that. Sorry I adhd'd again 😵💫 I think that the isolated changes that are not going to be controversial are:
The future changes could be
Yes. That is expected. On 64-bit targets, the size decrease on The size increase on This change is because The performance issues were noticeable more on GCC ARMv4, where it outlined Although I will have to look into fixing the GCC x86 performance issues. This might just be from the
The way it is force inlined makes there be 8. You can easily see the size of each function if you compile with The short hashing code (XXH3_[64,128]bits_internal) is deceptively heavy on 32-bit. When compiled with However, arguably, there is some benefit to this, as the short hash is more sensitive to latency, and the constant propagation of Additionally, as I mentioned before, the way hashLong is set up causes inline clones as well, which afaik doesn't seem to be beneficial now that the codepaths are identical. Another note is that on x86, the 64->128-bit multiply is a lot of instructions because of how disruptive the |
The first thing I'll do is find the minimum amount of |
I have a rough draft of the hashLong refactor that is designed to be dispatched, but |
I believe there were already several fixes merged for the |
The "fix However, the PR also lists "improve code size" as an objective. |
I compared dev ( 258351b ) and this PR ( 98be0b9 ) Note that this report contains all XXH algorithms.
We can still see some size difference in x86 mode. CSV
Procedurecd
git clone https://github.com/Cyan4973/xxHash.git xxhash-pr-802
cd xxhash-pr-802
mkdir report
export CC="gcc-11"
git checkout dev
export CH=$(git --no-pager log -1 --format="%h")
export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
git pull origin pull/802/head
git checkout FETCH_HEAD
git log -1
export CH=$(git --no-pager log -1 --format="%h")
export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
# Show CSV
cat report/* |
As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality. |
I should go back and finish that. I kinda got distracted by another project 😅 |
Ideally, I would like to publish release I presume completing this objective is not trivial, and therefore should rather be expected as a potential contribution to a later version ? (either |
Looking over the code I think it just needs some clean up, documentation, and benchmarking. |
This PR has remained inactive in "draft" mode for a long period of time. The real question if is this is abandoned, in which case, it would be appropriate to close it. |
Overview
Kinda a dump of (mostly) related changes I've been sitting on and never bothered to commit.
This is a set of changes to:
-O3
-O3
could get well over 100 kB.text
hashLong
constant propagation can be reduced-Os
/XXH_SIZE_OPT
performanceThese haven't been micro-benchmarked yet. The "what is worth inlining" is a rough estimate, and I want to go through the various recent compilers to find what works best.
Major changes:
-O3
to work around the-Og
bug. Fixes GCC-12 fails to inline XXX_FORCED_INLINE functions, XXH3_accumulate_512_sse2(), and XXH3_scrambleAcc_sse2() with -Og flag. #800, Pb when building with gcc-12 -Og #720__OPTIMIZE__
,!__NO_INLINE__
andXXH_SIZE_OPT <= 0
(!__OPTIMIZE_SIZE__
), so-O2
and-Og
both applyXXH_INLINE
: Acts like existingXXH_FORCE_INLINE
where it is disabled whenXXH_SIZE_OPT
is used.XXH_FORCE_INLINE
now actually force inlines unconditionally (when inline hints are on) and is used on XXH3 and XXH32's core loops and utilities.-Os
performance with only a small code size overheadXXH_32BIT_OPT
,XXH_INLINE_64BIT
: These are used to reduce code bloat on 32-bitclang armv4t -O3
code size by about half, now being <100 kBXXH32
andXXH64
have been rewritten to reuse the algorithm stepsXXH3_update
has been rewritten to be much easier to understand and much smaller.XXH3_128bits_update
now tail callsXXH3_64bits_update
, greatly reducing code size at the cost of one branchMinor changes:
-O2
on AVX2, that bug has been fixed.XXH32_state_t
andXXH64_state_t
have some fields renamed and now useunsigned char [N]
for their bufferXXH_OLD_NAMES
has been removed, it's been long enough