Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TrieLookupTable #18405

Closed
mergeconflict opened this issue Oct 4, 2021 · 15 comments
Closed

Optimize TrieLookupTable #18405

mergeconflict opened this issue Oct 4, 2021 · 15 comments

Comments

@mergeconflict
Copy link
Member

@mattklein123 put some work, many years ago, into optimizing HeaderMapImpl for header lookups that are constant-time in terms of the total number of headers. More recently, @ramaraochavali extracted the Trie code into common for use in Redis.

I was poking around a bit, looking at some performance data from our Envoy deployments at Google, and found that we spend about 1-2% of our CPU time on the worker threads in Envoy::TrieLookupTable::find, called from Envoy::Http::HeaderMapImpl::StaticLookupTable::lookup. If we can squeeze a bit more performance out of this, it's worthwhile for us, and I think I may have found a way to do so.

The current TrieLookupTable implementation is sparse and uncompressed: each TrieEntry contains an array with 256 entries, keyed by a single character. We can think of find() as being worst-case O(n) in terms of the number of characters in the string_view key: we traverse entries until we either know the trie doesn't contain our key, or we've found it. Of course, any lookup implementation will have O(n) worst case time complexity in the length of the key (for example, if we just used a hash map, we have to look at all the characters to compute a hash), but the repeated pointer traversal may add a significant constant factor. Pointer traversal is more expensive than accessing sequential bytes in CPU cache.

So I did a bit of digging in the literature and found a couple data structures for this purpose that are more cache-friendly:

  • Cache-conscious string hash tables: a fast, unordered hash optimized for strings.
  • HAT-Trie: an ordered prefix trie that is compressed to achieve similar lookup performance to the cache-conscious hash table, with lower memory consumption.

There are some promising benchmarking results in these papers, but I haven't yet done any benchmarking in the context of Envoy yet, to compare the current implementation against them (or, for that matter, against absl::flat_hash_map). Matt previously pointed out that we probably need to flesh out our existing benchmarks more. I know @adisuissa has done some work in this area before as well, so we'll chat and figure out the right thing to do.

@mergeconflict mergeconflict added bug triage Issue requires triage labels Oct 4, 2021
@jmarantz
Copy link
Contributor

jmarantz commented Oct 4, 2021

Also consider perfect hash-tables for a set of keys known at compile-time: https://www.gnu.org/software/gperf/manual/gperf.html

@mattklein123 mattklein123 added area/http area/perf help wanted Needs help! and removed bug triage Issue requires triage labels Oct 4, 2021
@jmarantz
Copy link
Contributor

jmarantz commented Oct 4, 2021

FWIW sample gperf usage in a C++ system built with bazel:

It'd be worth just doing the easiest thing first though and try using absl::flat_hash_map.

@mattklein123
Copy link
Member

It'd be worth just doing the easiest thing first though and try using absl::flat_hash_map.

Agree this is worth trying. This trie code originates from 5+ years ago, well before we were using abseil, and in response to some silly horse race benchmarks with nginx. There are likely better ways of doing it than this simple trie.

@adisuissa
Copy link
Contributor

+1 on first trying to use absl::flat_hash_map which I think is a very simple change to make it happen.
As for the kind of benchmarks, IIUC most of the header names have the same prefix, so we should aim for tests that capture this scenario. Not sure if there's a known distribution of the header name length, but I guess we should test various lengths between 4 to 128.

@jmarantz
Copy link
Contributor

jmarantz commented Oct 4, 2021

RE "same prefix": can you give examples?

I was thinking of very common things like "Date", "Cache-Control", "Content-Type", "Cookie".

There's also the common case where a bunch of custom headers are in use and they all fail the builtin-header lookup. Failing that lookup should be fast too.

@adisuissa
Copy link
Contributor

I meant the ones that use this prefix. I guess that some of the other (e.g., those that start with "content-" or "grpc-" can also be considered using the same prefix).

@jmarantz
Copy link
Contributor

jmarantz commented Oct 4, 2021

That dynamic-ish prefix makes it less elegant to use something like gperf, but my suspicion is that with absl::flat_hash_map we could get some benefit without having to think too much about a hierarchical prefix lookup first. It'd be best to get some benchmarks.

@mergeconflict
Copy link
Member Author

Alright, I did some basic benchmarking of a few different implementations, including the "HAT-Trie" I linked above. I will try to present the benchmark code and the results in some reasonable format here tomorrow, but for now I'll summarize: absl::flat_hash_map seems like a good choice. Probably somewhere between 10-100x faster than the current implementation in my environment, depending on the number of entries and the length of the keys.

@mergeconflict
Copy link
Member Author

mergeconflict commented Oct 5, 2021

Ok, I wrote two different benchmarks: one to measure the performance of hits in the lookup table, and the other to measure misses. For both benchmarks, I have two variables:

  • The number of characters in each key (from 4 to 128 in powers of two)
  • The number of entries in the lookup table (from 10 to 10,000 in powers of ten)

The strings I'm populating the lookup tables with are just randomly generated garbage, they aren't necessarily representative of the actual key names listed in source/common/http/headers.h (I should probably do another test using those). I ran these benchmarks over four different implementations including the existing trie, and it turns out the only alternative worth mentioning (at least in my exploration so far) is absl::flat_hash_map. Here are the comparative results:

Hit Performance

Entries Key Length Trie Lookup Time (ns) Hash Lookup Time (ns) Ratio (trie / hash)
10 4 28.0 73.6 0.38
100 4 295 691 0.43
1000 4 8877 7212 1.23
10000 4 260530 76495 3.41
10 8 39.9 72.9 0.55
100 8 801 700 1.14
1000 8 34995 7142 4.90
10000 8 1381132 76563 18.04
10 16 132 77.5 1.70
100 16 4204 775 5.42
1000 16 180587 7568 23.86
10000 16 4628613 83042 55.73
10 32 505 127 3.98
100 32 23937 1282 18.67
1000 32 975744 16037 60.84
10000 32 14549676 236820 61.44
10 64 1287 175 7.35
100 64 71941 1652 43.54
1000 64 3345193 19955 167.62
10000 64 38746744 293240 132.13
10 128 4580 238 19.24
100 128 311047 2492 124.82
1000 128 9527341 29152 326.76
10000 128 101177372 384726 262.99

Interpretation: A hit represents the worst-case performance of the trie, iterating over all characters in the key. The longer the key, the worse the performance in comparison to the hash. I think in the case of request headers, the map will have something like 50 keys, with lengths mostly somewhere between 10-30 characters, so if all lookups were hits, I'd guess we'd see about a 5x performance improvement from changing from trie to hash.

Miss Performance

Entries Key Length Trie Lookup Time (ns) Hash Lookup Time (ns) Ratio (hash / trie)
10 4 10.9 44.1 4.05
100 4 162 503 3.10
1000 4 3075 5484 1.78
10000 4 92834 64733 0.70
10 8 11.9 44.4 3.73
100 8 177 537 3.03
1000 8 3138 5428 1.73
10000 8 93181 59400 0.64
10 16 11.4 53.1 4.66
100 16 152 590 3.88
1000 16 3004 5970 1.99
10000 16 92202 64936 0.70
10 32 11.9 79.7 6.70
100 32 154 912 5.92
1000 32 2591 9458 3.65
10000 32 102143 98768 0.97
10 64 12.0 98.5 8.21
100 64 156 1120 7.18
1000 64 2911 11745 4.03
10000 64 116389 135023 1.16
10 128 11.2 128 11.43
100 128 160 1450 9.06
1000 128 3067 14793 4.82
10000 128 121476 165870 1.37

Interpretation: A miss within the first couple characters of the key represents the best-case performance of the trie, where it gets to bail out before looking at the whole key. So you'll notice that the trie lookup time in this table is nearly constant with respect to the key length. Using the same assumptions as above, if all lookups were misses, I'd guess we'd see about a 3x performance degradation by changing from trie to hash.

Summary

Okay, this isn't the 10-100x improvement I thought I was seeing last night. I had neglected to test the performance of misses at first, which is a major omission. And I realized it's not really useful to compare the performance of a lookup table with 1k-10k entries, since I only really care about our usage in the static header map. I'll do a bit more benchmarking using the actual header names instead of random strings, that should help narrow down what the performance impact might be. But if we have more hits than misses — which I think is expected assuming users aren't making heavy use of nonstandard headers — then it could still be an easy win.

I guess my question for Envoy folks is, assuming we wanted to try this out, how would we go about it? The two things that come to mind for me are:

  • We should allow users to choose implementation (maybe in the HCM config proto?)
  • We should maybe expose some metrics for header lookup hit rate and latency

That would allow users to make decisions based on actual traffic.

@jmarantz
Copy link
Contributor

jmarantz commented Oct 5, 2021

I feel like 100 entries and 16 chars is the most relevant example, where it's 5x better to use hash for hits, and 3x better to use tries for misses. This is looking closer to a wash if we expect half the headers to not be builtin http ones.

It'd be better to look at sets of http headers specifically.

@mattklein123
Copy link
Member

Whatever we do I would really rather not add an extra knob for this. I would maybe make it runtime configurable for testing or just change it.

@jmarantz
Copy link
Contributor

jmarantz commented Oct 5, 2021

Matt -- do you have a sense of what percentage of headers seen by Envoys (including response & request) are in the O(1) set?

I could imagine that traffic to/from browsers largely uses known headers, but internal API traffic may use all sorts of non-standard headers.

I was poking around httparchive.org, which I thought in the past had given me queryable access to a ton of internet web data including response headers, but I couldn't figure out how to do it anymore -- now it's just predefined queries.

@mattklein123
Copy link
Member

Matt -- do you have a sense of what percentage of headers seen by Envoys (including response & request) are in the O(1) set?

Sorry no clue, as you say I'm sure it very much depends on the workload.

@mattklein123
Copy link
Member

(Honestly, I would say, test deploy the switch to flat_hash_map at Google and if it looks better across various workloads just make the change upstream.)

@mergeconflict
Copy link
Member Author

I redid my benchmarks using all the headers listed in Envoy::Http::HeaderValues, hardcoding x-envoy as the prefix where applicable. That's 75 entries, which isn't exactly right, since both request and response headers are mixed in there, but it's probably close enough. I also included gperf per @jmarantz's suggestion. Results:

--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
testTrieHit         1683 ns         1683 ns       413600
testHashHit          769 ns          769 ns       909578
testGperfHit         400 ns          400 ns      1747820
testTrieMiss        81.9 ns         81.9 ns      8545283
testHashMiss         483 ns          483 ns      1449202
testGperfMiss        128 ns          128 ns      5443172

Using absl::flat_hash_map is not a clear win: it looks like only about a 2x improvement for hits (saving about 10 ns per header, on my desktop), but 6x worse for misses. Using gperf is better, as Josh predicted: more like a 4x improvement (about 17 ns per header) for hits, but is still about 2x worse for misses.

So I'm going to drop work on this, and close the issue. If somebody wants to pick it up, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants