Optimize TrieLookupTable #18405

mergeconflict · 2021-10-04T15:56:14Z

@mattklein123 put some work, many years ago, into optimizing HeaderMapImpl for header lookups that are constant-time in terms of the total number of headers. More recently, @ramaraochavali extracted the Trie code into common for use in Redis.

I was poking around a bit, looking at some performance data from our Envoy deployments at Google, and found that we spend about 1-2% of our CPU time on the worker threads in Envoy::TrieLookupTable::find, called from Envoy::Http::HeaderMapImpl::StaticLookupTable::lookup. If we can squeeze a bit more performance out of this, it's worthwhile for us, and I think I may have found a way to do so.

The current TrieLookupTable implementation is sparse and uncompressed: each TrieEntry contains an array with 256 entries, keyed by a single character. We can think of find() as being worst-case O(n) in terms of the number of characters in the string_view key: we traverse entries until we either know the trie doesn't contain our key, or we've found it. Of course, any lookup implementation will have O(n) worst case time complexity in the length of the key (for example, if we just used a hash map, we have to look at all the characters to compute a hash), but the repeated pointer traversal may add a significant constant factor. Pointer traversal is more expensive than accessing sequential bytes in CPU cache.

So I did a bit of digging in the literature and found a couple data structures for this purpose that are more cache-friendly:

Cache-conscious string hash tables: a fast, unordered hash optimized for strings.
HAT-Trie: an ordered prefix trie that is compressed to achieve similar lookup performance to the cache-conscious hash table, with lower memory consumption.

There are some promising benchmarking results in these papers, but I haven't yet done any benchmarking in the context of Envoy yet, to compare the current implementation against them (or, for that matter, against absl::flat_hash_map). Matt previously pointed out that we probably need to flesh out our existing benchmarks more. I know @adisuissa has done some work in this area before as well, so we'll chat and figure out the right thing to do.

The text was updated successfully, but these errors were encountered:

jmarantz · 2021-10-04T15:59:38Z

Also consider perfect hash-tables for a set of keys known at compile-time: https://www.gnu.org/software/gperf/manual/gperf.html

jmarantz · 2021-10-04T16:45:33Z

FWIW sample gperf usage in a C++ system built with bazel:

It'd be worth just doing the easiest thing first though and try using absl::flat_hash_map.

mattklein123 · 2021-10-04T16:57:11Z

It'd be worth just doing the easiest thing first though and try using absl::flat_hash_map.

Agree this is worth trying. This trie code originates from 5+ years ago, well before we were using abseil, and in response to some silly horse race benchmarks with nginx. There are likely better ways of doing it than this simple trie.

adisuissa · 2021-10-04T17:10:35Z

+1 on first trying to use absl::flat_hash_map which I think is a very simple change to make it happen.
As for the kind of benchmarks, IIUC most of the header names have the same prefix, so we should aim for tests that capture this scenario. Not sure if there's a known distribution of the header name length, but I guess we should test various lengths between 4 to 128.

jmarantz · 2021-10-04T17:14:46Z

RE "same prefix": can you give examples?

I was thinking of very common things like "Date", "Cache-Control", "Content-Type", "Cookie".

There's also the common case where a bunch of custom headers are in use and they all fail the builtin-header lookup. Failing that lookup should be fast too.

adisuissa · 2021-10-04T17:26:40Z

I meant the ones that use this prefix. I guess that some of the other (e.g., those that start with "content-" or "grpc-" can also be considered using the same prefix).

jmarantz · 2021-10-04T17:37:27Z

That dynamic-ish prefix makes it less elegant to use something like gperf, but my suspicion is that with absl::flat_hash_map we could get some benefit without having to think too much about a hierarchical prefix lookup first. It'd be best to get some benchmarks.

mergeconflict · 2021-10-05T02:27:45Z

Alright, I did some basic benchmarking of a few different implementations, including the "HAT-Trie" I linked above. I will try to present the benchmark code and the results in some reasonable format here tomorrow, but for now I'll summarize: absl::flat_hash_map seems like a good choice. Probably somewhere between 10-100x faster than the current implementation in my environment, depending on the number of entries and the length of the keys.

mergeconflict · 2021-10-05T20:14:11Z

Ok, I wrote two different benchmarks: one to measure the performance of hits in the lookup table, and the other to measure misses. For both benchmarks, I have two variables:

The number of characters in each key (from 4 to 128 in powers of two)
The number of entries in the lookup table (from 10 to 10,000 in powers of ten)

The strings I'm populating the lookup tables with are just randomly generated garbage, they aren't necessarily representative of the actual key names listed in source/common/http/headers.h (I should probably do another test using those). I ran these benchmarks over four different implementations including the existing trie, and it turns out the only alternative worth mentioning (at least in my exploration so far) is absl::flat_hash_map. Here are the comparative results:

Hit Performance

Entries	Key Length	Trie Lookup Time (ns)	Hash Lookup Time (ns)	Ratio (trie / hash)
10	4	28.0	73.6	0.38
100	4	295	691	0.43
1000	4	8877	7212	1.23
10000	4	260530	76495	3.41
10	8	39.9	72.9	0.55
100	8	801	700	1.14
1000	8	34995	7142	4.90
10000	8	1381132	76563	18.04
10	16	132	77.5	1.70
100	16	4204	775	5.42
1000	16	180587	7568	23.86
10000	16	4628613	83042	55.73
10	32	505	127	3.98
100	32	23937	1282	18.67
1000	32	975744	16037	60.84
10000	32	14549676	236820	61.44
10	64	1287	175	7.35
100	64	71941	1652	43.54
1000	64	3345193	19955	167.62
10000	64	38746744	293240	132.13
10	128	4580	238	19.24
100	128	311047	2492	124.82
1000	128	9527341	29152	326.76
10000	128	101177372	384726	262.99

Interpretation: A hit represents the worst-case performance of the trie, iterating over all characters in the key. The longer the key, the worse the performance in comparison to the hash. I think in the case of request headers, the map will have something like 50 keys, with lengths mostly somewhere between 10-30 characters, so if all lookups were hits, I'd guess we'd see about a 5x performance improvement from changing from trie to hash.

Miss Performance

Entries	Key Length	Trie Lookup Time (ns)	Hash Lookup Time (ns)	Ratio (hash / trie)
10	4	10.9	44.1	4.05
100	4	162	503	3.10
1000	4	3075	5484	1.78
10000	4	92834	64733	0.70
10	8	11.9	44.4	3.73
100	8	177	537	3.03
1000	8	3138	5428	1.73
10000	8	93181	59400	0.64
10	16	11.4	53.1	4.66
100	16	152	590	3.88
1000	16	3004	5970	1.99
10000	16	92202	64936	0.70
10	32	11.9	79.7	6.70
100	32	154	912	5.92
1000	32	2591	9458	3.65
10000	32	102143	98768	0.97
10	64	12.0	98.5	8.21
100	64	156	1120	7.18
1000	64	2911	11745	4.03
10000	64	116389	135023	1.16
10	128	11.2	128	11.43
100	128	160	1450	9.06
1000	128	3067	14793	4.82
10000	128	121476	165870	1.37

Interpretation: A miss within the first couple characters of the key represents the best-case performance of the trie, where it gets to bail out before looking at the whole key. So you'll notice that the trie lookup time in this table is nearly constant with respect to the key length. Using the same assumptions as above, if all lookups were misses, I'd guess we'd see about a 3x performance degradation by changing from trie to hash.

Summary

Okay, this isn't the 10-100x improvement I thought I was seeing last night. I had neglected to test the performance of misses at first, which is a major omission. And I realized it's not really useful to compare the performance of a lookup table with 1k-10k entries, since I only really care about our usage in the static header map. I'll do a bit more benchmarking using the actual header names instead of random strings, that should help narrow down what the performance impact might be. But if we have more hits than misses — which I think is expected assuming users aren't making heavy use of nonstandard headers — then it could still be an easy win.

I guess my question for Envoy folks is, assuming we wanted to try this out, how would we go about it? The two things that come to mind for me are:

We should allow users to choose implementation (maybe in the HCM config proto?)
We should maybe expose some metrics for header lookup hit rate and latency

That would allow users to make decisions based on actual traffic.

jmarantz · 2021-10-05T21:06:41Z

I feel like 100 entries and 16 chars is the most relevant example, where it's 5x better to use hash for hits, and 3x better to use tries for misses. This is looking closer to a wash if we expect half the headers to not be builtin http ones.

It'd be better to look at sets of http headers specifically.

mattklein123 · 2021-10-05T21:13:53Z

Whatever we do I would really rather not add an extra knob for this. I would maybe make it runtime configurable for testing or just change it.

jmarantz · 2021-10-05T21:20:05Z

Matt -- do you have a sense of what percentage of headers seen by Envoys (including response & request) are in the O(1) set?

I could imagine that traffic to/from browsers largely uses known headers, but internal API traffic may use all sorts of non-standard headers.

I was poking around httparchive.org, which I thought in the past had given me queryable access to a ton of internet web data including response headers, but I couldn't figure out how to do it anymore -- now it's just predefined queries.

mattklein123 · 2021-10-05T21:28:14Z

Matt -- do you have a sense of what percentage of headers seen by Envoys (including response & request) are in the O(1) set?

Sorry no clue, as you say I'm sure it very much depends on the workload.

mattklein123 · 2021-10-05T21:29:00Z

(Honestly, I would say, test deploy the switch to flat_hash_map at Google and if it looks better across various workloads just make the change upstream.)

mergeconflict · 2021-10-06T15:08:46Z

I redid my benchmarks using all the headers listed in Envoy::Http::HeaderValues, hardcoding x-envoy as the prefix where applicable. That's 75 entries, which isn't exactly right, since both request and response headers are mixed in there, but it's probably close enough. I also included gperf per @jmarantz's suggestion. Results:

--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
testTrieHit         1683 ns         1683 ns       413600
testHashHit          769 ns          769 ns       909578
testGperfHit         400 ns          400 ns      1747820
testTrieMiss        81.9 ns         81.9 ns      8545283
testHashMiss         483 ns          483 ns      1449202
testGperfMiss        128 ns          128 ns      5443172

Using absl::flat_hash_map is not a clear win: it looks like only about a 2x improvement for hits (saving about 10 ns per header, on my desktop), but 6x worse for misses. Using gperf is better, as Josh predicted: more like a 4x improvement (about 17 ns per header) for hits, but is still about 2x worse for misses.

So I'm going to drop work on this, and close the issue. If somebody wants to pick it up, feel free to reopen.

mergeconflict added bug triage Issue requires triage labels Oct 4, 2021

mattklein123 added area/http area/perf help wanted Needs help! and removed bug triage Issue requires triage labels Oct 4, 2021

mergeconflict closed this as completed Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize TrieLookupTable #18405

Optimize TrieLookupTable #18405

mergeconflict commented Oct 4, 2021

jmarantz commented Oct 4, 2021

jmarantz commented Oct 4, 2021

mattklein123 commented Oct 4, 2021

adisuissa commented Oct 4, 2021

jmarantz commented Oct 4, 2021

adisuissa commented Oct 4, 2021

jmarantz commented Oct 4, 2021

mergeconflict commented Oct 5, 2021

mergeconflict commented Oct 5, 2021 •

edited

Loading

jmarantz commented Oct 5, 2021

mattklein123 commented Oct 5, 2021

jmarantz commented Oct 5, 2021 •

edited

Loading

mattklein123 commented Oct 5, 2021

mattklein123 commented Oct 5, 2021

mergeconflict commented Oct 6, 2021

Optimize TrieLookupTable #18405

Optimize TrieLookupTable #18405

Comments

mergeconflict commented Oct 4, 2021

jmarantz commented Oct 4, 2021

jmarantz commented Oct 4, 2021

mattklein123 commented Oct 4, 2021

adisuissa commented Oct 4, 2021

jmarantz commented Oct 4, 2021

adisuissa commented Oct 4, 2021

jmarantz commented Oct 4, 2021

mergeconflict commented Oct 5, 2021

mergeconflict commented Oct 5, 2021 • edited Loading

Hit Performance

Miss Performance

Summary

jmarantz commented Oct 5, 2021

mattklein123 commented Oct 5, 2021

jmarantz commented Oct 5, 2021 • edited Loading

mattklein123 commented Oct 5, 2021

mattklein123 commented Oct 5, 2021

mergeconflict commented Oct 6, 2021

mergeconflict commented Oct 5, 2021 •

edited

Loading

jmarantz commented Oct 5, 2021 •

edited

Loading