-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: reduce the memory usage of LC Trie construction #4117
Conversation
Rather than preallocating a very large buffer to hold the LC Trie nodes the way the reference implementation from the original Nilsson & Karlsson paper did, the trie construction now starts out with a size based on the number of prefixes and lets the `std::vector` manage any needed reallocation during trie construction. In addition to using less memory during trie construction, this new approach ended up being faster: ``` Benchmark Time CPU Iterations ------------------------------------------------------------------------- Before: BM_LcTrieConstruct 26631971 ns 26522462 ns 26 After: BM_LcTrieConstruct 3654633 ns 3646942 ns 189 ``` Signed-off-by: Brian Pane <bpane@pinterest.com>
@PiotrSikora @rshriram @ccaraman this PR is a follow-up to 3980 that aims to further reduce the temporary memory footprint during trie construction. |
source/common/network/lc_trie.h
Outdated
// Build the trie_. | ||
trie_.reserve(ip_prefixes_.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experiments, it seems that the majority of the CPU savings here come from using reserve()
and not reducing the number of the items.
Since you've also added protection against overflow, is there any reason not to use 2 * ip_prefixes_.size() - 1
to avoid reallocations? This should be still small enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of elements needed in the vector can be bigger than 2 * ip_prefixes_.size()
, depending on the fill_factor
. I think size_t(ip_prefixes_.size() / fill_factor)
will work.
} | ||
|
||
trie_.resize(position + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserve()
gives additional 5% performance boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll switch to reserve
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you didn't...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I realized it would cause a buffer write overflow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come? If trie_.reserve(position + 1)
doesn't increase capacity, then how come trie_.reserve(size_t(ip_prefixes_.size() / fill_factor_))
does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They both have the potential to increase capacity. But what's needed in order for trie_[position]
to be valid isn't an increase in trie_
's capacity, but rather an increase in its size. The reserve
method doesn't increase the size, whereas resize
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's >= size()
and not >= capacity()
check. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are really trying to go for micro/max perf you could probably switch to a C style array here, but probably not worth it right now.
@@ -15,6 +17,16 @@ std::unique_ptr<Envoy::Network::LcTrie::LcTrie<std::string>> lc_trie_nested_pref | |||
|
|||
namespace Envoy { | |||
|
|||
static void BM_LcTrieConstruct(benchmark::State& state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add benchmarks for construct with:
- nested prefixes (1420μs before vs 980μs after, i.e. ~30% faster),
0.0.0.0/0
(390μs before vs 500ns after, i.e. ~800x faster).
Before: ``` ------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_LcTrieConstruct 26580552 ns 26462148 ns 27 BM_LcTrieConstructNested 27428018 ns 27263654 ns 26 BM_LcTrieConstructMinimal 21800259 ns 21758937 ns 32 ``` After: ``` ------------------------------------------------------------------------- BM_LcTrieConstruct 3616441 ns 3613198 ns 192 (7x speedup) BM_LcTrieConstructNested 5053845 ns 5046373 ns 134 (5x) BM_LcTrieConstructMinimal 4400 ns 4393 ns 159035 (4,900x) ``` Signed-off-by: Brian Pane <bpane@pinterest.com>
} | ||
|
||
trie_.resize(position + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you didn't...
source/common/network/lc_trie.h
Outdated
uint32_t next_free_index = 1; | ||
buildRecursive(0u, 0u, ip_prefixes_.size(), 0u, next_free_index); | ||
|
||
// The value of next_free_index is the final size of the trie_. | ||
trie_.resize(next_free_index); | ||
trie_.shrink_to_fit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think that I agree with removing those 2 lines... We still need to shrink it to release the memory, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the default case where fill_ratio
is 0.5, it arguably doesn't make sense to release the memory here: the capacity
of the buffer at this point can potentially be up to 2x the used size
, but that's true of most vectors if the std::vector
implementation uses a typical power-of-two allocator internally. For smaller values of fill_ratio
, though, reclaiming space here might be a win. I'll try reenabling the shrink-to-fit operation here to find out if it negatively impacts performance in the common case (by requiring a reallocation + memcpy).
source/common/network/lc_trie.h
Outdated
// Build the trie_. | ||
trie_.reserve(size_t(ip_prefixes_.size() / fill_factor_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size_t
constructor looks bizarre, but I'm not C++ expert. @ccaraman / @mattklein123 could you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine, could also use a static cast.
|
||
static void BM_LcTrieConstructMinimal(benchmark::State& state) { | ||
std::vector<std::pair<std::string, std::vector<Envoy::Network::Address::CidrRange>>> tags; | ||
tags.emplace_back(std::pair<std::string, std::vector<Envoy::Network::Address::CidrRange>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move it to global variable to match others and to avoid recreating this for every run?
Also, could you add BM_LcTrieLookupMinimal
?
…pace after construction Latest results ``` Benchmark Time CPU Iterations ------------------------------------------------------------------------- BM_LcTrieConstruct 3623593 ns 3621521 ns 188 BM_LcTrieConstructNested 5187460 ns 5182913 ns 103 BM_LcTrieConstructMinimal 5328 ns 5191 ns 153615 ``` Based on these benchmark results, reclaiming space after construction _does_ hurt construction speed, although the trie construction is still at least 4x faster and uses many megabytes less memory than it did prior to this PR. Signed-off-by: Brian Pane <bpane@pinterest.com>
source/common/network/lc_trie.h
Outdated
// Build the trie_. | ||
trie_.reserve(size_t(ip_prefixes_.size() / fill_factor_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine, could also use a static cast.
} | ||
|
||
trie_.resize(position + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are really trying to go for micro/max perf you could probably switch to a C style array here, but probably not worth it right now.
Signed-off-by: Brian Pane <bpane@pinterest.com>
9dfc9fd
* origin/master: (38 commits) test: add tests for corner-cases around sending requests before run() starts or after run() ends. (envoyproxy#4114) perf: reduce the memory usage of LC Trie construction (envoyproxy#4117) test: moving redundant code in websocket_integration_test to utilities (envoyproxy#4127) test: make YamlLoadFromStringFail less picky about error msg. (envoyproxy#4141) rbac: add rbac network filter. (envoyproxy#4083) fuzz: route lookup and header finalization fuzzer. (envoyproxy#4116) Set content-type and content-length (envoyproxy#4113) fault: use FractionalPercent for percent (envoyproxy#3978) test: Fix inverted exact match logic in IntegrationTcpClient::waitForData() (envoyproxy#4134) Added cluster_name to load assignment config for static cluster (envoyproxy#4123) ssl: refactor ContextConfig to use TlsCertificateConfig (envoyproxy#4115) syscall: refactor OsSysCalls for deeper errno latching (envoyproxy#4111) thrift_proxy: fix oneway bugs (envoyproxy#4025) Do not crash when converting YAML to JSON fails (envoyproxy#4110) config: allow unknown fields flag (take 2) (envoyproxy#4096) Use a jittered backoff strategy for handling HdsDelegate stream/connection failures (envoyproxy#4108) bazel: use GCS remote cache (envoyproxy#4050) Add thread local cache of overload action states (envoyproxy#4090) Added TCP healthcheck capabilities to the HdsDelegate (envoyproxy#4079) secret: add secret provider interface and use it for TlsCertificates (envoyproxy#4086) ... Signed-off-by: Snow Pettersen <snowp@squareup.com>
Description:
Rather than preallocating a very large buffer to hold the LC Trie nodes
the way the reference implementation from the original Nilsson & Karlsson
paper did, the trie construction now starts out with a size based on the
number of prefixes and lets the
std::vector
manage any needed reallocationduring trie construction.
In addition to using less memory during trie construction, this new approach
ended up being faster:
Risk Level: medium
Testing: unit tests plus new speed benchmark test
Docs Changes: N/A
Release Notes: N/A
Signed-off-by: Brian Pane bpane@pinterest.com