Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reduce the memory usage of LC Trie construction #4117

Merged
merged 6 commits into from
Aug 14, 2018
Merged

perf: reduce the memory usage of LC Trie construction #4117

merged 6 commits into from
Aug 14, 2018

Conversation

brian-pane
Copy link
Contributor

Description:
Rather than preallocating a very large buffer to hold the LC Trie nodes
the way the reference implementation from the original Nilsson & Karlsson
paper did, the trie construction now starts out with a size based on the
number of prefixes and lets the std::vector manage any needed reallocation
during trie construction.

In addition to using less memory during trie construction, this new approach
ended up being faster:

Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
Before:
BM_LcTrieConstruct                  26631971 ns   26522462 ns         26

After:
BM_LcTrieConstruct                   3654633 ns    3646942 ns        189

Risk Level: medium
Testing: unit tests plus new speed benchmark test
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Brian Pane bpane@pinterest.com

Rather than preallocating a very large buffer to hold the LC Trie nodes
the way the reference implementation from the original  Nilsson & Karlsson
paper did, the trie construction now starts out with a size based on the
number of prefixes and lets the `std::vector` manage any needed reallocation
during trie construction.

In addition to using less memory during trie construction, this new approach
ended up being faster:

```
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
Before:
BM_LcTrieConstruct                  26631971 ns   26522462 ns         26

After:
BM_LcTrieConstruct                   3654633 ns    3646942 ns        189
```

Signed-off-by: Brian Pane <bpane@pinterest.com>
@brian-pane
Copy link
Contributor Author

@PiotrSikora @rshriram @ccaraman this PR is a follow-up to 3980 that aims to further reduce the temporary memory footprint during trie construction.

Brian Pane added 2 commits August 13, 2018 15:32
Signed-off-by: Brian Pane <bpane@pinterest.com>
Signed-off-by: Brian Pane <bpane@pinterest.com>
// Build the trie_.
trie_.reserve(ip_prefixes_.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experiments, it seems that the majority of the CPU savings here come from using reserve() and not reducing the number of the items.

Since you've also added protection against overflow, is there any reason not to use 2 * ip_prefixes_.size() - 1 to avoid reallocations? This should be still small enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of elements needed in the vector can be bigger than 2 * ip_prefixes_.size(), depending on the fill_factor. I think size_t(ip_prefixes_.size() / fill_factor) will work.

}

trie_.resize(position + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve() gives additional 5% performance boost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll switch to reserve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you didn't...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I realized it would cause a buffer write overflow...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come? If trie_.reserve(position + 1) doesn't increase capacity, then how come trie_.reserve(size_t(ip_prefixes_.size() / fill_factor_)) does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both have the potential to increase capacity. But what's needed in order for trie_[position] to be valid isn't an increase in trie_'s capacity, but rather an increase in its size. The reserve method doesn't increase the size, whereas resize does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's >= size() and not >= capacity() check. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are really trying to go for micro/max perf you could probably switch to a C style array here, but probably not worth it right now.

@@ -15,6 +17,16 @@ std::unique_ptr<Envoy::Network::LcTrie::LcTrie<std::string>> lc_trie_nested_pref

namespace Envoy {

static void BM_LcTrieConstruct(benchmark::State& state) {
Copy link
Contributor

@PiotrSikora PiotrSikora Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add benchmarks for construct with:

  • nested prefixes (1420μs before vs 980μs after, i.e. ~30% faster),
  • 0.0.0.0/0 (390μs before vs 500ns after, i.e. ~800x faster).

Before:
```
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_LcTrieConstruct                  26580552 ns   26462148 ns         27
BM_LcTrieConstructNested            27428018 ns   27263654 ns         26
BM_LcTrieConstructMinimal           21800259 ns   21758937 ns         32
```

After:
```
-------------------------------------------------------------------------
BM_LcTrieConstruct                   3616441 ns    3613198 ns        192 (7x speedup)
BM_LcTrieConstructNested             5053845 ns    5046373 ns        134 (5x)
BM_LcTrieConstructMinimal               4400 ns       4393 ns     159035 (4,900x)
```

Signed-off-by: Brian Pane <bpane@pinterest.com>
}

trie_.resize(position + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you didn't...

uint32_t next_free_index = 1;
buildRecursive(0u, 0u, ip_prefixes_.size(), 0u, next_free_index);

// The value of next_free_index is the final size of the trie_.
trie_.resize(next_free_index);
trie_.shrink_to_fit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that I agree with removing those 2 lines... We still need to shrink it to release the memory, don't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the default case where fill_ratio is 0.5, it arguably doesn't make sense to release the memory here: the capacity of the buffer at this point can potentially be up to 2x the used size, but that's true of most vectors if the std::vector implementation uses a typical power-of-two allocator internally. For smaller values of fill_ratio, though, reclaiming space here might be a win. I'll try reenabling the shrink-to-fit operation here to find out if it negatively impacts performance in the common case (by requiring a reallocation + memcpy).

// Build the trie_.
trie_.reserve(size_t(ip_prefixes_.size() / fill_factor_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size_t constructor looks bizarre, but I'm not C++ expert. @ccaraman / @mattklein123 could you take a look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, could also use a static cast.


static void BM_LcTrieConstructMinimal(benchmark::State& state) {
std::vector<std::pair<std::string, std::vector<Envoy::Network::Address::CidrRange>>> tags;
tags.emplace_back(std::pair<std::string, std::vector<Envoy::Network::Address::CidrRange>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it to global variable to match others and to avoid recreating this for every run?

Also, could you add BM_LcTrieLookupMinimal?

…pace after construction

Latest results
```
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
BM_LcTrieConstruct                   3623593 ns    3621521 ns        188
BM_LcTrieConstructNested             5187460 ns    5182913 ns        103
BM_LcTrieConstructMinimal               5328 ns       5191 ns     153615
```

Based on these benchmark results, reclaiming space after construction _does_ hurt construction speed, although the trie construction is still at least 4x faster and uses many megabytes less memory than it did prior to this PR.

Signed-off-by: Brian Pane <bpane@pinterest.com>
PiotrSikora
PiotrSikora previously approved these changes Aug 14, 2018
mattklein123
mattklein123 previously approved these changes Aug 14, 2018
// Build the trie_.
trie_.reserve(size_t(ip_prefixes_.size() / fill_factor_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, could also use a static cast.

}

trie_.resize(position + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are really trying to go for micro/max perf you could probably switch to a C style array here, but probably not worth it right now.

Signed-off-by: Brian Pane <bpane@pinterest.com>
@brian-pane brian-pane dismissed stale reviews from mattklein123 and PiotrSikora via 9dfc9fd August 14, 2018 04:23
@mattklein123 mattklein123 merged commit 3527f77 into envoyproxy:master Aug 14, 2018
snowp added a commit to snowp/envoy that referenced this pull request Aug 14, 2018
* origin/master: (38 commits)
  test: add tests for corner-cases around sending requests before run() starts or after run() ends. (envoyproxy#4114)
  perf: reduce the memory usage of LC Trie construction (envoyproxy#4117)
  test: moving redundant code in websocket_integration_test to utilities (envoyproxy#4127)
  test: make YamlLoadFromStringFail less picky about error msg. (envoyproxy#4141)
  rbac: add rbac network filter. (envoyproxy#4083)
  fuzz: route lookup and header finalization fuzzer. (envoyproxy#4116)
  Set content-type and content-length (envoyproxy#4113)
  fault: use FractionalPercent for percent (envoyproxy#3978)
  test: Fix inverted exact match logic in IntegrationTcpClient::waitForData() (envoyproxy#4134)
  Added cluster_name to load assignment config for static cluster (envoyproxy#4123)
  ssl: refactor ContextConfig to use TlsCertificateConfig (envoyproxy#4115)
  syscall: refactor OsSysCalls for deeper errno latching (envoyproxy#4111)
  thrift_proxy: fix oneway bugs (envoyproxy#4025)
  Do not crash when converting YAML to JSON fails (envoyproxy#4110)
  config: allow unknown fields flag (take 2) (envoyproxy#4096)
  Use a jittered backoff strategy for handling HdsDelegate stream/connection failures (envoyproxy#4108)
  bazel: use GCS remote cache (envoyproxy#4050)
  Add thread local cache of overload action states (envoyproxy#4090)
  Added TCP healthcheck capabilities to the HdsDelegate (envoyproxy#4079)
  secret: add secret provider interface and use it for TlsCertificates (envoyproxy#4086)
  ...

Signed-off-by: Snow Pettersen <snowp@squareup.com>
@brian-pane brian-pane deleted the lc_trie_memory branch May 25, 2019 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants