Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch implementation with timer #6452

Merged

Conversation

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor

Envoy supports Redis pipelined commands, but does not batch commands. By batching, I mean when non-pipelined or pipelined commands are sent through Envoy, Envoy should not send them upstream to Redis until a certain number of commands has been received, and then write them to the connection buffer for a particular upstream Redis in one go. See #6365 for more details and discussion.

This PR implements a timer solution to the problem. The algorithm is as follows:

  • If buffer is currently 0 and adding to buffer, set timer for x ms.
  • If buffer is > 0, do nothing, timer is already set.
  • If buffer is > X, immediate flush, cancel timer if set.

Batching is defaulted to off in the config- zero batch size, zero timer, so it should behave the same as the old way. Note that there is the extra overhead of a timer check in ClientImpl::flushBufferAndResetTimer() even if we have zero batch size- this may add slight overhead, but I have not been able to characterize the performance degradation in any meaningful way*. If there is degradation, or if desired by reviewers I can add a 3rd if statement to handle this in client_impl.cc ie if (config_.maxBufferSizeBeforeFlush()==0) where we write without a timer check.

Besides updating unit tests, I've added 3 specific cases:

  • Batching should be no-op with default settings
  • Batching waits for timer fired when enabled (buffer size > request size)
  • Buffer flushes when buffer size < sum(request sizes), and cancels flush timer

*I have done some evaluation on performance on my local VM- batching is definitely faster with single redis behind envoy. However, for real-world tests against our redis-benchmark machines in AWS we will have to wait until later this week.

@mattklein123
Copy link
Member

@HenryYYang for first pass.

@mattklein123 mattklein123 self-assigned this Apr 1, 2019
Copy link
Contributor

@HenryYYang HenryYYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please add an entry to version history and the redis proxy configuration doc.

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

@HenryYYang, yeah, saw version history in your PR when I was reviewing (docs/root/intro/version_history.rst), so I'll add a note there.


// Pretend the the flush timer fires
EXPECT_CALL(*flush_timer_, enabled()).WillOnce(Return(false));
; // timer is disabled, as it has already fired

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 this was the python formatter. Should I flag this as an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

I suppose I'll get on the tip of mainline master again- that fixed coverage, but broke asan :|

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have some merge issues here. Can you take a look and make sure the diff is what you expect? Also, is this ready to fully review? I thought you mentioned that you want to do some testing first? Thank you!

/wait

@@ -46,6 +46,12 @@ message RedisProxy {
// * '{user1000}.following' and '{user1000}.followers' **will** be sent to the same upstream
// * '{user1000}.following' and '{user1001}.following' **might** be sent to the same upstream
bool enable_hashtagging = 2;

// Maximum size of buffer before flush is triggered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify what happens if this is unset? Same below?

@@ -59,6 +59,9 @@ Version history
* redis: added :ref:`success and error stats <config_network_filters_redis_proxy_per_command_stats>` for commands.
* redis: migrate hash function for host selection to `MurmurHash2 <https://sites.google.com/site/murmurhash>`_ from std::hash. MurmurHash2 is compatible with std::hash in GNU libstdc++ 3.4.20 or above. This is typically the case when compiled on Linux and not macOS.
* redis: added :ref:`latency_in_micros <envoy_api_field_config.filter.network.redis_proxy.v2.RedisProxy.latency_in_micros>` to specify the redis commands stats time unit in microseconds.
* redis: added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge master and move these docs to the 1.11.0 section.

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

This is not ready for full review- I am attempting to do some performance testing in AWS first.

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

Ok, had a chance to test in AWS. Results are less promising than hoped. Perhaps my understanding of the rpc-perf script is imperfect, but I'm only seeing around an 8% perf improvement (the rate metric below), but a big improvement in p99 latency.

TODO: Have to also add in notes on when unset per earlier comment from Matt.

Test

Procedure

Using an appropriate config file for Envoy (depending on if we are batching or not, ie we add batch params):

$ bazel-bin/source/exe/envoy-static -c [CONFIG] --service-cluster redisbenchmark-staging-iad --service-node nico-vm -l critical
$ redis-cli -h 172.31.43.35 FLUSHALL (or )
$ ./rpc-perf -s 127.0.0.1:6380 --config basic.toml -p redis | python3 process_rpc_perf_output.py

Basic Configuration of Toml file

request-timeout = 200
connect-timeout = 500
threads = [Threads]
connections = [Connections]
duration = 20
windows = 5
protocol = "redis"
tcp-nodelay = true
ipv4 = true
ipv6 = true
database = 0

[[workload]]
name = "get"
method = "get"
rate = [Request Rate]
  [[workload.parameter]]
  style = "random"
  size = 64
  regenerate = true

EC2 -> EC2

First, we compare performance of batching for different request rates to another EC2 instance running Redis.

Test 1

threads 5
connections 5
request rate 100000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 15212.6 rps
success: 100 %
p50: 5625.6 ms
p99: 28655.2 ms

Batch

rates: 17425 rps
success: 100 %
p50: 5599.6 ms
p99: 10559 ms

Test 2

threads 5
connections 5
request rate 300000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 13620 rps
success: 100 %
p50: 6936.8 ms
p99: 21326.6 ms

Batch

rates: 14822 rps
success: 100 %
p50: 5355.6 ms
p99: 21689 ms

EC2 -> Elasticache

Test 1

threads 5
connections 5
request rate 100000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 15294.2 rps
success: 100 %
p50: 5479.8 ms
p99: 27688.6 ms

Batch

rates: 16511.2 rps
success: 100 %
p50: 6444.6 ms
p99: 8201.2 ms

Test 2

threads 20
connections 20
request rate 100000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 14373.8 rps
success: 100 %
p50: 23098.4 ms
p99: 78087.4 ms

Batch

rates: 14660.8 rps
success: 100 %
p50: 23511.2 ms
p99: 74195 ms

@mattklein123
Copy link
Member

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg so you want to review this? Or you are still working on it? If so, PTAL at CI.

/wait

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

@mattklein123 sure, let's see if my changes fix the CI issues. Mitch's redirection work came through on merge and broke builds of some tests. However, if time is tight I'd prefer a review on #6446, which has been updated per your comments.

Originally I had planned on waiting for Henry and doing more testing, but I think the above results are pretty straightforward, and little perf bump is not unwelcome.

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good modulo small comments. Can you add an integration test with buffering and flushing enabled?

/wait

source/extensions/filters/network/common/redis/client.h Outdated Show resolved Hide resolved

// Pretend the the flush timer fires
EXPECT_CALL(*flush_timer_, enabled()).WillOnce(Return(false));
; // timer is disabled, as it has already fired
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

re // timer is disabled, as it has already fired and what is means, I cannot add a reply to that thread for whatever reason. What I meant for that line is this: Inside the timer callback function, we check if the timer is enabled and if so, cancel it- this is for the case where we fill the buffer before the timer fires. Thus, we expect the timer.enabled() call to return false inside this callback if the timer has already fired. I'll expand the comment to explain this.

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
…lure

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

@mattklein123 ready for review (your comments addressed, integration test added).

Note that asan/tsan have failed on unrelated tests- for one build //test/integration:hds_integration_test, on another build //test/integration:http_timeout_integration_test- so maybe there is some integration test flakiness going on? Perhaps related to Google outage?

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice looks great, just a few nits. Yeah there are some flakes in master right now. You can always use the /retest command to try to deflake.

/wait

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
Signed-off-by: Nicolas Flacco <nflacco@lyft.com>
@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg
Copy link
Contributor Author

@mattklein123 fixed things according to your comments; all checks now pass on latest master.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@mattklein123 mattklein123 merged commit dc3467a into envoyproxy:master Apr 18, 2019
mpuncel added a commit to mpuncel/envoy that referenced this pull request Apr 19, 2019
* master: (26 commits)
  docs: update docs to recommend /retest repokitteh command (envoyproxy#6655)
  http timeout integration test: wait for 15s for upstream reset (envoyproxy#6646)
  access log: add response code details to the access log formatter (envoyproxy#6626)
  build: add ppc build badge to README (envoyproxy#6629)
  Revert dispatcher stats (envoyproxy#6649)
  Batch implementation with timer (envoyproxy#6452)
  fault filter: reset token bucket on data start (envoyproxy#6627)
  event: update libevent dependency to fix race condition (envoyproxy#6637)
  examples: standardize docker-compose version and yaml extension (envoyproxy#6613)
  quiche: Implement SpdyUnsafeArena using SpdySimpleArena (envoyproxy#6612)
  router: support customizable retry back-off intervals (envoyproxy#6568)
  api: create OpenRCA service proto file (envoyproxy#6497)
  ext_authz: option for clearing route cache of authorized requests (envoyproxy#6503)
  build: update jinja to 2.10.1. (envoyproxy#6623)
  tools: check spelling in pre-push hook (envoyproxy#6631)
  security: blameless postmortem template. (envoyproxy#6553)
  Implementing Endpoint lease for ClusterLoadAssigment (envoyproxy#6477)
  add HTTP integration tests exercising timeouts (envoyproxy#6621)
  event: fix DispatcherImplTest::InitializeStats flake (envoyproxy#6619)
  Add tag extractor for RDS route config name (envoyproxy#6618)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants