Batch implementation with timer #6452

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-01T20:14:28Z

Envoy supports Redis pipelined commands, but does not batch commands. By batching, I mean when non-pipelined or pipelined commands are sent through Envoy, Envoy should not send them upstream to Redis until a certain number of commands has been received, and then write them to the connection buffer for a particular upstream Redis in one go. See #6365 for more details and discussion.

This PR implements a timer solution to the problem. The algorithm is as follows:

If buffer is currently 0 and adding to buffer, set timer for x ms.
If buffer is > 0, do nothing, timer is already set.
If buffer is > X, immediate flush, cancel timer if set.

Batching is defaulted to off in the config- zero batch size, zero timer, so it should behave the same as the old way. Note that there is the extra overhead of a timer check in ClientImpl::flushBufferAndResetTimer() even if we have zero batch size- this may add slight overhead, but I have not been able to characterize the performance degradation in any meaningful way*. If there is degradation, or if desired by reviewers I can add a 3rd if statement to handle this in client_impl.cc ie if (config_.maxBufferSizeBeforeFlush()==0) where we write without a timer check.

Besides updating unit tests, I've added 3 specific cases:

Batching should be no-op with default settings
Batching waits for timer fired when enabled (buffer size > request size)
Buffer flushes when buffer size < sum(request sizes), and cancels flush timer

*I have done some evaluation on performance on my local VM- batching is definitely faster with single redis behind envoy. However, for real-world tests against our redis-benchmark machines in AWS we will have to wait until later this week.

mattklein123 · 2019-04-01T20:24:15Z

@HenryYYang for first pass.

HenryYYang

Looks good, please add an entry to version history and the redis proxy configuration doc.

test/extensions/filters/network/common/redis/client_impl_test.cc

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-01T23:53:42Z

@HenryYYang, yeah, saw version history in your PR when I was reviewing (docs/root/intro/version_history.rst), so I'll add a note there.

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-02T00:21:13Z

test/extensions/filters/network/common/redis/client_impl_test.cc

+
+  // Pretend the the flush timer fires
+  EXPECT_CALL(*flush_timer_, enabled()).WillOnce(Return(false));
+  ; // timer is disabled, as it has already fired


@mattklein123 this was the python formatter. Should I flag this as an issue?

What do you mean?

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-02T19:12:37Z

I suppose I'll get on the tip of mainline master again- that fixed coverage, but broke asan :|

mattklein123

I think you have some merge issues here. Can you take a look and make sure the diff is what you expect? Also, is this ready to fully review? I thought you mentioned that you want to do some testing first? Thank you!

/wait

mattklein123 · 2019-04-05T20:50:09Z

api/envoy/config/filter/network/redis_proxy/v2/redis_proxy.proto

@@ -46,6 +46,12 @@ message RedisProxy {
    // * '{user1000}.following' and '{user1000}.followers' **will** be sent to the same upstream
    // * '{user1000}.following' and '{user1001}.following' **might** be sent to the same upstream
    bool enable_hashtagging = 2;
+
+    // Maximum size of buffer before flush is triggered


Can you specify what happens if this is unset? Same below?

mattklein123 · 2019-04-05T20:50:24Z

docs/root/intro/version_history.rst

@@ -59,6 +59,9 @@ Version history
 * redis: added :ref:`success and error stats <config_network_filters_redis_proxy_per_command_stats>` for commands.
 * redis: migrate hash function for host selection to `MurmurHash2 <https://sites.google.com/site/murmurhash>`_ from std::hash. MurmurHash2 is compatible with std::hash in GNU libstdc++ 3.4.20 or above. This is typically the case when compiled on Linux and not macOS.
 * redis: added :ref:`latency_in_micros <envoy_api_field_config.filter.network.redis_proxy.v2.RedisProxy.latency_in_micros>` to specify the redis commands stats time unit in microseconds.
+* redis: added 


Please merge master and move these docs to the 1.11.0 section.

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-05T21:38:48Z

This is not ready for full review- I am attempting to do some performance testing in AWS first.

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-12T00:59:43Z

Ok, had a chance to test in AWS. Results are less promising than hoped. Perhaps my understanding of the rpc-perf script is imperfect, but I'm only seeing around an 8% perf improvement (the rate metric below), but a big improvement in p99 latency.

TODO: Have to also add in notes on when unset per earlier comment from Matt.

Test

Procedure

Using an appropriate config file for Envoy (depending on if we are batching or not, ie we add batch params):

$ bazel-bin/source/exe/envoy-static -c [CONFIG] --service-cluster redisbenchmark-staging-iad --service-node nico-vm -l critical
$ redis-cli -h 172.31.43.35 FLUSHALL (or )
$ ./rpc-perf -s 127.0.0.1:6380 --config basic.toml -p redis | python3 process_rpc_perf_output.py

Basic Configuration of Toml file

request-timeout = 200
connect-timeout = 500
threads = [Threads]
connections = [Connections]
duration = 20
windows = 5
protocol = "redis"
tcp-nodelay = true
ipv4 = true
ipv6 = true
database = 0

[[workload]]
name = "get"
method = "get"
rate = [Request Rate]
  [[workload.parameter]]
  style = "random"
  size = 64
  regenerate = true

EC2 -> EC2

First, we compare performance of batching for different request rates to another EC2 instance running Redis.

Test 1

threads 5
connections 5
request rate 100000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 15212.6 rps
success: 100 %
p50: 5625.6 ms
p99: 28655.2 ms

Batch

rates: 17425 rps
success: 100 %
p50: 5599.6 ms
p99: 10559 ms

Test 2

threads 5
connections 5
request rate 300000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 13620 rps
success: 100 %
p50: 6936.8 ms
p99: 21326.6 ms

Batch

rates: 14822 rps
success: 100 %
p50: 5355.6 ms
p99: 21689 ms

EC2 -> Elasticache

Test 1

threads 5
connections 5
request rate 100000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 15294.2 rps
success: 100 %
p50: 5479.8 ms
p99: 27688.6 ms

Batch

rates: 16511.2 rps
success: 100 %
p50: 6444.6 ms
p99: 8201.2 ms

Test 2

threads 20
connections 20
request rate 100000
flush_buffer length 1024
flush_buffer timeout 3ms

Non-batch

rates: 14373.8 rps
success: 100 %
p50: 23098.4 ms
p99: 78087.4 ms

Batch

rates: 14660.8 rps
success: 100 %
p50: 23511.2 ms
p99: 74195 ms

mattklein123 · 2019-04-12T16:49:02Z

@FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg so you want to review this? Or you are still working on it? If so, PTAL at CI.

/wait

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-12T17:46:16Z

@mattklein123 sure, let's see if my changes fix the CI issues. Mitch's redirection work came through on merge and broke builds of some tests. However, if time is tight I'd prefer a review on #6446, which has been updated per your comments.

Originally I had planned on waiting for Henry and doing more testing, but I think the above results are pretty straightforward, and little perf bump is not unwelcome.

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

mattklein123

Thanks, looks good modulo small comments. Can you add an integration test with buffering and flushing enabled?

/wait

api/envoy/config/filter/network/redis_proxy/v2/redis_proxy.proto

source/extensions/filters/network/common/redis/client.h

source/extensions/filters/network/common/redis/client_impl.cc

source/extensions/filters/network/common/redis/client_impl.h

test/extensions/filters/network/common/redis/client_impl_test.cc

mattklein123 · 2019-04-16T18:27:08Z

test/extensions/filters/network/common/redis/client_impl_test.cc

+
+  // Pretend the the flush timer fires
+  EXPECT_CALL(*flush_timer_, enabled()).WillOnce(Return(false));
+  ; // timer is disabled, as it has already fired


What do you mean?

test/extensions/filters/network/common/redis/client_impl_test.cc

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-16T18:55:09Z

re // timer is disabled, as it has already fired and what is means, I cannot add a reply to that thread for whatever reason. What I meant for that line is this: Inside the timer callback function, we check if the timer is enabled and if so, cancel it- this is for the case where we fill the buffer before the timer fires. Thus, we expect the timer.enabled() call to return false inside this callback if the timer has already fired. I'll expand the comment to explain this.

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

…lure Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-17T23:24:15Z

@mattklein123 ready for review (your comments addressed, integration test added).

Note that asan/tsan have failed on unrelated tests- for one build //test/integration:hds_integration_test, on another build //test/integration:http_timeout_integration_test- so maybe there is some integration test flakiness going on? Perhaps related to Google outage?

mattklein123

Nice looks great, just a few nits. Yeah there are some flakes in master right now. You can always use the /retest command to try to deflake.

/wait

api/envoy/config/filter/network/redis_proxy/v2/redis_proxy.proto

source/extensions/filters/network/common/redis/client_impl.cc

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg · 2019-04-18T18:26:12Z

@mattklein123 fixed things according to your comments; all checks now pass on latest master.

mattklein123

Awesome work!

* master: (26 commits) docs: update docs to recommend /retest repokitteh command (envoyproxy#6655) http timeout integration test: wait for 15s for upstream reset (envoyproxy#6646) access log: add response code details to the access log formatter (envoyproxy#6626) build: add ppc build badge to README (envoyproxy#6629) Revert dispatcher stats (envoyproxy#6649) Batch implementation with timer (envoyproxy#6452) fault filter: reset token bucket on data start (envoyproxy#6627) event: update libevent dependency to fix race condition (envoyproxy#6637) examples: standardize docker-compose version and yaml extension (envoyproxy#6613) quiche: Implement SpdyUnsafeArena using SpdySimpleArena (envoyproxy#6612) router: support customizable retry back-off intervals (envoyproxy#6568) api: create OpenRCA service proto file (envoyproxy#6497) ext_authz: option for clearing route cache of authorized requests (envoyproxy#6503) build: update jinja to 2.10.1. (envoyproxy#6623) tools: check spelling in pre-push hook (envoyproxy#6631) security: blameless postmortem template. (envoyproxy#6553) Implementing Endpoint lease for ClusterLoadAssigment (envoyproxy#6477) add HTTP integration tests exercising timeouts (envoyproxy#6621) event: fix DispatcherImplTest::InitializeStats flake (envoyproxy#6619) Add tag extractor for RDS route config name (envoyproxy#6618) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg mentioned this pull request Apr 1, 2019

Redis Proxy supports pipelined commands, but does not batch commands to Upstream #6365

Closed

mattklein123 assigned HenryYYang Apr 1, 2019

mattklein123 self-assigned this Apr 1, 2019

HenryYYang reviewed Apr 1, 2019

View reviewed changes

test/extensions/filters/network/common/redis/client_impl_test.cc Outdated Show resolved Hide resolved

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 2, 2019

View reviewed changes

mattklein123 requested changes Apr 5, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 5, 2019

timer batch implementation, tested, merge conflicts fixed, etc

446d3b6

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg force-pushed the batch-implementation-with-timer branch from 02bb39d to 446d3b6 Compare April 12, 2019 00:51

repokitteh-read-only bot removed the waiting label Apr 12, 2019

repokitteh-read-only bot added the waiting label Apr 12, 2019

updated per commits and conflict from branch (redirection woooo)

bcee909

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

repokitteh-read-only bot removed the waiting label Apr 12, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg added 4 commits April 12, 2019 10:49

Merge branch 'master' into batch-implementation-with-timer

1ec5aa9

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

fix test failures resulting from merge of ask move redirect code

77d5652

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

Merge branch 'master' into batch-implementation-with-timer

53d8040

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

update history

668c298

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

mattklein123 requested changes Apr 16, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 16, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg added 2 commits April 16, 2019 14:57

Address matt comments

2fd41be

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

add integration test

318c3e6

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

repokitteh-read-only bot removed the waiting label Apr 17, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg added 3 commits April 17, 2019 14:17

fix spelling

74f5658

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

Merge branch 'master' into batch-implementation-with-timer

d3a4bae

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

one char change to trigger new tests as I suspect flaky unrelated fai…

441537a

…lure Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

mattklein123 requested changes Apr 18, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 18, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg added 2 commits April 18, 2019 09:30

update per matt comments

602f0c1

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

Merge branch 'master' into batch-implementation-with-timer

751e23e

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

repokitteh-read-only bot removed the waiting label Apr 18, 2019

mattklein123 approved these changes Apr 18, 2019

View reviewed changes

mattklein123 merged commit dc3467a into envoyproxy:master Apr 18, 2019

skonto mentioned this pull request Oct 13, 2023

QP request batcher knative/serving#13691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch implementation with timer #6452

Batch implementation with timer #6452

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 1, 2019

mattklein123 commented Apr 1, 2019

HenryYYang left a comment

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 1, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg Apr 2, 2019

mattklein123 Apr 16, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 2, 2019

mattklein123 left a comment

mattklein123 Apr 5, 2019

mattklein123 Apr 5, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 5, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 12, 2019

mattklein123 commented Apr 12, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 12, 2019

mattklein123 left a comment

mattklein123 Apr 16, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 16, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 17, 2019

mattklein123 left a comment

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 18, 2019

mattklein123 left a comment

Batch implementation with timer #6452

Batch implementation with timer #6452

Conversation

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 1, 2019

mattklein123 commented Apr 1, 2019

HenryYYang left a comment

Choose a reason for hiding this comment

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 1, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg Apr 2, 2019

Choose a reason for hiding this comment

mattklein123 Apr 16, 2019

Choose a reason for hiding this comment

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 2, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

mattklein123 Apr 5, 2019

Choose a reason for hiding this comment

mattklein123 Apr 5, 2019

Choose a reason for hiding this comment

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 5, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 12, 2019

Test

Procedure

Basic Configuration of Toml file

EC2 -> EC2

Test 1

Non-batch

Batch

Test 2

Non-batch

Batch

EC2 -> Elasticache

Test 1

Non-batch

Batch

Test 2

Non-batch

Batch

mattklein123 commented Apr 12, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 12, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

mattklein123 Apr 16, 2019

Choose a reason for hiding this comment

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 16, 2019

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 17, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

FAYiEKcbD0XFqF2QK2E4viAHg8rMm2VbjYKdjTg commented Apr 18, 2019

mattklein123 left a comment

Choose a reason for hiding this comment