Remove unncessary locking in encoder #14877

tjungblu · 2022-11-30T16:29:59Z

Hey folks,

While working on vectorized writes on bbolt, we found some very excessive amount of futex syscalls. I've been going through the hot path of the WAL and found an unnecessary locking on the encoder that is perfectly covered through the wal.mu mutex already.

Tests are passing fine for me, but would appreciate another close look from two of you on this. Testing locally this improves throughput by up to 20%.

The encoder is already protected by the wal.mu on all paths. Removing this lock yields up to 20% more throughput when measured with etcdctl check perf.

Signed-off-by: Thomas Jungblut tjungblu@redhat.com

ahrtr

Great finding. Thank you @tjungblu

server/storage/wal/encoder.go

ptabor · 2022-12-04T16:27:06Z

@tjungblu, could you, please put here the repro of the tests, that demonstrates the improvement ?

For example use this tool (https://github.com/etcd-io/etcd/tree/main/tools/rw-heatmaps) that should plot improvement chart for different sizes of the key-value pairs as well as read2write ratios.

tjungblu · 2022-12-05T11:49:30Z

I'm already running the perf test for 3h or so. I'll supply the numbers and charts once this thing is done with the PR and the current 3.6 HEAD. Is there a way to parallelize this? looking at the current CSV it's running a pretty elaborate grid search

As for the original "up to 20%" estimate, I was only running a simple etcdctl check perf --load=xl against both versions.

tjungblu · 2022-12-05T15:52:16Z

ok, I had to cut the benchmark down a bit. It was just too elaborate:

diff --git a/tools/rw-heatmaps/rw-benchmark.sh b/tools/rw-heatmaps/rw-benchmark.sh
index abd63db1a..c08d7826d 100755
--- a/tools/rw-heatmaps/rw-benchmark.sh
+++ b/tools/rw-heatmaps/rw-benchmark.sh
@@ -2,11 +2,11 @@
 
 #set -x
 
-RATIO_LIST="1/128 1/8 1/4 1/2 2/1 4/1 8/1 128/1"
-VALUE_SIZE_POWER_RANGE="8 14"
-CONN_CLI_COUNT_POWER_RANGE="5 11"
-REPEAT_COUNT=5
-RUN_COUNT=200000
+RATIO_LIST="1/4 1/2 2/1 4/1 8/1"
+VALUE_SIZE_POWER_RANGE="8 10"
+CONN_CLI_COUNT_POWER_RANGE="5 8"
+REPEAT_COUNT=3
+RUN_COUNT=2000

Also looks like the plot_data.py is a little outdated and won't work with latest matplotlib and python3.11. Never the less, here's the plots:

Ping me on slack if you need the raw results, our GDrive doesn't want to share via link to the outside world anymore since a couple of days. 🔐

tjungblu · 2022-12-05T16:46:44Z

just did a quick second run, the write results are somewhat different

I believe the many clients are simply starving etcd. I'll play around with the values a little and benchmark this again in the cloud tomorrow with a separate disk.

ptabor · 2022-12-06T11:16:12Z

Thank you @tjungblu . I'm a little confused.

The charts are showing Requests/sec. If we look at your pictures, the main has 2400 QPS max, while PR-14877 only 2000. That would suggest that the PR is decreasing the throughput. But I have no idea how removing a lock could decrease the performance. Are you sure the main vs PR-14877 csv files were not mixed up ?

If the results are opposite, that would be a huge improvement.

The encoder is already protected by the wal.mu on all paths. Removing this lock yields up to 20% more throughput when measured with etcdctl check perf. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

all public functions of the WAL should be covered by the mutex. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

tjungblu · 2022-12-06T11:51:55Z

I'm testing this again on two VMs on GCP in parallel to ensure they don't run the same version somehow. Will come back with the results once it's done.

ptabor · 2022-12-06T17:14:51Z

I'm testing this again on two VMs on GCP in parallel to ensure they don't run the same version somehow. Will come back with the results once it's done.

Thank you. On GCP to have predictable runs, I would recommend:

run on C2, C2D or T2D instance-family.
optionally: run both tests serially on the same instance, such that some risks of landing on more busy host or more-distant one are less likely to impact the results.

tjungblu · 2022-12-06T17:34:37Z

Sounds good, I'm running on the c2-standard-4 right now with the 100gigs of SSD persistent disk. I believe the four cores are not enough for the current set of clients, so I might reduce this to 2-16 clients for another set of runs - let's look at the data when it comes.

optionally: run both tests serially on the same instance, such that some risks of landing on more busy host or more-distant one are less likely to impact the results.

I'll hope the first run will finish soon, so I can run the respective other binary on the other host.

tjungblu · 2022-12-06T19:02:54Z

first run until R/W ratio 8.0 below, I'm not getting much wiser from the graphs now either. I'm running with the binaries/hosts reversed over the night to see if it's comparable.

tjungblu · 2022-12-07T12:19:19Z

Here are the results from the run overnight. Adjusted the test regimen a little with:

diff --git a/tools/rw-heatmaps/rw-benchmark.sh b/tools/rw-heatmaps/rw-benchmark.sh
index abd63db1a..be49cb518 100755
--- a/tools/rw-heatmaps/rw-benchmark.sh
+++ b/tools/rw-heatmaps/rw-benchmark.sh
@@ -3,10 +3,10 @@
 #set -x
 
 RATIO_LIST="1/128 1/8 1/4 1/2 2/1 4/1 8/1 128/1"
-VALUE_SIZE_POWER_RANGE="8 14"
-CONN_CLI_COUNT_POWER_RANGE="5 11"
+VALUE_SIZE_POWER_RANGE="8 12"
+CONN_CLI_COUNT_POWER_RANGE="1 4"
 REPEAT_COUNT=5
-RUN_COUNT=200000
+RUN_COUNT=20000
 
 KEY_SIZE=256
 KEY_SPACE_SIZE=$((1024 * 64))

basically reducing the number of clients significantly to account for the 4-cores of the host machine and cutting the 2^ 14 top end of the data sizes.

tjungblu · 2022-12-13T11:29:55Z

guys, anything in the way of blocking this merge still?
fyi, I'm OOO from the 15th on until the new year.

ahrtr · 2022-12-13T11:33:24Z

ping @ptabor

ahrtr · 2023-01-15T05:14:51Z

@tjungblu I do not see much performance gain in this PR. Actually I don't like the heatmap image because I don't think it's clear. Could you use #15060 to generate a HTML page and export a couple of images? See example below,

$go run plot_data.go ./example/main.csv  ./example/dev.csv

tjungblu · 2023-01-16T17:02:47Z

Could you use #15060 to generate a HTML page and export a couple of images? See example below,

Let me see if I find the data again, otherwise I'll do another run. How did you run the benchmarks?

ahrtr · 2023-01-16T21:39:36Z

Let me see if I find the data again, otherwise I'll do another run. How did you run the benchmarks?

I did not run the benchmark at all, instead I just got the CSV files from #13045 (comment).

I do not change anything on the benchmark test, I just changed the way to plot the data in #15060

tjungblu · 2023-01-17T11:40:56Z

@ahrtr I found the CSVs thankfully. I'm receiving an error while running your tool however:

${binary} -legend main,pr a4c6d1bbc_main/main-2.csv 751f1eb50_pr14877/pr-2.csv

Failed to load data file(s): failed to read csv file "a4c6d1bbc_main/main-2.csv", error: record on line 3: wrong number of fields

Data looks like this:

type,ratio,conn_size,value_size,iter1,iter2,iter3,iter4,iter5,comment
PARAM,,,,,,,,,"key_size=256,key_space_size=65536,backend_size=21474836480,range_limit=100,commit=751f1eb50"
DATA,.0078,2,256,13.0118:1872.8309,13.0258:1874.7251,12.6731:1823.9990,12.7353:1832.9397,12.8543:1850.0930
DATA,.0078,4,256,19.3010:2777.8891,19.1221:2752.1469,18.6790:2688.2388,18.6536:2684.7521,18.6719:2687.3951

ahrtr · 2023-01-17T12:17:18Z

It's because the data file's format isn't correct per CSV spec. The data rows' column is 1 less than the first & second rows.

$ diff data_old.csv  data_new.csv 
3,4c3,4
< DATA,.0078,2,256,13.0118:1872.8309,13.0258:1874.7251,12.6731:1823.9990,12.7353:1832.9397,12.8543:1850.0930
< DATA,.0078,4,256,19.3010:2777.8891,19.1221:2752.1469,18.6790:2688.2388,18.6536:2684.7521,18.6719:2687.3951
---
> DATA,.0078,2,256,13.0118:1872.8309,13.0258:1874.7251,12.6731:1823.9990,12.7353:1832.9397,12.8543:1850.0930,
> DATA,.0078,4,256,19.3010:2777.8891,19.1221:2752.1469,18.6790:2688.2388,18.6536:2684.7521,18.6719:2687.3951,

After adding a comma at the end of each line, it's working now.

tjungblu · 2023-01-17T12:22:59Z

not sure why the columns mismatch in the PARAM line, probably a bug in the bash script to run the bench. After removing the line I get this resulting html:

https://github.com/tjungblu/etcd/blob/pr_14877_result/rw_benchmark.html

which is faster in many cases. Maybe a suggestion for your tool would be to split the read and write throughput into two axis/plots, they seem to be on totally different scales and hard to browse otherwise.

ahrtr · 2023-01-17T12:35:08Z

not sure why the columns mismatch in the PARAM line, probably a bug in the bash script to run the bench.

It's indeed a bug of the rw-benchmark.sh, just fixed it in #15060 .

Maybe a suggestion for your tool would be to split the read and write throughput into two axis/plots

It's very easy to hide or show each line by clicking the right side legend.

ahrtr · 2023-01-17T12:38:40Z

It's strange that your PR sometimes decreases the performance.

tjungblu · 2023-01-17T12:41:04Z

It's very easy to hide or show each line by clicking the right side legend.

That's fair, it's apparently equally easy to just add a new axis:
go-echarts/go-echarts#204 (comment)

tjungblu · 2023-01-17T12:43:51Z

It's strange that your PR sometimes decreases the performance.

I still think there's a starvation issue with running on such a low SKU VM in GCP. Not sure how other benchmarks have been carried out so far? From the load it adds to a cluster it must've been a fairly beefy machine.

ahrtr · 2023-01-17T12:49:30Z

That's fair, it's apparently equally easy to just add a new axis:
go-echarts/go-echarts#204 (comment)

We can improve the way to plot the data after #15060 is merged. We can discuss it in a separate session.

In #15060:

In go-echarts/go-echarts#204 (comment)

tjungblu · 2023-02-01T10:10:48Z

Sorry for the radio silence. I've been doing another run with a 30 core machine on GCP, the difference is small which leads me to believe it didn't use the right binary again. I'm re-running it again for the PR as we speak - the whole run takes two days so I'll get back by Friday.

Here's the result with the rw heatmap:

here's the website generated from Bens PR:
https://github.com/tjungblu/etcd/blob/pr_14877_result/rw_benchmark.html

ahrtr · 2023-02-01T10:18:01Z

Almost have the same results.

tjungblu · 2023-02-08T17:51:16Z

okay I finally got the run with the PR (rebased to latest main) to finish

here's the website generated with Bens PR:
https://github.com/tjungblu/etcd/blob/pr_14877_result/rw_benchmark.html

and the raw data in here: https://github.com/tjungblu/etcd/tree/pr_14877_result/rw_benchmark

Since this is yet another run with different and weird results, I'll drop this case here.
Not sure why, but I also don't have that much time at my hands to dig into what's going on.

stale · 2023-05-21T11:48:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr approved these changes Dec 1, 2022

View reviewed changes

fuweid reviewed Dec 1, 2022

View reviewed changes

server/storage/wal/encoder.go Show resolved Hide resolved

ptabor added the area/performance label Dec 4, 2022

tjungblu force-pushed the walmlock branch from 6927ebd to 5ff9052 Compare December 5, 2022 09:07

tjungblu closed this Dec 6, 2022

tjungblu force-pushed the walmlock branch from 7edec46 to a4c6d1b Compare December 6, 2022 11:49

tjungblu added 2 commits December 6, 2022 12:50

Remove unncessary locking in encoder

fa02f29

The encoder is already protected by the wal.mu on all paths. Removing this lock yields up to 20% more throughput when measured with etcdctl check perf. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

Wal.Sync() missing lock

751f1eb

all public functions of the WAL should be covered by the mutex. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

tjungblu reopened this Dec 6, 2022

tjungblu mentioned this pull request Dec 7, 2022

Updating rw-heatmaps plotter to latest versions #14910

Closed

ahrtr mentioned this pull request Jan 17, 2023

Reimplement the tool rw-heatmaps using golang and rename it to rw-benchmark #15060

Open

stale bot added the stale label May 21, 2023

stale bot closed this Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unncessary locking in encoder #14877

Remove unncessary locking in encoder #14877

tjungblu commented Nov 30, 2022

ahrtr left a comment

ptabor commented Dec 4, 2022

tjungblu commented Dec 5, 2022 •

edited

Loading

tjungblu commented Dec 5, 2022

tjungblu commented Dec 5, 2022

ptabor commented Dec 6, 2022

tjungblu commented Dec 6, 2022

ptabor commented Dec 6, 2022

tjungblu commented Dec 6, 2022 •

edited

Loading

tjungblu commented Dec 6, 2022

tjungblu commented Dec 7, 2022 •

edited

Loading

tjungblu commented Dec 13, 2022

ahrtr commented Dec 13, 2022

ahrtr commented Jan 15, 2023 •

edited

Loading

tjungblu commented Jan 16, 2023

ahrtr commented Jan 16, 2023

tjungblu commented Jan 17, 2023

ahrtr commented Jan 17, 2023

tjungblu commented Jan 17, 2023

ahrtr commented Jan 17, 2023

ahrtr commented Jan 17, 2023

tjungblu commented Jan 17, 2023

tjungblu commented Jan 17, 2023

ahrtr commented Jan 17, 2023

tjungblu commented Feb 1, 2023

ahrtr commented Feb 1, 2023

tjungblu commented Feb 8, 2023

stale bot commented May 21, 2023

Remove unncessary locking in encoder #14877

Remove unncessary locking in encoder #14877

Conversation

tjungblu commented Nov 30, 2022

ahrtr left a comment

Choose a reason for hiding this comment

ptabor commented Dec 4, 2022

tjungblu commented Dec 5, 2022 • edited Loading

tjungblu commented Dec 5, 2022

tjungblu commented Dec 5, 2022

ptabor commented Dec 6, 2022

tjungblu commented Dec 6, 2022

ptabor commented Dec 6, 2022

tjungblu commented Dec 6, 2022 • edited Loading

tjungblu commented Dec 6, 2022

tjungblu commented Dec 7, 2022 • edited Loading

tjungblu commented Dec 13, 2022

ahrtr commented Dec 13, 2022

ahrtr commented Jan 15, 2023 • edited Loading

tjungblu commented Jan 16, 2023

ahrtr commented Jan 16, 2023

tjungblu commented Jan 17, 2023

ahrtr commented Jan 17, 2023

tjungblu commented Jan 17, 2023

ahrtr commented Jan 17, 2023

ahrtr commented Jan 17, 2023

tjungblu commented Jan 17, 2023

tjungblu commented Jan 17, 2023

ahrtr commented Jan 17, 2023

tjungblu commented Feb 1, 2023

ahrtr commented Feb 1, 2023

tjungblu commented Feb 8, 2023

stale bot commented May 21, 2023

tjungblu commented Dec 5, 2022 •

edited

Loading

tjungblu commented Dec 6, 2022 •

edited

Loading

tjungblu commented Dec 7, 2022 •

edited

Loading

ahrtr commented Jan 15, 2023 •

edited

Loading