Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unncessary locking in encoder #14877

Closed
wants to merge 2 commits into from
Closed

Conversation

tjungblu
Copy link
Contributor

Hey folks,

While working on vectorized writes on bbolt, we found some very excessive amount of futex syscalls. I've been going through the hot path of the WAL and found an unnecessary locking on the encoder that is perfectly covered through the wal.mu mutex already.

Tests are passing fine for me, but would appreciate another close look from two of you on this. Testing locally this improves throughput by up to 20%.


The encoder is already protected by the wal.mu on all paths. Removing this lock yields up to 20% more throughput when measured with etcdctl check perf.

Signed-off-by: Thomas Jungblut tjungblu@redhat.com

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding. Thank you @tjungblu

@ptabor
Copy link
Contributor

ptabor commented Dec 4, 2022

@tjungblu, could you, please put here the repro of the tests, that demonstrates the improvement ?

For example use this tool (https://github.com/etcd-io/etcd/tree/main/tools/rw-heatmaps) that should plot improvement chart for different sizes of the key-value pairs as well as read2write ratios.

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 5, 2022

I'm already running the perf test for 3h or so. I'll supply the numbers and charts once this thing is done with the PR and the current 3.6 HEAD. Is there a way to parallelize this? looking at the current CSV it's running a pretty elaborate grid search

As for the original "up to 20%" estimate, I was only running a simple etcdctl check perf --load=xl against both versions.

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 5, 2022

ok, I had to cut the benchmark down a bit. It was just too elaborate:

diff --git a/tools/rw-heatmaps/rw-benchmark.sh b/tools/rw-heatmaps/rw-benchmark.sh
index abd63db1a..c08d7826d 100755
--- a/tools/rw-heatmaps/rw-benchmark.sh
+++ b/tools/rw-heatmaps/rw-benchmark.sh
@@ -2,11 +2,11 @@
 
 #set -x
 
-RATIO_LIST="1/128 1/8 1/4 1/2 2/1 4/1 8/1 128/1"
-VALUE_SIZE_POWER_RANGE="8 14"
-CONN_CLI_COUNT_POWER_RANGE="5 11"
-REPEAT_COUNT=5
-RUN_COUNT=200000
+RATIO_LIST="1/4 1/2 2/1 4/1 8/1"
+VALUE_SIZE_POWER_RANGE="8 10"
+CONN_CLI_COUNT_POWER_RANGE="5 8"
+REPEAT_COUNT=3
+RUN_COUNT=2000

Also looks like the plot_data.py is a little outdated and won't work with latest matplotlib and python3.11. Never the less, here's the plots:

out png_write

out png_read

Ping me on slack if you need the raw results, our GDrive doesn't want to share via link to the outside world anymore since a couple of days. 🔐

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 5, 2022

just did a quick second run, the write results are somewhat different
image

I believe the many clients are simply starving etcd. I'll play around with the values a little and benchmark this again in the cloud tomorrow with a separate disk.

@ptabor
Copy link
Contributor

ptabor commented Dec 6, 2022

Thank you @tjungblu . I'm a little confused.

The charts are showing Requests/sec. If we look at your pictures, the main has 2400 QPS max, while PR-14877 only 2000. That would suggest that the PR is decreasing the throughput. But I have no idea how removing a lock could decrease the performance. Are you sure the main vs PR-14877 csv files were not mixed up ?

If the results are opposite, that would be a huge improvement.

The encoder is already protected by the wal.mu on all paths. Removing
this lock yields up to 20% more throughput when measured with etcdctl check
perf.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
all public functions of the WAL should be covered by the mutex.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@tjungblu tjungblu reopened this Dec 6, 2022
@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 6, 2022

I'm testing this again on two VMs on GCP in parallel to ensure they don't run the same version somehow. Will come back with the results once it's done.

@ptabor
Copy link
Contributor

ptabor commented Dec 6, 2022

I'm testing this again on two VMs on GCP in parallel to ensure they don't run the same version somehow. Will come back with the results once it's done.

Thank you. On GCP to have predictable runs, I would recommend:

  • run on C2, C2D or T2D instance-family.
  • optionally: run both tests serially on the same instance, such that some risks of landing on more busy host or more-distant one are less likely to impact the results.

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 6, 2022

Sounds good, I'm running on the c2-standard-4 right now with the 100gigs of SSD persistent disk. I believe the four cores are not enough for the current set of clients, so I might reduce this to 2-16 clients for another set of runs - let's look at the data when it comes.

optionally: run both tests serially on the same instance, such that some risks of landing on more busy host or more-distant one are less likely to impact the results.

I'll hope the first run will finish soon, so I can run the respective other binary on the other host.

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 6, 2022

first run until R/W ratio 8.0 below, I'm not getting much wiser from the graphs now either. I'm running with the binaries/hosts reversed over the night to see if it's comparable.

out png_write

out png_read

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 7, 2022

Here are the results from the run overnight. Adjusted the test regimen a little with:

diff --git a/tools/rw-heatmaps/rw-benchmark.sh b/tools/rw-heatmaps/rw-benchmark.sh
index abd63db1a..be49cb518 100755
--- a/tools/rw-heatmaps/rw-benchmark.sh
+++ b/tools/rw-heatmaps/rw-benchmark.sh
@@ -3,10 +3,10 @@
 #set -x
 
 RATIO_LIST="1/128 1/8 1/4 1/2 2/1 4/1 8/1 128/1"
-VALUE_SIZE_POWER_RANGE="8 14"
-CONN_CLI_COUNT_POWER_RANGE="5 11"
+VALUE_SIZE_POWER_RANGE="8 12"
+CONN_CLI_COUNT_POWER_RANGE="1 4"
 REPEAT_COUNT=5
-RUN_COUNT=200000
+RUN_COUNT=20000
 
 KEY_SIZE=256
 KEY_SPACE_SIZE=$((1024 * 64))

basically reducing the number of clients significantly to account for the 4-cores of the host machine and cutting the 2^ 14 top end of the data sizes.

out png_write

out png_read

@tjungblu
Copy link
Contributor Author

guys, anything in the way of blocking this merge still?
fyi, I'm OOO from the 15th on until the new year.

@ahrtr
Copy link
Member

ahrtr commented Dec 13, 2022

ping @ptabor

@ahrtr
Copy link
Member

ahrtr commented Jan 15, 2023

@tjungblu I do not see much performance gain in this PR. Actually I don't like the heatmap image because I don't think it's clear. Could you use #15060 to generate a HTML page and export a couple of images? See example below,

$go run plot_data.go ./example/main.csv  ./example/dev.csv

@tjungblu
Copy link
Contributor Author

Could you use #15060 to generate a HTML page and export a couple of images? See example below,

Let me see if I find the data again, otherwise I'll do another run. How did you run the benchmarks?

@ahrtr
Copy link
Member

ahrtr commented Jan 16, 2023

Let me see if I find the data again, otherwise I'll do another run. How did you run the benchmarks?

I did not run the benchmark at all, instead I just got the CSV files from #13045 (comment).

I do not change anything on the benchmark test, I just changed the way to plot the data in #15060

@tjungblu
Copy link
Contributor Author

@ahrtr I found the CSVs thankfully. I'm receiving an error while running your tool however:

${binary} -legend main,pr a4c6d1bbc_main/main-2.csv 751f1eb50_pr14877/pr-2.csv

Failed to load data file(s): failed to read csv file "a4c6d1bbc_main/main-2.csv", error: record on line 3: wrong number of fields

Data looks like this:

type,ratio,conn_size,value_size,iter1,iter2,iter3,iter4,iter5,comment
PARAM,,,,,,,,,"key_size=256,key_space_size=65536,backend_size=21474836480,range_limit=100,commit=751f1eb50"
DATA,.0078,2,256,13.0118:1872.8309,13.0258:1874.7251,12.6731:1823.9990,12.7353:1832.9397,12.8543:1850.0930
DATA,.0078,4,256,19.3010:2777.8891,19.1221:2752.1469,18.6790:2688.2388,18.6536:2684.7521,18.6719:2687.3951

@ahrtr
Copy link
Member

ahrtr commented Jan 17, 2023

It's because the data file's format isn't correct per CSV spec. The data rows' column is 1 less than the first & second rows.

$ diff data_old.csv  data_new.csv 
3,4c3,4
< DATA,.0078,2,256,13.0118:1872.8309,13.0258:1874.7251,12.6731:1823.9990,12.7353:1832.9397,12.8543:1850.0930
< DATA,.0078,4,256,19.3010:2777.8891,19.1221:2752.1469,18.6790:2688.2388,18.6536:2684.7521,18.6719:2687.3951
---
> DATA,.0078,2,256,13.0118:1872.8309,13.0258:1874.7251,12.6731:1823.9990,12.7353:1832.9397,12.8543:1850.0930,
> DATA,.0078,4,256,19.3010:2777.8891,19.1221:2752.1469,18.6790:2688.2388,18.6536:2684.7521,18.6719:2687.3951,

After adding a comma at the end of each line, it's working now.

R_W benchmark (RW Ratio 0 0078, Value Size 256)

@tjungblu
Copy link
Contributor Author

not sure why the columns mismatch in the PARAM line, probably a bug in the bash script to run the bench. After removing the line I get this resulting html:

https://github.com/tjungblu/etcd/blob/pr_14877_result/rw_benchmark.html

R_W benchmark (RW Ratio_ 0 0078, Value Size_ 256)

which is faster in many cases. Maybe a suggestion for your tool would be to split the read and write throughput into two axis/plots, they seem to be on totally different scales and hard to browse otherwise.

@ahrtr
Copy link
Member

ahrtr commented Jan 17, 2023

not sure why the columns mismatch in the PARAM line, probably a bug in the bash script to run the bench.

It's indeed a bug of the rw-benchmark.sh, just fixed it in #15060 .

Maybe a suggestion for your tool would be to split the read and write throughput into two axis/plots

It's very easy to hide or show each line by clicking the right side legend.

@ahrtr
Copy link
Member

ahrtr commented Jan 17, 2023

It's strange that your PR sometimes decreases the performance.

@tjungblu
Copy link
Contributor Author

It's very easy to hide or show each line by clicking the right side legend.

That's fair, it's apparently equally easy to just add a new axis:
go-echarts/go-echarts#204 (comment)

@tjungblu
Copy link
Contributor Author

It's strange that your PR sometimes decreases the performance.

I still think there's a starvation issue with running on such a low SKU VM in GCP. Not sure how other benchmarks have been carried out so far? From the load it adds to a cluster it must've been a fairly beefy machine.

@ahrtr
Copy link
Member

ahrtr commented Jan 17, 2023

That's fair, it's apparently equally easy to just add a new axis:
go-echarts/go-echarts#204 (comment)

We can improve the way to plot the data after #15060 is merged. We can discuss it in a separate session.

In #15060:
R_W benchmark (RW Ratio 2, Value Size 2048)

In go-echarts/go-echarts#204 (comment)
Screen Shot 2023-01-17 at 20 46 55

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 1, 2023

Sorry for the radio silence. I've been doing another run with a 30 core machine on GCP, the difference is small which leads me to believe it didn't use the right binary again. I'm re-running it again for the PR as we speak - the whole run takes two days so I'll get back by Friday.

Here's the result with the rw heatmap:
result png_write

here's the website generated from Bens PR:
https://github.com/tjungblu/etcd/blob/pr_14877_result/rw_benchmark.html

@ahrtr
Copy link
Member

ahrtr commented Feb 1, 2023

Almost have the same results.
Screen Shot 2023-02-01 at 18 16 49

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 8, 2023

okay I finally got the run with the PR (rebased to latest main) to finish

result png_write

here's the website generated with Bens PR:
https://github.com/tjungblu/etcd/blob/pr_14877_result/rw_benchmark.html

and the raw data in here: https://github.com/tjungblu/etcd/tree/pr_14877_result/rw_benchmark

Since this is yet another run with different and weird results, I'll drop this case here.
Not sure why, but I also don't have that much time at my hands to dig into what's going on.

@stale
Copy link

stale bot commented May 21, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 21, 2023
@stale stale bot closed this Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants