-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unncessary locking in encoder #14877
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great finding. Thank you @tjungblu
@tjungblu, could you, please put here the repro of the tests, that demonstrates the improvement ? For example use this tool (https://github.com/etcd-io/etcd/tree/main/tools/rw-heatmaps) that should plot improvement chart for different sizes of the key-value pairs as well as read2write ratios. |
I'm already running the perf test for 3h or so. I'll supply the numbers and charts once this thing is done with the PR and the current 3.6 HEAD. Is there a way to parallelize this? looking at the current CSV it's running a pretty elaborate grid search As for the original "up to 20%" estimate, I was only running a simple |
ok, I had to cut the benchmark down a bit. It was just too elaborate:
Also looks like the plot_data.py is a little outdated and won't work with latest matplotlib and python3.11. Never the less, here's the plots: Ping me on slack if you need the raw results, our GDrive doesn't want to share via link to the outside world anymore since a couple of days. 🔐 |
Thank you @tjungblu . I'm a little confused. The charts are showing If the results are opposite, that would be a huge improvement. |
The encoder is already protected by the wal.mu on all paths. Removing this lock yields up to 20% more throughput when measured with etcdctl check perf. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
all public functions of the WAL should be covered by the mutex. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
I'm testing this again on two VMs on GCP in parallel to ensure they don't run the same version somehow. Will come back with the results once it's done. |
Thank you. On GCP to have predictable runs, I would recommend:
|
Sounds good, I'm running on the c2-standard-4 right now with the 100gigs of SSD persistent disk. I believe the four cores are not enough for the current set of clients, so I might reduce this to 2-16 clients for another set of runs - let's look at the data when it comes.
I'll hope the first run will finish soon, so I can run the respective other binary on the other host. |
Here are the results from the run overnight. Adjusted the test regimen a little with:
basically reducing the number of clients significantly to account for the 4-cores of the host machine and cutting the 2^ 14 top end of the data sizes. |
guys, anything in the way of blocking this merge still? |
ping @ptabor |
Let me see if I find the data again, otherwise I'll do another run. How did you run the benchmarks? |
I did not run the benchmark at all, instead I just got the CSV files from #13045 (comment). I do not change anything on the benchmark test, I just changed the way to plot the data in #15060 |
@ahrtr I found the CSVs thankfully. I'm receiving an error while running your tool however:
Data looks like this:
|
It's because the data file's format isn't correct per CSV spec. The data rows' column is 1 less than the first & second rows.
After adding a comma at the end of each line, it's working now. |
not sure why the columns mismatch in the PARAM line, probably a bug in the bash script to run the bench. After removing the line I get this resulting html: https://github.com/tjungblu/etcd/blob/pr_14877_result/rw_benchmark.html which is faster in many cases. Maybe a suggestion for your tool would be to split the read and write throughput into two axis/plots, they seem to be on totally different scales and hard to browse otherwise. |
It's indeed a bug of the
It's very easy to hide or show each line by clicking the right side legend. |
It's strange that your PR sometimes decreases the performance. |
That's fair, it's apparently equally easy to just add a new axis: |
I still think there's a starvation issue with running on such a low SKU VM in GCP. Not sure how other benchmarks have been carried out so far? From the load it adds to a cluster it must've been a fairly beefy machine. |
We can improve the way to plot the data after #15060 is merged. We can discuss it in a separate session. In #15060: |
Sorry for the radio silence. I've been doing another run with a 30 core machine on GCP, the difference is small which leads me to believe it didn't use the right binary again. I'm re-running it again for the PR as we speak - the whole run takes two days so I'll get back by Friday. Here's the result with the rw heatmap: here's the website generated from Bens PR: |
okay I finally got the run with the PR (rebased to latest main) to finish here's the website generated with Bens PR: and the raw data in here: https://github.com/tjungblu/etcd/tree/pr_14877_result/rw_benchmark Since this is yet another run with different and weird results, I'll drop this case here. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Hey folks,
While working on vectorized writes on bbolt, we found some very excessive amount of futex syscalls. I've been going through the hot path of the WAL and found an unnecessary locking on the encoder that is perfectly covered through the
wal.mu
mutex already.Tests are passing fine for me, but would appreciate another close look from two of you on this. Testing locally this improves throughput by up to 20%.
The encoder is already protected by the wal.mu on all paths. Removing this lock yields up to 20% more throughput when measured with etcdctl check perf.
Signed-off-by: Thomas Jungblut tjungblu@redhat.com