-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signifcant performance degradation after upgrading librdkafka from 0.11.0 to 1.1.0 #2509
Comments
How many partitions? |
There is 1 partition. |
That thruput sounds low even for the old version. |
We are doing some processing when preparing the messages, that's why the throughput is low, but it is OK for us. The point is that when we do the same processing and use the new librdkafka we see poor performance as described. |
Suggest cutting out your processing to troubleshoot this issue |
I'm currently busy with other tasks and cannot isolate only rdkafka at our product. |
I was actually just about to submit an identical report as we recently moved from 0.11.5 to 1.1.0 (with a patch) as a follow up #2492. I'm going to try to repro with |
Ok using the command v0.11.5
It averages ~20k msgs/s and uses ~180% CPU v1.1.0
It averages a little over 17k msgs/s and still uses ~180% CPU. In my prod task, I saw significantly slower produce (around 4x slower) and increased CPU usage (about 2x). |
Here's a run with master (
Which looks to be a little over 21k msgs/s and uses only 170% CPU. |
@edenhill - Do you do any performance tests when cutting releases? If so, are the numbers reported anywhere? Would be nice to look at when looking to upgrade. Edit: I guess there would have to be a lot of "flavors" (producer, consumer, message size, compression, etc) |
My issue ended up being that I was setting As far as the slowdown in Will report back when I finish my rollout |
Alright, even setting the topic level config to |
We try hard to not have performance regression, but since the Kafka protocol is growing and more features being added, it is natural for a slow decline in performance, which is hopefully offset by the increase in compute power. |
When doing performance testing, make sure to adjust linger.ms upwards to at least 5-10ms to allow for proper batch accumulation. |
In production we have The problem with this performance degradation is that it's massive and spread across our rather large cluster it's a lot of additional resources. Unfortunately, we had to upgrade to get a bug fix we needed for another issue and now we find ourselves a little stuck. To be clear, the issue is 2-fold:
I've also run this against 2 different Kafka clusters, one that supports Message V2 and one that doesn't and it doesn't make a difference. |
I tried rdkafka performance tool on my environment, and here are the results: For librdkafka 0.11.0: For librdkafka 1.1.0: rdkafka_performance.exe -P -t my_topic -b eliyahum-kafka13.dockernet -a 1 -s 20 -X api.version.request=true -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1 To summarize I've run the test several time, and validated it really shows that on my environment, librdkafka 0.11.0 is much faster than librdkafka 1.1.0 |
On Linux with broker version 2.3.0 I get quite the opposite results: librdkafka v0.11.0 librdkafka v1.1.0 |
interesting. |
I've now run the same test on my personal machine (Windows 10), which is a strong one (intel i7-8550U CPU, 32GB RAM). I still see 1.1.0 is slower compared to 0.11.0, though not at same factor. At the virtual machine it was ~0.5 slower, and on my personal machine it is ~0.8 slower. I think it is related to the platform, i.e. running librdkafka on windows shows performance degradation. here are my results: for 1.1.0 I've also noticed, the number of messages at queue is different between 0.11.0 and 1.1.0 |
I see the issue on Linux, so I don't think it's related to the platform. I think it's possible it's related to available cpu, since in my measurements it's CPU time that is significantly heavier in 1.1.0 I'm also using a 1.1.0 broker, which may have something to do with it. Tomorrow, I plan on testing it out with multiple produce threads, to more accurately reproduce my issue. |
Can you please try out latest master, it has one producer performance improvement that affects core-congestion: e8b1c06 |
There is a big improvement (comparing master to 1.1.0), but still 0.11.0 is much faster. Here are the results: 0.11.0 master (a12b909) 1.1.0
on my personal computer, which is a strong one, running Windows 10, I see similar results: master has improved performance compared to 1.1.0, but still 0.11.0 is winning far away. 0.11.0 master 1.1.0 |
I don't want to lose track of the fact that aside from being slower, a significantly larger CPU load is present with newer versions. My previous tests were run in docker on my mac, so I decided to try one of our stage boxes, an otherwise idle RHEL7 56 core box. 0.11.5
1.1.0
1.2.0-RC3
This shows that 0.11.5 uses significantly less CPU resources (look particularly at the system cpu time) as compared to later versions. This is while being faster.
|
I find these times very surprising, so from what I understand:
This seems to indicate that while newer versions double their application CPU time, it almost 30-folds increases the syscalls time. A profiler should be able to give some ideas on where this time is being spent. Also please specify how many CPU-cores your test system has. |
@edenhill - Sorry, just edited my comment, to include the information. Test was a 56 core box, which is pretty much idle even when running these tests (since only 1 thread is producing). I will try to profile using valgrind, though if some of the CPU is in due to spin polling, then the ratio diff of network / CPU speeds might not show it. |
Also note that we are still using broker 1.1.0 |
I tried to use @edenhill - Do you also see the additional CPU usage with the newer version? |
@edenhill - I think I may be on to something. In |
I'm trying to dig through the code, but I'll admit that I'm not overly familiar with it. But I suspect there are 2 interesting behaviors here:
|
Ok, a little more info. When In In I'm not entirely sure that I've got this straight, but this seems to cause poll to terminate whenever a single message is enqueued even when we have thousands of possible messages to go to fill up a message set. @edenhill does this sound right? |
(This is producing to a topic with 4 partitions on 3 brokers). v0.11.5
master/v1.2.0-RC3
master is faster, and definitely uses alot more CPU: 48 vs 250 % Same test but with mutrace (measuring lock contention) and just 1M messages: v0.11.5
master
We see that v0.11.5 does a lot more locking, but master has a lot more lock contention. I think your analysis pointing to the wakeup fd are correct; The wakeup is done by a write() to a pipe, and this would account for the huge increase in syscalls (sys time) we see on master, and since the write is performed while holding the queue lock[1] (and possibly partition lock) it adds, and explains, the extra contention. [1]: the lock is held while calling write() to ensure that the fd is not closed (which is ok) and re-created (definitely not ok) between obtaining it and calling write() I'll need to think through what the best solution is for this without causing any regression in thruput or latency. |
Can't users still control the latency bounds by using |
I tested a patch that changed this line to 1.1.0 (no patch)
1.1.0 (qlen>=100 patch)
Which not only shows that it uses way less CPU but is insanely faster ( |
Interesting! So one way to fix this might be to limit the number of fd-based wakeups to linger.ms, I'll make a patch and see how it flies. |
Can you try out the ratewakeups branch? |
Not quite as pronounced a difference as the above change (likely due to additional queueing ops still) but nearly twice the throughput for 1/4th the CPU! master (1.2.0RC3, no patch)
ratewakeups branch
|
Nice! Sys time is all gone. |
Looks like ratewakeups has about 50% higher throughput than 0.11.5 and about equivalent CPU usage (total CPU is lower in ratewakeups due to shorter runtime!) |
Also for comparison, I tried the patch with a
|
Of course, this is testing with a thread publishing as quickly as possible. In real world scenarios, there is likely a small amount of processing to happen in between calls, that might make these numbers differ. But all in all, I'm eager to test out this patch in my real world case |
That's great news, thank you for all your work on this, Sean! 💯 |
We'll delay the upcoming v1.2.0 release to get this fix in, aiming at releasing on monday instead of today. |
Cool! Too bad you never get any issues like "Significantly better performance when upgrading from 1.1.0 to 1.2.0", huh? |
Hi, I don't have time to read the entire thread here. Just to summarize, I understand the issue is fixed and fix will be available at v1.2.0 soon available. right? |
@Eliyahu-Machluf There is no workaround, you will need to wait for v1.2.0 |
OK. Thanks |
@edenhill any news on the 1.2.0 release? |
v1.2.0 was released a couple of weeks ago, but has a Windows GSSAPI blocker, v1.2.1 is due this week. |
Description
We've upgraded our librdkafka version from 0.11.0 to 1.1.0, and doing performance tests, we've noticed a major performance degradation. librdkafka 1.1.0 is about 50% slower compared to 0.11.0 for our scenario
How to reproduce
We are running a produce session, producing 500,000 messages, and telling librdkafka to transfer them. When using librdkafka 1.1.0 it takes ~25 seconds, using 0.11.0 it takes ~12 seconds.
This is after we configured the
request.required.acks
to be '1' (as we've seen the default was changed, and want to compare the same configuration)Checklist
1.1.0
2.0 we've also tried it with 2.3
api.version.request=true request.required.acks=1 broker.version.fallback=0.9.0 message.max.bytes=1000000 queue.buffering.max.ms=1000 api.version.request=true request.required.acks=1 broker.version.fallback=0.9.0 message.max.bytes=1000000 queue.buffering.max.ms=1000
Windows10
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: