Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signifcant performance degradation after upgrading librdkafka from 0.11.0 to 1.1.0 #2509

Closed
2 of 5 tasks
Eliyahu-Machluf opened this issue Sep 5, 2019 · 48 comments
Closed
2 of 5 tasks

Comments

@Eliyahu-Machluf
Copy link
Contributor

Description

We've upgraded our librdkafka version from 0.11.0 to 1.1.0, and doing performance tests, we've noticed a major performance degradation. librdkafka 1.1.0 is about 50% slower compared to 0.11.0 for our scenario

How to reproduce

We are running a produce session, producing 500,000 messages, and telling librdkafka to transfer them. When using librdkafka 1.1.0 it takes ~25 seconds, using 0.11.0 it takes ~12 seconds.
This is after we configured the request.required.acks to be '1' (as we've seen the default was changed, and want to compare the same configuration)

Checklist

  • librdkafka version (release number or git tag): 1.1.0
  • [x ] Apache Kafka version: 2.0 we've also tried it with 2.3
  • librdkafka client configuration:
    api.version.request=true request.required.acks=1 broker.version.fallback=0.9.0 message.max.bytes=1000000 queue.buffering.max.ms=1000 api.version.request=true request.required.acks=1 broker.version.fallback=0.9.0 message.max.bytes=1000000 queue.buffering.max.ms=1000
  • Operating system: Windows10
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • [x ] Critical issue
@edenhill
Copy link
Contributor

edenhill commented Sep 5, 2019

How many partitions?
What kind of machine are you running the producer on, how many cores does it have?

@Eliyahu-Machluf
Copy link
Contributor Author

There is 1 partition.
We are running the producer on windows10 machine, which is a virtual machine (running on Hyper-V), with 8 virtual processors.
The hosting machine has 4 cores, 8 logical processors.

@edenhill
Copy link
Contributor

edenhill commented Sep 5, 2019

That thruput sounds low even for the old version.
What is the cpu usage on the client and broker? Network load?

@Eliyahu-Machluf
Copy link
Contributor Author

We are doing some processing when preparing the messages, that's why the throughput is low, but it is OK for us. The point is that when we do the same processing and use the new librdkafka we see poor performance as described.
The CPU usage is not relevant, as our process work on preparing the messages, and this also loads the CPU.

@edenhill
Copy link
Contributor

edenhill commented Sep 5, 2019

Suggest cutting out your processing to troubleshoot this issue

@Eliyahu-Machluf
Copy link
Contributor Author

I'm currently busy with other tasks and cannot isolate only rdkafka at our product.

@shanson7
Copy link
Contributor

shanson7 commented Sep 5, 2019

I was actually just about to submit an identical report as we recently moved from 0.11.5 to 1.1.0 (with a patch) as a follow up #2492. I'm going to try to repro with examples/rdkafka_performance

@shanson7
Copy link
Contributor

shanson7 commented Sep 5, 2019

Ok using the command ./examples/rdkafka_performance -P -s 40 -z snappy -t test -b kafka:9092 -a 1 (which pretty much reproduces my scenario of many small messages)

v0.11.5

% 20521 messages produced (820840 bytes), 20289 delivered (offset 0, 0 failed) in 1000ms: 20287 msgs/s and 0.81 MB/s, 0 produce failures, 232 in queue, snappy compression
% 41531 messages produced (1661240 bytes), 41240 delivered (offset 0, 0 failed) in 2000ms: 20618 msgs/s and 0.82 MB/s, 0 produce failures, 291 in queue, snappy compression
% 61335 messages produced (2453400 bytes), 61233 delivered (offset 0, 0 failed) in 3000ms: 20409 msgs/s and 0.82 MB/s, 0 produce failures, 103 in queue, snappy compression
% 78481 messages produced (3139240 bytes), 78319 delivered (offset 0, 0 failed) in 4000ms: 19578 msgs/s and 0.78 MB/s, 0 produce failures, 162 in queue, snappy compression
% 98303 messages produced (3932120 bytes), 98148 delivered (offset 0, 0 failed) in 5000ms: 19628 msgs/s and 0.79 MB/s, 0 produce failures, 155 in queue, snappy compression
% 118981 messages produced (4759240 bytes), 118832 delivered (offset 0, 0 failed) in 6000ms: 19804 msgs/s and 0.79 MB/s, 0 produce failures, 149 in queue, snappy compression
% 138039 messages produced (5521560 bytes), 137763 delivered (offset 0, 0 failed) in 7000ms: 19679 msgs/s and 0.79 MB/s, 0 produce failures, 276 in queue, snappy compression
% 157160 messages produced (6286400 bytes), 156945 delivered (offset 0, 0 failed) in 8000ms: 19617 msgs/s and 0.78 MB/s, 0 produce failures, 215 in queue, snappy compression
% 177533 messages produced (7101320 bytes), 177361 delivered (offset 0, 0 failed) in 9000ms: 19705 msgs/s and 0.79 MB/s, 0 produce failures, 172 in queue, snappy compression
% 196830 messages produced (7873200 bytes), 196640 delivered (offset 0, 0 failed) in 10000ms: 19662 msgs/s and 0.79 MB/s, 0 produce failures, 190 in queue, snappy compression

It averages ~20k msgs/s and uses ~180% CPU

v1.1.0

% 17113 messages produced (684520 bytes), 16877 delivered (offset 526462555, 0 failed) in 1000ms: 16876 msgs/s and 0.68 MB/s, 0 produce failures, 236 in queue, snappy compression
% 35145 messages produced (1405800 bytes), 35010 delivered (offset 526480688, 0 failed) in 2000ms: 17503 msgs/s and 0.70 MB/s, 0 produce failures, 135 in queue, snappy compression
% 52662 messages produced (2106480 bytes), 52497 delivered (offset 526498175, 0 failed) in 3000ms: 17498 msgs/s and 0.70 MB/s, 0 produce failures, 165 in queue, snappy compression
% 70181 messages produced (2807240 bytes), 69962 delivered (offset 526515640, 0 failed) in 4000ms: 17489 msgs/s and 0.70 MB/s, 0 produce failures, 219 in queue, snappy compression
% 87701 messages produced (3508040 bytes), 87560 delivered (offset 526533238, 0 failed) in 5000ms: 17510 msgs/s and 0.70 MB/s, 0 produce failures, 142 in queue, snappy compression
% 105440 messages produced (4217600 bytes), 105287 delivered (offset 526550965, 0 failed) in 6000ms: 17546 msgs/s and 0.70 MB/s, 0 produce failures, 153 in queue, snappy compression
% 122654 messages produced (4906160 bytes), 122411 delivered (offset 526568089, 0 failed) in 7000ms: 17486 msgs/s and 0.70 MB/s, 0 produce failures, 243 in queue, snappy compression
% 140041 messages produced (5601640 bytes), 139842 delivered (offset 526585520, 0 failed) in 8000ms: 17479 msgs/s and 0.70 MB/s, 0 produce failures, 199 in queue, snappy compression
% 155852 messages produced (6234080 bytes), 155668 delivered (offset 526601346, 0 failed) in 9001ms: 17294 msgs/s and 0.69 MB/s, 0 produce failures, 186 in queue, snappy compression
% 170970 messages produced (6838800 bytes), 170822 delivered (offset 526616500, 0 failed) in 10001ms: 17080 msgs/s and 0.68 MB/s, 0 produce failures, 148 in queue, snappy compression

It averages a little over 17k msgs/s and still uses ~180% CPU.

In my prod task, I saw significantly slower produce (around 4x slower) and increased CPU usage (about 2x).

@shanson7
Copy link
Contributor

shanson7 commented Sep 5, 2019

Here's a run with master (1.2.0-RC3)

% 26183 messages produced (1047320 bytes), 25724 delivered (offset 526684879, 0 failed) in 1000ms: 25723 msgs/s and 1.03 MB/s, 0 produce failures, 459 in queue, snappy compression
% 46333 messages produced (1853320 bytes), 46300 delivered (offset 526705455, 0 failed) in 2001ms: 23135 msgs/s and 0.93 MB/s, 0 produce failures, 34 in queue, snappy compression
% 62428 messages produced (2497120 bytes), 62290 delivered (offset 526721445, 0 failed) in 3001ms: 20754 msgs/s and 0.83 MB/s, 0 produce failures, 138 in queue, snappy compression
% 84142 messages produced (3365680 bytes), 83946 delivered (offset 526743101, 0 failed) in 4001ms: 20979 msgs/s and 0.84 MB/s, 0 produce failures, 196 in queue, snappy compression
% 103988 messages produced (4159520 bytes), 103771 delivered (offset 526762926, 0 failed) in 5002ms: 20743 msgs/s and 0.83 MB/s, 0 produce failures, 221 in queue, snappy compression
% 126828 messages produced (5073120 bytes), 126722 delivered (offset 526785877, 0 failed) in 6002ms: 21110 msgs/s and 0.84 MB/s, 0 produce failures, 106 in queue, snappy compression
% 149660 messages produced (5986400 bytes), 149454 delivered (offset 526808609, 0 failed) in 7002ms: 21342 msgs/s and 0.85 MB/s, 0 produce failures, 206 in queue, snappy compression
% 172683 messages produced (6907320 bytes), 172353 delivered (offset 526831508, 0 failed) in 8002ms: 21536 msgs/s and 0.86 MB/s, 0 produce failures, 330 in queue, snappy compression
% 194450 messages produced (7778000 bytes), 194287 delivered (offset 526853442, 0 failed) in 9003ms: 21580 msgs/s and 0.86 MB/s, 0 produce failures, 163 in queue, snappy compression

Which looks to be a little over 21k msgs/s and uses only 170% CPU.

@shanson7
Copy link
Contributor

shanson7 commented Sep 5, 2019

@edenhill - Do you do any performance tests when cutting releases? If so, are the numbers reported anywhere? Would be nice to look at when looking to upgrade.

Edit: I guess there would have to be a lot of "flavors" (producer, consumer, message size, compression, etc)

@shanson7
Copy link
Contributor

shanson7 commented Sep 5, 2019

My issue ended up being that I was setting acks on the producer, not the topic. Oddly, this didn't fail, it just didn't seem to work. I'm going to retry 1.1.0 with acks=1 for my task.

As far as the slowdown in ./examples/rdkafka_performance between versions, a co-worker tested them on a larger cluster and it seemed that 1.1.0 was slightly faster (when acks were set to 1)., so it's possible that might be a red herring.

Will report back when I finish my rollout

@shanson7
Copy link
Contributor

shanson7 commented Sep 6, 2019

Alright, even setting the topic level config to acks=1 I am seeing way higher CPU usage and way lower throughput. I'm not sure what else I can try to see what's slowing everything down.

@edenhill
Copy link
Contributor

edenhill commented Sep 7, 2019

We try hard to not have performance regression, but since the Kafka protocol is growing and more features being added, it is natural for a slow decline in performance, which is hopefully offset by the increase in compute power.

@edenhill
Copy link
Contributor

edenhill commented Sep 7, 2019

When doing performance testing, make sure to adjust linger.ms upwards to at least 5-10ms to allow for proper batch accumulation.
Small batches lead to too much overhead.

@shanson7
Copy link
Contributor

shanson7 commented Sep 7, 2019

In production we have queue.buffering.max.ms=100 and we see most message sets are 10000 messages. So I don't think it's a small batch issue.

The problem with this performance degradation is that it's massive and spread across our rather large cluster it's a lot of additional resources.

Unfortunately, we had to upgrade to get a bug fix we needed for another issue and now we find ourselves a little stuck.

To be clear, the issue is 2-fold:

  1. The synchronous part of produce is significantly slower, meaning we can no longer keep up with our workload (cluster wide we produce 6 million msgs/sec)
  2. The additional cpu usage means we need to acquire more hardware just to upgrade this library.

I've also run this against 2 different Kafka clusters, one that supports Message V2 and one that doesn't and it doesn't make a difference.

@Eliyahu-Machluf
Copy link
Contributor Author

I tried rdkafka performance tool on my environment, and here are the results:

For librdkafka 0.11.0:
rdkafka_performance.exe -P -t my_topic -b eliyahum-kafka13.dockernet -a 1 -s 20 -X api.version.request=true -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1
% Sending messages of size 20 bytes
% 500000 messages produced (10000000 bytes), 0 delivered (offset 0, 0 failed) in 1000ms: 0 msgs/s and 0.00 MB/s, 74 produce failures, 500000 in queue, no compression
% 1632715 messages produced (32654300 bytes), 1144707 delivered (offset 0, 0 failed) in 2000ms: 572353 msgs/s and 11.45 MB/s, 177 produce failures, 488008 in queue, no compression
% 3174518 messages produced (63490360 bytes), 2810275 delivered (offset 0, 0 failed) in 3000ms: 936758 msgs/s and 18.74 MB/s, 221 produce failures, 364244 in queue, no compression
% 4322647 messages produced (86452940 bytes), 3823616 delivered (offset 0, 0 failed) in 4000ms: 955904 msgs/s and 19.12 MB/s, 274 produce failures, 499031 in queue, no compression
% 5905364 messages produced (118107280 bytes), 5632328 delivered (offset 0, 0 failed) in 5000ms: 1126465 msgs/s and 22.53 MB/s, 349 produce failures, 273036 in queue, no compression
% 7244947 messages produced (144898940 bytes), 6744947 delivered (offset 0, 0 failed) in 6000ms: 1124157 msgs/s and 22.48 MB/s, 365 produce failures, 500000 in queue, no compression
% 7718276 messages produced (154365520 bytes), 7234947 delivered (offset 0, 0 failed) in 7000ms: 1033563 msgs/s and 20.67 MB/s, 447 produce failures, 483329 in queue, no compression
% 9149667 messages produced (182993340 bytes), 9131158 delivered (offset 0, 0 failed) in 8000ms: 1141394 msgs/s and 22.83 MB/s, 450 produce failures, 18509 in queue, no compression
% 9765859 messages produced (195317180 bytes), 9265859 delivered (offset 0, 0 failed) in 9000ms: 1029539 msgs/s and 20.59 MB/s, 518 produce failures, 500000 in queue, no compression
% 11094518 messages produced (221890360 bytes), 10718341 delivered (offset 0, 0 failed) in 10000ms: 1071834 msgs/s and 21.44 MB/s, 563 produce failures, 376177 in queue, no compression
% 12563353 messages produced (251267060 bytes), 12063353 delivered (offset 0, 0 failed) in 11000ms: 1096668 msgs/s and 21.93 MB/s, 569 produce failures, 500000 in queue, no compression
% 12893353 messages produced (257867060 bytes), 12403353 delivered (offset 0, 0 failed) in 12000ms: 1033612 msgs/s and 20.67 MB/s, 669 produce failures, 490000 in queue, no compression
% 14297603 messages produced (285952060 bytes), 14246149 delivered (offset 0, 0 failed) in 13000ms: 1095857 msgs/s and 21.92 MB/s, 683 produce failures, 51454 in queue, no compression
% 14826330 messages produced (296526600 bytes), 14326330 delivered (offset 0, 0 failed) in 14000ms: 1023309 msgs/s and 20.47 MB/s, 756 produce failures, 500000 in queue, no compression
% 16186189 messages produced (323723780 bytes), 16030246 delivered (offset 0, 0 failed) in 15000ms: 1068683 msgs/s and 21.37 MB/s, 790 produce failures, 155943 in queue, no compression
% 16846609 messages produced (336932180 bytes), 16346609 delivered (offset 0, 0 failed) in 16000ms: 1021663 msgs/s and 20.43 MB/s, 852 produce failures, 500000 in queue, no compression
% 18068203 messages produced (361364060 bytes), 17597609 delivered (offset 0, 0 failed) in 17000ms: 1035153 msgs/s and 20.70 MB/s, 938 produce failures, 470594 in queue, no compression
% 944 backpressures for 19382965 produce calls: 0.005% backpressure rate
% 19282567 messages produced (385651340 bytes), 19282567 delivered (offset 0, 0 failed) in 19047ms: 1012367 msgs/s and 20.25 MB/s, 944 produce failures, 0 in queue, no compression

For librdkafka 1.1.0:

rdkafka_performance.exe -P -t my_topic -b eliyahum-kafka13.dockernet -a 1 -s 20 -X api.version.request=true -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1
% Sending messages of size 20 bytes
% 806014 messages produced (16120280 bytes), 790000 delivered (offset 641645489, 0 failed) in 1000ms: 790000 msgs/s and 15.80 MB/s, 0 produce failures, 16014 in queue, no compression
% 1527021 messages produced (30540420 bytes), 1520000 delivered (offset 642375489, 0 failed) in 2000ms: 760000 msgs/s and 15.20 MB/s, 0 produce failures, 7021 in queue, no compression
% 2128733 messages produced (42574660 bytes), 2120000 delivered (offset 642975489, 0 failed) in 3000ms: 706665 msgs/s and 14.13 MB/s, 0 produce failures, 8733 in queue, no compression
% 2686567 messages produced (53731340 bytes), 2680000 delivered (offset 643535489, 0 failed) in 4000ms: 669994 msgs/s and 13.40 MB/s, 0 produce failures, 6567 in queue, no compression
% 3266046 messages produced (65320920 bytes), 3230000 delivered (offset 644085489, 0 failed) in 5000ms: 645995 msgs/s and 12.92 MB/s, 0 produce failures, 36046 in queue, no compression
% 3873850 messages produced (77477000 bytes), 3860000 delivered (offset 644715489, 0 failed) in 6002ms: 643106 msgs/s and 12.86 MB/s, 0 produce failures, 13850 in queue, no compression
% 4488769 messages produced (89775380 bytes), 4480000 delivered (offset 645335489, 0 failed) in 7002ms: 639803 msgs/s and 12.80 MB/s, 0 produce failures, 8769 in queue, no compression
% 5113808 messages produced (102276160 bytes), 5100000 delivered (offset 645955489, 0 failed) in 8002ms: 637329 msgs/s and 12.75 MB/s, 0 produce failures, 13808 in queue, no compression
% 5725249 messages produced (114504980 bytes), 5710000 delivered (offset 646565489, 0 failed) in 9002ms: 634293 msgs/s and 12.69 MB/s, 0 produce failures, 15249 in queue, no compression
% 6320369 messages produced (126407380 bytes), 6310000 delivered (offset 647165489, 0 failed) in 10002ms: 630864 msgs/s and 12.62 MB/s, 0 produce failures, 10369 in queue, no compression
% 6660000 messages produced (133331960 bytes), 6660000 delivered (offset 647515489, 0 failed) in 10622ms: 626964 msgs/s and 12.54 MB/s, 0 produce failures, 0 in queue, no compression

To summarize
For 0.11 average is about 1,000,000 msg/second, while for 1.1.0 average is about 650,000 msg/second.

I've run the test several time, and validated it really shows that on my environment, librdkafka 0.11.0 is much faster than librdkafka 1.1.0

@edenhill
Copy link
Contributor

edenhill commented Sep 8, 2019

On Linux with broker version 2.3.0 I get quite the opposite results:

librdkafka v0.11.0
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 0, 0 failed) in 20964ms: 476992 msgs/s and 9.54 MB/s, 1523 produce failures, 0 in queue, no compression

librdkafka v1.1.0
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 37850000, 0 failed) in 5762ms: 1735414 msgs/s and 34.71 MB/s, 0 produce failures, 0 in queue, no compression

@Eliyahu-Machluf
Copy link
Contributor Author

interesting.
Maybe it is related to the fact, I'm using a virtual machine as client, and it is not a strong machine, another possible cause might be related to the platform, as i run windows and you run linux.
@shanson7, can you conduct the same test on your machine with all the flags I've used and share the results?

@Eliyahu-Machluf
Copy link
Contributor Author

I've now run the same test on my personal machine (Windows 10), which is a strong one (intel i7-8550U CPU, 32GB RAM). I still see 1.1.0 is slower compared to 0.11.0, though not at same factor. At the virtual machine it was ~0.5 slower, and on my personal machine it is ~0.8 slower.

I think it is related to the platform, i.e. running librdkafka on windows shows performance degradation.

here are my results:
for 0.11.0
% 51861975 messages produced (1037239500 bytes), 51361975 delivered (offset 0, 0 failed) in 76016ms: 675673 msgs/s and 13.51 MB/s, 3161 produce failures, 500000 in queue, no compression

for 1.1.0
% 29378813 messages produced (587576260 bytes), 29370000 delivered (offset 958568477, 0 failed) in 57003ms: 515233 msgs/s and 10.30 MB/s, 0 produce failures, 8813 in queue, no compression

I've also noticed, the number of messages at queue is different between 0.11.0 and 1.1.0
For 0.11.0 I see 500K messages at queue, while at 1.1.0 there were only 8K messages in queue, and this was all over the performance session, having a lot of messages at 0.11 queue, and much less messages at 1.1.0 queue.

@shanson7
Copy link
Contributor

shanson7 commented Sep 8, 2019

I see the issue on Linux, so I don't think it's related to the platform. I think it's possible it's related to available cpu, since in my measurements it's CPU time that is significantly heavier in 1.1.0

I'm also using a 1.1.0 broker, which may have something to do with it.

Tomorrow, I plan on testing it out with multiple produce threads, to more accurately reproduce my issue.

@edenhill
Copy link
Contributor

edenhill commented Sep 9, 2019

Can you please try out latest master, it has one producer performance improvement that affects core-congestion: e8b1c06

@Eliyahu-Machluf
Copy link
Contributor Author

Eliyahu-Machluf commented Sep 9, 2019

There is a big improvement (comparing master to 1.1.0), but still 0.11.0 is much faster.

Here are the results:

0.11.0
% 49251056 messages produced (985021120 bytes), 48758536 delivered (offset 0, 0 failed) in 34000ms: 1434074 msgs/s and 28.68 MB/s, 5333 produce failures, 492520 in queue, no compression

master (a12b909)
% 36262600 messages produced (725252000 bytes), 36250000 delivered (offset 1509552908, 0 failed) in 36003ms: 1006835 msgs/s and 20.14 MB/s, 0 produce failures, 12600 in queue, no compression

1.1.0
% 25136731 messages produced (502734620 bytes), 25130000 delivered (offset 1584602179, 0 failed) in 34007ms: 738957 msgs/s and 14.78 MB/s, 0 produce failures, 6731 in queue, no compression

  • Tests were done on virtual machine, running Windows Server 2016

on my personal computer, which is a strong one, running Windows 10, I see similar results: master has improved performance compared to 1.1.0, but still 0.11.0 is winning far away.

0.11.0
% 58158945 messages produced (1163178900 bytes), 57658945 delivered (offset 0, 0 failed) in 38000ms: 1517340 msgs/s and 30.35 MB/s, 2620 produce failures, 500000 in queue, no compression

master
% 39420871 messages produced (788417420 bytes), 39410000 delivered (offset 1624242179, 0 failed) in 31004ms: 1271101 msgs/s and 25.42 MB/s, 12 produce failures, 10871 in queue, no compression

1.1.0
% 29749421 messages produced (594988420 bytes), 29730000 delivered (offset 1715084616, 0 failed) in 30003ms: 990882 msgs/s and 19.82 MB/s, 0 produce failures, 19421 in queue, no compression

@shanson7
Copy link
Contributor

shanson7 commented Sep 9, 2019

I don't want to lose track of the fact that aside from being slower, a significantly larger CPU load is present with newer versions.

My previous tests were run in docker on my mac, so I decided to try one of our stage boxes, an otherwise idle RHEL7 56 core box.
I see this (output snipped in the middle):

0.11.5

time ./rdkafka_performance_0_11_5 -P -t perf_test -b kafka:9092 -a 1 -s 20 -X api.version.request=true -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1 -c 10000000
% Sending 10000000 messages of size 20 bytes
% 500000 messages produced (10000000 bytes), 0 delivered (offset 0, 0 failed) in 1000ms: 0 msgs/s and 0.00 MB/s, 73 produce failures, 500000 in queue, no compression
...SNIP...
% 10000000 messages produced (200000000 bytes), 9510001 delivered (offset 0, 0 failed) in 21707ms: 438098 msgs/s and 8.76 MB/s, 1214 produce failures, 490011 in queue, no compression
% 1214 backpressures for 10000000 produce calls: 0.012% backpressure rate
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 0, 0 failed) in 21888ms: 456857 msgs/s and 9.14 MB/s, 1214 produce failures, 0 in queue, no compression

real	0m22.001s
user	0m15.070s
sys	0m1.687s

1.1.0

time ./rdkafka_performance_1_1_0 -P -t perf_test -b kafka:9092 -a 1 -s 20 -X api.version.request=true -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1 -c 10000000
% Sending 10000000 messages of size 20 bytes
% 370245 messages produced (7404900 bytes), 360000 delivered (offset 3233726, 0 failed) in 1000ms: 359999 msgs/s and 7.20 MB/s, 0 produce failures, 10245 in queue, no compression
...SNIP...
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 4035185, 0 failed) in 25973ms: 385004 msgs/s and 7.70 MB/s, 0 produce failures, 0 in queue, no compression

real	0m25.978s
user	0m32.974s
sys	0m34.566s

1.2.0-RC3

time ./rdkafka_performance_1_2_0_RC -P -t perf_test -b kafka:9092 -a 1 -s 20 -X api.version.request=true -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1 -c 10000000
% Sending 10000000 messages of size 20 bytes
% 419598 messages produced (8391960 bytes), 360000 delivered (offset 4897776, 0 failed) in 1000ms: 359999 msgs/s and 7.20 MB/s, 0 produce failures, 59598 in queue, no compression
...SNIP...
% 10000000 messages produced (200000000 bytes), 9988591 delivered (offset 5701301, 0 failed) in 24013ms: 415960 msgs/s and 8.32 MB/s, 0 produce failures, 11409 in queue, no compression
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 5701273, 0 failed) in 24040ms: 415963 msgs/s and 8.32 MB/s, 0 produce failures, 0 in queue, no compression

real	0m24.047s
user	0m31.017s
sys	0m33.820s

This shows that 0.11.5 uses significantly less CPU resources (look particularly at the system cpu time) as compared to later versions. This is while being faster.

Version Avg Throughput Wall Time User Time Sys Time CPU % (User+Sys/Wall)
0.11.5 456857 msg/s 22.001s 15.07s 1.687s 76.2%
1.1.0 385004 msg/s 25.978s 32.974s 34.566s 260%
1.2.0-RC3 (master) 415963 24.047s 31.017s 33.820s 270%

@edenhill
Copy link
Contributor

edenhill commented Sep 9, 2019

I find these times very surprising, so from what I understand:

  • real = wallclock time
  • user = application CPU time, e.g., librdkafka
  • sys = kernel CPU time, e.g., syscalls

This seems to indicate that while newer versions double their application CPU time, it almost 30-folds increases the syscalls time.

A profiler should be able to give some ideas on where this time is being spent.

Also please specify how many CPU-cores your test system has.

@shanson7
Copy link
Contributor

shanson7 commented Sep 9, 2019

@edenhill - Sorry, just edited my comment, to include the information. Test was a 56 core box, which is pretty much idle even when running these tests (since only 1 thread is producing). I will try to profile using valgrind, though if some of the CPU is in due to spin polling, then the ratio diff of network / CPU speeds might not show it.

@shanson7
Copy link
Contributor

shanson7 commented Sep 9, 2019

Also note that we are still using broker 1.1.0

@shanson7
Copy link
Contributor

shanson7 commented Sep 9, 2019

I tried to use perf to profile, but I'm missing some symbols in the heavy CPU functions. Trying to figure out how to find them all.

@edenhill - Do you also see the additional CPU usage with the newer version?

@shanson7
Copy link
Contributor

shanson7 commented Sep 9, 2019

@edenhill - I think I may be on to something. In 0.11.5, setting queue.buffering.max.ms=1000 uses far less CPU than setting queue.buffering.max.ms=999, even though the time difference is only 1 millisecond. I run with queue.buffering.max.ms=1000 in production. Is it possible that something is converting this value into seconds and values < 1000 become 0 and spin loop?

@shanson7
Copy link
Contributor

shanson7 commented Sep 9, 2019

I'm trying to dig through the code, but I'll admit that I'm not overly familiar with it. But I suspect there are 2 interesting behaviors here:

  1. In 0.11.x setting queue.buffering.max.ms <= 999 cause some sort of busy loop when producing
  2. In 1.x no matter what queue.buffering.max.ms is set, we get the same busy looping.

@shanson7
Copy link
Contributor

shanson7 commented Sep 10, 2019

Ok, a little more info. When WAKEUPFD is enabled, I see the behavior that poll is frequently terminated after little to no sleep.

In 0.11.5 this happens here: https://github.com/edenhill/librdkafka/blob/v0.11.5/src/rdkafka_broker.c#L3821 when buffering_max_ms < socket_blocking_max_ms which explains the behavior we are seeing.

In 1.1.0 this seems to always be enabled.

I'm not entirely sure that I've got this straight, but this seems to cause poll to terminate whenever a single message is enqueued even when we have thousands of possible messages to go to fill up a message set. @edenhill does this sound right?

@edenhill
Copy link
Contributor

edenhill commented Sep 10, 2019

(This is producing to a topic with 4 partitions on 3 brokers).

v0.11.5

$ $(which time) examples/rdkafka_performance -b $BROKERS -P -test -s 20 -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1 -c 10000000
....
% 1461 backpressures for 10000000 produce calls: 0.015% backpressure rate
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 0, 0 failed) in 22007ms: 454385 msgs/s and 9.09 MB/s, 1461 produce failures, 0 in queue, no compression

10.32user 0.34system 0:22.12elapsed 48%CPU (0avgtext+0avgdata 108404maxresident)k
0inputs+0outputs (0major+127839minor)pagefaults 0swaps

master/v1.2.0-RC3

$ $(which time) examples/rdkafka_performance -b $BROKERS -P -test -s 20 -X message.max.bytes=1000000 -X queue.buffering.max.ms=1000 -X request.required.acks=1 -c 10000000
....
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 4999710, 0 failed) in 16115ms: 620524 msgs/s and 12.41 MB/s, 0 produce failures, 0 in queue, no compression

21.86user 18.59system 0:16.12elapsed 250%CPU (0avgtext+0avgdata 20148maxresident)k
0inputs+0outputs (0major+14489minor)pagefaults 0swaps

master is faster, and definitely uses alot more CPU: 48 vs 250 %

Same test but with mutrace (measuring lock contention) and just 1M messages:

v0.11.5

mutrace: Showing 10 most contended mutexes:

rktp_lock total = 2999M

total=13500M locks

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
     494  4000245      236      150      330.207        0.000        0.041 Mx.--.  rkt_lock
     536       64       29       18       89.973        1.406       13.301 Wx...r  rk_lock
     495  1000030       37       14     4982.760        0.005        1.912 Wx...r  rkt_app_lock
     400   499885       16        9       72.445        0.000        2.032 Mx.--.  rkb_lock
     116   500552       81        8      179.584        0.000        0.045 Mx.--.  rktp_lock
     114  1001016       91        7       82.586        0.000        0.013 Mx.--.  rktp_lock
     125   500613       85        7      179.503        0.000        0.021 Mx.--.  rktp_lock
     515   249859       16        6       44.557        0.000        1.784 Mx.--.  rkb_avg
     134   999538       79        5       82.501        0.000        0.019 Mx.--.  rktp_lock
     574   250283       16        5       43.318        0.000        1.808 Mx.--.  rkb_avg
     ...      ...      ...      ...          ...          ...          ... ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 9296.681 ms.

mutrace: Results for SMP with 8 processors.

master

mutrace: Showing 10 most contended mutexes:

rktp_lock total = 3137M
rkq_lock = 4030M

total=8354M locks

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
     228   883087   390582    46269      268.051        0.000        0.103 Mx.--.  rktp_lock
     213   880860   388376    45802      265.925        0.000        0.155 Mx.--.  rktp_lock
     360  1533036   753787    41006      636.886        0.000        0.101 Mx.--.  rkb_ops.rkq_lock
     223   688345   373975    10019      233.197        0.000        0.028 Mx.--.  rktp_lock
     216   686236   375309     9446      232.473        0.000        0.021 Mx.--.  rktp_lock
     398   750617   375289       46      355.629        0.000        0.035 Mx.--.  rkb_ops.rkq_lock
     377   747941   373953       45      355.663        0.000        0.030 Mx.--.  rkb_ops.rkq_lock
     211  1000143      231       33      182.393        0.000        1.023 Mx.--.  rk_rep.rkq_lock
     217  1000027       31       14     4141.142        0.004        1.881 Wx...r  rkt_lock
     392   187745       49        8       39.881        0.000        2.590 Mx.--.  rkb_lock
     ...      ...      ...      ...          ...          ...          ... ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC 

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 8072.205 ms.

mutrace: Results for SMP with 8 processors.

We see that v0.11.5 does a lot more locking, but master has a lot more lock contention.

I think your analysis pointing to the wakeup fd are correct;
The fd-based wakeup (of the broker thread) will kick in when the rktp_msgq is empty, but since the broker thread will move all of rktp_msgq to its own rktp_xmit_msgq on each broker wakeup it is likely that this wakeup kicks in quite often.
As you've identified, on v0.11.5 the wakeup was only enabled if max.blocking.ms < socket.blocking.max.ms, while on master it is always enabled (to allow low latency).

The wakeup is done by a write() to a pipe, and this would account for the huge increase in syscalls (sys time) we see on master, and since the write is performed while holding the queue lock[1] (and possibly partition lock) it adds, and explains, the extra contention.

[1]: the lock is held while calling write() to ensure that the fd is not closed (which is ok) and re-created (definitely not ok) between obtaining it and calling write()

I'll need to think through what the best solution is for this without causing any regression in thruput or latency.

@shanson7
Copy link
Contributor

Can't users still control the latency bounds by using linger.ms? If they want very low latency, they can set that to a low number.

@shanson7
Copy link
Contributor

shanson7 commented Sep 10, 2019

I tested a patch that changed this line to queue_len >= 100 and reran my tests:

1.1.0 (no patch)

time  ./rdkafka_performance_1_1_0 -P -t perf_test -bkafka:9092 -a 1 -s 20  -X queue.buffering.max.ms=1000 -c 10000000
...SNIP...
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 50384443, 0 failed) in 25251ms: 396008 msgs/s and 7.92 MB/s, 0 produce failures, 0 in queue, no compression

real	0m25.257s
user	0m32.478s
sys	0m33.500s

1.1.0 (qlen>=100 patch)

time  ./rdkafka_performance-1.1.0.q100 -P -t perf_test -b kafka:9092 -a 1 -s 20  -X queue.buffering.max.ms=1000 -c 10000000
...SNIP...
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 51228320, 0 failed) in 8120ms: 1231454 msgs/s and 24.63 MB/s, 0 produce failures, 0 in queue, no compression

real	0m8.126s
user	0m10.850s
sys	0m0.965s

Which not only shows that it uses way less CPU but is insanely faster (396008 msgs/s and 7.92 MB/s vs 1231454 msgs/s and 24.63 MB/s). I will caveat this with the fact that I did not test this solution for correctness.

@edenhill
Copy link
Contributor

Interesting!

So one way to fix this might be to limit the number of fd-based wakeups to linger.ms, I'll make a patch and see how it flies.

@edenhill
Copy link
Contributor

Can you try out the ratewakeups branch?

@shanson7
Copy link
Contributor

shanson7 commented Sep 10, 2019

Not quite as pronounced a difference as the above change (likely due to additional queueing ops still) but nearly twice the throughput for 1/4th the CPU!

master (1.2.0RC3, no patch)

time  ./rdkafka_performance-1.2.0RC3 -P -t perf_test -b kafka:9092 -a 1 -s 20  -X queue.buffering.max.ms=1000 -c 10000000
...SNIP...
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 53105365, 0 failed) in 26340ms: 379646 msgs/s and 7.59 MB/s, 0 produce failures, 0 in queue, no compression

real	0m26.347s
user	0m31.435s
sys	0m34.868s

ratewakeups branch

time  ./rdkafka_performance-1.2.0RC3.patched -P -t perf_test -b kafka:9092 -a 1 -s 20  -X queue.buffering.max.ms=1000 -c 10000000
...SNIP...
% 657 backpressures for 10000000 produce calls: 0.007% backpressure rate
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 53921465, 0 failed) in 14460ms: 691518 msgs/s and 13.83 MB/s, 657 produce failures, 0 in queue, no compression

real	0m14.465s
user	0m12.679s
sys	0m0.568s

@edenhill
Copy link
Contributor

Nice! Sys time is all gone.
The thruput numbers are better than v0.11.5, right?
How about CPU usage?
v0.11.5 vs ratewakeups

@shanson7
Copy link
Contributor

shanson7 commented Sep 10, 2019

Looks like ratewakeups has about 50% higher throughput than 0.11.5 and about equivalent CPU usage (total CPU is lower in ratewakeups due to shorter runtime!)

@shanson7
Copy link
Contributor

Also for comparison, I tried the patch with a queue.buffering.max.ms=1 and it was even faster, at the cost of more CPU:

time  ./rdkafka_performance-1.2.0RC3.patched -P -t perf_test -b kafka:9092 -a 1 -s 20  -X queue.buffering.max.ms=1 -c 10000000
% Sending 10000000 messages of size 20 bytes
% 1718260 messages produced (34365200 bytes), 1715909 delivered (offset 54906121, 0 failed) in 1000ms: 1715909 msgs/s and 34.32 MB/s, 0 produce failures, 2351 in queue, no compression
% 3470034 messages produced (69400680 bytes), 3466887 delivered (offset 55032857, 0 failed) in 2000ms: 1733443 msgs/s and 34.67 MB/s, 0 produce failures, 3147 in queue, no compression
% 5217994 messages produced (104359880 bytes), 5215059 delivered (offset 55226671, 0 failed) in 3000ms: 1738353 msgs/s and 34.77 MB/s, 0 produce failures, 2935 in queue, no compression
% 6963723 messages produced (139274460 bytes), 6960888 delivered (offset 55365437, 0 failed) in 4000ms: 1740222 msgs/s and 34.80 MB/s, 0 produce failures, 2835 in queue, no compression
% 8724675 messages produced (174493500 bytes), 8723083 delivered (offset 55497546, 0 failed) in 5000ms: 1744616 msgs/s and 34.89 MB/s, 0 produce failures, 1592 in queue, no compression
% 10000000 messages produced (200000000 bytes), 10000000 delivered (offset 55617923, 0 failed) in 5729ms: 1745402 msgs/s and 34.91 MB/s, 0 produce failures, 0 in queue, no compression

real	0m5.733s
user	0m9.023s
sys	0m1.166s

@shanson7
Copy link
Contributor

Of course, this is testing with a thread publishing as quickly as possible. In real world scenarios, there is likely a small amount of processing to happen in between calls, that might make these numbers differ. But all in all, I'm eager to test out this patch in my real world case

@edenhill
Copy link
Contributor

That's great news, thank you for all your work on this, Sean! 💯

@edenhill
Copy link
Contributor

We'll delay the upcoming v1.2.0 release to get this fix in, aiming at releasing on monday instead of today.

@shanson7
Copy link
Contributor

Cool! Too bad you never get any issues like "Significantly better performance when upgrading from 1.1.0 to 1.2.0", huh?

@Eliyahu-Machluf
Copy link
Contributor Author

Hi,

I don't have time to read the entire thread here. Just to summarize, I understand the issue is fixed and fix will be available at v1.2.0 soon available. right?
Is there a configuration item I can configure 1.1.0 to achieve same fix?

@edenhill
Copy link
Contributor

@Eliyahu-Machluf There is no workaround, you will need to wait for v1.2.0

@Eliyahu-Machluf
Copy link
Contributor Author

OK. Thanks

@gridaphobe
Copy link
Contributor

@edenhill any news on the 1.2.0 release?

@edenhill
Copy link
Contributor

edenhill commented Oct 7, 2019

v1.2.0 was released a couple of weeks ago, but has a Windows GSSAPI blocker, v1.2.1 is due this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants