Robustness and resiliency on Azure #3109

amotl · 2020-10-15T18:14:44Z

Hi there,

first things first: Thanks for the tremendous amount of work you are putting into librdkafka, @edenhill. You know who you are.

Introduction

This is not meant to be a specific bug report as we believe the issues we have been experiencing when using librdkafka for connecting to Azure Event Hubs have already been mitigated within librdkafka 1.5.2 and newer. In fact, they might not have been specific to Azure Event Hubs anyway but also might have tripped others when just running Apache Kafka or the Confluent Stack on Azure in general.

Instead, we wanted to share our findings as a wrapup and future reference for others looking for similar issues. In this manner, apologies for not completing the checklist. The issue can well be closed right away.

The topics are spanning the area of Azure networking (problems) in general, as well as things related to Kafka and Kubernetes.

So, here we go.

General research

Azure LB closing idle network connections

The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.

The LB does not even bother to send RST packets to each of the communication partners, so client and server sockets will most probably try to reuse these dead connections.

In turn, services will be hitting individual socket timeouts or otherwise the Kernel will be doing retransmissions with backoff for another 15+ minutes until it considers the connection to be dead.

Quotes

TL;DR: Azure has a nasty artificial limitation that results in being unable to use long-lived TCP connections that have >= 4 minutes of radio silence at any given point.

They screwed it up so hard that when connection does timeout, they acknowledge the following TCP packets with an ok flag that makes the sender think “everything is okay - the data I sent was received succesfully”, which is 100 % unacceptable way to handle error conditions.

This caused me so much pain and loss of productive work time.

-- https://joonas.fi/2017/01/23/microsoft-azures-networking-is-fundamentally-broken/

Resources

Using Kafka and Event Hubs on Azure

Quotes

"The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in."

-- https://stackoverflow.com/a/58385324

Magnus Edenhill:

Joakim, we've seen a couple of similar reports for users on Azure and we can't really provide an explanation, something stalls the request/response until it times out, and we believe this to be outside the client, so most likely something in Azure.
I recommend opening an issue with the Azure folks.

Joakim Blach Andersen:

I got an answer from Azure:
The service closes idle connections (idle here means no request received for 10 minutes). It happens even when tcp keep-alive is enabled because that config only keeps the connection open but does not generate protocol requests. In your case, you have only observed the error on the idle event hub, it may be related to the idle connection being closed and the client SDK does not handle that correctly (Kafka).

-- #2845

Magnus Edenhill:

We're trying to work around these Azure weak idle disconnects in the upcoming v1.5.0 release by reusing the least idle connection for things like metadata requests, which should keep that connection alive and not cause these idle disconnect request timeouts.

-- confluentinc/confluent-kafka-dotnet#1305 (comment)

Resources

With kind regards,
Andreas.

amotl · 2020-10-15T18:19:37Z

Hi again,

based on the findings outlined above, I compiled a list of suggestions for our team the other day, which I also wanted to share here.

With kind regards,
Andreas.

Proposals

Upgrade to librdkafka 1.5.0

Based on findings from others when running Kafka on Azure and especially when using Azure Event Hubs with librdkafka, Magnus Edenhill, the author of librdkafka, implemented some fixes between version 1.3.0 and 1.5.0.

Magnus Edenhill:

produce/consume hang after partition goes away and comes back which is fixed in v1.4.2

-- confluentinc/confluent-kafka-python#899 (comment)

Magnus Edenhill:

As for the idle connection closing, I'm suspecting that ~~their~~ [the Azure] load balancers are silently dropping the connection without sending an RST or FIN to the client, so there is no way for the client TCP stack to know the connection was closed until a request (or TCP keepalive) is sent.
If the connection was properly TCP-closed the client would get notified instantly of the close.

The fix in 0527457 will go into v1.5.0 which is scheduled for early June. It will prefer to use the least idle connection when selecting "any" broker connection for metadata refresh, etc. This in combination with a topic.metadata.refresh.interval.ms < the LB/Kafka idle connection timeout should hopefully minimize the number of these timeout-due-to-killed-connection errors.

Prefer least-idle connection when randomly selecting broker for metadata queries, etc. (#2845)

Reasons:
- this connection is most likely to function properly.
- allows truly idle connections to be killed by the broker's/LB's
  idle connection reaper.

This will fix issues like "Metadata request timed out" when the periodic
metadata refresh picks an idle (typically bootstrap) connection which
has exceeded the load-balancer's idle time (which silently kills the
connection without sending FIN/RST to the client).

-- via: #2845

Apply recommended settings for Kafka on Azure

Charles Culver:

After finding recommended configuration settings for Azure Event Hubs (Microsoft's own Kafka cloud implementation) here, we added the following settings to consumers and producers and are seeing success.

-- #2845 (comment)

Here are the recommended configurations for using Azure Event Hubs from Apache Kafka client applications:

amotl · 2020-10-15T18:23:04Z

Kernel TCP settings for Azure VMs

In the same spirit, I would also like to outline some recommended settings for mitigating the "Azure LB closing idle network connections" on the host level.

These settings probably might be applied to the nodes (VMs) as well as the Kubernetes PODs/containers.

net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 8

Resources

Azure closing idle network connections
TCP settings for Azure VMs by @wbuchwalter and @telmosampaio
TCP settings for Azure VMs from the MSDN forums

amotl · 2020-10-15T18:41:10Z

Following all of these observations, we planned to upgrade to librdkafka 1.5.0 back then. However, ...

There has been a regression in 1.5.0 regarding roundrobin crashes reported through #3024 and confluentinc/confluent-kafka-dotnet#1366 as well as confluentinc/confluent-kafka-dotnet#1369, which was designated as a critical issue.

@mhowlett reported on behalf of @wwarby (confluentinc/confluent-kafka-dotnet#1366 (comment)):

Not sure if mine is the same issue or related in any way, but I've just run into massive instability after upgrading some high throughput C# apps from v 1.4.4 to 1.5.0. Rolled back to 1.4.4 and it solved the problem.

In my case, I'm consuming data from topics that already exist and I don't have topic auto-creation enabled - I'm running multiple instances of the C# app, each instantiating a single consumer to join the consumer group.

As we haven't been sure whether we might run into this on our production systems, we considered to be better off including the fix from #3049. However, back then, the next official release was said to be 1.6.0 happening on October 15, 2020, so we decided to go for librdkafka-1.5.0-e320a2d making our systems happy so far.

Thank you so much!

Now, we are happy to see the original milestone 1.6.0 was repurposed to milestone 1.5.2 with a corresponding pre-release of v1.5.2-RC1 and will consider going for that within the next iteration.

P.S.: While going through this story, we also discovered other open issues about stalled/stuck consumers observed with 1.4.2 at #2933, #2944 and #3082. As far as we have been able to see, these might be related to changing IP addresses while performing rolling updates on Kafka instances within a Kubernetes cluster. Our thoughts on this: Just don't do that ;]!

amotl · 2020-10-15T21:25:01Z

JFYI; I also just shared this at Azure/azure-functions-kafka-extension#185.

edenhill · 2020-10-20T14:01:34Z

Wow, that's an amazing write-up and RCA, @amotl !
Thank you for sharing your findings with the community.

amotl · 2020-10-20T16:31:56Z

Dear Magnus,

thank you for your comment, appreciate it. Thanks also for just releasing librdkafka 1.5.2 GA incorporating all the recent improvements in this regard and beyond.

Keep up the spirit and with kind regards,
Andreas.

amotl · 2020-10-22T12:49:46Z

Hi again,

as I recognized it might not have become exactly obvious from my elaborations above, I wanted to add some more details here regarding the recommended configuration settings when running on Azure as also outlined at Azure/azure-functions-kafka-extension#187.

Microsoft published them to [1] in general, but I wanted to specifically address here that the configuration properties socket.keepalive.enable and metadata.max.age.ms are apparently crucial to get right within Azure environments in order to save users from running into the networking issues reported within this discussion.

Property	Recommended Values	Permitted Range	Notes
socket.keepalive.enable	true		Necessary if connection is expected to idle. Azure will close inbound TCP idle > 240,000 ms.
metadata.max.age.ms	~ 180000	< 240000	Can be lowered to pick up metadata changes sooner.

While these settings [1] are primarily dedicated to Azure Event Hubs, I believe they will also apply to all communications with vanilla Kafka server components, as the underlying networking infrastructure problem will be the same.

We are now running with these settings and are happy so far:

socket.keepalive.enable = true
metadata.max.age.ms = 180000

Thanks for listening and with kind regards,
Andreas.

[1] https://docs.microsoft.com/en-us/azure/event-hubs/apache-kafka-configurations#librdkafka-configuration-properties

amotl · 2020-11-16T13:59:44Z

Hi again,

I quickly wanted to share some more observations on this topic. On our Azure environment, we still saw some partitions occasionally stalling on the consumer side, even after applying all of the mitigations outlined above. However, after just moving on to v1.6.0-PRE3, things appear to be really smooth now.

Kudos to @edenhill, @mhowlett and all people involved who added some recent improvements to librdkafka!

With kind regards,
Andreas.

edenhill · 2021-03-09T07:01:10Z

Added connections.max.idle.ms to allow the client to close idle connections before the LB does.

koushikchitta · 2021-08-31T16:56:01Z

@edenhill When is the probable date for 1.8 release ? waiting for "connections.max.idle.ms" change.

edenhill · 2021-08-31T18:30:04Z

It is soak testing now, if all goes well we'll release within a week.

koushikchitta · 2021-09-09T21:36:28Z

@edenhill Is V1.8.0-RC2 is the release candidate that is in soak testing?

edenhill · 2021-09-13T05:46:45Z

@koushikchitta Yep, we'll be releasing it this week.

robsonpeixoto · 2022-08-30T11:56:32Z

@edenhill, in some system is very important keep a low latency. Are there a option to renew the connection after connections.max.idle.ms, instead of only close the connection?

@amotl, your investigation was amazing, thanks!

justinmchase · 2024-02-07T21:04:46Z

Just to be clear, I am interpretting these notes to be the following client configuration, I would love to know if there are any other recommended settings to help connect to azure event hubs.

var config = new ClientConfig
{
    SaslMechanism = SaslMechanism.Plain,
    SecurityProtocol = SecurityProtocol.SaslSsl,
    SaslUsername = credentials.Username,
    SaslPassword = credentials.Password,
    BootstrapServers = $"{broker}:9093",

    // According to this documentation, these two settings are also critical for
    // azure environments
    // https://github.com/confluentinc/librdkafka/issues/3109#issuecomment-714471123
    SocketKeepaliveEnable = true,
    MetadataMaxAgeMs = 30000,

    // note: This needs to timeout sooner than the azure eventhub load balancer timeout
    // to avoid unexpected disconnects from the server after long periods of time
    // https://stackoverflow.com/questions/58010247/azure-eventhub-kafka-org-apache-kafka-common-errors-timeoutexception-for-some-of/58385324#58385324
    ConnectionsMaxIdleMs = ((60 * 4) - 30) * 1000 // 3m30s
};

amotl changed the title ~~Robustness and resiliency with Azure Event Hubs~~ Robustness and resiliency on Azure Oct 15, 2020

amotl mentioned this issue Oct 15, 2020

Upgrade to Confluent.Kafka 1.5.2 Azure/azure-functions-kafka-extension#185

Closed

TsuyoshiUshio mentioned this issue Oct 21, 2020

Upgrade 1.5.2 and improve debugging Azure/azure-functions-kafka-extension#186

Merged

amotl mentioned this issue Oct 21, 2020

Review default values for configuration settings Azure/azure-functions-kafka-extension#187

Closed

This was referenced Oct 22, 2020

Introduce "socket_keepalive_enable" configuration property aio-libs/aiokafka#670

Open

Official 1.5.2 release on PyPI confluentinc/confluent-kafka-python#973

Closed

This was referenced Nov 16, 2020

Upgrade to Confluent.Kafka 1.6.0 Azure/azure-functions-kafka-extension#193

Closed

Networking robustness and resiliency on Azure and beyond (AWS, GCP, AliCloud) crate/crate#10779

Closed

TsuyoshiUshio mentioned this issue Nov 20, 2020

GroupCoordinator: *.*.*.*:9092: 1 request(s) timed out: disconnect with Azure Backed Service Azure/azure-functions-kafka-extension#197

Open

edenhill added the GREAT REPORT label Jan 11, 2021

mhowlett mentioned this issue Mar 1, 2021

Is connections.max.idle.ms a Consumer Configuration? confluentinc/confluent-kafka-dotnet#1544

Closed

sorbra mentioned this issue Mar 5, 2021

REQTMOUT errors Azure/azure-event-hubs-for-kafka#139

Closed

14 tasks

edenhill closed this as completed Mar 9, 2021

edenhill reopened this Mar 9, 2021

jliunyu mentioned this issue Sep 3, 2021

Producer timing out confluentinc/confluent-kafka-go#666

Closed

7 tasks

nickwb mentioned this issue May 18, 2022

Consider supporting per-socket TCP Keep-Alive settings: TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT #3851

Open

robsonpeixoto mentioned this issue Aug 29, 2022

ssl_closed error kafka4beam/brod#530

Open

Gebumgar mentioned this issue Jan 11, 2023

Make connections.max.idle.ms configurable Azure/azure-functions-kafka-extension#405

Open

amotl mentioned this issue May 13, 2023

[udf] Unlock Lua for user-defined functions mqtt-tools/mqttwarn#669

Open

daniil-quix mentioned this issue Nov 27, 2023

Fix ssl timeout errors for Quix applications quixio/quix-streams#254

Merged

sshanks-kx mentioned this issue Sep 12, 2024

libkfk fails to connect back automatically to broker on azure cloud after certain inactivity time. KxSystems/kafka#119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustness and resiliency on Azure #3109

Robustness and resiliency on Azure #3109

amotl commented Oct 15, 2020

amotl commented Oct 15, 2020

amotl commented Oct 15, 2020

amotl commented Oct 15, 2020 •

edited

Loading

amotl commented Oct 15, 2020 •

edited

Loading

edenhill commented Oct 20, 2020

amotl commented Oct 20, 2020

amotl commented Oct 22, 2020

amotl commented Nov 16, 2020 •

edited

Loading

edenhill commented Mar 9, 2021

koushikchitta commented Aug 31, 2021

edenhill commented Aug 31, 2021

koushikchitta commented Sep 9, 2021

edenhill commented Sep 13, 2021

robsonpeixoto commented Aug 30, 2022

justinmchase commented Feb 7, 2024 •

edited

Loading

Robustness and resiliency on Azure #3109

Robustness and resiliency on Azure #3109

Comments

amotl commented Oct 15, 2020

Introduction

General research

Azure LB closing idle network connections

Quotes

Resources

Using Kafka and Event Hubs on Azure

Quotes

Resources

amotl commented Oct 15, 2020

Proposals

Upgrade to librdkafka 1.5.0

Apply recommended settings for Kafka on Azure

amotl commented Oct 15, 2020

Kernel TCP settings for Azure VMs

Resources

amotl commented Oct 15, 2020 • edited Loading

amotl commented Oct 15, 2020 • edited Loading

edenhill commented Oct 20, 2020

amotl commented Oct 20, 2020

amotl commented Oct 22, 2020

amotl commented Nov 16, 2020 • edited Loading

edenhill commented Mar 9, 2021

koushikchitta commented Aug 31, 2021

edenhill commented Aug 31, 2021

koushikchitta commented Sep 9, 2021

edenhill commented Sep 13, 2021

robsonpeixoto commented Aug 30, 2022

justinmchase commented Feb 7, 2024 • edited Loading

amotl commented Oct 15, 2020 •

edited

Loading

amotl commented Oct 15, 2020 •

edited

Loading

amotl commented Nov 16, 2020 •

edited

Loading

justinmchase commented Feb 7, 2024 •

edited

Loading