Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brooklin takes a long time to recover from errors in the destination kafka cluster, up to a complete halt of replication #675

Open
rantav opened this issue Dec 11, 2019 · 0 comments
Labels

Comments

@rantav
Copy link

rantav commented Dec 11, 2019

Subject of the issue

Brooklin takes long time to recover from errors in the destination cluster. Sometimes multiple cycles of rebalances and complete halt of replication during that time.

I have run some tests, they are documented here: https://github.com/AppsFlyer/kafka-mirror-tester/blob/master/results-brooklin.md

Your environment

  • Operating System 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20) x86_64 Linux
  • Brooklin version 1.0.2
  • Java version openjdk version "1.8.0_212"
  • Kafka version 2.1.0
  • ZooKeeper version 3.4.10

Steps to reproduce

As described in this document https://github.com/AppsFlyer/kafka-mirror-tester/blob/master/results-brooklin.md, there are several scenarios in which that may happen.
One such scenario is to restart a broker at the destination cluster.
This results in errors for as long as the broker is down, which is understandable. But even long after the broker is back up - brooklin continues to err, up to a complete halt of replication to the entire cluster (not only to that failed broker).

Expected behaviour

We expect brooklin to recover gracefully and not halt replication during the rebalance cycle.
We expect to see just a single, hopefully short, rebalance, instead we multiple cycles that sometimes take quite long (10-15 minutes).

Actual behaviour

Brooklin takes a long time (10-15 minutes, sometimes more) to recover. During that time we see cycles of replication and then a complete halt of replication and then again, replication and then again a halt.

@rantav rantav added the bug label Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant