-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failures for console producer/consumer with min.insync.replicas=2 #116
Comments
@solsson Actually, I synched with latest recently and I started getting the same test failures, where where I didn't with this repo pulled on Nov 12, 2017. Not sure if it's related, but I have a java consumer where now I keep getting the coordinator marked as dead.
Previously, the identical java consumer worked fine. I haven't figured this out yet. |
Looks bad. I'm sorry about leaving master relatively untested. Could be the cause of #114 too. I've been caught up in not only holiday season but also #101 (comment). I wrote these tests while working on new features, but now that they run constantly I find the readiness indication very useful. I will try to reproduce the above issues. |
Hrm, even with the tests succeeding on the old d01c128 , the yahoo kafka-manager addon is failing to do much of anything:
This is weird, since I can do this:
If I go to the actual kafka-manager (via dashboard
|
Let's assume that the issue(s?) is caused by min.insync.replicas=2. #108 and #114 were both caused by this change. I fail to reproduce the test issue. Kafka Manager works fine too. @allquantor Did you upgrade an existing cluster, that used to have The test can be improved. |
which means we get the default number of replicas from broker config for #116
I noticed when doing e059690 that default min.insync.replicas is 1. Thus the tests were probably unaffected by the new default The change might still have affected Kafka internals, as in #114. |
I'm seeing the same issue as @StevenACoffman but with the stock kafka-console-consumer using it both inside and outside the cluster: the consumer is stuck in a loop marking a coordinator resolved as id = max(int32) as dead. |
@albamoro Any clues in broker logs? |
None that I could see on the server side, but then I'm fairly new to them - I've upped the logging and been tailing the kafka-request.log to not much avail so far. Forgot to mention an important detail: the console works when using the old consumer, i.e., pointing to --zookeeper rather than --bootstrap-server. Clients and server versions are aligned as they are based on the same image. Client-wise this is the block that keeps on repeating - note the node's id is lost between lines 2 & 3. I couldn't tell how relevant that is.
|
From the port numbers it looks like you get the "outside" listener (#78). I'm quite curious how that can happen in the tests. An ordering issue was resolved in 4c202f4. Might #99 (comment) be helpful to you too? |
Same happens using the inside listener. I had attached the external client's logs because those are the only ones I've been able to configure for DEBUG logging. You can see below the console's output when used within my Minikube, v0.24.1. Re. commit 4c202f4, my repo is pointing at bbed23e so I presume the fix applies? I haven't checked #99 yet - I was planning on rolling back to early Nov as @StevenACoffman had done and work from there.
|
@solsson, just to confirm your comment above, #116 (comment), changing min.insync.replicas=1 whilst retaining default.replication.factor=3 clears these issues in my setup: produce-consume test fails & console consumer cannot use --bootstrap-server |
@albamoro I am confused by your last comment:
Does changing those values cause the failing test issue or fix the issue? |
Yup, sorry, thought it might not be entirely clear :) Changing it makes the issues go. I haven't checked the implications of the "ack" setting, I'm currently using "all" from my producers. |
I have tested the min.insync.replicas setting against https://github.com/wurstmeister/kafka-docker on docker-compose and the results are the same, i.e. with min.insync.replicas=2 the console consumer fails when using --bootstrap-server rather than zookeeper. I don't understand how @solsson local environment could pass the tests with ISR=2. Otherwise I'd think we are either missing a trick here with regards to the Kafka config or maybe we are inheriting a bug on wurstmeister's dockerfile as I understand it is the basis of yolean's image? |
So I have three different kubernetes clusters, and in each I have done
|
What is weird is that I can install the outside services, and kafkacat ok from that from my laptop. |
@StevenACoffman With outside services, are you bootstrapping using the address of all three brokers?
Could it be that the bootstrap service (in combination with acks) is the culprit? It does introduce the kind of randomness that would explain different behavior in identical setups. If so, does this only affect java clients? Have any of you spotted issues with the In 3.0 we used a single broker for bootstrap, to avoid errors for single-node setups. In 2.1 we used the full address to three brokers.
I'm just curious, why didn't this print some topic metadata? Will it do on pod restart? |
In my case, the kafkacat test always passed, regardless of the min.insync.replicas setting. |
By the way, it's very handy to do this:
Then you can just run commands inside the cluster like:
When I diff the results, I get identical output except for one block. The passing test cluster:
The failing test cluster:
|
@solsson Using current master bbed23e I adjusted the With these new settings, in all three of my kubernetes clusters, I can repeatably, reliably teardown all the kubernetes resources i from this repository (including lingering persistent volumes) and re-apply them with no problems. With By the way, the yahoo kafka-manager problem was my own doing. I had somehow gotten into the habit of incorrectly entering the kafka bootstrap service I appreciate your careful attention, patience and help during all of this. #122 was very helpful. |
I'm not sure if the sporadic successful runs were attributable to changes in kafka internals because I occasionally did this:
The topic that was created was not the same as the
|
I went through and diff'ed our outside of kubernetes kafka server.properties with the one here, and I noticed two big differences. The first:
The second is that we omit this section:
That last section makes me wonder if that was my real problem all along. |
Setting Changing |
Sorry I didn't pay enough attention to this finding. Maybe docker-compose can provide a better test case for
In fact I think Confluent's image looks inspired by wurstmeister's. I started out with Confluent's image, but with Kubernetes we have the ConfigMap feature so there's less need for using environment variables with services that use an actual config file. I wanted an image that was as close as possible to downloading and installing from https://kafka.apache.org/. There's some background on this choice in #46. This thread is very important. I consider reverting the change to I will try to find more time to research this myself. Thanks for your efforts @albamoro and @StevenACoffman. |
This is interesting & could have some bearing on the question: my tests fail when isr=2 and I use the new, bootstrap-server based consumer. My topics are autocreated on producer request, so maybe creating them using --zookeeper does something differently? @solsson I've left the docker-compose.yml here: https://gist.github.com/albamoro/a56ed9aff40a10c2580d678a55b2b5d9 You'll need to create the backbone network beforehand:
I start it using:
I very much appreciate your efforts too, thank you all, including @allquantor for raising it. |
@solsson I appreciate the goal of making something that works for developers and works in production. We've used Kafka in production for 5 years at Ithaka. In production in just November, our 5 node kafka cluster had a topic that handled 30,419,294 messages per day, and another that was was about half that. Lots of other busy topics too. We don't specify I'm not sure With that said, it sounds very nice in theory, and I'm very interested in understanding how you don't get the same results I do. If there's something else that needs to be turned on to make it work, then I'm all for it. |
I've contemplated this while re-reading relevant parts of the kafka book... with acks=1 the leader doesn't know if any replica is up-to-date or not. This has worked well for us too, but so has min.insync.replicas=2 in QA for the last month or so. Exerpts from Neha Narkhede, Gwen Shapira, and Todd Palino. ”Kafka: The Definitive Guide”:
In our case, as we grow with Kafka, I think it is important to require at least 1 successful leader->follower replication so that we catch cluster problems sooner rather than later. With Kubernetes loss of a node should be a non-event. It probably can be with commit="written to the leader" thanks to graceful shutdown, but I'd like higher consistency guarantees. Thus we'll keep running with
@StevenACoffman Good summary of the path we're taking at Yolean :) |
Actually there's been a readiness test for checking for under-replicated partitions since #95, that I had forgotten about :) Prometheus + #128 lets you alert on I still haven't investigated if there's a difference between auto-created topics and those created through the CLI. |
@solsson Hi! I'm getting:
With:
Can it be related to this issue? Also seeing errors like this in Zookeeper:
|
I don't think I consider that anymore :) This is slightly dated, but I found it useful for our persistent data use case: https://www.slideshare.net/gwenshap/kafka-reliability-when-it-absolutely-positively-has-to-be-there In addition, when working with #134, I learned that clients can't specify >1 acks. They can only chose to adhere to "all". It's an argument for min.insync.replicas > 1, that clients (or topics) can always specify Also I've tried to double check that min.insync.replicas=2 actually mean leader + 1 follower, not 2 followers. It's seemingly implicit in kafka docs, but here they write "The leader is considered one of the in-sync replicas.".
I've seen this too, frequently since #107. Might be normal with auto create topics. But, there's actually a timing issue with test start (which I realized today @albamoro @allquantor @StevenACoffman could be a cause of the original issue). You never know which container comes first to creating the topic, the consumer (which oddly is allowed to create topics) or the producer or the topic creation job. If it's the job you won't get any errors in producer or consumer, but on the other hand you get replication factor = @lenadroid The Zookeeper issue must be something else. Please create a separate issue. |
I think this issue will conclude the 3.1 cycle, dubbed "The paintstaking path to min.insync.replicas=2". Do you think #140 has fixed the issue(s) here? |
I released 3.1.0 now. Will leave this issue open, until I get positive reports or until I find the time to go through current open issues and close inactive ones. |
I've only had my three non-toy clusters running with this for 2 days, but so far it's been very solid, even with |
Sounds good. Made all the difference for me to learn about these additional replication factor props. I will close this then. |
For reference, we use this configuration in our Kafka clusters outside of kubernetes: I have summarized the differences below. We have upgraded our clusters many times, so please don't assume these are still current best practices, even if they've proven fairly reliable for us. Aside from retention policies, I find some very intriguing differences and I wonder if you have any opinions on them @solsson
For very important topics with settings that differ from the default, we have a few external programs that are constantly ensuring those topics exist and have the proper retention periods (e.g. 1 week instead of 4 hours). We have considered flipping this and defaulting to retaining for 1 week and changing the topic settings for all topics not on the whitelist. |
Wow, that's some differences for sure :) We haven't done any performance tuning, and I think we're on recent defaults. With hindsight, and #140 fresh in memory, we shouldn't have started from the sample For example the essential |
I updated the table to include a column for defaults. Now that 3.1 is released, would removing default values (or commenting them out) be warranted here? That would allow documentation to highlight the reason for any divergence from defaults. Our situation is that we initially set all values explicitly, and painstakingly performance tuned when issues arose. However, as we upgraded multiple times and the recommended defaults changed, we didn't retain the history of which values were intentional differences for our usage and which were due to outdated adherence to "best practices" that were contextual on those older versions. Part of the appeal of this project is it affords us the ability to cheaply experiment and revisit these settings. |
I created #148 to try to pursue that idea. It's not too late to make amends :) |
Running the Kafka tests this one do not complete
Logs from Testcase
Is it probably the producer does not generate enough data? No errors in producer or consumer.
Using kubernetes 1.8.5 on GKE
The text was updated successfully, but these errors were encountered: