-
Notifications
You must be signed in to change notification settings - Fork 249
Conversation
with Yolean/kubernetes-kafka brokers.
…ing due to retries
@solsson I know this is merged, but I just applied this (changed topic and namespace, otherwise identical) and am getting this:
I'm going to investigate more in the morning. |
@StevenACoffman Could it be the memory limits? You probably have significantly higher log volumes than I do. "No buffer space available" could indicate that. |
That was certainly one problem. I removed the limits, but am still seeing some failures, and the pods never are restarted successfully, as in #17 |
I get only the same kafka disconnects (~5 per hour per pod) as with |
@StevenACoffman What highs do you see in memory use with your load? I'll create a PR that increases the limits, but not too far because I want to see if buffers grow (#11 (comment)) or there's memory leaks. |
These are currently the highest metrics I have in one randomly selected cluster:
I've been focussing on getting things to run reliably (both kafka and fluent-bit) before fine-tuning memory limits. Most of my fluent-bit pods fell over pretty quickly in a busy cluster with the limits you have. Are you scraping the new prometheus endpoint in your test clusters? I would expect that to add a bit of overhead. The fluent-bit memory usage Documentation has:
Given the |
I think the problem is the lack of "check" in out_kafka: when we deliver data to kafka, we ingest the records into Kafka thread which has it own memory buffer, on that point we do ACK to Fluent Bit engine, instead we should wait for those records to be flushed before to ACK, otherwise Fluent Bit keeps ingesting records into out_kafka |
After 30 minutes, it stayed happily at
This is the cluster that I've been breaking kafka in interesting ways, so perhaps it is related to Eduardo's comment above. |
What is weird is that many of the fluent-bit pods that have been up for 3+ days in that same cluster stay around 0.5 MB (e.g., 516K, 492K, 416K). When they go over memory limits, they go way over. Maybe the nodes where fluent-bit stays low are not generating as many log files? |
Yes we do. I agree this should be accounted for in the recommended limits.
Then |
hmm I think the missing part about resources usage depends on how much memory the output plugins also requires, definitely out kafka due to buffering may consume a lot |
It's interesting that @StevenACoffman runs with experimental kafka config, that could cause producer errors. I guess anyone who likes to tweak limits can make a calculation based on average
@edsiper In other words, as hitting the memory limit means pod restart, fluent-bit will not re-process from sources the messages that had gone into the buffer? What's the effect with |
@edsiper I should set |
@solsson I've created fluent/fluent-bit#495 to track the enhancement. |
This is a rebase of #11. In addition: