Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

Add out_kafka example #16

Merged
merged 14 commits into from
Jan 16, 2018
Merged

Add out_kafka example #16

merged 14 commits into from
Jan 16, 2018

Conversation

solsson
Copy link
Contributor

@solsson solsson commented Jan 16, 2018

This is a rebase of #11. In addition:

  • Adds example memory limits.
  • Adds annotations for the new prometheus exporter feature.

@solsson solsson mentioned this pull request Jan 16, 2018
@edsiper edsiper merged commit 1263a3b into fluent:0.13-dev Jan 16, 2018
@StevenACoffman
Copy link

@solsson I know this is merged, but I just applied this (changed topic and namespace, otherwise identical) and am getting this:

kangaroo/fluent-bit-rq45j[fluent-bit]: [2018/01/17 03:42:04] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-rq45j[fluent-bit]: [2018/01/17 03:42:04] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-dncm5[fluent-bit]: [2018/01/17 03:42:12] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-dncm5[fluent-bit]: [2018/01/17 03:42:12] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-lw88d[fluent-bit]: [2018/01/17 03:42:13] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-lw88d[fluent-bit]: [2018/01/17 03:42:13] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [engine] started
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:49] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [engine] started
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:50] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:51] [ info] [filter_kube] API server connectivity OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:43:51] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:55] [ info] [filter_kube] API server connectivity OK
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:55] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [sqldb] error=disk I/O error
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [sqldb] error=disk I/O error
kangaroo/fluent-bit-g2lxp[fluent-bit]: socket: No buffer space available
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [io] could not create socket
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:56] [error] [filter_kube] upstream connection error
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:57] [error] [sqldb] error=disk I/O error
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:43:57] [error] [plugins/in_tail/tail_file.c:742 errno=2] No such file or directory
kangaroo/fluent-bit-77jx4[fluent-bit]: [2018/01/17 03:47:08] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-77jx4[fluent-bit]: [2018/01/17 03:47:08] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [engine] started
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:48:58] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:49:04] [ info] [filter_kube] API server connectivity OK
kangaroo/fluent-bit-npsz8[fluent-bit]: [2018/01/17 03:49:04] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [engine] started
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [out_kafka] brokers='bootstrap.kafka:9092' topics='k8s-firehose'
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [filter_kube] local POD info OK
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:09] [ info] [filter_kube] testing connectivity with API server...
kangaroo/fluent-bit-g2lxp[fluent-bit]: [2018/01/17 03:49:14] [ info] [filter_kube] API server connectivity OK

I'm going to investigate more in the morning.

@solsson
Copy link
Contributor Author

solsson commented Jan 17, 2018

@StevenACoffman Could it be the memory limits? You probably have significantly higher log volumes than I do. "No buffer space available" could indicate that.

@StevenACoffman
Copy link

That was certainly one problem. I removed the limits, but am still seeing some failures, and the pods never are restarted successfully, as in #17

@solsson
Copy link
Contributor Author

solsson commented Jan 17, 2018

I get only the same kafka disconnects (~5 per hour per pod) as with fluent-bit-kafka-dev:0.5. Running the limits from the minikube example in QA, max 10Mi, and got one OOMKilled for 6 pods over 19h. With the old image I suspect I had memory use increasing slowly over time, without any increase in log volumes, so I'm trying to spot that with tight limits.

@solsson
Copy link
Contributor Author

solsson commented Jan 22, 2018

@StevenACoffman What highs do you see in memory use with your load? I'll create a PR that increases the limits, but not too far because I want to see if buffers grow (#11 (comment)) or there's memory leaks.

@StevenACoffman
Copy link

These are currently the highest metrics I have in one randomly selected cluster:

container_memory_cache{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 28672
container_memory_rss{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 36864
container_memory_usage_bytes{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 573440
container_network_receive_bytes_total{container_name="POD",id="/kubepods/burstable/pod2efd65ad-fd2f-11e7-a785-0a1463037f40/124f69c80bd7311e43a9c4a1aaa3f03f500b5700eb6aec2b7ec790f836d0a101",image="gcr.io/google_containers/pause-amd64:3.0",interface="eth0",name="k8s_POD_fluent-bit-kwxbs_kangaroo_2efd65ad-fd2f-11e7-a785-0a1463037f40_0",namespace="kangaroo",pod_name="fluent-bit-kwxbs"} 5.69402814e+08

I've been focussing on getting things to run reliably (both kafka and fluent-bit) before fine-tuning memory limits. Most of my fluent-bit pods fell over pretty quickly in a busy cluster with the limits you have. Are you scraping the new prometheus endpoint in your test clusters? I would expect that to add a bit of overhead.

The fluent-bit memory usage Documentation has:

So, if we impose a limit of 10MB for the input plugins and considering the worse case scenario of the output plugin consuming 20MB extra, as a minimum we need (30MB x 1.2) = 36MB.

Given the Mem_Buf_Limit 5MB, then we either need 13 MB, not counting the prometheus endpoint. I'm seeing usage higher than that.

@edsiper
Copy link
Member

edsiper commented Jan 22, 2018

I think the problem is the lack of "check" in out_kafka: when we deliver data to kafka, we ingest the records into Kafka thread which has it own memory buffer, on that point we do ACK to Fluent Bit engine, instead we should wait for those records to be flushed before to ACK, otherwise Fluent Bit keeps ingesting records into out_kafka

@StevenACoffman
Copy link

cadvisor_-_k8s_fluent-bit_fluent-bit-kwxbs_kangaroo_2efd65ad-______kubepods_burstable_pod2efd65ad-fd2f-11e7-a785-0a1463037f40_ccac60a5fab90622c6e21136b846de733343aca5d3c2f358c0cd42eeb31eb121_

I looked in the cAdvisor UI for this container, which has been up for 3 days. I'm going to terminate it and look again.

@StevenACoffman
Copy link

After 30 minutes, it stayed happily at 416K. The other ones that are using a lot of memory are complaining in their logs:

[2018/01/22 19:10:50] [debug] [filter_kube] could not merge log as requested
[2018/01/22 19:10:50] [debug] [input tail.0] [mem buf] size = 5184238
[2018/01/22 19:10:50] [debug] [input] tail.0 paused (mem buf overlimit)

This is the cluster that I've been breaking kafka in interesting ways, so perhaps it is related to Eduardo's comment above.

@StevenACoffman
Copy link

What is weird is that many of the fluent-bit pods that have been up for 3+ days in that same cluster stay around 0.5 MB (e.g., 516K, 492K, 416K). When they go over memory limits, they go way over. Maybe the nodes where fluent-bit stays low are not generating as many log files?

@solsson
Copy link
Contributor Author

solsson commented Jan 22, 2018

Are you scraping the new prometheus endpoint in your test clusters?

Yes we do. I agree this should be accounted for in the recommended limits.

So, if we impose a limit of 10MB for the input plugins and considering the worse case scenario of the output plugin consuming 20MB extra, as a minimum we need (30MB x 1.2) = 36MB.

Given the Mem_Buf_Limit 5MB, then we either need 13 MB, not counting the prometheus endpoint. I'm seeing usage higher than that.

Then 20Mi could in theory be sufficient, with prometheus exports? Of course it's very annoying to have restarts of these pods, and it might affect testing negatively. A memory limit of twice the expected use isn't unreasonable, and over time metrics could tell if it can safely be reduced.

@edsiper
Copy link
Member

edsiper commented Jan 22, 2018

hmm I think the missing part about resources usage depends on how much memory the output plugins also requires, definitely out kafka due to buffering may consume a lot

@solsson
Copy link
Contributor Author

solsson commented Jan 22, 2018

hmm I think the missing part about resources usage depends on how much memory the output plugins also requires, definitely out kafka due to buffering may consume a lot

It's interesting that @StevenACoffman runs with experimental kafka config, that could cause producer errors.

I guess anyone who likes to tweak limits can make a calculation based on average fluentbit_output_proc_bytes_total and how much kafka downtime they want to handle before dropping log messages.

I think the problem is the lack of "check" in out_kafka: when we deliver data to kafka, we ingest the records into Kafka thread which has it own memory buffer, on that point we do ACK to Fluent Bit engine, instead we should wait for those records to be flushed before to ACK, otherwise Fluent Bit keeps ingesting records into out_kafka

@edsiper In other words, as hitting the memory limit means pod restart, fluent-bit will not re-process from sources the messages that had gone into the buffer?

What's the effect with Retry_Limit false (or an actual value) here?

@solsson
Copy link
Contributor Author

solsson commented Jan 23, 2018

@edsiper I should set request.required.acks=1 librdkafka property in the example daemonset's config. It's unlikely that anyone would like it higher for logs, and it's a useful example for those who want to change it to 0. How do I do that?

@edsiper
Copy link
Member

edsiper commented Jan 29, 2018

@solsson I've created fluent/fluent-bit#495 to track the enhancement.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants