Excessive memory usage on kubernetes #894

tomstreet · 2018-11-08T15:23:56Z

Running on Azure Kubernetes Service (kubernetes v1.11.3) as a daemon set using the fluent/fluent-bit:0.14.6 image. The nodes are quite small with each one running roughly 15 containers that are sending JSON logs over tcp. The pod memory limit is currently set to 200Mi and fluent-bit keeps hitting this and restarting. Any suggestions? Here is the config:

[SERVICE]
    Flush         5
    Log_Level     info
    Daemon        off
    Parsers_File  parsers.conf
    HTTP_Server   On
    HTTP_Listen   0.0.0.0
    HTTP_Port     2020

[INPUT]
    Name      tcp
    Listen    0.0.0.0
    Port      5170

[OUTPUT]
    Name      null
    Match     *

parsers.conf:

[PARSER]
    Name   json
    Format json
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z

The text was updated successfully, but these errors were encountered:

tomstreet · 2018-11-21T10:11:51Z

Is this even excessive memory usage? Are there any recommendations for what the resource limits/requests should be?

edsiper · 2018-11-21T17:19:19Z

If all Pods send around 200MB of data within 5 seconds, yeah, it will be killed.

While Fluent Bit receives data, it will not deliver the logs until the Flush time expiration, my suggestion is to set Flush to 1 (one second) and append a Mem_Buf_Limit option into the TCP input plugin just for protection, you can read more about memory handling here:

https://docs.fluentbit.io/manual/configuration/backpressure

tomstreet · 2018-11-30T11:53:51Z

@edsiper It takes about 10-15 minutes for the fluentbit pod to be killed - it just has a nice straight memory graph that looks like it never cleans up any memory:

The drop in memory usage is when the pod gets killed:

I have tried various settings for the Mem_Buf_Limit but none of them make any difference.

edsiper · 2018-11-30T12:56:57Z

Did u try Flush 1?

tomstreet · 2018-11-30T12:58:07Z

Yeh - those graphs are with it set to Flush 1

tomstreet · 2018-11-30T14:27:00Z

Looks like the issue is with our app - it never closed the TCP connection to fluentbit and instead just reused it for each batch of logs. Now we close the connection after a batch of logs and it has fixed the issue.

edsiper · 2018-11-30T14:58:06Z

@tomstreet I am curious to learn more about the issue. My expectation is that Fluent Bit will protect it self from that scenario. Would you please share some steps to reproduce the problem ?

tomstreet · 2018-11-30T16:28:54Z

Sure.. so the config is above, here is the daemonset yaml:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit-logging
    kubernetes.io/cluster-service: "true"
spec:
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: fluent-bit-logging
        kubernetes.io/cluster-service: "true"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2020"
        prometheus.io/path: /api/v1/metrics/prometheus
    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:0.14.7
        imagePullPolicy: Always
        ports:
          - containerPort: 2020
          - containerPort: 5170
            hostPort: 5170
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        resources:
          limits:
            cpu: 2
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

and our app is written in c# and here is a simplified version of the log emitter:

public class Emitter 
{
    private TcpClient _client;
    private FluentBitSettings _settings;

    private async Task Connect()
    {
        if(_client != null)
        {
            if (_client.Connected)
            {
                return;
            }

            _client.Dispose();
            _client = null;
        }

        _client = new TcpClient();

        await _client.ConnectAsync(_settings.Host, _settings.Port);
    }

    private void Disconnect()
    {
        _client?.Dispose();
        _client = null;
    }

    public async Task Emit(byte[] logsBatch) 
    {
        try
        {
            await Connect();

            var tcpStream = _client.GetStream();
            
            await tcpStream.WriteAsync(logsBatch);
            await tcpStream.FlushAsync();
        }
        finally
        {
            Disconnect();
        }
    }
}

If we remove the Disconnect() in the finally block of the Emit method, then it reuses the TCP connection without closing it every time the Emit method is called - this is what causes the memory issue in Fluent Bit. Including it not only stopped the issue in Fluent Bit but also reduced the memory usage of our own service.

HarishHothi · 2019-01-23T06:13:18Z

Similar problem,
Observing very high memory usage on fluentbit pod. We observed around 10GB of memory usage. We have not specified resource limit on pod for the testing.

kubectl top po -n logging fluent-bit-gmnrt 
NAME               CPU(cores)   MEMORY(bytes) 
fluent-bit-gmnrt    46m             9861Mi

When elastic search is heavily loaded it will give HTTP 429 Error to fluent bit. And fluent bit will keep unsent logs in its main memory for retry. Fluent bit is retrying for X number of times (as configured in output plugin's Retry_Limit setting). Retry_Limit. After X number of retries it should discard the message. Over here I am not sure it is discarding or keeping in its memory.
Mem_Buf_Limit is also set to 5MB but still 10GB is used by fluent bit.

To Reproduce
Start application and fluentbit when elastic search is heavily loaded.

Expected behavior
Once retry limit is reached fluentbit should not keep the record in its memory.

Your Environment
Kubernetes version is v1.12.2
Fluent bit version 0.14.7
Snippet of fluentbit Configuration

    [INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10
    ignore_older        1d
 
   [OUTPUT]
    Name            es
    Match           *
    Host            ${FLUENT_ELASTICSEARCH_HOST}
    Port            ${FLUENT_ELASTICSEARCH_PORT}
    Logstash_Format On
    Retry_Limit     2
    Buffer_Size     False

    [FILTER]
     Name record_modifier
     Match *
     Remove_key time

    [FILTER]
        Name            grep
        Match           *
        Regex           log [a-zA-Z1-9]*SOME_STRING[a-zA-Z1-9]*

mcauto · 2022-09-26T00:48:02Z

I don't think this issue has been resolved.

) Signed-off-by: Wesley Pettit <wppttt@amazon.com> Signed-off-by: Wesley Pettit <wppttt@amazon.com>

edsiper self-assigned this Nov 21, 2018

edsiper added question fixed labels Nov 21, 2018

edsiper closed this as completed Nov 23, 2018

allenhaozi mentioned this issue Sep 21, 2022

High memory usage #6079

Closed

rawahars pushed a commit to rawahars/fluent-bit that referenced this issue Oct 24, 2022

out_es: enhance descriptions of trace_error and trace_output (fluent#894

c591698

) Signed-off-by: Wesley Pettit <wppttt@amazon.com> Signed-off-by: Wesley Pettit <wppttt@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory usage on kubernetes #894

Excessive memory usage on kubernetes #894

tomstreet commented Nov 8, 2018

tomstreet commented Nov 21, 2018

edsiper commented Nov 21, 2018

tomstreet commented Nov 30, 2018

edsiper commented Nov 30, 2018

tomstreet commented Nov 30, 2018

tomstreet commented Nov 30, 2018

edsiper commented Nov 30, 2018

tomstreet commented Nov 30, 2018 •

edited

Loading

HarishHothi commented Jan 23, 2019

mcauto commented Sep 26, 2022 •

edited

Loading

Excessive memory usage on kubernetes #894

Excessive memory usage on kubernetes #894

Comments

tomstreet commented Nov 8, 2018

tomstreet commented Nov 21, 2018

edsiper commented Nov 21, 2018

tomstreet commented Nov 30, 2018

edsiper commented Nov 30, 2018

tomstreet commented Nov 30, 2018

tomstreet commented Nov 30, 2018

edsiper commented Nov 30, 2018

tomstreet commented Nov 30, 2018 • edited Loading

HarishHothi commented Jan 23, 2019

mcauto commented Sep 26, 2022 • edited Loading

tomstreet commented Nov 30, 2018 •

edited

Loading

mcauto commented Sep 26, 2022 •

edited

Loading