Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrade Fluentd complains about some chunk is not in gzip format #1522

Closed
serhatcetinkaya opened this issue Mar 24, 2021 · 14 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@serhatcetinkaya
Copy link

Describe the bug
We recently migrated to new structure of sumologic collection in kubernetes. After the migration, we started seeing an error on fluentd which says some chunk is not in gzip format. I checked troubleshooting guide and I restarted every pod while deleting/recreating their pvc. The problem is still there, can you help us find the cause of this ?

Logs

2021-03-24 11:06:52 +0000 [warn]: #0 [sumologic.endpoint.logs.default] failed to flush the buffer. retry_time=0 next_retry_seconds=2021-03-24 11:06:53 +0000 chunk="5be45feac120cff4dccdc496747843a8" error_class=Zlib::GzipFile::Error error="not in gzip format"
2021-03-24 11:06:52 +0000 [warn]: #0 suppressed same stacktrace
2021-03-24 11:06:53 +0000 [warn]: #0 [sumologic.endpoint.logs.default] failed to flush the buffer. retry_time=1 next_retry_seconds=2021-03-24 11:06:54 +0000 chunk="5be45feac120cff4dccdc496747843a8" error_class=Zlib::GzipFile::Error error="not in gzip format"
2021-03-24 11:06:53 +0000 [warn]: #0 suppressed same stacktrace
2021-03-24 11:06:54 +0000 [warn]: #0 [sumologic.endpoint.logs.default] failed to flush the buffer. retry_time=2 next_retry_seconds=2021-03-24 11:06:56 +0000 chunk="5be45feac120cff4dccdc496747843a8" error_class=Zlib::GzipFile::Error error="not in gzip format"
2021-03-24 11:06:54 +0000 [warn]: #0 suppressed same stacktrace
2021-03-24 11:06:56 +0000 [warn]: #0 [sumologic.endpoint.logs.default] failed to flush the buffer. retry_time=3 next_retry_seconds=2021-03-24 11:07:00 +0000 chunk="5be45feac120cff4dccdc496747843a8" error_class=Zlib::GzipFile::Error error="not in gzip format"
2021-03-24 11:06:56 +0000 [warn]: #0 suppressed same stacktrace

Command used to install/upgrade Collection
We are non-helm users.

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: collection-sumologic
  name: collection-sumologic
data:
  buffer.output.conf: |-
    compress "gzip"
    flush_interval "5s"
    flush_thread_count "8"
    chunk_limit_size "1m"
    total_limit_size "128m"
    queued_chunks_limit_size "128"
    overflow_action drop_oldest_chunk
    retry_max_interval "10m"
    retry_forever "true"
  common.conf: |-
    # Prevent fluentd from handling records containing its own logs.
    <match fluentd.**>
      @type null
    </match>
    # expose the Fluentd metrics to Prometheus
    <source>
      @type prometheus
      metrics_path /metrics
      port 24231
    </source>
    <source>
      @type prometheus_output_monitor
    </source>
    <source>
      @type http
      port 9880
      bind 0.0.0.0
    </source>
    <system>
      log_level info
    </system>
  fluent.conf: |-
    @include common.conf
    @include metrics.conf
    @include logs.conf
  logs.conf: |-
    <source>
      @type forward
      port 24321
      bind 0.0.0.0
    </source>
    @include logs.source.containers.conf
    @include logs.source.systemd.conf
    @include logs.source.default.conf
    @include logs.source.sys-logs.conf
  logs.enhance.k8s.metadata.filter.conf: |-
    cache_size  "10000"
    cache_ttl  "7200"
    cache_refresh "3600"
    cache_refresh_variation "900"
    in_namespace_path '$.kubernetes.namespace_name'
    in_pod_path '$.kubernetes.pod_name'
    core_api_versions v1
    api_groups apps/v1,extensions/v1beta1
    data_type logs
  logs.kubernetes.metadata.filter.conf: |-
    annotation_match ["sumologic\.com.*"]
    de_dot false
    watch "true"
    ca_file ""
    verify_ssl "true"
    client_cert ""
    client_key ""
    bearer_token_file ""
    cache_size "10000"
    cache_ttl "7200"
    tag_to_kubernetes_name_regexp '.+?\.containers\.(?<pod_name>[^_]+)_(?<namespace>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$'
  logs.kubernetes.sumologic.filter.conf: |-
    source_name "%{namespace}.%{pod}.%{container}"
    source_host 
    log_format "fields"
    source_category "%{namespace}/%{pod_name}"
    source_category_prefix "kubernetes/"
    source_category_replace_dash "/"
    exclude_pod_regex ""
    exclude_container_regex ""
    exclude_host_regex ""
  logs.output.conf: |-
    data_type logs
    log_key log
    endpoint "#{ENV['SUMO_ENDPOINT_DEFAULT_LOGS_SOURCE']}"
    verify_ssl "true"
    log_format "fields"
    add_timestamp "true"
    timestamp_key "timestamp"
    proxy_uri ""
    compress "true"
    compress_encoding "gzip"
  logs.source.containers.conf: |-
    <filter containers.**>
      @type record_transformer
      enable_ruby
      renew_record true
      <record>
        log    ${record["log"].split(/[\n\t]+/).map! {|item| JSON.parse(item)["log"]}.any? ? record["log"].split(/[\n\t]+/).map! {|item| JSON.parse(item)["log"]}.join("") : record["log"] rescue record["log"]}
        stream ${[record["log"].split(/[\n\t]+/)[0]].map! {|item| JSON.parse(item)["stream"]}.any? ? [record["log"].split(/[\n\t]+/)[0]].map! {|item| JSON.parse(item)["stream"]}.join("") : record["stream"] rescue record["stream"]}
        time   ${[record["log"].split(/[\n\t]+/)[0]].map! {|item| JSON.parse(item)["time"]}.any? ? [record["log"].split(/[\n\t]+/)[0]].map! {|item| JSON.parse(item)["time"]}.join("") : record["time"] rescue record["time"]}
      </record>
    </filter>

    # match all  container logs and label them @NORMAL
    <match containers.**>
      @type relabel
      @label @NORMAL
    </match>
    <label @NORMAL>
      # only match fluentd logs based on fluentd container log file name.
      # by default, this is <filter **collection-sumologic-fluentd**>
      <filter sumologic-fluentd**>
        # only ingest fluentd logs of levels: {error, fatal} and warning messages if buffer is full
        @type grep
        <regexp>
          key log
          pattern /\[error\]|\[fatal\]|drop_oldest_chunk|retry succeeded/
        </regexp>
      </filter>
  
    
     <filter sumologic-otelcol**>
       @type grep
       <regexp>
         key log
         # Select only known error/warning/fatal/panic levels or logs coming from one of the source known to provide useful data
         pattern /\"level\":\"(error|warning|fatal|panic|dpanic)\"|\"caller\":\"(builder|service|kube|static)/
       </regexp>
     </filter>
  
      # third-party Kubernetes metadata  filter plugin
      <filter containers.**>
        @type kubernetes_metadata
        @log_level error
        @include logs.kubernetes.metadata.filter.conf
      </filter>
      # Sumo Logic Kubernetes metadata enrichment filter plugin
      <filter containers.**>
        @type enhance_k8s_metadata
        @log_level error
        @include logs.enhance.k8s.metadata.filter.conf
      </filter>
      
      # Kubernetes Sumo Logic filter plugin
      <filter containers.**>
        @type kubernetes_sumologic
        @include logs.kubernetes.sumologic.filter.conf
        
        exclude_namespace_regex ""
      </filter>
      
      <match containers.**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.logs
          sumo_client "k8s_2.0.3"
          @log_level error
          @include logs.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/logs.containers
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
    </label>

  logs.source.systemd.conf: |-
    <match host.kubelet.**>
      @type relabel
      @label @KUBELET
    </match>
    <label @KUBELET>
      <filter host.kubelet.**>
        @type kubernetes_sumologic
        source_category "kubelet"
        source_name "k8s_kubelet"
        source_category_prefix "kubernetes/"
        source_category_replace_dash "/"
        exclude_facility_regex ""
        exclude_host_regex ""
        exclude_priority_regex ""
        exclude_unit_regex ""
      </filter>
      
      <match **>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.logs.kubelet
          sumo_client "k8s_2.0.3"
          @include logs.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/logs.kubelet
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
    </label>
    
    <match host.**>
      @type relabel
      @label @SYSTEMD
    </match>
    <label @SYSTEMD>
      <filter host.**>
        @type kubernetes_sumologic
        source_name "k8s_systemd"
        source_category "system"
        source_category_prefix "kubernetes/"
        source_category_replace_dash "/"
        exclude_facility_regex ""
        exclude_host_regex ""
        exclude_priority_regex ""
        exclude_unit_regex ""
      </filter>
      <filter host.**>
        @type record_modifier
        <record>
          _sumo_metadata ${record["_sumo_metadata"][:source] = tag_parts[1]; record["_sumo_metadata"]}
        </record>
      </filter>
      
      <match **>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.logs.systemd
          sumo_client "k8s_2.0.3"
          @include logs.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/logs.systemd
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
    </label>
  
  logs.source.default.conf: |-
    <filter **>
      @type grep
      <exclude>
        key message
        pattern /disable filter chain optimization/
      </exclude>
    </filter>
    <filter **>
      @type kubernetes_sumologic
      source_name "k8s_default"
      source_category "default"
      source_category_prefix "kubernetes/"
      source_category_replace_dash "/"
      exclude_facility_regex ""
      exclude_host_regex ""
      exclude_priority_regex ""
      exclude_unit_regex ""
    </filter>
    <match **>
      @type copy
      <store>
        @type sumologic
        @id sumologic.endpoint.logs.default
        sumo_client "k8s_2.0.3"
        @include logs.output.conf
        <buffer>
          @type file
          path /fluentd/buffer/logs.default
          @include buffer.output.conf
        </buffer>
      </store>
    </match>

  logs.source.sys-logs.conf: |-
    <match syslog_messages.**>
      @type sumologic
        @id sumologic.endpoint.logs.messages
        @include logs.output.conf
        <buffer>
          @type memory
          @include buffer.output.conf
        </buffer>
    </match>
    <match syslog_secure.**>
      @type sumologic
        @id sumologic.endpoint.logs.secure
        @include logs.output.conf
        <buffer>
          @type memory
          @include buffer.output.conf
        </buffer>
    </match>
    <match syslog_audit.**>
      @type sumologic
        @id sumologic.endpoint.logs.audit
        @include logs.output.conf
        <buffer>
          @type memory
          @include buffer.output.conf
        </buffer>
    </match>
    <match syslog_cron.**>
      @type sumologic
        @id sumologic.endpoint.logs.cron
        @include logs.output.conf
        <buffer>
          @type memory
          @include buffer.output.conf
        </buffer>
    </match>
    <match syslog_maillog.**>
      @type sumologic
        @id sumologic.endpoint.logs.maillog
        @include logs.output.conf
        <buffer>
          @type memory
          @include buffer.output.conf
        </buffer>
    </match>

  metrics.conf: |-
    <source>
      @type http
      port 9888
      <parse>
        @type protobuf
      </parse>
    </source>
    <match prometheus.metrics**>
      @type datapoint
      @label @DATAPOINT
    </match>
    <label @DATAPOINT>
      <filter prometheus.metrics**>
        @type record_modifier
        <record>
          cluster CCCCCC
        </record>
      </filter>
      <filter prometheus.metrics**>
        @type enhance_k8s_metadata
        cache_size  "10000"
        cache_ttl  "7200"
        cache_refresh "3600"
        cache_refresh_variation "900"
        core_api_versions v1
        api_groups apps/v1,extensions/v1beta1
      </filter>
      
      <filter prometheus.metrics**>
        @type prometheus_format
        relabel container_name:container,pod_name:pod
      </filter>
      
      <match prometheus.metrics.apiserver**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.apiserver
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_APISERVER_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.apiserver
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics.container**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.container
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_KUBELET_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.container
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics.control-plane**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.control.plane
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_CONTROL_PLANE_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.control_plane
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics.controller-manager**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.kube.controller.manager
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_CONTROLLER_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.controller
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics.kubelet**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.kubelet
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_KUBELET_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.kubelet
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics.node**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.node.exporter
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_NODE_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.node
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics.scheduler**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.kube.scheduler
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_SCHEDULER_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.scheduler
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics.state**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics.kube.state
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_STATE_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.state
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
      
      <match prometheus.metrics**>
        @type copy
        <store>
          @type sumologic
          @id sumologic.endpoint.metrics
          sumo_client "k8s_2.0.3"
          endpoint "#{ENV['SUMO_ENDPOINT_DEFAULT_METRICS_SOURCE']}"
        @include metrics.output.conf
          <buffer>
            @type file
            path /fluentd/buffer/metrics.default
            @include buffer.output.conf
          </buffer>
        </store>
      </match>
    </label>

  metrics.output.conf: |-
    data_type metrics
    metric_data_format prometheus
    disable_cookies true
    proxy_uri ""
    compress "true"
    compress_encoding "gzip"

To Reproduce
We are not able to reproduce.

Expected behavior
Not to see this error.

Environment (please complete the following information):

  • Collection version (e.g. helm ls -n sumologic): public.ecr.aws/sumologic/kubernetes-fluentd:1.12.0-sumo-1
  • Kubernetes version (e.g. kubectl version): v1.18.9-eks-d1db3c
  • Cloud provider: AWS
  • Others:

Anything else do we need to know
I did kubectl exec to the fluentd container and found following for the problematic files, in this example buffer.b5bde2b7225ad70f7d1eeb473af979652.log and buffer.b5bde2d0693d05a4d7dd397e388bbac72.log are problematic:

fluent@collection-sumologic-1:/$ find /fluentd/buffer/ -name "*5bde2b7225ad70f7d1eeb473af979652*"
/fluentd/buffer/logs.default/buffer.b5bde2b7225ad70f7d1eeb473af979652.log
fluent@collection-sumologic-1:/$ more /fluentd/buffer/logs.default/buffer.b5bde2b7225ad70f7d1eeb473af979652.log
fluent@collection-sumologic-1:/$ find /fluentd/buffer/ -name "*b5bde2d0693d05a4d7dd397e388bbac72*"
/fluentd/buffer/logs.default/buffer.b5bde2d0693d05a4d7dd397e388bbac72.log
fluent@collection-sumologic-1:/$ more /fluentd/buffer/logs.default/buffer.b5bde2d0693d05a4d7dd397e388bbac72.log
fluent@collection-sumologic-1:/$ ls -l /fluentd/buffer/logs.default/
total 8
-rw-r--r-- 1 fluent fluent    0 Mar 19 12:18 buffer.b5bde2b7225ad70f7d1eeb473af979652.log
-rw-r--r-- 1 fluent fluent    0 Mar 19 12:25 buffer.b5bde2d0693d05a4d7dd397e388bbac72.log
-rw-r--r-- 1 fluent fluent 3753 Mar 22 08:30 buffer.b5be1be15c62766245a80e6d8561c420c.log
-rw-r--r-- 1 fluent fluent   86 Mar 22 08:30 buffer.b5be1be15c62766245a80e6d8561c420c.log.meta

Also I am not sure if this is a coincidence but we always see problematic chunks under /fluentd/buffer/logs.default directory.

@serhatcetinkaya serhatcetinkaya added the bug Something isn't working label Mar 24, 2021
@Aaron-ML
Copy link

We are also seeing this on 2.0.5 using the helm chart.

@serhatcetinkaya
Copy link
Author

do you have any guesses, what might be the reason ?
@sumo-drosiek
@pmalek-sumo
@astencel-sumo

@sumo-drosiek
Copy link
Contributor

@Aaron-ML, @serhatcetinkaya
Could you try again troubleshooting
and kubectl exec into pod with ls -al on buffer directories when error occur?

I don't have specific idea for now, so I want to check if problematic files are really created after cleaning up the environment.

@sumo-drosiek sumo-drosiek self-assigned this Apr 13, 2021
@serhatcetinkaya
Copy link
Author

Hello @sumo-drosiek ,

The issue happened again, I did kubectl exec to the pod and run ls commands as you've asked:

fluent@collection-sumologic-1:/$ ls -laR /fluentd/
/fluentd/:
total 8
drwxr-xr-x  1 fluent fluent   20 Apr 13 23:39 .
drwxr-xr-x  1 root   root     43 Apr 13 23:39 ..
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 buffer
drwxrwsrwx  3 root   fluent 4096 Apr 11 08:47 etc
drwxr-xr-x  2 fluent fluent    6 Jan  6 13:06 log
drwxr-xr-x  2 fluent fluent    6 Jan  6 13:06 plugins

/fluentd/buffer:
total 144
drwxrwsr-x 16 root   fluent  4096 Mar 22 09:12 .
drwxr-xr-x  1 fluent fluent    20 Apr 13 23:39 ..
drwxrwsr-x  2 fluent fluent 36864 Apr 14 07:04 logs.containers
drwxrwsr-x  2 fluent fluent 24576 Apr 14 07:04 logs.default
drwxrwsr-x  2 fluent fluent 24576 Mar 30 09:07 logs.kubelet
drwxrwsr-x  2 fluent fluent  4096 Apr 14 07:04 logs.systemd
drwxrws---  2 root   fluent 16384 Mar 22 09:12 lost+found
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.apiserver
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.container
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.control_plane
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.controller
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.default
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.kubelet
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.node
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.scheduler
drwxrwsr-x  2 fluent fluent  4096 Mar 22 09:12 metrics.state

/fluentd/buffer/logs.containers:
total 64
drwxrwsr-x  2 fluent fluent 36864 Apr 14 07:04 .
drwxrwsr-x 16 root   fluent  4096 Mar 22 09:12 ..
-rw-r--r--  1 fluent fluent  1852 Apr 14 07:04 buffer.b5bfe95cf7e20838aed9014d09980aba8.log
-rw-r--r--  1 fluent fluent   219 Apr 14 07:04 buffer.b5bfe95cf7e20838aed9014d09980aba8.log.meta
-rw-r--r--  1 fluent fluent   537 Apr 14 07:04 buffer.b5bfe95cf7e98edfbbb12d566b002de1a.log
-rw-r--r--  1 fluent fluent   224 Apr 14 07:04 buffer.b5bfe95cf7e98edfbbb12d566b002de1a.log.meta
-rw-r--r--  1 fluent fluent  1222 Apr 14 07:04 buffer.q5bfe95cca51f52703231ea1ccef32d66.log
-rw-r--r--  1 fluent fluent   219 Apr 14 07:04 buffer.q5bfe95cca51f52703231ea1ccef32d66.log.meta

/fluentd/buffer/logs.default:
total 44
drwxrwsr-x  2 fluent fluent 24576 Apr 14 07:04 .
drwxrwsr-x 16 root   fluent  4096 Mar 22 09:12 ..
-rw-r--r--  1 fluent fluent     0 Apr 13 23:39 buffer.b5bfe325e05a7dd74410c7aa98830ed7d.log
-rw-r--r--  1 fluent fluent  1478 Apr 14 07:04 buffer.b5bfe95cf7e6b122d55dbd64c4eb27e24.log
-rw-r--r--  1 fluent fluent   108 Apr 14 07:04 buffer.b5bfe95cf7e6b122d55dbd64c4eb27e24.log.meta
-rw-r--r--  1 fluent fluent   874 Apr 14 07:04 buffer.b5bfe95d06b2b1b66c5ab634826f9cfb4.log
-rw-r--r--  1 fluent fluent    86 Apr 14 07:04 buffer.b5bfe95d06b2b1b66c5ab634826f9cfb4.log.meta

/fluentd/buffer/logs.kubelet:
total 28
drwxrwsr-x  2 fluent fluent 24576 Mar 30 09:07 .
drwxrwsr-x 16 root   fluent  4096 Mar 22 09:12 ..

/fluentd/buffer/logs.systemd:
total 8
drwxrwsr-x  2 fluent fluent 4096 Apr 14 07:04 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/lost+found:
total 20
drwxrws---  2 root fluent 16384 Mar 22 09:12 .
drwxrwsr-x 16 root fluent  4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.apiserver:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.container:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.control_plane:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.controller:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.default:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.kubelet:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.node:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.scheduler:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/buffer/metrics.state:
total 8
drwxrwsr-x  2 fluent fluent 4096 Mar 22 09:12 .
drwxrwsr-x 16 root   fluent 4096 Mar 22 09:12 ..

/fluentd/etc:
total 8
drwxrwsrwx 3 root   fluent 4096 Apr 11 08:47 .
drwxr-xr-x 1 fluent fluent   20 Apr 13 23:39 ..
drwxr-sr-x 2 root   fluent 4096 Apr 11 08:47 ..2021_04_11_08_47_00.046418225
lrwxrwxrwx 1 root   root     31 Apr 11 08:47 ..data -> ..2021_04_11_08_47_00.046418225
lrwxrwxrwx 1 root   root     25 Apr 11 08:47 buffer.output.conf -> ..data/buffer.output.conf
lrwxrwxrwx 1 root   root     18 Apr 11 08:47 common.conf -> ..data/common.conf
lrwxrwxrwx 1 root   root     18 Apr 11 08:47 fluent.conf -> ..data/fluent.conf
lrwxrwxrwx 1 root   root     16 Apr 11 08:47 logs.conf -> ..data/logs.conf
lrwxrwxrwx 1 root   root     44 Apr 11 08:47 logs.enhance.k8s.metadata.filter.conf -> ..data/logs.enhance.k8s.metadata.filter.conf
lrwxrwxrwx 1 root   root     43 Apr 11 08:47 logs.kubernetes.metadata.filter.conf -> ..data/logs.kubernetes.metadata.filter.conf
lrwxrwxrwx 1 root   root     44 Apr 11 08:47 logs.kubernetes.sumologic.filter.conf -> ..data/logs.kubernetes.sumologic.filter.conf
lrwxrwxrwx 1 root   root     23 Apr 11 08:47 logs.output.conf -> ..data/logs.output.conf
lrwxrwxrwx 1 root   root     34 Apr 11 08:47 logs.source.containers.conf -> ..data/logs.source.containers.conf
lrwxrwxrwx 1 root   root     31 Apr 11 08:47 logs.source.default.conf -> ..data/logs.source.default.conf
lrwxrwxrwx 1 root   root     32 Apr 11 08:47 logs.source.sys-logs.conf -> ..data/logs.source.sys-logs.conf
lrwxrwxrwx 1 root   root     31 Apr 11 08:47 logs.source.systemd.conf -> ..data/logs.source.systemd.conf
lrwxrwxrwx 1 root   root     19 Apr 11 08:47 metrics.conf -> ..data/metrics.conf
lrwxrwxrwx 1 root   root     26 Apr 11 08:47 metrics.output.conf -> ..data/metrics.output.conf

/fluentd/etc/..2021_04_11_08_47_00.046418225:
total 68
drwxr-sr-x 2 root fluent 4096 Apr 11 08:47 .
drwxrwsrwx 3 root fluent 4096 Apr 11 08:47 ..
-rw-r--r-- 1 root fluent  215 Apr 11 08:47 buffer.output.conf
-rw-r--r-- 1 root fluent  371 Apr 11 08:47 common.conf
-rw-r--r-- 1 root fluent   61 Apr 11 08:47 fluent.conf
-rw-r--r-- 1 root fluent  202 Apr 11 08:47 logs.conf
-rw-r--r-- 1 root fluent  246 Apr 11 08:47 logs.enhance.k8s.metadata.filter.conf
-rw-r--r-- 1 root fluent  321 Apr 11 08:47 logs.kubernetes.metadata.filter.conf
-rw-r--r-- 1 root fluent  262 Apr 11 08:47 logs.kubernetes.sumologic.filter.conf
-rw-r--r-- 1 root fluent  220 Apr 11 08:47 logs.output.conf
-rw-r--r-- 1 root fluent 2574 Apr 11 08:47 logs.source.containers.conf
-rw-r--r-- 1 root fluent  698 Apr 11 08:47 logs.source.default.conf
-rw-r--r-- 1 root fluent 1004 Apr 11 08:47 logs.source.sys-logs.conf
-rw-r--r-- 1 root fluent 1637 Apr 11 08:47 logs.source.systemd.conf
-rw-r--r-- 1 root fluent 4501 Apr 11 08:47 metrics.conf
-rw-r--r-- 1 root fluent  122 Apr 11 08:47 metrics.output.conf

/fluentd/log:
total 0
drwxr-xr-x 2 fluent fluent  6 Jan  6 13:06 .
drwxr-xr-x 1 fluent fluent 20 Apr 13 23:39 ..

/fluentd/plugins:
total 0
drwxr-xr-x 2 fluent fluent  6 Jan  6 13:06 .
drwxr-xr-x 1 fluent fluent 20 Apr 13 23:39 ..

also a similar but more compact output:

fluent@collection-sumologic-1:/$ du -a /fluentd/
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.kubernetes.metadata.filter.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/buffer.output.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.output.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/common.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.enhance.k8s.metadata.filter.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/fluent.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/metrics.output.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.source.sys-logs.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.source.default.conf
8       /fluentd/etc/..2021_04_11_08_47_00.046418225/metrics.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.source.containers.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.kubernetes.sumologic.filter.conf
4       /fluentd/etc/..2021_04_11_08_47_00.046418225/logs.source.systemd.conf
64      /fluentd/etc/..2021_04_11_08_47_00.046418225
0       /fluentd/etc/logs.kubernetes.sumologic.filter.conf
0       /fluentd/etc/logs.source.systemd.conf
0       /fluentd/etc/buffer.output.conf
0       /fluentd/etc/logs.output.conf
0       /fluentd/etc/common.conf
0       /fluentd/etc/logs.conf
0       /fluentd/etc/logs.kubernetes.metadata.filter.conf
0       /fluentd/etc/logs.enhance.k8s.metadata.filter.conf
0       /fluentd/etc/fluent.conf
0       /fluentd/etc/logs.source.default.conf
0       /fluentd/etc/metrics.conf
0       /fluentd/etc/logs.source.containers.conf
0       /fluentd/etc/metrics.output.conf
0       /fluentd/etc/logs.source.sys-logs.conf
0       /fluentd/etc/..data
68      /fluentd/etc
0       /fluentd/log
0       /fluentd/plugins
4       /fluentd/buffer/metrics.node
4       /fluentd/buffer/metrics.container
16      /fluentd/buffer/lost+found
4       /fluentd/buffer/metrics.scheduler
4       /fluentd/buffer/logs.containers/buffer.b5bfe95b99251f986f3f32ac056069812.log.meta
4       /fluentd/buffer/logs.containers/buffer.b5bfe95b99251f986f3f32ac056069812.log
4       /fluentd/buffer/logs.containers/buffer.q5bfe95b6b5ed23c0a2c269fcd13f785d.log
4       /fluentd/buffer/logs.containers/buffer.q5bfe95b6b5ed23c0a2c269fcd13f785d.log.meta
52      /fluentd/buffer/logs.containers
4       /fluentd/buffer/metrics.kubelet
4       /fluentd/buffer/metrics.state
4       /fluentd/buffer/logs.systemd
24      /fluentd/buffer/logs.kubelet
4       /fluentd/buffer/metrics.controller
4       /fluentd/buffer/metrics.apiserver
4       /fluentd/buffer/metrics.control_plane
4       /fluentd/buffer/logs.default/buffer.b5bfe95b678969fcf27c8a8e7a3249509.log.meta
4       /fluentd/buffer/logs.default/buffer.q5bfe95b43968588be4b848764fb81485.log
4       /fluentd/buffer/logs.default/buffer.b5bfe95b678969fcf27c8a8e7a3249509.log
4       /fluentd/buffer/logs.default/buffer.q5bfe95b43968588be4b848764fb81485.log.meta
0       /fluentd/buffer/logs.default/buffer.b5bfe325e05a7dd74410c7aa98830ed7d.log
40      /fluentd/buffer/logs.default
4       /fluentd/buffer/metrics.default
176     /fluentd/buffer
244     /fluentd/

I am also attaching logs from the container if you wish to take a look

sumo-fluentd.logs.zip

@sumo-drosiek
Copy link
Contributor

@serhatcetinkaya, One more question. How much time after restart, the issue occurred?

@serhatcetinkaya
Copy link
Author

@sumo-drosiek we have the application deployed in 10 different kubernetes clusters, so far it seems there isn't any pattern among these issues. some of them had the issue after hours of restart, some of them had it after days.

@sumo-drosiek
Copy link
Contributor

@serhatcetinkaya
We released v2.1.1
It doesn't change much in the fluentd image, but maybe it will solve the issue

Meantime I will try to reproduce the issue

@serhatcetinkaya
Copy link
Author

Thank you @sumo-drosiek , I am beginning to update image on our side 👍

@djsly
Copy link
Contributor

djsly commented Apr 21, 2021

We got hit by this as well:

#1001 (comment)

@sumo-drosiek
Copy link
Contributor

@djsly, @serhatcetinkaya
As we are not able to reproduce the issue, we cannot provide other solution as to disable gzip compression at this moment. I'm going to update documentation to help with that.


In case you would like to help us with issue, we prepared fluentd image based on fluentd-1.11.5 instead of 1.12.0 (This is proposed solution, but we are unable to verify it). To use it please add following config to the values.yaml:

fluentd:
  image:
    tag: v1.11.5-sumo-2

We could also move forward with investigation if you could provide us stacktrace of exception from fluentd, which we understand is not easy due to overflow and file rotation. To disable suppression of stacktraces by fluentd you can add --suppress-repeated-stacktrace false to statefulset command execution, using following command (please mind statefulset and namespace names:

kubectl patch -n sumologic statefulsets collection-sumologic-fluentd-logs -p '{"spec": {"template": {"spec": {"containers": [{"name": "fluentd", "command": ["tini", "--", "/bin/entrypoint.sh", "--suppress-repeated-stacktrace", "false"]}]}}}}}'

Note: This can generate a lot of additional logs, which can result in exceeding limits in sumo and even to make cluster unhealthy. By design warn logs shouldn't be forwarded to sumo, but please be aware that it can impact on overall collection behavior

To prevent processing fluentd logs, please add following filter to fluent-bit configuration in values.yaml

fluent-bit:
  config:
    filters: |-
      [FILTER]
          Name    grep
          Match   *fluentd*
          Exclude log .

      [FILTER]
          Name kubernetes
          Match kube.*
          Merge_Log On
          Keep_Log Off
          K8S-Logging.Parser On
          K8S-Logging.Exclude On

@serhatcetinkaya
Copy link
Author

@serhatcetinkaya
We released v2.1.1
It doesn't change much in the fluentd image, but maybe it will solve the issue

Meantime I will try to reproduce the issue

Hi @sumo-drosiek ,

After doing this change we haven't seen the error in any of our environments since April 19. I was going to update the discussion but I wanted to wait until the end of the month to make sure it solved the problem for sure. Just to let you know we are fine now. I will monitor things for a couple days more and update this discussion again.

Thanks for the help

@sumo-drosiek
Copy link
Contributor

@serhatcetinkaya Thank you for the update

@djsly just to ensure. Did you follow troubleshooting guide?

@sumo-drosiek
Copy link
Contributor

Documentation has been updated: https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/deploy/docs/Troubleshoot_Collection.md#gzip-compression-errors

@serhatcetinkaya
Copy link
Author

I will close this ticket the issue is resolved by us
Thanks for the help @sumo-drosiek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants