Fluentd fails to cleanup buffer metadata #2593

bhperry · 2019-08-29T17:44:05Z

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

Describe the bug

I have fluentd tailing kubernetes logs, and sending them to elasticsearch and s3. Both outputs are configured to use file buffers in order to avoid the loss of logs if something happens to the fluentd pod. On one cluster in particular, the s3 file buffer has been filling up with a huge number of empty buffer metadata files (all zero bytes), to the point that it uses up all the inodes on the volume. There is not associated log buffer file, just the metadata. Some of these empty metadata files were months old. I had to go manually clean up all the zero byte metadata files in order to restore cluster functionality. Looking for some insight into why these blank metadata files are created, and why they are never cleaned up by fluentd.

In the past, we were seeing issues with fluentd throughput, and the log buffers would get backed up and fill up all the disk space, so I do have it configured to aggressively create and flush chunks.

${archive_path} is configured from kubernetes metadata like so: ${'container-logs/' + record['kubernetes']['namespace_name'] + '/' + record['kubernetes']['container_name'] + '/' + record['kubernetes']['pod_name'] + '/'}. That along with the 60s chunk interval can make for a lot of chunks when many pods are redeployed, but I still would not expect to see over a million chunks, and it doesn't explain why these empty metadata files are left in place, seemingly forever.

Also asked a question about this here, but it does seem like it's a different issue than what the original question was about so I figured I'd create a new one.

To Reproduce

Still working on reproducing this behavior in a dev cluster. Best I can tell, lots of pod redeploys and nightly cronjob pods.

Expected behavior
Empty metadata files should be cleaned up by fluentd.

Your Environment

Fluentd or td-agent version: fluentd --version or td-agent --version
td-agent 1.3.3
Operating system: cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1967.6.0
VERSION_ID=1967.6.0
BUILD_ID=2019-02-12-2138
PRETTY_NAME="Container Linux by CoreOS 1967.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Kernel version: uname -r
4.14.96-coreos-r1

If you hit the problem with older fluentd version, try latest version first.

Your Configuration

    <match **>
      @type copy_ex
      <store ignore_error>
        @id elasticsearch
        @type elasticsearch
        @log_level debug
        id_key _hash
        remove_keys _hash,_tag # Remove temp key
        include_tag_key true
        host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
        port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
        scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
        logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}"
        logstash_format true
        reload_connections false
        reconnect_on_error true
        reload_on_failure true
        <buffer>
          @type file
          path /var/log/fluent/elasticsearch
          flush_mode interval
          retry_type exponential_backoff
          flush_interval 5s
          retry_forever
          retry_max_interval 30m
          chunk_limit_size 2M
          overflow_action block
        </buffer>
      </store>
      <store ignore_error>
        @id s3
        @type s3
        @log_level info
        aws_key_id "#{ENV['AWS_ACCESS_KEY_ID']}"
        aws_sec_key "#{ENV['AWS_SECRET_ACCESS_KEY']}"
        s3_region "#{ENV['AWS_DEFAULT_REGION']}"
        s3_bucket "#{ENV['AWS_S3_BUCKET']}"
        store_as json
        path ${archive_path}
        time_slice_format %Y-%m-%d-%H:%M
        <buffer archive_path,time>
          @type file
          path /var/log/fluent/s3
          timekey 60s
          timekey_wait 10s
          timekey_use_utc true # use utc
          chunk_limit_size 64m
          flush_thread_count 4
          flush_mode interval
          flush_interval 60s
          retry_forever
          overflow_action block
        </buffer>
      </store>
    </match>

Your Error Log

Couldn't find any error logs. Even with debug mode on and removing the ignore_error flag.

The text was updated successfully, but these errors were encountered:

repeatedly · 2019-08-30T00:42:46Z

Hmm... I'm not sure why this happens because fluentd just deletes metadata file after chunk flush:

fluentd/lib/fluent/plugin/buffer/file_chunk.rb

Line 165 in bd951f3

File.unlink(@path, @meta_path)

fluentd doesn't truncate metadata file so 0 size metadata is curious situation.

Couldn't find any error logs. Even with debug mode on and removing the ignore_error flag.

By no error logs, I assume this problem happens outside of fluentd.
fluentd v1.7 supports file_single buffer plugin, https://docs.fluentd.org/buffer/file_single, and this may resolve your problem.

bhperry · 2019-08-30T14:59:49Z

Thanks, that is helpful. Any migration tips for moving to file_single without losing logs? Assuming that it wouldn't know what to do with the existing buffer files if I did a hard cutover.

bhperry · 2019-08-30T15:24:32Z

I guess maybe leave both configurations in place for a while, but have the old one match on nothing?

bhperry · 2019-08-30T20:16:21Z

I did find a number of warnings like these that were from around the time the issue happened.
[s3] failed to write data into buffer by buffer overflow action=:block and error_class=Errno::ENOSPC error="No space left on device @ io_write - /var/log/fluent/elasticsearch/buffer.b590d1984b5c61afe21a9c0095dfbc3d1.log"

Is it possible that fluentd might create meta files, and then leave it there empty if the chunk fails to be written to disk?

bhperry · 2019-08-30T20:55:56Z

Yep :)

Filled up most of the disk, then created a pod that logs a ton of junk data to keep the disk full. After doing so, about 8000 empty meta files were generated within about a minute.

Tried out td-agent 1.7.0 as well, same behavior.

Here's a log where it created a meta file that has no data in it (wasn't seeing any logs about them until I removed overflow_action block):

2019-08-30 21:19:24 +0000 [warn]: #0 [s3] failed to write data into buffer by buffer overflow action=:throw_exception
2019-08-30 21:19:24 +0000 [error]: #0 ignore emit error error_class=Fluent::Plugin::Buffer::BufferOverflowError error="can't create buffer metadata for /var/log/fluent/s3/buffer.*.log. Stop creating buffer files: error = No space left on device @ io_write - /var/log/fluent/s3/buffer.b5915c2d2600ffcb343628cfffc1b11af.log.meta"

And the actual file:

ip-10-240-4-155 s3 # ls /var/log/fluent/s3/buffer.b5915c2d2600ffcb343628cfffc1b11af.log.meta -al
-rw-r--r--. 1 root root 0 Aug 30 21:19 /var/log/fluent/s3/buffer.b5915c2d2600ffcb343628cfffc1b11af.log.meta

ganmacs · 2019-09-02T07:24:49Z

Thank you for your hard work to dig the root cause! Your information was very helpful to us.
#2598 should fix this issue.

bhperry · 2019-09-02T18:25:11Z

Great! Thanks for fixing that up so quickly!

Problem: Fluentd will start creating hundreds of thousands of 0 byte buffer log files when the partition it is on runs out of space. fluent/fluentd#2593 Solution: Upgrade gems version Issue: rancher/rancher#22689

bhperry added the bug Something isn't working label Aug 29, 2019

ganmacs mentioned this issue Sep 2, 2019

Ensure to remove metadata if #write_metadata raise an error because of disk full #2598

Merged

ganmacs closed this as completed in #2598 Sep 4, 2019

patrick0057 mentioned this issue Sep 6, 2019

bug: fluentd consumes all inodes rancher/rancher#22689

Closed

aiwantaozi mentioned this issue Sep 16, 2019

Upgrade fluentd gem versions to fix fluentd consumes all inodes rancher/fluentd#18

Merged

aiwantaozi mentioned this issue Sep 17, 2019

Upgrade rancher fluentd image version to v0.1.17 rancher/system-charts#98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd fails to cleanup buffer metadata #2593

Fluentd fails to cleanup buffer metadata #2593

bhperry commented Aug 29, 2019 •

edited

Loading

repeatedly commented Aug 30, 2019 •

edited

Loading

bhperry commented Aug 30, 2019

bhperry commented Aug 30, 2019

bhperry commented Aug 30, 2019

bhperry commented Aug 30, 2019 •

edited

Loading

ganmacs commented Sep 2, 2019

bhperry commented Sep 2, 2019

Fluentd fails to cleanup buffer metadata #2593

Fluentd fails to cleanup buffer metadata #2593

Comments

bhperry commented Aug 29, 2019 • edited Loading

repeatedly commented Aug 30, 2019 • edited Loading

bhperry commented Aug 30, 2019

bhperry commented Aug 30, 2019

bhperry commented Aug 30, 2019

bhperry commented Aug 30, 2019 • edited Loading

ganmacs commented Sep 2, 2019

bhperry commented Sep 2, 2019

bhperry commented Aug 29, 2019 •

edited

Loading

repeatedly commented Aug 30, 2019 •

edited

Loading

bhperry commented Aug 30, 2019 •

edited

Loading