Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd fails to cleanup buffer metadata #2593

Closed
bhperry opened this issue Aug 29, 2019 · 7 comments · Fixed by #2598
Closed

Fluentd fails to cleanup buffer metadata #2593

bhperry opened this issue Aug 29, 2019 · 7 comments · Fixed by #2598
Labels
bug Something isn't working

Comments

@bhperry
Copy link

bhperry commented Aug 29, 2019

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

Describe the bug

I have fluentd tailing kubernetes logs, and sending them to elasticsearch and s3. Both outputs are configured to use file buffers in order to avoid the loss of logs if something happens to the fluentd pod. On one cluster in particular, the s3 file buffer has been filling up with a huge number of empty buffer metadata files (all zero bytes), to the point that it uses up all the inodes on the volume. There is not associated log buffer file, just the metadata. Some of these empty metadata files were months old. I had to go manually clean up all the zero byte metadata files in order to restore cluster functionality. Looking for some insight into why these blank metadata files are created, and why they are never cleaned up by fluentd.

In the past, we were seeing issues with fluentd throughput, and the log buffers would get backed up and fill up all the disk space, so I do have it configured to aggressively create and flush chunks.

${archive_path} is configured from kubernetes metadata like so: ${'container-logs/' + record['kubernetes']['namespace_name'] + '/' + record['kubernetes']['container_name'] + '/' + record['kubernetes']['pod_name'] + '/'}. That along with the 60s chunk interval can make for a lot of chunks when many pods are redeployed, but I still would not expect to see over a million chunks, and it doesn't explain why these empty metadata files are left in place, seemingly forever.

Also asked a question about this here, but it does seem like it's a different issue than what the original question was about so I figured I'd create a new one.

To Reproduce

Still working on reproducing this behavior in a dev cluster. Best I can tell, lots of pod redeploys and nightly cronjob pods.

Expected behavior
Empty metadata files should be cleaned up by fluentd.

Your Environment

  • Fluentd or td-agent version: fluentd --version or td-agent --version
    td-agent 1.3.3
  • Operating system: cat /etc/os-release
    NAME="Container Linux by CoreOS"
    ID=coreos
    VERSION=1967.6.0
    VERSION_ID=1967.6.0
    BUILD_ID=2019-02-12-2138
    PRETTY_NAME="Container Linux by CoreOS 1967.6.0 (Rhyolite)"
    ANSI_COLOR="38;5;75"
    HOME_URL="https://coreos.com/"
    BUG_REPORT_URL="https://issues.coreos.com"
    COREOS_BOARD="amd64-usr"
  • Kernel version: uname -r
    4.14.96-coreos-r1

If you hit the problem with older fluentd version, try latest version first.

Your Configuration

    <match **>
      @type copy_ex
      <store ignore_error>
        @id elasticsearch
        @type elasticsearch
        @log_level debug
        id_key _hash
        remove_keys _hash,_tag # Remove temp key
        include_tag_key true
        host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
        port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
        scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
        logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}"
        logstash_format true
        reload_connections false
        reconnect_on_error true
        reload_on_failure true
        <buffer>
          @type file
          path /var/log/fluent/elasticsearch
          flush_mode interval
          retry_type exponential_backoff
          flush_interval 5s
          retry_forever
          retry_max_interval 30m
          chunk_limit_size 2M
          overflow_action block
        </buffer>
      </store>
      <store ignore_error>
        @id s3
        @type s3
        @log_level info
        aws_key_id "#{ENV['AWS_ACCESS_KEY_ID']}"
        aws_sec_key "#{ENV['AWS_SECRET_ACCESS_KEY']}"
        s3_region "#{ENV['AWS_DEFAULT_REGION']}"
        s3_bucket "#{ENV['AWS_S3_BUCKET']}"
        store_as json
        path ${archive_path}
        time_slice_format %Y-%m-%d-%H:%M
        <buffer archive_path,time>
          @type file
          path /var/log/fluent/s3
          timekey 60s
          timekey_wait 10s
          timekey_use_utc true # use utc
          chunk_limit_size 64m
          flush_thread_count 4
          flush_mode interval
          flush_interval 60s
          retry_forever
          overflow_action block
        </buffer>
      </store>
    </match>

Your Error Log

Couldn't find any error logs. Even with debug mode on and removing the ignore_error flag.

@bhperry bhperry added the bug Something isn't working label Aug 29, 2019
@repeatedly
Copy link
Member

repeatedly commented Aug 30, 2019

Hmm... I'm not sure why this happens because fluentd just deletes metadata file after chunk flush:

File.unlink(@path, @meta_path)

fluentd doesn't truncate metadata file so 0 size metadata is curious situation.

Couldn't find any error logs. Even with debug mode on and removing the ignore_error flag.

By no error logs, I assume this problem happens outside of fluentd.
fluentd v1.7 supports file_single buffer plugin, https://docs.fluentd.org/buffer/file_single, and this may resolve your problem.

@bhperry
Copy link
Author

bhperry commented Aug 30, 2019

Thanks, that is helpful. Any migration tips for moving to file_single without losing logs? Assuming that it wouldn't know what to do with the existing buffer files if I did a hard cutover.

@bhperry
Copy link
Author

bhperry commented Aug 30, 2019

I guess maybe leave both configurations in place for a while, but have the old one match on nothing?

@bhperry
Copy link
Author

bhperry commented Aug 30, 2019

I did find a number of warnings like these that were from around the time the issue happened.
[s3] failed to write data into buffer by buffer overflow action=:block and error_class=Errno::ENOSPC error="No space left on device @ io_write - /var/log/fluent/elasticsearch/buffer.b590d1984b5c61afe21a9c0095dfbc3d1.log"

Is it possible that fluentd might create meta files, and then leave it there empty if the chunk fails to be written to disk?

@bhperry
Copy link
Author

bhperry commented Aug 30, 2019

Yep :)

Filled up most of the disk, then created a pod that logs a ton of junk data to keep the disk full. After doing so, about 8000 empty meta files were generated within about a minute.

Tried out td-agent 1.7.0 as well, same behavior.

Here's a log where it created a meta file that has no data in it (wasn't seeing any logs about them until I removed overflow_action block):

2019-08-30 21:19:24 +0000 [warn]: #0 [s3] failed to write data into buffer by buffer overflow action=:throw_exception
2019-08-30 21:19:24 +0000 [error]: #0 ignore emit error error_class=Fluent::Plugin::Buffer::BufferOverflowError error="can't create buffer metadata for /var/log/fluent/s3/buffer.*.log. Stop creating buffer files: error = No space left on device @ io_write - /var/log/fluent/s3/buffer.b5915c2d2600ffcb343628cfffc1b11af.log.meta"

And the actual file:

ip-10-240-4-155 s3 # ls /var/log/fluent/s3/buffer.b5915c2d2600ffcb343628cfffc1b11af.log.meta -al
-rw-r--r--. 1 root root 0 Aug 30 21:19 /var/log/fluent/s3/buffer.b5915c2d2600ffcb343628cfffc1b11af.log.meta

@ganmacs
Copy link
Member

ganmacs commented Sep 2, 2019

Thank you for your hard work to dig the root cause! Your information was very helpful to us.
#2598 should fix this issue.

@bhperry
Copy link
Author

bhperry commented Sep 2, 2019

Great! Thanks for fixing that up so quickly!

aiwantaozi added a commit to aiwantaozi/fluentd that referenced this issue Sep 16, 2019
Problem:
Fluentd will start creating hundreds of thousands of 0 byte buffer log files when the partition it is on runs out of space.
fluent/fluentd#2593

Solution:
Upgrade gems version

Issue:
rancher/rancher#22689
aiwantaozi added a commit to aiwantaozi/fluentd that referenced this issue Sep 16, 2019
Problem:
Fluentd will start creating hundreds of thousands of 0 byte buffer log files when the partition it is on runs out of space.
fluent/fluentd#2593

Solution:
Upgrade gems version

Issue:
rancher/rancher#22689
cjellick pushed a commit to rancher/fluentd that referenced this issue Sep 17, 2019
Problem:
Fluentd will start creating hundreds of thousands of 0 byte buffer log files when the partition it is on runs out of space.
fluent/fluentd#2593

Solution:
Upgrade gems version

Issue:
rancher/rancher#22689
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants