-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2023 High Impact Issues Notice/Catalogue Ticket #542
Labels
bug
Something isn't working
Comments
This was referenced Feb 9, 2023
Created upstream issues for two of the underlying causes of the keepalive networking crash: |
This was referenced Feb 13, 2023
Issue for duplicate tag: fluent/fluent-bit#6849 |
Issue for part of the keepalive issue we think: fluent/fluent-bit#6838 |
We now recommend 2.31.11 |
PettitWesley
changed the title
Q1 2023 High Impact Issues Tracking Ticket
2023 High Impact Issues Tracking Ticket
Jun 26, 2023
PettitWesley
changed the title
2023 High Impact Issues Tracking Ticket
2023 High Impact Issues Notice/Catalogue Ticket
Jun 26, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
AWS for Fluent Bit Q1 2023 High Impact Issues Notice for Customers
AWS for Fluent Bit team is aware of four high impact issues which the entire team is working to actively fix. We apologize for inconvenience caused by these. Keep checking the docs for updates.
https://github.com/aws/aws-for-fluent-bit/releases
Purpose
Customer messaging for customers using AWS for Fluent Bit who need to know if they are impacted by these issues and how to mitigate/resolve. This doc will continue to be updated with customer guidance.
This doc answer two questions:
The purpose of this doc is not to troubleshoot or explain how the code caused these issues. Links will be added to other docs explaining that.
Known Issues
Please check the specific section in this doc for each issue for an up to date description and list of known mitigations.
See: FAQ: Which version am I using and how do I change which version I am using? At the end of this doc.
GitHub Tracking Issue
#542
June 1st - Stable Version Upgraded to 2.31.11
As of June 1st, we have upgraded our stable version to 2.31.11. We now recommend this version or higher for all users.
Our stable version is marked here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION
Note on Out of Memory or OOMKill
While the CloudWatch Hang Issue did cause out of memory for some customers, this is not always the case. Furthermore, there are many possible causes of out of memory. None of the other issues noted in this doc should cause out of memory.
The most common cause of OOMKill is simply running under high throughput and the solution is to follow our guide here and update your settings: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention
CloudWatch Hang Issue
Fluent Bit can hang/freeze when the
cloudwatch_logs
output plugin is used, causing log loss. This generally only happens at very high throughput.Sometimes, this will cause an out of memory or OOMKill.
See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.
Versions Impacted
See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.
All versions prior to 2.28.4
Resolution
This issue was fixed in 2.29.0+
Please note the Duplicate Tag Match SIGSEGV Issue explained in this doc which was introduced in 2.29.0.
Mitigations
cloudwatch
go plugin which is not impacted in any version. See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.Duplicate Tag Match SIGSEGV Issue
There is an issue when two outputs match the same log tags: https://docs.fluentbit.io/manual/concepts/key-concepts
It only occurs if one of the outputs in the duplicate pair is a
cloudwatch_logs
output.For example:
Or for example:
Relevant Background information on FireLens Tags
https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#firelens-tag-and-match-pattern-and-generated-config
Versions Impacted
See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.
This issue affects 2.29.0+ with
cloudwatch_logs
outputs. All current customer reports involvecloudwatch_logs
and 2.29.0+.If do not use any
cloudwatch_logs
outputs or use a version 2.28.4 or lower, the issue will not occur.Resolved in: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.2
Resolution
https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.2
Mitigations
The following mitigations are recommended, in order of the likelihood that we expect them to reduce the frequency of the issue:
cloudwatch_logs
output to the oldercloudwatch
output: See: Migrating to cloudwatch go plugin from cloudwatch_logs C plugin at the end of this doc.Keepalive and Scheduler SIGSEGV
AWS is aware of an issue in the core networking and scheduler logic of Fluent Bit that causes it to crash with SIGSEGV.
Versions Impacted
See: FAQ: Which version am I using and how do I change which version I am using? at the end of this doc.
All versions are impacted. The issue is much more likely to occur in 2.29.0+ and all known customer reports are for 2.29.0+
The only way to know for sure if you are impacted by this issue is if you see a SIGSEGV with a stack trace like the following:
Resolution
This bug has been resolved in 2.31.3: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.3
Mitigations
Disable net.keepalive in your output configuration. This should prevent the issue from occurring:
https://docs.fluentbit.io/manual/administration/networking
AWS team believes that this issue impacts all versions, however, all current customer reports are for 2.29.0+, so downgrading may reduce its frequency. If you decide to downgrade, please read the notice in this doc about CloudWatch Hang.
S3 SIGSEGV related to asynchronous networking
This issue affects users of the Fluent Bit S3 output who enabled
use_put_object
: https://docs.fluentbit.io/manual/pipeline/outputs/s3The issue causes Fluent Bit to crash. It is not known to occur frequently.
Versions Impacted
All versions prior to 2.31.0.
See: FAQ: Which version am I using and how do I change which version I am using? At the end of this doc.
Resolution
Upgrade to 2.31.0.
S3 SIGSEGV with preserve_data_ordering option
Tracked here: #552
Mitigations
Suspected to be introduced in 2.31.1. Either downgrade, or turn the feature off:
Or, we have also released 2.31.4 and 2.31.5 with reverts of all recent S3 changes. These recent changes seem to have either introduced the issue or made it more frequent.
Migrating to cloudwatch go plugin from cloudwatch_logs C plugin
Please see:
The following options are only supported with
Name
cloudwatch_logs
and must be removed if you switch toName
cloudwatch
.metric_namespace
metric_dimensions
auto_retry_requests
workers
net.connect_timeout
net.connect_timeout_log_error
net.dns.mode
net.dns.prefer_ipv4
net.dns.resolver
net.keepalive
net.keepalive_idle_timeout
net.keepalive_max_recycle
net.source_address
cloudwatch
you can put$()
template variables in thelog_group_name
andlog_stream_name
options. You can then usedefault_log_group_name
anddefault_log_stream_name
as fallback names if templating fails.cloudwatch
supports direct templating of ECS metadata when you run in ECS:$(ecs_task_id)
,$(ecs_cluster
or$(ecs_task_arn)
. Withcloudwatch_logs
you can only inject values from the log JSONs. If you want to use ECS Metadata in your config withcloudwatch_logs
please see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/init-metadatacloudwatch_logs
templates go in thelog_group_template
orlog_stream_template
and use a$var
syntax (see doc). Fallback names if templating fails go in thelog_group_name
,log_stream_name
, orlog_stream_prefix
options.Example migration from
cloudwatch_logs
tocloudwatch
:After migration:
The following options are only supported with
Name
cloudwatch
and must be removed if you switch toName
cloudwatch_logs
.default_log_group_name
default_log_stream_name
new_log_group_tags
credentials_endpoint
cloudwatch
you can put$()
template variables in thelog_group_name
andlog_stream_name
options. You can then usedefault_log_group_name
anddefault_log_stream_name
as fallback names if templating fails.cloudwatch
supports direct templating of ECS metadata when you run in ECS:$(ecs_task_id)
,$(ecs_cluster
or$(ecs_task_arn)
. Withcloudwatch_logs
you can only inject values from the log JSONs. If you want to use ECS Metadata in your config withcloudwatch_logs
please see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/init-metadatacloudwatch_logs
templates go in thelog_group_template
orlog_stream_template
and use a$var
syntax (see doc). Fallback names if templating fails go in thelog_group_name
,log_stream_name
, orlog_stream_prefix
options.FAQ: Which version am I using and how do I change which version I am using?
The first log statement printed by AWS for Fluent Bit is always the version used:
Public Images
Public container images for aws-for-fluent-bit can be found on both:
The above are useful for finding the correct version/tag combination to use when a request to change version is required. Additional information related to public images and tags can be found at https://github.com/aws/aws-for-fluent-bit#public-images.
The text was updated successfully, but these errors were encountered: