Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Error extracting container id in kubernetes #27216

Closed
blakerouse opened this issue Aug 3, 2021 · 32 comments · Fixed by #27689
Closed

[Elastic Agent] Error extracting container id in kubernetes #27216

blakerouse opened this issue Aug 3, 2021 · 32 comments · Fixed by #27689
Labels
Team:Integrations Label for the Integrations team

Comments

@blakerouse
Copy link
Contributor

Running the https://github.com/elastic/beats/tree/master/deploy/kubernetes/elastic-agent-standalone deployment in GKE with version 7.13.4 results in the running filebeat to keep logging the following error:

[elastic_agent.filebeat][error] Error extracting container id - source value does not contain matcher's logs_path '/var/lib/docker/containers/'.

I was trying to reproduce #25435, but came across this issue instead.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 3, 2021
@ChrsMark ChrsMark added the Team:Integrations Label for the Integrations team label Aug 3, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 3, 2021
@ChrsMark
Copy link
Member

ChrsMark commented Aug 3, 2021

After debugging this offline we found that the error comes from add_kubernetes_metadata. While Elastic Agent does not explicitly enables the processor, the underlying Filebeat process

/usr/share/elastic-agent/state/data/install/filebeat-7.13.4-linux-x86_64/filebeat -E setup.ilm.enabled=false ...

runs with default config like /usr/share/elastic-agent/state/data/install/filebeat-7.13.4-linux-x86_64/filebeat.yml which has the processor enabled by default. This means that despite the processor is able to get initialised properly it fails since the matchers are not properly configured.

Despite it seems harmless it is filling up the logs and hence we need a way to better handle it. I think that improving the logging in processor's code might help here, tbh I don't see any reason of logging this kind of messages as errors, maybe we can switch them to debug.

@masci , @exekias , @MichaelKatsoulis do you think we can push it for 7.15 (even as a bug-fix after ff)?

@exekias
Copy link
Contributor

exekias commented Aug 3, 2021

how about not enabling this by default? we mostly rely on dynamic inputs, also for K8S logs, so add_k8s_metadata, while still useful for other cases, is not doing much atm (I believe?). Same goes for add_docker_metadata

@ChrsMark
Copy link
Member

ChrsMark commented Aug 3, 2021

You mean not enabling the processors in Beat's configs in general right? Or disable them when they are run by Agent only?

@blakerouse
Copy link
Contributor Author

Being that Elastic Agent is the now GA and the way forward, maybe just disabling them in the default configuration would be okay?

@exekias
Copy link
Contributor

exekias commented Aug 4, 2021

Being that Elastic Agent is the now GA and the way forward, maybe just disabling them in the default configuration would be okay?

This would be a breaking change. I was thinking more about disabling them in agent only

@MichaelKatsoulis
Copy link
Contributor

This would be a breaking change. I was thinking more about disabling them in agent only

This is something that needs to be updated in the agent side so that it will not use the default filebeat configuration.

Removing add_kubernetes_metadata from the default config will affect also non-agent uses of filebeat.

@exekias
Copy link
Contributor

exekias commented Aug 10, 2021

@blakerouse would it be possible to remove this processor from the Agent beats?

@blakerouse
Copy link
Contributor Author

@exekias At the moment we rely on the default configuration that is shipped with a beat, by changing that 1 behavior we affect all other beats that might rely on something from their default configuration.

We would need to send an empty list of processors in the configuration through the control protocol, but would filebeat even reload that section?

I understand the removing of the default breaks for others, but only in the case they are using the default configuration without any changes, correct? Is filebeat even usable with a default configuration and no changes?

@MichaelKatsoulis
Copy link
Contributor

To my understanding @exekias add_kubernetes_metadata processor purpose is needed in Kubernetes environments only.
The default config of filebeat is overridden in the proposed manifests we have for deploying filebeat in kubernetes. configmap.

So removing the processor from the default config wouldn't affect that.

@MichaelKatsoulis
Copy link
Contributor

Could we maybe leverage the if statements in the filebeat/metricbeat yaml like in packetbeat.yml ?

@ChrsMark
Copy link
Member

What @MichaelKatsoulis proposed above sounds good to me. We need a condition to verify that metadata already exists, maybe check for kubernetes.namespace and if so skip the processor. But still, will it be considered a breaking change in Beats? Maybe even if it is considered a breaking change it is for a good reason in general since it protects metadata being overridden by the processor. Thoughts @exekias ?

@exekias
Copy link
Contributor

exekias commented Aug 31, 2021

Any option sounds good to me, also consider that given the proximity of 8.0 the possibility of doing this as a breaking change is not that far

@ChrsMark
Copy link
Member

After re-thinking this and chatting offline with Mike I think we can avoid doing the change on the configuration level with the 2 options below:

  1. Resolve the log flooding issue by fixing logging levels. The level for the specific messages was set to error by Improve some logging messages for add_kubernetes_metadata processor #16866 but previously was set to debug so maybe we can revisit this change.
  2. Skip add_kubernetes_metadata enrichment if k8s metadata already present. This is something we already do in add_cloud_metadata processor when meta are already there by aws module.

Personally I'm +1 for applying both changes.

@MichaelKatsoulis
Copy link
Contributor

Skip add_kubernetes_metadata enrichment if k8s metadata already present. This is something we already do in add_cloud_metadata processor when meta are already there by aws module.

@exekias If you don't have any objection with that proposal I will create a PR to fix this. I believe it is the correct approach as the add_kubernetes_metadata processor is not actually needed in scenarios where the metadata are already present due kubernetes dynamic provider.
And it also tackles the problem in its source rather than updating configuration files.

@exekias
Copy link
Contributor

exekias commented Sep 1, 2021

SGTM!

@adammike
Copy link

adammike commented Oct 1, 2021

Any idea when this is going to make it into a release? This bug is still present in 7.15.0

@ChrsMark
Copy link
Member

ChrsMark commented Oct 3, 2021

@adammike this one will be fixed with 7.16 version of elastic-agent.

@adammike
Copy link

adammike commented Oct 4, 2021

Seeing as how 7.15.1 is not out yet, I assume 7.16 months away?

@ChrsMark
Copy link
Member

ChrsMark commented Oct 5, 2021

7.16 is not coupled with any of 7.15.x releases, the scopes are different. However 7.16 is not freezed yet, so it will take some time but not so much :). Btw this is not a critical bug you can just ignore it, right? The only problem is that it might overflow the logs/disk.

@tomsseisums
Copy link

tomsseisums commented Nov 15, 2021

Btw this is not a critical bug you can just ignore it, right? The only problem is that it might overflow the logs/disk.

In our case, using Elastic Cloud, this simply kills everything, because it logs/sends these like 10 times PER SECOND.

@ChrsMark
Copy link
Member

Hey @tomsseisums , sorry to hear that :(. Could you use 7.16.0-SNAPSHOT until the official release of 7.16 (it's coming really soon)? Unfortunately we have missed the 7.15.x releases.

@tomsseisums
Copy link

@ChrsMark Elastic Cloud itself is limited to 7.15 and upgrading agent to 7.16 snapshot results in:

Error: fail to enroll: fail to execute request to fleet-server: status code: 400, fleet-server returned an error: UnsupportedVersion, message: version is not supported

@ChrsMark
Copy link
Member

Well, in Elastic Cloud you can choose snapshot versions in GCP Belgium I think. However this will take you out of any support/sla so be sure that you actually want to do this and what implications it would have in possible updates.

@and-stuber
Copy link

Concluding.... Isn't possible to use the Elastic Agent to monitor K8s today, using Elastic Cloud?

@LaurisJakobsons
Copy link

@ChrsMark In our case, it seems like the issue still remains, at least to some extent. When elastic agent is started, it still fills the logs with like 10k to 20k entries per minute. Although, it seems to cool down after a while and errors disappear when it starts skipping the add_kubernetes_metadata (what supposedly was added as a fix to this problem). Is that the best we can do for now?

@rjbaucells
Copy link

I just started an Elastic cloud trial and I see this error using 8.0, is the issue fixed?

@ChrsMark
Copy link
Member

Hey folks! It is identified that the issue persists but for another reason explained at #29767. This will be resolved properly with elastic/elastic-agent#90 so I would suggest following this issue too (fyi @ph).

@WoodyWoodsta
Copy link

@ChrsMark For me, the errors (also in the range of 20k per minute) appear to be caused by fact that /var/lib/docker/containers is an empty folder. I don't see anywhere in the discussions anyone proposing to solve what is probably the immediate problem: that add_kubernetes_metadata fails if that folder is empty.

Personally I don't care if that processor is enabled by default without a way to change it, so long as it doesn't fail at this magnitude if what it's trying to find is empty (which is clearly a valid scenario).

Notes

My on-prem cluster is running with the containerd CRI.

Mounting individual folders like /var/log/pods and /var/log/containers for containerd kubernetes log ingestion avoids the problem. Mounting the entire /var/log folder (which I need for syslog and auth ingestion) introduces the problem for me, so I'm just under the assumption that it's the empty /var/lib/docker/containers which is the cause of the issue.

@ChrsMark
Copy link
Member

ChrsMark commented Aug 31, 2022

@ChrsMark For me, the errors (also in the range of 20k per minute) appear to be caused by fact that /var/lib/docker/containers is an empty folder. I don't see anywhere in the discussions anyone proposing to solve what is probably the immediate problem: that add_kubernetes_metadata fails if that folder is empty.

Personally I don't care if that processor is enabled by default without a way to change it, so long as it doesn't fail at this magnitude if what it's trying to find is empty (which is clearly a valid scenario).

Notes

My on-prem cluster is running with the containerd CRI.

Mounting individual folders like /var/log/pods and /var/log/containers for containerd kubernetes log ingestion avoids the problem. Mounting the entire /var/log folder (which I need for syslog and auth ingestion) introduces the problem for me, so I'm just under the assumption that it's the empty /var/lib/docker/containers which is the cause of the issue.

Hey @WoodyWoodsta! If the processor is failing while it is enabled intentionally then we should handle it in another issue. In case of Elastic Agent the processor should not be enabled by default and this is the purpose of this issue.

Are you running Elastic Agent and seeing this issue? If so please keep track of elastic/elastic-agent#90 (fyi @jlind23).

If you still want to use the processor but you hit issues please open another issue cause that should be a different use case. In any case at the moment the Elastic Agent would automatically add k8s metadata without the need of enabling the processor for most of the cases.

@WoodyWoodsta
Copy link

@ChrsMark Thanks - I was just wanting to point out that on top of the processor being enabled/disabled by default (which seems to be the focus of the discussions in related issues and threads), if /var/lib/docker/containers directory is empty, it fails. If I've understood correctly, anyone with that processor enabled, whether it's intentional or not, that uses containerd for their cluster, but has docker installed on a node will have this error.

If that sounds like a separate thing to you, I'm more than happy to open a new issue!

@ChrsMark
Copy link
Member

Yes @WoodyWoodsta feel free to file a different issue for this :), it's highly possible that this is a configuration issue or just a corner case we need to fix. Let's take the discussions there once we have the new issue though.

@elastic elastic locked as resolved and limited conversation to collaborators Aug 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Team:Integrations Label for the Integrations team
Projects
None yet