Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stopping of modules started by kubernetes autodiscover #10476

Merged
merged 11 commits into from
Feb 7, 2019

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Feb 1, 2019

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:

  • Don't require the pod to have an IP on stop events.
  • Use IDs for containers that don't depend on its state.

@jsoriano jsoriano added bug review libbeat needs_backport PR is waiting to be backported to other branches. containers Related to containers use case v7.0.0 Team:Integrations Label for the Integrations team v6.7.0 v7.0.0-beta1 v6.6.1 labels Feb 1, 2019
@jsoriano jsoriano self-assigned this Feb 1, 2019
@jsoriano jsoriano requested a review from a team as a code owner February 1, 2019 09:43
@jsoriano jsoriano requested a review from a team February 1, 2019 10:05
@jsoriano jsoriano force-pushed the kubernetes-autodiscover-terminating branch from 12df0f0 to b67d810 Compare February 1, 2019 14:11
@exekias
Copy link
Contributor

exekias commented Feb 3, 2019

well done @jsoriano, thank you for taking this! ❤️

ruflin
ruflin previously approved these changes Feb 4, 2019
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you rebase to get CI green?

When deleting pods, they are first marked for deletion and then, after a
grace period or after all containers have stopped, they are actually
deleted.

We were stopping modules only on pod deletion, but at this moment the
container can lack of information of containers that were running when
the modules started, but have been stopping during the grace period.

This change schedules additional stops as soon as we receive an event
with a deletion timestamp. At this moment the pods still contains the
information of all containers and thus the stop is actually done.
@jsoriano jsoriano force-pushed the kubernetes-autodiscover-terminating branch from b67d810 to 5a6ce23 Compare February 5, 2019 22:45
@jsoriano jsoriano changed the title Don't restart modules on terminated pod updates Schedule stop of modules started by kubernetes autodiscover as soon as possible Feb 5, 2019
@jsoriano
Copy link
Member Author

jsoriano commented Feb 5, 2019

TLDR; My original assumption was wrong, modules were not being re-started because of a race condition between last update and delete, but they weren't being stopped because on delete we may not have enough information to stop the modules.

On my first implementation of this change I messed up with the DeletionTimestamp, I was using the GetNanos() method but I have discovered that it always returns 0, so the modules were being deleted as soon as the pod is marked for termination. This fixed the original issue, no modules were running after the pod is stopped, but it was stopping the pods too soon, what can be a problem in some beats.
When I discovered that, I tried to fix it by using Seconds, confirming that all the timestamps were properly calculated. Then I saw the original problem happening again. And then I discovered what was happening.
We only generate events for pods that have an IP, and for containers that have an ID, because we need them, but when a pod is stopping, containers can be dead and without an ID, so now I schedule modules stop as soon as we know that the pod is stopping, at this moment the containers are still alive.

@jsoriano
Copy link
Member Author

jsoriano commented Feb 5, 2019

Writting my previous comment made me think that we may need an identifier for containers that don't depend on its state, so we can start and stop modules normally.

@jsoriano jsoriano changed the title Schedule stop of modules started by kubernetes autodiscover as soon as possible Fix stopping of modules started by kubernetes autodiscover Feb 5, 2019
@jsoriano
Copy link
Member Author

jsoriano commented Feb 5, 2019

Writting my previous comment made me think that we may need an identifier for containers that don't depend on its state, so we can start and stop modules normally.

Yes, much simpler, it works by using another ID, description updated.


// This must be an id that doesn't depend on the state of the container
// so it works also on `stop` if containers have been already deleted.
eventId := fmt.Sprintf("%s.%s", pod.Metadata.GetUid(), c.GetName())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var eventId should be eventID

@jsoriano
Copy link
Member Author

jsoriano commented Feb 6, 2019

@jsoriano I think failing tests are related. Could be that some tests need update.

Except for the adjustement in an event id, failures were legit, I was overwriting also the value for container.id, what was wrong. Great to have these tests 🙂

@jsoriano
Copy link
Member Author

jsoriano commented Feb 6, 2019

jenkins, test this again please

@jsoriano
Copy link
Member Author

jsoriano commented Feb 7, 2019

jenkins, test this again

@jsoriano
Copy link
Member Author

jsoriano commented Feb 7, 2019

Oh, this was not the build I wanted to relaunch 🤦‍♂️

@jsoriano
Copy link
Member Author

jsoriano commented Feb 7, 2019

jenkins, test this again

@jsoriano
Copy link
Member Author

jsoriano commented Feb 7, 2019

I am merging this to start with the backports, I'll do follow-ups with further fixes if needed.

@jsoriano jsoriano merged commit 15f2f26 into elastic:master Feb 7, 2019
@jsoriano jsoriano deleted the kubernetes-autodiscover-terminating branch February 7, 2019 19:03
jsoriano added a commit to jsoriano/beats that referenced this pull request Feb 7, 2019
…0476)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
@jsoriano jsoriano added v7.2.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 7, 2019
jsoriano added a commit to jsoriano/beats that referenced this pull request Feb 7, 2019
…0476)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
jsoriano added a commit to jsoriano/beats that referenced this pull request Feb 7, 2019
…0476)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
jsoriano added a commit to jsoriano/beats that referenced this pull request Feb 7, 2019
…0476)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
jsoriano added a commit that referenced this pull request Feb 11, 2019
…10643)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
jsoriano added a commit that referenced this pull request Feb 11, 2019
…10642)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
jsoriano added a commit that referenced this pull request Feb 11, 2019
…10641)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
jsoriano added a commit that referenced this pull request Feb 13, 2019
…10644)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 15f2f26)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…0476) (elastic#10644)

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:
    * Don't require the pod to have an IP on stop events.
    * Use IDs for containers that don't depend on its state.

(cherry picked from commit 4162aa9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants