Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes_logs error handling RFC #7527

Open
7 tasks
binarylogic opened this issue May 20, 2021 · 11 comments
Open
7 tasks

kubernetes_logs error handling RFC #7527

binarylogic opened this issue May 20, 2021 · 11 comments
Labels
needs: rfc Needs an RFC before work can begin. platform: kubernetes Anything `kubernetes` platform related source: kubernetes_logs Anything `kubernetes_logs` source related type: task Generic non-code related tasks

Comments

@binarylogic
Copy link
Contributor

binarylogic commented May 20, 2021

Given:

  1. The complexity of the kubernetes_logs source
  2. The fact that we've failed to handle errors in multiple places
  3. And the lack of ownership/knowledge in this area of the code

I think we'd benefit from an errors handling RFC for this source. This will allow us to see the errors from a bird's eye view, determine how to explicitly handle them, and be a resource for posterity.

@StephenWakely
Copy link
Contributor

StephenWakely commented Jun 10, 2021

As noted by a user on discord a "connection reset by peer" issue can still cause Vector to stop processing the logs.

@DC-NunoAl
Copy link

DC-NunoAl commented Jun 16, 2021

I've been currently facing the same problem - "connection reset by peer".

We're using:

  • Vector 0.14.0
  • AKS

Vector is looking at around 10 pods, using label selector to disregard other pods.
We have 2 vector agent pods (2 nodes).

We're also using the following resource limits:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "1024Mi"
    cpu: "2000m"

This problem was triggered by attempting to send 1000000 logs in 5 minutes.

Some internal vector logs:

2021-06-16T15:29:51.198229718Z Jun 16 15:29:51.198 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Recoverable { source: Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } } } internal_log_rate_secs=5
2021-06-16T15:29:51.198276522Z Jun 16 15:29:51.198 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Http Error in invocation! Your k8s metadata may be stale. Continuing Loop. error=Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } }
2021-06-16T15:29:51.198394834Z Watch invocation failed.
2021-06-16T15:29:51.198410936Z Http Error in invocation! Your k8s metadata may be stale. Continuing Loop.
2021-06-16T15:29:52.205654677Z Jun 16 15:29:52.205 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
2021-06-16T15:29:52.205706682Z Jun 16 15:29:52.205 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
2021-06-16T15:29:52.205812992Z Watch stream failed.
2021-06-16T15:29:52.205930404Z Handling desync.

@denispanferov
Copy link

denispanferov commented Jul 22, 2021

I've been currently facing the same problem - "connection reset by peer".

We're using:

  • Vector 0.14.0
  • AKS

Vector is looking at around 10 pods, using label selector to disregard other pods.
We have 2 vector agent pods (2 nodes).

We're also using the following resource limits:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "1024Mi"
    cpu: "2000m"

This problem was triggered by attempting to send 1000000 logs in 5 minutes.

Some internal vector logs:

2021-06-16T15:29:51.198229718Z Jun 16 15:29:51.198 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Recoverable { source: Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } } } internal_log_rate_secs=5
2021-06-16T15:29:51.198276522Z Jun 16 15:29:51.198 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Http Error in invocation! Your k8s metadata may be stale. Continuing Loop. error=Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } }
2021-06-16T15:29:51.198394834Z Watch invocation failed.
2021-06-16T15:29:51.198410936Z Http Error in invocation! Your k8s metadata may be stale. Continuing Loop.
2021-06-16T15:29:52.205654677Z Jun 16 15:29:52.205 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
2021-06-16T15:29:52.205706682Z Jun 16 15:29:52.205 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
2021-06-16T15:29:52.205812992Z Watch stream failed.
2021-06-16T15:29:52.205930404Z Handling desync.

I have similar problem.
ver. 0.14.0
k8s ver. 1.14.10

Jul 22 11:41:56.117 ERROR source{component_kind="source" component_name=svc1 component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Jul 22 11:41:56.117  WARN source{component_kind="source" component_name=svc1 component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

And also sometimes vector does not collect logs for some pods (eg: some svc has 4 replicas for 2 replicas vector collects logs for 2 others logs are absent). It seems was fixed in kubernetes_logs source doesn't detect new pods #5846 but still reproduced

@spencergilbert
Copy link
Contributor

Again - dysnc reported to cause failure to collect logs: #8616

@alexgavrisco
Copy link
Contributor

Given that desync is something that happens on a regular basis (e.g. I see that our control plane logs something like this for different resources periodically 1 cacher.go:162] Terminating all watchers from cacher *policy.PodSecurityPolicy), shouldn't it be just part of the standard flow and not be considered an error.
After checking the flow for desync, it seems that it's related to this assumption. This would explain why internal state seems to track only existing pods, without adding new ones.

Due to the nature of this issue, this becomes a major problem for us. While it seems that I found a way to indirectly detect Vector pods that do not forward logs from new pods (stream metrics), it means that I can only react by restarting Vector.
I've pushed a lot for Vector upgrade from a really old version (0.8) to 0.13+, making rollback almost impossible :(

@spencergilbert
Copy link
Contributor

Previous work from @StephenWakely should have set all error scenarios to be retried, and not a hard failure - so theoretically we should have already been "part of the standard flow" 🤔

@alexgavrisco
Copy link
Contributor

Yes, I remember that patch as well (this is why I upgraded from 0.13.1 to 0.15 in the first place). At least counters/logs that Vector emits I concluded that desync is considered an error.
I haven't checked all Vector pods whether it's the same pattern (desync in logs, then no more requests/stream objects consumed). However at least a few pods that I checked got the same symptoms.

@alexgavrisco
Copy link
Contributor

Discovered a few more pods that got into the same state. All of them follow the same pattern:

Aug 05 00:57:02.437 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Recoverable { source: K8sStream { source: Reading { source: hyper::Error(Body, Custom { kind: UnexpectedEof, error: "unexpected EOF during chunk size line" }) } } } internal_log_rate_secs=5
Aug 05 00:57:02.437  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Http Error in invocation! Your k8s metadata may be stale. Continuing Loop. error=K8sStream { source: Reading { source: hyper::Error(Body, Custom { kind: UnexpectedEof, error: "unexpected EOF during chunk size line" }) } }
Aug 05 00:57:04.115 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Internal log [Watch stream failed.] is being rate limited.
Aug 05 00:57:04.115  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
Aug 05 00:57:08.279 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Other { source: BadStatus { status: 403 } } internal_log_rate_secs=5
Aug 05 00:57:08.279 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 403 }
Aug 05 00:57:08.279 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Reflector process exited with an error. error=watch invocation failed
Aug 05 00:57:08.281  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Event processing loop completed gracefully.
Aug 05 00:57:08.282  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: File server completed gracefully.
Aug 05 00:57:08.282  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Done.

I've detected them by checking for 0 events related to watch streams or checkpointing. However I'm hesitant to use this an alert condition. I'd rather prefer for Vector just to terminate in case it enters an inconsistent state for kubernetes_logs source.

@alexgavrisco
Copy link
Contributor

@binarylogic @StephenWakely @spencergilbert Is there any attack plan (or a reliable workaround) for this issue?
It's quite concerning that some Vector pods can just stop forwarding logs from any K8S cluster and manually checking all Vector pods doesn't scale.

@MOZGIII
Copy link
Contributor

MOZGIII commented Sep 17, 2021

Apparently, a lot of error conditions come from hyper errors not being properly handled. This is a shame, as the errors that we should effectively handle aren't all properly documented.
There are some clearly unmatched expectations around the data stream parsing logic, and that logic can be better coupled with hyper-specific error to solve most of the cases (as I can tell now, based on the reports).

Hope this helps :)

@StephenWakely
Copy link
Contributor

Is there any attack plan (or a reliable workaround) for this issue?

We are hoping to nail this one down next quarter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: rfc Needs an RFC before work can begin. platform: kubernetes Anything `kubernetes` platform related source: kubernetes_logs Anything `kubernetes_logs` source related type: task Generic non-code related tasks
Projects
None yet
Development

No branches or pull requests

7 participants