`kubernetes_logs` error handling RFC #7527

binarylogic · 2021-05-20T14:58:55Z

Given:

The complexity of the kubernetes_logs source
The fact that we've failed to handle errors in multiple places
And the lack of ownership/knowledge in this area of the code

I think we'd benefit from an errors handling RFC for this source. This will allow us to see the errors from a bird's eye view, determine how to explicitly handle them, and be a resource for posterity.

Reference all related Github issues ("Connection reset by peer" error when vector agent is running inside an AKS cluster #6779, k8s_logs stops processing on Watch invocation failed error #7401, kubernetes_logs source stop glob after [unexpected EOF during chunk size line] occurred #6795, fix(kubernetes_logs source): Refactor stream and invocation errors to support recoverable error types #6816)
List all possible errors
Describe how each can occur
Describe how we currently handle them
Propose how we should handle them
Denote which ones are covered in tests and if we should add tests for them
Address documentation changes that would assist users in avoiding and handling them

The text was updated successfully, but these errors were encountered:

StephenWakely · 2021-06-10T14:56:11Z

As noted by a user on discord a "connection reset by peer" issue can still cause Vector to stop processing the logs.

DC-NunoAl · 2021-06-16T16:09:07Z

I've been currently facing the same problem - "connection reset by peer".

We're using:

Vector 0.14.0
AKS

Vector is looking at around 10 pods, using label selector to disregard other pods.
We have 2 vector agent pods (2 nodes).

We're also using the following resource limits:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "1024Mi"
    cpu: "2000m"

This problem was triggered by attempting to send 1000000 logs in 5 minutes.

Some internal vector logs:

2021-06-16T15:29:51.198229718Z Jun 16 15:29:51.198 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Recoverable { source: Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } } } internal_log_rate_secs=5
2021-06-16T15:29:51.198276522Z Jun 16 15:29:51.198 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Http Error in invocation! Your k8s metadata may be stale. Continuing Loop. error=Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } }
2021-06-16T15:29:51.198394834Z Watch invocation failed.
2021-06-16T15:29:51.198410936Z Http Error in invocation! Your k8s metadata may be stale. Continuing Loop.
2021-06-16T15:29:52.205654677Z Jun 16 15:29:52.205 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
2021-06-16T15:29:52.205706682Z Jun 16 15:29:52.205 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
2021-06-16T15:29:52.205812992Z Watch stream failed.
2021-06-16T15:29:52.205930404Z Handling desync.

denispanferov · 2021-07-22T12:02:26Z

I've been currently facing the same problem - "connection reset by peer".

We're using:

Vector 0.14.0
AKS

Vector is looking at around 10 pods, using label selector to disregard other pods.
We have 2 vector agent pods (2 nodes).

We're also using the following resource limits:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "1024Mi"
    cpu: "2000m"

This problem was triggered by attempting to send 1000000 logs in 5 minutes.

Some internal vector logs:

2021-06-16T15:29:51.198229718Z Jun 16 15:29:51.198 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Recoverable { source: Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } } } internal_log_rate_secs=5
2021-06-16T15:29:51.198276522Z Jun 16 15:29:51.198 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Http Error in invocation! Your k8s metadata may be stale. Continuing Loop. error=Request { source: CallRequest { source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 110, kind: TimedOut, message: "Connection timed out" })) } }
2021-06-16T15:29:51.198394834Z Watch invocation failed.
2021-06-16T15:29:51.198410936Z Http Error in invocation! Your k8s metadata may be stale. Continuing Loop.
2021-06-16T15:29:52.205654677Z Jun 16 15:29:52.205 ERROR source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
2021-06-16T15:29:52.205706682Z Jun 16 15:29:52.205 WARN source{component_kind="source" component_name=k8s component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
2021-06-16T15:29:52.205812992Z Watch stream failed.
2021-06-16T15:29:52.205930404Z Handling desync.

I have similar problem.
ver. 0.14.0
k8s ver. 1.14.10

Jul 22 11:41:56.117 ERROR source{component_kind="source" component_name=svc1 component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Jul 22 11:41:56.117  WARN source{component_kind="source" component_name=svc1 component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

And also sometimes vector does not collect logs for some pods (eg: some svc has 4 replicas for 2 replicas vector collects logs for 2 others logs are absent). It seems was fixed in kubernetes_logs source doesn't detect new pods #5846 but still reproduced

spencergilbert · 2021-08-06T12:48:08Z

Again - dysnc reported to cause failure to collect logs: #8616

alexgavrisco · 2021-08-06T13:29:23Z

Given that desync is something that happens on a regular basis (e.g. I see that our control plane logs something like this for different resources periodically 1 cacher.go:162] Terminating all watchers from cacher *policy.PodSecurityPolicy), shouldn't it be just part of the standard flow and not be considered an error.
After checking the flow for desync, it seems that it's related to this assumption. This would explain why internal state seems to track only existing pods, without adding new ones.

Due to the nature of this issue, this becomes a major problem for us. While it seems that I found a way to indirectly detect Vector pods that do not forward logs from new pods (stream metrics), it means that I can only react by restarting Vector.
I've pushed a lot for Vector upgrade from a really old version (0.8) to 0.13+, making rollback almost impossible :(

spencergilbert · 2021-08-06T13:49:11Z

Previous work from @StephenWakely should have set all error scenarios to be retried, and not a hard failure - so theoretically we should have already been "part of the standard flow" 🤔

alexgavrisco · 2021-08-06T13:57:34Z

Yes, I remember that patch as well (this is why I upgraded from 0.13.1 to 0.15 in the first place). At least counters/logs that Vector emits I concluded that desync is considered an error.
I haven't checked all Vector pods whether it's the same pattern (desync in logs, then no more requests/stream objects consumed). However at least a few pods that I checked got the same symptoms.

alexgavrisco · 2021-08-12T11:39:34Z

Discovered a few more pods that got into the same state. All of them follow the same pattern:

Aug 05 00:57:02.437 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Recoverable { source: K8sStream { source: Reading { source: hyper::Error(Body, Custom { kind: UnexpectedEof, error: "unexpected EOF during chunk size line" }) } } } internal_log_rate_secs=5
Aug 05 00:57:02.437  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Http Error in invocation! Your k8s metadata may be stale. Continuing Loop. error=K8sStream { source: Reading { source: hyper::Error(Body, Custom { kind: UnexpectedEof, error: "unexpected EOF during chunk size line" }) } }
Aug 05 00:57:04.115 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Internal log [Watch stream failed.] is being rate limited.
Aug 05 00:57:04.115  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
Aug 05 00:57:08.279 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Other { source: BadStatus { status: 403 } } internal_log_rate_secs=5
Aug 05 00:57:08.279 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 403 }
Aug 05 00:57:08.279 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Reflector process exited with an error. error=watch invocation failed
Aug 05 00:57:08.281  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Event processing loop completed gracefully.
Aug 05 00:57:08.282  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: File server completed gracefully.
Aug 05 00:57:08.282  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Done.

I've detected them by checking for 0 events related to watch streams or checkpointing. However I'm hesitant to use this an alert condition. I'd rather prefer for Vector just to terminate in case it enters an inconsistent state for kubernetes_logs source.

alexgavrisco · 2021-09-06T11:21:01Z

@binarylogic @StephenWakely @spencergilbert Is there any attack plan (or a reliable workaround) for this issue?
It's quite concerning that some Vector pods can just stop forwarding logs from any K8S cluster and manually checking all Vector pods doesn't scale.

MOZGIII · 2021-09-17T06:55:05Z

Apparently, a lot of error conditions come from hyper errors not being properly handled. This is a shame, as the errors that we should effectively handle aren't all properly documented.
There are some clearly unmatched expectations around the data stream parsing logic, and that logic can be better coupled with hyper-specific error to solve most of the cases (as I can tell now, based on the reports).

Hope this helps :)

StephenWakely · 2021-09-17T12:16:45Z

Is there any attack plan (or a reliable workaround) for this issue?

We are hoping to nail this one down next quarter.

binarylogic added type: task Generic non-code related tasks needs: rfc Needs an RFC before work can begin. source: kubernetes_logs Anything `kubernetes_logs` source related platform: kubernetes Anything `kubernetes` platform related labels May 20, 2021

StephenWakely mentioned this issue May 21, 2021

fix(kubernetes_logs source): k8s stream errors should be recoverable #7484

Merged

zsherman assigned StephenWakely May 28, 2021

This was referenced Jun 3, 2021

enhancement(kubernetes_logs source): Changed default for glob_minimum_cooldown_ms #7693

Merged

Retry all errors in kubernetes_logs source #7149

Closed

alexgavrisco mentioned this issue Aug 16, 2021

Vector agent stops watching logs from new pods #8616

Closed

jszwedko mentioned this issue Nov 12, 2021

kubernetes_logs source fixes #10016

Closed

4 tasks

k8s-comandante mentioned this issue Dec 13, 2021

Problems with container logs in release 0.18.0/0.18.1 #10413

Closed

jszwedko unassigned StephenWakely Dec 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kubernetes_logs` error handling RFC #7527

`kubernetes_logs` error handling RFC #7527

binarylogic commented May 20, 2021 •

edited

Loading

StephenWakely commented Jun 10, 2021 •

edited

Loading

DC-NunoAl commented Jun 16, 2021 •

edited

Loading

denispanferov commented Jul 22, 2021 •

edited

Loading

spencergilbert commented Aug 6, 2021

alexgavrisco commented Aug 6, 2021

spencergilbert commented Aug 6, 2021

alexgavrisco commented Aug 6, 2021

alexgavrisco commented Aug 12, 2021

alexgavrisco commented Sep 6, 2021

MOZGIII commented Sep 17, 2021

StephenWakely commented Sep 17, 2021

kubernetes_logs error handling RFC #7527

kubernetes_logs error handling RFC #7527

Comments

binarylogic commented May 20, 2021 • edited Loading

StephenWakely commented Jun 10, 2021 • edited Loading

DC-NunoAl commented Jun 16, 2021 • edited Loading

denispanferov commented Jul 22, 2021 • edited Loading

spencergilbert commented Aug 6, 2021

alexgavrisco commented Aug 6, 2021

spencergilbert commented Aug 6, 2021

alexgavrisco commented Aug 6, 2021

alexgavrisco commented Aug 12, 2021

alexgavrisco commented Sep 6, 2021

MOZGIII commented Sep 17, 2021

StephenWakely commented Sep 17, 2021

`kubernetes_logs` error handling RFC #7527

`kubernetes_logs` error handling RFC #7527

binarylogic commented May 20, 2021 •

edited

Loading

StephenWakely commented Jun 10, 2021 •

edited

Loading

DC-NunoAl commented Jun 16, 2021 •

edited

Loading

denispanferov commented Jul 22, 2021 •

edited

Loading