-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubernetes_logs
error handling RFC
#7527
Comments
As noted by a user on discord a "connection reset by peer" issue can still cause Vector to stop processing the logs. |
I've been currently facing the same problem - "connection reset by peer". We're using:
Vector is looking at around 10 pods, using label selector to disregard other pods. We're also using the following resource limits:
This problem was triggered by attempting to send 1000000 logs in 5 minutes. Some internal vector logs:
|
I have similar problem.
And also sometimes vector does not collect logs for some pods (eg: some svc has 4 replicas for 2 replicas vector collects logs for 2 others logs are absent). It seems was fixed in |
Again - dysnc reported to cause failure to collect logs: #8616 |
Given that desync is something that happens on a regular basis (e.g. I see that our control plane logs something like this for different resources periodically Due to the nature of this issue, this becomes a major problem for us. While it seems that I found a way to indirectly detect Vector pods that do not forward logs from new pods (stream metrics), it means that I can only react by restarting Vector. |
Previous work from @StephenWakely should have set all error scenarios to be retried, and not a hard failure - so theoretically we should have already been "part of the standard flow" 🤔 |
Yes, I remember that patch as well (this is why I upgraded from 0.13.1 to 0.15 in the first place). At least counters/logs that Vector emits I concluded that desync is considered an error. |
Discovered a few more pods that got into the same state. All of them follow the same pattern:
I've detected them by checking for 0 events related to watch streams or checkpointing. However I'm hesitant to use this an alert condition. I'd rather prefer for Vector just to terminate in case it enters an inconsistent state for kubernetes_logs source. |
@binarylogic @StephenWakely @spencergilbert Is there any attack plan (or a reliable workaround) for this issue? |
Apparently, a lot of error conditions come from Hope this helps :) |
We are hoping to nail this one down next quarter. |
Given:
kubernetes_logs
sourceI think we'd benefit from an errors handling RFC for this source. This will allow us to see the errors from a bird's eye view, determine how to explicitly handle them, and be a resource for posterity.
The text was updated successfully, but these errors were encountered: