You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposal: If it's possible, in case of deploy failure, add the information about the k8s node. There are cases when deployment failures are caused by the underlying node issues. It will be much easier to identify these causes by outputting the node information (name) for each of the failed resources.
Having this information in the logs also helps with audits and debugging.
The text was updated successfully, but these errors were encountered:
In general this sounds like it would be helpful. Though can you be a bit more specific about the types of failures you're seeing. e.g. DS pods not scheduling, container problems, ...
The other big question are you interested PRing this or is this just a request?
The other big question are you interested PRing this or is this just a request?
This is just a request, but if we find the time, we will consider putting in the work to implement this request
Though can you be a bit more specific about the types of failures you're seeing.
Few specific examples we have seen in the past:
The underlying node is having docker daemon issues and all the pods that get scheduled on that node are in a bad state (stuck in "Terminating" or "Initializing" state)
The underlying node is having performance issues, therefore causing a timeout (if this is helpful, the specific case we have seen was that the image pulls on one of the nodes were extra slow. Having surfaced node name in all the failed resources would have right away pointed to the node-specific problem)
Feature request
Proposal: If it's possible, in case of deploy failure, add the information about the k8s node. There are cases when deployment failures are caused by the underlying node issues. It will be much easier to identify these causes by outputting the node information (name) for each of the failed resources.
Having this information in the logs also helps with audits and debugging.
The text was updated successfully, but these errors were encountered: