-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cri] Handle pod transition states gracefully while listing pod stats #8671
Conversation
Hi @jsturtevant. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -416,6 +426,9 @@ func (c *criService) getSandboxPidCount(ctx context.Context, sandbox sandboxstor | |||
|
|||
processes, err := task.Pids(ctx) | |||
if err != nil { | |||
if errdefs.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requires microsoft/hcsshim#1807 but should be ok to merge this and follow up via hcsshim version bump. Alternatively, I can handle the errors directly here.
13c357b
to
74cfd2f
Compare
b0f40c9
to
d0e8742
Compare
looks like the ci integration was a flake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/hold |
d0e8742
to
05b5bcb
Compare
/hold cancel
|
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer doing this in a common way across platform impls over in podsandboxstats and listpodsandboxstats...
not sure it makes sense to return nil vs err for podsandboxstats.. but agree it makes sense to not fail all of listpodsandboxstats for one pod being not ready or unknown at the point of gathering the stats, that feels like an error
@bobbypage @haircommander do you have any thoughts? I assume some of the changes to soften the errors would apply to Linux side as well |
I am leaning towards returning an err for It might make sense to wrap some of the transient errs to make the decision easier to understand but at this time I'm not sure I know which errors we would really want to fail on for |
I believe most of the time an error is treated as transient. if it's worth logging in the kubelet, then I'd report it. If it's part of normal state changes in containerd, then I would say hide the errors. e.g: is there action a user can take, or a maintainer should be aware of? if so, error. if not, hide |
05b5bcb
to
67ea3f4
Compare
@haircommander @mikebrow I've made the following change e489ff5 |
d3de050
to
5cba93b
Compare
@haircommander @mikebrow PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments regarding using switch for err processing and changing the default for err to continue setting up to return the best can do results and the first/last/list of transient/errors etc.. vs return nil with err on unknown type of err..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
will be interesting on the kubelet side if they will choose to parse and eval the transient stats errors or log and ignore as we are moving to here... or if there will be a subsequent request to go even further and clear the transient errors. FYI @haircommander @dims @saschagrunert @SergeyKanzhelev |
3d7fd8b
to
f1f144f
Compare
@mikebrow - I updated the PR to use a switch statement for style |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
just a couple logistical things.. pls remove the .idea files and squash.. thx! |
.idea/.gitignore
Outdated
@@ -0,0 +1,8 @@ | |||
# Default ignored files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the folder~
When the pods are transitioning there are several cases where containers might not be in valid state. There were several cases where the stats where failing hard but we should just continue on as they are transient and will be picked up again when kubelet queries for the stats again. Signed-off-by: James Sturtevant <jstur@microsoft.com> Signed-off-by: Mark Rossetti <marosset@microsoft.com>
f1f144f
to
f914edf
Compare
done and done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
any objections to flagging these as cherry pick for 1.7 and 1.6? |
When the pods are transitioning there are several cases where containers might not be in valid state. There were several cases where the stats where failing hard but we should just continue on as they are transient and will be picked up again when kubelet queries for the stats again.
I've found these after running the e2e tests in kubernetes/kubernetes#116968 several times.