Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cri] Handle pod transition states gracefully while listing pod stats #8671

Merged
merged 1 commit into from
Jul 12, 2023

Conversation

jsturtevant
Copy link
Contributor

When the pods are transitioning there are several cases where containers might not be in valid state. There were several cases where the stats where failing hard but we should just continue on as they are transient and will be picked up again when kubelet queries for the stats again.

I've found these after running the e2e tests in kubernetes/kubernetes#116968 several times.

@k8s-ci-robot
Copy link

Hi @jsturtevant. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -416,6 +426,9 @@ func (c *criService) getSandboxPidCount(ctx context.Context, sandbox sandboxstor

processes, err := task.Pids(ctx)
if err != nil {
if errdefs.IsNotFound(err) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requires microsoft/hcsshim#1807 but should be ok to merge this and follow up via hcsshim version bump. Alternatively, I can handle the errors directly here.

@jsturtevant
Copy link
Contributor Author

looks like the ci integration was a flake

@dcantah

Copy link
Contributor

@marosset marosset left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jsturtevant
Copy link
Contributor Author

/hold
It looks like I goof up a rebase and only have some changes in the SB server

@jsturtevant
Copy link
Contributor Author

/hold cancel
We are good to go! I ran the e2e suite as well. Below is the stats specific tests but I also ran the entire suite

/bin/e2e.test --provider=skeleton --ginkgo.noColor --ginkgo.focus="\[Feature:Windows\] Kubelet-Stats" --ginkgo.skip="\[Serial\]" --node-os-distro=windows --disable-log-dump --ginkgo.progress=true --ginkgo.slowSpecThreshold=120.0 --ginkgo.flakeAttempts=0 --ginkgo.trace=true --ginkgo.timeout=24h --num-nodes=2 --ginkgo.v=true
Ran 2 of 7207 Specs in 28.926 seconds                                                                                                                                                      SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 7205 Skipped       

@mikebrow
Copy link
Member

/ok-to-test

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer doing this in a common way across platform impls over in podsandboxstats and listpodsandboxstats...

not sure it makes sense to return nil vs err for podsandboxstats.. but agree it makes sense to not fail all of listpodsandboxstats for one pod being not ready or unknown at the point of gathering the stats, that feels like an error

@jsturtevant
Copy link
Contributor Author

not sure it makes sense to return nil vs err for podsandboxstats.. but agree it makes sense to not fail all of listpodsandboxstats for one pod being not ready or unknown at the point of gathering the stats, that feels like an error

@bobbypage @haircommander do you have any thoughts? I assume some of the changes to soften the errors would apply to Linux side as well

@jsturtevant
Copy link
Contributor Author

not sure it makes sense to return nil vs err for podsandboxstats.

I am leaning towards returning an err for PodSandboxStats. Then the caller can determine if they want to fail; In the case of the ListPodSandboxStats we don't want to fail most of the time.

It might make sense to wrap some of the transient errs to make the decision easier to understand but at this time I'm not sure I know which errors we would really want to fail on for ListPodSandboxStats as there are a lot of failure returns currently in Linux and Windows in the stats code flow.

@haircommander
Copy link

I believe most of the time an error is treated as transient. if it's worth logging in the kubelet, then I'd report it. If it's part of normal state changes in containerd, then I would say hide the errors. e.g: is there action a user can take, or a maintainer should be aware of? if so, error. if not, hide

@jsturtevant
Copy link
Contributor Author

jsturtevant commented Jun 21, 2023

@haircommander @mikebrow I've made the following change e489ff5

@jsturtevant jsturtevant force-pushed the fix-windows-edge-cases branch 3 times, most recently from d3de050 to 5cba93b Compare June 26, 2023 22:03
@jsturtevant jsturtevant changed the title [cri] Handle Windows pod transition states gracefully [cri] Handle pod transition states gracefully while listing pod stats Jun 26, 2023
@jsturtevant
Copy link
Contributor Author

@haircommander @mikebrow PTAL

Copy link
Contributor

@marosset marosset left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@haircommander
Copy link

LGTM

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments regarding using switch for err processing and changing the default for err to continue setting up to return the best can do results and the first/last/list of transient/errors etc.. vs return nil with err on unknown type of err..

pkg/cri/server/sandbox_stats_list.go Outdated Show resolved Hide resolved
pkg/cri/server/sandbox_stats_list.go Outdated Show resolved Hide resolved
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow
Copy link
Member

mikebrow commented Jul 11, 2023

will be interesting on the kubelet side if they will choose to parse and eval the transient stats errors or log and ignore as we are moving to here... or if there will be a subsequent request to go even further and clear the transient errors. FYI @haircommander @dims @saschagrunert @SergeyKanzhelev

@marosset
Copy link
Contributor

See comments regarding using switch for err processing and changing the default for err to continue setting up to return the best can do results and the first/last/list of transient/errors etc.. vs return nil with err on unknown type of err..

@mikebrow - I updated the PR to use a switch statement for style

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow
Copy link
Member

just a couple logistical things.. pls remove the .idea files and squash.. thx!

.idea/.gitignore Outdated
@@ -0,0 +1,8 @@
# Default ignored files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the folder~

When the pods are transitioning there are several
cases where containers might not be in valid state.
There were several cases where the stats where
failing hard but we should just continue on as
they are transient and will be picked up again
when kubelet queries for the stats again.

Signed-off-by: James Sturtevant <jstur@microsoft.com>

Signed-off-by: Mark Rossetti <marosset@microsoft.com>
@marosset
Copy link
Contributor

just a couple logistical things.. pls remove the .idea files and squash.. thx!

done and done!

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@mikebrow mikebrow merged commit 3ed1bc1 into containerd:main Jul 12, 2023
43 checks passed
@mikebrow
Copy link
Member

any objections to flagging these as cherry pick for 1.7 and 1.6?

@mikebrow mikebrow added cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Jul 13, 2023
@jsturtevant
Copy link
Contributor Author

thanks @marosset and @mikebrow

@samuelkarp samuelkarp added area/cri Container Runtime Interface (CRI) cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch ok-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants