Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

K8S Pod keep waiting in creating container, such as ErrImagePull #3572

Closed
yqwang-ms opened this issue Sep 5, 2019 · 9 comments
Closed

K8S Pod keep waiting in creating container, such as ErrImagePull #3572

yqwang-ms opened this issue Sep 5, 2019 · 9 comments

Comments

@yqwang-ms
Copy link
Member

yqwang-ms commented Sep 5, 2019

If image cannot be pulled, K8S will keep the Pod in Pending state (without complete or retry it) and does not respect Pod restartPolicy, so FrameworkController respect the Pod state, i.e. Pending, thus the whole Framework state is AttemptPreparing by design.

        "state": {
          "waiting": {
            "reason": "ErrImagePull",
            "message": "rpc error: code = Unknown desc = Error response from daemon: pull access denied for nginx211, repository does not exist or may require 'docker login'"
          }
        },

K8S still cannot provide a Pod spec option to let user specify if fail the pod when image not found.

So this issue is more like a K8S new feature request, but we need to somehow mitigate it, such as:
Expose Reason:

  1. Expose event and Pod waiting messages to tell user in webportal.
  2. Add an allocated state.

Detect and early fail it:

  1. PAI Runtime init container proactively detect whether the image exists.
  2. FC detect it by Pod events (may not reliable and stable)

However, these cannot resolve all other errors that K8S keep creating container. Such as:
Submit a Pod with workingDir to be a file.

        "state": {
          "waiting": {
            "reason": "CreateContainerError",
            "message": "Error response from daemon: Cannot mkdir: /bin/bash is not a directory"
          }
        },

Previous meeting notes:

For this specific case (image pull permanent error):
1.	We should detect this and early fail the task (job), and Binyang has already done some works on it.

For general case (long time waiting after scheduled, such as long time image pull, or docker container creation error, but k8s still always retry it):
1.	We plan to add a new job and task state, such as allocated or initializing, to describe that we have already count its resource usage, but the user’s binary has not running yet. (It is our resource accounting boundary state)
The state is readable for both machine (program) and human.
2.	We plan to expose (and refine/enrich) the backend k8s events, so that user can understand why it is in current state in details.
The state is readable only for human.
Example:
“Failed to pull image "pytorch:pytorch-stable-py37": rpc error: code = Unknown desc = Error response from daemon: pull access denied for pytorch, repository does not exist or may require 'docker login'”
“Pulling Image”
“Mounting Storage”
“Preparing SSH Server”

@yqwang-ms
Copy link
Member Author

See more details in microsoft/frameworkcontroller#14 (comment)

@fanyangCS
Copy link
Contributor

We should at least do some precheck in the init container to avoid unnecessary pending states.

@scarlett2018
Copy link
Member

@yqwang-ms how frequent will this happen? do we have an expire time for the waiting?

@yqwang-ms
Copy link
Member Author

This depends on whether user use a not existing image. K8S does not provide a way for expire it.
We may have to precheck the image

@scarlett2018
Copy link
Member

This depends on whether user use a not existing image. K8S does not provide a way for expire it.
We may have to precheck the image

precheck the image sounds good.

@fanyangCS
Copy link
Contributor

#3993

@Binyang2014
Copy link
Contributor

Need to check ACR image in runtime. Refer to #3993

@scarlett2018 scarlett2018 changed the title K8S Pod keep waiting in creating container, such as ErrImagePull P1 - K8S Pod keep waiting in creating container, such as ErrImagePull Dec 30, 2019
@yqwang-ms
Copy link
Member Author

Tracked in K8S kubernetes/kubernetes#87278

@scarlett2018 scarlett2018 mentioned this issue May 28, 2020
5 tasks
@scarlett2018 scarlett2018 changed the title P1 - K8S Pod keep waiting in creating container, such as ErrImagePull K8S Pod keep waiting in creating container, such as ErrImagePull May 28, 2020
@fanyangCS
Copy link
Contributor

close as we now expose relevant events to end users.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants