Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with the PNS executor and CRI-O #2095

Closed
4 tasks done
rafalbigaj opened this issue Jan 29, 2020 · 2 comments
Closed
4 tasks done

Issues with the PNS executor and CRI-O #2095

rafalbigaj opened this issue Jan 29, 2020 · 2 comments
Labels

Comments

@rafalbigaj
Copy link
Contributor

rafalbigaj commented Jan 29, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:
I experience issues with PNS executor on CRI-O (OpenShift 4.2).
Any run (including https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml) fails with:

This step is in Error state with this message: failed to save outputs: Failed to determine pid for containerID 65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a: container may have exited too quickly

even though the main container finished successfully.

There is a problem with the PNSExecutor#updateCtrIDMap.
The map after update contains the entry with key:
crio-7a92a067289f6197148912be1c15f20f0330c7f3c541473d3b9c4043ca137b42.scope.
In other words the container ID is incorrectly parsed. It should be:
7a92a067289f6197148912be1c15f20f0330c7f3c541473d3b9c4043ca137b42

PR fixing this issue: #2096

What you expected to happen:
Step was expected to finish successfully. Container ID should be recognised as 65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a

How to reproduce it (as minimally and precisely as possible):
Install kubeflow on OpenShift 4.2 and run sample pipeline like:
https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version
2.3.5
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: "2019-11-14T04:24:29Z"
  compiler: gc
  gitCommit: b3cbbae08ec52a7fc73d334838e18d17e8512749
  gitTreeState: clean
  gitVersion: v1.16.3
  goVersion: go1.12.13
  major: "1"
  minor: "16"
  platform: darwin/amd64
serverVersion:
  buildDate: "2019-10-10T22:04:13Z"
  compiler: gc
  gitCommit: 2e5ed54
  gitTreeState: clean
  gitVersion: v1.14.6+2e5ed54
  goVersion: go1.12.8
  major: "1"
  minor: 14+
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
This step is in Error state with this message: failed to save outputs: Failed to determine pid for containerID 65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a: container may have exited too quickly
  • executor logs:
kubectl logs <failedpodname> -c wait
time="2020-01-28T15:06:32Z" level=info msg="Waiting on main container"
time="2020-01-28T15:06:32Z" level=warning msg="Polling root processes (1m0s)"
time="2020-01-28T15:06:37Z" level=info msg="pid 34: &{root 253 2147484141 {518511044 63715334690 0x29189c0} {64515 434110592 12 16877 0 0 0 0 253 4096 0 {1579737916 908000000} {1579737890 518511044} {1579737897 507511044} [0 0 0]}}"
time="2020-01-28T15:06:37Z" level=info msg="Secured filehandle on /proc/34/root"
time="2020-01-28T15:06:37Z" level=info msg="containerID crio-65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a.scope mapped to pid 34"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 28 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 28 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223997 992300242} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="Secured filehandle on /proc/34/root"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 28 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 28 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223997 992300242} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 28 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 28 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223997 992300242} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 39 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 39 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223998 180299618} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 39 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 39 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223998 180299618} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 39 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 39 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223998 180299618} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 39 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 39 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223998 180299618} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 39 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 39 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223998 180299618} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="main container started with container ID: 65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a"
time="2020-01-28T15:06:38Z" level=info msg="Starting annotations monitor"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 39 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 39 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223998 180299618} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=info msg="Starting deadline monitor"
time="2020-01-28T15:06:38Z" level=info msg="pid 34: &{root 39 2147484141 {649301379 63715820797 0x29189c0} {2097394 293675539 1 16877 0 0 0 0 39 4096 0 {1580223997 782300938} {1580223997 649301379} {1580223998 180299618} [0 0 0]}}"
time="2020-01-28T15:06:38Z" level=warning msg="Failed to wait for container id '65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a': Failed to determine pid for containerID 65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a: container may have exited too quickly"
  • workflow-controller logs:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
time="2020-01-28T15:06:22Z" level=info msg="Processing workflow" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:22Z" level=info msg="All of node kfp-on-wml-training-mnxhk.create-secret-kubernetes-cluster dependencies [] completed" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:23Z" level=info msg="Created pod: kfp-on-wml-training-mnxhk.create-secret-kubernetes-cluster (kfp-on-wml-training-mnxhk-143671368)" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:23Z" level=info msg="Pod node kfp-on-wml-training-mnxhk.create-secret-kubernetes-cluster (kfp-on-wml-training-mnxhk-143671368) initialized Pending" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:23Z" level=info msg="Workflow update successful" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:24Z" level=info msg="Processing workflow" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:24Z" level=info msg="Updating node kfp-on-wml-training-mnxhk.create-secret-kubernetes-cluster (kfp-on-wml-training-mnxhk-143671368) message: ContainerCreating"
time="2020-01-28T15:06:24Z" level=info msg="Workflow update successful" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:25Z" level=info msg="Processing workflow" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:31Z" level=info msg="Processing workflow" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:38Z" level=info msg="Processing workflow" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:38Z" level=info msg="Updating node kfp-on-wml-training-mnxhk.create-secret-kubernetes-cluster (kfp-on-wml-training-mnxhk-143671368) status Pending -> Running"
time="2020-01-28T15:06:38Z" level=info msg="Workflow update successful" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:39Z" level=info msg="Processing workflow" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="Processing workflow" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="Updating node kfp-on-wml-training-mnxhk.create-secret-kubernetes-cluster (kfp-on-wml-training-mnxhk-143671368) status Running -> Error"
time="2020-01-28T15:06:56Z" level=info msg="Updating node kfp-on-wml-training-mnxhk.create-secret-kubernetes-cluster (kfp-on-wml-training-mnxhk-143671368) message: failed to save outputs: Failed to determine pid for containerID 65e682ea2e3e8102b395f608f15f21b1ab56a021cfe5f1d741c4ed20e463c50a: container may have exited too quickly"
time="2020-01-28T15:06:56Z" level=info msg="node kfp-on-wml-training-mnxhk (kfp-on-wml-training-mnxhk) phase Running -> Error" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="node kfp-on-wml-training-mnxhk (kfp-on-wml-training-mnxhk) finished: 2020-01-28 15:06:56.622434172 +0000 UTC" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="Checking daemoned children of kfp-on-wml-training-mnxhk" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="Updated phase Running -> Error" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="Marking workflow completed" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="Checking daemoned children of " namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:56Z" level=info msg="Workflow update successful" namespace=kubeflow workflow=kfp-on-wml-training-mnxhk
time="2020-01-28T15:06:57Z" level=info msg="Labeled pod kubeflow/kfp-on-wml-training-mnxhk-143671368 completed"

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@alexec
Copy link
Contributor

alexec commented Jan 29, 2020

Hi @rafalbigaj would you be interested in submitting a fix for this ?

@rafalbigaj
Copy link
Contributor Author

@alexec Sure, the fix is already in PR: #2096

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants