Pod's phase is failed, but the workflow node status is succeeded. #3879

juchaosong · 2020-08-27T01:48:50Z

Summary

What happened/what you expected to happen?
Pod's phase is failed, but the workflow node status is succeeded.

Diagnostics

What version of Argo Workflows are you running?
v2.7.0

Paste the workflow here, including status:
kubectl get wf -o yaml ${workflow}

Paste the logs from the workflow controller:
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name) | grep ${workflow}

The pod's yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    ProviderCreate: done
    k8s.aliyun.com/eci-image-cache: "true"
    k8s.aliyun.com/eci-instance-cpu: "48.000"
    k8s.aliyun.com/eci-instance-id: eci-bp16ie6rtdjkawmxzpp2
    k8s.aliyun.com/eci-instance-mem: "186.00"
    k8s.aliyun.com/eci-instance-spec: ecs.gn6i-c24g1.12xlarge
    k8s.aliyun.com/eci-instance-zone: cn-hangzhou-i
    k8s.aliyun.com/eci-kube-proxy-enabled: "true"
    k8s.aliyun.com/eci-security-group: sg-bp14u2bufi0926ycat9w
    k8s.aliyun.com/eci-use-specs: ecs.gn6i-c24g1.12xlarge
    k8s.aliyun.com/eci-vswitch: vsw-bp10d0kcp9077dqzzu21z
    workflows.argoproj.io/node-name: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177[2].generate-metrics-step(4:groudTruthOssPath:oss://inceptio-groundtruth/annotations/sequence/2020/03/18/53ecb9a8-d89e-4c06-8b35-59ea72d97d22/,momentId:53ecb9a8-d89e-4c06-8b35-59ea72d97d22)[0].regenerate-step(0)
    workflows.argoproj.io/template: '{"name":"regenerate","arguments":{},"inputs":{"parameters":[{"name":"moment_id","value":"53ecb9a8-d89e-4c06-8b35-59ea72d97d22"}]},"outputs":{},"affinity":{"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"preference":{"matchExpressions":[{"key":"gpu_node","operator":"In","values":["true"]}]}}]}},"metadata":{"annotations":{"k8s.aliyun.com/eci-image-cache":"true","k8s.aliyun.com/eci-use-specs":"ecs.gn6i-c24g1.12xlarge"}},"script":{"name":"","image":"registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/infra_fusion-metrics-regenerate:v20200720-4160093","command":["bash"],"env":[{"name":"MOMENT_ID","value":"53ecb9a8-d89e-4c06-8b35-59ea72d97d22"},{"name":"CYBERTRON_COMMIT_ID","value":"50fffb56952906a7938e3becd6005b3f04251a7d"},{"name":"LAUNCH_NAME","value":"perception/fusion/fusion_replay_metrics_launch"},{"name":"REGENERATED_BUCKET","valueFrom":{"configMapKeyRef":{"name":"cluster-configs","key":"moment_regenerated_oss_bucket"}}},{"name":"GITLAB_TOKEN","valueFrom":{"secretKeyRef":{"name":"argo-hubble-secrets","key":"gitlab-read-repository-token"}}},{"name":"CAS_TOKEN","valueFrom":{"secretKeyRef":{"name":"cas-service-token","key":"token"}}}],"resources":{"limits":{"nvidia.com/gpu":"2"}},"volumeMounts":[{"name":"credentials-volume","mountPath":"/root/.jfrog/jfrog-cli.conf","subPath":"jfrog-cli.conf"},{"name":"credentials-volume","mountPath":"/root/.ossutilconfig","subPath":".ossutilconfig"},{"name":"artifacts-volume-claim","mountPath":"/data"}],"source":"set
      -ex\n\ncp /data/cybertron.tar /root \u0026\u0026 cd /root \u0026\u0026 tar -xf
      cybertron.tar \u0026\u0026 rm cybertron.tar \u0026\u0026 cd cybertron\nTS1=$(date
      +%s)\n\nIDL_RECORD_IO=irs/common/recorder/idl_record_io.h\nMAJOR=$(cat $IDL_RECORD_IO
      | sed -rn ''s/^.*kCurrentRepoVersionMajor\\ *=\\ *([0-9]*);/\\1/p'')\nMINOR=$(cat
      $IDL_RECORD_IO | sed -rn ''s/^.*kCurrentRepoVersionMinor\\ *=\\ *([0-9]*);/\\1/p'')\nIDL=$(cat
      $IDL_RECORD_IO | sed -rn ''s/^.*kCurrentIdlHiddenVersion\\ *=\\ *([0-9]*);/\\1/p'')\ndataset
      client \\\n  dl moments/$MOMENT_ID \\\n  --token=$CAS_TOKEN \\\n  --target=dataset-server.chariot:50050
      \\\n  --dataset-process-target=dataset-server.chariot:50051 \\\n  --target-path=/root/moments
      \\\n  --idl-version=${MAJOR}.${MINOR}.${IDL}\n\nTS2=$(date +%s)\nDIFF=$(( $TS2
      - $TS1 ))\necho \"fetch dataset took $DIFF seconds\"\n\ncd /root/moments/${MOMENT_ID:0:2}/${MOMENT_ID}\nif
      [ ! -d \"rec\" ]; then\n  mkdir rec \u0026\u0026 mv *.rec rec\nfi\n\nTS3=$(date
      +%s)\nDIFF=$(( $TS3 - $TS2 ))\necho \"mv dataset took $DIFF seconds\"\n\nMISSION_ID=$(curl
      -s -H \"authorization: Bearer $CAS_TOKEN\" \"http://chariot-gateway.chariot:8080/v1alpha2/moments/$MOMENT_ID\"
      \\\n  | jq .mission_id \\\n  | sed -e ''s/^\"//'' -e ''s/\"$//'')\nMAP=$(curl
      -s -H \"authorization: Bearer $CAS_TOKEN\" \"http://chariot-gateway.chariot:8080/v1alpha2/missions/$MISSION_ID\"
      \\\n  | jq .map \\\n  | sed -e ''s/^\"//'' -e ''s/\"$//'')\nVEHICLE_ID=$(curl
      -s -H \"authorization: Bearer $CAS_TOKEN\" \"http://chariot-gateway.chariot:8080/v1alpha2/missions/$MISSION_ID\"
      \\\n  | jq .vehicle.instance \\\n  | sed -e ''s/^\"//'' -e ''s/\"$//'')\n\ncd
      /root/cybertron\npip3 install irs/launch/launch irs/launch/launch_ros irs/launch/launch_yaml
      irs/launch/test_launch_ros irs/launch/osrf_pycommon\ncp -a /root/moments/${MOMENT_ID:0:2}/$MOMENT_ID/config/calibration
      /root/cybertron/config/\n\nchariot replay --name=$LAUNCH_NAME \\\n  --file=bundle.yaml
      \\\n  --map=$MAP \\\n  --vehicle-id=$VEHICLE_ID \\\n  --timeout=600 \\\n  --replay-rec-path=/root/moments/${MOMENT_ID:0:2}/$MOMENT_ID
      \\\n  --output=/root/moments-regenerated/${MOMENT_ID:0:2}/$MOMENT_ID\n\nTS4=$(date
      +%s)\nDIFF=$(( $TS4 - $TS3 ))\necho \"chariot replay took $DIFF seconds\"\n\n#
      free space\nrm -rf /root/moments/${MOMENT_ID:0:2}/$MOMENT_ID\nCONVERTER_PY=interfaces/tools/gems/pb_converter/gems_rec_convert.py\npython3
      $CONVERTER_PY \\\n  --exec_path=/root/cybertron \\\n  --input_path=/root/moments-regenerated/${MOMENT_ID:0:2}/$MOMENT_ID/rec
      \\\n  --output_path=/root/moments-regenerated/${MOMENT_ID:0:2}/$MOMENT_ID/pb\n\nossutil
      cp --config-file=/root/.ossutilconfig -r -f \\\n  /root/moments-regenerated/${MOMENT_ID:0:2}/$MOMENT_ID
      \\\n  oss://$REGENERATED_BUCKET/${MOMENT_ID:0:2}/$MOMENT_ID/$LAUNCH_NAME/$CYBERTRON_COMMIT_ID\n\nTS5=$(date
      +%s)\nDIFF=$(( $TS5 - $TS4 ))\necho \"rec-pb converter replay took $DIFF seconds\"\n"},"retryStrategy":{"limit":2},"tolerations":[{"key":"gpu_node","operator":"Equal","value":"true","effect":"NoSchedule"},{"key":"virtual-kubelet.io/provider","operator":"Equal","value":"alibabacloud","effect":"NoSchedule"}]}'
  creationTimestamp: "2020-08-25T03:18:34Z"
  labels:
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/controller-instanceid: kraken
    workflows.argoproj.io/workflow: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177
  name: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-1580075001
  namespace: argo-hubble
  ownerReferences:
  - apiVersion: argoproj.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Workflow
    name: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177
    uid: cd4490d0-e680-11ea-b659-0695ecaac99c
  resourceVersion: "1042943700"
  selfLink: /api/v1/namespaces/argo-hubble/pods/fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-1580075001
  uid: a8a231e4-e681-11ea-a15b-72a22667324f
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: gpu_node
            operator: In
            values:
            - "true"
        weight: 100
  containers:
  - command:
    - argoexec
    - wait
    env:
    - name: ARGO_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: aliyun_logs_workflow
      value: stdout
    - name: ARGO_CONTAINER_RUNTIME_EXECUTOR
      value: k8sapi
    image: registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/argoproj_argoexec:v2.7.0
    imagePullPolicy: IfNotPresent
    name: wait
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /argo/podmetadata
      name: podmetadata
    - mountPath: /mainctrfs/root/.jfrog/jfrog-cli.conf
      name: credentials-volume
      subPath: jfrog-cli.conf
    - mountPath: /mainctrfs/root/.ossutilconfig
      name: credentials-volume
      subPath: .ossutilconfig
    - mountPath: /mainctrfs/data
      name: artifacts-volume-claim
    - mountPath: /mainctrfs/argo/staging
      name: argo-staging
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-t5xcm
      readOnly: true
  - args:
    - /argo/staging/script
    command:
    - bash
    env:
    - name: MOMENT_ID
      value: 53ecb9a8-d89e-4c06-8b35-59ea72d97d22
    - name: CYBERTRON_COMMIT_ID
      value: 50fffb56952906a7938e3becd6005b3f04251a7d
    - name: LAUNCH_NAME
      value: perception/fusion/fusion_replay_metrics_launch
    - name: REGENERATED_BUCKET
      valueFrom:
        configMapKeyRef:
          key: moment_regenerated_oss_bucket
          name: cluster-configs
    - name: GITLAB_TOKEN
      valueFrom:
        secretKeyRef:
          key: gitlab-read-repository-token
          name: argo-hubble-secrets
    - name: CAS_TOKEN
      valueFrom:
        secretKeyRef:
          key: token
          name: cas-service-token
    image: registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/infra_fusion-metrics-regenerate:v20200720-4160093
    imagePullPolicy: IfNotPresent
    name: main
    resources:
      limits:
        nvidia.com/gpu: "2"
      requests:
        nvidia.com/gpu: "2"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /root/.jfrog/jfrog-cli.conf
      name: credentials-volume
      subPath: jfrog-cli.conf
    - mountPath: /root/.ossutilconfig
      name: credentials-volume
      subPath: .ossutilconfig
    - mountPath: /data
      name: artifacts-volume-claim
    - mountPath: /argo/staging
      name: argo-staging
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-t5xcm
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: acr-credential-0cd66774c2e1635329078c8ad94a7d92
  - name: acr-credential-90a511cb8249950d96ce0336cae093f0
  initContainers:
  - command:
    - argoexec
    - init
    env:
    - name: ARGO_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: aliyun_logs_workflow
      value: stdout
    - name: ARGO_CONTAINER_RUNTIME_EXECUTOR
      value: k8sapi
    image: registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/argoproj_argoexec:v2.7.0
    imagePullPolicy: IfNotPresent
    name: init
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /argo/podmetadata
      name: podmetadata
    - mountPath: /argo/staging
      name: argo-staging
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-t5xcm
      readOnly: true
  nodeName: virtual-node-eci-0
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: gpu_node
    operator: Equal
    value: "true"
  - effect: NoSchedule
    key: virtual-kubelet.io/provider
    operator: Equal
    value: alibabacloud
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - downwardAPI:
      defaultMode: 420
      items:
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations
        path: annotations
    name: podmetadata
  - name: credentials-volume
    secret:
      defaultMode: 420
      secretName: argo-hubble-secrets
  - name: artifacts-volume-claim
    persistentVolumeClaim:
      claimName: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-artifacts-volume-claim
  - emptyDir: {}
    name: argo-staging
  - name: default-token-t5xcm
    secret:
      defaultMode: 420
      secretName: default-token-t5xcm
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-08-25T03:18:34Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: eci://a1c99ce37a6a55a938e021dbf9348707ee8cb1590110cbba5377692fa5d2015c
    image: registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/infra_fusion-metrics-regenerate:v20200720-4160093
    imageID: eci://registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/infra_fusion-metrics-regenerate:v20200720-4160093
    lastState:
      waiting: {}
    name: main
    ready: false
    restartCount: 0
    state:
      waiting: {}
  - containerID: eci://93d74d2a93bb7fd23309a98fb8c1372b30848a241d04d2f40b79dd6416a63edd
    image: registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/argoproj_argoexec:v2.7.0
    imageID: eci://registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/argoproj_argoexec:v2.7.0
    lastState:
      waiting: {}
    name: wait
    ready: false
    restartCount: 0
    state:
      waiting: {}
  hostIP: 172.30.2.104
  initContainerStatuses:
  - containerID: eci://39bb85ba062466e052286d199e35fa67cc3694dfe099e2a4fee445bf128363b3
    image: registry-vpc.cn-hangzhou.aliyuncs.com/inceptio/argoproj_argoexec:v2.7.0
    imageID: ""
    lastState:
      waiting: {}
    name: init
    ready: false
    restartCount: 0
    state:
      waiting: {}
  phase: Failed

The workflow node status yaml

id: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-1580075001
name: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177[2].generate-metrics-step(4:groudTruthOssPath:oss://inceptio-groundtruth/annotations/sequence/2020/03/18/53ecb9a8-d89e-4c06-8b35-59ea72d97d22/,momentId:53ecb9a8-d89e-4c06-8b35-59ea72d97d22)[0].regenerate-step(0)
type: Pod
templateName: regenerate
phase: Succeeded
boundaryID: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-2612786667
startedAt: 2020-08-25T03:18:34Z
finishedAt: 2020-08-25T03:18:46Z
inputs: {"parameters":[{"name":"moment_id","value":"53ecb9a8-d89e-4c06-8b35-59ea72d97d22"}]}
children: fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-2854534890

You can see the pod's phase is failed, but the workflow node status is succeeded.
According to https://github.com/argoproj/argo/blob/v2.7.0/workflow/controller/operator.go#L1070-L1074, because the containerStatues's terminated is null, so the node will update to succeeded, https://github.com/argoproj/argo/blob/v2.7.0/workflow/controller/operator.go#L1165.
I'm not sure is argo's bug or the wrong pod container's status.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2020-08-27T04:20:38Z

@juchaosong can you try it on the latest build v2.9.5?

juchaosong · 2020-08-27T04:31:50Z

@juchaosong can you try it on the latest build v2.9.5?
But the codes is not changed.
https://github.com/argoproj/argo/blob/v2.9.5/workflow/controller/operator.go#L1148-L1151

sarabala1979 · 2020-08-27T04:40:49Z

Can you provide the workflowcontroller log ? do you have any insight on why the POD is failed, I didn't see the message?

juchaosong · 2020-08-27T04:59:16Z

Argo controller logs

time="2020-08-25T05:40:16Z" level=warning msg="Pod fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-1580075001 phase was Failed but wait did not have terminated state"
time="2020-08-25T05:40:16Z" level=warning msg="Pod fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-1580075001 phase was Failed but main did not have terminated state"

We use alibaba cloud eci pods which something like virtual-kubelet. The pod failed because there are not stock. I'm not sure whether alibaba cloud should change container state to terminated instead of waiting. Does kubernetes standardized when the pod failed, the container state must change to terminated?

sarabala1979 · 2020-08-27T15:29:24Z

Yes, one of the containers should be in the terminated state for POD failure.

Failed | All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system.

But did you see any of the node has removed on your cluster?

If a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for setting the phase of all Pods on the lost node to Failed.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

juchaosong · 2020-08-28T01:34:52Z

The node which hosts failed pod is not really node. It's created by virtual-kubelet. If the pod information is wrong, it may be a problem with the implementation of virtual-kubelet.

sarabala1979 · 2020-08-28T05:14:01Z

looks like virtual kubelet will not set terminate if the container is not running
https://github.com/virtual-kubelet/virtual-kubelet/blob/7f2a02291530d2df14905702e6d51500dd57640a/node/sync.go#L195-L208

fangpenlin · 2020-08-30T17:49:12Z

I think I encountered similar issue on GKE. The argo version I am using is 2.10.0. Last night I submitted a workflow which has 1024 pods in parallel for processing data and uploading to Google Cloud Storage (via gsutil command rather than artifacts output). This morning I noticed that the whole workflow is done, but yet there's one file missing. I looked into the log and realized the controller somehow think the node is done in very short time. I tried to find the log of the pod but can't find anything. I think the pod doesn't even get a chance to be up fully running, but soon it's dead. Yet somehow argo controller thinks the pod is succeeded. Here's the log for the controller:

[
  {
    "textPayload": "time=\"2020-08-30T09:59:33Z\" level=info msg=\"Updating node data-2bhpn-4179820782 status Pending -> Succeeded\"\n",
    "insertId": "u36f3kmuxix7veq4p",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "cluster_name": "dotdb3",
        "project_id": "dotdb-231204",
        "container_name": "controller",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "location": "us-central1",
        "namespace_name": "argo-0"
      }
    },
    "timestamp": "2020-08-30T09:59:33.794653184Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:59:39.464856256Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:59:25Z\" level=info msg=\"Skipped pod data-2bhpn.scan-shard-delta(560:0560)(0) (data-2bhpn-4179820782) creation: already exists\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7ven1i",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "controller",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "cluster_name": "dotdb3",
        "location": "us-central1",
        "namespace_name": "argo-0",
        "project_id": "dotdb-231204"
      }
    },
    "timestamp": "2020-08-30T09:59:25.474679262Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:59:29.444708649Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:59:14Z\" level=info msg=\"Skipped pod data-2bhpn.scan-shard-delta(560:0560)(0) (data-2bhpn-4179820782) creation: already exists\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7vei8g",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "namespace_name": "argo-0",
        "container_name": "controller",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "project_id": "dotdb-231204",
        "location": "us-central1",
        "cluster_name": "dotdb3"
      }
    },
    "timestamp": "2020-08-30T09:59:14.325085578Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:59:19.443556733Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:59:03Z\" level=info msg=\"Skipped pod data-2bhpn.scan-shard-delta(560:0560)(0) (data-2bhpn-4179820782) creation: already exists\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7vee1l",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "container_name": "controller",
        "cluster_name": "dotdb3",
        "namespace_name": "argo-0",
        "location": "us-central1",
        "project_id": "dotdb-231204"
      }
    },
    "timestamp": "2020-08-30T09:59:03.944302801Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:59:09.462969696Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:59:00Z\" level=info msg=\"Updating node data-2bhpn-4179820782 message: Unschedulable: 0/20 nodes are available: 13 Insufficient cpu, 2 node(s) had taints that the pod didn't tolerate, 5 node(s) didn't match node selector.\"\n",
    "insertId": "u36f3kmuxix7ved3n",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "controller",
        "namespace_name": "argo-0",
        "location": "us-central1",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "cluster_name": "dotdb3",
        "project_id": "dotdb-231204"
      }
    },
    "timestamp": "2020-08-30T09:59:00.312067761Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:59:04.583608911Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:54Z\" level=info msg=\"Skipped pod data-2bhpn.scan-shard-delta(560:0560)(0) (data-2bhpn-4179820782) creation: already exists\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7vdz2p",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "location": "us-central1",
        "container_name": "controller",
        "cluster_name": "dotdb3",
        "project_id": "dotdb-231204",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "namespace_name": "argo-0"
      }
    },
    "timestamp": "2020-08-30T09:58:54.287322757Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:59.500998938Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:52Z\" level=info msg=\"insignificant pod change\" key=argo-0/data-2bhpn-4179820782\n",
    "insertId": "u36f3kmuxix7vdz0k",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "location": "us-central1",
        "container_name": "controller",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "project_id": "dotdb-231204",
        "namespace_name": "argo-0",
        "cluster_name": "dotdb3"
      }
    },
    "timestamp": "2020-08-30T09:58:52.757956615Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:54.466535608Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:44Z\" level=info msg=\"Skipped pod data-2bhpn.scan-shard-delta(560:0560)(0) (data-2bhpn-4179820782) creation: already exists\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7vdu6l",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "cluster_name": "dotdb3",
        "container_name": "controller",
        "namespace_name": "argo-0",
        "project_id": "dotdb-231204",
        "location": "us-central1"
      }
    },
    "timestamp": "2020-08-30T09:58:44.137674055Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:49.554273725Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:32Z\" level=info msg=\"Skipped pod data-2bhpn.scan-shard-delta(560:0560)(0) (data-2bhpn-4179820782) creation: already exists\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7vdp5o",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "controller",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "location": "us-central1",
        "namespace_name": "argo-0",
        "cluster_name": "dotdb3",
        "project_id": "dotdb-231204"
      }
    },
    "timestamp": "2020-08-30T09:58:32.338253656Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:34.444975036Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:26Z\" level=info msg=\"Updating node data-2bhpn-4179820782 message: Unschedulable: 0/18 nodes are available: 13 Insufficient cpu, 5 node(s) didn't match node selector.\"\n",
    "insertId": "u36f3kmuxix7vdo7p",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "location": "us-central1",
        "cluster_name": "dotdb3",
        "container_name": "controller",
        "namespace_name": "argo-0",
        "project_id": "dotdb-231204",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7"
      }
    },
    "timestamp": "2020-08-30T09:58:26.739056781Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:29.543231340Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:22Z\" level=info msg=\"insignificant pod change\" key=argo-0/data-2bhpn-4179820782\n",
    "insertId": "u36f3kmuxix7vdk2s",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "controller",
        "location": "us-central1",
        "project_id": "dotdb-231204",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "cluster_name": "dotdb3",
        "namespace_name": "argo-0"
      }
    },
    "timestamp": "2020-08-30T09:58:22.894848019Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:24.469135596Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:22Z\" level=info msg=\"Created pod: data-2bhpn.scan-shard-delta(560:0560)(0) (data-2bhpn-4179820782)\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7vdk2o",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "controller",
        "cluster_name": "dotdb3",
        "location": "us-central1",
        "project_id": "dotdb-231204",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "namespace_name": "argo-0"
      }
    },
    "timestamp": "2020-08-30T09:58:22.888713326Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:24.469135596Z"
  },
  {
    "textPayload": "time=\"2020-08-30T09:58:22Z\" level=info msg=\"Pod node data-2bhpn-4179820782 initialized Pending\" namespace=argo-0 workflow=data-2bhpn\n",
    "insertId": "u36f3kmuxix7vdk2m",
    "resource": {
      "type": "k8s_container",
      "labels": {
        "container_name": "controller",
        "project_id": "dotdb-231204",
        "namespace_name": "argo-0",
        "pod_name": "argo-0-workflow-controller-75d5d566-4h9t7",
        "location": "us-central1",
        "cluster_name": "dotdb3"
      }
    },
    "timestamp": "2020-08-30T09:58:22.839015156Z",
    "severity": "ERROR",
    "logName": "projects/dotdb-231204/logs/stderr",
    "receiveTimestamp": "2020-08-30T09:58:24.469135596Z"
  }
]

Normally each of these node would take 20 minutes to finish, but this node from start to finish only took 1 minute, certainly not right. I cannot reproduce it, and not much other log I can find. These are what I can provide.

juchaosong · 2020-08-31T05:52:36Z

Argo controller logs
time="2020-08-25T05:40:16Z" level=warning msg="Pod fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-1580075001 phase was Failed but wait did not have terminated state"
time="2020-08-25T05:40:16Z" level=warning msg="Pod fusion-metrics-wf-0f95b86a-f7aa-4ac7-85a9-7d6f244177-1580075001 phase was Failed but main did not have terminated state"
We use alibaba cloud eci pods which something like virtual-kubelet. The pod failed because there are not stock. I'm not sure whether alibaba cloud should change container state to terminated instead of waiting. Does kubernetes standardized when the pod failed, the container state must change to terminated?

Can you find the controller log like these?

…#3879 (#3890)

alexec · 2020-09-02T15:25:53Z

Available for testing in v2.11.0-rc1.

…#3879 (#3890)

juchaosong added the type/bug label Aug 27, 2020

alexec assigned sarabala1979 Aug 27, 2020

sarabala1979 mentioned this issue Aug 28, 2020

fix: Workflow should fail on Pod failure before container starts Fixes #3879 #3890

Merged

6 tasks

sarabala1979 closed this as completed in #3890 Sep 1, 2020

sarabala1979 added a commit that referenced this issue Sep 1, 2020

fix: Workflow should fail on Pod failure before container starts Fixes …

be91d76

…#3879 (#3890)

alexec pushed a commit that referenced this issue Sep 2, 2020

fix: Workflow should fail on Pod failure before container starts Fixes …

69861fc

…#3879 (#3890)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod's phase is failed, but the workflow node status is succeeded. #3879

Pod's phase is failed, but the workflow node status is succeeded. #3879

juchaosong commented Aug 27, 2020

sarabala1979 commented Aug 27, 2020

juchaosong commented Aug 27, 2020 •

edited

Loading

sarabala1979 commented Aug 27, 2020

juchaosong commented Aug 27, 2020 •

edited

Loading

sarabala1979 commented Aug 27, 2020 •

edited

Loading

juchaosong commented Aug 28, 2020

sarabala1979 commented Aug 28, 2020

fangpenlin commented Aug 30, 2020

juchaosong commented Aug 31, 2020

alexec commented Sep 2, 2020 •

edited

Loading

Pod's phase is failed, but the workflow node status is succeeded. #3879

Pod's phase is failed, but the workflow node status is succeeded. #3879

Comments

juchaosong commented Aug 27, 2020

Summary

Diagnostics

sarabala1979 commented Aug 27, 2020

juchaosong commented Aug 27, 2020 • edited Loading

sarabala1979 commented Aug 27, 2020

juchaosong commented Aug 27, 2020 • edited Loading

sarabala1979 commented Aug 27, 2020 • edited Loading

juchaosong commented Aug 28, 2020

sarabala1979 commented Aug 28, 2020

fangpenlin commented Aug 30, 2020

juchaosong commented Aug 31, 2020

alexec commented Sep 2, 2020 • edited Loading

juchaosong commented Aug 27, 2020 •

edited

Loading

juchaosong commented Aug 27, 2020 •

edited

Loading

sarabala1979 commented Aug 27, 2020 •

edited

Loading

alexec commented Sep 2, 2020 •

edited

Loading