Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to save outputs: Get <URL> unexpected EOF #3522

Closed
4 tasks done
lueenavarro opened this issue Jul 20, 2020 · 5 comments · Fixed by #4253
Closed
4 tasks done

failed to save outputs: Get <URL> unexpected EOF #3522

lueenavarro opened this issue Jul 20, 2020 · 5 comments · Fixed by #4253
Labels

Comments

@lueenavarro
Copy link
Contributor

lueenavarro commented Jul 20, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:

I have a pod that runs for about 4 hours. Then exits. The program exits after 4 hours as it should but the pod's status is ERROR. From the logs, it says "failed to save outputs: Get https://172.21.0.1:443/api/v1/namespaces/default/pods/nrsa-daily-schedule-nrl7m-2508837627: unexpected EOF"

What you expected to happen:
The pod's status should be SUCCESS.

How to reproduce it (as minimally and precisely as possible):
Run a pod that run for about 4 hours.

Anything else we need to know?:

Environment:

  • Argo version:
v.2.9.2
  • Kubernetes version

$ kubectl version -o yaml

clientVersion:
  buildDate: "2020-03-12T21:00:06Z"
  compiler: gc
  gitCommit: ec6eb119b81be488b030e849b9e64fda4caaf33c
  gitTreeState: clean
  gitVersion: v1.16.8
  goVersion: go1.13.8
  major: "1"
  minor: "16"
  platform: linux/amd64
serverVersion:
  buildDate: "2020-06-17T18:32:22Z"
  compiler: gc
  gitCommit: 3305158dfe9ee1f89f596ef260135dcba881848c
  gitTreeState: clean
  gitVersion: v1.17.7+IKS
  goVersion: go1.13.9
  major: "1"
  minor: "17"
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
DEBU[0000] CLI version                                   version="{v2.9.3 2020-07-15T01:15:50Z 9407e19b3a1c61ad4043e382484fd0b6b15574f2 v2.9.3 clean go1.13.4 gc linux/amd64}"
DEBU[0000] Client options                                opts="{{ false false}  0x1676790 0xc0000ed180}"
Name:                nrsa-daily-schedule-nrl7m
Namespace:           default
ServiceAccount:      default
Status:              Error
Conditions:
 Completed           True
Created:             Sun Jul 19 01:00:00 +0000 (1 day ago)
Started:             Sun Jul 19 01:00:00 +0000 (1 day ago)
Finished:            Sun Jul 19 05:15:24 +0000 (1 day ago)
Duration:            4 hours 15 minutes
ResourcesDuration:   5h45m1s*(1 cpu),5h45m1s*(100Mi memory)

STEP                                 TEMPLATE               PODNAME                               DURATION  MESSAGE
 ⚠ nrsa-daily-schedule-nrl7m         nrsa-daily/nrsa-daily


 ├-✔ ON-START                        workflow-checker       nrsa-daily-schedule-nrl7m-3371647226  6s


 ├-✔ NRV001T                         job-checker            nrsa-daily-schedule-nrl7m-874292473   6s


 ├-✔ NRV060                          job-checker            nrsa-daily-schedule-nrl7m-2967513970  6s


 ├-⚠ TIME-00-15                      job-checker            nrsa-daily-schedule-nrl7m-2508837627  4h        failed to save outputs: Get https://172.21.0.1:443/api/v1/namespaces/default/pods/nrsa-daily-schedule-nrl7m-2508837627: unexpected EOF
 ├-✔ TIME-08-00                      job-checker            nrsa-daily-schedule-nrl7m-2352215589  6s


 ├-✔ NRV001                          job-checker            nrsa-daily-schedule-nrl7m-1038485143  5s


 ├-✔ NRV172                          job-checker            nrsa-daily-schedule-nrl7m-2473321238  5s


 ├-✔ NRV003                          job-checker            nrsa-daily-schedule-nrl7m-1072040381  4s


 ├-✔ NRV089                          job-checker            nrsa-daily-schedule-nrl7m-905544119   5s


 ├-✔ NRV085                          job-checker            nrsa-daily-schedule-nrl7m-704212691   5s





 ✔ nrsa-daily-schedule-nrl7m.onExit  clean-up


 ├---✔ pretty-print-json             pretty-print-json      nrsa-daily-schedule-nrl7m-486789590   4s


 ├---✔ send-email                    send-email             nrsa-daily-schedule-nrl7m-3972607695  6s


 └---✔ ON-EXIT                       workflow-checker       nrsa-daily-schedule-nrl7m-1629532479  4s
  • executor logs:

kubectl logs -c init

Program exited at 2020-07-19T00:15:00.104-05:00

kubectl logs -c wait

time="2020-07-19T01:00:11.461Z" level=info msg="Starting Workflow Executor" version=v2.9.2
time="2020-07-19T01:00:11.466Z" level=info msg="Creating a K8sAPI executor"
time="2020-07-19T01:00:11.466Z" level=info msg="Executor (version: v2.9.2, build_date: 2020-07-08T23:57:25Z) initialized (pod: default/nrsa-daily-schedule-nrl7m-2508837627) with template:\n{\"name\":\"job-checker\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"cron\",\"default\":\"\",\"value\":\"15 0 * * *\"},{\"name\":\"jobName\",\"default\":\"\",\"value\":\"\"},{\"name\":\"fileName\",\"default\":\"\",\"value\":\"\"},{\"name\":\"schedule\",\"default\":\"\",\"value\":\"ANYDAY\"},{\"name\":\"timeout\",\"default\":\"1800\",\"value\":\"86400\"},{\"name\":\"retries\",\"default\":\"0\",\"value\":\"0\"}]},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"us.icr.io/et-travel/job-checker:latest\",\"env\":[{\"name\":\"CRON\",\"value\":\"15 0 * * *\"},{\"name\":\"JOB_NAME\"},{\"name\":\"FILE_NAME\"},{\"name\":\"BASE_URL\",\"value\":\"http://nrsa-dev-service.default:80/gapwalk-application/script\"},{\"name\":\"SCHEDULE\",\"value\":\"ANYDAY\"},{\"name\":\"SUCCESS_MESSAGE\"},{\"name\":\"DB_HOST\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"nrsa-scheduler-logs-database\",\"key\":\"DB_HOST\"}}},{\"name\":\"DB_NAME\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"nrsa-scheduler-logs-database\",\"key\":\"DB_NAME\"}}},{\"name\":\"DB_USER\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"nrsa-scheduler-logs-database\",\"key\":\"DB_USER\"}}},{\"name\":\"DB_PASSWORD\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"nrsa-scheduler-logs-database\",\"key\":\"DB_PASSWORD\"}}},{\"name\":\"DB_PORT\",\"valueFrom\":{\"secretKeyRef\":{\"name\":\"nrsa-scheduler-logs-database\",\"key\":\"DB_PORT\"}}},{\"name\":\"SSL\",\"value\":\"true\"},{\"name\":\"NODE_TLS_REJECT_UNAUTHORIZED\",\"value\":\"0\"},{\"name\":\"WORKFLOW_NAME\",\"value\":\"nrsa-daily-schedule-nrl7m\"}],\"resources\":{},\"volumeMounts\":[{\"name\":\"nrsa-dev-storage\",\"mountPath\":\"/usr/src/app/files\"}],\"imagePullPolicy\":\"Always\"},\"volumes\":[{\"name\":\"nrsa-dev-storage\",\"persistentVolumeClaim\":{\"claimName\":\"nrsa-dev-10iops-pvc\"}}],\"podSpecPatch\":\"{\\\"activeDeadlineSeconds\\\":86400, \\\"retryStrategy\\\": {\\\"limit\\\": 0}}\"}"
time="2020-07-19T01:00:11.466Z" level=info msg="Waiting on main container"
time="2020-07-19T01:00:14.567Z" level=info msg="main container started with container ID: fa0fa9a3f26016e9c26261aac16fd7e41aefe54ed3eec910cc9087a253f4e862"
time="2020-07-19T01:00:14.567Z" level=info msg="Starting annotations monitor"
time="2020-07-19T01:00:14.580Z" level=info msg="Waiting for container fa0fa9a3f26016e9c26261aac16fd7e41aefe54ed3eec910cc9087a253f4e862 to complete"
time="2020-07-19T01:00:14.580Z" level=info msg="Starting to wait completion of containerID fa0fa9a3f26016e9c26261aac16fd7e41aefe54ed3eec910cc9087a253f4e862 ..."
time="2020-07-19T01:00:14.580Z" level=info msg="Starting deadline monitor"
time="2020-07-19T01:00:24.585Z" level=info msg="/argo/podmetadata/annotations updated"
time="2020-07-19T01:05:11.467Z" level=info msg="Alloc=6421 TotalAlloc=35999 Sys=70592 NumGC=10 Goroutines=10"
time="2020-07-19T01:10:11.466Z" level=info msg="Alloc=7634 TotalAlloc=57055 Sys=70592 NumGC=15 Goroutines=10"
time="2020-07-19T01:15:11.467Z" level=info msg="Alloc=4815 TotalAlloc=78116 Sys=70592 NumGC=21 Goroutines=10"
time="2020-07-19T01:20:11.467Z" level=info msg="Alloc=5581 TotalAlloc=99212 Sys=70592 NumGC=26 Goroutines=10"
time="2020-07-19T01:25:11.467Z" level=info msg="Alloc=6721 TotalAlloc=120295 Sys=70592 NumGC=31 Goroutines=10"
time="2020-07-19T01:30:11.466Z" level=info msg="Alloc=7779 TotalAlloc=141426 Sys=70592 NumGC=36 Goroutines=10"
time="2020-07-19T01:35:11.466Z" level=info msg="Alloc=8653 TotalAlloc=162495 Sys=70592 NumGC=41 Goroutines=10"
time="2020-07-19T01:40:11.467Z" level=info msg="Alloc=5847 TotalAlloc=183561 Sys=70592 NumGC=47 Goroutines=9"
time="2020-07-19T01:45:11.467Z" level=info msg="Alloc=6988 TotalAlloc=204831 Sys=70592 NumGC=52 Goroutines=9"
time="2020-07-19T01:50:11.467Z" level=info msg="Alloc=7950 TotalAlloc=225935 Sys=70592 NumGC=57 Goroutines=9"
time="2020-07-19T01:55:11.467Z" level=info msg="Alloc=8656 TotalAlloc=247019 Sys=70592 NumGC=62 Goroutines=9"
time="2020-07-19T02:00:11.467Z" level=info msg="Alloc=5337 TotalAlloc=268107 Sys=70592 NumGC=68 Goroutines=9"
time="2020-07-19T02:05:11.466Z" level=info msg="Alloc=6365 TotalAlloc=289225 Sys=70592 NumGC=73 Goroutines=9"
time="2020-07-19T02:10:11.467Z" level=info msg="Alloc=7742 TotalAlloc=310309 Sys=70592 NumGC=78 Goroutines=9"
time="2020-07-19T02:15:11.467Z" level=info msg="Alloc=8845 TotalAlloc=331398 Sys=70592 NumGC=83 Goroutines=9"
time="2020-07-19T02:20:11.467Z" level=info msg="Alloc=5269 TotalAlloc=352489 Sys=70592 NumGC=89 Goroutines=9"
time="2020-07-19T02:25:11.467Z" level=info msg="Alloc=6071 TotalAlloc=373563 Sys=70592 NumGC=94 Goroutines=9"
time="2020-07-19T02:29:38.457Z" level=warning msg="Failed to wait for container id 'fa0fa9a3f26016e9c26261aac16fd7e41aefe54ed3eec910cc9087a253f4e862': Get https://172.21.0.1:443/api/v1/namespaces/default/pods/nrsa-daily-schedule-nrl7m-2508837627: unexpected EOF"
time="2020-07-19T02:29:38.458Z" level=error msg="executor error: Get https://172.21.0.1:443/api/v1/namespaces/default/pods/nrsa-daily-schedule-nrl7m-2508837627: unexpected EOF"
time="2020-07-19T02:29:38.458Z" level=info msg="Annotations monitor stopped"
time="2020-07-19T02:29:38.458Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-07-19T02:29:38.458Z" level=info msg="Capturing script exit code"
time="2020-07-19T02:29:38.458Z" level=info msg="Getting exit code of fa0fa9a3f26016e9c26261aac16fd7e41aefe54ed3eec910cc9087a253f4e862"
time="2020-07-19T02:29:38.511Z" level=info msg="No output parameters"
time="2020-07-19T02:29:38.511Z" level=info msg="No output artifacts"
time="2020-07-19T02:29:38.511Z" level=info msg="Killing sidecars"
time="2020-07-19T02:29:38.531Z" level=info msg="Alloc=8348 TotalAlloc=391670 Sys=70592 NumGC=98 Goroutines=8"
time="2020-07-19T02:29:38.562Z" level=fatal msg="Get https://172.21.0.1:443/api/v1/namespaces/default/pods/nrsa-daily-schedule-nrl7m-2508837627: unexpected EOF"
  • workflow-controller logs:

kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

	Line 49096: time="2020-07-19T01:00:08Z" level=info msg="All of node nrsa-daily-schedule-nrl7m.TIME-00-15 dependencies [ON-START] completed" namespace=default workflow=nrsa-daily-schedule-nrl7m
	Line 49096: time="2020-07-19T01:00:08Z" level=info msg="All of node nrsa-daily-schedule-nrl7m.TIME-00-15 dependencies [ON-START] completed" namespace=default workflow=nrsa-daily-schedule-nrl7m
	Line 49098: time="2020-07-19T01:00:08Z" level=info msg="Created pod: nrsa-daily-schedule-nrl7m.TIME-00-15 (nrsa-daily-schedule-nrl7m-2508837627)" namespace=default workflow=nrsa-daily-schedule-nrl7m
	Line 49098: time="2020-07-19T01:00:08Z" level=info msg="Created pod: nrsa-daily-schedule-nrl7m.TIME-00-15 (nrsa-daily-schedule-nrl7m-2508837627)" namespace=default workflow=nrsa-daily-schedule-nrl7m
	Line 49098: time="2020-07-19T01:00:08Z" level=info msg="Created pod: nrsa-daily-schedule-nrl7m.TIME-00-15 (nrsa-daily-schedule-nrl7m-2508837627)" namespace=default workflow=nrsa-daily-schedule-nrl7m

Logs

argo get <workflowname>
kubectl logs <failedpodname> -c init
kubectl logs <failedpodname> -c wait
kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)

YAML
task:

          - arguments:
              parameters:
                - name: cron
                  value: 15 0 * * *
                - name: schedule
                  value: "ANYDAY"
                - name: timeout
                  value: 86400 # 1 day
            dependencies: 
              - "ON-START"
            name: TIME-00-15
            template: job-checker

template:

    - podSpecPatch: '{"activeDeadlineSeconds":{{inputs.parameters.timeout}}, "retryStrategy": {"limit": {{inputs.parameters.retries}}}}'
      volumes:
        - name: nrsa-dev-storage
          persistentVolumeClaim:
            claimName: nrsa-dev-10iops-pvc
      container:
        volumeMounts:
            - name: nrsa-dev-storage
              mountPath: /usr/src/app/files
        env:
          - name: CRON
            value: "{{inputs.parameters.cron}}"
          - name: JOB_NAME
            value: "{{inputs.parameters.jobName}}"
          - name: FILE_NAME
            value: "{{inputs.parameters.fileName}}"
          - name: BASE_URL
            value: "http://nrsa-dev-service.default:80/gapwalk-application/script"
          - name: SCHEDULE
            value: "{{inputs.parameters.schedule}}"
          - name: SUCCESS_MESSAGE
            value: ""
          - name: DB_HOST
            valueFrom:
              secretKeyRef:
                key: DB_HOST
                name: nrsa-scheduler-logs-database
          - name: DB_NAME
            valueFrom:
              secretKeyRef:
                key: DB_NAME
                name: nrsa-scheduler-logs-database
          - name: DB_USER
            valueFrom:
              secretKeyRef:
                key: DB_USER
                name: nrsa-scheduler-logs-database
          - name: DB_PASSWORD
            valueFrom:
              secretKeyRef:
                key: DB_PASSWORD
                name: nrsa-scheduler-logs-database
          - name: DB_PORT
            valueFrom:
              secretKeyRef:
                key: DB_PORT
                name: nrsa-scheduler-logs-database
          - name: SSL
            value: "true"
          - name: NODE_TLS_REJECT_UNAUTHORIZED
            value: "0"
          - name: WORKFLOW_NAME
            value: "{{workflow.name}}"
        image: us.icr.io/et-travel/job-checker:latest
        imagePullPolicy: Always
      inputs:
        parameters:
          - name: cron # define cron or jobName but not both
            default: ""
          - name: jobName
            default: ""
          - name: fileName
            default: ""
          - name: schedule
            default: ""
          - name: timeout
            default: 1800
          - name: retries
            default: 0
      name: job-checker

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@simster7 simster7 assigned simster7 and unassigned simster7 Jul 20, 2020
@alexec
Copy link
Contributor

alexec commented Jul 20, 2020

Someone should probably investigate to see if it is similar to K8S Executor bug and see if it is fragile and needs retries.

@lueenavarro
Copy link
Contributor Author

Today I got a somehow different message, same scenario though:
failed to save outputs: Get https://172.21.0.1:443/api/v1/namespaces/default/pods/nrsa-daily-schedule-4rcb2-1289784058: net/http: TLS handshake timeout

@maryoush
Copy link
Contributor

Hi
We have the same issue.
We have somehow mitigated this problem by using retries on step level
but solution is not ideal.
This unexpected EOF causes step to failure not error so we had to use retry


retryStrategy:
--
  |   | limit: 3
  |   | retryPolicy: "Always"


Additionally this retry potentially can cause steps which fail from business reasons.
We would really see the fix in client calling argo rest api.

@lueenavarro
Copy link
Contributor Author

a retry is not possible on our case. I think I was able to solve this by using kubelet executor instead of k8sapi executor. But I still have to test if that really solves the problem.

@maryoush
Copy link
Contributor

maryoush commented Oct 9, 2020

@lueenavarro i believe 'docker' is default executor ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants