Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[testing] postsubmit test failure 2021-08-12 #6311

Closed
Tracked by #5779
Bobgy opened this issue Aug 12, 2021 · 11 comments · Fixed by #6325 or #6379
Closed
Tracked by #5779

[testing] postsubmit test failure 2021-08-12 #6311

Bobgy opened this issue Aug 12, 2021 · 11 comments · Fixed by #6325 or #6379
Assignees

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Aug 12, 2021

https://oss-prow.knative.dev/view/gs/oss-prow/logs/kubeflow-pipeline-postsubmit-standalone-component-test/1425587097510612992#1:build-log.txt%3A5394

For standalone components test, it fails when building sample test image:

sample-test-rb7fv-2769807467: error: google-api-core 1.26.0 is installed but google-api-core<3.0dev,>=1.29.0 is required by {'google-cloud-storage'}
sample-test-rb7fv-2769807467: The command '/bin/sh -c cd /sdk/python && python3 setup.py install' returned a non-zero code: 1

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 12, 2021

before this error shows up, there was a different error that we should also fix:
https://oss-prow.knative.dev/view/gs/oss-prow/logs/kubeflow-pipeline-postsubmit-integration-test/1425263307043901440#1:build-log.txt%3A9072

The error seems like a bug in argo, docker executor also do not support optional output artifacts:

executor error: path /tmp/outputs/MLPipeline_UI_metadata/data does not exist in archive /tmp/argo/outputs/artifacts/launch-python-MLPipeline-UI-metadata.tgz\ngh.neting.cc/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngh.neting.cc/argoproj/argo-workflows/v3/errors.Errorf\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:55\

EDIT: I'm not sure now, because I found a passing postsubmit test after 1.7.0-rc.3 release, that proves current argo version can successfully run postsubmit tests.
image

EDIT2: I think I'm getting closer to the root cause, here's argo outputs annotation for the failed step in dataflow pipeline sample:

workflows.argoproj.io/outputs: '{"artifacts":[{"name":"mlpipeline-ui-metadata","path":"/tmp/outputs/MLPipeline_UI_metadata/data","optional":true},{"name":"launch-python-MLPipeline-UI-metadata","path":"/tmp/outputs/MLPipeline_UI_metadata/data"},{"name":"launch-python-job_id","path":"/tmp/outputs/job_id/data"},{"name":"main-logs","s3":{"key":"artifacts/dataflow-launch-python-pipeline-8hdph/2021/08/11/dataflow-launch-python-pipeline-8hdph-2670089030/main.log"}}]}'

Note that "launch-python-MLPipeline-UI-metadata" artifact should have "optional": "true", but it didn't. That caused the task to fail, so I guess this is a side effect bug caused by one of the recent KFP v1 compiler changes.

reference: pod info for the failed pod in dataflow sample pipeline: https://storage.googleapis.com/oss-prow/logs/kubeflow-pipeline-postsubmit-integration-test/1425515476460507136/artifacts/pods_info/dataflow-launch-python-pipeline-8hdph-2670089030.txt

@neuromage
Copy link
Contributor

Thanks @Bobgy. Can you point me to the pipeline sample that fails?

@zijianjoy
Copy link
Collaborator

/assign @neuromage

google-oss-robot pushed a commit that referenced this issue Aug 13, 2021
…and mlpipeline-metrics. Fixes #6311  (#6325)

* Fix compiler bug for legacy outputs mlpipeline-ui-metadata and
mlpipeline-metrics.

* fix - only delete legacy outputs in non-v2 pipelines.
@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 13, 2021

/reopen
postsubmit test is still failing: f743dde

@google-oss-robot
Copy link

@Bobgy: Reopened this issue.

In response to this:

/reopen
postsubmit test is still failing: f743dde

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Bobgy Bobgy mentioned this issue Aug 16, 2021
23 tasks
@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 16, 2021

Build starts to fail again, the error message shows up when building inverse proxy image:

Digest: sha256:2e3c5ecd1a55b32056f3ce0c4aaac05e31b85c361d501bf3f8a81bec14c4fe87
Status: Downloaded newer image for gcr.io/inverting-proxy/agent@sha256:2e3c5ecd1a55b32056f3ce0c4aaac05e31b85c361d501bf3f8a81bec14c4fe87
---> ce436ce6c655
Step 2/14 : RUN apt-get update && apt-get install -y curl jq python-pip
---> Running in 29b8ef43d3d1
Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Reading package lists...
E: Repository 'http://security.debian.org/debian-security buster/updates InRelease' changed its 'Suite' value from 'stable' to 'oldstable'
E: Repository 'http://deb.debian.org/debian buster InRelease' changed its 'Suite' value from 'stable' to 'oldstable'
E: Repository 'http://deb.debian.org/debian buster-updates InRelease' changed its 'Suite' value from 'stable-updates' to 'oldstable-updates'
The command '/bin/sh -c apt-get update && apt-get install -y curl jq python-pip' returned a non-zero code: 100

https://forums.linuxmint.com/viewtopic.php?t=355148&p=2053997 seems to describe the problem

I think this basically means inverse proxy debian image is too old, if we used a newer version, it should not hit the problem.

UPDATE: the latest version of inverse proxy image still has an outdated debian version.
gcr.io/inverting-proxy/agent

We can workaround by:

apt update --allow-releaseinfo-change

UPDATE2: sent #6351
UPDATE3: after above PR is merged, build recovers and postsubmit test fails like before as mentioned in #6311 (comment).

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 16, 2021

The latest postsubmit test run still fails, but only container builder and parameterized tfx samples are failing.
container builder failed because kaniko job times out, it's expected flakiness.
TFX sample will be fixed by jiyong.

https://storage.googleapis.com/oss-prow/logs/kubeflow-pipeline-postsubmit-integration-test/1427197975674753024/build-log.txt

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 16, 2021

The TFX sample fails with:

sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489: tensorflow.python.framework.errors_impl.InvalidArgumentError: Error executing an HTTP request: HTTP response code 400 with body '{
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:   "error": {
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:     "code": 400,
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:     "message": "Invalid bucket name: '{{kfp-default-bucket}}'",
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:     "errors": [
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:       {
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:         "message": "Invalid bucket name: '{{kfp-default-bucket}}'",
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:         "domain": "global",
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:         "reason": "invalid"
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:       }
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:     ]
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489:   }
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489: }
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489: '
sample-test-xcp5g-222029471: parameterized-tfx-oss-7qb9g-305390489: 	 when reading metadata of gs://{{kfp-default-bucket}}/tfx_taxi_simple/fe7841ca-0e29-4989-b13a-e32ea09b5d58/model_serving/1629110410
sample-test-xcp5g-222029471: 

so it seems parameterization isn't fully working in the test infra.

log: https://oss-prow.knative.dev/view/gs/oss-prow/logs/kubeflow-pipeline-postsubmit-integration-test/1427197975674753024#1:build-log.txt%3A16035

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 17, 2021

I got the the error message even though I filled in {{kfp-default-bucket}} from the pipeline parameter.
It seems that the placeholder in model pusher is not resolved.

cc @jiyongjung0

jiyongjung0 pushed a commit to jiyongjung0/pipelines that referenced this issue Aug 18, 2021
{} Placeholder doesn't work well in component parameters and
it is better to have it as a runtime parameter for flexibility.

kubeflow#6311 (comment)
google-oss-robot pushed a commit that referenced this issue Aug 18, 2021
…ple. (#6373)

{} Placeholder doesn't work well in component parameters and
it is better to have it as a runtime parameter for flexibility.

#6311 (comment)
@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 18, 2021

besides tfx pipeline issues, I see the following error message in test infra very often:

failed to wait for main container to complete: timed out waiting for the condition: container does not exist

example pod https://storage.googleapis.com/oss-prow/pr-logs/pull/kubeflow_pipelines/6363/kubeflow-pipeline-e2e-test/1427862991075807232/artifacts/pods_info/integration-test-c7z5r-4280080516.txt

identified the root cause and sent a fix to argo argoproj/argo-workflows#6561

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 20, 2021

There's one last thing, we reached model count quota again, because of #2356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment