Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the mnist notebook tests to pass #65

Closed
jlewi opened this issue Jun 26, 2020 · 21 comments
Closed

Get the mnist notebook tests to pass #65

jlewi opened this issue Jun 26, 2020 · 21 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jun 26, 2020

Split off from #42

#63 added the notebook tests. The Tekton workflows are being fired off

https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-gcp-blueprints-master-periodic
https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/kf-ci/pipelineruns/mnist-22vjq

The copy buckets step is failing though and that is causing the task to abort before running the step to copy the test artifacts

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.98
area/engprod 0.89

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 30, 2020

@jlewi
Copy link
Contributor Author

jlewi commented Jun 30, 2020

Error is:

Error while finding module specification for 'kubeflow.testing.tekton_client' (ModuleNotFoundError: No module named 'kubeflow')

@jlewi
Copy link
Contributor Author

jlewi commented Jun 30, 2020

jlewi pushed a commit to jlewi/testing that referenced this issue Jun 30, 2020
jlewi pushed a commit to jlewi/testing that referenced this issue Jun 30, 2020
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Jun 30, 2020
@jlewi jlewi changed the title The mnist notebook tests are not reporting results to test grid Get the mnist notebook tests to pass Jul 1, 2020
@jlewi
Copy link
Contributor Author

jlewi commented Jul 1, 2020

The latest run
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278139102348185600
https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/kf-ci/pipelineruns/mnist-x7jtk

The copy buckets command ends up being

gsutil cp -r / gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278139102348185600
  • So there's a bug there.

Copy artifacts step is also failing

INFO|2020-07-01T01:37:51|/srcCache/kubeflow/testing/py/kubeflow/testing/tekton_client.py|391| Walking through directory: /workspace/artifacts
INFO|2020-07-01T01:37:51|/srcCache/kubeflow/testing/py/kubeflow/testing/tekton_client.py|400| Parsing JUNIT: junit_notebook.xml
ERROR|2020-07-01T01:37:51|/srcCache/kubeflow/testing/py/kubeflow/testing/tekton_client.py|411| pytest has failure: message not found

@jlewi
Copy link
Contributor Author

jlewi commented Jul 1, 2020

Logs for the papermill job
https://console.cloud.google.com/logs/viewer?project=kubeflow-ci-deployment&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22%0Alabels.%22k8s-pod%2Fjob-name%22+%3D+%22mnist-gcp-013506-929%22%0A

Running the notebook fails with an import error

ImportError: cannot import name 'V1alpha2TensorRTSpec'

@jlewi
Copy link
Contributor Author

jlewi commented Jul 1, 2020

The path to upload the notebook to GCS doesn't look wrong and there is an exception trying to upload it

INFO|2020-07-01T01:36:08|Uploading notebook to gs://kubeflow-ci-deployment_ci-temp/mnist_test|/src/kubeflow/testing/py/kubeflow/testing/notebook_tests/execute_notebook.py|65|
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/credentials.py", line 96, in refresh self._retrieve_info(request) File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info request, service_account=self._service_account_email File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/_metadata.py", line 219, in get_service_account_info recursive=True, File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/_metadata.py", line 172, in get response, google.auth.exceptions.TransportError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\nb'Could not recursively fetch uri\\n'", <google.auth.transport.requests._Response object at 0x7f834cbd10f0>)

jlewi pushed a commit to jlewi/testing that referenced this issue Jul 1, 2020
* Related GoogleCloudPlatform/kubeflow-distribution#51 get-credentials isn't finding any clusters
  because when using Fire the parameter should be --pattern not --location

* Related GoogleCloudPlatform/kubeflow-distribution#65 When copying the bucket output
  in the notebook tests the parameter should be params.notebook-output
  not params.output
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Jul 1, 2020
* Related GoogleCloudPlatform/kubeflow-distribution#51 get-credentials isn't finding any clusters
  because when using Fire the parameter should be --pattern not --location

* Related GoogleCloudPlatform/kubeflow-distribution#65 When copying the bucket output
  in the notebook tests the parameter should be params.notebook-output
  not params.output
@jlewi
Copy link
Contributor Author

jlewi commented Jul 2, 2020

Latest run
https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-gcp-blueprints-master-periodic&group-by-hierarchy-pattern=%5B%5Cw-%5D%2B
https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/kf-ci/pipelineruns/mnist-f878t

Latest failure

ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
INFO|2020-07-02T01:07:44|/usr/local/lib/python3.8/dist-packages/oauth2client/transport.py|157| Attempting refresh to obtain initial access_token
WARNING|2020-07-02T01:07:44|/usr/local/lib/python3.8/dist-packages/googleapiclient/http.py|123| Invalid JSON content from response: b'{\n  "error": {\n    "code": 403,\n    "message": "Required \\"container.clusters.list\\" permission(s) for \\"projects/get-credentials\\".",\n    "status": "PERMISSION_DENIED"\n  }\n}\n'
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/srcCache/kubeflow/testing/py/kubeflow/testing/get_kf_testing_cluster.py", line 429, in <module>
    fire.Fire(CredentialHelper)
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/srcCache/kubeflow/testing/py/kubeflow/testing/get_kf_testing_cluster.py", line 379, in get_credentials
    c = _get_latest_cluster(project, location, pattern)
  File "/srcCache/kubeflow/testing/py/kubeflow/testing/get_kf_testing_cluster.py", line 235, in _get_latest_cluster
    for c in _iter_cluster(project, location):
  File "/srcCache/kubeflow/testing/py/kubeflow/testing/get_kf_testing_cluster.py", line 153, in _iter_cluster
    clusters = clusters_client.list(parent=parent).execute()
  File "/usr/local/lib/python3.8/dist-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/googleapiclient/http.py", line 907, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://container.googleapis.com/v1/projects/get-credentials/locations/us-central1-c/clusters?alt=json returned "Required "container.clusters.list" permission(s) for "projects/get-credentials".">

Step failed

@jlewi
Copy link
Contributor Author

jlewi commented Jul 2, 2020

Here was the command executed

name: get-credential
args:
  - '-m'
  - kubeflow.testing.get_kf_testing_cluster
  - get-credentials
  - '--pattern=$(inputs.params.testing-cluster-pattern)'
  - '--location=$(inputs.params.testing-cluster-location)'
  - get-credentials
command:
  - python

So looks like a bug in the command.

jlewi pushed a commit to jlewi/testing that referenced this issue Jul 2, 2020
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Jul 2, 2020
…ploy a stateful set that is in sync with the worker image (#713)

* To support debugging of the test worker image deploy a stateful set that
is in sync with the worker image

* Fix get-credentials in notebook tasks

* Related to GoogleCloudPlatform/kubeflow-distribution#65
@jlewi
Copy link
Contributor Author

jlewi commented Jul 2, 2020

Latest run
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793
https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/kf-ci/pipelineruns/mnist-np5kn

https://console.cloud.google.com/logs/viewer?project=kubeflow-ci-deployment&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22%0Alabels.%22k8s-pod%2Fjob-name%22+%3D+%22mnist-gcp-135002-5cc%22%0A

Copying papermill output failed

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/credentials.py", line 96, in refresh self._retrieve_info(request) File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info request, service_account=self._service_account_email File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/_metadata.py", line 219, in get_service_account_info recursive=True, File "/usr/local/lib/python3.6/dist-packages/google/auth/compute_engine/_metadata.py", line 172, in get response, google.auth.exceptions.TransportError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\nb'Could not recursively fetch uri\\n'", <google.auth.transport.requests._Response object at 0x7f834cbd10f0>)

Are we not setting the workload identity correctly?

No results are reported in test grid

gsutil ls -la gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793
     47306  2020-07-02T13:51:40Z  gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/build-log.txt#1593697900767497  metageneration=1
       555  2020-07-02T13:51:41Z  gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/finished.json#1593697901423058  metageneration=1
       120  2020-07-02T13:49:46Z  gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/image.yaml#1593697786672809  metageneration=1
      6712  2020-07-02T13:51:30Z  gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/junit_notebook.xml#1593697890349663  metageneration=1
     10571  2020-07-02T13:52:08Z  gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/podinfo.json#1593697928473542  metageneration=1
      2694  2020-07-02T13:52:04Z  gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/prowjob.json#1593697924436418  metageneration=1
       724  2020-07-02T14:37:50Z  gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/started.json#1593700670784310  metageneration=1
                                 gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1278686481346465793/artifacts/

Looks like the directory for junit_notebook.xml is wrong.

Should be in the artifacts subdirectory.

jlewi pushed a commit to jlewi/testing that referenced this issue Jul 2, 2020
* We need to upload the junits to the artifacts/junit_* directory

Related to GoogleCloudPlatform/kubeflow-distribution#65
jlewi pushed a commit to jlewi/testing that referenced this issue Jul 2, 2020
* We need to upload the junits to the artifacts/junit_* directory

Related to GoogleCloudPlatform/kubeflow-distribution#65
@jlewi
Copy link
Contributor Author

jlewi commented Jul 2, 2020

Filed kubeflow/examples#806 about the actual error in the notebook

jlewi pushed a commit to jlewi/testing that referenced this issue Jul 2, 2020
* We need to upload the junits to the artifacts/junit_* directory

Related to GoogleCloudPlatform/kubeflow-distribution#65
jlewi pushed a commit to jlewi/testing that referenced this issue Jul 2, 2020
* We need to upload the junits to the artifacts/junit_* directory

Related to GoogleCloudPlatform/kubeflow-distribution#65
@NikeNano
Copy link

NikeNano commented Jul 2, 2020

I did some small attempts to also test this on the mpi-operator repo, just triggering the minist notebook but it seems to fail as well. I don't fully understand if it is due to the same issue. Keeping track on this to fix it.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 2, 2020

kubeflow/testing#716 fixed the issue with the junit file not being copied to GCS.
https://testgrid.k8s.io/sig-big-data#kubeflow-gcp-blueprints-master-periodic

k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue Jul 3, 2020
* We need to upload the junits to the artifacts/junit_* directory

Related to GoogleCloudPlatform/kubeflow-distribution#65
@NikeNano
Copy link

NikeNano commented Jul 4, 2020

I have tried it out and get some issues with Junit as I understand the logs: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_mpi-operator/244/kubeflow-mpi-operator-presubmit/1279419183024574464/ . Currently test with a dummy notebook just which only prints "Done".

@jlewi
Copy link
Contributor Author

jlewi commented Jul 6, 2020

kubeflow/examples#807 should fix the KFServing error.
The notebook still doesn't run. IT fails trying to launch Kaniko pods with

RefreshError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\nb'Could not recursively fetch uri\\n'", <google.auth.transport.requests._Response object at 0x7f840290de10>)

This is most likely due to problems with workload identity. Which is being tracked in #61

@jlewi
Copy link
Contributor Author

jlewi commented Jul 6, 2020

@NikeNano Your logs indicate that your test harness was unable to find the HTML file on GCS containing the rendered notebook. I suspect its because the job running on the KF cluster didn't have permission to write the rendered notebook to GCS. As of right now the test infra for the notebooks appears to be working. If you are having additional problems with your notebook test please file a separate issue specific to your test and lets use that to track.

@NikeNano
Copy link

NikeNano commented Jul 6, 2020

Thanks for the help @jlewi.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 8, 2020

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1280696285153726467
https://kf-ci-v1.endpoints.kubeflow-ci.cloud.goog/tekton/#/namespaces/kf-ci/pipelineruns/mnist-r88kk

Looks like a 404 reading image file

      logging.info(f"Reading file {image_file}")
      contents = util.read_file(image_file)

The image file doesn't exist

gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1280696285153726467/artifacts/junit_mnist-notebook/image.yaml

@jlewi
Copy link
Contributor Author

jlewi commented Jul 8, 2020

The image step writes the image file to:
gs://kubernetes-jenkins/logs/kubeflow-gcp-blueprints-master-periodic/1280696285153726467/image.yaml

@jlewi
Copy link
Contributor Author

jlewi commented Jul 8, 2020

So artifacts-gcs isn't set consistently across the steps.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 8, 2020

@jlewi jlewi closed this as completed Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants