Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributions readiness for KF 1.5 #2146

Closed
kimwnasptd opened this issue Feb 17, 2022 · 33 comments
Closed

Distributions readiness for KF 1.5 #2146

kimwnasptd opened this issue Feb 17, 2022 · 33 comments

Comments

@kimwnasptd
Copy link
Member

kimwnasptd commented Feb 17, 2022

Prev issue #2038

Distribution testing phase - handbook

The goal of this issue is to track the progress of distributions alongside the 1.5 release, and coordinate our communications. First goal is to expose any issues we will be bumping into here, so that all distros can keep an eye on issues that arise.

While we hope all distros would manage to be ready when the KF 1.5 release is out, this is sometimes impossible to achieve. In this issue we want to both keep track of the progress of distributions, towards the KF 1.5 release, but also which of the distros will be working on KF 1.5 even if they can't meet the KF 1.5 deadline.

Without further ado, here's the list of distros we have in mind:

Distribution Representatives State
Arrikto EKF @kimwnasptd
Arrikto MiniKF @kimwnasptd
Azure
AWS @surajkota
Charmed Kubeflow @DomFleischmann
Google Cloud @zijianjoy
IBM @yhwang
Nutanix @johnugeorge
Kubeflow with Argo CD @davidspek
Openshift @nakfour @LaVLaS

So lets use this issue to expose our state while testing the KF 1.5 release, and also give heads up to users about the progress of distros with the KF 1.5 release

@kimwnasptd
Copy link
Member Author

We urge everyone to start their testing from the latest v1.5.0-rc.1 manifests tag. If anyone bumps into a problem, please open an issue and add a comment here as well so that we can all by in sync.

Regarding Arrikto's plans for the KF 1.5 release, we are targeting to also have our products ready for the deadline. But even if we don't manage, we will still be testing these following weeks and reporting bugs.

@kimwnasptd
Copy link
Member Author

And also
cc @kubeflow/release-team

@zijianjoy
Copy link
Contributor

Hello @kimwnasptd , which model-web-app does central dashboard integrate with? There are KFserving and KServe. I am curious how to configure between these two web apps in the central dashboard.

@surajkota
Copy link
Contributor

surajkota commented Feb 18, 2022

Created a tracking issue for AWS distribution work - awslabs/kubeflow-manifests#91

We are targeting to complete Generic/Vanilla Kubeflow i.e. as-is from this repository working on EKS as part of distribution testing phase. Other features and release will follow

@zijianjoy
Copy link
Contributor

zijianjoy commented Feb 23, 2022

Hello Kimonas, I would like to provide an update which requires changes to manifests as we are validating the Google Cloud distribution.

  1. Update KFP to v1.8.1-rc.0: https://github.com/kubeflow/pipelines/releases/tag/1.8.1-rc.0. This includes only fixes and no feature.
  2. I encountered issues when deploying kfserving endpoint using mnist sample. The way I resolved this issue is by running the following command:
kubectl patch mutatingwebhookconfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.v1beta1.defaulter","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'
 
kubectl patch ValidatingWebhookConfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.v1beta1.validator","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'

I think it is related to kserve/kserve#568 (comment). My testing environment is GKE v1.20.12. The error message is:

Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': '48360d9d-9621-43e8-a580-f40d74568b19', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '3e136267-4e52-4e29-9aa1-764e7dadc339', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b7e6649d-ea7f-442f-b7b5-7ea82514ebd3', 'Date': 'Wed, 23 Feb 2022 22:54:10 GMT', 'Content-Length': '717'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"inferenceservice.kfserving-webhook-server.v1beta1.validator\": Post \"
[https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s](https://kfserving-webhook-server-service.kubeflow.svc/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s)
\": x509: certificate signed by unknown authority","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"inferenceservice.kfserving-webhook-server.v1beta1.validator\": Post \"[https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s](https://kfserving-webhook-server-service.kubeflow.svc/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s)\": x509: certificate signed by unknown authority"}]},"code":500}
https://kfserving-webhook-server-service.kubeflow.svc:443/validate-serving-kubeflow-org-v1beta1-inferenceservice?timeout=30s
\": x509: certificate signed by unknown authority"}]},"code":500}
  1. I encountered the following issue for saving data during the Kubeflow - Serve Model using KFServing step:
    https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb
                                                                                                                
Traceback (most recent call last):
  File "kfservingdeployer.py", line 437, in <module>
    main()
{'apiVersion': 'serving.kubeflow.org/v1beta1', 'kind': 'InferenceService', 'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'}, 'creationTimestamp': '2022-02-23T23:16:40Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 2, 'managedFields': [{'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:sidecar.istio.io/inject': {}}}, 'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:tensorflow': {'.': {}, 'f:storageUri': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2022-02-23T23:16:40Z'}, {'apiVersion': 'serving.kubeflow.org/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {}}, 'f:spec': {'f:predictor': {'f:tensorflow': {'f:name': {}, 'f:resources': {}}}}, 'f:status': {}}, 'manager': 'manager', 'operation': 'Update', 'time': '2022-02-23T23:16:40Z'}], 'name': 'mnist-e2e-v1beta1-validator', 'namespace': 'jamxl', 'resourceVersion': '110397', 'uid': '706c317d-dc10-4f6d-94c4-307e42a5d7be'}, 'spec': {'predictor': {'tensorflow': {'name': '', 'resources': {}, 'storageUri': 'pvc://end-to-end-pipeline-6wmv9-model-volume/'}}}}
  File "kfservingdeployer.py", line 394, in main
    for condition in model_status["status"]["conditions"]:
KeyError: 'status'
time="2022-02-23T23:21:43.263Z" level=error msg="cannot save artifact /tmp/outputs/InferenceService_Status/data" argo=true error="stat /tmp/outputs/InferenceService_Status/data: no such file or directory"
Error: exit status 1

Do you know how to resolve the last issue? @kimwnasptd @andreyvelich

@ryansteakley
Copy link

ryansteakley commented Feb 25, 2022

@kimwnasptd Checking in here from AWS, attempting to do a vanilla installation into a fresh EKS cluster on kubernetes version 1.19 installed the manifests using the single-line command. Cannot connect to port-fowarding and looks like the issue is down to the cache-deployer-deployment pod being stuck in an error state.

echo 'ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.'
ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.

When changing the image back to 1.5.0 from 1.8.0 in cache-deployer , it is working properly again. I saw you had run into the same issue do you know how to resolve it? kubeflow/pipelines#7093 (comment)

images:
name: gcr.io/ml-pipeline/cache-deployer
newTag: 1.8.0

@zijianjoy
Copy link
Contributor

After discussion and debugging, we found that the issues 2 and 3 in #2146 (comment) are because I deploy KFServing and KServe together. My current suggestion is to deploy only one of them (which is kfserving), until we figure out how to migrate to kserve successfully and validate it using an updated mnist E2E script.

@kimwnasptd
Copy link
Member Author

@zijianjoy @ryansteakley thank you very much for exposing your progress!

Hello @kimwnasptd , which model-web-app does central dashboard integrate with? There are KFserving and KServe. I am curious how to configure between these two web apps in the central dashboard.

I'll provide some instructions for this very soon, on how someone will be able to use the KServe app. I'll also make this the default app that will be used by the dashboard, but there are some rough edges right now. I'll create the issues accordingly and give a heads up here again.

  1. Update KFP to v1.8.1-rc.0: https://github.com/kubeflow/pipelines/releases/tag/1.8.1-rc.0. This includes only fixes and no feature.

I'll also make a PR to update our manifests with this latest RC.

  1. I encountered issues when deploying kfserving endpoint using mnist sample. The way I resolved this issue is by running the following command:

I haven't bumped into this while testing the manifests. It's also not clear to me yet why this error happened now, since we had the same KFServing 0.6.1 manifests from the KF 1.4 release. In any case, thank you James for providing instructions for handling it. I'll look more into it.

  1. I encountered the following issue for saving data during the Kubeflow - Serve Model using KFServing step:
    https://github.com/kubeflow/pipelines/blob/master/samples/contrib/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb

I hadn't bumped into this as well. It looks like the InferenceService fails to get a status field. I'll open a new issue for this to track it independently and ping you there as well to further debug

@kimwnasptd
Copy link
Member Author

@ryansteakley regarding #2146 (comment) can you double check you are using the v1.5.0-rc.1 of the manifests?

That RC includes KFP 1.8.0, which in turn includes the fix for the cache-deployer AFAIK kubeflow/pipelines#7273

@ryansteakley
Copy link

ryansteakley commented Feb 26, 2022

@kimwnasptd Yes, I'm checking out the v1.5.0-rc.1 tag of the manifests to test the vanilla kubeflow on EKS 1.19 using 1.20 kubectl locally.

@zijianjoy
Copy link
Contributor

@kimwnasptd Google Cloud distribution is ready. 🚀

@yhwang
Copy link
Member

yhwang commented Feb 28, 2022

@kimwnasptd IBM IKS is ready for k8s 1.21.
However, I am waiting for the knative 0.22.3 and going to try it out on k8s 1.22

@pwzhong
Copy link

pwzhong commented Mar 1, 2022

@kimwnasptd Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with a more recent version? Who is the point of contact/representative?

I believe this is blocking Azure users from using kubeflow. v1.2 is on old k8s and istio versions, and v1.4 has no clear documents for Azure, it is quite hard to make it run on k8s 1.20+, which AKS only supports.

@kimwnasptd
Copy link
Member Author

A heads up, we've cut the new RC of the manifests.

I've added a more detailed explanation in #2112 (comment)

@kimwnasptd
Copy link
Member Author

@kimwnasptd Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with a more recent version? Who is the point of contact/representative?

@pwzhong unfortunately I don't have any more insights on this. We've tried to reach out to the maintainers throughout the releases, but we didn't get any feedback.

@johnugeorge
Copy link
Member

@kimwnasptd Nutanix Karbon is ready for k8s 1.21. Tested with latest RC - v1.5.0-rc.2

@surajkota
Copy link
Contributor

surajkota commented Mar 8, 2022

update from AWS side, Status: GREEN

Given timeframe of testing, we tested have tested 1.5.0-rc2 and will continue testing

Manually tested that current kubeflow/manifests master works with EKS 1.20
also successfully ran the https://github.com/kubeflow/manifests/tree/master/tests/e2e

Originally posted by @akartsky in awslabs/kubeflow-manifests#91 (comment)

@yhwang
Copy link
Member

yhwang commented Mar 8, 2022

For IBM IKS, I re-ran all test cases using v1.5.0-rc.2. everything is good on k8s 1.21. we are using KServe. Kfserving is not verified.

@akartsky
Copy link

akartsky commented Mar 9, 2022

From AWS, Status: GREEN

Successfully tested EKS 1.19, 1.20 & 1.21 and ran mnist-e2e test from this PR: #2164

AWS Release Tracker : awslabs/kubeflow-manifests#91

@akartsky
Copy link

akartsky commented Mar 9, 2022

From AWS, Status: RED

I just noticed an issue with cache-deployer Pod even with the latest master and rc2

I did not notice it before because this pod stays in running state for a few seconds before going into the crash loop
and i was able to run sample pipelines/notebooks tests successfully

@surajkota
Copy link
Contributor

surajkota commented Mar 9, 2022

@jbottum, @kubeflow/release-team We would like to request an extension on the release so we can get help for resolving #2165 on EKS. Please let me know your thoughts

Apologies for last minute request. We were under impression the issues is no longer present in latest rc2

@surajkota
Copy link
Contributor

surajkota commented Mar 9, 2022

@yhwang @johnugeorge @zijianjoy had you checked the cache-deployer-deoloyment pod in your Kubeflow deployment while testing?

It stays in running state for few seconds while it retries and then restarts

@jbottum
Copy link

jbottum commented Mar 9, 2022

@surajkota thanks for this report, if this is a reproducible bug, then I would consider it a P1, which could block the release. I am trying to understand the context, i.e. is this caching for Pipelines, https://www.kubeflow.org/docs/components/pipelines/overview/caching/. @kimwnasptd I believe you were going to test today with RC2 + final fixes. Have you been able to reproduce the referenced issue?

@yhwang
Copy link
Member

yhwang commented Mar 9, 2022

@surajkota for IBM IKS, I don't see that issue and the caching function works properly. I do have a test case to verify caching and it works well.

@Tomcli
Copy link
Member

Tomcli commented Mar 9, 2022

@yhwang @johnugeorge @zijianjoy had you checked the cache-deployer-deoloyment pod in your Kubeflow deployment while testing?

It stays in running state for few seconds while it retries and then restarts

I think we fixed the cert issue for minikube and IBM Cloud with this PR
kubeflow/pipelines#7273

I'm not sure how EKS handles the v1 CertificateSigningRequest, maybe you can update the list of Permitted subjects?
https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers

@johnugeorge
Copy link
Member

@surajkota It works for Nutanix K8s 1.21 as well.

@jbottum
Copy link

jbottum commented Mar 9, 2022

@theadactyl per our discussion, here is the tracking issue for KF 1.5,

@kimwnasptd
Copy link
Member Author

A status update, we are trying to get to the bottom of this alongside @akartsky and @surajkota.

Currently our main culprit is the K8s API Server on EKS, that can't create the certificate in status.certificate in the CertificateSigningRequest object of the KFP cache-deployer script [even though it's approved]. #2165

We are looking into gathering more logs from the control plane to have a better overview. This seems to be specific to EKS. If we won't get to the bottom of it within the next 2 hours I'll cut the final release, and we'll be more than happy to include any fixes necessary in a KF 1.5.1 patch release

@kimwnasptd
Copy link
Member Author

We've gotten to the bottom of the issue. This is a problem with any K8s cluster that does not support using signerName: kubernetes.io/kubelet-serving in CertificateSigningRequests, and EKS is such a case.

I want to further understand the following first:

  1. What is the best practice around such certificates?
  2. Is it a problem to give a certificate, aimed to be used by kubelet, to the cache-deployer webhook?
  3. What is the long term solution and how quickly could it be implemented?

I'd like to first have an answer for the above, before pushing the release button. For this I'll be delaying the release just for one more day, to take a look with a more clear mind and have answers on the above and a solid plan going forward.

We'll also add more technical details into #2165, which we'll at some point bring back to the KFP repo to discuss next steps.

cc @kubeflow/release-team

@surajkota
Copy link
Contributor

surajkota commented Mar 10, 2022

Thanks @kimwnasptd for the summary. Adding more context:

Both the PRs(kubeflow/pipelines#6668, kubeflow/pipelines#7273) are related. cache-deployer-deployment pod is requesting a cert for CSR with signerName kubernetes.io/kubelet-serving.

EKS only issues certificates for CSRs with signerName kubernetes.io/kubelet-serving for actual kubelets based on the information in the official K8s documentation:

kubernetes.io/kubelet-serving`: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager.

It is not supported since it is not recommended in Kubernetes upstream and EKS believes allowing this is unsafe. Kubernetes is recommending to use cert manger controller instead which is already being discussed here: kubeflow/pipelines#4695. IMO this is the right long term fix.

But given the timeframe, I am not sure if it is feasible to complete this. Since this Kubeflow release does not aim to support 1.22, an alternative for this release is to revert both the PRs and use CSR v1beta1 API and signerName legacy-unknown. This would mean pipelines would only work in K8s 1.21 and below since kubernetes.io/legacy-unknown is not supported in stable v1 API of CSR and hence it will not work for K8s 1.22 and above. kubeflow/pipelines#4695 will need to be addressed for K8s 1.22 and above.

Another alternative is to release 1.5.1 with the right fix i.e. using cert manager if other distributions do not see this as an issue.

Please let us know your thoughts on this

Originally posted by @surajkota in #2165 (comment)

@kimwnasptd
Copy link
Member Author

As mentioned in #2165 (comment) I'll move on with the KF 1.5.0 release now.

We'll be targeting on a long term fix for 1.5.1

@juliusvonkohout
Copy link
Member

/close

There has been no activity for a long time. Please reopen if necessary.

@google-oss-prow
Copy link

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests