-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributions readiness for KF 1.5 #2146
Comments
We urge everyone to start their testing from the latest Regarding Arrikto's plans for the KF 1.5 release, we are targeting to also have our products ready for the deadline. But even if we don't manage, we will still be testing these following weeks and reporting bugs. |
And also |
Hello @kimwnasptd , which |
Created a tracking issue for AWS distribution work - awslabs/kubeflow-manifests#91 We are targeting to complete Generic/Vanilla Kubeflow i.e. as-is from this repository working on EKS as part of distribution testing phase. Other features and release will follow |
Hello Kimonas, I would like to provide an update which requires changes to manifests as we are validating the Google Cloud distribution.
I think it is related to kserve/kserve#568 (comment). My testing environment is GKE v1.20.12. The error message is:
Do you know how to resolve the last issue? @kimwnasptd @andreyvelich |
@kimwnasptd Checking in here from AWS, attempting to do a vanilla installation into a fresh EKS cluster on kubernetes version 1.19 installed the manifests using the single-line command. Cannot connect to port-fowarding and looks like the issue is down to the
When changing the image back to 1.5.0 from 1.8.0 in cache-deployer , it is working properly again. I saw you had run into the same issue do you know how to resolve it? kubeflow/pipelines#7093 (comment)
|
After discussion and debugging, we found that the issues 2 and 3 in #2146 (comment) are because I deploy KFServing and KServe together. My current suggestion is to deploy only one of them (which is kfserving), until we figure out how to migrate to kserve successfully and validate it using an updated mnist E2E script. |
@zijianjoy @ryansteakley thank you very much for exposing your progress!
I'll provide some instructions for this very soon, on how someone will be able to use the KServe app. I'll also make this the default app that will be used by the dashboard, but there are some rough edges right now. I'll create the issues accordingly and give a heads up here again.
I'll also make a PR to update our manifests with this latest RC.
I haven't bumped into this while testing the manifests. It's also not clear to me yet why this error happened now, since we had the same KFServing 0.6.1 manifests from the KF 1.4 release. In any case, thank you James for providing instructions for handling it. I'll look more into it.
I hadn't bumped into this as well. It looks like the InferenceService fails to get a |
@ryansteakley regarding #2146 (comment) can you double check you are using the That RC includes KFP 1.8.0, which in turn includes the fix for the |
@kimwnasptd Yes, I'm checking out the v1.5.0-rc.1 tag of the manifests to test the vanilla kubeflow on EKS 1.19 using 1.20 kubectl locally. |
@kimwnasptd Google Cloud distribution is ready. 🚀 |
@kimwnasptd IBM IKS is ready for k8s 1.21. |
@kimwnasptd Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with a more recent version? Who is the point of contact/representative? I believe this is blocking Azure users from using kubeflow. v1.2 is on old k8s and istio versions, and v1.4 has no clear documents for Azure, it is quite hard to make it run on k8s 1.20+, which AKS only supports. |
A heads up, we've cut the new RC of the manifests. I've added a more detailed explanation in #2112 (comment) |
@pwzhong unfortunately I don't have any more insights on this. We've tried to reach out to the maintainers throughout the releases, but we didn't get any feedback. |
@kimwnasptd Nutanix Karbon is ready for k8s 1.21. Tested with latest RC - v1.5.0-rc.2 |
update from AWS side, Status: GREEN Given timeframe of testing, we tested have tested 1.5.0-rc2 and will continue testing Manually tested that current kubeflow/manifests master works with EKS 1.20 Originally posted by @akartsky in awslabs/kubeflow-manifests#91 (comment) |
For IBM IKS, I re-ran all test cases using v1.5.0-rc.2. everything is good on k8s 1.21. we are using KServe. Kfserving is not verified. |
From AWS, Status: GREEN Successfully tested EKS 1.19, 1.20 & 1.21 and ran mnist-e2e test from this PR: #2164 AWS Release Tracker : awslabs/kubeflow-manifests#91 |
From AWS, Status: RED I just noticed an issue with cache-deployer Pod even with the latest master and rc2 I did not notice it before because this pod stays in running state for a few seconds before going into the crash loop |
@yhwang @johnugeorge @zijianjoy had you checked the cache-deployer-deoloyment pod in your Kubeflow deployment while testing? It stays in running state for few seconds while it retries and then restarts |
@surajkota thanks for this report, if this is a reproducible bug, then I would consider it a P1, which could block the release. I am trying to understand the context, i.e. is this caching for Pipelines, https://www.kubeflow.org/docs/components/pipelines/overview/caching/. @kimwnasptd I believe you were going to test today with RC2 + final fixes. Have you been able to reproduce the referenced issue? |
@surajkota for IBM IKS, I don't see that issue and the caching function works properly. I do have a test case to verify caching and it works well. |
I think we fixed the cert issue for minikube and IBM Cloud with this PR I'm not sure how EKS handles the v1 CertificateSigningRequest, maybe you can update the list of |
@surajkota It works for Nutanix K8s 1.21 as well. |
@theadactyl per our discussion, here is the tracking issue for KF 1.5, |
A status update, we are trying to get to the bottom of this alongside @akartsky and @surajkota. Currently our main culprit is the K8s API Server on EKS, that can't create the certificate in We are looking into gathering more logs from the control plane to have a better overview. This seems to be specific to EKS. If we won't get to the bottom of it within the next 2 hours I'll cut the final release, and we'll be more than happy to include any fixes necessary in a KF 1.5.1 patch release |
We've gotten to the bottom of the issue. This is a problem with any K8s cluster that does not support using I want to further understand the following first:
I'd like to first have an answer for the above, before pushing the release button. For this I'll be delaying the release just for one more day, to take a look with a more clear mind and have answers on the above and a solid plan going forward. We'll also add more technical details into #2165, which we'll at some point bring back to the KFP repo to discuss next steps. cc @kubeflow/release-team |
Thanks @kimwnasptd for the summary. Adding more context: Both the PRs(kubeflow/pipelines#6668, kubeflow/pipelines#7273) are related. EKS only issues certificates for CSRs with signerName kubernetes.io/kubelet-serving`: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager. It is not supported since it is not recommended in Kubernetes upstream and EKS believes allowing this is unsafe. Kubernetes is recommending to use cert manger controller instead which is already being discussed here: kubeflow/pipelines#4695. IMO this is the right long term fix. But given the timeframe, I am not sure if it is feasible to complete this. Since this Kubeflow release does not aim to support 1.22, an alternative for this release is to revert both the PRs and use CSR Another alternative is to release 1.5.1 with the right fix i.e. using cert manager if other distributions do not see this as an issue. Please let us know your thoughts on this Originally posted by @surajkota in #2165 (comment) |
As mentioned in #2165 (comment) I'll move on with the KF 1.5.0 release now. We'll be targeting on a long term fix for 1.5.1 |
/close There has been no activity for a long time. Please reopen if necessary. |
@juliusvonkohout: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Prev issue #2038
Distribution testing phase - handbook
The goal of this issue is to track the progress of distributions alongside the 1.5 release, and coordinate our communications. First goal is to expose any issues we will be bumping into here, so that all distros can keep an eye on issues that arise.
While we hope all distros would manage to be ready when the KF 1.5 release is out, this is sometimes impossible to achieve. In this issue we want to both keep track of the progress of distributions, towards the KF 1.5 release, but also which of the distros will be working on KF 1.5 even if they can't meet the KF 1.5 deadline.
Without further ado, here's the list of distros we have in mind:
So lets use this issue to expose our state while testing the KF 1.5 release, and also give heads up to users about the progress of distros with the KF 1.5 release
The text was updated successfully, but these errors were encountered: