-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrap servcies to deploy Kubeflow with Cognito #20
Bootstrap servcies to deploy Kubeflow with Cognito #20
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM
would it make sense to include a script that reads the config and best effort deletes the resources? this way customers could retry delete failures or just clean up resources without incurring applicable billing costs
1bab715
to
3331c28
Compare
3331c28
to
f2aa79e
Compare
@rrrkharse yes, clean up can be a separate script. Let me change this in the next PR when I update the README, ok? |
@goswamig @rrrkharse ready to merge. fixed the bugs and separated the cleanup script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
### Description of your changes: - Tested and ported only relevant files from aws distro - Ported cognito related source files from v1.3-branch, no changes except clean up unused manifests Will port others changes in separate PRs but move all them to aws directory so its easy to maintain in future. This will be first step towards isolating all aws related changes. ### Testing Manual Followed Installation steps here - https://github.com/kubeflow/manifests/tree/v1.4-branch#kubeflow-components-versions and used scripts from #20 for the rest Created profile, logged in to verify user is detected and created a notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
route53:
rootDomain:
certARN: arn:aws:acm:us-east-1:123456789012:certificate/9d8c4bbc-3b02-4a48-8c7d-d91441c6e5af
hostedZoneId: Z05345723JQZG48XL32CW
name: xxxxx.xxx.xx.xxx
subDomain:
hostedZoneId: XXXXX
name: platform5.xxxxx.xxx.xx.xxx
us-east-1-certARN: arn:aws:acm:us-east-1:123456789012:certificate/373cc726-f525-4bc7-b7bf-d1d7b641c238
Are the value of all three subcomponents difference in root domain vs subdomain ?
@goswamig Yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward for README changes.
### Description of your changes: All AWS related changes are confined to `distributions/aws` directory with this PR - RDS and S3 integration for pipelines: - ported `aws` directory from v1.3-branch under `apps/pipeline/upstream/env` to `distributions/aws/apps/pipeline` - updated the base resources location in `distributions/aws/apps/pipeline/kustomization.yaml` - RDS integration for Katib: - ported `katib-external-db-with-kubeflow` directory from v1.3-branch directory under `apps/katib/upstream/installs` to `distributions/aws/apps` - updated base resources location in `distributions/aws/apps/katib-external-db-with-kubeflow/kustomization.yaml` and created a patches directory for db-manager patch. `db-manager.yaml` patch is a copy from `katib-external-db` because patches need to be under kustomization directory - DLC based jupyter notebooks - created `jupyter-web-app` directory under `distributions/aws/apps` and created an overlay with spawner_ui_config with AWS DLC based images for tensorflow and pytorch. spawner_ui_config is a copy of the file `apps/jupyter/jupyter-web-app/upstream/base/configs/spawner_ui_config.yaml` except for notebook server images updates - Changes from #20 to v1.4-branch. No changes here ### Testing Manual, WIP Deleted generic components installed in an existing 1.4 deployment and installed aws modified components - [x] Pipelines: Ran sample data passing pipeline, verified databases and run detail in RDS and artifacts pushed in S3 - [x] Notebook: created notebook server with tf-cpu image and ran some commands - [x] Katib: obseravtion_logs table was created by db-manager so assuming the connection is good. test run will be addressed as part of #30 #27
Description of your changes:
As described in #14, there are several steps to prep the resources before deploying kubeflow with Cognito. Using console to bootstrap these resources is time taking and error prone. This PR aims to enhance the UX of all things related to configuring hosted zone, creating Acm Certificates and Cognito userpool in section 1.0-3.0 and section 6.0 of the README.
With this package:
config.yaml
.cognito_pre_deployment.py
which will boootstrap all resources and update the resource names/ids/arn in the config.config.yaml
to setup Ingress, ALB controller and deploy kubeflowconfig.yaml
with ALB DNS name and run thecognito_post_deployment.py
to update the hosted zone with ALB addressI will update the readme once these changes are merged in a separate PR to avoid iterating on that.
Testing
Manual
Follow section 4 and 5 of readme to deploy Kubeflow
Final config looks like this:
Areas of improvement
Load the config values to a class instead of individual variables. This does not look possible out of the box just using pyyaml and dataclass because of nested fields (ideas welcome) and needs a more advanced implementation to serialize/deserialize the input. This needs further investigation
Notes to reviewer
Why is the clean up in case of error best effort?
During testing, I found that deleting the custom domain in Coginto removes the entry but sometimes there is delay in the cloudfront disassociating from the certificate. I have increased the timeout to wait for deleting the certificate but it may not be sufficient all the time. In this case, the logs will print the certificate arn and call out that deletion was skipped