Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap servcies to deploy Kubeflow with Cognito #20

Merged
merged 4 commits into from
Nov 23, 2021

Conversation

surajkota
Copy link
Contributor

@surajkota surajkota commented Nov 17, 2021

Description of your changes:

As described in #14, there are several steps to prep the resources before deploying kubeflow with Cognito. Using console to bootstrap these resources is time taking and error prone. This PR aims to enhance the UX of all things related to configuring hosted zone, creating Acm Certificates and Cognito userpool in section 1.0-3.0 and section 6.0 of the README.

With this package:

  1. User will input the entries such as desired subdomain name, deployment region, userpool name in config.yaml.
    1. (Optional) root domain name and hosted zone id to configure the subdomain NS record and create a certificate, if its hosted in route53
  2. (Section 1-3) Run the cognito_pre_deployment.py which will boootstrap all resources and update the resource names/ids/arn in the config.
    1. Script cleans up all the resources in case of any error in creation (best effort)
  3. (Section 4-5) Use the values in config.yaml to setup Ingress, ALB controller and deploy kubeflow
  4. (Section 6) Update the config.yaml with ALB DNS name and run the cognito_post_deployment.py to update the hosted zone with ALB address
  5. (Section 7) Create users in Cognito and corresponding profiles in Kubeflow
    1. Login to dashboard

I will update the readme once these changes are merged in a separate PR to avoid iterating on that.

Testing

Manual

pip install -r distributions/aws/infra_configs/scripts/requirements.txt
# update config.yaml
cd distributions/aws/infra_configs
PYTHONPATH=.. python scripts/cognito_pre_deployment.py

Follow section 4 and 5 of readme to deploy Kubeflow

PYTHONPATH=.. python scripts/cognito_post_deployment.py

Final config looks like this:

cognitoUserpool:
  ARN: arn:aws:cognito-idp:us-east-1:123456789012:userpool/us-east-1_yasI9dbxF
  appClientId: 5jmk7ljl2a74jk3n0a0fvj3l31
  domain: xxxxxxxxxx.cloudfront.net
  name: kf-userpool
kubeflow:
  ALBDNS: 72d70454-istiosystem-istio-2aa2-1234567890.us-east-1.elb.amazonaws.com
  region: us-east-1
route53:
  rootDomain:
    certARN: arn:aws:acm:us-east-1:123456789012:certificate/9d8c4bbc-3b02-4a48-8c7d-d91441c6e5af
    hostedZoneId: Z05345723JQZG48XL32CW
    name: xxxxx.xxx.xx.xxx
  subDomain:
    hostedZoneId: XXXXX
    name: platform5.xxxxx.xxx.xx.xxx
    us-east-1-certARN: arn:aws:acm:us-east-1:123456789012:certificate/373cc726-f525-4bc7-b7bf-d1d7b641c238

Areas of improvement

Load the config values to a class instead of individual variables. This does not look possible out of the box just using pyyaml and dataclass because of nested fields (ideas welcome) and needs a more advanced implementation to serialize/deserialize the input. This needs further investigation

Notes to reviewer

Why is the clean up in case of error best effort?
During testing, I found that deleting the custom domain in Coginto removes the entry but sometimes there is delay in the cloudfront disassociating from the certificate. I have increased the timeout to wait for deleting the certificate but it may not be sufficient all the time. In this case, the logs will print the certificate arn and call out that deletion was skipped

@surajkota surajkota self-assigned this Nov 17, 2021
Copy link
Contributor

@rrrkharse rrrkharse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM

would it make sense to include a script that reads the config and best effort deletes the resources? this way customers could retry delete failures or just clean up resources without incurring applicable billing costs

@surajkota surajkota force-pushed the v1.3-cognito-automation-v1 branch 2 times, most recently from 1bab715 to 3331c28 Compare November 18, 2021 20:58
@surajkota
Copy link
Contributor Author

@rrrkharse yes, clean up can be a separate script. Let me change this in the next PR when I update the README, ok?

@surajkota
Copy link
Contributor Author

@goswamig @rrrkharse ready to merge. fixed the bugs and separated the cleanup script

Copy link
Contributor

@rrrkharse rrrkharse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@surajkota
Copy link
Contributor Author

#26

surajkota added a commit that referenced this pull request Nov 23, 2021
### Description of your changes:
- Tested and ported only relevant files from aws distro
- Ported cognito related source files from v1.3-branch, no changes except clean up unused manifests

Will port others changes in separate PRs but move all them to aws directory so its easy to maintain in future. This will be first step towards isolating all aws related changes.

### Testing
Manual
Followed Installation steps here - https://github.com/kubeflow/manifests/tree/v1.4-branch#kubeflow-components-versions
and used scripts from #20 for the rest
Created profile, logged in to verify user is detected and created a notebook
Copy link
Member

@goswamig goswamig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

route53:
  rootDomain:
    certARN: arn:aws:acm:us-east-1:123456789012:certificate/9d8c4bbc-3b02-4a48-8c7d-d91441c6e5af
    hostedZoneId: Z05345723JQZG48XL32CW
    name: xxxxx.xxx.xx.xxx
  subDomain:
    hostedZoneId: XXXXX
    name: platform5.xxxxx.xxx.xx.xxx
    us-east-1-certARN: arn:aws:acm:us-east-1:123456789012:certificate/373cc726-f525-4bc7-b7bf-d1d7b641c238

Are the value of all three subcomponents difference in root domain vs subdomain ?

@surajkota
Copy link
Contributor Author

route53:
  rootDomain:
    certARN: arn:aws:acm:us-east-1:123456789012:certificate/9d8c4bbc-3b02-4a48-8c7d-d91441c6e5af
    hostedZoneId: Z05345723JQZG48XL32CW
    name: xxxxx.xxx.xx.xxx
  subDomain:
    hostedZoneId: XXXXX
    name: platform5.xxxxx.xxx.xx.xxx
    us-east-1-certARN: arn:aws:acm:us-east-1:123456789012:certificate/373cc726-f525-4bc7-b7bf-d1d7b641c238

Are the value of all three subcomponents difference in root domain vs subdomain ?

@goswamig Yes

Copy link
Member

@goswamig goswamig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward for README changes.

@surajkota surajkota merged commit 554cced into awslabs:v1.3-branch Nov 23, 2021
@surajkota surajkota mentioned this pull request Nov 30, 2021
3 tasks
surajkota added a commit that referenced this pull request Dec 2, 2021
### Description of your changes:

All AWS related changes are confined to `distributions/aws` directory with this PR

- RDS and S3 integration for pipelines: 
  - ported `aws` directory from v1.3-branch under `apps/pipeline/upstream/env` to `distributions/aws/apps/pipeline`
  - updated the base resources location in `distributions/aws/apps/pipeline/kustomization.yaml`
- RDS integration for Katib: 
   - ported `katib-external-db-with-kubeflow` directory from v1.3-branch directory under `apps/katib/upstream/installs` to `distributions/aws/apps`
   - updated base resources location in `distributions/aws/apps/katib-external-db-with-kubeflow/kustomization.yaml` and created a patches directory for db-manager patch. `db-manager.yaml` patch is a copy from `katib-external-db` because patches need to be under kustomization directory
- DLC based jupyter notebooks
  - created `jupyter-web-app` directory under `distributions/aws/apps` and created an overlay with spawner_ui_config with AWS DLC based images for tensorflow and pytorch. spawner_ui_config is a copy of the file `apps/jupyter/jupyter-web-app/upstream/base/configs/spawner_ui_config.yaml` except for notebook server images updates
- Changes from #20 to v1.4-branch. No changes here
 
### Testing
Manual, WIP
Deleted generic components installed in an existing 1.4 deployment and installed aws modified components
- [x] Pipelines: Ran sample data passing pipeline, verified databases and run detail in RDS and artifacts pushed in S3
- [x] Notebook: created notebook server with tf-cpu image and ran some commands
- [x] Katib: obseravtion_logs table was created by db-manager so assuming the connection is good. test run will be addressed as part of #30 

#27
surajkota added a commit that referenced this pull request Dec 8, 2021
### Description of your changes
- Add readme for scripts created in #20 
- Made some minor changes to the README and scripts based on user experience while testing

### Testing
Followed the readme to deploy Kubeflow with Cognito

#26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants