Skip to content

Latest commit

 

History

History
104 lines (92 loc) · 6.17 KB

01_setup_a_kubeflow_cluster.md

File metadata and controls

104 lines (92 loc) · 6.17 KB

Setup Kubeflow

In this part, you will setup Kubeflow on Google Kubernetes Engine.

Requirements

  • A GCP project with sufficient GPU quota to train the model with the desired GPU, e.g., K80, P100, V100...
  • kubectl CLI (command line interface) pointing to the kubernetes cluster
    • Make sure that you can run kubectl get nodes from your terminal successfully
  • The ksonnet CLI, v0.9.2 or higher: ks
    • In case you want to install a particular version of ksonnet, you can run

      export KS_VER=ks_0.11.0_linux_amd64
      wget -O /tmp/$KS_VER.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v0.11.0/$KS_VER.tar.gz
      mkdir -p ${HOME}/bin
      tar -xvf /tmp/$KS_VER.tar.gz -C ${HOME}/bin
      export PATH=$PATH:${HOME}/bin/$KS_VER
      

Kubeflow setup

Refer to the user guide for detailed instructions on how to setup Kubeflow on your kubernetes cluster. Specifically, complete the following sections:

  • Deploy Kubeflow on Google Kubernetes Engine
    • Zone(s) must have the GPU types you specify, e.g., us-central1-b for Nvidia V100
    • If you run into API rate limiting errors, ensure you have a ${GITHUB_TOKEN} environment variable set.
    • If you run into RBAC permissions issues running ks apply commands, be sure you have created a cluster-admin ClusterRoleBinding for your username.
  • Setup a persistent disk using Google Filestore
    • We need a shared persistent disk to store our trained model as containers' filesystems are ephemeral.
    • For this example, provision a 10GB cluster-wide shared NFS mount with the name kubeflow-gcfs.
    • Enable the component in the Kubeflow cluster with
    ks apply default -c google-cloud-filestore-pv
    
  • Enable Seldon component to serve our Pytorch models
    • Install the seldon package, generate the core component as per the instructions
    • Enable the component in the Kubeflow cluster with
    ks apply default -c seldon
    

After completing that, you should have the following ready:

  • A ksonnet app in a directory named ks_app
  • An output similar to this for kubectl get pods command
ambassador-7fb86f6bc5-8xj5h                      3/3       Running     0          13m
ambassador-7fb86f6bc5-h2hmt                      3/3       Running     0          13m
ambassador-7fb86f6bc5-tz98s                      3/3       Running     0          13m
argo-ui-7b6585d85d-mmxbb                         1/1       Running     0          13m
centraldashboard-f8d7d97fb-5g76z                 1/1       Running     0          13m
cert-manager-798d77c76-wflcb                     1/1       Running     0          12m
cloud-endpoints-controller-b55c586b6-7pdnn       1/1       Running     0          12m
cm-acme-http-solver-8lcvn                        1/1       Running     0          11m
envoy-79ff8d86b-4dgg7                            2/2       Running     2          12m
envoy-79ff8d86b-t6mv8                            2/2       Running     2          12m
envoy-79ff8d86b-xbfdj                            2/2       Running     2          12m
iap-enabler-6565b855d6-j2rt6                     1/1       Running     0          12m
ingress-bootstrap-kfzmg                          1/1       Running     0          12m
kube-metacontroller-6d567f9fb4-6g5sr             1/1       Running     0          12m
modeldb-backend-69dfc464df-h9lt7                 1/1       Running     0          12m
modeldb-db-6cf5bb764-qcxc7                       1/1       Running     0          12m
modeldb-frontend-795bcf6df9-8smn8                1/1       Running     0          12m
pytorch-operator-76f89745c-2qtt9                 1/1       Running     0          12m
seldon-redis-6cfc779655-lvphr                    1/1       Running     0          6m
seldon-seldon-cluster-manager-54c7794dcb-jdq6p   1/1       Running     0          6m
set-gcfs-permissions-w65j9                       0/1       Completed   0          18s
spartakus-volunteer-77446c94b5-7f5vf             1/1       Running     0          12m
studyjob-controller-68f5948984-c66g5             1/1       Running     0          12m
tf-hub-0                                         1/1       Running     0          13m
tf-job-dashboard-7cddcdf9c4-bqwbg                1/1       Running     0          13m
tf-job-operator-v1alpha2-6566f45db-v78tp         1/1       Running     0          13m
vizier-core-6d8c9d7bf7-chkkc                     1/1       Running     1          12m
vizier-db-cc59bc8bd-bgvp6                        1/1       Running     0          12m
vizier-suggestion-grid-76fb6b49c7-c22ss          1/1       Running     0          12m
vizier-suggestion-random-c5c64dfc9-svqc8         1/1       Running     0          12m
whoami-app-b7fb9f875-hchzc                       1/1       Running     0          12m
workflow-controller-59c7967f59-df424             1/1       Running     0          13m
  • A seldon component to serve our models
  • A Pytorch operator to train our models
  • A 10GB Filestore instance volume "kubeflow-gcfs"

Summary

  • We deployed a Kubernetes cluster with Deployment Manager using kfctl.sh script
  • We created a ksonnet app for our Kubeflow deployment
  • We deployed the Kubeflow components to our kubernetes cluster
  • We created a persistent disk using Google Filestore for storing our trained model

Next: Training the model