Setup Kubeflow

In this part, you will setup Kubeflow on Google Kubernetes Engine.

Requirements

A GCP project with sufficient GPU quota to train the model with the desired GPU, e.g., K80, P100, V100...
kubectl CLI (command line interface) pointing to the kubernetes cluster
- Make sure that you can run kubectl get nodes from your terminal successfully

The ksonnet CLI, v0.9.2 or higher: ks

In case you want to install a particular version of ksonnet, you can run

export KS_VER=ks_0.11.0_linux_amd64
wget -O /tmp/$KS_VER.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v0.11.0/$KS_VER.tar.gz
mkdir -p ${HOME}/bin
tar -xvf /tmp/$KS_VER.tar.gz -C ${HOME}/bin
export PATH=$PATH:${HOME}/bin/$KS_VER

Kubeflow setup

Refer to the user guide for detailed instructions on how to setup Kubeflow on your kubernetes cluster. Specifically, complete the following sections:

Deploy Kubeflow on Google Kubernetes Engine
- Zone(s) must have the GPU types you specify, e.g., us-central1-b for Nvidia V100
- If you run into API rate limiting errors, ensure you have a ${GITHUB_TOKEN} environment variable set.
- If you run into RBAC permissions issues running ks apply commands, be sure you have created a cluster-admin ClusterRoleBinding for your username.
Setup a persistent disk using Google Filestore
- We need a shared persistent disk to store our trained model as containers' filesystems are ephemeral.
- For this example, provision a 10GB cluster-wide shared NFS mount with the name kubeflow-gcfs.
- Enable the component in the Kubeflow cluster with
```
ks apply default -c google-cloud-filestore-pv
```
Enable Seldon component to serve our Pytorch models
- Install the seldon package, generate the core component as per the instructions
- Enable the component in the Kubeflow cluster with
```
ks apply default -c seldon
```

After completing that, you should have the following ready:

A ksonnet app in a directory named ks_app
An output similar to this for kubectl get pods command

ambassador-7fb86f6bc5-8xj5h                      3/3       Running     0          13m
ambassador-7fb86f6bc5-h2hmt                      3/3       Running     0          13m
ambassador-7fb86f6bc5-tz98s                      3/3       Running     0          13m
argo-ui-7b6585d85d-mmxbb                         1/1       Running     0          13m
centraldashboard-f8d7d97fb-5g76z                 1/1       Running     0          13m
cert-manager-798d77c76-wflcb                     1/1       Running     0          12m
cloud-endpoints-controller-b55c586b6-7pdnn       1/1       Running     0          12m
cm-acme-http-solver-8lcvn                        1/1       Running     0          11m
envoy-79ff8d86b-4dgg7                            2/2       Running     2          12m
envoy-79ff8d86b-t6mv8                            2/2       Running     2          12m
envoy-79ff8d86b-xbfdj                            2/2       Running     2          12m
iap-enabler-6565b855d6-j2rt6                     1/1       Running     0          12m
ingress-bootstrap-kfzmg                          1/1       Running     0          12m
kube-metacontroller-6d567f9fb4-6g5sr             1/1       Running     0          12m
modeldb-backend-69dfc464df-h9lt7                 1/1       Running     0          12m
modeldb-db-6cf5bb764-qcxc7                       1/1       Running     0          12m
modeldb-frontend-795bcf6df9-8smn8                1/1       Running     0          12m
pytorch-operator-76f89745c-2qtt9                 1/1       Running     0          12m
seldon-redis-6cfc779655-lvphr                    1/1       Running     0          6m
seldon-seldon-cluster-manager-54c7794dcb-jdq6p   1/1       Running     0          6m
set-gcfs-permissions-w65j9                       0/1       Completed   0          18s
spartakus-volunteer-77446c94b5-7f5vf             1/1       Running     0          12m
studyjob-controller-68f5948984-c66g5             1/1       Running     0          12m
tf-hub-0                                         1/1       Running     0          13m
tf-job-dashboard-7cddcdf9c4-bqwbg                1/1       Running     0          13m
tf-job-operator-v1alpha2-6566f45db-v78tp         1/1       Running     0          13m
vizier-core-6d8c9d7bf7-chkkc                     1/1       Running     1          12m
vizier-db-cc59bc8bd-bgvp6                        1/1       Running     0          12m
vizier-suggestion-grid-76fb6b49c7-c22ss          1/1       Running     0          12m
vizier-suggestion-random-c5c64dfc9-svqc8         1/1       Running     0          12m
whoami-app-b7fb9f875-hchzc                       1/1       Running     0          12m
workflow-controller-59c7967f59-df424             1/1       Running     0          13m

A seldon component to serve our models
A Pytorch operator to train our models
A 10GB Filestore instance volume "kubeflow-gcfs"

Summary

We deployed a Kubernetes cluster with Deployment Manager using kfctl.sh script
We created a ksonnet app for our Kubeflow deployment
We deployed the Kubeflow components to our kubernetes cluster
We created a persistent disk using Google Filestore for storing our trained model

Next: Training the model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_setup_a_kubeflow_cluster.md

01_setup_a_kubeflow_cluster.md

Setup Kubeflow

Requirements

Kubeflow setup

Summary

Files

01_setup_a_kubeflow_cluster.md

Latest commit

History

01_setup_a_kubeflow_cluster.md

File metadata and controls

Setup Kubeflow

Requirements

Kubeflow setup

Summary