Create a basic example for using Katib #116

jlewi · 2018-06-14T05:23:56Z

Do we have a simple example that walks through using Katib with a simple model (e.g. mnist)?

I'd like a simple example that we can ask folks to follow to provide a friction log.

lluunn · 2018-06-14T06:03:57Z

I was using https://github.com/kubeflow/katib/blob/master/examples/client-example.go

YujiOshima · 2018-06-14T06:21:08Z

I think @jlewi said about training script.
In the example @lluunn referred, we use mxnet python mnist example.
But it is inconvenient for users to find input and log format.
We may be good to store the training script example in this repo.
@jlewi Is my understanding correct?

jlewi · 2018-06-14T16:41:55Z

I think I'm looking for something higher level. I'd like instructions that walk people through using Katib. So the goal would be to show how customers can use Katib to do hyperparameter tuning for a simple problem (e.g. mnist). The algorithm could also be very simple (e.g. grid search). This example might refer to the pieces mentioned above. So my expectation is that the instructions would be something like the following

Deploy Kubeflow
Enable/Deploy Kubeflow
Run grid search for sample problem (e.g. mnist)
Open the UI to browse the results.

jlewi · 2018-06-22T21:32:35Z

@ankushagarwal Will investigate to see what we can do in the next sprint to integrate with TFJob and create an example.

jlewi · 2018-06-29T12:06:35Z

https://github.com/kubeflow/kubeflow/pull/989/files

Describes running client-example.go.

A simple thing to do would be to turn that into a K8s job and include that as a prototype within the Katib package.

An advantage of running it on cluster should be that we don't have to port-forward any ports to make the client work.

jlewi · 2018-06-29T22:55:53Z

The GKE demo is actually using our GitHub issue summarization example
https://github.com/kubeflow/katib/blob/master/examples/GKEDemo/GKE-Demo.md

Looks like the code for the model is actually being duplicated to that location.

* Add a parameter for the docker image containing the training code * Rename the parameter to point to the vizier-core service. Related to: kubeflow#116

…ue example. * Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo * I think it makes sense to centralize all the code in a single place. * Update the controller program (git-issue-summarize-demo.go) so that can specify the Docker image containing the training code. * Create a ksonnet deployment for running the controller on the cluster. * The HP tuning job isn't functional here's an incomplete list of issues * The training jobs launched fail because they don't have GCP credentials so they can't download the data. * We don't actually extract and report metrics back to Katib. Related to: kubeflow/katib#116

* The GKEDemo is using the GitHub summarization example; I think we should put all of the code for that demo kubeflow/examples (see kubeflow/examples#161) * The main code is the Katib HP controller git-issue-summarize-demo.go * We don't need the manifests for deploying katib because we can deploy Katib using the Kubeflow ksonnet package. * The code in docker-image duplicates the code in kubeflow/examples so we shouldn't need it. Related to: kubeflow#116

* The GKEDemo is using the GitHub summarization example; I think we should put all of the code for that demo kubeflow/examples (see kubeflow/examples#161) * The main code is the Katib HP controller git-issue-summarize-demo.go * We don't need the manifests for deploying katib because we can deploy Katib using the Kubeflow ksonnet package. * The code in docker-image duplicates the code in kubeflow/examples so we shouldn't need it. Related to: #116

…ue example. (#161) * Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo * I think it makes sense to centralize all the code in a single place. * Update the controller program (git-issue-summarize-demo.go) so that can specify the Docker image containing the training code. * Create a ksonnet deployment for running the controller on the cluster. * The HP tuning job isn't functional here's an incomplete list of issues * The training jobs launched fail because they don't have GCP credentials so they can't download the data. * We don't actually extract and report metrics back to Katib. Related to: kubeflow/katib#116

lluunn · 2018-07-11T22:42:58Z

Now that we have https://github.com/kubeflow/examples/tree/master/github_issue_summarization/hp-tune, what's the TODO here?

jlewi · 2018-07-18T04:48:23Z

@lluunn it doesn't actually work.

See for example kubeflow/examples#163 and kubeflow/examples#162

The go program you indicated doesn't successfully create the jobs for training (I think the issue was not supplying GCP credentials as a volume) and not extracting the metrics.

wukong1992 · 2018-08-02T11:58:27Z

@jlewi
Now I want to test other examples. When I m going to write an example of a training model, I don't know how to define its log format, and katib is based on log to get accuracy.

YujiOshima · 2018-08-02T17:58:15Z

@wukong1992 Currently Yes Katib will collect metrics from stdout.
You should print logs {metrics name}={value} style.
For example when your objective value name is loss and the metrics are recall and precision, your training code should print like this.

epoch 1:
loss=0.3
recall=0.5
precision=0.4

epoch 2:
loss=0.2
recall=0.55
precision=0.5

Katib will collect all log of metrics.

YujiOshima · 2018-09-17T13:47:36Z

Since we have a simple example of random, grid and hyperband, we can close this.
/close

k8s-ci-robot · 2018-09-17T13:47:37Z

@YujiOshima: Closing this issue.

In response to this:

Since we have a simple example of random, grid and hyperband, we can close this.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi added priority/p1 area/0.3.0 labels Jun 14, 2018

jlewi added the sprint/2018-06-25-to-07-06 label Jun 22, 2018

jlewi removed the sprint/2018-06-25-to-07-06 label Jun 22, 2018

jlewi added the sprint/2018-07-09-to-07-20 label Jun 29, 2018

jlewi added a commit to jlewi/katib that referenced this issue Jun 30, 2018

Create a Docker container to run the GH example.

55dc3a6

* Add a parameter for the docker image containing the training code * Rename the parameter to point to the vizier-core service. Related to: kubeflow#116

jlewi mentioned this issue Jun 30, 2018

Create a Docker container to run the GH example. #130

Closed

jlewi mentioned this issue Jun 30, 2018

Create a deployment to run the HP/Katib controller for the GitHub issue example kubeflow/examples#161

Merged

jlewi mentioned this issue Jun 30, 2018

Move the GKEDemo into kubeflow/examples #135

Merged

jlewi mentioned this issue Aug 6, 2018

Katib docs need to explain how to do a hyperparameter search kubeflow/website#128

Closed

k8s-ci-robot closed this as completed Sep 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a basic example for using Katib #116

Create a basic example for using Katib #116

jlewi commented Jun 14, 2018

lluunn commented Jun 14, 2018

YujiOshima commented Jun 14, 2018

jlewi commented Jun 14, 2018

jlewi commented Jun 22, 2018

jlewi commented Jun 29, 2018

jlewi commented Jun 29, 2018

lluunn commented Jul 11, 2018

jlewi commented Jul 18, 2018

wukong1992 commented Aug 2, 2018

YujiOshima commented Aug 2, 2018

YujiOshima commented Sep 17, 2018

k8s-ci-robot commented Sep 17, 2018

Create a basic example for using Katib #116

Create a basic example for using Katib #116

Comments

jlewi commented Jun 14, 2018

lluunn commented Jun 14, 2018

YujiOshima commented Jun 14, 2018

jlewi commented Jun 14, 2018

jlewi commented Jun 22, 2018

jlewi commented Jun 29, 2018

jlewi commented Jun 29, 2018

lluunn commented Jul 11, 2018

jlewi commented Jul 18, 2018

wukong1992 commented Aug 2, 2018

YujiOshima commented Aug 2, 2018

YujiOshima commented Sep 17, 2018

k8s-ci-robot commented Sep 17, 2018