Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a basic example for using Katib #116

Closed
jlewi opened this issue Jun 14, 2018 · 12 comments
Closed

Create a basic example for using Katib #116

jlewi opened this issue Jun 14, 2018 · 12 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jun 14, 2018

Do we have a simple example that walks through using Katib with a simple model (e.g. mnist)?

I'd like a simple example that we can ask folks to follow to provide a friction log.

/cc @YujiOshima

@lluunn
Copy link
Contributor

lluunn commented Jun 14, 2018

@YujiOshima
Copy link
Contributor

I think @jlewi said about training script.
In the example @lluunn referred, we use mxnet python mnist example.
But it is inconvenient for users to find input and log format.
We may be good to store the training script example in this repo.
@jlewi Is my understanding correct?

@jlewi
Copy link
Contributor Author

jlewi commented Jun 14, 2018

I think I'm looking for something higher level. I'd like instructions that walk people through using Katib. So the goal would be to show how customers can use Katib to do hyperparameter tuning for a simple problem (e.g. mnist). The algorithm could also be very simple (e.g. grid search). This example might refer to the pieces mentioned above. So my expectation is that the instructions would be something like the following

  1. Deploy Kubeflow
  2. Enable/Deploy Kubeflow
  3. Run grid search for sample problem (e.g. mnist)
  4. Open the UI to browse the results.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 22, 2018

@ankushagarwal Will investigate to see what we can do in the next sprint to integrate with TFJob and create an example.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 29, 2018

https://github.com/kubeflow/kubeflow/pull/989/files

Describes running client-example.go.

A simple thing to do would be to turn that into a K8s job and include that as a prototype within the Katib package.

An advantage of running it on cluster should be that we don't have to port-forward any ports to make the client work.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 29, 2018

The GKE demo is actually using our GitHub issue summarization example
https://github.com/kubeflow/katib/blob/master/examples/GKEDemo/GKE-Demo.md

Looks like the code for the model is actually being duplicated to that location.

jlewi added a commit to jlewi/katib that referenced this issue Jun 30, 2018
* Add a parameter for the docker image containing the training code
* Rename the parameter to point to the vizier-core service.

Related to: kubeflow#116
jlewi added a commit to jlewi/examples that referenced this issue Jun 30, 2018
…ue example.

* Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo

  * I think it makes sense to centralize all the code in a single place.

* Update the controller program (git-issue-summarize-demo.go) so that can
  specify the Docker image containing the training code.

* Create a ksonnet deployment for running the controller on the cluster.

* The HP tuning job isn't functional here's an incomplete list of issues

  * The training jobs launched fail because they don't have GCP credentials
    so they can't download the data.

  * We don't actually extract and report metrics back to Katib.

Related to: kubeflow/katib#116
jlewi added a commit to jlewi/katib that referenced this issue Jun 30, 2018
* The GKEDemo is using the GitHub summarization example; I think we should
  put all of the code for that demo kubeflow/examples
  (see kubeflow/examples#161)

* The main code is the Katib HP controller git-issue-summarize-demo.go

* We don't need the manifests for deploying katib because we can deploy
  Katib using the Kubeflow ksonnet package.

* The code in docker-image duplicates the code in kubeflow/examples so we
  shouldn't need it.

Related to: kubeflow#116
k8s-ci-robot pushed a commit that referenced this issue Jul 1, 2018
* The GKEDemo is using the GitHub summarization example; I think we should
  put all of the code for that demo kubeflow/examples
  (see kubeflow/examples#161)

* The main code is the Katib HP controller git-issue-summarize-demo.go

* We don't need the manifests for deploying katib because we can deploy
  Katib using the Kubeflow ksonnet package.

* The code in docker-image duplicates the code in kubeflow/examples so we
  shouldn't need it.

Related to: #116
k8s-ci-robot pushed a commit to kubeflow/examples that referenced this issue Jul 11, 2018
…ue example. (#161)

* Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo

  * I think it makes sense to centralize all the code in a single place.

* Update the controller program (git-issue-summarize-demo.go) so that can
  specify the Docker image containing the training code.

* Create a ksonnet deployment for running the controller on the cluster.

* The HP tuning job isn't functional here's an incomplete list of issues

  * The training jobs launched fail because they don't have GCP credentials
    so they can't download the data.

  * We don't actually extract and report metrics back to Katib.

Related to: kubeflow/katib#116
@lluunn
Copy link
Contributor

lluunn commented Jul 11, 2018

@jlewi
Copy link
Contributor Author

jlewi commented Jul 18, 2018

@lluunn it doesn't actually work.

See for example kubeflow/examples#163 and kubeflow/examples#162

The go program you indicated doesn't successfully create the jobs for training (I think the issue was not supplying GCP credentials as a volume) and not extracting the metrics.

@wukong1992
Copy link
Contributor

@jlewi
Now I want to test other examples. When I m going to write an example of a training model, I don't know how to define its log format, and katib is based on log to get accuracy.

@YujiOshima
Copy link
Contributor

@wukong1992 Currently Yes Katib will collect metrics from stdout.
You should print logs {metrics name}={value} style.
For example when your objective value name is loss and the metrics are recall and precision, your training code should print like this.

epoch 1:
loss=0.3
recall=0.5
precision=0.4

epoch 2:
loss=0.2
recall=0.55
precision=0.5

Katib will collect all log of metrics.

@YujiOshima
Copy link
Contributor

Since we have a simple example of random, grid and hyperband, we can close this.
/close

@k8s-ci-robot
Copy link

@YujiOshima: Closing this issue.

In response to this:

Since we have a simple example of random, grid and hyperband, we can close this.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants