-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a basic example for using Katib #116
Comments
I think @jlewi said about training script. |
I think I'm looking for something higher level. I'd like instructions that walk people through using Katib. So the goal would be to show how customers can use Katib to do hyperparameter tuning for a simple problem (e.g. mnist). The algorithm could also be very simple (e.g. grid search). This example might refer to the pieces mentioned above. So my expectation is that the instructions would be something like the following
|
@ankushagarwal Will investigate to see what we can do in the next sprint to integrate with TFJob and create an example. |
https://github.com/kubeflow/kubeflow/pull/989/files Describes running A simple thing to do would be to turn that into a K8s job and include that as a prototype within the Katib package. An advantage of running it on cluster should be that we don't have to port-forward any ports to make the client work. |
The GKE demo is actually using our GitHub issue summarization example Looks like the code for the model is actually being duplicated to that location. |
* Add a parameter for the docker image containing the training code * Rename the parameter to point to the vizier-core service. Related to: kubeflow#116
…ue example. * Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo * I think it makes sense to centralize all the code in a single place. * Update the controller program (git-issue-summarize-demo.go) so that can specify the Docker image containing the training code. * Create a ksonnet deployment for running the controller on the cluster. * The HP tuning job isn't functional here's an incomplete list of issues * The training jobs launched fail because they don't have GCP credentials so they can't download the data. * We don't actually extract and report metrics back to Katib. Related to: kubeflow/katib#116
* The GKEDemo is using the GitHub summarization example; I think we should put all of the code for that demo kubeflow/examples (see kubeflow/examples#161) * The main code is the Katib HP controller git-issue-summarize-demo.go * We don't need the manifests for deploying katib because we can deploy Katib using the Kubeflow ksonnet package. * The code in docker-image duplicates the code in kubeflow/examples so we shouldn't need it. Related to: kubeflow#116
* The GKEDemo is using the GitHub summarization example; I think we should put all of the code for that demo kubeflow/examples (see kubeflow/examples#161) * The main code is the Katib HP controller git-issue-summarize-demo.go * We don't need the manifests for deploying katib because we can deploy Katib using the Kubeflow ksonnet package. * The code in docker-image duplicates the code in kubeflow/examples so we shouldn't need it. Related to: #116
…ue example. (#161) * Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo * I think it makes sense to centralize all the code in a single place. * Update the controller program (git-issue-summarize-demo.go) so that can specify the Docker image containing the training code. * Create a ksonnet deployment for running the controller on the cluster. * The HP tuning job isn't functional here's an incomplete list of issues * The training jobs launched fail because they don't have GCP credentials so they can't download the data. * We don't actually extract and report metrics back to Katib. Related to: kubeflow/katib#116
Now that we have https://github.com/kubeflow/examples/tree/master/github_issue_summarization/hp-tune, what's the TODO here? |
@lluunn it doesn't actually work. See for example kubeflow/examples#163 and kubeflow/examples#162 The go program you indicated doesn't successfully create the jobs for training (I think the issue was not supplying GCP credentials as a volume) and not extracting the metrics. |
@jlewi |
@wukong1992 Currently Yes Katib will collect metrics from stdout.
Katib will collect all log of metrics. |
Since we have a simple example of random, grid and hyperband, we can close this. |
@YujiOshima: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Do we have a simple example that walks through using Katib with a simple model (e.g. mnist)?
I'd like a simple example that we can ask folks to follow to provide a friction log.
/cc @YujiOshima
The text was updated successfully, but these errors were encountered: