Early Stopping Implementation #1344

andreyvelich · 2020-09-18T22:15:24Z

Related: #1330.

I started Early Stopping implementation with API changes.
@gaocegege @johnugeorge Please let me know what do you think about these APIs.

I will work on the controller changes in this PR also.

kubeflow-bot · 2020-09-18T22:15:32Z

This change is

andreyvelich · 2020-10-09T03:50:43Z

@gaocegege @johnugeorge
I finished first step for EarlyStopping implementation. For provided example it's working:

I still need to add/fix unit test and verify some e2e tests. In the meantime, can you start to review it, please ?
I also blocked by: #1333, which helps me to test Tekton, MPI Jobs.

Few notes:

I made several changes in manager and controller APIs. Please, take a look and let me know about any thoughts/changes.
For Algorithm service I chose port: 6789, portName: suggestion-api. For EarlyStopping service I chose port: 6788, portName: earlystop-api.
My suggestion is to take Early Stopping data (image, resources, etc.. ) from the katib-config. We should do it in follow PR. Currently, I take temp data.
I didn't add median stopping rule in this PR. We can submit follow PR with the implementation. Currently, it simple stops Trials with accuracy < 0.8 and Epoch = 9. Both condition should be met.
As mentioned here: Early stopping implementation #1330 (comment), I attached RBAC to Suggestion deployment to change Trial status from Early Stopping service.
I use filters from metricsFormat field to parse early stopping rules metrics . You can check it here.
I made few changes in metrics collector command. Currently, the training command looks like this:

python3 /opt/mxnet-mnist/mnist.py --num-epochs=10 1>/var/log/katib/metrics.log 2>&1 || 
      if test -f /var/log/katib/$$$$.pid && [ $(head -n 1 /var/log/katib/$$.pid) = early-stopped ]; then 
        echo Training Container was Early Stopped; 
      else 
        echo Training Container was Failed; 
        exit 1;
      fi && echo completed > /var/log/katib/$$$$.pid

If Python command is failed, it checks that 7.pid file exists (if main proc = 7) and it contains early-stopped line.
If yes, Pod is completed. Otherwise, Pod is Failed with exit code 1.
This approach uses only one .pid file to monitor current status (early-stopped or completed)

So, the metrics collector workflow (with main proc PID = 7) is that:

Wait until all stop-rules are reached.
Create 7.pid file with early-stopped line (after that && echo completed > /var/log/katib/$$$$.pid runs).
Kill the child process for 7 PID (Support only single child process).
Report metrics to the DB.
Dial EarlyStopping service with SetTrialStatus request to update Trial status.

Let me know what do you think.

andreyvelich · 2020-10-13T01:50:34Z

/retest

andreyvelich · 2020-10-13T02:37:45Z

/retest

andreyvelich · 2020-10-14T09:23:38Z

/retest

andreyvelich · 2020-10-14T14:46:37Z

/retest

andreyvelich · 2020-10-14T15:59:52Z

I have tested Early Stopping on several CRDs.

Kubernetes Job is working as expected.
Tekton and MPIJob is working as well. Controller reconciles early stopped pods properly.
For PytorchJob, once Trial is early stopped, Master pod is completed, but Workers are went to CrashLoopBackOff status, because they don't have access to the Master.
Although, PytorchJob is in Succeeded status and Katib controller reconcile Trials properly. We can investigate this issue later and modify Pytorch training container.
As well, using retain: false in trialTemplate fixes this problem since controller clean-up resources once Trial is completed.

I removed WIP status, please take a look @gaocegege @johnugeorge.

/cc @terrytangyuan

andreyvelich · 2020-10-14T19:58:16Z

/retest

k8s-ci-robot · 2020-10-23T01:24:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaocegege

Generally LGTM

andreyvelich · 2020-10-27T20:21:23Z

After discussion with @johnugeorge, we decide to make few changes in APIs. To be consistent between AutoML Algorithm and Early Stopping, Experiment and Suggestion is defined as:

type ExperimentSpec struct {
	Algorithm     *common.AlgorithmSpec     `json:"algorithm,omitempty"`
	EarlyStopping *common.EarlyStoppingSpec `json:"earlyStopping,omitempty"`
}

type SuggestionSpec struct {
	Algorithm     *common.AlgorithmSpec     `json:"algorithm"`
	EarlyStopping *common.EarlyStoppingSpec `json:"earlyStopping,omitempty"`
}

type AlgorithmSpec struct {
	AlgorithmName     string             `json:"algorithmName,omitempty"`
	AlgorithmSettings []AlgorithmSetting `json:"algorithmSettings,omitempty"`
}

type AlgorithmSetting struct {
	Name  string `json:"name,omitempty"`
	Value string `json:"value,omitempty"`
}

type EarlyStoppingSpec struct {
	AlgorithmName     string                 `json:"algorithmName,omitempty"`
	AlgorithmSettings []EarlyStoppingSetting `json:"algorithmSettings,omitempty"`
}

type EarlyStoppingSetting struct {
	Name  string `json:"name,omitempty"`
	Value string `json:"value,omitempty"`
}

Original AutoML algorithm settings are located in Suggestion.Spec.Algorithm.AlgorithmSettings, modified AutoML algorithms settings are located in Suggestion.Status.AlgorithmSettings.

@gaocegege @johnugeorge What do you think about these APIs?

gaocegege · 2020-10-28T07:57:40Z

LGTM The new API is clean.

andreyvelich · 2020-10-28T13:25:24Z

I finished Median stop implementation, @gaocegege @johnugeorge Can you take a look, please?

I made few changes according to the paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf.

Add start_step parameter as fourth parameter in -stop-rule flag. Flag looks like this:
-stop-rule accuracy;0.8;less;4. We start to apply the early stopping rule only after 4 metrics were reported. (This start step is optional parameter, without it we apply early stopping rule from the first reported metric).
min_trials_required setting - quantity of succeeded Trials before calculate median, start_step - quantity of reported metrics before applying rule.
Send ObjectiveType to the sidecar in-o-type flag. According to the paper, we should compare the best objective value at step S. So for the Objective metric I calculate only best optimal value and apply early stopping rule. Later we can add possibilities to compare latest, min or max value, if it's needed.

aws-kf-ci-bot · 2020-10-28T17:09:59Z

@andreyvelich: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
kubeflow-katib-presubmit	`6df9863`	link	`/test kubeflow-katib-presubmit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

andreyvelich · 2020-10-29T17:18:58Z

@johnugeorge @gaocegege If you don't have any objections on this PR, can you give your lgtm and we can make other changes in the following PRs?

gaocegege · 2020-10-30T02:55:00Z

/lgtm

Tried it yesterday.

gaocegege · 2020-10-30T02:55:04Z

Thanks for your contribution! 🎉 👍

k8s-ci-robot added the do-not-merge/work-in-progress label Sep 18, 2020

k8s-ci-robot requested review from gaocegege and sperlingxx September 18, 2020 22:15

k8s-ci-robot added the size/XXL label Sep 18, 2020

andreyvelich force-pushed the early-stopping branch from f3c8279 to cecc99a Compare October 13, 2020 20:09

andreyvelich changed the title ~~[WIP] Early Stopping Implementation~~ Early Stopping Implementation Oct 14, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress label Oct 14, 2020

k8s-ci-robot requested a review from terrytangyuan October 14, 2020 15:59

k8s-ci-robot added the approved label Oct 23, 2020

andreyvelich mentioned this pull request Oct 26, 2020

SuccessiveHalving and early-stopping not working for Katib Hyperband algorithm #690

Closed

andreyvelich force-pushed the early-stopping branch 2 times, most recently from 4509290 to 731feef Compare October 26, 2020 21:59

gaocegege reviewed Oct 27, 2020

View reviewed changes

andreyvelich force-pushed the early-stopping branch from 9193b1d to 73dbf2f Compare October 27, 2020 11:00

andreyvelich added 7 commits October 27, 2020 12:45

Init commit

62515d9

Experiment API changes

6ba2600

Add APIs

236569c

Remove old es apis

7bdc593

Remove logging

54fa2d3

Add Early Stopping implementation

5e71803

Show metrics for early stopped trials

a67a0b8

andreyvelich added 6 commits October 27, 2020 12:45

Fix pv name in e2e test

dedb241

Add Katib config for Early Stopping

e6ee75d

Fix comment

a14bf4e

Remove unused gRPC Experiment spec

000923d

Remove legacy test files

797e69b

Remove labels and conditions from example

17edec6

andreyvelich force-pushed the early-stopping branch from 73dbf2f to 17edec6 Compare October 27, 2020 12:45

andreyvelich added 2 commits October 27, 2020 19:54

Modify API to be consistent with Algorithm

b3b0699

Ignore no such file error

4b55b4d

andreyvelich added 3 commits October 27, 2020 22:34

Add comments to proto

c58a990

Add median stop implementation

839e513

Fix few comments

2675c1b

andreyvelich added 3 commits October 28, 2020 13:45

Fix unit tests

01fbd3e

Add o-type flag to tfevent metrics collector

de109ad

Fix es unit test

2115524

Fix hyperband suggestion

5ad7844

andreyvelich mentioned this pull request Oct 29, 2020

Katib 0.10 v1beta1 release kubeflow/manifests#1593

Merged

1 task

k8s-ci-robot assigned gaocegege Oct 30, 2020

k8s-ci-robot added the lgtm label Oct 30, 2020

k8s-ci-robot merged commit 60f6c20 into kubeflow:master Oct 30, 2020

andreyvelich mentioned this pull request Nov 3, 2020

Early stopping implementation #1330

Closed

jbottum mentioned this pull request Nov 19, 2020

Kubeflow 1.2 Blog post kubeflow/community#455

Closed

andreyvelich deleted the early-stopping branch October 6, 2021 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early Stopping Implementation #1344

Early Stopping Implementation #1344

andreyvelich commented Sep 18, 2020

kubeflow-bot commented Sep 18, 2020

andreyvelich commented Oct 9, 2020

andreyvelich commented Oct 13, 2020

andreyvelich commented Oct 13, 2020

andreyvelich commented Oct 14, 2020

andreyvelich commented Oct 14, 2020

andreyvelich commented Oct 14, 2020

andreyvelich commented Oct 14, 2020

k8s-ci-robot commented Oct 23, 2020

gaocegege left a comment

andreyvelich commented Oct 27, 2020 •

edited

Loading

gaocegege commented Oct 28, 2020

andreyvelich commented Oct 28, 2020

aws-kf-ci-bot commented Oct 28, 2020 •

edited

Loading

andreyvelich commented Oct 29, 2020

gaocegege commented Oct 30, 2020

gaocegege commented Oct 30, 2020

Early Stopping Implementation #1344

Early Stopping Implementation #1344

Conversation

andreyvelich commented Sep 18, 2020

kubeflow-bot commented Sep 18, 2020

andreyvelich commented Oct 9, 2020

andreyvelich commented Oct 13, 2020

andreyvelich commented Oct 13, 2020

andreyvelich commented Oct 14, 2020

andreyvelich commented Oct 14, 2020

andreyvelich commented Oct 14, 2020

andreyvelich commented Oct 14, 2020

k8s-ci-robot commented Oct 23, 2020

gaocegege left a comment

Choose a reason for hiding this comment

andreyvelich commented Oct 27, 2020 • edited Loading

gaocegege commented Oct 28, 2020

andreyvelich commented Oct 28, 2020

aws-kf-ci-bot commented Oct 28, 2020 • edited Loading

andreyvelich commented Oct 29, 2020

gaocegege commented Oct 30, 2020

gaocegege commented Oct 30, 2020

andreyvelich commented Oct 27, 2020 •

edited

Loading

aws-kf-ci-bot commented Oct 28, 2020 •

edited

Loading