Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early Stopping Implementation #1344

Merged
merged 33 commits into from
Oct 30, 2020
Merged

Conversation

andreyvelich
Copy link
Member

Related: #1330.

I started Early Stopping implementation with API changes.
@gaocegege @johnugeorge Please let me know what do you think about these APIs.

I will work on the controller changes in this PR also.

@kubeflow-bot
Copy link

This change is Reviewable

@andreyvelich
Copy link
Member Author

@gaocegege @johnugeorge
I finished first step for EarlyStopping implementation. For provided example it's working:
Screenshot 2020-10-09 at 04 04 47

I still need to add/fix unit test and verify some e2e tests. In the meantime, can you start to review it, please ?
I also blocked by: #1333, which helps me to test Tekton, MPI Jobs.

Few notes:

  1. I made several changes in manager and controller APIs. Please, take a look and let me know about any thoughts/changes.
  2. For Algorithm service I chose port: 6789, portName: suggestion-api. For EarlyStopping service I chose port: 6788, portName: earlystop-api.
  3. My suggestion is to take Early Stopping data (image, resources, etc.. ) from the katib-config. We should do it in follow PR. Currently, I take temp data.
  4. I didn't add median stopping rule in this PR. We can submit follow PR with the implementation. Currently, it simple stops Trials with accuracy < 0.8 and Epoch = 9. Both condition should be met.
  5. As mentioned here: Early stopping implementation #1330 (comment), I attached RBAC to Suggestion deployment to change Trial status from Early Stopping service.
  6. I use filters from metricsFormat field to parse early stopping rules metrics . You can check it here.
  7. I made few changes in metrics collector command. Currently, the training command looks like this:
python3 /opt/mxnet-mnist/mnist.py --num-epochs=10 1>/var/log/katib/metrics.log 2>&1 || 
      if test -f /var/log/katib/$$$$.pid && [ $(head -n 1 /var/log/katib/$$.pid) = early-stopped ]; then 
        echo Training Container was Early Stopped; 
      else 
        echo Training Container was Failed; 
        exit 1;
      fi && echo completed > /var/log/katib/$$$$.pid

If Python command is failed, it checks that 7.pid file exists (if main proc = 7) and it contains early-stopped line.
If yes, Pod is completed. Otherwise, Pod is Failed with exit code 1.
This approach uses only one .pid file to monitor current status (early-stopped or completed)

So, the metrics collector workflow (with main proc PID = 7) is that:

  1. Wait until all stop-rules are reached.
  2. Create 7.pid file with early-stopped line (after that && echo completed > /var/log/katib/$$$$.pid runs).
  3. Kill the child process for 7 PID (Support only single child process).
  4. Report metrics to the DB.
  5. Dial EarlyStopping service with SetTrialStatus request to update Trial status.

Let me know what do you think.

@andreyvelich
Copy link
Member Author

/retest

1 similar comment
@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich andreyvelich changed the title [WIP] Early Stopping Implementation Early Stopping Implementation Oct 14, 2020
@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

I have tested Early Stopping on several CRDs.

  1. Kubernetes Job is working as expected.
  2. Tekton and MPIJob is working as well. Controller reconciles early stopped pods properly.
  3. For PytorchJob, once Trial is early stopped, Master pod is completed, but Workers are went to CrashLoopBackOff status, because they don't have access to the Master.
    Although, PytorchJob is in Succeeded status and Katib controller reconcile Trials properly. We can investigate this issue later and modify Pytorch training container.
    As well, using retain: false in trialTemplate fixes this problem since controller clean-up resources once Trial is completed.

I removed WIP status, please take a look @gaocegege @johnugeorge.

/cc @terrytangyuan

@andreyvelich
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

@andreyvelich
Copy link
Member Author

andreyvelich commented Oct 27, 2020

After discussion with @johnugeorge, we decide to make few changes in APIs. To be consistent between AutoML Algorithm and Early Stopping, Experiment and Suggestion is defined as:

type ExperimentSpec struct {
	Algorithm     *common.AlgorithmSpec     `json:"algorithm,omitempty"`
	EarlyStopping *common.EarlyStoppingSpec `json:"earlyStopping,omitempty"`
}

type SuggestionSpec struct {
	Algorithm     *common.AlgorithmSpec     `json:"algorithm"`
	EarlyStopping *common.EarlyStoppingSpec `json:"earlyStopping,omitempty"`
}

type AlgorithmSpec struct {
	AlgorithmName     string             `json:"algorithmName,omitempty"`
	AlgorithmSettings []AlgorithmSetting `json:"algorithmSettings,omitempty"`
}

type AlgorithmSetting struct {
	Name  string `json:"name,omitempty"`
	Value string `json:"value,omitempty"`
}

type EarlyStoppingSpec struct {
	AlgorithmName     string                 `json:"algorithmName,omitempty"`
	AlgorithmSettings []EarlyStoppingSetting `json:"algorithmSettings,omitempty"`
}

type EarlyStoppingSetting struct {
	Name  string `json:"name,omitempty"`
	Value string `json:"value,omitempty"`
}

Original AutoML algorithm settings are located in Suggestion.Spec.Algorithm.AlgorithmSettings, modified AutoML algorithms settings are located in Suggestion.Status.AlgorithmSettings.

@gaocegege @johnugeorge What do you think about these APIs?

@gaocegege
Copy link
Member

LGTM The new API is clean.

@andreyvelich
Copy link
Member Author

I finished Median stop implementation, @gaocegege @johnugeorge Can you take a look, please?

I made few changes according to the paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf.

  1. Add start_step parameter as fourth parameter in -stop-rule flag. Flag looks like this:
    -stop-rule accuracy;0.8;less;4. We start to apply the early stopping rule only after 4 metrics were reported. (This start step is optional parameter, without it we apply early stopping rule from the first reported metric).
  2. min_trials_required setting - quantity of succeeded Trials before calculate median, start_step - quantity of reported metrics before applying rule.
  3. Send ObjectiveType to the sidecar in-o-type flag. According to the paper, we should compare the best objective value at step S. So for the Objective metric I calculate only best optimal value and apply early stopping rule. Later we can add possibilities to compare latest, min or max value, if it's needed.

@aws-kf-ci-bot
Copy link
Contributor

aws-kf-ci-bot commented Oct 28, 2020

@andreyvelich: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-katib-presubmit 6df9863 link /test kubeflow-katib-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@andreyvelich
Copy link
Member Author

@johnugeorge @gaocegege If you don't have any objections on this PR, can you give your lgtm and we can make other changes in the following PRs?

@gaocegege
Copy link
Member

/lgtm

Tried it yesterday.

@gaocegege
Copy link
Member

Thanks for your contribution! 🎉 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants