Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement for Custom CRD #1333

Merged
merged 11 commits into from
Oct 13, 2020

Conversation

andreyvelich
Copy link
Member

I was testing Custom CRD.
I was able to run Tekton TaskRun as Trial CR. I Will submit PR with few examples soon.
I found these problems while running:

  • I disabled MutateJob in controller. We will track it in this issue: Refactoring Supported Job List #1320.

  • I disabled validation for Job other than SupportedJobList.

  • We should verify PrimaryContainerName in MutateVolume also.

  • I changed INSERT in RegisterObservationLog function. It is better to insert all lines in one SQL query, instead of running separate query for each line. It helps to avoid unnecessary Trial updates.

  • I added reconcile re-queue if metrics are not available. This use-case happens in Tekton job. Tekton task is succeeded once Training container is finished, but metrics can be not collected, yet. Because of that, controller reconciles Trial without observation and turns Trial status in Metrics Unavailable. To avoid it, I try to re-queue controller for maxRequeueCount times to convert Trial to Succeeded status. @gaocegege @johnugeorge What do you think about this approach, can we do it in other way?

/assign @gaocegege @johnugeorge

Add retry on empty observation
@kubeflow-bot
Copy link

This change is Reviewable

@andreyvelich
Copy link
Member Author

I figure out that we send all Trials to Suggestion. I think we don't need to send Trials with unavailable metrics to Suggestion service since we try to reconcile controller until metrics are reported.
WDYT @gaocegege @johnugeorge ?

@gaocegege
Copy link
Member

SGTM, we do not need to send these trials.

@gaocegege
Copy link
Member

/lgtm

If objective metric value is not reported metrics collector reports unavailable value to the DB
Controller reconciles Trial until DB is empty
@k8s-ci-robot k8s-ci-robot removed the lgtm label Sep 18, 2020
@andreyvelich
Copy link
Member Author

@gaocegege We were discussing with @johnugeorge about my first solution with re-queuing controller and saving data in map.
It has few disadvantages.

  1. We don't want to waste resources with this map and have problems with controller restarts as you mentioned before.
  2. It is not a good way to monitor Reconcile restarts. We don't really know how many times Trial controller will be restarted and it depends on user's cluster.

I propose a bit different solution:

  • If metrics collector doesn't find objectiveMetricName in the training logs, it reports unavailable value to the DB.
  • If Trial CRD Job is succeeded, controller re-queues Trial object until DB records with appropriate trial name are empty.
  • If DB has a record with unavailable metric value, Trial changes status to MetricsUnavailable

What do you think about this?

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

/retest

@johnugeorge
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 85fc7d0 into kubeflow:master Oct 13, 2020
@andreyvelich andreyvelich deleted the not-validate-supported-job branch October 6, 2021 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants