Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix worker error silent #863

Merged
merged 3 commits into from
Oct 9, 2019
Merged

Fix worker error silent #863

merged 3 commits into from
Oct 9, 2019

Conversation

hougangliu
Copy link
Member

@hougangliu hougangliu commented Oct 9, 2019

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:



This change is Reviewable

@hougangliu
Copy link
Member Author

/cc @gaocegege @johnugeorge

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @johnugeorge

}
cmd.Wait()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the third-party tool such as https://github.com/hpcloud/tail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I can update it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you signal the completion of tail logs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we doesn't need know the completion of tail logs, when the worker container exit, metrics collector main process will start to collect and parse the file, and then exit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if there is a chance to miss certain worker container logs in this case.

@johnugeorge
Copy link
Member

Shall we add a failure e2e test to test failed cases?

@k8s-ci-robot k8s-ci-robot added size/S and removed size/M labels Oct 9, 2019
@hougangliu
Copy link
Member Author

Shall we add a failure e2e test to test failed cases?

Sure, I can try to add one in another PR

@johnugeorge
Copy link
Member

Please rebase when #861 is merged

@@ -244,7 +244,7 @@ func wrapWorkerContainer(pod *v1.Pod, jobKind, metricsFile string, mc common.Met
if c.Args != nil {
args = append(args, c.Args...)
}
redirectStr := fmt.Sprintf(" 2>&1 | tee %s", metricsFile)
redirectStr := fmt.Sprintf("1>%s 2>&1", metricsFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we let users know that kubectl logs works on the side car which outputs logs from worker container too? Should we rename the side car container to "logger" or something?
WDYT? @gaocegege @hougangliu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Oct 9, 2019
@hougangliu
Copy link
Member Author

/test kubeflow-katib-presubmit

@gaocegege
Copy link
Member

Needs a reabse.

@hougangliu
Copy link
Member Author

Needs a reabse.

rebased already. It seems CI starts to strike now again

@hougangliu
Copy link
Member Author

/test kubeflow-katib-presubmit

1 similar comment
@hougangliu
Copy link
Member Author

/test kubeflow-katib-presubmit

@hougangliu
Copy link
Member Author

hougangliu commented Oct 9, 2019

@gaocegege @johnugeorge
test passed after nearly 50m, katib CI works slowly now again. We can enhance it by reduce maxTrialCount value in test
WDYT?

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gaocegege
Copy link
Member

@hougangliu
Copy link
Member Author

@gaocegege
Copy link
Member

/assign @johnugeorge

@johnugeorge
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit d39ee12 into kubeflow:master Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants