Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix MpiJob status inaccuracy when kill redundant worker replicas #1616

Closed
wants to merge 1 commit into from
Closed

fix MpiJob status inaccuracy when kill redundant worker replicas #1616

wants to merge 1 commit into from

Conversation

PeterChg
Copy link
Contributor

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@PeterChg
Copy link
Contributor Author

/assign @Jeffwan

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: PeterChg
To complete the pull request process, please ask for approval from jeffwan after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Jun 17, 2022

Pull Request Test Coverage Report for Build 2546467435

  • 4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.03%) to 39.979%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 77.7%
Totals Coverage Status
Change from base Build 2531249966: 0.03%
Covered Lines: 2334
Relevant Lines: 5838

💛 - Coveralls

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why there will be a redundant worker replica?

@PeterChg
Copy link
Contributor Author

Could you elaborate why there will be a redundant worker replica?

/cc @gaocegege
/cc @zw0610

I see that this scenario is considered in the code, i also want to know why this will happen

@google-oss-prow google-oss-prow bot requested a review from zw0610 June 20, 2022 11:06
@zw0610
Copy link
Member

zw0610 commented Jun 23, 2022

If you are referring Elastic Training with MPIJob, I believe the case could be:

With elastic training enabled, when some worker Pods is failed, evicted or deleted, the MPIJob should continue to run.

But I'm not sure whether we can describe such case as redundant workers.

@PeterChg When you say this pr is to fix something, as in the title, could you be more specific about which issue this pull request is going to fix? You can just adding the description in this pull request, link an existing issue or create a new issue.

… worker Pods is failed, evicted or deleted, the MPIJob should continue to run.
@PeterChg
Copy link
Contributor Author

PeterChg commented Jun 23, 2022

If you are referring Elastic Training with MPIJob, I believe the case could be:

With elastic training enabled, when some worker Pods is failed, evicted or deleted, the MPIJob should continue to run.

But I'm not sure whether we can describe such case as redundant workers.

@PeterChg When you say this pr is to fix something, as in the title, could you be more specific about which issue this pull request is going to fix? You can just adding the description in this pull request, link an existing issue or create a new issue.

/cc @terrytangyuan
@zw0610 said is more accurate that what this pr is to fix. I have modified the pr comment.
without this pr ,when Pods is evicted or deleted , the phase of pods will become Failed Before disappears, This will result in a failed MpiJob state.

@johnugeorge
Copy link
Member

johnugeorge commented Jun 28, 2022

@zw0610 Do you want to take this in the current release?
/cc @terrytangyuan

Related: #1622

@zw0610
Copy link
Member

zw0610 commented Jun 28, 2022

@johnugeorge I'm afraid we have to wait for the next release cycle because:

  1. the description of this pull request is still missing, which could be confusing for future developers
  2. I cannot recall situations of some worker reached Succeed status instead of failed, which means I failed to understand the change of this pull request.

@johnugeorge
Copy link
Member

Sounds good. We will hold it till next release
/hold

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants