Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Fix a race condition bug that does not store Trial Job cancel status correctly #707

Merged
merged 3 commits into from
Feb 25, 2019

Conversation

yds05
Copy link
Contributor

@yds05 yds05 commented Feb 1, 2019

  1. For PAITrainingService, we should rely on the status returned from PAI rest server to change trial job's status, rather than setting it's cancelling status in cancelTrialJob() method, because jobInfoCollector will then reset job's status based on PAI rest server's result.
  2. For RemoteMachineTrainingService, it almost the same, we should not set trial's status in cancelTrialJob() method.

deferred.resolve();
}
});

// Set trialjobDetail's early stopped field, to mark the job's cancellation source
trialJobDetail.isEarlyStopped = isEarlyStopped;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line has to be put at the beginning of this function.

Copy link
Contributor Author

@yds05 yds05 Feb 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very sure why we should put this line to the beginning.
The beginning of this function has many condition checking. If we put this line to the beginning of this function but condition checking failed, the trial job status may be wrong. And also, I didn't see any race condition risk to put it after we issues PAI job stop request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, then we should put this line before line 328 request(...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Sure.

trialJob.status = getJobCancelStatus(isEarlyStopped);
//TODO: delete and move set USER_CANCELLED/EARLY_STOP in getTrialJob
// Mark the toEarlyStop tag here
trialJob.isEarlyStopped = isEarlyStopped;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should be put before the await SSHClientUtility (i.e., line 282)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@scarlett2018 scarlett2018 added this to the 2019 Feb Release v0.5.2 milestone Feb 25, 2019
@yds05 yds05 merged commit 9a3a75c into microsoft:master Feb 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants