-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fix a race condition bug that does not store Trial Job cancel status correctly #707
Conversation
yds05
commented
Feb 1, 2019
- For PAITrainingService, we should rely on the status returned from PAI rest server to change trial job's status, rather than setting it's cancelling status in cancelTrialJob() method, because jobInfoCollector will then reset job's status based on PAI rest server's result.
- For RemoteMachineTrainingService, it almost the same, we should not set trial's status in cancelTrialJob() method.
deferred.resolve(); | ||
} | ||
}); | ||
|
||
// Set trialjobDetail's early stopped field, to mark the job's cancellation source | ||
trialJobDetail.isEarlyStopped = isEarlyStopped; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line has to be put at the beginning of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very sure why we should put this line to the beginning.
The beginning of this function has many condition checking. If we put this line to the beginning of this function but condition checking failed, the trial job status may be wrong. And also, I didn't see any race condition risk to put it after we issues PAI job stop request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, then we should put this line before line 328 request(...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes Sure.
trialJob.status = getJobCancelStatus(isEarlyStopped); | ||
//TODO: delete and move set USER_CANCELLED/EARLY_STOP in getTrialJob | ||
// Mark the toEarlyStop tag here | ||
trialJob.isEarlyStopped = isEarlyStopped; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line should be put before the await SSHClientUtility
(i.e., line 282)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done