-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Reuse OpenPAI jobs to run multiple trials #2521
Conversation
add internal prefix for internal storage methods for clear usage. fix pylint errors minor fixes
rename methods of storageService move trial to a seperated file fix some bugs.
fix openPAI breaking changes
fix minor bugs
to router training service for better understanding.
trialService is used to support different submission types like AML.
TrialDispatcher is easier to understand it's purpose.
|
||
nni_log(LogType.Info, "%s: start to run trial" % self.name) | ||
|
||
trial_working_dir = os.path.realpath(os.path.join(os.curdir, "..", "..", "trials", self.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little confused about the folder structure. trial_work_dir is in top level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trial_working_dir should be root dir of a trial, it has code, .nni folder inside.
|
||
return deferred.promise; | ||
Promise.all(connectionPromises); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing await
before Promise.all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
trial_runner_syslogger.pipeReader.set_process_exit() | ||
trial_runner_syslogger.close() | ||
|
||
# the process doesn't exit even main loop exit. So exit it explictly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the process does not exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in some error cases, subthreads not exit. So it needs to force exit to prevent this kind of situation.
await this.commandChannel.sendCommand(environment, KILL_TRIAL_JOB, trial.id); | ||
trial.isEarlyStopped = isEarlyStopped; | ||
trial.status = trial.isEarlyStopped === true ? | ||
'EARLY_STOPPED' : 'USER_CANCELED'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old logic, trial.status
is not set here because the cancel could fail, can we ensure the cancel succeed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it cannot guarantee cancel succeed in any case, but it will try best... the runner will kill all child processes, which belows to trial code.
Designed new interface to support reusable training service, currently only applies to OpenPAI, and default disabled.
Other minor changes,