Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Reuse OpenPAI jobs to run multiple trials #2521

Merged
merged 54 commits into from
Jun 30, 2020

Conversation

squirrelsc
Copy link
Member

@squirrelsc squirrelsc commented Jun 2, 2020

Designed new interface to support reusable training service, currently only applies to OpenPAI, and default disabled.

  1. Replace trial_keeper.py to trial_runner.py, trial_runner holds an environment, and receives commands from nni manager to run or stop an trial, and return events to nni manager.
  2. Add trial dispatcher, which inherits from original trianing service interface. It uses to share as many as possible code of all training service, and isolate with training services.
  3. Add EnvironmentService interface to manage environment, including start/stop an environment, refresh status of environments.
  4. Add command channel on both nni manager and trial runner parts, it supports different ways to pass messages between them. Current supported channels are file, web sockets. and supported commands from nni manager are start, kill trial, send new parameters; from runner are initialized(support some channel doesn't know which runner connected), trial end, stdout ((new type), including metric like before), version check (new type), gpu info (new type).
  5. Add storage service to wrapper a storage to standard file operations, like NFS, azure storage and so on.
  6. Partial support run multiple trials in parallel on runner side, but not supported by trial dispatcher side.

Other minor changes,

  1. Add log_level to TS UT, so that UT can show debug level log.
  2. Expose platform to start info.
  3. Add RouterTrainingService to keep origianl OpenPAI training service, and support dynamic IOC binding.
  4. Add more GPU info for future usage, including GPU mem total/free/used, gpu type.
  5. Make some license information consistence.
  6. Fix async/await problems on Array.forEach, this method doesn't support async actually.
  7. Fix IT errors on download data, which causes by my Remove duplicate data under /tmp folder, and other small changes. #2484 .
  8. Accelerate some run loop pattern by reducing sleep seconds.

squirrelsc and others added 15 commits June 2, 2020 15:19
1. rename storage file name
2. add more log on status changes
3. change isEnd to isAlive for better naming
add internal prefix for internal storage methods for clear usage.
fix pylint errors
minor fixes
rename methods of storageService
move trial to a seperated file
fix some bugs.
fix openPAI breaking changes
fix minor bugs
 to router training service for better understanding.

nni_log(LogType.Info, "%s: start to run trial" % self.name)

trial_working_dir = os.path.realpath(os.path.join(os.curdir, "..", "..", "trials", self.id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little confused about the folder structure. trial_work_dir is in top level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trial_working_dir should be root dir of a trial, it has code, .nni folder inside.


return deferred.promise;
Promise.all(connectionPromises);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing await before Promise.all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

trial_runner_syslogger.pipeReader.set_process_exit()
trial_runner_syslogger.close()

# the process doesn't exit even main loop exit. So exit it explictly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the process does not exit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some error cases, subthreads not exit. So it needs to force exit to prevent this kind of situation.

await this.commandChannel.sendCommand(environment, KILL_TRIAL_JOB, trial.id);
trial.isEarlyStopped = isEarlyStopped;
trial.status = trial.isEarlyStopped === true ?
'EARLY_STOPPED' : 'USER_CANCELED';
Copy link
Contributor

@chicm-ms chicm-ms Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old logic, trial.status is not set here because the cancel could fail, can we ensure the cancel succeed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot guarantee cancel succeed in any case, but it will try best... the runner will kill all child processes, which belows to trial code.

@squirrelsc squirrelsc changed the title [DRAFT] Reuse OpenPAI jobs to run multiple trials Reuse OpenPAI jobs to run multiple trials Jun 29, 2020
@squirrelsc squirrelsc linked an issue Jun 30, 2020 that may be closed by this pull request
8 tasks
@squirrelsc squirrelsc merged commit 0b9d6ce into microsoft:master Jun 30, 2020
@squirrelsc squirrelsc deleted the 2391-reuse-job branch June 30, 2020 04:56
@chicm-ms chicm-ms mentioned this pull request Jul 1, 2020
24 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advancing job performance: working pool to reuse job
3 participants