Reuse OpenPAI jobs to run multiple trials #2521

squirrelsc · 2020-06-02T07:22:31Z

Designed new interface to support reusable training service, currently only applies to OpenPAI, and default disabled.

Replace trial_keeper.py to trial_runner.py, trial_runner holds an environment, and receives commands from nni manager to run or stop an trial, and return events to nni manager.
Add trial dispatcher, which inherits from original trianing service interface. It uses to share as many as possible code of all training service, and isolate with training services.
Add EnvironmentService interface to manage environment, including start/stop an environment, refresh status of environments.
Add command channel on both nni manager and trial runner parts, it supports different ways to pass messages between them. Current supported channels are file, web sockets. and supported commands from nni manager are start, kill trial, send new parameters; from runner are initialized(support some channel doesn't know which runner connected), trial end, stdout ((new type), including metric like before), version check (new type), gpu info (new type).
Add storage service to wrapper a storage to standard file operations, like NFS, azure storage and so on.
Partial support run multiple trials in parallel on runner side, but not supported by trial dispatcher side.

Other minor changes,

Add log_level to TS UT, so that UT can show debug level log.
Expose platform to start info.
Add RouterTrainingService to keep origianl OpenPAI training service, and support dynamic IOC binding.
Add more GPU info for future usage, including GPU mem total/free/used, gpu type.
Make some license information consistence.
Fix async/await problems on Array.forEach, this method doesn't support async actually.
Fix IT errors on download data, which causes by my Remove duplicate data under /tmp folder, and other small changes. #2484 .
Accelerate some run loop pattern by reducing sleep seconds.

1. rename storage file name 2. add more log on status changes 3. change isEnd to isAlive for better naming

add internal prefix for internal storage methods for clear usage. fix pylint errors minor fixes

rename methods of storageService move trial to a seperated file fix some bugs.

fix openPAI breaking changes

fix minor bugs

to router training service for better understanding.

trialService is used to support different submission types like AML.

TrialDispatcher is easier to understand it's purpose.

SparkSnail · 2020-06-29T02:43:48Z

tools/nni_trial_tool/trial.py

+
+        nni_log(LogType.Info, "%s: start to run trial" % self.name)
+
+        trial_working_dir = os.path.realpath(os.path.join(os.curdir, "..", "..", "trials", self.id))


a little confused about the folder structure. trial_work_dir is in top level?

trial_working_dir should be root dir of a trial, it has code, .nni folder inside.

chicm-ms · 2020-06-29T04:13:11Z

src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts


-        return deferred.promise;
+        Promise.all(connectionPromises);


missing await before Promise.all

chicm-ms · 2020-06-29T06:45:46Z

tools/nni_trial_tool/trial_runner.py

+                trial_runner_syslogger.pipeReader.set_process_exit()
+            trial_runner_syslogger.close()
+
+    # the process doesn't exit even main loop exit. So exit it explictly.


why the process does not exit?

in some error cases, subthreads not exit. So it needs to force exit to prevent this kind of situation.

chicm-ms · 2020-06-29T07:00:08Z

src/nni_manager/training_service/reusable/trialDispatcher.ts

+                        await this.commandChannel.sendCommand(environment, KILL_TRIAL_JOB, trial.id);
+                        trial.isEarlyStopped = isEarlyStopped;
+                        trial.status = trial.isEarlyStopped === true ?
+                            'EARLY_STOPPED' : 'USER_CANCELED';


In the old logic, trial.status is not set here because the cancel could fail, can we ensure the cancel succeed here?

it cannot guarantee cancel succeed in any case, but it will try best... the runner will kill all child processes, which belows to trial code.

squirrelsc and others added 15 commits June 2, 2020 15:19

init changes

a738331

Merge remote-tracking branch 'official/master' into 2391-reuse-job

3177aeb

refactors

2aafac1

1. rename storage file name 2. add more log on status changes 3. change isEnd to isAlive for better naming

refactoring

0435b7f

add internal prefix for internal storage methods for clear usage. fix pylint errors minor fixes

minor fix, and take some review comments.

2e5ef51

move reuse to upper level

6d7bc62

support multi nodes

c67b162

rename methods of storageService move trial to a seperated file fix some bugs.

fix eslint errors

e13a620

support multi environments better

59d4a71

fix openPAI breaking changes

Merge remote-tracking branch 'official/master' into 2391-reuse-job

eae0540

code refactor

81c49cf

fix minor bugs

fix openpai yaml format

92cab3a

fix k8s yaml schema

0674d88

rename forward training service

e5b9665

to router training service for better understanding.

Merge remote-tracking branch 'official/master' into 2391-reuse-job

67ef648

QuanluZhang requested review from chicm-ms and SparkSnail June 8, 2020 03:17

Chi Song and others added 13 commits June 9, 2020 09:48

add trialService

1e626fd

trialService is used to support different submission types like AML.

not send stop for single node

c6b6061

rename environmentManager to trialDispatcher

b8e47be

TrialDispatcher is easier to understand it's purpose.

support no central storage service

0ee933a

improve delopment support

c094057

Merge remote-tracking branch 'official/master' into 2391-reuse-job

c2735d3

use latest storage component

d0b2504

add gpu info

c8d4696

separate channel and add gpu collector in runner

3f4c177

add more GPU information, and improve debugging.

d0768b0

fix GPU info collector

8dff16f

channel support single file

500c1cb

refine code, and implement command channel

d880512

SparkSnail reviewed Jun 29, 2020

View reviewed changes

SparkSnail mentioned this pull request Jun 29, 2020

【Released】Iteration plan for June #2507

Closed

38 tasks

fix incorrect check logic

c1f0239

chicm-ms reviewed Jun 29, 2020

View reviewed changes

Chi Song added 3 commits June 29, 2020 12:33

make license header consistent

af97bb1

Merge remote-tracking branch 'official/master' into 2391-reuse-job

10feb6a

add missed await.

c00cd31

SparkSnail approved these changes Jun 29, 2020

View reviewed changes

support log level in UT

586d6ac

chicm-ms reviewed Jun 29, 2020

View reviewed changes

chicm-ms approved these changes Jun 29, 2020

View reviewed changes

Chi Song added 6 commits June 29, 2020 15:51

refine interface to support aml better.

2db8ff8

fix runtime error on exit

f631e4c

Merge remote-tracking branch 'official/master' into 2391-reuse-job

5982fb3

fix eslint error

f687a6e

send metric data from channel

476ffec

support version check

0f2367c

squirrelsc changed the title ~~[DRAFT] Reuse OpenPAI jobs to run multiple trials~~ Reuse OpenPAI jobs to run multiple trials Jun 29, 2020

Chi Song added 2 commits June 29, 2020 21:51

fix pylint errors

9d7bd3c

fix non-local failed ITs

130ed27

squirrelsc linked an issue Jun 30, 2020 that may be closed by this pull request

Advancing job performance: working pool to reuse job #2391

Closed

8 tasks

Chi Song added 4 commits June 30, 2020 10:42

remove some useless changes.

f64d60c

format code

7e9245c

Merge remote-tracking branch 'official/master' into 2391-reuse-job

b9ab306

add document

76c7d5a

squirrelsc merged commit 0b9d6ce into microsoft:master Jun 30, 2020

squirrelsc deleted the 2391-reuse-job branch June 30, 2020 04:56

chicm-ms mentioned this pull request Jul 1, 2020

June release end game #2621

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse OpenPAI jobs to run multiple trials #2521

Reuse OpenPAI jobs to run multiple trials #2521

squirrelsc commented Jun 2, 2020 •

edited

Loading

SparkSnail Jun 29, 2020

squirrelsc Jun 29, 2020

chicm-ms Jun 29, 2020

squirrelsc Jun 29, 2020

chicm-ms Jun 29, 2020

squirrelsc Jun 29, 2020

chicm-ms Jun 29, 2020 •

edited

Loading

squirrelsc Jun 29, 2020


		nni_log(LogType.Info, "%s: start to run trial" % self.name)

		trial_working_dir = os.path.realpath(os.path.join(os.curdir, "..", "..", "trials", self.id))

Reuse OpenPAI jobs to run multiple trials #2521

Reuse OpenPAI jobs to run multiple trials #2521

Conversation

squirrelsc commented Jun 2, 2020 • edited Loading

SparkSnail Jun 29, 2020

Choose a reason for hiding this comment

squirrelsc Jun 29, 2020

Choose a reason for hiding this comment

chicm-ms Jun 29, 2020

Choose a reason for hiding this comment

squirrelsc Jun 29, 2020

Choose a reason for hiding this comment

chicm-ms Jun 29, 2020

Choose a reason for hiding this comment

squirrelsc Jun 29, 2020

Choose a reason for hiding this comment

chicm-ms Jun 29, 2020 • edited Loading

Choose a reason for hiding this comment

squirrelsc Jun 29, 2020

Choose a reason for hiding this comment

squirrelsc commented Jun 2, 2020 •

edited

Loading

chicm-ms Jun 29, 2020 •

edited

Loading