-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support pai and paiYarn trainingservice #1853
Conversation
merge master
merge master
Update evolution doc (microsoft#1493)
merge master
merge master
merge master
augment pylintrc (microsoft#1643)
fix console.log (microsoft#1636)
merge master
merge master
merge master
merge master
Filter prune algo implementation (microsoft#1655)
merge master
merge master
merge master
merge master
merge master
* Training Service implementation for OpenPAI (Open Platform for AI) | ||
* Refer https://github.com/Microsoft/pai for more info about OpenPAI | ||
*/ | ||
@component.Singleton |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this @component.Singleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
@Inject | ||
private readonly paiTrainingService: PAITrainingService; | ||
protected readonly paiTrainingService: PAITrainingService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Inject
should be removed if it is assigned in constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
private isMultiPhase: boolean = false; | ||
private authFileHdfsPath: string | undefined = undefined; | ||
private portList?: string | undefined; | ||
abstract class PAITrainingService implements TrainingService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this @component.Singleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
return deferred.promise; | ||
} | ||
|
||
public getClusterMetadata(key: string): Promise<string> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this getClusterMetadata is not implemented, then throw not implemented error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
try { | ||
await restServer.stop(); | ||
await this.paiJobRestServer.stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this unnecessary deferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
if (restServer.getErrorMessage !== undefined) { | ||
throw new Error(restServer.getErrorMessage); | ||
if (this.paiJobRestServer === undefined) { | ||
throw new Error('paiBaseJobRestServer not implemented!'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error message is not accurate, should be paiJobRestServer is undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
} | ||
|
||
public async setClusterMetadata(key: string, value: string): Promise<void> { | ||
const deferred: Deferred<void> = new Deferred<void>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the deferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
await validateCodeDir(this.paiTrialConfig.codeDir); | ||
} catch (error) { | ||
this.log.error(error); | ||
deferred.reject(new Error(error)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error should have already been logged by rest server, no need to catch and re-throw, please check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
if (this.paiClusterConfig === undefined) { | ||
throw new Error(`paiClusterConfig not initialized!`); | ||
} | ||
const deferred: Deferred<PAITrialJobDetail> = new Deferred<PAITrialJobDetail>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this deferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
return Promise.reject(`trial job ${trialJobId} not found`); | ||
} | ||
|
||
return Promise.resolve(paiTrialJob); | ||
return Promise.resolve(paiBaseTrialJob); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return paiBaseTrialJob
directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
||
if (paiTrialJob === undefined) { | ||
if (paiBaseTrialJob === undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw new Error to keep style consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
trialJobDetail.status = 'FAILED'; // eslint-disable-line require-atomic-updates | ||
deferred.resolve(true); | ||
|
||
return deferred.promise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return true directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
tools/nni_cmd/launcher_utils.py
Outdated
@@ -261,7 +262,7 @@ def validate_machine_list(experiment_config): | |||
|
|||
def validate_pai_trial_conifg(experiment_config): | |||
'''validate the trial config in pai platform''' | |||
if experiment_config.get('trainingServicePlatform') == 'pai': | |||
if experiment_config.get('trainingServicePlatform') in ['pai', 'PAIYarn']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems PAIYarn
should be paiYarn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
if (this.paiClusterConfig === undefined) { | ||
throw new Error(`paiBaseClusterConfig not initialized!`); | ||
} | ||
const deferred: Deferred<PAITrialJobDetail> = new Deferred<PAITrialJobDetail>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this deferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
// Validate to make sure codeDir doesn't have too many files | ||
try { | ||
await validateCodeDir(this.paiTrialConfig.codeDir); | ||
} catch (error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this try catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
} | ||
|
||
public async setClusterMetadata(key: string, value: string): Promise<void> { | ||
const deferred: Deferred<void> = new Deferred<void>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this deferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
protected postParameterFileMeta(parameterFileMeta: ParameterFileMeta): Promise<void> { | ||
const deferred: Deferred<void> = new Deferred<void>(); | ||
if (this.paiJobRestServer === undefined) { | ||
throw new Error('paiBaseJobRestServer not implemented!'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error messge inaccurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
test image
sparksnail/nni:paiYarn