Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Support pai and paiYarn trainingservice #1853

Merged
merged 41 commits into from
Dec 23, 2019

Conversation

SparkSnail
Copy link
Contributor

@SparkSnail SparkSnail commented Dec 15, 2019

test image sparksnail/nni:paiYarn

SparkSnail and others added 28 commits August 6, 2019 18:58
Filter prune algo implementation (microsoft#1655)
@SparkSnail SparkSnail changed the title Support PAI on K8S version Support pai and paiYarn trainingservice Dec 15, 2019
* Training Service implementation for OpenPAI (Open Platform for AI)
* Refer https://github.com/Microsoft/pai for more info about OpenPAI
*/
@component.Singleton
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this @component.Singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


@Inject
private readonly paiTrainingService: PAITrainingService;
protected readonly paiTrainingService: PAITrainingService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Inject should be removed if it is assigned in constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

private isMultiPhase: boolean = false;
private authFileHdfsPath: string | undefined = undefined;
private portList?: string | undefined;
abstract class PAITrainingService implements TrainingService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this @component.Singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


return deferred.promise;
}

public getClusterMetadata(key: string): Promise<string> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this getClusterMetadata is not implemented, then throw not implemented error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

try {
await restServer.stop();
await this.paiJobRestServer.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this unnecessary deferred

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if (restServer.getErrorMessage !== undefined) {
throw new Error(restServer.getErrorMessage);
if (this.paiJobRestServer === undefined) {
throw new Error('paiBaseJobRestServer not implemented!');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error message is not accurate, should be paiJobRestServer is undefined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

}

public async setClusterMetadata(key: string, value: string): Promise<void> {
const deferred: Deferred<void> = new Deferred<void>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the deferred

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

await validateCodeDir(this.paiTrialConfig.codeDir);
} catch (error) {
this.log.error(error);
deferred.reject(new Error(error));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error should have already been logged by rest server, no need to catch and re-throw, please check it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if (this.paiClusterConfig === undefined) {
throw new Error(`paiClusterConfig not initialized!`);
}
const deferred: Deferred<PAITrialJobDetail> = new Deferred<PAITrialJobDetail>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this deferred

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

return Promise.reject(`trial job ${trialJobId} not found`);
}

return Promise.resolve(paiTrialJob);
return Promise.resolve(paiBaseTrialJob);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return paiBaseTrialJob directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


if (paiTrialJob === undefined) {
if (paiBaseTrialJob === undefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw new Error to keep style consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

trialJobDetail.status = 'FAILED'; // eslint-disable-line require-atomic-updates
deferred.resolve(true);

return deferred.promise;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return true directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -261,7 +262,7 @@ def validate_machine_list(experiment_config):

def validate_pai_trial_conifg(experiment_config):
'''validate the trial config in pai platform'''
if experiment_config.get('trainingServicePlatform') == 'pai':
if experiment_config.get('trainingServicePlatform') in ['pai', 'PAIYarn']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems PAIYarn should be paiYarn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if (this.paiClusterConfig === undefined) {
throw new Error(`paiBaseClusterConfig not initialized!`);
}
const deferred: Deferred<PAITrialJobDetail> = new Deferred<PAITrialJobDetail>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this deferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

// Validate to make sure codeDir doesn't have too many files
try {
await validateCodeDir(this.paiTrialConfig.codeDir);
} catch (error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this try catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

}

public async setClusterMetadata(key: string, value: string): Promise<void> {
const deferred: Deferred<void> = new Deferred<void>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this deferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

protected postParameterFileMeta(parameterFileMeta: ParameterFileMeta): Promise<void> {
const deferred: Deferred<void> = new Deferred<void>();
if (this.paiJobRestServer === undefined) {
throw new Error('paiBaseJobRestServer not implemented!');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messge inaccurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@SparkSnail SparkSnail merged commit 9cbbf6f into microsoft:master Dec 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants