Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Add retry policy for azureStorage #1480

Merged
merged 11 commits into from
Aug 29, 2019

Conversation

SparkSnail
Copy link
Contributor

No description provided.

@suiguoxin suiguoxin requested review from QuanluZhang and yds05 August 21, 2019 02:39
fileServerClient.createDirectoryIfNotExists(azureShare, azureFoler, (error: any, result: any, response: any) => {
if (error) {
getLogger()
.error(`Create directory failed:, ${error}`);
deferred.reject(error);
deferred.resolve(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

`${this.azureStorageShare}/${path.join('nni', getExperimentId(), trialJobId, 'output')}`;
while (retryCount >= 0) {
//upload local files, including scripts for running the trial and configuration (e.g., hyperparameters) for the trial, to azure storage
let resultUploadDir1: boolean = await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient,
Copy link
Contributor

@yds05 yds05 Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define meaningful name, how about resultUploadNNIScript

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, rename to resultUploadNNIScript

`nni/${getExperimentId()}/${trialJobId}`, this.azureStorageShare,
`${trialLocalTempFolder}`);
//upload code files to azure storage
let resultUploadDir2: boolean = await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient,
Copy link
Contributor

@yds05 yds05 Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same, resultUploadCodeFile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, rename to resultUploadCodeFile


trialJobOutputUrl = `https://${this.azureStorageAccountName}.file.core.windows.net/` +
`${this.azureStorageShare}/${path.join('nni', getExperimentId(), trialJobId, 'output')}`;
while (retryCount >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusion here because even retryCount == 0, we will enter the loop.

I think change to do...while is better to understand:

retryCount = 1
do
{
} while( retyrCount-- > 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, changed logic.

break;
} else {
//wait for 5 seconds to re-upload files
await delay(5000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you differentiate the failure is caused by false resultUploadCodeFile, false resultUploadNNIScript, or both? rather than re-upload both regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, re-upload two folders separately.

this.log.info('Re-upload files to azure-storage');
retryCount -= 1;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these code is duplicated with frameworklauncher training service, consider to do a refactor to remove dup code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, extract duplicated code to kubernetesTrainingService.

await fileServerClient.createFileFromLocalFile(azureShare, azureDirectory, azureFileName, localFilePath,
(error: any, result: any, response: any) => {
if (error) {
getLogger()
.error(`Upload file failed:, ${error}`);
deferred.reject(error);
deferred.resolve(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -211,16 +211,28 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
if (this.azureStorageClient === undefined) {
throw new Error('azureStorageClient is not initialized');
}
let retryCount: number = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryCount always equals to 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add config file for this kind of configuration.

} else {
//wait for 5 seconds to re-upload files
await delay(5000);
this.log.info('Re-upload files to azure-storage');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Upload failed, Retry: upload files to azure-storage'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

this.log.info('Re-upload files to azure-storage');
retryCount -= 1;
}
}
} catch (error) {
this.log.error(error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no error anymore (becomes false or true). what if the upload is failed after all the retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If trialJobOutputUrl is '', the trial will be set to FAILED

@suiguoxin suiguoxin mentioned this pull request Aug 21, 2019
36 tasks
//return a empty url when got error
return Promise.resolve("");
}
return Promise.resolve(trialJobOutputUrl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after try retryCount times, the files are still not uploaded, what does this function return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function will return an empty path, and submitTrial function will set trial status FAILED.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to write log to tell the FAILED is induced by data upload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add log Retry-count is used up, upload files to azureStorage for trial ${trialJobId} failed!

@@ -0,0 +1 @@
azureStorageUploadRetryCount: 1 #If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting it here is strange, but let's keep it for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to put it in config

@@ -120,7 +120,8 @@ export namespace ValidationSchemas {
azureStorage: joi.object({
accountName: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,31}$/),
azureShare: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,63}$/)
})
}),
uploadRetryCount: joi.number().min(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to make it a configuration of azureStorage

@SparkSnail SparkSnail merged commit 2d252c9 into microsoft:v1.0 Aug 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants