-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add retry policy for azureStorage #1480
Add retry policy for azureStorage #1480
Conversation
merge master
…-fix-azureUpload
fileServerClient.createDirectoryIfNotExists(azureShare, azureFoler, (error: any, result: any, response: any) => { | ||
if (error) { | ||
getLogger() | ||
.error(`Create directory failed:, ${error}`); | ||
deferred.reject(error); | ||
deferred.resolve(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
`${this.azureStorageShare}/${path.join('nni', getExperimentId(), trialJobId, 'output')}`; | ||
while (retryCount >= 0) { | ||
//upload local files, including scripts for running the trial and configuration (e.g., hyperparameters) for the trial, to azure storage | ||
let resultUploadDir1: boolean = await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define meaningful name, how about resultUploadNNIScript
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, rename to resultUploadNNIScript
`nni/${getExperimentId()}/${trialJobId}`, this.azureStorageShare, | ||
`${trialLocalTempFolder}`); | ||
//upload code files to azure storage | ||
let resultUploadDir2: boolean = await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same, resultUploadCodeFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, rename to resultUploadCodeFile
|
||
trialJobOutputUrl = `https://${this.azureStorageAccountName}.file.core.windows.net/` + | ||
`${this.azureStorageShare}/${path.join('nni', getExperimentId(), trialJobId, 'output')}`; | ||
while (retryCount >= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusion here because even retryCount == 0, we will enter the loop.
I think change to do...while is better to understand:
retryCount = 1
do
{
} while( retyrCount-- > 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, changed logic.
break; | ||
} else { | ||
//wait for 5 seconds to re-upload files | ||
await delay(5000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you differentiate the failure is caused by false resultUploadCodeFile, false resultUploadNNIScript, or both? rather than re-upload both regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, re-upload two folders separately.
this.log.info('Re-upload files to azure-storage'); | ||
retryCount -= 1; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these code is duplicated with frameworklauncher training service, consider to do a refactor to remove dup code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, extract duplicated code to kubernetesTrainingService.
await fileServerClient.createFileFromLocalFile(azureShare, azureDirectory, azureFileName, localFilePath, | ||
(error: any, result: any, response: any) => { | ||
if (error) { | ||
getLogger() | ||
.error(`Upload file failed:, ${error}`); | ||
deferred.reject(error); | ||
deferred.resolve(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -211,16 +211,28 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple | |||
if (this.azureStorageClient === undefined) { | |||
throw new Error('azureStorageClient is not initialized'); | |||
} | |||
let retryCount: number = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retryCount always equals to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add config file for this kind of configuration.
} else { | ||
//wait for 5 seconds to re-upload files | ||
await delay(5000); | ||
this.log.info('Re-upload files to azure-storage'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Upload failed, Retry: upload files to azure-storage'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
this.log.info('Re-upload files to azure-storage'); | ||
retryCount -= 1; | ||
} | ||
} | ||
} catch (error) { | ||
this.log.error(error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no error anymore (becomes false or true). what if the upload is failed after all the retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If trialJobOutputUrl is '', the trial will be set to FAILED
//return a empty url when got error | ||
return Promise.resolve(""); | ||
} | ||
return Promise.resolve(trialJobOutputUrl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after try retryCount
times, the files are still not uploaded, what does this function return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function will return an empty path, and submitTrial function will set trial status FAILED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to write log to tell the FAILED is induced by data upload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add log Retry-count is used up, upload files to azureStorage for trial ${trialJobId} failed!
src/nni_manager/config/config.yaml
Outdated
@@ -0,0 +1 @@ | |||
azureStorageUploadRetryCount: 1 #If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting it here is strange, but let's keep it for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to put it in config
@@ -120,7 +120,8 @@ export namespace ValidationSchemas { | |||
azureStorage: joi.object({ | |||
accountName: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,31}$/), | |||
azureShare: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,63}$/) | |||
}) | |||
}), | |||
uploadRetryCount: joi.number().min(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to make it a configuration of azureStorage
No description provided.