-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set min replicate to 0 and unpin file if we fail to replicate a file #17865
Conversation
@jja725 some background context, a user saw repeated replication jobs because the file has been removed from the ufs, and was still pinned in alluxio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM
try { | ||
JobUtils.loadBlock(status, context.getFsContext(), config.getBlockId(), null, false); | ||
} catch (IOException e) { | ||
LOG.warn("Replication of {} failed, reduce min replication to 1 and unpin.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG.warn("Replication of {} failed, reduce min replication to 1 and unpin.", | |
LOG.warn("Replication of {} failed, reduce min replication to 0 and unpin.", |
status.getPath()); | ||
SetAttributePOptions.Builder optionsBuilder = | ||
SetAttributePOptions.newBuilder(); | ||
context.getFileSystem().setAttribute(new AlluxioURI(config.getPath()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this throw exception? how do we handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed as a supressed exception
Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
context.getFileSystem().setAttribute(new AlluxioURI(config.getPath()), | ||
optionsBuilder.setReplicationMin(0).setPinned(false).build()); | ||
} catch (Throwable e2) { | ||
e.addSuppressed(e2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add LOG.warn
here so at least we can find this suppressed issues in logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
try { | ||
JobUtils.loadBlock(status, context.getFsContext(), config.getBlockId(), null, false); | ||
} catch (IOException e) { | ||
LOG.warn("Replication of {} failed, reduce min replication to 0 and unpin.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the failure reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
alluxio-bot, check this please |
Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
Automated checks report:
All checks passed! |
alluxio-bot, merge this please |
merge failed: |
alluxio-bot, merge this please |
alluxio-bot, cherry-pick this to branch-2.10 please |
Auto cherry-pick to branch |
### What changes are proposed in this pull request? Now we unpin and set min replication to 0 if we failed due to IO related reasons. (such as UFS not available, file does not exist etc). ### Why are the changes needed? Previously if a replication job fails, it would be rescheduled again and again, causing Denial of Service. ### Does this PR introduce any user facing changes? Yes, the user would expect certain files unpinned or adjusted if it becomes unavailable. pr-link: #17865 change-id: cid-a2cde6da1e7a95fb361157307bd026303c0164ad
### What changes are proposed in this pull request? Merge missing commits from master-2.x to main. The commits in 2023/07/01~2023/11/08 from main...master-2.x will be included by this PR. We do this merge to catch missing fixes from `master-2.x` and catch the train before `main` cuts a release. #17747 is not cherry picked because tencent cloud EMR doc is removed #17755 is not cherry picked because DistLoadCliRunner has been removed in 3.x #17758 is not cherry picked because MonoBlockStore has been removed in 3.x #17641 is not cherry picked because the PR has already been in main #17781 is not cherry picked because the PR has already been in main #17722 is not cherry picked because the alluxio-fuse command has been changed a lot #17489 is not cherry picked because audit log on master is no longer in 3.x #17865 is not cherry picked because replication on job service is no longer in 3.x #17858 is not cherry picked because it is already in main #18090 is not cherry picked because generate-tarball has been rewritten in 3.x #18091 is not cherry picked because the change is already in main #17474 is not cherry picked because reconfiguration feature is not defined in 3.x #17735 is not cherry picked because MonoBlockStore is no longer in 3.x #18133 is not cherry picked because the issue is about master metadata and no longer relevant in 3.x #17910 is not cherry picked because I prefer to do that manually #17983 is not cherry picked because the web UI has been reworked #17984 is not cherry picked because Mount/Unmount commands have been reworked in 3.x #18103 is not cherry picked because worker cache metrics have been reworked in 3.x #18185 is not cherry picked because the report command has been reworked in 3.x #18222 is not cherry picked because Mount/Unmount operations have been reworked in 3.x #18143 is not cherry picked because the change is already in main #18303 is not cherry picked because the change is already in main #18208 is not cherry picked because cache metrics have been reworked in 3.x #17002 is not cherry picked because the owner has been notified separately #18334 is not cherry picked because the bash scripts have been reworked in 3.x #18326 is not cherry picked because the owner has been notified separately pr-link: #18397 change-id: cid-dbf8cbb2d9e721a5a0a1e5028a3c9577438a2ac0
What changes are proposed in this pull request?
Now we unpin and set min replication to 0 if we failed due to IO related reasons. (such as UFS not available, file does not exist etc).
Why are the changes needed?
Previously if a replication job fails, it would be rescheduled again and again, causing Denial of Service.
Does this PR introduce any user facing changes?
Yes, the user would expect certain files unpinned or adjusted if it becomes unavailable.