Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set min replicate to 0 and unpin file if we fail to replicate a file #17865

Merged
merged 4 commits into from
Aug 1, 2023

Conversation

yuzhu
Copy link
Contributor

@yuzhu yuzhu commented Jul 31, 2023

What changes are proposed in this pull request?

Now we unpin and set min replication to 0 if we failed due to IO related reasons. (such as UFS not available, file does not exist etc).

Why are the changes needed?

Previously if a replication job fails, it would be rescheduled again and again, causing Denial of Service.

Does this PR introduce any user facing changes?

Yes, the user would expect certain files unpinned or adjusted if it becomes unavailable.

@yuzhu yuzhu requested a review from jja725 July 31, 2023 21:11
@yuzhu
Copy link
Contributor Author

yuzhu commented Jul 31, 2023

@jja725 some background context, a user saw repeated replication jobs because the file has been removed from the ufs, and was still pinned in alluxio.

Copy link
Contributor

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM

try {
JobUtils.loadBlock(status, context.getFsContext(), config.getBlockId(), null, false);
} catch (IOException e) {
LOG.warn("Replication of {} failed, reduce min replication to 1 and unpin.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LOG.warn("Replication of {} failed, reduce min replication to 1 and unpin.",
LOG.warn("Replication of {} failed, reduce min replication to 0 and unpin.",

status.getPath());
SetAttributePOptions.Builder optionsBuilder =
SetAttributePOptions.newBuilder();
context.getFileSystem().setAttribute(new AlluxioURI(config.getPath()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this throw exception? how do we handle it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed as a supressed exception

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title must not end with punctuation
      • First word of title ("Unpin") is not an imperative verb. Please use one of the valid words
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@yuzhu yuzhu changed the title Unpin a file and set min replicate to 0 if we fail to replicate a file. Set min replicate to 0 and unpin file if we fail to replicate a file. Jul 31, 2023
context.getFileSystem().setAttribute(new AlluxioURI(config.getPath()),
optionsBuilder.setReplicationMin(0).setPinned(false).build());
} catch (Throwable e2) {
e.addSuppressed(e2);
Copy link
Contributor

@apc999 apc999 Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add LOG.warn here so at least we can find this suppressed issues in logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try {
JobUtils.loadBlock(status, context.getFsContext(), config.getBlockId(), null, false);
} catch (IOException e) {
LOG.warn("Replication of {} failed, reduce min replication to 0 and unpin.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the failure reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jja725 jja725 self-requested a review July 31, 2023 23:07
Copy link
Contributor

@jja725 jja725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuzhu
Copy link
Contributor Author

yuzhu commented Aug 1, 2023

alluxio-bot, check this please

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title must not end with punctuation
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@yuzhu yuzhu changed the title Set min replicate to 0 and unpin file if we fail to replicate a file. Set min replicate to 0 and unpin file if we fail to replicate a file Aug 1, 2023
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

@yuzhu
Copy link
Contributor Author

yuzhu commented Aug 1, 2023

alluxio-bot, merge this please

@alluxio-bot
Copy link
Contributor

merge failed:
Merge refused because pull request does not have label start with type-

@yuzhu yuzhu added the type-bug This issue is about a bug label Aug 1, 2023
@yuzhu
Copy link
Contributor Author

yuzhu commented Aug 1, 2023

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit c600830 into Alluxio:master-2.x Aug 1, 2023
17 checks passed
@yuzhu
Copy link
Contributor Author

yuzhu commented Aug 1, 2023

alluxio-bot, cherry-pick this to branch-2.10 please

@alluxio-bot
Copy link
Contributor

Auto cherry-pick to branch branch-2.10 successfully opened PR: #17867

alluxio-bot pushed a commit that referenced this pull request Aug 1, 2023
### What changes are proposed in this pull request?

Now we unpin and set min replication to 0 if we failed due to IO related reasons. (such as UFS not available, file does not exist etc).

### Why are the changes needed?
Previously if a replication job fails, it would be rescheduled again and again, causing Denial of Service.


### Does this PR introduce any user facing changes?
Yes, the user would expect certain files unpinned or adjusted if it becomes unavailable. 

			pr-link: #17865
			change-id: cid-a2cde6da1e7a95fb361157307bd026303c0164ad
alluxio-bot added a commit that referenced this pull request Aug 4, 2023
Cherry-pick of existing commit.
orig-pr: #17865
orig-commit: c600830
orig-commit-author: David Zhu <david@alluxio.com>

			pr-link: #17867
			change-id: cid-a2cde6da1e7a95fb361157307bd026303c0164ad
alluxio-bot added a commit that referenced this pull request Nov 8, 2023
### What changes are proposed in this pull request?
Merge missing commits from master-2.x to main. The commits in 2023/07/01~2023/11/08 from main...master-2.x will be included by this PR.

We do this merge to catch missing fixes from `master-2.x` and catch the train before `main` cuts a release.

#17747 is not cherry picked because tencent cloud EMR doc is removed
#17755 is not cherry picked because DistLoadCliRunner has been removed in 3.x
#17758 is not cherry picked because MonoBlockStore has been removed in 3.x
#17641 is not cherry picked because the PR has already been in main
#17781 is not cherry picked because the PR has already been in main
#17722 is not cherry picked because the alluxio-fuse command has been changed a lot
#17489 is not cherry picked because audit log on master is no longer in 3.x
#17865 is not cherry picked because replication on job service is no longer in 3.x
#17858 is not cherry picked because it is already in main
#18090 is not cherry picked because generate-tarball has been rewritten in 3.x
#18091 is not cherry picked because the change is already in main
#17474 is not cherry picked because reconfiguration feature is not defined in 3.x
#17735 is not cherry picked because MonoBlockStore is no longer in 3.x
#18133 is not cherry picked because the issue is about master metadata and no longer relevant in 3.x
#17910 is not cherry picked because I prefer to do that manually
#17983 is not cherry picked because the web UI has been reworked
#17984 is not cherry picked because Mount/Unmount commands have been reworked in 3.x
#18103 is not cherry picked because worker cache metrics have been reworked in 3.x
#18185 is not cherry picked because the report command has been reworked in 3.x
#18222 is not cherry picked because Mount/Unmount operations have been reworked in 3.x
#18143 is not cherry picked because the change is already in main
#18303 is not cherry picked because the change is already in main
#18208 is not cherry picked because cache metrics have been reworked in 3.x
#17002 is not cherry picked because the owner has been notified separately
#18334 is not cherry picked because the bash scripts have been reworked in 3.x
#18326 is not cherry picked because the owner has been notified separately

			pr-link: #18397
			change-id: cid-dbf8cbb2d9e721a5a0a1e5028a3c9577438a2ac0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants