Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file fingerprint to use atomic get content hash of uploaded file #16597

Merged
merged 19 commits into from
Feb 16, 2023

Conversation

tcrain
Copy link
Contributor

@tcrain tcrain commented Nov 30, 2022

What changes are proposed in this pull request?

Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations.

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • Commits associated with Github account: PASS
  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • First word of title ("Atomic") is not an imperative verb. Please use one of the valid words

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@@ -1699,7 +1699,8 @@ void completeFileInternal(RpcContext rpcContext, LockedInodePath inodePath,
try (CloseableResource<UnderFileSystem> ufsResource = resolution.acquireUfsResource()) {
UnderFileSystem ufs = ufsResource.get();
if (ufsStatus == null) {
ufsFingerprint = ufs.getParsedFingerprint(ufsPath).serialize();
ufsFingerprint = ufs.getParsedFingerprint(ufsPath, context.getOptions().hasContentHash()
? context.getOptions().getContentHash() : null).serialize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we have a case where input contentHash is diff than the ufs status contenthash, meaning at time of completefile ufs has changed to some other content, and we are updating fileinode with our version of contenthash but with other field uptodate with ufs like owner ? or at least we could reload if we find contenthash mismatch later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am going for here is to have the data in Alluxio match the fingerprint created. The first part is to match the contents of the file which is done here. The next part will be to match the metadata (this PR is still in progress). I might be able to remove the UFS access here all together.

I am thinking that even if the data in Alluxio is out of sync with the UFS it is fine here, as later synchronizations will happen based on the metadata sync policy. The main thing that matters is that the fingerprint matches the actual data so that if there is an updated file on the UFS it will be synced whenever the next sync happens based on the policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after internal discussion, only the content hash of the file will be kept consistent.

This is due to other metadata in the fingerprint being computed in so many different ways, with all kinds of strange behaviors that clients might rely on, it was better not to mess with this as it could break things.

@github-actions
Copy link

github-actions bot commented Jan 9, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jan 9, 2023
@LuQQiu
Copy link
Contributor

LuQQiu commented Jan 9, 2023

@lucyge2022 @tcrain PTAL too see if the changes are still valid? Please help work on the PR or close it otherwise

@github-actions github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label Jan 10, 2023
@jja725 jja725 self-requested a review January 12, 2023 00:17
@tcrain tcrain changed the title Atomic get content hash of uploaded file Fix file fingerprint to use atomic get content hash of uploaded file Jan 25, 2023
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • Commits associated with Github account: PASS
  • PR title follows the conventions: PASS

All checks passed!

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 14, 2023

@tcrain is this PR ready for another round of review?

@tcrain
Copy link
Contributor Author

tcrain commented Feb 14, 2023

@tcrain is this PR ready for another round of review?

@LuQQiu If you review, I'll also make another PR that will help with #16833

Copy link
Contributor

@LuQQiu LuQQiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left two comments about the code arrangement

* content written when the stream is closed. The content hash will then be used as part of
* the metadata fingerprint when the file is completed on the Alluxio master.
*/
public interface UnderFileSystemOutputStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename it to more concrete?

@@ -42,7 +44,7 @@
* local disk and copied as a complete file when the {@link #close()} method is called.
*/
@NotThreadSafe
public final class OBSOutputStream extends OutputStream {
public final class OBSOutputStream extends OutputStream implements UnderFileSystemOutputStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to move the contentHash function into OutputStream and have default implementation, the logics seem similar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out a good way to do this, since different classes extend different types out output streams I think I have to use an interface. Unless you can think of a better way?

Copy link
Contributor

@LuQQiu LuQQiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left one minor comment, please fix the test failures

@@ -20,7 +20,7 @@
* content written when the stream is closed. The content hash will then be used as part of
* the metadata fingerprint when the file is completed on the Alluxio master.
*/
public interface UnderFileSystemOutputStream {
public interface ContentHashableOutputStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without OutputStream? similar to Seekable?

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 16, 2023

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit 00da77c into Alluxio:master Feb 16, 2023
@Xenorith Xenorith added the area-master Alluxio master label Feb 28, 2023
YangchenYe323 pushed a commit to YangchenYe323/alluxio that referenced this pull request Apr 16, 2023
Currently when complete is called on a file in Alluxio, a fingerprint of
the file will be created by performing a GetStauts on the file on the
UFS. If due to a concurrent write, the state of the file is different
than what was written through Alluxio, the fingerprint will not actually
match the content of the file in Alluxio. If this happens the state of
the file in Alluxio will always be out of sync with the UFS, and the
file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file
needs synchronization, and if the fingerprint does not match the file in
Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be
computed while the file is actually written on the UFS. For object
stores, this means the hash is taken from the result of the call to
PutObject. Unfortunately HDFS does not have a similar interface, so the
content hash is taken just after the output stream is closed to complete
the write. There could be a small chance that someone changes the file
in this window between the two operations.

pr-link: Alluxio#16597
change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
jja725 pushed a commit to jja725/alluxio that referenced this pull request May 1, 2023
Currently when complete is called on a file in Alluxio, a fingerprint of
the file will be created by performing a GetStauts on the file on the
UFS. If due to a concurrent write, the state of the file is different
than what was written through Alluxio, the fingerprint will not actually
match the content of the file in Alluxio. If this happens the state of
the file in Alluxio will always be out of sync with the UFS, and the
file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file
needs synchronization, and if the fingerprint does not match the file in
Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be
computed while the file is actually written on the UFS. For object
stores, this means the hash is taken from the result of the call to
PutObject. Unfortunately HDFS does not have a similar interface, so the
content hash is taken just after the output stream is closed to complete
the write. There could be a small chance that someone changes the file
in this window between the two operations.

pr-link: Alluxio#16597
change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
jja725 pushed a commit to jja725/alluxio that referenced this pull request May 1, 2023
Currently when complete is called on a file in Alluxio, a fingerprint of
the file will be created by performing a GetStauts on the file on the
UFS. If due to a concurrent write, the state of the file is different
than what was written through Alluxio, the fingerprint will not actually
match the content of the file in Alluxio. If this happens the state of
the file in Alluxio will always be out of sync with the UFS, and the
file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file
needs synchronization, and if the fingerprint does not match the file in
Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be
computed while the file is actually written on the UFS. For object
stores, this means the hash is taken from the result of the call to
PutObject. Unfortunately HDFS does not have a similar interface, so the
content hash is taken just after the output stream is closed to complete
the write. There could be a small chance that someone changes the file
in this window between the two operations.

pr-link: Alluxio#16597
change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
@jja725 jja725 mentioned this pull request May 1, 2023
jja725 pushed a commit to jja725/alluxio that referenced this pull request May 2, 2023
Currently when complete is called on a file in Alluxio, a fingerprint of
the file will be created by performing a GetStauts on the file on the
UFS. If due to a concurrent write, the state of the file is different
than what was written through Alluxio, the fingerprint will not actually
match the content of the file in Alluxio. If this happens the state of
the file in Alluxio will always be out of sync with the UFS, and the
file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file
needs synchronization, and if the fingerprint does not match the file in
Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be
computed while the file is actually written on the UFS. For object
stores, this means the hash is taken from the result of the call to
PutObject. Unfortunately HDFS does not have a similar interface, so the
content hash is taken just after the output stream is closed to complete
the write. There could be a small chance that someone changes the file
in this window between the two operations.

pr-link: Alluxio#16597
change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
alluxio-bot pushed a commit that referenced this pull request May 3, 2023
Cherry-pick of existing commit.
orig-pr: #16597
orig-commit: 00da77c
orig-commit-author: Tyler Crain <tycrain@gmail.com>

			pr-link: #17361
			change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
jiacheliu3 pushed a commit to jiacheliu3/alluxio that referenced this pull request May 16, 2023
… get content hash of uploaded file

Currently when complete is called on a file in Alluxio, a fingerprint of
the file will be created by performing a GetStauts on the file on the
UFS. If due to a concurrent write, the state of the file is different
than what was written through Alluxio, the fingerprint will not actually
match the content of the file in Alluxio. If this happens the state of
the file in Alluxio will always be out of sync with the UFS, and the
file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file
needs synchronization, and if the fingerprint does not match the file in
Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be
computed while the file is actually written on the UFS. For object
stores, this means the hash is taken from the result of the call to
PutObject. Unfortunately HDFS does not have a similar interface, so the
content hash is taken just after the output stream is closed to complete
the write. There could be a small chance that someone changes the file
in this window between the two operations.

pr-link: Alluxio#16597
change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
jiacheliu3 pushed a commit to jiacheliu3/alluxio that referenced this pull request May 16, 2023
… get content hash of uploaded file

Currently when complete is called on a file in Alluxio, a fingerprint of
the file will be created by performing a GetStauts on the file on the
UFS. If due to a concurrent write, the state of the file is different
than what was written through Alluxio, the fingerprint will not actually
match the content of the file in Alluxio. If this happens the state of
the file in Alluxio will always be out of sync with the UFS, and the
file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file
needs synchronization, and if the fingerprint does not match the file in
Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be
computed while the file is actually written on the UFS. For object
stores, this means the hash is taken from the result of the call to
PutObject. Unfortunately HDFS does not have a similar interface, so the
content hash is taken just after the output stream is closed to complete
the write. There could be a small chance that someone changes the file
in this window between the two operations.

pr-link: Alluxio#16597
change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
Jackson-Wang-7 pushed a commit to Jackson-Wang-7/alluxio that referenced this pull request Nov 24, 2023
Currently when complete is called on a file in Alluxio, a fingerprint of
the file will be created by performing a GetStauts on the file on the
UFS. If due to a concurrent write, the state of the file is different
than what was written through Alluxio, the fingerprint will not actually
match the content of the file in Alluxio. If this happens the state of
the file in Alluxio will always be out of sync with the UFS, and the
file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file
needs synchronization, and if the fingerprint does not match the file in
Alluxio there will be inconsistencies.

This PR fixes this by having the contentHash field of the fingerprint be
computed while the file is actually written on the UFS. For object
stores, this means the hash is taken from the result of the call to
PutObject. Unfortunately HDFS does not have a similar interface, so the
content hash is taken just after the output stream is closed to complete
the write. There could be a small chance that someone changes the file
in this window between the two operations.

pr-link: Alluxio#16597
change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Changes covering public API area-master Alluxio master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants