-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix file fingerprint to use atomic get content hash of uploaded file #16597
Conversation
Automated checks report:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks. |
@@ -1699,7 +1699,8 @@ void completeFileInternal(RpcContext rpcContext, LockedInodePath inodePath, | |||
try (CloseableResource<UnderFileSystem> ufsResource = resolution.acquireUfsResource()) { | |||
UnderFileSystem ufs = ufsResource.get(); | |||
if (ufsStatus == null) { | |||
ufsFingerprint = ufs.getParsedFingerprint(ufsPath).serialize(); | |||
ufsFingerprint = ufs.getParsedFingerprint(ufsPath, context.getOptions().hasContentHash() | |||
? context.getOptions().getContentHash() : null).serialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we have a case where input contentHash is diff than the ufs status contenthash, meaning at time of completefile ufs has changed to some other content, and we are updating fileinode with our version of contenthash but with other field uptodate with ufs like owner ? or at least we could reload if we find contenthash mismatch later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am going for here is to have the data in Alluxio match the fingerprint created. The first part is to match the contents of the file which is done here. The next part will be to match the metadata (this PR is still in progress). I might be able to remove the UFS access here all together.
I am thinking that even if the data in Alluxio is out of sync with the UFS it is fine here, as later synchronizations will happen based on the metadata sync policy. The main thing that matters is that the fingerprint matches the actual data so that if there is an updated file on the UFS it will be synced whenever the next sync happens based on the policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after internal discussion, only the content hash of the file will be kept consistent.
This is due to other metadata in the fingerprint being computed in so many different ways, with all kinds of strange behaviors that clients might rely on, it was better not to mess with this as it could break things.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions. |
@lucyge2022 @tcrain PTAL too see if the changes are still valid? Please help work on the PR or close it otherwise |
Automated checks report:
All checks passed! |
@tcrain is this PR ready for another round of review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, left two comments about the code arrangement
* content written when the stream is closed. The content hash will then be used as part of | ||
* the metadata fingerprint when the file is completed on the Alluxio master. | ||
*/ | ||
public interface UnderFileSystemOutputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename it to more concrete?
@@ -42,7 +44,7 @@ | |||
* local disk and copied as a complete file when the {@link #close()} method is called. | |||
*/ | |||
@NotThreadSafe | |||
public final class OBSOutputStream extends OutputStream { | |||
public final class OBSOutputStream extends OutputStream implements UnderFileSystemOutputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to move the contentHash function into OutputStream and have default implementation, the logics seem similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out a good way to do this, since different classes extend different types out output streams I think I have to use an interface. Unless you can think of a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left one minor comment, please fix the test failures
@@ -20,7 +20,7 @@ | |||
* content written when the stream is closed. The content hash will then be used as part of | |||
* the metadata fingerprint when the file is completed on the Alluxio master. | |||
*/ | |||
public interface UnderFileSystemOutputStream { | |||
public interface ContentHashableOutputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without OutputStream? similar to Seekable?
alluxio-bot, merge this please |
Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version. This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies. This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations. pr-link: Alluxio#16597 change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version. This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies. This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations. pr-link: Alluxio#16597 change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version. This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies. This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations. pr-link: Alluxio#16597 change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version. This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies. This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations. pr-link: Alluxio#16597 change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
… get content hash of uploaded file Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version. This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies. This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations. pr-link: Alluxio#16597 change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
… get content hash of uploaded file Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version. This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies. This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations. pr-link: Alluxio#16597 change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version. This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies. This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations. pr-link: Alluxio#16597 change-id: cid-64723be309bdb14b05613864af3b6a1bb30cba6d
What changes are proposed in this pull request?
Currently when complete is called on a file in Alluxio, a fingerprint of the file will be created by performing a GetStauts on the file on the UFS. If due to a concurrent write, the state of the file is different than what was written through Alluxio, the fingerprint will not actually match the content of the file in Alluxio. If this happens the state of the file in Alluxio will always be out of sync with the UFS, and the file will never be updated to the most recent version.
This is because metadata sync uses the fingerprint to see if the file needs synchronization, and if the fingerprint does not match the file in Alluxio there will be inconsistencies.
This PR fixes this by having the contentHash field of the fingerprint be computed while the file is actually written on the UFS. For object stores, this means the hash is taken from the result of the call to PutObject. Unfortunately HDFS does not have a similar interface, so the content hash is taken just after the output stream is closed to complete the write. There could be a small chance that someone changes the file in this window between the two operations.