-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38704: [C++] Implement Azure FileSystem Move() via Azure DataLake Storage Gen 2 API #39904
Conversation
…and blobs/files even though they are expected to behave mostly the same
There is a second use for it in later commits.
…PI logs" This reverts commit b2134bb7752354b7d73420fb850e2e3f7b6f6baa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at all changes but I don't fully understand yet. I'll review again later.
// "subdir0/file-at-subdir" exists | ||
|
||
// src is a directory and dest does not exists | ||
CreateDirectory(adlfs_client, "subdir0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is needless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's necessary because I'm testing the scenario where the src exists. The next line moves the subdir0
to subdir1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that subdir0
created at https://github.com/apache/arrow/pull/39904/files#diff-7d2cbf9c6abf1ee980b8cf5a79e222543c17b3d93b12c4c0c9fa123b1bb200b6R1238 still exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Right. Removing it.
CreateDirectory(adlfs_client, "subdir1"); | ||
CreateDirectory(adlfs_client, "subdir2"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that they are needless.
latest < break_or_expires_at_ && | ||
!latest_known_expiry_time_.compare_exchange_weak(latest, break_or_expires_at_)) { | ||
} | ||
DCHECK_GE(latest_known_expiry_time_.load(), break_or_expires_at_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe?
I think that latest_known_expiry_time_
may be changed between latest_known_expiry_time_.compare_exchange_weak()
and latest_known_expiry_time_.load()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe because latest_known_expiry_time_
monotonically increases (it never goes down). So even if it's changed, the [G]reater than or [E]qual
will always work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
/// doesn't exist, otherwise a PathNotFound(location) error is produced right away | ||
/// \return A BlobLeaseClient is wrapped as a unique_ptr so it's moveable and | ||
/// optional (nullptr denotes blob not found) | ||
Result<std::unique_ptr<Blobs::BlobLeaseClient>> AcquireBlobLease( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that most codes are duplicated with AcquireContainerLease()
. Can we unify them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They use different SDK classes and the error handling is subtly different between each. Unifying these would require a lot of templating that would obfuscate the code more than clarify it.
static constexpr std::chrono::seconds kMaxLeaseDuration{60}; | ||
|
||
public: | ||
LeaseGuard(std::unique_ptr<Blobs::BlobLeaseClient> lease_client, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really make sense for this to be controlled by a consumer vs being controlled more internally? Is this a common pattern with Azure outside of our usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you mean by "consumer" and "more internally" here? The Arrow implementation is internal compared to the software using Arrow.
Leases are a common Distributed Systems pattern [1] and the multi-step operations being performed here would have almost unpredictable outcomes in the presence of concurrent clients. Without concurrent mutators, they are very cheap (lead to no delays at all) and with concurrent mutators, they lead to outcomes we and users can reason about. Note that I often use the lease acquisition as an existence check I would have to do anyways.
[1] https://martinfowler.com/articles/patterns-of-distributed-systems/lease.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you mean by "consumer" and "more internally" here? The Arrow implementation is internal compared to the software using Arrow.
I'm referring to anyone using arrow::filesystem::AzureFileSystem
directly, whether inside the arrow library (datasets) or not.
Leases are a common Distributed Systems pattern [1] and the multi-step operations being performed here would have almost unpredictable outcomes in the presence of concurrent clients
Yup, I know. I'm just referring to where the control of the lease is managed. But I also just realized that this entire leaseguard class isn't publicly exposed haha. Making my entire question here moot. So we're all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A-ha. Yes, the class is totally private. It will get more use-cases but they will all be within this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might move it to a separate internal.h/cc but I would prefer doing it later to reduce noise in these PR as a lot would be moving.
// | ||
// NOTE: The initial constant values were chosen conservatively. If we learn, | ||
// from experience, that they are causing issues, we can increase them. And if | ||
// broadly applicable values aren't possible, we can make them configurable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zeroshade There isn't much to these numbers, but what I can say is that they work well for a client running in Brazil talking to a storage account in a US east coast zone replicated across the US.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to make them configurable right off the bat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be premature. My plan if these constants are not good enough:
Raise lease time to 30s.
Raise operation times to 15s.
Network slow downs are unbounded, but failing without data loss risk and allowing a retry would be the way to go here IMO.
(I hardcoded GetUrl on the SDK class to debug my changes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
ARROW_ASSIGN_OR_RAISE(auto src_lease_client, | ||
AcquireContainerLease(src, kLeaseDuration)); | ||
LeaseGuard src_lease_guard{std::move(src_lease_client), kLeaseDuration}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must specify the same lease duration to AcquireContainer()
and LeaseGuard::LeaseGuard()
, right?
It may be misused.
Can we return std::unique_ptr<LeaseGuard>
by AcquireContainerLease()
to avoid creating a LeaseGuard
manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this, but the problem is that the LeaseGuard
should often live in an outer scope relative to the AcquireContainerLease
call, so I considered this misuse trap less bad than writing code that declares the guard far from where it's used -- I do that now sometime with optionals and I think that communicates the intent more clearly.
try { | ||
auto src_list_response = src_container_client.ListBlobs(list_blobs_options); | ||
if (!src_list_response.Blobs.empty()) { | ||
return Status::IOError("Unable to replace empty container: '", dest.all, "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Status::IOError("Unable to replace empty container: '", dest.all, "'"); | |
return Status::IOError("Unable to replace by non empty container: '", src.all, "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment explaining why dest
is the correct here.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
} | ||
try { | ||
src_lease_guard.BreakBeforeDeletion(kTimeNeededForContainerDeletion); | ||
src_container_client.DeleteIfExists(options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use DeleteIfExists()
here? Can we use Delete()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Delete
is enough. I'm changing.
|
||
// These functions are marked ARROW_NOINLINE because they are called from | ||
// multiple locations, but are not performance-critical. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is about the 3 functions below, not only the immediately next one.
// TODO(felipecrv): investigate why this can't be false | ||
select.allow_not_found = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to solve this in this PR?
If you want to defer this to a follow-up task, could you create an issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GTEST_SKIP() | ||
<< "The rest of TestMovePaths is not implemented for non-HNS scenarios"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just return
here?
Should we use GTEST_SKIP()
here? We have some tests for non-HNS case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to more loudly communicate that MOST of the tests are not in fact running. The return
is too subtle.
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 0ce54b6. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…aLake Storage Gen 2 API (apache#39904) ### Rationale for this change We need to move directories and files via the `arrow::FileSystem` interface. ### What changes are included in this PR? - A few filesystem error reporting improvements - A helper class to deal with Azure Storage leases [1] - The `Move()` implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabled - Lots of tests [1]: https://learn.microsoft.com/en-us/rest/api/storageservices/lease-blob ### Are these changes tested? Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations. ### Are there any user-facing changes? No breaking changes, only new functionality. * Closes: apache#38704 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
…aLake Storage Gen 2 API (apache#39904) ### Rationale for this change We need to move directories and files via the `arrow::FileSystem` interface. ### What changes are included in this PR? - A few filesystem error reporting improvements - A helper class to deal with Azure Storage leases [1] - The `Move()` implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabled - Lots of tests [1]: https://learn.microsoft.com/en-us/rest/api/storageservices/lease-blob ### Are these changes tested? Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations. ### Are there any user-facing changes? No breaking changes, only new functionality. * Closes: apache#38704 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
…aLake Storage Gen 2 API (apache#39904) ### Rationale for this change We need to move directories and files via the `arrow::FileSystem` interface. ### What changes are included in this PR? - A few filesystem error reporting improvements - A helper class to deal with Azure Storage leases [1] - The `Move()` implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabled - Lots of tests [1]: https://learn.microsoft.com/en-us/rest/api/storageservices/lease-blob ### Are these changes tested? Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations. ### Are there any user-facing changes? No breaking changes, only new functionality. * Closes: apache#38704 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Rationale for this change
We need to move directories and files via the
arrow::FileSystem
interface.What changes are included in this PR?
Move()
implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabledAre these changes tested?
Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations.
Are there any user-facing changes?
No breaking changes, only new functionality.
Move()
#38704