Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Don't allow the inclusion of passwords (storage account keys) in Auzre ABFS URLs #43197

Closed
sugibuchi opened this issue Jul 9, 2024 · 12 comments
Assignees
Milestone

Comments

@sugibuchi
Copy link

sugibuchi commented Jul 9, 2024

Describe the enhancement requested

Outline

The Azure Blob File System (ABFS) support in Apache Arrow, implemented in C++ API by #18014 and integrated into Python API by #39968, currently allows embedding a storage account key as a password in an ABFS URL.

https://github.com/apache/arrow/blob/r-16.1.0/cpp/src/arrow/filesystem/azurefs.h#L138-L144

However, I strongly recommend stopping this practice for two reasons.

Security

An access key of a storage account is practically a "root password," giving full access to the data in the storage account.

Microsoft repeatedly emphasises this point in various places in the documentation and encourages the protection of account keys in a secure place like Azure Key Vault.

Protect your access keys

Storage account access keys provide full access to the storage account data and the ability to generate SAS tokens. Always be careful to protect your access keys. Use Azure Key Vault to manage and rotate your keys securely.
https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string#protect-your-access-keys

Embedding a storage account key in an ABFS URL, which is usually not considered confidential information, may lead to unexpected exposure of the key.

Interoperability with other file system implementations

For historical reasons, the syntax of the Azure Blob File System (ABFS) URL is inconsistent between different file system implementations.

Original implementations by Apache Hadoop's hadoop-azure package link

  • adfs[s]://<container>@<account>.dsf.core.windows.net/path/to/file

This syntax is widely used, particularly by Apache Spark.

Python adlfs for fsspec link

  • Hadoop-compatible URL syntax , and
  • az://<container>/path/to/file
  • adfs[s]://<container>/path/to/file

Rust object_store::azure link

  • Hadoop-compatible URL syntax
  • adlfs-compatible URL syntax , and
  • azure://<container>/path/to/file
  • https://<account>.blob.core.windows.net/<container>/path/to/file
  • https://<account>.dfs.core.windows.net/<container>/path/to/file

DuckDB azure extension link

  • adfss://<container>/path/to/file
  • adfss://<account>.dsf.core.windows.net/<container>/path/to/file

Apache Arrow link

  • Hadoop-compatible URL syntax , and
  • abfs[s]://[:<password>@]<account>.blob.core.windows.net/<container>/path/to/file
  • abfs[s]://<container>[:<password>]@<account>.dfs.core.windows.net/path/to/file
  • abfs[s]://[<account[:<password>]@]<host[.domain]>[<:port>]/<container>/path/to/file
  • abfs[s]://[<account[:<password>]@]<container>[/path]

This inconsistency of the syntax already causes problems in applications using different frameworks, including additional overhead to translate ABFS URLs between different syntax. It may also lead to unexpected behaviours due to misinterpretation of the same URL by different file system implementations.

I believe a new file system implementation should respect the existing syntax of a URL scheme and SHOULD NOT invent new ones. As far as I understand, no other ABFS file system implementation allows embedding storage account keys in ABFS URLs.

Component(s)

C++, Python

@kou kou changed the title Don't allow the inclusion of passwords (storage account keys) in Auzre ABFS URLs [C++] Don't allow the inclusion of passwords (storage account keys) in Auzre ABFS URLs Jul 9, 2024
@kou
Copy link
Member

kou commented Jul 9, 2024

Could you clarify your suggestion?

The followings?

  • We should reject URLs that have :password in the "userinfo" URI part
  • We should reduce supported URL patterns

BTW, it seems that you misinterpreted supported formats:

Apache Arrow link

* Hadoop-compatible URL schemes, and
* adfs[s]://<container>:<password>@<account>.dsf.core.windows.net/path/to/file
* adfs[s]://<account>.dsf.core.windows.net/<container>/path/to/file
* adfs[s]://<password>@<account>.dsf.core.windows.net/<container>/path/to/file
  • Hadoop-compatible URL schemes (2. in the Arrow source), and
  • adfs[s]://<container>:<password>@<account>.dsf.core.windows.net/path/to/file (2. in the Arrow source)
  • adfs[s]://<account>.dsf.core.windows.net/<container>/path/to/file (1. in the Arrow source and Azure Data Lake Storage Gen2 compatible)
  • adfs[s]://:<password>@<account>.dsf.core.windows.net/<container>/path/to/file (: before <password> is missing and this is same as the above pattern)

All of them listed by you are compatible with https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri .
"3." and "4." in the Arrow source are missed.
"3." is for Azurite. It's helpful for development.
"4." is for compatibility with adlfs, object_store::azure and DuckDB.

@kou
Copy link
Member

kou commented Jul 9, 2024

@Tom-Newton Could you take a look at this?

@sugibuchi
Copy link
Author

sugibuchi commented Jul 9, 2024

@kou
Thank you very much for your answer and correction.

  • We should reject URLs that have :password in the "userinfo" URI part
  • We should reduce supported URL patterns

I suggest only the first point. Regarding the second point, I don't have a strong objection to the existing patterns as long as a file system implementation supports the Hadoop-compatible syntax, as the Hadoop-compatible syntax is the only one commonly supported by most file system implementations (except DuckDB...).

However, as having storage account keys in ABFS URLs is a new practice invented by Apache Arrow and has security concerns, I suggest removing this feature.

BTW, it seems that you misinterpreted supported formats:

You are right. I have updated the issue description, and let me copy-paste the original comment in the source code.

  /// 1. abfs[s]://[:\<password\>@]\<account\>.blob.core.windows.net
  ///    [/\<container\>[/\<path\>]]
  /// 2. abfs[s]://\<container\>[:\<password\>]@\<account\>.dfs.core.windows.net
  ///     [/path]
  /// 3. abfs[s]://[\<account[:\<password\>]@]\<host[.domain]\>[\<:port\>]
  ///    [/\<container\>[/path]]
  /// 4. abfs[s]://[\<account[:\<password\>]@]\<container\>[/path]

In my view, only the second pattern (without password) is Hadoop-compatible. 1 and 3 do not include the container (file system) name in the authority part of the patterns. 4 has a different syntax for the authority part.

But as I explained above, I don't have strong objections against these patterns except for having passwords in URLs.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Jul 9, 2024

To be honest I think I will struggle to find time to work on this. Personally I'm in favour of the proposed changes: removing some of this extra URL parsing functionality, including passwords. However I'm also not especially concerned about it. Certainly account key is more concerning but it is common practice to use SAS tokens in blob storage URLs.

@kou
Copy link
Member

kou commented Jul 11, 2024

I'm neutral for this proposal.

If we reject the password value, users can't use account key based authentication with the URI interface. It'll useful for local development with Azurite.
FYI: The Arrow's Azure filesystem doesn't send the URI specified by an user to somewhere and Azure SDK for C++ doesn't send raw account key to Azure. Azure SDK for C++ uses account key to generate hash value for authentication.

Could you share bad scenarios you think?

@kou
Copy link
Member

kou commented Jul 11, 2024

Could you start a discussion on the dev@arrow.apache.org mailing list too to collect more feedback for this?
See also: https://arrow.apache.org/community/#mailing-lists

@sugibuchi
Copy link
Author

sugibuchi commented Jul 12, 2024

If we reject the password value, users can't use account key based authentication with the URI interface. It'll useful for local development with Azurite.

I use Apache Arrow mainly in Python code. Let me explain by using PyArrow as an example.

When working with the PyArrow API, we have two methods to specify a file system.

  1. Create a FileSystem instance and explicitly set it as an argument.
  2. Let PyArrow infer a file system from a file path URL.
import pyarrow.parquet as pq

# Explicitly set
s3 = fs.S3FileSystem(..)
pq.read_table("my-bucket/data.parquet", filesystem=s3)

# Infer from a URL
pq.read_table("s3://my-bucket/data.parquet")

For 1, we don't need to embed a storage account key or any other credentials for file system access in a file path URL as long as we can set them when we create a file system instance.

s3 = fs.S3FileSystem(access_key=...)

For 2, many existing file system libraries provide an interface to configure credentials for file system access in a global context.

  • Hadoop HDFS: core-site.xml etc. in a classpath
  • Python fsspec: fsspec.config.conf link
  • Rust object_store: environment variables
  • DuckDB: secret manager link
  • Apache Arrow: S3FileSystem and GcsFileSystem try to obtain credentials from environment variables or standardized locations in the local file system

Even if it looks convenient, embedding credentials in a file path URL is generally unnecessary. Other file system implementations work well without this method.

In Azure SDK, EnvironmentCredential used by DefaultAzureCredential defines a set of environment variables to configure credentials

There is no standardized environment variable for setting storage account keys but AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY are widely used (example).

We should consider these common practices instead of inventing new ABFS URL syntax.

@sugibuchi
Copy link
Author

sugibuchi commented Jul 12, 2024

Could you share bad scenarios you think?

There are several Azure Blob File System implementations in Python, and we frequently need to use multiple implementations in the same code. However,

  1. Most ABFS implementations, except Arrow's AzureFileSystem, do not assume that ABFS URLs can contain confidential information like storage account keys.
  2. It is not always clear which implementation is actually used in a library.
    • PyArrow has native AzureFileSystem implementation since Arrow 16.0.0. However, delta-io implemented on Arrow uses Rust object_store.
    • Pandas initially used the fsspec's ABFS implementation but silently started using Arrow's native implementation after the release of Arrow 16.0.0 ([Python][Azure][Doc] Add documentation about AzureFilesystem #41496).
    • DuckDB has native ABFS support. But Rust object_store is eventually used when reading Delta Lake into DuckDB by using ibis API.

Because of 2, an Arrow's style ABFS URL containing a storage account key can be accidentally passed to a different ABFS implementation. However, the different implementation usually does not assume the passed URL contains a storage account key, as explained in 1.

This leads to the rejection of the URL and an error message like the one below that can be exposed to error logs, HTTP error responses, etc.

Invalid file path: abfs://my-container:(plain text of a storage account key)@mycontainer.dfs.core.windows.net/...

Having storage account keys in ABFS URLs can cause this kind of interoperability issue with other ABFS implementations and unexpected exposure of keys.

@sugibuchi
Copy link
Author

sugibuchi commented Jul 12, 2024

Certainly account key is more concerning but it is common practice to use SAS tokens in blob storage URLs.

This is true, but a blob URL containing an SAS token is usually (and should be) treated as confidential information that must be carefully handled to avoid unexpected leaks.

We cannot apply the same practice to file paths in general that appear in many places in a code.

@kou
Copy link
Member

kou commented Jul 19, 2024

Could you start a discussion on the dev@arrow.apache.org mailing list too to collect more feedback for this? See also: https://arrow.apache.org/community/#mailing-lists

@sugibuchi Could you do this with the additional information?

@kou
Copy link
Member

kou commented Sep 10, 2024

The discussion thread on dev@arrow.apache.org: https://lists.apache.org/thread/mhwr2lvtxvjcqos12k7hr4cqkdofrxxo

kou added a commit to kou/arrow that referenced this issue Sep 25, 2024
kou added a commit that referenced this issue Sep 30, 2024
### Rationale for this change

Other Azure Blob Storage based filesystem API implementations don't use password field in URI.

We don't use it too for compatibility.

### What changes are included in this PR?

Ignore password field.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

**This PR includes breaking changes to public APIs.**
* GitHub Issue: #43197

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 18.0.0 milestone Sep 30, 2024
@kou
Copy link
Member

kou commented Sep 30, 2024

Issue resolved by pull request 44220
#44220

@kou kou closed this as completed Sep 30, 2024
@amoeba amoeba added the Breaking Change Includes a breaking change to the API label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants