Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for specifying token directly #20

Open
aersam opened this issue Nov 6, 2023 · 17 comments
Open

Support for specifying token directly #20

aersam opened this issue Nov 6, 2023 · 17 comments

Comments

@aersam
Copy link

aersam commented Nov 6, 2023

Hi there

Thanks for this cool extension, that will enable lot's of use cases for us

If you acquire the token outside duckdb, would be nice to be able to do something like this:

SET azure_storage_bearer_token = '<your_token>';

This is espescially useful if you use Managed Identity / Interactive Browser Credentials or the like

@aersam aersam changed the title Support for specifing token directly Support for specifying token directly Nov 6, 2023
@samansmink
Copy link
Collaborator

Hi @aersam thanks for reporting, there are some changes coming up to how duckdb manages credentials, when that gets merged, I will look into adding this to it

@djouallah
Copy link

It will be nice to have it , OneLake which is based on Azure uses token by default today DuckDB can't use it directly:(

@quentingodeau
Copy link
Contributor

Hello,

Just wondering if the issue is still open?
Now that the extension is capable of handling some credentials types.
If yes would you mind explaining a bit the workflow? I do not understand the idea of the bearer token (I mean you will have to renew it manually each time it expires) no?

@aersam
Copy link
Author

aersam commented Feb 23, 2024

Yes you have to renew it manually. Main use case is if you have a token in Python or so and want to use it, e.g. you could have a token from a user context in a python backend and want to pass that. In such cases the lifetime is not an issue, your Library in python would be doing that and just before executing something you would be updating the duckdb variable

@quentingodeau
Copy link
Contributor

quentingodeau commented Feb 23, 2024

Ok, one more question the token come from a SPN, a manged id, a workload identity or env variable, no?
Why not pass this information to duck as a secret and let it get a new token for you?
(I can take a look to implement your request I think that it not very complex but I wonder if that a common use case or a really specific one)

@samansmink
Copy link
Collaborator

Yea i agree with @quentingodeau, the implementation would be something along the lines of:

class RawTokenCredential : public Azure::Core::Credentials::TokenCredential {
public:
	RawTokenCredential(const string& token_name) : Azure::Core::Credentials::TokenCredential(token_name) {
	}
	Azure::Core::Credentials::AccessToken GetToken(
	    Azure::Core::Credentials::TokenRequestContext const& tokenRequestContext,
	    Azure::Core::Context const& context) const override {
	    return raw_token;
	};
	Azure::Core::Credentials::AccessToken raw_token;
};

But it is a little hacky and probably not desirable if one of the other credentials provider methods can be used. Note that the Azure SDK does not provide this RawTokenCredential, so to me that feels like a hint that this is not a common path

@aersam
Copy link
Author

aersam commented Feb 26, 2024

Not very common, but sometimes required. I'd say it's just the more low-level approach for advanced use cases

@aersam
Copy link
Author

aersam commented Feb 26, 2024

Also there are so many ways to use Microsoft's Entra ID that I don't think you want to handle every edge case

@djouallah
Copy link

it is common, for example today, I can't write to Fabric OneLake using DuckDB

@quentingodeau
Copy link
Contributor

@djouallah do you known how Fabric authenticate ? Does it use app registration ?

@j-r77
Copy link

j-r77 commented Apr 16, 2024

Just chiming in here, this is also standard usage at our company. Basically we do something analogous to DeviceCodeCredential and then store the results in a custom class. The code is very similar to what samansmink suggested above, except it also keeps the refresh_token and refreshes the access token whenever needed.

The goal is to authenticate with a username/password, without having to either re-authenticate constantly or having to store username/password somewhere. Creating a service principal or managed identity per user is too difficult to manage/govern.

I'm not up to the task of writing it in duck/c++ myself, we previously used python and adlfs to authenticate this way. But if Ican help with anything e.g., testing, I'd be happy to do so.

@quentingodeau
Copy link
Contributor

Sorry I have been away a bit. I will try to see if I can find a way to automated some testing on this.
But just for info I may unprioritized this PR to add first the write capacity first.

@aersam
Copy link
Author

aersam commented Jul 3, 2024

Sorry I have been away a bit. I will try to see if I can find a way to automated some testing on this. But just for info I may unprioritized this PR to add first the write capacity first.

Ok, but good that it's still on the radar. I'm missing support for user-assigned managed identities in duckdb currently, which I could workaround with the direct token support

@mmaitre314
Copy link
Contributor

It looks like that could be a small change, so likely something I could contribute a PR for.

In my case I hit a couple of issues with the current auth setup in the extension:

  • Azure Synapse uses a non-standard way to get access tokens: code needs to call mssparkutils.credentials.getToken("Storage").
  • In How to get verbose logs? #63, one possibility for auth failure is that Az CLI gives tokens for the wrong user identity (I use multiple user identities on my machine)

Those feel like a long-tail of edge cases so likely not something worth having built-in support for but something which would be nice to unblock by allowing custom access-token generation.

Re: 'that feels like a hint that this is not a common path' -- in my experience it is actually fairly common to derive custom classes from TokenCredential to abstract away non-standard auth mechanisms from the Azure SDK. For instance, in Python, auth on Azure Synapse can be done like this:

from azure.core.credentials import AccessToken, TokenCredential

class StorageCredential(TokenCredential):
    def get_token(self, *scopes: str, claims: Optional[str] = None, tenant_id: Optional[str] = None, **kwargs: Any) -> AccessToken:
        return AccessToken(mssparkutils.credentials.getToken("Storage"), sys.maxsize)

Couple of potential issues:

  • Ideally tokens would be refreshed to avoid auth failures when tokens expire. This is typically achieved through some form of callback. Not sure if this is feasible in duckdb. Alternative might be for the caller to update the token on a timer.
  • Token expiration needs to be provided in Azure::Core::Credentials::AccessToken via the ExpiresOn field. One option could be to parse that from the exp claim in access tokens. Another could be to have the client provide that.

Is there a preference on how to solve those?

@samansmink
Copy link
Collaborator

@mmaitre314 we currently don't have a mechanism in duckdb to handle token expiry (yet) so that would probably be a place to start on this.

Otherwise I think we can just add this and document the fact that manual secret refreshing is required. That way this can work as a workaround until we have proper secret expiration

@mmaitre314
Copy link
Contributor

One workaround which works with the extension as-is, albeit a convoluted one:

  • Start with an Entra access token (from device code, managed identity, etc.)
  • Exchange it for a user-delegation Storage key (similar to regular Storage keys, but tied to Entra auth and temporary)
  • Generate a user-delegation SAS from the key
  • Wrap the SAS in a connection string
  • Set the connection string as DuckDB secret

User-delegation keys/SAS can live for up-to 7 days and it looks like DuckDB allows refreshing them using CREATE OR REPLACE SECRET.

Python sample code using a mix of Managed Identity and Interactive Browser credentials:

import duckdb
from datetime import datetime, timezone, timedelta
from azure.identity import ChainedTokenCredential, ManagedIdentityCredential, InteractiveBrowserCredential
from azure.storage.blob import BlobServiceClient, generate_container_sas

tenant_id='11111111-2222-3333-4444-555555555555'
account_name = "myaccount"
container_name = "mycontainer"
blob_path = "path/to/blobs/*.parquet"

credential = ChainedTokenCredential(ManagedIdentityCredential(), InteractiveBrowserCredential(tenant_id=tenant_id))

def create_user_delegation_sas() -> str:

    start_time = datetime.now(timezone.utc)
    expiry_time = start_time + timedelta(days=1)

    client = BlobServiceClient(f"https://{account_name}.blob.core.windows.net", credential=credential)

    return generate_container_sas(
        account_name = account_name,
        container_name = container_name,
        user_delegation_key = client.get_user_delegation_key(key_start_time=start_time, key_expiry_time=expiry_time),
        resource_types = "sco",
        permission = "rl",
        start = start_time,
        expiry = expiry_time,
    )

duckdb.sql(f"""
    CREATE OR REPLACE SECRET {account_name} (
        TYPE AZURE,
        CONNECTION_STRING 'DefaultEndpointsProtocol=https;AccountName={account_name};EndpointSuffix=core.windows.net;SharedAccessSignature={create_user_delegation_sas()}',
        SCOPE 'az://{account_name}.blob.core.windows.net/'
    )
    """)

duckdb.sql(f"SELECT COUNT(*) FROM 'az://{account_name}.blob.core.windows.net/{container_name}/{blob_path}'")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants