From cb0338a40c84dac8f7623dae97601f6256560e25 Mon Sep 17 00:00:00 2001 From: Scott Lyons Date: Fri, 15 Nov 2024 10:08:37 -0800 Subject: [PATCH] Enhancing and Standardizing ABFS docs (#603) * Standardizing ABFS docs * Making document more concise and consistent with #580 * Adding `file_format` to params section * Updating secrets section * Rephrasing secrets section * Update spiceaidocs/docs/components/data-connectors/abfs.md Co-authored-by: Phillip LeBlanc * Update spiceaidocs/docs/components/data-connectors/abfs.md Co-authored-by: Phillip LeBlanc * Update spiceaidocs/docs/components/data-connectors/abfs.md Co-authored-by: Phillip LeBlanc * Update spiceaidocs/docs/components/data-connectors/abfs.md Co-authored-by: Phillip LeBlanc * Updating ABFS docs with suggestions * Re-adding `hive_partitioning_enabled` parameter * Clarifying `disable_tagging` --------- Co-authored-by: Phillip LeBlanc --- .../docs/components/data-connectors/abfs.md | 144 +++++++++--------- .../docs/components/secret-stores/index.md | 2 +- 2 files changed, 76 insertions(+), 70 deletions(-) diff --git a/spiceaidocs/docs/components/data-connectors/abfs.md b/spiceaidocs/docs/components/data-connectors/abfs.md index d254c75d..0a97304e 100644 --- a/spiceaidocs/docs/components/data-connectors/abfs.md +++ b/spiceaidocs/docs/components/data-connectors/abfs.md @@ -4,12 +4,9 @@ sidebar_label: 'Azure BlobFS Data Connector' description: 'Azure BlobFS Data Connector Documentation' --- -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; +The Azure BlobFS (ABFS) Data Connector enables federated/accelerated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints. -The Azure BlobFS (ABFS) Data Connector enables federated SQL query on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints. - -If a folder path is provided, all child files will be loaded. +When a folder path is provided, all the contained files will be loaded. File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). @@ -18,36 +15,35 @@ datasets: - from: abfs://foocontainer/taxi_sample.csv name: azure_test params: - azure_account: spiceadls - azure_access_key: abc123== + abfs_account: spiceadls + abfs_access_key: ${ secrets:access_key } file_format: csv ``` -## Dataset Schema Reference +## Configuration ### `from` -The ABFS-compatible URI to a folder or object in one of two forms: +Defines the ABFS-compatible URI to a folder or object: - `from: abfs:///` with the account name configured using `abfs_account` parameter, or - `from: abfs://@.dfs.core.windows.net/` -:::note - -A valid URI must always be specified in the `from` field, even if you are setting the account or container name using [secrets](/components/secret-stores/index.md). When using secrets use a dummy account/container name and the values will be replaced with the values contained by the secrets at runtime. - -See the example [below](#using-secrets-for-container-and-account-name). - -::: - ### `name` -The dataset name. This will be used as the table name within Spice. +Defines the dataset name, which is used as the table name within Spice. -Example: `name: cool_dataset` +Example: +```yaml +datasets: + - from: abfs://foocontainer/taxi_sample.csv + name: cool_dataset + params: + ... +``` ```sql -SELECT COUNT(*) FROM cool_dataset +SELECT COUNT(*) FROM cool_dataset; ``` ```shell @@ -62,54 +58,62 @@ SELECT COUNT(*) FROM cool_dataset #### Basic parameters -| Parameter name | Description | -| --------------------------- | --------------------------------------------------------------------------------------- | -| `abfs_account` | Azure storage account name | -| `abfs_container_name` | Azure storage container name | -| `abfs_sas_string` | SAS Token to use for authorization | -| `abfs_endpoint` | Storage endpoint to connect to. Defaults to `https://{account}.blob.core.windows.net` | -| `abfs_use_emulator` | Connect to a locally-running Azure Storage emulator. Valid values are `true` or `false` | -| `abfs_allow_http` | Allow insecure HTTP connections | -| `abfs_authority_host` | Use an alternative authority host. Defaults to `https://login.microsoftonline.com` | -| `abfs_proxy_url` | Proxy URL to use when connecting | -| `abfs_proxy_ca_certificate` | A trusted CA certificate for the proxy | -| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections | -| `abfs_disable_tagging` | Ignore any tags provided to `put_opts` | +| Parameter name | Description | +| --------------------------- | ------------------------------------------------------------------------------------------------ | +| `file_format` | Specifies the data format. Required if not inferrable from `from`. Options: `parquet`, `csv`. | +| `abfs_account` | Azure storage account name | +| `abfs_sas_string` | SAS (Shared Access Signature) Token to use for authorization | +| `abfs_endpoint` | Storage endpoint, default: `https://{account}.blob.core.windows.net` | +| `abfs_use_emulator` | Use `true` or `false` to connect to a local emulator | +| `abfs_allow_http` | Allow insecure HTTP connections | +| `abfs_authority_host` | Alternative authority host, default: `https://login.microsoftonline.com` | +| `abfs_proxy_url` | Proxy URL | +| `abfs_proxy_ca_certificate` | CA certificate for the proxy | +| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections | +| `abfs_disable_tagging` | Disable tagging objects. Use this if your backing store doesn't support tags | | `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` | + #### Authentication parameters -The following parameters are used when authenticating with Azure. Only one of `abfs_access_key`, `abfs_bearer_token`, `abfs_client_secret` or `abfs_skip_signature` can be set at the same time. If none of these are set the connector will default to using a [managed identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview) +The following parameters are used when authenticating with Azure. Only one of these parameters can be used at a time: -| Parameter name | Description | -| --------------------------- | ------------------------------------------------------------------------------------------------ | -| `abfs_access_key` | Secret access key to use when authenticating | -| `abfs_bearer_token` | `BEARER` token to use when authenticating | -| `abfs_client_id` | Client ID to use with the client authentication flow | -| `abfs_client_secret` | Client Secret to use with the client authentication flow | -| `abfs_tenant_id` | Tenant ID to use with client authentication flow | -| `abfs_skip_signature` | Skip fetching credentials and skip signing requests. Used for interacting with public containers | -| `abfs_msi_endpoint` | The endpoing to use for acquiring managed identity tokens | -| `abfs_federated_token_file` | File path for acquiring Azure federated identity token in Kubernetes | -| `abfs_use_cli` | Set to `true` to use the Azure CLI to acquire access tokens | +* `abfs_access_key` +* `abfs_bearer_token` +* `abfs_client_secret` +* `abfs_skip_signature` + +If none of these are set the connector will default to using a [managed identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview) + +| Parameter name | Description | +| --------------------------- | ----------------------------------------------------------- | +| `abfs_access_key` | Secret access key | +| `abfs_bearer_token` | `BEARER` token | +| `abfs_client_id` | Client ID for client authentication flow | +| `abfs_client_secret` | Client Secret to use for client authentication flow | +| `abfs_tenant_id` | Tenant ID to use for client authentication flow | +| `abfs_skip_signature` | Skip credentials and request signing for public containers | +| `abfs_msi_endpoint` | Endpoint for managed identity tokens | +| `abfs_federated_token_file` | File path for federated identity token in Kubernetes | +| `abfs_use_cli` | Set to `true` to use the Azure CLI to acquire access tokens | #### Retry parameters -| Parameter name | Description | -| ------------------------------- | -------------------------------------------------------------------------------------------- | -| `abfs_max_retries` | Maximum number of retries | -| `abfs_retry_timeout` | Timeout for all retries. Accepts any duration string (i.e `5s`, `1m`, etc) | -| `abfs_backoff_initial_duration` | How long to wait before the initial retry. Accepts any duration string (i.e `5s`, `1m`, etc) | -| `abfs_backoff_max_duration` | Maximum length to wait for a retry. Accepts any duration string (i.e `5s`, `1m`, etc) | -| `abfs_backoff_base` | Floating-point base of the exponential to use when backing off retries | +| Parameter name | Description | +| ------------------------------- | -------------------------------------------- | +| `abfs_max_retries` | Maximum retries | +| `abfs_retry_timeout` | Total timeout for retries (e.g., `5s`, `1m`) | +| `abfs_backoff_initial_duration` | Initial retry delay (e.g., `5s`) | +| `abfs_backoff_max_duration` | Maximum retry delay (e.g., `1m`) | +| `abfs_backoff_base` | Exponential backoff base (e.g., `0.1`) | -#### File format parameters +## Supported file formats -File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). +Specify the file format using `file_format` parameter. More details in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats). ## Examples -### Reading a CSV file using an Access Key +### Reading a CSV file with an Access Key ```yaml datasets: @@ -117,11 +121,11 @@ datasets: name: azure_test params: abfs_account: spiceadls - abfs_access_key: abc123== + abfs_access_key: ${ secrets:ACCESS_KEY } file_format: csv ``` -### Reading from a public container +### Using Public Containers ```yaml datasets: @@ -133,27 +137,25 @@ datasets: file_format: csv ``` -### Using secrets for container and account name +### Connecting to the Storage Emulator ```yaml datasets: - # dummy_container will be overridden by the value in `abfs_container` - - from: abfs://dummy_container/my_csv.csv - name: prod_data + - from: abfs://test_container/test_csv.csv + name: test_data params: - abfs_account: ${ secrets:PROD_ACCOUNT } - abfs_container: ${ secrets:PROD_CONTAINER } + abfs_use_emulator: true file_format: csv ``` -### Connecting to the Storage Emulator +### Using secrets for Account name ```yaml datasets: - - from: abfs://test_container/test_csv.csv - name: test_data + - from: abfs://my_container/my_csv.csv + name: prod_data params: - abfs_use_emulator: true + abfs_account: ${ secrets:PROD_ACCOUNT } file_format: csv ``` @@ -164,7 +166,11 @@ datasets: - from: abfs://my_data/input.parquet name: my_data params: - abfs_tentant_id: B3E1A8F4-9D5B-4D3B-8D2E-1F4A9D5B4D3B + abfs_tenant_id: ${ secrets:MY_TENANT_ID } abfs_client_id: ${ secrets:MY_CLIENT_ID } abfs_client_secret: ${ secrets:MY_CLIENT_SECRET } -``` \ No newline at end of file +``` + +## Secrets + +Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets). \ No newline at end of file diff --git a/spiceaidocs/docs/components/secret-stores/index.md b/spiceaidocs/docs/components/secret-stores/index.md index 491b3f21..736f4309 100644 --- a/spiceaidocs/docs/components/secret-stores/index.md +++ b/spiceaidocs/docs/components/secret-stores/index.md @@ -39,7 +39,7 @@ secrets: name: env ``` -## Using referenced secrets in component parameters +## Using referenced secrets in component parameters {#using-secrets} Secrets may be used by components with the syntax `${:}`. For example, to reference a secret stored as an environment variable named `MY_SECRET` in the `env` secret store, use `${env:MY_SECRET}`.