Skip to content

Commit

Permalink
Enhancing and Standardizing ABFS docs (#603)
Browse files Browse the repository at this point in the history
* Standardizing ABFS docs

* Making document more concise and consistent with #580

* Adding `file_format` to params section

* Updating secrets section

* Rephrasing secrets section

* Update spiceaidocs/docs/components/data-connectors/abfs.md

Co-authored-by: Phillip LeBlanc <phillip@spiceai.io>

* Update spiceaidocs/docs/components/data-connectors/abfs.md

Co-authored-by: Phillip LeBlanc <phillip@spiceai.io>

* Update spiceaidocs/docs/components/data-connectors/abfs.md

Co-authored-by: Phillip LeBlanc <phillip@spiceai.io>

* Update spiceaidocs/docs/components/data-connectors/abfs.md

Co-authored-by: Phillip LeBlanc <phillip@spiceai.io>

* Updating ABFS docs with suggestions

* Re-adding `hive_partitioning_enabled` parameter

* Clarifying `disable_tagging`

---------

Co-authored-by: Phillip LeBlanc <phillip@spiceai.io>
  • Loading branch information
slyons and phillipleblanc authored Nov 15, 2024
1 parent c925a74 commit cb0338a
Show file tree
Hide file tree
Showing 2 changed files with 76 additions and 70 deletions.
144 changes: 75 additions & 69 deletions spiceaidocs/docs/components/data-connectors/abfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,9 @@ sidebar_label: 'Azure BlobFS Data Connector'
description: 'Azure BlobFS Data Connector Documentation'
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
The Azure BlobFS (ABFS) Data Connector enables federated/accelerated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.

The Azure BlobFS (ABFS) Data Connector enables federated SQL query on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.

If a folder path is provided, all child files will be loaded.
When a folder path is provided, all the contained files will be loaded.

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).

Expand All @@ -18,36 +15,35 @@ datasets:
- from: abfs://foocontainer/taxi_sample.csv
name: azure_test
params:
azure_account: spiceadls
azure_access_key: abc123==
abfs_account: spiceadls
abfs_access_key: ${ secrets:access_key }
file_format: csv
```
## Dataset Schema Reference
## Configuration
### `from`

The ABFS-compatible URI to a folder or object in one of two forms:
Defines the ABFS-compatible URI to a folder or object:

- `from: abfs://<container>/<path>` with the account name configured using `abfs_account` parameter, or
- `from: abfs://<container>@<account_name>.dfs.core.windows.net/<path>`

:::note

A valid URI must always be specified in the `from` field, even if you are setting the account or container name using [secrets](/components/secret-stores/index.md). When using secrets use a dummy account/container name and the values will be replaced with the values contained by the secrets at runtime.

See the example [below](#using-secrets-for-container-and-account-name).

:::

### `name`

The dataset name. This will be used as the table name within Spice.
Defines the dataset name, which is used as the table name within Spice.

Example: `name: cool_dataset`
Example:
```yaml
datasets:
- from: abfs://foocontainer/taxi_sample.csv
name: cool_dataset
params:
...
```

```sql
SELECT COUNT(*) FROM cool_dataset
SELECT COUNT(*) FROM cool_dataset;
```

```shell
Expand All @@ -62,66 +58,74 @@ SELECT COUNT(*) FROM cool_dataset

#### Basic parameters

| Parameter name | Description |
| --------------------------- | --------------------------------------------------------------------------------------- |
| `abfs_account` | Azure storage account name |
| `abfs_container_name` | Azure storage container name |
| `abfs_sas_string` | SAS Token to use for authorization |
| `abfs_endpoint` | Storage endpoint to connect to. Defaults to `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Connect to a locally-running Azure Storage emulator. Valid values are `true` or `false` |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Use an alternative authority host. Defaults to `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL to use when connecting |
| `abfs_proxy_ca_certificate` | A trusted CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Ignore any tags provided to `put_opts` |
| Parameter name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| `file_format` | Specifies the data format. Required if not inferrable from `from`. Options: `parquet`, `csv`. |
| `abfs_account` | Azure storage account name |
| `abfs_sas_string` | SAS (Shared Access Signature) Token to use for authorization |
| `abfs_endpoint` | Storage endpoint, default: `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Use `true` or `false` to connect to a local emulator |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Alternative authority host, default: `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL |
| `abfs_proxy_ca_certificate` | CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Disable tagging objects. Use this if your backing store doesn't support tags |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |


#### Authentication parameters

The following parameters are used when authenticating with Azure. Only one of `abfs_access_key`, `abfs_bearer_token`, `abfs_client_secret` or `abfs_skip_signature` can be set at the same time. If none of these are set the connector will default to using a [managed identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview)
The following parameters are used when authenticating with Azure. Only one of these parameters can be used at a time:

| Parameter name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| `abfs_access_key` | Secret access key to use when authenticating |
| `abfs_bearer_token` | `BEARER` token to use when authenticating |
| `abfs_client_id` | Client ID to use with the client authentication flow |
| `abfs_client_secret` | Client Secret to use with the client authentication flow |
| `abfs_tenant_id` | Tenant ID to use with client authentication flow |
| `abfs_skip_signature` | Skip fetching credentials and skip signing requests. Used for interacting with public containers |
| `abfs_msi_endpoint` | The endpoing to use for acquiring managed identity tokens |
| `abfs_federated_token_file` | File path for acquiring Azure federated identity token in Kubernetes |
| `abfs_use_cli` | Set to `true` to use the Azure CLI to acquire access tokens |
* `abfs_access_key`
* `abfs_bearer_token`
* `abfs_client_secret`
* `abfs_skip_signature`

If none of these are set the connector will default to using a [managed identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview)

| Parameter name | Description |
| --------------------------- | ----------------------------------------------------------- |
| `abfs_access_key` | Secret access key |
| `abfs_bearer_token` | `BEARER` token |
| `abfs_client_id` | Client ID for client authentication flow |
| `abfs_client_secret` | Client Secret to use for client authentication flow |
| `abfs_tenant_id` | Tenant ID to use for client authentication flow |
| `abfs_skip_signature` | Skip credentials and request signing for public containers |
| `abfs_msi_endpoint` | Endpoint for managed identity tokens |
| `abfs_federated_token_file` | File path for federated identity token in Kubernetes |
| `abfs_use_cli` | Set to `true` to use the Azure CLI to acquire access tokens |

#### Retry parameters

| Parameter name | Description |
| ------------------------------- | -------------------------------------------------------------------------------------------- |
| `abfs_max_retries` | Maximum number of retries |
| `abfs_retry_timeout` | Timeout for all retries. Accepts any duration string (i.e `5s`, `1m`, etc) |
| `abfs_backoff_initial_duration` | How long to wait before the initial retry. Accepts any duration string (i.e `5s`, `1m`, etc) |
| `abfs_backoff_max_duration` | Maximum length to wait for a retry. Accepts any duration string (i.e `5s`, `1m`, etc) |
| `abfs_backoff_base` | Floating-point base of the exponential to use when backing off retries |
| Parameter name | Description |
| ------------------------------- | -------------------------------------------- |
| `abfs_max_retries` | Maximum retries |
| `abfs_retry_timeout` | Total timeout for retries (e.g., `5s`, `1m`) |
| `abfs_backoff_initial_duration` | Initial retry delay (e.g., `5s`) |
| `abfs_backoff_max_duration` | Maximum retry delay (e.g., `1m`) |
| `abfs_backoff_base` | Exponential backoff base (e.g., `0.1`) |

#### File format parameters
## Supported file formats

File formats are specified using the `file_format` parameter, as described in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).
Specify the file format using `file_format` parameter. More details in [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats).

## Examples

### Reading a CSV file using an Access Key
### Reading a CSV file with an Access Key

```yaml
datasets:
- from: abfs://foocontainer/taxi_sample.csv
name: azure_test
params:
abfs_account: spiceadls
abfs_access_key: abc123==
abfs_access_key: ${ secrets:ACCESS_KEY }
file_format: csv
```

### Reading from a public container
### Using Public Containers

```yaml
datasets:
Expand All @@ -133,27 +137,25 @@ datasets:
file_format: csv
```

### Using secrets for container and account name
### Connecting to the Storage Emulator

```yaml
datasets:
# dummy_container will be overridden by the value in `abfs_container`
- from: abfs://dummy_container/my_csv.csv
name: prod_data
- from: abfs://test_container/test_csv.csv
name: test_data
params:
abfs_account: ${ secrets:PROD_ACCOUNT }
abfs_container: ${ secrets:PROD_CONTAINER }
abfs_use_emulator: true
file_format: csv
```

### Connecting to the Storage Emulator
### Using secrets for Account name

```yaml
datasets:
- from: abfs://test_container/test_csv.csv
name: test_data
- from: abfs://my_container/my_csv.csv
name: prod_data
params:
abfs_use_emulator: true
abfs_account: ${ secrets:PROD_ACCOUNT }
file_format: csv
```

Expand All @@ -164,7 +166,11 @@ datasets:
- from: abfs://my_data/input.parquet
name: my_data
params:
abfs_tentant_id: B3E1A8F4-9D5B-4D3B-8D2E-1F4A9D5B4D3B
abfs_tenant_id: ${ secrets:MY_TENANT_ID }
abfs_client_id: ${ secrets:MY_CLIENT_ID }
abfs_client_secret: ${ secrets:MY_CLIENT_SECRET }
```
```

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
2 changes: 1 addition & 1 deletion spiceaidocs/docs/components/secret-stores/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ secrets:
name: env
```

## Using referenced secrets in component parameters
## Using referenced secrets in component parameters {#using-secrets}

Secrets may be used by components with the syntax `${<secret_store_name>:<key_name>}`. For example, to reference a secret stored as an environment variable named `MY_SECRET` in the `env` secret store, use `${env:MY_SECRET}`.

Expand Down

0 comments on commit cb0338a

Please sign in to comment.