Skip to content

Commit

Permalink
Add type mapping documentation for databricks delta_lake mode (#669)
Browse files Browse the repository at this point in the history
* Add type mapping documentation for databricks delta_lake mode

* Update to use Databricks SQL Type
  • Loading branch information
Sevenannn authored Dec 6, 2024
1 parent 31cbbdf commit ceb1c11
Showing 1 changed file with 81 additions and 56 deletions.
137 changes: 81 additions & 56 deletions spiceaidocs/docs/components/data-connectors/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Databricks as a connector for federated SQL query against Databricks using [Spar

```yaml
datasets:
- from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
- from: databricks:spiceai.datasets.my_awesome_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
Expand All @@ -30,12 +30,12 @@ The `from` field for the Databricks connector takes the form `databricks:catalog
The dataset name. This will be used as the table name within Spice.

Example:

```yaml
datasets:
- from: databricks:spiceai.datasets.my_awesome_table
name: cool_dataset
params:
...
params: ...
```

```sql
Expand All @@ -54,13 +54,13 @@ SELECT COUNT(*) FROM cool_dataset;

Use the [secret replacement syntax](../secret-stores/index.md) to reference a secret, e.g. `${secrets:my_token}`.

| Parameter Name | Description |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode` | The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:<br /> <ul><li>`spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.</li><li>`delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided.</li></ul> |
| `databricks_endpoint` | The endpoint of the Databricks instance. Required for both modes. |
| `databricks_cluster_id` | The ID of the compute cluster in Databricks to use for the query. Only valid when `mode` is `spark_connect`. |
| `databricks_use_ssl` | If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`. |
| `client_timeout` | Optional. Applicable only in `delta_lake` mode. Specifies timeout for object store operations. Default value is `30s` E.g. `client_timeout: 60s` |
| Parameter Name | Description |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `mode` | The execution mode for querying against Databricks. The default is `spark_connect`. Possible values:<br /> <ul><li>`spark_connect`: Use Spark Connect to query against Databricks. Requires a Spark cluster to be available.</li><li>`delta_lake`: Query directly from Delta Tables. Requires the object store credentials to be provided.</li></ul> |
| `databricks_endpoint` | The endpoint of the Databricks instance. Required for both modes. |
| `databricks_cluster_id` | The ID of the compute cluster in Databricks to use for the query. Only valid when `mode` is `spark_connect`. |
| `databricks_use_ssl` | If true, use a TLS connection to connect to the Databricks endpoint. Default is `true`. |
| `client_timeout` | Optional. Applicable only in `delta_lake` mode. Specifies timeout for object store operations. Default value is `30s` E.g. `client_timeout: 60s` |

## Delta Lake object store parameters

Expand All @@ -80,15 +80,15 @@ Configure the connection to the object store when using `mode: delta_lake`. Use
:::info Note
**One** of the following auth values must be provided for Azure Blob:

- `databricks_azure_storage_account_key`,
- `databricks_azure_storage_client_id` and `azure_storage_client_secret`, or
- `databricks_azure_storage_account_key`,
- `databricks_azure_storage_client_id` and `azure_storage_client_secret`, or
- `databricks_azure_storage_sas_key`.
:::
:::

| Parameter Name | Description |
| ---------------------------------------- | ---------------------------------------------------------------------- |
| `databricks_azure_storage_account_name` | The Azure Storage account name. |
| `databricks_azure_storage_account_key` | The Azure Storage key for accessing the storage account. |
| `databricks_azure_storage_account_key` | The Azure Storage key for accessing the storage account. |
| `databricks_azure_storage_client_id` | The Service Principal client ID for accessing the storage account. |
| `databricks_azure_storage_client_secret` | The Service Principal client secret for accessing the storage account. |
| `databricks_azure_storage_sas_key` | The shared access signature key for accessing the storage account. |
Expand All @@ -105,64 +105,89 @@ Configure the connection to the object store when using `mode: delta_lake`. Use
### Spark Connect

```yaml
- from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: spark_connect
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_token: ${secrets:my_token}
- from: databricks:spiceai.datasets.my_spark_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: spark_connect
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_cluster_id: 1234-567890-abcde123
databricks_token: ${secrets:my_token}
```

### Delta Lake (S3)

```yaml
- from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_region: us-west-2 # Optional
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional
- from: databricks:spiceai.datasets.my_delta_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_aws_region: us-west-2 # Optional
databricks_aws_access_key_id: ${secrets:aws_access_key_id}
databricks_aws_secret_access_key: ${secrets:aws_secret_access_key}
databricks_aws_endpoint: s3.us-west-2.amazonaws.com # Optional
```

### Delta Lake (Azure Blobs)

```yaml
- from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
# Account Name + Key
databricks_azure_storage_account_name: my_account
databricks_azure_storage_account_key: ${secrets:my_key}
# OR Service Principal + Secret
databricks_azure_storage_client_id: my_client_id
databricks_azure_storage_client_secret: ${secrets:my_secret}
# OR SAS Key
databricks_azure_storage_sas_key: my_sas_key
- from: databricks:spiceai.datasets.my_adls_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
# Account Name + Key
databricks_azure_storage_account_name: my_account
databricks_azure_storage_account_key: ${secrets:my_key}
# OR Service Principal + Secret
databricks_azure_storage_client_id: my_client_id
databricks_azure_storage_client_secret: ${secrets:my_secret}
# OR SAS Key
databricks_azure_storage_sas_key: my_sas_key
```

### Delta Lake (GCP)

```yaml
- from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_google_service_account_path: /path/to/service-account.json
- from: databricks:spiceai.datasets.my_gcp_table # A reference to a table in the Databricks unity catalog
name: my_delta_lake_table
params:
mode: delta_lake
databricks_endpoint: dbc-a1b2345c-d6e7.cloud.databricks.com
databricks_token: ${secrets:my_token}
databricks_google_service_account_path: /path/to/service-account.json
```

## Types

### mode: delta_lake

The table below shows the Databricks (mode: delta_lake) data types supported, along with the type mapping to Apache Arrow types in Spice.

| Databricks SQL Type | Arrow Type |
| ------------------- | ------------------------------------- |
| `STRING` | `Utf8` |
| `BIGINT` | `Int64` |
| `INT` | `Int32` |
| `SMALLINT` | `Int16` |
| `TINYINT` | `Int8` |
| `FLOAT` | `Float32` |
| `DOUBLE` | `Float64` |
| `BOOLEAN` | `Boolean` |
| `BINARY` | `Binary` |
| `DATE` | `Date32` |
| `TIMESTAMP` | `Timestamp(Microsecond, Some("UTC"))` |
| `TIMESTAMP_NTZ` | `Timestamp(Microsecond, None)` |
| `DECIMAL` | `Decimal128` |
| `ARRAY` | `List` |
| `STRUCT` | `Struct` |
| `MAP` | `Map` |

## Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).
Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the [secret stores documentation](/components/secret-stores). Additionally, learn how to use referenced secrets in component parameters by visiting the [using referenced secrets guide](/components/secret-stores#using-secrets).

0 comments on commit ceb1c11

Please sign in to comment.