Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs update on data connectors & data accelerators #693

Draft
wants to merge 2 commits into
base: trunk
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions spiceaidocs/docs/components/data-accelerators/arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,7 @@ When accelerating a dataset using the In-Memory Arrow Data Accelerator, some or
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb.md) and [`sqlite`](./sqlite.md) accelerators by specifying `mode: file`.

:::

## Quickstarts and Samples

- A quickstart tutorial to configure In-Memory Arrow as data accelerator in Spice. [Arrow Accelerator quickstart](https://github.com/spiceai/quickstarts/tree/trunk/arrow)
22 changes: 14 additions & 8 deletions spiceaidocs/docs/components/data-accelerators/data-refresh.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,11 @@ Typically only a working subset of an entire dataset is used in an application o

### Refresh SQL

| | |
| --------------------------- | --------- |
| Supported in `refresh_mode` | Any |
| Required | No |
| Default Value | Unset |
| | |
| --------------------------- | ----- |
| Supported in `refresh_mode` | Any |
| Required | No |
| Default Value | Unset |

Refresh SQL supports specifying filters for data accelerated from the connected source using arbitrary SQL.

Expand Down Expand Up @@ -158,7 +158,7 @@ In this example, `refresh_data_window` is converted into an effective Refresh SQ

This parameter relies on the `time_column` dataset parameter specifying a column that is a timestamp type. Optionally, the `time_format` can be specified to instruct the Spice runtime on how to interpret timestamps in the `time_column`.

*Example with `refresh_sql`:*
_Example with `refresh_sql`:_

```yaml
datasets:
Expand All @@ -176,7 +176,7 @@ datasets:

This example will only accelerate data from the federated source that matches the filter `city = 'Seattle'` and is less than 1 day old.

*Example with `on_zero_results`:*
_Example with `on_zero_results`:_

```yaml
datasets:
Expand Down Expand Up @@ -446,7 +446,13 @@ This acceleration configuration applies a number of different behaviors:
1. A `refresh_data_window` was specified. When Spice starts, it will apply this `refresh_data_window` to the `refresh_sql`, and retrieve only the last day's worth of logs with an `asset = 'asset_id'`.
2. Because a `refresh_sql` is specified, every refresh (including initial load) will have the filter applied to the refresh query.
3. 10 minutes after loading, as specified by the `refresh_check_interval`, the first refresh will occur - retrieving new rows where `asset = 'asset_id'`.
4. Running a query to retrieve logs with an `asset` that is *not* `asset_id` will fall back to the source, because of the `on_zero_results: use_source` parameter.
4. Running a query to retrieve logs with an `asset` that is _not_ `asset_id` will fall back to the source, because of the `on_zero_results: use_source` parameter.
5. Running a query to retrieve a log longer than 1 day ago will fall back to the source, because of the `on_zero_results: use_source` parameter.
6. Running a query to retrieve logs within a range of now to longer than 1 day ago will only return logs from the last day. This is due to the `refresh_data_window` only accelerating the last day's worth of logs, which will return some results. Because results are returned, Spice will not fall back to the source even though `on_zero_results: use_source` is specified.
7. Spice will retain newly appended log rows for 7 days before discarding them, as specified by the `retention_*` parameters.

## Quickstarts and Samples

- Configure accelerated dataset retention policy. [Accelerated Dataset Retention Policy Quickstart](https://github.com/spiceai/quickstarts/blob/trunk/retention/README.md)
- Dynamically refresh specific data at runtime by programmatically updating refresh_sql and triggering data refreshes. [Advanced Data Refresh Quickstart](https://github.com/spiceai/quickstarts/blob/trunk/acceleration/data-refresh/README.md)
- Configure `refresh_data_window` to filter refreshed data to recent data [Refresh Data Window Quickstart](https://github.com/spiceai/quickstarts/blob/trunk/refresh-data-window/README.md)
4 changes: 4 additions & 0 deletions spiceaidocs/docs/components/data-accelerators/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,7 @@ When accelerating a dataset using `mode: memory` (the default), some or all of t
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb.md) and [`sqlite`](./sqlite.md) accelerators by specifying `mode: file`.

:::

## Quickstarts and Samples

- A quickstart tutorial to configure DuckDB as a data accelerator in Spice. [DuckDB Accelerator quickstart](https://github.com/spiceai/quickstarts/tree/trunk/duckdb/accelerator)
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,7 @@ The table below lists the supported [Apache Arrow data types](https://arrow.apac
| `Duration` | `BigInteger` | `bigint` |
| `List` / `LargeList` / `FixedSizeList` | `Array` | `array` |
| `Struct` | `N/A` | `Composite` (Custom type) |

## Quickstarts and Samples

- A quickstart tutorial to configure PostgreSQL as a data accelerator in Spice. [PostgreSQL Accelerator quickstart](https://github.com/spiceai/quickstarts/tree/trunk/postgres/accelerator)
6 changes: 5 additions & 1 deletion spiceaidocs/docs/components/data-accelerators/sqlite.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ datasets:
- The SQLite accelerator only supports arrow `List` types of primitive data types; lists with structs are not supported.
- The SQLite accelerator doesn't support advanced grouping features such as `ROLLUP` and `GROUPING`.
- In SQLite, `CAST(value AS DECIMAL)` doesn't convert an integer to a floating-point value if the casted value is an integer. Operations like `CAST(1 AS DECIMAL) / CAST(2 AS DECIMAL)` will be treated as integer division, resulting in 0 instead of the expected 0.5.
Use `FLOAT` to ensure conversion to a floating-point value: `CAST(1 AS FLOAT) / CAST(2 AS FLOAT)`.
Use `FLOAT` to ensure conversion to a floating-point value: `CAST(1 AS FLOAT) / CAST(2 AS FLOAT)`.
- Updating a dataset with SQLite acceleration while the Spice Runtime is running (hot-reload) will cause SQLite accelerator query federation to disable until the Runtime is restarted.

:::
Expand All @@ -54,3 +54,7 @@ When accelerating a dataset using `mode: memory` (the default), some or all of t
In-memory limitations can be mitigated by storing acceleration data on disk, which is supported by [`duckdb`](./duckdb.md) and [`sqlite`](./sqlite.md) accelerators by specifying `mode: file`.

:::

## Quickstarts and Samples

- A quickstart tutorial to configure SQLite as a data accelerator in Spice. [SQLite Accelerator quickstart](https://github.com/spiceai/quickstarts/tree/trunk/sqlite/accelerator)
30 changes: 15 additions & 15 deletions spiceaidocs/docs/components/data-connectors/abfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar_label: 'Azure BlobFS Data Connector'
description: 'Azure BlobFS Data Connector Documentation'
---

The Azure BlobFS (ABFS) Data Connector enables federated/accelerated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.
The Azure BlobFS (ABFS) Data Connector enables federated SQL queries on files stored in Azure Blob-compatible endpoints. This includes Azure BlobFS (`abfss://`) and Azure Data Lake (`adl://`) endpoints.

When a folder path is provided, all the contained files will be loaded.

Expand Down Expand Up @@ -58,20 +58,20 @@ SELECT COUNT(*) FROM cool_dataset;

#### Basic parameters

| Parameter name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------ |
| `file_format` | Specifies the data format. Required if not inferrable from `from`. Options: `parquet`, `csv`. |
| `abfs_account` | Azure storage account name |
| `abfs_sas_string` | SAS (Shared Access Signature) Token to use for authorization |
| `abfs_endpoint` | Storage endpoint, default: `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Use `true` or `false` to connect to a local emulator |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Alternative authority host, default: `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL |
| `abfs_proxy_ca_certificate` | CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Disable tagging objects. Use this if your backing store doesn't support tags |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |
| Parameter name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `file_format` | Specifies the data format. Required if not inferrable from `from`. Options: `parquet`, `csv`. Refer to [Object Store File Formats](/components/data-connectors/index.md#object-store-file-formats) for details. |
| `abfs_account` | Azure storage account name |
| `abfs_sas_string` | SAS (Shared Access Signature) Token to use for authorization |
| `abfs_endpoint` | Storage endpoint, default: `https://{account}.blob.core.windows.net` |
| `abfs_use_emulator` | Use `true` or `false` to connect to a local emulator |
| `abfs_allow_http` | Allow insecure HTTP connections |
| `abfs_authority_host` | Alternative authority host, default: `https://login.microsoftonline.com` |
| `abfs_proxy_url` | Proxy URL |
| `abfs_proxy_ca_certificate` | CA certificate for the proxy |
| `abfs_proxy_exludes` | A list of hosts to exclude from proxy connections |
| `abfs_disable_tagging` | Disable tagging objects. Use this if your backing store doesn't support tags |
| `hive_partitioning_enabled` | Enable partitioning using hive-style partitioning from the folder structure. Defaults to `false` |

#### Authentication parameters

Expand Down
Loading
Loading