Skip to content

Commit

Permalink
docs
Browse files Browse the repository at this point in the history
  • Loading branch information
rickyschools committed May 5, 2024
1 parent c4c54a0 commit 558f029
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 2 deletions.
6 changes: 4 additions & 2 deletions docs/_static/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@ Examples are stratified by increasing levels of complexity.
- [Moderate, Multi-Method Decoration](examples/medium.md)
- A moderate example multiple methods decorated with dlt.
- [Streaming, Append example](examples/streaming_append.md)
- An simple example showcasing how to write streaming append dlt pipelines.
- Streaming, CDC example
- A simple example showcasing how to write streaming append dlt pipelines.
- [Streaming, CDC example](./examples/streaming_cdc.md)
- A simple example showcasing how to write streaming apply changes into dlt pipelines.
- Complex, Multi-Step example

::: {toctree}

Check warning on line 17 in docs/_static/examples.md

View workflow job for this annotation

GitHub Actions / Docs to GH Pages

toctree contains reference to nonexisting document '_static/:hidden:'
:maxdepth: 1
examples/simple.md
examples/medium.md
examples/streaming_append.md
examples/streaming_cdc.md
:hidden:
:::
76 changes: 76 additions & 0 deletions docs/_static/examples/streaming_cdc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Streaming Apply Changes Example

This article intends to show how to get started with `dltflow` when authoring DLT streaming apply changes pipelines. In
this sample, we will be going through the following steps:

- [Code](#code)
- [Configuration](#configuration)
- [Workflow Spec](#workflow_spec)
- [Deployment](#deployment)

:::{include} ./base.md
:::

### Example Pipeline Code

For this example, we will show a simple example with a queue streaming reader.

- Import a `DLTMetaMixin` from `dltflow.quality` and will tell our sample pipeline to inherit from it.
- Generate the example data on the fly and put it into a python queue.
- We will transform it by coercing data types.

You should see that there are no direct calls to `dlt`. This is the beauty and intentional simplicity `dltflow`. It does
not want to get in your way. Rather, it really wants you to focus on your transformation logic to help keep your code
simple and easy to share with other team members.

:::{literalinclude} ../../../examples/pipelines/streaming_cdc.py
:::

## Configuration

Now that we have our example code, we need to write our configuration to tell the `DLTMetaMixin` how wrap our codebase.

Under the hood, `dltflow` uses `pydantic` to create validation for configuration. When working with `dltflow`, it
requires your configuration to adhere to a specific structure. Namely, file should have the following sections:

- `reader`: This is helpful for telling your pipeline where to read data from.
- `writer`: Used to define where your data is written to after being processed.
- `dlt`: Defines how `dlt` will be used in the project. We use this to dynamically wrap your code with `dlt` commands.

With this brief overview out of the way, lets review our configuration for this sample.

:::{literalinclude} ../../../examples/conf/streaming_apply_changes_dlt.yml
:::

The `dlt` section has the following keys, though this configuration can also be a list of `dlt` configs.

- `func_name`: The name of the function/method we want `dlt` to decorate.
- `kind`: Tells `dlt` if this query should be materialized as a `table` or `view`
- `expectation_action`: Tells `dlt` how to handle the expectations. `drop`, `fail`, and `allow` are all supported.
- `expectations`: These are a list of constraints we want to apply to our data.
- `is_streaming`: This tells `dltflow` this is a streaming query.
- `apply_chg_config`: This tells `dltflow` we're in a streaming append and fills out necessary `dlt` params.
- `target`: Tells `dltflow` what table data will be written to. This should be a streaming table definition created
ahead of time.
- `source`: Tells `dltflow` where to read and get data from.
- `keys`: The primary key(s) of the dataset.
- `sequence_by`: The column(s) to use when ordering the dataset.
- `stored_as_scd_type`: Tells `dltflow` how to materialize the table. `1` (default), SCD Type 1, 2 - SCD Type 2.

## Workflow Spec

Now that we've gone through the code and configuration, we need to start defining the workflow that we want to deploy
to Databricks so that our pipeline can be registered as a DLT Pipeline. This structure largely follows the [Databricks
Pipeline API]() with the addition of a `tasks` key. This key is used during deployment for transitioning your python
module into a Notebook that can be deployed as a DLT Pipeline.

:::{literalinclude} ../../../examples/workflows/streaming_append_changes_wrkflw.yml
:::

## Deployment

We're at the final step of this simple example. The last piece of the puzzle here is that we need to deploy our assets
to a Databricks workspace. To do so, we'll use the `dltflow` cli.

:::{literalinclude} ../../../examples/deployment/deploy_streaming_apply_pipeline.sh
:::

0 comments on commit 558f029

Please sign in to comment.