From 558f0298d6ad6268c6886802d09ed702704bd873 Mon Sep 17 00:00:00 2001 From: Ricky Schools Date: Sun, 5 May 2024 00:52:50 -0400 Subject: [PATCH] docs --- docs/_static/examples.md | 6 +- docs/_static/examples/streaming_cdc.md | 76 ++++++++++++++++++++++++++ 2 files changed, 80 insertions(+), 2 deletions(-) create mode 100644 docs/_static/examples/streaming_cdc.md diff --git a/docs/_static/examples.md b/docs/_static/examples.md index 55ccce0..41325f2 100644 --- a/docs/_static/examples.md +++ b/docs/_static/examples.md @@ -9,8 +9,9 @@ Examples are stratified by increasing levels of complexity. - [Moderate, Multi-Method Decoration](examples/medium.md) - A moderate example multiple methods decorated with dlt. - [Streaming, Append example](examples/streaming_append.md) - - An simple example showcasing how to write streaming append dlt pipelines. -- Streaming, CDC example + - A simple example showcasing how to write streaming append dlt pipelines. +- [Streaming, CDC example](./examples/streaming_cdc.md) + - A simple example showcasing how to write streaming apply changes into dlt pipelines. - Complex, Multi-Step example ::: {toctree} @@ -18,5 +19,6 @@ Examples are stratified by increasing levels of complexity. examples/simple.md examples/medium.md examples/streaming_append.md +examples/streaming_cdc.md :hidden: ::: diff --git a/docs/_static/examples/streaming_cdc.md b/docs/_static/examples/streaming_cdc.md new file mode 100644 index 0000000..b54bb73 --- /dev/null +++ b/docs/_static/examples/streaming_cdc.md @@ -0,0 +1,76 @@ +# Streaming Apply Changes Example + +This article intends to show how to get started with `dltflow` when authoring DLT streaming apply changes pipelines. In +this sample, we will be going through the following steps: + +- [Code](#code) +- [Configuration](#configuration) +- [Workflow Spec](#workflow_spec) +- [Deployment](#deployment) + +:::{include} ./base.md +::: + +### Example Pipeline Code + +For this example, we will show a simple example with a queue streaming reader. + +- Import a `DLTMetaMixin` from `dltflow.quality` and will tell our sample pipeline to inherit from it. +- Generate the example data on the fly and put it into a python queue. +- We will transform it by coercing data types. + +You should see that there are no direct calls to `dlt`. This is the beauty and intentional simplicity `dltflow`. It does +not want to get in your way. Rather, it really wants you to focus on your transformation logic to help keep your code +simple and easy to share with other team members. + +:::{literalinclude} ../../../examples/pipelines/streaming_cdc.py +::: + +## Configuration + +Now that we have our example code, we need to write our configuration to tell the `DLTMetaMixin` how wrap our codebase. + +Under the hood, `dltflow` uses `pydantic` to create validation for configuration. When working with `dltflow`, it +requires your configuration to adhere to a specific structure. Namely, file should have the following sections: + +- `reader`: This is helpful for telling your pipeline where to read data from. +- `writer`: Used to define where your data is written to after being processed. +- `dlt`: Defines how `dlt` will be used in the project. We use this to dynamically wrap your code with `dlt` commands. + +With this brief overview out of the way, lets review our configuration for this sample. + +:::{literalinclude} ../../../examples/conf/streaming_apply_changes_dlt.yml +::: + +The `dlt` section has the following keys, though this configuration can also be a list of `dlt` configs. + +- `func_name`: The name of the function/method we want `dlt` to decorate. +- `kind`: Tells `dlt` if this query should be materialized as a `table` or `view` +- `expectation_action`: Tells `dlt` how to handle the expectations. `drop`, `fail`, and `allow` are all supported. +- `expectations`: These are a list of constraints we want to apply to our data. +- `is_streaming`: This tells `dltflow` this is a streaming query. +- `apply_chg_config`: This tells `dltflow` we're in a streaming append and fills out necessary `dlt` params. + - `target`: Tells `dltflow` what table data will be written to. This should be a streaming table definition created + ahead of time. + - `source`: Tells `dltflow` where to read and get data from. + - `keys`: The primary key(s) of the dataset. + - `sequence_by`: The column(s) to use when ordering the dataset. + - `stored_as_scd_type`: Tells `dltflow` how to materialize the table. `1` (default), SCD Type 1, 2 - SCD Type 2. + +## Workflow Spec + +Now that we've gone through the code and configuration, we need to start defining the workflow that we want to deploy +to Databricks so that our pipeline can be registered as a DLT Pipeline. This structure largely follows the [Databricks +Pipeline API]() with the addition of a `tasks` key. This key is used during deployment for transitioning your python +module into a Notebook that can be deployed as a DLT Pipeline. + +:::{literalinclude} ../../../examples/workflows/streaming_append_changes_wrkflw.yml +::: + +## Deployment + +We're at the final step of this simple example. The last piece of the puzzle here is that we need to deploy our assets +to a Databricks workspace. To do so, we'll use the `dltflow` cli. + +:::{literalinclude} ../../../examples/deployment/deploy_streaming_apply_pipeline.sh +:::