A minimal modern data stack with working data pipelines in a single Docker container.
Explore the docs »
Report Bug
·
Request Feature
Table of Contents
A minimal modern data stack with working data pipelines in a single Docker container. Useful for an exploration of the tools involved:
- Meltano - ELT+ data engineering workflow tool
- Airflow - workflow platform, including scheduling
- dbt - data transformation tool
- DuckDB - local SQL OLAP database managment system
- Superset - data visualization and exploration platform
- Sample data pipelines with USGS Earthquake data and European Gas Inventory levels.
Explore the functionality of the tools by using the examples as-is; modify and expand on the examples for further exploration.
This is a convenient starting point for exploration. The project is not a showcase of all or even the best functionality that each tool has to offer.
Have Docker Desktop installed.
- Create a container from the published docker image:
docker create -p8093:8088 -p8085:8080 -p8094:8089 --name mimodast ghcr.io/ejoosterop/mimodast
- Optionally (required for the European Gas Inventory dataset; this can safely be done at a later stage) copy a
.env
file containing the API key as explained below:docker cp .env mimodast:/project/mimodast/.env
- Start the container (using the Docker Desktop UI or command line).
- For starters:
- Open the docker container terminal and peruse the meltano.yml file and other files/folders at
/project/mimodast/
. - Navigate to localhost:8085 to see the Airflow orhestrator (incl scheduler) interface. Use admin/admin as username/password.
- Navigate to localhost:8093 to see the Superset dashboard. Use admin/admin as username/password.
- Navigate to localhost:8094 to see data pipeline documentation (from dbt).
- Open the docker container terminal and peruse the meltano.yml file and other files/folders at
- NOTE: allow for some time (~1 minute) for the container to start up all processes. On first startup wait for the completion of the first run of the USGS pipeline before reviewing Superset.
The image contains ELT pipelines for two data sets. The USGS Earthquake dataset can be used right out of the box. This dataset is a good enough dataset to explore the core functionalities.
For the GIE Gas Inventory dataset an API key is required. Create a free and immediate GIE account to obtain the key.
This key needs to be available as an environment variable (ENV_GIE_XKEY) in the Docker container (it is referenced in the meltano.yml
configuration file) in order to run the GIE pipelines. One way to accomplish this is by creating a .env
file in the /projet/mimodast/
folder containing:
ENV_GIE_XKEY="YOUR-API-KEY"
Two basic examples are used to illustrate an extensive date pipeline encompassing the following components:
- obtaining the data from source
- capturing the data in a database
- transformation
- scheduling
- testing of data
- reporting
- integrated documentation of the process
The two pipelines involve USGS Earthquake data and European Gas Inventory data.
Below we highlight the core configuration for these components. For (much) more additional information refer to the respective documentation of the tools.
The data pipelines are fully defined in a set of files. This includes the source definitions, schedules, dependencies, transformation logic, tests and documentation. The reporting/dashboards in Superset are defined within Superset, but can be exported from there.
These files are all found in the /project/mimodast/
folder in the Docker container. It is best practice to capture this folder in a version control tool. Git is included in the Docker image.
Some of the core files include:
/project/mimodast/meltano.yml
- this includes configurations for data source specification, destination database, schedule and more./project/mimodast/orhestration/dags/gie_dag.py
- python code defining how to orchestrate a data pipeline in Airflow. Note that the GIE data uses this manually created file, whereas the USGS data orhestration relies purely on logic defined inmeltano.yml
./project/mimodast/tranformation/
- this folder contains transformation logic (undermodels/
). It also includes configuration for dbt, tests, documentation and various other items.
The meltano.yml
file specifies how to obtain the data from source. Specifically, configuring the connection to the data source is centered around this tap
section in meltano.yml
(To be found at /project/mimodast/meltano.yml
in the Docker container.):
- name: stg_usgs
inherit_from: tap-rest-api-msdk
config:
api_url: https://earthquake.usgs.gov/fdsnws
In the example, this tap is inherited from a publicly available tap for a REST API:
plugins:
extractors:
- name: tap-rest-api-msdk
variant: widen
pip_url: tap-rest-api-msdk
pip_url
specifies the location of the source code for the tap plug-in. In this case the actual location is derived by Meltano based on the tap name. Note that the tap for the other data pipeline (Gas Inventory data) also inherits from the same tap, but is clearly configured differently.
meltano.yml
contains the ful configuration of the tap. For example the following element specifies to select only earthquakes with a minimum magnitute:
minmagnitude: 6
Many more items are configured for this tap.
The following section in meltano.yml
configures the database where the data will be stored:
- name: target-duckdb
pip_url: target-duckdb~=0.4
config:
add_metadata_columns: true
default_target_schema: main
filepath: $DB_LOCATION
data_flattening_max_level: 10
A duckdb database is used in this example. This database contains the data captured from source and is accessed for reporting. The database file is located at /project/data/dev/data.duckdb
. This database location is specified in meltano.yml
.
The image contains the DuckDB command line interface. Use the command:
/project/duckdb_cli/duckdb /project/data/dev/data.duckdb
to start the CLI and browse the database using SQL (the CLI has a help function and is documented online).
Note that the database maybe unavailable if another process (pipeline, reporting) is already accessing it, resulting in an error message.
Transformation is performed by dbt and specified by a set of SQL
files. In the container these are located at: /project/mimodast/transform/models/usgs_rpt/
.
For this example pipeline the transformation is fully captured in rpt_usgs_events.sql
. In this case the 'transformation' simply copies some source attributes into a reporting table.
The following section of meltano.yml
configures the pipeline's run schedule:
schedules:
- name: USGS-Earthquake
interval: 35 */1 * * *
job: usgs-to-duckdb-rpt
start_date: 2023-01-01 15:40:21.295936
The scheduling interval
uses the Cron job format. In this case the pipeline is schedulted to run 35 minutes after every hour.
This section refers to a job
, wich is configured in the same file. This job consists of a tap
to obtain data from source, a loader specifying where the data needs to be stored and a transformation component:
jobs:
- name: usgs-to-duckdb-rpt
tasks:
- stg_usgs target-duckdb-usgs dbt-duckdb:usgs
In this docker container Airflow is used as the scheduler/orchestrator. Meltano ensures that the specified schedule is set in Airflow. Note that the schedule can also be set directly in Airflow, as is the case for our GIE Gas Inventory pipeline.
To monitor job progress access the Airflow UI from localhost:8085
, with admin/admin
as user/password. There are three pipelines: one for USGS and two for GIE.
Data tests are specified in two locations:
- SQL files in the
/project/mimodast/transform/models/tests/
folder. (In this specific case just a basic test that the magnitude of an earthquake is not unrealistically high.) Tests are performed by dbt. - yml files in the
/project/mimodast/transform/models/usgs_rpt/
folder. For example the following tests that theid
field is not null:
- name: id
description: Unique ID of earthquake or related event assigned by USGS.
tests:
- not_null
Tests can be initiated manually from the command line in the Docker container:
cd /project/mimodast/
meltano invoke dbt-duckdb:test
Superset is used to visualize/dashboard the data. The Superset UI can be accessed from localhost:8093
, with admin/admin
as user/password.
Initially the database in mimodast is empty so make sure to run the datapipeline(s) before reviewing the dashboards.
Due to an outstanding issue in mimodast, individual charts in a dashboard frequently do not show. For now:
- Refresh each chart by selecting this option in the right top corner of each chart.
- Possibly wait for other processes accessing the database (say the data pipelines) to complete
dbt is used for documentation of data and processes. Documentation is maitained in .yml files and overview.md
in the /project/mimodast/transform/models/
folder.
This documentation can be consulted at localhost:8094
.
The Gas pipeline is setup in a similar way as the USGS earthquake pipeline.
The following differences are noteworthy:
- The pipeline requires an account to be setup. This is free, quick and avaiable to everyone. Refer here for instructions.
- Note that for this reasong the jobs/DAGs for this pipeline are initially paused in Airflow. Once the API key is obtained an setup in the docker container use the Airflow UI to unpause the jobs.
- The
meltano.yml
file is configured to obtain this key from an environment variable as follows:
headers:
x-key: $ENV_GIE_XKEY
- Schedule/orchestration is not configured using
meltano.yml
but instead with two manually coded Airflow DAGs. The Python file containing the code for these can be found at/project/mimodast/orchestrate/dags/gie_dag.py
.- The backfill dag captures historic data from source. To specify the date range, two Airflow variables are used. These values can be changed using the Airflow UI.
- It takes some time (<1 minute) for the new date range to be reflected in the DAG.
- Note that using Airflow variables in a DAG in this way is not a best practice design but is used for simplicity.
- Meltano's inline data mapping functionality is used to create hashed ID fields when obtaining data from source using the following configuration in
meltano.yml
:
stream_maps:
stg_gie_storage:
key_hash: md5(config['hash_seed'] + (gasDayStart + code))
- Additional test types are included. For example (from
rpt_gie_storage.yml
):
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- sso_eic
- gasdaystart
See the open issues for a full list of proposed features and known issues.
Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information. The tools and the sample data are subject to their own respective licenses.
Project Link: https://github.com/EJOOSTEROP/mimodast