Future data architecture for OWID #356
Replies: 8 comments 8 replies
-
① Use MySQL as an ongoing ingredient in the data catalogWe can accept that the ETL doesn't have the whole picture, and blend its output with what we have in MySQL to create a combined API. MySQL can override any metadata in the ETL. graph LR
etl --> local[local catalog] -->|data| published[published catalog]
local -.->|register| mysql
admin --> mysql -->|metadata + fast track| published --> cloudflare --> browser
mysql -->|content| baker --> netlify --> browser
published -->|data| baker
Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
② Continue to use MySQL as our data catalogWe can repurpose MySQL as our new data catalog, keeping it as the source of truth. Instead of publishing the output of the ETL, we do a publishing step from MySQL. graph LR
upstream --> etl --> local[local catalog] --> mysql
upstream --> author --> admin --> mysql
mysql --> baker --> netlify --> browser
mysql -.-> published[published catalog] --> api --> browser
Pros:
Cons:
Variant A: one table per data frameOur current data model for the ETL contains:
where our on-disk format is based on one-table-per-file. This is more compact than storing every variable individually, since variables on the same table share the same primary key. We could make a new database in MySQL and use this data model, essentially having a huge number of tables. This has two advantages over the large This is basically the approach that we use for data-api at the moment, only with DuckDB. The only concern is that performance is worse for reading whole tables than columnar databases. Variant B: some tables have remote dataWe keep the existing MySQL model, except that we have two flavours of table, one that keeps its data in the In this model, the parquet files can be built by the ETL, and only for ETL-specific datasets, but legacy data and new fast-track data can remain in the This is pretty similar to variant A, but you need a reliable remote store of data that's controlled by the ETL rather than by the owid site codebase. Variant C: postgres with parquet tablesPostgres is able to register parquet files on-disk as tables. So this solution could look a lot like a hybrid of A and B, but we'd need to also do a substantial migration to Postgres. |
Beta Was this translation helpful? Give feedback.
-
③ Move metadata outside of MySQLThe ETL has a conundrum, it is both upstream and downstream of MySQL today, basically because of the fast-track and because of metadata editing. But if metadata is instead to the graph LR
upstream --> walden --> etl --> catalog[static catalog] --> cloudflare --> browser
upstream --> author --> admin -->|edit metadata or explorer| content[owid-content repo] --> etl
content -->|list data for editing| admin
admin --> walden
mysql --> baker --> netlify --> browser
admin -->|edit charts| mysql
Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
④ Maintain dual sources of truthConsider MySQL and the data catalog as eventually consistent systems that will sometimes, briefly, disagree, and deal with that lag. graph LR
upstream --> author --> admin --> mysql -->|content| baker --> netlify --> browser
upstream --> walden --> etl --> catalog[static catalog] --> mysql
catalog --> cloudflare --> browser
catalog -.-> api -.->|data| baker
mysql --> catalog
Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
⑤ No fast track, no data in MySQLgraph LR
mysql -->|content| baker --> netlify --> browser
upstream --> walden --> etl --> catalog[static catalog]
catalog --> cloudflare --> browser
catalog ---> api --->|data| baker
|
Beta Was this translation helpful? Give feedback.
-
⑥ No fast track, ETL generates static data valuesgraph LR
mysql -->|content| baker --> netlify --> browser
etl --> catalog[static catalog] --> parquet[grapher parquet] -->|register variables| mysql
mysql --> api -->|data| baker
parquet --> api
parquet --> cloudflare --> browser
mysql --> admin --> mysql
|
Beta Was this translation helpful? Give feedback.
-
Here are my notes for the architecture similar to the one above. Diagramgraph LR
ETL --> etl/data/garden/...
ETL --> etl/data/meadow/...
admin --> data_values
ETL --> GrapherStep --> data_values
GrapherStep --> variables
ETL --> GrapherParquetStep --> |grapher channel| etl/data/grapher/.../*.parquet
GrapherParquetStep --> variables
data_values -->|data| API
variables --> |metadata| API
etl/data/grapher/.../*.parquet -->|data| API
API --> baker
API --> browsable-catalog
API -->|metadata| future-dynamic-data-fetching
etl/data/grapher/.../*.parquet -->|data| future-dynamic-data-fetching
admin --> variables
New columns in
DimensionsVariables with dimensions would be now saved as single variable with new columns Notes
Incremental steps toward our goals
Questions:
|
Beta Was this translation helpful? Give feedback.
-
A discussion around how we see our data management evolving at OWID, and how it relates to our site.
Status: discussion on hold while Mojmir is on leave (as of 2022-08-08)
Problems
Key questions
entityId
,year
,variable
,value
)?How it's been done in the past
Our first data catalog is in MySQL, it powers the Our World In Data site.
There are two main paths for data:
Where we are today
Since October 2021, we've been building a new data catalog with the goal of having a more transparent and reproducible process for data (our
etl
). This adds a third path for data to get in, one which is intended to replaceimporters
.Actually, this is a simplification, because we want to be using the new catalog to power a range of new data tooling. That means it should have a copy of all of our existing data and all our metadata. We backport the existing data and new fast-track data to the new catalog, whilst still sending some of the data from the catalog back to MySQL. This way both MySQL and the static catalog are up to date.
A substantial downside of this setup is that there is no longer a single source of truth for our data.
We are also in progress building an API on top of our new catalog, which would then build the OWID site. This makes the work-in-progress picture a bit messier, more like this:
The architectural challenge
We are trying to decouple our data management and content management. The sticking points appear to be:
Below this post, I'll add some proposals we've discussed, so that we can consider them separately.
Read/Write scenarios for metadata & data
This section outlines the main use cases for reading and writing data that we have in our system (or that we think would be valuable additions in the future). The idea is that this list should help us evaluate the various scenario ideas and think about delay times etc.
Beta Was this translation helpful? Give feedback.
All reactions