Future data architecture for OWID #356

larsyencken · 2022-08-02T15:10:08Z

larsyencken
Aug 2, 2022
Maintainer

A discussion around how we see our data management evolving at OWID, and how it relates to our site.

Status: discussion on hold while Mojmir is on leave (as of 2022-08-08)

Problems

Data comes in many dimensions, which we can't visualise properly right now
Our existing catalog is private and internal, meaning we can't easily consume it when developing prototypes
External users can't consume it automatically, which means they can't easily add it to their own pipelines for reproducible research
Our traditional methods of importing data don't give us good options for QA'ing the data

Key questions

How can we set ourselves up to generate multiple data products, not just the OWID site?
How can we publish data that hasn't been flattened into grapher's 4-tuple format (entityId, year, variable, value)?
How can we support rapid prototyping of visualisations with correct data?
Where in our architecture should the source of truth for data be?

How it's been done in the past

Our first data catalog is in MySQL, it powers the Our World In Data site.

graph LR

upstream --> author --> admin --> mysql --> baker --> netlify --> browser
upstream --> importers --> mysql

There are two main paths for data:

A fast-track where authors can harmonize and upload data using an internal admin interface
Scripts to process and import large amounts of data from bigger institutions, which live in the importers repo

Where we are today

Since October 2021, we've been building a new data catalog with the goal of having a more transparent and reproducible process for data (our etl). This adds a third path for data to get in, one which is intended to replace importers.

graph LR

upstream --> author --> admin --> mysql --> baker --> netlify --> browser
upstream --> importers --> mysql
upstream --> walden --> etl --> catalog[static catalog] --> mysql
catalog --> cloudflare --> browser

Actually, this is a simplification, because we want to be using the new catalog to power a range of new data tooling. That means it should have a copy of all of our existing data and all our metadata. We backport the existing data and new fast-track data to the new catalog, whilst still sending some of the data from the catalog back to MySQL. This way both MySQL and the static catalog are up to date.

graph LR

admin --> mysql -->|backports| catalog
catalog[static catalog] -->|new datasets| mysql

A substantial downside of this setup is that there is no longer a single source of truth for our data.

We are also in progress building an API on top of our new catalog, which would then build the OWID site. This makes the work-in-progress picture a bit messier, more like this:

graph LR

upstream --> author --> admin --> mysql -->|content| baker --> netlify --> browser
upstream --> walden --> etl --> catalog[static catalog] --> mysql
catalog --> cloudflare --> browser
catalog -.-> api -.->|data| baker
mysql --> catalog

The architectural challenge

We are trying to decouple our data management and content management. The sticking points appear to be:

Metadata, which seems to live in both worlds
Fast track for researchers, which would need re-building if data should move outside of MySQL

Below this post, I'll add some proposals we've discussed, so that we can consider them separately.

Read/Write scenarios for metadata & data

This section outlines the main use cases for reading and writing data that we have in our system (or that we think would be valuable additions in the future). The idea is that this list should help us evaluate the various scenario ideas and think about delay times etc.

Search for variables and datasets ← admin, (browsable catalog, api)
- Needs full-text search
- Some structural search is useful
- Might be useful to have some usage data (e.g. how many charts use this variable)
  - Would come from a content graph
- Use cases
  - Make a new chart
  - Update metadata in admin
Fetch data + metadata for variables for showing a chart on the published site
- Need 2 json blobs for each variable used in a published chart
- This we already have and it is files in Netlify
- This is fast for requests from anywhere in the world (low latency)
Tight loop updates for working with charts in the admin (curating metadata like variable units, sources, …)
- Writes need to be reflected immediately in subsequent writes for the admin to show a consistent state of the UI
- Doesn’t necessarily need to be going through our main data pipeline - i.e. admin reads and writes could go to a db that is periodically snapshotted into git and pulled into the data pipeline (or similar)
  - Today there is already a delay whilst the site bakes before your change is live (10 mins normally, but up to an hour or two depending on what else is going on)
Fast track: a way for authors to interactively get a dataset in for charting
- Data added needs to be immediately reflected in the admin
- (today) Data added is immediately useable for charting
- Some pain points
  - Re-uploading the same dataset multiple times, copies of the same dataset in MySQL?
Registering datasets from the ETL
- Datasets and variables added to ETL need to show up in the variables and dataset search
- The data (and what metadate exists at this level) needs to be available for grapher charts to be plotted.
Fetch a full dataset for work in a notebook
- With working search this would be satisfied by our remote catalog
Fetch slices of datasets in JS for visualisation (future explorers)
- Client/server side SQL against data
- Ability to request subsets and slices
- Not satisfied at the moment but might be an interesting option

larsyencken · 2022-08-02T15:19:27Z

larsyencken
Aug 2, 2022
Maintainer Author

① Use MySQL as an ongoing ingredient in the data catalog

We can accept that the ETL doesn't have the whole picture, and blend its output with what we have in MySQL to create a combined API. MySQL can override any metadata in the ETL.

graph LR

etl --> local[local catalog] -->|data| published[published catalog]
local -.->|register| mysql
admin --> mysql -->|metadata + fast track| published --> cloudflare --> browser
mysql -->|content| baker --> netlify --> browser
published -->|data| baker

Pros:

We can publish a catalog or API that agrees with our site
The ETL can be simpler if it doesn't need to accommodate historical data (backports)

Cons:

We keep a significant amount of data in MySQL indefinitely
External parties can not fully reproduce our data, much of it remains internal
- ...but they can reproduce anything new that's via the ETL, only the metadata is non-authoritative

0 replies

larsyencken · 2022-08-02T19:08:55Z

larsyencken
Aug 2, 2022
Maintainer Author

② Continue to use MySQL as our data catalog

We can repurpose MySQL as our new data catalog, keeping it as the source of truth. Instead of publishing the output of the ETL, we do a publishing step from MySQL.

graph LR

upstream --> etl --> local[local catalog] --> mysql
upstream --> author --> admin --> mysql
mysql --> baker --> netlify --> browser
mysql -.-> published[published catalog] --> api --> browser

Pros:

Much easier to iterate on admin and metadata improvements
Much less work to port fast track
Single source of truth
Keeps metadata alongside the rest of the site content, no need to make a hard decision on what is editorial and what is data
ETL stays simple with no need to manage backports
Explorers CSVs could remain a special case, generated directly from the ETL

Cons:

ETL is no longer fully reproducible, in that it's missing metadata
- ...but metadata is the least important part for transparency, still have the full data recipes for everything except fast track
MySQL can become quite big
MySQL is not as performant as well-packed flat files in feather or parquet format

Variant A: one table per data frame

Our current data model for the ETL contains:

namespace > version > dataset > TABLE > variable

where our on-disk format is based on one-table-per-file. This is more compact than storing every variable individually, since variables on the same table share the same primary key.

We could make a new database in MySQL and use this data model, essentially having a huge number of tables. This has two advantages over the large data_values table that we have now: (1) you instantly have only the table you asked for; (2) the primary key can be different for every table, letting you store much more varied data.

This is basically the approach that we use for data-api at the moment, only with DuckDB. The only concern is that performance is worse for reading whole tables than columnar databases. ~~However, MariaDB is MySQL-compatible and contains a ColumnStore engine, so this is not a major concern.~~ (ColumnStore needs a massive server to run.)

Variant B: some tables have remote data

We keep the existing MySQL model, except that we have two flavours of table, one that keeps its data in the data_values table, and another that refers to a remote feather or parquet file in its metadata. Consumers then need the ability to read that format if they want to look at data values.

In this model, the parquet files can be built by the ETL, and only for ETL-specific datasets, but legacy data and new fast-track data can remain in the data_values table.

This is pretty similar to variant A, but you need a reliable remote store of data that's controlled by the ETL rather than by the owid site codebase.

Variant C: postgres with parquet tables

Postgres is able to register parquet files on-disk as tables. So this solution could look a lot like a hybrid of A and B, but we'd need to also do a substantial migration to Postgres.

1 reply

Marigold Aug 4, 2022
Maintainer

This might be identical to variant B, but I wanted to emphasise a few things:

there's no bidirectional sync - ETL pushes metadata to mysql and data to static catalog as parquet files, both backporting and fasttrack die (backporting might be still useful in the short term)
everything goes through an API (baking & public), API might return URL to parquet file
data is stored either in data_values or as parquet files (mixed), we can pull out the largest datasets from data_values into ETL manually
we'd keep using current MySQL schema for metadata, but keep improving it (e.g. variables could refer to parquet file instead of data_values, they could have information about their dimensions, etc.)

graph LR

upstream --> author --> admin --> mysql --> |metadata & data_values| api --> |metadata| baker --> netlify --> browser
upstream --> walden --> etl --> catalog[static catalog]
catalog --> |parquet| cloudflare --> browser
catalog -.-> |parquet| api -.->|data| baker
etl --> |metadata| mysql

larsyencken · 2022-08-02T19:25:21Z

larsyencken
Aug 2, 2022
Maintainer Author

③ Move metadata outside of MySQL

The ETL has a conundrum, it is both upstream and downstream of MySQL today, basically because of the fast-track and because of metadata editing. But if metadata is instead to the owid-content repo instead, then it becomes upstream-only for the ETL. The ETL is also designed to pull from Github repos already.

graph LR

upstream --> walden --> etl --> catalog[static catalog] --> cloudflare --> browser
upstream --> author --> admin -->|edit metadata or explorer| content[owid-content repo] --> etl
content -->|list data for editing| admin
admin --> walden
mysql --> baker --> netlify --> browser
admin -->|edit charts| mysql

Pros:

ETL becomes authoritative over data, since it contains definitive metadata
- Can use the ETL machinery to publish to various formats
A very smooth separation between the data path and the content path
Admin already does edit operations on the owid-content repo for explorers

Cons:

Introduces friction if we are unsure about the line between metadata and content
- ...since we need to pick a lane for any content
Might need a rewrite of chunks of the bulk FASTT editor, and possibly could not support such a nice UI for variable metadata
Need to change the data backend for the fast-track admin

1 reply

Marigold Aug 4, 2022
Maintainer

I'm a bit unclear on what happens between owid-content -> etl and also where does API live (if it exists at all).

larsyencken · 2022-08-02T19:30:29Z

larsyencken
Aug 2, 2022
Maintainer Author

④ Maintain dual sources of truth

Consider MySQL and the data catalog as eventually consistent systems that will sometimes, briefly, disagree, and deal with that lag.

graph LR

upstream --> author --> admin --> mysql -->|content| baker --> netlify --> browser
upstream --> walden --> etl --> catalog[static catalog] --> mysql
catalog --> cloudflare --> browser
catalog -.-> api -.->|data| baker
mysql --> catalog

Pros:

It's where we are now

Cons:

We have to maintain two data systems for a long time
We have to deal with consistency bugs, of which there are already some
We need the ETL to update very quickly with changes from grapher

0 replies

danielgavrilov · 2022-08-03T09:49:19Z

danielgavrilov
Aug 3, 2022

⚠️ I opened up a bigger discussion that doesn't belong here, and I also didn't properly explain the context where it's coming from. I think it's best to ignore.

I think there’s maybe an implicit assumption here that the way we assign metadata stays the same. I can’t imagine it staying like this if we want to scale up. It’s already unsustainable to maintain.

How we assign metadata needs to be redesigned also at the Grapher, not just database level.

I think it could look less like: “every possible chart needs to have a single title/subtitle defining all terms in the chart”…

…and more like “each term used in the chart should be defined, but they don’t have to form a coherent sentence/paragraph”

As an example of this, this is a sketch Max made of a ScatterPlot, a long time ago:

The title and subtitle are split up to define separate things.

I can imagine this applies across other charts – we want the definitions for concepts to be there, but they could be a list of definitions, not a coherent paragraph of all.

5 replies

danielgavrilov Aug 3, 2022

⚠️ I opened up a bigger discussion that doesn't belong here, and I also didn't properly explain the context where it's coming from. I think it's best to ignore.

I wonder what others think about titles/subtitles being composed from multiple definitions like that. Basically this is a somewhat auto-generation of metadata, but also questioning whether we do need a single "footnote", "subtitle" or "title" in each chart.

The details on demand we implemented come from a need to stop repeating ourselves across charts and keep definitions in one place, and also allow more extensive definitions without creating long subtitles.

As I understand, we also want to be able to inject these definitions as text into our subtitles, not just as tooltips. Once we do that, we either:

still have the worry to make the definitions flow nicely in all subtitles; or
we get more flexible about composing subtitles from multiple definitions that don't necessarily have to nicely flow (although just separating with full stops probably means they flow well enough)

If it's the latter, that means we can start assigning metadata differently to data – there doesn't need to be a single manually-written comprehensive chunk of metadata for every [country, year] group (which we currently call a "variable"), but we can:

Keep a central store of definitions (minimising repetition).
Assign definitions across multiple "variables". One variable could have multiple definitions attached to it.
When pulling the data to generate a chart, we collect all definitions and show them all individually on the chart.

Of course there should still be an option to define just a single footnote/subtitle/title, but this approach would open up a new way to assign metadata that can hopefully scale better and cover more of our database. It adds something in between the current options of perfect metadata and non-existent metadata.

This is just one idea in this direction of changing the way we define metadata, also curious if you have other ideas! It seems to me that the current setup we have for maintaining metadata is unsustainable. Whether we want to change it has an effect in how we architect the platform.

larsyencken Aug 3, 2022
Maintainer Author

I read a few different concerns in your post.

What's our model for metadata?

In grapher and the site, we assign metadata to variables, datasets, charts, explorers and articles
In our new data model (the ETL), we can have metadata on datasets, tables and variables

This architecture post is mainly about our data itself (pre charts, explorers or articles). Although in practice our metadata model has been pretty stable, we can and should evolve it over time. If you think an approach makes this easier or harder on the data side, that should definitely be something we factor in.

What's our data workflow?

The scatterplot example could be rendered using just the metadata we already have for variables (title, description). But underneath it, there's a question of what deliverables live on the data team's side vs the authors side when you want to publish data as visualisations and communicate about them. You described terms as being something you'd like to have.

We could decide that responsibility lives on the data manager side, then they would define the most important terms for each dataset and ship them as part of the dataset. If they lived on the author side, then the authors would define those terms, which might be shared across a whole topic area, and have responsibility for curating them over time. Currently, we have chosen to put them on the authors side with details on demand, but I don't think any architectural choices here prevent us from changing that decision.

Can we maintain as much metadata as we currently do?

We spent a long time investing in bulk metadata editing, and we now template it for explorers, so these are some bets we're making on keeping the metadata we already have sustainable.

In my mind, this is not really about metadata for charts or data specifically, it's about our whole body of evergreen content as a whole, and how we balance keeping it up to date. If we can't keep it up to date, the options are either to publish less to begin with or to let more things age.

This discussion on our architecture is really about keeping our data workflow maintainable, we should probably park the broader content concerns into a separate discussion.

larsyencken Aug 3, 2022
Maintainer Author

After discussion with @danielgavrilov added this concept to the ETL backlog:

Concept: annotate entity definitions #361

danyx23 Aug 3, 2022
Maintainer

I think it is very possible that we will want to change what kind of metadata we store but as Lars said I think this discussion here is mostly about ways of improving our data workflows that work with the existing approach. I don't think that at this point we also want to radically rethink how metadata works throughout the system and wait for such changes before we improve the flow of data between ETL and mysql.

That said, I think that the two bets that Lars mentioned will help and will give us more options. Let me elaborate a bit on this:

When we started work on the bulk FASTT editor we concentrated on a grid editor that shows one row for every variable and one column for every chart config field. The idea here was that what we actually store for every variable is a (often empty) default grapher chart config json object that defines default values for charts based on this variable. We could have chosen not to do this with full grapher configs but just a bunch of fields (e.g. title, subtitle etc) but I think that it is quite powerful to align this. E.g. it means that we can write migrations for configs and just apply the same logic both for default configs on the variable level and the main configs in the config table in mysql. It also means that we can specify everything in a chart config for the case when this variable is to be plotted alone - i.e. without creating a grapher config in the charts table in mysql at all for simple cases. We never finished this feature, so at the moment it is not possible to plot "just a variable" and variable level default values are never picked up in the grapher chart editor. But once we get around to adding this I think it will allow us to move a lot of curation effort "one level up" to the variables. We will also backport existing chart configs of single variable charts to the variable level as these should be good defaults in general for a variable. I think this will help a lot in making the metadata curation effort more scaleable.
The fact that we recently finetuned the baked json files for variable data has some relevant benefits. We changed this so that now every variable that is used in a chart gets baked into two json files, one for data and one for metadata. This means that the metadata for a variable is now accessible in a predictable way and we we can build on this now and hook up explorers to consume this metadata. This will drastically reducing the amount of duplication between explorers configs and other metadata in the database. Explorers can get a lot closer to "just bundles of graphers" when we want to author them this way (since they only have to enumerate a few chart types and specify where to get the data and metadata from)

The main complication with any metadata is that in the admin UI (or even something else like it) when editing charts we want a fast update cycle. This kinda rules out having files in ETL as the primary storage for this metadata (this is not entirely true - if we were to switch to a git + files based workflow with wizards for the authors that they use locally that would also be an option).

I think a good next step will be enumerating what kinds of read and write scenarios we have and check the above options to see what works well and what doesn't.

danielgavrilov Aug 3, 2022

Sorry, I regret opening up a discussion that doesn't belong here, and is more about multi-dimensional data than about "variables" as they exist today.

Lars extracted one relevant feature of it above, the rest of it I think we don't have to talk about until much later – it's probably only relevant in multi-dimensional datasets, which we have to synthesise variables from.

larsyencken · 2022-08-18T09:30:52Z

larsyencken
Aug 18, 2022
Maintainer Author

⑤ No fast track, no data in MySQL

graph LR

mysql -->|content| baker --> netlify --> browser
upstream --> walden --> etl --> catalog[static catalog]
catalog --> cloudflare --> browser
catalog ---> api --->|data| baker

0 replies

larsyencken · 2022-08-18T09:44:53Z

larsyencken
Aug 18, 2022
Maintainer Author

⑥ No fast track, ETL generates static data values

graph LR

mysql -->|content| baker --> netlify --> browser
etl --> catalog[static catalog] --> parquet[grapher parquet] -->|register variables| mysql
mysql --> api -->|data| baker
parquet --> api
parquet --> cloudflare --> browser
mysql --> admin --> mysql

0 replies

Marigold · 2022-08-18T10:04:01Z

Marigold
Aug 18, 2022
Maintainer

Here are my notes for the architecture similar to the one above.

Diagram

graph LR

ETL --> etl/data/garden/...
ETL --> etl/data/meadow/...
admin --> data_values
ETL --> GrapherStep --> data_values
GrapherStep --> variables
ETL --> GrapherParquetStep --> |grapher channel| etl/data/grapher/.../*.parquet
GrapherParquetStep --> variables
data_values -->|data| API
variables --> |metadata| API
etl/data/grapher/.../*.parquet -->|data| API
API --> baker
API --> browsable-catalog
API -->|metadata| future-dynamic-data-fetching
etl/data/grapher/.../*.parquet -->|data| future-dynamic-data-fetching
admin --> variables

New columns in variables table:

variables:
    data_source - `data_values` or `catalog`
    data_path - path to S3 if data_source is `catalog`

Dimensions

Variables with dimensions would be now saved as single variable with new columns dimensions (e.g. JSON with {"sex": ["male", "female"], "age_group": ["0-10", "10-20"]}) (or could be separate MySQL tables to keep it relational) or we could keep saving them like now depending on how technically challenging would it be. (variable relationships could help?)

Notes

GrapherStep will be eventually replaced by GrapherParquetStep
Only datasets with implemented GrapherParquetStep would be available through the API, meadow or open_numbers would be only available from our current catalog
API is the primary entrypoint for baker or external users. Catalog (and owid-catalog-py) is primarily for internal usage
Backport goes away, instead of that we grab data directly from the API (which is fetched data_values)

Incremental steps toward our goals

small changes to MySQL metadata to align ETL and MySQL
baker fetching both data and metadata from API using JSONs
baker fetching metadata from API and data from S3 parquet instead of dynamic API data
charts with baked metadata fetching data from S3 parquet

Questions:

To be at parity with ETL paths (namespace/version/dataset/table) we should add columns namespace and possibly table abstraction and generally align both schemas
Can we do decent full-text search in MySQL?
What if someone asks for an entire dataset in data_values? Is assembling wide table from data_values doable?
Is it ok to keep data_values in MySQL and support both ways of getting data in the short-term or is it worth doing full backport and big-bang switch to parquet files?

1 reply

larsyencken Aug 24, 2022
Maintainer Author

Thanks for writing this up! I added an extra connection admin --> variable to your diagram that seemed missing. Let's talk about it tomorrow.

Future data architecture for OWID #356

larsyencken Aug 2, 2022 Maintainer

Problems

Key questions

How it's been done in the past

Where we are today

The architectural challenge

Read/Write scenarios for metadata & data

Replies: 8 comments · 8 replies

larsyencken Aug 2, 2022 Maintainer Author

① Use MySQL as an ongoing ingredient in the data catalog

larsyencken Aug 2, 2022 Maintainer Author

② Continue to use MySQL as our data catalog

Variant A: one table per data frame

Variant B: some tables have remote data

Variant C: postgres with parquet tables

Marigold Aug 4, 2022 Maintainer

larsyencken Aug 2, 2022 Maintainer Author

③ Move metadata outside of MySQL

Marigold Aug 4, 2022 Maintainer

larsyencken Aug 2, 2022 Maintainer Author

④ Maintain dual sources of truth

danielgavrilov Aug 3, 2022

danielgavrilov Aug 3, 2022

larsyencken Aug 3, 2022 Maintainer Author

What's our model for metadata?

What's our data workflow?

Can we maintain as much metadata as we currently do?

larsyencken Aug 3, 2022 Maintainer Author

danyx23 Aug 3, 2022 Maintainer

danielgavrilov Aug 3, 2022

larsyencken Aug 18, 2022 Maintainer Author

⑤ No fast track, no data in MySQL

larsyencken Aug 18, 2022 Maintainer Author

⑥ No fast track, ETL generates static data values

Marigold Aug 18, 2022 Maintainer

Diagram

Dimensions

Notes

Incremental steps toward our goals

Questions:

larsyencken Aug 24, 2022 Maintainer Author

larsyencken
Aug 2, 2022
Maintainer

Replies: 8 comments 8 replies

larsyencken
Aug 2, 2022
Maintainer Author

larsyencken
Aug 2, 2022
Maintainer Author

Marigold Aug 4, 2022
Maintainer

larsyencken
Aug 2, 2022
Maintainer Author

Marigold Aug 4, 2022
Maintainer

larsyencken
Aug 2, 2022
Maintainer Author

danielgavrilov
Aug 3, 2022

larsyencken Aug 3, 2022
Maintainer Author

larsyencken Aug 3, 2022
Maintainer Author

danyx23 Aug 3, 2022
Maintainer

larsyencken
Aug 18, 2022
Maintainer Author

larsyencken
Aug 18, 2022
Maintainer Author

Marigold
Aug 18, 2022
Maintainer

larsyencken Aug 24, 2022
Maintainer Author