feat: new dataset/table/column models #17543

betodealmeida · 2021-11-24T18:54:52Z

SUMMARY

This PR implements new models for SIP-68 (#14909):

Column
Table
Dataset

The associated tables are created with a migration, and the existing datasets (SqlaTable, SqlMetric, TableColumn) are migrated to the new models:

A physical dataset is mapped into a Table instance and a Dataset instance.
A virtual dataset is mapped into a Dataset instance.
Metrics and columns are mapped into Column instances (in the new model it represents physical columns, derived columns, and metrics).

The models are kept up-to-date via hooks added to SqlaTable, SqlMetric, and TableColumn. Every time these models are created, deleted, or updated, the new models are updated correspondingly. The sync is unidirectional, from the old models to the new ones.

The next step is to modify the backend to read from the new models. Initially the API will remain unmodified, but hopefully in the future we might be able to clean it up and modernize it (in some places we haven't migrated to the v1 API yet).

After that, the backend will be modified to write to the new models. Once the backend is reading and writing from the new models we can get rid of the old models (including the Druid models).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

No visual changes.

TESTING INSTRUCTIONS

Reviewers can run superset load-examples and see the new models being populated:

mysql> SELECT `schema`, name FROM sl_tables;
+----------+------------------------+
| schema   | name                   |
+----------+------------------------+
| examples | wb_health_population   |
| examples | birth_names            |
| examples | long_lat               |
| examples | birth_france_by_region |
| examples | sf_population_polygons |
| examples | flights                |
| examples | bart_lines             |
| examples | channel_members        |
| examples | messages               |
| examples | channels               |
| examples | threads                |
| examples | unicode_test           |
| examples | users                  |
| examples | users_channels         |
| examples | cleaned_sales_data     |
+----------+------------------------+
15 rows in set (0.00 sec)

mysql> SELECT name, is_physical FROM sl_datasets;
+---------------------------+-------------+
| name                      | is_physical |
+---------------------------+-------------+
| wb_health_population      |           1 |
| birth_names               |           1 |
| long_lat                  |           1 |
| birth_france_by_region    |           1 |
| sf_population_polygons    |           1 |
| flights                   |           1 |
| bart_lines                |           1 |
| FCC 2018 Survey           |           0 |
| messages_channels         |           0 |
| channel_members           |           1 |
| covid_vaccines            |           0 |
| messages                  |           1 |
| channels                  |           1 |
| members_channels_2        |           0 |
| exported_stats            |           0 |
| threads                   |           1 |
| new_members_daily         |           0 |
| unicode_test              |           1 |
| users                     |           1 |
| video_game_sales          |           0 |
| users_channels            |           1 |
| cleaned_sales_data        |           1 |
| users_channels-uzooNNtSRO |           0 |
+---------------------------+-------------+
23 rows in set (0.00 sec)

mysql> SELECT expression FROM sl_columns WHERE is_aggregation;
+-----------------------+
| expression            |
+-----------------------+
| sum(`SP_POP_TOTL`)    |
| sum(`SH_DYN_AIDS`)    |
| sum(`SP_RUR_TOTL_ZS`) |
| sum(`SP_DYN_LE00_IN`) |
| sum(`SP_RUR_TOTL`)    |
| COUNT(*)              |
| COUNT(*)              |
| SUM(num)              |
| COUNT(*)              |
| AVG(`2004`)           |
| COUNT(*)              |
| COUNT(*)              |
| COUNT(*)              |
| COUNT(*)              |
| count(*)              |
+-----------------------+
15 rows in set (0.02 sec)

The PR has unit tests checking that the sync works on creation, deletion, and updates.

ADDITIONAL INFORMATION

Has associated issue: [SIP-68] A better model for Datasets #14909
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

github-actions · 2021-11-26T10:29:19Z

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

codecov · 2021-12-01T16:03:34Z

Codecov Report

Merging #17543 (2fcdeef) into master (dafc841) will increase coverage by 0.12%.
The diff coverage is 98.79%.

❗ Current head 2fcdeef differs from pull request most recent head 142d625. Consider uploading reports for the commit 142d625 to get more accurate results

@@            Coverage Diff             @@
##           master   #17543      +/-   ##
==========================================
+ Coverage   66.21%   66.33%   +0.12%     
==========================================
  Files        1633     1638       +5     
  Lines       63210    63454     +244     
  Branches     6409     6409              
==========================================
+ Hits        41852    42092     +240     
- Misses      19698    19702       +4     
  Partials     1660     1660

Flag	Coverage Δ
hive	`52.45% <73.09%> (+0.16%)`	⬆️
mysql	`81.68% <98.79%> (+0.13%)`	⬆️
postgres	`81.73% <98.79%> (+0.13%)`	⬆️
presto	`52.29% <73.09%> (+0.16%)`	⬆️
python	`82.16% <98.79%> (+0.12%)`	⬆️
sqlite	`81.42% <98.79%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/connectors/sqla/models.py	`90.04% <98.02%> (+1.28%)`	⬆️
superset/columns/models.py	`100.00% <100.00%> (ø)`
superset/columns/schemas.py	`100.00% <100.00%> (ø)`
superset/datasets/models.py	`100.00% <100.00%> (ø)`
superset/datasets/schemas.py	`97.18% <100.00%> (+0.14%)`	⬆️
superset/models/helpers.py	`90.90% <100.00%> (+0.24%)`	⬆️
superset/security/manager.py	`94.40% <100.00%> (+0.02%)`	⬆️
superset/tables/models.py	`100.00% <100.00%> (ø)`
superset/tables/schemas.py	`100.00% <100.00%> (ø)`
superset/datasets/commands/create.py	`98.11% <0.00%> (-1.89%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dafc841...142d625. Read the comment docs.

github-actions · 2021-12-03T18:29:20Z

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

superset/connectors/sqla/models.py

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py

john-bodley

Thanks @betodealmeida for making this change—and adding a slew of unit tests to boot. I think your dual write approach is the right approach and allows us (the community) to iron out any kinks in the system before making the switch.

superset/columns/models.py

superset/connectors/sqla/models.py

superset/tables/models.py

tests/unit_tests/tables/test_models.py

john-bodley · 2021-12-09T21:09:24Z

superset/columns/models.py

+    name = sa.Column(sa.Text)
+    type = sa.Column(sa.Text)
+
+    # Columns are defined by expressions. For tables, these are the actual columns names,


This is the one thing I find somewhat atypical, i.e., that a physical table/view column would require require and expression which needs to match the name. Why not just make expression nullable in that case and thus the column would only represent actual SQL expressions.

I'm hoping that having expression even for physical columns and tables will make things easier because of the consistency. One thing the new model does is automatically quote table/column names that are also identifiers; as an example, we'd have:

column = Column( name="select", expression="`select`", # or "select", depending on the DB ... )

So to select that column we just need to use its expression. Otherwise, we'd have to check if expression is null, and then potentially encode the name in order to select that column.

github-actions · 2021-12-13T21:05:37Z

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

github-actions · 2021-12-13T22:29:11Z

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

tests/unit_tests/columns/__init__.py

villebro · 2022-02-25T10:33:01Z

This is impressive work @betodealmeida ! 🚀 Talk about swapping out the airplane's engines mid flight! ✈️

ktmud · 2022-03-31T07:45:56Z

superset/tables/models.py

+    # not exist in the migrations. The reason it does not physically exist is MySQL,
+    # PostgreSQL, etc. have a different interpretation of uniqueness when it comes to NULL
+    # which is problematic given the catalog and schema are optional.
+    __table_args__ = (UniqueConstraint("database_id", "catalog", "schema", "name"),)


@betodealmeida It's also not possible to apply this constraint in MySQL because TEXT cannot be used in index keys without specifying a length: https://stackoverflow.com/questions/1827063/mysql-error-key-specification-without-a-key-length

betodealmeida requested a review from a team as a code owner November 24, 2021 18:54

pull-request-size bot added the size/XXL label Nov 24, 2021

betodealmeida force-pushed the ch26712 branch 2 times, most recently from 0d8ed9c to 20a5c90 Compare November 24, 2021 20:16

mistercrunch added the risk:db-migration PRs that require a DB migration label Nov 29, 2021

betodealmeida force-pushed the ch26712 branch 8 times, most recently from d4fa1e0 to 9b3a015 Compare December 1, 2021 01:32

betodealmeida force-pushed the ch26712 branch 3 times, most recently from a0e65cb to 4ace8da Compare December 1, 2021 18:44

betodealmeida requested review from dpgaspar, john-bodley and mistercrunch and removed request for mistercrunch December 1, 2021 19:10

craig-rueda reviewed Dec 8, 2021

View reviewed changes

superset/connectors/sqla/models.py Show resolved Hide resolved

craig-rueda reviewed Dec 8, 2021

View reviewed changes

superset/connectors/sqla/models.py Show resolved Hide resolved

craig-rueda reviewed Dec 8, 2021

View reviewed changes

superset/migrations/versions/b8d3a24d9131_new_dataset_models.py Outdated Show resolved Hide resolved

john-bodley reviewed Dec 9, 2021

View reviewed changes

betodealmeida added 10 commits February 23, 2022 12:54

Update hook

b636645

Add metric test

a711c6d

Do not rename tables

09b7f6a

Small fixes

f9a9c84

Fix migration

77994cb

Fix tests

ba676f2

Fix more tests

2f7332b

Even more tests

eb2d588

Addressing comments (WIP)

ba2ddb0

Fix migration

b1e9f18

betodealmeida force-pushed the ch26712 branch from 3f234fb to b1e9f18 Compare February 23, 2022 20:54

Rebase and update head

6a5f815

betodealmeida force-pushed the ch26712 branch from a7ff90e to e0c616f Compare February 23, 2022 21:31

Fix unit tests

553aa89

betodealmeida force-pushed the ch26712 branch from e0c616f to 553aa89 Compare February 23, 2022 21:47

Add external management cols

2fcdeef

betodealmeida force-pushed the ch26712 branch from 7b92605 to 2fcdeef Compare February 23, 2022 22:53

Small fixes

92b76ef

betodealmeida force-pushed the ch26712 branch from 142d625 to 92b76ef Compare February 23, 2022 23:43

pkdotson reviewed Feb 24, 2022

View reviewed changes

tests/unit_tests/columns/__init__.py Show resolved Hide resolved

hughhhh approved these changes Feb 24, 2022

View reviewed changes

betodealmeida merged commit 00c99c9 into apache:master Feb 24, 2022

ktmud mentioned this pull request Mar 29, 2022

perf(alembic): paginize db migration for new dataset models #19406

Closed

9 tasks

ktmud reviewed Mar 31, 2022

View reviewed changes

ktmud mentioned this pull request Mar 31, 2022

perf: refactor SIP-68 db migrations with INSERT SELECT FROM #19421

Merged

9 tasks

ktmud mentioned this pull request Apr 9, 2022

chore: skip SIP-68 shadow writing for LTS branches #19636

Closed

9 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.5.0 labels Mar 13, 2024

mistercrunch mentioned this pull request May 27, 2024

chore: remove sl_ tables #28704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: new dataset/table/column models #17543

feat: new dataset/table/column models #17543

betodealmeida commented Nov 24, 2021 •

edited

Loading

github-actions bot commented Nov 26, 2021

codecov bot commented Dec 1, 2021 •

edited

Loading

github-actions bot commented Dec 3, 2021

john-bodley left a comment

john-bodley Dec 9, 2021

betodealmeida Dec 10, 2021

github-actions bot commented Dec 13, 2021

github-actions bot commented Dec 13, 2021

villebro commented Feb 25, 2022

ktmud Mar 31, 2022

feat: new dataset/table/column models #17543

feat: new dataset/table/column models #17543

Conversation

betodealmeida commented Nov 24, 2021 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

github-actions bot commented Nov 26, 2021

codecov bot commented Dec 1, 2021 • edited Loading

Codecov Report

github-actions bot commented Dec 3, 2021

john-bodley left a comment

Choose a reason for hiding this comment

john-bodley Dec 9, 2021

Choose a reason for hiding this comment

betodealmeida Dec 10, 2021

Choose a reason for hiding this comment

github-actions bot commented Dec 13, 2021

github-actions bot commented Dec 13, 2021

villebro commented Feb 25, 2022

ktmud Mar 31, 2022

Choose a reason for hiding this comment

betodealmeida commented Nov 24, 2021 •

edited

Loading

codecov bot commented Dec 1, 2021 •

edited

Loading