Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new dataset/table/column models #17543

Merged
merged 22 commits into from
Feb 24, 2022
Merged

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Nov 24, 2021

SUMMARY

This PR implements new models for SIP-68 (#14909):

  • Column
  • Table
  • Dataset

The associated tables are created with a migration, and the existing datasets (SqlaTable, SqlMetric, TableColumn) are migrated to the new models:

  • A physical dataset is mapped into a Table instance and a Dataset instance.
  • A virtual dataset is mapped into a Dataset instance.
  • Metrics and columns are mapped into Column instances (in the new model it represents physical columns, derived columns, and metrics).

The models are kept up-to-date via hooks added to SqlaTable, SqlMetric, and TableColumn. Every time these models are created, deleted, or updated, the new models are updated correspondingly. The sync is unidirectional, from the old models to the new ones.

The next step is to modify the backend to read from the new models. Initially the API will remain unmodified, but hopefully in the future we might be able to clean it up and modernize it (in some places we haven't migrated to the v1 API yet).

After that, the backend will be modified to write to the new models. Once the backend is reading and writing from the new models we can get rid of the old models (including the Druid models).

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

No visual changes.

TESTING INSTRUCTIONS

Reviewers can run superset load-examples and see the new models being populated:

mysql> SELECT `schema`, name FROM sl_tables;
+----------+------------------------+
| schema   | name                   |
+----------+------------------------+
| examples | wb_health_population   |
| examples | birth_names            |
| examples | long_lat               |
| examples | birth_france_by_region |
| examples | sf_population_polygons |
| examples | flights                |
| examples | bart_lines             |
| examples | channel_members        |
| examples | messages               |
| examples | channels               |
| examples | threads                |
| examples | unicode_test           |
| examples | users                  |
| examples | users_channels         |
| examples | cleaned_sales_data     |
+----------+------------------------+
15 rows in set (0.00 sec)

mysql> SELECT name, is_physical FROM sl_datasets;
+---------------------------+-------------+
| name                      | is_physical |
+---------------------------+-------------+
| wb_health_population      |           1 |
| birth_names               |           1 |
| long_lat                  |           1 |
| birth_france_by_region    |           1 |
| sf_population_polygons    |           1 |
| flights                   |           1 |
| bart_lines                |           1 |
| FCC 2018 Survey           |           0 |
| messages_channels         |           0 |
| channel_members           |           1 |
| covid_vaccines            |           0 |
| messages                  |           1 |
| channels                  |           1 |
| members_channels_2        |           0 |
| exported_stats            |           0 |
| threads                   |           1 |
| new_members_daily         |           0 |
| unicode_test              |           1 |
| users                     |           1 |
| video_game_sales          |           0 |
| users_channels            |           1 |
| cleaned_sales_data        |           1 |
| users_channels-uzooNNtSRO |           0 |
+---------------------------+-------------+
23 rows in set (0.00 sec)

mysql> SELECT expression FROM sl_columns WHERE is_aggregation;
+-----------------------+
| expression            |
+-----------------------+
| sum(`SP_POP_TOTL`)    |
| sum(`SH_DYN_AIDS`)    |
| sum(`SP_RUR_TOTL_ZS`) |
| sum(`SP_DYN_LE00_IN`) |
| sum(`SP_RUR_TOTL`)    |
| COUNT(*)              |
| COUNT(*)              |
| SUM(num)              |
| COUNT(*)              |
| AVG(`2004`)           |
| COUNT(*)              |
| COUNT(*)              |
| COUNT(*)              |
| COUNT(*)              |
| count(*)              |
+-----------------------+
15 rows in set (0.02 sec)

The PR has unit tests checking that the sync works on creation, deletion, and updates.

ADDITIONAL INFORMATION

  • Has associated issue: [SIP-68] A better model for Datasets #14909
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@betodealmeida betodealmeida requested a review from a team as a code owner November 24, 2021 18:54
@betodealmeida betodealmeida force-pushed the ch26712 branch 2 times, most recently from 0d8ed9c to 20a5c90 Compare November 24, 2021 20:16
@github-actions
Copy link
Contributor

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@mistercrunch mistercrunch added the risk:db-migration PRs that require a DB migration label Nov 29, 2021
@betodealmeida betodealmeida force-pushed the ch26712 branch 8 times, most recently from d4fa1e0 to 9b3a015 Compare December 1, 2021 01:32
@codecov
Copy link

codecov bot commented Dec 1, 2021

Codecov Report

Merging #17543 (2fcdeef) into master (dafc841) will increase coverage by 0.12%.
The diff coverage is 98.79%.

❗ Current head 2fcdeef differs from pull request most recent head 142d625. Consider uploading reports for the commit 142d625 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17543      +/-   ##
==========================================
+ Coverage   66.21%   66.33%   +0.12%     
==========================================
  Files        1633     1638       +5     
  Lines       63210    63454     +244     
  Branches     6409     6409              
==========================================
+ Hits        41852    42092     +240     
- Misses      19698    19702       +4     
  Partials     1660     1660              
Flag Coverage Δ
hive 52.45% <73.09%> (+0.16%) ⬆️
mysql 81.68% <98.79%> (+0.13%) ⬆️
postgres 81.73% <98.79%> (+0.13%) ⬆️
presto 52.29% <73.09%> (+0.16%) ⬆️
python 82.16% <98.79%> (+0.12%) ⬆️
sqlite 81.42% <98.79%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/connectors/sqla/models.py 90.04% <98.02%> (+1.28%) ⬆️
superset/columns/models.py 100.00% <100.00%> (ø)
superset/columns/schemas.py 100.00% <100.00%> (ø)
superset/datasets/models.py 100.00% <100.00%> (ø)
superset/datasets/schemas.py 97.18% <100.00%> (+0.14%) ⬆️
superset/models/helpers.py 90.90% <100.00%> (+0.24%) ⬆️
superset/security/manager.py 94.40% <100.00%> (+0.02%) ⬆️
superset/tables/models.py 100.00% <100.00%> (ø)
superset/tables/schemas.py 100.00% <100.00%> (ø)
superset/datasets/commands/create.py 98.11% <0.00%> (-1.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dafc841...142d625. Read the comment docs.

@betodealmeida betodealmeida force-pushed the ch26712 branch 3 times, most recently from a0e65cb to 4ace8da Compare December 1, 2021 18:44
@betodealmeida betodealmeida requested review from dpgaspar, john-bodley and mistercrunch and removed request for mistercrunch December 1, 2021 19:10
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2021

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @betodealmeida for making this change—and adding a slew of unit tests to boot. I think your dual write approach is the right approach and allows us (the community) to iron out any kinks in the system before making the switch.

superset/columns/models.py Show resolved Hide resolved
superset/columns/models.py Outdated Show resolved Hide resolved
superset/connectors/sqla/models.py Show resolved Hide resolved
superset/tables/models.py Show resolved Hide resolved
tests/unit_tests/tables/test_models.py Outdated Show resolved Hide resolved
tests/unit_tests/tables/test_models.py Outdated Show resolved Hide resolved
tests/unit_tests/tables/test_models.py Outdated Show resolved Hide resolved
name = sa.Column(sa.Text)
type = sa.Column(sa.Text)

# Columns are defined by expressions. For tables, these are the actual columns names,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one thing I find somewhat atypical, i.e., that a physical table/view column would require require and expression which needs to match the name. Why not just make expression nullable in that case and thus the column would only represent actual SQL expressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping that having expression even for physical columns and tables will make things easier because of the consistency. One thing the new model does is automatically quote table/column names that are also identifiers; as an example, we'd have:

column = Column(
    name="select",
    expression="`select`",  # or "select", depending on the DB
    ...
)

So to select that column we just need to use its expression. Otherwise, we'd have to check if expression is null, and then potentially encode the name in order to select that column.

@github-actions
Copy link
Contributor

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

1 similar comment
@github-actions
Copy link
Contributor

⚠️ @betodealmeida Your base branch master has just also updated superset/migrations.

Please consider rebasing your branch to avoid db migration conflicts.

@betodealmeida betodealmeida merged commit 00c99c9 into apache:master Feb 24, 2022
@villebro
Copy link
Member

This is impressive work @betodealmeida ! 🚀 Talk about swapping out the airplane's engines mid flight! ✈️

# not exist in the migrations. The reason it does not physically exist is MySQL,
# PostgreSQL, etc. have a different interpretation of uniqueness when it comes to NULL
# which is problematic given the catalog and schema are optional.
__table_args__ = (UniqueConstraint("database_id", "catalog", "schema", "name"),)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida It's also not possible to apply this constraint in MySQL because TEXT cannot be used in index keys without specifying a length: https://stackoverflow.com/questions/1827063/mysql-error-key-specification-without-a-key-length

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.5.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:db-migration PRs that require a DB migration sip Superset Improvement Proposal size/XXL 🚢 1.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants