Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add UUID column to ImportMixin #11098

Merged
merged 18 commits into from
Oct 7, 2020
Merged

Conversation

betodealmeida
Copy link
Member

SUMMARY

This PR is a simple rewrite of #7829, adding a UUID column to the ImportMixin. This initial work will be used to improve the import/export functionality in Superset by producing artifacts that are not dependent on the primary keys of a particular database.

TEST PLAN

$ superset db upgrade 
$ superset db downgrade e5ef6828ac4e

Confirmed that columns are created and populated.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@mistercrunch mistercrunch added the risk:db-migration PRs that require a DB migration label Sep 29, 2020
Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks simpler than I thought it would be. In the interest of minimizing the number of database migrations, do we want to alter Dashboard.json_metadata in the same migration?

I'm open to more smaller PRs too. Unclear to me how much work migrating Dashboard.json_metadata is, but it shouldn't be too bad.

@betodealmeida
Copy link
Member Author

Overall looks simpler than I thought it would be. In the interest of minimizing the number of database migrations, do we want to alter Dashboard.json_metadata in the same migration?

I'm not sure... on one hand, doing it in this PR would reduce the number of migrations, which is great. On the other hand, having PRs with DB migrations being as small as possible makes it easier to cherry pick them.

Do you have any preferences? I think updating the column Dashboard.position_json to use UUIDs is simple, but I'm worried about updating the logic that touches it. We'd also have to update the examples.

@mistercrunch
Copy link
Member

One problem with db migrations is that they cannot be cherry-picked out of order. Or they can if they both reference the same parent and ultimately need a converging migration. All this is fairly confusing. Less migrations is probably better. Either way I'd advise release managers to avoid cherry-picking anything with a migration.

@betodealmeida
Copy link
Member Author

One problem with db migrations is that they cannot be cherry-picked out of order. Or they can if they both reference the same parent and ultimately need a converging migration. All this is fairly confusing. Less migrations is probably better. Either way I'd advise release managers to avoid cherry-picking anything with a migration.

Right. The case I was thinking of was when you want to cherry-pick feature B, and it has a database migration Mb that comes after another migration Ma that also needs to be cherry-picked. If the PR implementing Ma and the feature A is big it's harder to cherry-pick it, but you're forced to do it because of the Alembic DAG.

If instead you separate the feature A into one PR, and the migration Ma into another, someone interested in cherry-picking feature B can just cherry-pick the actual migration Ma and skip the PR implementing A, assuming that the DB migration is non-disruptive — eg, adding a column like we're doing here.

But in this case it doesn't matter, because the migration changing position_json is disruptive, and can't be separated from the changes in the logic to read the new schema. So I'll go ahead and implement the migration of position_json in this PR to consolidate the migrations.

Thanks, Max!


# add uniqueness constraint
with op.batch_alter_table(model.__tablename__) as batch_op:
batch_op.create_unique_constraint("uq_uuid", ["uuid"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we planning to do any lookups by uuid? Should we add an index on those columns if so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think not in the near future, we're using it just to ensure consistent relationships.

@betodealmeida
Copy link
Member Author

betodealmeida commented Sep 29, 2020

-- BEFORE migration
sqlite> SELECT position_json FROM dashboards WHERE id=8;
{
    "CHART-Hkx6154FEm": {
        "children": [],
        "id": "CHART-Hkx6154FEm",
        "meta": {
            "chartId": 82,
            "height": 30,
            "sliceName": "slice 1",
            "width": 4
        },
        "type": "CHART"
    },
    "GRID_ID": {
        "children": [
            "ROW-SyT19EFEQ"
        ],
        "id": "GRID_ID",
        "type": "GRID"
    },
    "ROOT_ID": {
        "children": [
            "GRID_ID"
        ],
        "id": "ROOT_ID",
        "type": "ROOT"
    },
    "ROW-SyT19EFEQ": {
        "children": [
            "CHART-Hkx6154FEm"
        ],
        "id": "ROW-SyT19EFEQ",
        "meta": {
            "background": "BACKGROUND_TRANSPARENT"
        },
        "type": "ROW"
    },
    "DASHBOARD_VERSION_KEY": "v2"
}
-- AFTER migration
sqlite> SELECT position_json FROM dashboards WHERE id=8;
{
    "CHART-Hkx6154FEm": {
        "children": [],
        "id": "CHART-Hkx6154FEm",
        "meta": {
            "chartId": 82,
            "height": 30,
            "sliceName": "slice 1",
            "width": 4,
            "uuid": "706c8c3c-175b-4606-9016-4ef7e2ebff09"
        },
        "type": "CHART"
    },
    "GRID_ID": {
        "children": [
            "ROW-SyT19EFEQ"
        ],
        "id": "GRID_ID",
        "type": "GRID"
    },
    "ROOT_ID": {
        "children": [
            "GRID_ID"
        ],
        "id": "ROOT_ID",
        "type": "ROOT"
    },
    "ROW-SyT19EFEQ": {
        "children": [
            "CHART-Hkx6154FEm"
        ],
        "id": "ROW-SyT19EFEQ",
        "meta": {
            "background": "BACKGROUND_TRANSPARENT"
        },
        "type": "ROW"
    },
    "DASHBOARD_VERSION_KEY": "v2"
}
sqlite> SELECT * FROM slices WHERE uuid=REPLACE('706c8c3c-175b-4606-9016-4ef7e2ebff09', '-', '');
2020-09-23 12:21:27.651468|2020-09-23 12:43:27.098505|82|Unicode Cloud|table||word_cloud|{"granularity_sqla": "dttm", "groupby": [], "limit": "100", "metric": {"aggregate": "SUM", "column": {"column_name": "value"}, "expressionType": "SIMPLE", "label": "Value"}, "rotation": "square", "row_limit": 50000, "series": "short_phrase", "since": "100 years ago", "size_from": "10", "size_to": "70", "until": "now", "viz_type": "word_cloud", "remote_id": 33, "datasource_name": "unicode_test", "schema": null, "database_name": "examples", "import_time": 1600890207}|2|2|||[examples].[unicode_test](id:4)|4||706c8c3c175b460690164ef7e2ebff09
sqlite> 

@betodealmeida betodealmeida changed the title Add UUID column to ImportMixin feat: add UUID column to ImportMixin Sep 30, 2020
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor perf comment/question.

Comment on lines 119 to 124
sa.Column(
"uuid",
UUIDType(binary=False),
primary_key=False,
default=uuid.uuid4,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had compatibility issues when using sqlalchemy_utils.UUIDType on different databases some time ago (I believe I was mixing Postgres and Sqlite at the time). I believe the resolution back then was to use binary=False like you've done, but I believe that eliminates the performance benefits of using a UUIDType over a traditional CHAR/VARCHAR implementation. DId you try running it with binary=True, did that cause CI trouble on Sqlite vs Postgres vs MySQL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, @villebro! I haven't tried running with binary=True, I'll give it a try as soon as I fix the unit tests that are not passing.

@codecov-io
Copy link

codecov-io commented Oct 6, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@94d4d55). Click here to learn what that means.
The diff coverage is 14.81%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #11098   +/-   ##
=========================================
  Coverage          ?   61.60%           
=========================================
  Files             ?      829           
  Lines             ?    39195           
  Branches          ?     3688           
=========================================
  Hits              ?    24145           
  Misses            ?    14869           
  Partials          ?      181           
Flag Coverage Δ
#javascript 62.30% <ø> (?)
#python 61.18% <14.81%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ns/b56500de1855_add_uuid_column_to_import_mixin.py 0.00% <0.00%> (ø)
superset/models/slice.py 88.02% <ø> (ø)
superset/views/core.py 74.26% <0.00%> (ø)
superset/dashboards/dao.py 94.38% <100.00%> (ø)
superset/datasets/schemas.py 94.28% <100.00%> (ø)
superset/models/helpers.py 87.44% <100.00%> (ø)
superset/utils/core.py 89.50% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 94d4d55...921960a. Read the comment docs.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass and everything LGTM

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a great step forward for imports/exports! ❤️

@betodealmeida betodealmeida merged commit 9785667 into apache:master Oct 7, 2020
@betodealmeida
Copy link
Member Author

betodealmeida commented Oct 7, 2020

For some reason this PR broke master (9785667 errored), working on a fix on #11196.

Base = declarative_base()


class ImportMixin:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida this migration is very slow, it is worth to mention in the changelog e.g. for our staging env it took ~30 min and often it means extra downtime for the service

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @bkyryliuk! I'll add it today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkyryliuk @betodealmeida I've managed to rewrite the uuid generation with native SQL queries and sped up the migration process by more than 100x. The whole migration job can now complete in under 5 minutes for our Superset db of more than 200k slices and 1 million table_columns. Do you mind taking a look and maybe testing it on your Superset deployments as well?

@ktmud ktmud mentioned this pull request Oct 9, 2020
6 tasks
betodealmeida added a commit to betodealmeida/incubator-superset that referenced this pull request Oct 13, 2020
betodealmeida added a commit that referenced this pull request Oct 14, 2020
* Add note about #11098

* Update UPDATING.md

Better description

Co-authored-by: Jesse Yang <jesse.yang@airbnb.com>

Co-authored-by: Jesse Yang <jesse.yang@airbnb.com>
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
* Add UUID column to ImportMixin

* Fix default value

* Fix lint

* Fix order of downgrade

* Add logging when downgrade fails

* Migrate position_json to contain UUIDs, and add schedule tables

* Save UUID when adding charts to dashboard

* Fix heads

* Rename migration file

* Fix dashboard serialization

* Fix migration script with Postgres

* Fix unique contraint name

* Handle UUID when exporting dashboard

* Fix Dataset PUT

* Add UUID JSON serialization

* Fix tests

* Simplify logic

* Try binary=True
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
…1256)

* Add note about apache#11098

* Update UPDATING.md

Better description

Co-authored-by: Jesse Yang <jesse.yang@airbnb.com>

Co-authored-by: Jesse Yang <jesse.yang@airbnb.com>
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:db-migration PRs that require a DB migration size/L 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants