Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-59] Proposal for Database migration standards #13351

Closed
robdiciuccio opened this issue Feb 25, 2021 · 17 comments
Closed

[SIP-59] Proposal for Database migration standards #13351

robdiciuccio opened this issue Feb 25, 2021 · 17 comments
Labels
preset-io sip Superset Improvement Proposal

Comments

@robdiciuccio
Copy link
Member

robdiciuccio commented Feb 25, 2021

[SIP] Proposal for database migration standards

Motivation

Reduce pain around metadata database migrations by ensuring standards are followed and appropriate reviews are obtained before merging.

Proposed Change

SIP-57 (Semantic Versioning) introduced standards for avoiding breaking changes and general best practices for database migrations. The proposed changes below are in addition to those standards:

  • All migrations must support rollbacks. Migrations must have a functional downgrade method to effectively rollback schema changes introduced in the upgrade method. If a migration makes changes to data that are not easily undone (e.g. fix: Retroactively add granularity param to charts #12960), the changes introduced must be non-breaking and idempotent.
  • Migrations should be atomic and configured to complete fully in a single run, using a single transaction where appropriate. Any failures should trigger a rollback to the previous state. Partial migrations should be avoided.
  • Any constraints added within a migration should include an explicit name, e.g. sa.ForeignKeyConstraint(["user_id"], ["ab_user.id"], name='fk_user_id').
  • PRs introducing database migrations must include runtime estimates and downtime expectations.
  • Care should be taken to not introduce expensive DDL operations such as adding unnecessary constraints/indexes or setting column default values on tables potentially containing thousands of rows. [1][2]
    • Indexes in Postgres tables should be added and removed CONCURRENTLY.
  • Migrations for breaking changes and cleanup (e.g. removal of columns) that should be held for the next major version, per the guidelines in SIP-57, should be accumulated in /superset/migrations/next/ for evaluation and inclusion in a future release.
  • Establish Github code owners on the superset/migrations directory to ensure PMC members are notified of new or updated migrations.
  • Require two approvals for PRs that include database migrations, including committers from multiple organizations.
  • PRs including database migrations should be kept open for a minimum review period of two business days to allow for adequate review, unless circumstances such as a critical vulnerability or breakage require faster turnaround.

New or Changed Public Interfaces

None.

New dependencies

No additional package dependencies.

Migration Plan and Compatibility

Workflow changes only. PR template will be updated with guidelines. Process for running migrations unchanged.

Rejected Alternatives

The status quo, which has resulted in quite a bit of thrash, deployment roadblocks and external discussions between Superset users.

@robdiciuccio robdiciuccio added the sip Superset Improvement Proposal label Feb 25, 2021
@etr2460
Copy link
Member

etr2460 commented Feb 26, 2021

Love the suggestions, thanks for driving this! A couple pieces of feedback:

Migration files must have a functional down method to effectively rollback changes introduced in the up method.

This isn't always possible, especially when the migration is doing some repair of the metadata (vs. adding/removing columns and tables). See #12960 for an example of an impossible migration to write a down method for. Maybe this can be more precise by saying that all migrations that modify the structure of the DB/it's columns must have a down method?

PRs introducing database migrations must include runtime estimates and downtime expectations.

Love this, let's plan to add these as fields in the PR template?

Establish Github code owners

This will be our first use of code owners I think, do you have any thoughts about using this more broadly across the repo? Or have you only thought about the migration use case so far?

Nothing here besides my first point should be considered blocking though, and I'll happily vote +1 on this initiative once the thread is created!

@john-bodley
Copy link
Member

Should we also consider how we could provide near zero-downtime for migrations which involve DDL operations or is this outside the scope of this SIP?

@mistercrunch
Copy link
Member

mistercrunch commented Mar 2, 2021

I was just talking to an engineer today (Arash @ Preset) about the idea of using ExtraJSONMixin or similar pattern to accumulate / delay database migrations. In his case he wanted to add a few new fields to the highly contentious Query model and I pointed out to it.
https://github.com/apache/superset/blob/master/superset/models/helpers.py#L456-L477

It seems like ExtraJSONMixin could be further improved to be more seamless if we wanted to, but I'm not sure how people feel about it.

A few other ideas around this SIP:

  • I would recommend using an accumulation pattern around cleanup that are not immediately needed, meaning if for instance we want to remove columns in the database, we can remove the field from the model, but delay the related database field cleanup in some sort of migrations/next.py where we accumulate those cleanup migration scripts and defer until the next big number release (say 2.0.0) where downtime may be expected
  • It'd be so nice to have blue/green forward compatible stamps on migrations, meaning previous version of the app is guaranteed to work with future version of the database. In many cases if the migration is not blue-green compatible it should be clearly identified as it requires downtime. I'd recommend really pushing PRs to meet this req and pushing to using the accumulation pattern when that's not the case.

@mistercrunch
Copy link
Member

mistercrunch commented Mar 2, 2021

Migration files must have a functional down method to effectively rollback changes introduced in the up method.

This isn't always possible, especially when the migration is doing some repair of the metadata (vs. adding/removing columns and tables).

It could be possible in some cases by keeping data as backup / renaming column to enable just that. Of course that doesn't always work as new objects get created and may be missing the backup, it may get very tricky to provide that guarantee where you may have to maintain the old and new field with the related old/new logic... Probably over-complicated, but we can see on a case per case basis if it makes sense to try to guarantee that down-migration. If it's not possible we may want to try to delay that migration until a bigger release if possible.

@rusackas
Copy link
Member

rusackas commented Mar 2, 2021

must include runtime estimates and downtime expectations

We've seen instances in the past where one contributor thought runtime/downtime would be minimal based on their perceived use cases. When merged, other orgs had significantly/exponentially more data that needed migration, and the execution time was a pain point. How can we most accurately provide realistic/reasonable estimates given the fairly disparate use cases and datasets of Superset users/institutions?

@robdiciuccio
Copy link
Member Author

Migration files must have a functional down method to effectively rollback changes introduced in the up method.

This isn't always possible, especially when the migration is doing some repair of the metadata (vs. adding/removing columns and tables).

Good point. The primary goal here is to be able to successfully rollback from any migration. The example you provided is idempotent and additive, which fits the criteria. How about this updated language?

All migrations must support rollbacks. Migrations must have a functional downgrade method to effectively rollback schema changes introduced in the upgrade method. If a migration makes changes to data that are not easily undone (e.g. #12960), the changes introduced must be non-breaking and idempotent.

@robdiciuccio
Copy link
Member Author

This will be our first use of code owners I think, do you have any thoughts about using this more broadly across the repo? Or have you only thought about the migration use case so far?

Another use case I'm thinking about for code owners is the new ephemeral test environment workflow code: adding Preset code owners to ensure AWS resources are not changed without account owner approval.

@robdiciuccio
Copy link
Member Author

We've seen instances in the past where one contributor thought runtime/downtime would be minimal based on their perceived use cases. When merged, other orgs had significantly/exponentially more data that needed migration, and the execution time was a pain point. How can we most accurately provide realistic/reasonable estimates given the fairly disparate use cases and datasets of Superset users/institutions?

Yeah, that's a bit tricky. One idea is to provide run times for different row counts, which could then be reasonably extrapolated for larger datasets. In general, committers notified via the proposed Github code owners should know if the tables being altered will incur significant migration overhead.

Require two approvals for PRs that include database migrations, including committers from multiple organizations.

Should we also require that the PR be open for review for a minimum period of time (48h?) to ensure committers from different orgs have time to review?

@robdiciuccio
Copy link
Member Author

Should we also consider how we could provide near zero-downtime for migrations which involve DDL operations or is this outside the scope of this SIP?

Making this work for all metadata DB types will be difficult, as the pitfalls and tooling are different for each. We could add some guidance around things like setting default values and creating indexes on tables with many rows, but DDL is going to potentially cause downtime on some systems unless you're using a tool like pt-online-schema-change (for MySQL).

@robdiciuccio
Copy link
Member Author

Ran across this guidance in the Alembic docs about naming constraints. Thoughts on including this as a requirement for migrations?

@craig-rueda
Copy link
Member

To build on Rob's point above, I'd like to add that, I've noticed several migrations that do things like call commit() on their current session multiple times (usually in a loop), which breaks the atomic guarantee of migrations. I'm sure Alembic wraps the current session and intercepts calls to commit() under the covers, but we should still be checking for this sort of thing.

@robdiciuccio
Copy link
Member Author

I would recommend using an accumulation pattern around cleanup that are not immediately needed, meaning if for instance we want to remove columns in the database, we can remove the field from the model, but delay the related database field cleanup in some sort of migrations/next.py where we accumulate those cleanup migration scripts and defer until the next big number release (say 2.0.0) where downtime may be expected

@mistercrunch agreed, I added an item for accumulating breaking/cleanup migrations for the next major release

It'd be so nice to have blue/green forward compatible stamps on migrations, meaning previous version of the app is guaranteed to work with future version of the database. In many cases if the migration is not blue-green compatible it should be clearly identified as it requires downtime. I'd recommend really pushing PRs to meet this req and pushing to using the accumulation pattern when that's not the case.

I think the standards set forth in SIP-57 re: breaking changes should accomplish this goal, unless you have something else in mind?

@robdiciuccio
Copy link
Member Author

@craig-rueda I added some detail around atomicity of migrations

@robdiciuccio
Copy link
Member Author

Updated the SIP above based on feedback in this thread. Will send for a vote on Friday if there's no other discussion items.

@betodealmeida
Copy link
Member

@robdiciuccio @evans regarding "PRs introducing database migrations must include runtime estimates and downtime expectations", I'm working on a script to run benchmarks on migrations that pre-populates the models:

#13561

@robdiciuccio
Copy link
Member Author

The SIP has been approved with nine binding +1 votes, four non-binding +1 votes, zero 0 votes and zero -1 votes.

@betodealmeida betodealmeida mentioned this issue Sep 18, 2024
9 tasks
@mz0in mz0in mentioned this issue Sep 18, 2024
9 tasks
@torgge torgge mentioned this issue Sep 19, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preset-io sip Superset Improvement Proposal
Projects
Status: Implemented / Done
Development

No branches or pull requests

9 participants