restarting multiple workers at once risks applying database migrations multiple times #8006

richvdh · 2020-07-31T12:00:58Z

each worker independently checks if the schema is up-to-date and applies the migrations if not. At best this fails with exceptions; at worst it could result in data corruption

richvdh · 2020-07-31T12:03:11Z

I thought we used to have a thing that stopped migrations from running on anything other than the main process. In any case empirically it doesn't work any more.

richvdh · 2020-07-31T12:05:02Z

if the solution to this ends up being any sort of database-level locking, it would be nice to consider #6467 at the same time.

erikjohnston · 2020-08-04T13:00:56Z

I think we should be able to just not run the prepare database step if we're on not-master?

clokep · 2020-08-04T13:02:20Z

I think we should be able to just not run the prepare database step if we're on not-master?

Is there a risk of those workers starting and then erroring because the migrations haven't finished yet?

erikjohnston · 2020-08-04T13:12:13Z

Hmm, true. We (matrix.org) do try and ensure master starts up first, but I'm not sure that's documented anywhere

richvdh · 2020-08-04T13:52:07Z

Is there a risk of those workers starting and then erroring because the migrations haven't finished yet?

that would be preferable to the current situation imho. But I also wouldn't object to the other workers going into a sleep/check loop until the db got upgraded.

Hmm, true. We (matrix.org) do try and ensure master starts up first, but I'm not sure that's documented anywhere

I'm also not convinced it's true that we wait for the master to restart before we restart the workers on the other server.

erikjohnston · 2020-08-04T13:58:44Z

Hmm, true. We (matrix.org) do try and ensure master starts up first, but I'm not sure that's documented anywhere

I'm also not convinced it's true that we wait for the master to restart before we restart the workers on the other server.

We have historically said that we restart the server with master on before the other servers when upgrading. That advice has probably gotten a bit lost over time too

richvdh · 2020-08-20T14:03:02Z

@reivilibre suggests maybe we could lock the schema version table while we do the migration

erikjohnston · 2020-08-20T14:05:13Z

applied_schema_deltas has a unique index so applying the same delta concurrently should cause one of the transactions to fail (and thus rollback the delta changes).

richvdh · 2020-08-20T14:17:45Z

applied_schema_deltas has a unique index so applying the same delta concurrently should cause one of the transactions to fail (and thus rollback the delta changes).

well this might be true if we actually made the upgrades respect transactions correctly (cf #6467)

erikjohnston · 2020-08-20T14:25:18Z

Well that feels like a prerequisite to this if we're not correctly wrapping things in transactions?

richvdh · 2020-08-20T14:32:46Z

depends, but sure it would be a good thing to fix :)

richvdh · 2020-09-06T23:03:18Z

Looking at this more closely, because psycopg starts a new transaction for every statement, concurrent attempts to upgrade the database will probably fail for the reason Erik gave previously.

I still think it's confusing, though.

richvdh · 2020-09-07T16:33:31Z

fixed by #8266

erikjohnston added z-bug (Deprecated Label) z-p2 (Deprecated Label) labels Aug 4, 2020

richvdh added the release blocker label Sep 4, 2020

richvdh mentioned this issue Sep 6, 2020

Refuse to upgrade database on worker processes #8266

Merged

richvdh closed this as completed Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restarting multiple workers at once risks applying database migrations multiple times #8006

restarting multiple workers at once risks applying database migrations multiple times #8006

richvdh commented Jul 31, 2020

richvdh commented Jul 31, 2020

richvdh commented Jul 31, 2020

erikjohnston commented Aug 4, 2020

clokep commented Aug 4, 2020 •

edited

Loading

erikjohnston commented Aug 4, 2020

richvdh commented Aug 4, 2020

erikjohnston commented Aug 4, 2020

richvdh commented Aug 20, 2020

erikjohnston commented Aug 20, 2020

richvdh commented Aug 20, 2020

erikjohnston commented Aug 20, 2020

richvdh commented Aug 20, 2020

richvdh commented Sep 6, 2020

richvdh commented Sep 7, 2020

restarting multiple workers at once risks applying database migrations multiple times #8006

restarting multiple workers at once risks applying database migrations multiple times #8006

Comments

richvdh commented Jul 31, 2020

richvdh commented Jul 31, 2020

richvdh commented Jul 31, 2020

erikjohnston commented Aug 4, 2020

clokep commented Aug 4, 2020 • edited Loading

erikjohnston commented Aug 4, 2020

richvdh commented Aug 4, 2020

erikjohnston commented Aug 4, 2020

richvdh commented Aug 20, 2020

erikjohnston commented Aug 20, 2020

richvdh commented Aug 20, 2020

erikjohnston commented Aug 20, 2020

richvdh commented Aug 20, 2020

richvdh commented Sep 6, 2020

richvdh commented Sep 7, 2020

clokep commented Aug 4, 2020 •

edited

Loading