Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

synapse takes ages to restart #7968

Open
richvdh opened this issue Jul 28, 2020 · 26 comments
Open

synapse takes ages to restart #7968

richvdh opened this issue Jul 28, 2020 · 26 comments
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@richvdh
Copy link
Member

richvdh commented Jul 28, 2020

Why does it take 70 seconds to initialise the database? It might be some sort of database connection contention but it would be nice if there were some log messages or something.

2020-07-28 09:56:32,141 - synapse.storage.data_stores - 50 - INFO - None - Preparing database 'master'...
2020-07-28 09:56:56,412 - synapse.storage.prepare_database - 325 - INFO - None - Upgrading schema to v58
2020-07-28 09:56:56,421 - synapse.storage.data_stores - 60 - INFO - None - Starting 'main' data store
2020-07-28 09:57:42,940 - synapse.config.appservice - 87 - INFO - None - Loaded application service: ApplicationService: ...

what's going on there?

@richvdh richvdh changed the title master blocks for a long tie during startup master blocks for a long time during startup Jul 28, 2020
@richvdh
Copy link
Member Author

richvdh commented Aug 19, 2020

keywords: restart

richvdh added a commit that referenced this issue Sep 6, 2020
I'm hoping this will provide some pointers for debugging
#7968.
@richvdh richvdh self-assigned this Sep 6, 2020
richvdh added a commit that referenced this issue Sep 7, 2020
I'm hoping this will provide some pointers for debugging
#7968.
@richvdh
Copy link
Member Author

richvdh commented Sep 7, 2020

A lot of the problem here is this query:

SELECT instance_name, MAX(stream_id) FROM cache_invalidation_stream_by_instance GROUP BY instance_name

... which takes six seconds on a relatively idle database.

@richvdh
Copy link
Member Author

richvdh commented Sep 7, 2020

this one takes 11 seconds:

SELECT COUNT(*) FROM users WHERE name NOT LIKE '%:matrix.org'

@richvdh
Copy link
Member Author

richvdh commented Sep 7, 2020

hopefully the latter will be fixed by #8271.

@richvdh
Copy link
Member Author

richvdh commented Sep 7, 2020

I've also manually cleared out cache_invalidation_stream_by_instance, which seems to have mostly resolved this problem, so I'm going to close this and leave #8269 to track a more permanent fix.

@richvdh richvdh closed this as completed Sep 7, 2020
@richvdh
Copy link
Member Author

richvdh commented Sep 30, 2020

it's still pretty slow, so I'm not sure it's fixed.

@richvdh richvdh reopened this Sep 30, 2020
@richvdh
Copy link
Member Author

richvdh commented Sep 30, 2020

shutting down also takes ages: that seems to be mostly because python runs a final GC at shutdown, which can take many seconds. Maybe we can skip that GC. (or maybe it improves if we fix #7176)

@richvdh
Copy link
Member Author

richvdh commented Oct 1, 2020

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

@richvdh
Copy link
Member Author

richvdh commented Oct 1, 2020

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early atexit.register(os._exit, 0).

@richvdh
Copy link
Member Author

richvdh commented Oct 1, 2020

I think I want to make this a release blocker; it feels like we've had a big regression here in the last couple of weeks.

I just tried restarting an idle worker:

Oct 01 15:21:52 hippogriff.matrix.org systemd[1882]: Stopping Synapse test_worker1 worker...
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Main process exited, code=killed, status=15/TERM
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:35 hippogriff.matrix.org synctl[8957]: started synapse.app.generic_worker('/home/synapse/config/workers/test_worker1.yaml')
Oct 01 15:22:35 hippogriff.matrix.org systemd[1882]: Started Synapse test_worker1 worker.

42 seconds to restart an idle process is no good.

@richvdh
Copy link
Member Author

richvdh commented Oct 1, 2020

I've also seen the startup sequence take more than 60 seconds (in addition to schema migrations)

@richvdh
Copy link
Member Author

richvdh commented Oct 2, 2020

This is now much better thanks to #8447. I still think there's work to be done here (on matrix.org, shutdown takes 12 seconds; startup takes 11) so will leave this open.

@richvdh richvdh changed the title master blocks for a long time during startup synapse takes ages to restart Jan 5, 2021
@richvdh
Copy link
Member Author

richvdh commented Jan 5, 2021

I just had another restart which took 83 seconds:

  • 11:35:36: send SIGTERMs to workers
  • 11:35:46: send SIGKILLs to workers
  • 11:35:51: last worker finally dies
  • 11:35:51: send SIGTERM to master
  • 11:36:01: send SIGKILL to master
  • 11:36:02: start master
  • 11:36:29: master is up; start workers
  • 11:36:59: workers started

A few thoughts:

  • do we (still) need to enforce the ordering of shutting down workers before starting the master, and starting the master before the workers? One reason for doing so used to be to ensure the DB was updated before starting the workers, but since Refuse to upgrade database on worker processes #8266 the workers will simply refuse to start in this situation. Breaking this link would roughly halve the restart time.
  • During startup, the workers all pause for 16 seconds on:
    2021-01-05 11:36:40,797 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for presence_stream(stream_id)
    
    can we make that faster?
  • The master, on the other hand, seems to pause for some different stream generators:
    2021-01-05 11:36:14,787 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for push_rules_stream(stream_id)
    2021-01-05 11:36:21,874 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for access_tokens(id)
    2021-01-05 11:36:29,288 - synapse.storage.databases - 83 - INFO - None - [database config 'master']: Starting 'state' database
    
  • I still think we should skip the final GC per synapse takes ages to restart #7968 (comment).

@richvdh
Copy link
Member Author

richvdh commented May 26, 2021

apparently the stream generator for push_rules_stream now takes 30 seconds to initialise:

2021-05-26 09:57:14,498 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-05-26 09:57:14,502 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-05-26 09:57:14,503 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-05-26 09:57:44,476 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-05-26 09:57:45,615 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

@erikjohnston erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 26, 2021
@richvdh richvdh removed their assignment Aug 19, 2021
@erikjohnston
Copy link
Member

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early atexit.register(os._exit, 0).

We need to be careful that we still run the functions we registered to run when the reactor stops

@squahtx
Copy link
Contributor

squahtx commented Aug 27, 2021

I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712
Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

@squahtx
Copy link
Contributor

squahtx commented Sep 2, 2021

We saw the same 30 second delay after initialising the stream generator for push_rules_stream again:

2021-08-31 10:36:32,594 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-08-31 10:36:32,606 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-08-31 10:36:32,613 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-08-31 10:37:07,958 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-08-31 10:37:08,106 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-08-31 10:37:09,514 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

After looking at postgres logs, two slow queries on startup were identified, accounting for ~30 seconds.

2021-08-31 10:36:54.115 UTC [matrix main] LOG:  duration: 17652.320 ms  statement: SELECT user_id, state, last_active_ts, last_federation_update_ts, last_user_sync_ts, status_msg, currently_active FROM presence_stream WHERE state != 'offline'
[...]
2021-08-31 10:37:07.905 UTC [matrix main] LOG:  duration: 13210.316 ms  statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2273929705 - 100000 GROUP BY room_id

From discussions with @richvdh and @erikjohnston:

  • The first query picks out 0 rows from an 800,000 row table with a sequential scan. There is one row for each user synapse has seen.
    For now we plan to add an index to the table.

  • The second query uses an index and ought to be fast. When run manually, the query does complete reasonably quickly.
    It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.

@squahtx
Copy link
Contributor

squahtx commented Sep 3, 2021

Our next step is to check the impact on restart times once these two changes have made it to matrix.org

@richvdh
Copy link
Member Author

richvdh commented Sep 10, 2021

It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.

Related to rolling restarts: #10793

@squahtx
Copy link
Contributor

squahtx commented Sep 14, 2021

In today's restart, we saw that the 30 second delay (actually 35 seconds) has been halved down to 18 seconds:

2021-09-14 10:20:51,840 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for pushers(id)
2021-09-14 10:20:51,842 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-09-14 10:20:51,844 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-09-14 10:21:09,785 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for access_tokens(id)
2021-09-14 10:21:09,786 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-09-14 10:21:11,279 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database
2021-09-14 10:20:55.415 UTC [matrix main] LOG:  duration: 3556.734 ms  statement: SELECT user_id, MAX(stream_id) FROM push_rules_stream WHERE stream_id > 2616199 - 100000 GROUP BY user_id
2021-09-14 10:20:56.270 UTC [matrix main] LOG:  duration: 639.503 ms  statement: SELECT user_id, MAX(stream_id) FROM presence_stream WHERE stream_id > 757284961 - 100000 GROUP BY user_id
2021-09-14 10:21:09.736 UTC [matrix main] LOG:  duration: 13282.137 ms  statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2308446758 - 100000 GROUP BY room_id

These 3 previously seen queries were responsible for most of those 18 seconds, as expected.

@Yoric
Copy link
Contributor

Yoric commented Oct 4, 2021

Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.

@richvdh
Copy link
Member Author

richvdh commented Oct 4, 2021

I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712
Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.

On a recent restart of matrix.org, almost all of the workers shut down immediately (within a second of the shutdown request). Exceptions were:

  • the client readers, which took a couple of seconds
  • background_worker1.service, and the main process, both of whose shutdowns timed out after 10 seconds and were SIGKILLed.

I'd love to know what the bg worker and the main process were doing for those 10 seconds. I still also think we should remove the dependency between main process and workers: there is no need to wait for all the workers to shut down before beginning to stop the main process.

@richvdh
Copy link
Member Author

richvdh commented Oct 4, 2021

Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.

so yes, it looks like the first /sync request after restart took over 3 minutes to generate a response. Breaking this down:

  • 13:34:46: shutdown begins
  • 13:34:56: background_worker1 shutdown complete
  • 13:35:06: main process shutdown complete. Startup begins
  • 13:35:50: main process startup complete; worker startup begins
  • 13:36:45: synchrotron1 starts processing requests
  • 13:39:48: synchrotron1 finally responds to @Yoric:
    2021-10-04 13:39:48,726 - synapse.access.http.8083 - 421 - INFO - GET-288 - ... - 8083 - {@dotyoric:matrix.org} Processed request: 182.045sec/-1.976sec (0.280sec, 0.030sec) (208.099sec/5.580sec/166) 0B 200! "GET /_matrix/client/r0/sync?filter=0&timeout=0&since=....
    

I thinl the very slow response - and a lot of the slow worker startup time - is just because we have tens of synapse processes hammering the database after a restart. Rolling restarts should help considerably here.

@squahtx
Copy link
Contributor

squahtx commented Oct 4, 2021

I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.

At the time I made that observation, #10703 was still an issue, so it may very well have been the cause

@DMRobertson
Copy link
Contributor

Data point: @erikjohnston and @reivilibre were surprised that the restart to upgrade to 1.47.1 seemed noticeably quicker than expected.

@DMRobertson
Copy link
Contributor

Deploying 1.59rc2 to matrix.org today took 42 minutes for 141 workers to sequentially shutdown, restart and notify systemd. I make that roughly 18 seconds per worker.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

5 participants