Feature: add synapse worker support #642

maxklenk · 2020-09-09T08:38:55Z

The PR of @eMPee584 is stale for quite a while. I would like to take over to implement this feature.

Docs:

Synapse: https://github.com/matrix-org/synapse/blob/develop/docs/workers.md

Usage

To enable worker support, simply set the matrix_synapse_workers_enabled flag to true. There are defaults on how many workers of which kind are started in matrix_synapse_workers_enabled_list. In addition your database should allow more connections to support all workers.

#
# Synapse config
#
matrix_synapse_workers_enabled: true

#
# Postgres config
#
matrix_postgres_process_extra_arguments: [
  "-c 'max_connections=200'"
]

Tasks:

use new worker names
use new worker routes
deploy redis when workers are enabled
add redis configuration for workers
check cleanup of worker processes/config when they are changed/renamed

· needs documentation; no checks yet for port clashes or typos in worker name · according to https://github.com/matrix-org/synapse/wiki/Workers-setup-with-nginx#results about 90% of requests go to the synchrotron endpoint · thus, the synchrotron worker is especially suited to be load-balanced · most of the other workers are documented to support only a single instance · https://github.com/matrix-org/synapse/blob/master/docs/workers.md

· 😅 How to keep this in sync with the matrix-synapse documentation? · regex location matching is expensive · nginx syntax limit: one location only per block / statement · thus, lots of duplicate statements in this file

FIXME: horrid duplication in template file

eMPee584 · 2020-09-09T18:40:51Z

Thanks for chiming in Max! Some of the checkboxes (redis) can be ticked, please have a look at the original PR #456 .. had been working on this last week but got too deep into trying to automatically parse the synapse workers dox .. converting my awk script* into perl and such non-sense. Also have a bunch of other stuff going on, and a tmux session with >60 windows (so maybe 150 terminal commands/apps running) really gets a bit confusing when combined with a sleep deficit 🤪
Anyway, I just force-pushed my rebased branch and some fragments of the necessary updates and will try to have it up running ASAP.. there's not much more missing.

[*] parse-workers-docs.awk

#!/usr/bin/awk
# ^ the most hackish approach to get a machine-readable list
# of current matrix REST API endpoint regular expressions from the »source«
# invoke in shell with:
# URL=https://github.com/matrix-org/synapse/raw/master/docs/workers.md
# curl -L ${URL} | awk -f parse-workers-docs.awk -

# Enable further processing only at the end of the preamble.
# Read each synapse worker section as record and its lines as fields.
/Available worker applications/ {
    enable_parsing = 1
    # set record separator to markdown sub-header
    RS = "### "
    # set field separator to newline
    FS = "\n"
}

# Each worker record should start with a synapse.app.X headline
enable_parsing && /synapse.app/ {
    # get rid of the backticks, then store the worker type
    gsub("`", "", $1)
    worker_type = $1

    # loop through the lines (1 - number of fields in record)
    for (i = 1; i <= NF; i++) {
        # and parse those looking like URLs
        if ($i ~ /^ +[\^/].*\//) {
            gsub(/^ +/, "", $i)
            api_endpoint_regex = $i
            print worker_type ": '" api_endpoint_regex "'"
        }
    }
}

# vim: tabstop=4 shiftwidth=4 expandtab autoindent

maxklenk · 2020-09-10T08:11:51Z

Hey @eMPee584 nice to see that you picked it up again. I would like to support, eg. your nginx configuration still contains the old workers/routes.

eMPee584 · 2020-09-10T10:39:16Z

You mean, [it still contains the old routes judging from the parts I have committed as yet..]
i usually try to have proper, atomic & tested commits.. so sometimes, things lie around in my work tree for a while before they are being committed even though practically ready..
Ah well, redundant implementation may be effort duplication in some cases, but also provides the benefit of validation. So keep going if you like, even though I might already have done a part of a task and it just remains to be committed and pushed.. 😉

PC-Admin · 2020-09-19T07:02:34Z

Big thanks to @eMPee584 and @maxklenk for trying to get this working, it's a very important addition to this deploy script.

ptman · 2020-09-23T07:12:06Z

how can we help you guys push this over the finish line?

maxklenk · 2020-09-23T07:37:49Z

I am running this branch in a beta test with about 15000 users right now to see if everything runs smooth under load.
I tried to include the feedback from PR #456 but it would be great to get a review of my changes to finish this.

eMPee584 · 2020-09-23T09:02:40Z

I'll have another go at it today.. 😅

PC-Admin · 2020-10-06T16:16:08Z

an idiot-proof guide on how to get the latest workers pull:

$ git clone https://github.com/spantaleev/matrix-docker-ansible-deploy.git
$ cd ./matrix-docker-ansible-deploy/
$ git fetch origin pull/642/head:maxtesting
$ git checkout maxtesting

Then add this to your vars.yml:

matrix_synapse_workers_enabled: true

then just re-run the playbook with setup-all,start

seems to run okay so far, good work max

Edit: This setup caused CPU load to spike badly, probably because 10 workers for like 2 dozen people is overkill. xD Tried reducing the amount of workers like so:

matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

This caused the service file for the postgres container to break like so:

-- A start job for unit matrix-postgres.service has finished successfully.
-- 
-- The job identifier is 78513.
Oct 07 01:29:56 pagrus matrix-postgres[1688]: chmod: /var/run/postgresql: Operation not permitted
Oct 07 01:29:56 pagrus matrix-postgres[1688]: PostgreSQL Database directory appears to contain a database; Skipping initialization
Oct 07 01:29:56 pagrus matrix-postgres[1688]: postgres: invalid argument: "ExecStop=-/usr/bin/env"
Oct 07 01:29:56 pagrus matrix-postgres[1688]: Try "postgres --help" for more information.
Oct 07 01:30:10 pagrus systemd[1]: matrix-postgres.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited

Which of course meant the synapse container wouldn't load. I ended up reverting to the no-worker setup and had to delete the image+container for redis manually. Back online now, I'll have another play tomorrow.

Edit 2: Half these were shut off when i deployed the second time with less workers, so it doesn't seem to be cleaning them up properly:

matrix-redis.service            loaded active running Matrix Redis server
matrix-synapse-worker@frontend_proxy:18701.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18101.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18102.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18103.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18104.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18105.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18106.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@media_repository:18501.service loaded active running Synapse Matrix Worker

It's possible these workers are 'all-or-nothing' so you can set them all up, or remove them all, but you can't pick and choose. I didn't run the same PR with workers disabled so that might have cleaned up all of these in one swoop.

PC-Admin · 2020-10-07T07:10:25Z

Had a bit more of a play, here is how it breaks the postgres service file:

ExecStart=/usr/bin/env docker run --rm --name matrix-postgres \
                        --log-driver=none \
                        --user=1001:1001 \
                        --cap-drop=ALL \
                        --read-only \
                        --tmpfs=/tmp:rw,noexec,nosuid,size=100m \
                        --tmpfs=/run/postgresql:rw,noexec,nosuid,size=100m \
                        --network=matrix \
                        --env-file=/matrix/postgres/env-postgres-server \
                        -v /matrix/postgres/data:/var/lib/postgresql/data:rw \
                        -v /etc/passwd:/etc/passwd:ro \
                        postgres:12.4-alpine \
                        postgres \

The bottom line and preceding slash is what breaks it, this might just be because of my crude merging though. xD

Also noticed with these variables the dockers start up but clients won't start syncing:

matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

It does a good job cleaning everything up at least when you set workers_enabled to false.

bobbythefett · 2020-10-07T09:59:00Z

I tested this PR with my homeserver. It did not work with the defaults.

My server went from a load average of 2 to 20+ (maybe too much workers for a small/medium server by default?)
Nginx configuration did not work with my own nginx. The lines added for the workers were not working, even after the load reduced.

PC-Admin · 2020-10-10T22:40:48Z

Have been running 3 workers for a few days now with no real issues:

# Synapse/Redit Workers
matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

Just ran with these settings, then fixed the broken postgres service file. CPU time seems to be climbing on the synapse processes but the service is very responsive now. :) Thanks again.

PC-Admin · 2020-10-11T09:10:19Z

One error i think might be related is 2/3rds of the time uploads fail to complete with error message: "The file 'XXX' failed to upload." in Element.

I believe this is the synapse workers failing to handle the request, and it only completes when the main synapse thread handles the request. I wasn't able to find a relevant error message in their logs though. :(

PC-Admin · 2020-10-12T11:49:50Z

The generic workers seem to be causing runaway CPU load, i wasn't able to locate the cause. Reverting to 1 federation sender for now.

PC-Admin · 2020-10-12T12:17:14Z

I tried to revert to 1 worker like so yet i still see the 2 generic workers in htop:

# Synapse/Redit Workers
matrix_synapse_workers_enabled: false
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

~THEN~

# Synapse/Redit Workers
matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: federation_sender, port: 18401 }

Interestingly enough the port mapping is correct, so these 2 generic workers shouldn't be able to access the web:

CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS                                                                                                                                  NAMES
76d2e06cf4a5        matrixdotorg/synapse:v1.19.1    "python -m synapse.a…"   3 minutes ago       Up 3 minutes        8008-8009/tcp, 8448/tcp, 0.0.0.0:18401->18401/tcp, 0.0.0.0:9000->9100/tcp                                                              matrix-synapse

We also see that the services are still available:

# systemctl list-units --type=service --state=running
matrix-synapse-worker@federation_sender:18401.service loaded active running Synapse Matrix Worker                       
matrix-synapse-worker@generic_worker:18101.service    loaded active running Synapse Matrix Worker                       
matrix-synapse-worker@generic_worker:18102.service    loaded active running Synapse Matrix Worker                       
matrix-synapse-worker@user_dir:18601.service          loaded active running Synapse Matrix Worker

I ended up manually removing these services.

Edit: In retrospect i wouldn't be surprised if the 2 generic workers were climbing in load constantly because they weren't port mapped before either! derp :) Still they don't get cleaned up properly when you reduce the amount of workers in this list.

localguru · 2020-10-23T20:30:00Z

@PC-Admin

One error i think might be related is 2/3rds of the time uploads fail to complete with error message: "The file 'XXX' failed to upload." in Element.

Have you tried to add a media_repository worker to split the upload from generic workers?

eMPee584 · 2020-10-23T21:49:29Z

I rebased my branch on this PR and added proper (hopefully) cleansing of left-over workers, please try if the problem persists with the recent commits on #456 ..

maxklenk · 2020-11-09T14:47:51Z

I'll close this PR in favour to finish the work in #456

eMPee584 and others added 12 commits April 19, 2020 19:05

Fix corner-cases found through testing (aka ansible is nuts)

cf452fd

Publish synapse worker ports, need to be accessible to nginx

66a4073

nginx config: route traffic to workers on matrix-synapse

e4763c2

FIXME: horrid duplication in template file

Move synapse worker ports up 10k

5f63d28

add missing ; to matrix-synapse.conf.j2

765c046

Nginx conf: more testing less b0rk

46984a4

Merge branch 'synapse-workers' into feature/add-worker-support

567d031

only apply worker redirects if workers are enabled

59d1fb7

remove duplicated key

53ccc78

refactor to use new workers and routes they serve

06bc430

maxklenk added 5 commits September 10, 2020 13:39

add redis support

a25a429

Merge branch 'master' into feature/add-worker-support

9a3d84b

add missing ratelimiting options required for load testing

4fdfc0a

fix redis config if no password is set

8800253

allow to pass arguments to the postgres process

1e68d8b

maxklenk changed the title ~~[WIP] Feature: add synapse worker support~~ Feature: add synapse worker support Sep 11, 2020

maxklenk added 2 commits September 18, 2020 10:18

fix worker routes

132daba

fix media routing

fc2edcb

eMPee584 mentioned this pull request Sep 29, 2020

Synapse workers #456

Merged

maxklenk closed this Nov 9, 2020

maxklenk deleted the feature/add-worker-support branch January 5, 2021 09:21

spantaleev mentioned this pull request Feb 26, 2021

Runaway RAM usage #908

Open

PC-Admin mentioned this pull request Mar 9, 2021

Postgres Tuning? #532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: add synapse worker support #642

Feature: add synapse worker support #642

maxklenk commented Sep 9, 2020 •

edited

Loading

eMPee584 commented Sep 9, 2020 •

edited

Loading

maxklenk commented Sep 10, 2020

eMPee584 commented Sep 10, 2020 •

edited

Loading

PC-Admin commented Sep 19, 2020 •

edited

Loading

ptman commented Sep 23, 2020

maxklenk commented Sep 23, 2020

eMPee584 commented Sep 23, 2020

PC-Admin commented Oct 6, 2020 •

edited

Loading

PC-Admin commented Oct 7, 2020 •

edited

Loading

bobbythefett commented Oct 7, 2020

PC-Admin commented Oct 10, 2020

PC-Admin commented Oct 11, 2020 •

edited

Loading

PC-Admin commented Oct 12, 2020

PC-Admin commented Oct 12, 2020 •

edited

Loading

localguru commented Oct 23, 2020

eMPee584 commented Oct 23, 2020

maxklenk commented Nov 9, 2020

Feature: add synapse worker support #642

Feature: add synapse worker support #642

Conversation

maxklenk commented Sep 9, 2020 • edited Loading

Docs:

Usage

Tasks:

eMPee584 commented Sep 9, 2020 • edited Loading

maxklenk commented Sep 10, 2020

eMPee584 commented Sep 10, 2020 • edited Loading

PC-Admin commented Sep 19, 2020 • edited Loading

ptman commented Sep 23, 2020

maxklenk commented Sep 23, 2020

eMPee584 commented Sep 23, 2020

PC-Admin commented Oct 6, 2020 • edited Loading

PC-Admin commented Oct 7, 2020 • edited Loading

bobbythefett commented Oct 7, 2020

PC-Admin commented Oct 10, 2020

PC-Admin commented Oct 11, 2020 • edited Loading

PC-Admin commented Oct 12, 2020

PC-Admin commented Oct 12, 2020 • edited Loading

localguru commented Oct 23, 2020

eMPee584 commented Oct 23, 2020

maxklenk commented Nov 9, 2020

maxklenk commented Sep 9, 2020 •

edited

Loading

eMPee584 commented Sep 9, 2020 •

edited

Loading

eMPee584 commented Sep 10, 2020 •

edited

Loading

PC-Admin commented Sep 19, 2020 •

edited

Loading

PC-Admin commented Oct 6, 2020 •

edited

Loading

PC-Admin commented Oct 7, 2020 •

edited

Loading

PC-Admin commented Oct 11, 2020 •

edited

Loading

PC-Admin commented Oct 12, 2020 •

edited

Loading