RoomMessageList worker postgres connections leaking memory #8516

chr-1x · 2020-10-10T20:10:04Z

Description

The immediate symptom I've been tracking is my postgres database server getting OOMkilled every 24-28 hours.

After setting up metrics for my database box and synapse, and keeping an eye on them for several days, I noticed a clear association between requests to the RoomMessageList servlet and jumps in memory usage on the database server. A stark example:

Furthermore, during a times of high RAM usage, I looked at postgres connections with the highest resident set size (RSS) as reported by ps and compared with the pg_stat_activity table to see which worker the connections were associated with. When nearing the server's memory limit, the connections for the worker handling this endpoint were using nearly double the RAM of any other connection:

RSS (KB)	application_name
361792	homeserver
362364	homeserver
362500	homeserver
366628	homeserver
368696	homeserver
372620	homeserver
373136	homeserver
375660	homeserver
376176	homeserver
385900	homeserver
605928	room_message_lister
712208	room_message_lister
732016	room_message_lister
734624	room_message_lister
760680	room_message_lister
768392	room_message_lister
779268	room_message_lister
852376	room_message_lister
876140	room_message_lister
922448	room_message_lister

Specifically, this worker is handling all requests for ^/_matrix/client/(api/v1|r0|unstable)/rooms/.*/messages$. I'm using redis replication on the homeserver.

Steps to reproduce

I haven't tried to reproduce this in a clean environment, but here's the nginx block for this worker:

    location ~* ^/_matrix/client/(api/v1|r0|unstable)/rooms/.*/messages$ {
         proxy_pass http://synapse_room_message_lister;
         proxy_set_header X-Forwarded-For $remote_addr;
    }

And here's the worker's config file:

worker_app: synapse.app.generic_worker

# The replication listener on the synapse to talk to.
worker_replication_host: 127.0.0.1
worker_replication_http_port: 9093

worker_listeners:
 - type: http
   port: 8084
   resources:
     - names:
       - client
 - port: 5094
   bind_address: '127.0.0.1'
   type: http
   resources:
   - names: [metrics]

worker_daemonize: True
worker_pid_file: /home/matrix/synapse/room_message_lister.pid
worker_log_config: /home/matrix/synapse/config/room_message_lister-log.yaml

and the worker's systemd unit:

Description=Synapse Matrix homeserver

[Service]
Type=forking
PIDFile=/home/matrix/synapse/room_message_lister.pid
User=matrix
Group=matrix
Environment="SYNAPSE_CACHE_FACTOR=2.0"
Environment="PGAPPNAME=synapse@room_message_lister"
WorkingDirectory=/home/matrix/synapse
ExecStart=/home/matrix/synapse/env3.6/bin/synctl -w /home/matrix/synapse/workers/room_message_lister.yaml start /home/matrix/synapse/homeserver.yaml
ExecStop=/home/matrix/synapse/env3.6/bin/synctl -w /home/matrix/synapse/workers/room_message_lister.yaml stop /home/matrix/synapse/homeserver.yaml
Restart=always
RestartSec=60

[Install]
WantedBy=multi-user.target

I can reproduce portions of the homeserver config or metrics that might help debug upon request.

Version information

Homeserver: https://matrix.cybre.space
Version: {"server_version":"1.20.1 (b=master,86a72d1)","python_version":"3.6.8"}
Install method: pip
Platform: Ubuntu 18.04 VPS, not containerized.

The text was updated successfully, but these errors were encountered:

clokep · 2020-10-13T12:27:08Z

It seems a bit surprising that those are using so much memory. Maybe someone is making a request to retrieve an extremely large amount of messages? It would be useful to see INFO logs for the room_message_lister worker.

clokep · 2020-12-21T13:09:14Z

Closing due to lack of response.

chr-1x mentioned this issue Oct 10, 2020

Federation reader stops processing incoming requests after database crash #8470

Closed

clokep added the info-needed label Oct 13, 2020

clokep closed this as completed Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoomMessageList worker postgres connections leaking memory #8516

RoomMessageList worker postgres connections leaking memory #8516

chr-1x commented Oct 10, 2020

clokep commented Oct 13, 2020

clokep commented Dec 21, 2020

RoomMessageList worker postgres connections leaking memory #8516

RoomMessageList worker postgres connections leaking memory #8516

Comments

chr-1x commented Oct 10, 2020

Description

Steps to reproduce

Version information

clokep commented Oct 13, 2020

clokep commented Dec 21, 2020