Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Repeated exceptions in background_process_metrics #7840

Closed
kpfleming opened this issue Jul 13, 2020 · 13 comments
Closed

Repeated exceptions in background_process_metrics #7840

kpfleming opened this issue Jul 13, 2020 · 13 comments
Labels
z-bug (Deprecated Label) z-p2 (Deprecated Label)

Comments

@kpfleming
Copy link

Description

Looks like the same issue as #6405, but this is happening with 1.17.0. Frequent (every ~10 seconds) exceptions thrown from notify_new_event operations.

Steps to reproduce

Not sure; this just started happening recently.

Log is at https://gist.github.com/kpfleming/b2fdf66e545b33c18200361689e0341e

Version information

  • Homeserver: km6g.us

If not matrix.org:

  • Version: 1.17.0

  • Install method: pip into a virtual environment

  • Platform: Debian Buster on an amd64 machine, running in a systemd-nspawn container

@anoadragon453
Copy link
Member

According to the logs, this seems to be happening due to stream orderings getting out of sync in the calculation of the room directory.

A temporary fix would be to switch stats.enabled to false in the homeserver config, however this is certainly not a long-term solution.

Additionally interesting that the error seems to be choking on a transaction containing user presence EDUs. I assume presence (config option use_presence) is enabled on your server @kpfleming? Have you ever disabled it?

@kpfleming
Copy link
Author

'use_presence has never appeared in my configuration file, and the default appears to be True, so I believe that means it is enabled (and that I've never disabled it).

@anoadragon453
Copy link
Member

@kpfleming Ok, I had a hunch on whether #7508 ignoring incoming updates had been breaking things if presence had been toggled on and off, but that doesn't sound like the case here.

@anoadragon453 anoadragon453 added z-bug (Deprecated Label) z-p2 (Deprecated Label) labels Jul 20, 2020
@kpfleming
Copy link
Author

If it helps, the only 'unusual' activity recently on this homeserver was an attempt to create a direct-chat room with a user on matrix.org, but while the federation configuration for km6g.us was broken (the .well-known file was not reachable).

@anoadragon453
Copy link
Member

Do you know how long ago that was?

I could see that causing matrix.org to stop sending events to you for a little bit, but I don't think it would cause event streams to become unordered. This seems like a bug internal to the homeserver.

@kpfleming
Copy link
Author

It would have been on July 11 or 12.

@richvdh
Copy link
Member

richvdh commented Jul 23, 2020

to help with searching, the exception is:

                                                Traceback (most recent call last):
                                                  File "/opt/matrix-synapse/lib/python3.7/site-packages/synapse/metrics/background_process_metrics.py", line 214, in run
                                                    return (yield result)
                                                  File "/opt/matrix-synapse/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
                                                    result = g.send(result)
                                                  File "/opt/matrix-synapse/lib/python3.7/site-packages/synapse/handlers/stats.py", line 69, in process
                                                    await self._unsafe_process()
                                                  File "/opt/matrix-synapse/lib/python3.7/site-packages/synapse/handlers/stats.py", line 94, in _unsafe_process
                                                    self.pos, room_max_stream_ordering
                                                  File "/opt/matrix-synapse/lib/python3.7/site-packages/synapse/storage/data_stores/main/state_deltas.py", line 53, in get_current_state_deltas
                                                    assert prev_stream_id <= max_stream_id
                                                AssertionError

@richvdh
Copy link
Member

richvdh commented Jul 23, 2020

I saw this previously, and it was caused by somebody migrating their database from one postgres instance to another without correctly copying the sequence generators over. Is that possible?

@kpfleming
Copy link
Author

Yes! I did indeed do that, and did not think about copying the sequence generators. I ran a full dump of the database and imported it on the new machine.

@richvdh
Copy link
Member

richvdh commented Jul 23, 2020

well, I'd venture to suggest you did not run a full dump of the database :).

Try this:

SELECT MAX(id)+1 AS next_id FROM state_groups \gset
ALTER SEQUENCE state_group_id_seq RESTART WITH :next_id;

and then restart synapse.

However, a warning: your database may well now be somewhat corrupt. If you see other odd behaviour, it could be caused by this.

@richvdh
Copy link
Member

richvdh commented Jul 23, 2020

raised opened #7938 to add a check for this

@kpfleming
Copy link
Author

Thanks; this homeserver only services two users and we don't have much data in it, so in the worst case I can rebuild from an empty database.

I used pg_dump to dump the entire schema, transferred the resulting SQL file to the new server, and used pg_restore to restore it before starting synapse for the first time on that server. I can't see anything in the pg_dump/pg_restore documentation which indicates that sequences require special handling, but it's possible.

In any case, I know the cause now and can address it, so thanks!

@kpfleming
Copy link
Author

I dropped the database, created a new one, and let synapse set up the schema, and re-registered my user account. It's all working as expected so this confirms that the problem was an improper transfer of the database.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label) z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

3 participants