appservice_stream_position can get into such a state that ASes break silently #1834

ara4n · 2017-01-19T19:04:46Z

I wasted 3h on arasphere last night trying to work out why it wouldn't send stuff to ASes. turns out the value in that table was somehow mangled. deleting it didn't help, but manually reinserting max streampos did. Plus it's inconsistently named wrt all the other AS stuff :(

ara4n · 2017-04-19T22:44:47Z

so on my HS atm i have:

sqlite> select max(stream_ordering) from events;
811112
sqlite> select * from appservice_stream_position;
X|809618

and it somehow got into a state that the value in appservice_stream_position was mangled (missing entirely, iirc)
and i unstuck it by manually inserting max(stream_ordering) from events into it

varesa · 2017-04-19T23:06:35Z

I've also been hit by this, now for the second time during the last week.

The first time I "solved" this be destroying the DBs and starting from scratch as I was still setting things up anyway.

Now on the second time my DB looked like this:

sqlite> select max(stream_ordering) from events;                                                                                                                                                                                                                                                                               
82774
sqlite> select * from appservice_stream_position;
X|6481

I was able to fix it by setting the stream_ordering column in appservice_stream_position to max(stream_ordering) from events like @ara4n said even though I was not missing the row.

sqlite> update appservice_stream_position set stream_ordering=82774;

ara4n · 2017-04-19T23:18:01Z

@kegsay any idea?

Spydar007 · 2017-06-15T06:35:30Z

I also had this issue, but correcting the values fixed it.

kegsay · 2017-07-14T15:18:31Z

@ara4n no - I didn't do any of the stream position stuff when I did the original Synapse AS API implementation. You'd need to ask @erikjohnston.

Sorunome · 2017-07-14T16:21:44Z

I had the same issue, fixed it via this.

Then I joined a room (#riot:matrix.org) noone on the synapse server was in yet, and it got off-sync again and i had to re-apply the fix. Perhaps this'll help to hunt down the bug

turt2live · 2017-10-11T03:10:35Z

This is still very much an issue. There's absolutely no information in the appservice worker's log, as it seems to successfully send a transaction to an appservice, then it stops trying ever again. It does still receive rdata caches/events, and the following appears between rdata updates, but eventually it also stops receiving rdata:

appservice - 2017-10-11 02:55:54,347 - synapse.storage.TIME - 219 - INFO - - Total database time: 0.000% {set_appservice_last_pos(0): 0.000%, get_users_in_room(0): 0.000%, get_user_by_id(0): 0.000%} {}

A restart of the appservice worker did not help too much (it started receiving rdata again, but not sending transactions to appservices). I had to do the update query manually and restart the entire synapse instance.

Are there particular flags/logging I can leave on to try and debug this if it happens again? (this is the 3rd time in ~5 months it has happened).

turt2live · 2018-03-21T17:44:44Z

This happened once again for me (t2bot.io), and I don't see anything useful in the logs. I have exported them for viewing if desired though (they have tokens and such, so I'd rather not post them publicly).

turt2live · 2018-03-29T18:19:50Z

and again :(

synapse=# select max(stream_ordering) from events;
   max
----------
 65145242
(1 row)

synapse=# select * from appservice_stream_position;
 lock | stream_ordering
------+-----------------
 X    |        65097572
(1 row)

Logs are also available for this event.

turt2live · 2018-05-06T16:40:07Z

This is becoming a trend where every couple months I'm having to meddle with the database. This has once again happened.

Is there a lead on a possible cause for this happening?

richvdh · 2018-05-09T10:52:47Z

ok so let me just try to understand the symptoms here:

synapse stops sending events to appservices
inspection of the db shows that appservice_stream_position is stuck at an old event
restarting synapse does not fix it
updating appservice_stream_position to a later value (and restarting synapse) does fix it

Is that correct?

If so, the only plausible conclusion is that synapse's appservice sender gets stuck on particular events; by updating appservice_stream_position, we skip the problematic event and things get going again.

Since synapse handles events in batches of 100, the problem might lie anywhere in the next 100 events after the sticking point (ie, stream_orderings 65097573 to 65097672 in @turt2live 's latest example). I'm not quite sure how we can go about narrowing down the problem further :/.

It's possible that some recent improvements to error handling in synapse (in particular, https://github.com/matrix-org/synapse/pull/3138/files#diff-df9c8550f76f9aca4912eaa72d55379eR121) will reveal the problem here.

turt2live · 2018-05-09T13:05:53Z

That's pretty much the problem. The additional issue is that there's no logging whatsoever for an issue occuring: it just... stops

Improved error handling may catch it. I'm also looking at adding much more verbose logging in the area to see if it can catch what's going on.

Half-Shot · 2019-03-20T16:47:28Z

We got bitten by this again today, because of some events that existed in the events table, yet DataStore._get_events couldn't find any of them so the stream position couldn't advance. I believe we should skip over any events we couldn't fetch.

TimePath · 2019-04-14T03:09:08Z

Ran into this again today, fast forwarded with this query:

update appservice_stream_position set stream_ordering = (select max(stream_ordering) from events);

No synapse restart required

pedro-nonfree · 2020-08-21T18:49:30Z

got crazy setting up mautrix-telegram, thanks @TimePath for the postgresql command

[20:37:52] root@matrix:~# sudo -u postgres psql synapse
could not change directory to "/root": Permission denied
psql (11.7 (Debian 11.7-0+deb10u1))
Type "help" for help.

synapse=# update appservice_stream_position set stream_ordering = (select max(stream_ordering) from events);

after applying this, I see transactions between synapse and mautrix-telegram happening, phew

richvdh · 2020-08-26T11:30:41Z

please can people who observe this issue share logs from the homeserver.

pedro-nonfree · 2020-08-26T19:34:24Z

Checked my logs and I don't see anything relevant. We can coordinate privately about the logs you need

richvdh · 2020-10-20T15:16:06Z

closing this for lack of useful response from the reporters.

turt2live · 2020-10-20T15:17:11Z

it's because the logs are silent - this still happens. What logs are you hoping to see?

richvdh · 2020-10-20T15:20:47Z

ok, well that is useful information in itself. Last I heard we'd made some logging changes and were waiting to hear if that showed anything up.

If you could share INFO-level logs from the time of the last successful transaction to a point where you would definitely have expected there to be another transaction, that would be a good starting point to check we're all on the same page.

turt2live · 2020-10-20T15:22:03Z

my logs have since rolled over (happened last week), but any day now it's due to happen again.

Twi1ightSparkle · 2020-10-23T15:53:11Z

I had this on my homeserver. Updating appservice_stream_position solved the problem.

I can share logs from today, but not from around when it stopped working. I don't use Telegram much, so based on my Telegram/Matrix rooms, all I know is that it broke sometime between April 16th and August 30th. I only have about two weeks worth of logs on my HS.
Based on the timestamp off the event that appservice_stream_position was stuck on, it broke May 20th

Edit: I hacked together a script to catch this next time it breaks: https://gist.github.com/Twi1ightSparkle/53d33ce50c813f84fcccb3abe8dcdfba

turt2live · 2020-12-13T22:16:26Z

@richvdh This just happened and I have logs this time. Would you like them personally or should I send them to someone else?

erikjohnston · 2020-12-18T11:03:19Z

Looking at logs and prometheus from @turt2live (thanks!), I'm seeing the following background process in flight:

synapse/synapse/handlers/appservice.py

Lines 86 to 179 in a7a9139

    
           async def _notify_interested_services(self, max_token: RoomStreamToken): 
        
               with Measure(self.clock, "notify_interested_services"): 
        
                   self.is_processing = True 
        
                   try: 
        
                       limit = 100 
        
                       while True: 
        
                           ( 
        
                               upper_bound, 
        
                               events, 
        
                           ) = await self.store.get_new_events_for_appservice( 
        
                               self.current_max, limit 
        
                           ) 
        
                           if not events: 
        
                               break 
        
                           events_by_room = {}  # type: Dict[str, List[EventBase]] 
        
                           for event in events: 
        
                               events_by_room.setdefault(event.room_id, []).append(event) 
        
                           async def handle_event(event): 
        
                               # Gather interested services 
        
                               services = await self._get_services_for_event(event) 
        
                               if len(services) == 0: 
        
                                   return  # no services need notifying 
        
                               # Do we know this user exists? If not, poke the user 
        
                               # query API for all services which match that user regex. 
        
                               # This needs to block as these user queries need to be 
        
                               # made BEFORE pushing the event. 
        
                               await self._check_user_exists(event.sender) 
        
                               if event.type == EventTypes.Member: 
        
                                   await self._check_user_exists(event.state_key) 
        
                               if not self.started_scheduler: 
        
                                   async def start_scheduler(): 
        
                                       try: 
        
                                           return await self.scheduler.start() 
        
                                       except Exception: 
        
                                           logger.error("Application Services Failure") 
        
                                   run_as_background_process("as_scheduler", start_scheduler) 
        
                                   self.started_scheduler = True 
        
                               # Fork off pushes to these services 
        
                               for service in services: 
        
                                   self.scheduler.submit_event_for_as(service, event) 
        
                               now = self.clock.time_msec() 
        
                               ts = await self.store.get_received_ts(event.event_id) 
        
                               synapse.metrics.event_processing_lag_by_event.labels( 
        
                                   "appservice_sender" 
        
                               ).observe((now - ts) / 1000) 
        
                           async def handle_room_events(events): 
        
                               for event in events: 
        
                                   await handle_event(event) 
        
                           await make_deferred_yieldable( 
        
                               defer.gatherResults( 
        
                                   [ 
        
                                       run_in_background(handle_room_events, evs) 
        
                                       for evs in events_by_room.values() 
        
                                   ], 
        
                                   consumeErrors=True, 
        
                               ) 
        
                           ) 
        
                           await self.store.set_appservice_last_pos(upper_bound) 
        
                           now = self.clock.time_msec() 
        
                           ts = await self.store.get_received_ts(events[-1].event_id) 
        
                           synapse.metrics.event_processing_positions.labels( 
        
                               "appservice_sender" 
        
                           ).set(upper_bound) 
        
                           events_processed_counter.inc(len(events)) 
        
                           event_processing_loop_room_count.labels("appservice_sender").inc( 
        
                               len(events_by_room) 
        
                           ) 
        
                           event_processing_loop_counter.labels("appservice_sender").inc() 
        
                           synapse.metrics.event_processing_lag.labels( 
        
                               "appservice_sender" 
        
                           ).set(now - ts) 
        
                           synapse.metrics.event_processing_last_ts.labels( 
        
                               "appservice_sender" 
        
                           ).set(ts) 
        
                   finally: 
        
                       self.is_processing = False

which is the loop that takes streams from the events table and dispatches to the various appservice classes for sending out. However, the outer loop is stuck, as the metric counts at the end of the loop aren't progressing:

synapse/synapse/handlers/appservice.py

Line 164 in a7a9139

events_processed_counter.inc(len(events))

But, the inner loop is progressing, at a rate of one iteration per minute, based on:

synapse/synapse/handlers/appservice.py

Lines 137 to 139 in a7a9139

    
           synapse.metrics.event_processing_lag_by_event.labels( 
        
               "appservice_sender" 
        
           ).observe((now - ts) / 1000)

During that time we see requests in the logs and metrics timing out after one minute:

INFO - notify_interested_services-743543- Error sending request to  GET https://<discord_appservice>/users/<user_id>?access_token=<redacted>: RequestTimedOutError 504: Timeout waiting for response from remote server

which is probably being triggered from the loop by:

synapse/synapse/handlers/appservice.py

Lines 116 to 118 in a7a9139

    
           await self._check_user_exists(event.sender) 
        
           if event.type == EventTypes.Member: 
        
               await self._check_user_exists(event.state_key)

Since we won't process the next batch of events until we've finished the current batch, this one minute lag per event processed is enough to cause all appservices to fall behind. (Side note: @turt2live has a patch to drop "old" events from being sent out, so we end up not seeing any transactions being sent out)

Other observations:

Things quite obviously get stuck going slow at a particular instance
At first we see events going out to the various different appservices, but then for the rest of the time we only see membership events for the discord bridge.
It looks like the user query hit to the discord appservice doesn't at first

I'm not entirely sure what's going on here, but there seems to suddenly be a large number of discord membership events all at once? Does the discord bridge then fail to handle the user querying correctly? Perhaps all user queries to the discord appservice are slow, and its only a noticeable problem due to the sheer number of membership requests all at once.

waclaw66 · 2021-04-12T16:36:12Z

It still seems to be happening, Synapse 1.31.0 stopped sending events to appservice rooms suddenly.
appservice_stream_position mismatched to select max(stream_ordering) from events.
Fixed by #1834 (comment).

chagai95 · 2021-05-03T19:29:10Z

Thanks so much, just ran into this issue!

14mRh4X0r · 2021-06-02T09:28:51Z

We ran into this for the Telegram bridge at SNT, and were able to identify the problem (at least for us): if all events that would be returned by get_new_events_for_appservice are rejected, the events list will be empty.

PR incoming.

Fixes matrix-org#1834 Signed-off-by: Willem Mulder <14mRh4X0r@gmail.com>

Fixes #1834. `get_new_events_for_appservice` internally calls `get_events_as_list`, which will filter out any rejected events. If all returned events are filtered out, `_notify_interested_services` will return without updating the last handled stream position. If there are 100 consecutive such events, processing will halt altogether. Breaking the loop is now done by checking whether we're up-to-date with `current_max` in the loop condition, instead of relying on an empty `events` list. Signed-off-by: Willem Mulder <14mRh4X0r@gmail.com>

t3chguy · 2021-12-21T22:36:13Z

I hit this and had to manually resolve it, would be nice if modern versions of Synapse were able to self-repair in this scenario.

richvdh · 2021-12-22T10:18:28Z

@t3chguy this issue is supposedly fixed - if you have evidence to the contrary, could you open a new issue with more details?

t3chguy · 2021-12-22T10:28:58Z

#11629

Sorunome mentioned this issue Jul 29, 2017

Outgoing messages not being sent matrix-hacks/matrix-puppet-skype#3

Closed

Half-Shot mentioned this issue May 12, 2018

Not relaying Matrix->IRC after performance issue matrix-org/matrix-appservice-irc#520

Closed

Half-Shot mentioned this issue Dec 5, 2019

Synapse retries too-large appservice transactions indefinitely #6478

Closed

kajdun mentioned this issue May 1, 2020

AS incoming events are not forwarded from Synapse to AS daemon port #5957

Closed

richvdh added the info-needed label Oct 9, 2020

richvdh closed this as completed Oct 20, 2020

turt2live reopened this Oct 20, 2020

Dual-0 mentioned this issue Dec 6, 2020

Bot didn't join the conversation/room geluk/matrix-webhook-gateway#4

Closed

cremesk mentioned this issue Jan 24, 2021

Event not found / Server not known matrix-org/matrix-appservice-irc#1222

Closed

richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed z-info-needed labels Apr 9, 2021

14mRh4X0r mentioned this issue Jun 2, 2021

Always update AS last_pos, even on no events #10107

Merged

4 tasks

14mRh4X0r added a commit to 14mRh4X0r/synapse that referenced this issue Jun 2, 2021

Always update AS last_pos, even on no events

91a7f7b

Fixes matrix-org#1834 Signed-off-by: Willem Mulder <14mRh4X0r@gmail.com>

erikjohnston closed this as completed in #10107 Jun 7, 2021

MadLittleMods added the A-Application-Service Related to AS support label Nov 10, 2021

t3chguy mentioned this issue Dec 22, 2021

newly-configured AS didn't receive new events #11629

Open

aggieben mentioned this issue Jun 30, 2022

Events no longer sent to application service, and no way to debug #13142

Closed

turt2live mentioned this issue Sep 29, 2022

Appservice stream position got stuck ~10k behind current, preventing requests to appservices #13950

Open

This was referenced Dec 21, 2023

newly-configured AS didn't receive new events element-hq/synapse#11629

Open

Appservice stream position got stuck ~10k behind current, preventing requests to appservices element-hq/synapse#13950

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

appservice_stream_position can get into such a state that ASes break silently #1834

appservice_stream_position can get into such a state that ASes break silently #1834

ara4n commented Jan 19, 2017

ara4n commented Apr 19, 2017 •

edited by richvdh

Loading

varesa commented Apr 19, 2017 •

edited

Loading

ara4n commented Apr 19, 2017

Spydar007 commented Jun 15, 2017

kegsay commented Jul 14, 2017

Sorunome commented Jul 14, 2017

turt2live commented Oct 11, 2017

turt2live commented Mar 21, 2018

turt2live commented Mar 29, 2018

turt2live commented May 6, 2018

richvdh commented May 9, 2018

turt2live commented May 9, 2018

Half-Shot commented Mar 20, 2019

TimePath commented Apr 14, 2019

pedro-nonfree commented Aug 21, 2020

richvdh commented Aug 26, 2020

pedro-nonfree commented Aug 26, 2020

richvdh commented Oct 20, 2020

turt2live commented Oct 20, 2020

richvdh commented Oct 20, 2020

turt2live commented Oct 20, 2020

Twi1ightSparkle commented Oct 23, 2020 •

edited

Loading

turt2live commented Dec 13, 2020

erikjohnston commented Dec 18, 2020

waclaw66 commented Apr 12, 2021

chagai95 commented May 3, 2021

14mRh4X0r commented Jun 2, 2021

t3chguy commented Dec 21, 2021

richvdh commented Dec 22, 2021

t3chguy commented Dec 22, 2021

appservice_stream_position can get into such a state that ASes break silently #1834

appservice_stream_position can get into such a state that ASes break silently #1834

Comments

ara4n commented Jan 19, 2017

ara4n commented Apr 19, 2017 • edited by richvdh Loading

varesa commented Apr 19, 2017 • edited Loading

ara4n commented Apr 19, 2017

Spydar007 commented Jun 15, 2017

kegsay commented Jul 14, 2017

Sorunome commented Jul 14, 2017

turt2live commented Oct 11, 2017

turt2live commented Mar 21, 2018

turt2live commented Mar 29, 2018

turt2live commented May 6, 2018

richvdh commented May 9, 2018

turt2live commented May 9, 2018

Half-Shot commented Mar 20, 2019

TimePath commented Apr 14, 2019

pedro-nonfree commented Aug 21, 2020

richvdh commented Aug 26, 2020

pedro-nonfree commented Aug 26, 2020

richvdh commented Oct 20, 2020

turt2live commented Oct 20, 2020

richvdh commented Oct 20, 2020

turt2live commented Oct 20, 2020

Twi1ightSparkle commented Oct 23, 2020 • edited Loading

turt2live commented Dec 13, 2020

erikjohnston commented Dec 18, 2020

waclaw66 commented Apr 12, 2021

chagai95 commented May 3, 2021

14mRh4X0r commented Jun 2, 2021

t3chguy commented Dec 21, 2021

richvdh commented Dec 22, 2021

t3chguy commented Dec 22, 2021

ara4n commented Apr 19, 2017 •

edited by richvdh

Loading

varesa commented Apr 19, 2017 •

edited

Loading

Twi1ightSparkle commented Oct 23, 2020 •

edited

Loading