Admin visibility into federation status #7982

chr-1x · 2020-07-29T19:39:43Z

As admin of a large synapse server (3000 users) I frequently end up in situations where users are reporting issues sending or receiving messages from other homeservers. (Often, this is matrix.org, but sometimes it's other large homeservers such as kde.org or pine64.org). I currently have little visibility into what could be causing these issues. Is it an issue with my homeserver, or the remote? If it's on our end, where should I be looking for problems?

In particular, there are a few key questions I don't currently have a way to answer:

Is my homeserver returning errors to remote homeservers? (If so, which homeservers and in what rooms?)
Is a particular remote homeserver returning errors / even online (from the perspective of my homeserver)?
When a message arrives late, what caused the delay?

I would love if this information was exposed in some kind of dashboard, but failing that an addition to the admin API would be acceptable. (Note though that I don't really have any insight into what's currently included in the admin API or how I would access it). Looking around in the docs folder of the repo, I've found an unfinished looking document on room statistics and some information on prometheus metrics, which is unhelpful to me as I don't use prometheus (maybe I should, but it's not mentioned anywhere in the README or setup instructions).

anoadragon453 · 2020-07-30T00:27:25Z

I agree that it would be nice to link these things from somewhere, install guide or elsewhere, during synapse setup. They're on the matrix.org synapse guides page, but a nice bullet point list of "Now you've got Synapse installed, here's the various things you can do from here!" somewhere would be great. There is work underway to set up different guides for different synapse usecases at the moment, which may address this problem somewhat.

As for yourself, yes prometheus metrics will give you insight into the overall health of your instance, but it's by no means a quick, simple digest of problems. That doesn't quite exist yet I'm afraid. There is a wiki article on understanding some of these graphs. Other than that though, we often recommend one search through the logs. Yes, it's obviously not the most friendly user interface, but it'll get the job done.

I suspect what you want, as well as a graphical installation tool for Synapse, will be part of our current goal of making setup and maintenance of Synapse easier for sysadmins over the coming months.

Looking around in the docs folder of the repo, I've found an unfinished looking document on room statistics

room_and_user_statistics.md is more development documentation than what's useful for sysadmins.

I would love if this information was exposed in some kind of dashboard, but failing that an addition to the admin API would be acceptable.

A dashboard like that would probably be built on top of the admin api. There actually is a third party project already doing that here: https://github.com/Awesome-Technologies/synapse-admin Element Matrix Services uses the admin API extensively for its dashboards as well. I don't think any of these dashboards will present internal errors and causes though.

A starting point for a project doing so could come from just parsing Synapse's logs, as they give you request information, timings, errors, etc.

chr-1x · 2020-07-30T01:49:09Z

Other than that though, we often recommend one search through the logs. Yes, it's obviously not the most friendly user interface, but it'll get the job done.

Unfortunately given the activity level of my server, its often impractical or extremely tedious if not outright impossible to find relevant errors in the logs unless I already know what I'm looking for. Some kind of automated parsing would certainly help.

lovelaced · 2020-07-30T08:21:49Z

It's not kept up to date to my knowledge but maybe something like https://github.com/turt2live/matrix-monitor-bot could be extended for these purposes? Would love to see a project like this revived/modernized.

anoadragon453 · 2020-07-30T16:43:42Z

@chr-1x A quick tip for that is back-paginating through a search of "ERROR ".

On another topic, Synapse does have (very) limited support for structured logging, which outputs logs lines as JSON objects rather than text, which may help build a parsing tool: https://github.com/matrix-org/synapse/blob/master/docs/structured_logging.md

hex-m · 2020-09-10T18:40:35Z

This could be part of the requested Admin UI.

MadLittleMods · 2021-11-10T22:09:02Z

Related to #10562 and #10553 which also lists some relevant already existing prometheus metrics to look at.

dklimpel · 2021-11-19T14:55:21Z

I think about to start with some API for this. Have anybody a hint where to start or find any informations?
For example something like this:

synapse/synapse/storage/databases/main/transactions.py

Lines 429 to 444 in 2b82ec4

    
               async def get_catch_up_outstanding_destinations( 
        
                   self, after_destination: Optional[str] 
        
               ) -> List[str]: 
        
                   """ 
        
                   Gets at most 25 destinations which have outstanding PDUs to be caught up, 
        
                   and are not being backed off from 
        
                   Args: 
        
                       after_destination: 
        
                           If provided, all destinations must be lexicographically greater 
        
                           than this one. 
        
                   Returns: 
        
                       list of up to 25 destinations with outstanding catch-up. 
        
                           These are the lexicographically first destinations which are 
        
                           lexicographically greater than after_destination (if provided). 
        
                   """

synapse/synapse/storage/databases/main/transactions.py

Lines 156 to 168 in 2b82ec4

    
               async def get_destination_retry_timings( 
        
                   self, 
        
                   destination: str, 
        
               ) -> Optional[DestinationRetryTimings]: 
        
                   """Gets the current retry timings (if any) for a given destination. 
        
                   Args: 
        
                       destination (str) 
        
                   Returns: 
        
                       None if not retrying 
        
                       Otherwise a dict for the retry scheme 
        
                   """

anoadragon453 added A-Admin-API enhancement z-p2 (Deprecated Label) labels Jul 30, 2020

MadLittleMods added A-Metrics metrics, measures, stuff we put in Prometheus A-Federation T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Nov 10, 2021

dklimpel mentioned this issue Nov 19, 2021

We need an admin API to reset federation retries for a given HS (SYN-222) #1266

Closed

dklimpel mentioned this issue Nov 22, 2021

Add admin API to get some information about federation status #11407

Merged

4 tasks

DMRobertson removed the z-enhancement label Aug 25, 2022

matrixbot mentioned this issue Dec 21, 2023

Admin visibility into federation status element-hq/synapse#7982

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Admin visibility into federation status #7982

Admin visibility into federation status #7982

chr-1x commented Jul 29, 2020

anoadragon453 commented Jul 30, 2020

chr-1x commented Jul 30, 2020

lovelaced commented Jul 30, 2020

anoadragon453 commented Jul 30, 2020

hex-m commented Sep 10, 2020

MadLittleMods commented Nov 10, 2021

dklimpel commented Nov 19, 2021

Admin visibility into federation status #7982

Admin visibility into federation status #7982

Comments

chr-1x commented Jul 29, 2020

anoadragon453 commented Jul 30, 2020

chr-1x commented Jul 30, 2020

lovelaced commented Jul 30, 2020

anoadragon453 commented Jul 30, 2020

hex-m commented Sep 10, 2020

MadLittleMods commented Nov 10, 2021

dklimpel commented Nov 19, 2021