Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Admin visibility into federation status #7982

Open
chr-1x opened this issue Jul 29, 2020 · 7 comments
Open

Admin visibility into federation status #7982

chr-1x opened this issue Jul 29, 2020 · 7 comments
Labels
A-Admin-API A-Federation A-Metrics metrics, measures, stuff we put in Prometheus T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. z-p2 (Deprecated Label)

Comments

@chr-1x
Copy link

chr-1x commented Jul 29, 2020

As admin of a large synapse server (3000 users) I frequently end up in situations where users are reporting issues sending or receiving messages from other homeservers. (Often, this is matrix.org, but sometimes it's other large homeservers such as kde.org or pine64.org). I currently have little visibility into what could be causing these issues. Is it an issue with my homeserver, or the remote? If it's on our end, where should I be looking for problems?

In particular, there are a few key questions I don't currently have a way to answer:

  • Is my homeserver returning errors to remote homeservers? (If so, which homeservers and in what rooms?)
  • Is a particular remote homeserver returning errors / even online (from the perspective of my homeserver)?
  • When a message arrives late, what caused the delay?

I would love if this information was exposed in some kind of dashboard, but failing that an addition to the admin API would be acceptable. (Note though that I don't really have any insight into what's currently included in the admin API or how I would access it). Looking around in the docs folder of the repo, I've found an unfinished looking document on room statistics and some information on prometheus metrics, which is unhelpful to me as I don't use prometheus (maybe I should, but it's not mentioned anywhere in the README or setup instructions).

@anoadragon453
Copy link
Member

I agree that it would be nice to link these things from somewhere, install guide or elsewhere, during synapse setup. They're on the matrix.org synapse guides page, but a nice bullet point list of "Now you've got Synapse installed, here's the various things you can do from here!" somewhere would be great. There is work underway to set up different guides for different synapse usecases at the moment, which may address this problem somewhat.

As for yourself, yes prometheus metrics will give you insight into the overall health of your instance, but it's by no means a quick, simple digest of problems. That doesn't quite exist yet I'm afraid. There is a wiki article on understanding some of these graphs. Other than that though, we often recommend one search through the logs. Yes, it's obviously not the most friendly user interface, but it'll get the job done.

I suspect what you want, as well as a graphical installation tool for Synapse, will be part of our current goal of making setup and maintenance of Synapse easier for sysadmins over the coming months.

Looking around in the docs folder of the repo, I've found an unfinished looking document on room statistics

room_and_user_statistics.md is more development documentation than what's useful for sysadmins.

I would love if this information was exposed in some kind of dashboard, but failing that an addition to the admin API would be acceptable.

A dashboard like that would probably be built on top of the admin api. There actually is a third party project already doing that here: https://github.com/Awesome-Technologies/synapse-admin Element Matrix Services uses the admin API extensively for its dashboards as well. I don't think any of these dashboards will present internal errors and causes though.

A starting point for a project doing so could come from just parsing Synapse's logs, as they give you request information, timings, errors, etc.

@chr-1x
Copy link
Author

chr-1x commented Jul 30, 2020

Other than that though, we often recommend one search through the logs. Yes, it's obviously not the most friendly user interface, but it'll get the job done.

Unfortunately given the activity level of my server, its often impractical or extremely tedious if not outright impossible to find relevant errors in the logs unless I already know what I'm looking for. Some kind of automated parsing would certainly help.

@lovelaced
Copy link

It's not kept up to date to my knowledge but maybe something like https://github.com/turt2live/matrix-monitor-bot could be extended for these purposes? Would love to see a project like this revived/modernized.

@anoadragon453
Copy link
Member

@chr-1x A quick tip for that is back-paginating through a search of "ERROR ".

On another topic, Synapse does have (very) limited support for structured logging, which outputs logs lines as JSON objects rather than text, which may help build a parsing tool: https://github.com/matrix-org/synapse/blob/master/docs/structured_logging.md

@hex-m
Copy link

hex-m commented Sep 10, 2020

This could be part of the requested Admin UI.

@MadLittleMods
Copy link
Contributor

Related to #10562 and #10553 which also lists some relevant already existing prometheus metrics to look at.

@MadLittleMods MadLittleMods added A-Metrics metrics, measures, stuff we put in Prometheus A-Federation T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Nov 10, 2021
@dklimpel
Copy link
Contributor

I think about to start with some API for this. Have anybody a hint where to start or find any informations?
For example something like this:

async def get_catch_up_outstanding_destinations(
self, after_destination: Optional[str]
) -> List[str]:
"""
Gets at most 25 destinations which have outstanding PDUs to be caught up,
and are not being backed off from
Args:
after_destination:
If provided, all destinations must be lexicographically greater
than this one.
Returns:
list of up to 25 destinations with outstanding catch-up.
These are the lexicographically first destinations which are
lexicographically greater than after_destination (if provided).
"""

async def get_destination_retry_timings(
self,
destination: str,
) -> Optional[DestinationRetryTimings]:
"""Gets the current retry timings (if any) for a given destination.
Args:
destination (str)
Returns:
None if not retrying
Otherwise a dict for the retry scheme
"""

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Admin-API A-Federation A-Metrics metrics, measures, stuff we put in Prometheus T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

7 participants