Unreachable servers lead to poor performance #6895

evilham · 2020-02-12T00:20:18Z

Preamble

This is similar to #5373 and I think it would get very bad if #3765 is not "properly dealt with" and lingering homeservers start being widespread (with e.g. "this domain is for sale" wildcard DNS that drop connections on 8448 or something), or if network splits start happening for whatever reason.
But it is particular enough to be worthy of explaining in its own issue, since this instance of this more general problem may be mitigated somewhat easily.

Description

There is a high number of Matrix/Synapse servers with bogus DNS data. By bogus DNS data in this case, I mean something isomorphic to(*):
Matrix domain: example.org
Delegating to: https://matrix.example.org:443
Resolving matrix.example.org as:

A record for an IPv4
AAAA record for an IPv6

Where actually that IPv6 is not listening on port 443, or the port is being blocked by the firewall or they have networking issues, .... (insert plethora of reasons to get IPv6 wrong here).

(*): As a note, this is not exclusive to IPv6, I have recently worked with CISCO's top 1M websites dataset and checked their DNS and have found... interesting things like A records in the 127.0.0.0/8 subnet or AAAA records with ::1 or fe80::.
Another way to reach this could be with a network split (e.g. country-wide bans to certain servers, ...)

With this in mind, I have come across an issue where:
A sufficient number of unreachable matrix servers will degrade performance critically. I believe this is because the server will see events from those "unreachable" servers through other servers sharing a room and start trying to reach out to them (fetch events, send events, ...).

Steps to reproduce

Setup an synapse in an IPv6-only network (yes, it's 2020!)
Use DNS64 (**), so that the v4 world is reachable form synapse
Use a dual-stack server to proxy incoming v4 connections to synapse
Join e.g. #synapse-admins:matrix.org (modulo typos)
Check the logs and see increasingly federationclient messages about timeouts and unreachable networks
Wait for a few hours and see performance degrade significantly (sending of events --> 10-15 seconds)
Leave the afore mentioned room and see performance be restored over a few minutes.

This by the way describes real-life set ups, maybe not numerous, but I know different people that have reached a similar setup independently.

(**): DNS64 works just like regular DNS, except, it adds AAAA records for sites that only have A records. This works by embedding the IPv4 addresses in the remaining bits of a /64 subnet, then doing stateful NAT-ing.
This becomes an issue, in the scenario described above, because there are AAAA records, but are not properly set up; which in turn means that the problematic host is effectively unreachable.

Version information

Confirmed in multiple servers with multiple synapse versions including 1.10.0rc5

Possible (short-term) solution

While there are amazing experiments running Matrix on different environments, right now, most real deployment is Synapse using standard DNS and TCP/IP and synapse itself is allowing faulty configuration.

So, maybe not allowing for this faulty config would be good? A two-stage process with a boolean in the configuration to disable this behaviour would be ideal:
1st stage: "health-check" self on start and warn when a similar situation to that described occurs
2nd stage (few versions down the road): "health-check" self on start and refuse to run if not overridden or corresponding boolean is set in config.

This would take care of the very obvious offenders, while being relatively nonintrusive (no changes in spec, ...).

The text was updated successfully, but these errors were encountered:

evilham · 2020-02-12T00:36:58Z

Addendum: #5113 doesn't fix these particular cases as incoming connections are possible, so instead the number of misconfigured servers keeps growing and the number of requests too, so at some point there are basically busy loops with all threads waiting on a connection to be dropped/time out.

aaronraimist · 2020-02-12T00:40:13Z

See also: #5406

richvdh · 2020-02-12T11:00:13Z

Yeah, the fact that incoming connections cause the backoff to be reset is less than ideal in this situation.

A self-check on startup could be helpful, though note that there are legitimate configurations (eg, non-federating instances) where a server might not be able to reach itself over federation. Also: it's super-annoying when a service won't restart when you're having a temporary network outage.

evilham · 2020-02-12T11:14:29Z

Also: it's super-annoying when a service won't restart when you're having a temporary network outage.

That's a great point, maybe if proposed 1st stage warns in an alarming enough fashion, 2nd stage is not needed? :-D

In any case: that proposal is more of a patch for this instance of the problem and it's worth thinking about other scenarios that can lead to something similar.

kescherCode · 2021-01-05T14:07:28Z

In my head, a fix for this issue would be "remembering" a connection failure for certain DNS records for a certain time (so, stored in the database, usually postgres) and trying the A record instead. If both fail, force a backoff independent of incoming connections by these homeservers.

And yes, a self-check that enforces valid DNS settings would be a good idea, but homeservers might use an internal, static DNS configuration for their own domain names, which doesn't make the same mistakes as their public DNS zone.

fuomag9 · 2022-02-28T22:05:52Z

In my case removing the networkmanager-based /etc/resolv.conf and specifying my dns directly seem to have solved my issue

jae1911 · 2023-03-24T13:39:28Z

I've got this issue on my server since I switched to V6 only.
DNS64 works and I can reach most servers but then Synapse will slow down to the point of not receiving any federation and not being able to send anything.

richvdh changed the title ~~Bogus DNS data leads to poor performance~~ Unreachable servers lead to poor performance Feb 12, 2020

richvdh added z-feature (Deprecated Label) z-p2 (Deprecated Label) labels Feb 14, 2020

ShadowJonathan mentioned this issue Aug 30, 2021

Fix resolution of domains with only AAAA records (#10694) #10717

Closed

reivilibre added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Mar 2, 2022

matrixbot mentioned this issue Dec 21, 2023

Unreachable servers lead to poor performance element-hq/synapse#6895

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreachable servers lead to poor performance #6895

Unreachable servers lead to poor performance #6895

evilham commented Feb 12, 2020 •

edited

Loading

evilham commented Feb 12, 2020

aaronraimist commented Feb 12, 2020

richvdh commented Feb 12, 2020

evilham commented Feb 12, 2020 •

edited

Loading

kescherCode commented Jan 5, 2021

fuomag9 commented Feb 28, 2022 •

edited

Loading

jae1911 commented Mar 24, 2023

Unreachable servers lead to poor performance #6895

Unreachable servers lead to poor performance #6895

Comments

evilham commented Feb 12, 2020 • edited Loading

Preamble

Description

Steps to reproduce

Version information

Possible (short-term) solution

evilham commented Feb 12, 2020

aaronraimist commented Feb 12, 2020

richvdh commented Feb 12, 2020

evilham commented Feb 12, 2020 • edited Loading

kescherCode commented Jan 5, 2021

fuomag9 commented Feb 28, 2022 • edited Loading

jae1911 commented Mar 24, 2023

evilham commented Feb 12, 2020 •

edited

Loading

evilham commented Feb 12, 2020 •

edited

Loading

fuomag9 commented Feb 28, 2022 •

edited

Loading