-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add tuning options for federation client backoff #5556
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm. the objective here is to handle brief outages such as those caused by restarts or networking blips. "Connection Refused" and "No route to host" sound like exactly the sort of error we might expect to see in those cases.
@@ -181,6 +181,7 @@ def send_transaction(self, transaction, json_data_callback=None): | |||
long_retries=True, | |||
backoff_on_404=True, # If we get a 404 the other side has gone | |||
try_trailing_slash_on_400=True, | |||
retry_on_dns_fail=False, # If we get DNS errors, the other side has gone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what its worth this was to fix a bug we had where our local DNS server would SERVFAIL fairly frequently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikjohnston: which was? the addition of this line?
@richvdh Hmm. That's true, but also, when joining |
Codecov Report
@@ Coverage Diff @@
## develop #5556 +/- ##
===========================================
- Coverage 63.25% 63.14% -0.12%
===========================================
Files 328 328
Lines 35877 35968 +91
Branches 5915 5922 +7
===========================================
+ Hits 22695 22712 +17
- Misses 11555 11623 +68
- Partials 1627 1633 +6 |
(ooh, this looks really really nice. synapse is finally growing up!) |
Before everyone gets excited about this: I'm not convinced it's the right solution. If transient outages turn into missed transactions, then we'll end up with more out-of-order messages, backfill, extremities, and generally more instability in the protocol. The entire protocol is built around the idea that we retransmit on a 500, which allows you to restart your server, or to have a couple of dropped packets. I'm also not a fan of having a million options that need to work correctly (and hence be tested) in all combinations, and that people need to fiddle with to find the least bad option for them. Synapse already has far too many options imho. I think we should be considering other solutions here, starting with #5113, and probably also kicking dead servers out of rooms. Basically: please could this be discussed within the Synapse team? |
This needs a bit more of a bottom-up approach. |
This adds configuration to be more aggressive about backing off. This led to a massive increase in performance on my constrained server when in rooms with many dead servers with the options enabled.
This pull request also adds some ancillary refactorings to make testing easier:
hs.reload_config()
which will reload the config portions of supporting modules. This could theoretically be hooked up to SIGHUP or something, in the future.HomeserverTestCase.amend_config()
which will take a config dict and turn it into a proper config object, callinghs.reload_config()
to propagate the changes. This means we're always going through the config parser, even when changing small options.MatrixFederationHttpClient
is built. It will now ask the HS for the client TLS factory instead of requiring it as an argument, as the client TLS factory also needs to be reloaded on config change.