Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add tuning options for federation client backoff #5556

Closed
wants to merge 20 commits into from

Conversation

hawkowl
Copy link
Contributor

@hawkowl hawkowl commented Jun 26, 2019

This adds configuration to be more aggressive about backing off. This led to a massive increase in performance on my constrained server when in rooms with many dead servers with the options enabled.

This pull request also adds some ancillary refactorings to make testing easier:

  • hs.reload_config() which will reload the config portions of supporting modules. This could theoretically be hooked up to SIGHUP or something, in the future.
  • HomeserverTestCase.amend_config() which will take a config dict and turn it into a proper config object, calling hs.reload_config() to propagate the changes. This means we're always going through the config parser, even when changing small options.
  • A refactoring of how MatrixFederationHttpClient is built. It will now ask the HS for the client TLS factory instead of requiring it as an argument, as the client TLS factory also needs to be reloaded on config change.

@hawkowl hawkowl requested a review from a team June 26, 2019 05:26
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm. the objective here is to handle brief outages such as those caused by restarts or networking blips. "Connection Refused" and "No route to host" sound like exactly the sort of error we might expect to see in those cases.

@@ -181,6 +181,7 @@ def send_transaction(self, transaction, json_data_callback=None):
long_retries=True,
backoff_on_404=True, # If we get a 404 the other side has gone
try_trailing_slash_on_400=True,
retry_on_dns_fail=False, # If we get DNS errors, the other side has gone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what its worth this was to fix a bug we had where our local DNS server would SERVFAIL fairly frequently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston: which was? the addition of this line?

@hawkowl
Copy link
Contributor Author

hawkowl commented Jun 26, 2019

@richvdh Hmm. That's true, but also, when joining #matrix:matrix.org, I had some 200 servers more that instantly got culled from the retry list and shortened the CPU thrashing considerably, as it wasn't constantly trying to retry 800 servers...

@codecov
Copy link

codecov bot commented Jun 27, 2019

Codecov Report

Merging #5556 into develop will decrease coverage by 0.11%.
The diff coverage is 52.17%.

@@             Coverage Diff             @@
##           develop    #5556      +/-   ##
===========================================
- Coverage    63.25%   63.14%   -0.12%     
===========================================
  Files          328      328              
  Lines        35877    35968      +91     
  Branches      5915     5922       +7     
===========================================
+ Hits         22695    22712      +17     
- Misses       11555    11623      +68     
- Partials      1627     1633       +6

@hawkowl hawkowl changed the title Be more aggressive with deciding what hosts are fatally down Add tuning options for federation client backoff Jul 8, 2019
@ara4n
Copy link
Member

ara4n commented Jul 8, 2019

(ooh, this looks really really nice. synapse is finally growing up!)

@hawkowl hawkowl requested a review from a team July 8, 2019 15:26
@richvdh
Copy link
Member

richvdh commented Jul 8, 2019

Before everyone gets excited about this: I'm not convinced it's the right solution. If transient outages turn into missed transactions, then we'll end up with more out-of-order messages, backfill, extremities, and generally more instability in the protocol. The entire protocol is built around the idea that we retransmit on a 500, which allows you to restart your server, or to have a couple of dropped packets.

I'm also not a fan of having a million options that need to work correctly (and hence be tested) in all combinations, and that people need to fiddle with to find the least bad option for them. Synapse already has far too many options imho.

I think we should be considering other solutions here, starting with #5113, and probably also kicking dead servers out of rooms.

Basically: please could this be discussed within the Synapse team?

@richvdh richvdh removed the request for review from a team July 9, 2019 16:39
@hawkowl hawkowl added the z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers label Jul 11, 2019
@hawkowl
Copy link
Contributor Author

hawkowl commented Jul 19, 2019

This needs a bit more of a bottom-up approach.

@hawkowl hawkowl closed this Jul 19, 2019
@richvdh richvdh deleted the hawkowl/more-aggressive-no-retry branch April 6, 2022 12:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants