Add tuning options for federation client backoff #5556

hawkowl · 2019-06-26T05:26:43Z

This adds configuration to be more aggressive about backing off. This led to a massive increase in performance on my constrained server when in rooms with many dead servers with the options enabled.

This pull request also adds some ancillary refactorings to make testing easier:

hs.reload_config() which will reload the config portions of supporting modules. This could theoretically be hooked up to SIGHUP or something, in the future.
HomeserverTestCase.amend_config() which will take a config dict and turn it into a proper config object, calling hs.reload_config() to propagate the changes. This means we're always going through the config parser, even when changing small options.
A refactoring of how MatrixFederationHttpClient is built. It will now ask the HS for the client TLS factory instead of requiring it as an argument, as the client TLS factory also needs to be reloaded on config change.

richvdh

hrm. the objective here is to handle brief outages such as those caused by restarts or networking blips. "Connection Refused" and "No route to host" sound like exactly the sort of error we might expect to see in those cases.

erikjohnston · 2019-06-26T08:26:51Z

synapse/federation/transport/client.py

@@ -181,6 +181,7 @@ def send_transaction(self, transaction, json_data_callback=None):
            long_retries=True,
            backoff_on_404=True,  # If we get a 404 the other side has gone
            try_trailing_slash_on_400=True,
+            retry_on_dns_fail=False,  # If we get DNS errors, the other side has gone


For what its worth this was to fix a bug we had where our local DNS server would SERVFAIL fairly frequently

@erikjohnston: which was? the addition of this line?

hawkowl · 2019-06-26T11:45:21Z

@richvdh Hmm. That's true, but also, when joining #matrix:matrix.org, I had some 200 servers more that instantly got culled from the retry list and shortened the CPU thrashing considerably, as it wasn't constantly trying to retry 800 servers...

codecov · 2019-06-27T09:10:16Z

Codecov Report

Merging #5556 into develop will decrease coverage by 0.11%.
The diff coverage is 52.17%.

@@             Coverage Diff             @@
##           develop    #5556      +/-   ##
===========================================
- Coverage    63.25%   63.14%   -0.12%     
===========================================
  Files          328      328              
  Lines        35877    35968      +91     
  Branches      5915     5922       +7     
===========================================
+ Hits         22695    22712      +17     
- Misses       11555    11623      +68     
- Partials      1627     1633       +6

…ssive-no-retry

ara4n · 2019-07-08T13:22:55Z

(ooh, this looks really really nice. synapse is finally growing up!)

richvdh · 2019-07-08T17:52:58Z

Before everyone gets excited about this: I'm not convinced it's the right solution. If transient outages turn into missed transactions, then we'll end up with more out-of-order messages, backfill, extremities, and generally more instability in the protocol. The entire protocol is built around the idea that we retransmit on a 500, which allows you to restart your server, or to have a couple of dropped packets.

I'm also not a fan of having a million options that need to work correctly (and hence be tested) in all combinations, and that people need to fiddle with to find the least bad option for them. Synapse already has far too many options imho.

I think we should be considering other solutions here, starting with #5113, and probably also kicking dead servers out of rooms.

Basically: please could this be discussed within the Synapse team?

hawkowl · 2019-07-19T09:02:00Z

This needs a bit more of a bottom-up approach.

hawkowl added 3 commits June 26, 2019 15:21

be a bit more aggressive with not retrying likely-dead hosts

411e4d8

fixes

bdad7ff

changelog

ad0d7d4

hawkowl requested a review from a team June 26, 2019 05:26

lint

69490ad

richvdh reviewed Jun 26, 2019

View reviewed changes

erikjohnston reviewed Jun 26, 2019

View reviewed changes

fix some logs

535e8ea

neilisfragile assigned hawkowl Jun 27, 2019

hawkowl added 15 commits July 1, 2019 20:57

make the aggressiveness more configurable

84cbc50

make logging a little better

d08ce1b

make logging a little better

a82457c

make logging a little better

30ed284

cancel in connect

45c0117

handle more things

f1467e5

fix

61f75d5

Merge remote-tracking branch 'origin/develop' into hawkowl/more-aggre…

9c998b9

…ssive-no-retry

fix lint

02f771a

backoff settings

7f1eada

add some tests

b5574c5

reload config and more tests

fcea504

proper config reloading & simplification

c579143

flake

294c3f0

clean up some things

1391f61

hawkowl changed the title ~~Be more aggressive with deciding what hosts are fatally down~~ Add tuning options for federation client backoff Jul 8, 2019

hawkowl requested a review from a team July 8, 2019 15:26

richvdh removed the request for review from a team July 9, 2019 16:39

hawkowl added the z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers label Jul 11, 2019

hawkowl closed this Jul 19, 2019

richvdh deleted the hawkowl/more-aggressive-no-retry branch April 6, 2022 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tuning options for federation client backoff #5556

Add tuning options for federation client backoff #5556

hawkowl commented Jun 26, 2019 •

edited

Loading

richvdh left a comment

erikjohnston Jun 26, 2019

richvdh Jun 26, 2019

hawkowl commented Jun 26, 2019

codecov bot commented Jun 27, 2019 •

edited

Loading

ara4n commented Jul 8, 2019

richvdh commented Jul 8, 2019

hawkowl commented Jul 19, 2019

Add tuning options for federation client backoff #5556

Add tuning options for federation client backoff #5556

Conversation

hawkowl commented Jun 26, 2019 • edited Loading

richvdh left a comment

Choose a reason for hiding this comment

erikjohnston Jun 26, 2019

Choose a reason for hiding this comment

richvdh Jun 26, 2019

Choose a reason for hiding this comment

hawkowl commented Jun 26, 2019

codecov bot commented Jun 27, 2019 • edited Loading

Codecov Report

ara4n commented Jul 8, 2019

richvdh commented Jul 8, 2019

hawkowl commented Jul 19, 2019

hawkowl commented Jun 26, 2019 •

edited

Loading

codecov bot commented Jun 27, 2019 •

edited

Loading