Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Improve handling of SRV records for federation connections #3016

Merged
merged 1 commit into from
Apr 9, 2018

Conversation

silkeh
Copy link
Contributor

@silkeh silkeh commented Mar 20, 2018

This is an updated version of #1934 (which cannot be reopened due to this not being a continuation but a rework).

SRV records are currently not only resolved to hosts, but also to the IP addresses belonging to those hosts. In contrast to hosts without SRV for which the hostname is used. Removing the manual address resolution allows for Twisted to do the heavy lifting.

The problem with the current SRV resolving behaviour is that while the code currently does resolve both A and AAAA, the server's native capabilities are not taken into account.
This breaks connections from IPv6 only servers to dual stack servers, including IPv6 transition methods like DNS64/NAT64.

The concerns of the last PR were:

@14mRh4X0r:

This patch breaks connections to many hosts. The default config causes servers to only listen on IPv4 (see #1886), but Synapse will try to connect over IPv6 for domains with an AAAA-record, yielding many "Connection refused" messages.

#1886 has been fixed now. The configuration of currently misconfigured servers (~100) will need to be updated. Until then “connection refused“ messages will occur. This is technically correct behaviour.

@erikjohnston:

This is going to require us to properly retry different IP addresses when we get TCP level failures on connection, otherwise its (at the very least) going to cause delays as we back off the reconnects.

I think Happy Eyeballs should suffice. Twisted's HostnameEndpoint should support this, and it is used in the current implementation.

Actually performing a fallback on a TCP “connection refused” seems to me to be outside of the scope of this PR (and related code) as this can be caused by many other things.

Edit: this may also resolve the behaviour seen in #2850

@matrixbot
Copy link
Member

Can one of the admins verify this patch?

1 similar comment
@matrixbot
Copy link
Member

Can one of the admins verify this patch?

@richvdh
Copy link
Member

richvdh commented Apr 4, 2018

@matrixbot: test this please

@richvdh
Copy link
Member

richvdh commented Apr 4, 2018

@silkeh: can you merge latest develop into your branch? I think the test failures may have been fixed.

@richvdh
Copy link
Member

richvdh commented Apr 4, 2018

So I have no real idea why this was so complicated in the first place, but this looks correct now. I suspect it was so that IPv4 was preferred when both were available, though as you note, the reasons for that have largely gone away and our attempts to work around it are causing bigger problems.

I'd like to see the tests passing, but otherwise unless @erikjohnston objects I'm going to land this.

@silkeh silkeh force-pushed the improve-service-lookups branch from 9a314fd to 9e4160d Compare April 4, 2018 10:14
@silkeh
Copy link
Contributor Author

silkeh commented Apr 4, 2018

Can you merge latest develop into your branch? I think the test failures may have been fixed.

Done. All green :)

@richvdh
Copy link
Member

richvdh commented Apr 4, 2018

Signed-off-by: Silke Hofstra <silke@slxh.eu>
@silkeh silkeh force-pushed the improve-service-lookups branch from 9e4160d to 72251d1 Compare April 4, 2018 10:26
@silkeh
Copy link
Contributor Author

silkeh commented Apr 4, 2018

@richvdh sure.

@richvdh
Copy link
Member

richvdh commented Apr 9, 2018

thank you!

@richvdh richvdh changed the title Remove address resolution of hosts in SRV records Improve handling of SRV records for federation connections Apr 9, 2018
@richvdh richvdh merged commit 664adb4 into matrix-org:develop Apr 9, 2018
@ara4n
Copy link
Member

ara4n commented Apr 10, 2018

@silkeh ftr this fixed my personal homeserver (which was trying to talk ipv6 most of the time despite not having an ipv6 stack...) so: thanks! :)

neilisfragile added a commit that referenced this pull request Apr 27, 2018
Changes in synapse v0.28.0-rc1 (2018-04-26)
===========================================

Bug Fixes:

* Fix quarantine media admin API and search reindex (PR #3130)
* Fix media admin APIs (PR #3134)

Changes in synapse v0.28.0-rc1 (2018-04-24)
===========================================

Minor performance improvement to federation sending and bug fixes.

(Note: This release does not include state resolutions discussed in matrix live)

Features:

* Add metrics for event processing lag (PR #3090)
* Add metrics for ResponseCache (PR #3092)

Changes:

* Synapse on PyPy (PR #2760) Thanks to @Valodim!
* move handling of auto_join_rooms to RegisterHandler (PR #2996) Thanks to @krombel!
* Improve handling of SRV records for federation connections (PR #3016) Thanks to @silkeh!
* Document the behaviour of ResponseCache (PR #3059)
* Preparation for py3 (PR #3061, #3073, #3074, #3075, #3103, #3104, #3106, #3107, #3109, #3110) Thanks to @NotAFile!
* update prometheus dashboard to use new metric names (PR #3069) Thanks to @krombel!
* use python3-compatible prints (PR #3074) Thanks to @NotAFile!
* Send federation events concurrently (PR #3078)
* Limit concurrent event sends for a room (PR #3079)
* Improve R30 stat definition (PR #3086)
* Send events to ASes concurrently (PR #3088)
* Refactor ResponseCache usage (PR #3093)
* Clarify that SRV may not point to a CNAME (PR #3100) Thanks to @silkeh!
* Use str(e) instead of e.message (PR #3103) Thanks to @NotAFile!
* Use six.itervalues in some places (PR #3106) Thanks to @NotAFile!
* Refactor store.have_events (PR #3117)

Bug Fixes:

* Return 401 for invalid access_token on logout (PR #2938) Thanks to @dklug!
* Return a 404 rather than a 500 on rejoining empty rooms (PR #3080)
* fix federation_domain_whitelist (PR #3099)
* Avoid creating events with huge numbers of prev_events (PR #3113)
* Reject events which have lots of prev_events (PR #3118)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants