wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints #140890

seb314 · 2021-10-07T20:30:37Z

Motivation for this change

When dns resolution fails with a permanent error ("Name or service not
known" instead of "Temporary failure in name resolution"), in the current
setup for dynamic dns with dynamicEndpointRefreshSeconds, wireguard
won't retry despite WG_ENDPOINT_RESOLUTION_RETRIES=infinity.

Ideally, dns would probably never report a permanent error for an
existing name, but unfortunately this does happen (maybe especially
with dynamic dns?) and cannot easily be fixed by the wireguard setup's
admin.

I can't think of a scenario where it is essential to not retry after a
negative dns response (given that the endpoint has been configured, the
dns name quite certainly exists), right?. On the other hand, a machine
that drops out of the vpn can be very annoying...

-> This change should improve reliability/connectivity.

somewhat related thread: #63869

Things done

jian-lin · 2021-10-28T17:59:26Z

Thanks for your work.

I also have this issue. Say wireguard is set on your laptop, wireguard-{interface}-peer-{publickey}-refresh.service will become dead if your laptop lose network connection (e.g. switch to a new wifi) or your laptop doesn't connect to the network before wireguard-{interface}-peer-{publickey}-refresh.service starts. The latter case is possible if you don't set a connection available for all users in NetworkManager, in which case network-online.target will be reached before a connection is established.

wireguard won't retry despite WG_ENDPOINT_RESOLUTION_RETRIES=infinity. Ideally, dns would probably never report a permanent error for an existing name, but unfortunately this does happen (maybe especially with dynamic dns?)

According to the code, wireguard won't retry if it gets the EAI_NONAME error ( the corresponding error message is "Name or service not known" ), which will appear when the network connection is lose and is considered as a permanent error instead of a transient one.

Adding retry in the systemd unit does fix this issue. I do not known if there is a better way, like only retrying when the network is back online.

seb314 · 2021-10-28T18:46:37Z

Thanks for you further analysis @jian-lin !

I guess the basic periodic retry shouldn't cause significant system load unless there is a really large number of peers with dynamic IPs, but something smarter might be nicer (e.g. exponential backoff or when network connectivity is restored, as you suggested). We should be sure that such "smartness" doesn't introduce a new edge case where we accidentally fail to retry, though.

nixos/modules/services/networking/wireguard.nix

stale · 2022-05-02T07:24:06Z

I marked this as stale due to inactivity. → More info

zarelit · 2022-05-02T16:59:51Z

I'm having the same problem, exactly on a laptop with no permanent connection. If time allows I should be able to review soon.

@jian-lin I think it's okay to keep the same interval, after all the invariant is "I want this peer to be refreshed every X seconds" and when the connection is brought back up you can expected it to have the right IP after X seconds

jian-lin · 2022-05-02T17:28:18Z

@zarelit I use a better workaround, patching wireguard-tools. By better, I mean wireguard peer services are not restarted. That patch is as follows.

Don't know if there is a way to distinguish between no network connection and EAI_NONAME.

From 1b33a0d5eb3b578767162ef465b5994afeb15dac Mon Sep 17 00:00:00 2001
From: linj <linj.dev@outlook.com>
Date: Wed, 24 Nov 2021 21:48:43 +0000
Subject: [PATCH 1/1] make EAI_NONAME transient

When a notebook loses its network connection, EAI_NONAME will appear if
you look up DNS. If dynamicEndpointRefreshSeconds is enabled, this error
will cause the peer systemd service to fail (but with 0 as exit code due
to the while loop) and clean all your wireguard peers due to the
ExecStopPost command. So your wireguard network will not work even after
your network is connected again. You have to restart your main wireguard
service to make it work again.

By applying this patch, this program will not exit if it get EAI_NONAME
error, whose error message is "Name or service not known". Instead, it
will treat this error like other transient errors and retry after a few
seconds. So your peer systemd unit will not exit and your wireguard peer
will not be cleaned. You wireguard network will be useable when you
connect to the network again.
---
 src/config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/config.c b/config.c
index 211e887..51b90d9 100644
--- a/config.c
+++ b/config.c
@@ -252,7 +252,7 @@ static inline bool parse_endpoint(struct sockaddr *endpoint, const char *value)
 		 *
 		 * So this is what we do, except FreeBSD removed EAI_NODATA some time ago, so that's conditional.
 		 */
-		if (ret == EAI_NONAME || ret == EAI_FAIL ||
+		if (ret == EAI_FAIL ||
 			#ifdef EAI_NODATA
 				ret == EAI_NODATA ||
 			#endif
-- 
2.33.1

zarelit

Tested on a intermittently-connected laptop and works.

The script that refreshes the endpoints quits with error when it's not possible to resolve the name of the peer and so the unit dies. This will happen 100% on a laptop with no automatic connection at boot.

The PR does not change the script and just guarantees that the refresh operation is tried every dynamicEndpointRefreshSeconds even when the script fails so it's a safe merge unless I'm missing a usecase where having the peers not refreshed is somehow expected.

Somewhat related to #165474 i.e. lifecycle of wireguard-related units.

Also thanks @seb314

zarelit · 2022-05-12T08:40:37Z

Also wrt/ #63869 (comment) This means that the restart of the unit is to handle the DNS not available case (e.g. because of lack of connection) and other transient failure are still handled by wg itself

nixos-discourse · 2022-05-12T08:44:58Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-already-reviewed/2617/507

seb314 · 2022-05-26T15:40:33Z

@zarelit thanks for the review!

Is there anything that can be done or improved about this PR that would improve the chances of it getting merged?

zarelit · 2022-05-26T17:38:26Z

Both @nh2 and @WilliButz worked on dynamically refreshed peers and systemd units that didn't restart in this module so maybe they can help us here

nixos-discourse · 2022-08-08T10:49:56Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-already-reviewed/2617/574

seb314 · 2022-08-08T11:13:43Z

I changed the title to something more descriptive, in case the previous title was part of the reason why the PR has been stuck. (If there is something wrong with the new title, please let me know).

nixos/modules/services/networking/wireguard.nix

seb314 · 2022-09-19T17:22:21Z

@jian-lin @Majiir I added an (optional) option dynamicEndpointRefreshRestartSeconds to set the retry delay directly.
If the option is not set, the value dynamicEndpointRefreshSeconds will be used, so the default behaviour is unchanged vs the previous version of this PR.

seb314 · 2022-09-26T20:06:39Z

@jian-lin @majir @zarelit could you maybe check if the updated version (with 2821cd9) works for you too?

If it works for you too, and you can mark your review as approved, we might have a better chance to get someone with merge permissions to look at the PR and maybe merge it :-)

zarelit

Tried out latest version (2022-09-27) with peers with mixed settings and works well

jian-lin

Suggest some small changes.

nixos/modules/services/networking/wireguard.nix

jian-lin

LGTM

zarelit

LGTM

RaitoBezarius

Looks sane to me (untested as I do not have any dynamic endpoint of this type).

pwaller

LGTM, untested but I have this problem and this looks to me like a suitable fix.

Artturin · 2022-10-22T18:30:05Z

squash the commits

the pr can be tested with https://nixos.wiki/wiki/Nixpkgs/Reviewing_changes#Modules

Make the dynamic-dns refresh systemd service (controlled via the preexisting option dynamicEndpointRefreshSecond) robust to e.g. dns failures that happen on intermittent network connections. Background: When dns resolution fails with a 'permanent' error ("Name or service not known" instead of "Temporary failure in name resolution"), wireguard won't retry despite WG_ENDPOINT_RESOLUTION_RETRIES=infinity. -> This change should improve reliability/connectivity. somewhat related thread: NixOS#63869

seb314 · 2022-10-28T17:19:53Z

squash the commits

done

pwaller

Friendly ping, would love to see this land.

pwaller · 2022-12-02T15:49:04Z

🎉 thanks @Artturin !

github-actions · 2022-12-02T15:55:18Z

Successfully created backport PR #204134 for release-22.11.

seb314 · 2022-12-02T21:42:06Z

Thanks @Artturin and thank you all for review, testing & support!

github-actions bot added 6.topic: nixos 8.has: module (update) labels Oct 7, 2021

ofborg bot added 10.rebuild-darwin: 0 10.rebuild-linux: 0 labels Oct 7, 2021

seb314 requested a review from nh2 October 8, 2021 15:57

jian-lin requested changes Oct 29, 2021

View reviewed changes

nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved

nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label May 2, 2022

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label May 2, 2022

zarelit approved these changes May 11, 2022

View reviewed changes

bobby285271 added the 12.approvals: 1 label Jun 10, 2022

zarelit mentioned this pull request Jul 5, 2022

Wireguard doesn't bring up peers #63869

Open

seb314 changed the title ~~wireguard: "Restart=on-failure" for dynamicEndpointRefreshSeconds~~ wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints Aug 8, 2022

SuperSandro2000 reviewed Aug 8, 2022

View reviewed changes

nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved

ofborg bot added 10.rebuild-linux: 1-10 and removed 10.rebuild-linux: 0 labels Aug 10, 2022

bobby285271 removed the 12.approvals: 1 label Aug 12, 2022

jian-lin reviewed Aug 13, 2022

View reviewed changes

nixos/modules/services/networking/wireguard.nix Show resolved Hide resolved

seb314 force-pushed the wireguard/dyndns-restart-on-failure branch from cfefff7 to 2821cd9 Compare September 19, 2022 17:06

seb314 requested review from Majiir and jian-lin September 26, 2022 20:04

zarelit approved these changes Sep 27, 2022

View reviewed changes

jian-lin reviewed Sep 27, 2022

View reviewed changes

nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved

nixos/modules/services/networking/wireguard.nix Show resolved Hide resolved

nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved

seb314 requested a review from jian-lin September 27, 2022 21:22

jian-lin approved these changes Sep 28, 2022

View reviewed changes

seb314 requested a review from SuperSandro2000 September 28, 2022 07:10

bobby285271 added the 12.approvals: 1 label Sep 28, 2022

zarelit self-requested a review September 28, 2022 20:50

zarelit approved these changes Sep 28, 2022

View reviewed changes

bobby285271 added 12.approvals: 2 and removed 12.approvals: 1 labels Sep 28, 2022

RaitoBezarius approved these changes Oct 2, 2022

View reviewed changes

bobby285271 added 12.approvals: 3+ and removed 12.approvals: 2 labels Oct 2, 2022

pwaller approved these changes Oct 10, 2022

View reviewed changes

seb314 force-pushed the wireguard/dyndns-restart-on-failure branch from f4f9839 to 82c5c3c Compare October 28, 2022 17:16

pwaller approved these changes Dec 2, 2022

View reviewed changes

Artturin added the backport release-22.11 label Dec 2, 2022

Artturin merged commit dadca5c into NixOS:master Dec 2, 2022

github-actions bot mentioned this pull request Dec 2, 2022

[Backport release-22.11] wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints #204134

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints #140890

wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints #140890

seb314 commented Oct 7, 2021 •

edited

Loading

jian-lin commented Oct 28, 2021

seb314 commented Oct 28, 2021

stale bot commented May 2, 2022

zarelit commented May 2, 2022

jian-lin commented May 2, 2022 •

edited

Loading

zarelit left a comment

zarelit commented May 12, 2022

nixos-discourse commented May 12, 2022

seb314 commented May 26, 2022

zarelit commented May 26, 2022

nixos-discourse commented Aug 8, 2022

seb314 commented Aug 8, 2022

seb314 commented Sep 19, 2022

seb314 commented Sep 26, 2022

zarelit left a comment

jian-lin left a comment

jian-lin left a comment

zarelit left a comment

RaitoBezarius left a comment

pwaller left a comment

Artturin commented Oct 22, 2022 •

edited

Loading

seb314 commented Oct 28, 2022

pwaller left a comment

pwaller commented Dec 2, 2022

github-actions bot commented Dec 2, 2022

seb314 commented Dec 2, 2022

wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints #140890

wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints #140890

Conversation

seb314 commented Oct 7, 2021 • edited Loading

Motivation for this change

Things done

jian-lin commented Oct 28, 2021

seb314 commented Oct 28, 2021

stale bot commented May 2, 2022

zarelit commented May 2, 2022

jian-lin commented May 2, 2022 • edited Loading

zarelit left a comment

Choose a reason for hiding this comment

zarelit commented May 12, 2022

nixos-discourse commented May 12, 2022

seb314 commented May 26, 2022

zarelit commented May 26, 2022

nixos-discourse commented Aug 8, 2022

seb314 commented Aug 8, 2022

seb314 commented Sep 19, 2022

seb314 commented Sep 26, 2022

zarelit left a comment

Choose a reason for hiding this comment

jian-lin left a comment

Choose a reason for hiding this comment

jian-lin left a comment

Choose a reason for hiding this comment

zarelit left a comment

Choose a reason for hiding this comment

RaitoBezarius left a comment

Choose a reason for hiding this comment

pwaller left a comment

Choose a reason for hiding this comment

Artturin commented Oct 22, 2022 • edited Loading

seb314 commented Oct 28, 2022

pwaller left a comment

Choose a reason for hiding this comment

pwaller commented Dec 2, 2022

github-actions bot commented Dec 2, 2022

seb314 commented Dec 2, 2022

seb314 commented Oct 7, 2021 •

edited

Loading

jian-lin commented May 2, 2022 •

edited

Loading

Artturin commented Oct 22, 2022 •

edited

Loading