Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints #140890

Merged
merged 1 commit into from
Dec 2, 2022

Conversation

seb314
Copy link
Contributor

@seb314 seb314 commented Oct 7, 2021

Motivation for this change

When dns resolution fails with a permanent error ("Name or service not
known" instead of "Temporary failure in name resolution"), in the current
setup for dynamic dns with dynamicEndpointRefreshSeconds, wireguard
won't retry despite WG_ENDPOINT_RESOLUTION_RETRIES=infinity.

Ideally, dns would probably never report a permanent error for an
existing name, but unfortunately this does happen (maybe especially
with dynamic dns?) and cannot easily be fixed by the wireguard setup's
admin.

I can't think of a scenario where it is essential to not retry after a
negative dns response (given that the endpoint has been configured, the
dns name quite certainly exists), right?. On the other hand, a machine
that drops out of the vpn can be very annoying...

-> This change should improve reliability/connectivity.

somewhat related thread: #63869

Things done
  • Built on platform(s)
    • x86_64-linux (built & deployed a cherry-picked version on top of roughly nixos-unstable)
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • 21.11 Release Notes (or backporting 21.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@jian-lin
Copy link
Contributor

Thanks for your work.

I also have this issue. Say wireguard is set on your laptop, wireguard-{interface}-peer-{publickey}-refresh.service will become dead if your laptop lose network connection (e.g. switch to a new wifi) or your laptop doesn't connect to the network before wireguard-{interface}-peer-{publickey}-refresh.service starts. The latter case is possible if you don't set a connection available for all users in NetworkManager, in which case network-online.target will be reached before a connection is established.

wireguard won't retry despite WG_ENDPOINT_RESOLUTION_RETRIES=infinity. Ideally, dns would probably never report a permanent error for an existing name, but unfortunately this does happen (maybe especially with dynamic dns?)

According to the code, wireguard won't retry if it gets the EAI_NONAME error ( the corresponding error message is "Name or service not known" ), which will appear when the network connection is lose and is considered as a permanent error instead of a transient one.

Adding retry in the systemd unit does fix this issue. I do not known if there is a better way, like only retrying when the network is back online.

@seb314
Copy link
Contributor Author

seb314 commented Oct 28, 2021

Thanks for you further analysis @jian-lin !

I guess the basic periodic retry shouldn't cause significant system load unless there is a really large number of peers with dynamic IPs, but something smarter might be nicer (e.g. exponential backoff or when network connectivity is restored, as you suggested). We should be sure that such "smartness" doesn't introduce a new edge case where we accidentally fail to retry, though.

nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved
nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved
@stale
Copy link

stale bot commented May 2, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label May 2, 2022
@zarelit
Copy link
Member

zarelit commented May 2, 2022

I'm having the same problem, exactly on a laptop with no permanent connection. If time allows I should be able to review soon.

@jian-lin I think it's okay to keep the same interval, after all the invariant is "I want this peer to be refreshed every X seconds" and when the connection is brought back up you can expected it to have the right IP after X seconds

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label May 2, 2022
@jian-lin
Copy link
Contributor

jian-lin commented May 2, 2022

@zarelit I use a better workaround, patching wireguard-tools. By better, I mean wireguard peer services are not restarted. That patch is as follows.

Don't know if there is a way to distinguish between no network connection and EAI_NONAME.

From 1b33a0d5eb3b578767162ef465b5994afeb15dac Mon Sep 17 00:00:00 2001
From: linj <linj.dev@outlook.com>
Date: Wed, 24 Nov 2021 21:48:43 +0000
Subject: [PATCH 1/1] make EAI_NONAME transient

When a notebook loses its network connection, EAI_NONAME will appear if
you look up DNS. If dynamicEndpointRefreshSeconds is enabled, this error
will cause the peer systemd service to fail (but with 0 as exit code due
to the while loop) and clean all your wireguard peers due to the
ExecStopPost command. So your wireguard network will not work even after
your network is connected again. You have to restart your main wireguard
service to make it work again.

By applying this patch, this program will not exit if it get EAI_NONAME
error, whose error message is "Name or service not known". Instead, it
will treat this error like other transient errors and retry after a few
seconds. So your peer systemd unit will not exit and your wireguard peer
will not be cleaned. You wireguard network will be useable when you
connect to the network again.
---
 src/config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/config.c b/config.c
index 211e887..51b90d9 100644
--- a/config.c
+++ b/config.c
@@ -252,7 +252,7 @@ static inline bool parse_endpoint(struct sockaddr *endpoint, const char *value)
 		 *
 		 * So this is what we do, except FreeBSD removed EAI_NODATA some time ago, so that's conditional.
 		 */
-		if (ret == EAI_NONAME || ret == EAI_FAIL ||
+		if (ret == EAI_FAIL ||
 			#ifdef EAI_NODATA
 				ret == EAI_NODATA ||
 			#endif
-- 
2.33.1

Copy link
Member

@zarelit zarelit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a intermittently-connected laptop and works.

The script that refreshes the endpoints quits with error when it's not possible to resolve the name of the peer and so the unit dies. This will happen 100% on a laptop with no automatic connection at boot.

The PR does not change the script and just guarantees that the refresh operation is tried every dynamicEndpointRefreshSeconds even when the script fails so it's a safe merge unless I'm missing a usecase where having the peers not refreshed is somehow expected.

Somewhat related to #165474 i.e. lifecycle of wireguard-related units.

Also thanks @seb314

@zarelit
Copy link
Member

zarelit commented May 12, 2022

Also wrt/ #63869 (comment) This means that the restart of the unit is to handle the DNS not available case (e.g. because of lack of connection) and other transient failure are still handled by wg itself

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-already-reviewed/2617/507

@seb314
Copy link
Contributor Author

seb314 commented May 26, 2022

@zarelit thanks for the review!

Is there anything that can be done or improved about this PR that would improve the chances of it getting merged?

@zarelit
Copy link
Member

zarelit commented May 26, 2022

Both @nh2 and @WilliButz worked on dynamically refreshed peers and systemd units that didn't restart in this module so maybe they can help us here

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-already-reviewed/2617/574

@seb314 seb314 changed the title wireguard: "Restart=on-failure" for dynamicEndpointRefreshSeconds wireguard: non-invasive fix for permanent disconnects on unstable network (e.g. laptops) from dyndns endpoints Aug 8, 2022
@seb314
Copy link
Contributor Author

seb314 commented Aug 8, 2022

I changed the title to something more descriptive, in case the previous title was part of the reason why the PR has been stuck. (If there is something wrong with the new title, please let me know).

@seb314 seb314 force-pushed the wireguard/dyndns-restart-on-failure branch from cfefff7 to 2821cd9 Compare September 19, 2022 17:06
@seb314
Copy link
Contributor Author

seb314 commented Sep 19, 2022

@jian-lin @Majiir I added an (optional) option dynamicEndpointRefreshRestartSeconds to set the retry delay directly.
If the option is not set, the value dynamicEndpointRefreshSeconds will be used, so the default behaviour is unchanged vs the previous version of this PR.

@seb314
Copy link
Contributor Author

seb314 commented Sep 26, 2022

@jian-lin @majir @zarelit could you maybe check if the updated version (with 2821cd9) works for you too?

If it works for you too, and you can mark your review as approved, we might have a better chance to get someone with merge permissions to look at the PR and maybe merge it :-)

Copy link
Member

@zarelit zarelit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried out latest version (2022-09-27) with peers with mixed settings and works well

Copy link
Contributor

@jian-lin jian-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest some small changes.

nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved
nixos/modules/services/networking/wireguard.nix Outdated Show resolved Hide resolved
Copy link
Contributor

@jian-lin jian-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@zarelit zarelit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@RaitoBezarius RaitoBezarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane to me (untested as I do not have any dynamic endpoint of this type).

Copy link
Contributor

@pwaller pwaller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, untested but I have this problem and this looks to me like a suitable fix.

@Artturin
Copy link
Member

Artturin commented Oct 22, 2022

squash the commits

the pr can be tested with https://nixos.wiki/wiki/Nixpkgs/Reviewing_changes#Modules

Make the dynamic-dns refresh systemd service (controlled via the
preexisting option dynamicEndpointRefreshSecond) robust to e.g. dns
failures that happen on intermittent network connections.

Background:

When dns resolution fails with a 'permanent' error ("Name or service not
known" instead of "Temporary failure in name resolution"), wireguard
won't retry despite WG_ENDPOINT_RESOLUTION_RETRIES=infinity.

-> This change should improve reliability/connectivity.

somewhat related thread: NixOS#63869
@seb314 seb314 force-pushed the wireguard/dyndns-restart-on-failure branch from f4f9839 to 82c5c3c Compare October 28, 2022 17:16
@seb314
Copy link
Contributor Author

seb314 commented Oct 28, 2022

squash the commits

done

Copy link
Contributor

@pwaller pwaller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping, would love to see this land.

@Artturin Artturin merged commit dadca5c into NixOS:master Dec 2, 2022
@pwaller
Copy link
Contributor

pwaller commented Dec 2, 2022

🎉 thanks @Artturin !

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2022

Successfully created backport PR #204134 for release-22.11.

@seb314
Copy link
Contributor Author

seb314 commented Dec 2, 2022

Thanks @Artturin and thank you all for review, testing & support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants