Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wireguard backend #1230

Merged
merged 1 commit into from
Dec 2, 2021
Merged

Conversation

andreek
Copy link
Contributor

@andreek andreek commented Dec 8, 2019

Description

Adds wireguard support to flannel. Based on the extension backend example.

Add dependency wgctrl-go to manage wireguard device.

A dependency of wgctrl-go forced me to update go version. I migrated glide to go modules according to this guide.
I marked this as WIP, because I disabled ipsec tests. Read more about the problem in #1211. I will rebase and change title when we have a solution for this issue. Fixed by 5dc1ff6 / 117c102

Todos

  • Tests
  • Documentation
  • Release note

Release Note

- add wireguard backend

@andreek andreek force-pushed the add-wireguard branch 8 times, most recently from 666b780 to 8bc0c3b Compare December 13, 2019 00:54
@andreek andreek mentioned this pull request Feb 7, 2020
3 tasks
@andreek
Copy link
Contributor Author

andreek commented Mar 14, 2020

Looks like removing vendor directory wasn't a reliable decision. I reverted this and rebased to latest changes.

@andreek andreek force-pushed the add-wireguard branch 2 times, most recently from e806d62 to f8b6c12 Compare September 5, 2020 00:17
@andreek andreek marked this pull request as draft September 5, 2020 11:29
@andreek andreek force-pushed the add-wireguard branch 2 times, most recently from ec75ceb to 2aeba33 Compare September 13, 2020 00:18
@andreek andreek force-pushed the add-wireguard branch 2 times, most recently from 25a77cc to 95bee33 Compare September 23, 2020 21:15
@andreek andreek changed the title WIP: Add wireguard backend Add wireguard backend Sep 23, 2020
@andreek
Copy link
Contributor Author

andreek commented Sep 23, 2020

It's ready to be reviewed. Beside provided e2e tests I did some manual tests with a kubernetes cluster.

@andreek andreek marked this pull request as ready for review September 23, 2020 21:26
@manuelbuil
Copy link
Collaborator

Hey @andreek , could you please rebase? I will be able to review this PR in the next days

@andreek andreek force-pushed the add-wireguard branch 2 times, most recently from 3057696 to 9463414 Compare October 21, 2021 19:36
@andreek
Copy link
Contributor Author

andreek commented Oct 21, 2021

Hey @manuelbuil, thank you for your time. The rebase is done.

The e2e-tests are not working anymore. If I'm running make e2e-test on my local setup I get the following error message:

Running tests in dist/functional-test.sh
Running test_hostgw_perf... Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.244.2.1:2379 exceeded header timeout

error #0: client: endpoint http://10.244.2.1:2379 exceeded header timeout

This error message I can even reproduce with current master branch. And I guess this is also the reason why the pipeline on this PR did not pass. But I'm still waiting to see some logs from actions. Can you maybe give any feedback to this issue?
Update: Solved by disabling firewall 😱

After fixing the pipeline I want to add another change. Currently the private key is generated every time flannel spawns. I think it would be the better solution to store the key in flannel config dir and only generate the key if the file does not exists. I'm looking forward to implement this change on the weekend. I would give you a ping when I added this change to the PR.

@andreek andreek force-pushed the add-wireguard branch 3 times, most recently from 4c5aa7c to 765b029 Compare October 23, 2021 09:44
@manuelbuil
Copy link
Collaborator

Another difference in the interface. When it works I see qlen 1000 in the wg interface. Not sure if this might be the problem:
NOT WORKING (using this PR):

75668: flannel-wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default 
    link/none 
    inet 10.42.0.0/32 brd 10.42.0.0 scope global flannel-wg
       valid_lft forever preferred_lft forever

WORKING (using wireguard via extension):

75659: flannel-wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.42.0.0/32 scope global flannel-wg
       valid_lft forever preferred_lft forever

Sorry for such a bad experience. I'll take a look tomorrow and will try to reproduce this. Actually I've tested latest changes just against the e2e test.

Does flannel log any error?

No worries! Thanks for you work. I don't see anything in the flannel logs. It'd be great if you could try to reproduce it, I hope it is not something related to my env.

@andreek
Copy link
Contributor Author

andreek commented Nov 3, 2021

Another difference in the interface. When it works I see qlen 1000 in the wg interface. Not sure if this might be the problem:
NOT WORKING (using this PR):

75668: flannel-wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default 
    link/none 
    inet 10.42.0.0/32 brd 10.42.0.0 scope global flannel-wg
       valid_lft forever preferred_lft forever

WORKING (using wireguard via extension):

75659: flannel-wg: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.42.0.0/32 scope global flannel-wg
       valid_lft forever preferred_lft forever

Sorry for such a bad experience. I'll take a look tomorrow and will try to reproduce this. Actually I've tested latest changes just against the e2e test.
Does flannel log any error?

No worries! Thanks for you work. I don't see anything in the flannel logs. It'd be great if you could try to reproduce it, I hope it is not something related to my env.

I was not able to reproduce this problem. I have deployed a small kubernetes cluster (1 control plane / 2 nodes) and applied kube-flannel.yml (modified backend type to wireguard and container images to versions of this branch). Deployed some nginx pods and run curl containers to try routing between nodes. Everything works fine. But I didn't used AWS environment.

Can you describe in more detail how you deployed flannel in your cluster?

@manuelbuil
Copy link
Collaborator

I tried again with the latest code and it worked :)

@manuelbuil
Copy link
Collaborator

Could you add a comment on top of dist/extension-wireguard claiming that it is deprecated in favour of this backend please?

@manuelbuil
Copy link
Collaborator

tested dual-stack and works too :)

@andreek
Copy link
Contributor Author

andreek commented Nov 11, 2021

@manuelbuil squashed

@sjoerdsimons
Copy link
Contributor

To put code where my mount is wrt. #1230 (comment)

Flannel implementation of a single wg dev: https://github.com/sjoerdsimons/flannel/tree/wip/sjoerd/wireguard-single-dev
Tested with k3s: https://github.com/sjoerdsimons/k3s/tree/flannel-wireguard-backend

As an example of how the wireguard interface (can) end up:

interface: flannel-wg
  public key: vBf2uv+CSiA+ZWNdIOslNNjONtJnqP9YmOe/u7oKHWc=
  private key: (hidden)
  listening port: 51820

peer: pQeO/TBYUPxChLaG1yjXQCrRl7EwnNiyl7ETJsMr9j8=
  endpoint: [2a00:bba0:114f:8c02:5054:ff:fe49:1271]:51820
  allowed ips: 10.42.1.0/24, fd00:42:0:1::/64
  latest handshake: 1 minute, 35 seconds ago
  transfer: 25.39 KiB received, 15.84 KiB sent
  persistent keepalive: every 25 seconds

peer: 6f/8MgMhs3BOGwoARQdhixcXVis6T2SBz6ARj+vugkg=
  endpoint: 84.247.45.71:33421
  allowed ips: 10.42.2.0/24, fd00:42:0:2::/64
  latest handshake: 2 minutes, 7 seconds ago
  transfer: 520 B received, 2.54 KiB sent
  persistent keepalive: every 25 seconds

The main idea for the endpoint is to use a heuristic to pick the address family that is the most likely to end up with a working connection. Obviously a next step might be to also make this configurable so it can be user overridden.

And as mentioned before as long as one side of the tunnel can connect to the other everything is happy.

Copy link
Contributor

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good, thanks for working on that! Just some minor comments about:

  • error wrapping
  • using defer, especially if there are error checks before client.Close()
  • one code duplication which probably would be nice to break into more pieces, if possible

backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
backend/wireguard/device.go Outdated Show resolved Hide resolved
@andreek
Copy link
Contributor Author

andreek commented Nov 21, 2021

In general looks good, thanks for working on that! Just some minor comments about:

  • error wrapping
  • using defer, especially if there are error checks before client.Close()
  • one code duplication which probably would be nice to break into more pieces, if possible

Thanks for the review. Applied your suggestions and reduced code duplication.

Copy link
Contributor

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@flokli
Copy link

flokli commented Nov 22, 2021

@andreek just asking back, did take a look at the proposed implementation from #1230 (comment), which handles both v4 and v6 in the same wg dev?

I don't think we should default to requiring public IPv4 and IPv6 adresses to work, but I'd rather prefer above behaviour by default.

If you disagree, this implementation should still have some docs explaining the current behaviour, and the differences. It might be seen as a regression, at least for people coming from the k3s wireguard implementation.

@andreek
Copy link
Contributor Author

andreek commented Nov 23, 2021

@andreek just asking back, did take a look at the proposed implementation from #1230 (comment), which handles both v4 and v6 in the same wg dev?

I don't think we should default to requiring public IPv4 and IPv6 adresses to work, but I'd rather prefer above behaviour by default.

I noticed it, but I gave a detailed response on this decision in the other thread. Believe me in the first prototype implementation of dual-stack I did it the same way. But if you look deeper into the implementation of dual-stack in flannel you'll notice that it is a rude decision to completely ignore either the v4 or v6 public address. So I still think it's the best way to integrate wireguard in the current implementation of dual-stack in flannel.

My suggestion is still to add configuration for the wireguard backend that will opt-in for v4 or v6 only in the inter-host communication in dual-stack mode. But as I mentioned I want to address this by another PR, because it takes some refactoring to do this in a readable fashion and it requires more testing.

If you disagree, this implementation should still have some docs explaining the current behaviour, and the differences. It might be seen as a regression, at least for people coming from the k3s wireguard implementation.

Will add a comment to wireguard.go.

Can you explain the regression to me? Because I don't see there is dual-stack support for wireguard now. And k3s is requiring a public v6 address as well for dual-stack support:
https://github.com/k3s-io/k3s/blob/3da1bb3af2ed0a3ef06cc69bef8aaed42112ea7f/pkg/agent/flannel/flannel.go#L106-L113

@andreek
Copy link
Contributor Author

andreek commented Nov 23, 2021

Added the comment mentioned, rebased and squashed.

@flokli
Copy link

flokli commented Nov 23, 2021

Can you explain the regression to me? Because I don't see there is dual-stack support for wireguard now. And k3s is requiring a public v6 address as well for dual-stack support:
https://github.com/k3s-io/k3s/blob/3da1bb3af2ed0a3ef06cc69bef8aaed42112ea7f/pkg/agent/flannel/flannel.go#L106-L113

Eww, you're right. The whole extension framework is IPv4 only - I thought $PUBLIC_IP could be PublicIP and PublicIPv6, and was "called twice".

My suggestion is still to add configuration for the wireguard backend that will opt-in for v4 or v6 only in the inter-host communication in dual-stack mode.

Alright, then let's tackle this in a followup PR.

@andreek
Copy link
Contributor Author

andreek commented Nov 24, 2021

Just did a minor update, because I found some error logs which aren't using error wrapping.

@sjoerdsimons
Copy link
Contributor

@andreek just asking back, did take a look at the proposed implementation from [#1230 (comment)]
Will add a comment to wireguard.go.

Can you explain the regression to me? Because I don't see there is dual-stack support for wireguard now. And k3s is requiring a public v6 address as well for dual-stack support: https://github.com/k3s-io/k3s/blob/3da1bb3af2ed0a3ef06cc69bef8aaed42112ea7f/pkg/agent/flannel/flannel.go#L106-L113

No it requires a ipv6 address on the interface; A link-local address is enough, there is no strict requirement for a public ipv6 address.

@sjoerdsimons
Copy link
Contributor

@andreek just asking back, did take a look at the proposed implementation from #1230 (comment), which handles both v4 and v6 in the same wg dev?
I don't think we should default to requiring public IPv4 and IPv6 adresses to work, but I'd rather prefer above behaviour by default.

I noticed it, but I gave a detailed response on this decision in the other thread. Believe me in the first prototype implementation of dual-stack I did it the same way. But if you look deeper into the implementation of dual-stack in flannel you'll notice that it is a rude decision to completely ignore either the v4 or v6 public address. So I still think it's the best way to integrate wireguard in the current implementation of dual-stack in flannel.

Why is it rude to ignore it? They're just potential tunnel endpoints (it's unfortunate you can only provide one endpoint to the kernel). I still failed to find a good reason for this particular, and somewhat surprising, approach...

My suggestion is still to add configuration for the wireguard backend that will opt-in for v4 or v6 only in the inter-host communication in dual-stack mode. But as I mentioned I want to address this by another PR, because it takes some refactoring to do this in a readable fashion and it requires more testing.

... but please don't block this on my behalf. Adding configuration later for a single wireguard interface, rather then one per address family later seems reasonable.

@manuelbuil
Copy link
Collaborator

I think we all agree that the extra config can be added in a later patch. Let's merge this one then! Thanks @andreek, great work!

@manuelbuil manuelbuil merged commit 7d0cbdf into flannel-io:master Dec 2, 2021
@andreek andreek deleted the add-wireguard branch December 2, 2021 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants