Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making IPIP/tunnel and override-nexthop independent #1025

Merged

Conversation

yydzhou
Copy link
Contributor

@yydzhou yydzhou commented Jan 21, 2021

Since both IPIP/tunnel and overrride-nexthop have their own parameters to enable/disable, it's better to make the switch independent each other. This PR is needed to support the case where both cross subnet iBGP peering (via tunnel) and multiple upstream routers peering(via override-nexthop) feature are needed.

@aauren
Copy link
Collaborator

aauren commented Jan 21, 2021

This is a fix for the issue described in #1022

@aauren
Copy link
Collaborator

aauren commented Jan 21, 2021

@murali-reddy I think that the concern that you had in #518 is still valid. When it creates the tunnel interface it is going to connect it to what kube-router thinks of in terms of nodeIP, however, the traffic might be ingressing on a different interface because of --override-nexthop. This could potentially lead to some traffic that cannot route.

I can see how some users who understand their network well, may desire this functionality and I think that kube-router should be flexible enough to provide it. However, I'm thinking that this PR should also include some sort of documentation warning users about combining these options and the potential risks involved.

What are your thoughts?

@murali-reddy
Copy link
Member

@murali-reddy I think that the concern that you had in #518 is still valid. When it creates the tunnel interface it is going to connect it to what kube-router thinks of in terms of nodeIP, however, the traffic might be ingressing on a different interface because of --override-nexthop. This could potentially lead to some traffic that cannot route.

Tunnels are established across the nodes in different subnets with only Kubernetes node IP's as source and destinations. I only thought off between nodes (in the same or different subnets) doing iBGP peering they should see "next hop" as only Kubernetes node IP's. So in @yydzhou use-case they dont want to advertise pod CIDR's to external BGP peer's so they will not have any issues. But you are right we need to understand the impact on network toplogies where a node is multi-homed and connected to two upstream routers for redundency and they advertise the pod CIDR to the external BGP peers.

However, I'm thinking that this PR should also include some sort of documentation warning users about combining these options and the potential risks involved. What are your thoughts?

In general we need to document how to use these knobs (iBGP peering, external BGP peering, route reflectors etc) for different topolgies and bit of prescriptive configuration of these knobs for various topolgies.

@yydzhou
Copy link
Contributor Author

yydzhou commented Jan 21, 2021

@murali-reddy I think that the concern that you had in #518 is still valid. When it creates the tunnel interface it is going to connect it to what kube-router thinks of in terms of nodeIP, however, the traffic might be ingressing on a different interface because of --override-nexthop. This could potentially lead to some traffic that cannot route.

Tunnels are established across the nodes in different subnets with only Kubernetes node IP's as source and destinations. I only thought off between nodes (in the same or different subnets) doing iBGP peering they should see "next hop" as only Kubernetes node IP's. So in @yydzhou use-case they dont want to advertise pod CIDR's to external BGP peer's so they will not have any issues. But you are right we need to understand the impact on network toplogies where a node is multi-homed and connected to two upstream routers for redundency and they advertise the pod CIDR to the external BGP peers.

However, I'm thinking that this PR should also include some sort of documentation warning users about combining these options and the potential risks involved. What are your thoughts?

In general we need to document how to use these knobs (iBGP peering, external BGP peering, route reflectors etc) for different topolgies and bit of prescriptive configuration of these knobs for various topolgies.

I had some experiment and found that there is still an issue, even if I didn't advertise the pod CIDR to external BGP.
When both override-nexthop and tunnel are enabled, the kube-router would unexpectedly change my default route.

@murali-reddy @aauren Is this behavior expected? or should be fixed as a bug in this case?

I0121 17:50:51.733721 2591628 network_routes_controller.go:515] injectRoute Path Looks Like: nlri:<type_url:"type.googleapis.com/gobgpapi.IPAddressPrefix" value:"\022\0070.0.0.0" > pattrs:<type_url:"type.googleapis.com/gobgpapi.OriginAttribute" > pattrs:<type_url:"type.googleapis.com/gobgpapi.AsPathAttribute" value:"\n\r\010\002\022\t\374\377\003\331\371\003\275\372\003" > pattrs:<type_url:"type.googleapis.com/gobgpapi.NextHopAttribute" value:"\n\r139.177.225.1" > pattrs:<type_url:"type.googleapis.com/gobgpapi.ExtendedCommunitiesAttribute" value:"\n<\n,type.googleapis.com/gobgpapi.UnknownExtended\022\014\010\317\001\022\007\341\000\000\000\000\375=" > age:<seconds:1611251451 > validation:<> family:<afi:AFI_IP safi:SAFI_UNICAST > source_asn:65532 source_id:"10.234.15.5" neighbor_ip:"139.177.225.1" local_identifier:1
I0121 17:50:51.734534 2591628 network_routes_controller.go:554] yydzhou: in injectRoute, before cleanup route and tennel:
 default via 139.177.225.1 dev eth1
10.0.0.0/8 via 10.234.107.1 dev eth0
10.234.107.0/26 dev eth0 proto kernel scope link src 10.234.107.19
139.177.225.0/26 dev eth1 proto kernel scope link src 139.177.225.9
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.24.0.0/25 via 10.234.107.18 dev eth0 proto 17
172.24.0.128/25 dev tun-10234107137 proto 17 src 10.234.107.19

I0121 17:50:51.735288 2591628 network_routes_controller.go:581] yydzhou: in injectRoute. after clean up, before IPIP setup:
 default via 139.177.225.1 dev eth1
10.0.0.0/8 via 10.234.107.1 dev eth0
10.234.107.0/26 dev eth0 proto kernel scope link src 10.234.107.19
139.177.225.0/26 dev eth1 proto kernel scope link src 139.177.225.9
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.24.0.0/25 via 10.234.107.18 dev eth0 proto 17
172.24.0.128/25 dev tun-10234107137 proto 17 src 10.234.107.19

I0121 17:50:51.735303 2591628 network_routes_controller.go:584] yydzhou: in injectRoute, during IPIP setup
I0121 17:50:51.735441 2591628 network_routes_controller.go:610] Tunnel interface: tun-1391772251 for the node 139.177.225.1 already exists.
I0121 17:50:51.736782 2591628 network_routes_controller.go:639] yydzhou: after IPIP setup:
 default via 139.177.225.1 dev eth1
10.0.0.0/8 via 10.234.107.1 dev eth0
10.234.107.0/26 dev eth0 proto kernel scope link src 10.234.107.19
139.177.225.0/26 dev eth1 proto kernel scope link src 139.177.225.9
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.24.0.0/25 via 10.234.107.18 dev eth0 proto 17
172.24.0.128/25 dev tun-10234107137 proto 17 src 10.234.107.19

I0121 17:50:51.736796 2591628 network_routes_controller.go:644] Inject route: '0.0.0.0/0 via 139.177.225.1' from peer to routing table
I0121 17:50:51.737589 2591628 network_routes_controller.go:647] yydzhou: in injectRoute., before RouteReplace:
 default via 139.177.225.1 dev eth1                 <---------------------- the correct default route
10.0.0.0/8 via 10.234.107.1 dev eth0
10.234.107.0/26 dev eth0 proto kernel scope link src 10.234.107.19
139.177.225.0/26 dev eth1 proto kernel scope link src 139.177.225.9
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.24.0.0/25 via 10.234.107.18 dev eth0 proto 17
172.24.0.128/25 dev tun-10234107137 proto 17 src 10.234.107.19

I0121 17:50:51.738309 2591628 network_routes_controller.go:650] yydzhou: in injectRoute., after RouteReplace:
 default dev tun-1391772251 proto 17 src 10.234.107.19 <---------------------- the correct default route
10.0.0.0/8 via 10.234.107.1 dev eth0
10.234.107.0/26 dev eth0 proto kernel scope link src 10.234.107.19
139.177.225.0/26 dev eth1 proto kernel scope link src 139.177.225.9
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.24.0.0/25 via 10.234.107.18 dev eth0 proto 17
172.24.0.128/25 dev tun-10234107137 proto 17 src 10.234.107.19

@murali-reddy
Copy link
Member

@yydzhou Looks like a bug

I0121 17:50:51.736796 2591628 network_routes_controller.go:644] Inject route: '0.0.0.0/0 via 139.177.225.1' from peer to routing table

your external BGP peer advertised a default route to the node.

I0121 17:50:51.738309 2591628 network_routes_controller.go:650] yydzhou: in injectRoute., after RouteReplace:
default dev tun-1391772251 proto 17 src 10.234.107.19 <---------------------- the correct default route

tunnel tun-1391772251 which is basically tunnel to the external BGP peer 139.177.225.1 should not have been created.

Ideally this check

https://github.com/cloudnativelabs/kube-router/blob/master/pkg/controllers/routing/network_routes_controller.go#L536

should be a check if next hop can be reached directly by L2 that would solve the problem.

@yydzhou
Copy link
Contributor Author

yydzhou commented Jan 27, 2021

@murali-reddy Thanks for the info. Yes, the default route was advertised by our public network router, so the tunnel was created based on that. However, there is still a problem as you pointed out. The check https://github.com/cloudnativelabs/kube-router/blob/master/pkg/controllers/routing/network_routes_controller.go#L536
can't handle the case that the node has two independent networkings. It only checks the whether the target ip is in same subnet as of node ip (obtained via k8s node object). In our case, the node ip is set to internal ip. So the default gateway, which is an external ip, fail to be recognized, even if it has a L2 connection with node.

I think we should have a separate PR to handle that? As I can imagine there would be some specific discussion regarding how to handle it.

@yydzhou yydzhou force-pushed the tunnel-override-nexthop branch from 4eec4f2 to 5893bc1 Compare January 27, 2021 17:29
@yydzhou yydzhou marked this pull request as ready for review January 27, 2021 21:09
@yydzhou
Copy link
Contributor Author

yydzhou commented Jan 27, 2021

@murali-reddy I think that the concern that you had in #518 is still valid. When it creates the tunnel interface it is going to connect it to what kube-router thinks of in terms of nodeIP, however, the traffic might be ingressing on a different interface because of --override-nexthop. This could potentially lead to some traffic that cannot route.

I can see how some users who understand their network well, may desire this functionality and I think that kube-router should be flexible enough to provide it. However, I'm thinking that this PR should also include some sort of documentation warning users about combining these options and the potential risks involved.

What are your thoughts?

@aauren could you please point out what's documentation contents is needed for this PR? I am happy to enhance the documentation but I am not sure if I can have an accurate description about the potential issue.

@yydzhou yydzhou force-pushed the tunnel-override-nexthop branch from 13fce0b to 5893bc1 Compare January 29, 2021 23:56
@aauren
Copy link
Collaborator

aauren commented Feb 1, 2021

@aauren could you please point out what's documentation contents is needed for this PR? I am happy to enhance the documentation but I am not sure if I can have an accurate description about the potential issue.

Sorry its taken me a while to respond. The logic here is complex and the possible ways that people might combine options can be nuanced.

I think that at a bare minimum we need to warn people that are doing all of the following together:

The warning here is that when they use --override-nexthop in the above scenario, it may cause kube-router to advertise an IP address other than the node IP which is what kube-router connects the tunnel to when the --enable-overlay option is given. If this happens it may cause some network flows to become un-routable.

Specifically, people need to take care when combining --override-nexthop and --enable-overlay and make sure that they understand their network, the flows they desire, how the kube-router logic works, and the possible side-effects that are created from their configuration.

Unfortunately, at this point, the number of resulting combinations and flows and possibilities is pretty vast, and probably the best thing to do would be to warn them and then add the code references to the documentation, specifically that injectRoute() function that you modify in this PR. Unfortunately, at this point, I don't have enough time to go digging through the code and identifying the possible consequences of this action, if you're willing to put in the leg work and describe how the flows would be affected or the possible combinations that would be great.

So anyway, some sort of warning about that added to the BGP doc: https://github.com/cloudnativelabs/kube-router/blob/master/docs/bgp.md

Additionally, adding your use-case to some sort of BGP use-case documentation that we could add to, would be helpful for the project and for others.

@yydzhou yydzhou force-pushed the tunnel-override-nexthop branch from 6837ed6 to 78b90b9 Compare February 3, 2021 21:07
@yydzhou yydzhou force-pushed the tunnel-override-nexthop branch from 78b90b9 to def5302 Compare February 3, 2021 21:08
@yydzhou
Copy link
Contributor Author

yydzhou commented Feb 3, 2021

@aauren could you please point out what's documentation contents is needed for this PR? I am happy to enhance the documentation but I am not sure if I can have an accurate description about the potential issue.

Sorry its taken me a while to respond. The logic here is complex and the possible ways that people might combine options can be nuanced.

I think that at a bare minimum we need to warn people that are doing all of the following together:

The warning here is that when they use --override-nexthop in the above scenario, it may cause kube-router to advertise an IP address other than the node IP which is what kube-router connects the tunnel to when the --enable-overlay option is given. If this happens it may cause some network flows to become un-routable.

Specifically, people need to take care when combining --override-nexthop and --enable-overlay and make sure that they understand their network, the flows they desire, how the kube-router logic works, and the possible side-effects that are created from their configuration.

Unfortunately, at this point, the number of resulting combinations and flows and possibilities is pretty vast, and probably the best thing to do would be to warn them and then add the code references to the documentation, specifically that injectRoute() function that you modify in this PR. Unfortunately, at this point, I don't have enough time to go digging through the code and identifying the possible consequences of this action, if you're willing to put in the leg work and describe how the flows would be affected or the possible combinations that would be great.

So anyway, some sort of warning about that added to the BGP doc: https://github.com/cloudnativelabs/kube-router/blob/master/docs/bgp.md

Additionally, adding your use-case to some sort of BGP use-case documentation that we could add to, would be helpful for the project and for others.

@aauren The documentation is added, mostly copied your comments with a little editing. I didn't find a seperate doc to add use case so I just put the example use case within the description.

docs/bgp.md Outdated
Comment on lines 163 to 165
In a scenario there are multiple groups of nodes in different subnets and user wants to peer mulitple upstream routers for each of their node (.e.g., a cluster has seperate public and private networking, and the nodes in this cluster are located into two different racks, which has their own routers. So there would be two upstream routers for each node, and there are two different subnets in this case), The override-nexthop and tunnel cross subnet features need to be used together to achive the goal.

to support the above case, user need to set `--enable-overlay` and `--override-nexthop` to true together. This configuration would have the following effect.
Copy link
Collaborator

@aauren aauren Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In a scenario there are multiple groups of nodes in different subnets and user wants to peer mulitple upstream routers for each of their node (.e.g., a cluster has seperate public and private networking, and the nodes in this cluster are located into two different racks, which has their own routers. So there would be two upstream routers for each node, and there are two different subnets in this case), The override-nexthop and tunnel cross subnet features need to be used together to achive the goal.
to support the above case, user need to set `--enable-overlay` and `--override-nexthop` to true together. This configuration would have the following effect.
A common scenario exists where each node in the cluster is connected to two upstream routers that are in two different subnets. For example, one router is connected to a public network subnet and the other router is connected to a private network subnet. Additionally, nodes may be split across different subnets (e.g. different racks) each of which has their own routers.
In this scenario, `--override-nexthop` can be used to correctly peer with each upstream router, ensuring that the BGP next-hop attribute is correctly set to the node's IP address that faces the upstream router. The `--enable-overlay` option can be set to allow overlay/underlay tunneling across the different subnets to achieve an interconnected pod network.
This configuration would have the following effects:

I took a second pass at the wording to hopefully make it a little more concise. However, before changing anything I would recommend getting feedback from @murali-reddy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aauren Thank you very much for the feedback. I have integrated your comments in new commit.

@murali-reddy Would you please take a look and let me know if there is anything else I need to add?

@murali-reddy
Copy link
Member

-	if (!sameSubnet || nrc.overlayType == "full") && !nrc.overrideNextHop && nrc.enableOverlays {
+	if (!sameSubnet || nrc.overlayType == "full") && nrc.enableOverlays {

This is the only code change. Which any way make sense, since tunnels should be established only with peer kubernetes nodes and not with external peers, check for nrc.overrideNextHop is not needed.

I think we should have a separate PR to handle that? As I can imagine there would be some specific discussion regarding how to handle it.

Yes. We still need to check tunnels are established or rather overlay network is confined to kubernetes nodes only. Will open a seperate tracking issue. Thanks for the PR @yydzhou

LGTM

@murali-reddy murali-reddy merged commit 49b9add into cloudnativelabs:master Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants