Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: enhance cluster networking capabilities. #637

Merged
merged 1 commit into from
Dec 22, 2021
Merged

Proposal: enhance cluster networking capabilities. #637

merged 1 commit into from
Dec 22, 2021

Conversation

DrmagicE
Copy link
Member

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:
/kind bug
/kind documentation
/kind enhancement
/kind good-first-issue
/kind feature
/kind question
/kind design
/sig ai
/sig iot
/sig network
/sig storage
/sig storage

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


other Note

@openyurt-bot
Copy link
Collaborator

@DrmagicE: GitHub didn't allow me to assign the following users: your_reviewer.

Note that only openyurtio members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:
/kind bug
/kind documentation
/kind enhancement
/kind good-first-issue
/kind feature
/kind question
/kind design
/sig ai
/sig iot
/sig network
/sig storage
/sig storage

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


other Note

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openyurt-bot openyurt-bot added the kind/feature kind/feature label Nov 29, 2021
@openyurt-bot openyurt-bot added the size/L size/L: 100-499 label Nov 29, 2021
@DrmagicE DrmagicE changed the title add cluster networking enhancement proposal Proposal: enhance cluster networking capabilities. Nov 29, 2021
@Congrool
Copy link
Member

Congrool commented Nov 30, 2021

Hi, it's really an exciting feature, but I still have some questions here.

  1. How do nodepools get their podCIDR? Is it the responsibility of this network solution or dependent on other components, such as yurt-app-manager?
  2. We know that flannel will also allocate the podCIDR for each node, how to ensure that the nodepool podCIDR contains podCIDRs of all its member nodes. (It may be one of the CNI compatibility problems that we will encounter later)
  3. If we select some nodes to be a new nodepool and want to join it into the network, do original pods need to restart to get their new podIP according to the nodepool podCIDR? Or, on the contary, is it the original podIPs that determine the nodepool podCIDR? (In the later case, we can also solve the problem2 if there's no conflict)
  4. Host network subnet conflict is really a diffcult problem to solve, because we cannot determine which nodepool to send the packet. I think, in such scenario, the network situation can not be absolutely transparent to the application. Maybe we can support it in the application layer and let the application itself make the determination.

@DrmagicE
Copy link
Member Author

@Congrool Hi, thanks for your feedback.

  1. How do nodepools get their podCIDR? Is it the responsibility of this network solution or dependent on other components, such as yurt-app-manager?
  2. We know that flannel will also allocate the podCIDR for each node, how to ensure that the nodepool podCIDR contains podCIDRs of all its member nodes. (It may be one of the CNI compatibility problems that we will encounter later

podCIDR is a field of the node resource. We can track all podCIDR belonging to a nodepool by list/watch mechanism.

[root@master-1 /]# kubectl get nodes master-1 -oyaml | grep podCIDR
        f:podCIDR: {}
        f:podCIDRs:
  podCIDR: 10.244.0.0/24
  podCIDRs:

But we should notice that not all CNI respects podCIDR. If we use flannel, we will be fine because flannel respect podCIDR.
However, some CNI like calico does not respect podCIDR. We have to figure out another way to get podCIDR of a node for such CNI. In other words, how to get podCIDR of a node may vary from CNI to CNI.

I suggest we start with flannel CNI, which is simple and widely use in OpenYurt.

  1. If we select some nodes to be a new nodepool and want to join it into the network, do original pods need to restart to get their new podIP according to the nodepool podCIDR? Or, on the contary, is it the original podIPs that determine the nodepool podCIDR? (In the later case, we can also solve the problem2 if there's no conflict)

No they don't need a restart. The solution introduced in this proposal will not change the podCIDR on the node, and will not affect IP allocation of pods, IP allocation is still managed by the CNI.

  1. Host network subnet conflict is really a diffcult problem to solve, because we cannot determine which nodepool to send the packet. I think, in such scenario, the network situation can not be absolutely transparent to the application. Maybe we can support it in the application layer and let the application itself make the determination.

Yes, still trying to figure out a way to solve the problem. It is an inevitable issue If we want this solution to replace the YurtTunnel.

@rambohe-ch
Copy link
Member

@DrmagicE @Congrool podCIDR is allocated to every node by rangeAllocator in kube-controller-manager, and NodePool contains name info for nodes that resides in the NodePool, so YurtGateway can calculate all podCIDRs through list/watch nodePool and nodes, and not need to consider the cni solution is flannel or calico.

@DrmagicE
Copy link
Member Author

@rambohe-ch Thanks for your reply.
I am not very experienced with calico, but I found some information here:
projectcalico/calico#2592 (comment)
This comment indicates that calico's IPAM plugin doesn't respect the values given to Node.Spec.PodCIDR, which means the pod IP allocated by calico's IPAM may not belong to Node.Spec.PodCIDR. In such a case, Node.Spec.PodCIDR does not cover all pod IPs of that node. Some of the pod IPs will not include in the VPN tunnel, thus losing connectivity.

@rambohe-ch
Copy link
Member

@DrmagicE Thanks for your feedback. if calico IPAM does not belong to node.Spec.PodCIDR, maybe we need to consider the other solution instead of PodCIDR. At first, we can start by flannel.

A new component that is responsible for configuring route table, route policy and other network-related configurations on nodes.
YurtRouter is a daemonset and is deployed on all nodepools that are participating in the VPN tunnel.

### Network Reachability Requirement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement is pretty challenge on edge side, Do you consider running a service on cloud to help establish the tunnel between different node-pools?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yixingjia edge yurt-gateway can connect with each other through cloud yurt-gateway. and the detail info will be added after next community meeting discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then the yurt-gateway on cloud can consider to change name like yurt-cloud-gateway and the gateway on the edge called yurt-edge-gateway. SDN have similar implementation for those kinds of requirements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yixingjia we will discuss component name in the other proposal.

Thus, YurtGateway is not aware of failover of the other side, and when failover occurs, the VPN tunnel is broken.

To fix that:
1. YurtGateway should be able to detect the VPN status. Once it detects failover on the other side, it will try to connect the other backup.
Copy link
Member

@vincent-pli vincent-pli Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the yurtGateway to know the new active gateway in target nodepool when a failover occurred in target nodepool, I remember we have leader election to handle the SPOF, but how others know exactly who win the election?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincent-pli Hi, sorry for the late reply. Base on discussions on our latest community meeting, we may not support H/A in our first release.We need to think more carefully about how to achieve H/A, especially in node autonomy circumstances.

Welcome to get involved and share your ideas.

@adamzhoul
Copy link
Member

hi @DrmagicE

can you add some detail about how to config vxlan to redirect traffic to the gateway node
more importantly how vxlan device route packet to IPsec on the gateway node

I'm a little confused.

thanks

@DrmagicE
Copy link
Member Author

hi @DrmagicE

can you add some detail about how to config vxlan to redirect traffic to the gateway node more importantly how vxlan device route packet to IPsec on the gateway node

I'm a little confused.

thanks

Hi, "redirect vxlan traffic to gateway node" is based on IP packet forwarding. We can configure the IP route table of non-gateway nodes via the ip r add command. I think from the perspective of the Linux routing table, there is not much difference between routing vxlan packets and normal IP packets.

For vxlan mode, the routing rules are as same as host-gw mode. Here is an example shown in the proposal:

$ ip rule
0:	from all lookup local
# Set up route policy for pod CIDR of nodepoolB to use cross_edge_table.
32764:	from all to 10.244.2.0/24 lookup cross_edge_table
32765:	from all to 10.244.3.0/24 lookup cross_edge_table
32766:	from all lookup main
32767:	from all lookup default
$ ip r list table cross_edge_table
# 10.0.20.13 is the private IP of the local gateway node.
# We need to set a smaller MTU for IPsec traffic, the concrete number is yet to be determined.
default via 10.0.20.13 dev eth0 mtu 1400

@rambohe-ch
Copy link
Member

@DrmagicE I will merge this pull request, and the detail design like API will be discussed in raven repo(htts://github.com/openyurtio/raven).

@DrmagicE
Copy link
Member Author

@rambohe-ch Ok.

@rambohe-ch
Copy link
Member

/lgtm
/approve

@adamzhoul
Copy link
Member

hi @DrmagicE
can you add some detail about how to config vxlan to redirect traffic to the gateway node more importantly how vxlan device route packet to IPsec on the gateway node
I'm a little confused.
thanks

Hi, "redirect vxlan traffic to gateway node" is based on IP packet forwarding. We can configure the IP route table of non-gateway nodes via the ip r add command. I think from the perspective of the Linux routing table, there is not much difference between routing vxlan packets and normal IP packets.

For vxlan mode, the routing rules are as same as host-gw mode. Here is an example shown in the proposal:

$ ip rule
0:	from all lookup local
# Set up route policy for pod CIDR of nodepoolB to use cross_edge_table.
32764:	from all to 10.244.2.0/24 lookup cross_edge_table
32765:	from all to 10.244.3.0/24 lookup cross_edge_table
32766:	from all lookup main
32767:	from all lookup default
$ ip r list table cross_edge_table
# 10.0.20.13 is the private IP of the local gateway node.
# We need to set a smaller MTU for IPsec traffic, the concrete number is yet to be determined.
default via 10.0.20.13 dev eth0 mtu 1400

thanks for the reply @DrmagicE . and sorry for the late.

personal tried to lead traffic from nodeA -> nodeB using vxlan device.
and encountered problems.

  1. on both nodes, add vxlan device
ip link add vxlan type vxlan \
    id 1 \
    dstport 4789 \
    local ${localMachineIp} \
    nolearning \
    dev eth0
  1. add a testIp and config to using vxlan on nodeA
# manual add a route to test traffic
 ip route add ${githubIp} dev vxlan

bridge fdb append ${vxlanMacNodeB} dev vxlan dst ${nodeBIP}
ip neigh add ${githubIp}  lladdr ${vxlanMacNodeB} dev vxlan
  1. ping $githubIp on nodeA

tcpdump -i vxlan on nodeB
capture packet with ICMP request, no ICMP reply.
capture no packet on eth0

in all, simply adding vxlan、route table is not enough to redirect traffic.

what I did to make it work:

  1. give both vxlan device IP, then we can capture packet on eth0
  2. add iptables -t nat -A POSTROUTING -s {vxlanIP}/16 -j MASQUERADE to do SNAT when going through eth0

maybe I make things complicated.
if I misunderstood your point or you have a better solution
please let me know, thanks.

Oh, by the way. I can't simply use the route table to redirect traffic from nodeA -> nodeB.
because in cloud env, there arp is answered by the gateway device, packet out of nodeA arrived at gateway will never find nodeB(dst IP is podIP, not nodeB IP). this is why we have to rely on vxlan.

@rambohe-ch
Copy link
Member

@adamzhoul we can discuss the details of raven at https://github.com/openyurtio/raven, and i will merge this pull request at first.

@rambohe-ch
Copy link
Member

/lgtm
/approve

@openyurt-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DrmagicE, rambohe-ch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openyurt-bot openyurt-bot added the approved approved label Dec 22, 2021
@openyurt-bot openyurt-bot merged commit d37ce4f into openyurtio:master Dec 22, 2021
MrGirl pushed a commit to MrGirl/openyurt that referenced this pull request Mar 29, 2022
Co-authored-by: zhanglifang@chinatelecom.cn <zhanglifang@chinatelecom.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved approved kind/feature kind/feature lgtm lgtm size/L size/L: 100-499
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants