Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA Egress for AWS EKS - Include functionality to assign Egress IPs as secondary IP addresses to the appropriate node instances #5210

Closed
lukasmrtvy opened this issue Jul 5, 2023 · 8 comments
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/feature Categorizes issue or PR as related to a new feature. reported-by/end-user Issues reported by end users.

Comments

@lukasmrtvy
Copy link

lukasmrtvy commented Jul 5, 2023

Describe the problem/challenge you have

HA Egress does not work for AWS EKS

Describe the solution you'd like

Include functionality to assign Egress IPs as secondary IP addresses to the appropriate node instances

Anything else you would like to add?

@lukasmrtvy lukasmrtvy added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 5, 2023
@antoninbas
Copy link
Contributor

@lukasmrtvy based on discussions we had on Slack, it seems that you are not really asking for Egress HA support across multiple AZs (see #4385), but you are first asking for Egress (with ExternalIPPool) to work out-of-the box for a single AZ?

cc @tnqn

@lukasmrtvy
Copy link
Author

lukasmrtvy commented Jul 5, 2023

yes, HA = for multiple nodes, not a single one, and yes, for a single AZ

@antoninbas antoninbas added the area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). label Jul 5, 2023
@robbo10
Copy link

robbo10 commented Sep 5, 2023

@antoninbas - Are there any plans to have an Egress assigned to more than one node for H/A purposes?

Current limitation is that each time we rotate workers, or a node crashes there will be an intermittent blip whilst the controller assigns the Egress to a healthy node.

@antoninbas
Copy link
Contributor

@robbo10 sorry for the delay, I just came back from vacation today

I think a "blip" is unavoidable. Even if we had an "active-active" implementation, in case of an Egress Node failure, we would still need to fail over connections to the remaining Node. Additionally, it's not clear how we would handle return traffic, if the Egress IP (which is the destination IP for return traffic) is "assigned" to multiple Egress Nodes.

While some earlier Antrea versions (most notably, v1.11.0 and v1.12.0) had a bug causing longer than normal failover times for Egress IPs, the bug was patched in Antrea v1.13, v1.12.1 and v1.11.3 (look for the following in release notes: Ensure the Egress IP is always correctly advertised to the network, including when the userspace ARP responder is not running or when the Egress IP is temporarily claimed by multiple Nodes.). In the absence of the bug, my experience is that failover is very fast (under 1s).

Adding @tnqn in case he has further comments.

@robbo10
Copy link

robbo10 commented Sep 19, 2023

@antoninbas - no problem at all this makes sense. I had a conversation with @tnqn who confirmed the same.

We were able to verify that the 1.11.3 release greatly improves Egress failover performance :)

Thanks for all the work in releasing this.

@antoninbas
Copy link
Contributor

@tnqn has created a documentation PR to describe this limitation

@tnqn
Copy link
Member

tnqn commented Dec 7, 2023

@tnqn has created a documentation PR to describe this limitation

https://github.com/antrea-io/antrea/blob/main/docs/egress.md#egress-on-cloud describes how Egress works on cloud platform today and what's missing to make it work on AWS.

@tnqn tnqn added the reported-by/end-user Issues reported by end users. label Dec 11, 2023
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2024
@luolanzone luolanzone removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). kind/feature Categorizes issue or PR as related to a new feature. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

No branches or pull requests

5 participants