Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Egress API visibility #4614

Closed
3 tasks done
tnqn opened this issue Feb 8, 2023 · 5 comments · Fixed by #5765
Closed
3 tasks done

Improve Egress API visibility #4614

tnqn opened this issue Feb 8, 2023 · 5 comments · Fixed by #5765
Assignees
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@tnqn
Copy link
Member

tnqn commented Feb 8, 2023

          > > > Do you know if we already publish an error condition in the Egress Status if we fail to select an Node for the Egress (e.g., because the ExternalIPPool doesn't select any Node, or with this change, because we hit the limit).

We don't publish error condition in the status yet. Currently if the Egress Node field is empty, it would mean no Node is eligible for the Egress IP. But an error condition would be a good UE improvement, let me see if we should do it in another PR.

The challenge is who should be responsible for reporting the error. If no Node is selected by the pool, how to reach an agreement among the agents that which one should do it. Any good idea?

If no Node is selected by the pool, I guess it should still be possible to identify one Node responsible for updating the Status: we can still select one Node consistently using memberlist. But it adds complexity as all Nodes now need to be watching all external IP pools and all Nodes, to check for external IP pools which do not select any Node.

An alternative is to handle this case in the Antrea Controller.

I am not sure either of these is a good idea... Although being able to let the user know about these issues is certainly valuable.

Originally posted by @antoninbas in #4593 (comment)


The lifecycle of an Egress object (with ExternalIPPool set) is as below:

  1. User creates an Egress object.
  2. antrea-controller allocates an Egress IP from the ExternalIPPool if it's not already set.
  3. antrea-agents elect an Egress Node which will assign the Egress IP to its own network interface.

There could be two cases the system fails to realize the Egress:

  • In step 2, if the ExternalIPPool has no available IP antrea-controller didn't do anything, which is not very clear to users what the problem is. To improve it, antrea-controller should update a condition (e.g. IPAllocated) in Egress status to indicate this situation.
  • In step 3, if there is no available Node can be elected as the Egress Node of the Egress, antrea-agent didn't do anything either. To improve it, one of antrea-agents should update a condition (e.g. IPAssigned) in Egress status to indicate this situation.

With the two conditions, most failure cases could be covered and user should be able to understand why an Egress is not working and how to fix it.

  • In addition, it's possible that an Egress is migrated from one Node to another for some reasons, we could generate a K8s event like below to record the migration.
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  IPAssigned 8s    antrea-agent       Assigned default/egress-dev to kind-worker2
  Normal  IPAssigned 50s   antrea-agent       Assigned default/egress-dev to kind-worker
@tnqn tnqn added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). labels Feb 8, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2023
@tnqn tnqn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2023
@tnqn tnqn added this to the Antrea v1.13 release milestone May 10, 2023
@tnqn tnqn changed the title Publish an error condition in the Egress Status if we fail to select an Node Improve visibility Jun 26, 2023
@tnqn tnqn changed the title Improve visibility Improve Egress API visibility Jun 26, 2023
@luolanzone
Copy link
Contributor

This is issue will be taken by a new contributor. Once the id is ready, I will update the assignee.

@luolanzone
Copy link
Contributor

@AJPL88

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2023
@tnqn tnqn removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2023
@luolanzone
Copy link
Contributor

Hi @tnqn I moved this issue to v1.15 considering #5282 will only resolve part of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/transit/egress Issues or PRs related to Egress (SNAT for traffic egressing the cluster). priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants