-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On Azure Ubuntu vm, upon antrea-agent restart agent cannot reach the API server #5221
Comments
It is observed previously that agent fails to connect to kube-apiserver because DNS is not working on the OVS internal interface after a restart. That is because the netplan configurations requires dhcp client to work on the interface with driver type "hv_netvsc" and mac address "60:45:bd:04:30:0e". But the OVS internal interface does not match the configured driver type. Today I got a new observation that the VM connectivity may be lost on the VM after agent restart. Via the web console, I found that the packets entering VM from uplink is not forwarded to the internal interface correctly even if we manually added OpenFlow entries. The setup is running OVS userspace processes inside containers, and a restart on antrea-agent service also restarts antrea-ovs container, so the flows added in the last round is lost. Instead, a normal flow is added by default by OVS userspace. But the strange finding is no packets hit the openflow entries. Personally, I doubt this observation on the connectivity lost issue is related with running OVS processes inside containers. A workaround in my thought is to restore the configurations when agent is stopping, including removing the uplink from OVS, deleting internal interface, renaming the uplink back, and configuring IP/routes back to the uplink interface. In this way, everyting is recoverd during the time agent is not running, and agent will start working on a fresh environment when it is up again. This is similar to what agent is doing in the container scenraio with FlexibleIPAM enabled. @Anandkumar26 @reachjainrahul @tnqn @antoninbas Any ideas about this? |
Another tested solution is modifying the netplan file for the candidate ovs interface under path
Then we need to apply netplana and reload the configuration with networkctl
After this, a new net-plan-rules file is added.
Then we can start antra-agent service on this VM. In this way, networkctl would manage the OVS internal interface which is created by antrea-agent. In my test, it can resolve the issue.
|
It sounds like the issue is caused specifically by the antrea-agent restarting?
But we can't assume that the agent will always exit gracefully? It could be killed ant not get a chance to do clean-up? It sounds like both solutions may not be mutually exclusive. We could implement agent cleanup, while at the same time providing netplan configuration(s) for specific cloud providers / distributions? |
I think the root cause is the OVS internal interface created by antrea-agent is not managed by networkctl ( because of the driver type limitation on azure ), although IP/routes are migrated statically, the DNS (manged by networkctl) configurations are lost. As antrea-agent is connected to the apiserver/antrea-controller in the runtime, it does not block agent working processes. But after a restart, agent fails to resolve the domain name without DNS configuration. My thought to restore configurations after agent is stopped, is to ensure DNS is working well when agent is re-start. So the best sulotion is to make networkctl can manage the antrea-agent created interface. |
Update for the part that proving a rule to match openvswitch driver. We don't need to modify the existing netplan configurations, instead, we can provide a udev rule and enforce networkctl to reload it, then networkd could identify the eth0 created by antrea-agent, like this,
We can provide such an template for ubuntu if running on azure, and the boostrap script can modify the mac address in the template with the real value of the VM and copy it to the correct path, then enforce networkctl to reload the configuration. In this way, the OVS type interface can be identified and managed by networkd after antrea-agent renames the uplink and moves it to OVS. |
@antoninbas about the cleanup, I think you mean a separate script or binary outside of antrea-agent. The latest solution of vm agent supports running ovs userspace nside containers, there exists a risk that ovs commands can not be accessed by the cleanup module if ovs processes are not running. Would it introduce some failures or unexpected legacy configurations on the VMs in the cleanup script? |
@wenyingd I was just replying to your earlier comment and suggestion:
|
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days |
Describe the bug
Upon restarting antrea-agent on Ubuntu VM, antrea-agent is unable to receive any events on the ExternalNode.
To Reproduce
Expected
Actual behavior
Versions:
Antrea version v.1.11.0
Additional context
Antrea-agent logs before restart
Antrea-agent logs after restart
OVS configuration after restart
Netplan configuration file
The text was updated successfully, but these errors were encountered: