Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use reserved OVS controller ports for default Antrea ports #6202

Conversation

antoninbas
Copy link
Contributor

@antoninbas antoninbas commented Apr 9, 2024

Use reserved OVS controller ports for default Antrea ports

Tunnel port, gateway port and uplink port.
These reserved port numbers (starting from 32,768) will never be
auto-assigned by OVS. By using them, we can avoid surprising ofport
changes in some edge cases.

Note that if these ports already exist when the Agent starts, we will
not re-create them or mutate their ofport. The new ofport values will
only be used when creating the OVS ports for the first time (e.g., on a
new K8s Node, or after a Node restart). That is the simplest solution
and is consistent with existing behavior. It should also be sufficient
for avoiding #6192, which was typically triggered by creating a new
tunnel port when updating the Antrea configuration in an existing
cluster.

After this change, we will also assume that br-int ports stored in OVSDB
always have the "antrea-type" external ID. This external ID became
required in Antrea v1.5, so this should be an acceptable change,
especially considering that this change is destined for the Antrea v2
release, which is a major version bump. Users will not be able to
upgrade from Antrea v1.5 to Antrea v2.0, but that this is not a
supported upgrade path anyway (for other reasons).
We also add some logic to prevent creating a port on the OVS bridge if a
required external ID is missing.

Fixes #6192

@antoninbas antoninbas requested review from wenyingd and tnqn April 9, 2024 02:15
// that once a port number has been assigned, it will stay the same throughout the lifetime
// of the port, but when using ofport_request it is not guaranteed.
// See https://github.com/antrea-io/antrea/issues/6192 for more details.
FirstControllerOVSPort = 32768
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this one should be defined somewhere else, in pkg/ovs. But then, the same can be said of BridgeOFPort and AutoAssignedOFPort. I can move the 3 of them. What do you think @tnqn?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, moving non business specific ofports to it makes sense to me.

UplinkNetConfig: &config.AdapterNetConfig{
MAC: fakeUplinkMAC,
OFPort: uint32(4),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was some typo. Before this change, the uplink port was using 3, but these unit tests were using 4 which is confusing.

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wenyingd to check impact on Windows

// that once a port number has been assigned, it will stay the same throughout the lifetime
// of the port, but when using ofport_request it is not guaranteed.
// See https://github.com/antrea-io/antrea/issues/6192 for more details.
FirstControllerOVSPort = 32768
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, moving non business specific ofports to it makes sense to me.

pkg/agent/config/node_config.go Outdated Show resolved Hide resolved
pkg/agent/multicluster/mc_route_controller_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use these non-conflict port to request the control port, do we need the logic to allocate a port first and then use it in the functions? e.g., https://github.com/antrea-io/antrea/blob/main/pkg/agent/agent_linux.go#L117

BTW, do we need to customize a new reserve port for host interface, which is now used in AntreaIPAM, and it may be used on Windows in future to replace the bridge interface?

Another question is for Windows upgrade case in which OVS is running as native service. If antrea already configures the OVS ports using the original values, after a upgrade, the of port number in memory is changed, the Openflow is supposed to use the new port number; but for existing OVS ports, we don't have the logic to reset the number in this pr, it may cause conflict in the OVS port and OpenFlow entries.

@antoninbas
Copy link
Contributor Author

antoninbas commented Apr 9, 2024

@wenyingd these are good points. I need to improve the code so that we handle all ports like the tunnel port: if it already exists, keep using the existing port with no changes. If the port needs to be created, use the new ofport_request values.

BTW, do we need to customize a new reserve port for host interface, which is now used in AntreaIPAM, and it may be used on Windows in future to replace the bridge interface?

Do you mean HostInterfaceOFPort? We use the UplinkOFPort (3) reserved value for this.
Is there another one I am missing?

I see what you mean now. But can you remind me why this is not BridgeOFPort (OFPP_LOCAL / 0xfffffffe)?

@antoninbas antoninbas force-pushed the use-reserved-controller-ports-for-default-antrea-ports branch 2 times, most recently from cbe67e1 to b35d85d Compare April 9, 2024 23:35
Comment on lines -346 to -354
} else {
// Antrea Interface type is not saved in OVS port external_ids in earlier Antrea versions, so we use
// the old way to decide the interface type for the upgrade case.
uplinkIfName := i.nodeConfig.UplinkNetConfig.Name
var antreaIFType string
switch {
case port.OFPort == config.HostGatewayOFPort:
intf = parseGatewayInterfaceFunc(port, ovsPort)
antreaIFType = interfacestore.AntreaGateway
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change and the v2 release, I think it makes sense to drop support for this legacy scenario. The interface type externalID was introduced by @wenyingd starting with Antrea v1.5 (#3027).

With the change to the default value of HostGatewayOFPort, this switch case (port.OFPort == config.HostGatewayOFPort) could no longer be supported.

cc @tnqn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Comment on lines +159 to +162
// WithRequiredPortExternalIDs will ensure that whenever we create a port, the required
// external ID (interface type) is provided. This is a sanity check to ensure code
// correctness.
ovsBridgeClient := ovsconfig.NewOVSBridge(o.config.OVSBridge, ovsDatapathType, ovsdbConnection, ovsconfig.WithRequiredPortExternalIDs(interfacestore.AntreaInterfaceTypeKey))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing reviewers' attention to this. I added WithRequiredPortExternalIDs to guarantee that moving forward we will indeed always have "antrea-type" as an ExternalID for all ports. Right now we don't have any guarantee about this in the code, so we need to rely on code reviews. This could make it easier to catch errors in e2e tests.

@antoninbas antoninbas added the action/release-note Indicates a PR that should be included in release notes. label Apr 9, 2024
tnqn
tnqn previously approved these changes Apr 10, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe add the info to PR description and commit message that the new ofPorts will only be used when newly creating ports, we should expect the legacy ofPorts for an existing cluster.

Comment on lines -346 to -354
} else {
// Antrea Interface type is not saved in OVS port external_ids in earlier Antrea versions, so we use
// the old way to decide the interface type for the upgrade case.
uplinkIfName := i.nodeConfig.UplinkNetConfig.Name
var antreaIFType string
switch {
case port.OFPort == config.HostGatewayOFPort:
intf = parseGatewayInterfaceFunc(port, ovsPort)
antreaIFType = interfacestore.AntreaGateway
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@antoninbas
Copy link
Contributor Author

LGTM, maybe add the info to PR description and commit message that the new ofPorts will only be used when newly creating ports, we should expect the legacy ofPorts for an existing cluster.

Thanks @tnqn. I will squash the commits and update the message after I get a review from @wenyingd. I want to make sure there is nothing I have missed. In the mean time, I will run some Jenkins e2e tests.

@antoninbas
Copy link
Contributor Author

/test-all
/test-flexible-ipam-e2e

@wenyingd
Copy link
Contributor

@wenyingd these are good points. I need to improve the code so that we handle all ports like the tunnel port: if it already exists, keep using the existing port with no changes. If the port needs to be created, use the new ofport_request values.

BTW, do we need to customize a new reserve port for host interface, which is now used in AntreaIPAM, and it may be used on Windows in future to replace the bridge interface?

I see what you mean now. But can you remind me why this is not BridgeOFPort (OFPP_LOCAL / 0xfffffffe)?

With AntreaIPAM and VM Agent scenario, we now create a different OVS port to take over the role of the uplink, so the latest implementation is to rename the uplink as ${originalName}~ , and name the new host interface as ${originalName}. In this way, 1) the nic configured with Node's IP looks the same as what it is was before running antrea, then it could reduce the impact on other processes running on the Node/VM which may have dependency on the nic's name. 2) we keep OVS bridge as "br-int", so the new host interface is a different port, then we can't use the reserved LOCAL port (0xfffffffe).

This mode is also planned on Windows container case, @XinShuYang has a patch working on it.

pkg/agent/agent.go Show resolved Hide resolved
i.nodeConfig.UplinkNetConfig.OFPort = uint32(ofport)
i.nodeConfig.HostInterfaceOFPort = config.BridgeOFPort
i.nodeConfig.HostInterfaceOFPort = ovsconfig.BridgeOFPort
Copy link
Contributor

@wenyingd wenyingd Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, better to add compare between the existing OF port and the desired static value on uplink port. I would prefer to also add check for uplink on Linux, but it is not strong.

@wenyingd
Copy link
Contributor

/test-vm-e2e

@antoninbas antoninbas force-pushed the use-reserved-controller-ports-for-default-antrea-ports branch 2 times, most recently from 4edd12c to be0ebab Compare April 11, 2024 17:29
@antoninbas
Copy link
Contributor Author

After #6216 is merged, I will rebase, squash commits, and run tests again.

@antoninbas
Copy link
Contributor Author

/test-flexible-ipam-e2e

intf.InterfaceName != i.nodeConfig.DefaultTunName {
klog.Infof("The discovered default tunnel interface name %s is different from the default value: %s",
intf.InterfaceName, i.nodeConfig.DefaultTunName)
if intf != nil && intf.InterfaceName != i.nodeConfig.DefaultTunName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect because we also call this function on the ipsec tunnel port, only compare on the interface name may impact on the ipsec case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the problem is that the function was using the port number as the condition for that, which is misleading and even potentially invalid. I will update that to use the interface type instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, I fixed the logic

@antoninbas antoninbas force-pushed the use-reserved-controller-ports-for-default-antrea-ports branch from be0ebab to fdfc3d3 Compare April 12, 2024 02:36
wenyingd
wenyingd previously approved these changes Apr 12, 2024
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas antoninbas force-pushed the use-reserved-controller-ports-for-default-antrea-ports branch from fdfc3d3 to 2ffad56 Compare April 15, 2024 20:16
@antoninbas
Copy link
Contributor Author

/test-all
/test-flexible-ipam-e2e
/test-vm-e2e
/test-windows-all

@antoninbas antoninbas force-pushed the use-reserved-controller-ports-for-default-antrea-ports branch from 2ffad56 to 0733d0c Compare April 15, 2024 20:32
@antoninbas
Copy link
Contributor Author

/test-all
/test-flexible-ipam-e2e
/test-vm-e2e
/test-windows-all

@antoninbas
Copy link
Contributor Author

/test-windows-all

@antoninbas antoninbas added this to the Antrea v2.0 release milestone Apr 17, 2024
tnqn
tnqn previously approved these changes Apr 18, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@XinShuYang
Copy link
Contributor

/test-windows-all

@XinShuYang
Copy link
Contributor

XinShuYang commented Apr 18, 2024

Windows CI failed due to Get-NetIPInterface failure, it seems br-int was not found after antrea agent started.
Error log:

===== Check Interface DHCP Status =====
Get-NetIPInterface : No matching MSFT_NetIPInterface objects found by CIM query for instances of the 
ROOT/StandardCimv2/MSFT_NetIPInterface class on the  CIM server: SELECT * FROM MSFT_NetIPInterface  WHERE 
((InterfaceAlias LIKE 'br-int')) AND ((AddressFamily = 2)). Verify query parameters and retry.
At line:1 char:2
+ (Get-NetIPInterface -InterfaceAlias br-int -AddressFamily IPv4).Dhcp
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (MSFT_NetIPInterface:String) [Get-NetIPInterface], CimJobException
    + FullyQualifiedErrorId : CmdletizationQuery_NotFound,Get-NetIPInterface
 
Newly created uplink DHCP status is different from the original adapter. Original DHCP: Enabled
, br-int DHCP: 

@antoninbas
Copy link
Contributor Author

Thanks @XinShuYang, I was able to reproduce locally:

PS C:\var\log\antrea> cat '.\antrea-agent.exe.EC2AMAZ-EHTEEQ1.WORKGROUP_EC2AMAZ-EHTEEQ1$.log.INFO.20240418-232214.3356'
Log file created at: 2024/04/18 23:22:14
Running on machine: EC2AMAZ-EHTEEQ1
Binary: Built with gc go1.21.5 for windows/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0418 23:22:14.296934    3356 log_file.go:93] Set log file max size to 104857600
W0418 23:22:14.354355    3356 options_windows.go:68] AntreaProxy is not enabled. NetworkPolicies might not be enforced correctly for Service traffic!
I0418 23:22:14.356588    3356 agent.go:105] "Starting Antrea agent" version="v2.0.0-dev-b46ba51"
I0418 23:22:14.359131    3356 client.go:89] No kubeconfig file was specified. Falling back to in-cluster config
I0418 23:22:14.421688    3356 prometheus.go:189] Initializing prometheus metrics
I0418 23:22:14.432940    3356 ovs_client.go:70] Connecting to OVSDB at address \\.\pipe\C:openvswitchvarrunopenvswitchdb.sock
I0418 23:22:15.443925    3356 ovs_client.go:89] Not connected yet, will try again in 2s
I0418 23:22:17.447747    3356 ovs_client.go:89] Not connected yet, will try again in 4s
I0418 23:22:21.459490    3356 discoverer.go:82] Starting ServiceCIDRDiscoverer
I0418 23:22:21.470700    3356 agent.go:377] Setting up node network
I0418 23:22:21.586204    3356 agent.go:991] "Got Interface MTU" MTU=1450
I0418 23:22:27.391941    3356 net_windows.go:456] "Creating HNSNetwork" name="antrea-hnsnetwork" subnet="20.0.2.0/24" nodeIP="10.0.0.206/24" adapter={"Index":2,"MTU":1500,"Name":"Ethernet 3","HardwareAddr":"BrhWpt87","Flags":51}
I0418 23:22:40.530606    3356 net_windows.go:605] Enabled Receive Segment Coalescing (RSC) for vSwitch antrea-hnsnetwork
I0418 23:22:40.530606    3356 net_windows.go:544] "Created HNSNetwork" name="antrea-hnsnetwork" id="F5951EC0-156C-423A-8E4F-124CD6E441F1"
I0418 23:22:40.533079    3356 ovs_client.go:137] Created bridge: 533bd55c-2283-4078-a96c-a235ab0b6c28
E0418 23:22:40.645905    3356 agent_windows.go:260] Failed to add uplink port Ethernet 3: missing required externalID 'antrea-type' for port 'Ethernet 3'
F0418 23:22:44.717311    3356 main.go:53] Error running agent: error initializing agent: missing required externalID 'antrea-type' for port 'Ethernet 3'

I will push a new commit to address this.

Do you know if we upload the agent logs as artifacts for the Windows CI jobs? I couldn't find them.

Tunnel port, gateway port and uplink port.
These reserved port numbers (starting from 32,768) will never be
auto-assigned by OVS. By using them, we can avoid surprising ofport
changes in some edge cases.

Note that if these ports already exist when the Agent starts, we will
not re-create them or mutate their ofport. The new ofport values will
only be used when creating the OVS ports for the first time (e.g., on a
new K8s Node, or after a Node restart). That is the simplest solution
and is consistent with existing behavior. It should also be sufficient
for avoiding antrea-io#6192, which was typically triggered by creating a new
tunnel port when updating the Antrea configuration in an existing
cluster.

After this change, we will also assume that br-int ports stored in OVSDB
always have the "antrea-type" external ID. This external ID became
required in Antrea v1.5, so this should be an acceptable change,
especially considering that this change is destined for the Antrea v2
release, which is a major version bump. Users will not be able to
upgrade from Antrea v1.5 to Antrea v2.0, but that this is not a
supported upgrade path anyway (for other reasons).
We also add some logic to prevent creating a port on the OVS bridge if a
required external ID is missing.

Fixes antrea-io#6192

Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
Signed-off-by: Antonin Bas <antonin.bas@broadcom.com>
@antoninbas antoninbas force-pushed the use-reserved-controller-ports-for-default-antrea-ports branch from 0733d0c to c583b81 Compare April 19, 2024 00:46
@antoninbas
Copy link
Contributor Author

/test-windows-all

@antoninbas
Copy link
Contributor Author

@wenyingd @tnqn please take a look at the new commit. There was a test failure on Windows because the uplink port was being created without the antrea-type external ID

@antoninbas
Copy link
Contributor Author

/test-all

@tnqn
Copy link
Member

tnqn commented Apr 19, 2024

@wenyingd @tnqn please take a look at the new commit. There was a test failure on Windows because the uplink port was being created without the antrea-type external ID

Good to see the new check catches it.

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor Author

/test-flexible-ipam-e2e

@antoninbas
Copy link
Contributor Author

TestAntreaIPAMAntreaPolicy/TestGroupNoK8sNP/Case=ACNPEgressDropSCTP is failing for test-flexible-ipam-e2e. It is unrelated to this PR. The TestAntreaIPAMAntreaPolicy/TestGroupNoK8sNP test group has been flaky, and this should be tracked by #6237.

@antoninbas antoninbas merged commit 3344bc7 into antrea-io:main Apr 22, 2024
53 of 60 checks passed
@antoninbas antoninbas deleted the use-reserved-controller-ports-for-default-antrea-ports branch April 22, 2024 19:51
@antoninbas
Copy link
Contributor Author

This is a pretty significant change and it will not be back-ported for several reasons.
Historically, production users have not run into #6192.

@antoninbas antoninbas mentioned this pull request Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tunnel ofport mismatch
4 participants