Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EgressSeparateSubnet feature doesn't work on OCP v4.15 #6546

Open
wenyingd opened this issue Jul 25, 2024 · 8 comments
Open

EgressSeparateSubnet feature doesn't work on OCP v4.15 #6546

wenyingd opened this issue Jul 25, 2024 · 8 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@wenyingd
Copy link
Contributor

Describe the bug

Hi,

I deployed an OCP testbed with version 4.15, and enabled feature "EgressSeparateSubnet". After I deployed the Egress IPPool and Egress CRs, I found that the traffic doesn't work. After capturing packets, we found that the request is successfully sent to the Egress Node, but not entering OVS pipeline from the tunnel port.

After checking with syctl configurations, it shows that the rp_filter value is "1" on the NIC antrea-ext.$vlan_id, which is not using the expected value "2". From antrea-agent log, it shows that this logic is supposed to run successfully, because we didn't find the related error reports in the logs.

To Reproduce

Expected

Actual behavior

Versions:

  • Antrea version: v2.0+

Additional context

@wenyingd wenyingd added the kind/bug Categorizes issue or PR as related to a bug. label Jul 25, 2024
@luolanzone
Copy link
Contributor

The feature EgressSeparateSubnet is added since 1.15.0, the impacted Antrea versions will be started from v1.15.0.

@luolanzone luolanzone self-assigned this Aug 5, 2024
@luolanzone
Copy link
Contributor

I did some troubleshooting on OCP4.16 env, I found that the value updated by Antrea should be reset by OpenShift cluster node tuning operator. According to official OCP docs https://docs.openshift.com/container-platform/4.13/nodes/containers/nodes-containers-sysctls.html#namespaced-and-node-level-sysctls and https://docs.openshift.com/container-platform/4.16/scalability_and_performance/using-node-tuning-operator.html#advanced-node-tuning-hosted-cluster_node-tuning-operator, users can update node level sysctl via 'Node Tuning Operator', but unfortunately, it doesn't work well when the interface name includes dot (e.g. antrea-ext.10). I have created an issue in the operator repo openshift/cluster-node-tuning-operator#1128 to track this problem.

For now, I think we may consider to add a known issue section for EgressSeparateSubnet on OCP until the issue is fixed.
We can also provide a manual workaround if users want this feature on OCP. @tnqn @wenyingd what's your thougths?

@tnqn
Copy link
Member

tnqn commented Aug 7, 2024

@luolanzone could it work if you use the operator to set /all/rp_filter to 2?

@luolanzone
Copy link
Contributor

luolanzone commented Aug 7, 2024

yes, I tried and the all.rp_filter can be updated, I guess we can choose this as an alternative solution to let users to change the default rp_filter to 2 in OCP? But I think it need to be done before Antrea is installed. When the antrea-ext.10 already exists, the value won't be impacted by the new all.rp_filter. I can verify if there is a way to update the default one to 2.

@tnqn
Copy link
Member

tnqn commented Aug 7, 2024

But I think it need to be done before Antrea is installed. When the antrea-ext.10 already exists, the value won't be impacted by the new all.rp_filter.

It doesn't need to. See https://sysctl-explorer.net/net/ipv4/rp_filter/

The max value from conf/{all,interface}/rp_filter is used when doing source validation on the {interface}.

I think we can document the workaround.

@luolanzone
Copy link
Contributor

Ah, got it, I will check and update a workaround for this. Thanks for the info.

@luolanzone
Copy link
Contributor

A document with a workaround is merged: #6622
We can check later when the bug is fixed from OCP operator side.

Copy link
Contributor

github-actions bot commented Dec 4, 2024

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days. You can add a label "lifecycle/frozen" to skip stale checking.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2024
@luolanzone luolanzone added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

3 participants