-
Notifications
You must be signed in to change notification settings - Fork 373
Community Meetings
Agenda, Minutes & Recordings for Antrea Community Meetings
- CI improvement: running multiple concurrent CI jobs using Kind clusters, in the same VM.
- Required support for custom build tags to avoid conflicts between concurrent jobs.
- Some e2e tests use a static "external" IP, to test features such as Egress & BGP. This may create issues when trying to run concurrent jobs on the same machine. We could parametrize the external IP(s), in order to avoid such issues.
- We talked about the possibility of using Prow (Docker-in-Docker) instead in the past. Maybe this should still be investigated?
- We opened multiple issues for improvements to the PacketCapture feature; most of them have no assignee and can be picked up by new contributors.
- We may use some of them for the next LFX mentorship term.
- There will be no meeting in 2 weeks (last week of December).
Antrea Community Meeting 12/16/2024
- Hemant Kumar (@hkiiita) presented the work he did as part of the LFX mentorship program
- Hemant contributed 2 significant improvements to the AntreaPolicy FQDN support:
- Maintain a TTL value for individual response IPs in the FQDN cache, instead of only one TTL value per FQDN, in order to make the implementation more correct - #6732
- Add new
fqdnCacheMinTTL
configuration parameter to force the Agent to cache resolved IPs for a minimum duration, in order to account for applications which cache them for longer than the response TTL - #6808
- Check out the meeting recording for a detailed presentation and a demo of
fqdnCacheMinTTL
. - Currently we have no upper bound on the value of
fqdnCacheMinTTL
, and no upper bound on the FQDN cache size.
- Hemant contributed 2 significant improvements to the AntreaPolicy FQDN support:
Antrea Community Meeting 12/02/2024
- We welcome Lan (@luolanzone) to the maintainers team!
- Presentation and demo of the PacketCapture feature introduced in Antrea v2.2
- See slides
- At the moment we only capture packets in one direction (source to destination) and we only support one enforcement point (at the source Pod if present, else at the destination Pod). These are things we want to address in the near future.
- By capturing at the source and destination Pod (and possibly at the gateway or "uplink" as well), we can support more advanced troubleshooting. For example, we could quickly identify checksum issues.
- Capturing traffic at multiple capture points will probably require creating multiple PacketCapture CRs (one for each capture point), which can be orchestrated by antctl.
- antctl support is also something we should add in the near future; antctl can retrieve the pcap file(s) automatically and copy them locally.
- The "1 of N Sampling" method may not be super useful for troubleshooting, as one usually wants to see the full history of a conversation, and not all packets are equal in importance (e.g., TCP handshake is what we usually look for).
Antrea Community Meeting 11/18/2024
- Proposal for a new "proxy" mode for the FlowAggregator
- See issue #6773.
- Goal is to simplify / improve integration with advanced external IPFIX collectors, which are already capable of doing flow aggregation / correlation of records exported by different networking devices in the network path.
- In proxy mode, FlowAggregator is stateless, which greatly reduces memory usage.
- Both modes of operation are somewhat "standard", there are some (unofficial?) terms to describe both modes of operation.
- Plan is to support both modes moving forward as both serve different use cases; small code differences between the 2.
- In proxy mode, the FlowAggregator could generate more data (up to 2x); unlikely to be a problem, but we could evaluate the increase in bandwidth usage using some realistic deployment (cluster size, connection rate).
- Resolving invalid module name for
antrea.io/antrea
- See issue #6774
- Importing version v2 of
antrea.io/antrea
is currently broken: Go requires the module name to end with/v2
, so it should beantrea.io/antrea/v2
. - Typically, projects which import
antrea.io/antrea
only care about Antrea APIs (soantrea.io/antrea/pkg/apis
, along withantrea.io/antrea/pkg/client
for the auto-generated code). However, we "force" them to import all ofantrea.io/antrea
, which has a lot of dependencies. - Two proposed approaches to solve the reported issue and facilitate importing Antrea API definitions: 1) move API definitions (and generated code) to a separate repository, 2) leverage Go workspace support to have multiple modules in the main Antrea repository, which can be imported independently.
- There seems to be a small preference for the second approach.
- We need a better understanding of the impact on import paths within the Antrea codebase and for projects which import
antrea.io/antrea
. AI(@antoninbas).
- Update on the upcoming v2.2 release.
- The release is late. We have a few open PRs that are still going through review.
- Release is currently planned for end of the week (11/08).
- The release notes have gone through review and are ready. See #6766.
Antrea Community Meeting 11/04/2024
- There is an issue with the current zoom link meeting in the README / calendar; we will update to the correct link.
- Presentation / demo of new
antctl
commands for BGP- Commands to query effective policy, peers, routes
- Commands must be executed from the antrea-agent Pod
- The
antctl get bgproutes
command returns a list of routes, but does not identify the "origin" of the routes (Pod / Service / Egress). We will add this as a follow-up. -
antctl get bgppolicy
(which retrieves the effective BGPPolicy applied to a Node) currently returns an error (non-zero exit code and error message) when there is no applied BGPPolicy- this behavior may be surprising to some automation tools, hard to distinguish between a request that fails and the absence of effective BGPPolicy
- the API returns a 404 (NotFound) if there is no effective BGPPolicy
- it was implemented this way because the API request is more similar to a GET ("get the effective policy") than a LIST ("get matching policies")
- we could keep the current API behavior but improve the antctl implementation?
- Migrating Antrea ARM builds from private repository to antrea-io/antrea
- See slides
- See PR #6486
- Thanks to some arm64 runners available through the CNC Github Enterprise account, we can start relying on self-hosted arm64 runners, and we can migrate the workflows which build Antrea ARM images and create the Antrea multi-platform manifests to the public antrea-io/antrea repository.
- The CNCF provides other compute resources to projects, can we make use of them?
- As a sandbox project, we probably have to be "reasonable" in our asks and expectations
- We already use some other resources; for example, we have an AWS account where host files and run some CI tests
Antrea Community Meeting 10/21/2024
- Overview of Antrea v2.2 release, planned for end of October
- Invalid implementation of
enableLogging
for L7 NetworkPolicies - see #6636- Current proposal is to always include a packet dump in alert events which are logged by the default reject rules. This can be achieved with the "global" Suricata configuration (so it will not be per policy rule). We will evaluate the impact on performance (requests per second that can be handled by the engine), and if there is a noticeable impact, we will introduce an antrea-agent configuration parameter to control this instead of enabling it in Suricata unconditionally.
- We can still keep
enableLogging
for L7 NetworkPolicies: the (L4) connections are logged to/var/log/antrea/networkpolicy/np.log
, with theRedirect
action (no application-layer metadata).
Antrea Community Meeting 09/09/2024
- Results of "Antrea longevity test"
- Antrea is deployed to a cluster with "realistic" workloads, and various metrics are monitored over a long period of time
- We would like to run this test prior to every release, but it is unclear how realistic that is (time-consuming)
Antrea Community Meeting 08/12/2024
- Review of open feature requests from end users
-
#6544: Add
--random-fully
to SNAT iptables rules- Consensus is that this should be done and can help with some issues, unsure what is the best way to expose it in the configuration (one global flag vs. one flag per feature vs. one flag for Node SNAT + Egress CRD annotation / field)
- K8s itself uses
--random-fully
by default for SNAT rules
-
#6567: Support BGP confederation
- Useful to avoid having to build a full-mesh of iBGP routers
- Complements route reflectors?
- Passthrough configuration to gobgp, should be straightforward to implement (but maybe harder to test)
- Ability to load-balance egress traffic for a given workload between multiple Egress IPs (ideally assigned to different Nodes)
- Issue not created yet
- Could be related to #5252
- Multiple Egress IPs per Egress resource better than multiple Egresses (each with one Egress IP) applied to the same workload
- How would we handle traffic rebalancing? We should avoid breaking existing connections.
-
#6562: Windows installation script doesn't handle non-English system locales
- We should handle this case, but we have yet to find a good way of doing it
- We depend on
pnputil.exe /enum-drivers
(no suitable alternative), for which the output is not structured and depends on the system locale - Fair to assume that the problem doesn't exist when the script is run from the antrea-agent-windows container?
- No reliable way to change the system locale temporarily for the PowerShell session?
-
#6544: Add
Antrea Community Meeting 07/29/2024
- Antrea v2.1 release update
- Release delayed by one week because some BGP PRs are still undergoing review and a NodePortLocal bug has to be addressed.
Antrea Community Meeting 07/15/2024
- Egress support for Antrea networkPolicyOnly mode.
- The feature request if for networkPolicyOnly mode, but the proposal could apply to other modes such as encap and hybrid.
- When using networkPolicyOnly mode, Antrea is chained with another "arbitrary" primary CNI responsible for IPAM / routing; in this experiment, Calico was used as the primary CNI.
- Calico SNAT rules are always enforced first (the Calico agent periodically enforces this), preventing Egress SNAT rules installed by Antrea from taking effect; we have to find a way to disable this behavior. No solution has been found yet.
- We need to ensure a symmetric path for return Egress traffic, which is not straightforward but possible using policy routing and OVS learned flows.
- The experiment used the iptables Calico datapath, not clear if Egress is even possible when using the eBPF datapath.
- Should we add an
except
field foripBlock
in Antrea native policies?- See issue #6428
- Implementing this requires "expanding" the cidrs manually: the
ipBlock
cidr needs to be broken up into individual non-contiguous cidrs, and we need to install OVS policy flows for each one. - We already do this today for K8s NetworkPolicy, as it is supported by this API. So from an implementation perspective, it should not be difficult.
- Consensus is that we should accept this feature request and provide this functionality.
- Should we add support for
NotSelf
/NotSameLabel
matching for Antrea native policies?- See issue #6424?
- While we acknowledge that this is a valid use case and that it would be convenient for users, there is risk in implementing this and it would be costly to implement with our current system.
- We may come up with a different more convenient way to efficiently provide namespace / org isolation in the future.
Antrea Community Meeting 06/17/2024
- Replacing nanoserver (Windows Server 2019) with hpc (HostProcess Container) as the base container image for antrea-agent on Windows.
- See slides
- Many benefits including broader compatibility with the Windows Node OS, image size, build time, build simplicity (image can even be built on Linux).
- The image can only be run as HostProcess Containers; this is not really an issue now that we only support containerd on Windows Nodes.
Antrea Community Meeting 06/03/2024
- End-of-term presentation from all 3 mentees for this iteration of the LFX mentorship program
- Replace deprecated bincover with golang built-in coverage profiling tool - #4962 - @shikharish
- see slides
- Node latency monitoring tool - #5514 - IRONICBo
- see slides and watch demo included in meeting recording
- we send ICMP echo request messages to other Nodes and process their replies, but we are not in charge of replying to requests (the OS takes care of it)
- it may be a good idea to spread out the ICMP echo request messages over the configured interval, to avoid bursts
- we will remove "same Node" latency measurement (Node pings itself) to avoid confusion
- Pre and Post-installation checks with antctl - #6061 #6153 - kanha-gupta
- watch demo included in meeting recording
- refer to Kanha's blog post
- could we run some more advanced tests for specific features (e.g., AntreaIPAM)?
- at the moment we run tests that should always work, regardless of the configuration in which Antrea is deployed
- we could check the configuration, and run some feature-specific tests; one suggestion was to validate encryption when Wireguard is enabled
- we want tests to run quickly (no more than a couple of minutes), and we don't want to duplicate existing e2e tests
- in the future, we want to enhance pre-installation checks so that they are consistent with the intended user configuration
- A big thanks to all mentees and mentors!
- Replace deprecated bincover with golang built-in coverage profiling tool - #4962 - @shikharish
Antrea Community Meeting 05/20/2024
- Technical discussion around possible implementation for proxyAll when kube-proxy is still present.
- See #6232
- LFX mentorship term is coming to an end.
- We plan to have mentees do a quick presentation of their work at the next community meeting.
- All 3 mentees may not be able to make it, in which case we will ask them to prepare a short video recording and we will play it during the meetings, and mentors will field questions.
No meeting recording available for this week.
- A proposal for a composable scale-testing framework - see slides
- The framework should be extensible so that we can easily scale test new features in the future; there is ongoing work to use this framework to perform scale testing of the Egress feature.
- The current PR (https://github.com/antrea-io/antrea/pull/5772) includes test scenarios for NetworkPolicy realization and Service realization.
- Framework supports a mix of real worker Nodes and antrea-agent simulators (cluster needs to be created and CNI needs to be installed ahead of time).
- The framework itself is not specific to the Antrea CNI: K8s NetworkPolicy scale testing can be performed on all CNIs which implement the NetworkPolicy API. However, some test cases / steps are Antrea-specific and need to be skipped / disabled for other CNIs.
- The framework can deploy kube-state-metrics + Prometheus + Grafana for metrics collection / visualization.
- Graphs used to show results should use a linear scale, otherwise results are confusing / misleading.
Antrea Community Meeting 04/22/2024
- Update on some recent CI enhancements - see slides
- CI testbeds have been moved from VMC (VMware on AWS) to native AWS, new Jenkins URL is jenkins.antrea.io
- Ongoing work to migrate additional jobs to CAPA / AWS and run more jobs in Kind
- Windows Docker jobs are being removed (starting with Antrea v2.0, we will officially only support containerd for Windows Nodes)
- updated trigger phrases for Windows CI jobs
- All test images are being transitioned from Harbor to DockerHub (transition has already been completed for user-facing images)
- Update on Antrea releases
- We have recently released 1.15.1, 1.14.3, 1.13.4 (last patch release for 1.13.x)
- Upcoming v2.0 release at the end of the month, more code reviews needed!
Antrea Community Meeting 04/08/2024
- Running concurrent CI jobs on the same VM using different Kind clusters - see slides
- We had a discussion about running Kind-in-Kubernetes instead (similar to what K8s does for CI, with Prow)
- The 2 approaches aim to solve the same issue (better utilization of CI resources), so we don't really need both capabilities
- The current approach (concurrent CI Kind jobs on the same VM) is almost ready; we will investigate the other approach (Kind-in-Kubernetes) after rolling that one out, and see if it provides additional value
- The ability to run multiple Kind test jobs on one machine may be useful for local development as well
- Initial concern that this would be complex to achieve was expressed in the issue, but we may have overestimated the complexity
- Update on BGPPolicy API design - see design doc
- Is there an actual use case for BGPFilter and explicitly excluding IPs, as opposed to just using selectors to select resources for which we want to advertise IPs?
- We are not currently looking into learning and installing routes through BGP, only advertising them (the default routing policies will be used).
- Do we really have a use case for different BGP configurations / BGP processes / BGP "virtual routers"?
- Supported by Cilium, but not by Calico
- Most common case is probably single BGP configuration with single local ASN, even single BGP peer to which we advertise all the desired IPs
- If we do need multiple BGP configurations in the future, maybe we should just allow multiple BGPPolicy CRs (with different local ASNs) to select the same Node(s). That would just require changing our validation strategy.
- If we omit BGP filters and support a single BGP configuration per BGPPolicy instance, we are essentially falling back to "proposal 1".
- In the first rollout phase, should we skip selectors and advertise either all IPs of a given type or none of them, based on a boolean toggle?
- We had a user request to support selecting specific Service IPs and even Pod IPs (at least at the Namespace level)
- If we choose not to add selectors in "phase 1", we should make sure that we design the API to accommodate for selectors in the future (without breaking API backwards-compatibility). For example, an absence of selector would mean "all IPs" (if the boolean is set), while a selector would mean "only IPs which belong to selected resources". If a user does not set the selector, the behavior stays the same.
- We plan on using gobgp, but we will make sure that we abstract away the implementation with an interface in case we want to support an alternative implementation in the future.
Antrea Community Meeting 03/11/2024
- E2e testing for the Flexible IPAM feature using Kind - see slides
- Is a separate Docker bridge really required for Flexible IPAM testing? Probably not, it was done this way because the e2e test cases assume that the K8s Nodes are all in a certain subnet, but that can be changed. Should be similar to Egress VLAN testing, which uses the default Kind docker bridge.
- Static routes installed manually on the test machine are for return traffic to the Pod primary network interface; typical requirement in noEncap mode. These routes are not specific to the Flexible IPAM feature.
- Proposal for BGP support in Antrea - see slides
- We could simplify the API by removing Service & Namespace selectors, unless there is a use case for them. We could have a boolean toggle for each IP type to advertise, with no granular filtering.
- Service IPs (LoadBalancerIPs, ExternalIPs) can be advertise from all Nodes. kube-proxy or AntreaProxy (proxyAll) can then load-balance the traffic to a Service endpoint. With ECMP (BGP multi-path), ingress Service traffic can be load-balanced across a set of Nodes.
- What's the use case for applying multiple BGP policies to the same Node, with different (local) ASNs? Advertise a different set of IP addresses to different peers.
- Need to confirm that go-bgp is the right choice for us (performance / feature set).
- There should be no extra config required on our side to enable ECMP for Services.
- For Services with Local ExternalTrafficPolicy, we will only advertise Service IPs from Nodes with at least one local Service endpoint.
- How quick is BGP convergence when an Egress IP is re-assigned to another Node?
- Users will need to use consistent Node selectors for BGPPolicies and Egress ExternalIPPools (same for Services with Local ExternalTrafficPolicy).
Antrea Community Meeting 02/26/2024
- Splitting up Agent and Controller container images to reduce their size - see slides
- The claimed size reduction for the antrea-agent image (300MB) seems a bit too extreme, given that only the antrea-controller binary (~100MB) was removed.
- Need to double check and compare size of unified image vs sizes of dedicated / split images.
- Could we have a shared layer for antctl binary across both images (antrea-agent & antrea-controller)?
- Our antctl binary seems to be excessively large (100MB). Binaries from other projects tend to be smaller even when they provide more features. For example, kubectl is only around 50MB. Maybe we can investigate how to reduce the size of binaries.
- Increase minimum version requirement for K8s - see #5879
- Currently we require K8s 1.16, and we were planning to start requiring K8s 1.19.
- If we decide to require a recent K8s version (more recent than K8s 1.19), we probably need to check with users and check which K8s versions are still supported by cloud-managed K8s services.
- Current plan is to be conservative and increase it to K8s 1.19 for the next Antrea release (post v1.15).
- Antrea v2.0 release - see #4832
- There is no strong reason to bump up the major version number to v2: no significant breaking change, no massive architectural change.
- There is also no strong reason not to do it; Antrea is not a library.
- We have graduated some features to GA, and we have deprecated some APIs; that could be reason enough.
- There are some remaining API changes we may want to do before 2.0 (e.g., change subnet definition in IPPool CRD), as well as some configuration changes?
- So the current plan is for Antrea v1.15 to be the last 1.X minor release; after that we will move to 2.X.
Antrea Community Meeting 01/16/2024
- Antrea v1.15 release status update
Antrea Community Meeting 01/02/2024
- Presentation of updated Antrea ROADMAP - see #5807
- The CNCF gave us some feedback during our annual review that we should update the ROADMAP.
- Multiple items were either already completed or no longer planned; some new items were missing.
- PR will be merged in early Jan, please review / leave comments before then.
- VLAN tagging for Egress - see slides
- How many VLAN sub-interfaces can be created for a given parent interface? Not sure, but this is unlikely to be an actual limitation compared to the limit on the number of route tables (~250).
- we only need one sub-interface per subnet, and we currently limit the number of subnets to 20
- 1-1 mapping between subnet and route table.
- Users must take care of configuring the "physical" Node network with their chosen VLAN IDs.
- Does the current design support subnet overlap between different VLANs?
- the uplink router(s) may support this (same IP in 2 different VLANs would map to a different SNAT IP)
- not covered by the current design, maybe not a very realistic use case (maybe we can assume that if one wants to use Egress, there is no other SNAT happening for the traffic, or at least no need to share Egress IPs across VLANs)
- The packet mark logic has not changed compared to the current Egress implementation; a specific packet mark maps to a specific Egress IP, and now also to a specific route table.
- How many VLAN sub-interfaces can be created for a given parent interface? Not sure, but this is unlikely to be an actual limitation compared to the limit on the number of route tables (~250).
- This was our last meeting of the year! Thanks for a great year 2023, and we are looking forward to 2024.
Antrea Community Meeting 12/19/2023
- Review of open issues for secondary networks support
-
#5047: Antrea as secondary CNI with Multus
- definitely requires code changes to the Antrea Agent, not a use case we had in mind for Antrea
- still waiting for a compelling use case from users requesting the feature
- can we reuse the SecondaryNetwork work (in particular, VLAN support for secondary network interfaces) to achieve this? Unclear at the moment. The SecondaryNetwork feature uses a controller-based approach (K8s controller watching for Pod annotation updates) and network provisioning is done asynchronously from the CNI Add call. This is not what "secondary CNIs" used with Multus (e.g. macvlan) usually do.
- secondary VLAN networks / secondary overlay networks
- do users expect additional K8s features for secondary networks (NetworkPolicy enforcement, Service load-balancing)? If yes, this starts looking more and more like the upstream effort to support multi-networks for Pods as part of K8s itself.
-
#5693: Configuring multiple RX / TX queues for Pod veths
- feature request is for secondary networks only
- can achieve better throughput for multiple concurrent connections between Pods (tested with Pods on the same Node) when Pods have access to multiple CPUs
- no objection to supporting this, but we may not have the cycles to work on this
-
#5693: Ability to provision secondary network interfaces (VLAN networks) without an IP address
- issue already addressed, a patch has been merged
-
#5735: Use a Node's primary NIC as the secondary OVS bridge physical interface
- Jianjun thinks it is doable, similar to the Antrea "bridging mode" which already exists for the primary network interface
- no one is actively working on this
-
#5047: Antrea as secondary CNI with Multus
Antrea Community Meeting 12/04/2023
- Proposal to drop support for the Docker CE (Moby) container runtime on Windows
- See slides.
- Drop support for rancher/wins installation method == Drop support for Docker?
- Key point is that we no longer test Docker CE support in Windows CI, so we are not really in a position to claim support.
- Proposed action items for Antrea v1.15: deprecate Docker support (documentation change) + remove unused CI scripts.
- No plan at the moment to deprecate running OVS daemons + Antrea Agent as Windows Services.
- In the future, we may only offer the HostProcess containers method.
- Proposal to support Egress on Windows
- See slides.
- Proposal wants to add the following:
- ability to assign Egress IPs to Windows Nodes
- Linux Pods can egress through Linux / Windows Nodes
- Windows Pods can egress through Linux / Windows Nodes
- More discussion required for implementation of SNAT on Windows Egress Nodes.
- differentiate between Egress reply traffic which we need to un-SNAT, and traffic from source Node that requires SNAT (dest IP is the same -> Egress IP)
- is "learn" flow really needed?
- Demo of PoC implementation.
- One suggestion is to only support Linux Nodes as Egress Nodes.
- simplified datapath implementation, no functional difference for users
- K8s clusters will always include Linux Nodes - no such thing as a "Window-only" cluster
- unlikely to have an availability zone with only Windows Nodes available for Egress
- Egressing traffic from Linux Pods through Windows Nodes doesn't seem like a very good idea (potential stability issues)
- No time to discuss "secondary network" feature requests, postponed to next meeting.
Antrea Community Meeting 11/20/2023
- Proposal for Node NetworkPolicies support
- See issue #5671
- See design slides
- It's a popular user request
- Similar to NetworkPolicy enforcement for the ExternalNode case.
- ExternalNode can have stability issues because the physical interface is moved to the OVS bridge
- We propose to find an alternative solution that will apply to both ExternalNode and Node NetworkPolicies.
- The simplest solution is iptables-based, with the only drawback being increased connection latency with a large number of rules.
- we expect the number of rules to be on the smaller side for this use case (Node NetworkPolicies)
- it's not clear that other solutions would not suffer from this issue
- API change for ClusterNetworkPolicy only (new nodeSelector in appliedTo field).
- Node NetworkPolicies for Windows?
- different approach required for Windows
- compatibility issues between OVS and Windows Firewall rules
- for Windows, we always need to move the uplink to OVS anyway (like for ExternalNode)
- not planned at the moment, but we could enforce Node NetworkPolicies in OVS (single uplink interface only)
- Using ClusterNetworkPolicy vs introducing a new API / CRD?
- there are use cases for selecting Pods as peers for Node NetworkPolicies
- Dataplane implementation deep-dive and demo.
- see slides
Antrea Community Meeting 11/06/2023
- Antrea E2E tests flakiness
- A script collects test failures of Kind E2E tests for the main branch.
- The script runs every week automatically and generates reports for 30-day failures and 90-day failures.
- See this repo for script and generated reports.
- Some flaky tests have already been fixed since this project was started.
- Most test failures come from the
TestAntreaPolicy/TestGroupNoK8sNP
test group.- So far we have not been able to root cause the failures. Quan has been experimenting with different PRs to try and troubleshoot the issue. With PR #5507, which was meant for troubleshooting, the failures don't seem to happen anymore.
- Helping users with migrating from Calico to Antrea - see issue #5578 and slides
- The presenter had some microphone issues, so we apologize for the poor audio quality in the first part of the presentation.
- The idea is to provide tooling to enable CNI migration with minimal workload downtime, and the ability to convert NetworkPolicy CRs when possible.
- Did we have requests from users to provide this?
- We need to have existing Pods switch over from the Calico network to the Antrea network. The current proposal is to kill the sandbox container to force a new CNI ADD invocation. We need to be more granular about which containers we kill, i.e., exclude hostNetwork Pods.
- NetworkPolicy CR migrator:
- The 2 NetworkPolicy APIs are quite different, so converting correctly may be quite challenging, and there are some resources that we may not be able to convert at all.
- We need to introduce a "dry-run" mode so that cluster admins can check ahead of time which policies can be converted and which cannot, and make an informed decision about how to proceed.
- How will the migrator be packaged? The plan is to have an antctl subcommand.
- Apart from NetworkPolicy conversion and cleanup of stale Calico resources, the process is generic, so we could reuse it for other CNIs if needed (e.g., Flannel).
- The migrator should be ready for the Antrea v1.15 release.
Antrea Community Meeting 10/23/2023
- Very short meeting, we discussed pending items for the next Antrea release - Antrea v1.14.
- Several folk are representing Antrea at KubeCon in China
- CI improvements using CAPA (Cluster API AWS) - see slides
- New CI test matrix for Antrea e2e tests: 4 most recent K8s versions x 2 most recent Ubuntu LTS versions (for Nodes)
- Run using Jenkins weekly and on-demand
- Some improvements are needed for workload cluster resource deletion (AWS compute instances and LB)
Antrea Community Meeting 09/25/2023
- Proposal for new PacketSampling CRD - see #5443
- When capturing traffic, will we capture traffic in both directions (request + reply) for a "connection"?
- Capturing in both directions is useful to troubleshoot latency issues, retransmissions, etc.
- CRD definition: parameters which are specific to a sampling method should be grouped. Best practice is to use a "oneOf", with a specific CRD field for each case (i.e., each sampling method).
- The format for captured packets is PcapNG. Can tcpdump read these files?
- Probably, need to double check.
- It's tedious to provide file server connection and authentication details in every CR. Is there a better way?
- We could consider introducing a new CRD for users to provide connection information; SupportBundle CRs and PacketSampling CRs could then refer to that object, to avoid redundancy for users.
- Let's wait for user feedback?
- When capturing traffic, will we capture traffic in both directions (request + reply) for a "connection"?
- Antctl support for VM Agent case - see slides
- Currently not supported; when running antctl on a VM managed by Antrea, it will default to "controller" mode (out-of-cluster), which is incorrect.
- Compared to regular agent mode, we can only support a subset of commands (as many commands are designed for K8s Nodes running Pods); hence we will focus on NetworkPolicy commands.
- Proposal is to introduce a flag to specify the antctl "mode" (optional flag, default behavior stays the same); the flag will be required for the VM case.
- New command (
antctl get entityinterface
)- Proposal to rename command from
entityinterface
tovminterface
for consistency with the mode name
- Proposal to rename command from
- Can the mode be read from the environment?
- Maybe there should be a way to persist the mode to a local config file (in the home driectory?), to avoid having to repeat the mode every time? E.g.,
antctl set mode vm
. It would also persist across shell sessions and reboots.
Antrea Community Meeting 09/11/2023
- Very short meeting, nothing on the agenda
- Release cadence for new Antrea minor versions is changing from 8 weeks to 12 weeks
- New PacketSampling CRD proposal will be discussed at the next meeting - see #5443
- Feature Gate promotions for Antrea v2: see #5068
- Consensus on promoting
AntreaProxy
andEndpointSlice
from Beta to GA- enabled by default for a while and used widely
- new features which are added to AntreaProxy (e.g., DSR) typically get their own Feature Gate
- for AntreaProxy, we will add a boolean toggle to the
antreaProxy
configuration section, for users who still want to disable it (e.g., because they prefer using kube-proxy IPVS mode)
- Not enough confidence for other Feature Gates
-
ServiceExternalIP
needs more testing / verification in production scenarios; more user requirements (e.g., VIP sharing) are still pending implementation - We still have known issues for L7 NetworkPolicies: users need to disable checksum offload which may impact datapath performance
- We also want to be conservative for the
FlowExporter
feature: not enough user feedback at scale, recent modifications to config, new key functionality added recently (e.g., TLS support for Flow Aggregator), improvements to the implementation are being investigated (e.g. using conntrack events in addition to polling), ... -
ExternalNode
: still working through issues in the public cloud, feature is still new -
SupportBundleCollection
(for ExternalNode): only supports SFTP for now, maybe we want to add support for cloud storage (e.g., S3)?
-
-
IPsecCertAuth
: maybe we should consider enabling it by default if it doesn't have any scale implications and is reasonably self-contained; let's do some more manual verification and evaluate the risk of promoting it to Beta - We could create separate issues to track requirements for Feature Gate promotions
- Consensus on promoting
- A couple of general updates:
- Go v1.21 has been released recently; we are working on migrating all Antrea projects from 1.19 (no longer maintained) to 1.21
- Hashicorp has changed the license for its most popular projects (Vagrant, Terraform, ...) to a "source available, non-open source" license (MPL -> BUSL)
- The Hashicorp Go libraries that we use (e.g.,
memberlist
) are not affected - We do have some Terraform scripts, as well as some Vagrant usage, that could be impacted
- We are waiting for some guidance from the CNCF - see #617
- The Hashicorp Go libraries that we use (e.g.,
Antrea Community Meeting 08/14/2023
- Traceflow "FirstN" sampling
- See slides
- Implementation: PacketIn messages (same as Traceflow) vs OVS native sampling support (IPFIX)?
- How can users retrieve the results of the sampling Traceflow?
- Cannot use the CRD Status as the results are too large
- We will use the same methodology as for support bundle: API endpoint to download the result data (maybe in pcap format)
- User-friendly consumption when using Antctl
- We should have a reasonable upper limit for N, to limit disk usage
- Does the hard timeout of 300s still apply for Traceflow sampling?
- Should we define a new CRD for this functionality?
- Differs significantly from the existing "liveTraffic" Traceflow, which only captures packet headers
- Proposal for several CI improvements
- See slides
- Instead of a manual command to kill "stale" jobs, should this be done automatically when a PR is updated?
- Need to ensure that stale test clusters are deleted properly to reclaim resources; maybe we need a separate cleanup job for this (like we do for public cloud tests: EKS / AKS / GKE)?
- For a few PRs (e.g., release PRs), it is better to run the job to completion to avoid repeating tests every time the PR is updated (e.g., change in release notes)
- Instead, should typing another
/test-X
command be the trigger to cancel previous jobs for the same PR?
- Normalize Jenkins job names / commands for ipv6: "ipv6-ds" vs "ipv6"
Antrea Community Meeting 07/31/2023
- Tech talk about XMasq: Bring Cache to the Container Overlay Network
- See slides; see paper
- Using an overlay network adds flexibility but has significant impact on network performance.
- The XMasq solution is not specific to any CNI in particular, but at the moment it cannot work with CNIs which rely on an eBPF dataplane (Cilium, Calico). This is why it has been tested with Antrea.
- XMasq can co-exist with the OVS dataplane; when there is a cache hit in XMasq, the OVS bridge is bypassed.
- Using XMasq means that some datapath features will not be available.
- The cache is only updated for encapsulated packets, so Pod-to-External traffic is not impacted.
- NetworkPolicy implementation is still a work-in-progress.
- stale entries are not removed from the cache
- current implementation is "stateless", and does not track individual connections
- The eBPF program does not have a significant impact on latency of the first packet (cache-miss path).
- The current implementation does not support Pod-to-Service traffic (destination IP is ClusterIP), which seems like a big limitation.
- Value of XMasq for primary Pod network vs secondary Pod networks (specialized use cases which don't require as many features but require higher throughput)?
- Antrea OVS containerization on Windows (with containerd only)
- Ability to run OVS userspace processes in a container (as Pod)
- symmetry with Linux
- easier OVS upgrades
- Dependency on Windows hostprocess feature, so only available with containerd.
- New container image is based on servercore vs nanoserver (big size difference).
- Ability to run OVS userspace processes in a container (as Pod)
Antrea Community Meeting 08/17/2023
- Short meeting where we discussed API and Feature Gate promotions for upcoming Antrea releases.
Antrea Community Meeting 07/03/2023
- Design proposal for Antrea Controller High-Availability (HA)
- See slides
- If the Node running the Antrea Controller goes down, it can take more than 5 minutes for the Controller Pod to be rescheduled to another healthy Node; we can get this down to 40s easily with the appropriate "tolerations" for the Antrea Controller Pod, but that still may be too much for some users.
- Active-active vs Active-standby: Active-active more complex and not really needed in our case (we have stress tested the Controller to very large clusters, with a single replica).
- Leader election can be tuned to failover to the standby in around 15s.
- No "perfect" solution to route the antrea Service traffic to the active replica; 3 possible solutions with different drawbacks.
- currently, the preferred solution is "Service without selector" (refer to slides)
- Before implementing Active-standby HA, we should confirm whether a 40s delay is good enough for our users.
- No state synchronization needed across replicas: all the state is persisted to K8s / etcd.
- Different Service definitions for "HA mode" and "single-replica mode" (so different YAML manifests) at first to avoid disruption to users.
- we want to evaluate the "Service without selector" solution and if it works well, we can use the same approach even for the single-replica case
- Are we aware of other projects using a similar approach for HA? Not really. The K8s apiserver uses a similar approach, but for different reasons.
- We have a somewhat unique architecture in Antrea where the Antrea Controller is used both as API server (serving in-memory data) and controller (running computations, processing state). If the 2 functionalities were split, the API server would use Active-active mode (multiple replicas could serve APIs from a distributed store) and the controller would use Active-standby (no need for traffic routing, just leader election).
- Using the "Pod readiness" approach does not help use do failover faster than 40s, so it's not better than using a single replica and setting tolerationSeconds to 0.
Antrea Community Meeting 06/20/2023
- New implementation for NetworkPolicy Logging - see slides
- We recently became aware of issue #5018: enabling logging can cause massive drop of user traffic (for
Allow
NP rules). - New proposal is to use separate OVS flows for the
SendToController
action, so that themeter
action only applies to these flows. - Use
group(ALL)
to make a separate copy of the packet, that will be used for logging purposes. - If the only action for the packet is sending to controller for logging, we probably do not need to define a group with a single bucket.
- Copy should only happen for the first packet of the connection, subsequent packets bypass that part of the pipeline thanks to conntrack lookup.
- No impact for NetworkPolicy rules that do not have logging enabled.
-
SendToController
is moved to the end of the pipeline. Will it be a problem for the following scenario: packet is applied an egress policy rule with Allow action and logging enabled, then is applied an ingress policy rule with Drop action? Will the logging happen correctly in this case?- It should work correctly, given that the copy created for logging will go directly to the Output table and will not be processed by Ingress NetworkPolicy tables
- Is creating packet copies an issue for the following scenario: large UDP flow hitting a policy rule with Drop action and logging enabled? Even though the meter will prevent the Agent from having to process too many PacketIn messages, we will still create a copy in the OVS datapath for each packet in the flow.
- This fix should be part of Antrea v1.13
- We recently became aware of issue #5018: enabling logging can cause massive drop of user traffic (for
Antrea Community Meeting 06/05/2023
- Antrea scale testing for agentless VMs
- Scale testing in the context of Nephe: VMs are onboarded into Antrea using the ExternalEntity API, and are selected by Antrea-native policies
- 1 Namespace, 10K ExternalEntities, 10K policies
- Results: 17 seconds to recompute policies, 1000m CPU, 300MB
- There was a question about whether the methodology used to measure resource consumption (metrics exposed by kubelet) was accurate enough
- DSR support for LoadBalancers in Antrea
- user issue: #4956
- slides
- design issue: #5025
- connection will be invalid in conntrack on the ingress Node (no return traffic is observed); but OVS doesn't expose ct_mark and ct_label for invalid connections, which are needed to store connection state (which backend Node was selected)
- one possible solution is to leverage the OVS learn action; but there is a latency between flow learning and datapath modification, which would create issues if the next packet in the connection is received before the datapath is ready
- other solutions will be investigated
- network performance metrics could be better with DSR if implemented well
- DSR mode can have an impact on NetworkPolicy enforcement, given that the source client IP is preserved
- Enforcing NetworkPolicy logging can have a huge impact on performance
- see #5018
- this was not the original intent when adding an OVS meter to rate-limit PacketIn messages: PacketIn messages should be rate-limited but it should not impact user traffic
- we should address it in the v1.13 release time frame
- Node policy support in Antrea
- see #4213
- this is a legitimate use case, but we are not sure what's the best way to implement it
- we could leverage the work done for ExternalNode support and move the Node's transport interface to the OVS bridge
- one risk is that managing the physical network interface is complicated and can depend on Node OS and hardware
- a bug in Antrea could impact Node connectivity
Antrea Community Meeting 05/22/2023
- Using Antrea ProxyAll on Windows to replace kube-proxy user-space
- Kube-proxy user-space has been dropped in K8s v1.26, we now need to rely on AntreaProxy
- Antrea cannot co-exist with kube-proxy Windows kernel mode (HNSNetwork Extension conflict)
- ProxyAll will be enabled by default for Windows starting with v1.12
- Antrea cannot use the ClusterIP to access the Kubernetes Service (we need the kubeAPIServerOverride option to be set)
- In Antrea v1.10.0, Windows crash (BSoD) was observed when ProxyAll was enabled (has been fixed and back ported)
- Working theory so far is that there was a traffic loop caused by some missing flows -> high CPU usage and eventually system crash
- May be applicable to Linux (with less dramatic consequences)
- Kube-proxy user-space has been dropped in K8s v1.26, we now need to rely on AntreaProxy
- Antrea v1.12 release status update
- See in-flight PRs
Antrea Community Meeting 05/08/2023
- Antrea v1.11.1 has been released
- Includes some important bug fixes for AntreaProxy
- Quan in the process of backporting them to v1.10
- Update on Antrea UI - see slide
- First release (v0.1.0) will be this week
- Proposal for a new exporter for the FlowAggregator to log all connections to a local file - see slides
- Motivated by user issue #3794
- Blocked connections are included in audit logs, is it also the case for the Flow Aggregator?
- Yes, and they both rely on the same mechanism in the Antrea Agent (OVS PacketIn messages)
- We have rate-limiting in place, so in both cases, if there are too many PacketIn messages, some will be dropped
- For connections committed to conntrack on Linux, the Flow Exporter polls conntrack every
60s5s.- Poll interval should be small enough not to miss connections
- If there is a delay between IP to Pod name translation, could the information be stale / invalid?
- It's an issue for audit logging if information is missing / incomplete / invalid
- Yes it's possible, we should see what we can do to avoid that
Antrea Community Meeting 04/24/2023
- Antrea CI enhancements over the last couple of months - see slides
- Windows testbeds: we currently support both Docker and containerd; we may drop Docker in the future to reduce the number of jobs
- Antrea v2.0 plans - see issue #4832
- Serving our production-ready APIs using an Alpha version doesn't send the right message to our users; some of these APIs haven't changed in years
- If you have enhancements to suggest for existing APIs (may or may not be backward compatible), now is the right time to suggest them
- We will follow our documented best practices for API deprecation & removal, and provide tooling for users to easily migrate the stored version of their existing CRs
Antrea Community Meeting 04/10/2023
- Status update for upcoming release (Antrea v1.11 / Theia v0.5)
- We want to enforce new rules for running Windows CI jobs, in order to avoid breaking Windows support in the future
- as before, if a patch modifies / adds any Windows-specific source file, all Windows tests should pass
- for every patch, the job that runs e2e tests on a Windows containerd testbed should pass
- we will add a periodic Jenkins job to run all Windows tests on that main branch
- we want to improve speed and robustness of Windows CI jobs
- The new mandatory Windows job should run automatically as part of
/test-all
, and we should have a corresponding/skip-*
command
Antrea Community Meeting 03/14/2023
- Proposal for a custom Antrea UI to replace the Antrea Octant plugin - issue #4640
- React-based UI, using the Clarity design system
- Demo of the UI prototype, which supports Traceflow
- Suggestion to develop using a Lens plugin instead, as Lens has become the de facto replacement for Octant and has a better plugin ecosystem
- Antonin to look into Lens
- Plugin-based mechanisms are by nature not as flexible / extensible as a custom UI
- Ideally we would not require users to deploy any other piece of software to access our UI
- Built-in authentication mechanism: password-based login and JWT token for accessing the backend APIs
- Is this also meant to replace our Grafana dashboard for Flow Visibility?
- It would be good to have a unified UI for everything, but porting all dashboards may be quite a bit of work
- Team would need to get familiar with React & Javascript libraries for rendering dashboards
- CI pipeline to test Antrea with Rancher - slides
- We hope that this new CI pipeline will help make Antrea an officially-supported CNI plugin for Rancher
- AntreaProxy enhancements plan - slides
- Some enhancements are needed in AntreaProxy to catch up with latest upstream API changes for Services (e.g., ProxyTerminatingEndpoints)
- We need to pass more upstream conformance tests when proxyAll is enabled and kube-proxy is removed
Antrea Community Meeting 02/27/2023
- Nothing on the agenda, so very quick meeting.
- We briefly discussed the implications of running Antrea on a cluster where SELinux is enabled on the Nodes.
- Throughput Anomaly Detection in Theia - slides
- 3 different algorithms supported to detect anomalies
- Ongoing work to make the results easier to consume
- Plan is to support running TAD in the background on "real-time" data
- Test network data came from an actual Antrea cluster, synthetic anomalies were injected into the data manually
- It should be possible to tune the algorithm(s) to make detection less sensitive
- With default Flow Exporter / Flow Aggregator settings, we do not have many data points (one data point per connection per minute)
Antrea Community Meeting 01/30/2023
- Secure Wireguard tunnels for traffic between clusters in multi-cluster deployments - slides
- Intra-cluster traffic does not change
- For inter-cluster traffic, Geneve traffic will be encapsulated with Wireguard
- Why not replace Geneve with Wireguard instead? We want to avoid too may changes to the datapath; the Geneve VNI field is required for Stretched NetworkPolicies
- Traffic that needs to be routed to Wireguard is marked with a special mark in OVS pipeline (when the dest IP matches the Service CIDR for a remote cluster).
- AI: check if the selected packet mark value is consistent with other mark(s) used in Antrea and update it if necessary to prevent conflicts
- At the moment, we are not considering Wireguard for both intra-cluster encryption and inter-cluster encryption
- Wireguard vs IPsec: an Antrea user requested Wireguard support so this is what we are supporting now; some users may want IPsec instead (for FIPS compliance)
- Need to check if rp_filter needs to be changed for the Wireguard interface
- Only one packet mark is needed, no matter how many other clusters (Wireguard peers) are in the cluster set; Wireguard handles the routing
- We need one OVS flow for each remote cluster (same as when Wireguard is enabled, we just add one action to each flow to set the packet mark)
Antrea Community Meeting 01/17/2023
- Ofnet enhancement plan
- Priority is dead lock bug, logging improvements and switching to a buffered channel
- Review of user issues that need to be triaged / addressed (18 open issues for feature requests)
- #4309: Allow multiple Services to share the same LB IP [ServiceExternalIP]
-
#4246: Default ExternalIPPool for ServiceExternalIP, so that the feature can be used with controllers which automatically create Services
- instead of having a global default IPPool, we could support Namespace-level annotations to let users specify a different IPPool for each Namespace
-
#4385: Ability to fail-over Egress across multiple subnets (e.g., AZs in the public cloud)
- for each Egress resource we would have 2 static EgressIPs or 2 ExternalIPPools (primary / backup)
-
#3805: In public cloud, the control plane is not typically part of the cluster itself, so it can be difficult to define policies which select traffic from the control plane (the
nodeSelector
cannot be used).- user proposal is to have an
endpointSelector
, but this seems convoluted and applicable only to this very specific use case - the user proposal may not work if the kubernetes control plane service resolves to a load-balancer IP
- GKE uses apiserver-network-proxy; in that case the source IP for control plane traffic would be the IP of the proxy agent Pod
- this is not Antrea-specific, but should affect all CNIs
- Yang will follow up on the issue
- user proposal is to have an
-
#3794: NetworkPolicy audit logs are missing source and dest Pod namespace and name
- Valid request, but this information is not all available in the Antrea agent and cannot be included in the local log files
- We could have centralized logging, but we want to avoid duplication with Flow Aggregator and Theia
- Some users can be reluctant to deploy Theia just for this information
- A solution could be to add this functionality to the Flow Aggregator: it already has all the required information, and could generate a centralized log file
- Antonin will follow up on this issue
-
#4213: NetworkPolicy support for Node traffic
- Not possible in Antrea today, as the Node traffic is not managed by Antrea / OVS (except when FlexibleIPAM is enabled, or for ExternalNode)
- We still believe that it is an important to have; in Calico, this is implemented using iptables
-
#3540: upstream NetworkPolicy Status support
- at the moment, the feature can only be used by CNIs to report whether
endPort
(port range) is supported - Yang will look into this
- at the moment, the feature can only be used by CNIs to report whether
- It has come to our attention that Octant is no longer actively maintained, with the last commit dating to 1 year ago
- We want to find an alternative, but we don't know yet what it will be
Antrea Community Meeting 01/03/2023
- Short status update for Antrea v1.10
- Major new features for this release are L7 NetworkPolicy support and changes to support bundle collection (new
SupportBundleCollection
CRD and support for ExternalNode)
- Major new features for this release are L7 NetworkPolicy support and changes to support bundle collection (new
- Short status update for Theia v0.4
- Next meeting is postponed by 24 hours because of EOY holidays
Antrea Community Meeting 12/19/2022
- L7 NetworkPolicy demo
- Open question on how users can specify that only a specific peer can access the application using L7 NetworkPolicies
Antrea Community Meeting 12/05/2022
- Update to L7 NetworkPolicy API
- Main change: L7 NetworkPolicy capabilities will be added to existing L3/L4 NetworkPolicy API.
- Rules can be port-agnostic (all traffic is sent to the L7 engine) or port-specific (traffic which doesn't match the specified port will not go through the L7 engine at all).
- The L4 ports field is used to scope the traffic that needs to be sent to the L7 engine.
- Any policy rule after a L7 policy rule will be ignored (applied L7 policy rules are terminal).
- Update on networkPolicyOnly mode with multi-cluster
- A tunnel interface needs to be created on each Node for cross-cluster traffic.
- Change of plan for handling reply traffic from gateway to general mode: use a L3Forwarding rule for each cluster Pod instead of relying on CT label. These rules are installed in the gateway's OVS bridge. Assumption is that the number of Pods is not that large in networkPolicyOnly mode. We have chosen this option because it is simpler, with no significant performance difference (performance is better with small number of rules, not measured with large number of rules).
- Some open questions for stretched NetworkPolicy support
Antrea Community Meeting 11/21/2022
- Review open questions for L7 NetworkPolicy API & implementation
- see slides
- a follow-up discussion is needed to determine the API behavior for "unsupported" protocols
- Benchmark results for L7 NetworkPolicy implementation with Suricata
- see slides
- Proposal for supporting networkPolicyOnly mode with multi-cluster
- see slides
Antrea Community Meeting 11/07/2022
- Finer-grained datapath updates in AntreaProxy
- Motivations:
- Avoid unnecessary datapath updates (OVS flows, Linux routes & ipsets) for some Service changes
- Better organized code
- Proposal should be revised to account for the fact that some Service Spec fields are immutable (e.g.
ClusterIP
unless some specific conditions are met) - Supporting incremental endpoint updates with Openflow (incremental bucket updates) seems more important than optimizing for very infrequent scenarios (e.g., changing the Service Type or the NodePort).
- For OVS flows, we have an in-memory cache, so the benefits of this new approach may be small (could be different for routes / ipsets).
- Motivations:
Antrea Community Meeting 10/24/2022
- Antrea L7NetworkPolicy API - see slides
- In the first release, only HTTP will be supported as the protocol.
- At the moment, isolated behavior is per direction AND per protocol (e.g., if there is an HTTP rule and no DNS rule, all DNS traffic is allowed as well as all other non-HTTP traffic).
- If Host is empty in an HTTP rule, any Host name is allowed.
- Applying L7 NetworkPolicies on External Nodes (VMs) should be possible (few requirements), but not planned for the first release.
- Host doesn’t include the port number (open question).
- No support for HTTPS at the moment, which requires decrypting traffic and certificate injection. In theory, we can still support Host-only rules with SNI (host name is in clear text).
- No support for policies such as "drop all L7 traffic that is not HTTP" (implementation doesn't support wildcard rules and a protocol has to be explicitly specified for all rules). Quan to investigate if we can match on "ip" as the protocol for default drop rules.
- Every time Suricata signatures are modified, Suricata has to reload the full set of signatures (one file).
- For a couple 100s rules, should take < 1s (Quan to verify)
- No third-party library to generate Suricata rules programmatically, rules are plain text and not binary.
- Antrea TrafficControl CRs will be generated for L7NetworkPolicy CRs to ensure that the right traffic gets sent to Suricata.
- Only drop & pass actions in first release.
- Integrating with Suricata requires disabling TX checksum offload in Pods; impact on performance not measured yet
- What's the effect if the NetworkPolicyPeer (for egress rules) uses FQDN? TBD
- First release with support for L7 NetworkPolicies will be Antrea 1.10 (HTTP only)
- Engine selection: Suricata vs other options?
- Which protocols are the most requested by users? Suricata supports HTTP, FTP, SSH, ... (similar to other engines like Snort)
- Other solutions based on Envoy have support for gRPC and rich set of matching features for HTTP traffic; Envoy can also easily be extended to support new protocols
- It should also be possible to support new protocols in Suricata using signatures, even if Suricata doesn't know how to parse the protocol natively
- The issue with Envoy is that it is actually a proxy which intercepts connections with L4 sockets. Our traffic control implementation works at L2, so it works very well with IDS / IPS engines, but not with Envoy. Other issues with Envoy: not a transparent proxy, performance, ...
Antrea Community Meeting 10/10/2022
- Stretched NetworkPolicies - see slides
- Why? ACNP replication doesn't support the enforcement of ingress policy rules (only toServices egress rules supported)
- We introduce the notion of label identity for Pods (more scalable than distributing IP addresses), which is generated from the Pod labels
- We use the VNI to store the label identity
- Even in noEncap mode (in a member cluster), cross-cluster traffic would always be encapsulated (including from the origin Node to the gateway Node), so there is always a VNI field
- No flow changes required on gateway Nodes: tunnel id / VNI will stay with packet when forwarding to the new tunnel port
- Should be part of Antrea v1.9
- How long does it take for a stretched policy to take effect? I takes a few seconds for 100 label identity changes
- If the identity tag is not known to the member cluster yet, traffic will be dropped for security reasons
- Project Nephe - see slides
- Only "Allow" action supported for policies (no "Drop" / "Deny" rules); this is because some clouds don't support other actions (Azure?). Need to double-check and decide whether to expose these actions for clouds that do support them.
- Cloud VM tags are added as labels to the ExternalEntity CRs, allowing users to select workloads by tags when defining Antrea policies
- Instead of copying VM tags as VirtualMachine CR annotations, maybe we could use labels? There is ongoing work to change how cloud tags are propagated to ExternalEntity labels. The VirtualMachine CRD will also be removed eventually, superseded by the VirtualMachinePolicy CRD.
Antrea Community Meeting 09/26/2022
- Update on Theia manager
- will be in charge of interacting with Spark operator for NetworkPolicy recommendation
- more generally an entrypoint for controlling all observability features
- boiler plate code merged, now adding NetworkPolicy recommendation CR with corresponding controller code
- Grafana new home page work completed for Theia
- Multiple PRs have been opened & merged to increase unit test code coverage - #4142
- Need to investigate code coverage measurement for Theia (which uses both Golang and Python)
- L7 NetworkPolicy
- feature targeted for Antrea v1.10 (ongoing design and PoC)
- license for Suricata is GPL 2.0; not an issue as we just plan to distribute it as a binary as part of our Docker images
- new CRDs will be introduced for L7 NetworkPolicies; there will be no support for Tiers and priorities (not possible with the chosen implementation), just like for K8s NetworkPolicies
Antrea Community Meeting 09/12/2022
- Nothing on agenda
- Update on support for stretched Antrea network policies for multi-cluster
- several PRs in progress
- upcoming demo at a later community meeting
- PR for Theia manager is under review
- looking for feature parity with current Theia CLI first - then more features will be introduced to leverage the Theia manager capabilities
Antrea Community Meeting 08/29/2022
- Customized Grafana homepage
- Motivations: brand it for Antrea / Theia, provide quick insights about the cluster network, provide quick access to dashboards
- What are stopped connections (widget)? Connections that were active during the selected time windows but that are no longer active now. "Terminated" may be a better name.
- The amount of data transmitted out of the cluster may be an interesting metric to have as a widget.
- It seems that "Data Transmitted" doesn’t include traffic sent from the server side back to the client, which is confusing. It should include bi-directional traffic.
- Antrea v1.8 release: some pending issues
- Multi-cluster Gateway HA support: #3754
- At the moment, user needs to annotate Nodes manually to select gateways: only the most recently selected Node is used as gateway, with no fallback mechanism in case of failure
- We want to make Gateways more robust: active-standby at first (with active-active in the future)
- When active Gateway changes, existing connections may be reset
Antrea Community Meeting 08/15/2022
- Update on Theia changes for upcoming release (Antrea v1.8)
- extending Theia CLI capabilities
- improving unit test coverage for theia repository (Python Spark jobs, TypeScript Grafana plugins)
- improving e2e test coverage
- clustering support for ClickHouse DB (for horizontal scaling and HA / replication), dropping support for non-clustered deployment
- support for seamless schema upgrades (no data loss)
- minor UI improvements for Grafana dashboards
- is there a plan to drop support for the IPFIX exporter in the Flow Aggregator given that Theia only works with the ClickHouse exporter? Not at the moment, we know of at least one user for the IPFIX exporter (vRNI, a VMware product).
- Update on upcoming Antrea v1.8 release (merged & pending PRs)
- support Topology Aware Hints in AntreaProxy (already supported in kube-proxy) - merged
- support for Helm chart installation method - merged
- multicast encap mode support - merged
- need more reviewers for pending PRs: https://github.com/antrea-io/antrea/milestone/21
- planning to freeze code around Wednesday August 10th
- limitations of audit logging support for K8s NetworkPolicies:
- when traffic is accepted, it can be because of any number of NetworkPolicies
- when traffic is dropped, it is not because of any specific NetworkPolicy, it is because the Pod is "isolated" (a Pod becomes isolated as long as at least one NetworkPolicy applies to it) - we just display "dropped by K8s NetworkPolicies".
- The recording ended abruptly because of a technical error on our side: we did continue the meeting after that for an extra few minutes, but nothing important was said, and there is no recording available.
Antrea Community Meeting 08/01/2022
- Support bundle for External Nodes (VMs) running the Antrea Agent
- Proposal slides
- We plan to change the API to request support bundles from Agents: it will use a CRD. We will have a unified API for Agents running on K8s Nodes and Agents running on External Nodes / VMs.
- We should be able to request support bundles from both External Nodes and K8s Nodes with one single API request (i.e., one single CR).
- Currently we only plan to support HTTPS for file upload, maybe other protocols in the future (e.g. FTP).
- Secret references in the CRD should not include the uid.
- An internal channel between Agent and Controller can be used to distribute the Secrets to the Agents. In that case, the Agent doesn't need to be granted RBAC permissions to read the Secret.
- It is unclear whether we need an internal channel between Agent and Controller at all for this feature. Primary motivation was to hide Node information (Node name) from other External Nodes, including in the same Namespace. However, it doesn't seem that this is a reasonable motivation since at the moment External Nodes already have access to this information based on existing RBAC. If we remove the internal channel, how to we distribute Secrets?
- Some support bundle requests which select the same Node(s) cannot be processed concurrently. Rather than have complicated admission logic and a dedicated error code for this case, we could simply have a single worker processing requests one-by-one sequentially. Support bundle is not a very frequent or time-sensitive operation.
- Internal channel for communications between Agent and Controller: can we have a unified solution across multiple features (support bundle, policy stats, traceflow, ...)?
- Helm charts for Antrea available starting with v1.8
- See https://github.com/antrea-io/antrea/blob/main/docs/helm.md
- Antrea Helm charts listed on https://artifacthub.io/
Antrea Community Meeting 07/18/2022
- Stretched Antrea network policies for multi-cluster
- Proposal slides
- Why single ResourceExport for all label identities in a cluster, but one ResourceImport object for each label identity from the leader cluster?
- ResourceExport object may grow too big or may need to be exchanged too often. Could there be a performance issue?
- Decision to have a single object was motivated by a potential race condition when different clusters make changes to the same normalized labels.
- Maybe there is another more efficient way to avoid that potential race condition. Yang and Grayson will look into this.
- Two choices for carrying label identity information across clusters: Geneve header TLV (up to 32 bits) or tunnel ID / VNI (up to 24 bits)
- Decision is to use VNI (simpler & no impact on MTU) even though it is limited to 24 bits
- Why is the MTU change an issue? Concern is that an existing cluster joins a ClusterSet and that we enable the StretchedNetworkPolicy feature. In that case, the MTU would need to be reconfigured for all Pods in the cluster.
- How to map workloads to label identities when using VNI? 2 choices: 12-bit label identity for Namespace and 12-bit label identity for Pod OR one 24-bit label identity which encodes both the Namespace and the Pod labels. Both have pros and cons when it comes to data exchange and number of supported identities.
- Decision to be made offline
Antrea Community Meeting 07/05/2022
- @tnqn is proposing to refactor the Kind e2e tests
- See https://github.com/antrea-io/antrea/pull/3922
- Plan is to have one job with all the Alpha+Beta features enabled and one job with all the Alpha+Beta features disabled (in addition to the jobs for different encap modes)
- Reduce redundancy in CI jobs
- Reduce side effects and unpredictability of tests by removing ConfigMap mutations by individual test cases (test cases which depend on a specific feature will only run if the feature is currently enabled in the ConfigMap)
- In the future, we can run additional tests in Kind (e.g., for Multicast) to reduce dependency on Jenkins jobs
- Releases of Antrea v1.7.0 and Theia v0.1.0
- Legacy network flow visibility will be removed from the Antrea repository as part of the Antrea v1.8.0 release
- Multi-cluster support in Theia
- 2 possible solutions: have a unique flow data store (ClickHouse DB) in each cluster, or a centralized flow data store in one of the clusters
- In both cases, we want a centralized Grafana instance, with the ability to select a specific cluster in the UI (from a list of "connected" clusters)
- Next meeting will be on Tuesday July 5th, because of US holiday
- We will discuss support for network policies with multi-cluster
Antrea Community Meeting 06/21/2022
- Antrea Jenkins CI - current situation and migration plans
- Any security risk associated with using smee to propagate Github webhook payloads to the private Jenkins instance?
- Smee is developed and run by Github, we don't think it's a big security risk and no plan to replace it at the moment
- Force pushing to a branch and running
/test-all
does not kill previous ongoing jobs, which creates delays for the new jobs and for other PRs- We can investigate a command-based solution to enable users to kill previous jobs, but we have to be careful about doing clean-up properly
- Budget for running Antrea CI jobs in AWS?
- Currently $200, but we can increase it
- AWS makes more sense for daily jobs, not for pre-merge jobs
- We plan to open-source the code we use to run CI jobs in the private Jenkins deployment (e.g., IPv6 & Windows CI jobs)
- Any security risk associated with using smee to propagate Github webhook payloads to the private Jenkins instance?
Antrea Community Meeting 06/06/2022
- Virtual Machine (VM) support in Antrea using
ExternalNode
- Proposal slides
- An ExternalNode is a kind of ExternalEntity - for each ExternalNode, we create a corresponding ExternalEntity, with the same labels.
- There will be support for ExternalEntity when we introduce namespaced Groups (as a way to easily select ExternalEntities and group them together, for the sake of defining network policies).
- AntreaAgentInfo will now be created / deleted by the Antrea Controller, but still updated by the Antrea Agent. This reduces RBAC permissions for the Antrea Agent. This also means that if we create a dedicated ServiceAccount per ExternalNode, we just need to grant the ServiceAccount permission to
update
its own AntreaAgentInfo resource (unlike thecreate
verb, theupdate
verb can be restricted to a specific resourceName). - 2 "pipelines" for ExternalNodes: IP pipeline focused on policy enforcement (no forwarding) and nonIP pipeline for non-IP packets
- When interfaces are moved to the OVS bridge, the internal ports get the original name of the pNIC to avoid disruptions to routing and other processes.
- No forwarding functions: the OS is in charge of everything, when a packet enters through ethX (internal port) is is immediately marked with the correct egress port, i.e. pnicX (the matching physical port).
- Are OVS restarts handled gracefully (no connectivity loss)? We know that NSX uses veth pairs to connect pNICs to the bridge for this reason, so something worth looking into.
- DHCP should keep working after moving the interface to the bridge: the DHCP client will keep renewing the lease for the IP.
- All ExternalNodes are ExternalEntities but not all ExternalEntities are ExternalNodes. In the cloud, we can support security policies for VMs using cloud-native constructs (e.g. security groups for AWS VPCs), without running the Antrea Agent on the VM. In this case, VM network interfaces map to an ExternalEntity, but there is no ExternalNode.
- An ExternalNode really means "a compute node on which we run the Antrea Agent and OVS"
- Note that we actually generate different ExternaEntities for the different network interfaces of an ExternalNode
- Last week, there was an Antrea office hour session at Kubecon EU (hosted by Salvatore and Quan)
- Short introductory presentation of the project, recent features, current state of the project
Antrea Community Meeting 05/23/2022
- Transition for Kustomize to Helm to generate Antrea YAML manifests
- Currently we only use Helm for its templating capabilities, we don't have a chart repo and we are not releasing the chart yet
- Kustomize had too many limitations for our use case (not a templating engine), and we ended up using
sed
a lot - No significant difference for the end user; the generated manifests are the same with the exception of the name of the Antrea ConfigMap to longer including a hash-generated suffix
- GKE supporting NetworkPolicies for Windows Pods with Antrea
- Antrea Network Flow Visibility solution is moving to its own Github repository under the name "Theia"
- The visibility solution has very different dependencies from Antrea and uses different technologies (e.g., Python + Spark for network policy recommendations)
- The Flow Aggregator is staying in the main Antrea repo, ELK integration is being removed starting with Antrea v1.7 (replaced by ClickHouse + Grafana)
- Upcoming Theia v0.1 release with new features: visualization for denied flows (flows denied by NetworkPolicies), network policy recommendation CLI
- Theia adds support for Helm to deploy all the different components
- Theia repo getting its own CI
- There is some duplicate documentation between the Antrea repo and the Theia repo, which could create some user confusion
- Antrea live show on flow visibility
Antrea Community Meeting 05/09/2022
- Certificate-based authentication for IPsec tunnels
- Proposal slides
- The Antrea Agents need access to the root CA certificate to configure the IPsec daemon; rather than add this certificate to a new ConfigMap, it can be added to the PEM-encoded certificate chain in the CSR Status field, after the CSR has been approved and signed (by the Antrea Controller).
- For certificate rotation, adding an extra OVS config option with a unique hash value seems like the best approach to trigger certificate reloading by the ovs-ipsec-monitor.
- When using unique file names, we need to handle cleaning up stale files.
- Updating any config option should be sufficient to trigger certificate reloading.
- Consensus is that the functionality for managing certificates (CSR creation, etc) should be located in the antrea-agent container, not in the antrea-ipsec container.
- The CN / SAN name in the certificate must match the remote_name in the IPsec config.
- With the current proposal for RBAC configuration, Agents can create any CSR they want (for any signing authority), including creating CSRs using another Node's name as the CN.
- Possibility of escalation if a Node is compromised (and the Agent's serviceAccount token is compromised).
- We could consider having an option so that Agent CSRs are not auto-approved; an admin would need to approve CSRs manually, including in case of Node reboot; the Antrea Agent can update the Node Status as "NotReady" until the CSR is approved (e.g., in case of reboot) to prevent Pods from being scheduled on the Node
Antrea Community Meeting 04/25/2022
- Proposal for introducing the traffic control capability
- Slides
- Design issue #3324
- Traffic can be mirrored or redirected through an OVS port.
- Users can either create the OVS port themselves, or let Antrea create it (can add an existing device, or create a new tunnel port); this is for added convenience, as creating an OVS port requires access to the ovsdb-server socket.
- Can multiple
TrafficDestination
CRs use the same port, and let Antrea create the port? Yes, the same port will be used if thename
fields are the same. However, the configuration properties have to be the same in all cases, or there will be a conflict / error. This is similar to multiple Pods using the samehostPath
volume on a K8s Node. - Another approach that was considered was having a ConfigMap (could be the antrea-config ConfigMap) to create all required ports upfront. The API-driven approach is probably more user-friendly.
- Initial version is planned for Antrea v1.7, with additional features (e.g., filtering) in Antrea v1.8.
- The
TrafficControl
feature name may be too general, astc
in Linux supports many additional use cases. - Adding e2e tests for this feature should not be an issue.
- Open question: stateful vs stateless implementation. Quan believes performance could be better with the stateless implementation, without compromising on supported use cases. Can be revisited when we start implementing filtering support.
-
TunnelDestination
could be a separate CRD, and multipleTrafficDestination
CRs could refer to the sameTunnelDestination
CR if they want to use the same tunnel. - Using OVS flow-based tunneling for mirroring is not an option: traffic needs to go out on the overlay (normal Pod traffic forwarding) but also be mirrored to another tunnel. We cannot use flow-based tunneling in both cases, as each packet only has one set of tunnel metadata in OVS.
- Some possible improvements for the API, will discuss them on the Github issue and offline.
Antrea Community Meeting 04/11/2022
- ICMP support in Antrea-native policies
- Feature request issue #3263
- Proposal slides
- CRD webhook conversion applies to served API resources, but stored resources are not modified automatically
- Consensus is to keep the current
Ports
field and introduce a newProtocols
field. If both fields are present, they will be merged. If they conflict, we can fail early. This means we do not need to introduce a new API version. - We could support arbitrary IP protocol numbers (as integers) as well.
- Demo video
- Implementation is in good shape, but there is an issue when using the
Reject
action, which needs more investigation.
- Multi-cluster datapath connectivity support
- Issue #3502
- Design doc
- Currently the gateway Node (for cross-cluster communications) is chosen manually and there is no failover mechanism.
- Routing to other clusters is based on the destination IP (each cluster advertises its Cluster CIDRs and Service CIDRs)
- Active-standby failover may not be enough, we may need active-active support (multiple active gateway Nodes), as a single gateway Node may become the bottleneck.
- We may want to support overlapping Service CIDRs across clusters, by allocating virtual IPs (in Antrea) for multi-cluster Services.
- Might need to support noEncap mode (in cluster members) too to cover cloud-managed K8s services; at the moment we only support encap mode for cluster members.
- K8s upstream multi-cluster DNS specification recently merged: https://github.com/kubernetes/enhancements/pull/2577
- Using Node private IP or public IP (as reported by K8s API) to create tunnel endpoints? We probably should use the public IP if available, or fall back to private IP if not available (there isn't always a public IP). Private IP may not be routable across clusters (e.g. if member clusters are in different VPCs).
- Will try to include this in Antrea v1.7
Antrea Community Meeting 03/28/2022
- "Live Traffic Tracing for Antrea" proposal
- Github issue #3428
- Design doc
- Antrea Traceflow feature, 2 modes:
- packet is injected by Antrea and traced through the cluster network, or
- the first "live" (real traffic) matching packet is traced and captured
- This design wants to add more advanced tracing and sampling capabilities compared to Traceflow (e.g. capture multiple packets, ...)
- Packets "marked" for capture will be matched in the OVS pipeline and sent to the Agent. To mark the packets, the best solution seems to be eBPF with TC hook.
- Risk of using / setting the IP ID field to uniquely identify sampled packets?
- Packet dumps (with metadata) will be stored at source and destination Nodes; HTTP API can be used to retrieve sampled packets.
- Live tracing for Service traffic (which goes through DNAT)? Need to check how this will work.
- Traceflow is already overloaded, maybe a new CRD should be used to configure traffic sampling and avoid user confusion
- Do we need such a complicated implementation (capture at both the source and destination Nodes) or can we go with something simpler for sampling (e.g. sample traffic at a specific Node)?
- How much complexity will eBPF be adding to the Antrea codebase?
Antrea Community Meeting 03/15/2022
- Multicast API design
- Reuse the same IP block field for multicast IP addresses (no need to introduce a new field)
- Need more discussion for how to select IGMP messages: we should probably consider a generic solution that can also work for other protocols we want to support, such as ICMP
- Multicast stats demo
- No need for a dedicated API for multicast NetworkPolicy stats (follow-up from last meeting's discussion)
- New APIs / antctl commands:
- For Pod multicast traffic stats: only locally available in "agent-mode" (from the antrea-agent Pod), no aggregation
- To query multicast group membership: membership information is aggregated in the Antrea Controller (there can be members across many different Nodes) and API available with APIService
- Proposed API name (
multicastgrouppodsmembers
) seems redundant, suggestion is to usemulticastgroupmembers
instead
- Antrea NetworkPolicy NodeSelector
- https://github.com/antrea-io/antrea/issues/3023
- https://github.com/antrea-io/antrea/pull/3038
- Design doc
- Which Node IPs are used to enforce NetworkPolicies?
- For intra-cluster communications, Nodes use the gateway IP assigned by the Antrea Agent
- Node IPs must include uplink IP and gateway IP
- The Controller can determine the gateway IP for each Node based on that Node's PodCIDR
- Other cases we need to cover?
- Antrea v1.6 release
- Largest ongoing PR is the "flexible pipeline" one, which changes how we manage OVS flows for the different Antrea features; should be merged very soon, which will unblock other PRs (multicast, flexible IPAM, ...)
Antrea Community Meeting 02/28/2022
- NetworkPolicy support for multicast traffic
- Ability to target multicast traffic but also IGMP messages (query / report)
- See https://github.com/antrea-io/antrea/issues/3323
- There are several possible options we should consider for the API design: dedicated types for multicast policies, dedicated types for multicast rules, same types but dedicated fields, ...
- In order to define rules for IGMP messages, should we consider introducing a generic mechanism to target arbitrary IP protocols?
- Support for Antrea multicast stats API
- See https://github.com/antrea-io/antrea/issues/3294
- The proposed stats API may need to evolve based on the final design we agree upon for the multicast NetworkPolicy support
- The legacy
*.antrea.tanzu.vmware.com
APIs are being removed in Antrea v1.6 (they were deprecated in favor of*.antrea.io
APIs in v1.0)
Antrea Community Meeting 02/14/2022
- Proposal for Antrea IPAM multi-VLAN support post Antrea v1.5 - slides
- Status of Flexible IPAM:
- Antrea v1.4: decouple Pod IP allocation from Node assignment (Pod can keep same IP when evicted to another Node), Linux + IPv4, ability to provide IPPool per Namespace
- Antrea v1.5: ability to provide IPPool per Deployment / StatefulSet, ability to provide IP per Pod
- The plan is to avoid introducing a new feature gate for multi-VLAN support; for Pods using Node IPAM and for the antrea gateway we will keep using trunk ports so no change for those
- With the current design, OVS will route packets across VLANs locally: a Pod in VLAN 100 and a Pod in VLAN 101 can talk to each other locally without the traffic going through an underlay router
- this may be surprising to users as this is not the typical VLAN isolation behavior
- an alternative would be to forward the traffic to the uplink always and let the underlay network handle it (wether it's local or remote Pod traffic)
- this could be configuration-based as well; macvlan offers a similar configuration as well?
- Same VLAN ID can be used in multiple IPPools
- The number of flows in the new VLAN table will be proportional to the number of local Pods
- Status of Flexible IPAM:
- Field names for NetworkPolicy API
- In the current K8s NetworkPolicy API (but also in the Antrea-native API), workloads can be selected in many different ways. For example, you can select workloads with a podSelector, a namespaceSelector or by providing a combination of both selectors. This becomes messy as we add new selectors to the API (e.g. serviceAccount selector). Some selectors are compatible (e.g., podSelector & namespaceSelector) but some or not (e.g., podSelector & serviceAccountSelector).
- At the moment, we use a validating webhook to check that the provided selectors make sense together.
- This is a bad design when updating the API to support new selectors. Users can create policies using the new API version. These policies can be invalid. If an older version of Antrea is still running, the policies will not be rejected and will be stored by the apiserver. The unsupported fields will be ignored by Antrea (silently) leading to an implementation which is not expected by the user. When the Antrea Controller Pod is updated, it will complain that the policies are invalid.
- That is why the upstream NetworkPolicy API cannot be updated with new features in its current version.
- This was a mistake in the original design of the API. In retrospect, there should be more fields in the API spec, each with a verbose name and each corresponding to a specific combination of selectors.
- Should be incorporate these learnings as we evolve our Antrea-native NetworkPolicy APIs? What about existing selectors?
- There could be other API designs achieving the same goals.
Antrea Community Meeting 01/18/2022
The meeting was cancelled.
- Presentation (with demo) on a solution for heterogeneous multi-cluster communications built with Antrea by Transwarp - slides
- heterogeneous: some individual clusters are running Antrea, others (legacy) are running Flannel
- full-mesh communications: direct Pod-to-Pod communications across all clusters, no gateway (different from Submariner)
- In the Transwarp solution, clusters are added / removed from the full-mesh by providing a kubeconfig file. Each cluster watches resources from other clusters (in particular Nodes) and each Node in the cluster programs OVS flows / routes accordingly.
- No change to existing Antrea code; new component running on each Node to program some extra flows. This component uses a different flow cookie number that will not conflict with Antrea flows and ensure that the flows are not deleted by the Antrea Agent.
- Not tested at scale yet, may be deployed in production next year.
- A gateway-based solution such as Submariner may not be suited performance-wise for big data applications like the ones Transwarp is supporting.
- May look at global Service discovery across clusters in the future.
Antrea Community Meeting 12/20/2021
- ServiceAccount selector in ACNP - issue #2927; slides
- A reserved label is added to each Pod internally in the Antrea Controller, where the label value is set to the name of the ServiceAccount used for that Pod.
- The ServiceAccount selector will be exclusive with other peer types (fqdn, group, etc).
- It should be possible for the user to only select ServiceAccounts in specific Namespaces, so it should be possible to simultaneously set the ServiceAccount selector and the Namespace selector.
- Grayson tried using an invalid K8s label to avoid potential (unlikely) conflicts with user-defined labels, but with our current implementation, the internal label still goes through apimachinery format validation.
- It is not possible for a malicious user to override the internal label and bypass correct NetworkPolicy enforcement.
- Refactor of the Antrea OVS pipeline: "Flexible Pipeline" - issue #2693; slides
Antrea Community Meeting 12/06/2021
- Strategy for secondary Pod network interfaces
- Originally, we were considering supporting secondary network interfaces "natively" in Antrea, with no need for Multus (Antrea as a Multus replacement).
- Currently, thanks to contributions from Intel team, Antrea is able to manage secondary interfaces of type SR-IOV. In the future we can consider supporting different interface types, such as VLAN, OVS, OVS DPDK.
- Feedback from users is that they are familiar with existing CNI plugins (macvlan, ipvlan, etc.) and want to keep using them.
- We see 2 options:
- duplicate Multus functionality, with support for invoking third-party CNI plugin binaries
- support working with Multus, with 2 modes: in one mode Antrea has limited support for secondary network interfaces and Multus is not required, in another mode Antrea can be invoked by Multus for both the primary network interfaces and to create some secondary network interfaces (e.g. OVS DPDK).
- The Antrea user and developer community have previously expressed concerns around Multus overhead. It is however not clear whether the reference was to CPU/memory overhead, management overhead, or additional complexity due to managing plugin chain.
- Developing a capability equivalent to Multus within Antrea:
- Antrea would probably need to maintain compatibility with Multus over time
- might benefit users that are familiar with Antrea but not with Multus
- to understand how such a solution would benefit the user community we probably need to better understand concerns about Multus overhead, and then figure out if they're better handled in an Antrea-specific implementation
- Also, it might be interesting to figure out which Antrea features can be supported when providing a secondary interface.
- currently, we do not support running Antrea as a secondary Multus plugin
- there are concerns about the viability of some Antrea features on secondary interfaces. AntreaIPAM for instance; also Antrea Network Policies will not work on interfaces of type SR-IOV.
- There is definite interest in providing IPAM support for secondary network interfaces (e.g. as an alternative to Whereabouts)
- Finally, it might be worth looking at industry and community adoption of Multus. Currently it seems that Multus is not supported in VMware distributions, is supported in Rancher, and is actually the default in Openshift. This is something worth keeping in mind when planning our strategy.
- No other topic was discussed. We agreed to follow-up on concerns related to Multus overhead in the next community meeting.
Antrea Community Meeting 11/22/2021
- Proposal to add QoS feature to Egress - issue #2766; slides
- Current PoC enforces bandwidth limits at the source Nodes / Pods, not the Egress Nodes
- this is not what users may expect as the actual cumulative bandwidth for a given Egress could be much higher than the configured bandwidth
- dropping the traffic at the Egress Nodes instead could have an impact on application communications, notably UDP? Unclear why, need to investigate more
- dropping the traffic at the Egress Nodes could cause bandwidth at the source Nodes to be wasted
- Current PoC enforces bandwidth limits at the source Nodes / Pods, not the Egress Nodes
Antrea Community Meeting 11/08/2021
- Antrea v1.4 release status update
- ProxyAll feature included (Linux & Windows)
- AntreaIPAM ("flexible" IPAM): review still ongoing
- should be on track for code freeze on Wednesday
- AntreaIPAM:
- no validation yet for IPPool CRD (validating webhook); validation may not be ready in time for v1.4
- Quan added similar validation in the past for ExternalIPPool (used for Egress); maybe the code can be shared or copied over
- Alpha feature means we could still include it in the release, even without proper validation and potential issues with IPPool Update / Delete operations
- should we unify Antrea NodeIPAM and FlexibleIPAM under one feature gate and group their configuration parameters?
- all combinations are possible use cases: NodeIPAM only, FlexibleIPAM only, NodeIPAM + FlexibleIPAM, and of course none of them (when K8s NodeIPAM is used)
- need to be able to enable / disable NodeIPAM through a dedicated configuration parameter in addition to the feature gate (to avoid conflicts with the K8s NodeIPAMController)
- seems to make more sense to keep configuration parameters separate
- FlexibleIPAM is not a very good name, it's quite generic
- the FlexibleIPAM feature gate actually covers 2 "different" things: IPAM logic but also datapath modifications to enable proper routing of Pod traffic
- we may want to reuse the same FlexibleIPAM logic for other things (e.g. secondary networks), in which case splitting up IPAM functionality and datapath modifications would make sense
- Antrea virtual project office hours at KubeCon NA 2021
- good attendance (65 people total, about 25 people at any given time)
- good engagement with questions from the audience
Antrea Community Meeting 10/25/2021
- Jay Vyas is proposing to use the bi-weekly Antrea Office Hours time slot to have a livestream. The livestream would focus on Antrea usage, new features, K8s networking updates, etc. Jay has some experience presenting the TGIK livestream. Consensus is that this could be a good idea, given the low attendance of Office Hours. Jay would be driving the livestream and different guests (maintainers, contributors, user) could join him on camera.
- Antrea Virtual Project Office Hours at Kubecon: https://sched.co/nBua
- Daylight Saving Time will take effect soon, not planning to change the meeting time for now (5AM GMT+1)
- New version of the antrea.io website is now live
- source code for the website is in its own repo: https://github.com/antrea-io/website
- documentation still lives in the main Antrea repo (under docs/).
- website is automatically updated when documentation is updated in the
main
Antrea branch, or when a new version is released - anyone can now open a PR to modify the website source code: fix formatting issues, add a blog post, etc.
Antrea Community Meeting 10/11/2021
- Wenying presented the "flexible OVS pipeline design" proposal - see issue #2693
- Intentions are to simplify feature development, and facilitate extensions / code sharing for other use cases (e.g. VM support)
- New code review manager was added to the antrea-io/antrea repo - see issue #2752
- Code reviews will be requested automatically based on PR "area" labels
- PR authors should label their PRs with the correct area labels, and label each other’s PRs
- Antrea CNCF metrics dashboard: https://antrea.devstats.cncf.io/d/8/dashboards?orgId=1&refresh=15m
- Next Community Meeting falls during KubeCon week: not planning to reschedule the meeting
Antrea Community Meeting 09/27/2021
- Multi-cluster support proposal - see issue #2770 and design doc
- We are using the MCS upstream APIs for
ServiceImport
andServiceExport
(KEP 1645), but the implementation will be Antrea-specific - We want to support export / import of other types of resources (e.g. NetworkPolicies) - hence we are introducing the
ResourceExport
andResourceImport
CRDs - For stretched NetworkPolicies, will the leader cluster need to be aware of the inventory (e.g. Namespaces, Pods) of all member clusters?
- Will the leader cluster aggregate realization status and stats information?
- We are using the MCS upstream APIs for
- Progress on the ClusterNetworkPolicy proposal upstream.
- Upstream discussions to introduce a NetworkPolicy Status; main motivation at the moment is to be able to report which features are supported or not by the CNI (e.g.
EndPort
).
Antrea Community Meeting 09/13/2021
- Release v1.3.0 is this week, we are on track to release on Wednesday
- The CNCF has worked with Docker to lift the rate-limiting restrictions for Antrea Docker images; since we have had recurring issues with the VMware distribution Harbor registry, we plan to switch back to Dockerhub as the default registry in Antrea YAML manifests
- for the foreseeable future, we will keep mirroring images to Harbor
- Abhishek and others will present the Antrea multi-cluster work at the next Community Meeting
- Update on upstream sig-network-policy-api work:
- there was an alternate proposal from RedHat for cluster-scope NetworkPolicies
- Abhishek is working on a document to compare the different proposals and will be looking to get feedback from the wider sig-network audience; everyone is welcome to take a look and comment
Antrea Community Meeting 08/30/2021
- Multicast design - see updated design doc
- Multicast group discovery implemented by sending IGMP membership report packets to the Antrea Agent (using the controller action)
- IGMP v3 is not fully supported as the client can subscribe to specific multicast sources which adds complexity to the implementation; support for IGMPv3 is a long term goal
- Target release for initial support is Antrea v1.4
- AntreaProxyFull / kube-proxy removal
- Hongliang ran benchmarks to compare ipset-based implementation with TC-based implementation; baseline is "regular" AntreaProxy with kube-proxy
- Decision (informed by benchmarks) has been made to use ipset instead of TC
- simpler design and similar performance
- some edge cases (e.g. multi-interface) are more complicated with TC, along with reply traffic routing
- we see more variance in benchmark results with TC (not sure if this is relevant)
- CPU usage wasn’t measured for these benchmark tests (there should not be a significant difference), memory usage improves after removing kube-proxy
- Jianjun believes it may still be worth investigating an eBPF implementation, for some cases
- eBPF / XDP to bypass host network stack in some cases
- in case of multiple interfaces (which has been an issue with TC), it is possible to enable eBPF acceleration on single interface (if there is one with higher throughput requirements)
- New PR #2599 was created with ipset-based implementation, needs code review
- Target release is still 1.3
Antrea Community Meeting 08/16/2021
- Presentation of updated design for multicast support in Antrea - see slides
- A Pod annotation may be a better mechanism than a Pod selector in the proposed Multicast CRD to let Pods join multicast groups; this information is needed to support multicast in Encap mode, at least until we support "auto-discovery" by monitoring IGMP messages sent by the Pod
- Suggestion that we start with a more limited scope to avoid throw-away code, limiting ourselves to what we know for sure has been requested by users
- start with noEncap only; potentially no need for a CRD at all, or the CRD can at least be simpler
- Should external interfaces (to send / receive multicast traffic to / from the underlay) be provided by the CRD?
- maybe we can have a more static configuration; does it really need to be defined for each multicast group or could it be part of the antrea-agent config (similar to how we handle the "transport interface")
- NetworkPolicies should apply to multicast traffic as well; more investigation required for use cases and implementation
- cgo is used to program multicast routes in Linux
Antrea Community Meeting 08/02/2021
- Status update on kube-proxy removal on Linux with latest design - see slides
- there are 2 parts to the design:
- redirecting traffic from the host to OVS / AntreaProxy: OS-specific, uses TC on Linux
- the new OVS pipeline: can apply to both Linux and Windows to some extent
- some edge cases are not currently handled by the design: when externalTrafficPolicy is set to Local and the endpoint is in the host network
- we need to decide if we want to support these cases, and how
- one possibility is to set the pkt mark in OVS, and then use it to set the connection mark in iptables; reply packets will then have the connection mark
- design should work for IPv6 as well; but currently there are some problems because of upstream issues
- we need to change how the antrea-agent connects to the K8s apiserver (to retrieve Services & Endpoints) to handle the case where kube-proxy has been removed (we cannot use the ClusterIP anymore)
- there are 2 parts to the design:
Antrea Community Meeting 07/19/2021
- Very short meeting because of US holiday
- Short discussion around issue #2344
- Can we use a network interface for the datapath (overlay & external traffic) which is not the one to which the "K8s Node IP" is assigned?
- May be relevant to Linux as well as Windows
- Short discussion around OpenShift Antrea support: the dependency on the K8s NodeIPAM controller requires a cumbersome hack during OpenShift installation
- Merging Kobi's PR will remove the need for this hack
Antrea Community Meeting 07/05/2021
- Multicast support in Antrea - proposal slides
- Even for control-plane based solution, multicast traffic may be broadcast to all Nodes when using noEncap mode (could depend on exact implementation)
- Most common-use case seems to be having a multicast source outside of the cluster with Pods as clients, or a multicast source inside the cluster with external clients
- we are also aware of use cases where source and client are Pods (encap mode)
- Consider the IGMPproxy daemon as a simpler alternative to other multicast routers such as mrouted
- IGMPproxy uses purely IGMP and acts as a proxy between the local Antrea OVS bridge and the underlay network
- In encap mode, with the control-plane based solution, we can avoid broadcasting multicast traffic to all Nodes, but this requires that the Pods which will join a multicast group be annotated by the user
- In encap mode, we may also be able to use mrouted in the overlay, by binding to antrea-gw0
- As an alternative to IGMPproxy: possible to use the socket API from the antrea-agent to join a multicast group in the underlay on behalf of Pod(s) ? no dependency on third-party code in this case
- See Github issue #2251 for follow-up discussions
Antrea Community Meeting 06/21/2021
- Starting with Antrea v1.3, there will be a feature freeze period of one week before each release
- only bug fixes can be merged into the release branch, no new features
- let's see if this can help facilitate the release process (avoid having too many code reviews and merges at the last minute) and ensure that we can release on time
- see #2223
- Progress on Egress failover feature
- API change to support Egress IP pools and support binding each IP pool to specific Nodes (or all Nodes by default); see #2236
- discussion about making the NodeSelector field mandatory (empty selector means "select all") or optional (meaning of null?)
Antrea Community Meeting 06/07/2021
- Design proposal for WireGuard support in Antrea, by Xu Liu
- Github issue #2204
- Design doc
- Better throughput than with IPsec, but higher latency
- With or without overlay?
- WireGuard already creates a UDP tunnel to route encrypted traffic between Nodes, which decreases the MTU. If we use WireGuard in addition to a Geneve / VXLAN tunnel, it will decrease the MTU further and potentially impact performance. Based on benchmarks it seems that performance is slightly impacted (what's the actual number?).
- What would be the advantage of using an overlay? Some features (Traceflow / Egress / ?) may be impacted when using a WireGuard tunnel. Using a Geneve overlay may help with supporting these features. For example, if using a WireGuard tunnel to send traffic to an Egress Node, it may be hard to map the tunnel to the Egress IP (whereas it is trivial with OVS, which lets us retrieve the tunnel destination IP and use it as the SNAT Egress IP). This requires further, per-feature investigation.
- We need feedback from enterprise users about WireGuard. This is a fairly recent technology (requires a recent Kernel version) and there is a latency cost.
Antrea Community Meeting 05/24/2021
- Antrea was accepted as a CNCF Sandbox project!!!
- repository needs to be transferred from the vmware-tanzu Github organization to the antrea-io Github organization
- VMware CLA will be removed from repository and replaced with the CNCF CLA or DCO
- see onboarding issue: https://github.com/cncf/toc/issues/650
- Transferring repository to the antrea-io Github organization; see #2154
- most disruptive item on the onboarding TODO list
- Github will forward HTTP request; nothing should break in contributors' local dev environment (they may get a notification from
git
to update their remote) - any impact on CI? not sure
- Use a vanity URL / import path for the Antrea Go module? see #2154
- many other cloud-native projects use them; no other CNI project?
- decision is to use a vanity import path:
antrea.io/antrea
- Automating website generation
- quite a bit of manual work to do for each Antrea release at the moment
- not possible (or at least not straightforward) for contributors to submit PRs to modify the website source (source is hosted on a separate branch in the Antrea repository)
- 2 options:
- have the website source in the "main" branch of the Antrea Github repository: this can cause considerable growth to the repository size over time, especially if binary artifacts needs to be checked-in
- now that we are going to have our own Github organization, we can have a dedicated repository for the website like many other projects: should we move the documentation to the new repo?
- AI(antoninbas): open Github issue with a proposal
Antrea Community Meeting 05/10/2021
- Presentation of the Egress automatic failover proposal by Quan: Github issue #2128, slides
- 2 approaches: K8s control plane or Hashicorp memberlist library (used by MetalLB)
- neither approach has a dependency on the Antrea Controller
- the approach based on the K8s control plane is likely to have much slower failure detection
- memberlist provides eventual consistency; it is possible to have an Egress IP assigned to 2 different Nodes for a small window of time, or traffic sent to the wrong Node, but all Nodes should converge pretty rapidly (convergence speed can be tuned using different config knobs, with a trade-off on the amount of traffic generated)
- what about having an active and a standby Node for each Egress?
- still need to detect failure and perform failover to the standby
- in a way with memberlist & consistent hashing, every Node is a possible standby
- IP assignment to the new Egress Node after a failure is pretty cheap / quick
- how to limit an Egress to a subset of Nodes?
- global configuration, per-Egress selector or both?
- restricting the pool of Nodes globally can limit the overhead of memberlist
- possible use case: Nodes are spread across different subnets, we want to tie an Egress IP to a specific subnet as traffic to that Egress IP will be routed to that subnet by the underlay
- introduce additional CRDs for Node pools / Egress IP pools?
- continue discussion on Github issue #2128
- 2 approaches: K8s control plane or Hashicorp memberlist library (used by MetalLB)
Antrea Community Meeting 04/26/2021
- Antrea v1.0 is out, yeah!
- At the next meeting, we can discuss the planned improvements to the Egress feature (SNAT IP selection)
- Some progress on the design of flexible IPAM, which can be discussed at the next Community meeting
- Antrea-native policies were broken for Windows Nodes pre v1.0 (see #2050); patch will be back-ported to v0.13
- Antrea operator has been certified for OpenShift, currently in the process of certifying Antrea itself
- one requirement for OpenShift is to support clusters running KubeVirt; according to Salvatore there should be no change required in Antrea but we need to pass the tests with KubeVirt
- still need to add the Antrea operator to the OperatorHub, for use in OpenShift or other K8s distributions; ongoing work
- connectivity issue between Agents and Controller at the moment, maybe an update issue with the antrea-ca ConfigMap
- Jianjun and Quan presented the Egress feature (Alpha) introduced in Antrea v1.0
- see current documentation
- why is egress traffic tunneled using the egress IP and not the K8s Node IP (which is used for the regular Pod overlay)?
- using the K8s Node IP would require more information to be disseminated to every Node / Agent
- also considered adding the egress IP as Geneve tunnel metadata and using the Node IP, but a) it is specific to Geneve, and b) it would require reducing the MTU again
- current solution works well to implement failover in the future (similar to what MetalLB does with VIPs): no centralized controller required, we can have distributed leader election for each egress IP
- more details about future improvements at the next meeting
- Progress on replacing iptables with another mechanism (e.g., TC)?
- Hongliang is evaluating TC as an alternative, and measuring the performance impact
- for ClusterIP traffic redirection from host to OVS (kube-proxy replacement), one idea is to compute the min CIDR for all known ClusterIPs and maintain a single route on the host to redirect traffic matching that CIDR; for NodePort we still need ipset or TC
- more information at the next meeting
- need to evaluate eBPF as well
Antrea Community Meeting 04/12/2021
- Presentation of the
reject
action for Antrea-native Network Policies - slides- Could we implement the reject action in OVS directly instead of using the
controller
action and letting the Antrea Agent craft the reply packet?- May already be possible to transform a TCP SYN packet into a TCP RST packet and send it back to the source Pod
- We could follow-up with the OVS team on whether this is possible / what changes would be needed
- "DDOS risk" if we do not implement packet-in rate limiting in OVS (using OpenFlow meters objects)
- We will document this in the release notes for v1.0
- Rate-limiting should be implemented by v1.1
- Not specific to the reject action, also affects Network Policy logging
- We could mitigate the issue in v1.0 by implementing rate-limiting directly in the AntreaAgent
- Could we implement the reject action in OVS directly instead of using the
Antrea Community Meeting 03/29/2021
- Review open-items regarding kube-proxy removal on Linux - slides
- Masquerading is necessary because the source IP is based on the routing decision before iptables mangle (and destination IP rewrite)
- ClusterIP with external endpoints (not part of Pod network): when load-balancing at the source OVS, the source IP is rewritten to ensure that the return traffic also goes through OVS and AntreaProxy
- Controller can rely on AntreaProxy provided by Agent to access K8s API, no need for new config parameter
- It seems that this design will not work in networkPolicyOnly mode, as it assumes that the host gateway interface is assigned an IP address; Hongliang to check and update the design as needed
- Kube-proxy removal on Windows
- parent Github issue: #1935
- need to install a route for each ClusterIP for now: is there an equivalent to Linux ipset on Windows to make this more efficient?
- could we assume a single "public" interface / IP for the Node to simplify the design?
- routes would no longer be needed, as ClusterIP traffic would follow the default route and enter the bridge through the br-int bridge port
- the current design is similar to the Linux one, so there is confidence that it "works"
- current implementation probably already makes this assumption (SNAT traffic always goes through uplink port)
- GKE is investigating using Antrea for Windows support - we should check with them whether it is a valid assumption for their use-case
- NodePort support in AntreaProxy on Windows Nodes requires a "userspace Antrea proxy" (different from userspace kube-proxy) for traffic originating from the local host or from a local Pod - see #1934
- the purpose is to steer packets with a local destination to the OVS bridge, but it seems there is no easy way to do that unlike on Linux; no WFP support?
- not a very common use case, should we just drop it to simplify the initial implementation?
- the userspace proxy would also support NodePort traffic coming through a different host interface (different from the OVS uplink)
Antrea Community Meeting 03/15/2021
- Kube-proxy removal on Linux - design slides
- currently the Antrea components depend on ClusterIP functionality to access the K8s API
- antrea-agent: needs to connect to the K8s apiserver directly (
kubeAPIServerOverride
configuration parameter); then AntreaProxy can install the necessary flows; then antrea-agent can connect to the Antrea Service (antrea-controller) using ClusterIP - antrea-controller: 2 options - 1) introduce a new configuration parameter (like for antrea-agent) and connect to the K8s apiserver directly, or 2) wait for antrea-agent to finish initializing on the Node (including AntreaProxy), then rely on AntreaProxy to access the K8s API with the ClusterIP
- antrea-agent: needs to connect to the K8s apiserver directly (
- traffic from local host to ClusterIP: is explicitly masquerading the traffic necessary? the source IP determined by the route table should be sufficient - TBD
- is stopping kube-proxy a requirement, or will AntreaProxy take precedence?
- AntreaProxy rules should take precedence (we use
insert
to add iptables rules) - need to confirm that this assumption holds even when kube-proxy is restarted (does kube-proxy use
insert
as well?)
- AntreaProxy rules should take precedence (we use
- concern that we are relying too much on iptables rules to steer packets to OVS
- conflicts with our messaging (OVS as the only datapath)
- more discrepancies between Linux and Windows: can we do more in OVS?
- see detailed issue with open questions - #1931
- currently the Antrea components depend on ClusterIP functionality to access the K8s API
- FlowAggreator presentation and demo - slides
- still working on FlowExporter / FlowAggregator support for dual-stack clusters
- there is support for both K8s NetworkPolicies and Antrea-native policies
- still working on support for deny policy actions in FlowExporter / FlowAggregator; what will the Kibana graphs look like in this case?
Antrea Community Meeting 03/01/2021
- Very quick meeting because of Lunary New Year holiday and no agenda items
- Following releases went out last week: v0.13.0, v0.12.1, v0.11.2
- AI: add link to Slack channel at the end of troubleshooting document
- AI: investigate whether ELK license change affects Antrea flow visualization
Antrea Community Meeting 02/16/2021
- Antrea ARM support
- Arm64 hardware provided by OSU OSL; used for building and testing Antrea images for arm64 and arm/v7
-
antrea/antrea-ubuntu-march
is a Docker manifest with support for amd64, arm64, arm/v7; updated every 6 hours - documentation PR under review
- plan is to replace the current "standard" image (
antrea/antrea-ubuntu
) with the march image - v0.14? - currently scripts for building and testing ARM images are hosted in a private Github repo (because of a security issue when using self-hosted CI runners with public Github repos); long term plan is to move them to main Antrea repo
- all features are available on ARM
- ARM testbed use k3s and all Nodes (including control-plane) are ARM machines
- Antrea generic operator (not just for OpenShift 4) - see doc
- interest in defining some common "library" methods in the Antrea repo, that can be leveraged by the operator (e.g. Antrea ConfigMap updates)
- could we leverage the operator on Windows to prepare the network environment?
- maybe, but still same problem that hostNetwork not supported on Windows
- other idea: environment could be configured through the antctl CLI using SSH
- Antrea cleanup by the operator not supported yet, maybe could also be a common function hosted in Antrea repo
- Re-schedule next community meeting because of US Holiday (President’s day) and Chinese New Year
- Salvatore will start a conversation on Slack
Antrea Community Meeting 02/01/2021
- Matt Fenwick working on a fuzzer to test NetworkPolicies, there seems to be some disparities across different CNI plugins
- Starting office hours every other Tuesday at 2pm PST (for 1 hour)
- Salvatore will update README with this information
- Salvatore will start a calendar so people can keep track of meetings across different time zones
- Service Function Chaining demo from Intel
- see earlier slides for detailed proposal
- default route is changed in each container to route user traffic to the next function; additional routes to all the "left" networks to route traffic back
- currently plan is to handle network / chain CRs at the Antrea Controller (not implemented yet) and generate a Pod annotation with the configuration of secondary network interfaces, which will be consumed by the Agent; may be possible to do everything at the Agent as well
- for IPAM, considering several open-source options like whereabouts
- for the PoC / demo, everything is hardcoded (no CRD controller, no IPAM) and there is no tunneling support
- currently programming flows for secondary networks in the same bridge (br-int) as for the primary interface; defining secondary bridges is a more likely long-term solution (separation of concerns, DPDK support)
- if SFC CRs are handled by the Antrea Controller, it can take care of 1) creating virtual networks, 2) annotating Pods with the necessary information (for consumption by Agents)
- New issue on Windows (Pods cannot access Services backed by Node endpoints - in particular the
kubernetes
Service)- https://github.com/vmware-tanzu/antrea/issues/1759
- Rui is working on it but may take some time
- Jay planning to improve upstream test coverage for Services in Windows
Antrea Community Meeting 01/19/2021
- Happy new year everyone!
- Next release (0.13) scheduled for mid-February: NodePort support in AntreaProxy, EndpointSlice support in AntreaProxy, NetworkPolicy improvements, IPv6 support for Traceflow, basic IPAM.
- Additional Windows enhancements:
- noEncap support (we have a temporary solution, long-term solution requires a fix in OVS)
- containerd support (workaround for 0.13)
- remove Hyper-V dependency: support dependent on upstream OVS changes (won't be available until H2 2021)
- Google doing a PoC of Antrea in GKE
- We should consider releasing Antrea v1.0 in the first half of this year (0.13 or 0.14 can be candidates)
- we are planning to rename APIs from
*.antrea.tanzu.vmware.com
to*.antrea.io
: this should be done before a v1.0 release
- we are planning to rename APIs from
- Need to open Github issues to discuss API renaming and the potential v1.0 release
- Additional office hours meeting: idea is to have a few Antrea "experts" attend each session to answer user questions and potentially discuss ideas early before taking them to the community meeting
- this idea seems more popular than having a rotating meeting time, which may create confusion
Antrea Community Meeting 01/04/2021
- Antrea retrospective for H2 2020, led by Steven Wong
- Still discussing alternating meeting times vs. adding an "office hours" meeting
Antrea Community Meeting 12/21/2020
- CNF and SFC (Service Function Chaining) support with Antrea - slides
- main components: multi-interface support, dynamic network and service chain definition using CRDs, SR-IOV
- dependency on global IPAM mechanism to allocate IP addresses to CNFs (Antrea Controller or third-party?)
- when defining a network CR, subnet is provided as part of the spec; subnet allocation could eventually be automated in the future
- Pod annotations are used to request dynamic interfaces; will look into using the same API as Multus
- overlay network is used across Nodes; any protocol (Geneve / VXLAN) will be supported; when using Geneve additional features can be supported (e.g. for telemetry)
- as traffic goes through the service chain, the IP header does not go through any rewrite
- initially, single SFC support for a given left provider network; in later phase multiple SFCs using a traffic classifier to steer flows to the correct SFC
- can a "nested" service function have multiple instances? how does load-balancing happen? MAC address based; design is WIP
- will the implementation use the same bridge (br-int) or a dedicated bridge? current POC adds flows to the same bridge; need to investigate potential benefits of using a separate bridge for dynamic networks
- how to reconcile with Glasnostic use case (all traffic on the default / primary Pod network is captured by a CNF)?
- not exactly CNF / SFC - "service insertion" for filtering, traffic shaping
- was originally using service mesh / Istio but Glasnostic thinks that should be addressed at L3 / CNI
- we could still define a service chain (with single service): classifier sends all Pod traffic to the chain, provider network is the default Pod network
- Next meeting (12/21/2020) agenda: 30 minute Antrea release retrospective driven by Steven Wong + 30 minute demo of CNF / SFC support
Antrea Community Meeting 12/07/2020
- Intel folks could not attend; CNF follow-up presentation deferred top next meeting.
- Kobi provided intro on IPAM proposal to remove dependency on the k8s NodeIPAMController & potentially support more "advanced" use cases
- example use cases: support smaller CIDRs, per-Namespace CIDR allocation
- use case of per-Namespace CIDR allocation set aside for now as it would be a significant departure from the current design
- current focus: start from a smaller CIDR, and allocate new CIDRs to the Node as needed
- impact on the Antrea Agent design: current design assumes that the gateway is assigned a single IP; routing between local Pods (same Node) within different subnets instead of just L2 switching
- always have some buffer (available IPs) in each Agent in case of transient connectivity issues to the Controller (which will manage IPAM)
- Agents request / release CIDRs through the Controller; need to handle Agent / Node failures to reclaim IPs
- the per-Namespace subnet use case probably makes more sense with routable Pod support (for policies enforced by the underlying network)
- AI: Kobi to publish the proposal on Github
- Marcus Schiesser (Glasnostic) provided an overview of how they’re leveraging Antrea
- they have a working CNF on raw socket with Antrea 0.11: a CNF Pod intercepts all traffic using OF flows and controls all traffic (policy enforcement, QoS)
- problem is performance: they’re interested in DPDK and will therefore seek how to contribute to the current proposal on Github
- they would like to see this running Kubernetes 1.15; Antrea now only supports 1.16 onwards (code should support 1.15, but YAML manifests target >= 1.16 to avoid deprecation warnings about older APIs in recent clusters)
- AI: Marcus to follow-up on Github
- Jay had questions about AntreaProxy & kube-proxy on Windows
- both need to run simultaneously
- rules are implemented on both but will never be hit twice by any traffic flow
- is there interest in replacing rancher/wins with something similar to the upstream csi-proxy, to improve security by reducing potential attack surface (only gives access to specific Windows functions on the host)?
- kube-proxy will no longer be required (on both Windows & Linux) starting with Antrea v0.13 or v0.14
- Some requests to consider having meetings at a more friendly time for the US east coast
- will have discussion on Slack
Antrea Community Meeting 11/23/2020
- NodePort implementation in AntreaProxy (kube-proxy replacement) - design doc
- targeted for v0.12
- 2 successive DNAT operations (on the host to link local address, then in OVS to the Service backend); any negative impact on latency?
- not tested with IPv6 yet, but should work fine
- can kube-proxy be removed / does it need to be removed ? so far it has been tested with kube-proxy running simultaneously, need to test without kube-proxy; to remove kube-proxy we may need additional iptables rules for Node -> ClusterIP traffic
- should be easy to have the same support on Windows (maybe easier, since on Windows the physical interface is moved to the OVS bridge)
- should we install flows to handle Pod -> NodePort traffic (local Node's IP) in OVS directly?
- demo
- Docker rate limiting on pull is starting to impact CI jobs - VMware has provisioned a public Harbor Docker registry for VMware projects that we can use for Antrea CI
- not planning to update the Antrea manifests yet to use the new registry, something we should consider
- Steven Wong offering to lead a release retrospective at the 12/07 meeting
Antrea Community Meeting 11/09/2020
- Proposal for a baseline policy Tier to support enforcing Antrea-native policies with a lower priority than K8s policies - design doc
- If no policy in baseline tier, no change in behavior
- Does it only make sense to have "deny" rules in the baseline tier? It seems that the only reason to include "allow" rules in the baseline tier is if someone wants to allow traffic that is implicitly denied by a K8s NetworkPolicy ("isolated" Pod behavior), but that would essentially "break" K8s NetworkPolicies for app developers.
- Do we need to support more the one baseline tier? Probably not if we only have deny rules...
- A separate issue is we want to be able to express: "allow traffic within a namespace, deny cross-namespace traffic" as a baseline behavior
- An alternative is to have a cluster-wide configuration parameter (default-deny / default-allow / enable within namespace only); it seems that it may cover most use cases that we know of today, but the semantics are more limited than with a baseline tier
- Users will still not be able to create a Tier with a priority value > default tier, but they can create policies in the "static" baseline Tier
- Sorting NetworkPolicies based on Tier + Priority - Issue #1388
- We are planning to have a mutating web-hook to auto-generate policy rule names, so the same web-hook could also compute an "effective priority" numerical value that could be used for sorting policy resources with kubectl
- Maybe this is something that we should provide in antctl instead of going out of our way to provide this support in kubectl
- Antrea operator demo - slides
- Do we have a way to clean-up Antrea: remove Antrea K8s resources but also undo all configuration changes made on the Nodes (e.g. iptables)? Maybe not the responsibility of an Openshift operator
- Each version of the operator supports a specific version of Antrea
- Antrea Openshift operator will be hosted in a separate Github repository, but are there some generic functions / building blocks that can be provided as part of the Antrea repo?
- We can clean-up the code and ensure the operator can be used on other platforms besides Openshift
Antrea Community Meeting 10/26/2020
- Requirements for NFV/CNF and how they could be addressed in Antrea
- requirements: multiple interfaces per Pod, dynamic virtual network creation, traffic steering / service function chaining (same Node or across Nodes)
- multi-interface support, with or without Multus? some claims were made that 1) Multus may consume too may resources for edge deployments and 2) some specific use cases may not be easy with Multus (we need more concrete data to assess this)
- VNF uses cases require multiple isolated networks to provide in-cluster connectivity (similar to Network Service Mesh?), with low-latency
- Multus is already "supported" by Antrea: we don't do anything special; Multus supports Antrea like any other CNI plugin; we have tested & documented Antrea + Multus
- single OVS bridge or different bridges for the different networks? No right or wrong way, multiple bridges may be easier to work with
- need to be able to enforce K8s NetworkPolicies on secondary network interfaces; open-question: how to apply NetworkPolicies to specific Pod interfaces (as opposed to only the primary interface or all interfaces)? maybe this limits us to Antrea-native policies
- do we need to consider integrating some subset of OVN functions?
- multi-interface support: start with Multus to accommodate CNF use case, CRD alternative in case of resource limitations
- AF_XDP / DPDK to provide high-performance datapath for CNFs; which one to prioritize? (better performance with DPDP, better portability with AF_XDP)
- AI: more information to be provided in the Github issue, with pros/cons for alternate solutions
- IPAM support in Antrea for EKS integration (Antrea as primary CNI for EKS)
- currently we use CNI chaining and Antrea just does NetworkPolicy enforcement; AWS VPC CNI limits the number of Pod IPs per Node, so there is a push for us to work as primary CNI in EKS
- EKS does not enable the NodeIPAM controller so we need an alternative solution
- current idea is to port NodeIPAM controller to Antrea (run it in the antrea-controller Pod) since one subnet per Node will work fine for EKS
- using NodeIPAM controller as library (https://github.com/vmware-tanzu/antrea/pull/1342) => 2 issues: 30% increase in binary size (60MB -> 90MB), messy Go dependency management (wasn't designed to be used as a library)
- consensus seems to be to replicate NodeIPAM controller functionality in Antrea Controller (we can copy the code) and we can evolve the code in the future to accommodate for advanced use cases
- can run as a separate binary in a separate container
- no persistent state required: when controller starts, iterate through Nodes to see which subnets have been allocated
- we can test this functionality in EKS or in a regular "kubeadm" cluster (by disabling NodeIPAM controller)
- Update on upstream NetworkPolicy activities
- 3 user stories are promoted to KEPs: port ranges, namespace selection by name (as an alternative to label selection), cluster-scope network policies
- some push for FQDN-based NetworkPolicies (supported in Cilium) - we need to have an implementation plan for this in Antrea
Antrea Community Meeting 10/12/2020
- Support for
antctl query endpoint
"out-of-cluster"?- see https://github.com/vmware-tanzu/antrea/issues/1137
- currently we use an HTTP handler in the Controller apiserver; using the command requires exec'ing into the Controller Pod; we want to be able to run the command anywhere by providing the correct kubeconfig
- requirements: handle HTTPS, use a versioned API
- multiple solutions: 1) use existing "controlplane" API and implement the functionality client-side, 2) introduce new API group (cannot use "ops" as it is used by CRDs, 3) introduce new API in "controlplane" API group
- do we need this to be in a new public API group (as opposed to "controlplane" which we usually consider internal) in case the API could be consumed by clients besides antctl (maybe third-party clients)
- is "controlplane" really internal, we now realize it may be used by other projects to integrate with the Antrea Controller?
- solution 3) seems like a good compromise; we can move the API to another group in the future if needed
- We need to auto-generate documentation for Antrea APIs to obtain a passing CII badge
- see https://github.com/vmware-tanzu/antrea/issues/989 and https://github.com/vmware-tanzu/antrea/issues/1287
- why is the CII badge important? general indicator of the health of an OSS project, something that the CNCF may look at, may be important to some users looking for a CNI
- want to generate docs for both CRDs and APIServices
- if anyone wants to recommend a tool, please comment in the issue
Antrea Community Meeting 09/28/2020
- Support for audit logging for Antrea Network Policies - presentation by Qiyue
- The rate limiter that is used in the agent, does it drop packets or queue them if the rate limit is exceeded? rate limiter should sit in front of the queue and drop packets
- Suggestions to improve logs:
- convert IP addresses to Pod reference (namespace/name); what would be the feasibility (maybe only for local Pods) & cost of doing this conversion?
- change log header: remove source information, use a more accurate timestamp (with ms value) & include time zone
- include more NetworkPolicy attributes: tier name
- include more packet attributes, e.g. packet size
- include more packet header information, e.g. transport port (may require additional parsing of the controller packet in the agent)
- Is there an estimate of the overhead of logging?
- Log format: should we aim for more compact logs, or include the name of each field (e.g. "SRC", "DEST")? Impact on performance?
- Should the logs be included in the support bundle? Not really useful for diagnosing issues with Antrea, plan is to not include them by default
- Plan to support syslog? Could be convenient for users, let's see what would be required exactly on the Antrea side
- API group changes
- https://github.com/vmware-tanzu/antrea/issues/1250
- Issue with Antrea v0.9.3: if cluster is large and agents are updated using a RollingUpdate, update from previous version can take a long time, during which NetworkPolicy enforcement will be disrupted
- Proposal is to serve both API versions in the next release to avoid this issue: as long as Controller is upgraded first, it will support both "old" and "new" Agents; when using the default YAML, Controller should be updated at the beginning (along with the first Agent)
- We need to document our API deprecation policy
- We have some very basic upgrade tests from N-2 (v0.7.2) and N-1 (v0.8.2) using Kind, but we may want to have more comprehensive tests using Jenkins (run the full conformance test suite)
- Moving forward, when resources need to be deleted during update (e.g. APIService), the Antrea Controller will take care of it
- Revisit short names for Antrea Network Policies for consistency
- Abhishek to create an issue to summarize the discussion ---> https://github.com/vmware-tanzu/antrea/issues/1253
Antrea Community Meeting 09/14/2020
- Enabling AntreaProxy by default in v0.10.0?
- Weiqiang ran some benchmarks with a large number of Services: with 2,000 services, the increase in memory consumption in the Agent is 50MB when enabling AntreaProxy
- Comparison with kube-proxy?
- AntreaProxy is required for all traffic modes other than encap, but still disabled by default for encap mode, which is a bit confusing
- We are still gathering data about performance at scale
- In v0.9, we support 5 static tiers for Antrea NetworkPolicies. In v0.10, we will introduce support for user-defined tiers. How do we handle the upgrade?
- Do we try to provide backward-compatibility by having the Antrea Controller create 5 CRDs corresponding to the existing 5 static tiers on startup? Do we create a single CRD for the default tier (currently called
Application
)? Or do we choose not to create any tier automatically on startup? - Analogy with PriorityClass: K8s ships with 2 existing PiorityClasses (
system-cluster-critical
andsystem-node-critical
). They cannot be deleted / modified. - Negative performance implications? Even if we have 5 tiers created by default, that does not mean the OVS pipeline will have 5 extra tables (and thus extra lookups for each SYN packet): we use one table for the
Default
/Application
tier, and another table for other tiers. Dedicated table forDefault
tier because that's where we expect most policies to be created. - As the community converges towards a standard multi-tier network policy model, it would be good to align the Antrea tiers to that model
- Having default tiers is good for documentation / examples, and should be helpful for most users
- Decision is to create the 5 tier CRDs by default: they all can be deleted / modified by the user except for the
Default
/Application
tier CRD, which will be read-only. - The default (lowest priority) tier is currently called
Application
. Do we want to change the name? Extra work if we want to rename it to something likeDefault
while preserving backwards-compatibility. - Tier priorities will be integer values (in the 0-255 range). "Static" tier CRDs will be assigned values in that range, with some gap for users to insert new tiers.
- Do we try to provide backward-compatibility by having the Antrea Controller create 5 CRDs corresponding to the existing 5 static tiers on startup? Do we create a single CRD for the default tier (currently called
- Support for
EndpointSlices
in AntreaProxy-
EndpointSlices
are enabled by default in K8s 1.19, will be GA in K8s 1.20 - Weiqiang will look into supporting them in AntreaProxy for v0.11
- https://github.com/vmware-tanzu/antrea/issues/1164
-
- What's the best way to let the community know about the next meeting (disrupted schedule because of Labor Day)? Salvatore to check if a calendar invite can be sent
Antrea Community Meeting 08/31/2020
- Next meeting will be on Monday 31st (US) to avoid a large gap because of Labor Day
- Update about upstream work around NetworkPolicy v2 (update by Cody)
- Effort started by Jay Vyas (VMware) about 2 months ago
- Temporary repository until we move to sig-network: https://github.com/jayunit100/network-policy-subproject
- Subproject agenda: https://docs.google.com/document/d/1AtWQy2fNa4qXRag9cCp5_HsefD7bxKe3ea2RPn8jnSs/edit#
- Meeting schedule: Mondays at 13:00 PST — https://zoom.us/j/96264742248
- Focus of the NetworkPolicy v2 discussions
- extend the NetworkPolicy API beyond just a developer-centric API localized to a single Namespace
- cluster administrator persona: ability to apply cluster-wide policies
- other personas: network administrator, compliance, ...
- Many user stories submitted, which have been weighted / prioritized based on multiple criteria (is it API-related / in-scope, does it improve usability, etc.)
- User stories and "KEP"-like document (with goals, non-goals, deliverables) can be found here: https://github.com/jayunit100/network-policy-subproject
- Cody is in the process of organizing user stories in that repository
- Questions:
- Is there a plan to facilitate migration from NetworkPolicy v1 to v2? New resources (e.g. CNP) do not have an equivalent in the v1 API. Whether a new, non-backwards compatible, API is needed for existing developer-centric policies has not been decided yet. Abhishek and others are reviewing user stories to determine which ones can be satisfied by extending the current API, and which ones require a v2 API. If a new v2 API is introduced for namespaced / developer-centric policies, plan is to include a tool in deliverables that would migrate from old API to new API.
- Can new user stories be submitted? Yes, open a new issue in the repository. A user story is strongly recommended for new proposals, or it's unlikely to be taken into consideration.
-
API group renaming
- As new features / APIs are introduced, we would like to revisit existing API groups to make them "future-proof" and ensure logically-related APIs are grouped together
- Cannot use "internal" for Controller <-> Agent API because of Go restrictions
- top contenders are "impl" and "controlplane"; consensus seems to be for "controlplane"
- Questions:
- "ops", "clusterinformation", "system", "metrics" seem related: what's the rationale for keeping them separate? "metrics" is usually kept separate in K8s, APIService and CRDs cannot share an API group (for routing in the K8s apiserver)
- Antrea Openshift operator - design proposal presented by Danting
- Questions:
- Convention in Openshift is to have a dedicated Namespace for the operator, should we do this for Antrea or is there any reason to keep using
kube-system
? Danting is investigating if a dedicated Namespace can be used (may require code changes in Antrea?). - Are we planning to use an operator-based installation outside of Openshift? Priority is Openshift right now as it is a requirement for certification; based on this experience we may consider using the operator for other types of clusters, e.g. bare-metal.
- Currently Antrea Pods have to be restarted when configuration changes, see issue #723.
- Convention in Openshift is to have a dedicated Namespace for the operator, should we do this for Antrea or is there any reason to keep using
- Questions:
Antrea Community Meeting 08/24/2020
- Antrea cloud support update by Rahul:
- Started with AKS engine support, either in CNI chaining mode (Azure CNI takes care of networking) or in regular encapsulation mode (without Azure CNI). In chaining mode, Pod gets an IP from the VNET: advantages in terms of performance and flow visibility. You can choose one or the other by modifying a JSON configuration file. See documentation.
- AKS support starting with v0.9.0, using CNI chaining. Currently AKS officially supports the Azure CNI, with Calico for NetworkPolicy enforcement. Starting with Antrea v0.9.0, you can deploy an AKS cluster (with Azure CNI) and deploy Antrea for NetworkPolicy enforcement out-of-band, after cluster creation.
- Documentation available in Antrea repo for using Antrea in AKS (managed service). As of now, you need to restart existing Pods after deploying Antrea, or NetworkPolicies won't take effect for these Pods.
- DIY cluster in Azure compute platform: encap mode should work out-of-the-box (no CI job yet); hybrid mode / no-encap mode will not work because Azure is a L3 cloud.
- Antrea has supported EKS with CNI chaining (NetworkPolicy mode) since Antrea v0.5.0.
- A common issue when doing CNI chaining with the cloud provider's CNI is existing Pods, which are deployed before the Antrea CNI is installed on every Node. Not specific to Antrea at all. Starting with Antrea v0.9.0, we added a DaemonSet that can take care of restarting the Pods without the user having to do it (EKS). This would be a problem for every CNI which is not an official supported option (e.g. Cilium). Which is why we are working with cloud providers to make Antrea an official CNI (they will then guarantee that no workload Pods are deployed until Antrea is ready).
- Before v0.9.0, users had to provide the Node MTU and the Service CIDR. We removed this requirement with automatic MTU discovery and by enabling Antrea Proxy for all non-encapsulated modes (EKS, GKE, AKS).
- We have bi-daily CI jobs for all cloud-managed services we support. See https://github.com/vmware-tanzu/antrea/tree/master/ci/jenkins. We run K8s conformance tests for networking, along with NetworkPolicy tests.
- Antrea requires NodeIPAM to be enabled in the K8s control plane. EKS does not enable NodeIPAM, so for Antrea to be a supported option for the "primary" CNI in EKS, we need to support our own IPAM mechanism (part of roadmap, scoped for v0.11.0?).
- DIY cluster in AWS EC2 is supported, including hybrid mode (for a VPC with multiple AZs, traffic would not be encapsulated within an AZ, but it would be across AZs). Note that hybrid mode affects support for Traceflow.
- Amazon is working on their next-gen VPC CNI implementation, which should include a NetworkPolicy controller and the ability to program security groups using CRDs. Need to see how it impacts our roadmap. We think Antrea will still provide some advantages in terms of performance (Antrea Proxy) and possibly federation.
- GKE is only supported when using Ubuntu Nodes; VPC-native can be enabled or disabled. For GKE, we do not use CNI chaining, Antrea operates as the primary CNI, and in no-encap mode. GKE allocates a Pod CIDR per Node, and this information is made available in the Node spec.
- At the next Community Meeting, we will discuss the upstream effort around NetworkPolicy v2.
Antrea Community Meeting 08/10/2020
- "antctl query", a command-line tool to inspect Network Policies, presented by Jake Sokol - #974
- For a given endpoint (at the moment endpoint == Pod), "antctl query" will show the full list of policies that "select" the Pod (either the policy is applied to the Pod or the Pod is selected by ingress / egress rules).
- It is implemented with a HTTP API: antctl queries the controller which computes the response using the state it stores. A new "Pod reference" index was needed for the controller in-memory stores, so that we can perform lookups using the Pod Namespace+Name; small memory overhead (20%?) incurred because of the index (see numbers in the #974 issue).
- Can this tool or its design be leveraged by other CNIs?
- The tool leverages state that is already computed and stored by the Antrea Controller, which made the implementation quite simple
- A generic tool could have been built for K8s Network Policies, the tool would be simpler than the Antrea Controller as a whole, but more complex that the work we did on top of the Controller to support this feature
- "antctl query" should also support Antrea-specific policies (CNPs, ANPs)
- In theory, we could have a Controller "analyzer" mode, where one can run the Antrea Controller + antctl, without deploying the Agent DaemonSet. Then "antctl query" could work even in a cluster where Antrea is not the CNI.
- Jake will open the corresponding PR within the next couple of days.
- Support for Network Policy metrics, presented by Quan - #985
- Not using the CRD status field (original plan) for scalability reasons.
- Agent reports incremental updates to the stats; if no change we can omit sending an update for this policy.
- With the current proposal (incremental updates), antrea-agent / antrea-ovs container restarts will lead to some gaps when traffic is not taken into account.
- Performance evaluation of Antrea Proxy, presented by Quan - results
- Antrea Proxy performs consistently better than Antrea + kube-proxy (iptables), but the difference is more significant for the intra-Node traffic case
- kube-proxy (iptables) performance degrades with the number of Services (linear search), may not be the case with IPVS
- Enabling Antrea Proxy by default on Linux (v0.9.0?)
- VMware plans on running some scale tests with Antrea Proxy enabled, will update community with results
- If there is any feedback (good or bad) regarding the Antrea Proxy feature, please share it with us so we can make an informed decision about making Antrea Proxy the default in Antrea v0.9.0
Antrea Community Meeting 07/27/2020
- IPv6 dual-stack support - see design doc
- Current proposal only discusses dual-stack case, but it should be possible to use IPv6-only as well
- Current proposal only covers "encap"-mode, routing may be difficult in "no-encap" mode, Wenying will look into this
- Is Node IPAM common among other CNIs for IPv6? Not sure, but Node IPAM will work fine with IPv6. Static addresses still seem to be preferred for IPv6.
- For NetworkPolicy implementation for IPv6, no changes should be required on the Controller side
- Need to make sure that new NetworkPolicy CRDs (e.g. ClusterNetworkPolicy) are not affected, or make the necessary changes if needed
- Some changes required for Service Load Balancing support in OVS, but not too much
- One challenge is no-SNAT support. Is there any value in using no-SNAT along with "encap"-mode? Unlikely, need to look into this
- Impact on encapsulation protocols of IPv6 support? For dual-stack, we need to support encapsulating both v4 and v6 traffic.
- Support for fully routed topologies: is it ever Antrea's responsibility to program the underlay or advertise routes?
- Routing differences between v4 and v6: can SLAAC be leveraged for IPv6? With SLAAC the endpoint auto-configures its address but this is not a way to program routes in the network fabric
- The only changes needed in Antrea with static IPv6 address are support for Neighbor Discovery and link-local address support in spoof-guard; for container solutions it seems that IPAM doesn't leverage SLAAC, static address allocation is used
- A related issue is that in K8s the Node internal IP is either an IPv4 or an IPv6 address
- https://github.com/kubernetes/kubernetes/issues/91940
- dual-stack Nodes is a requirement in K8s for dual stack Pod networking
- having single stack Nodes can break some things (e.g. hostNetwork Pods in Services?)
- Quick discussion about the CI pipeline for IPv6
- Kind supports IPv6
- unsure whether VMC (vSphere on AWS, used for CI) supports IPv6, need to investigate; if no support, may look into using a difference public cloud solution
- private lab (fallback solution + Windows) does not support IPv6 in the underlay
- IPFIX flow export demo - watch the recording!
- Does not resolve IP (to Pod name) for Pods scheduled on remote Nodes for now, will be done in a central flow aggregator
- Have to provide configuration file to configure dashboards, they are being checked-in into Antrea repo
- We can add Pod labels in IPFIX records as well
- Can we add extra metadata about network policies (e.g. "unprotected" flow vs explicitly-allowed flow?). We should be able to do that, will have more information when more progress is made with the implementation; plan is to use conntrack label to include network policy information in the IPFIX records (no label means unprotected flow?)
- Selectively exporting flows (e.g. from a specific namespace)? We can filter at the collector, should we support filtering at the agent as well? Could be used to reduce the amount of traffic to the collector.
- could be configured using some CRD API
Antrea Community Meeting 07/13/2020
- Release update
- 3 major features coming in this release
- Default tunnel type update from VXLAN to Geneve: impacts users in case of upgrade, no change in overlay MTU
- Default gw name change, no impact on Nodes in existing clusters
- 3 major features coming in this release
- NodePortLocal proposal presentation by Sudipta Biswas:
- Main goal is to allow external Load Balancers to provide connectivity to Pods, bypassing limitations of NodePort and kube-proxy
- No Pod annotations required, Antrea Agent publishes a CRD that can be consumed by the external Load Balancer (includes Pod to host port mapping)
- How does it relate externalTrafficPolicy=Local for NodePort services? Still depends on kube-proxy and still has some limitations. Sudipta will get back to us and provide a detailed comparison between the 2.
- Sudipta will create a “proposal” issue in the Antrea repository as a next step.
- A performance comparison with a traditional external Load Balancer using NodePort may be useful, but probably time-consuming
- Session affinity cannot be achieved with NodePort Services, because the external Load Balancer may use a Virtual IP (VIP) as the source IP. The same VIP will be used by many clients, so Session affinity will cause poor load-balancing. With NodePortLocal, the Load Balancer is in charge of controlling Session affinity.
- What’s the plan for testing this as part of the Antrea e2e test suite? Sudipta to come up with a plan.
- Would be good to include diagrams in the design doc to show the different traffic paths for NodePortLocal vs "traditional" use case(s).
- Documentation for Antrea ("Antrea the hard way")
- Some detailed document about Antrea works and how it uses OVS. Sometimes that can be consumed when trying to troubleshoot Antrea operations.
- Architecture document and OVS pipeline document are good places to start
- https://github.com/vmware-tanzu/antrea/issues/883
- Cody will look at some options, and especially with respect to how Antrea compares to "routed" CNIs like Calico
- NetworkPolicy v2: Cody, Jay, Abhishek will give an update at the next meeting regarding the discussions happening in sig-network; the Antrea community could provide useful feedback
- Zoom protection: maintainers and Cody will review options
The meeting was "Zoom-bombed" and as a result we had to edit-out 2 minutes of the footage to avoid uploading profanities. We apologize to the attendees and the presenter.
Antrea Community Meeting 06/29/2020
- First release retrospective (may not be a permanent link)
- Feedback around testing: flaky tests are making life harder for contributors
- Consider reducing our release cadence (4 weeks -> 6 weeks)
- We will have more retrospectives in the future (for each release?)
- Let’s add the most "popular" items to the agenda for future meetings
- Antrea ClusterNetworkPolicy Agent-side design by Yang
- See slides
- Why float values for priorities? Ensures that the user is always able to insert new rules between 2 existing rules, without updating existing priorities themselves
- Priority zones can have 130 priorities: thanks to these zones we have a more reasonable boundary on how many rules we have to shuffle when inserting a new one
- Easy to adjust the design in the future without impacting the user (antrea-agent restart when doing update will just re-organize flows)
- Typical scale according to Cody: 4-5 tiers; 65K rules should be sufficient but we may want to be able to balance between tiers / zones - as a reminder the limit applies at the Node level, not at the cluster level, so 65K may be more than enough
- When introducing RBAC, multiple rules spread across multiple namespaces may share the same priority values, so some priority zones may become too packed - size of priority zones may need to be dynamic
- Ability to mix and match Antrea-native Namespaced Network Policies with Cluster Network Policies? exact evaluation order yet to be determined
- Shuffling the flows (changing the priority) will not impact existing connections; we may want to ensure that OVS counters are preserved though, which may not be the case for the current implementation (Yang to verify)
- Until we have a UI for ordering policies, user will need to be aware of all existing priority values
- Next meeting
- AVI team may talk about their proposal for ingress NodePort policies
Antrea Community Meeting 06/15/2020
- Service Cluster IP access ("kube-proxy" functionality) implemented with OVS: presentation and demo by Weiqiang
- Generating ICMP host / port unreachable messages like kube-proxy with iptables for invalid Service IPs / ports ? not implemented at this time but will look into it
- For NodePort support, we will still rely on kube-proxy for now
- Documentation status? there is an available Google doc, will share after the meeting; we should also update the OVS pipeline documentation with new flows and tables as it is a useful document for new contributors
- We use OpenFlow groups for endpoint selection (for now, equal weights for all endpoints but later we can support topology awareness, annotations for weight specification, ...)
- Flow-tracing presentation and demonstration by Ran
- Is it possible to map the rule ID in output (CRD status) to specific K8s NetworkPolicy? work in progress
- Support for Service ClusterIP traffic? it is dependent on ClusterIP implementation in OVS
- Traceflow requests install temporary flows in the OVS bridge.
- What happens when multiple traceflow requests are performed concurrently? We can have up to 15 traceflows running at the same time.
- Need to think about RBAC for traceflow and rate-limiting.
- Cody would like to set aside some time (10-15 minutes) in each community meeting to have an open forum for new users to ask any questions they may have about the project
Antrea Community Meeting 06/01/2020
- Update on ClusterNetworkPolicy proposal
- K8s Network Policies will be considered part of the lower-priority category ("default" category), but the user can also create Antrea Cluster Network Policies within that same category.
- Add "from" field to egress rules and "to" field to ingress rules.
- As a consequence, the ingress and egress section in the ClusterNetworkPolicy CRD definition will use the same struct type
- Not P0 feature
- Not really useful for "in-cluster" policies which are applied to Pods: makes more sense for policies applied to Nodes / external entities
- At first, CRD validation will ensure that the new fields are always empty, and will reject the object otherwise (before it gets to the Antrea controller).
- Ability to set AppliedTo on a rule basis (and override the policy field): probably a nice-to-have, but no specific use case in mind as of now, so could be added incrementally later
- Refer to design doc for implementation roadmap and feature priority
- Discussion on API subgroup naming for Antrea-specific policy CRDs
- All these CRDs could go under a "security" subgroup?
- Could have "monitoring" and "troubleshooting" subgroups for other CRDs
- Let's punt this discussion to a future meeting to give people time to think about this
- Flow exporter design
- Should flow exporter be part of the antrea-agent process / container?
- For simplicity's sake, can be run in a separate container in the future
- Is there a plan to support flow filtering so that we only export flows specified by the user (e.g. flows from certain Pod)?
- At the moment plan is to export information about all the flows going through the OVS bridge, but we can extend that in the future.
- What is the performance impact on the node?
- Plan is to run some benchmarks after we take a first stab at the implementation.
- Flow information access? How do we restrict it in the context of multi tenancy?
- Srikar will think about this angle and RBAC implications
- Why poll the conntrack module for the implementation?
- Provides visibility into reverse traffic (counters)
- Bad performance of OVS IPFix
- We can embed NetworkPolicy information
- Are there other flows that we are interested in that are not committed to conntrack?
- At the moment we commit all connections as part of Network Policy implementation.
- With the current proposal we are missing L7 information, maybe something we want to consider in the future.
- K8s context information will be added to the IPFix records (e.g. Pod information) so that the UI can display in terms of K8s objects.
- To limit the amount of traffic, plan is to poll conntrack flows every 5-10s, export every couple of minutes.
- Flow logging in calico enterprise: see section labeled "Flow Logs with Workload Metadata" here; is there anything there that we wouldn't be able to support with this proposal? Ability to classify workloads?
- Flow information compression / aggregation: may be worth looking into this to avoid generating too much data
- Should it be done at the exporter / aggregator / collector?
- Sensitive to port scanning / SYN flood attacks?
- Only send information about established connections, unanswered SYNs can be exposed as a separate metric
- How does the UI scale with the size of the cluster and the number of connections?
- Needs to be benchmarked
- Should flow exporter be part of the antrea-agent process / container?
Antrea Community Meeting 05/18/2020
- Moshe walked us through his hardware offload proposal
- Topology manager is used to ensure that VF net device and CPU are on the same NUMA
- OVS supports 2 offload mechanisms (DPDK rte_flow / TC flower): the proposal covers Kernel offload (using TC flower offload)
- Full offload model: either all actions can be offloaded, or it will be handled in software
- Recirculation should be ok: connection tracking + encap / decap
- Right now we can only test Pod-to-Pod traffic, since Antrea still relies on kube-proxy for Service traffic
- This approach should be applicable to other vendors as well. Other vendors besides Mellanox may be able to support full offload, including connection tracking.
- Goal on Mellanox side is to be able to offload 1M+ flows to hardware.
- Mellanox willing to help out with the CI by providing and hosting a testbed that we can integrate in our public CI infrastructure.
- Moshe will include documentation about the requirements (OVS, Multus, etc) and the necessary configuration steps as part of his PR
- You choose which Pods need to be accelerated in the Pod spec, you can have a mix of accelerated and non-accelerated Pods.
- Going to take a while to make sure that Pod-to-Pod traffic is fully supported: need to ensure there are no gaps in upstream OVS / Linux Kernel code (a few months needed?).
- Questions on NetworkPolicy proposal from last week
- Ability to sandwich K8s NetworkPolicies between Antrea Cluster/Namespaced NetworkPolicies: common enterprise use case. All K8s NetworkPolicies should be relegated to a Default tier (lowest priority) as one block. Need ability to define relative ordering between K8s NetworkPoliciees and Antrea NetworkPolicies in that Default Tier.
- Need to abide by K8s isolated Pod behavior: if a K8s NetworkPolicy selects a namespace, and only allows egress TCP traffic from port 80, all other egress traffic from this namespace should be "denied", and there is no possibility to override that in a lower priority Antrea NetworkPolicy. These lower priority policies can only be used to deny more traffic.
- Can we unify
externalEntitySelector
andpodSelector
?- Cloud-native metadata can be automatically translated to K8s labels (possibly namespaces to avoid clashed) by code importing inventory.
- The rationale for having both
externalEntitySelector
andpodSelector
was that when Antrea NetworkPolicies are only consumed in the context of K8s, we wanted the fields to be pretty much the same as for K8s NetworkPolicies. We could have a 3rd fieldendpointSelector
to select across all endpoints (external and Pods)?
- Proposed Status field for NetworkPolicy CRDs: these are typically used to reflect current status and not to expose time series values. What is the value of including counters? Can’t we achieve the same thing with Prometheus metrics? Quan is still working on this proposal, we can review at a later time.
- AI(@abhiraut): schedule an extra meeting this week for further discussions on Antrea NetworkPolicies.
Antrea Community Meeting 05/04/2020
- Cluster-scope Network Policy proposal by Abhishek: https://docs.google.com/document/d/1l-1P5sNKzUo3Zxf5Qfl6oQCWY8TYPOTdIwe9mfppqLg/edit?usp=sharing
- Motivations:
- K8s Network Policies are namespace-scoped, so having a cluster-wide policy requires replication
- Upstream changes are slow, but eventually we would like to have a standardized API instead of relying on an Antrea-specific CRD
- No notion of policy tiering / priorities for policies that can be created by different roles
- Ability to select other kind of workloads besides Pods (e.g. Nodes, external entities such as VMs)
- Other open-source CNIs have the same kind of CRD, we have experience at VMware for NSX
- Does the idea of supporting service selectors conflict with service mesh policies (e.g. Istio)?
- This is still at layer 4 and is meant to complement K8s Network Policies
- How fast are Network Policies enforced?
- It applies to existing Pods but (at least in the case of Antrea) it only applies to new connections (existing connections are not affected because of how we use conntrack to skip checks for established connections). This is also the case for standard K8s Network Policies: just an API specification and there is no mandate on how they should be implemented, and as far as we know other CNIs (e.g. Calico) implement them in the same way (using conntrack).
- For port lists in rules, we could consider supporting port ranges, for convenience
- The document needs to clarify how rules between different categories (with different priorities) interact
- Add concrete use cases / user stories to document
- More detailed presentation later about Status (plan to expose byte / packet counters as CRD statuses)
- Motivations:
Antrea Community Meeting 04/20/2020
- Reschedule of the community meeting based on poll results:
- single meeting (no rotation), Monday 9PM PST - Tuesday 4AM GMT
- no conflict with K8s contributor calendar
- AI(Salvatore): update meeting time in README
- Using DDlog in the Antrea Controller for NetworkPolicy computation
- see Antonin's slides
- a few things that still need to be figured out:
- is NetworkPolicy computation the bottleneck, or is it actually the distribution to agents? if it's the latter, then some minor differences in computation time between the DDlog and native implementation are insignificant
- how much additional complexity do we expect in the future (e.g. with NetworkPolicy tiers)? - more complexity could justify a move to DDlog
- can DDlog help us support more features, such as connectivity queries?
- DDlog is used internally at VMware for some more complex projects
- additional optimizations can be done (e.g. in the Go <-> DDlog interface), but these represent a large engineering effort that is only justified if we commit to DDlog for Antrea
Antrea Community Meeting 04/08/2020
- v0.5.0 status update:
- Prometheus PRs still in-review, pushing it back to v0.6.0
- Antrea cleanup PR, no progress to report, pushing it back to v0.6.0
- Update to Go 1.14 - not important for release, keep it open for now as it is a good first issue
- v0.6.0
- Windows support: still missing CI pipeline and installation process
- According to Cody, there are some more urgent features (SNAT, IPAM, policy tiering), which are hurting Antrea adoption and should be targeted for the June time frame
- Issues need to be created for these features
- Missing stability features: "support-bundle" and log collection (e.g. with syslog)
- Relying on container logs is not ideal (no log persistence when a container restarts)
- crash-diagnostics is to collect information in a cluster for troubleshooting, it's not a log streaming / collection system
- Antrea core code should be agnostic to the log collection system (syslog, fluentd, ...), but we should have a reference integration with some popular open-source stack like EFK (ElasticSearch + Fluentd + Kibana), e.g. in the form of a reference operator, with documentation, configuration, etc
- Support bundle: what the user intended (configuration) + Antrea state snapshot + all logs available
- Action items for v0.6.0? reference integration with EFK for log collection / analysis + prototype support-bundle with antctl (crash-diagnostics may be out of the picture because of SSH access requirement to each Node) - for both items, more detailed PRD is required.
- Antrea support on ARM architectures
- See Antonin's slides
- K8s itself has issues on arm due to lack of testing
- many issues reported on slow arm devices (e.g. https://github.com/kubernetes/kubeadm/issues/1380), means it could be difficult to use emulation (qemu) for CI testing
- use x86 for control-plane node and arm only for workers (cannot use Kind cluster), which would be the typical use case
- According to Cody, this is available in Calico but not widely used
- According to Cody, we should try to tackle this for the end of the year, but not a priority for the summer time frame
- continue the discussion on Github / Slack
- Re-scheduling Antrea community meeting
- conflict with Calico monthly community meeting
- more than half of the Antrea active contributors are based in China and they should be able to participate in the meetings
- => let's take it to the Slack channel
Antrea Community Meeting 03/25/2020
- Review open issues for 0.5.0 release
- #361: this needs to be resolved for 0.5.0, which will be aligned with K8s release 1.18 (March 25th); waiting to hear back from assignee
- Prometheus patches: code reviews are needed for patches #322, #325 and #446
-
#494: Namespace deletion issue; Quan has a workaround for upgrading the YAML when API resources have changed
- we need to review apimachinery guidelines for future upgrades
-
#312: publish antctl binaries as part of 0.5.0 release; Antonin will review available antctl commands available and determine whether the binaries should be included in the release (depends on whether some useful commands can be run out of cluster).
- documentation needed for antctl (#337)!
- => all open issues on target for 0.5.0
- Public cloud update: EKS (AWS) support should be ready for 0.5.0
- Windows update: lots of progress in feature branch; able to run some e2e tests; need to setup CI; still using OVS CloudBase (no progress to report for upstream)
- Website update:
- website is all ready to go
- there was a lot of activity recently around VMware Tanzu, so Cody was waiting for the right time to do the launch to maximize impact
- also working on blog posts for performance + internal VMware Antrea deployment
- Antrea community meeting time slot currently conflicts with monthly Calico community meeting
- we are leaning towards changing the time from 9am to 10am PST, but can also consider another day of the week
- need to check for conflicts with other meetings in the K8s space
- no meeting next week
The hosts forgot to start the recording at the beginning of the meeting, so we only have a very short recording for this meeting.
Antrea Community Meeting 03/11/2020
- Jay presented the ongoing "netpol" (new Network Policy testing framework) work
- DSL to quickly and easily define new test cases (Network Policy definition and expected reachability matrix)
- Runs fast, tests are concise and easy to understand
- Long-term goal is to move everything upstream - how far should we go before then (in terms of framework / improvements / test cases)?
- Stop running it as a Job, run it as a Pod and exit 0/1 for success/failure
- Other upstream tests can benefit from this approach (network e2e tests, other e2e tests?)
- Jay will present at k8s sig-network meeting (03/05)
- Upstream feedback for the KEP:
- Similar project called illuminatio
- Make sure that the tests work the same no matter in which order "objects" are created (Pods, Network Policies, Labels, ...) - too many combinations to test them all but maybe we can isolate a few interesting scenarios
- Illuminatio implements some fuzz testing: test Network Policies present in the cluster by generating test cases
- There is a shell script (hack/netpol/test-kind.sh)
- Feature parity with upstream tests? Not yet (missing CIDR test and a few others), but about 80-90% of them; we also keep adding tests upstream which increases the gap :)
- Next steps: keep coming up with ideas and pushing them to hack/ or hardening the current stuff?
- Solve the scale test problem before we harden the current code - we know this is a required case and we don't know how to solve it yet
- See https://github.com/vmware-tanzu/antrea/issues/464
- Add new
area/
label to Github for netpol issues
- New issue for implementing kube-proxy in OVS: https://github.com/vmware-tanzu/antrea/issues/463
- Salvatore and Kobi have been thinking about this already - they will join forces with Quan and others
- Antrea community meeting is maintained next week (03/11)
Antrea Community Meeting 03/04/2020
- NoEncap support merged in for 0.4.0, investigating support for managed K8s services of public clouds
- Prometheus patch is still a work in progress; as a community we need to define which metrics are important
- Is it at all possible to display some Prometheus metrics in Octant? Octant probably not suited to display time series.
- IPsec regression in 0.4.0
- Suspect that IPsec broken after moving from port-based to flow-based tunnels
- Not captured in CI because of improper cleanup: the agent does not handle tunnel type changes correctly and traffic goes on the un-encrypted tunnel
- Jianjun will work on a fix and we can consider a bug fix release
- Review of open issues for 0.5.0
- Let’s try to start using the
lifecycle/active
label when an issue is actively being worked-on - Windows support unlikely to be ready for 0.5.0
- Cody to create an epic to track individual sub-tasks for Windows support, so that we can have a timeline
- Antrea support on Windows depends on OVS CloudBase changes (https://github.com/cloudbase/ovs)
- Should we push for these changes to be merged upstream?
- Don’t want to hinder our ability to move forward with Windows support
- Let’s try to start using the
- At Kubecon we will have the opportunity to demonstrate Antrea at the VMware booth: send demo ideas to Cody
- Network Policy upstream testing initiative: Jay has an open PR that needs review, question is where to host it until it gets accepted upstream
- Antrea lighting talk at Rejekts in Amsterdam: https://cfp.cloud-native.rejekts.io/cloud-native-rejekts-eu-2020/talk/QQZY3D/
- There will be a community meeting next week so that Jay can present the upstream Network Policy work
Antrea Community Meeting 02/26/2020
- Review open issues for 0.4.0 milestone
- #253: ongoing effort to move testbeds to VMC (VMware on AWS) so that the Jenkins UI is publicly accessible
- Named port still on track for release (with all community tests passing)
- #323: @weiqiangt has a fix ready, will be included in release
- #355: fix ready but @antoninbas is investigating why new e2e test (to test the feature) is failing in CI (it is passing on local cluster)
- Prometheus: no new progress
- Cody made some progress on license file generation tool - should be able to open a PR this week
- #347: documentation updates for issue / PR workflow and labels almost ready to merge
- Website has been merged into a branch, ongoing some final adjustments before making it into main branch
- Compatibility version matrix: some ongoing work
- Several small "good first issues" have been opened to try to attract external contributors
- Multus integration: use Antrea only for primary IP or for secondary IPs as well? #368 needs more information from submitter.
-
#374: STT kernel module is not part of upstream kernel (OVS needs to be built from source)
- does STT really provide a performance benefit?
- if no, maybe we should just drop STT "support"; if yes, then we should update documentation with instructions on how to enable STT
- #379: ongoing process to improve the upstream network policy tests
- 2 open PRs for Windows support
- CI system does not include a Windows K8s testbed at the moment
- let's use a "windows" feature branch and submit patches against it; merge the feature branch into main branch once it is complete and we have the ability to test the code in CI
- Moving to a bi-weekly cadence for meetings; will plan 0.5.0 at the next meeting
Antrea Community Meeting 02/12/2020
- Review Cody’s draft for request for Antrea to be included as CNCF sandbox project
- Inspired by Harbor’s proposal
- We can also look at Contour’s proposal, which was presented at the last CNCF SIG Network meeting
- Salvatore has a concern that the document uses terminology specific to Antrea without defining it - Cody plans to add pointer to more detailed ROADMAP.md
- Review open issues
- #345: Cody will point to an example -> Open Source License file will be required by some orgs consuming the project and by CNCF
- Antrea cleanup: Antonin will address Jianjun's feedback
- Prometheus: no new progress
- current changes focus on enabling Antrea to report metrics to Prometheus, we don't have a comprehensive list of metrics we want to expose; Cody can provide feedback
- Publicly-accessible log servers for Jenkins CI
- if we get accepted into CNCF, we can request some CI resources and move to public cloud
- VMware IT request still pending to expose a public log server for current Jenkins testbed
- "No-encap" support: high probability that it will be 0.4.0
- Come up with a compatibility matrix that we can publish with each release: K8s versions, OSes, cloud providers, ...; Cody will come up with a proposal and this should be published starting with 0.4.0
- Bug scrub
- Antrea website proof is ready, link will be posted on Antrea Slack channel for feedback
- Cody will open some issues requesting additional documentation for some specific deployment modes; would like performance numbers to be available as well
Antrea Community Meeting 01/29/2020
- Proposed modifications to the development process: https://github.com/McCodeman/antrea/tree/project-management/docs/dev-process
- Lifecyle of issues / PRs, new labels for the Antrea repository, ...
- Motivation:
- fairly young project but we want to grow fast in the upcoming year; there will be a lot of parallel work - this should give us more visibility into how the project is progressing
- more formality => better transparency and higher velocity
- Cody will be responsible for making sure that issues / PRs are correctly triaged / labeled
- Code freeze a week before release? not needed yet, will revisit in the future
- Do we want to formalize how people can submit proposals; we currently use Google Docs but nothing has been formalized - seems like an okay place to start and we can revisit later
- Review open issues for 0.3.0 release
- no useful antctl command can be run out-of-cluster and no available user-facing documentation: antctl binaries will not be shipped as part of the 0.3.0 release
- IPsec: Jianjun has an opened PR (approved) to limit the tunnel type to GRE, will open a new PR for documentation but maybe after the release
- Prometheus: opened PR ready for review - pushed out to 0.4.0
- Need to review licenses of Antrea dependencies: Cody will look into it
- Named Port support update: some opened PRs, ongoing conformance testing
Antrea Community Meeting 01/22/2020
- Review open issues for 0.3.0 release
- leaning towards postponing Prometheus integration and NoEncap mode
- Kobi to update Prometheus issue with design doc / status update
- we tagged a few other bug fixes for 0.3.0 release
- any features graduated to Beta / GA?
- Octant support was improved by Mengdie since last release, and Tong provided some “third-party” feedback; let’s target Beta status for next release (v0.4.0)
- Salvatore’s proposal to have Antrea be a supported networker for OpenShift 4, which some people may ask for
- may want to have a conversation with RedHat later to officially support Antrea in the open source code (like OVN)
- Yasen: what would be the differentiator for Antrea?
- Salvatore to open an issue to track this
- Ongoing work by Salvatore on dual-stack support
- large chunks of the IPAM code need to be updated
- also changes to CNI client, Network Policy code
- kube-proxy has to be in IPVS mode
- ongoing process, but slow
- Antrea vs other CNIs: Cody has some comparison data
- VMware has been running some scale tests for Antrea in the lab, results may be available publicly in the future
- Cody will present his Antrea project boards at the next community meeting (January 22nd)
Antrea Community Meeting 01/15/2020
- Walkthrough of NoEncapsulation proposal by Su (@suwang48404)
- Meeting next week (Jan 15) is maintained
- Cody will present project boards for Antrea and his proposal to streamline issue triage
- Review open issues for 0.3.0 release
Antrea Community Meeting 01/08/2020
- Review of 0.2.0 release issues
- network policy fixes have all been merged -> named port support still missing (not v0.2.0), currently ignoring named ports in Network Policy rules (i.e. traffic not allowed)
- CLI moved to next release (only command supported in current PR is "version")
- more tunnel types, stale CRDs removed (fix) -> no tracking issue, mention it in CHANGELOG
- promoting features: monitoring CRDs to Beta, Network Policy support stays in beta until all conformance tests pass, connectivity stays in beta (many small changes were made to the OVS pipeline since last release)
- target Thursday for the release (make sure we have run all conformance tests)
- Plan release 0.3.0
- CLI support with some useful commands
- No-encap mode: priority for some cloud-providers (e.g. AKS) - needs to talk about this offline (e.g. mailing list)
- IPsec support
- Any Prometheus integration?
- Delete all artifacts created by Antrea when it is deleted from the cluster
- Named port support?
- Tentative release date for 0.3.0 is Jan 22nd
- Prometheus integration
- architecture: have the controller be a central collector of have each agent report metrics?
- try to keep it separate from the core Antrea code as much as possible
- use Prometheus for more "static" data and commit to supporting that for 0.3.0, investigate what to do for more dynamic data (use Prometheus, another collector, ...)
- possible steps: 1) define metrics we want to expose & investigate endpoint discovery, 2) define a framework to export highly dynamic metrics, 3) troubleshooting: how to integrate the rest of the work we are planning with Prometheus / other visibility and open-tracing tools
- Kobi (@ksamoray) to drive this
- Named port support for 0.3.0: when a named port corresponds to different port values for different pods, significant amount of work in Antrea
- Integrating with cloud providers: maybe some changes required to Antrea core, need some additional work for each cloud provider
- In the process of getting a proposal for the website ready, will share with the team soon
- Postpone having versioned documentation as part of the Github directory structure until 1.0 release
- Next meeting after the holidays
Antrea Community Meeting 12/18/2019
- Walkthrough of the OVS pipeline by Antonin using the contents of PR #206
- some ongoing changes, PR will need to be updated
- PR #200 adds ARP spoof check for gw interface - even though we probably have other problems if an attacker is able to do that
- the policy tables (ingress & egress) use conntrack to accept all established connections regardless of current network policies
- this means that established connections cannot be broken by updating network policies. Is this the desired behavior? Policy updates can be "bypassed" by keeping a TCP connection open.
- other CNIs have this same issue, but maybe we can do better by checking policies for every packet with no loss of performance thanks to OVS
- connection tracking ensures that "reverse" traffic for a authorized connection does not get dropped, regardless of the Pod's ingress policy rules; may be hard to remove the flow for established connections
- we probably do not need add
ip
to all flows
- Status of v0.2.0 release: https://github.com/vmware-tanzu/antrea/milestone/1
- How do we simplify log collection for bug reports / support requests?
- let's define what we want to collect, then worry about how we can collect it automatically
- Antrea logs, kube-apiserver logs, kubelet logs, kube-proxy logs / config maps
- make sure we do not expose secret / sensitive information
- OVS logs, OVS flows, iptables rules
- which container runtime is used
- we can collaboratively build a list in issue #11 before next meeting
- let's define what we want to collect, then worry about how we can collect it automatically
- How to separate user-facing documentation / dev-facing documentation? Which tools are we planning to use to structure the documentation?
- Cody will look into it
- Keep everything on Github, documentation should be versioned
- Read The Docs / Jekyll?
- Last meeting before the holidays will be next Wednesday (12/18) - finalize v0.2.0 release
Antrea Community Meeting 12/11/2019
- Objectives of the community meetings
- a mix between a developer meeting and a release management meeting
- discuss issues, review proposals and brainstorm new ideas
- releases are correctly planned and on track (if not re-assign issues appropriately)
- in the future, may become a "user meeting" as well to discuss users' needs and pain-points
- a mix between a developer meeting and a release management meeting
- Architecture walkthrough by Jianjun: Antrea components & traffic walk
- see architecture document
- L2 broadcast traffic never leaves the Node, local OVS switch replies to ARP requests for remote gateways
- Upcoming documentation on OVS pipeline & network policy computation (with detailed examples)
- can do deep dives at the meeting when docs are available
- Release management
- we will use Github milestones to track releases and tag issues appropriately (Jira has no free plan)
- bug fix releases:
- we released v0.1.1 last week to fix Kind support on Linux
- no outstanding bugs urgently require a new bug fix release - network policy patches may be hard to cherry-pick into the release branch (conflicts)
- Release plan for v0.2.0
- we need to be able to run conformance tests (network e2e tests) and network policy tests
- adding support for "named port" for network policies is not trivial (code re-org required) so should be a stretch goal for v0.2; some network policy tests will fail without it
- CLI framework with some basic debugging commands
- target date is December 18th
- we need to be able to run conformance tests (network e2e tests) and network policy tests
- Running conformance tests / network policy tests as part of CI
- Run the full suite to qualify releases; ideally should be automated
- Run a smaller subset for every PR - it seems that running the entire network policy test suite takes 1+ hour
- Review of open issues
- #119 AI: update documentation to state that old CNI must be deleted & Pods rescheduled when deploying Antrea
- Kind support is currently broken on macOS, no solution to fix it at the moment