-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL #6229
Comments
@jianjuns @antoninbas @Dyanngg please let me know how you think about the proposal. |
The proposal sounds good to me. |
We have submitted this issue as a project idea for the LFX mentorship program: cncf/mentoring#1278. See https://docs.linuxfoundation.org/lfx/mentorship for more information on the program. Assuming our proposal is accepted, we will publish instructions for candidates here (as a new issue comment) with a list to a test task to be completed as part of the application. The test task helps us in several different ways: 1) it ensures that applicants have read this issue and got familiar with the goals, 2) it ensures that applicants showed some interest in the project and have the basic skills required to build Antrea and contribute to it, and 3) it helps us (the mentors) with candidate selection as we can look at the overall quality of submissions. Note that we do not expect the test task to take more than 1 or 2 hours, as our goal is not to impose a big burden on all applicants. We hope that there will be a lot of interest in this issue and the mentorship program. However, we ask that candidate mentees do not comment on this issue just to express their interest / desire to work on the issue. This can create a lot of noise. An upstream issue like this is primarily meant for technical discussion around the issue and the proposed solution. We want to keep the discussion thread relevant and easy to navigate for maintainers and contributors. Please post any questions about the LFX program and how to apply on the mentorship discussion forums. |
This comment has been minimized.
This comment has been minimized.
We are pleased to announce that our project proposal has been accepted: https://github.com/cncf/mentoring/tree/main/programs/lfx-mentorship/2024/03-Sep-Nov#antrea As mentioned above, we have published a test task to help us select the right candidate: #6590 If you have any questions, please comment in the discussion or post publicly in the |
Signed-off-by: Hemant <hkbiet@gmail.com>
) For #6229 Pending the implementation of the minTTL feature, we add an e2e test to validate the behavior of FQDN policy rule enforcement when an application is caching DNS responses beyond the included TTL. As of now, the Antrea implementation only caches the IP obtained from the DNS response for the duration specified by the TTL. This means that if an application keeps using the same IP (because it was cached) beyond that TTL for a whitelisted FQDN, it will eventually fail. After we add support for the minTTL configuration parameter, the test will be updated to validate that the minTTL value is honored, and that until minTTL "expires", the application can keep using the same IP successfully. Signed-off-by: Hemant <hkbiet@gmail.com>
Describe what you are trying to solve
I was working with @Scoobed to debug an issue of NetworkPolicy FQDN rule in a cluster that the Pod failed to connect to the FQDN intermittently. After realizing the application was based on Java, I found that in many cases JVM enabled a DNS cache which uses a configured TTL as below, instead of respecting the TTL value in the DNS response.
How the problem typically happened:
@Scoobed also confirmed that the problem was gone when using nodelocal dns, which should be due to the special handling in the buildpack that it disabled JVM DNS cache when it detects the DNS server is a link-local address: https://github.com/paketo-buildpacks/libjvm/blob/79182aa17fa3e49424f511dd0070dd66bdc1a3ec/helper/link_local_dns.go#L34-L64
As this may affect many Java based applications and not all clusters enable NodeLocal DNS, I have been thinking how to better support this scenario without requiring all application developers to disable their DNS cache or to respect TTL in DNS response (which is even harder than the former). One solution I come up with is to provide a configuration like
minTTL
, which determines the minimal TTL the DNS resolutions will be cached. If a DNS response's TTL is less thanminTTL
, the actual TTL in datapath will beminTTL
. Note that the TTL cache is not per Pod, so theminTTL
will be a global configuration which applies to all Pods (I don't think of any actual defect caused by it except for a few more memory consumption). Even different Pods can have different hard-coded DNS cache TTL, theminTTL
can just be the maximum value of them. And typically it could just be set to the default value of JVM DNS TTL or bigger value.Note that this still require application DNS cache not to cache forever.
Describe how your solution impacts user flows
The cluster admin should configure
minTTL
to be equal or larger than the maximum TTL values of application DNS caches.Alternative solutions that you considered
Require users to disable application-level DNS cache.
Test plan
e2e: validate applications with DNS cache can stably access the target FQDN while FQDN resolution frequently changes.
The text was updated successfully, but these errors were encountered: