Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL #6229

Open
tnqn opened this issue Apr 16, 2024 · 5 comments
Assignees
Labels
kind/design Categorizes issue or PR as related to design. lfx-mentorship Issues which have been proposed for the LFX Mentorship program

Comments

@tnqn
Copy link
Member

tnqn commented Apr 16, 2024

Describe what you are trying to solve

I was working with @Scoobed to debug an issue of NetworkPolicy FQDN rule in a cluster that the Pod failed to connect to the FQDN intermittently. After realizing the application was based on Java, I found that in many cases JVM enabled a DNS cache which uses a configured TTL as below, instead of respecting the TTL value in the DNS response.

networkaddress.cache.ttl

Specified in java.security to indicate the caching policy for successful name lookups from the name service.. The value is specified as integer to indicate the number of seconds to cache the successful lookup.
A value of -1 indicates "cache forever". The default behavior is to cache forever when a security manager is installed, and to cache for an implementation specific period of time, when a security manager is not installed.

How the problem typically happened:

  1. Pod made a DNS request of a FQDN
  2. Antrea inspected the DNS response and associated the FQDN with the IPs in the response
  3. Pod connected one of the IPs successfully because Antrea was aware of the IP.
  4. Antrea refreshed the FQDN resolution, found the previous IPs were no longer present in the response, so it removed the IPs when it reached TTL set in the previous response.
  5. Pod tried to connect the FQDN another time, but it skipped querying the FQDN's IP due to its own cache (with a fixed TTL), it failed due to the IP was no longer allowed by datapath.

@Scoobed also confirmed that the problem was gone when using nodelocal dns, which should be due to the special handling in the buildpack that it disabled JVM DNS cache when it detects the DNS server is a link-local address: https://github.com/paketo-buildpacks/libjvm/blob/79182aa17fa3e49424f511dd0070dd66bdc1a3ec/helper/link_local_dns.go#L34-L64

As this may affect many Java based applications and not all clusters enable NodeLocal DNS, I have been thinking how to better support this scenario without requiring all application developers to disable their DNS cache or to respect TTL in DNS response (which is even harder than the former). One solution I come up with is to provide a configuration like minTTL, which determines the minimal TTL the DNS resolutions will be cached. If a DNS response's TTL is less than minTTL, the actual TTL in datapath will be minTTL. Note that the TTL cache is not per Pod, so the minTTL will be a global configuration which applies to all Pods (I don't think of any actual defect caused by it except for a few more memory consumption). Even different Pods can have different hard-coded DNS cache TTL, the minTTL can just be the maximum value of them. And typically it could just be set to the default value of JVM DNS TTL or bigger value.

Note that this still require application DNS cache not to cache forever.

Describe how your solution impacts user flows

The cluster admin should configure minTTL to be equal or larger than the maximum TTL values of application DNS caches.

Alternative solutions that you considered

Require users to disable application-level DNS cache.

Test plan

e2e: validate applications with DNS cache can stably access the target FQDN while FQDN resolution frequently changes.

@tnqn tnqn added the kind/design Categorizes issue or PR as related to design. label Apr 16, 2024
@tnqn
Copy link
Member Author

tnqn commented Apr 16, 2024

@jianjuns @antoninbas @Dyanngg please let me know how you think about the proposal.

@tnqn tnqn changed the title Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions themselves Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL Apr 16, 2024
@jianjuns
Copy link
Contributor

The proposal sounds good to me.

@tnqn tnqn added this to the Antrea v2.1 release milestone Apr 19, 2024
@antoninbas antoninbas added the lfx-mentorship Issues which have been proposed for the LFX Mentorship program label Jul 25, 2024
@antoninbas
Copy link
Contributor

We have submitted this issue as a project idea for the LFX mentorship program: cncf/mentoring#1278.

See https://docs.linuxfoundation.org/lfx/mentorship for more information on the program.

Assuming our proposal is accepted, we will publish instructions for candidates here (as a new issue comment) with a list to a test task to be completed as part of the application. The test task helps us in several different ways: 1) it ensures that applicants have read this issue and got familiar with the goals, 2) it ensures that applicants showed some interest in the project and have the basic skills required to build Antrea and contribute to it, and 3) it helps us (the mentors) with candidate selection as we can look at the overall quality of submissions. Note that we do not expect the test task to take more than 1 or 2 hours, as our goal is not to impose a big burden on all applicants.

We hope that there will be a lot of interest in this issue and the mentorship program. However, we ask that candidate mentees do not comment on this issue just to express their interest / desire to work on the issue. This can create a lot of noise. An upstream issue like this is primarily meant for technical discussion around the issue and the proposed solution. We want to keep the discussion thread relevant and easy to navigate for maintainers and contributors. Please post any questions about the LFX program and how to apply on the mentorship discussion forums.

@HitanshuPrasad

This comment has been minimized.

@antoninbas
Copy link
Contributor

We are pleased to announce that our project proposal has been accepted: https://github.com/cncf/mentoring/tree/main/programs/lfx-mentorship/2024/03-Sep-Nov#antrea

As mentioned above, we have published a test task to help us select the right candidate: #6590

If you have any questions, please comment in the discussion or post publicly in the #antrea channel in the K8s Slack workspace. Do not DM the mentors or comment on this Github issue.

hkiiita added a commit to hkiiita/antrea that referenced this issue Sep 26, 2024
hkiiita added a commit to hkiiita/antrea that referenced this issue Sep 26, 2024
Signed-off-by: Hemant <hkbiet@gmail.com>
antoninbas pushed a commit that referenced this issue Nov 14, 2024
)

For #6229 

Pending the implementation of the minTTL feature, we add an e2e
test to validate the behavior of FQDN policy rule enforcement when
an application is caching DNS responses beyond the included TTL.

As of now, the Antrea implementation only caches the IP obtained from
the DNS response for the duration specified by the TTL. This means that
if an application keeps using the same IP (because it was cached) beyond
that TTL for a whitelisted FQDN, it will eventually fail. After we add
support for the minTTL configuration parameter, the test will be updated
to validate that the minTTL value is honored, and that until minTTL
"expires", the application can keep using the same IP successfully.

Signed-off-by: Hemant <hkbiet@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lfx-mentorship Issues which have been proposed for the LFX Mentorship program
Projects
None yet
Development

No branches or pull requests

6 participants