Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DNS record TTL and whether a record is in-use to determine expiration #7941

Merged
merged 7 commits into from
Jun 28, 2021

Conversation

brycekahle
Copy link
Member

@brycekahle brycekahle commented Apr 20, 2021

What does this PR do?

The reverse DNS cache inside system-probe was previously caching for 3 minutes, and extending that duration every time a record was used. This uses the actual TTL from the record and keeps track of values in-use.

Motivation

More accurate DNS caching behavior.

Additional Notes

This does not handle the following case:

  1. DNS is resolved for myhost to 1.1.1.1 with a TTL of 30s
  2. A connection is opened to 1.1.1.1 and kept open
  3. >30s elapses
  4. Another connection is opened to 1.1.1.1

The connection from step 4 will use the same domain mapping from step 1, even though the record has expired, and the new connection did not resolve any DNS.

To fix this scenario would require more sophisticated tracking of connections to DNS records.

Multiple clients also disrupt the record expiration by potentially causing an early expiration before the other client has had a chance to use the value. Fixing this would require per-client tracking with some sort of inactivity limit to prevent a single client request (often from debugging) from keeping records around forever.

@brycekahle brycekahle added changelog/no-changelog component/system-probe team/networks [deprecated] qa/skip-qa - use other qa/ labels [DEPRECATED] Please use qa/done or qa/no-code-change to skip creating a QA card labels Apr 20, 2021
@brycekahle brycekahle added this to the 7.29.0 milestone Apr 20, 2021
@brycekahle brycekahle requested a review from a team as a code owner April 20, 2021 19:51
defer c.mux.Unlock()

for _, val := range c.data {
val.inUse = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add a note perhaps here or at the type definition explaining the lifecycle of inUse. I don't quite follow how it works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note added.

@brycekahle brycekahle force-pushed the bryce.kahle/sysprobe-dns-ttl branch from afb69a4 to 5e8c772 Compare May 27, 2021 22:00
@brycekahle brycekahle requested a review from a team as a code owner May 27, 2021 22:00
@hush-hush hush-hush modified the milestones: 7.29.0, 7.30.0 May 31, 2021
pkg/network/dns_cache.go Show resolved Hide resolved
pkg/network/dns_cache.go Show resolved Hide resolved
pkg/network/dns_cache.go Show resolved Hide resolved
continue
}

if bytes.Equal(domainQueried, record.Name) ||
(alias != nil && bytes.Equal(alias, record.Name)) {
t.add(util.AddressFromNetIP(record.IP))
t.add(util.AddressFromNetIP(record.IP), time.Duration(record.TTL)*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an idea what the usual TTL might be for a DNS record? & if it's significantly longer than the current 3 min TTL, do we have an estimate for the impact this change might have on cache size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30s - 1hour is the usual range I've seen TTLs be in.

@@ -232,9 +231,10 @@ func (s *SocketFilterSnooper) getCachedTranslation() *translation {

// Recycle buffer if necessary
if t.ips == nil || len(t.ips) > maxIPBufferSize {
t.ips = make([]util.Address, 30)
t.ips = make(map[util.Address]time.Time, 30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Are we trying to resize the map and then copy over the entries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The translation object is allocated once and reused. This function is called to obtain the object for use, so we need to ensure it is "cleaned" up.

@hmahmood
Copy link
Contributor

So DNS TTLs can be arbitrarily large. Should we set an upper limit on the ttl on the entries in the cache?

@brycekahle
Copy link
Member Author

Should we set an upper limit on the ttl on the entries in the cache?

@hmahmood We could, but we would be accepting that once it expired, the same issue this PR is intended to fix, would happen. We have no indication what the DNS client is using as a max TTL or when it purges an entry from its cache.

We do also have a maximum size on this DNS cache.

@brycekahle brycekahle force-pushed the bryce.kahle/sysprobe-dns-ttl branch from 596d563 to e348948 Compare June 28, 2021 20:11
@brycekahle brycekahle merged commit 1c58523 into main Jun 28, 2021
@brycekahle brycekahle deleted the bryce.kahle/sysprobe-dns-ttl branch June 28, 2021 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog component/system-probe [deprecated] qa/skip-qa - use other qa/ labels [DEPRECATED] Please use qa/done or qa/no-code-change to skip creating a QA card team/networks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants