Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: cap registry size at 5k #61459

Merged
merged 1 commit into from
Mar 4, 2021
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 4, 2021

This caps the size of the span registry at 5k, evicting "random" (map
order) existing entries when at the limit.
The purpose of this is to serve as a guardrail against leaked spans,
which would otherwise lead to unbounded memory growth.

Touches #59188.

Release justification: low risk, high benefit changes to existing
functionality
Release note: None

This caps the size of the span registry at 5k, evicting "random" (map
order) existing entries when at the limit.
The purpose of this is to serve as a guardrail against leaked spans,
which would otherwise lead to unbounded memory growth.

Touches cockroachdb#59188.

Release justification: low risk, high benefit changes to existing
functionality
Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Mar 4, 2021

With this in, conceivably we can do #59315 early (before the SQL team has fixed the leak). The bypass in that issue unfortunately hides pretty much all of the SQL gateway's traces from the registry, which is very unhelpful. This should be fine if the remark in the linked issue about these leaks occurring only at server shutdown.

// is a leak. When the registry reaches max size, each new span added
// kicks out some old span. We rely on map iteration order here to
// make this cheap.
if toDelete := len(t.activeSpans.m) - maxSpanRegistrySize + 1; toDelete > 0 {
Copy link
Contributor

@irfansharif irfansharif Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you reckon dropping just the one (oldest) span when we've capped out could cause contention when we've gone overboard? I assume we'll want to do more profiling again, but that seems possible to me.

Copy link
Member Author

@tbg tbg Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Deleting the "first" element from the map (especially since it's bounded-size) should be approximately free. Hmm... The iteration maybe isn't. Let me check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems cheap:

func BenchmarkMapDeleteAdd1(b *testing.B) {
	const size = 5000
	m := make(map[int]int, size)
	for i := 0; i < size; i++ {
		m[rand.Intn(10000*size)] = 123
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for idx := range m {
			//delete(m, idx)
			break
			_ = m[idx]
		}
		//m[i] = 456
	}
}

With the delete and assignment in, ~200ns/op. As quoted above, ~45ns/op. So the deletion+insertion dominate over doing the one-element iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comment since I had to rebase on top of this code: the other option is to use a cache.UnorderedCache which is an LRU cache. Iterations don't count as accesses and neither do gets if you use StealthyGet, so only Adds will, which means the cache will evict the spans that are most likely to be leaks.

@tbg
Copy link
Member Author

tbg commented Mar 4, 2021

bors r=irfansharif

TFTR!

@craig
Copy link
Contributor

craig bot commented Mar 4, 2021

Build succeeded:

@craig craig bot merged commit 7ddc401 into cockroachdb:master Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants