Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] replace sync.Map with lru.Cache #158

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

[RFC] replace sync.Map with lru.Cache #158

wants to merge 11 commits into from

Conversation

crockeo
Copy link
Contributor

@crockeo crockeo commented Jun 16, 2023

I was looking at an application which uses gostats earlier today and found that it was leaking memory because it was misusing the gostats API. At a high-level:

  • The application was calling NewTimerWithTags with each request.
  • It was providing a map which contained a unique identifier each time it was called.
  • Because statStore uses an unbounded cache, timers was growing with each request.
  • That caused a non-obvious memory leak on the service.

This PR attempts to fix these issues by replacing the existing, unbounded cache with a bounded LRU cache. We can only do so for counters and timers because they are resilient to cache eviction:

  • Counters report the diff between the last sent value and the current value. Because we flush at the time of eviction there is no different between value = x -> eviction -> add y and value = x -> add y. In both cases the next report will contain y.

  • Timers don't maintain local state, instead they immediately call Sink.FlushTimer when a value is added. They can be evicted from the cache because any two AddValue calls to the same timer are independent, and so they will behave the same across two instances of the same timer.

Notably we cannot replace gauge's sync.Map with lru.Cache, because a gauge is client-side stateful. Users can still encounter memory issues if they produce gauges with unique tags.

crockeo added a commit that referenced this pull request Jun 21, 2023
As per the [Golang release
policy](https://go.dev/doc/devel/release#policy):

>Each major Go release is supported until there are two newer major
releases.

`gostats` doesn't need to support `1.17` or `1.18` anymore, so:

- Change unit test suite to target currently supported versions of Go.
- Change the `go.mod` version to be `1.18` to mark that we will need
generics support in #158.

Also going through and fixing lint errors with 325decb so that this PR
can be merged.
tomwans
tomwans previously approved these changes Jun 21, 2023
Copy link
Contributor

@tomwans tomwans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this change, I don't love that it involves getting a new dependency into the stat lib, I didn't like having envconfig there either. I was hoping we can get to zero.

Real talk what would it take to copy some of the bits of the lru cache into one file here?

@crockeo
Copy link
Contributor Author

crockeo commented Jun 21, 2023

I don't love that it involves getting a new dependency into the stat lib, I didn't like having envconfig there either. I was hoping we can get to zero.

Real talk what would it take to copy some of the bits of the lru cache into one file here?

Fixed in 1fa1009.

tomwans
tomwans previously approved these changes Jun 21, 2023
@njo
Copy link

njo commented Jun 22, 2023

It was providing a map which contained a unique identifier each time it was called
Just on this point @crockeo - this is obviously not a way the library is intended to be used - was this fixed in the service?

@charlievieth
Copy link
Contributor

Yeah, this was always a risk with using sync.Map but I figured it was only possible if there was a cardinality explosion in the metrics and observability would catch it / inform the service owners. But maybe that's not true with tags and TBH its been awhile.

stats.go Outdated
// do not flush counters that are set to zero
if value := v.(*counter).latch(); value != 0 {
s.sink.FlushCounter(key.(string), value)
for _, name := range s.counters.Keys() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're copying in the LRU code can we add a method that returns a slice of keys and values (this only increases the slices memory footprint by 50%)? Otherwise we have to lock/unlock a mutex and search for the value in the LRU map on each iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, seems reasonable.

@crockeo
Copy link
Contributor Author

crockeo commented Jun 23, 2023

this is obviously not a way the library is intended to be used - was this fixed in the service

Agreed that this isn't the intended use case, and yes it was fixed in the service. My thought is that we should design the library to be resilient to misuse if it doesn't impose add performance issues or UX friction for the golden path. Closing a PR is easier than merging it, though, so please let me know if you disagree.

@charlievieth
Copy link
Contributor

@crockeo I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter panicked.

$ go test -run XXX -bench BenchmarkStoreNewPerInstanceCounter
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x100f854b8]

goroutine 6 [running]:
github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru.(*Cache[...]).Get(0x10105f160, {0x1400010a000, 0x1f})
	/Users/cvieth/go/src/github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru/lru.go:96 +0x28
github.com/lyft/gostats.(*statStore).newCounter(0x140000f9d88, {0x1400010a000, 0x1f})
	/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:428 +0x40
github.com/lyft/gostats.(*statStore).NewCounterWithTags(0x10101a900?, {0x100f8c517?, 0x100f8c18e?}, 0x2?)
	/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:443 +0x40
github.com/lyft/gostats.(*statStore).NewPerInstanceCounter(0x10101a900?, {0x100f8c517, 0x4}, 0x14000191558)
	/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:457 +0x74
github.com/lyft/gostats.BenchmarkStoreNewPerInstanceCounter.func1(0x140000b9180)
	/Users/cvieth/go/src/github.com/lyft/gostats/stats_test.go:479 +0x1e4
testing.(*B).runN(0x140000b9180, 0x1)
	/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:193 +0x12c
testing.(*B).run1.func1()
	/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:233 +0x50
created by testing.(*B).run1
	/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:226 +0x90
exit status 2
FAIL	github.com/lyft/gostats	0.009s

@crockeo
Copy link
Contributor Author

crockeo commented Jun 23, 2023

I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter panicked.

Looks like I missed some test cases initializing statStore with a zero-value. E.g. for the benchmark you mentioned:

var store statStore

I'll make sure to fix those before merging 🙂

@charlievieth
Copy link
Contributor

Benching this with the below function shows a 3443.27% slowdown (mutex contention + list update), but ~5ns is a low base and the new time of ~171ns shouldn't be an issue (though the mutex contention does worry me a bit). That said, it's been awhile since I was at Lyft so I have no idea what's acceptable these days (I also made this thing as stupid fast as possible partly for my own enjoyment).

goos: darwin
goarch: arm64
pkg: github.com/lyft/gostats
                           │  map10.txt  │                lru10.txt                │
                           │   sec/op    │    sec/op      vs base                  │
StoreNewCounterParallel-10   4.819n ± 4%   170.750n ± 2%  +3443.27% (p=0.000 n=10)

New benchmark that just stresses the store:

// New benchmark that exclusively stresses the store
func BenchmarkStoreNewCounterParallel(b *testing.B) {
	s := NewStore(nullSink{}, false)
	t := time.NewTicker(time.Hour) // don't flush
	defer t.Stop()
	go s.Start(t)
	names := new([2048]string)
	for i := 0; i < len(names); i++ {
		names[i] = "counter_" + strconv.Itoa(i)
	}
	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		for i := 0; pb.Next(); i++ {
			s.NewCounter(names[i%len(names)])
		}
	})
}

@crockeo
Copy link
Contributor Author

crockeo commented Jun 23, 2023

range thing looks like it does a good job at bringing down contention:

this PR:    BenchmarkStore_MutexContention-10   4006423  297.9  ns/op
with range: BenchmarkStore_MutexContention-10  16804858   69.96 ns/op
original:   BenchmarkStore_MutexContention-10  20921458	  56.88 ns/op

but i'm afraid it's broken. i'll do more testing

@charlievieth
Copy link
Contributor

@crockeo One idea that I had many years ago while working on this library and which I actually wrote the code was to tag counters and timers as used/unused and remove ones that went unused N times between flushes. The commit is here 3b65338 and I made a PR so that it's easier to track: #160.

The advantage of this approach is that you eliminate the need for locking around any of the Counter/Timer methods. Also, feel free to steal all the code you want from this commit/PR or just totally ignore it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants