[RFC] replace `sync.Map` with `lru.Cache` #158

crockeo · 2023-06-16T23:56:07Z

I was looking at an application which uses gostats earlier today and found that it was leaking memory because it was misusing the gostats API. At a high-level:

The application was calling NewTimerWithTags with each request.
It was providing a map which contained a unique identifier each time it was called.
Because statStore uses an unbounded cache, timers was growing with each request.
That caused a non-obvious memory leak on the service.

This PR attempts to fix these issues by replacing the existing, unbounded cache with a bounded LRU cache. We can only do so for counters and timers because they are resilient to cache eviction:

Counters report the diff between the last sent value and the current value. Because we flush at the time of eviction there is no different between value = x -> eviction -> add y and value = x -> add y. In both cases the next report will contain y.
Timers don't maintain local state, instead they immediately call Sink.FlushTimer when a value is added. They can be evicted from the cache because any two AddValue calls to the same timer are independent, and so they will behave the same across two instances of the same timer.

Notably we cannot replace gauge's sync.Map with lru.Cache, because a gauge is client-side stateful. Users can still encounter memory issues if they produce gauges with unique tags.

As per the [Golang release policy](https://go.dev/doc/devel/release#policy): >Each major Go release is supported until there are two newer major releases. `gostats` doesn't need to support `1.17` or `1.18` anymore, so: - Change unit test suite to target currently supported versions of Go. - Change the `go.mod` version to be `1.18` to mark that we will need generics support in #158. Also going through and fixing lint errors with 325decb so that this PR can be merged.

tomwans

I'm ok with this change, I don't love that it involves getting a new dependency into the stat lib, I didn't like having envconfig there either. I was hoping we can get to zero.

Real talk what would it take to copy some of the bits of the lru cache into one file here?

crockeo · 2023-06-21T18:06:20Z

I don't love that it involves getting a new dependency into the stat lib, I didn't like having envconfig there either. I was hoping we can get to zero.

Real talk what would it take to copy some of the bits of the lru cache into one file here?

Fixed in 1fa1009.

njo · 2023-06-22T19:11:20Z

It was providing a map which contained a unique identifier each time it was called
Just on this point @crockeo - this is obviously not a way the library is intended to be used - was this fixed in the service?

charlievieth · 2023-06-22T22:27:42Z

Yeah, this was always a risk with using sync.Map but I figured it was only possible if there was a cardinality explosion in the metrics and observability would catch it / inform the service owners. But maybe that's not true with tags and TBH its been awhile.

charlievieth · 2023-06-22T22:32:44Z

stats.go

-		// do not flush counters that are set to zero
-		if value := v.(*counter).latch(); value != 0 {
-			s.sink.FlushCounter(key.(string), value)
+	for _, name := range s.counters.Keys() {


Since we're copying in the LRU code can we add a method that returns a slice of keys and values (this only increases the slices memory footprint by 50%)? Otherwise we have to lock/unlock a mutex and search for the value in the LRU map on each iteration.

Yep, seems reasonable.

crockeo · 2023-06-23T17:06:49Z

this is obviously not a way the library is intended to be used - was this fixed in the service

Agreed that this isn't the intended use case, and yes it was fixed in the service. My thought is that we should design the library to be resilient to misuse if it doesn't impose add performance issues or UX friction for the golden path. Closing a PR is easier than merging it, though, so please let me know if you disagree.

charlievieth · 2023-06-23T17:21:58Z

@crockeo I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter panicked.

$ go test -run XXX -bench BenchmarkStoreNewPerInstanceCounter
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x100f854b8]

goroutine 6 [running]:
github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru.(*Cache[...]).Get(0x10105f160, {0x1400010a000, 0x1f})
	/Users/cvieth/go/src/github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru/lru.go:96 +0x28
github.com/lyft/gostats.(*statStore).newCounter(0x140000f9d88, {0x1400010a000, 0x1f})
	/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:428 +0x40
github.com/lyft/gostats.(*statStore).NewCounterWithTags(0x10101a900?, {0x100f8c517?, 0x100f8c18e?}, 0x2?)
	/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:443 +0x40
github.com/lyft/gostats.(*statStore).NewPerInstanceCounter(0x10101a900?, {0x100f8c517, 0x4}, 0x14000191558)
	/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:457 +0x74
github.com/lyft/gostats.BenchmarkStoreNewPerInstanceCounter.func1(0x140000b9180)
	/Users/cvieth/go/src/github.com/lyft/gostats/stats_test.go:479 +0x1e4
testing.(*B).runN(0x140000b9180, 0x1)
	/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:193 +0x12c
testing.(*B).run1.func1()
	/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:233 +0x50
created by testing.(*B).run1
	/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:226 +0x90
exit status 2
FAIL	github.com/lyft/gostats	0.009s

crockeo · 2023-06-23T17:23:48Z

I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter panicked.

Looks like I missed some test cases initializing statStore with a zero-value. E.g. for the benchmark you mentioned:

gostats/stats_test.go

Line 471 in e6b88e6

var store statStore

I'll make sure to fix those before merging 🙂

charlievieth · 2023-06-23T17:39:15Z

Benching this with the below function shows a 3443.27% slowdown (mutex contention + list update), but ~5ns is a low base and the new time of ~171ns shouldn't be an issue (though the mutex contention does worry me a bit). That said, it's been awhile since I was at Lyft so I have no idea what's acceptable these days (I also made this thing as stupid fast as possible partly for my own enjoyment).

goos: darwin
goarch: arm64
pkg: github.com/lyft/gostats
                           │  map10.txt  │                lru10.txt                │
                           │   sec/op    │    sec/op      vs base                  │
StoreNewCounterParallel-10   4.819n ± 4%   170.750n ± 2%  +3443.27% (p=0.000 n=10)

New benchmark that just stresses the store:

// New benchmark that exclusively stresses the store
func BenchmarkStoreNewCounterParallel(b *testing.B) {
	s := NewStore(nullSink{}, false)
	t := time.NewTicker(time.Hour) // don't flush
	defer t.Stop()
	go s.Start(t)
	names := new([2048]string)
	for i := 0; i < len(names); i++ {
		names[i] = "counter_" + strconv.Itoa(i)
	}
	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		for i := 0; pb.Next(); i++ {
			s.NewCounter(names[i%len(names)])
		}
	})
}

crockeo · 2023-06-23T18:08:12Z

range thing looks like it does a good job at bringing down contention:

this PR:    BenchmarkStore_MutexContention-10   4006423  297.9  ns/op
with range: BenchmarkStore_MutexContention-10  16804858   69.96 ns/op
original:   BenchmarkStore_MutexContention-10  20921458	  56.88 ns/op

but i'm afraid it's broken. i'll do more testing

charlievieth · 2023-06-24T01:17:15Z

@crockeo One idea that I had many years ago while working on this library and which I actually wrote the code was to tag counters and timers as used/unused and remove ones that went unused N times between flushes. The commit is here 3b65338 and I made a PR so that it's easier to track: #160.

The advantage of this approach is that you eliminate the need for locking around any of the Counter/Timer methods. Also, feel free to steal all the code you want from this commit/PR or just totally ignore it.

crockeo mentioned this pull request Jun 20, 2023

target only upstream supported versions of Go #159

Merged

crockeo added 4 commits June 21, 2023 08:54

replace sync.Map with lru.Cache

4ef9a53

revert gauges

f766165

clean up unnecessary diffs

a538625

some formatting

09cd973

crockeo force-pushed the ch/use-lru branch from 9d8ae27 to 09cd973 Compare June 21, 2023 15:55

crockeo added 2 commits June 21, 2023 08:59

goimports

4604e4c

make cache size larger than we would expect to be filled

0b47ff3

tomwans previously approved these changes Jun 21, 2023

View reviewed changes

crockeo dismissed tomwans’s stale review via 0fdc92c June 21, 2023 17:14

crockeo force-pushed the ch/use-lru branch from 0fdc92c to 0b47ff3 Compare June 21, 2023 17:15

vendor hashicorp LRU

1fa1009

linting again

ebc85de

tomwans previously approved these changes Jun 21, 2023

View reviewed changes

charlievieth reviewed Jun 22, 2023

View reviewed changes

crockeo added 3 commits June 23, 2023 10:39

fix zero-value

59b83d9

add new benchmark

d2e64d2

make range command (that might be broken?)

1586642

crockeo dismissed tomwans’s stale review via 1586642 June 23, 2023 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] replace `sync.Map` with `lru.Cache` #158

[RFC] replace `sync.Map` with `lru.Cache` #158

crockeo commented Jun 16, 2023 •

edited

Loading

tomwans left a comment

crockeo commented Jun 21, 2023

njo commented Jun 22, 2023

charlievieth commented Jun 22, 2023

charlievieth Jun 22, 2023

crockeo Jun 23, 2023

crockeo commented Jun 23, 2023

charlievieth commented Jun 23, 2023

crockeo commented Jun 23, 2023

charlievieth commented Jun 23, 2023

crockeo commented Jun 23, 2023

charlievieth commented Jun 24, 2023

[RFC] replace sync.Map with lru.Cache #158

Are you sure you want to change the base?

[RFC] replace sync.Map with lru.Cache #158

Conversation

crockeo commented Jun 16, 2023 • edited Loading

tomwans left a comment

Choose a reason for hiding this comment

crockeo commented Jun 21, 2023

njo commented Jun 22, 2023

charlievieth commented Jun 22, 2023

charlievieth Jun 22, 2023

Choose a reason for hiding this comment

crockeo Jun 23, 2023

Choose a reason for hiding this comment

crockeo commented Jun 23, 2023

charlievieth commented Jun 23, 2023

crockeo commented Jun 23, 2023

charlievieth commented Jun 23, 2023

crockeo commented Jun 23, 2023

charlievieth commented Jun 24, 2023

[RFC] replace `sync.Map` with `lru.Cache` #158

[RFC] replace `sync.Map` with `lru.Cache` #158

crockeo commented Jun 16, 2023 •

edited

Loading