-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] replace sync.Map
with lru.Cache
#158
base: master
Are you sure you want to change the base?
Conversation
As per the [Golang release policy](https://go.dev/doc/devel/release#policy): >Each major Go release is supported until there are two newer major releases. `gostats` doesn't need to support `1.17` or `1.18` anymore, so: - Change unit test suite to target currently supported versions of Go. - Change the `go.mod` version to be `1.18` to mark that we will need generics support in #158. Also going through and fixing lint errors with 325decb so that this PR can be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this change, I don't love that it involves getting a new dependency into the stat lib, I didn't like having envconfig there either. I was hoping we can get to zero.
Real talk what would it take to copy some of the bits of the lru cache into one file here?
Fixed in 1fa1009. |
|
Yeah, this was always a risk with using |
stats.go
Outdated
// do not flush counters that are set to zero | ||
if value := v.(*counter).latch(); value != 0 { | ||
s.sink.FlushCounter(key.(string), value) | ||
for _, name := range s.counters.Keys() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're copying in the LRU code can we add a method that returns a slice of keys and values (this only increases the slices memory footprint by 50%)? Otherwise we have to lock/unlock a mutex and search for the value in the LRU map on each iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, seems reasonable.
Agreed that this isn't the intended use case, and yes it was fixed in the service. My thought is that we should design the library to be resilient to misuse if it doesn't impose add performance issues or UX friction for the golden path. Closing a PR is easier than merging it, though, so please let me know if you disagree. |
@crockeo I tried running the benchmarks for a comparison, but
|
Looks like I missed some test cases initializing Line 471 in e6b88e6
I'll make sure to fix those before merging 🙂 |
Benching this with the below function shows a 3443.27% slowdown (mutex contention + list update), but ~5ns is a low base and the new time of ~171ns shouldn't be an issue (though the mutex contention does worry me a bit). That said, it's been awhile since I was at Lyft so I have no idea what's acceptable these days (I also made this thing as stupid fast as possible partly for my own enjoyment).
New benchmark that just stresses the store: // New benchmark that exclusively stresses the store
func BenchmarkStoreNewCounterParallel(b *testing.B) {
s := NewStore(nullSink{}, false)
t := time.NewTicker(time.Hour) // don't flush
defer t.Stop()
go s.Start(t)
names := new([2048]string)
for i := 0; i < len(names); i++ {
names[i] = "counter_" + strconv.Itoa(i)
}
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for i := 0; pb.Next(); i++ {
s.NewCounter(names[i%len(names)])
}
})
} |
range thing looks like it does a good job at bringing down contention:
but i'm afraid it's broken. i'll do more testing |
@crockeo One idea that I had many years ago while working on this library and which I actually wrote the code was to tag counters and timers as used/unused and remove ones that went unused N times between flushes. The commit is here 3b65338 and I made a PR so that it's easier to track: #160. The advantage of this approach is that you eliminate the need for locking around any of the Counter/Timer methods. Also, feel free to steal all the code you want from this commit/PR or just totally ignore it. |
I was looking at an application which uses gostats earlier today and found that it was leaking memory because it was misusing the gostats API. At a high-level:
NewTimerWithTags
with each request.statStore
uses an unbounded cache,timers
was growing with each request.This PR attempts to fix these issues by replacing the existing, unbounded cache with a bounded LRU cache. We can only do so for counters and timers because they are resilient to cache eviction:
Counters report the diff between the last sent value and the current value. Because we flush at the time of eviction there is no different between
value = x
-> eviction ->add y
andvalue = x
->add y
. In both cases the next report will containy
.Timers don't maintain local state, instead they immediately call
Sink.FlushTimer
when a value is added. They can be evicted from the cache because any twoAddValue
calls to the same timer are independent, and so they will behave the same across two instances of the same timer.Notably we cannot replace gauge's
sync.Map
withlru.Cache
, because a gauge is client-side stateful. Users can still encounter memory issues if they produce gauges with unique tags.