This repository has been archived by the owner on Apr 19, 2024. It is now read-only.
Releases: mailgun/gubernator
Releases · mailgun/gubernator
v2.4.0
What's Changed
- MegaFix global behavior bugs by @Baliedge in #225
- Every call to
GetRateLimits
would reset theResetTime
and not theRemaining
counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:- The
Duration
value was never properly propagated in global behavior. This was added to the global broadcast logic.- The changes in PR #219 fixes propagation issues in
UpdatePeerGlobals
during a global broadcast but neglected to propagateDuration
. - As a result, logic in algorithms.go would detect a change in
Duration
to zero and trigger a reset of theResetTime
. This code path does not reset theRemaining
counter because it's meant for cases where an existing rate limit had been extended or abbreviated in duration. - I had wondered why this was never a problem before that PR. That's because that PR fixed a global broadcast bug that was setting the wrong data type in a
CacheItem
struct and logic in algorithms.go would ignore it, causing it to short circuit around the logic that checksDuration
. Once the data type was corrected, theDuration
bug was revealed.
- The changes in PR #219 fixes propagation issues in
- The
ResetTime
generated by the owning and non-owning peers did not always match exactly.- Value would vary slightly depending on network lag and system time synchronization because peers were generating
ResetTime
in multiple places based onclock.Now()
. - This isn't a showstopper normally, but it does prevent writing a unit test to ensure
ResetTime
doesn't change due to the above bug. GetRateLimits()
will set arequestTime
and pass it around so that any date/time computation to setResetTime
will always use the same base value instead ofclock.Now()
.
- Value would vary slightly depending on network lag and system time synchronization because peers were generating
- The
- Fix race condition in
QueueUpdate()
used by peers to propagate updates to rate limits that it owns.- Updates include ratelimit state, such as the
Remaining
counter. So, if the same key were updated multiple times it may get added in non-chronological order. The last update wins, potentially passing a staleRemaining
count, thereby dropping hits already applied. - The fix is to pass only ratelimit key info to
QueueUpdates()
. Then, when the timer calls to propagate the update, get the current ratelimit state of each queued update just before sending to the peers.
- Updates include ratelimit state, such as the
- Fix inconsistency with over limit status when calling
GetRateLimits
on a non-owner peer with global behavior.- The logic would always return a response with status
UNDER_LIMIT
no matter how many hits were applied. - This differs when the same request reaches the owner peer, which will return the appropriate status.
- The fix adds a check if hits > remaining and set status accordingly.
- The logic would always return a response with status
- Optimize calls to
GetRateLimits
with zero hits to not trigger any global updates because nothing changed. - Add rigorous functional tests around global behavior to verify full peer-to-peer propagation after a call to
GetRateLimits
. - Fix doublecounting of metric
gubernator_over_limit_counter
on both non-owner and owner peers. Only count on owner peer. - Fix metric doublecounting of
gubernator_getratelimit_counter
. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked. - Remove redundant metric
gubernator_broadcast_counter
. Usegubernator_broadcast_duration_count
instead. - Fix intermittent test error related to
TestHealthCheck
that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.
- Every call to
- Fix mutex deadlocks in PeerClient by @miparnisari in #223
- Fix goroutine leaks by @miparnisari in #221
- Add test for global rate limiting with load balancing by @Baliedge and @philipgough in #224
- Update protobufs and Makefile by @miparnisari in #211
- Update versions and run buf mod update and make proto
- Fix version of gateway
- Generate reverse proxy for peers v1
- Change global behavior by @thrawn01 in #219
- To change how GLOBAL behavior operates. Previously, the owner of the rate limit would broadcast the computed result of the rate limit to other peers, and the computed result from the owner is returned to clients who submit hits to a peer. However, after some great feed back on #218 and #216 It has become clear that we should instead allow the local peers to compute the result of the request based on the hits broadcast by the owning peer.
- In the new behavior a peer will compute the result of the rate limit request and immediately return that computed result. This means that the non owning peer will compute the result with the current Remaining value it currently has in it's cache. To put it another way, the peer cache will no longer hold the computed result from the owner.
- In order to facilitate this change, I've added many more tests around global functionality which should help ensure we don't break behavior going forward.
- Add docs in
global.go
for global behavior by @miparnisari in #213 SetupDaemonConfig
no longer needs a file by @miparnisari in #214- Update GitHub Action dep versions by @miparnisari in #201
v2.3.2
v2.3.1
v2.3.0
v2.2.1
v2.2.0
v2.1.4
v2.1.3
What's Changed
- Patch for CVE-2023-45142 by @Baliedge in #194
- OpenTelemetry-Go Contrib has DoS vulnerability in otelhttp due to unbound cardinality metrics
- https://www.cve.org/CVERecord?id=CVE-2023-45142
v2.1.2
What's Changed
- Security update golang.org/x/net and Tidy code by @Baliedge in #192
- Improve
TestGlobalRateLimits
that was not checking for exact behavior.- Required moving metric variables into
GlobalManager
fields so that tests would not read from global metric variables that were impacted by other tests.
- Required moving metric variables into
- Tidy up code.
- Update default for
GlobalSyncWait
from 500ms to 100ms- This applies to both
runAsyncHits()
andrunBroadcasts()
. When a ratelimit is hit, it could take up to 2x this setting before it's replicated to each peer.
- This applies to both
- Improve