-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create base error type for ingester per-instance errors and remove logging for them #5585
Conversation
6765f74
to
b4935bd
Compare
93b4276
to
2295ff5
Compare
pkg/ingester/instance_limits.go
Outdated
func newInstanceLimitError(err error) error { | ||
return &instanceLimitErr{ | ||
// Errors from hitting per-instance limits are always "unavailable" for gRPC | ||
status: status.New(codes.Unavailable, err.Error()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for this is being able to use Unavailable
for errors that count towards a circuit breaker or error throttling mechanism (discussed internally)
2295ff5
to
26d0e0d
Compare
This allows us to decorate them with extra information for gRPC responses and our logging middleware (to prevent them from being logged, increasing resource usage). Related #5581 Related weaveworks/common#299 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
26d0e0d
to
aec69ee
Compare
func (e *instanceLimitErr) ShouldLog(context.Context, time.Duration) bool { | ||
// We increment metrics when hitting per-instance limits and so there's no need to | ||
// log them, the error doesn't contain any interesting information for us. | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this correctly, then this will remove all logging, and as such should be prominently noted in the PR description.
I have an alternative proposal in #5584, to log 1 in N. I think there is useful information in the specific labels that cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it will remove all logging for these per-instance errors. We already don't log anything for in-flight requests and all of these increment metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an alternative proposal in #5584, to log 1 in N. I think there is useful information in the specific labels that cause problems.
I don't understand, there are no labels here. This is only for per-instance (ingester) limits so there's no extra information to log beyond "this thing happened" -- unlike per-tenant limits which it seems like #5584 is addressing (and I agree, sampling is useful for that case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (e *instanceLimitErr) ShouldLog(context.Context, time.Duration) bool { | ||
// We increment metrics when hitting per-instance limits and so there's no need to | ||
// log them, the error doesn't contain any interesting information for us. | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -760,6 +760,11 @@ func (i *Ingester) PushWithCleanup(ctx context.Context, pushReq *push.Request) ( | |||
|
|||
db, err := i.getOrCreateTSDB(userID, false) | |||
if err != nil { | |||
// Check for a particular per-instance limit and return that error directly | |||
// since it contains extra information for gRPC and our logging middleware. | |||
if errors.Is(err, errMaxTenantsReached) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging middleware can handle wrapped errors, but unfortunately gRPC FromError()
does not. I'm wondering if what we should here to make code more generic is check if the error is implementing GRPCStatus()
and, if so, do not wrap it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wrong here as well. FromError()
can unwrap too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why do we need this logic at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you improve the existing tests to assert on the returned error and check that gRPC status.FromError()
returns the expected code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why do we need this logic at all?
Because wrapWithUser()
as it exists today doesn't actually "wrap" the error. It does something like errors.New(err.Error())
and so it returns a brand new error that doesn't have any relation to the original error or implement the GRPCStatus()
or ShouldLog()
methods. I could change it but it seemed like it purposefully didn't wrap the existing error.
Could you improve the existing tests to assert on the returned error and check that gRPC
status.FromError()
returns the expected code?
Sure, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because wrapWithUser() as it exists today doesn't actually "wrap" the error.
Ouf... you're right, again.
I could change it but it seemed like it purposefully didn't wrap the existing error.
No, don't do it. I thin would be unsafe. There's some context here cortexproject/cortex#2004 but TL;DR is that some errors carry the series labels but these series labels are only safe to read during the execution of push()
because they're unmarshalled from the write request into a pool. So before returning, we effectively "make a copy" to ensure the error will be safe after the push()
has returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The errMaxTenantsReached
and errMaxInMemorySeriesReached
don't carry any label, so they're expected to be safe.
As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)
IMO we should create an issue and tackle that, as otherwise this is a recipe for a disaster, and it's just a matter of time that it explodes somewhere.
Errors are something that is meant to escape your function, we shouldn't return errors that aren't safe to escape further than some point.
Edit: issue #6008
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because the logic is OK, but I'm not sure we actually need to not wrap the per instance limit errors.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Sorry I got confused between per-instance and per-tenant. |
What this PR does
Create a base type for ingester per-instance limit errors. This allows us to decorate them with extra information for gRPC responses and our logging middleware (to prevent them from being logged, increasing resource usage).
Which issue(s) this PR fixes or relates to
Related #5581
Related #5551
Related weaveworks/common#299
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]