Create base error type for ingester per-instance errors and remove logging for them #5585

56quarters · 2023-07-25T18:42:48Z

What this PR does

Create a base type for ingester per-instance limit errors. This allows us to decorate them with extra information for gRPC responses and our logging middleware (to prevent them from being logged, increasing resource usage).

Which issue(s) this PR fixes or relates to

Related #5581
Related #5551
Related weaveworks/common#299

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

56quarters · 2023-07-31T14:32:59Z

pkg/ingester/instance_limits.go

+func newInstanceLimitError(err error) error {
+	return &instanceLimitErr{
+		// Errors from hitting per-instance limits are always "unavailable" for gRPC
+		status: status.New(codes.Unavailable, err.Error()),


The motivation for this is being able to use Unavailable for errors that count towards a circuit breaker or error throttling mechanism (discussed internally)

This allows us to decorate them with extra information for gRPC responses and our logging middleware (to prevent them from being logged, increasing resource usage). Related #5581 Related weaveworks/common#299 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

bboreham · 2023-08-03T13:56:06Z

pkg/ingester/instance_limits.go

+func (e *instanceLimitErr) ShouldLog(context.Context, time.Duration) bool {
+	// We increment metrics when hitting per-instance limits and so there's no need to
+	// log them, the error doesn't contain any interesting information for us.
+	return false
+}


If I read this correctly, then this will remove all logging, and as such should be prominently noted in the PR description.

I have an alternative proposal in #5584, to log 1 in N. I think there is useful information in the specific labels that cause problems.

Correct, it will remove all logging for these per-instance errors. We already don't log anything for in-flight requests and all of these increment metrics.

I have an alternative proposal in #5584, to log 1 in N. I think there is useful information in the specific labels that cause problems.

I don't understand, there are no labels here. This is only for per-instance (ingester) limits so there's no extra information to log beyond "this thing happened" -- unlike per-tenant limits which it seems like #5584 is addressing (and I agree, sampling is useful for that case).

@bboreham Maybe there's a misunderstanding. This PR is about per-instance limits, which don't look be rate limited by the PR #5584 (as far as I can see).

pracucci · 2023-08-03T14:23:38Z

pkg/ingester/instance_limits.go

+func (e *instanceLimitErr) ShouldLog(context.Context, time.Duration) bool {
+	// We increment metrics when hitting per-instance limits and so there's no need to
+	// log them, the error doesn't contain any interesting information for us.
+	return false
+}


@bboreham Maybe there's a misunderstanding. This PR is about per-instance limits, which don't look be rate limited by the PR #5584 (as far as I can see).

pkg/ingester/ingester.go

pracucci · 2023-08-03T15:09:11Z

pkg/ingester/ingester.go

@@ -760,6 +760,11 @@ func (i *Ingester) PushWithCleanup(ctx context.Context, pushReq *push.Request) (

 	db, err := i.getOrCreateTSDB(userID, false)
 	if err != nil {
+		// Check for a particular per-instance limit and return that error directly
+		// since it contains extra information for gRPC and our logging middleware.
+		if errors.Is(err, errMaxTenantsReached) {


The logging middleware can handle wrapped errors, but unfortunately gRPC FromError() does not. I'm wondering if what we should here to make code more generic is check if the error is implementing GRPCStatus() and, if so, do not wrap it.

I'm wrong here as well. FromError() can unwrap too.

So why do we need this logic at all?

Could you improve the existing tests to assert on the returned error and check that gRPC status.FromError() returns the expected code?

So why do we need this logic at all?

Because wrapWithUser() as it exists today doesn't actually "wrap" the error. It does something like errors.New(err.Error()) and so it returns a brand new error that doesn't have any relation to the original error or implement the GRPCStatus() or ShouldLog() methods. I could change it but it seemed like it purposefully didn't wrap the existing error.

Could you improve the existing tests to assert on the returned error and check that gRPC status.FromError() returns the expected code?

Sure, will do.

Because wrapWithUser() as it exists today doesn't actually "wrap" the error.

Ouf... you're right, again.

I could change it but it seemed like it purposefully didn't wrap the existing error.

No, don't do it. I thin would be unsafe. There's some context here cortexproject/cortex#2004 but TL;DR is that some errors carry the series labels but these series labels are only safe to read during the execution of push() because they're unmarshalled from the write request into a pool. So before returning, we effectively "make a copy" to ensure the error will be safe after the push() has returned.

The errMaxTenantsReached and errMaxInMemorySeriesReached don't carry any label, so they're expected to be safe.

As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)

As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)

IMO we should create an issue and tackle that, as otherwise this is a recipe for a disaster, and it's just a matter of time that it explodes somewhere.

Errors are something that is meant to escape your function, we shouldn't return errors that aren't safe to escape further than some point.

Edit: issue #6008

pracucci

Approving because the logic is OK, but I'm not sure we actually need to not wrap the per instance limit errors.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

bboreham · 2023-08-04T21:43:31Z

Sorry I got confused between per-instance and per-tenant.

56quarters force-pushed the 56quarters/instance-errors branch from 6765f74 to b4935bd Compare July 25, 2023 19:49

56quarters requested review from colega and bboreham July 25, 2023 22:27

56quarters force-pushed the 56quarters/instance-errors branch 3 times, most recently from 93b4276 to 2295ff5 Compare July 31, 2023 14:11

56quarters marked this pull request as ready for review July 31, 2023 14:31

56quarters requested a review from a team as a code owner July 31, 2023 14:31

56quarters commented Jul 31, 2023

View reviewed changes

56quarters force-pushed the 56quarters/instance-errors branch from 2295ff5 to 26d0e0d Compare August 1, 2023 20:11

56quarters force-pushed the 56quarters/instance-errors branch from 26d0e0d to aec69ee Compare August 3, 2023 13:45

bboreham reviewed Aug 3, 2023

View reviewed changes

56quarters changed the title ~~Create base error type for ingester per-instance errors~~ Create base error type for ingester per-instance errors and remove logging for them Aug 3, 2023

pracucci reviewed Aug 3, 2023

View reviewed changes

pracucci approved these changes Aug 3, 2023

View reviewed changes

56quarters added 2 commits August 3, 2023 11:36

Tests for new error type

a0a5c4d

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Update existing ingester error handling tests

5ad3a13

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters merged commit fa415b0 into main Aug 3, 2023
28 checks passed

56quarters deleted the 56quarters/instance-errors branch August 3, 2023 17:25

eyenx mentioned this pull request May 28, 2024

chore(mimir): update to 5.3.0 adfinis/helm-charts#1256

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create base error type for ingester per-instance errors and remove logging for them #5585

Create base error type for ingester per-instance errors and remove logging for them #5585

56quarters commented Jul 25, 2023 •

edited

Loading

56quarters Jul 31, 2023

bboreham Aug 3, 2023

56quarters Aug 3, 2023

56quarters Aug 3, 2023

pracucci Aug 3, 2023

pracucci Aug 3, 2023

pracucci Aug 3, 2023

pracucci Aug 3, 2023

pracucci Aug 3, 2023

pracucci Aug 3, 2023

56quarters Aug 3, 2023

pracucci Aug 3, 2023

pracucci Aug 3, 2023

colega Sep 12, 2023 •

edited

Loading

pracucci left a comment

bboreham commented Aug 4, 2023

Create base error type for ingester per-instance errors and remove logging for them #5585

Create base error type for ingester per-instance errors and remove logging for them #5585

Conversation

56quarters commented Jul 25, 2023 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colega Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

bboreham commented Aug 4, 2023

56quarters commented Jul 25, 2023 •

edited

Loading

colega Sep 12, 2023 •

edited

Loading