Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consumer] Allow annotating consumer errors with metadata #9041

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
58704ec
Add new consumererror error types
evan-bradley Dec 5, 2023
a4f738d
impi
evan-bradley Jul 3, 2024
6a932c1
crosslink
evan-bradley Jul 3, 2024
2888dbc
tidy
evan-bradley Jul 3, 2024
4447c00
Merge branch 'main' into issue-7047
evan-bradley Jul 9, 2024
7ddbc0d
tidy
evan-bradley Jul 10, 2024
9e27e93
crosslink
evan-bradley Jul 10, 2024
f4216c5
Update consumer/consumererror/README.md
evan-bradley Jul 14, 2024
a1ca2e1
Address PR feedback
evan-bradley Jul 14, 2024
3fb5426
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Jul 24, 2024
3b80b56
Update design doc
evan-bradley Jul 24, 2024
fe1802b
Update design doc
evan-bradley Jul 25, 2024
ea959b6
Update design doc
evan-bradley Jul 25, 2024
fc40a8a
Add example
evan-bradley Jul 25, 2024
05d6fa9
Address PR feedback
evan-bradley Jul 26, 2024
d19438a
Include obsreport as an error consumer
evan-bradley Jul 26, 2024
d5f7a37
Draft implementation of error type
evan-bradley Jul 26, 2024
9fe7880
Update
evan-bradley Jul 26, 2024
53cbc05
Improve metadata
evan-bradley Jul 26, 2024
0d5af43
Add metadata to interface
evan-bradley Jul 26, 2024
2aa9e86
Address PR feedback
evan-bradley Jul 30, 2024
8c05443
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Jul 30, 2024
c8c40ef
Address PR feedback
evan-bradley Aug 5, 2024
c26e31a
Update consumer/consumererror/README.md
evan-bradley Aug 5, 2024
3aadc29
Address PR feedback
evan-bradley Aug 5, 2024
a0b8bfd
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Aug 5, 2024
6f85be2
Fix interface implementation
evan-bradley Aug 7, 2024
b55f661
go mod tidy
evan-bradley Aug 7, 2024
ef40c0d
Reorder imports
evan-bradley Aug 7, 2024
dba92c6
make crosslink
evan-bradley Aug 7, 2024
dd69f4a
Fix CI
evan-bradley Aug 7, 2024
7300bda
Polish and tests
evan-bradley Aug 8, 2024
e953920
Update changelog
evan-bradley Aug 8, 2024
a3fb82d
Minor cleanup
evan-bradley Aug 8, 2024
12ffaba
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Aug 8, 2024
b3cbde5
Update consumer/consumererror/error.go
evan-bradley Aug 9, 2024
1d1affc
Address PR feedback
evan-bradley Aug 9, 2024
e776426
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Aug 29, 2024
404ebd5
Combine ErrorContainer and Error into a single type
evan-bradley Aug 29, 2024
be9b41a
Address PR feedback
evan-bradley Aug 29, 2024
ebe59e6
Add more tests
evan-bradley Aug 30, 2024
0bf7ee0
Merge branch 'main' into issue-7047
evan-bradley Aug 30, 2024
951b019
Minor updates to readme
evan-bradley Aug 30, 2024
3a56990
Clarify example
evan-bradley Sep 3, 2024
55ea049
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .chloggen/issue-7047.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. otlpreceiver)
component: consumererror

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Introduce an `Error` type that allows recording contextual information

# One or more tracking issues or pull requests related to the change
issues: [7047]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: |
Currently allows recording status codes on consumer errors,
but will be expanded in the future to record additional data.

# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: [api]
6 changes: 3 additions & 3 deletions client/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ require (
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
golang.org/x/net v0.25.0 // indirect
golang.org/x/sys v0.20.0 // indirect
golang.org/x/text v0.15.0 // indirect
golang.org/x/net v0.26.0 // indirect
golang.org/x/sys v0.21.0 // indirect
golang.org/x/text v0.16.0 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20240528184218-531527333157 // indirect
google.golang.org/grpc v1.65.0 // indirect
google.golang.org/protobuf v1.34.2 // indirect
Expand Down
12 changes: 6 additions & 6 deletions client/go.sum

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

306 changes: 306 additions & 0 deletions consumer/consumererror/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
# Consumer errors

This package contains error types that should be returned by a consumer when an
error occurs while processing telemetry. The error types included in this
package provide functionality for communicating details about the error for use
upstream in the pipeline. Ideally the error returned by a component in its
`consume` function should be from this package.

## Error classes

**Retryable**: Errors are retryable if re-submitting data to a sink may result
in a successful submission.
Comment on lines +11 to +12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to carry some informations:

  • [required] what to retry on;
  • [optional] retry interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a classification of error, I cover the details of what retryable errors contain here: https://github.com/open-telemetry/opentelemetry-collector/pull/9041/files#diff-e0fa8222784a2c0c2683b70e2b6a7ccf2c54b6477acd7e2518839e38060bdcf5R90.

We discussed this last week and determined that the caller can provide the data to retry, but we're going to focus on exactly what goes into a retryable error after this PR. Right now we're mainly focusing on the HTTP/gRPC errors.


**Permanent**: Errors are permanent if submission will always fail for the
current data. Errors are considered permanent unless they are explicitly marked
as retryable.
Comment on lines +14 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For permanent errors, we should probably always return the "number of not accepted" entries? Is that reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do that since the caller will already know how many items failed.


## Use cases

**Retry logic**: Errors should be allowed to include information necessary to
perform retries.

**Indicating partial success**: Errors can indicate that not all items were
accepted, for example as in an OTLP partial success message. OTLP backends will
return failed item counts if a partial success occurs, and this information can
be propagated up to a receiver and returned to the caller.

**Communicating network error codes**: Errors should allow embedding information
necessary for the Collector to act as a proxy for a backend, i.e. relay a status
code returned from a backend in a response to a system upstream from the
Collector.
Comment on lines +28 to +31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jpkrohling had a more detailed use-case here, but would be interesting to hear what problem we are solving. If I remember correctly was something that the collector should not retry and all retries to be done by the caller, but cannot remember exactly, so better to document in details here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can provide more examples if you like, but the simplest one is the one I described here: the Collector can act as a proxy and relay a code from a backend back to the caller. This works for retries too, if the code is e.g. an HTTP 429 status code.


## Current targets for using errors
evan-bradley marked this conversation as resolved.
Show resolved Hide resolved

**Receivers**: Receivers should be able to consume multiple errors downstream
and determine the best course of action based on the user's configuration. This
may entail either keeping the retry queue inside of the Collector by having the
receiver keep track of retries, or may involve having the caller manage the
retry queue by returning a retryable network code with relevant information.

**scraperhelper**: The scraper helper can use information about errors from
downstream to affect scraping. For example, if the backend is struggling with
the volume of data, scraping could be slowed, or the amount of data collected
could be reduced.

**exporterhelper**: When an exporter returns a retryable error, the
exporterhelper can use this information to retry. Permanent errors will be
forwarded back up the pipeline.

**obsreport**: Recording partial success information can ensure we correctly
track the number of failed telemetry records in the pipeline. Right now, all
records will be considered to be failed, which isn't accurate when partial
successes occur.

## Creating Errors

Errors can be created by calling `consumererror.New(err, opts...)` where `err`
is the underlying error, and `opts` is one of the provided options for supplying
additional metadata:

- `consumererror.WithGRPCStatus`
- `consumererror.WithHTTPStatus`
Comment on lines +61 to +62
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, can these 2 be merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a component uses both of these, the assumption is that it has its own HTTP<->gRPC translation, and wants to specify both statuses. The error will produce the given status if it has it, otherwise will convert from the status from the other transport.


The following options are not currently available, but may be made available in
the future:

- `consumererror.WithRetry`
- `consumererror.WithPartial`
- `consumererror.WithMetadata`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing any use-case for this, why do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for extensibility. I don't think we would add this until it was requested, but basically a custom exporter and receiver could communicate using a custom struct type by putting the struct inside the error with WithMetadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove it for now? We should problably ensure the model is extendable but not over-prescriptive for something we don't have yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If extensibility is an explicit goal, I'd like to keep this in the design doc. I've added warnings to things that don't actually exist right now.


All options can be combined, we assume that the component knows what it is doing
when seemingly conflicting options.

Two examples:

- `WithRetry` and `WithPartial` are included together: Partial successes are
considered permanent errors in OTLP, which conflicts with making an error
retryable by including `WithRetry`. However, per our definition of what makes
a permanent error, this error has been marked as retryable, and therefore we
assume the component producing this error supports retyable partial success
errors.
Comment on lines +84 to +89
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous question about this and what it means if I know that 5 out of 10 spans failed to send.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I know that 5/10 spans failed, then I can record that in obsreport, but will resubmit all 10; maybe the backend uses the span IDs to make span submission idempotent. I don't have a real use-case in mind for this, what I want to cover in this section is that we can use the rules established here to handle seemingly invalid error combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a real use-case in mind for this, what I want to cover in this section is that we can use the rules established here to handle seemingly invalid error combinations.

When in doubt leave it out. I want us to not add any API if we don't have a good use-case.

Copy link
Contributor Author

@evan-bradley evan-bradley Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This doesn't add any API, the point of this paragraph is just to explain how seemingly conflicting error combinations could be reconciled. WithRetry and WithPartial are intended to be used separately, and already have use cases in mind. We just needed to explain what happens if a component author attempts to use them together.

- `WithGRPCStatus` and `WithHTTPStatus` are included together: While the
component likely only sent data over one of these transports, our errors will
produce the given status if it is included on the error, otherwise it will
translate a status from the status for the other transport. If both of these
are given, we assume the component wanted to perform its own status
conversion, and we will simply give the status for the requested transport
without performing any conversion.

**Example**:

```go
consumererror.New(err,
consumererror.WithRetry(
consumerrerror.WithRetryDelay(10 * time.Second)
),
consumererror.WithGRPCStatus(codes.InvalidArgument),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example must be invalid. codes.InvalidArgument should not be retryable.

In general, it still seems like all these options provide too much flexibility, making it confusing to use and easy to misuse.

For example, why GRPC and HTTP have to be options? Why can't we just have different constructors like consumererror.NewFromGRPC(grpc.Status)? The status should have the retry information that we can retrieve if applicable. If we expect exporters to add extra info on top of grpc error, they can manually parse the grpc status and create a collector "internal" error another way.

Copy link
Contributor Author

@evan-bradley evan-bradley Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of an intentionally unusual option combination since we describe above how combinations that are invalid in OTLP are handled. I'll move this and change it to a more normal combination so we're not highlighting anything we wouldn't recommend. It is worth noting that this is only non-retryable per the OTLP spec and other protocols could consider this a retryable code.

I still think options are the best path forward even if they allow states that are not valid in OTLP, but I'm open to exploring different approaches. A few questions on how we would transition from HTTP/gRPC options to constructors:

Why can't we just have different constructors like consumererror.NewFromGRPC(grpc.Status)? The status should have the retry information that we can retrieve if applicable.

Wouldn't this still be susceptible to the same issue, where an exporter creates a gRPC status with both codes.InvalidArgument and retryable information? Also, where would this information be given when using an HTTP constructor like consumer.NewFromHTTP?

If we expect exporters to add extra info on top of grpc error, they can manually parse the grpc status and create a collector "internal" error another way.

Could you give an example of how we would do this?

```

### Retrying data submission

> [!WARNING] This function is currently in the design phase. It is not available
> and may not be added. The below is a design describing how this may work.

If an error is transient, the `WithRetry` option corresponding to the relevant
signal should be used to indicate that the error is retryable and to pass on any
retry settings. These settings can come from the data sink or be determined by
the component, such as through runtime conditions or from user settings.

The data for the retry will be provided by the component performing the retry.
This will require all processing to be completely redone; in the future,
including data from the failed component so as to not retry this processing may
be made as an available option.

To ensure only the failed pipeline branch is retried, the sequence of components
that created the error will be recorded by a pipeline utility as the error goes
back up the pipeline.

**Note**: If retry information is not included in an error, the error will be
considered permanent and will not be retried.

**Usage:**

```go
consumererror.WithRetry(
consumerrerror.WithRetryDelay(10 * time.Second)
)
```

The delay is an optional setting that can be provided if it is available.

### Indicating partial success

> [!WARNING] This function is currently in the design phase. It is not available
> and may not be added. The below is a design describing how this may work.

If the component receives an OTLP partial success message (or other indication
of partial success), it should include this information with a count of the
failed records.

**Usage:**

```go
consumererror.WithPartial(failedRecords)
```

### Indicating error codes from network transports

If the failure occurred due to a network transaction, the exporter should record
the status code of the message received from the backend. This information can
be then relayed to the receiver caller if necessary. Note that when the upstream
component reads a code, it will read a code for its desired transport, and the
code may be translated depending whether the input and output transports are
different. For example, a gRPC exporter may record a gRPC status. If a gRPC
receiver reads this status, it will be exactly the provided status. If an HTTP
receiver reads the status, it wants an HTTP status, and the gRPC status will be
converted to an equivalent HTTP code.

**Usage:**

```go
consumererror.WithGRPCStatus(codes.InvalidArgument)
consumererror.WithHTTPStatus(http.StatusTooManyRequests)
```

### Including custom data

> [!WARNING] This function is currently in the design phase. It is not available
> and may not be added. The below is a design describing how this may work.

Custom data can be included as well for any additional information that needs to
be propagated back up the pipeline. It is up to the consuming component if or
how this data will be used.

**Usage:**

```go
consumererror.WithMetadata(MyMetadataStuct{})
```

To keep error analysis simple when looking at an error upstream in a pipeline,
the component closest to the source of an error or set of errors should make a
decision about the nature of the error. The following are a few places where
special considerations may need to be made.

## Using errors

### Fanouts

When a fanout receives multiple errors, it will combine them with
`(consumererror.Error).Combine(errs...)` and pass them upstream. The upstream
component can then later pull all errors out for analysis.

### Retrieving errors

> [!WARNING] This functionality is currently experimental, and the description
> here is for design purposes. The code snippet may not work as-written.

When a receiver gets a response that includes an error, it can get the data out
by doing something similar to the following. Note that this uses the `ErrorData`
type, which is for reading error data, as opposed to the `Error` type, which is
for recording errors.

```go
cerr := consumererror.Error{}
var errData []consumerError.ErrorData

if errors.As(err, &cerr) {
errData := cerr.Data()

for _, data := range errData {
data.HTTPStatus()
data.Retryable()
data.Partial()
}
}
```

### Error data

> [!WARNING] The description below is a design proposal for how this
> functionality may work. See `error.go` within this package for the current
> functionality.

Obtaining data from an error can be done using an interface that looks something
like this:

```go
type ErrorData interface {
// Returns the underlying error
Error() error

// Second argument is `false` if no code is available.
HTTPStatus() (int, bool)

// Second argument is `false` if no code is available.
GRPCStatus() (*status.Status, bool)

// Second argument is `false` if no retry information is available.
Retryable() (Retryable, bool)

// Second argument is `false` if no partial counts were recorded.
Partial() (Partial, bool)
}

type Retryable struct {}

// Returns nil if no delay was set, indicating to use the default.
// This makes it so a delay of `0` indicates to resend immediately.
func (r *Retryable) Delay() *time.Duration {}

type Partial struct {}
```

## Other considerations

### Mixed error classes

When a receiver sees a mixture of permanent and retryable errors from downstream
in the pipeline, it must first consider whether retries are enabled within the
Collector.

**Retries are enabled**: Ignore the permanent errors, retry data submission for
only components that indicated retryable errors.

**Retries are disabled**: In an asynchronous pipeline, simply do not retry any
data. In a synchronous pipeline, the receiver should return a permanent error
code indicating to the caller that it should not retry the data submission. This
is intended to not induce extra failures where we know the data submission will
fail, but this behavior could be made configurable by the user.

### Signal conversion

When converting between signals in a pipeline, it is expected that the connector
performing the conversion should perform the translation necessary in the error
for any signal item counts. If the converted count cannot be determined, the
full count of pre-converted signals should be returned.

### Asynchronous processing
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

The use of any components that do asynchronous processing in a pipeline will cut
off error backpropagation at the asynchronous component. The asynchronous
component may communicate error information using the Collector's own signals.

## Transitioning

> [!WARNING] This functionality is currently in the design phase. It is not
> available and may not be added. The below is a design describing how this may
> work.

The following describes how to transition to these error types:

- `NewPermanent`: To transition to new permanent errors, call
`consumererror.New` with the relevant metadata included in the invocation.
Errors will be permanent by default going forward.
- `New[Traces|Metrics|Logs]`: These functions will be deprecated in favor of
having the caller provide the data to retry. Current uses can invoke
`consumererror.New` with the `WithRetry` option to retry a request.
- `exporterhelper.NewThrottleRetry`: This will be replaced with `WithRetry`, and
can follow a similar approach as above.

`consumererror.IsPermanent` will be deprecated in favor of checking whether
retry information is available, and only retrying if it has been provided. This
will be possible by calling `ErrorData.Retryable()` and checking for retry
information.
Loading
Loading