Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error handling proposal #153

Merged
Merged
Changes from 12 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
1b6737a
error handling proposal
SergeyKanzhelev Jun 21, 2019
ac14996
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Jun 21, 2019
f90437e
added self-monitoring
SergeyKanzhelev Jun 21, 2019
d30f90a
Merge branch 'exceptionsHandling' of https://github.com/SergeyKanzhel…
SergeyKanzhelev Jun 21, 2019
b4eee5a
Merge branch 'master' into exceptionsHandling
c24t Aug 20, 2019
58179c6
Reword principles
c24t Aug 20, 2019
03e937e
Reword guidance
c24t Aug 20, 2019
55bbad2
Reword diagnostics
c24t Aug 20, 2019
dcb740c
Reword exceptions
c24t Aug 20, 2019
6c87edf
Add note on logs, callbacks
c24t Aug 20, 2019
3812350
Formatting
c24t Aug 20, 2019
8c01355
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Sep 20, 2019
669052e
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Sep 30, 2019
4109c10
formatting and a mention of ToString
SergeyKanzhelev Sep 30, 2019
a10d8ea
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 1, 2019
ea993fd
dynamic languages
SergeyKanzhelev Oct 1, 2019
872ebd4
returning noops, not nulls
SergeyKanzhelev Oct 1, 2019
de7cb15
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 1, 2019
a8451f1
Reformat for #192
c24t Oct 1, 2019
e4ddb0c
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 3, 2019
eb4bc45
Update specification/error-handling.md
SergeyKanzhelev Oct 9, 2019
30bb855
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 9, 2019
6fed7dc
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 10, 2019
83f4477
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions specification/error-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Error handling in OpenTelemetry
SergeyKanzhelev marked this conversation as resolved.
Show resolved Hide resolved

OpenTelemetry generates telemetry data to help users monitor application code.
In most cases, the work that the library performs is not essential from the perspective of application business logic.
We assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is new, from @reyang's comment at open-telemetry/opentelemetry-python#11 (comment).


OpenTelemetry may be enabled via platform extensibility mechanisms, or dynamically loaded at runtime.
This makes the use of the library non-obvious for end users, and may even be outside of the application developer's control.
This makes for some unique requirements with respect to error handling.

## Basic error handling principles

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally changed this to read:

The OpenTelemetry API package MUST NOT expose (i.e. throw, return, or otherwise leak) unhandled exceptions to end users at run time.
OpenTelemetry implementations MUST NOT expose exceptions that are not documented in the API.

to address comments at 1b6737a#r296402616, but on reflection I think it is actually better not to have any checked exceptions in the API. It's also more consistent with @SergeyKanzhelev's original text.

The big unresolved problem here is NPEs.


1. API methods MUST NOT throw unhandled exceptions when used incorrectly by end users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Throw unhandled exceptions" is kind of redundant, but may help distinguish between checked exceptions listed in docs/signatures and unchecked. Since I'm proposing outlawing checked exceptions too this might not be necessary.

Copy link
Member

@OlivierAlbertini OlivierAlbertini Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is great! Should we add how those exceptions can effect the span status ?
Use case: As a user, I provide a callback in order to add custom attributes to my spans. This callback throws an exception. However, beside this exception, everything is ok. In Zipkin/Jaeger, Should I see:

  • this span in error,
  • no exception to the span (CanonicalCode is OK)
  • CanonicalCode UNKNOWN with error attributes
  • error attributes but CanonicalCode is OK
  • an event describing the error and CanonicalCode is OK

I think that user should know that everything is slowing down his application. Let me know! Thanks in advance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think that such "internal instrumentation error" information on spans that describes errors not in the instrumented code but the instrumentation itself might be very useful but have not been considered at all yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OlivierAlbertini SDK will never know the intention of a callback to add an attribute. I think SDK design must try to avoid those kind of uncertainties. Errors can be exposed as language-specific logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also #275

The API and SDK SHOULD provide safe defaults for missing or invalid arguments.
For instance, a name like `empty` may be used if the user passes in `null` as the span name argument during `Span` construction.
2. The API or SDK may _fail fast_ and cause the application to fail on initialization, e.g. because of bad user config or a environment, but MUST NOT cause the application to fail at run time, e.g. due to dynamic config settings received from the Agent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line, added for @iredelmeier's comment at 1b6737a#r296935968.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this contradicts the statement above:

We assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application.

Do we really want to crash the application on startup if there's an issue with the configuration or if the monitoring backend can't be reached?
I'd rather provide means for the user to proactively check if OTel is correctly initialized. I think this check is something we can expect application developers to add to their startup routine. There they can add appropriate error handling (e.g. reporting the misconfiguration to their logging service - or crashing their app if they really want to).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arminru I think the motivation is that, if something fails at start time, you can easily/quickly check what's going on, opposed to suddenly get errors once the application has been running for a while.

That being said, we can also probably get more relaxed there, and only fail for fatal cases (again, only at the start).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to crash the application on startup if there's an issue with the configuration or if the monitoring backend can't be reached?

I think yes if we've got a malformed config, no if the backend can't be reached -- and hopefully we're not making any network requests at startup.

3. The SDK MUST NOT throw unhandled exceptions for errors in their own operations.
For example, an exporter should not throw an exception when it cannot reach the endpoint to which it sends telemetry data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed

SDK MUST NOT crash process by throwing exception or causing OutOfMemoryException when telemetry receiving endpoint cannot be reached

because performance guarantees seem out of scope for this doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it sounds like it was an example of an Exception, instead of that actual OutOfMemoryException (then again, it's probably redundant and we can get rid of that, after your recent changes :) )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took it to mean that exporters should be careful about catching unchecked exceptions that they might cause, in this case by causing an OOM because the export queue grows unbounded.


## Guidance

1. API methods that accept external callbacks MUST handle all errors.
2. Background tasks (e.g. threads, asynchronous tasks, and spawned processes) should run in the context of a global error handler to ensure that exceptions do not affect the end user application.
3. Long-running background tasks should not fail permanently in response to internal errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line to expand on the point above.

In general, internal exceptions should only affect the execution context of the request that caused the exception.
4. Internal error handling should follow language-specific conventions.
In general, developers should minimize the scope of error handlers and add special processing for expected exceptions.
5. Beware external callbacks and overrideable interfaces: Expect them to throw.

## Self-diagnostics

All OpenTelemetry libraries -- the API, SDK, exporters, instrumentation adapters, etc. -- are encouraged to expose self-troubleshooting metrics, spans, and other telemetry that can be easily enabled and filtered out by default.

One good example of such telemetry is a `Span` exporter that indicates how much time exporters spend uploading telemetry.
Another example may be a metric exposed by a `SpanProcessor` that describes the current queue size of telemetry data to be uploaded.

Whenever the library suppresses an error that would otherwise have been exposed to the user, the library SHOULD log the error using language-specific conventions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new line, I thought logging was worth mentioning separately from callbacks.

SDKs MAY expose callbacks to allow end users to handle self-diagnostics separately from application code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line for @jmacd and @arminru's comments at #153 (comment).



## Exceptions to the rule

SDK authors MAY supply settings that allow end users to change the library's default error handling behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

There are situations when end-user wants to know whether API/SDK are used
correctly. For instance, it may be desirable to not deploy an app with the
malformed monitoring configuration.

since we say we're allowed to fail fast on configuration errors above.

Application developers may want to run with strict error handling in a staging environment to catch invalid uses of the API, or malformed config.