Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error handling proposal #153

Merged
Merged
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
1b6737a
error handling proposal
SergeyKanzhelev Jun 21, 2019
ac14996
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Jun 21, 2019
f90437e
added self-monitoring
SergeyKanzhelev Jun 21, 2019
d30f90a
Merge branch 'exceptionsHandling' of https://github.com/SergeyKanzhel…
SergeyKanzhelev Jun 21, 2019
b4eee5a
Merge branch 'master' into exceptionsHandling
c24t Aug 20, 2019
58179c6
Reword principles
c24t Aug 20, 2019
03e937e
Reword guidance
c24t Aug 20, 2019
55bbad2
Reword diagnostics
c24t Aug 20, 2019
dcb740c
Reword exceptions
c24t Aug 20, 2019
6c87edf
Add note on logs, callbacks
c24t Aug 20, 2019
3812350
Formatting
c24t Aug 20, 2019
8c01355
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Sep 20, 2019
669052e
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Sep 30, 2019
4109c10
formatting and a mention of ToString
SergeyKanzhelev Sep 30, 2019
a10d8ea
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 1, 2019
ea993fd
dynamic languages
SergeyKanzhelev Oct 1, 2019
872ebd4
returning noops, not nulls
SergeyKanzhelev Oct 1, 2019
de7cb15
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 1, 2019
a8451f1
Reformat for #192
c24t Oct 1, 2019
e4ddb0c
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 3, 2019
eb4bc45
Update specification/error-handling.md
SergeyKanzhelev Oct 9, 2019
30bb855
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 9, 2019
6fed7dc
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 10, 2019
83f4477
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions specification/error-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Error handling in OpenTelemetry
SergeyKanzhelev marked this conversation as resolved.
Show resolved Hide resolved

OpenTelemetry generates telemetry data to help users monitor application code.
In most cases, the work that the library performs is not essential from the
perspective of application business logic. We assume that users would prefer to
lose telemetry data rather than have the library significantly change the
behavior of the instrumented application.

OpenTelemetry may be enabled via platform extensibility mechanisms, or
dynamically loaded at runtime. This makes the use of the library non-obvious for
end users, and may even be outside of the application developer's control. This
makes for some unique requirements with respect to error handling.

## Basic error handling principles

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally changed this to read:

The OpenTelemetry API package MUST NOT expose (i.e. throw, return, or otherwise leak) unhandled exceptions to end users at run time.
OpenTelemetry implementations MUST NOT expose exceptions that are not documented in the API.

to address comments at 1b6737a#r296402616, but on reflection I think it is actually better not to have any checked exceptions in the API. It's also more consistent with @SergeyKanzhelev's original text.

The big unresolved problem here is NPEs.


1. API methods MUST NOT throw unhandled exceptions when used incorrectly by end
users. The API and SDK SHOULD provide safe defaults for missing or invalid
arguments. For instance, a name like `empty` may be used if the user passes
in `null` as the span name argument during `Span` construction.
2. The API or SDK may _fail fast_ and cause the application to fail on
initialization, e.g. because of bad user config or a environment, but MUST
NOT cause the application to fail at run time, e.g. due to dynamic config
settings received from the Collector.
3. The SDK MUST NOT throw unhandled exceptions for errors in their own
operations. For example, an exporter should not throw an exception when it
cannot reach the endpoint to which it sends telemetry data.

## Guidance

1. API methods that accept external callbacks MUST handle all errors.
2. Background tasks (e.g. threads, asynchronous tasks, and spawned processes)
should run in the context of a global error handler to ensure that exceptions
do not affect the end user application.
3. Long-running background tasks should not fail permanently in response to
internal errors. In general, internal exceptions should only affect the
execution context of the request that caused the exception.
4. Internal error handling should follow language-specific conventions. In
general, developers should minimize the scope of error handlers and add
special processing for expected exceptions.
5. Beware external callbacks and overrideable interfaces: Expect them to throw.
6. Beware to call any methods that wasn't explicitly provided by API and SDK
users as a callbacks. Method `ToString` that SDK may decide to call on user
object may be badly implemented and lead to stack overflow. It is common that
the application never calls this method and this bad implementation would
never be caught by an application owner.
7. Whenever API call returns values that is expected to be non-`null` value - in
case of error in processing logic - SDK MUST return a "no-op" or any other
"default" object that was (_ideally_) pre-allocated and readily available.
This way API call sites will not crash on attempts to access methods and
properties of a `null` objects.

## Error handling and performance

Error handling and extensive input validation may cause performance degradation,
especially on dynamic languages where the input object types are not guaranteed
in compile time. Runtime type checks will impact performance and are error
prone, exceptions may occur despite the best effort.

It is recommended to have a global exception handling logic that will guarantee
that exceptions are not leaking to the user code. And make a reasonable trade
off of the SDK performance and fullness of type checks that will provide a
better on-error behavior and SDK errors troubleshooting.

## Self-diagnostics

All OpenTelemetry libraries -- the API, SDK, exporters, instrumentation
adapters, etc. -- are encouraged to expose self-troubleshooting metrics, spans,
and other telemetry that can be easily enabled and filtered out by default.

One good example of such telemetry is a `Span` exporter that indicates how much
time exporters spend uploading telemetry. Another example may be a metric
exposed by a `SpanProcessor` that describes the current queue size of telemetry
data to be uploaded.

Whenever the library suppresses an error that would otherwise have been exposed
to the user, the library SHOULD log the error using language-specific
conventions. SDKs MAY expose callbacks to allow end users to handle
self-diagnostics separately from application code.

## Exceptions to the rule

SDK authors MAY supply settings that allow end users to change the library's
default error handling behavior. Application developers may want to run with
strict error handling in a staging environment to catch invalid uses of the API,
or malformed config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu you brought up the auditing use case in the SIG call, what do you think about expanding on strict error handling here? E.g.:

Under certain circumstances, it's preferable for an application to fail to complete an operation than to complete it but fail to produce telemetry data.

SDK authors SHOULD provide a configuration option that allows the end user to change the library's default error handling behavior. When this option is enabled, the SDK MUST NOT suppress errors or return placeholder objects.

Application developers may also want to run with strict error handling during tests, or in a staging environment, to catch invalid uses of the API, or malformed config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang IIRC you're interested in this too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to standardize this I'd suggest to have a separate work stream on it. "MAY" can be enough for a first iteration