Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error handling proposal #153

Merged
Merged
Changes from 4 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
1b6737a
error handling proposal
SergeyKanzhelev Jun 21, 2019
ac14996
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Jun 21, 2019
f90437e
added self-monitoring
SergeyKanzhelev Jun 21, 2019
d30f90a
Merge branch 'exceptionsHandling' of https://github.com/SergeyKanzhel…
SergeyKanzhelev Jun 21, 2019
b4eee5a
Merge branch 'master' into exceptionsHandling
c24t Aug 20, 2019
58179c6
Reword principles
c24t Aug 20, 2019
03e937e
Reword guidance
c24t Aug 20, 2019
55bbad2
Reword diagnostics
c24t Aug 20, 2019
dcb740c
Reword exceptions
c24t Aug 20, 2019
6c87edf
Add note on logs, callbacks
c24t Aug 20, 2019
3812350
Formatting
c24t Aug 20, 2019
8c01355
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Sep 20, 2019
669052e
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Sep 30, 2019
4109c10
formatting and a mention of ToString
SergeyKanzhelev Sep 30, 2019
a10d8ea
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 1, 2019
ea993fd
dynamic languages
SergeyKanzhelev Oct 1, 2019
872ebd4
returning noops, not nulls
SergeyKanzhelev Oct 1, 2019
de7cb15
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 1, 2019
a8451f1
Reformat for #192
c24t Oct 1, 2019
e4ddb0c
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 3, 2019
eb4bc45
Update specification/error-handling.md
SergeyKanzhelev Oct 9, 2019
30bb855
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 9, 2019
6fed7dc
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 10, 2019
83f4477
Merge branch 'master' into exceptionsHandling
SergeyKanzhelev Oct 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions specification/error-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Error handling in OpenTelemetry
SergeyKanzhelev marked this conversation as resolved.
Show resolved Hide resolved

OpenTelemetry is a library that will in many cases run in a context of customer
app performing non-essential from app business logic perspective operations.
OpenTelemetry SDK also can be enabled via platform extensibility mechanisms and
potentially only enabled in runtime. Which makes the use of SDK non-obvious for
the end user and sometimes even outside of the application developer control.

This makes some unique requirements for OpenTelemetry error handling practices.

## Basic error handling principles

OpenTelemetry SDK must not throw or leak unhandled or user unhandled exceptions.

1. APIs must not throw unhandled exceptions when the API is used incorrectly by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. I think I have a different opinion:

  1. When a specific implementation is used no different checks are applied - means that if the API says a span name is valid if not null, then implementation cannot throw exception if empty.
  2. API can throw exception for things like null span name because that is documented.

Happy to be convinced otherwise, but my understanding is that as long as we apply the same checks across the API and SDK we can throw exception for obvious things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one problem that you may run into is that arguments you are passing to telemetry API may be received from external source. So you have a null string. If you are smart and read the doc - you will check for null and pass some random name to the API. If you are smart and haven't read the doc - API will crash and potentially bring this request processing with you.

Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but running a "blind" process with no monitor at all because we cannot export data to the metrics backend, is that better? I feel that fail fast approach should be used here and crashing during the initialization is very good in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @bogdandrutu on "API can throw exception for things like null span name because that is documented".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.

I agree with this. At the same time, we need a balance for reporting errors for truly fatal things (i.e. " cannot export data to the metrics backend"), but otherwise throwing exceptions should stay at a minimum level.

Specific case: I do think using a null name for Span could fallback to using a default name, instead of throwing. Documenting it is nice, but providing defaults for these kind of things is a good alternative ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree re: failing fast at initialization/construction time, but not in ways that might crash the program (or thread or ____) at runtime.

the end user. Smart defaults should be used so that the SDK generally works.
For instance, name like `empty` MUST be used when `null` value was passed as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think it's good practice to validate the required argument when end users call our API and throw exceptions when it violates certain constraints. Consider other cases like name being too long, I think it's better to throw exceptions than silently truncate it or use defaults.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if application doesn't start due to some environment variable was misconfigured (for resource API for instance) or if application sent a slightly inconsistent telemetry?

Same for span name. Would you rather return 500 to customer or use empty for the null string? (see also comment for @bogdandrutu 's question

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some conversations in Python SDK open-telemetry/opentelemetry-python#11 (comment).

One thing to consider: there is no simple way to determine if span name is too long since different exporters/backends might have different limitations. We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.

+1

a span name. Instead of throwing `NullReferenceException`.
2. SDK must not throw unhandled exceptions for configuration errors. Wrong
configuration file, environment variables or config settings received from
Agent MUST NOT bring the entire process down.
3. SDK must not throw unhandled exceptions for errors in their own operations.
For examples, SDK MUST NOT crash process by throwing exception or causing
`OutOfMemoryException` when telemetry receiving endpoint cannot be reached.

## Guidance

1. Every API call that may call external callback MUST handle all errors.
2. Every background operation callback, Task or Thread method should have a
global error handling set up (like `try{}catch` statement) to ensure
that exception from this asynchronous operation will not affect end-user app.
3. Error handling in other cases MUST follow standard language practice. Which
is typically - reduce the scope of the error handler and add special
processing for the expected errors.
4. Beware of any call to external callbacks or override-able interface. Expect
them to throw.

## SDK self-diagnostics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While very valuable, self-diagnostics are inherently complicated. This feels like something that needs a larger discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that user-facing instrumentation APIs should never return errors (or throw them).
One reason that users should never receive these, is that the'll be tempted to turn around and call the instrumentation library with the error.
The SDK will, however, encounter errors, and it a user has a right to know, but not with in-line code. In the opentracing Go library we addressed this by letting the application register a callback for receiving self diagnostics, including errors, that would allow the application to handle self diagnostics in a separate module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing exactly the same at Dynatrace with our agent SDK:
The SDK API never throws, since we do not want to change the application's behavior and risk crashing it if they don't catch everything properly, just because monitoring wouldn't work. In order to not let our SDK users in the dark and give them guidance for troubleshooting we also offer a diagnostic callback. It provides immediate logging output in case of errors (see https://github.com/Dynatrace/OneAgent-SDK#logging-callback). This turned out to work well and be pretty helpful in the past.


All OpenTelemetry libraries - API, SDK, exporters, instrumentation adapters,
etc. are encouraged to expose self-troubleshooting metrics, spans and other
telemetry that can be easily enabled and filtered out by default.

Good example of such telemetry is a `Span` Zipkin exporter that indicates how
much time exporter spent on uploading telemetry. Another example may be a metric
exposed by SpanProcessor exposing the current queue size of telemetry to be
uploaded.

## Exceptions from the rule

There are situations when end-user wants to know whether API/SDK are used
correctly. For instance, it may be desirable to not deploy an app with the
malformed monitoring configuration. Or catch an invalid use of OpenTelemetry
API.

SDK authors may supply the setting that will allow to change the default
error handling behavior.