-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error handling proposal #153
error handling proposal #153
Conversation
specification/error-handling.md
Outdated
|
||
1. APIs must not throw unhandled exceptions when the API is used incorrectly by | ||
the end user. Smart defaults should be used so that the SDK generally works. | ||
For instance, name like `empty` MUST be used when `null` value was passed as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I think it's good practice to validate the required argument when end users call our API and throw exceptions when it violates certain constraints. Consider other cases like name being too long, I think it's better to throw exceptions than silently truncate it or use defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better if application doesn't start due to some environment variable was misconfigured (for resource API for instance) or if application sent a slightly inconsistent telemetry?
Same for span name. Would you rather return 500
to customer or use empty
for the null
string? (see also comment for @bogdandrutu 's question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had some conversations in Python SDK open-telemetry/opentelemetry-python#11 (comment).
One thing to consider: there is no simple way to determine if span name is too long since different exporters/backends might have different limitations. We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.
+1
specification/error-handling.md
Outdated
|
||
OpenTelemetry SDK must not throw or leak unhandled or user unhandled exceptions. | ||
|
||
1. APIs must not throw unhandled exceptions when the API is used incorrectly by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point. I think I have a different opinion:
- When a specific implementation is used no different checks are applied - means that if the API says a span name is valid if not null, then implementation cannot throw exception if empty.
- API can throw exception for things like null span name because that is documented.
Happy to be convinced otherwise, but my understanding is that as long as we apply the same checks across the API and SDK we can throw exception for obvious things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one problem that you may run into is that arguments you are passing to telemetry API may be received from external source. So you have a null
string. If you are smart and read the doc - you will check for null
and pass some random name to the API. If you are smart and haven't read the doc - API will crash and potentially bring this request processing with you.
Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but running a "blind" process with no monitor at all because we cannot export data to the metrics backend, is that better? I feel that fail fast approach should be used here and crashing during the initialization is very good in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @bogdandrutu on "API can throw exception for things like null span name because that is documented".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.
I agree with this. At the same time, we need a balance for reporting errors for truly fatal things (i.e. " cannot export data to the metrics backend"), but otherwise throwing exceptions should stay at a minimum level.
Specific case: I do think using a null name for Span
could fallback to using a default name, instead of throwing. Documenting it is nice, but providing defaults for these kind of things is a good alternative ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree re: failing fast at initialization/construction time, but not in ways that might crash the program (or thread or ____) at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc currently mingles API and SDK error handling in a way that I expect to be confusing to multiple parties, particularly anyone either working on or using an alternative to the SDK.
Part of the problem may be upstream, i.e., that the API spec is missing logic and so the SDK is forced to backfill (and therefore error handle) it.
specification/error-handling.md
Outdated
4. Beware of any call to external callbacks or override-able interface. Expect | ||
them to throw. | ||
|
||
## SDK self-diagnostics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While very valuable, self-diagnostics are inherently complicated. This feels like something that needs a larger discussion.
specification/error-handling.md
Outdated
4. Beware of any call to external callbacks or override-able interface. Expect | ||
them to throw. | ||
|
||
## SDK self-diagnostics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that user-facing instrumentation APIs should never return errors (or throw them).
One reason that users should never receive these, is that the'll be tempted to turn around and call the instrumentation library with the error.
The SDK will, however, encounter errors, and it a user has a right to know, but not with in-line code. In the opentracing Go library we addressed this by letting the application register a callback for receiving self diagnostics, including errors, that would allow the application to handle self diagnostics in a separate module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing exactly the same at Dynatrace with our agent SDK:
The SDK API never throws, since we do not want to change the application's behavior and risk crashing it if they don't catch everything properly, just because monitoring wouldn't work. In order to not let our SDK users in the dark and give them guidance for troubleshooting we also offer a diagnostic callback. It provides immediate logging output in case of errors (see https://github.com/Dynatrace/OneAgent-SDK#logging-callback). This turned out to work well and be pretty helpful in the past.
This has gone a bit stale. I believe @SergeyKanzhelev is on paternity leave. Any opinions on how we resolve this, or otherwise move it forwards? |
@SergeyKanzhelev: I like Error Handing and Performance paragraph you added above. I think the approach you describe in #153 (comment) makes sense, although, I don't see that explicitly mentioned in the document. Would it make sense to be more explicit, or are we leaving that out to give more freedom in implementation. |
@mwear I've added a note into guidance on returning no-op instead of |
Thanks @SergeyKanzhelev I'll bring this up at the Ruby SIG so folks are aware of it. |
I think there's a fourth option here: that the SDK can rescue errors and return non-null defaults when they're encountered. If it's possible to provide defaults for every argument ("do runtime type checks and provide defaults when they fail") then presumably it's possible to provide the default return value that you'd get by calling the function with all the default args. I don't mean to say this is the right solution, just that it's worth considering with the others. |
specification/error-handling.md
Outdated
SDK authors MAY supply settings that allow end users to change the library's | ||
default error handling behavior. Application developers may want to run with | ||
strict error handling in a staging environment to catch invalid uses of the API, | ||
or malformed config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdandrutu you brought up the auditing use case in the SIG call, what do you think about expanding on strict error handling here? E.g.:
Under certain circumstances, it's preferable for an application to fail to complete an operation than to complete it but fail to produce telemetry data.
SDK authors SHOULD provide a configuration option that allows the end user to change the library's default error handling behavior. When this option is enabled, the SDK MUST NOT suppress errors or return placeholder objects.
Application developers may also want to run with strict error handling during tests, or in a staging environment, to catch invalid uses of the API, or malformed config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reyang IIRC you're interested in this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to standardize this I'd suggest to have a separate work stream on it. "MAY" can be enough for a first iteration
@c24t Is your comment the same as what is being suggested here: #153 (comment). If so, that is the approach that this document currently recommends. |
Yup. We're on the same page, #153 (comment) seemed to suggest those were the only three options. |
@iredelmeier, thank you for review. I feel that it is addressed now and ask to review again. This proposal was discussed on today's spec SIG and a few before this. And there is a general feeling that we need to document error handling principles.
This doc asks all implementations to be consistent in their error handling practices: "OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.". Even though it may be hard to control, this is what we believe must simplify operation with the OpenTelemetry given it's nature of a monitoring, not a business critical tool in most cases.
What forum do you suggest this discussion to be held on? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SergeyKanzhelev, this latest version looks great.
Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (even though my approval means nothing here 😄)
Thanks a lot for taking care of this, Sergey and Chris!
this PR had a lot of bake time after was approved. Merging |
* error handling proposal * added self-monitoring * Reword principles * Reword guidance * Reword diagnostics * Reword exceptions * Add note on logs, callbacks * Formatting * formatting and a mention of ToString * dynamic languages * returning noops, not nulls * Reformat for open-telemetry#192 * Update specification/error-handling.md Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
* error handling proposal * added self-monitoring * Reword principles * Reword guidance * Reword diagnostics * Reword exceptions * Add note on logs, callbacks * Formatting * formatting and a mention of ToString * dynamic languages * returning noops, not nulls * Reformat for open-telemetry#192 * Update specification/error-handling.md Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
Fix #141
This proposal suggests the change comparing to the current API/SDK implementation in Java.