-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Performance and Blocking specification #130
Changes from all commits
9b24adf
53b0519
256fa3f
cdc70d2
2c8acd1
ae21cfd
f3789ef
9efc8f2
34380b3
d12aca0
d66917f
87d93b2
de6e6e5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Performance and Blocking of OpenTelemetry API | ||
|
||
This document defines common principles that will help designers create language libraries that are safe to use. | ||
|
||
## Key principles | ||
|
||
Here are the key principles: | ||
|
||
- **Library should not block end-user application by default.** | ||
- **Library should not consume unbounded memory resource.** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there libraries for which this is not true? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For blocking, opencensus-java blocks end-user application when a queue gets full: census-instrumentation/opencensus-java#1837 (this is why I get concerned about those matters). An easy solution to the blocking matter is to use an unbounded queue to avoid blocking. In compensation, it consumes memory. I don't know the monitoring library which uses an unbounded queue, but I think clarifying it is meaningful. Also, unbounded memory usage matter is related to the log volume matter described in "End-user application should aware of the size of logs" section: https://github.com/open-telemetry/opentelemetry-specification/pull/130/files#diff-44dc82a7e6286380ed89736215beda74R33 . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we single out memory only? CPU and latency impact are often more important. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is true that computation overhead (CPU usage, latency) is also a possible cause of unwelcome blocking as discussed in #130 (comment) . Could you read the above comment? If we need to deep dive into CPU and latency impact matters, it is better to create separate PR/issue, I feel. |
||
|
||
Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible. So that it should not block the end-user application nor consume too much memory resource. | ||
|
||
See also [Concurrency and Thread-Safety](concurrency.md) if the implementation supports concurrency. | ||
|
||
### Tradeoff between non-blocking and memory consumption | ||
|
||
Incomplete asynchronous I/O tasks or background tasks may consume memory to preserve their state. In such a case, there is a tradeoff between dropping some tasks to prevent memory starvation and keeping all tasks to prevent information loss. | ||
|
||
If there is such tradeoff in language library, it should provide the following options to end-user: | ||
|
||
- **Prevent information loss**: Preserve all information but possible to consume many resources | ||
- **Prevent blocking**: Dropping some information under overwhelming load and show warning log to inform when information loss starts and when recovered | ||
- Should provide option to change threshold of the dropping | ||
- Better to provide metric that represents effective sampling ratio | ||
- Language library might provide this option for Logging | ||
|
||
### End-user application should be aware of the size of logs | ||
|
||
Logging could consume much memory by default if the end-user application emits too many logs. This default behavior is intended to preserve logs rather than dropping it. To make resource usage bounded, the end-user should consider reducing logs that are passed to the exporters. | ||
|
||
Therefore, the language library should provide a way to filter logs to capture by OpenTelemetry. End-user applications may want to log so much into log file or stdout (or somewhere else) but not want to send all of the logs to OpenTelemetry exporters. | ||
|
||
In a documentation of the language library, it is a good idea to point out that too many logs consume many resources by default then guide how to filter logs. | ||
|
||
### Shutdown and explicit flushing could block | ||
|
||
The language library could block the end-user application when it shut down. On shutdown, it has to flush data to prevent information loss. The language library should support user-configurable timeout if it blocks on shut down. | ||
|
||
If the language library supports an explicit flush operation, it could block also. But should support a configurable timeout. | ||
|
||
## Documentation | ||
|
||
If language specific implementation has special characteristics that are not described in this document, such characteristics should be documented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider putting "Instrumentation cannot be a failure modality".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I agree with the policy itself: "Instrumentation cannot be a failure modality". But this file/PR is focusing on performance / blocking matter.
Could you make it as separate GitHub issue or something?
The topic relates with about error handling, recovery, retry, logging, handling information loss etc. Seems not to be so simple.