-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Observability Vision document #188
Add Observability Vision document #188
Conversation
Codecov Report
@@ Coverage Diff @@
## master #188 +/- ##
=======================================
Coverage 73.29% 73.29%
=======================================
Files 108 108
Lines 6329 6329
=======================================
Hits 4639 4639
Misses 1456 1456
Partials 234 234 Continue to review full report at Codecov.
|
- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline). | ||
- Latency broken down by pipeline elements (including exporter network roundtrip latency for request/response protocols). | ||
|
||
“Rate” values must reflect the average rate of the last 10 seconds. Rates must exposed in bytes/sec and units/sec (e.g. spans/sec). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can defer the calculation of rates to metric backend. The service observability can simply record and export cumulative values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's also what is suggested in the text as a possible option / open question:
Note: some of the current values may be calculated as derivatives of cumulative values in the backend, so it is an open question if we want to expose them separately or no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons when rates may be desirable to be calculated in the service is when you have a deployment where services are not monitored using a metrics monitoring system. Own calculations may not be as good as a full-blown metrics backend can provide but it may be better than nothing to have the rates dumped to service's log file periodically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@songy23 let me know if you think the text is unclear and you want changes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current statement looks good :)
- Incoming connection rate (new connections per second), broken down by receiver. | ||
- In-memory queue size (in bytes and in units). | ||
- Persistent queue size (when supported). | ||
- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another challenge here is batching: we could record it for each element of the batch but that increases the overall tracking cost
docs/observability.md
Outdated
|
||
### Aggregate Values | ||
|
||
For the following metrics we want to track min, average and max since start: TBD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We likely want some histograms for some metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is more to have a consistent set of histograms applicable to specific kind of metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Histogram calculation can be also more expensive. Do we want it to be done in the service? With histograms we are really moving into the realm of full-blown metric backends, perhaps crossing the line of what should be done in-house by the service itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @pjanotti. IMO calculating histogram aggregation is not essentially different from calculating cumulative counts/bytes/etc. It may be more difficult to figure out the rate of histograms but I think that's the case where we can defer it to the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this section for now since it is incomplete. We can add it in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can take the discussion about histograms in separate.
This is a guidance document that developers can consult with when working on improving own observability of OpenTelemetry Service.
a10268c
to
fc0ea4f
Compare
@songy23 PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Migrate updates to Prometheus receiver from opencensus-service * Remove unnecessary configuration for adjusting metrics * Update exporters README (#210) Remove stale content, put in place a simple skeleton, and put information for the Zipkin exporter. * Added Jaeger gRPC receiver (#197) * Added Jaeger gRPC receiver Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Addressed first review round Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Changes based on the feedback from the second review Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Fixed import order, check for same ip:port on endpoints Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Reverted the port-check Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de> * Fix README - remove broken links and commands (#211) Signed-off-by: Hui Kang <kangh@us.ibm.com> * Refactor opencensus exporter to make it easily extensible (#212) Refactored oc exporter code to make it easier to import in derived builds without copying all the code. Example derived exporter: https://github.com/owais/example-derived-oc-exporter - Moved internal/compression to root - Split opencensusexporter factory's `CreateTraceExporter` method into - `OCAgentOptions` and `CreateOCAgent` * Use sha256 instead of md5 in node batcher processor (#202) MD5 is insecure cryptographic hash functions, and are therefore unsuitable for security applications. * Add goimports check and fix import order for all files (#209) * Add goimports check and fix import order for all files Updates #155. * Try specifying a version for goimports * Show the diff instead of files and fix import * Fix default in Makefile * Add factory and update config for tail sampling processor (#200) * Add factory and update config for tail sampling processor Updates #146 * Move to package processor * Remove an unnecessary check * Move CreateDefaultConfig to factory and add unit tests * Fix test failure * Remove commented code * Add misspell check and fix all typos (#214) * Add misspell check and fix all typos Updates #155. * Format * Include yaml files * Move tail sampling config to its own package and remove stale configs (#217) Follow up of #200. Second step of #146. * Add Observability Vision document (#188) * Add Observability Vision document This is a guidance document that developers can consult with when working on improving own observability of OpenTelemetry Service. * PR fixes * Add Zipkin exporter factory (#207) * Add Zipkin exporter factory Add the Zipkin exporter configuration factory using the new configuration format. This does not bring many of the settings from the old configuration since they won't be used. In a sub-sequent PR the code of the exporter itself will be changed. * PR Feedback: add 1 test plus some comments * Rename endpoint to http-url and related field * Fix goimports issue * Remove TODO commented code Replaced the TODO commented code with an issue. * Rename the field used to specify destination * Update README to address reviewer comments. * Update module dependencies and import order for metricfamily * Improve OC Receiver Documentation (#203) * First round of receiver and opencensus receiver documentation. * Undo go mod/sum changes * Address initial set of comments. * Address next set of comments. * Address next set of comments. * Fix use of server instead of receiver in comment and explain settings can be mix and matched. * Merged master and fixed mispell error caugh with new tools * Add static check and fix all errors (#218) * Add static check Fixes #155. * Fix most staticcheck errors * More fixes * Fix id_batcher * Rename exporterhelp parameter (#220) The paramater was named exportFormat but it actually used as the exporter name. * Fix build: lower casing error message (#224) * Add jaeger grpc exporter (#219) * Add Jaeger gRPC exporter Adds a Jaeger gRPC exporter. * Update exporters/README.md * Fixes on the exporter/README.md * Add doc.go with package description. * Fix import order on config_test.go * Migrate updates to Prometheus receiver from opencensus-service * Remove unnecessary configuration for adjusting metrics * Update README to address reviewer comments. * Update module dependencies and import order for metricfamily * Fix goimports * Fix staticcheck issues
Since these methods are about the span itself rather than specifically *events* on the span, it makes sense to drop "events" from their names. [Closes open-telemetry#33]
This is a guidance document that developers can consult with when
working on improving own observability of OpenTelemetry Service.