-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add exporter API and more details about SDK implementation #186
Add exporter API and more details about SDK implementation #186
Conversation
specification/library-guidelines.md
Outdated
|
||
- Minimize burden of implementation for protocol-dependant telemetry exporters. The Telemetry Exporter is expected to be primarily a simple telemetry data encoder and transmitter. | ||
End users should be given the flexibility of making many of the decisions regarding the queuing, retrying, tagging, batching functionality that make the most sense for their application. For example, if an application is deployed with a locally running Agent daemon, the end user may prefer to not have a Retrier since local daemon is highly available. As opposed to that an application may require a reliable delivery of telemetry data to a remote backend that has no guaranteed availability. The end user may choose to use a persistent local queue and a Retrier in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reliable delivery:
- Do we provide both sync and async exporters? The definition of "sync" is: the exporter get called at the moment telemetry data is provided, and within the execution context of user's code, and the user's code will get blocked until the exporter returns. @24 if you have more input here.
- For queuing and retrying, do we consider a local cache to offload memory (e.g. instead of keeping all the data in-memory for retry, swap them to local disk. both memory and disk need to have a configurable quota though)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We did consider having both sync and async calling convention in the Exporter interface and decided that supporting async at the interface level complicates things. Exporters that want to have non-blocking behavior should implement it internally by whatever means they want.
-
That is in the realms of language library authors. The queuing (whether in-memory or persistent) is recommended to be implemented via composable exporters that can be choosen by end users. This document intentionally does not prescribe neither to language library authors nor to end users the type of the queue they must implement/use (the reason is specified in this paragraph: it must be end user's choice depending on their environment / use case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should support the in-memory for sure, disk is optional but open to be convinced otherwise.
specification/library-guidelines.md
Outdated
|
||
- Minimize burden of implementation for protocol-dependant telemetry exporters. The Telemetry Exporter is expected to be primarily a simple telemetry data encoder and transmitter. | ||
End users should be given the flexibility of making many of the decisions regarding the queuing, retrying, tagging, batching functionality that make the most sense for their application. For example, if an application is deployed with a locally running Agent daemon, the end user may prefer to not have a Retrier since local daemon is highly available. As opposed to that an application may require a reliable delivery of telemetry data to a remote backend that has no guaranteed availability. The end user may choose to use a persistent local queue and a Retrier in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For in-memory queue / persistent local queue - do we expect FIFO or there could be LIFO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is outside the scope of this recommendation document. Either can be implemented and provided as a composable helper by language library authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The moment we mentioned queue, it already assumes either LIFO, FIFO or priority queue, etc.
I do think there are cases where people need a collection of unsent items without following a strict order. It would be great to stay neutral in the spec.
specification/sdk-exporter.md
Outdated
|
||
## Interface Definition | ||
|
||
The exporter must support two functions: **Export** and **Shutdown**. In strongly typed languages typically there will be 2 separate Exporter interfaces, one that accepts Spans (SpanExporter) and one that accepts Metrics (MetricsExporter). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be exporter for logs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think logs are currently support by OpenTelemetry. In the future, yes, definitely. This actually raises a good question that I had in my mind for a while:
As we add more telemetry data types we need to think about proliferation of similarly looking interfaces with similar functionality and whether we need to aim at reducing conceptual duplication and code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've explored logs in OpenCensus https://github.com/census-instrumentation/opencensus-python/tree/master/contrib/opencensus-ext-azure/opencensus/ext/azure/log_exporter.
One struggle that I have: given most of the languages already have established logging infrastrutures, should we align with the existing language/library/framework approach, or we try to align among OpenTelemetry SDKs? In OpenCensus Python I chose the former (instead of log exporter, we have log handler which follows the Python paradigm).
Also, considering performance/reliability, most of the users might want to store the logs to a file, and send them to the backend through agent, instead of using SDKs which run inside the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't put significant put into OpenTelemetry+logs topic, so it will be difficult for me to engage in a meaningful discussion. Perhaps it is worth discussing the logs separately.
Also, considering performance/reliability, most of the users might want to store the logs to a file, and send them to the backend through agent, instead of using SDKs which run inside the process.
Yes, there are log collection products and log collection agents that do exactly this (BTW, one of my previous works was on a log collection agent).
#### Go SpanExporter Interface | ||
|
||
```go | ||
type SpanExporter interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call it TraceExporter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to suggestions, I have no strong preference. It seems to me that since it accepts a batch of Spans, SpanExporter may be a better name, but would be great to know other opinions.
specification/sdk-exporter.md
Outdated
|
||
const ( | ||
Success ExportResultCode = iota | ||
BadData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a difficult problem, we do have backends which can return HTTP 206 for partial success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. It is hard to tackle this by trying to figure out which part of data is bad and which is good and should be re-send. The goal here is to have at least a minimal notion of bad data as a concept so that the senders avoid retrying sending it. Whether that means the entire batch gets dropped or the receiver is smart enough to have accepted the good part of the data is an implementation detail of specific exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OpenCensus Python, the export function takes the input data without returning anything. We leave error handling and retry completely to exporters (which gives flexibility, but also means more work in exporters).
If we intend to have Retry in the core SDK (or the common part that can be used to compose the exporter), we need to design it in a way which works with most of the backends, versus a joint-set of backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We leave error handling and retry completely to exporters (which gives flexibility, but also means more work in exporters).
Sorry, I have to disagree. I believe a similar approach was done in OpenCensus Agent/Collector and we end up with exporters that don't handle the sending errors and just drop the data silently with zero visibility.
Also, you are right that it means more work in exporters. It is unnecessary to implement all this common functionality in every single exporter. The exporters should be as simple as possible so that it is less work for vendors to support OpenTelemetry project.
Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.
Awesome! We're on the same page. Next step is "we need to design it in a way which works with most of the backends, versus a joint-set of backends".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.
I just realized that this may not necessary reflect my thoughts correctly.
What I mean to say is that when the Exporter sends the data to its destination and sending fails, the exporter must clearly process this erroneous situation, must not conceal it and must return an error code that clearly communicates this failure to the caller of Export() function.
That is the purpose of the return result of Export() function. The exporters should not try to do clever things and handle the errors in a way that is invisible to the caller and hide the errors, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the exporter expect its caller to do with the return value?
There are some tricky scenario, for example, the exporter sent data to the backend, and the backend accepted the data, then connection reset happened and exporter never heard back from the backend. In this case, should the exporter retry by itself, or return an error code and rely on the core SDK to retry (which means the core SDK will keep the data and call exporter again at some point). If the exporter expects the SDK to retry, it has to return enough information so the SDK knows how to retry - for example, which items failed, what's the retry policy/period, is there a de-dupe mechanism (for traces we might be able to use trace-id and span-id to de-dupe if we trust the data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the exporter expect its caller to do with the return value?
There are several possible uses for return value. For example if the caller is a "Retrier" helper then it can retry sending on failures. If the caller is the SDK it can increment internal metric counter indicating dropped data, etc.
There are some tricky scenario, for example, the exporter sent data to the backend, and the backend accepted the data, then connection reset happened and exporter never heard back from the backend.
This is a good question. I believe it is out of the scope for this PR, we should discuss this in our OpenTelemetry protocol scope. I have a specific opinion on what should be done in this case, to avoid sidetracking this PR I suggest that we discuss it in the "protocol" meeting or PRs.
If the exporter expects the SDK to retry, it has to return enough information so the SDK knows how to retry - for example, which items failed, what's the retry policy/period, is there a de-dupe mechanism (for traces we might be able to use trace-id and span-id to de-dupe if we trust the data).
IMO this is not an SDK concern. Logic like this should be in separate helper exporters that can be chained. All these questions that you asked are very valid and we as OpenTelemetry authors are not in the best position to answer them. I believe they should be in the realm of the end-user who knows their own environment and can compose/customize the best behavior for their use-case.
However, our concern as OpenTelemetry authors is to empower the end users so that they are able to do this if they want. If we don't have the error in the return value none of this is possible anymore. The only way would be to include all this functionality in the protocol exporter itself and make the protocol exporter configurable. We will loose composability if we do that and will place very significant burden on implementors of protocol exporters to support all this functionality. The experience with exporters in OpenCensus Service shows us that this is not going to happen, we will end up with half-baked and non-uniform behavior of protocol exporters.
I believe it is important that we make OpenTelemetry attractive to vendors so that it is very easy for them to add their protocol support to OpenTelemetry. This proposal makes it possible.
|
||
type ExportResult struct { | ||
Code ExportResultCode | ||
WrappedError error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this WrappedError
include information about how SDK should retry?
Retry policy could vary among backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as the text says these are just examples for illustration and language library authors need to think about the details of how exactly the interface and its input/output data types should look like.
If this creates more questions than it answers I would be inclined to completely remove the section. There is no intent here to define an Exporter interface that is idiomatic, complete or even correct at all.
824ceae
to
670eb10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content and examples look great, but some prose and formatting nits:
- technical terms like "Exporter" and "Retrier" are capitalized but not formatted as
code
- some lines are hard-wrapped, some aren't
- article use is inconsistent: "SDK" vs "the SDK"
I was surprised to see that there's no requirement to export batches in a particular order, or order spans within a batch, but I think it does make sense to leave this up to the implementation.
I think we should define the batching order and the delivery order but we can do that in a separate PR. For example in Java we have a SpanProcessor which is the interface with the SDK, and we provide a batching and a non batching version that calls into the exporter interface. The main reason to have the SpanProcessor is to give vendors more hooks (access on start and end events for example) but just for Exporter only the end event is required. Also the SpanProcessor is called sync while the exporter may be called async based on the way how it is configured. |
Fixed.
Paragraphs should be hard-wrapped, the rest shouldn't. Can you point to a specific place?
As a non-native English speaker I struggle with the articles. Help is appreciated. I remove "the" for now.
I don't think ordering of spans should be in the requirements. In a distributed system the global ordering is not guaranteed anyway and local ordering is unlikely to provide any benefits that are not already provided by simple sorting using timestamps if needed at the final place of use. |
specification/sdk-exporter.md
Outdated
|
||
The goals of the interface are: | ||
|
||
- Minimize burden of implementation for protocol-dependant telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Minimize burden of implementation for protocol-dependant telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter. | |
- Minimize burden of implementation for protocol-dependent telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter. |
specification/sdk-exporter.md
Outdated
|
||
- Success - batch is successfully exported. For protocol exporters this typically means that the data is sent over the wire and delivered to the destination server. | ||
|
||
- BadData - batch contains bad data and cannot be sent. The caller must not retry exporting the same batch. The batch must be dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we drop the whole batch or export as much "good data" as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the whole batch. It is difficult to identify "good data".
44ed113
to
a4133b1
Compare
Reviewers, I made fixes. Please have another look. |
|
||
### Shutdown() | ||
|
||
Shuts down the exporter. Called when SDK is shut down. This is an opportunity for exporter to do any cleanup required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need more clarification:
Shutdown
is called when the SDK is shutting down, it should block until the clean up is done (or failed).Shutdown
should be called only once for a given exporter.- When the SDK calls
Shutdown
, there should be no further calls toExport
. The exporter should make sure that subsequent calls toExport
return failure immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will 2 and 3 to the spec. Good points.
Not sure about 1. For example if Shutdown
is called for the purpose of restarting the application it be undesirable for block for too long. If Shutdown
is required to block until all data contained in the exporter is delivered to the destination and there is network unavailability this may be too much.
I'd like to see some opinions and learn about use cases..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OpenCensus Python we have a grace_period
which controls the maximum wait time. Might be something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can say "Shutdown should not block indefinitely" and let language authors decide if they want to make shutdown timeout to be configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good balance.
|
||
- FailedRetryable - cannot export to the destination. The caller should record the error and may retry exporting the same batch after some time. This for example can happen when the destination is unavailable, there is a network error or endpoint does not exist. | ||
|
||
### Shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a flush
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have shutdown
the use case for flush is not clear to me.
We can add it if we see the clear use case. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - in this PR shutdown and flush are both mentioned. There is no general agreement on whether we need flush or not at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the barrier high for adding new functions. If we don't have a good use case for flush
let's keep it out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
a4133b1
to
c83e303
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, we might want to revisit some names (e.g. SpanExporter
or TraceExporter
, FailedNoneRetryable
, FailedNonRetryable
or FailedNonretryable
), which don't have to block this PR though.
Thanks @tigrannajaryan for driving this!
c83e303
to
9cf8edc
Compare
Approvers, please merge this PR, it has 2 approvals now. |
This clarifies what functionality should be the core part of the SDK and what should be available for optional usage and also defines the API that exporters need to define.
9cf8edc
to
5c13670
Compare
Approvers, please merge. |
- Remove duplication with open-telemetry#186 - Mention about configurable timeout of flush operation
* Add Performance and Blocking specification Performance and Blocking specification is specified in a separate document and is linked from Language Library Design principles document. Implements issue: #94 * PR fix (#94). - Write about Metrics & Logging to cover entire API - Write about shut down / flush operations - Leave room for blocking implementation options (should not block "as default behavior") - Grammar & syntax fix * PR fix (#94). - Not limit for tracing, metrics. * PR fix (#94). - Mentioned about inevitable overhead - Shutdown may block, but it should support configurable timeout also * PR fix (#94) - s/traces/telemetry data/ - Syntax fix Co-Authored-By: Yang Song <songy23@users.noreply.github.com> * PR fix (#130) - Remove duplication with #186 - Mention about configurable timeout of flush operation * PR fix (#130) - Not specify default strategy (blocking or information loss)
…metry#186) * Add exporter API and more details about SDK implementation This clarifies what functionality should be the core part of the SDK and what should be available for optional usage and also defines the API that exporters need to define. * PR fixes
* Add Performance and Blocking specification Performance and Blocking specification is specified in a separate document and is linked from Language Library Design principles document. Implements issue: open-telemetry#94 * PR fix (open-telemetry#94). - Write about Metrics & Logging to cover entire API - Write about shut down / flush operations - Leave room for blocking implementation options (should not block "as default behavior") - Grammar & syntax fix * PR fix (open-telemetry#94). - Not limit for tracing, metrics. * PR fix (open-telemetry#94). - Mentioned about inevitable overhead - Shutdown may block, but it should support configurable timeout also * PR fix (open-telemetry#94) - s/traces/telemetry data/ - Syntax fix Co-Authored-By: Yang Song <songy23@users.noreply.github.com> * PR fix (open-telemetry#130) - Remove duplication with open-telemetry#186 - Mention about configurable timeout of flush operation * PR fix (open-telemetry#130) - Not specify default strategy (blocking or information loss)
…metry#186) * Add exporter API and more details about SDK implementation This clarifies what functionality should be the core part of the SDK and what should be available for optional usage and also defines the API that exporters need to define. * PR fixes
* Add Performance and Blocking specification Performance and Blocking specification is specified in a separate document and is linked from Language Library Design principles document. Implements issue: open-telemetry#94 * PR fix (open-telemetry#94). - Write about Metrics & Logging to cover entire API - Write about shut down / flush operations - Leave room for blocking implementation options (should not block "as default behavior") - Grammar & syntax fix * PR fix (open-telemetry#94). - Not limit for tracing, metrics. * PR fix (open-telemetry#94). - Mentioned about inevitable overhead - Shutdown may block, but it should support configurable timeout also * PR fix (open-telemetry#94) - s/traces/telemetry data/ - Syntax fix Co-Authored-By: Yang Song <songy23@users.noreply.github.com> * PR fix (open-telemetry#130) - Remove duplication with open-telemetry#186 - Mention about configurable timeout of flush operation * PR fix (open-telemetry#130) - Not specify default strategy (blocking or information loss)
This clarifies what functionality should be the core part of
the SDK and what should be available for optional usage and
also defines the API that exporters need to define.
Co-authored with @bogdandrutu