Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exporter API and more details about SDK implementation #186

Merged

Conversation

tigrannajaryan
Copy link
Member

@tigrannajaryan tigrannajaryan commented Jul 18, 2019

This clarifies what functionality should be the core part of
the SDK and what should be available for optional usage and
also defines the API that exporters need to define.

Co-authored with @bogdandrutu


- Minimize burden of implementation for protocol-dependant telemetry exporters. The Telemetry Exporter is expected to be primarily a simple telemetry data encoder and transmitter.
End users should be given the flexibility of making many of the decisions regarding the queuing, retrying, tagging, batching functionality that make the most sense for their application. For example, if an application is deployed with a locally running Agent daemon, the end user may prefer to not have a Retrier since local daemon is highly available. As opposed to that an application may require a reliable delivery of telemetry data to a remote backend that has no guaranteed availability. The end user may choose to use a persistent local queue and a Retrier in this case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reliable delivery:

  1. Do we provide both sync and async exporters? The definition of "sync" is: the exporter get called at the moment telemetry data is provided, and within the execution context of user's code, and the user's code will get blocked until the exporter returns. @24 if you have more input here.
  2. For queuing and retrying, do we consider a local cache to offload memory (e.g. instead of keeping all the data in-memory for retry, swap them to local disk. both memory and disk need to have a configurable quota though)?

Copy link
Member Author

@tigrannajaryan tigrannajaryan Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang

  1. We did consider having both sync and async calling convention in the Exporter interface and decided that supporting async at the interface level complicates things. Exporters that want to have non-blocking behavior should implement it internally by whatever means they want.

  2. That is in the realms of language library authors. The queuing (whether in-memory or persistent) is recommended to be implemented via composable exporters that can be choosen by end users. This document intentionally does not prescribe neither to language library authors nor to end users the type of the queue they must implement/use (the reason is specified in this paragraph: it must be end user's choice depending on their environment / use case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support the in-memory for sure, disk is optional but open to be convinced otherwise.


- Minimize burden of implementation for protocol-dependant telemetry exporters. The Telemetry Exporter is expected to be primarily a simple telemetry data encoder and transmitter.
End users should be given the flexibility of making many of the decisions regarding the queuing, retrying, tagging, batching functionality that make the most sense for their application. For example, if an application is deployed with a locally running Agent daemon, the end user may prefer to not have a Retrier since local daemon is highly available. As opposed to that an application may require a reliable delivery of telemetry data to a remote backend that has no guaranteed availability. The end user may choose to use a persistent local queue and a Retrier in this case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For in-memory queue / persistent local queue - do we expect FIFO or there could be LIFO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is outside the scope of this recommendation document. Either can be implemented and provided as a composable helper by language library authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The moment we mentioned queue, it already assumes either LIFO, FIFO or priority queue, etc.
I do think there are cases where people need a collection of unsent items without following a strict order. It would be great to stay neutral in the spec.


## Interface Definition

The exporter must support two functions: **Export** and **Shutdown**. In strongly typed languages typically there will be 2 separate Exporter interfaces, one that accepts Spans (SpanExporter) and one that accepts Metrics (MetricsExporter).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be exporter for logs as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think logs are currently support by OpenTelemetry. In the future, yes, definitely. This actually raises a good question that I had in my mind for a while:

As we add more telemetry data types we need to think about proliferation of similarly looking interfaces with similar functionality and whether we need to aim at reducing conceptual duplication and code duplication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've explored logs in OpenCensus https://github.com/census-instrumentation/opencensus-python/tree/master/contrib/opencensus-ext-azure/opencensus/ext/azure/log_exporter.

One struggle that I have: given most of the languages already have established logging infrastrutures, should we align with the existing language/library/framework approach, or we try to align among OpenTelemetry SDKs? In OpenCensus Python I chose the former (instead of log exporter, we have log handler which follows the Python paradigm).

Also, considering performance/reliability, most of the users might want to store the logs to a file, and send them to the backend through agent, instead of using SDKs which run inside the process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't put significant put into OpenTelemetry+logs topic, so it will be difficult for me to engage in a meaningful discussion. Perhaps it is worth discussing the logs separately.

Also, considering performance/reliability, most of the users might want to store the logs to a file, and send them to the backend through agent, instead of using SDKs which run inside the process.

Yes, there are log collection products and log collection agents that do exactly this (BTW, one of my previous works was on a log collection agent).

#### Go SpanExporter Interface

```go
type SpanExporter interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it TraceExporter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestions, I have no strong preference. It seems to me that since it accepts a batch of Spans, SpanExporter may be a better name, but would be great to know other opinions.


const (
Success ExportResultCode = iota
BadData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a difficult problem, we do have backends which can return HTTP 206 for partial success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It is hard to tackle this by trying to figure out which part of data is bad and which is good and should be re-send. The goal here is to have at least a minimal notion of bad data as a concept so that the senders avoid retrying sending it. Whether that means the entire batch gets dropped or the receiver is smart enough to have accepted the good part of the data is an implementation detail of specific exporter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OpenCensus Python, the export function takes the input data without returning anything. We leave error handling and retry completely to exporters (which gives flexibility, but also means more work in exporters).
If we intend to have Retry in the core SDK (or the common part that can be used to compose the exporter), we need to design it in a way which works with most of the backends, versus a joint-set of backends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We leave error handling and retry completely to exporters (which gives flexibility, but also means more work in exporters).

Sorry, I have to disagree. I believe a similar approach was done in OpenCensus Agent/Collector and we end up with exporters that don't handle the sending errors and just drop the data silently with zero visibility.

Also, you are right that it means more work in exporters. It is unnecessary to implement all this common functionality in every single exporter. The exporters should be as simple as possible so that it is less work for vendors to support OpenTelemetry project.

Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.

Awesome! We're on the same page. Next step is "we need to design it in a way which works with most of the backends, versus a joint-set of backends".

Copy link
Member Author

@tigrannajaryan tigrannajaryan Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.

I just realized that this may not necessary reflect my thoughts correctly.
What I mean to say is that when the Exporter sends the data to its destination and sending fails, the exporter must clearly process this erroneous situation, must not conceal it and must return an error code that clearly communicates this failure to the caller of Export() function.

That is the purpose of the return result of Export() function. The exporters should not try to do clever things and handle the errors in a way that is invisible to the caller and hide the errors, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the exporter expect its caller to do with the return value?

There are some tricky scenario, for example, the exporter sent data to the backend, and the backend accepted the data, then connection reset happened and exporter never heard back from the backend. In this case, should the exporter retry by itself, or return an error code and rely on the core SDK to retry (which means the core SDK will keep the data and call exporter again at some point). If the exporter expects the SDK to retry, it has to return enough information so the SDK knows how to retry - for example, which items failed, what's the retry policy/period, is there a de-dupe mechanism (for traces we might be able to use trace-id and span-id to de-dupe if we trust the data).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the exporter expect its caller to do with the return value?

There are several possible uses for return value. For example if the caller is a "Retrier" helper then it can retry sending on failures. If the caller is the SDK it can increment internal metric counter indicating dropped data, etc.

There are some tricky scenario, for example, the exporter sent data to the backend, and the backend accepted the data, then connection reset happened and exporter never heard back from the backend.

This is a good question. I believe it is out of the scope for this PR, we should discuss this in our OpenTelemetry protocol scope. I have a specific opinion on what should be done in this case, to avoid sidetracking this PR I suggest that we discuss it in the "protocol" meeting or PRs.

If the exporter expects the SDK to retry, it has to return enough information so the SDK knows how to retry - for example, which items failed, what's the retry policy/period, is there a de-dupe mechanism (for traces we might be able to use trace-id and span-id to de-dupe if we trust the data).

IMO this is not an SDK concern. Logic like this should be in separate helper exporters that can be chained. All these questions that you asked are very valid and we as OpenTelemetry authors are not in the best position to answer them. I believe they should be in the realm of the end-user who knows their own environment and can compose/customize the best behavior for their use-case.

However, our concern as OpenTelemetry authors is to empower the end users so that they are able to do this if they want. If we don't have the error in the return value none of this is possible anymore. The only way would be to include all this functionality in the protocol exporter itself and make the protocol exporter configurable. We will loose composability if we do that and will place very significant burden on implementors of protocol exporters to support all this functionality. The experience with exporters in OpenCensus Service shows us that this is not going to happen, we will end up with half-baked and non-uniform behavior of protocol exporters.

I believe it is important that we make OpenTelemetry attractive to vendors so that it is very easy for them to add their protocol support to OpenTelemetry. This proposal makes it possible.


type ExportResult struct {
Code ExportResultCode
WrappedError error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this WrappedError include information about how SDK should retry?
Retry policy could vary among backends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as the text says these are just examples for illustration and language library authors need to think about the details of how exactly the interface and its input/output data types should look like.

If this creates more questions than it answers I would be inclined to completely remove the section. There is no intent here to define an Exporter interface that is idiomatic, complete or even correct at all.

@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/libguide branch 2 times, most recently from 824ceae to 670eb10 Compare July 18, 2019 20:54
Copy link
Member

@c24t c24t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content and examples look great, but some prose and formatting nits:

  • technical terms like "Exporter" and "Retrier" are capitalized but not formatted as code
  • some lines are hard-wrapped, some aren't
  • article use is inconsistent: "SDK" vs "the SDK"

I was surprised to see that there's no requirement to export batches in a particular order, or order spans within a batch, but I think it does make sense to leave this up to the implementation.

specification/sdk-exporter.md Outdated Show resolved Hide resolved
specification/sdk-exporter.md Show resolved Hide resolved
specification/sdk-exporter.md Outdated Show resolved Hide resolved
specification/sdk-exporter.md Show resolved Hide resolved
specification/library-guidelines.md Outdated Show resolved Hide resolved
specification/library-guidelines.md Outdated Show resolved Hide resolved
specification/sdk-exporter.md Outdated Show resolved Hide resolved
@bogdandrutu
Copy link
Member

I think we should define the batching order and the delivery order but we can do that in a separate PR. For example in Java we have a SpanProcessor which is the interface with the SDK, and we provide a batching and a non batching version that calls into the exporter interface.

The main reason to have the SpanProcessor is to give vendors more hooks (access on start and end events for example) but just for Exporter only the end event is required. Also the SpanProcessor is called sync while the exporter may be called async based on the way how it is configured.

@tigrannajaryan
Copy link
Member Author

@c24t

technical terms like "Exporter" and "Retrier" are capitalized but not formatted as code

Fixed.

some lines are hard-wrapped, some aren't

Paragraphs should be hard-wrapped, the rest shouldn't. Can you point to a specific place?

article use is inconsistent: "SDK" vs "the SDK"

As a non-native English speaker I struggle with the articles. Help is appreciated. I remove "the" for now.

I was surprised to see that there's no requirement to export batches in a particular order, or order spans within a batch, but I think it does make sense to leave this up to the implementation.

I don't think ordering of spans should be in the requirements. In a distributed system the global ordering is not guaranteed anyway and local ordering is unlikely to provide any benefits that are not already provided by simple sorting using timestamps if needed at the final place of use.


The goals of the interface are:

- Minimize burden of implementation for protocol-dependant telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Minimize burden of implementation for protocol-dependant telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter.
- Minimize burden of implementation for protocol-dependent telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter.


- Success - batch is successfully exported. For protocol exporters this typically means that the data is sent over the wire and delivered to the destination server.

- BadData - batch contains bad data and cannot be sent. The caller must not retry exporting the same batch. The batch must be dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we drop the whole batch or export as much "good data" as possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the whole batch. It is difficult to identify "good data".

specification/sdk-exporter.md Outdated Show resolved Hide resolved
@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/libguide branch 2 times, most recently from 44ed113 to a4133b1 Compare July 22, 2019 16:30
@tigrannajaryan
Copy link
Member Author

Reviewers, I made fixes. Please have another look.


### Shutdown()

Shuts down the exporter. Called when SDK is shut down. This is an opportunity for exporter to do any cleanup required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more clarification:

  1. Shutdown is called when the SDK is shutting down, it should block until the clean up is done (or failed).
  2. Shutdown should be called only once for a given exporter.
  3. When the SDK calls Shutdown, there should be no further calls to Export. The exporter should make sure that subsequent calls to Export return failure immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will 2 and 3 to the spec. Good points.

Not sure about 1. For example if Shutdown is called for the purpose of restarting the application it be undesirable for block for too long. If Shutdown is required to block until all data contained in the exporter is delivered to the destination and there is network unavailability this may be too much.

I'd like to see some opinions and learn about use cases..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OpenCensus Python we have a grace_period which controls the maximum wait time. Might be something to consider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can say "Shutdown should not block indefinitely" and let language authors decide if they want to make shutdown timeout to be configurable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good balance.


- FailedRetryable - cannot export to the destination. The caller should record the error and may retry exporting the same batch after some time. This for example can happen when the destination is unavailable, there is a network error or endpoint does not exist.

### Shutdown()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a flush?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have shutdown the use case for flush is not clear to me.

We can add it if we see the clear use case. Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - in this PR shutdown and flush are both mentioned. There is no general agreement on whether we need flush or not at this moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the barrier high for adding new functions. If we don't have a good use case for flush let's keep it out for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/libguide branch from a4133b1 to c83e303 Compare July 22, 2019 19:06
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, we might want to revisit some names (e.g. SpanExporter or TraceExporter, FailedNoneRetryable, FailedNonRetryable or FailedNonretryable), which don't have to block this PR though.

Thanks @tigrannajaryan for driving this!

@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/libguide branch from c83e303 to 9cf8edc Compare July 22, 2019 19:14
@tigrannajaryan
Copy link
Member Author

Approvers, please merge this PR, it has 2 approvals now.

This clarifies what functionality should be the core part of
the SDK and what should be available for optional usage and
also defines the API that exporters need to define.
@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/libguide branch from 9cf8edc to 5c13670 Compare July 25, 2019 00:27
@tigrannajaryan
Copy link
Member Author

Approvers, please merge.

saiya added a commit to saiya/opentelemetry-specification that referenced this pull request Jul 25, 2019
- Remove duplication with open-telemetry#186
- Mention about configurable timeout of flush operation
@bogdandrutu bogdandrutu merged commit c73e63f into open-telemetry:master Jul 25, 2019
@tigrannajaryan tigrannajaryan deleted the feature/tigran/libguide branch July 25, 2019 23:44
carlosalberto pushed a commit that referenced this pull request Jul 31, 2019
* Add Performance and Blocking specification

Performance and Blocking specification is specified in a separate document and
is linked from Language Library Design principles document.

Implements issue: #94

* PR fix (#94).

- Write about Metrics & Logging to cover entire API
- Write about shut down / flush operations
- Leave room for blocking implementation options (should not block "as default behavior")
- Grammar & syntax fix

* PR fix (#94).

- Not limit for tracing, metrics.

* PR fix (#94).

- Mentioned about inevitable overhead
- Shutdown may block, but it should support configurable timeout also

* PR fix (#94)

- s/traces/telemetry data/
- Syntax fix

Co-Authored-By: Yang Song <songy23@users.noreply.github.com>

* PR fix (#130)

- Remove duplication with #186
- Mention about configurable timeout of flush operation

* PR fix (#130)

- Not specify default strategy (blocking or information loss)
SergeyKanzhelev pushed a commit to SergeyKanzhelev/opentelemetry-specification that referenced this pull request Feb 18, 2020
…metry#186)

* Add exporter API and more details about SDK implementation

This clarifies what functionality should be the core part of
the SDK and what should be available for optional usage and
also defines the API that exporters need to define.

* PR fixes
SergeyKanzhelev pushed a commit to SergeyKanzhelev/opentelemetry-specification that referenced this pull request Feb 18, 2020
* Add Performance and Blocking specification

Performance and Blocking specification is specified in a separate document and
is linked from Language Library Design principles document.

Implements issue: open-telemetry#94

* PR fix (open-telemetry#94).

- Write about Metrics & Logging to cover entire API
- Write about shut down / flush operations
- Leave room for blocking implementation options (should not block "as default behavior")
- Grammar & syntax fix

* PR fix (open-telemetry#94).

- Not limit for tracing, metrics.

* PR fix (open-telemetry#94).

- Mentioned about inevitable overhead
- Shutdown may block, but it should support configurable timeout also

* PR fix (open-telemetry#94)

- s/traces/telemetry data/
- Syntax fix

Co-Authored-By: Yang Song <songy23@users.noreply.github.com>

* PR fix (open-telemetry#130)

- Remove duplication with open-telemetry#186
- Mention about configurable timeout of flush operation

* PR fix (open-telemetry#130)

- Not specify default strategy (blocking or information loss)
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
…metry#186)

* Add exporter API and more details about SDK implementation

This clarifies what functionality should be the core part of
the SDK and what should be available for optional usage and
also defines the API that exporters need to define.

* PR fixes
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
* Add Performance and Blocking specification

Performance and Blocking specification is specified in a separate document and
is linked from Language Library Design principles document.

Implements issue: open-telemetry#94

* PR fix (open-telemetry#94).

- Write about Metrics & Logging to cover entire API
- Write about shut down / flush operations
- Leave room for blocking implementation options (should not block "as default behavior")
- Grammar & syntax fix

* PR fix (open-telemetry#94).

- Not limit for tracing, metrics.

* PR fix (open-telemetry#94).

- Mentioned about inevitable overhead
- Shutdown may block, but it should support configurable timeout also

* PR fix (open-telemetry#94)

- s/traces/telemetry data/
- Syntax fix

Co-Authored-By: Yang Song <songy23@users.noreply.github.com>

* PR fix (open-telemetry#130)

- Remove duplication with open-telemetry#186
- Mention about configurable timeout of flush operation

* PR fix (open-telemetry#130)

- Not specify default strategy (blocking or information loss)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants