Add exporter API and more details about SDK implementation #186

tigrannajaryan · 2019-07-18T17:21:00Z

This clarifies what functionality should be the core part of
the SDK and what should be available for optional usage and
also defines the API that exporters need to define.

Co-authored with @bogdandrutu

reyang · 2019-07-18T18:10:45Z

specification/library-guidelines.md


- Minimize burden of implementation for protocol-dependant telemetry exporters. The Telemetry Exporter is expected to be primarily a simple telemetry data encoder and transmitter.
+End users should be given the flexibility of making many of the decisions regarding the queuing, retrying, tagging, batching functionality that make the most sense for their application. For example, if an application is deployed with a locally running Agent daemon, the end user may prefer to not have a Retrier since local daemon is highly available. As opposed to that an application may require a reliable delivery of telemetry data to a remote backend that has no guaranteed availability. The end user may choose to use a persistent local queue and a Retrier in this case.


For reliable delivery:

Do we provide both sync and async exporters? The definition of "sync" is: the exporter get called at the moment telemetry data is provided, and within the execution context of user's code, and the user's code will get blocked until the exporter returns. @24 if you have more input here.

For queuing and retrying, do we consider a local cache to offload memory (e.g. instead of keeping all the data in-memory for retry, swap them to local disk. both memory and disk need to have a configurable quota though)?

@reyang

We did consider having both sync and async calling convention in the Exporter interface and decided that supporting async at the interface level complicates things. Exporters that want to have non-blocking behavior should implement it internally by whatever means they want.

That is in the realms of language library authors. The queuing (whether in-memory or persistent) is recommended to be implemented via composable exporters that can be choosen by end users. This document intentionally does not prescribe neither to language library authors nor to end users the type of the queue they must implement/use (the reason is specified in this paragraph: it must be end user's choice depending on their environment / use case)

I see, thanks.

I think we should support the in-memory for sure, disk is optional but open to be convinced otherwise.

reyang · 2019-07-18T18:13:13Z

specification/library-guidelines.md


- Minimize burden of implementation for protocol-dependant telemetry exporters. The Telemetry Exporter is expected to be primarily a simple telemetry data encoder and transmitter.
+End users should be given the flexibility of making many of the decisions regarding the queuing, retrying, tagging, batching functionality that make the most sense for their application. For example, if an application is deployed with a locally running Agent daemon, the end user may prefer to not have a Retrier since local daemon is highly available. As opposed to that an application may require a reliable delivery of telemetry data to a remote backend that has no guaranteed availability. The end user may choose to use a persistent local queue and a Retrier in this case.


For in-memory queue / persistent local queue - do we expect FIFO or there could be LIFO?

That is outside the scope of this recommendation document. Either can be implemented and provided as a composable helper by language library authors.

The moment we mentioned queue, it already assumes either LIFO, FIFO or priority queue, etc.
I do think there are cases where people need a collection of unsent items without following a strict order. It would be great to stay neutral in the spec.

specification/library-guidelines.md

reyang · 2019-07-18T18:15:37Z

specification/sdk-exporter.md

+
+## Interface Definition
+
+The exporter must support two functions: **Export** and **Shutdown**. In strongly typed languages typically there will be 2 separate Exporter interfaces, one that accepts Spans (SpanExporter) and one that accepts Metrics (MetricsExporter).


There could be exporter for logs as well.

I do not think logs are currently support by OpenTelemetry. In the future, yes, definitely. This actually raises a good question that I had in my mind for a while:

As we add more telemetry data types we need to think about proliferation of similarly looking interfaces with similar functionality and whether we need to aim at reducing conceptual duplication and code duplication.

We've explored logs in OpenCensus https://github.com/census-instrumentation/opencensus-python/tree/master/contrib/opencensus-ext-azure/opencensus/ext/azure/log_exporter.

One struggle that I have: given most of the languages already have established logging infrastrutures, should we align with the existing language/library/framework approach, or we try to align among OpenTelemetry SDKs? In OpenCensus Python I chose the former (instead of log exporter, we have log handler which follows the Python paradigm).

Also, considering performance/reliability, most of the users might want to store the logs to a file, and send them to the backend through agent, instead of using SDKs which run inside the process.

I haven't put significant put into OpenTelemetry+logs topic, so it will be difficult for me to engage in a meaningful discussion. Perhaps it is worth discussing the logs separately.

Also, considering performance/reliability, most of the users might want to store the logs to a file, and send them to the backend through agent, instead of using SDKs which run inside the process.

Yes, there are log collection products and log collection agents that do exactly this (BTW, one of my previous works was on a log collection agent).

reyang · 2019-07-18T18:18:01Z

specification/sdk-exporter.md

+#### Go SpanExporter Interface
+
+```go
+type SpanExporter interface {


Should we call it TraceExporter?

Open to suggestions, I have no strong preference. It seems to me that since it accepts a batch of Spans, SpanExporter may be a better name, but would be great to know other opinions.

specification/sdk-exporter.md

reyang · 2019-07-18T18:22:00Z

specification/sdk-exporter.md

+
+const (
+    Success ExportResultCode = iota
+    BadData


This is a difficult problem, we do have backends which can return HTTP 206 for partial success.

Right. It is hard to tackle this by trying to figure out which part of data is bad and which is good and should be re-send. The goal here is to have at least a minimal notion of bad data as a concept so that the senders avoid retrying sending it. Whether that means the entire batch gets dropped or the receiver is smart enough to have accepted the good part of the data is an implementation detail of specific exporter.

In OpenCensus Python, the export function takes the input data without returning anything. We leave error handling and retry completely to exporters (which gives flexibility, but also means more work in exporters).
If we intend to have Retry in the core SDK (or the common part that can be used to compose the exporter), we need to design it in a way which works with most of the backends, versus a joint-set of backends.

We leave error handling and retry completely to exporters (which gives flexibility, but also means more work in exporters).

Sorry, I have to disagree. I believe a similar approach was done in OpenCensus Agent/Collector and we end up with exporters that don't handle the sending errors and just drop the data silently with zero visibility.

Also, you are right that it means more work in exporters. It is unnecessary to implement all this common functionality in every single exporter. The exporters should be as simple as possible so that it is less work for vendors to support OpenTelemetry project.

Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.

Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.

Awesome! We're on the same page. Next step is "we need to design it in a way which works with most of the backends, versus a joint-set of backends".

Error handling should be part of Exporter, and the expectation for Exporter to return proper error code should be made completely obvious.

I just realized that this may not necessary reflect my thoughts correctly.
What I mean to say is that when the Exporter sends the data to its destination and sending fails, the exporter must clearly process this erroneous situation, must not conceal it and must return an error code that clearly communicates this failure to the caller of Export() function.

That is the purpose of the return result of Export() function. The exporters should not try to do clever things and handle the errors in a way that is invisible to the caller and hide the errors, etc.

What would the exporter expect its caller to do with the return value?

There are some tricky scenario, for example, the exporter sent data to the backend, and the backend accepted the data, then connection reset happened and exporter never heard back from the backend. In this case, should the exporter retry by itself, or return an error code and rely on the core SDK to retry (which means the core SDK will keep the data and call exporter again at some point). If the exporter expects the SDK to retry, it has to return enough information so the SDK knows how to retry - for example, which items failed, what's the retry policy/period, is there a de-dupe mechanism (for traces we might be able to use trace-id and span-id to de-dupe if we trust the data).

What would the exporter expect its caller to do with the return value?

There are several possible uses for return value. For example if the caller is a "Retrier" helper then it can retry sending on failures. If the caller is the SDK it can increment internal metric counter indicating dropped data, etc.

There are some tricky scenario, for example, the exporter sent data to the backend, and the backend accepted the data, then connection reset happened and exporter never heard back from the backend.

This is a good question. I believe it is out of the scope for this PR, we should discuss this in our OpenTelemetry protocol scope. I have a specific opinion on what should be done in this case, to avoid sidetracking this PR I suggest that we discuss it in the "protocol" meeting or PRs.

If the exporter expects the SDK to retry, it has to return enough information so the SDK knows how to retry - for example, which items failed, what's the retry policy/period, is there a de-dupe mechanism (for traces we might be able to use trace-id and span-id to de-dupe if we trust the data).

IMO this is not an SDK concern. Logic like this should be in separate helper exporters that can be chained. All these questions that you asked are very valid and we as OpenTelemetry authors are not in the best position to answer them. I believe they should be in the realm of the end-user who knows their own environment and can compose/customize the best behavior for their use-case.

However, our concern as OpenTelemetry authors is to empower the end users so that they are able to do this if they want. If we don't have the error in the return value none of this is possible anymore. The only way would be to include all this functionality in the protocol exporter itself and make the protocol exporter configurable. We will loose composability if we do that and will place very significant burden on implementors of protocol exporters to support all this functionality. The experience with exporters in OpenCensus Service shows us that this is not going to happen, we will end up with half-baked and non-uniform behavior of protocol exporters.

I believe it is important that we make OpenTelemetry attractive to vendors so that it is very easy for them to add their protocol support to OpenTelemetry. This proposal makes it possible.

reyang · 2019-07-18T18:23:10Z

specification/sdk-exporter.md

+
+type ExportResult struct {
+    Code         ExportResultCode
+    WrappedError error


Would this WrappedError include information about how SDK should retry?
Retry policy could vary among backends.

So, as the text says these are just examples for illustration and language library authors need to think about the details of how exactly the interface and its input/output data types should look like.

If this creates more questions than it answers I would be inclined to completely remove the section. There is no intent here to define an Exporter interface that is idiomatic, complete or even correct at all.

c24t

The content and examples look great, but some prose and formatting nits:

technical terms like "Exporter" and "Retrier" are capitalized but not formatted as code
some lines are hard-wrapped, some aren't
article use is inconsistent: "SDK" vs "the SDK"

I was surprised to see that there's no requirement to export batches in a particular order, or order spans within a batch, but I think it does make sense to leave this up to the implementation.

specification/sdk-exporter.md

specification/library-guidelines.md

specification/sdk-exporter.md

bogdandrutu · 2019-07-19T12:53:32Z

I think we should define the batching order and the delivery order but we can do that in a separate PR. For example in Java we have a SpanProcessor which is the interface with the SDK, and we provide a batching and a non batching version that calls into the exporter interface.

The main reason to have the SpanProcessor is to give vendors more hooks (access on start and end events for example) but just for Exporter only the end event is required. Also the SpanProcessor is called sync while the exporter may be called async based on the way how it is configured.

tigrannajaryan · 2019-07-19T16:26:28Z

@c24t

technical terms like "Exporter" and "Retrier" are capitalized but not formatted as code

Fixed.

some lines are hard-wrapped, some aren't

Paragraphs should be hard-wrapped, the rest shouldn't. Can you point to a specific place?

article use is inconsistent: "SDK" vs "the SDK"

As a non-native English speaker I struggle with the articles. Help is appreciated. I remove "the" for now.

I was surprised to see that there's no requirement to export batches in a particular order, or order spans within a batch, but I think it does make sense to leave this up to the implementation.

I don't think ordering of spans should be in the requirements. In a distributed system the global ordering is not guaranteed anyway and local ordering is unlikely to provide any benefits that are not already provided by simple sorting using timestamps if needed at the final place of use.

specification/library-guidelines.md

songy23 · 2019-07-19T22:00:19Z

specification/sdk-exporter.md

+
+The goals of the interface are:
+
+- Minimize burden of implementation for protocol-dependant telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter.


Suggested change

- Minimize burden of implementation for protocol-dependant telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter.

- Minimize burden of implementation for protocol-dependent telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter.

songy23 · 2019-07-19T22:04:30Z

specification/sdk-exporter.md

+
+- Success - batch is successfully exported. For protocol exporters this typically means that the data is sent over the wire and delivered to the destination server.
+
+- BadData - batch contains bad data and cannot be sent. The caller must not retry exporting the same batch. The batch must be dropped.


Should we drop the whole batch or export as much "good data" as possible?

Drop the whole batch. It is difficult to identify "good data".

specification/sdk-exporter.md

tigrannajaryan · 2019-07-22T16:31:49Z

Reviewers, I made fixes. Please have another look.

reyang · 2019-07-22T18:00:35Z

specification/sdk-exporter.md

+
+### Shutdown()
+
+Shuts down the exporter. Called when SDK is shut down. This is an opportunity for exporter to do any cleanup required.


Need more clarification:

Shutdown is called when the SDK is shutting down, it should block until the clean up is done (or failed).

Shutdown should be called only once for a given exporter.

When the SDK calls Shutdown, there should be no further calls to Export. The exporter should make sure that subsequent calls to Export return failure immediately.

I will 2 and 3 to the spec. Good points.

Not sure about 1. For example if Shutdown is called for the purpose of restarting the application it be undesirable for block for too long. If Shutdown is required to block until all data contained in the exporter is delivered to the destination and there is network unavailability this may be too much.

I'd like to see some opinions and learn about use cases..

In OpenCensus Python we have a grace_period which controls the maximum wait time. Might be something to consider.

We can say "Shutdown should not block indefinitely" and let language authors decide if they want to make shutdown timeout to be configurable.

That's a good balance.

reyang · 2019-07-22T18:01:10Z

specification/sdk-exporter.md

+
+- FailedRetryable - cannot export to the destination. The caller should record the error and may retry exporting the same batch after some time. This for example can happen when the destination is unavailable, there is a network error or endpoint does not exist. 
+
+### Shutdown()


Do we need a flush?

Since we have shutdown the use case for flush is not clear to me.

We can add it if we see the clear use case. Any thoughts?

FYI - in this PR shutdown and flush are both mentioned. There is no general agreement on whether we need flush or not at this moment.

I think we should keep the barrier high for adding new functions. If we don't have a good use case for flush let's keep it out for now.

Make sense.

specification/sdk-exporter.md

reyang

LGTM in general, we might want to revisit some names (e.g. SpanExporter or TraceExporter, FailedNoneRetryable, FailedNonRetryable or FailedNonretryable), which don't have to block this PR though.

Thanks @tigrannajaryan for driving this!

tigrannajaryan · 2019-07-23T15:50:28Z

Approvers, please merge this PR, it has 2 approvals now.

This clarifies what functionality should be the core part of the SDK and what should be available for optional usage and also defines the API that exporters need to define.

tigrannajaryan · 2019-07-25T00:28:28Z

Approvers, please merge.

- Remove duplication with open-telemetry#186 - Mention about configurable timeout of flush operation

* Add Performance and Blocking specification Performance and Blocking specification is specified in a separate document and is linked from Language Library Design principles document. Implements issue: #94 * PR fix (#94). - Write about Metrics & Logging to cover entire API - Write about shut down / flush operations - Leave room for blocking implementation options (should not block "as default behavior") - Grammar & syntax fix * PR fix (#94). - Not limit for tracing, metrics. * PR fix (#94). - Mentioned about inevitable overhead - Shutdown may block, but it should support configurable timeout also * PR fix (#94) - s/traces/telemetry data/ - Syntax fix Co-Authored-By: Yang Song <songy23@users.noreply.github.com> * PR fix (#130) - Remove duplication with #186 - Mention about configurable timeout of flush operation * PR fix (#130) - Not specify default strategy (blocking or information loss)

…metry#186) * Add exporter API and more details about SDK implementation This clarifies what functionality should be the core part of the SDK and what should be available for optional usage and also defines the API that exporters need to define. * PR fixes

* Add Performance and Blocking specification Performance and Blocking specification is specified in a separate document and is linked from Language Library Design principles document. Implements issue: open-telemetry#94 * PR fix (open-telemetry#94). - Write about Metrics & Logging to cover entire API - Write about shut down / flush operations - Leave room for blocking implementation options (should not block "as default behavior") - Grammar & syntax fix * PR fix (open-telemetry#94). - Not limit for tracing, metrics. * PR fix (open-telemetry#94). - Mentioned about inevitable overhead - Shutdown may block, but it should support configurable timeout also * PR fix (open-telemetry#94) - s/traces/telemetry data/ - Syntax fix Co-Authored-By: Yang Song <songy23@users.noreply.github.com> * PR fix (open-telemetry#130) - Remove duplication with open-telemetry#186 - Mention about configurable timeout of flush operation * PR fix (open-telemetry#130) - Not specify default strategy (blocking or information loss)

…metry#186) * Add exporter API and more details about SDK implementation This clarifies what functionality should be the core part of the SDK and what should be available for optional usage and also defines the API that exporters need to define. * PR fixes

* Add Performance and Blocking specification Performance and Blocking specification is specified in a separate document and is linked from Language Library Design principles document. Implements issue: open-telemetry#94 * PR fix (open-telemetry#94). - Write about Metrics & Logging to cover entire API - Write about shut down / flush operations - Leave room for blocking implementation options (should not block "as default behavior") - Grammar & syntax fix * PR fix (open-telemetry#94). - Not limit for tracing, metrics. * PR fix (open-telemetry#94). - Mentioned about inevitable overhead - Shutdown may block, but it should support configurable timeout also * PR fix (open-telemetry#94) - s/traces/telemetry data/ - Syntax fix Co-Authored-By: Yang Song <songy23@users.noreply.github.com> * PR fix (open-telemetry#130) - Remove duplication with open-telemetry#186 - Mention about configurable timeout of flush operation * PR fix (open-telemetry#130) - Not specify default strategy (blocking or information loss)

tigrannajaryan requested review from AloisReitbauer, bogdandrutu, c24t, carlosalberto, iredelmeier, reyang, SergeyKanzhelev, songy23, tedsuo and yurishkuro as code owners July 18, 2019 17:21

tigrannajaryan mentioned this pull request Jul 18, 2019

Add exporter API and more details about SDK implementation tigrannajaryan/opentelemetry-specification#1

Closed

reyang reviewed Jul 18, 2019

View reviewed changes

specification/library-guidelines.md Outdated Show resolved Hide resolved

reyang reviewed Jul 18, 2019

View reviewed changes

specification/sdk-exporter.md Outdated Show resolved Hide resolved

reyang reviewed Jul 18, 2019

View reviewed changes

tigrannajaryan force-pushed the feature/tigran/libguide branch 2 times, most recently from 824ceae to 670eb10 Compare July 18, 2019 20:54

c24t reviewed Jul 19, 2019

View reviewed changes

specification/library-guidelines.md Outdated Show resolved Hide resolved

songy23 reviewed Jul 19, 2019

View reviewed changes

tigrannajaryan force-pushed the feature/tigran/libguide branch 2 times, most recently from 44ed113 to a4133b1 Compare July 22, 2019 16:30

reyang reviewed Jul 22, 2019

View reviewed changes

specification/sdk-exporter.md Outdated Show resolved Hide resolved

reyang reviewed Jul 22, 2019

View reviewed changes

specification/sdk-exporter.md Outdated Show resolved Hide resolved

tigrannajaryan force-pushed the feature/tigran/libguide branch from a4133b1 to c83e303 Compare July 22, 2019 19:06

reyang approved these changes Jul 22, 2019

View reviewed changes

tigrannajaryan force-pushed the feature/tigran/libguide branch from c83e303 to 9cf8edc Compare July 22, 2019 19:14

bogdandrutu approved these changes Jul 23, 2019

View reviewed changes

c24t mentioned this pull request Jul 23, 2019

Generalize span creation #190

Merged

bogdandrutu mentioned this pull request Jul 24, 2019

Add Performance and Blocking specification #130

Merged

c24t mentioned this pull request Jul 24, 2019

Settle on linebreak style #192

Closed

c24t approved these changes Jul 25, 2019

View reviewed changes

tigrannajaryan added 2 commits July 24, 2019 20:26

Add exporter API and more details about SDK implementation

160b09e

This clarifies what functionality should be the core part of the SDK and what should be available for optional usage and also defines the API that exporters need to define.

PR fixes

5c13670

tigrannajaryan force-pushed the feature/tigran/libguide branch from 9cf8edc to 5c13670 Compare July 25, 2019 00:27

saiya added a commit to saiya/opentelemetry-specification that referenced this pull request Jul 25, 2019

PR fix (open-telemetry#130)

9efc8f2

- Remove duplication with open-telemetry#186 - Mention about configurable timeout of flush operation

bogdandrutu merged commit c73e63f into open-telemetry:master Jul 25, 2019

tigrannajaryan deleted the feature/tigran/libguide branch July 25, 2019 23:44

c24t mentioned this pull request Sep 30, 2019

Add timestamp to metrics, implement Azure Monitor Metrics Exporter and Console Metrics Exporter open-telemetry/opentelemetry-python#186

Closed


		- Minimize burden of implementation for protocol-dependant telemetry exporters. The Telemetry Exporter is expected to be primarily a simple telemetry data encoder and transmitter.
		End users should be given the flexibility of making many of the decisions regarding the queuing, retrying, tagging, batching functionality that make the most sense for their application. For example, if an application is deployed with a locally running Agent daemon, the end user may prefer to not have a Retrier since local daemon is highly available. As opposed to that an application may require a reliable delivery of telemetry data to a remote backend that has no guaranteed availability. The end user may choose to use a persistent local queue and a Retrier in this case.


		## Interface Definition

		The exporter must support two functions: Export and Shutdown. In strongly typed languages typically there will be 2 separate Exporter interfaces, one that accepts Spans (SpanExporter) and one that accepts Metrics (MetricsExporter).


		The goals of the interface are:

		- Minimize burden of implementation for protocol-dependant telemetry exporters. The protocol exporter is expected to be primarily a simple telemetry data encoder and transmitter.


		- Success - batch is successfully exported. For protocol exporters this typically means that the data is sent over the wire and delivered to the destination server.

		- BadData - batch contains bad data and cannot be sent. The caller must not retry exporting the same batch. The batch must be dropped.


		### Shutdown()

		Shuts down the exporter. Called when SDK is shut down. This is an opportunity for exporter to do any cleanup required.


		- FailedRetryable - cannot export to the destination. The caller should record the error and may retry exporting the same batch after some time. This for example can happen when the destination is unavailable, there is a network error or endpoint does not exist.

		### Shutdown()

Add exporter API and more details about SDK implementation #186

Add exporter API and more details about SDK implementation #186

Conversation

tigrannajaryan commented Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

tigrannajaryan Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tigrannajaryan Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c24t left a comment

Choose a reason for hiding this comment

bogdandrutu commented Jul 19, 2019

tigrannajaryan commented Jul 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tigrannajaryan commented Jul 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyang left a comment

Choose a reason for hiding this comment

tigrannajaryan commented Jul 23, 2019

tigrannajaryan commented Jul 25, 2019

tigrannajaryan commented Jul 18, 2019 •

edited

Loading

tigrannajaryan Jul 18, 2019 •

edited

Loading

tigrannajaryan Jul 18, 2019 •

edited

Loading