Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META 605] Add span links for message tracing #2489

Closed
felixbarny opened this issue Mar 2, 2022 · 8 comments · Fixed by #2610
Closed

[META 605] Add span links for message tracing #2489

felixbarny opened this issue Mar 2, 2022 · 8 comments · Fixed by #2610
Assignees
Labels
Milestone

Comments

@felixbarny
Copy link
Member

felixbarny commented Mar 2, 2022

Meta issue:

Spec issue:

Summary of spec changes:

  • If your agent instruments any of Azure Service Bus, JMS, Kafka, RabbitMQ, SQS, or SNS, then instrumentation for receiving messages or batches of messages "SHOULD":
    • Create a transaction (with trace-context from message metadata, if any) for the handling of each message, if possible.
    • Otherwise, when creating the transaction or span for the message or batch of messages, look for trace-context in metadata for up to 1000 messages and create a span link for each.
  • If your agent supports AWS Lambda instrumentation, then in addition to the creation of span links, this spec change has a number of changes to the context captured for SQS and SNS-triggered invocations. (The special-casing of a message batch with a single message has been dropped.)
@eyalkoren
Copy link
Contributor

Consider adding instrumentation for Spring Kafka batch listener (#2601) and use it as a test case for span links

@eyalkoren
Copy link
Contributor

eyalkoren commented May 1, 2022

So far, of all supported messaging system clients, only the Kafka plugin had to deal with batch processing. In this plugin, we wrap the batch iteration, allowing us to create a transaction-per-record.
Hopefully, we can apply the same approach for SNS and SQS clients without the need for special-casing single-message-batches.
In addition, Lambda functions triggered by SNS/SQS rely on events that provide lists of records.

Therefore, as far as I can see, the scope of this spec implementation for the Java agent is as follows:

  • Polling spans: if the instrumented polling API returns a message with a tracecontext header - add a link to the sender span
  • Record-batch iteration (Kafka plugin): we already instrument org.apache.kafka.clients.consumer.KafkaConsumer#poll, which returns org.apache.kafka.clients.consumer.ConsumerRecords that we instrument as well to wrap the iteration. We will need to apply different iteration-wrapper strategy for each ConsumerRecords instance, based on the state at OnMetodExit of the KafkaConsumer#poll instrumentation - if there is an active span, the wrapper will only add a link per record with tracecontext to the active span, otherwise- we will use the same wrapper we use now to create transaction per record.
  • SNS/SQS-triggered Lambda functions- add span links to the single function transaction. We can either iterate over the list for that, or wrap the list with our own, so that we add the links while the application is iterating. The latter is a bit more efficient, but way more complex and would not work if we fail to activate the transaction on the iterating thread.

@eyalkoren
Copy link
Contributor

eyalkoren commented May 2, 2022

Correction: we also have this one to support similarly to Kafka:

@trentm
Copy link
Member

trentm commented May 2, 2022

We can either iterate over the list for that, or wrap the list with our own, so that we add the links while the application is iterating.

I'm not sure how strict this rule is, but the OTel spec on span links says that Links cannot be added after Span creation. My understanding of their reasoning for that is that all span links are passed into the OTel SDK's "Sampling" API so that the registered Sampling implementation can use those links in its processing. I'm not sure if any of this applies to your "we add the links while the application is iterating".

@eyalkoren
Copy link
Contributor

Thanks for pointing out!
I am not even sure there is a meaningful advantage for us in doing that through iteration vs. at batch processing start.
Either way, we discussed it yesterday and decided there are two approaches we would apply:

  1. create an encapsulation batch processing span/transaction, where we can typically add links at creation time, as in Lambda triggered by SNS/SQS, or if we instrument a batch processing API like org.springframework.amqp.core.MessageListener#onMessageBatch(List)
  2. create a span-per-message if our instrumentation already supports batch-iteration instrumentation with a single span link (added at creation time), if there is an active transaction when the iteration starts. For example, in Add support of kafka batch consumer #2601, a user asked for inherent instrumentation for a Spring batch processing wrapper around Kafka's polling client API. If we add such instrumentation, we can do as described in 1. However, if we don't add such and the user adds a manual encapsulating batch-processing transaction, as described in the issue, then we will create a span per Kafka record with a link to the sending span.

So I think we are OK with this guidance.

@eyalkoren
Copy link
Contributor

@trentm the said above is true for transactions. However, adding span links to message-polling spans cannot typically follow this guidance, as by nature, polling APIs return the received message only at exit time. Assuming that we want polling span to measure the polling duration, it means that we create and start the span on polling method entry and end it on exit, which is only when we can look for the sender's tracecontext.
I am doing exactly this now. If you think that is an issue, let me know and we'll see what we can do about that.

@trentm
Copy link
Member

trentm commented May 4, 2022

@eyalkoren https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md#batch-receiving shows an example from OTel where there are separate spans for "message receive" and "message processing". In their example they add a span link on the "message processing" span. I don't know if that is relevant or useful for the case you describe. (I vaguely recall some discussion about how to model spans for messaging systems back when Russ was adding Azure support, but I wasn't following very closely at the time.)

If you think that is an issue

I suspect that it is probably fine to add span links late (after span creation) for our agents, if necessary. open-telemetry/opentelemetry-specification#454 is an OTel Spec discussion on potentially (re-)allowing this.

@eyalkoren
Copy link
Contributor

eyalkoren commented May 8, 2022

This OTel example is about a poll action that takes place when there is no active context. The guidance is then to create an orphan span for the receive action and a child span per message with a single span link. I don't mind doing that once our UI properly renders distributed traces based on links. But as it is now, if I change our tracing to do that, it will be an unfortunate breaking change for our existing users, as it will break their currently properly-presented full traces, because there will be no parent-child relationships to represent the distributed trace.
In the Java agent we actually have a distinction between receive transactions and message-handling transaction. The former type is discarded by default.

What I referred to is a polling span, i.e.- a polling action traced when there is an active context (transaction/span). As opposed to the former scenario, in this one there is no incentive to do all kind of sophisticated and risky assumptions to trace message handling as it will be automatically traced as child actions of the active transaction/span. So in this scenario, I would say a polling span is enough to represent the involvement of a messaging system and with the ability to link to the sending span, that's even better covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants