Links from transactions and spans to multiple spans/transactions/traces #122

axw · 2019-07-24T10:00:18Z

Currently it is possible to define only one relationship between transactions/spans: a single parent. This covers the most common patterns (namely request/response), but it is not currently possible to trace others, such as:

batch processing, where multiple inputs are batched and processed in one operation (multi-parent tracing)
within a transaction, receiving and processing events (e.g. polling a message queue) originating from another trace (inter-trace linking)

Additionally, as described in the OpenTelemetry spec, there may be scenarios where a trace must be restarted (i.e. creating a new trace root), and in such cases the restarted trace could be linked to the originating trace.

Proposed changes

The first step is to extend the transaction and span model such that they can be linked to multiple other transactions or spans. Errors would continue to accommodate only a single parent transaction or span.

Intake API

I propose we add the following optional property to the intake schema:

Events: span and transaction
Field name: links (I'm also partial to refs and references, maybe even <your proposal>)
Field type: array, with items having the following type:

{
  "type": "object",
  "properties": {
    "id": {
      "description": "Hex encoded 64-bit random ID of the linked transaction or span.",
      "type": "string",
      "maxLength": 1024
    },
    "trace_id": {
      "description": "Hex encoded 128-bit random ID of the correlated trace.",
      "type": "string",
      "maxLength": 1024
    }
  },
  "required": ["id"]
}

Note that the trace_id field is optional. If it is empty, then the span or transaction's own trace_id is assumed.

ES Mapping

We have two main options here: store as nested docs, or store as an array of objects.

Using nested means that for every link, there will be an additional document in ES, which could introduce performance issues. I don't think it's a good idea to go down this road.

Using an array of objects for the links means that we cannot search on both trace ID and span/transaction ID and have them match only links that have both fields that match. We could deal with this in one of two ways:

combine the ID/trace ID in the documents, so we end up storing it as something like: "links": ["trace_id:span_id", "trace_id:span_id"]
rely on the keys being individually random enough to make multiple matches highly unlikely. i.e. just store as an array of objects, do nothing special

The types of searches we're likely to do are "find all spans linked to span X in trace Y" within the configured time-frame. I expect it is highly unlikely that we would ever find a repeated span ID AND have the same trace ID involved. So either approach is probably fine, structured is generally easier to deal with.

UI

Needs input from @elastic/apm-ui and design as to how we do it, but we should render the links in the UI, perhaps as a list in the transaction details and span details flyout. We can defer discussing the specifics, so long as we can come up with an ES mapping that is flexible enough.

The text was updated successfully, but these errors were encountered:

axw · 2019-07-24T10:03:09Z

The main questions I think we need to discuss are:

how to store the links?
is there any other searchable fields we want to store on links? If we need to add additional searchable link fields, then that makes array-of-objects less tenable

felixbarny · 2019-07-24T11:43:48Z

SGTM in general

A few more questions:

Will the link to the parent also be in the links array or only additional links?
What about a type field for the links like child_of/follows_from?
What happens when there are multiple parent links? Which one will be in the array and which in the parent_id?
Should we only link to spans which are referenced in links or should they be part of the waterfall?

axw · 2019-07-25T07:25:10Z

Will the link to the parent also be in the links array or only additional links?

I didn't plan to have it in there, but I'm open to arguments.

What about a type field for the links like child_of/follows_from?

Not those specifically, unless we intend to do something with them. I do think we need to be more specific about the link types though (more at the end).

What happens when there are multiple parent links? Which one will be in the array and which in the parent_id?

My initial thought was to use the first parent observed in the parent field, all others in the links, but that might be a bit too naive. Not too sure on the answer here, depends on whether we want to visualise multi-parent relationships.

Should we only link to spans which are referenced in links or should they be part of the waterfall?

Again, not sure, but I think we'll need to figure this out before we can proceed after all. I can't imagine how we would extend our existing visualisation to account for multiple parents which may cross traces.

"Links" is too generic/vague a concept to be useful for visualisation in a tree anyway. At most we could use them for creating a list of links under transaction/span details. For some kind of DAG visualisation we would need to know the link type, specifically whether it's a parent or child (i.e. the arc direction).

I think perhaps instead of adding support for generic links, we should change this proposal to focus on adding support for multiple parents.

Metrics are currently not exported; we'll wait for the data model changes to settle, so we can build the translation off the OTLP representation. Not all of the OpenTelemetry model is covered by Elastic APM. In particular, there's currently no support for links or events. We'll add support for events later, and most likely links too (see elastic/apm#122).

This PR introduces an exporter for [Elastic APM](https://www.elastic.co/apm). The exporter works by translating spans and metrics into the ND-JSON format expected by Elastic APM Server, and sending over HTTP. Currently only spans are supported. Code for translating metrics exists, but is not yet wired up to the exporter; we'll do that once the switch over to the new metrics model is done. Not all of the OpenTelemetry model is covered by Elastic APM. In particular, there's currently no support for links or span events. We'll add support for events later, and most likely links too (see elastic/apm#122). **Testing:** Unit tests added for translating resources, spans, and metrics to the Elastic APM model. This has been tested using a mock in-memory Elastic APM Server. Coverage is > 80%. Manually tested, sending to an [Elastic Cloud](https://cloud.elastic.co/) deployment. **Documentation:** Added a README, which describes the exporter's config. Metrics are currently not exported; we'll wait for the data model changes to settle, so we can build the translation off the OTLP representation.

mitoihs · 2020-11-12T13:55:11Z

Lack of that feature was a blocker for us to use Elastic APM. We have microservices performing data processing pipeline with a scatter & gather (fork & join) steps, so we need spans which can be a part of multiple different traces.

SergeyKleyman · 2020-11-13T15:02:41Z

@mitoihs Have you considered using labels?

SergeyKleyman · 2020-11-13T15:09:27Z

Question 1: Are there use cases where we expect agents to fill in links automatically? Or do expect links to be set via public API?
Question 2: Are there use cases where we expect backend to use links in some way?
If the answer to the both questions is no then why do we need a special property vs letting users use labels to gather and store this information?

axw · 2020-11-16T01:43:09Z

Question 1: Are there use cases where we expect agents to fill in links automatically? Or do expect links to be set via public API?

I think message queue instrumentation is one case where we would do this. e.g. receiving a message within a transaction would link said transaction to the span that published the message to the queue. @eyalkoren may have more to say on this.

Question 2: Are there use cases where we expect backend to use links in some way?

This question is unresolved, which is why this issue hasn't progressed yet. I would expect the links to show up in the UI, which is why I would expect them to have their own place in the data model.

graphaelli · 2020-11-16T02:33:37Z

ECS uses related for "pivoting around a piece of data" which might fit here as keyword fields related.id and related.trace_id based on the second mapping proposal in the description. A top level related.id sounds too general though - two alternatives I can think of: 1. related.span_id and make it apply for transactions too 2. nest both under trace, for trace.related.id and trace.related.trace_id. My hesitation around 2 is whether it makes sense in non-trace context, eg would a log event with log.trace.id and log.trace.related.trace.id make sense?

eyalkoren · 2020-11-16T07:27:49Z

I think message queue instrumentation is one case where we would do this. e.g. receiving a message within a transaction would link said transaction to the span that published the message to the queue. @eyalkoren may have more to say on this.

Indeed, for example when using a scheduled task (for which we create a transaction) that reads a message (or a bulk of messages) from a queue; or a send-and-reply scenario where the reply-receiving span has a parent and may be linked to the reply sender span in addition.

mitoihs · 2020-11-16T08:08:38Z

@mitoihs Have you considered using labels?

I didn't. I wanted to keep up with OpenTelemetry specification which uses links which are probably functionally similar. I don't want to depend on ElasticAPM-specific implementation. Using OpenTelemetry gives me an option to switch between multiple "backends" for tracing.

estolfo · 2020-11-25T11:33:25Z

Here is an example use case in Ruby with the background job processing library, Sidekiq.

nikhilbhaware007 · 2021-01-05T22:54:58Z

I have similiar requirement where multiple inputs are batched and processed in one operation (multi-parent tracing). Is there any ETA for same? @mitoihs what backend did you use finally to support this use-case?

ghost · 2021-01-09T08:01:22Z

Yeah +1 to that. We have a similar batching requirement where we'd like to trace which batch they we're indexed into.

At the moment we have a pre-amble process that iterates the events that are part of the batch and begins and ends a transaction for them each before we batch it, but as you can imagine there are a lot of things wrong with this approach.

mitoihs · 2021-01-11T11:14:37Z

@nikhilbhaware007 when I wrote that comment, I was scanning through available solutions to choose something. We don't yet use anything but will use OpenTelemetry as a... well, not exactly backend but "protocol"? We'll store it in Elasticsearch probably and use a custom frontend to display our traces. Currently, only Jaeger (among few solutions I've checked) has a limited support for displaying such multiparented traces and it's not good enough for us.

This commit adds instrumentation for Azure Service Bus when an application is using Microsoft.Azure.ServiceBus 3.0.0+ or Azure.Messaging.ServiceBus 7.0.0+ nuget packages. Two IDiagnosticListener implementations, one for Microsoft.Azure.ServiceBus and another for Azure.Messaging.ServiceBus, create transactions and spans for received and sent messages: A new transaction is created when - one or more messages are received from a queue or topic subscription. - a message is receive deferred from a queue or topic subscription. A new span is created when there is a current transaction, and when - one or more messages are sent to a queue or topic. - one or more messages are scheduled to a queue or a topic. The diagnostic events do not expose details about sent or received messages. The trace ids of messages are exposed but are not currently captured in this implementation. Messages are often received in batches, and it is possible for each message to have its own trace id, but the APM implementation does not have a concept for capturing such data right now. See elastic/apm#122 A terraform template file is used to create a resource group, Azure Service Bus namespace resource in the resource group, and set RBAC rules to allow the Service Principal that issues the creation access to the resources. The Service Principal credentials can are sourced from a .credentials.json file in the root of the repository for CI, and from an account authenticated with az for local development. A default location is set within the template, but all variables can be passed using standard Terraform input variable conventions. Closes #1157

joshdover · 2022-07-12T12:27:19Z

I have a use case for this with in Fleet Server's APM instrumentation. We have a bulk process that will batch search and indexing requests from multiple incoming HTTP requests from Elastic Agents into a single _msearch or _bulk request against Elasticsearch. It'd be great to be able to connect each bulk request to their upstream incoming HTTP request from Elastic Agents.

axw · 2022-07-18T01:06:38Z

@joshdover you (or whomever will implement that) may want to subscribe to elastic/apm-agent-go#1243. Support exists in APM Server and Kibana, we're just lacking an API to add links in the Go agent.

felixbarny · 2022-07-25T10:24:19Z

Closing as duplicate of #594

axw added the discussion label Jul 24, 2019

graphaelli mentioned this issue May 7, 2020

Multiple parent tracing #72

Closed

axw mentioned this issue May 16, 2020

exporter/elasticexporter: add Elastic APM exporter open-telemetry/opentelemetry-collector-contrib#240

Merged

axw mentioned this issue Nov 5, 2020

Let applications enable support for follows-from spans elastic/apm-agent-go#843

Closed

russcam mentioned this issue Mar 16, 2021

Add instrumentation for Azure Service Bus elastic/apm-agent-dotnet#1225

Merged

maxekman mentioned this issue Dec 13, 2021

Fix / Register tracing context manually looplab/eventhorizon#372

Merged

axw mentioned this issue Feb 2, 2022

Define a data model for span links elastic/apm-server#7171

Closed

felixbarny added the 8.3-candidate label Feb 2, 2022

felixbarny closed this as not planned Won't fix, can't repro, duplicate, stale Jul 25, 2022

juliaElastic mentioned this issue Aug 11, 2022

[Scalability] Add APM instrumentation with Span Links elastic/fleet-server#1736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Links from transactions and spans to multiple spans/transactions/traces #122

Links from transactions and spans to multiple spans/transactions/traces #122

axw commented Jul 24, 2019 •

edited

Loading

axw commented Jul 24, 2019

felixbarny commented Jul 24, 2019

axw commented Jul 25, 2019

mitoihs commented Nov 12, 2020

SergeyKleyman commented Nov 13, 2020

SergeyKleyman commented Nov 13, 2020

axw commented Nov 16, 2020

graphaelli commented Nov 16, 2020

eyalkoren commented Nov 16, 2020 •

edited

Loading

mitoihs commented Nov 16, 2020

estolfo commented Nov 25, 2020

nikhilbhaware007 commented Jan 5, 2021 •

edited

Loading

ghost commented Jan 9, 2021

mitoihs commented Jan 11, 2021

joshdover commented Jul 12, 2022

axw commented Jul 18, 2022

felixbarny commented Jul 25, 2022

Links from transactions and spans to multiple spans/transactions/traces #122

Links from transactions and spans to multiple spans/transactions/traces #122

Comments

axw commented Jul 24, 2019 • edited Loading

Proposed changes

Intake API

ES Mapping

UI

axw commented Jul 24, 2019

felixbarny commented Jul 24, 2019

axw commented Jul 25, 2019

mitoihs commented Nov 12, 2020

SergeyKleyman commented Nov 13, 2020

SergeyKleyman commented Nov 13, 2020

axw commented Nov 16, 2020

graphaelli commented Nov 16, 2020

eyalkoren commented Nov 16, 2020 • edited Loading

mitoihs commented Nov 16, 2020

estolfo commented Nov 25, 2020

nikhilbhaware007 commented Jan 5, 2021 • edited Loading

ghost commented Jan 9, 2021

mitoihs commented Jan 11, 2021

joshdover commented Jul 12, 2022

axw commented Jul 18, 2022

felixbarny commented Jul 25, 2022

axw commented Jul 24, 2019 •

edited

Loading

eyalkoren commented Nov 16, 2020 •

edited

Loading

nikhilbhaware007 commented Jan 5, 2021 •

edited

Loading