Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links from transactions and spans to multiple spans/transactions/traces #122

Closed
axw opened this issue Jul 24, 2019 · 17 comments
Closed

Links from transactions and spans to multiple spans/transactions/traces #122

axw opened this issue Jul 24, 2019 · 17 comments

Comments

@axw
Copy link
Member

axw commented Jul 24, 2019

Currently it is possible to define only one relationship between transactions/spans: a single parent. This covers the most common patterns (namely request/response), but it is not currently possible to trace others, such as:

  • batch processing, where multiple inputs are batched and processed in one operation (multi-parent tracing)
  • within a transaction, receiving and processing events (e.g. polling a message queue) originating from another trace (inter-trace linking)

Additionally, as described in the OpenTelemetry spec, there may be scenarios where a trace must be restarted (i.e. creating a new trace root), and in such cases the restarted trace could be linked to the originating trace.

Proposed changes

The first step is to extend the transaction and span model such that they can be linked to multiple other transactions or spans. Errors would continue to accommodate only a single parent transaction or span.

Intake API

I propose we add the following optional property to the intake schema:

  • Events: span and transaction
  • Field name: links (I'm also partial to refs and references, maybe even <your proposal>)
  • Field type: array, with items having the following type:
{
  "type": "object",
  "properties": {
    "id": {
      "description": "Hex encoded 64-bit random ID of the linked transaction or span.",
      "type": "string",
      "maxLength": 1024
    },
    "trace_id": {
      "description": "Hex encoded 128-bit random ID of the correlated trace.",
      "type": "string",
      "maxLength": 1024
    }
  },
  "required": ["id"]
}

Note that the trace_id field is optional. If it is empty, then the span or transaction's own trace_id is assumed.

ES Mapping

We have two main options here: store as nested docs, or store as an array of objects.

Using nested means that for every link, there will be an additional document in ES, which could introduce performance issues. I don't think it's a good idea to go down this road.

Using an array of objects for the links means that we cannot search on both trace ID and span/transaction ID and have them match only links that have both fields that match. We could deal with this in one of two ways:

  • combine the ID/trace ID in the documents, so we end up storing it as something like: "links": ["trace_id:span_id", "trace_id:span_id"]
  • rely on the keys being individually random enough to make multiple matches highly unlikely. i.e. just store as an array of objects, do nothing special

The types of searches we're likely to do are "find all spans linked to span X in trace Y" within the configured time-frame. I expect it is highly unlikely that we would ever find a repeated span ID AND have the same trace ID involved. So either approach is probably fine, structured is generally easier to deal with.

UI

Needs input from @elastic/apm-ui and design as to how we do it, but we should render the links in the UI, perhaps as a list in the transaction details and span details flyout. We can defer discussing the specifics, so long as we can come up with an ES mapping that is flexible enough.

@axw axw added the discussion label Jul 24, 2019
@axw
Copy link
Member Author

axw commented Jul 24, 2019

The main questions I think we need to discuss are:

  • how to store the links?
  • is there any other searchable fields we want to store on links? If we need to add additional searchable link fields, then that makes array-of-objects less tenable

@felixbarny
Copy link
Member

SGTM in general

A few more questions:

  • Will the link to the parent also be in the links array or only additional links?
  • What about a type field for the links like child_of/follows_from?
  • What happens when there are multiple parent links? Which one will be in the array and which in the parent_id?
  • Should we only link to spans which are referenced in links or should they be part of the waterfall?

@axw
Copy link
Member Author

axw commented Jul 25, 2019

Will the link to the parent also be in the links array or only additional links?

I didn't plan to have it in there, but I'm open to arguments.

What about a type field for the links like child_of/follows_from?

Not those specifically, unless we intend to do something with them. I do think we need to be more specific about the link types though (more at the end).

What happens when there are multiple parent links? Which one will be in the array and which in the parent_id?

My initial thought was to use the first parent observed in the parent field, all others in the links, but that might be a bit too naive. Not too sure on the answer here, depends on whether we want to visualise multi-parent relationships.

Should we only link to spans which are referenced in links or should they be part of the waterfall?

Again, not sure, but I think we'll need to figure this out before we can proceed after all. I can't imagine how we would extend our existing visualisation to account for multiple parents which may cross traces.

"Links" is too generic/vague a concept to be useful for visualisation in a tree anyway. At most we could use them for creating a list of links under transaction/span details. For some kind of DAG visualisation we would need to know the link type, specifically whether it's a parent or child (i.e. the arc direction).

I think perhaps instead of adding support for generic links, we should change this proposal to focus on adding support for multiple parents.

axw added a commit to axw/opentelemetry-collector-contrib that referenced this issue Jun 1, 2020
Metrics are currently not exported; we'll wait for
the data model changes to settle, so we can build
the translation off the OTLP representation.

Not all of the OpenTelemetry model is covered by
Elastic APM. In particular, there's currently no
support for links or events. We'll add support for
events later, and most likely links too
(see elastic/apm#122).
tigrannajaryan pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue Jun 1, 2020
This PR introduces an exporter for [Elastic APM](https://www.elastic.co/apm). The exporter works by translating spans and metrics into the ND-JSON format expected by Elastic APM Server, and sending over HTTP.

Currently only spans are supported. Code for translating metrics exists, but is not yet wired up to the exporter; we'll do that once the switch over to the new metrics model is done.

Not all of the OpenTelemetry model is covered by Elastic APM. In particular, there's currently no support for links or span events. We'll add support for events later, and most likely links too (see elastic/apm#122).

**Testing:**

Unit tests added for translating resources, spans, and metrics to the Elastic APM model. This has been tested using a mock in-memory Elastic APM Server. Coverage is > 80%.

Manually tested, sending to an [Elastic Cloud](https://cloud.elastic.co/) deployment.

**Documentation:**

Added a README, which describes the exporter's config.
Metrics are currently not exported; we'll wait for
the data model changes to settle, so we can build
the translation off the OTLP representation.
wyTrivail pushed a commit to mxiamxia/opentelemetry-collector-contrib that referenced this issue Jul 13, 2020
This PR introduces an exporter for [Elastic APM](https://www.elastic.co/apm). The exporter works by translating spans and metrics into the ND-JSON format expected by Elastic APM Server, and sending over HTTP.

Currently only spans are supported. Code for translating metrics exists, but is not yet wired up to the exporter; we'll do that once the switch over to the new metrics model is done.

Not all of the OpenTelemetry model is covered by Elastic APM. In particular, there's currently no support for links or span events. We'll add support for events later, and most likely links too (see elastic/apm#122).

**Testing:**

Unit tests added for translating resources, spans, and metrics to the Elastic APM model. This has been tested using a mock in-memory Elastic APM Server. Coverage is > 80%.

Manually tested, sending to an [Elastic Cloud](https://cloud.elastic.co/) deployment.

**Documentation:**

Added a README, which describes the exporter's config.
Metrics are currently not exported; we'll wait for
the data model changes to settle, so we can build
the translation off the OTLP representation.
@mitoihs
Copy link

mitoihs commented Nov 12, 2020

Lack of that feature was a blocker for us to use Elastic APM. We have microservices performing data processing pipeline with a scatter & gather (fork & join) steps, so we need spans which can be a part of multiple different traces.

@SergeyKleyman
Copy link
Contributor

@mitoihs Have you considered using labels?

@SergeyKleyman
Copy link
Contributor

Question 1: Are there use cases where we expect agents to fill in links automatically? Or do expect links to be set via public API?
Question 2: Are there use cases where we expect backend to use links in some way?
If the answer to the both questions is no then why do we need a special property vs letting users use labels to gather and store this information?

@axw
Copy link
Member Author

axw commented Nov 16, 2020

Question 1: Are there use cases where we expect agents to fill in links automatically? Or do expect links to be set via public API?

I think message queue instrumentation is one case where we would do this. e.g. receiving a message within a transaction would link said transaction to the span that published the message to the queue. @eyalkoren may have more to say on this.

Question 2: Are there use cases where we expect backend to use links in some way?

This question is unresolved, which is why this issue hasn't progressed yet. I would expect the links to show up in the UI, which is why I would expect them to have their own place in the data model.

@graphaelli
Copy link
Member

ECS uses related for "pivoting around a piece of data" which might fit here as keyword fields related.id and related.trace_id based on the second mapping proposal in the description. A top level related.id sounds too general though - two alternatives I can think of: 1. related.span_id and make it apply for transactions too 2. nest both under trace, for trace.related.id and trace.related.trace_id. My hesitation around 2 is whether it makes sense in non-trace context, eg would a log event with log.trace.id and log.trace.related.trace.id make sense?

@eyalkoren
Copy link
Contributor

eyalkoren commented Nov 16, 2020

I think message queue instrumentation is one case where we would do this. e.g. receiving a message within a transaction would link said transaction to the span that published the message to the queue. @eyalkoren may have more to say on this.

Indeed, for example when using a scheduled task (for which we create a transaction) that reads a message (or a bulk of messages) from a queue; or a send-and-reply scenario where the reply-receiving span has a parent and may be linked to the reply sender span in addition.

@mitoihs
Copy link

mitoihs commented Nov 16, 2020

@mitoihs Have you considered using labels?

I didn't. I wanted to keep up with OpenTelemetry specification which uses links which are probably functionally similar. I don't want to depend on ElasticAPM-specific implementation. Using OpenTelemetry gives me an option to switch between multiple "backends" for tracing.

@estolfo
Copy link
Contributor

estolfo commented Nov 25, 2020

Here is an example use case in Ruby with the background job processing library, Sidekiq.

@nikhilbhaware007
Copy link

nikhilbhaware007 commented Jan 5, 2021

I have similiar requirement where multiple inputs are batched and processed in one operation (multi-parent tracing). Is there any ETA for same? @mitoihs what backend did you use finally to support this use-case?

@ghost
Copy link

ghost commented Jan 9, 2021

Yeah +1 to that. We have a similar batching requirement where we'd like to trace which batch they we're indexed into.

At the moment we have a pre-amble process that iterates the events that are part of the batch and begins and ends a transaction for them each before we batch it, but as you can imagine there are a lot of things wrong with this approach.

@mitoihs
Copy link

mitoihs commented Jan 11, 2021

@nikhilbhaware007 when I wrote that comment, I was scanning through available solutions to choose something. We don't yet use anything but will use OpenTelemetry as a... well, not exactly backend but "protocol"? We'll store it in Elasticsearch probably and use a custom frontend to display our traces. Currently, only Jaeger (among few solutions I've checked) has a limited support for displaying such multiparented traces and it's not good enough for us.

russcam added a commit to elastic/apm-agent-dotnet that referenced this issue Apr 6, 2021
This commit adds instrumentation for Azure Service Bus when an application is 
using Microsoft.Azure.ServiceBus 3.0.0+ or Azure.Messaging.ServiceBus 7.0.0+ nuget packages.

Two IDiagnosticListener implementations, one for Microsoft.Azure.ServiceBus 
and another for Azure.Messaging.ServiceBus, create transactions and spans for received 
and sent messages:

A new transaction is created when

- one or more messages are received from a queue or topic subscription.
- a message is receive deferred from a queue or topic subscription.

A new span is created when there is a current transaction, and when

- one or more messages are sent to a queue or topic.
- one or more messages are scheduled to a queue or a topic.

The diagnostic events do not expose details about sent or received messages.
The trace ids of messages are exposed but are not currently captured in this implementation.
Messages are often received in batches, and it is possible for each message to have its
own trace id, but the APM implementation does not have a concept for capturing such
data right now. See elastic/apm#122

A terraform template file is used to create a resource group, Azure Service Bus namespace 
resource in the resource group, and set RBAC rules to allow the Service Principal that issues
the creation access to the resources. The Service Principal credentials can are sourced from
a .credentials.json file in the root of the repository for CI, and from an account authenticated
with az for local development. A default location is set within the template, but all variables 
can be passed using standard Terraform input variable conventions.

Closes #1157
@joshdover
Copy link

I have a use case for this with in Fleet Server's APM instrumentation. We have a bulk process that will batch search and indexing requests from multiple incoming HTTP requests from Elastic Agents into a single _msearch or _bulk request against Elasticsearch. It'd be great to be able to connect each bulk request to their upstream incoming HTTP request from Elastic Agents.

@axw
Copy link
Member Author

axw commented Jul 18, 2022

@joshdover you (or whomever will implement that) may want to subscribe to elastic/apm-agent-go#1243. Support exists in APM Server and Kibana, we're just lacking an API to add links in the Go agent.

@felixbarny
Copy link
Member

Closing as duplicate of #594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants