Skip to content

Commit

Permalink
Add outcome to transactions and spans (#299)
Browse files Browse the repository at this point in the history
  • Loading branch information
felixbarny authored Aug 24, 2020
1 parent 0c78d54 commit 9c6dd55
Show file tree
Hide file tree
Showing 4 changed files with 76 additions and 5 deletions.
7 changes: 7 additions & 0 deletions specs/agents/error-tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,10 @@ The agent support reporting exceptions/errors. Errors may come in one of two for
Agents should include exception handling in the instrumentation they provide, such that exceptions are reported to the APM Server automatically, without intervention. In addition, hooks into logging libraries may be provided such that logged errors are also sent to the APM Server.

Errors may or may not occur within the context of a transaction or span. If they do, then they will be associated with them by recording the trace ID and transaction or span ID. This enables the APM UI to annotate traces with errors.

### Impact on the `outcome`

Tracking an error that's related to a transaction does not impact its `outcome`.
A transaction might have multiple errors associated to it but still return with a 2xx status code.
Hence, the status code is a more reliable signal for the outcome of the transaction.
This, in turn, means that the `outcome` is always specific to the protocol.
11 changes: 7 additions & 4 deletions specs/agents/tracing-instrumentation-http.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@ Agents should instrument HTTP request routers/handlers, starting a new transacti

- The transaction `type` should be `request`.
- The transaction `result` should be `HTTP Nxx`, where N is the first digit of the status code (e.g. `HTTP 4xx` for a 404)
- The transaction `outcome` should be `"success"` for HTTP status codes < 500 and `"failure"` for status codes >= 500. \
Status codes in the 4xx range (client errors) are not considered a `failure` as the failure has not been caused by the application itself but by the caller.
As there's no browser API to get the status code of a page load, the RUM agent always reports `"unknown"` for those transactions.
- The transaction `name` should be aggregatable, such as the route or handler name. Examples:

- `GET /users/{id}`
- `UsersController#index`

Expand Down Expand Up @@ -40,7 +42,8 @@ We capture spans for outbound HTTP requests. These should have a type of `extern

For outbound HTTP request spans we capture the following http-specific span context:

- `http.url` (the target URL)
- `http.status_code` (the response status code)
- `http.url` (the target URL) \
The captured URL should have the userinfo (username and password), if any, redacted.
- `http.status_code` (the response status code) \
The span's `outcome` should be set to `"success"` if the status code is lower than 400 and to `"failure"` otherwise.

The captured URL should have the userinfo (username and password), if any, redacted.
21 changes: 21 additions & 0 deletions specs/agents/tracing-spans.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,27 @@

The agent should also have a sense of the most common libraries for these and instrument them without any further setup from the app developers.

#### Span outcome

The `outcome` property denotes whether the span represents a success or a failure.
It supports the same values as `transaction.outcome`.
The only semantic difference is that client errors set the `outcome` to `"failure"`.
Agents should try to determine the outcome for spans created by auto instrumentation,
which is especially important for exit spans (spans representing requests to other services).

While the transaction outcome lets you reason about the error rate from the service's point of view,
other services might have a different perspective on that.
For example, if there's a network error so that service A can't call service B,
the error rate of service B is 100% from service A's perspective.
However, as service B doesn't receive any requests, the error rate is 0% from service B's perspective.
The `span.outcome` also allows reasoning about error rates of external services.

#### Outcome API

Agents should expose an API to manually override the outcome.
This value must always take precedence over the automatically determined value.
The documentation should clarify that spans with `unknown` outcomes are ignored in the error rate calculation.

#### Span stack traces

Spans may have an associated stack trace, in order to locate the associated source code that caused the span to occur. If there are many spans being collected this can cause a significant amount of overhead in the application, due to the capture, rendering, and transmission of potentially large stack traces. It is possible to limit the recording of span stack traces to only spans that are slower than a specified duration, using the config variable `ELASTIC_APM_SPAN_FRAMES_MIN_DURATION`.
Expand Down
42 changes: 41 additions & 1 deletion specs/agents/tracing-transactions.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,44 @@ Transactions are a special kind of span.
They represent the entry into a service.
They are sometimes also referred to as local roots or entry spans.

Transactions are created either by the built-in auto-instrumentation or an agent or the [tracer API](tracing-api.md).
Transactions are created either by the built-in auto-instrumentation or an agent or the [tracer API](tracing-api.md).

#### Transaction outcome

The `outcome` property denotes whether the transaction represents a success or a failure from the perspective of the entity that produced the event.
The APM Server converts this to the [`event.outcome`](https://www.elastic.co/guide/en/ecs/current/ecs-allowed-values-event-outcome.html) field.
This property is optional to preserve backwards compatibility.
If an agent doesn't report the `outcome` (or reports `null`), the APM Server sets the outcome to `"unknown"`.

- `"failure"`: Indicates that this transaction describes a failed result. \
Note that client errors (such as HTTP 4xx) don't fall into this category as they are not an error from the perspective of the server.
- `"success"`: Indicates that this transaction describes a successful result.
- `"unknown"`: Indicates that there's no information about the outcome.
This is the default value that applies when an outcome has not been set explicitly.
This may be the case when a user tracks a custom transaction without explicitly setting an outcome.
For existing auto-instrumentations, agents should set the outcome either to `"failure"` or `"success"`.

What counts as a failed or successful request depends on the protocol and does not depend on whether there are error documents associated with a transaction.

##### Error rate

The error rate of a transaction group is based on the `outcome` of its transactions.

error_rate = failure / (failure + success)

Note that when calculating the error rate,
transactions with an `unknown` or non-existent outcome are not considered.

The calculation just looks at the subset of transactions where the result is known and extrapolates the error rate for the total population.
This avoids that `unknown` or non-existant outcomes reduce the error rate,
which would happen when looking at a mix of old and new agents,
or when looking at RUM data (as page load transactions have an `unknown` outcome).

Also note that this only reflects the error rate as perceived from the application itself.
The error rate perceived from its clients is greater or equal to that.

##### Outcome API

Agents should expose an API to manually override the outcome.
This value must always take precedence over the automatically determined value.
The documentation should clarify that transactions with `unknown` outcomes are ignored in the error rate calculation.

0 comments on commit 9c6dd55

Please sign in to comment.