Skip to content

Commit

Permalink
Add transaction and span outcome gherkin feature file (#402)
Browse files Browse the repository at this point in the history
* Add transaction and span outcome gherkin feature file

* update gherkin & written spec

* fix wording

* clarify span spec

* update gRPC statuses mapping for transactions

* add examples of non-determistic outcomes

* fix wording

Co-authored-by: eyalkoren <41850454+eyalkoren@users.noreply.github.com>

* sync executable spec with spec for humans

* fix wording

Co-authored-by: Emily S <emily.s@elastic.co>

Co-authored-by: Sylvain Juge <sylvain.juge@elastic.co>
Co-authored-by: SylvainJuge <syl20j@gmail.com>
Co-authored-by: eyalkoren <41850454+eyalkoren@users.noreply.github.com>
  • Loading branch information
4 people authored Feb 15, 2021
1 parent 18f499c commit 3898c7f
Show file tree
Hide file tree
Showing 5 changed files with 202 additions and 12 deletions.
55 changes: 55 additions & 0 deletions specs/agents/tracing-instrumentation-grpc.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Server and Client Unary request/response calls are instrumented. Support for oth
* **type**: `request`
* **trace_context**: \<trace-context\>
* **result**: [\<a-valid-result-value\>](https://github.com/grpc/grpc/blob/master/doc/statuscodes.md#status-codes-and-their-use-in-grpc), ex: `OK`
* **outcome**: See [Outcome](#outcome)

#### Span context

Expand All @@ -28,10 +29,64 @@ See [apm#180](https://github.com/elastic/apm/issues/180) and [apm#115](https://g
* **name**: \<method\>, ex: `/helloworld.Greeter/SayHello`
* **type**: `external`
* **subtype**: `grpc`
* **outcome**: See [Outcome](#outcome)
* **destination**:
* **address**: Either an IP (v4 or v6) or a host/domain name.
* **port**: A port number; Should report default ports.
* **service**:
* **resource**: Capture host, and port.
* **name**: Capture the scheme, host, and non-default port.
* **type**: Same as `span.type`

#### Outcome

With gRPC, transaction and span outcome is set from gRPC response status.

If such status is not available, then we default to the following:

- `failure` if an error is reported
- `success` otherwise

According to the [gRPC status codes reference spec](https://github.com/grpc/grpc/blob/master/doc/statuscodes.md), some
statuses are not used by gRPC client & server, thus some of them should be considered as client-side errors.

The gRPC `UNKNOWN` status refers to an error that is not known, thus we should treat it as a `failure` and NOT map it to
an `unknown` outcome.

For gRPC spans (from the client):

- `OK` : `success`
- anything else: `failure`

For gRPC transactions (from the server):

This mapping can be quite subjective, as we know that some statuses are not used by the gRPC server & client
implementations and thus their meaning would be application specific. However, we attempt to report as `failure`
outcomes errors that might require attention from the server point of view and report as `success` all the statuses
that are only relevant on the client-side.

| status | outcome | justification |
| ------------------------- | --------- | ------------------------------------------------ |
| `OK` | `success` | |
| `CANCELLED` | `success` | Operation cancelled by client |
| `UNKNOWN` | `failure` | Error of an unknown type, but still an error |
| `INVALID_ARGUMENT` (*) | `success` | Client-side error |
| `DEADLINE_EXCEEDED` | `failure` | |
| `NOT_FOUND` (*) | `success` | Client-side error (similar to HTTP 404) |
| `ALREADY_EXISTS` (*) | `success` | Client-side error (similar to HTTP 409) |
| `PERMISSION_DENIED` (*) | `success` | Client authentication (similar to HTTP 403) |
| `RESOURCE_EXHAUSTED` (*) | `failure` | Likely used for server out of resources |
| `FAILED_PRECONDITION` (*) | `failure` | Similar to UNAVAILABLE |
| `ABORTED` (*) | `failure` | Similar to UNAVAILABLE |
| `OUT_OF_RANGE` (*) | `success` | Client-side error (similar to HTTP 416) |
| `UNIMPLEMENTED` | `success` | Client called a non-implemented feature |
| `INTERNAL` | `failure` | Internal error (similar to HTTP 500) |
| `UNAVAILABLE` | `failure` | Transient error, client may retry with backoff |
| `DATA_LOSS` (*) | `failure` | Lost data should always be reported |
| `UNAUTHENTICATED` (*) | `success` | Client-side authentication (similar to HTTP 401) |

The statuses marked with (*) are not used by gRPC libraries and thus their actual meaning is contextual to the
application.

Also, the gRPC status code for a given transaction should be reported in the `transaction.result` field, thus we still have the
capability to detect an abnormal rate of a given status, in a similar way as we do with HTTP 4xx and 5xx errors.
21 changes: 16 additions & 5 deletions specs/agents/tracing-instrumentation-http.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ Agents should instrument HTTP request routers/handlers, starting a new transacti

- The transaction `type` should be `request`.
- The transaction `result` should be `HTTP Nxx`, where N is the first digit of the status code (e.g. `HTTP 4xx` for a 404)
- The transaction `outcome` should be `"success"` for HTTP status codes < 500 and `"failure"` for status codes >= 500. \
Status codes in the 4xx range (client errors) are not considered a `failure` as the failure has not been caused by the application itself but by the caller.
- The transaction `outcome` is set from response status code (see [Outcome](#outcome))

As there's no browser API to get the status code of a page load, the RUM agent always reports `"unknown"` for those transactions.
- The transaction `name` should be aggregatable, such as the route or handler name. Examples:
- `GET /users/{id}`
Expand Down Expand Up @@ -80,7 +80,18 @@ For outbound HTTP request spans we capture the following http-specific span cont

- `http.url` (the target URL) \
The captured URL should have the userinfo (username and password), if any, redacted.
- `http.status_code` (the response status code) \
The span's `outcome` should be set to `"success"` if the status code is lower than 400 and to `"failure"` otherwise.
If the request is aborted the `outcome` should be set to `unknown`.
- `http.status_code` (the response status code)
- `outcome` is set from response status code (see [Outcome](#outcome) for details)

## Outcome

For HTTP transactions (from the server perspective), status codes in the 4xx range (client errors) are not considered
a `failure` as the failure has not been caused by the application itself but by the caller.

For HTTP spans (from the client perspective), the span's `outcome` should be set to `"success"` if the status code is
lower than 400 and to `"failure"` otherwise.

For both transactions and spans, if there is no HTTP status we set `outcome` from the reported error:

- `failure` if an error is reported
- `success` otherwise
29 changes: 23 additions & 6 deletions specs/agents/tracing-spans.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,14 @@ for simpler and more performant UI queries.

### Span outcome

The `outcome` property denotes whether the span represents a success or a failure.
It supports the same values as `transaction.outcome`.
The only semantic difference is that client errors set the `outcome` to `"failure"`.
Agents should try to determine the outcome for spans created by auto instrumentation,
which is especially important for exit spans (spans representing requests to other services).
The `outcome` property denotes whether the span represents a success or failure, it is used to compute error rates
to calling external services (exit spans) from the monitored application. It supports the same values as `transaction.outcome`.

If an agent doesn't report the `outcome` (or reports `null`), the APM Server will set it based on `context.response.status_code`. If the status code is not available, then it will be set to `"unknown"`.
This property is optional to preserve backwards compatibility, thus it is allowed to omit it or use a `null` value.

If an agent does not report the `outcome` property (or use a `null` value), then the outcome will be set according to HTTP
response status if available, or `unknown` if not available. This allows a server-side fallback for existing
agents that might not report `outcome`.

While the transaction outcome lets you reason about the error rate from the service's point of view,
other services might have a different perspective on that.
Expand All @@ -29,6 +30,22 @@ the error rate of service B is 100% from service A's perspective.
However, as service B doesn't receive any requests, the error rate is 0% from service B's perspective.
The `span.outcome` also allows reasoning about error rates of external services.

The following protocols get their outcome from protocol-level attributes:

- [gRPC](tracing-instrumentation-grpc.md#outcome)
- [HTTP](tracing-instrumentation-http.md#outcome)

For other protocols, we can default to the following behavior:

- `failure` when an error is reported
- `success` otherwise

Also, while we encourage most instrumentations to create spans that have a deterministic outcomes, there are a few
examples for which we might still have to report `unknown` outcomes to prevent reporting any misleading information:
- Inferred spans created through a sampling profiler: those are not exit spans, we can't know if those could be reported
as either `failure` or `outcome` due to inability to capture any errors.
- External process execution, we can't know the `outcome` until the process has exited with an exit code.

### Outcome API

Agents should expose an API to manually override the outcome.
Expand Down
12 changes: 11 additions & 1 deletion specs/agents/tracing-transactions.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,17 @@ If an agent doesn't report the `outcome` (or reports `null`), the APM Server wil
This may be the case when a user tracks a custom transaction without explicitly setting an outcome.
For existing auto-instrumentations, agents should set the outcome either to `"failure"` or `"success"`.

What counts as a failed or successful request depends on the protocol and does not depend on whether there are error documents associated with a transaction.
What counts as a failed or successful request depends on the protocol.

The following protocols get their outcome from protocol-level attributes:

- [gRPC](tracing-instrumentation-grpc.md#outcome)
- [HTTP](tracing-instrumentation-http.md#outcome)

For other protocols, we can default to the following behavior:

- `failure` when an error is reported
- `success` otherwise

#### Error rate

Expand Down
97 changes: 97 additions & 0 deletions tests/agents/gherkin-specs/outcome.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
Feature: Outcome

# ---- user set outcome

Scenario: User set outcome on span has priority over instrumentation
Given an agent
And an active span
And user sets span outcome to 'failure'
And span terminates with outcome 'success'
Then span outcome is 'failure'

Scenario: User set outcome on transaction has priority over instrumentation
Given an agent
And an active transaction
And user sets transaction outcome to 'unknown'
And transaction terminates with outcome 'failure'
Then transaction outcome is 'unknown'

# ---- span & transaction outcome from reported errors

Scenario: span with error
Given an agent
And an active span
And span terminates with an error
Then span outcome is 'failure'

Scenario: span without error
Given an agent
And an active span
And span terminates without error
Then span outcome is 'success'

Scenario: transaction with error
Given an agent
And an active transaction
And transaction terminates with an error
Then transaction outcome is 'failure'

Scenario: transaction without error
Given an agent
And an active transaction
And transaction terminates without error
Then transaction outcome is 'success'

# ---- HTTP

@http
Scenario Outline: HTTP transaction and span outcome
Given an agent
And an HTTP transaction with <status> response code
Then transaction outcome is "<server>"
Given an HTTP span with <status> response code
Then span outcome is "<client>"
Examples:
| status | client | server |
| 100 | success | success |
| 200 | success | success |
| 300 | success | success |
| 400 | failure | success |
| 404 | failure | success |
| 500 | failure | failure |
| -1 | failure | failure |
# last row with negative status represents the case where the status is not available
# for example when an exception/error is thrown without status (IO error, redirect loop, ...)

# ---- gRPC

# reference spec : https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

@grpc
Scenario Outline: gRPC transaction and span outcome
Given an agent
And a gRPC transaction with '<status>' status
Then transaction outcome is "<server>"
Given a gRPC span with '<status>' status
Then span outcome is "<client>"
Examples:
| status | client | server |
| OK | success | success |
| CANCELLED | failure | success |
| UNKNOWN | failure | failure |
| INVALID_ARGUMENT | failure | success |
| DEADLINE_EXCEEDED | failure | failure |
| NOT_FOUND | failure | success |
| ALREADY_EXISTS | failure | success |
| PERMISSION_DENIED | failure | success |
| RESOURCE_EXHAUSTED | failure | failure |
| FAILED_PRECONDITION | failure | failure |
| ABORTED | failure | failure |
| OUT_OF_RANGE | failure | success |
| UNIMPLEMENTED | failure | success |
| INTERNAL | failure | failure |
| UNAVAILABLE | failure | failure |
| DATA_LOSS | failure | failure |
| UNAUTHENTICATED | failure | success |
| n/a | failure | failure |
# last row with 'n/a' status represents the case where status is not available

0 comments on commit 3898c7f

Please sign in to comment.