Add transaction and span outcome gherkin feature file (#402)

* Add transaction and span outcome gherkin feature file * update gherkin & written spec * fix wording * clarify span spec * update gRPC statuses mapping for transactions * add examples of non-determistic outcomes * fix wording Co-authored-by: eyalkoren <41850454+eyalkoren@users.noreply.github.com> * sync executable spec with spec for humans * fix wording Co-authored-by: Emily S <emily.s@elastic.co> Co-authored-by: Sylvain Juge <sylvain.juge@elastic.co> Co-authored-by: SylvainJuge <syl20j@gmail.com> Co-authored-by: eyalkoren <41850454+eyalkoren@users.noreply.github.com>
elastic · Feb 15, 2021 · 3898c7f · 3898c7f
1 parent 18f499c
commit 3898c7f
Show file tree

Hide file tree

Showing 5 changed files with 202 additions and 12 deletions.
diff --git a/specs/agents/tracing-instrumentation-grpc.md b/specs/agents/tracing-instrumentation-grpc.md
@@ -19,6 +19,7 @@ Server and Client Unary request/response calls are instrumented. Support for oth
 * **type**: `request`
 * **trace_context**: \<trace-context\>
 * **result**: [\<a-valid-result-value\>](https://github.com/grpc/grpc/blob/master/doc/statuscodes.md#status-codes-and-their-use-in-grpc), ex: `OK`
+* **outcome**: See [Outcome](#outcome)
 
 #### Span context
 
@@ -28,10 +29,64 @@ See [apm#180](https://github.com/elastic/apm/issues/180) and [apm#115](https://g
 * **name**: \<method\>, ex: `/helloworld.Greeter/SayHello`
 * **type**: `external`
 * **subtype**: `grpc`
+* **outcome**: See [Outcome](#outcome)
 * **destination**:
   * **address**: Either an IP (v4 or v6) or a host/domain name.
   * **port**: A port number; Should report default ports.
   * **service**:
     * **resource**: Capture host, and port.
     * **name**: Capture the scheme, host, and non-default port.
     * **type**: Same as `span.type`
+
+#### Outcome
+
+With gRPC, transaction and span outcome is set from gRPC response status.
+
+If such status is not available, then we default to the following:
+
+- `failure` if an error is reported
+- `success` otherwise
+
+According to the [gRPC status codes reference spec](https://github.com/grpc/grpc/blob/master/doc/statuscodes.md), some
+statuses are not used by gRPC client & server, thus some of them should be considered as client-side errors.
+
+The gRPC `UNKNOWN` status refers to an error that is not known, thus we should treat it as a `failure` and NOT map it to
+an `unknown` outcome.
+
+For gRPC spans (from the client):
+
+- `OK` : `success`
+- anything else: `failure`
+
+For gRPC transactions (from the server):
+
+This mapping can be quite subjective, as we know that some statuses are not used by the gRPC server & client 
+implementations and thus their meaning would be application specific. However, we attempt to report as `failure`
+outcomes errors that might require attention from the server point of view and report as `success` all the statuses
+that are only relevant on the client-side.
+
+| status                    | outcome   | justification                                    |
+| ------------------------- | --------- | ------------------------------------------------ |
+| `OK`                      | `success` |                                                  |
+| `CANCELLED`               | `success` | Operation cancelled by client                    |
+| `UNKNOWN`                 | `failure` | Error of an unknown type, but still an error     |
+| `INVALID_ARGUMENT` (*)    | `success` | Client-side error                                |
+| `DEADLINE_EXCEEDED`       | `failure` |                                                  |
+| `NOT_FOUND` (*)           | `success` | Client-side error (similar to HTTP 404)          |
+| `ALREADY_EXISTS` (*)      | `success` | Client-side error (similar to HTTP 409)          |
+| `PERMISSION_DENIED` (*)   | `success` | Client authentication (similar to HTTP 403)      |
+| `RESOURCE_EXHAUSTED` (*)  | `failure` | Likely used for server out of resources          |
+| `FAILED_PRECONDITION` (*) | `failure` | Similar to UNAVAILABLE                           |
+| `ABORTED` (*)             | `failure` | Similar to UNAVAILABLE                           |
+| `OUT_OF_RANGE` (*)        | `success` | Client-side error (similar to HTTP 416)          |
+| `UNIMPLEMENTED`           | `success` | Client called a non-implemented feature          |
+| `INTERNAL`                | `failure` | Internal error (similar to HTTP 500)             |
+| `UNAVAILABLE`             | `failure` | Transient error, client may retry with backoff   |
+| `DATA_LOSS` (*)           | `failure` | Lost data should always be reported              |
+| `UNAUTHENTICATED` (*)     | `success` | Client-side authentication (similar to HTTP 401) |
+
+The statuses marked with (*) are not used by gRPC libraries and thus their actual meaning is contextual to the
+application.
+
+Also, the gRPC status code for a given transaction should be reported in the `transaction.result` field, thus we still have the
+capability to detect an abnormal rate of a given status, in a similar way as we do with HTTP 4xx and 5xx errors.
diff --git a/specs/agents/tracing-instrumentation-http.md b/specs/agents/tracing-instrumentation-http.md
@@ -4,8 +4,8 @@ Agents should instrument HTTP request routers/handlers, starting a new transacti
 
 - The transaction `type` should be `request`.
 - The transaction `result` should be `HTTP Nxx`, where N is the first digit of the status code (e.g. `HTTP 4xx` for a 404)
-- The transaction `outcome` should be `"success"` for HTTP status codes < 500 and `"failure"` for status codes >= 500. \
-  Status codes in the 4xx range (client errors) are not considered a `failure` as the failure has not been caused by the application itself but by the caller.
+- The transaction `outcome` is set from response status code (see [Outcome](#outcome))
+
   As there's no browser API to get the status code of a page load, the RUM agent always reports `"unknown"` for those transactions.
 - The transaction `name` should be aggregatable, such as the route or handler name. Examples:
     - `GET /users/{id}`
@@ -80,7 +80,18 @@ For outbound HTTP request spans we capture the following http-specific span cont
 
 - `http.url` (the target URL) \
   The captured URL should have the userinfo (username and password), if any, redacted.
-- `http.status_code` (the response status code) \
-  The span's `outcome` should be set to `"success"` if the status code is lower than 400 and to `"failure"` otherwise. 
-  If the request is aborted the `outcome` should be set to `unknown`.
+- `http.status_code` (the response status code)
+- `outcome` is set from response status code (see [Outcome](#outcome) for details)
+
+## Outcome
+
+For HTTP transactions (from the server perspective), status codes in the 4xx range (client errors) are not considered
+a `failure` as the failure has not been caused by the application itself but by the caller.
+
+For HTTP spans (from the client perspective), the span's `outcome` should be set to `"success"` if the status code is
+lower than 400 and to `"failure"` otherwise.
+
+For both transactions and spans, if there is no HTTP status we set `outcome` from the reported error:
 
+- `failure` if an error is reported
+- `success` otherwise
diff --git a/specs/agents/tracing-spans.md b/specs/agents/tracing-spans.md
@@ -14,13 +14,14 @@ for simpler and more performant UI queries.
 
 ### Span outcome
 
-The `outcome` property denotes whether the span represents a success or a failure.
-It supports the same values as `transaction.outcome`.
-The only semantic difference is that client errors set the `outcome` to `"failure"`.
-Agents should try to determine the outcome for spans created by auto instrumentation,
-which is especially important for exit spans (spans representing requests to other services).
+The `outcome` property denotes whether the span represents a success or failure, it is used to compute error rates
+to calling external services (exit spans) from the monitored application. It supports the same values as `transaction.outcome`.
 
-If an agent doesn't report the `outcome` (or reports `null`), the APM Server will set it based on `context.response.status_code`. If the status code is not available, then it will be set to `"unknown"`.
+This property is optional to preserve backwards compatibility, thus it is allowed to omit it or use a `null` value.
+
+If an agent does not report the `outcome` property (or use a `null` value), then the outcome will be set according to HTTP
+response status if available, or `unknown` if not available. This allows a server-side fallback for existing
+agents that might not report `outcome`.
 
 While the transaction outcome lets you reason about the error rate from the service's point of view,
 other services might have a different perspective on that.
@@ -29,6 +30,22 @@ the error rate of service B is 100% from service A's perspective.
 However, as service B doesn't receive any requests, the error rate is 0% from service B's perspective.
 The `span.outcome` also allows reasoning about error rates of external services.
 
+The following protocols get their outcome from protocol-level attributes:
+
+- [gRPC](tracing-instrumentation-grpc.md#outcome)
+- [HTTP](tracing-instrumentation-http.md#outcome)
+
+For other protocols, we can default to the following behavior:
+
+- `failure` when an error is reported
+- `success` otherwise
+
+Also, while we encourage most instrumentations to create spans that have a deterministic outcomes, there are a few 
+examples for which we might still have to report `unknown` outcomes to prevent reporting any misleading information:
+- Inferred spans created through a sampling profiler: those are not exit spans, we can't know if those could be reported
+as either `failure` or `outcome` due to inability to capture any errors.
+- External process execution, we can't know the `outcome` until the process has exited with an exit code.
+
 ### Outcome API
 
 Agents should expose an API to manually override the outcome.

diff --git a/specs/agents/tracing-transactions.md b/specs/agents/tracing-transactions.md
@@ -21,7 +21,17 @@ If an agent doesn't report the `outcome` (or reports `null`), the APM Server wil
   This may be the case when a user tracks a custom transaction without explicitly setting an outcome.
   For existing auto-instrumentations, agents should set the outcome either to `"failure"` or `"success"`.
 
-What counts as a failed or successful request depends on the protocol and does not depend on whether there are error documents associated with a transaction.
+What counts as a failed or successful request depends on the protocol.
+
+The following protocols get their outcome from protocol-level attributes:
+
+- [gRPC](tracing-instrumentation-grpc.md#outcome)
+- [HTTP](tracing-instrumentation-http.md#outcome)
+
+For other protocols, we can default to the following behavior:
+
+- `failure` when an error is reported
+- `success` otherwise
 
 #### Error rate
 

diff --git a/tests/agents/gherkin-specs/outcome.feature b/tests/agents/gherkin-specs/outcome.feature
@@ -0,0 +1,97 @@
+Feature: Outcome
+
+  # ---- user set outcome
+
+  Scenario: User set outcome on span has priority over instrumentation
+    Given an agent
+    And an active span
+    And user sets span outcome to 'failure'
+    And span terminates with outcome 'success'
+    Then span outcome is 'failure'
+
+  Scenario: User set outcome on transaction has priority over instrumentation
+    Given an agent
+    And an active transaction
+    And user sets transaction outcome to 'unknown'
+    And transaction terminates with outcome 'failure'
+    Then transaction outcome is 'unknown'
+
+  # ---- span & transaction outcome from reported errors
+
+  Scenario: span with error
+    Given an agent
+    And an active span
+    And span terminates with an error
+    Then span outcome is 'failure'
+
+  Scenario: span without error
+    Given an agent
+    And an active span
+    And span terminates without error
+    Then span outcome is 'success'
+
+  Scenario: transaction with error
+    Given an agent
+    And an active transaction
+    And transaction terminates with an error
+    Then transaction outcome is 'failure'
+
+  Scenario: transaction without error
+    Given an agent
+    And an active transaction
+    And transaction terminates without error
+    Then transaction outcome is 'success'
+
+  # ---- HTTP
+
+  @http
+  Scenario Outline: HTTP transaction and span outcome
+    Given an agent
+    And an HTTP transaction with <status> response code
+    Then transaction outcome is "<server>"
+    Given an HTTP span with <status> response code
+    Then span outcome is "<client>"
+    Examples:
+      | status | client  | server  |
+      | 100    | success | success |
+      | 200    | success | success |
+      | 300    | success | success |
+      | 400    | failure | success |
+      | 404    | failure | success |
+      | 500    | failure | failure |
+      | -1     | failure | failure |
+      # last row with negative status represents the case where the status is not available
+      # for example when an exception/error is thrown without status (IO error, redirect loop, ...)
+
+  # ---- gRPC
+
+  # reference spec : https://github.com/grpc/grpc/blob/master/doc/statuscodes.md
+
+  @grpc
+  Scenario Outline: gRPC transaction and span outcome
+    Given an agent
+    And a gRPC transaction with '<status>' status
+    Then transaction outcome is "<server>"
+    Given a gRPC span with '<status>' status
+    Then span outcome is "<client>"
+    Examples:
+      | status              | client  | server  |
+      | OK                  | success | success |
+      | CANCELLED           | failure | success |
+      | UNKNOWN             | failure | failure |
+      | INVALID_ARGUMENT    | failure | success |
+      | DEADLINE_EXCEEDED   | failure | failure |
+      | NOT_FOUND           | failure | success |
+      | ALREADY_EXISTS      | failure | success |
+      | PERMISSION_DENIED   | failure | success |
+      | RESOURCE_EXHAUSTED  | failure | failure |
+      | FAILED_PRECONDITION | failure | failure |
+      | ABORTED             | failure | failure |
+      | OUT_OF_RANGE        | failure | success |
+      | UNIMPLEMENTED       | failure | success |
+      | INTERNAL            | failure | failure |
+      | UNAVAILABLE         | failure | failure |
+      | DATA_LOSS           | failure | failure |
+      | UNAUTHENTICATED     | failure | success |
+      | n/a                 | failure | failure |
+    # last row with 'n/a' status represents the case where status is not available