Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Additional data telemetry changes #71112

Merged
merged 16 commits into from
Jul 14, 2020
Merged

Conversation

smith
Copy link
Contributor

@smith smith commented Jul 8, 2020

  • Add a date range of now-1d to the cloud query
  • Add a timeout of 5m to all queries (we'll investigate using async queries to improve this in the future.)
  • Factor out the date range filter into a variable
  • Fix a bug with the indices_stats tasks when it doesn't return data
  • Delete the merge mapping script and update instructions
  • Add a schema property to the usage collector
  • Add a complete JSON snapshot of our mapping
  • Add cardinality for client.geo.country_iso_code

@smith smith requested a review from a team as a code owner July 8, 2020 15:59
@botelastic botelastic bot added the Team:APM All issues that need APM UI Team support label Jul 8, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@smith smith added v7.9.0 release_note:skip Skip the PR/issue when compiling release notes labels Jul 8, 2020
@dgieselaar
Copy link
Member

@smith I'm a bit worried about 10s. I expect some of the tasks to take significantly longer for our bigger users. We are also storing took for the tasks if I remember correctly. You could have a look at the telemetry data to see whether 10s is a good default.

@smith
Copy link
Contributor Author

smith commented Jul 8, 2020

@smith I'm a bit worried about 10s. I expect some of the tasks to take significantly longer for our bigger users. We are also storing took for the tasks if I remember correctly. You could have a look at the telemetry data to see whether 10s is a good default.

There were a few that were over 350s, and you were right, many in the 30s range. I'm going to set it to 5m.

@smith smith force-pushed the nls/bounded-tasks branch from 4c6fae0 to 9de1027 Compare July 8, 2020 20:46
* Add a date range of `now-1d` to the cloud query
* Add a timeout of 5m to all queries (we'll investigate using async queries to improve this in the future.)
* Factor out the date range filter into a variable
* Fix a bug with the `indices_stats` tasks when it doesn't return data
* Update the merge mapping script to create a migration file
@smith smith force-pushed the nls/bounded-tasks branch from 9de1027 to 043c722 Compare July 8, 2020 20:47
Copy link
Member

@sorenlouv sorenlouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@dgieselaar
Copy link
Member

dgieselaar commented Jul 9, 2020

@smith can you add telemetry for aggregated transactions? I'd like to record the number of metric documents we can expect with the current implementation. We can simulate this by requesting data for 1m, adding a composite aggregation on a number of fields and counting the number of buckets. The resulting count will be an approximation of the amount of metric documents that will be created. We should record both the expected metric document count plus the transaction count for that time range.

I'd like to record several combinations:

  • current_implementation: observer.name, transaction.name, transaction.result, transaction.type, agent.name, service.environment, service.version, host.hostname, container.id, kubernetes.pod.name, user_agent.name
  • no_observer_name: transaction.name, transaction.result, transaction.type, agent.name, service.environment, service.version, host.hostname, container.id, kubernetes.pod.name, user_agent.name
  • no_rum: observer.name, transaction.name, transaction.result, transaction.type, agent.name, service.environment, service.version, host.hostname, container.id, kubernetes.pod.name (add a filter to exclude RUM data)
  • no_rum_no_observer_name: observer.name, transaction.name, transaction.result, transaction.type, agent.name, service.environment, service.version, host.hostname, container.id, kubernetes.pod.name (add a filter to exclude RUM data)
  • only_rum: observer.name, transaction.name, transaction.result, transaction.type, agent.name, service.environment, service.version, host.hostname, container.id, kubernetes.pod.name, user_agent.name (add a filter to exclude all non-RUM data)
  • only_rum_no_observer_name: transaction.name, transaction.result, transaction.type, agent.name, service.environment, service.version, host.hostname, container.id, kubernetes.pod.name, user_agent.name (add a filter to exclude all non-RUM data)

I'd also like to record the cardinality of client.geo.country_iso_code separately.

For an example of how to create the composite aggregation and paginate through the buckets, see scripts/aggregate-latency-metrics/index.ts.

@smith smith changed the title [APM] Add more bounds to telemetry tasks [APM] Additional data telemetry changes Jul 9, 2020
@smith
Copy link
Contributor Author

smith commented Jul 9, 2020

@dgieselaar I should be able to add all of that stuff on Monday but I'll do it in another PR.

@smith smith requested a review from a team July 9, 2020 22:28
Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the code and ran the telemetry checker on the collector schema. Unfortunately, I'm getting type errors from the typescript parser that I can't seem to work around. I've asked the team to pitch in and help get this over the finish line.

Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some NITs re the schema property 👍

@@ -55,15 +55,9 @@ The mapping for the telemetry data is here under `stack_stats.kibana.plugins.apm

The mapping used there can be generated with the output of the [`getTelemetryMapping`](../common/apm_telemetry.ts) function.

To make a change to the mapping, edit this function, run the tests to update the snapshots, then use the `merge_telemetry_mapping` script to merge the data into the telemetry repository.
The `schema` property of the `makeUsageCollector` call in the [`createApmTelemetry` function](../server/lib/apm_telemetry/index.ts) contains the output of `getTelemetryMapping`. Pull requests with changes to the schema should automatically notify the Telemetry team so they can update the mapping in the telemetry clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: In order for the Kibana Telemetry team to be notified, devs need to run node scripts/telemetry_check.js --fix. It will update the file containing all the schemas (and owned by the Telemetry team).

At the moment, APM is excluded so it doesn't fail (

"plugins/apm/server/lib/apm_telemetry/index.ts",
) and listed in this issue as pending to be fixed: #70180

I tried with removing the exclusion but it fails.
For progress, I'd say we merge this PR and tackle this issue separately 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the reason it's failing is because file with the usage collector call is parsed directly so the whole schema has to be there as a literal, whereas we're calling a function schema: getApmTelemetryMapping(), which it fails to parse as a valid schema.

Copy link
Contributor

@TinaHeiligers TinaHeiligers Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smith there are two other issues I've found thus far while trying to resolve the telemetry-check failure:

  1. The schema needs to be what's displayed in the doc (as types, not actual values) and that the usage collector needs to be explicitly typed.
Here's the schema I got from your mapping
export const apmTelemetrySchema = {
  agents: {
    dotnet: {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
    go: {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
    java: {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
    'js-base': {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
    nodejs: {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
    python: {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
    ruby: {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
    'rum-js': {
      agent: {
        version: { type: 'keyword' },
      },
      service: {
        framework: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        language: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
        runtime: {
          composite: { type: 'keyword' },
          name: { type: 'keyword' },
          version: { type: 'keyword' },
        },
      },
    },
  },
  cloud: {
    availability_zone: { type: 'keyword' },
    provider: { type: 'keyword' },
    region: { type: 'keyword' },
  },
  counts: {
    agent_configuration: {
      all: { type: 'long' },
    },
    error: {
      '1d': { type: 'long' },
      all: { type: 'long' },
    },
    max_error_groups_per_service: {
      '1d': { type: 'long' },
    },
    max_transaction_groups_per_service: {
      '1d': { type: 'long' },
    },
    metric: {
      '1d': { type: 'long' },
      all: { type: 'long' },
    },
    onboarding: {
      '1d': { type: 'long' },
      all: { type: 'long' },
    },
    services: {
      '1d': { type: 'long' },
    },
    sourcemap: {
      '1d': { type: 'long' },
      all: { type: 'long' },
    },
    span: {
      '1d': { type: 'long' },
      all: { type: 'long' },
    },
    traces: {
      '1d': { type: 'long' },
    },
    transaction: {
      '1d': { type: 'long' },
      all: { type: 'long' },
    },
  },
  cardinality: {
    user_agent: {
      original: {
        all_agents: {
          '1d': { type: 'long' },
        },
        rum: {
          '1d': { type: 'long' },
        },
      },
    },
    transaction: {
      name: {
        all_agents: {
          '1d': { type: 'long' },
        },
        rum: {
          '1d': { type: 'long' },
        },
      },
    },
  },
  has_any_services: {
    type: 'boolean',
  },
  indices: {
    all: {
      total: {
        docs: { count: { type: 'long' } },
        store: { size_in_bytes: { type: 'long' } },
      },
    },
    shards: { total: { type: 'long' } },
  },
  integrations: {
    ml: { all_jobs_count: { type: 'long' } },
  },
  retainment: {
    error: { ms: { type: 'long' } },
    metric: { ms: { type: 'long' } },
    onboarding: { ms: { type: 'long' } },
    span: { ms: { type: 'long' } },
    transaction: { ms: { type: 'long' } },
  },
  services_per_agent: {
    dotnet: { type: 'long' },
    go: { type: 'long' },
    java: { type: 'long' },
    'js-base': { type: 'long' },
    nodejs: { type: 'long' },
    python: { type: 'long' },
    ruby: { type: 'long' },
    'rum-js': { type: 'long' },
  },
  tasks: {
    agent_configuration: { took: { ms: { type: 'long' } } },
    agents: { took: { ms: { type: 'long' } } },
    cardinality: { took: { ms: { type: 'long' } } },
    cloud: { took: { ms: { type: 'long' } } },
    groupings: { took: { ms: { type: 'long' } } },
    indices_stats: { took: { ms: { type: 'long' } } },
    integrations: { took: { ms: { type: 'long' } } },
    processor_events: { took: { ms: { type: 'long' } } },
    services: { took: { ms: { type: 'long' } } },
    versions: { took: { ms: { type: 'long' } } },
  },
  version: {
    apm_server: {
      major: { type: 'long' },
      minor: { type: 'long' },
      patch: { type: 'long' },
    },
  },
};

(note: there are a few improvements we also need to make to the allowed schema types, including allowing ignore_above for keyword fields)
2. The usageCollector is complaining of a missing type, even though type: "apm" is there. I think it's because the collector isn't typed explicitly but we'll figure all these issues out in #70180

@smith
Copy link
Contributor Author

smith commented Jul 14, 2020

@smith can you add telemetry for aggregated transactions?

@dgieselaar yes, but opened #71593 to do it in a separate PR.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@smith smith merged commit c6705e8 into elastic:master Jul 14, 2020
@smith smith deleted the nls/bounded-tasks branch July 14, 2020 13:16
smith added a commit to smith/kibana that referenced this pull request Jul 14, 2020
* Add a date range of `now-1d` to the cloud query
* Add a timeout of 5m to all queries (we'll investigate using async queries to improve this in the future.)
* Factor out the date range filter into a variable
* Fix a bug with the `indices_stats` tasks when it doesn't return data
* Update the merge mapping script to create a migration file
smith added a commit that referenced this pull request Jul 14, 2020
* Add a date range of `now-1d` to the cloud query
* Add a timeout of 5m to all queries (we'll investigate using async queries to improve this in the future.)
* Factor out the date range filter into a variable
* Fix a bug with the `indices_stats` tasks when it doesn't return data
* Update the merge mapping script to create a migration file
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jul 14, 2020
* master: (21 commits)
  [Maps] 7.9 design improvements (elastic#71563)
  [ML] Changing all calls to ML endpoints to use internal user (elastic#70487)
  [eventLog] prevent log writing when initialization fails (elastic#71339)
  [Observability] landing page always being displayed (elastic#71494)
  [IM] Address data stream copy feedback (elastic#71615)
  [Logs UI] Anomalies page dataset filtering (elastic#71110)
  [data.search.aggs] Remove `use_field_mapping` from top hits agg (elastic#71168)
  [ML] Anomaly swim lane embeddable navigation and filter actions (elastic#71082)
  Fixes typo in siem_cloudtrail job description (elastic#71569)
  Require granted API Keys to have a name (elastic#71623)
  Update  getUsageForCollection (elastic#71609)
  Only fetch saved elements once (elastic#71310)
  [SecuritySolution][Resolver] Adding siem index and guarding process ancestry (elastic#71570)
  [APM] Additional data telemetry changes (elastic#71112)
  [Visualize] Fix export table for table export links (elastic#71249)
  [Search] Server side search API (elastic#70446)
  use inclusive language (elastic#71607)
  [Security Solution] Hide timeline footer when Resolver is open (elastic#71516)
  [Index template wizard] Remove shadow and use border for components panels (elastic#71606)
  [ML] Kibana API endpoint for histogram chart data (elastic#70976)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:APM All issues that need APM UI Team support v7.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants