[Infra UI] Allow custom metrics calculation based on field values on node detail page charts #42687

skh · 2019-08-06T12:47:33Z

This issue came up during the implementation of #39282

AWS CloudWatch metrics, and by extension metricbeat's aws.ec2 metricset, contain fields that have to be interpreted differently depending on whether the user has enabled (and paid for) "detailed" CloudWatch monitoring:

NetworkOut:
The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance.

The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.

Units: Bytes

(from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html )

The affected metrics are NetworkIn, NetworkOut, DiskReadBytes, and DiskWriteBytes. They are mapped to the fields aws.ec2.network.in.bytes, aws.ec2.network.out.bytes, aws.ec2.diskio.read.bytes, and aws.ec2.diskio.write.bytes, respectively.

The information whether "detailed" monitoring is enabled is captured in the field aws.ec2.instance.monitoring.state.

Currently there is no easy way to perform custom calculations on a metric that is fetched through TSVB models.

I did experiment with this code in the model definition (in kibana/x-pack/legacy/plugins/infra/server/lib/adapters/metrics/models/aws/aws_network_bytes.ts):

  {
      id: 'tx',
      metrics: [
        {
          field: 'aws.ec2.network.out.bytes',
          id: 'max-net-out',
          type: InfraMetricModelMetricType.max,
        },
        {
          id: 'by-second-max-net-out',
          type: InfraMetricModelMetricType.calculation,
          variables: [
            { id: 'var-max', name: 'max', field: 'max-net-out' },
            {
              id: 'var-detailed-monitoring',
              name: 'detailed-monitoring',
              field: 'aws.ec2.instance.monitoring.state',
            },
          ],
          script: 'params.detailed-monitoring == "disabled" ? params.max / 300 : params.max / 60',
        },
      ],
      split_mode: 'everything',
    }

This does not work, presumably because string comparison is not possible in the script field of a TSVB model.

This feature request is about adding this functionality to the node detail page and the corresponding metrics GraphQL resolver, and performing the correct calculations for AWS metrics.

As long as this isn't implemented, we shouldn't show the affected metrics.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-06T12:47:35Z

Pinging @elastic/infra-logs-ui

skh · 2019-08-06T13:49:51Z

It is possible that DiskWriteOps and DiskReadOps (aws.ec2.diskio.read.count and aws.ec2.diskio.write.count) are affected by this as well, though the documentation is unclear. (Cc: @kaiyan-sheng )

skh · 2019-08-06T16:17:13Z

@kaiyan-sheng Another idea, could this maybe be solved in metricbeat directly?

simianhacker · 2019-08-06T16:43:04Z

@skh What if we added an attribute to the metadata endpoint that indicated whether or not they have detailed metrics. If we had that then the two approaches I could see are:

Pass the metadata to the TSVB models then use a template string to inject the correct value (60 or 300) in to the script dynamically
Return a series without the "scaling" and do it client side based on the metadata.

kaiyan-sheng · 2019-08-07T21:23:02Z

It is possible that DiskWriteOps and DiskReadOps (aws.ec2.diskio.read.count and aws.ec2.diskio.write.count) are affected by this as well, though the documentation is unclear. (Cc: @kaiyan-sheng )

Yes, I tried yesterday to create an EC2 instance but can't make DiskWrite/ReadOps to be non-zero. I assume for these two metrics, the definition of "a specified period of time" in the documentation is the same as DiskReadBytes and DiskWriteBytes.

@kaiyan-sheng Another idea, could this maybe be solved in metricbeat directly?

@skh You mean calculating rate for all NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, DiskReadOps and DiskWriteOps? Since monitoringState is available, this shouldn't be hard.

jasonrhodes · 2019-08-08T13:29:50Z

You mean calculating rate for all NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, DiskReadOps and DiskWriteOps?

Yeah this sounds like the best solution since, as far as I can tell, we don't care about the "raw values" ever again. Then "whether the user has detailed monitoring" becomes completely irrelevant to all consumers of the data after it's been stored, if I understand correctly.

skh · 2019-08-08T14:34:57Z

Yeah this sounds like the best solution since, as far as I can tell, we don't care about the "raw values" ever again. Then "whether the user has detailed monitoring" becomes completely irrelevant to all consumers of the data after it's been stored, if I understand correctly.

Yes. I would also favour this solution.

skh · 2019-08-09T13:20:29Z

Another option would be (for now), to use a sum aggregation for these fields, and always calculate the rate based on 300 seconds.

As we don't have variable bucket intervals yet (that would come with elastic/beats#12616) this would be correct in the following two cases:

user has basic monitoring and 300s reporting period configured in metricbeat for the aws module (default). In this case we have one value in our 300 second bucket and divide that by 300 to get the rate.
use has detailed monitoring and 60s reporting period configured in metricbeat for the aws module. In this case we have five values in our 300 second bucket, sum them up and then divide that by 300 to get the rate. It would be the average rate for the last 300 seconds instead of for the last 60 seconds, but it would be correct.

In the following case it would be incorrect:

user has detailed monitoring but still has the default 300s reporting period configured in metricbeat. In this case we have one value in our 300 second bucket, but it is the value for the last 60 seconds, not the last 300. We would divide by 300 and the value would be one-fifth of what it should be.

We could point out in documentation that the reporting period for the metricbeat aws module has to be adjusted when detailed monitoring is available.

@simianhacker @kaiyan-sheng what do you think?

skh · 2019-08-16T12:03:20Z

After further discussion, this is now implemented as:

Leave the bucket size hard-coded at >=5m

First, use a sum aggregation in every bucket. that way, if the user has detailed monitoring and a reporting period of 60s, we sum up even in the smallest possible buckets. If the user zooms out and the TSVB endpoint therefore uses larger buckets (note that the value we use is >=5m, not 5m), we also sum up if the user has only basic monitoring and a reporting period of 300s.

Next, calculate the cumulative sum in each bucket, i.e. each bucket will have the sum of all buckets before it, and its own value added on top. This turns the metric into a monotonically increasing counter.

Finally, take the derivative with unit: 1s to get the per-second rate regardless of bucket sizes.

Note that this will still be wrong if the user has detailed monitoring enabled in AWS but configured a 300s reporting period in metricbeat.

kaiyan-sheng · 2019-08-18T16:39:51Z

Hi @skh, the PR for calculating rate metrics in ec2 metricset is merged 😬

jasonrhodes · 2019-08-26T11:30:45Z

@skh @simianhacker with the changes that landed in elastic/beats#13203 can we remove our workaround and just use the raw numbers? Will those beats changes have any effect on our workaround?

skh added Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Aug 6, 2019

skh mentioned this issue Aug 6, 2019

[Infra UI] Add AWS metrics to node detail page #42153

Merged

kaiyan-sheng mentioned this issue Aug 8, 2019

Add calculate_rate for NetworkIn/Out, DiskRead/WriteBytes and DiskRead/WriteOps elastic/beats#13203

Merged

sgrodzicki added [zube]: Inbox bug Fixes for quality problems that affect the customer experience and removed [zube]: Inbox labels Aug 12, 2019

sgrodzicki assigned skh Aug 12, 2019

skh mentioned this issue Aug 14, 2019

[Infra UI] Use sum for aggregating AWS metrics. #43293

Merged

skh closed this as completed Aug 16, 2019

zube bot added [zube]: Done and removed [zube]: Investigate labels Aug 16, 2019

sgrodzicki removed the [zube]: Done label Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infra UI] Allow custom metrics calculation based on field values on node detail page charts #42687

[Infra UI] Allow custom metrics calculation based on field values on node detail page charts #42687

skh commented Aug 6, 2019 •

edited by kaiyan-sheng

Loading

elasticmachine commented Aug 6, 2019

skh commented Aug 6, 2019

skh commented Aug 6, 2019

simianhacker commented Aug 6, 2019

kaiyan-sheng commented Aug 7, 2019 •

edited

Loading

jasonrhodes commented Aug 8, 2019

skh commented Aug 8, 2019

skh commented Aug 9, 2019

skh commented Aug 16, 2019 •

edited

Loading

kaiyan-sheng commented Aug 18, 2019

jasonrhodes commented Aug 26, 2019

[Infra UI] Allow custom metrics calculation based on field values on node detail page charts #42687

[Infra UI] Allow custom metrics calculation based on field values on node detail page charts #42687

Comments

skh commented Aug 6, 2019 • edited by kaiyan-sheng Loading

elasticmachine commented Aug 6, 2019

skh commented Aug 6, 2019

skh commented Aug 6, 2019

simianhacker commented Aug 6, 2019

kaiyan-sheng commented Aug 7, 2019 • edited Loading

jasonrhodes commented Aug 8, 2019

skh commented Aug 8, 2019

skh commented Aug 9, 2019

skh commented Aug 16, 2019 • edited Loading

kaiyan-sheng commented Aug 18, 2019

jasonrhodes commented Aug 26, 2019

skh commented Aug 6, 2019 •

edited by kaiyan-sheng

Loading

kaiyan-sheng commented Aug 7, 2019 •

edited

Loading

skh commented Aug 16, 2019 •

edited

Loading