Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra UI] Allow custom metrics calculation based on field values on node detail page charts #42687

Closed
skh opened this issue Aug 6, 2019 · 11 comments
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@skh
Copy link
Contributor

skh commented Aug 6, 2019

This issue came up during the implementation of #39282

AWS CloudWatch metrics, and by extension metricbeat's aws.ec2 metricset, contain fields that have to be interpreted differently depending on whether the user has enabled (and paid for) "detailed" CloudWatch monitoring:

NetworkOut:
The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance.

The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.

Units: Bytes

(from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html )

The affected metrics are NetworkIn, NetworkOut, DiskReadBytes, and DiskWriteBytes. They are mapped to the fields aws.ec2.network.in.bytes, aws.ec2.network.out.bytes, aws.ec2.diskio.read.bytes, and aws.ec2.diskio.write.bytes, respectively.

The information whether "detailed" monitoring is enabled is captured in the field aws.ec2.instance.monitoring.state.

Currently there is no easy way to perform custom calculations on a metric that is fetched through TSVB models.

I did experiment with this code in the model definition (in kibana/x-pack/legacy/plugins/infra/server/lib/adapters/metrics/models/aws/aws_network_bytes.ts):

  {
      id: 'tx',
      metrics: [
        {
          field: 'aws.ec2.network.out.bytes',
          id: 'max-net-out',
          type: InfraMetricModelMetricType.max,
        },
        {
          id: 'by-second-max-net-out',
          type: InfraMetricModelMetricType.calculation,
          variables: [
            { id: 'var-max', name: 'max', field: 'max-net-out' },
            {
              id: 'var-detailed-monitoring',
              name: 'detailed-monitoring',
              field: 'aws.ec2.instance.monitoring.state',
            },
          ],
          script: 'params.detailed-monitoring == "disabled" ? params.max / 300 : params.max / 60',
        },
      ],
      split_mode: 'everything',
    }

This does not work, presumably because string comparison is not possible in the script field of a TSVB model.


This feature request is about adding this functionality to the node detail page and the corresponding metrics GraphQL resolver, and performing the correct calculations for AWS metrics.

As long as this isn't implemented, we shouldn't show the affected metrics.

@skh skh added Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Aug 6, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-logs-ui

@skh
Copy link
Contributor Author

skh commented Aug 6, 2019

It is possible that DiskWriteOps and DiskReadOps (aws.ec2.diskio.read.count and aws.ec2.diskio.write.count) are affected by this as well, though the documentation is unclear. (Cc: @kaiyan-sheng )

@skh
Copy link
Contributor Author

skh commented Aug 6, 2019

@kaiyan-sheng Another idea, could this maybe be solved in metricbeat directly?

@simianhacker
Copy link
Member

@skh What if we added an attribute to the metadata endpoint that indicated whether or not they have detailed metrics. If we had that then the two approaches I could see are:

  1. Pass the metadata to the TSVB models then use a template string to inject the correct value (60 or 300) in to the script dynamically
  2. Return a series without the "scaling" and do it client side based on the metadata.

@kaiyan-sheng
Copy link
Contributor

kaiyan-sheng commented Aug 7, 2019

It is possible that DiskWriteOps and DiskReadOps (aws.ec2.diskio.read.count and aws.ec2.diskio.write.count) are affected by this as well, though the documentation is unclear. (Cc: @kaiyan-sheng )

Yes, I tried yesterday to create an EC2 instance but can't make DiskWrite/ReadOps to be non-zero. I assume for these two metrics, the definition of "a specified period of time" in the documentation is the same as DiskReadBytes and DiskWriteBytes.

@kaiyan-sheng Another idea, could this maybe be solved in metricbeat directly?

@skh You mean calculating rate for all NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, DiskReadOps and DiskWriteOps? Since monitoringState is available, this shouldn't be hard.

@jasonrhodes
Copy link
Member

You mean calculating rate for all NetworkIn, NetworkOut, DiskReadBytes, DiskWriteBytes, DiskReadOps and DiskWriteOps?

Yeah this sounds like the best solution since, as far as I can tell, we don't care about the "raw values" ever again. Then "whether the user has detailed monitoring" becomes completely irrelevant to all consumers of the data after it's been stored, if I understand correctly.

@skh
Copy link
Contributor Author

skh commented Aug 8, 2019

Yeah this sounds like the best solution since, as far as I can tell, we don't care about the "raw values" ever again. Then "whether the user has detailed monitoring" becomes completely irrelevant to all consumers of the data after it's been stored, if I understand correctly.

Yes. I would also favour this solution.

@skh
Copy link
Contributor Author

skh commented Aug 9, 2019

Another option would be (for now), to use a sum aggregation for these fields, and always calculate the rate based on 300 seconds.

As we don't have variable bucket intervals yet (that would come with elastic/beats#12616) this would be correct in the following two cases:

  • user has basic monitoring and 300s reporting period configured in metricbeat for the aws module (default). In this case we have one value in our 300 second bucket and divide that by 300 to get the rate.
  • use has detailed monitoring and 60s reporting period configured in metricbeat for the aws module. In this case we have five values in our 300 second bucket, sum them up and then divide that by 300 to get the rate. It would be the average rate for the last 300 seconds instead of for the last 60 seconds, but it would be correct.

In the following case it would be incorrect:

  • user has detailed monitoring but still has the default 300s reporting period configured in metricbeat. In this case we have one value in our 300 second bucket, but it is the value for the last 60 seconds, not the last 300. We would divide by 300 and the value would be one-fifth of what it should be.

We could point out in documentation that the reporting period for the metricbeat aws module has to be adjusted when detailed monitoring is available.

@simianhacker @kaiyan-sheng what do you think?

@sgrodzicki sgrodzicki added [zube]: Inbox bug Fixes for quality problems that affect the customer experience and removed [zube]: Inbox labels Aug 12, 2019
@skh
Copy link
Contributor Author

skh commented Aug 16, 2019

After further discussion, this is now implemented as:

Leave the bucket size hard-coded at >=5m

First, use a sum aggregation in every bucket. that way, if the user has detailed monitoring and a reporting period of 60s, we sum up even in the smallest possible buckets. If the user zooms out and the TSVB endpoint therefore uses larger buckets (note that the value we use is >=5m, not 5m), we also sum up if the user has only basic monitoring and a reporting period of 300s.

Next, calculate the cumulative sum in each bucket, i.e. each bucket will have the sum of all buckets before it, and its own value added on top. This turns the metric into a monotonically increasing counter.

Finally, take the derivative with unit: 1s to get the per-second rate regardless of bucket sizes.

Note that this will still be wrong if the user has detailed monitoring enabled in AWS but configured a 300s reporting period in metricbeat.

@kaiyan-sheng
Copy link
Contributor

Hi @skh, the PR for calculating rate metrics in ec2 metricset is merged 😬

@jasonrhodes
Copy link
Member

@skh @simianhacker with the changes that landed in elastic/beats#13203 can we remove our workaround and just use the raw numbers? Will those beats changes have any effect on our workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

6 participants