Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect cloudwatch stats in a more efficient and cost effective manner #5544

Merged
merged 24 commits into from
Apr 23, 2019

Conversation

glinton
Copy link
Contributor

@glinton glinton commented Mar 7, 2019

Resolves #5420
Resolves #5609
closes #3255

Due to the structure of the data returned by GetMetricData (as opposed to GetMetricStatistics) metrics will no longer be tagged with the unit type.

This version makes one call to the AWS api and creates metrics based on the dimensions and specified metric names from the config file.

Config:

  [[inputs.cloudwatch.metrics]]
    names = ["NetworkIn"]
    [[inputs.cloudwatch.metrics.dimensions]]
      name = "AutoScalingGroupName"
      value = "k8s-ci"
  [[inputs.cloudwatch.metrics]]
    names = ["NetworkOut"]
    [[inputs.cloudwatch.metrics.dimensions]]
      name = "AutoScalingGroupName"
      value = "k8s-frontline"

Output:

cloudwatch_aws_ec2,AutoScalingGroupName=k8s-frontline,region=us-east-1 network_out_sum=6625229694,network_out_average=3312614847,network_out_maximum=3345284533,network_out_minimum=3279945161 1551919140000000000
cloudwatch_aws_ec2,AutoScalingGroupName=k8s-ci,region=us-east-1 network_in_minimum=1167141,network_in_sum=45580787,network_in_average=15193595.666666666,network_in_maximum=42813872 1551919140000000000

Todo:

  • tests
  • handle pagination
  • handle window better

@glinton glinton added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin area/aws AWS plugins including cloudwatch, ecs, kinesis labels Mar 7, 2019
@glinton glinton added this to the 1.11.0 milestone Mar 7, 2019
@glinton glinton changed the title Collect cloudwatch stats in a more efficient and cost effective manner [WIP] Collect cloudwatch stats in a more efficient and cost effective manner Mar 7, 2019
@glinton glinton changed the title [WIP] Collect cloudwatch stats in a more efficient and cost effective manner Collect cloudwatch stats in a more efficient and cost effective manner Mar 8, 2019
@glinton glinton changed the title Collect cloudwatch stats in a more efficient and cost effective manner [WIP] Collect cloudwatch stats in a more efficient and cost effective manner Mar 8, 2019
@CpuID
Copy link

CpuID commented Mar 22, 2019

Due to the structure of the data returned by GetMetricData (as opposed to GetMetricStatistics) metrics will no longer be tagged with the unit type.

❤️ I planned to rename this out anyway tbh

@danielnelson
Copy link
Contributor

@glinton regarding the unit type, I'm not super attached to it but it seems we still know that information when we make the query, are you still thinking to remove it?

Copy link

@CpuID CpuID left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glinton minor tweaks :)

plugins/inputs/cloudwatch/README.md Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/README.md Outdated Show resolved Hide resolved
@CpuID
Copy link

CpuID commented Mar 22, 2019

@glinton regarding the unit type, I'm not super attached to it but it seems we still know that information when we make the query, are you still thinking to remove it?

I think the main upside to retaining it would be retaining namespace compatibility for existing users. ie, user upgrades, things "just work" in terms of other subsystems/dashboards that expect a pre-existing metric namespace.

maybe having a feature flag to disable the units would be extremely nice though...

@glinton glinton self-assigned this Mar 25, 2019
@glinton
Copy link
Contributor Author

glinton commented Mar 25, 2019

I'm not certain that we know the unit type at query time. What about adding a config option to set a unit type, as the user may know what units a metric/namespace uses? If it's not set, there is no unit type tag.

@danielnelson
Copy link
Contributor

Don't worry about the unit tag then, we will add it to the release notes, there are other ways one could manually attach tags.

@CpuID
Copy link

CpuID commented Mar 25, 2019

@glinton can you drop some "napkin math" in here re how you foresee the "cost effective manner" part playing out?

background: I've been running this branch in "protoduction" for a few days, reading a reasonable volume of metrics from CW:

(4 inputs.cloudwatch stanzas, EBS (5m), ECS (1m), EC2 (5m), RDS (5m), most selectively, ECS all dimensions/statistics)

2019-03-25T19:00:03Z D! [outputs.graphite] wrote batch of 352 metrics in 72.25408ms
2019-03-25T19:00:03Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:01:00Z D! [outputs.graphite] wrote batch of 593 metrics in 106.623415ms
2019-03-25T19:01:00Z D! [outputs.graphite] buffer fullness: 0 / 10000 metrics. 
2019-03-25T19:02:03Z D! [outputs.graphite] wrote batch of 664 metrics in 178.746237ms
2019-03-25T19:02:03Z D! [outputs.graphite] buffer fullness: 3 / 10000 metrics. 
2019-03-25T19:03:03Z D! [outputs.graphite] wrote batch of 636 metrics in 109.464365ms
2019-03-25T19:03:03Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:04:01Z D! [outputs.graphite] wrote batch of 376 metrics in 63.310413ms
2019-03-25T19:04:01Z D! [outputs.graphite] buffer fullness: 3 / 10000 metrics. 
2019-03-25T19:05:01Z D! [outputs.graphite] wrote batch of 260 metrics in 199.167616ms
2019-03-25T19:05:01Z D! [outputs.graphite] buffer fullness: 5 / 10000 metrics. 
2019-03-25T19:06:04Z D! [outputs.graphite] wrote batch of 772 metrics in 115.374741ms
2019-03-25T19:06:04Z D! [outputs.graphite] buffer fullness: 2 / 10000 metrics. 
2019-03-25T19:07:01Z D! [outputs.graphite] wrote batch of 481 metrics in 52.644945ms
2019-03-25T19:07:01Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:08:03Z D! [outputs.graphite] wrote batch of 689 metrics in 212.339554ms
2019-03-25T19:08:03Z D! [outputs.graphite] buffer fullness: 4 / 10000 metrics. 
2019-03-25T19:09:03Z D! [outputs.graphite] wrote batch of 467 metrics in 86.771158ms
2019-03-25T19:09:03Z D! [outputs.graphite] buffer fullness: 3 / 10000 metrics. 
2019-03-25T19:10:01Z D! [outputs.graphite] wrote batch of 160 metrics in 39.880934ms
2019-03-25T19:10:01Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:11:01Z D! [outputs.graphite] wrote batch of 697 metrics in 163.696045ms
2019-03-25T19:11:01Z D! [outputs.graphite] buffer fullness: 4 / 10000 metrics. 
2019-03-25T19:12:02Z D! [outputs.graphite] wrote batch of 592 metrics in 171.005482ms
2019-03-25T19:12:02Z D! [outputs.graphite] buffer fullness: 5 / 10000 metrics. 
2019-03-25T19:13:04Z D! [outputs.graphite] wrote batch of 687 metrics in 207.710244ms
2019-03-25T19:13:04Z D! [outputs.graphite] buffer fullness: 4 / 10000 metrics. 
2019-03-25T19:14:02Z D! [outputs.graphite] wrote batch of 407 metrics in 47.990755ms
2019-03-25T19:14:02Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:15:03Z D! [outputs.graphite] wrote batch of 329 metrics in 75.403892ms
2019-03-25T19:15:03Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 

When I enabled ECS only last week on telegraf master, it took our Cloudwatch usage from $2.12/day (unrelated usage, baseline) to $9~/day ($6.88~/day in telegraf usage, $206.40~/mth. I then enabled the other 3 stanzas and used this branch, left it going over the weekend and checked our daily usage, $10.26/day ($8.14~/day in telegraf usage, $244.20~/mth)

I'm wondering if my assumptions on how GetMetricData is billed are wrong here...? how do you interpret the pricing model?


My equivalent "napkin math", which I feel may be off:

Threw some print debugging in my local build of the #5544 branch, to see how many GetMetricData calls were fired off:

You can request up to five statistics for the same metric in a single GetMetricData API request. Additional statistics are billed as an additional metric.
$0.01/1,000 metrics requested using GetMetricData
$0.01/1,000 GetMetricStatistics, ListMetrics, PutMetricData, GetDashboard, ListDashboards, PutDashboard and DeleteDashboards requests

  • in the code, the breakout comment says // loop through metrics and send groups of 100 at a time to gatherMetrics
  • so the code does 100 metrics per GetMetricData call, vs a single metric in the current "released" code per GetMetricStatistic call
  • some math:
    • 235~ metrics per min divided by 5 = 47 billable GetMetricData API calls. Could be slightly higher depending on the qty of data queries per API call, but let's go with that for napkin math sake
    • thats 2030400 billable GetMetricData API calls per month
    • 2030.4 x $0.01 = $20.304/mth
  • the ECS metrics are per minute, the EBS + RDS + EC2 metrics are per 5mins so retrieved 80% less often, a minority of the overall usage.

what we are seeing right now:

  • 391~ x 1 GetMetricStatistic calls, per min
    • 40% higher due to no statistic filtering, so we get sum/min/max/avg/sample_count, instead of just avg/max/sample_count
  • thats 16891200 billable GetMetricStatistic API calls per mth, for ECS metrics alone
  • $168.912/mth, or $5.6304/day
  • this lines up pretty closely with the $7~/day I was seeing for ECS + EC2 + RDS + EBS metrics (last 24hrs $9~ vs a couple of days ago $2~ - the $2 is mostly cloudwatch-to-graphite)

and after deploying with the 5min metrics added also (3 extra stanzas):

  • all the other 5min AWS metrics enabled are about 665280 stats per month. the 5min metric count is something like 1k~, some napkin math tells me $17.28~/mth for those.

I'd say it's pretty safe to say that #5544 will bring our telegraf Cloudwatch cost down to something like $37~/mth instead of $206~/mth for latest released telegraf

@glinton
Copy link
Contributor Author

glinton commented Mar 26, 2019

When I started, I didn't read the pricing docs, but it makes it sound like it'll be the same cost for either way?

$0.01/1,000 metrics requested using GetMetricData
$0.01/1,000 GetMetricStatistics, ListMetrics, PutMetricData, GetDashboard, ListDashboards, PutDashboard and DeleteDashboards requests

Sure, we make up to 20-100x fewer calls at best using `GetMetricData.

Say you want stats for 100 metrics. Using GetMetricStatistic API, it would be one call per metric and you would get the 5 stats back (max, min, avg, sum, count), resulting in 100 API calls. (statistics API is billed per request)
Using GetMetricData and no filtering, it would be 5 calls, because you can request 100 data points per call (20 metrics with 5 stats per request = 100 data points)
Using GetMetricData and only requesting the average of all 100 metrics, it would be 1 request.

What has me confused now is that the pricing for GetMetricData is per metric requested. Sure it's more efficient, but the pricing doc makes it sound like it won't actually save any money.

In their knowledge-center, it says:

GetMetricData is cheaper. You can ask for more data points for a single metric by using GetMetricData rather than by using GetMetricStatistics. To get the same number of data points, you must perform 70 GetMetricStatistics calls.

Which makes it sound not only cheaper computationally, but financially as well.

I know docs docs are eventually consistent, so there's a chance (and it seems you've experienced it) that this new API call does save money while making fewer requests.

@CpuID
Copy link

CpuID commented Mar 26, 2019

I know docs docs are eventually consistent, so there's a chance (and it seems you've experienced it) that this new API call does save money while making fewer requests.

I wonder if it's really only a saving for someone retrieving bulk metrics, ie more than 1 data point per interval? I guess one route there if you are willing to accept a delayed retrieval, is tune the configuration (still thinking through the exact combination) to pull say 5 x 1min metrics, once per 5mins using GetMetricData since that API call permits it? That would reduce the retrieval cost substantially, at the expense of retrieval "latency".

I think I concur with your thoughts, from my initial reading. Admittedly I did just wake up though, so if this opinion changes... :)

@glinton
Copy link
Contributor Author

glinton commented Mar 26, 2019

That's a great point.. I'll think about a way to preserve the timestamps when and if multiple data points are returned.

Copy link
Contributor

@goller goller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @glinton I can't wait to use this plugin. I like what you have done. I've found a few bugs and put a variety of nit-picks in. Let me know what you think!!

plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch_test.go Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/README.md Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/README.md Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/README.md Show resolved Hide resolved
@glinton
Copy link
Contributor Author

glinton commented Apr 10, 2019

@druchoo @CpuID can you test this build? there are some caching improvements and modifications to how metrics are created:

@druchoo
Copy link

druchoo commented Apr 11, 2019

Initial testing of build looks good. I plan on taking a closing look at each namespace tomorrow. We're doing some cost analysis first. Notes I have now:

  • It's been running ~1d and we're not really seeing favorable results for cost vs the GetMetricStatistics version. Cost Explorer is showing roughly same cost. Granted there's no filtering in any configs but would have thought there would be savings from just switching API calls. I was hopeful at least from the discussion and napkin math above.
  • I believe ratelimit section in README needs to be updated. GetMetricData is 50 TPS vs 400 for GetMetricStatistics.
  • Can you elaborate a bit on the caching (cache_ttl)?

@glinton
Copy link
Contributor Author

glinton commented Apr 11, 2019

  • Since the napkin math, I've been worried this was a waste of time. I do believe that cost savings can be seen at the cost of not having immediate stats. Can you try increasing the interval to 5-10x what you currently have it set to, but leave the period the same (I assume it's `1m), this should limit the number of metrics requested, but will return all backfilled data at 1 minute intervals.

  • Thanks, I'll make that update to the README.

  • The caching applies to how frequently metrics are listed from a namespace. If you don't have any metrics defined for a namespace, this plugin lists the available metrics no more frequently than cache_ttl. This setting also serves as ttl for other cache's used in the plugin to limit processing.

@druchoo
Copy link

druchoo commented Apr 11, 2019

@glinton Totally not a waste of time :-). The performance/speed benefit alone is insane. Before it wasn't practical to consume larger namespaces due to time it took to complete.

Increasing interval showed cost savings with GetMetricStatistics. I imagine it will be the same for GetMetricData. I think we'll start with actually filtering our metrics and dimensions. Depending on how that goes we can play around with interval.

Maybe worth mentioning.. In many cases, all available metrics are not useful. Similarly, although multiple statistics are available for a given metric, there may only be one meaningful one. Unfortunately, you need to read the CloudWatch docs for each namespace.

Regardless of cost savings, thank you for working on this!

@danielnelson
Copy link
Contributor

@druchoo Is this helping with #3255 at all?

@CpuID
Copy link

CpuID commented Apr 12, 2019

@druchoo @CpuID can you test this build? there are some caching improvements and modifications to how metrics are created:

OOO, I can test next wk if required

@CpuID
Copy link

CpuID commented Apr 12, 2019

ya I think the filtering of metrics helps a lot for $ reasons over the prior implementation.

and I think with longer intervals, $ savings are possible.

this method is definitely the more efficient way to bulk retrieve from CW, so I think it's definitely not a waste

@druchoo
Copy link

druchoo commented Apr 15, 2019

@danielnelson, although TCP connections are way down I don't think the underlying issue is resolved. Connections are still left open but it's looks to be very unlikely this would be an issue now. Just another plus for GetMetricData :-)

Not sure what to do with #3255. Close I guess?

plugins/inputs/cloudwatch/README.md Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Show resolved Hide resolved
plugins/inputs/cloudwatch/cloudwatch.go Show resolved Hide resolved
@danielnelson
Copy link
Contributor

@druchoo I marked this PR to close #3255 issue when merged.

@danielnelson danielnelson requested a review from goller April 17, 2019 22:06
@danielnelson danielnelson merged commit e334830 into master Apr 23, 2019
@danielnelson danielnelson deleted the feature/5420 branch April 23, 2019 00:36
@CpuID
Copy link

CpuID commented Apr 23, 2019

🎉

hwaastad pushed a commit to hwaastad/telegraf that referenced this pull request Jun 13, 2019
bitcharmer pushed a commit to bitcharmer/telegraf that referenced this pull request Oct 18, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/aws AWS plugins including cloudwatch, ecs, kinesis feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin
Projects
None yet
5 participants