-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect cloudwatch stats in a more efficient and cost effective manner #5544
Conversation
❤️ I planned to rename this out anyway tbh |
@glinton regarding the unit type, I'm not super attached to it but it seems we still know that information when we make the query, are you still thinking to remove it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glinton minor tweaks :)
I think the main upside to retaining it would be retaining namespace compatibility for existing users. ie, user upgrades, things "just work" in terms of other subsystems/dashboards that expect a pre-existing metric namespace. maybe having a feature flag to disable the units would be extremely nice though... |
I'm not certain that we know the unit type at query time. What about adding a config option to set a unit type, as the user may know what units a metric/namespace uses? If it's not set, there is no unit type tag. |
Don't worry about the unit tag then, we will add it to the release notes, there are other ways one could manually attach tags. |
@glinton can you drop some "napkin math" in here re how you foresee the "cost effective manner" part playing out? background: I've been running this branch in "protoduction" for a few days, reading a reasonable volume of metrics from CW: (4
When I enabled ECS only last week on telegraf master, it took our Cloudwatch usage from $2.12/day (unrelated usage, baseline) to $9~/day ($6.88~/day in telegraf usage, $206.40~/mth. I then enabled the other 3 stanzas and used this branch, left it going over the weekend and checked our daily usage, $10.26/day ($8.14~/day in telegraf usage, $244.20~/mth) I'm wondering if my assumptions on how My equivalent "napkin math", which I feel may be off: Threw some print debugging in my local build of the #5544 branch, to see how many
what we are seeing right now:
and after deploying with the 5min metrics added also (3 extra stanzas):
I'd say it's pretty safe to say that #5544 will bring our telegraf Cloudwatch cost down to something like $37~/mth instead of $206~/mth for latest released telegraf |
When I started, I didn't read the pricing docs, but it makes it sound like it'll be the same cost for either way?
Sure, we make up to 20-100x fewer calls at best using `GetMetricData. Say you want stats for 100 metrics. Using What has me confused now is that the pricing for In their knowledge-center, it says:
Which makes it sound not only cheaper computationally, but financially as well. I know docs docs are eventually consistent, so there's a chance (and it seems you've experienced it) that this new API call does save money while making fewer requests. |
I wonder if it's really only a saving for someone retrieving bulk metrics, ie more than 1 data point per interval? I guess one route there if you are willing to accept a delayed retrieval, is tune the configuration (still thinking through the exact combination) to pull say 5 x 1min metrics, once per 5mins using I think I concur with your thoughts, from my initial reading. Admittedly I did just wake up though, so if this opinion changes... :) |
That's a great point.. I'll think about a way to preserve the timestamps when and if multiple data points are returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @glinton I can't wait to use this plugin. I like what you have done. I've found a few bugs and put a variety of nit-picks in. Let me know what you think!!
Initial testing of build looks good. I plan on taking a closing look at each namespace tomorrow. We're doing some cost analysis first. Notes I have now:
|
|
@glinton Totally not a waste of time :-). The performance/speed benefit alone is insane. Before it wasn't practical to consume larger namespaces due to time it took to complete. Increasing interval showed cost savings with Maybe worth mentioning.. In many cases, all available metrics are not useful. Similarly, although multiple statistics are available for a given metric, there may only be one meaningful one. Unfortunately, you need to read the CloudWatch docs for each namespace. Regardless of cost savings, thank you for working on this! |
ya I think the filtering of metrics helps a lot for $ reasons over the prior implementation. and I think with longer intervals, $ savings are possible. this method is definitely the more efficient way to bulk retrieve from CW, so I think it's definitely not a waste |
@danielnelson, although TCP connections are way down I don't think the underlying issue is resolved. Connections are still left open but it's looks to be very unlikely this would be an issue now. Just another plus for GetMetricData :-) Not sure what to do with #3255. Close I guess? |
🎉 |
Resolves #5420
Resolves #5609
closes #3255
Due to the structure of the data returned by
GetMetricData
(as opposed toGetMetricStatistics
) metrics will no longer be tagged with the unit type.This version makes one call to the AWS api and creates metrics based on the dimensions and specified metric names from the config file.
Config:
Output:
Todo: