-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Re-model SQL Server Plugin Data Output #3233
Comments
This is one of the older plugins and it seems like it might be time to revamp the output format here to fit better with the latest Telegraf best practices. In particular I think we could make better use of tags and fields. I could also be talked into style changes too so that the output looks more like other plugins. Everything needs to be done in a backwards compatible fashion, but if you had a fresh slate, what changes would you make? |
This is not a full spec by any means, it's just a summary of some thoughts I had on how this plugin could be reworked. ModelingI have been thinking about the output format quite a bit, and this is how I think I would model it. I have left out some of the tags (like host) for brevity. I think all of the modeling changes could be made in the TSQL code, so adding new queries and only using them if a config option is set would be a good approach to maintain compatibility. Performance CountersThese should be treated as key/value pairs. The key being a combo of the object and counter name, the value being the value of that counter. The counter instance would be used as a tag. Besides the modeling changes we should also reduce the number of objects this query is returning.
Wait StatsWait stats would be captured per wait type instead of categorized. I would argue pretty hard for this. As new versions of SQL come out, new wait types come with it. If we only captured the top 15-20 at any given time it would keep the record counts under control while still capturing the most important info. We could also exclude common wait types that are harmless. The "measurement" would be static, just "waitstats" or similar. The wait type would be used as a tag, and wait time, waiting tasks, and signal wait time would be fields.
General ModelingThe approach would be to place any data we would want to group by in a tag, and any data we would ever want to aggregate would be in a field. There may be special cases we run into for the other existing queries, but I think they will likely look like one of the examples above. ConfigI would like a few new configuration options (these would all have sane defaults that would result in the plugin working as it does currently):
Delta metricsThis change would be more drastic. For a lot of these metrics where we need to capture deltas over time, it would be nice if the plugin would handle the delta logic. So the queries would be simplified to just select the data. On first run the plugin would capture the data and store it in memory until the next run, then it would report the delta between the two. This would allow the user to collect data at a much longer interval while still capturing all of the data. It would also result in easier to maintain TSQL within the plugin, and more accurate output. Custom QueriesThis wouldn't be too difficult, but will take some thought to make it fit with the rest of the plugin. The user should be able to specify the path to a directory that contains custom SQL scripts that would be executed. Either that or a list of scripts directly in the config. There have been many times where I run into an issue and want to start gathering data from a DMV I don't normally look at and graph some data out. With custom queries this would be possible. This could potentially be added as a separate project. |
We often use an include/exclude list for filtering, but with a short list of items just and exclude list works. All items should be independent of each other if we do this:
Can you tell me more about the metrics where we capture deltas over time? Most of the plugins prefer to report absolute values, these can easily be converted to deltas at query time and handle any Telegraf downtime nicely. I notice there are some "per second" metrics, are we computing these or do they come from the database? Sorry if this is obvious from the code, the TSQL is way over my head. Regarding the custom queries, @lucadistefano has a general purpose SQL plugin #2785 in the works that can read scripts from file, I believe it is waiting on me. Can you take a look and see if it would meet you needs? |
Personally I don’t like perform deltas in the gathering of metrics. I prefer to get the raw values and perform deltas using db functions or frontend functions or a post processing batch.
If we get deltas we loose the information of the absolute values. For get deltas you need to perform 2 queries for each poll, using raw values only 1 query. Then how to choose the delta period? Some metrics are updated constantly by the db, other each 10 secs, other maybe with less frequency. The delta period could limit the frequency of polling.
It would be useful to have a way to select the metrics you gather avoiding the execution of query that are not needed.
Luca
…On 15 Sep 2017, at 04:55, Mark Wilkinson - m82labs ***@***.***> wrote:
This is not a full spec by any means, it's just a summary of some thoughts I had on how this plugin could be reworked.
Modeling
I have been thinking about the output format quite a bit, and this is how I think I would model it. I have left out some of the tags (like host) for brevity. I think all of the modeling changes could be made in the TSQL code, so adding new queries and only using them if a config option is set would be a good approach to maintain compatibility.
Performance Counters
These should be treated as key/value pairs. The key being a combo of the object and counter name, the value being the value of that counter. The counter instance would be used as a tag. Besides the modeling changes we should also reduce the number of objects this query is returning.
|-- measurement--|-- tags ----------| -- field -- |
| | | |
| object-counter | counter_instance | value |
Wait Stats
Wait stats would be captured per wait type instead of categorized. I would argue pretty hard for this. As new versions of SQL come out, new wait types come with it. If we only captured the top 15-20 at any given time it would keep the record counts under control while still capturing the most important info. We could also exclude common wait types that are harmless. The "measurement" would be static, just "waitstats" or similar. The wait type would be used as a tag, and wait time, waiting tasks, and signal wait time would be fields.
|-- measurement--|-- tags ----------| -- field ------------------------------------ |
| | | |
| waitstats | wait_type | wait_time_ms, waiting tasks, signal wait time |
The general approach would be to place any data we would ever want to group data by would be placed in a tag, and any data we would ever want to aggregate would be in a field. There may be special cases we run into for the other existing queries, but I think they will likely look like one of the examples above.
Config
I would like a few new configuration options (these would all have sane defaults that would result in the plugin working as it does currently):
metric_version - As discussed, this could be used to switch to a new set of queries while maintaining backward compatibility.
perf_objects - This would be a list of additional objects to add to the existing default list. Example: perf_objects = [ 'User Settable','Workload Group Stats' ]
totals_only - This option would only grab the _total instance of any perf counters that have multiple instances, one of which being _total. In some cases getting ALL of the instances could create a LOT of data. As an example, I have 400+ databases on some instances, on those boxes I might opt to not collect all of those counters, but just get the total.
get_perfcounters/waitstats/databaseio/etc - It might be neat to have a config option per query that would allow you to NOT collect certain data. get_perfcounters = 0
Delta metrics
This change would be more drastic. For a lot of these metrics where we need to capture deltas over time, it would be nice if the plugin would handle the delta logic. So the queries would be simplified to just select the data. On first run the plugin would capture the data and store it in memory until the next run, then it would report the delta between the two. This would allow the user to collect data at a much longer interval while still capturing all of the data. It would also result in easier to maintain TSQL within the plugin, and more accurate output.
Custom Queries
This wouldn't be too difficult, but will take some thought to make it fit with the rest of the plugin. The user should be able to specify the path to a directory that contains custom SQL scripts that would be executed. Either that or a list of scripts directly in the config. There have been many times where I run into an issue and want to start gathering data from a DMV I don't normally look at and graph some data out. With custom queries this would be possible. This could potentially be added as a separate project.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#3233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABB9k5Tq50ZinA4o3v9Hr6Re9OhrFPnfks5siecugaJpZM4PYEZn>.
|
I agree with the statement above about raw values. Some of the counters are static values and some require you to get the delta for them to make sense, so maybe a tag could be added to specify? Handling user defined queries in a separate plugin would likely be cleaner so I'll take a look at the general purpose plugin. |
A couple things I think would be useful are: Use a common measurement name or at least far fewer measurement names, perhaps based on the type field. This provides a way to more easily find all measurements from the sqlserver if you have other plugins enabled. For example we could have Use fields to their full potential. In the performance counters we have a single Here is an example of shifting the counter name to the field, it is sort of a bad example since we should probably stop storing per second aggregations and instead use raw values. Before:
After:
We could potentially remove totals if we did this, since you can then easily sum the cursor types at query time. Usually we only save aggregations when the source provides the aggregations, this is more flexible at query time. |
If I understand correctly @m82labs, you are thinking about a way to determine the units of a value. I have seen tags used for storing units but it can make it more difficult to use fields, since sometimes you want fields with differing units in the same series. Here is an example where someone wants to compute the number of bytes written per call to write. If you have a units tags:
Same series with no units tag, instead putting the unit in the field:
|
@danielnelson what you mention about units is what I was thinking. Depending on the type of perfmon counter though, you might have more complex calculations ( A2 - A1 / B2 - B1). I was thinking perf counter would have a counter value and base value. Sometimes the base value would be empty, and then you would tag the record with a counter type, this type is what would indicate to the user what type of calculation they need to do. So if the type is "Avg/Sec" you know you need to calculate the value using the formula I have above. I can more easily illustrate with a query, I have been working on some new queries for the plugin to help me think through the problem, I can share some examples later this weekend on what I was thinking. As far as your comments on measurement, I like that as well. I have had a few idea on this (like having the same measure name for everything in the plugin, and using the type to differentiate) though I do like the idea of just prefixing the measure with 'sqlserver_' as well. |
This page explains the types of calculations you have to deal with: https://blogs.msdn.microsoft.com/psssql/2013/09/23/interpreting-the-counter-values-from-sys-dm_os_performance_counters/ I would say skip it and just grab the perf counters using the windows perf counter plugin, but this would not work for SQL on Linux users. In that environment the only way to get the performance counters is via the DMV. |
That makes sense, one series per performance counter name, with one field for the value and, sometimes, one for the the base:
I would do this only if you think you will want to do calculations across counter_names, but you could try to push the counter_name into the field, which would allow the series to hold multiple counters:
When you query you would have to know the type based on the counter name, I'd recommend against also having a |
After some more thought, I think it might make sense to give all metrics collected by this plugin the same measurement, and use a 'type' tag to differentiate between the different types of metrics. This would allow some interesting possibilities, like dividing WRITELOG waits for a given period by the batches per second to get avg writelog per batch. There are other use cases for this, but this is one I could use right away. I have created a new wait stats query here: https://gist.github.com/m82labs/a73ca20395c41f9ef099f8b030e04035 It gets the top 20 waits stats ordered by waiting tasks count. It includes the wait type as well as the wait category (as defined by Microsoft for the new QDS wait stats categories). |
Here is the perfmon query I would suggest: https://gist.github.com/m82labs/5abe0ece587f7090174ec76baab5448f The c_type tag would be used to determine if you need to treat it as a raw value or if you need to calculate the delta. After looking over all of the counters used, the most important ones are fairly easy to deal with, so the query will handle cases where you just need to divide a metric by a base metric, as is the case with some of the CPU metrics. Since I simplified it a bit, it removes the need to include a 'base' value. This is a modified version of the query used by Microsoft when monitoring SQL Server on Linux instances: https://github.com/Microsoft/mssql-monitoring/blob/master/collectd/collectd.conf In my opinion I think these are really the only queries we need. The rest of the system-level stats can be collected by appropriate plugins for that OS (for CPU, disk space/latency, and memory). This will keep the plugin lean, and prevent it from collecting data that is already collected elsewhere. |
This all sounds good to me, though I will have a better feel for it when I see it as line protocol. I like the idea of simplifying the queries a bit too. One concern I have is that I wonder if the system stats are needed when using a hosted server such as on Azure. @regevbr @kardianos @deluxor Would this be a problem? |
@danielnelson I wonder if we make the system stats an optional query? I can work out some queries to query the Azure specific DMV's for system metrics. For example: https://docs.microsoft.com/en-us/sql/relational-databases/system-catalog-views/sys-resource-stats-azure-sql-database This would be for Azure SQLDB, if you are just running an instance on a VM in Azure, the normal Windows/Linux OS collectors should suffice. Note: Azure SQLDB does NOT come with a system health extended event session, so things like CPU usage are not exposed outside of the special DMVs added for SQLDB. |
@danielnelson - here is an example of the output of the wait stats query: And the perfmon query: |
Here is a query that could be used for the Azure DB stats: https://gist.github.com/m82labs/753d0972e7c9240609e8fbeafc769c24 |
Looks great to me. Do you have an example of a perfmon query with a base and value? What do you think about having separate measurement names for wait stats vs perf counters? @zensqlmonitor I know you collect a lot of metrics on your servers, would this amount of data still be sufficient for you? |
@danielnelson I altered the query to handle ratio metrics automatically, it doesn't handle the more complex counter types, but I also don't think any of those are worth collecting over time by default. So it ended up being quite simple. The reason I was thinking of keeping measurement names the same, was so you could do things like dividing a certain wait type by batches/sec. It was my understanding that you would not be able to do this if they were in different measurements. |
On metrics such as You can use functions on a field with just the measurement name being the same, these are usually aggregations over time. But if you want to use an operator they need to be on the same series (measurement + tagset) with different field names. The only way around this that I know of is if you have a set of tags that you can use to get just the two values:
This should work, but it falls apart when tags are used more appropriately with a tag being set on multiple series, also it is somewhat hard to figure out how to do it. This is why we usually push people to use many field names. |
@danielnelson ah, I understand. In that case, having different measurement names would work fine. Just to clarify, the only calculations I am doing in the queries I linked to are dividing a metric by it's base, for ratio metrics (CPU% for example). Only if a counter has a base am I doing any calculations. The rest are raw values. All of the |
Okay, I'm not sure it is the same situation, but you might be interested to know that most of our advanced users prefer cpu time (in jiffies) over percentage. I believe they are more accurate when aggregating over time. Still many others prefer the simplicity of percentage. |
After a lot of tweaking, I have a pull request in for this: #3618 |
This request is based on this initial conversation: https://community.influxdata.com/t/plugin-modification-guidance/2355
Feature Request
Proposal:
The proposal is to re-model the data that is returned by the SQL Server plugin to be more user friendly. Most of the changes could be made in the TSQL code itself, so we should be able to keep backward compatibility for users by adding a
metric_version
option to the configuration.Current behavior:
Currently there is a lot of data stored in the name of the measure and some counters are included that don't really need to be.
Desired behavior:
All "unique" information about a metric should be moved in tags. Things like database name, 'type' when dealing with locks or wait stats, etc. The number of performance counters should also be reduced, or be made configurable (possibly be including a list of performance counter objects in the config).
Use case:
I currently manage an environment with 300+ instances and thousands of databases. In it's current form the SQL Server plugin makes it too difficult to graph data by database, or resource governor workload group, etc, as these are not uniform across all instances. Making the above changes would make this data a lot easier to work with.
The text was updated successfully, but these errors were encountered: