Sql server remodel #3618

m82labs · 2017-12-22T16:37:13Z

This is a rewrite of most of the data collection queries for the SQL Server plugin.

Changes:

Added AzureDB support
Added configuration option to exclude some queries
Added configuration option to switch between original and new queries
Queries are initialized once when the plugin is initialized, instead of on each time metrics are gathered
Re-wrote README to include new config options and explain the v2 queries. Also removed the Grafana examples and metrics output dump to simplify it.

Notes:

New queries were written with performance in mind (not readability :) )
New queries will return raw metrics whenever possible, deltas are not calculated in the TSQL or plugin code
More human friendly names have been added for things like memory clerks
Wait stats have been categorized based on Microsoft published categories
These changes have been running on 200+ instances in a development environment for over a week

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

…ry init.

…y initialization to only happen once.

…collected.

…tance is not running on AzureDB.

m82labs · 2017-12-22T21:38:22Z

I do need someone to test this with Azure SQL DB. My free trial expired before I could fully test it.

danielnelson · 2017-12-28T00:34:58Z

@regevbr Do you think you would be able to help with Azure testing?

m82labs · 2017-12-28T02:13:53Z

@regevbr "AzureDB support" in this case is simply adding a query to an AzureDB specific DMV for resource utilization as well as adding logic throughout V2 queries to avoid DMVs not available in AzureDB. Please suggest more queries that could be added if you can. I was thinking of adding per DB wait stats instead of the "server" wide, since those can be misleading in AzureDB.

I would also be interested to know if AzureDB users would typically expect to set up telegraf by the database, or by the server.

zensqlmonitor · 2018-01-04T16:13:51Z

@m82labs could you please update/upload the grafana dashboard with the modifications you have done?

m82labs · 2018-01-04T16:41:50Z

Since the queries are so different in how the data is gathered it would be a complete re-write of the dashboard. I could upload a custom dashboard I use (it is actually 4 different dashboards), but it is very much tweaked for my own environment, and I assumed others would build there own from scratch. I should mention that the original queries are all still here and could continue to be used for those using the dashboard created around them.

…

On Thu, Jan 4, 2018 at 11:14 AM, zensqlmonitor ***@***.***> wrote: @m82labs <https://github.com/m82labs> could you please update/upload the grafana dashboard with the modifications you have done? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dIL1se-_V3i9uSFdgn2_Vennd2BVks5tHPjIgaJpZM4RLN0w> .

danielnelson · 2018-01-04T19:13:30Z

While we will have the original queries for some time, when eventually we release Telegraf 2.0 I would like to drop the old version. This might be a ton of work, but maybe we should make a list of the queries we have in the current version and note if we can still make them with this format?

m82labs · 2018-01-04T20:22:10Z

Just to make sure I understand, are you saying that we should see if we can rewrite the existing queries to return data in a more telegraf-friendly format? (No delta calculations, etc.) If so, I will try to fit the new query results to the existing dashboard and see what's missing and we can go from there. Since I added the ability to exclude specific queries, adding more queries to cover what the old queries were returning would likely meet everyone's needs here.

…

On Thu, Jan 4, 2018 at 2:14 PM, Daniel Nelson ***@***.***> wrote: While we will have the original queries for some time, when eventually we release Telegraf 2.0 I would like to drop the old version. This might be a ton of work, but maybe we should make a list of the queries we have in the current version and note if we can still make them with this format? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dLsBOQoQV7TVH_UPbLJg-M8mWQ6tks5tHSLwgaJpZM4RLN0w> .

danielnelson · 2018-01-04T20:59:50Z

I'm interested in knowing if there are queries on the current dashboard that could not be rewritten with the new queries/format due to data not being collected. That doesn't necessarily mean we have to have equivalents, we just should have an idea what is no longer available.

m82labs · 2018-01-04T21:25:56Z

Ah, that makes sense. I will document what would no longer be available.

…

On Thu, Jan 4, 2018 at 4:00 PM, Daniel Nelson ***@***.***> wrote: I'm interested in knowing if there are queries on the current dashboard that could not be rewritten with the new queries/format due to data not being collected. That doesn't necessarily mean we have to have equivalents, we just should have an idea what is no longer available. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dB6WQjJm4NDTRguwchYb_8qPER4Nks5tHTvRgaJpZM4RLN0w> .

kerams · 2018-02-02T08:38:51Z

I'm excited to try this, because the old version generates an insane amount of measurements, most of which are of no real interest to me. Thanks for this.

danielnelson · 2018-02-02T20:22:38Z

@kerams Should be available in the nightly build now if you are interested: https://dl.influxdata.com/telegraf/nightlies/telegraf-nightly_windows_amd64.zip, would love to hear what you think.

m82labs · 2018-02-03T15:47:37Z

@kerams longer term I want to add better support for AzureDB. Prior to the release I want to add backup throughput counters as well as the user defined counters.

kerams · 2018-02-05T09:29:33Z

I've started porting the Grafana dashboard and these are the questions/notes so far:

The database counts (online, offline, etc.) don't seem to include system databases
Many performance counter readings have /sec in their name (and have the wrong c_type tag) even though they represent raw values, not deltas
The CPU usage % counter has 2 tag values for instance - default and internal. What are these?
Apart from the ones you mentioned above, I could not find the appropriate measurements to populate these panels:
- Page file usage
- Target memory
- Used memory
- Page file
- Row/log writes and reads for the entire instance
- Log used %
- System log used %
I'm running both versions of the plugin side-by-side on the same DB instance and several v1 and v2 counters seem to differ by a relatively wide margin. I'll keep an eye on this.

m82labs · 2018-02-05T12:53:09Z

- Good catch @kerams. Not sure why I am not including system databases, I will change that. - Do you have a few examples of the c_type issues? I can take a look. I am calculating that field so I am not sure what could be going wrong. (I based it on ths doc: https://blogs.msdn.microsoft.com/psssql/2013/09/23/interpreting-the-counter-values-from-sys-dm_os_performance_counters/) - The CPU usage included in this plugin is specifically for resource governor workload groups. So in this case it shows the internal and default workload groups. For more detailed CPU information you would need to use the CPU plugin. - Page file usage would be another that would need to be grabbed from another plugin, as well as more detailed memory information. It does have total physical memory, I could include the max server memory as well - Row/log writes and reads are captured at the database level currently, it wouldn't be too difficult to alter the query to get a total as well, I can add that. - Log Files Size and Used are both captured, so free % could be derived from that (same for system log if you wanted that in another graph) In any cases where an existing plugin could capture better data I left the data out. I figured this was the best approach to keep this a pure SQL Server plugin. Some of the "server" counters might not make a whole lot of sense on a SQL on Linux instance for example. As far as the data being different, this is expected. The original plugin was capturing short pockets of time at a given interval, so it missed a lot of detail. This plugin relies on the user to do diff calculations so it can report at a much higher frequency, providing a lot more detail. Also, there is almost zero math happening in the plugin itself, so the data should be a lot more accurate. When I deployed this in my production environment I noticed things I had never noticed before, but it matched my previous metrics for the most part (I was not using telegraf, I was doing raw perf counter captures). If you can give me a few examples of where they vary, I can try to provide a better explanation.

…

On Mon, Feb 5, 2018 at 4:29 AM, kerams ***@***.***> wrote: I've started porting the Grafana dashboard and these are the questions/notes so far: - The database counts (online, offline, etc.) don't seem to include system databases - Many performance counter readings have /sec in their name (and have the wrong c_type tag) even though they represent raw values, not deltas - The CPU usage % counter has 2 tag values for instance - default and internal. What are these? - Apart from the ones you mentioned above, I could not find the appropriate measurements to populate these panels: - Page file usage - Target memory - Used memory - Page file - Row/log writes and reads for the entire instance - Log used % - System log used % - I'm running both versions of the plugin side-by-side on the same DB instance and several v1 and v2 counters seem to differ by a relatively wide margin. I'll keep an eye on this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dKASLKFQT67UckRw_FJ1VnkiGpbqks5tRsoFgaJpZM4RLN0w> .

kerams · 2018-02-05T13:56:44Z

Do you have a few examples of the c_type issues?

For instance, Batch Requests/sec's and SQL Compilations/sec's c_type is rate, but they contain raw values that I feed to non_negative_derivative.

m82labs · 2018-02-05T14:35:53Z

Ah this sounds like a terminology issue on my part maybe. 'raw' means it can be reported directly, using avg or something similar, 'rate' means it has to be fed to 'non_negative_derivative'. It sounds confusing now that I write that. Maybe 'current' and 'cumulative' would be better?

kerams · 2018-02-05T14:44:31Z

I see. The rate and /sec combination is a bit unfortunate. Nonetheless, I don't really see the benefit of that tag (and object for that matter). Sure, it helps you design your queries at first, but its value for a counter never changes. Could this kind of meta information perhaps be more suited for the docs?

m82labs · 2018-02-05T14:48:40Z

I originally wanted to do that, but wanted to avoid adding all of the counters to the docs. After working with this a while though I DO feel having that extra tag on there is kind of a waste.

…

On Mon, Feb 5, 2018 at 9:44 AM, kerams ***@***.***> wrote: I see. The rate and /sec combination is a bit unfortunate. Nonetheless, I don't really see the benefit of that tag. Sure, it helps you design your queries at first, but its value for a counter never changes. Could this kind of meta information perhaps be more suited for the docs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dExQisubzb-Ad1onNfD0IrpbPA08ks5tRxPVgaJpZM4RLN0w> .

kerams · 2018-02-06T08:26:56Z

Take a look at https://i.imgur.com/C8wuyAo.png and https://i.imgur.com/OI4lHlh.png. Logouts and SQL (Re-)Compilations appear to be significantly different.

m82labs · 2018-02-06T11:21:14Z

Can you change your graphs for the V2 queries to calculate like this: `non_negative_derivative(last("value"),1s)` This will give you a more accurate number. Using the mean on cumulative numbers like these will also under-report the value.

…

On Tue, Feb 6, 2018 at 3:27 AM, kerams ***@***.***> wrote: Take a look at https://i.imgur.com/C8wuyAo.png and https://i.imgur.com/OI4lHlh.png. Logouts and SQL (Re-)Compilations appear to be significantly different. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dOjmegbBI09W-qKobmEHwVO2QTcOks5tSAzbgaJpZM4RLN0w> .

kerams · 2018-02-06T12:06:16Z

Tried it, but the averages reported by Grafana remained unaffected.

m82labs · 2018-02-06T12:52:21Z

Have you tried using performance monitor on the instance itself to see which is closer to reality? I am going to do this now on one of my instances.

…

On Tue, Feb 6, 2018 at 7:06 AM, kerams ***@***.***> wrote: Tried it, but the averages reported by Grafana remained unaffected. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dDMg2qEaLxEwRare369xvLbfdh3Pks5tSEBAgaJpZM4RLN0w> .

kerams · 2018-02-06T14:09:21Z

SQL Compilations and Batch requests in perfmon indeed seem to match v2 more closely.

m82labs · 2018-02-06T14:18:08Z

@kerams I ran this for ~ 1 hour and checked and the new results seem to line up with performance monitor:

kerams · 2018-02-06T14:23:01Z

Yeah, sorry for the false alarm. V2 looks good then.

m82labs · 2018-02-06T14:39:24Z

No worries, it forced me to double check. :)

…

On Tue, Feb 6, 2018 at 9:23 AM, kerams ***@***.***> wrote: Yeah, sorry for the false alarm. V2 looks good then. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dHurhMryXnWfN7D7aiT2xH7NWgg3ks5tSGBMgaJpZM4RLN0w> .

DieterHi · 2018-04-04T14:22:52Z

Is it possible to get the values from sqlserver_performance as Integer and not as String?
SHOW FIELD KEYS
name: sqlserver_performance
fieldKey fieldType
value string

name: sqlserver_database_io
fieldKey fieldType
read_bytes integer
read_latency_ms integer
reads integer
write_bytes integer
write_latency_ms integer
writes integer

from telegraf test ->
*sqlserver_performance,counter=Query,host=hamd,instance=User* counter\ 9,object=SQLServer:User\ Settable,sql_instance=HAMD value=“0.000000000000” 1522841856000000000

sqlserver_server_properties,host=ham,sql_instance=HAMD,sql_version=12.0.2000.8 db_suspect=0i,uptime=17249i,db_restoring=0i,server_memory=33553840i,cpu_count=4i,db_recovering=0i,db_online=8i,db_offline=0i,db_recoveryPending=0i 1522841856000000000

m82labs · 2018-04-04T14:32:41Z

This was not intentional. I'll take a look at it.

…

On Wed, Apr 4, 2018 at 10:23 DieterHi ***@***.***> wrote: Is it possible to get the values from sqlserver_performance as Integer and not as String? SHOW FIELD KEYS name: sqlserver_performance fieldKey fieldType value string name: sqlserver_database_io fieldKey fieldType read_bytes integer read_latency_ms integer reads integer write_bytes integer write_latency_ms integer writes integer from telegraf test -> **sqlserver_performance,counter=Query,host=hamd,instance=User** counter\ 9,object=SQLServer:User\ Settable,sql_instance=HAMD *value=“0.000000000000”* 1522841856000000000 *sqlserver_server_properties,host=ham,sql_instance=HAMD,sql_version=12.0.2000.8* *db_suspect=0i,uptime=17249i,db_restoring=0i,server_memory=33553840i,cpu_count=4i,db_recovering=0i,db_online=8i,db_offline=0i,db_recoveryPending=0i* 1522841856000000000 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3618 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIS1dFXkfY1zwqAZFoKHN2LksD-TSGlQks5tlNdIgaJpZM4RLN0w> .

DieterHi · 2018-04-04T14:35:27Z

I use telegraf 1.6 RC3 with Kafka output.
sorry ->
Telegraf v1.6.0~rc2 (git: release-1.6 1e95f97)

m82labs · 2018-04-06T15:18:34Z

@zensqlmonitor would you be able to shed a little light on how the plugin decides which data type to use? It seems any recent build is outputting strings, but the SQL datatypes being returned are not, they are typically type numeric.

m82labs · 2018-04-06T17:06:20Z

@DieterHi I have a fix I am currently testing.

danielnelson · 2018-04-06T17:14:23Z

Might be worth looking into moving back to the main go-mssqldb repo: https://github.com/denisenkom/go-mssqldb

m82labs · 2018-04-06T18:50:06Z

@danielnelson good call. I can look at whats involved.

EDIT: A quick test and it looks like nothing would need to change, but I will do some further testing before doing a PR.

DieterHi · 2018-04-09T11:16:15Z

Hi Mark, Now i use telegraf Nighty Build Telegraf v1.7.0~a28de4b5 (git: master a28de4b) with your new code and its run. Thanks for your Good Work and your Support! Greetings Dieter Gesendet: Freitag, 06. April 2018 um 20:50 Uhr Von: "Mark Wilkinson - m82labs" <notifications@github.com> An: influxdata/telegraf <telegraf@noreply.github.com> Cc: DieterHi <dieter_hildebrandt@web.de>, Mention <mention@noreply.github.com> Betreff: Re: [influxdata/telegraf] Sql server remodel (#3618) @danielnelson good call. I can look at whats involved. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

m82labs · 2018-04-10T10:52:29Z

Good to hear @DieterHi !

m82labs and others added 17 commits October 1, 2017 22:59

Adding code to handle multiple query versions as well as one-time que…

32d31ca

…ry init.

Merge remote-tracking branch 'upstream/master' into SqlServer-Remodel

439932c

Added new queries, a queryVersion config option, and altered the quer…

212ca49

…y initialization to only happen once.

Added config option to specify if AzureDB resource metrics should be …

4d886e7

…collected.

Fixed NULL values returned in performance metrics query.

63fa41c

Added logic to handle errors if "azuerDB" is set to true, but the ins…

9d9a09d

…tance is not running on AzureDB.

Added ability to exclude queries.

a4e26ac

Added DatabaseIO replacement query.

c81b10b

Added DatabaseIO replacement query.

6697655

Syntax fixes, added database properties.

dba7f5b

Merge remote-tracking branch 'upstream/master' into SqlServer-Remodel

c7cef49

Added memory clerk query.

f0e7cd8

Added instance and server name. Added comment, added server properties.

3299d63

Added 'instance' and 'server' to all queries.

28e0ae6

Fixed a typo in the memory clerk query.

81e2848

Merge remote-tracking branch 'upstream/master' into SqlServer-Remodel

b2bffda

Updated readme.

1fccf86

m82labs mentioned this pull request Dec 22, 2017

Feature Request: Re-model SQL Server Plugin Data Output #3233

Closed

Made a few changes to make this work better for Azure SQL DB customers.

ffb0809

danielnelson added this to the 1.6.0 milestone Jan 4, 2018

danielnelson added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jan 4, 2018

danielnelson mentioned this pull request Mar 19, 2018

sqlserver plugin should have database name as tag and not in metric name #1816

Closed

danielnelson mentioned this pull request Apr 6, 2018

Some metrics disapears after upgrade to 1.6.x and above using Prometheus Ouput Plugin #3977

Closed

danielnelson mentioned this pull request Jun 5, 2018

#2861 - sqlserver input plugin not working with Azure SQL #2864

Closed

3 tasks

maxunt pushed a commit that referenced this pull request Jun 26, 2018

Add new sql server output data model (#3618)

43c092d

Sql server remodel #3618

Sql server remodel #3618

Conversation

m82labs commented Dec 22, 2017

Required for all PRs:

m82labs commented Dec 22, 2017

danielnelson commented Dec 28, 2017

m82labs commented Dec 28, 2017 • edited Loading

zensqlmonitor commented Jan 4, 2018

m82labs commented Jan 4, 2018 via email

danielnelson commented Jan 4, 2018

m82labs commented Jan 4, 2018 via email

danielnelson commented Jan 4, 2018

m82labs commented Jan 4, 2018 via email

kerams commented Feb 2, 2018

danielnelson commented Feb 2, 2018

m82labs commented Feb 3, 2018

kerams commented Feb 5, 2018

m82labs commented Feb 5, 2018 via email

kerams commented Feb 5, 2018

m82labs commented Feb 5, 2018 via email

kerams commented Feb 5, 2018 • edited Loading

m82labs commented Feb 5, 2018 via email

kerams commented Feb 6, 2018

m82labs commented Feb 6, 2018 via email

kerams commented Feb 6, 2018

m82labs commented Feb 6, 2018 via email • edited Loading

kerams commented Feb 6, 2018

m82labs commented Feb 6, 2018

kerams commented Feb 6, 2018

m82labs commented Feb 6, 2018 via email

DieterHi commented Apr 4, 2018

m82labs commented Apr 4, 2018 via email

DieterHi commented Apr 4, 2018 • edited Loading

m82labs commented Apr 6, 2018 • edited Loading

m82labs commented Apr 6, 2018

danielnelson commented Apr 6, 2018

m82labs commented Apr 6, 2018 • edited Loading

DieterHi commented Apr 9, 2018 via email

m82labs commented Apr 10, 2018

m82labs commented Dec 28, 2017 •

edited

Loading

kerams commented Feb 5, 2018 •

edited

Loading

m82labs commented Feb 6, 2018 via email •

edited

Loading

DieterHi commented Apr 4, 2018 •

edited

Loading

m82labs commented Apr 6, 2018 •

edited

Loading

m82labs commented Apr 6, 2018 •

edited

Loading