Missing metrics {type="Database size"} for inputs.sqlserver when output to prometheus_client when running telegraf on windows. #3349

mplisov · 2017-10-17T23:24:54Z

Seeing strange behaviour with sqlserver metrics on telegraf 1.4.2 on Windows.
Prometheus output misses {type="Database size"} metrics block completely.
I can see those metrics in --test run of telegraf though.
Second note: Issue doesn't appear when running same version of telegraf on linux.

SQL servers and connection credentials are the same (checked with sql server 2008,2012,2014)

danielnelson · 2017-10-17T23:49:55Z

Can you show what the output looks like on Linux vs Windows? Can you also show the relevant output when using --test?

mplisov · 2017-10-18T00:11:39Z

Output is quite big.
Example of missing data
Here is the sample from --test run (its identical on windows and linux):

> Log\ size\ (bytes),type=Database\ size,servername=SQL 5nine_monitor=1064960i,smg_pmc=72679424i,tempdb=18677760i,enterprise20=33751040i,agroplan_prod=789250048i,master=1835008i,model=1048576i,conversion3=78577664i,smg_itil=1538260992i,conversion2=445382656i,msdb=2621440i,ReportServerTempDB=1064960i,kmszarauto=15162540032i,SQLSentry=158334976i,agroplan_copy=789250048i,smg_docmngcorp=27262976i,smg_hrm_82_regl=838402048i,AgroPlan=789250048i,smg_modeling=251330560i,ReportServer=9633792i,MdxStepByStep=40239104i,psp_dw=174325760i,smg_82=20578304i,smg_buh_20=37158912i,Sqlbi_ManyToMany_Multidimensional=5242880i,courseconstructor=2097152i 1508285358000000000
> Rows\ size\ (bytes),servername=SQL,type=Database\ size SQLSentry=367001600i,agroplan_copy=78708736i,model=4259840i,ReportServerTempDB=4259840i,conversion3=153092096i,smg_modeling=233897984i,smg_hrm_82_regl=319881216i,smg_82=1016070144i,smg_itil=1988165632i,master=5111808i,enterprise20=5460852736i,conversion2=2602631168i,courseconstructor=58785792i,agroplan_prod=133234688i,ReportServer=5308416i,Sqlbi_ManyToMany_Multidimensional=10485760i,tempdb=180682752i,5nine_monitor=4259840i,AgroPlan=133234688i,smg_docmngcorp=375586816i,MdxStepByStep=106561536i,msdb=225640448i,kmszarauto=6821773312i,smg_pmc=29425664i,psp_dw=104857600i,smg_buh_20=7089881088i 1508285358000000000

mplisov · 2017-10-18T09:47:47Z

Figured out than metrics of type="Database IO" also missing on Windows instance. It seems that bug somewhere in parsing data to Prometheus format, because in --test output they are present.
Example of missing data

> Log\ reads\ (bytes/sec),servername=SQL2012,type=Database\ IO,host=sql2012 model=0i,tempdb=0i,breedfarm=0i,newvysokovskoe=0i,crm_demo=0i,trade.standard=0i,erp2_demo=0i,enterprise20_copy=0i,trade11.standard=0i,newzalmilk=0i,erp2_agr_demo=0i,master=0i,erp2_demo1=0i,enterprise20_test=0i,erp_eurocond=0i,nmsztrade=0i,zalfoods=0i,docmngcorp_demo=0i,Total=0i,OpManagerDB=0i,consolidationprof=0i,learn_expert1c=0i,crm3_demo=0i,docmngcorp_test=0i,enterprise20_copy2=0i,msdb=0i,nmsztrade.sample=0i,enterprise20_demo=0i,enterprise_standard=0i 1508282208000000000

danielnelson · 2017-10-18T22:11:10Z

My access to a Windows machine is limited, but I will try to reproduce this Friday.

danielnelson · 2017-10-21T00:47:29Z

I created a powershell script that echos some of the lines from the --test output, and ran that as an exec plugin, but I wasn't missing any data from the prometheus output:

# HELP Log_reads__bytes_sec__enterprise20_demo Telegraf collected metric
# TYPE Log_reads__bytes_sec__enterprise20_demo untyped
Log_reads__bytes_sec__enterprise20_demo{host="sql2012",servername="SQL2012",type="Database IO"} 0

Can you enable a file output and check if the missing data is written?

mplisov · 2017-11-21T00:02:14Z

I think I've found the bug. For some strange reason if I specify connection host as fqdn/hostname, I get missing metrics, but when I specify IPv4 address like 127.0.0.1 it works as expected. This explains why metrics collected from telegraf on linux were ok, because linux machine was connecting using IPv4 address. Good question why this happens...

danielnelson · 2017-11-21T00:21:26Z

Wow, thats odd. Can you duplicate this just using curl or similar and diffing the output?

mplisov · 2017-11-21T00:34:05Z

Well, it was too quick decision. I was wrong on ipv6 thing, sorry.
Attempt #2. I can reproduce the error-behaviour when telegraf is running as windows service as LocalSystem account. When change LocalSystem account to my domain account or run it from comandline everything is ok. I will check it on couple other servers in two days and report back.

allangood · 2018-04-05T20:55:30Z

Hi Daniel,

I'm getting this exactly behavior, but after upgrade from 1.5.3 to 1.6.X or 1.7.X for both, Windows or Linux boxes.

Some outputs like inputs.mem and inputs.sqlserver just disappears from the prometheus exposed list (http://myserver/metrics) but works perfectly with --test switch on command line.

I've tried to run as different user (on windows) and using IP address instead of name as well, and none worked.
There's no relevant logs on log file or running with "debug" flag as true.

I will be glad to provide more information with you need.

Thank you.

My configuration:

[global_tags]
  collector = "telegraf"
  fqdn = "myserver.mydomain"
  location = "mylocation"
  ostype = "windows"
[[outputs.prometheus_client]]
  expiration_interval = "60s"
  listen = "0.0.0.0:9273"
[[inputs.mem]]
[[inputs.sqlserver]]
  query_version = 2
  servers = [ "Server=localhost\\instance;User Id=telegraf;Password=telegraf;app name=telegraf;encrypt=disable;log=0"]

Output from command line:

c:\Program Files\telegraf>telegraf.exe -config "c:\Program Files\telegraf\telegraf.conf" --config-directory "c:\Program Files\telegraf\telegraf.d" --test --input-filter mem

> mem,collector=telegraf,fqdn=myserver,host=myserver,location=mylocation,ostype=windows active=0i,available=8255950848i,available_percent=48.057264930773286,buffered=0i,cached=0i,free=i,inactive=0i,slab=0i,total=17179402240i,used=8923451392i,used_percent=51.942735069226714,wire=0i 1522960875000000000

Output from curl "http://myserver/metrics":

# TYPE mem_available_percent counter
mem_available_percent{collector="telegraf",fqdn="myserver",host="myserver",location="mylocation",ostype="windows"} 48.13911625367473
# HELP mem_used_percent Telegraf collected metric
# TYPE mem_used_percent counter
mem_used_percent{collector="telegraf",fqdn="myserver",host="myserver",location="mylocation",ostype="windows"} 51.86088374632527

Output from version 1.5.3:

curl http://myserver02/metrics

# HELP mem_active Telegraf collected metric
# TYPE mem_active counter
mem_active{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_available Telegraf collected metric
# TYPE mem_available counter
mem_available{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 5.204361216e+09
# HELP mem_available_percent Telegraf collected metric
# TYPE mem_available_percent counter
mem_available_percent{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 17.31076795555045
# HELP mem_buffered Telegraf collected metric
# TYPE mem_buffered counter
mem_buffered{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_cached Telegraf collected metric
# TYPE mem_cached counter
mem_cached{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_free Telegraf collected metric
# TYPE mem_free counter
mem_free{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_inactive Telegraf collected metric
# TYPE mem_inactive counter
mem_inactive{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_slab Telegraf collected metric
# TYPE mem_slab counter
mem_slab{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_total Telegraf collected metric
# TYPE mem_total counter
mem_total{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 3.0064300032e+10
# HELP mem_used Telegraf collected metric
# TYPE mem_used counter
mem_used{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 2.4859938816e+10
# HELP mem_used_percent Telegraf collected metric
# TYPE mem_used_percent counter
mem_used_percent{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 82.68923204444955

danielnelson · 2018-04-05T21:32:09Z

@allangood Thanks for the help testing, I can reproduce this and will work on a fix. I think this is different from the original issue though, can you open this as a new issue?

danielnelson · 2018-04-05T21:33:24Z

@mplisov I hope you were able to resolve the original problem, I'm going to assume no new is good news and close this issue, let me know if it should be reopened.

allangood · 2018-04-05T22:03:24Z

Hi @danielnelson,

I opened issue number #3977

Thank you!

mplisov changed the title ~~Missing metrics for inputs.sqlserver when output to prometheus_client when running telegraf on windows.~~ Missing metrics {type="Database size"} for inputs.sqlserver when output to prometheus_client when running telegraf on windows. Oct 17, 2017

danielnelson added area/prometheus bug unexpected problem or unintended behavior labels Oct 17, 2017

danielnelson closed this as completed Apr 5, 2018

danielnelson mentioned this issue Apr 5, 2018

Fix conversion of unsigned ints in prometheus output #3978

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing metrics {type="Database size"} for inputs.sqlserver when output to prometheus_client when running telegraf on windows. #3349

Missing metrics {type="Database size"} for inputs.sqlserver when output to prometheus_client when running telegraf on windows. #3349

mplisov commented Oct 17, 2017

danielnelson commented Oct 17, 2017

mplisov commented Oct 18, 2017 •

edited by danielnelson

Loading

mplisov commented Oct 18, 2017 •

edited

Loading

danielnelson commented Oct 18, 2017

danielnelson commented Oct 21, 2017

mplisov commented Nov 21, 2017 •

edited

Loading

danielnelson commented Nov 21, 2017

mplisov commented Nov 21, 2017

allangood commented Apr 5, 2018 •

edited

Loading

danielnelson commented Apr 5, 2018

danielnelson commented Apr 5, 2018

allangood commented Apr 5, 2018 •

edited

Loading

Missing metrics {type="Database size"} for inputs.sqlserver when output to prometheus_client when running telegraf on windows. #3349

Missing metrics {type="Database size"} for inputs.sqlserver when output to prometheus_client when running telegraf on windows. #3349

Comments

mplisov commented Oct 17, 2017

danielnelson commented Oct 17, 2017

mplisov commented Oct 18, 2017 • edited by danielnelson Loading

mplisov commented Oct 18, 2017 • edited Loading

danielnelson commented Oct 18, 2017

danielnelson commented Oct 21, 2017

mplisov commented Nov 21, 2017 • edited Loading

danielnelson commented Nov 21, 2017

mplisov commented Nov 21, 2017

allangood commented Apr 5, 2018 • edited Loading

danielnelson commented Apr 5, 2018

danielnelson commented Apr 5, 2018

allangood commented Apr 5, 2018 • edited Loading

mplisov commented Oct 18, 2017 •

edited by danielnelson

Loading

mplisov commented Oct 18, 2017 •

edited

Loading

mplisov commented Nov 21, 2017 •

edited

Loading

allangood commented Apr 5, 2018 •

edited

Loading

allangood commented Apr 5, 2018 •

edited

Loading