Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metrics using labels #4042

Closed

Conversation

pierresouchay
Copy link
Contributor

This will create good quality metrics for systems supporting it
(example: Prometheus)

@guidoiaquinti
Copy link
Contributor

guidoiaquinti commented Apr 19, 2018

Example output (after patching master with this PR + #4016):

➜  ~ curl -s "localhost:8500/v1/agent/metrics?format=prometheus"
# HELP consul_autopilot_failure_tolerance consul_autopilot_failure_tolerance
# TYPE consul_autopilot_failure_tolerance gauge
consul_autopilot_failure_tolerance 0
# HELP consul_autopilot_healthy consul_autopilot_healthy
# TYPE consul_autopilot_healthy gauge
consul_autopilot_healthy 1
# HELP consul_catalog_register consul_catalog_register
# TYPE consul_catalog_register summary
consul_catalog_register{quantile="0.5"} NaN
consul_catalog_register{quantile="0.9"} NaN
consul_catalog_register{quantile="0.99"} NaN
consul_catalog_register_sum 2.5854299068450928
consul_catalog_register_count 1
# HELP consul_consul_autopilot_failure_tolerance consul_consul_autopilot_failure_tolerance
# TYPE consul_consul_autopilot_failure_tolerance gauge
consul_consul_autopilot_failure_tolerance 0
# HELP consul_consul_autopilot_healthy consul_consul_autopilot_healthy
# TYPE consul_consul_autopilot_healthy gauge
consul_consul_autopilot_healthy 1
# HELP consul_consul_catalog_register consul_consul_catalog_register
# TYPE consul_consul_catalog_register summary
consul_consul_catalog_register{quantile="0.5"} NaN
consul_consul_catalog_register{quantile="0.9"} NaN
consul_consul_catalog_register{quantile="0.99"} NaN
consul_consul_catalog_register_sum 2.774869918823242
consul_consul_catalog_register_count 1
# HELP consul_consul_fsm_autopilot consul_consul_fsm_autopilot
# TYPE consul_consul_fsm_autopilot summary
consul_consul_fsm_autopilot{quantile="0.5"} NaN
consul_consul_fsm_autopilot{quantile="0.9"} NaN
consul_consul_fsm_autopilot{quantile="0.99"} NaN
consul_consul_fsm_autopilot_sum 0.29412001371383667
consul_consul_fsm_autopilot_count 1
# HELP consul_consul_fsm_coordinate_batch_update consul_consul_fsm_coordinate_batch_update
# TYPE consul_consul_fsm_coordinate_batch_update summary
consul_consul_fsm_coordinate_batch_update{quantile="0.5"} 0.0741880014538765
consul_consul_fsm_coordinate_batch_update{quantile="0.9"} 0.0741880014538765
consul_consul_fsm_coordinate_batch_update{quantile="0.99"} 0.0741880014538765
consul_consul_fsm_coordinate_batch_update_sum 4.042929969727993
consul_consul_fsm_coordinate_batch_update_count 100
# HELP consul_consul_fsm_register consul_consul_fsm_register
# TYPE consul_consul_fsm_register summary
consul_consul_fsm_register{quantile="0.5"} NaN
consul_consul_fsm_register{quantile="0.9"} NaN
consul_consul_fsm_register{quantile="0.99"} NaN
consul_consul_fsm_register_sum 2.113976964727044
consul_consul_fsm_register_count 4
# HELP consul_consul_fsm_tombstone consul_consul_fsm_tombstone
# TYPE consul_consul_fsm_tombstone summary
consul_consul_fsm_tombstone{op="reap",quantile="0.5"} NaN
consul_consul_fsm_tombstone{op="reap",quantile="0.9"} NaN
consul_consul_fsm_tombstone{op="reap",quantile="0.99"} NaN
consul_consul_fsm_tombstone_sum{op="reap"} 0.07081200182437897
consul_consul_fsm_tombstone_count{op="reap"} 1
# HELP consul_consul_http_GET_v1_agent_metrics consul_consul_http_GET_v1_agent_metrics
# TYPE consul_consul_http_GET_v1_agent_metrics summary
consul_consul_http_GET_v1_agent_metrics{quantile="0.5"} NaN
consul_consul_http_GET_v1_agent_metrics{quantile="0.9"} NaN
consul_consul_http_GET_v1_agent_metrics{quantile="0.99"} NaN
consul_consul_http_GET_v1_agent_metrics_sum 7.05637001991272
consul_consul_http_GET_v1_agent_metrics_count 3
# HELP consul_consul_leader_barrier consul_consul_leader_barrier
# TYPE consul_consul_leader_barrier summary
consul_consul_leader_barrier{quantile="0.5"} NaN
consul_consul_leader_barrier{quantile="0.9"} NaN
consul_consul_leader_barrier{quantile="0.99"} NaN
consul_consul_leader_barrier_sum 14.676066040992737
consul_consul_leader_barrier_count 5
# HELP consul_consul_leader_establish_leadership consul_consul_leader_establish_leadership
# TYPE consul_consul_leader_establish_leadership summary
consul_consul_leader_establish_leadership{quantile="0.5"} NaN
consul_consul_leader_establish_leadership{quantile="0.9"} NaN
consul_consul_leader_establish_leadership{quantile="0.99"} NaN
consul_consul_leader_establish_leadership_sum 0.07118500024080276
consul_consul_leader_establish_leadership_count 1
# HELP consul_consul_leader_reconcile consul_consul_leader_reconcile
# TYPE consul_consul_leader_reconcile summary
consul_consul_leader_reconcile{quantile="0.5"} NaN
consul_consul_leader_reconcile{quantile="0.9"} NaN
consul_consul_leader_reconcile{quantile="0.99"} NaN
consul_consul_leader_reconcile_sum 1.0313669964671135
consul_consul_leader_reconcile_count 5
# HELP consul_consul_leader_reconcileMember consul_consul_leader_reconcileMember
# TYPE consul_consul_leader_reconcileMember summary
consul_consul_leader_reconcileMember{quantile="0.5"} NaN
consul_consul_leader_reconcileMember{quantile="0.9"} NaN
consul_consul_leader_reconcileMember{quantile="0.99"} NaN
consul_consul_leader_reconcileMember_sum 0.6272670105099678
consul_consul_leader_reconcileMember_count 5
# HELP consul_consul_rpc_accept_conn consul_consul_rpc_accept_conn
# TYPE consul_consul_rpc_accept_conn counter
consul_consul_rpc_accept_conn{client="127.0.0.1:50315"} 1
# HELP consul_consul_rpc_query consul_consul_rpc_query
# TYPE consul_consul_rpc_query counter
consul_consul_rpc_query 8
# HELP consul_consul_rpc_request consul_consul_rpc_request
# TYPE consul_consul_rpc_request counter
consul_consul_rpc_request{client="127.0.0.1:50315"} 144
# HELP consul_consul_session_ttl_active consul_consul_session_ttl_active
# TYPE consul_consul_session_ttl_active gauge
consul_consul_session_ttl_active 0
# HELP consul_fsm_autopilot consul_fsm_autopilot
# TYPE consul_fsm_autopilot summary
consul_fsm_autopilot{quantile="0.5"} NaN
consul_fsm_autopilot{quantile="0.9"} NaN
consul_fsm_autopilot{quantile="0.99"} NaN
consul_fsm_autopilot_sum 0.2513119876384735
consul_fsm_autopilot_count 1
# HELP consul_fsm_coordinate_batch_update consul_fsm_coordinate_batch_update
# TYPE consul_fsm_coordinate_batch_update summary
consul_fsm_coordinate_batch_update{quantile="0.5"} 0.044555000960826874
consul_fsm_coordinate_batch_update{quantile="0.9"} 0.044555000960826874
consul_fsm_coordinate_batch_update{quantile="0.99"} 0.044555000960826874
consul_fsm_coordinate_batch_update_sum 3.0443769991397858
consul_fsm_coordinate_batch_update_count 100
# HELP consul_fsm_register consul_fsm_register
# TYPE consul_fsm_register summary
consul_fsm_register{quantile="0.5"} NaN
consul_fsm_register{quantile="0.9"} NaN
consul_fsm_register{quantile="0.99"} NaN
consul_fsm_register_sum 1.983374036848545
consul_fsm_register_count 4
# HELP consul_fsm_tombstone consul_fsm_tombstone
# TYPE consul_fsm_tombstone summary
consul_fsm_tombstone{op="reap",quantile="0.5"} NaN
consul_fsm_tombstone{op="reap",quantile="0.9"} NaN
consul_fsm_tombstone{op="reap",quantile="0.99"} NaN
consul_fsm_tombstone_sum{op="reap"} 0.004244000185281038
consul_fsm_tombstone_count{op="reap"} 1
# HELP consul_http_GET_v1_agent_metrics consul_http_GET_v1_agent_metrics
# TYPE consul_http_GET_v1_agent_metrics summary
consul_http_GET_v1_agent_metrics{quantile="0.5"} NaN
consul_http_GET_v1_agent_metrics{quantile="0.9"} NaN
consul_http_GET_v1_agent_metrics{quantile="0.99"} NaN
consul_http_GET_v1_agent_metrics_sum 7.286432087421417
consul_http_GET_v1_agent_metrics_count 3
# HELP consul_leader_barrier consul_leader_barrier
# TYPE consul_leader_barrier summary
consul_leader_barrier{quantile="0.5"} NaN
consul_leader_barrier{quantile="0.9"} NaN
consul_leader_barrier{quantile="0.99"} NaN
consul_leader_barrier_sum 14.82359504699707
consul_leader_barrier_count 5
# HELP consul_leader_reconcile consul_leader_reconcile
# TYPE consul_leader_reconcile summary
consul_leader_reconcile{quantile="0.5"} NaN
consul_leader_reconcile{quantile="0.9"} NaN
consul_leader_reconcile{quantile="0.99"} NaN
consul_leader_reconcile_sum 0.9288930147886276
consul_leader_reconcile_count 5
# HELP consul_leader_reconcileMember consul_leader_reconcileMember
# TYPE consul_leader_reconcileMember summary
consul_leader_reconcileMember{quantile="0.5"} NaN
consul_leader_reconcileMember{quantile="0.9"} NaN
consul_leader_reconcileMember{quantile="0.99"} NaN
consul_leader_reconcileMember_sum 0.49364200234413147
consul_leader_reconcileMember_count 5
# HELP consul_memberlist_gossip consul_memberlist_gossip
# TYPE consul_memberlist_gossip summary
consul_memberlist_gossip{quantile="0.5"} 0.020315000787377357
consul_memberlist_gossip{quantile="0.9"} 0.041338998824357986
consul_memberlist_gossip{quantile="0.99"} 0.08215799927711487
consul_memberlist_gossip_sum 50.847086979076266
consul_memberlist_gossip_count 2058
# HELP consul_memberlist_msg_alive consul_memberlist_msg_alive
# TYPE consul_memberlist_msg_alive counter
consul_memberlist_msg_alive{node="test-guido"} 1
consul_memberlist_msg_alive{node="test-guido.dc1"} 1
# HELP consul_raft_apply consul_raft_apply
# TYPE consul_raft_apply counter
consul_raft_apply 15
# HELP consul_raft_barrier consul_raft_barrier
# TYPE consul_raft_barrier counter
consul_raft_barrier 5
# HELP consul_raft_commitTime consul_raft_commitTime
# TYPE consul_raft_commitTime summary
consul_raft_commitTime{quantile="0.5"} 1.2852790355682373
consul_raft_commitTime{quantile="0.9"} 1.2852790355682373
consul_raft_commitTime{quantile="0.99"} 1.2852790355682373
consul_raft_commitTime_sum 37.3045608997345
consul_raft_commitTime_count 21
# HELP consul_raft_fsm_apply consul_raft_fsm_apply
# TYPE consul_raft_fsm_apply summary
consul_raft_fsm_apply{quantile="0.5"} 0.12771299481391907
consul_raft_fsm_apply{quantile="0.9"} 0.12771299481391907
consul_raft_fsm_apply{quantile="0.99"} 0.12771299481391907
consul_raft_fsm_apply_sum 9.35883492976427
consul_raft_fsm_apply_count 106
# HELP consul_raft_leader_dispatchLog consul_raft_leader_dispatchLog
# TYPE consul_raft_leader_dispatchLog summary
consul_raft_leader_dispatchLog{quantile="0.5"} 1.2522809505462646
consul_raft_leader_dispatchLog{quantile="0.9"} 1.2522809505462646
consul_raft_leader_dispatchLog{quantile="0.99"} 1.2522809505462646
consul_raft_leader_dispatchLog_sum 31.954365968704224
consul_raft_leader_dispatchLog_count 21
# HELP consul_raft_state_candidate consul_raft_state_candidate
# TYPE consul_raft_state_candidate counter
consul_raft_state_candidate 1
# HELP consul_raft_state_follower consul_raft_state_follower
# TYPE consul_raft_state_follower counter
consul_raft_state_follower{leader=""} 1
# HELP consul_raft_state_leader consul_raft_state_leader
# TYPE consul_raft_state_leader counter
consul_raft_state_leader 1
# HELP consul_raft_transition_heartbeat_timeout consul_raft_transition_heartbeat_timeout
# TYPE consul_raft_transition_heartbeat_timeout counter
consul_raft_transition_heartbeat_timeout{leader=""} 1
# HELP consul_rpc_accept_conn consul_rpc_accept_conn
# TYPE consul_rpc_accept_conn counter
consul_rpc_accept_conn{client="127.0.0.1:50315"} 1
# HELP consul_rpc_query consul_rpc_query
# TYPE consul_rpc_query counter
consul_rpc_query 8
# HELP consul_rpc_request consul_rpc_request
# TYPE consul_rpc_request counter
consul_rpc_request{client="127.0.0.1:50315"} 144
# HELP consul_runtime_alloc_bytes consul_runtime_alloc_bytes
# TYPE consul_runtime_alloc_bytes gauge
consul_runtime_alloc_bytes 7.598168e+06
# HELP consul_runtime_free_count consul_runtime_free_count
# TYPE consul_runtime_free_count gauge
consul_runtime_free_count 268156
# HELP consul_runtime_gc_pause_ns consul_runtime_gc_pause_ns
# TYPE consul_runtime_gc_pause_ns summary
consul_runtime_gc_pause_ns{quantile="0.5"} NaN
consul_runtime_gc_pause_ns{quantile="0.9"} NaN
consul_runtime_gc_pause_ns{quantile="0.99"} NaN
consul_runtime_gc_pause_ns_sum 973745
consul_runtime_gc_pause_ns_count 12
# HELP consul_runtime_heap_objects consul_runtime_heap_objects
# TYPE consul_runtime_heap_objects gauge
consul_runtime_heap_objects 47081
# HELP consul_runtime_malloc_count consul_runtime_malloc_count
# TYPE consul_runtime_malloc_count gauge
consul_runtime_malloc_count 315237
# HELP consul_runtime_num_goroutines consul_runtime_num_goroutines
# TYPE consul_runtime_num_goroutines gauge
consul_runtime_num_goroutines 70
# HELP consul_runtime_sys_bytes consul_runtime_sys_bytes
# TYPE consul_runtime_sys_bytes gauge
consul_runtime_sys_bytes 1.7086712e+07
# HELP consul_runtime_total_gc_pause_ns consul_runtime_total_gc_pause_ns
# TYPE consul_runtime_total_gc_pause_ns gauge
consul_runtime_total_gc_pause_ns 973745
# HELP consul_runtime_total_gc_runs consul_runtime_total_gc_runs
# TYPE consul_runtime_total_gc_runs gauge
consul_runtime_total_gc_runs 12
# HELP consul_serf_events consul_serf_events
# TYPE consul_serf_events counter
consul_serf_events 1
# HELP consul_serf_events_consul:new_leader consul_serf_events_consul:new_leader
# TYPE consul_serf_events_consul:new_leader counter
consul_serf_events_consul:new_leader 1
# HELP consul_serf_member_join consul_serf_member_join
# TYPE consul_serf_member_join counter
consul_serf_member_join{node="test-guido"} 1
consul_serf_member_join{node="test-guido.dc1"} 1
# HELP consul_serf_queue_Event consul_serf_queue_Event
# TYPE consul_serf_queue_Event summary
consul_serf_queue_Event{quantile="0.5"} NaN
consul_serf_queue_Event{quantile="0.9"} NaN
consul_serf_queue_Event{quantile="0.99"} NaN
consul_serf_queue_Event_sum 9
consul_serf_queue_Event_count 18
# HELP consul_serf_queue_Intent consul_serf_queue_Intent
# TYPE consul_serf_queue_Intent summary
consul_serf_queue_Intent{quantile="0.5"} NaN
consul_serf_queue_Intent{quantile="0.9"} NaN
consul_serf_queue_Intent{quantile="0.99"} NaN
consul_serf_queue_Intent_sum 0
consul_serf_queue_Intent_count 18
# HELP consul_serf_queue_Query consul_serf_queue_Query
# TYPE consul_serf_queue_Query summary
consul_serf_queue_Query{quantile="0.5"} NaN
consul_serf_queue_Query{quantile="0.9"} NaN
consul_serf_queue_Query{quantile="0.99"} NaN
consul_serf_queue_Query_sum 0
consul_serf_queue_Query_count 18
# HELP consul_serf_snapshot_appendLine consul_serf_snapshot_appendLine
# TYPE consul_serf_snapshot_appendLine summary
consul_serf_snapshot_appendLine{quantile="0.5"} NaN
consul_serf_snapshot_appendLine{quantile="0.9"} NaN
consul_serf_snapshot_appendLine{quantile="0.99"} NaN
consul_serf_snapshot_appendLine_sum 0.31947699934244156
consul_serf_snapshot_appendLine_count 3
# HELP consul_session_ttl_active consul_session_ttl_active
# TYPE consul_session_ttl_active gauge
consul_session_ttl_active 0
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.5804e-05
go_gc_duration_seconds{quantile="0.25"} 5.2843e-05
go_gc_duration_seconds{quantile="0.5"} 6.7541e-05
go_gc_duration_seconds{quantile="0.75"} 8.7403e-05
go_gc_duration_seconds{quantile="1"} 0.000203939
go_gc_duration_seconds_sum 0.000973745
go_gc_duration_seconds_count 12
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 75
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.10"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 7.648608e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.7738944e+07
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.455814e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 268189
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction -6.757294620007126e-08
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 643072
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 7.648608e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.768896e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 9.814016e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 47308
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 1.2582912e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5241598308303673e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 38
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 315497
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 6944
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 127680
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 147456
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.1512768e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.192498e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.048576e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.048576e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 1.7086712e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 15

@pierresouchay
Copy link
Contributor Author

@guidoiaquinti Thank you, I was a bit too lazy to test before #4016 is merged :)

At first sight:

  • I just did the counter, adding the gauge/quantiles might be a good idea as well
  • a probably need to harmonise node/client

About the naming:

  • client: applies to RPC/HTTP Clients requesting HTTP endpoint (might be another agent in case of RPC, but not sure)
  • node: I put agent when sure it is a Consul agent, agent might be better => any opinion?
  • server: I know it is a server (ex: elections)
  • leader : When talking about the current server leader => when there is not leader, the tag is empty... remove it in that case?
  • dst : a destination address when sending UDP packets

Other labels:

  • HTTP catalog discovery: added dc, service, found:=true|false, results => not sure about results aka the number of responses to the query, might be better to create another metric (if we consider advices from there: https://prometheus.io/docs/practices/naming/#labels). If there is a tag, use tag. But I am not fan of this one since some methods might have several tags... don't know if we have to use tags=list,of,tags instead => any opinion ?
  • DNS service, query, tag : once again, not a big fan of tag, while current DNS only supports one single tag, having severals later might be nice.

Any advices regarding naming of labels ?

@guidoiaquinti
Copy link
Contributor

Minor comments:

@guidoiaquinti Thank you, I was a bit too lazy to test before #4016 is merged :)

TIL: you can download the diff/patch of a PR by adding a .diff or .patch to the URL (example: https://github.com/hashicorp/consul/pull/4042.diff)

About the naming:
node: I put agent when sure it is a Consul agent, agent might be better => any opinion?

+1 for agent

leader : When talking about the current server leader => when there is not leader, the tag is empty... remove it in that case?

I think in that case you will end up increase the metric cardinality, I think it make sense to keep the empty label.

Thanks! 🙌

@iksaif
Copy link

iksaif commented Apr 20, 2018

Can we get an example of the output of /metrics with all the new changes ?

@pearkes pearkes added this to the 1.1.0 milestone Apr 20, 2018
@pierresouchay
Copy link
Contributor Author

@iksaif Here is an output after a few requests

Can be simulated with:

consul agent -dev -hcl 'telemetry{ prometheus_retention_time="24h"}' -hcl 'telemetry{disable_hostname=true}'

And generate a few requests, for instance:

curl localhost:8500/v1/catalog/datacenters

```prometheus
# HELP consul_autopilot_failure_tolerance consul_autopilot_failure_tolerance
# TYPE consul_autopilot_failure_tolerance gauge
consul_autopilot_failure_tolerance 0
# HELP consul_autopilot_healthy consul_autopilot_healthy
# TYPE consul_autopilot_healthy gauge
consul_autopilot_healthy 1
# HELP consul_catalog_register consul_catalog_register
# TYPE consul_catalog_register summary
consul_catalog_register{quantile="0.5"} NaN
consul_catalog_register{quantile="0.9"} NaN
consul_catalog_register{quantile="0.99"} NaN
consul_catalog_register_sum 545.6957179456949
consul_catalog_register_count 2128
# HELP consul_catalog_service_query consul_catalog_service_query
# TYPE consul_catalog_service_query counter
consul_catalog_service_query{found="true",service="consul"} 1
# HELP consul_client_api_catalog_datacenters consul_client_api_catalog_datacenters
# TYPE consul_client_api_catalog_datacenters counter
consul_client_api_catalog_datacenters{agent="C02Q3L7GG8WP",client="127.0.0.1"} 1
# HELP consul_client_api_catalog_register consul_client_api_catalog_register
# TYPE consul_client_api_catalog_register counter
consul_client_api_catalog_register{agent="C02Q3L7GG8WP",client="127.0.0.1"} 2127
# HELP consul_client_api_catalog_service_nodes consul_client_api_catalog_service_nodes
# TYPE consul_client_api_catalog_service_nodes counter
consul_client_api_catalog_service_nodes{agent="C02Q3L7GG8WP",client="127.0.0.1"} 1
# HELP consul_client_api_catalog_services consul_client_api_catalog_services
# TYPE consul_client_api_catalog_services counter
consul_client_api_catalog_services{agent="C02Q3L7GG8WP",client="127.0.0.1"} 1
# HELP consul_client_api_success_catalog_datacenters consul_client_api_success_catalog_datacenters
# TYPE consul_client_api_success_catalog_datacenters counter
consul_client_api_success_catalog_datacenters{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200"} 1
# HELP consul_client_api_success_catalog_register consul_client_api_success_catalog_register
# TYPE consul_client_api_success_catalog_register counter
consul_client_api_success_catalog_register{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200"} 2127
# HELP consul_client_api_success_catalog_service_nodes consul_client_api_success_catalog_service_nodes
# TYPE consul_client_api_success_catalog_service_nodes counter
consul_client_api_success_catalog_service_nodes{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200",consistency="leader"} 1
# HELP consul_client_api_success_catalog_services consul_client_api_success_catalog_services
# TYPE consul_client_api_success_catalog_services counter
consul_client_api_success_catalog_services{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200",consistency="leader"} 1
# HELP consul_fsm_autopilot consul_fsm_autopilot
# TYPE consul_fsm_autopilot summary
consul_fsm_autopilot{quantile="0.5"} NaN
consul_fsm_autopilot{quantile="0.9"} NaN
consul_fsm_autopilot{quantile="0.99"} NaN
consul_fsm_autopilot_sum 0.04310600087046623
consul_fsm_autopilot_count 1
# HELP consul_fsm_coordinate_batch_update consul_fsm_coordinate_batch_update
# TYPE consul_fsm_coordinate_batch_update summary
consul_fsm_coordinate_batch_update{quantile="0.5"} NaN
consul_fsm_coordinate_batch_update{quantile="0.9"} NaN
consul_fsm_coordinate_batch_update{quantile="0.99"} NaN
consul_fsm_coordinate_batch_update_sum 0.4209139943122864
consul_fsm_coordinate_batch_update_count 7
# HELP consul_fsm_register consul_fsm_register
# TYPE consul_fsm_register summary
consul_fsm_register{quantile="0.5"} NaN
consul_fsm_register{quantile="0.9"} NaN
consul_fsm_register{quantile="0.99"} NaN
consul_fsm_register_sum 268.29275116324425
consul_fsm_register_count 2129
# HELP consul_health_service_query consul_health_service_query
# TYPE consul_health_service_query counter
consul_health_service_query{found="true",service="consul"} 1
# HELP consul_http_GET_v1_agent_metrics consul_http_GET_v1_agent_metrics
# TYPE consul_http_GET_v1_agent_metrics summary
consul_http_GET_v1_agent_metrics{quantile="0.5"} NaN
consul_http_GET_v1_agent_metrics{quantile="0.9"} NaN
consul_http_GET_v1_agent_metrics{quantile="0.99"} NaN
consul_http_GET_v1_agent_metrics_sum 31.180049896240234
consul_http_GET_v1_agent_metrics_count 19
# HELP consul_http_GET_v1_catalog_datacenters consul_http_GET_v1_catalog_datacenters
# TYPE consul_http_GET_v1_catalog_datacenters summary
consul_http_GET_v1_catalog_datacenters{quantile="0.5"} NaN
consul_http_GET_v1_catalog_datacenters{quantile="0.9"} NaN
consul_http_GET_v1_catalog_datacenters{quantile="0.99"} NaN
consul_http_GET_v1_catalog_datacenters_sum 0.17020699381828308
consul_http_GET_v1_catalog_datacenters_count 1
# HELP consul_http_GET_v1_catalog_service__ consul_http_GET_v1_catalog_service__
# TYPE consul_http_GET_v1_catalog_service__ summary
consul_http_GET_v1_catalog_service__{quantile="0.5"} 0.26043400168418884
consul_http_GET_v1_catalog_service__{quantile="0.9"} 0.26043400168418884
consul_http_GET_v1_catalog_service__{quantile="0.99"} 0.26043400168418884
consul_http_GET_v1_catalog_service___sum 0.26043400168418884
consul_http_GET_v1_catalog_service___count 1
# HELP consul_http_GET_v1_catalog_services consul_http_GET_v1_catalog_services
# TYPE consul_http_GET_v1_catalog_services summary
consul_http_GET_v1_catalog_services{quantile="0.5"} NaN
consul_http_GET_v1_catalog_services{quantile="0.9"} NaN
consul_http_GET_v1_catalog_services{quantile="0.99"} NaN
consul_http_GET_v1_catalog_services_sum 5.103577136993408
consul_http_GET_v1_catalog_services_count 1
# HELP consul_http_GET_v1_health_service__ consul_http_GET_v1_health_service__
# TYPE consul_http_GET_v1_health_service__ summary
consul_http_GET_v1_health_service__{quantile="0.5"} NaN
consul_http_GET_v1_health_service__{quantile="0.9"} NaN
consul_http_GET_v1_health_service__{quantile="0.99"} NaN
consul_http_GET_v1_health_service___sum 0.3441370129585266
consul_http_GET_v1_health_service___count 1
# HELP consul_http_PUT_v1_catalog_register consul_http_PUT_v1_catalog_register
# TYPE consul_http_PUT_v1_catalog_register summary
consul_http_PUT_v1_catalog_register{quantile="0.5"} NaN
consul_http_PUT_v1_catalog_register{quantile="0.9"} NaN
consul_http_PUT_v1_catalog_register{quantile="0.99"} NaN
consul_http_PUT_v1_catalog_register_sum 1021.447518080473
consul_http_PUT_v1_catalog_register_count 2127
# HELP consul_leader_barrier consul_leader_barrier
# TYPE consul_leader_barrier summary
consul_leader_barrier{quantile="0.5"} NaN
consul_leader_barrier{quantile="0.9"} NaN
consul_leader_barrier{quantile="0.99"} NaN
consul_leader_barrier_sum 0.271465003490448
consul_leader_barrier_count 3
# HELP consul_leader_reconcile consul_leader_reconcile
# TYPE consul_leader_reconcile summary
consul_leader_reconcile{quantile="0.5"} NaN
consul_leader_reconcile{quantile="0.9"} NaN
consul_leader_reconcile{quantile="0.99"} NaN
consul_leader_reconcile_sum 1.2004680037498474
consul_leader_reconcile_count 3
# HELP consul_leader_reconcileMember consul_leader_reconcileMember
# TYPE consul_leader_reconcileMember summary
consul_leader_reconcileMember{quantile="0.5"} NaN
consul_leader_reconcileMember{quantile="0.9"} NaN
consul_leader_reconcileMember{quantile="0.99"} NaN
consul_leader_reconcileMember_sum 0.9792210161685944
consul_leader_reconcileMember_count 3
# HELP consul_memberlist_gossip consul_memberlist_gossip
# TYPE consul_memberlist_gossip summary
consul_memberlist_gossip{quantile="0.5"} 0.016102999448776245
consul_memberlist_gossip{quantile="0.9"} 0.030741000548005104
consul_memberlist_gossip{quantile="0.99"} 0.07080300152301788
consul_memberlist_gossip_sum 54.35099599859677
consul_memberlist_gossip_count 3142
# HELP consul_memberlist_msg_alive consul_memberlist_msg_alive
# TYPE consul_memberlist_msg_alive counter
consul_memberlist_msg_alive{agent="C02Q3L7GG8WP"} 1
consul_memberlist_msg_alive{agent="C02Q3L7GG8WP.dc1"} 1
# HELP consul_raft_apply consul_raft_apply
# TYPE consul_raft_apply counter
consul_raft_apply 2137
# HELP consul_raft_barrier consul_raft_barrier
# TYPE consul_raft_barrier counter
consul_raft_barrier 3
# HELP consul_raft_commitTime consul_raft_commitTime
# TYPE consul_raft_commitTime summary
consul_raft_commitTime{quantile="0.5"} NaN
consul_raft_commitTime{quantile="0.9"} NaN
consul_raft_commitTime{quantile="0.99"} NaN
consul_raft_commitTime_sum 52.44972601532936
consul_raft_commitTime_count 2141
# HELP consul_raft_fsm_apply consul_raft_fsm_apply
# TYPE consul_raft_fsm_apply summary
consul_raft_fsm_apply{quantile="0.5"} NaN
consul_raft_fsm_apply{quantile="0.9"} NaN
consul_raft_fsm_apply{quantile="0.99"} NaN
consul_raft_fsm_apply_sum 309.2562921270728
consul_raft_fsm_apply_count 2137
# HELP consul_raft_leader_dispatchLog consul_raft_leader_dispatchLog
# TYPE consul_raft_leader_dispatchLog summary
consul_raft_leader_dispatchLog{quantile="0.5"} NaN
consul_raft_leader_dispatchLog{quantile="0.9"} NaN
consul_raft_leader_dispatchLog{quantile="0.99"} NaN
consul_raft_leader_dispatchLog_sum 18.89283002819866
consul_raft_leader_dispatchLog_count 2141
# HELP consul_raft_state_candidate consul_raft_state_candidate
# TYPE consul_raft_state_candidate counter
consul_raft_state_candidate 1
# HELP consul_raft_state_follower consul_raft_state_follower
# TYPE consul_raft_state_follower counter
consul_raft_state_follower{leader=""} 1
# HELP consul_raft_state_leader consul_raft_state_leader
# TYPE consul_raft_state_leader counter
consul_raft_state_leader 1
# HELP consul_raft_transition_heartbeat_timeout consul_raft_transition_heartbeat_timeout
# TYPE consul_raft_transition_heartbeat_timeout counter
consul_raft_transition_heartbeat_timeout{leader=""} 1
# HELP consul_rpc_accept_conn consul_rpc_accept_conn
# TYPE consul_rpc_accept_conn counter
consul_rpc_accept_conn{client="127.0.0.1:54534"} 1
# HELP consul_rpc_query consul_rpc_query
# TYPE consul_rpc_query counter
consul_rpc_query 9
# HELP consul_rpc_request consul_rpc_request
# TYPE consul_rpc_request counter
consul_rpc_request{client="127.0.0.1:54534"} 79
# HELP consul_runtime_alloc_bytes consul_runtime_alloc_bytes
# TYPE consul_runtime_alloc_bytes gauge
consul_runtime_alloc_bytes 1.3860224e+07
# HELP consul_runtime_free_count consul_runtime_free_count
# TYPE consul_runtime_free_count gauge
consul_runtime_free_count 2.389306e+06
# HELP consul_runtime_gc_pause_ns consul_runtime_gc_pause_ns
# TYPE consul_runtime_gc_pause_ns summary
consul_runtime_gc_pause_ns{quantile="0.5"} NaN
consul_runtime_gc_pause_ns{quantile="0.9"} NaN
consul_runtime_gc_pause_ns{quantile="0.99"} NaN
consul_runtime_gc_pause_ns_sum 2.759855e+06
consul_runtime_gc_pause_ns_count 34
# HELP consul_runtime_heap_objects consul_runtime_heap_objects
# TYPE consul_runtime_heap_objects gauge
consul_runtime_heap_objects 134307
# HELP consul_runtime_malloc_count consul_runtime_malloc_count
# TYPE consul_runtime_malloc_count gauge
consul_runtime_malloc_count 2.523613e+06
# HELP consul_runtime_num_goroutines consul_runtime_num_goroutines
# TYPE consul_runtime_num_goroutines gauge
consul_runtime_num_goroutines 68
# HELP consul_runtime_sys_bytes consul_runtime_sys_bytes
# TYPE consul_runtime_sys_bytes gauge
consul_runtime_sys_bytes 3.2205048e+07
# HELP consul_runtime_total_gc_pause_ns consul_runtime_total_gc_pause_ns
# TYPE consul_runtime_total_gc_pause_ns gauge
consul_runtime_total_gc_pause_ns 2.759855e+06
# HELP consul_runtime_total_gc_runs consul_runtime_total_gc_runs
# TYPE consul_runtime_total_gc_runs gauge
consul_runtime_total_gc_runs 34
# HELP consul_serf_events consul_serf_events
# TYPE consul_serf_events counter
consul_serf_events 1
# HELP consul_serf_events_consul:new_leader consul_serf_events_consul:new_leader
# TYPE consul_serf_events_consul:new_leader counter
consul_serf_events_consul:new_leader 1
# HELP consul_serf_member_join consul_serf_member_join
# TYPE consul_serf_member_join counter
consul_serf_member_join{agent="C02Q3L7GG8WP"} 1
consul_serf_member_join{agent="C02Q3L7GG8WP.dc1"} 1
# HELP consul_serf_queue_Event consul_serf_queue_Event
# TYPE consul_serf_queue_Event summary
consul_serf_queue_Event{quantile="0.5"} 0
consul_serf_queue_Event{quantile="0.9"} 1
consul_serf_queue_Event{quantile="0.99"} 1
consul_serf_queue_Event_sum 5
consul_serf_queue_Event_count 10
# HELP consul_serf_queue_Intent consul_serf_queue_Intent
# TYPE consul_serf_queue_Intent summary
consul_serf_queue_Intent{quantile="0.5"} 0
consul_serf_queue_Intent{quantile="0.9"} 0
consul_serf_queue_Intent{quantile="0.99"} 0
consul_serf_queue_Intent_sum 0
consul_serf_queue_Intent_count 10
# HELP consul_serf_queue_Query consul_serf_queue_Query
# TYPE consul_serf_queue_Query summary
consul_serf_queue_Query{quantile="0.5"} 0
consul_serf_queue_Query{quantile="0.9"} 0
consul_serf_queue_Query{quantile="0.99"} 0
consul_serf_queue_Query_sum 0
consul_serf_queue_Query_count 10
# HELP consul_session_ttl_active consul_session_ttl_active
# TYPE consul_session_ttl_active gauge
consul_session_ttl_active 0
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.3379e-05
go_gc_duration_seconds{quantile="0.25"} 5.6315e-05
go_gc_duration_seconds{quantile="0.5"} 7.0554e-05
go_gc_duration_seconds{quantile="0.75"} 8.8917e-05
go_gc_duration_seconds{quantile="1"} 0.000194053
go_gc_duration_seconds_sum 0.002759855
go_gc_duration_seconds_count 34
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 73
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.9.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.3912456e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.67393232e+08
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.508992e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.389344e+06
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction -2.311222829797619e-07
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 1.097728e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.3912456e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 4.75136e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.1331968e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 134592
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 2.6083328e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5246019638401039e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 4561
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 2.523936e+06
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 13888
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 367080
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 393216
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 2.2866096e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.925752e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.179648e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.179648e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 3.2205048e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 15

@banks
Copy link
Member

banks commented Apr 25, 2018

@pierresouchay I'm a big fan of Prometheus and label-based metrics generally so this is fantastic work.

It's a bit of a nightmare to review though 😁

I think overall the biggest concern we have here is that other go-metrics sinks also support labels: statsite, dogstatsd, influx do at very least IIRC and at least some of those (I think!!) support them by pushing them into the metric name.

We can't really change metric names as that will break lots of people's monitoring and dashboards when they upgrade.

Did you look into that? I didn't yet in detail so it's quite possible you already made sure that can't happen with any of the other metrics sinks.

If not, I guess we need to find a way to test that other metrics sinks are not affected by this. If they are, then we'll need a migration strategy like make these changes optional somehow for a few versions until a next major release to give people time to migrate?

You may have thought all this through, I just wanted to raise it in case you already have it planned out before I go dig into the details of what will change.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even apart from the issue of metric names changing for some other sinks, this is going to cause a huge explosion in the number of metrics output for a large cluster.

It might be those are wanted but I think we need more of a rollout plan that makes some parts opt-in.

return true, nil
}

func (s *HTTPServer) CatalogDeregister(resp http.ResponseWriter, req *http.Request) (interface{}, error) {
metrics.IncrCounterWithLabels([]string{"client", "api", "catalog_deregister"}, 1,
[]metrics.Label{{Name: "node", Value: s.nodeName()}})
labels := []metrics.Label{{Name: "agent", Value: s.nodeName()}, {Name: "client", Value: findSourceIP(req)}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not sure if that is something we'd want to do. In Prometheus at least every distinct label combination is a different series. Having something with such a high and potentially unbounded cardinality as the client IP in HTTP handler seems like a bad thing in such a general place.

It might be something people who know their clients are well behaved could opt in to but just on by default seems like it could really blow up the number of series Consul is generating.

Even splitting all these metrics by agent host might be a big deal for some of our users who have thousands of nodes in their cluster and will suddenly have many thousand times more distinct series after this rolls out.

@pierresouchay
Copy link
Contributor Author

@banks Yes, I completely see your point.
From our perspective, it is relatively safe to consider client addresses as safe, but it might not be the case for clients having their Consul agent exposed to the internet for instance.

There are a few things that bothers me as well. Ideally, I would like to be able to "blacklist" some of those metrics in Prometheus by default since labels do serve this purpose already (example: "consul", "health", "service", "not-found" VS "consul", "health", "service" with label found). Since it happens in many places, do you have a suggestion on how doing this ?
Add a label "not-prometheus" or something similar and add a filter ?

Suggestions about cardinality:

  1. for metrics with free client address, add a label client, but only put "localhost" or "remote" by default. Since for most installations having a local agent is the recommended way, it will just allow people to know if a remote Consul client is abusing their registration/whatever and possibly to investigate the culprits in case of abuse. Add a configuration option use_ip_for_client_label to replace "remote" with the real IP of client. Thus, the cardinality will be bound to 2 by default and not the whole IPv6 address range
  2. we have the same kind of issue with RPC as well (especially when initiating the connection) since AFAIK there is not way to know if the connection is a legitimate Consul peer or any other other kind of client (malicious or mistake), I don't know if you have an opinion about that. I tend to think it is less likely an issue than for HTTP, but you might have a different opinion
  3. we also have a similar issue with services not found: currently, I have no way to guess whether a service does exist except len(result) == 0. It means an attacker (or a code mistake) might forge an unlimited number of service names and create lots of carnality using this mechanism. Again, not sure it is really a problem

What I propose to go forward:

  • remove client everywhere for now, for HTTP, just consider localhost VS remote as described in 1)
  • add the option to replace "remote" with something else in another PR - for later
  • for RPC, keep it that way (meaning that on Consul servers, it will be possible to know which Consul Agent are using the most RPC easily) => assume no-one will try to attack RPC heavily to increase the number of metrics

What do you think?

Kind regards

@banks
Copy link
Member

banks commented Apr 26, 2018

remove client everywhere for now, for HTTP, just consider localhost VS remote as described in 1)

👍

add the option to replace "remote" with something else in another PR - for later

👍

for RPC, keep it that way (meaning that on Consul servers, it will be possible to know which Consul Agent are using the most RPC easily) => assume no-one will try to attack RPC heavily to increase the number of metrics

I can totally see how this is useful and for most people RPC cardinality is reasonable but I'm still a bit wary of turning it on by default. Some Consul clusters (like yours ;) ) have thousands of agents in. That's quite a bit of metrics growth after an upgrade from ~3 metrics per RPC defined to suddenly 3 * X thousand. That's worst case but assuming every agent eventually hits almost all RPCs which isn't unreasonable you'd be there very quickly.

I'm well aware that Prometheus and others are super performant with millions of metrics etc. but it's been my experience that you still want to control that growth for capacity planning purposes which makes me nervous about anything that significantly multiplies cardinality by default on upgrade.

What do you think? Should we make that optional too?

pierresouchay added a commit to pierresouchay/go-metrics that referenced this pull request May 2, 2018
This will allow for a more elegant solution to
hashicorp/consul#4042
@banks
Copy link
Member

banks commented May 2, 2018

but it would allow metrics to split between local Consul calls (I mean, local clients performing registration/lookups) from clients not running on local agen

Ah I'm with you. What do you think of making it client="remote" to distinguish it more clearly?

@pierresouchay
Copy link
Contributor Author

@banks A am fine with "remote", I'll perform the change

What do you think of this kind of change hashicorp/go-metrics#77 doe metrics filtering ? Would it be good enough or do you want something more clever ?

I basically only added blacklist/whitelist of labels (not per metrics, only global white/black list), would it be good enough for you ?

My plan is to:

  • add white/black list of labels
  • when metrics has to be included, remove all labels being in the blacklist or if whitelist is specified (being not nil), only if present in whitelist

Is that configurable enough for you or do you prefer to be able to add specific labels per metric (in that case, the patch would be far more intrusive).

Kind regards

@banks
Copy link
Member

banks commented May 2, 2018

@pierresouchay I didn't review hashicorp/go-metrics#77 in detail but that seems like a reasonable path to me.

I think the Consul config story needs a bit of thought. My thinking runs something like:

  • The default config should be the current metrics and labels before this PR (can be achieved by setting default white list if needed in our base/default config)
  • I think it would be better UX to make config options somewhat higher level than raw white/black lists of metrics and labels although I've not thought about that a lot.
    • Something like:
     metrics { 
       enable_labels = true // opts in to the basic labelling additions in this PR
       label_http_remote_client_ip = true // high-level rules opting in to the labels we discussed as adding potentially large cardinality
       label_rpc_remote_client_ip = true
    

What do you think?

@pierresouchay
Copy link
Contributor Author

@banks Ok, I propose the following configuration (assuming white/blacklist of labels is accepted in hashicorp/go-metrics#77):

metrics: {
  labels: {
    http_remote_client_ip: true|false, # defaults to false
    rpc_remote_client_ip: true|false, #defaults to false
    whitelist: <labels_white_list>, #defaults to nil => no whitelist, everything not blacklisted is added
    blacklist: <labels_to_blacklist>, #defaults to nil => no blacklist
  }
}

if labels block is not set in conf, initialize with given values:

labels: {
    http_remote_client_ip: false,
    rpc_remote_client_ip: false
    white_list: ["no_label_exported"]
}

Since white_list: ["no_label_exported"] would not match anything, no label would be exported.

As soon as user defines the labels block, the default config would export all labels by default, with ability to either whitelist, either add everything but blacklist.

Would it be fine for you ?

@pierresouchay
Copy link
Contributor Author

@banks Can someone have a look on hashicorp/go-metrics#77 so I can continue this PR ?

@banks
Copy link
Member

banks commented May 10, 2018

@pierresouchay really sorry, this is some way down my TODO list this week – we have a release deadline and next week will be travelling/in meetings. We really appreciate your work and attention on it.

Would it be fine for you ?

We discussed with team and I think we'd be a lot more comfortable if we didn't expose low-level config like whitelist and blacklist to users. It is simple and flexible but it also binds us for further changes - for example any metric name or label that might be in someone's config file becomes a config compatibility issue if we ever change it and even more so than the existing output compatibility issue we are trying to solve.

For example if someone whitelists client_ip label but then we go and add client labels to more metrics or to a place with much higher cardinality later then we have the exact same problem as now.

We talked a bit about how we could potentially have internally some preset whitelist or blacklist configurations - default one would match exactly what we have now (i.e. white list only existing labels).

For now it feels like we could control them with three flags:
-. enhanced_labels: true would enable the bulk of this PR minus the high-cardinality client labels

  • http_remote_client_ip and rpc_remote_client_ip as suggested

Then document those such that enhanced_labels is an opt-in preview of enhanced labels and should be treated as "beta" for now - that is we might change the actual labels on any release to improve them until we consider it stable, when we do it can remain opt-in for a release or so and then can switch the default to be enhanced_labels: true but with option to explicitly set to false on next major release and document the change in upgrade guide.

Does that sound reasonable? I think if gives reasonable flexibility to enhance this stuff without just letting peoples dashboards break.

Can someone have a look on hashicorp/go-metrics#77 so I can continue this PR ?

Yeah I will take another look hopefully in the next week and I'll see if I can find another volunteer in the org (maybe Armon!) to glance over it too.

@pierresouchay
Copy link
Contributor Author

Hello,

Could someone review hashicorp/go-metrics#77 so I can continue this PR ?

@zippolyte
Copy link

Hey @pierresouchay, @banks, @iksaif ! I'm looking forward to these improvements.
I'm working on the consul integration for Datadog, and we want to scrape the prometheus endpoint to get consul metrics. In the current state several metrics are not aptly named, and contain parts that would be good as labels instead, and that makes scraping the endpoint and converting to Datadog metrics troublesome.
I don't know if it is something you can tackle in this PR or if it is out of its scope, but I thought I'd mention that here.

The ones I can think of are:

For these, peer ID should probably appear as a label. There should even be the node name instead to be consistent with the label in other metrics.

Maybe there are more cases, but these are the ones that I could see by running a consul cluster in dev mode.

@pierresouchay
Copy link
Contributor Author

@zippolyte Thank you so much for your feedback

I am still waiting for review on hashicorp/go-metrics#77 in order to rework this review in depth

My plan is to:

Do not hesitate if you have other suggestions

@ofek
Copy link

ofek commented Sep 18, 2018

@pierresouchay Have you had an opportunity to give this another shot?

@pierresouchay
Copy link
Contributor Author

@ofek Not yet, the rebase will be painful as well as the requested changes (for instance, there are some places where we don't have any kind of config), so it is a big refactoring.

I'll try to have a look soon, but if you have other ideas, be my guest :)

Regards

@nahratzah
Copy link

👍 This PR will help us (Slack) enormously with measuring request success/failure rate.

❓ I would very much like if the HTTP API telemetry can be split for blocking and non-blocking requests (for example labeled with blocking="true" and blocking="false" resp).
Then we'll get use out of the timing telemetry as well (currently the timing telemetry measures the P99 of all requests, meaning it simply tests for how long blocking requests block).

@banks
Copy link
Member

banks commented Mar 21, 2019

I would very much like if the HTTP API telemetry can be split for blocking and non-blocking
Great idea @nahratzah

We do this for RPC metrics so makes total sense.

@pierresouchay
Copy link
Contributor Author

@banks I restarted to work on this, but I need settings (ie: to decide if we create metrics per RPC client or not) in catalog and other places.
Do you think I should create a new object to pass those settings or using a singleton is enough?

@banks
Copy link
Member

banks commented Apr 1, 2019

@pierresouchay great!

I've not looked in details about where this is needed but I'd suggest we try to pass the config through explicitly to avoid global state etc.

I was imagining that we'd do this somewhat centrally - e.g. add the new labels everywhere unconditionally but then use the new feature you added to go-metrics to allow filtering out those labels unless they are enabled. Wasn't that the point of that go-metrics change? That way we only need to plumb the config into the place that sets up go-metrics and not every place we instrument things in the code?

But maybe I missed something.

@hanshasselberg
Copy link
Member

@pierresouchay are you still trying to get this merged and if so, are you blocked by anything from our side?

@hanshasselberg hanshasselberg self-assigned this Dec 12, 2019
@pierresouchay
Copy link
Contributor Author

@i0rek Too many changes since the start of this patch, I probably need to restart almost from scratch :(
You can close it probably

@hanshasselberg
Copy link
Member

yeah, let me know in case you want to revive it.

@ghost
Copy link

ghost commented Jan 25, 2020

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

@ghost ghost locked and limited conversation to collaborators Jan 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants