-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve metrics using labels #4042
Improve metrics using labels #4042
Conversation
This will create good quality metrics for systems supporting it (example: Prometheus)
Example output (after patching master with this PR + #4016):
|
@guidoiaquinti Thank you, I was a bit too lazy to test before #4016 is merged :) At first sight:
About the naming:
Other labels:
Any advices regarding naming of labels ? |
Minor comments:
TIL: you can download the diff/patch of a PR by adding a
+1 for agent
I think in that case you will end up increase the metric cardinality, I think it make sense to keep the empty label. Thanks! 🙌 |
Can we get an example of the output of /metrics with all the new changes ? |
@iksaif Here is an output after a few requests Can be simulated with: consul agent -dev -hcl 'telemetry{ prometheus_retention_time="24h"}' -hcl 'telemetry{disable_hostname=true}' And generate a few requests, for instance: curl localhost:8500/v1/catalog/datacenters
```prometheus
# HELP consul_autopilot_failure_tolerance consul_autopilot_failure_tolerance
# TYPE consul_autopilot_failure_tolerance gauge
consul_autopilot_failure_tolerance 0
# HELP consul_autopilot_healthy consul_autopilot_healthy
# TYPE consul_autopilot_healthy gauge
consul_autopilot_healthy 1
# HELP consul_catalog_register consul_catalog_register
# TYPE consul_catalog_register summary
consul_catalog_register{quantile="0.5"} NaN
consul_catalog_register{quantile="0.9"} NaN
consul_catalog_register{quantile="0.99"} NaN
consul_catalog_register_sum 545.6957179456949
consul_catalog_register_count 2128
# HELP consul_catalog_service_query consul_catalog_service_query
# TYPE consul_catalog_service_query counter
consul_catalog_service_query{found="true",service="consul"} 1
# HELP consul_client_api_catalog_datacenters consul_client_api_catalog_datacenters
# TYPE consul_client_api_catalog_datacenters counter
consul_client_api_catalog_datacenters{agent="C02Q3L7GG8WP",client="127.0.0.1"} 1
# HELP consul_client_api_catalog_register consul_client_api_catalog_register
# TYPE consul_client_api_catalog_register counter
consul_client_api_catalog_register{agent="C02Q3L7GG8WP",client="127.0.0.1"} 2127
# HELP consul_client_api_catalog_service_nodes consul_client_api_catalog_service_nodes
# TYPE consul_client_api_catalog_service_nodes counter
consul_client_api_catalog_service_nodes{agent="C02Q3L7GG8WP",client="127.0.0.1"} 1
# HELP consul_client_api_catalog_services consul_client_api_catalog_services
# TYPE consul_client_api_catalog_services counter
consul_client_api_catalog_services{agent="C02Q3L7GG8WP",client="127.0.0.1"} 1
# HELP consul_client_api_success_catalog_datacenters consul_client_api_success_catalog_datacenters
# TYPE consul_client_api_success_catalog_datacenters counter
consul_client_api_success_catalog_datacenters{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200"} 1
# HELP consul_client_api_success_catalog_register consul_client_api_success_catalog_register
# TYPE consul_client_api_success_catalog_register counter
consul_client_api_success_catalog_register{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200"} 2127
# HELP consul_client_api_success_catalog_service_nodes consul_client_api_success_catalog_service_nodes
# TYPE consul_client_api_success_catalog_service_nodes counter
consul_client_api_success_catalog_service_nodes{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200",consistency="leader"} 1
# HELP consul_client_api_success_catalog_services consul_client_api_success_catalog_services
# TYPE consul_client_api_success_catalog_services counter
consul_client_api_success_catalog_services{agent="C02Q3L7GG8WP",client="127.0.0.1",code="200",consistency="leader"} 1
# HELP consul_fsm_autopilot consul_fsm_autopilot
# TYPE consul_fsm_autopilot summary
consul_fsm_autopilot{quantile="0.5"} NaN
consul_fsm_autopilot{quantile="0.9"} NaN
consul_fsm_autopilot{quantile="0.99"} NaN
consul_fsm_autopilot_sum 0.04310600087046623
consul_fsm_autopilot_count 1
# HELP consul_fsm_coordinate_batch_update consul_fsm_coordinate_batch_update
# TYPE consul_fsm_coordinate_batch_update summary
consul_fsm_coordinate_batch_update{quantile="0.5"} NaN
consul_fsm_coordinate_batch_update{quantile="0.9"} NaN
consul_fsm_coordinate_batch_update{quantile="0.99"} NaN
consul_fsm_coordinate_batch_update_sum 0.4209139943122864
consul_fsm_coordinate_batch_update_count 7
# HELP consul_fsm_register consul_fsm_register
# TYPE consul_fsm_register summary
consul_fsm_register{quantile="0.5"} NaN
consul_fsm_register{quantile="0.9"} NaN
consul_fsm_register{quantile="0.99"} NaN
consul_fsm_register_sum 268.29275116324425
consul_fsm_register_count 2129
# HELP consul_health_service_query consul_health_service_query
# TYPE consul_health_service_query counter
consul_health_service_query{found="true",service="consul"} 1
# HELP consul_http_GET_v1_agent_metrics consul_http_GET_v1_agent_metrics
# TYPE consul_http_GET_v1_agent_metrics summary
consul_http_GET_v1_agent_metrics{quantile="0.5"} NaN
consul_http_GET_v1_agent_metrics{quantile="0.9"} NaN
consul_http_GET_v1_agent_metrics{quantile="0.99"} NaN
consul_http_GET_v1_agent_metrics_sum 31.180049896240234
consul_http_GET_v1_agent_metrics_count 19
# HELP consul_http_GET_v1_catalog_datacenters consul_http_GET_v1_catalog_datacenters
# TYPE consul_http_GET_v1_catalog_datacenters summary
consul_http_GET_v1_catalog_datacenters{quantile="0.5"} NaN
consul_http_GET_v1_catalog_datacenters{quantile="0.9"} NaN
consul_http_GET_v1_catalog_datacenters{quantile="0.99"} NaN
consul_http_GET_v1_catalog_datacenters_sum 0.17020699381828308
consul_http_GET_v1_catalog_datacenters_count 1
# HELP consul_http_GET_v1_catalog_service__ consul_http_GET_v1_catalog_service__
# TYPE consul_http_GET_v1_catalog_service__ summary
consul_http_GET_v1_catalog_service__{quantile="0.5"} 0.26043400168418884
consul_http_GET_v1_catalog_service__{quantile="0.9"} 0.26043400168418884
consul_http_GET_v1_catalog_service__{quantile="0.99"} 0.26043400168418884
consul_http_GET_v1_catalog_service___sum 0.26043400168418884
consul_http_GET_v1_catalog_service___count 1
# HELP consul_http_GET_v1_catalog_services consul_http_GET_v1_catalog_services
# TYPE consul_http_GET_v1_catalog_services summary
consul_http_GET_v1_catalog_services{quantile="0.5"} NaN
consul_http_GET_v1_catalog_services{quantile="0.9"} NaN
consul_http_GET_v1_catalog_services{quantile="0.99"} NaN
consul_http_GET_v1_catalog_services_sum 5.103577136993408
consul_http_GET_v1_catalog_services_count 1
# HELP consul_http_GET_v1_health_service__ consul_http_GET_v1_health_service__
# TYPE consul_http_GET_v1_health_service__ summary
consul_http_GET_v1_health_service__{quantile="0.5"} NaN
consul_http_GET_v1_health_service__{quantile="0.9"} NaN
consul_http_GET_v1_health_service__{quantile="0.99"} NaN
consul_http_GET_v1_health_service___sum 0.3441370129585266
consul_http_GET_v1_health_service___count 1
# HELP consul_http_PUT_v1_catalog_register consul_http_PUT_v1_catalog_register
# TYPE consul_http_PUT_v1_catalog_register summary
consul_http_PUT_v1_catalog_register{quantile="0.5"} NaN
consul_http_PUT_v1_catalog_register{quantile="0.9"} NaN
consul_http_PUT_v1_catalog_register{quantile="0.99"} NaN
consul_http_PUT_v1_catalog_register_sum 1021.447518080473
consul_http_PUT_v1_catalog_register_count 2127
# HELP consul_leader_barrier consul_leader_barrier
# TYPE consul_leader_barrier summary
consul_leader_barrier{quantile="0.5"} NaN
consul_leader_barrier{quantile="0.9"} NaN
consul_leader_barrier{quantile="0.99"} NaN
consul_leader_barrier_sum 0.271465003490448
consul_leader_barrier_count 3
# HELP consul_leader_reconcile consul_leader_reconcile
# TYPE consul_leader_reconcile summary
consul_leader_reconcile{quantile="0.5"} NaN
consul_leader_reconcile{quantile="0.9"} NaN
consul_leader_reconcile{quantile="0.99"} NaN
consul_leader_reconcile_sum 1.2004680037498474
consul_leader_reconcile_count 3
# HELP consul_leader_reconcileMember consul_leader_reconcileMember
# TYPE consul_leader_reconcileMember summary
consul_leader_reconcileMember{quantile="0.5"} NaN
consul_leader_reconcileMember{quantile="0.9"} NaN
consul_leader_reconcileMember{quantile="0.99"} NaN
consul_leader_reconcileMember_sum 0.9792210161685944
consul_leader_reconcileMember_count 3
# HELP consul_memberlist_gossip consul_memberlist_gossip
# TYPE consul_memberlist_gossip summary
consul_memberlist_gossip{quantile="0.5"} 0.016102999448776245
consul_memberlist_gossip{quantile="0.9"} 0.030741000548005104
consul_memberlist_gossip{quantile="0.99"} 0.07080300152301788
consul_memberlist_gossip_sum 54.35099599859677
consul_memberlist_gossip_count 3142
# HELP consul_memberlist_msg_alive consul_memberlist_msg_alive
# TYPE consul_memberlist_msg_alive counter
consul_memberlist_msg_alive{agent="C02Q3L7GG8WP"} 1
consul_memberlist_msg_alive{agent="C02Q3L7GG8WP.dc1"} 1
# HELP consul_raft_apply consul_raft_apply
# TYPE consul_raft_apply counter
consul_raft_apply 2137
# HELP consul_raft_barrier consul_raft_barrier
# TYPE consul_raft_barrier counter
consul_raft_barrier 3
# HELP consul_raft_commitTime consul_raft_commitTime
# TYPE consul_raft_commitTime summary
consul_raft_commitTime{quantile="0.5"} NaN
consul_raft_commitTime{quantile="0.9"} NaN
consul_raft_commitTime{quantile="0.99"} NaN
consul_raft_commitTime_sum 52.44972601532936
consul_raft_commitTime_count 2141
# HELP consul_raft_fsm_apply consul_raft_fsm_apply
# TYPE consul_raft_fsm_apply summary
consul_raft_fsm_apply{quantile="0.5"} NaN
consul_raft_fsm_apply{quantile="0.9"} NaN
consul_raft_fsm_apply{quantile="0.99"} NaN
consul_raft_fsm_apply_sum 309.2562921270728
consul_raft_fsm_apply_count 2137
# HELP consul_raft_leader_dispatchLog consul_raft_leader_dispatchLog
# TYPE consul_raft_leader_dispatchLog summary
consul_raft_leader_dispatchLog{quantile="0.5"} NaN
consul_raft_leader_dispatchLog{quantile="0.9"} NaN
consul_raft_leader_dispatchLog{quantile="0.99"} NaN
consul_raft_leader_dispatchLog_sum 18.89283002819866
consul_raft_leader_dispatchLog_count 2141
# HELP consul_raft_state_candidate consul_raft_state_candidate
# TYPE consul_raft_state_candidate counter
consul_raft_state_candidate 1
# HELP consul_raft_state_follower consul_raft_state_follower
# TYPE consul_raft_state_follower counter
consul_raft_state_follower{leader=""} 1
# HELP consul_raft_state_leader consul_raft_state_leader
# TYPE consul_raft_state_leader counter
consul_raft_state_leader 1
# HELP consul_raft_transition_heartbeat_timeout consul_raft_transition_heartbeat_timeout
# TYPE consul_raft_transition_heartbeat_timeout counter
consul_raft_transition_heartbeat_timeout{leader=""} 1
# HELP consul_rpc_accept_conn consul_rpc_accept_conn
# TYPE consul_rpc_accept_conn counter
consul_rpc_accept_conn{client="127.0.0.1:54534"} 1
# HELP consul_rpc_query consul_rpc_query
# TYPE consul_rpc_query counter
consul_rpc_query 9
# HELP consul_rpc_request consul_rpc_request
# TYPE consul_rpc_request counter
consul_rpc_request{client="127.0.0.1:54534"} 79
# HELP consul_runtime_alloc_bytes consul_runtime_alloc_bytes
# TYPE consul_runtime_alloc_bytes gauge
consul_runtime_alloc_bytes 1.3860224e+07
# HELP consul_runtime_free_count consul_runtime_free_count
# TYPE consul_runtime_free_count gauge
consul_runtime_free_count 2.389306e+06
# HELP consul_runtime_gc_pause_ns consul_runtime_gc_pause_ns
# TYPE consul_runtime_gc_pause_ns summary
consul_runtime_gc_pause_ns{quantile="0.5"} NaN
consul_runtime_gc_pause_ns{quantile="0.9"} NaN
consul_runtime_gc_pause_ns{quantile="0.99"} NaN
consul_runtime_gc_pause_ns_sum 2.759855e+06
consul_runtime_gc_pause_ns_count 34
# HELP consul_runtime_heap_objects consul_runtime_heap_objects
# TYPE consul_runtime_heap_objects gauge
consul_runtime_heap_objects 134307
# HELP consul_runtime_malloc_count consul_runtime_malloc_count
# TYPE consul_runtime_malloc_count gauge
consul_runtime_malloc_count 2.523613e+06
# HELP consul_runtime_num_goroutines consul_runtime_num_goroutines
# TYPE consul_runtime_num_goroutines gauge
consul_runtime_num_goroutines 68
# HELP consul_runtime_sys_bytes consul_runtime_sys_bytes
# TYPE consul_runtime_sys_bytes gauge
consul_runtime_sys_bytes 3.2205048e+07
# HELP consul_runtime_total_gc_pause_ns consul_runtime_total_gc_pause_ns
# TYPE consul_runtime_total_gc_pause_ns gauge
consul_runtime_total_gc_pause_ns 2.759855e+06
# HELP consul_runtime_total_gc_runs consul_runtime_total_gc_runs
# TYPE consul_runtime_total_gc_runs gauge
consul_runtime_total_gc_runs 34
# HELP consul_serf_events consul_serf_events
# TYPE consul_serf_events counter
consul_serf_events 1
# HELP consul_serf_events_consul:new_leader consul_serf_events_consul:new_leader
# TYPE consul_serf_events_consul:new_leader counter
consul_serf_events_consul:new_leader 1
# HELP consul_serf_member_join consul_serf_member_join
# TYPE consul_serf_member_join counter
consul_serf_member_join{agent="C02Q3L7GG8WP"} 1
consul_serf_member_join{agent="C02Q3L7GG8WP.dc1"} 1
# HELP consul_serf_queue_Event consul_serf_queue_Event
# TYPE consul_serf_queue_Event summary
consul_serf_queue_Event{quantile="0.5"} 0
consul_serf_queue_Event{quantile="0.9"} 1
consul_serf_queue_Event{quantile="0.99"} 1
consul_serf_queue_Event_sum 5
consul_serf_queue_Event_count 10
# HELP consul_serf_queue_Intent consul_serf_queue_Intent
# TYPE consul_serf_queue_Intent summary
consul_serf_queue_Intent{quantile="0.5"} 0
consul_serf_queue_Intent{quantile="0.9"} 0
consul_serf_queue_Intent{quantile="0.99"} 0
consul_serf_queue_Intent_sum 0
consul_serf_queue_Intent_count 10
# HELP consul_serf_queue_Query consul_serf_queue_Query
# TYPE consul_serf_queue_Query summary
consul_serf_queue_Query{quantile="0.5"} 0
consul_serf_queue_Query{quantile="0.9"} 0
consul_serf_queue_Query{quantile="0.99"} 0
consul_serf_queue_Query_sum 0
consul_serf_queue_Query_count 10
# HELP consul_session_ttl_active consul_session_ttl_active
# TYPE consul_session_ttl_active gauge
consul_session_ttl_active 0
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.3379e-05
go_gc_duration_seconds{quantile="0.25"} 5.6315e-05
go_gc_duration_seconds{quantile="0.5"} 7.0554e-05
go_gc_duration_seconds{quantile="0.75"} 8.8917e-05
go_gc_duration_seconds{quantile="1"} 0.000194053
go_gc_duration_seconds_sum 0.002759855
go_gc_duration_seconds_count 34
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 73
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.9.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.3912456e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.67393232e+08
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.508992e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 2.389344e+06
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction -2.311222829797619e-07
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 1.097728e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.3912456e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 4.75136e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.1331968e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 134592
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 2.6083328e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5246019638401039e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 4561
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 2.523936e+06
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 13888
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 367080
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 393216
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 2.2866096e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.925752e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.179648e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.179648e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 3.2205048e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 15 |
@pierresouchay I'm a big fan of Prometheus and label-based metrics generally so this is fantastic work. It's a bit of a nightmare to review though 😁 I think overall the biggest concern we have here is that other go-metrics sinks also support labels: statsite, dogstatsd, influx do at very least IIRC and at least some of those (I think!!) support them by pushing them into the metric name. We can't really change metric names as that will break lots of people's monitoring and dashboards when they upgrade. Did you look into that? I didn't yet in detail so it's quite possible you already made sure that can't happen with any of the other metrics sinks. If not, I guess we need to find a way to test that other metrics sinks are not affected by this. If they are, then we'll need a migration strategy like make these changes optional somehow for a few versions until a next major release to give people time to migrate? You may have thought all this through, I just wanted to raise it in case you already have it planned out before I go dig into the details of what will change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even apart from the issue of metric names changing for some other sinks, this is going to cause a huge explosion in the number of metrics output for a large cluster.
It might be those are wanted but I think we need more of a rollout plan that makes some parts opt-in.
return true, nil | ||
} | ||
|
||
func (s *HTTPServer) CatalogDeregister(resp http.ResponseWriter, req *http.Request) (interface{}, error) { | ||
metrics.IncrCounterWithLabels([]string{"client", "api", "catalog_deregister"}, 1, | ||
[]metrics.Label{{Name: "node", Value: s.nodeName()}}) | ||
labels := []metrics.Label{{Name: "agent", Value: s.nodeName()}, {Name: "client", Value: findSourceIP(req)}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm not sure if that is something we'd want to do. In Prometheus at least every distinct label combination is a different series. Having something with such a high and potentially unbounded cardinality as the client IP in HTTP handler seems like a bad thing in such a general place.
It might be something people who know their clients are well behaved could opt in to but just on by default seems like it could really blow up the number of series Consul is generating.
Even splitting all these metrics by agent host might be a big deal for some of our users who have thousands of nodes in their cluster and will suddenly have many thousand times more distinct series after this rolls out.
@banks Yes, I completely see your point. There are a few things that bothers me as well. Ideally, I would like to be able to "blacklist" some of those metrics in Prometheus by default since labels do serve this purpose already (example: "consul", "health", "service", "not-found" VS "consul", "health", "service" with label found). Since it happens in many places, do you have a suggestion on how doing this ? Suggestions about cardinality:
What I propose to go forward:
What do you think? Kind regards |
👍
👍
I can totally see how this is useful and for most people RPC cardinality is reasonable but I'm still a bit wary of turning it on by default. Some Consul clusters (like yours ;) ) have thousands of agents in. That's quite a bit of metrics growth after an upgrade from ~3 metrics per RPC defined to suddenly 3 * X thousand. That's worst case but assuming every agent eventually hits almost all RPCs which isn't unreasonable you'd be there very quickly. I'm well aware that Prometheus and others are super performant with millions of metrics etc. but it's been my experience that you still want to control that growth for capacity planning purposes which makes me nervous about anything that significantly multiplies cardinality by default on upgrade. What do you think? Should we make that optional too? |
This will allow for a more elegant solution to hashicorp/consul#4042
Ah I'm with you. What do you think of making it |
@banks A am fine with "remote", I'll perform the change What do you think of this kind of change hashicorp/go-metrics#77 doe metrics filtering ? Would it be good enough or do you want something more clever ? I basically only added blacklist/whitelist of labels (not per metrics, only global white/black list), would it be good enough for you ? My plan is to:
Is that configurable enough for you or do you prefer to be able to add specific labels per metric (in that case, the patch would be far more intrusive). Kind regards |
@pierresouchay I didn't review hashicorp/go-metrics#77 in detail but that seems like a reasonable path to me. I think the Consul config story needs a bit of thought. My thinking runs something like:
What do you think? |
@banks Ok, I propose the following configuration (assuming white/blacklist of labels is accepted in hashicorp/go-metrics#77): metrics: {
labels: {
http_remote_client_ip: true|false, # defaults to false
rpc_remote_client_ip: true|false, #defaults to false
whitelist: <labels_white_list>, #defaults to nil => no whitelist, everything not blacklisted is added
blacklist: <labels_to_blacklist>, #defaults to nil => no blacklist
}
} if labels block is not set in conf, initialize with given values: labels: {
http_remote_client_ip: false,
rpc_remote_client_ip: false
white_list: ["no_label_exported"]
} Since white_list: ["no_label_exported"] would not match anything, no label would be exported. As soon as user defines the Would it be fine for you ? |
@banks Can someone have a look on hashicorp/go-metrics#77 so I can continue this PR ? |
@pierresouchay really sorry, this is some way down my TODO list this week – we have a release deadline and next week will be travelling/in meetings. We really appreciate your work and attention on it.
We discussed with team and I think we'd be a lot more comfortable if we didn't expose low-level config like For example if someone whitelists We talked a bit about how we could potentially have internally some preset whitelist or blacklist configurations - default one would match exactly what we have now (i.e. white list only existing labels). For now it feels like we could control them with three flags:
Then document those such that Does that sound reasonable? I think if gives reasonable flexibility to enhance this stuff without just letting peoples dashboards break.
Yeah I will take another look hopefully in the next week and I'll see if I can find another volunteer in the org (maybe Armon!) to glance over it too. |
Hello, Could someone review hashicorp/go-metrics#77 so I can continue this PR ? |
Hey @pierresouchay, @banks, @iksaif ! I'm looking forward to these improvements. The ones I can think of are:
For these, peer ID should probably appear as a label. There should even be the node name instead to be consistent with the label in other metrics.
Maybe there are more cases, but these are the ones that I could see by running a consul cluster in dev mode. |
@zippolyte Thank you so much for your feedback I am still waiting for review on hashicorp/go-metrics#77 in order to rework this review in depth My plan is to:
Do not hesitate if you have other suggestions |
@pierresouchay Have you had an opportunity to give this another shot? |
@ofek Not yet, the rebase will be painful as well as the requested changes (for instance, there are some places where we don't have any kind of config), so it is a big refactoring. I'll try to have a look soon, but if you have other ideas, be my guest :) Regards |
👍 This PR will help us (Slack) enormously with measuring request success/failure rate. ❓ I would very much like if the HTTP API telemetry can be split for blocking and non-blocking requests (for example labeled with |
We do this for RPC metrics so makes total sense. |
@banks I restarted to work on this, but I need settings (ie: to decide if we create metrics per RPC client or not) in catalog and other places. |
@pierresouchay great! I've not looked in details about where this is needed but I'd suggest we try to pass the config through explicitly to avoid global state etc. I was imagining that we'd do this somewhat centrally - e.g. add the new labels everywhere unconditionally but then use the new feature you added to But maybe I missed something. |
@pierresouchay are you still trying to get this merged and if so, are you blocked by anything from our side? |
@i0rek Too many changes since the start of this patch, I probably need to restart almost from scratch :( |
yeah, let me know in case you want to revive it. |
Hey there, This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍. |
This will create good quality metrics for systems supporting it
(example: Prometheus)