Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set prometheus remoteTimeout to 5s #1199

Merged
merged 3 commits into from
Dec 1, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions deploy/docs/Best_Practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Missing labels](#missing-labels)
- [Configure Ignore_Older Config for Fluentbit](#configure-ignore_older-config-for-fluentbit)
- [Disable logs, metrics, or falco](#disable-logs-metrics-or-falco)
- [Load Balancing Prometheus traffic between Fluentds](#load-balancing-prometheus-traffic-between-fluentds)

## Multiline Log Support

Expand Down Expand Up @@ -475,3 +476,25 @@ respectively in the `values.yaml` file and run the `helm upgrade` command.
| `sumologic.logs.enabled` | false | disable logs collection |
| `sumologic.metrics.enabled` | false | disable metrics collection |
| `falco.enabled` | false | disable falco |

## Load Balancing Prometheus traffic between Fluentds

Equal utilization of the Fluentd pods is important for collection process.
If Fluentd pod is under high pressure, incoming connections can be handled with some delay.
To avoid backpressure, `remote_timeout` configuration options for Prometheus' `remote_write` can be used.
By default this is `30s`, which means that Prometheus is going to wait such amount of time
for connection to specific fluentd before trying to reach another.
This significantly decreases performance and can lead to the Prometheus memory issues,
so we decided to override it with `5s`.

```yaml
kube-prometheus-stack:
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- #...
remoteTimeout: 5s
```

**NOTE** We observed that changing this value increases metrics loss during prometheus resharding,
but the traffic is much better balanced between Fluentds and Prometheus is more stable in terms of memory.
17 changes: 17 additions & 0 deletions deploy/helm/sumologic/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1413,6 +1413,7 @@ kube-prometheus-stack:
## kube_hpa_status_current_replicas
## kube_hpa_status_desired_replicas
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.state
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: kube-state-metrics;(?:kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_replicas|kube_statefulset_metadata_generation|kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_misscheduled|kube_daemonset_status_number_unavailable|kube_deployment_spec_replicas|kube_deployment_status_replicas_available|kube_deployment_status_replicas_unavailable|kube_node_info|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_hpa_spec_max_replicas|kube_hpa_spec_min_replicas|kube_hpa_status_current_replicas|kube_hpa_status_desired_replicas)
Expand All @@ -1432,6 +1433,7 @@ kube-prometheus-stack:
## kube_pod_container_status_waiting_reason
## kube_pod_status_phase
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.state
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: kube-state-metrics;(?:kube_pod_container_info|kube_pod_container_resource_requests|kube_pod_container_resource_limits|kube_pod_container_status_ready|kube_pod_container_status_terminated_reason|kube_pod_container_status_waiting_reason|kube_pod_container_status_restarts_total|kube_pod_status_phase)
Expand All @@ -1443,6 +1445,7 @@ kube-prometheus-stack:
## cloudprovider_aws_api_request_duration_seconds_count
## cloudprovider_aws_api_request_duration_seconds_sum
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.controller-manager
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: kubelet;cloudprovider_.*_api_request_duration_seconds.*
Expand All @@ -1458,6 +1461,7 @@ kube-prometheus-stack:
## scheduler_scheduling_algorithm_duration_seconds_bucket
## scheduler_scheduling_algorithm_duration_seconds_count
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.scheduler
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: kube-scheduler;scheduler_(?:e2e_scheduling|binding|scheduling_algorithm)_duration_seconds.*
Expand Down Expand Up @@ -1485,6 +1489,7 @@ kube-prometheus-stack:
## etcd_helper_cache_miss_count
## etcd_helper_cache_miss_total
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.apiserver
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: apiserver;(?:apiserver_request_(?:count|total)|apiserver_request_(?:duration_seconds|latencies)_(?:count|sum)|apiserver_request_latencies_summary(?:|_count|_sum)|etcd_request_cache_(?:add|get)_(?:duration_seconds|latencies_summary)_(?:count|sum)|etcd_helper_cache_(?:hit|miss)_(?:count|total))
Expand All @@ -1507,6 +1512,7 @@ kube-prometheus-stack:
## kubelet_runtime_operations_latency_microseconds_count
## kubelet_runtime_operations_latency_microseconds_sum
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.kubelet
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: kubelet;(?:kubelet_docker_operations_errors(?:|_total)|kubelet_(?:docker|runtime)_operations_duration_seconds_(?:count|sum)|kubelet_running_(?:container|pod)(?:_count|s)|kubelet_(:?docker|runtime)_operations_latency_microseconds(?:|_count|_sum))
Expand All @@ -1518,6 +1524,7 @@ kube-prometheus-stack:
## container_memory_working_set_bytes
## container_cpu_cfs_throttled_seconds_total
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.container
remoteTimeout: 5s
writeRelabelConfigs:
- action: labelmap
regex: container_name
Expand All @@ -1532,6 +1539,7 @@ kube-prometheus-stack:
## container_network_receive_bytes_total
## container_network_transmit_bytes_total
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.container
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: kubelet;(?:container_network_receive_bytes_total|container_network_transmit_bytes_total)
Expand All @@ -1544,6 +1552,7 @@ kube-prometheus-stack:
## node_load5
## node_load15
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.node
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: node-exporter;(?:node_load1|node_load5|node_load15|node_cpu_seconds_total)
Expand Down Expand Up @@ -1584,6 +1593,7 @@ kube-prometheus-stack:
## node:node_num_cpu:sum
## node_namespace_pod:kube_pod_info:
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.operator.rule
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: 'cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|instance:node_filesystem_usage:sum|instance:node_network_receive_bytes:rate:sum|cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile|cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile|cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile|node_namespace_pod:kube_pod_info:|:kube_pod_info_node_count:|node:node_num_cpu:sum|:node_cpu_utilisation:avg1m|node:node_cpu_utilisation:avg1m|node:cluster_cpu_utilisation:ratio|:node_cpu_saturation_load1:|node:node_cpu_saturation_load1:|:node_memory_utilisation:|node:node_memory_bytes_total:sum|node:node_memory_utilisation:ratio|node:cluster_memory_utilisation:ratio|:node_memory_swap_io_bytes:sum_rate|node:node_memory_utilisation:|node:node_memory_utilisation_2:|node:node_memory_swap_io_bytes:sum_rate|:node_disk_utilisation:avg_irate|node:node_disk_utilisation:avg_irate|:node_disk_saturation:avg_irate|node:node_disk_saturation:avg_irate|node:node_filesystem_usage:|node:node_filesystem_avail:|:node_net_utilisation:sum_irate|node:node_net_utilisation:sum_irate|:node_net_saturation:sum_irate|node:node_net_saturation:sum_irate|node:node_inodes_total:|node:node_inodes_free:'
Expand Down Expand Up @@ -1646,6 +1656,7 @@ kube-prometheus-stack:
## prometheus_remote_storage_succeeded_samples_total
## up
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: (?:up|prometheus_remote_storage_.*|fluentd_.*|fluentbit.*|otelcol.*)
Expand All @@ -1664,6 +1675,7 @@ kube-prometheus-stack:
## process_open_fds
## process_resident_memory_bytes
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.control-plane.coredns
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: coredns;(?:coredns_cache_(size|(hits|misses)_total)|coredns_dns_request_duration_seconds_(count|sum)|coredns_(dns_request|dns_response_rcode|forward_request)_count_total|process_(cpu_seconds_total|open_fds|resident_memory_bytes))
Expand All @@ -1688,6 +1700,7 @@ kube-prometheus-stack:
## process_open_fds
## process_resident_memory_bytes
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.control-plane.kube-etcd
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: kube-etcd;(?:etcd_debugging_(mvcc_db_total_size_in_bytes|store_(expires_total|watchers))|etcd_disk_(backend_commit|wal_fsync)_duration_seconds_bucket|etcd_grpc_proxy_cache_(hits|misses)_total|etcd_network_client_grpc_(received|sent)_bytes_total|etcd_server_(has_leader|leader_changes_seen_total)|etcd_server_proposals_(pending|(applied|committed|failed)_total)|process_(cpu_seconds_total|open_fds|resident_memory_bytes))
Expand All @@ -1710,6 +1723,7 @@ kube-prometheus-stack:
## nginx_ingress_nginx_connections_writing
## nginx_ingress_nginx_http_requests_total
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.applications.nginx-ingress
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: (?:nginx_ingress_controller_ingress_resources_total|nginx_ingress_controller_nginx_(last_reload_(milliseconds|status)|reload(s|_errors)_total)|nginx_ingress_controller_virtualserver(|route)_resources_total|nginx_ingress_nginx_connections_(accepted|active|handled|reading|waiting|writing)|nginx_ingress_nginx_http_requests_total)
Expand All @@ -1724,6 +1738,7 @@ kube-prometheus-stack:
## nginx_waiting
## nginx_writing
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.applications.nginx
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: (?:nginx_(accepts|active|handled|reading|requests|waiting|writing))
Expand Down Expand Up @@ -1760,6 +1775,7 @@ kube-prometheus-stack:
## redis_used_memory_rss
## redis_used_memory_startup
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.applications.redis
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: (?:redis_((blocked_|)clients|cluster_enabled|cmdstat_calls|connected_slaves|(evicted|expired|tracking_total)_keys|instantaneous_ops_per_sec|keyspace_(hitrate|hits|misses)|(master|slave)_repl_offset|maxmemory|mem_fragmentation_(bytes|ratio)|rdb_changes_since_last_save|rejected_connections|total_commands_processed|total_net_(input|output)_bytes|uptime|used_(cpu_(sys|user)|memory(_overhead|_rss|_startup|))))
Expand Down Expand Up @@ -1827,6 +1843,7 @@ kube-prometheus-stack:
## java_lang_Threading_ThreadCpuTime*
## java_lang_Threading_TotalStartedThreadCount
- url: http://$(CHART).$(NAMESPACE).svc.cluster.local:9888/prometheus.metrics.applications.jmx
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: (?:java_lang_(ClassLoading_(TotalL|Unl|L)oadedClassCount|Compilation_TotalCompilationTime|GarbageCollector_(Collection(Count|Time)|LastGcInfo_(GcThreadCount|duration|(memoryU|u)sage(After|Before)Gc_.*_used))|MemoryPool_(CollectionUsage(ThresholdSupported|_committed|_max|_used)|(Peak|)Usage_(committed|max|used)|UsageThresholdSupported)|Memory_((Non|)HeapMemoryUsage_(committed|max|used)|ObjectPendingFinalizationCount)|OperatingSystem_(AvailableProcessors|(CommittedVirtual|(Free|Total)(Physical|))MemorySize|(Free|Total)SwapSpaceSize|(Max|Open)FileDescriptorCount|ProcessCpu(Load|Time)|System(CpuLoad|LoadAverage))|Runtime_(BootClassPathSupported|Pid|Uptime|StartTime)|Threading_(CurrentThread(AllocatedBytes|(Cpu|User)Time)|(Daemon|Peak|TotalStarted|)ThreadCount|(ObjectMonitor|Synchronizer)UsageSupported|Thread(AllocatedMemory.*|ContentionMonitoring.*|CpuTime.*))))
Expand Down