-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect Docker Swarm metrics in docker input plugin #3141
Changes from 7 commits
81df6a3
6fccb64
55b094b
fa5ee94
9d23a6a
5695c79
becfc10
ecff41b
a77e893
26f9997
d0e1a99
d1d45ba
ccd6571
ca09644
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,7 @@ import ( | |
"time" | ||
|
||
"github.com/docker/docker/api/types" | ||
"github.com/docker/docker/api/types/swarm" | ||
"github.com/influxdata/telegraf" | ||
"github.com/influxdata/telegraf/filter" | ||
"github.com/influxdata/telegraf/internal" | ||
|
@@ -35,6 +36,8 @@ type Docker struct { | |
Endpoint string | ||
ContainerNames []string | ||
|
||
SwarmEnabled bool `toml:"swarm_enabled"` | ||
|
||
Timeout internal.Duration | ||
PerDevice bool `toml:"perdevice"` | ||
Total bool `toml:"total"` | ||
|
@@ -82,6 +85,9 @@ var sampleConfig = ` | |
## To use environment variables (ie, docker-machine), set endpoint = "ENV" | ||
endpoint = "unix:///var/run/docker.sock" | ||
|
||
## Set to true to collect Swarm metrics(desired_replicas, running_replicas) | ||
swarm_enabled = false | ||
|
||
## Only collect metrics for these containers, collect all if empty | ||
container_names = [] | ||
|
||
|
@@ -160,6 +166,13 @@ func (d *Docker) Gather(acc telegraf.Accumulator) error { | |
acc.AddError(err) | ||
} | ||
|
||
if d.SwarmEnabled { | ||
err := d.gatherSwarmInfo(acc) | ||
if err != nil { | ||
acc.AddError(err) | ||
} | ||
} | ||
|
||
// List containers | ||
opts := types.ContainerListOptions{} | ||
ctx, cancel := context.WithTimeout(context.Background(), d.Timeout.Duration) | ||
|
@@ -187,6 +200,73 @@ func (d *Docker) Gather(acc telegraf.Accumulator) error { | |
return nil | ||
} | ||
|
||
func (d *Docker) gatherSwarmInfo(acc telegraf.Accumulator) error { | ||
|
||
ctx, cancel := context.WithTimeout(context.Background(), d.Timeout.Duration) | ||
defer cancel() | ||
services, err := d.client.ServiceList(ctx, types.ServiceListOptions{}) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
if len(services) > 0 { | ||
|
||
tasks, err := d.client.TaskList(ctx, types.TaskListOptions{}) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
nodes, err := d.client.NodeList(ctx, types.NodeListOptions{}) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
running := map[string]int{} | ||
tasksNoShutdown := map[string]int{} | ||
|
||
activeNodes := make(map[string]struct{}) | ||
for _, n := range nodes { | ||
if n.Status.State != swarm.NodeStateDown { | ||
activeNodes[n.ID] = struct{}{} | ||
} | ||
} | ||
|
||
for _, task := range tasks { | ||
if task.DesiredState != swarm.TaskStateShutdown { | ||
tasksNoShutdown[task.ServiceID]++ | ||
} | ||
|
||
if _, nodeActive := activeNodes[task.NodeID]; nodeActive && task.Status.State == swarm.TaskStateRunning { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm new to Docker Swarm, but why do you check if the node is not in the shutdown state, instead of just recording the running status? It seems like almost always the task will not be running if the node is down. |
||
running[task.ServiceID]++ | ||
} | ||
} | ||
|
||
for _, service := range services { | ||
tags := map[string]string{} | ||
fields := make(map[string]interface{}) | ||
now := time.Now() | ||
tags["swarm_service_id"] = service.ID | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should consider not including this since looks to be a random identifier string, which can cause high cardinality depending on how quickly it changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For each service there will be unique ID. Since services are not something which frequently changes as in containers I think we can still keep this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the measurement name is |
||
tags["swarm_service_name"] = service.Spec.Name | ||
if service.Spec.Mode.Replicated != nil && service.Spec.Mode.Replicated.Replicas != nil { | ||
fields["swarm_service_mode"] = "replicated" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think |
||
fields["swarm_tasks_running"] = running[service.ID] | ||
fields["swarm_tasks_desired"] = *service.Spec.Mode.Replicated.Replicas | ||
} else if service.Spec.Mode.Global != nil { | ||
fields["swarm_service_mode"] = "global" | ||
fields["swarm_tasks_running"] = running[service.ID] | ||
fields["swarm_tasks_desired"] = tasksNoShutdown[service.ID] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand why non-shutdown tasks are the desired number of tasks. Shouldn't this be equal to the number of Nodes since global services are on every node? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is chance that the on one of the nodes or on few nodes the task is not running. When the mode is "global", swarm tries to deploy containers(for swarm service it is tasks) on all nodes. However, there is a chance that on any of the node the container might not get started due to reasons like registry is not accessible from that node, /var/lib/docker/images directory is corrupted etc.. |
||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case the Replicated.Replicas is nil or another Mode is added, we should have an else condition that continues and perhaps logs (depending on if Replicas being nil is an error or not). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Handled this with "log" |
||
// Add metrics | ||
acc.AddFields("docker_swarm", | ||
fields, | ||
tags, | ||
now) | ||
} | ||
} | ||
|
||
return nil | ||
} | ||
|
||
func (d *Docker) gatherInfo(acc telegraf.Accumulator) error { | ||
// Init vars | ||
dataFields := make(map[string]interface{}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be named
gather_services = true
orgather = ["services"]
since the similar docker command is simplydocker service
.