Removing `mesos_tasks` metrics. #1686

harnash · 2016-08-30T16:58:11Z

Required for all PRs:

CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
Sign CLA (if not already signed)
README.md updated (if adding a new plugin)

Tagging metrics with executor_id can cause thousands series in InfluxDB
which can easily reach over default limit of 1M series per DB. This commit should add missing framework_id tag and send executor_id as normal field.

Tests were improved and documentation was also updated.

ping @tuier

sparrc · 2016-08-30T17:20:54Z

@harnash I think we should actually completely remove that tag. Why would we want to expose the option to create massive cardinality? And especially why would we default that to true?

harnash · 2016-08-30T17:56:37Z

@sparrc I'll look into it first thing in the morning tomorrow (UTC+0200 time). I'll probably include those values as standard fields and leave only framework_id and server.

Tagging values by executor_id can create quite a lot data series in InfluxDB so we should stick to framework_id and server.

harnash · 2016-08-31T11:25:49Z

@sparrc Is this what you had on mind?

r0bj · 2016-08-31T11:57:33Z

@harnash @sparrc what about just normalize this tag by default? This could avoid intrusive behaviour and leave functionality at the same time.

sparrc · 2016-08-31T15:10:42Z

please don't, Telegraf is meant to send well-behaved data. Sending massive cardinality tags is a very bad practice and I wouldn't call eliminating those "intrusive"

harnash · 2016-09-01T09:08:32Z

@sparrc Your call. I'll leave it as it is. We can still use continuous queries to group them by executor_id if needed.

r0bj · 2016-09-01T09:16:51Z

Not having task name in influx tag has some impact in using that kind of data, eg. in grafana we can group by tag feature to be able to create graphs per task.

sparrc · 2016-09-01T13:43:12Z

@r0bj you'll need to workaround that if it's a high-cardinality series. Crashing the database is not an acceptable trade-off for easier graphing.

tuier · 2016-09-05T21:31:32Z

The main idea of the Mesos plugin's tasks part is to be able to collect metric per task, so we can monitor them and troubleshoot if needed.
I agree that in some case this could create a lots of metric and because of that the normalization should be enable by default. but removing that part completely will remove half of the value of this output.
@sparrc @harnash

sparrc · 2016-09-06T07:36:59Z

I think we're only talking about removing executor_id as a tag, does that present a huge problem to people?

harnash · 2016-09-06T09:33:51Z

Well it is problematic since we want to collect and analyze statistics per executor_id and InfluxDB can group data by tag. We can group it using continuous queries but then we are facing the cardinality problem since we can't normalize this data in InfluxDB. It is doable but problematic and adds additional step in the workflow to process executor_id values (third party tool, kapacitor UDF, etc.)
Normalizing the executor_id tag would be a compromise and should be enough in most cases I guess. If there is some better workaround I'm not aware of them at the moment.

@sparrc @tuier @r0bj

sparrc · 2016-09-06T09:36:16Z

forgive my ignorance, as I'm not very familiar with mesos, but some of you are talking about executor_id and others are talking about "tasks", is there a relationship between the two? Also @harnash what do you mean by "normalizing"? Removing the tag?

harnash · 2016-09-06T10:48:02Z

@sparrc Sorry for creating confusion. So as far as I know mesos executor can spawn multiple tasks. For our discussion we can assume that executor_id identifies each task running on mesos.

By normalizing I mean strip UUIDs from executor_ids effectively grouping series by task name.

sparrc · 2016-09-06T11:04:53Z

OK, I see, I think I actually agree with what @r0bj suggested. You should strip the UUID off of the task_id and push the metric with a tag called "task_name".

From what I can tell you should also be removing the framework_id tag? What is the usefulness of that UUID as a tag?

harnash · 2016-09-06T11:05:00Z

Right now I'm not setting executor_id as a tag just as simple value. I'm digging through documentation of mesos to make sure that if we want to group series by executor_id it will work on all frameworks launched on mesos... From my understanding generating task_id/excutor_id is done on framework level so assuming we get it always in the format of some_task_name.UUID might be wrong.

harnash · 2016-09-06T11:09:00Z

This seems related and can save us some trouble: d2iq-archive/marathon#1242

harnash · 2016-09-06T12:22:33Z

@sparrc @tuier After looking at documentation and other resources I'm inclined to leave current solution as it is - executor_id as plain field.
In a complex situation we can have multiple frameworks running on a one mesos cluster each framework can assign task id's and executor id's to their own liking so we cannot assume that the format of this field and properly process it.
One solution would be to create regex substitution formula which would normalize this field but then again we would probably need separate formula for each framework type/version and somehow get the list of running frameworks and match their name (which are arbitrary) with formula in telegraf's config.
So including this login in telegraf seems like a bad decision and leave the processing to the users via telegraf's exec plugin, kapacitor UDF or other third party tool.

sparrc · 2016-09-06T12:45:34Z

is there really no way to get the task name from mesos without the UUID?

sparrc · 2016-09-06T12:50:17Z

@harnash @tuier @r0bj if anyone could, it would be helpful if we could come up with a metric design schema that doesn't include any UUIDs but is still capable of usefully separating metrics.

If the metrics are not possible to be usefully separated with the UUID then I think we should consider if those metrics should even be collected at all.

harnash · 2016-09-06T13:09:55Z

Yeah. We are still brainstorming on a solution. Problem is that executor_id is generated by a framework. Either way we would probably need to transform those executor_ids into something that can identify the application/task being executed by the framework.

harnash · 2016-09-06T14:20:18Z

Alright. There is a way to do it in a framework agnostic way. We should call /master/tasks API endpoint to check what tasks are running on a given cluster. We should get something like this:

{
    "frameworks": [
        {
            "id": "20150323-183301-505808906-5050-14434-0001",
            "name": "marathon",
            "pid": "scheduler-2f7cdc09-84a0-4788-b501-5651065e273f@10.8.40.30:53051",
            "used_resources": {
                "disk": 139264.0,
                "mem": 170346.0,
                "gpus": 0.0,
                "cpus": 66.3,
                "ports": "[31026-31027, 31035-31036, 31059-31060, 31078-31078, 31096-31096, 31118-31120, 31134-31135, 31139-31139, 31167-31168, 31179-31180, 31189-31190, 31208-31209, 31215-31215, 31230-31231, 31258-31259, 31291-31291, 31296-31297, 31309-31310, 31312-31313, 31320-31320, 31323-31324, 31344-31345, 31401-31401, 31420-31420, 31436-31437, 31439-31440, 31456-31457, 31477-31478, 31490-31491, 31547-31547, 31558-31558, 31579-31579, 31594-31595, 31607-31608, 31614-31617, 31638-31639, 31664-31664, 31705-31706, 31709-31709, 31714-31715, 31720-31721, 31732-31733, 31745-31746, 31748-31751, 31765-31766, 31797-31798, 31813-31813, 31821-31821, 31835-31836, 31864-31865, 31869-31870, 31886-31887, 31899-31900, 31936-31938, 31950-31951, 31972-31973, 31993-31994, 31996-31997]"
            },
            "offered_resources": {
                "disk": 0.0,
                "mem": 0.0,
                "gpus": 0.0,
                "cpus": 0.0
            },
            "capabilities": [],
            "hostname": "mesos-s2",
            "webui_url": "http://mesos-s2:8080",
            "active": true,
            "user": "marathon",
            "failover_timeout": 604800.0,
            "checkpoint": true,
            "role": "*",
            "registered_time": 1471960678.14146,
            "unregistered_time": 0.0,
            "resources": {
                "disk": 139264.0,
                "mem": 170346.0,
                "gpus": 0.0,
                "cpus": 66.3,
                "ports": "[31026-31027, 31035-31036, 31059-31060, 31078-31078, 31096-31096, 31118-31120, 31134-31135, 31139-31139, 31167-31168, 31179-31180, 31189-31190, 31208-31209, 31215-31215, 31230-31231, 31258-31259, 31291-31291, 31296-31297, 31309-31310, 31312-31313, 31320-31320, 31323-31324, 31344-31345, 31401-31401, 31420-31420, 31436-31437, 31439-31440, 31456-31457, 31477-31478, 31490-31491, 31547-31547, 31558-31558, 31579-31579, 31594-31595, 31607-31608, 31614-31617, 31638-31639, 31664-31664, 31705-31706, 31709-31709, 31714-31715, 31720-31721, 31732-31733, 31745-31746, 31748-31751, 31765-31766, 31797-31798, 31813-31813, 31821-31821, 31835-31836, 31864-31865, 31869-31870, 31886-31887, 31899-31900, 31936-31938, 31950-31951, 31972-31973, 31993-31994, 31996-31997]"
            },
            "tasks": [
                {
                    "id": "hello_world.a9897537-7437-11e6-8492-56847afe9799",
                    "name": "hello_world.production",
                    "framework_id": "20150323-183301-505808906-5050-14434-0001",
                    "executor_id": "",
                    "slave_id": "78791579-8172-4509-8160-a5c76e0540af-S7",
                    "state": "TASK_RUNNING",
                    "resources": {
                        "disk": 0.0,
                        "mem": 550.0,
                        "gpus": 0.0,
                        "cpus": 0.2
                    },
                    "statuses": [
                        {
                            "state": "TASK_RUNNING",
                            "timestamp": 1473169310.86516,
                            "container_status": {
                                "network_infos": [
                                    {
                                        "ip_addresses": [
                                            {
                                                "ip_address": "10.8.40.30"
                                            }
                                        ]
                                    }
                                ]
                            }
                        }
                    ],
                    "discovery": {
                        "visibility": "FRAMEWORK",
                        "name": "hello_world.production",
                        "ports": {}
                    }
                },
            ]
        }    
    ]
}

now from this we can map task_name to executor_id (which in my case is the same as tasks.id in above JSON). Tricky part is that each instance of telegraf which gathers metrics from slaves should know where is the current elected master.

sparrc · 2016-09-06T14:26:15Z

"tasks": [
                {
                    "id": "hello_world.a9897537-7437-11e6-8492-56847afe9799",
                    "name": "hello_world.production",

why not just use tasks[i][name]?

harnash · 2016-09-06T14:33:15Z

Yup. We should find executor_id in current metrics and match it with id from above json and use name as an tag for a metric.
We can detect leader by using information which we already pull from master metrics but it will mean that in telegraf config we should always have full list of master nodes so we can look for a leader among them and pass that information to the slave collector.

harnash · 2016-09-07T10:45:02Z

@sparrc I have one issue with this solution: when we have 4 instances then all metrics will be sent as one data series which will be problematic for representing this data or do sensible alerting based on such. One solution would be to enumerate tasks per task_name and assign them unique tags like this: task_name_1, task_name_2 etc. There will be at most the twice number of tags per task as the number of instances running (depending on the upgrade policy) and tags won't be in always pointing on the same instance.

sparrc · 2016-09-07T11:01:42Z

@harnash please try to spell it out a bit more simply, I truly have no experience with mesos, drawing out some diagrams or condensing what you're saying into bullet-points would help I think.

If we can't find a solution soon, I'm going to have to remove the mesos plugin from Telegraf. I don't like having plugins within Telegraf that send badly formed data, as I'm sure many users have already shot themselves in the foot with this plugin.

A short-term solution might be to remove the per-task metrics for the time being. We can also revisit per-task metrics when InfluxDB has fixed influxdata/influxdb#7151

harnash · 2016-09-07T11:29:43Z

@sparrc Alright. So we have situation as follows:

In mesos there is a task named hello_world which has three instances running:

hello_world.a927cc04-1630-400a-8b95-a39fe192ce8e
hello_world.06d66c19-4a89-482a-985e-cfe6b1d23b5c
hello_world.d0a6a0b2-254f-47fd-b30e-bb4a3923a8ae

Telegraf will gather metrics for those three instances. To simplify I'll use cpu_usage as an example.
These are the metrics snapshot at a given point of time:

executor_id=hello_world.a927cc04-1630-400a-8b95-a39fe192ce8e cpu_usage=0.3
executor_id=hello_world.06d66c19-4a89-482a-985e-cfe6b1d23b5c cpu_usage=0.1
executor_id=hello_world.d0a6a0b2-254f-47fd-b30e-bb4a3923a8ae cpu_usage=0.9

Now if we send this data to Influx with tag specifying only task_name we would get something like this:

task_name=hello_world cpu_usage=0.3
task_name=hello_world cpu_usage=0.1
task_name=hello_world cpu_usage=0.9

and it will be hard to do statistic analysis based on tag value since we would have data from three instances, I think. I'm not sure how would InfluxDB handle this data (there might be slight time offset between those data points but that is a guess right now).

Idea is to map those executor_ids to a limited tags in such a way that we would get metrics like this:

task_name=hello_world_1 cpu_usage=0.3
task_name=hello_world_2 cpu_usage=0.1
task_name=hello_world_3 cpu_usage=0.9

Of course you would also get executor_id as a normal field if you want to pin point the exact instance but on metrics and graphs it would show as n-th instance of the given task and order of those instances is determined by their executor_ids (they can move around when some instances in the middle gets killed).

I hope that clears the picture a bit.

As for removing functionality I would like to leave master/slave stats and just disable task metrics which are problematic.

sparrc · 2016-09-07T11:42:31Z

got it, crystal clear, thanks @harnash.

My recommendation for now would be to remove task metrics, but leave the code as stubs as-is & in-place (still collecting task UUIDs).

Once influxdata/influxdb#7151 has been implemented, we could provide a configuration option to gather task metrics, but with very clear warnings that collecting per-task metrics would lead to infinitely-increasing cardinality, and recommend that this data be kept in a very short-term retention-policy.

…h them. Due to quite real problem of generating vast number of data series through mesos tasks metrics this feature is disabled until better solution is found.

harnash · 2016-09-23T19:53:39Z

@sparrc Alright. As decided I have removed mesos-tasks metrics from the plugin but I've left most of the code commented out.

sparrc · 2016-09-27T10:31:49Z

thanks @harnash, you can update the changelog now and I'll merge this change

Fixed tags on mesos_task metrics.

f21a6eb

Tagging values by executor_id can create quite a lot data series in InfluxDB so we should stick to framework_id and server.

harnash changed the title ~~Added option to normalize tags on mesos_tasks metrics.~~ Fix tags on mesos_tasks metrics. Aug 31, 2016

jwilder added the bug unexpected problem or unintended behavior label Sep 1, 2016

Merge branch 'master' into mesos_tag_fixes

969d152

Disable mesos tasks statistics until we find a better way to deal wit…

9eadfbf

…h them. Due to quite real problem of generating vast number of data series through mesos tasks metrics this feature is disabled until better solution is found.

harnash changed the title ~~Fix tags on mesos_tasks metrics.~~ Removing mesos_tasks metrics. Sep 23, 2016

sparrc added this to the 1.1.0 milestone Sep 27, 2016

sparrc merged commit 32268fb into influxdata:master Sep 28, 2016

harnash deleted the mesos_tag_fixes branch October 4, 2016 14:07

danielnelson mentioned this pull request Nov 29, 2017

add mesos slave monitor/statics api #3494

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing `mesos_tasks` metrics. #1686

Removing `mesos_tasks` metrics. #1686

harnash commented Aug 30, 2016 •

edited

Loading

sparrc commented Aug 30, 2016 •

edited

Loading

harnash commented Aug 30, 2016

harnash commented Aug 31, 2016

r0bj commented Aug 31, 2016

sparrc commented Aug 31, 2016

harnash commented Sep 1, 2016

r0bj commented Sep 1, 2016

sparrc commented Sep 1, 2016

tuier commented Sep 5, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 7, 2016

sparrc commented Sep 7, 2016

harnash commented Sep 7, 2016

sparrc commented Sep 7, 2016 •

edited

Loading

harnash commented Sep 23, 2016 •

edited

Loading

sparrc commented Sep 27, 2016

Removing mesos_tasks metrics. #1686

Removing mesos_tasks metrics. #1686

Conversation

harnash commented Aug 30, 2016 • edited Loading

Required for all PRs:

sparrc commented Aug 30, 2016 • edited Loading

harnash commented Aug 30, 2016

harnash commented Aug 31, 2016

r0bj commented Aug 31, 2016

sparrc commented Aug 31, 2016

harnash commented Sep 1, 2016

r0bj commented Sep 1, 2016

sparrc commented Sep 1, 2016

tuier commented Sep 5, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 6, 2016

sparrc commented Sep 6, 2016

harnash commented Sep 6, 2016

harnash commented Sep 7, 2016

sparrc commented Sep 7, 2016

harnash commented Sep 7, 2016

sparrc commented Sep 7, 2016 • edited Loading

harnash commented Sep 23, 2016 • edited Loading

sparrc commented Sep 27, 2016

Removing `mesos_tasks` metrics. #1686

Removing `mesos_tasks` metrics. #1686

harnash commented Aug 30, 2016 •

edited

Loading

sparrc commented Aug 30, 2016 •

edited

Loading

sparrc commented Sep 7, 2016 •

edited

Loading

harnash commented Sep 23, 2016 •

edited

Loading