Problem with MongoDB plug-in #5326

SteveH-US · 2019-01-22T18:33:26Z

Relevant telegraf.conf: 1.9.1

System info:

Ubuntu 14.04

[Include Telegraf version, operating system name, and other relevant details]

Steps to reproduce:

deploy telegraf on MongoDB mongos server
observe results

Expected behavior:

"mongodb_shard_stats" measurement populated

Actual behavior:

The following error is produced

2019-01-22T17:26:40Z E! Error getting first oplog entry (Can't use 'local' database through mongos)

and no metrics are posted

Additional info:

I think I found a problem reviewing the code. The same "gatherMetrics" method attempts to get oplog details from "local.oplog.rs" collection and chunk details from the "chunk" collection. The only part of a sharded MongoDB cluster that has replicaset details and "chunk" details is the config replica set. However, the metrics produced by the config server are not representative of the load being put on the sharded cluster, which is going through the mongos and the shards.

Please advise. What am I missing?

[Include gist of relevant config, logs, etc.]

The text was updated successfully, but these errors were encountered:

danielnelson · 2019-01-23T00:17:46Z

Can you point the plugin at the mongod servers only?

SteveH-US · 2019-01-23T13:54:26Z

I believe that you mean point telegraf to the shard (replica sets) themselves; Yes, I can. However, not only does the "mongodb_shard_stats" measurement not get populated, but even if it was, the JSON docs returned from the 'mongod' is empty and not interesting. In order for the "shardConnPoolStats" results to be useful, one would have to run the command from the 'mongos'. However, doing so produces an error when telegraf errantly tries to get 'oplog' details from that 'mongos' which do not exist.

danielnelson · 2019-01-24T00:02:36Z

Would it be possible to comment out this line and see if any errors remain?

telegraf/plugins/inputs/mongodb/mongodb_server.go

Line 121 in 3de4737

oplogStats := s.gatherOplogStats()

SteveH-US · 2019-01-24T21:47:39Z

Once I get a deployment running again with the updated MongoDB plug-in I'll let you know if there are other errors. Since the "shardConnPoolStats" admin command is running against the shard members, I would have expected that metrics to be send to the "mongodb_shard_stats" measurement, but that isn't happening. It could be because the data returned from that command on the shards would be empty anyway.

I have no doubt that the call to the "shardConnPoolStats" admin command will work with a mongos. As previously described, the code is simply not correct.

danielnelson · 2019-08-23T03:00:11Z

@SteveH-US I opened a pull request which essentially just skips over this error and continues.

SteveH-US · 2019-08-23T13:37:46Z

Hi Daniel,

Thanks for taking this on. However, it appears that you'll still may get an error at line 71 of "mongodb_server.go" when retrieving the "Timestamp". Even if the "op_first_time.Timestamp" property is initialized, the "stats" would be invalid. When reporting on these metrics, one would explicitly remove the mongos oplog metrics, otherwise, the stats from them would throw off the calculations.

IMHO, this plug-in ought not be reporting opLog metrics for mongos at all.

danielnelson · 2019-08-24T04:54:18Z

Thanks for taking a look, I see what you mean.

We already were doing a check to see if we are in a replica set, so I've updated the code to skip the oplog completely if we are not in a replica set. I also made it so the oplog field is not added if the oplog collection cannot be queried. Can you take another look?

Also, follow up on your original comment about chunks, do you think we should do the same for these: only look them up if we are connected to a replica set member?

SteveH-US · 2019-08-27T20:16:59Z

Actually, you can only look for the config.chunks if the cluster member you're running on is a mongos.

danielnelson · 2019-08-28T02:07:20Z

Okay, right now we are still reporting jumbo_chunks=0i when connected to a mongos. I think I will leave it as is for now, it seems technically correct if not useful.

If someone reading this has a system with jumbo chunks but would love to see the output of this on a mongos and a shardsvr mongod when there are jumbo chunks.

> db.getSiblingDB("config").getCollection("chunks").find({"jumbo": true})

SteveH-US · 2019-08-28T14:59:11Z

The presence of jumbo chunks is a symptom, not a cause, and probably not something to alert about. IMHO, a more interest stat to collect metrics on is chunk migration status and failures.

…

On Tue, Aug 27, 2019 at 9:07 PM Daniel Nelson ***@***.***> wrote: Okay, right now we are still reporting jumbo_chunks=0i when connected to a mongos. I think I will leave it as is for now, it seems technically correct if not useful. If someone reading this has a system with jumbo chunks but would love to see the output of this on a mongos and a shardsvr mongod when there are jumbo chunks. > db.getSiblingDB("config").getCollection("chunks").find({"jumbo": true}) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5326?email_source=notifications&email_token=ALGDKB2XFU2U64PM7J6OPGDQGXMWRA5CNFSM4GRVCCBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5JU5LQ#issuecomment-525553326>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALGDKBYPPGJTZUCA5QJCHITQGXMWRANCNFSM4GRVCCBA> .

-- This e-mail message and any attachments to it are intended only for the named recipients and may contain legally privileged and/or confidential information. If you are not one of the intended recipients, do not duplicate or forward this e-mail message.

danielnelson · 2019-08-28T18:39:26Z

Yeah that makes a lot of sense based on my limited understanding. Do you think you would be able to research the queries we would need for this and create a new issue for this?

The mongodb plugin is near the point where we need to consider doing a redesign, as the library we are using is no longer under active development and there are some issues supporting newer MongoDB versions. We may want make a clean break and do a v2 of this plugin and it would be really helpful if we had a good list of what is important to bring (as well as what we should skip).

SteveH-US · 2019-08-28T20:15:43Z

Sure, I could.

As far as metrics, there is a "changelog" collection in the config database that has details about what the balancer is doing. Monitoring the balancer is probably the most interesting thing to monitor from the sharded cluster, other than changes to the shard cluster configuration itself.

Here's an example of the type of changes recorded in one of the config DB collections.

$ db.changelog.aggregate([{$match:{"time":{ "$gte" : new Date(ISODate().getTime() - 1000 * 3600 * 24 * 1) }}}, {$group:{_id:{ns:"$ns",what:"$what"},count:{$sum:1}}},{$sort:{_id:1}}])

{ "_id" : { "ns" : "MyDB.BankTransactions", "what" : "multi-split" }, "count" : 3 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.commit" }, "count" : 24 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.error" }, "count" : 405 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.from" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.start" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.to" }, "count" : 24 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.commit" }, "count" : 320 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.error" }, "count" : 78 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.from" }, "count" : 398 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.start" }, "count" : 398 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.to" }, "count" : 320 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "multi-split" }, "count" : 108 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.commit" }, "count" : 6 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.error" }, "count" : 423 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.from" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.start" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.to" }, "count" : 6 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "multi-split" }, "count" : 38 }

However, MongoDB only lists one of the several collections in that database, this "changelog", but states we ought not depend on it.

Following their advice, plug-ins like this one ought not be querying the "config" database. As such, I'm not sure it makes sense for a plug-in recording shard level metrics; unless your team wants to be on the hook for reacting to changes MDB makes in this database.

What do you think?

danielnelson · 2019-08-28T21:27:57Z

Yeah that's a tricky one to answer. If there is no public API and the information is important to monitor, we may have no choice but to implement internal queries and deal with the fallout. It may make sense going forward to do a better job of making the distinction between public and internal APIs and segregating the code.

One thing that could reduce our need to do queries against the internal databases is if we had a plugin that allowed ad-hoc queries against MongoDB (#4252). However, this really just pushes the problem over to the users of the plugin.

danielnelson added the area/mongodb label Jan 23, 2019

danielnelson added this to the 1.12.0 milestone Aug 5, 2019

danielnelson self-assigned this Aug 20, 2019

danielnelson mentioned this issue Aug 23, 2019

Ignore error querying the oplog through a mongos #6307

Merged

3 tasks

danielnelson closed this as completed in #6307 Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with MongoDB plug-in #5326

Problem with MongoDB plug-in #5326

SteveH-US commented Jan 22, 2019 •

edited

Loading

danielnelson commented Jan 23, 2019

SteveH-US commented Jan 23, 2019

danielnelson commented Jan 24, 2019

SteveH-US commented Jan 24, 2019

danielnelson commented Aug 23, 2019

SteveH-US commented Aug 23, 2019

danielnelson commented Aug 24, 2019

SteveH-US commented Aug 27, 2019

danielnelson commented Aug 28, 2019

SteveH-US commented Aug 28, 2019 via email

danielnelson commented Aug 28, 2019

SteveH-US commented Aug 28, 2019

danielnelson commented Aug 28, 2019

Problem with MongoDB plug-in #5326

Problem with MongoDB plug-in #5326

Comments

SteveH-US commented Jan 22, 2019 • edited Loading

Relevant telegraf.conf: 1.9.1

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Jan 23, 2019

SteveH-US commented Jan 23, 2019

danielnelson commented Jan 24, 2019

SteveH-US commented Jan 24, 2019

danielnelson commented Aug 23, 2019

SteveH-US commented Aug 23, 2019

danielnelson commented Aug 24, 2019

SteveH-US commented Aug 27, 2019

danielnelson commented Aug 28, 2019

SteveH-US commented Aug 28, 2019 via email

danielnelson commented Aug 28, 2019

SteveH-US commented Aug 28, 2019

danielnelson commented Aug 28, 2019

SteveH-US commented Jan 22, 2019 •

edited

Loading