-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with MongoDB plug-in #5326
Comments
Can you point the plugin at the mongod servers only? |
I believe that you mean point telegraf to the shard (replica sets) themselves; Yes, I can. However, not only does the "mongodb_shard_stats" measurement not get populated, but even if it was, the JSON docs returned from the 'mongod' is empty and not interesting. In order for the "shardConnPoolStats" results to be useful, one would have to run the command from the 'mongos'. However, doing so produces an error when telegraf errantly tries to get 'oplog' details from that 'mongos' which do not exist. |
Would it be possible to comment out this line and see if any errors remain?
|
Once I get a deployment running again with the updated MongoDB plug-in I'll let you know if there are other errors. Since the "shardConnPoolStats" admin command is running against the shard members, I would have expected that metrics to be send to the "mongodb_shard_stats" measurement, but that isn't happening. It could be because the data returned from that command on the shards would be empty anyway. I have no doubt that the call to the "shardConnPoolStats" admin command will work with a mongos. As previously described, the code is simply not correct. |
@SteveH-US I opened a pull request which essentially just skips over this error and continues. |
Hi Daniel, Thanks for taking this on. However, it appears that you'll still may get an error at line 71 of "mongodb_server.go" when retrieving the "Timestamp". Even if the "op_first_time.Timestamp" property is initialized, the "stats" would be invalid. When reporting on these metrics, one would explicitly remove the mongos oplog metrics, otherwise, the stats from them would throw off the calculations. IMHO, this plug-in ought not be reporting opLog metrics for mongos at all. |
Thanks for taking a look, I see what you mean. We already were doing a check to see if we are in a replica set, so I've updated the code to skip the oplog completely if we are not in a replica set. I also made it so the oplog field is not added if the oplog collection cannot be queried. Can you take another look? Also, follow up on your original comment about chunks, do you think we should do the same for these: only look them up if we are connected to a replica set member? |
Actually, you can only look for the config.chunks if the cluster member you're running on is a mongos. |
Okay, right now we are still reporting If someone reading this has a system with jumbo chunks but would love to see the output of this on a mongos and a shardsvr mongod when there are jumbo chunks.
|
The presence of jumbo chunks is a symptom, not a cause, and probably not
something to alert about.
IMHO, a more interest stat to collect metrics on is chunk migration status
and failures.
…On Tue, Aug 27, 2019 at 9:07 PM Daniel Nelson ***@***.***> wrote:
Okay, right now we are still reporting jumbo_chunks=0i when connected to
a mongos. I think I will leave it as is for now, it seems technically
correct if not useful.
If someone reading this has a system with jumbo chunks but would love to
see the output of this on a mongos and a shardsvr mongod when there are
jumbo chunks.
> db.getSiblingDB("config").getCollection("chunks").find({"jumbo": true})
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5326?email_source=notifications&email_token=ALGDKB2XFU2U64PM7J6OPGDQGXMWRA5CNFSM4GRVCCBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5JU5LQ#issuecomment-525553326>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALGDKBYPPGJTZUCA5QJCHITQGXMWRANCNFSM4GRVCCBA>
.
--
This e-mail message and any attachments to it are intended only for the
named recipients and may contain legally privileged and/or confidential
information. If you are not one of the intended recipients, do not
duplicate or forward this e-mail message.
|
Yeah that makes a lot of sense based on my limited understanding. Do you think you would be able to research the queries we would need for this and create a new issue for this? The |
Sure, I could. As far as metrics, there is a "changelog" collection in the config database that has details about what the balancer is doing. Monitoring the balancer is probably the most interesting thing to monitor from the sharded cluster, other than changes to the shard cluster configuration itself. Here's an example of the type of changes recorded in one of the config DB collections.
Following their advice, plug-ins like this one ought not be querying the "config" database. As such, I'm not sure it makes sense for a plug-in recording shard level metrics; unless your team wants to be on the hook for reacting to changes MDB makes in this database. What do you think? |
Yeah that's a tricky one to answer. If there is no public API and the information is important to monitor, we may have no choice but to implement internal queries and deal with the fallout. It may make sense going forward to do a better job of making the distinction between public and internal APIs and segregating the code. One thing that could reduce our need to do queries against the internal databases is if we had a plugin that allowed ad-hoc queries against MongoDB (#4252). However, this really just pushes the problem over to the users of the plugin. |
Relevant telegraf.conf: 1.9.1
System info:
Ubuntu 14.04
[Include Telegraf version, operating system name, and other relevant details]
Steps to reproduce:
Expected behavior:
"mongodb_shard_stats" measurement populated
Actual behavior:
The following error is produced
and no metrics are posted
Additional info:
I think I found a problem reviewing the code. The same "gatherMetrics" method attempts to get oplog details from "local.oplog.rs" collection and chunk details from the "chunk" collection. The only part of a sharded MongoDB cluster that has replicaset details and "chunk" details is the config replica set. However, the metrics produced by the config server are not representative of the load being put on the sharded cluster, which is going through the mongos and the shards.
Please advise. What am I missing?
[Include gist of relevant config, logs, etc.]
The text was updated successfully, but these errors were encountered: