-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combined monitoring vpoller/zabbix agent does not work #39
Comments
Marin, I have done another test, removed almost all my double monitored systems just kept 4 of them, shutdown the vcenter process in Windows but this time the Zabbix Agents did not become unavailable. I will keep you up to date if I can reproduce it. Kurt |
Hey Kurt, By monitoring your VMware environment with vPoller + Zabbix Agent you are referring to the possiblity to use an Agent interface for all the vPoller requests, right? If that is the case just wanted to let you know that I'm working on some Zabbix I'm almost half-way done with the As for the vPoller Worker crashes - it is not a bug, but I guess it needs improvement there. When a vCenter server goes down (service is stopped, server crashed, etc) then all active sessions to the vCenter get disconnected. At that point the vPoller Worker catches that as expection and exits, since we cannot poll anything until the vCenter is back online. I suppose it would be okay to add a retry mechanism in the vPoller Worker which would wait for the vCenter to come back online. That way vPoller Worker won't exit and restore normal operations once the vCenter is online again. I will file a separate issue for this where we can follow this up there. Thanks, |
Hello Marin, Actually the second issue you mention the crashing when vcenter is not there as an issue but as a cause of the real issue/new feature. What I noticed is the following: I am monitoring all my VM's using the zabbix agent, so I can also retrieve information about processes like bind, nginx, mysql, postgresql, etc. Secondly I also want to monitor my VM's using vpoller at the same time, so I can have information like which Hypervisor the VM run's on, the memory information from the Hypervisor point, etc... When I implement both zabbix agent monitoring + vpoller monitoring by manually adding the template (not auto detect because then the VM's are read only), then when the vcenter is not available all my VM's who had the vpoller template also lost connection with the zabbix agent, this means all monitoring for the VM stops while in real only vcenter is not available. NOTE: I am not sure but when I went back from 30 monitored VM's with vpoller and zabbix agent to 4 mixed VM's, and doing a shutdown from the vcenter processes this issue did not come back. I will redo my test with all the 30 VM's and the vcenter shutdown again to see if it comes back. Concerning the issue with the vpoller-worker crashing when vcenter goes offline, I have created a new template which monitors the vcenter process and port that is used to connect to the vmware API, when 1 of them becomes in status PROBLEM then Zabbix executes a stop for vpoller-worker and vpoller-proxy. When zabbix discovers that the API port is back online then Zabbix executes a start for vpoller-proxy and vpoller-worker. I will document this soon and add it also in github. I hope this makes it more clear. Best Regards, Kurt |
Hi Kurt,
That is strange, because the Zabbix Agent and vPoller stuff are not connected at all. They don't even communicate with each other. One thing that I'm thinking of could be related though is this:
Do you happen to use a Zabbix Proxy or just use the Server for monitoring? This is a just a shot in the dark, but a things you could check when this happens:
On a side note regarding the read-only hosts which are discovered - exactly because of this Zabbix "feature" I'm using the You might want to check it out, as it provides some benefits over the LLD discovery... at least for me :)
Still this is a small number of VMs, so I'd be curious of the results when you are ready :)
That's a clever one! :) Please submit a pull request when you have the time to document it! Thanks, |
Marin, I confirm that this is probably a bug. When I configure 27 VM's which are also monitored with zabbix agent, and I shutdown vcenter access then in a time frame of more than 10 minutes the vcenter not available for all monitored objects the zabbix agents become in status not responding which stops all monitoring for this object. When I say all monitored objects I mean also objects which have not a vpoller template attached to them like my physical backup server, my physical firewall devices and network switches. When I try this with only a few VM's (4 or 6 tested) this issue does not happen. Kurt |
Hi Kurt, Were you able to check these things when that happened?
Regards, |
I have redone the test and indeed: The queue gets high and it could be related to this. So it means we should find a way to stop doing non agent-related checks. For your information I have done this test also with VmBix and there I did not had this issue. I am not 100% sure but it looks like it is related to this message I have in my logs: This could mean that there is no valid result received and causes zabbix to queue up the requests. Using zabbix_get just returns that agent ping is ok. Kurt |
That error here:
Even when vCenter is down a vPoller request should return a proper JSON message, so you shouldn't see this error. Can you see for which item this error is logged? If you could then run the That way we can narrow it down and further troubleshoot this one. Regards, |
Is there an easy way to check where this parse error comes from ? |
I have found 1 of them that is failing: [root@zabbix: /var/lib/zabbix/externalscripts]$ vpoller-cclient -m vm.disk.get -V vSphere -n vSphere -k C:\ -p capacity It looks like when there is no reply coming from the server that the JASON message is not complete. Especially in the c-client which I am using for my tests. |
If the vCenter is down, all of your vPoller requests should fail.
The error is triggered by one of the items when using the Best way to re-produce this is to call the Btw, are you using the vPoller Python or C client? If you are using the vPoller C client, can you please try and switch to the Python client and see if that happens again? Thanks, |
I am using the c-client, and when I do tests with all kinds of different requests the result that comes back is always like this: { "success": 1, "msg": "Did not receive reply from server, aborting..." It looks like the c-client does not close the JSON output like "}" is missing. This in the situation that there is no real result coming back, so when vcenter is down. Using the python client there is no issue. |
Concerning the issue that all my zabbix agent's become unreachable that is related to the amount of entries in the queue. |
Just pushed a fix for the C client: Can you try it out? Also what is the size of the queue when you shutdown the vCenter? How many Zabbix Pollers are you running with? I've made tests with more vSphere Objects in the past and didn't see an issue then. I will try to re-produce this from my side as well, but I'll need more info about your Zabbix setup. Thanks, |
Hello Marin, I have just checked the new c-client. Now the JSON output is closed. At the moment I lost my RDP access to my vsphere so not easy to turn of the processes. But I stopped the worker process and then the output looks correct. I have 10 Zabbix Pollers, the queue size is 0 during the first 60 seconds after 60 seconds I get more than 1500 entries in the 1 minute column then the issue start. It is related to this for sure because I had my vcenter down for 30 minutes in my last test. And I see that the zabbix agent is swapping all the time between available and not available. Kurt |
Hey Kurt, Here a couple of ideas for improvement. Once a vCenter goes down one way to solve the issue with the increasing queue is to make all monitored items by Zabbix At that case Zabbix will stop monitoring them for a specified period of time (default is 10 minutes) after that it will retry to see items become available again. The vPoller Client returns a text message when it cannot reach the vCenter or the Proxy/Workers are not replying and most of the Zabbix items expect a text message as the output. That is okay, but the downside of this is that the time it takes for the message to reach Zabbix increases due to the retry and timeout paramets of the vPoller Client. This leads to increasing the Zabbix queue as a result and might put a delay on other monitored items and probably a load on the Zabbix Server/Proxy. In order to make an item A few settings that might be tuned in order to handle such situations are:
Another thing we can do is to have a mechanism in vPoller Workers that can detect when a vCenter is down and return an error message quickly without waiting for any result from the vSphere API, but that could be tricky to do right. A third option I'm thinking of could be to automatically disable any vPoller items in Zabbix once a vCenter goes down similar to what you are already doing with the Proxy/Worker shutdown and restart in case you notice a vCenter is down. Although I think that making the items So to summarize the options I see we have in order to handle the case when a vCenter goes down and avoid a Zabbix queue which could impact other monitored items:
I'll test things out tomorrow about the What do you think about this? Any suggestions? Thanks, |
My feeling is that going to unsupported items is the best option together with an automatic reconnect. Kurt |
Marin, Let me know when you have something available for testing then I will put it in my test environment. Kurt |
Hey Kurt, I just did a small test (with ~ 20 VMs) and here are the results. After some time I stop the vPoller Proxy/Worker and this way making it impossible for a vPoller Client request to succeed. Essentially this is the same as stopping the vCenter as no vSphere API calls would succeed as well. So, after doing that my vPoller items in Zabbix started to timeout (as expected) and soon enough all my vPoller items became When that happened I'm not building up a Zabbix queue at all since my vPoller items are checked only once in a while (default 10 minutes) to see if they become available again. So once I start again my vPoller Proxy/Workers the items become supported as well. What is the timeout period for external scripts that you use? What is the number of retries of vPoller Client and timeout that you have? Can you also check what happens with the item values when you stop the vCenter service? Do these items become unsupported or they get the "Did not receive reply from server, aborting..." value? Thanks, |
Hello Marin, The Timeout for zabbix server is set do 30 (30 seconds) I will do a new test tomorrow. Today I am busy with some other urgent tasks for work. You prefer that I bring the timeout value down to the default 3 seconds ? This will give issue with my work environment because we check HP ILO systems using IPMI we need a higher timeout with the standard timeout these checks are not working correctly. I Can do the test with 15 seconds of timout that is the value we use in our production environment. Kurt |
Hey Kurt, Okay, that explains a lot now :) So, vPoller Client does by default retry 3 times and each time by 3 seconds. That means that after 9 seconds vPoller Client will give up as this should be more than enough for a reply from vCenter to arrive. What happens after the 9 seconds is that vPoller Client returns the standard message - "Did not receive reply..." and that value is actually being used by Zabbix as the result. So if you now check your vPoller item values in Zabbix what you should find is that their value is now "Did not receive reply..." Since you have a small number of Zabbix Pollers, that means that each vPoller check now takes ~ 9 seconds to complete and that makes the queue increase, as you actually got a result from vPoller :) Okay, if you cannot update the timeout value - don't do it. I will see what's the best way to return a value for making an item unsupported and we can use that. That should solve the issue permanently :) Will get back to you soon with a patch to try out. Thanks, |
Hello Marin, I can do a small update of the timeout from 30 to 15 in my test environment. Lets see what will happen tomorrow if I do that. Off course if there is a solution which is not depended on the timeout setting form zabbix that would be better. Kurt |
Hey Kurt, No need to change the timeout values as you would be in the same situation :) Please remember that vpoller-client will abort after 9 seconds, so your timeout of 15 seconds will still get a reply from vPoller to Zabbix. In order for this to work you need a low timeout value, lower than the actual abort message from vpoller-client. Let me first prepare a patch, which you could try. Don't update the timeout for now. Marin |
Hey Kurt, So, here's what I have after my tests. When you have a Zabbix item that expects a string (text or character item data) there is no easy way to return any data that would make the item unsupported. No matter what you return for a text/character item Zabbix will consider this as valid and thus will never make the item unsupported. So for now the best way to make an item unsupported would be to simply have the vPoller Client with a higher timeout than the Zabbix externalcheck timeout. For example if your Make sure to adjust the Can you please update the Here are some sample values you could use:
The Let me know how it goes when you have the time to test. Thanks, |
Hello Marin, I will make the modifications and do the test again. Kurt |
Hello Marin, Still the same. The moment I vcenter is down for 10 minutes or more my zabbix agents even of not related items start getting unreachable. I will play a little with the timeout parameters now, to see if something happens. I will keep you updated. Kurt |
Hi Kurt, Problem with too high timeouts is that it will take some time for the items to become unsupported. That's the bad thing about too high timeouts. On a side note, just to let you know that I'm planning to have native vPoller support for Zabbix soon, which means that Zabbix will be able to talk to vPoller without externalchecks. This would solve a number of performance issues with externalchecks and also will provide a way to make an item unsupported as soon as we spot an issue with the communication to vCenter. You can track progress on the loadable module in this issue here - #51 Regards, |
Hello Marin, Thank you. I will try to find an easy solution to fix this form the wrapper scripts. By validating if the API is reachable before doing the vpoller-client/vpoller-cclient command. Kurt |
Hello Marin, I have solved this by adding some tests in the wrapper scripts. I have added the following:
If 1 of them fails then the wrapper ends with a ZBX_NOTSUPPORTED and the reason: No API, no curl, no vpoller proxy. If they not fail then the vpoller client does his work My vcenter has been down for 45 minutes with these checks in it and there was no queue or zabbix agents failures. I have not yet committed it. Are you interested to have this code ? Kurt |
Hey Kurt, Sure, please submit a pull request for it. If you could just use a different filename for it, e.g. I'd like to keep both versions of the wrapper scripts separate for now if possible. Hopefully I will find more spare time soon, so I can work on the loadable module of vPoller for Zabbix. Having this loadable module ready should solve lots of these issues. Thanks, |
Hello Marin,
I have found another bug/new feature.
Situation: I would like to monitor my VM's using both vpoller and zabbix agent. This works as long as the vpoller-worker has a connection to vcenter. Reasons when it fails: if vpoller-worker crashes, when vcenter is not available, then all zabbix agents become unreachable.
Can you have a look into it ?
Thanks in advance.
Kurt
The text was updated successfully, but these errors were encountered: