Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combined monitoring vpoller/zabbix agent does not work #39

Closed
blackcobra1973 opened this issue Aug 25, 2014 · 30 comments
Closed

Combined monitoring vpoller/zabbix agent does not work #39

blackcobra1973 opened this issue Aug 25, 2014 · 30 comments

Comments

@blackcobra1973
Copy link
Contributor

Hello Marin,

I have found another bug/new feature.

Situation: I would like to monitor my VM's using both vpoller and zabbix agent. This works as long as the vpoller-worker has a connection to vcenter. Reasons when it fails: if vpoller-worker crashes, when vcenter is not available, then all zabbix agents become unreachable.

Can you have a look into it ?

Thanks in advance.

Kurt

@blackcobra1973
Copy link
Contributor Author

Marin,

I have done another test, removed almost all my double monitored systems just kept 4 of them, shutdown the vcenter process in Windows but this time the Zabbix Agents did not become unavailable.

I will keep you up to date if I can reproduce it.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 25, 2014

Hey Kurt,

By monitoring your VMware environment with vPoller + Zabbix Agent you are referring to the possiblity to use an Agent interface for all the vPoller requests, right?

If that is the case just wanted to let you know that I'm working on some Zabbix UserParameters which would allow this to be done and also using zabbix_get(8) from the command-line would work as well.

I'm almost half-way done with the UserParameters, so I hope to be done with it still this week.

As for the vPoller Worker crashes - it is not a bug, but I guess it needs improvement there.

When a vCenter server goes down (service is stopped, server crashed, etc) then all active sessions to the vCenter get disconnected.

At that point the vPoller Worker catches that as expection and exits, since we cannot poll anything until the vCenter is back online.

I suppose it would be okay to add a retry mechanism in the vPoller Worker which would wait for the vCenter to come back online. That way vPoller Worker won't exit and restore normal operations once the vCenter is online again.

I will file a separate issue for this where we can follow this up there.

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

Actually the second issue you mention the crashing when vcenter is not there as an issue but as a cause of the real issue/new feature.

What I noticed is the following:

I am monitoring all my VM's using the zabbix agent, so I can also retrieve information about processes like bind, nginx, mysql, postgresql, etc.

Secondly I also want to monitor my VM's using vpoller at the same time, so I can have information like which Hypervisor the VM run's on, the memory information from the Hypervisor point, etc...

When I implement both zabbix agent monitoring + vpoller monitoring by manually adding the template (not auto detect because then the VM's are read only), then when the vcenter is not available all my VM's who had the vpoller template also lost connection with the zabbix agent, this means all monitoring for the VM stops while in real only vcenter is not available.

NOTE: I am not sure but when I went back from 30 monitored VM's with vpoller and zabbix agent to 4 mixed VM's, and doing a shutdown from the vcenter processes this issue did not come back. I will redo my test with all the 30 VM's and the vcenter shutdown again to see if it comes back.

Concerning the issue with the vpoller-worker crashing when vcenter goes offline, I have created a new template which monitors the vcenter process and port that is used to connect to the vmware API, when 1 of them becomes in status PROBLEM then Zabbix executes a stop for vpoller-worker and vpoller-proxy. When zabbix discovers that the API port is back online then Zabbix executes a start for vpoller-proxy and vpoller-worker. I will document this soon and add it also in github.

I hope this makes it more clear.

Best Regards,

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 26, 2014

Hi Kurt,

When I implement both zabbix agent monitoring + vpoller monitoring by manually adding the template (not auto detect because then the VM's are read only), then when the vcenter is not available all my VM's who had the vpoller template also lost connection with the zabbix agent, this means all monitoring for the VM stops while in real only vcenter is not available.

That is strange, because the Zabbix Agent and vPoller stuff are not connected at all. They don't even communicate with each other.

One thing that I'm thinking of could be related though is this:

  1. vCenter goes down
  2. Zabbix continues to run vPoller requests against the 'dead' vCenter and builds a queue of delayed items
  3. Somewhere along the line the Zabbix Agents become "unavailable" because there are many delayed items, while the Agents are reachable in fact.

Do you happen to use a Zabbix Proxy or just use the Server for monitoring?

This is a just a shot in the dark, but a things you could check when this happens:

  1. Check the Zabbix queue
  2. Use zabbix_get(8) to check if a Zabbix Agent is reachable

On a side note regarding the read-only hosts which are discovered - exactly because of this Zabbix "feature" I'm using the zabbix-vsphere-import script which allows to import your vSphere Objects as regular Zabbix hosts and perform updates to them - grouping, attaching templates, etc.

You might want to check it out, as it provides some benefits over the LLD discovery... at least for me :)

NOTE: I am not sure but when I went back from 30 monitored VM's with vpoller and zabbix agent to 4 mixed VM's, and doing a shutdown from the vcenter processes this issue did not come back. I will redo my test with all the 30 VM's and the vcenter shutdown again to see if it comes back.

Still this is a small number of VMs, so I'd be curious of the results when you are ready :)

Concerning the issue with the vpoller-worker crashing when vcenter goes offline, I have created a new template which monitors the vcenter process and port that is used to connect to the vmware API, when 1 of them becomes in status PROBLEM then Zabbix executes a stop for vpoller-worker and vpoller-proxy. When zabbix discovers that the API port is back online then Zabbix executes a start for vpoller-proxy and vpoller-worker. I will document this soon and add it also in github.

That's a clever one! :)

Please submit a pull request when you have the time to document it!

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

Marin,

I confirm that this is probably a bug.

When I configure 27 VM's which are also monitored with zabbix agent, and I shutdown vcenter access then in a time frame of more than 10 minutes the vcenter not available for all monitored objects the zabbix agents become in status not responding which stops all monitoring for this object.

When I say all monitored objects I mean also objects which have not a vpoller template attached to them like my physical backup server, my physical firewall devices and network switches.

When I try this with only a few VM's (4 or 6 tested) this issue does not happen.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 26, 2014

Hi Kurt,

Were you able to check these things when that happened?

  1. Check the Zabbix queue
  2. Use zabbix_get(8) to check if a Zabbix Agent is reachable

Regards,
Marin

@blackcobra1973
Copy link
Contributor Author

I have redone the test and indeed:

The queue gets high and it could be related to this. So it means we should find a way to stop doing non agent-related checks.

For your information I have done this test also with VmBix and there I did not had this issue. I am not 100% sure but it looks like it is related to this message I have in my logs:
"parse error: Unfinished JSON term"

This could mean that there is no valid result received and causes zabbix to queue up the requests.

Using zabbix_get just returns that agent ping is ok.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 26, 2014

That error here:

parse error: Unfinished JSON term

Even when vCenter is down a vPoller request should return a proper JSON message, so you shouldn't see this error.

Can you see for which item this error is logged? If you could then run the zabbix-vpoller command manually and see if it will return the same message?

That way we can narrow it down and further troubleshoot this one.

Regards,
Marin

@blackcobra1973
Copy link
Contributor Author

Is there an easy way to check where this parse error comes from ?

@blackcobra1973
Copy link
Contributor Author

I have found 1 of them that is failing:

[root@zabbix: /var/lib/zabbix/externalscripts]$ vpoller-cclient -m vm.disk.get -V vSphere -n vSphere -k C:\ -p capacity
{ "success": 1, "msg": "Did not receive reply from server, aborting..."

It looks like when there is no reply coming from the server that the JASON message is not complete. Especially in the c-client which I am using for my tests.

@dnaeon
Copy link
Owner

dnaeon commented Aug 26, 2014

If the vCenter is down, all of your vPoller requests should fail.

Is there an easy way to check where this parse error comes from ?

The error is triggered by one of the items when using the zabbix-vpoller or zabbix-cvpoller wrapper scripts. The result returned is not a proper JSON message and that makes jq complain about it.

Best way to re-produce this is to call the zabbix-vpoller wrapper script and see which item would return that error.

Btw, are you using the vPoller Python or C client?

If you are using the vPoller C client, can you please try and switch to the Python client and see if that happens again?

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

I am using the c-client, and when I do tests with all kinds of different requests the result that comes back is always like this:

{ "success": 1, "msg": "Did not receive reply from server, aborting..."

It looks like the c-client does not close the JSON output like "}" is missing. This in the situation that there is no real result coming back, so when vcenter is down.

Using the python client there is no issue.

@blackcobra1973
Copy link
Contributor Author

Concerning the issue that all my zabbix agent's become unreachable that is related to the amount of entries in the queue.

@dnaeon
Copy link
Owner

dnaeon commented Aug 26, 2014

Just pushed a fix for the C client:

Can you try it out?

Also what is the size of the queue when you shutdown the vCenter? How many Zabbix Pollers are you running with?

I've made tests with more vSphere Objects in the past and didn't see an issue then. I will try to re-produce this from my side as well, but I'll need more info about your Zabbix setup.

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

I have just checked the new c-client. Now the JSON output is closed. At the moment I lost my RDP access to my vsphere so not easy to turn of the processes. But I stopped the worker process and then the output looks correct.

I have 10 Zabbix Pollers, the queue size is 0 during the first 60 seconds after 60 seconds I get more than 1500 entries in the 1 minute column then the issue start. It is related to this for sure because I had my vcenter down for 30 minutes in my last test. And I see that the zabbix agent is swapping all the time between available and not available.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 26, 2014

Hey Kurt,

Here a couple of ideas for improvement.

Once a vCenter goes down one way to solve the issue with the increasing queue is to make all monitored items by Zabbix unsupported.

At that case Zabbix will stop monitoring them for a specified period of time (default is 10 minutes) after that it will retry to see items become available again.

The vPoller Client returns a text message when it cannot reach the vCenter or the Proxy/Workers are not replying and most of the Zabbix items expect a text message as the output. That is okay, but the downside of this is that the time it takes for the message to reach Zabbix increases due to the retry and timeout paramets of the vPoller Client.

This leads to increasing the Zabbix queue as a result and might put a delay on other monitored items and probably a load on the Zabbix Server/Proxy.

In order to make an item unsupported we could simply return a value that Zabbix is not expecting to receive and that's it. That should indicate an error, which would be enough to make the item unsupported.

A few settings that might be tuned in order to handle such situations are:

  • Lower the timeout of external scripts in Zabbix Server/Proxy
  • Lower the timeout of vPoller Client, so that result is sent to Zabbix sooner
  • Lower the number of retries of vPoller Client

Another thing we can do is to have a mechanism in vPoller Workers that can detect when a vCenter is down and return an error message quickly without waiting for any result from the vSphere API, but that could be tricky to do right.

A third option I'm thinking of could be to automatically disable any vPoller items in Zabbix once a vCenter goes down similar to what you are already doing with the Proxy/Worker shutdown and restart in case you notice a vCenter is down. Although I think that making the items unsupported might be a better way to do this.

So to summarize the options I see we have in order to handle the case when a vCenter goes down and avoid a Zabbix queue which could impact other monitored items:

  • Make items unsupported when a vCenter goes down by returning a proper error result to Zabbix which would make items unsupported
  • Detect vCenter failures in vPoller and send error messages early
  • Tune some of the Zabbix & vPoller settings in order to fail early and quickly

I'll test things out tomorrow about the unsupported items in Zabbix and will let you know how it goes.

What do you think about this? Any suggestions?

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

My feeling is that going to unsupported items is the best option together with an automatic reconnect.

Kurt

@blackcobra1973
Copy link
Contributor Author

Marin,

Let me know when you have something available for testing then I will put it in my test environment.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 27, 2014

Hey Kurt,

I just did a small test (with ~ 20 VMs) and here are the results.

After some time I stop the vPoller Proxy/Worker and this way making it impossible for a vPoller Client request to succeed. Essentially this is the same as stopping the vCenter as no vSphere API calls would succeed as well.

So, after doing that my vPoller items in Zabbix started to timeout (as expected) and soon enough all my vPoller items became unsupported.

When that happened I'm not building up a Zabbix queue at all since my vPoller items are checked only once in a while (default 10 minutes) to see if they become available again.

So once I start again my vPoller Proxy/Workers the items become supported as well.

What is the timeout period for external scripts that you use? What is the number of retries of vPoller Client and timeout that you have?

Can you also check what happens with the item values when you stop the vCenter service? Do these items become unsupported or they get the "Did not receive reply from server, aborting..." value?

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

The Timeout for zabbix server is set do 30 (30 seconds)

I will do a new test tomorrow. Today I am busy with some other urgent tasks for work.

You prefer that I bring the timeout value down to the default 3 seconds ? This will give issue with my work environment because we check HP ILO systems using IPMI we need a higher timeout with the standard timeout these checks are not working correctly. I Can do the test with 15 seconds of timout that is the value we use in our production environment.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 27, 2014

Hey Kurt,

Okay, that explains a lot now :)

So, vPoller Client does by default retry 3 times and each time by 3 seconds.

That means that after 9 seconds vPoller Client will give up as this should be more than enough for a reply from vCenter to arrive.

What happens after the 9 seconds is that vPoller Client returns the standard message - "Did not receive reply..." and that value is actually being used by Zabbix as the result.

So if you now check your vPoller item values in Zabbix what you should find is that their value is now "Did not receive reply..."

Since you have a small number of Zabbix Pollers, that means that each vPoller check now takes ~ 9 seconds to complete and that makes the queue increase, as you actually got a result from vPoller :)

Okay, if you cannot update the timeout value - don't do it. I will see what's the best way to return a value for making an item unsupported and we can use that.

That should solve the issue permanently :)

Will get back to you soon with a patch to try out.

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

I can do a small update of the timeout from 30 to 15 in my test environment. Lets see what will happen tomorrow if I do that.

Off course if there is a solution which is not depended on the timeout setting form zabbix that would be better.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 27, 2014

Hey Kurt,

No need to change the timeout values as you would be in the same situation :)

Please remember that vpoller-client will abort after 9 seconds, so your timeout of 15 seconds will still get a reply from vPoller to Zabbix.

In order for this to work you need a low timeout value, lower than the actual abort message from vpoller-client.

Let me first prepare a patch, which you could try.

Don't update the timeout for now.

Marin

@dnaeon
Copy link
Owner

dnaeon commented Aug 28, 2014

Hey Kurt,

So, here's what I have after my tests.

When you have a Zabbix item that expects a string (text or character item data) there is no easy way to return any data that would make the item unsupported. No matter what you return for a text/character item Zabbix will consider this as valid and thus will never make the item unsupported.

So for now the best way to make an item unsupported would be to simply have the vPoller Client with a higher timeout than the Zabbix externalcheck timeout.

For example if your Zabbix externalcheck timeout value is 30 seconds in order to "force" a timeout of vPoller items simply use the --timeout option of vpoller-client to something higher than 30 seconds, e.g. 31 seconds.

Make sure to adjust the --retries option to 1 for vpoller-client as well since 30 seconds is more than enough, so you don't retry after the initial 30 seconds of vpoller-client.

Can you please update the --timeout and --retries setting to your vpoller-zabbix wrapper script and re-run your tests?

Here are some sample values you could use:

  • If externalcheck timeout value is 30 seconds then set the --timeout 31000 and --retries 1

The --timeout option expects the value in milliseconds, that's why make sure to use the proper value.

Let me know how it goes when you have the time to test.

Thanks,
Marin

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

I will make the modifications and do the test again.

Kurt

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

Still the same. The moment I vcenter is down for 10 minutes or more my zabbix agents even of not related items start getting unreachable.

I will play a little with the timeout parameters now, to see if something happens.

I will keep you updated.

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 28, 2014

Hi Kurt,

Problem with too high timeouts is that it will take some time for the items to become unsupported. That's the bad thing about too high timeouts.

On a side note, just to let you know that I'm planning to have native vPoller support for Zabbix soon, which means that Zabbix will be able to talk to vPoller without externalchecks.

This would solve a number of performance issues with externalchecks and also will provide a way to make an item unsupported as soon as we spot an issue with the communication to vCenter.

You can track progress on the loadable module in this issue here - #51

Regards,
Marin

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

Thank you. I will try to find an easy solution to fix this form the wrapper scripts. By validating if the API is reachable before doing the vpoller-client/vpoller-cclient command.

Kurt

@blackcobra1973
Copy link
Contributor Author

Hello Marin,

I have solved this by adding some tests in the wrapper scripts.

I have added the following:

  • Check if curl is available on the system (required to test if the vcenter SDK if available)
  • Added a curl check to the host mentioned in parameter -V to see if he responds
  • Check the status of the vpoller-proxy

If 1 of them fails then the wrapper ends with a ZBX_NOTSUPPORTED and the reason: No API, no curl, no vpoller proxy.

If they not fail then the vpoller client does his work

My vcenter has been down for 45 minutes with these checks in it and there was no queue or zabbix agents failures.

I have not yet committed it. Are you interested to have this code ?

Kurt

@dnaeon
Copy link
Owner

dnaeon commented Aug 28, 2014

Hey Kurt,

Sure, please submit a pull request for it.

If you could just use a different filename for it, e.g. zabbix-vpoller-with-checks or something similar.

I'd like to keep both versions of the wrapper scripts separate for now if possible.

Hopefully I will find more spare time soon, so I can work on the loadable module of vPoller for Zabbix.

Having this loadable module ready should solve lots of these issues.

Thanks,
Marin

@blackcobra1973 blackcobra1973 mentioned this issue Aug 28, 2014
@dnaeon dnaeon closed this as completed Aug 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants