-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gearmand hangs intermittently when timeouts are used #301
Comments
So we're reasonably confident now (ha) that the problem is https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 however we can't see how it would ever break out of that loop as nothing sets the |
Thank you for your elaborate report, @infraweavers. Do you know how the bug could be reproduced? |
@p-alik unfortunately not, our reproduction case currently is to upgrade one of our monitoring servers and deploy the entire configuration to it and wait/restart the services a few times. With our config applied there is about 1200 checks per minute running, which should work out at about 2400 jobs being pushed onto gearman itself. We're trying to narrow it down to a reproducible case currently, however it's difficult for us to track the exact scenario in which it happens as it appears "random". |
Does gearmand/libgearman-server/worker.cc Lines 127 to 130 in 8f58a8b
|
We don't believe so, we see no logs for that nor when "nexting" through we've not seen it happen. I'll put a breakpoint on it and confirm now |
@p-alik No, it doesn't. |
Despite some "if nok"-branches there are only two gearmand/libgearman-server/job.cc Line 340 in 8f58a8b
|
Wow that's a doozy. It looks like gearmand is having trouble tearing down workers that disconnect while they have jobs assigned to them. That should work, this code is present to handle the situation, but I will say, you may want to address this defensively, while we work this out, by making sure your workers avoid quitting while they have a job assigned. The state machine in gearmand is basically the hardest part to debug, so forgive me if I'm asking a bunch of random questions but I'd like to trace through a few parts of the code with some ideas in mind: How often do your workers disconnect? How many threads are you running with? Do you run with retries configured? If so, do you get any logged failures to retry? If you use foreground jobs, do you get WORK_FAIL packets sent to your clients? Are you primarily using background or foreground jobs? Do you have a stateful store set up for background jobs? If so which one? Finally, if you've got a debugger attached in one of these situations, if you can watch the job_list, and let us know what is in it? One suspicion I have, is that the job list is somehow getting corrupted, and is no longer null-terminated. TBH, I've been out of the code long enough, and I have been avoiding revisiting those macros long enough, I'm not even sure that's how they're supposed to work. I'd love to replace them with nice C++ std::list, because zomg this is my nightmare, but for now, if you can just verify that the list does in fact only go in one direction, and end with the _next element set to NULL, and not looped back onto itself, that would help. Also, I wish I could offer some of my time to dig through a core file or something, but I don't have that kind of time outside day job, and gearmand is not part of dayjob. You may want to seek out a consultant with deep C/C++ skills who can take your money on this one. ;) |
I am trying to give some answers for naemons mod-gearman in general:
|
We do have a debugger attached in a VM snapshot whilst the infinite loop is ongoing, we'll try and get some useful output. Do you have a recommended way of "seeing" what you want to see? |
Again, now looking at job.cc https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L424-L435 it looks like the next item in the job_list is the first item in the job_list so it will just loop forever. I've tried to demonstrate here as you can see by the printed references: |
Possibly related to issue #119 ? |
It seems a worker registered with CAN_DO_TIMEOUT couldn't finish his job in time: gearmand/libgearman-server/connection.cc Lines 655 to 657 in 8f58a8b
gearmand/libgearman-server/server.cc Lines 730 to 731 in 8f58a8b
The errors could be reproduced with:
use v5.10;
use strict;
use warnings;
use Gearman::Worker;
my $timeout = 2;
my $worker = Gearman::Worker->new(debug => 1);
$worker->job_servers({ host => "localhost", port => 4730 },);
$worker->register_function(
'x', $timeout,
sub {
say "going to sleep";
sleep $timeout * 2;
}
);
$worker->work(
on_complete => sub {
my ($jobhandle, $result) = @_;
say "on complete $jobhandle";
},
on_fail => sub {
my ($jobhandle, $err) = @_;
say "on fail $jobhandle";
},
) while 1;
|
I haven’t done much with this part of the code, so I really don’t know what I’m talking about, but this line looks suspicious to me: gearmand/libgearman-server/worker.cc Line 126 in e2d76cf
It’s passing |
We don't need a new job structure. What that code attempts to do, is remove it from the currently assigned worker, and push it back into the unassigned queue. The whole of gearmand works this way, adjusting references. It feels like we may be zeroing in on it. I think that the bug is exactly there @esabol, but not for the reasons you might be thinking. My theory, which I'm just testing by reading code, and that's kinda hard, is that the timeout screws up the list, leaving it looping around on itself. I'll try and find some time to debug tonight or tomorrow. We can probably write a regression test around this scenario pretty easily. |
@p-alik we've been playing with using that client arrangement and it does seem that "sometimes" it can trigger the event that causes gearmand to lock up, however we still have the other load on the gearman instance as well, so it's possible that it's just happening randomly and it correllates with us using the perl worker. |
@infraweavers, perl Gearman::Worker dies on first "GEARMAN_JOB_NOT_FOUND:Job given in work result not found" because it gets unexpected ERROR in response for GRAB_JOB. The worker implementation in your case seems to proceed to interact with |
@p-alik Yeah, I wonder if that's what triggers the actual problem. We've not managed to use the perl worker in a loop on an isolated gearmand and reproduce the problem yet, so it feels like it may not have been quite as clear-cut as that. |
We've been watching it a lot today to try and get some handle on the situation where it happens, we're pretty sure that before it breaks, we've always ended up with a -1 in the Jobs Running column on gearman_top. A few seconds later it seems to end up stuck in that loop again |
@infraweavers, I hope #302 solves the -1-issue. At least in my aforementioned test it did.
Next worker handled the job without timeout and exited afterwards:
|
@p-alik we weren't able to consistently get the -1 issue using the aforementioned test however we have found that if we:
Then gearmand will crash out, however it's in a completely different way from the original issue raised as the process vanishes rather than getting stuck in an infinite loop. I've attached a gif of this problem, however just regarding worker_that_doesnt_timeout.pl.txt We'll apply the fix from #302 and see if that resolves it crashing in this situation. |
That's odd running
An attempt to kill
For #302 second worker worker_that_doesnt_timeout.pl.txt dies with "unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]". But without
|
Interesting, so we've just built 8f58a8b and the the behaviour is unchanged for us (i.e. the same as the gif above). We're building and running on Debian 9.13 x64 freshly installed to test this whole scenario out so there's a minimum amount of stuff installed etc. It's very odd that we don't have any logs in |
I strongly disagree with reverting cfa0585. All you’re doing is changing the timeout of your worker from 2 ms to 2 seconds. You could achieve the same result by setting your worker timeout to 2000 ms. A 2 ms timeout is ridiculously short and unrealistic, and that is presumably what is contributing to the race condition. |
That agrees with the working theory. My theory being that less threads ==
more chance of the wrong line of execution winning the race and thus
double-mutating the list.
…On Mon, Sep 7, 2020 at 11:49 AM A Codeweavers Infrastructure Bod < ***@***.***> wrote:
Anyway, working theory is just a good old thread-unsafe race condition. If
you have more threads, and better latency as a result, then you make it
hard for the race to go the wrong way.
That *would* make sense, however I am able to reproduce the problem with
--threads=1 ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#301 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS6YAAJJV6LCQNV6JWICDSEUTKDANCNFSM4P7IF7AA>
.
|
It appears to me without cfa0585 |
Ah, OK. I was going to add a bunch of debug logging statements to prove it one way or another, but I don't have to now. Or maybe that would still be useful? One of the things I noticed in the code was that, if the timeout is less than 1000 ms, it sets it to 1000 ms. gearmand/libgearman-server/connection.cc Lines 687 to 693 in aa41156
So whether you specify a timeout of 2 ms or 999 ms, it should have the same result. Does that agree with what you're seeing? |
It took me a while to locate I'm now wondering if the code is even doing timeouts correctly.
The modern libevent2 API is struct event *ev;
struct timeval tv;
static void cb(int sock, short which, void *arg) {
if (!evtimer_pending(ev, NULL)) {
event_del(ev);
evtimer_add(ev, &tv);
}
}
int main(int argc, char **argv) {
struct event_base *base = event_base_new();
tv.tv_sec = 10;
tv.tv_usec = 0;
ev = evtimer_new(base, cb, NULL);
evtimer_add(ev, &tv);
event_base_loop(base, 0);
return 0;
} This code looks very different from the following: gearmand/libgearman-server/connection.cc Lines 699 to 723 in aa41156
Maybe this code needs to be modernized? Something like: con->timeout_event= evtimer_new(dcon->thread->base, _server_job_timeout, job);
if (con->timeout_event == NULL)
{
return gearmand_perror(errno, "evtimer_new() failed in gearman_server_con_add_job_timeout");
}
struct timeval timeout_tv = { 0 , 0 };
time_t milliseconds= worker->timeout;
timeout_tv.tv_sec= milliseconds / 1000;
timeout_tv.tv_usec= (suseconds_t)((milliseconds % 1000) * 1000);
evtimer_add(con->timeout_event, &timeout_tv); |
@esabol presumably if the old version is actually causing this problem it'd be firing the event twice or similiar? We can add logging and confirm if that's the case |
Yes.
vscode + ms-vscode.cpptools shows it just in time.
Judging from the comment in event_compat.h it makes sense in any case, but I don't think that helps to solve this issue. |
I can get this on 1.1.8. TL;DR; problem is there something to test with? I'm using Perl system where client absence hangs Gearman. I'm currently building 1.1.9.1 to check if it still present. |
If it's the same problem @illuusio it will still be there. We've worked around it by removing the timeout handling from one of our gearman workers; that seems to have helped massively for us. Is your use case reproducible? |
@infraweavers I can try to make it. It hangs when client just drop randomly. I test it more to get reproducible example. |
@infraweavers This was new way to do it (just found it). Just supply missing RPC function and it hangs. I have to check if this is false-positive as I assume. As I assume it should just pump back and say there is no this kind of function but this can be just me-not-reading-enough documentation problem. But the freezing is the same. It just hangs.
This one stays on handle->wait forever and ever waiting task to complete as Gearmand seems to be happy that there is not jobs available. This one only seems to freeze the one client not them all but I think the reason is the same.. I'll do very badly behaving client. |
Do you mean 1.1.18 and 1.1.19.1?
Maybe I don’t have enough information, but it doesn’t seem the same to me. This issue is clearly related to jobs that timeout. If you’re experiencing a hang and it doesn’t involve a task with a specified timeout, then I think what you are describing (and from the log snippet) is a separate issue. |
@esabol I got Gearmand to freeze on 1.1.8 I got to test 1.1.19.1 more to have glue if it's doing the same. I have hunch it's something to-do with bidirectional transfer when worker is sending back to client information about state. I try to make test on this next week to prove it or false this.. |
That’s possible. A long time ago, when I was looking into some SSL errors, I had the suspicion that gearmand needed to have exclusive pthreads locks whenever sending and receiving. That might have been partially alleviated by upgrades to OpenSSL, which incorporated better thread-safety at a lower level, but it might still be an issue. See, for example: https://github.com/LibVNC/libvncserver/pull/389/files But @infraweavers isn’t using SSL. Could be a similar thing with regards to handling of the timeout though? |
The issue reproduces with --threads=1, so I'm leaning more to @p-alik 's theory of double-free than thread unsafety. |
Agreed. My previous comment pertained to @illuusio’s issue, which as I wrote previously, sounded like a different thing. |
FWIW: From our investigations we've not seen anything occur which is asynchronous within gearmand, however given that it doesn't happen every time, we think it's still a "concurrent" problem, but based on when things happen (i.e. timeouts, retries, reconnection, connection close etc) outside of gearmand rather than within, which causes the problem. |
Thanks for the update. I still believe quite strongly that it is a
race condition that is simply triggered by the way your workers and
clients interact with Gearman.
Hopefully somebody with some time and context on the problem will
figure it out. :-P
…On Mon, Nov 2, 2020 at 12:08 AM A Codeweavers Infrastructure Bod ***@***.***> wrote:
FWIW: From our investigations we've not seen anything occur which is asynchronous within gearmand, however given that it doesn't happen every time, we think it's still a "concurrent" problem, but based on when things happen (i.e. timeouts, retries, reconnection, connection close etc) outside of gearmand rather than within, which causes the problem.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
this patch addresses some issues in gearman worker mode: - when forking a child, the forked %children contains all created childs so far, so if that child receives a sigint, it will kill its siblings. - check for exited childs in a loop in case multiple childs exited at once - do not remove the pidfile when a children receives a sigint - fix issue with gearman jobs having a timeout, for details see - gearman/gearmand#301 - ConSol-Monitoring/omd#107
Hello!,
We've recently upgraded to OMD3.30 which has replaced the ancient gearmand 0.33 that was shipped previously with 1.1.19.
This includes mod-gearman (for naemon) and pnp_gearman_worker (for pnp4nagios). We find that periodically all our naemon checks stop responding. At the same time when gearman_top is executed, we get an error:
We replaced the build of gearmand with one with symbols in it, waited for the problem to occur (about 15-20 minutes) in our setup and found that gearmand seems to be in an infinite loop here:
https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L399-L417
It looks like
noop_sent
is never getting incremented as it is 0 and then it's never getting into theif
so it's never sending thenoop
to any worker and then is stuck forever.Edit:
Running
so actually, it doesn't look like the noop situation is the problem, there must be something wrong with the cyclic list so that head is now not the loop point? We'll try and confirm,
Edit: doesn't look like it's that. We've just stepped through the loop 200 times and the list seems to loop correctly:
gdb.txt
Edit: again, this is all wrong. It is leaving the loop as we put a breakpoint after, we think it's just that everytime we gcore the code it just so happens that it's inside that loop. Presumably there's a loop on the outside that it's actually stuck in.
Actually! We think it's this loop that's looping forever: https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 as we put a breakpoint on https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L133 and it's not getting hit despite the programing continuing for a few seconds.
The text was updated successfully, but these errors were encountered: