gearmand hangs intermittently when timeouts are used #301

infraweavers · 2020-08-14T08:19:29Z

Hello!,

We've recently upgraded to OMD3.30 which has replaced the ancient gearmand 0.33 that was shipped previously with 1.1.19.
This includes mod-gearman (for naemon) and pnp_gearman_worker (for pnp4nagios). We find that periodically all our naemon checks stop responding. At the same time when gearman_top is executed, we get an error:

failed to connect to address localhost and port 4730: Interrupted system call

We replaced the build of gearmand with one with symbols in it, waited for the problem to occur (about 15-20 minutes) in our setup and found that gearmand seems to be in an infinite loop here:
https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L399-L417

It looks like noop_sent is never getting incremented as it is 0 and then it's never getting into the if so it's never sending the noop to any worker and then is stuck forever.

Edit:
Running

(gdb) print _global_gearmand->server->worker_wakeup
$4 = 0 '\000'

so actually, it doesn't look like the noop situation is the problem, there must be something wrong with the cyclic list so that head is now not the loop point? We'll try and confirm,

Edit: doesn't look like it's that. We've just stepped through the loop 200 times and the list seems to loop correctly:
gdb.txt

Edit: again, this is all wrong. It is leaving the loop as we put a breakpoint after, we think it's just that everytime we gcore the code it just so happens that it's inside that loop. Presumably there's a loop on the outside that it's actually stuck in.

Actually! We think it's this loop that's looping forever: https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 as we put a breakpoint on https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L133 and it's not getting hit despite the programing continuing for a few seconds.

The text was updated successfully, but these errors were encountered:

infraweavers · 2020-08-14T13:59:57Z

So we're reasonably confident now (ha) that the problem is https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 however we can't see how it would ever break out of that loop as nothing sets the job_list to NULL once it's been set. Confused :/

p-alik · 2020-08-14T15:19:59Z

Thank you for your elaborate report, @infraweavers. Do you know how the bug could be reproduced?

infraweavers · 2020-08-14T15:26:58Z

@p-alik unfortunately not, our reproduction case currently is to upgrade one of our monitoring servers and deploy the entire configuration to it and wait/restart the services a few times. With our config applied there is about 1200 checks per minute running, which should work out at about 2400 jobs being pushed onto gearman itself.

We're trying to narrow it down to a reproducible case currently, however it's difficult for us to track the exact scenario in which it happens as it appears "random".

p-alik · 2020-08-14T16:03:10Z

Actually! We think it's this loop that's looping forever: https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 as we put a breakpoint on https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L133 and it's not getting hit despite the programing continuing for a few seconds.

Does gearmand get in the if branch?

gearmand/libgearman-server/worker.cc

Lines 127 to 130 in 8f58a8b

    
           if (ret != GEARMAND_SUCCESS) 
        
           { 
        
             gearmand_gerror_warn("gearman_server_job_queue", ret); 
        
           }

infraweavers · 2020-08-14T16:04:33Z

Does gearmand get in the if branch?

We don't believe so, we see no logs for that nor when "nexting" through we've not seen it happen. I'll put a breakpoint on it and confirm now

infraweavers · 2020-08-14T16:07:18Z

@p-alik No, it doesn't.

p-alik · 2020-08-14T16:47:26Z

Despite some "if nok"-branches there are only two return GEARMAND_SUCCESS in gearman_server_job_queue. Hence that's the only possible value

gearmand/libgearman-server/job.cc

Line 340 in 8f58a8b

gearmand_error_t gearman_server_job_queue(gearman_server_job_st *job)

SpamapS · 2020-08-17T03:16:22Z

Wow that's a doozy. It looks like gearmand is having trouble tearing down workers that disconnect while they have jobs assigned to them. That should work, this code is present to handle the situation, but I will say, you may want to address this defensively, while we work this out, by making sure your workers avoid quitting while they have a job assigned.

The state machine in gearmand is basically the hardest part to debug, so forgive me if I'm asking a bunch of random questions but I'd like to trace through a few parts of the code with some ideas in mind:

How often do your workers disconnect?

How many threads are you running with?

Do you run with retries configured? If so, do you get any logged failures to retry?

If you use foreground jobs, do you get WORK_FAIL packets sent to your clients?

Are you primarily using background or foreground jobs?

Do you have a stateful store set up for background jobs? If so which one?

Finally, if you've got a debugger attached in one of these situations, if you can watch the job_list, and let us know what is in it?

One suspicion I have, is that the job list is somehow getting corrupted, and is no longer null-terminated. TBH, I've been out of the code long enough, and I have been avoiding revisiting those macros long enough, I'm not even sure that's how they're supposed to work. I'd love to replace them with nice C++ std::list, because zomg this is my nightmare, but for now, if you can just verify that the list does in fact only go in one direction, and end with the _next element set to NULL, and not looped back onto itself, that would help.

Also, I wish I could offer some of my time to dig through a core file or something, but I don't have that kind of time outside day job, and gearmand is not part of dayjob. You may want to seek out a consultant with deep C/C++ skills who can take your money on this one. ;)

sni · 2020-08-17T09:37:33Z

I am trying to give some answers for naemons mod-gearman in general:

workers will be spawned and stopped dynamically to respond to the workload
options for the gearmand are: --port=4735 --pid-file=... /gearmand.pid --daemon --threads=0 -q libsqlite3 --libsqlite3-db=.../gearmand.db --store-queue-on-shutdown --log-file=.../gearman/gearmand.log --verbose=ERROR --listen=localhost
mod-gearman uses background jobs only
payload is base64 only
job unique identifier might be arbitrary data

infraweavers · 2020-08-17T09:58:22Z

We have 100-500 workers per gearmand (normally around 110), each worker performs 1000 jobs before exiting and being replaced with a new one. Unless it’s idle, then it waits 30 seconds before killing itself. So we think this works out at about 100 disconnect/reconnects per 30 seconds or so.
We are reasonably sure we don’t use retries as we don’t have it configured in any of the mod-gearman files and it looks like it defaults to 0 with omd3.30 using a naemon core and mod_gearman_worker c version (maybe @sni knows for sure).

We do have a debugger attached in a VM snapshot whilst the infinite loop is ongoing, we'll try and get some useful output. Do you have a recommended way of "seeing" what you want to see?

infraweavers · 2020-08-17T12:54:37Z

Looking at that loop we've added 3 breakpoints (in the same location as the red dots in VScode on the right) and run the loop several times, it looks like worker->job_list is always pointing to the same reference:

infraweavers · 2020-08-17T14:08:37Z

Again, now looking at job.cc https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L424-L435 it looks like the next item in the job_list is the first item in the job_list so it will just loop forever.

I've tried to demonstrate here as you can see by the printed references:

infraweavers · 2020-08-17T15:00:53Z

Additionally, whilst trying to get a more easily reproducible example, we've found that before gearmand completely stops responding, we end up with a queue with -1 jobs running :

infraweavers · 2020-08-17T15:11:23Z

We managed to get this to occur whilst we had the gearmand logging set to INFO. It seems a little suspicious that there's a unique log around the time for "Worker timeout reach on job, requeuing" just before gearmand stopped responding. It's perfectly possible that this is a coincidence however:

esabol · 2020-08-17T17:17:35Z

Possibly related to issue #119 ?

p-alik · 2020-08-17T19:06:37Z

We managed to get this to occur whilst we had the gearmand logging set to INFO. It seems a little suspicious that there's a unique log around the time for "Worker timeout reach on job, requeuing" just before gearmand stopped responding.

It seems a worker registered with CAN_DO_TIMEOUT couldn't finish his job in time:

gearmand/libgearman-server/connection.cc

Lines 655 to 657 in 8f58a8b

    
           gearmand_log_warning(GEARMAN_DEFAULT_LOG_PARAM, 
        
                                "Worker timeout reached on job, requeueing: %s %s", 
        
                                job->job_handle, job->unique);

gearmand/libgearman-server/server.cc

Lines 730 to 731 in 8f58a8b

    
           return _server_error_packet(GEARMAN_DEFAULT_LOG_PARAM, server_con, GEARMAN_JOB_NOT_FOUND, 
        
                                       gearman_literal_param("Job given in work result not found"));

The errors could be reproduced with:

client: echo x| ./gearmand/bin/gearman -f x
worker.pl

use v5.10;
use strict;
use warnings;
use Gearman::Worker;

my $timeout = 2;
my $worker = Gearman::Worker->new(debug => 1);
$worker->job_servers({ host => "localhost", port => 4730 },);
$worker->register_function(
    'x', $timeout,
    sub {
        say "going to sleep";
        sleep $timeout * 2;
    }
);
$worker->work(
    on_complete => sub {
        my ($jobhandle, $result) = @_;
        say "on complete $jobhandle";
    },
    on_fail => sub {
        my ($jobhandle, $err) = @_;
        say "on fail $jobhandle";
    },
) while 1;

gearmand log

WARNING 2020-08-17 20:58:01.000000 [     3 ] Worker timeout reached on job, requeueing: H:varenik:1 917c2972-e0bb-11ea-b55c-7446a09167b1 -> libgearman-server/connection
WARNING 2020-08-17 20:58:01.000000 [     3 ] if worker -> libgearman-server/job.cc:344
WARNING 2020-08-17 20:58:01.000000 [     3 ] if worker_list -> libgearman-server/job.cc:397
WARNING 2020-08-17 20:58:01.000000 [     3 ] return GEARMAND_SUCCESS -> libgearman-server/job.cc:441
WARNING 2020-08-17 20:58:04.000000 [  proc ] GEARMAN_JOB_NOT_FOUND:Job given in work result not found -> libgearman-server/server.cc:779
   INFO 2020-08-17 20:58:04.000000 [     3 ] Peer connection has called close()
   INFO 2020-08-17 20:58:04.000000 [     3 ] Disconnected 127.0.0.1:53120
   INFO 2020-08-17 20:58:04.000000 [     3 ] Gear connection disconnected: -:-

status

$ ./bin/gearadmin --status
x       4294967295      0       1
.

esabol · 2020-08-18T03:00:48Z

I haven’t done much with this part of the code, so I really don’t know what I’m talking about, but this line looks suspicious to me:

gearmand/libgearman-server/worker.cc

Line 126 in e2d76cf

gearmand_error_t ret= gearman_server_job_queue(worker->job_list);

It’s passing worker->job_list to gearman_server_job_queue() to requeue. I would think you’d want to create a new job struct to requeue or at least nullify some stuff in the job struct before requeuing the same job struct?

SpamapS · 2020-08-18T04:35:07Z

We don't need a new job structure. What that code attempts to do, is remove it from the currently assigned worker, and push it back into the unassigned queue. The whole of gearmand works this way, adjusting references.

It feels like we may be zeroing in on it. I think that the bug is exactly there @esabol, but not for the reasons you might be thinking. My theory, which I'm just testing by reading code, and that's kinda hard, is that the timeout screws up the list, leaving it looping around on itself.

I'll try and find some time to debug tonight or tomorrow. We can probably write a regression test around this scenario pretty easily.

infraweavers · 2020-08-18T09:40:12Z

@p-alik we've been playing with using that client arrangement and it does seem that "sometimes" it can trigger the event that causes gearmand to lock up, however we still have the other load on the gearman instance as well, so it's possible that it's just happening randomly and it correllates with us using the perl worker.

p-alik · 2020-08-18T10:10:54Z

@infraweavers, perl Gearman::Worker dies on first "GEARMAN_JOB_NOT_FOUND:Job given in work result not found" because it gets unexpected ERROR in response for GRAB_JOB. The worker implementation in your case seems to proceed to interact with gearmand with the same connection, which leads to the "Job given in work result not found"-loop. I guess that's the only difference.

infraweavers · 2020-08-18T11:22:13Z

@p-alik Yeah, I wonder if that's what triggers the actual problem. We've not managed to use the perl worker in a loop on an isolated gearmand and reproduce the problem yet, so it feels like it may not have been quite as clear-cut as that.

infraweavers · 2020-08-18T13:49:17Z

We've been watching it a lot today to try and get some handle on the situation where it happens, we're pretty sure that before it breaks, we've always ended up with a -1 in the Jobs Running column on gearman_top. A few seconds later it seems to end up stuck in that loop again

p-alik · 2020-08-18T18:44:59Z

@infraweavers, I hope #302 solves the -1-issue. At least in my aforementioned test it did.
After the first worker run in timeout. status showed proper result:

$ ./bin/gearadmin --status
x       1       0       0
.

Next worker handled the job without timeout and exited afterwards:

$ ./bin/gearadmin --status
x       0       0       0
.

infraweavers · 2020-08-19T08:14:18Z

@p-alik we weren't able to consistently get the -1 issue using the aforementioned test however we have found that if we:

run the worker
push an item onto the queue
wait for the worker to timeout
run another worker that doesn't timeout to consume the message (twice)

Then gearmand will crash out, however it's in a completely different way from the original issue raised as the process vanishes rather than getting stuck in an infinite loop. I've attached a gif of this problem, however just regarding

worker_that_doesnt_timeout.pl.txt
worker_that_timesout.pl.txt

We'll apply the fix from #302 and see if that resolves it crashing in this situation.

p-alik · 2020-08-19T09:03:52Z

That's odd running gearmand (built from master branch 8f58a8b) doesn't die in my environment. After worker_that_doesnt_timeout.pl.txt finishes the stuff. Status looks

$ ./bin/gearadmin --status
x       4294967295      4294967295      0

An attempt to kill gearmand by Ctrl-C doesn't work afterwards:

double free or corruption (out)

Begin stack trace, frames found: 13

For #302 second worker worker_that_doesnt_timeout.pl.txt dies with "unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]". But without $timeout worker_that_doesnt_timeout.pl.txt finishes the work faultless, status looks good and gearmand dies on Ctrl-C.

11c11
<     'x', $timeout,
---
>     'x', # $timeout,

infraweavers · 2020-08-19T10:00:18Z

Interesting, so we've just built 8f58a8b and the the behaviour is unchanged for us (i.e. the same as the gif above).

We're building and running on Debian 9.13 x64 freshly installed to test this whole scenario out so there's a minimum amount of stuff installed etc.

It's very odd that we don't have any logs in dmesg or /var/log/syslog also, so it doesn't seem to be segfaulting or similiar as that'd be visible in there.

esabol · 2020-09-09T01:02:29Z

@infraweavers, may I ask you to test https://github.com/p-alik/gearmand/tree/issue-301-revert-cfa0585
It's the same as #302 + reverting of cfa0585. In my environment geamand built on top of the branch sustained both my and yours test procedures.

I strongly disagree with reverting cfa0585. All you’re doing is changing the timeout of your worker from 2 ms to 2 seconds. You could achieve the same result by setting your worker timeout to 2000 ms. A 2 ms timeout is ridiculously short and unrealistic, and that is presumably what is contributing to the race condition.

SpamapS · 2020-09-09T01:28:12Z

That agrees with the working theory. My theory being that less threads == more chance of the wrong line of execution winning the race and thus double-mutating the list.

…

On Mon, Sep 7, 2020 at 11:49 AM A Codeweavers Infrastructure Bod < ***@***.***> wrote: Anyway, working theory is just a good old thread-unsafe race condition. If you have more threads, and better latency as a result, then you make it hard for the race to go the wrong way. That *would* make sense, however I am able to reproduce the problem with --threads=1 ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#301 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADS6YAAJJV6LCQNV6JWICDSEUTKDANCNFSM4P7IF7AA> .

p-alik · 2020-09-10T15:45:02Z

I strongly disagree with reverting cfa0585. All you’re doing is changing the timeout of your worker from 2 ms to 2 seconds. You could achieve the same result by setting your worker timeout to 2000 ms. A 2 ms timeout is ridiculously short and unrealistic, and that is presumably what is contributing to the race condition.

It appears to me without cfa0585 gearmand executes CAN_DO_TIMEOUT in same way as CAN_DO and never goes into _server_job_timeout. Hence reverting of cfa0585 would solve this issue, but consequently reopen #196 again.

infraweavers · 2020-09-10T17:09:05Z

@p-alik Is there something you're seeing there that implies that it'll have a different effect to multiplying the timeout by 1000? To me it looks like the net-effect of reverting that change is what @esabol suggested.

p-alik · 2020-09-10T19:34:29Z

@esabol, I was completely wrong about reverting cfa0585 would solve this issue.

esabol · 2020-09-10T22:42:35Z

Ah, OK. I was going to add a bunch of debug logging statements to prove it one way or another, but I don't have to now. Or maybe that would still be useful?

One of the things I noticed in the code was that, if the timeout is less than 1000 ms, it sets it to 1000 ms.

gearmand/libgearman-server/connection.cc

Lines 687 to 693 in aa41156

    
           // We treat 0 and -1 as being the same (i.e. no timer) 
        
           if (worker->timeout > 0) 
        
           { 
        
             if (worker->timeout < 1000) 
        
             { 
        
               worker->timeout= 1000; 
        
             }

So whether you specify a timeout of 2 ms or 999 ms, it should have the same result. Does that agree with what you're seeing?

esabol · 2020-09-10T23:26:54Z

It took me a while to locate timeout_add(). It's part of the libevent API in case anyone else didn't know. I was curious as to whether the struct timeval should be an elapsed time or an absolute time. At least one other API I came across in my googling used absolute times, and you were supposed to add the current time to the struct timeval, but none of the libevent examples I googled did that, so I think that's a dead end.

I'm now wondering if the code is even doing timeouts correctly. timeout_add() is from very, very old versions of libevent:

Versions of Libevent before 2.0 used "signal_" as the prefix for the signal-based variants of event_set() and so on, rather than "evsignal_". (That is, they had signal_set(), signal_add(), signal_del(), signal_pending(), and signal_initialized().) Truly ancient versions of Libevent (before 0.6) used "timeout_" instead of "evtimer_". Thus, if you’re doing code archeology, you might see timeout_add(), timeout_del(), timeout_initialized(), timeout_set(), timeout_pending(), and so on.

The modern libevent2 API is evtimer_add(). Anyway, here is a reference example I found for using timeouts with libevent:

struct event *ev;
struct timeval tv;

static void cb(int sock, short which, void *arg) {
   if (!evtimer_pending(ev, NULL)) {
       event_del(ev);
       evtimer_add(ev, &tv);
   }
}

int main(int argc, char **argv) {
   struct event_base *base = event_base_new();

   tv.tv_sec = 10;
   tv.tv_usec = 0;

   ev = evtimer_new(base, cb, NULL);

   evtimer_add(ev, &tv);

   event_base_loop(base, 0);

   return 0;
}

This code looks very different from the following:

gearmand/libgearman-server/connection.cc

Lines 699 to 723 in aa41156

    
                   if (con->timeout_event == NULL) 
        
                   { 
        
                     gearmand_con_st *dcon= con->con.context; 
        
                     con->timeout_event= (struct event *)malloc(sizeof(struct event)); // libevent POD 
        
                     if (con->timeout_event == NULL) 
        
                     { 
        
                       return gearmand_merror("malloc(sizeof(struct event)", struct event, 1); 
        
                     } 
        
                     timeout_set(con->timeout_event, _server_job_timeout, job); 
        
                     if (event_base_set(dcon->thread->base, con->timeout_event) == -1) 
        
                     { 
        
                       gearmand_perror(errno, "event_base_set"); 
        
                     } 
        
                   } 
        
                   /* XXX Right now, if a worker has diff timeouts for functions I think 
        
                     this will overwrite any existing timeouts on that event. One 
        
                     solution to that would be to record the timeout from last time, 
        
                     and only set this one if it is longer than that one. */ 
        
                   struct timeval timeout_tv = { 0 , 0 }; 
        
                   time_t milliseconds= worker->timeout; 
        
                   timeout_tv.tv_sec= milliseconds / 1000; 
        
                   timeout_tv.tv_usec= (suseconds_t)((milliseconds % 1000) * 1000); 
        
                   timeout_add(con->timeout_event, &timeout_tv);

Maybe this code needs to be modernized? Something like:

        con->timeout_event= evtimer_new(dcon->thread->base, _server_job_timeout, job);
        if (con->timeout_event == NULL) 
        { 
             return gearmand_perror(errno, "evtimer_new() failed in gearman_server_con_add_job_timeout"); 
        } 
        struct timeval timeout_tv = { 0 , 0 };
        time_t milliseconds= worker->timeout;
        timeout_tv.tv_sec= milliseconds / 1000;
        timeout_tv.tv_usec= (suseconds_t)((milliseconds % 1000) * 1000);
        evtimer_add(con->timeout_event, &timeout_tv);

infraweavers · 2020-09-11T07:07:49Z

@esabol presumably if the old version is actually causing this problem it'd be firing the event twice or similiar? We can add logging and confirm if that's the case

p-alik · 2020-09-11T07:29:22Z

So whether you specify a timeout of 2 ms or 999 ms, it should have the same result. Does that agree with what you're seeing?

Yes.

It took me a while to locate timeout_add().

vscode + ms-vscode.cpptools shows it just in time.

Maybe this code needs to be modernized?

Judging from the comment in event_compat.h it makes sense in any case, but I don't think that helps to solve this issue.

illuusio · 2020-10-22T09:04:35Z

I can get this on 1.1.8. TL;DR; problem is there something to test with? I'm using Perl system where client absence hangs Gearman. I'm currently building 1.1.9.1 to check if it still present.

infraweavers · 2020-10-22T10:32:06Z

If it's the same problem @illuusio it will still be there. We've worked around it by removing the timeout handling from one of our gearman workers; that seems to have helped massively for us. Is your use case reproducible?

illuusio · 2020-10-22T11:00:37Z

@infraweavers I can try to make it. It hangs when client just drop randomly. I test it more to get reproducible example.

illuusio · 2020-10-22T11:23:36Z

@infraweavers This was new way to do it (just found it). Just supply missing RPC function and it hangs. I have to check if this is false-positive as I assume. As I assume it should just pump back and say there is no this kind of function but this can be just me-not-reading-enough documentation problem. But the freezing is the same. It just hangs.


  NOTICE 2020-10-22 14:27:36.623382 [  proc ] accepted,missingFunction,,0 -> libgearman-server/server.cc:321
  DEBUG 2020-10-22 14:27:36.623404 [     2 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:633
  DEBUG 2020-10-22 14:27:36.623570 [     2 ] send() 23 bytes to peer -> libgearman-server/io.cc:407
  DEBUG 2020-10-22 14:27:36.623592 [     2 ] Sent JOB_CREATED -> libgearman-server/thread.cc:356

This one stays on handle->wait forever and ever waiting task to complete as Gearmand seems to be happy that there is not jobs available.

This one only seems to freeze the one client not them all but I think the reason is the same.. I'll do very badly behaving client.

esabol · 2020-10-23T06:38:17Z

I can get this on 1.1.8. TL;DR; problem is there something to test with? I'm using Perl system where client absence hangs Gearman. I'm currently building 1.1.9.1 to check if it still present.

Do you mean 1.1.18 and 1.1.19.1?

But the freezing is the same.

Maybe I don’t have enough information, but it doesn’t seem the same to me. This issue is clearly related to jobs that timeout. If you’re experiencing a hang and it doesn’t involve a task with a specified timeout, then I think what you are describing (and from the log snippet) is a separate issue.

illuusio · 2020-10-23T08:53:36Z

@esabol I got Gearmand to freeze on 1.1.8 I got to test 1.1.19.1 more to have glue if it's doing the same. I have hunch it's something to-do with bidirectional transfer when worker is sending back to client information about state. I try to make test on this next week to prove it or false this..

esabol · 2020-10-23T15:49:57Z

@esabol I got Gearmand to freeze on 1.1.8 I got to test 1.1.19.1 more to have glue if it's doing the same. I have hunch it's something to-do with bidirectional transfer when worker is sending back to client information about state. I try to make test on this next week to prove it or false this..

That’s possible. A long time ago, when I was looking into some SSL errors, I had the suspicion that gearmand needed to have exclusive pthreads locks whenever sending and receiving. That might have been partially alleviated by upgrades to OpenSSL, which incorporated better thread-safety at a lower level, but it might still be an issue. See, for example:

https://github.com/LibVNC/libvncserver/pull/389/files

But @infraweavers isn’t using SSL. Could be a similar thing with regards to handling of the timeout though?

SpamapS · 2020-10-31T17:44:17Z

The issue reproduces with --threads=1, so I'm leaning more to @p-alik 's theory of double-free than thread unsafety.

esabol · 2020-11-01T05:24:56Z

Agreed. My previous comment pertained to @illuusio’s issue, which as I wrote previously, sounded like a different thing.

infraweavers · 2020-11-02T08:07:52Z

FWIW: From our investigations we've not seen anything occur which is asynchronous within gearmand, however given that it doesn't happen every time, we think it's still a "concurrent" problem, but based on when things happen (i.e. timeouts, retries, reconnection, connection close etc) outside of gearmand rather than within, which causes the problem.

SpamapS · 2020-11-02T09:49:56Z

Thanks for the update. I still believe quite strongly that it is a race condition that is simply triggered by the way your workers and clients interact with Gearman. Hopefully somebody with some time and context on the problem will figure it out. :-P

…

On Mon, Nov 2, 2020 at 12:08 AM A Codeweavers Infrastructure Bod ***@***.***> wrote: FWIW: From our investigations we've not seen anything occur which is asynchronous within gearmand, however given that it doesn't happen every time, we think it's still a "concurrent" problem, but based on when things happen (i.e. timeouts, retries, reconnection, connection close etc) outside of gearmand rather than within, which causes the problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

this patch addresses some issues in gearman worker mode: - when forking a child, the forked %children contains all created childs so far, so if that child receives a sigint, it will kill its siblings. - check for exited childs in a loop in case multiple childs exited at once - do not remove the pidfile when a children receives a sigint - fix issue with gearman jobs having a timeout, for details see - gearman/gearmand#301 - ConSol-Monitoring/omd#107

gearman/gearmand#301

infraweavers mentioned this issue Aug 14, 2020

gearmand hanging and non responsive after OMD 3.30 upgrade ConSol-Monitoring/omd#107

Closed

p-alik added the bug label Aug 14, 2020

esabol changed the title ~~GearmanD hangs intermittantly~~ GearmanD hangs intermittently Aug 18, 2020

p-alik mentioned this issue Aug 18, 2020

decrement server_job->function->job_total only if it > 0 #302

Closed

esabol added the help wanted label Dec 9, 2020

sni mentioned this issue May 20, 2021

Improve gearman mode lingej/pnp4nagios#184

Merged

Bellardia added a commit to Jumbleberry/GearmanManager that referenced this issue Aug 1, 2021

Default to 0 timeout, timeouts actually break gearman

6fb2fe4

gearman/gearmand#301

Bellardia added a commit to Jumbleberry/GearmanManager that referenced this issue Aug 1, 2021

Default to 0 timeout, timeouts actually break gearman

9c54077

gearman/gearmand#301

Bellardia added a commit to Jumbleberry/GearmanManager that referenced this issue Aug 1, 2021

Default to 0 timeout, timeouts actually break gearman

54b6964

gearman/gearmand#301

esabol changed the title ~~GearmanD hangs intermittently~~ gearmand hangs intermittently when timeouts are used Aug 22, 2021

esabol added the hacktoberfest label Oct 4, 2021

esabol removed the hacktoberfest label Dec 24, 2021

steevhise mentioned this issue Jul 5, 2022

Stuck Jobs #343

Open

gearmand hangs intermittently when timeouts are used #301

gearmand hangs intermittently when timeouts are used #301

Comments

infraweavers commented Aug 14, 2020 • edited Loading

infraweavers commented Aug 14, 2020

p-alik commented Aug 14, 2020

infraweavers commented Aug 14, 2020

p-alik commented Aug 14, 2020

infraweavers commented Aug 14, 2020

infraweavers commented Aug 14, 2020

p-alik commented Aug 14, 2020

SpamapS commented Aug 17, 2020

sni commented Aug 17, 2020

infraweavers commented Aug 17, 2020

infraweavers commented Aug 17, 2020 • edited Loading

infraweavers commented Aug 17, 2020 • edited Loading

infraweavers commented Aug 17, 2020

infraweavers commented Aug 17, 2020

esabol commented Aug 17, 2020

p-alik commented Aug 17, 2020

esabol commented Aug 18, 2020

SpamapS commented Aug 18, 2020

infraweavers commented Aug 18, 2020

p-alik commented Aug 18, 2020

infraweavers commented Aug 18, 2020

infraweavers commented Aug 18, 2020

p-alik commented Aug 18, 2020

infraweavers commented Aug 19, 2020 • edited Loading

p-alik commented Aug 19, 2020

infraweavers commented Aug 19, 2020

esabol commented Sep 9, 2020 • edited Loading

SpamapS commented Sep 9, 2020 via email

p-alik commented Sep 10, 2020

infraweavers commented Sep 10, 2020

p-alik commented Sep 10, 2020 • edited Loading

esabol commented Sep 10, 2020 • edited Loading

esabol commented Sep 10, 2020 • edited Loading

infraweavers commented Sep 11, 2020

p-alik commented Sep 11, 2020

illuusio commented Oct 22, 2020

infraweavers commented Oct 22, 2020

illuusio commented Oct 22, 2020

illuusio commented Oct 22, 2020 • edited Loading

esabol commented Oct 23, 2020 • edited Loading

illuusio commented Oct 23, 2020 • edited Loading

esabol commented Oct 23, 2020 • edited Loading

SpamapS commented Oct 31, 2020

esabol commented Nov 1, 2020

infraweavers commented Nov 2, 2020

SpamapS commented Nov 2, 2020 via email

infraweavers commented Aug 14, 2020 •

edited

Loading

infraweavers commented Aug 17, 2020 •

edited

Loading

infraweavers commented Aug 17, 2020 •

edited

Loading

infraweavers commented Aug 19, 2020 •

edited

Loading

esabol commented Sep 9, 2020 •

edited

Loading

p-alik commented Sep 10, 2020 •

edited

Loading

esabol commented Sep 10, 2020 •

edited

Loading

esabol commented Sep 10, 2020 •

edited

Loading

illuusio commented Oct 22, 2020 •

edited

Loading

esabol commented Oct 23, 2020 •

edited

Loading

illuusio commented Oct 23, 2020 •

edited

Loading

esabol commented Oct 23, 2020 •

edited

Loading