Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gearmand hangs intermittently when timeouts are used #301

Open
infraweavers opened this issue Aug 14, 2020 · 108 comments
Open

gearmand hangs intermittently when timeouts are used #301

infraweavers opened this issue Aug 14, 2020 · 108 comments

Comments

@infraweavers
Copy link

infraweavers commented Aug 14, 2020

Hello!,

We've recently upgraded to OMD3.30 which has replaced the ancient gearmand 0.33 that was shipped previously with 1.1.19.
This includes mod-gearman (for naemon) and pnp_gearman_worker (for pnp4nagios). We find that periodically all our naemon checks stop responding. At the same time when gearman_top is executed, we get an error:

failed to connect to address localhost and port 4730: Interrupted system call

We replaced the build of gearmand with one with symbols in it, waited for the problem to occur (about 15-20 minutes) in our setup and found that gearmand seems to be in an infinite loop here:
https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L399-L417

It looks like noop_sent is never getting incremented as it is 0 and then it's never getting into the if so it's never sending the noop to any worker and then is stuck forever.

image

Edit:
Running

(gdb) print _global_gearmand->server->worker_wakeup
$4 = 0 '\000'

so actually, it doesn't look like the noop situation is the problem, there must be something wrong with the cyclic list so that head is now not the loop point? We'll try and confirm,

Edit: doesn't look like it's that. We've just stepped through the loop 200 times and the list seems to loop correctly:
gdb.txt

Edit: again, this is all wrong. It is leaving the loop as we put a breakpoint after, we think it's just that everytime we gcore the code it just so happens that it's inside that loop. Presumably there's a loop on the outside that it's actually stuck in.

Actually! We think it's this loop that's looping forever: https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 as we put a breakpoint on https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L133 and it's not getting hit despite the programing continuing for a few seconds.

@infraweavers
Copy link
Author

So we're reasonably confident now (ha) that the problem is https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 however we can't see how it would ever break out of that loop as nothing sets the job_list to NULL once it's been set. Confused :/

@p-alik p-alik added the bug label Aug 14, 2020
@p-alik
Copy link
Collaborator

p-alik commented Aug 14, 2020

Thank you for your elaborate report, @infraweavers. Do you know how the bug could be reproduced?

@infraweavers
Copy link
Author

@p-alik unfortunately not, our reproduction case currently is to upgrade one of our monitoring servers and deploy the entire configuration to it and wait/restart the services a few times. With our config applied there is about 1200 checks per minute running, which should work out at about 2400 jobs being pushed onto gearman itself.

We're trying to narrow it down to a reproducible case currently, however it's difficult for us to track the exact scenario in which it happens as it appears "random".

@p-alik
Copy link
Collaborator

p-alik commented Aug 14, 2020

Actually! We think it's this loop that's looping forever: https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131 as we put a breakpoint on https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L133 and it's not getting hit despite the programing continuing for a few seconds.

Does gearmand get in the if branch?

if (ret != GEARMAND_SUCCESS)
{
gearmand_gerror_warn("gearman_server_job_queue", ret);
}

@infraweavers
Copy link
Author

Does gearmand get in the if branch?

We don't believe so, we see no logs for that nor when "nexting" through we've not seen it happen. I'll put a breakpoint on it and confirm now

@infraweavers
Copy link
Author

@p-alik No, it doesn't.
image

@p-alik
Copy link
Collaborator

p-alik commented Aug 14, 2020

Despite some "if nok"-branches there are only two return GEARMAND_SUCCESS in gearman_server_job_queue. Hence that's the only possible value

gearmand_error_t gearman_server_job_queue(gearman_server_job_st *job)

@SpamapS
Copy link
Member

SpamapS commented Aug 17, 2020

Wow that's a doozy. It looks like gearmand is having trouble tearing down workers that disconnect while they have jobs assigned to them. That should work, this code is present to handle the situation, but I will say, you may want to address this defensively, while we work this out, by making sure your workers avoid quitting while they have a job assigned.

The state machine in gearmand is basically the hardest part to debug, so forgive me if I'm asking a bunch of random questions but I'd like to trace through a few parts of the code with some ideas in mind:

How often do your workers disconnect?

How many threads are you running with?

Do you run with retries configured? If so, do you get any logged failures to retry?

If you use foreground jobs, do you get WORK_FAIL packets sent to your clients?

Are you primarily using background or foreground jobs?

Do you have a stateful store set up for background jobs? If so which one?

Finally, if you've got a debugger attached in one of these situations, if you can watch the job_list, and let us know what is in it?

One suspicion I have, is that the job list is somehow getting corrupted, and is no longer null-terminated. TBH, I've been out of the code long enough, and I have been avoiding revisiting those macros long enough, I'm not even sure that's how they're supposed to work. I'd love to replace them with nice C++ std::list, because zomg this is my nightmare, but for now, if you can just verify that the list does in fact only go in one direction, and end with the _next element set to NULL, and not looped back onto itself, that would help.

Also, I wish I could offer some of my time to dig through a core file or something, but I don't have that kind of time outside day job, and gearmand is not part of dayjob. You may want to seek out a consultant with deep C/C++ skills who can take your money on this one. ;)

@sni
Copy link

sni commented Aug 17, 2020

I am trying to give some answers for naemons mod-gearman in general:

  • workers will be spawned and stopped dynamically to respond to the workload
  • options for the gearmand are: --port=4735 --pid-file=... /gearmand.pid --daemon --threads=0 -q libsqlite3 --libsqlite3-db=.../gearmand.db --store-queue-on-shutdown --log-file=.../gearman/gearmand.log --verbose=ERROR --listen=localhost
  • mod-gearman uses background jobs only
  • payload is base64 only
  • job unique identifier might be arbitrary data

@infraweavers
Copy link
Author

  • We have 100-500 workers per gearmand (normally around 110), each worker performs 1000 jobs before exiting and being replaced with a new one. Unless it’s idle, then it waits 30 seconds before killing itself. So we think this works out at about 100 disconnect/reconnects per 30 seconds or so.

  • We are reasonably sure we don’t use retries as we don’t have it configured in any of the mod-gearman files and it looks like it defaults to 0 with omd3.30 using a naemon core and mod_gearman_worker c version (maybe @sni knows for sure).

We do have a debugger attached in a VM snapshot whilst the infinite loop is ongoing, we'll try and get some useful output. Do you have a recommended way of "seeing" what you want to see?

@infraweavers
Copy link
Author

infraweavers commented Aug 17, 2020

Looking at that loop we've added 3 breakpoints (in the same location as the red dots in VScode on the right) and run the loop several times, it looks like worker->job_list is always pointing to the same reference:

image

@infraweavers
Copy link
Author

infraweavers commented Aug 17, 2020

Again, now looking at job.cc https://github.com/gearman/gearmand/blob/master/libgearman-server/job.cc#L424-L435 it looks like the next item in the job_list is the first item in the job_list so it will just loop forever.

I've tried to demonstrate here as you can see by the printed references:
JobsNextIsItself

@infraweavers
Copy link
Author

Additionally, whilst trying to get a more easily reproducible example, we've found that before gearmand completely stops responding, we end up with a queue with -1 jobs running :

image

@infraweavers
Copy link
Author

We managed to get this to occur whilst we had the gearmand logging set to INFO. It seems a little suspicious that there's a unique log around the time for "Worker timeout reach on job, requeuing" just before gearmand stopped responding. It's perfectly possible that this is a coincidence however:
image

@esabol
Copy link
Member

esabol commented Aug 17, 2020

Possibly related to issue #119 ?

@p-alik
Copy link
Collaborator

p-alik commented Aug 17, 2020

We managed to get this to occur whilst we had the gearmand logging set to INFO. It seems a little suspicious that there's a unique log around the time for "Worker timeout reach on job, requeuing" just before gearmand stopped responding.

It seems a worker registered with CAN_DO_TIMEOUT couldn't finish his job in time:

gearmand_log_warning(GEARMAN_DEFAULT_LOG_PARAM,
"Worker timeout reached on job, requeueing: %s %s",
job->job_handle, job->unique);

return _server_error_packet(GEARMAN_DEFAULT_LOG_PARAM, server_con, GEARMAN_JOB_NOT_FOUND,
gearman_literal_param("Job given in work result not found"));

The errors could be reproduced with:

  • client: echo x| ./gearmand/bin/gearman -f x
  • worker.pl
use v5.10;
use strict;
use warnings;
use Gearman::Worker;

my $timeout = 2;
my $worker = Gearman::Worker->new(debug => 1);
$worker->job_servers({ host => "localhost", port => 4730 },);
$worker->register_function(
    'x', $timeout,
    sub {
        say "going to sleep";
        sleep $timeout * 2;
    }
);
$worker->work(
    on_complete => sub {
        my ($jobhandle, $result) = @_;
        say "on complete $jobhandle";
    },
    on_fail => sub {
        my ($jobhandle, $err) = @_;
        say "on fail $jobhandle";
    },
) while 1;
  • gearmand log
WARNING 2020-08-17 20:58:01.000000 [     3 ] Worker timeout reached on job, requeueing: H:varenik:1 917c2972-e0bb-11ea-b55c-7446a09167b1 -> libgearman-server/connection
WARNING 2020-08-17 20:58:01.000000 [     3 ] if worker -> libgearman-server/job.cc:344
WARNING 2020-08-17 20:58:01.000000 [     3 ] if worker_list -> libgearman-server/job.cc:397
WARNING 2020-08-17 20:58:01.000000 [     3 ] return GEARMAND_SUCCESS -> libgearman-server/job.cc:441
WARNING 2020-08-17 20:58:04.000000 [  proc ] GEARMAN_JOB_NOT_FOUND:Job given in work result not found -> libgearman-server/server.cc:779
   INFO 2020-08-17 20:58:04.000000 [     3 ] Peer connection has called close()
   INFO 2020-08-17 20:58:04.000000 [     3 ] Disconnected 127.0.0.1:53120
   INFO 2020-08-17 20:58:04.000000 [     3 ] Gear connection disconnected: -:-
  • status
$ ./bin/gearadmin --status
x       4294967295      0       1
.

@esabol
Copy link
Member

esabol commented Aug 18, 2020

I haven’t done much with this part of the code, so I really don’t know what I’m talking about, but this line looks suspicious to me:

gearmand_error_t ret= gearman_server_job_queue(worker->job_list);

It’s passing worker->job_list to gearman_server_job_queue() to requeue. I would think you’d want to create a new job struct to requeue or at least nullify some stuff in the job struct before requeuing the same job struct?

@esabol esabol changed the title GearmanD hangs intermittantly GearmanD hangs intermittently Aug 18, 2020
@SpamapS
Copy link
Member

SpamapS commented Aug 18, 2020

We don't need a new job structure. What that code attempts to do, is remove it from the currently assigned worker, and push it back into the unassigned queue. The whole of gearmand works this way, adjusting references.

It feels like we may be zeroing in on it. I think that the bug is exactly there @esabol, but not for the reasons you might be thinking. My theory, which I'm just testing by reading code, and that's kinda hard, is that the timeout screws up the list, leaving it looping around on itself.

I'll try and find some time to debug tonight or tomorrow. We can probably write a regression test around this scenario pretty easily.

@infraweavers
Copy link
Author

@p-alik we've been playing with using that client arrangement and it does seem that "sometimes" it can trigger the event that causes gearmand to lock up, however we still have the other load on the gearman instance as well, so it's possible that it's just happening randomly and it correllates with us using the perl worker.

@p-alik
Copy link
Collaborator

p-alik commented Aug 18, 2020

@infraweavers, perl Gearman::Worker dies on first "GEARMAN_JOB_NOT_FOUND:Job given in work result not found" because it gets unexpected ERROR in response for GRAB_JOB. The worker implementation in your case seems to proceed to interact with gearmand with the same connection, which leads to the "Job given in work result not found"-loop. I guess that's the only difference.

@infraweavers
Copy link
Author

@p-alik Yeah, I wonder if that's what triggers the actual problem. We've not managed to use the perl worker in a loop on an isolated gearmand and reproduce the problem yet, so it feels like it may not have been quite as clear-cut as that.

@infraweavers
Copy link
Author

We've been watching it a lot today to try and get some handle on the situation where it happens, we're pretty sure that before it breaks, we've always ended up with a -1 in the Jobs Running column on gearman_top. A few seconds later it seems to end up stuck in that loop again

@p-alik
Copy link
Collaborator

p-alik commented Aug 18, 2020

@infraweavers, I hope #302 solves the -1-issue. At least in my aforementioned test it did.
After the first worker run in timeout. status showed proper result:

$ ./bin/gearadmin --status
x       1       0       0
.

Next worker handled the job without timeout and exited afterwards:

$ ./bin/gearadmin --status
x       0       0       0
.

@infraweavers
Copy link
Author

infraweavers commented Aug 19, 2020

@p-alik we weren't able to consistently get the -1 issue using the aforementioned test however we have found that if we:

  • run the worker
  • push an item onto the queue
  • wait for the worker to timeout
  • run another worker that doesn't timeout to consume the message (twice)

Then gearmand will crash out, however it's in a completely different way from the original issue raised as the process vanishes rather than getting stuck in an infinite loop. I've attached a gif of this problem, however just regarding

9LIxRd20Ff

worker_that_doesnt_timeout.pl.txt
worker_that_timesout.pl.txt

We'll apply the fix from #302 and see if that resolves it crashing in this situation.

@p-alik
Copy link
Collaborator

p-alik commented Aug 19, 2020

That's odd running gearmand (built from master branch 8f58a8b) doesn't die in my environment. After worker_that_doesnt_timeout.pl.txt finishes the stuff. Status looks

$ ./bin/gearadmin --status
x       4294967295      4294967295      0

An attempt to kill gearmand by Ctrl-C doesn't work afterwards:

double free or corruption (out)

Begin stack trace, frames found: 13

For #302 second worker worker_that_doesnt_timeout.pl.txt dies with "unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]". But without $timeout worker_that_doesnt_timeout.pl.txt finishes the work faultless, status looks good and gearmand dies on Ctrl-C.

11c11
<     'x', $timeout,
---
>     'x', # $timeout,

@infraweavers
Copy link
Author

Interesting, so we've just built 8f58a8b and the the behaviour is unchanged for us (i.e. the same as the gif above).

We're building and running on Debian 9.13 x64 freshly installed to test this whole scenario out so there's a minimum amount of stuff installed etc.

It's very odd that we don't have any logs in dmesg or /var/log/syslog also, so it doesn't seem to be segfaulting or similiar as that'd be visible in there.

@esabol
Copy link
Member

esabol commented Sep 9, 2020

@infraweavers, may I ask you to test https://github.com/p-alik/gearmand/tree/issue-301-revert-cfa0585
It's the same as #302 + reverting of cfa0585. In my environment geamand built on top of the branch sustained both my and yours test procedures.

I strongly disagree with reverting cfa0585. All you’re doing is changing the timeout of your worker from 2 ms to 2 seconds. You could achieve the same result by setting your worker timeout to 2000 ms. A 2 ms timeout is ridiculously short and unrealistic, and that is presumably what is contributing to the race condition.

@SpamapS
Copy link
Member

SpamapS commented Sep 9, 2020 via email

@p-alik
Copy link
Collaborator

p-alik commented Sep 10, 2020

I strongly disagree with reverting cfa0585. All you’re doing is changing the timeout of your worker from 2 ms to 2 seconds. You could achieve the same result by setting your worker timeout to 2000 ms. A 2 ms timeout is ridiculously short and unrealistic, and that is presumably what is contributing to the race condition.

It appears to me without cfa0585 gearmand executes CAN_DO_TIMEOUT in same way as CAN_DO and never goes into _server_job_timeout. Hence reverting of cfa0585 would solve this issue, but consequently reopen #196 again.

@infraweavers
Copy link
Author

@p-alik Is there something you're seeing there that implies that it'll have a different effect to multiplying the timeout by 1000? To me it looks like the net-effect of reverting that change is what @esabol suggested.

@p-alik
Copy link
Collaborator

p-alik commented Sep 10, 2020

@esabol, I was completely wrong about reverting cfa0585 would solve this issue.

@esabol
Copy link
Member

esabol commented Sep 10, 2020

Ah, OK. I was going to add a bunch of debug logging statements to prove it one way or another, but I don't have to now. Or maybe that would still be useful?

One of the things I noticed in the code was that, if the timeout is less than 1000 ms, it sets it to 1000 ms.

// We treat 0 and -1 as being the same (i.e. no timer)
if (worker->timeout > 0)
{
if (worker->timeout < 1000)
{
worker->timeout= 1000;
}

So whether you specify a timeout of 2 ms or 999 ms, it should have the same result. Does that agree with what you're seeing?

@esabol
Copy link
Member

esabol commented Sep 10, 2020

It took me a while to locate timeout_add(). It's part of the libevent API in case anyone else didn't know. I was curious as to whether the struct timeval should be an elapsed time or an absolute time. At least one other API I came across in my googling used absolute times, and you were supposed to add the current time to the struct timeval, but none of the libevent examples I googled did that, so I think that's a dead end.

I'm now wondering if the code is even doing timeouts correctly. timeout_add() is from very, very old versions of libevent:

Versions of Libevent before 2.0 used "signal_" as the prefix for the signal-based variants of event_set() and so on, rather than "evsignal_". (That is, they had signal_set(), signal_add(), signal_del(), signal_pending(), and signal_initialized().) Truly ancient versions of Libevent (before 0.6) used "timeout_" instead of "evtimer_". Thus, if you’re doing code archeology, you might see timeout_add(), timeout_del(), timeout_initialized(), timeout_set(), timeout_pending(), and so on.

The modern libevent2 API is evtimer_add(). Anyway, here is a reference example I found for using timeouts with libevent:

struct event *ev;
struct timeval tv;

static void cb(int sock, short which, void *arg) {
   if (!evtimer_pending(ev, NULL)) {
       event_del(ev);
       evtimer_add(ev, &tv);
   }
}

int main(int argc, char **argv) {
   struct event_base *base = event_base_new();

   tv.tv_sec = 10;
   tv.tv_usec = 0;

   ev = evtimer_new(base, cb, NULL);

   evtimer_add(ev, &tv);

   event_base_loop(base, 0);

   return 0;
}

This code looks very different from the following:

if (con->timeout_event == NULL)
{
gearmand_con_st *dcon= con->con.context;
con->timeout_event= (struct event *)malloc(sizeof(struct event)); // libevent POD
if (con->timeout_event == NULL)
{
return gearmand_merror("malloc(sizeof(struct event)", struct event, 1);
}
timeout_set(con->timeout_event, _server_job_timeout, job);
if (event_base_set(dcon->thread->base, con->timeout_event) == -1)
{
gearmand_perror(errno, "event_base_set");
}
}
/* XXX Right now, if a worker has diff timeouts for functions I think
this will overwrite any existing timeouts on that event. One
solution to that would be to record the timeout from last time,
and only set this one if it is longer than that one. */
struct timeval timeout_tv = { 0 , 0 };
time_t milliseconds= worker->timeout;
timeout_tv.tv_sec= milliseconds / 1000;
timeout_tv.tv_usec= (suseconds_t)((milliseconds % 1000) * 1000);
timeout_add(con->timeout_event, &timeout_tv);

Maybe this code needs to be modernized? Something like:

        con->timeout_event= evtimer_new(dcon->thread->base, _server_job_timeout, job);
        if (con->timeout_event == NULL) 
        { 
             return gearmand_perror(errno, "evtimer_new() failed in gearman_server_con_add_job_timeout"); 
        } 
        struct timeval timeout_tv = { 0 , 0 };
        time_t milliseconds= worker->timeout;
        timeout_tv.tv_sec= milliseconds / 1000;
        timeout_tv.tv_usec= (suseconds_t)((milliseconds % 1000) * 1000);
        evtimer_add(con->timeout_event, &timeout_tv);

@infraweavers
Copy link
Author

@esabol presumably if the old version is actually causing this problem it'd be firing the event twice or similiar? We can add logging and confirm if that's the case

@p-alik
Copy link
Collaborator

p-alik commented Sep 11, 2020

So whether you specify a timeout of 2 ms or 999 ms, it should have the same result. Does that agree with what you're seeing?

Yes.

It took me a while to locate timeout_add().

vscode + ms-vscode.cpptools shows it just in time.

Maybe this code needs to be modernized?

Judging from the comment in event_compat.h it makes sense in any case, but I don't think that helps to solve this issue.

@illuusio
Copy link

I can get this on 1.1.8. TL;DR; problem is there something to test with? I'm using Perl system where client absence hangs Gearman. I'm currently building 1.1.9.1 to check if it still present.

@infraweavers
Copy link
Author

If it's the same problem @illuusio it will still be there. We've worked around it by removing the timeout handling from one of our gearman workers; that seems to have helped massively for us. Is your use case reproducible?

@illuusio
Copy link

@infraweavers I can try to make it. It hangs when client just drop randomly. I test it more to get reproducible example.

@illuusio
Copy link

illuusio commented Oct 22, 2020

@infraweavers This was new way to do it (just found it). Just supply missing RPC function and it hangs. I have to check if this is false-positive as I assume. As I assume it should just pump back and say there is no this kind of function but this can be just me-not-reading-enough documentation problem. But the freezing is the same. It just hangs.


  NOTICE 2020-10-22 14:27:36.623382 [  proc ] accepted,missingFunction,,0 -> libgearman-server/server.cc:321
  DEBUG 2020-10-22 14:27:36.623404 [     2 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:633
  DEBUG 2020-10-22 14:27:36.623570 [     2 ] send() 23 bytes to peer -> libgearman-server/io.cc:407
  DEBUG 2020-10-22 14:27:36.623592 [     2 ] Sent JOB_CREATED -> libgearman-server/thread.cc:356

This one stays on handle->wait forever and ever waiting task to complete as Gearmand seems to be happy that there is not jobs available.

This one only seems to freeze the one client not them all but I think the reason is the same.. I'll do very badly behaving client.

@esabol
Copy link
Member

esabol commented Oct 23, 2020

I can get this on 1.1.8. TL;DR; problem is there something to test with? I'm using Perl system where client absence hangs Gearman. I'm currently building 1.1.9.1 to check if it still present.

Do you mean 1.1.18 and 1.1.19.1?

But the freezing is the same.

Maybe I don’t have enough information, but it doesn’t seem the same to me. This issue is clearly related to jobs that timeout. If you’re experiencing a hang and it doesn’t involve a task with a specified timeout, then I think what you are describing (and from the log snippet) is a separate issue.

@illuusio
Copy link

illuusio commented Oct 23, 2020

@esabol I got Gearmand to freeze on 1.1.8 I got to test 1.1.19.1 more to have glue if it's doing the same. I have hunch it's something to-do with bidirectional transfer when worker is sending back to client information about state. I try to make test on this next week to prove it or false this..

@esabol
Copy link
Member

esabol commented Oct 23, 2020

@esabol I got Gearmand to freeze on 1.1.8 I got to test 1.1.19.1 more to have glue if it's doing the same. I have hunch it's something to-do with bidirectional transfer when worker is sending back to client information about state. I try to make test on this next week to prove it or false this..

That’s possible. A long time ago, when I was looking into some SSL errors, I had the suspicion that gearmand needed to have exclusive pthreads locks whenever sending and receiving. That might have been partially alleviated by upgrades to OpenSSL, which incorporated better thread-safety at a lower level, but it might still be an issue. See, for example:

https://github.com/LibVNC/libvncserver/pull/389/files

But @infraweavers isn’t using SSL. Could be a similar thing with regards to handling of the timeout though?

@SpamapS
Copy link
Member

SpamapS commented Oct 31, 2020

The issue reproduces with --threads=1, so I'm leaning more to @p-alik 's theory of double-free than thread unsafety.

@esabol
Copy link
Member

esabol commented Nov 1, 2020

Agreed. My previous comment pertained to @illuusio’s issue, which as I wrote previously, sounded like a different thing.

@infraweavers
Copy link
Author

FWIW: From our investigations we've not seen anything occur which is asynchronous within gearmand, however given that it doesn't happen every time, we think it's still a "concurrent" problem, but based on when things happen (i.e. timeouts, retries, reconnection, connection close etc) outside of gearmand rather than within, which causes the problem.

@SpamapS
Copy link
Member

SpamapS commented Nov 2, 2020 via email

sni added a commit to sni/pnp4nagios that referenced this issue May 20, 2021
this patch addresses some issues in gearman worker mode:

    - when forking a child, the forked %children contains all created childs so far, so if that child receives a sigint, it will kill its siblings.
    - check for exited childs in a loop in case multiple childs exited at once
    - do not remove the pidfile when a children receives a sigint
    - fix issue with gearman jobs having a timeout, for details see
        - gearman/gearmand#301
        - ConSol-Monitoring/omd#107
Bellardia added a commit to Jumbleberry/GearmanManager that referenced this issue Aug 1, 2021
Bellardia added a commit to Jumbleberry/GearmanManager that referenced this issue Aug 1, 2021
Bellardia added a commit to Jumbleberry/GearmanManager that referenced this issue Aug 1, 2021
@esabol esabol changed the title GearmanD hangs intermittently gearmand hangs intermittently when timeouts are used Aug 22, 2021
@steevhise steevhise mentioned this issue Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants