Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring Task getting stuck #115

Closed
RLHawk1 opened this issue Jun 23, 2020 · 18 comments
Closed

Recurring Task getting stuck #115

RLHawk1 opened this issue Jun 23, 2020 · 18 comments
Labels

Comments

@RLHawk1
Copy link

RLHawk1 commented Jun 23, 2020

I have a recurring task that's supposed to run more or less every 30 seconds. It's a polling service that handles tasks as they come in which can take longer to run if there's anything to process.

I keep having an issue though where it gets "stuck". The DB record shows it was picked and which machine it was picked by indefinitely and doesn't keep executing. The last_heartbeat column gets updated regularly. But the last_success column stays stale. The only way I've found to get it back running again is to set the picked_by column to null and set picked to 0.

Do you have any advice for how I can make this more robust? Some way of just resuming the usual schedule if it ever gets stuck for more than 10 minutes or something like that? I am also trying to figure out how/why it's getting stuck to begin with, obviously, but I'm assuming that's some issue on my end and not related to db-scheduler.

I'm currently using db-scheduler version 6.7.

@kagkarlsson
Copy link
Owner

It sounds to me as though the task (your code) is still running? Take a threaddump of the jvm to find out what it is doing.

@kagkarlsson
Copy link
Owner

last_success will be updated after the task has finished.

@kagkarlsson
Copy link
Owner

From https://github.com/kagkarlsson/db-scheduler#things-to-note--gotchas:

There are no guarantees that all instants in a schedule for a RecurringTask will be executed. The Schedule is consulted after the previous task-execution finishes, and the closest time in the future will be selected for next execution-time. A new type of task may be added in the future to provide such functionality.

@kagkarlsson
Copy link
Owner

If you want guarantees that all Instants in a schedule is executed, you need another type of task, for example a recurring task that "spawns" one-time executions for all upcoming Instants regularly

@cbarbosa2
Copy link

It just happened to me the exact same thing @RLHawk1 is reporting. A recurrent task got stuck 3 days ago with the last_heartbeat column still being updated regularly. My guess is that somehow db-scheduler wasn't able to update the db row when the task ended, maybe due to a network issue, and caused the task to never be freed. I'm using db-schedule version 6.8 with MySQL 5.6.

@kagkarlsson
Copy link
Owner

Hmm, curious. Are you able to locate any logging that might be related to this?

@kagkarlsson
Copy link
Owner

If someone encounters this, would you please take a threaddump?

@cbarbosa2
Copy link

When this happened the task being executed failed due to an apparent network issue, though the exception was captured and returned as having succeeded but db-scheduler didn't mark it as complete, and that's why I suggest there may have been a problem updating the scheduler db row due a short network glitch. Don't have any db-scheduler specific log message. As @RLHawk1 reported by setting the picked_by column to null and picked to 0 it came back to normal, no additional actions necessary.

@kagkarlsson
Copy link
Owner

If the last_heartbeat is still being update, it mean the executions is still considered to be under execution by the scheduler.

The code handling it looks like:

            currentlyProcessing.put(pickedExecution.get(), new CurrentlyExecuting(pickedExecution.get(), clock));
            try {
                statsRegistry.register(StatsRegistry.CandidateStatsEvent.EXECUTED);
                executePickedExecution(pickedExecution.get());
            } finally {
                if (currentlyProcessing.remove(pickedExecution.get()) == null) {
                    // May happen in rare circumstances (typically concurrency tests)
                    LOG.warn("Released execution was not found in collection of executions currently being processed. Should never happen.");
                }
                addedDueExecutionsBatch.oneExecutionDone(() -> triggerCheckForDueExecutions());
            }

So there is a finally block removing the execution from currentlyProcessing which will be called every time. So the thread is still "doing" something in there..

@kagkarlsson
Copy link
Owner

Do you have the jvm in which the error occurred running right now? (and have not been restarted?) @cbarbosa2

@cbarbosa2
Copy link

I do.

@kagkarlsson
Copy link
Owner

kagkarlsson commented Jun 25, 2020

Are you able to get a threaddump of it? And extract the state(stacktrace) of the db-scheduler threads?

Additionally, do you have a socket-timeout set on the DataSource?

@RLHawk1
Copy link
Author

RLHawk1 commented Jun 25, 2020

@cbarbosa2 Saying that your issue was after a network request also made me look into that code more as mine is also after a network request. It looks like this is an issue on my end. I did not realize that the default for a Http(s)URLConnection was to never timeout. So I think my request was simply hanging indefinitely leading to the execution never finishing as @kagkarlsson suggested. Thank you for your help tracking this down and for such an excellent library!

@cbarbosa2
Copy link

Indeed you nailed it @kagkarlsson ! I was mistaken by a log message in the same period that threw an HTTP error but by doing the thread dump as you suggested I realized it was indeed stuck in the HTTP call:
"db-scheduler-pool-7-thread-3" #73 prio=5 os_prio=0 tid=0x00007f8040010000 nid=0x5c22 runnable [0x00007f800fdd9000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) at sun.security.ssl.InputRecord.read(InputRecord.java:532) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:990) - locked <0x00000000940975a8> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:948) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) - locked <0x0000000094098340> (a sun.security.ssl.AppInputStream) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x0000000094098318> (a java.io.BufferedInputStream) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1593) - locked <0x00000000940a9298> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) - locked <0x00000000940a9298> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352) at org.springframework.http.client.SimpleClientHttpResponse.getRawStatusCode(SimpleClientHttpResponse.java:55) at org.springframework.web.client.DefaultResponseErrorHandler.hasError(DefaultResponseErrorHandler.java:61) at org.springframework.web.client.RestTemplate.handleResponse(RestTemplate.java:770) at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:740) at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:714) at org.springframework.web.client.RestTemplate.postForObject(RestTemplate.java:440)

@cbarbosa2
Copy link

Thanks so much for your quick and eye-opening suggestions! Your lib rocks! I'm so glad I got rid of Quartz and replaced it by yours which is way easier to handle with!

@RLHawk1 RLHawk1 closed this as completed Jun 25, 2020
@RLHawk1
Copy link
Author

RLHawk1 commented Jun 25, 2020

Sorry for commenting here, but I can't find any way to contact you directly. @kagkarlsson, do you accept donations? This was such excellent support on an open source project and I see you've gone above and beyond by creating #116 to help catch and track down any similar issues in the future. I can't do much but I would love to do something to support your efforts!

@kagkarlsson
Copy link
Owner

Very good! I was worried there a bit when we had two similar issues :)

@cbarbosa2 Thanks for the feedback! Might I quote you on that if I were to put up some sort of "user testimonials" on the front page?

@RLHawk1 Thanks! I haven't thought about donations, maybe I should look into it and add a link. I mostly do it because I enjoy building something that is proving useful to others :). For now, spreading the word and if possible adding yourself as users to Who uses db-scheduler? would be helpful. It would also be helpful to have a "user testimonial" on how your experience with db-scheduler have been :)

@cbarbosa2
Copy link

Feel free to quote me at will. I work for Becker Professional Education, and we've replaced Quartz by db-scheduler a month ago in production on our main product and so far I'm more than glad we did. It's way simpler and provides all the capabilities we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants