Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubble-up exceptions from scheduler #38317

Merged

Conversation

henningandersen
Copy link
Contributor

Instead of logging warnings we now rethrow exceptions thrown inside
scheduled/submitted tasks. This will still log them as warnings in
production but has the added benefit that if they are thrown during
unit/integration test runs, the test will be flagged as an error.

This is a continuation of #38014

Primary review target is Scheduler.SafeScheduledThreadPoolExecutor.afterExecute

Instead of logging warnings we now rethrow exceptions thrown inside
scheduled/submitted tasks. This will still log them as warnings in
production but has the added benefit that if they are thrown during
unit/integration test runs, the test will be flagged as an error.

This is a continuation of elastic#38014
@henningandersen henningandersen added >bug :Core/Infra/Core Core issues without another label v7.0.0 v6.7.0 labels Feb 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Fixed NPE that caused CCR tests (IndexFollowingIT and likely others)
to fail.
scheduleUnlessShuttingDown could bubble rejected exception to uncaught
exception handler when not using SAME executor. Now ignore rejected
exception if executor is shutdown.
scheduleUnlessShuttingDown could bubble rejected exception to uncaught
exception handler when not using SAME executor. Now ignore rejected
exception if executor is shutdown.
@henningandersen
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/2

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one question.

@@ -354,8 +354,11 @@ public ScheduledCancellable schedule(Runnable command, TimeValue delay, String e
}

public void scheduleUnlessShuttingDown(TimeValue delay, String executor, Runnable command) {
if (!Names.SAME.equals(executor)) {
command = new ThreadedRunnableAllowShutdown(command, executor(executor));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for this change? It's unrelated to this PR, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related to this PR. If you schedule something on a non-SAME executor and then shutdown the threadpool afterwards (before the delay is passed), the execute on the executor will fail, causing an exception. This exception is thrown on the scheduler thread. So far this caused no issues, since it was ignored (and recently logged). But with the changes in this PR, the exception was bubbled to the uncaught exception handler causing tests to fail.

Above fixes that such that scheduleUnlessShuttingDown allows both the schedule call itself to pass and the subsequent execute on the executor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the regular .schedule method not suffer from the same issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will potentially, though none of the CI runs ran into it.

I guess we could argue that before the warning/exception changes, this would go unnoticed for both schedule and scheduleUnlessShuttingDown and therefore we should silently ignore this in both cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the shutdown procedure is relevant to this question. ThreadPool.terminate does a regular shutdown first on both scheduler and pool executors. This allows any scheduled tasks on the ScheduledThreadPoolExecutor to run to completion but does not allow executing them on the pool executor.

So scheduling jobs using SAME executor with delay until after shutdown will complete, using any other executor will not.

Given that we have both methods, I think the right solution is to make sure that tasks scheduled to run within the terminate timeout will run. I think this belongs in a follow-up PR. Two options:

  1. During terminate, we can do awaitTermination of the scheduler before calling shutdown on the rest of the executors.
  2. If a scheduled job fails to call execute on the executor, we simply call it in current thread (ie. one of the scheduler threads).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this on another channel. Conclusion is to silently ignore scheduled tasks that fail to re-execute on the target executor if the target executor is shutdown (ie. like it was before all our changes).

scheduleUnlessShuttingDown was primarily added to avoid having to catch and handle the rejected exception everywhere. Down the road we should likely merge the two methods into one after analyzing all usages.

@henningandersen
Copy link
Contributor Author

Thanks @ywelsch, have responded to your question above.

@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1
@elasticmachine run elasticsearch-ci/2

Like scheduleUnlessShuttingDown, we want to silently ignore exceptions
thrown during execute on target executor. Once ThreadPool.terminate()
has done shutdown of thread pools, all scheduled tasks not using SAME
executor will not run.
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1
@elasticmachine run elasticsearch-ci/2
@elasticmachine run elasticsearch-ci/default-distro

@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/default-distro

@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

1 similar comment
@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

1 similar comment
@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

henningandersen added a commit that referenced this pull request Feb 5, 2019
Instead of logging warnings we now rethrow exceptions thrown inside
scheduled/submitted tasks. This will still log them as warnings in
production but has the added benefit that if they are thrown during
unit/integration test runs, the test will be flagged as an error.

Fixed NPE in GlobalCheckPointListeners that caused CCR tests
(IndexFollowingIT and likely others) to fail.

This is a continuation of #38014

Backports #38317
@henningandersen henningandersen merged commit 20c66c5 into elastic:master Feb 5, 2019
@colings86 colings86 removed the v7.0.0 label Feb 7, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 11, 2019
* master:
  Add an authentication cache for API keys (elastic#38469)
  Fix exit code in certutil packaging test (elastic#38393)
  Enable logs for intermittent test failure (elastic#38426)
  Disable BWC to backport recovering retention leases (elastic#38477)
  Enable bwc tests now that elastic#38443 is backported. (elastic#38462)
  Fix Master Failover and DataNode Leave Blocking Snapshot (elastic#38460)
  Recover retention leases during peer recovery (elastic#38435)
  Set update mappings mater node timeout to 30 min (elastic#38439)
  Assert job is not null in FullClusterRestartIT (elastic#38218)
  Update ilm-api.asciidoc, point to REMOVE policy (elastic#38235) (elastic#38463)
  SQL: Fix esType for DATETIME/DATE and INTERVALS (elastic#38179)
  Handle deprecation header-AbstractUpgradeTestCase (elastic#38396)
  XPack: core/ccr/Security-cli migration to java-time (elastic#38415)
  Disable bwc tests for elastic#38443 (elastic#38456)
  Bubble-up exceptions from scheduler (elastic#38317)
  Re-enable TasksClientDocumentationIT.testCancelTasks (elastic#38234)
  Allow custom authorization with an authorization engine  (elastic#38358)
  CRUDDocumentationIT fix documentation references
  Remove support for internal versioning for concurrency control (elastic#38254)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants