Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding graceful shutdown for druid process #23

Closed

Conversation

sydefz
Copy link
Contributor

@sydefz sydefz commented May 22, 2024

Description of changes:

Sometimes we can observe after stopping historical the host gets terminated immediately, this causes the in-flight requests to fail and results in 500s.

This PR adds a 30s sleep after the historical/query/master stop, so it has enough time to gracefully shutdown the process before terminating the host. Also updates the supervisord config to wait for 30s before sending the KILL signal.

I chose 30s because the default druid.server.http.gracefulShutdownTimeout is PT30s (https://druid.apache.org/docs/latest/configuration/#historical-query-configs)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sydefz sydefz changed the title feat: adding graceful shutdown for historical feat: adding graceful shutdown for druid process May 22, 2024
@dougtoppin
Copy link
Contributor

@sydefz Thank you for your PR. We will take a look at it and get back to you.

@sydefz
Copy link
Contributor Author

sydefz commented Jun 1, 2024

hi @dougtoppin any updates on this?

@dougtoppin
Copy link
Contributor

@sydefz Sorry about the delay. We have been working on some other tasks. We will get to this soon.
Thanks for your submission.

@@ -109,6 +109,8 @@ waitForProcess() {
else
echo "The new node is up. Stopping old node..."
$SUPERVISORCTL_CMD stop $process_name
# wait gracefulShutdownTimeout for 30 seconds
sleep 30

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in stopwaitsecs from node config into this script instead of hardcoding it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good idea, eventually we should avoid hard code, make it configurable and have the flexibility to handle shorter or longer graceful shutdown periods.

I guess the priority and purpose of this PR is to ensure we don't throw 500s during deployment, I'm happy to extend this to make it flexible later.

@@ -58,6 +58,7 @@ user=${USER_NAME}
autorestart=true
redirect_stderr=true
stdout_logfile=/var/log/supervisor/historical.log
stopwaitsecs=30

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this in common_user_data so we don't have to repeat it for each individual service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the supervisord config for each druid component is generated in individual user_data. They are slightly different, bringing these customisations into common_user_data would defeat the purpose of having common.

I'd keep it like this for now.

@van-vothanh van-vothanh self-requested a review June 6, 2024 08:45
@van-vothanh
Copy link
Member

thanks for the submission @sydefz
I've verified it with our solution pipeline and everything is working 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants