Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable TS->Master Heartbeat timeout #2418

Closed
amitanandaiyer opened this issue Sep 24, 2019 · 4 comments
Closed

Configurable TS->Master Heartbeat timeout #2418

amitanandaiyer opened this issue Sep 24, 2019 · 4 comments
Assignees
Labels
area/docdb YugabyteDB core features good first issue This is a good issue to start contributing!

Comments

@amitanandaiyer
Copy link
Contributor

TS->master heartbeat is hardcoded with a 10 sec timeout.

For use cases with thausands of tablets, the master can sometimes be slow if there is contention at the master side.

Would be useful to have this be configurable (through a gflag)

Copy link
Contributor Author

The TS is only supposed to send a full report the very first time it connects to the master.
View in Slack

Copy link
Contributor Author

However in this case, (due to the nodes having about 5k tablets each) processing each of those requests at the master takes > 10secs
View in Slack

Copy link
Contributor Author

This causes the RPC to time out at the TServer end, and it keeps retrying the rpc to the master with the master having to reprocess all the tablets again.
View in Slack

Copy link
Contributor Author

So, the TS is continuously sending full tablet reports to the master; and the master is getting overwhelmed because processing a full tablet report is a lot of work.
View in Slack

@kmuthukk kmuthukk added the area/docdb YugabyteDB core features label Sep 25, 2019
@amitanandaiyer amitanandaiyer added the good first issue This is a good issue to start contributing! label Sep 25, 2019
amitanandaiyer added a commit that referenced this issue Sep 25, 2019
instead of hardcoding 10sec timeout #2418

Summary:
Currently, the TS uses a hard-coded 10 sec timeout for the heartbeat RPC.
If the TServer has a lot of tablets, the initial RPC reporting the full tablet report can take a long time.

Use FLAGS_heartbeat_rpc_timeout_ms to make this configurable for such large clusters.

Test Plan: eyeball

Reviewers: kannan, mihnea, hector

Reviewed By: hector

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D7279
m-iancu pushed a commit that referenced this issue Sep 30, 2019
instead of hardcoding 10sec timeout #2418

Summary:
Currently, the TS uses a hard-coded 10 sec timeout for the heartbeat RPC.
If the TServer has a lot of tablets, the initial RPC reporting the full tablet report can take a long time.

Use FLAGS_heartbeat_rpc_timeout_ms to make this configurable for such large clusters.

Test Plan: eyeball

Reviewers: kannan, mihnea, hector

Reviewed By: hector

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D7279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features good first issue This is a good issue to start contributing!
Projects
None yet
Development

No branches or pull requests

2 participants