Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinot graceful node replacement for large scale production usage #14592

Open
lnbest0707-uber opened this issue Dec 4, 2024 · 0 comments
Open

Comments

@lnbest0707-uber
Copy link
Contributor

On the real world cloud based stateful platform, host underlying the Pinot container would run in dynamic status. Host/Node replacement is very frequent. Such operation ideally should be fully transparent to users even without Pinot admins' awareness.
However, Pinot nowadays, does not have a really graceful (enough) way to handle the node replacement. Though it is usually with multiple replicas, running in a under replication status would make the system stressful and risky. For example, if the table is with 2 replicas, during node replacement, we have to experience:

  • Many segments are only under 1 replica, the query load on it would go double.
  • For segments running with 1 replica, we are experiencing a very high risk that the data might lose or query might fail if another node goes down due to hardware or network issues.

Though we would experience same issue during node restart, node replacement is far slower than node restart especially with high data volume. For example, for a large node with 5+TB data, the entire single node replacement might take 5+ hours to complete. This is far longer than the node restart which might only take minutes. The longer the node replacement is, the longer node downtime we have to endure, the higher the risk is introduced.

Hence, reducing node replacement downtime is crucial for smooth large scale production maintenance.

During the downtime, we would observe

The speed is far slower than a node restart because the node has to download the missing segment data from either deep store or peers before loading them into memory.
Therefore, a straightforward and effective way to reduce the downtime is that, before bringing down the old node (ON), we'd better pre-download all required segments on the new node (NN). Afterwards, bringing down the ON and starting up the NN would be same as the lightweight node restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants