-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply non-persistent log? #600
Comments
And what do you do if some nodes of the cluster are up, and some or down (or unreachable)? Should you continue to retry sending those updates to the downed nodes, waiting for them to come back? Should they pick up the messages after come back online? |
There are a bunch of details missing from your proposal. |
Hey @bubbajoe,
I don't think we'd ever want to add arbitrary messaging features to raft as that's not a core function of the library. All our tools either coordinate through logs written to FSM (which by definition go to all other nodes in a consistent order) or by out-of-band RPC for things that are not appropriate to be "written" through raft. Note that your application is already the source of truth about which nodes exist in the cluster and their addresses since you have to configure that in raft so sending arbitrary messages to them shouldn't be hard. So that's my recommendation for the initial request to "send data to all members". You specific use case does sound more raft-related, but I'm not super clear on what you are looking for. In general raft only guarantees linearizable reads and writes if you read and write from the leader. So if you ever have followers accept and handle reads, you are going to be serving inconsistent data by definition. If you want something either weaker or stronger than that model, you'll need to specify the behaviour a little more precisely because what you are describing is no longer raft but an additional algorithm on top! For that reason, and because the specific needs of every application are different with regard to consistency requirements, it doesn't seem likely to me that there is a single solution it makes sense to build into raft for this, though if there is a reasonable design you have checked that would only be possible with additional help from the library that might be an interesting feature request.
I'm still not sure this is really what you mean but... the only time this matters in a traditional raft setup where all reads go to the leader is when a new leader takes over, it has to make sure that it has applied every previously committed write to it's FSM before it accepts reads. We do already have a mechanism for that call |
Right, that's what I was hinting at in my comment. The whole point of Raft is (in a sense) to send messages to multiple nodes in a cluster in a very tightly defined way, in a specific order, such that it handles all the edges cases when some nodes are down, come back, etc. Now there are some situations where nodes talking to other nodes doesn't require Raft. However I have only ever found one situation in rqlite that ended up doing that -- it's when one node wants to know something about another node (the specific attribute doesn't matter). In that case the first node reaches out to the second node and asks it. But if that other node doesn't respond, then rqlite handles that failure there and then and moves on. But it's a very specific node talking to a very specific node, it's not a "one node sends a message to all other nodes" situation. You can learn the full story here: https://www.philipotoole.com/rqlite-6-0-0-building-for-the-future/ |
Thank you for your responses. I'd like to provide additional context regarding my application. I am developing a Distributed API Gateway, an open-source project leveraging HashiCorp's Raft implementation. This gateway comprises an admin API for configuration updates and a proxy server that reflects these changes. Upon startup, we observe a gradual increase in state updates, as expected. However, there is no mechanism to confirm that all logs have been applied and the server is now ready to wait for new logs. I need a solution to ensure that both the proxy server and the admin API only begin accepting traffic once they are fully synchronized with the locally persisted logs. This is essential for optimizing node startup times. Log entries may necessitate a proxy server restart, and performing this operation thousands of times during startup is time-consuming. Currently, I use a maximum append entries size of 1024 and restart only at the end of batches, but I am looking to further optimize this process. While raft.Barrier() is beneficial for the leader, my scenario requires follower nodes to accept reads as well. Any suggestions or insights on achieving these optimizations would be greatly appreciated. |
By "new logs" I presume you mean "ready for writes". Since you can't have writes until a Leader is elected, register for the LeaderObservation change. Once you see that go by with a non-empty leader, the node is ready. This works regardless of whether the node becomes the Leader or a Follower.
"Persisted" is probably not what you mean. Just because a log is persisted i.e. written to disk, does not mean it's committed. See the Raft paper. In fact "persisted" logs could be removed on a node if depending on what the Leader does at election time. To learn what is actually committed requires a Leader election. See section 5.2 of the paper. I acknowledge this can be tricky to think about -- I often have to think about it myself multiple times.
You need to be a bit more precise. When you say "accept reads" do you mean "followers must be able to service reads from their local store". That can be done, but what state do you want that local store to be in? See rqlite's docs to understand better what you're asking: https://rqlite.io/docs/api/read-consistency/ What type of read consistency do you want your system to offer? And what kind of conditions do you want to be true before those reads can be served by a follower, and with what consistency? |
Now you can determine somewhat what is committed. For example, if you find snapshots when you launch you know that those were committed. While a node is running is also knows what has been committed (it stores that index in memory) so you could persist that to say, BoltDB, periodically. This would get you close to your goal. I.e. poll https://pkg.go.dev/github.com/hashicorp/raft#Raft.AppliedIndex. rqlite used to do this, but since I optimized start-up times, I didn't need it anymore and removed it. |
@otoolep Thanks for the suggestions. I am using AppliedIndex for now and I was able to get it working, but what if a new node is added to the cluster with no appliedIndex cached? How does rqlite optimize start up times if it doesn't use AppliedIndex? |
If a brand new node joins there is no optimization that I know of. It has to (probably) get a snapshot from another node, and then apply all remaining log entries. |
Thanks for the discussion folks. It's great to see the community able to help each other out with these complex ideas. This touched on another open issue so I'll mention it for completeness. I was surprised not that long ago to (re)learn how this library handles logs on startup. I'd assumed it replayed them back to the last commit index it knew of prior to restarting, but this isn't the case since we never persist the commitIndex learned from the leader (except implicitly when we snapshot but that can be infrequent). So the only safe thing to do is wait for the leader to send at least one appendEntries (data or heartbeat) and then replay logs up to the commitIndex in that message. But that can mean delays where the follower hasn't heard from the leader and so hasn't replayed it's whole FSM. In this case, even some writes it previously did have in it's FSM before the restart aren't replayed yet. This can cause clients that accept "stale" reads from the local FSM to actually go backwards in time for a period until the message and replay occur. There is a separate issue to track improving that as it's been a real issue for HashiCorp users in the past who often rely on "stale" reads for read scaling. See #549 |
Hello Hashicorp Team,
I would like to send arbitrary data to all members of the cluster. Similar to a how configuration logs work, but the API doesn't allow for the usage of this type directly.
Use case: Telling my cluster that the (snapshot and) persisted logs have been applied, and the system is ready for use. If the system is set to ready prematurely, it could provided stale values on startup. Being able to apply a non-persistent log, means that the call will block until completed (or error). There doesn't seem to be any way to tell that the persisted logs are done being applied by the FSM.
The text was updated successfully, but these errors were encountered: