Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply non-persistent log? #600

Closed
bubbajoe opened this issue Jun 17, 2024 · 10 comments
Closed

Apply non-persistent log? #600

bubbajoe opened this issue Jun 17, 2024 · 10 comments

Comments

@bubbajoe
Copy link

Hello Hashicorp Team,

I would like to send arbitrary data to all members of the cluster. Similar to a how configuration logs work, but the API doesn't allow for the usage of this type directly.

Use case: Telling my cluster that the (snapshot and) persisted logs have been applied, and the system is ready for use. If the system is set to ready prematurely, it could provided stale values on startup. Being able to apply a non-persistent log, means that the call will block until completed (or error). There doesn't seem to be any way to tell that the persisted logs are done being applied by the FSM.

@otoolep
Copy link
Contributor

otoolep commented Jun 17, 2024

And what do you do if some nodes of the cluster are up, and some or down (or unreachable)? Should you continue to retry sending those updates to the downed nodes, waiting for them to come back? Should they pick up the messages after come back online?

@otoolep
Copy link
Contributor

otoolep commented Jun 17, 2024

There are a bunch of details missing from your proposal.

@banks
Copy link
Member

banks commented Jun 17, 2024

Hey @bubbajoe,

I would like to send arbitrary data to all members of the cluster.
you can already do this in a way - write your message to raft and handle it in your FSM.

I don't think we'd ever want to add arbitrary messaging features to raft as that's not a core function of the library. All our tools either coordinate through logs written to FSM (which by definition go to all other nodes in a consistent order) or by out-of-band RPC for things that are not appropriate to be "written" through raft. Note that your application is already the source of truth about which nodes exist in the cluster and their addresses since you have to configure that in raft so sending arbitrary messages to them shouldn't be hard.

So that's my recommendation for the initial request to "send data to all members".

You specific use case does sound more raft-related, but I'm not super clear on what you are looking for. In general raft only guarantees linearizable reads and writes if you read and write from the leader. So if you ever have followers accept and handle reads, you are going to be serving inconsistent data by definition. If you want something either weaker or stronger than that model, you'll need to specify the behaviour a little more precisely because what you are describing is no longer raft but an additional algorithm on top!

For that reason, and because the specific needs of every application are different with regard to consistency requirements, it doesn't seem likely to me that there is a single solution it makes sense to build into raft for this, though if there is a reasonable design you have checked that would only be possible with additional help from the library that might be an interesting feature request.

There doesn't seem to be any way to tell that the persisted logs are done being applied by the FSM.

I'm still not sure this is really what you mean but... the only time this matters in a traditional raft setup where all reads go to the leader is when a new leader takes over, it has to make sure that it has applied every previously committed write to it's FSM before it accepts reads. We do already have a mechanism for that call raft.Barrier(). In our products the new leader as part of it's initialisation (and before it starts accepting either reads or writes from clients) will call that and wait for it to complete to ensure that it has a completely up-to-date FSM. Does this help you?

@otoolep
Copy link
Contributor

otoolep commented Jun 17, 2024

Right, that's what I was hinting at in my comment. The whole point of Raft is (in a sense) to send messages to multiple nodes in a cluster in a very tightly defined way, in a specific order, such that it handles all the edges cases when some nodes are down, come back, etc.

Now there are some situations where nodes talking to other nodes doesn't require Raft. However I have only ever found one situation in rqlite that ended up doing that -- it's when one node wants to know something about another node (the specific attribute doesn't matter). In that case the first node reaches out to the second node and asks it. But if that other node doesn't respond, then rqlite handles that failure there and then and moves on. But it's a very specific node talking to a very specific node, it's not a "one node sends a message to all other nodes" situation.

You can learn the full story here: https://www.philipotoole.com/rqlite-6-0-0-building-for-the-future/

@bubbajoe
Copy link
Author

Thank you for your responses. I'd like to provide additional context regarding my application. I am developing a Distributed API Gateway, an open-source project leveraging HashiCorp's Raft implementation. This gateway comprises an admin API for configuration updates and a proxy server that reflects these changes.

Upon startup, we observe a gradual increase in state updates, as expected. However, there is no mechanism to confirm that all logs have been applied and the server is now ready to wait for new logs. I need a solution to ensure that both the proxy server and the admin API only begin accepting traffic once they are fully synchronized with the locally persisted logs.

This is essential for optimizing node startup times. Log entries may necessitate a proxy server restart, and performing this operation thousands of times during startup is time-consuming. Currently, I use a maximum append entries size of 1024 and restart only at the end of batches, but I am looking to further optimize this process.

While raft.Barrier() is beneficial for the leader, my scenario requires follower nodes to accept reads as well. Any suggestions or insights on achieving these optimizations would be greatly appreciated.

@otoolep
Copy link
Contributor

otoolep commented Jun 18, 2024

However, there is no mechanism to confirm that all logs have been applied and the server is now ready to wait for new logs.

By "new logs" I presume you mean "ready for writes". Since you can't have writes until a Leader is elected, register for the LeaderObservation change. Once you see that go by with a non-empty leader, the node is ready. This works regardless of whether the node becomes the Leader or a Follower.

locally persisted logs

"Persisted" is probably not what you mean. Just because a log is persisted i.e. written to disk, does not mean it's committed. See the Raft paper. In fact "persisted" logs could be removed on a node if depending on what the Leader does at election time.

To learn what is actually committed requires a Leader election. See section 5.2 of the paper. I acknowledge this can be tricky to think about -- I often have to think about it myself multiple times.

my scenario requires follower nodes to accept reads as well. Any suggestions or insights on achieving these optimizations would be greatly appreciated.

You need to be a bit more precise. When you say "accept reads" do you mean "followers must be able to service reads from their local store". That can be done, but what state do you want that local store to be in? See rqlite's docs to understand better what you're asking:

https://rqlite.io/docs/api/read-consistency/

What type of read consistency do you want your system to offer? And what kind of conditions do you want to be true before those reads can be served by a follower, and with what consistency?

@otoolep
Copy link
Contributor

otoolep commented Jun 18, 2024

Now you can determine somewhat what is committed. For example, if you find snapshots when you launch you know that those were committed. While a node is running is also knows what has been committed (it stores that index in memory) so you could persist that to say, BoltDB, periodically. This would get you close to your goal.

I.e. poll https://pkg.go.dev/github.com/hashicorp/raft#Raft.AppliedIndex.

rqlite used to do this, but since I optimized start-up times, I didn't need it anymore and removed it.

@bubbajoe
Copy link
Author

@otoolep Thanks for the suggestions.

I am using AppliedIndex for now and I was able to get it working, but what if a new node is added to the cluster with no appliedIndex cached?

How does rqlite optimize start up times if it doesn't use AppliedIndex?

@otoolep
Copy link
Contributor

otoolep commented Jun 22, 2024

If a brand new node joins there is no optimization that I know of. It has to (probably) get a snapshot from another node, and then apply all remaining log entries.

@banks
Copy link
Member

banks commented Jun 24, 2024

Thanks for the discussion folks. It's great to see the community able to help each other out with these complex ideas.

This touched on another open issue so I'll mention it for completeness. I was surprised not that long ago to (re)learn how this library handles logs on startup. I'd assumed it replayed them back to the last commit index it knew of prior to restarting, but this isn't the case since we never persist the commitIndex learned from the leader (except implicitly when we snapshot but that can be infrequent). So the only safe thing to do is wait for the leader to send at least one appendEntries (data or heartbeat) and then replay logs up to the commitIndex in that message. But that can mean delays where the follower hasn't heard from the leader and so hasn't replayed it's whole FSM. In this case, even some writes it previously did have in it's FSM before the restart aren't replayed yet. This can cause clients that accept "stale" reads from the local FSM to actually go backwards in time for a period until the message and replay occur.

There is a separate issue to track improving that as it's been a real issue for HashiCorp users in the past who often rely on "stale" reads for read scaling. See #549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants