-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements around queries #260
Improvements around queries #260
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #260 +/- ##
==========================================
- Coverage 89.57% 89.38% -0.20%
==========================================
Files 21 21
Lines 3099 3080 -19
==========================================
- Hits 2776 2753 -23
- Misses 323 327 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
933410c
to
8fdefd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small typos, otherwise this is looking good to me!
8fdefd7
to
00298f3
Compare
I renamed the existing testcase in |
00298f3
to
3178667
Compare
3178667
to
809c627
Compare
91f9768
to
8313165
Compare
8bb85ac
to
b66fc4f
Compare
This pull request is ready for review again. @the-mikedavis, compared to the last time you looked at it, I reintroduced the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the typos this all looks good to me!
b66fc4f
to
227ca2a
Compare
Thank you! Thank was fast :-) I fixed the typos in the commit messages. |
227ca2a
to
1fc15aa
Compare
The use of |
1fc15aa
to
07ec4a6
Compare
IIRC |
Good idea. I will restore the direct send to the leader if it is known and use the I'm in the process of removing the leader ID caching we do and use |
…rver}` [Why] The default behavior of Khepri is that a reply to a synchronous command is sent to the caller once a quorum number of Ra members committed the command and the leader applied it. This list of Ra members might not include the Ra server local to the caller. If the caller then depends on the local state, for local queries or projections, it might not see the update that was successfully applied yet. This behavior might be surprising. The default is now to send the reply to a synchronous command only once a quorum number of Ra members committed the command and the local Ra server applied it, regardless it is the leader or a follower. This ensures the local state is in sync with whatever the caller might expect after a successful update. [How] The default value of the `reply_from` option is simply set to `{member, LocalRaServer}`. We continue to send the command to the leader though. The caller might still decide to change the `reply_from` for a smaller latency. This change is also required by the changes to queries that follow.
[Why] It can return `no_correlation` in addition to a valid correlation ID.
[Why] The default priority is `low` in Ra. This means that the leader may batch all low priority commands and only commit them after some internal conditions. For Khepri, we change that default to `normal` to have a behavior closer to synchronous commands. This brings a more logicial behavior to queries, or more exactly a less surprising one. Indeed, we want to push for local queries by default and add a "fence" mechanism in subsequent commits for various reasons (see the commit messages to learn more). Having asynchronous commands that behave more closely to synchronous ones makes it easier to reason about the caveats around queries and the local Ra server state.
... if possible. [Why] This will be useful for the upcoming fence mechanism (see a later commit). This will ensure that commands, even async ones, and queries are managed in order. [How] For async commands without a correlation ID, we can send them to the local Ra server. It will handle redirection to the leader if the local server is a follower. For those with a correlation ID, the local Ra server won't handle redirections. Therefore, we must query the leader and send the command to it. If we don't know the leader, it is sent to the local Ra server and if it is a follower, it will reply with an error. Async commands with a correlation ID won't be covered by the fence mechanism.
[Why] With the previous commits, we ensured that when a synchronous update returns, the local machine state is up-to-date. The chance that it is the case for async commands without a correlation ID has also increased. In a follow-up commit, Khepri will default to local queries, to eliminate the risks linked to remote executions and possible incompatible code. Therefore, a query is local to make sure its execution is local and that it works on an up-to-date state. There is still the following scenario where this isn't enough: 1. The caller makes several asynchronous updates without a correlation ID, or overrides the default of `reply_from => {member, LocalRaServer}`. 2. The caller then performs a query. In this case, the query is unlikely to be executed after the asynchronous commands were applied. This is not a bug, the caller explicitly asked for asynchronous updates. The caller could use correlation IDs and wait for the replies. But without correlation IDs, it's not possible. [How] To help the caller in this case, this patch introduces `khepri:fence/{0,1,2}`. It is a blocking call that queries the Ra leader to learn its last index (the number of the last command it received), then performs an arbitrary local query, passing that index so that the query execution waits for that index to be committed locally. This way, by putting a call to `khepri:fence()` between asynchronous updates and a query, the caller ensures that the query will see the result of those asynchronous updates. To make this mechanism effective, we also send all synchronous commands and asynchronous commands without a correlation ID to the local Ra server, instead of the leader even if we know who it is. This way, all messages are "serialized".
[Why] Before, Khepri tried something smart by mixing consistent and leader queries by default. It exposed the `favor` option to let the caller influence the type of query. Leader queries are a good compromise in terms of latency and freshness of the queried data. However, they come with a big caveat: the query function is executed on the leader, not on the local node (if the local node is a follower). There are various potential issues with the remote execution of query functions: * The function reference might be invalid on the remote node because of a difference in the version of Erlang/OTP. * The function reference might be invalid on the remote node because the module hosting the function is not the same. * The function reference might be invalid on the remote node because the module hosting the function is missing. * Using an exported MFA instead of a function reference doesn't solve the problem because the remote copy of that MFA may still be missing or simply be incompatible (different code/behavior). This might be even more difficult to debug. The problem is the same with consistent queries. [How] The only way to be sure that the query function behaves as the caller expects is to run the local copy of that function. Therefore, we must always use a local query. The `favor` option is changed to accept `low_latency` or `consistency`. A local query will always be used behind the scene. However if the caller favors consistency, the query will leverage the fence mechanism introduced in a previous commit. If the caller favors the low latency, there is a risk that the query runs against out-of-date data. That is why a previous commit changed the default behavior of synchronous updates so that the call returns only when it was applied locally. This should increase the chance that the query works on fresh data. Therefore, with the new default behaviors and options in this commit and the previous ones, we ensure that a query will work with the local query function and that it will be executed against the local up-to-date state.
07ec4a6
to
a379a73
Compare
... in `handle_async_ret/2` [Why] If the async command with a correlation ID is sent to a follower, that follower will reply with the `not_leader` error. The caller is responsible for resending the command to the leader that is specified in the `not_leader` error. [How] This piece of information was not passed to the caller by `handle_async_ret/2` so the caller couldn't do it. The API is fixed by this commit.
[Why] Ra exposes the leader ID through `ra_leaderboard:lookup_leader/1`. There is no need to duplicate that feature anymore.
@the-mikedavis: I restored the send to the leader, but with |
[Why] This option was removed in rabbitmq/khepri#260.
Currently, queries suffer several issues or limitations:
This collection of patches aims at fixing this with the following changes:
reply_from => local
option.local
(possible thanks to the change above).khepri:fence()
API is introduced.See individual commits to learn more.
Fixes #238.