Improvements around queries #260

dumbbell · 2024-06-27T15:46:46Z

Currently, queries suffer several issues or limitations:

They are executed remotely on the leader. This is a problem if the remote node has an incompatible version of Erlang/OTP or the module hosting the function, or even lacks the function entirely.
Local queries solve this problem but they may be executed against ouf-of-date data when compared to updates that were previously submitted.

This collection of patches aims at fixing this with the following changes:

The default options of updates changes to use the reply_from => local option.
The default query type is now local (possible thanks to the change above).
A new khepri:fence() API is introduced.

See individual commits to learn more.

Fixes #238.

codecov · 2024-06-27T15:47:00Z

Codecov Report

Attention: Patch coverage is 91.66667% with 6 lines in your changes missing coverage. Please review.

Project coverage is 89.38%. Comparing base (0f56ea9) to head (a379a73).

Files	Patch %	Lines
src/khepri_machine.erl	92.18%	5 Missing ⚠️
src/khepri.erl	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #260      +/-   ##
==========================================
- Coverage   89.57%   89.38%   -0.20%     
==========================================
  Files          21       21              
  Lines        3099     3080      -19     
==========================================
- Hits         2776     2753      -23     
- Misses        323      327       +4

Flag	Coverage Δ
erlang-25	`88.63% <91.66%> (-0.14%)`	⬇️
erlang-26	`89.25% <91.66%> (-0.04%)`	⬇️
os-ubuntu-latest	`89.38% <91.66%> (+0.03%)`	⬆️
os-windows-latest	`88.60% <91.66%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

the-mikedavis

Just some small typos, otherwise this is looking good to me!

src/khepri.erl

dumbbell · 2024-06-28T08:52:51Z

I renamed the existing testcase in test/fence.erl (too much copy-pasting :-) and added few more testcases.

dumbbell · 2024-07-13T14:02:09Z

This pull request is ready for review again. @the-mikedavis, compared to the last time you looked at it, I reintroduced the favor option but limited the values to low_latency and consistency. Regardless of the value, the query is always local. The difference is in the use of the fence mechanism of not.

the-mikedavis · 2024-07-13T14:46:16Z

I found a few typos in the commit messages:

c5ea616 "loader" => "leader"
653ec8e "ais" => "is"
b66fc4f "increaset" => "increase"
b66fc4f "beraviors" => "behaviors"

the-mikedavis

Other than the typos this all looks good to me!

dumbbell · 2024-07-14T07:35:01Z

Thank you! Thank was fast :-) I fixed the typos in the commit messages.

dumbbell · 2024-07-14T09:51:43Z

The use of reply_from => local for all commands is useless as it is in this patch because we try to send commands directly to the leader if we know it... I believe this explains the transient failure in cluster_SUITE we se in CI.

the-mikedavis · 2024-07-14T12:37:10Z

IIRC reply_from can also be member => RaServerId. We could pass the local node as that member

dumbbell · 2024-07-15T19:32:39Z

IIRC reply_from can also be member => RaServerId. We could pass the local node as that member

Good idea. I will restore the direct send to the leader if it is known and use the reply_from => {member, LocalRaServerId} option.

I'm in the process of removing the leader ID caching we do and use ra_leaderboard:lookup_leader/1 directly.

…rver}` [Why] The default behavior of Khepri is that a reply to a synchronous command is sent to the caller once a quorum number of Ra members committed the command and the leader applied it. This list of Ra members might not include the Ra server local to the caller. If the caller then depends on the local state, for local queries or projections, it might not see the update that was successfully applied yet. This behavior might be surprising. The default is now to send the reply to a synchronous command only once a quorum number of Ra members committed the command and the local Ra server applied it, regardless it is the leader or a follower. This ensures the local state is in sync with whatever the caller might expect after a successful update. [How] The default value of the `reply_from` option is simply set to `{member, LocalRaServer}`. We continue to send the command to the leader though. The caller might still decide to change the `reply_from` for a smaller latency. This change is also required by the changes to queries that follow.

[Why] It can return `no_correlation` in addition to a valid correlation ID.

[Why] The default priority is `low` in Ra. This means that the leader may batch all low priority commands and only commit them after some internal conditions. For Khepri, we change that default to `normal` to have a behavior closer to synchronous commands. This brings a more logicial behavior to queries, or more exactly a less surprising one. Indeed, we want to push for local queries by default and add a "fence" mechanism in subsequent commits for various reasons (see the commit messages to learn more). Having asynchronous commands that behave more closely to synchronous ones makes it easier to reason about the caveats around queries and the local Ra server state.

... if possible. [Why] This will be useful for the upcoming fence mechanism (see a later commit). This will ensure that commands, even async ones, and queries are managed in order. [How] For async commands without a correlation ID, we can send them to the local Ra server. It will handle redirection to the leader if the local server is a follower. For those with a correlation ID, the local Ra server won't handle redirections. Therefore, we must query the leader and send the command to it. If we don't know the leader, it is sent to the local Ra server and if it is a follower, it will reply with an error. Async commands with a correlation ID won't be covered by the fence mechanism.

[Why] With the previous commits, we ensured that when a synchronous update returns, the local machine state is up-to-date. The chance that it is the case for async commands without a correlation ID has also increased. In a follow-up commit, Khepri will default to local queries, to eliminate the risks linked to remote executions and possible incompatible code. Therefore, a query is local to make sure its execution is local and that it works on an up-to-date state. There is still the following scenario where this isn't enough: 1. The caller makes several asynchronous updates without a correlation ID, or overrides the default of `reply_from => {member, LocalRaServer}`. 2. The caller then performs a query. In this case, the query is unlikely to be executed after the asynchronous commands were applied. This is not a bug, the caller explicitly asked for asynchronous updates. The caller could use correlation IDs and wait for the replies. But without correlation IDs, it's not possible. [How] To help the caller in this case, this patch introduces `khepri:fence/{0,1,2}`. It is a blocking call that queries the Ra leader to learn its last index (the number of the last command it received), then performs an arbitrary local query, passing that index so that the query execution waits for that index to be committed locally. This way, by putting a call to `khepri:fence()` between asynchronous updates and a query, the caller ensures that the query will see the result of those asynchronous updates. To make this mechanism effective, we also send all synchronous commands and asynchronous commands without a correlation ID to the local Ra server, instead of the leader even if we know who it is. This way, all messages are "serialized".

[Why] Before, Khepri tried something smart by mixing consistent and leader queries by default. It exposed the `favor` option to let the caller influence the type of query. Leader queries are a good compromise in terms of latency and freshness of the queried data. However, they come with a big caveat: the query function is executed on the leader, not on the local node (if the local node is a follower). There are various potential issues with the remote execution of query functions: * The function reference might be invalid on the remote node because of a difference in the version of Erlang/OTP. * The function reference might be invalid on the remote node because the module hosting the function is not the same. * The function reference might be invalid on the remote node because the module hosting the function is missing. * Using an exported MFA instead of a function reference doesn't solve the problem because the remote copy of that MFA may still be missing or simply be incompatible (different code/behavior). This might be even more difficult to debug. The problem is the same with consistent queries. [How] The only way to be sure that the query function behaves as the caller expects is to run the local copy of that function. Therefore, we must always use a local query. The `favor` option is changed to accept `low_latency` or `consistency`. A local query will always be used behind the scene. However if the caller favors consistency, the query will leverage the fence mechanism introduced in a previous commit. If the caller favors the low latency, there is a risk that the query runs against out-of-date data. That is why a previous commit changed the default behavior of synchronous updates so that the call returns only when it was applied locally. This should increase the chance that the query works on fresh data. Therefore, with the new default behaviors and options in this commit and the previous ones, we ensure that a query will work with the local query function and that it will be executed against the local up-to-date state.

... in `handle_async_ret/2` [Why] If the async command with a correlation ID is sent to a follower, that follower will reply with the `not_leader` error. The caller is responsible for resending the command to the leader that is specified in the `not_leader` error. [How] This piece of information was not passed to the caller by `handle_async_ret/2` so the caller couldn't do it. The API is fixed by this commit.

[Why] Ra exposes the leader ID through `ra_leaderboard:lookup_leader/1`. There is no need to duplicate that feature anymore.

dumbbell · 2024-07-16T07:41:10Z

@the-mikedavis: I restored the send to the leader, but with reply_from => {member, LocalRaServer}. I also dropped the cached leader thing in favor of the use of ra_leaderboard:lookup_leader/1. I finally reorganized the commits a bit and reworded some commit messages. It's ready for review again and hopefully, it should be the last iteration :-)

[Why] The parameter was removed in commit 982f17d (as part of #260).

[Why] This option was removed in rabbitmq/khepri#260.

dumbbell added the enhancement New feature or request label Jun 27, 2024

dumbbell added this to the v0.15.0 milestone Jun 27, 2024

dumbbell requested a review from the-mikedavis June 27, 2024 15:46

dumbbell self-assigned this Jun 27, 2024

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from 933410c to 8fdefd7 Compare June 27, 2024 15:47

dumbbell marked this pull request as ready for review June 27, 2024 16:34

the-mikedavis reviewed Jun 27, 2024

View reviewed changes

src/khepri.erl Outdated Show resolved Hide resolved

src/khepri.erl Outdated Show resolved Hide resolved

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from 8fdefd7 to 00298f3 Compare June 28, 2024 08:51

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from 00298f3 to 3178667 Compare June 28, 2024 09:03

the-mikedavis approved these changes Jun 28, 2024

View reviewed changes

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from 3178667 to 809c627 Compare July 11, 2024 15:59

dumbbell marked this pull request as draft July 11, 2024 16:01

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch 3 times, most recently from 91f9768 to 8313165 Compare July 12, 2024 14:52

dumbbell marked this pull request as ready for review July 12, 2024 15:07

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch 2 times, most recently from 8bb85ac to b66fc4f Compare July 13, 2024 08:33

the-mikedavis self-requested a review July 13, 2024 14:20

the-mikedavis approved these changes Jul 13, 2024

View reviewed changes

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from b66fc4f to 227ca2a Compare July 14, 2024 07:34

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from 227ca2a to 1fc15aa Compare July 14, 2024 08:40

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from 1fc15aa to 07ec4a6 Compare July 14, 2024 10:02

dumbbell marked this pull request as draft July 14, 2024 10:28

dumbbell added 6 commits July 15, 2024 22:32

khepri_machine: Fix select_command_type/3 type spec

73a412d

[Why] It can return `no_correlation` in addition to a valid correlation ID.

dumbbell force-pushed the default-to-local-queries-plus-related-improvements branch from 07ec4a6 to a379a73 Compare July 15, 2024 22:27

dumbbell added 2 commits July 16, 2024 00:28

Remove the leader ID caching

a379a73

[Why] Ra exposes the leader ID through `ra_leaderboard:lookup_leader/1`. There is no need to duplicate that feature anymore.

dumbbell marked this pull request as ready for review July 16, 2024 07:38

the-mikedavis self-requested a review July 16, 2024 13:21

the-mikedavis approved these changes Jul 16, 2024

View reviewed changes

dumbbell merged commit 25b405f into main Jul 16, 2024
8 checks passed

dumbbell deleted the default-to-local-queries-plus-related-improvements branch July 16, 2024 14:53

dumbbell mentioned this pull request Jul 17, 2024

Passing fun() to ra:*_query() may break when executed on a remote node #238

Closed

dumbbell added a commit that referenced this pull request Jul 22, 2024

test: Don't set consistent_query_interval_in_compromise

b2640e6

[Why] The parameter was removed in commit 982f17d (as part of #260).

This was referenced Jul 22, 2024

test: Don't set consistent_query_interval_in_compromise #276

Merged

Always pass exported functions to Ra queries #242

Closed

dumbbell added a commit to rabbitmq/khepri-benchmark that referenced this pull request Jul 25, 2024

Remove testing of the favor => compromise option

f769fe6

[Why] This option was removed in rabbitmq/khepri#260.

dumbbell mentioned this pull request Jul 25, 2024

Remove testing of the favor => compromise option rabbitmq/khepri-benchmark#10

Merged

dumbbell mentioned this pull request Aug 14, 2024

Improve the fence mechanism performances #283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements around queries #260

Improvements around queries #260

dumbbell commented Jun 27, 2024 •

edited

Loading

codecov bot commented Jun 27, 2024 •

edited

Loading

the-mikedavis left a comment

dumbbell commented Jun 28, 2024 •

edited

Loading

dumbbell commented Jul 13, 2024

the-mikedavis commented Jul 13, 2024 •

edited

Loading

the-mikedavis left a comment

dumbbell commented Jul 14, 2024

dumbbell commented Jul 14, 2024

the-mikedavis commented Jul 14, 2024

dumbbell commented Jul 15, 2024

dumbbell commented Jul 16, 2024

Improvements around queries #260

Improvements around queries #260

Conversation

dumbbell commented Jun 27, 2024 • edited Loading

codecov bot commented Jun 27, 2024 • edited Loading

Codecov Report

the-mikedavis left a comment

Choose a reason for hiding this comment

dumbbell commented Jun 28, 2024 • edited Loading

dumbbell commented Jul 13, 2024

the-mikedavis commented Jul 13, 2024 • edited Loading

the-mikedavis left a comment

Choose a reason for hiding this comment

dumbbell commented Jul 14, 2024

dumbbell commented Jul 14, 2024

the-mikedavis commented Jul 14, 2024

dumbbell commented Jul 15, 2024

dumbbell commented Jul 16, 2024

dumbbell commented Jun 27, 2024 •

edited

Loading

codecov bot commented Jun 27, 2024 •

edited

Loading

dumbbell commented Jun 28, 2024 •

edited

Loading

the-mikedavis commented Jul 13, 2024 •

edited

Loading