-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LWT and rack-aware routing bugfixes #1037
Merged
Merged
+152
−28
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
They aid in debugging.
A copy-paste kind of bug made local rack non-replica nodes not present in the fallback iterator earlier than non-local-rack non-replica nodes. This was a fairly edge case bug, though, because for it to manifest, the following condition would have to be satisfied: 1. a non-replica node from the preferred rack would have to be penalised by LatencyAwareness, 2. a non-replica node from the preferred DC but not preferred rack would have to be penalised by LatencyAwareness, too.
wprzytula
requested review from
Lorak-mmk and
muzarski
and removed request for
Lorak-mmk
July 10, 2024 05:50
|
muzarski
reviewed
Jul 11, 2024
There is a special logic in default policy that handles the case when computing the first acceptable replica (i.e. such replica that it satisfies `pick_predicate`) is expensive (i.e. requires allocation). That logic was to make `pick` return None, because then `fallback` would allocate and compute all acceptable replicas. Unfortunately, the logic contained a bug that made `pick` continue execution instead of returning None, leading to a non-necessarily-replica being returned. This would break the LWT optimisation, because in case that the primary replica is filtered out by LB (e.g. it's down, or disabled by HostFilter), the second replica should be targeted instead, deterministically. The fix involves creating an enum to distinguish between three scenarios: 1. No replica exists that could satisfy the pick predicate -> None; 2. The primary replica satisfies the pick predicate -> Some(Computed(primary_replica)); 3. The primary replica does not satisfy the pick predicate, but it's possible that another replica does -> Some(ToBeComputedInFallback). Before the fix, the third scenario would merge with the first, leading to incorrect behaviour of not returning None from `pick`.
The test runs against ClusterData with node F being disabled by HostFilter. In such arrangement, F should be never returned. As F is the primary replica for executed statement and no DC is preferred, the second replica (`A`) cannot be computed cheaply, so `pick` should return None. Before the bug was fixed, `pick` would return an arbitrary robinned node, e.g. `B` (not even a replica).
Lorak-mmk
approved these changes
Jul 11, 2024
wprzytula
force-pushed
the
lwt-routing-bugfix
branch
from
July 11, 2024 11:51
d29464d
to
9407afb
Compare
muzarski
approved these changes
Jul 11, 2024
wprzytula
added a commit
to wprzytula/scylla-rust-driver
that referenced
this pull request
Jul 11, 2024
LWT and rack-aware routing bugfixes (cherry picked from commit c01ad2b)
Merged
wprzytula
added a commit
to wprzytula/scylla-rust-driver
that referenced
this pull request
Jul 11, 2024
LWT and rack-aware routing bugfixes (cherry picked from commit c01ad2b)
wprzytula
added a commit
to wprzytula/scylla-rust-driver
that referenced
this pull request
Jul 11, 2024
LWT and rack-aware routing bugfixes (cherry picked from commit c01ad2b)
wprzytula
added a commit
to wprzytula/scylla-rust-driver
that referenced
this pull request
Jul 11, 2024
LWT and rack-aware routing bugfixes (cherry picked from commit c01ad2b)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem solved
I've noticed two bugs in the routing code in DefaultPolicy:
rack+latency-awareness bug.
A copy-paste kind of bug made local rack non-replica nodes not present in the fallback iterator earlier than non-local-rack non-replica nodes. This was a fairly edge case bug, though, because for it to manifest, the following conditions would have to be satisfied:
LWT + no preferred DC + the primary replica down or disabled - bug.
There is a special logic in default policy that handles the case when computing the first acceptable replica (i.e. such replica that it satisfies
pick_predicate
) is expensive (i.e. requires allocation). That logic was to makepick
return None, because thenfallback
would allocate and compute all acceptable replicas. Unfortunately, the logic contained a bug that madepick
continue execution instead of returning None, leading to a non-necessarily-replica being returned. This would break the LWT optimisation, because in case that the primary replica is filtered out by LB (e.g. it's down, or disabled by HostFilter), the second replica should be targeted instead, deterministically.The fix involves creating an enum to distinguish between three scenarios:
None
;Some(Computed(primary_replica))
;possible that another replica does ->
Some(ToBeComputedInFallback)
.Before the fix, the third scenario would merge with the first, leading to incorrect behaviour of not returning None from
pick
.Bonus
Additionally, this PR comes with typo fixes, comment amendments, and increased test assertions verbosity to aid debugging.
Pre-review checklist
[ ] I have provided docstrings for the public items that I want to introduce.[ ] I have adjusted the documentation in./docs/source/
.[ ] I added appropriateFixes:
annotations to PR description.