-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Design for Reader/Writer separation #15237
Comments
This is very thorough! If I understand it correctly one will be able to dynamically change |
@mch2 : Thanks for this proposal . Liking the way we are building on top of the primitives built up recently. Are we also considering a notion of adding Pull based replicas also make a case of reusing remote store in cross cluster replication as well . |
Thanks @mch2 for a very comprehensive proposal - hope we can get feedback from @amberzsy. Look forward to releasing scaling searchers/writers to 0 as experimental in 2.17. Separation of writer/reader should stay optional and different deployments of OpenSearch (managed, serverless, partners) can decide whether to adopt. |
Thank for the proposal, @mch2 ! Quick question please regarding
This setting inherently assumes that index (or all indices) is backed by remote store (please correct me if I am wrong), and we use different set of settings |
Thanks @dblock, yep thats right adjustment would relate to the number of search shards per primary and is configurable per index which is the same as normal replica config today. As for perf, roughly 30-40% in write throughput for a workload where we concurrently index & search. More thorough results here - search perf had quite a bit of variance given segment counts fluctuate (no force merge before queries start), but last run was with concurrent search enabled and showed more promise.
@gbbafna This is a good idea. For the first version i was thinking that read-only would disable polling entirely but coming up with a more optimal polling strategy based on write activity would make sense. With the current cluster model as long as we make this infrequent maybe no writes within 15-20m marks the index as inactive where we reduce poll frequency. Will think through this some more...
Thanks @reta, I think thats a fair suggestion and is a bit more user friendly than throwing validation errors. Its true with the first version I'm only thinking about remote store enabled because node-node segrep doesn't scale well in its current form. Though I think we could improve that in the future. wdyt about prefixing with ".replication" instead? It does control replication behavior as well as search routing.
|
Thanks @mch2 , you mean |
@reta, Apologies I've been away from my laptop for a few days. Yes The setting is intended to configure an index with a special replica type. In the case its not segrep/remotestore I think we'd have to fail creation and throw an error so it can be corrected. I think we'd need to do that irrespective of the setting name. We wouldn't want to accept the req and a user has replicas/searchers missing. |
Thanks @mch2 for the detailed proposal.
I do agree there is no need for plugin at this point.
The work has started on in-place shard splitting without any downtime for write traffic here - #12918
my vote would be for
Generally users have the expectation that new documents/ segments are searchable after the refresh_interval. I agree primary shards need not wait for no-op replication on search shards. But, either there should be a separate setting explicitly defined to guarantee data freshness for search shards or honor refresh_interval otherwise it will create confusion for users when searches are returning stale data for random time period. |
@shwetathareja thanks for reviewing!
I think a separate setting to control searcher refresh interval to pull new segments makes sense because the refresh default (1s) would be pretty aggressive in polling the remote. It also gives the opportunity to differentiate/communicate to users that a searcher "refresh" is different than a writer refresh and that the searchers are inherently eventually consistent. I think we'll end up with three knobs here re freshness, 1. writer refresh interval, 2. writer upload interval and 3. searcher refresh/polling interval - related. |
After working on the first couple of PRs a case popped up that I didn't cover here around replica auto expansion. Given the new search replica is still a "replica" we need to figure out how to handle this setting. I am thinking to start we should block the addition of search replicas on an index with auto expansion and vice versa, or we honor it and regular replicas take priority leaving search replicas unassigned without dedicated nodes. However, within dedicated search nodes it would be useful to have auto expansion as well. Likely we'd want this a different setting so that it that only takes effect within nodes that match the new allocation filter. |
auto-expand-replicas setting makes more sense for search-replica as opposed for write. So ideally yes it would be useful to support auto-expand for search replicas. |
Is your feature request related to a problem? Please describe
Background
We have had a few great RFCs over the last year or so (#7258 and #14596) with the goals of separating indexing and search traffic within a cluster to achieve independent scalability and failure/workload isolation.
This issue sets next steps for implementation using the ideas from these RFCs with the goal of achieving full isolation. It is opinionated in that it focuses on clusters leveraging remote storage in order to achieve that full isolation. There are variants that are possible for local clusters (node-node segrep) that achieve varying levels of isolation but they are not in initial scope.
Note - One point of contention on the original RFC was to make this feature pluggable if only geared toward cloud native use cases. At this time I don’t believe it makes sense to start with a plugin as the only functionality we could reasonably pull out is allocation with ClusterPlugin. There are building blocks to this that will be beneficial to all clusters using segment replication.
Desired Architecture:
Requirements:
Out of scope (at least initially):
Describe the solution you'd like
Implementation
The first step in achieving our desired architecture is introducing the concept of a “search” shard while existing primaries and replicas remain specialized for indexing for write availability. A search shard internally is treated as a separate replica type within an IndexShard’s routing table. By default, search shards need not be physically isolated to separate hardware nor are they required to serve all search requests. However we can use them to achieve the desired separation. Search replicas will have the following properties to make them lower overhead than normal replicas. (related issue):
Even without physical separation this takes a step toward decoupling write paths from search paths.
Routing layer changes & disabling primary eligibility
As called out in @shwetathareja ‘s RFC, In order to introduce the search replica we need to differentiate at the shard routing layer. We can do this with a simple flag in ShardRouting to denote the shard as “searchOnly”, similar to how a routing entry is flagged as “primary”. Once we have this flag we can add additional ShardRouting entries based on a configured count in API, similar to primary/replica counts. Create index flow would be updated as follows, with similar updates to restore and update:
Once the shard routing table is constructed with separate replica types, we can easily leverage this within allocation, primary selection, and request routing logic to filter out search shards appropriately.
Recovery
In remote store clusters replicas recover by fetching segments from the remote store. However, recovery of these shards is still initiated as a peer recovery and still runs all the required steps to bring the shards in-sync. With search shards we do not need to perform these steps and can leverage the existing RemoteStoreRecoverySource to recover directly from the remote store.
Replication
To make replicas pull based we use scheduled tasks to trigger segment replication cycles through SegmentReplicationTargetService.forceReplication. At the beginning of a SegRep cycle we fetch the latest metadata from the remote store and then compute a diff and fetch required segments. This scheduled task will run on a configurable interval based on freshness requirements and continue until the index is marked as read-only.
Write & refresh cycle:
Segrep Stats & backpressure:
We will also need to make changes to how we track replica status to serve stats APIs and fail replicas if they fall too far behind. Today these stats and operations are performed from primary shards only.
Node Separation
Allocation
Up until this point search replicas can live side by side with other shard types. In order to make node separation opt-in, an additional setting is required to force allocation to a set of nodes either by node attribute or role. I propose we separate logic for search shards into a new setting for use with
FilterAllocationDecider
specific to shard types across the cluster. This would allow more flexibility to allocate based on any node attribute and be extended to optionally support node role. This would require two changes - one to honor role as a filter, add _role as a special keyword within the filter similar to _host, _name etc..., and second a new require filter based on shard type. To enforce separation the user would then set:Alternative:
We can leverage the Search node role directly and allocate any search shard to nodes with that role through the TargetPoolAllocationDecider and treat any search shard as REMOTE_ELIGIBLE thereby allocating to search nodes. While maybe simple out of the box it doesn’t provide users much flexibility. Further the search role is currently intended for searchable snapshot shards and there is a bit of work pending to have hot/warm shards live on the same box.
Request Routing:
Within a non isolated cluster with search replicas: Requests without any preference passed can hit any active shard.
Within a separated node cluster (there is a filter present): Requests without any preference passed will hit only search replicas.
Scale to zero:
Search:
With the feature enabled we will be able to easily scale searchers to 0 as existing primary and replica groups take no dependency on searcher presence. However, we will need to update operation routing such that any search request that arrives without primary preference will be rejected with a notice that the index is not taking reads.
Index:
TBD - will create a separate issue for these changes.
API Changes
Create/update/restore index will all need to support a separate parameter for the number of search only shards. Example:
number_of_replicas will only control the counts of writer replicas while number_of_search_only_shards will control search replicas.
New Configuration
Additional API Changes:
Cluster health:
Cluster health today has limitations in that the color codes are misleading and only based on allocated shards. I propose we initially leave the definition of this status as is and adopt any new changes at part of this issue to improve cluster health.
The meaning of our color coded cluster status will mean:
green = all desired shards are active.
yellow = primaries are assigned but not all replicas or all search shards are active. Ex. 1 of 2 desired searchers per primary are active.
red = a primary is not active (no write availbility)
This would mean that if a user has zero read availability the cluster would report as yellow. Users will need to use existing APIs (_cat/shards) and metrics etc to diagnose the issue.
Alternative:
To further assist in debugging status we include more detailed information on search vs indexing status. We can provide more granular “search” shard counts, but I think to start this will help us diagnose which side is having issues and quickly figure out unassigned shards with cat API. Something like:
Cat shards:
To differentiate search replicas from regular we can mark them as [s] in cat shard output.
Task breakdown & Rollout (to convert to meta):
-- Experimental (target 2.17)
-- GA (target - pending feedback)
Related component
Other
Describe alternatives you've considered
See RFCs
Additional context
No response
The text was updated successfully, but these errors were encountered: