[Proposal] Design for Reader/Writer separation #15237

mch2 · 2024-08-14T00:42:47Z

Is your feature request related to a problem? Please describe

Background

We have had a few great RFCs over the last year or so (#7258 and #14596) with the goals of separating indexing and search traffic within a cluster to achieve independent scalability and failure/workload isolation.

This issue sets next steps for implementation using the ideas from these RFCs with the goal of achieving full isolation. It is opinionated in that it focuses on clusters leveraging remote storage in order to achieve that full isolation. There are variants that are possible for local clusters (node-node segrep) that achieve varying levels of isolation but they are not in initial scope.

Note - One point of contention on the original RFC was to make this feature pluggable if only geared toward cloud native use cases. At this time I don’t believe it makes sense to start with a plugin as the only functionality we could reasonably pull out is allocation with ClusterPlugin. There are building blocks to this that will be beneficial to all clusters using segment replication.

Desired Architecture:

Requirements:

Create or restore an index with specialized indexer and searcher shards.
Allow users the option to physically separate these indexer and searchers to separate hardware.
Searcher shard counts are independently tunable via API.
Scale to zero - turn off all read or write traffic independently.
Users should be able to leverage this isolation without much additional configuration.
Code changes should be backward compatible with 2.x and not rely on a 3.0 release.

Out of scope (at least initially):

Stateless writers / Dynamic resharding - adding/removing primary shards per index is not possible without rollover.
Shard splitting - Splitting writer shards into multiple search shards or vice versa.
Coordinator layer changes. To achieve separation at a coordinator layer users will need to route their traffic to separate node groups. By default all shards act as coordinators for all operation types. Rerouting between shard types does not achieve isolation. In the future however we can provide trimmed down coordinators that only handle certain request types, but that is not an initial concern.
Cluster state filtering/splitting to specific shard types. This is out of scope for the first version but a need to achieve full isolation.

Describe the solution you'd like

Implementation

The first step in achieving our desired architecture is introducing the concept of a “search” shard while existing primaries and replicas remain specialized for indexing for write availability. A search shard internally is treated as a separate replica type within an IndexShard’s routing table. By default, search shards need not be physically isolated to separate hardware nor are they required to serve all search requests. However we can use them to achieve the desired separation. Search replicas will have the following properties to make them lower overhead than normal replicas. (related issue):

They are not primary eligible
They do not receive any replicated operations, particularly the primary term check on the write path.
Interface only with remote storage to pull segments for recovery and replication flows.

Even without physical separation this takes a step toward decoupling write paths from search paths.

Routing layer changes & disabling primary eligibility

As called out in @shwetathareja ‘s RFC, In order to introduce the search replica we need to differentiate at the shard routing layer. We can do this with a simple flag in ShardRouting to denote the shard as “searchOnly”, similar to how a routing entry is flagged as “primary”. Once we have this flag we can add additional ShardRouting entries based on a configured count in API, similar to primary/replica counts. Create index flow would be updated as follows, with similar updates to restore and update:

Once the shard routing table is constructed with separate replica types, we can easily leverage this within allocation, primary selection, and request routing logic to filter out search shards appropriately.

Recovery

In remote store clusters replicas recover by fetching segments from the remote store. However, recovery of these shards is still initiated as a peer recovery and still runs all the required steps to bring the shards in-sync. With search shards we do not need to perform these steps and can leverage the existing RemoteStoreRecoverySource to recover directly from the remote store.

Replication

To make replicas pull based we use scheduled tasks to trigger segment replication cycles through SegmentReplicationTargetService.forceReplication. At the beginning of a SegRep cycle we fetch the latest metadata from the remote store and then compute a diff and fetch required segments. This scheduled task will run on a configurable interval based on freshness requirements and continue until the index is marked as read-only.
Write & refresh cycle:

Segrep Stats & backpressure:

We will also need to make changes to how we track replica status to serve stats APIs and fail replicas if they fall too far behind. Today these stats and operations are performed from primary shards only.

Change to compute lag/bytes behind for _stats APIs at the coordinator by returning the current ReplicationCheckpoint from each shard. This object already includes all required data to compute desired stats.
Disable normal SegRep backpressure as primaries will not know where search replicas are. We can rely on remote store pressure that applies on upload delays.
To enforce freshness among the search shards we can internally fail them if they are too far behind.

Node Separation

Allocation

Up until this point search replicas can live side by side with other shard types. In order to make node separation opt-in, an additional setting is required to force allocation to a set of nodes either by node attribute or role. I propose we separate logic for search shards into a new setting for use with FilterAllocationDecider specific to shard types across the cluster. This would allow more flexibility to allocate based on any node attribute and be extended to optionally support node role. This would require two changes - one to honor role as a filter, add _role as a special keyword within the filter similar to _host, _name etc..., and second a new require filter based on shard type. To enforce separation the user would then set:

search.shard.routing.allocation.require.<node-attr>: <attr-value>
or
search.shard.routing.allocation.require._role: <node role (ex search)>

Alternative:
We can leverage the Search node role directly and allocate any search shard to nodes with that role through the TargetPoolAllocationDecider and treat any search shard as REMOTE_ELIGIBLE thereby allocating to search nodes. While maybe simple out of the box it doesn’t provide users much flexibility. Further the search role is currently intended for searchable snapshot shards and there is a bit of work pending to have hot/warm shards live on the same box.

Request Routing:
Within a non isolated cluster with search replicas: Requests without any preference passed can hit any active shard.
Within a separated node cluster (there is a filter present): Requests without any preference passed will hit only search replicas.

Scale to zero:

Search:

With the feature enabled we will be able to easily scale searchers to 0 as existing primary and replica groups take no dependency on searcher presence. However, we will need to update operation routing such that any search request that arrives without primary preference will be rejected with a notice that the index is not taking reads.

Index:

TBD - will create a separate issue for these changes.

API Changes

Create/update/restore index will all need to support a separate parameter for the number of search only shards. Example:

curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'             
{                  
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "number_of_search_only_shards": 1
    }
  }
}'

number_of_replicas will only control the counts of writer replicas while number_of_search_only_shards will control search replicas.

New Configuration

opensearch.experimental.feature.search_replica.enabled - feature flag
index.number_of_search_only_shards - to control search replica count (default: 0)
search.shard.routing.allocation.require - AllocationFilter to optionally force search replicas to specific set of nodes. (default off)
index.segment.replication.interval - interval at which to initiate replication cycles. (default 10s)

Additional API Changes:

Cluster health:

Cluster health today has limitations in that the color codes are misleading and only based on allocated shards. I propose we initially leave the definition of this status as is and adopt any new changes at part of this issue to improve cluster health.
The meaning of our color coded cluster status will mean:
green = all desired shards are active.
yellow = primaries are assigned but not all replicas or all search shards are active. Ex. 1 of 2 desired searchers per primary are active.
red = a primary is not active (no write availbility)

This would mean that if a user has zero read availability the cluster would report as yellow. Users will need to use existing APIs (_cat/shards) and metrics etc to diagnose the issue.

Alternative:
To further assist in debugging status we include more detailed information on search vs indexing status. We can provide more granular “search” shard counts, but I think to start this will help us diagnose which side is having issues and quickly figure out unassigned shards with cat API. Something like:

{
"cluster_name" : "opensearch-cluster",
"status" : "green",
"search_status": "green",
"indexing_status": "green",
...
All remaining shard counts will now include search shards.
"active_primary_shards" : 6,
"active_shards" : 12,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
...

Cat shards:

To differentiate search replicas from regular we can mark them as [s] in cat shard output.

curl localhost:9200/_cat/shards                                                 
test2 0 p STARTED 0 208b 127.0.0.1 runTask-0
test2 0 s STARTED 0 208b 127.0.0.1 runTask-1
test2 0 r STARTED 0 208b 127.0.0.1 runTask-3

Task breakdown & Rollout (to convert to meta):

Introduce new search replica concept and changes to create API
Add Allocation filter based on shard type.
Changes to update index API to support toggling search replica counts.
Updates to restore from snapshot with added search replicas
Replication updates to make search replica pull based.
Update Segment replication stats APIs and backpressure.

-- Experimental (target 2.17)

Update recovery for search replicas to be directly from remote store.
Update API specs & clients
Updates to cat API
Updates to cluster health API
scale to zero writer changes (TBD)
ISM integration to auto scale down writers on index migration

-- GA (target - pending feedback)

Related component

Other

Describe alternatives you've considered

See RFCs

Additional context

No response

The text was updated successfully, but these errors were encountered:

dblock · 2024-08-14T12:13:21Z

This is very thorough!

If I understand it correctly one will be able to dynamically change number_of_search_only_shards up and down which will cause the adjustment of the number of search-only shards. Compared to today, what kind of performance impact without search-only nodes do you expect users to begin seeing?

gbbafna · 2024-08-14T14:08:19Z

@mch2 : Thanks for this proposal . Liking the way we are building on top of the primitives built up recently. Are we also considering a notion of adding active_writes/ read_only etc tag to an index ? We can infer to ISM lifecycle state as well for this. Based on this information in cluster state (?) , replicas can adjust their polling frequency. This can reduce the unnecessary polling (& cost) for all the shards and also reduce lag on active shards.

Pull based replicas also make a case of reusing remote store in cross cluster replication as well .

Pallavi-AWS · 2024-08-14T20:15:59Z

Thanks @mch2 for a very comprehensive proposal - hope we can get feedback from @amberzsy. Look forward to releasing scaling searchers/writers to 0 as experimental in 2.17. Separation of writer/reader should stay optional and different deployments of OpenSearch (managed, serverless, partners) can decide whether to adopt.

reta · 2024-08-14T20:34:44Z

Thank for the proposal, @mch2 ! Quick question please regarding number_of_search_only_shards:

settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "number_of_search_only_shards": 1
    }
  }

This setting inherently assumes that index (or all indices) is backed by remote store (please correct me if I am wrong), and we use different set of settings *.remote_store.* to manage this part. In general, the number_of_search_only_shards makes sense but it looks to me taken somewhat out of context , may be we should explicitly use *.remote_store.* naming for it?

mch2 · 2024-08-15T20:27:58Z

This is very thorough!

If I understand it correctly one will be able to dynamically change number_of_search_only_shards up and down which will cause the adjustment of the number of search-only shards. Compared to today, what kind of performance impact without search-only nodes do you expect users to begin seeing?

Thanks @dblock, yep thats right adjustment would relate to the number of search shards per primary and is configurable per index which is the same as normal replica config today. As for perf, roughly 30-40% in write throughput for a workload where we concurrently index & search. More thorough results here - search perf had quite a bit of variance given segment counts fluctuate (no force merge before queries start), but last run was with concurrent search enabled and showed more promise.
I have been meaning to run a benchmark without node separation and only the pull replicas, will run one with that set up for comparison and report back.

@mch2 : Thanks for this proposal . Liking the way we are building on top of the primitives built up recently. Are we also considering a notion of adding active_writes/ read_only etc tag to an index ? We can infer to ISM lifecycle state as well for this. Based on this information in cluster state (?) , replicas can adjust their polling frequency. This can reduce the unnecessary polling (& cost) for all the shards and also reduce lag on active shards.

@gbbafna This is a good idea. For the first version i was thinking that read-only would disable polling entirely but coming up with a more optimal polling strategy based on write activity would make sense. With the current cluster model as long as we make this infrequent maybe no writes within 15-20m marks the index as inactive where we reduce poll frequency. Will think through this some more...

Thank for the proposal, @mch2 ! Quick question please regarding number_of_search_only_shards:
settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "number_of_search_only_shards": 1
    }
  }
This setting inherently assumes that index (or all indices) is backed by remote store (please correct me if I am wrong), and we use different set of settings *.remote_store.* to manage this part. In general, the number_of_search_only_shards makes sense but it looks to me taken somewhat out of context , may be we should explicitly use *.remote_store.* naming for it?

Thanks @reta, I think thats a fair suggestion and is a bit more user friendly than throwing validation errors. Its true with the first version I'm only thinking about remote store enabled because node-node segrep doesn't scale well in its current form. Though I think we could improve that in the future. wdyt about prefixing with ".replication" instead? It does control replication behavior as well as search routing.

 settings": {
     "index": {
       "number_of_shards": 1,
       "number_of_replicas": 1,
       "replication_number_of_search_only_shards": 1
     }
   }

reta · 2024-08-15T20:36:33Z

Though I think we could improve that in the future. wdyt about prefixing with ".replication" instead? It does control replication behavior as well as search routing.

Thanks @mch2 , you mean replication.number_of_search_only_shards ? I think my problem here is how to build the right mental model, if user for example has an index but no segrep / remotestore (let's call it "legacy" :D), what would the behaviour be: failure to create index?

mch2 · 2024-08-19T17:28:38Z

Though I think we could improve that in the future. wdyt about prefixing with ".replication" instead? It does control replication behavior as well as search routing.

Thanks @mch2 , you mean replication.number_of_search_only_shards ? I think my problem here is how to build the right mental model, if user for example has an index but no segrep / remotestore (let's call it "legacy" :D), what would the behaviour be: failure to create index?

@reta, Apologies I've been away from my laptop for a few days. Yes replication. was my thought instead of remote_store., but I don't think I like that either as it is not tuning replication behavior similar to how .remote_store is not tuning the remote store.

The setting is intended to configure an index with a special replica type. In the case its not segrep/remotestore I think we'd have to fail creation and throw an error so it can be corrected. I think we'd need to do that irrespective of the setting name. We wouldn't want to accept the req and a user has replicas/searchers missing.

shwetathareja · 2024-08-20T09:30:55Z

Thanks @mch2 for the detailed proposal.

At this time I don’t believe it makes sense to start with a plugin as the only functionality we could reasonably pull out is allocation with ClusterPlugin.

I do agree there is no need for plugin at this point.

Shard splitting - Splitting writer shards into multiple search shards or vice versa.

The work has started on in-place shard splitting without any downtime for write traffic here - #12918

wdyt about prefixing with ".replication" instead?

my vote would be for number_of_search_only_shards setting name. Also, It makes sense to fail the index creation in case this setting is provided and index is not remote enabled.

This scheduled task will run on a configurable interval based on freshness requirements and continue until the index is marked as read-only.
To enforce freshness among the search shards we can internally fail them if they are too far behind.

Generally users have the expectation that new documents/ segments are searchable after the refresh_interval. I agree primary shards need not wait for no-op replication on search shards. But, either there should be a separate setting explicitly defined to guarantee data freshness for search shards or honor refresh_interval otherwise it will create confusion for users when searches are returning stale data for random time period.

mch2 · 2024-08-22T05:01:13Z

@shwetathareja thanks for reviewing!

Generally users have the expectation that new documents/ segments are searchable after the refresh_interval. I agree primary shards need not wait for no-op replication on search shards. But, either there should be a separate setting explicitly defined to guarantee data freshness for search shards or honor refresh_interval otherwise it will create confusion for users when searches are returning stale data for random time period.

I think a separate setting to control searcher refresh interval to pull new segments makes sense because the refresh default (1s) would be pretty aggressive in polling the remote. It also gives the opportunity to differentiate/communicate to users that a searcher "refresh" is different than a writer refresh and that the searchers are inherently eventually consistent. I think we'll end up with three knobs here re freshness, 1. writer refresh interval, 2. writer upload interval and 3. searcher refresh/polling interval - related.

mch2 · 2024-08-28T20:31:04Z

After working on the first couple of PRs a case popped up that I didn't cover here around replica auto expansion. Given the new search replica is still a "replica" we need to figure out how to handle this setting. I am thinking to start we should block the addition of search replicas on an index with auto expansion and vice versa, or we honor it and regular replicas take priority leaving search replicas unassigned without dedicated nodes. However, within dedicated search nodes it would be useful to have auto expansion as well. Likely we'd want this a different setting so that it that only takes effect within nodes that match the new allocation filter.

shwetathareja · 2024-08-29T03:09:25Z

auto-expand-replicas setting makes more sense for search-replica as opposed for write. So ideally yes it would be useful to support auto-expand for search replicas.

mch2 added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 14, 2024

github-actions bot added the Other label Aug 14, 2024

mch2 changed the title ~~Initial Design for Reader/Writer separation~~ Implementation Design for Reader/Writer separation Aug 14, 2024

mch2 changed the title ~~Implementation Design for Reader/Writer separation~~ Design for Reader/Writer separation Aug 14, 2024

mch2 added the Roadmap:Cost/Performance/Scale Project-wide roadmap label label Aug 14, 2024

opensearch-infra bot added this to OpenSearch Roadmap Aug 14, 2024

github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Aug 14, 2024

mch2 changed the title ~~Design for Reader/Writer separation~~ [Proposal] Design for Reader/Writer separation Aug 14, 2024

mch2 added RFC Issues requesting major changes proposal labels Aug 14, 2024

mch2 added the Search Search query, autocomplete ...etc label Aug 14, 2024

github-project-automation bot added this to Search Project Board Aug 14, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Aug 14, 2024

mch2 added this to Performance Roadmap Aug 19, 2024

github-project-automation bot moved this to Todo in Performance Roadmap Aug 19, 2024

mch2 mentioned this issue Aug 19, 2024

[META] Reader/Writer Separation #15306

Open

14 tasks

mch2 added Roadmap:Modular Architecture Project-wide roadmap label and removed untriaged labels Aug 19, 2024

mch2 mentioned this issue Aug 26, 2024

Initial commit to support a search only replica for RW separation. #15410

Merged

3 tasks

BrewTestBot mentioned this issue Sep 18, 2024

opensearch 2.17.0 Homebrew/homebrew-core#191052

Merged

mch2 mentioned this issue Dec 5, 2024

Limit RW separation to remote store enabled clusters and update recovery flow #16760

Open

3 tasks

sachinpkale mentioned this issue Dec 10, 2024

[Feature Request] Redefine the computation of segment replication metrics in Node Stats #16801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Design for Reader/Writer separation #15237

[Proposal] Design for Reader/Writer separation #15237

mch2 commented Aug 14, 2024 •

edited

Loading

dblock commented Aug 14, 2024

gbbafna commented Aug 14, 2024

Pallavi-AWS commented Aug 14, 2024

reta commented Aug 14, 2024 •

edited

Loading

mch2 commented Aug 15, 2024

reta commented Aug 15, 2024 •

edited

Loading

mch2 commented Aug 19, 2024

shwetathareja commented Aug 20, 2024

mch2 commented Aug 22, 2024

mch2 commented Aug 28, 2024 •

edited

Loading

shwetathareja commented Aug 29, 2024

[Proposal] Design for Reader/Writer separation #15237

[Proposal] Design for Reader/Writer separation #15237

Comments

mch2 commented Aug 14, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Background

Desired Architecture:

Requirements:

Describe the solution you'd like

Implementation

Routing layer changes & disabling primary eligibility

Recovery

Replication

Segrep Stats & backpressure:

Node Separation

Allocation

Scale to zero:

Search:

Index:

API Changes

New Configuration

Additional API Changes:

Cluster health:

Cat shards:

Task breakdown & Rollout (to convert to meta):

Related component

Describe alternatives you've considered

Additional context

dblock commented Aug 14, 2024

gbbafna commented Aug 14, 2024

Pallavi-AWS commented Aug 14, 2024

reta commented Aug 14, 2024 • edited Loading

mch2 commented Aug 15, 2024

reta commented Aug 15, 2024 • edited Loading

mch2 commented Aug 19, 2024

shwetathareja commented Aug 20, 2024

mch2 commented Aug 22, 2024

mch2 commented Aug 28, 2024 • edited Loading

shwetathareja commented Aug 29, 2024

mch2 commented Aug 14, 2024 •

edited

Loading

reta commented Aug 14, 2024 •

edited

Loading

reta commented Aug 15, 2024 •

edited

Loading

mch2 commented Aug 28, 2024 •

edited

Loading