-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add setting to disable diff application in remote publication #15232
Add setting to disable diff application in remote publication #15232
Conversation
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15232 +/- ##
============================================
- Coverage 71.84% 71.77% -0.08%
- Complexity 62953 62964 +11
============================================
Files 5178 5185 +7
Lines 295204 295196 -8
Branches 42682 42665 -17
============================================
- Hits 212099 211871 -228
- Misses 65710 65935 +225
+ Partials 17395 17390 -5 ☔ View full report in Codecov by Sentry. |
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/cluster/coordination/PublicationTransportHandler.java
Show resolved
Hide resolved
* This is to be used as an andon cord in case we want to rebuild the full cluster state on nodes from remote. | ||
* If disabled, we will use the diff manifest shared by cluster manager to build the cluster state on nodes. | ||
*/ | ||
public static final Setting<Boolean> REMOTE_PUBLICATION_APPLY_FULL_STATE = Setting.boolSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the andon cord not disabling remote state altogether instead of moving to full state download? What if the state is few 100s of MBs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling the remote state will require us to change the setting in opensearch.yml
file and nodes to be bounced, where as this setting provides us an option to dynamically disable the diff publication.
This setting is added specifically for the case if we identify any issue with diff publication or application which causes the cluster state to go stale, we can toggle this setting so that nodes can sync the cluster state to latest state with master.
For dealing with issues like throttling while downloading a huge cluster state, we can add some jitter while downloading the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only requires cluster manager nodes to be restarted and that can be done in rolling restart fashion as well. If we adding this new dynamic setting, I am thinking then why not make remote publication as dynamic setting itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't have any argument of not keeping this setting as Dynamic
, but the way currently the publication flow has been written just turning the setting to Dynamic won't work. We will need to re-access the whole publication flow as it was written on assumption of the setting being Final
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case where we get into a state of diffs corrupting the cluster state a good way would be to ensure we do a full download. Rolling restart for masters might be a temporary arrangement to have data nodes download the full state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote publication is currently gated using the static feature flag for experimental release. This has to be changed to a new dynamic setting for gating in GA release. This is a more important andon cord IMO.
The remote state setting is final setting and it only impacts the uploads. So this should be fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we should introduce more settings for the sake of it. If there is bug in publication flow, you disable publication flow. why do you need another andon cord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwetathareja I think we are talking about the same thing - Remote Publication setting. We need to make it dynamic which is not the case currently.
server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java
Show resolved
Hide resolved
* This is to be used as an andon cord in case we want to rebuild the full cluster state on nodes from remote. | ||
* If disabled, we will use the diff manifest shared by cluster manager to build the cluster state on nodes. | ||
*/ | ||
public static final Setting<Boolean> REMOTE_PUBLICATION_APPLY_FULL_STATE = Setting.boolSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote publication is currently gated using the static feature flag for experimental release. This has to be changed to a new dynamic setting for gating in GA release. This is a more important andon cord IMO.
The remote state setting is final setting and it only impacts the uploads. So this should be fine for now.
Created a new backlog issue #15852 to create a dynamic setting for enabling or disabling the remote publication. Closing this pull request as a dynamic setting is a better way to control the publication rather than this change. |
Description
Added a setting
cluster.remote_publication.apply_full_state
to control remote publication flows. If enabled, the remote publication flow will always apply full state and not use the diff. This is to be used as an andon cord in case we want to rebuild the full cluster state on nodes from remote. If disabled, we will use the diff manifest shared by cluster manager to build the cluster state on nodes.Related Issues
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.