Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add setting to disable diff application in remote publication #15232

Conversation

shiv0408
Copy link
Member

Description

Added a setting cluster.remote_publication.apply_full_state to control remote publication flows. If enabled, the remote publication flow will always apply full state and not use the diff. This is to be used as an andon cord in case we want to rebuild the full cluster state on nodes from remote. If disabled, we will use the diff manifest shared by cluster manager to build the cluster state on nodes.

Related Issues

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Copy link
Contributor

✅ Gradle check result for 513badb: SUCCESS

Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 71.77%. Comparing base (978d14e) to head (513badb).
Report is 184 commits behind head on main.

Files with missing lines Patch % Lines
...ster/coordination/PublicationTransportHandler.java 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15232      +/-   ##
============================================
- Coverage     71.84%   71.77%   -0.08%     
- Complexity    62953    62964      +11     
============================================
  Files          5178     5185       +7     
  Lines        295204   295196       -8     
  Branches      42682    42665      -17     
============================================
- Hits         212099   211871     -228     
- Misses        65710    65935     +225     
+ Partials      17395    17390       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shiv0408 shiv0408 added enhancement Enhancement or improvement to existing feature or request skip-changelog ClusterManager:RemoteState labels Aug 14, 2024
@shiv0408 shiv0408 self-assigned this Aug 14, 2024
* This is to be used as an andon cord in case we want to rebuild the full cluster state on nodes from remote.
* If disabled, we will use the diff manifest shared by cluster manager to build the cluster state on nodes.
*/
public static final Setting<Boolean> REMOTE_PUBLICATION_APPLY_FULL_STATE = Setting.boolSetting(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the andon cord not disabling remote state altogether instead of moving to full state download? What if the state is few 100s of MBs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling the remote state will require us to change the setting in opensearch.yml file and nodes to be bounced, where as this setting provides us an option to dynamically disable the diff publication.
This setting is added specifically for the case if we identify any issue with diff publication or application which causes the cluster state to go stale, we can toggle this setting so that nodes can sync the cluster state to latest state with master.
For dealing with issues like throttling while downloading a huge cluster state, we can add some jitter while downloading the state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only requires cluster manager nodes to be restarted and that can be done in rolling restart fashion as well. If we adding this new dynamic setting, I am thinking then why not make remote publication as dynamic setting itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't have any argument of not keeping this setting as Dynamic, but the way currently the publication flow has been written just turning the setting to Dynamic won't work. We will need to re-access the whole publication flow as it was written on assumption of the setting being Final.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case where we get into a state of diffs corrupting the cluster state a good way would be to ensure we do a full download. Rolling restart for masters might be a temporary arrangement to have data nodes download the full state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote publication is currently gated using the static feature flag for experimental release. This has to be changed to a new dynamic setting for gating in GA release. This is a more important andon cord IMO.
The remote state setting is final setting and it only impacts the uploads. So this should be fine for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should introduce more settings for the sake of it. If there is bug in publication flow, you disable publication flow. why do you need another andon cord?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwetathareja I think we are talking about the same thing - Remote Publication setting. We need to make it dynamic which is not the case currently.

* This is to be used as an andon cord in case we want to rebuild the full cluster state on nodes from remote.
* If disabled, we will use the diff manifest shared by cluster manager to build the cluster state on nodes.
*/
public static final Setting<Boolean> REMOTE_PUBLICATION_APPLY_FULL_STATE = Setting.boolSetting(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote publication is currently gated using the static feature flag for experimental release. This has to be changed to a new dynamic setting for gating in GA release. This is a more important andon cord IMO.
The remote state setting is final setting and it only impacts the uploads. So this should be fine for now.

@shiv0408
Copy link
Member Author

shiv0408 commented Sep 9, 2024

Created a new backlog issue #15852 to create a dynamic setting for enabling or disabling the remote publication. Closing this pull request as a dynamic setting is a better way to control the publication rather than this change.

@shiv0408 shiv0408 closed this Sep 9, 2024
@shiv0408 shiv0408 deleted the remote-publication-diff-setting branch September 9, 2024 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ClusterManager:RemoteState enhancement Enhancement or improvement to existing feature or request skip-changelog
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

4 participants