Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic in master service to optimize performance and retain detailed logging for critical cluster operations. #16421

Merged
merged 3 commits into from
Oct 26, 2024

Conversation

sumitasr
Copy link
Member

@sumitasr sumitasr commented Oct 22, 2024

Description

Add logic in master service to optimize performance and retain detailed logging for critical cluster operations.

Related Issues

Resolves #14795 (review)

Testing

Modified the check for generating short summary from 1000 to 1 in local environment to test the logging.

[2024-10-22T13:44:32,280][DEBUG][o.o.c.s.MasterService ] [runTask-0] took [0s] to notify listeners on successful publication of cluster state (version: 6, uuid: _4hvlyf0RA2ulV2jtjIsKg) for [Tasks batched with key: org.opensearch.cluster.action.shard.ShardStateAction, count:2 and sample tasks: shard-started StartedShardEntry{shardId [[sample-index2][1]], allocationId [wZdnPxXmTQ-87FSRVNA8CQ], primary term [1], message [master {runTask-0}{d12XZzhcSiG1aGyfi-eUpw}{s36FgDsCS6Git2YUlAXRzw}{127.0.0.1}{127.0.0.1:9300}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]}[StartedShardEntry{shardId [[sample-index2][1]], allocationId [wZdnPxXmTQ-87FSRVNA8CQ], primary term [1], message [master {runTask-0}{d12XZzhcSiG1aGyfi-eUpw}{s36FgDsCS6Git2YUlAXRzw}{127.0.0.1}{127.0.0.1:9300}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]}], shard-started StartedShardEntry{shardId [[sample-index2][1]], allocationId [wZdnPxXmTQ-87FSRVNA8CQ], primary term [1], message [after new shard recovery]}[StartedShardEntry{shardId [[sample-index2][1]], allocationId [wZdnPxXmTQ-87FSRVNA8CQ], primary term [1], message [after new shard recovery]}]]

[2024-10-22T13:44:32,057][DEBUG][o.o.c.s.MasterService ] [runTask-0] took [0s] to notify listeners on successful publication of cluster state (version: 4, uuid: FeOWukKcR5qoPMRZMIlqKA) for [Tasks batched with key: org.opensearch.cluster.metadata.MetadataCreateIndexService, count:1 and sample tasks: create-index [sample-index2], cause [api]]

[2024-10-22T13:44:32,282][DEBUG][o.o.c.s.MasterService ] [runTask-0] took [2ms] to compute cluster state update for [Tasks batched with key: org.opensearch.cluster.routing.BatchedRerouteService, count:1 and sample tasks: cluster_reroute(reroute after starting shards)]

Node Join

[2024-10-25T12:51:37,680][INFO ][o.o.c.s.MasterService ] [data1] Tasks batched with key: org.opensearch.cluster.coordination.JoinHelper, count:1 and sample tasks: node-join[{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true} join existing leader], term: 2, version: 7, delta: added {{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true}}

Node Left

[2024-10-25T12:52:05,089][INFO ][o.o.c.s.MasterService ] [data1] Tasks batched with key: org.opensearch.cluster.coordination.NodeRemovalClusterStateTaskExecutor@78a30062, count:1 and sample tasks: node-left[{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true} reason: disconnected], term: 2, version: 8, delta: removed {{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true}}

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❕ Gradle check result for 348d542: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.04%. Comparing base (4ad1be3) to head (cbaac5d).
Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16421      +/-   ##
============================================
+ Coverage     72.03%   72.04%   +0.01%     
- Complexity    65003    65026      +23     
============================================
  Files          5313     5313              
  Lines        303375   303397      +22     
  Branches      43902    43902              
============================================
+ Hits         218544   218593      +49     
+ Misses        66915    66857      -58     
- Partials      17916    17947      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks can we add a test for node-left logs please

@sumitasr
Copy link
Member Author

Thanks can we add a test for node-left logs please

Node Join

[2024-10-25T12:51:37,680][INFO ][o.o.c.s.MasterService ] [data1] Tasks batched with key: org.opensearch.cluster.coordination.JoinHelper, count:1 and sample tasks: node-join[{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true} join existing leader], term: 2, version: 7, delta: added {{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true}}

Node Left

[2024-10-25T12:52:05,089][INFO ][o.o.c.s.MasterService ] [data1] Tasks batched with key: org.opensearch.cluster.coordination.NodeRemovalClusterStateTaskExecutor@78a30062, count:1 and sample tasks: node-left[{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true} reason: disconnected], term: 2, version: 8, delta: removed {{data1}{bUKIJ8aDR6yJftPHtNrMVg}{-chi7PdPRjS2TwqxUnTkLQ}{127.0.0.1}{127.0.0.1:9301}{dmr}{shard_indexing_pressure_enabled=true}}

…ed logging for critical cluster operations.

Signed-off-by: Sumit Bansal <sumitsb@amazon.com>
@shwetathareja
Copy link
Member

Tasks batched with key: org.opensearch.cluster.coordination.JoinHelper, count:1 and
@sumitasr should we append this test in the end?

@sumitasr
Copy link
Member Author

sumitasr commented Oct 25, 2024

Tasks batched with key: org.opensearch.cluster.coordination.JoinHelper, count:1 and
@sumitasr should we append this test in the end?

Ignore previous comment, updated description.

Copy link
Contributor

✅ Gradle check result for bcfbece: SUCCESS

Signed-off-by: shwetathareja <shwetathareja@live.com>
Copy link
Contributor

❕ Gradle check result for cbaac5d: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@shwetathareja shwetathareja merged commit 6f1b59e into opensearch-project:main Oct 26, 2024
41 of 42 checks passed
@shwetathareja shwetathareja added the backport 2.x Backport to 2.x branch label Oct 26, 2024
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16421-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6f1b59e54bec41d40772f8571c7b65d4b523f8b1
# Push it to GitHub
git push --set-upstream origin backport/backport-16421-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16421-to-2.x.

sumitasr added a commit to sumitasr/OpenSearch that referenced this pull request Oct 27, 2024
…ed logging for critical cluster operations. (opensearch-project#16421)

Signed-off-by: Sumit Bansal <sumitsb@amazon.com>
@sumitasr
Copy link
Member Author

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16421-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6f1b59e54bec41d40772f8571c7b65d4b523f8b1
# Push it to GitHub
git push --set-upstream origin backport/backport-16421-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16421-to-2.x.

Raised backport PR #16493

shwetathareja pushed a commit that referenced this pull request Oct 28, 2024
…ed logging for critical cluster operations. (#16421) (#16493)

Signed-off-by: Sumit Bansal <sumitsb@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants