-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TopNSearchTasksLogger settings to Cluster Settings #6716
Add TopNSearchTasksLogger settings to Cluster Settings #6716
Conversation
Signed-off-by: PritLadani <pritkladani@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
…el for Search Backpressure Signed-off-by: PritLadani <pritkladani@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks good to me .
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
server/src/main/java/org/opensearch/tasks/consumer/TopNSearchTasksLogger.java
Show resolved
Hide resolved
Signed-off-by: PritLadani <pritkladani@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #6716 +/- ##
============================================
- Coverage 70.74% 70.69% -0.05%
+ Complexity 59215 59209 -6
============================================
Files 4809 4809
Lines 283455 283466 +11
Branches 40868 40867 -1
============================================
- Hits 200520 200387 -133
- Misses 66511 66663 +152
+ Partials 16424 16416 -8
... and 478 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
public TaskManager(Settings settings, ThreadPool threadPool, Set<String> taskHeaders) { | ||
this(settings, threadPool, taskHeaders, null); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this constructor is preserved to keep the existing tests unmodified. If the effort is small, can we refactor them and remove this constructor completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 10+ tests which are using this constructor, passing null argument for each of these constructor usages may create confusion. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing null
or a mocked object is perfectly fine and even desirable in unit-tests. But letting unit-tests dictate the design choices, especially when it introduces invalid code flows, should be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this change though, creating a register method avoids further refactoring.
this.threadPool = threadPool; | ||
this.taskHeaders = new ArrayList<>(taskHeaders); | ||
this.maxHeaderSize = SETTING_HTTP_MAX_HEADER_SIZE.get(settings); | ||
this.taskResourceConsumersEnabled = TASK_RESOURCE_CONSUMERS_ENABLED.get(settings); | ||
this.taskResourceConsumer = Set.of(new TopNSearchTasksLogger(settings)); | ||
this.taskResourceConsumer = Set.of(new TopNSearchTasksLogger(settings, clusterSettings)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass a collection of task consumers as a dependency to the TaskManager
? Or expose a method to register a task consumer in order to decouple dependencies.
LOG_TOP_QUERIES_SIZE, | ||
10, | ||
1, | ||
25, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we arrive at this max threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sruti has already done testing earlier so after discussion with her, we decided these thresholds. Increasing top n queries size or increasing frequency may affect the performance, that's why we have these thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add links/references for the testing done? I understand there should be a max cap to avoid verbose logging, but is 25 the right number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will let Sruti answer this one. @sruti1312
private int topQueriesSize; | ||
private long topQueriesLogFrequencyInNanos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be volatile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Making them volatile in the next commit.
LOG_TOP_QUERIES_FREQUENCY, | ||
TimeValue.timeValueSeconds(60L), | ||
TimeValue.timeValueSeconds(60L), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we arrive at this max threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sruti has already done testing earlier so after discussion with her, we decided these thresholds. Increasing top n queries size or increasing frequency may affect the performance, that's why we have these thresholds.
private long lastReportedTimeInNanos = System.nanoTime(); | ||
|
||
public TopNSearchTasksLogger(Settings settings) { | ||
public TopNSearchTasksLogger(Settings settings, @Nullable ClusterSettings clusterSettings) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster settings should always be non-null. Let's refactor unit-tests as needed.
this.topQueriesSize = LOG_TOP_QUERIES_SIZE_SETTING.get(settings); | ||
this.topQueriesLogFrequencyInNanos = LOG_TOP_QUERIES_FREQUENCY_SETTING.get(settings).getNanos(); | ||
this.topQueries = new PriorityQueue<>(topQueriesSize, Comparator.comparingLong(Tuple::v1)); | ||
this.topQueries = new PriorityQueue<>(Comparator.comparingLong(Tuple::v1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing topQueriesSize
(initial capacity) from the PriorityQueue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need of removing the initial capacity. Reverting it back.
@@ -97,4 +106,12 @@ private void logTopResourceConsumingQueries() { | |||
SEARCH_TASK_DETAILS_LOGGER.info(new SearchShardTaskDetailsLogMessage(topQuery.v2())); | |||
} | |||
} | |||
|
|||
void setLogTopQueriesSize(int topQueriesSize) { | |||
this.topQueriesSize = topQueriesSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in two places in the recordSearchTask
method. Now that we are making it dynamic, it is prone to race-conditions. Can we please fix?
public void testDynamicSettings() { | ||
Settings settings = Settings.builder() | ||
.put(LOG_TOP_QUERIES_SIZE_SETTING.getKey(), 1) | ||
.put(LOG_TOP_QUERIES_FREQUENCY_SETTING.getKey(), "60s") | ||
.build(); | ||
ClusterSettings clusterSettings = new ClusterSettings(Settings.EMPTY, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS); | ||
topNSearchTasksLogger = new TopNSearchTasksLogger(settings, clusterSettings); | ||
settings = Settings.builder().put(LOG_TOP_QUERIES_SIZE_SETTING.getKey(), "10").build(); | ||
ClusterUpdateSettingsResponse response = client().admin().cluster().prepareUpdateSettings().setPersistentSettings(settings).get(); | ||
assertEquals(response.getPersistentSettings().get(LOG_TOP_QUERIES_SIZE_SETTING.getKey()), "10"); | ||
|
||
settings = Settings.builder().put(LOG_TOP_QUERIES_FREQUENCY_SETTING.getKey(), "100s").build(); | ||
response = client().admin().cluster().prepareUpdateSettings().setPersistentSettings(settings).get(); | ||
assertEquals(response.getPersistentSettings().get(LOG_TOP_QUERIES_FREQUENCY_SETTING.getKey()), "100s"); | ||
} | ||
|
||
public void testMinMaxSettingValues() { | ||
final Settings settings = Settings.builder() | ||
.put(LOG_TOP_QUERIES_SIZE_SETTING.getKey(), 1) | ||
.put(LOG_TOP_QUERIES_FREQUENCY_SETTING.getKey(), "60s") | ||
.build(); | ||
topNSearchTasksLogger = new TopNSearchTasksLogger(settings, null); | ||
final Settings sizeSetting = Settings.builder().put(LOG_TOP_QUERIES_SIZE_SETTING.getKey(), "50").build(); | ||
// request should fail as it crosses maximum defined value(25) | ||
expectThrows( | ||
IllegalArgumentException.class, | ||
() -> client().admin().cluster().prepareUpdateSettings().setPersistentSettings(sizeSetting).get() | ||
); | ||
|
||
final Settings freqSetting = Settings.builder().put(LOG_TOP_QUERIES_FREQUENCY_SETTING.getKey(), "30s").build(); | ||
// request should fail as it crosses minimum defined value(60s) | ||
expectThrows( | ||
IllegalArgumentException.class, | ||
() -> client().admin().cluster().prepareUpdateSettings().setPersistentSettings(freqSetting).get() | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that these tests don't belong here. They're not testing anything about the feature in this PR, instead, they're testing the functionality of the common Settings
' framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, removing them from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Unrelated to this PR]
The recordSearchTask
method is synchronized
for thread-safety. Since it's called on every (search) task completion, the performance penalty might be too much on highly concurrent search workloads. Do we need a performance test to know the impact? If needed, we should document it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sruti1312 Have we done any performance test around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had done initial perf tests and noticed performance degradation when logging is enabled. Results are as follows.
When search IOPS is low the performance impact was negligible (less than 10ms). After increasing the search IOPS by running parallel search workloads against the same cluster. The latency degradation is close to 20%. Throughput is about the same.
Currently, logging occurs during unregister and can block the request until all top N messages are logged.
Performance results:
Latency (ms) Operation P50 P90 P99 P100
Consumer disabled query 107 141 242.5 432.7
Consumer enabled query 138.1 189.6 282.3 402.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details @sruti1312.
@PritLadani, I have the following concerns with the current implementation of this feature.
- Clusters with high throughput searches, especially on nodes with large number of cores & thread pools, will suffer the most with synchronization. We should test specifically for this scenario.
- As logging happens in the hot-path of searches, it will introduce HOL blocking. I'm expecting the tail latencies to suffer due to this. Let's also confirm that we use async logging here.
- Less concerning – As logging is triggered by a search request, it may not honor the
LOG_TOP_QUERIES_FREQUENCY
accurately when the search traffic is low and irregularly spaced apart. For example, consider a scenario where a few hundred requests arrive at a timet = 0 seconds
but the subsequent requests arrive at a later timet = 1000 seconds
, the logging will be deferred for 16+ minutes.
Just to clarify, I'm not suggesting a change of implementation. I'm interested to know the expected impact, document it, and convey it to our end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue to address this and to reduce performance degradation: #7146
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated review.
Signed-off-by: PritLadani <pritkladani@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
@@ -97,4 +104,12 @@ private void logTopResourceConsumingQueries() { | |||
SEARCH_TASK_DETAILS_LOGGER.info(new SearchShardTaskDetailsLogMessage(topQuery.v2())); | |||
} | |||
} | |||
|
|||
void setLogTopQueriesSize(int topQueriesSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this method private
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
this.topQueriesSize = topQueriesSize; | ||
} | ||
|
||
void setTopQueriesLogFrequencyInNanos(TimeValue timeValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this method private
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need this to be package-private since we need to modify frequency for testing, otherwise UTs will take 60+ seconds to finish. Thoughts?
Code reference: https://github.com/opensearch-project/OpenSearch/pull/6716/files#diff-7f1abfc2317c2c29386a05b19737625a551e7f89d9c1da1ad81065e2bb61bf17R76
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had done initial perf tests and noticed performance degradation when logging is enabled. Results are as follows.
When search IOPS is low the performance impact was negligible (less than 10ms). After increasing the search IOPS by running parallel search workloads against the same cluster. The latency degradation is close to 20%. Throughput is about the same.
Currently, logging occurs during unregister and can block the request until all top N messages are logged.
Performance results:
Latency (ms) Operation P50 P90 P99 P100
Consumer disabled query 107 141 242.5 432.7
Consumer enabled query 138.1 189.6 282.3 402.4
Signed-off-by: PritLadani <pritkladani@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6716-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9f6c06775eb088bf86181f3cace885f409c25e61
# Push it to GitHub
git push --set-upstream origin backport/backport-6716-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x Then, create a pull request where the |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.7 2.7
# Navigate to the new working tree
pushd ../.worktrees/backport-2.7
# Create a new branch
git switch --create backport/backport-6716-to-2.7
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 9f6c06775eb088bf86181f3cace885f409c25e61
# Push it to GitHub
git push --set-upstream origin backport/backport-6716-to-2.7
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.7 Then, create a pull request where the |
…oject#6716) Signed-off-by: PritLadani <pritkladani@gmail.com> (cherry picked from commit 9f6c067)
…oject#6716) Signed-off-by: PritLadani <pritkladani@gmail.com> (cherry picked from commit 9f6c067)
…oject#6716) Signed-off-by: PritLadani <pritkladani@gmail.com>
…oject#6716) Signed-off-by: PritLadani <pritkladani@gmail.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Description
We introduced Task Resource Consumer to log task resource information at a specified interval as a part of #2293. However, newly introduced dynamic settings
cluster.task.consumers.top_n.size
andcluster.task.consumers.top_n.frequency
are not registered under ClusterSettings.BUILT_IN_CLUSTER_SETTINGS and hence not recognised when trying to change the value. We need to register them in ClusterSettings.BUILT_IN_CLUSTER_SETTINGS to make them dynamic. Making this change to fix the issue.Issues Resolved
#6708
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.