-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up MappingStats Computation on Coordinating Node #82830
Speed up MappingStats Computation on Coordinating Node #82830
Conversation
Pinging @elastic/es-search (Team:Search) |
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @original-brownbear this is a nice optimization. I think our testing may be a bit thin towards this though, perhaps you can add a bit of randomized unittests for cases of shared and non-shared mappings for both analysis-stats and mapping-stats (or maybe I missed it, happy to be pointed to it instead).
@@ -51,30 +52,14 @@ public static AnalysisStats of(Metadata metadata, Runnable ensureNotCancelled) { | |||
final Map<String, IndexFeatureStats> usedBuiltInTokenFilters = new HashMap<>(); | |||
final Map<String, IndexFeatureStats> usedBuiltInAnalyzers = new HashMap<>(); | |||
|
|||
final Map<String, Integer> mappingCounts = Maps.newMapWithExpectedSize(metadata.getMappingsByHash().size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it more intuitive to use an IdentityHashMp
or just a hash-map (since hash-code/equals compare on the hash anyway). Is there a reason to use an explicit hash-value as key here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ... IdentityHashMap
sounds nice and simplifies the loop a little as well saving another round of lookup :)
} | ||
for (Map.Entry<String, Integer> mappingAndCount : mappingCounts.entrySet()) { | ||
Set<String> indexAnalyzers = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also do ensureNotCancelled.run()
here to ensure we can cancel here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ adding it back
server/src/main/java/org/elasticsearch/action/admin/cluster/stats/MappingStats.java
Show resolved
Hide resolved
if (scriptSourceObject != null) { | ||
String scriptSource = scriptSourceObject.toString(); | ||
int chars = scriptSource.length(); | ||
long lines = scriptSource.lines().count(); | ||
int docUsages = countOccurrences(scriptSource, DOC_PATTERN); | ||
int sourceUsages = countOccurrences(scriptSource, SOURCE_PATTERN); | ||
scriptStats.update(chars, lines, sourceUsages, docUsages); | ||
for (int i = 0; i < multiplier; i++) { | ||
scriptStats.update(chars, lines, sourceUsages, docUsages); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add a multiplier
param to FieldScriptStats.update
too to avoid the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ that's cuter :)
server/src/test/java/org/elasticsearch/action/admin/cluster/stats/MappingStatsTests.java
Show resolved
Hide resolved
Thanks for taking a look Henning!
Right it was quite thin indeed. We had a pretty extensive test case for the mapping stats that I reused for a non-shard test (not the most beautiful solution but I figured it was a reasonable cost-return tradeoff). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the extra work on the tests.
server/src/main/java/org/elasticsearch/action/admin/cluster/stats/MappingStats.java
Show resolved
Hide resolved
@@ -949,7 +949,7 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws | |||
return builder; | |||
} | |||
|
|||
Map<String, MappingMetadata> getMappingsByHash() { | |||
public Map<String, MappingMetadata> getMappingsByHash() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just expose the size of the map, since that is all we need? Small point, but would be nice to keep this internal to this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm we could here but I have another PR inbound that needs this map. I left it as is for now, hope that's ok.
server/src/test/java/org/elasticsearch/action/admin/cluster/stats/AnalysisStatsTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/action/admin/cluster/stats/MappingStatsTests.java
Outdated
Show resolved
Hide resolved
Hi @original-brownbear, I've created a changelog YAML for you. |
@elasticmachine update branch |
773d3f3
to
9b7c828
Compare
Hi @original-brownbear, I've created a changelog YAML for you. |
@elasticmachine update branch (sorry some changelog madness here) |
We can exploit the mapping deduplication logic to save deserializing the same mapping repeatedly here. This should fix extremly long running computations when the cache needs to be refreshed for these stats in the common case of many duplicate mappings in a cluster. In a follow-up we can probably do the same for `AnalysisStats` as well.
74e9208
to
b4ea9c9
Compare
Hi @original-brownbear, I've created a changelog YAML for you. |
Jenkins run elasticsearch-ci/part-2 |
Thanks Henning! |
We can exploit the mapping deduplication logic to save deserializing the
same mapping repeatedly here. This should fix extremely long running
computations when the cache needs to be refreshed for these stats
in the common case of many duplicate mappings in a cluster.
Also, removed some confusing and needless set creation in the constructor here.
We could go even further here probably and merge the logic for analysis and mapping stats parsing into one, but it doesn't matter much. With this fix the time to get a response in a 10k indices cluster with many repeated but very large (Beats) mappings (as you would expect them to be in the real world) goes from ~10s down to sub-second in the uncached case.
This removes a very long running task from the management pool and also ensures we don't burn endless CPU responding to monitoring in clusters that go through frequent metadata updates and thus won't see all that much benefit from the caching in
TransportClusterStatsAction
.relates #77466