-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More debugging info for significant_text #72727
Conversation
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.
@@ -374,7 +374,7 @@ Chi square behaves like mutual information and can be configured with the same p | |||
|
|||
|
|||
===== Google normalized distance | |||
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter | |||
Google normalized distance as described in https://arxiv.org/pdf/cs/0412098v3.pdf["The Google Similarity Distance", Cilibrasi and Vitanyi, 2007] can be used as significance score by adding the parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was invalid syntax I bumped into reading the docs so I could write the tests. I can move this into its own change if you'd like, but its pretty small and isolated as is.
@@ -1,151 +0,0 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this file, replace index
with _bulk
, and added a ton more tests. It doesn't seem to detect the rename because its barely the same file any more....
(int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE) + (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_MULTI), | ||
greaterThan(0) | ||
); | ||
assertThat(termsAggResult.getDebugInfo().toString(), (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE), greaterThan(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bumped into this while double checking that I hadn't broken this test. This was fixed in #71241.
@@ -117,6 +118,8 @@ public InternalAggregation buildEmptyAggregation() { | |||
public void collectDebugInfo(BiConsumer<String, Object> add) { | |||
super.collectDebugInfo(add); | |||
add.accept("total_buckets", bucketOrds.size()); | |||
add.accept("collection_strategy", collectorSource.describe()); | |||
collectorSource.collectDebugInfo(add); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without describing the collection strategy it isn't obvious from reading the profile that you are looking at significant_text
rather than significant_terms
. The extra debugging information should be useful in figuring out what makes a particular significant_text
execution slow - maybe its fetching from source. Maybe its analysis. Maybe its just generating a zillion terms. That's all there in the debug output.
CollectConsumer consumer | ||
) throws IOException { | ||
valuesFetched++; | ||
charsFetched += text.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these two counters and the Timer
around extract we can tell if the we are extracting many fields or a few fields. We can see if we're extracting many fields and filtering them away. And we can see how long the fields are in utf-16 char
s. Which is better than nothing. Not as good as bytes, but such is life.
Runtime fields won't have them and that claim isn't central to the test we're running.
I've used elastic/rally-tracks#176 to seed a bunch of data into a local cluster and used this to learn some things. Running
Those are all in nanos from the profile API. That means that we spent about 5 seconds building the results and picking the best buckets. Yikes! Worse, we spent six seconds re-analyzing the text. And five seconds reading the On the other extreme, running the
The first thing to know is that this whole thing takes about 43ms. Much faster! Good? Maybe. The performance is dominated by extracting the data from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a couple of suggestions.
@@ -148,15 +170,23 @@ LeafBucketCollector getLeafCollector( | |||
* Fetch values from a {@link ValuesSource}. | |||
*/ | |||
public static class ValuesSourceCollectorSource implements CollectorSource { | |||
private final ValuesSource valuesSource; | |||
private final ValuesSourceConfig valuesSource; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not rename the variable as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I wasn't being careful. I'll do it.
throw new IllegalArgumentException("No analyzer configured for field " + f); | ||
}); | ||
if (context.profiling()) { | ||
return new SignificantTextCollectorSource( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many overloaded methods here that I feel like extracting it into an old fashioned static class can greatly increase readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.
Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.
Now that elastic#72727 has landed in 7.x we can run the bwc tests against its changes.
Now that elastic#72727 has landed in 7.x we can run the bwc tests against its changes.
Now that #72727 has landed in 7.x we can run the bwc tests against its changes.
Now that #72727 has landed in 7.x we can run the bwc tests against its changes.
Adds some extra debugging information to make it clear that you are
running
significant_text
. Also adds some using timing informationaround the
_source
fetch and theterms
accumulation. This lets youcalculate a third useful timing number: the analysis time. It is
collect_ns - fetch_ns - accumulation_ns
.This also adds a half dozen extra REST tests to get a fairly
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.