Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More debugging info for significant_text #72727

Merged
merged 8 commits into from
May 10, 2021
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented May 4, 2021

Adds some extra debugging information to make it clear that you are
running significant_text. Also adds some using timing information
around the _source fetch and the terms accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
collect_ns - fetch_ns - accumulation_ns.

This also adds a half dozen extra REST tests to get a fairly
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

Adds some extra debugging information to make it clear that you are
running `significant_text`. Also adds some using timing information
around the `_source` fetch and the `terms` accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
`collect_ns - fetch_ns - accumulation_ns`.

This also adds a half dozen extra REST tests to get a *fairly*
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.
@@ -374,7 +374,7 @@ Chi square behaves like mutual information and can be configured with the same p


===== Google normalized distance
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
Google normalized distance as described in https://arxiv.org/pdf/cs/0412098v3.pdf["The Google Similarity Distance", Cilibrasi and Vitanyi, 2007] can be used as significance score by adding the parameter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was invalid syntax I bumped into reading the docs so I could write the tests. I can move this into its own change if you'd like, but its pretty small and isolated as is.

@@ -1,151 +0,0 @@
---
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this file, replace index with _bulk, and added a ton more tests. It doesn't seem to detect the rename because its barely the same file any more....

(int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE) + (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_MULTI),
greaterThan(0)
);
assertThat(termsAggResult.getDebugInfo().toString(), (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE), greaterThan(0));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped into this while double checking that I hadn't broken this test. This was fixed in #71241.

@@ -117,6 +118,8 @@ public InternalAggregation buildEmptyAggregation() {
public void collectDebugInfo(BiConsumer<String, Object> add) {
super.collectDebugInfo(add);
add.accept("total_buckets", bucketOrds.size());
add.accept("collection_strategy", collectorSource.describe());
collectorSource.collectDebugInfo(add);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without describing the collection strategy it isn't obvious from reading the profile that you are looking at significant_text rather than significant_terms. The extra debugging information should be useful in figuring out what makes a particular significant_text execution slow - maybe its fetching from source. Maybe its analysis. Maybe its just generating a zillion terms. That's all there in the debug output.

CollectConsumer consumer
) throws IOException {
valuesFetched++;
charsFetched += text.length();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these two counters and the Timer around extract we can tell if the we are extracting many fields or a few fields. We can see if we're extracting many fields and filtering them away. And we can see how long the fields are in utf-16 chars. Which is better than nothing. Not as good as bytes, but such is life.

nik9000 added 4 commits May 5, 2021 08:25
Runtime fields won't have them and that claim isn't central to the test
we're running.
@nik9000
Copy link
Member Author

nik9000 commented May 5, 2021

I've used elastic/rally-tracks#176 to seed a bunch of data into a local cluster and used this to learn some things.

Running significant_text on an unselective query without the sampler agg is very very slow. But it got faster in ES 7.9 mostly due to #62509.

  "build_aggregation" :  4913715588
            "collect" : 11255019266

collect breaks down into three parts:
         "extract_ns" :  4772881326
"collect_analyzed_ns" :   497302572
            "analyze" :  5984835368

Those are all in nanos from the profile API. That means that we spent about 5 seconds building the results and picking the best buckets. Yikes! Worse, we spent six seconds re-analyzing the text. And five seconds reading the _source. That's a lot of seconds! This is fine if you expect it to be slow - but it is slow.

On the other extreme, running the significant_text in a sampler agg on a selective query yields:

   "build_aggregation" : 3595633
            "collect" : 38938326

collect breaks down into three parts:
         "extract_ns" : 37522252
"collect_analyzed_ns" :   339014
            "analyze" :  1077060

The first thing to know is that this whole thing takes about 43ms. Much faster! Good? Maybe. The performance is dominated by extracting the data from _source. With slower disks or slower compression or a larger sample it'll get worse. This is something we can improve, I think.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a couple of suggestions.

@@ -148,15 +170,23 @@ LeafBucketCollector getLeafCollector(
* Fetch values from a {@link ValuesSource}.
*/
public static class ValuesSourceCollectorSource implements CollectorSource {
private final ValuesSource valuesSource;
private final ValuesSourceConfig valuesSource;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not rename the variable as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I wasn't being careful. I'll do it.

throw new IllegalArgumentException("No analyzer configured for field " + f);
});
if (context.profiling()) {
return new SignificantTextCollectorSource(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are so many overloaded methods here that I feel like extracting it into an old fashioned static class can greatly increase readability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@nik9000 nik9000 merged commit a43b166 into elastic:master May 10, 2021
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 10, 2021
Adds some extra debugging information to make it clear that you are
running `significant_text`. Also adds some using timing information
around the `_source` fetch and the `terms` accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
`collect_ns - fetch_ns - accumulation_ns`.

This also adds a half dozen extra REST tests to get a *fairly*
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.
nik9000 added a commit that referenced this pull request May 11, 2021
Adds some extra debugging information to make it clear that you are
running `significant_text`. Also adds some using timing information
around the `_source` fetch and the `terms` accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
`collect_ns - fetch_ns - accumulation_ns`.

This also adds a half dozen extra REST tests to get a *fairly*
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 11, 2021
Now that elastic#72727 has landed in 7.x we can run the bwc tests against its
changes.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 11, 2021
Now that elastic#72727 has landed in 7.x we can run the bwc tests against its
changes.
nik9000 added a commit that referenced this pull request May 11, 2021
Now that #72727 has landed in 7.x we can run the bwc tests against its
changes.
nik9000 added a commit that referenced this pull request May 11, 2021
Now that #72727 has landed in 7.x we can run the bwc tests against its
changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants