More debugging info for significant_text #72727

nik9000 · 2021-05-04T22:03:21Z

Adds some extra debugging information to make it clear that you are
running significant_text. Also adds some using timing information
around the _source fetch and the terms accumulation. This lets you
calculate a third useful timing number: the analysis time. It is
collect_ns - fetch_ns - accumulation_ns.

This also adds a half dozen extra REST tests to get a fairly
comprehensive set of the operations this supports. It doesn't cover all
of the significance heuristic parsing, but its certainly much better
than what we had.

elasticmachine · 2021-05-04T22:03:24Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.

nik9000 · 2021-05-04T22:03:58Z

docs/reference/aggregations/bucket/significantterms-aggregation.asciidoc

@@ -374,7 +374,7 @@ Chi square behaves like mutual information and can be configured with the same p


 ===== Google normalized distance
-Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (https://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
+Google normalized distance as described in https://arxiv.org/pdf/cs/0412098v3.pdf["The Google Similarity Distance", Cilibrasi and Vitanyi, 2007] can be used as significance score by adding the parameter


This was invalid syntax I bumped into reading the docs so I could write the tests. I can move this into its own change if you'd like, but its pretty small and isolated as is.

nik9000 · 2021-05-04T22:04:46Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.aggregation/90_sig_text.yml

@@ -1,151 +0,0 @@
---


I renamed this file, replace index with _bulk, and added a ton more tests. It doesn't seem to detect the rename because its barely the same file any more....

nik9000 · 2021-05-04T22:06:22Z

...rnalClusterTest/java/org/elasticsearch/search/profile/aggregation/AggregationProfilerIT.java

-            (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE) + (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_MULTI),
-            greaterThan(0)
-        );
+        assertThat(termsAggResult.getDebugInfo().toString(), (int) termsAggResult.getDebugInfo().get(SEGMENTS_WITH_SINGLE), greaterThan(0));


I bumped into this while double checking that I hadn't broken this test. This was fixed in #71241.

nik9000 · 2021-05-04T22:08:15Z

...c/main/java/org/elasticsearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

@@ -117,6 +118,8 @@ public InternalAggregation buildEmptyAggregation() {
    public void collectDebugInfo(BiConsumer<String, Object> add) {
        super.collectDebugInfo(add);
        add.accept("total_buckets", bucketOrds.size());
+        add.accept("collection_strategy", collectorSource.describe());
+        collectorSource.collectDebugInfo(add);


Without describing the collection strategy it isn't obvious from reading the profile that you are looking at significant_text rather than significant_terms. The extra debugging information should be useful in figuring out what makes a particular significant_text execution slow - maybe its fetching from source. Maybe its analysis. Maybe its just generating a zillion terms. That's all there in the debug output.

nik9000 · 2021-05-04T22:10:22Z

...ava/org/elasticsearch/search/aggregations/bucket/terms/SignificantTextAggregatorFactory.java

+                    CollectConsumer consumer
+                ) throws IOException {
+                    valuesFetched++;
+                    charsFetched += text.length();


With these two counters and the Timer around extract we can tell if the we are extracting many fields or a few fields. We can see if we're extracting many fields and filtering them away. And we can see how long the fields are in utf-16 chars. Which is better than nothing. Not as good as bytes, but such is life.

Runtime fields won't have them and that claim isn't central to the test we're running.

nik9000 · 2021-05-05T15:32:11Z

I've used elastic/rally-tracks#176 to seed a bunch of data into a local cluster and used this to learn some things.

Running significant_text on an unselective query without the sampler agg is very very slow. But it got faster in ES 7.9 mostly due to #62509.

  "build_aggregation" :  4913715588
            "collect" : 11255019266

collect breaks down into three parts:
         "extract_ns" :  4772881326
"collect_analyzed_ns" :   497302572
            "analyze" :  5984835368

Those are all in nanos from the profile API. That means that we spent about 5 seconds building the results and picking the best buckets. Yikes! Worse, we spent six seconds re-analyzing the text. And five seconds reading the _source. That's a lot of seconds! This is fine if you expect it to be slow - but it is slow.

On the other extreme, running the significant_text in a sampler agg on a selective query yields:

   "build_aggregation" : 3595633
            "collect" : 38938326

collect breaks down into three parts:
         "extract_ns" : 37522252
"collect_analyzed_ns" :   339014
            "analyze" :  1077060

The first thing to know is that this whole thing takes about 43ms. Much faster! Good? Maybe. The performance is dominated by extracting the data from _source. With slower disks or slower compression or a larger sample it'll get worse. This is something we can improve, I think.

imotov

LGTM, left a couple of suggestions.

imotov · 2021-05-06T00:11:21Z

...c/main/java/org/elasticsearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

@@ -148,15 +170,23 @@ LeafBucketCollector getLeafCollector(
     * Fetch values from a {@link ValuesSource}.
     */
    public static class ValuesSourceCollectorSource implements CollectorSource {
-        private final ValuesSource valuesSource;
+        private final ValuesSourceConfig valuesSource;


Why not rename the variable as well?

Because I wasn't being careful. I'll do it.

imotov · 2021-05-06T00:14:30Z

...ava/org/elasticsearch/search/aggregations/bucket/terms/SignificantTextAggregatorFactory.java

+            throw new IllegalArgumentException("No analyzer configured for field " + f);
+        });
+        if (context.profiling()) {
+            return new SignificantTextCollectorSource(


There are so many overloaded methods here that I feel like extracting it into an old fashioned static class can greatly increase readability.

Adds some extra debugging information to make it clear that you are running `significant_text`. Also adds some using timing information around the `_source` fetch and the `terms` accumulation. This lets you calculate a third useful timing number: the analysis time. It is `collect_ns - fetch_ns - accumulation_ns`. This also adds a half dozen extra REST tests to get a *fairly* comprehensive set of the operations this supports. It doesn't cover all of the significance heuristic parsing, but its certainly much better than what we had.

Now that elastic#72727 has landed in 7.x we can run the bwc tests against its changes.

Now that #72727 has landed in 7.x we can run the bwc tests against its changes.

nik9000 added >non-issue :Analytics/Aggregations Aggregations v8.0.0 v7.14.0 labels May 4, 2021

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2021

nik9000 commented May 4, 2021

View reviewed changes

nik9000 mentioned this pull request May 4, 2021

Comprehensive aggregation REST tests #26220

Open

69 tasks

nik9000 added 4 commits May 5, 2021 08:25

Checkstyle

ba2b227

Fixup skip

2675e92

Merge branch 'master' into sig_text_extra

ab4ae70

Don't claim filter by filter

554f3f1

Runtime fields won't have them and that claim isn't central to the test we're running.

imotov approved these changes May 6, 2021

View reviewed changes

nik9000 added 3 commits May 6, 2021 14:44

Merge branch 'master' into sig_text_extra

9edf8b2

Rename

8f30e5f

Fork class

1da65a3

nik9000 merged commit a43b166 into elastic:master May 10, 2021

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 11, 2021

Update skip after backport

c4d4b41

Now that elastic#72727 has landed in 7.x we can run the bwc tests against its changes.

nik9000 mentioned this pull request May 11, 2021

Update skip after backport #72921

Merged

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 11, 2021

Update skip after backport

abb7916

Now that elastic#72727 has landed in 7.x we can run the bwc tests against its changes.

nik9000 mentioned this pull request May 11, 2021

Update skip after backport #72922

Merged

nik9000 added a commit that referenced this pull request May 11, 2021

Update skip after backport (#72921)

db80a50

Now that #72727 has landed in 7.x we can run the bwc tests against its changes.

nik9000 added a commit that referenced this pull request May 11, 2021

Update skip after backport (#72922)

9d99c79

Now that #72727 has landed in 7.x we can run the bwc tests against its changes.

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More debugging info for significant_text #72727

More debugging info for significant_text #72727

nik9000 commented May 4, 2021

elasticmachine commented May 4, 2021

nik9000 May 4, 2021

nik9000 May 4, 2021

nik9000 May 4, 2021

nik9000 May 4, 2021

nik9000 May 4, 2021

nik9000 commented May 5, 2021

imotov left a comment

imotov May 6, 2021

nik9000 May 6, 2021

imotov May 6, 2021

nik9000 May 6, 2021

More debugging info for significant_text #72727

More debugging info for significant_text #72727

Conversation

nik9000 commented May 4, 2021

elasticmachine commented May 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented May 5, 2021

imotov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment