Slow date_histogram after upgrading to 7.3.0 #45702

brenuart · 2019-08-19T15:41:13Z

Elasticsearch version (bin/elasticsearch --version):
Version: 7.3.0, Build: default/rpm/de777fa/2019-07-24T18:30:11.767338Z, JVM: 12.0.1

Plugins installed: []
repository-s3

JVM version (java -version):
JVM shipped with Elasticsearch (rpm)

OS version (uname -a if on a Unix-like system):
Linux es-hot-03.aegaeon-it.com 4.9.120-xxxx-std-ipv6-64 #327490 SMP Thu Aug 16 10:11:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
We noticed a severe degradation of date_histogram performance after upgrading from Elasticsearch 6.8.2 to the latest 7.3.0.

The table below shows the execution time of the same query on the same cluster before and after the upgrade. The first execution happens after caches are cleared as follows:

ES caches are cleared with POST _cache/clear
Filesystem caches are invalidated by doing a cat of all files found in /var/lib/elasticsearch do /dev/null (that's the best we found...)

Timings are as follows:

ES Version   Timings (ms)
------------------------------------
6.8.2        10388,  5797,  7319
7.3.0        21553, 15138, 14833

As you notice 7.3.0 is at least twice slower than 6.8.2.

The dataset is made of about 650m documents spread evenly between 15 indexes. Indexes have 3 shards each and no replica. The cluster is made of three nodes, each with 7Gb RAM (of each 4Gb is allocated to the heap) and 2 vCPUs at 3.1Ghz. There is no activity on the cluster besides this test query.

The query targets only half of the documents (using a date_range) and builds a date_histogram with buckets of 3h. This query is actually what KIbana's Discover panel will do...

{
  "size": 500,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "3h",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2019-01-09T00:00:00.000Z",
              "lte": "2019-01-15T00:00:00.000Z"
            }
          }
        }
      ]
    }
  }
}

We are very surprised by this drop of performance...
Did we forgot to change some configurations parameters when doing the upgrade or is it a regression in ES itself ?

The text was updated successfully, but these errors were encountered:

dliappis · 2019-08-20T08:25:06Z

Thanks very much for your interest in Elasticsearch.

This appears to be a user question, and we'd like to direct these kinds of things to the forums. If you can stop by there, we'd appreciate it. This allows us to use GitHub for verified bug reports, feature requests, and pull requests.

There's an active community in the forums that should be able to help get an answer to your question. As such, I hope you don't mind that I close this.

Finally note that we track performance across releases in https://elasticsearch-benchmarks.elastic.co and you can see results between 6.8 and 7.3 e.g. for date_histogram using the nyc_taxis track in https://elasticsearch-benchmarks.elastic.co/#tracks/nyc-taxis/release.

brenuart · 2019-08-23T09:40:28Z

I'm linking here to the forum post I created: https://discuss.elastic.co/t/slow-date-histogram-after-upgrading-to-7-3-0-on-dense-indexes/196475.

Please note that we could not reproduce the issue with the nyc_taxis dataset but with a custom one. The reason (as described in the post) is dates in the nyc_taxis dataset are not close enough to each other to trigger the issue.

spinscale · 2019-08-26T13:58:09Z

reopening, as a performance issue between 6.8 and 7.3 seems to exist and needs to be evaluated.

Thanks for opening @brenuart!

spinscale · 2019-08-26T14:12:38Z

See https://discuss.elastic.co/t/slow-date-histogram-after-upgrading-to-7-3-0-on-dense-indexes/196475/8?u=spinscale to download the four datasets mentioned in https://discuss.elastic.co/t/slow-date-histogram-after-upgrading-to-7-3-0-on-dense-indexes/196475/5

https://drive.google.com/drive/folders/1ZIQjh00zOFyLoLY8vTEo2Tm2T0ZJ4AnD

elasticmachine · 2019-08-26T17:56:36Z

Pinging @elastic/es-analytics-geo

vboulaye · 2019-09-10T17:29:28Z

Hi,

We are in the process of upgrading our ELK stack from 6.8 to 7.3 and we see exactly the same behaviour.

In our logs the @timestamps are very close from each other and we see a huge degradation of the response time on the default query of kibana.
It is 4 to 5 times slower to run the default query ( >30s for the last 7 days.)

I do not have a nice dataset like @brenuart created to reproduce the issue, but ran some test queries that show the problem (839ms in ES6.8, 4683ms in ES7.3):

In ES 6.8

GET log-appli-2019.09.06/_search
{ 
  "size": 1,
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "3h",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  }, 
  "timeout": "30000ms"
}
=>
{
  "took" : 839,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 53079313,
    "max_score" : 1.0,
    "hits" : [
...

in ES 7.3

GET /log-appli-2019.09.06/_search
{
  "size": 1,
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "3h",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  },
  "timeout": "30000ms"
}


=>

{
  "took" : 4683,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
...

polyfractal · 2019-09-10T18:02:45Z

Thanks for the extra details @vboulaye! We're actively looking into this issue right now, will update as we diagnose.

Out of curiosity, would you mind trying those queries again without the time_zone parameter? We've seen performance issues with time zone conversion in the past (#28727). I don't think that's the case here but it would be good to rule out.

brenuart · 2019-09-10T18:19:49Z

@polyfractal We did the test with and without timezone with both versions. 7.3 is indeed a few seconds slower when a timezone is set.
But if you look at my report posted in the forum (https://discuss.elastic.co/t/slow-date-histogram-after-upgrading-to-7-3-0-on-dense-indexes/196475), the queries we used for testing don’t have this parameter set. This was on purpose to reduce the amount of variable factors in the test case...
Timezone is apparently not the (principal) culprit here.

polyfractal · 2019-09-10T18:27:50Z

Noted, thanks for the clarification!

vboulaye · 2019-09-11T06:25:52Z

Hi,
I just ran my test query without the timestamp and it is a bit faster, but still above 4s.

vboulaye · 2019-09-11T07:25:08Z

Do you think it is worth trying with version 7.2 or is this more likely to be related to the major changes in v7?

dliappis · 2019-09-11T11:56:17Z

@brenuart

We've attempted to reproduce the issue without success.

We used the data sets you provided and @csoulios kindly created a Rally track to help us benchmark the queries. We used a bare metal environment (separate machine for loaddriver and one dedicated node for Elasticsearch). Elasticsearch uses 1G heap, running on Java-12 and in between every experiment we cleaned up caches, slab objects, compacted memory and trimmed SSD disks as shown in [1].

In this gist I have the comparisons for the gaussian-sameday data between ES 7.3.1 (contender) and 6.8.2 (baseline) and in particular for the date histogram query we see:

| Metric                        | Task                     | Baseline | Contender |     Diff | Unit |
|-------------------------------+--------------------------+----------+-----------+----------+------|
| 50th percentile latency       | query-agg-date_histogram |  2788.58 |   2690.07 | -98.5142 | ms   |
| 90th percentile latency       | query-agg-date_histogram |  2809.82 |   2741.53 | -68.2962 | ms   |
| 100th percentile latency      | query-agg-date_histogram |  2884.77 |   2792.32 | -92.4499 | ms   |
| 50th percentile service time  | query-agg-date_histogram |  2786.19 |    2687.6 | -98.5906 | ms   |
| 90th percentile service time  | query-agg-date_histogram |  2807.42 |   2739.02 | -68.3978 | ms   |
| 100th percentile service time | query-agg-date_histogram |  2882.37 |   2789.84 | -92.5238 | ms   |

The results from another identical experiment are here and again we see comparable performance between 6.8 and 7.3 for example for the 50th percentile latency difference is: 2847ms -> 6.8 and 2690ms -> 7.3 (so 7.3 was slightly faster)

| Metric                        | Task                     | Baseline | Contender |     Diff | Unit |
|-------------------------------+--------------------------+----------+-----------+----------+------|
| 50th percentile latency       | query-agg-date_histogram |  2847.59 |   2690.07 | -157.524 | ms   |
| 90th percentile latency       | query-agg-date_histogram |  2884.79 |   2741.53 | -143.257 | ms   |
| 100th percentile latency      | query-agg-date_histogram |  2951.17 |   2792.32 | -158.846 | ms   |
| 50th percentile service time  | query-agg-date_histogram |  2845.24 |    2687.6 | -157.635 | ms   |
| 90th percentile service time  | query-agg-date_histogram |  2882.45 |   2739.02 | -143.427 | ms   |
| 100th percentile service time | query-agg-date_histogram |  2948.81 |   2789.84 | -158.968 | ms   |

Do you think you could give it a go yourself and see if it reproduces for you?

Running it should be fairly easy, just follow the instructions in the README file of the custom Rally track.

Looking forward to your feedback.

[1]:

sudo /sbin/fstrim --all
sync
sleep 3
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
sudo sh -c "echo 1 > /proc/sys/vm/compact_memory"

brenuart · 2019-09-11T13:38:45Z

I'll have a look this evening. Stay tuned.

brenuart · 2019-09-11T16:58:11Z

I'm back with the results... and of course, I could NOT reproduce the issue with your rally challenge... Same conclusion as yours: 7.3 is faster!

I looked at the differences between your test and ours and noticed that our query for 7.3 is slightly different: it makes use of fixed_interval instead of interval. I missed that point earlier - sorry about that. To make it clear, the query for 7.3 looked as follows:

{
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d",
        "min_doc_count": 1
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

To make sure this had an impact, I quickly made the following tests through Kibana's dev console targetting two different test clusters running respectively 6.8.2 and 7.3.0. The timings were:

query type           6.8.2    7.3.0
-------------------------------------
interval               3873    4030
fixed_interval          n/a   17845
calendar_interval       n/a    4089

The most important thing here is not the difference between 6.8 and 7.3 as I executed the queries only a couple of times manually and took only the best result.
However, using fixed_interval_ instead of interval` (which is now deprecated) makes a huge difference!

I made the test with the 4 datasets (gaussian and uniform distributions). I got the same ranking as in my forum post: gaussian is faster, followed by uniform-sameday, 1s and 10s.

However, I don't believe anymore that the date distribution in the dataset is the issue (although it has obviously some impact on the performance). fixed_interval is definitely where to look at...

I'm really sorry I missed that initially. I pushed you in the wrong direction with an incomplete and inaccurate report. Sorry about that...

For your information, Kibana is using a fixed_interval in its aggregation query when it builds the histogram... That's how we noticed the slow down initially...

polyfractal · 2019-09-11T17:46:09Z

Hm interesting, the fixed/calendar interval change theoretically should have been equivalent to the now-deprecated interval (it's using the same parsing as before, just more explicit). I wonder if we're doing extra/unnecessary parsing somewhere that's leading to a slowdown.

One question though @brenuart. The interval "3h" shouldn't be valid for calendar intervals. E.g. calendar_interval: "3h" should return this error:

{
  "error": {
    "root_cause": [
      {
        "type": "x_content_parse_exception",
        "reason": "[7:30] [date_histogram] failed to parse field [calendar_interval]"
      }
    ],
    "type": "x_content_parse_exception",
    "reason": "[7:30] [date_histogram] failed to parse field [calendar_interval]",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "The supplied interval [3h] could not be parsed as a calendar interval."
    }
  },
  "status": 400
}

When you did the fixed vs calendar test did you use an interval like 1h instead? Just want to make sure we're looking at the same thing you are :)

brenuart · 2019-09-11T17:57:08Z

Oh god - sorry once again. Wrong copy/paste in the example query above.
I used 1d for interval - just like your own tests.

PS: I updated the query above to avoid confusion.

vboulaye · 2019-09-11T21:52:45Z

I made some tests on my side with the various types of interval.

on the 7.3:

for a 1d interval:

interval (calendar_interval according to the doc) => 900ms
calendar_interval => 900ms
fixed_interval => 4000ms

for a 3h interval:

interval (fixed_interval according to the doc) => 4600ms
calendar_interval => error
fixed_interval => 4600ms

for a 24h interval:

interval (fixed_interval according to the doc) => 4600ms
calendar_interval => error
fixed_interval => 4600ms

on ES6.8
for a 1d interval:

interval (calendar_interval according to the doc) => 900ms

for a 3h interval:

interval (fixed_interval according to the doc) => 900ms

for a 24h interval:

interval (fixed_interval according to the doc) => 900ms

So there really seem to be a difference in the way the date histogram handles the fixed_interval in 7.3

jimczi · 2019-09-12T09:48:05Z

We have optimizations for the calendar_interval if the timezone is UTC. This was added because we spotted the slowdown in our benchmark:
https://github.com/elastic/elasticsearch/blob/7.x/server/src/main/java/org/elasticsearch/common/Rounding.java#L283
#38221

However we don't benchmark fixed_interval and they suffer from the same issue. We don't special case UTC when using a fixed interval while Joda uses a fast path that doesn't require any conversion. @spinscale could we just change the fixed interval rounding to do:

        @Override
        public long round(final long utcMillis) {
            if (isUtc) {
                return roundKey(utcMillis, interval) * interval;
            }

?
Regarding benchmarks, I can reproduce the slowdown in micro-benchmarks using the Java Time Rounding and the Joda Rounding with random dates and any interval. On my machine the Joda rounding is 15-20 times faster than the Java time rounding when using UTC.

brenuart · 2019-09-12T10:02:02Z

Glad you could reproduce the issue!
Please note that Kibana, in our setup at least, is building the aggregation with a timezone set to the one used by the browser - which is not UTC in our case. Does it mean that the fix you are talking about won't have any positive effect for us?

jimczi · 2019-09-12T11:02:03Z

Does it mean that the fix you are talking about won't have any positive effect for us?

Yes my proposal is too simplistic, we should be able to tackle all fixed timezone (UTC, +1, -3, ...).
We introduced a change in 6x that allows to rewrite a non-fixed timezone (Europe/Berlin) into a fixed timezone (+1, +2, ...) if all documents in a shard are in the same offset (#30534) but this optimization doesn't work well with the current implementation for Java time rounding. I think that explains why you have different behavior depending on the distribution of dates that you have in the testing shard. It shouldn't be too hard to restore the 6x behavior though. Instead of handling only utc we could check that the time zone is fixed and apply something like:

// Create the fixed offset
long fixedOffset = timeZone.getRules().isFixedOffset() ? timeZone.getRules().getOffset(Instant.EPOCH).getTotalSeconds() * 1000 : -1;

       @Override
        public long round(final long utcMillis) {
            if (fixedOffset != -1) {
                long localMillis = utcMillis + fixedOffset;
                return (roundKey(localMillis, interval) * interval) - fixedOffset;
            }

This should handle the case where a non-fixed timezone is rewritten into a fixed timezone ?

jimczi · 2019-09-12T11:06:18Z

I also wonder if we should apply the same logic to the calendar interval since only UTC timezone is optimized there so the same slowdown would happen with fixed time zones.

spinscale · 2019-09-12T12:29:08Z

@jimczi after a quick look through the code that sounds good to me

…stic#45702

Fixes #45702 with date_histogram aggregation when using fixed_interval. Optimization has been implemented for both fixed and calendar intervals

csoulios · 2019-09-16T22:50:16Z

@brenuart and @vboulaye thank you for reporting this issue. It has been fixed in our master branch and will be backported to v7.4.1 and v7.5.0 in the following days.

vboulaye · 2019-09-17T06:44:23Z

Thanks for the fix.
We will test it as soon as the patched version is released!

brhardwick · 2019-09-17T17:53:33Z

Is there any way to circumvent the issue temporarily while we wait for the 7.4.1 release? We are looking to do an upgrade this weekend and this bug makes the Kibana discover tab almost unusable for long date ranges. Would love any suggestions here.

csoulios · 2019-09-18T08:10:04Z

@brhardwick, try using calendar_interval instead of fixed_interval in your query. Have in mind that results will not be the same as there are differences between calendar_interval and fixed_interval.

You can find more information here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

…6670) Fixes elastic#45702 with date_histogram aggregation when using fixed_interval. Optimization has been implemented for both fixed and calendar intervals

Fixes #45702 with date_histogram aggregation when using fixed_interval. Optimization has been implemented for both fixed and calendar intervals

vboulaye · 2019-10-07T09:02:17Z

We deployed the 7.4 and the problem is fixed: the query now takes the same time with fixed_interval than with calendar_interval.
Thanks again!

brenuart · 2019-10-07T10:31:24Z

Same here. Thx a lot for the quick response.

dliappis closed this as completed Aug 20, 2019

spinscale reopened this Aug 26, 2019

dliappis self-assigned this Aug 26, 2019

$@polyfractal$ polyfractal added the :Analytics/Aggregations Aggregations label Aug 26, 2019

dliappis assigned csoulios and unassigned dliappis Aug 28, 2019

cachedout mentioned this issue Sep 11, 2019

[Monitoring] Only do a single date_histogram agg for get_nodes calls elastic/kibana#43481

Merged

jimczi added the >regression label Sep 12, 2019

csoulios added a commit to csoulios/elasticsearch that referenced this issue Sep 12, 2019

Implement rounding optimization for fixed offset timezones. Fixes ela…

96e66a7

…stic#45702

csoulios mentioned this issue Sep 12, 2019

Implement rounding optimization for fixed offset timezones #46670

Merged

csoulios closed this as completed in #46670 Sep 16, 2019

csoulios added a commit that referenced this issue Sep 16, 2019

Implement rounding optimization for fixed offset timezones (#46670)

f7a34c8

Fixes #45702 with date_histogram aggregation when using fixed_interval. Optimization has been implemented for both fixed and calendar intervals

csoulios mentioned this issue Sep 18, 2019

[7.4] Implement rounding optimization for fixed offset timezones #46807

Merged

csoulios added a commit that referenced this issue Sep 18, 2019

Implement rounding optimization for fixed offset timezones (#46807)

ecc23e2

Fixes #45702 with date_histogram aggregation when using fixed_interval. Optimization has been implemented for both fixed and calendar intervals

csoulios added a commit that referenced this issue Sep 18, 2019

Implement rounding optimization for fixed offset timezones (#46809)

0076083

Fixes #45702 with date_histogram aggregation when using fixed_interval. Optimization has been implemented for both fixed and calendar intervals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow date_histogram after upgrading to 7.3.0 #45702

Slow date_histogram after upgrading to 7.3.0 #45702

brenuart commented Aug 19, 2019

dliappis commented Aug 20, 2019

brenuart commented Aug 23, 2019

spinscale commented Aug 26, 2019

spinscale commented Aug 26, 2019

elasticmachine commented Aug 26, 2019

vboulaye commented Sep 10, 2019

polyfractal commented Sep 10, 2019 •

edited

Loading

brenuart commented Sep 10, 2019 •

edited

Loading

polyfractal commented Sep 10, 2019

vboulaye commented Sep 11, 2019

vboulaye commented Sep 11, 2019

dliappis commented Sep 11, 2019 •

edited

Loading

brenuart commented Sep 11, 2019

brenuart commented Sep 11, 2019 •

edited

Loading

polyfractal commented Sep 11, 2019

brenuart commented Sep 11, 2019

vboulaye commented Sep 11, 2019

jimczi commented Sep 12, 2019

brenuart commented Sep 12, 2019

jimczi commented Sep 12, 2019 •

edited

Loading

jimczi commented Sep 12, 2019

spinscale commented Sep 12, 2019

csoulios commented Sep 16, 2019

vboulaye commented Sep 17, 2019

brhardwick commented Sep 17, 2019

csoulios commented Sep 18, 2019 •

edited

Loading

vboulaye commented Oct 7, 2019

brenuart commented Oct 7, 2019

Slow date_histogram after upgrading to 7.3.0 #45702

Slow date_histogram after upgrading to 7.3.0 #45702

Comments

brenuart commented Aug 19, 2019

dliappis commented Aug 20, 2019

brenuart commented Aug 23, 2019

spinscale commented Aug 26, 2019

spinscale commented Aug 26, 2019

elasticmachine commented Aug 26, 2019

vboulaye commented Sep 10, 2019

polyfractal commented Sep 10, 2019 • edited Loading

brenuart commented Sep 10, 2019 • edited Loading

polyfractal commented Sep 10, 2019

vboulaye commented Sep 11, 2019

vboulaye commented Sep 11, 2019

dliappis commented Sep 11, 2019 • edited Loading

brenuart commented Sep 11, 2019

brenuart commented Sep 11, 2019 • edited Loading

polyfractal commented Sep 11, 2019

brenuart commented Sep 11, 2019

vboulaye commented Sep 11, 2019

jimczi commented Sep 12, 2019

brenuart commented Sep 12, 2019

jimczi commented Sep 12, 2019 • edited Loading

jimczi commented Sep 12, 2019

spinscale commented Sep 12, 2019

csoulios commented Sep 16, 2019

vboulaye commented Sep 17, 2019

brhardwick commented Sep 17, 2019

csoulios commented Sep 18, 2019 • edited Loading

vboulaye commented Oct 7, 2019

brenuart commented Oct 7, 2019

polyfractal commented Sep 10, 2019 •

edited

Loading

brenuart commented Sep 10, 2019 •

edited

Loading

dliappis commented Sep 11, 2019 •

edited

Loading

brenuart commented Sep 11, 2019 •

edited

Loading

jimczi commented Sep 12, 2019 •

edited

Loading

csoulios commented Sep 18, 2019 •

edited

Loading