Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow date_histogram after upgrading to 7.3.0 #45702

Closed
brenuart opened this issue Aug 19, 2019 · 28 comments · Fixed by #46670
Closed

Slow date_histogram after upgrading to 7.3.0 #45702

brenuart opened this issue Aug 19, 2019 · 28 comments · Fixed by #46670
Assignees

Comments

@brenuart
Copy link

Elasticsearch version (bin/elasticsearch --version):
Version: 7.3.0, Build: default/rpm/de777fa/2019-07-24T18:30:11.767338Z, JVM: 12.0.1

Plugins installed: []
repository-s3

JVM version (java -version):
JVM shipped with Elasticsearch (rpm)

OS version (uname -a if on a Unix-like system):
Linux es-hot-03.aegaeon-it.com 4.9.120-xxxx-std-ipv6-64 #327490 SMP Thu Aug 16 10:11:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
We noticed a severe degradation of date_histogram performance after upgrading from Elasticsearch 6.8.2 to the latest 7.3.0.

The table below shows the execution time of the same query on the same cluster before and after the upgrade. The first execution happens after caches are cleared as follows:

  • ES caches are cleared with POST _cache/clear
  • Filesystem caches are invalidated by doing a cat of all files found in /var/lib/elasticsearch do /dev/null (that's the best we found...)

Timings are as follows:

ES Version   Timings (ms)
------------------------------------
6.8.2        10388,  5797,  7319
7.3.0        21553, 15138, 14833

As you notice 7.3.0 is at least twice slower than 6.8.2.

The dataset is made of about 650m documents spread evenly between 15 indexes. Indexes have 3 shards each and no replica. The cluster is made of three nodes, each with 7Gb RAM (of each 4Gb is allocated to the heap) and 2 vCPUs at 3.1Ghz. There is no activity on the cluster besides this test query.

The query targets only half of the documents (using a date_range) and builds a date_histogram with buckets of 3h. This query is actually what KIbana's Discover panel will do...

{
  "size": 500,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "3h",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "2019-01-09T00:00:00.000Z",
              "lte": "2019-01-15T00:00:00.000Z"
            }
          }
        }
      ]
    }
  }
}

We are very surprised by this drop of performance...
Did we forgot to change some configurations parameters when doing the upgrade or is it a regression in ES itself ?

@dliappis
Copy link
Contributor

Thanks very much for your interest in Elasticsearch.

This appears to be a user question, and we'd like to direct these kinds of things to the forums. If you can stop by there, we'd appreciate it. This allows us to use GitHub for verified bug reports, feature requests, and pull requests.

There's an active community in the forums that should be able to help get an answer to your question. As such, I hope you don't mind that I close this.

Finally note that we track performance across releases in https://elasticsearch-benchmarks.elastic.co and you can see results between 6.8 and 7.3 e.g. for date_histogram using the nyc_taxis track in https://elasticsearch-benchmarks.elastic.co/#tracks/nyc-taxis/release.

@brenuart
Copy link
Author

I'm linking here to the forum post I created: https://discuss.elastic.co/t/slow-date-histogram-after-upgrading-to-7-3-0-on-dense-indexes/196475.

Please note that we could not reproduce the issue with the nyc_taxis dataset but with a custom one. The reason (as described in the post) is dates in the nyc_taxis dataset are not close enough to each other to trigger the issue.

@spinscale
Copy link
Contributor

reopening, as a performance issue between 6.8 and 7.3 seems to exist and needs to be evaluated.

Thanks for opening @brenuart!

@spinscale spinscale reopened this Aug 26, 2019
@dliappis dliappis self-assigned this Aug 26, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@dliappis dliappis assigned csoulios and unassigned dliappis Aug 28, 2019
@vboulaye
Copy link
Contributor

Hi,

We are in the process of upgrading our ELK stack from 6.8 to 7.3 and we see exactly the same behaviour.

In our logs the @timestamps are very close from each other and we see a huge degradation of the response time on the default query of kibana.
It is 4 to 5 times slower to run the default query ( >30s for the last 7 days.)

I do not have a nice dataset like @brenuart created to reproduce the issue, but ran some test queries that show the problem (839ms in ES6.8, 4683ms in ES7.3):

In ES 6.8

GET log-appli-2019.09.06/_search
{ 
  "size": 1,
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "3h",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  }, 
  "timeout": "30000ms"
}
=>
{
  "took" : 839,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 53079313,
    "max_score" : 1.0,
    "hits" : [
...

in ES 7.3

GET /log-appli-2019.09.06/_search
{
  "size": 1,
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "3h",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  },
  "timeout": "30000ms"
}


=>

{
  "took" : 4683,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
...

@polyfractal
Copy link
Contributor

polyfractal commented Sep 10, 2019

Thanks for the extra details @vboulaye! We're actively looking into this issue right now, will update as we diagnose.

Out of curiosity, would you mind trying those queries again without the time_zone parameter? We've seen performance issues with time zone conversion in the past (#28727). I don't think that's the case here but it would be good to rule out.

@brenuart
Copy link
Author

brenuart commented Sep 10, 2019

@polyfractal We did the test with and without timezone with both versions. 7.3 is indeed a few seconds slower when a timezone is set.
But if you look at my report posted in the forum (https://discuss.elastic.co/t/slow-date-histogram-after-upgrading-to-7-3-0-on-dense-indexes/196475), the queries we used for testing don’t have this parameter set. This was on purpose to reduce the amount of variable factors in the test case...
Timezone is apparently not the (principal) culprit here.

@polyfractal
Copy link
Contributor

Noted, thanks for the clarification!

@vboulaye
Copy link
Contributor

Hi,
I just ran my test query without the timestamp and it is a bit faster, but still above 4s.

@vboulaye
Copy link
Contributor

Do you think it is worth trying with version 7.2 or is this more likely to be related to the major changes in v7?

@dliappis
Copy link
Contributor

dliappis commented Sep 11, 2019

@brenuart

We've attempted to reproduce the issue without success.

We used the data sets you provided and @csoulios kindly created a Rally track to help us benchmark the queries. We used a bare metal environment (separate machine for loaddriver and one dedicated node for Elasticsearch). Elasticsearch uses 1G heap, running on Java-12 and in between every experiment we cleaned up caches, slab objects, compacted memory and trimmed SSD disks as shown in [1].

In this gist I have the comparisons for the gaussian-sameday data between ES 7.3.1 (contender) and 6.8.2 (baseline) and in particular for the date histogram query we see:

| Metric                        | Task                     | Baseline | Contender |     Diff | Unit |
|-------------------------------+--------------------------+----------+-----------+----------+------|
| 50th percentile latency       | query-agg-date_histogram |  2788.58 |   2690.07 | -98.5142 | ms   |
| 90th percentile latency       | query-agg-date_histogram |  2809.82 |   2741.53 | -68.2962 | ms   |
| 100th percentile latency      | query-agg-date_histogram |  2884.77 |   2792.32 | -92.4499 | ms   |
| 50th percentile service time  | query-agg-date_histogram |  2786.19 |    2687.6 | -98.5906 | ms   |
| 90th percentile service time  | query-agg-date_histogram |  2807.42 |   2739.02 | -68.3978 | ms   |
| 100th percentile service time | query-agg-date_histogram |  2882.37 |   2789.84 | -92.5238 | ms   |

The results from another identical experiment are here and again we see comparable performance between 6.8 and 7.3 for example for the 50th percentile latency difference is: 2847ms -> 6.8 and 2690ms -> 7.3 (so 7.3 was slightly faster)

| Metric                        | Task                     | Baseline | Contender |     Diff | Unit |
|-------------------------------+--------------------------+----------+-----------+----------+------|
| 50th percentile latency       | query-agg-date_histogram |  2847.59 |   2690.07 | -157.524 | ms   |
| 90th percentile latency       | query-agg-date_histogram |  2884.79 |   2741.53 | -143.257 | ms   |
| 100th percentile latency      | query-agg-date_histogram |  2951.17 |   2792.32 | -158.846 | ms   |
| 50th percentile service time  | query-agg-date_histogram |  2845.24 |    2687.6 | -157.635 | ms   |
| 90th percentile service time  | query-agg-date_histogram |  2882.45 |   2739.02 | -143.427 | ms   |
| 100th percentile service time | query-agg-date_histogram |  2948.81 |   2789.84 | -158.968 | ms   |

Do you think you could give it a go yourself and see if it reproduces for you?

Running it should be fairly easy, just follow the instructions in the README file of the custom Rally track.

Looking forward to your feedback.

[1]:

sudo /sbin/fstrim --all
sync
sleep 3
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
sudo sh -c "echo 1 > /proc/sys/vm/compact_memory"

@brenuart
Copy link
Author

I'll have a look this evening. Stay tuned.

@brenuart
Copy link
Author

brenuart commented Sep 11, 2019

I'm back with the results... and of course, I could NOT reproduce the issue with your rally challenge... Same conclusion as yours: 7.3 is faster!

I looked at the differences between your test and ours and noticed that our query for 7.3 is slightly different: it makes use of fixed_interval instead of interval. I missed that point earlier - sorry about that. To make it clear, the query for 7.3 looked as follows:

{
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1d",
        "min_doc_count": 1
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

To make sure this had an impact, I quickly made the following tests through Kibana's dev console targetting two different test clusters running respectively 6.8.2 and 7.3.0. The timings were:

query type           6.8.2    7.3.0
-------------------------------------
interval               3873    4030
fixed_interval          n/a   17845
calendar_interval       n/a    4089

The most important thing here is not the difference between 6.8 and 7.3 as I executed the queries only a couple of times manually and took only the best result.
However, using fixed_interval_ instead of interval` (which is now deprecated) makes a huge difference!

I made the test with the 4 datasets (gaussian and uniform distributions). I got the same ranking as in my forum post: gaussian is faster, followed by uniform-sameday, 1s and 10s.

However, I don't believe anymore that the date distribution in the dataset is the issue (although it has obviously some impact on the performance). fixed_interval is definitely where to look at...

I'm really sorry I missed that initially. I pushed you in the wrong direction with an incomplete and inaccurate report. Sorry about that...

For your information, Kibana is using a fixed_interval in its aggregation query when it builds the histogram... That's how we noticed the slow down initially...

@polyfractal
Copy link
Contributor

Hm interesting, the fixed/calendar interval change theoretically should have been equivalent to the now-deprecated interval (it's using the same parsing as before, just more explicit). I wonder if we're doing extra/unnecessary parsing somewhere that's leading to a slowdown.

One question though @brenuart. The interval "3h" shouldn't be valid for calendar intervals. E.g. calendar_interval: "3h" should return this error:

{
  "error": {
    "root_cause": [
      {
        "type": "x_content_parse_exception",
        "reason": "[7:30] [date_histogram] failed to parse field [calendar_interval]"
      }
    ],
    "type": "x_content_parse_exception",
    "reason": "[7:30] [date_histogram] failed to parse field [calendar_interval]",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "The supplied interval [3h] could not be parsed as a calendar interval."
    }
  },
  "status": 400
}

When you did the fixed vs calendar test did you use an interval like 1h instead? Just want to make sure we're looking at the same thing you are :)

@brenuart
Copy link
Author

Oh god - sorry once again. Wrong copy/paste in the example query above.
I used 1d for interval - just like your own tests.

PS: I updated the query above to avoid confusion.

@vboulaye
Copy link
Contributor

I made some tests on my side with the various types of interval.

on the 7.3:

for a 1d interval:

  • interval (calendar_interval according to the doc) => 900ms
  • calendar_interval => 900ms
  • fixed_interval => 4000ms

for a 3h interval:

  • interval (fixed_interval according to the doc) => 4600ms
  • calendar_interval => error
  • fixed_interval => 4600ms

for a 24h interval:

  • interval (fixed_interval according to the doc) => 4600ms
  • calendar_interval => error
  • fixed_interval => 4600ms

on ES6.8
for a 1d interval:

  • interval (calendar_interval according to the doc) => 900ms

for a 3h interval:

  • interval (fixed_interval according to the doc) => 900ms

for a 24h interval:

  • interval (fixed_interval according to the doc) => 900ms

So there really seem to be a difference in the way the date histogram handles the fixed_interval in 7.3

@jimczi
Copy link
Contributor

jimczi commented Sep 12, 2019

We have optimizations for the calendar_interval if the timezone is UTC. This was added because we spotted the slowdown in our benchmark:
https://github.com/elastic/elasticsearch/blob/7.x/server/src/main/java/org/elasticsearch/common/Rounding.java#L283
#38221

However we don't benchmark fixed_interval and they suffer from the same issue. We don't special case UTC when using a fixed interval while Joda uses a fast path that doesn't require any conversion. @spinscale could we just change the fixed interval rounding to do:

        @Override
        public long round(final long utcMillis) {
            if (isUtc) {
                return roundKey(utcMillis, interval) * interval;
            }

?
Regarding benchmarks, I can reproduce the slowdown in micro-benchmarks using the Java Time Rounding and the Joda Rounding with random dates and any interval. On my machine the Joda rounding is 15-20 times faster than the Java time rounding when using UTC.

@brenuart
Copy link
Author

Glad you could reproduce the issue!
Please note that Kibana, in our setup at least, is building the aggregation with a timezone set to the one used by the browser - which is not UTC in our case. Does it mean that the fix you are talking about won't have any positive effect for us?

@jimczi
Copy link
Contributor

jimczi commented Sep 12, 2019

Does it mean that the fix you are talking about won't have any positive effect for us?

Yes my proposal is too simplistic, we should be able to tackle all fixed timezone (UTC, +1, -3, ...).
We introduced a change in 6x that allows to rewrite a non-fixed timezone (Europe/Berlin) into a fixed timezone (+1, +2, ...) if all documents in a shard are in the same offset (#30534) but this optimization doesn't work well with the current implementation for Java time rounding. I think that explains why you have different behavior depending on the distribution of dates that you have in the testing shard. It shouldn't be too hard to restore the 6x behavior though. Instead of handling only utc we could check that the time zone is fixed and apply something like:

// Create the fixed offset
long fixedOffset = timeZone.getRules().isFixedOffset() ? timeZone.getRules().getOffset(Instant.EPOCH).getTotalSeconds() * 1000 : -1;

       @Override
        public long round(final long utcMillis) {
            if (fixedOffset != -1) {
                long localMillis = utcMillis + fixedOffset;
                return (roundKey(localMillis, interval) * interval) - fixedOffset;
            }

This should handle the case where a non-fixed timezone is rewritten into a fixed timezone ?

@jimczi
Copy link
Contributor

jimczi commented Sep 12, 2019

I also wonder if we should apply the same logic to the calendar interval since only UTC timezone is optimized there so the same slowdown would happen with fixed time zones.

@spinscale
Copy link
Contributor

@jimczi after a quick look through the code that sounds good to me

csoulios added a commit to csoulios/elasticsearch that referenced this issue Sep 12, 2019
csoulios added a commit that referenced this issue Sep 16, 2019
Fixes #45702 with date_histogram aggregation when using fixed_interval.
Optimization has been implemented for both fixed and calendar intervals
@csoulios
Copy link
Contributor

@brenuart and @vboulaye thank you for reporting this issue. It has been fixed in our master branch and will be backported to v7.4.1 and v7.5.0 in the following days.

@vboulaye
Copy link
Contributor

Thanks for the fix.
We will test it as soon as the patched version is released!

@brhardwick
Copy link

Is there any way to circumvent the issue temporarily while we wait for the 7.4.1 release? We are looking to do an upgrade this weekend and this bug makes the Kibana discover tab almost unusable for long date ranges. Would love any suggestions here.

@csoulios
Copy link
Contributor

csoulios commented Sep 18, 2019

@brhardwick, try using calendar_interval instead of fixed_interval in your query. Have in mind that results will not be the same as there are differences between calendar_interval and fixed_interval.

You can find more information here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

csoulios added a commit to csoulios/elasticsearch that referenced this issue Sep 18, 2019
…6670)

Fixes elastic#45702 with date_histogram aggregation when using fixed_interval.
Optimization has been implemented for both fixed and calendar intervals
csoulios added a commit to csoulios/elasticsearch that referenced this issue Sep 18, 2019
…6670)

Fixes elastic#45702 with date_histogram aggregation when using fixed_interval.
Optimization has been implemented for both fixed and calendar intervals
csoulios added a commit that referenced this issue Sep 18, 2019
Fixes #45702 with date_histogram aggregation when using fixed_interval.
Optimization has been implemented for both fixed and calendar intervals
csoulios added a commit that referenced this issue Sep 18, 2019
Fixes #45702 with date_histogram aggregation when using fixed_interval.
Optimization has been implemented for both fixed and calendar intervals
@vboulaye
Copy link
Contributor

vboulaye commented Oct 7, 2019

We deployed the 7.4 and the problem is fixed: the query now takes the same time with fixed_interval than with calendar_interval.
Thanks again!

@brenuart
Copy link
Author

brenuart commented Oct 7, 2019

Same here. Thx a lot for the quick response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants