Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborting aggregation because memory limit was exceeded #3837

Closed
yangshike opened this issue Sep 15, 2023 · 18 comments · Fixed by quickwit-oss/tantivy#2183
Closed

Aborting aggregation because memory limit was exceeded #3837

yangshike opened this issue Sep 15, 2023 · 18 comments · Fixed by quickwit-oss/tantivy#2183
Labels
bug Something isn't working enhancement New feature or request high-priority

Comments

@yangshike
Copy link
Contributor

yangshike commented Sep 15, 2023

The quickwit grafana plugin looks great, but there are also some issues with its use:

Through the Grafana plugin, I only searched for data within an hour, which is around a few million, but the volumes logs reported an error:

Failed to load log volume for this query
Internal error: Aborting aggregation because memory limit was exceeded. Limit: 500.00 MB, Current: 1.36 PB.

because I added a condition to the Lucene query. If I didn't add a condition, it would be fine

other :
1、Can the level fields and their values in the data source configuration be customized so that I can distinguish between other types of data, rather than just general log levels

2、Can I use Quickwit as a data source to create reports similar to curve and pie charts,

3、Or support aggregation in languages such as count, group by, etc. on Granafa
and Comparison of some calculations: > 、>=、 < 、!= and so on

@yangshike yangshike added the enhancement New feature or request label Sep 15, 2023
@PSeitz
Copy link
Contributor

PSeitz commented Sep 15, 2023

Can you share the aggregation query that is send to the backend?

What condition did you add?

@yangshike
Copy link
Contributor Author

yangshike commented Sep 15, 2023

In addition to the general log, we also have some other data without these fields (error, info)
other:
There is also a Message field name in the current plugin
Can only configure one, would you like to write more

@yangshike
Copy link
Contributor Author

yangshike commented Sep 15, 2023

A certain index has been running for some time, and I want to modify one field to fast: true. Do I need to delete the index and rebuild it?

because when i got error, then ,i want to change tihis field to fast :true
Internal error: (Internal error: An invalid argument was passed: 'Field "name" is not configured as fast field'`.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 15, 2023

A certain index has been running for some time, and I want to modify one field to fast: true. Do I need to delete the index and rebuild it?

because when i got error, then ,i want to change this field to fast :true Internal error: (Internal error: An invalid argument was passed: 'Field "name" is not configured as fast field'`.

Yes you need to re-index everything currently.

2、Can I use Quickwit as a data source to create reports similar to curve and pie charts,
Yes that's possible with the aggregations API.

On the error Aborting aggregation because memory limit was exceeded. Limit: 500.00 MB, Current: 1.36 PB, can you check the logs on what query is sent to the backend. (or check what the UI is sending)

@yangshike
Copy link
Contributor Author

yangshike commented Sep 18, 2023

on the error(log volumn ):
this query is send to the backend :
"POST /api/v1/_elastic/_msearch?max_concurrent_shard_requests=256"

Aborting aggregation because memory limit was exceeded. Limit: 500.00 MB, Current: 1.36 PB
This error only occurs when use grafana。 in the quickwit-UI is ok!

@PSeitz
Copy link
Contributor

PSeitz commented Sep 19, 2023

on the error(log volumn ): this query is send to the backend : "POST /api/v1/_elastic/_msearch?max_concurrent_shard_requests=256"

Can you provide the payload of the POST request and your index configuration?

@yangshike
Copy link
Contributor Author

[19/Sep/2023:14:35:39 +0800] "POST /api/v1/_elastic/_msearch?max_concurrent_shard_requests=256 HTTP/1.1" 200 302 "" "Go-http-client/1.1" "" {"ignore_unavailable":true,"index":"qiniu_crm","search_type":"query_then_fetch"}\n{"aggs":{"2":{"aggs":{"3":{"date_histogram":{"field":"msg_time","fixed_interval":"60000ms","min_doc_count":0,"extended_bounds":{"min":1695094538615,"max":1695105338615}}}},"terms":{"field":"level","size":100,"order":{"_count":"desc"},"min_doc_count":0}}},"query":{"bool":{"filter":{"range":{"msg_time":{"gte":"2023-09-19T03:35:38.615Z","lte":"2023-09-19T06:35:38.615Z"}}}}},"size":0}\n

@yangshike
Copy link
Contributor Author

doc_mapping:
field_mappings:
- name: msg_time
type: datetime
input_formats:
- unix_timestamp
output_format: unix_timestamp_millis
stored: true
indexed: true
fast: true
precision: milliseconds
- name: content
type: text
tokenizer: chinese_compatible
record: position
stored: true
indexed: true
fast: true
- name: level
type: text
stored: true
indexed: true
fast: true
- name: server_ip
type: text
stored: true
indexed: true
fast: true
tokenizer: raw
- name: service_name
type: text
stored: true
indexed: true
fast: true
tokenizer: raw
- name: host_name
type: text
stored: true
indexed: true
fast: true
tokenizer: raw
- name: time
type: datetime
input_formats:
- rfc3339
- "%Y-%m-%d %H:%M:%S.%f"
output_format: "%Y-%m-%d %H:%M:%S.%f"
tag_fields: ["service_name"]
timestamp_field: msg_time

search_settings:
default_search_fields: [content]

indexing_settings:
commit_timeout_secs: 10
retention:
period: 180 days
schedule: daily

@yangshike
Copy link
Contributor Author

yangshike commented Sep 19, 2023

Only search 5-minute data,also get this error!

[19/Sep/2023:14:39:26 +0800] "POST /api/v1/_elastic/_msearch?max_concurrent_shard_requests=256 HTTP/1.1" 200 303 "" "Go-http-client/1.1" "" {"ignore_unavailable":true,"index":"qiniu_crm","search_type":"query_then_fetch"}\n{"aggs":{"2":{"aggs":{"3":{"date_histogram":{"field":"msg_time","fixed_interval":"1000ms","min_doc_count":0,"extended_bounds":{"min":1695105266331,"max":1695105566331}}}},"terms":{"field":"level","size":100,"order":{"_count":"desc"},"min_doc_count":0}}},"query":{"bool":{"filter":{"range":{"msg_time":{"gte":"2023-09-19T06:34:26.331Z","lte":"2023-09-19T06:39:26.331Z"}}}}},"size":0}\n

Failed to load log volume for this query
:Internal error: Aborting aggregation because memory limit was exceeded. Limit: 500.00 MB, Current: 81.36 PB.
It's an occasional error, not every time

some times get this error:

Internal error: Aborting aggregation because bucket limit was exceeded. Limit: 65000, Current: 102714.

@fulmicoton
Copy link
Contributor

Here is the second one indented

"aggs": {
	"2": {
		"aggs": {
			"3": {
				"date_histogram": {
					"field": "msg_time",
					"fixed_interval": "1000ms",
					"min_doc_count": 0,
					"extended_bounds": {
						"min": 1695105266331,
						"max": 1695105566331
					}
				}
			}
		},
		"terms": {
			"field": "level",
			"size": 100,
			"order": {
				"_count": "desc"
			},
			"min_doc_count": 0
		}
	}
}, "query": {
	"bool": {
		"filter": {
			"range": {
				"msg_time": {
					"gte": "2023-09-19T06:34:26.331Z",
					"lte": "2023-09-19T06:39:26.331Z"
				}
			}
		}
	}
}, "size": 0
}

@fulmicoton fulmicoton changed the title about grafana plugins Aborting aggregation because memory limit was exceeded Sep 19, 2023
@fulmicoton
Copy link
Contributor

that should be 300 buggets, in a term subaggregation...
This sounds like a bug? @PSeitz

@PSeitz
Copy link
Contributor

PSeitz commented Sep 19, 2023

Only search 5-minute data,also get this error!

[19/Sep/2023:14:39:26 +0800] "POST /api/v1/_elastic/_msearch?max_concurrent_shard_requests=256 HTTP/1.1" 200 303 "" "Go-http-client/1.1" "" {"ignore_unavailable":true,"index":"qiniu_crm","search_type":"query_then_fetch"}\n{"aggs":{"2":{"aggs":{"3":{"date_histogram":{"field":"msg_time","fixed_interval":"1000ms","min_doc_count":0,"extended_bounds":{"min":1695105266331,"max":1695105566331}}}},"terms":{"field":"level","size":100,"order":{"_count":"desc"},"min_doc_count":0}}},"query":{"bool":{"filter":{"range":{"msg_time":{"gte":"2023-09-19T06:34:26.331Z","lte":"2023-09-19T06:39:26.331Z"}}}}},"size":0}\n

Failed to load log volume for this query :Internal error: Aborting aggregation because memory limit was exceeded. Limit: 500.00 MB, Current: 81.36 PB. It's an occasional error, not every time

Thanks, that's helpful. I couldn't reproduce it locally so far. Can you edit the JSON payload and resend the request, by setting "min_doc_count": 1 in the date_histogram and post what's returned?

If you get bucket limit was exceeded you can set size: 3 in terms.

"date_histogram": {
	"field": "msg_time",
	"fixed_interval": "1000ms",
	"min_doc_count": 1,
	"extended_bounds": {
		"min": 1695105266331,
		"max": 1695105566331
	}
}

some times get this error:

Internal error: Aborting aggregation because bucket limit was exceeded. Limit: 65000, Current: 102714.

It seems like the intermediate result after merging has too many buckets, we probably would need to prune before counting when converting to the final result. I created a separate issue here: quickwit-oss/tantivy#2182

@fulmicoton fulmicoton added bug Something isn't working high-priority labels Sep 19, 2023
@PSeitz
Copy link
Contributor

PSeitz commented Sep 20, 2023

some times get this error:

Internal error: Aborting aggregation because bucket limit was exceeded. Limit: 65000, Current: 102714.

Do you get this for the same request you posted?

@yangshike
Copy link
Contributor Author

yes, Same query,Both of these are occasional occurrences。

@yangshike
Copy link
Contributor Author

yangshike commented Sep 20, 2023

this is logs of grafana!

I reinstalled Grafana and crawled the logs again. When I found an error, the stats of this log were 400:

[20/Sep/2023:15:19:39 +0800] "POST /api/ds/query?ds_type=quickwit-quickwit-datasource&requestId=explore_left_logs_volume_0 HTTP/1.1" 400 176 "http://test123.cn/explore?left=%7B%22datasource%22:%22eb289c78-48e7-453b-9458-faadd67f9157%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22quickwit-quickwit-datasource%22,%22uid%22:%22eb289c78-48e7-453b-9458-faadd67f9157%22%7D,%22query%22:%22%22,%22alias%22:%22%22,%22metrics%22:%5B%7B%22id%22:%223%22,%22type%22:%22logs%22,%22settings%22:%7B%22limit%22:%22100%22%7D%7D%5D,%22bucketAggs%22:%5B%5D,%22timeField%22:%22msg_time%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D&orgId=1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" "" {"queries":[{"refId":"log-volume-A","query":"","metrics":[{"type":"count","id":"1"}],"timeField":"msg_time","bucketAggs":[{"id":"2","type":"terms","settings":{"min_doc_count":"0","size":"0","order":"desc","orderBy":"_count"},"field":"level"},{"id":"3","type":"date_histogram","settings":{"interval":"auto","min_doc_count":"0","trimEdges":"0"},"field":"msg_time"}],"datasource":{"type":"quickwit-quickwit-datasource","uid":"eb289c78-48e7-453b-9458-faadd67f9157"},"datasourceId":1,"intervalMs":60000,"maxDataPoints":1512}],"range":{"from":"2023-09-20T06:19:38.241Z","to":"2023-09-20T07:19:38.241Z","raw":{"from":"now-1h","to":"now"}},"from":"1695190778241","to":"1695194378241"}

@yangshike
Copy link
Contributor Author

When there is no error reported. Status=200

[20/Sep/2023:15:23:31 +0800] "POST /api/ds/query?ds_type=quickwit-quickwit-datasource&requestId=explore_left_logs_volume_0 HTTP/1.1" 200 137649 "http://test123.cn/explore?left=%7B%22datasource%22:%22eb289c78-48e7-453b-9458-faadd67f9157%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22quickwit-quickwit-datasource%22,%22uid%22:%22eb289c78-48e7-453b-9458-faadd67f9157%22%7D,%22query%22:%22%22,%22alias%22:%22%22,%22metrics%22:%5B%7B%22id%22:%223%22,%22type%22:%22logs%22,%22settings%22:%7B%22limit%22:%22100%22%7D%7D%5D,%22bucketAggs%22:%5B%5D,%22timeField%22:%22msg_time%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D&orgId=1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36" "" {"queries":[{"refId":"log-volume-A","query":"","metrics":[{"type":"count","id":"1"}],"timeField":"msg_time","bucketAggs":[{"id":"2","type":"terms","settings":{"min_doc_count":"0","size":"0","order":"desc","orderBy":"_count"},"field":"level"},{"id":"3","type":"date_histogram","settings":{"interval":"auto","min_doc_count":"0","trimEdges":"0"},"field":"msg_time"}],"datasource":{"type":"quickwit-quickwit-datasource","uid":"eb289c78-48e7-453b-9458-faadd67f9157"},"datasourceId":1,"intervalMs":60000,"maxDataPoints":1512}],"range":{"from":"2023-09-20T06:23:30.565Z","to":"2023-09-20T07:23:30.565Z","raw":{"from":"now-1h","to":"now"}},"from":"1695191010565","to":"1695194610565"}

@yangshike
Copy link
Contributor Author

by the way,Very much expected : Log search, where matches are made, hoping for highlighting

@PSeitz
Copy link
Contributor

PSeitz commented Sep 20, 2023

Thanks, I could reproduce it locally.

The issue is a missing normalization from request values (ms) to fast field values (ns), when converting an intermediate result to the final result. This results in a wrong computation by a factor 1_000_000.
The Histogram normalizes values to nanoseconds, to make the user input like extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible.
This normalization happens only for date type fields, as other field types don't have precision settings. In the query above the column_type parameter was empty therefore normalization did not happen.

One field can have multiple field types associated, therefore we delay setting the column_type until working on the tantivy segment level.
In the case of empty results, as in the terms aggregation with min_doc_count: 0, there may be no column type set.
The actual root cause is a missing propagation of the column_type, when merging two intermediate results and one has no column_type and the other has one.

PSeitz added a commit to quickwit-oss/tantivy that referenced this issue Sep 21, 2023
Fixes a computation issue of the number of buckets needed in the
DateHistogram.

This is due to a missing normalization from request values (ms) to fast field
values (ns), when converting an intermediate result to the final result.
This results in a wrong computation by a factor 1_000_000.
The Histogram normalizes values to nanoseconds, to make the user input like
extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible.
This normalization happens only for date type fields, as other field types don't have precision settings.
The normalization does not happen due a missing `column_type`, which is not
correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation.

Another related issue is an empty aggregation, which will not have
`column_type` set, will not convert the result to human readable format.

This PR fixes the issue by:
- Limit the allowed field types of DateHistogram to DateType
- Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`.
- Fix the merge logic

Add a flag to to normalization only once. This is not an issue
currently, but it could become easily one.

closes quickwit-oss/quickwit#3837
PSeitz added a commit to quickwit-oss/tantivy that referenced this issue Sep 21, 2023
* Fix DateHistogram bucket gap

Fixes a computation issue of the number of buckets needed in the
DateHistogram.

This is due to a missing normalization from request values (ms) to fast field
values (ns), when converting an intermediate result to the final result.
This results in a wrong computation by a factor 1_000_000.
The Histogram normalizes values to nanoseconds, to make the user input like
extended_bounds (ms precision) and the values from the fast field (ns precision for date type) compatible.
This normalization happens only for date type fields, as other field types don't have precision settings.
The normalization does not happen due a missing `column_type`, which is not
correctly passed after merging an empty aggregation (which does not have a `column_type` set), with a regular aggregation.

Another related issue is an empty aggregation, which will not have
`column_type` set, will not convert the result to human readable format.

This PR fixes the issue by:
- Limit the allowed field types of DateHistogram to DateType
- Instead of passing the column_type, which is only available on the segment level, we flag the aggregation as `is_date_agg`.
- Fix the merge logic

Add a flag to to normalization only once. This is not an issue
currently, but it could become easily one.

closes quickwit-oss/quickwit#3837

* use older nightly for time crate (breaks build)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request high-priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants