-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add high cardinality requirements doc #7175
Closed
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# Overview | ||
|
||
This document describes the "high cardinality" problem as it relates to the in-memory index. The current problems and the requirements needed for a solution to the problem are explained. | ||
|
||
## Problem Statement | ||
|
||
The database maintains an in-memory, inverted index that maps measurements with tags and values to series. This was initially implemented as part of previous storage engines to facilitate faster query planning and meta queries. | ||
|
||
The in-memory aspect of the index present problems when series cardinality is large. This can happen from writing to many distinct measurements or tags as well as using tag keys and values that are large in size. For example, writing a Docker container ID create high cardinality as well as large tag values consuming large amounts of memory. Storing counter as a tag can create many distinct series causing memory problems over time. | ||
|
||
Another issue that arises with high cardinality is slower startup times. Startup times increase because the index needs to be reloaded from TSM and WAL files. Each series key must be re-parsed from the TSM file to reconstruct the measurement and tag to value index. | ||
|
||
## Requirements | ||
|
||
### Functional | ||
|
||
1. The index must support finding all series by a measurement | ||
2. The index must support finding all series by a tag and value | ||
3. The index must support retrieving all tag keys | ||
4. The index must support retrieve all tag keys by measurement | ||
5. The index must support retrieving all tag values by tag key | ||
6. The index must support retrieving all tag values by tag key and measurement | ||
7. The index must support the removal of measurements | ||
8. The index must support removal of series | ||
9. The index must support removal of tag keys and values | ||
10. The index must support finding all series by regex | ||
11. Updating the index for new series must not cause writes to block | ||
12. Queries and writes must not block each other | ||
13. The index must support point-in-time snapshots using the current shard backup and snapshotting mechanism (hard links) | ||
|
||
### Performance | ||
|
||
1. The index must be able to support 1B+ series without exhausting RAM | ||
2. Startup times should must not exceed 1 min | ||
3. Query planning must be <10ms | ||
4. The index should not significantly increase the total storage size of a shard. | ||
|
||
### Reliability | ||
|
||
1. The index should be able to be re-created from existing TSM and WAL files. | ||
|
||
### Backward Compatibility | ||
|
||
1. The server must be able to operate with shards that do not have an index. | ||
|
||
## Use Cases | ||
|
||
The following use-cases should be applied to any proposed designs to evaluate their feasibility. For each use case, the requirements above should be evaluated to better understand the design. | ||
|
||
1. `SHOW MEASUREMENTS` | ||
2. `SHOW TAG KEYS` | ||
3. `SHOW TAG VALUES WITH KEY = foo` | ||
4. `SELECT * FROM cpu` | ||
5. `SELECT * FROM cpu WHERE host = 'server-01'` | ||
6. `SELECT count(value) FROM cpu where host ='server-01' AND location = 'us-east1'` | ||
6. `SELECT count(value) FROM cpu where host ='server-01' AND location = 'us-east1' GROUP BY host` | ||
7. `DROP MEASUREMENT cpu` | ||
8. `DROP SERIES cpu WHERE time > now() - 1h` | ||
9. `DROP SEREIES cpu WHERE host = 'server-01'` | ||
|
||
For each use case, the proposed design should be evaluated against it to understand how the index will be queried to return results for a single shard and across shards. What are the performance characteristics? Is it natively supported by the index or does it require post-processing? |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some other queries here, which have suffered from performance problems in the past?
The execution time for the two queries below for example should be fast, and get slower gracefully with the amount data we store. Whether we can achieve
O(1)
or something worse like ~O(n log n)
will probably depend on the TSI implementation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These queries are similar to the ones already listed from how they would interact with the index. I was trying to come up with some scenarios where the index would be used and stressed in different ways.
For example,
SELECT first(value) FROM cpu
needs to access the index essentially the same asDROP MEASUREMENT cpu
in that the index would need to be queried to determine all series keys forcpu
and then process those series.The
first
,last
, vsorder by desc
, is more of a query engine thing than an index issue because all four would hit the index to return all series forcpu
and then let the query engine figure out thefirst
,last
, etc..Looking at current ones, I think adding a regex scenario is missing and should be added. Also different boolean logic for tags as opposed to just
AND
which would stress how we merged series sets in the index.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without meaning to jump too far into implementation details in the requirements doc, would it not be possible to maintain the first/last value for
value
within the index, so we don't need to scan any series keys at all? I guess it's hard to go down that path and still provide a drop-in replacement for the current index.