Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: split detected fields queries #12491

Merged
merged 17 commits into from
Apr 17, 2024
Merged

Conversation

trevorwhitney
Copy link
Collaborator

@trevorwhitney trevorwhitney commented Apr 5, 2024

What this PR does / why we need it:

This PR splits detected fields queries using the existing split by time logic used for log filter queries.

Which issue(s) this PR fixes:
Re #12339

@trevorwhitney trevorwhitney requested a review from a team as a code owner April 5, 2024 19:03
//this is an estimation, as the true cardinality could be greater
//than either of the seen values, but will never be less
if ok {
curCard, newCard := f.Cardinality, field.Cardinality
Copy link
Contributor

@cyriltovena cyriltovena Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you merge multiple timerange why isn't the cardinality additive ?

Copy link
Collaborator Author

@trevorwhitney trevorwhitney Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because you don't know if you've seen different values in each timerange, for example

timerange 1 (cardinality 3)

flavor=sweet
flavor=sour
flavor=sweet
flavor=spicy

timerange 2 (cardinality 2)

flavor=sweet
flavor=sweet
flavor=bland

since we're discarding the actual values, there's no way to be 100% accurate here. the safest bet is to take the highest cardinality seen. in this example, the true cardinality is actually 4, even though we'll report it as 3. adding them to 5 would be incorrect. Also imagine if they saw they same 3 values in each time range, adding them we produce 6 which would be even further from the truth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like you might want to run the sketch in the frontend may be. Happy to start like this though.

But if you can stream 100k logs back to frontend and do sketch there then it will be more accurate. may be querier could do the parsing of fields to absord most of the CPU work.

Copy link
Collaborator Author

@trevorwhitney trevorwhitney Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyril also suggested (in a separate conversation) looking in to some ways to merge the sketches in the frontend, so I'm going to investigate that a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this code to merge sketches and get more accurate cardinality estimations

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Really think we should look at merging sketch for better result in the future.

Copy link
Contributor

@shantanualsi shantanualsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@trevorwhitney trevorwhitney merged commit 6c33809 into main Apr 17, 2024
12 checks passed
@trevorwhitney trevorwhitney deleted the detected-fields-splitting branch April 17, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants