Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-exploration): countIf and other aggregations #13854

Merged
merged 3 commits into from
Jan 20, 2023
Merged

Conversation

mariusandra
Copy link
Collaborator

Changes

Adds support for the following aggregations (with number of allowed arguments)

# Permitted HogQL aggregations
HOGQL_AGGREGATIONS = {
    "count": 0,
    "countIf": 1,
    "countDistinct": 1,
    "countDistinctIf": 2,
    "min": 1,
    "minIf": 2,
    "max": 1,
    "maxIf": 2,
    "sum": 1,
    "sumIf": 2,
    "avg": 1,
    "avgIf": 2,
    "any": 1,
    "anyIf": 2,
}

These two translate as follows

  • countDistinct(x) -> count(distinct x) (in hogql) -> uniqExact(x) (internally in clickhouse)
  • countDistinctIf(x) -> countIf(distinct x) (in hogql) -> uniqExactIf(x) (internally in clickhouse)

I'm softly biased towards not exposing uniq and uniqExact and uniqIf and uniqExactIf to the users.

  • uniq and uniqExact are non-standard SQL terms (compared to "count distinct" which I've seen in probably every SQL dialect)
  • The default uniq returns non-precise results (uniqExtract is precise but slower), so we'd need to explain this to users at every turn. We need to talk about the difference between the two often.
  • The uniqExact aggregation just rolls off your tongue (uniq? unique? uniquerque?)
  • Thus I propose we abstract all of this away into our own countDistinct, which is the natural thing you'd try if you have used any SQL before. This way we can later optimize slow queries to use uniq if needed, while keeping everyone using the "data integrity is important" uniqExact instead.

How did you test this code?

Updated the test

@mariusandra mariusandra changed the title feat(data-exploration) countIf and other aggregations feat(data-exploration): countIf and other aggregations Jan 20, 2023
@@ -61,7 +61,9 @@ def test_hogql_materialized_fields_and_properties(self):

def test_hogql_methods(self):
self.assertEqual(self._translate("count()"), "count(*)")
self.assertEqual(self._translate("count(event)"), "count(distinct event)")
self.assertEqual(self._translate("countDistinct(event)"), "count(distinct event)")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if one does count(properties.value == 2)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error because it expects just 0 arguments. I'd add the test, but the tests pass right now so it'll come in another PR 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, sorry, I meant countDistinct

Copy link
Collaborator Author

@mariusandra mariusandra Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, give it a try :)

I guess the answer is you'll get a 0, 1 or 2 depending on the data.

For me it crashes, but adding toInt around the property gives 0. Doing coalesce(toInt(properties.value), 0) == 2 would likely work.

Copy link
Collaborator

@Twixes Twixes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I completely agree with the thinking behind only exposing intuitive count functions – anyone with just a bit of SQL experience should be able to pick this up immediately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants