Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are "Data Exploration" and "HogQL", and what can you already do with them? #86

Closed
mariusandra opened this issue Jan 12, 2023 · 13 comments

Comments

@mariusandra
Copy link
Contributor

mariusandra commented Jan 12, 2023

If you're reading this, you're either a curious character, or have been added to the data exploration beta. Read on to know what you can already do, and what's coming soon.

What is data exploration?

"Data exploration" is an umbrella term that includes:

  • Formulas in the "live events" table
  • HogQL to power those formulas
  • Support for similar formulas in insight filters
  • Customisable columns for person tables
  • An "edit source" button next to most insights and tables, to give powerusers superpowers.
  • All of the above is defined through a data structure of nested query nodes

I'll cover all of this below.

How do I know if I'm in the beta?

If you see the <> button next to the "live events" table, or the persons list table, you're using the new data exploration interface.

Screenshot 2023-01-12 at 15 37 21

You don't need to click this button under normal usage.

HogQL and data tables

When you open the "live events" table and click "configure columns", you're now presented with a new filter type: "HogQL".

image

image

You're free to write any supported expression in this field, including conditionals.

HogQL is our translation layer that parses this expression, and transpiles it to ClickHouse SQL.

The list of HogQL functions is ever expanding. See the full reference in code here.

Here are some fun HogQL expressions to try:

  • 1 + 2
  • concat(properties.$browser, ' ', properties.$geoip_city_name)
  • ifElse(toInt(properties.$screen_width) < toInt(properties.$screen_height), 'portrait', 'landscape')
  • coalesce(properties.$current_url, properties.$screen_name)
  • toStartOfMonth(timestamp)
  • person.properties.$initial_browser
  • leftPad(properties.$screen_width, 10, '0')

There are a few peculiarities to keep in mind at this stage:

  • Currently all properties are parsed as strings, so you'll have to wrap numeric property access with toInt or toFloat.
  • We don't surface SQL errors, yet. Thus if you make a mistake, you'll only see "query failed", not exactly what syntax caused the problem.
  • Unlike regular SQL, regular "equals" checks are performed with double equals a == b, not single equals a = b

We're working on resolving these points

In case the expression gets too long, append an alias with a # comment, such as:

  • ifElse(toInt(properties.$screen_width) < toInt(properties.$screen_height), 'portrait', 'landscape') # Orientation

HogQL aggregations

HogQL also supports aggregations. When using those, make sure to remove any "not at all unique" columns (like * or person) from the table for meaningful results.

Currently we support the following aggregations: total, min, max, sum, avg, any. Except for total that takes no arguments, you'll have to cast the properties you're aggregating to integers with toInt total changed to count

Currently we support the following aggregations: count, min, max, sum, avg, any.

  • count can either take no arguments and return the count of rows matching the rest of the columns
  • count(properties.bla) with an argument will return the count of distinct properties.bla values
  • The other aggregations require numeric inputs. Because all properties are currently returned as strings, you'll have to cast the properties you're aggregating with toInt.

Here's an example using several aggregations and expressions:

2023-01-12 16 05 20

(note that total() in the screencast above changed to count())

Edit source

This "edit source" button lets you edit the query powering the page at hand. You do not need to use this button, ever. But you can. Clicking it can take you to the query editor, which shows a bunch of example queries you can run.

2023-01-12 15 39 10

Feel free to play around! The idea is to make every part of posthog hackable, copy-pasteable and embeddable :).

Coming soon

Work will continue on improving this interface further, and adding the following:

@grouville
Copy link

@mariusandra 👋

It seems to be awesome. I'm actually interested by these features to be able to filter a series with multiple property value. How can we be added to the beta ?

@mariusandra
Copy link
Contributor Author

Hey @grouville, find me in the PostHog users slack, send me your PostHog cloud e-mail along with a promise to share all the feedback you have on this feature... and the access is yours :).

@andrewm4894
Copy link

@mariusandra will I be able to "Save As" in some way - so ability to use the new HogQL stuff to save different "views" of the live events stream? (and ability to share with everyone in the project etc too maybe just like insights are visible to everyone by default).

@andrewm4894
Copy link

looks cool - will share some quick feedback or examples in here as i use it.

first thing i did was try to filter my event stream via a HogQL expression - i think that was wrong on my end as really it should just be trying to use HogQL to make "derived columns" or something it seems?

So i think i assumed i could use HogQL to in some way filter too from the above below.

This is just me not reading stuff properly but just wanted to share as feedback.

image

@andrewm4894
Copy link

just trying to get a sum or total count of events on these 3 dims to get a sort of breakout of our crash stats by OS name and version over last 24 hours:

image

having some sort of autocomplete here would be magical

image

im not sure how to just count the number of events i tried sum(1) but obviously i need something different

@mariusandra
Copy link
Contributor Author

Hey @andrewm4894 , great to see you test this out!

will I be able to "Save As" in some way - so ability to use the new HogQL stuff to save different "views" of the live events stream?

You can save the URL, which contains the entire query behind the table in the #q= part. If you give this (admittedly long) URL to someone else in the same project, they should see exactly what you're seeing.

Saved views are an obvious next step, but not there yet :).

event = 'META'

Regarding this expression, you'll need to use == instead of a single =. It's a limitation we'll get rid of once we switch the AST parser from python to something custom.

You can use HogQL to filter as well, but for that you'll need to add a manual where clause in the query editor:

image

Actually using HogQL in the filters in the interface is the next thing I'm working on: PostHog/posthog#13264

just trying to get a sum or total count of events on these 3 dims to get a sort of breakout of our crash stats by OS name and version over last 24 hours:

Try total() without any arguments.

@andrewm4894
Copy link

@mariusandra sweet - yep that works and being able to share the url works for now and defo will be useful for sharing with colleagues etc.

Any plans for something like count distinct or unique() in my case i might want total events, count of machines, and maybe then a further derived "crashes per machine" col that's just ratio of other too.

Also would be really cool then to be able to build an insight on this too or visualize in some way.

Being able to maybe use this to just show a date or hour and then drive two ratios for example and then just plot those two ratios on same chart could be cool. Might be better done direct in insights (ability to go beyond 1 formula) but this seems like could be another way to at least get them into a tables (if i can wrangle Time into a date or hour):

image

@andrewm4894
Copy link

would be cool if i can name my expression something too eg crash_events=total() or 'crash_events' as total() or some sort of syntax that makes sense

@mariusandra
Copy link
Contributor Author

Most of what you described is planned, and coming as soon as we can get it out. Aliasing fields, just for display in the table, is possible already. Just separate the alias from the sql exprrssion with a # symbol, like you'd add a line comment in python.

@grouville
Copy link

grouville commented Jan 16, 2023

Hi, we started to play with the HogQL API.

It seems promising, but there's still a use-case we didn't manage to implement: relying on that to filter a serie, based on several property values, as we cannot do it for now:
image

When using above filtering, with below events (that are duplicates), we have noise. We want to un-duplicate by time / mirror and os:
image

The idea is to rely on your beta feature to create a new column that only part of these events will have, and on which we could filter inside the insights. But, even after checking the clickhouse / https://github.com/PostHog/posthog/blob/master/posthog/hogql/expr_parser.py#L67 reference, I'm still not sure how to implement that.

Do you have any hints ? 🙏

Btw, the fact that we often need to rely on extensions to dedup / fix a use-case is very frustrating for now 😇 + the dedup extension seems to only work for new incoming events, so not really useful in our case

@mariusandra
Copy link
Contributor Author

Hey, without seeing your query, here's what I think: the time column currently resolves to timestamp, which means any day. Change it to toStartOfMonth(timestamp). Then also change the column properties.downloads to sum(toInt(properties.downloads)). Finally, remove the * column.

Thanks for testing it out this clunky interface :). Improvements to many of the points here are coming...

@mariusandra
Copy link
Contributor Author

If you're following along, then the latest updates (as soon as this PR is merged) will bring:

  • New table column headers that let you manipulate columns without the "column configurator"
  • The total() aggregation is renamed to a more standard count()
  • We added support for counting unique values: count(value), which translates to something like count(distinct value)

image

@mariusandra
Copy link
Contributor Author

I've just merged in a few new aggregations, and made a "breaking change" - PostHog/posthog#13854

The count(thing) aggregation, which returned the number of distinct thing-s in the database, has been renamed to countDistinct(thing). That's to make it more in line with actual SQL, and to distinguish it from other count-s like countIf. So please rename count(bla) to countDistinct(bla) if you already started to use it!

That means you can now do the following:

  • count() # number of events
  • countDistinct(properties.$browser) # number of unique browsers
  • countIf(properties.$browser == 'Chrome') # number of Chromes
  • countDistinctIf(properties.$browser_version, properties.$browser == 'Chrome') # number of unique Chrome versions
  • avg(toInt(properties.$browser_version)) # average browser version
  • avgIf(toInt(properties.$browser_version), properties.$browser == 'Chrome') # average Chrome version

There's also sum, sumIf, min, minIf, max, maxIf, any, anyIf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants