Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hogql): HogQLQuery node and data table support #14265

Merged
merged 90 commits into from
Feb 22, 2023
Merged

Conversation

mariusandra
Copy link
Collaborator

@mariusandra mariusandra commented Feb 16, 2023

Problem

HogQL is cool, but nobody can use it yet.

Extracted from #14042

Part of PostHog/meta#81

Stacked on top of #14185

Changes

Exposes a HogQL query editor, together with pretty printing and a table to show the results, under /query.

image

This HogQL editor is not exposed under its own node because it should likely always be created as a "new insight". Will wait for insights to natively be query nodes for that.

Security concerns

Before putting this live, I'd like to be 100% sure there's no way to leak data from other team. What was not resolved in #14185 will need to be discussed now.

  • Is the team_id check secure as is? Can someone override it somehow? What must we look out for.
  • Should put the backed query node behind a feature flag and enable just for our team out of paranoia?
  • For joined tables, should this team_id guard be on the ON or in the WHERE clause?
  • What about PREWHERE vs WHERE? Which should we actually use?

How did you test this code?

In the browser, plus added a test for the new API endpoint.

@mariusandra mariusandra marked this pull request as ready for review February 20, 2023 23:32
@mariusandra
Copy link
Collaborator Author

Other than a failing storybook build (sql formatter + webpack 4 = ❤️), this is ready for a review as well.

Copy link
Contributor

@thmsobrmlr thmsobrmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

SELECT ("posthog_person"."properties" -> '$some_prop_1') = '"something_1"' AS "flag_X_condition_0"
SELECT (("posthog_person"."properties" -> '$some_prop_1') = '"something_1"'
AND "posthog_person"."properties" ? '$some_prop_1'
AND NOT (("posthog_person"."properties" -> '$some_prop_1') = 'null')) AS "flag_X_condition_0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these tests would change in this PR, but going to assume it's ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two do look like unrelated changes. They're definitely referencing SQL that's not generated by HogQL. I remember hearing about some changes to /decide, so that's probably this plus a bad merge somewhere by someone?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah me 🙈 , old PR got merged after a snapshot change went in, which didn't update snapshots again

SELECT ("posthog_person"."properties" -> 'email') = '"tim@posthog.com"' AS "flag_X_condition_0",
SELECT (("posthog_person"."properties" -> 'email') = '"tim@posthog.com"'
AND "posthog_person"."properties" ? 'email'
AND NOT (("posthog_person"."properties" -> 'email') = 'null')) AS "flag_X_condition_0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: I don't understand why these tests would change in this PR, but going to assume it's ok.

Base automatically changed from hogql-symbol-resolution to master February 21, 2023 15:55
@neilkakkar
Copy link
Collaborator

neilkakkar commented Feb 21, 2023

Is the team_id check secure as is? Can someone override it somehow? What must we look out for.

It seems so. Potential thing to look out for: can I make conditions in such a way that the extra_where ends up as an OR condition? (Doesn't seem likely to me right now, as we explicitly always AND it with all the other expressions).

Should put the backed query node behind a feature flag and enable just for our team out of paranoia?

🤷‍♂️

For joined tables, should this team_id guard be on the ON or in the WHERE clause?

Definitely WHERE, faster and safer

Rows are joined if the whole complex condition is met. If the conditions are not met, still rows may be included in the result depending on the JOIN type. Note that if the same conditions are placed in a WHERE section and they are not met, then rows are always filtered out from the result.

What about PREWHERE vs WHERE? Which should we actually use?

It's fine to stick to WHERE, the only thing that would ever really go into PREWHERE is team_id check, but this is more of a perf optimisation.

@mariusandra
Copy link
Collaborator Author

mariusandra commented Feb 21, 2023

Is the team_id check secure as is? Can someone override it somehow? What must we look out for.

It seems so. Potential thing to look out for: can I make conditions in such a way that the extra_where ends up as an OR condition? (Doesn't seem likely to me right now, as we explicitly always AND it with all the other expressions).

I can't imagine a case either. The only other case I can imagine is if somebody manually creates an alias with the name team_id for a random field to trick the where clause into accepting that, but since we do symbol resolution that's effectively impossible.

Out of paranoia I have anyway blocked xyx as team_id from working (ValueError: Alias 'team_id' is a reserved keyword.), but even removing the block, the query SELECT event, 1 as team_id from events translates to SELECT event, 1 AS team_id FROM events WHERE equals(events.team_id, 42) LIMIT 65535 thanks to symbol resolution. So I think we're good.

Should put the backed query node behind a feature flag and enable just for our team out of paranoia?

🤷‍♂️

I think if we implement some of the suggested query safeguards (max memory limit, timeouts, etc) discussed here, we're probably good

For joined tables, should this team_id guard be on the ON or in the WHERE clause?

Definitely WHERE, faster and safer

👍

Rows are joined if the whole complex condition is met. If the conditions are not met, still rows may be included in the result depending on the JOIN type. Note that if the same conditions are placed in a WHERE section and they are not met, then rows are always filtered out from the result.

What about PREWHERE vs WHERE? Which should we actually use?

It's fine to stick to WHERE, the only thing that would ever really go into PREWHERE is team_id check, but this is more of a perf optimisation.

Worth doing then, though I propose it should be outside the scope of this PR.

@mariusandra
Copy link
Collaborator Author

mariusandra commented Feb 22, 2023

All right, to make everything simple and to give us the maximum amount of playroom, I put the backend in this PR behind a new flag hogql-queries. Locally (not is_cloud()) it's always enabled. On cloud, we check the flag before giving access to the HogQL backend.

image

The "frontend" is still as it is, since there's no real frontend, only an obscure /query debug URL where you'll find HogQL as the 21st link in a list of links. So I think that's fine for now. This flag can be reused when we build a real frontend, or get HogQL insights running.

I'll be happy to add all the query/byte/time/etc limits in a followup.

@mariusandra mariusandra merged commit 1456301 into master Feb 22, 2023
@mariusandra mariusandra deleted the hogql-node branch February 22, 2023 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants