Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(recordings): Optimize recordings list query #14458

Merged
merged 40 commits into from
Mar 22, 2023

Conversation

EDsCODE
Copy link
Member

@EDsCODE EDsCODE commented Feb 28, 2023

Problem

Changes

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

How did you test this code?

@posthog-bot
Copy link
Contributor

Hey @EDsCODE! 👋
This pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

@EDsCODE EDsCODE added feature/recordings performance Has to do with performance. For PRs, runs the clickhouse query performance suite labels Feb 28, 2023
@EDsCODE
Copy link
Member Author

EDsCODE commented Feb 28, 2023

TODO:

  • handle person properties pushdown

)

def run(self, *args, **kwargs) -> SessionRecordingQueryResult:
self._filter.hogql_context.using_person_on_events = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connection to hogql restricts the improvement to person_on_events enabled orgs for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't get this, aren't there like only 4 poe orgs right now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. hogql allows for person properties to be filtered in the recordings event query which usually was not the case. Now, either we use person on events and only have this optimization working for the tiny subset until everything works, or I go through a refactor that splits up and does the right pushdown of person property filtering to the subquery when we detect a hogql based person property
Screenshot 2023-03-14 at 9 52 03 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh I see, okay, good enough

@EDsCODE EDsCODE requested a review from neilkakkar March 2, 2023 18:30
@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

HAVING full_snapshots > 0
AND start_time >= '2021-01-14 00:00:00'
AND start_time <= '2021-01-21 20:00:00') AS session_recordings ON session_recordings.session_id = events.session_id
WHERE session_recordings.distinct_id == events.distinct_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matching by distinctID lead to problems as we're randomly selecting one?

Don't think sessionIDs change between login steps, but distinctIDs can & do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder this... Seems like a copy from the v1 query but I'm not sure why it matters... We only need one distinctId to load the Person anyways and we have that from the recording. We anyways group by session_id...

Perhaps this is to account for weird cases where there is a clash of session_id but I don't see that happening...

Copy link
Collaborator

@neilkakkar neilkakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! The code here is going get more complicated as we support multiple ways of doing things, but given that it will all go away sometime with the schema changes, not too bothered.

Would like it if session-recording folks have a look too though!

@@ -217,6 +217,26 @@ def _person_on_events_querying_enabled(self) -> bool:
# on self-hosted, use the instance setting
return get_instance_setting("PERSON_ON_EVENTS_ENABLED")

@property
def recordings_list_v2_query_enabled(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this on the frontend instead and control it via a query param to the API? Makes it much easier to test back and forth in a live situation to compare results etc.

Copy link
Member Author

@EDsCODE EDsCODE Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on second thought, I don't want to add filters to the object and muddy our caching for insights. Do you think it's worth it for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't suggesting adding it to filters (not everything has to go in that mega object 😅 ) - just a standard query param. I pushed the change in the way I meant so now we just check a query param which makes it super easy to flip back and forth without having to change the feature flag all the time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep, we can do that. Too focused on the filter object 😬

HAVING full_snapshots > 0
AND start_time >= '2021-01-14 00:00:00'
AND start_time <= '2021-01-21 20:00:00') AS session_recordings ON session_recordings.session_id = events.session_id
WHERE session_recordings.distinct_id == events.distinct_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder this... Seems like a copy from the v1 query but I'm not sure why it matters... We only need one distinctId to load the Person anyways and we have that from the recording. We anyways group by session_id...

Perhaps this is to account for weird cases where there is a clash of session_id but I don't see that happening...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is worth trying but reading it still makes me think "there must be a simpler way" :D

I keep thinking why is not just

select * from recording_events_grouped where session_id in (BuildStandardEventsQuery(filters).select("session_id")

Definitely don't want to derail and at this point I'll take any improvements to the speed of this thing but these queries still confuse the hell out of me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be that easy... why isn't it? :D

What would this look like if you rebuilt it the same way the events query is built?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigger issue to wait on before spending time unnecessarily is person on events deployment. Once we're there, we can probably move this to @mariusandra's suggestion with less headache and a lot less joins to negotiate

@EDsCODE
Copy link
Member Author

EDsCODE commented Mar 14, 2023

Will need this to wait on person on events because of this dependency: #14458 (comment)

@mariusandra
Copy link
Collaborator

The point of HogQL is that it'll deal with the persons on events issue for you. I'm actively working on making this actually true. Once that PR is wrapped up (tomorrow? 🤞), person joins will and won't happen as needed, for both the events and session_recording_events tables. You can just use person_id and person.properties.foo as if it's on the same table.

I also wrote up a quick guide on how to use HogQL with Python in the backend. Hopefully that can provide some inspiration. You will be free to do any joins or non-joins as you'd like, or let HogQL handle it for you. It's just a layer around ClickHouse SQL after all... as long as you stick to the implemented syntax (no arrays yet :/).

@EDsCODE
Copy link
Member Author

EDsCODE commented Mar 21, 2023

I've updated the logic to not use V2 if there are hogql properties. This will be in effect until person on events is fully deployed. The updates to hogql mentioned do not solve this particular issue because the recordings query does adhoc property parsing that does not adhere to the rest of the query patterns

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 0 modified, 0 deleted
  • webkit: 0 added, 1 modified, 0 deleted (diff for shard 2)
  • firefox: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Copy link
Contributor

@benjackwhite benjackwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm happy for this to go in but it would be great if this doesn't hang around for 3 months as half-done thing 😅 Ideally we test it heavily and immediately, and then follow up with a review to swap over fully ASAP.

In the not too far future, we're going to be rebuilding the whole thing (hopefully using the new query stuff) in order to support Notebooks more effectively so it would be good to have as a clean a base to start from

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 0 modified, 0 deleted
  • webkit: 0 added, 1 modified, 0 deleted (diff for shard 2)
  • firefox: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@EDsCODE EDsCODE merged commit a1a8fdd into master Mar 22, 2023
@EDsCODE EDsCODE deleted the optimize-recordings-list-query branch March 22, 2023 14:27
alexkim205 pushed a commit that referenced this pull request Mar 22, 2023
* derivative class

* derivative class

* tests

* Update query snapshots

* add flag

* typing

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* use person on events with new query

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* snapshots

* Update query snapshots

* Update query snapshots

* Update query snapshots

* merge master

* merge master

* remove unnecessary joins with poe is active

* try turning off the test unless poe

* revert poe changes

* don't use optimized recording list on hogql queries for now

* Moved flag to frontend

* Update query snapshots

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `webkit` (2)

* Update UI snapshots for `webkit` (2)

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ben White <ben@posthog.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants