Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(recordings): Optimize recordings list query #14458

Merged
merged 40 commits into from
Mar 22, 2023
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
695b787
derivative class
EDsCODE Feb 28, 2023
e2add4e
derivative class
EDsCODE Feb 28, 2023
c7eff77
tests
EDsCODE Feb 28, 2023
9253921
Update query snapshots
github-actions[bot] Feb 28, 2023
bce115e
add flag
EDsCODE Feb 28, 2023
5ce263f
typing
EDsCODE Feb 28, 2023
68b86fc
Update query snapshots
github-actions[bot] Feb 28, 2023
eed5f89
Update query snapshots
github-actions[bot] Feb 28, 2023
f5edc76
Update query snapshots
github-actions[bot] Mar 1, 2023
d68ac85
Update query snapshots
github-actions[bot] Mar 1, 2023
034221a
use person on events with new query
EDsCODE Mar 1, 2023
f25e745
Update query snapshots
github-actions[bot] Mar 1, 2023
e9c559a
Update query snapshots
github-actions[bot] Mar 1, 2023
fc831d4
Update query snapshots
github-actions[bot] Mar 1, 2023
10efc7d
Update query snapshots
github-actions[bot] Mar 1, 2023
51893d6
Update query snapshots
github-actions[bot] Mar 1, 2023
a0c966e
Update query snapshots
github-actions[bot] Mar 1, 2023
5b1435d
Update query snapshots
github-actions[bot] Mar 2, 2023
36c0e01
Update query snapshots
github-actions[bot] Mar 2, 2023
da42388
Update query snapshots
github-actions[bot] Mar 2, 2023
24135c2
Update query snapshots
github-actions[bot] Mar 2, 2023
4bd7535
snapshots
EDsCODE Mar 2, 2023
a22cfb1
Update query snapshots
github-actions[bot] Mar 2, 2023
b72a7f1
Update query snapshots
github-actions[bot] Mar 2, 2023
37238f8
merge master
EDsCODE Mar 2, 2023
e679735
Update query snapshots
github-actions[bot] Mar 2, 2023
6f8f06e
merge master
EDsCODE Mar 13, 2023
e9efea6
merge master
EDsCODE Mar 13, 2023
9562ebf
merge master
EDsCODE Mar 13, 2023
16d4cf5
Merge branch 'master' into optimize-recordings-list-query
EDsCODE Mar 14, 2023
fd7a7fd
remove unnecessary joins with poe is active
EDsCODE Mar 14, 2023
6bd28d9
try turning off the test unless poe
EDsCODE Mar 14, 2023
c9d18a2
revert poe changes
EDsCODE Mar 14, 2023
9d746ea
merge master
EDsCODE Mar 21, 2023
3e5dbb4
don't use optimized recording list on hogql queries for now
EDsCODE Mar 21, 2023
695f328
Moved flag to frontend
benjackwhite Mar 22, 2023
4467504
Update query snapshots
github-actions[bot] Mar 22, 2023
5234c22
Update UI snapshots for `chromium` (1)
github-actions[bot] Mar 22, 2023
c52d00b
Update UI snapshots for `webkit` (2)
github-actions[bot] Mar 22, 2023
1d6e97f
Update UI snapshots for `webkit` (2)
github-actions[bot] Mar 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions posthog/api/session_recording.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import json
from typing import Any, List, cast
from typing import Any, List, Type, cast

import structlog
from dateutil import parser
Expand All @@ -22,7 +22,7 @@
from posthog.models.session_recording_event import SessionRecordingViewed
from posthog.models.team.team import Team
from posthog.permissions import ProjectMembershipNecessaryPermissions, TeamMemberAccessPermission
from posthog.queries.session_recordings.session_recording_list import SessionRecordingList
from posthog.queries.session_recordings.session_recording_list import SessionRecordingList, SessionRecordingListV2
from posthog.queries.session_recordings.session_recording_properties import SessionRecordingProperties
from posthog.rate_limit import ClickHouseBurstRateThrottle, ClickHouseSustainedRateThrottle
from posthog.utils import format_query_params_absolute_url
Expand Down Expand Up @@ -238,7 +238,12 @@ def list_recordings(filter: SessionRecordingsFilter, request: request.Request, t

if (all_session_ids and filter.session_ids) or not all_session_ids:
# Only go to clickhouse if we still have remaining specified IDs or we are not specifying IDs
(ch_session_recordings, more_recordings_available) = SessionRecordingList(filter=filter, team=team).run()
session_recording_list_instance: Type[SessionRecordingList] = (
SessionRecordingListV2 if team.recordings_list_v2_query_enabled else SessionRecordingList
)
(ch_session_recordings, more_recordings_available) = session_recording_list_instance(
filter=filter, team=team
).run()
recordings_from_clickhouse = SessionRecording.get_or_build_from_clickhouse(team, ch_session_recordings)
recordings = recordings + recordings_from_clickhouse

Expand Down
169 changes: 94 additions & 75 deletions posthog/api/test/__snapshots__/test_session_recordings.ambr

Large diffs are not rendered by default.

25 changes: 13 additions & 12 deletions posthog/api/test/test_session_recordings.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,22 +142,23 @@ def test_get_session_recordings(self):
@snapshot_postgres_queries
def test_listing_recordings_is_not_nplus1_for_persons(self):
# request once without counting queries to cache an ee.license lookup that makes results vary otherwise
self.client.get(f"/api/projects/{self.team.id}/session_recordings")
with freeze_time("2022-06-03T12:00:00.000Z"):
self.client.get(f"/api/projects/{self.team.id}/session_recordings")

base_time = (now() - relativedelta(days=1)).replace(microsecond=0)
num_queries = 10
base_time = (now() - relativedelta(days=1)).replace(microsecond=0)
num_queries = 9

self._person_with_snapshots(base_time=base_time, distinct_id="user", session_id="1")
with self.assertNumQueries(num_queries):
self.client.get(f"/api/projects/{self.team.id}/session_recordings")
self._person_with_snapshots(base_time=base_time, distinct_id="user", session_id="1")
with self.assertNumQueries(num_queries):
self.client.get(f"/api/projects/{self.team.id}/session_recordings")

self._person_with_snapshots(base_time=base_time, distinct_id="user2", session_id="2")
with self.assertNumQueries(num_queries):
self.client.get(f"/api/projects/{self.team.id}/session_recordings")
self._person_with_snapshots(base_time=base_time, distinct_id="user2", session_id="2")
with self.assertNumQueries(num_queries):
self.client.get(f"/api/projects/{self.team.id}/session_recordings")

self._person_with_snapshots(base_time=base_time, distinct_id="user3", session_id="3")
with self.assertNumQueries(num_queries):
self.client.get(f"/api/projects/{self.team.id}/session_recordings")
self._person_with_snapshots(base_time=base_time, distinct_id="user3", session_id="3")
with self.assertNumQueries(num_queries):
self.client.get(f"/api/projects/{self.team.id}/session_recordings")

def _person_with_snapshots(self, base_time: datetime, distinct_id: str = "user", session_id: str = "1") -> None:
Person.objects.create(
Expand Down
20 changes: 20 additions & 0 deletions posthog/models/team/team.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,26 @@ def _person_on_events_querying_enabled(self) -> bool:
# on self-hosted, use the instance setting
return get_instance_setting("PERSON_ON_EVENTS_ENABLED")

@property
def recordings_list_v2_query_enabled(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this on the frontend instead and control it via a query param to the API? Makes it much easier to test back and forth in a live situation to compare results etc.

Copy link
Member Author

@EDsCODE EDsCODE Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on second thought, I don't want to add filters to the object and muddy our caching for insights. Do you think it's worth it for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't suggesting adding it to filters (not everything has to go in that mega object 😅 ) - just a standard query param. I pushed the change in the way I meant so now we just check a query param which makes it super easy to flip back and forth without having to change the feature flag all the time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep, we can do that. Too focused on the filter object 😬

if is_cloud():
return posthoganalytics.feature_enabled(
"recordings-list-v2-enabled",
str(self.uuid),
groups={"organization": str(self.organization_id)},
group_properties={
"organization": {
"id": str(self.organization_id),
"created_at": self.organization.created_at,
"name": self.organization.name,
}
},
only_evaluate_locally=True,
send_feature_flag_events=False,
)

return False

@property
def strict_caching_enabled(self) -> bool:
enabled_teams = get_list(get_instance_setting("STRICT_CACHING_TEAMS"))
Expand Down
4 changes: 2 additions & 2 deletions posthog/queries/funnels/test/__snapshots__/test_funnel.ambr
Original file line number Diff line number Diff line change
Expand Up @@ -684,8 +684,8 @@
HAVING argMax(is_deleted, version) = 0) AS pdi ON e.distinct_id = pdi.distinct_id
WHERE team_id = 2
AND event IN ['$autocapture', 'user signed up', '$autocapture']
AND toTimeZone(timestamp, 'UTC') >= toDateTime('2023-02-20 00:00:00', 'UTC')
AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-27 23:59:59', 'UTC')
AND toTimeZone(timestamp, 'UTC') >= toDateTime('2023-02-21 00:00:00', 'UTC')
AND toTimeZone(timestamp, 'UTC') <= toDateTime('2023-02-28 23:59:59', 'UTC')
AND (step_0 = 1
OR step_1 = 1) ))
WHERE step_0 = 1 ))
Expand Down
195 changes: 182 additions & 13 deletions posthog/queries/session_recordings/session_recording_list.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is worth trying but reading it still makes me think "there must be a simpler way" :D

I keep thinking why is not just

select * from recording_events_grouped where session_id in (BuildStandardEventsQuery(filters).select("session_id")

Definitely don't want to derail and at this point I'll take any improvements to the speed of this thing but these queries still confuse the hell out of me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be that easy... why isn't it? :D

What would this look like if you rebuilt it the same way the events query is built?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigger issue to wait on before spending time unnecessarily is person on events deployment. Once we're there, we can probably move this to @mariusandra's suggestion with less headache and a lot less joins to negotiate

Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from posthog.constants import TREND_FILTER_TYPE_ACTIONS
from posthog.models import Entity
from posthog.models.action.util import format_entity_filter
from posthog.models.filters.mixins.utils import cached_property
from posthog.models.filters.session_recordings_filter import SessionRecordingsFilter
from posthog.models.property.util import parse_prop_grouped_clauses
from posthog.models.utils import PersonPropertiesMode
Expand All @@ -13,6 +14,8 @@


class EventFiltersSQL(NamedTuple):
non_aggregate_select_condition_clause: str
aggregate_event_select_clause: str
aggregate_select_clause: str
aggregate_having_clause: str
where_conditions: str
Expand Down Expand Up @@ -159,6 +162,7 @@ def _determine_should_join_persons(self) -> None:
def _determine_should_join_events(self):
return self._filter.entities and len(self._filter.entities) > 0

@cached_property
def _get_properties_select_clause(self) -> str:
clause = (
f", events.elements_chain as elements_chain"
Expand All @@ -170,6 +174,7 @@ def _get_properties_select_clause(self) -> str:
)
return clause

@cached_property
def _get_person_id_clause(self) -> Tuple[str, Dict[str, Any]]:
person_id_clause = ""
person_id_params = {}
Expand All @@ -180,6 +185,7 @@ def _get_person_id_clause(self) -> Tuple[str, Dict[str, Any]]:

# We want to select events beyond the range of the recording to handle the case where
# a recording spans the time boundaries
@cached_property
def _get_events_timestamp_clause(self) -> Tuple[str, Dict[str, Any]]:
timestamp_clause = ""
timestamp_params = {}
Expand All @@ -191,6 +197,7 @@ def _get_events_timestamp_clause(self) -> Tuple[str, Dict[str, Any]]:
timestamp_params["event_end_time"] = self._filter.date_to + timedelta(hours=12)
return timestamp_clause, timestamp_params

@cached_property
def _get_recording_start_time_clause(self) -> Tuple[str, Dict[str, Any]]:
start_time_clause = ""
start_time_params = {}
Expand All @@ -202,12 +209,14 @@ def _get_recording_start_time_clause(self) -> Tuple[str, Dict[str, Any]]:
start_time_params["end_time"] = self._filter.date_to
return start_time_clause, start_time_params

@cached_property
def session_ids_clause(self) -> Tuple[str, Dict[str, Any]]:
if self._filter.session_ids is None:
return "", {}

return "AND session_id in %(session_ids)s", {"session_ids": self._filter.session_ids}

@cached_property
def _get_duration_clause(self) -> Tuple[str, Dict[str, Any]]:
duration_clause = ""
duration_params = {}
Expand Down Expand Up @@ -260,7 +269,10 @@ def format_event_filter(self, entity: Entity, prepend: str, team_id: int) -> Tup

return filter_sql, params

@cached_property
def format_event_filters(self) -> EventFiltersSQL:
non_aggregate_select_condition_clause = ""
aggregate_event_select_clause = ""
aggregate_select_clause = ""
aggregate_having_clause = ""
where_conditions = "AND event IN %(event_names)s"
Expand Down Expand Up @@ -289,7 +301,14 @@ def format_event_filters(self) -> EventFiltersSQL:

params = {**params, "event_names": list(event_names_to_filter)}

return EventFiltersSQL(aggregate_select_clause, aggregate_having_clause, where_conditions, params)
return EventFiltersSQL(
non_aggregate_select_condition_clause,
aggregate_event_select_clause,
aggregate_select_clause,
aggregate_having_clause,
where_conditions,
params,
)

def get_query(self) -> Tuple[str, Dict[str, Any]]:
offset = self._filter.offset or 0
Expand All @@ -300,12 +319,11 @@ def get_query(self) -> Tuple[str, Dict[str, Any]]:
self._filter.property_groups, person_id_joined_alias=f"{self.DISTINCT_ID_TABLE_ALIAS}.person_id"
)

events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause()
recording_start_time_clause, recording_start_time_params = self._get_recording_start_time_clause()
session_ids_clause, session_ids_params = self.session_ids_clause()
person_id_clause, person_id_params = self._get_person_id_clause()
duration_clause, duration_params = self._get_duration_clause()
properties_select_clause = self._get_properties_select_clause()
events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause
recording_start_time_clause, recording_start_time_params = self._get_recording_start_time_clause
session_ids_clause, session_ids_params = self.session_ids_clause
person_id_clause, person_id_params = self._get_person_id_clause
duration_clause, duration_params = self._get_duration_clause

core_recordings_query = self._core_session_recordings_query.format(
recording_start_time_clause=recording_start_time_clause,
Expand Down Expand Up @@ -334,13 +352,9 @@ def get_query(self) -> Tuple[str, Dict[str, Any]]:
},
)

event_filters = self.format_event_filters()
event_filters = self.format_event_filters

core_events_query = self._core_events_query.format(
properties_select_clause=properties_select_clause,
event_filter_where_conditions=event_filters.where_conditions,
events_timestamp_clause=events_timestamp_clause,
)
core_events_query, core_events_query_params = self._get_core_events_query()

return (
self._session_recordings_query_with_events.format(
Expand All @@ -363,9 +377,24 @@ def get_query(self) -> Tuple[str, Dict[str, Any]]:
**recording_start_time_params,
**event_filters.params,
**session_ids_params,
**core_events_query_params,
},
)

def _get_core_events_query(self) -> Tuple[str, Dict[str, Any]]:
params: Dict[str, Any] = {}
event_filters = self.format_event_filters
properties_select_clause = self._get_properties_select_clause
events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause

core_events_query = self._core_events_query.format(
properties_select_clause=properties_select_clause,
event_filter_where_conditions=event_filters.where_conditions,
events_timestamp_clause=events_timestamp_clause,
)

return core_events_query, {**params, **events_timestamp_params}

def _paginate_results(self, session_recordings) -> SessionRecordingQueryResult:
more_recordings_available = False
if len(session_recordings) > self.limit:
Expand Down Expand Up @@ -406,3 +435,143 @@ def run(self, *args, **kwargs) -> SessionRecordingQueryResult:
query_results = sync_execute(query, {**query_params, **self._filter.hogql_context.values})
session_recordings = self._data_to_return(query_results)
return self._paginate_results(session_recordings)


class SessionRecordingListV2(SessionRecordingList):

_core_events_query = """
SELECT
uuid,
distinct_id,
event,
team_id,
timestamp,
$session_id as session_id,
$window_id as window_id
{properties_select_clause}
{non_aggregate_select_condition_clause}

FROM events
WHERE
team_id = %(team_id)s
{event_filter_where_conditions}
{events_timestamp_clause}
AND notEmpty(session_id)
"""

_core_events_query_grouped = """
SELECT
session_id,
distinct_id
{aggregate_event_select_clause}
FROM (
{ungrouped_core_events_query}
) GROUP BY session_id, distinct_id
HAVING 1 = 1
{event_filter_aggregate_having_clause}
"""

_session_recordings_query_with_events = """
SELECT
session_recordings.session_id,
start_time,
end_time,
click_count,
keypress_count,
urls,
duration,
session_recordings.distinct_id as distinct_id
{event_filter_aggregate_select_clause}
FROM (
{core_events_query}
) AS events
JOIN (
{core_recordings_query}
) AS session_recordings
ON session_recordings.session_id = events.session_id
{recording_person_query}
WHERE
session_recordings.distinct_id == events.distinct_id
{prop_filter_clause}
{person_id_clause}
ORDER BY start_time DESC
LIMIT %(limit)s OFFSET %(offset)s
"""

def _get_core_events_query(self) -> Tuple[str, Dict[str, Any]]:
params: Dict[str, Any] = {}
event_filters = self.format_event_filters
properties_select_clause = self._get_properties_select_clause
events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause

core_events_query = self._core_events_query.format(
properties_select_clause=properties_select_clause,
non_aggregate_select_condition_clause=event_filters.non_aggregate_select_condition_clause,
event_filter_where_conditions=event_filters.where_conditions,
events_timestamp_clause=events_timestamp_clause,
)

grouped_events_query = self._core_events_query_grouped.format(
aggregate_event_select_clause=event_filters.aggregate_event_select_clause,
ungrouped_core_events_query=core_events_query,
event_filter_aggregate_having_clause=event_filters.aggregate_having_clause,
)

return grouped_events_query, {**params, **events_timestamp_params}

@cached_property
def format_event_filters(self) -> EventFiltersSQL:
non_aggregate_select_condition_clause = ""
aggregate_event_select_clause = ""
aggregate_select_clause = ""
aggregate_having_clause = ""
where_conditions = "AND event IN %(event_names)s"
# Always include $pageview events so the start_url and end_url can be extracted
event_names_to_filter: List[Union[int, str]] = []

params: Dict = {}

for index, entity in enumerate(self._filter.entities):
if entity.type == TREND_FILTER_TYPE_ACTIONS:
action = entity.get_action()
event_names_to_filter.extend([ae for ae in action.get_step_events() if ae not in event_names_to_filter])
else:
if entity.id not in event_names_to_filter:
event_names_to_filter.append(entity.id)

condition_sql, filter_params = self.format_event_filter(
entity, prepend=f"event_matcher_{index}", team_id=self._team_id
)
aggregate_event_select_clause += f"""
, groupUniqArrayIf(100)((timestamp, uuid, session_id, window_id), event_match_{index} = 1) AS matching_events_{index}
, sum(event_match_{index}) AS matches_{index}
"""

aggregate_select_clause += f"""
, matches_{index}
, matching_events_{index}
"""

non_aggregate_select_condition_clause += f"""
, if({condition_sql}, 1, 0) as event_match_{index}
"""
aggregate_having_clause += f"\nAND matches_{index} > 0"
params = {**params, **filter_params}

params = {**params, "event_names": list(event_names_to_filter)}

return EventFiltersSQL(
non_aggregate_select_condition_clause,
aggregate_event_select_clause,
aggregate_select_clause,
aggregate_having_clause,
where_conditions,
params,
)

def run(self, *args, **kwargs) -> SessionRecordingQueryResult:
self._filter.hogql_context.using_person_on_events = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connection to hogql restricts the improvement to person_on_events enabled orgs for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't get this, aren't there like only 4 poe orgs right now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. hogql allows for person properties to be filtered in the recordings event query which usually was not the case. Now, either we use person on events and only have this optimization working for the tiny subset until everything works, or I go through a refactor that splits up and does the right pushdown of person property filtering to the subquery when we detect a hogql based person property
Screenshot 2023-03-14 at 9 52 03 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh I see, okay, good enough

query, query_params = self.get_query()
query_results = sync_execute(query, {**query_params, **self._filter.hogql_context.values})
session_recordings = self._data_to_return(query_results)
return self._paginate_results(session_recordings)
Loading