-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(recordings): Optimize recordings list query #14458
Changes from 23 commits
695b787
e2add4e
c7eff77
9253921
bce115e
5ce263f
68b86fc
eed5f89
f5edc76
d68ac85
034221a
f25e745
e9c559a
fc831d4
10efc7d
51893d6
a0c966e
5b1435d
36c0e01
da42388
24135c2
4bd7535
a22cfb1
b72a7f1
37238f8
e679735
6f8f06e
e9efea6
9562ebf
16d4cf5
fd7a7fd
6bd28d9
c9d18a2
9d746ea
3e5dbb4
695f328
4467504
5234c22
c52d00b
1d6e97f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Overall I think this is worth trying but reading it still makes me think "there must be a simpler way" :D I keep thinking why is not just
Definitely don't want to derail and at this point I'll take any improvements to the speed of this thing but these queries still confuse the hell out of me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be that easy... why isn't it? :D What would this look like if you rebuilt it the same way the events query is built? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bigger issue to wait on before spending time unnecessarily is person on events deployment. Once we're there, we can probably move this to @mariusandra's suggestion with less headache and a lot less joins to negotiate |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ | |
from posthog.constants import TREND_FILTER_TYPE_ACTIONS | ||
from posthog.models import Entity | ||
from posthog.models.action.util import format_entity_filter | ||
from posthog.models.filters.mixins.utils import cached_property | ||
from posthog.models.filters.session_recordings_filter import SessionRecordingsFilter | ||
from posthog.models.property.util import parse_prop_grouped_clauses | ||
from posthog.models.utils import PersonPropertiesMode | ||
|
@@ -13,6 +14,8 @@ | |
|
||
|
||
class EventFiltersSQL(NamedTuple): | ||
non_aggregate_select_condition_clause: str | ||
aggregate_event_select_clause: str | ||
aggregate_select_clause: str | ||
aggregate_having_clause: str | ||
where_conditions: str | ||
|
@@ -159,6 +162,7 @@ def _determine_should_join_persons(self) -> None: | |
def _determine_should_join_events(self): | ||
return self._filter.entities and len(self._filter.entities) > 0 | ||
|
||
@cached_property | ||
def _get_properties_select_clause(self) -> str: | ||
clause = ( | ||
f", events.elements_chain as elements_chain" | ||
|
@@ -170,6 +174,7 @@ def _get_properties_select_clause(self) -> str: | |
) | ||
return clause | ||
|
||
@cached_property | ||
def _get_person_id_clause(self) -> Tuple[str, Dict[str, Any]]: | ||
person_id_clause = "" | ||
person_id_params = {} | ||
|
@@ -180,6 +185,7 @@ def _get_person_id_clause(self) -> Tuple[str, Dict[str, Any]]: | |
|
||
# We want to select events beyond the range of the recording to handle the case where | ||
# a recording spans the time boundaries | ||
@cached_property | ||
def _get_events_timestamp_clause(self) -> Tuple[str, Dict[str, Any]]: | ||
timestamp_clause = "" | ||
timestamp_params = {} | ||
|
@@ -191,6 +197,7 @@ def _get_events_timestamp_clause(self) -> Tuple[str, Dict[str, Any]]: | |
timestamp_params["event_end_time"] = self._filter.date_to + timedelta(hours=12) | ||
return timestamp_clause, timestamp_params | ||
|
||
@cached_property | ||
def _get_recording_start_time_clause(self) -> Tuple[str, Dict[str, Any]]: | ||
start_time_clause = "" | ||
start_time_params = {} | ||
|
@@ -202,12 +209,14 @@ def _get_recording_start_time_clause(self) -> Tuple[str, Dict[str, Any]]: | |
start_time_params["end_time"] = self._filter.date_to | ||
return start_time_clause, start_time_params | ||
|
||
@cached_property | ||
def session_ids_clause(self) -> Tuple[str, Dict[str, Any]]: | ||
if self._filter.session_ids is None: | ||
return "", {} | ||
|
||
return "AND session_id in %(session_ids)s", {"session_ids": self._filter.session_ids} | ||
|
||
@cached_property | ||
def _get_duration_clause(self) -> Tuple[str, Dict[str, Any]]: | ||
duration_clause = "" | ||
duration_params = {} | ||
|
@@ -260,7 +269,10 @@ def format_event_filter(self, entity: Entity, prepend: str, team_id: int) -> Tup | |
|
||
return filter_sql, params | ||
|
||
@cached_property | ||
def format_event_filters(self) -> EventFiltersSQL: | ||
non_aggregate_select_condition_clause = "" | ||
aggregate_event_select_clause = "" | ||
aggregate_select_clause = "" | ||
aggregate_having_clause = "" | ||
where_conditions = "AND event IN %(event_names)s" | ||
|
@@ -289,7 +301,14 @@ def format_event_filters(self) -> EventFiltersSQL: | |
|
||
params = {**params, "event_names": list(event_names_to_filter)} | ||
|
||
return EventFiltersSQL(aggregate_select_clause, aggregate_having_clause, where_conditions, params) | ||
return EventFiltersSQL( | ||
non_aggregate_select_condition_clause, | ||
aggregate_event_select_clause, | ||
aggregate_select_clause, | ||
aggregate_having_clause, | ||
where_conditions, | ||
params, | ||
) | ||
|
||
def get_query(self) -> Tuple[str, Dict[str, Any]]: | ||
offset = self._filter.offset or 0 | ||
|
@@ -300,12 +319,11 @@ def get_query(self) -> Tuple[str, Dict[str, Any]]: | |
self._filter.property_groups, person_id_joined_alias=f"{self.DISTINCT_ID_TABLE_ALIAS}.person_id" | ||
) | ||
|
||
events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause() | ||
recording_start_time_clause, recording_start_time_params = self._get_recording_start_time_clause() | ||
session_ids_clause, session_ids_params = self.session_ids_clause() | ||
person_id_clause, person_id_params = self._get_person_id_clause() | ||
duration_clause, duration_params = self._get_duration_clause() | ||
properties_select_clause = self._get_properties_select_clause() | ||
events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause | ||
recording_start_time_clause, recording_start_time_params = self._get_recording_start_time_clause | ||
session_ids_clause, session_ids_params = self.session_ids_clause | ||
person_id_clause, person_id_params = self._get_person_id_clause | ||
duration_clause, duration_params = self._get_duration_clause | ||
|
||
core_recordings_query = self._core_session_recordings_query.format( | ||
recording_start_time_clause=recording_start_time_clause, | ||
|
@@ -334,13 +352,9 @@ def get_query(self) -> Tuple[str, Dict[str, Any]]: | |
}, | ||
) | ||
|
||
event_filters = self.format_event_filters() | ||
event_filters = self.format_event_filters | ||
|
||
core_events_query = self._core_events_query.format( | ||
properties_select_clause=properties_select_clause, | ||
event_filter_where_conditions=event_filters.where_conditions, | ||
events_timestamp_clause=events_timestamp_clause, | ||
) | ||
core_events_query, core_events_query_params = self._get_core_events_query() | ||
|
||
return ( | ||
self._session_recordings_query_with_events.format( | ||
|
@@ -363,9 +377,24 @@ def get_query(self) -> Tuple[str, Dict[str, Any]]: | |
**recording_start_time_params, | ||
**event_filters.params, | ||
**session_ids_params, | ||
**core_events_query_params, | ||
}, | ||
) | ||
|
||
def _get_core_events_query(self) -> Tuple[str, Dict[str, Any]]: | ||
params: Dict[str, Any] = {} | ||
event_filters = self.format_event_filters | ||
properties_select_clause = self._get_properties_select_clause | ||
events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause | ||
|
||
core_events_query = self._core_events_query.format( | ||
properties_select_clause=properties_select_clause, | ||
event_filter_where_conditions=event_filters.where_conditions, | ||
events_timestamp_clause=events_timestamp_clause, | ||
) | ||
|
||
return core_events_query, {**params, **events_timestamp_params} | ||
|
||
def _paginate_results(self, session_recordings) -> SessionRecordingQueryResult: | ||
more_recordings_available = False | ||
if len(session_recordings) > self.limit: | ||
|
@@ -406,3 +435,143 @@ def run(self, *args, **kwargs) -> SessionRecordingQueryResult: | |
query_results = sync_execute(query, {**query_params, **self._filter.hogql_context.values}) | ||
session_recordings = self._data_to_return(query_results) | ||
return self._paginate_results(session_recordings) | ||
|
||
|
||
class SessionRecordingListV2(SessionRecordingList): | ||
|
||
_core_events_query = """ | ||
SELECT | ||
uuid, | ||
distinct_id, | ||
event, | ||
team_id, | ||
timestamp, | ||
$session_id as session_id, | ||
$window_id as window_id | ||
{properties_select_clause} | ||
{non_aggregate_select_condition_clause} | ||
|
||
FROM events | ||
WHERE | ||
team_id = %(team_id)s | ||
{event_filter_where_conditions} | ||
{events_timestamp_clause} | ||
AND notEmpty(session_id) | ||
""" | ||
|
||
_core_events_query_grouped = """ | ||
SELECT | ||
session_id, | ||
distinct_id | ||
{aggregate_event_select_clause} | ||
FROM ( | ||
{ungrouped_core_events_query} | ||
) GROUP BY session_id, distinct_id | ||
HAVING 1 = 1 | ||
{event_filter_aggregate_having_clause} | ||
""" | ||
|
||
_session_recordings_query_with_events = """ | ||
SELECT | ||
session_recordings.session_id, | ||
start_time, | ||
end_time, | ||
click_count, | ||
keypress_count, | ||
urls, | ||
duration, | ||
session_recordings.distinct_id as distinct_id | ||
{event_filter_aggregate_select_clause} | ||
FROM ( | ||
{core_events_query} | ||
) AS events | ||
JOIN ( | ||
{core_recordings_query} | ||
) AS session_recordings | ||
ON session_recordings.session_id = events.session_id | ||
{recording_person_query} | ||
WHERE | ||
session_recordings.distinct_id == events.distinct_id | ||
{prop_filter_clause} | ||
{person_id_clause} | ||
ORDER BY start_time DESC | ||
LIMIT %(limit)s OFFSET %(offset)s | ||
""" | ||
|
||
def _get_core_events_query(self) -> Tuple[str, Dict[str, Any]]: | ||
params: Dict[str, Any] = {} | ||
event_filters = self.format_event_filters | ||
properties_select_clause = self._get_properties_select_clause | ||
events_timestamp_clause, events_timestamp_params = self._get_events_timestamp_clause | ||
|
||
core_events_query = self._core_events_query.format( | ||
properties_select_clause=properties_select_clause, | ||
non_aggregate_select_condition_clause=event_filters.non_aggregate_select_condition_clause, | ||
event_filter_where_conditions=event_filters.where_conditions, | ||
events_timestamp_clause=events_timestamp_clause, | ||
) | ||
|
||
grouped_events_query = self._core_events_query_grouped.format( | ||
aggregate_event_select_clause=event_filters.aggregate_event_select_clause, | ||
ungrouped_core_events_query=core_events_query, | ||
event_filter_aggregate_having_clause=event_filters.aggregate_having_clause, | ||
) | ||
|
||
return grouped_events_query, {**params, **events_timestamp_params} | ||
|
||
@cached_property | ||
def format_event_filters(self) -> EventFiltersSQL: | ||
non_aggregate_select_condition_clause = "" | ||
aggregate_event_select_clause = "" | ||
aggregate_select_clause = "" | ||
aggregate_having_clause = "" | ||
where_conditions = "AND event IN %(event_names)s" | ||
# Always include $pageview events so the start_url and end_url can be extracted | ||
event_names_to_filter: List[Union[int, str]] = [] | ||
|
||
params: Dict = {} | ||
|
||
for index, entity in enumerate(self._filter.entities): | ||
if entity.type == TREND_FILTER_TYPE_ACTIONS: | ||
action = entity.get_action() | ||
event_names_to_filter.extend([ae for ae in action.get_step_events() if ae not in event_names_to_filter]) | ||
else: | ||
if entity.id not in event_names_to_filter: | ||
event_names_to_filter.append(entity.id) | ||
|
||
condition_sql, filter_params = self.format_event_filter( | ||
entity, prepend=f"event_matcher_{index}", team_id=self._team_id | ||
) | ||
aggregate_event_select_clause += f""" | ||
, groupUniqArrayIf(100)((timestamp, uuid, session_id, window_id), event_match_{index} = 1) AS matching_events_{index} | ||
, sum(event_match_{index}) AS matches_{index} | ||
""" | ||
|
||
aggregate_select_clause += f""" | ||
, matches_{index} | ||
, matching_events_{index} | ||
""" | ||
|
||
non_aggregate_select_condition_clause += f""" | ||
, if({condition_sql}, 1, 0) as event_match_{index} | ||
""" | ||
aggregate_having_clause += f"\nAND matches_{index} > 0" | ||
params = {**params, **filter_params} | ||
|
||
params = {**params, "event_names": list(event_names_to_filter)} | ||
|
||
return EventFiltersSQL( | ||
non_aggregate_select_condition_clause, | ||
aggregate_event_select_clause, | ||
aggregate_select_clause, | ||
aggregate_having_clause, | ||
where_conditions, | ||
params, | ||
) | ||
|
||
def run(self, *args, **kwargs) -> SessionRecordingQueryResult: | ||
self._filter.hogql_context.using_person_on_events = True | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This connection to hogql restricts the improvement to person_on_events enabled orgs for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't get this, aren't there like only 4 poe orgs right now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah. hogql allows for person properties to be filtered in the recordings event query which usually was not the case. Now, either we use person on events and only have this optimization working for the tiny subset until everything works, or I go through a refactor that splits up and does the right pushdown of person property filtering to the subquery when we detect a hogql based person property There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ugh I see, okay, good enough |
||
query, query_params = self.get_query() | ||
query_results = sync_execute(query, {**query_params, **self._filter.hogql_context.values}) | ||
session_recordings = self._data_to_return(query_results) | ||
return self._paginate_results(session_recordings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do this on the frontend instead and control it via a query param to the API? Makes it much easier to test back and forth in a live situation to compare results etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, on second thought, I don't want to add filters to the object and muddy our caching for insights. Do you think it's worth it for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't suggesting adding it to filters (not everything has to go in that mega object 😅 ) - just a standard query param. I pushed the change in the way I meant so now we just check a query param which makes it super easy to flip back and forth without having to change the feature flag all the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yep, we can do that. Too focused on the filter object 😬