Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: write recording summary events #15245

Merged
merged 46 commits into from
May 9, 2023
Merged

Conversation

pauldambra
Copy link
Member

@pauldambra pauldambra commented Apr 25, 2023

Problem

see #15200 (comment)

When we store session recording events we materialize a lot of information using the snapshot data column.

We'll soon not be storing the snapshot data so won't be able to use that to materialize that information, so we need to capture it earlier in the pipeline. Since this is only used for searching for/summarizing recordings we don't need to store every event.

Changes

We'll push a summary event to a new kafka topic during ingestion. ClickHouse can ingest from that topic into an aggregating merge tree table. So that we store (in theory, although not in practice) only one row per session.

  • add config to turn this on and off by team in plugin server
  • add code behind that to write session recording summary events to a new topic in kafka
  • add ClickHouse tables to ingest and aggregate those summary events

How did you test this code?

some developer tests and running it locally

2023-04-27 16 15 31

Copy link
Contributor

@benjackwhite benjackwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoop that was fast! Lots of changes needed but hopefully these are useful comments after the first draft.

plugin-server/src/config/kafka-topics.ts Show resolved Hide resolved
keypressCount += 1
}
}
if (!!eventSummary.data.href?.trim().length && url === undefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never know if this is the case but isn't the missing ? after trim meaning that it could throw if there is no href?

Also why not use truthiness and keep it simple :D

Suggested change
if (!!eventSummary.data.href?.trim().length && url === undefined) {
if (!url && eventSummary.data.href?.trim()) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elvis operator says if the thing is there carry on if not return undefined

Screenshot 2023-04-26 at 09 15 57

I prefer the explicit !! to remind myself I'm using truthiness but no strong opinion. I'll see what the tests think of the suggestion and go from there :)

plugin-server/src/worker/ingestion/process-event.ts Outdated Show resolved Hide resolved
plugin-server/src/worker/ingestion/process-event.ts Outdated Show resolved Hide resolved
plugin-server/src/worker/ingestion/process-event.ts Outdated Show resolved Hide resolved
plugin-server/tests/main/process-event.test.ts Outdated Show resolved Hide resolved
posthog/models/session_replay_event/sql.py Outdated Show resolved Hide resolved
posthog/models/session_replay_event/sql.py Outdated Show resolved Hide resolved
@pauldambra
Copy link
Member Author

pauldambra commented Apr 26, 2023

Can investigate aggregations with

drop table if exists testing_things;

  CREATE TABLE IF NOT EXISTS testing_things
  (
      session_id VARCHAR,
      team_id Int64,
      distinct_id VARCHAR,
      timestamp DateTime64(6, 'UTC'),
      first_timestamp AggregateFunction(min, DateTime64(6, 'UTC')),
      last_timestamp AggregateFunction(max, DateTime64(6, 'UTC')),
      first_url Nullable(VARCHAR),
      click_count SimpleAggregateFunction(sum, Int64),
      keypress_count SimpleAggregateFunction(sum, Int64),
      mouse_activity_count SimpleAggregateFunction(sum, Int64)
  ) ENGINE = AggregatingMergeTree()
  PARTITION BY toYYYYMM(timestamp)
ORDER BY (team_id, session_id, toStartOfHour(timestamp));

insert into testing_things
select
    'wat',
    1,
    'wat',
    '2021-12-05T00:10:00',
    minState(toDateTime64('2021-12-05T00:10:00', 6, 'UTC')),
    maxState(toDateTime64('2021-12-05T00:10:00', 6, 'UTC')),
    anyState(null),
    4,
    8,
    12;

select session_id, team_id, any(distinct_id), minMerge(first_timestamp), maxMerge(last_timestamp), sum(click_count), sum(keypress_count), sum(mouse_activity_count)
from testing_things
group by session_id, team_id

Copy link
Contributor

@benjackwhite benjackwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking better!

plugin-server/src/worker/ingestion/process-event.ts Outdated Show resolved Hide resolved
plugin-server/tests/main/process-event.test.ts Outdated Show resolved Hide resolved
@pauldambra
Copy link
Member Author

Screenshot 2023-04-27 at 15 08 14

Hmmm, so close

@pauldambra
Copy link
Member Author

2023-04-27 16 15 31

I was using the same columns for kafka and the aggregating table because I'm a fool.

@pauldambra pauldambra marked this pull request as ready for review April 27, 2023 20:13
Copy link
Contributor

@benjackwhite benjackwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks solid now 👍

@github-actions github-actions bot temporarily deployed to pr-feat/write-recording-summary May 4, 2023 09:45 Destroyed
@github-actions github-actions bot temporarily deployed to pr-feat/write-recording-summary May 4, 2023 09:51 Destroyed
@github-actions github-actions bot temporarily deployed to pr-feat/write-recording-summary May 4, 2023 11:12 Destroyed
@github-actions github-actions bot temporarily deployed to pr-feat/write-recording-summary May 4, 2023 13:59 Destroyed
@pauldambra pauldambra removed the deploy label May 4, 2023
@github-actions github-actions bot temporarily deployed to pr-feat/write-recording-summary May 4, 2023 15:37 Destroyed
@pauldambra
Copy link
Member Author

@hazzadous I don't think the failing plugin server functional tests here are due to my change https://github.com/PostHog/posthog/actions/runs/4884989857/jobs/8719065927

Should I ignore?

@pauldambra pauldambra mentioned this pull request May 8, 2023
5 tasks
@pauldambra pauldambra enabled auto-merge (squash) May 9, 2023 14:38
@pauldambra pauldambra merged commit 067d73c into master May 9, 2023
@pauldambra pauldambra deleted the feat/write-recording-summary branch May 9, 2023 14:41
thmsobrmlr pushed a commit that referenced this pull request May 10, 2023
Problem
see #15200 (comment)

When we store session recording events we materialize a lot of information using the snapshot data column.

We'll soon not be storing the snapshot data so won't be able to use that to materialize that information, so we need to capture it earlier in the pipeline. Since this is only used for searching for/summarizing recordings we don't need to store every event.

Changes
We'll push a summary event to a new kafka topic during ingestion. ClickHouse can ingest from that topic into an aggregating merge tree table. So that we store (in theory, although not in practice) only one row per session.

add config to turn this on and off by team in plugin server
add code behind that to write session recording summary events to a new topic in kafka
add ClickHouse tables to ingest and aggregate those summary events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants