-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev/incremental media model #2
Dev/incremental media model #2
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions around simplifying and keeping things more consistent but it looks good! I also have a lingering question about the incremental functionality of the media_stats
table but for the rest it's looking great!
Hi Emiel! I have left comments above for all the recommended changes and questions you or Will had and made modifications accordingly. I also added documentation within |
Hey Agnes, thanks for this and great job on everything so far!! I have added some small QOL suggestions with regards to the documentation, as well as directory conventions that we seem to be following in the other packages. I also pointed out what I think is an issue with an average calculation in one of the custom models, and I'm still a bit confused by the |
@agnessnowplow |
I have run the model within dev1 so that I can replace the docs for the site (previously it was all within one schema). I have also added more fields for interactions_this_run that are not used in the model but users might want to take advantage of them in their custom models. I also made a correction within base_this_run because the retention_rate was not taking the correct value in case the percent_progress was calculated from the ended event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On two dbt run
commands and a dbt test
I get two failed unique
tests for event_id
in snowplow_media_player_interactions_this_run
and snowplow_web_base_events_this_run
. Is this something you encountered? If this is resolved I'm happy to give the go ahead!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors in previous comment were due to duplicate source data, model seems to work well!! Great job on this!
Thanks for your review! I fixed the bugs I came across: as discussed separately, I changed the ephemeral models to tables, including the docs so that the manifest table gets updated and the incremental runs don't get stuck. I also changed the full outer join back to account for the scenario when there is no data for the incremental run, which should preserve any pre-existing percent progress values from previous runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add the sort and dist keys and then it should be okay!
custom: | ||
+schema: "scratch" | ||
+tags: "snowplow_web_incremental" | ||
+enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enabled should be triggered by a variable, which should be defined in your vars and in the documentation so that users can easily enable/disable these custom tables, right? Or how else would we expect users to build off of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I always do for testing, you just overwrite it in your own dbt project's project.yml file and explained in the docs so they should be ok I think (?):
By default these are disabled, but you can enable them in the project's profiles.yml, if needed.
yml
# dbt_project.yml
...
models:
snowplow_media_player:
custom:
enabled: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, although I do think we should change it to being behind a variable once this dbt bug is fixed dbt-labs/dbt-core#3698 to align with the web and mobile packages and their optional modules, but it's good for now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
78c0622
to
21d785b
Compare
This is the Redshift only incremental model currently working together with the web package.
The
_interactions_this_run
table follows the web model`s incremental logic and takessnowplow_web_base_events_this_run
as an input to then add the various contexts to enrich the base table with the additional media related fields.The
_plays_by_pageview_this_run
table aggregates the_interactions_this_run
'table and serves as a basis for the incrementalised derived table_plays_by_pageview
.The main
_media_stats
derived table will also be updated incrementally based on the_plays_by_pageview
derived table, on a play by pageview basis after a set time window. This is to prevent complex and expensive queries due to metrics which need to take the whole page_view events into calculation. This way the metrics will only be calculated once per pageview/media, after no new events are expected.The additional
_pivot_base
table is there to calculate the percent_progress boundaries and weights that are used to calculate the total play time and other related media fields.Once we are happy about this way of incrementalisation, I will update the schema names (only the media_stats and plays_by_pageview should be in the derived schema, the rest could go to scratch, to follow the conventions)
For testing, I ran dbt test which passed and I also compared the results with the drop and recompute model by taking 18-19-20 Jan data and running the incremental model with 1 day batches. I have adjusted some of the metrics calculations during this exercise so the two models diverged in parts at the moment.