Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/incremental media model #2

Merged

Conversation

agnessnowplow
Copy link
Contributor

This is the Redshift only incremental model currently working together with the web package.

The _interactions_this_run table follows the web model`s incremental logic and takes snowplow_web_base_events_this_run as an input to then add the various contexts to enrich the base table with the additional media related fields.

The _plays_by_pageview_this_run table aggregates the _interactions_this_run'table and serves as a basis for the incrementalised derived table _plays_by_pageview.

The main _media_stats derived table will also be updated incrementally based on the _plays_by_pageview derived table, on a play by pageview basis after a set time window. This is to prevent complex and expensive queries due to metrics which need to take the whole page_view events into calculation. This way the metrics will only be calculated once per pageview/media, after no new events are expected.

The additional _pivot_base table is there to calculate the percent_progress boundaries and weights that are used to calculate the total play time and other related media fields.

Once we are happy about this way of incrementalisation, I will update the schema names (only the media_stats and plays_by_pageview should be in the derived schema, the rest could go to scratch, to follow the conventions)

For testing, I ran dbt test which passed and I also compared the results with the drop and recompute model by taking 18-19-20 Jan data and running the incremental model with 1 day batches. I have adjusted some of the metrics calculations during this exercise so the two models diverged in parts at the moment.

Copy link
Contributor

@emielver emielver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions around simplifying and keeping things more consistent but it looks good! I also have a lingering question about the incremental functionality of the media_stats table but for the rest it's looking great!

packages.yml Outdated Show resolved Hide resolved
models/web/snowplow_media_plays_by_pageview_this_run.sql Outdated Show resolved Hide resolved
models/web/snowplow_media_pivot_base.sql Outdated Show resolved Hide resolved
models/web/snowplow_media_media_stats.sql Outdated Show resolved Hide resolved
models/web/snowplow_media_media_stats.sql Outdated Show resolved Hide resolved
models/web/snowplow_media_interactions_this_run.sql Outdated Show resolved Hide resolved
models/custom/snowplow_media_user_stats.sql Outdated Show resolved Hide resolved
dbt_project.yml Show resolved Hide resolved
dbt_project.yml Outdated Show resolved Hide resolved
@agnessnowplow
Copy link
Contributor Author

Hi Emiel! I have left comments above for all the recommended changes and questions you or Will had and made modifications accordingly. I also added documentation within snowplow_media_player_overview.md and README.md. As I was going through the guide I also realised I need some slight changes, for instance I added a selectors.yml file. I always use the prod db for testing purposes because that has correct data with enough for sampling. I checked 18-20 Jan period when I have data in consecutive days for testing incremental runs with daily limits but once we are happy with the calculations I will do a full testing with all the data just in case. When testing I use it as intended, in a new dbt project importing it as a package and running it together with the web model. Just noticed I left scratch_agnes as schema name, will make another commit shortly to correct this.

@emielver
Copy link
Contributor

emielver commented May 3, 2022

Hey Agnes, thanks for this and great job on everything so far!! I have added some small QOL suggestions with regards to the documentation, as well as directory conventions that we seem to be following in the other packages. I also pointed out what I think is an issue with an average calculation in one of the custom models, and I'm still a bit confused by the snowplow_media_player_media_stats model's construction. Maybe we could have a chat about that before I finish the review but everything else should be actionable! And wrt schema stuff, you can also change the value in your dbt profiles.yml of the schema variable which will prefix scratch and derived with whatever you put, e.g. dbt_agnes so then you won't need to change the dbt_project.yml when testing locally!

docs/markdown/snowplow_media_player_overview.md Outdated Show resolved Hide resolved
docs/markdown/snowplow_media_player_overview.md Outdated Show resolved Hide resolved
docs/markdown/snowplow_media_player_overview.md Outdated Show resolved Hide resolved
docs/markdown/snowplow_media_player_overview.md Outdated Show resolved Hide resolved
docs/markdown/snowplow_media_player_overview.md Outdated Show resolved Hide resolved
models/web/scratch/snowplow_media_player_base_this_run.sql Outdated Show resolved Hide resolved
models/custom/snowplow_media_player_session_stats.sql Outdated Show resolved Hide resolved
models/custom/snowplow_media_player_user_stats.sql Outdated Show resolved Hide resolved
models/custom/snowplow_media_player_user_stats.sql Outdated Show resolved Hide resolved
models/web/snowplow_media_player_media_stats.sql Outdated Show resolved Hide resolved
@emielver
Copy link
Contributor

emielver commented May 4, 2022

@agnessnowplow
There's also some issues with building the pages. Can you add a empty .nojekyll file to the docs folder, and can you do a docs generate and copy over the ['catalog.json', 'index.html', 'manifest.json', 'run_results.json'] files from your target folder to your docs folder after the docs have been generated, in the same vein as https://github.com/snowplow/dbt-snowplow-mobile/tree/main/docs
Also you should be admin now so you can check/build the docs yourself as well!

models/web/sources.yml Outdated Show resolved Hide resolved
models/web/schema.yml Outdated Show resolved Hide resolved
@github-pages github-pages bot temporarily deployed to github-pages May 9, 2022 18:40 Inactive
@github-pages github-pages bot temporarily deployed to github-pages May 10, 2022 10:17 Inactive
@agnessnowplow
Copy link
Contributor Author

I have run the model within dev1 so that I can replace the docs for the site (previously it was all within one schema). I have also added more fields for interactions_this_run that are not used in the model but users might want to take advantage of them in their custom models. I also made a correction within base_this_run because the retention_rate was not taking the correct value in case the percent_progress was calculated from the ended event.

Copy link
Contributor

@emielver emielver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On two dbt run commands and a dbt test I get two failed unique tests for event_id in snowplow_media_player_interactions_this_run and snowplow_web_base_events_this_run. Is this something you encountered? If this is resolved I'm happy to give the go ahead!

Copy link
Contributor

@emielver emielver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors in previous comment were due to duplicate source data, model seems to work well!! Great job on this!

@github-pages github-pages bot temporarily deployed to github-pages May 12, 2022 14:49 Inactive
@agnessnowplow
Copy link
Contributor Author

Thanks for your review! I fixed the bugs I came across: as discussed separately, I changed the ephemeral models to tables, including the docs so that the manifest table gets updated and the incremental runs don't get stuck. I also changed the full outer join back to account for the scenario when there is no data for the incremental run, which should preserve any pre-existing percent progress values from previous runs.

@github-pages github-pages bot temporarily deployed to github-pages May 12, 2022 15:52 Inactive
Copy link
Contributor

@emielver emielver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add the sort and dist keys and then it should be okay!

models/custom/snowplow_media_player_session_stats.sql Outdated Show resolved Hide resolved
models/custom/snowplow_media_player_user_stats.sql Outdated Show resolved Hide resolved
@github-pages github-pages bot temporarily deployed to github-pages May 12, 2022 17:04 Inactive
custom:
+schema: "scratch"
+tags: "snowplow_web_incremental"
+enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enabled should be triggered by a variable, which should be defined in your vars and in the documentation so that users can easily enable/disable these custom tables, right? Or how else would we expect users to build off of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I always do for testing, you just overwrite it in your own dbt project's project.yml file and explained in the docs so they should be ok I think (?):

By default these are disabled, but you can enable them in the project's profiles.yml, if needed.

yml
# dbt_project.yml
...
  models:
    snowplow_media_player:
      custom:
        enabled: true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, although I do think we should change it to being behind a variable once this dbt bug is fixed dbt-labs/dbt-core#3698 to align with the web and mobile packages and their optional modules, but it's good for now!

Copy link
Contributor

@emielver emielver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-pages github-pages bot temporarily deployed to github-pages May 13, 2022 09:21 Inactive
@github-pages github-pages bot temporarily deployed to github-pages May 13, 2022 09:28 Inactive
Copy link
Contributor

@emielver emielver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@agnessnowplow agnessnowplow force-pushed the dev/incremental_media_model branch from 78c0622 to 21d785b Compare May 13, 2022 10:09
@github-pages github-pages bot temporarily deployed to github-pages May 13, 2022 10:09 Inactive
@agnessnowplow agnessnowplow merged commit 2c21ebf into feature/incremental_media_model May 13, 2022
@emielver emielver deleted the dev/incremental_media_model branch August 26, 2022 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants