Filter data sources based on partitions and clustering where possible #1827

scholtzan · 2023-07-10T15:59:38Z

A significant cost saving can be achieved by querying data sources only for relevant partitions or clusters. One such example is to query main_v4 only for nightly data for every experiment that is run on nightly. Even more cost can be saved when querying events and making sure only event_categorys with relevant data get queried.

Currently, these optimizations need to be made manually in custom configs. Normal users are not familiar with this, so it would be good if there was some kind of automated or more guided way that could be provided.

Cost savings are quite significant here (often cost can be cut by up to 10x)

┆Issue is synchronized with this Jira Task

The text was updated successfully, but these errors were encountered:

ncalexan · 2023-07-13T17:40:24Z

I will emphasize this for events: a custom data source with the event_category filter is orders of magnitude more efficient than the existing events datasource. See, e.g., mozilla/metric-hub@a4f3625.

What I'd like to see is some thinking about how the data source TOML can accommodate this. We're hitting a BigQuery pessimization where-in the order of the filters matters. Can we "parameterize" data sources, so that using the events datasource requires something like events('event_category') or similar? Can we deprecate events entirely, so that it'll be more clear to custom analysis writers that the custom data source pattern is the way to go? (When doing this myself, I discovered the pattern "by hand"; the existing code base is actively misleading because it doesn't do this.)

scholtzan mentioned this issue Feb 12, 2024

Update macos-background-tab-power-savings.toml mozilla/metric-hub#358

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter data sources based on partitions and clustering where possible #1827

Filter data sources based on partitions and clustering where possible #1827

scholtzan commented Jul 10, 2023 •

edited by data-sync-user

Loading

ncalexan commented Jul 13, 2023

Filter data sources based on partitions and clustering where possible #1827

Filter data sources based on partitions and clustering where possible #1827

Comments

scholtzan commented Jul 10, 2023 • edited by data-sync-user Loading

ncalexan commented Jul 13, 2023

scholtzan commented Jul 10, 2023 •

edited by data-sync-user

Loading