Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create shredder_targets_joined table with additional table metadata #6331

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

BenWu
Copy link
Contributor

@BenWu BenWu commented Oct 10, 2024

Description

This adds some table metadata to monitoring_derived.shredder_targets_v1 created in #6289. I decided to do this in an additional derived table so there isn't bunch of sql embedded in the python script

Reviewer, please follow this checklist

┆Issue is synchronized with this Jira Task

@dataops-ci-bot
Copy link

Integration report for "Create shredder_targets_joined table with additional table metadata"

sql.diff

Click to expand!
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_monitoring.py /tmp/workspace/generated-sql/dags/bqetl_monitoring.py
--- /tmp/workspace/main-generated-sql/dags/bqetl_monitoring.py	2024-10-10 20:42:19.000000000 +0000
+++ /tmp/workspace/generated-sql/dags/bqetl_monitoring.py	2024-10-10 20:44:32.000000000 +0000
@@ -112,6 +112,20 @@
         email=["ascholtz@mozilla.com", "wichan@mozilla.com"],
     )
 
+    with TaskGroup(
+        "monitoring_derived__bigquery_table_storage__v1_external",
+    ) as monitoring_derived__bigquery_table_storage__v1_external:
+        ExternalTaskMarker(
+            task_id="bqetl_shredder_monitoring__wait_for_monitoring_derived__bigquery_table_storage__v1",
+            external_dag_id="bqetl_shredder_monitoring",
+            external_task_id="wait_for_monitoring_derived__bigquery_table_storage__v1",
+            execution_date="{{ (execution_date - macros.timedelta(days=-1, seconds=50400)).isoformat() }}",
+        )
+
+        monitoring_derived__bigquery_table_storage__v1_external.set_upstream(
+            monitoring_derived__bigquery_table_storage__v1
+        )
+
     monitoring_derived__bigquery_table_storage_timeline_daily__v1 = GKEPodOperator(
         task_id="monitoring_derived__bigquery_table_storage_timeline_daily__v1",
         arguments=[
@@ -136,6 +150,20 @@
         email=["ascholtz@mozilla.com", "wichan@mozilla.com"],
     )
 
+    with TaskGroup(
+        "monitoring_derived__bigquery_tables_inventory__v1_external",
+    ) as monitoring_derived__bigquery_tables_inventory__v1_external:
+        ExternalTaskMarker(
+            task_id="bqetl_shredder_monitoring__wait_for_monitoring_derived__bigquery_tables_inventory__v1",
+            external_dag_id="bqetl_shredder_monitoring",
+            external_task_id="wait_for_monitoring_derived__bigquery_tables_inventory__v1",
+            execution_date="{{ (execution_date - macros.timedelta(days=-1, seconds=50400)).isoformat() }}",
+        )
+
+        monitoring_derived__bigquery_tables_inventory__v1_external.set_upstream(
+            monitoring_derived__bigquery_tables_inventory__v1
+        )
+
     monitoring_derived__bigquery_usage__v1 = GKEPodOperator(
         task_id="monitoring_derived__bigquery_usage__v1",
         arguments=[
@@ -218,6 +246,20 @@
         email=["ascholtz@mozilla.com", "mhirose@mozilla.com"],
     )
 
+    with TaskGroup(
+        "monitoring_derived__jobs_by_organization__v1_external",
+    ) as monitoring_derived__jobs_by_organization__v1_external:
+        ExternalTaskMarker(
+            task_id="bqetl_shredder_monitoring__wait_for_monitoring_derived__jobs_by_organization__v1",
+            external_dag_id="bqetl_shredder_monitoring",
+            external_task_id="wait_for_monitoring_derived__jobs_by_organization__v1",
+            execution_date="{{ (execution_date - macros.timedelta(days=-1, seconds=50400)).isoformat() }}",
+        )
+
+        monitoring_derived__jobs_by_organization__v1_external.set_upstream(
+            monitoring_derived__jobs_by_organization__v1
+        )
+
     monitoring_derived__schema_error_counts__v2 = bigquery_etl_query(
         task_id="monitoring_derived__schema_error_counts__v2",
         destination_table="schema_error_counts_v2",
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_shredder_monitoring.py /tmp/workspace/generated-sql/dags/bqetl_shredder_monitoring.py
--- /tmp/workspace/main-generated-sql/dags/bqetl_shredder_monitoring.py	2024-10-10 20:42:19.000000000 +0000
+++ /tmp/workspace/generated-sql/dags/bqetl_shredder_monitoring.py	2024-10-10 20:44:35.000000000 +0000
@@ -52,6 +52,42 @@
     tags=tags,
 ) as dag:
 
+    wait_for_monitoring_derived__bigquery_table_storage__v1 = ExternalTaskSensor(
+        task_id="wait_for_monitoring_derived__bigquery_table_storage__v1",
+        external_dag_id="bqetl_monitoring",
+        external_task_id="monitoring_derived__bigquery_table_storage__v1",
+        execution_delta=datetime.timedelta(seconds=36000),
+        check_existence=True,
+        mode="reschedule",
+        allowed_states=ALLOWED_STATES,
+        failed_states=FAILED_STATES,
+        pool="DATA_ENG_EXTERNALTASKSENSOR",
+    )
+
+    wait_for_monitoring_derived__bigquery_tables_inventory__v1 = ExternalTaskSensor(
+        task_id="wait_for_monitoring_derived__bigquery_tables_inventory__v1",
+        external_dag_id="bqetl_monitoring",
+        external_task_id="monitoring_derived__bigquery_tables_inventory__v1",
+        execution_delta=datetime.timedelta(seconds=36000),
+        check_existence=True,
+        mode="reschedule",
+        allowed_states=ALLOWED_STATES,
+        failed_states=FAILED_STATES,
+        pool="DATA_ENG_EXTERNALTASKSENSOR",
+    )
+
+    wait_for_monitoring_derived__jobs_by_organization__v1 = ExternalTaskSensor(
+        task_id="wait_for_monitoring_derived__jobs_by_organization__v1",
+        external_dag_id="bqetl_monitoring",
+        external_task_id="monitoring_derived__jobs_by_organization__v1",
+        execution_delta=datetime.timedelta(seconds=36000),
+        check_existence=True,
+        mode="reschedule",
+        allowed_states=ALLOWED_STATES,
+        failed_states=FAILED_STATES,
+        pool="DATA_ENG_EXTERNALTASKSENSOR",
+    )
+
     monitoring_derived__shredder_targets__v1 = GKEPodOperator(
         task_id="monitoring_derived__shredder_targets__v1",
         arguments=[
@@ -68,3 +104,30 @@
         owner="bewu@mozilla.com",
         email=["bewu@mozilla.com"],
     )
+
+    monitoring_derived__shredder_targets_joined__v1 = bigquery_etl_query(
+        task_id="monitoring_derived__shredder_targets_joined__v1",
+        destination_table="shredder_targets_joined_v1",
+        dataset_id="monitoring_derived",
+        project_id="moz-fx-data-shared-prod",
+        owner="bewu@mozilla.com",
+        email=["bewu@mozilla.com"],
+        date_partition_parameter="submission_date",
+        depends_on_past=False,
+    )
+
+    monitoring_derived__shredder_targets_joined__v1.set_upstream(
+        wait_for_monitoring_derived__bigquery_table_storage__v1
+    )
+
+    monitoring_derived__shredder_targets_joined__v1.set_upstream(
+        wait_for_monitoring_derived__bigquery_tables_inventory__v1
+    )
+
+    monitoring_derived__shredder_targets_joined__v1.set_upstream(
+        wait_for_monitoring_derived__jobs_by_organization__v1
+    )
+
+    monitoring_derived__shredder_targets_joined__v1.set_upstream(
+        monitoring_derived__shredder_targets__v1
+    )
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring: shredder_targets
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived: shredder_targets_joined_v1
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/metadata.yaml	2024-10-10 20:39:31.000000000 +0000
@@ -0,0 +1,16 @@
+friendly_name: Shredder Targets
+description: |-
+  Daily list of shredder deletion targets comparing the configured targets with
+  the lineage of found id tables in bigquery.
+owners:
+- bewu@mozilla.com
+labels:
+  owner1: bewu
+bigquery: null
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  view.sql:
+  - moz-fx-data-shared-prod.monitoring_derived.shredder_targets_joined_v1
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/view.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/view.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/view.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring/shredder_targets/view.sql	2024-10-10 20:36:54.000000000 +0000
@@ -0,0 +1,7 @@
+CREATE OR REPLACE VIEW
+  `moz-fx-data-shared-prod.monitoring.shredder_targets`
+AS
+SELECT
+  *
+FROM
+  `moz-fx-data-shared-prod.monitoring_derived.shredder_targets_joined_v1`
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/metadata.yaml	2024-10-10 20:39:24.000000000 +0000
@@ -0,0 +1,32 @@
+friendly_name: Shredder Targets Joined
+description: |-
+  Daily list of shredder deletion targets comparing the configured targets with
+  the lineage of found id tables in bigquery.
+  Augmented version of monitoring_derived.shredder_targets_v1 with additional table metadata.
+owners:
+- bewu@mozilla.com
+labels:
+  incremental: true
+  owner1: bewu
+  schedule: daily
+  dag: bqetl_shredder_monitoring
+scheduling:
+  dag_name: bqetl_shredder_monitoring
+bigquery:
+  time_partitioning:
+    type: day
+    field: run_date
+    require_partition_filter: true
+    expiration_days: null
+  range_partitioning: null
+  clustering: null
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  query.sql:
+  - moz-fx-data-shared-prod.monitoring_derived.bigquery_table_storage_v1
+  - moz-fx-data-shared-prod.monitoring_derived.bigquery_tables_inventory_v1
+  - moz-fx-data-shared-prod.monitoring_derived.jobs_by_organization_v1
+  - moz-fx-data-shared-prod.monitoring_derived.shredder_targets_v1
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/query.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/query.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/query.sql	2024-10-10 20:36:54.000000000 +0000
@@ -0,0 +1,83 @@
+WITH query_counts AS (
+  SELECT
+    COUNT(*) AS query_count,
+    targets.project_id,
+    targets.dataset_id,
+    targets.table_id,
+  FROM
+    `moz-fx-data-shared-prod.monitoring_derived.jobs_by_organization_v1`
+  CROSS JOIN
+    UNNEST(referenced_tables) AS ref_table
+  RIGHT JOIN
+    `moz-fx-data-shared-prod.monitoring_derived.shredder_targets_v1` AS targets
+    ON targets.project_id = ref_table.project_id
+    AND targets.dataset_id = ref_table.dataset_id
+    AND STARTS_WITH(ref_table.table_id, targets.table_id)
+    -- filter out shredder reads
+    AND (
+      NOT ENDS_WITH(reservation_id, '.shredder-main-v4')
+      OR NOT ENDS_WITH(reservation_id, '.batch')
+    )
+  WHERE
+    DATE(creation_time)
+    BETWEEN DATE_SUB(@submission_date, INTERVAL 29 DAY)
+    AND DATE(@submission_date)
+    AND targets.run_date = @submission_date
+  GROUP BY
+    targets.project_id,
+    targets.dataset_id,
+    targets.table_id
+),
+write_counts AS (
+  SELECT
+    COUNT(*) AS write_count,
+    targets.project_id,
+    targets.dataset_id,
+    targets.table_id,
+  FROM
+    `moz-fx-data-shared-prod.monitoring_derived.shredder_targets_v1` AS targets
+  LEFT JOIN
+    `moz-fx-data-shared-prod.monitoring_derived.jobs_by_organization_v1`
+    ON targets.project_id = destination_table.project_id
+    AND targets.dataset_id = destination_table.dataset_id
+    AND STARTS_WITH(destination_table.table_id, targets.table_id)
+    -- filter out shredder writes
+    AND (
+      NOT ENDS_WITH(reservation_id, '.shredder-main-v4')
+      OR NOT ENDS_WITH(reservation_id, '.batch')
+    )
+  WHERE
+    DATE(creation_time)
+    BETWEEN DATE_SUB(@submission_date, INTERVAL 29 DAY)
+    AND DATE(@submission_date)
+    AND targets.run_date = @submission_date
+  GROUP BY
+    targets.project_id,
+    targets.dataset_id,
+    targets.table_id
+)
+SELECT
+  targets.*,
+  owners,
+  COALESCE(query_count, 0) AS query_count_last_30d,
+  COALESCE(write_count, 0) AS write_count_last_30d,
+  ROUND(total_logical_bytes / 1024 / 1024 / 1024 / 1024, 2) AS table_size_tib,
+  table_inventory.creation_date AS table_creation_date,
+  IFNULL(table_inventory.deprecated, FALSE) AS deprecated,
+FROM
+  `moz-fx-data-shared-prod.monitoring_derived.shredder_targets_v1` AS targets
+LEFT JOIN
+  `moz-fx-data-shared-prod.monitoring_derived.bigquery_tables_inventory_v1` AS table_inventory
+  USING (project_id, dataset_id, table_id)
+LEFT JOIN
+  `moz-fx-data-shared-prod.monitoring_derived.bigquery_table_storage_v1` AS table_storage
+  USING (project_id, dataset_id, table_id)
+LEFT JOIN
+  query_counts
+  USING (project_id, dataset_id, table_id)
+LEFT JOIN
+  write_counts
+  USING (project_id, dataset_id, table_id)
+WHERE
+  targets.run_date = @submission_date
+  AND table_inventory.submission_date = @submission_date
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/schema.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/schema.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/monitoring_derived/shredder_targets_joined_v1/schema.yaml	2024-10-10 20:36:54.000000000 +0000
@@ -0,0 +1,60 @@
+fields:
+- name: run_date
+  type: DATE
+  mode: NULLABLE
+- name: project_id
+  type: STRING
+  mode: NULLABLE
+- name: dataset_id
+  type: STRING
+  mode: NULLABLE
+- name: table_id
+  type: STRING
+  mode: NULLABLE
+- name: current_sources
+  type: RECORD
+  mode: REPEATED
+  fields:
+  - name: table
+    type: STRING
+    mode: NULLABLE
+  - name: field
+    type: STRING
+    mode: NULLABLE
+  - name: project
+    type: STRING
+    mode: NULLABLE
+- name: detected_sources
+  type: RECORD
+  mode: REPEATED
+  fields:
+  - name: table
+    type: STRING
+    mode: NULLABLE
+  - name: field
+    type: STRING
+    mode: NULLABLE
+  - name: project
+    type: STRING
+    mode: NULLABLE
+- name: matching_sources
+  type: BOOLEAN
+  mode: NULLABLE
+- name: owners
+  type: STRING
+  mode: REPEATED
+- name: query_count_last_30d
+  type: INTEGER
+  mode: NULLABLE
+- name: write_count_last_30d
+  type: INTEGER
+  mode: NULLABLE
+- name: table_size_tib
+  type: FLOAT
+  mode: NULLABLE
+- name: table_creation_date
+  type: DATE
+  mode: NULLABLE
+- name: deprecated
+  type: BOOLEAN
+  mode: NULLABLE

Link to full diff

@BenWu BenWu requested a review from akkomar October 10, 2024 20:49
@BenWu BenWu added this pull request to the merge queue Oct 11, 2024
Merged via the queue into main with commit 14c2f18 Oct 11, 2024
21 checks passed
@BenWu BenWu deleted the benwu/shredder-targets-augment branch October 11, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants