Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for adding BigQuery policy tags #2586

Closed
azhard opened this issue Jun 23, 2020 · 9 comments · Fixed by #2589
Closed

Add support for adding BigQuery policy tags #2586

azhard opened this issue Jun 23, 2020 · 9 comments · Fixed by #2589
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!

Comments

@azhard
Copy link
Contributor

azhard commented Jun 23, 2020

Describe the feature

Policy tags are a BigQuery feature that allows users to set column level access control (docs). After a tag has been created, it can be added programatically via the BigQuery python library to columns as needed.

This proposed feature would allow users to add (already created) policy tags to their existing dbt models through their schema.yml files.

Considerations:

  • Policy tag integration in BigQuery has been added to the google-cloud-bigquery in version 1.25.0 (latest version). Dbt currently is using version 1.24.0 so this will need to be bumped up
  • Adding a policy tag to a column requires a few pieces of information related to the policy tag. On the dbt side, we could either take each piece of information and construct the policy tag "name" for the user or have them input the fully qualified name themselves. Information required includes:
    • Project
    • Location
    • Taxonomy ID
    • Policy tag ID
  • I think intuitively this step should take place within the persist_docs step and would be handled if columns is set to true for persist_docs.
  • If having this happen during the persist_docs function makes sense, the easiest solution to get this off the ground would be to utilize the meta field. Alternatively a new BigQuery specific schema field could be used
  • Functionally this can reside in the BigQuery update_column_description adapter function that is. called during persist_docs. The advantage here is that we're already iterating over the columns so we wouldn't have to do it again and the function name can be changed to something more generic eg. update_column

From my understanding, this is something that currently can only be supported for models in the schema.yml files as I don't think docs can be updated for sources but that would be an ideal situation to support this as well

Additional context

This feature would be specific to BigQuery.

Who will this benefit?

BigQuery users who currently use (or want to start using) policy tags and manage them through their dbt setup.

@azhard azhard added enhancement New feature or request triage labels Jun 23, 2020
@jtcohen6 jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Jun 24, 2020
@jtcohen6
Copy link
Contributor

Thanks for the detailed writeup, @azhard! Since dbt adds labels and descriptions to tables and views when initially creating them, it makes sense enough to me that dbt could add these policy tags as well.

Is this a feature you'd be interested in contributing?

On the dbt side, we could either take each piece of information and construct the policy tag "name" for the user or have them input the fully qualified name themselves.

I'm inclined to require users to fully qualify tag names, unless you have a strong feeling for us systematizing these in a highly opinionated way.

Related issues

  • Along the lines of #2497: do we expect dbt to update policy tags on incremental runs? dbt does not currently update descriptions or labels on incremental models during partial refresh runs. The stakes on that feel low when those are metadata; less so when it comes to access policy for sensitive information. In the V1 of this, we should explicitly call out that keeping access policies up to date requires a --full-refresh run.
  • There's an open issue that proposes persisting docs to sources (Persist source descriptions to database #2540). I don't think this is something we'd want dbt to do natively, but I could totally see benefit to coding up some macros that users could explicitly call as run-operation, both to persist column-level descriptions and apply column-level access policies.

@azhard
Copy link
Contributor Author

azhard commented Jun 24, 2020

Yeah more than happy to work on this @jtcohen6 ! With respect to the tag names, I think having the user fully qualify them makes sense as well so will incorporate that into the PR.

What're your thoughts around the meta field that I mentioned? Do you think it makes sense to have users define the tags in meta or have a new field specific to bigquery schema files?

Also re: the related issues. Both make sense to me and would love to get this working with the future persist_source macro!

@azhard
Copy link
Contributor Author

azhard commented Jun 25, 2020

@jtcohen6 ended up sticking the policy tags in the tags field that already exists in my PR. Happy to change that as needed

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 25, 2020

I'm supportive of that. tags is a field that we understand to have some functional purpose. As I see it, meta really just exists to persist an arbitrary dict of information into manifest.json and the documentation site. What we may want to do is differentiate between tags that are dbt-only and tags that should be persisted to BigQuery as access policy tags. @drewbanin let me know if you disagree.

Quick turnaround on the PR! I'll take a look a bit later today

@drewbanin
Copy link
Contributor

hmmm, I think we may not want to repurpose tags as policyTags! It's a good and sensible idea, but I think we'll be better off long-term by making them totally separate. tags is a dbt-wide config, and I think that our future selves will be grateful if we implement BigQuery policy tags with a BigQuery-specific config. Maybe we can call it something like policy_tags instead?

That might be a little bit tricky to implement today as I don't think we have any notion of database-specific schema.yml configs. I truthfully am not sure of the best way to implement something like this, but I did want to raise this point sooner rather than later.

What do y'all think?

@azhard
Copy link
Contributor Author

azhard commented Jun 29, 2020

Yeah I think that makes sense actually, don't want to override any dbt specific functionality that leverages the tags feature in the future... With the current lack of database-specific schema.yml files is this something that will have to be put on hold or is there another way you can think of to implement this (and maybe the database-specific .ymls too)?

@drewbanin
Copy link
Contributor

do we specifically want this config to live in the schema.yml file? If we instead allow it to be a model-level config (configured in the models: dict in dbt_project.yml or in a {{ config() }} call), then the path to implementation should be pretty straightforward! The tricky thing is of course that it's a column-level config. I could imagine an API like:

{{ config(
    policyTags: {
      "col_1": ['tag1', 'tag2'],
      "col_2": ['tag3', 'tag4']
     }
) }}

I totally get that it's not ideal to write out the column list twice, but this is definitely one option!

If we really really want to, we could add this config as a dbt Core-level config that's only used on BigQuery. That's definitely not my preference.... but in looking at the config above, I do think schema.yml is the right place for this config long-term

@azhard
Copy link
Contributor Author

azhard commented Jun 29, 2020

Aside from defining the columns twice, my main concern with the first option is that I don't believe that method can be leveraged for sources. Can't speak to other orgs but for our case I was hoping to add policy tags to both models and sources (using a method similar to the one described in #2540).

I'm also hesitant on adding this as a core-level config, don't think it makes sense to expose it to other adapters. Not sure where that would leave implementation for this feature for now though

@drewbanin
Copy link
Contributor

Ah, ok, thanks for that context @azhard! I gave this some more thought, and I'm increasingly happy with the idea of adding a BQ specific config in the columns section of dbt_project.yml. The big caveat I will add is that at some point in the future, we might decide we need to change up the interface to support per-adapter configs more generally. I can't say with any certainty what that might look like, but I did want to call it out!

I'll take a look at #2589 and follow up over there with a review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants