-
Notifications
You must be signed in to change notification settings - Fork 977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make dbt-bigquery python model easier to use #4174
make dbt-bigquery python model easier to use #4174
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wazi55! These docs have needed some TLC and your contributions are certainly in the right direction. That said
- I think there's some polishing needed, and
- I'm not confident as to exactly what "good enough" is here, and would defer to the dbt-bigquery community
here's direct links to preview pages that I'd recommend having your colleagues (and the community) review for content and readability.
- Python models: Supported Data Platforms (BigQuery)
- Bigquery Configuration: Submitting a Python model
Once you get sign off, I can ask our product docs team to do a copy review
Any user or service account that runs dbt Python models will need the following permissions(in addition to the required BigQuery permissions) ([docs](https://cloud.google.com/dataproc/docs/concepts/iam/iam)): | ||
``` | ||
dataproc.batches.create | ||
dataproc.clusters.use | ||
dataproc.jobs.create | ||
dataproc.jobs.get | ||
dataproc.operations.get | ||
dataproc.operations.list | ||
storage.buckets.get | ||
storage.objects.create | ||
storage.objects.delete | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that this detailed information would live on somewhere else, like the new section you created within bigquery-configs
?
storage.objects.create | ||
storage.objects.delete | ||
``` | ||
- Cluster Submission Method: Create or use an existing Dataproc Cluster [See example](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) within dbt_project.yml or yml file within the `models/` directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Cluster Submission Method: Create or use an existing Dataproc Cluster [See example](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) within dbt_project.yml or yml file within the `models/` directory | ||
|
||
**Installing packages:** If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster. | ||
- Serverless Submission Method: Dataproc Serverless does not require a ready cluster, but it can also mean the cluster is slower to start. [See example](/reference/resource-configs/bigquery-configs.md#submitting-a-python-model) submitting a job to a serverless cluster in the `.py` file | ||
|
||
Google recommends installing Python packages on Dataproc clusters via initialization actions: | ||
- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used) | ||
- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python) | ||
|
||
You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`. | ||
**Installing packages**: If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). If you are using Dataproc Serverless, you can build your own [custom container image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#python_packages) with the packages you need. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps something like this table is neater than what you have? and involves less repeated text?
submission method | cluster |
serverless |
---|---|---|
notes | can be pre-existing cluster. can also be used to create clusters | Dataproc Serverless does not require a ready cluster, but it can also mean the cluster is slower to start. |
packages | you can add third-party packages while creating the cluster with the Spark BigQuery connector initialization action | build your own custom container image with the packages you need |
Co-authored-by: Anders <anders.swanson@dbtlabs.com>
hey @wazi55 - thanks for opening this pr! Anders provided some feedback, wanted to check if you had any add'l questions? |
Been over 8 months since this had any movement from the author. Closing due to age but we can re-open if this is something we need to visit again. |
[Re:] #4150
What are you changing in this pull request and why?
Checklist
Adding new pages (delete if not applicable):
website/sidebars.js
Removing or renaming existing pages (delete if not applicable):
website/sidebars.js
website/static/_redirects