Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(BigQuery): pass table_id as str type #23141

Merged
merged 1 commit into from
Jul 4, 2022

Conversation

bryan824
Copy link
Contributor

Recently during migration from 1.10.14 to 2.2.3, I noticed an issue in the BigQueryDeleteTableOperator. For the context of this, there are two ways to specify a table in GCP BigQuery, one with the project_id, like my-project.mydataset.mytable, and the other one without project_id, like mydataset.mytable.

In 1.10.14, I was using the version without project_id, because the table can be recognized by BigQueryHook, using bigquery_conn_id to fetch project_id in configuration.

The path to pass this info is: gcp_api_base_hook#L131 -> gcp_api_base_hook#L200 -> bigquery_hook#L71 -> bigquery_hook#L1498.

But after upgrading to 2.2.3, a full table_id is required. This is unexpected because bigquery_conn_id/gcp_conn_id is still a valid parameter, BigQueryDeleteTableOperator should still be able to get project_id automatically from the connection configuration. It seems like in this line of code bigquery#L1195, it forces users to use full table_id to create a Table instance, which is the root cause.

Method delete_table accepts 4 types of tables, such as Table, TableReference, TableListItem and str as shown in client#L1754. Then in client#L1784, it converts these 4 types to 1 type, which is TableReference as shown in table#L2689.

So back to the possible improvement of this issue, I wonder if it will help migration get smoother if instead of using Table.from_string to get a Table type, a str type parameter is passed directly. And this str parameter can be just mydataset.mytable, with project_id set by the Client as shown in bigquery#L1194. I believe due to the plan of GCP, companies are slowly migrating to Airflow 2.0 for better support. This improvement will avoid having them add the project_id to table_id for hundreds of DAGs since it is already included in the connection configuration.

Below are two scenarios based on the two formats of specifying a BigQuery table:

  1. table_id like mydataset.mytable is passed in bigquery#L1797 and the corresponding project_id is configured by the connection. This will work as expected, if no project_id is found, error will be captured in _helpers#L825.
  2. table_id like my-project.mydataset.mytable is passed. In this case, whether or not the project_id is configured or configured correspondingly, it will use the project_id defined in the table_id regardless as shown in _helpers#L836.

This is my first attempt at submitting a PR to an open-sourced repo. Please let me know how I can improve. It is also fine if it is not worth merging such a change. I enjoyed the time when looking into this.

@kaxil @eladkal @potiuk

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Apr 21, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 21, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented May 8, 2022

Thanks for that - and sorry for delay, it's been a bit busy period for us all (and it's going to last for a while). I am not sure if this one is good or not - I am not a bq expert but maybe @turbaszek @mik-laj., @TobKed or maybe @lwyszomi or @bhirsz can chime in here. In any way that does not seem like something that needs to be merged quickly.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 23, 2022
@bhirsz
Copy link
Contributor

bhirsz commented Jun 24, 2022

I apologize, it somehow slipped my notice - I'm taking a look.

@bhirsz
Copy link
Contributor

bhirsz commented Jun 24, 2022

So if I got it correctly the hook delete_table accepts 4 types of parameters but we're trying to feed it only with Table.from_str() which is forcing us to use project_id in table_id. But since delete_table can accept table_id (with or without project_id - it will be resolved anyway) it's safe to pass table_id directly hence making it more flexible for the users.

I think the change is OK.

@bryan824
Copy link
Contributor Author

Thanks for checking. Yes, what you said is exactly why I submitted this PR. Adding project_id should be optional since it is part of the connection that is already configured.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jun 25, 2022
@potiuk potiuk merged commit fe13eae into apache:main Jul 4, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 4, 2022

Awesome work, congrats on your first merged pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants