Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review an improvement plan for Bigquery operators #231

Closed
mik-laj opened this issue Aug 23, 2019 · 7 comments
Closed

Review an improvement plan for Bigquery operators #231

mik-laj opened this issue Aug 23, 2019 · 7 comments
Assignees

Comments

@mik-laj
Copy link
Member

mik-laj commented Aug 23, 2019

No description provided.

@potiuk potiuk assigned turbaszek and potiuk and unassigned turbaszek Sep 3, 2019
@potiuk potiuk changed the title Prepare an improvement plan for Bigquery operators Review an improvement plan for Bigquery operators Sep 16, 2019
@mschickensoup mschickensoup assigned TobKed and unassigned potiuk Sep 16, 2019
@TobKed TobKed added the WIP label Sep 23, 2019
@TobKed
Copy link
Member

TobKed commented Sep 23, 2019

Sensors:

BigQueryTableSensor
There is just one sensor and its name follows naming convention. The sensor accepts reasonable parameters and uses BigQueryHook in proper way.
Uses deprecated bigquery_conn_id instead of gcp_conn_id.
Proposition: adding gcp_conn_id parameter support, keeping support for bigquery_conn_id parameters, adding deprecation warning (keeping backward compatibility).

Operators:

Helper classes:
BigQueryConsoleLink
BigQueryConsoleIndexableLink

Helper classes for operator_extra_links property of BigQueryOperator.
Proposition: Nothing to be changed.

Check operators:
BigQueryCheckOperator
BigQueryValueCheckOperator
BigQueryIntervalCheckOperator

They seem to follow naming convention.
Use deprecated bigquery_conn_id, support for gcp_conn_id provided, deprecation warning is present.
Due to inheritance from *CheckOperator they are dependent on conn->cursor PEP-249 BigQueryHook implementation.
Proposition: Look Hooks

Query operators:
BigQueryOperator
BigQueryCreateEmptyTableOperator
BigQueryCreateExternalTableOperator
BigQueryDeleteDatasetOperator
BigQueryCreateEmptyDatasetOperator
BigQueryGetDatasetOperator
BigQueryPatchDatasetOperator
BigQueryUpdateDatasetOperator
BigQueryTableDeleteOperator

Only BigQueryOperator does not follow naming convention.
Mostly they use gcp_conn_id and raise deprecation warning for bigquery_conn_id.
In some places there is still bigquery_conn_id (when also other conn_id is used like google_cloud_storage_conn_id).
They use Hook().get_conn().cursor().method* which seems to be unnecessary. More comments in Hooks.

Proposition:

  • adding gcp_conn_id where deprecated is still present, preserve backward compatibility and add deprecation warning
  • renaming BigQueryOperator to BigQueryRunQueryOperator
  • using directly Hook in execute.

Hooks:

PEP 249
BigQueryHook(GoogleCloudBaseHook, DbApiHook)
BigQueryBaseCursor(LoggingMixin)
BigQueryCursor(BigQueryBaseCursor)
BigQueryConnection

Currently BigQueryHook faciliates concept of PEP-249 where cursor is needed (Hook().get_conn().cursor().method*). The BigQueryBaseCursor contains most of the logic which do not require cursor.

Proposition:

  •   BigQueryBaseHook(GoogleCloudBaseHook)  # all methods which do not require cursor
      BigQueryConnection(google.cloud.bigquery.dbapi.connection.Connection)
      BigQueryCursor(google.cloud.bigquery.dbapi.cursor.Cursor, LoggingMixin)
      # or BigQueryCursor(BigQueryBaseHook, LoggingMixin)
      BigQueryHook(BigQueryBaseHook)
      BigQueryDbAPiHook(BigQueryHook, DbApiHook)  # override get_conn to provide PEP 249 compatibility(`Hook().get_conn().cursor().method*`)
    
  • BigQueryHook is the main Hook and do not require using cursor.
  • BigQueryDbAPiHook is the secondary hook and use cursor.
  • BigQueryCursor and BigQueryConnection inherits from Google Library to provide additional methods for backward compatiblity.
  • Moreover google.cloud.bigquery.dbapi.cursor.Cursor itself have limited functionaly, e.g method execute does not allow to provide custom QueryJobConfig which is neccessary to provide support for legacy_sql: bigquery/google/cloud/bigquery/dbapi/cursor.py#L163
  • Another drawbackgoogle.cloud.bigquery.dbapi.cursor.Cursor.exeute method return None where BigQueryCursor.execute method return JobId
  • This could be problematic for other method as well and It could be case that many methods must be rewriten to keep backward compatibilty.

Other:

BigQueryPandasConnector
Iherits from GbqConnector (from Pandas).
Nothing to be changed.

General questions:

  • Operator BigQueryOperator takes a lot of arguments which are passed to BigQueryHook. In the Google Library they create google.cloud.bigquery.job.QueryJobConfig object. When Google Library will be used these arguments will be used to instantiate config objects. Question is: are key-word arguments for hook could be deprecated and require config object as a parameter?
    Similar situation situation is for *TableOperators and google.cloud.bigquery.table.Table object. Or maybe it could be done in hooks instead of the operators? Should it be done at all?
  • What about incompatible google.cloud.bigquery.dbapi.cursor.Cursor methods? Maybe BigQueryCursor(google.cloud.bigquery.dbapi.cursor.Cursor, LoggingMixin) could inherit from BigQueryBaseHook and use own implementetions of cursor method for backward compatibilty?

General thoughts:

  • Using all Google Libary objects and methods (especially cursor) and keeping backward compatiblity seems to be very challanging and little bit overengineering. It will introduce unnessary error prone complexity. If we decide to use Google Library classes Cursor and Connection backward compatibilty should be dropped.
  • Backward compatibilty is possible when cursor object from Google Library will be ommited however code will be not simplified in significant way.

@TobKed
Copy link
Member

TobKed commented Sep 24, 2019

@TobKed TobKed removed the WIP label Sep 24, 2019
@TobKed
Copy link
Member

TobKed commented Sep 26, 2019

cc @mik-laj @nuclearpinguin @potiuk

@turbaszek
Copy link
Member

General questions:

  • Operator BigQueryOperator takes a lot of arguments which are passed to BigQueryHook. In the Google Library they create google.cloud.bigquery.job.QueryJobConfig object. When Google Library will be used these arguments will be used to instantiate config objects. Question is: are key-word arguments for hook could be deprecated and require config object as a parameter?
    Similar situation situation is for *TableOperators and google.cloud.bigquery.table.Table object. Or maybe it could be done in hooks instead of the operators? Should it be done at all?

I think BQ belongs to most used operators and we should preserve backward compatibility. There are two ways to approach this: build object from kwargs or keep key words arguments for time being and rise an error when "object" was not passed. Personally I prefer the second option as it's easier to remove something in future than add a "builder" now (that will be remove - time wasting). I would suggest also adding "create_my_object" method for easier transfer between old and new approach.

Regarding Cursor - I think we should have two hook / operators. One for average users (implementing what python BQ clients has) and one for those who know what a cursor is. This should simplify the refactor.

@potiuk
Copy link

potiuk commented Sep 30, 2019

I agree that backwards compatibility is really important for BQ.

My proposal is to only leave the QueryJobConfig object as parameter and if it is missing, use builder that can be applied to kwargs (raising deprecation warning).

We should be able to pass both the google.cloud.bigquery.job.QueryJobConfig class as well as dict representation (I think it should be possible). The latter will allow to use templates in backwards-compatible way.

We could also think about implementing template handling for QueryJobConfig-like object (if it does not work like that for now).

We can vote on those approaches when everyone chimes in :). I can be convinced otherwise.

For the cursor - I think indeed it should be separated out.

@turbaszek
Copy link
Member

Is it done? @TobKed

@TobKed
Copy link
Member

TobKed commented Dec 18, 2019

@nuclearpinguin, not yet, to do this I am working on #126 with with @mik-laj to be sure that nothing is broken when any changes will be introduced.

@mschickensoup mschickensoup assigned mik-laj and unassigned TobKed Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants