Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor all get_sqla_engine to use contextmanager in codebase #21943

Merged
merged 23 commits into from
Nov 15, 2022

Conversation

hughhhh
Copy link
Member

@hughhhh hughhhh commented Oct 26, 2022

SUMMARY

In this PR, we moving get_sqla_engine to be leveraged as a contextmanager through out the entire codebase. What that means is any time we want to instantiate an engine we'll have to use a with ___ as ___ block. For example:

with get_sqla_engine() as engine: 
     # use engine ...

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@hughhhh hughhhh marked this pull request as ready for review October 28, 2022 17:50
@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Merging #21943 (89020b5) into master (736b534) will increase coverage by 0.96%.
The diff coverage is 38.54%.

@@            Coverage Diff             @@
##           master   #21943      +/-   ##
==========================================
+ Coverage   67.09%   68.06%   +0.96%     
==========================================
  Files        1827     1827              
  Lines       69866    72646    +2780     
  Branches     7548     7548              
==========================================
+ Hits        46875    49444    +2569     
- Misses      21035    21246     +211     
  Partials     1956     1956              
Flag Coverage Δ
hive 53.46% <19.79%> (+0.62%) ⬆️
javascript 53.72% <ø> (ø)
mysql 79.38% <38.54%> (+0.97%) ⬆️
postgres 79.43% <38.54%> (+0.96%) ⬆️
presto 53.37% <19.79%> (+0.62%) ⬆️
python 82.40% <38.54%> (+0.82%) ⬆️
sqlite 77.87% <38.54%> (+0.93%) ⬆️
unit 51.40% <5.20%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/datasets/commands/importers/v1/utils.py 57.64% <0.00%> (-0.69%) ⬇️
superset/db_engine_specs/gsheets.py 75.91% <0.00%> (ø)
superset/examples/bart_lines.py 0.00% <0.00%> (ø)
superset/examples/birth_names.py 71.42% <0.00%> (ø)
superset/examples/country_map.py 0.00% <0.00%> (ø)
superset/examples/energy.py 0.00% <0.00%> (ø)
superset/examples/flights.py 0.00% <0.00%> (ø)
superset/examples/long_lat.py 0.00% <0.00%> (ø)
superset/examples/multiformat_time_series.py 0.00% <0.00%> (ø)
superset/examples/paris.py 0.00% <0.00%> (ø)
... and 38 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@hughhhh
Copy link
Member Author

hughhhh commented Nov 2, 2022

/testenv up

@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2022

@hughhhh Ephemeral environment spinning up at http://34.222.64.144:8080. Credentials are admin/admin. Please allow several minutes for bootstrapping and startup.

with database.get_sqla_engine_with_context(
schema=schema, source=source
) as engine:
return engine
Copy link
Member

@betodealmeida betodealmeida Nov 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work; the context manager will be exited before the functions returns. It's equivalent to:

with database.get_sqla_engine_with_context(...) as engine:
    a = engine
return a

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should i just return database._ get_sqla_engine()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should i just return database._ get_sqla_engine()

If you do that, then the SSH tunneling wouldn't be setup in places that call DBEngineSpec.get_engine(), defeating the purpose of the changes in this PR. You need to convert this into a context manager as well (and update everything downstream), so that the setup/teardown works as expected.

You can play with a simple Python script to understand how this works:

from contextlib import contextmanager

@contextmanager
def get_sqla_engine_with_context():
    print("enabling ssh tunnel")
    yield 42
    print("disabling ssh tunnel")

def get_engine():
    with get_sqla_engine_with_context() as engine:
        return engine

print("start")
engine = get_engine()
print("I have the engine, I can now work on it")
print("end")

Running the code above prints:

start
enabling ssh tunnel
disabling ssh tunnel
I have the engine, I can now work on it
end

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a get_raw_connection context manager (that uses closing), it should simplify the code a lot. Could be done in this PR or a separate one.

We should also make the use engine.raw_connection().execute() vs. engine.execute() more consistent, standardizing on the latter, but we can do that in a separate PR.

with dataset.database.get_sqla_engine_with_context(
schema=dataset.schema
) as engine:
with closing(engine.raw_connection()) as conn:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do this now, but eventually we could have a database.get_raw_connection context manager, to simplify things a little bit. This way we could rewrite this as:

with dataset.database.get_raw_connection(schema=dataset.schema) as conn:
    cursor = conn.cursor()
    ...

And the implementation of get_raw_connection() would take care of closing the connection:

@contextmanager
def get_raw_connection(...):
    with get_sqla_engine_with_context(...) as engine:
        with closing(engine.raw_connection()) as conn:
             yield conn

Comment on lines 105 to 109
int(
app.config[
"TEST_DATABASE_CONNECTION_TIMEOUT"
].total_seconds()
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, I'd compute this outside the with block just to improve readability:

timeout = app.config["TEST_DATABASE_CONNECTION_TIMEOUT"].total_seconds()

with database.get_sqla_engine_with_context() as engine:
    try:
        alive = func_timeout(int(timeout), ping, args=(engine,))

Also, func_timeout should take floats. :-P

Comment on lines 120 to 127
if not alive:
raise DatabaseOfflineError(
SupersetError(
message=__("Database is offline."),
error_type=SupersetErrorType.GENERIC_DB_ENGINE_ERROR,
level=ErrorLevel.ERROR,
),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can dedent this block, no?

superset/datasets/commands/importers/v1/utils.py Outdated Show resolved Hide resolved
to_sql_kwargs["method"] = "multi"
with cls.get_engine(database) as engine:
if engine.dialect.supports_multivalues_insert:
to_sql_kwargs["method"] = "multi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to improve this a little bit... here we're building an engine just to check an attribute in the dialect, which means we're setting up and tearing down an SSH connection just to read an attribute. :-(

Maybe we should add a get_dialect method to the DB engine spec, that builds the engine without the context manager:

@classmethod
def get_dialect(database, schema, source):
     engine = database.get_sqla_engine(schema=schema, source=source)
     return engine.dialect

Then when we only need the dialect we can call this method, which is cheaper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case we still need the engine though, so it makes more sense to just use get_engine instead of just the dialect

https://github.com/apache/superset/pull/21943/files/7ce583678e1770d472527abb8270dd22e666b9c0#diff-2e62d64ef1113e48efdfeb2acbaa522fca13e49e6a00c2cfd4f74efc4ae1b45cR916

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, I missed that.

with cls.get_engine(database) as engine:
to_gbq_kwargs = {
"destination_table": str(table),
"project_id": engine.url.host,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add an attribute to DB engine specs annotating if they support SSH tunnel or not? BigQuery, eg, will probably never support it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a ticket for this

@@ -205,7 +203,8 @@ def df_to_sql(
if table_exists:
raise SupersetException("Table already exists")
elif to_sql_kwargs["if_exists"] == "replace":
engine.execute(f"DROP TABLE IF EXISTS {str(table)}")
with cls.get_engine(database) as engine:
engine.execute(f"DROP TABLE IF EXISTS {str(table)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that sometimes we use engine.raw_connection().execute, and others we use engine.execute. Ideally we should standardize in the latter wherever possible, since it's more concise.

engine = database.get_sqla_engine()
schema = inspect(engine).default_schema_name
table_exists = database.has_table_by_name(tbl_name)
with database.get_sqla_engine_with_context() as engine:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should convert these old examples into the new format (YAML + CSV)...

Comment on lines 373 to 375
yield self._get_sqla_engine(schema=schema, nullpool=nullpool, source=source)
except Exception as ex:
raise self.db_engine_spec.get_dbapi_mapped_exception(ex)
raise ex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need try/except here if you're raising all exceptions.

@@ -733,7 +733,7 @@ def test_execute_sql_statements(self, mock_execute_sql_statement, mock_get_query
mock_query = mock.MagicMock()
mock_query.database.allow_run_async = False
mock_cursor = mock.MagicMock()
mock_query.database.get_sqla_engine.return_value.raw_connection.return_value.cursor.return_value = (
mock_query.database.get_sqla_engine_with_context.return_value.__enter__.return_value.raw_connection.return_value.cursor.return_value = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace all intermediary .return_value with () (but not the last):

Suggested change
mock_query.database.get_sqla_engine_with_context.return_value.__enter__.return_value.raw_connection.return_value.cursor.return_value = (
mock_query.database.get_sqla_engine_with_context().__enter__().raw_connection().cursor.return_value = (

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, there's just one place that needs to be fixed.

with database.get_sqla_engine_with_context() as engine:
connection = engine

df.to_sql(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only run in the else block, but it needs to run in the if database.sqlalchemy_uri == current_app.config.get("SQLALCHEMY_DATABASE_URI"): block as well.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hughhhh hughhhh merged commit e23efef into master Nov 15, 2022
@github-actions
Copy link
Contributor

Ephemeral environment shutdown and build artifacts deleted.

diegomedina248 pushed a commit to preset-io/superset that referenced this pull request Dec 3, 2022
john-bodley added a commit that referenced this pull request Dec 15, 2022
john-bodley added a commit that referenced this pull request Dec 15, 2022
john-bodley added a commit to airbnb/superset-fork that referenced this pull request Dec 15, 2022
john-bodley added a commit to airbnb/superset-fork that referenced this pull request Jan 17, 2023
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 and removed 🚢 2.1.3 labels Mar 13, 2024
@mistercrunch mistercrunch deleted the ref-get-sqla-engine-2 branch March 26, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/XXL 🚢 2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants