-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use native prepared statements from trino-python-client #61
Conversation
@@ -64,7 +62,7 @@ def __init__(self, handle): | |||
self._fetch_result = None | |||
|
|||
def cursor(self): | |||
self._cursor = self.handle.cursor() | |||
self._cursor = self.handle.cursor(experimental_python_types=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should give the ability to the user to opt in/out of the "experimental" feature.
Should we introduce a profile configuration setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO the conversions in ConnectionWrapper._escape_value(cls, value)
are all implemented by the trino-python-client
, which is the right place to implement them. The covered data types are not really that 'experimental', in the sense that the same Python data type constraints apply as in the earlier code. So I consider this some kind of technical debt that we should remove.
Not setting this parameter consistently would also prevent us from consistently fixing #28.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This going to be an impactful change for the users, but I agree that we need to introduce it since some Python types were mapped correctly "by accident".
Perhaps enable experimental_python_types
by default and introduce a profile configuration setting in case someone would like to use the old approach (to not impact current pipelines and migrate them later on). But I doubt it is a common situation.
Could you please point out the code in trino-python-client which converts for instance:
if value is None:
return "NULL"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the code that handles the prepared statements parameters
As you can see it handles following cases:
None -> NULL (same)
string -> quoted (same)
pyhon datetime -> trino TIMESTAMP (agate doesn't do anything with timezones, also currently the python-trino-client only supports up to millisecond precision, not completely sure why we strip out the precision within dbt-trino) (same)
python date -> trino DATE (same)
NUMBER -> specific handling for integer, decimal (slightly different)
The number handling seems slightly different but without any impact. Basically every number becomes a Decimal through the agate type detection. See agate docs.
Trino can handle conversions from decimal to other numeric types. Any loss in precision is the result of conversion to the chosen target column type.
CREATE TABLE test_delta.test.test_table1 (a integer);
insert into test_delta.test.test_table1 (a) values (decimal '1.2')
select * from test_delta.test.test_table1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add fallback logic to the "classic" binding manner and eventually remove it (add an extra issue for follow up).
Even if the changes are mostly about dbt seed
we should however ensure that the users have the option to fallback to the classic functionality with the newer version of dbt-trino in case they will deal with unforeseen exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I thought a little more about it. Actually I think it should be fixed in trino-python-client by not creating lines bigger http.client._MAXLINE
. Then no configuration should be needed.
Still need to validate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, let's set up it in trino-python-client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end I decided to put it in dbt-trino as patching Python's http.client may impact other usages, while in the context of dbt command executions this impact is limited. Currently we patch the http.client by default. It can be disabled by setting patch_http_client_header_limit
to false
in your dbt profile. This is also documented in the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played a bit with a 15K records CSV file
Note that dbt seed
is more thought for test purposes.
Is this a problem that we should be solving?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use cases of dbt seed
are definitely much broader then test purposes only.
The issue is that some users may have some seed files that may fail if we remove the http client patch. It won't be obvious for them to find out what they need to do. That was my motivation to add this as a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
@@ -173,3 +173,7 @@ | |||
{% macro trino__current_timestamp() -%} | |||
CURRENT_TIMESTAMP | |||
{%- endmacro %} | |||
|
|||
{% macro trino__get_binding_char() %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why trino__get_binding_char()
is needed and where is used? Is default macro from dbt-core not working with trino?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So before we converted bindings using standard Python string interpolation:
dbt-trino/dbt/adapters/trino/connections.py
Lines 103 to 104 in 338475a
bindings = tuple(self._escape_value(b) for b in bindings) | |
sql = sql % bindings |
Now we are using Trino prepared statements. As we can then depend on the type support added in trino-python-client, which covers more data types and a better unit test coverage.
To make the prepared statements work, the standard %s
dbt binding char need to be replaced with the Trino binding char ?
, similar like many other dbt adapters, and we pass the bindings down to the execute function of Trino's dbapi.
result = self._cursor.execute(sql, params=bindings)
@@ -64,7 +62,7 @@ def __init__(self, handle): | |||
self._fetch_result = None | |||
|
|||
def cursor(self): | |||
self._cursor = self.handle.cursor() | |||
self._cursor = self.handle.cursor(experimental_python_types=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This going to be an impactful change for the users, but I agree that we need to introduce it since some Python types were mapped correctly "by accident".
Perhaps enable experimental_python_types
by default and introduce a profile configuration setting in case someone would like to use the old approach (to not impact current pipelines and migrate them later on). But I doubt it is a common situation.
Could you please point out the code in trino-python-client which converts for instance:
if value is None:
return "NULL"
nit: Please use a capital letter in the first word of the commit message:
|
@@ -64,7 +62,7 @@ def __init__(self, handle): | |||
self._fetch_result = None | |||
|
|||
def cursor(self): | |||
self._cursor = self.handle.cursor() | |||
self._cursor = self.handle.cursor(experimental_python_types=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the main functionalities of dbt-trino affected by this change? Is it only dbt seed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would also return python types instead of strings. So there might be more places then only seeds, #28 for example.
Also any dbt packages expecting eg a python datetime or a number instead a string to do some calculations.
Note that most of the other dbapi return proper Python data types and therefore most of the libraries took that as an assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested if the PR solves #28 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but will do by next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source freshness has been succesfully tested
I've also tested the source freshness feature
|
README.md
Outdated
| host | The hostname to connect to | Required | `127.0.0.1` | | ||
| port | The port to connect to the host on | Required | `8080` | | ||
| threads | How many threads dbt should use | Optional (default is `1`) | `8` | | ||
| patch_http_client_header_limit | [Patch python's http client to work around LineTooLong limit](#prepared-statements) | Optional (default is `true`) | `true` or `false` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
patch_http_client_header_limit
please make sure that the tabs are aligned.
dbt/adapters/trino/connections.py
Outdated
headers.append(line) | ||
if line in (b'\r\n', b'\n', b''): | ||
break | ||
header_string = b''.join(headers).decode('iso-8859-1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Latin-1 encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the encoding according the http spec.
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
@mdesmet please consider shrinking this PR / spin-off part of it in another PR to enable the source freshness functionality and leave the controversial discussion around the line too long for |
As discussed I've added a new flag, also documented in the README. |
dbt/adapters/trino/connections.py
Outdated
if TrinoCredentials._HTTP_CLIENT_PATCHED: | ||
return | ||
if (self.patch_http_client_header_limit): | ||
TrinoCredentials.patch_http_client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use TrinoCredentials
and not self.
?
dbt/adapters/trino/connections.py
Outdated
def parse_headers(fp, _class=HTTPMessage): | ||
headers = [] | ||
while True: | ||
line = fp.readline(_MAXLINE + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this code snippet taken from?
I'm not keen into adding this workaround in the current form in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been taken from Stack overflow.
I also had my doubts about the encoding and investigated that at that time and definitely don't like patching standard libraries as they also may impact any other code executed after this patch. So I tend to agree with you, however dbt code is in general executed through a dbt
command, so this patch will probably not impact any other applications outside of very exotic usage of dbt as a library.
@@ -64,7 +62,7 @@ def __init__(self, handle): | |||
self._fetch_result = None | |||
|
|||
def cursor(self): | |||
self._cursor = self.handle.cursor() | |||
self._cursor = self.handle.cursor(experimental_python_types=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played a bit with a 15K records CSV file
Note that dbt seed
is more thought for test purposes.
Is this a problem that we should be solving?
dbt/adapters/trino/connections.py
Outdated
@@ -21,6 +21,8 @@ | |||
|
|||
|
|||
logger = AdapterLogger("Trino") | |||
PATCH_HTTP_CLIENT_DEFAULT = True | |||
DISABLE_PREPARED_STATEMENTS_DEFAULT = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about flipping around the name -> PREPARED_STATEMENTS_ENABLED
?
dbt/adapters/trino/connections.py
Outdated
_ALIASES = {"catalog": "database"} | ||
_HTTP_CLIENT_PATCHED = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http client patching is not in any way critical for this PR.
Can we get this change in a different PR to concentrate more thoroughly on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we added the flag for the prepared statements, that's true. It may however slightly increase the odds of the defaults failing.
I had discussed this issue earlier with @hovaesco, in the end I think Trino should play nicely with the Python ecosystem, so Trino should abide to any limits of Python (such as http.client._LINE
). So ideally we get this fixed in Trino itself.
It's just that compared to Python scripting, the interface of dbt is not Python but the dbt
command, so it's not easy for users to add this Python snippet and override the Python environment. So I thought it would be nice to ensure smooth operations of any existing large seed files and/or bumped up batch size that worked before but not anymore with the prepared statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against adding this kind of patch.
I would rather discuss it more thoroughly (in a separate PR) before actually making it generally available.
I made use of the new test framework to add some integration tests. I also added a |
# trino doesn't actually pass bindings along so we have to do the | ||
# escaping and formatting ourselves | ||
if not self._prepared_statements_enabled and bindings is not None: | ||
# DEPRECATED: by default prepared statements are used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do create a follow-up request to remove this safety feature after 1,2 releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comments
{{ return('?') }} | ||
{%- else -%} | ||
{{ return('%s') }} | ||
{%- endif -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{%- if target.prepared_statements_enabled|as_bool -%}
{{ return('?') }}
{%- else -%}
{{ return('%s') }}
{%- endif -%}
tests/conftest.py
Outdated
@@ -1,4 +1,5 @@ | |||
import pytest | |||
import trino | |||
|
|||
# Import the fuctional fixtures as a plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls add a Fix typo
commit for changing "fuctional" to "functional"
|
||
# The actual sequence of dbt commands and assertions | ||
# pytest will take care of all "setup" + "teardown" | ||
def test_run_seed_with_prepared_statements_disabled(self, project, trino_connection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both methods:
test_run_seed_with_prepared_statements_disabled
test_run_seed_with_prepared_statements_enabled
look roughly the same. Extract the common logic to the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only one small comment.
…659d827fb85fb65601b36a [Snyk] Security upgrade openjdk from 8-jre to 16-ea-17
Use native prepared statements from trino-python-client
Use native prepared statements from trino-python-client
Overview
Remove prepared statements emulation from dbt-trino and rely on the prepared statements support of native python types from trino-python-client
Checklist
README.md
updated and added information about my changeCHANGELOG.md
updated and added information about my change